Harvester overview
The Fair Data Point harvester in CKAN processes datasets through three stages:
gather_stagefetch_stageimport_stage
Gather stage
During gather_stage, the harvester requests all available resources from a source and generates a unique guid for each resource.
-
GUID generation
The harvester generates the following GUIDs:
- for a catalog
catalog=<FDP link to a catalog> - for a dataset
catalog=<FDP link to the dataset's parent catalog>;dataset=<FDP link to a dataset>
where
FDP link to a catalog/datasetis an FDP reference URL (subject URL) of the resource.For example, for this dataset: https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25, the CKAN harvester GUID will be:
catalog=https://health-ri.sandbox.semlab-leiden.nl/catalog/e3faf7ad-050c-475f-8ce4-da7e2faa5cd0;dataset=https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25 - for a catalog
-
Status assignment
The harvester queries CKAN database for guids harvested from the same source before:
SELECT harvest_object.guid AS harvest_object_guid, harvest_object.package_id AS harvest_object_package_id
FROM harvest_object
WHERE harvest_object.current = true AND harvest_object.harvest_source_id = %(harvest_source_id_1)swhere
harvest_source_id_1is the harvester source id of the current job.Based on these two lists of guids, a harvest object is created and assigned with status
delete,neworchange:delete = guids_in_db - guids_in_harvestwhereguids_in_dbare ids from CKANharvest_objecttable for a given source (the result of the query above)new= resources that appear in the harvest but not in the databasechange= resources that exist in both with potential updates
For resources marked as
delete, the harvester sets the status of harvest objects to'current': False, so they stay in the database but are not shown.
Fetch stage
During fetch_stage, data for resources with new or change status are collected and parsed from the source.
Import stage
During import_stage, resources are deleted, updated, or inserted to the CKAN database based on the harvest object status:
delete: The harvester deletes datasets by callingtoolkit.get_action('package_delete')(context, {ID: harvest_object.package_id})new: Datasets are inserted into CKANchange: Existing datasets are updated with new metadata
Based on the CKAN documentation, it is possible to configure a CKAN instance to prevent updating of certain fields.
- CKAN treats datasets as "new" if you delete and reconfigure their harvester source.
- If deletion fails during the
import_stage, the dataset remains hidden in the database permanently. - When you move a dataset between catalogues in FDP (by updating
DCTERMS.isPartOf), CKAN treats it as a new dataset because the GUID includes the catalogue id.