Fair Data Point Harvester Update Strategy
As was mentioned, a CKAN harvester must implement gather_stage, fetch_stage and import_stage.
During gather_stage the fair data point harvester requests all the available resources from a source.
For each resource a unique guid is generated. Guids are generated by the harvester as the following:
- for a catalog
catalog=<FDP link to a catalog> - for a dataset
catalog=<FDP link to the dataset's parent catalog>;dataset=<FDP link to a dataset>
where FDP link to a catalog/dataset is an FDP reference URL (subject URL) of the resource.
e.g. for the following dataset https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25
CKAN harvester guid will be catalog=https://health-ri.sandbox.semlab-leiden.nl/catalog/e3faf7ad-050c-475f-8ce4-da7e2faa5cd0;dataset=https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25.
Then the harvester queries CKAN database for guids harvested from the same source before. The query is the following:
SELECT harvest_object.guid AS harvest_object_guid, harvest_object.package_id AS harvest_object_package_id
FROM harvest_object
WHERE harvest_object.current = true AND harvest_object.harvest_source_id = %(harvest_source_id_1)s
where harvest_source_id_1 is the harvester source id of the current job.
Based on these two lists of guids a harvest object is created and assigned with status delete, new or change.
Data for the last two types of resources are collected and parsed during fetch_stage. During import_stage resources
are deleted, updated or inserted to the CKAN database with regard of the harvester object status.
As per documentation it is possible to configure a CKAN instance to prevent updating of certain fields.
On the gather_stage guids of datasets to delete are defined as delete = guids_in_db - guids_in_harvest where
guids_in_db - ids from CKAN harvest_object table for a given source, a result of the query above. After that,
still during gather_stage the harvester sets the status of harvest objects to delete to 'current': False.
So they stay in the database but are not shown.
Then, during the import_stage, the harvester actually deletes those datasets by calling toolkit.get_action('package_delete')(context, {ID: harvest_object.package_id}).
- Datasets will be considered “new” if one configures a harvester source in CKAN, deletes it and re-configures then.
- If a dataset is set for deletion and something goes wrong during the
import_stagea dataset stays forever as no more current one. - If a dataset is moved in FDP from a catalogue to another catalogue (by updating
DCTERMS.isPartOfreference on the dataset level) it will be considered a new one because a guid of CKAN harvested resource (unlike FDP itself) includes a catalogue id (see above).