Fair Data Point Harvester Update Strategy

As was mentioned, a CKAN harvester must implement gather_stage, fetch_stage and import_stage.

During gather_stage the fair data point harvester requests all the available resources from a source. For each resource a unique guid is generated. Guids are generated by the harvester as the following:

for a catalog catalog=<FDP link to a catalog>
for a dataset catalog=<FDP link to the dataset's parent catalog>;dataset=<FDP link to a dataset>

where FDP link to a catalog/dataset is an FDP reference URL (subject URL) of the resource. e.g. for the following dataset https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25 CKAN harvester guid will be catalog=https://health-ri.sandbox.semlab-leiden.nl/catalog/e3faf7ad-050c-475f-8ce4-da7e2faa5cd0;dataset=https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25.

Then the harvester queries CKAN database for guids harvested from the same source before. The query is the following:

SELECT harvest_object.guid AS harvest_object_guid, harvest_object.package_id AS harvest_object_package_id 
FROM harvest_object 
WHERE harvest_object.current = true AND harvest_object.harvest_source_id = %(harvest_source_id_1)s

where harvest_source_id_1 is the harvester source id of the current job.

Based on these two lists of guids a harvest object is created and assigned with status delete, new or change.

Data for the last two types of resources are collected and parsed during fetch_stage. During import_stage resources are deleted, updated or inserted to the CKAN database with regard of the harvester object status.

As per documentation it is possible to configure a CKAN instance to prevent updating of certain fields.

Deletion of a dataset

On the gather_stage guids of datasets to delete are defined as delete = guids_in_db - guids_in_harvest where guids_in_db - ids from CKAN harvest_object table for a given source, a result of the query above. After that, still during gather_stage the harvester sets the status of harvest objects to delete to 'current': False. So they stay in the database but are not shown.

Then, during the import_stage, the harvester actually deletes those datasets by calling toolkit.get_action('package_delete')(context, {ID: harvest_object.package_id}).

Caveat:

Datasets will be considered “new” if one configures a harvester source in CKAN, deletes it and re-configures then.
If a dataset is set for deletion and something goes wrong during the import_stage a dataset stays forever as no more current one.
If a dataset is moved in FDP from a catalogue to another catalogue (by updating DCTERMS.isPartOf reference on the dataset level) it will be considered a new one because a guid of CKAN harvested resource (unlike FDP itself) includes a catalogue id (see above).