Fair Data Point Harvester Update Strategy
As was mentioned, a CKAN harvester must implement gather_stage
, fetch_stage
and import_stage
.
During gather_stage
the fair data point harvester requests all the available resources from a source.
For each resource a unique guid is generated. Guids are generated by the harvester as the following:
- for a catalog
catalog=<FDP link to a catalog>
- for a dataset
catalog=<FDP link to the dataset's parent catalog>;dataset=<FDP link to a dataset>
where FDP link to a catalog/dataset
is an FDP reference URL (subject URL) of the resource.
e.g. for the following dataset https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25
CKAN harvester guid will be catalog=https://health-ri.sandbox.semlab-leiden.nl/catalog/e3faf7ad-050c-475f-8ce4-da7e2faa5cd0;dataset=https://health-ri.sandbox.semlab-leiden.nl/dataset/d7129d28-b72a-437f-8db0-4f0258dd3c25
.
Then the harvester queries CKAN database for guids harvested from the same source before. The query is the following:
SELECT harvest_object.guid AS harvest_object_guid, harvest_object.package_id AS harvest_object_package_id
FROM harvest_object
WHERE harvest_object.current = true AND harvest_object.harvest_source_id = %(harvest_source_id_1)s
where harvest_source_id_1
is the harvester source id of the current job.
Based on these two lists of guids a harvest object is created and assigned with status delete
, new
or change
.
Data for the last two types of resources are collected and parsed during fetch_stage
. During import_stage
resources
are deleted, updated or inserted to the CKAN database with regard of the harvester object status.
As per documentation it is possible to configure a CKAN instance to prevent updating of certain fields.
On the gather_stage
guids of datasets to delete are defined as delete = guids_in_db - guids_in_harvest
where
guids_in_db
- ids from CKAN harvest_object
table for a given source, a result of the query above. After that,
still during gather_stage
the harvester sets the status of harvest objects to delete to 'current': False
.
So they stay in the database but are not shown.
Then, during the import_stage
, the harvester actually deletes those datasets by calling toolkit.get_action('package_delete')(context, {ID: harvest_object.package_id})
.
- Datasets will be considered “new” if one configures a harvester source in CKAN, deletes it and re-configures then.
- If a dataset is set for deletion and something goes wrong during the
import_stage
a dataset stays forever as no more current one. - If a dataset is moved in FDP from a catalogue to another catalogue (by updating
DCTERMS.isPartOf
reference on the dataset level) it will be considered a new one because a guid of CKAN harvested resource (unlike FDP itself) includes a catalogue id (see above).