Harvester
Basic harvest extension is publicly available and developed by CKAN community: https://github.com/ckan/ckanext-harvest It defines basic infrastructure for harvesting - a process that receives an end point and tries to identify how many objects are in there and sends each individual object to a fetch consumer which, in its turn, tries to convert each object to CKAN dataset or resource etc. In core CKAN-to-CKAN harvesting is implemented which means a CKAN instance can harvest other CKAN instances.
In CKAN harvest core setup.py enabled plugins are listed, where :
harvest=ckanext.harvest.plugin:Harvest
is a harvesting infrastructureckan_harvester=ckanext.harvest.harvesters:CKANHarvester
is a basic harvester responsible for harvesting other CKANs
These settings define choice of available source types in the UI harvester when a new harvester is added.
Among publicly available and used the most is DCAT harvester, DCAT RDF harvester is well-documented and can be referenced as an example of a harvester configurations and code. There are several harversers in ckanext-dcat, all they extend core harvester interface HarvesterBase
.
In the interface three methods are of particular interest and defining 80% of the harvesting process:
gather_stage
- goes to original URL and tries to define a single object and sends and saves it to the DB as harvester object (intoharvest_object
table).fetch_stage
- checks what was saved withgather_stage
and if information is not complete, goes back to source to fetch additional data.import_stage
- maps data to CKAN fields. This is the place to implement hooks for data preprocessing etc.