User Portal - Docs
GitHubWebsiteToggle Dark/Light/Auto modeToggle Dark/Light/Auto modeToggle Dark/Light/Auto modeBack to homepage

Harvester


Basic harvest extension is publicly available and developed by CKAN community: https://github.com/ckan/ckanext-harvest It defines basic infrastructure for harvesting - a process that receives an end point and tries to identify how many objects are in there and sends each individual object to a fetch consumer which, in its turn, tries to convert each object to CKAN dataset or resource etc. In core CKAN-to-CKAN harvesting is implemented which means a CKAN instance can harvest other CKAN instances.

In CKAN harvest core setup.py enabled plugins are listed, where :

  • harvest=ckanext.harvest.plugin:Harvest is a harvesting infrastructure
  • ckan_harvester=ckanext.harvest.harvesters:CKANHarvester is a basic harvester responsible for harvesting other CKANs

These settings define choice of available source types in the UI harvester when a new harvester is added.

Among publicly available and used the most is DCAT harvester, DCAT RDF harvester is well-documented and can be referenced as an example of a harvester configurations and code. There are several harversers in ckanext-dcat, all they extend core harvester interface HarvesterBase.

In the interface three methods are of particular interest and defining 80% of the harvesting process:

  • gather_stage - goes to original URL and tries to define a single object and sends and saves it to the DB as harvester object (into harvest_object table).
  • fetch_stage - checks what was saved with gather_stage and if information is not complete, goes back to source to fetch additional data.
  • import_stage - maps data to CKAN fields. This is the place to implement hooks for data preprocessing etc.