Test harvester locally
Test DCAT harvesting in CKAN by setting up a local nginx web server. This guide assumes you have CKAN running with the previously configured docker-compose.
Use hosted test data
You can also test harvesting with the Health-RI harvesting test data repository.
-
Find the CKAN sysadmin credentials in the
.envfile:CKAN_SYSADMIN_NAME
CKAN_SYSADMIN_PASSWORD -
Log in to CKAN with those credentials.
-
Create an organisation. CKAN requires an organisation before you can create a harvest source. The name can be any local test name, for example
testorg1. -
Create a harvest source for the test data from
Health-RI/harvesting-test-data:Field Value URL https://raw.githubusercontent.com/Health-RI/harvesting-test-data/c1019281532fb1114bd4eba362e94cc04ae18132/dataset_health.ttlTitle Any title you can recognise later, for example harvest_dataset_healthSource type Generic DCAT RDF HarvesterUpdate frequency ManualConfiguration { "profile": "fairdatapoint_dcat_ap", "rdf_format": "text/turtle", "force_all": "true" }The test data can change during development. Ask the team which version to use if you need reproducible results. The URL above pins an earlier commit.
-
Save the harvest source.
-
Trigger the harvest from the running
ckan-devcontainer. Replaceharvest_dataset_healthwith the title, name, or harvest source ID you used. If CKAN generated a different URL name, use the last part of the harvest source URL.docker compose exec -it ckan-dev bash
ckan --config=/srv/app/ckan.ini harvester run-test harvest_dataset_healthThe test data contains non-resolvable URIs, so several
404errors are expected. If you getsqlalchemy.exc.PendingRollbackError, rerun the harvest job and cancel the running job when prompted. This can happen because dataset and dataset series records are created in an order that is not deterministic.
Set up nginx web server
Start an nginx container to serve your DCAT files. Replace ~/Development/xnatdcat/ with the folder containing your test data.
docker run -it -d -p 8888:80 --name rdf -v ~/Development/xnatdcat/:/usr/share/nginx/html nginx
Verify the server is running by opening http://localhost:8888 in your browser.
Connect to CKAN network
Connect the nginx container to CKAN's Docker network so CKAN can access it:
docker network connect ckan-docker_default rdf
Now you can configure the harvest source in CKAN using the URL: http://rdf:80/the_file.ttl
Configure MIME types
CKAN identifies DCAT files by their MIME types. Nginx does not include the correct MIME types for RDF formats by default.
Add the following configuration to /etc/nginx/conf.d/default.conf in the nginx container:
types {
text/turtle ttl;
application/rdf+xml rdf;
}
Add this configuration inside the server block, above the location directive. Restart the container to apply the changes. CKAN can now harvest from your local web server.