Procedure for adding fields to backend
This document outlines the steps required to add, modify, or delete fields across various components of the CKAN ecosystem, including DCAT-AP schema updates, Solr search configuration, SeMPyRO, Discovery Service, and FAIR Data Point (FDP).
When a schema change falls under DCAT-AP 3 or an earlier version of DCAT-AP but is not yet present, follow these steps:
- Fork and clone the repository:
git clone https://github.com/ckan/ckanext-dcat
- Add the new field to the schema:
- Modify the schema file: <ckanext/dcat/schemas/dcat_ap_full.yaml>
- Use appropriate field types (e.g., text, repeating subfield, URI).
- Follow examples from other fields for consistency. For more information about scheming can be found here
- Extend the existing mapping depending on the DCAT-AP version: Modify the mapping files located in the directory: <ckanext/dcat/profiles>
- Fix the corresponding unit tests:
- Create a pull request to the CKAN DCAT extension repository. Ensure that you follow the contributing guidelines for CKAN:
- Include unit tests for the new fields.
- Ensure compatibility across different DCAT-AP versions.
- Update the following repositories after a new release: Update development and production Dockerfiles in these repositories( order is important):
- https://github.com/GenomicDataInfrastructure/gdi-userportal-ckanext-fairdatapoint
- https://github.com/GenomicDataInfrastructure/gdi-userportal-ckan-docker Check if ckan locally works with the new added fields by harvesting an example FDP
An example of a missing mapping in CKAN DCAT can be found here:
Multi-valued field creator in CKAN DCAT.
Note: Always take into account the mapping from CKAN → DCAT in addition to DCAT → CKAN.
If you’re adding a new field in CKAN and you want it to be searchable via Solr, follow these steps to modify the schema.xml
file.
Defining the Field Type and Name
In the top part of theschema.xml
file, define the type and name of the new field. The type specifies how Solr will handle the data in the field (e.g., astext
,integers
,dates
, etc.).- Navigate to the section in
schema.xml
where other fields are defined. - Add your new field with its corresponding type.
Example:
<field name="custom_field" type="string" indexed="true" stored="true" />
Here, custom_field is the name of the field, and it’s set as a string type. It is also indexed (which makes it searchable) and stored (so it can be returned in search results).
- Navigate to the section in
Adding the Field to Search In the lower part of the schema.xml file, you’ll need to add this field to the list of fields that are searchable by Solr. This is typically done in a section that defines which fields are indexed for searches. Example:
<copyField source="custom_field" dest="text" />
This example maps the custom_field to the text field, which Solr uses for full-text searches. By adding the copyField directive, you’re instructing Solr to include the contents of custom_field in the search index
When finished. Release a new version and update When finished. Release a new version and update GitHub - GenomicDataInfrastructure/gdi-userportal-ckan-docker: Scripts and images to run CKAN using Docker Compose in the development and production dockerfile
- indexed=“true”: The field can be used in searches.
- stored=“true”: The field can be retrieved in search results.
After making these changes, you should restart your Solr instance and reindex your CKAN data to ensure that the new field is indexed and searchable with the command:
ckan -c /etc/ckan/default/ckan.ini search-index rebuild
Fields are easy to add to SeMPyRO. You’ll need to know a few things:
- The predicate of the field
- Cardinality (single or multiple-valued)
- Range or datatype
Once that’s identified, go to the relevant class and add a property as follows. Here’s an example of the type
property of HRI_Dataset
:
type: List[AnyHttpUrl] = Field(
default=None,
description="The nature or genre of the resource. HRI recommended",
rdf_term=DCTERMS.type,
rdf_type="uri")
At Line 1, we see type
, which is the name of the property. Its range is an AnyHttpUrl
, which is a helper for any URL. Other examples of this are LiteralField
or sometimes even classes like Agent
or VCard
. It is multi-valued because it’s in a List
. If the maximum cardinality is one, it should not be in a List
.
At Line 2, default=None
indicates the field is optional and by default undefined. Leave this line out for mandatory fields.
At Line 3, we have a human-readable description of the field.
At Line 4, we define the predicate. In this case, it’s dcterms:type
. Some common namespaces, like DCTERMS
and DCAT
, are imported by default. A full URI can also be defined, for example with URIRef("http://example.com/range#property")
.
At Line 5, we define the RDF type. There are many possible values here, such as rdfs_literal
, xsd:string
, or uri
. It’s recommended to take a look at other properties to understand what is necessary here.
Once this is done, the JSON and YAML schemas need to be re-generated. For the HRIDataset
class, this can be done by running the following command:
hatch run python sempyro/hri_dcat/hri_dataset
For the technical point of view, updating the appropriate SHACL shapes allows for adding of fields.
- In the FDP, log in as an admin user and go to the Metadata schemas option.
- Select the resource to update (e.g. Catalog).
- In the Form Definition textarea, add a new entry in the list of
sh:property
values. For example:
[
sh:path my:new-property ; # the predicate IRI
sh:nodeKind sh:Literal ; # the value type
sh:minCount 1 ; # cardinality
dash:viewer dash:LiteralViewer ; # UI hint for displaying
dash:editor dash:TextFieldEditor ; # UI hint for editing
]
- Click Save if this is a draft and needs further work, or Save and release if the work is done.
- Add a description and select a version number.
- Click Release.
The Dataset Discovery service requires two parts to be updated: the OpenAPI definitions and the mapping.
Two definitions need to be updated, both located in the src/main/openapi
folder:
- ckan.yaml: This file contains the API returned by CKAN. Based on this YAML, Java classes are automatically generated corresponding to the API definition. For adding a field to a Dataset, the primary change will likely be in the CkanPackage definition. See the examples there on how to add a property.
- discovery.yaml: This file defines what the Discovery service should return. You can make this definition whatever you want it to be—it does not have to correspond one-to-one with CKAN. To add a property here, modify the RetrievedDataset definition. Again, see the examples in the file.
Once you have changed the definitions, follow these steps:
- Run the following command:
mvn clean compile
The command will probably generate a bunch of errors, but will regenerate the classes reflecting the OpenAPI objects
2. Add the mapping between the CKAN and Discovery service fields. The main place you want to look is most likely `src/main/java/io/github/genomicdatainfrastructure/discovery/utils/PackageShowMapper.java`, and modify the `RetrievedDatasetBuilder` See the code there on examples on how to map fields.
3. Update test cases, they are found in `src/test/java/io/github/genomicdatainfrastructure/discovery/services/PackageShowMapperTest.java`. Make sure to update 1. empty dataset examples 2. filled examples. You'll need to update both the `CkanPackage` objects (which reflects the CKAN API output) as well as the expected output, which is in the form of a `RetrievedDataset`.
4. Finally, test using both automatic testing `mvn test`, as well as run the package (`mvn compile quarkus:dev`) and check with Postman if mapping and output is as expected.