RDF data import into CKAN


#1

Dear all,

I have a set of RDF documents (ontologies that describe datasets, meta-data, business ontologies etc.) in turtle file format. I would to import these RDFs into my newly setup CKAN data catalog.

Based on my limited knowledge, my understanding is that I need to have this " ckanext-dcat" extension/plugin installed on my ckan distribution. And the feature specifically I would be requiring is: An RDF Harvester that allows importing RDF serializations from other catalogs to create CKAN datasets ( dcat_rdf_harvester plugin).

Further my understanding is that this harvester needs a remote source, to download the remote file, extract all datasets using the parser and create or update actual CKAN datasets based on that. It will also handle deletions, ie if a dataset is not present any more in the DCAT dump anymore it will get deleted from CKAN.

Now, comes a list of my questions:

  • I have a set of .ttl files on my fileshare. How can I import them using this harvester?
  • Do my turtle file .ttl need to comply to some specific ckan schema in order for them to be parsed successfully by the harvester?
  • Any pointer/example as to how I can get started with some test rdf datasets import into my newly setup ckan instance?

Many thanks.


#2

I’ve never actually used ckanext-harvest. But, from reading the documentation, I think that if your data is described using the DCAT vocabulary, then you should be able to use the harvester to fetch the data if you have the file hosted somewhere and provide the URL to the harvester.

Have you tried that approach? Did you get any errors in the harvester job?


#3

Many thanks for your reply.

I am using CKAN 2.5.2 version. I have downloaded and installed the extension “ckan/ckanext-dcat”. I also installed a couple of other modules as well e.g. rdflib, pylon. I also added the following configuration in my ckan .ini file:
ckan.plugins = dcat_rdf_harvester

Finally, when i tried to parse the turtle document through this CLI command:

`python ckanext-dcat/ckanext/dcat/processors.py consume my.ttl`

I got this error:

Traceback (most recent call last):
  File "C:\Downloads\src\ckan\ckanext\dcat_rdf_harvester\dcat\processors.py", line 16, in <module>
    import ckan.plugins as p
ImportError: No module named ckan.plugins

I also tried:

`python ckanext/dcat/processors.py produce my.ttl`

But same above error.

This is what is in the processors.py file at line 16:

…
import ckan.plugins as p

from ckanext.dcat.utils import catalog_uri, dataset_uri, url_to_rdflib_format, DCAT_EXPOSE_SUBCATALOGS
from ckanext.dcat.profiles import DCAT, DCT, FOAF

…

I am kind of stuck as to how to fix this error? no module ckan.plugins?


#4

If you’re starting a new CKAN from scratch, as you mentioned in your first message, might I ask if is there any particular reason you are using version 2.5.2 of CKAN, considering that the latest stable release is 2.8.1? Is it because of compatibility with the ckanext-dcat extension? It seems to have been tested with up to CKAN 2.7.

From the error you’re getting, it seems that the python virtual environment where CKAN was installed to might have not been activated. Did you remember to run the following command before?

. /usr/lib/ckan/default/bin/activate

You can find out more about it in the documentation about the Command Line Interface. You might also want to take a look at running paster commands provided by extensions.


#5

thanks for reply!

Yes, the problem could be due to some compatibility issue. I will try to setup with the latest version.

Regarding the activation, actually i did activate the virtual machine with the command: `workon ckan
and I do have (ckan) infront of my command prompt.
(ckan) C:\Office_Documents…

The ckan application is also running.

Now, I have another fundamental question, since, ckan offers this plugin “ckanext-harvest” where RDFs would get imported only if they are using “DCAT vocabulary”, but the problem is only a little portion of these rdf models/ontologies are actually using this vocabulary. So my understanding is that the harvester woudl not work out of the box in this case? And we need to “develop” something on our own to consume and ingest data into ckan…


#6

From the workon command you described I assume you are using virtualenvwrapper. There might be an issue there too, as CKAN’s documentation describes a way to install CKAN without using it. Or perhaps when you type the workon command it is activating a different Python virtual environment other than the one CKAN uses. Try just running

python -c "import ckan"

If it fails with any error message at all, you can be sure that you are not using the same virtual environment as CKAN is.

I can also tell that you are using Windows, from the paths on your error messages. So, the command I mentioned before, which is in CKAN’s documentation for activating the virtual environment will also not work. Maybe someone who has experience using CKAN on Windows (I don’t) can help you.


#7

Thanks Augusto for your valuable feedback.

Need a bit of your knowledge with regards to the following.

We need to import our data (annotated in RDF) into cKan. Since, we are not using the DCAT vocabulary to define our RDF, hence, we can’t use ckanext-harvest to automate the import. Now my question is what could be the best way possible to import RDF data (e.g. in Virtuoso datastore) to import it into cKan datastore? What are our options for building a connector/adapter for RDF data import into cKan from Virtuoso? In other words, does cKan offer any other plugin(s) to facilitate such kind of task? Our idea is not to build this connector/adapter in Python rather in some other language (e.g. Java, C#), but this would require some sort of support from cKan which should allow data insert into cKan or if cKan offer some plugin to query in SPARQL to directly retrieve data from Virtuoso datastore? Any pointer in this regard?

Any experience with the following:
SPARQL endpoint analyzer and metadata generator for CKAN
SPARQL endpoint for CKAN
SPARQL Interface for CKAN


#8

Since you’re not using the DCAT vocabulary to describe the data, I would suggest you first try to transform your data to fit into DCAT, considering it is not only the standard this kind of data, but also the vocabulary that is the most supported by tools.

You can do this by doing CONSTRUCT and INSERT queries in SPARQL to transform the data and store the results in another graph. See this example from Virtuoso’s documentation.