Using CKAN for Research Data Management

This blog post provides examples of current uses of CKAN for RDM and an overview of a CKAN pilot that has started within the University of St Andrews.

University of St Andrews CKAN pilot

The CKAN pilot with in the University started as part of an extension to the JISC-funded Cerif for Datasets (C4D) project and is currently continued with existing University resources. At the outset of the project the Open Knowledge Foundation provided a vanilla install of CKAN 2.1 to allow for an initial evaluation of the software.

The initial goal of the pilot project is to investigate the functionality offered by CKAN with a view to potential integration within our local RDM infrastructure, in particular with our existing Pure CRIS which is based on the CERIF metadata standard. The expectation is that Pure will be developed as our research data catalogue, linking through to research datasets stored externally or locally. Thus, unlike many of the other projects we have heard of so far our particular interest is with using CKAN as a research data repository, rather than as a data catalogue. We are keen to find out whether it is possible to run preservation tools on data that is stored in CKAN and thus to implement basic digital preservation workflows. We are also interested in determining how straightforward or otherwise it is to develop customised metadata fields and interoperability with our Pure CRIS using CERIF-XML.

We will be publishing periodic updates of our CKAN pilot in this space and are keen to hear from others who are also working at implementing CKAN for RDM. To get in touch, please leave a comment or email research-computing [AT] st-andrews.ac.uk.

Background of the St Andrews CKAN pilot

In February 2013 the JISC-funded projects Orbital and data.bris organised a workshop on their specific uses of CKAN as part of their respective institutional RDM infrastructure. The workshop culminated in a requirements-gathering exercise during which representatives from the different organisations fed back their expectations on CKAN, if it was to form part of their institutional RDM solution. Requirements identified by the University of St Andrews and fed back into this exercise can be found in a separate blog post.

In February the Universities of Lincoln and Bristol appeared to be the only universities within the UK who had been working at CKAN with a view to implementing it as part of their institutional RDM infrastructures. Since then CKAN appears to have become of greater interest, and a recent request sent to the CKAN for Research Data Management mailing list returned the following results:

  • CKAN has already been implemented at the University of Newcastle, the University for the Creative Arts / VADS, the University of Linclon, the University of Bristol, the University of Oxford and the AHRC / EPSRC-funded DART project.
  • In addition to St Andrews Cardiff University is in the early stages of implementing CKAN on a pilot basis. The Marine Research Monitoring group in Western Australia is also in the process of implementing CKAN for research data.
  • The University of Leicester is considering a trial of CKAN.

Use cases

The first four use cases were implemented by projects that received funding under the JISC Managing Research Data Programme. The University of Oxford was the only University within the UK that responded and that has made progress in experimenting with CKAN outside a publicly funded project.  Current activities at developing CKAN for RDM centre around two main forms of implementation:

  • Using the full CKAN package: (1), (4), (5), (6), and (7)
  • Using CKAN as data catalogue: (2), (3), and (4)

(1)  Irdium project (University of Newcastle)

The Irdium project has a small pilot running whereby CKAN is used by one research group as data repository. Information of CKAN at Newcastle University is available here: http://research.ncl.ac.uk/rdm/tools/ckan/

(2)  KAPTUR (University for the Creative Arts / VADS)

There was a requirement to integrate the RDM system with the existing institutional repository running on e-Prints. Therefore CKAN has not been used as a repository but rather as RDM system.

(3)  Orbital (University of Lincoln)

The project developed an application that works with the CKAN API. The RDM system built on CKAN does not use CKAN as research data repository.

(4)  data.bris (University of Bristol)

Data.bris is working at implementing two instances of CKAN as part of Bristol’s RDM system: One public read-only catalogue of data publications and one instance with controlled access for research active data.

(5)  DART project

The DART project uses CKAN as their data repository.  The project has developed an ingest framework which will allow them to streamline the ingest and metadata markup of the thousands of ‘research objects’ (data) which will be hosted in the repository. Content will be exposed as OAI-PMH so that it can be consumed by organisations like the ADS and Europeana.

(6)  University of Oxford

Oxford University IT Services is experimenting with using CKAN as a vehicle for rapid prototyping.

(7)  Marine Research Monitoring group in Western Australia

The Marine Research and Monitoring group is within a state government department in Western Australia and is setting up CKAN for their research data management.  The project has additional constraints as it is dealing with sensitive data about threatened species and communities. For this reason the project is looking at maintaining proper ISO 13139 / ANZLIC MCP metadata on a CKAN-harvested GeoNetwork catalogue for spatially referenced datasets.

Issues encountered by projects that are currently using CKAN

Feedback received has pointed to three main technical challenges in the implementation of CKAN, and these challenges are echoed by on-going discussions on the ckan-dev mailing list:

  • While the installation of CKAN is generally described as fairly straight forward, this is not true for getting the integrated data visualisation tool for .csv files to work, especially when installations are made on institutional intranets. The Irdium project identified an undocumented feature of CKAN as the cause of this problem: CKAN is making use of external data processing web services for some of its functionality. First experience with the software in St Andrews confirmed the issue with visualisation of data uploaded into CKAN. Like the Irdium server, the server used for the St Andrews project is currently behind the University’s firewall. At the time of writing, we are working at resolving the problem which involves replacing CKAN’s dataproxy with datapusher  — a fix suggested by the Open Knowledge Foundation that, we believe, is going to be implemented in future releases of CKAN.  The expectation is that use of datapusher will eliminate the dependence of CKAN on external data processing web services.
  • The CKAN datastore has been described as requiring time to understand.
  • Shibboleth integration is possible with some local adaptations made to the code. The Irdium project implemented the extension developed for the Finnish Science Data Catalogue (https://github.com/kata-csc/ckanext-shibboleth).

In addition respondents pointed to cultural and support issues within the respective organisation that need to be addressed if the CKAN implementation is going to be successful:

  • CKAN requires Python skills to be available to the organisation. To allow for a successful RDM service these skills need to be available beyond the duration of the initial implementation project. Where this has not been the case, RDM service development has slowed down or stopped.
  • The development of a research data repository needs to consider academic workflows within the different disciplines, which will impact on the technical implementation and on the features that are made available.

Note:

The CKAN Project has provided additional information to the above in their own blog post: http://ckan.org/2013/11/28/ckan4rdm-st-andrews/

This blog identifies an additional project that uses CKAN for RDM, EDaWaX (European Data Watch Extended).

29 November 2013