Using CKAN for Research Data Management

This blog post provides examples of current uses of CKAN for RDM and an overview of a CKAN pilot that has started within the University of St Andrews.

University of St Andrews CKAN pilot

The CKAN pilot with in the University started as part of an extension to the JISC-funded Cerif for Datasets (C4D) project and is currently continued with existing University resources. At the outset of the project the Open Knowledge Foundation provided a vanilla install of CKAN 2.1 to allow for an initial evaluation of the software.

The initial goal of the pilot project is to investigate the functionality offered by CKAN with a view to potential integration within our local RDM infrastructure, in particular with our existing Pure CRIS which is based on the CERIF metadata standard. The expectation is that Pure will be developed as our research data catalogue, linking through to research datasets stored externally or locally. Thus, unlike many of the other projects we have heard of so far our particular interest is with using CKAN as a research data repository, rather than as a data catalogue. We are keen to find out whether it is possible to run preservation tools on data that is stored in CKAN and thus to implement basic digital preservation workflows. We are also interested in determining how straightforward or otherwise it is to develop customised metadata fields and interoperability with our Pure CRIS using CERIF-XML.

We will be publishing periodic updates of our CKAN pilot in this space and are keen to hear from others who are also working at implementing CKAN for RDM. To get in touch, please leave a comment or email research-computing [AT] st-andrews.ac.uk.

Background of the St Andrews CKAN pilot

In February 2013 the JISC-funded projects Orbital and data.bris organised a workshop on their specific uses of CKAN as part of their respective institutional RDM infrastructure. The workshop culminated in a requirements-gathering exercise during which representatives from the different organisations fed back their expectations on CKAN, if it was to form part of their institutional RDM solution. Requirements identified by the University of St Andrews and fed back into this exercise can be found in a separate blog post.

In February the Universities of Lincoln and Bristol appeared to be the only universities within the UK who had been working at CKAN with a view to implementing it as part of their institutional RDM infrastructures. Since then CKAN appears to have become of greater interest, and a recent request sent to the CKAN for Research Data Management mailing list returned the following results:

  • CKAN has already been implemented at the University of Newcastle, the University for the Creative Arts / VADS, the University of Linclon, the University of Bristol, the University of Oxford and the AHRC / EPSRC-funded DART project.
  • In addition to St Andrews Cardiff University is in the early stages of implementing CKAN on a pilot basis. The Marine Research Monitoring group in Western Australia is also in the process of implementing CKAN for research data.
  • The University of Leicester is considering a trial of CKAN.

Use cases

The first four use cases were implemented by projects that received funding under the JISC Managing Research Data Programme. The University of Oxford was the only University within the UK that responded and that has made progress in experimenting with CKAN outside a publicly funded project.  Current activities at developing CKAN for RDM centre around two main forms of implementation:

  • Using the full CKAN package: (1), (4), (5), (6), and (7)
  • Using CKAN as data catalogue: (2), (3), and (4)

(1)  Irdium project (University of Newcastle)

The Irdium project has a small pilot running whereby CKAN is used by one research group as data repository. Information of CKAN at Newcastle University is available here: http://research.ncl.ac.uk/rdm/tools/ckan/

(2)  KAPTUR (University for the Creative Arts / VADS)

There was a requirement to integrate the RDM system with the existing institutional repository running on e-Prints. Therefore CKAN has not been used as a repository but rather as RDM system.

(3)  Orbital (University of Lincoln)

The project developed an application that works with the CKAN API. The RDM system built on CKAN does not use CKAN as research data repository.

(4)  data.bris (University of Bristol)

Data.bris is working at implementing two instances of CKAN as part of Bristol’s RDM system: One public read-only catalogue of data publications and one instance with controlled access for research active data.

(5)  DART project

The DART project uses CKAN as their data repository.  The project has developed an ingest framework which will allow them to streamline the ingest and metadata markup of the thousands of ‘research objects’ (data) which will be hosted in the repository. Content will be exposed as OAI-PMH so that it can be consumed by organisations like the ADS and Europeana.

(6)  University of Oxford

Oxford University IT Services is experimenting with using CKAN as a vehicle for rapid prototyping.

(7)  Marine Research Monitoring group in Western Australia

The Marine Research and Monitoring group is within a state government department in Western Australia and is setting up CKAN for their research data management.  The project has additional constraints as it is dealing with sensitive data about threatened species and communities. For this reason the project is looking at maintaining proper ISO 13139 / ANZLIC MCP metadata on a CKAN-harvested GeoNetwork catalogue for spatially referenced datasets.

Issues encountered by projects that are currently using CKAN

Feedback received has pointed to three main technical challenges in the implementation of CKAN, and these challenges are echoed by on-going discussions on the ckan-dev mailing list:

  • While the installation of CKAN is generally described as fairly straight forward, this is not true for getting the integrated data visualisation tool for .csv files to work, especially when installations are made on institutional intranets. The Irdium project identified an undocumented feature of CKAN as the cause of this problem: CKAN is making use of external data processing web services for some of its functionality. First experience with the software in St Andrews confirmed the issue with visualisation of data uploaded into CKAN. Like the Irdium server, the server used for the St Andrews project is currently behind the University’s firewall. At the time of writing, we are working at resolving the problem which involves replacing CKAN’s dataproxy with datapusher  — a fix suggested by the Open Knowledge Foundation that, we believe, is going to be implemented in future releases of CKAN.  The expectation is that use of datapusher will eliminate the dependence of CKAN on external data processing web services.
  • The CKAN datastore has been described as requiring time to understand.
  • Shibboleth integration is possible with some local adaptations made to the code. The Irdium project implemented the extension developed for the Finnish Science Data Catalogue (https://github.com/kata-csc/ckanext-shibboleth).

In addition respondents pointed to cultural and support issues within the respective organisation that need to be addressed if the CKAN implementation is going to be successful:

  • CKAN requires Python skills to be available to the organisation. To allow for a successful RDM service these skills need to be available beyond the duration of the initial implementation project. Where this has not been the case, RDM service development has slowed down or stopped.
  • The development of a research data repository needs to consider academic workflows within the different disciplines, which will impact on the technical implementation and on the features that are made available.

Note:

The CKAN Project has provided additional information to the above in their own blog post: http://ckan.org/2013/11/28/ckan4rdm-st-andrews/

This blog identifies an additional project that uses CKAN for RDM, EDaWaX (European Data Watch Extended).

29 November 2013

CKAN for Research Data Management

CKAN is an open source data management system that for the past six years has been developed by the Open Knowledge Foundation. CKAN provides tools to steamline the processes of publishing, sharing, finding and using data. Initialliy CKAN was aimed at data publishers such as national and regional governments, companies and organisations that want to make their data publicly available. For example, the UK Government’s data portal  (data.gov.uk) runs on CKAN, and in Australia, where CKAN is widely used by government agencies at both national and regional levels, it has become accepted as de facto standard for data management.

As part of the JISC Managing Research Data programme two projects, Orbital (University of Lincoln) and data.bris (University of Bristol) , have adopted CKAN as a component of their institutional RDM solutions.  The experience of both projects using CKAN was discussed at the workshop “CKAN for Research Data Management” that was held in London on 18th February 2013. In addition to representatives from both projects, members from the Open Knowledge Foundation and staff from other UK Universities took part in the workshop. Summaries of the workshop can be found on the data.bris and Orbital project blogs.

As part of the workshop a requirements gathering exercise was started investigating the wider needs of the RDM community and matching these needs against the functionality of CKAN. As part of this requirements gathering exercise, the following RDM roles were decided on: researcher, developer, curator/manager, re-user, IT support, and data subjects.

The requirements were gathered through expressions in the following format: “As a [RDM role], I want [what, requirement], so that [why, reason]”.

Below is a list of requirements of what we would hope to find in a RDM solution that we have fed back into the requirements gathering exercise. CKAN already meets a number of these requirements.

No RDM role Requirement Reason
1 Curator / manager Digital objects not to be stored within a database So that where necessary / desirable preservation tools (e.g.  JHOVE, DROID) can be used to continuously validate file integrity
2 Curator / manager Ability to integrate with Research Information Systems (e.g. Pure) To allow for efficiency of institutional processes, e.g. in relation to REF
3 Curator / manager Usage metrics to be available To support gathering of information on research impact, e.g. for REF
4 Curator / manager The availability of records management functionality, e.g. for the administration of retention / life cycle management periods To support institutional life cycle management processes
5 Curator / manager Ability to link publications (in the publications repository) to associated datasets (in CKAN) To allow for ease of access to publications and related data, to support transparency and openness
6 Curator / manager Ability to draw metadata from Pure and/or to export metadata into Pure To assist the integration of RDM solutions with the Research Information System
7 Curator / manager, researcher Support of variety of academic subject-specific metadata standards To ensure CKAN is useful and adaptable to a wide range of academic disciplines
8 Curator / manager, researcher To keep track of versions of data To allow for ease of tracking modifications made to individual files
9 Developer CKAN to support a range of accepted protocols for metadata harvesting (e.g. OAI-PMH) So that catalogues for a number of different data stores can be integrated and searched via a single point of access.
10 Developer to harvest metadata from Fedora Commons, possibly via OAI-PMH CKAN can present data that is kept in other repositories.
11 Developer to use Fedora Commons as a FileStore (http://docs.ckan.org/en/ckan-1.8/filestore.html) CKAN can access digital objects in Fedora (as an alternative to harvesting)
12 Developer APIs to support the development of  alternative methods of data ingest into CKAN and the development of tools for data analysis Subject-specific RDM needs can be met
13 IT Support CKAN to be able to support a number of different data and database structures Subject-specific RDM needs can be supported and existing department-level RDM solutions can be integrated into an institutional CKAN RDM solution
14 IT Support CKAN to support Shibboleth and other single sign on protocols Institutional sign on mechanisms can be used to authenticate to CKAN
15 IT Support CKAN to be able to integrate with institutional identity management (IDM) systems Existing institutional IDM can be used to define roles and levels of access to individual datasets within CKAN.
16 IT Support The availability of a documented mechanism of running several customisable instances of CKAN from the same codebase Institutional support for potentially a multitude of CKAN instances to cater for various subject-specific needs can be done efficiently.
17 IT Support The availability of maintenance agreements So that institutions adopting CKAN can get expert technical support when needed
18 IT Support Commitment from CKAN developers to keeping code base up-to-date So that business continuity can be ensured.
19 IT Support Any necessary security fixes to be developed quickly So that system security can be ensured.
20 IT Support metadata extraction upon ingest (including subject-specific metadata standards, e.g. TEI & VRA) To avoid, where possible, manual metadata entry (reduction of typos, efficient use of staff time)
21 IT Support The ability to run admin reports (e.g. storage space used in individual collections; file types contained in individual collections, access to / usage of (parts of) individual collections) To allow for efficient support provision (e.g. planning of storage requirements, charging, etc.)
22 IT Support CKAN to be designed in a modular fashion So that it is possible to select individual components of the software and to integrate these with existing systems and technical infrastructure
23 IT Support The availability of relevant and accessible user documentation that can be modified for local use To reduce the number of CKAN users contacting the IT Service Desk for advice on how to use the system

All requirements gathered will be summarised by the Orbital project and will feed into wider discussions on how CKAN can be developed further to support institutional RDM processes better.