From now on we’ll be blogging (hopefully more frequently than has been usual here) on the Library Blog.
It’s been some time since we last posted anything, but that should be changing. The Research Computing team is now part of The Library and we intend to freshen up things up and start blogging more regularly. Watch this space.
I rounded off my time at the 2015 International Digital Curation Conference this week with a workshop on Islandora. The Digital Humanities site, through which items from the University’s Special Collections are made available online, uses Islandora 6, and we have plans to upgrade it to Islandora 7, the current version.
The open-source Islandora framework is based principally on three widely-used and well established open-source applications:
- The Drupal content management system
- The Fedora Commons repository software
- The Solr search platform
The Islandora extensions to Drupal make it much easier to interact with the rock-solid, secure but somewhat user-unfriendly Fedora software and make use of Solr to ease the discovery of information. A range of “solution packs” allow users to work effectively with a range of data types, both when ingesting them into the repository (e.g. capture of metadata, creation of derivatives) and when viewing them. For example, when a TIFF image is imported via the Image Solution Pack, copies in other formats, such as JPEG, are stored alongside the original TIFF. This is because TIFFs are excellent for archival purposes but not as well suited for display on the web as JPEGs. The Image Solution Pack also allows users to do things like annotate the image and interact with it in a zoomable image viewer.
The Islandora framework is managed by the Islandora Foundation, but much of the work, particularly on the core components, is carried out by Discovery Garden, who provide commercial services around Islandora. Alan Stanley of Discovery Garden was one of the original developers of Islandora and he led the workshop after giving an introductory demonstration on the first day of the conference. Alan will, I believe, be making his slides available through the conference web pages but they are not there for me to link to at the time of writing.
Working with an Islandora instance which was in place when I came into my position here, and not having worked with Islandora before, the overview and the explanation of the architecture was very useful. With my work on Islandora having been focused on certain key areas, principally ensuring that content will display correctly following an upgrade, coverage of other features such as the Form Builder and the way in which roles can be used within the system were particularly useful.
Unfortunately, upgrading from 6 to 7 has thus far proved more difficult than I anticipated. I have previously forced content to work in the new version using techniques that I wasn’t entirely happy with (e.g. directly editing XML in the repository) and sought some further guidance here. (The documentation can sometimes be less than helpful – see the rather unhelpful comment at the end of the Overview of the Book Solution Pack. Some detail on the migration script would be useful.) While I now have some understanding of the (admittedly good) reasons why the upgrade is tricky, there was not adequate time to get into any detail on it. Alan has agreed to help me out if I get in touch with him, however, and I will be taking him up on that offer.
We have some other projects in the pipeline that would seem to be a good fit for Islandora, so we are considering moving to a multi-site setup, with several Drupal instances drawing on a single Fedora Commons instance, as described by a poster from the University of Toronto Libraries at the conference. The workshop included some useful tips on how to go about doing that and it continues to look like a good way forward.
I’m spending most of this week at the DCC’s 10th International Digital Curation Conference in London. Research data management (RDM) is a key subject throughout the conference, but I have a particular interest in the Islandora software, covered by a poster, demonstration and finally a workshop on Thursday. The Digital Humanities site at St Andrews is an Islandora site, and I’ll come back to it in another post following the workshop.
On the wider subject of RDM (or Digital Curation, if you prefer), a number of key themes kept cropping up. There has been a lot of talk about carrots vs. sticks (and, at one point, carrot sticks) – having funders require researchers to undertake data management tasks and make certain commitments is effective in getting things moving in the right direction, but there is a real need for a change of culture so that good data management practices are embedded within research processes because they deliver real benefits to researchers. Part of achieving this is providing suitably usable services that a cost/benefit analysis makes good practices attractive and change behaviours and attitudes. However, sustained funding is required to provide those services, whether at institutional or cross-institutional levels. Meanwhile, project funding needs to allow for data management costs.
While it would be wonderful if providing suitable incentives were sufficient in engendering good practices, a particular example emerged in discussions around funder requirements for data management plans. The EPSRC is unusual in the UK in not requiring researchers to submit a DMP (funders may use other terminology, e.g. technical plan) with their bids. They do expect a DMP to exist, but do not ask to see it and it is incumbent on the institution to make sure that it is in place. The upshot is that it can be difficult to get researchers applying to the EPSRC to write DMPs – it doesn’t affect whether or not their bid will succeed, so why bother? While having researchers do this as a box-ticking exercise to ensure compliance with funder requirements is less than ideal, it is better than it not being done at all and can help to create good habits.
The sessions have ranged widely, and presentations are available from the programme page. The posters are also available online. Watch out for papers in the next edition of the International Journal of Digital Curation.
A couple of positive observations – they may seem faint, but are worth bearing in mind for those of us trying to move things forward:
- Slow progress is still progress
- Poor metadata is better than no metadata
Finally, I’d like to give a plug to DMPonline partly because I was a developer working on it before coming to St Andrews but mainly because it is the best way to draft a DMP (even if you’re EPSRC-funded!), with templates based on funder requirements and guidance from both the University and the funder.
This blog post provides examples of current uses of CKAN for RDM and an overview of a CKAN pilot that has started within the University of St Andrews.
University of St Andrews CKAN pilot
The CKAN pilot with in the University started as part of an extension to the JISC-funded Cerif for Datasets (C4D) project and is currently continued with existing University resources. At the outset of the project the Open Knowledge Foundation provided a vanilla install of CKAN 2.1 to allow for an initial evaluation of the software.
The initial goal of the pilot project is to investigate the functionality offered by CKAN with a view to potential integration within our local RDM infrastructure, in particular with our existing Pure CRIS which is based on the CERIF metadata standard. The expectation is that Pure will be developed as our research data catalogue, linking through to research datasets stored externally or locally. Thus, unlike many of the other projects we have heard of so far our particular interest is with using CKAN as a research data repository, rather than as a data catalogue. We are keen to find out whether it is possible to run preservation tools on data that is stored in CKAN and thus to implement basic digital preservation workflows. We are also interested in determining how straightforward or otherwise it is to develop customised metadata fields and interoperability with our Pure CRIS using CERIF-XML.
Background of the St Andrews CKAN pilot
In February 2013 the JISC-funded projects Orbital and data.bris organised a workshop on their specific uses of CKAN as part of their respective institutional RDM infrastructure. The workshop culminated in a requirements-gathering exercise during which representatives from the different organisations fed back their expectations on CKAN, if it was to form part of their institutional RDM solution. Requirements identified by the University of St Andrews and fed back into this exercise can be found in a separate blog post.
In February the Universities of Lincoln and Bristol appeared to be the only universities within the UK who had been working at CKAN with a view to implementing it as part of their institutional RDM infrastructures. Since then CKAN appears to have become of greater interest, and a recent request sent to the CKAN for Research Data Management mailing list returned the following results:
- CKAN has already been implemented at the University of Newcastle, the University for the Creative Arts / VADS, the University of Linclon, the University of Bristol, the University of Oxford and the AHRC / EPSRC-funded DART project.
- In addition to St Andrews Cardiff University is in the early stages of implementing CKAN on a pilot basis. The Marine Research Monitoring group in Western Australia is also in the process of implementing CKAN for research data.
- The University of Leicester is considering a trial of CKAN.
The first four use cases were implemented by projects that received funding under the JISC Managing Research Data Programme. The University of Oxford was the only University within the UK that responded and that has made progress in experimenting with CKAN outside a publicly funded project. Current activities at developing CKAN for RDM centre around two main forms of implementation:
- Using the full CKAN package: (1), (4), (5), (6), and (7)
- Using CKAN as data catalogue: (2), (3), and (4)
(1) Irdium project (University of Newcastle)
The Irdium project has a small pilot running whereby CKAN is used by one research group as data repository. Information of CKAN at Newcastle University is available here: http://research.ncl.ac.uk/rdm/tools/ckan/
- Irdium project report: http://research.ncl.ac.uk/media/sites/researchwebsites/iridium/iridium_CKAN_case_study_12_6_2013_v1_BA.pdf
(2) KAPTUR (University for the Creative Arts / VADS)
There was a requirement to integrate the RDM system with the existing institutional repository running on e-Prints. Therefore CKAN has not been used as a repository but rather as RDM system.
- Project web site: http://vads.ac.uk/kaptur/about.html
- KAPTUR project report: http://vads.ac.uk/kaptur/outputs/KAPTUR_final_report.pdf
(3) Orbital (University of Lincoln)
The project developed an application that works with the CKAN API. The RDM system built on CKAN does not use CKAN as research data repository.
- Project web site: https://orbital.blogs.lincoln.ac.uk/2013/05/03/the-researcher-dashboard/
- Code: https://github.com/lncd/Orbital-Bridge
(4) data.bris (University of Bristol)
Data.bris is working at implementing two instances of CKAN as part of Bristol’s RDM system: One public read-only catalogue of data publications and one instance with controlled access for research active data.
- Web site: http://data.bris.ac.uk/
(5) DART project
The DART project uses CKAN as their data repository. The project has developed an ingest framework which will allow them to streamline the ingest and metadata markup of the thousands of ‘research objects’ (data) which will be hosted in the repository. Content will be exposed as OAI-PMH so that it can be consumed by organisations like the ADS and Europeana.
- Project web site: http://dartproject.info/WPBlog/
- Repository web site: http://dartportal.leeds.ac.uk/
(6) University of Oxford
Oxford University IT Services is experimenting with using CKAN as a vehicle for rapid prototyping.
(7) Marine Research Monitoring group in Western Australia
The Marine Research and Monitoring group is within a state government department in Western Australia and is setting up CKAN for their research data management. The project has additional constraints as it is dealing with sensitive data about threatened species and communities. For this reason the project is looking at maintaining proper ISO 13139 / ANZLIC MCP metadata on a CKAN-harvested GeoNetwork catalogue for spatially referenced datasets.
Issues encountered by projects that are currently using CKAN
Feedback received has pointed to three main technical challenges in the implementation of CKAN, and these challenges are echoed by on-going discussions on the ckan-dev mailing list:
- While the installation of CKAN is generally described as fairly straight forward, this is not true for getting the integrated data visualisation tool for .csv files to work, especially when installations are made on institutional intranets. The Irdium project identified an undocumented feature of CKAN as the cause of this problem: CKAN is making use of external data processing web services for some of its functionality. First experience with the software in St Andrews confirmed the issue with visualisation of data uploaded into CKAN. Like the Irdium server, the server used for the St Andrews project is currently behind the University’s firewall. At the time of writing, we are working at resolving the problem which involves replacing CKAN’s dataproxy with datapusher — a fix suggested by the Open Knowledge Foundation that, we believe, is going to be implemented in future releases of CKAN. The expectation is that use of datapusher will eliminate the dependence of CKAN on external data processing web services.
- The CKAN datastore has been described as requiring time to understand.
- Shibboleth integration is possible with some local adaptations made to the code. The Irdium project implemented the extension developed for the Finnish Science Data Catalogue (https://github.com/kata-csc/ckanext-shibboleth).
In addition respondents pointed to cultural and support issues within the respective organisation that need to be addressed if the CKAN implementation is going to be successful:
- CKAN requires Python skills to be available to the organisation. To allow for a successful RDM service these skills need to be available beyond the duration of the initial implementation project. Where this has not been the case, RDM service development has slowed down or stopped.
- The development of a research data repository needs to consider academic workflows within the different disciplines, which will impact on the technical implementation and on the features that are made available.