How much (more) research data do we have, and where do we store it?

In January and February 2013 we asked Computing Officers across the University to provide us with feedback about the storage requirements for research data within their respective school.

15 out of 19 schools and 3 research centres responded to the survey. The format in which responses were received varied from a summary response per school or research centre to receiving a compilation of forms completed by researchers. For a number of schools Computing Officers have pointed out that participation from researchers in the survey was low, for example 5 researchers out of 61 for one school, or 11 out of 30 for another. Given the response rate, it is inevitable that the survey outcomes presented here are unable to provide a full picture of the University’s research data storage need.

After we received the feedback from Computing Officers we compared it to the outcome of the Data Asset Framework audits carried out in 2012, and where we had identified additional needs, these were added. As a final step we looked at the requests for research data storage that reached IT Services over the last year and at data that is already stored on existing Research Computing infrastructure. Where we found gaps, we added the figures to the storage requirements for each school.

The survey identified a storage need of 1.36Pb for active data with an annual increase of approximately 10%, and a long-term storage need of approximately 600Tb with an annual increase of 20%. The feedback that we received also suggests that most researchers do not currently differentiate between active data and long-term storage, which will almost certainly have had an effect on the figures presented here. The 1.36Pb of active data storage include a requirement of 600Tb for working copies of data produced by a High Performance Computing Cluster in the School of Mathematics and Statistics and 150Tb of scratch space for HPC users.

The table below provides a breakdown of current data storage split into the use of different options. It is obvious that only a very small proportion of the University’s research data (0.14%) is stored on centrally provided systems. Worryingly, an approximate 19% of research data is currently stored on staff computers, and a further 15% is on external storage media. Only 5% of research data is stored in the public cloud. None of the respondents used an external organisation or data centre to look after their data. Only one respondent indicated that some of their research data is stored on a privately owned computer at home. It is to expect that the figure for research data storage on home computers is much higher than the few Gb in the category “other” the survey identified.

active data long-term storage
current need annual increase current need annual increase
ITS Central Filestore 369.6 89.6 232.0 0.0
ITS web servers 2,066.0 1,219.5 0.0 1,000.0
Networked file store (non-ITS) 938,721.1 37,040.1 153,000.0 27,600.2
non-ITS web servers 39,510.0 19,542.0 50,004.5 16,500.0
staff computers 208,253.0 28,993.0 164,030.0 732.5
external storage media 114,635.0 38,292.0 181,361.0 47,812.5
external organisation or datacentre 0.0 0.0 0.0 0.0
use of external cloud services 54,593.0 1,273.8 50,000.0 0.0
other 32.0 6.2 0.2 0.0
Total (in Gb) 1,358,179.7 126,456.2 598,627.7 93,645.2

The charts that follow provide a breakdown of the various types of data storage across the different schools and research centres. It is clear from those charts that there are some rather significant gaps, and that the results presented here only provide an incomplete overview of the University’s research data storage requirement.

Some of the charts suggest that, compared to the storage needs in other schools and research centres, there is some especially data intensive research in the School of Biology and in CREEM. While this is likely to be one factor, it is worth noting that responses from Biology and CREEM were much fuller than those received from other parts of the University. It is likely, therefore, that a higher proportion of research data has been identified than has been achieved in other schools and research centres.

IT Services central file store

ITS web servers

Networked file store (non-ITS)

The category on non-ITS networked file storage contains 600Tb of storage of working copies of HPC data (School of Mathematics and Statistics) and 150Tb of scratch space for HPC users (School of Chemistry). A further 62Tb for active data and 100Tb for long-term storage were added as a result of requests received by IT Services.

Non-ITS web servers

Staff computers

External storage media

External cloud services

CKAN for Research Data Management

CKAN is an open source data management system that for the past six years has been developed by the Open Knowledge Foundation. CKAN provides tools to steamline the processes of publishing, sharing, finding and using data. Initialliy CKAN was aimed at data publishers such as national and regional governments, companies and organisations that want to make their data publicly available. For example, the UK Government’s data portal  (data.gov.uk) runs on CKAN, and in Australia, where CKAN is widely used by government agencies at both national and regional levels, it has become accepted as de facto standard for data management.

As part of the JISC Managing Research Data programme two projects, Orbital (University of Lincoln) and data.bris (University of Bristol) , have adopted CKAN as a component of their institutional RDM solutions.  The experience of both projects using CKAN was discussed at the workshop “CKAN for Research Data Management” that was held in London on 18th February 2013. In addition to representatives from both projects, members from the Open Knowledge Foundation and staff from other UK Universities took part in the workshop. Summaries of the workshop can be found on the data.bris and Orbital project blogs.

As part of the workshop a requirements gathering exercise was started investigating the wider needs of the RDM community and matching these needs against the functionality of CKAN. As part of this requirements gathering exercise, the following RDM roles were decided on: researcher, developer, curator/manager, re-user, IT support, and data subjects.

The requirements were gathered through expressions in the following format: “As a [RDM role], I want [what, requirement], so that [why, reason]”.

Below is a list of requirements of what we would hope to find in a RDM solution that we have fed back into the requirements gathering exercise. CKAN already meets a number of these requirements.

No RDM role Requirement Reason
1 Curator / manager Digital objects not to be stored within a database So that where necessary / desirable preservation tools (e.g.  JHOVE, DROID) can be used to continuously validate file integrity
2 Curator / manager Ability to integrate with Research Information Systems (e.g. Pure) To allow for efficiency of institutional processes, e.g. in relation to REF
3 Curator / manager Usage metrics to be available To support gathering of information on research impact, e.g. for REF
4 Curator / manager The availability of records management functionality, e.g. for the administration of retention / life cycle management periods To support institutional life cycle management processes
5 Curator / manager Ability to link publications (in the publications repository) to associated datasets (in CKAN) To allow for ease of access to publications and related data, to support transparency and openness
6 Curator / manager Ability to draw metadata from Pure and/or to export metadata into Pure To assist the integration of RDM solutions with the Research Information System
7 Curator / manager, researcher Support of variety of academic subject-specific metadata standards To ensure CKAN is useful and adaptable to a wide range of academic disciplines
8 Curator / manager, researcher To keep track of versions of data To allow for ease of tracking modifications made to individual files
9 Developer CKAN to support a range of accepted protocols for metadata harvesting (e.g. OAI-PMH) So that catalogues for a number of different data stores can be integrated and searched via a single point of access.
10 Developer to harvest metadata from Fedora Commons, possibly via OAI-PMH CKAN can present data that is kept in other repositories.
11 Developer to use Fedora Commons as a FileStore (http://docs.ckan.org/en/ckan-1.8/filestore.html) CKAN can access digital objects in Fedora (as an alternative to harvesting)
12 Developer APIs to support the development of  alternative methods of data ingest into CKAN and the development of tools for data analysis Subject-specific RDM needs can be met
13 IT Support CKAN to be able to support a number of different data and database structures Subject-specific RDM needs can be supported and existing department-level RDM solutions can be integrated into an institutional CKAN RDM solution
14 IT Support CKAN to support Shibboleth and other single sign on protocols Institutional sign on mechanisms can be used to authenticate to CKAN
15 IT Support CKAN to be able to integrate with institutional identity management (IDM) systems Existing institutional IDM can be used to define roles and levels of access to individual datasets within CKAN.
16 IT Support The availability of a documented mechanism of running several customisable instances of CKAN from the same codebase Institutional support for potentially a multitude of CKAN instances to cater for various subject-specific needs can be done efficiently.
17 IT Support The availability of maintenance agreements So that institutions adopting CKAN can get expert technical support when needed
18 IT Support Commitment from CKAN developers to keeping code base up-to-date So that business continuity can be ensured.
19 IT Support Any necessary security fixes to be developed quickly So that system security can be ensured.
20 IT Support metadata extraction upon ingest (including subject-specific metadata standards, e.g. TEI & VRA) To avoid, where possible, manual metadata entry (reduction of typos, efficient use of staff time)
21 IT Support The ability to run admin reports (e.g. storage space used in individual collections; file types contained in individual collections, access to / usage of (parts of) individual collections) To allow for efficient support provision (e.g. planning of storage requirements, charging, etc.)
22 IT Support CKAN to be designed in a modular fashion So that it is possible to select individual components of the software and to integrate these with existing systems and technical infrastructure
23 IT Support The availability of relevant and accessible user documentation that can be modified for local use To reduce the number of CKAN users contacting the IT Service Desk for advice on how to use the system

All requirements gathered will be summarised by the Orbital project and will feed into wider discussions on how CKAN can be developed further to support institutional RDM processes better.

Online Research Database Service (ORDS)

It is not uncommon for us to get requests from researchers for the setup of databases. Especially when such requests relate to unfunded research, we have often had to decline them. As a result, a number of research datasets have remained unpublished.

The problem of not having the resources to provide bespoke technical solutions to every research project that applies for our help is not unique to St Andrews. But then, not every project requires a bespoke solution. Over the past years Oxford University IT Services undertook the VIDaaS (Virtual Infrastructure with Database as a Service)  project to develop a technical solution that allows researchers to build and publish online database quickly and without the need for programming skills.

VIDaaS runs on the DaaS (Database as a Service) software also developed by Oxford. DaaS is an online solution that provides users with a WYSIWIG (what you see is what you get) interface. DaaS allows for version-control and, if necessary, it permits users to make changes to the database schema via a drag and drop mechanism. There is no need to make any modifications to srcipts or user interfaces that normally become necessary as a result of a change to the database schema. These changes will be undertaken automatically by the DaaS software when the database schema is changed.

Oxford has been working at DaaS to support different types of database, including relational databases, XML databases and document databases. Plans also included the ablitiy to upload MS Access databases and for DaaS to convert these into online PostgeSQL databases.

DaaS supports various levels of access restriction to the data. The ability to support institutional single sign-on mechanisms is being developed. Several members of research groups can be given permission to modify the project database.

The VIDaaS project came to an end in 2012, and since then Oxford University IT Services has migrated the DaaS software to a more secure technical environment.

The Online Research Database Service (ORDS) that is currently under development uses and builds on the outcome of the VIDaaS project. Oxford University IT Services is currently undertaking an ORDS maturity project with a view to both supporting researchers at Oxford and making the DaaS software available to other institutions. As part of the ORDS maturity project the Research Computing Service will be looking at the software and its features with a view to investigating the feasibility of providing the ORDS locally to researchers within the University of St Andrews.