IDCC 2015

I’m spending most of this week at the DCC’s 10th International Digital Curation Conference in London. Research data management (RDM) is a key subject throughout the conference, but I have a particular interest in the Islandora software, covered by a poster, demonstration and finally a workshop on Thursday. The Digital Humanities site at St Andrews is an Islandora site, and I’ll come back to it in another post following the workshop.

On the wider subject of RDM (or Digital Curation, if you prefer), a number of key themes kept cropping up. There has been a lot of talk about carrots vs. sticks (and, at one point, carrot sticks) – having funders require researchers to undertake data management tasks and make certain commitments is effective in getting things moving in the right direction, but there is a real need for a change of culture so that good data management practices are embedded within research processes because they deliver real benefits to researchers. Part of achieving this is providing suitably usable services that a cost/benefit analysis makes good practices attractive and change behaviours and attitudes. However, sustained funding is required to provide those services, whether at institutional or cross-institutional levels. Meanwhile, project funding needs to allow for data management costs.

While it would be wonderful if providing suitable incentives were sufficient in engendering good practices, a particular example emerged in discussions around funder requirements for data management plans. The EPSRC is unusual in the UK in not requiring researchers to submit a DMP (funders may use other terminology, e.g. technical plan) with their bids. They do expect a DMP to exist, but do not ask to see it and it is incumbent on the institution to make sure that it is in place. The upshot is that it can be difficult to get researchers applying to the EPSRC to write DMPs – it doesn’t affect whether or not their bid will succeed, so why bother? While having researchers do this as a box-ticking exercise to ensure compliance with funder requirements is less than ideal, it is better than it not being done at all and can help to create good habits.

The sessions have ranged widely, and presentations are available from the programme page. The posters are also available online. Watch out for papers in the next edition of the International Journal of Digital Curation.

A couple of positive observations – they may seem faint, but are worth bearing in mind for those of us trying to move things forward:

  • Slow progress is still progress
  • Poor metadata is better than no metadata

Finally, I’d like to give a plug to DMPonline partly because I was a developer working on it before coming to St Andrews but mainly because it is the best way to draft a DMP (even if you’re EPSRC-funded!), with templates based on funder requirements and guidance from both the University and the funder.

Using CKAN for Research Data Management

This blog post provides examples of current uses of CKAN for RDM and an overview of a CKAN pilot that has started within the University of St Andrews.

University of St Andrews CKAN pilot

The CKAN pilot with in the University started as part of an extension to the JISC-funded Cerif for Datasets (C4D) project and is currently continued with existing University resources. At the outset of the project the Open Knowledge Foundation provided a vanilla install of CKAN 2.1 to allow for an initial evaluation of the software.

The initial goal of the pilot project is to investigate the functionality offered by CKAN with a view to potential integration within our local RDM infrastructure, in particular with our existing Pure CRIS which is based on the CERIF metadata standard. The expectation is that Pure will be developed as our research data catalogue, linking through to research datasets stored externally or locally. Thus, unlike many of the other projects we have heard of so far our particular interest is with using CKAN as a research data repository, rather than as a data catalogue. We are keen to find out whether it is possible to run preservation tools on data that is stored in CKAN and thus to implement basic digital preservation workflows. We are also interested in determining how straightforward or otherwise it is to develop customised metadata fields and interoperability with our Pure CRIS using CERIF-XML.

We will be publishing periodic updates of our CKAN pilot in this space and are keen to hear from others who are also working at implementing CKAN for RDM. To get in touch, please leave a comment or email research-computing [AT] st-andrews.ac.uk.

Background of the St Andrews CKAN pilot

In February 2013 the JISC-funded projects Orbital and data.bris organised a workshop on their specific uses of CKAN as part of their respective institutional RDM infrastructure. The workshop culminated in a requirements-gathering exercise during which representatives from the different organisations fed back their expectations on CKAN, if it was to form part of their institutional RDM solution. Requirements identified by the University of St Andrews and fed back into this exercise can be found in a separate blog post.

In February the Universities of Lincoln and Bristol appeared to be the only universities within the UK who had been working at CKAN with a view to implementing it as part of their institutional RDM infrastructures. Since then CKAN appears to have become of greater interest, and a recent request sent to the CKAN for Research Data Management mailing list returned the following results:

  • CKAN has already been implemented at the University of Newcastle, the University for the Creative Arts / VADS, the University of Linclon, the University of Bristol, the University of Oxford and the AHRC / EPSRC-funded DART project.
  • In addition to St Andrews Cardiff University is in the early stages of implementing CKAN on a pilot basis. The Marine Research Monitoring group in Western Australia is also in the process of implementing CKAN for research data.
  • The University of Leicester is considering a trial of CKAN.

Use cases

The first four use cases were implemented by projects that received funding under the JISC Managing Research Data Programme. The University of Oxford was the only University within the UK that responded and that has made progress in experimenting with CKAN outside a publicly funded project.  Current activities at developing CKAN for RDM centre around two main forms of implementation:

  • Using the full CKAN package: (1), (4), (5), (6), and (7)
  • Using CKAN as data catalogue: (2), (3), and (4)

(1)  Irdium project (University of Newcastle)

The Irdium project has a small pilot running whereby CKAN is used by one research group as data repository. Information of CKAN at Newcastle University is available here: http://research.ncl.ac.uk/rdm/tools/ckan/

(2)  KAPTUR (University for the Creative Arts / VADS)

There was a requirement to integrate the RDM system with the existing institutional repository running on e-Prints. Therefore CKAN has not been used as a repository but rather as RDM system.

(3)  Orbital (University of Lincoln)

The project developed an application that works with the CKAN API. The RDM system built on CKAN does not use CKAN as research data repository.

(4)  data.bris (University of Bristol)

Data.bris is working at implementing two instances of CKAN as part of Bristol’s RDM system: One public read-only catalogue of data publications and one instance with controlled access for research active data.

(5)  DART project

The DART project uses CKAN as their data repository.  The project has developed an ingest framework which will allow them to streamline the ingest and metadata markup of the thousands of ‘research objects’ (data) which will be hosted in the repository. Content will be exposed as OAI-PMH so that it can be consumed by organisations like the ADS and Europeana.

(6)  University of Oxford

Oxford University IT Services is experimenting with using CKAN as a vehicle for rapid prototyping.

(7)  Marine Research Monitoring group in Western Australia

The Marine Research and Monitoring group is within a state government department in Western Australia and is setting up CKAN for their research data management.  The project has additional constraints as it is dealing with sensitive data about threatened species and communities. For this reason the project is looking at maintaining proper ISO 13139 / ANZLIC MCP metadata on a CKAN-harvested GeoNetwork catalogue for spatially referenced datasets.

Issues encountered by projects that are currently using CKAN

Feedback received has pointed to three main technical challenges in the implementation of CKAN, and these challenges are echoed by on-going discussions on the ckan-dev mailing list:

  • While the installation of CKAN is generally described as fairly straight forward, this is not true for getting the integrated data visualisation tool for .csv files to work, especially when installations are made on institutional intranets. The Irdium project identified an undocumented feature of CKAN as the cause of this problem: CKAN is making use of external data processing web services for some of its functionality. First experience with the software in St Andrews confirmed the issue with visualisation of data uploaded into CKAN. Like the Irdium server, the server used for the St Andrews project is currently behind the University’s firewall. At the time of writing, we are working at resolving the problem which involves replacing CKAN’s dataproxy with datapusher  — a fix suggested by the Open Knowledge Foundation that, we believe, is going to be implemented in future releases of CKAN.  The expectation is that use of datapusher will eliminate the dependence of CKAN on external data processing web services.
  • The CKAN datastore has been described as requiring time to understand.
  • Shibboleth integration is possible with some local adaptations made to the code. The Irdium project implemented the extension developed for the Finnish Science Data Catalogue (https://github.com/kata-csc/ckanext-shibboleth).

In addition respondents pointed to cultural and support issues within the respective organisation that need to be addressed if the CKAN implementation is going to be successful:

  • CKAN requires Python skills to be available to the organisation. To allow for a successful RDM service these skills need to be available beyond the duration of the initial implementation project. Where this has not been the case, RDM service development has slowed down or stopped.
  • The development of a research data repository needs to consider academic workflows within the different disciplines, which will impact on the technical implementation and on the features that are made available.

Note:

The CKAN Project has provided additional information to the above in their own blog post: http://ckan.org/2013/11/28/ckan4rdm-st-andrews/

This blog identifies an additional project that uses CKAN for RDM, EDaWaX (European Data Watch Extended).

29 November 2013

Funding Research Data Management

What follows is a summary of the Research Data Management Forum special event on Funding RDM services which took place at Lakeside Conference Centre, Aston University on 25th April 2013.

During the morning session there were four presentations:

  • Caretakers of the Present, Guardians of the Future: The digital Research Data Challenge (Jeff Haywood, University of Edinburgh)
  • Research Facilities and Equipment Sharing database developments / DataPool – Collaboration as Foundation for Sustainability (Adrian Cox & Wendy White, University of Southampton)
  • Research Data Management at Oxford Brookes (Sarah Taylor, Research Support Manager, Oxford Brookes University)
  • Ongoing RDM Support at the University of Bristol (Stephen Gray, Project Manager data.bris)

During the afternoon a panel session of funders with representatives from BBSRC, EPSRC, MRC, NERC, STFC, and the Wellcome Trust took place. The slides for the above presentations and the questions put to research funders are available from the Digital Curation Centre web site.

Caretakers of the Present, Guardians of the Future: The digital Research Data Challenge

Research data can mean different things, such as scratch data (e.g. raw data that is collected straight from a machine), analysed data, and final data sets that represent underlying data for publications. Each researcher’s collection of data is liable to contain a mixture of the different data types.

Over the past years the University of Edinburgh has provided the following Research Computing Services: data services, storage, sharing, and curation. Haywood talked about the difficulty in scoping such services. As part of the development of RDM services a number of issues had to be addressed, such as inadequate storage provision, a lack of formal institutional policies for the creation and management of data, and a lack of training and best practice guidance for researchers on RDM.

The initial move of the University to set up and develop RDM services was not triggered by funder requirements but was based on internal motivators that recognised the value of such services to research. Within the context of an increasing formalisation of RDM from funders, Haywood expects the relative importance of UK funders to diminish over the next years in favour of European and other international funders. This will increase the complexity of the RDM policy frameworks that institutions will have to comply with.

Edinburgh currently provides researchers with 0.5TB of data storage by default and is expecting to increase this to 1TB. As part of JISC-funded projects it has developed the Research Mantra RDM training suite and the DMPonline data management planning tool. Both are available free of charge to researchers in other institutions.

Over the next couple of years the University of Edinburgh plans to invest at least £2m on its development of research data services.

Research Facilities and Equipment Sharing database developments / DataPool – Collaboration as Foundation for Sustainability

Adrian Cox talked about the EPSRC-funded UNIQUIP project, a national equipment database that provided a mechanism for an effective national agenda for sharing (http://equipment.data.ac.uk). This project is now part of the data.ac.uk portal that is aimed at sharing different types of data, including research data.

Wendy White continued talking about the DataPool project and the development of shared, scalable and sustainable services. Southampton had developed Tweepository, a Twitter archiving service. As with any other service, the long-term institutional commitment to Tweepository depends on use and take-up. White went on talking about institutional policies and underlined the need for exemplars for the innovative use of linked / open data to help with establishing data retention guidance.

Research Data Management at Oxford Brookes

Sarah Taylor spoke about the experience of developing RDM services at Oxford Brookes University. The University had formed part of the Digital Curation Centre Institutional Engagement project. Until two years ago IT support in Oxford Brookes had been much devolved. Since then a new faculty structure had been implemented and centralised IT support provision introduced. Oxford Brookes agreed their institutional RDM policy in February 2013.

Ongoing RDM Support at the University of Bristol

Within the University of Bristol there are 2,288 researchers submitting approximately 1,700 applications for funding per annum. 42% of submitted applications are successful, currently generating a research income of £106m.

The development of RDM services in Bristol builds on a £2m investment in research data storage infrastructure made prior to the RCUK RDM agenda. As part of standard service provision, each researcher can apply for 5TB of storage free of charge. RDM infrastructure was developed by the data.bris project, that in addition to university resources had received £250k JISC-funding and a further £65k from EPSRC.

In Bristol RDM support is spilt over three units: IT Services, the Library, and Research, Enterprise and Development (RED) services. Prior to the start of the data.bris project there was a 0.2FTE digital humanities support officer within IT Services and ad-hoc research computing work carried out in individual academic departments. Within the Library there was 0.2FTE support for a small institutional repository. RED provided assistance for research funding applications at pre-award stage only as well as training in grant writing, which excluded RDM training.

The data.bris project developed RDM training provision, software to deposit data, and a RDM policy that is similar to Edinburgh’s policy. Bristol also adopted the RDM principles centering around excellence and impact, world-class infrastructure, skills and knowledge, integrity and professionalism, and leadership and collaboration that are set out in Monash University’s Research Data Management Strategy and Strategic Plan 2012-2015.

An additional outcome of the data.bris project was a business case to the university, which was successful. As a result of the successful business case, the University of Bristol has approved a pilot data service project running from 1 August 2013 until 31st July 2015 during which the requirements for permanent service provision will be established. The project is going to be resourced with 1 director, 3 subject data librarians, and 1 technical support staff. While the data.bris project was owned by IT Services working with the Library and RED, responsibility for RDM service provision during the pilot project lies with the Library in collaboration with IT Services.

Panel session

The afternoon session comprised a panel discussion with representatives from BBSRC, EPSRC, MRC, NERC, STFC, and the Wellcome Trust. Attendants had been asked to submit questions to funders in advance. Questions had been categorised as follows:

  • A. What RDM services and infrastructure are in scope?
  • B. Open Access and data sharing
  • C. Specific and allowable RDM cost elements
  • D. Long term storage, preservation and archiving
  • E. What is the application process to recover RDM services?
  • F. Guidance and support
  • H. Compliance

A. What RDM services and infrastructure are in scope?

RDM services and infrastructure are within the scope of two RCUK Common Principles on Data Policy, principles 2 and 7:

  • Principle 7: “It is appropriate to use public funds to support the management and sharing of publicly-funded research data. To maximise the research benefit which can be gained from limited budgets, the mechanisms for these activities should be both efficient and cost-effective in the use of public funds.” There are two ways of looking at the statement of wanting to maximise the research benefits from the use of such funds: Firstly such money should be used efficiently, and secondly funders are looking for research benefits where money is spent.
  • Principle 2: “Institutional and project specific data management policies and plans should be in accordance with relevant standards and community best practice. Data with acknowledged long-term value should be preserved and remain accessible and usable for future research.” This means that not all data should remain accessible or be preserved.
Costs

There is a need to distinguish between the costs that are incurred during a project and those that arise afterwards. Costs incurred during a project should be included in the direct costs for a project, including hardware, staff, expenses, etc. RDM costs will include preparation of the data for access and curation (incl. the addition of metadata).

After the project there are three possible cases:

  • Provision through the research funder:
    • Some research councils (NERC, ESRC) have their own data centres.
    • Some disciplines have their own disciplinary repositories.
    • In STFC’s programme big projects (e.g. Hedron collider) already have an infrastructure in place that can be used by other projects.
      Where such services are available, grant holders are expected to use them. MRC have a mixed model, where for some areas of the programme such services exist.
  • Provision through the research organisation: When the institution is going to provide a service in the long term the obvious way to finance it, is through FEC.
  • Outsourcing to a third party: If a third party is providing the service, there is the need for due diligence to be done by the institution.

The principles under which the data is going to be made available are expected to be explained in the Data Management Plan (DMP) at the project level. The institutional infrastructure that is provided for making data available is expected to be outlined within an institutional plan, policy or operational document. Researchers need to make it clear within the Justification of Resources of their application for funding what exactly it is that they expect funders to pay. Justifications of Resources should, where possible, separate out the following RDM cost elements: cost of collecting data, the cost of curating data, the cost of analysing data, the cost of preservation and sharing.

As concerns funding, there are some differences in the way the Wellcome Trust operates. Owing to its charity status, the Wellcome Trust in general only pays directly incurred costs.

Acknowledged long-term value

Acknowledged long-term value involves a value judgement that needs to be made on a case-by-case basis. Research funders ask for the justification of what data to keep to be made as part of the DMP.

Funders acknowledge that researchers in all areas of Science need to be selective. It is necessary to obtain experience to help with these judgements. Such experience can be gained through the peer review process that assesses the likely research benefits to be gained. Furthermore, researchers can develop their own experience by keeping a record of who has asked for data, what kind of data was provided, and what publications and further outputs arose from that activity.

EPSRC finds it challenging to come up with a one-size-fits-all solution, which is what RDM is trying to do. In EPSRC policy it is stated that apart from data that underpins publications, which must be kept, EPSRC leaves it up to the researchers to decide what to keep. If a researcher decides that something should be kept, then this data should be stored properly. Once a decision to keep data has been made, the policy stated that the data should be kept for 10 years. EPSRC assumes that it does not cost much more to store data for 10 years than it costs to store it in the short term. If in that 10 year period the data has not been accessed, questions need to be asked as to whether this data should really be kept. If people have accessed the data, the 10 year period starts anew. There is obvious data (e.g. climate readings) that will need to be kept in the long term regardless of whether or not such data has been accessed during the 10-year period. It is not up to EPSRC to determine on the level of each discipline what to support. It is the researcher who knows data best and who needs to decide whether it is has long-term value.

B. Open Access and Data Sharing

Research data is like any other cost of a grant. If the cost is incurred during the grant period, it counts like any other direct cost. Every good application for funding should in its overall general work plan have time set aside for dealing with data management activities.

Making data open access incurs a cost that is separate from APCs.

Funders point out that research data should be in a similar position to publications as concerns access. From MRC’s perspective, however, researchers are not ready to have their data put up for anybody to look at. This is due to the complexity of data and the need for confidentiality. The research team who created the data needs to be involved in the process of preparing the data for publication. MRC could not afford to put up all the data from 60-80 years of medical research for anybody to use. Such a job is not achievable at the moment.

EPSRC does not require a DMP as part of the application. In EPSRC applications the Management Section within the Case for Support should increasingly be used to include references to data and its uses. There is a possible move towards formalising this aspect and this way to include data management within the peer review process. Jeff Haywood pointed out that EPSRC’s approach is from an institutional perspective inconsistent with the requirements of other funders. EPSRC require every funded project to have a DMP as part of good research practice, but the DMP itself is not subject to peer review.

C. Specific and allowable RDM cost elements

The research community should develop a better understanding of the importance of using metadata to describe their datasets. Advocacy for metadata is considered part of normal institutional service provision and should not be funded by Research Councils. Universities have a role to ensure that custodians of data have a sufficient understanding of metadata.

It is not possible to include RDM costs in applications for funding based on a rule-of-thumb estimate. For some projects, RDM may be within institutional provision, in other projects RDM may be the majority of the cost. The cost of RDM is project-specific and entirely depends on the type of work.

Within NERC the cost of RDM represents approximately 2% of the overall cost of funding, but the distribution of this cost varies greatly from project to project.

D. Long term storage, preservation and archiving

Long term storage, preservation and archiving costs are part of the overhead of each HEI and should be met by research funders. If this cost cannot be covered through the direct cost as part of the project, then indirect and quality-related costs are considered obvious funding streams.

ESPRC expects institutions to call upon these sources of funding and to set aside funds to ensure that institutions have good data management infrastructure in place to support researchers. It is up to the institution to decide whether RDM infrastructure is provided in-house or via a third party. An organisation that sets itself up as a research organisation needs to take onboard the importance and necessity of providing data management infrastructure to researchers as part of its central provisioning.

The pay-once-store-forever (POSF) model is a difficult concept for funders. The assumption made by the POSF model is storage over 20 years or forever, whichever comes first. Although most of the curation costs are upfront costs, if data needs to be kept in the long term, there is likely to be an ongoing cost for storage and access.

The cost of storage increases where multiple copies of data need to be kept unless it is acceptable to allow data to decay. According to funders, judgements need to be made about opportunities and risks. If managing a large resource, some parts will be paid a lot of attention to, and other parts will be paid less attention to. It is about taking a proportionate approach to opportunities and risks.

EPSRC point out that RDM takes place within the context of the UK legislative framework (e.g. FoI). Public bodies such as universities are under the obligation to make data available unless there are exemptions to be applied. Therefore, organisations need to know what data they hold. A process of appraising whether or not to keep data is a valid part of institutional policy. There should be a cost / benefit analysis, if the cost of appraisal is higher than just keeping the data, then there is no benefit in doing the appraisal process.

E. What is the application process to recover RDM services?

The question was asked whether institutions can set up small research facilities to recover the cost of RDM services. It was pointed out that there is already a model for doing this in the provision of HPC services. Simon Hodgson clarified that the origin of small research facilities lies in the TRAC terminology. Small research facilities are chargeable above cost, i.e. not as part of the indirect cost nor as part of the university infrastructure. Instead, they are chargeable as a specific infrastructure that can be charged against a research grant. A small research facility is provided by the university but it is used by research groups.

A small research facility needs to be approved by funders as such. For example, such a facility might be possible in the area of big data, if big data is generated in a new way. Added value needs to be shown. Funders recognise that we are on a journey, and that it is not impossible to make the case. The added value would have to be demonstrated to funders and to the research community. If a facility was funded, it would be out of the research budget, possibly as a cross-institutional service. A small research facility needs to be very close to Science. It is about creating highly specialised services.

F. Guidance and support

RCUK is developing guidance for RDM. It is hoped that this guidance will become available within a few months.

G. Compliance

Councils hope that universities would want to take on the development of support for RDM and prove that they are doing this well. After all, RDM is about producing good research. In Bristol the EPSRC policy has carried some weight, but there were other drivers. RDM brings kudos.

EPSRC does not want to act as a police officer. There are expectations that organisations have good research management practices in place. If an organisation is becoming obviously non-compliant or if there are systemic failures in RDM, then the organisation risks losing their eligibility for research funding.

EPSRC does not expect to have to go out and check institutional compliance. A case could be made that there could be cross-council visits to institutions rather than just the EPSRC checking compliance with their own policies.

MRC does not believe that councils should ask institutions lots of questions, but there are areas, e.g. security where a few intelligent questions could be asked to check compliance. More concern is over the compliance of Principal Investigators with MRC’s policy to share data. It was felt that we are still not realising the benefits of sharing to the extent that we should do. This is where peer reviewers can help, because they do understand the data. The MRC does not want to use the compliance argument too much, but would prefer to argue that RDM is good for outputs and good for collaboration.

How much (more) research data do we have, and where do we store it?

In January and February 2013 we asked Computing Officers across the University to provide us with feedback about the storage requirements for research data within their respective school.

15 out of 19 schools and 3 research centres responded to the survey. The format in which responses were received varied from a summary response per school or research centre to receiving a compilation of forms completed by researchers. For a number of schools Computing Officers have pointed out that participation from researchers in the survey was low, for example 5 researchers out of 61 for one school, or 11 out of 30 for another. Given the response rate, it is inevitable that the survey outcomes presented here are unable to provide a full picture of the University’s research data storage need.

After we received the feedback from Computing Officers we compared it to the outcome of the Data Asset Framework audits carried out in 2012, and where we had identified additional needs, these were added. As a final step we looked at the requests for research data storage that reached IT Services over the last year and at data that is already stored on existing Research Computing infrastructure. Where we found gaps, we added the figures to the storage requirements for each school.

The survey identified a storage need of 1.36Pb for active data with an annual increase of approximately 10%, and a long-term storage need of approximately 600Tb with an annual increase of 20%. The feedback that we received also suggests that most researchers do not currently differentiate between active data and long-term storage, which will almost certainly have had an effect on the figures presented here. The 1.36Pb of active data storage include a requirement of 600Tb for working copies of data produced by a High Performance Computing Cluster in the School of Mathematics and Statistics and 150Tb of scratch space for HPC users.

The table below provides a breakdown of current data storage split into the use of different options. It is obvious that only a very small proportion of the University’s research data (0.14%) is stored on centrally provided systems. Worryingly, an approximate 19% of research data is currently stored on staff computers, and a further 15% is on external storage media. Only 5% of research data is stored in the public cloud. None of the respondents used an external organisation or data centre to look after their data. Only one respondent indicated that some of their research data is stored on a privately owned computer at home. It is to expect that the figure for research data storage on home computers is much higher than the few Gb in the category “other” the survey identified.

active data long-term storage
current need annual increase current need annual increase
ITS Central Filestore 369.6 89.6 232.0 0.0
ITS web servers 2,066.0 1,219.5 0.0 1,000.0
Networked file store (non-ITS) 938,721.1 37,040.1 153,000.0 27,600.2
non-ITS web servers 39,510.0 19,542.0 50,004.5 16,500.0
staff computers 208,253.0 28,993.0 164,030.0 732.5
external storage media 114,635.0 38,292.0 181,361.0 47,812.5
external organisation or datacentre 0.0 0.0 0.0 0.0
use of external cloud services 54,593.0 1,273.8 50,000.0 0.0
other 32.0 6.2 0.2 0.0
Total (in Gb) 1,358,179.7 126,456.2 598,627.7 93,645.2

The charts that follow provide a breakdown of the various types of data storage across the different schools and research centres. It is clear from those charts that there are some rather significant gaps, and that the results presented here only provide an incomplete overview of the University’s research data storage requirement.

Some of the charts suggest that, compared to the storage needs in other schools and research centres, there is some especially data intensive research in the School of Biology and in CREEM. While this is likely to be one factor, it is worth noting that responses from Biology and CREEM were much fuller than those received from other parts of the University. It is likely, therefore, that a higher proportion of research data has been identified than has been achieved in other schools and research centres.

IT Services central file store

ITS web servers

Networked file store (non-ITS)

The category on non-ITS networked file storage contains 600Tb of storage of working copies of HPC data (School of Mathematics and Statistics) and 150Tb of scratch space for HPC users (School of Chemistry). A further 62Tb for active data and 100Tb for long-term storage were added as a result of requests received by IT Services.

Non-ITS web servers

Staff computers

External storage media

External cloud services

CKAN for Research Data Management

CKAN is an open source data management system that for the past six years has been developed by the Open Knowledge Foundation. CKAN provides tools to steamline the processes of publishing, sharing, finding and using data. Initialliy CKAN was aimed at data publishers such as national and regional governments, companies and organisations that want to make their data publicly available. For example, the UK Government’s data portal  (data.gov.uk) runs on CKAN, and in Australia, where CKAN is widely used by government agencies at both national and regional levels, it has become accepted as de facto standard for data management.

As part of the JISC Managing Research Data programme two projects, Orbital (University of Lincoln) and data.bris (University of Bristol) , have adopted CKAN as a component of their institutional RDM solutions.  The experience of both projects using CKAN was discussed at the workshop “CKAN for Research Data Management” that was held in London on 18th February 2013. In addition to representatives from both projects, members from the Open Knowledge Foundation and staff from other UK Universities took part in the workshop. Summaries of the workshop can be found on the data.bris and Orbital project blogs.

As part of the workshop a requirements gathering exercise was started investigating the wider needs of the RDM community and matching these needs against the functionality of CKAN. As part of this requirements gathering exercise, the following RDM roles were decided on: researcher, developer, curator/manager, re-user, IT support, and data subjects.

The requirements were gathered through expressions in the following format: “As a [RDM role], I want [what, requirement], so that [why, reason]”.

Below is a list of requirements of what we would hope to find in a RDM solution that we have fed back into the requirements gathering exercise. CKAN already meets a number of these requirements.

No RDM role Requirement Reason
1 Curator / manager Digital objects not to be stored within a database So that where necessary / desirable preservation tools (e.g.  JHOVE, DROID) can be used to continuously validate file integrity
2 Curator / manager Ability to integrate with Research Information Systems (e.g. Pure) To allow for efficiency of institutional processes, e.g. in relation to REF
3 Curator / manager Usage metrics to be available To support gathering of information on research impact, e.g. for REF
4 Curator / manager The availability of records management functionality, e.g. for the administration of retention / life cycle management periods To support institutional life cycle management processes
5 Curator / manager Ability to link publications (in the publications repository) to associated datasets (in CKAN) To allow for ease of access to publications and related data, to support transparency and openness
6 Curator / manager Ability to draw metadata from Pure and/or to export metadata into Pure To assist the integration of RDM solutions with the Research Information System
7 Curator / manager, researcher Support of variety of academic subject-specific metadata standards To ensure CKAN is useful and adaptable to a wide range of academic disciplines
8 Curator / manager, researcher To keep track of versions of data To allow for ease of tracking modifications made to individual files
9 Developer CKAN to support a range of accepted protocols for metadata harvesting (e.g. OAI-PMH) So that catalogues for a number of different data stores can be integrated and searched via a single point of access.
10 Developer to harvest metadata from Fedora Commons, possibly via OAI-PMH CKAN can present data that is kept in other repositories.
11 Developer to use Fedora Commons as a FileStore (http://docs.ckan.org/en/ckan-1.8/filestore.html) CKAN can access digital objects in Fedora (as an alternative to harvesting)
12 Developer APIs to support the development of  alternative methods of data ingest into CKAN and the development of tools for data analysis Subject-specific RDM needs can be met
13 IT Support CKAN to be able to support a number of different data and database structures Subject-specific RDM needs can be supported and existing department-level RDM solutions can be integrated into an institutional CKAN RDM solution
14 IT Support CKAN to support Shibboleth and other single sign on protocols Institutional sign on mechanisms can be used to authenticate to CKAN
15 IT Support CKAN to be able to integrate with institutional identity management (IDM) systems Existing institutional IDM can be used to define roles and levels of access to individual datasets within CKAN.
16 IT Support The availability of a documented mechanism of running several customisable instances of CKAN from the same codebase Institutional support for potentially a multitude of CKAN instances to cater for various subject-specific needs can be done efficiently.
17 IT Support The availability of maintenance agreements So that institutions adopting CKAN can get expert technical support when needed
18 IT Support Commitment from CKAN developers to keeping code base up-to-date So that business continuity can be ensured.
19 IT Support Any necessary security fixes to be developed quickly So that system security can be ensured.
20 IT Support metadata extraction upon ingest (including subject-specific metadata standards, e.g. TEI & VRA) To avoid, where possible, manual metadata entry (reduction of typos, efficient use of staff time)
21 IT Support The ability to run admin reports (e.g. storage space used in individual collections; file types contained in individual collections, access to / usage of (parts of) individual collections) To allow for efficient support provision (e.g. planning of storage requirements, charging, etc.)
22 IT Support CKAN to be designed in a modular fashion So that it is possible to select individual components of the software and to integrate these with existing systems and technical infrastructure
23 IT Support The availability of relevant and accessible user documentation that can be modified for local use To reduce the number of CKAN users contacting the IT Service Desk for advice on how to use the system

All requirements gathered will be summarised by the Orbital project and will feed into wider discussions on how CKAN can be developed further to support institutional RDM processes better.