How much (more) research data do we have, and where do we store it?

In January and February 2013 we asked Computing Officers across the University to provide us with feedback about the storage requirements for research data within their respective school.

15 out of 19 schools and 3 research centres responded to the survey. The format in which responses were received varied from a summary response per school or research centre to receiving a compilation of forms completed by researchers. For a number of schools Computing Officers have pointed out that participation from researchers in the survey was low, for example 5 researchers out of 61 for one school, or 11 out of 30 for another. Given the response rate, it is inevitable that the survey outcomes presented here are unable to provide a full picture of the University’s research data storage need.

After we received the feedback from Computing Officers we compared it to the outcome of the Data Asset Framework audits carried out in 2012, and where we had identified additional needs, these were added. As a final step we looked at the requests for research data storage that reached IT Services over the last year and at data that is already stored on existing Research Computing infrastructure. Where we found gaps, we added the figures to the storage requirements for each school.

The survey identified a storage need of 1.36Pb for active data with an annual increase of approximately 10%, and a long-term storage need of approximately 600Tb with an annual increase of 20%. The feedback that we received also suggests that most researchers do not currently differentiate between active data and long-term storage, which will almost certainly have had an effect on the figures presented here. The 1.36Pb of active data storage include a requirement of 600Tb for working copies of data produced by a High Performance Computing Cluster in the School of Mathematics and Statistics and 150Tb of scratch space for HPC users.

The table below provides a breakdown of current data storage split into the use of different options. It is obvious that only a very small proportion of the University’s research data (0.14%) is stored on centrally provided systems. Worryingly, an approximate 19% of research data is currently stored on staff computers, and a further 15% is on external storage media. Only 5% of research data is stored in the public cloud. None of the respondents used an external organisation or data centre to look after their data. Only one respondent indicated that some of their research data is stored on a privately owned computer at home. It is to expect that the figure for research data storage on home computers is much higher than the few Gb in the category “other” the survey identified.

active data long-term storage
current need annual increase current need annual increase
ITS Central Filestore 369.6 89.6 232.0 0.0
ITS web servers 2,066.0 1,219.5 0.0 1,000.0
Networked file store (non-ITS) 938,721.1 37,040.1 153,000.0 27,600.2
non-ITS web servers 39,510.0 19,542.0 50,004.5 16,500.0
staff computers 208,253.0 28,993.0 164,030.0 732.5
external storage media 114,635.0 38,292.0 181,361.0 47,812.5
external organisation or datacentre 0.0 0.0 0.0 0.0
use of external cloud services 54,593.0 1,273.8 50,000.0 0.0
other 32.0 6.2 0.2 0.0
Total (in Gb) 1,358,179.7 126,456.2 598,627.7 93,645.2

The charts that follow provide a breakdown of the various types of data storage across the different schools and research centres. It is clear from those charts that there are some rather significant gaps, and that the results presented here only provide an incomplete overview of the University’s research data storage requirement.

Some of the charts suggest that, compared to the storage needs in other schools and research centres, there is some especially data intensive research in the School of Biology and in CREEM. While this is likely to be one factor, it is worth noting that responses from Biology and CREEM were much fuller than those received from other parts of the University. It is likely, therefore, that a higher proportion of research data has been identified than has been achieved in other schools and research centres.

IT Services central file store

ITS web servers

Networked file store (non-ITS)

The category on non-ITS networked file storage contains 600Tb of storage of working copies of HPC data (School of Mathematics and Statistics) and 150Tb of scratch space for HPC users (School of Chemistry). A further 62Tb for active data and 100Tb for long-term storage were added as a result of requests received by IT Services.

Non-ITS web servers

Staff computers

External storage media

External cloud services

Online Research Database Service (ORDS)

It is not uncommon for us to get requests from researchers for the setup of databases. Especially when such requests relate to unfunded research, we have often had to decline them. As a result, a number of research datasets have remained unpublished.

The problem of not having the resources to provide bespoke technical solutions to every research project that applies for our help is not unique to St Andrews. But then, not every project requires a bespoke solution. Over the past years Oxford University IT Services undertook the VIDaaS (Virtual Infrastructure with Database as a Service)  project to develop a technical solution that allows researchers to build and publish online database quickly and without the need for programming skills.

VIDaaS runs on the DaaS (Database as a Service) software also developed by Oxford. DaaS is an online solution that provides users with a WYSIWIG (what you see is what you get) interface. DaaS allows for version-control and, if necessary, it permits users to make changes to the database schema via a drag and drop mechanism. There is no need to make any modifications to srcipts or user interfaces that normally become necessary as a result of a change to the database schema. These changes will be undertaken automatically by the DaaS software when the database schema is changed.

Oxford has been working at DaaS to support different types of database, including relational databases, XML databases and document databases. Plans also included the ablitiy to upload MS Access databases and for DaaS to convert these into online PostgeSQL databases.

DaaS supports various levels of access restriction to the data. The ability to support institutional single sign-on mechanisms is being developed. Several members of research groups can be given permission to modify the project database.

The VIDaaS project came to an end in 2012, and since then Oxford University IT Services has migrated the DaaS software to a more secure technical environment.

The Online Research Database Service (ORDS) that is currently under development uses and builds on the outcome of the VIDaaS project. Oxford University IT Services is currently undertaking an ORDS maturity project with a view to both supporting researchers at Oxford and making the DaaS software available to other institutions. As part of the ORDS maturity project the Research Computing Service will be looking at the software and its features with a view to investigating the feasibility of providing the ORDS locally to researchers within the University of St Andrews.

New Research Computing posts

We are looking to recruit an Applications Developer (Research Computing) and a Research Computing Advisor to join the Research Computing team. Both posts will allow the service to meet increased requirements for support and to expand to cater for new needs.

Research Computing Advisor

The Research Computing Advisor plays a liaison role across the University that helps identify generic research support needs and that supports individual research projects. In addition, electronic research skills development and the management of the electronic component of research projects form important parts of this role.

To aid compliance with the RCUK RDM agenda, more generic infrastructure-type solutions are required that can be customised to meet specific academic needs. It is the role of the Research Computing Advisor to gather and monitor these requirements and to assist Applications Developers in the flexible implementation of technical solutions. After the end of such projects, requirements need to be monitored to ensure that support provision remains aligned with academic and funder expectations.

The Research Computing Advisor will allow the service to provide technical contributions to applications for funding from across the University. Such contributions consist of:

  1. researching suitable technical solutions,
  2. identifying the resources needed for the implementation of such solutions and for post-project maintenance of research outputs, and
  3. developing data management and technical project plans.

There is an increased demand for electronic research computing skills development as a result of the RDM agenda. There is also academic interest to engage in the development of digital methodologies to provide new forms of academic expression and/or to allow for new research questions to be asked.

Since 2011 the Research Computing Service has offered a successful student internship scheme for primarily PG students to assist on different projects in exchange for the opportunity to develop new skills and gain experience. The Research Computing Advisor will help to expand these activities.

Applications Developer (Research Computing)

The Applications Developer (Research Computing) develops, tests and deploys technical solutions for both individual research projects and for more generic, infrastructure-type solutions. Furthermore, the role holder produces all relevant technical documentation and is responsible for versioning of developed software and for post-project maintenance to ensure security and that research outputs remain available online for as long as is needed.

Requirements placed on the Applications Developer are varied ranging from the development of research databases, to text encoding, metadata standards or the processing of geo-spacial information. There is also an increasing interest among researchers in making their research outcomes available via mobile apps.

Technical solutions must be designed in such a way that academic requirements with which they interface are met. Examples of recent academic needs include the development of XML vocabularies to describe structurally diverse MSS in a way that allows for consistent results to be retreived when this data is manipulated or the set up of mechanisms that allow for quantitative analysis of information contained in qualitative data.

Research data storage needs

Please help us plan our support provision for research in general and for funder requirements in relation to research data management in particular. For this reason we are looking to obtain an overview of research data storage needs within each School. At this stage we are looking for approximate cumulative figures of the volume of research data that is stored using different types of storage. Please provide an estimate for each type of storage used. We are only interested in research data, and not in any other type of data that is stored by researchers within your School.

We would be grateful at this stage to receive input from each School via Computing Officers or Data Managers to get an overview of current and emerging research data storage and management needs and of the types of support required from IT Services for such activities. Getting as accurate an overview of such needs from within each School as is possible will assist us greatly in the planning of Research Computing support that is available to staff within the University.

Please download the Research Data Survey and return it to Birgit Plietzsch (bp10 [AT] st-andrews.ac.uk). We would very much appreciate responses by Friday, 15th February 2013.

Definitions:

Data relates to any file that is stored on a computer.

Research data is data that is created or re-used by academics as part of research processes, and for which the institution and its researchers have the primary long-term responsibility.

Active data is data  that is used in current research and that needs to be readily accessible without modification or reconstruction.

Long-term storage relates to data that needs to be preserved digitally and kept accessible in the long term. Data that is kept in long-term storage is potentially re-usable, but not necessarily immediately accessible or easy to use.

Service expansion

We’ve had an excellent start to the New Year and are well on the way of being able to exand our service, which is one of our strategic aims.

The University of St Andrews has approved two new permanent posts to join the Research Comptuing team, and we will shortly be recruiting an Application Developer (Research Computing) and a Research Computing Advisor to join our team.

The Application Developer (Research Computing) will help to meet an increased demand from researchers within the Universtiy for bespoke online applications to support research within the University. The Research Computing Advisor will fill a liaison, training and project management role. The focus of both roles is to support research in general and to assist with research data management in particular.

We will make further details of both posts available via this blog shortly.