3D Laser Scanning: Seeking a New Standard in Documentation

I attended a one day conference at the Royal Commission on the Ancient and Historical Monuments of Scotland (RACHMS) in Edinburgh on May 1 2013. This was organised by Emily Nimmo (RACHMS) and the Digital Preservation Coalition (DPC). The topic of the conference was 3D laser scanning, with an emphasis on storing, reusing and archiving the data from scanners.

There was a good mix of speakers and informal panel sessions, with speakers coming from a variety of organisations. Between them all, they covered the subject from different perspectives, in the way that an accurate point cloud is buit up from scans from multiple angles.

My main reason for attending was not so much to learn about 3D laser scanning in particular, but to abstract more general issues about working with scientific research data. The Research Computing team is hoping to expand, to cover scientific, as well as humanties research, so I wanted to pick up as much information as I could.

James Hepher, a surveyor and archaeologist from Historic Scotland, spoke about the challenges faced by those working on the Scottish Ten project. He was very recently back from scanning the Sydney Opera House. This involved a large team scanning the exterior and interior of the building, from the ground, from rigs on top of the sails, and by abseiling down the sides.

Five different scanners were used. These devices output raw data in different formats, which has to be merged to produce the point cloud. We were shown some amazing fly-throughs of the Opera House, Sydney harbour full of boats and the sky scrapers behind. The area was scanned at a resolution of 5mm, so one can imagine the volumes of data involved.

Similarly, English Heritage (EH) are also using 3D laser scanning to record important monuments and building such as Stonehenge, Harmondsworth Barn (aka The Cathedral of Middlesex) and Iron Bridge.

Stonehenge has recently been scanned, to a resolution of 0.5mm. This will allow a baseline to be taken of the condition of the stones. New scans in the future can then be used to measure any erosion and damage. This has parallels with a project on which I have been working, documenting the scenes in the frieze on Trajan’s Column. In the 19th century, plaster casts were taken of the column. These are now, in many ways, a more accurate source of information than the column itself, as the casts have been preserved inside, while the column has endured another century of Roman pollution.

Laser scans don’t always record colour very well. But in renderings of scans, colour is often used to indicate other properties of the objects being scanned. Images generated from the scan of Harmondsworth Barn showed up timbers damaged in a recent fire. Laser scans of buildings (and the spaces inside them) form an important part of Building Information Modeling (BIM). BIM is the construction of digital representations of buildings that can be used and contributed to by all parties involved, from drawing board to demolition. It works best for new builds, but can also be applied to 15th century buildings, although factors such as costs and building codes and practices can only be guessed at in these cases.

Iron Bridge has been of interest ever since it was built in 1779, and in 1934 was declared an Ancient Monument. Various surveys have been taken, including a 3D model from 1999. Just as the bridge itself demonstrated the use of iron in bridge building, the 3D survey hoped to demonstrate the utility of new technologies. Most of the work was done using photogrammetry – a technique where points in space are calculated from multiple photos.

Technology in this area has improved so much that this recent survey is already out of date, and a new survey is needed. Interestingly, analogue photos of the bridge from many years ago are still reliable sources of information.

Laser scans can be used to help protect and conserve ancient monuments, by measuring changes and by providing insights into construction methods. But what about preserving the data from laser scans? Documenting and archiving 3D data was the subject of a talk by Catherine Hardman from the Archaeology Data Service (ADS).

3D modelling (both with laser scanning and photogrammetry) is a new and evolving area of research and practice. The ADS work extensively with 3D modelling of archaeological artefacts, so are keen to document the best methods for obtaining and retaining 3D data. Initially the ADS produced hard copies of the standards on which they were working. But, realising how often these standards were being revised, they switched to using a wiki.

Vast amounts of data are produced in 3D modelling, not least because there are many steps between the hardware (scanner or camera) and the final rendering of model. Raw data from the device, which could be millions of points from a laser scanner, or multiple high resolution photos have to be processed to produce the 3D model, and then there are many different ways to visualise the model.

The methods and workflows used to obtain the raw data need to be documented, so that future surveys can follow the same methodology. Given the number of steps involved, decisions must be made about which stages’ data to keep (and so, which stages to redo as needed). There was a lively discussion about how much data and metadata was needed, and when it becomes overkill.

The basic principles of digital archiving apply – use plain text formats, make sure they are open and non-proprietary and don’t compress data. One problem with laser scan data is that it may not be possible to losslessly convert raw binary data to ASCII. And one may not have the luxury of being able to keep both.

Archiving workflows requires a means of formally describing each step in the process, so that someone else can faithfully reproduce it later. Software used in the process may also need to be archived, requiring emulation, virtual machines or hardware preservation.

Some of the participants who create 3D data felt that the metadata requirements of the ADS were overly onerous, and were an obstacle to depositing data with them. It was suggested that the ADS provide 3D data with metadata in varying amounts and qualities to other archives, to see what level of metadata is sufficient for 3D data.

An argument for erring on the side of caution when collecting metadata is that the quality of one’s data is equal to the quality of one’s metadata. And one doesn’t know the true value (or range of re-uses) of one’s data at the point of creation or deposition in an archive.

Although much has been made of the lack of a non-proprietary file format for 3D data, there might be a solution. Faraz Ravi (Bentley Systems, and chairman of the ASTM Committee E57 on 3D Imaging Systems) presented the E57 format. It has been developed for interoperability. If one wants to losslessly convert from one proprietary file format to another, E57 can be used as the intermediate file format. Faraz made it clear that E57 is not designed to be a working format, nor was it designed to be an archival format. It was designed to be an open interchange format.

In the E57 format, metadata is encoded as XML. The 3D data is mostly stored as binary, to make it more compact. Source code for programs (with an open source license) which implement the format can be downloaded. The format is also supported by many of the major software vendors in the industry.

A major feature of E57 is its extensibility. Similar to the TIFF format, extensions specific to particular hardware can be incorporated without invalidating the files. For some people this is a strength. For others, mainly concerned with archiving such files, this can lead to problems similar to those encountered when working with TIFFs from multiple sources. Different pieces of software have the ability to read and write different extensions. TIFFs with extensions need to be interrogated, to see what exactly they contain.

It was suggested that, as with TIFFs, a baseline E57 format could be specified, which would be recommended for archiving files of this format.

A talk by Joe Beeching of 3D Laser Mapping Ltd illustrated the complexities of documenting and making sense of data from scanners. Scans can be taken from aircraft, or from cars, or from handheld devices wobbling on the end of a spring (working title: the Wobulator). To get a point cloud, one must not only know the distance from the scanner to each point on the object, but also the location and orientation of the scanner.

One interesting feature of some scanners is the ability to pick up multiple returns from a scan. A laser beam can pass through vegetation and the scanner can pick up both the external vegetation and the object underneath (e.g. plants on a cliff face). With software, one can make the vegetation visible or hidden.

Steven Ramsey of Leica Geostems Ltd had some eye-opening anecdotes from his years working in the scanning and surveying industry. There was a time when each project’s scans could fit on a floppy disk. Now each project needs its own RAID array. Before people used scanners, a surveyor could record 2000 points in a day. Now a scanner can record a million points in a second. This triggered an interesting conversation about what constituted a point cloud, and when a collection of points became a cloud. There is no real answer to this, though I felt that at some point, from the sheer quantity of points emerge new qualities of the object being scanned.

This was an informative day, spent in the company of some knowledgeable people. I picked up some valuable insights into workflows involving the output from digital instruments, and questions to ask when looking at workflows from an archiving perspective. Is the raw data in a proprietary or binary format? Is there data loss when converting to ASCII? What are the different stages? Do the outputs of different stages need archiving too? Can the workflow be documented, so that experiments can be re-run with different input data?

Funding Research Data Management

What follows is a summary of the Research Data Management Forum special event on Funding RDM services which took place at Lakeside Conference Centre, Aston University on 25th April 2013.

During the morning session there were four presentations:

  • Caretakers of the Present, Guardians of the Future: The digital Research Data Challenge (Jeff Haywood, University of Edinburgh)
  • Research Facilities and Equipment Sharing database developments / DataPool – Collaboration as Foundation for Sustainability (Adrian Cox & Wendy White, University of Southampton)
  • Research Data Management at Oxford Brookes (Sarah Taylor, Research Support Manager, Oxford Brookes University)
  • Ongoing RDM Support at the University of Bristol (Stephen Gray, Project Manager data.bris)

During the afternoon a panel session of funders with representatives from BBSRC, EPSRC, MRC, NERC, STFC, and the Wellcome Trust took place. The slides for the above presentations and the questions put to research funders are available from the Digital Curation Centre web site.

Caretakers of the Present, Guardians of the Future: The digital Research Data Challenge

Research data can mean different things, such as scratch data (e.g. raw data that is collected straight from a machine), analysed data, and final data sets that represent underlying data for publications. Each researcher’s collection of data is liable to contain a mixture of the different data types.

Over the past years the University of Edinburgh has provided the following Research Computing Services: data services, storage, sharing, and curation. Haywood talked about the difficulty in scoping such services. As part of the development of RDM services a number of issues had to be addressed, such as inadequate storage provision, a lack of formal institutional policies for the creation and management of data, and a lack of training and best practice guidance for researchers on RDM.

The initial move of the University to set up and develop RDM services was not triggered by funder requirements but was based on internal motivators that recognised the value of such services to research. Within the context of an increasing formalisation of RDM from funders, Haywood expects the relative importance of UK funders to diminish over the next years in favour of European and other international funders. This will increase the complexity of the RDM policy frameworks that institutions will have to comply with.

Edinburgh currently provides researchers with 0.5TB of data storage by default and is expecting to increase this to 1TB. As part of JISC-funded projects it has developed the Research Mantra RDM training suite and the DMPonline data management planning tool. Both are available free of charge to researchers in other institutions.

Over the next couple of years the University of Edinburgh plans to invest at least £2m on its development of research data services.

Research Facilities and Equipment Sharing database developments / DataPool – Collaboration as Foundation for Sustainability

Adrian Cox talked about the EPSRC-funded UNIQUIP project, a national equipment database that provided a mechanism for an effective national agenda for sharing (http://equipment.data.ac.uk). This project is now part of the data.ac.uk portal that is aimed at sharing different types of data, including research data.

Wendy White continued talking about the DataPool project and the development of shared, scalable and sustainable services. Southampton had developed Tweepository, a Twitter archiving service. As with any other service, the long-term institutional commitment to Tweepository depends on use and take-up. White went on talking about institutional policies and underlined the need for exemplars for the innovative use of linked / open data to help with establishing data retention guidance.

Research Data Management at Oxford Brookes

Sarah Taylor spoke about the experience of developing RDM services at Oxford Brookes University. The University had formed part of the Digital Curation Centre Institutional Engagement project. Until two years ago IT support in Oxford Brookes had been much devolved. Since then a new faculty structure had been implemented and centralised IT support provision introduced. Oxford Brookes agreed their institutional RDM policy in February 2013.

Ongoing RDM Support at the University of Bristol

Within the University of Bristol there are 2,288 researchers submitting approximately 1,700 applications for funding per annum. 42% of submitted applications are successful, currently generating a research income of £106m.

The development of RDM services in Bristol builds on a £2m investment in research data storage infrastructure made prior to the RCUK RDM agenda. As part of standard service provision, each researcher can apply for 5TB of storage free of charge. RDM infrastructure was developed by the data.bris project, that in addition to university resources had received £250k JISC-funding and a further £65k from EPSRC.

In Bristol RDM support is spilt over three units: IT Services, the Library, and Research, Enterprise and Development (RED) services. Prior to the start of the data.bris project there was a 0.2FTE digital humanities support officer within IT Services and ad-hoc research computing work carried out in individual academic departments. Within the Library there was 0.2FTE support for a small institutional repository. RED provided assistance for research funding applications at pre-award stage only as well as training in grant writing, which excluded RDM training.

The data.bris project developed RDM training provision, software to deposit data, and a RDM policy that is similar to Edinburgh’s policy. Bristol also adopted the RDM principles centering around excellence and impact, world-class infrastructure, skills and knowledge, integrity and professionalism, and leadership and collaboration that are set out in Monash University’s Research Data Management Strategy and Strategic Plan 2012-2015.

An additional outcome of the data.bris project was a business case to the university, which was successful. As a result of the successful business case, the University of Bristol has approved a pilot data service project running from 1 August 2013 until 31st July 2015 during which the requirements for permanent service provision will be established. The project is going to be resourced with 1 director, 3 subject data librarians, and 1 technical support staff. While the data.bris project was owned by IT Services working with the Library and RED, responsibility for RDM service provision during the pilot project lies with the Library in collaboration with IT Services.

Panel session

The afternoon session comprised a panel discussion with representatives from BBSRC, EPSRC, MRC, NERC, STFC, and the Wellcome Trust. Attendants had been asked to submit questions to funders in advance. Questions had been categorised as follows:

  • A. What RDM services and infrastructure are in scope?
  • B. Open Access and data sharing
  • C. Specific and allowable RDM cost elements
  • D. Long term storage, preservation and archiving
  • E. What is the application process to recover RDM services?
  • F. Guidance and support
  • H. Compliance

A. What RDM services and infrastructure are in scope?

RDM services and infrastructure are within the scope of two RCUK Common Principles on Data Policy, principles 2 and 7:

  • Principle 7: “It is appropriate to use public funds to support the management and sharing of publicly-funded research data. To maximise the research benefit which can be gained from limited budgets, the mechanisms for these activities should be both efficient and cost-effective in the use of public funds.” There are two ways of looking at the statement of wanting to maximise the research benefits from the use of such funds: Firstly such money should be used efficiently, and secondly funders are looking for research benefits where money is spent.
  • Principle 2: “Institutional and project specific data management policies and plans should be in accordance with relevant standards and community best practice. Data with acknowledged long-term value should be preserved and remain accessible and usable for future research.” This means that not all data should remain accessible or be preserved.

There is a need to distinguish between the costs that are incurred during a project and those that arise afterwards. Costs incurred during a project should be included in the direct costs for a project, including hardware, staff, expenses, etc. RDM costs will include preparation of the data for access and curation (incl. the addition of metadata).

After the project there are three possible cases:

  • Provision through the research funder:
    • Some research councils (NERC, ESRC) have their own data centres.
    • Some disciplines have their own disciplinary repositories.
    • In STFC’s programme big projects (e.g. Hedron collider) already have an infrastructure in place that can be used by other projects.
      Where such services are available, grant holders are expected to use them. MRC have a mixed model, where for some areas of the programme such services exist.
  • Provision through the research organisation: When the institution is going to provide a service in the long term the obvious way to finance it, is through FEC.
  • Outsourcing to a third party: If a third party is providing the service, there is the need for due diligence to be done by the institution.

The principles under which the data is going to be made available are expected to be explained in the Data Management Plan (DMP) at the project level. The institutional infrastructure that is provided for making data available is expected to be outlined within an institutional plan, policy or operational document. Researchers need to make it clear within the Justification of Resources of their application for funding what exactly it is that they expect funders to pay. Justifications of Resources should, where possible, separate out the following RDM cost elements: cost of collecting data, the cost of curating data, the cost of analysing data, the cost of preservation and sharing.

As concerns funding, there are some differences in the way the Wellcome Trust operates. Owing to its charity status, the Wellcome Trust in general only pays directly incurred costs.

Acknowledged long-term value

Acknowledged long-term value involves a value judgement that needs to be made on a case-by-case basis. Research funders ask for the justification of what data to keep to be made as part of the DMP.

Funders acknowledge that researchers in all areas of Science need to be selective. It is necessary to obtain experience to help with these judgements. Such experience can be gained through the peer review process that assesses the likely research benefits to be gained. Furthermore, researchers can develop their own experience by keeping a record of who has asked for data, what kind of data was provided, and what publications and further outputs arose from that activity.

EPSRC finds it challenging to come up with a one-size-fits-all solution, which is what RDM is trying to do. In EPSRC policy it is stated that apart from data that underpins publications, which must be kept, EPSRC leaves it up to the researchers to decide what to keep. If a researcher decides that something should be kept, then this data should be stored properly. Once a decision to keep data has been made, the policy stated that the data should be kept for 10 years. EPSRC assumes that it does not cost much more to store data for 10 years than it costs to store it in the short term. If in that 10 year period the data has not been accessed, questions need to be asked as to whether this data should really be kept. If people have accessed the data, the 10 year period starts anew. There is obvious data (e.g. climate readings) that will need to be kept in the long term regardless of whether or not such data has been accessed during the 10-year period. It is not up to EPSRC to determine on the level of each discipline what to support. It is the researcher who knows data best and who needs to decide whether it is has long-term value.

B. Open Access and Data Sharing

Research data is like any other cost of a grant. If the cost is incurred during the grant period, it counts like any other direct cost. Every good application for funding should in its overall general work plan have time set aside for dealing with data management activities.

Making data open access incurs a cost that is separate from APCs.

Funders point out that research data should be in a similar position to publications as concerns access. From MRC’s perspective, however, researchers are not ready to have their data put up for anybody to look at. This is due to the complexity of data and the need for confidentiality. The research team who created the data needs to be involved in the process of preparing the data for publication. MRC could not afford to put up all the data from 60-80 years of medical research for anybody to use. Such a job is not achievable at the moment.

EPSRC does not require a DMP as part of the application. In EPSRC applications the Management Section within the Case for Support should increasingly be used to include references to data and its uses. There is a possible move towards formalising this aspect and this way to include data management within the peer review process. Jeff Haywood pointed out that EPSRC’s approach is from an institutional perspective inconsistent with the requirements of other funders. EPSRC require every funded project to have a DMP as part of good research practice, but the DMP itself is not subject to peer review.

C. Specific and allowable RDM cost elements

The research community should develop a better understanding of the importance of using metadata to describe their datasets. Advocacy for metadata is considered part of normal institutional service provision and should not be funded by Research Councils. Universities have a role to ensure that custodians of data have a sufficient understanding of metadata.

It is not possible to include RDM costs in applications for funding based on a rule-of-thumb estimate. For some projects, RDM may be within institutional provision, in other projects RDM may be the majority of the cost. The cost of RDM is project-specific and entirely depends on the type of work.

Within NERC the cost of RDM represents approximately 2% of the overall cost of funding, but the distribution of this cost varies greatly from project to project.

D. Long term storage, preservation and archiving

Long term storage, preservation and archiving costs are part of the overhead of each HEI and should be met by research funders. If this cost cannot be covered through the direct cost as part of the project, then indirect and quality-related costs are considered obvious funding streams.

ESPRC expects institutions to call upon these sources of funding and to set aside funds to ensure that institutions have good data management infrastructure in place to support researchers. It is up to the institution to decide whether RDM infrastructure is provided in-house or via a third party. An organisation that sets itself up as a research organisation needs to take onboard the importance and necessity of providing data management infrastructure to researchers as part of its central provisioning.

The pay-once-store-forever (POSF) model is a difficult concept for funders. The assumption made by the POSF model is storage over 20 years or forever, whichever comes first. Although most of the curation costs are upfront costs, if data needs to be kept in the long term, there is likely to be an ongoing cost for storage and access.

The cost of storage increases where multiple copies of data need to be kept unless it is acceptable to allow data to decay. According to funders, judgements need to be made about opportunities and risks. If managing a large resource, some parts will be paid a lot of attention to, and other parts will be paid less attention to. It is about taking a proportionate approach to opportunities and risks.

EPSRC point out that RDM takes place within the context of the UK legislative framework (e.g. FoI). Public bodies such as universities are under the obligation to make data available unless there are exemptions to be applied. Therefore, organisations need to know what data they hold. A process of appraising whether or not to keep data is a valid part of institutional policy. There should be a cost / benefit analysis, if the cost of appraisal is higher than just keeping the data, then there is no benefit in doing the appraisal process.

E. What is the application process to recover RDM services?

The question was asked whether institutions can set up small research facilities to recover the cost of RDM services. It was pointed out that there is already a model for doing this in the provision of HPC services. Simon Hodgson clarified that the origin of small research facilities lies in the TRAC terminology. Small research facilities are chargeable above cost, i.e. not as part of the indirect cost nor as part of the university infrastructure. Instead, they are chargeable as a specific infrastructure that can be charged against a research grant. A small research facility is provided by the university but it is used by research groups.

A small research facility needs to be approved by funders as such. For example, such a facility might be possible in the area of big data, if big data is generated in a new way. Added value needs to be shown. Funders recognise that we are on a journey, and that it is not impossible to make the case. The added value would have to be demonstrated to funders and to the research community. If a facility was funded, it would be out of the research budget, possibly as a cross-institutional service. A small research facility needs to be very close to Science. It is about creating highly specialised services.

F. Guidance and support

RCUK is developing guidance for RDM. It is hoped that this guidance will become available within a few months.

G. Compliance

Councils hope that universities would want to take on the development of support for RDM and prove that they are doing this well. After all, RDM is about producing good research. In Bristol the EPSRC policy has carried some weight, but there were other drivers. RDM brings kudos.

EPSRC does not want to act as a police officer. There are expectations that organisations have good research management practices in place. If an organisation is becoming obviously non-compliant or if there are systemic failures in RDM, then the organisation risks losing their eligibility for research funding.

EPSRC does not expect to have to go out and check institutional compliance. A case could be made that there could be cross-council visits to institutions rather than just the EPSRC checking compliance with their own policies.

MRC does not believe that councils should ask institutions lots of questions, but there are areas, e.g. security where a few intelligent questions could be asked to check compliance. More concern is over the compliance of Principal Investigators with MRC’s policy to share data. It was felt that we are still not realising the benefits of sharing to the extent that we should do. This is where peer reviewers can help, because they do understand the data. The MRC does not want to use the compliance argument too much, but would prefer to argue that RDM is good for outputs and good for collaboration.