Skip navigation to content

Discussion on standardising and sharing data among Arabic projects

I attended the 2013 Deutscher Orientalistentag in Münster to co-present a talk on the Arab Cultural Semantics in Transition (ARSEM) project, along with the principal investigator, Kirill Dmitriev. My half of the presentation, available here, focused on the technical aspects of the project.

Our presentation clashed with the start of the Digital Humanities (New methods of text analysis in Arabic and Islamic studies) panel, but I was still able to make it to most of the talks and the discussion afterwards.

The presenters in the Digital Humanities panel were:

The discussion, led by Andreas Kaplony, focused on the need for the diverse range of projects represented in the room to agree on standards. He told the story of how Hellenistic scholars had already done this, by sitting together for an afternoon.

Examples of what can and should be agreed upon are when centuries stop and start (did the 20th century end on 31 December 1999 or 31 December 2000?), citation standards and, more domain specific issues such as transliteration from Arabic to Latin (DIN 31635, Hans Wehr, Buckwalter etc.).

These aren’t problems which affect humans – if accuracy to the year is important, it can often be inferred from the context, and we are fairly forgiving when it comes to citation styles. And from what I understand, scholars can often read many different
Romanised forms of Arabic.

But computers are more simple minded, and need to be told explicitly the date range to search within or which transliteration method has been used.

It was suggested that many of these types of issue could be resolved by using a schema to describe the standards used. Academics wouldn’t have to all use the same standards, as long as they documented the standards that they did use. As has been said, “the nice thing about standards is that you have so many to choose from”.

Less equivocal meanings could then be obtained for fuzzy phrases such as “eighth century”. Crosswalks could translate data from one standard to another.

This solves some issues, but a crosswalk won’t always be available for converting transliterated text between methods (without access to the original vocalised Arabic). Some Romanisation methods are lossy, and not all lose the same information.

One of the main motivations for agreeing on standards is to facilitate the sharing of data. Although most people in the room were in favour of opening their research data, there wasn’t agreement on when in a project’s life cycle this should be done. To some, one’s data represented hard work and reputation, and wasn’t to be given away before the rewards had been obtained. Others were more confident that attribution could be preserved and a wide dissemination and use of one’s data would not be a waste of one’s effort.

Guidelines which apply to data accompanying published articles don’t necessarily apply to a long research project. Which, if any, milestones in a research project are analogous to publication and might mandate the opening of one’s research data?

Currently, email is used to share data on an ad hoc basis between researchers. Trust is a key component here. Another suggested alternative to simply offering up one’s data for download was to make it available via an API. This would control the access and also mean that the data that one made available was worked or processed and not raw data.

One thing which would be possible if researchers opened their data up via an API would be federated searches across multiple databases. Many projects are producing Arabic dictionaries focusing on different uses of Arabic in different periods of time and in different places, e.g. merchants’ shopping lists from the middle ages, Arabic words used in Occitan medical books or pre-Islamic poetry. With an API, it would be possible to search across projects using a single interface.

Each project, however, focuses on different features of the language, and has different strengths and weaknesses. But hopefully a subset of data will be present across projects, to make search federation a useful facility.

In addition to the projects linked to above, Kirill has been compiling a list of related projects.

As the number of projects with a large Arabic component on which I’m working increases, I found it very useful to learn more about the subject.

Scottish Informatics Programme (SHIP) Conference

The ScottisH Informatics Programme (SHIP) conference was held at the University of St Andrews on the 28-30th August 2013. The conference provided a forum to discuss the usage of linked electronic healthcare datasets. Those in attendance featured heavily from Scotland & Australia (even some Scottish working in Australia to confuse us all!). The cover on the book of abstracts was a clever ‘word cloud’ created from all the abstracts that were submitted; ‘Data’, ‘Linkage’, ‘Health’ all featured very prominently with ‘Electronic’, ‘Administrative’, ‘Care’, ‘Research’, ‘Governance’ closely following. The conference was vast, with 7 parallel sessions to choose from. I tried to attend as many diverse sessions as I could to get a complete picture of linked healthcare data so this may be reflected in my summary of the whole event.

The opening address was given by Christopher Chute a professor of Biomedical Informatics at the Mayo Clinic entitled “Big Data meets Healthcare: the case for comparability and consistency”. He talked about the “Chasm of Semantic Despair” and how often standards can feel like “lead in shoes”. Following on from this I attended a session entitled “Data transparency, access and public engagement…” Throughout the session the key message became obvious – ‘data is there, but Researchers can’t always gain access to it in a timely manner…’ One speaker explained how in her Canadian study it showed that a funding lifecycle can be ~2years, but it can take 1-18months to obtain the data required to complete the study. She questioned strongly the ‘risks’ in building a career using administrative health data. Speakers spoke of “the process” involved in obtaining data: ‘the application’, where the data requested is matched against the research questions; ‘the review’, where ethics, risks, privacy, legislation were all carefully examined – “adjudication & grant approval” and finally ‘data preparation’, the complexities involved in data extraction, linkage etc. Interestingly, or perhaps not?, the theme emerged time and time again, not only during this session but throughout the entire conference, that the technical components are not causing a headache but red tape and extensive, overlapping ‘review’ processes are. One speaker from Australia shared that there were 30 individual approvals required for their Population Health Research Network. As a technical person I am always cognisant that emerging technologies are providing useful tools for researchers enabling them to focus on their research and not be restricted by having to work around technology limitations, however this was not the take home message. Has too much time been spent focusing on the exciting, challenging technical intricacies and as a result has data governance been ignored? Is the fear of getting it wrong, therefore adapting a belt and braces approach, stifling research creativity and discovery?

Access platforms were discussed as where the complexities of web based delivery of such datasets. Emerging platforms included SURE: Secure Unified Research Environments in Australia and SAIL: Secure Anonymised Information Linkage Databank at Swansea University. Both offered (all be it with a little healthy competitiveness) to provide secure computing environments where researchers can log into remotely to access linked datasets.

SAIL & SURE where repeatedly mentioned as real solutions to dealing with the issues of datasets being anonymised at source but becoming re-identifiable by data linkage. Coming from a spatial background one clever feature of SURE caught my attention – the ability to use spatial data, de-identified and randomised, to obtain new data within SURE where the results are reconnected to the initial data without any breeches of confidentiality. Essentially, SURE carries out all the linkages behind the scenes, resulting in an enhanced spatial dataset, without any of the complexities of linkage data.

SeRP: Secure eResearch Platform (also from Swansea University) was described in detail by Simon Thompson as SAIL only bigger and more generic. It is due to go-live in March 2014 and has a very impressive (self-provisioning) menu on offer including: SQL Server 2012, HADOOP, R Cluster, LSM Server, dataset management, free text handling, service management etc. When questioned on the costs there were no clear figures offered just a reference to the platform being publically funded and a comment that if the setup is not used it will be a great shame. But I get the sense that this service will be in huge demand once the capabilities are more widely advertised.

Other presentations explored the work involved in sharing and data reuse. Amongst the researchers there was a sense of fear that data quality will decline unless research data outputs are seen as assets and financially supported. Funders are currently asking researchers to share their data, but without any financial support or incentives. There was some discussion around The Wellcome Trust delving into this further which was met with interest.

One very inspirational keynote speaker was Professor Ian Deary from the University of Edinburgh. His talk was entitled “Reusing historical data: the Scottish Mental Surveys of 1932 and 1947”. These surveys were carried out nationally to assess the mental ability of Scottish children. He talked about rediscovering the paper ledgers in storage at the basement of the Scottish Council for Research in Education in Glasgow with his colleague Professor Lawrence Whalley and how cohorts were set up subsequentially to reassess some of the participants now in their 80s and 90s. It was all very exciting with so many influential studies made possible as a direct result of the data rediscovery and reusage, he was only able to focus in on the top 10 peer reviewed outputs.

The final session I attended was entitled “The role of computer science in e-Health research”. This was a special session and I hoped to walk away with a clear vision of the future of the two disciplines. However this was not the case. Nobody can doubt the role computer science has to play in deciphering all this administrative healthcare data. However, the biggest limiting factor in regard to data is governance.

If I were to carry out my own ‘word cloud’ for the diverse sessions I attended I think I would put ‘Governance’ x10 bigger than any other. One thing was very clear; there are some real cutting edge and important studies being carried out using linked healthcare data both nationally and internationally. These studies demand, and will continue to demand, the ability to access elastic computation capabilities including storage, assistance with anonymisation and encryption, secure data transportation, access control, reliable record linkage but most importantly they require immediate verification of compliance with data governance.

Is enough time being devoted to ‘cryptographic security’ – addressing the standards around legal, political and ethical protocols on using administrative datasets in research? Are funds being appropriately spent in enhancing these studies? Has too much been invested in fixing data (programmatic) once it’s been collected? Should those inputting the data be trained and made aware of how the data, they input, may be used in the future? Are those that consent to such datasets being gathered fully informed of the intended usage? For researchers to be able to harness the benefits of the availability of datasets more fully, it is important that these questions and the challenges they represent are addressed.

The Trajan’s Column Project

052.8-053.42

Trajan’s Column, a 35m-tall structure in Rome, is an important historical object that is decorated from top to bottom with a series of elaborate engravings. The Trajan’s Column project aimed to create a digital representation of the column with a searchable online database. Other digital databases of the column already exist but the novel aspect of this project is the exhaustive cataloguing of each individual figure. Study of the column necessitates access to detailed, high-quality images and the project meets this need by making a catalogue of over 1,700 images available. Similarly, precise analysis of individual figures is made possible by the provision of academic information on each figure. The finished website therefore contains a mixture of images (divided into diagrams, cast images and shaft images) and textual descriptions of the figures.

What is Trajan’s Column?

Trajan’s Column has emerged as an object of intense academic interest partly because of the level of detail in its engravings. Other examples of architecture from the Roman Empire have survived but the column is remarkable for the precision and intricacy its figures exhibit. Furthermore, the column has a balcony from which visitors can survey Rome from above.  As such, the column is an important historical artefact which is appealing as a subject for further study.

Positioned within the Forum Traiani, the column is an epic monument to the two Dacian wars of 101-102 and 105-106 AD.  The carvings document the expansion of the Roman Empire into Eastern Europe under the command of the emperor Trajan. The two wars are given approximately the same amount of space, although there is more fighting within the first half and more travelling within the second half. The column serves to glorify the emperor and publicise his military successes. By contrast, the Dacian king Decebalus is represented as a nemesis who is eventually beheaded.

There is a quite dazzling array of figures depicted on the column, including both Roman and Dacian soldiers, civilians, deities, standard bearers, musicians, ceremonial attendants and archers. A great variety of different styles of armour, clothing and weaponry are evident amongst the figures and it has been suggested that variations can be attributed to the individual techniques of different sculptors.

Casts of the column were made in the 1860s which have helped to preserve the detail of the engravings in a different material. An analysis of both the casts and the original in contrast is the necessary for fruitful academic study. As a result, the project was involved in cataloguing images of both casts and ‘shaft’ (the original column) and clearly demarking the different types of images. These images were primarily photos of the column taken by the project’s Principle Investigator, Dr Jon Coulston of the School of Classics.

Other digital resources of the column already exist, including that of the German Archaeological Institute (GAI). The GAI archive is particularly useful because it features photographs of the column which are categorised by scene rather than drawings or artistic representations. The St Andrews project differs from others in that it is primarily composed of full-colour photographs. The most distinctive aspect, however, is the categorisation of figures within scenes.

The Process: The Database

This raw data next had to be put into a database format. The University’s image database was used as a system in which to input data. The image database conforms to the Visual Resources Association’s (VRA) Core 4 standard. Specifically, it differentiates between works and images of works and specifies the relationship between images. Individual scenes/figures or groups of scenes/figures were designated as works. Diagrams were demarked as ‘parent images’ to other diagrams and a link formed between them.

There were two basic types of entries: those relating to images and those relating to figures. Images were typically of multiple images or scenes and were divided into five categories: figure, figures, scene, scenes and detail. ‘Scene’ was used for images which contained all the figures from within a scene and ‘scenes’ for images which featured images from more than one scene. ‘Detail’ was used for images which did not feature one figure in its entirety but which perhaps focused on a particular helmet, shield or building. Figure entries were always designated as ‘figure’ and contained the figure description. Whilst there may be a possibility that images could be uploaded in the future for each individual figure these entries currently remain without images.

The Process: Types of Data

The column has traditionally been divided into 155 scenes in order to enable simple engravings to be simply referenced. The project maintained these scene divisions but added the further specification of numbering the individual figures within these scenes. The figures were numbered from left to right in each scene so that the leftmost figure in scene 1 was designated as figure 1. Each figure was given a five digit reference code so that scene 1 figure 1 was 001.01. Some scenes (such as scene 40) contained as many as 72 figures whereas others (such as scenes 3 and 78) only contained one. It is hoped that this unique system of designating figures will enable scholars to refer to specific figures of interest with increased clarity.

Three types of data were involved and this data formed the basis for the searchable database. The first type of data was images. These images were divided into three categories: diagrams, cast photos and shaft photos. The diagrams were simple black and white drawings which were produced for each scene with each figure being clearly numbered within the diagram. The photos were labelled with a file name that indicated the scene(s) and figure(s) it referred to and whether it was either a cast photo or a shaft photo. The difference between the two photo types is important and so there was a paramount need for accuracy when inputting the data and for clarifying the correct designation when it was unclear. Other information, such as the date when the photo was taken, may have been useful in order to comply more fully with the VRA Core 4 standard but it was necessary to recognise the limitations within both the data that was supplied and the exhaustiveness of the project.

Secondly, data was supplied by Dr Coulston relating to each individual figure, describing the figure type, armour details and so on. This information appears in the website in such a way that users see that figure 1 from scene 5 has lorica hamata armour then simply clicking on this armour type displays the other figures which possess this type of armour. Thirdly, image coordinates were generated by a programme that analysed the SVG images of the diagrams. The coordinates create a link between the diagram of the column and the diagrams of the individual scenes and between the diagrams of scenes and entries relating to individual figures. This feature meant that scenes could be clearly located within the column as a whole and that there was a visual element to navigating through the website.

The Outcome: The Trajan’s Project Website

The website was designed by Mary Woodock-Kroble and Swithun Crowe. A PHP script pulled XML data out of the image database, transforms the data and adds it to a Solr index. This index is the basis for search queries. Emma Lewsley was charged with checking the entries for any errors or inconsistencies. The finished website allows users to search the column by scene or figure. There is also the option to ‘zoom in’ or ‘zoom out’ through scenes and figures.

The website may be accessed at the following address: http://arts.st-andrews.ac.uk/trajans-column/. I would like to thank Dr Jon Coulston for allowing me to reproduce his photo here and for contributing information about the column.

How to Search the Column

Screenshot 1Figure 1: Users may search the column in two main ways. Firstly, they may identify a scene of interest from a diagram of the column in its entirety that is labelled in Roman numerals. For example, hovering over ‘LIII’ reveals a link to Scene 53.

Screenshot 2Figure 2: A diagram of the relevant scene is included, allowing the user to select an individual figure. Again, links are included in the diagram so that clicking on ‘1’ provides access to the entry for scene 53 figure 1.

Screenshot 3Figure 3: Each figure entry contains information relating to the figure. Clicking on ‘zoom out’ shows the user the images in which figure 1 features.

Screenshot 4Figure 4: Alternatively, users can search by typing a scene/figure number into the search boxes on the left.

Screenshot 5Figure 5: Here a search for ‘053’ brings up all the figure entries for that scene and all the images that relate to scene 53.

Screenshot 6Figure 6: Each image entry contains information on the material (cast or shaft) and an image type (figure, figures, scene, scenes or detail).

My Summer With the Trajan’s Column Project

During June 2013 I participated in an internship within the Research Computing service within IT Services, working on the Trajan’s Column project on behalf of the School of Classics. Essentially, my work involved cataloguing academic data and images related to the column into a searchable online database. A more in-depth explanation of the project, the column and the implementation process is available in another blog post. In contrast, this blog post offers a more personal reflection on my role within the project, the challenges that I encountered and other projects that are on-going within the remit of the Research Computing team.

Trajan’s Column: The Challenges

One of the initial hurdles to overcome at the project’s outset was entering and becoming familiar with two worlds which I was previously unfamiliar: that of Classics and that of digital cataloguing. Whilst a high degree of academic knowledge of Roman history was not necessary in my role I nevertheless had to engage to a certain extent with the subject matter that I was cataloguing.

Sometimes I found that scene diagrams had been uploaded incorrectly so that they referred to a different scene altogether and sometimes the photos were labelled with the wrong scene or figure number. The occasions on which I was able to spot errors such as these helped the project achieve a higher degree of accuracy. Similarly, I had to learn how to operate the university’s image database. I already had some previous experience with using a different form of software but it was nevertheless necessary to get to grips with the image database’s format and layout. Fortunately, this was not a difficult task.

Once I had gravitated myself within these new worlds I was faced with the challenges involved in actually completing the work. These were compounded by the importance of the project, which had been selected by the School of Classics to be connected to a case study for the Research Excellence Framework (REF) in 2014. The REF initiative is an exercise that assesses the quality of research in higher education institutions and, amongst other things, has implications for the levels of funding that these institutions receive. The completed part of the Trajan’s Column project is also envisaged to be a pilot for potential future funding. Given the inherent value attached to the project the need for accuracy and efficiency in my work was key. Much of my role involved moving data from one digital location (for example, a word processing document) to another (the image database). The repetitiveness of this process meant that there was the possibility of errors entering into the data as it was transferred across. A close eye for detail was required to spot mistakes as and when they occurred and to guard against their inclusion in the finished website. Emma Lewsley, an intern working on a Biographical Register project with the Library, gave a great deal of assistance in this area in checking the work that I had uploaded to the image database. The final pressure was the pressure of time, given that my internship lasted only four weeks and that it was preferable for the project to be completed in that time.

The Biographical Register

A secondary responsibility of my role was to check the work done on a Biographical Register. The Biographical Register contains data referring to University of St Andrews alumni and the work on this project was undertaken by Emma Lewsley. I was responsible for checking Emma’s work for any errors or inconsistencies and similarly she proofread my own work. The Register was compiled by Robert N. Smart and in its revised format features a variety of sections. There is a name section, a qualifications section (although members of staff are also included), a birth/baptism section, a careers section, a death section and references section. Each section is not necessarily present in every entry and some entries contain no data except the name! An editorial decision was made to break the data up into the various sections and this mark-up was achieved by a programme. Nevertheless, human input was required to confirm that the entries had been marked up correctly. In addition, data such as names, dates and occupations were tagged within each section in order to make the database searchable.

The challenges of the Trajan’s Column project were in many ways applicable to my work on the Biographical Register as well. There was a need for accuracy and a close eye for detail in order to spot mistakes in the database. Balancing my time between my own project and the Biographical Register was also important. Since the first part of the Register contains the records of 11,744 people the magnitude of the task (which was not ultimately completed) facing Emma and myself was significant!

Other Projects Within IT Services

In the course of my internship I gained a certain degree of insight into the workings of IT Services within the University. I worked closely with Swithun Crowe, the Applications Developer in the Research Computing team. He served as a useful source of council on the more technical aspects of the project but also introduced me to some of the previous and contemporary projects which he has worked on. The ‘Records of the Scottish Parliament to 1707’ project is a complex one which enables users to search the manuscripts of the Scottish parliament in both the original languages (such as Latin) and in English. This project stretched over more than a decade after its genesis with the School of History in 1996 and involved an extensive period of transcription of the original documents into a digital format. Swithun was heavily involved in creating the online database for the website. Other projects included a corpus of Scottish medieval parish churches and a website devoted to Arabic semantics. It was apparent that the work undertaken by the Research Computing team is diverse and intriguing.

Conclusion

I am grateful to have had the opportunity to be involved professionally with digital cataloguing and the challenges that it contains. It was particularly satisfying to be able to witness the Trajan’s Column project progress from partially completed concept to the finished website as it currently appears.

3D Laser Scanning: Seeking a New Standard in Documentation

I attended a one day conference at the Royal Commission on the Ancient and Historical Monuments of Scotland (RACHMS) in Edinburgh on May 1 2013. This was organised by Emily Nimmo (RACHMS) and the Digital Preservation Coalition (DPC). The topic of the conference was 3D laser scanning, with an emphasis on storing, reusing and archiving the data from scanners.

There was a good mix of speakers and informal panel sessions, with speakers coming from a variety of organisations. Between them all, they covered the subject from different perspectives, in the way that an accurate point cloud is buit up from scans from multiple angles.

My main reason for attending was not so much to learn about 3D laser scanning in particular, but to abstract more general issues about working with scientific research data. The Research Computing team is hoping to expand, to cover scientific, as well as humanties research, so I wanted to pick up as much information as I could.

James Hepher, a surveyor and archaeologist from Historic Scotland, spoke about the challenges faced by those working on the Scottish Ten project. He was very recently back from scanning the Sydney Opera House. This involved a large team scanning the exterior and interior of the building, from the ground, from rigs on top of the sails, and by abseiling down the sides.

Five different scanners were used. These devices output raw data in different formats, which has to be merged to produce the point cloud. We were shown some amazing fly-throughs of the Opera House, Sydney harbour full of boats and the sky scrapers behind. The area was scanned at a resolution of 5mm, so one can imagine the volumes of data involved.

Similarly, English Heritage (EH) are also using 3D laser scanning to record important monuments and building such as Stonehenge, Harmondsworth Barn (aka The Cathedral of Middlesex) and Iron Bridge.

Stonehenge has recently been scanned, to a resolution of 0.5mm. This will allow a baseline to be taken of the condition of the stones. New scans in the future can then be used to measure any erosion and damage. This has parallels with a project on which I have been working, documenting the scenes in the frieze on Trajan’s Column. In the 19th century, plaster casts were taken of the column. These are now, in many ways, a more accurate source of information than the column itself, as the casts have been preserved inside, while the column has endured another century of Roman pollution.

Laser scans don’t always record colour very well. But in renderings of scans, colour is often used to indicate other properties of the objects being scanned. Images generated from the scan of Harmondsworth Barn showed up timbers damaged in a recent fire. Laser scans of buildings (and the spaces inside them) form an important part of Building Information Modeling (BIM). BIM is the construction of digital representations of buildings that can be used and contributed to by all parties involved, from drawing board to demolition. It works best for new builds, but can also be applied to 15th century buildings, although factors such as costs and building codes and practices can only be guessed at in these cases.

Iron Bridge has been of interest ever since it was built in 1779, and in 1934 was declared an Ancient Monument. Various surveys have been taken, including a 3D model from 1999. Just as the bridge itself demonstrated the use of iron in bridge building, the 3D survey hoped to demonstrate the utility of new technologies. Most of the work was done using photogrammetry – a technique where points in space are calculated from multiple photos.

Technology in this area has improved so much that this recent survey is already out of date, and a new survey is needed. Interestingly, analogue photos of the bridge from many years ago are still reliable sources of information.

Laser scans can be used to help protect and conserve ancient monuments, by measuring changes and by providing insights into construction methods. But what about preserving the data from laser scans? Documenting and archiving 3D data was the subject of a talk by Catherine Hardman from the Archaeology Data Service (ADS).

3D modelling (both with laser scanning and photogrammetry) is a new and evolving area of research and practice. The ADS work extensively with 3D modelling of archaeological artefacts, so are keen to document the best methods for obtaining and retaining 3D data. Initially the ADS produced hard copies of the standards on which they were working. But, realising how often these standards were being revised, they switched to using a wiki.

Vast amounts of data are produced in 3D modelling, not least because there are many steps between the hardware (scanner or camera) and the final rendering of model. Raw data from the device, which could be millions of points from a laser scanner, or multiple high resolution photos have to be processed to produce the 3D model, and then there are many different ways to visualise the model.

The methods and workflows used to obtain the raw data need to be documented, so that future surveys can follow the same methodology. Given the number of steps involved, decisions must be made about which stages’ data to keep (and so, which stages to redo as needed). There was a lively discussion about how much data and metadata was needed, and when it becomes overkill.

The basic principles of digital archiving apply – use plain text formats, make sure they are open and non-proprietary and don’t compress data. One problem with laser scan data is that it may not be possible to losslessly convert raw binary data to ASCII. And one may not have the luxury of being able to keep both.

Archiving workflows requires a means of formally describing each step in the process, so that someone else can faithfully reproduce it later. Software used in the process may also need to be archived, requiring emulation, virtual machines or hardware preservation.

Some of the participants who create 3D data felt that the metadata requirements of the ADS were overly onerous, and were an obstacle to depositing data with them. It was suggested that the ADS provide 3D data with metadata in varying amounts and qualities to other archives, to see what level of metadata is sufficient for 3D data.

An argument for erring on the side of caution when collecting metadata is that the quality of one’s data is equal to the quality of one’s metadata. And one doesn’t know the true value (or range of re-uses) of one’s data at the point of creation or deposition in an archive.

Although much has been made of the lack of a non-proprietary file format for 3D data, there might be a solution. Faraz Ravi (Bentley Systems, and chairman of the ASTM Committee E57 on 3D Imaging Systems) presented the E57 format. It has been developed for interoperability. If one wants to losslessly convert from one proprietary file format to another, E57 can be used as the intermediate file format. Faraz made it clear that E57 is not designed to be a working format, nor was it designed to be an archival format. It was designed to be an open interchange format.

In the E57 format, metadata is encoded as XML. The 3D data is mostly stored as binary, to make it more compact. Source code for programs (with an open source license) which implement the format can be downloaded. The format is also supported by many of the major software vendors in the industry.

A major feature of E57 is its extensibility. Similar to the TIFF format, extensions specific to particular hardware can be incorporated without invalidating the files. For some people this is a strength. For others, mainly concerned with archiving such files, this can lead to problems similar to those encountered when working with TIFFs from multiple sources. Different pieces of software have the ability to read and write different extensions. TIFFs with extensions need to be interrogated, to see what exactly they contain.

It was suggested that, as with TIFFs, a baseline E57 format could be specified, which would be recommended for archiving files of this format.

A talk by Joe Beeching of 3D Laser Mapping Ltd illustrated the complexities of documenting and making sense of data from scanners. Scans can be taken from aircraft, or from cars, or from handheld devices wobbling on the end of a spring (working title: the Wobulator). To get a point cloud, one must not only know the distance from the scanner to each point on the object, but also the location and orientation of the scanner.

One interesting feature of some scanners is the ability to pick up multiple returns from a scan. A laser beam can pass through vegetation and the scanner can pick up both the external vegetation and the object underneath (e.g. plants on a cliff face). With software, one can make the vegetation visible or hidden.

Steven Ramsey of Leica Geostems Ltd had some eye-opening anecdotes from his years working in the scanning and surveying industry. There was a time when each project’s scans could fit on a floppy disk. Now each project needs its own RAID array. Before people used scanners, a surveyor could record 2000 points in a day. Now a scanner can record a million points in a second. This triggered an interesting conversation about what constituted a point cloud, and when a collection of points became a cloud. There is no real answer to this, though I felt that at some point, from the sheer quantity of points emerge new qualities of the object being scanned.

This was an informative day, spent in the company of some knowledgeable people. I picked up some valuable insights into workflows involving the output from digital instruments, and questions to ask when looking at workflows from an archiving perspective. Is the raw data in a proprietary or binary format? Is there data loss when converting to ASCII? What are the different stages? Do the outputs of different stages need archiving too? Can the workflow be documented, so that experiments can be re-run with different input data?