Discussion on standardising and sharing data among Arabic projects

I attended the 2013 Deutscher Orientalistentag in Münster to co-present a talk on the Arab Cultural Semantics in Transition (ARSEM) project, along with the principal investigator, Kirill Dmitriev. My half of the presentation, available here, focused on the technical aspects of the project.

Our presentation clashed with the start of the Digital Humanities (New methods of text analysis in Arabic and Islamic studies) panel, but I was still able to make it to most of the talks and the discussion afterwards.

The presenters in the Digital Humanities panel were:

The discussion, led by Andreas Kaplony, focused on the need for the diverse range of projects represented in the room to agree on standards. He told the story of how Hellenistic scholars had already done this, by sitting together for an afternoon.

Examples of what can and should be agreed upon are when centuries stop and start (did the 20th century end on 31 December 1999 or 31 December 2000?), citation standards and, more domain specific issues such as transliteration from Arabic to Latin (DIN 31635, Hans Wehr, Buckwalter etc.).

These aren’t problems which affect humans – if accuracy to the year is important, it can often be inferred from the context, and we are fairly forgiving when it comes to citation styles. And from what I understand, scholars can often read many different
Romanised forms of Arabic.

But computers are more simple minded, and need to be told explicitly the date range to search within or which transliteration method has been used.

It was suggested that many of these types of issue could be resolved by using a schema to describe the standards used. Academics wouldn’t have to all use the same standards, as long as they documented the standards that they did use. As has been said, “the nice thing about standards is that you have so many to choose from”.

Less equivocal meanings could then be obtained for fuzzy phrases such as “eighth century”. Crosswalks could translate data from one standard to another.

This solves some issues, but a crosswalk won’t always be available for converting transliterated text between methods (without access to the original vocalised Arabic). Some Romanisation methods are lossy, and not all lose the same information.

One of the main motivations for agreeing on standards is to facilitate the sharing of data. Although most people in the room were in favour of opening their research data, there wasn’t agreement on when in a project’s life cycle this should be done. To some, one’s data represented hard work and reputation, and wasn’t to be given away before the rewards had been obtained. Others were more confident that attribution could be preserved and a wide dissemination and use of one’s data would not be a waste of one’s effort.

Guidelines which apply to data accompanying published articles don’t necessarily apply to a long research project. Which, if any, milestones in a research project are analogous to publication and might mandate the opening of one’s research data?

Currently, email is used to share data on an ad hoc basis between researchers. Trust is a key component here. Another suggested alternative to simply offering up one’s data for download was to make it available via an API. This would control the access and also mean that the data that one made available was worked or processed and not raw data.

One thing which would be possible if researchers opened their data up via an API would be federated searches across multiple databases. Many projects are producing Arabic dictionaries focusing on different uses of Arabic in different periods of time and in different places, e.g. merchants’ shopping lists from the middle ages, Arabic words used in Occitan medical books or pre-Islamic poetry. With an API, it would be possible to search across projects using a single interface.

Each project, however, focuses on different features of the language, and has different strengths and weaknesses. But hopefully a subset of data will be present across projects, to make search federation a useful facility.

In addition to the projects linked to above, Kirill has been compiling a list of related projects.

As the number of projects with a large Arabic component on which I’m working increases, I found it very useful to learn more about the subject.

3D Laser Scanning: Seeking a New Standard in Documentation

I attended a one day conference at the Royal Commission on the Ancient and Historical Monuments of Scotland (RACHMS) in Edinburgh on May 1 2013. This was organised by Emily Nimmo (RACHMS) and the Digital Preservation Coalition (DPC). The topic of the conference was 3D laser scanning, with an emphasis on storing, reusing and archiving the data from scanners.

There was a good mix of speakers and informal panel sessions, with speakers coming from a variety of organisations. Between them all, they covered the subject from different perspectives, in the way that an accurate point cloud is buit up from scans from multiple angles.

My main reason for attending was not so much to learn about 3D laser scanning in particular, but to abstract more general issues about working with scientific research data. The Research Computing team is hoping to expand, to cover scientific, as well as humanties research, so I wanted to pick up as much information as I could.

James Hepher, a surveyor and archaeologist from Historic Scotland, spoke about the challenges faced by those working on the Scottish Ten project. He was very recently back from scanning the Sydney Opera House. This involved a large team scanning the exterior and interior of the building, from the ground, from rigs on top of the sails, and by abseiling down the sides.

Five different scanners were used. These devices output raw data in different formats, which has to be merged to produce the point cloud. We were shown some amazing fly-throughs of the Opera House, Sydney harbour full of boats and the sky scrapers behind. The area was scanned at a resolution of 5mm, so one can imagine the volumes of data involved.

Similarly, English Heritage (EH) are also using 3D laser scanning to record important monuments and building such as Stonehenge, Harmondsworth Barn (aka The Cathedral of Middlesex) and Iron Bridge.

Stonehenge has recently been scanned, to a resolution of 0.5mm. This will allow a baseline to be taken of the condition of the stones. New scans in the future can then be used to measure any erosion and damage. This has parallels with a project on which I have been working, documenting the scenes in the frieze on Trajan’s Column. In the 19th century, plaster casts were taken of the column. These are now, in many ways, a more accurate source of information than the column itself, as the casts have been preserved inside, while the column has endured another century of Roman pollution.

Laser scans don’t always record colour very well. But in renderings of scans, colour is often used to indicate other properties of the objects being scanned. Images generated from the scan of Harmondsworth Barn showed up timbers damaged in a recent fire. Laser scans of buildings (and the spaces inside them) form an important part of Building Information Modeling (BIM). BIM is the construction of digital representations of buildings that can be used and contributed to by all parties involved, from drawing board to demolition. It works best for new builds, but can also be applied to 15th century buildings, although factors such as costs and building codes and practices can only be guessed at in these cases.

Iron Bridge has been of interest ever since it was built in 1779, and in 1934 was declared an Ancient Monument. Various surveys have been taken, including a 3D model from 1999. Just as the bridge itself demonstrated the use of iron in bridge building, the 3D survey hoped to demonstrate the utility of new technologies. Most of the work was done using photogrammetry – a technique where points in space are calculated from multiple photos.

Technology in this area has improved so much that this recent survey is already out of date, and a new survey is needed. Interestingly, analogue photos of the bridge from many years ago are still reliable sources of information.

Laser scans can be used to help protect and conserve ancient monuments, by measuring changes and by providing insights into construction methods. But what about preserving the data from laser scans? Documenting and archiving 3D data was the subject of a talk by Catherine Hardman from the Archaeology Data Service (ADS).

3D modelling (both with laser scanning and photogrammetry) is a new and evolving area of research and practice. The ADS work extensively with 3D modelling of archaeological artefacts, so are keen to document the best methods for obtaining and retaining 3D data. Initially the ADS produced hard copies of the standards on which they were working. But, realising how often these standards were being revised, they switched to using a wiki.

Vast amounts of data are produced in 3D modelling, not least because there are many steps between the hardware (scanner or camera) and the final rendering of model. Raw data from the device, which could be millions of points from a laser scanner, or multiple high resolution photos have to be processed to produce the 3D model, and then there are many different ways to visualise the model.

The methods and workflows used to obtain the raw data need to be documented, so that future surveys can follow the same methodology. Given the number of steps involved, decisions must be made about which stages’ data to keep (and so, which stages to redo as needed). There was a lively discussion about how much data and metadata was needed, and when it becomes overkill.

The basic principles of digital archiving apply – use plain text formats, make sure they are open and non-proprietary and don’t compress data. One problem with laser scan data is that it may not be possible to losslessly convert raw binary data to ASCII. And one may not have the luxury of being able to keep both.

Archiving workflows requires a means of formally describing each step in the process, so that someone else can faithfully reproduce it later. Software used in the process may also need to be archived, requiring emulation, virtual machines or hardware preservation.

Some of the participants who create 3D data felt that the metadata requirements of the ADS were overly onerous, and were an obstacle to depositing data with them. It was suggested that the ADS provide 3D data with metadata in varying amounts and qualities to other archives, to see what level of metadata is sufficient for 3D data.

An argument for erring on the side of caution when collecting metadata is that the quality of one’s data is equal to the quality of one’s metadata. And one doesn’t know the true value (or range of re-uses) of one’s data at the point of creation or deposition in an archive.

Although much has been made of the lack of a non-proprietary file format for 3D data, there might be a solution. Faraz Ravi (Bentley Systems, and chairman of the ASTM Committee E57 on 3D Imaging Systems) presented the E57 format. It has been developed for interoperability. If one wants to losslessly convert from one proprietary file format to another, E57 can be used as the intermediate file format. Faraz made it clear that E57 is not designed to be a working format, nor was it designed to be an archival format. It was designed to be an open interchange format.

In the E57 format, metadata is encoded as XML. The 3D data is mostly stored as binary, to make it more compact. Source code for programs (with an open source license) which implement the format can be downloaded. The format is also supported by many of the major software vendors in the industry.

A major feature of E57 is its extensibility. Similar to the TIFF format, extensions specific to particular hardware can be incorporated without invalidating the files. For some people this is a strength. For others, mainly concerned with archiving such files, this can lead to problems similar to those encountered when working with TIFFs from multiple sources. Different pieces of software have the ability to read and write different extensions. TIFFs with extensions need to be interrogated, to see what exactly they contain.

It was suggested that, as with TIFFs, a baseline E57 format could be specified, which would be recommended for archiving files of this format.

A talk by Joe Beeching of 3D Laser Mapping Ltd illustrated the complexities of documenting and making sense of data from scanners. Scans can be taken from aircraft, or from cars, or from handheld devices wobbling on the end of a spring (working title: the Wobulator). To get a point cloud, one must not only know the distance from the scanner to each point on the object, but also the location and orientation of the scanner.

One interesting feature of some scanners is the ability to pick up multiple returns from a scan. A laser beam can pass through vegetation and the scanner can pick up both the external vegetation and the object underneath (e.g. plants on a cliff face). With software, one can make the vegetation visible or hidden.

Steven Ramsey of Leica Geostems Ltd had some eye-opening anecdotes from his years working in the scanning and surveying industry. There was a time when each project’s scans could fit on a floppy disk. Now each project needs its own RAID array. Before people used scanners, a surveyor could record 2000 points in a day. Now a scanner can record a million points in a second. This triggered an interesting conversation about what constituted a point cloud, and when a collection of points became a cloud. There is no real answer to this, though I felt that at some point, from the sheer quantity of points emerge new qualities of the object being scanned.

This was an informative day, spent in the company of some knowledgeable people. I picked up some valuable insights into workflows involving the output from digital instruments, and questions to ask when looking at workflows from an archiving perspective. Is the raw data in a proprietary or binary format? Is there data loss when converting to ASCII? What are the different stages? Do the outputs of different stages need archiving too? Can the workflow be documented, so that experiments can be re-run with different input data?