Discussion on standardising and sharing data among Arabic projects

I attended the 2013 Deutscher Orientalistentag in Münster to co-present a talk on the Arab Cultural Semantics in Transition (ARSEM) project, along with the principal investigator, Kirill Dmitriev. My half of the presentation, available here, focused on the technical aspects of the project.

Our presentation clashed with the start of the Digital Humanities (New methods of text analysis in Arabic and Islamic studies) panel, but I was still able to make it to most of the talks and the discussion afterwards.

The presenters in the Digital Humanities panel were:

The discussion, led by Andreas Kaplony, focused on the need for the diverse range of projects represented in the room to agree on standards. He told the story of how Hellenistic scholars had already done this, by sitting together for an afternoon.

Examples of what can and should be agreed upon are when centuries stop and start (did the 20th century end on 31 December 1999 or 31 December 2000?), citation standards and, more domain specific issues such as transliteration from Arabic to Latin (DIN 31635, Hans Wehr, Buckwalter etc.).

These aren’t problems which affect humans – if accuracy to the year is important, it can often be inferred from the context, and we are fairly forgiving when it comes to citation styles. And from what I understand, scholars can often read many different
Romanised forms of Arabic.

But computers are more simple minded, and need to be told explicitly the date range to search within or which transliteration method has been used.

It was suggested that many of these types of issue could be resolved by using a schema to describe the standards used. Academics wouldn’t have to all use the same standards, as long as they documented the standards that they did use. As has been said, “the nice thing about standards is that you have so many to choose from”.

Less equivocal meanings could then be obtained for fuzzy phrases such as “eighth century”. Crosswalks could translate data from one standard to another.

This solves some issues, but a crosswalk won’t always be available for converting transliterated text between methods (without access to the original vocalised Arabic). Some Romanisation methods are lossy, and not all lose the same information.

One of the main motivations for agreeing on standards is to facilitate the sharing of data. Although most people in the room were in favour of opening their research data, there wasn’t agreement on when in a project’s life cycle this should be done. To some, one’s data represented hard work and reputation, and wasn’t to be given away before the rewards had been obtained. Others were more confident that attribution could be preserved and a wide dissemination and use of one’s data would not be a waste of one’s effort.

Guidelines which apply to data accompanying published articles don’t necessarily apply to a long research project. Which, if any, milestones in a research project are analogous to publication and might mandate the opening of one’s research data?

Currently, email is used to share data on an ad hoc basis between researchers. Trust is a key component here. Another suggested alternative to simply offering up one’s data for download was to make it available via an API. This would control the access and also mean that the data that one made available was worked or processed and not raw data.

One thing which would be possible if researchers opened their data up via an API would be federated searches across multiple databases. Many projects are producing Arabic dictionaries focusing on different uses of Arabic in different periods of time and in different places, e.g. merchants’ shopping lists from the middle ages, Arabic words used in Occitan medical books or pre-Islamic poetry. With an API, it would be possible to search across projects using a single interface.

Each project, however, focuses on different features of the language, and has different strengths and weaknesses. But hopefully a subset of data will be present across projects, to make search federation a useful facility.

In addition to the projects linked to above, Kirill has been compiling a list of related projects.

As the number of projects with a large Arabic component on which I’m working increases, I found it very useful to learn more about the subject.