Towards Digital Coptic: Searching and Visualizing Coptic Manuscript Data

Posted on 14 January 2014

Talk: Amir Zeldes (HU), “Towards Digital Coptic: Searching and Visualizing Coptic Manuscript Data”.

Permalink: http://hdl.handle.net/11858/00-1780-0000-0022-D54A-7

Date: Tuesday, 14 January 2014

Time: starting at 18:00 c.t. (i.e. 18:15)

Venue: TOPOI Building Dahlem, Hittorfstraße 18, Freie Universität Berlin, 14195 Berlin (map)


Abstract

The Coptic language of Hellenistic Egypt has been chronically underrepresented in the digital humanities landscape despite the masses of material available and general multidisciplinary interest in Egypt in the first millennium for ancient history, religious studies and the classics. In this talk we present some of the first efforts to offer an open infrastructure for the online representation of Coptic texts using automated tools which have become commonplace for Classical Greek and Latin in the last decade. The talk is divided into two parts, discussing searching and visualizing Coptic data respectively.

Searching through Coptic data using available DH architectures is complex for several reasons. Firstly, most texts come from potentially damaged, incomplete and physically disjoint papyri and codices. Texts must be reconstructed from pieces of manuscripts stored in various libraries and museums worldwide, complicating both metadata management and the concept of a corpus or text. Are resources grouped according to ‘works’, physical repositories or codices? How do we establish what constitutes a ‘work’? Secondly, in Coptic many concepts taken for granted for Greek and Latin are non-trivial. Coptic Manuscripts are written in scriptio continua, without spaces, and modern conventions on word division differ substantially (see Layton 2004: 19–20). Additionally, since Coptic is an agglutinative language, the relevant unit for linguistic analysis is the morpheme, below the ‘word’ level. This means that segmentation guidelines must be developed for both levels of resolution. Thirdly, in order to search through multiple manuscripts, it is necessary to develop guidelines and tools for normalization, part-of-speech tagging and lemmatization of Coptic, all of which do not exist yet, as opposed to the rich morphological, syntactic and lexical analyses readily available for Greek and Latin (e.g. in the Perseus Digital Library, Crane et al. 2009).

Visualizing Coptic manuscript data is also non-trivial. The need for diplomatic transcriptions including line breaks, columns, pages and scribal or manuscript idiosyncrasies often conflicts with the exigencies of a consistent, normalized view of linguistic elements across documents. Typically, structural markup is done in TEI XML, while linguistic annotations (part-of-speech tagging, coreference analyses and more) are done using computational linguistics formats (e.g. PAULA XML, Dipper 2005). To tackle the problem of reconciling conflicting layers of representation we present an implementation of two Sahidic Coptic corpora taken from the writings of Shenoute of Atripe (4th–5th Century) and the Coptic Apophthegmata Patrum, using the multilayer search and visualization platform ANNIS (Zeldes et al. 2009). Users can select whether searches are oriented to word forms, morphemes or diplomatic forms for purposes of setting context size (e.g. ±5 words), distance between search elements (e.g. two verbs within 3–6 morphemes) and visualization of the text. We will show how visualizations can be generated dynamically from the data using simple style sheets, including editionlike diplomatic views, normalized text and aligned translations with which users can interact (Figure 1). The talk will also suggest some directions for future work on representing the relationship between overlapping and conflicting manuscripts of the same or related works.

References

Crane, Gregory, Babeu, Alison, Bamman, David, Breuel, Thomas, Cerrato, Lisa, Deckers, Daniel, Lüdeling, Anke, Mimno, Daid, Singhal, Rashmi, Smith, David A., and Zeldes, Amir (2009). ‘Classics in the Million Book Library’, Digital Humanities Quarterly 3(1), Available at: http://www.digitalhumanities.org/dhq/vol/003/1/000034/000034.html

Dipper, Stefanie (2005). ‘XML-based Stand-off Representation and Exploitation of Multi-Level Linguistic Annotation’, in Proceedings of Berliner XML Tage 2005 (BXML 2005). Berlin, Germany, 39–50.

Layton, Bentley (2004). A Coptic Grammar. Second Edition, Revised and Expanded. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.

Zeldes, Amir, Ritz, Julia, Lüdeling, Anke, and Chiarcos, Christian (2009). ‘ANNIS: A Search Tool for Multi-Layer Annotated Corpora’, in Proceedings of Corpus Linguistics 2009. Liverpool, UK.

Slides

Or you can download the slides from here.

Video

Talk

Discussion

Or you can just download the video files (.mp4): talk (560 MB) ; discussion (205 MB).