Deep Learning and Computational Authorship Attribution for Ancient Greek Texts. The case of the Attic Orators.
Posted on 16 February 2016
Talk: Mike Kestemont (University of Antwerp), Francesco Mambrini (German Archaeological Institute) and Marco Passarotti (Università Cattolica del Sacro Cuore, Milan), “Deep Learning and Computational Authorship Attribution for Ancient Greek Texts. The case of the Attic Orators”.
Date: Tuesday, 16 February 2016
Time: starting at 17:00 c.t. (i.e. 17:15)
The debate on the authorship of the texts by the Ancient Greek authors is as old as the formation of Greek literary culture itself. Naturally, this debate has always known an important methodological dimension. Although the systematic study of the language and style of the Greek classics has played a pioneering role in philology, attribution research on Greek texts has so far remained relatively isolated from the development of computational stylometry, apart from a number of scattered studies in the recent past (e.g. [13, 10, 2, 5]).
In this seminar we present the results of an ongoing collaboration in which we apply a broad range of state-of-the-art techniques from stylometry to the corpus of the Attic orators, using the digitized texts included in the Perseus Digital Library . The corpus encompasses the surviving works of 10 authors who were active in Athens from the last quarter of the 5th to the end of the 4th century BC. These authors remained the most canonized representatives of the genre of oratory until the end of Antiquity. The corpus is highly suitable as a test bed for stylometric experiments because of its considerable size (ca. 600K words), its uniformity in terms of genre and chronology, as well as the differences in personality and background between the authors it includes. Interestingly, it additionally presents a number of long-standing problems in attribution which remain to be solved, such as the authorship of the “Funeral Speech” (2) or the oration “Against Andokides” (6) attributed to Lysias, or of Demosthenes’ “On the Halonnesus” (7) (see e.g. , [12, 26-31] and [4, 65-98]).
In our exposition, we formalize authorship attribution as a text classification task in which an anonymous text has to be attributed to one of a series of candidate authors [11, 6, 8]. In terms of textual features, we focus on token unigrams and character unigrams, which are easy to extract, but have nevertheless shown excellent performance in previous research . An important novelty is that we compare a number of established classification approaches in stylometry (e.g. Principal Components Analysis, Burrows’s Delta, Support Vector Machines) to a set of novel neural network approaches from Representation Learning.
Representation learning or Deep Learning is an increasingly popular branch of Machine Learning in which neural networks are used to map input data (such as texts or images) to the correct output labels [3, 9]. Neural networks have shown outstanding performance in a variety of classification tasks (e.g. face detection in computer vision), but so far, they been been rarely applied in computational authorship studies. We will show that deep nets show excellent performance across multiple experimental setups for this corpus, and that they rival (and often outperform) the best attribution algorithms currently available. This is true in terms of plain performance, but also in terms of interpretability, because there exist many intuitive methods to visualize which latent representations a network has learned during training (compare Figs. 1 and 2). The question of how to read these results in light of the history of the corpus, as well as what new insights and research perspectives emerge on the authors and texts, will be discussed in the conclusions.
 Perseus Digital Library. Canonical – Greek Literature. https://github.com/PerseusDL/canonical-greekLit.
 C. Belcastro and Paolo Ruffolo. A mathematical classification of the Platonic corpus. Linguistica Computazionale, 20-21:1–19, 2000.
 Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206.5538, 2012.
 Luciano Canfora. Discorsi e lettere di Demostene. Discorsi all’assemblea, volume 1. Utet, Torino, 1974.
 Christopher W. Forstall and Walter J. Scheirer. Features from frequency: authorship and stylistic analysis using repetitive sound. Journal of the Chicago Colloquium on Digital Humanities and Computer Science, 1(2), 2010.
 Patrick Juola. Authorship attribution. Foundations and Trends in Information Retrieval, 1(3):233–334, 2006.
 Mike Kestemont. Function words in authorship attribution. from black magic to theory? In Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL), pages 59–66, Gothenburg, Sweden:, 2014. Association for Computational Linguistics.
 Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1):9–26, 2009.
 Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 Bernd Ludwig. A contribution to the question of authenticity of Rhesus using part-of-speech tagging. In Gerhard Brewka, Christopher Habel, and Bernhard Nebel, editors, KI-97: Advances in Artificial Intelligence, pages 231–242. Springer, Berlin and Heidelberg, 1997.
 Efstathios Stamatatos. A survey of modern authorship attribution methods. Journal of the Association for Information Science and Technology, 60(3):538–556, 2009.
 Stephen Charles Todd. A Commentary on Lysias, speeches 1-11. Oxford University Press, Oxford, 2007.
 Stephen Usher and Dietmar Najock. A statistical study of authorship in the Corpus Lysiacum. Computers and the Humanities, 16(2):85–105, October 1982.
 Laurens van der Maaten and Geoffrey E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008.
Or you can download the file from here (28,1 MB).
Or download the video from here (593 MB).
Or download the video from here (298 MB).