This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.
Please fill in the following metadata about this story (and delete this line when finished!):
This narrative is excerpted from a pre-publication draft of Conclusion: Cyberinfrastructure, the Scaife Digital Library and Classics in a Digital age, Christopher Blackwell (Furman University) and Gregory Crane (Tufts University). This paper has now been published in Digital Humanities Quarterly v3 n1 - Winter 2009. The draft was provided in response to an e-mail request from the collector that referenced Professor Crane's 4/6 presentation at the Project Bamboo Workshop 1d (Princeton) workshop and other published materials.
The scope section is provided by the collector, with input from the scholar(s), and attempts to estimate the scope of the group that performs the processes described: How broadly do the practices described in this story apply to others in same field, in related fields, etc?
Please provide some keywords that will allow us to group or cluster related stories--or aspects of stories.
1. Was this story collected for a particular Bamboo working group? If so, please include, as keywords, the appropriate group(s).
2. Suggested keywords: Does this story contain elements that could be mapped to these keywords? If so, please indicate which ones and briefly describe the mapping. Add any additional keywords in #3. (These are global keywords from this page keywords)
3. Please list additional keywords here:
4. Related Stories: Are there parts of the story that relate to other collected stories? Please provide title(s) and link to the story page.
[...] On the one hand, ePhilology emphasizes the role of the linguistic record in producing and organizing ideas and information about the ancient world. We use eClassics, by contrast, to describe Greek and Latin languages and literatures, wherever and whenever produced, as they live within our physical brains, touch our less tangible hearts and shape our actions in the world around us. [...]
[...] after decades of collection development within the field of classics, a number of services have begun to emerge, some of them actively used for years. The following services represent a core set, and should be components in any cyberinfrastructure for classical studies.
The following list offers a minimal set of services, each of which can be built with the technologies available today and each of which addresses established problems relevant to classicists in particular and many humanists. The services below largely address the problem of classification, i.e., applying a set of criteria to find and/or to label materials. Different annotation tasks admit of different levels of certainty: human readers can identify the correct transcription for print on a modern page but lexicographers will disagree on the senses of a given word. Nevertheless, these services aim at more or less deterministic, right-or-wrong answers. We do not include below clustering and other techniques that can detect patterns that require new categories. The services below reflect basic tools on which more open-ended research depends.
2.1.1 Canonical Text Services (CTS)
Canonical text services allow us to call up canonical texts by standard chapter/verse citation schemes. Christopher Blackwell and Neel Smith, working in conjunction with Harvard's Center for Hellenic Studies (CHS), have developed a general protocol for canonical text services that provides essential functions for any system that serves classicists - or any scholarly community working with canonical texts. Early modern books or MSS that defy current OCR technology can be indexed by conventional citation (e.g., this page of the Venetus A manuscript contains the following lines of the Iliad).
2.1.2 Optical Character Recognition and Page Layout Analysis
Transcription captures the keystrokes. Page layout analysis captures the logical structures implicit in the page. These logical structures include not only header, footnote, chapter title, encyclopedia/index/lexicon entry etc., but more scholarly forms such as commentary and textual notes. All disciplines have used tables to represent structured data and we need much better tools with which to convert tabular data into semantically analyzed machine actionable data. Much of the work in the Mellon funded Cybereditions Project will focus on this stage of the workflow, focusing on the problem of mining highly accurate data from OCR output of scholarly editions in Greek and Latin.
2.1.3. Morphological Analysis
Morphological analysis takes an inflected form (e.g, fecit) and identifies its possible morphological analyses (e.g., 3rd sg perfect indicative active) and dictionary entries (e.g., Latin facio, "to do, make"). David Packard developed the first morphological analyzer for classical Greek, Morph, over a generation ago. Gregory Crane began the initial work on what would become the core morphological analyzer for Greek and Latin in Perseus in 1984. Neel Smith and Joshua Kosman, then graduate students at Berkeley, extended this work and created a library of subroutines that remain part of the current code base for Morpheus. Morpheus is written in C, has been compiled on a range of Unix systems over the course of more than twenty years, and contains extensive databases of Greek and Latin inflections and stems. Of all the classics specific services with which we are familar, Morpheus is the most mature and well developed. The goal has long been to create an open source version of Morpheus. Desiderata include new documentation, modern XML formats for the stems and endings and a distributed environment whereby users can add new stems and endings.
2.1.4. Syntactic Analysis
Syntactic analysis identifies the syntactic relationships between words in a sentence; it allows us to provide quantitative data about lexicography (e.g., which nouns are the subjects and objects of particular verbs), word usage (e.g., which verbs take dative indirect objects? Where do we have indirect discourse using the infinitive vs. a participle vs. a conjunction?), style (e.g., hyperbaton, periodic composition), and linguistics (e.g., changes from SOV to SVO word order). Even relatively coarse syntactic analysis can yield valuable results when applied to a large corpus: working with our morphological analyzer and a tiny Latin Treebank of 30,000 words with which to train a syntactic analyzer, we were able to tag 54% of the untagged words correctly, but the correct analyses provided a strong enough signal for us to detect larger lexical patterns. More robust syntactic analysis based on very large treebanks can yield accuracies of 80 and 85%. Human annotators can build upon preliminary automated analysis to create treebanks, where every word's function has been examined and accounted for. Treebanks provide not only training data for automated parsing but also explanatory data whereby readers can see the underlying structure of complex sentences - a valuable instrument to support interdisciplinary researchers from fields such as Philosophy or the History of Science who are not specialists in Latin and Greek.
2.1.5. Word sense discovery
Word sense discovery automatically identifies distinctive word usage in electronic corpora. Even without syntactic analysis, collocation analysis can reveal words that are closely associated (e.g., phrases such as the English "ham and eggs") and thus identify idiomatic expressions. Jeff Rydberg Cox developed collocational analysis for the Greek and Latin texts in Perseus and the results are visible as part of the on-line Greek and Latin lexica in Perseus 3.0. Access to translations aligned to the original allows us to identify distinct senses: e.g., oratio corresponds both to English "oration" but in other instances to English "prayer." At Perseus, we have been experimenting with this technique since 2005 and have begun a project, funded by the NEH Research and Development Program, to explore methods for a Dynamic Lexicon for Greek and Latin.
2.1.6. Named entity Identification
Named entity identification provides semantic classification (e.g., is Salamis a place or a Greek nymph by that name) and then associates names with particular entities in the real world (e.g., if Salamis is a place, is it the Salamis near Athens, Salamis in Cyprus or some other Salamis?). We have developed a serviceable named entity identification system for English and have support from the Advancing Knowledge IMLS/NEH Digital Partnership to extend this work to documents about Greco-Roman antiquity. We expect more general named entity systems to supersede the system that we developed and we are therefore focusing our efforts on creating knowledge sources that will allow these more general systems to perform effective named entity identification on classical materials. Our work focuses on creating (1) a labeled training set, based on print indices, with place and personal names identified, (2) a multilingual list of 60,000 Greek and Latin names in Greek, Latin, English, French, German, Italian, and Spanish, and (3) contextual information, or in other words, which authors mention which people and places in which passages, extracted from the 19th century encyclopedias of biography and geography edited by William Smith.
2.1.7. Metrical Analysis
Metrical analysis both discovers and analyzes the underlying metrical forms of digital texts. Metrical analysis provides information about vowel quantity that can improve performance of morphological, syntactic and named entity analysis. Metrical analysis is particularly important for areas such as post-classical Latin, which have very large bodies of poetic materials that will never receive the manual analysis applied to Homer, the Athenian Dramatists, Vergil and other canonical authors.
2.1.8. Translation Support
Translation support aims at fluent translation of full text but can provide useful results at a much earlier stage of development. Thus, word sense disambiguation, a component within machine translation, helps translate words and phrases: e.g., given an instance of the Latin word oratio, word sense disambiguation identifies when that word most likely corresponds to "oration," "prayer" or some other English word or phrase. The same service also supports semantic queries such as "list all Latin words that correspond to the English word 'prayer' in particular contexts." [cf. Gist - approximate translation - an activity defined by Bamboo's Shared Services working group]
2.1.9. Cross Language Information Retrieval (CLIR)
Cross language information retrieval (CLIR) allows users to pose a query in one language (e.g., English) and retrieve results in other languages (e.g., Arabic or Chinese). For classics, CLIR is an extremely important technology because classicists are expected to work with materials not only in Greek and Latin but, at a minimum, in English, French, German and Italian. CLIR is a mature technology where the cross language queries in some competitions perform better than the monolingual baseline systems (e.g., you get better results searching Arabic with an English query than if you searched with Arabic). Classicists should be able to type queries for secondary sources in various languages such as English, French, German or Italian.
2.1.10. Citation Identification
Citation identification is a particular case of named entity identification that focuses on recognizing particular: e.g., determining whether the string "Th. 1.33" refers to book 1, chapter 33 of Thucydides, line 33 of the first Idyll of Theocritus or something else? Are numbers floating in the text such as "333" or "1.33" partial citations and, if so, what are the full citations? Primary source citations tend to be shorter and more variable in form from the bibliographic citations found in scientific publications. Perseus has, over the course of more than twenty years, extracted millions of citations from thousands of documents but the citation extractors tend to be ad hoc systems tuned for the subtly different formats by which publications represent these already brief and cryptic abbreviations. In the million book world, we need citation extractors that can recognize the underlying citation conventions of arbitrary documents and then match them to known citations on the fly (e.g, observe numerous references to Thucydides and then infer that strings such as "T. 1,33" describe Thucydides, Book 1, Chapter 33).
2.1.11. Quotation Identification
Quotation identification can recognize where one text quotes - either precisely or with small modifications - another even when there is no explicit machine actionable citation information: e.g., it can recognize "arma virumque cano" as a quotation from the first line of the Aeneid. The fundamental problem is analogous to plagiarism detection. Support from the Mellon-funded Classics in the Million Book Library study allowed us to begin work on exploring quotation identification techniques.
2.1.12. Translation identification
Translation identification builds on both CLIR and quotation identification to identify translations, primary but not exclusively, of Greek and Latin texts that are on-line in large digital collections. These translations may be of entire works or of small excerpts.
2.1.13. Text Alignment
Text alignment services most commonly align translations with their source texts and are components of word sense disambiguation systems. Text alignment, however, serves also to create human readable links between source texts and translations that do not have machine actionable book/chapter/section/verse or other citation markers or between source texts that are tagged with different citation schemes. Text alignment is one of the priorities of the Mellon-funded Cybereditions Project at Tufts University.
2.1.14. Version Analysis
Version analysis services can collate transcriptions of manuscript sources or of different printed editions of the same work. Such services allow readers to identify which versions of a work are closest to one another, which differences are most influential, and, on a smaller scale, how the text in one passage varies in multiple editions. Version analysis can also be used for automated error correction: when two versions of a text differ and one version contains a word that does not generate a valid Greek and Latin morphological analysis, we flag that word as a possible error and associate the parseable word from the other text with it as a possible correction.
2.1.15. Markup Projection
Markup projection services, implicit in many of the services above, automatically associate machine actionable data from one source with the same passage in another source. Thus, an index might state that a reference to Salamis in passage A describes Salamis near Athens but that the reference in passage B is to Salamis of Cyprus. Markup projection services would associate those statements with all references to Salamis in various versions of passages A or B, including not only full scholarly editions but also quotations of those passages that appear in journal articles or monographs.