This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.
TO BE COMPLETED
The scope section is provided by the collector, with input from the scholar(s), and attempts to estimate the scope of the group that performs the processes described: How broadly do the practices described in this story apply to others in same field, in related fields, etc?
Aggregate, Annotate, Consider
A story from the Mellon proposal, section III.2 COMPUTER SCIENCE:
(Preface) We have, as part of the preparation for this planning proposal, polled faculty in Berkeley's Computer Science Division and the School of Information to gauge their interest in computing for humanities research. We received detailed replies, many describing ongoing projects, from over 20 faculty members. The range of computer science research topics covered is too broad to enumerate here, so we provide one example scenario of what would be possible in the future if we can find better ways to structure and sustain partnerships:
Over the centuries Buddhist monasteries have housed thousands of "books" of
Tibetan Buddhist literature, each composed of the print from several hundred
woodblocks. As was common in the early phases of the digital revolution in the
humanities, many of these have been scanned - approximately four million page
images are available at www.tbrc.org - but many have not. First attempts at transcribing
and collating these vast collections and creating a simple index, all done
by hand, have proved expensive and time-consuming. With an investment estimated
to be in excess of $1M, less than 2% of Tibetan texts have been input.
Now in our scenario, faculty members in Computer Science, East Asian Languages
and Cultures, and South and Southeast Asian Studies team up to address
the problem. First, the digital images are stored in the campus archiving repository,
which provides improved speed of access, reduced costs, and a guarantee of
permanence. Achieving the requisite level of accuracy will itself require the development
of new OCR techniques by Computer Science Professor 1 (CS-Prof1)
guided by syntactic and semantic models co-developed with East Asian Language
and Cultures Professor 1 (EALC-Prof1). Metadata on authorship, woodblock
location, etc., is added to the corpus.
Then, CS-Prof2 and EALC-Prof2 work together to develop a digital lexicon for
the various styles of Tibetan used in the corpus. CS-Prof2's automated grammar
learning system is used to create a probabilistic context-sensitive free grammar
for Tibetan. As the proper semantic rendering of Tibetan is highly dependent on
the mastery of a vast number of contexts and idiomatic usages, CS-Prof2's automated
system enables the development of translation tools that dramatically reduce
the amount of time required for scholars to master the language, and thus
significantly increases the quality and quantity of translations.
EALC-Prof1 is excited to discover systematic patterns in the evolution of grammatical
styles over time. Longstanding debates regarding the existence of "Old
Tibetan" dialects are resolved by means of grammatical analysis of 1000s of texts
in the corpus. With the help of a number of Sanskrit-Tibetan, Tibetan-English,
and Tibetan-Chinese parallel texts, CS-Prof2, EALC-Prof1, and a linguistics researcher
use machine learning techniques to create rough translation systems that
enable automated translation among most of the canonical languages of the Buddhist
tradition. These make possible the identification of several thousand cases
where passages from the literature in one language (e.g. Tibetan Buddhist literature)
turn out to have been borrowed wholesale from another (e.g. Sanskrit Saiva
literature). These borrowings clarify many previously unresolved questions in the
development of the Buddhist tradition in Asia, as well as providing a much larger
set of parallel texts that enable more accurate translations.
The resolution of many pertinent historical questions lies in the identification and
cross-correlation of key historical figures across a range of literature. EALCProf2
is interested in using the texts as a historical source, and works with CSProf2
and CS-Prof3 to apply their information-extraction technology to pull out
basic historical assertions from the corpus, as well as from other related corpora
including those containing the writings of diverse ethnic groups of Silk Route
travelers in Central Asia from the same period.
The technology is able to create a multilingual glossary of names and places, giving
reliable identification of the many different ways in which the same name is
rendered at different times and in different languages. For example, by compiling
a large collection of assertions made about Padmasambhava, who is popularly
thought to have "brought" Buddhism to Tibet in the 8th century, EALC-Prof2 is
able to resolve long-standing questions as to the historicity and influence of this
individual. Examination of other historical sources further contextualizes the development
of Buddhism in Tibet.
Scenario was created by authors of the Mellon proposal (see the Preface paragraph above).