Page Tree:

Child pages
  • Overview of Bamboo's proposed curation application

This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.

Skip to end of metadata
Go to start of metadata

This was written by Bruce Barton in December 2011.


Towards the goals of scalable reading and collaborative curation, we will build an application in BTP Phase II to support the curation of text corpora. We aim to facilitate corpora curation in two ways: 1) through the application of text exploration — scalable reading — techniques to assist curators in discovering locations in texts that merit attention; and 2) through crowd sourcing when human judgment is needed in curatorial decision making.

Executing curation workflows in a Text Curation Shop

The curation of texts to produce what Martin Mueller has called diplomatic quality editions involves a number of steps: reading, closely or at a distance; annotation to mark features or errors; collaboration or consultation to agree on emendations; the preparation of revisions; placing a revision for others to access; and documenting what was been done and the relations between the now several distinct manifestations of the text. Each of these steps may involve the use of one or more tools and may produce a number of artifacts: annotations, assertions, discussions, manifestations of the text, and so on. Think of the orchestration of these steps and artifacts as a scholarly workflow. Different workflows involving different tools will be appropriate to different curation strategies.

Think of the application we will build in Phase II as a Text Curation Shop, in which tools are available to perform various tasks and in which the arrangement of tools makes it easy to move a piece — a text — through the shop. As in a custom furniture shop, the tools are provided by different tool makers and specialized tools may be brought into the shop when needed. Efficiency comes through the organization and arrangement of the tools and staging areas for work in progress. In the Text Curation Shop, the arrangement of tools can be adjusted to suit the workflow, and scholarly data management performs the staging area function. Workflows may be executed by hand for small jobs; large scale, repetitive jobs can be automated.

How Phase I capabilities contribute

Much of the infrastructure for the Text Curation Shop is provided through the capabilities of the BTP Phase I ecosystem. Scholars bring raw materials — raw OCR, page images, and so on — into the shop either by uploading them from local sources or by retrieving them through the Collections Interoperability Hub. Tools deployed in a Bamboo Work Space or on the Bamboo Services Platform are applied to the materials to produce derivative objects. (The BSP may also mediate access to remote tools.) Coordination among workers is facilitated through the collaboration tools present in a Work Space. The Bamboo Persons, Profiles, Policies, and Groups services help to sort out questions of who and whose: who did or who may do what sorts of things, and to whom do objects belong.

What is the Shop? It is easy to push a metaphor too far, but we could say that the Bamboo ecosystem as a whole is the Shop.

While we're pushing a metaphor, we might note that some tasks can be farmed out to be performed using tools for which deployment in the ecosystem provides no advantage, or no advantage relative to costs of that deployment. Bamboo doesn't have to own every aspect of the job, and shouldn't.

Data management capabilities added in Phase II

Phase I infrastructure does not provide the capability to manage scholarly data. Phase I infrastructure does provide for storing objects in local object stores. But managing objects for scholarly purposes requires more than merely storing them. We must also record how objects are related: when a text has been derived from another or several others through some process, we need to record the relationships between the texts and to document the process. And when the process produces a text with particular attributes, say "diplomatic quality edition" or "variant 1b", we need to express those attributes as metadata. Metadata of all sorts are curated and need to be managed as well. Expressions of relationships between texts, descriptions of processes used, and metadata form the chains of evidence that establish a particular manifestation of a text as authoritative or as suitable for some purpose.

In the course of curating documents and metadata about them, scholars produce a number of related artifacts: annotations of and assertions about texts or segments within texts, and records of curation events, e.g. a text correction. These, like the texts themselves, can be mined for patterns that could serve, for example, as training data for tools that automate error detection and the generation of proposed corrections.

Reputation is a third category of data to manage. Here we want to mine the records of a curator's proposals to determine how accurate the proposals are thought to be and what level of trust the curator has earned in a crowd-sourced curation application.

A scholarly data management problem approach to addressability

What Mike Witmore has called the massive addressability of texts presents both technical and scholarly/social challenges. References into texts undergoing active curation are unstable, making addressing small units within the text impractical. And then there is the scholarly/social question of what the units should be. Our work on addressability will not take a position on what units there are or should be; we aim to support whatever units of measure or structure scholars find useful by mapping units onto stable references into a text. Stability, then, is the focus of our work on addressability.

We believe that a technical solution to providing stable references into a text can be found through combining several of the capabilities that we plan to provide in phase II. Disciplined version control for texts and thorough record keeping in managing scholarly data can yield stable texts, and as a by product, stable references into texts. This is so because references are made to a particular FRBR manifestation and manifestations in our practice will be fixed. Text alignment tools provide a means of matching locations in distinct manifestations of a text when that can be done algorithmically. Naturally, attempts through text alignment to map referenced locations from one manifestation to another can fail in a number of obvious ways. But the failure is graceful: the reference to the original manifestation, which we have retained, still succeeds.

Does this imply that we will retain everything, every manifestation constructed during a workflow? Likely not. Some will be disposed of as the work proceeds and references to them are no longer useful. The end products of a scholar's work, manifestations at major milestones, will be blessed as published and given durable handles that can be used to create durable references at any level of granularity into the text.

How these capabilities support a non-corpora and non-curatorial focused scholarship

Although a final list of tools has not been selected for deployment in the Text Curation Shop — arguably there is never a final list in an open ended shop — we expect that many of the tools will not be specifically designed for curation. Text mining and data visualization tools support scalable reading, and produce outputs that can aid in prioritizing texts for curation or segments within texts to examine. Such tools can also be used to answer other scholarly questions. Indeed, interest in those other scholarly questions may motivate curation.

Similarly, transformations used to prepare texts for analysis will leave a trail of evidence in scholarly data management services. This trail of evidence, suitably constructed, puts us in a position to drill down from a high level analytical result to underlying source texts.

Differences in scale vs. differences in kind and broad support for humanities research

Some corpora-scale work relies on mass: a sample must contain enough texts for patterns in them to be significant. But some corpora work is really work on individual texts just applied to a great many of them. For these latter cases, corpora work is a matter of scale rather than a difference in kind.

Groups working on the cutting edge of text visualization still do the basics: they share documents; they rely on wikis and discussion forums to coordinate their efforts; their members reside at different institutions on different continents or, perhaps, have no institutional affiliation, but all can have an identity in Bamboo. The needs of a cutting-edge group overlap substantially with the needs of a group preparing a critical edition of the works of Ben Johnson. And of course, preparing a critical edition is curation on a smaller and more focused scale. The Text Curation Shop can be used at this scale as well.

What does the Text Curation Shop look like? What is the user's experience?

To catch a glimpse of what the user's experience of the Text Curation Shop, let's take a look a simple workflow we have constructed as a demonstrator. The workflow involves a small number of steps. A scholar identifies a Latin text in a remote repository and uses the CI Hub to obtain the text from the repository and to store it in the Bamboo Work Space's local object store. She can inspect the text in the Work Space's repository browser interface, and having identified a section of text she would like to work on, she can launch the workflow from the repository browser interface. At the beginning of the workflow, the text is displayed in a text editing window where light editing can be performed.

When she's satisfied with text and is ready to have it annotated, she submits the text to an annotation service running on the Bamboo Services Platform. The service returns a document containing markup that identifies sentences present in the text and the words and punctuation in them. This document is then submitted to the Alpheios Treebank Editor. The Treebank Editor returns a list of sentence identifiers to the workflow and web page containing a table of the sentences is constructed.

At least this is what happens behind the scenes. From the scholar's point of view, she has submitted a web page and what was returned is a table listing each sentence in the text and a pair of links for each sentence related to the Alpheios Treebank Editor. One link launches the Treebank Editor with the sentence loaded in a new browser window. This editor is not hosted by Bamboo; it runs wherever Alpheios is hosting it. Another link initiates a call to the Treebank Editor to retrieve the results of the editing session launched earlier, which results are displayed in a workflow interface in the Bamboo Work Space and stored in the Work Space's local object store.

Several different steps, several difference tools, each with its own interface, are orchestrated in an intuitive workflow. In practice, workflows may be considerably more involved than this. They may invoke many tools. They may iterate through a series of steps a number of times.

This simple workflow models what the user experience of more elaborate workflows could be in the Text Curation Shop. Different tools are presented at appropriate stages in the workflow. Some of the tools present a user interface. Some do not. The scholar can track the movement of the text through the shop and intervene when necessary. This style of loose integration allows us to adjust workflows to match the needs of the current job, adding or removing tools, and reordering steps. The capability to tinker with workflows is under the scholar's control. (Control in that wonderful sense in which household projects are under the home owner's control. For some plumbing jobs you call in the expert.)

A critical component of the scholar's experience will be a view into the data we are amassing as we manage the scholarly artifacts produced in the course of the workflow. A scholar can inspect the artifacts themselves, process logs produced by tools in the course of creating the artifacts, workflow logs, the documented relationships between objects, and metadata produced or carried forward at each step of the workflow. In short, she can view the chain of evidence we have built up to document her work.

  • No labels