This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.
The description of curation workflows were proposed by Martin Mueller and Greg Crane, with participation of Robert Morrissey, Neil Fraistat, and Trevor Muñoz. They were completed on 12 Aug 2011, with minor amendment on 15 Aug 2011. These workflows are an analysis and 'remediation' of an essay written by Martin Mueller describing curation of Early Modern English texts.
The discussion of integration patterns and implied capabilities that follows was developed by Bob Taylor, Bruce Barton, Steve Masover, Tim Cole and Travis Brown, as part of planning for a second phase of technical development, in early September 2011.
On this page:
This recipe lets contributors collaborate on curation of digital surrogates of printed books. These digital surrogates have been mass-produced either through re-keying or optical character recognition (OCR) and the transcriptions are often imperfect or incomplete, sometimes both. The curation tasks in this workflow allow contributors to:
Completion or correction tasks involve direct changes to published data which are reviewed and incorporated by contributors with appropriate qualifications. Other kinds of annotations may be incorporated into the published data but may also remain private to individual researchers.
Allowing contributors to collaboratively curate imperfectly or incompletely transcribed old books helps researchers by raising the quality of digital surrogates to a level that will gain acceptance by scholarly communities while reducing the time cost of curatorial labor to any individual contributor. A well-designed work flow directs the crowd, or more accurately individuals in the crowd, to problems that need doing and match their skills. This reduces the time cost of getting to a particular problem, focuses the attention of curators on the task that requires human intervention, but then relieves them of the burden of keeping track of what they have done and makes the system do the work of reporting the curatorial act.
This recipe covers curation of full-text transcriptions of books whether transcribed by manual re-keying or OCR. This workflow does not address the re-integration of curated texts with the holdings of contributing repositories or the long-term storage of materials after that point. Contributing back curated data is an area where further work is required.
Texts will be turned tokenized and tagged with PoS upon demand. If a contributor works on a text that has not been previously curated, pre-processing will occur. If a text has been previously-curated, tokenization and PoS tagging will not happen again, instead, the prepared text will be served to the curation interface.
The workflows described in this document will require a variety of interfaces:
The key to quality assurance lies in some basic features of modern computers that are very familiar to IT professional but are still likely to strike ordinary scholars as quite magical. It is easy for computers to keep track of thousands of users and millions of curatorial acts they perform, logging each one carefully as who did what and when. It is equally possible for such computers to divide a large corpus into its individual words and treat each word as a distinct token with its unique ID. This can be done with corpora running into billions of words.
From the computer’s perspective, the engagement of a user with a piece of text is a transaction that results in a log entry to the effect that at a particular moment in time a userID with certain properties changed, deleted, or added properties associated with one or more wordIDs . Algorithmically produced curation can also be logged in this fashion. The userID in that case is that of a machine running an algorithm. The record of such transactions is a curation log that may run into many millions of records. Think of a vast digital expansion of the multivolume Berichtigungsliste or “correction list” that Greek papyrologists have kept of their editorial work for almost a century.
The curation log is thus the fundamental management tool for maintaining quality control. It can in principle support quite different organizational choices for editorial review, whether highly centralized or widely distributed. In practice, it is likely that the best results will be found in distributed systems that give substantial control over editorial decisions to existing scholarly “data communities” in various scholarly societies and their sub-committees or interest groups. There are substantial technical problems that need to be solved in order to main a robust and flexible infrastructure that will enable such data communities to work independently while staying in touch. Such a system might be backed by a system like git to avoid reimplementing parts of this functionality.
Integration here refers to the presentation and coordination of functionality to the scholar. We describe several patterns that vary along a scale ranging from a highly integrated but narrowly specialized curation application (Workbench Pattern), to a bundle of applications largely selected by the individual scholar but including a specialized tool for creating, storing, and retrieving annotations (Notebook Pattern), and on to infrastructure for managing the information used in and the scholarly artifacts produced through curation while placing the responsibility to identify suitable curation tools entirely with scholars (Plumbing Pattern). See below for a fuller description of each pattern and notes about their strengths and weaknesses.
In our discussion of curation workflows we will occasionally refer to a capability needed to support a curation step. Later in this document we list general capabilities of a text curation application realized as some combination of tools and services along with notes about how these might be realized or supported in Bamboo's infrastructure.. These include but are not limited to annotation and related processes, document management, permissions management. Some capabilities like "document editing" are not explicitly listed. Our aim is to call out those capabilities that are assumed but not described in the workflows below.
In all workflows we assume a curatorial review process in which the reporter's reputation plays a role in the evaluation of candidate error and acceptance of proposed corrections. Of course, for "errors" we should read "annotations" of whatever kind the curation process is managing.
In the Workbench Pattern, a specialized book viewer application allows a suitably privileged reader who spots a suspected error in a page transcript to verify immediately that the error has occurred by consulting the facsimile image of the page and to flag the error or "make" the correction in the transcript. The correction is posted in an annotation service, and curators watching the annotation service evaluate the annotation and make the correction.
In the Notebook Pattern, a reader is working with a document with some level of scholarly adornment using tools appropriate to the task at hand, when he spots a suspected error. He may or may not be able to verify that error in the transcript underlying the document. Still, he records an annotation in the Scholarly Notebook he keeps open on his desktop, using an addressing scheme shared between the local tools and the Scholarly Notebook. The annotation is posted in an annotation service. Curators watching the annotation service evaluate the annotation and make the correction.
In the Plumbing Pattern developers of specialized scholarly text tools have added functionality to their tools that report errors to the annotation service as these are identified. Such tools have a "mark as possible error" function. As in the Notebook Pattern, curators watch the annotation service.
The Workbench application in intermittent curation, the specialized book viewer, has been extended to invoke the analytical tools used in cascading curation. These tools include keyword-in-context searching for error candidates and automated candidate error detection.
In cascading curation in the Notebook Pattern, the scholar would use his or her own local text search or concordance tools (or the search capabilities of the source collection) to identify "batches" of similar errors. It is not desirable to require the user to copy and paste every the address and correction for every error individually in a large batch correction task, so cascading curation in the Notebook Pattern would rely on support in the local tools for harvesting addresses from search results.
In the Plumbing Pattern, the developer of a corpus query or concordance tool would add functionality to their tool to submit batches of correction annotations to the Bamboo annotation services, which would be monitored by curators as in the other workflows.
In the Workbench Pattern the specialized book viewer would provide facilities (possibly through a plugin architecture) that would support specific kinds of more advanced annotation (identification of multi-word expressions, named entities, quotations, etc.).
In the Notebook Pattern, the scholar would annotate the phenomena of interest (such as named entities) with his or her chosen tools. These may include automated natural language processing tools or an XML editor such as oXygen. The Scholar's Notebook application would read the edited document and submit the annotations to the Bamboo annotation service.
In the Plumbing Pattern, the developer of an annotation tool (which may be an automated NLP tool or an interface that allows users to create annotations, for example) would incorporate functionality that would allow users to submit annotations directly to the Bamboo annotation service.
In the Workbench Pattern a specialized version of a Work Space local collection management tool would allow users to provide corrections to text metadata, indicate duplicates, or create links to an external ontology or authority file.
A simplified form of collection curation in the Scholarly Notebook Pattern might look very similar to intermittent curation: the user notices a metadata error while working with a text and enters the correction in the Notebook application. More advanced forms of collection curation (duplicate identification and reconciliation to external ontologies) would present challenges for this pattern.
In the Plumbing Pattern the developer of a tool such as Eighteenth-Century Book Tracker would add functionality that would allow the user to submit metadata corrections and links to external ontologies or authorities directly to the Bamboo annotation service.
As we plan our approach to implementing support for these workflows, we should also decide at what level we want tackle hard problems. Are we looking to build a perfectly general solution to a problem? Are we looking to a demonstration of one or more approaches to the problem? We list here several of the challenges that have become apparent in our discussions around these workflows.
When texts are stable, it is practical to address sections of text, e.g. a word, a page, a chapter, at any level of scale, as Mike Witmore says. Texts being curated are by definition unstable. Methods for pointing into a document to particular locations degrade in performance, as Phelps and Wilensky note, as the document changes. Yet, curation requires precision, especially if the work is done collectively where, for example, changes are suggested by citizen curators working at a distance and approved by others at a later time.
Similarly, annotation through reference to points in a text as in the curation by linked reference pattern imply rigid designation to fixed points in a text however much the context around that point may shift.
In a riff on "you break it, you buy it," "you fix it, you own it." But, Bamboo does not intend to become the alternative repository for high quality texts. In addition to clean texts, curatorial scholarship produces annotations expressing unresolved assertions about variant readings and so on. Where should the enriched texts and scholarly apparatus live?
See Schmitz, Patrick. The CONCUR framework for community maintenance of curated resources.
Here we mean grading as in sorting apples by quality and size. As the volume of candidate error corrections grows — as we hope it would with citizen curation — we anticipate the need to sort candidates algorithmically into priority queues for review and acceptance. We list reputation management among the capabilities implied in curation. Reputation is one dimension one might consider. Utility, however that is defined, is another. How much effort should we expend on tuning queue management, and how generalizable is this, given that utility is specific to a domain?
In this pattern all functionality in the curation workflow is available within a single application. The scholar is reading a document to curate it or for some other purpose, notices an error, pops up the facsimile, annotates the error and the proposed correction, returns to reading .... An curator is just someone who, using the same tool, can review proposed corrections (along with citizen curator reputations) and can accept the correction. The tool writes to the curation log.
The workbench calls the backing services that make collaboration possible.
With respect to user experience this is tight integration. The scholar is given little choice in which tools to use. The tool is likely limited to a narrow range of curatorial functions.
A single application presents a rich environment in which to provide user interfaces well-tuned to specific curation tasks. For a user who finds this single environment well-suited to her task, it is convenient to perform the full range of curation tasks within a single, familiar context. Writers use environments of this type when they compose using a fully-featured word processing program; software developers are familiar with environments of this kind if they are frequent users of Integrated Development Environments (IDEs).
It is difficult (and therefore expensive) to generalize an environment to support a variety of tasks and functions. It is even more difficult to do so while maintaining a low 'barrier to entry' for those who must become familiar with an environment in order to use it.
A 'deep and narrow' application suited to operate on a single corpus obtained from a single repository, and to fully support a tightly constrained set of simple curation tasks (e.g., suggest corrections to algorithmically-generated OCR), would be more easily realized than one suited to operate on multiple corpora pulled from multiple repositories in the service of a variety of simple and complex curation tasks.
To implement a generalizable environment 'from scratch' would be far beyond Project Bamboo's means (we are not equipped to build applications as complex as MS-Word or Eclipse).
To realize a 'single application' as a set of plug-ins or widgets that can be deployed to a container – such as a Work Space platform – begins to bleed into other integration patterns: particularly, a "Scholar's Notebook Pattern," as described below, with some 'under the hood' benefit realized by running diverse components in a single application framework. However, these 'benefits' would not deliver on the substantive promise of a single application: the 'seamless' user experience of a cohesively designed, fully-implemented, soup-to-nuts curation environment. As Martin Mueller put it in his essay Collaboratively Curating Early Modern English Texts, "It is a challenging task for interface designer to build an environment that supports 'curation en passant.'"
The sum of these constraints suggest that achievable results that follow this pattern are likey to promise more than Project Bamboo can actually deliver.
In this pattern, the user goes about her scholarly business using best-of-breed tools (or, more likely, the tools she can use most effectively). She notices an error. She has been using her Bamboo work space to manage her content and in may be that some of the tools she is using run in the work space. She has open in her Bamboo work space her Scholar's Notebook. She records the error, location of the error, encoding level, error type, and the proposed correction. Scholar's Notebook is backed by an annotation/assertion store. Let's imagine that the location can be harvested easily from the tools she is using.
In another context curators are working with texts to correct them. The backing annotation/assertion store has a recommendation engine that assigns value to the correction based on a calculation of importance of source in the corpus, type of correction, and recommender's reputation. Possible corrections are organized by value rank and gathered into bundles by source text. The curator loads Scholar's Notebook and selects curation mode. A queue of suggestions tuned to her expertise is presented. The curator picks off suggestions from the queue, makes the correction and grades the recommendation, which grade contributes to the recommender's reputation One could imagine levels of approval.
The correction tool is tuned to making corrections at the encoding level at which the curator is working.
Workflow integration in this pattern occurs in the heads of the participants.
A wide range of tools already familiar to scholars (and citizen-scholars) remain at front-and-center in the work of obtaining, analyzing, examining, addressing, and emending elements of texts of interest, presenting no added 'barrier to entry' in any aspect of a curation workflow other than recording a curation event. This allows the fullest possible freedom in selection of tools best suited to particular scholarly and exploratory tasks, as well as user preferences, except insofar as participants in curation workflows need to adjust to new and 'out of flow' steps and tools in order to accomplish the pivotal task of curating.
This pattern has the further advantage to Project Bamboo of clearly limiting responsibility and investment to a value-adding set of functionality that is closely bound to modeling, storing, and harvesting records of curation events.
The need to adapt to a new tool and/or process that takes a scholar, reader, or curator out of her familiar contexts is a disincentive to participation in collaborative curation. As Martin Mueller put it in his essay Collaboratively Curating Early Modern English Texts, "The easier it is to switch between exploration and curation the easier it will be to engage scholars in the work of collaborative curation."
Another weakness is that there are limits on how loosely integrated the tools and the Notebook can be; for example, they need to share at least an addressing scheme, since the Notebook must be able to understand the "location of the error" that the user reports.
In this pattern, the key piece of glue that drives workflow is an annotation reference resolver. Given a text context, the resolver can gather annotations/assertions in the neighborhood of that context by calling a method on the backing annotation service's API.
Our work in this pattern is to provide the backing service and simple demonstration clients.
Integration in this pattern amounts to passing around annotation references among services and tools. This is a looser approach to integration. Functionality is distributed across tools and services. Some of the tools run in work spaces or use work spaces functionality to manage content.
This pattern permits annotations managed by Bamboo services to be put to a nearly boundless range of uses. Some of these uses can be manifested in Bamboo services and research environments, but none need be.
Also, this pattern offers the greatest advantage to Project Bamboo with respect to limiting responsibility and investment to a value-adding set of integratable functionality that is (a) closely bound to modeling, storing, and harvesting records of curation events; yet, (b) not bound to any particular application or interface, with the exception of simple demonstration clients. It enables developers of a wide range of tools already familiar to scholars to add value to their own software by presenting Bamboo-enabled annotation management within their own interfaces and workflows.
The utility of Project Bamboo's investment is dependent not only on adoption by interested scholars, citizen-scholars, and curators, but also by the developers of tools these individuals employ. While the risk implied by this dependency can be mitigated through adoption by Bamboo-built services and research environments, it is important to note that this form of hedging may begin to bleed into the Workbench or Scholar's Notebook integration patterns described above.
In this pattern the user is reading a text or viewing metadata for a text. He recognizes an error or variant spelling/reference to an entity. Rather than propose a correction, he inserts a link from the erroneous or variant text or metadata to an authoritative ontology or reference supporting the proposed correction.
Subsequently a curator determines whether to make a correction or to formalize the link. The latter action, for example, could be taken in the instance that the variant spelling of a name matches what was printed but is not the generally recognized (normalized) way of spelling the name. A link could also be left without further correction if there remains uncertainty in the community as to whether the current transcription is in error.
Links might also be used to collocate or address relationship issues useful for organizing and facilitating discovery/use, such as to relate a manifestation to an expression or an item (instance) to a manifestation. Though a bit beyond the scope of the reference documents, these are in actuality important facets of curation.
Same as Plumbing Pattern or annotation reference resolution, above.
Same as Plumbing Pattern or annotation reference resolution, above.
Delivery in Bamboo
A application to support scholarly annotation. Minimally, there is a backend annotation store that holds annotations generated and consumed by a group of collaborating scholars. We imagine clients of this backend annotation store that allow scholars (whether faculty or citizen-scholars) to add annotations of texts, and clients that allow other scholar to review annotations. Annotations may be applied to addressable elements of a textual object. Annotation store must support classing (rdf typing) of annotations, structured annotation bodies, annotation body by reference (URI), annotation properties such as creator, time/date created, etc.
The annotation store could be implemented or proxied as a BSP service; it could be hosted on a Bamboo work space; it will be called by clients running on a Bamboo work space. Bamboo must support multiple annotation store (most probably external to Bamboo) and must support retrieval/filtering of annotations by target identity, annotation class, creator identity, created date, ....
This is a mechanism for marking a suggested correction as accepted. A client UI and a backing store are implied. Approval/acceptance has a context: what is accepted by a scholar for her working set of materials may not be accepted by the curators of an 'authoritative' repository (or vica versa).
Possibly an extension of the annotation capability.
A reputation is calculated on accepted vs rejected suggested corrections. Reputation, like approval/acceptance, also has context, e.g., reputation for making corrections acceptable to a given repository or set of curators. Should attributes available in the Bamboo profile also be available to curators viewing suggested corrections?
Linked to Bamboo Person and Profile Services. Open question: are reputations managed at the ecosystem level or in a curation application that is limited to a particular domain?
Reference to external authorities, ontologies, participation in virtual collections that span Bamboo boundaries, instantiation of relationships (e.g., between expression and manifestation).
Bamboo document stores and CI connectors must accommodate linked data attributes. Work Space and Services should leverage values.
Where the working copies of documents live.
In the Phase I model, documents are stored in a Work Space object store. CI cache is like a HTTP cache: transparent to users and merely done in the service of efficiency with respect to communication and network traffic between the CI Hub and repsitories to which the CI Hub mediates access.
How documents are retrieved from a source repository.
Via CI Hub to a WorkSpace object store
This is an application. (Nothing in the work flows implies real time collaborative editing at a distance.)
This application may or may not run directly in the Bamboo ecosystem. However, it obtains the objects it works on through the ecosystem and stores the revised versions of the document back to the ecosystem.
Who can view, edit, approve. etc. Likely, this involves groups of users and permission is associated with a group.
BSP IAM: Persons, Policies, Groups services. Object level permissions (policies) in Work Spaces may be managed locally. However, Work Spaces users and groups are synced to BSP Persons and Groups services. Permissions and policies applicable to a repository whose owners / managers / curators evaluate and may accept artifacts of Bamboo-enabled curation are managed locally to that repository; modes by which a repository may participate in a 'curation mining' process of this kind are TBD and are likely to depend on a repository's particular requirements.
This is implemented on the local Work Space object store.
Pattern search across corpus
The application of analytic tools to a corpus to identify instances of an error pattern.
External services proxied by the BSP (in Phase One). In Phase Two analytic tools may be hosted on the BSP.