Navigation:
Documentation
Archive



Page Tree:

Child pages
  • Collection Interoperability - Documentation Summary 2013

This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.

Skip to end of metadata
Go to start of metadata

As of 1 Feb 2013, an initial effort to mine prior documents and reports – which is NOT IN ANY WAY MEANT TO BE COMPLETE OR DEFINITIVE – was begun by Steve Masover.

These initial notes within the tentative outline structure on this page are meant as ideas thrown on the wall. It will be up to those who directed and performed the work of the Collections Interoperability area in BTP Phase One to determine what ought to stick and what ought to be omitted as inappropriate to the final CI documentation.

 

 

Humanist Scholars' Use of Digital Materials

Bamboo Planning Phase

From Sec 4.4 of the final (funded) BTP Phase One proposal to the Mellon Foundation:

To support a broad range of scholarship in the humanities, scholars desperately need for distinct digital collections of research materials to become interoperable [7]. Interoperability must extend beyond support simply for resource discovery; scholars must be able to deploy tools and services across widely distributed collections without needing to be expert in every digital format used and every brand and version of repository software extent in academia today. Digital object descriptions must be rich enough and precise enough to support scholarly reference and allow the implementation of transformation and remediation tools and services that can facilitate digital information resource reuse and recombination, while simultaneously maintaining resource provenance adequate for scholarship.

Despite a variety of technical and policy challenges, we believe that Bamboo can make a significant contribution in this area, by defining standard methods for making digital content available to web services. Where existing protocols, practices or ontologies can be leveraged, we will do so, extending and profiling current community standards as required to meet the rigorous requirements of scholars. Simultaneously we will identify gaps in existing standards, and define new technical approaches, application profiles, and best practices as necessary. We will thus develop, adopt, and publish a set of guidelines, protocols, and specifications that will help content providers enhance interoperability by taking advantage of the Bamboo platform. We will also develop services that will gather usage data from collections. In this way, we can use the platform to track scholarly activity and the ways in which scholars use content collections. This in turn will allow us to understand where the efforts of Bamboo should be focused after the initial three-year building projects.

We will work with libraries, museums, and special collections, and we will work closely with such initiatives as Fedora / DuraSpace, campus enterprise content management activities, Hathi Trust, and CollectionSpace. Many of the Bamboo partners have extensive experience in meeting challenges in this arena, and we believe they will be able to integrate content collections with various content-management technologies in new and valuable ways. [...]

 

[7] In workshop 1 of the Bamboo Planning Project, at which participants were asked about current and future needs in the digital humanities, a significant number of participants raised the issue of collections interoperability and the related theme of content and tool interoperability. Because of the importance of this issue, the Bamboo Planning Project team created a strategic focus area on “Content Interoperability Partnerships” as one of the 11 major elements of Bamboo’s 7-10 year program document, which was presented at workshop 4 of the Bamboo Planning Project. See: https://wiki.projectbamboo.org/x/jQGK. At workshop 4, approximately 30 institutional teams formally voted on which elements of this Program document Bamboo should focus on in the short term and which elements institutions were will to lead. In both rounds of voting “Content Interoperability Partnerships” was ranked in the top 2-3 categories. See https://wiki.projectbamboo.org/x/xYCR.

 

 

2012 Survey

Cf. BambooSurveyAndInterviews.pdf (via Tim Cole)

Bamboo Book Model

Purpose

Here, from the 31 May 2012 draft proposal to the Mellon Foundation, sec. A.3.1 (the beginning of the longer section repeated below under Future Evolution):

Collections Interoperability work during phase one made important progress towards normalizing interactions with content repositories to meet the input requirements of tools by defining a Bamboo Book Model, a flexible yet predictable organization of the media and text assets a digital book comprises, including page images, raw OCR-transcribed text, DejaVu XML, and TEI marked-up text.

 

Scope

Cf. Book Model (Draft) and Book Model UML – are these the latest and correct diagrams and documentation?

Cf. July 2011 Work Report for potentially useful summary text

Also cf. CMIS Types and Paths Map for Book Model -- WORKING DRAFT (Tim will work from this – a later document and closer approximation to what was actually implemented)

 

Description

Limitations

E.g., inapplicabilitiy of "pages" to texts that pre-date the invention of the codex.

Future Evolution

 

Cf. May 2012 proposal section draft, which purportedly incorporates some feedback from Tim Cole: Sec 3_1_3 - Collections Interoperability - 2012-05-01 Draft for Comment

Cf. e-mail subject headed "CI in Bamboo Mellon Proposal: draft for QUICK review (please)" – thread extended from 5/1/2012 to 5/3/2012 for discussion on future evolution from that time, including text that might be repurposed.

Here, from the 31 May 2012 draft proposal to the Mellon Foundation, sec. A.3.1:

Collections Interoperability work during phase one made important progress towards normalizing interactions with content repositories to meet the input requirements of tools by defining a Bamboo Book Model, a flexible yet predictable organization of the media and text assets a digital book comprises, including page images, raw OCR-transcribed text, DejaVu XML, and TEI marked-up text. In phase two, this model will be extended to include algorithmically derived or normalized structural markup typical of books, poems, and plays---e.g., to recognize and make available structures such as chapters, stanzas, and acts. The model will be extended to handle variant derivatives such as alternate OCR, page images, transcriptions, morphologically-annotated TEI. In addition, sibling models of textual objects other than books – such as manuscripts, tablets, scrolls, and correspondence – will be developed to accommodate the holdings of repositories whose content will be the focus of TextShop curation and exploration functionality.

In defining and refining CI Hub adapters, Project Bamboo will also explore appropriate models for non-textual objects associated with the principal, textual focus of phase two work. These may include images (e.g., page images); musical scores (e.g., of a musical piece grounded in a textual work); geolocation data (e.g., maps and datasets that associate places with textual references to places); and digitized audio (e.g., recordings of readings, recordings of performances of associated musical works). It is worth noting that these models will, like the Bamboo Book Model which they extend, also model the relationship of a core scholarly object to associated manifestations, representations, and annotations. The Bamboo models are structured for compatibility with the CMIS standard, used by a large and increasing number of vendors and open-source projects as a basis for interoperable storage and management of content objects. It may be illustrative to compare content modeling employed by Shared Canvas . Shared Canvas models physical documents as abstract "canvases" that may be "annotated" with associated content such as digital facsimile images or text transcriptions. As the Shared Canvas data model specification states, its "modeling requirements are drawn from presentation systems use cases only," and it does not attempt to model "the intellectual work embodied within" the physical object. It therefore covers a subset of the phenomena that can be described in the Bamboo Book Model or the proposed Bamboo Textual Object Model, and is not designed to support the full range of Bamboo use cases. Where these models overlap, it is simple to convert between them; Bamboo will support such conversion with transformation services where in-scope tool interoperability requires. It is also worth noting that artifacts of scholarship (scholarly derivatives) created, recorded, or stored by researchers using Text Shop will be exposed and discoverable as Linked Open Data [...]


 

CI Hub

Purpose and Function

Cf. July 2011 Work Report for potentially useful summary text

Here, from the 31 May 2012 draft proposal to the Mellon Foundation, sec. A.3: "Adapters to external repositories in phase one were built as extensions to the Apache Chemistry Open CMIS (Content Management Interoperability Services) client and server libraries at the core of the CI Hub . These adapters form the core of the current CI Hub, which facilitates retrieval of digitized texts, components of digitized texts, and metadata describing these. The differences between content retrieved from multiple repositories are smoothed by transforming repository specific content models to a single normalized model of the book, the phase one Bamboo Book Model. This normalization makes content interoperable from the point of view of tools that are aware of the model. At the same time, the CI Hub preserves the source repository's content model and can deliver that to tools able to work with the repository's native content."

On creation of CI Hub adapters, from the 31 May 2012 draft proposal to the Mellon Foundation, sec. A.3.4 (virtually the same in the 15 Nov 2012 draft, sec 11.3): 

While the costs and benefits of the former approach will differ among repositories and may depend on their extant APIs for content access, it may be helpful to describe the process of creating CI Hub adapters in phase one in order to illustrate the scope and participants in the work. Each of the 3 CI Hub adapters created in phase one performs 2 tasks. First, each adapter maps a particular repository's content model to the Bamboo Book Model. We found that the 3 repositories integrated in phase one presented substantially different content models from each other and from the Bamboo Book Model. Second, each adapter exposes mapped content to Apache Chemistry components of the CI Hub, which enable access to the content in response to HTTP requests that conform to the CMIS specification. The CMIS specification defines a domain model plus Web Services and Restful AtomPub (RFC5023) bindings. By accessing, paginating, aggregating, and transforming each repository's source content to match the Bamboo Book Model and then providing access to this mapped content through a standard, already well defined set of CMIS services, the phase one CI Hub adapters simplify the job of consuming Bamboo tools and services.

The process starts with someone familiar with the candidate repository providing a mapping of its content models and access methods to corresponding features of the Bamboo Book Model. This mapping would also describe use of Bamboo properties in resulting Bamboo Book Model content. Because the process requires an understanding of the Bamboo Book Model as well, it likely involves collaboration with someone in the Bamboo Project familiar with that model. In the course of this work, participants may conclude that the candidate repository or its content will suggest models not well supported by the phase one Bamboo Book Model. In our phase one experience, cases involving modeling Classical texts from Perseus have stimulated initial discussion around evolution of the Bamboo Book Model to a more general and expansive Bamboo Text Model. For example, the poor fit of a 'pagination' model, developed to describe a manuscript or printed book, to texts that predate codex manifestations has pointed to the need for a more general Text Object Model to more fully support the range of corpora on which phase two use cases will operate.

The task of implementing an adapter involves understanding the mappings produced in the design phase of adapter development, retrieving relevant repository content, and transforming its files to Bamboo Book Model directories and files with associated property files. The three existing adapters serve as models for how to structure this program and integrate it into the CI Hub request processing. Those adapters also include useful working code for accessing Fedora repository datastreams and disseminators, parsing zip files for page images, parsing MARC bibliographic records, converting image formats, and other operations that may be called in different combinations for different content models or repositories.

In phase two, Project Bamboo will welcome and provide guidance to repository owners who express interest in collection interoperability with the Project Bamboo ecosystem, where repository holdings align with community interest and the TextShop environment. We will build adapters to integrate with the following repositories to augment the phase one set: AustLit, the Oxford Text Archive, the Folger Digital Folio of Renaissance Drama, and the Shelley-Godwin Archive . The content models of such repositories will inform the evolution of the Bamboo Book Model into the Bamboo Text Model. We anticipate that some repository owners will choose to leverage the opportunity for greater standardization and expose their content via CMIS and/or in accord with the Bamboo Text Model, simplifying or in a few cases obviating all together the need for a custom, repository-specific CI Hub adapter.

The last paragraph of the text above was modified for the 15 Nov 2012 draft of the Mellon proposal to read:

We anticipate that some repository owners will choose to leverage the opportunity for greater standardization and expose their content via CMIS and/or in accord with the Bamboo Text Model, simplifying or in a few cases obviating altogether the need for a custom, repository-specific CI Hub adapter.

 

Architecture

Cf. Bill Parod's architecture document, W.I.P. in Dec 2012.

Implementation

Cf. Bill Parod's architecture document, W.I.P. in Dec 2012.

Cf. CMIS Object Models

Future Evolution

Here, from the 31 May 2012 draft proposal to the Mellon Foundation, sec. A.3.5:

We propose to refactor CI Hub adapters to external repositories in order to leverage the Apache Camel Enterprise Integration Pattern support built into the Shared Services Platform. Enterprise Integration Patterns are broadly-accepted solutions to integration requirements grounded in experienced gained by senior integration developers and architects , and Apache Camel provides a framework for implementing them, permitting integration of applications that use different protocols and technologies. This framework fits the CI Hub role in the Bamboo ecosystem in that the CI Hub seeks to connect to widely heterogeneous repositories in order to access and aggregate content. Refactoring the CI Hub to Camel will better manage routing of messages and processes by relying on robust infrastructure developed and maintained by a vigorous open-source community external to Project Bamboo. This improvement in CI Hub communication with external repositories will also support uploading annotation sets on and curated versions of textual content back to the repository that owns the source content.

The phase one mode for user-specification of materials to be retrieved by the CI Hub is to construct a Zotero bookmarks file containing URLs of the desired content. This mode will be generalized in phase two, so that we can retrieve texts associated with English Short Title Catalogue (ESTC) entries and those of other catalogs.

 

Client and User Experience

Repository Browser

from the 31 May 2012 draft proposal to the Mellon Foundation, Sec. C.3 (Appendix):

The repository browser and local object store developed in phase one – for the HUBzero platform initially and then ported to Drupal – comprise a user interface built in JavaScript, a CMIS client built in PHP and accessed by the UI through a Drupal module API, a Java-based CMIS server that acts as a middleware in front of the object store component, and a Fedora Commons repository. The CMIS layers are built with PHP and Java releases of the open source Apache Chemistry project . The CMIS client and server implement a CMIS binding of the Bamboo Book Model. This binding is also implemented in the CMIS layer of the Collections Interoperability Hub. The repository browser provides a simple preview for local objects and objects stored in the CI Hub cache, including the various manifestations of object elements such as page images, corresponding raw OCR text, OCR marked-up DejaVu, morphologically annotated TEI, etc. The CMIS layers also support permissions management based on a user's principals—BambooPersonId and identifiers for groups to which the user belongs. Principals are gathered from the Drupal AuthNZ context noted above. Finally, in phase one we support simple editing of Dublin Core metadata for local objects.

CMIS gives us a layer of abstraction between the UI and the backing object store. Bamboo object models are bound at the CMIS layer rather than natively in the object store. In principle, then, one could swap in another backing object store, Alfresco ECM, for example, to replace Fedora Commons so long as the CMIS bindings for Bamboo object models are preserved. We will not undertake this work in phase two. Here, we only call out this possibility. This is worth noting because of the value of aligning repository technologies with local institutional practices and investments particularly in the area of the long-term preservation of research data and digital scholarly artifacts. While the Fedora Commons repository is a reasonable candidate for a backing object store and has been adopted by some universities as the backbone of their institutional repository infrastructure, it is by no means the only viable repository software.

 

 

  • No labels