Navigation:
Documentation
Archive



Page Tree:

Child pages
  • CI Workplan (April 15, 2011 edition)

This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.

Skip to end of metadata
Go to start of metadata

Collection Interoperability Development Tasks

For Bamboo Phase One

 

Changes represented in the latest update:  

April 13, 2011 updates by Taylor

April 14, 2011 updates by Jonathan A. Smith (JAS)

April 15, 2011 updates by Claire Stewart

April 17, 2011 updates by Tim Cole

Staffing item: NU has added software developer Xin Xiang to the NU CI effort, for the months of April-June 2011. (Taylor)
General forecast item: CMIS adapters work for Hathi content will be done by end of May. (Taylor)
General forecast item: CMIS adapters work for Fedora/Perseus content will be completed sometime during month of June.  Fedeora adapter work dependent upon final preparations of Perseus/Fedora repository at Tufts before end of May.  (Taylor)
Indiana participation in CI still to be added, but expected for some subset of tasks 7-11 (based on recent conversations with Jon Dunn). 

PREAMBLE: 

This document describes the software development and related objectives for the CI teams (NU, UIUC, UW-Madison, ANU) Fall 2011. Not fully described are tasks during winter early spring 2012. During 2011, CI teams will develop CMIS-based content interoperability services that will enable managed use of the following four collections 

  (1) Hathi Trust content,

  (2) selected Perseus (Fedora-based) content,

  (3) selected TCP texts, and

  (4) selected AustLit texts 

in the Work Spaces applications being developed at UW-Madison and UC-Berkeley.  

These initial, CMIS-based content interoperability services will be extensible (with additional work in adapters) to other target collections of importance to the Bamboo community of scholars and students.    However, the effort required in developing other collections adapters (beyond the above four) is probably beyond the resources available to the CI teams during Bamboo Phase One.    

The overall content interoperability architecture, as well as the collections adapters, are being developed by CI in consultation with the Bamboo Services Platform team at UC-Berkeley.   The Bamboo Services Platform team has ultimate responsibility for deciding where the CI software components should reside within the overall, deployment architecture of the Bamboo system. 

CI teams also will collaborate with Workspaces and Tool & Services Registry teams to: 

  • model / profile descriptive & provenance item-level (i.e., work and digitized volume-level) and collection-level metadata that will be used in context of Bamboo;
  • create and iterate standard class model(s) for books & texts;
  • engage humanities scholars and enlist their help in iterating and refining list of priority collections targeted for interoperability with Bamboo; and
  • identify content normalization and remediation transformation tools and services needed to facilitate the application of analytical tools across boundaries of distributed and disparate collections and repositories. 

Terms and Conditions:  Targets provided are approximate. Effort required to attend/support meetings, coordinate/participate in conference calls, participate in workshops, and similar is not fully accounted for in the following estimates since not fully under control of CI. 

1. Content Ingest for Hathi Trust content

 

Primary development team:                                   NU A&RT (Smith)

Total Bamboo Phase One effort level:                    3.5 weeks

Effort level made to date:                                       2.5 weeks

Anticipated start date:                                            February 14, 2011

Anticipated delivery date:                                      April 22, 2011

 

Our first objective is to use METs data and content from Hathi Trust. We will develop a utility to read METs metadata XML and upload this data as CMIS folders and documents with broken-out property values. As far as possible we should use the existing JCR-Connect XTF connector source and maintain support CDL as well as Hathi METs files. 

We have now received a sample of 20+ or so items including METs and content files from Hathi. The initial goal is to run the program to upload the broken-out content to a CMIS repository. This repository and data will be made available to the workspace development group to use to test their development work. 

[1 week] Evaluation and review of Apache Chemistry / OpenCMIS code base 

[1.5 weeks] Parser for Hathi metadata files, creating CMIS folders, documents and properties via OpenCMIS client code. Unit tests. 

[0.5 weeks] Create sample repository for Workspace developers 

[0.5 weeks] Design of CMIS content types and coordinate mapping needed for raw Hathi and CDL content

2. Standard Model DSL for Hathi Content

 

Primary development team:                                     NU A&RT (JAS)

Total Bamboo Phase One effort level:                     3.5 weeks

Effort level made to date:                                        2 weeks

Anticipated start date:                                             28 March 2011

Anticipated delivery date:                                       20 May 2011

 

The Standard Model is a class model that defines a set of common fields and organization for content loaded from contributing repositories. The Standard Model is a Java class model with well defined translations to and from CMIS. 

A Standard Model plug-in will be developed for each contributing repository. Each summarizing plug-in will create Standard Model objects and CMIS data from raw content fields provided by a connector. The code for each repository will be based on a shared class library and domain-specific language (DSL) designed to simplify this task. 

Claire Stewart of Northwestern Library is organizing a group to design a standard model for books and texts. 

[0.5 weeks] Initial draft of Standard Model for Hathi content (UML, Interface definitions) Note that this is a sketch, to be passed on to Claire's group for refinement and reorganization. 

[2 weeks] Code for generating Standard Model objects, including initial version of matching / object generation DSL and unit tests. 

[1 week] Second version of standard model generator reflecting input from Claire's group and others. 

3. Content Locator Objects (on-demand content loading)

 

Primary development team:                                     NU A&RT (Xiang)

Total Bamboo Phase One effort level:                     3.0 weeks

Effort level made to date:                                        0.5 weeks

Anticipated start date:                                             4 April 2011

Anticipated delivery date:                                       27 May 2011

 

Content locator objects will be virtual CMIS folders and documents that will download content, and invoke translation code in response to CMIS requests. Content will be cached as needed, but will be downloaded the first time in response to a request. Content locator classes will be defined for each type of contributing repository. This code will be developed by implementing a specialized OpenCMIS server. 

[0.5 week] Design of interface for Locator Folder 

[2 week] Implementation of Locator Folder, including specialization for Hathi content 

[0.5 week] setup and deployment of OpenCMIS server (for development activities only) 

4. Zotero Bookmark Parsing

 

Primary development team:                                   NU A&RT (Xiang and JAS)

Total Bamboo Phase One effort level:                    1.0 weeks

Effort level made to date:                                        0 weeks

Anticipated start date:                                            23 May 2011

Anticipated delivery date (first version):                 31 May 2011

The Zotero RDF parser will process Zotero bookmark files uploaded to out hub (via a CMIS client). The software will find links in the file to known repositories, and content locator objects to link the the specified content. 

We will also explore connecting to Zotero servers.   However, that work is probably beyond the resources available to CI teams during Bamboo phase One. 

[1.0 week] Zotoro bookmark file parser / URL matcher 

Note:  this Zotero service is an experiment of sorts. Currently only volumes can be collected into Zotero, not bib records which may point to multiple volumes (e.g., multiple volumes of a triple decker novel).  Thus, there may need to be a subsequent or related task  (not sure if this is Phase One or Phase Two) to write a Zotero translator for Hathi Bib splash pages.   Also, Zotero RDF exports tend to provide differing data according to what the collection repository has allowed/tagged to be collected into Zotero Collection (e.g., through the use of COINs). Parser that works with Hathi, may not work for Perseus, etc. 

5. OSGi Service Development

 

Primary development team:                                     NU A&RT (JAS and/or Xiang)

Total Bamboo Phase One effort level:                     4.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                                xxx

Anticipated delivery date:                                           xxx 

We will build OSGi services that will embed our connectors (obtaining and translating content into raw nodes and properties) and Standard Model generators. Our code will build on the OpenCMIS server code base to incorporate our extensions for Zotero file parsing, locator folders, content ingest, and standard model generation. 

Additional time will be required to implement authentication and authorization beyond a minimal system based on IP-based (or similar) institutional affiliation.   This additional work is not currently included in Bamboo Phase One. 

[2 weeks] Become familiar with technologies need to build our service. OSGi, authentication libraries, etc. 

[2 weeks] Create an OSGi bundle that adds our code to OpenCMIS server. Integrate connectors for supported repositories. Develop unit tests. 

6. Adapt JCR-Connect Fedora Connector for Perseus

 

Primary development team:                                     NU A&RT (JAS and/or Xiang)

Total Bamboo Phase One effort level:                     4.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                                xxx (Start date dependent upon completion of Fedora work at Tufts)

Anticipated delivery date:                                          Sometime before June 30, 2011 (Delivery date dependent upon completion date of Fedora work at Tufts)

We will adapt the Fedora / JCR Connector developed by Northwestern for the JCR-Connect project (for Chicago History Museum content) to providing access to the Bamboo community to Fedora repositories via OSGi.   The initial target collection for the Bamboo Fedora Connector is Perseus.   It is the intention of the CI team to develop the Fedora Connector so as to easily be adapted to other Fedora-based collections. 

[4 Weeks] Conversion of JCR-Connect Fedora connector to use the OpenCMIS server code. Including new unit tests. 

7. Create Web accessible data store of TEI P5 XML derived from TCP texts

 

Primary development team:                                     UIUC and NU Library

Total Bamboo Phase One effort level:                     6.0 weeks

Effort level made to date:                                          1 week

Anticipated start date:                                               1 March 2011

Anticipated delivery date:                                         30 April 2011 16 May 2011 (initial samples of Abbotized TEI texts were delayed)

Using small-scale sub-collections of TCP texts, prior work (MONK, SEASR, etc.) has established the benefit of encoding TCP texts in TEI-A P5 XML. As a pre-cursor to enabling Bamboo interoperability with TCP collections, CI teams will create a persistent, Web-accessible data store containing most TCP texts encoded in TEI-A P5 XML. This will start by obtaining TCP SGML texts and a current copy of Abbott configured for use with TCP collections (from UNL) -- this assumes support from Martin Mueller (NU) and Brian Pytlik Zillig (Nebraska) and agreement of TCP rights holders; we have every indication this support and agreement will be forthcoming. Remaining work: 

[1.5 weeks] "Transform" TCP collections into P5 TEI-A XML; anticipated success rate is greater than 80% with minimal information loss. 

[1 week] Deploy a TCP collections server (data store) for TCP texts.   (Fedora?) 

[1.5 weeks] Generate and associate metadata, including provenance & linkages to TCP SGML and EBSCO, GALE, and/or Newsbank page images with data store objects. 

[1 week] Implement RESTful functionality to support TCP CMIS connector (next task), including exposure of links to raw sgml & page images. 

8. Develop CMIS Connector (Adapter) for TCP text collections

 

Primary development team:                                     UIUC

Total Bamboo Phase One effort level:                     6.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                               15 April 201111 May 2011 (cascade delay from previous item)   

Anticipated delivery date:                                         30 June 2011

We will leverage work done by NU A&RT on Hathi connector, Perseus connector, and locator service as available and as applicable. 

[1.5 weeks] Implement Apache Chemistry / OpenCMIS over TCP XML data store (take advantage of NU experience with application) 

[1.5 weeks] Parser for TCP objects & metadata files, creating CMIS folders, documents and properties via OpenCMIS client code. Unit tests. 

[2 weeks] Allow TCP folders to include (virtually) TCP HTML and TCP or EBSCO / Gale / Newsbank page images of components of objects as well as local TEI-A P5 XML representations; coordinate with NU work on Content Object Locators work as makes sense. 

[1 week] Generate identifier mappings. Create & iterate repository access via CMIS for Workspace developers 

9. Develop CMIS Connector (Adapter) for AustLit

Primary development team:                                     UIUC & ANU

Total Bamboo Phase One effort level:                     4.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                                April 2011May 2011

Anticipated delivery date:                                          July 2011 

Will coordinate and follow in lock-step with TCP Connector work described above; but smaller scope and simpler than TCP since data newer, slightly more homogeneous, and under slightly more consistent management (i.e., mostly U of Queensland). 

10. Develop ancillary services to facilitate interoperability / utility of TCP resources

 

Primary development team:                                     UIUC & NUL

Total Bamboo Phase One effort level:                     8.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                               June 2011

Anticipated delivery date:                                         August 2011 

Texts in TCP collections are known by a range of different identifiers, in some case at multiple levels of granularity. Thus a TCP text may be known by an EEBO, ECCO, or Evans identifier, or by an instance of a bibliographic work identified in ESTC, or .... It is likely that scholars using Bamboo will want to retrieve links to TCP texts in response to queries created using a range of bibliographic or identifier metadata. To resolve such queries would require an identifier mapping service. 

Similarly some TCP texts remain under license. Agreements with TCP require that texts only be furnished to authenticated users affiliated with TCP member institutions or institutions having appropriate subscription access. To assure conformance an InCommon interface (e.g., such as built for MONK) or similar may need to be built at either CI or Workspace level. 

Or other services may need to be built to facilitate interoperability and use of TCP resources. This task is tentatively sized based on assumption that 2 such ancillary services like those described above will be required. 

11. Services for remediation & normalization of TCP and other texts

 

Primary development team:                                     UIUC and NU

Total Bamboo Phase One effort level:                     8.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                               August 2011

Anticipated delivery date:                                         November 2011 

As demonstrated during Corpora Space Camp (March 2011), even well curated data (e.g., examples of TCP and Perseus texts in XML) can require significant normalization to be useable by certain analytical tools (e.g., Topic Map Feature Analyzer), especially in combination with texts from other dissimilar collections. To support integrated analysis across boundaries of collections and repositories, we anticipate the need for multiple such services -- for example: 

  • to chunk texts;
  • to transform (e.g., via MorphAdorner) to British National Corpus P5 TEI to facilitate certain kinds of analysis / integration with texts from other collections;
  •  to normalize spelling / lemmatization;
  • to correct errors 

This task is place holder for resources required to implement up to 3 such services. 

12. Document core item-level metadata profile

Primary development team:                                     NU Library, UW-Madison, UIUC

Total Bamboo Phase One effort level:                     6.0 weeks

Effort level made to date:                                          .5 weeks

Anticipated start date:                                               1 January 2011

Anticipated delivery date:                                          June 2011

This encompasses: the CMIS folder object & document object model, which levies certain object-level attributes; Jonathan’s proxy object implementation incorporating that model; Workspace requirements for item-level metadata; and best practice options for associating additional metadata with objects as they are ingested and used (e.g., MARC, MODS, Dublin Core, etc.).

Progress so far encompassed under preliminary work to develop standard content model for CMIS work (see item 2, above), mapping hierarchy in Hathi METS and TCP TEI texts to a set of trees ('bag o trees') in Atom Syndication Format for CMIS connector.

Future work: discussion of content model/Atom mapping with CI team, revision of model for texts, preliminary modeling for other content types. Develop recommendations for core metadata sharing and placement of metadata within content chunks exposed via Atom/CMIS.

13. Document collection-level metadata & collection-level service descriptions

Primary development team:                                     UIUC, NUL, ANU

Total Bamboo Phase One effort level:                     6.0 weeks

Effort level made to date:                                          2.0 weeks

Anticipated start date:                                               1 January 2011

Anticipated delivery date:                                         June 2011 

Would include integration of collection-level descriptions into Tools & Services Registry and could include some experimentation with semi-automating generation of RIF-CS records. 

 

14. Scholar engagement to update and refine collections list for Bamboo Phase II

Primary development team:                                     UIUC, NUL, IU Lib, others?

Total Bamboo Phase One effort level:                     8+ weeks (plus some non Bamboo resources from UIUC, now confirmed)

Effort level made to date:                                          23.0 weeks

Anticipated start date:                                               1 Jan 2011

Anticipated delivery date:                                         Sept 2011 

There are many existing digital collections contain potentially rich content for humanities scholarship, some well known, some relatively unknown. The profileration of such resources suggests that usage of digital content in humanities research is increasing. However, in leveraging these collections and in facilitating their interoperability and use in the context of Bamboo, we need to understand more about the ways in which scholars want to use these resources, the importance they place on these resources, and the scope and reach of these resources. 

To ascertain this information, we will undertake a needs assessment; this will inform ongoing collection development during Bamboo Phase I, lay groundwork for a robust and well received Phase II of work, and allow CI to continue to refine and augment prioritized list of collections to target for Bamboo CI Standards adoption (a Phase I deliverable). 

The main thrust of this task will be to reach out to identified Bamboo Partner (and CIC) scholars via Web surveys, one-on-one interviews, focus group interactions. We will identify scholars to contact through librarians, DH Centers, and other Bamboo participants. We will leverage prior work in this area, e.g., outcomes of Bamboo Project Planning Workshops, the Oxford [BVREH User Survey|http://bvreh.humanities.ox.ac.uk/news/Survey_outcomes.html], etc. We plan to supplement committed in-kind and funded Bamboo time with local resources (e.g., funding from UIUC Library's Research & Publication Committee) 

[2-4 weeks] Establish team to do this work (e.g., librarians at Illinois, Indiana, NUL, ...) & draft necessary IRB paperwork (multiple institutions).   

[3 weeks] Draft and iterate instruments (overlapping with previous task) 

[1 week] Implement Web survey form. 

[2-4 weeks] Follow-up interviews & small-scale focus groups (includes supplemental resources) 

[4 weeks] Analyze and report out data (includes supplemental resources) 

  • No labels