Scheduled DB Maintenance: January 21st - 8:00 AM to 10:00 AM. Confluence will be unavailable during this time.

Navigation:
Documentation
Archive



Page Tree:

Child pages
  • CI Workplan (May 13, 2011 edition)

This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.

Skip to end of metadata
Go to start of metadata

Collection Interoperability Development Tasks

For Bamboo Phase One

 

Changes represented in the latest update (May 13 edition):  

May 13, 2011 CI updates by Taylor in RED.

General Notes

1. Staffing item: NU has added software developer Xin Xiang to the NU CI effort, for the months of April-June 2011. (Taylor 5/13)
2. General forecast item: CMIS adapters work for Hathi content will be done by end of May. (Taylor 5/13)
3. General forecast item: CMIS adapters work for Fedora/Perseus content will be completed at NU during the month of June.  Bridget Almas (Tufts) is on track to have an initial Fedora implementation of Perseus texts online at Tufts on June 1. (Dev server, probably not a production/public server.)   In the meantime, Bridget has given NU access to a "preview implementation" of the Fedora Perseus from her work site at home. (Taylor 5/18)
4. Indiana participation in CI still to be added, but expected for some subset of tasks 7-11 (based on recent conversations with Jon Dunn).  (Cole)
5. CI conference call to be scheduled for week of June 1 (doodle forthcoming) to dicuss preliminary content model and metadata req for texts; JAS to upload CMIS/content model drafts to wiki. (Stewart 5/19)

General Challenges for Bamboo

1. Comprehensive use of Hathi content by scholars (in Bamboo Work Spaces) is predicated upon support from HathiTrust for Bamboo's caching of content in the CI Hub. (Taylor 5/13)
2. Bamboo Phase One will not offer fine-graned access control to protected content. (Taylor 5/13)

Jonathan Smith (JAS) Updates in Green

1. We should be on target for release of Hathi connector, Standard Model implementation for Hathi (current draft), and locator objects on 27th May. Since there are significant inter-dependancies, I have adjusted the delivery date for all three items.

2. I have been devoting time to #12, since it is a major requirement for Bamboo's work.  But I am not the designated work team(s) for this development. Proposed revisions are welcome.

PREAMBLE: 

This document describes the software development and related objectives for the CI teams (NU, UIUC, UW-Madison, ANU) Fall 2011. Not fully described are tasks during winter early spring 2012. During 2011, CI teams will develop CMIS-based content interoperability services that will enable managed use of the following four collections 

  (1) Hathi Trust content,

  (2) selected Perseus (Fedora-based) content,

  (3) selected TCP texts, and

  (4) selected AustLit texts 

in the Work Spaces applications being developed at UW-Madison and UC-Berkeley.  

These initial, CMIS-based content interoperability services will be extensible (with additional work in adapters) to other target collections of importance to the Bamboo community of scholars and students.    However, the effort required in developing other collections adapters (beyond the above four) is probably beyond the resources available to the CI teams during Bamboo Phase One.    

The overall content interoperability architecture, as well as the collections adapters, are being developed by CI in consultation with the Bamboo Services Platform team at UC-Berkeley.   The Bamboo Services Platform team has ultimate responsibility for deciding where the CI software components should reside within the overall, deployment architecture of the Bamboo system. 

CI teams also will collaborate with Workspaces and Tool & Services Registry teams to: 

  • model / profile descriptive & provenance item-level (i.e., work and digitized volume-level) and collection-level metadata that will be used in context of Bamboo;
  • create and iterate standard class model(s) for books & texts;
  • engage humanities scholars and enlist their help in iterating and refining list of priority collections targeted for interoperability with Bamboo; and
  • identify content normalization and remediation transformation tools and services needed to facilitate the application of analytical tools across boundaries of distributed and disparate collections and repositories. 

Terms and Conditions:  Targets provided are approximate. Effort required to attend/support meetings, coordinate/participate in conference calls, participate in workshops, and similar is not fully accounted for in the following estimates since not fully under control of CI. 

1. Content Ingest for Hathi Trust content

 

Primary development team:                                   NU A&RT (Smith)

Total Bamboo Phase One effort level:                    3.5 weeks

Effort level made to date:                                       2.5 weeks

Anticipated start date:                                            February 14, 2011

Anticipated delivery date:                                      27 May 2011 (with #2 and #3 below)

 

Our first objective is to use METs data and content from Hathi Trust. We will develop a utility to read METs metadata XML and upload this data as CMIS folders and documents with broken-out property values. As far as possible we should use the existing JCR-Connect XTF connector source and maintain support CDL as well as Hathi METs files. 

We have now received a sample of 20+ or so items including METs and content files from Hathi. The initial goal is to run the program to upload the broken-out content to a CMIS repository. This repository and data will be made available to the workspace development group to use to test their development work. 

[1 week] Evaluation and review of Apache Chemistry / OpenCMIS code base 

[1.5 weeks] Parser for Hathi metadata files, creating CMIS folders, documents and properties via OpenCMIS client code. Unit tests. 

[0.5 weeks] Create sample repository for Workspace developers 

[0.5 weeks] Design of CMIS content types and coordinate mapping needed for raw Hathi and CDL content

2. Standard Model DSL for Hathi Content

 

Primary development team:                                     NU A&RT (JAS)

Total Bamboo Phase One effort level:                     3.5 weeks

Effort level made to date:                                        2.5 weeks

Anticipated start date:                                             28 March 2011

Anticipated delivery date:                                       27 May 2011

 

The Standard Model is a class model that defines a set of common fields and organization for content loaded from contributing repositories. The Standard Model is a Java class model with well defined translations to and from CMIS. 

A Standard Model plug-in will be developed for each contributing repository. Each summarizing plug-in will create Standard Model objects and CMIS data from raw content fields provided by a connector. The code for each repository will be based on a shared class library and domain-specific language (DSL) designed to simplify this task. 

Claire Stewart of Northwestern Library is organizing a group to design a standard model for books and texts. 

[0.5 weeks] Initial draft of Standard Model for Hathi content (UML, Interface definitions) Note that this is a sketch, to be passed on to Claire's group for refinement and reorganization. 

[2 weeks] Code for generating Standard Model objects, including initial version of matching / object generation DSL and unit tests. 

[1 week] Second version of standard model generator reflecting input from Claire's group and others. 

3. Content Locator Objects (on-demand content loading)

 

Primary development team:                                     NU A&RT (Xiang)

Total Bamboo Phase One effort level:                     3.0 weeks

Effort level made to date:                                        2.0 weeks

Anticipated start date:                                             4 April 2011

Anticipated delivery date:                                       27 May 2011

 

Content locator objects will be virtual CMIS folders and documents that will download content, and invoke translation code in response to CMIS requests. Content will be cached as needed, but will be downloaded the first time in response to a request. Content locator classes will be defined for each type of contributing repository. This code will be developed by implementing a specialized OpenCMIS server. 

[0.5 week] Design of interface for Locator Folder 

[2 week] Implementation of Locator Folder, including specialization for Hathi content 

[0.5 week] setup and deployment of OpenCMIS server (for development activities only) 

4. Zotero Bookmark Parsing

Primary development team:                                   NU A&RT (Xiang and JAS)

Total Bamboo Phase One effort level:                    1.0 weeks

Effort level made to date:                                      0.5 weeks

Anticipated start date:                                            23 May 2011

Anticipated delivery date (first version):                 31 May 2011

The Zotero RDF parser will process Zotero bookmark files uploaded to out hub (via a CMIS client). The software will find links in the file to known repositories, and content locator objects to link the the specified content. 

We will also explore connecting to Zotero servers.   However, that work is probably beyond the resources available to CI teams during Bamboo phase One. 

[1.0 week] Zotoro bookmark file parser / URL matcher 

Note:  this Zotero service is an experiment of sorts. Currently only volumes can be collected into Zotero, not bib records which may point to multiple volumes (e.g., multiple volumes of a triple decker novel).  Thus, there may need to be a subsequent or related task  (not sure if this is Phase One or Phase Two) to write a Zotero translator for Hathi Bib splash pages.   Also, Zotero RDF exports tend to provide differing data according to what the collection repository has allowed/tagged to be collected into Zotero Collection (e.g., through the use of COINs). Parser that works with Hathi, may not work for Perseus, etc. 

5. OSGi Service Development

Primary development team:                                     NU A&RT (JAS and/or Xiang)

Total Bamboo Phase One effort level:                     4.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                                xxx

Anticipated delivery date:                                           xxx 

We will build OSGi services that will embed our connectors (obtaining and translating content into raw nodes and properties) and Standard Model generators. Our code will build on the OpenCMIS server code base to incorporate our extensions for Zotero file parsing, locator folders, content ingest, and standard model generation. 

Additional time would be required to implement authentication and authorization beyond a minimal system based on IP-based (or similar) institutional affiliation.   This additional work is NOT currently included in Bamboo Phase One. 

[2 weeks] Become familiar with technologies need to build our service. OSGi, authentication libraries, etc. 

[2 weeks] Create an OSGi bundle that adds our code to OpenCMIS server. Integrate connectors for supported repositories. Develop unit tests. 

6. Adapt JCR-Connect Fedora Connector for Perseus

 

Primary development team:                                     NU A&RT (JAS and/or Xiang)

Total Bamboo Phase One effort level:                     4.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                                June 1, 2011 (Start date will be June 1, based upon recent assurances from Tufts that their Fedora work for Persueus is on track for first uses for Bamboo by end of May.) (Taylor)

Anticipated delivery date:                                          June 30, 2011 (Growing confidence at NU of being able to meet this delivery date.) (Taylor)

We will adapt the Fedora / JCR Connector developed by Northwestern for the JCR-Connect project (for Chicago History Museum content) to providing access to the Bamboo community to Fedora repositories via OSGi.   The initial target collection for the Bamboo Fedora Connector is Perseus.   It is the intention of the CI team to develop the Fedora Connector so as to easily be adapted to other Fedora-based collections. 

[4 Weeks] Conversion of JCR-Connect Fedora connector to use the OpenCMIS server code. Including new unit tests. 

7. Create Web accessible data store of TEI P5 XML derived from TCP texts

 

Primary development team:                                     UIUC and NU Library

Total Bamboo Phase One effort level:                     6.0 weeks

Effort level made to date:                                          1 week

Anticipated start date:                                               1 March 2011

Anticipated delivery date:                                         30 April 2011 30 May 2011 16 May 2011 (initial samples of Abbotized TEI texts were delayed)

Using small-scale sub-collections of TCP texts, prior work (MONK, SEASR, etc.) has established the benefit of encoding TCP texts in TEI-A P5 XML. As a pre-cursor to enabling Bamboo interoperability with TCP collections, CI teams will create a persistent, Web-accessible data store containing most TCP texts encoded in TEI-A P5 XML. This will start by obtaining TCP SGML texts and a current copy of Abbott configured for use with TCP collections (from UNL) -- this assumes support from Martin Mueller (NU) and Brian Pytlik Zillig (Nebraska) and agreement of TCP rights holders; we have every indication this support and agreement will be forthcoming. Remaining work: 

[1.5 weeks] "Transform" TCP collections into P5 TEI-A XML; anticipated success rate is greater than 80% with minimal information loss. Done for 1400+ ECCO texts

[1 week] Deploy a TCP collections server (data store) for TCP texts.   (Fedora?) Done

[1.5 weeks] Generate and associate metadata, including provenance & linkages to TCP SGML and EBSCO, GALE, and/or Newsbank page images with data store objects. Initial set of disseminators and content models created (using ECCO sample); need to refine and coordinate with outcomes of task 12

[1 week] Implement functionality in Fedora to support TCP CMIS connector (next task), including exposure of links to raw sgml & page images. Partially done

- Parod update 2011-05-13: Added text extraction Servlet and related Fedora Content Model, Service Definition, and Service deployment objects to support efficient TEI-based text extraction on TCP texts for header, TOC, page, and other transcription structure retrieval. Next steps will be to harmonize this access with page image access in the context of Bamboo Book Model for CMIS access. 

- Cole update 2011-05-18: Starting conversation with TCP (Michigan & Oxford) on longer term issues / division of what goes where; also with Phil Burns on adding morphadorned versions of texts (see also tasks 10 and 11).

- Parod update 2011-06-17: Integrated TagSoup in TEI Page access so that all TEI page fragments are well formed XML regardless of page cutoff. Added support for PageByFacs which provides well formed TEI page fragment from <pb facs=""/> based on facs value. Next steps will be to harmonize this access with page image access in the context of Bamboo Book Model for CMIS access.

8. Develop CMIS Connector (Adapter) for TCP text collections

 

Primary development team:                                     UIUC

Total Bamboo Phase One effort level:                     6.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                               15 April 201111 May 2011 (cascade delay from previous item)   

Anticipated delivery date:                                         30 June 2011 15 July 2011

We will leverage work done by NU A&RT on Hathi connector, Perseus connector, and locator service as available and as applicable. Have received first output of Std. Model DSL for Hathi (Book Model) and begun analysis of how to interpret for TCP collections.

[1.5 weeks] Implement Apache Chemistry / OpenCMIS over TCP XML data store (take advantage of NU experience with application) 

[1.5 weeks] Parser for TCP objects & metadata files, creating CMIS folders, documents and properties via OpenCMIS client code. Unit tests. 

[2 weeks] Allow TCP folders to include (virtually) TCP HTML and TCP or EBSCO / Gale / Newsbank page images of components of objects as well as local TEI-A P5 XML representations; coordinate with NU work on Content Object Locators work as makes sense. 

[1 week] Generate identifier mappings. Create & iterate repository access via CMIS for Workspace developers 

- Parod update 2011-06-17: With help from Xin Xiang of A&RT, built OpenCMIS release and skeletal CMIS Connector on TCP server at UIUC, as first step in building a TCP-based CMIS Connector.

9. Develop CMIS Connector (Adapter) for AustLit

Primary development team:                                     UIUC & ANU

Total Bamboo Phase One effort level:                     4.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                                April 2011May 2011 June 2011

Anticipated delivery date:                                          July 2011 August 2011

Will coordinate and follow in lock-step with TCP Connector work described above; but smaller scope and simpler than TCP since data newer, slightly more homogeneous, and under slightly more consistent management (i.e., mostly U of Queensland). 

10. Develop ancillary services to facilitate interoperability / utility of TCP resources

 

Primary development team:                                     UIUC & NUL

Total Bamboo Phase One effort level:                     8.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                               June 2011 July 2011

Anticipated delivery date:                                         August 2011 September 2011

Texts in TCP collections are known by a range of different identifiers, in some case at multiple levels of granularity. Thus a TCP text may be known by an EEBO, ECCO, or Evans identifier, or by an instance of a bibliographic work identified in ESTC, or .... It is likely that scholars using Bamboo will want to retrieve links to TCP texts in response to queries created using a range of bibliographic or identifier metadata. To resolve such queries would require an identifier mapping service. Identifiers are on the ECCO TCP objects and have been flagged for indexing in SOLR 

Similarly some TCP texts remain under license. Agreements with TCP require that texts only be furnished to authenticated users affiliated with TCP member institutions or institutions having appropriate subscription access. To assure conformance an InCommon interface (e.g., such as built for MONK) or similar may need to be built at either CI or Workspace level. 

Or other services may need to be built to facilitate interoperability and use of TCP resources. This task is tentatively sized based on assumption that 2 such ancillary services like those described above will be required. Phil Burns has begun looking at ECCO TCP sample for morphadorning -- preliminary sample has been received for test ingest into Fedora alongside unadorned texts. 

11. Services for remediation & normalization of TCP and other texts

 

Primary development team:                                     UIUC and NU

Total Bamboo Phase One effort level:                     8.0 weeks

Effort level made to date:                                          0 weeks

Anticipated start date:                                               August 2011

Anticipated delivery date:                                         November 2011 

As demonstrated during Corpora Space Camp (March 2011), even well curated data (e.g., examples of TCP and Perseus texts in XML) can require significant normalization to be useable by certain analytical tools (e.g., Topic Map Feature Analyzer), especially in combination with texts from other dissimilar collections. To support integrated analysis across boundaries of collections and repositories, we anticipate the need for multiple such services -- for example: 

  • to chunk texts;
  • to transform (e.g., via MorphAdorner) to British National Corpus P5 TEI to facilitate certain kinds of analysis / integration with texts from other collections;
  •  to normalize spelling / lemmatization;
  • to correct errors 

This task is place holder for resources required to implement up to 3 such services. 

12. Document core item-level metadata profile

Primary development team:                                     NU Library, UW-Madison, UIUC

Total Bamboo Phase One effort level:                     6.0 weeks

Effort level made to date:                                          2.5 weeks (JAS added +2 week)

Anticipated start date:                                               1 January 2011

Anticipated delivery date:                                          June 2011

This encompasses: the CMIS folder object & document object model, which levies certain object-level attributes; Jonathan’s proxy object implementation incorporating that model; Workspace requirements for item-level metadata; and best practice options for associating additional metadata with objects as they are ingested and used (e.g., MARC, MODS, Dublin Core, etc.).

Progress so far encompassed under preliminary work to develop standard content model for CMIS work (see item 2, above), mapping hierarchy in Hathi METS and TCP TEI texts to a set of trees ('bag o trees') in Atom Syndication Format for CMIS connector.

Future work: discussion of content model/Atom mapping with CI team, revision of model for texts, preliminary modeling for other content types. Develop recommendations for core metadata sharing and placement of metadata within content chunks exposed via Atom/CMIS.

13. Document collection-level metadata & collection-level service descriptions

Primary development team:                                     UIUC, NUL, ANU

Total Bamboo Phase One effort level:                     6.0 weeks

Effort level made to date:                                          2.0 weeks

Anticipated start date:                                               1 January 2011

Anticipated delivery date:                                         June 2011 July 2011 

Would include integration of collection-level descriptions into Tools & Services Registry and could include some experimentation with semi-automating generation of RIF-CS records. 

 

14. Scholar engagement to update and refine collections list for Bamboo Phase II

Primary development team:                                     UIUC, NUL, IU Lib, others?

Total Bamboo Phase One effort level:                     8+ weeks (plus some non Bamboo resources from UIUC, now confirmed)

Effort level made to date:                                          23.0 weeks

Anticipated start date:                                               1 Jan 2011

Anticipated delivery date:                                         Sept 2011 

There are many existing digital collections contain potentially rich content for humanities scholarship, some well known, some relatively unknown. The profileration of such resources suggests that usage of digital content in humanities research is increasing. However, in leveraging these collections and in facilitating their interoperability and use in the context of Bamboo, we need to understand more about the ways in which scholars want to use these resources, the importance they place on these resources, and the scope and reach of these resources. 

To ascertain this information, we will undertake a needs assessment; this will inform ongoing collection development during Bamboo Phase I, lay groundwork for a robust and well received Phase II of work, and allow CI to continue to refine and augment prioritized list of collections to target for Bamboo CI Standards adoption (a Phase I deliverable). 

The main thrust of this task will be to reach out to identified Bamboo Partner (and CIC) scholars via Web surveys, one-on-one interviews, focus group interactions. We will identify scholars to contact through librarians, DH Centers, and other Bamboo participants. We will leverage prior work in this area, e.g., outcomes of Bamboo Project Planning Workshops, the Oxford [BVREH User Survey|http://bvreh.humanities.ox.ac.uk/news/Survey_outcomes.html], etc. We plan to supplement committed in-kind and funded Bamboo time with local resources (e.g., funding from UIUC Library's Research & Publication Committee) 

[2-4 weeks] Establish team to do this work (e.g., librarians at Illinois, Indiana, NUL, ...) & draft necessary IRB paperwork (multiple institutions).   IRB submittals in

[3 weeks] Draft and iterate instruments (overlapping with previous task) Drafts done

[1 week] Implement Web survey form. 

[2-4 weeks] Follow-up interviews & small-scale focus groups (includes supplemental resources) 

[4 weeks] Analyze and report out data (includes supplemental resources)  

  • No labels