Navigation:
Documentation
Archive



Page Tree:

Child pages
  • Future Development Directions for the CI Hub

This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.

Skip to end of metadata
Go to start of metadata

A number of team members were invited to contribute their thoughts on future development directions for the Collections Interoperability Hub (CI Hub). These are included below, in the order submitted.

Bill Parod, Northwestern University

Resource Caching

When CI-Hub processes citations in an uploaded Zotero file, it retrieves the referenced external repository content and builds a resource with that content – a CMIS folder and file structure defined by the Bamboo Book Model. CI-Hub does not maintain a directory of such resources that it has processed, check to see if a resource request has previously been processed, where in the CI-Hub folder structure that resource was saved, how old the previous version is, if it has changed, or any other logic appropriate for resource caching.

When considering caching behavior and the potential of shared resources generally, one may need to consider overall repository structure and operation. That is, the current CI-Hub behaves like a file system with an ad-hoc, user-driven organization: users create folders, upload Zotero files into them, and associated Bamboo Book Model / CMIS folders and files are created in-place as a result. A simple incremental improvement towards caching might be a policy of not building a resource in a given folder where the resource already exists. With this policy, CI-Hub resource organization would remain up to individual users, but perhaps reduce redundant processing. A more comprehensive policy of resource caching though might motivate a repository of researcher requested, but centrally managed, registered, and discoverable shared resources. Various policy as well as implementation options emerge in a resource management discussion that would likely attend potential caching options.

Resource Ownership

As described above, the resources present in CI-Hub and their organization with CMIS folders is ad-hoc per individual user behavior. Mechanism for sharing resources among users, or conversely, enforcement of access restrictions on resources for users or groups of users is not implemented in the current CI-Hub. Consequently, integration with Bamboo Identity and Access Management (IAM) and associated internal repository policies and mechanisms for resource ownership and permissions management have not been implemented in CI-Hub. Such an implementation would naturally follow consideration of resource management issues described in the caching discussion above and more broadly in relation to Bamboo IAM efforts.
HathiTrust Connector

HathiTrust API Changes

As described above, the CI-Hub includes repository connectors/adapters/locators for three external repositories. One of those repositories, the HathiTrust, has revised its Application Programing Interface (API) instituting access key security enforcement. The CI-Hub HathiTrust connector has not been updated in accordance with this HathiTrust API change. As a result, we have commented out its inclusion in the main cihub.properties configuration file. Work on the referenced class below will be necessary accommodate the new HathiTrust API.

# repository.cihub.hathi.locator =
# org.projectbamboo.cihub.northwestern.domain.HathiLocatorService

 

JPEG200 Conversion

Additionally, the HathiTrust connector performs conversion of HathiTrust JPEG2000 page image files to jpeg. The connector uses a locally deployed Djatoka servlet to perform that conversion. Djatoka and its use by the HathiTrust connector has not been brought into the Bamboo Services Platform (BSP) architecture. That is, Djatoka is has not been cast or proxied as an OSGi service, and so is not part of a standard BSP deployment. This will also need to be addressed for HathiTrust connector support of page images.

 

Bruce Barton, University of Wisconsin - Madison

CI HUB as Cache

The CI HUB as constructed behaved as a simple remote public directory tree. The organization of directories in the tree and the placement of objects in them were not regulated in the prototype versions of the HUB. AuthNZ had yet to be integrated. Object ownership was not formally expressed. We had no means of controlling how long objects placed in the HUB remain there. None of these is fatal in a prototype, but would need to be remedied in a full production implementation.

As work on the CI HUB proceeded there were several discussions about whether to think of the HUB not as a directory of objects but as a cache. Objects requested from a repository such as HathiTrust would be mapped onto a predictable path in the HUB.  The path would be the same for all users of the HUB.   If an object is requested that been previously been retrieved and was still in the cache, the HUB verifies that the object had not been altered in the source repository since the HUB had retrieved it and then serves the normalized object to the requestor. If the object is not in the cache, the HUB retrieves it from the source repository and normalizes the object, storing the result and mapping it to the predicable path. From the requestor’s point of the view, the result in either case is ultimately the same, although cached objects may be available sooner.

Treating the CI HUB as a cache has several convenient features:

  1. There is no concept of ownership for objects. Permission to retrieve an object is evaluated on each request. We don’t have to track ownership or wrestle with concepts like sharing an object among colleagues in a work group.
  2. The TTL for object in the cache is under the control of the HUB administrators.  They are free to implement whatever cache management scheme, say, FIFO or a resource utilization scheme, that makes sense given patterns of traffic.
  3. The path structure of the cache is a function of source repository and object identifier and therefore predictable.
  4. One could imagine making it possible to browse the contents of the cache. This is useful for cases in which it matters less what the objects are than that there are objects convenient to hand or where one is interested in patterns of retrieval.

Status Reporting

Object retrieval from a remote repository and the processing to normalize it to create a Bamboo Book Object is a long running job.  In the prototype versions of CI HUB, we did not notify users that a job had completed or in the case where a job failed, that it had failed and why it failed. Notification was on the TODO list.  

We proposed that notification be expressed in XML or JSON in a predictable schema easily consumable by software. The status report would indicate for each object requested the status of the request, e.g. complete, pending, not authorized, not found, timed out, etc; and for completed requests the path to the object in the HUB and statistics on the retrieval and the resulting object such as retrieval time, processing time, size of the object, etc. The notification would also include human consumable metadata such as dc:title. Finally, the notification would indicate that status of the job as a whole.

One could imagine various mechanisms for delivering the status report, polling or messaging as suites the capabilities of the CI HUB client.

Packaging

As we noted in the discussion of the CMIS binding for the Bamboo Book Objects, copying an object from the CI HUB to the client is quite chatty and consequently prone to error.

One improvement might be to package Bamboo Book Objects for transmission from the HUB to the client. Such a package would include a manifest that describes the package contents, and provenance and technical metadata about the object(s).  The metadata would include the version number of the normalization software used to create the Bamboo Book Object(s), retrieval metadata such as the date/time of retrieval and object statistics, and fixity information.

Transformations

In the Bamboo architecture, the CI HUB sits alongside transformation and analytical services. Applying transformations prior to the delivery of an object is an obvious extension of the CI HUB. A request for an object could include a recipe for the desired chain of transformations. The delivery package would then be supplemented with transformation outputs, the provenance and technical metadata that describe them, e.g. transformation tool identifier and version number, process status, error reports, and so on, and of course, the recipe itself.

 

Bridget Almas, Tufts University

Externalizing parsing and transformation algorithms

The CI HUB's parsing and transformation algorithms for the source XML and Zotero bookmarks are embedded directly in the Scala code. I feel that these should be externalized via templates and/or a rule-based system. This would facilitate:

  • participation of new collection providers by allowing them to just define their content model and how it transforms to the book model according to whatever method was adopted, and not requiring that they code a connector/adaptor in Scala
  • support for multiple and, more importantly, continually evolving, schemas and bookmarkable urls for different types of content from a single provider

Early in the Bamboo Technology Project I suggested looking at the Abbot service to see if it could help with this functionality. I think this is still an avenue worth exploring, although there may also be other services or implementations out there that could be of help here. XSugar, for example, provides support for rule-based transformations, and the TEI-Integrator prototype from CLARIN also showed some potential, particularly in the user interface it provided for configuring and checking the transformations.

Content retrieval from repositories that support the CTS Protocol and others with RESTful interfaces

In the end, there was also some overlap between the CI HUB and the Repository Service I coded to support retrieval of content for analysis by the Morphology and Syntactic Annotation Services. The Repository Service supports particularly retrieval of content from remote repositories supporting the CTS protocol, but also from non CMIS repositories with RESTful interfaces whose url syntax could be defined in a configuration setting. I thought that if development continued, this functionality might eventually be merged with the CI HUB to avoid redundancy and enable it to take advantage of the caching and other features planned for the CI HUB.

 

 

 

  • No labels