Navigation:
Documentation
Archive



Page Tree:

Child pages
  • Collection Interoperability Hub (CI Hub) architecture and implementation

This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.

Skip to end of metadata
Go to start of metadata

This is a wiki rendering of the attached MS-Word document, CIHUB-Architecture-20130512a.docx, written by Bill Parod of Northwestern University. The document is also attached as a PDF.

Team members' suggestions for future development of the CI Hub are included in a child page to this one, Future Development Directions for the CI Hub.

 

Apache Chemistry OpenCMIS

Apache Chemistry OpenCMIS is an open source implementation of the Content Management Interoperability Standard (CMIS). OpenCMIS includes a server framework for layering CMIS over other content repositories and a client framework for integrating consuming applications with CMIS compliant repositories. The Apache Chemistry distribution comes with two example repository implementations: "InMemory" and "FileShare".

The Bamboo CI-Hub is an extended version of the FileShare implementation. The FileShare implementation, as the name implies, persist CMIS folders, documents, and properties as filesystem folders and files. It declares its own package but also leverages other Apache Chemistry server modules.

Apache Chemistry exposes the core CMIS domain services through three binding options: Atom Publishing Protocol (APP), Web Services, and local Java class bindings. Bamboo uses the AtomPub binding to provide HTTP access to CI-Hub. A simplified diagram of the AtomPub over FileShareRepository processing is shown below.

Figure 1: AtomPub over FileShareRepository processing (simplified)

Figure 1 above shows the main classes involved in Atompub request processing. The org.apache.chemistry.opencmis.server.impl.atompub package includes the CMISAtomPubServlet servlet class for handling HTTP requests. Its initialization establishes a dispatch table, mapping CMIS request to their associated CMIS service classes (RepositoryService, NavigationService, ObjectService, VersioningService, and DiscoveryService) within the package. In handling HTTP requests, the servlet forwards incoming HTTPServletRequest and HTTPServletResponse objects as well as a CMIS CallContext object through this Dispatcher to these classes serving the specific CMIS request. These Atompub classes in turn invoke associated methods on the configured repository CMIS service classes and write return values as AtomPub XML to the forwarded HTTPServletResponse object. This is how the AtomPub request parsing and reply formatting is accomplished.

The underlying CMIS implementation is responsible for managing content within the repository. It invokes implementations of core CMIS services classes like those described in the AtomPub binding (RepositoryService, NavigationService, ObjectService, VersioningService, and DiscoveryService).

Bamboo Collection Interoperability Hub (CI-Hub)

As mentioned above, the Bamboo CI-Hub is based on the OpenCMIS FileShare Repository. Adapting OpenCMIS for Bamboo involved two major efforts: 1) extending OpenCMIS to normalize content from external repositories and 2) refactoring OpenCMIS for OSGi deployments on the Bamboo Services Platform.

Bamboo Services Platform (BSP)

The Apache Chemistry OpenCMIS is typically deployed as a web application. Deploying the CI-Hub on the Bamboo Services Platform (BSP) required separating HTTP request processing from core service functionality into separate OSGi bundles functioning as BSP Resources and a BSP Service respectively.

Resource Oriented Architecture (ROA)

The CI-Hub ROA layer defines a service for the BSP’s CXF servlet, exposing the AtomPub binding to CMIS services described above. ROA uses a Spring beans.xml file to declare and configure its service, and Java annotations in the implementing class to expose Java methods implementing HTTP methods.

The ROA layer beans.xml file (fragments shown below) defines a single bean implemented by the CIHubResource class bound to the root (“/cihub”) CXF path. This class provides a functional replacement, in the BSP, for the CMISAtomPubServlet class used in the Apache Chemistry webapp implementation. CIHubResource is modeled on the CMISAtomPubServlet class, but is invoked by BSP CXF, rather than directly by a servlet container. Instead of a web.xml file to configure servlet properties for initialization, our ROA bean obtains properties from its beans.xml file.

ICMISRepositoryServiceFactory is the CI-Hub SOA layer interface that our ROA layer uses to obtain an SOA CmisServiceFactory. The CmisServiceFactory creates a CmisService based on the specific factory class configured in the Apache Chemistry FileShare repository configuration file, cihub.properties. CI-Hub configures that property to use its own org.projectbamboo.cihub.northwestern.domain.FileShareServiceFactory class, replacing the default Apache Chemistry class. FileShareServiceFactory can then substitute the custom org.projectbamboo.cihub.northwestern.domain.FileShareService class to achieve custom CI-Hub behavior.

Another important configuration property in the CI-Hub is atomPubAddedPath. The OpenCMIS FileShare Repository is implemented to run as a servlet in a servlet container. It therefor forms URLs in atompub replies based on servlet context. However, in our BSP deployment, we are executing as a jaxrs service on our own path (/cihub) under a CXF servlet in an OSGi container. Consequently, in the BSP environment we need to provide a more extensive URL path in atompub replies. The extra path information is configured in the ROA bean definition’s atomPubAddedPath property, reflecting the BSP ROA deployment and the specific path of our ROA service jaxrs addres. These are shown below in the CI-Hub ROA bean file.

<jaxrs:server id="cihub" address="/cihub">
   <jaxrs:serviceBeans><ref bean="ciHubResource"/></jaxrs:serviceBeans>
</jaxrs:server>

<bean id="ciHubResource" init-method="create"
    	class="org.projectbamboo.cihub.northwestern.resources.CIHubResource">
  <property name="callContextHandlerClass" value="org.apache.chemistry.opencmis.server.shared.BasicAuthCallContextHandler"/>
  <property name="cmisRepositoryServiceFactory" ref="CMISRepositoryServiceFactory"/>
  <property name="serviceCatalog" ref ="serviceCatalog"/>
  <!-- this reflects the jaxrs address above -->
  <property name="atomPubAddedPath" value="/services/bsp/cihub/"/> 
  <!-- unsername for fileShare login -->
  <property name="fileShareUsername" value="test"/> 
  <!-- fileShare password -->
  <property name="fileSharePassword" value="test"/> 
</bean>

Major elements of the ROA bean.xml file.

The SOA service instance obtained by the ROA layer is passed through ROA level AtomPub package classes. Deep in the AtomPub package classes, binding-neutral CMIS methods are invoked on the passed-in CmisService class, which in our case is an SOA layer service.

Service Oriented Architecture (SOA)

The CI-Hub SOA layer defines a service used by other BSP services and the CI-Hub ROA layer resource. The beans.xml file (fragments shown below) for the SOA layer defines a single bean for the CMISRepositoryServiceFactory class and exposes that as an OSGi service supporting the ICMISRepositoryServiceFactory interface. This class implements a single method: getCMISRepositoryService(CallContext context). This method simply returns an Apache Chemistry CmisService class. Consumers of this service then use the CmisService API defined by that class. ROA and SOA processing is shown below in Figure 2.

<osgi:service ref="cmisRepositoryServiceFactory"
interface="org.projectbamboo.cihub.northwestern.service.ICMISRepositoryServiceFactory"
    ranking="1">
        
  <osgi:service-properties>
    <entry key="service.pid" value="urn:uuid:418E1B99-5ABE-4693-8AAD-FC9DA164A581"/>
    <entry key="serviceDescriptionLocation" value="https://wikihub.berkeley.edu/display/pbamboo/CI+Hub+Service+Contract+Description+-+v0.9-alpha"/>
    <entry key="service.description" value="CMIS Service Factory"/>
    <entry key="service.vendor" value="Northwestern University"/>
    <entry key="serviceProviderName" value="Bamboo CI Hub OSGi Service Implementation"/>
    <entry key="serviceVersion" value="1.0"/>
    <entry key="serviceProviderType" value="functional"/>
    <entry key="defaultServiceProvider" value="true"/>
    <entry key="serviceProviderSupportedVersionsRange " value="[1.0.0,2.0.0)"/>
    <entry key="serviceProviderContact" value="bill-parod@northwestern.edu"/>
  </osgi:service-properties>
</osgi:service>
    
<bean id ="cmisRepositoryServiceFactory" 
      class="org.projectbamboo.cihub.northwestern.service.CMISRepo…">
   <property name="repositoryConfigFile" value="/cihub.properties"/> 
   <property name="repositoryId" value="content"/> 
</bean>

Major elements of the SOA bean.xml file.

Figure 2: ROA and SOA Processing.

CI-Hub FileShare Repository

The Apache Chemistry FileShare Repository supports configuration of the factory class to use for the CMIS FileShare service implementation. This setting is used by the CI-Hub to substitute its own custom service factory class (FileShareServiceFactory). This custom class in turn instantiates custom CI-Hub versions of the FileShareService and FileShareRepository classes in order to achieve custom CI-Hub behavior, as shown above.

The CI-Hub FileShareRepository is provided to customize behavior when new files are submitted to the CMIS repository. FileShareRepository examines submitted CMIS files for indication that they are Zotero format bibliography files. When a Zotero file is detected, FileShareRepository presents the Zotero references to each configured “Locator” class in the CI-Hub. Locator or “connector” classes are used to provide the specific processing needed to import content from external repositories. These class relationships are shown below in Figure 3.

CI-Hub Locator Extensions

In addition to repository-specific reference detection, CI-Hub “Locator” processing involves retrieval of content from the referenced repository and conversion and placement of that content into a Bamboo Book Model structure in the CI-Hub. An item’s citation reference alone is usually not sufficient for retrieval of an item’s complete content. Each specific “Locater” must understand the specific repository’s application programing interface (API) and form additional URL references based on the cited URL’s identifier to retrieve additional description and content for the item. The Bamboo Book Model defines the folder structure, file naming conventions, and set of CMIS properties for all content files constituting a Bamboo “Book”. Each locator class understands its respective repository’s content model and its mapping to the Bamboo Book Model. The CI-Hub provides such locator services for Perseus, HathiTrust, and a Fedora implementation of selected Text Creation Partnership (TCP) texts running at the University of Illinois.

Figure 3: Relationships between FileShareRepository and Locator classes

In addition to the Bamboo Book Model conversion, the connectors save source files from their respective repositories in the CMIS repository. The specific behavior of each connector is described in more detail in the sections below. The URL pattern for each repository recognized by its respective locator is listed.

Perseus Connector

http://www.perseus.tufts.edu/hopper/text?doc=.*

The Perseus connector is based on a generic Fedora connector. As such it obtains all the referenced Fedora object’s datastreams and stores them in the Bamboo Book Model’s book/source directory. It then forms a URL to the object’s TEI datastream using URL pattern heuristics coded in the connector to obtain the object’s TEI transcript.

The connector also uses the Fedora object’s “MODS” datastream to obtain basic bibliographic metadata for the object. This basic descriptive metadata is used in CMIS property files within the Book Model.

The Bamboo Book Model is a page-based model, that is, it organizes book content into separate constituent pages. Perseus transcripts however, as is characteristic of classics texts, are not naturally paginated and so their TEI transcripts do not contain any page break markup. In order to supply “pages” for the Bamboo Book Model, the CI-Hub Perseus connector performs its own pagination by breaking the TEI datastream by its <div/> elements, creating a Bamboo page for each <div/>.

For each of these “pages”, the connector creates files for its TEI XML representation, an xhtml representation, and a plain text representation. The connector also creates a .cmis file for each of these page representation files containing relevant properties.

If the specific Perseus TEI transcript does not contain <div/> elements, the connector provides the same representation types above, using <l/> elements for pagination. The connector also creates a plain text “volume level” version of the entire transcript by concatenating plain text pages into a single file.

HathiTrust Connector

http://hdl.handle.net/2027/*

The HathiTrust connector uses the HathiTrust API to obtain the referenced item’s bibliographic description and a ZIP file containing its scanned pages. It uses two HathiTrust API calls to do this, one to obtain the item’s bibliographic description as JSON formatted MARC, and the other to obtain a ZIP file containing the full set of scanned pages.

CMIS bibliographic descriptive properties are obtained by parsing the JSON formatted string for MARC fields, using the following MARC codes and subcodes for basic description:

title = 245$a
creator = 100$a
publisher = 260
issued = 260$c

The Bamboo Book Model also accommodates JPEG page images at various widths. However many HathiTrust volumes provide their page images in the JPEG2000 format. Consequently the HathiTrust connector must reformat JPEG2000 images as JPEG for Bamboo Book Model conformance. The HathiTrust connector uses a Djatoka JPEG2000 server to decode HathiTrust JPEG2000 files as JPEG images at various resolutions, defined by the Bamboo Book Model.

TCP/Fedora Connector

http://ramman.grainger.uiuc.edu

The Text Creation Partnership (TCP) connector, like the Perseus connector, leverages Fedora Repository access classes included in the CI-Hub. Its Fedora object model, however is different from the Perseus. It holds separate Fedora objects for the main bibliographic entity, the TEI transcript, and the MorphAdorned TEI transcript. Page-level representations, whether TEI, MorphAdorned TEI, plain text, or JPEG at various resolutions – are all obtained using parameterized methods in Fedora disseminations. The TCP Connector forms all relevant URLs internally to access desired Bamboo Book page representations, and like HathiTrust and Perseus content, forms all desired representations that are possible for the source materials. The Bamboo Book Model representations provided by each connector are summarized in the table below.

For each repository column, "Repository" means that particular CMIS Type is obtained directly from the repository. "Connector" means that the CMIS Type is manufactured by the connector from other content obtained from the repository. For example, the Perseus connector creates plaintext, xhtml, and TEI xml pages from the Perseus source TEI transcript.

CMIS Type

mime-type

TCP (UIUC)

Hathi

Perseus

page-plaintext

text/plain

repository

connector

connector

page-xhtml

text/html

repository

 

connector

page-tei

text/xml

repository

connector

connector

page-morphadorned

text/xml

repository

 

 

page-image

image/jpeg

repository

connector

 

page-thumb150

image/jpeg

repository

connector

 

book-tei

text/xml

repository

 

repository

book-plaintext

text/plain

connector

connector

connector

source-mets

text/xml

 

 

 

source-aggregate

application/zip

 

repository

 

source-bib-marc

application/json

 

repository

 

source-page-image-jp2

image/jp2

 

repository

 

source-page-xml

text/xml

 

 

 

source-page-ocr

text/plain

 

 

 

 

Configuration

The Apache Chemistry OpenCMIS distribution requires minor configuration but offers considerable flexibility and customization. Its main configuration file (cihub.properties) requires local deployment settings for file system paths to the repository content area and initial account credentials. Beyond that it allows local substitution of the major factory class for the repository, facilitating custom implementations and local extensibility of CMIS content types. These properties and their Bamboo settings are given below.

The cihub.properties file is maintained outside of the OSGi container and discovered at runtime by the CI-Hub by forming a path combining the $BSPLOCALSTORE_HOME Environment variable and the repositoryConfigFile property defined in the SOA bean.xml file.

Care should be taken to define the BSPLOCALSTORE_HOME shell variable in the BSP execution environment, defining the repositoryConfigFile path relative to $BSPLOCALSTORE_HOME, and placing the cihub.properties in that location.

Service factory

This property is used to declare the class responsible for creating the principal class used for CMIS processing, as described above.

class=org.projectbamboo.cihub.northwestern.domain.FileShareServiceFactory

 

Accounts

Apache Chemistry performs security checks on each CMIS request. The method it uses to manage accounts, passwords, and readwrite or readonly permissions are these next properties below.

login.1 = test:PASSWORD
login.2 = cmisuser:PASSWORD
login.3 = reader:PASSWORD
repository.cihub.readwrite = test, cmisuser
repository.cihub.readonly = reader

 

Locator classes

This is where “Locator” classes or repository connectors are declared for the CI-Hub. By associating a class name here with a property name that ends with “.locator”, that class will be included in the list of potential repository locator services associated with the local CMIS repository having repository identifier = REPOSITORY_ID, where REPOSITORY_ID is obtained from the property name with this regular expression: ‘/repository\.REPOSITORY_ID\..*\.locator’.

# repository.cihub.hathi.locator = 
# org.projectbamboo.cihub.northwestern.domain.HathiLocatorService
repository.cihub.perseus.locator = org.projectbamboo.cihub.northwestern.domain.PerseusLocatorService
repository.cihub.tcp.locator = org.projectbamboo.cihub.northwestern.domain.TCPLocatorService

 

Locator configurations

This properties file provides connect information for the various locator classes listed above.

repository.cihub.connectorConfig = /config/connector.properties

 

Bamboo CMIS types

In CMIS, folders and documents can be assigned properties. Properties are contained in a separate file with s special naming convention. On of the properties associated with all Bamboo content is its type. Each type is defined by a repository-wide configuration file which describes properties for objects of its type. Those type definitions are enumerated and the paths to their definitions listed below.

type.01 = /CMISTypes{file.separator}bamboo-page-document.xml
type.02 = /CMISTypes{file.separator}book.xml
type.03 = /CMISTypes{file.separator}contents.xml
type.04 = /CMISTypes{file.separator}example-type.xml
type.05 = /CMISTypes{file.separator}metadata.xml
type.06 = /CMISTypes{file.separator}page.xml
type.07 = /CMISTypes{file.separator}page-image.xml
type.08 = /CMISTypes{file.separator}page-tei.xml
type.09 = /CMISTypes{file.separator}page-thumb150.xml
type.10 = /CMISTypes{file.separator}page-xhtml.xml
type.11 = /CMISTypes{file.separator}source-mets.xml
type.12 = /CMISTypes{file.separator}source-page-image.xml
type.13 = /CMISTypes{file.separator}source-page-ocr.xml
type.14 = /CMISTypes{file.separator}source-page-xml.xml
type.15 = /CMISTypes{file.separator}bamboo-folder.xml
type.16 = /CMISTypes{file.separator}page-morphadorned.xml
type.17 = /CMISTypes{file.separator}page-plaintext.xml
type.18 = /CMISTypes{file.separator}volume-plaintext.xml
type.19 = /CMISTypes{file.separator}source-aggregate.xml
type.20 = /CMISTypes{file.separator}userfolder.xml
type.21 = /CMISTypes{file.separator}page-image-jp2.xml

 

Repository root file path

This is the path on the file system where CI-Hub will store and retrieve the folders and files it manages as a CMIS repository. Care should be taken to coordinate file ownership and permissions on this directory with the process owner of CI-Hub’s execution, e.g. the BSP process owner.

repository.cihub = /var/bamboo/cmis/content

 

Request and Response folders

Initial design of the CI-Hub intended submitted files, that is, Zotero files containing references representing Bamboo Book creation/resolution requests, to arrive in the req folder and processing status responses to be written by CI-Hub into an associated file in the req folder. Zotero file submission can occur anywhere in the CMIS Folder structure. Book processing status messages are written to a similarly named file in the res directory.

repository.cihub.request = req
repository.cihub.response = res

 

Source Code

The CI-Hub source code is organized into ROA and SOA source trees and includes Java, Scala, as well as Spring XML files. This section describes that source code organization. All source code, with the exception of the cihub.properties configuration file, is relative to ci-hub-service/ (location in Bamboo's Sourceforge code repository).

The cihub.properties file, which is not compiled and deployed in the OSGi bundles but placed in the BSP file system relative to the path indicated in the $BSPLOCALSTORE_HOME environment variable, is found in the ci-hib-config/ directory, at the same level as ci-hub-service/ in the source tree.

Resource Oriented Architecture (ROA) Source

The ROA source is found under ci-hub-service/resource. It contains java and Spring beans.xml source files.

/ci-hub-service/resource/src/main/

 

ROA BSP Classes

The CI-Hub ROA layer contains only two Java class files needs for the BSP, one for the ROA Interface class and one for the ROA Implementation class:

./java/org/projectbamboo/bsp/services/cihub/resources/
  ICIHubResource.java
  CIHubResource.java

 

The ROA resource is declared and configured with its associated Spring beans file:

./resources/META-INF/spring/
  beans.xml

 

Service Oriented Architecture (SOA) Source

Source code for the CI-Hub SOA layer is found under /ci-hub-service/service/.

/ci-hub-service/service/src/main/

It provides four major aspects of the CI-Hub customization of the Apache Chemistry FileShare implementation:

  1. SOA interface and implementation classes for BSP architecture
  2. Apache Chemistry class overrides and custom configuration
  3. Locator extensions for external repositories
  4. Custom CMIS type definitions and bindings

There are two Java packages, a Scala source package, a folder structure of resource files for configuration and CMIS types, and a webapp folder for deployment as a web application deployed in a servlet container.

SOA BSP Classes

Like the CI-Hub ROA layer, the SOA layer contains two Java class files referenced in the layer’s beans.xml, one for the SOA Interface class and one for the SOA Implementation class. These are each implemented with Java:

./java/org/projectbamboo/cihub/northwestern/service/
  CMISRepositoryServiceFactory.java
  ICMISRepositoryServiceFactory.java

 

Apache Chemistry FileShare Repository Override

As described in the first section above, the CI-Hub extends the Apache Chemistry FileShare Repository with custom processing of Zotero files. The Java source files that override similar Chemistry classes are found in:

./java/org/projectbamboo/cihub/northwestern/domain/
  FileShareRepository.java
  FileShareService.java
  FileShareServiceFactory.java
  MIMETypes.java
  RepositoryMap.java
  RepositoryService.java
  TypeManager.java

CI-Hub also overrides org.apache.chemistry.opencmis.server.impl.atompub classes. The modification here is slight but impacts several classes which in turn must be overridden to reference the new class. This Apache Chemistry package contains a utility class, AtomPubUtils which is used by several other classes in the package to form URLs in atompub replies. AtomPubUtils by default assumes that the execution environment is a servlet container, running CMISAtomPubServlet. It therefor forms URLs based on the current domain and servlet context. However, in our BSP deployment, we are executing as a jaxrs service on our own path (/cihub) under a CXF servlet in an OSGi container. Consequently, in the BSP environment we need to provide a different URL path in atompub replies as the default URLs formed in AtomPubUtils would be incorrect. We override AtomPubUtils in CI-Hub, as well as other classes in its package that use AtomPubUtils to provide appropriate paths in atompub replies. The extra path is passed in the HttpServletRequest object and configured at the ROA in its bean definition’s atomPubAddedPath property. These custom classes are found in:

./java/org/projectbamboo/cihub/northwestern/domain/atompub
  AtomPubUtils.java
  NavigationService.java
  ObjectService.java
  RepositoryService.java

Locator Extensions for External Repositories

The CI-Hub source code is mixed language though both (Java and Scala) are JVM-based. All are part of the same org.projectbamboo.cihub.northwestern.domain package. The classes that implement interactions with external repositories and some utility classes are in Java. The external repository “locator” and their support classes are in Scala. Language choice here is likely more historical and reflecting different developers’ preferences over the life of the project rather than specific advantage to task.

Java utility classes for external repository API encapsulation:

./java/org/projectbamboo/cihub/northwestern/domain/
  fedora
    DataStream.java
    FedoraConnector.java
    FedoraConnectorREST.java
    HttpInputStream.java

  hathi
    HathiConnector.java

Scala “locator” and support classes:

./scala/org/projectbamboo/cihub/northwestern/domain/
  BambooRequest.scala
  BambooRequestImpl.scala
  BambooType.scala
  ConnectionException.scala
  HathiLocatorService.scala
  InvalidIDException.scala
  LocatorServiceAPI.scala
  PerseusLocatorService.scala
  RenderingType.scala
  TCPLocatorService.scala
  TextImageConverter.scala
  TimeoutException.scala
  ZipProcessor.scala
  ZoteroFileParser.scala

 

Custom CMIS Types and Configuration

./resources

These files provide definitions of custom CMIS objects. They serve as xml templates for “locator” classes, in order to create CMIS metadata files for Bamboo Book Model folders and files for external repository content.

./CMISTemplates
  cmis.xml
  cmis.xml.folder
  cmis.xml.item
  cmis.xml.locator
  cmis.xml.page
  cmis.xml.tcp



These files provide CMIS definitions of custom Bamboo types.

./CMISTypes
  bamboo-folder.xml
  bamboo-page-document.xml
  book.xml
  contents.xml
  example-type.xml
  metadata.xml
  page-image-jp2.xml
  page-image.xml
  page-morphadorned.xml
  page-plaintext.xml
  page-tei.xml
  page-thumb150.xml
  page-xhtml.xml
  page.xml
  source-aggregate.xml
  source-mets.xml
  source-page-image.xml
  source-page-ocr.xml
  source-page-xml.xml
  userfolder.xml
  volume-plaintext.xml

The configuration file for external repositories referenced by CI-Hub connectors is found in the config directory:

./config
  connector.properties

 

 

 

  • No labels