Page Tree:

Child pages
  • Book Model (Draft)

This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.

Skip to end of metadata
Go to start of metadata

(Drafted with Claire Stewart and Karen Miller)

Book Model (Working Draft)

The Bamboo Book Model is a set of recommendations for providing a core set of standard content for any text represented in Bamboo accessed via Collections Interoperability Services.

The Book Model is built on Content Management Interoperability Services (CMIS) standard. A book will be accessed as a set of CMIS documents, folders, and references. Since CMIS provides a web-service interface, many of the details for access are provided by the CMIS standard and need not be repeated here.

In this recommendation, a package of original content documents retrieved from a source repository are always provided without modification. Additional metadata properties and generated content is added to insure that tools can use a single set of conventions for navigation and display functions.

The software, under development by the Bamboo CI group extends the OpenCMIS server library to provide a CMIS interface to clients. This CMIS interface is not an actual repository, but provides a virtual representation of the content objects that will be available from the source repository on request.

For performance purposes, content that has been acquired and processed will be cached for a limited period of time.

Source Content

At a minimum, source content should include:

1. Source repository id and an identifier for the items within that repository.

1. Metadata to supply the required item-level fields: title, creator, and date.

2. Full text of the book. This might be TEI, HTML, or OCR text files.

Optionally a book might include:

3. Additional item-level metadata: publisher, publication date(s)

4. Identifier for another book if this is a version. If this is a volume in a series, an identifier for that series.

5. Structural information about the book, divisions (chapters, acts, etc)

6. Scanned images of individual pages

Organizing Book Materials


The organization of a book is shown in the diagram "Bamboo CI Book Model". A book is represented as a CMIS folder of type bamboo:book. Within this folder are subfolders that contain the original source material and derived page HTML, TEI, page and thumbnail images, and indexes. Item-level metadata is attached as property values on the item folder. Item level metadata fields include:


Dublin Core title (required)


Dublin Core creator (required)


Dublin Core date (required, ISO 8601 date range)


Dublin Core publisher


Dublin Core publication or issue date


Identifier for volume that includes this item


Identifier for text that this is a version of


Set only if there is a permanent URL for the item. The value of dc:identifier is the permanent URL
in the contributing repository


Short identifier of contributing repository


Bamboo assigned, unique URI for the item. The URI includes a short identifier for the
contributing repository and an identifier for the item in that repository.

The "dc:" prefix represents the dublin core namespace (, and the "bamboo:" prefix represents the bamboo namespace (URI to be determined).

The values of bamboo:uri, dc:isPartOf, and dc:isVersionOf properties are bamboo identifiers. Bamboo identifiers are URIs in the form "<source>/<item-id>" where <source> is a short Bamboo-assigned URI for the contributing repository and <item-id> is the same as, or a slightly modified version of the contributing repository's identifier for the item. The source identifier is modified only so far as is needed to make it a legal component of the URI.

Dates (dc:date and dc:issued) as ISO 8601 dates or a range of ISO 8601 dates in the format "<d1> to <d2>". A date may be followed by a " ?" string if the original date was marked as uncertain. For example:

    1099 to 1100 ?

Represents an uncertain year between 1099 and 1100 inclusive.

Page Content 

Book pages are rendered in a number of formats depending on the materials available. Book pages will always be available as XHTML 1.0 and TEI page files. If page scans images are available, each page will also include a full image and thumbnail.

Every representation of a page appears in a sub-folder of the item folder depending by type. Page sub-folders are named: XHTML, TEI, images, and thumbnails. The following CMIS types are used to represent page content:

bamboo:page-image (name "<sequence>.jpg")

A high quality jpg image of a book page with width 800 pixels. Aspect ratio is the same as the original scan.

bamboo:page-tei (name "<sequence>.tei.xml")

TEI representation of book page

bamboo:page-xhtml (name "<sequence>.html")

Semantic XHTML (that is without formatting tags) of a book's page text. CSS classes will be mappable to TEI elements.

bamboo:page-thumb150 (name "<sequence>w150.jpg")

Thumbnail for page as a jpeg image with width 150 pixels. Aspect ratio is the same as the original scan.

Each page document (any of the above types) has the following properties:


Assigned sequence number


Page identifier as it appears in the source content


Identifier for the division (chapter, act, etc) that includes the page. This may have multiple values if 
the page includes content belonging to more than one division.


These properties are attached to each page representation as well as any separate source object (OCR text, image.) that was used in generating the derived content, so that the original source objects can be matched with generated content via a join.

Original Source Content

The original content as obtained from the contributing repository is located in a subfolder of the item folded named "source". This content is exactly as obtained from the contributing repository. Additional property fields are added to the source objects to allow them to be matched with generated content (see page properties above). The following CMIS types are used to identify source content:

bamboo:source-mets (name as original)

METS document describing the Book as supplied by contributing repository

bamboo:source-page-image (name as original)

A scanned image of a page (any image format) as supplied by contributing repository

bamboo:source-page-ocr (name as original)

Raw OCR text of a book page as supplied by a contributing repository

bamboo:source-page-xml (name as original)

XML text and/or metadata for a book page as supplied by a contributing repository (usually TEI.)

Additional types will be added as needed.

Contents Folder and Indexing

Each item folder will contain a "contents" sub-folder, with multiple index folders. Each index provides an alternative way of navigating generated content. Indexes are (possibly nested) folders containing documents that reference content documents.

Indexes are sets of annotations of source and/or generated content, similar to (and available as) an Atom feed. Even though generated content is organized by page, index items may refer to a selected portion of a range of pages so as to allow divisions and other indexed units to overlap page boundaries. Indexes are described in the diagram labeled "Contents / Indexing".

Page Index

The page index provides a way of navigating to each page. The page index provides an easy way to examine all of the content (generated, and original source) that is related to a page.

Div Index

The div index provides a structural view of book content. (For example by chapter, scene and act). This index is built using the structure map in a METS file or TEI div structure depending on the content source.

Genre Indexes

Items may include one or more Genre Indexes. Each Genre Index resembles the div index, but will organize content using a standardized set of div types that will be consistant across source repositories. The types used will depend on the type of content (book, drama, essay.)

Genre indexes should be considered a research topic and may or might not exist for any particular book.

  • No labels

1 Comment

  1. Unknown User (

    Google Books / HathiTrust Book Pagination Alignment

    This issue relates to the identification of pages of books from the Google Books corpus (and presumably the HathiTrust). Book page identifications may not always have a consistent relationship with the identifiers used on different files of OCR'd text. In general, a given HTML file in Google Books represents a given OCRed page of a book. However, because book scanning is not perfect and because books often contain complex pagination, it is often difficult to align book pages with file identifiers. This may be an issue when attempting to implement / use the "Bamboo Book Model". It will probably give users many headaches as they try to align results from different text analysis services (including Places-Text).

    It may be very difficult to solve this issue without manual effort in identifying how book pagination aligns with file identifiers in problematic books. It may be very useful for Bamboo to offer a client for researchers to share mappings, contributing metadata about books to augment institutionally created metadata and metadata presented n the Bamboo Book Model.