Navigation:
Documentation
Archive



Page Tree:

Child pages
  • Candidate Collections

This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.

Skip to end of metadata
Go to start of metadata

Some Candidate Collections from Past Discussions, and from Corpora Space wiki

  • Some ideas tossed about in May in conversations around Collections Interoperability included Hathi Trust, EEBO, ECCO, TCP, JSTOR, ARTSTOR, NINES, TextGrid (Germany), Perseus, and local collections (e.g., UW Digital Collections at Wisconsin).  [Jim Muehlenberg]
  • Oxford Text Archive (OTA) has numerous corpora and electronic texts, and is currently undergoing technical enhancements to be part of the CLARIN infrastructure. All resources free to end user, but the re are some licensing restrictions on their use. Contact: Martin Wynne, Oxford. [http://www.ota.ox.ac.uk]
  • CLARIN is a major European initiative building an infrastructure to offer access to language resources and tools. Interoperating with CLARIN services would offer a bridge to significant collections. [http://www.clarin.eu/vlo/]
  • British National Corpus. A major resource for the modern English language, used by linguists, language teachers and learners, lexicographers, etc. There is a licence fee to use the corpus.  A major project to revise, enhance, anonymize, align and make available the digital audio of the 10m words of the spoken part of the corpus is underway (ending 2011) Contacts: Martin Wynne and John Coleman. [http://www.natcorp.ox.ac.uk/]
  • Oxford English Corpus: a major English language research resource, up to date, with various tools. Subscription required. [http://www.oxforddictionaries.com/page/oec]
  • Beazley Archive, Oxford, an important collection of resources relating to Classical Art, part of a collaborative venture to collaborate with other archvies and offer advanced discovery and access services in the CLAROS initiative. Contact: Donna Kurtz or Bamboo Oxford participants) [http://www.beazley.ox.ac.uk/, [http://www.clarosnet.org/]]
  • Brigham Young University corpora, offering online access to BNC, and corpora of modern and historical American English, plus Spanish and Portuguese corpora. Contact: Mark Davies. [http://corpus.byu.edu/]
  • Early English Books online (EEBO). Institutional subscription required. [http://eebo.chadwyck.com/]
  • Dot Porter (Indiana) reported at the 9/21/10 Work Spaces teleconference about the meeting she and Martin Mueller attended the prior week for ARC (Advanced? Research Collaboration) about extending NINES infrastructure to other corpora (e.g., medieval studies, renaissance, modernism, Canadian literature, etc - various "nodes" in the ARC federation / set of corpora).  Would this be another set of collections to consider?  [Jim Muehlenberg]
  • Approximately 30,000 texts from before 1800 exist now in the EEBO-TCP, ECCO-TCP, and Evans-TCP projects. They will pass into the public domain in 2015-16. By then there are expected to be about 70,000 texts before 1700. For planning purposes it makes sense to think of this archive as already public. Brian Pytlik Zillig at Nebraska has been working on XSLT stylesheets that convert these texts (as well as texts from other TEI "Level 4" library archives ) into a standard TEI P5 format that gives you "cross-walk" capabilities across all these texts. This work originated in the MONK project and is going very well: a recent run on 1,000 randomly chosen EEBO texts converted 950 without any difficulties, and a handful of fixable issues accounted for the fifty texts that didn't parse. (Added by Martin Mueller, October 20)

Other Candidate Collections

  •  From John Coleman (Nov 6, 2010) The main collections I'd like to think about/negotiating including are Google Books (not just the public service, but an improved academic-oriented version using the scans that Google have provided to the participating libraries, with improved metadata); JSTOR (again, with the value-added service of being able to dig into their data for content analysis, citation threading etc); Hathi trust (Oxford is not a member, I think, but if Bamboo opened up access through collaborations with participating institutions on some sort of reciprocal basis, that would be a powerful example of how Bamboo enables institutions not to have to individually buy in to everything separately)

Gridded version of candidate collections, for heuristic evaluation

Collection

Proposed by/in

Type

URL

Value (to scholars? provides important variety to the test set? etc.)

How encoded/stored

Readiness /Accessibility (can we get at it for experimenting? rights impediments? OAI? APIs? Asset actions? etc.)

Comments/notes

Google Books

John Coleman

Digitized books

http://code.google.com/apis/books/http://books.google.com

 

 

Google Book search APIs Use terms:

From John Coleman (Nov 6, 2010) The main collections I'd like to think about/negotiating including are Google Books (not just the public service, but an improved academic-oriented version using the scans that Google have provided to the participating libraries, with improved metadata); JSTOR (again, with the value-added service of being able to dig into their data for content analysis, citation threading etc); Hathi trust (Oxford is not a member, I think, but if Bamboo opened up access through collaborations with participating institutions on some sort of reciprocal basis, that would be a powerful example of how Bamboo enables institutions not to have to individually buy in to everything separately)

HathiTrust

Various

Primarily digitized books

http://www.hathitrust.org/

High.

Metadata + facsimile scans + OCR'ed text

Variety of APIs for accessing search, bib data, rights data, etc.

 

EEBO

 

 

http://eebo.chadwyck.com/

 

 

Subscription currently required. MONK and possibly others have copies that can be available to BTP.

Approximately 30,000 texts from before 1800 exist now in the EEBO-TCP, ECCO-TCP, and Evans-TCP projects. They will pass into the public domain in 2015-16. By then there are expected to be about 70,000 texts before 1700. For planning purposes it makes sense to think of this archive as already public. Brian Pytlik Zillig at Nebraska has been working on XSLT stylesheets that convert these texts (as well as texts from other TEI "Level 4" library archives ) into a standard TEI P5 format that gives you "cross-walk" capabilities across all these texts. This work originated in the MONK project and is going very well: a recent run on 1,000 randomly chosen EEBO texts converted 950 without any difficulties, and a handful of fixable issues accounted for the fifty texts that didn't parse. (Added by Martin Mueller, October 20)

ECCO

 

 

 

 

 

 

 

TCP

 

 

 

 

 

 

 

JSTOR

 

 

 

 

 

 

 

NINES

 

 

 

 

 

 

Dot Porter (Indiana) reported at the 9/21/10 Work Spaces teleconference about the meeting she and Martin Mueller attended the prior week for ARC (Advanced? Research Collaboration) about extending NINES infrastructure to other corpora (e.g., medieval studies, renaissance, modernism, Canadian literature, etc - various "nodes" in the ARC federation / set of corpora).  Would this be another set of collections to consider?  [Jim Muehlenberg]

ARTstor

 

 

 

 

 

 

 

TextGrid

 

 

 

 

 

 

 

Perseus

 

 

 

 

 

 

 

UW digital collections

 

Digitized texts, images, audio

http://uwdc.library.wisc.edu/

 

TIFF, JPEG, JP2, TEI, MODS; stored in Fedora repository

Fedora APIs

Transitioning to new Fedora repository, so there's room for experimentation/reconfiguring to meet Bamboo requirements

Winterton photographs

Northwestern

Digitized high resolution photographs, inventory/descriptive data

http://digital.library.northwestern.edu/winterton/

 

JP2 images, metadata available both as EAD and expressed as MODS

Most datastreams available via disseminators roughly corresponding to asset actions. Readily available for BTP demonstrators

 

Northwestern Books

Northwestern

Digitized books

http://books.northwestern.edu

 

Facsimile pages as JP2, OCR text as plain txt and ABBYY XML

Most datastreams available via disseminators roughly corresponding to asset actions. Readily available for BTP demonstrators

 

Oxford Text Archive

 

 

http://www.ota.ox.ac.uk

 

 

 

 

British National Corpus

 

 

http://www.natcorp.ox.ac.uk/

 

 

 

 

Oxford English Corpus

 

 

http://www.oxforddictionaries.com/page/oec

 

 

 

 

Beazley Archive

 

 

http://www.beazley.ox.ac.uk/

 

 

 

 

BYU

 

http://corpus.byu.edu

 

 

 

 

Humboldt digital archive

Scholarly Narrative 1

 

http://www.avhumboldt.net/index.php?page=138

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • No labels