Navigation:
Documentation
Archive



Page Tree:

Child pages
  • January 2011 Meeting - Pre-Conference Workshop Session 2 (CI) Notes

This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.

Skip to end of metadata
Go to start of metadata

Upload presentations and meeting notes about Pre-Conference Workshop Session 2 (CI) Notes to this page.

CI Areas of Focus:

This morning we are talking about two areas of focus of the four that were identified for the CI group.

1. Prioritization, collection profiling
2. Interoperability operations

[Tim's presentation will go up online?]

Issues implicated in interoperability at the item level:
-item-level metadata profiles
-getting item representations out of and back into a repository

Identifying and describing collections that support scholarly activities, including construction of specialized corpora.

List of possible candidate collections, many of these have been identified by the Corpora Space group
-EEBO, ECCO, Perseus, Hathi, Google

Broad question about identifying collections: most of those currently on the list have been identified by the scholarly narratives gathered in the Bamboo meetings, and by

Many availalbe dig coll found on the web fall short as scholarly corpora
-too opportunistic (Hathi is what was in library collections)
-inconsistent in object structure
-many insufficiently described

Existing collections should be thought of not just as copora but as sources for new corpora that scholars will create as they work with Bamboo tools. We need to look at what scholars do, what they want to do, ad hoc collections, etc.

Example of MONK, which is a corpus built from other corpora. In order to make the MONK texts usable, had to modify and transform them, and in effect have created a new corpus.

----
Harriett Green - English and Digital Humanities Librarian at University of Illinois

Presentation on engaging scholars and identifying collections to be used in Bamboo.

ATTACHMENTS: Bibliography document

Handout document

Many digital collections that we might use fall short as corpora, and we should think of them more as resources that a scholar might use to build a corpora.

Proposing we do a needs assessment with the scholars to identify what they are interested in, from the perspective of modern library collection development.

Humanities scholars constantly create personal collections and corpora. Ref. the Brockman article.  Useful digital collections can't just be mass aggregations. Palmer presentation at ASIS meeting: size and search are baseline

How do we retain an identity of the individual collections, how do we systematically aggregate objects from collections.

Concept of a dynamic user and type of collection. Emerging user-centered principles in library collection development are relevant for this Bamboo work.

Library collection development: we no longer have analog collections, we are managing a hybrid of print and digital. Access, content and place are dispersed. There are many collections that we have no control over. We now call it "knowledge/content management"

Information seeking is contextual and interactive. Access must be seamless and as simple as possible.

This concept extends to Bamboo collections as well
'contextual mass' rather than 'critical mass'

User is central to how collection is developed. Same perspective can be applied to the Bamboo collections.

Humanities corpora:
collections and aggregations. Two types: traditional, and digital humanities corpora. One: literary body. Second: materials on which some

Humanities corpora defined by scholarly work patterns. Browsing, researching, chaining. Discovering, referring, annotating, comparing, etc. Use in teaching/pedagogical implications. ALso defined by heterogeneity of objects (not just texts, also using paintings, other media), very interdisciplinary in nature. Gathering all different kinds of resources.

DIGITAL humanities corpora have typically been either: corpus linguistics, or ____ . Very much driven by what scholars are seeking.

These two types of collections are the starting points for the things that Tim mentioned. All built from user's perspective. Will all be different from the start. We do know they will have rich content, need to have discovery mechanisms, and they must be authoritative. Need to know that they have been carefully curated.

What collection attributes make collection a good founcation for building corpora?
-rich metadata
-normalized into fiable formates
-comprehensive in axes of research interest for scholars
-content framed in terms of 'contextual mass'
[Reference the Palmer paper]

If we examine existing use cases that reveal what scholars want from a digital collection, and supplement with a survey that tells us more about that they need, we can form a scholarly rubric that will tell us

Neil: it won't do to have lots of material that we can cut across, because we can only cut across it in a very shallow way. Important not just to be able to cut across it horizontally, but also have a way to deeply engage with it.

Tim: Bamboo as just plumbing, there may be validity to this approach, but it presupposes that stuff out there on the web is already in a condition to be used. We are going to have to be able to do remediation, make derivatives, etc. so that scholars can do the things they need to do.

Commenter: Offering a dissenting view: history of corpora space computational linguistics goes against this principle. Has gone beyond just aggregation at scale. Sometimes it is incredibly valuable just to be able to throw very large volumes of material . Important to future-proof so that we don't just serve todays scholars who want to do very specific things with very well formed content in specific forms

Tim: yes, we will probably need to serve both communities.

John: ability to aggregate very very large collections is a new kind of scholarship, and this is the future, not very small collections of very carefully curated content.

Martin: it's different strokes for different folks. You could argue that in some disciplines, very large collections of texts a la Hathi and Google are exceedingly valuable. For others, smaller collections curated in a specific way will be more valuable. It may be that people who need large undistinguished masses of texts may be being met very well by what is happening already anway. Two models: a pyramid model where you could curate up from the bottom level. Other is I and you, scientists: I have my data, I contribute it to a gene bank or some other common data set, the nuisance of doing this (preparing it, normalizing, it, etc.) pays off because there is a downstream benefit when others also contribute the data they have produced.

Tim: let's include in the spring,summer,fall, in preparation for a Corpora Space phase 2 proposal, more information about what real scholars say that they need? Do they feel like they can't connect to Google without some mediation? Or is the problem that they want to be able to do some curation and contribute it back and feel like they can't do that? We have anecdotal evidence for what we think they want, we need to test it a bit more.

Back to Harriet...
Need to learn their research agenda and learn more about the tools they are using. Ultimately we need to be able to connect with our scholars and understand their needs and their activities better.

Proposing a three part process for assessing needs
-scholarly narratives
-targeting surveys
-one-on-one interviews

Scholarly narratives: examing use cases currently on wiki. We may not need to gather new scholarly narratives?

Surveys: ID scholars and faculty who are working with dig. collections. Won't be a mass survey, would be a very targeted survey.

Marlita: will we also include people who aren't yet putting their toes in the water with digital collections?

Harriet: yes, we do need to find out who is interested but haven't yet engaged.

What are the services that will make it attractive for them to engage with a collection?

Jim: could we consider potential consortium members with this? We don't just need to limit to our current Bamboo partners.

Tim: Need to think about how we will be successful engaging with these faculty. Will we be more successful

Bill: questions are relatively high level. If we are asking high level questions about how they would use, and then we want to translate into Bamboo services, how do we harmonize the two?

Tim: we don't really know that yet. The in-depth interviews might help

Jonathan: scholars working with dig. materials today might be a very different set from those who will be working with them in 10 years.

Interviews slide
These have been tried with various degrees of success. If we are skeptical about how effective these techniques will be, how else can we engage with scholars about what they need?

Lee: There are huge digital collection efforts underway that are grappling with these problems as well, we should

Tim: it's been more typical that librarians have dominated the digital collection development, based on funding or for other reasons. We are starting to change our approach to this in some ways, how do we make those activities more focussed on what faculty actually need and want?

Martin: It's a very difficult problem. University of Toronto: very deep network of conversation, but largely informal, about what kinds of books to buy. That network has largely disappeared because we have new procedures for print collection development. At Northwestern, very little conversation about patterns of collection building, collection use, this seems like it is typical.

Neil: Another way to approach the issue would be to look at what large collections are actually getting the largest use right now (published papers, etc.), might actually interview people who are running those collections, and what makes them heavily used? [DOUBLE STARS ON THIS COMMENT]

Robert: Moray experiments to gather ? [DIDN'T QUITE CATCH THIS]

John: Scholars are consumers, but are also producers.

Neil: we can work with Corpora Space on this work, because for CS to be successful, it has to connect with scholars who are producing.]

Tim: we have to present a better case about why we are interacting with particular content

John: There is a tension between a pragmatic goal of what we can get going reasonably quickly and well, and getting a broad set of materials and collections. We have to start somewhere, starting with text makes sense, but people also use different kinds of content. If we keep skirting it, we may be doing ourselves a disservice.

Tim: yes, there is a tension, partly the problem is the 18 month timeframe.

David: yes, eventually we will need to get into other types of content, but we had to be pragmatic and choose some things to work with, but building in the conversations about these other things early, it was a way to balance. Design/planning work going hand-in-hand from the start (Corpora Space) with the other work of doing Work Spaces, etc.

Back to Harriet
Result would be to have developed a rubric for phase two that includes specific and relatively objective parameters for identifying collections for the next phase of Bamboo work.

Back to Tim
We have a limited amount of time, we are going to have to recruit help from elsewhere.

Jim: NITLE folks have a digital humanities contact, she may be able to coordinate a call to them, this might help to promote thinking about the value of Bamboo.

David: out of this meeting, can we take an approach where we make some first phase decisions about the collections we will be focusing on for the first phase of work, in addition to laying this groundwork for the next phase?

Tim: Yes, Corpora Space has already identified 3 collections they want to work with, if they can expand that this week to identify a set of 5 or 6 colletions to work with, then we can solidify it and focus our work on it.

Neil: CS and CI should be thinking about: what are the questions we are hoping to answer by working with these collections? So that we come out of it with a sense of what's working and what's not, and know what to focus on in phase two.

Tim: one question that's come up is what are the object models we will be working with? What will the local proxy look like at the item level? We also need to be able to describe a collection at the aggregation level, so that we can say what we know about the collection. Also so that the tools and services component will know something about how to interact with the collections and how to make use of the objects. This blurs over into the service registry, so there are questions about whether this will be met by the TSR, or if a separate collection registry will be needed?

Goals:
-facilitate discovery of colelctions. Users should be able to see the same kinds of things about collections that Bamboo can interact with
-Facilitate use and access
-Eventually enable bi-directional access (put modified objects back into a collection)

Prior work on descriptions of collections (incomplete listing)
-RSLP coll description
-Dublin Core Collection description
-IMLS Digital Collections and Content Collection Description
-IESR (JISC) Collection app profile. Had some
-ANDS RIF-CS App profile. Includes some description of services
-Ockham initiative out of NSDL. Picked up a bit out of DLF

Prior work
-Core elements for collection description
-Focus on discovery of collections
-Focus on human readable descriptions, used in collection registry
-Some machine actionable information, controlled vocab of language, etc. but genrally limited semantics for service URLs. generally insufficient for machine action. Question is for Bamboo, what information are we going to record.

EXAMPLE of what a collection description might look like in RDF using the Dublin Core schema. When providing URLs, might have a collection of these, each of which does different things (OAI provider, search URL, etc.)

Example of RIF-CS, where you have a few more affordances for URLs that point to specific services. Still a bit crude and preliminary but it is a start.

Example XML from RIF-CS, showing how you would describe an RSS feed.

RIF-CS example expressed in RDF, a page from Hathi.  

Open issues:
-XML vs RDF?
-Information required for Bamboo WS, SP, CS? Will all of the information be hidden in the CMIS connectors? Do we need to be concerned about this? Will the collection registry need to provide the information that CMIS needs?
-Potential need for Bamboo specific extensions:
    -temporal
    -spatial
    -expand serviceType vocab for Bamboo
    -vocabularies for art attributes: type, etc.

Question for Jonathan, for CMIS to work
-Do all the collections need to have nice clean names for objects? what if I want to be able to differentiate between a page image or the text of a page

Jonathan: CMIS is like XML, it doesn't restrict, it's more of a framework for creating expressions

Tim: So Bamboo will probably need to think about expanding vocabulary, and that will need to be a group activity.

If we can, by the end of the week, some ideas about who from the other groups about who needs to be involved, we can put together some crosswalking groups.

Steve: one of the things we need to be pretty clear about up front is that the TSR is about providing information to humans, not information to machines to create automated collections. In Big SOA space, there is an idea of service registries, where machines can figure out what the profiles for the services live. There hasn't been tremendous success in this area. Doesn't want to set us up to try to tackle this same problem. So think about who the descriptions are for? Are they for machines? Are they for technologists who will integrate services? Are they for scholars who can find out which collections I'm interested in can be connected to which tools I'm interested in?

Tim: Collection descriptions do very well with human-readable descriptions. But what about people who want to go beyond to search, so that, for example, a search can be limited to things that have some temporal limitations.

Doug: this is a big problem. Can take the user to the corpora or the data set that may be relevant to you. At ANDS, the goal is to get the data out there for re-use, including for data sets that may be living in some analog format, filed away in a box, etc. Goal is to put it out there, get people to reuse it, etc.

Tim: Part of CI is important

It is speculative to talk about machine actionable collection description, but there has been work lately that is making progress.

Do have a machine actionable registry is not out of scope but we need to be on guard against going to the polar extremes. We also need to be careful about classification schemes that we claim are authoratative; a different scholar from a different discpline might see such a claim differently. We might want to consider

Commenter2: how high a priority is it to deal with persistent identifiers? Was surprised to see us using URLs. Registering persistence is very important to avoid brittleness.

Tim: yes, this is a good point, the URL we were using is being used as part of the RIF-CS standard. For smaller collections it will be important but we will always be dependent on what the collection supports. Where Hathi or others will use HDLs or ARKs, we should support them.

Doug: pragmatic choice to use URLs rather than URIs. A reference point could be just as easily be a URI.

Tim: Is IESR completely quiet? Commenter3: doesn't know but will find out.

Robin: collection discovery. Wondering how important it is? Scholars generally already know about the collections they need.

Tim: doesn't think it's necessarily true that scholars know about the collections they might be interested in, but probably true that they won't come to Bamboo to look for it. Some collection description will be useful but don't know if

Commenter1: if all Bamboo does is provide a gateway to resources they already know about, it's not that valuable.

Robin: tool discovery is more important

Martin: need to know what tools will work with which collections.

Neil: this also exists as a UI collection. Scholars may know about their collections that they use, but they might not know everything about its characteristics. Maybe a recommender system, or something that could tell people that if you are interested in this, you might also be interested in this thing, or people who use these things also use these other things...

Martin: Spenser list...even people with deep interest in very specific corpora don't actually know about all of its capabilities or the tools that can be applied.

Tim: where is TSR work located in the project?

Bruce: It's in Work Spaces domain. Within the next 9 months, need to have a platform available for information about tools that can operate in the work spaces. Recently the discussion in Work Spaces has focused on whether the TSR should include information about collections, which are actionable, which can be used using the tools in front of me right now. Another is sort of advertising ... this is what Bamboo will allow you to do.

David: this is a marker, we put TSR in, it's important but underresourced, sort of a stepchild. We may want to call it out and focus on it more, across the project focus more on things that need more attention.

Tim: handing it over to Bill to talk about CI's work

Bill: Has been working with Scott trying to define the actual actions we would want to perform on these actionable collections? If we have adapters that can talk to different collections, we can obtain information, perform searches, etc. Get metadata: this one gives us MARC, this one Dublin Core, and then we can apply transformations between them. The answer to whether we can perform actions depends on what the actions are. Are we doing full text search? If we want to add Georefs, do we have geographic information anywhere. WOrking on breaking the use cases down to fairly atomic operations, and also looking at the specific collections that have been nominated to understand what the API affords. Very much in harmony with the work that Harriet has outlined. Don't want to privilege any specific use case, like the abstraction that we have been exposing so far in the CI profiling discussing. Understanding what the services are either directly from the provider or via one of the remediation services.

Three use cases right now: 1. geomashup (someone has texts and they'd like to map placenames in Google). Scott mapped this: operation, input, output. 2. Neil pointed out Travis Brown's use case, drilled down into this one. 3. Martin's prelim. report on various features, a list with some mnemonics, "unique sort", etc. Are now working on taking the abstractions against these three use cases, working on trying to see what this would tell us we'd need in a collection description. Not a separate, ongoing activity, more like a field trip to explore a little more what are the "operations" in "interoperability".

Tim: point is, not sufficient just to get the object out. We know we are going to want to perform operations on an object before we can use it.

Oren: are you looking at all at the services that are offered within the corpora already? For example, in Philologic, a set of actions that can be performed on texts and this is based on things scholars want to do.

Bill/Tim: you need both to know what the operations are, and also what the tool requires. So for Philologic, for example, if it has specific requirements.

Neil: Prosopography as a use case would be very valuable. Also has a broader question: when CS did its workshop with Google: it's not about getting objects out and down, it's about bringing the computing to the collections.

Closing down for lunch?

Jim: to what extent will the work that Bill and Scott are doing inform the work that Work Spaces is doing?

Bruce: Understanding operations will help us do the implementation.

Noah: very helpful ... looking at a generalizable approach for implementing ... important to be provide

-----

1 Comment

  1. Unknown User (sdenbo@gmail.com)

    In relation to the discussion on the tools and services registry, it is crucial that we engage with other infrastructure initiatives because there is a lot of good work going on in various places on providing information about tools (it is also one of the ways of dealing with the fact that it is under resourced within Bamboo). By way of example a couple of links:

    CenterNet page on Scholarly Tools for Digital Humanities

    www.arts-humanities.net/tools