Cody Hennesy of The Library and Patrick Schmitz (Research IT) will lead a discussion about non-consumptive research. We'll specifically explore a few different models for the non-consumptive research of text corpora in the HathiTrust Digital Library, the legal restrictions that necessitate limited access of those collections, a few use cases, and the technical infrastructure behind HathiTrust's data capsule. We invite researchers using non-consumptive methods to join us and share their experiences, describe unmet needs, and/or suggest what the campus can do to support their future work.
When: Thursday, February 23, 2017 from 12 - 1pm
Presenting: Cody Hennesey (Library); Patrick Schmitz (Research IT)
Aaron Culich, Research IT
Chris Hoffman, Research IT
Deb McCaffrey, Research IT
Jason Christopher, Research IT
Kelly Rowland, Nuclear Engineering & BRC
Krishna Muriki, LBNL & BRC (via Zoom)
Miles Lincoln, ETS
Rick Jaffe, Research IT
Rachel Samberg, Library
Ronald Sprouse, Linguistics
Scott Peterson, Library
Steve Masover, Research IT
See slides (PDF)
Cody: Context re: strict copyright law governing modern published content. What scholars do with text analysis, literary & linguistic: topic modeling, text mining (e.g., re: gender as it occurs in a corpus)...
Ron S: [linguistic uses of corpora-level analysis, e.g., frame analysis]
Cody: Domains interested in doing this work also include sociology, political science. Nick Adams' work (PhD, Sociology) as example, analyzing news reports of Occupy protests. Newspapers are a popular source of content/corpora.
Cody: Non-consumptive research means analysis over work that the analyst (researcher) is not permitted to view ... such that analysis does not result in data that can reconstruct pages. Bag of words model (you get words & some metadata, but not the text itself); and data capsule. Data Capsule is Linux VM, you can load software you want, compute over HathiTrust data -- but what you export is controlled (new, still a hard sell because setting up environment is non-trivial; and the review of exportable results hasn't been done much, people aren't sure what they're going to get back for their effort using a data capsule).
Patrick: Wharton School of Business ...
Patrick: So the real question is how do we establish patterns for how to set up environments that will be useful for people.
Chris: It's kind of a Wild West era for this kind of research nowadays...no standards, likely to be in flux for a while.
Patrick: So concerns include (1) how do you architect these environments so that they meet constraints imposed by data/copyright owners; and (2) how do you offer such a service in an affordable, sustainable way across domains and data sets (the business model problem).
Patrick: resiliency issues implicit in data enclave model ... if data exists in only one place, what happens when that data store is threatened -- by politics (e.g., concerns about gov't data going away given current U.S. administration; financial collapse; etc.)
Rick: Tension in cleaning up data ... how do you clean it and how does that affect a researcher's use of it, what inferences on the data are skewed by how the data was cleaned, etc.
Scott: Nick Adams has ideas about making data sets research-ready, as a service that researchers will prefer to use over performing all the cleanup, structuring, metadata-linking themselves.
Patrick/Chris/Rachel: looking at these kinds of cleaned data as derivative work. That's what the research is done over. Licensure depends in large part on how one is licensed to use the underlying material. Govt data that is public domain but cleaned up by a commercial outfit (e.g., ProQuest) is governed by one's contract with ProQuest.
Cody: U Michigan LexisNexis example: invested in API-based access to newspapers, but now not sure they can release the code ... possibly a contractual constraint.
Rachel/Patrick/Cody: imposition of contractual constraint in order to recoup costs is an issue, yes; but there's also an opportunity for another entity (HathiTrust? InternetArchive?) to "redo", perhaps funded by a grant, and making the result public domain or less-constrained.
Rick/Patrick: What people get credit for ... is it creditworthy -- in academic / research contexts -- to create a data set? Raises question of what role of library is in creating such data sets that researchers can then do work that they have incentive to do. And the question of whether major investment (NSF, et al.) ought to be made in the production of robust, re-usable data sets and tools to operate over data sets, something in-between the raw data and the research over data.