Today's reading group is a last-minute topic change, as our originally-scheduled presenter was unable to attend. We will reschedule Prof. David Bamman to talk about natural language processing over a fanfiction corpus later in the year.
Quinn Dombrowski of Research IT will present on text analysis resources on the national XSEDE compute infrastructure, and a potential model for making national high-performance compute resources accessible for humanists, including those at institutions without local support for research computing.
When: Thursday, March 9, 2017 from 12 - 1pm
Background on the Text Analysis Gateway on XSEDE: https://wrathematics.github.io/_pages/files/papers/2015/tag.pdf
Presenting: Quinn Dombrowski
Aaron Culich, Research IT
Anna Sackmann, Library
Aron Roberts, Research IT
Bill Allison, IST-API & Chief Technology Officer
Brandon Eltiste, Library
Cody Hennessey, Library
Deb McCaffrey, Research IT (BRC Domain Consultant)
Emilia Malachowski, Research IT
Larry Conrad, CIO
Michael Campos-Quinn, Pacific Film Archive
Richard Katz, SAIT
Rick Jaffe, Research IT
Ron Sprouse, Linguistics
Scott Peterson, Library
Steve Masover, Research IT
[See slides, PDF]
DH use cases that want to be run on an HPC cluster: large photogrammetry jobs (Photoscan); OCR at scale (Tesseract); text analysis (no single software solution to apply -- there are many, depending).
XSEDE Text Analysis Gateway
Presentation of an XSEDE GUI that provides a non-command-line gateway to digital humanists for text analysis. Only accepts ASCII text, a limitation for humanists dealing with languages expressed in other alphabets (Cyrillic, Chinese, etc.). Other awkwardness in the UI, it's under development. What's in place is the first set of 'easiest to implement' functionality.
Voyant: voyant-tools.org --- widely used, unicode, useful even though the various panes visible in this online tool can't be regenerated via exported underlying R code ... it is possible to download and run locally, one option under consideration is to run it on XSEDE infrastructure.
Larry: How broad is interest in this kind of tool?
Quinn: text-analysis working group; but also people who do literary studies and other text-analysis research even if the researchers don't consider themselves digital humanists.
Patrick: Text analysis courses?
Cody: Laura Nelson teaching DH course in history this term. Not using these tools, but these concepts are getting treated.
Aaron: Interesting tension between the nice web interface provided by some of these terms, but constrained to what developer offers; vs. having to learn how to code R or Python, etc., to generate what they want to.
Richard: Data Stages of the XSEDE Text Analysis Gateway suggests a plugin kind of architecture might be possible.
Patrick: SEASR, Wordseer -- SEASR is a pipelining framework, not sure how active it is now; Wordseer, School of Information developed this -- not sure if anyone is using it at this time
Quinn: broad adoption -- Voyant
Larry: Does this need outreach/marketing?
Quinn: definitely need to do this if we want to reach literary scholars who wouldn't currently think of themselves as having any need for big computational resources. may pursue funding to make some of these tools available on national infrastructure in US and Canada, and from there consider whether these kinds of tools can be run by HathiTrust over their corpus, etc.
Rick: Consultation w/ Mech Engineer who wanted to do some analysis over a bunch of bibliographic data re: papers in his field. I wonder whether he can use them in a public resource.
Patrick: He can work with them if he doesn't make them into a publicly accessible data set that violates copyright.
Quinn: but he may also be able to create a derivative that would not be subject to copyright
Patrick: Wonder about language models associated with these tools ... can they be swapped out or is the analysis bound to whatever language model the tool or service has adopted.
Ron: Might we consider hosting the 1TB Google N-Gram data set, so that researchers don't have to find storage for it themselves?
Aaron: This is something that can be hosted on XSEDE's Bridges data collections service. You can even get an engineer to help leverage the resources.
Patrick: we could find a place to park 1TB if need be. Maybe we can fiddle with it, prove its use, and then engage Bridges in wider hosting.
Anna: patents as a data set -- patent classification -- would be interesting to do text analysis over large corpora of patents.
Aaron: some of this happening in IEOR.
Anna/Aaron: Lee Fleming
Patrick: might check with legal informatics group in the School of Law