Confluence has been transitioned back to Berkeley. If you experience any issues, please contact

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

We're planning to submit a workshop paper for an IEEE BigData 2013 workshop on Big Humanities.

The deadline for the full paper is July 30, 2013

 Topics in the Call that seem to relate to or resonate with our work:

  • Text- and data-mining of historical and archival material.
    • Limited NLP we are doing
    • Background Color
      Patrick can do something short about this
  • Social media analysis, including sentiment analysis
    • Nothing here
  • Cultural analytics
    • This is where we are, I think, in terms of the application
    • Background Color
      This is covered in our intro

  • Big data and the construction of memory and identity
    • Laurie - what do you think?:
    • Background Color
      LEP:  my initial ramble is to think about Steve Tinney's comments early on about the potential for prosopographic tools to help with understanding the cult organization and operation as evidenced in the literary texts, and how the data derived from prosopographical research would inform the investigation of an essentially literary corpus, which might be studied from the perspective of entirely different sets of attributes (what features construct the narrative—Alison Booth's proposal for women's biographies).   how grinding over data from different domains (administrative and literary), with different approaches (tagging literary features, disambiguation for prosopography) enables exploration of seemingly different environments (in the sources), all of which are recognized by the specialists to contribute to the modern understanding of the  construction of (memory and) identity.  I think this is more than just about handling more data, but how computation provides potential for those discrete data realms to be brought together.  In terms of construction of identity, this may support the integration of the multiple contexts in which an individual or individuals in text contexts appear.  PLS: This is a use-case in cultural analytics
  • Crowd-sourcing and big data
    • Sort of, in the application of the assertion model and sharing.
    • cuneiform studies as one model for accepting and sharing conventions in data
    • Background Color
      Laurie talked about an idea to connect various data published as journal articles, etc., to a large onomasticon, placename thesauri, etc
  • Cyber-infrastructures for the humanities
    • We are making a pitch for re-use, and good architecture. E.g., the SNA services, and the assertions.

      Background Color
      Corpus management as a shared (and reusable) resource
  • Relationship between ‘small data’ and big data
    • Would like to discuss the issues of data-based research, and computational research, as the central questions (more than count of bytes). Note that sciences have made this transition (although there are still issues around repeatability, provenance, data curation, etc.), and so for them, the cultural shift in how research is done is almost forgotten. For humanists, the shift from a reading of individual texts  to build up a model, to the computational analysis of whole corpora, is just getting moving. The implications and challenges for reproducibility are still very real in the sciences. (Impressions of how humanities --- using cuneiform as case study—handles "big paper data") The parallel issues of peer review and evaluation in the humanities are in early discussion (how many humanists are qualified to review the NLP algorithms used by others to analyze a corpus?).
    • Background Color
      This is not a big or particularly significant point, but this question of big data problems in processing small data reflects our aim of setting up the assertions model so that it emulates the way humanities researchers work/think about their data calls to mind our focus on emulating the way researchers think about analyzing their data in the assertion model
    • A common theme is the difference between the construction of a data resource, and the synthesis of new ideas or novel insights based upon such a resource. This is central to the evaluation of what constitutes a research result, a publication, etc. "Engineers that build telecopes don't get tenure in Astronomy," but what was often a publication a few short decades ago (e.g., a prosopography of a given corpus, or place and time) would now just be a data resource used for research.
  • NoSQL databases and their application, e.g. document and graph databases
    • Nothing here
  • Big data and archival practice
    • Not sure what to say here. Seems like it is in our domain, but I suspect it goes to data curation and data management practices, than to our research
    • Background Color
      Does this embrace archiving provenance/authority citation/archiving? Short note here. Note that across the domain, metadata practices are quite uneven. E.g., there are no broad namelists across years and corpora. There is a basic level of standardization at the document level (standard identifiers for a given tablet), but little standard content MD, or finding aids. MODS/METS/EAD etc. are not in common use.
  • Construction of big data
    • W.r.t. humanities, there is a story to tell here about how cuneiform studies "went digital."


      Background Color
      LEP contribute the cuneiform user story: bits of clay, bytes of data. field has more raw data than anyone can access in lifetime when inaccessibility of sources taken into account.  how do physical attributes/condition of data invite or discourage computational models of research? This would make a good blog post independent of the rest of the paper. Once that is done, we can extract and include here.


  • Big data in Heritage
    • Background Color
      Would be nice to talk about how it is transforming (or at least has the potential to transform) research in this area Not clear what Heritage means here. DO we have a separate story in this aspect?
Areas we will include/discuss

Much of this draws on what we worked out on the whiteboard, but I have added some stuff on the areas I know something about.


We need to review all this w.r.t. the ACM DocEng paper


  • Project profile, drawing from other work. Include something about our collaboration as a model, etc.
    • Cuneiform studies and digitization
    • DH, combining humanist and IT cultures to really collaborate
    • Goals of project (we should aim high here, and perhaps look forward more than back). (LEP: revisit the wish-lists)
  • Project context - peers, relationship to field
    • Tradition of prosopography
    • Evolution of the domain, becoming digital, etc.
      • .5 Million documents total, across N years.
      • Moving from objects and paper books, to scans and Unicode transliterations 
        • small point: digitization dealing with the problems of OCR, but this is further complicated with 3D tablets—geodesic dome photography contributing to solutions?
        • Epidocs?
      • Moving from silos/hoards to shared repositories
      • How this is changing research
    • Traditions of NLP - minor point? May not be a good idea to say much about this, other than to note that we are not innovating on NLP.
    • Projects like
    • SNA toolkits - note and cite our libs, but describe how we wrapped in GraphML, and built as RESTful service.
    • ??OAC as representation with possibility of provenance. Does not solve problem of dynamism in the text. Cite robust linking by Wilensky, et al.?
    • Scholarly workspaces - Perseus, etc. 
  • Exploration and analysis - What If? and Aha! moments
    • PLS: I forgot what you want to say here...
    • Background Color
      LEP: freedom to change assumptions (assertions) and track their provenance opens domain to once-unlikely investigations.  the results may be totally unexpected: bridges (in SNA) who may not have been discernible
  • Evaluation and peer review in BigHumanties
    • Transition from review with Human Eyes and individual intellect, to review of data sources, collection methods, filtering, processing, algorithms, etc.
    • Voytek's three laws (the more stats and big math, the fewer people understand, the more res hides behind math, etc.)
    • Data Provenance, Analytic Provenance, and modeling scholarly workflows.
  • Workflows around annotation
    • Lemmatization with Oracc tools, 
    • NLP markup for roles, activities, etc. 
    • Assertions on dates, on people, etc. 
    • Workspaces, settings and rules for model, what-if scenarios
  • Our assertion model, and proposed functionality  (as Cyber-infrastructure, and Collaborative Analysis)
    • Basic functionality and purpose
      • Accept/reject automated conjectures
      • Add additional information
      • Formalize model for disagreement and discussion
    • Publish/subscribe support. 
      • Use-cases for research
      • Use-cases for teaching
      • Provenance of ideas, tracking influence, metering
  • Pluggable, abstract rules for disambiguation (as Cyber-infrastructure)
    • Expose UI for user to control, expose parameters
    • Conform to simple model for disambiguation
      • shift, boost, or discount
      • intra-doc or inter-doc (cross-corpus)
    • Core generic types (parameterized, but general across many corpora)
      • Name qualification
      • Date comparisons
      • Role matrix
    • Base rule classes can be extended to other specific features of roles, activities, etc. 
      • Place of activity
      • Life-roles
      • Custom knowledge, e.g., a given family/clan might have a focus, or a taboo on certain activities