On December 18th (Tuesday), Wikihub will be unavailable from 7-9am.
Skip to end of metadata
Go to start of metadata

Prof. David Bamman (School of Information) will discuss the application of Natural Language Processing (NLP) to corpora of literary texts. Bamman will discuss a project undertaken by EECS senior Smitha Milli, in which a large corpus of fanfiction was run through an NLP pipeline that Bamman developed (BookNLP), which surfaced patterns in which secondary characters and female characters are allocated significantly different attention relative to original, previously existing work on which the fanfiction is based. To run the BookNLP pipeline, Milli utilized computational resources on Berkeley's shared cluster, Savio, becoming one of the first undergraduate students at Berkeley to conduct research on the cluster; she then presented the results in a conference paper at EMNLP 2016 this past November, in Austin, Texas.

When: Thursday, 15 June from 12 - 1pm

Where: 200C Warren Hall, 2195 Hearst St (see building access instructions on parent page).
What: Fan Fiction and NLP
Presenting: Prof. David Bamman (School of Information)

Prior to the meeting, please review:

CRA gives out 4 outsatnding undergrad achievemenet awards to whole country, Smitha Milli received one
Starting PhD at Berkeley in the fall
Computational text analysis: applying machine learning, NLP methods to answer empirical questions about literary texts
Genre, emotions, pattern recognitions, money, geographic locations, themes
Holst Katsma — loudness in the novel (quantifiers for verbs introducing dialogue: shouting vs. saying)
Most applies to standard literary canon: Project Gutenberg, or large-scale visual libraries like Internet Archive, HathiTrust
Books that are in print
Fan fiction: different kind of literary creation
If you like the universe, want to see more stories with same characters, this happens in fan fiction
fanfiction.net — getting more scholarly attention
Fan fiction: not just collection of stories, it’s an ecosystem of people writing and commenting on stories, feedback between reviews that authors get influencing next chapters
Every chapter has collection of responses
9k different canon (universes) of stories
Almost 6M stories, 55B tokens
Entire Google Books collection: 550B tokens (10% of all the books Google has digitized)
159M reviews, 2M users
BookNLP: works on book-length document; most NLP optimized for contemporary newswire (1980’s Wall Street Journal)
Simple sentences get bad syntax analyses when you train parser on newswire
Noun phrase after verb: from news wire, doesn’t know that’s the subject
Don’t account for scale of novels: 200k words
Phrase structure of grammar: complexity is cubic in length of sentence
Pronominal anaphora resolution: quadratic
Literary novels: trying to understand which references to names are referring to same character
Associate characters with whatever actions they’re affiliated with
Aligning characters between fan fiction stories and original stories
Needed Savio to be able to do this, even with optimizations of BookNLP
2-3 minutes for each book, 10’s or 100k’s of books, need parallelization
Fan fiction allocates more attention to secondary characters
Lots of people writing these stories to change something about original work; changing the attention to make it more “about” secondary characters
Fan fiction allocates more attention to women
Lots of writers and readers are women
43% vs 40% in original stories — statistically significant, but not huge
Something like 80% of authors are women
Haven’t done breakdown of female authors vs. female characters
Fraction of attention given to female characters in 100k books in HathiTrust. Stories written by women give equal distribution to men and women. Male-authored books: attention to women much less.
Additional allocation to women: as a correction to canonical stories?
Story reviews: author encouragement, requests for stories, emotional reactions
Predict reader response
Had readers judge whether a response was positive/negative/neutral
Want to predict response based just on fiction text
Different ways of targeting individual characters: agent/patient/predicative/possessive relations, unigrams, character gender, character frequency of mention
Can’t predict this better than chance
Syntactic features have predictive power: if character is pregnant, mother, boy — that has some predictive power
Definitely some noise
More attention to fan fiction in last couple years
Text reuse appearing in these collections
Modeling reader reaction to text
Make assumption that all text in the novel happens at the same time — maybe better to process texts sequentially like readers
Looking at what exactly the things are that make a given character hated by design 
  • No labels