Navigation:
Documentation
Archive



Page Tree:

Child pages
  • SN-0069 Searching Within a Digital Text Corpus

This wiki space contains archival documentation of Project Bamboo, April 2008 - March 2013.

Skip to end of metadata
Go to start of metadata

Searching Within a Digital Text Corpus

Collection Date: March 9., 2009
Scholar #1 Info: (if more than one scholar's process is described, copy this set for each scholar)

  • Name: Eric Greene
  • Email:
  • Title: Doctoral Student, Buddhist Studies
  • Institution/Organization: University of California Berkeley
  • Field of Study/Creative Endeavor: Chinese Buddhism

Collector Info (can be the same as "Scholar" above):

  • Name: Rich Meyer
  • Email: rtmeyer@berkeley.edu
  • Title: Project Bamboo Program Manager
  • Institution/Organization: University of California, Berkeley
  • Name: Connor Riley
  • Email:
  • Title: Graduate Student Researcher, School of Information
  • Institution/Organization: University of California, Berkeley

Notes on Methodology:

The collectors recorded this interview; delineated various workflows discussed in the interview and wrote them using quotes from the interview. These were then reviewed and edited by the interviewee before being posted.

Scope

The scope section is provided by the collector, with input from the scholar(s), and attempts to estimate the scope of the group that performs the processes described: How broadly do the practices described in this narrative apply to others in same field, in related fields, etc?

  1. In the opinion of the scholar, who participates in the process the story describes?
    (e.g. "just this scholar", "many people in the scholar's field of inquiry", "all academics", etc.)
  2. What is this process intended to accomplish for the scholar?
  3. Who is the intended audience of the processes described?
  4. Is this the only process the scholar uses to accomplish his/her goals?
  5. What "shared services" would help transform the story into something of more benefit for the scholar or his/her audience?  What process or processes in the story could be automated?

Keywords

Please provide some keywords that will allow us to group or cluster related stories--or aspects of stories.

1. Was this story collected for a particular Bamboo working group?  If so, please include, as keywords, the appropriate group(s).

  • Scholarly Narratives

2. Suggested keywords: Does this narrative contain elements that could be mapped to these keywords?  If so, please indicate which ones and briefly describe the mapping.  Add any additional keywords in #3. (These are global keywords from this page keywords)

3. Please list additional keywords here:

Narrative

When conducting research in early Chinese Buddhism, most scholars make use of digital corpora which collect and make searchable some set of the Buddhist canon. Although all of the collected works are written in Chinese corpora differ in what works they collect (i.e., secular or religious),  access permissions, what versions of traditional texts they use, and whether the texts themselves have been updated to replace more obscure characters with their modern counterparts.

For research purposes, searching within a digital text corpus is especially useful when trying to find a reference to a particular person or character combination. I primarily use the CBETA project's search tools as they digitize the bulk of the texts I need to reference. When searching for a person's name or a phrase using the CBETA tools,  the program returns a comprehensive list of all uses of that combination of characters in the corpora. The documents returned can then be read within the program; it's up to me to determine which search results are worth reading. For instance, if I'm searching for a person's name, I would be more likely to read through a well-known set of biographies for more information if it was returned as a search result.

When searching for a unique phrase, the nature of the Chinese writing system is such that there may be several characters which are interchangeable in meaning, due to changes in the written characters over time. Some existing Chinese literature search engines, such as that for searching within the digitized Siku Quanshu, will recognize characters with interchangeable forms within searches and will include these other forms within the returned search results. The CBETA search engine does not do this, however, so when using that search engine I must manually search for any other interchangeable forms I can think of.

Needs

There mare many different digitized Chinese text corpora available, each with their own search engine. Within the scope of Chinese Buddhism I generally only use CBETA as it really is the only useful digital corpora in the field. However, there are search capabilities within other Chinese-language corpora that would be helpful to my research. The ability to have the search engine recognize and search for interchangeable character forms would be very useful to have, for instance. Federating the content of Chinese Buddhist corpora and secular Chinese literature corpora could conceivably be helpful to me in certain situations as well. More important to me than the federation of this content is the ability to have manuscript and woodblock versions of Buddhist texts indexed and readable alongside what is already digitized in CBETA.
 

Other Comments:

The information below was comprised when transcribing the interview, to make sure pieces were not missing.  If it is unhelpful, please disregard.

Recipe

Discover item of interest (name, unusual character or word)
Select appropriate search engine (CBETA, SAT, etc.)
Search for desired term
Parse search results
Read returned texts
Edit search term and repeat search if necessary

Ingredients: Tools and Content

CBETA
SAT
Other digital corpora search tools
Chinese language Input Method Editor for Windows
Siku Quanshu or other Chinese literature collections


Link

Notes

Example Link