Reading Genres: Exploring Massive Digital Collections From the Top Down
Schmidt, Ben
Schmidt, Ben
Citations
Altmetric:
Abstract
At what scale can digital analysis address live questions in the humanities? On the one hand, humanists have long cultivated expertise in elucidating meaning from a single text or author; on the other, increasing numbers of scientists are drawn to massive digital corpuses by the appeal of describing ‘culture’ writ large. While digital reading promises only modest improvements to traditional techniques, the scientific approach rightfully causes many humanists discomfort for simplifying the variegated worlds of historical experience out of existence. This paper proposes that the most fruitful applications of ‘big data’ will come from a scale only slightly smaller—the analysis of categories of authorship that can encompass hundreds of thousands of texts. The most important of these categories—academic disciplines and geographic regions, ethnicities and genders—have themselves long been central objects of humanistic research. But to fully realize the benefits offered to humanists by digitization requires developing strategies, infrastructure, and vocabularies for reading digital libraries from the top down.
This paper will address the technical and intellectual challenges this sort of reading presents. As a historian at the Harvard Cultural Observatory, I have helped design and build some of the largest collections of text-as-data designed for historical research. Previously the CO collaborated with Google to build the Google Ngram viewer: since my arrival we have cultivated several terabytes of new textual data at a much more granular level, supported using cloud infrastructure and storage. These collections make trends in massive digital corpora with millions of texts and metadata (newspapers from the Library of Congress, books from the Internet Archive, journal articles from Jstor) available for both quick visualization (through a public website, Bookworm) and more intensive statistical research.
Drawing on these collections, my paper will demonstrate how digital reading opens two specific massive cultural fields for new sorts of analysis. The first is geography. Millions of historical newspaper pages have been digitized and placed in the public domain; properly structured, this data can show subtle geographic variations that neither keyword search nor close reading could unearth. By tracking the impact of a simple federal-mandated practice—spelling—across the late 19th and early 20th centuries, I will show how aggregate behaviors can map onto historiographical questions of the center and periphery. The second is academic discipline. Metaphors of the operation of mind let us explore questions of intellectual history, and recenter the subject from the individual to the discipline. While the individual-centered approaches of intellectual history privileges psychologists or philosophers, genrebased analysis suggests that fields like pedagogy are, perhaps, more influential.
The difficulties and uncertainties in establishing claims like these are not statistical. Rather, they involve questions about the coherence of metadata categories, the provenance of records, and the subtle biases of separately-collected sources. These difficulties are enormous. But they are also traditional questions of source interpretation, ones that need to be adapted but not abandoned to empower new techniques of reading to be effective.
Description
Presented at “Big Data & Uncertainty in the Humanities”, University of Kansas, September 22, 2012. Institute for Digital Research in the Humanities: http://idrh.ku.edu
Ben Schmidt is a PhD candidate in History at Princeton University, and a Graduate Fellow in Cultural Observatory at Harvard.
Date
2012-09-22
Journal Title
Journal ISSN
Volume Title
Publisher
Collections
Research Projects
Organizational Units
Journal Issue
Keywords
Digital Humanities, Text Visualization, Text Analysis, Big Data, Google Ngram