Reading Genres: Exploring Massive Digital Collections From the Top Down

Schmidt, Ben

dc.contributor.author	Schmidt, Ben
dc.date.accessioned	2020-05-01T21:05:35Z
dc.date.available	2020-05-01T21:05:35Z
dc.date.issued	2012-09-22
dc.identifier.uri	http://hdl.handle.net/1808/30307
dc.description	Presented at “Big Data & Uncertainty in the Humanities”, University of Kansas, September 22, 2012. Institute for Digital Research in the Humanities: http://idrh.ku.edu Ben Schmidt is a PhD candidate in History at Princeton University, and a Graduate Fellow in Cultural Observatory at Harvard.	en_US
dc.description.abstract	At what scale can digital analysis address live questions in the humanities? On the one hand, humanists have long cultivated expertise in elucidating meaning from a single text or author; on the other, increasing numbers of scientists are drawn to massive digital corpuses by the appeal of describing ‘culture’ writ large. While digital reading promises only modest improvements to traditional techniques, the scientific approach rightfully causes many humanists discomfort for simplifying the variegated worlds of historical experience out of existence. This paper proposes that the most fruitful applications of ‘big data’ will come from a scale only slightly smaller—the analysis of categories of authorship that can encompass hundreds of thousands of texts. The most important of these categories—academic disciplines and geographic regions, ethnicities and genders—have themselves long been central objects of humanistic research. But to fully realize the benefits offered to humanists by digitization requires developing strategies, infrastructure, and vocabularies for reading digital libraries from the top down. This paper will address the technical and intellectual challenges this sort of reading presents. As a historian at the Harvard Cultural Observatory, I have helped design and build some of the largest collections of text-as-data designed for historical research. Previously the CO collaborated with Google to build the Google Ngram viewer: since my arrival we have cultivated several terabytes of new textual data at a much more granular level, supported using cloud infrastructure and storage. These collections make trends in massive digital corpora with millions of texts and metadata (newspapers from the Library of Congress, books from the Internet Archive, journal articles from Jstor) available for both quick visualization (through a public website, Bookworm) and more intensive statistical research. Drawing on these collections, my paper will demonstrate how digital reading opens two specific massive cultural fields for new sorts of analysis. The first is geography. Millions of historical newspaper pages have been digitized and placed in the public domain; properly structured, this data can show subtle geographic variations that neither keyword search nor close reading could unearth. By tracking the impact of a simple federal-mandated practice—spelling—across the late 19th and early 20th centuries, I will show how aggregate behaviors can map onto historiographical questions of the center and periphery. The second is academic discipline. Metaphors of the operation of mind let us explore questions of intellectual history, and recenter the subject from the individual to the discipline. While the individual-centered approaches of intellectual history privileges psychologists or philosophers, genrebased analysis suggests that fields like pedagogy are, perhaps, more influential. The difficulties and uncertainties in establishing claims like these are not statistical. Rather, they involve questions about the coherence of metadata categories, the provenance of records, and the subtle biases of separately-collected sources. These difficulties are enormous. But they are also traditional questions of source interpretation, ones that need to be adapted but not abandoned to empower new techniques of reading to be effective.	en_US
dc.relation.isversionof	https://youtu.be/TakIcc4PSAc	en_US
dc.subject	Digital Humanities	en_US
dc.subject	Text Visualization	en_US
dc.subject	Text Analysis	en_US
dc.subject	Big Data	en_US
dc.subject	Google Ngram	en_US
dc.title	Reading Genres: Exploring Massive Digital Collections From the Top Down	en_US
dc.type	Video	en_US
dc.rights.accessrights	openAccess	en_US

Files in this item

Name:: Schmidt_Reading_Genres.mp4
Size:: 370.1Mb
Format:: MPEG-4 video

View/Open

This item appears in the following Collection(s)

The University of Kansas prohibits discrimination on the basis of race, color, ethnicity, religion, sex, national origin, age, ancestry, disability, status as a veteran, sexual orientation, marital status, parental status, gender identity, gender expression and genetic information in the University’s programs and activities. The following person has been designated to handle inquiries regarding the non-discrimination policies: Director of the Office of Institutional Opportunity and Access, IOA@ku.edu, 1246 W. Campus Road, Room 153A, Lawrence, KS, 66045, (785)864-6414, 711 TTY.