HathiTrust Research Center (HTRC) UnCamp Fellowships

HathiTrust Research Center (HTRC) UnCamp Fellowships
The UCB Libraries are delighted to offer a limited number of general fellowships for free admission to the 2018 HTRC UnCamp at UC Berkeley.
These fellowships are open to current UC Berkeley students and staff. All qualified applicants will be accepted in order of application while fellowships are available, though priority will be given to student applicants. Fellowship applications are due by Nov 13.
Apply here:
Note: Those who do not receive fellowship awards will be informed in time to register at the UnCamp Early Bird price.
About the HathiTrust Research Center (HTRC) 2018 UnCamp
Location: University of California Libraries, Berkeley, CA
Dates: January 25-26, 2018
HTRC UnCamp 2018 aims to facilitate the creation of a national community focussed on improving research use of the HathiTrust corpus through computational analysis. The UnCamp will discuss topics relevant to understanding and utilizing the HathiTrust Digital Library corpus within the modern computational research eco-system. This includes discussion of practices and experiences in mass-scale data mining, visualization, and analysis of the HT collection, with the goal of improving the quality of access and use of the collection by means of the HTRC Data Capsule and other affiliated research tools.
Stacy Reardon
Literatures and Digital Humanities Librarian
438 Doe Library | University of California, Berkeley | Berkeley, CA 94720
sreardon@berkeley.edu

Where to Find the Texts for Text Mining

Sketch for Monotype Digital Type Wall
frame1351170437122. Marcin Ignac, CC BY-NC-ND 2.0

Text mining, the process of computationally analyzing large swaths of natural language texts, can illuminate patterns and trends in literature, journalism, and other forms of textual culture that are sometimes discernible only at scale, and it’s an important digital humanities method. If text mining interests you, then finding the right tool — whether you turn to an entry-level system like Voyant or master a programming language like Python — is only a part of the solution. Your analyses are only as strong as the texts you’re working with, after all, and finding authoritative text corpora can sometimes be difficult due to paywalls and licensing restrictions. The good news is the UC Berkeley Libraries offer a range of text corpora for you to analyze, and we can help you get your hands on things we don’t already have access to.

The first step in your exploration should be the library’s Text Mining Guide, which lists text corpora that are either publicly accessible (e.g., the Library of Congress’s Chronicling America newspaper collection) or are available to UCB faculty, students, and staff (e.g., JSTOR Data for Research).  The content of these sources are available in a variety of formats: you may be able to download the texts in bulk, use an API, or make use of a content provider’s in-platform tools. In other cases (e.g., ProQuest Historical Newspapers), the library may be able to arrange access upon request. While the scope of the corpora we have access to is wide, we are particularly strong in newspaper collections, pre-20th century English literature collections, and scholarly texts.

What happens if the library doesn’t have what you need? We regularly facilitate the acquisition of text corpora upon request, and you can always email your subject librarian with specific requests or questions. The library will deal with licensing questions so you don’t have to, and we’ll work with you to figure out the best way to make the texts available for your work, often with the help of our friends in the D-Lab or Research IT . We also offer the Data Acquisition and Access Program to provide special funding for one-time data set purchases, including text corpora.  Your requests and suggestions help the library develop our collection, making text mining easier for the next researcher who comes along.

Important caveats:

  • Unless explicitly stated, our contracts for most Library databases and library resources (e.g., Scopus, Project MUSE) don’t allow for bulk download. Please avoid web scraping licensed library resources on your own: content providers realize what is happening pretty quickly, and they react by shutting down access for our entire campus. Ask your subject librarian  for help instead.
  • Keep in mind that many of the vendors themselves are limited in how, and how much access, they can provide to a particular resource, based on their own contractual agreements. It’s not uncommon for specific contemporary newspapers and journals to be unavailable for analysis at scale, even when library funding for access may be available.

Related resources:

 

Stacy Reardon and Cody Hennesy
Contact us at sreardon [at] berkeley.edu; chennesy [at] berkeley.edu