Team Awarded Grant to Help Digital Humanities Scholars Navigate Legal Issues of Text Data Mining

We are thrilled to share that the National Endowment for the Humanities (NEH) has awarded a $165,000 grant to a UC Berkeley-led team of legal experts, librarians, and scholars who will help humanities researchers and staff navigate complex legal questions in cutting-edge digital research.

What is this grant all about?

If you were to crack open some popular English-language novels written in the 1850’s–say, ones from Brontë, Hawthorne, Dickens, and Melville–you would find they describe men and women in very different terms. While a male character might be said to “get” something, a female character is more likely to have “felt” it. Whereas the word “mind” might be used when describing a man, the word “heart” is more likely to be used about a woman. Yet, as the 19th Century became the 20th, these descriptive differences between genders actually diminish. How do we know all this? We confess we have not actually read every novel ever written between the 19th and 21st Centuries (though we’d love to envision a world in which we could). Instead, we can make this assertion because researchers (including David Bamman, of UC Berkeley’s School of Information) used automated techniques to extract information from the novels, and analyzed these word usage trends at scale. They crafted algorithms to turn the language of those novels into data about the novels.

In fields of inquiry like the digital humanities, the application of such automated techniques and methods for identifying, extracting, and analyzing patterns, trends, and relationships across large volumes of unstructured or thinly-structured digital content is called “text data mining.” (You may also see it referred to as “text and data mining” or “computational text analysis”). Text data mining provides humanists and social scientists with invaluable frameworks for sifting, organizing, and analyzing vast amounts of material. For instance, these methods make it possible to:

The Problem

Until now, humanities researchers conducting text data mining have had to navigate a thicket of legal issues without much guidance or assistance. For instance, imagine the researchers needed to scrape content about Egyptian artifacts from online sites or databases, or download videos about Egyptian tomb excavations, in order to conduct their automated analysis. And then imagine the researchers also want to share these content-rich data sets with others to encourage research reproducibility or enable other researchers to query the data sets with new questions. This kind of work can raise issues of copyright, contract, and privacy law, not to mention ethics if there are issues of, say, indigenous knowledge or cultural heritage materials plausibly at risk. Indeed, in a recent study of humanities scholars’ text analysis needs, participants noted that access to and use of copyright-protected texts was a “frequent obstacle” in their ability to select appropriate texts for text data mining. 

Potential legal hurdles do not just deter text data mining research; they also bias it toward particular topics and sources of data. In response to confusion over copyright, website terms of use, and other perceived legal roadblocks, some digital humanities researchers have gravitated to low-friction research questions and texts to avoid decision-making about rights-protected data. They use texts that have entered into the public domain or use materials that have been flexibly licensed through initiatives such as Creative Commons or Open Data Commons. When researchers limit their research to such sources, it is inevitably skewed, leaving important questions unanswered, and rendering resulting findings less broadly applicable. A growing body of research also demonstrates how race, gender, and other biases found in openly available texts have contributed to and exacerbated bias in developing artificial intelligence tools. 

The Solution

The good news is that the NEH has agreed to support an Institute for Advanced Topics in the Digital Humanities to help key stakeholders to learn to better navigate legal issues in text data mining. Thanks to the NEH’s $165,000 grant, Rachael Samberg of UC Berkeley Library’s Office of Scholarly Communication Services will be leading a national team (identified below) from more than a dozen institutions and organizations to teach humanities researchers, librarians, and research staff how to confidently navigate the major legal issues that arise in text data mining research. 

Our institute is aptly called Building Legal Literacies for Text Data Mining (Building LLTDM), and will run from June 23-26, 2020 in Berkeley, California. Institute instructors are legal experts, humanities scholars, and librarians immersed in text data mining research services, who will co-lead experiential meeting sessions empowering participants to put the curriculum’s concepts into action.

In October, we will issue a call for participants, who will receive stipends to support their attendance. We will also be publishing all of our training materials in an openly-available online book for researchers and librarians around the globe to help build academic communities that extend these skills.

Building LLTDM team member Matthew Sag, a law professor at Loyola University Chicago School of Law and leading expert on copyright issues in the digital humanities, said he is “excited to have the chance to help the next generation of text data mining researchers open up new horizons in knowledge discovery. We have learned so much in the past ten years working on HathiTrust [a text-minable digital library] and related issues. I’m looking forward to sharing that knowledge and learning from others in the text data mining community.” 

Team member Brandon Butler, a copyright lawyer and library policy expert at the University of Virginia, said, “In my experience there’s a lot of interest in these research methods among graduate students and early-career scholars, a population that may not feel empowered to engage in “risky” research. I’ve also seen that digital humanities practitioners have a strong commitment to equity, and they are working to build technical literacies outside the walls of elite institutions. Building legal literacies helps ease the burden of uncertainty and smooth the way toward wider, more equitable engagement with these research methods.”

Kyle K. Courtney of Harvard University serves as Copyright Advisor at Harvard Library’s Office for Scholarly Communication, and is also a Building LLTDM team member. Courtney added, “We are seeing more and more questions from scholars of all disciplines around these text data mining issues. The wealth of full-text online materials and new research tools provide scholars the opportunity to analyze large sets of data, but they also bring new challenges having to do with the use and sharing not only of the data but also of the technological tools researchers develop to study them. I am excited to join the Building LLTDM team and help clarify these issues and empower humanities scholars and librarians working in this field.”

Megan Senseney, Head of the Office of Digital Innovation and Stewardship at the University of Arizona Libraries reflected on the opportunities for ongoing library engagement that extends beyond the initial institute. Senseney said that, “Establishing a shared understanding of the legal landscape for TDM is vital to supporting research in the digital humanities and developing a new suite of library services in digital scholarship. I’m honored to work and learn alongside a team of legal experts, librarians, and researchers to create this institute, and I look forward to integrating these materials into instruction and outreach initiatives at our respective universities.”

Next Steps

The Building LLTDM team is excited to begin supporting humanities researchers, staff, and librarians en route to important knowledge creation. Stay tuned if you are interested in participating in the institute. 

In the meantime, please join us in congratulating all the members of the project team:

  • Rachael G. Samberg (University of California, Berkeley) (Project Director)
  • Scott Althaus (University of Illinois, Urbana-Champaign)
  • David Bamman (University of California, Berkeley)
  • Sara Benson (University of Illinois, Urbana-Champaign)
  • Brandon Butler (University of Virginia)
  • Beth Cate (Indiana University, Bloomington)
  • Kyle K. Courtney (Harvard University)
  • Maria Gould (California Digital Library)
  • Cody Hennesy (University of Minnesota, Twin Cities)
  • Eleanor Koehl (University of Michigan)
  • Thomas Padilla (University of Nevada, Las Vegas; OCLC Research)
  • Stacy Reardon (University of California, Berkeley)
  • Matthew Sag (Loyola University Chicago)
  • Brianna Schofield (Authors Alliance)
  • Megan Senseney (University of Arizona)
  • Glen Worthey (Stanford University)

Workshop: Publish Digital Books & Open Educational Resources with Pressbooks

Digital Publishing Workshop Series

Publish Digital Books & Open Educational Resources with Pressbooks
Monday, May 6, 11:10am-12:30pm
Academic Innovation Studio, Dwinelle Hall 117 (Level D)

If you’re looking to self-publish work of any length and want an easy-to-use tool that offers a high degree of customization, allows flexibility with publishing formats (EPUB, MOBI, PDF), and provides web-hosting options, Pressbooks may be great for you. Pressbooks is often the tool of choice for academics creating digital books, open textbooks, and open educational resources, since you can license your materials for reuse however you desire. Learn why and how to use Pressbooks for publishing your original books or course materials. You’ll leave the workshop with a project already under way! Register at bit.ly/dp-berk



Workshop: By Design: Graphics & Images Basics

Digital Publishing Workshop Series

By Design: Graphics & Images Basics
Monday, April 22, 4:10-5:00pm
D-Lab, 350 Barrows Hall

In this hands-on workshop, we will learn how to create web graphics for your digital publishing projects and websites. We will cover topics such as: image editing tools in Photoshop; image resolution for the web; sources for free public domain and Creative Commons images; and image upload to publishing tools such as WordPress. If possible, please bring a laptop with Photoshop installed. All UCB faculty and students can receive a free Adobe Creative Suite license. Register at bit.ly/dp-berk

Upcoming Workshops in this Series 2018-2019:

    • Publish Digital Books & Open Educational Resources with Pressbooks

Please see bit.ly/dp-berk for details.



DH+Lib: Building and Preserving Collections for Digital Humanities Research

An English stage showing Sir John Falstaff and Mrs. Quickly, ca. 1662
An English stage showing Sir John Falstaff and Mrs. Quickly, ca. 1662

DH+LIB: BUILDING AND PRESERVING COLLECTIONS FOR DIGITAL HUMANITIES RESEARCH

Wednesday, April 17th, 9:30 – 11:00 AM
Doe 180

This session will feature panelists building collections and tools for local digital humanities projects. Kathryn Stine, manager for digital content development and strategy at the California Digital Library, will talk about building web archive collections through collaboration, preparing these collections for discovery and use, and tapping the research potential of the resulting captured content and data. Mary Elings, Head of Technical Services for The Bancroft Library, will talk about the role libraries can play in developing research-ready digital collections to facilitate emerging research methods. And Gisèle Tanasse, Film & Media Services Librarian at the Library, will discuss her role in Shakespeare’s Staging, a DH project to help digitize, preserve, and make accessible Shakespeare performances from UC Berkeley students.

DH Fair 2019
http://ucberk.li/dhfair

 

2019 DH Fair Library Committee
Stacy Reardon, Chair
Lynn Cunningham
Mary Elings
Jeremy Ott
Liladhar Pendse
Claude Potts


Workshop: Text Data Mining and Publishing

Digital Publishing Workshop Series

Text Data Mining and Publishing
Monday, April 8, 11:10am-12:30pm
D-Lab, 350 Barrows Hall

If you are working on a computational text analysis project and have wondered how to legally acquire, use, and publish text and data, this workshop is for you! We will teach you 5 legal literacies (copyright, contracts, privacy, ethics, and special use cases) that will empower you to make well-informed decisions about compiling, using, and sharing your corpus. By the end of this workshop, and with a useful checklist in hand, you will be able to confidently design lawful text analysis projects or be well positioned to help others design such projects. Consider taking alongside Copyright and Fair Use for Digital Projects. Register at bit.ly/dp-berk

Upcoming Workshops in this Series 2018-2019:

    • By Design: Graphics & Images Basics
    • Publish Digital Books & Open Educational Resources with Pressbooks

Please see bit.ly/dp-berk for details.



Workshop: Copyright and Fair Use for Digital Projects

Digital Publishing Workshop Series

Copyright and Fair Use for Digital Projects
Thursday, March 7, 1:10-2:30pm
D-Lab, 350 Barrows Hall

This training will help you navigate the copyright, fair use, and usage rights of including third-party content in your digital project. Whether you seek to embed video from other sources for analysis, post material you scanned from a visit to the archives, add images, upload documents, or more, understanding the basics of copyright and discovering a workflow for answering copyright-related digital scholarship questions will make you more confident in your publication. We will also provide an overview of your intellectual property rights as a creator and ways to license your own work. Register at bit.ly/dp-berk

Upcoming Workshops in this Series 2018-2019:

    • Text Data Mining and Publishing
    • By Design: Graphics & Images Basics
    • Publish Digital Books & Open Educational Resources with Pressbooks

Please see bit.ly/dp-berk for details.



Workshop: HTML/CSS Toolkit for Digital Projects

Digital Publishing Workshop Series

HTML/CSS Toolkit for Digital Projects
Tuesday, November 13th, 3:40-5:00pm
D-Lab, 350 Barrows Hall

If you’ve tinkered in WordPress, Google Sites, or other web publishing tools, chances are you’ve wanted more control over the placement and appearance of your content. With a little HTML and CSS under your belt, you’ll know how to edit “under the hood” so you can place an image exactly where you want it, customize the formatting of text, or troubleshoot copy & paste issues. By the end of this workshop, interested learners will be well prepared for a deeper dive into the world of web design. Please bring a laptop if possible. Register at bit.ly/dp-berk

Please see bit.ly/dp-berk for details.



Workshop: The Long Haul: Best Practices for Making Your Digital Project Last

Digital Publishing Workshop Series

The Long Haul: Best Practices for Making Your Digital Project Last
Tuesday, October 30th, 1:10-2:00pm
Academic Innovation Studio, Dwinelle Hall 117 (Level D)

You’ve invested a lot of work in creating a digital project, but how do you ensure it has staying power? We’ll look at choices you can make at the beginning of project development to influence sustainability, best practices for documentation and asset management, and how to sunset your project in a way that ensures long-term access for future researchers. Register at bit.ly/dp-berk

Upcoming Workshops in this Series:

  • HTML/CSS Toolkit for Digital Projects

Please see bit.ly/dp-berk for details.



Workshop: Publish Digital Books & Open Educational Resources with Pressbooks

Digital Publishing Workshop Series

Publish Digital Books & Open Educational Resources with Pressbooks
Wednesday, September 26, 11:10am-12:30pm
Academic Innovation Studio, Dwinelle Hall 117 (Level D)

If you’re looking to self-publish work of any length and want an easy-to-use tool that offers a high degree of customization, allows flexibility with publishing formats (EPUB, MOBI, PDF), and provides web-hosting options, Pressbooks may be great for you. Pressbooks is often the tool of choice for academics creating digital books, open textbooks, and open educational resources, since you can license your materials for reuse however you desire. Learn why and how to use Pressbooks for publishing your original books or course materials. You’ll leave the workshop with a project already under way! Register at bit.ly/dp-berk

Upcoming Workshops in this Series 2018-2019:

  • The Long Haul: Best Practices for Making Your Digital Project Last
  • HTML/CSS Toolkit for Digital Projects

Please see bit.ly/dp-berk for details.



Workshop: Omeka, Scalar, WordPress, Oh My!

Digital Publishing Workshop Series

Omeka, Scalar, WordPress, Oh My!: Web Platforms for Digital Projects
Tuesday, September 25th, 3:40-5:00pm
D-Lab, 350 Barrows Hall

How do you go about publishing a digital book, a multimedia project, a digital exhibit, or another kind of digital project? In this workshop, we’ll take a look at use cases for common open-source web platforms like WordPress, Drupal, Omeka, and Scalar, and we’ll talk about hosting, storage, and asset management. There will be time for hands-on work in the platform most suited to your needs. No coding experience is necessary. Please bring a laptop if possible. Register at bit.ly/dp-berk

Upcoming Workshops in this Series:

  • Publish Digital Books & Open Educational Resources with Pressbooks
  • The Long Haul: Best Practices for Making Your Digital Project Last
  • HTML/CSS Toolkit for Digital Projects

Please see bit.ly/dp-berk for details.