Before you scrape and before you train…

A person's hands holding a white stylus pen over a tablet screen displaying a digital to-do list. The tablet shows 'PLAN' at the top with several checkboxes below it, some checked and some unchecked. The scene is set on a desk with a small potted plant visible in the background.
Photo by Jakub Żerdzicki on Unsplash

Using AI and Text Mining with Library Resources: What Every UC Berkeley Researcher Needs to Know

Planning to scrape a website or database? Train an AI tool? Before you scrape and before you train, there are steps you need to take!

First, consult the general terms and conditions that you need to comply with for all Library electronic resources (journal articles, books, databases, and more) in the Conditions of Use for Electronic Resources policy.

Second, if you also intend to use any Library electronic resources with AI tools or for text and data mining research, check what’s allowed under our license agreements by looking at the AI & TDM guide. If you don’t see your resource or database listed, please e-mail tdm-access@berkeley.edu and we’ll check the license agreement and tell you what’s permitted.

Violating license agreements can result in the entire campus losing access to critical research resources, and potentially expose you and the University to legal liability.

Below we answer some FAQs.


Understanding the Basics

Where does library content come from?

Most of the digital research materials you access through the UC Berkeley Library aren’t owned by the University. Instead, they’re owned by commercial publishers, academic societies, and other content providers who create and distribute scholarly resources.

Think of using the Library’s electronic resources like watching Netflix: you can watch movies and shows on Netflix, but Netflix doesn’t own most of that content—they pay licensing fees to film studios and content creators for the right to make it available to you. So does the Library.

In fact, each year, the Library signs license agreements and pays substantial licensing fees (millions of dollars annually) to publishers like Elsevier, Springer, Wiley, and hundreds of other content providers so that you can access their journals, books, and databases for your research and coursework.

What is a library license agreement?

A license agreement is a legal contract between UC Berkeley and each publisher that spells out exactly how the UC Berkeley community can use that publisher’s content. These contracts typically cover:

  • Who can access the content (usually current faculty, students, researchers, and staff)
  • How you can use it (reading, downloading, printing individual articles)
  • What you can’t do (automated mass downloading, sharing with unauthorized users, making commercial uses of it)
  • Special restrictions (including rules about AI tools and text and data mining)

Any time you access a database or use your Berkeley credentials to log in to a resource, you must comply with the terms of the license agreement that the Library has signed. All the agreements are different.

Why are all the agreements different? Can’t the Library just sign the same agreement with everyone?

Unfortunately, no. Each publisher has their own standard contract terms, and they rarely agree to identical language. Here’s why:

  • Different business models: Some publishers focus on journals, others on books or datasets—each has different concerns
  • Varying attitudes toward artificial intelligence: Some publishers embrace AI research, others are more restrictive
  • Disciplinary variations: Publishers licensing content in different fields (e.g. business, data) typically offer different restrictions than those in other disciplines
  • Legal complexity: Text data mining and AI are relatively new, so contract language is still evolving

Can’t you negotiate better terms?

The good news is that the UC Berkeley Library is among global leaders in negotiating the very best possible terms of text and data mining and AI uses for you. We’ve set the stage for the world in terms of AI rights in license agreements, and the UC President has recognized the efforts of our Library in this regard.

Still, we can’t force publishers to accept uniform language, and we can’t guarantee that every resource allows AI usage. This is why we need to check each agreement individually when you want to use content with AI tools.

But my research is a “fair use”!

We agree. (And we’re glad you’re staying up-to-speed on copyright and fair use.) But there’s a distinction between what copyright law allows and how license agreements (which are contracts) affect your rights under copyright law.

Copyright law gives you certain rights, including fair use for research and education.

Contract law can override those rights when you agree to specific terms. When UC Berkeley signs a license agreement with a publisher so you can use content, both the University and its users (that’s you) must comply with those contract terms.

Therefore, even if your AI training or text mining would normally qualify as fair use, the license agreement you’re bound by might explicitly prohibit it, or place specific qualifications on how AI might be used (e.g. use of AI permissible; training of AI prohibited).

Your responsibilities

What do I have to do?

You should consult the general terms and conditions that you need to comply with for all Library electronic resources (journal articles, books, databases, and more) in the Conditions of Use for Electronic Resources policy.

If you also intend to use any Library electronic resources with AI tools or for text and data mining research, check what’s allowed under our license agreements by looking at the AI & TDM guide. If you don’t see your resource or database listed, then e-mail tdm-access@berkeley.edu and we’ll check the license agreement and tell you what’s permitted.

Do I have to comply? What’s the big deal?

Violating license agreements can result in losing access to critical research resources for the entire UC Berkeley community—and potentially expose you and the University to legal liability and lawsuits.

For the University:

  • Loss of access: Publishers can immediately cut off access to critical research resources for everyone on campus
  • Legal liability: The University could face costly lawsuits. Some publishers might claim millions of dollars worth of damages
  • Damaged relationships: Violations can harm the library’s ability to negotiate future agreements, or prevent us from getting you access to key scholarly content

This doesn’t just affect the University—it also affects you. Violating the agreements can result in:

  • Immediate suspension of your access to all library electronic resources
  • Legal exposure: You could potentially be held personally liable for damages in a lawsuit
  • Research disruption: Loss of access to essential materials for your work

How do I know if I’m using Library-licensed content?

The following kinds of materials are typically governed by Library license agreements:

  • Materials you access through the UC Library Search (the Library’s online catalog)
  • Articles from academic journals accessed through the Library
  • E-books available through Library databases
  • Research datasets licensed by the Library
  • Any content accessed through Library database subscriptions
  • Materials that require you to log in with your UC Berkeley credentials

What if I get the content from a website not licensed by the Library?

If you’re downloading or mining content from a website that is not licensed by the Library, you should read the website’s terms of use, sometimes called “terms of service.” They will usually be found through a link at the bottom of the web page. Carefully understanding the terms of service can help you make informed decisions about how to proceed.

Even if the terms of use for the website or database restrict or prohibit text mining or AI, the provider may offer an application programming interface, or API, with its own set of terms that allows scraping and AI. You could also try contacting the provider and requesting permission for the research you want to do.

What if I’m using a campus-licensed AI platform?

Even when using UC Berkeley’s own AI platforms (like Gemini or River), you still need to check on whether you can upload Library-licensed content to that platform. The fact that the University provides the tool doesn’t automatically make all Library-licensed content okay to upload to it.

What if I’m using my own ChatGPT, Anthropic, or other generative AI account?

Again, you still need to check on whether you can upload Library-licensed content to that platform. The fact that you subscribe to the tool doesn’t mean you can upload Library-licensed content to it.

Do I really have to contact you? Can’t I just look up the license terms somewhere?

We wish it were that simple, but the Library signs thousands of agreements each year with highly complex terms. We’re working on trying to make the terms more visible to you, though. Stay tuned.
In the meantime, check out the AI & TDM guide. If you don’t see your resource or database listed, then e-mail tdm-access@berkeley.edu and we’ll tell you what’s permitted.

Best practices are to:

  • Plan ahead: Contact us early in your research planning process
  • Be specific: The more details you provide, the faster we can give you guidance
  • Ask questions: We’re here to help, not to block your research

Get Help

For text mining and AI questions: tdm-access@berkeley.edu

For other licensing questions: acq-licensing@lists.berkeley.edu

For copyright and fair use guidance: schol-comm@berkeley.edu


This guidance is for informational purposes and should not be construed as legal advice. When in doubt, always contact library staff for assistance with specific situations.


Wrapping up our NEH-funded project to help text and data mining researchers navigate cross-border legal and ethical issues

Black and white photograph with grass and concrete with the word "finish" painted on the concrete in large capitalized letters.
Image via rawpixel, public domain

In August 2022, the UC Berkeley Library and Internet Archive were awarded a grant from the National Endowment for the Humanities (NEH) to study legal and ethical issues in cross-border text and data mining (TDM).

The project, entitled Legal Literacies for Text Data Mining – Cross-Border (“LLTDM-X”), supported research and analysis to address law and policy issues faced by U.S. digital humanities practitioners whose text data mining research and practice intersects with foreign-held or -licensed content, or involves international research collaborations.

LLTDM-X is now complete, resulting in the publication of an instructive case study for researchers and white paper. Both resources are explained in greater detail below.

Project Origins

LLTDM-X built upon the previous NEH-sponsored institute, Building Legal Literacies for Text Data Mining. That institute provided training, guidance, and strategies to digital humanities TDM researchers on navigating legal literacies for text data mining (including copyright, contracts, privacy, and ethics) within a U.S. context.

A common challenge highlighted during the institute was the fact that TDM practitioners encounter expanding and increasingly complex cross-border legal problems. These include situations in which: (i) the materials they want to mine are housed in a foreign jurisdiction, or are otherwise subject to foreign database licensing or laws; (ii) the human subjects they are studying or who created the underlying content reside in another country; or, (iii) the colleagues with whom they are collaborating reside abroad, yielding uncertainty about which country’s laws, agreements, and policies apply.

Project design

We designed LLTDM-X to identify and better understand the cross-border issues that digital humanities TDM practitioners face, with the aim of using these issues to inform prospective research and education. Secondarily, we hoped that LLTDM-X would also suggest preliminary guidance to include in future educational materials. In early 2023, we hosted a series of three online round tables with U.S.-based cross-border TDM practitioners and law and ethics experts from six countries. 

The round table conversations were structured to illustrate the empirical issues that researchers face, and also for the practitioners to benefit from preliminary advice on legal and ethical challenges. Upon the completion of the round tables, the LLTDM-X project team created a hypothetical case study that (i) reflects the observed cross-border LLTDM issues and (ii) contains preliminary analysis to facilitate the development of future instructional materials.

We also charged the experts with providing responsive and tailored written feedback to the practitioners about how they might address specific cross-border issues relevant to each of their projects.

Guidance & Analysis

Case Study

Extrapolating from the issues analyzed in the round tables, the practitioners’ statements, and the experts’ written analyses, the Project Team developed a hypothetical case study reflective of “typical” cross-border LLTDM issues that U.S.-based practitioners encounter. The case study provides basic guidance to support U.S. researchers in navigating cross-border TDM issues, while also highlighting questions that would benefit from further research. 

The case study examines cross-border copyright, contracts, and privacy & ethics variables across two distinct paradigms: first, a situation where U.S.-based researchers perform all TDM acts in the U.S., and second, a situation where U.S.-based researchers engage with collaborators abroad, or otherwise perform TDM acts in both U.S. and abroad.

White Paper

The LLTDM-X white paper provides a comprehensive description of the project, including origins and goals, contributors, activities, and outcomes. Of particular note are several project takeaways and recommendations, which we hope will help inform future research and action to support cross-border text data mining. Our project takeaways touched on seven key themes: 

  1. Uncertainty about cross-border LLTDM issues indeed hinders U.S. TDM researchers, confirming the need for education about cross-border legal issues; 
  2. The expansion of education regarding U.S. LLTDM literacies remains essential, and should continue in parallel to cross-border education; 
  3. Disparities in national copyright, contracts, and privacy laws may incentivize TDM researcher “forum shopping” and exacerbate research bias;
  4. License agreements (and the concept of “contractual override”) often dominate the overall analysis of cross-border TDM permissibility;
  5. Emerging lawsuits about generative artificial intelligence may impact future understanding of fair use and other research exceptions; 
  6. Research is needed into issues of foreign jurisdiction, likelihood of lawsuits in foreign countries, and likelihood of enforcement of foreign judgments in the U.S. However, the overall “risk” of proceeding with cross-border TDM research may remain difficult to quantify; and
  7. Institutional review boards (IRBs) have an opportunity to explore a new role or build partnerships to support researchers engaged in cross-border TDM.

Gratitude & Next Steps

Thank you to the practitioners, experts, project team, and generous funding of the National Endowment for the Humanities for making this project a success. 

We aim to broadly share our project outputs to continue helping U.S.-based TDM researchers navigate cross-border LLTDM hurdles. We will continue to speak publicly to educate researchers and the TDM community regarding project takeaways, and to advocate for legal and ethical experts to undertake the essential research questions and begin developing much-needed educational materials. And, we will continue to encourage the integration of LLTDM literacies into digital humanities curricula, to facilitate both domestic and cross-border TDM research.

[Note: this content is cross-posted on the LLTDM blog.]


Upcoming Workshop: Can I Mine That? Should I Mine That? A Clinic for Copyright, Ethics & More in TDM Research

computer keyboard and mouse with title of the Digital publishing Workshop Series

Workshop Date/Time: Wednesday, March 8, 2023, 11:00am–12:30pm

Register to receive Zoom link

If you are working on a computational text analysis project and have wondered how to legally acquire, use, and publish text and data, this workshop is for you! We will teach you 5 legal literacies (copyright, contracts, privacy, ethics, and special use cases) that will empower you to make well-informed decisions about compiling, using, and sharing your corpus. By the end of this workshop, and with a useful checklist in hand, you will be able to confidently design lawful text analysis projects or be well positioned to help others design such projects. Consider taking alongside Copyright and Fair Use for Digital Projects.

Please sign up today and join us online on March 8.


UC Berkeley Library and Internet Archive co-directing project to help text data mining researchers navigate cross-border legal and ethical issues

We are excited to announce that the National Endowment for the Humanities (NEH) has awarded nearly $50,000 to UC Berkeley Library and Internet Archive to study legal and ethical issues in cross-border text data mining. The funding was made possible through NEH’s Digital Humanities Advancement Grant program

NEH funding for the project, entitled Legal Literacies for Text Data Mining – Cross Border (“LLTDM-X”), will support research and analysis to address law and policy issues faced by U.S. digital humanities practitioners whose text data mining research and practice intersects with foreign-held or -licensed content, or involves international research collaborations. 

LLTDM-X builds upon the highly successful Building Legal Literacies for Text Data Mining Institute (Building LLTDM), previously funded by the NEH in 2019. UC Berkeley Library directed Building LLTDM in June 2020, bringing together expert faculty from across the country to train 32 digital humanities researchers on how to navigate law, policy, ethics, and risk within text data mining projects. (All of the results and impacts are summarized in the white paper here.) 

In Building LLTDM’s instructional sessions and post-workshop evaluations, participants identified cross-border research collaborations as an ongoing and critical legal and policy problem, and they also noted that foreign law and ethics issues pervaded their research. UC Berkeley Library’s Office of Scholarly Communication Services partnered with Internet Archive to begin to address these essential needs, and LLTDM-X sprung to life.

Why is LLTDM-X needed?

Text data mining, or TDM, is an increasingly essential and widespread research approach. TDM relies on automated techniques and algorithms to extract revelatory information from large sets of unstructured or thinly-structured digital content. These methodologies allow scholars to identify and analyze critical social, scientific, and literary patterns, trends, and relationships across volumes of data that would otherwise be impossible to sift through.

While TDM methodologies offer great potential, they also present scholars with nettlesome law and policy challenges that can prevent them from understanding how to move forward with their research. Building LLTDM trained TDM researchers and professionals on essential principles of copyright, licensing, and privacy law, as well as ethics—thereby helping them move forward with impactful digital humanities research.

As Building LLTDM revealed, United States digital humanities scholars do not conduct text data mining research only in or about the U.S. Further, digital humanities research in particular is marked by collaboration across institutions and geographical boundaries. Yet, U.S. practitioners encounter expanding and increasingly complex cross-border problems. 

For example, U.S. contract law may supersede rights under copyright, such that a U.S. database license agreement may prohibit text data mining and other fair uses, whereas UK licenses cannot. Therefore U.S. TDM practitioners collaborating with UK-based colleagues face impactful choices about which agreements to apply, as this may determine whether text data mining is permitted. In the U.S., “breaking” technological protection measures to conduct text data mining is now authorized within certain parameters, yet other jurisdictions prohibit such work or apply different conditions. U.S. text data mining researchers must accordingly consider how they work with internationally-held or -licensed materials or collaborators. 

There are at least three such “cross-border” TDM scenarios that scholars must parse, including: (i) if the materials they want to mine are housed in a foreign jurisdiction, or are otherwise subject to foreign database licensing or laws; (ii) if the human subjects they are studying or who created the underlying content reside in another country; or, (iii) if the colleagues with whom they are collaborating reside abroad, yielding uncertainty about which country’s laws, agreements, and policies apply. These may collectively be considered the “cross-border” TDM scenarios.

U.S. researchers are uncertain about how to navigate each of these scenarios. As evidenced in an informal survey that we conducted with digital humanities scholars, 70% of respondents reported cross-border copyright questions, 72% reported uncertainty about cross-border licensing terms, 52% noted privacy issues, and 48% identified ethical concerns. This confusion greatly impacted their TDM research. Twenty-eight percent (28%) of respondents confirmed that these cross-border copyright, licensing, privacy, or ethical issues impeded or prevented their project entirely. Of equal concern is that 40% of responding practitioners reported hesitation to share their workflows, methodology, or sources because of possible cross-border LLTDM issues. Without transparency, findings are deemed unreliable and scholarship may be rejected for publication. These problems will only mount given the increasing collaborativeness of research and the substantial amount of cross-border research occurring.

How will LLTDM-X help the world? 

Our long-term goal is to design instructional materials and institutes to support digital humanities TDM scholars facing cross-border issues, but our first step with LLTDM-X is getting a better handle on the specific law and policy challenges they face.

Through a series of virtual roundtable discussions, and accompanying legal research and analysis, LLTDM-X will surface these cross-border issues and begin to distill preliminary guidance to help scholars in navigating them. 

The first roundtable will engage U.S. digital humanities text data mining practitioners in sharing their cross-border TDM experiences. U.S. and global law and ethics experts will help guide the roundtable discussion to elicit the contours of practitioner experiences. During two subsequent roundtables—one focusing on cross-border copyright and licensing, and another on cross-border privacy and ethics—the experts will discuss practitioners’ hurdles in depth, and begin to develop customized guidance. 

After the roundtables, we will work with the law and ethics experts to create instructive case studies that reflect the types of cross-border TDM issues practitioners encountered. These case studies will incorporate recommendations to help a broad audience of U.S. digital humanities text data mining practitioners navigate LLTDM-X concerns. Case studies, guidance, and recommendations will be widely-disseminated via an open access report to be published at the completion of the project. And most importantly, they will be used to inform our future educational offerings.

An experienced team

The team for LLTDM-X (introduced below) is eager to get started. The project is co-directed by Thomas Padilla, Deputy Director, Archiving and Data Services at Internet Archive. 

LLTDM-X responds strategically to a pervasive challenge that needlessly complicates, inhibits, and weakens the fullest potential of research. This work paves a critical path toward building future training institutes that address cross-border legal issues in TDM. At Internet Archive we’re committed to supporting universal access to all knowledge—LLTDM-X couldn’t be more clearly aligned with what we hope to achieve. We look forward to working with our partners at UC Berkeley Library and the wider community to advance this work.”

Rachael Samberg, who leads UC Berkeley Library’s Office of Scholarly Communication Services and oversaw Building LLTDM, joins Thomas as co-director and explains that: 

“We are ready to begin analyzing and sorting out the complex legal challenges for digital humanities TDM researchers. We’ve already secured an incredible group of international legal and ethics experts to conduct the analyses, and will share more on that soon. In the meantime, we are gearing up to build out an even larger group of participating scholars whose experiences will help us create case studies.”

On behalf of the entire project team, we would like to thank NEH’s Office of Digital Humanities again for funding this important work. We invite you to contact us with any questions you may have. 

Thomas Padilla (Project Director): Thomas is Deputy Director, Archiving and Data Services at Internet Archive, and has deep experience cultivating library, archive, and museum ability to support TDM research. He has previously served as Principal Investigator of the Andrew W. Mellon supported Collections as Data: Part to Whole, the Institute of Museum and Library Services supported, Always Already Computational: Collections as Data, and as author of the library community research agenda, Responsible Operations: Data Science, Machine Learning, and AI in Libraries. In addition, Padilla was an expert faculty for Building LLTDM, the precursor to LLTDM-X.

Rachael Samberg (Project Co-Director): Rachael is Scholarly Communication Officer & Program Director of the University of California, Berkeley Library’s Office of Scholarly Communication Services. She served as Project Director and legal expert for Building LLTDM. A Duke Law graduate, Rachael practiced intellectual property litigation at Fenwick & West LLP for seven years before spending six years at Stanford Law School’s library, where she was Head of Reference & Instructional Services and a Lecturer in Law. Rachael speaks throughout the country about copyright and TDM issues, about which she is widely published. Her chapter, Law & Literacy in Non-Consumptive Text Mining, was published in Copyright Conversations (ALA, 2019).

Stacy Reardon (Project Team Member): Stacy Reardon is Literatures and Digital Humanities Librarian at the University of California, Berkeley Library, where she provides guidance and instruction on digital humanities projects and methods. Stacy served as a library expert on the Project Team for the NEH-funded Building Legal Literacies for Text Data Mining. She is co-chair of the UC Berkeley’s Digital Humanities Working Group, and received her Ph.D. in literature from the University of Massachusetts, Amherst.

Timothy Vollmer (Project Manager): Timothy Vollmer is Scholarly Communication and Copyright Librarian at UC Berkeley Library. He served as Project Manager for the NEH-funded Building Legal Literacies for Text Data Mining. Tim worked as a senior public policy manager for Creative Commons, and contributed to writing and advocacy on the text data mining exceptions in the EU’s Directive on Copyright in the Digital Single Market. He formerly was the Assistant Director to the Program on Public Access to Information at the American Library Association.


Upcoming digital publishing workshops with the Office of Scholarly Communication Services

computer keyboard and mouse with title of the Digital publishing Workshop Series
Photo by Damian Zaleski on Unsplash

It’s 2022, and we’re right back at it with supporting your scholarship and publishing. This Spring, the Office of Scholarly Communication Services has some practical workshops for you as part of the Library’s Digital Publishing Series. Here’s what’s coming up over the next few months.

Workshops

Publish Digital Books and Open Educational Resources with Pressbooks

February 8, 2022
11:00am–12:30pm
Online: Register to receive the Zoom link

If you’re looking to self-publish work of any length and want an easy-to-use tool that offers a high degree of customization, allows flexibility with publishing formats (EPUB, PDF), and provides web-hosting options, Pressbooks may be great for you. Pressbooks is often the tool of choice for academics creating digital books, open textbooks, and open educational resources, since you can license your materials for reuse however you desire. Learn why and how to use Pressbooks for publishing your original books or course materials. You’ll leave the workshop with a project already under way! Signup at the link above and the Zoom login details will be emailed to you.

Can I Mine That? Should I Mine That?: A Clinic for Copyright, Ethics & More in TDM Research

March 9, 2022
11:00am–12:30pm
Online: Register to receive the Zoom link

If you are working on a computational text analysis project and have wondered how to legally acquire, use, and publish text and data, this workshop is for you! We will teach you 5 legal literacies (copyright, contracts, privacy, ethics, and special use cases) that will empower you to make well-informed decisions about compiling, using, and sharing your corpus. By the end of this workshop, and with a useful checklist in hand, you will be able to confidently design lawful text analysis projects or be well positioned to help others design such projects. Signup at the link above and the Zoom login details will be emailed to you.

Other ways we can help

In addition to the workshops, we’re here to help answer a variety of questions you might have on intellectual property, digital publishing, and information policy. 

Want help or more information? Send us an email. We can provide individualized support and personal consultations, online class instruction, presentations and workshops for small or large groups & classes, and customized support and training for departments and disciplines.