text data mining

Before you scrape and before you train…

Posted on September 2, 2025September 17, 2025 by Timothy Vollmer

A person's hands holding a white stylus pen over a tablet screen displaying a digital to-do list. The tablet shows 'PLAN' at the top with several checkboxes below it, some checked and some unchecked. The scene is set on a desk with a small potted plant visible in the background. — Photo by Jakub Żerdzicki on Unsplash

Using AI and Text Mining with Library Resources: What Every UC Berkeley Researcher Needs to Know

Planning to scrape a website or database? Train an AI tool? Before you scrape and before you train, there are steps you need to take!

First, consult the general terms and conditions that you need to comply with for all Library electronic resources (journal articles, books, databases, and more) in the Conditions of Use for Electronic Resources policy.

Second, if you also intend to use any Library electronic resources with AI tools or for text and data mining research, check what’s allowed under our license agreements by looking at the AI & TDM guide. If you don’t see your resource or database listed, please e-mail tdm-access@berkeley.edu and we’ll check the license agreement and tell you what’s permitted.

Violating license agreements can result in the entire campus losing access to critical research resources, and potentially expose you and the University to legal liability.

Below we answer some FAQs.

Understanding the Basics

Where does library content come from?

Most of the digital research materials you access through the UC Berkeley Library aren’t owned by the University. Instead, they’re owned by commercial publishers, academic societies, and other content providers who create and distribute scholarly resources.

Think of using the Library’s electronic resources like watching Netflix: you can watch movies and shows on Netflix, but Netflix doesn’t own most of that content—they pay licensing fees to film studios and content creators for the right to make it available to you. So does the Library.

In fact, each year, the Library signs license agreements and pays substantial licensing fees (millions of dollars annually) to publishers like Elsevier, Springer, Wiley, and hundreds of other content providers so that you can access their journals, books, and databases for your research and coursework.

What is a library license agreement?

A license agreement is a legal contract between UC Berkeley and each publisher that spells out exactly how the UC Berkeley community can use that publisher’s content. These contracts typically cover:

Who can access the content (usually current faculty, students, researchers, and staff)
How you can use it (reading, downloading, printing individual articles)
What you can’t do (automated mass downloading, sharing with unauthorized users, making commercial uses of it)
Special restrictions (including rules about AI tools and text and data mining)

Any time you access a database or use your Berkeley credentials to log in to a resource, you must comply with the terms of the license agreement that the Library has signed. All the agreements are different.

Why are all the agreements different? Can’t the Library just sign the same agreement with everyone?

Unfortunately, no. Each publisher has their own standard contract terms, and they rarely agree to identical language. Here’s why:

Different business models: Some publishers focus on journals, others on books or datasets—each has different concerns
Varying attitudes toward artificial intelligence: Some publishers embrace AI research, others are more restrictive
Disciplinary variations: Publishers licensing content in different fields (e.g. business, data) typically offer different restrictions than those in other disciplines
Legal complexity: Text data mining and AI are relatively new, so contract language is still evolving

Can’t you negotiate better terms?

The good news is that the UC Berkeley Library is among global leaders in negotiating the very best possible terms of text and data mining and AI uses for you. We’ve set the stage for the world in terms of AI rights in license agreements, and the UC President has recognized the efforts of our Library in this regard.

Still, we can’t force publishers to accept uniform language, and we can’t guarantee that every resource allows AI usage. This is why we need to check each agreement individually when you want to use content with AI tools.

But my research is a “fair use”!

We agree. (And we’re glad you’re staying up-to-speed on copyright and fair use.) But there’s a distinction between what copyright law allows and how license agreements (which are contracts) affect your rights under copyright law.

Copyright law gives you certain rights, including fair use for research and education.

Contract law can override those rights when you agree to specific terms. When UC Berkeley signs a license agreement with a publisher so you can use content, both the University and its users (that’s you) must comply with those contract terms.

Therefore, even if your AI training or text mining would normally qualify as fair use, the license agreement you’re bound by might explicitly prohibit it, or place specific qualifications on how AI might be used (e.g. use of AI permissible; training of AI prohibited).

Your responsibilities

What do I have to do?

You should consult the general terms and conditions that you need to comply with for all Library electronic resources (journal articles, books, databases, and more) in the Conditions of Use for Electronic Resources policy.

If you also intend to use any Library electronic resources with AI tools or for text and data mining research, check what’s allowed under our license agreements by looking at the AI & TDM guide. If you don’t see your resource or database listed, then e-mail tdm-access@berkeley.edu and we’ll check the license agreement and tell you what’s permitted.

Do I have to comply? What’s the big deal?

Violating license agreements can result in losing access to critical research resources for the entire UC Berkeley community—and potentially expose you and the University to legal liability and lawsuits.

For the University:

Loss of access: Publishers can immediately cut off access to critical research resources for everyone on campus
Legal liability: The University could face costly lawsuits. Some publishers might claim millions of dollars worth of damages
Damaged relationships: Violations can harm the library’s ability to negotiate future agreements, or prevent us from getting you access to key scholarly content

This doesn’t just affect the University—it also affects you. Violating the agreements can result in:

Immediate suspension of your access to all library electronic resources
Legal exposure: You could potentially be held personally liable for damages in a lawsuit
Research disruption: Loss of access to essential materials for your work

How do I know if I’m using Library-licensed content?

The following kinds of materials are typically governed by Library license agreements:

Materials you access through the UC Library Search (the Library’s online catalog)
Articles from academic journals accessed through the Library
E-books available through Library databases
Research datasets licensed by the Library
Any content accessed through Library database subscriptions
Materials that require you to log in with your UC Berkeley credentials

What if I get the content from a website not licensed by the Library?

If you’re downloading or mining content from a website that is not licensed by the Library, you should read the website’s terms of use, sometimes called “terms of service.” They will usually be found through a link at the bottom of the web page. Carefully understanding the terms of service can help you make informed decisions about how to proceed.

Even if the terms of use for the website or database restrict or prohibit text mining or AI, the provider may offer an application programming interface, or API, with its own set of terms that allows scraping and AI. You could also try contacting the provider and requesting permission for the research you want to do.

What if I’m using a campus-licensed AI platform?

Even when using UC Berkeley’s own AI platforms (like Gemini or River), you still need to check on whether you can upload Library-licensed content to that platform. The fact that the University provides the tool doesn’t automatically make all Library-licensed content okay to upload to it.

What if I’m using my own ChatGPT, Anthropic, or other generative AI account?

Again, you still need to check on whether you can upload Library-licensed content to that platform. The fact that you subscribe to the tool doesn’t mean you can upload Library-licensed content to it.

Do I really have to contact you? Can’t I just look up the license terms somewhere?

We wish it were that simple, but the Library signs thousands of agreements each year with highly complex terms. We’re working on trying to make the terms more visible to you, though. Stay tuned.
In the meantime, check out the AI & TDM guide. If you don’t see your resource or database listed, then e-mail tdm-access@berkeley.edu and we’ll tell you what’s permitted.

Best practices are to:

Plan ahead: Contact us early in your research planning process
Be specific: The more details you provide, the faster we can give you guidance
Ask questions: We’re here to help, not to block your research

Get Help

For text mining and AI questions: tdm-access@berkeley.edu

For other licensing questions: acq-licensing@lists.berkeley.edu

For copyright and fair use guidance: schol-comm@berkeley.edu

This guidance is for informational purposes and should not be construed as legal advice. When in doubt, always contact library staff for assistance with specific situations.

Wrapping up our NEH-funded project to help text and data mining researchers navigate cross-border legal and ethical issues

Posted on October 2, 2023October 4, 2023 by Timothy Vollmer

Black and white photograph with grass and concrete with the word "finish" painted on the concrete in large capitalized letters. — Image via rawpixel, public domain

In August 2022, the UC Berkeley Library and Internet Archive were awarded a grant from the National Endowment for the Humanities (NEH) to study legal and ethical issues in cross-border text and data mining (TDM).

The project, entitled Legal Literacies for Text Data Mining – Cross-Border (“LLTDM-X”), supported research and analysis to address law and policy issues faced by U.S. digital humanities practitioners whose text data mining research and practice intersects with foreign-held or -licensed content, or involves international research collaborations.

LLTDM-X is now complete, resulting in the publication of an instructive case study for researchers and white paper. Both resources are explained in greater detail below.

Project Origins

LLTDM-X built upon the previous NEH-sponsored institute, Building Legal Literacies for Text Data Mining. That institute provided training, guidance, and strategies to digital humanities TDM researchers on navigating legal literacies for text data mining (including copyright, contracts, privacy, and ethics) within a U.S. context.

A common challenge highlighted during the institute was the fact that TDM practitioners encounter expanding and increasingly complex cross-border legal problems. These include situations in which: (i) the materials they want to mine are housed in a foreign jurisdiction, or are otherwise subject to foreign database licensing or laws; (ii) the human subjects they are studying or who created the underlying content reside in another country; or, (iii) the colleagues with whom they are collaborating reside abroad, yielding uncertainty about which country’s laws, agreements, and policies apply.

Project design

We designed LLTDM-X to identify and better understand the cross-border issues that digital humanities TDM practitioners face, with the aim of using these issues to inform prospective research and education. Secondarily, we hoped that LLTDM-X would also suggest preliminary guidance to include in future educational materials. In early 2023, we hosted a series of three online round tables with U.S.-based cross-border TDM practitioners and law and ethics experts from six countries.

The round table conversations were structured to illustrate the empirical issues that researchers face, and also for the practitioners to benefit from preliminary advice on legal and ethical challenges. Upon the completion of the round tables, the LLTDM-X project team created a hypothetical case study that (i) reflects the observed cross-border LLTDM issues and (ii) contains preliminary analysis to facilitate the development of future instructional materials.

We also charged the experts with providing responsive and tailored written feedback to the practitioners about how they might address specific cross-border issues relevant to each of their projects.

Guidance & Analysis

Case Study

Extrapolating from the issues analyzed in the round tables, the practitioners’ statements, and the experts’ written analyses, the Project Team developed a hypothetical case study reflective of “typical” cross-border LLTDM issues that U.S.-based practitioners encounter. The case study provides basic guidance to support U.S. researchers in navigating cross-border TDM issues, while also highlighting questions that would benefit from further research.

The case study examines cross-border copyright, contracts, and privacy & ethics variables across two distinct paradigms: first, a situation where U.S.-based researchers perform all TDM acts in the U.S., and second, a situation where U.S.-based researchers engage with collaborators abroad, or otherwise perform TDM acts in both U.S. and abroad.

White Paper

The LLTDM-X white paper provides a comprehensive description of the project, including origins and goals, contributors, activities, and outcomes. Of particular note are several project takeaways and recommendations, which we hope will help inform future research and action to support cross-border text data mining. Our project takeaways touched on seven key themes:

Uncertainty about cross-border LLTDM issues indeed hinders U.S. TDM researchers, confirming the need for education about cross-border legal issues;
The expansion of education regarding U.S. LLTDM literacies remains essential, and should continue in parallel to cross-border education;
Disparities in national copyright, contracts, and privacy laws may incentivize TDM researcher “forum shopping” and exacerbate research bias;
License agreements (and the concept of “contractual override”) often dominate the overall analysis of cross-border TDM permissibility;
Emerging lawsuits about generative artificial intelligence may impact future understanding of fair use and other research exceptions;
Research is needed into issues of foreign jurisdiction, likelihood of lawsuits in foreign countries, and likelihood of enforcement of foreign judgments in the U.S. However, the overall “risk” of proceeding with cross-border TDM research may remain difficult to quantify; and
Institutional review boards (IRBs) have an opportunity to explore a new role or build partnerships to support researchers engaged in cross-border TDM.

Gratitude & Next Steps

Thank you to the practitioners, experts, project team, and generous funding of the National Endowment for the Humanities for making this project a success.

We aim to broadly share our project outputs to continue helping U.S.-based TDM researchers navigate cross-border LLTDM hurdles. We will continue to speak publicly to educate researchers and the TDM community regarding project takeaways, and to advocate for legal and ethical experts to undertake the essential research questions and begin developing much-needed educational materials. And, we will continue to encourage the integration of LLTDM literacies into digital humanities curricula, to facilitate both domestic and cross-border TDM research.

[Note: this content is cross-posted on the LLTDM blog.]

Upcoming Workshop: Can I Mine That? Should I Mine That? A Clinic for Copyright, Ethics & More in TDM Research

Posted on January 12, 2023 by Timothy Vollmer

computer keyboard and mouse with title of the Digital publishing Workshop Series

Workshop Date/Time: Wednesday, March 8, 2023, 11:00am–12:30pm

Register to receive Zoom link

If you are working on a computational text analysis project and have wondered how to legally acquire, use, and publish text and data, this workshop is for you! We will teach you 5 legal literacies (copyright, contracts, privacy, ethics, and special use cases) that will empower you to make well-informed decisions about compiling, using, and sharing your corpus. By the end of this workshop, and with a useful checklist in hand, you will be able to confidently design lawful text analysis projects or be well positioned to help others design such projects. Consider taking alongside Copyright and Fair Use for Digital Projects.

Please sign up today and join us online on March 8.