Before you scrape and before you train… - UC Berkeley Library Update

A person's hands holding a white stylus pen over a tablet screen displaying a digital to-do list. The tablet shows 'PLAN' at the top with several checkboxes below it, some checked and some unchecked. The scene is set on a desk with a small potted plant visible in the background. — Photo by Jakub Żerdzicki on Unsplash

Using AI and Text Mining with Library Resources: What Every UC Berkeley Researcher Needs to Know

Planning to scrape a website or database? Train an AI tool? Before you scrape and before you train, there are steps you need to take!

First, consult the general terms and conditions that you need to comply with for all Library electronic resources (journal articles, books, databases, and more) in the Conditions of Use for Electronic Resources policy.

Second, if you also intend to use any Library electronic resources with AI tools or for text and data mining research, check what’s allowed under our license agreements by looking at the AI & TDM guide. If you don’t see your resource or database listed, please e-mail tdm-access@berkeley.edu and we’ll check the license agreement and tell you what’s permitted.

Violating license agreements can result in the entire campus losing access to critical research resources, and potentially expose you and the University to legal liability.

Below we answer some FAQs.

Understanding the Basics

Where does library content come from?

Most of the digital research materials you access through the UC Berkeley Library aren’t owned by the University. Instead, they’re owned by commercial publishers, academic societies, and other content providers who create and distribute scholarly resources.

Think of using the Library’s electronic resources like watching Netflix: you can watch movies and shows on Netflix, but Netflix doesn’t own most of that content—they pay licensing fees to film studios and content creators for the right to make it available to you. So does the Library.

In fact, each year, the Library signs license agreements and pays substantial licensing fees (millions of dollars annually) to publishers like Elsevier, Springer, Wiley, and hundreds of other content providers so that you can access their journals, books, and databases for your research and coursework.

What is a library license agreement?

A license agreement is a legal contract between UC Berkeley and each publisher that spells out exactly how the UC Berkeley community can use that publisher’s content. These contracts typically cover:

Who can access the content (usually current faculty, students, researchers, and staff)
How you can use it (reading, downloading, printing individual articles)
What you can’t do (automated mass downloading, sharing with unauthorized users, making commercial uses of it)
Special restrictions (including rules about AI tools and text and data mining)

Any time you access a database or use your Berkeley credentials to log in to a resource, you must comply with the terms of the license agreement that the Library has signed. All the agreements are different.

Why are all the agreements different? Can’t the Library just sign the same agreement with everyone?

Unfortunately, no. Each publisher has their own standard contract terms, and they rarely agree to identical language. Here’s why:

Different business models: Some publishers focus on journals, others on books or datasets—each has different concerns
Varying attitudes toward artificial intelligence: Some publishers embrace AI research, others are more restrictive
Disciplinary variations: Publishers licensing content in different fields (e.g. business, data) typically offer different restrictions than those in other disciplines
Legal complexity: Text data mining and AI are relatively new, so contract language is still evolving

Can’t you negotiate better terms?

The good news is that the UC Berkeley Library is among global leaders in negotiating the very best possible terms of text and data mining and AI uses for you. We’ve set the stage for the world in terms of AI rights in license agreements, and the UC President has recognized the efforts of our Library in this regard.

Still, we can’t force publishers to accept uniform language, and we can’t guarantee that every resource allows AI usage. This is why we need to check each agreement individually when you want to use content with AI tools.

But my research is a “fair use”!

We agree. (And we’re glad you’re staying up-to-speed on copyright and fair use.) But there’s a distinction between what copyright law allows and how license agreements (which are contracts) affect your rights under copyright law.

Copyright law gives you certain rights, including fair use for research and education.

Contract law can override those rights when you agree to specific terms. When UC Berkeley signs a license agreement with a publisher so you can use content, both the University and its users (that’s you) must comply with those contract terms.

Therefore, even if your AI training or text mining would normally qualify as fair use, the license agreement you’re bound by might explicitly prohibit it, or place specific qualifications on how AI might be used (e.g. use of AI permissible; training of AI prohibited).

Your responsibilities

What do I have to do?

You should consult the general terms and conditions that you need to comply with for all Library electronic resources (journal articles, books, databases, and more) in the Conditions of Use for Electronic Resources policy.

If you also intend to use any Library electronic resources with AI tools or for text and data mining research, check what’s allowed under our license agreements by looking at the AI & TDM guide. If you don’t see your resource or database listed, then e-mail tdm-access@berkeley.edu and we’ll check the license agreement and tell you what’s permitted.

Do I have to comply? What’s the big deal?

Violating license agreements can result in losing access to critical research resources for the entire UC Berkeley community—and potentially expose you and the University to legal liability and lawsuits.

For the University:

Loss of access: Publishers can immediately cut off access to critical research resources for everyone on campus
Legal liability: The University could face costly lawsuits. Some publishers might claim millions of dollars worth of damages
Damaged relationships: Violations can harm the library’s ability to negotiate future agreements, or prevent us from getting you access to key scholarly content

This doesn’t just affect the University—it also affects you. Violating the agreements can result in:

Immediate suspension of your access to all library electronic resources
Legal exposure: You could potentially be held personally liable for damages in a lawsuit
Research disruption: Loss of access to essential materials for your work

How do I know if I’m using Library-licensed content?

The following kinds of materials are typically governed by Library license agreements:

Materials you access through the UC Library Search (the Library’s online catalog)
Articles from academic journals accessed through the Library
E-books available through Library databases
Research datasets licensed by the Library
Any content accessed through Library database subscriptions
Materials that require you to log in with your UC Berkeley credentials

What if I get the content from a website not licensed by the Library?

If you’re downloading or mining content from a website that is not licensed by the Library, you should read the website’s terms of use, sometimes called “terms of service.” They will usually be found through a link at the bottom of the web page. Carefully understanding the terms of service can help you make informed decisions about how to proceed.

Even if the terms of use for the website or database restrict or prohibit text mining or AI, the provider may offer an application programming interface, or API, with its own set of terms that allows scraping and AI. You could also try contacting the provider and requesting permission for the research you want to do.

What if I’m using a campus-licensed AI platform?

Even when using UC Berkeley’s own AI platforms (like Gemini or River), you still need to check on whether you can upload Library-licensed content to that platform. The fact that the University provides the tool doesn’t automatically make all Library-licensed content okay to upload to it.

What if I’m using my own ChatGPT, Anthropic, or other generative AI account?

Again, you still need to check on whether you can upload Library-licensed content to that platform. The fact that you subscribe to the tool doesn’t mean you can upload Library-licensed content to it.

Do I really have to contact you? Can’t I just look up the license terms somewhere?

We wish it were that simple, but the Library signs thousands of agreements each year with highly complex terms. We’re working on trying to make the terms more visible to you, though. Stay tuned.
In the meantime, check out the AI & TDM guide. If you don’t see your resource or database listed, then e-mail tdm-access@berkeley.edu and we’ll tell you what’s permitted.

Best practices are to:

Plan ahead: Contact us early in your research planning process
Be specific: The more details you provide, the faster we can give you guidance
Ask questions: We’re here to help, not to block your research

Get Help

For text mining and AI questions: tdm-access@berkeley.edu

For other licensing questions: acq-licensing@lists.berkeley.edu

For copyright and fair use guidance: schol-comm@berkeley.edu

This guidance is for informational purposes and should not be construed as legal advice. When in doubt, always contact library staff for assistance with specific situations.