Before you scrape and before you train…

A person's hands holding a white stylus pen over a tablet screen displaying a digital to-do list. The tablet shows 'PLAN' at the top with several checkboxes below it, some checked and some unchecked. The scene is set on a desk with a small potted plant visible in the background.
Photo by Jakub Żerdzicki on Unsplash

Using AI and Text Mining with Library Resources: What Every UC Berkeley Researcher Needs to Know

Planning to scrape a website or database? Train an AI tool? Before you scrape and before you train, there are steps you need to take!

First, consult the general terms and conditions that you need to comply with for all Library electronic resources (journal articles, books, databases, and more) in the Conditions of Use for Electronic Resources policy.

Second, if you also intend to use any Library electronic resources with AI tools or for text and data mining research, check what’s allowed under our license agreements by looking at the AI & TDM guide. If you don’t see your resource or database listed, please e-mail tdm-access@berkeley.edu and we’ll check the license agreement and tell you what’s permitted.

Violating license agreements can result in the entire campus losing access to critical research resources, and potentially expose you and the University to legal liability.

Below we answer some FAQs.


Understanding the Basics

Where does library content come from?

Most of the digital research materials you access through the UC Berkeley Library aren’t owned by the University. Instead, they’re owned by commercial publishers, academic societies, and other content providers who create and distribute scholarly resources.

Think of using the Library’s electronic resources like watching Netflix: you can watch movies and shows on Netflix, but Netflix doesn’t own most of that content—they pay licensing fees to film studios and content creators for the right to make it available to you. So does the Library.

In fact, each year, the Library signs license agreements and pays substantial licensing fees (millions of dollars annually) to publishers like Elsevier, Springer, Wiley, and hundreds of other content providers so that you can access their journals, books, and databases for your research and coursework.

What is a library license agreement?

A license agreement is a legal contract between UC Berkeley and each publisher that spells out exactly how the UC Berkeley community can use that publisher’s content. These contracts typically cover:

  • Who can access the content (usually current faculty, students, researchers, and staff)
  • How you can use it (reading, downloading, printing individual articles)
  • What you can’t do (automated mass downloading, sharing with unauthorized users, making commercial uses of it)
  • Special restrictions (including rules about AI tools and text and data mining)

Any time you access a database or use your Berkeley credentials to log in to a resource, you must comply with the terms of the license agreement that the Library has signed. All the agreements are different.

Why are all the agreements different? Can’t the Library just sign the same agreement with everyone?

Unfortunately, no. Each publisher has their own standard contract terms, and they rarely agree to identical language. Here’s why:

  • Different business models: Some publishers focus on journals, others on books or datasets—each has different concerns
  • Varying attitudes toward artificial intelligence: Some publishers embrace AI research, others are more restrictive
  • Disciplinary variations: Publishers licensing content in different fields (e.g. business, data) typically offer different restrictions than those in other disciplines
  • Legal complexity: Text data mining and AI are relatively new, so contract language is still evolving

Can’t you negotiate better terms?

The good news is that the UC Berkeley Library is among global leaders in negotiating the very best possible terms of text and data mining and AI uses for you. We’ve set the stage for the world in terms of AI rights in license agreements, and the UC President has recognized the efforts of our Library in this regard.

Still, we can’t force publishers to accept uniform language, and we can’t guarantee that every resource allows AI usage. This is why we need to check each agreement individually when you want to use content with AI tools.

But my research is a “fair use”!

We agree. (And we’re glad you’re staying up-to-speed on copyright and fair use.) But there’s a distinction between what copyright law allows and how license agreements (which are contracts) affect your rights under copyright law.

Copyright law gives you certain rights, including fair use for research and education.

Contract law can override those rights when you agree to specific terms. When UC Berkeley signs a license agreement with a publisher so you can use content, both the University and its users (that’s you) must comply with those contract terms.

Therefore, even if your AI training or text mining would normally qualify as fair use, the license agreement you’re bound by might explicitly prohibit it, or place specific qualifications on how AI might be used (e.g. use of AI permissible; training of AI prohibited).

Your responsibilities

What do I have to do?

You should consult the general terms and conditions that you need to comply with for all Library electronic resources (journal articles, books, databases, and more) in the Conditions of Use for Electronic Resources policy.

If you also intend to use any Library electronic resources with AI tools or for text and data mining research, check what’s allowed under our license agreements by looking at the AI & TDM guide. If you don’t see your resource or database listed, then e-mail tdm-access@berkeley.edu and we’ll check the license agreement and tell you what’s permitted.

Do I have to comply? What’s the big deal?

Violating license agreements can result in losing access to critical research resources for the entire UC Berkeley community—and potentially expose you and the University to legal liability and lawsuits.

For the University:

  • Loss of access: Publishers can immediately cut off access to critical research resources for everyone on campus
  • Legal liability: The University could face costly lawsuits. Some publishers might claim millions of dollars worth of damages
  • Damaged relationships: Violations can harm the library’s ability to negotiate future agreements, or prevent us from getting you access to key scholarly content

This doesn’t just affect the University—it also affects you. Violating the agreements can result in:

  • Immediate suspension of your access to all library electronic resources
  • Legal exposure: You could potentially be held personally liable for damages in a lawsuit
  • Research disruption: Loss of access to essential materials for your work

How do I know if I’m using Library-licensed content?

The following kinds of materials are typically governed by Library license agreements:

  • Materials you access through the UC Library Search (the Library’s online catalog)
  • Articles from academic journals accessed through the Library
  • E-books available through Library databases
  • Research datasets licensed by the Library
  • Any content accessed through Library database subscriptions
  • Materials that require you to log in with your UC Berkeley credentials

What if I get the content from a website not licensed by the Library?

If you’re downloading or mining content from a website that is not licensed by the Library, you should read the website’s terms of use, sometimes called “terms of service.” They will usually be found through a link at the bottom of the web page. Carefully understanding the terms of service can help you make informed decisions about how to proceed.

Even if the terms of use for the website or database restrict or prohibit text mining or AI, the provider may offer an application programming interface, or API, with its own set of terms that allows scraping and AI. You could also try contacting the provider and requesting permission for the research you want to do.

What if I’m using a campus-licensed AI platform?

Even when using UC Berkeley’s own AI platforms (like Gemini or River), you still need to check on whether you can upload Library-licensed content to that platform. The fact that the University provides the tool doesn’t automatically make all Library-licensed content okay to upload to it.

What if I’m using my own ChatGPT, Anthropic, or other generative AI account?

Again, you still need to check on whether you can upload Library-licensed content to that platform. The fact that you subscribe to the tool doesn’t mean you can upload Library-licensed content to it.

Do I really have to contact you? Can’t I just look up the license terms somewhere?

We wish it were that simple, but the Library signs thousands of agreements each year with highly complex terms. We’re working on trying to make the terms more visible to you, though. Stay tuned.
In the meantime, check out the AI & TDM guide. If you don’t see your resource or database listed, then e-mail tdm-access@berkeley.edu and we’ll tell you what’s permitted.

Best practices are to:

  • Plan ahead: Contact us early in your research planning process
  • Be specific: The more details you provide, the faster we can give you guidance
  • Ask questions: We’re here to help, not to block your research

Get Help

For text mining and AI questions: tdm-access@berkeley.edu

For other licensing questions: acq-licensing@lists.berkeley.edu

For copyright and fair use guidance: schol-comm@berkeley.edu


This guidance is for informational purposes and should not be construed as legal advice. When in doubt, always contact library staff for assistance with specific situations.


When Copyright and Contracts Collide: Advocacy for Library and User Rights

A dramatic scene depicting a large copyright symbol exploding in a burst of energy, surrounded by flying pages and debris. The background features a stormy sky and a mountainous landscape.
AI-generated image via ChatGPT

In the ever-evolving landscape of digital access to scholarly research, libraries face new challenges as they navigate the intersection of copyright law and contractual agreements. Academic institutions increasingly rely on digital content, and understanding how copyright exceptions and contract law interact is crucial for protecting the rights of libraries and our users.

Tim Vollmer (Scholarly Communication & Copyright Librarian, UC Berkeley), Sara Benson (Copyright Librarian and Associate Professor, University of Illinois-Urbana Champaign), Jonathan Band (copyright attorney and counsel to the Library Copyright Alliance), and Jim Neal (University Librarian Emeritus, Columbia University) presented on these issues at the 2024 American Library Association Annual Conference in San Diego. Our panel was titled When Copyright and Contracts Collide: Advocacy for Library and User Rights.

The Role of Copyright Exceptions

Sara set the stage for our discussion by describing the importance of limitations and exceptions to copyright that empower libraries, research, and teaching. For example, Section 108 of the U.S. Copyright Act allows libraries and archives to make limited copies of copyrighted materials for preservation, replacement, fulfilling interlibrary loan requests, and more. Fair use—Section 107 of the Act—permits limited use of copyrighted works without having to seek the copyright holder’s permission when the use is for purposes such as teaching, research, scholarship, reporting, criticism, or parody. Faculty, students, and academic authors leverage fair use when they incorporate copyrighted materials for teaching, research, and publishing. And the fair use exception has played an increasingly important role in facilitating new types of scholarly research, including text and data mining.

The Threat of Contractual Override

Despite these protections, contractual agreements can sometimes override copyright exceptions. Vendor licensing terms may include clauses that restrict activities such as text and data mining. And even though fair use is a statutory right (meaning it’s in the law) in the U.S., and even though there have been court cases that confirm that activities such as text data mining falls under fair use, there is no protection against the practice where private parties such as academic publishers “contract around” fair use for actions that already are lawful.

As a result, academic libraries are forced to negotiate and often pay significant sums each year to try to preserve fair use rights for campus scholars through the database and electronic content license agreements that they sign.

Jonathan discussed alternative international approaches to the problem of contractual override. The European Union, for example, has implemented directives that nullify contract terms which override specific copyright exceptions. Countries like Australia, New Zealand, and Norway have also adopted similar measures. However, the United States and Canada lack comprehensive contract override prevention laws, making it challenging to protect copyright exceptions at the national level.

Advocating for Fair Contracts in Library Licensing

Tim discussed how academic libraries are demanding license agreements that preserve fair use rights. But at the same time, libraries are already starting to see contract amendments put forth by scholarly publishers that attempt to impose outright bans on any use of artificial intelligence (AI) tools for the content we’re licensing from them. The challenge is that we know that researchers are using library-licensed materials for many AI uses in the context of nonprofit scholarship and research, and these uses should be a fair use, just as it’s fair use for researchers to conduct text data mining on licensed resources.

Library workers can smartly negotiate to protect the rights of instructors, students, and other academic community members to use library-licensed resources in the ways they need to conduct their teaching and research while simultaneously taking into consideration the concerns of publishers.

Moving Forward: A Coordinated Approach

To address the issue of contractual override, Jim suggested several approaches, including educating library stakeholders such as administrators and faculty, building constructive relationships with publishers, monitoring international developments, and pursuing legislative change to protect copyright exceptions.

The University of California Libraries are already collaborating on this and related issues with our colleagues. After outreach to several library and faculty committees, the UC’s Academic Senate sent a letter to UC President Michael Drake to advocate that the UC Libraries need to be able to negotiate to preserve fair use rights when licensing electronic resources—including the rights to conduct computational research and utilize AI tools in academic studies and scholarship. President Drake and UC System Provost and Executive Vice President for Academic Affairs Katherine S. Newman affirmed this commitment.

Please reach out to schol-comm@berkeley.edu with any questions. For more information, please see the links below.


UC Berkeley Library to Copyright Office: Protect fair uses in AI training for research and education

Madison Building, Library of Congress
Copyright Matt H. Wade, licensed CC-BY-NC-SA 3.0

We are pleased to share the UC Berkeley Library’s response to the U.S. Copyright Office’s Notice of Inquiry regarding artificial intelligence and copyright. Our response addresses the essential fair use right relied upon by UC Berkeley scholars in undertaking groundbreaking research, and the need to preserve access to the underlying copyright-protected content so that scholars using AI systems can conduct research inquiries.

In this blog post, we explain what the Copyright Office is studying, and why it was important for the Library to make scholars’ voices heard.

What the Copyright Office is studying and why

Loosely speaking, the Copyright Office wants to understand how to set policy for copyright issues raised by artificial intelligence (“AI”) systems.

Over the last year, AI systems and the rapid growth of their capabilities have attracted significant attention. One type of AI, referred to as “generative AI”, is capable of producing outputs such as text, images, video, or audio (including emulating a human voice) that would be considered copyrightable if created by a human author. These systems include, for instance, the chatbot ChatGPT, and text-to-image generators like DALL·E, Midjourney, and Stable Diffusion. A user can prompt ChatGPT to write a short story that features a duck and a frog who are best friends, or prompt DALL·E to create an abstract image in the style of a Jackson Pollock painting. Generative AI systems are relevant to and impact many educational activities on a campus like UC Berkeley, but (at least to date) have not been the key facilitator of campus research methodologies. 

Instead, in the context of research, scholars have been relying on AI systems to support a set of research methodologies referred to as “text and data mining” (or TDM). TDM utilizes computational tools, algorithms, and automated techniques to extract revelatory information from large sets of unstructured or thinly-structured digital content. Imagine you have a book like “Pride and Prejudice.” There are nearly infinite volumes of information stored inside that book, depending on your scholarly inquiry, such as how many female vs. male characters there are, what types of words the female characters use as opposed to the male characters, what types of behaviors the female characters display relative to the males, etc. TDM allows researchers to identify and analyze patterns, trends, and relationships across volumes of data that would otherwise be impossible to sift through on a close examination of one book or item at a time. 

Not all TDM research methodologies necessitate the usage of AI systems to extract this information. For instance, as in the “Pride and Prejudice” example above, sometimes TDM can be performed by developing algorithms to detect the frequency of certain words within a corpus, or to parse sentiments based on the proximity of various words to each other. In other cases, though, scholars must employ machine learning techniques to train AI models before the models can make a variety of assessments. 

Here is an illustration of the distinction: Imagine a scholar wishes to assess the prevalence with which 20th century fiction authors write about notions of happiness. The scholar likely would compile a corpus of thousands or tens of thousands of works of fiction, and then run a search algorithm across the corpus to detect the occurrence or frequency of words like “happiness,” “joy,” “mirth,” “contentment,” and synonyms and variations thereof. But if a scholar instead wanted to establish the presence of fictional characters who embody or display characteristics of being happy, the scholar would need to employ discriminative modeling (a classification and regression technique) that can train AI to recognize the appearance of happiness by looking for recurring indicia of character psychology, behavior, attitude, conversational tone, demeanor, appearance, and more. This is not using a generative AI system to create new outputs, but rather training a non-generative AI system to predict or detect existing content. And to undertake this type of non-generative AI training, a scholar would need to use a large volume of often copyright-protected works.

The Copyright Office is studying both of these kinds of AI systems—that is, both generative AI and non-generative AI. They are asking a variety of questions in response to having been contacted by stakeholders across sectors and industries with diverse views about how AI systems should be regulated. Some of the concerns expressed by stakeholders include: 

  • Who is the “author” of generative AI outputs?
  • Should people whose voices or images are used to train generative AI systems have a say in how their voices or images are used? 
  • Should the creator of an AI system (whether generative or non-generative) need permission from copyright holders to use copyright-protected materials in training the AI to predict and detect things?
  • Should copyright owners get to opt out of having their content used to train AI? Should ethics be considered within copyright regulation?

Several of these questions are already the subject of pending litigation. While these questions are being explored by the courts, the Copyright Office wants to understand the entire landscape better as it considers what kinds of AI copyright regulations to enact.

The copyright law and policy landscape underpinning the use of AI models is complex, and whatever regulatory decisions that the Copyright Office makes will bear ramifications for global enterprise, innovation, and trade. The Copyright Office’s inquiry thus raises significant and timely legal questions, many of which we are only beginning to understand. 

For these reasons, the Library has taken a cautious and narrow approach in its response to the inquiry: we address only two key principles known about fair use and licensing, as these issues bear upon the nonprofit education, research, and scholarship undertaken by scholars who rely on (typically non-generative) AI models. In brief, the Library wants to ensure that (1) scholars’ voices, and that of the academic libraries who support them, are heard to preserve fair use in training AI, and that (2) copyright-protected content remains available for AI training to support nonprofit education and research.

Why the study matters for fair use

Previous court cases like Authors Guild v. HathiTrust, Authors Guild v. Google, and A.V. ex rel. Vanderhye v. iParadigms have addressed fair use in the context of TDM and determined that the reproduction of copyrighted works to create and text mine a collection of copyright-protected works is a fair use. These cases further hold that making derived data, results, abstractions, metadata, or analysis from the copyright-protected corpus available to the public is also fair use, as long as the research methodologies or data distribution processes do not re-express the underlying works to the public in a way that could supplant the market for the originals. Performing all of this work is essential for TDM-reliant research studies.

For the same reasons that the TDM process is fair use of copyrighted works, the training of AI tools to do that TDM should also be fair use, in large part because training does not reproduce or communicate the underlying copyrighted works to the public. Here, there is an important distinction to make between training inputs and outputs, in that the overall fair use of generative AI outputs cannot always be predicted in advance: The mechanics of generative models’ operations suggest that there are limited instances in which generative AI outputs could indeed be substantially similar to (and potentially infringing of) the underlying works used for training; this substantial similarity is possible typically only when a training corpus is rife with numerous copies of the same work. However, the training of AI models by using copyright-protected inputs falls squarely within what courts have determined to be a transformative fair use, especially when that training is for nonprofit educational or research purposes. And it is essential to protect the fair use rights of scholars and researchers to make these uses of copyright-protected works when training AI.

Further, were these fair use rights overridden by limiting AI training access to only “safe” materials (like public domain works or works for which training permission has been granted via license), this would exacerbate bias in the nature of research questions able to be studied and the methodologies available to study them, and amplify the views of an unrepresentative set of creators given the limited types of materials available with which to conduct the studies.

Why access to AI training content should be preserved

For the same reasons, it is important that scholars’ ability to access the underlying content to conduct AI training be preserved. The fair use provision of the Copyright Act does not afford copyright owners a right to opt out of allowing other people to use their works for good reason: if content creators were able to opt out, the provision for fair use would be undermined, and little content would be available to build upon for the advancement of science and the useful arts. Accordingly, to the extent that the Copyright Office is considering creating a regulatory right for creators to opt out of having their works included in AI training, it is paramount that such opt-out provision not be extended to any AI training or activities that constitute fair use, particularly in the nonprofit educational and research contexts.

AI training opt-outs would be a particular threat for research and education because fair use in these contexts is already becoming an out-of-reach luxury even for the wealthiest institutions. Academic libraries are forced to pay significant sums each year to try to preserve fair use rights for campus scholars through the database and electronic content license agreements that libraries sign. In the U.S., the prospect of “contractual override” means that, although fair use is statutorily provided for, private parties (like publishers) may “contract around” fair use by requiring libraries to negotiate for otherwise lawful activities (such as conducting TDM or training AI for research), and often to pay additional fees for the right to conduct these lawful activities on top of the cost of licensing the content, itself. When such costs are beyond institutional reach, the publisher or vendor may then offer similar contractual terms directly to research teams, who may feel obliged to agree in order to get access to the content they need. Vendors may charge tens or even hundreds of thousands of dollars for this type of access.

This “pay-to-play” landscape of charging institutions for the opportunity to rely on existing statutory rights is particularly detrimental for TDM research methodologies, because TDM research often requires use of massive datasets with works from many publishers, including copyright owners that cannot be identified or who are unwilling to grant licenses. If the Copyright Office were to enable rightsholders to opt-out of having their works fairly used for training AI, then academic institutions and scholars would face even greater hurdles in licensing content for research purposes. 

First, it would be operationally difficult for academic publishers and content aggregators to amass and license the “leftover” body of copyrighted works that remain eligible for AI training. Costs associated with publishers’ efforts in compiling “AI-training-eligible” content would be passed along as additional fees charged to academic libraries. In addition, rightsholders might opt out of allowing their work to be used for AI training fair uses, and then turn around and charge AI usage fees to scholars (or libraries)—essentially licensing back fair uses for research. These scenarios would impede scholarship by or for research teams who lack grant or institutional funds to cover these additional expenses; penalize research in or about underfunded disciplines or geographical regions; and result in bias as to the topics and regions studied. 

Scholars need to be able to utilize existing knowledge resources to create new knowledge goods. Congress and the Copyright Office clearly understand the importance of facilitating access and usage rights, having implemented the statutory fair use provision without any exclusions or opt-outs. This status quo should be preserved for fair use AI training—and particularly in the nonprofit educational or research contexts. 

Our office is here to help

No matter what happens with the Copyright Office’s inquiry and any regulations that ultimately may be established, the UCB Library’s Office of Scholarly Communication Services is here to help you. We are a team of copyright law and information policy (licensing, privacy, and ethics) experts who help UC Berkeley scholars navigate legal, ethical, and policy considerations in utilizing resources in their research and teaching. And we are national and international leaders in supporting TDM research—offering online tools, trainings, and individual consultations to support your scholarship. Please feel free to reach out to us with any questions at schol-comm@berkeley.edu


Making it easier to reuse and share Thérèse Bonney photography

Woman holding a child wrapped in a blanket at Parroquia Del Dulce Nombre de Maria in Donna Carlotta, Madrid
Spain: Parroquia Del Dulce Nombre de Maria in Doña Carlota, Madrid.
BANC PIC 1982.111.03.0287–NNEG
Thérèse Bonney, © The Regents of the University of California, The Bancroft Library, University of California, Berkeley. This work is made available under a Creative Commons Attribution 4.0 license.

As part of UC Berkeley Library’s trend-setting efforts to make all our collections easier to use, reuse, and publish from, we are excited to announce that: 

We’ve just eliminated hurdles to the reuse of renowned photographer Thérèse Bonney’s photographs. Every photograph ever taken by Bonney is licensed under a Creative Commons Attribution 4.0 license (CC BY 4.0). This means anyone around the world can incorporate Bonney’s photos into papers, projects, and productions—even commercial ones—without ever getting further permission or another license from us.

Thérèse Bonney Copyright

Thérèse Bonney (1894-1978) was a documentary photographer and war correspondent. She concentrated much of her work on documenting conditions in Europe during World War II. Prior to her work as a war correspondent, Bonney extensively photographed French architecture and design, as well as writers and artists such as Joan Miró, Fernand Léger, and Gertrude Stein. Her photographs have been exhibited at the Museum of Modern Art, the Library of Congress, and Carnegie Hall. 

Bonney transferred copyright to all of her work to the UC Regents to be managed by UC Berkeley Library. This includes Bonney materials at the UC Berkeley Library and any Bonney-authored or Bonney-created materials held by other institutions. 

Although people did not previously need the UC Regents’ permission (sometimes called a “license”) to make fair uses of Bonney’s because of the progressive permissions policy we created, prior to July 2022 people did need a license to reuse Bonney’s works if their intended use exceeded fair use. As a result, hundreds of book publishers, journals, and film-makers sought licenses from the Library each year to publish Bonney’s photos. 

The UC Berkeley Library recognized this as an unnecessary barrier for research and scholarship, and has now exercised its authority on behalf of the UC Regents to freely license Bonney’s entire corpus under CC-BY. This license is designed for maximum dissemination and use of the materials. 

How to use Bonney’s works going forward

Now that all Bonney photographs have a CC-BY license applied to them, no additional permission or license from the UC Regents or anyone else is needed to use Bonney’s work, even if you are using the work for commercial purposes. No fees will be charged, and no paperwork is necessary.

The CC-BY license does require attribution to the copyright owner, which in this case is the UC Regents. The Library suggests the following attribution:

Thérèse Bonney, © The Regents of the University of California, The Bancroft Library, University of California, Berkeley. This work is made available under a Creative Commons Attribution 4.0 license.

What’s ahead for the Library

The Library now has some work to do to make our catalog and other information sources about the Bonney photos reflect this application of the CC-BY licenses. This means we have to update things like the Bonney collection finding aid and the metadata for individual photos in the digital versions of the Bonney photos that we make available online. In the meantime, you can rely on written confirmation that we’ve applied the CC-BY license by consulting the Easy to Use Collections page of our permissions guide.

In the coming year, we hope to add many more collections to that Easy to Use Collections page, too. We’ll be spending some time reviewing materials for which the UC Regents own copyright, and seeing what we can “open up” with other CC BY licenses. Stay tuned.

— — — — —

This post was written by the Library’s Office of Scholarly Communication Services