Before you scrape and before you train…

A person's hands holding a white stylus pen over a tablet screen displaying a digital to-do list. The tablet shows 'PLAN' at the top with several checkboxes below it, some checked and some unchecked. The scene is set on a desk with a small potted plant visible in the background.
Photo by Jakub Żerdzicki on Unsplash

Using AI and Text Mining with Library Resources: What Every UC Berkeley Researcher Needs to Know

Planning to scrape a website or database? Train an AI tool? Before you scrape and before you train, there are steps you need to take!

First, consult the general terms and conditions that you need to comply with for all Library electronic resources (journal articles, books, databases, and more) in the Conditions of Use for Electronic Resources policy.

Second, if you also intend to use any Library electronic resources with AI tools or for text and data mining research, check what’s allowed under our license agreements by looking at the AI & TDM guide. If you don’t see your resource or database listed, please e-mail tdm-access@berkeley.edu and we’ll check the license agreement and tell you what’s permitted.

Violating license agreements can result in the entire campus losing access to critical research resources, and potentially expose you and the University to legal liability.

Below we answer some FAQs.


Understanding the Basics

Where does library content come from?

Most of the digital research materials you access through the UC Berkeley Library aren’t owned by the University. Instead, they’re owned by commercial publishers, academic societies, and other content providers who create and distribute scholarly resources.

Think of using the Library’s electronic resources like watching Netflix: you can watch movies and shows on Netflix, but Netflix doesn’t own most of that content—they pay licensing fees to film studios and content creators for the right to make it available to you. So does the Library.

In fact, each year, the Library signs license agreements and pays substantial licensing fees (millions of dollars annually) to publishers like Elsevier, Springer, Wiley, and hundreds of other content providers so that you can access their journals, books, and databases for your research and coursework.

What is a library license agreement?

A license agreement is a legal contract between UC Berkeley and each publisher that spells out exactly how the UC Berkeley community can use that publisher’s content. These contracts typically cover:

  • Who can access the content (usually current faculty, students, researchers, and staff)
  • How you can use it (reading, downloading, printing individual articles)
  • What you can’t do (automated mass downloading, sharing with unauthorized users, making commercial uses of it)
  • Special restrictions (including rules about AI tools and text and data mining)

Any time you access a database or use your Berkeley credentials to log in to a resource, you must comply with the terms of the license agreement that the Library has signed. All the agreements are different.

Why are all the agreements different? Can’t the Library just sign the same agreement with everyone?

Unfortunately, no. Each publisher has their own standard contract terms, and they rarely agree to identical language. Here’s why:

  • Different business models: Some publishers focus on journals, others on books or datasets—each has different concerns
  • Varying attitudes toward artificial intelligence: Some publishers embrace AI research, others are more restrictive
  • Disciplinary variations: Publishers licensing content in different fields (e.g. business, data) typically offer different restrictions than those in other disciplines
  • Legal complexity: Text data mining and AI are relatively new, so contract language is still evolving

Can’t you negotiate better terms?

The good news is that the UC Berkeley Library is among global leaders in negotiating the very best possible terms of text and data mining and AI uses for you. We’ve set the stage for the world in terms of AI rights in license agreements, and the UC President has recognized the efforts of our Library in this regard.

Still, we can’t force publishers to accept uniform language, and we can’t guarantee that every resource allows AI usage. This is why we need to check each agreement individually when you want to use content with AI tools.

But my research is a “fair use”!

We agree. (And we’re glad you’re staying up-to-speed on copyright and fair use.) But there’s a distinction between what copyright law allows and how license agreements (which are contracts) affect your rights under copyright law.

Copyright law gives you certain rights, including fair use for research and education.

Contract law can override those rights when you agree to specific terms. When UC Berkeley signs a license agreement with a publisher so you can use content, both the University and its users (that’s you) must comply with those contract terms.

Therefore, even if your AI training or text mining would normally qualify as fair use, the license agreement you’re bound by might explicitly prohibit it, or place specific qualifications on how AI might be used (e.g. use of AI permissible; training of AI prohibited).

Your responsibilities

What do I have to do?

You should consult the general terms and conditions that you need to comply with for all Library electronic resources (journal articles, books, databases, and more) in the Conditions of Use for Electronic Resources policy.

If you also intend to use any Library electronic resources with AI tools or for text and data mining research, check what’s allowed under our license agreements by looking at the AI & TDM guide. If you don’t see your resource or database listed, then e-mail tdm-access@berkeley.edu and we’ll check the license agreement and tell you what’s permitted.

Do I have to comply? What’s the big deal?

Violating license agreements can result in losing access to critical research resources for the entire UC Berkeley community—and potentially expose you and the University to legal liability and lawsuits.

For the University:

  • Loss of access: Publishers can immediately cut off access to critical research resources for everyone on campus
  • Legal liability: The University could face costly lawsuits. Some publishers might claim millions of dollars worth of damages
  • Damaged relationships: Violations can harm the library’s ability to negotiate future agreements, or prevent us from getting you access to key scholarly content

This doesn’t just affect the University—it also affects you. Violating the agreements can result in:

  • Immediate suspension of your access to all library electronic resources
  • Legal exposure: You could potentially be held personally liable for damages in a lawsuit
  • Research disruption: Loss of access to essential materials for your work

How do I know if I’m using Library-licensed content?

The following kinds of materials are typically governed by Library license agreements:

  • Materials you access through the UC Library Search (the Library’s online catalog)
  • Articles from academic journals accessed through the Library
  • E-books available through Library databases
  • Research datasets licensed by the Library
  • Any content accessed through Library database subscriptions
  • Materials that require you to log in with your UC Berkeley credentials

What if I get the content from a website not licensed by the Library?

If you’re downloading or mining content from a website that is not licensed by the Library, you should read the website’s terms of use, sometimes called “terms of service.” They will usually be found through a link at the bottom of the web page. Carefully understanding the terms of service can help you make informed decisions about how to proceed.

Even if the terms of use for the website or database restrict or prohibit text mining or AI, the provider may offer an application programming interface, or API, with its own set of terms that allows scraping and AI. You could also try contacting the provider and requesting permission for the research you want to do.

What if I’m using a campus-licensed AI platform?

Even when using UC Berkeley’s own AI platforms (like Gemini or River), you still need to check on whether you can upload Library-licensed content to that platform. The fact that the University provides the tool doesn’t automatically make all Library-licensed content okay to upload to it.

What if I’m using my own ChatGPT, Anthropic, or other generative AI account?

Again, you still need to check on whether you can upload Library-licensed content to that platform. The fact that you subscribe to the tool doesn’t mean you can upload Library-licensed content to it.

Do I really have to contact you? Can’t I just look up the license terms somewhere?

We wish it were that simple, but the Library signs thousands of agreements each year with highly complex terms. We’re working on trying to make the terms more visible to you, though. Stay tuned.
In the meantime, check out the AI & TDM guide. If you don’t see your resource or database listed, then e-mail tdm-access@berkeley.edu and we’ll tell you what’s permitted.

Best practices are to:

  • Plan ahead: Contact us early in your research planning process
  • Be specific: The more details you provide, the faster we can give you guidance
  • Ask questions: We’re here to help, not to block your research

Get Help

For text mining and AI questions: tdm-access@berkeley.edu

For other licensing questions: acq-licensing@lists.berkeley.edu

For copyright and fair use guidance: schol-comm@berkeley.edu


This guidance is for informational purposes and should not be construed as legal advice. When in doubt, always contact library staff for assistance with specific situations.


Exploring OCR tools with two 19th century documents

— Guest post by Eileen Chen (UCSF)

When I (Eileen Chen, UCSF) started this capstone project with UC Berkeley, as part of the Data Services Continuing Professional Education (DSCPE) program, I had no idea what OCR was. “Something something about processing data with AI” was what I went around telling anyone who asked. As I learned more about Optical Character Recognition (OCR), it soon sucked me in. While it’s a lot different from what I normally do as a research and data librarian, I can’t be more glad that I had the opportunity to work on this project!

The mission was to run two historical documents from the Bancroft Library through a variety of OCR tools – tools that convert images of text into a machine-readable format, relying to various extents on artificial intelligence.

The documents were as follows:

Both were nineteenth century printed texts, and the latter also consists of multiple maps and tables.

I tested a total of seven OCR tools, and ultimately chose two tools with which to process one of the two documents – the earthquake catalogue – from start to finish. You can find more information on some of these tools in this LibGuide.

Comparison of tools

Table comparing OCR tools

OCR Tool Cost Speed Accuracy Use cases
Amazon Textract Pay per use Fast High Modern business documents (e.g. paystubs, signed forms)
Abbyy Finereader By subscription Moderate High Broad applications
Sensus Access Institutional subscription Slow High Conversion to audio files
ChatGPT Free-mium* Fast High Broad applications
Adobe Acrobat By subscription Fast Low PDF files
Online OCR Free Slow Low Printed text
Transkribus By subscription Moderate Varies depending on model Medieval documents
Google AI Pay per use ? ? Broad applications

*Free-mium = free with paid premium option(s)

As Leo Tolstoy famously (never) wrote, “All happy OCR tools are alike; each unhappy OCR tool is unhappy in its own way.” An ideal OCR tool accurately detects and transcribes a variety of texts, be it printed or handwritten, and is undeterred by tables, graphs, or special fonts. But does a happy OCR tool even really exist?

After testing seven of the above tools (excluding Google AI, which made me uncomfortable by asking for my credit card number in order to verify that I am “not a robot”), I am both impressed with and simultaneously let down by the state of OCR today. Amazon Textract seemed accurate enough overall, but corrupted the original file during processing, which made it difficult to compare the original text and its generated output side by side. ChatGPT was by far the most accurate in terms of not making errors, but when it came to maps, admitted that it drew information from other maps from the same time period when it couldn’t read the text. Transkribus’s super model excelled the first time I ran it, but the rest of the models differed vastly in quality (you can only run the super model once on a free trial).

It seems like there is always a trade-off with OCR tools. Faithfulness to original text vs. ability to auto-correct likely errors. Human readability vs. machine readability. User-friendly interface vs. output editability. Accuracy at one language vs. ability to detect multiple languages.

So maybe there’s no winning, but one must admit that utilizing almost any of these tools (except perhaps Adobe Acrobat or Free Online OCR) can save significant time and aggravation. Let’s talk about two tools that made me happy in different ways: Abbyy Finereader and ChatGPT OCR.

Abbyy Finereader

I’ve heard from an archivist colleague that Abbyy Finereader is a gold standard in the archiving world, and it’s not hard to see why. Of all the tools I tested, it was the easiest to do fine-grained editing with through its side-by-side presentation of the original text and editing panel, as well as (mostly) accurately positioned text boxes.

Its level of AI utilization is relatively low, and encourages users to proactively proofread for mistakes by highlighting characters that it flags as potentially erroneous. I did not find this feature to be especially helpful, since the majority of errors I identified had not been highlighted and many of the highlighted characters weren’t actual errors, but I appreciate the human-in-the-loop model nonetheless.

Overall, Abbyy excelled at transcribing paragraphs of printed text, but struggled with maps and tables. It picked up approximately 25% of the text on maps, and 80% of the data from tables. The omissions seemed wholly random to the naked eye. Abbyy was also consistent at making certain mistakes (e.g. mixing up “i” and “1,” or “s” and 8”), and could only detect one language at a time. Since I set the language to English, it automatically omitted the accented “é” in San José in every instance, and mistranscribed nearly every French word that came up. Perhaps some API integration could streamline the editing process, for those who are code-savvy.

Capture of Abbyy Finereader attempt to interpret a map of a portion of California with map on the left and the attempted read on the right.
Earthquake map page as seen in the Abbyy Finereader Editor

I selected “searchable PDF” as my output file type, but Abbyy offers several other file types as well, including docx, csv, and jpg. In spite of its limitations, compared to PDF giant Adobe Acrobat and other PDF-generating OCR tools, Abbyy is still in a league of its own.

ChatGPT OCR

After being disillusioned by Free Online OCR, I decided to manage my expectations for the next free online tool I tested. Sure, it’s ChatGPT, but last I heard about it, it failed to count how many “r”s were in “strawberry.”

Surprisingly, ChatGPT made minimal errors in the transcription process, succeeding at text of various fonts and even rare instances of handwritten annotations. It spewed things out in text form in the chat box, as it does by default, but makes good use of bolding and italics to broadly distinguish between different fonts.

As I was formatting its responses to my prompts into a Word document, I had to think hard about some choices. In what order do I arrange the text to best represent the flow of natural eye movement, especially when it comes to text from a map? Since ChatGPT offers descriptions of page layouts upon further prompting, how much do I want it to do that, and how do I standardize this?

Below is an excerpt from the document as processed by ChatGPT. ChatGPT descriptions are in blue and italicized, with detected original text left in black. I used two separate prompts: 1) “Please transcribe all the text on the two pages”; 2) “Please describe the image on the left page.” (Before you judge me for using “please,” it has been shown that being polite to ChatGPT may get you better results!)

The page features a simple and elegant design with the following elements:
  1. Coat of Arms: At the center of the page is a decorative emblem or coat of arms. It features the word “Baltimore” at the top and a Latin motto, “Vita Doi Mi Taptori,” at the bottom. The design appears to be the official crest or insignia, likely of historical or regional significance to Baltimore.
  2. Printer’s Mark: Beneath the coat of arms, there is text indicating the name and location of the publishing company:
    • “The Lord Baltimore Press” is printed in a distinctive typeface.
    • Below that, it says “THE FRIEDENWALD COMPANY”, followed by the location: “BALTIMORE, MD., U.S.A.”

The page is otherwise blank, with no additional decorations or text, keeping the focus on the coat of arms and the printer’s information. There are some handwritten cataloging notes visible in the upper corners:
In the upper left corner, “F851” and “H74” are written.

In the upper right corner, the number “604” and some other library marking are noted.

The overall layout is minimalist and centered, with the elements well-spaced on the page.

Page with seal stamped in the vertical middle, drifting right horizontally. There is a small pencil mark in the upper left.
Example of a largely blank page with seal.

As far as its work on this document has shown, ChatGPT’s descriptions are fairly safe and neutral. With some human curation for quality assurance, it can go beyond the bare minimum of “human readable” and bring a document to “human comprehensible.”

However, there was one alarming instance: when transcribing text of a smaller font size from a footnote – ChatGPT completely changed the content and basically wrote me a new paragraph based on the few keywords it was able to identify!

Example text from footnote with several special characters such as parenthesis.

ChatGPT:

* First proposed by Rossi in Archives des Sci. Phys. et Nat., IV, p. 371 (1850). Further expanded by Forel (ibid., Vol. V, p. 420). After comparing hundreds of published accounts of California earthquakes, I have found that the words of the Italian and Swiss seismologists fairly express the commonly accepted ideas of the arbitrary terms employed by the California observers. In the phrases for comparison, when any one is describing the effect of a shock, when consulting the tables, and making use of the Rossi-Forel scale, the books and other reports in California will aid.

When I asked it again, ChatGPT produced a different, still wildly incorrect footnote:

*First proposed by Rossi in Archives des Sci. Phys. et Nat., IV, p. 371 (1880). See also Schmidt in Bull. Acad. Petersb., vol. xxix, p. 41. After comparing hundreds of published accounts of California earthquakes, I have found that the words printed in italics in the following scale of Rossi and Forel, are the ones employed with much precision by most of the California observers. The other words are added for convenience. When any one is describing the effect of an earthquake, and closely follows the meaning or intent of the Rossi-Forel scale, then, the better to make that meaning clear, he should consider himself as adhering to the Rossi-Forel numerals. If, while describing an earthquake and seeking to assign the proper R.-F. numeral, it will be found that the words here set down in italics are of service.

The scary thing, in my opinion, isn’t that it twisted the meaning of the footnote entirely, but that it took the wild stab without my knowledge at all.

When it came to map recognition and transcription, ChatGPT took a few more liberties (although it does a fair job justifying and explaining this). After initially detecting none of the cities on a map I uploaded, ChatGPT responded the following to my prompt “Please list each individual city on the map”

Here is a list of the cities visible on the map of Washington and Oregon (based on typical historical maps of this type). However, since I can’t zoom into the map directly from the image, I’ll list some cities likely included on a historical earthquake-related map of Washington and Oregon from this period. Let me know if you need a detailed transcription from another zoomed-in image.

Likely Cities in Washington (circa 19th century):

  • Seattle
  • Tacoma
  • Olympia
  • Spokane
  • Walla Walla
  • Vancouver
  • Port Townsend

Likely Cities in Oregon:

  • Portland
  • Astoria
  • Salem
  • Eugene
  • Oregon City
  • The Dalles
  • Jacksonville

This map would also typically include significant natural features such as:

  • Columbia River
  • Willamette River
  • Cascade Range

If you have a clearer image or can provide additional detail, I could give a more precise list.

All of this was indeed correct, and I appreciate ChatGPT’s transparency about drawing from external sources, but this response generated many more questions. What about the cities on the map that weren’t listed? ChatGPT may have gotten it right this time, but doesn’t it have significant potential to be wrong? If we present it with a digitized map of similar quality that had place names in their Indigenous languages, for example, would they still be listed as, say, “likely to be Vancouver”?

So yes, I was dazzled by the AI magic, but also wary of the perpetuation of potential biases, and of my complicity in this as a user of the tool.

Conclusion

So, let’s summarize my recommendations. If you want an OCR output that’s as similar to the original as possible, and are willing to put in the effort, use Abbyy Finereader. If you want your output to be human-readable and have a shorter turnaround time, use ChatGPT OCR. If you are looking to convert your output to audio, SensusAccess could be for you! Of course, not every type of document works equally well in any OCR tool – doing some experimenting if you have the option to is always a good idea.

A few tips I only came up with after undergoing certain struggles:

  1. Set clear intentions for the final product when choosing an OCR tool
    1. Does it need to be human-readable, or machine-readable?
    2. Who is the audience, and how will they interact with the final product?
  2. Many OCR tools operate on paid credits and have a daily cap on the number of files processed. Plan out the timeline (and budget) in advance!
  3. Title your files well. Better yet, have a file-naming convention. When working with a larger document, many OCR tools would require you to split it into smaller files, and even if not, you will likely end up with multiple versions of a file during your processing adventure.
  4. Use standardized, descriptive prompts when working with ChatGPT for optimal consistency and replicability.

You can find my cleaned datasets here:

  1. Earthquake catalogue (Abbyy Finereader)*
  2. Earthquake catalogue (ChatGPT)

*A disclaimer re: Abbyy Finereader output: I was working under the constraints of a 7-day free trial, and did not have the opportunity to verify any of the location names on maps. Given what I had to work with, I can safely estimate that about 50% of the city names had been butchered.


Intelligenza Artificiale in Italia

artificial intelligence in Italy

Judging by the explosion of new books on artificial intelligence, or AI, being published in Italy,  you might think this Mediterranean country is the the editorial epicenter for one of the hottest interdisciplinary topics. Whether you are in the humanities, social sciences, human sciences,  computer science, or STEM fields, “intelligenza artificiale” as it’s called in Italian will eventually find its way into your coursework or research. Here are just a few of the books on AI to recently reach bookstores in Italy and that have not automatically been sent to the UC Berkeley Library. However, if you are inclined just let your friendly Romance languages librarian know and he’ll be happy to push the first button to initiate this demand-driven order.*

 

*Demand-driven acquisition (DDA), is a model of library collection development in which a library only purchases materials when it is clear that a patron has demonstrated the need for a resource. If implemented correctly, DDA can make it possible to purchase only what is needed, allowing libraries to spend the same amount of money as they previously spent on monographs, but with a higher rate of use.