UC Berkeley Library and Internet Archive co-directing project to help text data mining researchers navigate cross-border legal and ethical issues

We are excited to announce that the National Endowment for the Humanities (NEH) has awarded nearly $50,000 to UC Berkeley Library and Internet Archive to study legal and ethical issues in cross-border text data mining. The funding was made possible through NEH’s Digital Humanities Advancement Grant program

NEH funding for the project, entitled Legal Literacies for Text Data Mining – Cross Border (“LLTDM-X”), will support research and analysis to address law and policy issues faced by U.S. digital humanities practitioners whose text data mining research and practice intersects with foreign-held or -licensed content, or involves international research collaborations. 

LLTDM-X builds upon the highly successful Building Legal Literacies for Text Data Mining Institute (Building LLTDM), previously funded by the NEH in 2019. UC Berkeley Library directed Building LLTDM in June 2020, bringing together expert faculty from across the country to train 32 digital humanities researchers on how to navigate law, policy, ethics, and risk within text data mining projects. (All of the results and impacts are summarized in the white paper here.) 

In Building LLTDM’s instructional sessions and post-workshop evaluations, participants identified cross-border research collaborations as an ongoing and critical legal and policy problem, and they also noted that foreign law and ethics issues pervaded their research. UC Berkeley Library’s Office of Scholarly Communication Services partnered with Internet Archive to begin to address these essential needs, and LLTDM-X sprung to life.

Why is LLTDM-X needed?

Text data mining, or TDM, is an increasingly essential and widespread research approach. TDM relies on automated techniques and algorithms to extract revelatory information from large sets of unstructured or thinly-structured digital content. These methodologies allow scholars to identify and analyze critical social, scientific, and literary patterns, trends, and relationships across volumes of data that would otherwise be impossible to sift through.

While TDM methodologies offer great potential, they also present scholars with nettlesome law and policy challenges that can prevent them from understanding how to move forward with their research. Building LLTDM trained TDM researchers and professionals on essential principles of copyright, licensing, and privacy law, as well as ethics—thereby helping them move forward with impactful digital humanities research.

As Building LLTDM revealed, United States digital humanities scholars do not conduct text data mining research only in or about the U.S. Further, digital humanities research in particular is marked by collaboration across institutions and geographical boundaries. Yet, U.S. practitioners encounter expanding and increasingly complex cross-border problems. 

For example, U.S. contract law may supersede rights under copyright, such that a U.S. database license agreement may prohibit text data mining and other fair uses, whereas UK licenses cannot. Therefore U.S. TDM practitioners collaborating with UK-based colleagues face impactful choices about which agreements to apply, as this may determine whether text data mining is permitted. In the U.S., “breaking” technological protection measures to conduct text data mining is now authorized within certain parameters, yet other jurisdictions prohibit such work or apply different conditions. U.S. text data mining researchers must accordingly consider how they work with internationally-held or -licensed materials or collaborators. 

There are at least three such “cross-border” TDM scenarios that scholars must parse, including: (i) if the materials they want to mine are housed in a foreign jurisdiction, or are otherwise subject to foreign database licensing or laws; (ii) if the human subjects they are studying or who created the underlying content reside in another country; or, (iii) if the colleagues with whom they are collaborating reside abroad, yielding uncertainty about which country’s laws, agreements, and policies apply. These may collectively be considered the “cross-border” TDM scenarios.

U.S. researchers are uncertain about how to navigate each of these scenarios. As evidenced in an informal survey that we conducted with digital humanities scholars, 70% of respondents reported cross-border copyright questions, 72% reported uncertainty about cross-border licensing terms, 52% noted privacy issues, and 48% identified ethical concerns. This confusion greatly impacted their TDM research. Twenty-eight percent (28%) of respondents confirmed that these cross-border copyright, licensing, privacy, or ethical issues impeded or prevented their project entirely. Of equal concern is that 40% of responding practitioners reported hesitation to share their workflows, methodology, or sources because of possible cross-border LLTDM issues. Without transparency, findings are deemed unreliable and scholarship may be rejected for publication. These problems will only mount given the increasing collaborativeness of research and the substantial amount of cross-border research occurring.

How will LLTDM-X help the world? 

Our long-term goal is to design instructional materials and institutes to support digital humanities TDM scholars facing cross-border issues, but our first step with LLTDM-X is getting a better handle on the specific law and policy challenges they face.

Through a series of virtual roundtable discussions, and accompanying legal research and analysis, LLTDM-X will surface these cross-border issues and begin to distill preliminary guidance to help scholars in navigating them. 

The first roundtable will engage U.S. digital humanities text data mining practitioners in sharing their cross-border TDM experiences. U.S. and global law and ethics experts will help guide the roundtable discussion to elicit the contours of practitioner experiences. During two subsequent roundtables—one focusing on cross-border copyright and licensing, and another on cross-border privacy and ethics—the experts will discuss practitioners’ hurdles in depth, and begin to develop customized guidance. 

After the roundtables, we will work with the law and ethics experts to create instructive case studies that reflect the types of cross-border TDM issues practitioners encountered. These case studies will incorporate recommendations to help a broad audience of U.S. digital humanities text data mining practitioners navigate LLTDM-X concerns. Case studies, guidance, and recommendations will be widely-disseminated via an open access report to be published at the completion of the project. And most importantly, they will be used to inform our future educational offerings.

An experienced team

The team for LLTDM-X (introduced below) is eager to get started. The project is co-directed by Thomas Padilla, Deputy Director, Archiving and Data Services at Internet Archive. 

LLTDM-X responds strategically to a pervasive challenge that needlessly complicates, inhibits, and weakens the fullest potential of research. This work paves a critical path toward building future training institutes that address cross-border legal issues in TDM. At Internet Archive we’re committed to supporting universal access to all knowledge—LLTDM-X couldn’t be more clearly aligned with what we hope to achieve. We look forward to working with our partners at UC Berkeley Library and the wider community to advance this work.”

Rachael Samberg, who leads UC Berkeley Library’s Office of Scholarly Communication Services and oversaw Building LLTDM, joins Thomas as co-director and explains that: 

“We are ready to begin analyzing and sorting out the complex legal challenges for digital humanities TDM researchers. We’ve already secured an incredible group of international legal and ethics experts to conduct the analyses, and will share more on that soon. In the meantime, we are gearing up to build out an even larger group of participating scholars whose experiences will help us create case studies.”

On behalf of the entire project team, we would like to thank NEH’s Office of Digital Humanities again for funding this important work. We invite you to contact us with any questions you may have. 

Thomas Padilla (Project Director): Thomas is Deputy Director, Archiving and Data Services at Internet Archive, and has deep experience cultivating library, archive, and museum ability to support TDM research. He has previously served as Principal Investigator of the Andrew W. Mellon supported Collections as Data: Part to Whole, the Institute of Museum and Library Services supported, Always Already Computational: Collections as Data, and as author of the library community research agenda, Responsible Operations: Data Science, Machine Learning, and AI in Libraries. In addition, Padilla was an expert faculty for Building LLTDM, the precursor to LLTDM-X.

Rachael Samberg (Project Co-Director): Rachael is Scholarly Communication Officer & Program Director of the University of California, Berkeley Library’s Office of Scholarly Communication Services. She served as Project Director and legal expert for Building LLTDM. A Duke Law graduate, Rachael practiced intellectual property litigation at Fenwick & West LLP for seven years before spending six years at Stanford Law School’s library, where she was Head of Reference & Instructional Services and a Lecturer in Law. Rachael speaks throughout the country about copyright and TDM issues, about which she is widely published. Her chapter, Law & Literacy in Non-Consumptive Text Mining, was published in Copyright Conversations (ALA, 2019).

Stacy Reardon (Project Team Member): Stacy Reardon is Literatures and Digital Humanities Librarian at the University of California, Berkeley Library, where she provides guidance and instruction on digital humanities projects and methods. Stacy served as a library expert on the Project Team for the NEH-funded Building Legal Literacies for Text Data Mining. She is co-chair of the UC Berkeley’s Digital Humanities Working Group, and received her Ph.D. in literature from the University of Massachusetts, Amherst.

Timothy Vollmer (Project Manager): Timothy Vollmer is Scholarly Communication and Copyright Librarian at UC Berkeley Library. He served as Project Manager for the NEH-funded Building Legal Literacies for Text Data Mining. Tim worked as a senior public policy manager for Creative Commons, and contributed to writing and advocacy on the text data mining exceptions in the EU’s Directive on Copyright in the Digital Single Market. He formerly was the Assistant Director to the Program on Public Access to Information at the American Library Association.


A Library Research Journey (Pandemic Edition)

Screenshot of team members
Association of College and Research Libraries conference poster–screenshot of recorded talk

Even beyond those who believe that librarians sit around and read books all day (which would be delightful but is most definitely not our reality), many are surprised to learn that librarians double as active researchers. This is especially true in settings where librarians are members of the faculty, but even where that isn’t the case, such as at Berkeley, librarians are born investigators and it carries over into wanting to find out about and add to knowledge of our settings.

What does it look like to conduct library research?  Glad you asked! In our case, it started with a conversation and an idea.  Natalia Estrada (now Berkeley’s Political Science and Public Policy Librarian, then the Social Sciences Collection and Reference Assistant and in library school) and I were talking about how much we admired the work of Kaetrena Davis Kendrick.  Kendrick wrote a foundational work in the study of librarian workplace morale, The Low Morale Experience of Academic Librarians: A Phenomenological Study, and it sparked many more studies on this topic.  But, where were the studies of library staff experiences?  We wanted to find out!

We were lucky to recruit two colleagues who added so much to the team: Bonita Dyess, Circulation/Reserves Supervisor at the Earth Sciences/Map Library, and Celia Emmelhainz, Berkeley’s Anthropology & Qualitative Research Librarian.  First we applied for (and eventually got) funding for the research from LAUC (the Librarians Association of the University of California).  This meant we could pay for transcribing our interviews, give the participants gift cards, and buy qualitative data analysis software.  Then we applied for (and got) approval from the IRB (Institutional Review Board), making sure we were complying with processes for research with human subjects.

Here’s where the “pandemic edition” part comes in. All this planning and applying, starting in November 2019, took time; so, at the point we were actually ready to recruit participants, it was April 2020. We were sheltering in place, and not sure how this all would work (although it was probably better than having to go virtual in mid-stream)! Nevertheless, we hurled out information about and invitations to be part of the study to every list-serv, association, and friendly librarian we could think of, nationwide.  We ended up doing 34 interviews with academic library staff from a range of locations and institution types (purposefully excluding the UC system), during a three-week period in May-June 2020.   Due to COVID these were all online, either by phone or Google Meet (sort of like Zoom), and we asked a structured list of questions, with room for branching into other topics, or diving deeply.  Celia trained a wonderful student to transcribe the interviews, and once we had those transcripts and stripped identifying information from them, we were off– coding away (using MAXQDA software), and drawing themes, quotes, recommendations, and other findings from the surprisingly rich information we’d collected.

Next—we had to start getting the information out into the world!  Our eventual goal is to write a paper, or several, for publication.  There are a number of library and information science journals out there that we are considering… but that takes time as well, and we wanted to start presenting our findings sooner.  So, we did an “initial findings” presentation to the UC Berkeley Library Research Working Group, and then stepped into the big time with acceptance to present a poster at the 2021 Association of College and Research Libraries online conference (our poster got almost 600 views), and with a webinar we did for the Pennsylvania Library Association (both the poster and the webinar slides are available through the UC’s eScholarship portal).  All our work to get to this point is hopefully now helping others.

Screenshot of title slide of PA Library Association webinar

And, a word about connecting with our participants.  We were bowled over by their generosity with us and by all they had to say: much that we didn’t expect, and much that they were grateful someone was even asking about.  It ended up that we had captured one of the last opportunities to get a snapshot of pre-COVID library staff life; people were still in limbo, and talked about their regular jobs before any lockdowns, for the most part. At that point most expected to be back in their libraries and all to be normal by the end of the summer 2020.  We know now that that didn’t happen, and we know that library re-openings and staff roles in them have been challenging and sometimes contentious; we wish we’d known to ask for permission to re-interview our participants—even if only to check in with them.  But how could we have known?  We wonder how they are.

So now, we have papers to write, and thinking to do about how to take our questions into new avenues of research—because it’s a never-ending, and completely exciting process, and, we suspect, will be very different (easier? or not?) in the post-COVID landscape.  Do you have ideas for us?  We’d love to hear them!  Or want to hear more about our morale study? Please get in touch with us at librarystaffmorale@berkeley.edu!