Bancroft to Explore Text Analysis as Aid in Analyzing, Processing, and Providing Access to Text-based Archival Collections

Mary W. Elings, Head of Digital Collections, The Bancroft Library

The Bancroft Library recently began testing a theory discussed at the Radcliffe Workshop on Technology & Archival Processing held at Harvard’s Radcliffe College in early April 2014. The theory suggested that archives can use text analysis tools and topic modelling — a type of statistical model for discovering the abstract “topics” that occur in a collection of documents — to analyze text-based archival collections in order to aid in analyzing, processing and describing collections, as well as improving access.

Helping us to test this theory, the Bancroft welcomed summer intern Janine Heiser from the UC Berkeley School of Information. Over the summer, supported by an ISchool Summer Non-profit Internship Grant, Ms. Heiser worked with digitized analog archival materials to test this theory, answer specific research questions, and define use cases that will help us determine if text analysis and topic modelling are viable technologies to aid us in our archival work. Based on her work over the summer, the Bancroft has recently awarded Ms. Heiser an Archival Technologies Fellowship for 2015 so that she can continue the work she began in the summer and further develop and test her work.

                During her summer internship, Ms. Heiser created a web-based application, called “ArchExtract” that extracts topics and named entities (people, places, subjects, dates, etc.) from a given collection. This application implements and extends various natural language processing software tools such as MALLET and the Stanford Core NLP toolkit. To test and refine this web application, Ms. Heiser used collections with an existing catalog record and/or finding aid, namely the John Muir Correspondence collection, which was digitized in 2009.

                For a given collection, an archivist can compare the topics and named entities that ArchExtract outputs to the topics found in the extant descriptive information, looking at the similarities and differences between the two in order to verify ArchExtract’s accuracy. After evaluating the accuracy, the ArchExtract application can be improved and/or refined.

                Ms. Heiser also worked with collections that either have minimal description or no extant description in order to further explore this theory as we test the tool further. Working with Bancroft archivists, Ms. Heiser will determine if the web application is successful, where it falls short, and what the next steps might be in exploring this and other text analysis tools to aid in processing collections.

                The hope is that automated text analysis will be a way for libraries and archives to use this technology to readily identify the major topics found in a collection, and potentially identify named entities found in the text, and their frequency, thus giving archivists a good understanding of the scope and content of a collection before it is processed. This could help in identifying processing priorities, funding opportunities, and ultimately helping users identify what is found in the collection.

               Ms. Heiser is a second year masters’ student at the UC Berkeley School of Information where she is learning the theory and practice of storing, retrieving and analyzing digital information in a variety of contexts and is currently taking coursework in natural language processing with Marti Hearst. Prior to the ISchool, Ms. Heiser worked at several companies where she helped develop database systems and software for political parties, non-profits organizations, and an online music distributor. In her free time, she likes to go running and hiking around the bay area. Ms. Heiser was also one of our participants in the #HackFSM hackathon! She was awarded an ISchool Summer Non-profit Internship Grant to support her work at Bancroft this summer and has been awarded an Archival Technologies Fellowship at Bancroft for 2015.


Vendor visit: Project MUSE

A representative from Project MUSE will be visiting on Thursday, October 24 from 10:00am to 11:00am. He is interested in  “discussing [our] eBooks policies and finding out ways to make the Project MUSE eBooks available to [our] community.”

More information on Project MUSE ebook content is available at: http://muse.jhu.edu/about/order/book_title_lists.html

We will meet in the 212/218 Doe conference area. This room is keycode access so just knock to have someone let you in.


Vendor visit: Alexander Street Press

A representative from Alexander Street Press will be here on Tuesday, August 27 from 10-11. We will meet in 212/218 Doe Library conference area. She’s hoping to get us interested some of ASP’s new offerings including: VAST: Academic Video Online, LGBT Studies Video and LGBT Thought and Culture, Twentieth Century Religious Thought, Asian Studies in Video, Silent Film Online, New World Cinema, Filmakers Library Online, Video Journal of Counseling and Therapy, Ethnographic Video Online (2nd edition), World Newsreels Online, 1929-1966, and their All Music Package among other resources.

(212/218 Doe is keycode access so just knock to have someone let you in.)


Springer launches full book download feature

As announced on the LIBLICENSE-L listserv:

“Springer is pleased to introduce the ‘full book download functionality’ on SpringerLink and Springer for R&D in response to the high demand for this feature by our authors, researchers and library customers. This new functionality allows users to download all chapters of a book in one go. In addition to this feature, users can view an eBook on SpringerLink and Springer for R&D through the LookInside, and download the individual chapters as PDF and/ or HTML format. The chapter level and full book PDF is available to subscribed users without restrictions. The LookInside always shows some sample pages to unsubscribed users, and the full chapter to subscribed users.”


Wiley ebooks

Wiley is now posting newly published ebooks at the Wiley digital books resource site. Specifically, you can go to that page and download the “Just Published Titles” list.

Says a message from the Wiley sales director: “[We] will have access to all new Wiley Online Books with a 2013 print publication date that go online in 2013 as they are published rather than having to wait each month for activation. A bi-weekly list of the books that are newly published is available for you to download.”

Please note that CDL will no longer be distributing the Wiley ebook title lists or posting the Wiley ebook lists at the CDL website.


ERF Update – April, May, June 2013

Current number of records in the ERF: 1105

ADDED since last update

DELETED since last update

  • Global Road Warrior (cancelled)

CHANGES since last update


Licensing digital resources: Tiers 1-2-3.

In response to a request by the Collection Services Council, Margaret Phillips has created an overview of UC’s tiered approach to licensing digital resources including some useful links.

This great information is available directly at http://www.lib.berkeley.edu/Staff/CS/tiers%20123%20splash.html. Or if you want to follow the breadcrumbs

  • start at the /CS home page
  • select “UC Tiers 1-2-3 and Shared Print”
  • select “Licensing digital resources : Tiers 1-2-3”

–gail


Reminder about vendor visits in June

Please feel free to join any of the upcoming meetings with vendors.

  • Tuesday, June 4: Gale reps Rob Hoyer and Vince Vessalo. 9am-10am in 251 Doe (original post)
  • Wednesday, June 5: Mike Diaz from the ProQuest DAAP (Digital Arching and Access Program) and Michelle Valani, our ProQuest account manager. 2pm-3pm in 251 Doe (original post)
  • Wednesday, June 12: ProQuest CFO Jonathan Collins along with Michelle Valani. 9am-10am in 212/218 Doe Library.  (original post)

Vendor visit: ProQuest CFO – June 12

Jonathan Collins, a newly appointed CFO at ProQuest will be on campus on Wednesday, June 12 from 9-10. He is on a fact-finding mission to hear from librarians about what is important to us, challenges we face, and, most importantly, to get honest feedback about ProQuest.

Please join us in 218/218 Doe Library . (The room is keycode access so just knock to have someone let you in.)

No need to RSVP. Just show up.