The Library of Congress recently released 25 million metadata records for free bulk download at loc.gov/cds/products/marcDist.php. These MARC records make up the foundation for library catalogs, such as OskiCat, which have enabled library users to find and access library books and other media for decades. As the LOC describes the collection:
The data covers a wide range of Library items including books, serials, computer files, manuscripts, maps, music and visual materials. The free data sets cover more than 45 years, ranging from 1968, during the early years of MARC, to 2014. Each record provides standardized information about an item, including the title, author, publication date, subject headings, genre, related names, summary and other notes.
The data is available in UTF-8, MARC8, and XML formats, and has been conveniently divided by media type including books, computer files, maps, music, and more.
Find out more:
- Computational Text Analysis and Text Mining Guide – find many other sources for large-scale text analysis projects
- LOC’s Getting Started Guide (PDF) for details on accessing the data
- Stacy Reardon, Literatures and Digital Humanities Librarian, sreardon [at] berkeley.edu
- Cody Hennesy, E-Learning and Information Studies Librarian, chennesy [at] berkeley.edu