New Data Alert: Web of Science XML Data

Exploring Research at Scale with Web of Science XML Data

Algo-r-(h)-i-(y)-thms, 2018. Installation view at ON AIR, Tomás Saraceno's solo exhibition at Palais de Tokyo, Paris, 2018.

The Web of Science XML dataset now available for research, teaching, and learning at UC Berkeley. 

This dataset is an essential tool for anyone exploring, evaluating, or visualizing global research activity. Drawing from over 12,500 journals across 254 disciplines in the sciences, social sciences, and humanities, this rich dataset includes not only journal articles, but also conference proceedings and book metadata—spanning back to 1900.

With more than 63 million article records and over 1 billion cited references, the dataset supports large-scale analysis of scholarly communication and impact. Key metadata elements include ORCID identifiers in over 6.2 million records to help disambiguate authors, detailed funding acknowledgments with grant numbers, and full author and institutional affiliations to support accurate attribution and collaboration analysis. Web of Science also standardizes institutional names to resolve naming variations, making cross-institutional analyses more reliable.

Researchers can access this data through flexible XML, allowing them to build complex citation networks, analyze research dynamics, and model trends over time. The dataset can be combined with other datasets for additional insights or used in visualization and statistical tools.

For research offices the dataset provides an opportunity to gain meaningful insights into the ever-evolving research landscape. With consistent indexing and global coverage, it’s a foundation for informed research strategy, evaluation, and discovery. 

The data can be accessed in UC Berkeley Library’s Dataverse, through the Savio computing cluster, or TDM Studio. Please visit the Web of Science XML data section of the Text Mining & Computational Text Analysis research guide. Contact the Library Data Services Program for a Dataverse API token or with questions: librarydataservices@berkeley.edu


UC Berkeley Library’s Dataverse

artistic logo and text that reads Data and Digital Scholarship Services UC Berkeley Library

Library IT and the Library Data Services Program are thrilled to announce the launch of the UC Berkeley Library Dataverse, a new platform designed to streamline access to data licensed by the UC Berkeley Library. This initiative addresses the challenges users have faced in finding and managing data for research, teaching, and learning.

With Dataverse, we have simplified the data acquisition process and created a central hub where users can easily locate datasets and understand their terms of use. Dataverse is an open-source platform managed by Harvard’s Institute for Quantitative Social Science, and was selected for its robust features that enhance the user experience.

All licensed data, whether locally stored or vendor-managed, will now be available in Dataverse. While metadata is publicly accessible, users will need to log in to download datasets. This platform is the result of a collaborative effort to support both library staff and users. Anna Sackmann, our Data Services Librarian, will continue to assist with the acquisition process, while Library IT oversees the platform’s maintenance. We are also committed to helping researchers publish their data by guiding them toward the best repository options.

Access the UC Berkeley Library Dataverse via the Library’s website under Data + Digital Scholarship Services. Please email librarydataservices@berkeley.edu with questions.


Love Data Week 2024

Blue and yellow heart made up of circuits and names of UC campuses

The UC-wide Love Data Week, brought to you by UC Libraries, will be a jam-packed week of data talks, presentations, and workshops Feb. 12-16, 2024. With over 30 presentations and workshops, there’s plenty to choose from, with topics such as:

  • Code-free data analysis
  • Open Research
  • How to deal with large datasets
  • Geospatial analysis
  • Drone data
  • Cleaning and coding data for qualitative analysis
  • 3D data
  • Tableau
  • Navigating AI

All members of the UC community are invited to attend these events to gain hands-on experience, learn about resources, and engage in discussions about data needs throughout the research process. To register for workshops during this week and see what other sessions will be offered UC-wide, visit the UC Love Data Week 2024 website.


National Science Foundation Public Access Plan 2.0

Many of you may have already seen, or even read, the NSF Public Access Plan 2.0. This document, disseminated last week, is the National Science Foundation’s response to the OSTP Public Access Memo from August 2022, which requires all federal grant funding agencies to make research publications and their supporting data freely available and accessible, without embargo, no later than December 31, 2025. The public access plan is not the agency’s new policies, but rather the framework for how they will improve public access and address the new requirements. The agency states they will accomplish this prior to the December deadline, on January 31, 2025. I have highlighted just a few points from the report below.
  • The agency will leverage the existing  NSF Public Access Repository (NSF-PAR) to make research papers, either the author’s accepted manuscript (AAM) or the publisher’s version of record (VOR), available immediately. All papers will be available in machine-readable XML, which will make additional research through text and data mining (TDM) possible.
  • The agency will continue to leverage relationships with long-standing disciplinary and generalist data repositories, like Dryad.
  • All data and publications will have permanent identifiers (PIDS). Data PIDS will be included with the article metadata.
  • The agency acknowledges the complexity in size, type, and quality of documentation with data. Publishing a dataset has far greater technical variability than publishing a manuscript. The agency will continue to explore how to best address data in the next two years.
  • The NSF has long required data management plans (DMPs). DMPs will be renamed to “data management and sharing plans,” or DMSPs, to better describe the required documentation and align with other agencies, like the NIH.

The above bullets are a mere 5 items in the lengthy report. Most importantly, over this next year, the Data Collaboration Team will develop an inreach plan to ensure all librarians and staff know how the OSTP memo and resulting policy will impact them and their researchers. Following awareness within the library, we will work on developing a coordinated outreach approach to support our researchers as they adapt to new requirements. This work will be in coordination with the Office of Scholarly Communication Services, the Research Data Management Program, and other longstanding LDSP partners.

Please let us know if you have any questions by sending them to librarydataservices@berkeley.edu.

Data Analysis Workshop Series: partnering with the CDSS Data Science Discovery Consultants

With the increase in data science across all disciplines, most undergraduates will encounter basic data science concepts and be expected to analyze data at some point during their time at UC Berkeley. To address this growing need, the Library Data Services Program began partnering with the Data Science Discovery Consultants in the Division of Computing, Data Science, and Society (CDSS) on the Introduction to Data Analysis Support workshop series in Fall 2020. The Data Science Discovery Consultants are a group of undergraduates majoring in computer science, math, data science, and related fields who are hired as student employees. They receive training to offer consultation services across a wide range of topics, including Python, R, SQL, and Tableau, and they have existing partnerships with other groups on campus to provide instruction around data as a part of their program. Through the partnership, Data Science Discovery Consultants work with librarians to develop as instructors and gain experience constructing workshops and teaching technical skills. The end result is the creation of a peer-to-peer learning environment for novice undergraduate learners who want to begin working with data. The peer-to-peer learning model lowers the barrier to learning for other undergraduates and enhances motivation and understanding.

The Data Science Discovery Consultants enthusiastically embraced the core values of the Carpentries, through which they empower each other and the audience, collaborate with their community, and create inclusive spaces that welcome and extend empathy and kindness to all learners. In Fall 2022, attendance for the workshop series was opened up to local community college students who may be interested in transferring to UC Berkeley. One of the workshops was taught in Spanish, to provide an environment in which native Spanish speakers could better connect with one another and the content. 

Diego Sotomayor, a former UC Berkeley Library student employee and current Data Science Discovery Consultant, taught the inaugural Introduction to Python in Spanish: Introducción al análisis de datos con Python. Diego comments that: 

“Languages at events are no longer just a necessity but have gone to the next level of being essential to transmit any relevant information to the interested public. There are many people who only speak Spanish or another language other than English and intend to learn new topics through various platforms including workshops. However, because they are limited by only speaking a language that is not very popular, they get stuck in this desire to progress and learn. Implementing the workshop in different languages, not only in English but in Spanish and even others, is important to give the same opportunities and equal resources to people looking for opportunities.”

The UC Berkeley Library and the Division of Computing, Data Science, and Society hope to further provide these offerings for prospective transfer students in Fall 2023. Many thanks to Elliott Smith, Lisa Ngo, Kristina Bush (now at Tufts University), and Misha Coleman in the Library. Anthony Suen is the Library’s staff partner in the Data Science Discovery Program and Kseniya Usovich assists with outreach.