Exploring Research at Scale with Web of Science XML Data
The Web of Science XML dataset now available for research, teaching, and learning at UC Berkeley.
This dataset is an essential tool for anyone exploring, evaluating, or visualizing global research activity. Drawing from over 12,500 journals across 254 disciplines in the sciences, social sciences, and humanities, this rich dataset includes not only journal articles, but also conference proceedings and book metadata—spanning back to 1900.
With more than 63 million article records and over 1 billion cited references, the dataset supports large-scale analysis of scholarly communication and impact. Key metadata elements include ORCID identifiers in over 6.2 million records to help disambiguate authors, detailed funding acknowledgments with grant numbers, and full author and institutional affiliations to support accurate attribution and collaboration analysis. Web of Science also standardizes institutional names to resolve naming variations, making cross-institutional analyses more reliable.
Researchers can access this data through flexible XML, allowing them to build complex citation networks, analyze research dynamics, and model trends over time. The dataset can be combined with other datasets for additional insights or used in visualization and statistical tools.
For research offices the dataset provides an opportunity to gain meaningful insights into the ever-evolving research landscape. With consistent indexing and global coverage, it’s a foundation for informed research strategy, evaluation, and discovery.
The data can be accessed in UC Berkeley Library’s Dataverse, through the Savio computing cluster, or TDM Studio. Please visit the Web of Science XML data section of the Text Mining & Computational Text Analysis research guide. Contact the Library Data Services Program for a Dataverse API token or with questions: librarydataservices@berkeley.edu