UC Berkeley Library’s Dataverse

artistic logo and text that reads Data and Digital Scholarship Services UC Berkeley Library

Library IT and the Library Data Services Program are thrilled to announce the launch of the UC Berkeley Library Dataverse, a new platform designed to streamline access to data licensed by the UC Berkeley Library. This initiative addresses the challenges users have faced in finding and managing data for research, teaching, and learning.

With Dataverse, we have simplified the data acquisition process and created a central hub where users can easily locate datasets and understand their terms of use. Dataverse is an open-source platform managed by Harvard’s Institute for Quantitative Social Science, and was selected for its robust features that enhance the user experience.

All licensed data, whether locally stored or vendor-managed, will now be available in Dataverse. While metadata is publicly accessible, users will need to log in to download datasets. This platform is the result of a collaborative effort to support both library staff and users. Anna Sackmann, our Data Services Librarian, will continue to assist with the acquisition process, while Library IT oversees the platform’s maintenance. We are also committed to helping researchers publish their data by guiding them toward the best repository options.

Access the UC Berkeley Library Dataverse via the Library’s website under Data + Digital Scholarship Services. Please email librarydataservices@berkeley.edu with questions.


Love Data Week 2024

The UC-wide Love Data Week, brought to you by UC Libraries, will be a jam-packed week of data talks, presentations, and workshops Feb. 12-16, 2024. With over 30 presentations and workshops, there’s plenty to choose from, with topics such as:

  • Code-free data analysis
  • Open Research
  • How to deal with large datasets
  • Geospatial analysis
  • Drone data
  • Cleaning and coding data for qualitative analysis
  • 3D data
  • Tableau
  • Navigating AI

All members of the UC community are invited to attend these events to gain hands-on experience, learn about resources, and engage in discussions about data needs throughout the research process. To register for workshops during this week and see what other sessions will be offered UC-wide, visit the UC Love Data Week 2024 website.


National Science Foundation Public Access Plan 2.0

Many of you may have already seen, or even read, the NSF Public Access Plan 2.0. This document, disseminated last week, is the National Science Foundation’s response to the OSTP Public Access Memo from August 2022, which requires all federal grant funding agencies to make research publications and their supporting data freely available and accessible, without embargo, no later than December 31, 2025. The public access plan is not the agency’s new policies, but rather the framework for how they will improve public access and address the new requirements. The agency states they will accomplish this prior to the December deadline, on January 31, 2025. I have highlighted just a few points from the report below.
  • The agency will leverage the existing  NSF Public Access Repository (NSF-PAR) to make research papers, either the author’s accepted manuscript (AAM) or the publisher’s version of record (VOR), available immediately. All papers will be available in machine-readable XML, which will make additional research through text and data mining (TDM) possible.
  • The agency will continue to leverage relationships with long-standing disciplinary and generalist data repositories, like Dryad.
  • All data and publications will have permanent identifiers (PIDS). Data PIDS will be included with the article metadata.
  • The agency acknowledges the complexity in size, type, and quality of documentation with data. Publishing a dataset has far greater technical variability than publishing a manuscript. The agency will continue to explore how to best address data in the next two years.
  • The NSF has long required data management plans (DMPs). DMPs will be renamed to “data management and sharing plans,” or DMSPs, to better describe the required documentation and align with other agencies, like the NIH.

The above bullets are a mere 5 items in the lengthy report. Most importantly, over this next year, the Data Collaboration Team will develop an inreach plan to ensure all librarians and staff know how the OSTP memo and resulting policy will impact them and their researchers. Following awareness within the library, we will work on developing a coordinated outreach approach to support our researchers as they adapt to new requirements. This work will be in coordination with the Office of Scholarly Communication Services, the Research Data Management Program, and other longstanding LDSP partners.

Please let us know if you have any questions by sending them to librarydataservices@berkeley.edu.

Data Analysis Workshop Series: partnering with the CDSS Data Science Discovery Consultants

With the increase in data science across all disciplines, most undergraduates will encounter basic data science concepts and be expected to analyze data at some point during their time at UC Berkeley. To address this growing need, the Library Data Services Program began partnering with the Data Science Discovery Consultants in the Division of Computing, Data Science, and Society (CDSS) on the Introduction to Data Analysis Support workshop series in Fall 2020. The Data Science Discovery Consultants are a group of undergraduates majoring in computer science, math, data science, and related fields who are hired as student employees. They receive training to offer consultation services across a wide range of topics, including Python, R, SQL, and Tableau, and they have existing partnerships with other groups on campus to provide instruction around data as a part of their program. Through the partnership, Data Science Discovery Consultants work with librarians to develop as instructors and gain experience constructing workshops and teaching technical skills. The end result is the creation of a peer-to-peer learning environment for novice undergraduate learners who want to begin working with data. The peer-to-peer learning model lowers the barrier to learning for other undergraduates and enhances motivation and understanding.

The Data Science Discovery Consultants enthusiastically embraced the core values of the Carpentries, through which they empower each other and the audience, collaborate with their community, and create inclusive spaces that welcome and extend empathy and kindness to all learners. In Fall 2022, attendance for the workshop series was opened up to local community college students who may be interested in transferring to UC Berkeley. One of the workshops was taught in Spanish, to provide an environment in which native Spanish speakers could better connect with one another and the content. 

Diego Sotomayor, a former UC Berkeley Library student employee and current Data Science Discovery Consultant, taught the inaugural Introduction to Python in Spanish: Introducción al análisis de datos con Python. Diego comments that: 

Languages at events are no longer just a necessity but have gone to the next level of being essential to transmit any relevant information to the interested public. There are many people who only speak Spanish or another language other than English and intend to learn new topics through various platforms including workshops. However, because they are limited by only speaking a language that is not very popular, they get stuck in this desire to progress and learn. Implementing the workshop in different languages, not only in English but in Spanish and even others, is important to give the same opportunities and equal resources to people looking for opportunities.”

The UC Berkeley Library and the Division of Computing, Data Science, and Society hope to further provide these offerings for prospective transfer students in Fall 2023. Many thanks to Elliott Smith, Lisa Ngo, Kristina Bush (now at Tufts University), and Misha Coleman in the Library. Anthony Suen is the Library’s staff partner in the Data Science Discovery Program and Kseniya Usovich assists with outreach. 


University of California Research Data Policy: a few things to know

University of California Research Data Policy: a few things to know

The University of California Office of the President recently announced an updated Research Data Policy, effective July 15, 2022. The new policy complements the original policy from 1958. It re-confirms that research data are owned by the University but outlines how University Researchers may use the data generated or collected in the course of their research. While most researchers likely will find that the updated policy doesn’t require a complete overhaul of their data stewardship practices, it’s important to understand key  terms, conditions, and permissions enabled by the new policy. The policy, however, will help them make decisions around management, retention, data publication, and data transfer. Implementation of this policy at a campus level is currently under development. Additional details are forthcoming.

A few key points: 

  • The Regents of the University of California own Research Data generated or collected in the course of University Research. 
    • Research Data include “recorded information embodying facts resulting from a scientific inquiry.” Research Data do not include scholarly & aesthetic works, informal notes, paper drafts, administrative or medical records, and other materials (see policy text for complete list).
    • University Research means “research conducted by a Principal Investigator or University Researcher that is within the course and scope of their assigned duties, uses University resources, and/or is funded by or through the University.”
  • University Researchers may use the Research Data they generate or collect in order to conduct other research, share with collaborators, publish outcomes, and create scholarly works. The University “supports the free and unfettered dissemination of information, knowledge, and discoveries generated by University Researchers.” As such:
    • Principal Investigators (PIs) are the stewards of Research Data, and maintain autonomy about which data should be preserved or dispositioned;
    • Researchers may share data as dictated by scholarly/disciplinary standards or data management plans, or legal, funder, or contractual requirements; 
    • When a University Researcher leaves the UC, they may take copies of the data they generated or collected, as long as it is approved by the PI;
    • Neither the University nor University Researchers may assert ownership of Research Data owned by third parties.

 

Resources and Assistance: 

 

Written by Tim Vollmer, Erin Foster, and Anna Sackmann

 


NIH DMSP Frequently Asked Questions

The Library Data Services Program recently posted about the National Institutes of Health (NIH) Data Management and Sharing Policy and how it will affect UC Berkeley researchers. Please read more about the new policy on this post.

Here are a list of FAQs about the new policy. Please contact the UC Berkeley Library Data Services Program (librarydataservices@berkeley.edu) with questions.

How is UC Berkeley  responding to this policy?

The Library Data Services Program is collaborating with the Research Data Management Program to provide guidance and documentation to ensure compliance with the NIH policy.

What is considered “scientific data” for the purposes of this plan?

The final NIH Policy defines Scientific Data as: “The recorded factual material commonly accepted in the scientific community as of sufficient quality to validate and replicate research findings, regardless of whether the data are used to support scholarly publications. Scientific data do not include laboratory notebooks, preliminary analyses, completed case report forms, drafts of scientific papers, plans for future research, peer reviews, communications with colleagues, or physical objects, such as laboratory specimens.” The NIH states that “the final DMS Policy is designed to increase the sharing of scientific data, regardless of whether a publication is produced…Data that do not form the basis of a publication produced during the award period should be shared by the end of the award period.”

What is included in a Data Management and Sharing Plan?

In these max two-page documents, researchers will describe their:

  • Data type(s)
  • Related tools, software, and/or code
  • Standards
  • Data preservation, access, and associated timelines
  • Access, distribution, or reuse considerations
  • Oversight of data management and sharing

Read more about Data Management Plans and see sample language

Can I make my data available upon request?

NIH strongly prefers that scientific data be shared and preserved through repositories or, for datasets up to 2GB, through PubMed Central-deposited supplemental data files, rather than kept by a researcher and provided upon request.

How will the plans be assessed?

NIH program staff will assess the DMS plans but peer reviewers may comment on the proposed budget for data management and sharing.

What data repository should I use?

NIH encourages the use of established repositories. To select the best repository for your data consider the following:

  • Is there a specific NIH repository named in the funding announcement?
  • Is there a data repository specific to your discipline?
  • If not, is there a general data repository you can use?

To learn more, read the NIH guidance on selecting a data repository.

What is a standard? What standards are relevant to my research?

A standard specifies how exactly data and related materials should be stored, organized, and described. In the context of research data, the term typically refers to the use of specific and well-defined formats, schemas, vocabularies, and ontologies in the description and organization of data. However, for researchers within a community where more formal standards have not been well established, it can also be interpreted more broadly to refer to the adoption of the same (or similar) data management-related activities, conventions, or strategies by different researchers and across different projects.

When do I need to make my data available?

NIH encourages scientific data to be shared as soon as possible, and no later than time of an associated publication or end of the performance period, whichever comes first.

What data management and sharing costs can I include in my grant?

Allowable costs can include:

  • data curation and developing documentation (formatting data, de-identifying data, preparing metadata, curating data for a data repository)
  • data management considerations (unique and specialized information infrastructure necessary to provide local management and preservation before depositing in a repository)
  • preserving data in data repositories (data deposit fees)

Read more about allowable costs.

Guidance for data management and sharing costs on NIH budget requests

What happens if I do not comply with the NIH policy or make my data available as described in the DMS policy?

NIH Program Staff will be monitoring compliance with the policy during the funding period. “Noncompliance with Plans may result in the NIH ICO adding special Terms and Conditions of Award or terminating the award. If award recipients are not compliant with Plans at the end of the award, noncompliance may be factored into future funding decisions.”

I work with sensitive topics/populations – how do I protect my participants’ privacy?

NIH strongly encourages researchers who work with sensitive topics and/or populations to address data sharing in the Informed Consent process. See the UC Berkeley Human Research Protection Program’s Informed Consent page, which includes guidelines and appropriate form templates.

Researchers should pay special attention to their de-identification process to ensure that all identifying information has been fully removed. Researchers should consider depositing their data in restricted access repositories that require data use agreements and research plans in order to access the data. Contact librarydataservices@berkeley.edu if you would like guidance on selecting restricted access repositories.

Please view the UCSF’s resources on data de-identification and sharing de-identified data for additional guidance. 

Do specific NIH Institutes and Centers (ICs) have additional policies or recommendations? 

Yes, NIH ICs may have additional requirements or recommendations. Please identify the institute or center using this table to learn more about requirements.

Supplemental information from the NIH:

Responsible Management and Sharing of American Indian/Alaska Native Participant Data

Protecting Privacy When Sharing Human Research Participant Data

 

Many thanks to Ariel Deardorff at the UCSF Library for allowing us to adapt their list of Frequently Asked Questions and thank you to UC Berkeley’s Elliott Smith, Michael Sholinbeck, and Erin Foster for all of their expertise and contributions. 

 


Forthcoming NIH Data Management and Sharing Policy

On January 25, 2023, the National Institutes of Health (NIH) will implement a new Data Management and Sharing Policy. The Library Data Services Program and Research Data Management Program have several resources to help you adapt to the new policy, including extensive guidance and suggested language for writing a plan. Additionally, the DMPTool, which is free for UC Berkeley users and supported by the Library Data Services Program, walks grant applicants through the plan requirements. We will be offering four drop-in workshops designed for researchers this fall. Please register using for the workshops using the links below:

 

 

The NIH is a leader in implementing data management plans and was the first federal granting agency to do so with their 2003 NIH Sharing Policy. In the years since, the agency developed two genomic data sharing policies (2008 and 2014), and addressed data sharing in clinical trials in 2016. The new data sharing policy builds on their existing data management requirements and is broadly sweeping with the goal to maximize the “…sharing of scientific data generated from NIH-funded or conducted research, with justified limitations or exceptions.” 

 

The new policy will apply to all research funded (in whole or in part) by the NIH that produces scientific data. It will apply to grant applications submitted on or after January 25, 2023. The NIH defines scientific data as: “the recorded factual material commonly accepted in the scientific community as of sufficient quality to validate and replicate research findings, regardless of whether the data are used to support scholarly publications…” Please note that the NIH definition of scientific data does NOT include the following: laboratory notebooks, preliminary analyses, completed case report forms, drafts of scientific papers, plans for future research, peer reviews, communications with colleagues, or physical objects, such as laboratory specimens.

 

There are a few aspects that set the new NIH DMSP apart from current policy:

  • Plans will outline how data and metadata will be managed over the course of the project and which of these data will be shared.
  • Grant applicants will need to include details about software or other tools that were used to analyze the data.
  • If generating data derived from human participants, plans will need to outline how confidentiality, privacy, and rights of those individuals will be preserved.
  • Plans must include a selected repository or repositories where the data will be preserved along with a timeline for sharing the data (either as soon as possible, no later than the time of an associated publication, or at the end of performance period if there is no associated publication). 
  • Updates to plans over the course of the project will be reviewed by the NIH ICO (institute, center, or office) during regular reporting intervals.
  • Data management costs may be added to the grant budget including data curation and developing documentation (e.g., formatting data, de-identifying data); data management considerations (e.g., unique and specialized information infrastructure necessary to provide local management and preservation before depositing in a repository); preserving data in data repositories (e.g., data deposit fees).
  • Compliance with plans will be measured during the funding period at regular reporting intervals.

 

Please see our list of frequently asked questions. For additional information about the policy, please see the NIH page on Data Management and Sharing Policy

Many thanks to Elliott Smith, Michael Sholinbeck, and Erin Foster for all of their expertise and contributions!

If you have any questions or need additional support, please email librarydataservices@berkeley.edu. 


Nexis Data Lab Computing Environment

 

The UC Berkeley Library is pleased to announce access to a new text and data mining platform, Nexis Data Lab from LexisNexis. The cloud-based platform enables users to run computational analysis in a Jupyter notebook on content licensed for use at UC Berkeley. Please take a look at this brief, two minute video to see how the environment works. Researchers should be familiar with Python or R. Each account may have up to 6 projects (workspaces), with a limit of 100,000 documents per project. The number of seats are limited, so we ask that you have a TDM project in progress. 

 

Please view the list of content and titles available to UC Berkeley users. LexisNexis will continue to make additional content available as the platform grows. (Note that the following publications are NOT available: The New York Times (NDL does include NYT International), The New York Times Blogs, Wall Street Journal Abstracts, Information Base Abstracts, and Jane’s Defence Weekly.)

 

If you would like to get started using the platform, request a seat by filling out this form. The Library is holding a training session for the platform (hosted by LexisNexis) on August 15, 2022 at 9:00 AM. Please register here for the event. For more information on text and data mining platforms and resources available at UC Berkeley, please check out our guide to Text Mining and Computational Text Analysis. Contact the Library Data Services Program at librarydataservices@berkeley.edu with questions. 

 

Library Data Services Program logo


Publisher Data Requirements Revisited

In May and September of 2017, the Library wrote posts (read them here and here) about a number of publisher research data policies. Over the last year, publishers have engaged in conversations with institutions, funders, and not-for-profit organizations to examine how they can better shape and influence the sharing of research data.

Image from Unsplash by Franki Chamaki

To accompany their data sharing policies and recommendations, publishers like Springer Nature and Elsevier recently developed their own research data services to better assist researchers who are preparing their data to be published alongside a manuscript. They now provide individual guidance (for a fee) and repositories in which to deposit and share data. Please talk to a consultant at UC Berkeley’s Research Data Management program about the guidance we can provide along with University of California supported data sharing options.

Elsevier continues to communicate about research data through a series of principles (data should be made available free of charge wherever possible with minimal reuse restrictions; by enabling effective reuse of data we’re finding efficiencies and preventing duplication of effort). These principles map to a series of policies. The policies speak to how Elsevier will support and encourage researchers when sharing data. Elsevier’s research data guidelines, which remain largely unchanged since last year, prescribe how and when researchers will share their data. Elsevier’s journals are assigned to one of five research data guidelines, which have slight variations in language and range from “encouraged to deposit research data” to “required to deposit research data.”

When submitting to an Elsevier journal, be sure to check the individual journal’s Guide for Authors, which is located on the journal homepage. Elsevier does not maintain a master list of journals mapped to the five research data guidelines. Your subject librarian can provide guidance if you need more information about the data publishing policies from a specific Elsevier title. If you don’t know where you will submit your research, it’s best to prepare for the most rigorous data policy by adhering to a data management plan throughout the course of your work.

Springer Nature’s data publishing policies follows the same, four tiered structure they developed in 2017; however, they’ve added more nuanced requirements within each tier for the life sciences and non life sciences. Check here to see the publisher’s list of journals and their assigned data publishing policy.

Wiley applies one of three data sharing policies to their journals: encourages data sharing; expects data sharing; and mandates data sharing. The publisher has created an author compliance tool, which enables researchers who are submitting papers to one of the publisher’s journals to check what they need to do with their data to be in compliance with their funder, institution, and journal. For example, if your research is funded by the NIH, you work at a University of California institution, and would like to publish in Bioengineering and Translational Medicine, you’ll learn that the journal encourages you to share your data, the NIH requires you to share your data, and the university does not have a policy. In cases like this, you need to default to the entity that requires the most sharing. In this case, you would share your data as stipulated by the NIH.

Wiley’s author compliance tool points out the gaps in policy that exist for researchers, especially in the United States. Data sharing policies differ widely between institutions, publishers, and funders which leads to confusion for the researcher. In general, when planning research and communicating your results, take the Open Science approach, which advocates for showing your work and sharing your work in the name of advancing science. By thoroughly documenting your data and research process, others are better able to understand your work and potentially utilize the data for another research purpose. The Open Science approach supports transparency and reuse, which results in better science and more rapid advances. If you would like more information about preparing your data to be shared with others, please contact the Research Data Management Program.