Research Data Publishing & Licensing 101 - UC Berkeley Library Update

Please join Science Data & Engineering Librarian Anna Sackmann and Scholarly Communication Officer Rachael Samberg for practical tips about why, where, and how to publish and license your research data.

This workshop will be held from 11 a.m.–12 p.m., Doe Library, Rm. 190 (BIDS) on February 16, 2017 as part of Love Your Data Week. Check out the reservation form!

Why Should We Care About Publishing Research Data?

Sharing research data promotes transparency, reproducibility, and progress. Indeed, it can spur new discoveries on a daily basis. It’s not atypical for geneticists, for example, to sequence by day and post research results the same evening—allowing others to begin using their datasets in nearly real time (see, for example, Pisani & AbouZahr’s paper about this data publishing cycle). The datasets researchers share can, in turn, inform business or regulatory policymaking, legislation, government or social services, and much more.

Publishing your research data can also increase the impact of your research, and with it, your scholarly profile. Depositing datasets in a repository makes them both visible and citable. You can include them in your CV and grant application biosketches. Conversely, scholars around the world can begin working with your data and crediting you. As a result, sharing detailed research data can be associated with increased citation rates (check out this Piwowar et al. study, among others).

Publishing your data may also be required. Federal funders (e.g. National Institutes of Health), grant agencies (e.g. Bill & Melinda Gates Foundation), and journal publishers (e.g. PLoS and other journals listed in this Open Access Directory) increasingly require that datasets be made publicly available to readers—often immediately upon associated article publication.

How Do We Publish Data?

Merely uploading your dataset to a personal or departmental website won’t achieve these aims of promoting knowledge and progress. Datasets should be able to link seamlessly to any research articles they support. Their metadata should be compatible with bibliographic management and citation systems (e.g. CrossRef or Ref Works), and be formatted for crawling by abstracting and indexing services. After all, you want to be able to find other people’s datasets, manage them in a your own reference manager, and cite them as appropriate. So, you’d want your own dataset to be positioned for the same discoverability and ease of use.

How can you achieve all this? It sounds daunting, but it’s actually pretty straightforward and simple. You’ll just want to select a data publishing tool or service that is built around both preservation and discoverability: It should offer you a stable location or DOI (which will provide a persistent link to your data’s location), help you create sufficient metadata to facilitate transparency and reproducibility, and optimize the metadata for search engines.

For instance, UC’s Dash tool is a terrific and easy-to-use solution that preserves and publishes your datasets. At the Feb. 16 workshop we’re hosting, you can learn more about how to prepare, describe, and upload your data for deposit and publishing with Dash and other tools.

We also recommend that, if your chosen publishing tool enables it, you should include your ORCID (a persistent digital identifier) with your datasets just like with all your other research. This way, your research and scholarly output will be collocated in one place, and it will become easier for others to discover and credit your work.

What Does it Mean to License Your Data For Reuse?

Uploading a dataset—with good metadata, of course!—to a repository is not the end of the road for shepherding one’s research. We must also consider what we are permitting other researchers to do with our data. And, what rights do we, ourselves, have to grant such permissions—particularly if we got the data from someone else, or the datasets were licensed to us for a particular use?

To better understand these issues, we first have to distinguish between attribution and licensing. Citing datasets is an essential scholarly practice. But the issue of someone citing your data is separate from the question of whether it’s permissible for them to use the data in the first place. That is, what license for reuse have you applied to the dataset?

The type of reuse we can grant depends on whether we own our research data and hold copyright in it. There can be a number of possibilities here. For instance, sometimes the terms of contracts we’ve entered into (e.g. funder/grant agreements, website terms of use, etc.) dictate data ownership and copyright. Sometimes, our employers own our research data under our employment contracts (e.g. the research data is “work-for-hire”). In some cases, the dataset might not be copyrightable to begin with if it does not constitute original expression. We could run into hot water if we try to grant licenses to data for which we don’t actually hold rights. For an excellent summary addressing these “Who owns your data?” questions, including copyright issues, check out this blog post by Katie Fortney written for the UC system-wide Office of Scholarly Communication.

To try to streamline ownership and copyright questions, and promote data reuse, often data repositories will simply apply a particular “Creative Commons” license or public domain designation to all deposited datasets. For instance:

Dryad and BioMed Central repositories apply a Creative Commons Zero (CC0) designation to deposited data—meaning that, by depositing in those repositories, you are not reserving any copyright that you might have. Someone using your dataset still should cite the dataset to comply with scholarly norms, but you cannot mandate that they attribute you and cannot pursue copyright claims against them.
UC Dash applies a Creative Commons Attribution (CC-BY) license to datasets deposited by UC researchers. This means that someone using your Dash-deposited dataset not only should cite it to adhere to scholarly norms, but also is required to attribute you as the author.

What’s the Right License or Designation for Your Data?

Well, sometimes you don’t have a say in the matter, as your funding agreement or the repository you choose dictates the license applied. Otherwise, it’s worth considering what your goals are for sharing the data to begin with, and selecting a designation or license that both meets your needs and fits within whatever ownership and use rights you have over the data. Your Scholarly Communication Officer or librarian can help you with this.

Bear in mind that ambiguity surrounding the ability to reuse data inhibits the pace of research. So, try to identify clearly for potential users what rights are being granted in the dataset you publish.

How To Learn More if You’re a UC Berkeley Researcher

Come to the workshop, of course! For data publishing questions, contact the Research Data Management team at researchdata@berkeley.edu. With questions about data ownership, copyright, or licensing, contact the Library’s Scholarly Communication Officer at rsamberg@berkeley.edu. You can also check out the Research Data Management website for more on preserving and disseminating your data. In the meantime, we hope to see you at the workshop next week!

by Rachael Samberg in Scholarly Communications on February 9th, 2017