GitHub: Archiving and Repositories

Github has become ubiquitous in the coding world and, with the advent of data science and computation in a slew of other disciplines, researchers are turning to the version control repository and hosting service. Google uses it, Microsoft uses it, and it’s on the list of the top 100 most popular sites on Earth. As a librarian and a member of the Research Data Management team, I often get the question: “Can I archive my code in my Github repository?” From the research data management perspective, the answer is a little sticky.

github mark

The terms “archive” and “repository” from GitHub mean something very different than their definitions from a research data management perspective. For example, in GitHub, a repository “contains all of the project files…and stores each file’s revision history.” Archiving content on GitHub means that your repository will stay on GiHub until you choose to remove it (or if GitHub receives a DMCA takedown notice, or if it violates their guidelines or terms of service).

For librarians, research data managers, and many funders and publishers, archiving content in a repository requires more stringent requirements. For example, Dryad, a commonly known repository, requires those who wish to remove content to go through a lengthy process proving that work has been infringed, or is not in compliance of the law (read more about removing content from Dryad here). Most importantly, Dryad (and many other repositories) take specific steps to preserve the research materials. For example:
* persistent identification
* fixity checks
* versioning
* multiple copies are kept in a variety of storage sites

A good repository provides persistent access to materials, enables discovery, and does not guarantee, but takes multiple steps to prevent data loss.

So, how can you continue to work efficiently through GitHub and adhere to good archival practices? GitHub links up with Zenodo, a repository based out of CERN. Data files are stored at CERN with another site in Budapest. All data is backed-up on a daily basis with regular fixity and authenticity checks. Zenodo assigns a digital object identifier to your code, making it persistently identifiable and discoverable. Check out this guide on Making Your Code Citable for more information on linking your GitHub with Zenodo. Zenodo isn’t perfect and there are a few limitations, including a max file size of 50 GB. Read more about their policies here.

UC-Berkeley has its own institutional version of GitHub, which means that Berkeley development teams and individual contributors can now have private repositories (and private, shared repositories within the Berkeley domain). If you’d like access, please email github@berkeley.edu. Additionally, we have institutional subscriptions to Overleaf and ShareLaTeX, both of which integrate with GitHub.

Please contact researchdata@berkeley.edu if you’d like more information about archiving your code on GitHub.