Data & Digital Scholarship
Wikiphiliacs, Unite! (At our Wikipedia Editathon, on Valentine’s Day, 2024)
I am a proud Wikiphiliac. At least, according to the Urban Dictionary, which defines Wikiphilia as “a powerful obsession with Wikipedia”. I have many of the signs it warns of, including “accessing Wikipedia several times a day…spending much more time on Wikipedia than originally intended [and]… compulsively switching to other Wikipedia articles, using the hyperlinks within articles, often without obtaining the originally sought information and leaving a bizarre informational “trail” in his/her browsing history” (but that last part is just normal life as a librarian).
How else do I love Wikipedia? Let me count the ways! As a librarian, I always approach crowd-sourced information with a critical eye, but I also admire that Wikipedia has its own standards for fact-checking, and in fact some topics are locked to public editing. It takes its mission very seriously. It also has an accessible and neutral tone. Especially when I want to learn about a technical topic, it can give me a straightforward and helpful way to approach it. I also use it pretty routinely as a way to look at collections of sources about a topic; when I was a medical librarian, I was asked for data on the condition neurofibromatosis, and at that time the best basic links I found were in the references for the Wikipedia article. Last and maybe most importantly, the fact that anyone can edit is a huge strength…with challenges. Wikipedia openly admits its content is skewed by the gender and racial imbalance of its editors, and knowing this is part of approaching it critically, but it also means that IT CAN CHANGE, and WE CAN CHANGE IT.
Given that philia, a word taken from Ancient Greek (according to the philia Wikipedia article), means affection for or love of something, it’s fitting that our 2024 Wikipedia Editathon is part of UC’s Love Data Week, and happens on Valentine’s Day. If you would like to learn to contribute to this amazing resource, and perhaps even help diversify its editorial pool, we can get you started! There isn’t yet a Wikipedia page on Wikiphilia, but maybe you could create one! There already is a podcast series…
If you’re interested in learning more, we warmly welcome you and invite you to join us on Wednesday, February 14, from 1-2:30 for the 2024 UC Berkeley Libraries Wikipedia Editathon. No experience is required—we will teach you all you need to know about editing! (but, if you want to edit with us in real time, please create a Wikipedia account before the workshop—information on how to do that is on the registration page). The link to register is here, and you can contact any of the workshop leaders with questions. We hope you will join us, and we look forward to editing with you!
NOTE: the Wikipedia Editathon is just one of the programs that’s part of the University of California’s Love Data Week 2024! Don’t forget to check out all the other great UC Love Data Week offerings—this year UC Berkeley Librarians are hosting/co-hosting SIX different sessions! Here are those UCB-led workshop links, and the full calendar is linked here:
Thinking About and Finding Health Statistics & Data
GIS & Mapping: Where to Start
Cultivating Collaboration: Getting Started with Open Research
Code-free Data Analysis
Wikipedia Edit-a-thon
Getting Started with Qualitative Data Analysis
Love Data Week 2024
The UC-wide Love Data Week, brought to you by UC Libraries, will be a jam-packed week of data talks, presentations, and workshops Feb. 12-16, 2024. With over 30 presentations and workshops, there’s plenty to choose from, with topics such as:
- Code-free data analysis
- Open Research
- How to deal with large datasets
- Geospatial analysis
- Drone data
- Cleaning and coding data for qualitative analysis
- 3D data
- Tableau
- Navigating AI
All members of the UC community are invited to attend these events to gain hands-on experience, learn about resources, and engage in discussions about data needs throughout the research process. To register for workshops during this week and see what other sessions will be offered UC-wide, visit the UC Love Data Week 2024 website.
PhiloBiblon 2023 n. 6 (octubre): PhiloBiblon White Paper
A requirement of the NEH Foundation grant for PhiloBiblon, “PhiloBiblon: From Siloed Databases to Linked Open Data via Wikibase: Proof of Concept” (PW-277550-21) was the preparation of a White Paper to summarize its results and provide advice and suggestions for other projects that have enthusiastic volunteers but little money:
White Paper
NEH Grant PW-277550-21
October 10, 2023
The proposal for this grant, “PhiloBiblon: From Siloed Databases to Linked Open Data via Wikibase: Proof of Concept,” submitted to NEH under the Humanities Collections and Reference Resources Foundations grant program, set forth the following goals:
This project will explore the use of the FactGrid: database for Historians Wikibase platform to prototype a low-cost light-weight development model for PhiloBiblon:
(1) show how to map PhiloBiblon’s complex data model to Linked Open Data (LD) / Resource Description Framework (RDF) as instantiated in Wikibase;
(2) evaluate the Wikibase data entry module and create prototype query modules based on the FactGrid Query Service;
(3) study Wikibase’s LD access points to and from libraries and archives;
(4) test the Wikibase data export module for JSON-LD, RDF, and XML on PhiloBiblon data,
(5) train PhiloBiblon staff in the use of the platform;
(6) place the resulting software and documentation on GitHub as the basis for a final “White Paper” and follow-on implementation project.
A Wikibase platform would position PhiloBiblon to take advantage of current and future semantic web developments and decrease long-term sustainability costs. Moreover, we hope to demonstrate that this project can serve as a model for low-cost light-weight database development for similar academic projects with limited resources.
PhiloBiblon is a free internet-based bio-bibliographical database of texts written in the various Romance vernaculars of the Iberian Peninsula during the Middle Ages and the early Renaissance. It does not contain the texts themselves; rather it attempts to catalog all their primary sources, both manuscript and printed, the texts they contain, the individuals involved with the production and transmission of those sources and texts, and the libraries holding them, along with relevant secondary references and authority files for persons, places, and institutions.
It is one of the oldest digital humanities projects in existence, and the oldest in the Hispanic world, starting out as an in-house database for the Dictionary of the Old Spanish Language project (DOSL) at the University of Wisconsin, Madison, in 1972, funded by NEH. Its initial purpose was to locate manuscripts and printed texts physically produced before 1501 to provide a corpus of authentic lexicographical material for DOSL. It soon became evident that the database would also be of interest to scholars elsewhere; and a photo-offset edition of computer printout was published in 1975 as the Bibliography of Old Spanish Texts (BOOST). It contained 977 records, each one listing a given text in a given manuscript or printed edition. A second edition followed in 1977 and a third in 1984.
PhiloBiblon was published in 1992 on CD-ROM, incorporating not only the materials in Spanish but also those in Portuguese and Catalan. By this time BOOST had been re-baptized as BETA (Bibliografía Española de Textos Antiguos), while the Portuguese corpus became BITAGAP (Bibliografia de Textos Antigos Galegos e Portugueses) and the Catalan corpus BITECA (Bibliografia de Textos Antics Catalans, Valencians i Balears). PhiloBiblon was ported to the web in 1997; and the web version was substantially re-designed in 2015. PhiloBiblon’s three databases currently hold over 240,000 records.
All of this data has been input manually by dozens of volunteer staff in the U.S., Spain, and Portugal, either by keyboarding or by cutting-and-pasting, thousands of hours of unpaid labor. That unpaid labor has been key to expanding the databases, but just as important, and much more difficult to achieve, has been the effort to keep up with the display and database technology. The initial database management system (DBMS) was FAMULUS running on the Univac 1110 at Madison, a flat-file DBMS originally developed at Berkeley in 1964. In 1985 the database was mapped to SPIRES (Stanford Public Information Retrieval System) and then, in 1987, to a proprietary relational DBMS, Revelation G, running on an IBM PC.
Today we continue to use Revelation Technology’s OpenInsight on Windows, the lineal descendent of Revelation G. We periodically export data from the Windows database in XML format and upload it to a server at Berkeley, where the XTF (eXtensible Text Framework) program suite parses it into individual records, indexes it, and serves it up on the fly in HTML format in response to queries from users around the world. The California Digital Library developed XTF as open source software ca. 2010, but it is now in the process of being phased out and is no longer supported by the UC Berkeley Library.
The need to find a substitute for XTF caused us to rethink our entire approach to the technologies that make PhiloBiblon possible. Major upgrades to the display and DBMS technology, either triggered by technological change or by a desire to enhance web access, have required significant grant support, primarily from NEH, eleven NEH grants from 1989 to 2021. We applied for the current grant in the hope that it would show us how to get off the technology merry-go-round. Instead of seeking major grant support every five to seven years for bespoke technology, this pilot project was designed to demonstrate that we could solve our technology problems for the foreseeable future by moving PhiloBiblon to Wikibase, the technology underlying Wikipedia and Wikidata. Maintained by Wikimedia Deutschland, the software development arm of the Wikimedia Foundation, Wikibase is made available for free. With Wikibase,we would no longer have to raise money to support our software infrastructure.
We have achieved all of the goals of the pilot project under this current grant and placed all of our software development work on GitHub (see below). We received a follow-on two-year implementation grant from NEH and on 1 July 2023 began work to map all of the PhiloBiblon data from the Windows DBMS to FactGrid.
❧ ❧ ❧
For the purposes of this White Paper, I shall focus on the PhiloBiblon pilot project as a model for institutions with limited resources for technology but dedicated volunteer staff. There are thousands of such institutions in the United States alone, in every part of the country, joined in national and regional associations, e.g., the American Association for State and Local History, Association of African American Museums, Popular Culture Association, Asian / Pacific / American Archives Survey Project, Southeastern Museums Conference. Many of their members are small institutions that depend on volunteer staff and could use the PhiloBiblon model to develop light-weight low-cost databases for their own projects. In the San Francisco Bay Area alone, for example there are dozens of such small cultural heritage institutions (e.g., The Beat Museum, GLBT Historical Society Archives, Holocaust Center Library and Archives, Berkeley Architectural History Association.
To begin at the beginning: What is Linked Open Data and why is it important?
What is Wikibase, why use it, and how does it work?
Linked Open Data (LOD) is the defining principle of the semantic web: “globally accessible and linked data on the internet based on the standards of the World Wide Web Consortium (W3C), an open environment where data can be created, connected and consumed on internet scale.”
Why use it? Simply, data has more value if it can be connected to other data, if it does not exist in a silo.
Wikibase in turn is the “free software for open data projects. It connects the knowledge of people and organizations and enables them to open their linked data to the world.” It is one of the backbone technologies of the LOD world.
Why use it? The primary reason to use Wikibase is precisely to make local or specialized knowledge easily available to the rest of the world by taking advantage of LOD, the semantic web. Conversely, the semantic web makes it easier for local institutions to take advantage of LOD.
How does Wikibase work? The Wikibase data model is deceptively simple. Each record has a “fingerprint” consisting of a Label, a Description, and an optional Alias. This fingerprint uniquely identifies the record. It can be repeated in multiple languages, although in every case the Label and the Description in the other languages must also be unique. Following the fingerprint header comes a series of three-part statements (triples, triplestores) that link a (1) subject Q to an (2) object Q by means of a (3) property P. The new record itself is the subject, to which Wikibase assigns automatically a unique Q#. There is no limit, except that of practicality, to the number of statements that a record can contain. They can be input in any order, and new statements are simply appended at the end of the record. No formal ontology is necessary, although having one is certainly useful, as librarians have discovered over the past sixty years. Must records start with a statement of identity, e.g.: Jack Keraouc (Q160534) [is an] Instance of (P31) Human (Q5).[1] Each statement can be qualified with sub-statements and footnoted with references. Because Wikibase is part of the LOD world, each record can be linked to the existing rich world of LOD identifiers: Jack Keraouc (Q160534) in the Union List of Artist Names ID (P245) is ID 500290917.
Another important reason for using Wikibase is the flexibility that it allows in tailoring Q items and P properties to the needs of the individual institution. There is no need to develop an ontology or schema ahead of time; it can be developed on the fly, so to speak. There is no need to establish a hierarchy of subject headings, for example, like that of the Library of Congress as set forth in the Library of Congress Subject Headings (LCSH). LC subject headings can be extended as necessary or entirely ignored. Other kinds of data can also be added:
New P properties to establish categories: nicknames, associates (e.g., other members of a rock band), musical or artistic styles);
New Q items related to the new P properties (e.g., the other members of the band).
There is no need to learn the Resource Description Access (RDA) rules necessary for highly structured data, such as MARC or its eventual replacement, BIBFRAME. This in turn means that data input does not need persons trained in librarianship.
How would adoption of Wikibase to catalog collections, whether of books, archival materials, or physical objects, work in practice? What decisions must be made? The first decision is simply whether (1) to join Wikidata or (2) set up a separate Wikibase instance (like FactGrid).[2] The former is far simpler. It requires no programming experience at all and very little knowledge of data science. Joining Wikidata simply means mapping the institution’s current database to Wikidata through a careful analysis of the database in comparison with Wikidata. For example, a local music history organization, like the SF Music Hall of Fame, might want to organize an archive of significant San Francisco musicians.
The first statement in the record of rock icon Jerry García might be Instance of (P31) Human (Q5); a second statement might be Sex or Gender (P21) Male (Q6581097); and a third, Occupation (P106) Guitarist (Q855091).
Once the institutional database properties have been matched to the corresponding Wikidata properties, the original database must be exported as a CSV (comma separated values) file. Its data must then be compared systematically to Wikidata through a process known as reconciliation, using the open source OpenRefine tool. This same reconciliation process can be used to compare the institutional database to a large number of other LOD services through Mix n Match, which lists hundreds of external databases in fields ranging alphabetically from Art to Video games. Thus the putative SF Music Hall of Fame database might be reconciled against the large Grammy Awards (5700 records) database of the Recording Academy.
Reconciliation is important because it establishes links between records in the institutional database and existing records in the LOD world. If there are no such records, the reconciliation process creates new records that automatically become part of the LOD world.
One issue to consider is that, like Wikipedia, anyone can edit Wikidata. This has both advantages and disadvantages. The advantage is that outside users can correct or expand records created by the institution. The disadvantage is that a malicious user or simply a well-intentioned but poorly informed one can also damage records by the addition of incorrect information.
In the implementation of the new NEH grant (2023-2025), we hope to have it both ways. Our new user interface will allow, let us say, a graduate student looking at a medieval Spanish manuscript in a library in Poland to add information about that manuscript through a template. However, before that information can be integrated into the master database, it would have to be vetted by a PhiloBiblon editorial committee.
The second option, to set up a separate Wikibase instance, is straightforward but not simple. The Wikibase documentation is a good place to start, but it assumes a fair amount of technical expertise. Matt Miller (currently at the Library of Congress) has provided a useful tutorial, Wikibase for Research Infrastructure , explaining how to set up a Wikibase instance and the steps required to go about it. Our programmer, Josep Formentí, has made this more conveniently available on a public GitHub repository, Wikibase Suite on Docker, which installs a standard collection of Wikibase services via Docker Compose V:
Wikibase
Query Service
QuickStatements
OpenRefine
Reconcile Service
The end result is a local Wikibase instance, like the one created by Formentí on a server at UC Berkeley as part of the new PhiloBiblon implementation grant: PhiloBiblon Wikibase instance. He used as his basis the suite of programs at Wikibase Release Pipeline. Formentí has also made available on GitHub his work on the PhiloBiblon user interface mentioned above. This would serve PhiloBiblon as an alternative to the standard Wikibase interface.
Once the local Wikibase instance has been created, it is essentially a tabula rasa. It has no Properties and no Items. The properties would then have to be created manually, based on the structure of the existing database or on Wikidata. By definition, the first property will be P1. Typically it will be “Instance of,” corresponding to Instance of (P31) in Wikidata.
The Digital Scriptorium project, a union catalog of medieval manuscripts in North American libraries now housed at the University of Pennsylvania, went through precisely this process when it mapped 67 data elements to Wikibase properties created specifically for that project. Thus property P1 states the Digital Scriptorium ID number; P2 states the current holding institution, etc.
Once the properties have been created, the next step is to import the data in a batch process, as described above, by reconciling it with existing databases. Miller explains alternative methods of batch uploads using python scripts.
Getting the initial upload of institutional data into Wikidata or a local Wikibase instance is the hard part, but once that initial upload has been accomplished, all data input from then on can be handled by non-technical staff. To facilitate the input of new records, properties can be listed in a spreadsheet in the canonical input order, with the P#, the Label, and a short Description. Most records will start with the P1 property “Institutional ID number” followed by the value of the identification number in the institutional database. The Cradle or Shape Expressions tools, with the list of properties in the right order, can generate a ready-made template for the creation of new records. Again, this is something that an IT specialist would implement during the initial setup of a local Wikibase instance.
New records can be created easily by inputting statements following the canonical order in the list of properties. New properties can also be created if it is found, over time, that relevant data is not being captured. For example, returning to the Jerry García example, it might be useful to specify “rock guitarist”(Q#) as a subclass of “guitarist.”
The institution would then need to decide whether the local Wikibase instance is to be open or closed. If it were entirely open, it would be like Wikidata, making crowd-sourcing possible. If it were closed, only authorized users could add or correct records. PhiloBiblon is exploring a third option for its user interface, crowdsourcing mediated by an editorial committee that would approve additions or changes before they could be added to the database.
One issue remains, searching:
Wikibase has two search modes, one of which is easy to use, and one of which is not.
- The basic search interface is the ubiquitous Google box. As the user types in a request, the potential records show up below it until the user sees and clicks on the requested record. If no match is found, the user can then opt to “Search for pages containing [the search term],” which brings up all the pages in which the search term occurs, although there is no way to sort them. They show up neither in alphabetical order of the Label nor in numerical order of the Q#.
- More precise and targeted searches must make use of the Wikibase Query Service, which opens a “SPARQL endpoint,” a window in which users can program queries using the SPARQL query language. SPARQL pronounced “sparkle,” is a recursive acronym for “SPARQL Protocol And RDF Query Language,” designed by and for the World Wide Web Consortium (WC3) as the standard language for LOD triplestores, just as SQL (Structured Query Language) is the standard language for relational database tables.
SPARQL is not for the casual user. It requires some knowledge of SPARQL or similar query languages as well as of the specifics of Wikibase items and properties. Many Wikibase installations offer “canned” SPARQL queries. In Wikidata, for example, one can use a canned query to find all of the pictures of the Dutch artist Jan Vermeer and plot their current locations on a map, with images of the pictures themselves. In fact, Wikidata offers over 400 examples of canned queries, each of which can then serve as a model for further queries.
How, then, to make more sophisticated searches available for those who do not wish to learn SPARQL?
For PhiloBiblon we are developing masks or templates to facilitate searching for, e.g., persons, institutions, works. Thus, the institutions mask allows for searches for free text, the institution, its location, its type (e.g., university), and subject headings:
This mimics the search structure of the PhiloBiblon legacy website:
The use of templates does not, however, address the problem of searching across different types of objects or of providing different kinds of outputs. For example, one could not use such a template to plot the locations and dates of Franciscans active in Spain between 1450 and 1500. For this one needs a query language, i.e., SPARQL.
We have just begun to consider this problem under the new NEH implementation grant. It might be possible to use a Large Language Model query service such as ChatGPT or Bard as an interface to SPARQL. A user might send a prompt like this: “Write a SPARQL query for FactGrid to find all Franciscans active in Spain between 1450 and 1500 and plot their locations and dates on a map and a timeline.” This would automatically invoke the SPARQL query service and return the results to the user in the requested format.
Other questions and considerations will undoubtedly arise for any institution or project contemplating the use of Wikibase for its database needs. Nevertheless, we believe that we have demonstrated that this NEH-funded project can serve as a model for low-cost light-weight database development for small institutions or similar academic projects with limited resources.
Questions may be addressed to Charles Faulhaber (cbf@berkeley.edu).
[1] For the sake of convenience, I use the Wikidata Q# and P# numbers.
[2] For a balanced discussion of whether to join Wikidata or set up a local Wikibase instance, see Lozana Rossenova, Paul Duchesne, and Ina Blümel, “Wikidata and Wikibase as complementary research data management services for cultural heritage data.” The 3rd Wikidata Workshop, Workshop for the scientific Wikidata community, @ ISWC 2022, 24 October 2022. CEUR_WS, vol-3262.
Charles Faulhaber
University of California, Berkeley
Free large-format scanning for UC Berkeley students, faculty, and staff
National Science Foundation Public Access Plan 2.0
- The agency will leverage the existing NSF Public Access Repository (NSF-PAR) to make research papers, either the author’s accepted manuscript (AAM) or the publisher’s version of record (VOR), available immediately. All papers will be available in machine-readable XML, which will make additional research through text and data mining (TDM) possible.
- The agency will continue to leverage relationships with long-standing disciplinary and generalist data repositories, like Dryad.
- All data and publications will have permanent identifiers (PIDS). Data PIDS will be included with the article metadata.
- The agency acknowledges the complexity in size, type, and quality of documentation with data. Publishing a dataset has far greater technical variability than publishing a manuscript. The agency will continue to explore how to best address data in the next two years.
- The NSF has long required data management plans (DMPs). DMPs will be renamed to “data management and sharing plans,” or DMSPs, to better describe the required documentation and align with other agencies, like the NIH.
The above bullets are a mere 5 items in the lengthy report. Most importantly, over this next year, the Data Collaboration Team will develop an inreach plan to ensure all librarians and staff know how the OSTP memo and resulting policy will impact them and their researchers. Following awareness within the library, we will work on developing a coordinated outreach approach to support our researchers as they adapt to new requirements. This work will be in coordination with the Office of Scholarly Communication Services, the Research Data Management Program, and other longstanding LDSP partners.
Data Analysis Workshop Series: partnering with the CDSS Data Science Discovery Consultants
With the increase in data science across all disciplines, most undergraduates will encounter basic data science concepts and be expected to analyze data at some point during their time at UC Berkeley. To address this growing need, the Library Data Services Program began partnering with the Data Science Discovery Consultants in the Division of Computing, Data Science, and Society (CDSS) on the Introduction to Data Analysis Support workshop series in Fall 2020. The Data Science Discovery Consultants are a group of undergraduates majoring in computer science, math, data science, and related fields who are hired as student employees. They receive training to offer consultation services across a wide range of topics, including Python, R, SQL, and Tableau, and they have existing partnerships with other groups on campus to provide instruction around data as a part of their program. Through the partnership, Data Science Discovery Consultants work with librarians to develop as instructors and gain experience constructing workshops and teaching technical skills. The end result is the creation of a peer-to-peer learning environment for novice undergraduate learners who want to begin working with data. The peer-to-peer learning model lowers the barrier to learning for other undergraduates and enhances motivation and understanding.
The Data Science Discovery Consultants enthusiastically embraced the core values of the Carpentries, through which they empower each other and the audience, collaborate with their community, and create inclusive spaces that welcome and extend empathy and kindness to all learners. In Fall 2022, attendance for the workshop series was opened up to local community college students who may be interested in transferring to UC Berkeley. One of the workshops was taught in Spanish, to provide an environment in which native Spanish speakers could better connect with one another and the content.
Diego Sotomayor, a former UC Berkeley Library student employee and current Data Science Discovery Consultant, taught the inaugural Introduction to Python in Spanish: Introducción al análisis de datos con Python. Diego comments that:
“Languages at events are no longer just a necessity but have gone to the next level of being essential to transmit any relevant information to the interested public. There are many people who only speak Spanish or another language other than English and intend to learn new topics through various platforms including workshops. However, because they are limited by only speaking a language that is not very popular, they get stuck in this desire to progress and learn. Implementing the workshop in different languages, not only in English but in Spanish and even others, is important to give the same opportunities and equal resources to people looking for opportunities.”
The UC Berkeley Library and the Division of Computing, Data Science, and Society hope to further provide these offerings for prospective transfer students in Fall 2023. Many thanks to Elliott Smith, Lisa Ngo, Kristina Bush (now at Tufts University), and Misha Coleman in the Library. Anthony Suen is the Library’s staff partner in the Data Science Discovery Program and Kseniya Usovich assists with outreach.
Workshop: HTML/CSS Toolkit for Digital Projects
HTML/CSS Toolkit for Digital Projects
Wednesday, May 3rd, 2:10-3:30pm
Online: Register to receive the Zoom link
Stacy Reardon and Kiyoko Shiosaki
If you’ve tinkered in WordPress, Google Sites, or other web publishing tools, chances are you’ve wanted more control over the placement and appearance of your content. With a little HTML and CSS under your belt, you’ll know how to edit “under the hood” so you can place an image exactly where you want it, customize the formatting of text, or troubleshoot copy & paste issues. By the end of this workshop, interested learners will be well-prepared for a deeper dive into the world of web design. Register here.
Please see bit.ly/dp-berk for details.
Workshop: By Design: Graphics & Images Basics
By Design: Graphics & Images Basics
Thursday, April 6th, 3:10-4:30pm
Location: Doe 223
Lynn Cunningham
In this hands-on workshop, we will learn how to create web graphics for your digital publishing projects and websites. We will cover topics such as: sources for free public domain and Creative Commons images; image resolution for the web; and basic image editing tools in Photoshop. If possible, please bring a laptop with Photoshop installed. (All UCB faculty and students can receive a free Adobe Creative Suite license: https://software.berkeley.edu/adobe) Register here.
Upcoming Workshops in this Series – Spring 2022:
- HTML/CSS Toolkit for Digital Projects
Please see bit.ly/dp-berk for details.
Workshop: “Can I Mine That? Should I Mine That?”: A Clinic for Copyright, Ethics & More in TDM Research
“Can I Mine That? Should I Mine That?”: A Clinic for Copyright, Ethics & More in TDM Research
Wednesday, March 8th, 11:10am-12:30pm
Online: Register to receive the Zoom link
Tim Vollmer and Stacy Reardon
If you are working on a computational text analysis project and have wondered how to legally acquire, use, and publish text and data, this workshop is for you! We will teach you 5 legal literacies (copyright, contracts, privacy, ethics, and special use cases) that will empower you to make well-informed decisions about compiling, using, and sharing your corpus. By the end of this workshop, and with a useful checklist in hand, you will be able to confidently design lawful text analysis projects or be well-positioned to help others design such projects. Consider taking alongside Copyright and Fair Use for Digital Projects. Register here.
Upcoming Workshops in this Series – Spring 2022:
- By Design: Graphics & Images Basics
- HTML/CSS Toolkit for Digital Projects
Please see bit.ly/dp-berk for details.
Text Analysis with Archival Materials: Gale Digital Scholar Lab
Text Analysis with Archival Materials: Gale Digital Scholar Lab
Thursday, February 16th, 2:00-3:00pm
Online: Register to receive the Zoom link
The Gale Digital Scholar Lab is a platform that allows researchers to do text data mining on archival collections available through Gale (see list below). During this session we’ll cover the workflow for using the Lab, focusing on the Build, Clean, and Analyze steps. We’ll review curating and creating a content set, developing clean configurations, applying text data mining analysis tools, and exporting your Lab results. We’ll also review new Lab updates and explore the Lab Learning Center.
Primary source collections available in Gale include: American Fiction, 17th and 18th Century Burney Collection, American Civil Liberties Union Papers, 1912-1990, American Fiction, Archives Unbound, Archives of Sexuality & Gender, British Library Newspapers, The Economist Historical Archive, Eighteenth Century Collections Online, Indigenous Peoples: North America, The Making of Modern Law, The Making of the Modern World, Nineteenth Century Collections Online, Nineteenth Century U.S. Newspapers, Sabin Americana, 1500-1926, The Times Digital Archive, The Times Literary Supplement Historical Archive, U.S. Declassified Documents Online
This event is part of the UC-wide “Love Data Week” series of talks, presentations, and workshops to be held February 13-17, 2023. All events are free to attend and open to any member of the UC community. To see a full list of UC Love Data Week 2023 events, please visit: https://bit.ly/UC-LDW
Related LibGuide: Text Mining & Computational Text Analysis by Stacy Reardon