— Yasmina Anwar (@yasmina_anwar) February 13, 2018
Last week, the University Library, the Berkeley Institute for Data Science (BIDS), the Research Data Management program were delighted to host Love Data Week (LDW) 2018 at UC Berkeley. Love Data Week is a nationwide campaign designed to raise awareness about data visualization, management, sharing, and preservation. The theme of this year’s campaign was data stories to discuss how data is being used in meaningful ways to shape the world around us.
At UC Berkeley, we hosted a series of events designed to help researchers, data specialists, and librarians to better address and plan for research data needs. The events covered issues related to collecting, managing, publishing, and visualizing data. The audiences gained hands-on experience with using APIs, learned about resources that the campus provides for managing and publishing research data, and engaged in discussions around researchers’ data needs at different stages of their research process.
Participants from many campus groups (e.g., LBNL, CSS-IT) were eager to continue the stimulating conversation around data management. Check out the full program and information about the presented topics.
Photographs by Yasmin AlNoamany for the University Library and BIDS.
LDW at UC Berkeley was kicked off by a walkthrough and demos about Scopus APIs (Application Programming Interface), was led by Eric Livingston of the publishing company, Elsevier. Elsevier provides a set of APIs that allow users to access the content of journals and books published by Elsevier.
In the first part of the session, Eric provided a quick introduction to APIs and an overview about Elsevier APIs. He illustrated the purposes of different APIs that Elsevier provides such as DirectScience APIs, SciVal API, Engineering Village API, Embase APIs, and Scopus APIs. As mentioned by Eric, anyone can get free access to Elsevier APIs, and the content published by Elsevier under Open Access licenses is fully available. Eric explained that Scopus APIs allow users to access curated abstracts and citation data from all scholarly journals indexed by Scopus, Elsevier’s abstract and citation database. He detailed multiple popular Scopus APIs such as Search API, Abstract Retrieval API, Citation Count API, Citation Overview API, and Serial Title API. Eric also overviewed the amount of data that Scopus database holds.
In the second half of the workshop, Eric explained how Scopus APIs work, how to get a key to Scopus APIs, and showed different authentication methods. He walked the group through live queries, showed them how to extract data from API and how to debug queries using the advanced search. He talked about the limitations of the APIs and provided tips and tricks for working with Scopus APIs.
Eric left the attendances with actionable and workable code and scripts to pull and retrieve data from Scopus APIs.
On the second day, we hosted a Data Stories and Visualization Panel, featuring Claudia von Vacano (D-Lab), Garret S. Christensen (BIDS and BITSS), Orianna DeMasi (Computer Science and BIDS), and Rita Lucarelli (Department of Near Eastern Studies). The talks and discussions centered upon how data is being used in creative and compelling ways to tell stories, in addition to rewards and challenges of supporting groundbreaking research when the underlying research data is restricted.
Claudia von Vacano, the Director of D-Lab, discussed the Online Hate Index (OHI), a joint initiative of the Anti-Defamation League’s (ADL) Center for Technology and Society that uses crowd-sourcing and machine learning to develop scalable detection of the growing amount of hate speech within social media. In its recently-completed initial phase, the project focused on training a model based on an unbiased dataset collected from Reddit. Claudia explained the process, from identifying the problem, defining hate speech, and establishing rules for human coding, through building, training, and deploying the machine learning model. Going forward, the project team plans to improve the accuracy of the model and extend it to include other social media platforms.
Next, Garret S. Christensen, BIDS and BITSS fellow, talked about his experience with research data. He started by providing a background about his research, then discussed the challenges he faced in collecting his research data. The main research questions that Garret investigated are: How are people responding to military deaths? Do large numbers of, or high-profile, deaths affect people’s decision to enlist in the military?
Garret discussed the challenges of obtaining and working with the Department of Defense data obtained through a Freedom of Information Act request for the purpose of researching war deaths and military recruitment. Despite all the challenges that Garret faced and the time he spent on getting the data, he succeeded in putting the data together into a public repository. Now the information on deaths in the US Military from January 1, 1990 to November 11, 2010 that was obtained through Freedom of Information Act request is available on dataverse. At the end, Garret showed that how deaths and recruits have a negative relationship.
Orianna DeMasi, a graduate student of Computer Science and BIDS Fellow, shared her story of working with human subjects data. The focus of Orianna’s research is on building tools to improve mental healthcare. Orianna framed her story about collecting and working with human subject data as a fairy tale story. She indicated that working with human data makes security and privacy essential. She has learned that it’s easy to get blocked “waiting for data” rather than advancing the project in parallel to collecting or accessing data. At the end, Orianna advised the attendees that “we need to keep our eyes on the big problems and data is only the start.”
Rita Lucarelli, Department of Near Eastern Studies discussed the Book of the Dead in 3D project, which shows how photogrammetry can help visualization and study of different sets of data within their own physical context. According to Rita, the “Book of the Dead in 3D” project aims in particular to create a database of “annotated” models of the ancient Egyptian coffins of the Hearst Museum, which is radically changing the scholarly approach and study of these inscribed objects, at the same time posing a challenge in relation to data sharing and the publication of the artifacts. Rita indicated that metadata is growing and digital data and digitization are challenging.
It was fascinating to hear about Egyptology and how to visualize 3D ancient objects!
We closed out LDW 2018 at UC Berkeley with a session about Research Data Management Planning and Publishing. In the session, Daniella Lowenberg (University of California Curation Center) started by discussing the reasons to manage, publish, and share research data on both practical and theoretical levels.
Daniella shared practical tips about why, where, and how to manage research data and prepare it for publishing. She discussed relevant data repositories that UC Berkeley and other entities offer. Daniela also illustrated how to make data reusable, and highlighted the importance of citing research data and how this maximizes the benefit of research.
At the end, Daniella presented a live demo on using Dash for publishing research data and encouraged UC Berkeley workshop participants to contact her with any question about data publishing. In a lively debate, researchers shared their experiences with Daniella about working with managing research data and highlighted what has worked and what has proved difficult.
We have received overwhelmingly positive feedback from the attendees. Attendees also expressed their interest in having similar workshops to understand the broader perspectives and skills needed to help researchers manage their data.
I would like to thank BIDS and the University Library for sponsoring the events.
A collection of digitized texts marks the start of a research project — or does it?
For many social sciences and humanities researchers, creating searchable, editable, and machine-readable digital texts out of heaps of paper in archival boxes or from books painstakingly sourced from overlooked corners of the library can be a tedious, time-consuming process.
Scholars using traditional methodologies may find it advantageous to have a digital copy of their source material, if only to be able to more easily search through it. For anyone who wants to use computational methods and tools, converting print sources to digital text is a prerequisite. The process of converting an image of scanned text to digital text involves Optical Character Recognition (OCR) software. New developments in campus services are providing additional options for researchers who wish to prepare their texts this way.
What resources does UC Berkeley offer to convert scans to digital text?
- For basic needs, try the Library’s scanners.
- For documents with complex layouts or for additional language support, ABBYY FineReader with Berkeley’s OCR virtual desktop is a solution.
- Finally, Tesseract can handle large scale OCR projects.
Books and simple documents: library scanners with OCR software
All of the UC Berkeley libraries, including the Main (Gardner) Stacks, have at least one Scannx scanner station with built-in OCR software. This software automatically identifies and splits apart pages when you’re scanning a book, and it performs OCR on any text it can identify. You can save your results as a “Searchable PDF” (with embedded OCR output) or as a Microsoft Word document, or you can save page images as TIFF, JPEG, or PDF files (omitting digitized text). For book scanning or simple document scanning, the library scanners can take you from analog to digital in a single step.
Complex layouts or language support: ABBYY FineReader and Berkeley Research Computing’s OCR virtual desktop
If your source material has a complex layout (like irregular columns, embedded images, and/or tables that you want to continue to edit as tables) or uses a non-Latin alphabet, ABBYY FineReader OCR may get you better OCR results. FineReader supports Arabic, Chinese, Cyrillic, Greek, Hebrew, Japanese, and Thai, among other languages.
On campus, FineReader is available on computers in the D-Lab (350 Barrows). From off campus, the OCR virtual research desktop provided through Berkeley Research Computing’s AEoD service (Analytic Environments on Demand, pronounced “A-odd”) allows users to log into a virtual Windows environment from their own laptop or desktop computer anywhere there’s an internet connection. If you’re visiting an archive and aren’t sure that your image capture setup is getting good enough results to use as OCR input, you can log into the OCR virtual research desktop and try out a couple samples, then refine your process as needed. You can also work on your OCR project from home, or on nights and weekends when campus buildings are closed. To use the OCR virtual research desktop, sign up for access at http://research-it.berkeley.edu/ocr.
FineReader is not generally recommended for very large numbers of PDFs because each conversion must be started by hand. However, if you don’t need to differentiate the origin of your various source PDFs (e.g., if your text analysis will treat all text as part of a single corpus, and it doesn’t matter which of the million PDFs any particular bit of text originally came from), you might be able to use FineReader by creating one or more “mega-PDFs” that combine tens or hundreds of source PDFs and letting it run over a long period of time. At a certain point, however, Tesseract might be a better choice.
OCR at scale: Tesseract on the Savio high-performance compute cluster
If you have thousands, hundreds of thousands, or millions of PDFs to OCR, a high-powered, automated solution is usually best. One such option is the open source OCR engine Tesseract. Research IT has installed Tesseract in a container that you can use on the Savio high performance computing (HPC) cluster. For researchers who are less comfortable with the command line, there is also a Jupyter notebook available that provides the necessary commands and “human-readable” documentation, in a form that you can run on the cluster. Any tenure-track faculty member is eligible for a Faculty Computing Allowance for using Savio. For graduate students, talk to your advisor about signing up for an allowance and receiving access.
No matter how large or small your OCR project is, UC Berkeley has the perfect tool for you in scanning equipment, ABBYY FineReader, or Tesseract. Happy converting!
Related Event: From Sources to Data: Using OCR in the Classroom
March 16, 2017
10:30am to 12:00pm
Open to: All faculty, graduate students, and staff
Quinn Dombrowski, Research IT quinnd [at] berkeley.edu
— Yasmina Anwar (@yasmina_anwar) February 14, 2017
This week, the University Library and the Research Data Management program were delighted to participate in the Love Your Data (LYD) Week campaign by hosting a series of workshops designed to help researchers, data specialists, and librarians to better address and plan for research data needs. The workshops covered issues related to managing, securing, publishing, and licensing data. Participants from many campus groups (e.g., LBNL, CSS-IT) were eager to continue the stimulating conversation around data management. Check out the full program and information about the presented topics.
Photographs by Yasmin AlNoamany for the University Library.
The first day of LYD week at UC Berkeley was kicked off by a discussion panel about Securing Research Data, featuring
Jon Stiles (D-Lab, Federal Statistical RDC), Jesse Rothstein (Public Policy and Economics, IRLE), Carl Mason (Demography). The discussion centered upon the rewards and challenges of supporting groundbreaking research when the underlying research data is sensitive or restricted. In a lively debate, various social science researchers detailed their experiences working with sensitive research data and highlighted what has worked and what has proved difficult.
At the end, Chris Hoffman, the Program Director of the Research Data Management program, described a campus-wide project about Securing Research Data. Hoffman said the goals of the project are to improve guidance for researchers, benchmark other institutions’ services, and assess the demand and make recommendations to campus. Hoffman asked the attendees for their input about the services that the campus provides.
On the second day, we hosted a workshop about the best practices for using Box and bDrive to manage documents, files and other digital assets by Rick Jaffe (Research IT) and Anna Sackmann (UC Berkeley Library). The workshop covered multiple issues about using Box and bDrive such as the key characteristics, and personal and collaborative use features and tools (including control permissions, special purpose accounts, pushing and retrieving files, and more). The workshop also covered the difference between the commercial and campus (enterprise) versions of Box and Drive. Check out the RDM Tools and Tips: Box and Drive presentation.
We closed out LYD Week 2017 at UC Berkeley with a workshop about Research Data Publishing and Licensing 101. In the workshop, Anna Sackmann and Rachael Samberg (UC Berkeley’s Scholarly Communication Officer) shared practical tips about why, where, and how to publish and license your research data. (You can also read Samberg & Sackmann’s related blog post about research data publishing and licensing.)
In the first part of the workshop, Anna Sackmann talked about reasons to publish and share research data on both practical and theoretical levels. She discussed relevant data repositories that UC Berkeley and other entities offer, and provided criteria for selecting a repository. Check out Anna Sackmann’s presentation about Data Publishing.
During the second part of the presentation, Rachael Samberg illustrated the importance of licensing data for reuse and how the agreements researchers enter into and copyright affects licensing rights and choices. She also distinguished between data attribution and licensing. Samberg mentioned that data licensing helps resolve ambiguity about permissions to use data sets and incentivizes others to reuse and cite data. At the end, Samberg explained how people can license their data and advised UC Berkeley workshop participants to contact her with any questions about data licensing.
Check out the slides from Rachael Samberg’s presentation about data licensing below.
The workshops received positive feedback from the attendees. Attendees also expressed their interest in having similar workshops to understand the broader perspectives and skills needed to help researchers manage their data.
Special thanks to Rachael Samberg for editing this post.
You probably love maps (who doesn’t?!). They can be beautiful, visually compelling, interesting representations of the world. You might have one hanging on your wall, laugh over one showing how Oakland was the new Brooklyn even back in 1888, or exclaim over one (my new favorite) showing bear concentrations in Norway.
These same qualities that make us love maps are also why they can be excellent research tools. Even if you are not a geographer or urban planner you can use maps to provide context for a place you are describing, explore spatial relationships, or visualize your data in a way that highlights new patterns. For example, a music student used and made maps to trace the locations of 19th century Parisian opera goers!
In addition to the approximately half-a-million physical maps and air photos in the UC Berkeley Library’s collections, the library subscribes to several online databases that let you explore demographic data and create maps that you can share online or print. SimplyMap, Social Explorer, and Policy Map all cover the United States. If you are interested in China, the China Geo-Explorer II database has mappable census data.
There are also many freely available resources available online. My standby is Old Maps Online, which pulls together scanned maps from institutions around the world into a single search screen. Just zoom in to your area of the world, adjust the time slider, and explore!
Contact Susan Powell, GIS & Map Librarian, at smpowell[at]berkeley.edu if you’d like to find out more about how you can use maps — both print and digital — in your research. Or stop by the Earth Sciences & Map Library to explore the collection and find out more!
Wherever you are in your graduate career, a citation management tool is essential to organizing, writing and sharing your research. Two free and highly popular citation managers that run on Windows, Mac OS and Linux are Mendeley and Zotero. In short, Mendeley is frequently used by physical and life scientists and Zotero by social scientists and arts and humanities scholars. Below is a brief comparison.
|Access, edit and insert citations into a document offline||Yes||Yes|
|Microsoft Word plug-in||Yes||Yes|
|Automatic download of citations from OskiCat and the UCB Library discovery tool||No||Yes|
|Insert citations into Google Docs||No||Yes|
|Free Storage for PDFs||2GB||300 MB|
|Annotate PDFs from within the program||Yes||No|
|Attach web pages and screen captures||No||Yes|
|Recommendations of relevant and highly cited articles||Yes||No|
|Connect with a community of scholars (i.e., academic social network)||Yes||Kinda|
|Collaborate with colleagues in the cloud||Yesfree for up to 3 group members||Yesunlimited|
|Automatically create citation records from PDFs||Yes||Yes|
|Easy de-duplication of item entries||Yes||Yes|
Both citation managers allow you to easily download citation information and incorporate citations into your papers and publications. Each has over 7,000 citation styles covering the vast majority of journals you’ll publish in. Focus on research, reading and writing and leave citation management drudgery to either Mendeley or Zotero.
If you’d like to set up a Zotero training session for five or more, please contact David Eifler – deifler [at] berkeley.edu to arrange a convenient time.
If you are a regular user of Google (and at this point, who isn’t?), it may have registered with you that Google is trying to solve all your information needs. Need to convert currency? Type 100 dollars in Euros. Need to identify the location of an area code? Type in the area code. Headed to Seattle and want to know whether you need to bring an umbrella? Type weather in Seattle.
This is all very handy, but the most powerful features of Google are the advanced search operators that allow you to refine your searches to retrieve a more relevant set of results. These operators will let you limit your search to a specific site, type of document, and date range. You can search for phrases, remove results with terms, and look for synonyms. The Library has developed a guide on Advanced Searching in Google that describes the most useful of these tools. It also provides more guidance on constructing a successful search by eliminating superfluous words, like the, an, of, in , where, who, and is. (There are some exceptions to this – try searching for who, then a who, then the who.) You wouldn’t think word order would matter, either, but it does. Search for sky blue and then blue sky. You get completely different results.
In addition to advice on constructing your search, the guide will help you set up Google Scholar to access the online content the Library pays for. Have you been published? You can track who has cited your work using “My Citations.” Setting up alerts will have you emailed every time new items appear related to your search. Google Trends allows you to track how often terms were searched in Google. Can’t remember the name of that book you saw yesterday, but remember it was blue and had a picture of Abraham Lincoln on it? The guide shows you how to use the search Google Images to identify it.
Google changes all the time, adding new features and deleting less popular ones. If you come across a useful feature of Google that you think should be shared, please feel free to contact us and let us know. You can find our contact information under the More Help? Tab.
– Jennifer Dorner, Doe Library
contact me at dorner [at] berkeley.edu
Prefer to receive library notices as text messages instead of emails? Login to My OskiCat with your Calnet ID and passphrase and click on “Update Your Account” to opt-in under ‘Mobile settings.”
To opt out, reply to any received text message with Stop, Quit, Cancel, Unsubscribe, or Stop All. Or, text Stop to either 35143 or 82453 to stop all messages. An alternative method is signing into My OskiCat and set preference to stop all text notifications by removing the check from the “Opt In” box and deleting the mobile phone number.
Browse, read, and monitor thousands of scholarly journals on your tablet or iPhone/iPod Touch with the BrowZine app.
Create a personal bookshelf of favorite journals, be alerted when new issues are published, and save articles to Zotero, Mendeley, Dropbox, and more.
Get started on your tablet (iPad, Android, Kindle Fire HD) or iPhone/iPod Touch in three easy steps:
- Go to your app store, search for BrowZine and download for free.
- Open the app, and select our library, University of California, Berkeley, from the listing.
- Afterwards, use AirBears or set up the campus VPN to begin reading scholarly journals from the Library.
by Jeffery Loo, Cheminformatics Librarian
Contact me at jloo [at] library.berkeley.edu
As the size of your papers lengthen — from term papers to thesis to dissertation — you’ll begin to recognize the value of a citation management tool. Good citation managers allow you to easily capture a variety of citation sources (books, articles, interviews, videos, newspaper articles), and then readily incorporate them into a Word document, ultimately producing a bibliography in any one of a variety of formats (MLA, Chicago, APA, Harvard, etc.) In short, they take the drudgery out of citation collection and bibliography production so you can better focus on the content of your research. A good citation manager will also facilitate group collaboration and cloud-based storage of references and accompanying PDFs.
There are four commonly used academic citation managers on the UC Berkeley campus: EndNote, RefWorks, Zotero, and Mendeley. EndNote, the elder statesman of the group, has been widely used by science faculty for over 25 years, but costs about $100 and the web-based interface leaves quite a bit to be desired. UC Berkeley pays for a subscription to RefWorks so it’s free for students and faculty to use. It’s web-based and can’t be used if you don’t have an internet connection and to this reviewer’s eye the interface is a bit cludgy.
Zotero is a free, open source citation manager that’s been around nearly 10 years and is going strong. Design for new media research, it recognizes a wide variety of citation sources (books and articles as well as maps, computer programs, e-mail, patents, podcasts, theses, reports) and imports citations with a single click from Safari, Chrome and Firefox browsers. It works from your device or the cloud and allows for easy group collaboration.
Mendeley is the newcomer to the citation management crowd. Recently purchased by the Elsevier Corporation, it is cloud-based, allows for easily import of citations and annotation of PDFs and is currently free. It doesn’t recognize new media sources such as interviews, forum posts, and TV broadcasts as Zotero does. It is, however, gaining in popularity among scientists.
If you haven’t already guessed, Zotero, is my favorite — especially for students in the arts and humanities and social sciences. Mendeley is a close contender, but I have concerns that it won’t be free forever. You can find an excellent guide on setting up and using Zotero at http://guides.lib.berkeley.edu/c.php?g=4472&p=15929. The most important thing is to not delay; begin using a citation manager today. You won’t believe the difference it will make in your individual and collaborative research projects.
If you’d like to set up a Zotero training session, please get at least 5 colleagues together and we’ll find an open time on my calendar for a 1 hour training. deifler [@] berkeley.edu.
by David Eifler, Environmental Design Library
Contact me at deifler [at] berkeley.edu