Where to Find the Texts for Text Mining

Sketch for Monotype Digital Type Wall
frame1351170437122. Marcin Ignac, CC BY-NC-ND 2.0

Text mining, the process of computationally analyzing large swaths of natural language texts, can illuminate patterns and trends in literature, journalism, and other forms of textual culture that are sometimes discernible only at scale, and it’s an important digital humanities method. If text mining interests you, then finding the right tool — whether you turn to an entry-level system like Voyant or master a programming language like Python — is only a part of the solution. Your analyses are only as strong as the texts you’re working with, after all, and finding authoritative text corpora can sometimes be difficult due to paywalls and licensing restrictions. The good news is the UC Berkeley Libraries offer a range of text corpora for you to analyze, and we can help you get your hands on things we don’t already have access to.

The first step in your exploration should be the library’s Text Mining Guide, which lists text corpora that are either publicly accessible (e.g., the Library of Congress’s Chronicling America newspaper collection) or are available to UCB faculty, students, and staff (e.g., JSTOR Data for Research).  The content of these sources are available in a variety of formats: you may be able to download the texts in bulk, use an API, or make use of a content provider’s in-platform tools. In other cases (e.g., ProQuest Historical Newspapers), the library may be able to arrange access upon request. While the scope of the corpora we have access to is wide, we are particularly strong in newspaper collections, pre-20th century English literature collections, and scholarly texts.

What happens if the library doesn’t have what you need? We regularly facilitate the acquisition of text corpora upon request, and you can always email your subject librarian with specific requests or questions. The library will deal with licensing questions so you don’t have to, and we’ll work with you to figure out the best way to make the texts available for your work, often with the help of our friends in the D-Lab or Research IT . We also offer the Data Acquisition and Access Program to provide special funding for one-time data set purchases, including text corpora.  Your requests and suggestions help the library develop our collection, making text mining easier for the next researcher who comes along.

Important caveats:

  • Unless explicitly stated, our contracts for most Library databases and library resources (e.g., Scopus, Project MUSE) don’t allow for bulk download. Please avoid web scraping licensed library resources on your own: content providers realize what is happening pretty quickly, and they react by shutting down access for our entire campus. Ask your subject librarian  for help instead.
  • Keep in mind that many of the vendors themselves are limited in how, and how much access, they can provide to a particular resource, based on their own contractual agreements. It’s not uncommon for specific contemporary newspapers and journals to be unavailable for analysis at scale, even when library funding for access may be available.

Related resources:

 

Stacy Reardon and Cody Hennesy
Contact us at sreardon [at] berkeley.edu; chennesy [at] berkeley.edu


My Dissertation Is Online! Wait – My Dissertation Is Online?! Copyright & Your Magnum Opus

 

keyboard-question-mark

Cross-posted from the UCB Library Scholarly Communications blog

You’ve worked painstakingly for years (we won’t let on how many) on your magnum opus: your dissertation—the scholarly key to completing your graduate degree, securing a possible first book deal, and making inroads toward faculty status somewhere. Then, as you are about to submit your pièce de résistance through ProQuest’s online administration system, you are confronted with the realization that—for students at many institutions—your dissertation is about to be made available open access online to readers all over the world (hurrah! and gulp).

Because your dissertation will be openly available online, there are many questions you need to address—both about what you put in your dissertation, and the choices you’ll need to make as you put it online. If you are a first-time author, facing these concerns can be daunting to say the least. And you definitely don’t want to be thinking about them for the first time when you are scrambling to submit your dissertation to ProQuest.

For instance, you’ll need to consider:

  • Are you using materials created by other people in your dissertation? Perhaps you’re using photos, text excerpts, scientific drawings or diagrams? You might need the authors’ permission to include them.
  • Are you using materials from a library’s special collections or archives? You may have signed agreements or accepted terms of use that affect what you can publish from those materials. (Examples: Archive.org, Harvard’s Houghton Library, Smithsonian, and Niels Bohr Library & Archives.)
  • Are you including information about particular living individuals? You might need to consider their privacy rights (see, for instance, a discussion on p. 15 of a University of Michigan dissertation guide).
  • If you own copyright in your dissertation (as most grad students in the UC campus system do), should you register your copyright?
  • Do you need to embargo your dissertation for privacy, patent, or other concerns?
  • Should you license your dissertation for greater use by others?

At UC Berkeley, we’ve created a workflow and guide for you to tackle these kinds of important copyright and other legal questions. Below, I’ve included highlights from the workflow, but there are plenty more best practices to draw upon in the guide. What follows are, of course, exactly that: best practices, and not legal advice. Your local scholarly communication officer or librarian (see this list for some resources around UC) can help you find additional information as you consider these issues for your own dissertation.

Continue to learn more about Copyright Basics and the Workflow

Rachael G. Samberg
Scholarly Communication Officer

Contact me at rsamberg [at] berkeley.edu.


Who, How Many and Where: Research Using the U.S. Census

census-logo

While doing an academic research project, you may encounter the need for a demographic or economic statistic (What is the current population of King City, CA?; How did people get to work in 1960?; etc). There are many sources of statistics out there, some reliable, and some –  well, not so much. Sources may vary by location, time period, types of questions asked, etc. One of the most reliable sources of U.S. demographics statistics and data to become familiar with is the United States Census Bureau.

The work of the U.S. Census Bureau dates back to the founding of the country, though the Census Bureau wasn’t a permanent government office until the early 1900’s. Its primary role is mandated by our constitution: Article 1, Section 2 of U.S. Constitution stipulates an enumeration of the population be taken every ten years for apportionment and redistricting of  the U.S. House of Representatives. This enumeration is the Decennial Census that has taken place every 10 years from 1790 to present.  

Since there are way too many data programs done by the Census Bureau to cover in this short article, we’ll look at the two most widely used: the Decennial Census and the American Community Survey. The Decennial Census is what most people think of when they think about the U.S. Census: conducted every ten years, lots of details, etc. Up until the year 2000, the Decennial Census was the main source of detailed statistics. However, as of 2005, we now have the American Community Survey, which provides the same level of detailed information in 1-,3-, and 5-year rolling averages. The shorter the timespan, the more current the information – but statistics are only available for population sizes of 80,000 or larger. In a longer timespan, the statistics are less current, but are available for all smaller populations.

The Census Bureau gathers a lot of information and makes it available in a number of ways.  Over the last two decades, U.S. Census information has become more readily available online. Current information is available through the Census Bureau’s site or via American FactFinder.  Other sources include the Library’s subscription to Social Explorer (which covers 1790-present), and allows for the creation of maps. And if you like maps, try Policy Map. If you want to dig into the numerical data and not just the statistics, the Census Bureau provides that as well. To learn more about these sources and the Census, visit the Library’s Census Guide. The D-Lab also holds training sessions on the Census data.    

One final point to consider when talking about the Census or any government program: funding.  Many times in the recent past, the U.S. Congress has threatened to cut or even kill funding for the Census Bureau. In 2011, Congress succeeded when they cut the funding to the office that provided the Statistical Abstract of the United States, despite a public outcry against this cut. As the Census Bureau gears up for the 2020 Census, will there be cuts to the program?  How might that affect your research or research in your field?  

–Jesse Silva
Government Information, Political Science and Public Policy Librarian

Contact me at jsilva [@] library.berkeley.edu


Berkeley Services for Digital Scholarship

tinyrdmmodelEvery time you download a spreadsheet, use a piece of someone else’s code, share a video, or take photos for a project, you’re working with data. When you are producing, accessing, or sharing data in order to answer a research question, you’re working with research data, and Berkeley has a service that can help you.

Research Data Management at Berkeley is a service that supports researchers in every discipline as they find, generate, store, share, and archive their data. The program addresses current and emerging data management issues, compliance with policy requirements imposed by funders and by the University, and reduction of risk associated with the challenges of data stewardship.

In September 2015, the program launched the RDM Consulting Service, staffed by dedicated consultants with expertise in key aspects of managing research data. The RDM Consulting Service coordinates closely with consulting services in Research IT, the Library, and other researcher-facing support organizations on the campus. Contact a consultant at researchdata@berkeley.edu.

The RDM program also developed an online resource guide. The Guide documents existing services, providing context and use cases from a research perspective. In the rapidly changing landscape of federal funding requirements, archiving tools, electronic lab notebooks, and data repositories, the Guide offers information that directly addresses the needs of researchers at Berkeley. The RDM Guide is available at researchdata.berkeley.edu.

Jamie Wittenberg
Research Data Management Service Design Analyst

Contact me at wittenberg[@]berkeley.edu


Digital Humanities and the Library

digital humanities manAre you a humanist working with digital materials to do your research? Are you carrying out your research or presenting your results using digital methods and tools? Are you teaching using digital tools and content?  If you answered yes to any of these questions, then your work might be considered digital humanities.

Digital humanities has been described as “dynamic dialogue between emerging technology and humanistic inquiry” (Varner, 2016). It is a term that is used to describe a domain within the humanities where researchers are doing most of their work using digital tools, content, and/or methods. Whether this work is partially or exclusively digital, this designation is a way to set these emerging practices apart from more traditional or “analog” ones, though there is no clear distinction.

The scope of digital humanities has been a hot topic in recent years, especially in relation to the library’s role in this new domain.  What services does the library provide to digital humanists? What can the library do to support digital humanities on campus?

The Library has always provided services to researchers and will continue to provide those same services, as well as to expand their offerings to encompass new forms of research, publication, and teaching.  It is not a question of libraries supporting one or the other. Digital humanities is still evolving, and the Library is evolving right along with it, continuing to offer collections, research support, and instruction in both traditional ways and new ones as this “dynamic dialogue” expands.

The Library collects and creates digital resources at the same time that it continues to build its analog collections. Myriad databases, data sets, and other digital resources are available through the Library catalog and website. In addition, our digitized special collections are available through Calisphere, which provides access to digital images, texts, and recordings from California’s great libraries, archives, and museums.

While the library is busy collecting and organizing digital resources, reference librarians are ready and willing to provide you with research help. The expertise that librarians have in connecting researchers to materials, designing research, and providing instruction on how to evaluate and use new content and tools continues to grow and expand in this new environment.

In addition, the library provides instruction to help those new to the digital humanities to learn about tools and skills needed to do this work.  Many librarians have partnered with the D-lab in Barrows Hall on campus to provide instruction on citation management, metadata, and research data management.  The D-lab also offers training in various programming languages and data tools, as well as consulting on research design, data analysis, data management, and related techniques and technologies. Library trainings and events are generally posted to the library events calendar.

The Library also works closely with the Digital Humanities @ Berkeley group (a partnership between Research IT and the Office of the Dean of Arts and Humanities) which support digital humanities events, trainings, course support, and graduate student and faculty projects. Their calendar lists talks, workshops, and other events designed to help move the DH community on campus forward.

Keeping the “dynamic dialogue” of digital humanities moving forward is a campus goal, and the relationship between digital humanities and the Library is an evolving one. We are hiring new librarians with digital humanities skills to further develop this relationship and expect to see more growth in the scope of the library’s involvement in digital humanities as the community on campus continues to expand.

Mary W. Elings
Head of Digital Collection, The Bancroft Library

Contact me at melings [at] berkeley.edu


Maps, Mapping and Your Research

World_Map,_Political,_2012,_Cahill-Keyes_Projection

You probably love maps (who doesn’t?!). They can be beautiful, visually compelling, interesting representations of the world. You might have one hanging on your wall, laugh over one showing how Oakland was the new Brooklyn even back in 1888, or exclaim over one (my new favorite) showing bear concentrations in Norway.

These same qualities that make us love maps are also why they can be excellent research tools. Even if you are not a geographer or urban planner you can use maps to provide context for a place you are describing, explore spatial relationships, or visualize your data in a way that highlights new patterns. For example, a music student used and made maps to trace the locations of 19th century Parisian opera goers!

In addition to the approximately half-a-million physical maps and air photos in the UC Berkeley Library’s collections, the library subscribes to several online databases that let you explore demographic data and create maps that you can share online or print. SimplyMap, Social Explorer, and Policy Map all cover the United States. If you are interested in China, the China Geo-Explorer II database has mappable census data.

There are also many freely available resources available online. My standby is Old Maps Online, which pulls together scanned maps from institutions around the world into a single search screen. Just zoom in to your area of the world, adjust the time slider, and explore!

What if you want to make a map with your own data? There are many good options for that, too, including ArcGIS Online (a library subscribed resource), CartoDB, MapBox, and Google, among others.

Contact Susan Powell, GIS & Map Librarian, at smpowell[at]berkeley.edu if you’d like to find out more about how you can use maps — both print and digital — in your research. Or stop by the Earth Sciences & Map Library to explore the collection and find out more!


Open Access News at Berkeley

ucoapolicies_small

 

 

 

 

There are now two official open access (OA) policies at Berkeley:

  • On October 23 of this year, UC issued a Presidential Open Access Policy expanding the reach of the Academic Senate policy by including all UC employees and encouraging them to freely share their research publications worldwide. Among those affected by the expanded policy are clinical faculty, lecturers, staff researchers, postdoctoral scholars, graduate students and librarians.

What does this mean for Graduate Students?

The Academic Senate policy was officially launched on November 17 with the implementation of “harvester” software that sends an automated email message to to faculty listing eligible articles which they authored or co-authored; faculty are then prompted to verify (or reject) the articles and instructed on how to post their publications to eScholarship, UC’s OA publishing platform. If you, as a graduate student, co-authored a paper with an Academic Senate faculty any time after July 2013, that article may be posted to eScholarship which provides free global access to your publication. Wider dissemination of Berkeley research is not only a public good but also results in greater impact and recognition for researchers. Ask your faculty collaborators if they’ve posted their publications and, if not, offer to help them!

The Presidential Open Access Policy covers graduate student work if the student was an employee of the university at the time their article was published. Until eligible UC employees are folded into the harvester software used for Academic Senate faculty, you are encouraged to post your eligible articles using the eScholarship deposit mechanism. See Deposit your content in eScholarship for more details.

Keep in mind that your articles are automatically covered by the policy; you are not required to amend your author agreement and you do not need to pay any additional article processing charges.

Remind me: What is Open Access?

OA literature is free, digital, and available to anyone online. With OA literature, there is the potential for greater access, thus more readers and greater impact. There are two different approaches to open access: Gold and Green. Gold OA provides immediate access on the publisher’s website. In the Green OA model (also known as “self archiving”) authors continue to publish as they always have in all the same journals; once the article has been published in a traditional journal, the author then posts a “final author version” of the article to a repository. The UC Open Access Policy falls under the Green OA model.

For more information

  • Open Access: UC Open Access Policy
  • For individual questions, contact oapolicy@lists.berkeley.edu.
  • For in-person assistance come to a Library “upload-a-thon”
    • Tuesdays and Wednesdays
    • 4pm-5pm
    • Library Data Lab, 189 Doe
    • These  drop-in sessions will run from November 17- December 16; January 19-February 24 (and beyond, if necessary).

Many subject special libraries are also offering “upload-a-thons” (see For more help).

Margaret Phillips, Education-Psychology Library
contact me at mphillip [at] library.berkeley.edu

 


Interlibrary Services – a vital resource for Cal scholars

Globe_and_books_mgx.svg“ILL is probably my favorite academic service on campus!  ILL  staff always seem to be warm and friendly and to care about their jobs.  I also wouldn’t have been able to write my dissertation without the probably hundreds of books I’ve gotten throughout the years.  Thank you!”

–Berkeley grad student

 

Whether you are writing a paper, dissertation or doing post-doctoral research, the UCB Interlibrary Services department can be your best friend.

Even a world class library like Berkeley’s can’t own everything; and that is where Interlibrary Loan (ILL) comes in. Books, articles, conference papers, dissertations; if your research requires something that isn’t available at Berkeley, we will try to obtain it from another institution.

We borrow from the other UC campuses, from hundreds of libraries nationally, and increasingly from partners around the world. We can’t guarantee that we will get everything you need, but we can guarantee that we’ll try. Recent successes include scans of two very rare items: Reveries Orientales: Poems et Illustrations by Jamil Hamoudi and Maha Vira Charita by Bhava Bhuti (published in Calcutta in 1857).

So, check out our website for information on the services we provide, or stop by our office at 133 Doe Library. Our staff will gladly take the time to walk you through the ILL process, show you how to manage your ILL account, and do all we can to help you succeed.

Our hours are Monday through Friday, 10:00 am to 4:00pm, and our telephone number is 510-642-7365. You can find us online at http://www.lib.berkeley.edu/using-the-libraries/interlibrary-loan.

Patrick Shannon [Head, Interlibrary Services]  pshannon [at] library.berkeley.edu

 


Citation Managers: A Must-Have Tool in Your Research Arsenal

Index_cards_(tabbed,_showing_hole)

 

 

Wherever you are in your graduate career, a citation management tool is essential to organizing, writing and sharing your research. Two free and highly popular citation managers that run on Windows, Mac OS and Linux are  Mendeley and Zotero.  In short, Mendeley is frequently used by physical and life scientists and Zotero by social scientists and arts and humanities scholars.  Below is a brief comparison.

Mendeley Zotero
Access, edit and insert citations into a document offline Yes Yes
Microsoft Word plug-in Yes Yes
Automatic download of citations from OskiCat and the UCB Library discovery tool No Yes
Insert citations into Google Docs No Yes
Free Storage for PDFs 2GB 300 MB
Annotate PDFs from within the program Yes No
Attach web pages and screen captures No Yes
Recommendations of relevant and highly cited articles Yes No
Connect with a community of scholars (i.e., academic social network) Yes Kinda
Collaborate with colleagues in the cloud Yesfree for up to 3 group members Yesunlimited
Automatically create citation records from PDFs Yes Yes
Easy de-duplication of item entries Yes Yes

 

Both citation managers allow you to easily download citation information and incorporate citations into your papers and publications.  Each has over 7,000 citation styles covering the vast majority of journals you’ll publish in.  Focus on research, reading and writing and leave citation management drudgery to either Mendeley or  Zotero.

If you’d like to set up a Zotero training session for five or more, please contact David Eifler – deifler [at] berkeley.edu to arrange a convenient time.

David Eifler (Environmental Design Librarian) deifler [at] library.berkeley.edu and Jeffery Loo (Optometry and Health Sciences Librarian) jloo [at] library.berkeley.edu