LibGuides: Gale Digital Scholar Lab: Support: FAQs

What content can I find in Gale Digital Scholar Lab?

Your institution’s library has access to a number of Gale’s primary source collections, and they can be found by clicking on the Available Text links on the Home Page or in the Learning Center. Most of Gale’s archival collections are text mineable, with the exception of those that are primarily manuscript-based or have specific rights restrictions that prevent text mining at this time:

Chatham House Archive
Early Arabic Printed Books
The Financial Times Historical Archive
State Papers Online
National Geographic Magazine Archive

Hand-written texts (including Arabic texts) present considerable difficulty in rendering the content in plain text due to limitations on handwritten text recognition. While OCR engines are continuously improving their ability to recognize a wide variety of character sets, the variables presented by handwritten text remain challenging to most platforms today. Even so, Gale has employed a number of new technologies to derive OCR from manuscript collections like the Crime and Punishment module of the Nineteenth Century Collections Online (NCCO) and will continue to create and advance the state of manuscript OCR in the future.

How much content can I analyze in Gale Digital Scholar Lab?

At present, users are limited to 10,000 documents per content set. The limit was determined through consultation with our source library partners, researchers, beta testers, and programmers. It allows us to analyze analysis pipeline performance and make changes to both hardware and software to respond to computational needs in the future.

What digital humanities tools are included in Gale Digital Scholar Lab? And why were they selected?

Gale Digital Scholar Lab includes a variety of tools that support well-known text analysis methods that are both qualitative and quantitative. Four of these tools are open-source and are widely recognized and used in the academic space today; the remaining two tools are built in similar fashion to their Open Source equivalents or utilize Open Source components in the analysis process. Providing these tools along with millions of pages of primary source content and accompanying OCR text gives users the ability to quickly move from corpus creation to text analysis in one platform.

Gale Digital Scholar Lab includes the following tools:

Name of Software Tool: What type of tool is it?

Mallet*: Topic Modeling - a widely used toolset for text mining. Having Mallet loaded into the Lab and ready for use will not only support established researchers who are already using these tools, but also support those who are new to text mining and are only beginning to learn about them. (Java based)

SciKit Learn*: Clustering - Automatic grouping of similar objects into sets. SciKit Learn is open source and offers other tools for text mining and data analysis. (Python compatible)

Gale Custom Tool**: Ngrams - a type of collocation where words appear next to or in the proximity of others. When computing n-grams you typically move one word forward to show co-occurring words.

spaCy*: Named entity recognition (NER) - recognizes and extracts Named Entities from documents within a content set, and output lists of entities spaCy*.

spaCy*: Parts of Speech Tagger - The purpose is to parse document sentences into Parts of Speech (PoS) and tag them accordingly. PoS tagging effectively creates a lexicographical index or dictionary of a content set.

Gale Custom Tool***: Sentiment Analysis - Sentiment analysis determines a tally of the positive or negative words within each document of a content set. It uses the AFINN lexicon (dictionary of words and their sentiment value) to compile sentiment scores for each phrase, which are then compiled to produce a document-level sentiment value.

*open source
**Gale custom tool utilizing Apache’s Lucene Standard Tokenizer
***Gale custom tool utilizing open database license source lexicon

Can I clean the content sets I create in Gale Digital Scholar Lab?

Yes. The Clean feature of Gale Digital Scholar Lab lets you strip out blank spaces, punctuation, special characters, and more in order to ensure cleaner, more accurate analytical output. It’s designed to work seamlessly with the included analysis tools, in addition to cleaning content sets before downloading them locally. Cleaning is a critical part of the preparation for any text analysis. Gale Digital Scholar Lab includes the ability to clean content sets as a separate feature, so you can ensure that documents in specific Content Sets are prepared in precisely the same way. Users can decide how they’re altered and make adjustments according to their individual research needs.

Can I analyze content outside of the Gale Primary Source Collections provided by my library?

The majority of Gale Primary Sources and Archives Unbound collections can be analyzed within Gale Digital Scholar Lab (please see exclusion list earlier in this FAQ). While the mission of the platform is to provide access to OCR text of your institution’s Gale Primary Source collections, we will also support the ability to analyze non-Gale texts with the Digital Scholar Lab. We will continue to explore possibilities to extend our content reach to include outside collections that are frequently asked for by customers.

How do I upload my own documents into Gale Digital Scholar Lab?

Users can upload plain text files (.txt) and text in spreadsheets (.csv) by navigating to the Upload feature on the Build page of Gale Digital Scholar Lab. They can select one or more files from their computer to upload, apply metadata, manage, and add to a Content Set.

Watch this 7-minute tutorial video to learn more.

What can I do with my uploaded documents?

Users are the only ones who can access their documents and have control over their state. Once a document has been uploaded in the Lab they can edit the document’s text, apply metadata, and add to a Content Set. Users can also delete their documents from the Gale Digital Scholar Lab environment at any time. It is important to note that deleting documents means they will no longer be available for inclusion in content sets or analysis. They will also be removed from any content set currently containing them and no longer be available to view in past analyses.