Data Sets for Text Analytics

This page provides a list of different data sets that you can use to get started with text analytics applications.

Data SetDescriptionTask

Factory Reports

The Factory Reports data set is a table containing approximately 500 reports with various attributes including a plain text description in the variable Description and a categorical label in the variable Category.

Read the Factory Reports data from the file "factoryReports.csv". Extract the text data and the labels from the Description and Category columns, respectively.

filename = "factoryReports.csv";
data = readtable(filename,'TextType','string');

textData = data.Description;
labels = data.Category;

For an example showing how to process this data for deep learning, see Classify Text Data Using Deep Learning (Deep Learning Toolbox).

Text classification, topic modeling

Shakespeare's Sonnets

The file sonnets.txt contains all of Shakespeare's sonnets in a single text file.

Read the Shakespeare's Sonnets data from the file "sonnets.txt".

filename = "sonnets.txt";
textData = extractFileText(filename);

The sonnets are indented by two whitespace characters and are separated by two newline characters. Remove the indentations using replace and split the text into separate sonnets using split. Remove the main title from the first three elements and the sonnet titles, which appear before each sonnet.

textData = replace(textData,"  ","");
textData = split(textData,[newline newline]);
textData = textData(5:2:end);

For an example showing how to process this data for deep learning, see Generate Text Using Deep Learning (Deep Learning Toolbox).

Topic modeling, text generation

ArXiv Metadata

The ArXiv API allows you to access the metadata of scientific e-prints submitted to https://arxiv.org including the abstract and subject areas. For more information, see https://arxiv.org/help/api.

Import a set of abstracts and category labels from math papers using the arXiV API.

url = "https://export.arxiv.org/oai2?verb=ListRecords" + ...
    "&set=math" + ...
    "&metadataPrefix=arXiv";
options = weboptions('Timeout',160);
code = webread(url,options);

For an example showing how to parse the returned XML code and import more records, see Multilabel Text Classification Using Deep Learning.

Text classification, topic modeling

Books from Project Gutenberg

You can download many books from Project Gutenberg. For example, download the text from Alice's Adventures in Wonderland by Lewis Carroll from https://www.gutenberg.org/files/11/11-h/11-h.htm using the webread function.

url = "https://www.gutenberg.org/files/11/11-h/11-h.htm";
code = webread(url);

The HTML code contains the relevant text inside <p> (paragraph) elements. Extract the relevant text by parsing the HTML code using the htmlTree function and then finding all the elements with the element name "p".

tree = htmlTree(code);
selector = "p";
subtrees = findElement(tree,selector);

Extract the text data from the HTML subtrees using the extractHTMLText function and remove the empty elements.

textData = extractHTMLText(subtrees);
textData(textData == "") = [];

For an example showing how to process this data for deep learning, see Word-By-Word Text Generation Using Deep Learning.

Topic modeling, text generation

Weekend updates

The file weekendUpdates.xlsx contains example social media status updates containing the hashtags "#weekend" and "#vacation".

Extract the text data from the file weekendUpdates.xlsx using the readtable function and extract the text data from the variable TextData.

filename = "weekendUpdates.xlsx";
tbl = readtable(filename,'TextType','string');
textData = tbl.TextData;

For an example showing how to process this data, see Create Simple Preprocessing Function.

Sentiment analysis

Roman Numerals

The CSV file "romanNumerals.csv" contains the decimal numbers 1–1000 in the first column and the corresponding Roman numerals in the second column.

Load the decimal-Roman numeral pairs from the CSV file "romanNumerals.csv".

filename = fullfile("romanNumerals.csv");

options = detectImportOptions(filename, ...
    'TextType','string', ...
    'ReadVariableNames',false);
options.VariableNames = ["Source" "Target"];
options.VariableTypes = ["string" "string"];

data = readtable(filename,options);

For an example showing how to process this data for deep learning, see Sequence-to-Sequence Translation Using Attention.

Sequence-to-sequence translation

Related Topics