Data Sets for Text Analytics

This page provides a list of different data sets that you can use to get started with text analytics applications.

Data Set Description Task

Data Set	Description	Task
Factory Reports	The Factory Reports data set is a table containing approximately 500 reports with various attributes including a plain text description in the variable `Description` and a categorical label in the variable `Category`. Read the Factory Reports data from the file `"factoryReports.csv"`. Extract the text data and the labels from the `Description` and `Category` columns, respectively. filename = "factoryReports.csv"; data = readtable(filename,'TextType','string'); textData = data.Description; labels = data.Category; For an example showing how to process this data for deep learning, see Classify Text Data Using Deep Learning (Deep Learning Toolbox).	Text classification, topic modeling
Shakespeare's Sonnets	The file `sonnets.txt` contains all of Shakespeare's sonnets in a single text file. Read the Shakespeare's Sonnets data from the file `"sonnets.txt"`. filename = "sonnets.txt"; textData = extractFileText(filename); The sonnets are indented by two whitespace characters and are separated by two newline characters. Remove the indentations using `replace` and split the text into separate sonnets using `split`. Remove the main title from the first three elements and the sonnet titles, which appear before each sonnet. textData = replace(textData," ",""); textData = split(textData,[newline newline]); textData = textData(5:2:end); For an example showing how to process this data for deep learning, see Generate Text Using Deep Learning (Deep Learning Toolbox).	Topic modeling, text generation
ArXiv Metadata	The ArXiv API allows you to access the metadata of scientific e-prints submitted to https://arxiv.org including the abstract and subject areas. For more information, see https://arxiv.org/help/api. Import a set of abstracts and category labels from math papers using the arXiV API. url = "https://export.arxiv.org/oai2?verb=ListRecords" + ... "&set=math" + ... "&metadataPrefix=arXiv"; options = weboptions('Timeout',160); code = webread(url,options); For an example showing how to parse the returned XML code and import more records, see Multilabel Text Classification Using Deep Learning.	Text classification, topic modeling
Books from Project Gutenberg	You can download many books from Project Gutenberg. For example, download the text from Alice's Adventures in Wonderland by Lewis Carroll from https://www.gutenberg.org/files/11/11-h/11-h.htm using the `webread` function. url = "https://www.gutenberg.org/files/11/11-h/11-h.htm"; code = webread(url); The HTML code contains the relevant text inside `<p>` (paragraph) elements. Extract the relevant text by parsing the HTML code using the `htmlTree` function and then finding all the elements with the element name `"p"`. tree = htmlTree(code); selector = "p"; subtrees = findElement(tree,selector); Extract the text data from the HTML subtrees using the `extractHTMLText` function and remove the empty elements. textData = extractHTMLText(subtrees); textData(textData == "") = []; For an example showing how to process this data for deep learning, see Word-by-Word Text Generation Using Deep Learning.	Topic modeling, text generation
Weekend updates	The file `weekendUpdates.xlsx` contains example social media status updates containing the hashtags "#weekend" and "#vacation". Extract the text data from the file `weekendUpdates.xlsx` using the `readtable` function and extract the text data from the variable `TextData`. filename = "weekendUpdates.xlsx"; tbl = readtable(filename,'TextType','string'); textData = tbl.TextData; For an example showing how to process this data, see Analyze Sentiment in Text.	Sentiment analysis
Roman Numerals	The CSV file `"romanNumerals.csv"` contains the decimal numbers 1–1000 in the first column and the corresponding Roman numerals in the second column. Load the decimal-Roman numeral pairs from the CSV file `"romanNumerals.csv"`. filename = fullfile("romanNumerals.csv"); options = detectImportOptions(filename, ... 'TextType','string', ... 'ReadVariableNames',false); options.VariableNames = ["Source" "Target"]; options.VariableTypes = ["string" "string"]; data = readtable(filename,options); For an example showing how to process this data for deep learning, see Sequence-to-Sequence Translation Using Attention.	Sequence-to-sequence translation
Finance Reports	The Securities and Exchange Commission (SEC) allows you to access financial reports via the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) API. For more information, see https://www.sec.gov/search-filings/edgar-search-assistance/accessing-edgar-data. To download this data, use the function `financeReports` attached to the example Generate Domain Specific Sentiment Lexicon as a supporting file. To access this function, open the example as a Live Script. year = 2019; qtr = 4; maxLength = 2e6; textData = financeReports(year,qtr,maxLength); For an example showing how to process this data, see Generate Domain Specific Sentiment Lexicon.	Sentiment analysis

Factory Reports

Word cloud illustrating the Factory Reports data set.

The Factory Reports data set is a table containing approximately 500 reports with various attributes including a plain text description in the variable Description and a categorical label in the variable Category.

Read the Factory Reports data from the file "factoryReports.csv". Extract the text data and the labels from the Description and Category columns, respectively.

filename = "factoryReports.csv";
data = readtable(filename,'TextType','string');

textData = data.Description;
labels = data.Category;

For an example showing how to process this data for deep learning, see Classify Text Data Using Deep Learning (Deep Learning Toolbox).

Text classification, topic modeling

Shakespeare's Sonnets

Word cloud illustrating the Shakespeare's Sonnets data set.

The file sonnets.txt contains all of Shakespeare's sonnets in a single text file.

Read the Shakespeare's Sonnets data from the file "sonnets.txt".

filename = "sonnets.txt";
textData = extractFileText(filename);

The sonnets are indented by two whitespace characters and are separated by two newline characters. Remove the indentations using replace and split the text into separate sonnets using split. Remove the main title from the first three elements and the sonnet titles, which appear before each sonnet.

textData = replace(textData,"  ","");
textData = split(textData,[newline newline]);
textData = textData(5:2:end);

For an example showing how to process this data for deep learning, see Generate Text Using Deep Learning (Deep Learning Toolbox).

Topic modeling, text generation

ArXiv Metadata

Three word clouds illustrating the ArXiv Metadata data set. The first word cloud shows words related to combinatorics. The second shows words related to Statistics Theory. The third shows words from both categories.

The ArXiv API allows you to access the metadata of scientific e-prints submitted to https://arxiv.org including the abstract and subject areas. For more information, see https://arxiv.org/help/api.

Import a set of abstracts and category labels from math papers using the arXiV API.

url = "https://export.arxiv.org/oai2?verb=ListRecords" + ...
    "&set=math" + ...
    "&metadataPrefix=arXiv";
options = weboptions('Timeout',160);
code = webread(url,options);

For an example showing how to parse the returned XML code and import more records, see Multilabel Text Classification Using Deep Learning.

Text classification, topic modeling

Books from Project Gutenberg

Word cloud illustrating the Books from Project Gutenberg data set. The word cloud shows words from "Alice's Adventures in Wonderland."

You can download many books from Project Gutenberg. For example, download the text from Alice's Adventures in Wonderland by Lewis Carroll from https://www.gutenberg.org/files/11/11-h/11-h.htm using the webread function.

url = "https://www.gutenberg.org/files/11/11-h/11-h.htm";
code = webread(url);

The HTML code contains the relevant text inside <p> (paragraph) elements. Extract the relevant text by parsing the HTML code using the htmlTree function and then finding all the elements with the element name "p".

tree = htmlTree(code);
selector = "p";
subtrees = findElement(tree,selector);

Extract the text data from the HTML subtrees using the extractHTMLText function and remove the empty elements.

textData = extractHTMLText(subtrees);
textData(textData == "") = [];

For an example showing how to process this data for deep learning, see Word-by-Word Text Generation Using Deep Learning.

Topic modeling, text generation

Weekend updates

Word cloud illustrating the Weekend Updates data set.

The file weekendUpdates.xlsx contains example social media status updates containing the hashtags "#weekend" and "#vacation".

Extract the text data from the file weekendUpdates.xlsx using the readtable function and extract the text data from the variable TextData.

filename = "weekendUpdates.xlsx";
tbl = readtable(filename,'TextType','string');
textData = tbl.TextData;

For an example showing how to process this data, see Analyze Sentiment in Text.

Sentiment analysis

Roman Numerals

Table illustrating the Roman Numerals data set. The entries show single roman digits. Each row corresponds to a multidigit roman number of varying lengths. Short rows are padded with empty gray table entries.

The CSV file "romanNumerals.csv" contains the decimal numbers 1–1000 in the first column and the corresponding Roman numerals in the second column.

Load the decimal-Roman numeral pairs from the CSV file "romanNumerals.csv".

filename = fullfile("romanNumerals.csv");

options = detectImportOptions(filename, ...
    'TextType','string', ...
    'ReadVariableNames',false);
options.VariableNames = ["Source" "Target"];
options.VariableTypes = ["string" "string"];

data = readtable(filename,options);

For an example showing how to process this data for deep learning, see Sequence-to-Sequence Translation Using Attention.

Sequence-to-sequence translation

Finance Reports

Word cloud illustrating the Finance Reports data set.

The Securities and Exchange Commission (SEC) allows you to access financial reports via the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) API. For more information, see https://www.sec.gov/search-filings/edgar-search-assistance/accessing-edgar-data.

To download this data, use the function financeReports attached to the example Generate Domain Specific Sentiment Lexicon as a supporting file. To access this function, open the example as a Live Script.

year = 2019;
qtr = 4;
maxLength = 2e6;
textData = financeReports(year,qtr,maxLength);

For an example showing how to process this data, see Generate Domain Specific Sentiment Lexicon.

Sentiment analysis

Data Sets for Text Analytics

See Also

Topics