Main Content

Text Data Preparation

Import text data into MATLAB® and preprocess it for analysis

Text Analytics Toolbox™ includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. Use these tools to extract text from popular file formats, preprocess raw text, extract individual words or multiword phrases (n-grams), convert text into numerical representations, and build statistical models. For an example showing how to get started, see Prepare Text Data for Analysis.

Text Analytics Toolbox supports the languages English, Japanese, German, and Korean. Most Text Analytics Toolbox functions work with text from other languages. For more information, see Language Considerations.

Live Editor Tasks

Preprocess Text DataPreprocess and clean up text data for analysis (Since R2023a)

Functions

expand all

extractFileTextRead text from PDF, Microsoft Word, HTML, and plain text files
extractHTMLTextExtract text from HTML
readPDFFormDataRead data from PDF forms
pdfinfoPDF file information (Since R2023a)
writeTextDocumentWrite documents to text file
htmlTreeParsed HTML tree
findElementFind elements in HTML tree
getAttributeRead HTML attribute of root node of HTML tree
ismissingFind HTML trees without values
stringConvert parsed HTML tree to string
tokenizedDocumentArray of tokenized documents for text analysis
erasePunctuationErase punctuation from text and documents
eraseTagsErase HTML and XML tags from text
eraseURLsErase HTTP and HTTPS URLs from text
removeStopWordsRemove stop words from documents
removeShortWordsRemove short words from documents or bag-of-words model
removeLongWordsRemove long words from documents or bag-of-words model
removeWordsRemove selected words from documents or bag-of-words model
normalizeWordsStem or lemmatize words
replaceWordsReplace words in documents
replaceNgramsReplace n-grams in documents
splitSentencesSplit text into sentences
splitParagraphsSplit text into paragraphs (Since R2023a)
stopWordsList of stop words
decodeHTMLEntitiesConvert HTML and XML entities into characters
lowerConvert documents to lowercase
upperConvert documents to uppercase
contextSearch documents for word or n-gram occurrences in context
tokenDetailsDetails of tokens in tokenized document array
addSentenceDetailsAdd sentence numbers to documents
addPartOfSpeechDetailsAdd part-of-speech tags to documents
addLemmaDetailsAdd lemma forms of tokens to documents
addLanguageDetailsAdd language identifiers to documents
addEntityDetailsAdd entity tags to documents
addDependencyDetailsAdd grammatical dependency details to documents (Since R2022b)
addTypeDetailsAdd token type details to documents
splitSentencesSplit text into sentences
splitParagraphsSplit text into paragraphs (Since R2023a)
corpusLanguageDetect language of text
abbreviationsTable of common abbreviations
topLevelDomainsList of top-level domains
bagOfWordsBag-of-words model
bagOfNgramsBag-of-n-grams model
addDocumentAdd documents to bag-of-words or bag-of-n-grams model
removeDocumentRemove documents from bag-of-words or bag-of-n-grams model
removeInfrequentWordsRemove words with low counts from bag-of-words model
removeInfrequentNgramsRemove infrequently seen n-grams from bag-of-n-grams model
removeNgramsRemove n-grams from bag-of-n-grams model
removeEmptyDocumentsRemove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
topkwordsMost important words in bag-of-words model or LDA topic
topkngramsMost frequent n-grams
encodeEncode documents as matrix of word or n-gram counts
tfidfTerm Frequency–Inverse Document Frequency (tf-idf) matrix
joinCombine multiple bag-of-words or bag-of-n-grams models
correctSpellingCorrect spelling of words (Since R2020a)
editDistanceFind edit distance between two strings or documents
editDistanceSearcherEdit distance nearest neighbor searcher
knnsearchFind nearest neighbors by edit distance
rangesearchFind nearest neighbors by edit distance range
splitGraphemesSplit string into graphemes
docfunApply function to words in documents
containsWordsCheck if word is member of documents (Since R2022b)
containsNgramsCheck if n-gram is member of documents (Since R2022a)
containsCheck if pattern is substring in documents (Since R2022b)
plusAppend documents
replaceReplace substrings in documents
regexprepReplace text in words of documents using regular expression
doclengthLength of documents in document array
doc2cellConvert documents to cell array of string vectors
joinWordsConvert documents to string by joining words
stringConvert scalar document to string vector
textanalytics.unicode.nfcUnicode composed normalized form (NFC) (Since R2022b)
textanalytics.unicode.nfdUnicode decomposed normalized form (NFD) (Since R2021a)
textanalytics.unicode.nfkcUnicode compatibility composed normalized form (NFKC) (Since R2022b)
textanalytics.unicode.nfkdUnicode compatibility decomposed normalized form (NFKD) (Since R2022b)
textanalytics.unicode.UTF32Unicode UTF-32 string representation (Since R2021a)
characterCategoriesUnicode character categories (Since R2021a)
hexConvert UTF-32 representation to hexadecimal values (Since R2021a)
stringConvert UTF-32 representation to string (Since R2021a)

Topics

Import

Preprocessing

Language Support