addDocument
Add documents to bag-of-words or bag-of-n-grams model
Description
Examples
Add Documents to Bag-of-Words Model
Create a bag-of-words model from an array of tokenized documents.
documents = tokenizedDocument([ "an example of a short sentence" "a second short sentence"]); bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [2x7 double] Vocabulary: ["an" "example" "of" "a" "short" "sentence" "second"] NumWords: 7 NumDocuments: 2
Create another array of tokenized documents and add it to the same bag-of-words model.
documents = tokenizedDocument([ "a third example of a short sentence" "another short sentence"]); newBag = addDocument(bag,documents)
newBag = bagOfWords with properties: Counts: [4x9 double] Vocabulary: ["an" "example" "of" "a" "short" "sentence" "second" "third" "another"] NumWords: 9 NumDocuments: 4
Import Text from Multiple Files Using a File Datastore
If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.
Create a file datastore for the example sonnet text files. The examples sonnets have file names "exampleSonnetN.txt
", where N
is the number of the sonnet. Specify the read function to be extractFileText
.
readFcn = @extractFileText; fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);
Create an empty bag-of-words model.
bag = bagOfWords
bag = bagOfWords with properties: Counts: [] Vocabulary: [1x0 string] NumWords: 0 NumDocuments: 0
Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to bag
.
while hasdata(fds) str = read(fds); document = tokenizedDocument(str); bag = addDocument(bag,document); end
View the updated bag-of-words model.
bag
bag = bagOfWords with properties: Counts: [4x276 double] Vocabulary: ["From" "fairest" "creatures" "we" "desire" "increase" "," "That" "thereby" "beauty's" "rose" "might" "never" "die" "But" "as" "the" "riper" "should" ... ] (1x276 string) NumWords: 276 NumDocuments: 4
Input Arguments
bag
— Input bag-of-words or bag-of-n-grams model
bagOfWords
object | bagOfNgrams
object
Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords
object or a bagOfNgrams
object.
documents
— Input documents
tokenizedDocument
array | string array | cell array of character vectors
Input documents, specified as a tokenizedDocument
array, a string array of words, or a cell array of
character vectors. If documents
is not a
tokenizedDocument
array, then it must be a row vector representing
a single document, where each element is a word. To specify multiple documents, use a
tokenizedDocument
array.
Output Arguments
newBag
— Output model
bagOfWords
object | bagOfNgrams
object
Output model, returned as a bagOfWords
object or a bagOfNgrams
object. The type of
newBag
is the same as the type of
bag
.
Version History
Introduced in R2017b
See Also
bagOfWords
| bagOfNgrams
| removeDocument
| removeEmptyDocuments
| tokenizedDocument
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)