addLemmaDetails

Add lemma forms of tokens to documents

collapse all in page

Syntax

updatedDocuments = addLemmaDetails(documents)

updatedDocuments = addLemmaDetails(documents,'DiscardKnownValues',true)

Description

Use addLemmaDetails to add lemma forms to documents.

The function supports English, Japanese, and Korean text.

updatedDocuments = addLemmaDetails(documents) adds lemma details to documents and updates the token details. To get the lemma details from updatedDocuments, use tokenDetails.

updatedDocuments = addLemmaDetails(documents,'DiscardKnownValues',true) discards previously computed details and recomputes them.

Tip

Use addLemmaDetails before using the lower, upper, and normalizeWords functions as addLemmaDetails uses information that is removed by these functions.

Examples

Add Lemma Details to Documents

Open Live Script

Create a tokenized document array.

str = [ ...
    "The dogs ran after the cat."
    "I am building a house."];
documents = tokenizedDocument(str);

Add lemma details to the documents using addLemmaDetails. This function lemmatizes the text and adds the lemma form of each token to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addLemmaDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)

     Token     DocumentNumber    LineNumber       Type        Language     Lemma 
    _______    ______________    __________    ___________    ________    _______

    "The"            1               1         letters           en       "the"  
    "dogs"           1               1         letters           en       "dog"  
    "ran"            1               1         letters           en       "run"  
    "after"          1               1         letters           en       "after"
    "the"            1               1         letters           en       "the"  
    "cat"            1               1         letters           en       "cat"  
    "."              1               1         punctuation       en       "."    
    "I"              2               1         letters           en       "i"

Input Arguments

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

Output Arguments

`updatedDocuments` — Updated documents
`tokenizedDocument` array

Updated documents, returned as a tokenizedDocument array. To get the token details from updatedDocuments, use tokenDetails.

Version History

Introduced in R2018b

See Also

tokenDetails | addDependencyDetails | addSentenceDetails | addPartOfSpeechDetails | addLanguageDetails | addTypeDetails | normalizeWords | tokenizedDocument | addEntityDetails

Topics