Extract Keywords from Text Data Using TextRank

This example shows to extract keywords from text data using TextRank.

The TextRank keyword extraction algorithm extracts keywords using a part-of-speech tag-based approach to identify candidate keywords and scores them using word co-occurrences determined by a sliding window. Keywords can contain multiple tokens. Furthermore, the TextRank keyword extraction algorithm also merges keywords when they appear consecutively in a document.

Extract Keywords

Create an array of tokenized document containing the text data.

textData = [
    "MATLAB provides really useful tools for engineers. Scientists use many useful MATLAB toolboxes."
    "MATLAB and Simulink have many features. MATLAB and Simulink makes it easy to develop models."
    "You can easily import data in MATLAB. In particular, you can easily import text data."];
documents = tokenizedDocument(textData);

Extract the keywords using the textrankKeywords function.

tbl = textrankKeywords(documents)

tbl=6×3 table
                   Keyword                   DocumentNumber    Score 
    _____________________________________    ______________    ______

    "useful"    "MATLAB"      "toolboxes"          1           4.8695
    "useful"    ""            ""                   1           2.3612
    "MATLAB"    ""            ""                   1           1.6212
    "many"      "features"    ""                   2           4.6152
    "text"      "data"        ""                   3           3.4781
    "data"      ""            ""                   3           1.7391

If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

For readability, transform the multi-word keywords into a single string using the join and strip functions.

if size(tbl.Keyword,2) > 1
    tbl.Keyword = strip(join(tbl.Keyword));
end
head(tbl)

ans=6×3 table
             Keyword             DocumentNumber    Score 
    _________________________    ______________    ______

    "useful MATLAB toolboxes"          1           4.8695
    "useful"                           1           2.3612
    "MATLAB"                           1           1.6212
    "many features"                    2           4.6152
    "text data"                        3           3.4781
    "data"                             3           1.7391

Specify Maximum Number of Keywords Per Document

The textrankKeywords function, by default, returns all identified keywords. To reduce the number of keywords, use the 'MaxNumKeywords' option.

Extract the top two keywords for each document by setting the 'MaxNumKeywords' option to 2.

tbl = textrankKeywords(documents,'MaxNumKeywords',2)

tbl=5×3 table
                   Keyword                   DocumentNumber    Score 
    _____________________________________    ______________    ______

    "useful"    "MATLAB"      "toolboxes"          1           4.8695
    "useful"    ""            ""                   1           2.3612
    "many"      "features"    ""                   2           4.6152
    "text"      "data"        ""                   3           3.4781
    "data"      ""            ""                   3           1.7391

Specify Part-of-Speech Tags

Notice that in the extracted keywords above, the function does not consider the word "import" as a keyword. This is because the TextRank keyword extraction algorithm, by default, uses tokens with the part-of-speech tags "noun", "proper-noun" and "adjective" as candidate keywords. Because the word "import" is a verb, the algorithm does not consider this as a candidate keyword. Similarly, the algorithm does not consider the adverb "easily" as a candidate keyword.

To specify which part-of-speech tags to use for identifying candidate keywords, use the 'PartOfSpeech' option.

Extract keywords from the same text as before and also specify also specify the part-of-speech tags "adverb" and "verb".

newTags = ["adverb" "verb"];
tags = ["noun" "proper-noun" "adjective" newTags];
tbl = textrankKeywords(documents,'PartOfSpeech', tags)

tbl=7×3 table
                      Keyword                       DocumentNumber    Score 
    ____________________________________________    ______________    ______

    "use"         "many"    "useful"    "MATLAB"          1           5.8839
    "useful"      ""        ""          ""                1           2.0169
    "MATLAB"      ""        ""          ""                1           1.5478
    "Simulink"    "have"    "many"      ""                2           4.5058
    "Simulink"    ""        ""          ""                2           1.5161
    "import"      "text"    "data"      ""                3           4.7921
    "import"      "data"    ""          ""                3           3.4195

Notice here that the function treats the token "import" as a candidate keyword and merges it into the multi-word keywords "import data" and "import text data".

Specify Windows Size

Notice that in the extracted keywords above, that the function does not extract the adverb "easily" as a keyword. This is because of the proximity of these words in the text to other candidate keywords.

The TextRank keyword extraction algorithm scores candidate keywords using the number of pairwise co-occurrences within a sliding window. To increase the window size, use the 'Window' option. Increasing the window size enables the function to find more co-occurrences between keywords which increases the keyword importance scores. This can result in finding more relevant keywords at the cost of potentially over-scoring less relevant keywords.

Extract keywords from the same text as before and also specify also specify a window size of 3.

tbl = textrankKeywords(documents, ...
    'PartOfSpeech', tags, ...
    'Window',3)

tbl=8×3 table
                      Keyword                       DocumentNumber    Score 
    ____________________________________________    ______________    ______

    "many"        "useful"    "MATLAB"    ""              1           4.2185
    "really"      "useful"    ""          ""              1           2.8851
    "MATLAB"      ""          ""          ""              1           1.3154
    "Simulink"    ""          ""          ""              2           1.4526
    "develop"     ""          ""          ""              2           1.0912
    "features"    ""          ""          ""              2           1.0794
    "easily"      "import"    "text"      "data"          3           5.2989
    "easily"      "import"    "data"      ""              3           4.0842

Notice here that the function treats the tokens "easily" as keywords and merges it into the multi-word keywords "easily import text data" and "easily import data".

To learn more about the TextRank keyword extraction algorithm, see TextRank Keyword Extraction.

Alternatives

You can experiment with different keyword extraction algorithms to see what works best with your data. Because the TextRank keywords algorithm uses a part-of-speech tag-based approach to extract candidate keywords, the extracted keywords can be short. Alternatively, you can try extracting keywords using RAKE algorithm which extracts sequences of tokens appearing between delimiters as candidate keywords. To extract keywords using RAKE, use the rakeKeywords function. To learn more, see Extract Keywords from Text Data Using RAKE.

References

[1] Mihalcea, Rada, and Paul Tarau. "TextRank: Bringing Order into Text." In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404-411. 2004.

Extract Keywords from Text Data Using TextRank