textAnalytics toolbox: removing Entity details from documents

2 views (last 30 days)
I have a very large set of documents that I am preprocessing to use in a bert classification model.
I have tokenized the documents and added the entity details.
Now I want to remove all of the tokenswith in the documents that have been "tagged as" orginisation.
I have the following variables:
documents: tokenized documents
tdetails: a table of tokens with the document number, sentence number, line number, Type, Language, PartOfSpeech and Entity.
"Astoria" 1 2 3 'letters' 'en' 'proper-noun' 'person'
"Federal Savings Bank" 1 2 3 'other' 'en' 'proper-noun' 'organization'
"settled" 1 2 3 'letters' 'en' 'verb' 'non-entity'
How do I remove all of the tokens in the variable documents based on the entity=organisation
eg in documents(1,1).Vocabulary(7) I can find "Federal Savings Bank" which is in row 7 of the example above. I coudl loop through all of the documents and tdetails==organisation but that woudl take quite while
cant seem to figure out how to do this more simply

Accepted Answer

Cris LaPierre
Cris LaPierre on 18 Nov 2023
I would use removeWords.
documents = tokenizedDocument(Text(:));
tdetails = tokenDetails(documents) ;
documents2 = removeWords(documents,tdetails{tdetails.Entity=="organisation"});
  1 Comment
david cowan
david cowan on 19 Nov 2023
Moved: Cris LaPierre on 19 Nov 2023
Really appreciate that.
removeWords !!
I'll not forget that now - I knew there had to be a simple approach I was just missing

Sign in to comment.

More Answers (0)




Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!