Incomplete reading of MS Word file

At work I have to read some VERY long Word documents (~300 pages) and analyze the text. However, if I use the commands suggested in https://fr.mathworks.com/matlabcentral/answers/348737-how-to-read-ms-word-file-doc-docx :
word = actxserver('Word.Application');
wdoc = word.Documents.Open(filePath);
text = wdoc.Content.text;
wdoc.Close; % close document
word.Quit; % end application
the resulting "text" variable (1x158745 char) only contains ~25% of the document.
How can I read the whole document using this method? I saw that on newer relaseses there are dedicated functions/toolboxes for reading Word documents, but I don't have access to them as my company only provides R2020b and limited toolboxes.

Answers (1)

I haven't tried for such a huge file but can you try the open word document with fopen and read the whole text using read(fid, '*char'). Maybe it will work.

1 Comment

That will not work in the form stated. .docx files are zip files that contain a directory of mostly XML files.
You can unzip the .docx file and go through the directory and try to extract things from the XML files; the XML files would be text files.

Sign in to comment.

Products

Release

R2020b

Asked:

on 12 Apr 2023

Commented:

on 12 Apr 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!