How can I get the word count of each line from an extracted PDF file

3 views (last 30 days)
Hi, I extracted text from a PDF file with many lines/entries of comments. I want to get the word count of each line, the average word count all lines, and the number of lines that only has one word. Is this possible..? Thanks!!

Answers (1)

Kiran Felix Robert
Kiran Felix Robert on 2 Feb 2021
Hi Yao,
I assume that you have extracted the text from a pdf file which is saved as a string variable. You can convert the string to a character array (convertStringsToChars) and count the words and lines.
Assume that
  1. Every word ends with a space
  2. Every line ending has a carriage return and line feed
Using the built-in MATLAB example, the following program gives you the total line count and word count in the section of the file.
str = extractFileText("exampleSonnets.pdf");
ii = strfind(str,"II");
iii = strfind(str,"III");
start = ii(1);
fin = iii(1);
stringText = extractBetween(str,start,fin-1);
B = convertStringsToChars(stringText);
% Define the space character and end-of-line character
SpaceCharacter = B(3);
CarraigeReturnCharacter = B(4);
lineCount = 0;
wordCount = 0;
i = 1;
while i <= length(B)
if B(i) == CarraigeReturnCharacter
lineCount = lineCount + 1; % Total line count
end
if B(i) == SpaceCharacter
wordCount = wordCount + 1; % Total Word Count
end
i = i + 1;
end
Kiran

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!