How to fix my attempt to vectorize counts of strings and regexpPatterns in a text file?

2 views (last 30 days)
REVISED:
Hello Folks,
I am having difficulty vectorizing the counting of occurrences of lines in a data file, File_1_rev1.txt, containing search terms that can either be strings or regular expression patterns. The attached file is small in size for the purpose of this example. The actual file I want to parse is typically 2TB in size so I want to perform counts as efficiently as possible.
Objective:
Minimize the processing time for counting lines in FIle_1_rev1.txt containing occurrences of strings or regexpPatterns and output count results in a table.
Desired output:
Code Issue:
Output I get for the code provide below is incorrect. How do I define variable <C> correctly to count lines containing regular expression patterns so that I get the desired output, shown above?
clear
clc
SearchTerms = {...
'Term_1', 'Blanket';...
'Term_2', 'blah';...
'Term_3', 'of';...
'Term_4', '(dat|not)\d{1}';...
'Term_5', '(dat|not)\d{23}'...
};
Term_IDs = SearchTerms(:,1); % ID of string/regexpPattern to search for
Term_Patterns = SearchTerms(:,2); % string/regexpPattern to count
Num_SearchTerms = height(SearchTerms);
fid = fopen('File_1_rev1.txt');
Text = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
Lines = Text{1,1};
C = categorical(Lines, Term_Patterns, Term_IDs);
[TermCounts,Categories] = histcounts(C);
Result = cell2table(cell(0,Num_SearchTerms), 'VariableNames', Term_IDs');
Result = [Result; num2cell(TermCounts)]
Result = 1×5 table
Term_1 Term_2 Term_3 Term_4 Term_5 ______ ______ ______ ______ ______ 0 0 0 0 0
  1 Comment
Jude
Jude on 28 Dec 2023
Unsuccessfully, I have also tried...
1. Trouble with line below is getting regexpPattern to work,
C = categorical(Lines, Term_Patterns,Term_IDs,"Ordinal",true);
2. Line below looked workable but I am having trouble with implementation
C = discretize(Lines, contains(Lines, regexpPattern(Term_Patterns)), 'categorical', Term_IDs')
3. Currently looking into using the dictionary function to convert <Lines> into a line-by-line representation of
<Term_IDs> where applicable then follow up with the categorical function and histocounts function to get the
counts.

Sign in to comment.

Accepted Answer

Stephen23
Stephen23 on 28 Dec 2023
Edited: Stephen23 on 28 Dec 2023
SearchTerms = {...
'Term_1', 'Blanket';...
'Term_2', 'blah';...
'Term_3', 'of';...
'Term_4', '(dat|not)\d{1}';...
'Term_5', '(dat|not)\d{23}'...
};
Term_IDs = SearchTerms(:,1); % ID of string/regexpPattern to search for
Term_Patterns = SearchTerms(:,2); % string/regexpPattern to count
L = readlines('File_1_rev1.txt')
L = 5729×1 string array
"Blanket Blanket Blanket" "" "This" "is a test" "a test Of your" "testing system" "this text does" "not mean anything." "! Do not5 mind spe$cial charac7er5~" "not mean anything." "" "" "this text does" "testing system" "a test of your" "is a test" "This" "55 !! && Test" "dat3 field blah" "blah Blah" "case sensitive or not" "might want to create counts" "for each maybe not. This" "is the end oF an example," "instead of having actual" "data with millions of lines" "of text. " "" "" "This"
P = regexpPattern(Term_Patterns);
F = @(p)nnz(contains(L,p));
V = arrayfun(F,P)
V = 5×1
1 424 848 424 0
T = unstack(table(V,Term_IDs),'V','Term_IDs')
T = 1×5 table
Term_1 Term_2 Term_3 Term_4 Term_5 ______ ______ ______ ______ ______ 1 424 848 424 0

More Answers (0)

Categories

Find more on File Operations in Help Center and File Exchange

Products


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!