How to fix my attempt to vectorize counts of strings and regexpPatterns in a text file?

Question

Jude on 27 Dec 2023

1
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/2064526-how-to-fix-my-attempt-to-vectorize-counts-of-strings-and-regexppatterns-in-a-text-file

Commented: Jude on 28 Dec 2023

Accepted Answer: Stephen23

Open in MATLAB Online

REVISED:

Hello Folks,

I am having difficulty vectorizing the counting of occurrences of lines in a data file, File_1_rev1.txt, containing search terms that can either be strings or regular expression patterns. The attached file is small in size for the purpose of this example. The actual file I want to parse is typically 2TB in size so I want to perform counts as efficiently as possible.

Objective:

Minimize the processing time for counting lines in FIle_1_rev1.txt containing occurrences of strings or regexpPatterns and output count results in a table.

Desired output:

Code Issue:

Output I get for the code provide below is incorrect. How do I define variable <C> correctly to count lines containing regular expression patterns so that I get the desired output, shown above?

clear
clc
SearchTerms = {...
                'Term_1', 'Blanket';...
                'Term_2', 'blah';...
                'Term_3', 'of';...
                'Term_4', '(dat|not)\d{1}';...
                'Term_5', '(dat|not)\d{23}'...
              };
Term_IDs = SearchTerms(:,1);       % ID of string/regexpPattern to search for
Term_Patterns = SearchTerms(:,2);  % string/regexpPattern to count
Num_SearchTerms = height(SearchTerms);
fid = fopen('File_1_rev1.txt');
Text = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
Lines = Text{1,1};
C = categorical(Lines, Term_Patterns, Term_IDs);
[TermCounts,Categories] = histcounts(C);
Result = cell2table(cell(0,Num_SearchTerms), 'VariableNames', Term_IDs');
Result = [Result; num2cell(TermCounts)]
Result = 1×5 table
    Term_1    Term_2    Term_3    Term_4    Term_5
    ______    ______    ______    ______    ______

      0         0         0         0         0   

1 Comment
Show -1 older commentsHide -1 older comments

Jude on 28 Dec 2023

Unsuccessfully, I have also tried...

1. Trouble with line below is getting regexpPattern to work,

C = categorical(Lines, Term_Patterns,Term_IDs,"Ordinal",true);

2. Line below looked workable but I am having trouble with implementation

C = discretize(Lines, contains(Lines, regexpPattern(Term_Patterns)), 'categorical', Term_IDs')

3. Currently looking into using the dictionary function to convert <Lines> into a line-by-line representation of

<Term_IDs> where applicable then follow up with the categorical function and histocounts function to get the

counts.

Sign in to comment.

Sign in to answer this question.

Answer 1

Stephen23 on 28 Dec 2023

2
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/2064526-how-to-fix-my-attempt-to-vectorize-counts-of-strings-and-regexppatterns-in-a-text-file#answer_1379551

Edited: Stephen23 on 28 Dec 2023

Open in MATLAB Online

File_1_rev1.txt

SearchTerms = {...
    'Term_1', 'Blanket';...
    'Term_2', 'blah';...
    'Term_3', 'of';...
    'Term_4', '(dat|not)\d{1}';...
    'Term_5', '(dat|not)\d{23}'...
    };
Term_IDs      = SearchTerms(:,1);  % ID of string/regexpPattern to search for
Term_Patterns = SearchTerms(:,2);  % string/regexpPattern to count
L = readlines('File_1_rev1.txt')
L = 5729×1 string array
    "Blanket Blanket Blanket"
    ""
    "This"
    "is a test"
    "a test Of your"
    "testing system"
    "this text does"
    "not mean anything."
    "! Do not5 mind spe$cial charac7er5~"
    "not mean anything."
    ""
    ""
    "this text does"
    "testing system"
    "a test of your"
    "is a test"
    "This"
    "55 !! && Test"
    "dat3 field blah"
    "blah Blah"
    "case sensitive or not"
    "might want to create counts"
    "for each maybe not.  This"
    "is the end oF an example,"
    "instead of having actual"
    "data with millions of lines"
    "of text. "
    ""
    ""
    "This"
P = regexpPattern(Term_Patterns);
F = @(p)nnz(contains(L,p));
V = arrayfun(F,P)
V = 5×1
     1
   424
   848
   424
     0
T = unstack(table(V,Term_IDs),'V','Term_IDs')
T = 1×5 table
    Term_1    Term_2    Term_3    Term_4    Term_5
    ______    ______    ______    ______    ______

      1        424       848       424        0   

2 Comments
Show NoneHide None

Dyuman Joshi on 28 Dec 2023

+1 for readlines()

Jude on 28 Dec 2023

@Stephen23, thank you for sharing your solution with me. I like this vectorized approach.

Sign in to comment.

How to fix my attempt to vectorize counts of strings and regexpPatterns in a text file?

1 Comment
Show -1 older commentsHide -1 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

How to fix my attempt to vectorize counts of strings and regexpPatterns in a text file?

1 Comment Show -1 older commentsHide -1 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

1 Comment
Show -1 older commentsHide -1 older comments

2 Comments
Show NoneHide None