How do I keep updating and accumulating my arrays as I read multiple files one after the other
1 view (last 30 days)
Show older comments
So I have multiple m.files, and I have implemented my code which is able to read one file and do exactly what I need it to do.
However, I need to run this code over multiple files, all with different words in them, and I need to at the end of it find all existing words in ALL the files.
How can I do this.
This is my code so far
fid=fopen('testing1.m')
out=textscan(fid, '%s', 'Delimiter', '\n');
out=regexp(lower(out{1}), ' ' , 'split');
fclose(fid)
comb=unique([out{:}]);
comb=comb(~cellfun('isempty', comb));
m=size(out,1)
idx=false(m,size(comb,2));
for j=1:m
idx(j,:)=ismember(comb,out{j});
if ismember('hello', out{j})
AL(j,:)=idx(j,:);
end
end
AL(all(AL==0,2),:)=[];
end
To open up my multiples files I use this
for i=1:2
fid=fopen(sprintf('testing%d.m',i))
When I use this to open up 2 files, I can't seem to make my code work because of the matrix dimension.
Any ideas on how to output a cell array AL, for two m.files testing1 testing2? I wanna create cell arrays which accumulates each time a file is being read.
0 Comments
Accepted Answer
Cedric
on 19 Apr 2013
Edited: Cedric
on 19 Apr 2013
You could go for something like:
words = {} ;
for k = 1 : 2
buffer = fileread(sprintf('testing%d.m', k)) ;
words = [words, regexp(buffer, '\w*', 'match')] ; % Alphanumerical words.
end
uniqueWords = unique(words) ;
% .. etc.
13 Comments
Cedric
on 21 Apr 2013
Edited: Cedric
on 21 Apr 2013
I see :) your computation with idxsd doesn't scale well actually.
The first thing that you can do is to remove Asd which is strictly equivalent to idxsd, and prealloc the latter as you know its size (n_lines x n_words). You can also try to avoid storing the full idxsd but store counts, and get rid of ISMEMBER. I performed a few tests actually, and I compare each one of them with the output of your version of sdmean:
% - Prealloc.
tic ;
idxsd2 = false(length(linessd), length(wordssd)) ;
for j = 1 : length(linessd)
idxsd2(j,:) = ismember(wordssd, linessd{j});
end
sdmean2 = mean(idxsd2) ;
toc
% - Store in vector; avoid array.
tic ;
idxsd3 = zeros(1, length(wordssd)) ;
for j = 1 : length(linessd)
idxsd3 = idxsd3 + ismember(wordssd, linessd{j});
end
sdmean3 = idxsd3 / length(linessd) ;
toc
% - Avoid ISMEMBER.
tic ;
idxsd4 = zeros(1, length(wordssd)) ;
for j = 1 : length(linessd)
pos = arrayfun(@(w)find(strcmp(w, wordssd), 1), linessd{j}) ;
idxsd4(pos) = idxsd4(pos) + 1 ;
end
sdmean4 = idxsd4 / length(linessd) ;
toc
% - Avoid ISMEMBER and ARRAYFUN.
tic ;
idxsd5 = zeros(1, length(wordssd)) ;
for j = 1 : length(linessd)
for k = 1 : length(linessd{j})
idxsd5 = idxsd5 + strcmp(linessd{j}{k}, wordssd) ;
end
end
sdmean5 = idxsd5 / length(linessd) ;
toc
% Check.
[all(sdmean2==sdmean), all(sdmean3==sdmean), ...
all(sdmean4==sdmean), all(sdmean5==sdmean)]
Running this comparison outputs:
Elapsed time is 0.475360 seconds.
Elapsed time is 0.324377 seconds.
Elapsed time is 0.316564 seconds.
Elapsed time is 0.301044 seconds.
Elapsed time is 0.121800 seconds.
ans =
1 1 1 0
which indicates that the double FOR loop is the fastest, BUT it differs from the previous outputs. The reason is that it counts all the occurrences of a word in a line, whereas ISMEMBER and the solution based on ARRAYFUN generate a unit increment even when there are multiple occurrences. We could correct the behavior of the 5th method so it matches the four previous ones, but I wanted, before doing that, to raise the question: "which behavior is correct from the point of view of the statistics that you want to compute"? Once you answer this, we can also think about using STRNCMP
Just as a side note: your matrix idxsd can be visualized with
spy(idxsd) ;
which is interesting qualitatively speaking.
More Answers (1)
See Also
Categories
Find more on JSON Format in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!