How do I keep updating and accumulating my arrays as I read multiple files one after the other

So I have multiple m.files, and I have implemented my code which is able to read one file and do exactly what I need it to do.
However, I need to run this code over multiple files, all with different words in them, and I need to at the end of it find all existing words in ALL the files.
How can I do this.
This is my code so far
fid=fopen('testing1.m')
out=textscan(fid, '%s', 'Delimiter', '\n');
out=regexp(lower(out{1}), ' ' , 'split');
fclose(fid)
comb=unique([out{:}]);
comb=comb(~cellfun('isempty', comb));
m=size(out,1)
idx=false(m,size(comb,2));
for j=1:m
idx(j,:)=ismember(comb,out{j});
if ismember('hello', out{j})
AL(j,:)=idx(j,:);
end
end
AL(all(AL==0,2),:)=[];
end
To open up my multiples files I use this
for i=1:2
fid=fopen(sprintf('testing%d.m',i))
When I use this to open up 2 files, I can't seem to make my code work because of the matrix dimension.
Any ideas on how to output a cell array AL, for two m.files testing1 testing2? I wanna create cell arrays which accumulates each time a file is being read.

 Accepted Answer

You could go for something like:
words = {} ;
for k = 1 : 2
buffer = fileread(sprintf('testing%d.m', k)) ;
words = [words, regexp(buffer, '\w*', 'match')] ; % Alphanumerical words.
end
uniqueWords = unique(words) ;
% .. etc.

13 Comments

That works great for a small number of files thank you, however I need to read a 1000 files. Also how can I read each line from my script separately? Because in your example, it outputs everything from both scripts into 1 array, which I do need for unique, but I also need to read each line.
My main aim is to work out the average of the word distribution for sentences containing a certain word, and I need to do this by reading over a 1000 files. So if I do it in this way, I'll get an error saying I'm out of memory. Any idea? I was thinking the best way was to create the logical vectors for file1, and then add onto it each time a new word is found in the next file and so on.
What kind of files are you treating exactly? You named them with extension .m and I can't imagine that 1000 M-files, even full of code, could lead to an "out of memory" error.
You are talking about "lines" and "sentences". Do you have one sentence per line?
Without talking yet about statistics per sentence, does the following lead to the same error?
words = {} ;
for k = 1 : 2
buffer = fileread(sprintf('testing%d.m', k)) ;
words = unique([words, regexp(buffer, '\w*', 'match')]) ;
end
Also, is the '\w*' pattern fine or do you need more than alphanumeric characters plus the underscore?
I'm sorry for the confusion. Your code works just fine, it's just I'm not able to use it on my code because I'm not able to read each line, which contains 1 sentence per line, separately.
fid=fopen('testing1.m')
out=textscan(fid, '%s', 'Delimiter', '\n');
out=regexp(lower(out{1}), ' ' , 'split');
fclose(fid)
my out reads each line from the script, which I then need to create my cell array idx I need to compare uniqueWords with each line,
I've tried to combine all my files in 1 , but when I try to create
idx=false(m, size(comb,2));
I'm told I've exceeded the max variable size allowed. I get a 386701 by 42466 cell
My conversations have commas, question marks percentage signs etc, and I was able to read the conversations just fine when I ran your code
What I still don't understand is that you are talking about "distribution of words in sentences that contain a certain word", but nothing in your code identifies sentences unless there is one sentence per line.. so is there one sentence per line? If so, here is a second proposal:
nFiles = 2 ;
match = 'hello' ;
words = {} ;
lines = {} ;
for k = 1 : nFiles
fid = fopen(sprintf('testing%d.m', k), 'r') ;
content = textscan(fid, '%s', 'Delimiter', '\n') ;
fclose(fid) ;
content = regexp(lower(content{1}), '\w*' , 'match') ;
words = unique([words, [content{:}]]) ;
id = cellfun(@(line) any(strcmpi(match, line)), content) ;
lines = [lines; content(id)] ;
end
where words should contain all unique words, and lines all lines that contain the match. It is essentially what you had already done, but it is filtering relevant words/lines within the "files" loop, so we don't stack irrelevant content.
Yes, each line is a sentence, and it works great now. It worked for a 100 files, but it took 5 minutes, I dread to see how long it takes to run the code for a 1000 files! I'm just changing the name of my files now so I can loop them all.
Thank you Cedric, you were really helpful, I appreciate it. :)
My pleasure! Now you should profile this little bit of code to see what takes that much time
( >> profile viewer ), and we could talk further about optimization.
So I saved my code in a m.file called execute and profiled it as you say, and it shows that
idx(j,:)=ismember(words, lines...)
is called 4785 times and the total time is 185.8, half of the time taken to run the files.
The next one is
AA(j,:)=idx(j,:); called 4785 times, takes up 40% of the time. And this is just for a 100 of my files. I tried to run it with all 1154 files but it took too long.
I need to run this code 42 times, and I can't even run it once. Any ideas on optimisation? Also, I was wondering where my output is stored, because in the end I get the average of a matrix 42 different times for 42 different conditions and I need all averages to run them in another m.file to measure the angle between each of these vectors.
...I have to do all this by Monday, is that doable?
Ok; I actually don't understand where you need to use ISMEMBER anymore. The version that I gave you outputs the list of all words and a cell array of all lines that contain the match. Why do you still have ISMEMBER and also the second part with AA(j,:)=idx(j,:)? I looks like you profiled your first version.. could you copy/paste your final code?
Also, you mention words in sentences/lines, statistics, 42 conditions, vectors, angles .. I have no idea about what you are doing indeed ;-) could you explain a little more the whole thing?
For Monday I have no idea; are these files that you are treating large? Can you send one or two to me by email (using my profile email address) or is it too large?
It's a rather long process so here goes, so I'll try to say it in a few words.
So my aim is to identify similarities in dialogue act tags. I have conversations between people with a dialogue act tag and a sentence associated with it in each line.
Example
sd A.11 I think this looks good on you
qy B.12 Why do you think that ?
sd A.13 I know this is true
where sd, and qy are dialogue act tags, A and B are the speakers.
The cell array idx outputs a logical array that compares each sentence which contains 'sd' with the array that holds all of unique words, and outputs 1 if the string in the line containing 'sd' is also in the . cell array of unique words, and 0 otherwise. So at the end, I get all these matrices idx for the dialogue act sd, and I work out their mean.
I then, run the same code, while changing the condition, and looking only for sentences with DA qy. I do so, for 42 different DAs, so I'm basically going to be running the same code 42 times changing 'match' each time.
Then I get the mean for each DA tags, and I work out the cosine similarity between them all.
I'm going to add my code below, and send you a file or two. The files itself are not too big, its the matrices I create that are rather large.
Example
sd hello my name is Sam
b hello, my name is Cedric
sd nice to meet you Cedric
OUTPUT wordssd= hello my name is sam Cedric nice to meet you
Asd=[1 1 1 1 1 0 0 0 0 0]
Asd=[0 0 0 0 0 1 1 1 1 1]
for the sentences with 'sd'.
sdmean=[0.5 .......0.5]. I hope it makes more sense this way :)
I see :) your computation with idxsd doesn't scale well actually.
The first thing that you can do is to remove Asd which is strictly equivalent to idxsd, and prealloc the latter as you know its size (n_lines x n_words). You can also try to avoid storing the full idxsd but store counts, and get rid of ISMEMBER. I performed a few tests actually, and I compare each one of them with the output of your version of sdmean:
% - Prealloc.
tic ;
idxsd2 = false(length(linessd), length(wordssd)) ;
for j = 1 : length(linessd)
idxsd2(j,:) = ismember(wordssd, linessd{j});
end
sdmean2 = mean(idxsd2) ;
toc
% - Store in vector; avoid array.
tic ;
idxsd3 = zeros(1, length(wordssd)) ;
for j = 1 : length(linessd)
idxsd3 = idxsd3 + ismember(wordssd, linessd{j});
end
sdmean3 = idxsd3 / length(linessd) ;
toc
% - Avoid ISMEMBER.
tic ;
idxsd4 = zeros(1, length(wordssd)) ;
for j = 1 : length(linessd)
pos = arrayfun(@(w)find(strcmp(w, wordssd), 1), linessd{j}) ;
idxsd4(pos) = idxsd4(pos) + 1 ;
end
sdmean4 = idxsd4 / length(linessd) ;
toc
% - Avoid ISMEMBER and ARRAYFUN.
tic ;
idxsd5 = zeros(1, length(wordssd)) ;
for j = 1 : length(linessd)
for k = 1 : length(linessd{j})
idxsd5 = idxsd5 + strcmp(linessd{j}{k}, wordssd) ;
end
end
sdmean5 = idxsd5 / length(linessd) ;
toc
% Check.
[all(sdmean2==sdmean), all(sdmean3==sdmean), ...
all(sdmean4==sdmean), all(sdmean5==sdmean)]
Running this comparison outputs:
Elapsed time is 0.475360 seconds.
Elapsed time is 0.324377 seconds.
Elapsed time is 0.316564 seconds.
Elapsed time is 0.301044 seconds.
Elapsed time is 0.121800 seconds.
ans =
1 1 1 0
which indicates that the double FOR loop is the fastest, BUT it differs from the previous outputs. The reason is that it counts all the occurrences of a word in a line, whereas ISMEMBER and the solution based on ARRAYFUN generate a unit increment even when there are multiple occurrences. We could correct the behavior of the 5th method so it matches the four previous ones, but I wanted, before doing that, to raise the question: "which behavior is correct from the point of view of the statistics that you want to compute"? Once you answer this, we can also think about using STRNCMP
Just as a side note: your matrix idxsd can be visualized with
spy(idxsd) ;
which is interesting qualitatively speaking.
I have just run all your codes on a 100 files, and i get a different result for the first method with sdmean2, but all the others have the same mean, even the 5th method. idxsd2 is a 2012 by 3956 logical, whereas the others are a 1 by 6073 double
When I run your code on an example though, I see the difference between the 5th method and the rest. I don't think I have any multiple occurrences in the first 100 files which is why I couldn't see the difference. Is there any way to use the 5th method to make it similar to the others, it takes much less time.
I want the code to ignore multiple occurrences, if the string is in the sentence, I just want it to output 1, and the next time it reads the string in that same sentence, it shouldn't do anything.
And thank you for taking the time to show me all these various methods.

Sign in to comment.

More Answers (1)

nFiles = 1154 ;
match = 'sd' ;
wordssd = {} ;
linessd = {} ;
for k = 1 : nFiles
fid = fopen(sprintf('sw000%d.m', k), 'r') ;
contentsd = textscan(fid, '%s', 'Delimiter', '\n') ;
fclose(fid) ;
contentsd = regexp(lower(contentsd{1}), '\w*' , 'match') ;
wordssd = unique([wordssd, [contentsd{:}]]) ;
id = cellfun(@(line) any(strcmpi(match, line)), contentsd) ;
linessd = [linessd; contentsd(id)] ;
end
for j=1:length(linessd)
idxsd(j,:)=ismember(wordssd, linessd{j});
Asd(j,:)=idxsd(j,:);
end
sdmean=mean(Asd);

Categories

Asked:

on 19 Apr 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!