How to use map and reduce efficiently?
Show older comments
I have the following table, which has over 40 million rows and 5 columns:

The first column is irrelevant. The second column is a YYYYMMDD date, and the frequency of the data is quarterly. The third column is a firmID - some firm IDs include letters as well as numbers. The fourth and fifth columns are values assigned to 2 different variables.
I wish to do 2 things:
1) for every rdate-cusip pair, sum shares across all different identifiers of mgrno that exist for that rdate-cusip combination. Call this value A.
2) for every rdate-cusip pair, obtain the mode value of shrout2 across the different identifiers of mgrno that exist for that rdate-cusip combination. Call this value B.
3) divide A by B.
This would normally be straightforward, but due to the big dimensions of the data, I am struggling to do it. I have tried to use the functions map and reduce, without really loading the file into the workspace, but I believe I am mkaing some kind of mistake. I was getting error messages trying to conduct the division inside the mapping phase, so I decided to skip the division and just have as output a table in which the first column is quarter-CUSIP identifier, second column is A, and third column is B.
ds = datastore('myFile.csv');
ds.TextscanFormats{3} = '%q';
ds.TextscanFormats{4} = '%q';
outds = mapreduce(ds, @gvkeyMapFun2, @gvkeyReduceFun2);
output = readall(outds);
where the functions are defined as
function gvkeyMapFun2(data, ~, intermKVStore)
% gets quarter variables
vQuarter = num2str(data.rdate); % char format
% gets cusip in char format
vNCUSIP = cell2mat(data.cusip);
% creates quarter-ncusip identifer
IDnum = strcat(vQuarter,vNCUSIP);
IDnum = cellstr(IDnum);
% finds unique NCUSIPS-quarter
[intermKeys,~,idx] = unique(IDnum, 'stable'); % intermKeys is cell of characters (some cusips have letters), idx is double
% gets variables of intersst
dataOwnership = cellfun(@(x) str2double(x),data.shares);
dataTotalShares = data.shrout2;
for ii = 1:numel(intermKeys)
totalOwnership = sum(dataOwnership(idx==ii));
totalShares = mode(dataTotalShares(idx==ii));
totalOwnershipInfo(ii,1:3) = [repmat(intermKeys(ii),size(totalOwnership,1),1), totalOwnership,repmat(totalShares,size(totalOwnership,1),1) ];
add(intermKVStore, intermKeys{ii}, totalOwnershipInfo);
end
end
and
function gvkeyReduceFun2(intermKey, intermValIter, outKVStore)
databasereducedFinal = array2table([]);
while hasnext(intermValIter)
databasereducedFinal = [databasereducedFinal; getnext(intermValIter)];
end
add(outKVStore, 'output', databasereducedFinal);
end
I then run
output = readall(outds);
c = vertcat(output{:, 2});
tableBig = vertcat(c{:});
to try and get the table because "output" looks like this:

I feel this is still quite inefficient. Is there anyway do this more efficiently? (also, I believe there's some other mistake somewhere, because the final table "tableBig" is larger than I would expect given the possible number of unique CUSIP-quarters.
thank you.
5 Comments
Guillaume
on 8 Aug 2019
Which version of matlab are you using? This is straightforward to do with groupsummary (R2018a or later required) but for the fact that shares is text instead of numbers. In earlier versions, you'd use splitapply which is only slightly more complicated.
So first, I think you need to fix your import of the data so that shares is numeric. It could be done after the fact but I supect it would be extremely slow. Also, you may want to import rdate as a proper datetime as that would make it easier to calculate monthly/quarterly/yearly/etc. stats.
I'm also not entirely clear if the grouping variables are {'rdate', 'cusip'} or {'mgrno', 'rdate', 'cusip'}.
Daniel Pinto
on 8 Aug 2019
Guillaume
on 8 Aug 2019
Then splitapply it is
What about my other comment and question? I've just noticed that you explicitly ask that shares be read as text. Why? You can't sum text.
Daniel Pinto
on 8 Aug 2019
Daniel Pinto
on 8 Aug 2019
Accepted Answer
More Answers (0)
Categories
Find more on Startup and Shutdown in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!