Mapreduce does not seem to use all available cores

Question

Mehrdad Oveisi on 10 Nov 2014

1
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/162140-mapreduce-does-not-seem-to-use-all-available-cores

Answered: Rick Amos on 24 Nov 2014

Hello,

I am using mapreduce on a machine with 16 cores. I make a pool with 15 workers (cores) which works fine. When I run mapreduce though, it only utilizes one or two workers: sometimes one for the mapper and one for the reducer. This is how I check which worker is processing the data (in addition to using a system monitor to watch CPU/core activities):

tk=getCurrentTask();
disp(tk.ID)

There are tens of files to be processed and each mapper is called with one file to process. Each time a mapper is called it loads and processes one file. I expect that during the first call to the mapper and while it is loading and processing the first file on one worker (core), there are other parallel calls to mapper to process the next files on other workers. However, this is not how it happens; it just sequentially calls the mapper on the same worker. Sometimes it uses a second worker for the reducer calls. So at most it uses two workers, while there are 15 available in the pool.

What would be a simple code to check if mapreduce is making use of all the available cores?

EDIT: Actually now I can confirm that the mapper is always run by a single worker, but the reducer may be run by a few different workers, as expected.

Your help is appreciated, Mehrdad

10 Comments
Show 8 older commentsHide 8 older comments

Mehrdad Oveisi on 13 Nov 2014

Edited: Mehrdad Oveisi on 13 Nov 2014

Open in MATLAB Online

workers_test.m

Actually I have now come up with a simple example code to illustrate this problem (changing the example presented in Getting Started with MapReduce). Running the following code (also attached) on my system shows that there is only one worker for the mapper function. Note the single value 9 for the key 'MapperTaskID' in the output.

Output:

            Key           Value  
      _______________    ________
      'ReducerTaskID'    [     9]
      'Mean'             [702.16]
      'ReducerTaskID'    [     7]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      ...

The testing code:

function keyvalues = workers_test
    ds = datastore('airlinesmall.csv','TreatAsMissing','NA');
    ds.SelectedVariableNames = 'Distance';
    ds.RowsPerRead = 5000; % smaller values increase the num of mapper calls
    preview(ds)
    outds = mapreduce(ds, @MeanDistMapFun, @MeanDistReduceFun);
    keyvalues = readall(outds);
end
function MeanDistMapFun(data, info, intermKVStore)
    tk=getCurrentTask();
    add(intermKVStore, 'MapperTaskID', tk.ID);
      distances = data.Distance(~isnan(data.Distance));
      sumLenValue = [sum(distances)  length(distances)];
      add(intermKVStore, 'sumAndLength', sumLenValue);
  end
function MeanDistReduceFun(intermKey, intermValIter, outKVStore)
    tk=getCurrentTask();
    add(outKVStore, 'ReducerTaskID', tk.ID);
      if strcmp(intermKey, 'MapperTaskID') 
          while hasnext(intermValIter)  % pass the same key/values along
              add(outKVStore, intermKey, getnext(intermValIter));
          end
          return
      end
      sumLen = [0 0];
      while hasnext(intermValIter)
          sumLen = sumLen + getnext(intermValIter);
      end
      add(outKVStore, 'Mean', sumLen(1)/sumLen(2));
  end

Mehrdad Oveisi on 13 Nov 2014

> This example hits a separate limitation that the input data currently needs to "large" to provide meaningful parallelism.

I guess this limitation is behind the problem I am having. I have about 600 files to be processed. The files are about 40M on average (ranging from 5M to 130M max). All of them are in .mat format containing exactly four structs, which contain the data, meta data, etc. So the actual "data" table in each file is inside a struct in that file. I wasn't sure if it is possible to directly make datastores from these tables that are inside structs in the files. So instead I pass to the datastore as input a text file containing the 600 .mat filenames. (And set ds.RowsPerRead=1 to go through the filenames one by one.)

Then as I mentioned in the original post "each time a mapper is called it loads and processes one file."

Given the limitation you are mentioning, since the input to the mapper is just a filename, it will not provide parallelism.

Is there any setting options to change this assumption that small input requires small amount of processing?
Or is there any way to make a datastore of tables that are inside structs in the input files?

Rick Amos on 17 Nov 2014

Open in MATLAB Online

Currently, the one very specific form of mat files that can be read by datastore is the output of another mapreduce call. An unofficial shortcut that creates such a mat file is the following code:-

data.Key = {'Test'};
data.Value = {struct('a', 'Hello World!', 'b', 42)};
save('myMatFile.mat', '-struct', 'data');
ds = datastore('myMatFile.mat');
readall(ds)

Mehrdad Oveisi on 19 Nov 2014

Thank you Rick! I found your reply here useful. So I thought it's good to have a separate thread for this tip.

Sign in to comment.

Sign in to answer this question.

Answer 1

Rick Amos on 24 Nov 2014

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/162140-mapreduce-does-not-seem-to-use-all-available-cores#answer_160012

In R2014b, there are some limitations with the minimum size of data that can be parallelized. To avoid this limitation, the input datastore must contain at least one of the following:

Multiple files, where each file will be handled in parallel.
Files that are larger than 32 MB, where each 32 MB will be handled in parallel.

If the input datastore contains a single small file, you will need to find a way to split that file into multiple files. For example, if the input datastore contains a single file listing many filenames (to the actual data), you can split this up into many files each containing a single or small number of filenames to ensure parallelism.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Mapreduce does not seem to use all available cores

10 Comments
Show 8 older commentsHide 8 older comments

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

Mapreduce does not seem to use all available cores

10 Comments Show 8 older commentsHide 8 older comments

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

10 Comments
Show 8 older commentsHide 8 older comments

0 Comments
Show -2 older commentsHide -2 older comments