Issue with Mexfile in parfor loops

6 views (last 30 days)
RH
RH on 8 Dec 2020
Edited: RH on 9 Dec 2020
To speed up some heavy calculations I wrote a C file with I compiled with Matlabs mex compiler. It appears to run smoothly giving correct results when using only single threads/no parfor loops and I have run it > 100 times without any error.
However, when I run several calculations in parallel, one or two of my workers usually die, which lets the parfoor loop restart. After a while though all workers are able to finish. These calculations are done using SLURM, so on another machine in our network. Anyone got an idea? Perhaps my MexFile does something illegal I am not aware of.
My main script has this structure:
parfor i=1:numWorkers
doWork();
end
and doWork() is basically like
function doWork()
doSomestuff();
[a,b,c,d,e,f] = initialize();
myMexFunc(a,b,c,d,e,f);
doMoreStuff();
end
and my Mex file is the following:
#include "mex.h"
#include "stdio.h"
void calcModulation(double* A, unsigned int* B, double* C, unsigned int* D, unsigned int L, double* E, unsigned int num_col, double* F)
{
// First Task
for(unsigned int n=0;n < L; ++n)
{
for(unsigned int m=0; m < 132; ++m)
{
A[D[n]+ 22*(B[n]+m)] = A[D[n] + 22*(B[n]+m)] + C[m+132*n];
}
}
// Second Task
for(unsigned int n=0;n < num_col; ++n)
{
for(unsigned int m=0; m < 22; ++m)
{
E[n] = E[n] + F[m + 22*(n)] * A[m + 22*(n)];
}
}
}
/* The gateway function */
void mexFunction( int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
// Names changed as part of the original code is secret
unsigned int num_col = mxGetN(prhs[0]);
unsigned int L = mxGetN(prhs[2]);
double* myMatrix_A = mxGetData(prhs[0]); // N x L
unsigned int *myVector_C, *myVector_D;
myVector_C = (unsigned int*) mxGetData(prhs[1]); // N x 1
double* myMatrix_B = mxGetData(prhs[2]); // N x L
myVector_D = (unsigned int*) mxGetData(prhs[3]); // N x 1
double* myVector_E = mxGetData(prhs[4]); //1 x L
double* myMatrix_D = mxGetData(prhs[5]); //N X L
calcModulation(myMatrix_A, myVector_C, myMatrix_B, myVector_D, L, myVector_E, num_col, myMatrix_D);
}
Is there something wrong about the way I set the pointers in the mex file?
The dimensions of the Matlab variables are stated next to the "mxGetData" calls. All are double except for those casted to unsigned int*.
  2 Comments
James Tursa
James Tursa on 8 Dec 2020
Are the unsigned int* variables actually uint32 class at the MATLAB m-file level?
There is no way for us to determine if your indexing is correct because you don't show us the inputs, and these input values are actually used as indexing into other variables.
Also, you are modifying variables inplace, which is against the rules. I.e., the A and E in calcModulation come from prhs variables which according to the official rules are const.
And you never check that the prhs inputs are actually the class and sizes you expect before you use them.
We don't really have much else to examine based on what you have posted thus far, but I would start with the above comments.
RH
RH on 8 Dec 2020
Thank you!
[Quote]
Are the unsigned int* variables actually uint32 class at the MATLAB m-file level?
[/Quote]
I casted the doubles to uint32. To be double safe I changed my mexfile such that i use uint32_T as data type of the unsigned ints.
[Quote]
Also, you are modifying variables inplace, which is against the rules. I.e., the A and E in calcModulation come from prhs variables which according to the official rules are const.
[/Quote]
I see, I thought what I get is a pointer to the actual data that I may modify. This would allow me to avoid copying and creating large amounts of data, i.e. is it not possible to pass by address without it being a pointer to constant data?
Probably this would explain the behavior.
[Quote]
And you never check that the prhs inputs are actually the class and sizes you expect before you use them.
[/Quote]
That is correct but in my code, I can be sure the data is always in the correct format, i.e. the class and size should always fit.
[Quote]
There is no way for us to determine if your indexing is correct because you don't show us the inputs
[/Quote]
Yes, sorry about that but I cannot be sure what part I am allowed to share and what I am not allowed to share.

Sign in to comment.

Accepted Answer

RH
RH on 9 Dec 2020
Edited: RH on 9 Dec 2020
Alright, thank you for your thorough responses James. I found the problem. As I inititally suspected but then discarded I had insufficient amounts of RAM. The data size was significantly larger than I inititally calculated and therefore the workers did not get enough RAM to allocate the memory request in my code.
The solution of course is simple: More RAM or smaller data sizes. We decided to split our data in several parts and process them individually.
I found this out by setting the number of workers to one but keeping the parfor loop in there. Then I got the error message in detail from this worker where I got only a simple "Worker has died blabla" message without anything concrete previously.
edit: To avoid confusion:
The code in my opening post was apparently problematic because I changed the content of the input variables which should not be done. What I did was to change my mex file so that dynamic memory was allocated inside of it. Then I ran into the issue when a worker tried to allocate memory but it was not granted by the server it was running on and threw an exception and died.
This post responds to this issue.

More Answers (1)

James Tursa
James Tursa on 8 Dec 2020
Edited: James Tursa on 8 Dec 2020
Regarding the inplace modification in MATLAB, here is the actual situation:
MATLAB uses a system behind the scenes that is often known as "copy-on-write". That is, multiple variables can share the same data memory. A deep copy is only made when changes are made. The actual behaviour varies a bit depending on MATLAB version, but goes something like this in a recent version:
A = 1:10; % variable A is created, but it is sharing the same data area as a background varible you know nothing about
B = A; % variable B is sharing the same data area as A and the background variable.
% at this point in the code, there are actually three variables sharing the same data area
mymexfunction(A) % suppose this mex function changes the values of A inplace
% at this point in the code, variable B and the background variable have been changed inplace, a nasty side effect
C = 1:10; % variable C maybe gets created as a shared copy of the background variable with the changed values!!!
You are screwed at this point. MATLAB saw the 1:10 pattern when creating C so it might use the background variable for this, but you had inadvertently changed the values of that background variable inplace with your mex routine. If you subsequently did the A = 1:10 line again you would definitely be screwed since the variable is the same.
What to do? You can sometimes get away with modifying variables inplace in a mex routine, but only if you really, really know what you are doing and take extra precautions to make sure the variable isn't shared with any other variable prior to calling your mex routine. Since MATLAB gives you no official tools to determine this, it can be a bit of a crap shoot to know if your code is going to work as you want or expect. See this link for a nasty example:
One method that seems to work for making sure a variable is unshared is the following:
A = something potentially shared with other variables
A(1) = A(1); % MATLAB sees the assignment so it will unshare A first.
mymexfunction(A); % modifying A inplace will *probably* work OK now.
Even so, I am not sure what to expect if you are using parfor loops and each thread is trying to write into the same workspace variable inplace.
  2 Comments
RH
RH on 8 Dec 2020
Thanks for the elaborate reply. I changed my code such that I now do not change the input of the mex functions. However, this does not affect the outcome, some workers still crash for some reason.
Is there some reasonable way to debug the workers? This is a rather difficult problem as it appears to be kind of random if and what worker crashes. Like 4 out of 20 crash.
James Tursa
James Tursa on 8 Dec 2020
You can use the crude debugger (i.e., lots of print statements to make sure your indexing is not running off the end of the valid memory areas), or e.g. in Visual Studio you can compile your mex routine in debug mode and then attach the MATLAB process to your Visual Studio session and try to do the debugging there. But I don't have any experience doing this with parfor.

Sign in to comment.

Categories

Find more on MATLAB Compiler in Help Center and File Exchange

Tags

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!