Mat file empty after saving

I'm training a neural network on an HPCMP system GPU cluster (Linux OS) and I'm saving the trained model parameters in a structure. When I execute the code on my local machine (Windows OS) for debugging, the code executes fine and my model is saved. The saved .mat file containing the model/data structure is about 550 MB and I have no problems importing the structure and its contents back into Matlab on my local machine.
When I execute the same code on the HPCMP GPU cluster, the code executes fine, the model is saved in the designated directory and the filesize shows that it is also about 550 MB; however, when I try to import this file back into Matlab the import utility says the Mat file is empty. I'm performing transfer learning with my neural network on the HPCMP sytems and when the saved Mat file is reloaded, the code fails because the file doesn't load any data.
I don't understand why this mat file would show a filesize of 550 MB and be empty in the import wizard. If I pull the mat file off HPCMP systems and onto my local machine, the filesize still shows 550 MB but I still can't see or load any data from it; import utility still says the Mat file is empty.
I've saved models on HPCMP systems with similar code without these issues so I don't know what has changed exactly and why the file won't load any data even though the filesize shows the file isn't empty. I'm using iLaucher to submit communicating jobs to the HPCMP cluster, in the past, I used HPC portal but I don't see why this would make a difference exactly. Since its a communicating job, I lose access to debug the job after it is submitted to the cluster.
Any help would be greatly appreciated!

7 Comments

Is it currently being saved to a network file system? If so can you save it to a local file system such as tempdir() and then copy it to the network file system?
Nicholas Hopkins
Nicholas Hopkins on 23 Oct 2022
Edited: Nicholas Hopkins on 23 Oct 2022
Hi Walter, thanks for the response. The file is being saved on HPC's network as I don't have direct access to the system I'm running on; I have to remote into it and submit jobs (although I can run Matlab on HPC as I would on my local machine, jobs are submitted to run in the background though and then Matlab is closed so the license can be freed up for another user). To get the saved mat file onto my local PC, I use a file sharing program called Filezilla.
I have a directory on the HPC network that was setup specifically for my project so the permissions (Linux file system) should be set appropriately. I could try saving the file to my home directory and then copying to my project directory to see if that yields a difference; however, both my home and project directories in this case are on the HPC's network system if that makes sense. My workflow is: 1) Develop code locally 2) Transfer code via Filezilla to HPC system project directory, 3) Execute code on HPC system to generate and save data 4) Transfer data via Filezilla from HPC back to local machine for further processing.
I'll try saving the file to my home directory to see if that makes a difference; thank you for the suggestion.
Nicholas Hopkins
Nicholas Hopkins on 23 Oct 2022
Edited: Nicholas Hopkins on 23 Oct 2022
I tried saving the file to my home directory on the HPC system; however, I encounter the same results as when I load the file; the import wizard says the Mat file is empty and my structure doesn't load in the workspace but the filesize still shows at > 500 MB
If you use tempdir() then that should refer to a local hard drive on the cluster.
It would also be interesting to save -7.3 explicitly and see if that makes a difference.
I setup my code on HPC side so I could debug/step through it instead of submitting the jobs to the cluster to run in the background. I'm able to save the model to the specified directory and read it back in as I would normally. It would appear there is some issue with the communicating job wrapper/submit functions being used to submit the job to the cluster that is causing the model to not be able to be loaded once it is saved.
Oh I see what you mean concerning tempDir(), this project has re-introduced me to Linux so I'm pretty green on interpretting troubleshooting tips on that OS.
I'll try saving with -7.3 and see if that changes anything also, although, given I can save/reload the model stepping through the code on HPC side outside of submitting the code via communicating job I don't think saving with -7.3 will help but its worth trying.
A HPC data scientist setup the communicating job code so my algorithm could be run appropriately on HPC's system; he's modified it a decent bit this past week to run Matlab in a container since the Linux OS doesn't have the necessary video codecs for my algorithm's application. I'm hoping to speak to him shortly and get some more information.
Apparently inputing '-v7.3' into the save function does save the model in a way that allows me to load the model back in from the mat file after saving with a communicating job. I'm not sure what using this save version does differently though in my case; I've only ever used this input when the variables size to be saved exceeds 2 GB. I spoke with my HPC counterpart and he thinks the issue was related to another problem which I'll answer my original question with. Thank you for you input though, I certainly appreciate it!

Sign in to comment.

 Accepted Answer

Nicholas Hopkins
Nicholas Hopkins on 23 Oct 2022
As Walter mentioned in the comments above, using '-v7.3' as an input to the save function saves the model when code is executed via a communicating job on the HPC GPU cluster in a way that allows me to load back in the saved model. However, the following code is what should have been used when saving a model processed on multiple GPU clusters in my case:
if labindex == 1
save(fileName,variableName)
end
I was not using the if statement; doing so ensures that the model running on each GPU cluster does not try to save at the same time. I believe this is what my issue was as I only encountered the model saving problem when submitting a communicating job which ran on multiple GPU clusters (i.e. using the '-v7.3' save function input was not needed to save the model outside of GPU cluster processing and running the code locally on my machine).

More Answers (0)

Products

Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!