Shuffle method on custom datastore written for a single binary file

1 view (last 30 days)
I am writing a custom datastore and am seeking some assistance. My datasets consist of stacks of 2D images (frames) stored sequentially in a single binary file. While it's very straight forward to read in the binary stream using fread, each full dataset itself can easily be on the order of 50+ GB, making it infeasible to load everything at once on the hardware equipment I have available. This was my original motivation for exploring the use of a datastore.
In addition to the need for managing out-of-memory data, I also would like to partition the data into chunks where each chunk contains a random collection of frames from this binary file. If possible, I would like to use the shuffle method for the datastore superclass to accomplish this, as this seems to be the "proper" approach (although I'm very open to alternatives).
The problem I am currently having is that the default datastore shuffle method appears only to randomize the order of files in a datastore directory. However, since I only have one (very large) binary file, it doesn't seem to "shuffle" anything at all - running readall on the shuffled datastore returns the exact same data as if I were to run it on the original datastore. I would rather need it to "shuffle" the frames within the binary file. Presumably, if I were to save each frame as an individual image file on disk, then I could get this to work using imageDatastore or fileDatastore. However, then I would have to go through all my files and save them to disk again as individual files, which seems rather silly.
I have written code to load a chunk of the data manually by jumping around the file using fseek. However, then I lose access to the datastore object as well as its built-in functionality. So I thought I would throw this question out there to see if anyone could offer some help.

Answers (1)

Sanjana
Sanjana on 6 Oct 2024
Hi,
You can implement a custom datastore in MATLAB to shuffle frames within a single large binary file while maintaining the benefits of a datastore.
Custom Datastore class:Create a custom datastore class that extends the matlab.io.Datastore class. This class can be implemented to read and shuffle frames within a binary file.
Implementing Custom Read and Shuffle methods:
  1. Read Method: Implement the read method to read a specified number of frames from randomly shuffled positions in a binary file. Use "fseek" to move the file position pointer based on the shuffled frame order in the binary file.
  2. Shuffle Method: Implement a custom "shuffle" method that generates a random permutation of frame indices. This method should update the order in which frames are accessed during reading, without altering the binary file.
Example Custom Datastore class definition:
classdef CustomFrameDatastore < matlab.io.Datastore
properties
FileName
FrameSize
TotalFrames
CurrentIndex
FrameOrder
end
methods
function ds = CustomFrameDatastore(fileName, frameSize, totalFrames)
ds.FileName = fileName;
ds.FrameSize = frameSize;
ds.TotalFrames = totalFrames;
ds.CurrentIndex = 1;
ds.FrameOrder = randperm(totalFrames);
end
function data = read(ds)
if ds.CurrentIndex > ds.TotalFrames
error('No more data to read.');
end
fid = fopen(ds.FileName, 'rb');
frameIndex = ds.FrameOrder(ds.CurrentIndex);
fseek(fid, (frameIndex-1)*ds.FrameSize, 'bof');
data = fread(fid, ds.FrameSize, 'uint8');
fclose(fid);
ds.CurrentIndex = ds.CurrentIndex + 1;
end
function reset(ds)
ds.CurrentIndex = 1;
end
function tf = hasdata(ds)
tf = ds.CurrentIndex <= ds.TotalFrames;
end
function shuffle(ds)
ds.FrameOrder = randperm(ds.TotalFrames);
ds.CurrentIndex = 1;
end
end
end
Here is the example code to use the above custom datastore:
% Initialize datastore
frameSize = 1024 * 1024; % Example frame size
totalFrames = 50000; % Example total number of frames
ds = CustomFrameDatastore('largefile.bin', frameSize, totalFrames);
% Shuffle and read frames
ds.shuffle();
while hasdata(ds)
frameData = ds.read();
% Process frameData
end
I hope this helps!

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!