Find data from files that are too large to read in

19 views (last 30 days)
I have structured data files (each about 30 GB). I need to find all the lines in the file that contain a specific number in one of the fields. I am presently doing this by reading in each line in turn and checking the field, but it takes a long time ( > 1 hr) to scan through the file). The program HEX FIEND allows me to do this manually in a small fraction of the time. Is there a way to read a file up to the point that some condition is met? If there is, I suspect it will speed up finding and extracting the lines of the file I want.
  2 Comments
Kevin Lehmann
Kevin Lehmann on 20 Feb 2024
This solution, using the ds = tabularTextDatastore function call worked fo me. The default read frame is 20,000 lines; I got a speed-up by going to 1,000,000 frame size. By putting my code to analyze the data inside a
while hasdata(ds)
end
allowed the transition from code to use a file I could load into memory to one too large to do so.

Sign in to comment.

Answers (2)

Walter Roberson
Walter Roberson on 17 Feb 2024
Use buffer-fulls of data for increased efficiency.
fread() a block of data of fixed size. Scan backwards through the block looking for the last newline, keeping a count of how far you go. truncate the block there, and fseek() backwards by the number of bytes you had to scan backwards to reach the newline. Now process the in-memory block of data.
Repeat until you are at the end of file. Be careful because the file might potentially not end in newline.
  10 Comments
Kevin Lehmann
Kevin Lehmann on 21 Feb 2024
I got my code to work using data = fread(FILEID, [37,1000000], 'int8=>char')' to read from the file, 1 million lines at a time. With the same processing after input, this took a factor of 10 longer ( 50 mins vs 5 mins as reported by tic..toc) compared to using the tabularTextDatastore to read the same data and doing the same processing after input.

Sign in to comment.


Image Analyst
Image Analyst on 17 Feb 2024
Perhaps memmapfile? I think its purpose is to look at very large files.
help memmapfile
MEMMAPFILE Construct memory-mapped file object. M = MEMMAPFILE(FILENAME) constructs a memmapfile object that maps file FILENAME to memory, using default property values. FILENAME can be a partial pathname relative to the MATLAB path. If the file is not found in or relative to the current working directory, MEMMAPFILE searches down the MATLAB search path. M = MEMMAPFILE(FILENAME, PROP1, VALUE1, PROP2, VALUE2, ...) constructs a memmapfile object, and sets the properties of that object that are named in the argument list (PROP1, PROP2, etc.) to the given values (VALUE1, VALUE2, etc.). All property name arguments must be quoted character vectors or strings (e.g., 'Writable'). Any properties that are not specified are given their default values. Property/Value pairs and descriptions: Format: string scalar or character vector, or Nx3 cell array (defaults to 'uint8'). Format of the contents of the mapped region. If a string or character vector, Format specifies that the mapped data is to be accessed as a single vector of type specified by Format's value. Supported values are 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'single', and 'double'. If an Nx3 cell array, Format specifies that the mapped data is to be accessed as a repeating series of segments of basic types, each with specific dimensions and name. The cell array must be of the form {TYPE1, DIMS1, NAME1; ...; TYPEn, DIMSn, NAMEn}, where TYPE is one of the data types listed above, DIMS is a numeric row vector specifying the dimensions of the segment of data to use, and NAME is a field name to use to access the data (as a subfield of the Data property). See Data property and examples below. Repeat: Positive integer or Inf (defaults to Inf). Number of times to apply the specified format to the mapped region of the file. If Inf, repeat until end of file. Offset: Nonnegative integer (defaults to 0). Number of bytes from the start of the file to the start of the mapped region. Offset 0 represents the start of the file. Writable: True or false (defaults to false). Access level which determines whether or not Data property (see below) may be assigned to. All the properties above may also be accessed after the memmapfile object has been created by dot-subscripting the memmapfile object. For example, M.Writable = true; changes the Writable property of M to true. Two properties which may not be specified to the MEMMAPFILE constructor as Property/Value pairs are listed below. These may be accessed (with dot-subscripting) after the memmapfile object has been created. Data: Numeric array or structure array. Contains the actual memory-mapped data from FILENAME. If Format is a string or character vector, then Data is a simple numeric array of the type specified by Format. If Format is a cell array, then Data is a structure array, the field names of which are specified by the third column of the cell array. The type and shape of each field of Data are determined by the first and second columns of the cell array, respectively. Changes to the Data field or subfields also change the corresponding values in the memory-mapped file. Filename: Char array. Contains the name of the file being mapped. Note that when a variable containing a memmapfile object goes out of scope or is otherwise cleared, the memory map is automatically unmapped. Examples: % To map the file 'records.dat' to a series of unsigned 32-bit % integers and set every other value to zero (in Data and % records.dat): m = memmapfile('records.dat', 'Format', 'uint32', 'Writable', true); m.Data(1:2:end) = 0; % To map the file 'records.dat' to a repeating series of 20 singles % (as a 5-by-4 matrix) called 'sdata', followed by 10 doubles (as a 1-by-10 vector) called 'ddata': m = memmapfile('records.dat', 'Format', {'single' [5 4] 'sdata'; ... 'double', [1 10] 'ddata'}); firstSdata = m.Data(1).sdata; firstDdata = m.Data(1).ddata; See also MEMMAPFILE/DISP, MEMMAPFILE/GET Documentation for memmapfile doc memmapfile
  1 Comment
Kevin Lehmann
Kevin Lehmann on 20 Feb 2024
It appears from what I read that MEMMAPFILE only works for binary files. As I am reading large, pre-exisiting ASCII files, this did not work fo me. If I was generating the files myself, this would probably we a good option, though it also appears that all the data needs to be saved in the same format.

Sign in to comment.

Categories

Find more on MATLAB Report Generator in Help Center and File Exchange

Products


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!