mat file size compression
7 views (last 30 days)
Show older comments
I have two files (examples from a large dataset) with apparently identical contents. One takes 3.2 MB disk space, the other 15 MB disk space. I understand that Matlab incorporates an algorithm for compression, but why would it behave so differently in one case than in the other?
Some high level stats on them (they appear to be identical in all respects, at all levels of the struct)(rec327 and recPTL are working names in memory here for the contents of TVFR-327-G-STV.mat and PTL-293238CA03.mat, each simply loaded into memory and renamed from their internal struct name to distinguish them here)
>> whos rec327
Name Size Bytes Class Attributes
rec327 1x1 16127172 struct
>> whos recPTL
Name Size Bytes Class Attributes
recPTL 1x1 16127172 struct
whos -file TVFR-327-G-STV.mat
Name Size Bytes Class Attributes
rec 1x1 16127172 struct
>> whos -file PTL-293238CA-3.mat
Name Size Bytes Class Attributes
rec 1x1 16127172 struct
foo =
4x1 struct array with fields:
name
date
bytes
isdir
datenum
>> foo(3)
ans =
name: 'PTL-293238CA-3.mat'
date: '13-Feb-2016 17:27:20'
bytes: 15040532
isdir: 0
datenum: 7.363737273148148e+05
>> foo(4)
ans =
name: 'TVFR-327-G-STV.mat'
date: '01-Mar-2016 19:48:36'
bytes: 3200147
isdir: 0
datenum: 7.363908254166667e+05
0 Comments
Answers (1)
Jan
on 5 Mar 2016
Edited: Jan
on 5 Mar 2016
There can be several reasons. E.g. the compression works differently accoring to the entropy of the data. It matters, if the file was saved at once or if data have been appended or pertially overwritten. There are different MAT file formats, see -v4, -v6, -v7, -v7.3 flags for save.
5 Comments
Walter Roberson
on 23 Sep 2022
Suppose that you have a set of all possible files that are 16 bits each, 2^16 = 65536 files. Is it possible to compress all of them by at least one bit each? That would hypothetically reduce them to 15 bits each. But there are only 2^15 = 32768 distinct 15 bit files, so it would not be possible to compress all 65536 files into 32768 different slots, and still be able to recover them all perfectly.
In every single case, due to the "pigeon hole principle", any lossless compression algorithm that compresses at least one file must make at least one file longer than it was originally. The only reason that compression turns out to be useful is that we are typically not equally "interested" in all files. For example for pictures of people, adjacent pixels on the face turn out to be close to each other in value rather than being wildly different. Readings from devices such as thermometers tend to be the same order of magnitude rather than being equally likely to be literally any representable number. So for practical purposes, there are often cases where the tradeoff that some files must get longer is acceptable: we are less likely to be interested in such files. But mathematically, any given lossless compression algorithm is not going to be able to compress all possible files to any given size limit.
jkr
on 24 Sep 2022
The contents are not similar, they're the same. In Matlab, you could compare every element of the struct loaded from each file and find them identical.
I think Jan's answer at the top probably holds the clue(s). The older file may have been built by accrual, then loaded into Matlab and saved with the new name, all at once and more efficiently.
Version of "save" could contribute, I suppose. I've never bothered with any version flags when saving, but I do keep up to date with releases (not betas), so "save" behavior could have changed without me noticing.
See Also
Categories
Find more on Workspace Variables and MAT-Files in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!