Clear Filters
Clear Filters

Determine row vector out of matrix with most evenly spaced and distributed values

3 views (last 30 days)
I have calculated results in a matrix with 100x20 values. Now I want to find the one row vector out of this matrix where the values (2:19) are most evenly spaced between these two boundary values (2) and (19). First and last value of the row don't need to be considered. Boundary values are different per each row, but in a similar range. Rows are already sorted.
Example with less values:
M(2,2:8) = [2.1 2.5 2.9 3.2 3.5 3.7 5.8 8.9] would be a bad one
M(20,2:8) = [2.0 2.9 3.7 4.4 5.3 6.2 7.0 7.9] would be a better one
Does someone have a good idea, how to do that?
  1 Comment
Torsten
Torsten on 5 May 2024
Edited: Torsten on 5 May 2024
So you search for a mathematical expression to measure "most evenly spaced" ?
Do your numbers have to cover a certain interval "most evenly" ?
And why do you think M(2,2:8) is worse than M(20,2:8) ?

Sign in to comment.

Accepted Answer

John D'Errico
John D'Errico on 5 May 2024
Edited: John D'Errico on 5 May 2024
Simple. Sort of. But you need to define what equal spacing means to you, and how you will measure the deviation from equal spacing. I'll make up a simple array.
A = sort(randn(12,10),2)
A = 12x10
-1.3531 -0.2857 -0.1567 0.0245 0.2589 0.3649 0.5003 0.6332 0.9667 1.0794 -1.7414 -1.7294 -1.0535 -0.1502 -0.0227 0.2560 0.6755 0.8847 0.9662 0.9917 -1.5032 -0.5306 -0.3986 -0.2131 0.0775 0.2356 0.2688 1.0318 1.8646 2.0014 -1.0865 -0.9003 -0.8249 -0.8086 -0.6752 -0.6246 -0.0647 -0.0051 0.1303 1.4062 -1.2505 -0.4619 -0.3885 0.3655 0.5612 0.7090 0.9345 1.1863 1.2119 1.6337 -1.0211 -0.9281 -0.5193 0.5549 1.0672 1.1005 1.9367 1.9414 1.9444 2.0160 -1.6603 -1.6308 -0.9662 -0.6644 0.0338 0.0644 0.1684 0.3133 0.8591 1.3865 -1.8653 -0.5368 -0.3114 -0.1596 0.2480 0.4386 0.7890 0.8707 1.0441 1.3907 -2.2163 -1.9454 -1.1586 -0.6006 -0.1521 -0.0530 1.2626 1.6694 2.1450 2.4063 -1.0122 -0.8694 -0.7798 -0.6724 -0.4959 -0.2683 0.1921 0.6581 1.3302 1.3377
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
So each row of A is increasing in sequence. But some of those rows are probably more uniformly spaced. Start by using diff.
Adiff = diff(A,[],2)
Adiff = 12x9
1.0674 0.1290 0.1812 0.2344 0.1060 0.1354 0.1329 0.3335 0.1127 0.0120 0.6758 0.9033 0.1275 0.2787 0.4194 0.2093 0.0815 0.0254 0.9726 0.1321 0.1854 0.2906 0.1581 0.0333 0.7630 0.8328 0.1368 0.1863 0.0754 0.0162 0.1334 0.0507 0.5599 0.0596 0.1354 1.2759 0.7886 0.0734 0.7540 0.1957 0.1478 0.2255 0.2518 0.0256 0.4218 0.0930 0.4087 1.0742 0.5123 0.0333 0.8362 0.0047 0.0030 0.0716 0.0295 0.6646 0.3018 0.6982 0.0306 0.1040 0.1449 0.5458 0.5274 1.3285 0.2254 0.1518 0.4075 0.1906 0.3504 0.0817 0.1734 0.3466 0.2710 0.7867 0.5581 0.4485 0.0991 1.3155 0.4068 0.4757 0.2613 0.1428 0.0897 0.1074 0.1765 0.2276 0.4604 0.4660 0.6720 0.0075
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
We now have a list of differences. Those differences are the stride between each consecutive pair of numbers. Now it is your turn to make a decision.
T = table(min(Adiff,[],2), ...
max(Adiff,[],2), ...
std(Adiff,[],2)./median(Adiff,2), ...
max(Adiff,[],2)./min(Adiff,[],2), ...
kurtosis(Adiff,[],2));
T.Properties.VariableNames = {'Min stride','Max stride','Norm std','Max/Min','Kurtosis'}
T = 12x5 table
Min stride Max stride Norm std Max/Min Kurtosis __________ __________ ________ _______ ________ 0.10599 1.0674 2.2729 10.071 6.3726 0.012005 0.90333 1.4766 75.245 2.5505 0.033257 0.97261 1.9425 29.246 1.696 0.01624 1.2759 3.0606 78.564 5.1301 0.025609 0.78865 1.2392 30.796 2.2152 0.0029964 1.0742 4.289 358.51 2.219 0.029534 0.69815 0.90632 23.639 1.3493 0.081688 1.3285 1.6771 16.263 6.0903 0.099122 1.3155 0.80054 13.272 3.8627 0.0074845 0.67203 1.2515 89.79 2.2265 0.019656 0.99879 2.9615 50.812 2.6513 0.01194 0.89351 1.5652 74.835 3.0905
Is a row where some of those strides are REALLY tiny a bad thing? Is a row where ONE of those strides is really large bad? Which is worse? Maybe it might be the standard deviation, normalized by the median stride (column 3). Or (column 4) possibly you might decide to look at the ratio of the largest stride, divided by the smallest stride.
In this array, the 5th row would seem to be the best in terms of normalized standard deviation of the strides (1.2392), since it has a relatively small normalized standard deviation of the strides. But if we compare the maximum stride divided by the min stride, then row 1 is better. Or perhaps kurtosis is a good measure here, since it measures how heavy are the tails of a distribution. (And kurtosis is automatically normalized.)
Arguably you do want to normalize those strides by their median or average value, since if you doubled all of the numbers in one row, then you might not want it to be measured as worse. That is...
A = sort(rand(1,10));
A = [A;2*A;3*A];
Adiff = diff(A,[],2)
Adiff = 3x9
0.1191 0.0139 0.0566 0.0480 0.1060 0.3093 0.0800 0.0120 0.0703 0.2382 0.0278 0.1131 0.0960 0.2121 0.6187 0.1600 0.0241 0.1405 0.3573 0.0417 0.1697 0.1440 0.3181 0.9280 0.2401 0.0361 0.2108
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
All three rows should be, to me at least, as being identical in their spacings in terms of goodness. And that suggests you want to normalize things in some way.
Adiffnorm = Adiff./median(Adiff,2)
Adiffnorm = 3x9
1.6951 0.1980 0.8051 0.6830 1.5092 4.4028 1.1390 0.1713 1.0000 1.6951 0.1980 0.8051 0.6830 1.5092 4.4028 1.1390 0.1713 1.0000 1.6951 0.1980 0.8051 0.6830 1.5092 4.4028 1.1390 0.1713 1.0000
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
Now each of those rows are seen to be equivalent, as I think they should. In the end though, you need to define what is good or bad. The simple example you gave might seem to say it all, but it does not.
  2 Comments
John D'Errico
John D'Errico on 5 May 2024
Edited: John D'Errico on 5 May 2024
Excellent. Odds are there are many measures you could use. My thoughts would be to take some of your data. Maybe plot it. For example, a set with a constant stride would show up as a perfectly linear plot.
A = sort(randn(12,10),2)
A = 12x10
-1.6724 -1.0698 -0.9837 -0.9073 -0.4877 -0.3854 -0.2887 -0.1449 0.2297 1.0068 -0.9884 -0.5906 -0.5593 -0.3321 -0.1977 0.0239 0.4641 0.6662 0.7083 2.1204 -1.5585 -0.8932 -0.7908 -0.7702 -0.7138 -0.5487 -0.2845 0.2091 0.2583 1.9670 -2.8522 -0.8535 -0.5428 -0.4398 0.1563 0.1574 0.2039 0.3910 0.9089 1.7343 -2.1003 -1.4418 -1.1302 -0.9000 -0.8716 -0.5205 0.0245 0.2864 0.3186 2.2926 -1.5674 -0.9739 -0.8680 -0.6090 -0.5666 0.3659 0.5800 0.7947 1.0988 1.4289 -1.5809 -1.5215 -0.8630 -0.3208 -0.2255 -0.1635 0.3308 0.9677 1.3600 1.4122 -2.3213 -1.1710 -0.7835 -0.6771 -0.2087 -0.2076 0.8963 1.4512 1.6049 3.7787 -1.0800 -0.4816 -0.3866 -0.1896 0.0485 0.0797 0.1745 0.4604 0.5195 1.5534 -2.2878 -1.1352 -0.6783 -0.1652 0.0592 0.1126 0.8074 1.1636 1.4416 1.9326
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
plot(1:10,A,'-')
And maybe that does not help any. But it suggests another idea. The correlation coefficient of such a relationship would be as large as possible. In this case, since the stride would always be positive, the correlation would be +1.
C = corr([1:10;A]');
C(1,2:end)
ans = 1x12
0.9616 0.9351 0.8847 0.9089 0.9329 0.9857 0.9857 0.9537 0.9392 0.9796 0.9379 0.9730
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
We want the row of A with the highest correlation coefficient, when compared to a perfectly linear sequence.
[cmax,ind] = max(C(1,2:end))
cmax = 0.9857
ind = 6
So in this case, the curve that is most nearly perfectly linear would be the 4th one.
plot(diff(A(ind,:)))
What I see there is a set that has a fairly uniform set of strides, with one or two outliers. The problem is, in this case, there is one large outlier in stride near the middle of the row. And that will turn out to be of very weak influence on the correlation coefficient. As such, I'd suggest this idea of a correlation coefficient is probably a poor one, since had that outlier in striede been near the beginning or end of the row, it would change the result.
As I said, there would be many different schemes you could use. And no particular one would be perfect, since we don't have a mathematical definition of what is best. Anyway, I'd be looking for one of the other schemes I suggested, since correlation coefficient will not be robust.

Sign in to comment.

More Answers (0)

Categories

Find more on Loops and Conditional Statements in Help Center and File Exchange

Tags

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!