MATLAB Answers

Duplicate x,y in table; get min/max of variables to new table

1 view (last 30 days)
a1
a1 on 2 Nov 2019
Commented: a1 on 2 Nov 2019
I have a large dataset, small subset attached. There are duplicate x,y values. I need a cleaned table with "var1_max, var1_min, var2_max, var2_min" etc. for every duplicate "x,y"
I'm just not sure where to start setting up the problem. Thanks for any pointers to get going. Using Matlab R2018a, no fancy add-ons/packages
file_in = 'example.csv';
data_in = readtable(file_in);
%%table_out.Properties.VariableNames = {'id', 'x', 'y', 'var1_max', 'var1_min', ...
%% 'var2_max', 'var2_min', 'var3_max', 'var3_min', 'var4_max', 'var4_min', ...
%% 'var5_max', 'var5_min', 'var6_max', 'var6_min', 'var7_max', 'var7_min'};

  3 Comments

dpb
dpb on 2 Nov 2019
Are groups by x and y together or by each unique x, y?
Cris LaPierre
Cris LaPierre on 2 Nov 2019
Not sure I understand completely. I have the same questions a dpb. It looks like you've highlighted groups, but rows 12-14 only have duplicate x values, not x and y.
For the min/max values you want, do you want the overall min/max or just the min/max within the group (those that have the same x,y values)?
For example, rows 2 and 3 above do not have values for var1-3. What values should they have?
Rows 6-10 do have values there. What should the min/max be?
And finally, 12-14 have same x but different y. How should this be handled?
a1
a1 on 2 Nov 2019
Please disregard highlight on rows 12-14, it is in error.
I think I should add "var1_mid" ... "var7_mid" to handle cases where x,y are unique. Example in row 1:
  • x = 31.74053 | y = 27.18408 | var1_max = NaN | var1_min = NaN | var1_mid = 90 | var2_max = NaN | var2_min = NaN | var2_mid = 54 | ...etc.
Answers to other questions below:
For the min/max values you want, do you want the overall min/max or just the min/max within the group (those that have the same x,y values)?
  • just the min/max within the group (those that have the same x,y values)
For example, rows 2 and 3 above do not have values for var1-3. What values should they have?
  • NaN
Rows 6-10 do have values there. What should the min/max be?
  • x = 31.79403 | y = 27.26837 | var1_max = 79 | var1_min = 68 | var1_mid = NaN | var2_max = 251 | var2_min = 193 | var2_mid = NaN | ...etc.

Sign in to comment.

Accepted Answer

Cris LaPierre
Cris LaPierre on 2 Nov 2019
One potential solution that is simple to implement is to use groupsummary. You can have your data grouped by x and y, and have it return the within-group min, mean, and max for table variables you specify.
However, you prescribe wanting NaN for mid-value if there are min/max values. This won't do that .It will provide values for all 3 statistics. Also note that group summary will only return one row for each group. You can use the join function if you want to merge the two together.
newData = groupsummary(data_in,{'x','y'},{'min','mean','max'},["var1","var2","var3","var4","var5","var6","var7"])

  1 Comment

a1
a1 on 2 Nov 2019
This is perfect, thank you. The groupsummary returns a GroupCount, so I can just use that such that if > 1 then mean = NaN

Sign in to comment.

More Answers (0)

Sign in to answer this question.

Tags