Removing Error Data from Table

Question

0 votes

scatterData.mat

Hi,

I am working on a project that involves plotting Temperature vs. Sensor Data on a scatter plot. My goal is to calculate the coefficients of a trend line for future predictions. However, there are instances where my sensor outputs false results. Instead of manually removing this data, I am in search of an outlier function that can automatically eliminate these anomalies and provide coefficients. These coefficients would be useful for defining the boundaries of valid data in future datasets.

I am aware of the standard outlier removal functions. However, the scatter plot I've attached here illustrates my challenge. You can observe the general pattern of valid data, interspersed with some errors. Although there's no trend line in the plot, it's possible to visualize how it would look and identify the inaccurate data points.

My question is: Is there a way to configure an outlier function to take into account both temperature and sensor output in identifying false results? Additionally, can this function reference previous data to aid in this evaluation? I have posted the plots below and also attached data samples of some sensor output for your reference.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Star Strider on 5 Dec 2023

Edited: Star Strider on 5 Dec 2023

Open in MATLAB Online

0 votes

scatterData.mat

This is definitely an interesting problem, and one I have thought about at other times.

I am not certain that I have completely solved it, since it still may require a bit of interactivity, depending on the data. For these two data sets, it seems to work reasonably well with these parameters. The scatterhist calls are just there as a way to illustrate the data, and are not absolutely necessary.

Try this —

LD = load('scatterData.mat')

LD = struct with fields:

dataTable4: [8321×2 table] dataTable5: [8498×2 table]

% dataTable4 ————————————————————————————————————————————————————————————————————————————————

T4 = LD.dataTable4;

T4 = rmmissing(T4)

T4 = 8135×2 table

Temperature NO2_WE_Voltage ___________ ______________ 10.769 249.58 10.669 249.71 10.991 248.95 11.243 248.65 11.301 248.27 11.414 248.4 11.414 248.22 11.283 247.82 11.258 248.17 11.27 248.1 11.215 248.12 11.228 248.07 11.414 248 11.706 247.52 12.028 246.84 12.801 245.61

VN4 = T4.Properties.VariableNames;

figure

hsh4 = scatterhist(T4{:,1}, T4{:,2}, 'NBins',50, 'Marker','.')

hsh4 =

1×3 Axes array: Axes Axes Axes

idx4 = clusterdata(table2array(T4), 'MaxClust',15, 'Criterion','distance');

[Uidx,~,uix] = unique(idx4); % Unique Cluster Numbers & Indices

ClustCounts = accumarray(idx4, 1); % Count Cluster Members

T4_ClusterCounts = table(Uidx, ClustCounts) % Results Summary Table

T4_ClusterCounts = 15×2 table

Uidx ClustCounts ____ ___________ 1 3 2 136 3 1 4 1 5 1 6 1 7 3526 8 1 9 4 10 2173 11 2273 12 1 13 11 14 2 15 1

[MaxCounts,MaxCidx] = max((T4_ClusterCounts{:,2})) % Largest Cluster & Index

MaxCounts = 3526

MaxCidx = 7

cm = colormap(turbo(numel(unique(idx4))));

hold on

Ax = hsh4(1);

% get(Ax)

Ax.UserData

ans = 1×4 cell array

{'gscatter'} {8135×1 double} {8135×1 double} {0×0 double}

scatter(Ax.UserData{2}, Ax.UserData{3}, 10, idx4)

colormap(turbo(numel(unique(idx4))))

hold off

colorbar

Lv4 = idx4 == MaxCidx;

figure

scatter(Ax.UserData{2}(Lv4), Ax.UserData{3}(Lv4), 10, '.')

grid

% dataTable5 ————————————————————————————————————————————————————————————————————————————————

T5 = LD.dataTable5;

T5 = rmmissing(T5)

T5 = 7487×2 table

Temperature NO2_WE_Voltage ___________ ______________ 12.411 252.73 12.255 252.96 12.255 253.31 12.27 253.18 12.368 252.81 12.356 252.68 12.338 253.21 12.368 252.3 12.338 252.78 12.27 252.33 12.169 252.96 12.283 252.43 12.451 252.63 12.69 251.52 13.224 251.09 13.831 249.76

VN5 = T5.Properties.VariableNames;

figure

hsh5 = scatterhist(T5{:,1}, T5{:,2}, 'NBins',50, 'Marker','.')

hsh5 =

1×3 Axes array: Axes Axes Axes

idx5 = clusterdata(table2array(T5), 'MaxClust',10, 'Criterion','distance');

[Uidx,~,uix] = unique(idx5);

ClustCounts = accumarray(idx5, 1);

T5_ClusterCounts = table(Uidx, ClustCounts)

T5_ClusterCounts = 10×2 table

Uidx ClustCounts ____ ___________ 1 2 2 7411 3 1 4 64 5 1 6 1 7 1 8 3 9 2 10 1

[MaxCounts,MaxCidx] = max((T5_ClusterCounts{:,2}))

MaxCounts = 7411

MaxCidx = 2

cm = colormap(turbo(numel(unique(idx5))));

hold on

Ax = hsh5(1);

% get(Ax)

Ax.UserData

ans = 1×4 cell array

{'gscatter'} {7487×1 double} {7487×1 double} {0×0 double}

scatter(Ax.UserData{2}, Ax.UserData{3}, 10, idx5)

colormap(turbo(numel(unique(idx5))))

hold off

colorbar

Lv5 = idx5 == MaxCidx;

figure

scatter(Ax.UserData{2}(Lv5), Ax.UserData{3}(Lv5), 10, '.')

grid

The clusterdata function works here with 'MaxClust' (the maximum number of clusters) set at 10, however that may not generalise to all data sets, and more clusters may be necessary in those instances. That is where the ‘interactivity’ requirement may arise.

It would likely be possible to create a function from this code (with or without the plots, however I believe the plots are necessary to illustrate the data and how the code works).

I have been working on this for a few hours and will ponder that in the morning.

EDIT — (5 Dec 2023 at 04:35)

Streamlined code, added more explanation text.

.

11 Comments
Show 9 older comments Hide 9 older comments

Star Strider on 5 Dec 2023

Open in MATLAB Online

scatterData.mat

I have corrected my code to use scatter instead of scatterhist (since scatterhist is not actually necessary), and added linear regressions with fitlm and predict and associated plots. All that is necessary to plot these as subplots is to assign them tto the various subplot axes.

Try this —

LD = load('scatterData.mat')

LD = struct with fields:

dataTable4: [8321×2 table] dataTable5: [8498×2 table]

% dataTable4 ————————————————————————————————————————————————————————————————————————————————

T4 = LD.dataTable4;

T4 = rmmissing(T4)

T4 = 8135×2 table

Temperature NO2_WE_Voltage ___________ ______________ 10.769 249.58 10.669 249.71 10.991 248.95 11.243 248.65 11.301 248.27 11.414 248.4 11.414 248.22 11.283 247.82 11.258 248.17 11.27 248.1 11.215 248.12 11.228 248.07 11.414 248 11.706 247.52 12.028 246.84 12.801 245.61

VN4 = T4.Properties.VariableNames;

idx4 = clusterdata(table2array(T4), 'MaxClust',15, 'Criterion','distance');

[Uidx,~,uix] = unique(idx4); % Unique Cluster Numbers & Indices

ClustCounts = accumarray(idx4, 1); % Count Cluster Members

T4_ClusterCounts = table(Uidx, ClustCounts) % Results Summary Table

T4_ClusterCounts = 15×2 table

Uidx ClustCounts ____ ___________ 1 3 2 136 3 1 4 1 5 1 6 1 7 3526 8 1 9 4 10 2173 11 2273 12 1 13 11 14 2 15 1

[MaxCounts,MaxCidx] = max((T4_ClusterCounts{:,2})) % Largest Cluster & Index

MaxCounts = 3526

MaxCidx = 7

cm = colormap(turbo(numel(unique(idx4))));

figure

scatter(T4{:,1}, T4{:,2}, 10, idx4, '.')

grid

colormap(turbo(numel(unique(idx4))))

colorbar

xlabel(VN4{1})

ylabel(strrep(VN4{2},'_','\_'))

title('dataTable4')

Lv4 = idx4 == MaxCidx;

mdl4 = fitlm(T4{Lv4,1}, T4{Lv4,2})

mdl4 =

Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ _________ _______ ______ (Intercept) 270.78 0.11505 2353.7 0 x1 -1.7755 0.0060972 -291.19 0 Number of observations: 3526, Error degrees of freedom: 3524 Root Mean Squared Error: 2.35 R-squared: 0.96, Adjusted R-Squared: 0.96 F-statistic vs. constant model: 8.48e+04, p-value = 0

xv4 = linspace(min(T4{Lv4,1}), max(T4{Lv4,1}), numel(T4{Lv4,1})).';

[y4,yci4] = predict(mdl4, xv4);

figure

hs = scatter(T4{Lv4,1}, T4{Lv4,2}, 10, '.', 'DisplayName','Data');

hold on

hp1 = plot(xv4, y4, '-r', 'DisplayName','Linear Regression');

hp2 = plot(xv4, yci4, '--r', 'DisplayName','95% Confidence Intervals');

hold off

grid

xlabel(VN4{1})

ylabel(strrep(VN4{2},'_','\_'))

title('dataTable4')

legend([hs,hp1,hp2(1)], 'Location','best')

% dataTable5 ————————————————————————————————————————————————————————————————————————————————

T5 = LD.dataTable5;

T5 = rmmissing(T5)

T5 = 7487×2 table

Temperature NO2_WE_Voltage ___________ ______________ 12.411 252.73 12.255 252.96 12.255 253.31 12.27 253.18 12.368 252.81 12.356 252.68 12.338 253.21 12.368 252.3 12.338 252.78 12.27 252.33 12.169 252.96 12.283 252.43 12.451 252.63 12.69 251.52 13.224 251.09 13.831 249.76

VN5 = T5.Properties.VariableNames;

idx5 = clusterdata(table2array(T5), 'MaxClust',10, 'Criterion','distance');

[Uidx,~,uix] = unique(idx5);

ClustCounts = accumarray(idx5, 1);

T5_ClusterCounts = table(Uidx, ClustCounts)

T5_ClusterCounts = 10×2 table

Uidx ClustCounts ____ ___________ 1 2 2 7411 3 1 4 64 5 1 6 1 7 1 8 3 9 2 10 1

[MaxCounts,MaxCidx] = max((T5_ClusterCounts{:,2}))

MaxCounts = 7411

MaxCidx = 2

figure

scatter(T5{:,1}, T5{:,2}, 10, idx5, '.')

grid

colormap(turbo(numel(unique(idx4))))

colorbar

xlabel(VN5{1})

ylabel(strrep(VN5{2},'_','\_'))

title('dataTable5')

Lv5 = idx5 == MaxCidx;

mdl5 = fitlm(T5{Lv5,1}, T5{Lv5,2})

mdl5 =

Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ ________ ______ ______ (Intercept) 277.68 0.26206 1059.6 0 x1 -1.7983 0.013717 -131.1 0 Number of observations: 7411, Error degrees of freedom: 7409 Root Mean Squared Error: 8.03 R-squared: 0.699, Adjusted R-Squared: 0.699 F-statistic vs. constant model: 1.72e+04, p-value = 0

xv5 = linspace(min(T5{Lv5,1}), max(T5{Lv5,1}), numel(T5{Lv5,1})).';

[y5,yci5] = predict(mdl5, xv5);

figure

hs = scatter(T5{Lv5,1}, T5{Lv5,2}, 10, '.', 'DisplayName','Data');

hold on

hp1 = plot(xv5, y5, '-r', 'DisplayName','Linear Regression');

hp2 = plot(xv5, yci5, '--r', 'DisplayName','95% Confidence Intervals');

hold off

grid

xlabel(VN5{1})

ylabel(strrep(VN5{2},'_','\_'))

title('dataTable5')

legend([hs,hp1,hp2(1)], 'Location','best')

.

Star Strider on 6 Dec 2023

Open in MATLAB Online

scatterData.mat

That actually looks to be almost linear. That aside, since humidity is also a factor, perhaps doing a multiple linear regression (the regress function and others) with temperature and humidity as predictors could be worthwhile.

Doing a second-order polynomial fit with fitlm is straightforward —

LD = load('scatterData.mat')

LD = struct with fields:

dataTable4: [8321×2 table] dataTable5: [8498×2 table]

% dataTable4 ————————————————————————————————————————————————————————————————————————————————

T4 = LD.dataTable4;

T4 = rmmissing(T4)

T4 = 8135×2 table

Temperature NO2_WE_Voltage ___________ ______________ 10.769 249.58 10.669 249.71 10.991 248.95 11.243 248.65 11.301 248.27 11.414 248.4 11.414 248.22 11.283 247.82 11.258 248.17 11.27 248.1 11.215 248.12 11.228 248.07 11.414 248 11.706 247.52 12.028 246.84 12.801 245.61

VN4 = T4.Properties.VariableNames;

idx4 = clusterdata(table2array(T4), 'MaxClust',15, 'Criterion','distance');

[Uidx,~,uix] = unique(idx4); % Unique Cluster Numbers & Indices

ClustCounts = accumarray(idx4, 1); % Count Cluster Members

T4_ClusterCounts = table(Uidx, ClustCounts) % Results Summary Table

T4_ClusterCounts = 15×2 table

Uidx ClustCounts ____ ___________ 1 3 2 136 3 1 4 1 5 1 6 1 7 3526 8 1 9 4 10 2173 11 2273 12 1 13 11 14 2 15 1

[MaxCounts,MaxCidx] = max((T4_ClusterCounts{:,2})) % Largest Cluster & Index

MaxCounts = 3526

MaxCidx = 7

cm = colormap(turbo(numel(unique(idx4))));

% figure

% scatter(T4{:,1}, T4{:,2}, 10, idx4, '.')

% grid

% colormap(turbo(numel(unique(idx4))))

% colorbar

% xlabel(VN4{1})

% ylabel(strrep(VN4{2},'_','\_'))

% title('dataTable4')

Lv4 = idx4 == MaxCidx;

mdl4 = fitlm([T4{Lv4,1}.^2 T4{Lv4,1}], T4{Lv4,2}) % 'x1' = Temperature^2, 'x2' = Temperature

mdl4 =

Linear regression model: y ~ 1 + x1 + x2 Estimated Coefficients: Estimate SE tStat pValue _________ __________ _______ ___________ (Intercept) 263.79 0.17135 1539.4 0 x1 -0.026384 0.00055086 -47.895 0 x2 -0.85033 0.019891 -42.75 4.6442e-322 Number of observations: 3526, Error degrees of freedom: 3523 Root Mean Squared Error: 1.83 R-squared: 0.976, Adjusted R-Squared: 0.976 F-statistic vs. constant model: 7.11e+04, p-value = 0

xv4 = linspace(min(T4{Lv4,1}), max(T4{Lv4,1}), numel(T4{Lv4,1})).';

[y4,yci4] = predict(mdl4, [xv4.^2, xv4]);

figure

hs = scatter(T4{Lv4,1}, T4{Lv4,2}, 10, '.', 'DisplayName','Data');

hold on

hp1 = plot(xv4, y4, '-r', 'DisplayName','Linear Regression');

hp2 = plot(xv4, yci4, '--r', 'DisplayName','95% Confidence Intervals');

hold off

grid

xlabel(VN4{1})

ylabel(strrep(VN4{2},'_','\_'))

title('dataTable4')

legend([hs,hp1,hp2(1)], 'Location','best')

% dataTable5 ————————————————————————————————————————————————————————————————————————————————

T5 = LD.dataTable5;

T5 = rmmissing(T5)

T5 = 7487×2 table

Temperature NO2_WE_Voltage ___________ ______________ 12.411 252.73 12.255 252.96 12.255 253.31 12.27 253.18 12.368 252.81 12.356 252.68 12.338 253.21 12.368 252.3 12.338 252.78 12.27 252.33 12.169 252.96 12.283 252.43 12.451 252.63 12.69 251.52 13.224 251.09 13.831 249.76

VN5 = T5.Properties.VariableNames;

idx5 = clusterdata(table2array(T5), 'MaxClust',10, 'Criterion','distance');

[Uidx,~,uix] = unique(idx5);

ClustCounts = accumarray(idx5, 1);

T5_ClusterCounts = table(Uidx, ClustCounts)

T5_ClusterCounts = 10×2 table

Uidx ClustCounts ____ ___________ 1 2 2 7411 3 1 4 64 5 1 6 1 7 1 8 3 9 2 10 1

[MaxCounts,MaxCidx] = max((T5_ClusterCounts{:,2}))

MaxCounts = 7411

MaxCidx = 2

% figure

% scatter(T5{:,1}, T5{:,2}, 10, idx5, '.')

% grid

% colormap(turbo(numel(unique(idx4))))

% colorbar

% xlabel(VN5{1})

% ylabel(strrep(VN5{2},'_','\_'))

% title('dataTable5')

Lv5 = idx5 == MaxCidx;

mdl5 = fitlm([T5{Lv5,1}.^2 T5{Lv5,1}], T5{Lv5,2}) % 'x1' = Temperature^2, 'x2' = Temperature

mdl5 =

Linear regression model: y ~ 1 + x1 + x2 Estimated Coefficients: Estimate SE tStat pValue __________ _________ _______ ___________ (Intercept) 276.74 0.48267 573.35 0 x1 -0.0035276 0.0015207 -2.3197 0.020385 x2 -1.6735 0.055514 -30.145 2.0505e-188 Number of observations: 7411, Error degrees of freedom: 7408 Root Mean Squared Error: 8.03 R-squared: 0.699, Adjusted R-Squared: 0.699 F-statistic vs. constant model: 8.6e+03, p-value = 0

xv5 = linspace(min(T5{Lv5,1}), max(T5{Lv5,1}), numel(T5{Lv5,1})).';

[y5,yci5] = predict(mdl5, [xv5.^2 xv5]);

figure

hs = scatter(T5{Lv5,1}, T5{Lv5,2}, 10, '.', 'DisplayName','Data');

hold on

hp1 = plot(xv5, y5, '-r', 'DisplayName','Linear Regression');

hp2 = plot(xv5, yci5, '--r', 'DisplayName','95% Confidence Intervals');

hold off

grid

xlabel(VN5{1})

ylabel(strrep(VN5{2},'_','\_'))

title('dataTable5')

legend([hs,hp1,hp2(1)], 'Location','best')

It would be better to actually model the dynamics of the sensor if you have a model for it. I suspect that the reactions at the catalyst surface are simple first-order kinetics.

.

Dharmesh Joshi on 7 Dec 2023

Just done it

Star Strider on 7 Dec 2023

Thank you!

Sign in to comment.

Removing Error Data from Table

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

11 Comments
Show 9 older comments Hide 9 older comments

More Answers (0)

Categories

Products

Tags

Community Treasure Hunt

Removing Error Data from Table

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

11 Comments Show 9 older comments Hide 9 older comments

More Answers (0)

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

11 Comments
Show 9 older comments Hide 9 older comments