Removing Error Data from Table

Hi,
I am working on a project that involves plotting Temperature vs. Sensor Data on a scatter plot. My goal is to calculate the coefficients of a trend line for future predictions. However, there are instances where my sensor outputs false results. Instead of manually removing this data, I am in search of an outlier function that can automatically eliminate these anomalies and provide coefficients. These coefficients would be useful for defining the boundaries of valid data in future datasets.
I am aware of the standard outlier removal functions. However, the scatter plot I've attached here illustrates my challenge. You can observe the general pattern of valid data, interspersed with some errors. Although there's no trend line in the plot, it's possible to visualize how it would look and identify the inaccurate data points.
My question is: Is there a way to configure an outlier function to take into account both temperature and sensor output in identifying false results? Additionally, can this function reference previous data to aid in this evaluation? I have posted the plots below and also attached data samples of some sensor output for your reference.

 Accepted Answer

This is definitely an interesting problem, and one I have thought about at other times.
I am not certain that I have completely solved it, since it still may require a bit of interactivity, depending on the data. For these two data sets, it seems to work reasonably well with these parameters. The scatterhist calls are just there as a way to illustrate the data, and are not absolutely necessary.
Try this —
LD = load('scatterData.mat')
LD = struct with fields:
dataTable4: [8321×2 table] dataTable5: [8498×2 table]
% dataTable4 ————————————————————————————————————————————————————————————————————————————————
T4 = LD.dataTable4;
T4 = rmmissing(T4)
T4 = 8135×2 table
Temperature NO2_WE_Voltage ___________ ______________ 10.769 249.58 10.669 249.71 10.991 248.95 11.243 248.65 11.301 248.27 11.414 248.4 11.414 248.22 11.283 247.82 11.258 248.17 11.27 248.1 11.215 248.12 11.228 248.07 11.414 248 11.706 247.52 12.028 246.84 12.801 245.61
VN4 = T4.Properties.VariableNames;
figure
hsh4 = scatterhist(T4{:,1}, T4{:,2}, 'NBins',50, 'Marker','.')
hsh4 =
1×3 Axes array: Axes Axes Axes
idx4 = clusterdata(table2array(T4), 'MaxClust',15, 'Criterion','distance');
[Uidx,~,uix] = unique(idx4); % Unique Cluster Numbers & Indices
ClustCounts = accumarray(idx4, 1); % Count Cluster Members
T4_ClusterCounts = table(Uidx, ClustCounts) % Results Summary Table
T4_ClusterCounts = 15×2 table
Uidx ClustCounts ____ ___________ 1 3 2 136 3 1 4 1 5 1 6 1 7 3526 8 1 9 4 10 2173 11 2273 12 1 13 11 14 2 15 1
[MaxCounts,MaxCidx] = max((T4_ClusterCounts{:,2})) % Largest Cluster & Index
MaxCounts = 3526
MaxCidx = 7
cm = colormap(turbo(numel(unique(idx4))));
hold on
Ax = hsh4(1);
% get(Ax)
Ax.UserData
ans = 1×4 cell array
{'gscatter'} {8135×1 double} {8135×1 double} {0×0 double}
scatter(Ax.UserData{2}, Ax.UserData{3}, 10, idx4)
colormap(turbo(numel(unique(idx4))))
hold off
colorbar
Lv4 = idx4 == MaxCidx;
figure
scatter(Ax.UserData{2}(Lv4), Ax.UserData{3}(Lv4), 10, '.')
grid
% dataTable5 ————————————————————————————————————————————————————————————————————————————————
T5 = LD.dataTable5;
T5 = rmmissing(T5)
T5 = 7487×2 table
Temperature NO2_WE_Voltage ___________ ______________ 12.411 252.73 12.255 252.96 12.255 253.31 12.27 253.18 12.368 252.81 12.356 252.68 12.338 253.21 12.368 252.3 12.338 252.78 12.27 252.33 12.169 252.96 12.283 252.43 12.451 252.63 12.69 251.52 13.224 251.09 13.831 249.76
VN5 = T5.Properties.VariableNames;
figure
hsh5 = scatterhist(T5{:,1}, T5{:,2}, 'NBins',50, 'Marker','.')
hsh5 =
1×3 Axes array: Axes Axes Axes
idx5 = clusterdata(table2array(T5), 'MaxClust',10, 'Criterion','distance');
[Uidx,~,uix] = unique(idx5);
ClustCounts = accumarray(idx5, 1);
T5_ClusterCounts = table(Uidx, ClustCounts)
T5_ClusterCounts = 10×2 table
Uidx ClustCounts ____ ___________ 1 2 2 7411 3 1 4 64 5 1 6 1 7 1 8 3 9 2 10 1
[MaxCounts,MaxCidx] = max((T5_ClusterCounts{:,2}))
MaxCounts = 7411
MaxCidx = 2
cm = colormap(turbo(numel(unique(idx5))));
hold on
Ax = hsh5(1);
% get(Ax)
Ax.UserData
ans = 1×4 cell array
{'gscatter'} {7487×1 double} {7487×1 double} {0×0 double}
scatter(Ax.UserData{2}, Ax.UserData{3}, 10, idx5)
colormap(turbo(numel(unique(idx5))))
hold off
colorbar
Lv5 = idx5 == MaxCidx;
figure
scatter(Ax.UserData{2}(Lv5), Ax.UserData{3}(Lv5), 10, '.')
grid
The clusterdata function works here with 'MaxClust' (the maximum number of clusters) set at 10, however that may not generalise to all data sets, and more clusters may be necessary in those instances. That is where the ‘interactivity’ requirement may arise.
It would likely be possible to create a function from this code (with or without the plots, however I believe the plots are necessary to illustrate the data and how the code works).
I have been working on this for a few hours and will ponder that in the morning.
EDIT — (5 Dec 2023 at 04:35)
Streamlined code, added more explanation text.
.

11 Comments

Thanks, the `clusterdata` function looks interesting. The issue with my data isn't necessarily a noise-related issue, but rather false data caused by sensor malfunctions due to faulty wiring or power supply issues. So, I think this function might be what I require instead of using something like `removeoutliers`.
From what I have understood, `clusterdata` will segment the data into small blocks, and we select the block that has the maximum samples. However, what determines how each block is created? How close does data need to be to one another to be classified in the same cluster?
I am learning MATLAB as I progress and would eventually like to create a function that uses `clusterdata` and outputs it in a coder for a .NET project. How can I determine if a function such as `clusterdata` will be compatible for such an export?
My pleasure.
The clusterdata function has a number of options, including a number of name-value pair options that govern how it works. (I chose the most common options, because I am not certain exactly what you want.) I encourage you to read the options to decide what works best (I am not certain what criteria you want to use to define the clusters), although the options I chose seem to give quite good results. (I initially considered using histograms until I realised that clusterdata was more efficient and provided the result I wanted. There are other clustering algorithms available and my initial approach used kmeans, however that did not give good results.)
I used an accumarray call to calculate the total number of points in each cluster, and then chose the cluster with the greatest number of points as the ‘winner’. That definitely worked with these data sets.
The last scatter plot for each data set demonstrates how to get the data for the ‘winning’ cluster to plot or to do further analysis.
I will delete my Answer if it does not work for you.
Thanks i will try the different options.
Eventually i need to add a trend line and work out the coeffieients, but need remove the false datam therefore this function seems to work.
Is it possible to plot scatterhist in a subplot ?
I have corrected my code to use scatter instead of scatterhist (since scatterhist is not actually necessary), and added linear regressions with fitlm and predict and associated plots. All that is necessary to plot these as subplots is to assign them tto the various subplot axes.
Try this —
LD = load('scatterData.mat')
LD = struct with fields:
dataTable4: [8321×2 table] dataTable5: [8498×2 table]
% dataTable4 ————————————————————————————————————————————————————————————————————————————————
T4 = LD.dataTable4;
T4 = rmmissing(T4)
T4 = 8135×2 table
Temperature NO2_WE_Voltage ___________ ______________ 10.769 249.58 10.669 249.71 10.991 248.95 11.243 248.65 11.301 248.27 11.414 248.4 11.414 248.22 11.283 247.82 11.258 248.17 11.27 248.1 11.215 248.12 11.228 248.07 11.414 248 11.706 247.52 12.028 246.84 12.801 245.61
VN4 = T4.Properties.VariableNames;
idx4 = clusterdata(table2array(T4), 'MaxClust',15, 'Criterion','distance');
[Uidx,~,uix] = unique(idx4); % Unique Cluster Numbers & Indices
ClustCounts = accumarray(idx4, 1); % Count Cluster Members
T4_ClusterCounts = table(Uidx, ClustCounts) % Results Summary Table
T4_ClusterCounts = 15×2 table
Uidx ClustCounts ____ ___________ 1 3 2 136 3 1 4 1 5 1 6 1 7 3526 8 1 9 4 10 2173 11 2273 12 1 13 11 14 2 15 1
[MaxCounts,MaxCidx] = max((T4_ClusterCounts{:,2})) % Largest Cluster & Index
MaxCounts = 3526
MaxCidx = 7
cm = colormap(turbo(numel(unique(idx4))));
figure
scatter(T4{:,1}, T4{:,2}, 10, idx4, '.')
grid
colormap(turbo(numel(unique(idx4))))
colorbar
xlabel(VN4{1})
ylabel(strrep(VN4{2},'_','\_'))
title('dataTable4')
Lv4 = idx4 == MaxCidx;
mdl4 = fitlm(T4{Lv4,1}, T4{Lv4,2})
mdl4 =
Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ _________ _______ ______ (Intercept) 270.78 0.11505 2353.7 0 x1 -1.7755 0.0060972 -291.19 0 Number of observations: 3526, Error degrees of freedom: 3524 Root Mean Squared Error: 2.35 R-squared: 0.96, Adjusted R-Squared: 0.96 F-statistic vs. constant model: 8.48e+04, p-value = 0
xv4 = linspace(min(T4{Lv4,1}), max(T4{Lv4,1}), numel(T4{Lv4,1})).';
[y4,yci4] = predict(mdl4, xv4);
figure
hs = scatter(T4{Lv4,1}, T4{Lv4,2}, 10, '.', 'DisplayName','Data');
hold on
hp1 = plot(xv4, y4, '-r', 'DisplayName','Linear Regression');
hp2 = plot(xv4, yci4, '--r', 'DisplayName','95% Confidence Intervals');
hold off
grid
xlabel(VN4{1})
ylabel(strrep(VN4{2},'_','\_'))
title('dataTable4')
legend([hs,hp1,hp2(1)], 'Location','best')
% dataTable5 ————————————————————————————————————————————————————————————————————————————————
T5 = LD.dataTable5;
T5 = rmmissing(T5)
T5 = 7487×2 table
Temperature NO2_WE_Voltage ___________ ______________ 12.411 252.73 12.255 252.96 12.255 253.31 12.27 253.18 12.368 252.81 12.356 252.68 12.338 253.21 12.368 252.3 12.338 252.78 12.27 252.33 12.169 252.96 12.283 252.43 12.451 252.63 12.69 251.52 13.224 251.09 13.831 249.76
VN5 = T5.Properties.VariableNames;
idx5 = clusterdata(table2array(T5), 'MaxClust',10, 'Criterion','distance');
[Uidx,~,uix] = unique(idx5);
ClustCounts = accumarray(idx5, 1);
T5_ClusterCounts = table(Uidx, ClustCounts)
T5_ClusterCounts = 10×2 table
Uidx ClustCounts ____ ___________ 1 2 2 7411 3 1 4 64 5 1 6 1 7 1 8 3 9 2 10 1
[MaxCounts,MaxCidx] = max((T5_ClusterCounts{:,2}))
MaxCounts = 7411
MaxCidx = 2
figure
scatter(T5{:,1}, T5{:,2}, 10, idx5, '.')
grid
colormap(turbo(numel(unique(idx4))))
colorbar
xlabel(VN5{1})
ylabel(strrep(VN5{2},'_','\_'))
title('dataTable5')
Lv5 = idx5 == MaxCidx;
mdl5 = fitlm(T5{Lv5,1}, T5{Lv5,2})
mdl5 =
Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ ________ ______ ______ (Intercept) 277.68 0.26206 1059.6 0 x1 -1.7983 0.013717 -131.1 0 Number of observations: 7411, Error degrees of freedom: 7409 Root Mean Squared Error: 8.03 R-squared: 0.699, Adjusted R-Squared: 0.699 F-statistic vs. constant model: 1.72e+04, p-value = 0
xv5 = linspace(min(T5{Lv5,1}), max(T5{Lv5,1}), numel(T5{Lv5,1})).';
[y5,yci5] = predict(mdl5, xv5);
figure
hs = scatter(T5{Lv5,1}, T5{Lv5,2}, 10, '.', 'DisplayName','Data');
hold on
hp1 = plot(xv5, y5, '-r', 'DisplayName','Linear Regression');
hp2 = plot(xv5, yci5, '--r', 'DisplayName','95% Confidence Intervals');
hold off
grid
xlabel(VN5{1})
ylabel(strrep(VN5{2},'_','\_'))
title('dataTable5')
legend([hs,hp1,hp2(1)], 'Location','best')
.
Thanks i iwll try this out tommrow. I think the regression line will need to be poly 2 or 3
Is there any special functional relationship between the temperature and sensor voltage? If there is, using fitnlm to fit it to that model would likely be more appropriate. The code is essentially the same except for replacing fitlm with fitnlm.
Yes, i am working on correction factor which changes with temperture and over time. The plot below is theorical plot, which shows how the trend line will need to look and over time as temperture decrease that region get more flat.
That actually looks to be almost linear. That aside, since humidity is also a factor, perhaps doing a multiple linear regression (the regress function and others) with temperature and humidity as predictors could be worthwhile.
Doing a second-order polynomial fit with fitlm is straightforward —
LD = load('scatterData.mat')
LD = struct with fields:
dataTable4: [8321×2 table] dataTable5: [8498×2 table]
% dataTable4 ————————————————————————————————————————————————————————————————————————————————
T4 = LD.dataTable4;
T4 = rmmissing(T4)
T4 = 8135×2 table
Temperature NO2_WE_Voltage ___________ ______________ 10.769 249.58 10.669 249.71 10.991 248.95 11.243 248.65 11.301 248.27 11.414 248.4 11.414 248.22 11.283 247.82 11.258 248.17 11.27 248.1 11.215 248.12 11.228 248.07 11.414 248 11.706 247.52 12.028 246.84 12.801 245.61
VN4 = T4.Properties.VariableNames;
idx4 = clusterdata(table2array(T4), 'MaxClust',15, 'Criterion','distance');
[Uidx,~,uix] = unique(idx4); % Unique Cluster Numbers & Indices
ClustCounts = accumarray(idx4, 1); % Count Cluster Members
T4_ClusterCounts = table(Uidx, ClustCounts) % Results Summary Table
T4_ClusterCounts = 15×2 table
Uidx ClustCounts ____ ___________ 1 3 2 136 3 1 4 1 5 1 6 1 7 3526 8 1 9 4 10 2173 11 2273 12 1 13 11 14 2 15 1
[MaxCounts,MaxCidx] = max((T4_ClusterCounts{:,2})) % Largest Cluster & Index
MaxCounts = 3526
MaxCidx = 7
cm = colormap(turbo(numel(unique(idx4))));
% figure
% scatter(T4{:,1}, T4{:,2}, 10, idx4, '.')
% grid
% colormap(turbo(numel(unique(idx4))))
% colorbar
% xlabel(VN4{1})
% ylabel(strrep(VN4{2},'_','\_'))
% title('dataTable4')
Lv4 = idx4 == MaxCidx;
mdl4 = fitlm([T4{Lv4,1}.^2 T4{Lv4,1}], T4{Lv4,2}) % 'x1' = Temperature^2, 'x2' = Temperature
mdl4 =
Linear regression model: y ~ 1 + x1 + x2 Estimated Coefficients: Estimate SE tStat pValue _________ __________ _______ ___________ (Intercept) 263.79 0.17135 1539.4 0 x1 -0.026384 0.00055086 -47.895 0 x2 -0.85033 0.019891 -42.75 4.6442e-322 Number of observations: 3526, Error degrees of freedom: 3523 Root Mean Squared Error: 1.83 R-squared: 0.976, Adjusted R-Squared: 0.976 F-statistic vs. constant model: 7.11e+04, p-value = 0
xv4 = linspace(min(T4{Lv4,1}), max(T4{Lv4,1}), numel(T4{Lv4,1})).';
[y4,yci4] = predict(mdl4, [xv4.^2, xv4]);
figure
hs = scatter(T4{Lv4,1}, T4{Lv4,2}, 10, '.', 'DisplayName','Data');
hold on
hp1 = plot(xv4, y4, '-r', 'DisplayName','Linear Regression');
hp2 = plot(xv4, yci4, '--r', 'DisplayName','95% Confidence Intervals');
hold off
grid
xlabel(VN4{1})
ylabel(strrep(VN4{2},'_','\_'))
title('dataTable4')
legend([hs,hp1,hp2(1)], 'Location','best')
% dataTable5 ————————————————————————————————————————————————————————————————————————————————
T5 = LD.dataTable5;
T5 = rmmissing(T5)
T5 = 7487×2 table
Temperature NO2_WE_Voltage ___________ ______________ 12.411 252.73 12.255 252.96 12.255 253.31 12.27 253.18 12.368 252.81 12.356 252.68 12.338 253.21 12.368 252.3 12.338 252.78 12.27 252.33 12.169 252.96 12.283 252.43 12.451 252.63 12.69 251.52 13.224 251.09 13.831 249.76
VN5 = T5.Properties.VariableNames;
idx5 = clusterdata(table2array(T5), 'MaxClust',10, 'Criterion','distance');
[Uidx,~,uix] = unique(idx5);
ClustCounts = accumarray(idx5, 1);
T5_ClusterCounts = table(Uidx, ClustCounts)
T5_ClusterCounts = 10×2 table
Uidx ClustCounts ____ ___________ 1 2 2 7411 3 1 4 64 5 1 6 1 7 1 8 3 9 2 10 1
[MaxCounts,MaxCidx] = max((T5_ClusterCounts{:,2}))
MaxCounts = 7411
MaxCidx = 2
% figure
% scatter(T5{:,1}, T5{:,2}, 10, idx5, '.')
% grid
% colormap(turbo(numel(unique(idx4))))
% colorbar
% xlabel(VN5{1})
% ylabel(strrep(VN5{2},'_','\_'))
% title('dataTable5')
Lv5 = idx5 == MaxCidx;
mdl5 = fitlm([T5{Lv5,1}.^2 T5{Lv5,1}], T5{Lv5,2}) % 'x1' = Temperature^2, 'x2' = Temperature
mdl5 =
Linear regression model: y ~ 1 + x1 + x2 Estimated Coefficients: Estimate SE tStat pValue __________ _________ _______ ___________ (Intercept) 276.74 0.48267 573.35 0 x1 -0.0035276 0.0015207 -2.3197 0.020385 x2 -1.6735 0.055514 -30.145 2.0505e-188 Number of observations: 7411, Error degrees of freedom: 7408 Root Mean Squared Error: 8.03 R-squared: 0.699, Adjusted R-Squared: 0.699 F-statistic vs. constant model: 8.6e+03, p-value = 0
xv5 = linspace(min(T5{Lv5,1}), max(T5{Lv5,1}), numel(T5{Lv5,1})).';
[y5,yci5] = predict(mdl5, [xv5.^2 xv5]);
figure
hs = scatter(T5{Lv5,1}, T5{Lv5,2}, 10, '.', 'DisplayName','Data');
hold on
hp1 = plot(xv5, y5, '-r', 'DisplayName','Linear Regression');
hp2 = plot(xv5, yci5, '--r', 'DisplayName','95% Confidence Intervals');
hold off
grid
xlabel(VN5{1})
ylabel(strrep(VN5{2},'_','\_'))
title('dataTable5')
legend([hs,hp1,hp2(1)], 'Location','best')
It would be better to actually model the dynamics of the sensor if you have a model for it. I suspect that the reactions at the catalyst surface are simple first-order kinetics.
.
@Dharmesh Joshi — If this solution works for you, please Accept my Answer.
Just done it

Sign in to comment.

More Answers (0)

Categories

Products

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!