How to do Data Cleaning in MATLAB
Data analysis is only as good as the data used. Raw data usually has missing values and outliers that could lead to incorrect analysis. Data Cleaning is an iterative process and is often time-consuming. The Data Cleaner app in MATLAB® provides an interactive tool for completing data cleaning workflows more quickly and without needing to write code directly. Learn how you can use the Data Cleaner app to clean your data using simple cleaning steps and prepare it for analysis and model development.
Published: 1 Dec 2022
Data cleaning is the process of removing or correcting data to prepare it for analysis and model development. In this video, we'll see how you can use the Data Cleaner app in MATLAB to clean and prepare your data.
There are two ways to work with the table data in MATLAB, Tables and Timetables. The Data Cleaner app works well for both types of data sets, allowing you to clean, retime, and reshape the data. This video will demonstrate the steps to clean and prepare a timetable data set.
For example, we'll use electricity data from the New York ISO. The data set contains peak electricity load recorded in different zones, with samples taken over a certain period. This data set can be used for load scheduling and forecasting. But before that, the data must be carefully cleaned to account for anomalies such as missing data or outliers.
The Data Cleaner app provides an easy interactive tool for handling such complex data cleaning workflows.
Start with importing the data set into the Data Cleaner app. You can import it from the MATLAB workspace or from a local file on your system. After the data is imported, the app will show a visualization of each variable, along with some descriptive statistics.
We can quickly determine characteristics of the data via the Visualization tab. But if we want to review the actual data set, we can do that from the Data tab. You can get a summary of the whole data set under the Summary tab.
Go to the Visualization tab, and observe that the charts don't make much sense and are not providing any useful insights about the variables. This indicates that our data may need to be reformatted. Let us look at the actual data to understand why that might be.
Observe that the Zones column has categorical data type. It seems that the records for all zones are stacked on top of each other. Putting each zone in its own column would give us a better view of the data set. This can be done using the Unstack method.
Select the data set. And in the Cleaning methods dropdown, click on Unstack variables. This opens the parameter pane, which provides different options for unstacking. The app also shows an instant preview of the transformed data set. The zones which are all in a single column are now in separate columns, giving us a better view of the load data.
Specify the method and parameters for unstacking. And once you are satisfied with the preview, hit Accept. When a cleaning method is accepted, it will be added to the Cleaning Steps section, where it can be edited or deleted any time.
As you can see, this data now makes more sense with each zone as a separate variable. Along with the variables and values, the Data tab also shows a chart to visualize the variable. Let us go to the Visualization tab for a broader look at the charts.
These checkboxes allows you to select the variables you want to focus on, and deselect the others. These stacks on the side provide some additional insight into each variable. For example, it looks like some of the zones have missing values in them. You can also review for outliers by just looking at these visualizations.
For example, look at the chart for North zone. These points give us a sense that there are outliers present in this variable. We need to take care of these missing values and outliers before using this data set for analysis. Let's take care of missing values first.
Select the data set, and click on Clean Missing Data. This opens the parameter pane, along with an instant preview of the cleaning action based on the default cleaning parameters.
These red dots indicate the points where the missing values are filled, based on the default parameters. Different data may need a different way of filling the missing values. Depending on the data you have, you can choose one of these filling options or you can choose to remove them.
Here, we will just remove the missing values. After you are done, accept the changes, and check the data summary to verify that all the missing values are removed.
We saw earlier several outliers in the North zone chart. Select North zone from the Variables, and click on Clean Outlier Data. The app will automatically detect the outliers based on the default detection method. The preview of the cleaning step shows where the outliers will be filled and which outliers will be removed. You can always change the detection, cleaning, and fill methods to the ones which best suit your data set.
For now, use the Linear interpolation as the fill method, and decrease the threshold factor to 2. Hit Accept to save the cleaning step. Deselect the original data to see only the clean data on the chart. It looks like we were able to successfully remove all the outliers from the North zone data.
Cleaning data can be an iterative process. And you might need to change the order of the steps and edit cleaning parameters. Click and drag a cleaning step up or down to change its position in the order. To edit, just click on the cleaning step and make the changes.
Now that your data is ready, export this data to the MATLAB workspace for analysis. You can also export the cleaning steps as a script or as a function to apply them on similar data sets.
For more information on the Data Cleaner app, visit the MATLAB documentation. To learn more about data cleaning, smoothing, and grouping, check out our Preprocessing Data web page.