Clear Filters
Clear Filters

Optimizing Interpretability in Gaussian Process Regression Models: A Strategic Approach to Preprocessing and Testing Data

9 views (last 30 days)
I am utilizing the Regression Learner App to develop a model that can adjust my RAW data so that it can accurately predict data accordingly. My question pertains more to the general usage of the tool.
1. When setting my input data, there is an option to reserve a portion of the data for testing. Does this process allocate the learning and testing data randomly, or does it do so sequentially, e.g., using the first few weeks of data for training and the remaining for testing?
2. I have discovered that Gaussian Process Regression (GPR) models yield the best results for my dataset. However, this type of model lacks interpretability. My inputs include Signal Data, Temperature, and Humidity.
If I wish to assess the individual impact of each input on the overall signal, in terms of applying a linear or polynomial correction before the GPR model processing, is this possible? By doing so, I can minimize the amount of data fed into the GPR model, which in turn might provide some interpretability for my overall modeling process.

Answers (1)

Drew on 3 Nov 2023
  1. Regression Learner partitions the test data randomly. In Classification Learner, the partition is random and stratified. ( Stratification is based on the class labels. That is, an attempt is made to keep the class frequency similar in the training and test sets. If you want to control your test partition, you could (1) first partition your data into train and test outside of the Learner app, (2) load the training data into the Learner app at the session start dialogue, and (3) later load the separate test data into the Learner app.
  2. You can use model-agnostic interpretability techniques such as Partial Dependence Plot (PDP), Shapley, and LIME on your GPR models. In R2023b, you can use these techniques inside the Learner app using the "Explain" tab within the Learner app.
If this answer helps you, please remember to accept the answer.
Example screenshot from the Regression Learner app, within the Explain tab, for a GPR model on fisheriris data:
Dharmesh Joshi
Dharmesh Joshi on 4 Nov 2023
Okay, let me explain a little more about what I am trying to model, even though this post is more about the actual tool rather than the techniques I need to use. I am trying to model an NO2 sensor, which is affected by both temperature and humidity. Ideally, I would like to correct all temperature issues outside the GPR model, as the issue with temperature is more linear or polynomial, while humidity has more to do with patterns and rise/fall rates, which I would like to be exclusively handled by the GPR model until I fully understand the chemistry behind the sensor and its interaction with humidity.
Yes, as the temperature increases, my sensor, which I am trying to model, reduces its signal level; therefore, we need to increase the output of the model to compensate for this. What I was hoping to do was to correct this temperature effect using a linear correction, based on what is learned in this GPR model, as the output is very accurate. Meanwhile, if I were to use a linear model, my R-squared is very low, and therefore the temperature part of the correction might not be as accurate as the GPR is outputting.
So to answer your question about why I am adjusting the RAW data, it is because I want the GPR model to exclusively handle humidity-related issues only. Thus, I am trying to break the model into smaller blocks.
So, what is PCA? Am I correct in understanding it is an optimization option that needs to be enabled? My inputs into my model, in addition to Signal, Temperature, and Humidity, include additional hourly rate changes from 1 hour to 120 hours, so it over 100 inputs, representing humidity rate changes. Including rate changes provides better results in the model. When I run PCA, it indicates only 5 inputs are used but does not specify which 5 inputs they are. Have i understood this option correctly?
I am using the Regression Learner app; would I also need to use the Classification Learner app? What would be the advantage?
Dharmesh Joshi
Dharmesh Joshi on 6 Nov 2023
In addition to my previous post, I have another question related to the Regression Learner app. Would I be better off using the app?
I have sensor data from June to November, which includes sensor signals, humidity, and humidity change rates.
I have divided my data into two tables: one for training and the other for testing. The training table covers data from June to September, while the testing table spans from September to November. These are the results I'm getting; as you can see, the training performance is very good, but the testing performance is fairly low.
Am I correct in understanding that for the Regression Learner app to produce accurate results, the training needs to be conducted with data that has a similar combination of variables as the test data or future incoming data? Is it possible to configure my data in such a way that the Regression Learner app comprehends more of the underlying principles during its training, rather than relying solely on absolute values?

Sign in to comment.




Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!