Force coefficients in multivariate and multiple regression
7 views (last 30 days)
Show older comments
I have a dataset from physical experiments where I control 2 independent variables x and y and I measure 2 dependent variables u and v, and according to a physical model, they depend on parameters p and q as follows:
u = p*q^4*x/y
v = p*q^3*x*y
I make my life a lot easier by converting this to logarithmic scales:
log(u) = log(x) - log(y) + A with 10^A = p*q^4
log(v) = log(x) + log(y) + B with 10^B = p*q^3
And then I can use mvregress, the only problem is: mvregress doens't allow me to force the slopes for x and y to a specific value right out of the box. I really want to force these slopes, which will have an impact on my confidence intervals for A and B. And on top of that: I know I can back calculate the variance of A and B by dividing half of the confidence interval by the t-value (which depends on the confidence level and the degrees of freedom for error), but I am unsure whether A and B will have a covariance and how I can calculate that, because I need that covariance in order to calculate my confidence intervals for p and q.
2 Comments
Answers (1)
John D'Errico
on 8 Mar 2023
Edited: John D'Errico
on 8 Mar 2023
There has been much unsaid here that I think you don't understand. Given these models:
u = p*q^4*x/y
v = p*q^3*x*y
where x and y are known independent variables, and u and v are measured, you want to estimate p and q.
You have stated first that you logged the models. In doing so, you made an implicit assumption that you completely missed.
If u and v are measured data, then what is the error structure on the noise in that data? Is the noise additive, Normally distributed? That is, is the true model for these processes:
u + noise = p*q^4*x/y
v + noise = p*q^3*x*y
where the noise is additive gaussian noise? The problem is, when you log the data, it is NOT true that log(a+b) has any simple behavior.
Or, is the noise in u and v really more likely to be proportional noise? If it is, then most likely the noise in your model was something more like lognormally distributed. And that is the good news. It suggests that logging the model is really the proper way to estimate things, because it transforms proportional lognormal noise into additive gaussian noise.
So the very first thing I would suggest is to look at your data. We don't see it, so I cannot help you in that respect. If u and v vary by multiple orders of magnitude, but the noise seems to be roughly the same magitude at all levels of u and v, that would suggest it is additive noise. But if the noise magnitude seems to scale with the size of u and v themselves, that suggests proportional noise, in whic hcase logging the data is a good thing.
The point is, least squares techniques are designed to solve problems where the error structure is gaussian, normally distributed, with a homogeneous variance, so the same at all levels of your data. If that fails, then expect to see poor estimates of your parameters. (Do I need to give an example of what happens, and why logging a model can be right, or wrong? I can, if that would be useful.)
Now. On to the problem at hand. IF you would log those models to estimate p and q...
log(u) = log(p) + 4*log(q) + log(x) - log(y)
log(v) = log(p) + 3*log(q) + log(x) + log(y)
What is known here? We know u,v,x,y. So move EVERYTHING THAT IS KNOWN to one side. We can now write the problem as:
log(u) - log(x) + log(y) = log(p) + 4*log(q) + gaussian noise
log(u) - log(x) - log(y) = log(p) + 3*log(q) + gaussian noise
This reduces to two simple models. The right hand side, IF we knew p and q, would be a simple constant. So the best estimator of that right hand side is just the mean. In fact if we assume the noise if additive normally distributed, we don't need to use any regression tool at all. Just use the mean.
RHS1 = mean(log(u) - log(x) + log(y)); % Best estimatar for: log(p)+4*log(q)
RHS2 = mean(log(u) - log(x) - log(y)); % Best estimatar for: log(p)+3*log(q)
Now we have two estimates. One for log(p)+4*log(q), and a second for log(p)+3*log(q)
That means we can find log(q) as
log(q) = (log(p)+4*log(q)) - (log(p)+3*log(q)) = RHS1 - RHS2
And therefore, the best estimate of p is just
q = exp(RHS1 - RHS2);
Next, we can recover p given that we know the value of q, because we have those same relations. Or, we can get p by taking the linear combination:
4*(log(p)+3*log(q)) = 3*(log(p)+4*log(q)) = log(p)
I'll use the latter approach.
p = exp(4*RHS2 - 3*RHS1);
Having said all of that, We could also have solved thos problem in another way. Consider the ratio:
u./v = q/y.^2
Do you see that p is gone? Only the constant q remains. And, because we used the ratio of those variables, now the proportional error again changes its structure. Still though, the ratio of two lognormal random variables is still lognormally distributed. But now we can estimate q using a regression. Even so, what matters is to understand the error striucture. I'd still probably go back to using the means as I describe above, s long as that is viable.
Again, I NEVER used MVNREGRESS at all. There is absolutely no need, since p and q enter into the model as constant terms there. Again though, the very first thing I would do is to look at the error structure.
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!