Force coefficients in multivariate and multiple regression

28 views (last 30 days)
I have a dataset from physical experiments where I control 2 independent variables x and y and I measure 2 dependent variables u and v, and according to a physical model, they depend on parameters p and q as follows:
u = p*q^4*x/y
v = p*q^3*x*y
I make my life a lot easier by converting this to logarithmic scales:
log(u) = log(x) - log(y) + A with 10^A = p*q^4
log(v) = log(x) + log(y) + B with 10^B = p*q^3
And then I can use mvregress, the only problem is: mvregress doens't allow me to force the slopes for x and y to a specific value right out of the box. I really want to force these slopes, which will have an impact on my confidence intervals for A and B. And on top of that: I know I can back calculate the variance of A and B by dividing half of the confidence interval by the t-value (which depends on the confidence level and the degrees of freedom for error), but I am unsure whether A and B will have a covariance and how I can calculate that, because I need that covariance in order to calculate my confidence intervals for p and q.
Torsten on 8 Mar 2023
Edited: Torsten on 8 Mar 2023
Just a remark:
If you use "log", p*q^4 = e^A and p*q^3 = e^B. Otherwise, you will have to take log10 instead of log.
And you are aware that the fitting coefficients are distorted by taking the log of your equations ?
Rogier Delporte
Rogier Delporte on 8 Mar 2023
Yeah, for me log implicitly means base 10, the base e log is ln for me. MATLAB is stupid in that regard, just like with array indexing starting at 1 in stead of zero.

Sign in to comment.

Answers (1)

John D'Errico
John D'Errico on 8 Mar 2023
Edited: John D'Errico on 8 Mar 2023
There has been much unsaid here that I think you don't understand. Given these models:
u = p*q^4*x/y
v = p*q^3*x*y
where x and y are known independent variables, and u and v are measured, you want to estimate p and q.
You have stated first that you logged the models. In doing so, you made an implicit assumption that you completely missed.
If u and v are measured data, then what is the error structure on the noise in that data? Is the noise additive, Normally distributed? That is, is the true model for these processes:
u + noise = p*q^4*x/y
v + noise = p*q^3*x*y
where the noise is additive gaussian noise? The problem is, when you log the data, it is NOT true that log(a+b) has any simple behavior.
Or, is the noise in u and v really more likely to be proportional noise? If it is, then most likely the noise in your model was something more like lognormally distributed. And that is the good news. It suggests that logging the model is really the proper way to estimate things, because it transforms proportional lognormal noise into additive gaussian noise.
So the very first thing I would suggest is to look at your data. We don't see it, so I cannot help you in that respect. If u and v vary by multiple orders of magnitude, but the noise seems to be roughly the same magitude at all levels of u and v, that would suggest it is additive noise. But if the noise magnitude seems to scale with the size of u and v themselves, that suggests proportional noise, in whic hcase logging the data is a good thing.
The point is, least squares techniques are designed to solve problems where the error structure is gaussian, normally distributed, with a homogeneous variance, so the same at all levels of your data. If that fails, then expect to see poor estimates of your parameters. (Do I need to give an example of what happens, and why logging a model can be right, or wrong? I can, if that would be useful.)
Now. On to the problem at hand. IF you would log those models to estimate p and q...
log(u) = log(p) + 4*log(q) + log(x) - log(y)
log(v) = log(p) + 3*log(q) + log(x) + log(y)
What is known here? We know u,v,x,y. So move EVERYTHING THAT IS KNOWN to one side. We can now write the problem as:
log(u) - log(x) + log(y) = log(p) + 4*log(q) + gaussian noise
log(u) - log(x) - log(y) = log(p) + 3*log(q) + gaussian noise
This reduces to two simple models. The right hand side, IF we knew p and q, would be a simple constant. So the best estimator of that right hand side is just the mean. In fact if we assume the noise if additive normally distributed, we don't need to use any regression tool at all. Just use the mean.
RHS1 = mean(log(u) - log(x) + log(y)); % Best estimatar for: log(p)+4*log(q)
RHS2 = mean(log(u) - log(x) - log(y)); % Best estimatar for: log(p)+3*log(q)
Now we have two estimates. One for log(p)+4*log(q), and a second for log(p)+3*log(q)
That means we can find log(q) as
log(q) = (log(p)+4*log(q)) - (log(p)+3*log(q)) = RHS1 - RHS2
And therefore, the best estimate of p is just
q = exp(RHS1 - RHS2);
Next, we can recover p given that we know the value of q, because we have those same relations. Or, we can get p by taking the linear combination:
4*(log(p)+3*log(q)) = 3*(log(p)+4*log(q)) = log(p)
I'll use the latter approach.
p = exp(4*RHS2 - 3*RHS1);
Having said all of that, We could also have solved thos problem in another way. Consider the ratio:
u./v = q/y.^2
Do you see that p is gone? Only the constant q remains. And, because we used the ratio of those variables, now the proportional error again changes its structure. Still though, the ratio of two lognormal random variables is still lognormally distributed. But now we can estimate q using a regression. Even so, what matters is to understand the error striucture. I'd still probably go back to using the means as I describe above, s long as that is viable.
Again, I NEVER used MVNREGRESS at all. There is absolutely no need, since p and q enter into the model as constant terms there. Again though, the very first thing I would do is to look at the error structure.
Torsten on 8 Mar 2023
Edited: Torsten on 8 Mar 2023
var(X+Y) = var(X)+var(Y)+2*cov(X,Y)
And I don't think that
log(u) - log(x) + log(y)
log(v) - log(x) - log(y)
are independent.
Rogier Delporte
Rogier Delporte on 9 Mar 2023
1) Clarification of my problem:
It's not that I didn't know how to do what you've explained, it's that I am unsure whether it is the correct way of interpreting statistics or not.
My confusion originates from the following, the fit of:
v/y = p*q^3*x
Yields a worse R² value and visually more spread than:
v = p*q^3*x*y
And that holds both on lin-lin scales as it does on log-log scales. My thinking is that you need to keep the independent variables on the RHS and the dependent variables on the LHS, because least squares regression — whether it be linear or non-linear — assumes no error on the dependent variables and all error on the independent variables. In my experiments, there can be no fluctuation of x during the experiments, because that would violate conservation of mass, there can only be inaccuracy of the values because the balance I weighed on wasn't set up on a sufficiently damped surface. I'd be very amazed if y were to fluctuate during my experiments, but in theory it is possible. There is, however, some inaccuracy in y as well.
My dependent variables u and v, however, are subject to amplifier noise, which is additive. u is the initial value of a time transient, v it the integration of the whole transient. There is a third independent variable z, which determines the gain setting of the amplifier and hence the amount of additive noise.
For large values of z, v kind of saturates, so all the v values are taken under the same conditions of amplifier noise. The inaccuracy of v stems from the fact that I need to guestimate when the transient ends, because there is a second transient which is orders of magnitude smaller in initial value and decay time. The criterion for the upper integration boundary for v scales with y^2/(q*z), which is the reason my residuals on log-log scale show heteroscedascity.
u is actually not the initial value of the time transient, but rather a fitted proportionality of said initial value i with z. There, the noise level for i depends on the gain setting and hence on z, but as I don't have a proper model for the variance to properly weight my points, I just exclude the points where i is too close to the noise floor. The actual inaccuracy for u stems from the relationship of how fast my system decays (which depends nonlinearly on z and p and it depends linearly on q) to how fast my amplifier settles. I actually calculate i by averaging over a couple of time points, in order to reduce the influence of additive noise.
I have already given up on properly calculating the exact amount of gaussian noise for u and v, because it is far too complex and during acquisition there was a running average filter applied to economize on storage and the standard deviation at different points in time was not saved, so i am pretty sure it is impossible to correctly calculate the amount of additive noise on u and v.
That's why I am just approximating the additive noise on u and v to be zero and I'm doing a first order assumption of proportional noise on p and q. There is time dependent noise on z and also a certain amount of inaccuracy, but both are orders of magnitude better than the inaccuracy of x and y, which I can only really guess at.
I've already tried the approach of:
4*log(u) - 3*log(v) = log(x) + 7*log(y) + log(p)
log(u) - log(v) = 2*log(y) - log(q)
But I have a suspicion that in this case, the influence of inaccuracies on y become non-neglible.
2) Correct propagation of variances:
The problem I have with your approach, is that you don't take the covariance of u and v into account. The dead easy part is getting values for p and q, the hard part is the best way of getting accurate estimates of their approximate confidence intervals. I guess the approach with the y^7 is the worst of the two, as it amplifies the errors in y (which I approximate to be zero, as I don't have a model for estimating them) to the seventh power.

Sign in to comment.


Find more on Descriptive Statistics in Help Center and File Exchange




Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!