Force coefficients in multivariate and multiple regression

Question

Rogier Delporte on 8 Mar 2023

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/1925145-force-coefficients-in-multivariate-and-multiple-regression

Commented: Rogier Delporte on 9 Mar 2023

I have a dataset from physical experiments where I control 2 independent variables x and y and I measure 2 dependent variables u and v, and according to a physical model, they depend on parameters p and q as follows:

u = p*q^4*x/y

v = p*q^3*x*y

I make my life a lot easier by converting this to logarithmic scales:

log(u) = log(x) - log(y) + A with 10^A = p*q^4

log(v) = log(x) + log(y) + B with 10^B = p*q^3

And then I can use mvregress, the only problem is: mvregress doens't allow me to force the slopes for x and y to a specific value right out of the box. I really want to force these slopes, which will have an impact on my confidence intervals for A and B. And on top of that: I know I can back calculate the variance of A and B by dividing half of the confidence interval by the t-value (which depends on the confidence level and the degrees of freedom for error), but I am unsure whether A and B will have a covariance and how I can calculate that, because I need that covariance in order to calculate my confidence intervals for p and q.

2 Comments
Show NoneHide None

Torsten on 8 Mar 2023

Edited: Torsten on 8 Mar 2023

Just a remark:

If you use "log", p*q^4 = e^A and p*q^3 = e^B. Otherwise, you will have to take log10 instead of log.

And you are aware that the fitting coefficients are distorted by taking the log of your equations ?

Rogier Delporte on 8 Mar 2023

Yeah, for me log implicitly means base 10, the base e log is ln for me. MATLAB is stupid in that regard, just like with array indexing starting at 1 in stead of zero.

Sign in to comment.

Sign in to answer this question.

Answer 1

John D'Errico on 8 Mar 2023

0
Link

Direct link to this answer

https://ch.mathworks.com/matlabcentral/answers/1925145-force-coefficients-in-multivariate-and-multiple-regression#answer_1188385

Edited: John D'Errico on 8 Mar 2023

Open in MATLAB Online

There has been much unsaid here that I think you don't understand. Given these models:

u = p*q^4*x/y

v = p*q^3*x*y

where x and y are known independent variables, and u and v are measured, you want to estimate p and q.

You have stated first that you logged the models. In doing so, you made an implicit assumption that you completely missed.

If u and v are measured data, then what is the error structure on the noise in that data? Is the noise additive, Normally distributed? That is, is the true model for these processes:

u + noise = p*q^4*x/y

v + noise = p*q^3*x*y

where the noise is additive gaussian noise? The problem is, when you log the data, it is NOT true that log(a+b) has any simple behavior.

Or, is the noise in u and v really more likely to be proportional noise? If it is, then most likely the noise in your model was something more like lognormally distributed. And that is the good news. It suggests that logging the model is really the proper way to estimate things, because it transforms proportional lognormal noise into additive gaussian noise.

So the very first thing I would suggest is to look at your data. We don't see it, so I cannot help you in that respect. If u and v vary by multiple orders of magnitude, but the noise seems to be roughly the same magitude at all levels of u and v, that would suggest it is additive noise. But if the noise magnitude seems to scale with the size of u and v themselves, that suggests proportional noise, in whic hcase logging the data is a good thing.

The point is, least squares techniques are designed to solve problems where the error structure is gaussian, normally distributed, with a homogeneous variance, so the same at all levels of your data. If that fails, then expect to see poor estimates of your parameters. (Do I need to give an example of what happens, and why logging a model can be right, or wrong? I can, if that would be useful.)

Now. On to the problem at hand. IF you would log those models to estimate p and q...

log(u) = log(p) + 4*log(q) + log(x) - log(y)

log(v) = log(p) + 3*log(q) + log(x) + log(y)

What is known here? We know u,v,x,y. So move EVERYTHING THAT IS KNOWN to one side. We can now write the problem as:

log(u) - log(x) + log(y) = log(p) + 4*log(q) + gaussian noise
log(u) - log(x) - log(y) = log(p) + 3*log(q) + gaussian noise

This reduces to two simple models. The right hand side, IF we knew p and q, would be a simple constant. So the best estimator of that right hand side is just the mean. In fact if we assume the noise if additive normally distributed, we don't need to use any regression tool at all. Just use the mean.

RHS1 = mean(log(u) - log(x) + log(y)); % Best estimatar for: log(p)+4*log(q)
RHS2 = mean(log(u) - log(x) - log(y)); % Best estimatar for: log(p)+3*log(q)

Now we have two estimates. One for log(p)+4*log(q), and a second for log(p)+3*log(q)

That means we can find log(q) as

log(q) = (log(p)+4*log(q)) - (log(p)+3*log(q)) = RHS1 - RHS2

And therefore, the best estimate of p is just

q = exp(RHS1 - RHS2);

Next, we can recover p given that we know the value of q, because we have those same relations. Or, we can get p by taking the linear combination:

4*(log(p)+3*log(q)) = 3*(log(p)+4*log(q)) = log(p)

I'll use the latter approach.

p = exp(4*RHS2 - 3*RHS1);

Having said all of that, We could also have solved thos problem in another way. Consider the ratio:

u./v = q/y.^2

Do you see that p is gone? Only the constant q remains. And, because we used the ratio of those variables, now the proportional error again changes its structure. Still though, the ratio of two lognormal random variables is still lognormally distributed. But now we can estimate q using a regression. Even so, what matters is to understand the error striucture. I'd still probably go back to using the means as I describe above, s long as that is viable.

Again, I NEVER used MVNREGRESS at all. There is absolutely no need, since p and q enter into the model as constant terms there. Again though, the very first thing I would do is to look at the error structure.

4 Comments
Show 2 older commentsHide 2 older comments

Rogier Delporte on 8 Mar 2023

Yes, my noise does seem to be proportional and not additive, when I do two univariate nonlinear regressions on u vs x and y and v vs x and y, there is very clear heteroscedascity visible in my residuals. The additive noise in my experiments can be kept at a fairly low level with filtering at the very start of the signal processing chain. The variability seems to come from variability in my parameters p and q.

The point is: what about the variances and the confidence intervals/prediction bounds (yes, I know the difference between the two)? It's dead easy to find values for p and q, I just want to know how well I know these values, because for the further processing of my data (other variables than u and v), I have to compare multiple models against one another. If a certain model introduces a new parameter k that I expect to be constant for x and y, I need to know how much variation I can blame on variablitiy in p and q in order to have any idea whether or not k is actually a constant or not, because if it isn't, my model is flawed and I can eliminate it as a candidate model.

John D'Errico on 8 Mar 2023

Open in MATLAB Online

You say it is dead easy to estimate p and q, but you did not seem to know how to do that. I'm a little confused. Anyway, IF all you need at this point is time know the uncertainty on p and q, that too is rather simple.

Compute, as I showed you:

RHS1 = mean(log(u) - log(x) + log(y)); % Best estimator for: log(p)+4*log(q)
RHS2 = mean(log(u) - log(x) - log(y)); % Best estimator for: log(p)+3*log(q)

Now, we presume these to be normally distributed. Just compute the variances!

Var1 = var(log(u) - log(x) + log(y)); % The variance of log(p)+4*log(q)
Var2 = var(log(u) - log(x) - log(y)); % The variance of log(p)+3*log(q)

So now, what is the VARIANCE of log(q)? Again, simple. If log(q) is the difference of the two means, then the variance of log(q) is just the sum of those variances. Easy, peasy.

varlogq = Var1 + Var2;

What does this tell us about the variance of q itself? That is more difficult, since exponentiating that will create a lognormal distribution. That is, we see log(q) as having mean = RHS1-RHS2, and we just computed the variance. So now you might compute a 95% confidence interval from the lognormal CDF, based on that mean and variance. At least, this is the easy solution. It would depend on how much data you have. Because really to be statistically appropriate, we need to appreciate why the degrees of freedom come into this problem in confidence intervals. In the end, if you have more than 25 data points or so, I would just use logncdf directly to compute the confidence intervals, since at that point, the t-statistic is not that different from a Normal distribution. But if you have only 8 or 10 data points, or even less, then it will be important to worry about the difference.

We can do the same computations for p. Thus

varlogp = 3*Var1 + 4*Var2;

Note that p will have a bit wider confidence interval than q. Such is life. It depends on your model. And of course, p and q are not independent of each other, but that independence is often ignored when computing confidence intervals.

Torsten on 8 Mar 2023

Edited: Torsten on 8 Mar 2023

Open in MATLAB Online

var(X+Y) = var(X)+var(Y)+2*cov(X,Y)

And I don't think that

log(u) - log(x) + log(y)

and

log(v) - log(x) - log(y)

are independent.

Rogier Delporte on 9 Mar 2023

John D'Errico :

1) Clarification of my problem:

It's not that I didn't know how to do what you've explained, it's that I am unsure whether it is the correct way of interpreting statistics or not.

My confusion originates from the following, the fit of:

v/y = p*q^3*x

Yields a worse R² value and visually more spread than:

v = p*q^3*x*y

And that holds both on lin-lin scales as it does on log-log scales. My thinking is that you need to keep the independent variables on the RHS and the dependent variables on the LHS, because least squares regression — whether it be linear or non-linear — assumes no error on the dependent variables and all error on the independent variables. In my experiments, there can be no fluctuation of x during the experiments, because that would violate conservation of mass, there can only be inaccuracy of the values because the balance I weighed on wasn't set up on a sufficiently damped surface. I'd be very amazed if y were to fluctuate during my experiments, but in theory it is possible. There is, however, some inaccuracy in y as well.

My dependent variables u and v, however, are subject to amplifier noise, which is additive. u is the initial value of a time transient, v it the integration of the whole transient. There is a third independent variable z, which determines the gain setting of the amplifier and hence the amount of additive noise.

For large values of z, v kind of saturates, so all the v values are taken under the same conditions of amplifier noise. The inaccuracy of v stems from the fact that I need to guestimate when the transient ends, because there is a second transient which is orders of magnitude smaller in initial value and decay time. The criterion for the upper integration boundary for v scales with y^2/(q*z), which is the reason my residuals on log-log scale show heteroscedascity.

u is actually not the initial value of the time transient, but rather a fitted proportionality of said initial value i with z. There, the noise level for i depends on the gain setting and hence on z, but as I don't have a proper model for the variance to properly weight my points, I just exclude the points where i is too close to the noise floor. The actual inaccuracy for u stems from the relationship of how fast my system decays (which depends nonlinearly on z and p and it depends linearly on q) to how fast my amplifier settles. I actually calculate i by averaging over a couple of time points, in order to reduce the influence of additive noise.

I have already given up on properly calculating the exact amount of gaussian noise for u and v, because it is far too complex and during acquisition there was a running average filter applied to economize on storage and the standard deviation at different points in time was not saved, so i am pretty sure it is impossible to correctly calculate the amount of additive noise on u and v.

That's why I am just approximating the additive noise on u and v to be zero and I'm doing a first order assumption of proportional noise on p and q. There is time dependent noise on z and also a certain amount of inaccuracy, but both are orders of magnitude better than the inaccuracy of x and y, which I can only really guess at.

I've already tried the approach of:

4*log(u) - 3*log(v) = log(x) + 7*log(y) + log(p)

log(u) - log(v) = 2*log(y) - log(q)

But I have a suspicion that in this case, the influence of inaccuracies on y become non-neglible.

2) Correct propagation of variances:

The problem I have with your approach, is that you don't take the covariance of u and v into account. The dead easy part is getting values for p and q, the hard part is the best way of getting accurate estimates of their approximate confidence intervals. I guess the approach with the y^7 is the worst of the two, as it amplifies the errors in y (which I approximate to be zero, as I don't have a model for estimating them) to the seventh power.

Sign in to comment.

Force coefficients in multivariate and multiple regression

2 Comments
Show NoneHide None

Answers (1)

4 Comments
Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Force coefficients in multivariate and multiple regression

2 Comments Show NoneHide None

Answers (1)

4 Comments Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

2 Comments
Show NoneHide None

4 Comments
Show 2 older commentsHide 2 older comments