Minimize difference between two populations.
5 views (last 30 days)
Show older comments
Hi, I have this problem: I have a dataset composed by 2 vectors, each composed by 100 elements. I need to find the "best fit" of 50 values for each vector in order to minimize the difference between them.
I think I should set up a minimization problem but i don't know how to achieve it in matlab...
0 Comments
Answers (2)
John D'Errico
on 2 Oct 2017
Edited: John D'Errico
on 2 Oct 2017
If I understand you correctly, you have two vectors, call them A and B.
N = 20;
A = randn(N,1);
B = randn(N.1);
Now, you want to extract subsets sA and sB of each vector, each of length J, SUCH THAT
norm(A(sA) - B(sB))
is minimized. YIKES!
This is a highly combinatorial problem. Finding a truly optimal solution will be nasty. If N is at all large, the problem will become highly computational.
For example, suppose that N=20 with J = 10. So you want to reduce each set by 50%. Effectively, this requires us to sort through nchoosek(20,10) possible subsets of A, but also nchoosek(20,10) subsets of B.
Worse, we are not just comparing those subsets in any arbitrary order. So we need to consider all permutations of one of those subsets. So the size of the solution space is...
nchoosek(N,J)^2*factorial(J)
N = 20;
J = 10;
nchoosek(N,J)^2*factorial(J)
ans =
1.23868287980237e+17
So the search space is immense and combinatorial. You cannot use a classical optimization tool. fmincon is out. It cannot be used for combinatorial problems. intlinprog is out, because the objective is not linear. I suppose you could formulate it in the form of a binary nonlinear programming problem. So optimize a binary vector that essentially extracts subsets of each of A and B. Even that fails though, because the order of the subsets are important.
Simple, but sub-optimal would be a greedy algorithm. At each iteration, just find the two closest points of the two sets. Pair them up, and then reduce A and B by those elements. Then repeat.
This would do the work to find the closest pair.
[IDX,D] = knnsearch(A,B);
For example, in the random set I generated...
[dmin,ind] = min(D)
dmin =
0.00148090843713233
ind =
7
[B(ind),A(IDX(ind))]
ans =
0.726885133383238 0.725404224946106
Now just find the next closest pair, after those elements have been removed from A and B. Repeat until you have chosen J elements.
The above is a totally greedy algorithm, that may be reasonable if J is quite small compared to N.
Good luck however, if J is large compared to N, especially if N is large.
2 Comments
John D'Errico
on 2 Oct 2017
I think it is fairly new. Some recent release introduced it, as part of the stats toolbox.
Jan
on 2 Oct 2017
What exactly is "a data set"? Do you mean a [2 x 100] matrix? What is "the difference" you want to minimize? Does the order of the elements matter? Do you mean the Euclidean distance between two subvectors of the size [1, 50]? If so, contemplate. Start with thinking. How many solutions are possible? Would a brute force attack solve the problem or is an exhaustive search impossible in the next billion of years?
There is a huuuge number of ways to choose 50 elements out of 100. So think again: Would an approximation be enough? Perhaps a genetic algorithm or another global optimization technique?
- Start with defining the problem exactly in a mathematical sense.
- Then think about the magnitude of the problem.
- Afterwards it might be important to know why you want to solve this: Is it a homework and you hear a lesson about global optimization or Monte Carlo methods? Did an evil professor ask you to solve this? Did this problem occur in your field of science, you asked your colleagues and they forwarded you to the Matlab forum?
1 Comment
See Also
Categories
Find more on Creating and Concatenating Matrices in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!