Such a thing as relationship sorting?
Show older comments
I have a random array of 10000 7-letter strings. I have a simple algorithm that assesses how similar each element is to all others. For example, SESTINA and SESTINE would have a high similarity, but ZYZZYVA and NEROLIS would have low similarity. The result is a 10000x10000 symetric matrix.
I want to group/sort the strings based on similarity, where proximity of similar strings is important and order is irrelevant. The higher the similarity of two strings, the more important it is for them to be together. I've imagined some sort of least squares technique?
Anyone have a clue for me?
EDIT: The end goal is a 1d array, and not looking for a graphic solution.
8 Comments
Matt Fig
on 16 Nov 2012
How do you quantify similarity? In order to proceed we have to know what we are working with.
curran
on 17 Nov 2012
curran
on 17 Nov 2012
Matt J
on 17 Nov 2012
How would you order this
AA AB AC BA BB BC CA CB CC
Matt Fig
on 17 Nov 2012
curran, you still have not given us any clue as to what the similarity matrix looks like. You showed the initial list of strings and the final list, but the crucial thing we need in order to help you is to see what the similarity matrix looks like.
So, given your initial and final strings posted above, what did the similarity matrix look like?
curran
on 18 Nov 2012
Matt J, your sample array gives very high degree of similarity, making it difficult to calculate manually, but since ive put anagrams as a higher similarity than substitutions, it would hazard: AA AB BA CA CC AC BC CB BB
I would have expected AA BA BB AB AC CA CB BC CC. If I assume anagrams have zero distance between them and substitutions have a distance given by their alphabetic separation, the above sequence has a length of 6 whereas yours has a length of 9.
curran
on 18 Nov 2012
Accepted Answer
More Answers (1)
Another thing that might help is to try to optimize the way you initialize your current algorithm. If anagrams have lower separation distances than substitutions, for example, I would expect that doing a dictionary sort, but one which groups anagrams together, would provide a pretty good initial guess, e.g.,
X=['AA';'AB';'AC';'BA';'BB';'BC';'CA';'CB';'CC'];
[~,idx]=sortrows(sort(X,2));
Xinitial=X(idx,:)
Xinitial =
AA
AB
BA
AC
CA
BB
BC
CB
CC
Categories
Find more on Matrices and Arrays in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!