mmrScores

Document scoring with Maximal Marginal Relevance (MMR) algorithm

Since R2020a

Syntax

``scores = mmrScores(documents,queries)``
``scores = mmrScores(bag,queries)``
``scores = mmrScores(___,lambda)``

Description

example

````scores = mmrScores(documents,queries)` scores `documents` according to their relevance to a `queries` avoiding redundancy using the MMR algorithm. The score in `scores(i,j)` is the MMR score of `documents(i)` relative to `queries(j)`.```
````scores = mmrScores(bag,queries)` scores documents encoded by the bag-of-words or bag-of-n-grams model `bag` relative to `queries`. The score in `scores(i,j)` is the MMR score of the `i`th document in `bag` relative to `queries(j)`.```
````scores = mmrScores(___,lambda)` also specifies the trade off between relevance and redundancy. ```

Examples

collapse all

Create an array of input documents.

```str = [ "the quick brown fox jumped over the lazy dog" "the fast fox jumped over the lazy dog" "the dog sat there and did nothing" "the other animals sat there watching"]; documents = tokenizedDocument(str)```
```documents = 4x1 tokenizedDocument: 9 tokens: the quick brown fox jumped over the lazy dog 8 tokens: the fast fox jumped over the lazy dog 7 tokens: the dog sat there and did nothing 6 tokens: the other animals sat there watching ```

Create an array of query documents.

```str = [ "a brown fox leaped over the lazy dog" "another fox leaped over the dog"]; queries = tokenizedDocument(str)```
```queries = 2x1 tokenizedDocument: 8 tokens: a brown fox leaped over the lazy dog 6 tokens: another fox leaped over the dog ```

Calculate MMR scores using the `mmrScores` function. The output is a sparse matrix.

`scores = mmrScores(documents,queries);`

Visualize the MMR scores in a heat map.

```figure heatmap(scores); xlabel("Query Document") ylabel("Input Document") title("MMR Scores")```

Higher scores correspond to stonger relavence to the query documents.

Create an array of input documents.

```str = [ "the quick brown fox jumped over the lazy dog" "the quick brown fox jumped over the lazy dog" "the fast fox jumped over the lazy dog" "the dog sat there and did nothing" "the other animals sat there watching" "the other animals sat there watching"]; documents = tokenizedDocument(str);```

Create a bag-of-words model from the input documents.

`bag = bagOfWords(documents)`
```bag = bagOfWords with properties: Counts: [6x17 double] Vocabulary: ["the" "quick" "brown" "fox" "jumped" "over" "lazy" "dog" "fast" "sat" "there" "and" "did" "nothing" "other" "animals" "watching"] NumWords: 17 NumDocuments: 6 ```

Create an array of query documents.

```str = [ "a brown fox leaped over the lazy dog" "another fox leaped over the dog"]; queries = tokenizedDocument(str)```
```queries = 2x1 tokenizedDocument: 8 tokens: a brown fox leaped over the lazy dog 6 tokens: another fox leaped over the dog ```

Calculate the MMR scores. The output is a sparse matrix.

`scores = mmrScores(bag,queries);`

Visualize the MMR scores in a heat map.

```figure heatmap(scores); xlabel("Query Document") ylabel("Input Document") title("MMR Scores")```

Now calculate the scores again, and set the lambda value to 0.01. When the lambda value is close to 0, redundant documents yield lower scores and diverse (but less query-relevant) documents yield higher scores.

```lambda = 0.01; scores = mmrScores(bag,queries,lambda);```

Visualize the MMR scores in a heat map.

```figure heatmap(scores); xlabel("Query Document") ylabel("Input Document") title("MMR Scores, lambda = " + lambda)```

Finally, calculate the scores again and set the lambda value to 1. When the lambda value is 1, the query-relevant documents yield higher scores despite other documents yielding high scores.

```lambda = 1; scores = mmrScores(bag,queries,lambda);```

Visualize the MMR scores in a heat map.

```figure heatmap(scores); xlabel("Query Document") ylabel("Input Document") title("MMR Scores, lambda = " + lambda)```

Input Arguments

collapse all

Input documents, specified as a `tokenizedDocument` array, a string array of words, or a cell array of character vectors. If `documents` is not a `tokenizedDocument` array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a `tokenizedDocument` array.

Input bag-of-words or bag-of-n-grams model, specified as a `bagOfWords` object or a `bagOfNgrams` object. If `bag` is a `bagOfNgrams` object, then the function treats each n-gram as a single word.

Set of query documents, specified as one of the following:

• A `tokenizedDocument` array

• A 1-by-N string array representing a single document, where each element is a word

• A 1-by-N cell array of character vectors representing a single document, where each element is a word

To compute term frequency and inverse document frequency statistics, the function encodes `queries` using a bag-of-words model. The model it uses depends on the syntax you call it with. If your syntax specifies the input argument `documents`, then it uses `bagOfWords(documents)`. If your syntax specifies `bag`, then the function encodes `queries` using `bag` then uses the resulting tf-idf matrix.

Trade off between relevance and redundancy, specified as a nonnegative scalar.

When `lambda` is close to 0, redundant documents yield lower scores and diverse (but less query-relevant) documents yield higher scores. If `lambda` is 1, then query-relevant documents yield higher scores despite other documents yielding high scores.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`

Output Arguments

collapse all

MMR scores, returned as an N1-by-N2 matrix, where `scores(i,j)` is the MMR score of `documents(i)` relative to `j`th query document, and N1 and N2 are the number of input and query documents, respectively.

A document has a high MMR score if it is both relevant to the query and has minimal similarity relative to the other documents.

References

[1] Carbonell, Jaime G., and Jade Goldstein. "The use of MMR, diversity-based reranking for reordering documents and producing summaries." In SIGIR, vol. 98, pp. 335-336. 1998.

Version History

Introduced in R2020a