it.unimi.dsi.mg4j.search.score
Class BM25Scorer

java.lang.Object
  extended by it.unimi.dsi.fastutil.ints.AbstractIntIterator
      extended by it.unimi.dsi.mg4j.search.score.AbstractScorer
          extended by it.unimi.dsi.mg4j.search.score.AbstractIndexScorer
              extended by it.unimi.dsi.mg4j.search.score.AbstractWeightedScorer
                  extended by it.unimi.dsi.mg4j.search.score.BM25Scorer
All Implemented Interfaces:
IntIterator, FlyweightPrototype<Scorer>, DelegatingScorer, Scorer, Iterator<Integer>

public class BM25Scorer
extends AbstractWeightedScorer
implements DelegatingScorer

A scorer that implements the BM25 ranking scheme.

BM25 is the name of a ranking scheme for text derived from the probabilistic model. The essential feature of the scheme is that of assigning to each term appearing in a given document a weight depending both on the count (the number of occurrences of the term in the document), on the frequency (the number of the documents in which the term appears) and on the document length (in words). It was devised in the early nineties, and it provides a significant improvement over the classical TF/IDF scheme. Karen Spärck Jones, Steve Walker and Stephen E. Robertson give a full account of BM25 and of the probabilistic model in “A probabilistic model of information retrieval: development and comparative experiments”, Inf. Process. Management 36(6):779−840, 2000.

There are a number of incarnations with small variations of the formula itself. Here, the weight assigned to a term which appears in f documents out of a collection of N documents w.r.t. to a document of length l in which the term appears c times is

log( (Nf + 1/2) / (f + 1/2) ) ( k1 + 1 ) c    ( c + k1 ((1 − b) + bl / L) ),
where L is the average document length, and k1 and b are parameters that default to DEFAULT_K1 and DEFAULT_B: these values were chosen following the suggestions given in “Efficiency vs. effectiveness in Terabyte-scale information retrieval”, by Stefan Büttcher and Charles L. A. Clarke, in Proceedings of the 14th Text REtrieval Conference (TREC 2005). Gaithersburg, USA, November 2005. The logarithmic part (a.k.a. idf (inverse document-frequency) part) is actually maximised with EPSILON_SCORE, so it is never negative (the net effect being that terms appearing in more than half of the documents have almost no weight).

This class uses a CounterCollectionVisitor and related classes (by means of DocumentIterator.acceptOnTruePaths(it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor)) to take into consideration only terms that are actually involved in query semantics for the current document.

Author:
Mauro Mereu, Sebastiano Vigna

Field Summary
 double b
          The parameter b.
static double DEFAULT_B
          The default value used for the parameter b.
static double DEFAULT_K1
          The default value used for the parameter k1.
static double EPSILON_SCORE
          The value of the document-frequency part for terms appearing in more than half of the documents.
 double k1
          The parameter k1.
 
Fields inherited from class it.unimi.dsi.mg4j.search.score.AbstractWeightedScorer
currWeight, index2Weight
 
Fields inherited from class it.unimi.dsi.mg4j.search.score.AbstractIndexScorer
currIndex, n
 
Fields inherited from class it.unimi.dsi.mg4j.search.score.AbstractScorer
documentIterator
 
Constructor Summary
BM25Scorer()
          Creates a BM25 scorer using DEFAULT_K1 and DEFAULT_B as parameters.
BM25Scorer(double k1, double b)
          Creates a BM25 scorer using specified k1 and b parameters.
BM25Scorer(String k1, String b)
          Creates a BM25 scorer using specified k1 and b parameters specified by strings.
 
Method Summary
 BM25Scorer copy()
           
 double score()
          Computes a score by calling Scorer.score(Index) for each index in the current index map, and summing the weighted results.
 double score(Index index)
          Returns a score for the current document of the last document iterator given to Scorer.wrap(DocumentIterator), but considering only a given index (optional operation).
 boolean usesIntervals()
          Whether this scorer uses intervals.
 void wrap(DocumentIterator d)
          Wraps the given document iterator.
 
Methods inherited from class it.unimi.dsi.mg4j.search.score.AbstractWeightedScorer
getWeights, setWeights
 
Methods inherited from class it.unimi.dsi.mg4j.search.score.AbstractScorer
hasNext, nextDocument, nextInt, skip
 
Methods inherited from class it.unimi.dsi.fastutil.ints.AbstractIntIterator
next, remove
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface it.unimi.dsi.mg4j.search.score.Scorer
getWeights, nextDocument, nextInt, setWeights
 
Methods inherited from interface it.unimi.dsi.fastutil.ints.IntIterator
skip
 
Methods inherited from interface java.util.Iterator
hasNext, next, remove
 

Field Detail

DEFAULT_K1

public static final double DEFAULT_K1
The default value used for the parameter k1.

See Also:
Constant Field Values

DEFAULT_B

public static final double DEFAULT_B
The default value used for the parameter b.

See Also:
Constant Field Values

EPSILON_SCORE

public static final double EPSILON_SCORE
The value of the document-frequency part for terms appearing in more than half of the documents.

See Also:
Constant Field Values

k1

public final double k1
The parameter k1.


b

public final double b
The parameter b.

Constructor Detail

BM25Scorer

public BM25Scorer()
Creates a BM25 scorer using DEFAULT_K1 and DEFAULT_B as parameters.


BM25Scorer

public BM25Scorer(double k1,
                  double b)
Creates a BM25 scorer using specified k1 and b parameters.

Parameters:
k1 - the k1 parameter.
b - the b parameter.

BM25Scorer

public BM25Scorer(String k1,
                  String b)
Creates a BM25 scorer using specified k1 and b parameters specified by strings.

Parameters:
k1 - the k1 parameter.
b - the b parameter.
Method Detail

copy

public BM25Scorer copy()
Specified by:
copy in interface FlyweightPrototype<Scorer>
Specified by:
copy in interface Scorer

score

public double score()
             throws IOException
Description copied from class: AbstractWeightedScorer
Computes a score by calling Scorer.score(Index) for each index in the current index map, and summing the weighted results.

Specified by:
score in interface Scorer
Overrides:
score in class AbstractWeightedScorer
Returns:
the combined weighted score.
Throws:
IOException

score

public double score(Index index)
Description copied from interface: Scorer
Returns a score for the current document of the last document iterator given to Scorer.wrap(DocumentIterator), but considering only a given index (optional operation).

Specified by:
score in interface Scorer
Parameters:
index - the only index to be considered.
Returns:
the score.

wrap

public void wrap(DocumentIterator d)
          throws IOException
Description copied from class: AbstractWeightedScorer
Wraps the given document iterator.

Besides the services provided by AbstractIndexScorer.wrap(DocumentIterator), this method sets up AbstractWeightedScorer.currWeight.

Specified by:
wrap in interface Scorer
Overrides:
wrap in class AbstractWeightedScorer
Parameters:
d - the document iterator that will be used in subsequent calls to AbstractWeightedScorer.score() and Scorer.score(Index).
Throws:
IOException

usesIntervals

public boolean usesIntervals()
Description copied from interface: Scorer
Whether this scorer uses intervals.

This method is essential when aggregating scorers, because if several scores need intervals, a CachingDocumentIterator will be necessary.

Specified by:
usesIntervals in interface Scorer
Returns:
true if this scorer uses intervals.