Class RIDFTermPruningPolicy


  • public class RIDFTermPruningPolicy
    extends TermPruningPolicy
    Implementation of TermPruningPolicy that uses "residual IDF" metric to determine the postings of terms to keep/remove, as defined in http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf.

    Residual IDF measures a difference between a collection-wide IDF of a term (which assumes a uniform distribution of occurrences) and the actual observed total number of occurrences of a term in all documents. Positive values indicate that a term is informative (e.g. for rare terms), negative values indicate that a term is not informative (e.g. too popular to offer good selectivity).

    This metric produces small values close to [-1, 1], so useful ranges for thresholds under this metrics are somewhere between [0, 1]. The higher the threshold the more informative (and more rare) terms will be retained. For filtering of common words a value of close to or slightly below 0 (e.g. -0.1) should be a good starting point.

    • Constructor Detail

      • RIDFTermPruningPolicy

        public RIDFTermPruningPolicy​(org.apache.lucene.index.IndexReader in,
                                     Map<String,​Integer> fieldFlags,
                                     Map<String,​Double> thresholds,
                                     double defThreshold)
    • Method Detail

      • initPositionsTerm

        public void initPositionsTerm​(org.apache.lucene.index.TermPositions tp,
                                      org.apache.lucene.index.Term t)
                               throws IOException
        Description copied from class: TermPruningPolicy
        Called when moving TermPositions to a new Term.
        Specified by:
        initPositionsTerm in class TermPruningPolicy
        Parameters:
        tp - input term positions
        t - current term
        Throws:
        IOException
      • pruneTermEnum

        public boolean pruneTermEnum​(org.apache.lucene.index.TermEnum te)
                              throws IOException
        Description copied from class: TermPruningPolicy
        Pruning of all postings for a term (invoked once per term).
        Specified by:
        pruneTermEnum in class TermPruningPolicy
        Parameters:
        te - positioned term enum.
        Returns:
        true if all postings for this term should be removed, false otherwise.
        Throws:
        IOException
      • pruneAllPositions

        public boolean pruneAllPositions​(org.apache.lucene.index.TermPositions termPositions,
                                         org.apache.lucene.index.Term t)
                                  throws IOException
        Description copied from class: TermPruningPolicy
        Prune all postings per term (invoked once per term per doc)
        Specified by:
        pruneAllPositions in class TermPruningPolicy
        Parameters:
        termPositions - positioned term positions. Implementations MUST NOT advance this by calling TermPositions methods that advance either the position pointer (next, skipTo) or term pointer (seek).
        t - current term
        Returns:
        true if the current posting should be removed, false otherwise.
        Throws:
        IOException
      • pruneTermVectorTerms

        public int pruneTermVectorTerms​(int docNumber,
                                        String field,
                                        String[] terms,
                                        int[] freqs,
                                        org.apache.lucene.index.TermFreqVector v)
                                 throws IOException
        Description copied from class: TermPruningPolicy
        Pruning of individual terms in term vectors.
        Specified by:
        pruneTermVectorTerms in class TermPruningPolicy
        Parameters:
        docNumber - document number
        field - field name
        terms - array of terms
        freqs - array of term frequencies
        v - the original term frequency vector
        Returns:
        0 if no terms are to be removed, positive number to indicate how many terms need to be removed. The same number of entries in the terms array must be set to null to indicate which terms to remove.
        Throws:
        IOException
      • pruneSomePositions

        public int pruneSomePositions​(int docNum,
                                      int[] positions,
                                      org.apache.lucene.index.Term curTerm)
        Description copied from class: TermPruningPolicy
        Prune some postings per term (invoked once per term per doc).
        Specified by:
        pruneSomePositions in class TermPruningPolicy
        Parameters:
        docNum - current document number
        positions - original term positions in the document (and indirectly term frequency)
        curTerm - current term
        Returns:
        0 if no postings are to be removed, or positive number to indicate how many postings need to be removed. The same number of entries in the positions array must be set to -1 to indicate which positions to remove.