Class SweetSpotSimilarity

  • All Implemented Interfaces:
    Serializable

    public class SweetSpotSimilarity
    extends DefaultSimilarity
    A similarity with a lengthNorm that provides for a "plateau" of equally good lengths, and tf helper functions.

    For lengthNorm, A global min/max can be specified to define the plateau of lengths that should all have a norm of 1.0. Below the min, and above the max the lengthNorm drops off in a sqrt function.

    A per field min/max can be specified if different fields have different sweet spots.

    For tf, baselineTf and hyperbolicTf functions are provided, which subclasses can choose between.

    See Also:
    A Gnuplot file used to generate some of the visualizations refrenced from each function., Serialized Form
    • Constructor Detail

      • SweetSpotSimilarity

        public SweetSpotSimilarity()
    • Method Detail

      • setBaselineTfFactors

        public void setBaselineTfFactors​(float base,
                                         float min)
        Sets the baseline and minimum function variables for baselineTf
        See Also:
        baselineTf(float)
      • setHyperbolicTfFactors

        public void setHyperbolicTfFactors​(float min,
                                           float max,
                                           double base,
                                           float xoffset)
        Sets the function variables for the hyperbolicTf functions
        Parameters:
        min - the minimum tf value to ever be returned (default: 0.0)
        max - the maximum tf value to ever be returned (default: 2.0)
        base - the base value to be used in the exponential for the hyperbolic function (default: 1.3)
        xoffset - the midpoint of the hyperbolic function (default: 10.0)
        See Also:
        hyperbolicTf(float)
      • setLengthNormFactors

        public void setLengthNormFactors​(int min,
                                         int max,
                                         float steepness)
        Sets the default function variables used by lengthNorm when no field specific variables have been set.
        See Also:
        Similarity.lengthNorm(java.lang.String, int)
      • setLengthNormFactors

        public void setLengthNormFactors​(String field,
                                         int min,
                                         int max,
                                         float steepness,
                                         boolean discountOverlaps)
        Sets the function variables used by lengthNorm for a specific named field.
        Parameters:
        field - field name
        min - minimum value
        max - maximum value
        steepness - steepness of the curve
        discountOverlaps - if true, numOverlapTokens will be subtracted from numTokens; if false then numOverlapTokens will be assumed to be 0 (see DefaultSimilarity.computeNorm(String, FieldInvertState) for details).
        See Also:
        Similarity.lengthNorm(java.lang.String, int)
      • computeNorm

        public float computeNorm​(String fieldName,
                                 FieldInvertState state)
        Implemented as state.getBoost() * lengthNorm(fieldName, numTokens) where numTokens does not count overlap tokens if discountOverlaps is true by default or true for this specific field.
        Overrides:
        computeNorm in class DefaultSimilarity
        Parameters:
        fieldName - field name
        state - current processing state for this field
        Returns:
        the calculated float norm
      • computeLengthNorm

        public float computeLengthNorm​(String fieldName,
                                       int numTerms)
        Implemented as: 1/sqrt( steepness * (abs(x-min) + abs(x-max) - (max-min)) + 1 ) .

        This degrades to 1/sqrt(x) when min and max are both 1 and steepness is 0.5

        :TODO: potential optimization is to just flat out return 1.0f if numTerms is between min and max.

        See Also:
        setLengthNormFactors(int, int, float), An SVG visualization of this function
      • tf

        public float tf​(int freq)
        Delegates to baselineTf
        Overrides:
        tf in class Similarity
        Parameters:
        freq - the frequency of a term within a document
        Returns:
        a score factor based on a term's within-document frequency
        See Also:
        baselineTf(float)