Class ShingleAnalyzerWrapper

    • Constructor Detail

      • ShingleAnalyzerWrapper

        public ShingleAnalyzerWrapper​(Analyzer defaultAnalyzer)
      • ShingleAnalyzerWrapper

        public ShingleAnalyzerWrapper​(Analyzer defaultAnalyzer,
                                      int maxShingleSize)
      • ShingleAnalyzerWrapper

        public ShingleAnalyzerWrapper​(Analyzer defaultAnalyzer,
                                      int minShingleSize,
                                      int maxShingleSize)
      • ShingleAnalyzerWrapper

        public ShingleAnalyzerWrapper​(Analyzer defaultAnalyzer,
                                      int minShingleSize,
                                      int maxShingleSize,
                                      String tokenSeparator,
                                      boolean outputUnigrams,
                                      boolean outputUnigramsIfNoShingles)
        Creates a new ShingleAnalyzerWrapper
        Parameters:
        defaultAnalyzer - Analyzer whose TokenStream is to be filtered
        minShingleSize - Min shingle (token ngram) size
        maxShingleSize - Max shingle size
        tokenSeparator - Used to separate input stream tokens in output shingles
        outputUnigrams - Whether or not the filter shall pass the original tokens to the output stream
        outputUnigramsIfNoShingles - Overrides the behavior of outputUnigrams==false for those times when no shingles are available (because there are fewer than minShingleSize tokens in the input stream)? Note that if outputUnigrams==true, then unigrams are always output, regardless of whether any shingles are available.
      • ShingleAnalyzerWrapper

        public ShingleAnalyzerWrapper​(Version matchVersion,
                                      int minShingleSize,
                                      int maxShingleSize)
    • Method Detail

      • getMaxShingleSize

        public int getMaxShingleSize()
        The max shingle (token ngram) size
        Returns:
        The max shingle (token ngram) size
      • setMaxShingleSize

        @Deprecated
        public void setMaxShingleSize​(int maxShingleSize)
        Deprecated.
        Setting maxShingleSize after Analyzer instantiation prevents reuse. Confgure maxShingleSize during construction.
        Set the maximum size of output shingles (default: 2)
        Parameters:
        maxShingleSize - max shingle size
      • getMinShingleSize

        public int getMinShingleSize()
        The min shingle (token ngram) size
        Returns:
        The min shingle (token ngram) size
      • setMinShingleSize

        @Deprecated
        public void setMinShingleSize​(int minShingleSize)
        Deprecated.
        Setting minShingleSize after Analyzer instantiation prevents reuse. Confgure minShingleSize during construction.

        Set the min shingle size (default: 2).

        This method requires that the passed in minShingleSize is not greater than maxShingleSize, so make sure that maxShingleSize is set before calling this method.

        Parameters:
        minShingleSize - min size of output shingles
      • getTokenSeparator

        public String getTokenSeparator()
      • setTokenSeparator

        @Deprecated
        public void setTokenSeparator​(String tokenSeparator)
        Deprecated.
        Setting tokenSeparator after Analyzer instantiation prevents reuse. Confgure tokenSeparator during construction.
        Sets the string to use when joining adjacent tokens to form a shingle
        Parameters:
        tokenSeparator - used to separate input stream tokens in output shingles
      • isOutputUnigrams

        public boolean isOutputUnigrams()
      • setOutputUnigrams

        @Deprecated
        public void setOutputUnigrams​(boolean outputUnigrams)
        Deprecated.
        Setting outputUnigrams after Analyzer instantiation prevents reuse. Confgure outputUnigrams during construction.
        Shall the filter pass the original tokens (the "unigrams") to the output stream?
        Parameters:
        outputUnigrams - Whether or not the filter shall pass the original tokens to the output stream
      • isOutputUnigramsIfNoShingles

        public boolean isOutputUnigramsIfNoShingles()
      • setOutputUnigramsIfNoShingles

        @Deprecated
        public void setOutputUnigramsIfNoShingles​(boolean outputUnigramsIfNoShingles)
        Deprecated.
        Setting outputUnigramsIfNoShingles after Analyzer instantiation prevents reuse. Confgure outputUnigramsIfNoShingles during construction.

        Shall we override the behavior of outputUnigrams==false for those times when no shingles are available (because there are fewer than minShingleSize tokens in the input stream)? (default: false.)

        Note that if outputUnigrams==true, then unigrams are always output, regardless of whether any shingles are available.

        Parameters:
        outputUnigramsIfNoShingles - Whether or not to output a single unigram when no shingles are available.
      • tokenStream

        public TokenStream tokenStream​(String fieldName,
                                       Reader reader)
        Description copied from class: Analyzer
        Creates a TokenStream which tokenizes all the text in the provided Reader. Must be able to handle null field name for backward compatibility.
        Specified by:
        tokenStream in class Analyzer
      • reusableTokenStream

        public TokenStream reusableTokenStream​(String fieldName,
                                               Reader reader)
                                        throws IOException
        Description copied from class: Analyzer
        Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method. Callers that do not need to use more than one TokenStream at the same time from this analyzer should use this method for better performance.
        Overrides:
        reusableTokenStream in class Analyzer
        Throws:
        IOException