Package | Description |
---|---|
org.apache.lucene.analysis |
Text analysis.
|
org.apache.lucene.analysis.cn.smart |
Analyzer for Simplified Chinese, which indexes words.
|
org.apache.lucene.analysis.core |
Basic, general-purpose analysis components.
|
org.apache.lucene.analysis.ngram |
Character n-gram tokenizers and filters.
|
org.apache.lucene.analysis.path |
Analysis components for path-like strings such as filenames.
|
org.apache.lucene.analysis.pattern |
Set of components for pattern-based (regex) analysis.
|
org.apache.lucene.analysis.standard |
Fast, general-purpose grammar-based tokenizers.
|
org.apache.lucene.analysis.standard.std40 |
Backwards-compatible implementation to match
Version.LUCENE_4_0 |
org.apache.lucene.analysis.th |
Analyzer for Thai.
|
org.apache.lucene.analysis.util |
Utility functions for text analysis.
|
org.apache.lucene.analysis.wikipedia |
Tokenizer that is aware of Wikipedia syntax.
|
Modifier and Type | Field and Description |
---|---|
protected Tokenizer |
Analyzer.TokenStreamComponents.source
Original source of the tokens.
|
Modifier and Type | Method and Description |
---|---|
Tokenizer |
Analyzer.TokenStreamComponents.getTokenizer()
Returns the component's
Tokenizer |
Constructor and Description |
---|
Analyzer.TokenStreamComponents(Tokenizer source)
Creates a new
Analyzer.TokenStreamComponents instance. |
Analyzer.TokenStreamComponents(Tokenizer source,
TokenStream result)
Creates a new
Analyzer.TokenStreamComponents instance. |
Modifier and Type | Class and Description |
---|---|
class |
HMMChineseTokenizer
Tokenizer for Chinese or mixed Chinese-English text.
|
class |
SentenceTokenizer
Deprecated.
Use
HMMChineseTokenizer instead |
Modifier and Type | Method and Description |
---|---|
Tokenizer |
HMMChineseTokenizerFactory.create(AttributeFactory factory) |
Modifier and Type | Class and Description |
---|---|
class |
KeywordTokenizer
Emits the entire input as a single token.
|
class |
LetterTokenizer
A LetterTokenizer is a tokenizer that divides text at non-letters.
|
class |
LowerCaseTokenizer
LowerCaseTokenizer performs the function of LetterTokenizer
and LowerCaseFilter together.
|
class |
UnicodeWhitespaceTokenizer
A UnicodeWhitespaceTokenizer is a tokenizer that divides text at whitespace.
|
class |
WhitespaceTokenizer
A tokenizer that divides text at whitespace characters as defined by
Character.isWhitespace(int) . |
Modifier and Type | Method and Description |
---|---|
Tokenizer |
WhitespaceTokenizerFactory.create(AttributeFactory factory) |
Modifier and Type | Class and Description |
---|---|
class |
EdgeNGramTokenizer
Tokenizes the input from an edge into n-grams of given size(s).
|
class |
Lucene43EdgeNGramTokenizer
Deprecated.
|
class |
Lucene43NGramTokenizer
Deprecated.
|
class |
NGramTokenizer
Tokenizes the input into n-grams of the given size(s).
|
Modifier and Type | Method and Description |
---|---|
Tokenizer |
NGramTokenizerFactory.create(AttributeFactory factory)
|
Tokenizer |
EdgeNGramTokenizerFactory.create(AttributeFactory factory) |
Modifier and Type | Class and Description |
---|---|
class |
PathHierarchyTokenizer
Tokenizer for path-like hierarchies.
|
class |
ReversePathHierarchyTokenizer
Tokenizer for domain-like hierarchies.
|
Modifier and Type | Method and Description |
---|---|
Tokenizer |
PathHierarchyTokenizerFactory.create(AttributeFactory factory) |
Modifier and Type | Class and Description |
---|---|
class |
PatternTokenizer
This tokenizer uses regex pattern matching to construct distinct tokens
for the input stream.
|
Modifier and Type | Class and Description |
---|---|
class |
ClassicTokenizer
A grammar-based tokenizer constructed with JFlex
|
class |
StandardTokenizer
A grammar-based tokenizer constructed with JFlex.
|
class |
UAX29URLEmailTokenizer
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29
URLs and email addresses are also tokenized according to the relevant RFCs.
|
Modifier and Type | Method and Description |
---|---|
Tokenizer |
UAX29URLEmailTokenizerFactory.create(AttributeFactory factory) |
Tokenizer |
StandardTokenizerFactory.create(AttributeFactory factory) |
Modifier and Type | Class and Description |
---|---|
class |
StandardTokenizer40
Deprecated.
|
class |
UAX29URLEmailTokenizer40
Deprecated.
|
Modifier and Type | Class and Description |
---|---|
class |
ThaiTokenizer
Tokenizer that use
BreakIterator to tokenize Thai text. |
Modifier and Type | Method and Description |
---|---|
Tokenizer |
ThaiTokenizerFactory.create(AttributeFactory factory) |
Modifier and Type | Class and Description |
---|---|
class |
CharTokenizer
An abstract base class for simple, character-oriented tokenizers.
|
class |
SegmentingTokenizerBase
Breaks text into sentences with a
BreakIterator and
allows subclasses to decompose these sentences into words. |
Modifier and Type | Method and Description |
---|---|
Tokenizer |
TokenizerFactory.create()
Creates a TokenStream of the specified input using the default attribute factory.
|
abstract Tokenizer |
TokenizerFactory.create(AttributeFactory factory)
Creates a TokenStream of the specified input using the given AttributeFactory
|
Modifier and Type | Class and Description |
---|---|
class |
WikipediaTokenizer
Extension of StandardTokenizer that is aware of Wikipedia syntax.
|
Copyright © 2000-2018 The Apache Software Foundation. All Rights Reserved.