Package | Description |
---|---|
org.apache.lucene.analysis |
API and code to convert text into indexable/searchable tokens.
|
org.apache.lucene.analysis.ar |
Analyzer for Arabic.
|
org.apache.lucene.analysis.cjk |
Analyzer for Chinese, Japanese, and Korean, which indexes bigrams (overlapping groups of two adjacent Han characters).
|
org.apache.lucene.analysis.cn |
Analyzer for Chinese, which indexes unigrams (individual chinese characters).
|
org.apache.lucene.analysis.cn.smart |
Analyzer for Simplified Chinese, which indexes words.
|
org.apache.lucene.analysis.icu.segmentation |
Tokenizer that breaks text into words with the Unicode Text Segmentation algorithm.
|
org.apache.lucene.analysis.in |
Analysis components for Indian languages.
|
org.apache.lucene.analysis.ja |
Analyzer for Japanese.
|
org.apache.lucene.analysis.ngram |
Character n-gram tokenizers and filters.
|
org.apache.lucene.analysis.path |
Analysis components for path-like strings such as filenames.
|
org.apache.lucene.analysis.ru |
Analyzer for Russian.
|
org.apache.lucene.analysis.standard |
Standards-based analyzers implemented with JFlex.
|
org.apache.lucene.analysis.wikipedia |
Tokenizer that is aware of Wikipedia syntax.
|
Modifier and Type | Class | Description |
---|---|---|
class |
CharTokenizer |
An abstract base class for simple, character-oriented tokenizers.
|
class |
EmptyTokenizer |
Emits no tokens
|
class |
KeywordTokenizer |
Emits the entire input as a single token.
|
class |
LetterTokenizer |
A LetterTokenizer is a tokenizer that divides text at non-letters.
|
class |
LowerCaseTokenizer |
LowerCaseTokenizer performs the function of LetterTokenizer
and LowerCaseFilter together.
|
class |
MockTokenizer |
Tokenizer for testing.
|
class |
WhitespaceTokenizer |
A WhitespaceTokenizer is a tokenizer that divides text at whitespace.
|
Modifier and Type | Field | Description |
---|---|---|
protected Tokenizer |
ReusableAnalyzerBase.TokenStreamComponents.source |
Constructor | Description |
---|---|
TokenStreamComponents(Tokenizer source) |
Creates a new
ReusableAnalyzerBase.TokenStreamComponents instance. |
TokenStreamComponents(Tokenizer source,
TokenStream result) |
Creates a new
ReusableAnalyzerBase.TokenStreamComponents instance. |
Modifier and Type | Class | Description |
---|---|---|
class |
ArabicLetterTokenizer |
Deprecated.
(3.1) Use
StandardTokenizer instead. |
Modifier and Type | Class | Description |
---|---|---|
class |
CJKTokenizer |
Deprecated.
Use StandardTokenizer, CJKWidthFilter, CJKBigramFilter, and LowerCaseFilter instead.
|
Modifier and Type | Class | Description |
---|---|---|
class |
ChineseTokenizer |
Deprecated.
Use
StandardTokenizer instead, which has the same functionality. |
Modifier and Type | Class | Description |
---|---|---|
class |
SentenceTokenizer |
Tokenizes input text into sentences.
|
Modifier and Type | Class | Description |
---|---|---|
class |
ICUTokenizer |
Breaks text into words according to UAX #29: Unicode Text Segmentation
(http://www.unicode.org/reports/tr29/)
|
Modifier and Type | Class | Description |
---|---|---|
class |
IndicTokenizer |
Deprecated.
(3.6) Use
StandardTokenizer instead. |
Modifier and Type | Class | Description |
---|---|---|
class |
JapaneseTokenizer |
Tokenizer for Japanese that uses morphological analysis.
|
Modifier and Type | Class | Description |
---|---|---|
class |
EdgeNGramTokenizer |
Tokenizes the input from an edge into n-grams of given size(s).
|
class |
NGramTokenizer |
Tokenizes the input into n-grams of the given size(s).
|
Modifier and Type | Class | Description |
---|---|---|
class |
PathHierarchyTokenizer |
Tokenizer for path-like hierarchies.
|
class |
ReversePathHierarchyTokenizer |
Tokenizer for domain-like hierarchies.
|
Modifier and Type | Class | Description |
---|---|---|
class |
RussianLetterTokenizer |
Deprecated.
Use
StandardTokenizer instead, which has the same functionality. |
Modifier and Type | Class | Description |
---|---|---|
class |
ClassicTokenizer |
A grammar-based tokenizer constructed with JFlex
|
class |
StandardTokenizer |
A grammar-based tokenizer constructed with JFlex.
|
class |
UAX29URLEmailTokenizer |
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29
URLs and email addresses are also tokenized according to the relevant RFCs.
|
Modifier and Type | Class | Description |
---|---|---|
class |
WikipediaTokenizer |
Extension of StandardTokenizer that is aware of Wikipedia syntax.
|
Copyright © 2000-2018 Apache Software Foundation. All Rights Reserved.