Class DefaultICUTokenizerConfig
- java.lang.Object
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
-
- org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig
-
public class DefaultICUTokenizerConfig extends ICUTokenizerConfig
DefaultICUTokenizerConfig
that is generally applicable to many languages.Generally tokenizes Unicode text according to UAX#29 (
BreakIterator.getWordInstance(ULocale.ROOT)
), but with the following tailorings:- Thai text is broken into words with a
DictionaryBasedBreakIterator
- Lao, Myanmar, and Khmer text is broken into syllables based on custom BreakIterator rules.
- Hebrew text has custom tailorings to handle special cases involving punctuation.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
- Thai text is broken into words with a
-
-
Field Summary
Fields Modifier and Type Field Description static String
WORD_HANGUL
Token type for words containing Korean hangulstatic String
WORD_HIRAGANA
Token type for words containing Japanese hiraganastatic String
WORD_IDEO
Token type for words containing ideographic charactersstatic String
WORD_KATAKANA
Token type for words containing Japanese katakanastatic String
WORD_LETTER
Token type for words that contain lettersstatic String
WORD_NUMBER
Token type for words that appear to be numbers
-
Constructor Summary
Constructors Constructor Description DefaultICUTokenizerConfig()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description com.ibm.icu.text.BreakIterator
getBreakIterator(int script)
Return a breakiterator capable of processing a given script.String
getType(int script, int ruleStatus)
Return a token type value for a given script and BreakIterator rule status.
-
-
-
Field Detail
-
WORD_IDEO
public static final String WORD_IDEO
Token type for words containing ideographic characters
-
WORD_HIRAGANA
public static final String WORD_HIRAGANA
Token type for words containing Japanese hiragana
-
WORD_KATAKANA
public static final String WORD_KATAKANA
Token type for words containing Japanese katakana
-
WORD_HANGUL
public static final String WORD_HANGUL
Token type for words containing Korean hangul
-
WORD_LETTER
public static final String WORD_LETTER
Token type for words that contain letters
-
WORD_NUMBER
public static final String WORD_NUMBER
Token type for words that appear to be numbers
-
-
Method Detail
-
getBreakIterator
public com.ibm.icu.text.BreakIterator getBreakIterator(int script)
Description copied from class:ICUTokenizerConfig
Return a breakiterator capable of processing a given script.- Specified by:
getBreakIterator
in classICUTokenizerConfig
-
getType
public String getType(int script, int ruleStatus)
Description copied from class:ICUTokenizerConfig
Return a token type value for a given script and BreakIterator rule status.- Specified by:
getType
in classICUTokenizerConfig
-
-