Apache Lucene is a high-performance, full-featured text search engine library. Here's a simple example how to use Lucene for indexing and searching (using JUnit to check if the results are what we expect):
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); // Store the index in memory: Directory directory = new RAMDirectory(); // To store an index on disk, use this instead: //Directory directory = FSDirectory.open("/tmp/testindex"); IndexWriter iwriter = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000)); Document doc = new Document(); String text = "This is the text to be indexed."; doc.add(new Field("fieldname", text, Field.Store.YES, Field.Index.ANALYZED)); iwriter.addDocument(doc); iwriter.close(); // Now search the index: IndexReader ireader = IndexReader.open(directory); // read-only=true IndexSearcher isearcher = new IndexSearcher(ireader); // Parse a simple query that searches for "text": QueryParser parser = new QueryParser("fieldname", analyzer); Query query = parser.parse("text"); ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs; assertEquals(1, hits.length); // Iterate through the results: for (int i = 0; i < hits.length; i++) { Document hitDoc = isearcher.doc(hits[i].doc); assertEquals("This is the text to be indexed.", hitDoc.get("fieldname")); } isearcher.close(); ireader.close(); directory.close();
The Lucene API is divided into several packages:
- org.apache.lucene.analysis defines an abstract Analyzer API for converting text from a java.io.Reader into a TokenStream, an enumeration of token Attributes. A TokenStream can be composed by applying TokenFilters to the output of a Tokenizer. Tokenizers and TokenFilters are strung together and applied with an Analyzer. A handful of Analyzer implementations are provided, including StopAnalyzer and the grammar-based StandardAnalyzer.
- org.apache.lucene.document provides a simple Document class. A Document is simply a set of named Fields, whose values may be strings or instances of java.io.Reader.
- org.apache.lucene.index provides two primary classes: IndexWriter, which creates and adds documents to indices; and IndexReader, which accesses the data in the index.
- org.apache.lucene.search provides data structures to represent queries (ie TermQuery for individual words, PhraseQuery for phrases, and BooleanQuery for boolean combinations of queries) and the abstract Searcher which turns queries into TopDocs. IndexSearcher implements search over a single IndexReader.
- org.apache.lucene.queryParser uses JavaCC to implement a QueryParser.
- org.apache.lucene.store defines an abstract class for storing persistent data, the Directory, which is a collection of named files written by an IndexOutput and read by an IndexInput. Multiple implementations are provided, including FSDirectory, which uses a file system directory to store files, and RAMDirectory which implements files as memory-resident data structures.
- org.apache.lucene.util contains a few handy data structures and util classes, ie BitVector and PriorityQueue.
- Create Documents by adding Fields;
- Create an IndexWriter and add documents to it with addDocument();
- Call QueryParser.parse() to build a query from a string; and
- Create an IndexSearcher and pass the query to its search() method.
- IndexFiles.java creates an index for all the files contained in a directory.
- SearchFiles.java prompts for queries and searches an index.
> java -cp lucene.jar:lucene-demo.jar:lucene-analyzers-common.jar org.apache.lucene.demo.IndexFiles rec.food.recipes/soups
adding rec.food.recipes/soups/abalone-chowder
[ ... ]> java -cp lucene.jar:lucene-demo.jar:lucene-analyzers-common.jar org.apache.lucene.demo.SearchFiles
Query: chowder
Searching for: chowder
34 total matching documents
1. rec.food.recipes/soups/spam-chowder
[ ... thirty-four documents contain the word "chowder" ... ]Query: "clam chowder" AND Manhattan
Searching for: +"clam chowder" +manhattan
2 total matching documents
1. rec.food.recipes/soups/clam-chowder
[ ... two documents contain the phrase "clam chowder" and the word "manhattan" ... ]
[ Note: "+" and "-" are canonical, but "AND", "OR" and "NOT" may be used. ]
Package | Description |
---|---|
org.apache.lucene |
Top-level package.
|
org.apache.lucene.analysis |
API and code to convert text into indexable/searchable tokens.
|
org.apache.lucene.analysis.standard |
Standards-based analyzers implemented with JFlex.
|
org.apache.lucene.analysis.standard.std31 |
Backwards-compatible implementation to match
Version.LUCENE_31 |
org.apache.lucene.analysis.standard.std34 |
Backwards-compatible implementation to match
Version.LUCENE_34 |
org.apache.lucene.analysis.tokenattributes |
Useful
Attribute s for text analysis. |
org.apache.lucene.collation |
CollationKeyFilter
converts each token into its binary CollationKey using the
provided Collator , and then encode the CollationKey
as a String using
IndexableBinaryStringTools , to allow it to be
stored as an index term. |
org.apache.lucene.document |
The logical representation of a
Document for indexing and searching. |
org.apache.lucene.index |
Code to maintain and access indices.
|
org.apache.lucene.queryParser |
A simple query parser implemented with JavaCC.
|
org.apache.lucene.search |
Code to search indices.
|
org.apache.lucene.search.function |
Programmatic control over documents scores.
|
org.apache.lucene.search.payloads |
The payloads package provides Query mechanisms for finding and using payloads.
|
org.apache.lucene.search.spans |
The calculus of spans.
|
org.apache.lucene.store |
Binary i/o API, used for all index data.
|
org.apache.lucene.util |
Some utility classes.
|
org.apache.lucene.util.fst |
Finite state transducers
|
org.apache.lucene.util.packed |
The packed package provides random access capable arrays of positive longs.
|