search¶
Services for searching and matching of text.
indexing¶
CommonIndexer¶
base class for interfaces to indexing engines for pootle
-
class
translate.search.indexing.CommonIndexer.
CommonDatabase
(basedir, analyzer=None, create_allowed=True)¶ Base class for indexing support.
Any real implementation must override most methods of this class.
-
ANALYZER_DEFAULT
= 6¶ the default analyzer to be used if nothing is configured
-
ANALYZER_EXACT
= 0¶ exact matching: the query string must equal the whole term string
-
ANALYZER_PARTIAL
= 2¶ partial matching: a document matches, even if the query string only matches the beginning of the term value.
-
ANALYZER_TOKENIZE
= 4¶ tokenize terms and queries automatically
-
INDEX_DIRECTORY_NAME
= None¶ override this with a string to be used as the name of the indexing directory/file in the filesystem
-
QUERY_TYPE
= None¶ override this with the query class of the implementation
-
begin_transaction
()¶ begin a transaction
You can group multiple modifications of a database as a transaction. This prevents time-consuming database flushing and helps, if you want that a changeset is committed either completely or not at all. No changes will be written to disk until ‘commit_transaction’. ‘cancel_transaction’ can be used to revert an ongoing transaction.
Database types that do not support transactions may silently ignore it.
-
cancel_transaction
()¶ cancel an ongoing transaction
See ‘start_transaction’ for details.
-
commit_transaction
()¶ Submit the currently ongoing transaction and write changes to disk.
See ‘start_transaction’ for details.
-
delete_doc
(ident)¶ Delete the documents returned by a query.
Parameters: ident (int | list of tuples | dict | list of dicts | query (e.g. xapian.Query) | list of queries) – [list of] document IDs | dict describing a query | query
-
delete_document_by_id
(docid)¶ Delete a specified document.
Parameters: docid (int) – the document ID to be deleted
-
field_analyzers
= {}¶ mapping of field names and analyzers - see
set_field_analyzers()
-
flush
(optimize=False)¶ Flush the content of the database - to force changes to be written to disk.
Some databases also support index optimization.
Parameters: optimize (bool) – should the index be optimized if possible?
-
get_field_analyzers
(fieldnames=None)¶ Return the analyzer that was mapped to a specific field.
See
set_field_analyzers()
for details.Parameters: fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields. Returns: The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers Return type: int | dict
-
get_query_result
(query)¶ return an object containing the results of a query
Parameters: query (a query object of the real implementation) – a pre-compiled query Returns: an object that allows access to the results Return type: subclass of CommonEnquire
-
index_document
(data)¶ Add the given data to the database.
Parameters: data (dict | list of str) – the data to be indexed. A dictionary will be treated as fieldname:value
combinations. If the fieldname is None then the value will be interpreted as a plain term or as a list of plain terms. Lists of terms are indexed separately. Lists of strings are treated as plain terms.
-
make_query
(args, require_all=True, analyzer=None)¶ Create simple queries (strings or field searches) or combine multiple queries (AND/OR).
To specifiy rules for field searches, you may want to take a look at
set_field_analyzers()
. The parameter ‘match_text_partial’ can override the previously defined default setting.Parameters: - args (list of queries | single query | str | dict) –
queries or search string or description of field query examples:
[xapian.Query("foo"), xapian.Query("bar")] xapian.Query("foo") "bar" {"foo": "bar", "foobar": "foo"}
- require_all (boolean) – boolean operator (True -> AND (default) / False -> OR)
- analyzer (int) –
(only applicable for ‘dict’ or ‘str’) Define query options (partial matching, exact matching, tokenizing, …) as bitwise combinations of CommonIndexer.ANALYZER_???.
This can override previously defined field analyzer settings.
If analyzer is
None
(default), then the configured analyzer for the field is used.
Returns: the combined query
Return type: query type of the specific implementation
- args (list of queries | single query | str | dict) –
-
search
(query, fieldnames)¶ Return a list of the contents of specified fields for all matches of a query.
Parameters: - query (a query object of the real implementation) – the query to be issued
- fieldnames (string | list of strings) – the name(s) of a field of the document content
Returns: a list of dicts containing the specified field(s)
Return type: list of dicts
-
set_field_analyzers
(field_analyzers)¶ Set the analyzers for different fields of the database documents.
All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.
Parameters: field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers Raises: TypeError – invalid values in field_analyzers
-
-
class
translate.search.indexing.CommonIndexer.
CommonEnquire
(enquire)¶ An enquire object contains the information about the result of a request.
-
get_matches
(start, number)¶ Return a specified number of qualified matches of a previous query.
Parameters: - start (int) – index of the first match to return (starting from zero)
- number (int) – the number of matching entries to return
Returns: a set of matching entries and some statistics
Return type: tuple of (returned number, available number, matches) “matches” is a dictionary of:
["rank", "percent", "document", "docid"]
-
get_matches_count
()¶ Return the estimated number of matches.
Use
translate.search.indexing.CommonIndexer.search()
to retrieve the exact number of matchesReturns: The estimated number of matches Return type: int
-
-
translate.search.indexing.CommonIndexer.
is_available
()¶ Check if this indexing engine interface is usable.
This function must exist in every module that contains indexing engine interfaces.
Returns: is this interface usable? Return type: bool
PyLuceneIndexer¶
interface for the PyLucene (v2.x) indexing engine
take a look at PyLuceneIndexer1.py for the PyLucene v1.x interface
-
class
translate.search.indexing.PyLuceneIndexer.
PyLuceneDatabase
(basedir, analyzer=None, create_allowed=True)¶ Manage and use a pylucene indexing database.
-
begin_transaction
()¶ PyLucene does not support transactions
Thus this function just opens the database for write access. Call “cancel_transaction” or “commit_transaction” to close write access in order to remove the exclusive lock from the database directory.
-
cancel_transaction
()¶ PyLucene does not support transactions
Thus this function just closes the database write access and removes the exclusive lock.
See ‘start_transaction’ for details.
-
commit_transaction
()¶ PyLucene does not support transactions
Thus this function just closes the database write access and removes the exclusive lock.
See ‘start_transaction’ for details.
-
delete_doc
(ident)¶ Delete the documents returned by a query.
Parameters: ident (int | list of tuples | dict | list of dicts | query (e.g. xapian.Query) | list of queries) – [list of] document IDs | dict describing a query | query
-
delete_document_by_id
(docid)¶ delete a specified document
Parameters: docid (int) – the document ID to be deleted
-
flush
(optimize=False)¶ flush the content of the database - to force changes to be written to disk
some databases also support index optimization
Parameters: optimize (bool) – should the index be optimized if possible?
-
get_field_analyzers
(fieldnames=None)¶ Return the analyzer that was mapped to a specific field.
See
set_field_analyzers()
for details.Parameters: fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields. Returns: The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers Return type: int | dict
-
get_query_result
(query)¶ return an object containing the results of a query
Parameters: query (a query object of the real implementation) – a pre-compiled query Returns: an object that allows access to the results Return type: subclass of CommonEnquire
-
index_document
(data)¶ Add the given data to the database.
Parameters: data (dict | list of str) – the data to be indexed. A dictionary will be treated as fieldname:value
combinations. If the fieldname is None then the value will be interpreted as a plain term or as a list of plain terms. Lists of terms are indexed separately. Lists of strings are treated as plain terms.
-
make_query
(*args, **kwargs)¶ Create simple queries (strings or field searches) or combine multiple queries (AND/OR).
To specifiy rules for field searches, you may want to take a look at
set_field_analyzers()
. The parameter ‘match_text_partial’ can override the previously defined default setting.Parameters: - args (list of queries | single query | str | dict) –
queries or search string or description of field query examples:
[xapian.Query("foo"), xapian.Query("bar")] xapian.Query("foo") "bar" {"foo": "bar", "foobar": "foo"}
- require_all (boolean) – boolean operator (True -> AND (default) / False -> OR)
- analyzer (int) –
(only applicable for ‘dict’ or ‘str’) Define query options (partial matching, exact matching, tokenizing, …) as bitwise combinations of CommonIndexer.ANALYZER_???.
This can override previously defined field analyzer settings.
If analyzer is
None
(default), then the configured analyzer for the field is used.
Returns: the combined query
Return type: query type of the specific implementation
- args (list of queries | single query | str | dict) –
-
search
(query, fieldnames)¶ Return a list of the contents of specified fields for all matches of a query.
Parameters: - query (a query object of the real implementation) – the query to be issued
- fieldnames (string | list of strings) – the name(s) of a field of the document content
Returns: a list of dicts containing the specified field(s)
Return type: list of dicts
-
set_field_analyzers
(field_analyzers)¶ Set the analyzers for different fields of the database documents.
All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.
Parameters: field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers Raises: TypeError – invalid values in field_analyzers
-
-
class
translate.search.indexing.PyLuceneIndexer.
PyLuceneHits
(enquire)¶ an enquire object contains the information about the result of a request
-
get_matches
(start, number)¶ return a specified number of qualified matches of a previous query
Parameters: - start (int) – index of the first match to return (starting from zero)
- number (int) – the number of matching entries to return
Returns: a set of matching entries and some statistics
Return type: tuple of (returned number, available number, matches) “matches” is a dictionary of:
["rank", "percent", "document", "docid"]
-
get_matches_count
()¶ Return the estimated number of matches.
Use
translate.search.indexing.CommonIndexer.search()
to retrieve the exact number of matchesReturns: The estimated number of matches Return type: int
-
XapianIndexer¶
Interface to the Xapian indexing engine for the Translate Toolkit
Xapian v1.0 or higher is supported.
If you are interested in writing an interface for Xapian 0.x, then you should checkout the following:
svn export -r 7235 https://translate.svn.sourceforge.net/svnroot/translate/src/branches/translate-search-indexer-generic-merging/translate/search/indexer/
It is not completely working, but it should give you a good start.
-
class
translate.search.indexing.XapianIndexer.
XapianDatabase
(basedir, analyzer=None, create_allowed=True)¶ Interface to the Xapian indexer.
-
begin_transaction
()¶ Begin a transaction.
Xapian supports transactions to group multiple database modifications. This avoids intermediate flushing and therefore increases performance.
-
cancel_transaction
()¶ cancel an ongoing transaction
no changes since the last execution of ‘begin_transcation’ are written
-
commit_transaction
()¶ Submit the changes of an ongoing transaction.
All changes since the last execution of ‘begin_transaction’ are written.
-
delete_doc
(ident)¶ Delete the documents returned by a query.
Parameters: ident (int | list of tuples | dict | list of dicts | query (e.g. xapian.Query) | list of queries) – [list of] document IDs | dict describing a query | query
-
delete_document_by_id
(docid)¶ Delete a specified document.
Parameters: docid (int) – the document ID to be deleted
-
flush
(optimize=False)¶ force to write the current changes to disk immediately
Parameters: optimize (bool) – ignored for xapian
-
get_field_analyzers
(fieldnames=None)¶ Return the analyzer that was mapped to a specific field.
See
set_field_analyzers()
for details.Parameters: fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields. Returns: The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers Return type: int | dict
-
get_query_result
(query)¶ Return an object containing the results of a query.
Parameters: query (xapian.Query) – a pre-compiled xapian query Returns: an object that allows access to the results Return type: XapianIndexer.CommonEnquire
-
index_document
(data)¶ Add the given data to the database.
Parameters: data (dict | list of str) – the data to be indexed. A dictionary will be treated as fieldname:value
combinations. If the fieldname is None then the value will be interpreted as a plain term or as a list of plain terms. Lists of terms are indexed separately. Lists of strings are treated as plain terms.
-
make_query
(*args, **kwargs)¶ Create simple queries (strings or field searches) or combine multiple queries (AND/OR).
To specifiy rules for field searches, you may want to take a look at
set_field_analyzers()
. The parameter ‘match_text_partial’ can override the previously defined default setting.Parameters: - args (list of queries | single query | str | dict) –
queries or search string or description of field query examples:
[xapian.Query("foo"), xapian.Query("bar")] xapian.Query("foo") "bar" {"foo": "bar", "foobar": "foo"}
- require_all (boolean) – boolean operator (True -> AND (default) / False -> OR)
- analyzer (int) –
(only applicable for ‘dict’ or ‘str’) Define query options (partial matching, exact matching, tokenizing, …) as bitwise combinations of CommonIndexer.ANALYZER_???.
This can override previously defined field analyzer settings.
If analyzer is
None
(default), then the configured analyzer for the field is used.
Returns: the combined query
Return type: query type of the specific implementation
- args (list of queries | single query | str | dict) –
-
search
(query, fieldnames)¶ Return a list of the contents of specified fields for all matches of a query.
Parameters: - query (xapian.Query) – the query to be issued
- fieldnames (string | list of strings) – the name(s) of a field of the document content
Returns: a list of dicts containing the specified field(s)
Return type: list of dicts
-
set_field_analyzers
(field_analyzers)¶ Set the analyzers for different fields of the database documents.
All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.
Parameters: field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers Raises: TypeError – invalid values in field_analyzers
-
-
class
translate.search.indexing.XapianIndexer.
XapianEnquire
(enquire)¶ interface to the xapian object for storing sets of matches
-
get_matches
(start, number)¶ Return a specified number of qualified matches of a previous query.
Parameters: - start (int) – index of the first match to return (starting from zero)
- number (int) – the number of matching entries to return
Returns: a set of matching entries and some statistics
Return type: tuple of (returned number, available number, matches) “matches” is a dictionary of:
["rank", "percent", "document", "docid"]
-
get_matches_count
()¶ Return the estimated number of matches.
Use
translate.search.indexing.CommonIndexer.search()
to retrieve the exact number of matchesReturns: The estimated number of matches Return type: int
-
lshtein¶
A class to calculate a similarity based on the Levenshtein distance.
See http://en.wikipedia.org/wiki/Levenshtein_distance.
If available, the python-Levenshtein will be used which will provide better performance as it is implemented natively.
-
translate.search.lshtein.
distance
(a, b, stopvalue=0)¶ Same as python_distance in functionality. This uses the fast C version if we detected it earlier.
Note that this does not support arbitrary sequence types, but only string types.
-
translate.search.lshtein.
native_distance
(a, b, stopvalue=0)¶ Same as python_distance in functionality. This uses the fast C version if we detected it earlier.
Note that this does not support arbitrary sequence types, but only string types.
-
translate.search.lshtein.
python_distance
(a, b, stopvalue=-1)¶ Calculates the distance for use in similarity calculation. Python version.
match¶
Class to perform translation memory matching from a store of translation units.
-
class
translate.search.match.
matcher
(store, max_candidates=10, min_similarity=75, max_length=70, comparer=None, usefuzzy=False)¶ A class that will do matching and store configuration for the matching process.
-
buildunits
(candidates)¶ Builds a list of units conforming to base API, with the score in the comment.
-
extendtm
(units, store=None, sort=True)¶ Extends the memory with extra unit(s).
Parameters: - units – The units to add to the TM.
- store – Optional store from where some metadata can be retrieved and associated with each unit.
- sort – Optional parameter that can be set to False to supress
sorting of the candidates list. This should probably
only be used in
matcher.inittm()
.
-
getstartlength
(min_similarity, text)¶ Calculates the minimum length we are interested in. The extra fat is because we don’t use plain character distance only.
-
getstoplength
(min_similarity, text)¶ Calculates a length beyond which we are not interested. The extra fat is because we don’t use plain character distance only.
-
inittm
(stores, reverse=False)¶ Initialises the memory for later use. We use simple base units for speedup.
-
matches
(text)¶ Returns a list of possible matches for given source text.
Parameters: text (String) – The text that will be search for in the translation memory Return type: list Returns: a list of units with the source and target strings from the translation memory. If self.addpercentage
is True (default) the match quality is given as a percentage in the notes.
-
setparameters
(max_candidates=10, min_similarity=75, max_length=70)¶ Sets the parameters without reinitialising the tm. If a parameter is not specified, it is set to the default, not ignored
-
usable
(unit)¶ Returns whether this translation unit is usable for TM
-
-
translate.search.match.
sourcelen
(unit)¶ Returns the length of the source string.
-
class
translate.search.match.
terminologymatcher
(store, max_candidates=10, min_similarity=75, max_length=500, comparer=None)¶ A matcher with settings specifically for terminology matching.
-
buildunits
(candidates)¶ Builds a list of units conforming to base API, with the score in the comment.
-
extendtm
(units, store=None, sort=True)¶ Extends the memory with extra unit(s).
Parameters: - units – The units to add to the TM.
- store – Optional store from where some metadata can be retrieved and associated with each unit.
- sort – Optional parameter that can be set to False to supress
sorting of the candidates list. This should probably
only be used in
matcher.inittm()
.
-
getstartlength
(min_similarity, text)¶ Calculates the minimum length we are interested in. The extra fat is because we don’t use plain character distance only.
-
getstoplength
(min_similarity, text)¶ Calculates a length beyond which we are not interested. The extra fat is because we don’t use plain character distance only.
-
inittm
(store)¶ Normal initialisation, but convert all source strings to lower case
-
matches
(text)¶ Normal matching after converting text to lower case. Then replace with the original unit to retain comments, etc.
-
setparameters
(max_candidates=10, min_similarity=75, max_length=70)¶ Sets the parameters without reinitialising the tm. If a parameter is not specified, it is set to the default, not ignored
-
usable
(unit)¶ Returns whether this translation unit is usable for terminology.
-
-
translate.search.match.
unit2dict
(unit)¶ converts a pounit to a simple dict structure for use over the web
segment¶
Module to deal with different types and uses of segmentation
-
translate.search.segment.
character_iter
(text)¶ Returns an iterator over the characters in text.
-
translate.search.segment.
characters
(text)¶ Returns a list of characters in text.
-
translate.search.segment.
sentence_iter
(text)¶ Returns an iterator over the senteces in text.
-
translate.search.segment.
sentences
(text)¶ Returns a list of senteces in text.
-
translate.search.segment.
word_iter
(text)¶ Returns an iterator over the words in text.
-
translate.search.segment.
words
(text)¶ Returns a list of words in text.
terminology¶
A class that does terminology matching