OCR Toolkit: Global Functions

Last modified:

Contents

The toolkit defines a number of free function which are not image methods. These are defined in ocr_toolkit.py and can be imported in a python script with

from gamera.toolkits.ocr.ocr_toolkit import *

Output text generation

While the class Page splits the image into Textline objects and possibly classifies the characters, it does not generate an output string. For this purpose, you can use the function textline_to_string.

textline_to_string

Returns a unicode string of the text in the given Textline.

Signature:

textline_to_string (textline, heuristic_rules="roman", extra_chars_dict={})

with

textline:
A Textline object containing the glyphs. The glyphs must already be classified.
heuristic_rules:

Depending on the alphabeth, some characters can very similar and need further heuristic rules for disambiguation, like apostroph and comma, which have the same shape and only differ in their position relative to the baseline.

When set to "roman", several rules specific for latin alphabeths are applied.

extra_chars_dict
A dictionary of additional translations of classnames to character codes. This is necessary when you use class names that are not unicode names. Will be passed to return_char.

As this function uses return_char, the class names of the glyphs in textline must corerspond to unicode character names, as described in the documentation of return_char.

return_char

Converts a unicode character name to a unicode symbol.

Signature:

return_char (classname, extra_chars_dict={})

with

classname:
A class name derived from a unicode character name. Example: latin.small.letter.a returns the character a.
extra_chars_dict
A dictionary of additional translations of classnames to character codes. This is necessary when you use class names that are not unicode names. The character 'code' does not need to be an actual code, but can be any string. This can be useful, e.g. for ligatures:
return_char(glyph.get_main_id(), {'latin.small.ligature.st':'st'})

When classname is not listed in extra_chars_dict, it must correspond to a standard unicode character name, as in the examples of the following table:

Character Unicode Name Class Name
! EXCLAMATION MARK exclamation.mark
2 DIGIT TWO digit.two
A LATIN CAPITAL LETTER A latin.capital.letter.a
a LATIN SMALL LETTER A latin.small.letter.a

chars_make_words

Groups the given glyphs to words based upon the horizontal distance between adjacent glyphs.

Signature:
chars_make_words (glyphs, threshold=None)

with

glyphs:
A list of Cc data types, each of which representing a character. All glyphs must stem from the same single line of text.
threshold:
Horizontal white space greater than threshold will be considered a word separating gap. When None, the threshold value is calculated automatically as 2.5 times teh median white space between adjacent glyphs.

The result is a nested list of glyphs with each sublist representing a word. This is the same data structure as used in Textline.words

Segmentation

These functions are used in the segmentation methods of class Page. You will generally not need to call them, unless you are implementing a custom segmentation method.

get_line_glyphs

Splits image regions representing text lines into characters.

Signature:

get_line_glyphs (image, segments)

with

image:
The document image that is to be further segmentated. It must contin the same underlying image data as the second argument segments
segments:
A list Cc data types, each of which represents a text line region. The image views must correspond to image, i.e. each pixels has a value that is the unique label of the text line it belongs to. This is the interface used by the plugins in the "PageSegmentation" section of the Gamera core.

The result is returned as a list of Textline objects.

show_bboxes

Returns an RGB image with bounding boxes of the given glyphs as hollow rects. Useful for visualization and debugging of a segmentation.

Signature:

show_bboxes (image, glyphs)

with:

image:
An image of the textdokument which has to be segmentated.
glyphs:
List of rects which will be drawn on image as hollow rects. As all image types are derived from Rect, any image list can be passed.