Detailed description of the character marking area, binarization and cleaning from distractions procedures by the “Cutouts” application.
1. Description of the application
This application supports preparation of the proper training material for the OCR system. As a proper training material we understand a set of shapes (areas) separatedfrom the source document composing a font used for a print of a given document. Proper training material possess following characteristics:
The separated area contains full shape of the single character, it should not enclose any distractions and other shapes fragments. You should also remember that diacritics (e.g. accents or hooks) are the integral part of proper character.
The separated shape should most accurately reflect features of the font used for a print.
Graphic form of the separated area is associated with a corresponding character entered from the virtual or computer’s keyboard..
The separated shape is marked as a correct or incorrect one..
2. Preparation of the training material
In frames of the application, preparation of the training material process has been divided into the following steps:
Correct marking of the area containing full shape of the character (compare feature 1) should base on prompt of the application or manually marked area. If the area suggested by the application is incorrect and will not be used, it should be deleted. Then the application goes to another prompt (step nr 1).
Selecting the level (treshold) of binarization that will accurately separate the shape of the character (compare feature 2). In this step, the application suggests initial binarization settings which in case of necessity may be corrected by the user.
Deleting the elements which do not belong to the genuine shape of the letter (compare feature 1) with the “brush”.
Entering the character which fulfills the separated area (compare feature 3).
Saving the shape or marking it as an incorrectly printed (compare feature 4).
After the 5th step it goes back to the step nr 1 until the moment of processing all characters in the document.
Particular steps and recommendations are precisely described in examples below.
3.1 Step 1: Marking the area with a diacritics shape.
Area suggested by the application does not contain the diacritics which is a part of processed letter. In such case the area should be widened so as it include whole shape of the letter.
3.2 Step 1: Connecting the neighbouring areas.
Prompt of the application is wrong because the V character was divided into two overlapping areas:
One of them should be broadened in such way that it would include whole shape of the character, the second one should be rejected.
Two areas suggested by the application and the third outcome area are presented below.
3.3 Step 1: Incorrect area that does not contain the character.
Area suggested by the application may not contain a character. In such case the suggested area should be deleted (rejected). Examples of areas containing: grime, decorative graphics element and table framing element are presented below.
3.4 Step 2: Selection of binarization’s level (treshold)
Selection of binarization’s level (treshold) in 2nd step is controlled by +/- buttons or the by the scroll bar:
Binarization process is a conversion of the multicolour (or the grey scale) picture to the monochromatic (two-colour) one. While correcting the binarization level you should aim to make character’s shape fulfill its proper character (comp. feature 2). It is presented below how the selection of binarization level influences the resulting (monochromatic) look of the L character. From the right: genuine area (grey scale picture) and the binarization results for increasing values of binarization level.
As the example shows, while increasing the binarization’s level, it was possible to delete the horizontal distortions (noises). Then, to high value of the binarization’s level can lead to “blurring” the shape. And so in the example above, the second binarization level is the correct one.
Next example shows how the binarization level selection can help in separating the character’s area, additionally deleting shapes that do not belong to the character. The genuine area and increasing the binarization level are presented again.
The purpose of binarization level selection is the separation of the most accurate shape of the character in a monochromatic form. It is not always possible to delete shapes of another origin, especially when they are the same shade.
3.5 Step 3: Erasing the neighbouring character
In a situation when the rectangle area cannot contain single character (see the illustration below), elements belonging to e.g. other characters should be deleted with the “brush”. The example below illustrates the genuine area and next levels of deleting the unwanted element.
Tool for erasing shapes should not be used to gain the “ideal” shape (compare the advice to step 2).
3.6 Step 5: Incorrectly printed characters
In view of character of the documents which are the base of training material, it is possible that single characters (or bigger fragments of the text) are incorrectly printed. In such cases, it is impossible to separate the accurately fulfilling shape representing the character. It is necessary to mark the shape in the 5th step as a misprint (see feature 4) and despite the misprint you should enter which character does the given shape represent (if it is possible).
There are examples of shapes which should be marked as a misprint presented below. The genuine area and binarized (monochromatic) shape were given:
1. Un-driven type for the a character:
2. Un-driven type for the A character:
3. Un-driven type for the S character. It is shown how increasing of the binarization level (treshold) leads to blurring the genuine shape of the character.
In case when the misprint is considerable, there can be no certainty what character does the given shape represent. Then, in the 4th step, the character recognized by the broader context (e.g. word) should be entered. The example is the “S” character in a word “Sądy” above.
In case of some misprints, the shape of the character is deformed but still correct. As an illustration there is an example of the “a” character which is still a correct shape despite of grimes:
Studies of cases found while working with “Cutouts”
1. The change of binarization level
Is it necessary to mark characters manually in a case like the one below?
In a case like the one marked with a circle on the picture above, two presses of a plus cause filling the white dot with black and then the binarization can be regulated. In most cases the binarization regulation is not necessary. For example in example below the level of binarization should not be changed, because there is a distinct white hole in a letter before binarization.
Should the case presented below be treated as a misprint or not?
Unfortunately it is not possible to give any numerical data (e.g. % of character’s loss) which classify the case as a misprint. In this case it is not a misprint. Designation of a misprint was created to point characters which are nearly unreadable and we find what they are e.g. because of the context.
The examples of “y” and “w” presented above should be marked as a misprint, even though they were correctly recognized by the system.
The illustration below depicts a word "rachuukowości" (eng. something like “accouuts”), program correctly recognized the second "u" - as an "n". Should it be marked as correct because the program recognized a word from the context or rather as incorrect because the program may then recognize "u" as "n"?
We are concerned to correctly recognize the characters, if the picture shows “u” then the recognition should also contain “u”.
Manual modification is crucial especially in a situation when neighbouring characters overlap the given sample.
The “i” above should be treated as a correct one.
Is it “i” or a misprinted “j”?
This is “j” which should be marked as a misprint.
A few examples that occured in corrected texts are presented below. Misprints are marked red under the illustration.
n, k, i, k, r, d, p, e, s, i, z, a, a, w, m, w, ś, p
What should be entered as a correct character in examples below?
What has to be done in cases of such vestigial letters as those below? It is not always possible to recognize the meaning by the context.
If it is possible to recognize the meaning of the word by its context, a proper character should be entered and the misprint ought to be marked.
Should the area in the case presented below be considered as incorrect or marked manually as “y”? Can I use binarization? What to do if program marks only the little fragment of the letter - skip this character?
This letter should be marked as “y”, but also as a misprint. The character can be slightly binarized in this case. Binarized element does not resemble anything.
Is this a correct character or a misprint?
This is a correct character - Latin small letter with circumflex
Is that a misprint or a special character?
This is an s with a misprint. It results from the context of the word: czasie (eng. time).
Do full stops in a cursive written text have to be marked also as a cursive? Or it doesn’t matter?
In case of full stops it doesn’t matter.
3. Division of words
soft (optional) hyphen, U+00AD
There are a few links and a test illustrating the ABBYY FineReader method presented below.
Character “¬”, which is not a soft hyphen, but its graphic look simplifies entering the right symbol and prevents from mistakes (http://www.fileformat.info/info/unicode/char/ac/index.htm) should be used to mark the words division. ABBYY Fine Reader uses similar method. There is a possibility to peep at some samples illustrating this method, they are available here - http://www.digitalizacja.pl/ocr/soft_hyphen_test.zip. In DOC file opened in MS Word (switch to the hidden characters view) you can see how the characters generated by FR are being changed into soft hyphens. For example there are also PDF, ePUB and TXT formats generated. In ePUB and TXT you can peep at the text’s behaviour in case of a simple instead of a soft hyphen usage. In some places FR wrongly recognized the hyphen and used a simple one.
Which character from the virtual keyboard should be chosen in this case?
The best choice will be: Pause - U+2014. On the mouse move for this character, you can see whether it is U+2014 kode, because there are two nearly the same there.
ATTENTION: in documents printed in gothic font, the gothic button is pressed defaultly. In case when a normally printed type appears in the text, this option should be switched off.
On the page: http://en.wikipedia.org/wiki/Long_s there is information about the long s character.
On the page: http://en.wikipedia.org/wiki/%C3%9F there is information about the ß character.
Which characters should be used for quotation marks in this case? Are those simple quotation marks or the inverted ones?
These should be standard quotation marks, suitably U+201D and U+201E.
Gothic fonts examples - http://www.digitalizacja.pl/ocr/Gotyk.pdf
5. Skipping the characters
Do I have to skip characters or recognize them by context in case of such untypically written headings as presented below?
In this case you can skip.
In this case you should recognize.
Typographic ligature is a type or a font in which the picture contains at least two connected letters in a form of one, new, common character e.g.: in “fi” combination, when a dot from “i” is simultaneously a ball in the end of the “f” letter. The other example build in this way is a case of "f" and "l" letters standing on a mutual footnote. Ligatures are specially created writing characters for the most often occurring groups of letters in words, which contain characters with colliding drawings. Each language has its own pairs of characters which mostly occur, that’s why ligatures are the characteristic element for each language. In readily available fonts, "fi", "fl", "ff" English ligatures are a standard. Despite the graphic similarity, the ligature “fj” is really rare - probably because it occurs only in one (what’s more foreign) English word (“fjord”). Some Polish fonts have “łł" ligatures in a form of a common bend above both letters. Also character "&", which arose from the combination of letters in a word et (lat. and) is a ligature. More information about ligatures can be found here: http://en.wikipedia.org/wiki/Typographic_ligature
This is an example of untypical ligature, information about it can be found on pages:
There is no proper Unicode code, a COMBINING GRAPHEME JOINER is used here (U+034F, http://unicode.org/faq/char_combmark.html#17) in combination with c and h characters (<c, CGJ, h>. The interesting ones are available on the site: http://typophile.com/node/33330. Because it comes from greek letter chi, it is recommended to use an alternative character: U+03A7. Chi itself should not occur in a group of verified texts.
7. Practical hints improving the work
In case of texts printed in cursive e.g. such as in the picture above, you should mark each letter this way in the character verification window by pressing correct button. If these are single words, the operation is not complicated, but in case of lengthy texts, additional pressing the button can be onerous and time-consuming. In order to improve this activity, some shortcuts were introduced and in case of cursive it is a button with the i letter. Pressing a proper keyboard button causes the same effect as clicking proper button on the screen with the mouse.
Examples of gothic fonts - http://www.digitalizacja.pl/ocr/Gotyk.pdf
Gothic writing - http://en.wikipedia.org/wiki/Gothic_alphabet
Latin Extended-A - http://en.wikipedia.org/wiki/Latin_Extended-A
Hyphen - http://en.wikipedia.org/wiki/Hyphen