n, k, i, k, r, d, p, e, s, i, z, a, a, w, m, w, ś, p
What should be entered as a correct character in examples below?
What has to be done in cases of such vestigial letters as those below? It is not always possible to recognize the meaning by the context.
If it is possible to recognize the meaning of the word by its context, a proper character should be entered and the misprint ought to be marked.
Should the area in the case presented below be considered as incorrect or marked manually as “y”? Can I use binarization? What to do if program marks only the little fragment of the letter - skip this character?
This letter should be marked as “y”, but also as a misprint. The character can be slightly binarized in this case. Binarized element does not resemble anything.
Is this a correct character or a misprint?
This is a correct character - Latin small letter with circumflex
Is that a misprint or a special character?
This is an s with a misprint. It results from the context of the word: czasie (eng. time).
Do full stops in a cursive written text have to be marked also as a cursive? Or it doesn’t matter?
In case of full stops it doesn’t matter.
3. Division of words
soft (optional) hyphen, U+00AD
There are a few links and a test illustrating the ABBYY FineReader method presented below.
Character “¬”, which is not a soft hyphen, but its graphic look simplifies entering the right symbol and prevents from mistakes (http://www.fileformat.info/info/unicode/char/ac/index.htm) should be used to mark the words division. ABBYY Fine Reader uses similar method. There is a possibility to peep at some samples illustrating this method, they are available here - http://www.digitalizacja.pl/ocr/soft_hyphen_test.zip. In DOC file opened in MS Word (switch to the hidden characters view) you can see how the characters generated by FR are being changed into soft hyphens. For example there are also PDF, ePUB and TXT formats generated. In ePUB and TXT you can peep at the text’s behaviour in case of a simple instead of a soft hyphen usage. In some places FR wrongly recognized the hyphen and used a simple one.
Which character from the virtual keyboard should be chosen in this case?
The best choice will be: Pause - U+2014. On the mouse move for this character, you can see whether it is U+2014 kode, because there are two nearly the same there.
ATTENTION: in documents printed in gothic font, the gothic button is pressed defaultly. In case when a normally printed type appears in the text, this option should be switched off.
On the page: http://en.wikipedia.org/wiki/Long_s there is information about the long s character.
On the page: http://en.wikipedia.org/wiki/%C3%9F there is information about the ß character.
Which characters should be used for quotation marks in this case? Are those simple quotation marks or the inverted ones?
These should be standard quotation marks, suitably U+201D and U+201E.
Gothic fonts examples - http://www.digitalizacja.pl/ocr/Gotyk.pdf
5. Skipping the characters
Do I have to skip characters or recognize them by context in case of such untypically written headings as presented below?
In this case you can skip.
In this case you should recognize.
Typographic ligature is a type or a font in which the picture contains at least two connected letters in a form of one, new, common character e.g.: in “fi” combination, when a dot from “i” is simultaneously a ball in the end of the “f” letter. The other example build in this way is a case of "f" and "l" letters standing on a mutual footnote. Ligatures are specially created writing characters for the most often occurring groups of letters in words, which contain characters with colliding drawings. Each language has its own pairs of characters which mostly occur, that’s why ligatures are the characteristic element for each language. In readily available fonts, "fi", "fl", "ff" English ligatures are a standard. Despite the graphic similarity, the ligature “fj” is really rare - probably because it occurs only in one (what’s more foreign) English word (“fjord”). Some Polish fonts have “łł" ligatures in a form of a common bend above both letters. Also character "&", which arose from the combination of letters in a word et (lat. and) is a ligature. More information about ligatures can be found here: http://en.wikipedia.org/wiki/Typographic_ligature
This is an example of untypical ligature, information about it can be found on pages:
There is no proper Unicode code, a COMBINING GRAPHEME JOINER is used here (U+034F, http://unicode.org/faq/char_combmark.html#17) in combination with c and h characters (<c, CGJ, h>. The interesting ones are available on the site: http://typophile.com/node/33330. Because it comes from greek letter chi, it is recommended to use an alternative character: U+03A7. Chi itself should not occur in a group of verified texts.
7. Practical hints improving the work
In case of texts printed in cursive e.g. such as in the picture above, you should mark each letter this way in the character verification window by pressing correct button. If these are single words, the operation is not complicated, but in case of lengthy texts, additional pressing the button can be onerous and time-consuming. In order to improve this activity, some shortcuts were introduced and in case of cursive it is a button with the i letter. Pressing a proper keyboard button causes the same effect as clicking proper button on the screen with the mouse.
Examples of gothic fonts - http://www.digitalizacja.pl/ocr/Gotyk.pdf
Gothic writing - http://en.wikipedia.org/wiki/Gothic_alphabet
Latin Extended-A - http://en.wikipedia.org/wiki/Latin_Extended-A
Hyphen - http://en.wikipedia.org/wiki/Hyphen