Before uploading a scanned image of the document to the Virtual Transcription Laboratory, you should ensure that the image was led into the optimized effigy for optical character recognition (OCR) programs. The aim of the OCR is a change of the scan, which is treated by web searchers solely as an image, into a document in which certain phrases can be searched. Therefore it is worth to make OCR easier by deleting all despoilments, elements which may be considered by program as a text and actually are not a text. Scan Tailor is used for mentioned post-processing.
Steps of post-processing in Scan Tailor:
- After reading the scan into the program, the first step is (visible in the left upper corner) „Fix Orientation”, the possibility of the image rotation if the one read was set inappropriately. It is not always necessary. To do the transformation you should click the button , which can be found in the same line as the name of step “Fix Orientation”(Image 1).
Image 1. Fixing the orientation of the image.
- The next step is „Split Pages”, in which the program automatically suggests the division of a scan into pages, what surely can be changed as preferred by using icons on the left and the calibration on the image. The operation has two modes of functioning:
- Automatic – program tries to suit the division itself and a user can adapt it manually to exact scans;
- Manual – users choose where the division line appears by, they may also predetermine that scans contain an image of just one page (menu „Page Layout”) and the division is unnecessary.
Image 2. Marking the division of pages.
- The further step is „Deskew” – suiting the text on an image to a net, so as the text, independently from curvature of scanned paper, will lay straightly in front of the computer’s screen. It is suited automatically by the program, there is also a possibility of manual modification.
Image 3. Deskewing the text form the image.
- „Select Content” consists of marking the text field, which should be recognized by OCR, so as the smallest possible part of the rest of the scan would appear in this field. Marked text is suggested automatically, its area can be changed manually.
Image 4. Selecting content of the text field.
- The next step is marking the „Margins”. As a result of transformations such as pages division, text lines orientation change or finally the content selection, the size of the scans is changed, margins are added to standardize the size. After the post-processing, margins area will be filled with white color.
Image 5. Marking the margins.
- „Output” is the last step in which we can define details related to the resolution of output files („Output Resolution”), used in their case color depth (the default images are black-and-white, menu „Mode”). The additional options are presented below.
Image 6. The output of the former processing.
„Dewarping” let us straighten the curving of text lines coming from the fact that book scanning took place in the flatbed scanner.
Image 7. Dewarping the image.
„Despeckling” is the level of the despoilments disposal from the image (the bigger the paintbrush on the icon, the more intensively the cleaning algorithm seeks for despoilments). There is obviously a possibility of cleaning the destined text, not only the despoilments; in the tab on the right, we can see (marked red) the places which were marked as despoilments.
Image 8. Despeckling the image.
The image created in the process described above is being saved in the folder „Out” created automatically by Scan Tailor in the place from which the scan is loaded. The result of the image post-processing is ready to be optically recognized (OCR).