Information in this document is subject to change without notice and does not represent any commitment on the part
of ABBYY Software House. The document is supplied as a part of the ABBYY FormReader package under a
license agreement. No part of this document may be reproduced or transmitted in any form or by any means,
electronic or otherwise, without the express written approval of ABBYY Software House.
IDENTIFICATION OF DIFFERENT FORMS PROCESSED IN THE SAME BATCH ..............................................26
CREATING A BARCODE USING CORELDRAW...........................................................................................27
RECOMMENDED COLORS FOR DROPOUT FORMS .....................................................................................29
What is a form?
Questionnaires, social security forms, polling slips, warranty cards – all different types of form used to collect
different types of information. How do forms differ from other types of documents?
1. A form has a set number of fields.
2. Field content is always determined by for example field name. E.g. a “Last Name” field contains only last
names (if completed correctly), a “Date” field only dates, etc.
3. During form processing, only the field contents are of interest; all remaining form elements are
disregarded.
Gathering information can be a long and weary process, involving the input of hundreds if not thousands of forms.
ABBYY FormReader, however, makes life much easier, allowing the whole process to be automated. The inputting
process then consists of the following stages:
1. Application setup – the form to be processed is specified.
A form template is created within the program, containing the geometrical locations of the fields and
specifying the type of information to be contained within them and containing other field parameters.
2. Form processing.
Completed forms are scanned and recognized (i.e. field images are converted into text) by the application.
An existing template is used to identify form field positions and the type of information contained within
them. Recognition results are subsequently verified and exported to a file or database.
Easy? In theory, yes, in practice, no, as not all forms used to gather information are suitable for automated input.
The aim of this guide is to explain exactly which requirements a form must meet if it is to be suitable for automated
processing, and to show you how to create your own forms using Microsoft Visio 2000, Microsoft Word 2000, and
Corel Draw.
What is a machine-readable form?
Two principal tasks are carried out during form recognition:
1. Locating fields.
This is by no means an easy task as the scanned form image may be distorted in various ways e.g. stretched,
skewed, or rotated. In order for these distortions to be corrected, the form must contain what are termed
reference points. For more information on reference points and other form elements, see: “Elements of
machine-readable forms“, page 6.
2. Separating field contents from field borders
The information entered in the fields must be clearly separated from other form elements: field borders,
background, service, and explanatory text. In order for the application to do this correctly, the form must meet
certain requirements; these requirements specify several form types. For more information on form types, see:
“Types of machine-readable forms“ (page 6).
In order for the above two tasks to be carried out successfully, the forms must correspond to the form pattern exactly, i.e. forms of the same type must be printed using the same source document (pattern) so that the location of
all form elements is identical on each one. If this is not the case, i.e. the location of fields on different copies of the
form varies, the application will be unable to “find” the fields and, consequently, unable to recognize them. Copies
of the form will only match the source document (pattern) by having the forms printed professionally. For more
information regarding print quality, see: “Print quality requirements“ (page 15).
If the application is able to identify the field locations and separate the field contents from the field borders, the form
in question is deemed to be machine-readable. From now on such forms are simply referred to as forms.
Form completion methods
A form may be completed in one of the following ways:
1) by hand (“handprint” completion). Letters, digits and all other characters are written separately, with each
character having its own individual character space.
2) Using a matrix printer.
3) Using a typewriter.
4) Typographically. This refers to the use of inkjet and laser (not matrix) printers with a resolution of no less than
300 dpi.
5) Using a combination of the above.
Elements of machine-readable forms
The following elements may be present on a form:
1) Fields for completion and automatic processing. These contain the information to be gathered.
Field type
A text field for
entering letters,
digits and other
characters
Checkboxes to be
marked
Radio group
2) Fields that contain significant information, but which are not recognized automatically. Such fields may
contain, for example, personal signatures, company stamps, photos, etc.
3) Explanatory information – any textual or graphic information not subject to recognition. For example, field
headers, completion instructions, additional information, page numbers, etc.
4) Service information. A form may contain a field which is only to be completed with some service information,
e.g. document number, data of document acquisition, client identification number etc. Such information may be
for example entered when forms are handed in to the operator or is entered automatically during the scanning
process.
5) Reference points. These are special form elements necessary for:
• matching the template correctly (determination of field locations),
• compensating any image skew or distortion (linear and non-linear) that may arise during scanning;
• unambiguous form identification in the case of simultaneous input of forms of different types.
The following form elements may be used as reference points:
Reference point types Comments
Black squares Solid black squares
Lines Horizontal or vertical solid lines.
Static text Any explanatory information, which is usually textual in form.
Barcodes Barcodes of the following types: Code 39, Check Code 39, Interleaved 25, Check
Comments
See “Form Completion Methods” (page 5).
These may take the form of squares, bubbles etc., or
fields that must be underlined. They are marked using
various symbols: the standard “tick”, the “period”
symbol, the letter “x”, etc.
A group of checkmarks in which only one checkmark
can be marked.
Interleaved 25, EAN 13, EAN 8, Code 128.
We recommend that the EAN 13 format be used. An example of barcode creation using
the CorelDraw editor is given in Appendix II.
Example
þ Yes, I like to buy it
ý Scanner is used
o Agree
þYes oNo oDon’t know
Fig. 1. An example of a blank form containing all types of reference points.
Types of machine-readable forms
There are three different form types depending on the method of separating the field contents from the field
borders:
1. Dropout form
All the fields on the form are white rectangles on a color background. The important thing here is the color
used, as it disappears during the scanning process (see recommendations on color choice in Appendix
III), leaving only the field contents and reference points on the form image for the recognition module to
recognize.
Dropout forms are the preferred choice in terms of recognition quality.
2. Raster Forms
Field borders on raster forms are termed raster lines – i.e. lines made up of a series of dots located at equal
distance from each other. The size and the location of these dots are determined manually (see
"Black&white forms with raster backgrounds" (page 9) and "Black&white forms with raster borders" (page
9)) . These dots are retained on the image after scanning, but the system treats them as garbage and
removes them automatically during image cleaning, leaving only field contents for the recognition
module to recognize.
3. Black&white Linear Forms
Field borders in this case take on a normal appearance (i.e. are black solid lines) and remain on the image
after scanning. That means that the block image includes both field borders and field contents, and the field
contents separation task is carried out by the recognition module. Hence recognition quality will depend to
a large extent on how neatly the form was completed (see "Black&white linear forms" (Page 10).
That’s why we do not recommend the use of black&white linear forms for automated processing.
Let’s turn to the advantages and disadvantages of the following types of forms:
• dropout color forms (as well as gray forms);
• raster forms: containing raster lines as field borders and forms with raster backgrounds;
• black&white linear forms.
Dropout color forms
Dropout color forms - these are forms in which fields are represented by a series of white rectangles (or other white
geometrical shapes) on a color background. The background is usually “red-orange” or “green” in color, and
disappears if the scanner has a special driver which can filter colors (in the case of color scanners), or a colorfiltering lamp (in the case of a non-color scanners).
(a) (b)
Ideally, all form elements, with the exception of reference points, disappear during scanning, leaving only field
contents for recognition on the form image.
How is this done? By ensuring that not only the background but also the explanatory information is printed in the
dropout color (see figure (а)).
Scanning
The scanning of forms with “red” and “green” background is performed either:
a. on a color scanner with color filtering software (red or green)
b. on a non-color scanner using a red or green lamp (hardware color filtering takes place in this case).
c. on a non-color scanner using a white lamp and with a red or green filter (filtering quality in this case is
much lower, as the background may not disappear completely, or field contents may be inadvertently
removed).
Notes.
1. Many color scanners also have blue software filtering. We do not recommend the use of blue forms,
however, as forms are likely to be completed using both black and blue ink. Field contents written in blue
ink will disappear in this case.
2. Should you use a standard white lamp with no color filtering to scan your forms, various light colors (not
only “red” or “green”, but light yellow and other similar colors) are also likely to dropout. That means you
can also use forms with such a background color with low saturation. In this case you should find the
proper color and it’s saturation manually, depending on the scanner model used.
Choosing the form color
Red-orange colors are preferable to green as a form color. This is because it represents the greatest possible contrast
to blue, and consequently results in enhanced scanning and recognition quality if the forms are completed using blue
ink.
Appendix III lists the recommended colors for form processing i.e. those most likely to disappear during scanning
with almost any scanner.. A “dropout” color list for a particular scanner (in Pantone or any other format) can also be
obtained from your scanner manufacturer/dealer.
It is up to you which color you choose, however, keep in mind that the form color chosen should be pleasant for
those required to complete it.
Advantages and disadvantages
Advantages
Drawing field borders on a color background form results in the highest possible recognition quality because:
1. Only the text image is subject to recognition; all garbage and field borders are removed.
2. Letters/digits overlapping field borders is less of a problem, as the borders themselves are simply
backgrounds which drop out during scanning, leaving only the field contents for recognition.
3. Printing explanatory information in the same color as the background increases recognition quality, as the
information “drops out” from the form image during scanning, and, consequently, does not interfere with
field contents.
4. Printing explanatory information in the same color as the background saves disk space, as the form image
file is smaller; and form processing speed increases.
Disadvantages
1. Creating a color form is complicated. A graphics editor is needed, and color forms have to be printed either
professionally or by using a color Xerox machine. Note that Xerox machines do not guarantee identical
field locations nor can black/color levels be altered.
2. Printing explanatory information in the same color as the form background reduces form readability
significantly, as the contrast between background and explanatory text is poor. This can lead to incorrect
form completion.
Gray forms
Gray forms are a subclass of color dropout forms. Gray forms are those with a shade of gray as a background color,
which again disappears during scanning. A gray background is achieved by printing the field borders in black using
the following parameters:
• saturation of no more than 10%,
• RGB parameters of 222,221,221
The resulting color is light gray due to low color saturation (i.e. black dots are rarefied).
Both field border variants depicted below may be used:
(a) (b)
Scanning
Forms may be scanned using any white lamp scanner. However, in order for the background to drop out, the correct
scanning parameters (contrast and brightness) must be chosen. If the brightness is very low and the contrast is very
high, the gray background may still remain on the image after scanning. The scanning parameters must be set
individually for each scanner.
Advantages and disadvantages
Advantages
1. The forms are very easy to develop using any graphics editor or word processing application e.g. MS
Word. See: "Developing forms using Microsoft Word" (Page 22).
Disadvantages
1. Scanning parameters (brightness and contrast) can only be altered to a slight degree. This can prove
problematic when scanning forms completed using a very light ink, as decreasing brightness to increase
text image quality can result in the appearance of field borders or the background on the form image, and
consequently, cause a deterioration in the recognition quality.
2. If the printer makes unauthorized changes to the technical print parameters (i.e. different paper, other color
components) then the background may become too dark and could prove difficult to remove regardless of
the scanning parameters chosen.
Black&white forms with raster background
Fields on such forms are simply white spaces (usually rectangles) on a raster background. The background is made
up of individual dots, no more than 0.1 mm in size, with the distance between each dot about 1 mm. This is much
greater than is the case with gray forms, where dot density is such that the eye perceives the background as smooth
gray.
Background filtering
The raster background does not disappear during scanning itself; instead, the raster dots are classified as garbage and
removed from the image during despeckling.
Advantages and disadvantages
Advantages
Disadvantages
Black&white forms with raster borders
Field borders here are made up of raster lines i.e. sequences of small black dots. Raster dot size should be 0.39 –
0.5 pt.
1. If both the scanning parameters and the dot size are chosen correctly, the form image will be despeckled
and the recognition module will acquire the field image free of garbage and superfluous characters.
2. Letters/digits overlapping field borders is less of a problem; field borders are part of the background, and
therefore disappear during image cleaning, leaving only the field contents left to be recognized..
1. Raster forms require periods, commas and other small characters to be written thickly. This is because their
size must be greater than that of the raster dots; otherwise they will be removed as part of the background.
2. Scanning parameters (brightness and contrast) can only be altered to a limited extent. This can prove
problematic when scanning forms completed using a very light ink, as decreasing the brightness to increase
the text image quality can result in the field borders or the background appearing on the form image, and
consequently, worsen the recognition quality.
3. Not all graphic editors and word processors (e.g. MS Word) have the shading style described above (i.e.
raster) in their standard styles palette. , In addition, word processors normally only have a limited number
of raster set up tools, leading to difficulties, for example, when trying to change the distance between raster
dots, or their size.
4. A raster background can prove tiring to the eye, and consequently discourage form completion.
5. If printing density is increased, dots may become larger and, as a result, left on the image as garbage. This,
in turn can make character recognition impossible.
The recommended raster dot size is 0.39 pt, with the distance between the raster dots being at least five times larger
than the dot size:
If the distance is less, the dots may become glued during scanning, leading to them remaining on the image after
despeckling. This, in turn, leads to lower recognition quality.
Acceptable ways of completing fields with raster borders are shown in the figures below:
Loading...
+ 20 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.