Abbyy Software FORMREADER Guide To Create Forms

ABBYY FormReader Automatic Form Input System
A Guide to Creating
A Guide to Creating A Guide to Creating Machine
Machine----Readable Forms
MachineMachine
ABBYY Software House
Readable Forms
Readable FormsReadable Forms
Moscow 2001
ABBYY Software House
A Guide to Creating Machine-Readable Forms
Information in this document is subject to change without notice and does not represent any commitment on the part of ABBYY Software House. The document is supplied as a part of the ABBYY FormReader package under a license agreement. No part of this document may be reproduced or transmitted in any form or by any means, electronic or otherwise, without the express written approval of ABBYY Software House.
© ABBYY Software House (BIT Software), 1993-2001. All rights reserved. ABBYY, BIT Software, FineReader, “fountain image transformation,” Lingvo, Scan&Read, Scan&Translate, “one button principle,” “Your computer reads by itself,” “Your computer reads and translates by itself” are registered trademarks of ABBYY. ABBYY FormReader, Try&Buy, DOCFLOW are trademarks of ABBYY. All other trademarks are the property of their respective owners. 125015, Moscow, p /b 72. ABBYY Software House.
CONTENTS
WHAT IS A FORM? ....................................................................................................................................5
WHAT IS A MACHINE-READABLE FORM? ..........................................................................................5
FORM COMPLETION M ...............................................................................................................................5
ETHODS......................................................................................................................................................5
ELEMENTS OF MACHINE-READABLE FORMS .............................................................................................6
TYPES OF MACHINE-READABLE FORMS ....................................................................................................6
Dropout color forms ............................................................................................................................. 7
Scanning...........................................................................................................................................7
Choosing the form color...................................................................................................................8
Advantages and disadvantages.........................................................................................................8
Gray forms............................................................................................................................................8
Scanning...........................................................................................................................................8
Advantages and disadvantages.........................................................................................................8
Black&white forms with raster background.........................................................................................9
Background filtering ........................................................................................................................9
Advantages and disadvantages.........................................................................................................9
Black&white forms with raster borders ...............................................................................................9
Black&white linear forms...................................................................................................................10
Advantages and disadvantages.......................................................................................................10
HOW TO CHOOSE A FORM TYPE...............................................................................................................11
Criteria for choosing the form type ....................................................................................................11
Hardware........................................................................................................................................11
Volume, printing method, and form printing cost..........................................................................12
Image size and average form processing speed..............................................................................12
Editors for form creation................................................................................................................12
Table: Summary of form types - advantages and disadvantages........................................................12
GENERAL REQUIREMENTS FOR MACHINE-READABLE FORMS .................................................................14
Form background requirements.........................................................................................................14
Reference point requirements.............................................................................................................14
Requirements for black squares .....................................................................................................14
Requirements for static text............................................................................................................14
Requirements for lines ...................................................................................................................14
Requirements for barcode ..............................................................................................................14
Requirements for geometric field parameters ....................................................................................15
Raster dot size ................................................................................................................................15
Character space size.......................................................................................................................15
Line thickness.................................................................................................................................15
Print quality requirements..................................................................................................................15
Requirements for form completion .....................................................................................................16
CREATING MACHINE-READABLE FORMS........................................................................................16
FORM CREATION STAGES ........................................................................................................................16
DEVELOPING FORMS IN MICROSOFT VISIO 2000....................................................................................16
Attaching a stencil set.........................................................................................................................16
The form elements provided by the stencil .........................................................................................17
Form creation in MS Visio: example..................................................................................................18
Creating your own stencils.................................................................................................................20
Preparing an MS Visio form for professional printing ......................................................................21
DEVELOPING FORMS USING MICROSOFT WORD 2000 ............................................................................ 22
Preparing the workspace....................................................................................................................22
Paper size .......................................................................................................................................22
Page margins .................................................................................................................................. 22
Grid ................................................................................................................................................22
Which is best - background or raster? ...............................................................................................23
Setting up the background..................................................................................................................23
MS Word 2000 graphic tools used to develop machine-readable forms............................................23
Positioning form elements. .................................................................................................................23
Protecting the form............................................................................................................................. 24
CERTIFICATION.......................................................................................................................................24
APPENDICES.............................................................................................................................................25
USEFUL TIPS............................................................................................................................................25
IDENTIFICATION OF DIFFERENT FORMS PROCESSED IN THE SAME BATCH ..............................................26
CREATING A BARCODE USING CORELDRAW...........................................................................................27
RECOMMENDED COLORS FOR DROPOUT FORMS .....................................................................................29
What is a form?
Questionnaires, social security forms, polling slips, warranty cards – all different types of form used to collect different types of information. How do forms differ from other types of documents?
1. A form has a set number of fields.
2. Field content is always determined by for example field name. E.g. a “Last Name” field contains only last names (if completed correctly), a “Date” field only dates, etc.
3. During form processing, only the field contents are of interest; all remaining form elements are disregarded.
Gathering information can be a long and weary process, involving the input of hundreds if not thousands of forms. ABBYY FormReader, however, makes life much easier, allowing the whole process to be automated. The inputting process then consists of the following stages:
1. Application setup – the form to be processed is specified. A form template is created within the program, containing the geometrical locations of the fields and specifying the type of information to be contained within them and containing other field parameters.
2. Form processing. Completed forms are scanned and recognized (i.e. field images are converted into text) by the application. An existing template is used to identify form field positions and the type of information contained within them. Recognition results are subsequently verified and exported to a file or database.
Easy? In theory, yes, in practice, no, as not all forms used to gather information are suitable for automated input. The aim of this guide is to explain exactly which requirements a form must meet if it is to be suitable for automated processing, and to show you how to create your own forms using Microsoft Visio 2000, Microsoft Word 2000, and Corel Draw.
What is a machine-readable form?
Two principal tasks are carried out during form recognition:
1. Locating fields. This is by no means an easy task as the scanned form image may be distorted in various ways e.g. stretched, skewed, or rotated. In order for these distortions to be corrected, the form must contain what are termed reference points. For more information on reference points and other form elements, see: “Elements of machine-readable forms“, page 6.
2. Separating field contents from field borders The information entered in the fields must be clearly separated from other form elements: field borders, background, service, and explanatory text. In order for the application to do this correctly, the form must meet certain requirements; these requirements specify several form types. For more information on form types, see: “Types of machine-readable forms“ (page 6).
In order for the above two tasks to be carried out successfully, the forms must correspond to the form pattern exactly, i.e. forms of the same type must be printed using the same source document (pattern) so that the location of all form elements is identical on each one. If this is not the case, i.e. the location of fields on different copies of the form varies, the application will be unable to “find” the fields and, consequently, unable to recognize them. Copies of the form will only match the source document (pattern) by having the forms printed professionally. For more information regarding print quality, see: “Print quality requirements“ (page 15). If the application is able to identify the field locations and separate the field contents from the field borders, the form in question is deemed to be machine-readable. From now on such forms are simply referred to as forms.
Form completion methods
A form may be completed in one of the following ways:
1) by hand (“handprint” completion). Letters, digits and all other characters are written separately, with each
character having its own individual character space.
2) Using a matrix printer.
3) Using a typewriter.
4) Typographically. This refers to the use of inkjet and laser (not matrix) printers with a resolution of no less than
300 dpi.
5) Using a combination of the above.
Elements of machine-readable forms
The following elements may be present on a form:
1) Fields for completion and automatic processing. These contain the information to be gathered.
Field type
A text field for entering letters, digits and other characters
Checkboxes to be marked
Radio group
2) Fields that contain significant information, but which are not recognized automatically. Such fields may contain, for example, personal signatures, company stamps, photos, etc.
3) Explanatory information – any textual or graphic information not subject to recognition. For example, field headers, completion instructions, additional information, page numbers, etc.
4) Service information. A form may contain a field which is only to be completed with some service information, e.g. document number, data of document acquisition, client identification number etc. Such information may be for example entered when forms are handed in to the operator or is entered automatically during the scanning process.
5) Reference points. These are special form elements necessary for:
matching the template correctly (determination of field locations),
compensating any image skew or distortion (linear and non-linear) that may arise during scanning;
unambiguous form identification in the case of simultaneous input of forms of different types.
The following form elements may be used as reference points:
Reference point types Comments
Black squares Solid black squares
Lines Horizontal or vertical solid lines.
Static text Any explanatory information, which is usually textual in form.
Barcodes Barcodes of the following types: Code 39, Check Code 39, Interleaved 25, Check
Comments
See “Form Completion Methods” (page 5).
These may take the form of squares, bubbles etc., or fields that must be underlined. They are marked using various symbols: the standard “tick”, the “period” symbol, the letter “x”, etc. A group of checkmarks in which only one checkmark can be marked.
Interleaved 25, EAN 13, EAN 8, Code 128.
We recommend that the EAN 13 format be used. An example of barcode creation using the CorelDraw editor is given in Appendix II.
Example
þ Yes, I like to buy it ý Scanner is used o Agree
þYes oNo oDon’t know
Fig. 1. An example of a blank form containing all types of reference points.
Types of machine-readable forms
There are three different form types depending on the method of separating the field contents from the field borders:
1. Dropout form
All the fields on the form are white rectangles on a color background. The important thing here is the color used, as it disappears during the scanning process (see recommendations on color choice in Appendix III), leaving only the field contents and reference points on the form image for the recognition module to recognize. Dropout forms are the preferred choice in terms of recognition quality.
2. Raster Forms
Field borders on raster forms are termed raster lines – i.e. lines made up of a series of dots located at equal distance from each other. The size and the location of these dots are determined manually (see "Black&white forms with raster backgrounds" (page 9) and "Black&white forms with raster borders" (page
9)) . These dots are retained on the image after scanning, but the system treats them as garbage and removes them automatically during image cleaning, leaving only field contents for the recognition module to recognize.
3. Black&white Linear Forms
Field borders in this case take on a normal appearance (i.e. are black solid lines) and remain on the image after scanning. That means that the block image includes both field borders and field contents, and the field contents separation task is carried out by the recognition module. Hence recognition quality will depend to a large extent on how neatly the form was completed (see "Black&white linear forms" (Page 10). That’s why we do not recommend the use of black&white linear forms for automated processing.
Let’s turn to the advantages and disadvantages of the following types of forms:
dropout color forms (as well as gray forms);
raster forms: containing raster lines as field borders and forms with raster backgrounds;
black&white linear forms.
Dropout color forms
Dropout color forms - these are forms in which fields are represented by a series of white rectangles (or other white geometrical shapes) on a color background. The background is usually “red-orange” or “green” in color, and disappears if the scanner has a special driver which can filter colors (in the case of color scanners), or a color­filtering lamp (in the case of a non-color scanners).
(a) (b) Ideally, all form elements, with the exception of reference points, disappear during scanning, leaving only field contents for recognition on the form image. How is this done? By ensuring that not only the background but also the explanatory information is printed in the dropout color (see figure (а)).
Scanning
The scanning of forms with red” and “green” background is performed either:
a. on a color scanner with color filtering software (red or green) b. on a non-color scanner using a red or green lamp (hardware color filtering takes place in this case). c. on a non-color scanner using a white lamp and with a red or green filter (filtering quality in this case is
much lower, as the background may not disappear completely, or field contents may be inadvertently removed).
Notes.
1. Many color scanners also have blue software filtering. We do not recommend the use of blue forms, however, as forms are likely to be completed using both black and blue ink. Field contents written in blue ink will disappear in this case.
2. Should you use a standard white lamp with no color filtering to scan your forms, various light colors (not only “red” or “green”, but light yellow and other similar colors) are also likely to dropout. That means you
can also use forms with such a background color with low saturation. In this case you should find the proper color and it’s saturation manually, depending on the scanner model used.
Choosing the form color
Red-orange colors are preferable to green as a form color. This is because it represents the greatest possible contrast to blue, and consequently results in enhanced scanning and recognition quality if the forms are completed using blue ink. Appendix III lists the recommended colors for form processing i.e. those most likely to disappear during scanning with almost any scanner.. A “dropout” color list for a particular scanner (in Pantone or any other format) can also be obtained from your scanner manufacturer/dealer. It is up to you which color you choose, however, keep in mind that the form color chosen should be pleasant for those required to complete it.
Advantages and disadvantages
Advantages
Drawing field borders on a color background form results in the highest possible recognition quality because:
1. Only the text image is subject to recognition; all garbage and field borders are removed.
2. Letters/digits overlapping field borders is less of a problem, as the borders themselves are simply backgrounds which drop out during scanning, leaving only the field contents for recognition.
3. Printing explanatory information in the same color as the background increases recognition quality, as the information “drops out” from the form image during scanning, and, consequently, does not interfere with field contents.
4. Printing explanatory information in the same color as the background saves disk space, as the form image file is smaller; and form processing speed increases.
Disadvantages
1. Creating a color form is complicated. A graphics editor is needed, and color forms have to be printed either professionally or by using a color Xerox machine. Note that Xerox machines do not guarantee identical field locations nor can black/color levels be altered.
2. Printing explanatory information in the same color as the form background reduces form readability significantly, as the contrast between background and explanatory text is poor. This can lead to incorrect form completion.
Gray forms
Gray forms are a subclass of color dropout forms. Gray forms are those with a shade of gray as a background color, which again disappears during scanning. A gray background is achieved by printing the field borders in black using the following parameters:
saturation of no more than 10%,
RGB parameters of 222,221,221
The resulting color is light gray due to low color saturation (i.e. black dots are rarefied). Both field border variants depicted below may be used:
(a) (b)
Scanning
Forms may be scanned using any white lamp scanner. However, in order for the background to drop out, the correct scanning parameters (contrast and brightness) must be chosen. If the brightness is very low and the contrast is very high, the gray background may still remain on the image after scanning. The scanning parameters must be set individually for each scanner.
Advantages and disadvantages
Advantages
1. The forms are very easy to develop using any graphics editor or word processing application e.g. MS Word. See: "Developing forms using Microsoft Word" (Page 22).
Disadvantages
1. Scanning parameters (brightness and contrast) can only be altered to a slight degree. This can prove problematic when scanning forms completed using a very light ink, as decreasing brightness to increase
text image quality can result in the appearance of field borders or the background on the form image, and consequently, cause a deterioration in the recognition quality.
2. If the printer makes unauthorized changes to the technical print parameters (i.e. different paper, other color components) then the background may become too dark and could prove difficult to remove regardless of the scanning parameters chosen.
Black&white forms with raster background
Fields on such forms are simply white spaces (usually rectangles) on a raster background. The background is made up of individual dots, no more than 0.1 mm in size, with the distance between each dot about 1 mm. This is much greater than is the case with gray forms, where dot density is such that the eye perceives the background as smooth gray.
Background filtering
The raster background does not disappear during scanning itself; instead, the raster dots are classified as garbage and removed from the image during despeckling.
Advantages and disadvantages
Advantages
Disadvantages
Black&white forms with raster borders
Field borders here are made up of raster lines i.e. sequences of small black dots. Raster dot size should be 0.39 –
0.5 pt.
1. If both the scanning parameters and the dot size are chosen correctly, the form image will be despeckled and the recognition module will acquire the field image free of garbage and superfluous characters.
2. Letters/digits overlapping field borders is less of a problem; field borders are part of the background, and therefore disappear during image cleaning, leaving only the field contents left to be recognized..
1. Raster forms require periods, commas and other small characters to be written thickly. This is because their size must be greater than that of the raster dots; otherwise they will be removed as part of the background.
2. Scanning parameters (brightness and contrast) can only be altered to a limited extent. This can prove problematic when scanning forms completed using a very light ink, as decreasing the brightness to increase the text image quality can result in the field borders or the background appearing on the form image, and consequently, worsen the recognition quality.
3. Not all graphic editors and word processors (e.g. MS Word) have the shading style described above (i.e. raster) in their standard styles palette. , In addition, word processors normally only have a limited number of raster set up tools, leading to difficulties, for example, when trying to change the distance between raster dots, or their size.
4. A raster background can prove tiring to the eye, and consequently discourage form completion.
5. If printing density is increased, dots may become larger and, as a result, left on the image as garbage. This, in turn can make character recognition impossible.
The recommended raster dot size is 0.39 pt, with the distance between the raster dots being at least five times larger than the dot size:
If the distance is less, the dots may become glued during scanning, leading to them remaining on the image after despeckling. This, in turn, leads to lower recognition quality. Acceptable ways of completing fields with raster borders are shown in the figures below:
Loading...
+ 20 hidden pages