Abbyy Software FORMREADER Automated Forms Processing

AUTOMATED FORMS PROCESSING
Table of Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
Form Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
What is a form? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
Form structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
Form types and design elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
What is forms processing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
OCR/ICR basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
Automated Forms Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Where data capture should be used? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Designing a form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
Selecting form type and design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
Drawing a form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
Setting up FormReader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
Selecting a scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
Personnel training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
Processing cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
Ensuring Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
Defining data quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
Image preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
Data type checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19
Data format checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
Controlling logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21
Processing multipage forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
Operator stress as an important quality factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
Organizing Automated Forms Processing……………………………………. ………………… .23
Approaches to data capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
Data capture basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
Batch processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
Operator specialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
Processing queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
Data flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
Production capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
Using ABBYY Technologies to Solve Untypical Tasks………………………. ……………. .27
What if FormReader does not support a required language? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
Remote scanning and processing faxed forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28
Distributed verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28
Capturing data from forms that are not machinereadable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31
Contacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32
Automated Forms Processing
Form Types
Introduction
Form Types
What is a form?
A form is a document with blank spaces to be filled in with par ticulars before it is executed. These blank spaces are called fields and are usually provided with explanations or captions that tell people what kind of information and in what format is to be entered into each particular field.
Forms are used whenever information must be collected from a large number of people. Government bodies in particular make wide use of all sorts of forms. In Russia, for example, forms are extensively used by the Tax Ministry and the Pension Fund. The former collects and processes
tax returns
filled in by hand and the latter collects
social security forms.
.
Forms are also widely used in business. Insurance companies, for example, have to handle thousands of insurance applications and insurance claims, marketing agencies have to deal with opin ion polls and customer surveys, and educational institutions make extensive use of forms in all sorts of examinations and formalized
tests. The banking industry also uses forms when issuing credit cards or handing out loans to their clients. There are also mail orders, coupons, medical forms, utility bills and many more  the list is practically endless.
In the course of our lives we fill in hundreds of forms  applica tion forms, questionnaires, insurance claims, etc. At the same time computers have become indispensable for collecting and managing information, making the task of extracting data from printed docu ments even more pressing.
This White Paper presents an overview of the existing data cap ture technologies used to extract handprinted text from completed forms and explains in detail the principles behind ABBYY FormReader, a data capture solution that is used to process forms in more than 30 countries.
Different types of paper form.
Automated Forms Processing
When completing a form one has to enter information into blank spaces or specially designed fields that make up the structure of the form. This information must then be extracted and processed. Forms from which data can be extracted, or "captured", automati cally by computer are called machinereadable. Almost any form can be structured in such a way as to become machinereadable.
Forms can be filled in: by hand (such forms are called handprinted, because informa
tion is entered in separate block letters, each letter occupying
one character space);
using a typewriter or printer in a printing house;using a combination of all of the above.
Form structure
Sometimes people filling in a form are too careless or sloppy. For this reason forms are designed in such a way as to make their completion intuitive and selfevident. The following
design ele
ments
are used to tell people where to write what:.
Entry (or data) fields. These include
Text fields.
Each text field consists of a certain number of character spaces supplied with an explanatory caption. Character spaces stand apart so that the entered letters do not merge.
Check boxes.
These are fields of various shapes (usually squares, but in practice this can be any geometrical figure with a closed boundary). A person filling in the form makes a mark such as a check, a tick or a cross in this field to select a particular option. Or they may simply ink over the entire box.
Groups of check boxes.
These are used for multiple choices. Usually check boxes within one group correspond to mutu ally exclusive options, i.e. only one of them must be selected.
Service fields.
Service fields contain socalled anchor or refer ence points that facilitate forms processing. Anchor points are used by a data capture program to detect the top and bottom of a form and to correct distortions introduced by scanning. Anchor points may also be used to identify different forms if mixed types of forms are processed within one batch. The fol lowing elements may be used as reference points on forms processed by ABBYY FormReader:
black squares, corners and crosses;vertical or horizontal lines;static text, i.e. field captions that remain unchanged from
form to form.
ID fields or identifiers.
These fields serve to identify the form. Black squares, corners and crosses can also be used to identify forms, but identification is more reliable if forms are identified using such identifiers as numbers, bar codes or form titles.
Image areas.
These areas contain objects which are not to be recognized, e.g. seals or signatures which will be treated as pic tures. FormReader can save such images into an ODBC data base in the following formats: TIF, BMP, JPG, PCX, and WMF.
Optional design elements
: logos, headers, footers and other formatting elements. In data capture, data contained in these elements can also be used to identify forms, e.g. by analysing text in logos the program can find out which company has issued the invoice.
service fields
text fieldscheck boxes
identifier
Examples of form elements.
Form Types
Form types and design elements
Forms can be divided into two major classes  structured forms, on which the locations and sizes of all fields are exactly the same for all forms in a batch, and flexible forms, on which the sizes and locations of fields may vary from form to form. In order to capture data from a structured form, a program has to know where . to look for data. For this purpose a template is created which is essentially a skeleton of a form that contains information about the locations of fields and the kind of data the program may expect to find in each of them. The program will then match this template with a completed form and separate the entered data from the field borders and cap tions. Next, the entered data are "read" or recognized, i.e. converted into text and digits.
All the forms in a batch must conform to one and the same pat tern. It is also essential that reference points and ID fields are pre served during scanning.
If a form is not structured, it cannot be processed automatically and requires a human operator to read the data from its fields and type them into a database. This is a slow and tedious process that can be avoided by designing a wellstructured form that can then be read by computer.
Depending on their design, machinereadable forms can be divided into the following three
major types:
Colour forms.
All data fields on such forms consist of white rectangles printed on a colour background. Backgrounds are usually light grey, pink, orange, or green. The colours and satu ration are selected so that the background disappears during scanning (this is why they are also known as dropout colours). Ideally, all elements must disappear during scanning with the exception of reference points and ID fields. Special scanners with red or green lamps are used to scan such forms.
Alternatively, the drivers of common scanners may be adjusted so that they become blind to the background. Colour forms pro vide the best recognition quality.
Raster forms.
Data fields on such forms consist of white rec tangles printed on a colour background, but unlike on colour forms, backgrounds are made up of small dots located at regu lar intervals from one another. These dots do not disappear during scanning, but ABBYY recognition software can remove such dots without losing information entered into the data fields. There is also a subtype of raster form which has no back ground at all. The borders of data fields on such forms are made
up of separate dots which can then be filtered out by ABBYY software.
Blackandwhite linear forms.
Field borders on such forms consist of solid black lines which do not disappear during scan ning.
The following field designs are available for linear forms:
(a) solid lines
(b) frames for words
(c) isolated frames for characters
(d) conjoined frames for characters
(e) lines with "combs”
(f) frames with "combs"
The recognition engine separates the data from the field bor
ders and then recognizes them. ABBYY FormReader uses informa tion about the field design provided on the template and looks for specific design elements such as vertical lines or the number of character cells. The program then ignores the formatting and rec ognizes only the data contained within the fields. A form may also contain "garbage" or undesirable artefacts resembling field lines. The program will remember the shape of the fields and distinguish between the meaningful field borders and the arbitrary "noise"
which will be removed so that it does not interfere with recognition.
A blackandwhite form on which characters are to be entered into separate frames.
Colour dropout form.
Raster field borders.
Automated Forms Processing
What is form processing?
Forms processing is a process whereby information entered into
data fields is converted into electronic form:
entered data are "captured" form their respective fieldsforms themselves are digitised and saved as images.
In most cases forms processing is considered complete when the data from all the forms have been captured, verified and saved into a database. It is also essential that the integrity of the captured data be preserved.
As has been mentioned earlier, forms can be processed manu ally or using forms processing software. In the sections that follow we consider the advantages and disadvantages of each method.
Many people still prefer to process forms manually, even though this is not the most efficient and reliable method. Here is a list of typical actions that need to be performed in the case of man ual data entry: Each
human operator (keyer)
must be provided with a work ing place. This entails the most expenses, since each operator must be provided with a computer connected to the local area network, and the average productivity of a qualified operator is no more than 200 forms per day.
Forms preprocessing requires
sorting operators
and
input
controllers
. Controllers make sure that no pages are lost if a form has more than one page and oversee the sorting process. The number of sorting operators and input controllers depends on the expected work load. On average, one sorting operator will sort up to 1,000 forms per day, and one input controller will handle up to 300 forms per day
Once the data from forms have been entered into a computer,
they must be checked by
verifiers
. Verifiers check the data entered by keyers and correct any errors that may have occurred.
Finally, a manager is required to supervise the entire data entry
team. Now suppose you need to enter data from 1,000 forms per day.
You will need five keyers, one input controller and one manager. This means seven desks, seven chairs, seven PCs and additional equipment  network adapters and UPS.
Costs,USD Qty Total, USD
PC 1,000 7 7000 Office furniture 1,000 7 7000 Network and other equipment 1,000
15000
Table 1. Lumpsum costs for manual processing at 1,000 pages per day.
The lumpsum costs stand at around USD 15,000. Now let's count your monthly costs for the same productivity. You will need an office of at least 50 sq. m. which may cost you around 1,000 per month. Labour costs will amount to USD 1200 for the operator and controller and another USD 2000 for the manager.
Costs, USD Qty Total, USD
Operators' salary 1,200 5 6,000
Controller's salary 1,200 1 1,200
Manager's salary 2,000 1 2,000
Office space 20 50 sq. m. 1,000
10,200
Table 2. Monthly costs for manual processing at 1,000 pages per day.
Note that these calculations do not include the cost of electricity, telephone, cleaning, fillin staff, etc. But even this austere budget stands at around USD
10,200
per month
Form Types
The cost of manual processing
In the previous section you saw that the lumpsum and running costs of manual forms processing add up to a pretty sum. And we have the first conclusion.
But money is not the only problem associated with manual forms processing. You will need additional staff and another tier of management. Obviously it takes some time to set up a team of 8 10 employees and buy the necessary equipment. And some of the new staff may not like this tiresome job and leave.
Now suppose your client needs his forms processed by tomor row or by the day after tomorrow. Obviously, high costs is not the only problem  you simply won't be able to kickstart the whole process within these two days. The second conclusion suggests itself.
Another important point is that whatever the size of your pro cessing team, you won't be able to increase their productivity quickly  hiring additional operators is useless unless you provide them with the right equipment. This equipment will require addi tional office space. Hiring additional staff entails costs which are comparable to the lumpsum costs of setting up the entire team. The third conclusion is:
There is a host of other problems. The most critical of them have to do with the human factor, and this is practically unsolv able. Manual data entry is a tedious job  try typing, for example, a newspaper article in your word processor. This means that even experienced keyers will make mistakes, and their number tends to increase towards the end of the working day. Some of these mis takes will be corrected by the output controller, but controllers are
also human, and the quality of the output data tends to deterio rate. And typing is a great strain for the eyes, so you are likely to get complaints from your staff as early as within the first two months.
The quality of the output data is likely to be unacceptably low because a human operator cannot verify data character by charac ter for hours. Your customer will never be happy with an errorrid den database which your team of operators took so long to create. Two other conclusions arise:
It follows, then, that manual forms processing is not the best solution, particularly for companies which need to process large number of forms regularly.
Manual processing is expensive.
Manual processing takes time to set up.
Manual processing is not easily scalable.
Your staff won't like the job. And you won't like the results of the their work.
Scheme of manual forms processing.
Input
Manager
Database
keyers
Automated Forms Processing
Automated forms processing
Automated forms processing.
Input
Data entry operator supervises scanning, recogni tion, verification and export of data
Database
An alternative is a data capture solution such as ABBYY FormReader. This is how FormReader works: A batch of completed forms is scanned using a highspeed
scanner (usually scanners that scan at least 10 pages per minute
are used);
Most of the data are recognized automatically; A few characters about which the program is uncertain are
passed on to a human operator; Verified data are saved into a database.
It is noteworthy that the entire process requires only one human operator since all of the stages, except verification, are fully auto mated.
The operator's workplace must be equipped with one scanner and one PC connected to the local area network. This workplace can be set up within one day and does not require a lot of office space.
Neither manual sorting nor checking for missing pages is required, since FormReader can identify forms and select the matching template.
With ABBYY FormReader 6.0 Desktop Edition, one operator will be able to process from 1,000 to 3,000 forms per day depending on the complexity of their layout.
Now let us estimate the possible onetime and monthly costs for processing the same 1,000 pages per day using ABBYY FormReader.
Costs, Qty Total,
USD USD
PC 1,000 1 1,000 Scanner 1,500 1 1,500 Office furniture 1,000 1 1,000 Software licence 1,695 1 1,695 Software installation 1500 1 1,500 and setup
6,695
Table 3. Lumpsum costs for FormReader at 1,000 pages per day.
Costs, Qty Total,
USD USD
Main operator 1,200 1 person 1,200 Fillin operator 1,000 1 person 1,000 Office space 50 10 sq. m. 500 Scanner maintenance 50
3,250
Table 4. Monthly costs for ABBYY FormReader at 1,000 forms per day.
The costs of manual and automated processing compared:
Manual Form processing Money
processing with saved
FormReader
USD USD USD
Lumpsum costs 15,000 6,695 8,305 Monthly costs 10,200 3,250 6.950
Table 5. Money you can save when processing 1,000 forms per day using ABBYY
FormReader.
These figures talk for themselves, but, more importantly,
FormReader will solve
all of the five problems
discussed above.
ABBYY FormReader is a highly scalable solution  you only need a few more FormReader modules and several additional oper ators (whom it will take just hours to train).
There is no other way
to increase productivity tenfold within just one day.
It goes without saying that the quality of output data will be much higher, because the role of the human factor will be reduced to a minimum. Most of the job will be done by computers which never get tired and never make typos. What's more, FormReader can use specially designed validation rules ensuring even higher data integrity and reliability.
Form Types
OCR/ICR basics
There are two major types of character recognition  Optical Character Recognition (OCR) and Intelligent Character Recognition (ICR). OCR programs recognize characters printed using a printer, a plotter or a typewriter. ICR programs read docu ments filled in by hand in block letters (socalled handprint recog nition). Let us consider the main differences between OCR programs and ICR programs.
An OCR program first analyses the image and divides it into zones which include text, tables, illustrations, etc. Next, it divides these zones into smaller objects: paragraphs, lines, words, and char acters. Once the characters have been recognized by the character classifiers, the OCR program will assemble them back into words, lines, paragraphs, etc., until it gets an electronic version of the orig inal paper document.
ICR programs, which are mainly used to process handfilled forms, work differently. First, an ICR program detects zones that are expected to contain meaningful data entered by the user. These zones are then processed by the program's modules, including the character classifiers. ICR programs do not attempt to recreate the original document. Instead, they are extracting information from particular fields and save it into a database.
An important feature of an ICR program is mark sense recog nition, or recognition of marks in check boxes. Check boxes are widely used on all sorts of forms, because they make their comple tion easier and can increase the reliability of output data up to
99.9%. ABBYY FormReader 6.0 can recognize all sorts of marks. Mark sense recognition is usually referred to as OMR (Optical Mark Recognition) and works as follows: when creating a template, the operator singles out a checkbox zone where the program has to look for a mark; the program then analyses these zones on com pleted forms and calculates the black/white ratio in these areas. If the portion of black colour in a check box exceeds a certain thresh old, FormReader will consider the check box selected. FormReader can even recognize corrected marks, i.e. boxes ticked by mistake and then inked over.
ABBYY FormReader 6.0 will reliably recognize not only conven tional ticks/checks and crosses, but also completely inked over check boxes if the latter are rectangular in shape or have no borders.
This feature of ABBYY FormReader has a very important prac tical application. Suppose someone filling in a form makes a mis take and ticks the wrong box. Instead of taking a new blank form and filling it from scratch, they can just blot out the mark in the check box selected by mistake and put a new mark in the right check box. FormReader will treat the inkedover check box as a mistake and consider it to be unchecked. This method may also be used when recognizing text fields.
Verification of inkedover check boxes in ABBYY FormReader Desktop Edition.
Loading...
+ 23 hidden pages