Third-party software is copyrighted and licensed from Kofax’s suppliers.
This product is protected by U.S. Patent No. 5,159,667.
THIS SOFTWARE CONTAINS CONFIDENTIAL INFORMATION AND TRADE
SECRETS OF KOFAX, INC. USE, DISCLOSURE OR REPRODUCTION IS
PROHIBITED WITHOUT THE PRIOR EXPRESS WRITTEN PERMISSION OF
KOFAX, INC.
Kofax, the Kofax logo, INDICIUS, Ascent Capture, Kofax Capture, VirtualReScan, the
“VRS VirtualReScan” logo, and VRS are trademarks or registered trademarks of
Kofax, Inc. in the U.S. and other countries. All other trademarks are the trademarks
or registered trademarks of their respective owners.
U.S. Government Rights Commercial software. Government users are subject to the
Kofax, Inc. standard license agreement and applicable provisions of the FAR and its
supplements.
You agree that you do not intend to and will not, directly or indirectly, export or
transmit the Software or related documentation and technical data to any country to
which such export or transmission is restricted by any applicable U.S. regulation or
statute, without the prior written consent, if required, of the Bureau of Export
Administration of the U.S. Department of Commerce, or such other governmental
entity as may have jurisdiction over such export or transmission. You represent and
warrant that you are not located in, under the control of, or a national or resident of
any such country.
DOCUMENTATION IS PROVIDED “AS IS” AND ALL EXPRESS OR IMPLIED
CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY
IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE
EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.
Contents
How to Use This Guide .......................................................................................................... vii
Step 8: Publish Batch Class ................................................................................. 114
Step 9: Process Batch ............................................................................................ 114
Getting Started Guide (Classification and Separation)
v
vi Getting Started Guide (Classification and Separation)
Introduction
This guide introduces INDICIUS and describes how it is used to automatically
separate pages into documents and to classify documents. It starts with brief
installation instructions which are followed by a tutorial. The tutorial will guide you
through processing batches using the pre-installed Mortgage Applications example.
The guide then describes how each of the modules are configured, enabling you to
create a new set of configurations to use to classify mortgage application documents
(and how to assign this configuration to Kofax Capture). The final tutorial steps
through creating an alternative configuration, this time to include automatic
separation of pages into documents.
How to Use This Guide
This guide assumes that you have a thorough understanding of Windows standards,
applications, and interfaces.
This guide is for people who need an introduction to INDICIUS, specifically
automatically classifying and separating documents. It is beneficial to people who
will be:
Configuring INDICIUS to process documents on the Kofax Capture platform.
Administering or supporting an INDICIUS solution.
Read the entire guide sequentially. It includes several tutorials which need to be
completed in order. The tutorials require INDICIUS to be installed including the
INDICIUS examples.
If you need more detailed information on configuring a module, open the INDICIUS Help and read the relevant “How to configure” book. Additional details of all the
documentation provided with INDICIUS are included in the section Related
Documentation.
Getting Started Guide (Classification and Separation)
vii
Related Documentation
The following documentation is included with INDICIUS.
Each PDF guide can be opened by clicking Start on the taskbar to display the menu,
and selecting All Programs | INDICIUS | Documentation.
The INDICIUS Help can be opened from the same menu, but can also be opened from
the Help menu within the tools. Pressing F1 within Definer and Script Editor will
open the topic for the feature being used.
Installation Guide (.pdf)
This guide is written for those installing INDICIUS, either on a development
computer (where a solution is configured or tested) or on a production computer.
The guide explains:
System requirements.
Licensing requirements.
The procedure for installing INDICIUS.
How to customize modules running as dedicated applications.
How to install the unattended modules as Windows services.
User's Guide (.pdf)
The User's Guide (.pdf) is written for keyboard operators (keyers) who will be using
the attended modules on a production computer, and for those using all of the
modules on a development computer.
The guide explains:
What each INDICIUS module is used for.
How to operate each module.
viii Getting Started Guide (Classification and Separation)
Getting Started Guides
These guides are written for people who need an introduction to INDICIUS. The
guides are useful as a starting point for those who will be configuring or
administering INDICIUS, or those using the keying modules. The guides are self
contained, however each focuses on configuring a different document processing
solution.
Getting Started (Fixed-Form) (.pdf)
The Ge
tting Started Guide (Fixed-Form) (.pdf) focuses on configuring a solution to
extract data from fixed-form (structured) documents.
The guide explains:
How to extract data from single page documents of a known document type,
using the installed Order Forms example.
How the tools, concepts and configuration files relate to the setup in Kofax
Capture.
How to replicate the Order Forms configuration by following detailed
procedures.
Getting Started Guide (Free-Form) (.pdf)
The Ge
tting Started Guide (Free-Form) (.pdf) focuses on configuring a solution to extract
data from free-form (semi-structured or unstructured) documents.
The guide explains:
How to extract data from single page documents of a known document type,
using the installed Solicitors Letters example.
How the tools, concepts and configuration files relate to the setup in Kofax
Capture.
How to replicate the Solicitors Letters configuration by following detailed
procedures.
Getting Started Guide (Classification and Separation)
ix
INDICIUS Help
The INDICIUS Help is written for those configuring a solution and for system
administrators, and assumes those reading it have read the Getting Started Guides or
attended an INDICIUS training course. This assumption is made so that the
INDICIUS Help can provide the most accurate and detailed information across every
aspect of the product.
The INDICIUS Help explains:
How to configure the INDICIUS modules to process a document set.
How to use the module setup dialogs to assign a configuration to a batch
The integration of INDICIUS within the Kofax Capture platform.
How to set up and monitor an efficient production environment
The INDICIUS Help also contains a reference section which includes:
Definition file parameters used by the Recognition and Correction modules.
Script objects, hooks, methods and properties used by all of the modules.
class.
(Administration Help).
Visual Basic Scripting Help (.chm)
The Visual Basic Scripting Help (.chm) is provided for further information on VB
scripting.
x Getting Started Guide (Classification and Separation)
Introduction
This chapter introduces some of the concepts of data capture and key points of
INDICIUS.
What Does INDICIUS Add to Kofax Capture?
INDICIUS is a set of modules that provide additional automatic recognition
(classification, separation and extraction) as well as advanced keying (indexing and
validation) functionality to Kofax Capture.
Chapter 1
Overview
Kofax Capture scans paper-based documents, creating a series of scanned image
files. Alternatively, Kofax Capture Import Connector – Email can retrieve emails
(including attachments) from a server. Kofax Capture then routes the files through
INDICIUS, a set of modules that separate pages into documents, classify documents
and extract information. Within INDICIUS these classification, separation and
extraction results are presented for review by keyboard operators. The accurate,
validated data and images can then be exported to a back-end system using Kofax
Capture and/or INDICIUS depending on the requirements of the system.
Features of INDICIUS
This guide covers two key features of INDICIUS: classification and separation.
What is Classification?
sification is the process of assigning a type to each document, either to export to
Clas
the final repository or to use during extraction. INDICIUS can be configured to
classify documents directly or as a result of page classification and document
separation.
Getting Started Guide (Classification and Separation)
1
Chapter 1
Classification Methods
Classification can be done using one or more of the following methods:
Image Classification: Classification based on the overall layout and structure
of a page, including lines, boxes, logos and placement of text.
Text Classification: Classification based on detailed analysis of the text
content of a page or document.
Rules-Based Classification: Classification performed by searching for specific
data or keywords, independent of layout.
Templated Classification: Classification determined by the presence of one
or more marks, barcodes or items of text in pre-defined locations.
What is Separation?
Document sepa
ration methods provide an automated approach to identifying the
boundaries between multiple documents in a single batch.
Separation Methods
Document separation is determined from the page classification results using either
of the following methods:
Rules-based document separation One or more rules specify when new
documents are created; for example, if a page of type A is seen, create a
document of type X.
Advanced document separation A probabilistic method that ascertains the
most likely document structure from the page classifications and their
confidence scores. This method is robust to variation in documents and misclassifications due to its probabilistic nature.
Classification and Separation of Documents in Production
The Recognition and Document Review modules (along with Kofax Capture Scan)
are used to classify and separate documents.
Recognition
C
lassification and separation are done in the same processing step, in an instance of
the Recognition module. A single solution would do one of the following:
Document Classification
2 Error! No text of specified style in document.
Overview
Page Classification and Separation (resulting in document classification)
If extraction is also being done as part of the queue, an additional instance of the
Recognition module named INDICIUS Recognition (Classification and Separation) is
used for classification and separation.
This leaves the standard instance of Recognition available for extraction.
Note Data extraction is generally done in the standard instance of Recognition, once
all document types have been determined (and manually reviewed if needed).
Document Review
Document Review is usually used after Recognition (Classification and Separation)
to review the automatic classification results. Within Document Review, a user can
confirm any types that Recognition is uncertain about, fix any validation failures (for
example, by changing document type) and review the batch.
Scan
As we
ll as obtaining the images, Kofax Capture Scan is used prior to the INDICIUS
modules to do one of the following:
Establish document boundaries using patch code separators or fixed pages
(for a Document Classification solution).
Place imported pages in a single document (for a Page Classification and
Separation solution).
What is the System Architecture for an INDICIUS Solution?
The
system architecture is determined by the capacity of the system: high volume or
low volume.
High Volume, Distributed Environment
In high volume environments it is typical to have multiple stations processing
batches, with each station dedicated to running a specific module.
Getting Started Guide (Classification and Separation)
3
Chapter 1
Low Volume, Single Station Environment
In lower volume environments it is possible to run batches through all the modules
on a single station, using Kofax Capture Batch Manager.Configuring a Classification
and Separation Solution
Configuration (that is, setting up the INDICIUS modules to process particular
documents) is a two step process:
Configure the INDICIUS modules using the INDICIUS configuration tools
and a set of sample documents.
Assign the configuration to a batch class using Kofax Capture
Administration.
The primary INDICIUS tool used for configuration is Transformation Studio. The
tutorial in this guide will step you through the configuration process.
The Tutorial
This guide includes a tutorial on processing documents using the classification and
separation functionality in INDICIUS.
The tutorial works through processing and configuring two solutions:
Document Classification
Page Classification and Separation
The Example Documents
The tutoria
following document types:
Appraisal Report
Header
Funding Transmittal
Redemption
Initial Escrow
Request for Tax Form
Tax Escrow
Truth In Lending
Loan Application
4 Error! No text of specified style in document.
l uses a set of example mortgage application documents with the
Introduction
This chapter provides instructions for installing INDICIUS using the installation
wizard (standard installation).
To install INDICIUS the following items are required:
1 A computer satisfying the system requirements as described in the
Installation Guide (.pdf).
2 An INDICIUS installation CD.
Chapter 2
Installing INDICIUS
3 A Kofax Capture license hardware key with INDICIUS features enabled.
Note Kofax Capture must be pre-installed and licensed.
INDICIUS is installed to its own program folder. By default this location, referred to
as <Installation Path>, is:
C:\Program Files\INDICIUS\.
Getting Started Guide (Classification and Separation)
5
Chapter 2
Installing INDICIUS for the First Time
Standard Installation
X To install INDICIUS
1 Place the INDICIUS installation CD into the CD-ROM drive.
The main installation screen will display.
2 Select INDICIUS and follow the on-screen instructions.
3 To install Document Review, select INDICIUS Document Review and follow
the on-screen instructions.
4 To install Transformation Studio, select Transformation Studio and follow
the on-screen instructions.
Licensing
The Kofax Capture license hardware key controls:
The number of stations of each module that can be run simultaneously.
Any optional features for each module.
The page throughput of Recognition and Scripted Export.
Note The tools (except Recognition Test Tool) are not licensed through the hardware
key.
For more information on licensing, refer to the Installation Guide (.pdf).
6 Error! No text of specified style in document.
Introduction
This chapter will introduce you to the INDICIUS modules as they are used in
production. You will use a pre-configured example solution to experience how the
modules run.
The Mortgage Applications Example
The INDICIUS installation includes an example configuration that demonstrates
some of the processing features of classification and separation in INDICIUS. The
example includes INDICIUS module configurations (assigned to pre-defined batch
classes) for capturing data from a set of example images.
Chapter 3
Processing
The example contains two different configurations which demonstrate two different
methods for using INDICIUS. The first demonstrates document classification, and
the second demonstrates page classification and separation (resulting in document
classification).
Getting Started Guide (Classification and Separation)
7
Chapter 3
Setting Up the Classification and Separation Instance of
Recognition
The following section will describe how to set up the classification and separation
instance of Recognition.
Registering the Additional Instance
The following steps will guide you through registering an additional instance of
Recognition to be used for classification and separation.
X To register the Classification and Separation instance of Recognition
1 Click Start on the taskbar to display the menu, and select All Programs |
Kofax Capture 8.0 | Administration.
2 Select Tools | Custom Module Manager... to open the Custom Module
Manager window.
3 Click Add to open the file selection window and select the following file:
The Import/Export window will display showing the progress of the
unpacking operation.
5 When unpacking has completed, click OK.
The Import window displays showing “Mortgage Apps” in the Available
Batch Classes list.
6 Double-click “Mortgage Apps” to add it to the Selected Batch Classes list.
7 Click Import.
The Import/Export window will display showing the progress of the import
operation.
8 When import has completed, click OK.
9 Repeat the previous steps to import the following batch class:
<Installation Path>\examples\Mortgage Applications\Mortgage Apps with
Separation.cab.
Important For this batch class (or whichever batch class you import last)
select the “Do not import duplicates” option on the Import window. If you
do not select this option, Kofax Capture will rename the pre-configured
document classes during import and the example will not function correctly.
10 Select File | Publish.
10 Getting Started Guide (Classification and Separation)
Processing
The Publish window will display.
11 Press and hold Ctrl and click to select the following batch classes:
Mortgage Apps
Mortgage Apps with Separation
12 Click Publish.
The progress of the publishing operation will be logged in the Results panel.
Note It is normal for a warning to be generated when the batch class is
published. This is because Index Fields are defined to hold exported data but
neither Kofax Capture Validation nor Kofax Capture Recognition Server are
included in the batch class. This warning can be ignored, as the index fields
will in fact be populated by the INDICIUS modules.
Troubleshooting
In a client-server installation of Kofax Capture, an error may be generated when the
batch class is published.
The batch class is configured to store images in the Kofax Capture images folder, and
an error is raised if this folder exists elsewhere on the network. If this is the case, use
the procedure below to specify a different images folder.
X To specify a different images folder
1 On the Batch panel, select the “Mortgage Apps” batch class.
2 Right click on the selection to display the menu, and select Properties.
3 The Batch Class Properties window is displayed.
4 Change the image folder by directly entering a path in the “Image folder:”
box, or click Browse to navigate to a folder on disk.
5 Click OK.
6 Repeat the previous steps for the “Mortgage Apps with Separation” batch
class.
7 Select File | Publish.
The Publish window will display.
8 Press and hold Ctrl and click to select the following batch classes:
Getting Started Guide (Classification and Separation)
11
Chapter 3
Mortgage Apps
Mortgage Apps with Separation
9 Click Publish.
The progress of the publishing operation will be logged in the Results panel.
10 When publishing has been completed, click Close.
12 Getting Started Guide (Classification and Separation)
Processing
Viewing the Modules
X To view the modules included in the example batch classes
1 On the Batch panel, select the “Mortgage Apps” batch class.
2 Right click on the selection to display the menu, and select Properties.
3 The Batch Class Properties window is displayed.
4 Select the Queues tab.
The modules included in the batch class are displayed in the Selected Queues
list.
5 Click OK.
6 Optionally, repeat the previous steps for the “Mortgage Apps with
Separation” batch class.
Both batch classes contain the same modules, but these modules are
configured differently in each case.
Getting Started Guide (Classification and Separation)
13
Chapter 3
Document Classification
In the document classification example, INDICIUS Recognition (Classification and
Separation) and INDICIUS Document Review are used to classify mortgage
applications. Document boundaries are established prior to INDICIUS Recognition
(Classification and Separation) using patch code separators in Kofax Capture Scan.
The complete set of Kofax Capture and INDICIUS modules used is:
This section will step you through classifying documents automatically using the
example. This document will first take you through typical processing in a low
volume scenario, using Kofax Capture Batch Manager. It will then step you through
the same process, but opening each module as a dedicated application.
Tutorial: Process the Batch from Batch Manager
Crea
te a Batch
X To create an example batch in Batch Manager
1 Start Batch Manager by clicking Start on the taskbar to display the menu, and
selecting:
All Programs | Kofax Capture 8.0 | Batch Manager.
2 Select File | New Batch or click Create Batch on the toolbar.
3 Make sure the “Mortgage Apps” batch class is selected.
14 Getting Started Guide (Classification and Separation)
Processing
Figure 3-1. Create Batch Window
4 Enter a name for the new batch in the “Name:” box, for example “Mortgage
Applications 1”.
5 Click Save.
6 Click Close.
Your batch is displayed in the list. The Queue column indicates that the
batch is ready to be processed by Kofax Capture Scan.
Import Images and Establish Document Boundaries
X To import images and establish document boundaries
1 Make sure the name of the new batch is highlighted and select File | Process
Batch or click Process Batch(
The batch is opened in Kofax Capture Scan.
2 Click Scan Batch (
) to display an Import window.
3 Select the images in the following location.
Getting Started Guide (Classification and Separation)
4 Click Open to import the images and establish document boundaries.
Kofax Capture Scan detects the patch code separators placed between each
document and uses this to create the full batch structure (shown in the tree
view).
5 Select Batch | Close and click Yes to the message box.
In Batch Manager, the Queue column indicates that the batch is ready to be
processed by the Classification and Separation instance of Recognition.
Classify the Documents
Use the Classification and Separation instance of the Recognition module to
automatically classify the documents. Any documents that can not be recognized
confidently will be displayed in the next stage, in Document Review.
XTo classify the documents, click Process Batch on the toolbar in Batch
Manager.
Recognition will automatically begin processing the batch. Information messages
will be displayed and the “Docs processed” should increment. When “Docs
processed” reaches 12, Recognition will close.
In Batch Manager, the Queue column indicates that the batch is ready to be
processed by INDICIUS Document Review.
Review the Classification Results
Use INDICIUS Document Review to confirm any document types that Recognition is
unsure about, or fix problems with any documents which are failing validation rules.
X To review the automatic classification results
1 Click Process Batch on the toolbar in Batch Manager.
2 Wait for the batch to be automatically loaded into Document Review.
Document Review will launch with the batch open in the module’s
Document Classification view. This is used to quickly set any missing
document types or confirm any that are uncertain. If nothing can be quickly
fixed in this view, the problems can be overridden and will then display in
the Review view where you can see all the documents in the batch.
16 Getting Started Guide (Classification and Separation)
Processing
Figure 3-2. The Batch Loaded in Document Review
3 Select Document Classification | Override Problem to ignore the problem for
now.
A message will display stating that there are no more problem documents to
display in Document Classification view so the batch will now be shown in
Review.
Note You can also press F7 to override a problem.
Getting Started Guide (Classification and Separation)
17
Chapter 3
Figure 3-3. Transition from Document Classification view to Review
4 Click OK on the message.
The batch will open in Review, with the overridden document displayed. It
has failed a validation rule (shown in the yellow message); a problem which
must be fixed before the batch can be closed.
18 Getting Started Guide (Classification and Separation)
Processing
Figure 3-4. The Batch in Review
In Review, you can see all the documents in the batch. Although the best
classification result is “Appraisal Report,” the document is poor quality and
appears to be upside down.
5 Press F3 to rotate the image by 180˚.
You can see that this is a very poorly scanned Truth In Lending document.
The next document in the batch is also a Truth In Lending.
6 Click the + buttons to the left of the two documents to expand them and
display the thumbnails.
Getting Started Guide (Classification and Separation)
19
Chapter 3
Figure 3-5. The Two Documents with Thumbnail Images
7 Compare the two documents.
They appear to be the same document (they have the same loan number in
the top left). The first of the two documents can therefore be deleted.
8 Right click on the problem document to display the context menu
20 Getting Started Guide (Classification and Separation)
Processing
Figure 3-6. Deleting a Document
9 Select Delete and click Yes.
As there are no further problems in the batch you will be prompted to close
the batch.
10 Click Yes.
In Batch Manager, the Queue column indicates that the batch is ready to be
processed by INDICIUS Recognition.
Conditionally Extract Data
Having classified the documents and reviewed the results of the classification in
Document Review, extraction can now take place.
The standard instance of the INDICIUS Recognition module is used to extract data
from the documents, resulting in a different set of INDICIUS fields for each
INDICIUS document type.
XTo extract the data, click Process Batch on the toolbar in Batch Manager.
Getting Started Guide (Classification and Separation)
21
Chapter 3
Recognition will automatically begin processing the batch. Information messages
will be displayed and the “Docs Processed” should increment. When “Docs
Processed” reaches 11, Recognition will close.
In Batch Manager, the Queue column indicates that the batch is ready to be
processed by INDICIUS Completion.
Review the Extraction Results
Use INDICIUS Completion to review the fields extracted for each document.
X To review the data
1 Click Process Batch on the toolbar in Batch Manager.
2 Wait for the batch to be automatically loaded into Completion.
Every document is being displayed for this example. In production, only
documents with missing or invalid data would be displayed.
Figure 3-7. Completion Window
This is a “Header” document. No data has been extracted, but the document
type is displayed as a read only field.
22 Getting Started Guide (Classification and Separation)
Processing
3 Press F12 to move to the next document.
Again, no data has been extracted for the “Tax Escrow” document type.
4 Press F12 to move to the next document.
5 Use Tab/Shift+Tab to navigate around the fields on Document 3.
Figure 3-8. Completion Window displaying Data Extracted for Loan Application
Documents
These fields have been specifically extracted for the “Loan Application”
document type. Such conditional extraction would not be possible without
the document classification that ran before.
6 Press F12 after viewing this document and after viewing each of the
remaining documents.
Notice the data that has been conditionally extracted (or deliberately not
extracted) for each of the document types in the document classification
solution.
When all the fields have been completed an End of Batch window is
displayed.
Getting Started Guide (Classification and Separation)
23
Chapter 3
7 Click Exit Completion.
Release the Documents
Kofax Capture Release runs after the last INDICIUS module in the queue. The data
stored in index fields within the Kofax Capture document classes (each of which
corresponds to an INDICIUS document type) is copied to the destination as
configured in the release script.
In a production system this would be a back-end system. In this example, a text file is
output containing the data from the documents. This is located in the following
folder:
XTo release the documents, click Process Batch on the toolbar in Batch Manager.
Kofax Capture Release will automatically begin processing the batch.
Information messages will be displayed and the progress is displayed in the
“Current Batch Progress” panel. When the text “Document 11 of 11” is displayed,
Kofax Capture Release will close.
In Batch Manager, you can see that the batch has been deleted.
Tutorial: Process the Batch from Dedicate
d Applications
Kofax Capture and INDICIUS modules may be run from a desktop or Start menu
shortcut, rather than being run from Batch Manager. In most production
environments a module is left to run continuously on a computer, and hence is
referred to as a “dedicated application.”
When running as a dedicated application, the module will usually be set to poll for
batches. Batches will be processed first according to their priority and then by
creation date (oldest first).
Import Images and Establish Document Boundaries
X To create an example batch in Scan
1 Start Scan by clicking Start on the taskbar to display the menu, and selecting:
All Programs | Kofax Capture 8.0 | Scan.
When the module opens, the Create Batch window is automatically
displayed.
24 Getting Started Guide (Classification and Separation)
Processing
2 Make sure the “Mortgage Apps” batch class is selected.
3 Enter a name for the new batch in the “Name:” box, for example “Mortgage
Applications 2”.
4 Click Scan to display an Import window.
5 Select the images in the following location.
6 Click Open to import the images and establish document boundaries.
Kofax Capture Scan detects the patch code separators placed between each
document and uses this to create the full batch structure (shown in the tree
view).
7 Select Batch | Close and click Yes to the message box.
The Create Batch window will display again.
8 Click Cancel to finish creating batches.
9 Click Batch | Exit to close Kofax Capture Scan.
Classify Documents
Use the Classification and Separation instance of the Recognition module to
automatically classify the documents. Any documents that can not be recognized
confidently will be displayed in the next stage, in Document Review.
X To classify the documents
1 Open Recognition (Classification and Separation) by double clicking on the
“Recognition (Classification and Separation)” desktop shortcut you created
earlier.
2 Select Session | Select Batch.
3 Select the batch created in Scan from the list.
4 Click Ok.
Recognition will begin processing the batch. Information messages will be
displayed and the “Docs Processed” should increment. When “Docs
Processed” reaches 12, the status bar will display “Idle.”
5 Select Session | Exit to close Recognition.
Getting Started Guide (Classification and Separation)
25
Chapter 3
Note Rather than selecting a single batch in Recognition, the module would
normally be started in Wait for any Batch mode to automatically process batches as
they become available. Alternatively, Recognition would be installed as a Windows
service and would process batches automatically.
Review the Classification Results
Use INDICIUS Document Review to confirm any document types that Recognition is
unsure about or that fail validation rules.
X To review the automatic classification results
1 Open Document Review by clicking Start on the taskbar to display the menu,
and selecting:
All Programs | INDICIUS | Document Review.
2 Select Session | Select Batch.
3 Select the batch created in Scan from the list.
4 Click Ok.
5 Override the problem in the Document Classification view.
6 Click OK on the message box to open the Review view.
7 Delete the poorly scanned document.
8 Click Yes on the message box to close the batch.
9 Click Cancel on the Select Batch window.
10 Select Session | Exit to close Document Review.
Conditionally Extract Data
Having classified the documents and reviewed the results of the classification in
Document Review, extraction can now take place.
The standard instance of the INDICIUS Recognition module is used to extract data
from the documents, resulting in a different set of INDICIUS fields for each
INDICIUS document type.
X To extract the data
1 Open Recognition by clicking Start on the taskbar to display the menu, and
selecting:
26 Getting Started Guide (Classification and Separation)
Processing
All Programs | INDICIUS | Recognition.
2 Select Session | Select Batch.
3 Select the batch created in Scan from the list.
4 Click OK.
Recognition will begin processing the batch. Information messages will be
displayed and the “Docs Processed” should increment. When “Docs
Processed” reaches 11, the status bar will display “Idle.”
5 Select Session | Exit to close Recognition.
Review the Extraction Results
Use INDICIUS Completion to review the fields extracted for each document.
X To review the data
1 Open Completion by clicking Start on the taskbar to display the menu, and
selecting:
All Programs | INDICIUS | Completion.
2 Select Session | Select Batch.
3 Select the batch created in Scan from the list.
4 Click Ok.
5 Review the data as you did when running from Batch Manager.
6 When the end of batch window displays, click Exit Completion.
Release the Documents
Kofax Capture Release runs after the last INDICIUS module in the queue. The data
stored in index fields within the Kofax Capture document classes (each of which
corresponds to an INDICIUS document type) is copied to the destination as
configured in the release script.
In a production system this would be a back-end system. In this example, a text file is
output containing the data from the documents. This is located in the following
folder:
Getting Started Guide (Classification and Separation)
27
Chapter 3
To review the data
X
1 Open Kofax Capture Release by clicking Start on the taskbar to display the
menu, and selecting:
All Programs | Kofax Capture 8.0 | Release.
Kofax Capture Release will automatically begin processing the batch.
Information messages will be displayed and the progress is displayed in the
“Current Batch Progress” panel. When the text “Document 11 of 11” is
displayed, Kofax Capture Release will stop processing.
2 Select Batch | Exit to close Kofax Capture Release.
28 Getting Started Guide (Classification and Separation)
Processing
Page Classification and Separation
In the page classification and separation solution, INDICIUS Recognition
(Classification and Separation) and INDICIUS Document Review are used to classify
and separate documents. Document boundaries are established from the
classification of pages in INDICIUS Recognition (Classification and Separation) and
used by the later INDICIUS modules.
The complete set of Kofax Capture and INDICIUS modules used is:
This section is similar to the last tutorial, but will demonstrate automatic separation.
The tutorial only uses the low volume scenario (using Batch Manager), though either
could be used.
Tutorial: Process the Batch from Batch Manager
Crea
te a Batch
X To create an example batch in Batch Manager
1 Start Batch Manager by clicking Start on the taskbar to display the menu, and
selecting:
All Programs | Kofax Capture 8.0 | Batch Manager.
2 Select File | New Batch or click Create Batch on the toolbar.
3 Make sure the “Mortgage Apps with Separation” batch class is selected.
Getting Started Guide (Classification and Separation)
29
Chapter 3
Figure 3-9. Create Batch Window
4 Enter a name for the new batch in the “Name:” box, for example “Mortgage
Applications 3”.
5 Click Save.
6 Click Close.
Your batch is displayed in the list. The Queue column indicates that the
batch is ready to be processed by Kofax Capture Scan.
Import Images
X To import images
1 Make sure the name of the new batch is highlighted and select File | Process
30 Getting Started Guide (Classification and Separation)
Processing
4 Click Open to import the images.
Kofax Capture Scan imports all the pages and places them in a single
temporary document. This document will be replaced later by INDICIUS,
after page classification and separation has run.
5 Select Batch | Close and click Yes to the message box.
In Batch Manager, the Queue column indicates that the batch is ready to be
processed by the Classification and Separation instance of Recognition.
Classify and Separate the Documents
Use the Classification and Separation instance of INDICIUS Recognition to
automatically classify and separate the documents. Any documents that cannot be
classified or separated confidently will be displayed for the user’s attention in the
next stage, in Document Review.
XTo classify and separate the documents, click Process Batch on the toolbar in
Batch Manager.
Recognition will automatically begin processing the batch. Information messages
will be displayed and the “Pages processed” should increment. When “Pages
processed” reaches 20, Recognition will close.
In Batch Manager, the Queue column indicates that the batch is ready to be
processed by INDICIUS Document Review.
Review the Classification and Separation Results
Use INDICIUS Document Review to confirm any document types that Recognition is
unsure about or that fail validation rules.
X To review the automatic classification and separation results
1 Click Process Batch on the toolbar in Batch Manager.
2 Wait for the batch to be automatically loaded into Document Review.
A problem is displayed in the Document Classification view, since the
separation confidence score for a Funding Transmittal document was too
low.
Getting Started Guide (Classification and Separation)
31
Chapter 3
Figure 3-10. Problem in the Document Classification View
You can see that the document is, however, correctly classified. A simple
confirmation is required.
3 Press Enter to confirm the document type.
4 Click OK on the message to open the Review view.
5 Expand the documents to check the automatic separation has been
successful.
6 Select Session | Close Batch and click Yes to close the batch.
In Batch Manager, the Queue column indicates that the batch is ready to be
processed by INDICIUS Recognition.
Conditionally Extract Data
Having classified the documents and reviewed the results of the classification in
Document Review, extraction can now take place.
The standard instance of the INDICIUS Recognition module is used to extract data
from the documents, resulting in a different set of INDICIUS fields for each
INDICIUS document type.
32 Getting Started Guide (Classification and Separation)
Processing
X
To extract the data, click Process Batch on the toolbar in Batch Manager.
Recognition will automatically begin processing the batch. Information messages
will be displayed and the “Docs Processed” should increment. When “Docs
Processed” reaches 14, Recognition will close.
In Batch Manager, the Queue column indicates that the batch is ready to be
processed by INDICIUS Completion.
Review the Extraction Results
Use INDICIUS Completion to review the fields extracted for each document.
X To review the data
1 Click Process Batch on the toolbar in Batch Manager.
2 Wait for the batch to be automatically loaded into Completion.
3 Review the data that has been conditionally extracted for each of the
document types in the page classification and separation solution, pressing
F12 to move on to the next document.
4 When you reach the final document and have verified that all the fields are
complete, press F12.
An End of Batch window is displayed.
5 Click Exit Completion.
Release the Documents
Kofax Capture Release runs after the last INDICIUS module in the queue. The data
stored in index fields within the Kofax Capture document classes (each of which
corresponds to an INDICIUS document type) is copied to the destination as
configured in the release script.
In a production system this would be a back-end system. In this example, a text file is
output containing the data from the documents. This is located in the following
folder:
<Installation Path>\examples\Mortgage Applications\Export\Mortgage Apps with
Separation
XTo release the documents, click Process Batch on the toolbar in Batch Manager.
Kofax Capture Release will automatically begin processing the batch.
Information messages will be displayed and the progress is displayed in the
Getting Started Guide (Classification and Separation)
33
Chapter 3
“Current Batch Progress” panel. When the text “Document 14 of 14” is displayed,
Kofax Capture Release will close.
In Batch Manager, you can see that the batch has been deleted.
34 Getting Started Guide (Classification and Separation)
Overview
Introduction
To create an INDICIUS solution, you first need to configure the INDICIUS modules
using the INDICIUS configuration tools and a set of sample documents. Once you
have created and tested this configuration, you need to assign it to a batch class in
Kofax Capture Administration.
In these tutorials you will replicate the classification and separation elements of the
Mortgage Applications example configuration.
Chapter 4
Configuration
Sample Documents
In order to build your configuration, you need a set of sample documents that
accurately represent the documents you will process in the final solution. Typically
these documents will be exported from a current archive or repository system or
collected from current incoming documents.
Accuracy
The first step
features in Transformation Studio to ensure the accuracy of the sample documents.
This is particularly important when building a configuration automatically – if the
input to the training process for classifiers and separators isn't accurate, the output
won't be accurate. The Document Set Management steps are also useful to ensure a
good understanding of the structure of the document set (and that no document
types are missing).
Getting Started Guide (Classification and Separation)
when configuring a solution is to use the Document Set Management
35
Chapter 4
Representative Documents
It is important that the sample documents are scanned using the production scanner
and represent the variations that are seen in production, for example faxes and
photocopies. If extraction (indexing) is being implemented as well as classification
and separation, it is recommended that the documents are scanned at 300 dpi.
Document Set Management Steps
The followin
g steps are used to create two accurate document sets, which are then
used to configure and test a solution.
Step 1: Create Project
Step 2: Import Documents
Step 3: Initial Analysis
Step 4: Select Sample Documents for Configuration
: Open Transformation Studio and create a new project.
: Import documents, optionally with document properties.
: Get an overview of your document set.
: Select a subset of documents to
cleanup and use for configuration and testing.
Step 5: Read Page Content
: Read (OCR) all the pages in the documents selected for
configuration. Transformation Studio will use the reads in the next step.
Step 6: Cleanup Documents
: Within this step you will analyze your document set,
cleanup the documents and add more samples until the set is ready to be used for
configuration.
Step 7: Select Documents for Testing
: From the clean document set, select a set of
documents to use for testing. These documents must not be used for configuration.
Create Configuration
Recognition
The Recognition module uses classifiers and separators to determine how a batch of
pages is split into documents, and to determine the type of each document.
Classifiers may be based on image (at page-level) or text content and are built from a
set of sample documents. These learn-by-example classifiers can be supplemented
with manually configured templated (including barcode) or rules-based classification
methods.
36 Getting Started Guide (Classification and Separation)
Configuration
The advanced document separator is created automatically from the document types
assigned to a set of sample documents and, when run in production, takes into
account the confidence of the page classification results. Rules-based separation is
manually defined using a set of separation rules.
Transformation Studio is used to create the learn-by-example classifiers and the
advanced document separator. For those created automatically, it is particularly
important that the sample documents are accurately defined using the Document Set
Management Steps.
Document Classification Configuration Steps
The following steps are used to create and test a Recognition document classification
configuration using the two accurate document sets.
Step 1: Create Configuration
: Create a default configuration from the Document
Classification template.
Step 2: Configure Text Classification
: Create a document text classifier, integrate into
the configuration and test.
Step 3: Add in Additional Classification Methods
: Optionally, configure templated
and rules-based classification, integrate into the configuration and test.
Step 4: Test
: Test the full configuration, analyzing the classification results and
looking for areas to improve.
Getting Started Guide (Classification and Separation)
Page Classification and Separation Configuration Steps
The following steps are used to create and test a Recognition page classification and
separation configuration using the two accurate document sets.
Step 1: Create Configuration
: Create a default configuration from the Page
Classification and Separation template.
Step 2: Configure Text Classification
: Create a page text classifier, integrate into the
configuration and test.
Step 3: Configure Document Separation
: Create a separator, integrate into the
configuration and test.
Step 4: Add in Additional Classification Methods
: Optionally, configure image,
templated and rules-based classification, integrate into the configuration and test.
Step 5: Test and Evaluate Performance
: Test the full configuration and evaluate the
classification and separation performance.
38 Getting Started Guide (Classification and Separation)
Configuration
Figure 4-12. Recognition Configuration Steps (Page Classification and Separation)
Document Review
The Document Review module is configured using a Document Review project file.
This project file contains reasons for displaying documents, validation rules and
window options (for example shortcut keys and text labels).
The Document Review project file is configured using the Document Review Project
Editor.
Document Classification Configuration Steps
There is just one step when configuring Document Review:
Step 1: Configure a Document Review Project File
to include validation rules and interface options.
Getting Started Guide (Classification and Separation)
: Create and configure a project file
39
Chapter 4
Page Classification and Separation Configuration Steps
The page classification and separation tutorial will use the Document Review
configuration you created for the document classification tutorial.
Integrate the Configuration with Kofax Capture
Having built the configuration, a batch class must be created in Kofax Capture
Administration. The configuration can then be assigned to the batch class and a batch
can be processed.
Integration Steps
The followin
g steps are used to integrate the configuration with Capture and run a
batch through the solution.
Document Classification Integration Steps
Step 1: Create Batch Class
Step 2: Insert Required Document Classe
s and Form Types
Step 3: Assign Configuration to the Additional Instance of Recognition
Step 4: Assign Configuration to Docum
ent Review
Step 5: Assign Configuration to the Standard Instance of Recognition
Step 6: Assign Configuration to Completion
Step 7: Configure Kofax Capture Release
Step 8: Publish Batch Class
Step 9: Process Batch
Page Classification and Separation Integration Steps
Step 1: Create Batch Class
Step 2: Insert Required Document Classe
s and Form Types
Step 3: Assign Configuration to the Additional Instance of Recognition
Step 4: Assign Configuration to Docum
ent Review
40 Getting Started Guide (Classification and Separation)
Configuration
Step 5: Assign Configuration to the Standard Instance of Recognition
Step 6: Assign Configuration to Completion
Step 7: Configure Kofax Capture Release
Step 8: Publish Batch Class
Step 9: Process Batch
Getting Started Guide (Classification and Separation)
41
Chapter 4
Document Classification Tutorial
Document Set Management
Step 1: Create Project
When using Transformation Studio, you will work in a project. Within this project
you can import and organize your sample documents and create one or more
configurations.
X To create a project
1 Open Transformation Studio by clicking Start on the taskbar to display the
menu, and selecting All Programs | INDICIUS | Tools | Transformation
Studio.
2 Click New to open the New Project window.
3 In the window enter the name “Tutorial” for your new project..
Note It is recommended that the project is saved in the default location.
However, you may change the location by clicking Browse and selecting a
new location from the Project Location window.
4 Click Create to create the project and open the Import Documents tab.
42 Getting Started Guide (Classification and Separation)
Configuration
Figure 4-13. Transformation Studio after Create New Project
1 Project Explorer showing current document sets and configurations
2 Document Types panel displaying the document types in the current document set
3 Status bar showing the current state of Transformation Studio
4 Tab area, currently showing the Import Documents tab
Step 2: Import Documents
Transformation Studio includes a wizard for importing documents into the current
project. Documents can be imported by selecting files and/or folders containing files.
Files on disk can be mapped into various document structures using the file/folder
structure of the image. In addition, data stored in the text of the filename (for
example document types) can be used for a first attempt at classification. However, if
document types are not known they can be assigned after import. Manual
modifications to document structure can also be made after import.
Getting Started Guide (Classification and Separation)
43
Chapter 4
The import documents wizard is launched automatically when a new project is
created.
Note To launch the Import Documents Wizard manually, select File | Import
Documents, press CTRL+SHIFT+I or click .
The Example
The example mortgage applications have been exported from an archive system, and
have the following folder architecture:
Each document type is in a folder, named with the document type.
Each document is in a folder.
Each page is a single file, with the last part of the filename indicating the page
number.
During import, you will use values from the filename and path to specify how the
files should be imported into documents and to set document properties:
Each folder will indicate a new document.
The last part of the filename will indicate page order.
The folder name will be imported as the document type.
Note During import, Transformation Studio displays status messages at the bottom
of the window.
X To import the example mortgage documents
1 Select the images to import.
a Click Select Folders at the top of the Import Documents tab.
b In the Select Folders to Import window, navigate to:
44 Getting Started Guide (Classification and Separation)
Configuration
Figure 4-14. Folders to Select for Import
c Select all the folders (there is one for each document type).
d Click Open.
Getting Started Guide (Classification and Separation)
45
Chapter 4
Figure 4-15. Folders to be Imported
e Click Next to display Step 2.
2 Specify the document structure and values to import.
Transformation Studio has already split (parsed) the filenames and paths
into values, as displayed using the example at the top of the tab. For this
example, there is no need to modify the parsing options.
a On the Structure panel, select “Every imported folder is a document.”
The preview on the right will update to show how the files will be
imported into documents.
b From the “Page sequence indicator” list, select the last value “10 - 01.”
This specifies that the tenth value in the filename/path indicates the order
of pages in each document. For the currently selected example document,
the data in this tenth value is “01.”
46 Getting Started Guide (Classification and Separation)
Configuration
Note If you have not installed to the default location, the number may be
different. Ensure you select the last item in the list.
c On the Document Properties panel, select “6 – Appraisal Report” from
the “Document type” list.
This specifies that the sixth value in the filename/path is to be imported
as the document type. For the currently selected example document, the
value of this sixth property is “Appraisal Report.”
Note If you have not installed to the default location, the number may be
different. Ensure you select the item with the value “Appraisal Report” in
the list.
Figure 4-16. Options Specifying Document Structure and Properties
d Click Next to go to step 3.
3 Specify import options.
Getting Started Guide (Classification and Separation)
47
Chapter 4
a Select “Copy document files into project folder,” rather than referencing
the images in their current location.
This will move them into the project folder, making it easier to move your
project at a later time and ensuring no dependency on the images
remaining in their current location.
Note Using this option will slow the import process and require more
disk space.
It is possible to move the project when the images have not been copied
into it, but the images must be accessible from wherever the project is
moved to.
Figure 4-17. Import Options
b Click Import to import the files into the project.
c Click Finish to exit the Import Documents tab.
48 Getting Started Guide (Classification and Separation)
Configuration
Assigning Document Types
In the example, the imported documents already have document types. If you
import documents without document types, you would need to do the following:
X To assign document types to completely unclassified documents
1 Read the page content for all documents.
2 Assign document types to 5-10 samples of each type and confirm these
documents.
3 Use Auto Classify.
Auto Classify compares the text content of each unclassified document to the text
content of any documents that do have a type assigned. The most probable document
type is assigned depending on the similarities/differences between the text.
Step 3: Initial Analysis
Once you have imported
your documents, the Overview tab displays the
composition of your document set.
Getting Started Guide (Classification and Separation)
49
Chapter 4
Figure 4-18. Overview
You can see how many document types you have and the distribution of documents
across those types. From the Overview you may realize that some types occur rarely
and don’t need configuration or that you have more or less document types or
documents than you expected.
X To review the mortgage applications
1 Review the chart for the number of document types (x-axis) and the number
of documents (y-axis).
Note The Tax Escrow has more documents than any other type and the
Header has very few documents.
50 Getting Started Guide (Classification and Separation)
Configuration
2 Review the Header documents to see whether you should get more
examples.
a Double-click the Header bar in the chart to open Browse Documents with
a filter to show just the Header documents.
b Scroll through the documents, looking at the amount of variation between
each example.
In fact, all of these documents contain barcodes that will be used to
classify the documents. Configuration of templated (barcode)
classification is simple and requires only a few examples, so no additional
documents are required.
3 Select the Overview tab to return to the chart view.
Note To display more information on a specific document type, hold the mouse over
the bar in the chart and review the information in the tool tip and in the summary
statistics below the chart.
To change the chart display, use the toolbar buttons above the chart.
Step 4: Select Sample Documents for Configuration
In this step you will select a subset of the documents in your project to add into the
Sample Documents set, on which your configuration will be based. Any documents
not selected are put into the Unused Documents set. From this set it is easy to add
more samples later, without accidentally selecting duplicates.
Document Sets
When working in a project it is recommended that you always use standard
document sets to ensure maximum accuracy and efficiency when setting up a
solution. If you have a large number of documents it may be time-consuming and
unnecessary to work on all the documents in your project. It is also important to
separate out a set of documents to use for testing and to ensure these are not used
during configuration. Standard document sets and the tools to move documents into
them support both these options. In addition, you may wish to create subsets of
documents in your project according to your own criteria, in which case you can use
the custom document sets.
Each of the three standard document sets has a specific role.
Sample Documents Documents to use when developing the configuration
(for text and image classification these will be used to train the classifiers).
Getting Started Guide (Classification and Separation)
51
Chapter 4
Test Documents Documents to use when testing the configuration (not used
for training).
Unused Documents Documents that are not currently being used. These may
be additional documents that are not required for configuration or documents
that have not yet been classified.
Table 4-1 gives guidelines for the number of documents required for the different
classification methods (per document type). The figures take into account that some
documents may be misclassified or of poor quality and therefore may be discarded
before starting the configuration process. Although you can use more than the
suggested number of sample documents, this will slow down the configuration
process and may not improve accuracy. However, if your initial document set is poor
you should start with a higher number.
Table 4-1. Guideline Number of Documents per Document Type
Method Number of Documents
Text classification (or a combination of classification methods) 150
Image, templated or rules-based classification 10
Note For information on the suitability of documents/pages for a particular
classification method, refer to the INDICIUS Help.
Documents in Multiple Document Sets
Documents in standard or custom document sets are shared with those in the overall
project, that is, a document in a standard or custom document set is the same as that
document in All Documents.
Note The actual image files contained in a document are not duplicated into each
set, only the names of the files are duplicated.
When documents are added to a set, they are members of the original set and the set
they have been added to.
When documents are moved to a set, they are members of the new set, but not the
original set.
Similarly, if pages or documents are modified in one document set they will be
modified in all document sets to which they belong. These modifications may be
52 Getting Started Guide (Classification and Separation)
Configuration
deletions of pages or documents, reordering of pages within documents, or the
addition of Confirmed or Extra Page attributes.
Note All documents are always present in the All Documents set. Documents can be
added to another set from All Documents, but cannot be moved to another set from
All Documents.
X To select sample documents to use for configuration
1 Select Document Sets | Select Sample Documents or click
to display the
Select Sample Documents window.
Figure 4-19. Select Sample Documents window
Note By default, 150 documents of each type will be added to the Sample
Documents set. If fewer than the specified number of documents exist in a
document type, a warning will display and all the documents in that type
will be added to Sample Documents set.
2 Click OK.
Once the documents have been successfully added, a message will display.
Figure 4-20. Message After Selecting Documents
3 Click Yes to open the Sample Documents set.
Getting Started Guide (Classification and Separation)
53
Chapter 4
Figure 4-21. Project Explorer after Select Sample Documents
Step 5: Read Page Content
At this point you need to read (OCR) each page of the documents in your sample set.
Using these reads, Transformation Studio can help you analyze the documents with
the aim of finding any that are misclassified or poor quality. These reads will also be
used when you build text classifiers and configure additional classification methods.
Although all the documents in a project could be read, this is time consuming and
often unnecessary. Reading just the documents in the Sample Documents set is
sufficient (as these documents will be used for configuration and testing).
During the read, the status bar will display the number of the page being read and
the estimated time remaining. The read can be stopped at any point; no data will be
lost but you will need to read the remaining pages in order to continue to the next
step.
The read parameters used in the production configuration should match the
parameters used when reading the page content in Transformation Studio. When
you create a new configuration the default parameters will automatically match (the
parameters in the configuration resources folder are the same as those used by
default on the Read Page Content tab). However, if you are updating a production
configuration in which you have customized the full page read, you should use these
customized full page read parameters when reading page content in Transformation
Studio.
In addition, you should use custom read parameters if:
You have non-English language documents.
54 Getting Started Guide (Classification and Separation)
Configuration
You only need to read a small section of each page (which will speed up
processing time).
You want to use the read for extraction as well as classification and need a
higher read accuracy.
Note For information on setting custom read parameters refer to the INDICIUS Help.
X To read the pages in the sample document set
1 Select Tools | Read Page Content.
Note As it is the currently open document set, “Sample Documents” will
automatically be selected in the Document Set list.
2 Click Read.
Once the read has finished, the Stop button will be renamed to Finish.
3 Click Finish.
Important Reading all the pages in a document set may take a long time.
Step 6: Cleanup Documents
Within this step you will analyze your document set, cleanup the documents and
add more samples until the set is ready to be used for configuration. These three
steps may have knock-on effects to each other, requiring one or more steps to be
done multiple times.
The aim is to:
Have a clean document set (and therefore have no more work to do in
Cleanup Documents)
Have at least 100 clean samples of each document type (this is critical if
configuring page text classification)
The following sections describe each of the individual steps. The tutorial then ties
these together, showing how each step may need to be done more than once.
Getting Started Guide (Classification and Separation)
55
Chapter 4
Step 6.1: Analysis using the Overview tab
The Overview tab was first used in Step 3: Initial Analysis
and displays statistical
information on a document set. Having read the pages in Step 5: Read Page Content
the Overview chart is updated to indicate how clean (accurate) Transformation
Studio has analyzed the set to be. Each document type in the chart is color-coded
according to the following criteria:
Table 4-2. Color Coding of Document Types in Overview Chart
Color Label Description
Green Clean
Orange Poor
Red Very Poor
Gray Unknown
The document type does not have very much variation and
needs little or no work in Cleanup Documents.
The document type has some variation and will need some
work in Cleanup Documents.
The document type has a lot of variation in the text content. It
may need a lot of attention within Cleanup Documents or may
not be suitable for text classification.
No information is available as the document type has not been
read or is “(Unknown),” that is no type is assigned to the
documents.
Note This data is also visible by displaying the tool tip for a document type in the
chart (hover the mouse over the column).
,
The analysis of the documents is based on the page content (text) reads. This means
that occasionally a document type will appear to be poor, when it is actually clean
but only suitable for a classification method other than text (for example, image or
templated classification).
Step 6.2: Cleanup
Using the Cleanup Documents tab is an efficient way to cleanup your document set
with assistance from Transformation Studio. Possible problem documents are
identified automatically and displayed for manual confirmation. In addition,
documents that will help Transformation Studio to refine its analysis of the
document type in the most efficient way are displayed. These documents are
continually updated based on the confirmation (or re-classification) of the last
document.
Within Cleanup Documents there are two steps:
56 Getting Started Guide (Classification and Separation)
Configuration
Cleaning up Extra Pages
Cleaning up Document Types
Transformation Studio analyzes the document set and identifies pages it suspects are
extra. These may be blank pages, fax cover sheets, pages with text that isn't found on
other documents in the type or pages that cannot be read properly.
Cleaning up Extra Pages: Within this step you will confirm whether or not
each of the marked pages (those suspected as being extra) are extra pages.
Only confirm that a page is an Extra Page if it is not representative of the type
and will not occur in production. Extra Pages may mislead the process when
building classifiers, reducing the accuracy of the overall solution.
Cleaning up Document Types: When Transformation Studio analyzes the
document set it assigns a confidence state to each document: confident,
unconfident or misclassified. It also identifies documents that will help define
each document type. In this step you will confirm or remove the document
type for each of these identified documents until all the documents in the set
are confident. As you work, Transformation Studio will continually reanalyze the set and adjust the confidence states for other documents.
Note As you confirm pages and documents, you may find that other documents are
affected. Therefore Transformation Studio may cycle you through the Cleanup
Documents process until cleanup is complete, that is there are no extra pages,
unconfident documents or misclassified documents.
As you work in Cleanup Documents, Transformation Studio will continually reanalyze the documents, adjusting the confidence states based on the information you
provide. Only documents that will significantly affect the states will be shown,
reducing the amount of work you need to do.
X To cleanup your document set
1 Review the Overview chart.
You will see that two document types are red while the others are green.
This indicates that the Redemption and Tax Escrow types will need more
work to cleanup than the others.
Note To see more information on a specific document type, hover the mouse
over the bar in the chart.
2 Select the Cleanup Documents tab.
Getting Started Guide (Classification and Separation)
57
Chapter 4
3 Cleanup the documents that are displayed by following the on screen
instructions and answering the questions. This step will vary depending on
which documents were randomly selected by Transformation Studio as
Sample Documents. However, the same process is always used:
Suspected extra pages are displayed for each document type in the set.
Documents needing their type confirmed are displayed for each
document type in the set.
Additional suspected extra pages are displayed.
Additional documents needing their type confirmed are displayed.
Further documents may then be identified that need attention. If this is the
case a message will be displayed.
Between each of the above steps (and each change of document type) a
message is displayed.
Document Type Cleanup If the document displays with a colored title bar
and the question reads “Is the displayed document a <document type>?,” the
document type needs to be confirmed.
58 Getting Started Guide (Classification and Separation)
Configuration
Figure 4-22. Confirming Document Types
Only documents that will significantly affect the confidence of the
documents in the set will be displayed. These documents are continually
reassessed as you confirm or remove document types.
The documents are color-coded as described in Table 4-3.
Table 4-3. Color Coding of Documents
Color Confidence State Description
Green Confident
Gray Unconfident
Red Misclassified
Getting Started Guide (Classification and Separation)
Transformation Studio is confident that this document
is correctly classified, that is, it has the correct
document type.
Transformation Studio is not confident that this
document is correctly classified.
Transformation Studio believes this document is
incorrectly classified.
59
Chapter 4
Table 4-3. Color Coding of Documents
Color Confidence State Description
Blue Confirmed The document type has been manually confirmed.
a Look at the currently displayed document using the thumbnails and
Image Viewer and decide whether it has the correct document type.
Note The message above the document (and the color coding in the title
bar) indicates how confident Transformation Studio is about the
document.
Click a thumbnail to display that page in the Image Viewer
b Confirm or remove the document type:
Click Yes (or press ENTER or Y) to confirm the document type is
correct.
Click No (or press N) to remove the document type.
Tips Some of the documents in the Mortgage Applications set are
misclassified or incorrect.
Tax Escrow Approximately half or the documents in the Tax Escrow type are
actually Initial Escrow. Make sure you do not confirm these Initial Escrow
documents: when you see an Initial Escrow, click No to the question “Is the
displayed document a ‘Tax Escrow’?.”
Request for Tax Form These documents are all 2 pages long, if you see a 4
page document, right click on page 3 and select “Split Document” from the
context menu. You may see a document with a very skewed second page,
select the second page and click the Display Text button at the top of the
Image Viewer. You will see that the page read is very poor. This document
should not be used for configuration and should be deleted from the
document set. Right click on the document and select “Delete document
from project” from the context menu.
Redemption Each of the unstructured letters are redemption documents.
Two of these have second pages which may be seen in production.
60 Getting Started Guide (Classification and Separation)
Configuration
Loan Application Some of the loan applications have lots of pages. To wrap
the pages so they all display in the thumbnail viewer without scrolling, click
the Wrap Pages button above the thumbnail viewer.
Note Documents without a type become “Unknown” and can be
automatically classified later.
When configuring a real solution, there are alternatives to having to click
“No” to this many documents during Cleanup. An advanced user could
leave Cleanup after rejecting just a few documents, and use the Auto Classify
feature to distinguish between the Tax Escrow and Initial Escrow documents.
However, this tutorial focuses on the basic Cleanup functionality (Auto
Classify will be used after a full run through Cleanup).
As you confirm each document the status bar will update the proportion of
documents of each state within the type. When this bar is completely blue
and green (that is, all documents are confirmed or confident) the type is
clean and a message will display.
When working on the Cleanup Documents tab, you may wish to close the
Project Explorer and Document Types panels. If you have multiple monitors,
you may find it beneficial to drag the Image Viewer to a separate monitor.
Extra Page Cleanup If the document shows a page highlighted in pink and
the question at the bottom of the tab is “Is the selected page an Extra Page?,”
Transformation Studio is displaying a document which it believes contains
one or more extra pages.
Getting Started Guide (Classification and Separation)
61
Chapter 4
Figure 4-23. Cleaning Up Suspected Extra Pages
a Look at the currently marked page using the Thumbnail Viewer and the
Image Viewer and decide whether or not it is an extra page.
b Confirm or clear the extra page mark:
Click Yes (or press ENTER or Y) to confirm the page is extra to the
document and will not occur in production.
Click No (or press N) to clear the suspected extra page.
Tip The only extra page in the mortgage applications set is a separator sheet
containing the text “NEW DOCUMENT.” All other suspected extra pages
may occur in production.
Continue working in Cleanup Documents until a message displays telling
you than no further work is required.
4 Having cleaned up the documents, select the Overview tab.
62 Getting Started Guide (Classification and Separation)
Configuration
Each document type in the graph (except Unknown Documents) will be
green as it will contain only confirmed and confident documents.
5 Review the number of documents in each type.
You need at least 100 documents of each type (except the Header) in order to
create a configuration. You will not have enough documents in the Tax
Escrow (as the Initial Escrow documents were mixed into this type but were
reclassified as Unknown during cleanup).
Step 6.3: Add More Samples
Having completed cleanup and analyzed your document set in Overview, you may
have determined that you need more samples in order to create an accurate
configuration. There are three methods for adding more classified documents into
Sample Documents:
Automatically classify documents of unknown type (that is, documents that
have never been classified or had their type removed during cleanup).
Move more documents from the Unused Documents set (documents were
moved into the Unused Documents set during Step 4: Select Sample
Documents for Configuration).
Import more documents.
Note Whenever you add or move documents to the Sample Documents set, it is
recommended you repeat cleanup. For an indication of the additional work required,
review the Overview chart.
1 Classify the Unknown documents. The majority of these documents are
Initial Escrows that were originally imported with the Tax Escrow type.
a Double-click the bar for Unknown documents in the Overview chart.
Browse Documents will open with just the Unknown documents visible.
b Browse through the thumbnails until you find an Initial Escrow
document.
c Right click on the document to display the context menu.
d Select “Change Document Type” to display the Change Document Type
window.
e In the Document Type box, enter “Initial Escrow”.
f Click OK.
g Click OK to the message box.
Getting Started Guide (Classification and Separation)
63
Chapter 4
The document will now have the type “Initial Escrow” assigned.
h Find the next Initial Escrow document.
i Right click on the document and select “Change Document Type” from
the context menu.
j Select “Initial Escrow” from the list of document types.
k Click OK.
l Assign the “Initial Escrow” type to three more documents, including a
two page Initial Escrow.
m In the list of documents, select all the documents with the type “Initial
Escrow” (hold down CTRL and use the mouse to select multiple
documents at once).
n Right click on the selection to display the context menu.
o Select Confirm Document Type.
p On the Document Types panel in the bottom left of Transformation
Studio’s view, select (Unknown).
q Right-click on the selection to display the context menu.
r Select Auto Classify documents.
Transformation Studio will process the Unknown documents using
information it has learned about the document types in this set. A
window will display the results of the automatic classification, and you
should see that most of the documents were classified.
s Click Close.
2 Having classified the Unknown documents, select the Cleanup Documents
tab to repeat the cleanup process.
Only a few documents should need to be confirmed.
3 Select the Overview tab and check the number of documents in each type.
Tax Escrow and Initial Escrow are both likely to have less than 100
documents of each type (ignore the Header as this can be configured using
just 10 documents).
The Initial Escrow and Tax Escrow documents were all originally part of the
Tax Escrow type.
4 Check whether there are more of this type in the project that are not
currently being used. If there are, move them into the Sample Documents set
so they can be used.
64 Getting Started Guide (Classification and Separation)
Configuration
a Double-click the Unused Documents set on the Project Explorer panel.
The Overview chart and the Document Types panel will be updated to
show the composition of this set. There are documents in the Tax Escrow
type that are not currently being used.
b Right click on Tax Escrow in the Document Types panel to display the
context menu.
c Select “Move documents to another document set” to open the Move
Documents to Document Set window.
d From the Move documents to list, select Sample Documents.
e Select the third Move option, “Maximum number of selected documents
per document type.”
f Click OK to move 100 documents into the Sample Documents set.
5 Double-click Sample Documents on the Project Explorer panel.
The Overview chart is no longer color coded as not all the documents in the
set have been read.
6 Read the new pages using the Read Page Content tool.
a Select Tools | Read Page Content.
On the Read Page Content tab, Sample Documents will be selected in the
Document Set list, the “Read only pages that are missing content” option
will be selected in Page Options and the “Use default read parameters”
option will be selected in Read Parameters.
b Click Read.
c When the Stop button is renamed to Finish, click Finish.
7 The Overview will now display Tax Escrow as orange. This is because the
Tax Escrow includes some (misclassified) Initial Escrow documents. Use
Auto Classify to try to reclassify these documents.
a On the Document Types panel, right click on Tax Escrow.
b Select Auto Classify documents from the context menu.
c Click Close.
8 Having added more documents to the set, run cleanup again.
a Select the Cleanup Documents tab.
b Follow the instructions, confirming documents until a message states that
there is no more work to do.
c Select the Overview tab.
Getting Started Guide (Classification and Separation)
65
Chapter 4
9 Review the chart.
There should now be at least 100 documents for each document type (except
for Header) and each bar should be green.
Note It is possible to review the documents that have been automatically classified,
using Browse Documents. For more information refer to the INDICIUS Help.
Step 7: Select Documents for Testing
The Test Documents set is used to store a subset of the clean documents for use in
testing. These are not used during the training process and therefore form an unseen
set of documents to use for testing. As the test documents have been cleaned up, a
comparison between the data in the project and the results of running the
configuration on the documents will provide an accurate indication of performance.
The Test Documents set is populated by moving documents from the Sample
Documents set.
Guidelines for Selecting Test Documents
When selecting test documents you must specify the percentage of documents to
move from the Sample Documents set. You can also specify whether documents that
have had their type manually confirmed may be moved into the test set, or whether
they must remain in the Sample Documents set.
The following table shows guidelines for selecting test documents.
Table 4-4. Test Document Selection Guidelines
Method Number of
Documents in
Test Set
Page text classification
Multiple page level classification methods
Document text classification
Page image classification
Templated (including barcode) classification
Rules-based classification
30% Yes
90% Yes
Keep Confirmed
Documents in Sample
Set
66 Getting Started Guide (Classification and Separation)
Configuration
To select documents for testing
X
1 Select Document Sets | Select Test Documents or click
to display the
Select Test Documents window.
Figure 4-24. Select Test Documents window
2 Read the warning message; optionally click Show Warnings to see more
details.
3 As the percentage of documents to move is already at 30%, click OK to move
the documents.
Once the documents have been successfully moved, a message will display.
Figure 4-25. Message After Selecting Test Documents
4 Click No to remain in the Sample Documents set rather than opening the
Test Documents set.
The number of documents in each set will be updated in Project Explorer.
Getting Started Guide (Classification and Separation)
67
Chapter 4
Figure 4-26. Project Explorer after Selecting Test Documents
Note If you look at the quality of the Test Documents set in isolation, you
will see that it appears to be of lower quality than the Sample Documents set
before and after the split. This is because Confirmed documents contribute
most to the information used to judge the quality of a document set, and
none of these were moved to Test Documents.
Create Recognition Configuration
Step 1: Create Configuration
A Recognition configuration is a set of files containing information that specifies how
Recognition processes documents for a specific solution. Once a configuration has
been created, these files - known as resources - are accessible from the Project
Explorer panel within Transformation Studio, and are stored in the following folder
on your computer:
Note By default, Project Location is My Documents\Transformation Studio Projects\
Each Recognition configuration configures one instance of the Recognition module.
Separate Recognition configurations are used for:
Page classification and separation (assigned to the Classification and
Separation instance).
Document classification (assigned to the Classification and Separation
instance).
68 Getting Started Guide (Classification and Separation)
Configuration
Extraction (assigned to the standard instance).
A Recognition configuration is always based on a configuration template. A template
is a set of resources that form the foundation of your configuration.
Note The resources created will vary depending on the type of template selected.
X To create a document classification configuration
1 Select Configuration | Create Configuration... to display the New
Configuration window.
2 Select “Document Classification.”
The Name box will automatically be updated with the default name
“Document Classification.”
3 Click Add.
The configuration will be added into the Configurations list in Project
Explorer.
Figure 4-27. Project Explorer with a Configuration
4 Click the + beside Document Classification to expand the configuration and
view the resources.
Getting Started Guide (Classification and Separation)
69
Chapter 4
Figure 4-28. Project Explorer showing Configuration Resources
Step 2: Configure Text Classification
Document Text Classifier
The classifier is created using the Build Document Text Classifier tab. Typically the
text classifier is trained on the documents in the Sample Documents set (after it has
been cleaned during document set management). Training options are selected
before the build process is started.
It is possible to specify whether training is restricted to documents that have been
confirmed, whether extra pages are trained on, and whether to further limit which
pages within a document are used in the training. Typically the first two options are
not selected (that is, all documents are used while training but not extra pages). The
pages to be used is only limited to save processing time (as the unused pages won't
need to be read in production) and if the document type can robustly be identified
from a subset of pages.
“Document Classification” to display the Build Document Text Classifier tab.
Sample Documents will already be selected in the “Training Document Set”
list and the document types within the set will be listed in the table.
70 Getting Started Guide (Classification and Separation)
Configuration
2 Within the table, clear the “Include” check box for the Header document
type, so these documents are not used in training the classifier.
This document type will be accounted for later by configuring templated
(barcode) classification.
3 Click Build.
4 Once the classifier has been built, click Finish.
Integrate Classifier
In production, Recognition runs a Recognition script, which uses the classifier. The
Recognition script (named Document Classification.ifv) is created automatically
when the configuration is created. Two changes may be needed in this script:
The name of the classifier
The pages to be used by the classifier (and therefore that need to be read)
The script will, by default, use a classifier called “Document text classifier.ibc.” This
is the default name of the classifier created using the Build Document Text Classifier
tab. If the name is left unchanged, no modification is needed to the script. For
information on changing the classifier name in the script, refer to the INDICIUS Help.
The script will, by default, run the classification on all pages. For information on
changing the pages to be used, refer to the INDICIUS Help.
XTo integrate the classifier, no modifications to the script are required for this
tutorial.
Test Classification
You will test the configuration on the Test Documents set, that is, the documents that
were not used to build the classifier. These documents require exporting from
Transformation Studio so they can be loaded into the Recognition Test Tool. You will
need to export these documents in the correct file structure for testing document
classification (a multi-page image file for each document).
You will then assign the configuration to a project in Recognition Test Tool, where it
is run on the test documents.
Note Although all testing could be done once the configuration is finished, it is
recommended that testing is done as each classification method is implemented,
ensuring any issues are quickly found and fixed.
Getting Started Guide (Classification and Separation)
71
Chapter 4
To test the configuration
X
1 Export the Test Documents set from Transformation Studio.
a Select File | Export Documents to display the Export Documents tab.
b Select Test Documents from the “Document Set” list.
c Click Browse and navigate to the following location:
My Documents\Transformation Studio Projects\Tutorial.
d Create a new folder called “Exported Document Sets”.
e In the new folder, create a new subfolder called “Test Documents
(Document Classification).”
f Select the folder Test Documents (Document Classification) and click
Open.
g Make sure the “Create one image file for each document” option is
selected.
h Clear the “Export text files” option (but leave “Export recognition output
files” selected).
i Click Export.
j Once the documents have been exported, click Finish to close the Export
Documents tab.
2 Create a project in Recognition Test Tool and run the test.
a Open Recognition Test Tool by clicking Start on the taskbar to display the
menu, and selecting All Programs | INDICIUS | Tools | Recognition
Test Tool.
b Select File | New Project... to display the New Project window.
c On the Configuration tab, use the “Recognition Script File”
button to
assign the script file in your configuration:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document
Classification\Resources\Document Classification.ifv.
d Select the Test Properties tab.
e Select the “Display document tree after test” option.
f Click OK.
g Select Documents | Select Batch File to open the Select Batch File
window.
72 Getting Started Guide (Classification and Separation)
Configuration
h Select the batch file you exported from Transformation Studio:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Test Documents (Document Classification)\All
Document Types.ibf.
i Click Open.
j Press F8 or click the Run Test button to test the configuration.
The batch file will not be altered during this process.
k Select the Summary tab to view the Test Documents set with document
types assigned.
The documents have been sorted by document type, and each set of
documents can be viewed by selecting the tab with a name corresponding
to their document type.
l Select File | Save Project and navigate to the My Documents folder.
m Create a new folder called “Test Projects”.
n Save the project as:
My Documents\Test Projects\Document Classification.rtp.
o Select File | Exit to close Recognition Test Tool.
Step 3: Add in Additional Classification Methods
iple classification methods can be used together to ensure an accurate and
Mult
efficient configuration. For example, the Header document type would be classified
more robustly by reading the barcode than by reading all the text on the page (as
there is very little text and it varies significantly). When processing documents, there
are three classification methods that can be used:
For more information on these methods, refer to Classification MethodsINDICIUS Help.
Note Image classification is not available in document classification configurations.
Getting Started Guide (Classification and Separation)
or the
73
Chapter 4
Export Documents for Use in Other Tools
Templated and rules-based classification are configured in Definer. As with text
classification, the configuration is based on the Sample Documents. In order to use
these sample documents easily in Definer, they need to be exported from
Transformation Studio. You can export the whole Sample Documents set or, by
creating additional custom sets, just the samples for the document types that you
need to use in Definer.
X To export the Header documents in the Sample Documents set
1 In Transformation Studio, if the Sample Documents set isn't open, double-
click “Sample Documents” in Project Explorer.
2 On the Document Types panel, right click on the type “Header.”
3 From the context menu, select “Add documents to another document set.”
Figure 4-29. Add Documents to Document Set
Note By adding the documents rather than moving them, the documents
still exist in the Sample Documents set.
4 In the “Add documents to” box, enter “Sample Header Documents”.
5 Click OK.
6 Select File | Export Documents to display the Export Documents tab.
74 Getting Started Guide (Classification and Separation)
Configuration
7 Select “Sample Header Documents” from the “Document Set” list.
8 Click Browse and navigate to the following folder:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets.
9 Create a new folder called “Sample Header Documents”.
10 Select the folder Sample Header Documents and click Open.
11 Make sure the “Create one image file for each document” option is selected.
12 Make sure the “Export text files” and “Export recognition output files”
options are clear.
No recognition output is required for configuring or testing templated
classification.
13 Click Export.
14 Once the documents have been exported, click Finish.
Definition File for Templated Classification
Templated classification is configured in a definition file. The definition file stores
registration marks, fields and barcodes which are used in production to classify
documents of the associated document type. The definition file is created in Definer.
X To create a definition file to classify the Header type by barcode
1 Open Definer by clicking Start on the taskbar to display the menu, and
selecting All Programs | INDICIUS | Tools | Definer.
2 Select Image | Open Image to open the Open Sample Image window.
3 Select the first Header image in the location you exported the sample Header
documents to:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Sample Header Documents\Header\.
4 Click Open.
5 On the toolbar, click the Barcode button
6 Draw a rectangle around the top barcode on the image by clicking and
holding the mouse down.
Allow plenty of space around the barcode so it looks like this:
Getting Started Guide (Classification and Separation)
.
75
Chapter 4
Figure 4-30. Barcode Field
7 On the Properties panel on the right, select the Name property and replace
the default value by entering “Barcode” for the field name.
8 Select File | Save Definition to open the Save As window.
9 Navigate to the location of your Recognition configuration:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document Classification\Resources.
10 Enter “Header” as the file name.
Note The name of the definition file will be the document type assigned if
classification is successful.
11 Click Save.
12 Select Tools | Test Definition to open Test Mode. If prompted, click Yes to
save the configuration file each time you run a test.
13 Click Add Image To List,
, to display the Select Images window.
14 Select the other Header images in the location you exported the sample
Header documents to:
My Documents\Transformation Studio Projects\Tutorial\Exported
Document Sets\Sample Header Documents\Header\.
15 Click Open.
16 Click Select All to select all the documents in the list.
17 Click Process Document so you can check the result of each document
during the test.
18 Click Auto Exit at End of Test so it is not selected.
19 Click Run to start the test.
76 Getting Started Guide (Classification and Separation)
Configuration
20 Check the field shows the message “Barcode found at ” and that the data
matches the value above the barcode.
21 Click Run Step to test the next document.
22 Repeat the last two steps until all images have been tested.
23 Click Close to exit the Test Mode window.
The barcode should have been found on every document. If needed, resize
the field and retest until all the barcodes are found.
24 In the main Definer view, select the Definition File tab below the image.
25 Press Enter to make space for a new line of code after the line
and before the line
END.
CORRECT NEVER
26 To cause registration to succeed upon finding a barcode, enter the following
lines:
REGREGEXP .+
REGFORMID 1 -1
The complete field will then be:
BEGIN FIELD
COORDS 427 489 1429 652
FORMID 1
NAME Barcode
TYPE CODE39
CORRECT NEVER
REGREGEXP .+
REGFORMID 1 -1
END
When Recognition runs, a successfully registered document is classified with
100% confidence, and assigned the document type given by the name of the
definition file.
Note For information on the two parameters used, press F1 to open the
INDICIUS Help.
27 Select File | Save Definition.
28 Select File | Exit to close Definer.
Getting Started Guide (Classification and Separation)
77
Chapter 4
Integrate Definition file
As with the text classifier, the definition file is called by the Recognition script in
production. The script will not call a definition file by default, but this can easily be
modified. The name of the definition file must also be updated.
X To integrate the definition file into the script
1 In Windows Explorer, navigate to your configuration’s resources folder:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document Classification\Resources.
2 Double-click the file Document Classification.ifv to open it in Script Editor.
3 Turn on classification by definition file by changing the following line at the
top of the script:
Const CLASSIFY_BY_DEFINITION_FILE = False
To:
Const CLASSIFY_BY_DEFINITION_FILE = True
4 Set the name of the definition file by changing the following line at the top of
the script:
Const DEFINITION_FILE_FILENAME = "Template.idf"
To:
Const DEFINITION_FILE_FILENAME = "Header.idf"
5 Select File | Save File.
6 Select File | Exit.
Test Classification
X To test document text and document templated classification together
1 Open Recognition Test Tool by clicking Start on the taskbar to display the
menu, and selecting All Programs | INDICIUS | Tools | Recognition Test
Tool.
2 Open the project you used to test document text classification.
Note You can open the project from the recent projects on the File menu. By
default it will be:
My Documents\Test Projects\Document Classification.rtp
78 Getting Started Guide (Classification and Separation)
Configuration
3 Click Run Test.
4 Once the test has finished, select the Header tab.
5 Select one of the documents and click Script Messages in the bottom left
panel.
If templated classification has run, the message will read “Document
<number> classified by template as Header.”
Do not close Recognition Test Tool.
Step 4: Test Performance
Y
ou have already tested the configuration when you added each classification
method. Those tests primarily checked that each classification method was called and
no errors occurred when running a test. In this step, you will analyze the
performance of the classification in detail, checking that the classification methods
implemented are picking up the documents as you expect and looking to see where
configuration could be improved (for example, by adding in a new classification
method).
X To analyze the test results
1 In Recognition Test Tool, having run a test, click
2 Select the Document Classification tab to display a table of the percentage of
documents that have been confidently classified into each type.
3 Click the % button so it is no longer selected.
This displays the number of documents classified as each type rather than
the percentage.
4 Check that the number of documents for each type matches the number of
documents in each type in the Test Documents set in Transformation Studio.
a Open the Tutorial project in Transformation Studio.
b Double-click Test Documents to open the set.
c Check the number of examples of each document type on the Document
Types panel and compare with the values in the Results Analysis table in
Recognition Test Tool.
Note The numbers may not be exactly the same, but the closer they are, the
more effective the configuration.
Getting Started Guide (Classification and Separation)
(Analyse Results).
79
Chapter 4
5 Select File | Exit to close the Results Analysis window.
6 Select File | Exit to close Recognition Test Tool.
Create Document Review Configuration
Step 1: Configure a Document Review Project File
In this step you will create and configure a Document Review project file using
Document Review Project Editor. In a document classification solution, Document
Review is used to ensure document types are correctly assigned.
X To create a Document Review project file
1 Open Document Review Project Editor by clicking Start on the taskbar to
display the menu, and selecting All Programs | INDICIUS | Tools |
Document Review Project Editor.
2 Select File | New.
3 Select File | Save to display the Save Project File window.
4 Navigate to the following folder within your Transformation Studio project:
My Documents\Transformation Studio Projects\Tutorial\Configurations.
5 Create a new folder named “Document Review”.
6 Open the folder.
7 Enter the file name “Review”.
8 Click Save.
9 Select the “Use the Document Classification view” option on the General
Options tab.
10 Select the Types tab.
11 Click in top left cell of the Document Types table.
12 Enter the following types in the table:
80 Getting Started Guide (Classification and Separation)
Configuration
Redemption
Request for Tax Form
Tax Escrow
Truth In Lending
Note It is important that the spelling and case of the document types is
exactly as written here, so that the types match those assigned in
Transformation Studio.
You will now specify a validation rule that states that all documents in the
batch must have a type specified in the list you just created. If a document
fails this rule, a problem will display in the Document Review module.
13 Select the Validation tab.
14 Click Add to display the Select Validation Rule window.
15 Select the rule “Every document must have a type specified in the list.”
16 Click OK.
17 Click Add again.
18 Select the rule “Every document must have a confident type.”
19 Click OK.
20 Select the Review tab.
21 In the Review Options panel, select the Behavior property “Automatically go
to next problem.”
22 From the drop down list, select True as the value for this property.
23 Select File | Save to save the project file.
24 Select File | Exit to close Document Review Project Editor.
Integrate the Configuration with Kofax Capture
Once the configuration has been created and tested, it needs to be assigned to a batch
class in Kofax Capture Administration.
Getting Started Guide (Classification and Separation)
81
Chapter 4
Step 1: Create Batch Class
X To create a batch class
1 Start Kofax Capture Administration by clicking Start on the taskbar to
display the menu, and selecting:
All Programs | Kofax Capture 8.0 | Administration.
2 Select File | New | Batch Class.
3 In the “Name:” box, enter “My Mortgage Apps”.
4 Select the Queues tab.
5 Add the following modules (and re-order them as follows):
82 Getting Started Guide (Classification and Separation)
Configuration
Request for Tax Form
Tax Escrow
Truth In Lending
Unknown
5 Click OK.
The document classes from the installed Mortgage Applications example,
along with their folder classes, are inserted into the batch class you just
created.
Note Batch classes should always contain an “Unknown” document class
and form type. These account for any documents which could not be
classified, since every INDICIUS document type must correspond to a Kofax
Capture form type.
Step 3: Assign Configuration to the Additional Instance of Recognition
X To assign the configuration for the Classification and Separation instance
of Recognition
1 On the Batch panel, select the “My Mortgage Apps” batch class.
2 Right click on the selection to display the menu, and select INDICIUS
Recognition Setup (Classification and Separation).
The setup dialog for this instance of Recognition is displayed. The
Recognition configuration files are specified (and can be changed) here.
3 On the Recognition Script File panel, click Select Script File... to display a file
selection window.
4 Select the following file:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document
Classification\Resources\Document Classification.ifv.
5 Click Open.
6 Click OK.
Getting Started Guide (Classification and Separation)
83
Chapter 4
Step 4: Assign Configuration to Document Review
X To assign the configuration for Document Review
1 On the Batch panel, select the “My Mortgage Apps” batch class.
2 Right click on the selection to display the menu, and select INDICIUS
Document Review Setup.
The Document Review setup dialog is displayed. The Document Review
project file is specified (and can be changed) here.
3 Click Select... to display a file selection window.
4 Select the following file:
My Documents\Transformation Studio
Projects\Tutorial\Configurations\Document Review\Review.drp.
5 Click Open.
6 Click OK.
Step 5: Assign Configuration to the Standard Instance of Recognition
X To assign the configuration for the standard instance of Recognition
1 On the Batch panel, select the “My Mortgage Apps” batch class.
2 Right click on the selection to display the menu, and select INDICIUS
Recognition Setup.
The setup dialog for this instance of Recognition is displayed. The
Recognition configuration files are specified (and can be changed) here.
3 On the Recognition Script File panel, click Select Script File... to display a file
selection window.
4 Browse to the Extraction configuration’s resources folder for the installed
We will not be configuring Completion in this tutorial, so we will assign the
pre-installed configuration files.
5 Select all eight templates in the folder.
6 Click Open.
7 On the left hand panel, select “Input/Output” to display the Input/Output
view.
8 For the “Load document type from:” dropdown list, select “System
Document Type.”
9 For the “Write data to:” options, select both “File” and “Index Fields.”
10 Select the “Display all documents to user” option.
Step 7: Configure Kofax Capture Release
X To configure Kofax Capture Release
1 On the Batch panel, expand the “My Mortgage Apps” batch class to display
the document classes.
2 Select the “Appraisal Report” document class.
3 Right click on the selection to display the menu, and select Release Scripts.
The Release Scripts window is displayed.
Getting Started Guide (Classification and Separation)
85
Chapter 4
Kofax Capture comes with pre-installed release scripts, which control the
method and final location of the data you have captured. We will use the
“Kofax Capture Text” release script to release the data to a text file. In
production, the data would be released to a database or back-end system.
4 From the “Available Release Scripts:” list, select “Kofax Capture Text.”
5 Click Add.
The Text Release Setup window is displayed.
6 On the Index Storage panel, next to the “File name:” box, click Browse.
7 Navigate to the following location:
8 Create a folder called “My Mortgage Apps”.
9 Open the folder and enter the file name “My Mortgage Apps.txt”.
10 Click Open.
11 Copy the path in the “File name:” box to the clipboard.
12 On the Text Release Setup window, select the Document Storage tab.
13 On the Document Storage panel, clear the “Release image files” option.
14 Click OK on the Text Release Setup window.
15 Click Close on the Release Scripts window.
16 Repeat the previous steps for the remaining document classes, pasting the
path into the “File name:” box.
Step 8: Publish Batch Class
X To publish the batch class
1 On the Batch panel, select the “My Mortgage Apps” batch class.
2 Select File | Publish.
The Publish window will display.
3 Click Publish.
The progress of the publishing operation will be logged in the Results panel.
4 When publishing has been completed, click Close.
5 Select File | Exit to close Kofax Capture Administration.
86 Getting Started Guide (Classification and Separation)
Configuration
Step 9: Process Batch
Create a new batch using Kofax Capture Batch Manager and then process the images
through the modules (if necessary refer to the Processing
chapter for instructions).
Getting Started Guide (Classification and Separation)
87
Chapter 4
Page Classification and Separation Tutorial
Summary
In this section, you will modify the current solution to use automatic document
separation. Automatic document separation can save significant time and cost by
removing the need for patch code separators.
Changes needed to make the current solution into a classification and separation
solution are:
Change the Recognition configuration to run classification at page level (as it
will run before document boundaries are established) and then to call
separation.
Create a page classification and separation configuration for Recognition.
Configure page classification methods.
Add in advanced document separation (which will use the page
classification results to determine the document boundaries and document
types).
Modify the batch class so Kofax Capture Scan does not detect patch code
separators, but imports every scanned image as a page and places them in a
single temporary document.
Create Recognition Configuration
Step 1: Create Configuration
As for document classification, the first step is to create a configuration based on a
template.
X To create a page classification and separation configuration
1 Open Transformation Studio.
2 Open your project.
3 Select Configuration | Create Configuration... to display the New
Configuration window.
4 Select “Page Classification and Separation.”
The Name box will automatically be updated with the default name “Page
Classification and Separation.”
88 Getting Started Guide (Classification and Separation)
Configuration
5 Click Add.
The configuration will be added to the Configurations list on the Project
Explorer panel.
Step 2: Configure Text Classification
Build Page Text Classifier
The classifier
is created on the Build Page Text Classifier tab, where training options
are selected before the build process is started. Typically the text classifier is trained
on the documents in the Sample Documents set (after it has been cleaned during
document set management).
It is possible to specify whether training is restricted to pages within documents that
have been confirmed, whether extra pages are trained on, and whether to further
limit which page types are trained within each document type. Any page types that
are not used for training page text classification need to be classified using an
alternative page classification method (for example, image or templated
classification).
Note In order for a page type to be trained successfully, at least 50 examples of that
page type are required. A warning will display in the table if there are less than 50
examples of a page type within the document set.
X To build a page text classifier
1 Select Configuration | Build Page Text Classifier into | Configuration “Page
Classification and Separation” to display the Build Page Text Classifier tab.
Sample Documents will be selected by default in the “Training Document
Set” list and the document types within the set will be listed in the table.
Note Some warning triangles may display. Hover the mouse over a specific
triangle to display the warning. The warnings in the tutorial are due to not
having enough examples of some page types. If these warnings were seen on
a project, you would need to ask the customer for more examples of these
page types.
2 Within the table, select “None” in the “Train Using” column for the Header
document type as this will be classified by templated classification (in this
case, using a barcode).
Getting Started Guide (Classification and Separation)
89
Chapter 4
Figure 4-31. Building the Page Text Classifier
3 Click Build.
Once the page text classifier has been built it will display on the Project
Explorer panel.
Figure 4-32. Project Explorer Displaying the New Classifier
90 Getting Started Guide (Classification and Separation)
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.