Global Match is a stand-alone product that you can use to find duplicates in
Unicode-based text files. It is driven by configuration files that contain various
settings for Global Matching functionality. The configuration files are ASCIIbased text files, which provide a means to tune the matching process and to
benefit from its power and flexibility.
Global Match can match Unicode data in UTF8 or UTF16 format. Global Match,
which can be considered a simplified version of the Match/Consolidate job file
does the following:
Identifies duplicate records by matching on data using locale-specific
business rules.
Uses the Global Match library to integrate global matching capability into
your organizational applications.
Normalizes Chinese, Japanese, Korean, and Taiwanese (CJKT) data into one
format, and removes spaces and punctuation, as applicable.
Identifies noise words, which are words that the Global Match product
ignores during the matching process, so as to match on the more meaningful
parts of a data string.
Although we recommend using the Global Match product in conjunction with
Match/Consolidate, it is a separate product.
Chinese, Japanese,
Korean, and
Taiwanese matching
Using Global Match, you can process native CJKT data. Regardless of the
country-specific language, the matching process is the same. For example, Global
Match:
Considers half-width and full-width characters to be equal.
Considers native script numerals and Arabic numerals to be equal. Global
Match can interpret numbers that are written in native script.
Includes variations for popular, personal, and firm name characters in the
referential data.
Considers firm names, such as Corporation or Limited, to be equal to their
variations (Corp. or Ltd.) during the matching comparison process. To find
the abbreviations, Global Match uses native script variations of the English
alphabets during firm name matching.
Ignores commonly used optional markers for province, city, district, and so
on, in address data comparison.
Intelligently handles variations in a building marker.
2
Global Match User’s Guide
Japanese-specific
matching capabilities
In Japanese data, Global Match considers:
Block data markers, such as chome and banchi, to be equal to those used with
hyphenated data.
Words with or without Okurigana to be equal in address data.
Var ia ti on s o f no marker, ga marker, and so on, to be equal.
Variations of a hyphen or dashed line to be equal.
Global Match
limitations
The Global Match functionality does not:
Perform conversions of simplified and traditional Chinese data.
Compare different scripts, such as Kana to Kanji, or Chinese to English.
Introducing Global Match
3
Set up and run a Global Match job
The following explains how to set up and run a Global Match job.
Create an input fileBegin by creating an input file that conforms to the specifications of the input file
formats supported by Global Match. Global Match can match delimited Unicode
data in UTF8 or UTF16 format.
If you have multi-national, multi-script data, use the Match/Consolidate job file
first. Using the bypass filter in the Match/Consolidate job file, you can segregate
the records that you want to process through Global Match.
If you are using a Relational Database Management System (RDBMS), or any
other database, use database tools for exporting the data to a text file of the UTF8
or UTF16 format. You can use the filters in the database to segregate the records
that you want to process.
Set up the DMT fileCreate a delimited (DMT) format file that depicts the field layout of the input file
that you created. Consider the following when setting up the DMT file:
The record delimiter cannot be a string of characters. For example,
016 017 is not a valid record delimiter, but 016 is a valid record delimiter.
There cannot be a Topoffset.
Configuration file
overview
The lengths and the field types, if specified, are ignored.
If you do not define the field and record delimiters, the default is a comma for
the field delimiter, and a carriage return (and line feed for Windows) for the
record delimiter.
Field framing characters are not allowed.
Begin setting up the configuration files. Use the sample configuration files
installed in the \pw\gmatch\samples directory as a reference.
Global Match employs rule-based matching, which is equivalent to extended
matching in the Match/Consolidate job file. Almost all of the parameters in the
configuration files used by Global Match are the same as those used in extended
matching.
Global Match uses the following configuration files, which are preset to the
matcher type, to process input files. When setting up the configuration files, see
the Rule Definition information, which is provided with the Match/Consolidate
job file.
Sample configuration
file name
gmtc.cfg
Description
This is the primary configuration file that you will use to
set up your parameters and to run your Global Match jobs.
gfirm.cfg
This is the Overall Match configuration file for the firm
match strategy. This file specifies the paths and filenames
of the rest of the configuration files to use for this session.
4
Global Match User’s Guide
Sample configuration
file name
Description
Edit the primary
configuration file
gfirm_key.cfg
gfirm_gen.cfg
gfirm_fld.cfg
gfirm_rul.cfg
This is the Key configuration file for the firm match
strategy. This file sets up the fields that the match engine
will use as key data for its comparisons of record pairs. The
match engine cannot compare any data that is not set up as
a key field.
This is the General configuration file for the firm match
strategy. This file controls which match options are used in
this match process.
This is the Keyfield Match Options configuration file for
the firm match strategy. This file controls which key field
options are used in this match process. For example, check
for transpositions, abbreviations, adjustments, and so on.
This is the Rule Definition configuration file for the firm
match strategy. This file sets rules for the match engine to
use in determining if two records match.
You will use the primary configuration file to set up your parameters and to run
your jobs.
Before making any changes to a configuration file, copy and save it with a
!
different name.
To edit the primary configuration file:
1.In the \pw\gmatch\samples directory, open the Global Match configuration
file (gmtc.cfg).
2.In the configuration file, specify the parameters, such as the input file,
encoding, output file, DMT file, report file, work path, resource path, and so
on, as appropriate for your job.
3.Verify that the configuration file header is named flgmtc_config.
4.Specify the input file to be processed; you can specify only one input file per
job. For example: Input_File: Data.txt.
5.Specify the input file character encode UTF8 or UTF16. For example:
File_Encoding: UTF8. The default encoding is UTF16.
The output file is created in the same encoding as the input file.
6.Specify the DMT file for the input file. For example: Dmt_File:Data.dmt.
7.Specify the output file name. For example: Output_File: Data_output.txt.
Set up and run a Global Match job
5
Edit the Match
configuration files
To edit the Match configuration files:
1.Open the Overall Match configuration file (gfirm.cfg) in the \pw\gmatch\samples directory.
The gfirm.cfg file points to the following configuration files, which you will
use to determine what options that you will use to match on your data:
gfirm_key.cfg, gfirm_gen.cfg, gfirm_fld.cfg, and gfirm_rul.cfg
2.Specify the match configuration file in the Match_Config parameter in the
gmtc.cfg file. For example: Match_Config: gfirm.cfg
Remember that Global Match supports only rule-based configurations.
Specify the break
keys
Break keys specify the break group formed during processing. To specify the
break keys, analyze your input data and decide which field you want to use for
breaking. This concept is very similar to that used with the Match/Consolidate job
file.
The order in which you specify the break keys is important, so choose the break
keys, their offsets (0-based), and lengths carefully. Note that before break keys
are formed, the data is normalized according to the specified field and matcher
type. For information about matcher types, see “Specify the matcher type” on
page 9.
To specify the break keys:
1.Specify up to 42 break keys, which will be used in combination to form break
groups. To do so, use the following format: break_key: input-field-name,offset,break fieldlength,field type
For example: Post Code,0,3,pc
2.Verify that the input field name is in the DMT file, and that the break field
starting position and break field length are positive numbers.
The following table specifies the possible values for the field type, which Global
Match uses to normalize the data before forming a match key.
Field type valueDescription
namePersonal name data
firmFirm name data
addrAddress data
pcPost code data
otherUse this value when no pre-processing or normalization for
the matcher type is required.
6
Global Match User’s Guide
Specify the match
keys
Match keys map the match key fields to the input file fields. Using the following
format, specify up to 42 match keys, which will be used in combination to form
break groups. The format is as follows:
match_key: match-keyfield,input-field-name,field type
For example: MTC_KEYFLD_SPECIAL1, Company_Name, Firm
To edit the Match Key configuration file:
1.Open the Match Key configuration file (gfirm_key.cfg) in the \pw\gmatch\
samples directory.
2.Specify the match keys that are to be used, the various keys, and the lengths
to be used for matching. The keys can be in any order. Use the
MTC_KEYFLD_SPECIAL1 to MTC_KEYFLD_SPECIAL10 fields for
Global Match.
3.In the Global Match Configuration file, specify the Match_Key to be used.
This parameter maps the Match keys with the input fields specified in the
DMT file. For help with completing this parameter list, see the Match Key
configuration file and the DMT file.
4.Specify the appropriate field type. This field type determines what specific
matching intelligence will be applied to the match key. The matching
intelligence applied is matcher specific. For details about how to set the
matcher type, see “Specify the matcher type” on page 9.
Edit the Match
Keyfield configuration
file
The following table specifies the possible values for the field type, which Global
Match uses to normalize the data before forming a match key.
Field type valueDescription
namePersonal name data
firmFirm name data
addrAddress data
pcPost code data
otherUse this value when no pre-processing or normalization for
the matcher type is required.
To edit the Match Keyfield configuration file:
1.Open the Match Keyfield configuration file (gfirm_fld.cfg) in the \pw\
gmatch\samples directory.
2.Set the options for the different keys on which you are matching.
3.Specify the appropriate options like checking for abbreviations, checking for
initials, transpositions, and so on.
Refer to the documentation in the configuration file for details about each of the
options. For those keys on which you do not specify any options, defaults will be
used, as specified in the documentation in the file.
Set up and run a Global Match job
7
Edit the Match
General Configuration
file
To edit the Match General configuration file:
1.Open the Match General Configuration file (gfirm_gen.cfg) in the \pw\gmatch\samples directory.
2.Set the options appropriate for your job.
Edit the Match Rule
Definition
configuration file
To edit the Match Rule Definition configuration file:
1.Open the Match Rule Definition file (gfirm_rul.cfg) in the \pw\gmatch\samples directory.
2.Set the rules that the records should pass in order to be considered a match.
3.Verify the results of the trial runs with settings in this file, and adjust them as
necessary.
Rule settingsThe following provides brief descriptions of the rule settings. If you need
additional information, see the documentation in the configuration file.
Rule evaluationRules are evaluated in the order of the rule number. For example, Rule Number 1
will be evaluated based on the scores and settings.
Max No-Dupe ScoreMax No-Dupe Score setting establishes a cutoff for the match engine to conclude
that two records do not match, based solely on the dissimilarity of these two
fields. If the field match score is equal to or less than this value, then the match
engine will conclude that the two records do not match. Thus, a decision of Nodupe is returned. No further rules are evaluated.
If the score is higher than this, then other criteria are checked. If this value is set
to -1 (or if the line is commented), then the match engine will not use the field
match score from this rule to conclude that the two records do not match.
Min Dupe ScoreMin Dupe Score setting establishes a cutoff for the match engine to conclude that
two records, in fact, match, based solely on the similarity of these two fields. If
the field match score is equal to or more than this value, then the match engine
will conclude that the two records match. Thus, a decision is made. No further
rules are evaluated.
If this value is set to 101 (or if the line is commented), then the match engine will
not use the field match score from this rule to conclude that the two records
match. If the decision cannot be made, other criteria are checked.
If no decision is madeIf both of the scores cannot make a decision about dupe or no-dupe, subsequent
rules are evaluated until one of the rules is able to make the decision.
Blank matchBlank match is the setting that controls the behavior when one of the fields is
blank. If it is set to Yes, then the fields will be considered a match. The Rule
Configurations settings provide many other score settings that you can adjust to
control the behavior in case of blank fields. Refer to the documentation in the
configuration file for details.
8
Global Match User’s Guide
Specify the
matcher type
Global Match works on individual records to apply specific matching
intelligence. However, you need to specify the matcher fields, which determine
which matcher type to apply.
For example: Matcher_Field: Country Code
Map the matcher identifier input field values to the matcher type. The identifier
text in the matcher field determines which matcher is to be applied. The format
for mapping the match identifier is as follows:
Matcher: “identifier-text”, matcher_type
For example: Matcher: “JPN”, jp
Specify one matcher field value per matcher entry. You can specify up to 32
matcher entries.
Use one of the following matcher types: Chinese,
Japanese, Korean, and Taiwanese. The table at
right lists the values for each of the matcher
types.
If you know your data can be processed with a
Matcher typeValue
Japanesejp
Koreanko
Chinesezh
single matcher type, you can specify the
Default_Matcher parameter. If you do not specify
Tai wanesetw
any default matcher, all the remaining records
that cannot be associated with a particular matcher type specified above will be
processed through the generic Unicode matcher.
Run Global MatchAfter you set up the input file, the DMT, and the configuration files, you are ready
to run the job. Run Global Match in the installation directory as follows: flgmtc
mygmtc.cfg
Global Match verifies all of the configuration settings, issues any errors or
warnings, and displays various informational and progress messages.
Analyze the outputGlobal Match creates two types of output files: an output file and a report file.
Output fileThe output file has the same format as the input file, with the following fields
appended to each record:
FieldDescription
Record NumberThis is the number of the record.
Dupe Group IDAll the records in one dupe group will have the same numeric ID.
All the master records will have the rule number and score values
as -1.
All individual records will have a dupe group ID of 0.
Rule NumberThis is the rule number that made the deciding comparison. If a rule
did not cause a decision to be made, then the rule number is zero.
ScoreThis is the score of the rule that made the deciding comparison.
If a rule did not cause a decision to be made, then this is the weighted
score of the decision.
Set up and run a Global Match job
9
Global Match does not create output files containing duplicate records, master
records, and so on. Instead, it attaches information to the records, which can be
used to derive those specific sets.
For example, if you want to locate all of the unique and master records, then
segregate all the records with a rule number or rule score value of “-1.” If you
want to locate only unique records, filter the records with a dupe group ID of “0.”
Use the rule number and score information to further fine-tune the match rule
configuration.
Report fileGlobal Match also creates a report file showing an execution summary with
information about the number of dupe break groups, dupe groups, duplicates
found per matcher type, and so on.
To create a report file, specify the report file name. For example: Report_File:
Data_report.txt. The following shows an example summary report.
Global Match Summary Report
Input file: Data.txt
Output file: Data_output.txt
Start time: Fri Feb 25 10:02:07 2005
End time: Fri Feb 25 10:02:23 2005
Number of records processed: 5849
Number of break groups: 348
Number of total duplicates: 33
Number of dupe groups: 30
Generic Unicode Matcher
Number of dupes: 0
Number of dupe groups: 0
Japanese Matcher
Number of dupes: 0
Number of dupe groups: 0
Korean Matcher
Number of dupes: 7
Number of dupe groups: 4
Chinese Matcher
Number of dupes: 26
Number of dupe groups: 26
Taiwanese Matcher
Number of dupes: 0
Number of dupe groups: 0
10
Global Match User’s Guide
Error messages
The following table summarizes the Global Match general and configuration file
error messages, and warning and informational messages.
General error messages
MessageDescriptionSolution
Usage: flgmtc ConfigFile
Usage: flgmtc -rev
Display software version and revision
level
Failed to open configuration file:
name
. No such file.
Keyfile pwmpg.key not found or
corrupted
Record n is too long. Cannot continue
processing.
Error opening
Some_file
report files.
System Error. Not enough memory to
initialize Global Match.
System error: Not enough memory.
System Error. Failed to create temporary files.
some_file
could be input, output or
file.
file-
Occurs when the incorrect option is
supplied on the command line.
The configuration file is not found or
could not be opened.
Occurs when the Keyfile is not present,
is the wrong version, or is corrupted.
Global Match could not read the record
because it was too long.
Failed to read any records from the input
file. Ensure that the input file is not
empty and the file encoding specified in
the configuration file is correct.
The file was not found. Verify the path of the file in the Global
Not enough memory.Close some applications and try running
The disk is full or write-protected.Verify that you have write permissions on
Use a correct command line option.
Use correct file name and permissions.
Use the correct Keyfile, and ensure that it
is in the current directory.
Specify a correct value for a record
delimiter in the DMT file.
This error generally occurs if the input file
is empty or the actual encoding of the file
does not match with the encoding specified in the configuration file.
Match Configuration file.
Global Match again.
the Work_Path specified in the Global
Match configuration file.
Internal match library error.
Some_Error_message
This is an unexpected error and should
not occur in normal Global Match
functioning.
Contact Business Objects Customer Support.
Error messages
11
Configuration file error messages
MessageDescriptionSolution
Error in Global Match configuration
file. Occurred at line number = n
Some_Error_Message
Note:
For these kinds of main Global
Match configuration file errors, the
text in “Description” will appear as
“Some_Error_Message.”
No format file specified.Specify a format file in the parameter
Dmt_File in the configuration file.
Unable to open the format file.The specified format file does not exist, or
does not have enough read permissions.
Error in the format file.The DMT file violates one or more excep-
tions specified in earlier sections about
DMT File specifications.
No match keys specified in the
configuration.
The configuration file must have at least
one Match_Key specified.
Invalid File_Encoding.The file encoding must be either UTF8 or
UTF16. UTF16 is the default.
Invalid Break Key Specification.
Invalid Match Key Specification.
There is an error in the specification. One
or more fields may have been omitted or
may be incorrect.
Multiple values specified for
parameter
.Some of the parameters in the configura-
tion file can be specified only once.
Ensure that you have not repeated one of
these.
Input file name too long.
DMT file name too long.
The length of the file name(s) exceeds the
maximum allowed by the platform.
Output file name too long.
Report file name too long.
Resource path too long.
Work path too long.
Match configuration file name too long.
No Matcher_Field specified.No actual input field is specified as a
value for the Matcher_Field parameter.
Invalid Default_Matcher name (
xx
).The Default_Matcher value must be
jp, ko, zh, or tw.
Exceeded the maximum number of
break keys.
You can only specify 42 break or match
keys.
Exceeded the maximum number of
Match keys.
Missing input field name in Break Key
Specification.
The specified break or match key does not
have an input field name.
Missing input field name in Match Key
Specification.
12
Global Match User’s Guide
MessageDescriptionSolution
Error in Global Match configuration
file. Occurred at line number = n
Some_Error_Message
Note:
For these kinds of main Global
Match configuration file errors, the
text in “Description” will appear as
“Some_Error_Message”.
xxx
Invalid input file field (
) in Break
Key Specification.
Invalid input file field (
xxx
) in Match
Key Specification.
Missing offset in Break Key Specification.
Missing length in Break Key Specification.
Missing field_type name in Break Key
Specification.
Missing field_type name in Match Key
Specification.
Invalid field_type (
xx
) in Break Key
Specification.
Invalid field_type (
xx
) in Match Key
Specification.
Missing match key field name in Match
Key Specification.
Every break or match key must be associated with an input field as specified in the
DMT file. An input field name is specified, but it is not present in the DMT file.
Specify a positive value as offset.
Specify a positive value as length.
Specify a field_type for the break key or
match key. Valid field types are name,
firm, addr, and pc.
A field_type specified for the break or
match key is invalid. Valid field types are
name, firm, addr, pc.
Every Match key must have an associated
match key field.
Error in Match Configuration files.
Error: Some_Error_MessageDetails:
Some_Error_DetailMatch Library
Error Number: n
Note.
For these kinds of Match configuration file errors,
Some_Error_Message and
Some_Error_Detail will contain all the
details of the actual error.
Invalid match key field name (
xxx
) in
Match Key Specification.
Matcher values cannot be specified
without matcher field.
Exceeded the maximum number of
Matcher values.
Missing string value in Matcher Specification.
Matcher string must be double quoted in
Matcher Specification.
Missing matcher name in Matcher Specification.
Invalid matcher name (
xxx
) in Matcher
Specification.
Refer the Error message and detail. A
typical error message will contain name
and line number of the file in which the
error occurred as well as the exact nature
of the error.
Refer to the
Reference
Match Library Programmer’s
for valid values of match key
field names.
You specified matcher values, but did not
specify a matcher field.
You can specify only 32 matcher values.
Matcher specification must contain a
double-quoted string value.
Matcher specification must contain a
double-quoted string value.
Matcher value must have matcher name
specified. Valid values are jp, ko, zh,
or tw.
Valid values are jp, ko, zh or tw.
In most cases, the error detail will provide
information about the problem. In some
cases, absence of one or more parameters
from the required set of parameters will
cause the error.
For example, in the match key configuration file, if you do not specify all five
parameters for each key, you will receive
an error.
Error messages
13
MessageDescriptionSolution
Error in configuration file. Match configuration must specify
Rule
matching.
Error instantiating matcher instances.
Please check resource path in the configuration file.
Rule
Auto
Global Match does not support
matching. It only supports
matching.
The resource path is incorrect. Ensure that the Resource_Path specified
Verify that the value of MTC_Matching_
Type is R
Configuration file.
in the Global Match configuration file
contains the
Warning and informational messages
MessageDescriptionSolution
Warning: Unable to retrieve matcher
field. Record number = n
This is just a warning that the records
specified did not contain a matcher identifier field.
This may occur because of incorrect field
specifications in the DMT file. Or, the
data may be blank for the records specified.
Because this is a warning and not an error,
these records will pass through the
default_matcher, if specified, or the
generic Unicode matcher.
Creating work file....
Comparing m record(s) in break group
n Sorting results....
Creating Report....
Processing completed successfully
Progress information messages.Progress information messages. Process-
ing time may be lengthy depending on the
number of records and match rules setup,
so the command prompt may appear stationary. This does not mean that the system is hanging.
ule
in the Overall Match
root.res
file.
14
Global Match User’s Guide
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.