Business objects GLOBAL MATCH 7.90C User Manual

Global Match

User’s Guide

Global Match 7.90c
November 2008

Introducing Global Match

Global Match is a stand-alone product that you can use to find duplicates in Unicode-based text files. It is driven by configuration files that contain various settings for Global Matching functionality. The configuration files are ASCII­based text files, which provide a means to tune the matching process and to benefit from its power and flexibility.
Global Match can match Unicode data in UTF8 or UTF16 format. Global Match, which can be considered a simplified version of the Match/Consolidate job file does the following:
Identifies duplicate records by matching on data using locale-specific
business rules.
Uses the Global Match library to integrate global matching capability into
your organizational applications.
Normalizes Chinese, Japanese, Korean, and Taiwanese (CJKT) data into one
format, and removes spaces and punctuation, as applicable.
Identifies noise words, which are words that the Global Match product
ignores during the matching process, so as to match on the more meaningful parts of a data string.
Although we recommend using the Global Match product in conjunction with Match/Consolidate, it is a separate product.

Chinese, Japanese, Korean, and Taiwanese matching

Using Global Match, you can process native CJKT data. Regardless of the country-specific language, the matching process is the same. For example, Global Match:
Considers half-width and full-width characters to be equal.
Considers native script numerals and Arabic numerals to be equal. Global
Match can interpret numbers that are written in native script.
Includes variations for popular, personal, and firm name characters in the
referential data.
Considers firm names, such as Corporation or Limited, to be equal to their
variations (Corp. or Ltd.) during the matching comparison process. To find the abbreviations, Global Match uses native script variations of the English alphabets during firm name matching.
Ignores commonly used optional markers for province, city, district, and so
on, in address data comparison.
Intelligently handles variations in a building marker.
2
Global Match User’s Guide

Japanese-specific matching capabilities

In Japanese data, Global Match considers:
Block data markers, such as chome and banchi, to be equal to those used with
hyphenated data.
Words with or without Okurigana to be equal in address data.
Var ia ti on s o f no marker, ga marker, and so on, to be equal.
Variations of a hyphen or dashed line to be equal.

Global Match limitations

The Global Match functionality does not:
Perform conversions of simplified and traditional Chinese data.
Compare different scripts, such as Kana to Kanji, or Chinese to English.
Introducing Global Match
3

Set up and run a Global Match job

The following explains how to set up and run a Global Match job.

Create an input file Begin by creating an input file that conforms to the specifications of the input file

formats supported by Global Match. Global Match can match delimited Unicode data in UTF8 or UTF16 format.
If you have multi-national, multi-script data, use the Match/Consolidate job file first. Using the bypass filter in the Match/Consolidate job file, you can segregate the records that you want to process through Global Match.
If you are using a Relational Database Management System (RDBMS), or any other database, use database tools for exporting the data to a text file of the UTF8 or UTF16 format. You can use the filters in the database to segregate the records that you want to process.

Set up the DMT file Create a delimited (DMT) format file that depicts the field layout of the input file

that you created. Consider the following when setting up the DMT file:
The record delimiter cannot be a string of characters. For example,
016 017 is not a valid record delimiter, but 016 is a valid record delimiter.
There cannot be a Topoffset.

Configuration file overview

The lengths and the field types, if specified, are ignored.
If you do not define the field and record delimiters, the default is a comma for
the field delimiter, and a carriage return (and line feed for Windows) for the record delimiter.
Field framing characters are not allowed.
Begin setting up the configuration files. Use the sample configuration files installed in the \pw\gmatch\samples directory as a reference.
Global Match employs rule-based matching, which is equivalent to extended matching in the Match/Consolidate job file. Almost all of the parameters in the configuration files used by Global Match are the same as those used in extended matching.
Global Match uses the following configuration files, which are preset to the matcher type, to process input files. When setting up the configuration files, see the Rule Definition information, which is provided with the Match/Consolidate job file.
Sample configuration file name
gmtc.cfg
Description
This is the primary configuration file that you will use to set up your parameters and to run your Global Match jobs.
gfirm.cfg
This is the Overall Match configuration file for the firm match strategy. This file specifies the paths and filenames of the rest of the configuration files to use for this session.
4
Global Match User’s Guide
Sample configuration file name
Description

Edit the primary configuration file

gfirm_key.cfg
gfirm_gen.cfg
gfirm_fld.cfg
gfirm_rul.cfg
This is the Key configuration file for the firm match strategy. This file sets up the fields that the match engine will use as key data for its comparisons of record pairs. The match engine cannot compare any data that is not set up as a key field.
This is the General configuration file for the firm match strategy. This file controls which match options are used in this match process.
This is the Keyfield Match Options configuration file for the firm match strategy. This file controls which key field options are used in this match process. For example, check for transpositions, abbreviations, adjustments, and so on.
This is the Rule Definition configuration file for the firm match strategy. This file sets rules for the match engine to use in determining if two records match.
You will use the primary configuration file to set up your parameters and to run your jobs.
Before making any changes to a configuration file, copy and save it with a
!
different name.
To edit the primary configuration file:
1. In the \pw\gmatch\samples directory, open the Global Match configuration file (gmtc.cfg).
2. In the configuration file, specify the parameters, such as the input file, encoding, output file, DMT file, report file, work path, resource path, and so on, as appropriate for your job.
3. Verify that the configuration file header is named flgmtc_config.
4. Specify the input file to be processed; you can specify only one input file per job. For example: Input_File: Data.txt.
5. Specify the input file character encode UTF8 or UTF16. For example: File_Encoding: UTF8. The default encoding is UTF16.
The output file is created in the same encoding as the input file.
6. Specify the DMT file for the input file. For example: Dmt_File:Data.dmt.
7. Specify the output file name. For example: Output_File: Data_output.txt.
Set up and run a Global Match job
5

Edit the Match configuration files

To edit the Match configuration files:
1. Open the Overall Match configuration file (gfirm.cfg) in the \pw\gmatch\ samples directory.
The gfirm.cfg file points to the following configuration files, which you will use to determine what options that you will use to match on your data: gfirm_key.cfg, gfirm_gen.cfg, gfirm_fld.cfg, and gfirm_rul.cfg
2. Specify the match configuration file in the Match_Config parameter in the gmtc.cfg file. For example: Match_Config: gfirm.cfg
Remember that Global Match supports only rule-based configurations.

Specify the break keys

Break keys specify the break group formed during processing. To specify the break keys, analyze your input data and decide which field you want to use for breaking. This concept is very similar to that used with the Match/Consolidate job file.
The order in which you specify the break keys is important, so choose the break keys, their offsets (0-based), and lengths carefully. Note that before break keys are formed, the data is normalized according to the specified field and matcher type. For information about matcher types, see “Specify the matcher type” on
page 9.
To specify the break keys:
1. Specify up to 42 break keys, which will be used in combination to form break groups. To do so, use the following format: break_key: input-field- name,offset,break field length,field type
For example: Post Code,0,3,pc
2. Verify that the input field name is in the DMT file, and that the break field starting position and break field length are positive numbers.
The following table specifies the possible values for the field type, which Global Match uses to normalize the data before forming a match key.
Field type value Description
name Personal name data
firm Firm name data
addr Address data
pc Post code data
other Use this value when no pre-processing or normalization for
the matcher type is required.
6
Global Match User’s Guide

Specify the match keys

Match keys map the match key fields to the input file fields. Using the following format, specify up to 42 match keys, which will be used in combination to form break groups. The format is as follows: match_key: match-keyfield,input-field-name,field type
For example: MTC_KEYFLD_SPECIAL1, Company_Name, Firm
To edit the Match Key configuration file:
1. Open the Match Key configuration file (gfirm_key.cfg) in the \pw\gmatch\
samples directory.
2. Specify the match keys that are to be used, the various keys, and the lengths
to be used for matching. The keys can be in any order. Use the MTC_KEYFLD_SPECIAL1 to MTC_KEYFLD_SPECIAL10 fields for Global Match.
3. In the Global Match Configuration file, specify the Match_Key to be used.
This parameter maps the Match keys with the input fields specified in the DMT file. For help with completing this parameter list, see the Match Key configuration file and the DMT file.
4. Specify the appropriate field type. This field type determines what specific
matching intelligence will be applied to the match key. The matching intelligence applied is matcher specific. For details about how to set the matcher type, see “Specify the matcher type” on page 9.

Edit the Match Keyfield configuration file

The following table specifies the possible values for the field type, which Global Match uses to normalize the data before forming a match key.
Field type value Description
name Personal name data
firm Firm name data
addr Address data
pc Post code data
other Use this value when no pre-processing or normalization for
the matcher type is required.
To edit the Match Keyfield configuration file:
1. Open the Match Keyfield configuration file (gfirm_fld.cfg) in the \pw\
gmatch\samples directory.
2. Set the options for the different keys on which you are matching.
3. Specify the appropriate options like checking for abbreviations, checking for
initials, transpositions, and so on.
Refer to the documentation in the configuration file for details about each of the options. For those keys on which you do not specify any options, defaults will be used, as specified in the documentation in the file.
Set up and run a Global Match job
7

Edit the Match General Configuration file

To edit the Match General configuration file:
1. Open the Match General Configuration file (gfirm_gen.cfg) in the \pw\ gmatch\samples directory.
2. Set the options appropriate for your job.

Edit the Match Rule Definition configuration file

To edit the Match Rule Definition configuration file:
1. Open the Match Rule Definition file (gfirm_rul.cfg) in the \pw\gmatch\ samples directory.
2. Set the rules that the records should pass in order to be considered a match.
3. Verify the results of the trial runs with settings in this file, and adjust them as necessary.

Rule settings The following provides brief descriptions of the rule settings. If you need

additional information, see the documentation in the configuration file.
Rule evaluation Rules are evaluated in the order of the rule number. For example, Rule Number 1
will be evaluated based on the scores and settings.
Max No-Dupe Score Max No-Dupe Score setting establishes a cutoff for the match engine to conclude
that two records do not match, based solely on the dissimilarity of these two fields. If the field match score is equal to or less than this value, then the match engine will conclude that the two records do not match. Thus, a decision of No­dupe is returned. No further rules are evaluated.
If the score is higher than this, then other criteria are checked. If this value is set to -1 (or if the line is commented), then the match engine will not use the field match score from this rule to conclude that the two records do not match.
Min Dupe Score Min Dupe Score setting establishes a cutoff for the match engine to conclude that
two records, in fact, match, based solely on the similarity of these two fields. If the field match score is equal to or more than this value, then the match engine will conclude that the two records match. Thus, a decision is made. No further rules are evaluated.
If this value is set to 101 (or if the line is commented), then the match engine will not use the field match score from this rule to conclude that the two records match. If the decision cannot be made, other criteria are checked.
If no decision is made If both of the scores cannot make a decision about dupe or no-dupe, subsequent
rules are evaluated until one of the rules is able to make the decision.
Blank match Blank match is the setting that controls the behavior when one of the fields is
blank. If it is set to Yes, then the fields will be considered a match. The Rule Configurations settings provide many other score settings that you can adjust to control the behavior in case of blank fields. Refer to the documentation in the configuration file for details.
8
Global Match User’s Guide

Specify the matcher type

Global Match works on individual records to apply specific matching intelligence. However, you need to specify the matcher fields, which determine which matcher type to apply.
For example: Matcher_Field: Country Code
Map the matcher identifier input field values to the matcher type. The identifier text in the matcher field determines which matcher is to be applied. The format for mapping the match identifier is as follows:
Matcher: “identifier-text”, matcher_type
For example: Matcher: “JPN”, jp
Specify one matcher field value per matcher entry. You can specify up to 32 matcher entries.
Use one of the following matcher types: Chinese, Japanese, Korean, and Taiwanese. The table at right lists the values for each of the matcher types.
If you know your data can be processed with a
Matcher type Value
Japanese jp
Korean ko
Chinese zh
single matcher type, you can specify the Default_Matcher parameter. If you do not specify
Tai wanese tw
any default matcher, all the remaining records that cannot be associated with a particular matcher type specified above will be processed through the generic Unicode matcher.

Run Global Match After you set up the input file, the DMT, and the configuration files, you are ready

to run the job. Run Global Match in the installation directory as follows: flgmtc mygmtc.cfg
Global Match verifies all of the configuration settings, issues any errors or warnings, and displays various informational and progress messages.

Analyze the output Global Match creates two types of output files: an output file and a report file.

Output file The output file has the same format as the input file, with the following fields
appended to each record:
Field Description
Record Number This is the number of the record.
Dupe Group ID All the records in one dupe group will have the same numeric ID.
All the master records will have the rule number and score values as -1.
All individual records will have a dupe group ID of 0.
Rule Number This is the rule number that made the deciding comparison. If a rule
did not cause a decision to be made, then the rule number is zero.
Score This is the score of the rule that made the deciding comparison.
If a rule did not cause a decision to be made, then this is the weighted score of the decision.
Set up and run a Global Match job
9
Global Match does not create output files containing duplicate records, master records, and so on. Instead, it attaches information to the records, which can be used to derive those specific sets.
For example, if you want to locate all of the unique and master records, then segregate all the records with a rule number or rule score value of “-1.” If you want to locate only unique records, filter the records with a dupe group ID of “0.” Use the rule number and score information to further fine-tune the match rule configuration.
Report file Global Match also creates a report file showing an execution summary with
information about the number of dupe break groups, dupe groups, duplicates found per matcher type, and so on.
To create a report file, specify the report file name. For example: Report_File: Data_report.txt. The following shows an example summary report.
Global Match Summary Report
Input file: Data.txt Output file: Data_output.txt
Start time: Fri Feb 25 10:02:07 2005 End time: Fri Feb 25 10:02:23 2005
Number of records processed: 5849 Number of break groups: 348 Number of total duplicates: 33 Number of dupe groups: 30
Generic Unicode Matcher Number of dupes: 0 Number of dupe groups: 0
Japanese Matcher Number of dupes: 0 Number of dupe groups: 0
Korean Matcher Number of dupes: 7 Number of dupe groups: 4
Chinese Matcher Number of dupes: 26 Number of dupe groups: 26
Taiwanese Matcher Number of dupes: 0 Number of dupe groups: 0
10
Global Match User’s Guide

Error messages

The following table summarizes the Global Match general and configuration file error messages, and warning and informational messages.
General error messages
Message Description Solution
Usage: flgmtc ConfigFile Usage: flgmtc -rev
Display software version and revision level
Failed to open configuration file:
name
. No such file.
Keyfile pwmpg.key not found or corrupted
Record n is too long. Cannot continue processing.
Error opening
Some_file
report files.
System Error. Not enough memory to initialize Global Match.
System error: Not enough memory.
System Error. Failed to create tempo­rary files.
some_file
could be input, output or
file.
file-
Occurs when the incorrect option is supplied on the command line.
The configuration file is not found or could not be opened.
Occurs when the Keyfile is not present, is the wrong version, or is corrupted.
Global Match could not read the record because it was too long.
Failed to read any records from the input file. Ensure that the input file is not empty and the file encoding specified in the configuration file is correct.
The file was not found. Verify the path of the file in the Global
Not enough memory. Close some applications and try running
The disk is full or write-protected. Verify that you have write permissions on
Use a correct command line option.
Use correct file name and permissions.
Use the correct Keyfile, and ensure that it is in the current directory.
Specify a correct value for a record delimiter in the DMT file.
This error generally occurs if the input file is empty or the actual encoding of the file does not match with the encoding speci­fied in the configuration file.
Match Configuration file.
Global Match again.
the Work_Path specified in the Global Match configuration file.
Internal match library error. Some_Error_message
This is an unexpected error and should not occur in normal Global Match functioning.
Contact Business Objects Customer Sup­port.
Error messages
11
Configuration file error messages
Message Description Solution
Error in Global Match configuration file. Occurred at line number = n Some_Error_Message
Note:
For these kinds of main Global Match configuration file errors, the text in “Description” will appear as “Some_Error_Message.”
No format file specified. Specify a format file in the parameter
Dmt_File in the configuration file.
Unable to open the format file. The specified format file does not exist, or
does not have enough read permissions.
Error in the format file. The DMT file violates one or more excep-
tions specified in earlier sections about DMT File specifications.
No match keys specified in the configuration.
The configuration file must have at least one Match_Key specified.
Invalid File_Encoding. The file encoding must be either UTF8 or
UTF16. UTF16 is the default.
Invalid Break Key Specification. Invalid Match Key Specification.
There is an error in the specification. One or more fields may have been omitted or may be incorrect.
Multiple values specified for
parameter
. Some of the parameters in the configura-
tion file can be specified only once. Ensure that you have not repeated one of these.
Input file name too long. DMT file name too long.
The length of the file name(s) exceeds the maximum allowed by the platform.
Output file name too long. Report file name too long. Resource path too long. Work path too long. Match configuration file name too long.
No Matcher_Field specified. No actual input field is specified as a
value for the Matcher_Field parameter.
Invalid Default_Matcher name (
xx
). The Default_Matcher value must be
jp, ko, zh, or tw.
Exceeded the maximum number of break keys.
You can only specify 42 break or match keys.
Exceeded the maximum number of Match keys.
Missing input field name in Break Key Specification.
The specified break or match key does not have an input field name.
Missing input field name in Match Key Specification.
12
Global Match User’s Guide
Message Description Solution
Error in Global Match configuration file. Occurred at line number = n Some_Error_Message
Note:
For these kinds of main Global Match configuration file errors, the text in “Description” will appear as “Some_Error_Message”.
xxx
Invalid input file field (
) in Break
Key Specification. Invalid input file field (
xxx
) in Match
Key Specification.
Missing offset in Break Key Specifica­tion.
Missing length in Break Key Specifica­tion.
Missing field_type name in Break Key Specification.
Missing field_type name in Match Key Specification.
Invalid field_type (
xx
) in Break Key
Specification. Invalid field_type (
xx
) in Match Key
Specification.
Missing match key field name in Match Key Specification.
Every break or match key must be associ­ated with an input field as specified in the DMT file. An input field name is speci­fied, but it is not present in the DMT file.
Specify a positive value as offset.
Specify a positive value as length.
Specify a field_type for the break key or match key. Valid field types are name, firm, addr, and pc.
A field_type specified for the break or match key is invalid. Valid field types are name, firm, addr, pc.
Every Match key must have an associated match key field.
Error in Match Configuration files. Error: Some_Error_MessageDetails: Some_Error_DetailMatch Library Error Number: n
Note.
For these kinds of Match config­uration file errors, Some_Error_Message and Some_Error_Detail will contain all the details of the actual error.
Invalid match key field name (
xxx
) in
Match Key Specification.
Matcher values cannot be specified without matcher field.
Exceeded the maximum number of Matcher values.
Missing string value in Matcher Specifi­cation.
Matcher string must be double quoted in Matcher Specification.
Missing matcher name in Matcher Spec­ification.
Invalid matcher name (
xxx
) in Matcher
Specification.
Refer the Error message and detail. A typical error message will contain name and line number of the file in which the error occurred as well as the exact nature of the error.
Refer to the
Reference
Match Library Programmer’s
for valid values of match key
field names.
You specified matcher values, but did not specify a matcher field.
You can specify only 32 matcher values.
Matcher specification must contain a double-quoted string value.
Matcher specification must contain a double-quoted string value.
Matcher value must have matcher name specified. Valid values are jp, ko, zh, or tw.
Valid values are jp, ko, zh or tw.
In most cases, the error detail will provide information about the problem. In some cases, absence of one or more parameters from the required set of parameters will cause the error.
For example, in the match key configura­tion file, if you do not specify all five parameters for each key, you will receive an error.
Error messages
13
Message Description Solution
Error in configuration file. Match con­figuration must specify
Rule
matching.
Error instantiating matcher instances. Please check resource path in the con­figuration file.
Rule
Auto
Global Match does not support matching. It only supports matching.
The resource path is incorrect. Ensure that the Resource_Path specified
Verify that the value of MTC_Matching_ Type is R
Configuration file.
in the Global Match configuration file contains the
Warning and informational messages
Message Description Solution
Warning: Unable to retrieve matcher field. Record number = n
This is just a warning that the records specified did not contain a matcher iden­tifier field.
This may occur because of incorrect field specifications in the DMT file. Or, the data may be blank for the records speci­fied.
Because this is a warning and not an error, these records will pass through the default_matcher, if specified, or the generic Unicode matcher.
Creating work file....
Comparing m record(s) in break group
n Sorting results....
Creating Report....
Processing completed successfully
Progress information messages. Progress information messages. Process-
ing time may be lengthy depending on the number of records and match rules setup, so the command prompt may appear sta­tionary. This does not mean that the sys­tem is hanging.
ule
in the Overall Match
root.res
file.
14
Global Match User’s Guide
Loading...