CGUL Best Practices and Examples.....................................................................................79Chapter 4
4.1
4.2
4.3
4.3.1
4.3.2
4.3.3
Index89
Best Practices for a Rule Development..................................................................................79
Syntax Errors to Look For When Compiling Rules..................................................................82
Examples For Writing Extraction Rules...................................................................................83
Example: Writing a simple CGUL rule: Hello World.................................................................83
Example: Extracting Names Starting with Z............................................................................84
Example: Extracting Names of Persons and Awards they Won...............................................85
Testing Dictionaries and Extraction Rules............................................................................87Chapter 5
2010-12-025
Contents
2010-12-026
Introduction
Introduction
1.1 Welcome to SAP BusinessObjects Data Services
1.1.1 Welcome
SAP BusinessObjects Data Services delivers a single enterprise-class solution for data integration,
data quality, data profiling, and text data processing that allows you to integrate, transform, improve,
and deliver trusted data to critical business processes. It provides one development UI, metadata
repository, data connectivity layer, run-time environment, and management console—enabling IT
organizations to lower total cost of ownership and accelerate time to value. With SAP BusinessObjects
Data Services, IT organizations can maximize operational efficiency with a single solution to improve
data quality and gain access to heterogeneous sources and applications.
1.1.2 Documentation set for SAP BusinessObjects Data Services
You should become familiar with all the pieces of documentation that relate to your SAP BusinessObjects
Data Services product.
What this document providesDocument
Administrator's Guide
Customer Issues Fixed
Designer Guide
Information about administrative tasks such as monitoring,
lifecycle management, security, and so on.
Information about customer issues fixed in this release.
Information about how to use SAP BusinessObjects Data
Services Designer.
Documentation Map
Information about available SAP BusinessObjects Data Services books, languages, and locations.
2010-12-027
Introduction
What this document providesDocument
Installation Guide for Windows
Installation Guide for UNIX
Integrator's Guide
Management Console Guide
Performance Optimization Guide
Reference Guide
Release Notes
Technical Manuals
Information about and procedures for installing SAP BusinessObjects Data Services in a Windows environment.
Information about and procedures for installing SAP BusinessObjects Data Services in a UNIX environment.
Information for third-party developers to access SAP BusinessObjects Data Services functionality using web services and
APIs.
Information about how to use SAP BusinessObjects Data
Services Administrator and SAP BusinessObjects Data Services Metadata Reports.
Information about how to improve the performance of SAP
BusinessObjects Data Services.
Detailed reference material for SAP BusinessObjects Data
Services Designer.
Important information you need before installing and deploying
this version of SAP BusinessObjects Data Services.
A compiled “master” PDF of core SAP BusinessObjects Data
Services books containing a searchable master table of contents and index:
•
Administrator's Guide
•
Designer Guide
•
Reference Guide
•
Management Console Guide
•
Performance Optimization Guide
•
Supplement for J.D. Edwards
•
Supplement for Oracle Applications
•
Supplement for PeopleSoft
•
Supplement for Salesforce.com
•
Supplement for Siebel
•
Supplement for SAP
Text Data Processing Extraction Customization Guide
Text Data Processing Language Reference
Guide
Information about building dictionaries and extraction rules to
create your own extraction patterns to use with Text Data
Processing transforms.
Information about the linguistic analysis and extraction processing features that the Text Data Processing component provides, as well as a reference section for each language supported.
2010-12-028
Introduction
What this document providesDocument
Tutorial
Upgrade Guide
What's New
In addition, you may need to refer to several Adapter Guides and Supplemental Guides.
Supplement for J.D. Edwards
Supplement for Oracle Applications
Supplement for PeopleSoft
A step-by-step introduction to using SAP BusinessObjects
Data Services.
Release-specific product behavior changes from earlier versions of SAP BusinessObjects Data Services to the latest release. This manual also contains information about how to
migrate from SAP BusinessObjects Data Quality Management
to SAP BusinessObjects Data Services.
Highlights of new key features in this SAP BusinessObjects
Data Services release. This document is not updated for support package or patch releases.
What this document providesDocument
Information about interfaces between SAP BusinessObjects Data Services
and J.D. Edwards World and J.D. Edwards OneWorld.
Information about the interface between SAP BusinessObjects Data Services
and Oracle Applications.
Information about interfaces between SAP BusinessObjects Data Services
and PeopleSoft.
Supplement for Salesforce.com
Supplement for SAP
Supplement for Siebel
Information about how to install, configure, and use the SAP BusinessObjects
Data Services Salesforce.com Adapter Interface.
Information about interfaces between SAP BusinessObjects Data Services,
SAP Applications, and SAP NetWeaver BW.
Information about the interface between SAP BusinessObjects Data Services
and Siebel.
We also include these manuals for information about SAP BusinessObjects Information platform services.
Information platform services Administrator's Guide
Information platform services Installation Guide for
UNIX
What this document providesDocument
Information for administrators who are responsible for
configuring, managing, and maintaining an Information
platform services installation.
Installation procedures for SAP BusinessObjects Information platform services on a UNIX environment.
2010-12-029
Introduction
What this document providesDocument
Information platform services Installation Guide for
Windows
1.1.3 Accessing documentation
You can access the complete documentation set for SAP BusinessObjects Data Services in several
places.
1.1.3.1 Accessing documentation on Windows
After you install SAP BusinessObjects Data Services, you can access the documentation from the Start
menu.
1.
Choose Start > Programs > SAP BusinessObjects Data Services XI 4.0 > Data Services
Documentation.
Installation procedures for SAP BusinessObjects Information platform services on a Windows environment.
Note:
Only a subset of the documentation is available from the Start menu. The documentation set for this
release is available in <LINK_DIR>\Doc\Books\en.
2.
Click the appropriate shortcut for the document that you want to view.
1.1.3.2 Accessing documentation on UNIX
After you install SAP BusinessObjects Data Services, you can access the online documentation by
going to the directory where the printable PDF files were installed.
1.
Go to <LINK_DIR>/doc/book/en/.
2.
Using Adobe Reader, open the PDF file of the document that you want to view.
1.1.3.3 Accessing documentation from the Web
2010-12-0210
Introduction
You can access the complete documentation set for SAP BusinessObjects Data Services from the SAP
BusinessObjects Business Users Support site.
1.
Go to http://help.sap.com.
2.
Click SAP BusinessObjects at the top of the page.
3.
Click All Products in the navigation pane on the left.
You can view the PDFs online or save them to your computer.
1.1.4 SAP BusinessObjects information resources
A global network of SAP BusinessObjects technology experts provides customer support, education,
and consulting to ensure maximum information management benefit to your business.
Useful addresses at a glance:
2010-12-0211
Introduction
ContentAddress
Customer Support, Consulting, and Education
services
http://service.sap.com/
SAP BusinessObjects Data Services Community
http://www.sdn.sap.com/irj/sdn/ds
Forums on SCN (SAP Community Network )
http://forums.sdn.sap.com/forum.jspa?foru
mID=305
Blueprints
http://www.sdn.sap.com/irj/boc/blueprints
Information about SAP Business User Support
programs, as well as links to technical articles,
downloads, and online forums. Consulting services
can provide you with information about how SAP
BusinessObjects can help maximize your information management investment. Education services
can provide information about training options and
modules. From traditional classroom learning to
targeted e-learning seminars, SAP BusinessObjects
can offer a training package to suit your learning
needs and preferred learning style.
Get online and timely information about SAP BusinessObjects Data Services, including tips and tricks,
additional downloads, samples, and much more.
All content is to and from the community, so feel
free to join in and contact us if you have a submission.
Search the SAP BusinessObjects forums on the
SAP Community Network to learn from other SAP
BusinessObjects Data Services users and start
posting questions or share your knowledge with the
community.
Blueprints for you to download and modify to fit your
needs. Each blueprint contains the necessary SAP
BusinessObjects Data Services project, jobs, data
flows, file formats, sample data, template tables,
and custom functions to run the data flows in your
environment with only a few modifications.
http://help.sap.com/businessobjects/
Supported Platforms (Product Availability Matrix)
https://service.sap.com/PAM
1.2 Overview of This Guide
SAP BusinessObjects product documentation.Product documentation
Get information about supported platforms for SAP
BusinessObjects Data Services.
Use the search function to search for Data Services.
Click the link for the version of Data Services you
are searching for.
2010-12-0212
Introduction
Welcome to the
SAP BusinessObjects Data Services text data processing software enables you to perform extraction
processing and various types of natural language processing on unstructured text.
The two major features of the software are linguistic analysis and extraction. Linguistic analysis includes
natural-language processing (NLP) capabilities, such as segmentation, stemming, and tagging, among
other things. Extraction processing analyzes unstructured text, in multiple languages and from any text
data source, and automatically identifies and extracts key entity types, including people, dates, places,
organizations, or other information, from the text. It enables the detection and extraction of activities,
events and relationships between entities and gives users a competitive edge with relevant information
for their business needs.
Extraction Customization Guide
1.2.1 Who Should Read This Guide
This guide is written for dictionary and extraction rule writers. Users of this guide should understand
extraction concepts and have familiarity with linguistic concepts and with regular expressions.
This documentation assumes the following:
•You understand your organization's text analysis extraction needs.
.
1.2.2 About This Guide
This guide contains the following information:
•Overview and conceptual information about dictionaries and extraction rules.
•How to create, compile, and use dictionaries and extraction rules.
•Examples of sample dictionaries and extraction rules.
•Best practices for writing extraction rules.
2010-12-0213
Introduction
2010-12-0214
Using Dictionaries
Using Dictionaries
A dictionary in the context of the extraction process is a user-defined repository of entities. It can store
customized information about the entities your application must find. You can use a dictionary to store
name variations in a structured way that is accessible through the extraction process. A dictionary
structure can also help standardize references to an entity.
Dictionaries are language-independent. This means that you can use the same dictionary to store all
your entities and that the same patterns are matched in documents of different languages.
You can use a dictionary for:
•name variation management
•disambiguation of unknown entities
•control over entity recognition
2.1 Entity Structure in Dictionaries
This section examines the entity structure in dictionaries. A dictionary contains a number of user-defined
entity types, each of which contains any number of entities. For each entity, the dictionary distinguishes
between a standard form name and variant names:
•Standard form name–The most complete or precise form for a given entity. For example, United
States of America might be the standard form name for that country. A standard form name can
have one or more variant names (also known as source form) embedded under it.
•Variant name–Less standard or complete than a standard form name, and it can include abbreviations,
different spellings, nicknames, and so on. For example, United States, USA and US could be
variant names for the same country. In addition, a dictionary lets you assign variant names to a type.
For example, you might define a variant type ABBREV for abbreviations.
The following figure shows a graphical representation of the dictionary hierarchy and structure of a
dictionary entry for United Parcel Service of America, Inc:
2010-12-0215
Using Dictionaries
The real-world entity, indicated by the circle in the diagram, is associated with a standard form name
and an entity type ORGANIZATION and subtype COMMERCIAL. Under the standard form name are
name variations, one of which has its own type specified. The dictionary lookup lets you get the standard
form and the variant names given any of the related forms.
2.1.1 Generating Predictable Variants
The variants United Parcel Service and United Parcel Service of America, Inc. are predictable, and
more predictable variants can be generated by the dictionary compiler for later use in the extraction
process. The dictionary compiler, using its variant generate feature, can programmatically generate
certain predictable variants while compiling a dictionary.
Variant generation works off of a list of designators for entities in the entity type ORGANIZATION in
English. For instance, Corp. designates an organization. Variant generation in languages other than
English covers the standard company designators, such as AG in German and SA in French. The
variant generation facility provides the following functionality:
•Creates or expands abbreviations for specified designators. For example, the abbreviation Inc. is
expanded to Incorporated, and Incorporated is abbreviated to Inc., and so on.
•Handles optional commas and periods.
•Makes optional such company designators as Inc, Corp. and Ltd, as long as the organization name
has more than one word in it.
For example, variants for Microsoft Corporation can include:
•Microsoft Corporation
•Microsoft Corp.
•Microsoft Corp
2010-12-0216
Using Dictionaries
Single word variant names like Microsoft are not automatically generated as variant organization names,
since they are easily misidentified. One-word variants need to be entered into the dictionary individually.
Variants are not enumerated without the appropriate organization designators.
Note:
Variant generation is supported in English, French, German, and Spanish.
Related Topics
• Adding Standard Variant Types
2.1.2 Custom Variant Types
You can also define custom variant types in a dictionary. Custom variant types can contain a list of
variant name pre-modifiers and post-modifiers for a standard form name type. For any variant names
of a standard form name to be generated, it must match at least one of the patterns defined for that
custom variant type.
A variant generation definition can have one or more patterns. For each pattern that matches, the
defined generators are invoked. Patterns can contain the wildcards * and ?, that match zero-or-more
and a single token respectively. Patterns can also contain one or more capture groups. These are
sub-patterns that are enclosed in brackets. The contents of these capture groups after matching are
copied into the generator output when referenced by its corresponding placeholder (if any). Capture
groups are numbered left to right, starting at 1. A capture group placeholder consists of a backslash,
followed by the capture group number.
The pattern always matches the entire string of the standard form name and never only part of that
string. For example,
•The pattern matches forces preceded by one token only. Thus, it matches Afghan forces, but not
U.S. forces, as the latter contains more than one token. To capture variant names with more than
one token, use the pattern (*) forces.
•The single capture group is referenced in all generators by its index: \1. The generated variant
names are Afghan troops, Afghan soldiers, Afghan Army, Afghan military, and Afghan forces. In
principle you do not need the last generator, as the standard form name already matches those
tokens.
2010-12-0217
Using Dictionaries
The following example shows how to specify the variant generation within the dictionary source:
Standard variants include the base text in the generated variant names, while custom variants do not.
Related Topics
• Adding Custom Variant Types
2.1.3 Entity Subtypes
Dictionaries support the use of entity subtypes to enable the distinction between different varieties of
the same entity type. For example, to distinguish leafy vegetables from starchy vegetables.
To define an entity subtype in a dictionary entry, add an @ delimited extension to the category identifier,
as in VEG@STARCHY. Subtyping is only one-level deep, so TYPE@SUBTYPE@SUBTYPE is not valid.
Related Topics
• Adding an Entity Subtype
2.1.4 Variant Types
Variant names can optionally be associated with a type, meaning that you specify the type of variant
name. For example, one specific type of variant name is an abbreviation, ABBREV. Other examples of
variant types that you could create are ACRONYM, NICKNAME, or PRODUCT-ID.
2.1.5 Wildcards in Entity Names
Dictionary entries support entity names specified with wildcard pattern-matching elements. These are
the Kleene star ("*") and question mark ("?") characters, used to match against a portion of the input
string. For example, either "* University" or "? University" might be used as the name of an
entity belonging to a custom type UNIVERSITY.
2010-12-0218
Using Dictionaries
These wildcard elements must be restricted to match against only part of the input buffer. Consider a
pattern "Company *" which matches at the beginning of a 500 KB document. If unlimited matching
were allowed, the * wildcard would match against the document's remaining 499+ KB.
Note:
Using wildcards in a dictionary may affect the speed of entity extraction. Performance decreases
proportionally with the number of wildcards in a dictionary. Use this functionality keeping potential
performance degradations in mind.
2.1.5.1 Wildcard Definitions
The * and ? wildcards are described as follows, given a sentence:
•* matches any number of tokens greater than or equal to zero within a sentence.
•? matches only one token within a sentence.
A token is an independent piece of a linguistic expression, such as a word or a punctuation. The wildcards
match whole tokens only and not sub-parts of tokens. For both wildcards, any tokens are eligible to be
matching elements, provided the literal (fixed) portion of the pattern is satisfied.
2.1.5.2 Wildcard Usage
Wildcard characters are used to specify a pattern, normally containing both literal and variable elements,
as the name of an entity. For instance, consider this input:
I once attended Stanford University, though I considered Carnegie Mellon University.
Consider an entity belonging to the category UNIVERSITY with the variant name "* University".
The pattern will match any sentence ending with "University".
If the pattern were "? University", it would only match a single token preceding "University"
occurring as or as a part of a sentence. Then the entire string "Stanford University" would match as
intended. However, for "Carnegie Mellon University", it is the substring "Mellon University" which would
match: "Carnegie" would be disregarded, since the question mark matches one token at most–and this
is probably not the intended result.
If several patterns compete, the extraction process returns the match with the widest scope. Thus if a
competing pattern "* University" were available in the previous example, "Carnegie Mellon University"
would be returned, and "Mellon University" would be ignored.
Since * and ? are special characters, "escape" characters are required to treat the wildcards as literal
elements of fixed patterns. The back slash "\" is the escape character. Thus "\*" represents the literal
asterisk as opposed to the Kleene star. A back slash can itself be made a literal by writing "\\".
2010-12-0219
Using Dictionaries
Note:
Use wildcards when defining variant names of an entity instead of using them for defining a standard
form name of an entity.
Related Topics
• Adding Wildcard Variants
2.2 Creating a Dictionary
To create a dictionary, follow these steps:
1.
Create an XML file containing your content, formatted according to the dictionary syntax.
2.
Run the dictionary compiler on that file.
Note:
For large dictionary source files, make sure the memory available to the compiler is at least five
times the size of the input file, in bytes.
Related Topics
• Dictionary XSD
• Compiling a Dictionary
2.3 Dictionary Syntax
2.3.1 Dictionary XSD
The syntax of a dictionary conforms to the following XML Schema Definition ( XSD). When creating your
custom dictionary, format your content using the following syntax, making sure to specify the encoding
if the file is not UTF-8.
<?xml version="1.0" encoding="UTF-8"?>
!-Copyright 2010 SAP AG. All rights reserved.
SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP Business ByDesign,
and other SAP products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of SAP AG in
Germany and other countries.
2010-12-0220
Using Dictionaries
Business Objects and the Business Objects logo, BusinessObjects, Crystal
Reports, Crystal Decisions, Web Intelligence, Xcelsius, and other Business
Objects products and services mentioned herein as well as their respective
logos are trademarks or registered trademarks of Business Objects S.A. in the
United States and in other countries.
Business Objects is an SAP company.
All other product and service names mentioned are the trademarks of their
respective companies. Data contained in this document serves informational
purposes only. National product specifications
may vary.
These materials are subject to change without notice. These materials are
provided by SAP AG and its affiliated companies ("SAP Group") for informational
purposes only, without representation or warranty of any kind, and SAP Group
shall not be liable for errors or omissions with respect to the materials.
The only warranties for SAP Group products and services are those that are set
forth in the express warranty statements accompanying such products and
services, if any. Nothing herein should be construed as constituting an
additional warranty.
The following table describes each element and attribute of the dictionary XSD.
<xsd:element ref="dd:pattern"/>
<xsd:element ref="dd:generate"/>
dictionary
entity_category
Attributes and DescriptionElement
This is the root tag, of which a dictionary may contain only one.
Contains one or more embedded entity_category elements.
The category (type) to which all embedded entities belong. Contains
one or more embedded entity_name elements.
Must be explicitly closed.
The name of the category, such as
PEOPLE, COMPANY, PHONE
name
NUMBER, and so on. Note that the
entity category name is case sensitive.
2010-12-0222
Using Dictionaries
entity_name
Attributes and DescriptionElement
A named entity in the dictionary. Contains zero or more of the elements variant, query_only and variant_generation.
Must be explicitly closed.
The standard form of the entity_name. The standard form is
generally the longest or most common form of a named entity.
standard_form
The standard_form name must
be unique within the entity_cate
gory but not within the dictio
nary.
variant
query_only
A user-defined ID for the standard
uid
form name. This is an optional attribute.
A variant name for the entity.The variant name must be unique
within the entity_name. Need not be explicitly closed.
name
[Required] The name of the variant.
[Optional] The type of variant, gen-
type
erally a subtype of the larger enti
ty_category.
name
type
2010-12-0223
Using Dictionaries
variant_generation
Attributes and DescriptionElement
Specifies whether the dictionary should automatically generate
predictable variants. By default, the standard form name is used
as the starting point for variant generation.
Need not be explicitly closed.
[Optional] Specifies the language to
use for standard variant generation,
in lower case, for example, "english". If this option is not specified
language
in the dictionary, the language
specified with the compiler command is used, or it defaults to English when there is no language
specified in either the dictionary or
the compiler command.
define-variant_genera
tion
pattern
generate
Related Topics
• Adding Custom Variant Types
• Formatting Your Source
[Required] Types supported are
type
standard or the name of a custom
variant generation defined earlier in
the dictionary.
[Optional] Specifies text other than
base_text
the standard form name to use as
the starting point for the computation
of variants.
Specifies custom variant generation.
Specifies the pattern that must be matched to generate custom
variants.
Specifies the exact pattern for custom variant generation within
each generate tag.
2010-12-0224
Using Dictionaries
2.3.2 Guidelines for Naming Entities
This section describes several guidelines for the format of standard form and variant names in a
dictionary:
•You can use any part-of-speech (word class).
•Use only characters that are valid for the specified encoding.
•The symbols used for wildcard pattern matching, "?" and "*", must be escaped using a back slash
character ("\") .
•Any other special characters, such as quotation marks, ampersands, and apostrophes, can be
escaped according to the XML specification.
The following table shows some such character entities (also used in HTML), along with the correct
syntax:
<
>
&
"
'
Less than (<) sign
Greater than (>) sign
Ampersand (&) sign
Quotation marks (")
Apostrophe (')
2.3.3 Character Encoding in a Dictionary
A dictionary supports all the character encodings supported by the Xerces-C XML parser. If you are
creating a dictionary to be used for more than one language, use an encoding that supports all required
languages, such as UTF-8. For information on encodings supported by theXerces-C XML parser, see
The default input encoding assumed by a dictionary is UTF-8. Dictionary input files that are not in UTF8 must specify their character encoding in an XML directive to enable proper operation of the configuration
file parser, for example:
<?xml version="1.0" encoding="UTF-16" ?>.
If no encoding specification exists, UTF-8 is assumed. For best results, always specify the encoding.
Note:
CP-1252 must be specified as windows-1252 in the XML header element. The encoding names
should follow the IANA-CHARSETS recommendation.
Format your source file according to the dictionary XSD. The source file must contain sufficient context
to make the entry unambiguous. The required tags for a dictionary entry are:
•entity_category
•entity_name
Others can be mentioned according to the desired operation. If tags are already in the target dictionary,
they are augmented; if not, they are added. The add operation never removes tags, and the remove
operation never adds them.
Related Topics
• Dictionary XSD
2010-12-0226
Using Dictionaries
2.3.6 Working with a Dictionary
This section provides details on how to update your dictionary files to add or remove entries as well as
update existing entries.
2.3.6.1 Adding an Entity
To add an entity to a dictionary:
•Specify the entity's standard form under the relevant entity category, and optionally, its variants.
The example below adds two new entities to the ORGANIZATION@COMMERCIAL category: