SAP Business objects DATA SERVICES Text Data Processing Extraction Customization Guide

Text Data Processing Extraction Customization Guide
SAP BusinessObjects Data Services XI 4.0 (14.0.0)
2010-12-02
Copyright
© 2010 SAP AG. All rights reserved.SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP Business ByDesign, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries. Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, and other Business Objects products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Business Objects S.A. in the United States and in other countries. Business Objects is an SAP company.All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary.These materials are subject to change without notice. These materials are provided by SAP AG and its affiliated companies ("SAP Group") for informational purposes only, without representation or warranty of any kind, and SAP Group shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP Group products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.
2010-12-02

Contents

Introduction.............................................................................................................................7Chapter 1
1.1
1.1.1
1.1.2
1.1.3
1.1.4
1.2
1.2.1
1.2.2
2.1
2.1.1
2.1.2
2.1.3
2.1.4
2.1.5
2.2
2.3
2.3.1
2.3.2
2.3.3
2.3.4
2.3.5
2.3.6
2.4
2.4.1
2.4.2
2.4.3
2.4.4
Welcome to SAP BusinessObjects Data Services...................................................................7
Welcome.................................................................................................................................7
Documentation set for SAP BusinessObjects Data Services...................................................7
Accessing documentation......................................................................................................10
SAP BusinessObjects information resources.........................................................................11
Overview of This Guide..........................................................................................................12
Who Should Read This Guide................................................................................................13
About This Guide...................................................................................................................13
Using Dictionaries.................................................................................................................15Chapter 2
Entity Structure in Dictionaries...............................................................................................15
Generating Predictable Variants.............................................................................................16
Custom Variant Types............................................................................................................17
Entity Subtypes......................................................................................................................18
Variant Types.........................................................................................................................18
Wildcards in Entity Names......................................................................................................18
Creating a Dictionary..............................................................................................................20
Dictionary Syntax...................................................................................................................20
Dictionary XSD......................................................................................................................20
Guidelines for Naming Entities...............................................................................................25
Character Encoding in a Dictionary.........................................................................................25
Dictionary Sample File............................................................................................................26
Formatting Your Source.........................................................................................................26
Working with a Dictionary......................................................................................................27
Compiling a Dictionary...........................................................................................................31
Command-line Syntax for Compiling a Dictionary...................................................................31
Adding Dictionary Entries.......................................................................................................33
Removing Dictionary Entries..................................................................................................34
Removing Standard Form Names from a Dictionary...............................................................34
2010-12-023
Contents
Using Extraction Rules..........................................................................................................37Chapter 3
3.1
3.2
3.2.1
3.2.2
3.3
3.3.1
3.4
3.5
3.5.1
3.5.2
3.5.3
3.5.4
3.5.5
3.6
3.6.1
3.7
3.7.1
3.7.2
3.7.3
3.7.4
3.7.5
3.7.6
3.7.7
3.7.8
3.7.9
3.7.10
3.8
3.9
3.9.1
3.9.2
3.9.3
3.9.4
3.9.5
3.10
3.10.1
3.10.2
3.10.3
3.11
About Customizing Extraction................................................................................................37
Understanding Extraction Rule Patterns.................................................................................38
CGUL Elements.....................................................................................................................39
CGUL Conventions................................................................................................................42
Including Files in a Rule File....................................................................................................43
Using Predefined Character Classes......................................................................................43
Including a Dictionary in a Rule File.........................................................................................43
CGUL Directives ...................................................................................................................44
Writing Directives...................................................................................................................45
Using the #define Directive....................................................................................................45
Using the #subgroup Directive...............................................................................................46
Using the #group Directive.....................................................................................................47
Using Items in a Group or Subgroup......................................................................................50
Tokens...................................................................................................................................51
Building Tokens......................................................................................................................51
Expression Markers Supported in CGUL...............................................................................54
Paragraph Marker [P].............................................................................................................55
Sentence Marker [SN]...........................................................................................................56
Noun Phrase Marker [NP]......................................................................................................56
Verb Phrase Marker [VP].......................................................................................................57
Clause Marker [CL]................................................................................................................58
Clause Container [CC]...........................................................................................................58
Context Marker [OD].............................................................................................................59
Entity Marker [TE]..................................................................................................................60
Unordered List Marker [UL]...................................................................................................60
Unordered Contiguous List Marker [UC]................................................................................61
Writing Extraction Rules Using Context Markers....................................................................61
Regular Expression Operators Supported in CGUL................................................................62
Standard Operators Valid in CGUL........................................................................................62
Iteration Operators Supported in CGUL.................................................................................68
Grouping and Containment Operators Supported in CGUL....................................................71
Operator Precedence Used in CGUL.....................................................................................73
Special Characters.................................................................................................................74
Match Filters Supported in CGUL..........................................................................................75
Longest Match Filter..............................................................................................................75
Shortest Match Filter (?)........................................................................................................76
List Filter (*)...........................................................................................................................77
Compiling Extraction Rules.....................................................................................................77
2010-12-024
Contents
CGUL Best Practices and Examples.....................................................................................79Chapter 4
4.1
4.2
4.3
4.3.1
4.3.2
4.3.3
Index 89
Best Practices for a Rule Development..................................................................................79
Syntax Errors to Look For When Compiling Rules..................................................................82
Examples For Writing Extraction Rules...................................................................................83
Example: Writing a simple CGUL rule: Hello World.................................................................83
Example: Extracting Names Starting with Z............................................................................84
Example: Extracting Names of Persons and Awards they Won...............................................85
Testing Dictionaries and Extraction Rules............................................................................87Chapter 5
2010-12-025
Contents
2010-12-026

Introduction

Introduction
1.1 Welcome to SAP BusinessObjects Data Services
1.1.1 Welcome
SAP BusinessObjects Data Services delivers a single enterprise-class solution for data integration, data quality, data profiling, and text data processing that allows you to integrate, transform, improve, and deliver trusted data to critical business processes. It provides one development UI, metadata repository, data connectivity layer, run-time environment, and management console—enabling IT organizations to lower total cost of ownership and accelerate time to value. With SAP BusinessObjects Data Services, IT organizations can maximize operational efficiency with a single solution to improve data quality and gain access to heterogeneous sources and applications.
1.1.2 Documentation set for SAP BusinessObjects Data Services
You should become familiar with all the pieces of documentation that relate to your SAP BusinessObjects Data Services product.
What this document providesDocument
Administrator's Guide
Customer Issues Fixed
Designer Guide
Information about administrative tasks such as monitoring, lifecycle management, security, and so on.
Information about customer issues fixed in this release.
Information about how to use SAP BusinessObjects Data Services Designer.
Documentation Map
Information about available SAP BusinessObjects Data Ser­vices books, languages, and locations.
2010-12-027
Introduction
What this document providesDocument
Installation Guide for Windows
Installation Guide for UNIX
Integrator's Guide
Management Console Guide
Performance Optimization Guide
Reference Guide
Release Notes
Technical Manuals
Information about and procedures for installing SAP Busines­sObjects Data Services in a Windows environment.
Information about and procedures for installing SAP Busines­sObjects Data Services in a UNIX environment.
Information for third-party developers to access SAP Busines­sObjects Data Services functionality using web services and APIs.
Information about how to use SAP BusinessObjects Data Services Administrator and SAP BusinessObjects Data Ser­vices Metadata Reports.
Information about how to improve the performance of SAP BusinessObjects Data Services.
Detailed reference material for SAP BusinessObjects Data Services Designer.
Important information you need before installing and deploying this version of SAP BusinessObjects Data Services.
A compiled “master” PDF of core SAP BusinessObjects Data Services books containing a searchable master table of con­tents and index:
Administrator's Guide
Designer Guide
Reference Guide
Management Console Guide
Performance Optimization Guide
Supplement for J.D. Edwards
Supplement for Oracle Applications
Supplement for PeopleSoft
Supplement for Salesforce.com
Supplement for Siebel
Supplement for SAP
Text Data Processing Extraction Customiza­tion Guide
Text Data Processing Language Reference Guide
Information about building dictionaries and extraction rules to create your own extraction patterns to use with Text Data Processing transforms.
Information about the linguistic analysis and extraction process­ing features that the Text Data Processing component pro­vides, as well as a reference section for each language sup­ported.
2010-12-028
Introduction
What this document providesDocument
Tutorial
Upgrade Guide
What's New
In addition, you may need to refer to several Adapter Guides and Supplemental Guides.
Supplement for J.D. Edwards
Supplement for Oracle Applica­tions
Supplement for PeopleSoft
A step-by-step introduction to using SAP BusinessObjects Data Services.
Release-specific product behavior changes from earlier ver­sions of SAP BusinessObjects Data Services to the latest re­lease. This manual also contains information about how to migrate from SAP BusinessObjects Data Quality Management to SAP BusinessObjects Data Services.
Highlights of new key features in this SAP BusinessObjects Data Services release. This document is not updated for sup­port package or patch releases.
What this document providesDocument
Information about interfaces between SAP BusinessObjects Data Services and J.D. Edwards World and J.D. Edwards OneWorld.
Information about the interface between SAP BusinessObjects Data Services and Oracle Applications.
Information about interfaces between SAP BusinessObjects Data Services and PeopleSoft.
Supplement for Salesforce.com
Supplement for SAP
Supplement for Siebel
Information about how to install, configure, and use the SAP BusinessObjects Data Services Salesforce.com Adapter Interface.
Information about interfaces between SAP BusinessObjects Data Services, SAP Applications, and SAP NetWeaver BW.
Information about the interface between SAP BusinessObjects Data Services and Siebel.
We also include these manuals for information about SAP BusinessObjects Information platform services.
Information platform services Administrator's Guide
Information platform services Installation Guide for UNIX
What this document providesDocument
Information for administrators who are responsible for configuring, managing, and maintaining an Information platform services installation.
Installation procedures for SAP BusinessObjects Infor­mation platform services on a UNIX environment.
2010-12-029
Introduction
What this document providesDocument
Information platform services Installation Guide for Windows
1.1.3 Accessing documentation
You can access the complete documentation set for SAP BusinessObjects Data Services in several places.
1.1.3.1 Accessing documentation on Windows
After you install SAP BusinessObjects Data Services, you can access the documentation from the Start menu.
1.
Choose Start > Programs > SAP BusinessObjects Data Services XI 4.0 > Data Services Documentation.
Installation procedures for SAP BusinessObjects Infor­mation platform services on a Windows environment.
Note:
Only a subset of the documentation is available from the Start menu. The documentation set for this release is available in <LINK_DIR>\Doc\Books\en.
2.
Click the appropriate shortcut for the document that you want to view.
1.1.3.2 Accessing documentation on UNIX
After you install SAP BusinessObjects Data Services, you can access the online documentation by going to the directory where the printable PDF files were installed.
1.
Go to <LINK_DIR>/doc/book/en/.
2.
Using Adobe Reader, open the PDF file of the document that you want to view.
1.1.3.3 Accessing documentation from the Web
2010-12-0210
Introduction
You can access the complete documentation set for SAP BusinessObjects Data Services from the SAP BusinessObjects Business Users Support site.
1.
Go to http://help.sap.com.
2.
Click SAP BusinessObjects at the top of the page.
3.
Click All Products in the navigation pane on the left.
You can view the PDFs online or save them to your computer.
1.1.4 SAP BusinessObjects information resources
A global network of SAP BusinessObjects technology experts provides customer support, education, and consulting to ensure maximum information management benefit to your business.
Useful addresses at a glance:
2010-12-0211
Introduction
ContentAddress
Customer Support, Consulting, and Education services
http://service.sap.com/
SAP BusinessObjects Data Services Community
http://www.sdn.sap.com/irj/sdn/ds
Forums on SCN (SAP Community Network )
http://forums.sdn.sap.com/forum.jspa?foru mID=305
Blueprints
http://www.sdn.sap.com/irj/boc/blueprints
Information about SAP Business User Support programs, as well as links to technical articles, downloads, and online forums. Consulting services can provide you with information about how SAP BusinessObjects can help maximize your informa­tion management investment. Education services can provide information about training options and modules. From traditional classroom learning to targeted e-learning seminars, SAP BusinessObjects can offer a training package to suit your learning needs and preferred learning style.
Get online and timely information about SAP Busi­nessObjects Data Services, including tips and tricks, additional downloads, samples, and much more. All content is to and from the community, so feel free to join in and contact us if you have a submis­sion.
Search the SAP BusinessObjects forums on the SAP Community Network to learn from other SAP BusinessObjects Data Services users and start posting questions or share your knowledge with the community.
Blueprints for you to download and modify to fit your needs. Each blueprint contains the necessary SAP BusinessObjects Data Services project, jobs, data flows, file formats, sample data, template tables, and custom functions to run the data flows in your environment with only a few modifications.
http://help.sap.com/businessobjects/
Supported Platforms (Product Availability Matrix)
https://service.sap.com/PAM
1.2 Overview of This Guide
SAP BusinessObjects product documentation.Product documentation
Get information about supported platforms for SAP BusinessObjects Data Services.
Use the search function to search for Data Services. Click the link for the version of Data Services you are searching for.
2010-12-0212
Introduction
Welcome to the
SAP BusinessObjects Data Services text data processing software enables you to perform extraction processing and various types of natural language processing on unstructured text.
The two major features of the software are linguistic analysis and extraction. Linguistic analysis includes natural-language processing (NLP) capabilities, such as segmentation, stemming, and tagging, among other things. Extraction processing analyzes unstructured text, in multiple languages and from any text data source, and automatically identifies and extracts key entity types, including people, dates, places, organizations, or other information, from the text. It enables the detection and extraction of activities, events and relationships between entities and gives users a competitive edge with relevant information for their business needs.
Extraction Customization Guide
1.2.1 Who Should Read This Guide
This guide is written for dictionary and extraction rule writers. Users of this guide should understand extraction concepts and have familiarity with linguistic concepts and with regular expressions.
This documentation assumes the following:
You understand your organization's text analysis extraction needs.
.
1.2.2 About This Guide
This guide contains the following information:
Overview and conceptual information about dictionaries and extraction rules.
How to create, compile, and use dictionaries and extraction rules.
Examples of sample dictionaries and extraction rules.
Best practices for writing extraction rules.
2010-12-0213
Introduction
2010-12-0214

Using Dictionaries

Using Dictionaries
A dictionary in the context of the extraction process is a user-defined repository of entities. It can store customized information about the entities your application must find. You can use a dictionary to store name variations in a structured way that is accessible through the extraction process. A dictionary structure can also help standardize references to an entity.
Dictionaries are language-independent. This means that you can use the same dictionary to store all your entities and that the same patterns are matched in documents of different languages.
You can use a dictionary for:
name variation management
disambiguation of unknown entities
control over entity recognition
2.1 Entity Structure in Dictionaries
This section examines the entity structure in dictionaries. A dictionary contains a number of user-defined entity types, each of which contains any number of entities. For each entity, the dictionary distinguishes between a standard form name and variant names:
Standard form name–The most complete or precise form for a given entity. For example, United
States of America might be the standard form name for that country. A standard form name can have one or more variant names (also known as source form) embedded under it.
Variant name–Less standard or complete than a standard form name, and it can include abbreviations,
different spellings, nicknames, and so on. For example, United States, USA and US could be variant names for the same country. In addition, a dictionary lets you assign variant names to a type. For example, you might define a variant type ABBREV for abbreviations.
The following figure shows a graphical representation of the dictionary hierarchy and structure of a dictionary entry for United Parcel Service of America, Inc:
2010-12-0215
Using Dictionaries
The real-world entity, indicated by the circle in the diagram, is associated with a standard form name and an entity type ORGANIZATION and subtype COMMERCIAL. Under the standard form name are name variations, one of which has its own type specified. The dictionary lookup lets you get the standard form and the variant names given any of the related forms.
2.1.1 Generating Predictable Variants
The variants United Parcel Service and United Parcel Service of America, Inc. are predictable, and more predictable variants can be generated by the dictionary compiler for later use in the extraction process. The dictionary compiler, using its variant generate feature, can programmatically generate certain predictable variants while compiling a dictionary.
Variant generation works off of a list of designators for entities in the entity type ORGANIZATION in English. For instance, Corp. designates an organization. Variant generation in languages other than English covers the standard company designators, such as AG in German and SA in French. The variant generation facility provides the following functionality:
Creates or expands abbreviations for specified designators. For example, the abbreviation Inc. is
expanded to Incorporated, and Incorporated is abbreviated to Inc., and so on.
Handles optional commas and periods.
Makes optional such company designators as Inc, Corp. and Ltd, as long as the organization name
has more than one word in it.
For example, variants for Microsoft Corporation can include:
Microsoft Corporation
Microsoft Corp.
Microsoft Corp
2010-12-0216
Using Dictionaries
Single word variant names like Microsoft are not automatically generated as variant organization names, since they are easily misidentified. One-word variants need to be entered into the dictionary individually. Variants are not enumerated without the appropriate organization designators.
Note:
Variant generation is supported in English, French, German, and Spanish.
Related Topics
Adding Standard Variant Types
2.1.2 Custom Variant Types
You can also define custom variant types in a dictionary. Custom variant types can contain a list of variant name pre-modifiers and post-modifiers for a standard form name type. For any variant names of a standard form name to be generated, it must match at least one of the patterns defined for that custom variant type.
A variant generation definition can have one or more patterns. For each pattern that matches, the defined generators are invoked. Patterns can contain the wildcards * and ?, that match zero-or-more and a single token respectively. Patterns can also contain one or more capture groups. These are sub-patterns that are enclosed in brackets. The contents of these capture groups after matching are copied into the generator output when referenced by its corresponding placeholder (if any). Capture groups are numbered left to right, starting at 1. A capture group placeholder consists of a backslash, followed by the capture group number.
The pattern always matches the entire string of the standard form name and never only part of that string. For example,
<define-variant_generation type="ENUM_TROOPS" >
<pattern string="(?) forces" >
<generate string="\1 troops" /> <generate string="\1 soldiers" /> <generate string="\1 Army" /> <generate string="\1 military" /> <generate string="\1 forces" />
</pattern>
</define-variant_generation>
In the above example this means that:
The pattern matches forces preceded by one token only. Thus, it matches Afghan forces, but not
U.S. forces, as the latter contains more than one token. To capture variant names with more than one token, use the pattern (*) forces.
The single capture group is referenced in all generators by its index: \1. The generated variant
names are Afghan troops, Afghan soldiers, Afghan Army, Afghan military, and Afghan forces. In principle you do not need the last generator, as the standard form name already matches those tokens.
2010-12-0217
Using Dictionaries
The following example shows how to specify the variant generation within the dictionary source:
<entity_name standard_form="Afghan forces">
<variant_generation type="ENUM_TROOPS" /> \ <variant name="Afghanistan's Army" />
</entity_name>
Note:
Standard variants include the base text in the generated variant names, while custom variants do not.
Related Topics
Adding Custom Variant Types
2.1.3 Entity Subtypes
Dictionaries support the use of entity subtypes to enable the distinction between different varieties of the same entity type. For example, to distinguish leafy vegetables from starchy vegetables.
To define an entity subtype in a dictionary entry, add an @ delimited extension to the category identifier, as in VEG@STARCHY. Subtyping is only one-level deep, so TYPE@SUBTYPE@SUBTYPE is not valid.
Related Topics
Adding an Entity Subtype
2.1.4 Variant Types
Variant names can optionally be associated with a type, meaning that you specify the type of variant name. For example, one specific type of variant name is an abbreviation, ABBREV. Other examples of variant types that you could create are ACRONYM, NICKNAME, or PRODUCT-ID.
2.1.5 Wildcards in Entity Names
Dictionary entries support entity names specified with wildcard pattern-matching elements. These are the Kleene star ("*") and question mark ("?") characters, used to match against a portion of the input string. For example, either "* University" or "? University" might be used as the name of an entity belonging to a custom type UNIVERSITY.
2010-12-0218
Using Dictionaries
These wildcard elements must be restricted to match against only part of the input buffer. Consider a pattern "Company *" which matches at the beginning of a 500 KB document. If unlimited matching were allowed, the * wildcard would match against the document's remaining 499+ KB.
Note:
Using wildcards in a dictionary may affect the speed of entity extraction. Performance decreases proportionally with the number of wildcards in a dictionary. Use this functionality keeping potential performance degradations in mind.
2.1.5.1 Wildcard Definitions
The * and ? wildcards are described as follows, given a sentence:
* matches any number of tokens greater than or equal to zero within a sentence.
? matches only one token within a sentence.
A token is an independent piece of a linguistic expression, such as a word or a punctuation. The wildcards match whole tokens only and not sub-parts of tokens. For both wildcards, any tokens are eligible to be matching elements, provided the literal (fixed) portion of the pattern is satisfied.
2.1.5.2 Wildcard Usage
Wildcard characters are used to specify a pattern, normally containing both literal and variable elements, as the name of an entity. For instance, consider this input:
I once attended Stanford University, though I considered Carnegie Mellon University.
Consider an entity belonging to the category UNIVERSITY with the variant name "* University". The pattern will match any sentence ending with "University".
If the pattern were "? University", it would only match a single token preceding "University" occurring as or as a part of a sentence. Then the entire string "Stanford University" would match as intended. However, for "Carnegie Mellon University", it is the substring "Mellon University" which would match: "Carnegie" would be disregarded, since the question mark matches one token at most–and this is probably not the intended result.
If several patterns compete, the extraction process returns the match with the widest scope. Thus if a competing pattern "* University" were available in the previous example, "Carnegie Mellon University" would be returned, and "Mellon University" would be ignored.
Since * and ? are special characters, "escape" characters are required to treat the wildcards as literal elements of fixed patterns. The back slash "\" is the escape character. Thus "\*" represents the literal asterisk as opposed to the Kleene star. A back slash can itself be made a literal by writing "\\".
2010-12-0219
Using Dictionaries
Note:
Use wildcards when defining variant names of an entity instead of using them for defining a standard form name of an entity.
Related Topics
Adding Wildcard Variants
2.2 Creating a Dictionary
To create a dictionary, follow these steps:
1.
Create an XML file containing your content, formatted according to the dictionary syntax.
2.
Run the dictionary compiler on that file.
Note:
For large dictionary source files, make sure the memory available to the compiler is at least five times the size of the input file, in bytes.
Related Topics

Dictionary XSD

Compiling a Dictionary
2.3 Dictionary Syntax
2.3.1 Dictionary XSD
The syntax of a dictionary conforms to the following XML Schema Definition ( XSD). When creating your custom dictionary, format your content using the following syntax, making sure to specify the encoding if the file is not UTF-8.
<?xml version="1.0" encoding="UTF-8"?> !-­Copyright 2010 SAP AG. All rights reserved.
SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP Business ByDesign, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries.
2010-12-0220
Using Dictionaries
Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, and other Business Objects products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Business Objects S.A. in the United States and in other countries.
Business Objects is an SAP company.
All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary.
These materials are subject to change without notice. These materials are provided by SAP AG and its affiliated companies ("SAP Group") for informational purposes only, without representation or warranty of any kind, and SAP Group shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP Group products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.
--
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:dd="http://www.sap.com/ta/4.0" targetNamespace="http://www.sap.com/ta/4.0"
<xsd:element name="dictionary">
<xsd:complexType>
<xsd:sequence maxOccurs="unbounded">
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="entity_category">
<xsd:complexType>
<xsd:sequence maxOccurs="unbounded">
</xsd:sequence> <xsd:attribute name="name" type="xsd:string" use="required"/>
</xsd:complexType>
</xsd:element>
<xsd:element name="entity_name">
<xsd:complexType>
<xsd:sequence>
</xsd:sequence> <xsd:attribute name="standard_form" type="xsd:string" use="required"/> <xsd:attribute name="uid" type="xsd:string" use="optional"/>
</xsd:complexType>
</xsd:element>
<xsd:element name="variant">
<xsd:complexType>
<xsd:attribute name="name" type="xsd:string" use="required"/> <xsd:attribute name="type" type="xsd:string" use="optional"/>
</xsd:complexType>
</xsd:element>
<xsd:element name="query_only">
<xsd:complexType>
<xsd:attribute name="name" type="xsd:string" use="required"/> <xsd:attribute name="type" type="xsd:string" use="optional"/>
</xsd:complexType>
</xsd:element>
<xsd:element name="variant_generation">
<xsd:complexType>
<xsd:attribute name="type" type="xsd:string" use="required"/> <xsd:attribute name="language" type="xsd:string" use="optional" default="english"/> <xsd:attribute name="base_text" type="xsd:string" use="optional"/>
</xsd:complexType>
</xsd:element>
elementFormDefault="qualified" attributeFormDefault="unqualified">
<xsd:element ref="dd:define-variant_generation" minOccurs="0" maxOccurs="unbounded"/> <xsd:element ref="dd:entity_category" maxOccurs="unbounded"/>
<xsd:element ref="dd:entity_name"/>
<xsd:element ref="dd:variant" minOccurs="0" maxOccurs="unbounded"/> <xsd:element ref="dd:query_only" minOccurs="0" maxOccurs="unbounded"/> <xsd:element ref="dd:variant_generation" minOccurs="0" maxOccurs="unbounded"/>
2010-12-0221
Using Dictionaries
<xsd:element name="define-variant_generation">
<xsd:complexType>
<xsd:sequence maxOccurs="unbounded">
</xsd:sequence> <xsd:attribute name="type" type="xsd:string" use="required"/>
</xsd:complexType>
</xsd:element>
<xsd:element name="pattern">
<xsd:complexType>
<xsd:sequence maxOccurs="unbounded">
</xsd:sequence> <xsd:attribute name="string" type="xsd:string" use="required"/>
</xsd:complexType>
</xsd:element>
<xsd:element name="generate">
<xsd:complexType>
<xsd:attribute name="string" type="xsd:string" use="required"/>
</xsd:complexType>
</xsd:element>
</xsd:schema>
The following table describes each element and attribute of the dictionary XSD.
<xsd:element ref="dd:pattern"/>
<xsd:element ref="dd:generate"/>
dictionary
entity_category
Attributes and DescriptionElement
This is the root tag, of which a dictionary may contain only one.
Contains one or more embedded entity_category elements.
The category (type) to which all embedded entities belong. Contains one or more embedded entity_name elements.
Must be explicitly closed.
The name of the category, such as PEOPLE, COMPANY, PHONE
name
NUMBER, and so on. Note that the entity category name is case sensi­tive.
2010-12-0222
Using Dictionaries
entity_name
Attributes and DescriptionElement
A named entity in the dictionary. Contains zero or more of the ele­ments variant, query_only and variant_generation.
Must be explicitly closed.
The standard form of the enti ty_name. The standard form is generally the longest or most com­mon form of a named entity.
standard_form
The standard_form name must be unique within the entity_cate
gory but not within the dictio nary.
variant
query_only
A user-defined ID for the standard
uid
form name. This is an optional at­tribute.
A variant name for the entity.The variant name must be unique within the entity_name. Need not be explicitly closed.
name
[Required] The name of the variant.
[Optional] The type of variant, gen-
type
erally a subtype of the larger enti ty_category.
name
type
2010-12-0223
Using Dictionaries
variant_generation
Attributes and DescriptionElement
Specifies whether the dictionary should automatically generate predictable variants. By default, the standard form name is used as the starting point for variant generation.
Need not be explicitly closed.
[Optional] Specifies the language to use for standard variant generation, in lower case, for example, "en­glish". If this option is not specified
language
in the dictionary, the language specified with the compiler com­mand is used, or it defaults to En­glish when there is no language specified in either the dictionary or the compiler command.
define-variant_genera tion
pattern
generate
Related Topics
Adding Custom Variant Types
Formatting Your Source
[Required] Types supported are
type
standard or the name of a custom variant generation defined earlier in the dictionary.
[Optional] Specifies text other than
base_text
the standard form name to use as the starting point for the computation of variants.
Specifies custom variant generation.
Specifies the pattern that must be matched to generate custom variants.
Specifies the exact pattern for custom variant generation within each generate tag.
2010-12-0224
Using Dictionaries
2.3.2 Guidelines for Naming Entities
This section describes several guidelines for the format of standard form and variant names in a dictionary:
You can use any part-of-speech (word class).
Use only characters that are valid for the specified encoding.
The symbols used for wildcard pattern matching, "?" and "*", must be escaped using a back slash
character ("\") .
Any other special characters, such as quotation marks, ampersands, and apostrophes, can be
escaped according to the XML specification.
The following table shows some such character entities (also used in HTML), along with the correct syntax:
<
>
&
"
'
Less than (<) sign
Greater than (>) sign
Ampersand (&) sign
Quotation marks (")
Apostrophe (')
2.3.3 Character Encoding in a Dictionary
A dictionary supports all the character encodings supported by the Xerces-C XML parser. If you are creating a dictionary to be used for more than one language, use an encoding that supports all required languages, such as UTF-8. For information on encodings supported by theXerces-C XML parser, see
http://xerces.apache.org/xerces-c/faq-parse-3.html#faq-16.
Dictionary EntryDescriptionCharacter
<
>
&
"
'
2010-12-0225
Using Dictionaries
The default input encoding assumed by a dictionary is UTF-8. Dictionary input files that are not in UTF­8 must specify their character encoding in an XML directive to enable proper operation of the configuration
file parser, for example:
<?xml version="1.0" encoding="UTF-16" ?>.
If no encoding specification exists, UTF-8 is assumed. For best results, always specify the encoding.
Note:
CP-1252 must be specified as windows-1252 in the XML header element. The encoding names should follow the IANA-CHARSETS recommendation.
2.3.4 Dictionary Sample File
Here is a sample dictionary file.
<?xml version="1.0" encoding="windows-1252"?> <dictionary>
<entity_category name="ORGANIZATION@COMMERCIAL">
<entity_name standard_form="United Parcel Service of America, Incorporated">
<variant name="United Parcel Service" /> <variant name="U.P.S." type="ABBREV" /> <variant name="UPS" /> <variant_generation type="standard" language="english" />
</entity_name>
</entity_category>
</dictionary>
Related Topics
Entity Structure in Dictionaries
2.3.5 Formatting Your Source
Format your source file according to the dictionary XSD. The source file must contain sufficient context to make the entry unambiguous. The required tags for a dictionary entry are:
entity_category
entity_name
Others can be mentioned according to the desired operation. If tags are already in the target dictionary, they are augmented; if not, they are added. The add operation never removes tags, and the remove operation never adds them.
Related Topics
Dictionary XSD
2010-12-0226
Using Dictionaries
2.3.6 Working with a Dictionary
This section provides details on how to update your dictionary files to add or remove entries as well as update existing entries.
2.3.6.1 Adding an Entity
To add an entity to a dictionary:
Specify the entity's standard form under the relevant entity category, and optionally, its variants.
The example below adds two new entities to the ORGANIZATION@COMMERCIAL category:
<?xml version="1.0" encoding="windows-1252"?> <dictionary>
<entity_category name="ORGANIZATION@COMMERCIAL">
<entity_name standard_form="Seventh Generation Incorporated">
<variant name="Seventh Generation"/>
<variant name="SVNG"/> </entity_name> <entity_name standard_form="United Airlines, Incorporated">
<variant name="United Airlines, Inc."/>
<variant name="United Airlines"/>
<variant name="United"/> </entity_name>
</entity_category>
</dictionary>
2.3.6.2 Adding an Entity Type
To add an entity type to a dictionary:
Include a new entity_category tag
For example:
<?xml version="1.0" encoding="windows-1252"?> <dictionary>
<entity_category name="YOUR_ENTITY_TYPE">
...
</entity_category>
</dictionary>
2010-12-0227
Loading...
+ 63 hidden pages