SAP Business objects DATA SERVICES Text Data Processing Language Reference Guide

Text Data Processing Language Reference Guide
SAP BusinessObjects Data Services XI 4.0 (14.0.0)
2010-12-02
Copyright
© 2010 SAP AG. All rights reserved.SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP Business ByDesign, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries. Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, and other Business Objects products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Business Objects S.A. in the United States and in other countries. Business Objects is an SAP company.All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary.These materials are subject to change without notice. These materials are provided by SAP AG and its affiliated companies ("SAP Group") for informational purposes only, without representation or warranty of any kind, and SAP Group shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP Group products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.
2010-12-02

Contents

Introduction.............................................................................................................................7Chapter 1
1.1
1.1.1
1.1.2
1.1.3
1.1.4
1.2
1.2.1
1.2.2
2.1
2.2
2.2.1
2.3
2.4
3.1
3.2
3.3
3.3.1
3.4
3.5
3.5.1
3.5.2
3.5.3
3.5.4
3.5.5
3.5.6
3.5.7
3.6
Welcome to SAP BusinessObjects Data Services...................................................................7
Welcome.................................................................................................................................7
Documentation set for SAP BusinessObjects Data Services...................................................7
Accessing documentation......................................................................................................10
SAP BusinessObjects information resources.........................................................................11
Overview of This Guide..........................................................................................................12
About This Guide ..................................................................................................................13
Who Should Read This Guide................................................................................................13
Overview of Linguistic Analysis and Extraction....................................................................15Chapter 2
About Linguistic Analysis.......................................................................................................15
About Extraction.....................................................................................................................15
About Customizing Extraction................................................................................................16
Languages Modules Supported..............................................................................................17
Specialized Extraction Content...............................................................................................17
Linguistic Analysis Support...................................................................................................19Chapter 3
Linguistic Analysis Language Feature Matrix..........................................................................20
Segment Generation..............................................................................................................21
Word Segmentation...............................................................................................................21
White Space Languages........................................................................................................22
Case Normalization Rules......................................................................................................23
Stemming..............................................................................................................................24
Standard Inflectional Stemming..............................................................................................25
Expanded Inflectional Stemming.............................................................................................26
Inflectional Stemmer Guesser................................................................................................26
Compound Word Stemming...................................................................................................26
Non-Decompounding Stemming.............................................................................................26
Derivational Stemming...........................................................................................................27
Stemming Unknown Words....................................................................................................27
Part-of-Speech Support.........................................................................................................28
2010-12-023
Contents
3.6.1
3.6.2
3.6.3
3.6.4
4.1
4.1.1
4.2
4.3
4.4
4.4.1
4.4.2
5.1
5.1.1
5.1.2
5.2
5.2.1
5.2.2
5.3
5.3.1
5.3.2
5.4
5.4.1
5.4.2
5.5
5.5.1
5.5.2
5.6
5.6.1
5.6.2
Tag Name Conventions..........................................................................................................28
Unfound Words......................................................................................................................29
Tagged Stemming..................................................................................................................29
Word Breaking.......................................................................................................................29
Extraction Support.................................................................................................................31Chapter 4
Entity and Fact Extraction.......................................................................................................31
Subentities and Subtypes......................................................................................................32
Extraction Resource Files.......................................................................................................32
Levels of Extraction Support for the Language Modules.........................................................33
Predefined Entity Type Support..............................................................................................34
Named Entities.......................................................................................................................35
Common Mentions................................................................................................................42
Language Modules Reference..............................................................................................45Chapter 5
Chinese (Simplified) Language Reference..............................................................................45
Linguistic Processing.............................................................................................................45
Extraction...............................................................................................................................50
English Language Reference..................................................................................................63
Linguistic Processing.............................................................................................................64
Extraction...............................................................................................................................73
French Language Reference.................................................................................................100
Linguistic Processing...........................................................................................................100
Extraction.............................................................................................................................108
German Language Reference...............................................................................................126
Linguistic Processing...........................................................................................................126
Extraction.............................................................................................................................139
Japanese Language Reference............................................................................................157
Linguistic Processing...........................................................................................................157
Extraction.............................................................................................................................167
Spanish Language Reference...............................................................................................167
Linguistic Processing...........................................................................................................168
Extraction.............................................................................................................................176
6.1
6.1.1
6.1.2
6.1.3
6.2
Voice of the Customer Content..........................................................................................193Chapter 6
Extracting Sentiments..........................................................................................................194
English: Sentiment Extraction Examples...............................................................................195
French: Sentiment Extraction Examples................................................................................196
Spanish: Sentiment Extraction Examples..............................................................................197
Extracting Requests.............................................................................................................198
2010-12-024
Contents
6.2.1
6.2.2
6.2.3
7.1
7.2
7.3
7.4
7.5
8.1
8.1.1
8.1.2
8.1.3
8.1.4
8.1.5
8.1.6
8.1.7
8.1.8
8.1.9
8.1.10
8.1.11
8.2
8.2.1
English: Request Extraction Examples..................................................................................199
French: Request Extraction Examples...................................................................................200
Spanish: Request Extraction Examples.................................................................................200
Enterprise Content..............................................................................................................201Chapter 7
Extracting Membership Information......................................................................................202
Extracting Management Change Events...............................................................................204
Extracting Product Release Events.......................................................................................206
Extracting Merger Information..............................................................................................207
Extracting Organizational Information....................................................................................208
Public Sector Content.........................................................................................................211Chapter 8
English: Types of Information Extracted ...............................................................................211
Public Sector Content Rule Sets–English.............................................................................211
Public Sector Content Entities–English.................................................................................213
Extracting Action Events.......................................................................................................219
Extracting Travel Events.......................................................................................................227
Extracting Military Units........................................................................................................236
Extracting Organizational Information....................................................................................237
Extracting a Person's Aliases...............................................................................................240
Extracting Information About a Person's Appearance...........................................................243
Extracting Information About a Person's Attributes...............................................................244
Extracting Information About a Person's Relationships.........................................................249
Extracting Spatial References...............................................................................................250
Simplified Chinese: Types of Information Extracted..............................................................251
Public Sector Entities–Simplified Chinese.............................................................................251
Index 257
2010-12-025
Contents
2010-12-026

Introduction

Introduction
1.1 Welcome to SAP BusinessObjects Data Services
1.1.1 Welcome
SAP BusinessObjects Data Services delivers a single enterprise-class solution for data integration, data quality, data profiling, and text data processing that allows you to integrate, transform, improve, and deliver trusted data to critical business processes. It provides one development UI, metadata repository, data connectivity layer, run-time environment, and management console—enabling IT organizations to lower total cost of ownership and accelerate time to value. With SAP BusinessObjects Data Services, IT organizations can maximize operational efficiency with a single solution to improve data quality and gain access to heterogeneous sources and applications.
1.1.2 Documentation set for SAP BusinessObjects Data Services
You should become familiar with all the pieces of documentation that relate to your SAP BusinessObjects Data Services product.
What this document providesDocument
Administrator's Guide
Customer Issues Fixed
Designer Guide
Information about administrative tasks such as monitoring, lifecycle management, security, and so on.
Information about customer issues fixed in this release.
Information about how to use SAP BusinessObjects Data Services Designer.
Documentation Map
Information about available SAP BusinessObjects Data Ser­vices books, languages, and locations.
2010-12-027
Introduction
What this document providesDocument
Installation Guide for Windows
Installation Guide for UNIX
Integrator's Guide
Management Console Guide
Performance Optimization Guide
Reference Guide
Release Notes
Technical Manuals
Information about and procedures for installing SAP Busines­sObjects Data Services in a Windows environment.
Information about and procedures for installing SAP Busines­sObjects Data Services in a UNIX environment.
Information for third-party developers to access SAP Busines­sObjects Data Services functionality using web services and APIs.
Information about how to use SAP BusinessObjects Data Services Administrator and SAP BusinessObjects Data Ser­vices Metadata Reports.
Information about how to improve the performance of SAP BusinessObjects Data Services.
Detailed reference material for SAP BusinessObjects Data Services Designer.
Important information you need before installing and deploying this version of SAP BusinessObjects Data Services.
A compiled “master” PDF of core SAP BusinessObjects Data Services books containing a searchable master table of con­tents and index:
Administrator's Guide
Designer Guide
Reference Guide
Management Console Guide
Performance Optimization Guide
Supplement for J.D. Edwards
Supplement for Oracle Applications
Supplement for PeopleSoft
Supplement for Salesforce.com
Supplement for Siebel
Supplement for SAP
Text Data Processing Extraction Customiza­tion Guide
Text Data Processing Language Reference Guide
Information about building dictionaries and extraction rules to create your own extraction patterns to use with Text Data Processing transforms.
Information about the linguistic analysis and extraction process­ing features that the Text Data Processing component pro­vides, as well as a reference section for each language sup­ported.
2010-12-028
Introduction
What this document providesDocument
Tutorial
Upgrade Guide
What's New
In addition, you may need to refer to several Adapter Guides and Supplemental Guides.
Supplement for J.D. Edwards
Supplement for Oracle Applica­tions
Supplement for PeopleSoft
A step-by-step introduction to using SAP BusinessObjects Data Services.
Release-specific product behavior changes from earlier ver­sions of SAP BusinessObjects Data Services to the latest re­lease. This manual also contains information about how to migrate from SAP BusinessObjects Data Quality Management to SAP BusinessObjects Data Services.
Highlights of new key features in this SAP BusinessObjects Data Services release. This document is not updated for sup­port package or patch releases.
What this document providesDocument
Information about interfaces between SAP BusinessObjects Data Services and J.D. Edwards World and J.D. Edwards OneWorld.
Information about the interface between SAP BusinessObjects Data Services and Oracle Applications.
Information about interfaces between SAP BusinessObjects Data Services and PeopleSoft.
Supplement for Salesforce.com
Supplement for SAP
Supplement for Siebel
Information about how to install, configure, and use the SAP BusinessObjects Data Services Salesforce.com Adapter Interface.
Information about interfaces between SAP BusinessObjects Data Services, SAP Applications, and SAP NetWeaver BW.
Information about the interface between SAP BusinessObjects Data Services and Siebel.
We also include these manuals for information about SAP BusinessObjects Information platform services.
Information platform services Administrator's Guide
Information platform services Installation Guide for UNIX
What this document providesDocument
Information for administrators who are responsible for configuring, managing, and maintaining an Information platform services installation.
Installation procedures for SAP BusinessObjects Infor­mation platform services on a UNIX environment.
2010-12-029
Introduction
What this document providesDocument
Information platform services Installation Guide for Windows
1.1.3 Accessing documentation
You can access the complete documentation set for SAP BusinessObjects Data Services in several places.
1.1.3.1 Accessing documentation on Windows
After you install SAP BusinessObjects Data Services, you can access the documentation from the Start menu.
1.
Choose Start > Programs > SAP BusinessObjects Data Services XI 4.0 > Data Services Documentation.
Installation procedures for SAP BusinessObjects Infor­mation platform services on a Windows environment.
Note:
Only a subset of the documentation is available from the Start menu. The documentation set for this release is available in <LINK_DIR>\Doc\Books\en.
2.
Click the appropriate shortcut for the document that you want to view.
1.1.3.2 Accessing documentation on UNIX
After you install SAP BusinessObjects Data Services, you can access the online documentation by going to the directory where the printable PDF files were installed.
1.
Go to <LINK_DIR>/doc/book/en/.
2.
Using Adobe Reader, open the PDF file of the document that you want to view.
1.1.3.3 Accessing documentation from the Web
2010-12-0210
Introduction
You can access the complete documentation set for SAP BusinessObjects Data Services from the SAP BusinessObjects Business Users Support site.
1.
Go to http://help.sap.com.
2.
Click SAP BusinessObjects at the top of the page.
3.
Click All Products in the navigation pane on the left.
You can view the PDFs online or save them to your computer.
1.1.4 SAP BusinessObjects information resources
A global network of SAP BusinessObjects technology experts provides customer support, education, and consulting to ensure maximum information management benefit to your business.
Useful addresses at a glance:
2010-12-0211
Introduction
ContentAddress
Customer Support, Consulting, and Education services
http://service.sap.com/
SAP BusinessObjects Data Services Community
http://www.sdn.sap.com/irj/sdn/ds
Forums on SCN (SAP Community Network )
http://forums.sdn.sap.com/forum.jspa?foru mID=305
Blueprints
http://www.sdn.sap.com/irj/boc/blueprints
Information about SAP Business User Support programs, as well as links to technical articles, downloads, and online forums. Consulting services can provide you with information about how SAP BusinessObjects can help maximize your informa­tion management investment. Education services can provide information about training options and modules. From traditional classroom learning to targeted e-learning seminars, SAP BusinessObjects can offer a training package to suit your learning needs and preferred learning style.
Get online and timely information about SAP Busi­nessObjects Data Services, including tips and tricks, additional downloads, samples, and much more. All content is to and from the community, so feel free to join in and contact us if you have a submis­sion.
Search the SAP BusinessObjects forums on the SAP Community Network to learn from other SAP BusinessObjects Data Services users and start posting questions or share your knowledge with the community.
Blueprints for you to download and modify to fit your needs. Each blueprint contains the necessary SAP BusinessObjects Data Services project, jobs, data flows, file formats, sample data, template tables, and custom functions to run the data flows in your environment with only a few modifications.
http://help.sap.com/businessobjects/
Supported Platforms (Product Availability Matrix)
https://service.sap.com/PAM
1.2 Overview of This Guide
SAP BusinessObjects product documentation.Product documentation
Get information about supported platforms for SAP BusinessObjects Data Services.
Use the search function to search for Data Services. Click the link for the version of Data Services you are searching for.
2010-12-0212
Introduction
Welcome to the
SAP BusinessObjects Data Services text data processing software enables you to perform linguistic analysis of and extraction of content from unstructured text.
Linguistic analysis includes natural-language processing (NLP) capabilities, such as segmentation, stemming, and tagging, among other things. Extraction analyzes unstructured text, in multiple languages and from any text data source, and automatically identifies and extracts key entity types, including people, dates, places, organizations, or other information, from the text.
Language Reference Guide
1.2.1 About This Guide
This guide contains two kinds of information:
Overviews and conceptual information about the linguistic analysis and extraction features provided
by the software.
A reference section for each language supported by the software. It describes the behavior of the
supported language modules during extraction and normalization.
.
1.2.2 Who Should Read This Guide
Users of this guide may need to enhance extraction in their text analytics application and should understand text data processing extraction concepts. However, users of this guide are not expected to understand or be familiar with the natural languages of the text being processed by the software. Similarly, users are not required to be familiar with linguistic principles. This document assumes the following:
You are an application developer or consultant working on enhancing text data processing extraction.
You understand your organization's text data processing extraction needs.
2010-12-0213
Introduction
2010-12-0214

Overview of Linguistic Analysis and Extraction

Overview of Linguistic Analysis and Extraction
The software includes language modules for the languages supported. Each language module consists of a set of files that include system dictionaries containing words to support the language processing operations for the given natural language. It is the language modules that enable linguistic analysis and extraction of unstructured text in a given language. Language modules use the following language processing technologies:
Linguistic analysis to handle natural language processing
Extraction to handle entity extraction
Related Topics
Linguistic Analysis Support
Extraction Support
2.1 About Linguistic Analysis
The software provides and uses sophisticated natural language processing capabilities for linguistic analysis of unstructured data. Some of these capabilities include:
Segmentation–the separation of input text into its elements
Stemming–the identification of word stems, or dictionary forms
Tagging–the labeling of words' parts of speech
Related Topics
Linguistic Analysis Support
Language Modules Reference
2.2 About Extraction
2010-12-0215
Overview of Linguistic Analysis and Extraction
Extraction is the process of discovering and presenting specific entities and facts that occur in unstructured text.
Entities denote the names of people, places, things, dates, values, and so forth, that can be extracted
from text. An entity is defined as a pairing of a standard form and its type. For example, Winston Churchill/PERSON is an entity in which Winston Churchill is the standard form and PERSON is the type.
Facts are entities and subentities, found during the extraction process, that represent relationships,
events, sentiments, or requests. Facts are extracted based on extraction rules consisting of patterns that define the expressions to use to extract the information. The specialized voice of the customer content, for example, provides the rules that let you extract facts that represent sentiments and requests.
The language modules included with the software contain system dictionaries and provide an extensive set of predefined entity types. The extraction process can extract entities using these lists of specific entities. It can also discover new entities using linguistic models. Extraction classifies each extracted entity by entity type and presents this metadata in a standardized format.
Related Topics
Extraction Support
Predefined Entity Type Support

About Customizing Extraction

Languages Modules Supported
Language Modules Reference
Specialized Extraction Content
2.2.1 About Customizing Extraction
You can enhance the extraction process by creating and using:
Dictionaries that contain information about entities. You can customize information about the entities
your application must find.
Extraction rules.
For details about enhancing extraction, refer to the
Data Processing Extraction Customization Guide
For certain language modules, you can also enhance extraction by using the specialized extraction content included in them.
Related Topics
Specialized Extraction Content
SAP BusinessObjects Data Services XI 4.0 Text
.
2010-12-0216
Overview of Linguistic Analysis and Extraction
2.3 Languages Modules Supported
The software provides these language modules, which are supported by linguistic analysis and extraction:
English
French
German
Japanese
Simplified Chinese
Spanish
Note:
Not all linguistic analysis and extraction features are supported for all languages.
Related Topics
Linguistic Analysis Language Feature Matrix
Levels of Extraction Support for the Language Modules
Language Modules Reference
2.4 Specialized Extraction Content
Certain language modules include specialized content that provides entity types and sets of rules that address specific needs:
Specialized Extraction Con­tent
Voice of the customer
Description
Extracts specific information about your customers' needs (requests) and perceptions and problems (sentiments)
Included in These Language Modules
English
French
Spanish
Enterprise
Extracts enterprise-specific infor­mation, such as management changes and product releases
English
2010-12-0217
Overview of Linguistic Analysis and Extraction
Specialized Extraction Con­tent
Public Sector
Related Topics
Voice of the Customer Content
Enterprise Content
Public Sector Content
Description
Extracts public-sector-specific information, such as events and relations
Included in These Language Modules
English
Simplified Chinese
2010-12-0218

Linguistic Analysis Support

Linguistic Analysis Support
The software provides and uses these linguistic analysis features for multilingual natural language processing (NLP) of unstructured data:
Language and encoding identification
DescriptionFeature
The automatic recognition of the input language, for example, French or Japanese, and of various character encodings (such as Unicode UTF-8 and Code Page 1252).
Segment generation
Word segmentation
Case normalization
Stemming
Tagging
Document analysis
The breaking of input text into segments of one or more complete paragraphs for more efficient processing.
The separation of input text into its elements, such as words and punctuation.
The normalization of the initial letter of a word to upper or lower case. Used to counteract case changes related to document structure, such as title and heading capitalization.
The identification of word stems, or dictionary forms, for text or single words.
The labeling of words' parts of speech, for exam­ple, noun or verb.
The recognition of a document's major sec­tions–paragraphs and sentences.
Tagged stemming
The identification of word stems for a word of a given part-of-speech.
2010-12-0219
Linguistic Analysis Support
Note:
Not all operations are supported for all languages.
Related Topics

Linguistic Analysis Language Feature Matrix

Segment Generation
Word Segmentation
Case Normalization Rules
Stemming
Part-of-Speech Support
Tagged Stemming
Language Modules Reference
3.1 Linguistic Analysis Language Feature Matrix
Linguistic analysis provides two levels of language support:
Standard–Tagging is not supported
Advanced–Tagging is supported
The following table shows the status of each supported feature for each natural language.
Inflection­al Stem­ming
Tagging
Language
Multiword Units
Word Seg­mentation
Compound Words
Simplified Chinese
Tagged Stemming
XXX**X*X
XXX***XXEnglish
XXXXXFrench
XXXXXXGerman
XXXX*XJapanese
XXXXXSpanish
2010-12-0220
Linguistic Analysis Support
* Compound analysis is supported by the expanded language module for the language.
** Because Chinese words are not inflected, the stems of all Chinese words are identical to their
source forms. Therefore, stemming is not supported for Chinese.
*** For English only, derivational stemming is also supported.
Related Topics
Multiword Units

Word Segmentation

Stemming
Compound Word Stemming
Expanded Inflectional Stemming
Derivational Stemming
Part-of-Speech Support
Tagged Stemming
Language Modules Reference
3.2 Segment Generation
During the analysis of unstructured text, text processing objects operate on one segment of a data stream at a time. Segments are small units of text, including one or more complete paragraphs. Linguistic analysis operations break input streams into chunks. This chunking of the data stream is called segment generation.
Segment generation involves two steps: reading in the input text as a byte stream and breaking it into segments. The resulting segments contain associated metadata markup about the context text. These segments are then passed on for further linguistic analysis from which words, sentences, and paragraphs can then be extracted.
3.3 Word Segmentation
The word segmentation operation performs basic word breaking. It breaks text into the smallest, meaningful syntactic units, such as words or punctuation. The word segmenter also identifies idiomatic phrases, such as "case in point" or "out-of-the-box." These idiomatic phrases are processed as a single unit or word. Hyphenated words are not broken, since they are syntactic units. However, contractions (such as "don't") and elisions (such as "l'abri") are separated into their syntactic units.
2010-12-0221
Linguistic Analysis Support
3.3.1 White Space Languages
White space languages mark word boundaries with white space and punctuation marks. This group includes European, Balkan, and Middle Eastern languages, as well as Korean. Punctuation marks sometimes end a sentence, in which case they are used in sentence detection.
Non-white space languages include the Chinese languages, Japanese, and Thai (CCJT for short). Word segmentation in the CCJT languages occurs with a slightly different algorithm due to their structure. Because complete morphological analysis is required to perform word segmentation in these languages, the word segmentation, stemming, and part-of-speech tagging operations occur in a single step.
3.3.1.1 Multiword Units
By default, multiword units are segmented as a single unit, for example, "to and fro" and "Buenos Aires" are each segmented as one unit. However, you can turn this behavior off. In this case, multiword units are broken into their individual components. For example, "to and fro" is segmented into three units instead of one.
3.3.1.2 Punctuation
Word segmentors generally split off punctuation marks as separate units. This includes periods and commas, sentence-ending punctuation, and various quotation marks.
The following table summarizes punctuation-related segmentation conventions:
If a punctuation mark is followed by a character
No Whitespace
Abbreviations
and not by white space, it is not split off from its surrounding word. For example: "filename.filetype" is segmented as "filename.filetype".
Abbreviations ending in a period are important exceptions to the general rule that splits punctua­tion from their terms; their periods remain with them.
2010-12-0222
Linguistic Analysis Support
Apostrophes
Hyphens
3.4 Case Normalization Rules
Contractions spelled with apostrophes (like can't, don't, etc. in English) are handled via language­specific rules.
Embedded and trailing hyphens are not split off from their words. Leading hyphens are not split off before a digit expression, for example, -1000 is segmented as one unit.
Case normalization provides case-normalized alternatives for words which, by their position in a sentence or because they occur in a title, may or may not appear with their inherent, meaningful capitalization. For instance, a proper noun like SAP is always capitalized, but a common noun like horse is only capitalized if it begins a sentence or occurs in a title. Therefore, if Horse is encountered, the case normalizer provides the lower-case alternative so that later processing will not mistake Horse for a proper noun. The two resulting alternatives can then be passed on to the stemming or tagging operations.
Note:
Case normalization is not relevant to languages that do not distinguish between upper and lower case, for example, the CCJT languages.
Case normalization depends on the type of sentence (normal sentence, title, or query) and the position of the word to be normalized in each sentence type. The important position to consider is the sentence-initial position, where special normalization rules may apply. Words directly following certain punctuation marks are also treated as if they are in sentence-initial position.
Title sentence
All capitalized words are normalized. For example, a newspaper heading would be normalized as:
Cardinals Strike Out( Cardinals | cardinals ) ( Strike | strike ) (Out | out )
Query sentence
Lowercase words are normalized to their upper case variants. Capitalized and all-caps words are not normalized in query sentences.
aaaa: aaaa, Aaaa, AAAA
aaaA: aaaA, AaaA
Normal sentence
2010-12-0223
Linguistic Analysis Support
Capitalized words are normalized when they occur in sentence-initial position. All-caps words in sentence-initial position are also normalized. In other positions of normal sentences, capitalized and all-uppercase words are not normalized. For instance:
Aaaa bbb Cccc:(Aaaa | aaaa) (bbb) (Cccc)
AAAA bbb CCCC: (AAAA | Aaaa | aaaa) (bbb) (CCCC)
3.5 Stemming
Words like speaks or speaking have one stem– speak. Some words have more than one possible stem: spoke, for instance, may turn out, in context, to be the past tense of the verb speak, but it could also be the singular form of the noun spoke. A stem is a base form for one or more variant (source) forms found in text; it is the form referenced in the dictionary.
Stemming a word means finding and returning its stem. For example, rather than redundantly deal with grind, grinds, grinding, ground, and so on, all of these source forms can be recognized as variants of the single verb grind. Ground can also be a noun whose meaning is completely unrelated to the verb grind.
The example of indexing documents according to key words they contain can help to better understand the advantages of working with more abstract forms. If indexing is done naïvely, grind, grinds, grinding, ground will be handled as unrelated words, and a query containing one of these variants will not return documents containing the other variants. With the use of a stemmer, however, all of the variants will be indexed under the base form grind (verb).
The stemmer the software uses receives input of a series of syntactic units (for example, ground ) and associates each unit with one or more base forms (for example, ground , grind ). The stemmer always returns all possible alternative stems for each input term.
The software distinguishes between standard inflectional stemming and derivational stemming. The stemmers are inflectional by default. Derived stemmers are indicated as such.
Inflectional stemming is provided for every supported language. At present, derivational stemming is supported only for English.
For some languages, two different inflectional stemmers are included–the standard inflectional stemmer and an expanded inflectional stemmer that is more permissive of variation in the input text.
The stemmers support several different variants of the stemming operation:
The standard variant returns all possible normalized stems for the input. It also performs compound
analysis in languages like German, such that compound words are broken into their component parts.
The expanded variant covers the same normalization as the standard variant, but it is biased for
recall by allowing wider variation in capitalization, accentuation, and similar features, as found in informal text.
2010-12-0224
Linguistic Analysis Support
In German, the no-split stemmer supports compound stemming without breaking the compound into
separate stems, which provides better browsability.
In English, the derivational variant provides the root stem for morphologically derived words.
Related Topics

Standard Inflectional Stemming

Expanded Inflectional Stemming
Derivational Stemming
3.5.1 Standard Inflectional Stemming
With inflectional stemming, words retain the part of speech (noun, verb, and so on) of the base forms. For example, the verb forms speaks and speaking remain verbs like the base form speak, even while incorporating changes related to person (first, second, third person), number (singular and plural), tense (present, past, future), aspect (progressive) or other grammatical features.
Here are some additional examples:
Stems toExample
{aller, vais, vas, va, allons, allez, vont} [French]
{reach, reaches, reached, reaching}
{big, bigger, biggest}
{balloon, balloons}
{go, goes, going, gone, went}
aller
reach
big
balloon
go
The bold words are the stems (dictionary forms). The characters added to the stem (es in reaches, s in balloons ) are called inflections or affixes.
To handle unknown words such as neologisms, the standard stemmer contains a set of morphological rules that apply to words.
2010-12-0225
Linguistic Analysis Support
3.5.2 Expanded Inflectional Stemming
The expanded inflectional stemming dictionaries provide all the same functionality as the standard stemmers provided, and more. The expanded inflectional stemmer allows for certain non-standard word forms–for example, capitalization errors–as well as standard forms. Thus it can be used to process informal or imperfect text (such as email, online documents, or queries). The variation it handle includes case variation, hyphenation and unaccented characters among others. The expanded variant of the CCJT languages is designed for more granular stemming results suitable for index generation.
3.5.3 Inflectional Stemmer Guesser
The inflectional stemmer guesser contains morphological rules that can be applied to syntactic units that are unknown to the standard or expanded inflectional stemmer and, therefore, cannot be stemmed. The software provides inflectional stemmer guessers for English, French, German, and Spanish.
3.5.4 Compound Word Stemming
Compound words are those like bookmark or birdbath, formed by combining or concatenating several words. German is especially famous for its compounds, for example, Bildungsroman from Bildung "education" and Roman "novel", and Weltanschauung from Welt "world" and Anschauung "view".
The software performs compound analysis for German. In German, compounds are always separated into their component stems.
3.5.5 Non-Decompounding Stemming
The German language module includes a variant no-split stemmer that does not perform de-compounding in the stemmer. This stemmer stems the head of the compound, but does not split the compound into separate stems. For example, the plural compound Bildungsromane is stemmed to Bildungsroman, but is not split into component stems. The returned stem is always a single term; and since there is no compound boundary marker, the term cannot be broken up.
If alternate stems are possible, more than one stem may be returned, as with the standard and expanded stemmers.
2010-12-0226
Linguistic Analysis Support
3.5.6 Derivational Stemming
Derivational stemming involves cases in which words and stems may or may not have the same part of speech: a noun may be derived from a verb stem (as for participation and participate), or an adjective may be derived from a noun (as for boyish and boy). Here are more derivational examples:
{introduction, introductory, introducer} from introduce
{subcategory, categorize, categorization} from category
{useful, usable, unusable} from use
{reenlist} from enlist
Derivational stemming is currently supported for English only.
3.5.7 Stemming Unknown Words
The stemmer identifies the stems of all the standard words of a language. However, an unknown word, such as one not found in the system dictionary, will not have a stem. In general, the stemmer returns the input term as the stem itself. A complicating factor is that, due to case-normalization, the input to the stemmer may include more than one variant term for a given word. This means that one variant might be found while another might not be. By default, the stemmer returns the stems of found terms and removes unfound terms from the results.
For example, at the beginning of a sentence, the word Dogs would be normalized as the disjunction (Dogs | dogs). In such cases, the stemmer considers both members of the disjunction–both Dogs and dogs. Assume that lower-case dogs is in the stemmer dictionary, and that capitalized Dogs is absent. Since Dogs is not in the dictionary (and considered an unfound word), it would stem to Dogs itself. Since dogs is in the dictionary, it stems to dog. By default, the stemmer discards the unknown word Dogs and returns dog as the stem of the found variant. This is the default behavior.
If none of the case-normalized variants is found, then the stemmer returns all the case-normalized variants. For example, suppose the input sentence begins with the unknown word Fbzzz. The case normalizer returns the disjunction (Fbzzz | fbzzz). The stemmer finds neither one in the dictionary and returns both forms as stems.
Related Topics
Case Normalization Rules
2010-12-0227
Linguistic Analysis Support
3.6 Part-of-Speech Support
The part-of-speech tagger identifies and labels the part of speech for each word in context. A word's part-of-speech is the grammatical category it falls into, such as noun or verb, along with subclass attributes of each of these major categories, such as singular or plural for nouns, and present or past tense for verbs.
For certain of its language modules, the software supports the use of two types of parts-of-speech tags. You can also use these tags when creating extraction rules:
Umbrella tags–These tags identify major parts-of-speech at a high level, without breaking down the
part of speech further than its overall function. For example, the Nn tag identifies all nouns, regardless of whether they are singular or plural, feminine or masculine, and so on.
Complete tags–These tags identify the exact part-of-speech, along with its attributes. For example,
the Nn-Pl tag identifies plural nouns, and V-Pres-3-sg identifies present tense, 3rd person singular verbs.
For specific details about the tag sets in each supported language, refer to the chapter for that language in the "
Language Module Reference
3.6.1 Tag Name Conventions
Tags consist of feature names separated by hyphens. The first feature name is called a category tag. It usually specifies the high level part of speech of the word, for example, noun or verb, abbreviated as Nn and V respectively. When the tag contains more than one part-of-speech, as in V/Adj or Det/Pron, this indicates that the part-of-speech can be of either category.
Feature tags classify the word more precisely. They may indicate number (for example, plural and singular), person (for example, first, second or third), or tense (for example, present and past). Thus, the tag V-Pres-3-Sg indicates that the verb is present tense, third person singular.
When a feature appears in all lower case, as in the tag Prep-para from the Spanish tagger, it stands for a word in that language (here, Spanish para), and means that the word's distribution differs enough from that of other words of its category to rate its own feature. Such very specific features are listed in the language-specific tables.
For specific details about the tag sets in each supported language, refer to the chapter for that language in the "
Language Modules Reference
" part of this guide.
" part of this guide.
2010-12-0228
Linguistic Analysis Support
3.6.2 Unfound Words
Words not found in the tagger dictionary are passed to the relevant guesser to be assigned the most likely tag. The guesser assigns tags to unfound words based on a set of rules about the morphology of the given language. Capitalization information may also be used as capitalized words are also proper nouns in many languages. Combinations of alphabetic, numeric and optionally, punctuation characters tend to be guessed as proper nouns as well. Ordinal numbers are tagged either as noun or adjective, depending on the context. Internet and e-mail addresses are assigned the tag Nn-Net.
In the Asian languages, unfound words are assigned the tag Nn by default.
3.6.3 Tagged Stemming
The tagged stemming operation provides complete linguistic analysis of input text, including stemming with respect to part-of-speech information. This operation segments text into words and punctuation, performs document analysis, case normalization, and part-of-speech tagging. Then, given a term and its part-of-speech tag, it performs stemming of the term. For example, for the input term-tag pair children[Nn-Pl], the output is child.
3.6.4 Word Breaking
The word-breaking operation segments text into words and punctuation, performs document analysis, case normalization, and part-of-speech tagging.
2010-12-0229
Linguistic Analysis Support
2010-12-0230
Loading...
+ 230 hidden pages