Simplified Chinese: Types of Information Extracted..............................................................251
Public Sector Entities–Simplified Chinese.............................................................................251
Index257
2010-12-025
Contents
2010-12-026
Introduction
Introduction
1.1 Welcome to SAP BusinessObjects Data Services
1.1.1 Welcome
SAP BusinessObjects Data Services delivers a single enterprise-class solution for data integration,
data quality, data profiling, and text data processing that allows you to integrate, transform, improve,
and deliver trusted data to critical business processes. It provides one development UI, metadata
repository, data connectivity layer, run-time environment, and management console—enabling IT
organizations to lower total cost of ownership and accelerate time to value. With SAP BusinessObjects
Data Services, IT organizations can maximize operational efficiency with a single solution to improve
data quality and gain access to heterogeneous sources and applications.
1.1.2 Documentation set for SAP BusinessObjects Data Services
You should become familiar with all the pieces of documentation that relate to your SAP BusinessObjects
Data Services product.
What this document providesDocument
Administrator's Guide
Customer Issues Fixed
Designer Guide
Information about administrative tasks such as monitoring,
lifecycle management, security, and so on.
Information about customer issues fixed in this release.
Information about how to use SAP BusinessObjects Data
Services Designer.
Documentation Map
Information about available SAP BusinessObjects Data Services books, languages, and locations.
2010-12-027
Introduction
What this document providesDocument
Installation Guide for Windows
Installation Guide for UNIX
Integrator's Guide
Management Console Guide
Performance Optimization Guide
Reference Guide
Release Notes
Technical Manuals
Information about and procedures for installing SAP BusinessObjects Data Services in a Windows environment.
Information about and procedures for installing SAP BusinessObjects Data Services in a UNIX environment.
Information for third-party developers to access SAP BusinessObjects Data Services functionality using web services and
APIs.
Information about how to use SAP BusinessObjects Data
Services Administrator and SAP BusinessObjects Data Services Metadata Reports.
Information about how to improve the performance of SAP
BusinessObjects Data Services.
Detailed reference material for SAP BusinessObjects Data
Services Designer.
Important information you need before installing and deploying
this version of SAP BusinessObjects Data Services.
A compiled “master” PDF of core SAP BusinessObjects Data
Services books containing a searchable master table of contents and index:
•
Administrator's Guide
•
Designer Guide
•
Reference Guide
•
Management Console Guide
•
Performance Optimization Guide
•
Supplement for J.D. Edwards
•
Supplement for Oracle Applications
•
Supplement for PeopleSoft
•
Supplement for Salesforce.com
•
Supplement for Siebel
•
Supplement for SAP
Text Data Processing Extraction Customization Guide
Text Data Processing Language Reference
Guide
Information about building dictionaries and extraction rules to
create your own extraction patterns to use with Text Data
Processing transforms.
Information about the linguistic analysis and extraction processing features that the Text Data Processing component provides, as well as a reference section for each language supported.
2010-12-028
Introduction
What this document providesDocument
Tutorial
Upgrade Guide
What's New
In addition, you may need to refer to several Adapter Guides and Supplemental Guides.
Supplement for J.D. Edwards
Supplement for Oracle Applications
Supplement for PeopleSoft
A step-by-step introduction to using SAP BusinessObjects
Data Services.
Release-specific product behavior changes from earlier versions of SAP BusinessObjects Data Services to the latest release. This manual also contains information about how to
migrate from SAP BusinessObjects Data Quality Management
to SAP BusinessObjects Data Services.
Highlights of new key features in this SAP BusinessObjects
Data Services release. This document is not updated for support package or patch releases.
What this document providesDocument
Information about interfaces between SAP BusinessObjects Data Services
and J.D. Edwards World and J.D. Edwards OneWorld.
Information about the interface between SAP BusinessObjects Data Services
and Oracle Applications.
Information about interfaces between SAP BusinessObjects Data Services
and PeopleSoft.
Supplement for Salesforce.com
Supplement for SAP
Supplement for Siebel
Information about how to install, configure, and use the SAP BusinessObjects
Data Services Salesforce.com Adapter Interface.
Information about interfaces between SAP BusinessObjects Data Services,
SAP Applications, and SAP NetWeaver BW.
Information about the interface between SAP BusinessObjects Data Services
and Siebel.
We also include these manuals for information about SAP BusinessObjects Information platform services.
Information platform services Administrator's Guide
Information platform services Installation Guide for
UNIX
What this document providesDocument
Information for administrators who are responsible for
configuring, managing, and maintaining an Information
platform services installation.
Installation procedures for SAP BusinessObjects Information platform services on a UNIX environment.
2010-12-029
Introduction
What this document providesDocument
Information platform services Installation Guide for
Windows
1.1.3 Accessing documentation
You can access the complete documentation set for SAP BusinessObjects Data Services in several
places.
1.1.3.1 Accessing documentation on Windows
After you install SAP BusinessObjects Data Services, you can access the documentation from the Start
menu.
1.
Choose Start > Programs > SAP BusinessObjects Data Services XI 4.0 > Data Services
Documentation.
Installation procedures for SAP BusinessObjects Information platform services on a Windows environment.
Note:
Only a subset of the documentation is available from the Start menu. The documentation set for this
release is available in <LINK_DIR>\Doc\Books\en.
2.
Click the appropriate shortcut for the document that you want to view.
1.1.3.2 Accessing documentation on UNIX
After you install SAP BusinessObjects Data Services, you can access the online documentation by
going to the directory where the printable PDF files were installed.
1.
Go to <LINK_DIR>/doc/book/en/.
2.
Using Adobe Reader, open the PDF file of the document that you want to view.
1.1.3.3 Accessing documentation from the Web
2010-12-0210
Introduction
You can access the complete documentation set for SAP BusinessObjects Data Services from the SAP
BusinessObjects Business Users Support site.
1.
Go to http://help.sap.com.
2.
Click SAP BusinessObjects at the top of the page.
3.
Click All Products in the navigation pane on the left.
You can view the PDFs online or save them to your computer.
1.1.4 SAP BusinessObjects information resources
A global network of SAP BusinessObjects technology experts provides customer support, education,
and consulting to ensure maximum information management benefit to your business.
Useful addresses at a glance:
2010-12-0211
Introduction
ContentAddress
Customer Support, Consulting, and Education
services
http://service.sap.com/
SAP BusinessObjects Data Services Community
http://www.sdn.sap.com/irj/sdn/ds
Forums on SCN (SAP Community Network )
http://forums.sdn.sap.com/forum.jspa?foru
mID=305
Blueprints
http://www.sdn.sap.com/irj/boc/blueprints
Information about SAP Business User Support
programs, as well as links to technical articles,
downloads, and online forums. Consulting services
can provide you with information about how SAP
BusinessObjects can help maximize your information management investment. Education services
can provide information about training options and
modules. From traditional classroom learning to
targeted e-learning seminars, SAP BusinessObjects
can offer a training package to suit your learning
needs and preferred learning style.
Get online and timely information about SAP BusinessObjects Data Services, including tips and tricks,
additional downloads, samples, and much more.
All content is to and from the community, so feel
free to join in and contact us if you have a submission.
Search the SAP BusinessObjects forums on the
SAP Community Network to learn from other SAP
BusinessObjects Data Services users and start
posting questions or share your knowledge with the
community.
Blueprints for you to download and modify to fit your
needs. Each blueprint contains the necessary SAP
BusinessObjects Data Services project, jobs, data
flows, file formats, sample data, template tables,
and custom functions to run the data flows in your
environment with only a few modifications.
http://help.sap.com/businessobjects/
Supported Platforms (Product Availability Matrix)
https://service.sap.com/PAM
1.2 Overview of This Guide
SAP BusinessObjects product documentation.Product documentation
Get information about supported platforms for SAP
BusinessObjects Data Services.
Use the search function to search for Data Services.
Click the link for the version of Data Services you
are searching for.
2010-12-0212
Introduction
Welcome to the
SAP BusinessObjects Data Services text data processing software enables you to perform linguistic
analysis of and extraction of content from unstructured text.
Linguistic analysis includes natural-language processing (NLP) capabilities, such as segmentation,
stemming, and tagging, among other things. Extraction analyzes unstructured text, in multiple languages
and from any text data source, and automatically identifies and extracts key entity types, including
people, dates, places, organizations, or other information, from the text.
Language Reference Guide
1.2.1 About This Guide
This guide contains two kinds of information:
•Overviews and conceptual information about the linguistic analysis and extraction features provided
by the software.
•A reference section for each language supported by the software. It describes the behavior of the
supported language modules during extraction and normalization.
.
1.2.2 Who Should Read This Guide
Users of this guide may need to enhance extraction in their text analytics application and should
understand text data processing extraction concepts. However, users of this guide are not expected to
understand or be familiar with the natural languages of the text being processed by the software.
Similarly, users are not required to be familiar with linguistic principles. This document assumes the
following:
•You are an application developer or consultant working on enhancing text data processing extraction.
•You understand your organization's text data processing extraction needs.
2010-12-0213
Introduction
2010-12-0214
Overview of Linguistic Analysis and Extraction
Overview of Linguistic Analysis and Extraction
The software includes language modules for the languages supported. Each language module consists
of a set of files that include system dictionaries containing words to support the language processing
operations for the given natural language. It is the language modules that enable linguistic analysis and
extraction of unstructured text in a given language. Language modules use the following language
processing technologies:
•Linguistic analysis to handle natural language processing
•Extraction to handle entity extraction
Related Topics
• Linguistic Analysis Support
• Extraction Support
2.1 About Linguistic Analysis
The software provides and uses sophisticated natural language processing capabilities for linguistic
analysis of unstructured data. Some of these capabilities include:
•Segmentation–the separation of input text into its elements
•Stemming–the identification of word stems, or dictionary forms
•Tagging–the labeling of words' parts of speech
Related Topics
• Linguistic Analysis Support
• Language Modules Reference
2.2 About Extraction
2010-12-0215
Overview of Linguistic Analysis and Extraction
Extraction is the process of discovering and presenting specific entities and facts that occur in
unstructured text.
•Entities denote the names of people, places, things, dates, values, and so forth, that can be extracted
from text. An entity is defined as a pairing of a standard form and its type. For example, WinstonChurchill/PERSON is an entity in which Winston Churchill is the standard form and PERSON
is the type.
•Facts are entities and subentities, found during the extraction process, that represent relationships,
events, sentiments, or requests. Facts are extracted based on extraction rules consisting of patterns
that define the expressions to use to extract the information. The specialized voice of the customer
content, for example, provides the rules that let you extract facts that represent sentiments and
requests.
The language modules included with the software contain system dictionaries and provide an extensive
set of predefined entity types. The extraction process can extract entities using these lists of specific
entities. It can also discover new entities using linguistic models. Extraction classifies each extracted
entity by entity type and presents this metadata in a standardized format.
Related Topics
• Extraction Support
• Predefined Entity Type Support
• About Customizing Extraction
• Languages Modules Supported
• Language Modules Reference
• Specialized Extraction Content
2.2.1 About Customizing Extraction
You can enhance the extraction process by creating and using:
•Dictionaries that contain information about entities. You can customize information about the entities
your application must find.
•Extraction rules.
For details about enhancing extraction, refer to the
Data Processing Extraction Customization Guide
For certain language modules, you can also enhance extraction by using the specialized extraction
content included in them.
Related Topics
• Specialized Extraction Content
SAP BusinessObjects Data Services XI 4.0 Text
.
2010-12-0216
Overview of Linguistic Analysis and Extraction
2.3 Languages Modules Supported
The software provides these language modules, which are supported by linguistic analysis and extraction:
•English
•French
•German
•Japanese
•Simplified Chinese
•Spanish
Note:
Not all linguistic analysis and extraction features are supported for all languages.
Related Topics
• Linguistic Analysis Language Feature Matrix
• Levels of Extraction Support for the Language Modules
• Language Modules Reference
2.4 Specialized Extraction Content
Certain language modules include specialized content that provides entity types and sets of rules that
address specific needs:
Specialized Extraction Content
Voice of the customer
Description
Extracts specific information
about your customers' needs
(requests) and perceptions and
problems (sentiments)
Included in These Language
Modules
English
French
Spanish
Enterprise
Extracts enterprise-specific information, such as management
changes and product releases
English
2010-12-0217
Overview of Linguistic Analysis and Extraction
Specialized Extraction Content
Public Sector
Related Topics
• Voice of the Customer Content
• Enterprise Content
• Public Sector Content
Description
Extracts public-sector-specific
information, such as events and
relations
Included in These Language
Modules
English
Simplified Chinese
2010-12-0218
Linguistic Analysis Support
Linguistic Analysis Support
The software provides and uses these linguistic analysis features for multilingual natural language
processing (NLP) of unstructured data:
Language and encoding identification
DescriptionFeature
The automatic recognition of the input language,
for example, French or Japanese, and of various
character encodings (such as Unicode UTF-8 and
Code Page 1252).
Segment generation
Word segmentation
Case normalization
Stemming
Tagging
Document analysis
The breaking of input text into segments of one
or more complete paragraphs for more efficient
processing.
The separation of input text into its elements, such
as words and punctuation.
The normalization of the initial letter of a word to
upper or lower case. Used to counteract case
changes related to document structure, such as
title and heading capitalization.
The identification of word stems, or dictionary
forms, for text or single words.
The labeling of words' parts of speech, for example, noun or verb.
The recognition of a document's major sections–paragraphs and sentences.
Tagged stemming
The identification of word stems for a word of a
given part-of-speech.
2010-12-0219
Linguistic Analysis Support
Note:
Not all operations are supported for all languages.
Related Topics
• Linguistic Analysis Language Feature Matrix
• Segment Generation
• Word Segmentation
• Case Normalization Rules
• Stemming
• Part-of-Speech Support
• Tagged Stemming
• Language Modules Reference
3.1 Linguistic Analysis Language Feature Matrix
Linguistic analysis provides two levels of language support:
•Standard–Tagging is not supported
•Advanced–Tagging is supported
The following table shows the status of each supported feature for each natural language.
Inflectional Stemming
Tagging
Language
Multiword
Units
Word Segmentation
Compound
Words
Simplified
Chinese
Tagged
Stemming
XXX**X*X
XXX***XXEnglish
XXXXXFrench
XXXXXXGerman
XXXX*XJapanese
XXXXXSpanish
2010-12-0220
Linguistic Analysis Support
•* Compound analysis is supported by the expanded language module for the language.
•** Because Chinese words are not inflected, the stems of all Chinese words are identical to their
source forms. Therefore, stemming is not supported for Chinese.
•*** For English only, derivational stemming is also supported.
Related Topics
• Multiword Units
• Word Segmentation
• Stemming
• Compound Word Stemming
• Expanded Inflectional Stemming
• Derivational Stemming
• Part-of-Speech Support
• Tagged Stemming
• Language Modules Reference
3.2 Segment Generation
During the analysis of unstructured text, text processing objects operate on one segment of a data
stream at a time. Segments are small units of text, including one or more complete paragraphs. Linguistic
analysis operations break input streams into chunks. This chunking of the data stream is called segment
generation.
Segment generation involves two steps: reading in the input text as a byte stream and breaking it into
segments. The resulting segments contain associated metadata markup about the context text. These
segments are then passed on for further linguistic analysis from which words, sentences, and paragraphs
can then be extracted.
3.3 Word Segmentation
The word segmentation operation performs basic word breaking. It breaks text into the smallest,
meaningful syntactic units, such as words or punctuation. The word segmenter also identifies idiomatic
phrases, such as "case in point" or "out-of-the-box." These idiomatic phrases are processed as a single
unit or word. Hyphenated words are not broken, since they are syntactic units. However, contractions
(such as "don't") and elisions (such as "l'abri") are separated into their syntactic units.
2010-12-0221
Linguistic Analysis Support
3.3.1 White Space Languages
White space languages mark word boundaries with white space and punctuation marks. This group
includes European, Balkan, and Middle Eastern languages, as well as Korean. Punctuation marks
sometimes end a sentence, in which case they are used in sentence detection.
Non-white space languages include the Chinese languages, Japanese, and Thai (CCJT for short).
Word segmentation in the CCJT languages occurs with a slightly different algorithm due to their structure.
Because complete morphological analysis is required to perform word segmentation in these languages,
the word segmentation, stemming, and part-of-speech tagging operations occur in a single step.
3.3.1.1 Multiword Units
By default, multiword units are segmented as a single unit, for example, "to and fro" and "Buenos Aires"
are each segmented as one unit. However, you can turn this behavior off. In this case, multiword units
are broken into their individual components. For example, "to and fro" is segmented into three units
instead of one.
3.3.1.2 Punctuation
Word segmentors generally split off punctuation marks as separate units. This includes periods and
commas, sentence-ending punctuation, and various quotation marks.
The following table summarizes punctuation-related segmentation conventions:
If a punctuation mark is followed by a character
No Whitespace
Abbreviations
and not by white space, it is not split off from its
surrounding word. For example: "filename.filetype"
is segmented as "filename.filetype".
Abbreviations ending in a period are important
exceptions to the general rule that splits punctuation from their terms; their periods remain with
them.
2010-12-0222
Linguistic Analysis Support
Apostrophes
Hyphens
3.4 Case Normalization Rules
Contractions spelled with apostrophes (like can't,
don't, etc. in English) are handled via languagespecific rules.
Embedded and trailing hyphens are not split off
from their words. Leading hyphens are not split
off before a digit expression, for example, -1000
is segmented as one unit.
Case normalization provides case-normalized alternatives for words which, by their position in a sentence
or because they occur in a title, may or may not appear with their inherent, meaningful capitalization.
For instance, a proper noun like SAP is always capitalized, but a common noun like horse is only
capitalized if it begins a sentence or occurs in a title. Therefore, if Horse is encountered, the case
normalizer provides the lower-case alternative so that later processing will not mistake Horse for a
proper noun. The two resulting alternatives can then be passed on to the stemming or tagging operations.
Note:
Case normalization is not relevant to languages that do not distinguish between upper and lower case,
for example, the CCJT languages.
Case normalization depends on the type of sentence (normal sentence, title, or query) and the position
of the word to be normalized in each sentence type. The important position to consider is the
sentence-initial position, where special normalization rules may apply. Words directly following certain
punctuation marks are also treated as if they are in sentence-initial position.
•Title sentence
All capitalized words are normalized. For example, a newspaper heading would be normalized as:
Lowercase words are normalized to their upper case variants. Capitalized and all-caps words are
not normalized in query sentences.
•aaaa: aaaa, Aaaa, AAAA
•aaaA: aaaA, AaaA
•Normal sentence
2010-12-0223
Linguistic Analysis Support
Capitalized words are normalized when they occur in sentence-initial position. All-caps words in
sentence-initial position are also normalized. In other positions of normal sentences, capitalized and
all-uppercase words are not normalized. For instance:
•Aaaa bbb Cccc:(Aaaa | aaaa) (bbb) (Cccc)
•AAAA bbb CCCC: (AAAA | Aaaa | aaaa) (bbb) (CCCC)
3.5 Stemming
Words like speaks or speaking have one stem– speak. Some words have more than one possible
stem: spoke, for instance, may turn out, in context, to be the past tense of the verb speak, but it could
also be the singular form of the noun spoke. A stem is a base form for one or more variant (source)
forms found in text; it is the form referenced in the dictionary.
Stemming a word means finding and returning its stem. For example, rather than redundantly deal with
grind, grinds, grinding, ground, and so on, all of these source forms can be recognized as variants
of the single verb grind. Ground can also be a noun whose meaning is completely unrelated to the
verb grind.
The example of indexing documents according to key words they contain can help to better understand
the advantages of working with more abstract forms. If indexing is done naïvely, grind, grinds, grinding,
ground will be handled as unrelated words, and a query containing one of these variants will not return
documents containing the other variants. With the use of a stemmer, however, all of the variants will
be indexed under the base form grind (verb).
The stemmer the software uses receives input of a series of syntactic units (for example, ground ) and
associates each unit with one or more base forms (for example, ground , grind ). The stemmer always
returns all possible alternative stems for each input term.
The software distinguishes between standard inflectional stemming and derivational stemming. The
stemmers are inflectional by default. Derived stemmers are indicated as such.
Inflectional stemming is provided for every supported language. At present, derivational stemming is
supported only for English.
For some languages, two different inflectional stemmers are included–the standard inflectional stemmer
and an expanded inflectional stemmer that is more permissive of variation in the input text.
The stemmers support several different variants of the stemming operation:
•The standard variant returns all possible normalized stems for the input. It also performs compound
analysis in languages like German, such that compound words are broken into their component
parts.
•The expanded variant covers the same normalization as the standard variant, but it is biased for
recall by allowing wider variation in capitalization, accentuation, and similar features, as found in
informal text.
2010-12-0224
Linguistic Analysis Support
•In German, the no-split stemmer supports compound stemming without breaking the compound into
separate stems, which provides better browsability.
•In English, the derivational variant provides the root stem for morphologically derived words.
Related Topics
• Standard Inflectional Stemming
• Expanded Inflectional Stemming
• Derivational Stemming
3.5.1 Standard Inflectional Stemming
With inflectional stemming, words retain the part of speech (noun, verb, and so on) of the base forms.
For example, the verb forms speaks and speaking remain verbs like the base form speak, even while
incorporating changes related to person (first, second, third person), number (singular and plural), tense
(present, past, future), aspect (progressive) or other grammatical features.
Here are some additional examples:
Stems toExample
{aller, vais, vas, va, allons, allez, vont} [French]
{reach, reaches, reached, reaching}
{big, bigger, biggest}
{balloon, balloons}
{go, goes, going, gone, went}
aller
reach
big
balloon
go
The bold words are the stems (dictionary forms). The characters added to the stem (es in reaches, s
in balloons ) are called inflections or affixes.
To handle unknown words such as neologisms, the standard stemmer contains a set of morphological
rules that apply to words.
2010-12-0225
Linguistic Analysis Support
3.5.2 Expanded Inflectional Stemming
The expanded inflectional stemming dictionaries provide all the same functionality as the standard
stemmers provided, and more. The expanded inflectional stemmer allows for certain non-standard word
forms–for example, capitalization errors–as well as standard forms. Thus it can be used to process
informal or imperfect text (such as email, online documents, or queries). The variation it handle includes
case variation, hyphenation and unaccented characters among others. The expanded variant of the
CCJT languages is designed for more granular stemming results suitable for index generation.
3.5.3 Inflectional Stemmer Guesser
The inflectional stemmer guesser contains morphological rules that can be applied to syntactic units
that are unknown to the standard or expanded inflectional stemmer and, therefore, cannot be stemmed.
The software provides inflectional stemmer guessers for English, French, German, and Spanish.
3.5.4 Compound Word Stemming
Compound words are those like bookmark or birdbath, formed by combining or concatenating several
words. German is especially famous for its compounds, for example, Bildungsroman from Bildung
"education" and Roman "novel", and Weltanschauung from Welt "world" and Anschauung "view".
The software performs compound analysis for German. In German, compounds are always separated
into their component stems.
3.5.5 Non-Decompounding Stemming
The German language module includes a variant no-split stemmer that does not perform de-compounding
in the stemmer. This stemmer stems the head of the compound, but does not split the compound into
separate stems. For example, the plural compound Bildungsromane is stemmed to Bildungsroman,
but is not split into component stems. The returned stem is always a single term; and since there is no
compound boundary marker, the term cannot be broken up.
If alternate stems are possible, more than one stem may be returned, as with the standard and expanded
stemmers.
2010-12-0226
Linguistic Analysis Support
3.5.6 Derivational Stemming
Derivational stemming involves cases in which words and stems may or may not have the same part
of speech: a noun may be derived from a verb stem (as for participation and participate), or an
adjective may be derived from a noun (as for boyish and boy). Here are more derivational examples:
•{introduction, introductory, introducer} from introduce
•{subcategory, categorize, categorization} from category
•{useful, usable, unusable} from use
•{reenlist} from enlist
Derivational stemming is currently supported for English only.
3.5.7 Stemming Unknown Words
The stemmer identifies the stems of all the standard words of a language. However, an unknown word,
such as one not found in the system dictionary, will not have a stem. In general, the stemmer returns
the input term as the stem itself. A complicating factor is that, due to case-normalization, the input to
the stemmer may include more than one variant term for a given word. This means that one variant
might be found while another might not be. By default, the stemmer returns the stems of found terms
and removes unfound terms from the results.
For example, at the beginning of a sentence, the word Dogs would be normalized as the disjunction
(Dogs | dogs). In such cases, the stemmer considers both members of the disjunction–both Dogs and
dogs. Assume that lower-case dogs is in the stemmer dictionary, and that capitalized Dogs is absent.
Since Dogs is not in the dictionary (and considered an unfound word), it would stem to Dogs itself.
Since dogs is in the dictionary, it stems to dog. By default, the stemmer discards the unknown word
Dogs and returns dog as the stem of the found variant. This is the default behavior.
If none of the case-normalized variants is found, then the stemmer returns all the case-normalized
variants. For example, suppose the input sentence begins with the unknown word Fbzzz. The case
normalizer returns the disjunction (Fbzzz | fbzzz). The stemmer finds neither one in the dictionary
and returns both forms as stems.
Related Topics
• Case Normalization Rules
2010-12-0227
Linguistic Analysis Support
3.6 Part-of-Speech Support
The part-of-speech tagger identifies and labels the part of speech for each word in context. A word's
part-of-speech is the grammatical category it falls into, such as noun or verb, along with subclass
attributes of each of these major categories, such as singular or plural for nouns, and present or past
tense for verbs.
For certain of its language modules, the software supports the use of two types of parts-of-speech tags.
You can also use these tags when creating extraction rules:
•Umbrella tags–These tags identify major parts-of-speech at a high level, without breaking down the
part of speech further than its overall function. For example, the Nn tag identifies all nouns, regardless
of whether they are singular or plural, feminine or masculine, and so on.
•Complete tags–These tags identify the exact part-of-speech, along with its attributes. For example,
the Nn-Pl tag identifies plural nouns, and V-Pres-3-sg identifies present tense, 3rd person singular
verbs.
For specific details about the tag sets in each supported language, refer to the chapter for that language
in the "
Language Module Reference
3.6.1 Tag Name Conventions
Tags consist of feature names separated by hyphens. The first feature name is called a category tag.
It usually specifies the high level part of speech of the word, for example, noun or verb, abbreviated as
Nn and V respectively. When the tag contains more than one part-of-speech, as in V/Adj or Det/Pron,
this indicates that the part-of-speech can be of either category.
Feature tags classify the word more precisely. They may indicate number (for example, plural and
singular), person (for example, first, second or third), or tense (for example, present and past). Thus,
the tag V-Pres-3-Sg indicates that the verb is present tense, third person singular.
When a feature appears in all lower case, as in the tag Prep-para from the Spanish tagger, it stands
for a word in that language (here, Spanish para), and means that the word's distribution differs enough
from that of other words of its category to rate its own feature. Such very specific features are listed in
the language-specific tables.
For specific details about the tag sets in each supported language, refer to the chapter for that language
in the "
Language Modules Reference
" part of this guide.
" part of this guide.
2010-12-0228
Linguistic Analysis Support
3.6.2 Unfound Words
Words not found in the tagger dictionary are passed to the relevant guesser to be assigned the most
likely tag. The guesser assigns tags to unfound words based on a set of rules about the morphology
of the given language. Capitalization information may also be used as capitalized words are also proper
nouns in many languages. Combinations of alphabetic, numeric and optionally, punctuation characters
tend to be guessed as proper nouns as well. Ordinal numbers are tagged either as noun or adjective,
depending on the context. Internet and e-mail addresses are assigned the tag Nn-Net.
In the Asian languages, unfound words are assigned the tag Nn by default.
3.6.3 Tagged Stemming
The tagged stemming operation provides complete linguistic analysis of input text, including stemming
with respect to part-of-speech information. This operation segments text into words and punctuation,
performs document analysis, case normalization, and part-of-speech tagging. Then, given a term and
its part-of-speech tag, it performs stemming of the term. For example, for the input term-tag pair
children[Nn-Pl], the output is child.
3.6.4 Word Breaking
The word-breaking operation segments text into words and punctuation, performs document analysis,
case normalization, and part-of-speech tagging.
2010-12-0229
Linguistic Analysis Support
2010-12-0230
Loading...
+ 230 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.