Chinese Information Extraction Technologies

About This Presentation

Title:

Chinese Information Extraction Technologies

Description:

Natural Language Processing Lab. National Taiwan University Chinese Information Extraction Technologies – PowerPoint PPT presentation

Number of Views:183

Avg rating:3.0/5.0

Slides: 120

Provided by: HsinHs8

Category:

more less

Transcript and Presenter's Notes

Title: Chinese Information Extraction Technologies

1
Chinese Information Extraction Technologies

Hsin-Hsi Chen
Department of Computer Science and Information
Engineering
National Taiwan University
Taipei, Taiwan
E-mail hh_chen_at_csie.ntu.edu.tw

2
Outline

Introduction to Information Extraction (IE)
Chinese IE Technologies
Tagging Environment for Chinese IE
Applications
Summary

3
Introduction
4
Introduction

Information Extraction
the extraction or pulling out of pertinent
information from large volumes of texts
Information Extraction System
an automated system to extract pertinent
information from large volumes of text
Information Extraction Technologies
techniques used to automatically extract
specified information from text

(http//www.itl.nist.gov/iaui/894.02/related_proje
cts/muc/)
5
An Example in Air Vehicle Launch

Original Document
Named-Entity-Tagged Document
Equivalence Classes
Co-Reference Tagged Document

6
ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ???? lt/DOCSRCgt ltTEXTgt ????????
??????????????????? ????????????
???????,??????? ????????????,?????????????? ??????
?,?????????? ????????? ??????????? ????
?????????????????? ??????? ,?????????????????????
????? ??????????,????????????????
red location name blue date expression green
organization name purple person name
7
???????????????,??????????? ???????,??????????
??????????????????????????? ??????????????????????
?????? ??????????????,????????????
??????????????????????????? ??????????,???????????
?????????????????,???????? ??????????????????
?????????? ????? ?????????????,????????????? ??
??????????????????????????, ??,???????????????????

8
?????????????????????????? ????
?????????????????????????? ????????????
???????????????????????,? ??????????????????????
???? ?????? ??????,???????????? ?????????
lt/TEXTgt lt/DOCgt
9
ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ???? lt/DOCSRCgt ltISRELEVANTgt
NO lt/ISRELEVANTgt ltTITLEgt ltENAMEX
TYPE"LOCATION"gt?lt/ENAMEXgt??ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt1065?????
lt/TITLEgt ltTEXTgt
?????ltENAMEX TYPE"LOCATION"gt??lt/ENAMEXgt?ltENAMEX
TYPE"LOCATION"gt???lt/ENAMEXgtltTIMEX
TYPE"DATE"gt???lt/TIMEXgt??????ltENAMEX
TYPE"LOCATION"gt???lt/ENAMEXgt????????????????
?ltENAMEX TYPE"LOCATION"gt??lt/ENAMEXgt?ltTIMEX
TYPE"DATE"gt???lt/TIMEXgt,ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt?ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgtltTIMEX
TYPE"DATE"gt??lt/TIMEXgt?ltENAMEX TYPE"LOCATION"gt??lt
/ENAMEXgt?????????,???????????ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt????????,??ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt?????? ?????ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt????????????? ????
10
ltID"3"gt??? ltID"4" REF"3" gt?? ltID"5
REF"3"gt???????????? ???????
ltID"63" gt??????? ltID66 REF63gt?????????????
?????? ?????
ltID"65" REF"63"gt????????????? ltID"70"
REF"65"gt?? ltID"69" REF"65"gt?? ltID"64"
REF"63"gt?????????
11
ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ltCOREF ID"1"gt????lt/COREFgt
lt/DOCSRCgt ltISRELEVANTgt NO lt/ISRELEVANTgt ltTITLEgt
ltCOREF ID"6"gt?lt/COREFgt??ltCOREF
ID"23"gt??lt/COREFgtltCOREF ID"45" REF"44"
TYPE"IDENT" MIN"????"gt1065?????lt/COREFgt
lt/TITLEgt ltTEXTgt
?ltCOREF ID"2" REF"1" TYPE"IDENT"gt??lt/COREFgt??ltC
OREF ID"61"gt??lt/COREFgt?ltCOREF ID"8"
STATUS"OPT" REF"6" TYPE"IDENT"gt???lt/COREFgtltCORE
F ID"3"gt???lt/COREFgt??????ltCOREF ID"7" REF"6"
TYPE"IDENT"gt???lt/COREFgt????ltCOREF ID"5"
STATUS"OPT" REF"3" TYPE"IDENT"
MIN"???"gt???????????? ?ltCOREF ID"24" REF"23"
TYPE"IDENT"gt??lt/COREFgt????lt/COREFgt,ltCOREF
ID"77"gtltCOREF ID"9" REF"6" TYPE"IDENT"gt??lt/COR
EFgt?ltCOREF ID"29"gt??lt/COREFgtlt/COREFgtltCOREF
ID"4" REF"3" TYPE"IDENT"gt??lt/COREFgt?ltCOREF
ID"62" REF"61" TYPE"IDENT"gt??lt/COREFgt??ltCOREF
ID"63" MIN"??"gt???????lt/COREFgt,ltCOREF ID"64"
REF"63" TYPE"IDENT" MIN"??"gt?????????lt/COREFgt??
ltCOREF ID"81" STATUS"OPT" REF"75" TYPE"IDENT"
MIN"??"gtltCOREF ID"30" REF"29"
TYPE"IDENT"gt??lt/COREFgt???lt/COREFgt?????,??ltCOREF
ID"31" REF"29" TYPE"IDENT"gt??lt/COREFgt??????
???????????????????? ????
12
IE Evaluation in MUC-7 (1998)

Named Entity Task NE Insert SGML tags into the
text to mark each string that represents a
person, organization, or location name, or a date
or time stamp, or a currency or percentage figure
Multi-lingual Entity Task MET NE task for
Chinese and Japanese
Co-reference Task CO Capture information on
co-referring expressions all mentions of a given
entity, including those tagged in NE, TE tasks

13
IE Evaluation in MUC-7 (cont.)

Template Element Task TE Extract basic
information related to organization, person, and
artifact entities, drawing evidence from anywhere
in the text
Template Relation Task TR Extract relational
information on employee_of, manufacture_of, and
location_of relations
Scenario Template Task ST Extract
pre-specified event information and relate the
event information to particular organization,
person, or artifact entities involved in the
event.

14
Chinese IE Technologies

Segmentation
Named Entity Extraction
Part of Speech/Sense Tagging
Full/Partial Parsing
Co-Reference Resolution

15
Segmentation
16
Segmentation

Problem
A Chinese sentence is composed of characters
without word boundary
?????????
? ? ?? ? ? ???
? ? ??? ? ???
Word Definition
A character string with an independent meaning
and a specific syntactic function

17
Segmentation

Standard
China???????????????
Implemented in 1988
National standard in 1992 (GB/T13715-92)
Taiwan???????????????
Proposed by ROCLING in 1996
National standard in 1999 (CNS14366)

18
Segmentation Strategies

Dictionary is an important resource
List all possible words
Find the most plausible path from a word
lattice
???????????
??????????????

19
Segmentation Strategies (Continued)

Disambiguation Select the best combination
Rule-based
Longest-word first???? ? ?? ?
????????? ? ??? ? ???
Delete the discontinuous fragments
Other heuristic rules 2-3 words preference, ...
parser
Statistics-based
Markov models, relaxation method, and so on

20
Segmentation Strategies

Dictionary Coverage
Dictionary cannot cover all the words
solutions
Morphological rules
(semi-)automatic construction of dictionaries
automatic terminology extraction
Unknown word resolution

21
Morphological Rules

numeral classifierclassifier
???, ???
date time
????????
noun (or verb) prefix/suffix
???
special verbs
?? ?,?? ?,?? ?
????,????,????,????
???,???,???
...

22
Term Extraction n-gram Approach

Compute n-grams from a corpus
Select candidate terms
Successor variety
the successor variety will sharply increase until
a segment boundary is reached
Use i-grams and (i1)-grams to select candidate
terms of length i
Mutual Information
Significance Estimation Function

23
Named Entity Extraction
24
Named Entities Extraction

Five basic components in a document
People, affairs, time, places, things
Major unknown words
Named Entities in MET2
Names people, organizations, locations
Number monetary/percentage expressions
Time data/time expressions

25
Named People Extraction

Chinese person names
Chinese person names are composed of surnames and
names.
Most Chinese surnames are single character and
some rare ones are two characters.
Most names are two characters and some rare ones
are single characters (in Taiwan)
The length of Chinese person names ranges from 2
to 6 characters.
Transliterated person names
Transliterated person names denote foreigners.
The length of transliterated person names is not
restricted to 2 to 6 characters.

26
Named People ExtractionChinese Person Names

Extraction Strategies
baseline models name-formulation statistics
Propose possible candidates.
context cues
Add extra scores to the candidates.
When a title appears before (after) a string, it
is probably a person name.
Person names usually appear at the head or the
tail of a sentence.
Persons may be accompanied with speech-act verbs
like "??", "?", "??", etc.
cache occurrences of named people
A candidate appearing more than once has high
tendency to be a person name.

27
Structure of Chinese Personal Names

Chinese surnames have the following three types
Single character like '?', '?', '?', '?'
Two characters like '??' and '??'
Two surnames together like '??'
Most names have the following two types
Single character
Two characters

28
Training Data

Name-formulation statistics is trained from
1-million person name corpus in Taiwan.
Each contains surname, name and sex.
There are 489,305 male names, and 509,110 female
names.
Total 598 surnames are retrieved from this 1-M
corpus.
The surnames of very low frequency like ?, ?,
etc., are removed to avoid false alarms.
Only 541 surnames are left, and are used to
trigger the person name extraction system.

29
Training Data

The probability of a Chinese character to be the
first character (the second character) of a name
is computed for male and female, separately.
We compute the probabilities using training
tables for female and male, respectively.
Either male score or female score may be greater
than thresholds.
In some cases, female score may be greater than
male score.
Thresholds are defined as 99 of training data
should pass the thresholds.

30
Baseline Models name-formulation statistics

Model 1. Single character, e.g., ?, ?, ? and
?
P(C1)P(C2)P(C3) using the training table for
male gt Threshold1 and P(C2)P(C3) using training
table for male gt Threshold2, or
P(C1)P(C2)P(C3) using the training table for
female gt Threshold3 andP(C2)P(C3) using the
training table for female gt Threshold4
Model 2. Two characters, e.g., ?? and ??
P(C2)P(C3) using training table for male gt
Threshold2, or
P(C2)P(C3) using training table for female gt
Threshold4
Model 3. Two surnames together like '??
P(C12)P(C2)P(C3) using the training table for
female gt Threshold3,P(C2)P(C3) using the
training table for female gt Threshold4
andP(C12)P(C2)P(C3) using the training table
for female gtP(C12)P(C2)P(C3) using training
table for male

31
Cues from Character Levels

Gender
A married woman may add her husband's surname
before her surname. That forms type 3 person
names.
Because a surname may be considered as a name,
the candidates with two surnames do not always
belong to the type 3 person names.
The gender information helps us disambiguate this
type of person names.
Some Chinese characters have high score for male
and some for female. The following shows some
examples.
Male ?????????????????????
Female ?????????????????????

32
Cues from Sentence Levels

Titles
When a title appears before (after) a candidate,
it is probable a person name. It can help to
decide the boundary of a name.
????? vs. ??????? ...
Mutual Information
How to tell if a word is a content word or a name
is indispensable.
?????,??????
When there exists a strong relationship between
surrounding words, the candidate word has a high
probability to be a content word.
Punctuation Marks
When a candidate is located at the end of a
sentence, we give it an extra score.
If words around the caesura mark, then they have
similar types.

33
Cues from Passage/Document Level Cache

A person name may appear more than once in a
paragraph.
There are four cases when cache is used.
(1) C1C2C3 and C1C2C4 are both in the cache, and
C1C2 is correct.
(2) C1C2C3 and C1C2C4 are both in the cache, and
both are correct.
(3) C1C2C3 and C1C2 are both in the cache, and
C1C2C3 is correct.
(4) C1C2C3 and C1C2 are both in the cache, and
C1C2 is correct.

34
Cache

The problem using cache is case selection.
For every entry in the cache, we assign it a
weight.
The entry with clear right boundary has a high
weight.
title and punctuation
The other entries are assigned low level weight.
The use of weight in case selection
high vs. high gt case (2)
high vs. low or low vs. high gt high is correct
low vs. low
check the score of the last character of the name
part
??? ???
??? ???

35
Discussion

Some typical types of errors.
foreign names (e.g., ???, ???)
They are identified as proper nouns correctly,
but are assigned wrong features.
About 20 of errors belong to this type.
rare surnames (e.g.,?, ?, ?) or artists' stage
names.
Near 14 of errors come from this type.
others
Other proper nouns (place names, organization
names, etc.)
identification errors

36
Omitted Name Problem

Some texts usually omit name part and leave
surname only.
??????
Strategies
If this candidate appears before in the same
paragraph, it is an omitted name.
If this candidate has a special title like
??????? or a general title like ??????...,
then it is an omitted name.
If two single characters have very high
probability to be surnames, and they appear
around caesura mark, then they are regarded as
omitted names.

37
Transliterated Person Names

Challenging Issues
No special cue like surnames in Chinese person
names to trigger the recognition system.
No restriction on the length of a transliterated
person name.
No large scale transliterated personal name
corpus
Ambiguity in classification. '???' may denote a
city or a former American president.

38
Strategy (1)

Character Condition
When a foreign name is transliterated, the
selection of homophones is restrictive.
Richard Macs ????? vs. ?????
Basic character set can be trained from a
transliterated name corpus.
If all the characters in a string belong to this
set, they are regarded as a candidate.

39
Strategy (2)

Syllable Condition
Some characters which meet the character
condition do not look like transliterated names.
Syllable Sequence
Simplified Condition
(1) For each candidate, we check the syllable of
the first (the last) character.
(2) If the syllable does not belong to the
training corpus, the character is deleted.
(3) The remaining characters are treated in the
similar way.

40
Strategy (3)

Frequency Condition
For each candidate which has only two characters,
we compute the frequency of these two characters
to see if it is larger than a threshold.
The threshold is determined in the similar way as
the baseline model of Chinese person names.

41
Cues around Names

Cues within Transliterated Names
Character Condition
Syllable Condition
Frequency Condition
Cues around Transliterated Names
titles the same as Chinese person names
name introducers "?", "??", "??", "??", and "??"
special verbs the same as Chinese person names
first namemiddle name last name

42
Discussion

Some transliterated person names may be
identified by the Chinese person name extraction
system.
??? ???
Some nouns may look like transliterated person
names.
popular brands of automobiles, e.g., '???' and
'???'
Chinese proper nouns, e.g., '??', '??' and '??'
Chinese person names, e.g., '???'
Besides the above nouns, the boundary errors
affect the precision too.
(?)???

43
Named Organization Extraction

A complete organization name can be divided into
two parts name and keyword.
Example ?????
Many words can serve as names, but only some
fixed words can serve as keywords.
Challenging Issues
(1) a keyword is usually a common content word.
(2) a keyword may appear in the abbreviated form.
(3) the keyword may be omitted completely.

44
Classification of Organization Names

Complete organization names
This type of organization names is usually
composed of proper nouns and keywords.
Some organization names are very long, thus
(left) boundary determination is difficult.
Some organization names with keywords are still
ambiguous.
'???' usually denotes reading matters, but not
organizations.
Incomplete organization names
These organization names often omit their
keywords.
The abbreviated organization names may be
ambiguous.
'??' and '??' are famous sport teams in Taiwan
and in USA, respectively, however, they are also
common content words.

45
Strategies

Keywords
A keyword shows not only the possibility of an
occurrence of an organization name, but also its
right boundary.
Prefix
Prefix is a good marker for possible left
boundary.
Single-character words
If the character preceding a possible keyword is
a single-character word, then the content word is
not a keyword.
If the characters preceding a possible keyword
cannot exist independently, they form a name part
of an organization.
Words of at least two characters
The words to compose a name part usually have
strong relationships.

46
Strategies

Parts of speech
The name part of an organization cannot extend
beyond a transitive verb.
Numeral and classifier are also helpful.
Cache
problem when should a pattern be put into cache?
Character set is incomplete.
n-gram model
It must consist of a name and an organization
name keyword.
Its length must be greater than 2 words.
It does not cross any punctuation marks.
It must occur more than a threshold.

47
Handcrafted Rules

OrganizationName ? OrganizationName
OrganizationNameKeyworde.g.,
??? ??
OrganizationName ? CountryName
OrganizationNameKeyworde.g.,
?? ???
OrganizationName ? PersonName
OrganizationNameKeyworde.g.,
??? ???
OrganizationName ? CountryName
OrganizationNamee.g.,
?? ???
OrganizationName ? LocationName
OrgnizationNamee.g.,
???? ??
OrganizationName ? CountryName DDD
OrganizationNameKeyworde.g.,
?? ?? ????
OrganizationName ? PersonName DD
OrganizationNameKeyworde.g.,
??? ?? ???
OrganizationName ? LocationName DD
OrganizationNameKeyworde.g.,
?? ?? ????

48
Discussion

Most errors result from organization names
without keywords.
??? ?? ????
?? ?? ??
Identification errors
Even if keywords appear, organization names do
not always exist.
???? ????
Error left boundary is also a problem.
????? (??)????
Ambiguities
??? ????

49
Application of Gender Assignment

Anaphora resolution
"?????,??????????,??????????,?????????????????????
,?????,?????,?????????????????,????????"
Gender of a person name is useful for this
problem.
The correct rate for gender assignment is 89.
Co-Reference resolution

50
Named Location Extraction

A location name is composed of name and keyword
parts.
Rules
LocationName ? PersonName LocationNameKeyword
LocationName ? LocationName LocationNameKeyword
Locative verbs like '??', '??', and so on, are
introduced to treat location names without
keywords.
Cache and n-gram models are also employed to
extract location names.

51
Date Expressions

DATE ? NUMBER YEAR (? ?)
DATE ? NUMBER MTHUNIT (? ?)
DATE ? NUMBER DUNIT (? ?)
DATE ? REGINC (??)
DATE ? FSTATE DATE (?? ??)
DATE ? COMMON DATE (? ??)
DATE ? REGINE DATE (?? ????)
DATE ? DATE DMONTH (?? ??)
DATE ? DATE BSTATE (?? ?)
DATE ? FSTATEDATE DATE (?? ???)
DATE ? FSTATEDATE DMONTH (?? ??)
DATE ? FSTATEDATE FSTATEDATE (?? ??)
DATE ? DATE YXY DATE (???? ? ????)

52
Time Expressions

TIME ? NUMBER HUNIT (? ?)
TIME ? NUMBER MUNIT (?? ?)
TIME ? NUMBER SUNIT (? ?)
TIME ? FSTAETIME TIME
TIME ? FSTATE TIME
TIME ? TIME BSTATE
TIME ? MORN BSTATE
TIME ? TIME TIME
TIME ? TIME YXY TIME (?? ? ??)
TIME ? NUMBER COLON NUMBER (03 45)

53
Monetary Expressions

DMONEY ? MOUNIT NUMBER MOUNIT (?? ? ?)
DMONEY ? NUMBER MOUNIT MOUNIT (? ? ??)
DMONEY ? NUMBER MOUNIT (? ?)
DMONEY ? MOUNIT MOUNIT NUMBER (?? 5)
DMONEY ? MOUNIT NUMBER ( 5)
DMONEY ? NUMBER YXY DMONEY (? ? ??)
DMONEY ? DMONEY YXY DMONEY (?? ? ??)
DMONEY ? DMONEY YXY NUMBER (200 - 500)

54
Percentage Expressions

DPERCENT ? PERCENT NUMBER
(??? ?)
DPERCENT ? NUMBER PERCENT
(3 )
DPERCENT ? DPERCENT YXY DPERCENT
(5 ? 8)
DPERCENT ? DPERCENT YXY NUMBER
(???? ? ?)
DPERCENT ? NUMBER YXY DPERCENT
(? ? ????)

55
Named Entity Extraction in MET2

Transform Chinese texts in GB codes into texts in
Big-5 codes.
Segment Chinese texts into a sequence of tokens.
Identify named people.
Identify named organizations.
Identify named locations.
Use n-gram model to identify named
organizations/locations.
Identify the rest of named expressions.
Transform the results in Big-5 codes into the
results in GB codes.

56
from GB codes to Big-5 codes

Big-5 traditional character set and GB simplified
character set are adopted in Taiwan and in China,
respectively.
Our system is developed on the basis of Big-5
codes, so that the transformation is required.
Characters used both in simplified character set
and in tradition character set always result in
error mapping.
?? vs. ?? ?? vs. ?? ?? vs. ???? vs. ?? ?? vs.
?? ??? vs. ?????? vs. ??? ?? vs. ?? ?? vs.
?????? vs. ???? and so on.
More unknown words may be generated.

57
Segmentation

We list all the possible words by dictionary
look-up, and then resolve ambiguities by
segmentation strategies.
The test documents in MET-2 are selected from
China newspapers.
Our dictionary is trained from Taiwan corpora.
Due to the different vocabulary sets, many more
unknown words may be introduced.???? vs.
????, ?? vs. ??, ?? vs. ???, ??? vs.
???, etc.
The unknown words from different code sets and
different vocabulary sets make named entity
extraction more challenging.

58
MET-2 Formal Run of NTUNLPL

F-measures
PR 79.61
2PR 77.88
P2R 81.42
Recall and Precision
name (85, 79)
number (91, 98)
time (95, 85)

59
Named Persons

The recall rate and the precision are 91 and
74.
Major errors
segmentation, e.g., ??? -gt ?? ?Part of person
names may be regarded as a word during
segmentation.
surname name, character set and title are
incomplete, e.g., ???, ?? ? ??, ?? ??
blanks, e.g., ? ?We cannot tell if blanks
exist in the original documents or are inserted
by segmentation system.
Boundary errors
Japanese names, e.g., ?????

60
Evaluation Named Organization

The recall rate and the precision rate are 78
and 85.
Major errors
more than two content words between name and
keyworde.g., ?? ?? ?? ?? ??
absent of keywordse.g., ???????
absent of name part the name part do not satisfy
character condition, e.g., ????
n-gram errorse.g., ???????????

61
Evaluation Named Locations

The recall rate and the precision rate are 78
and 69.
character set
The characters "?" and "?" in the string "????"
do not belong to our transliterated character
set.
wrong keyword
The character "?" is an organization keyword.
Thus the string "?????" is mis-regarded as an
organization name.
common content words
The words such as "??", "??", etc., are common
content words. We do not give them special tags.
single-character locations
The single-character locations such as "?", "?",
and so on, are missed during recognition.

62
Evaluation Time/Date Expressions

The recall rate and the precision rate for date
expression, time expression, monetary expression
and percentage expression are (94, 88), (98,
70), (98, 98), and (83, 98), respectively.
Major errors
propagation errors
segmentation before entity extraction, e.g., ??
named people extraction before date expressions
absent date units
the date unit does not appear, e.g., ????
the date unit should appear, e.g., ???

Absent keywords
Some keywords are not listed.
E.g., ???????8?58? is divided into ??,
8?58?
Rule coverage
E.g., ?????
Ambiguity
Some characters like ? can be used in time and
monetary expressions. E.g., ???????? is
divided into two parts ??? and ?????
The strings "??" and "??" are words. In our
pipelined model, "????" and "????" will be missed.

64
Issues

Deal with the errors propagated from the previous
modules
Pipelining model vs. interleaving model
Deal with the errors resulting from rule coverage
Handcrafted rules vs. learning rules
Deal with the errors resulting from segmentation
standards
Vocabulary set of Academic Sinica vs. Peking
University

65
Pipelining Model

segmentation
named entity extraction
named people
named location
named organization
number date/time
input
output
ambiguity resolution
only one result
66
Interleaving Model

table lookup
named people
named locations
named organization
number date/time
input
ambiguity resolution
output
67
An Example in Interleaving Model
?
?
?
?
?
?
68
Learning Rules vs. Hand-Crafted Rules

Collect organization names.
Extract Patterns
Clustering organization names based on keywords
Assign features in name parts
Employ Teiresias algorithm to extract
patterns(http//cbcsrv.watson.ibm.com/Tspd.html)

69
Teiresias algorithm

Patterns consist of words and wild card (),
e.g.,
?? ?? ??
?? ?? ??
gt ?? ??
parameter setting
L the least number of non-wild cards
W the maximum number of L non-wild cards
T confidence level, I.e., how many training
instances this pattern must satisfy

70
Keyword Set

Extracting keyword set
Input all the training instances (i.e.,
organization names) into Teiresias algorithm.
Let confidence level be 5.
Find all the patterns not ending with wild card,
e.g., ( ?? ) ( ?? ) ( ? ? )
Regard suffix of a pattern as a keyword.e.g.,
??? ??? ? ?? ?? ??? ?? ?????

71
Features of Patterns

types
named entities
named people
named locations
named organizations
date expression ( ????????????? )
number ( 87????????? )
common nouns

72
Tagging
73
Tagging

Lexical level
Part of Speech Tagging
Named entity tagging
Sense Tagging
Syntactic level
Syntactic Category (Structure) Tagging
Discourse level
Anaphora-Antecedent Tagging
Co-Reference Tagging

74
Part-of-Speech Tagging

Issues of tagging accuracy
the amount of training data
the granularity of the tagging set
the occurrences of unknown words, and so on.
Academia Sinica Balanced Corpus
5 million words
46 tags
Language Models, e.g., bigram, trigram,

75
Sense Tagging

Assign sense labels to words in a sentence.
Sense Tagging Set
tong2yi4ci2ci2lin2 (?????, Cilin)
12 large categories
94 middle categories
1,428 small categories
3,925 word clusters

76
A People Aa a collective
name 01       Human being The people Everybody 02
       I We 03       You You 04      He/She T
hey 05      Myself Others Someone 06      Who A
b people of all ages and both sexes 01 A
Man A Woman Men and Women 02 An Old
Person An Adult The old and the
young 03        A Teenager 04        An
Infant A Child Ac posture 01 A Tall
Person A Dwarf 02 A Fat Person A Thin
Person 03 A Beautiful Woman A Handsome Man
77
(No Transcript)
78
Degree of Polysemy in Mandarin Chinese

Small categories of Cilin are used to compute the
distribution of word senses.
ASBC is employed to count frequency of a word
Total 28,321 word types appear both in Cilin and
in ASBC corpus.
Total 5,922 words are polysemous.

79
degree number of senses of a word word type a
dictionary entry
80
97.53 94.70 for N and V
98.2297.08 for A and K
?
5922
4132
81
93.77 of polysemous words belong to the class of
low ambiguity they only occupy 58.52 of tokens
in ASBC corpus
word token an occurrence of word type in ASBC
corpus word type a dictionary entry
82
Low frequency (lt 100), Middle frequency
(100?lt1000), High frequency (?1000)
24
58
gt8
83
Phenomena

POS information reduces the degree of ambiguities
Total 8.94 of word tokens are high ambiguous in
Table 3. It decreases to 0.47 in Table 4.
High ambiguous words tend to be high frequent
23.67 of word types are middle- or high-frequent
words, and they occupy 94.06 of word tokens

84
Semantic Tagging Unambiguous Words

acquire the context for each semantic tag
starting from the unambiguous words

Unambiguous Words
those words that have only one sense in Cilin
Cilin
ASBC
85
Acquire Contextual Vectors

An unambiguous word is characterized by the
surrounding words.
The window size is set to 6, and the stop words
are removed.
A sense tag Ctag is in terms of a vector (w1, w2,
..., wn)
MI metric
EM metric

86
Semantic Tagging Ambiguous Words

apply the information trained at the first stage
to selecting the best sense tag from the
candidates of each ambiguous word

Unambiguous Words
Cilin
ASBC
Ambiguous Words
Those words that have more than one sense in Cilin
87
Apply and Retrain Contextual Vectors

Identify the context vector of an ambiguous word.
Measures the similarity between a sense vector
and a context vector by cosine formula.
Select the sense tag of the highest similarity
score.
Retrain the sense vector for each sense tag.

88
Semantic Tagging Unknown Words

Adopt outside evidences from the mapping among
WordNet synsets and Cilin sense tags to narrow
down the candidate set.

Unambiguous Words
Cilin
ASBC
Unknown Words
Ambiguous Words
89
(No Transcript)
90
Experiments

Test Materials
Sample documents of different categories from
ASBC corpus
Total 35,921 words in the test corpus
Research associates tag this corpus manually
Mark up the ambiguous words by looking up the
Cilin dictionary
Tag the unknown words by looking up the mapping
table
The tag mapper achieves 82.52 of performance
approximately

91
(No Transcript)
92
49.55
62.60 31.36 27.00
93
?The performance for tagging low ambiguity (2-4),
middle ambiguity (5-8) and high ambiguity (gt8) is
similar (i.e., 63.98, 60.92 and 67.95) when 1
candidate, 2 candidates, and 3 candidates are
proposed. ? Under the middle categories and 1-3
proposed candidates, the performance for tagging
low, middle and high ambiguous words are 71.02,
73.88, and 75.94.
94
(No Transcript)
95
Co-Reference Resolution
96
Introduction

Anaphora vs. Co-Reference
Anaphora
?????
Type/Instance ??/??, ?????/??
Function/Value ?????/?? 30 ?
NP ?????? ?????/???

??
??
97
Flow of Co-Reference Resolution
Document
98
Find the Candidate List
Document
Determine Candidates
All the Candidates
Co-Reference Resolution Algorithm
Singletons
Class 1
Class 2
Class N
99
Find the Candidates

Select all nouns (Cand-Terms)
Na (????)
Nb (????)
Nc (????)
Nd (????)
Nh (???)
Delete some Nds (total 171)
e.g., ??????????, ??, ??, ...
Select noun phrases (Cand-NP)
Select maximal noun phrases (Cand-MaxNP)

Some are found in named entity extraction
100
Recognize NPs whose head is Na (common noun)
??(Neqa) ?(DE) ??(Na) ??(Neqa) ?(Neu) ?(Nf)
??(Na) ?(Nep) ?(Neu) ?(Nf) ??(Na) ?(Nes)
??(Na)
101
Recognize NPs whose head is Nh (pronoun)
Na
Nh
Init
Nb
??(Nb) ??(Nh) ???(Na) ??(Nd) ?(Nh) ??(Nh)
102
Cilin (?????)

12 large categories
94 middle categories
1,428 small categories
3,925 word clusters

103
Features

Classification
Word/Phrase Itself
Part of Speech of Head
Semantics of Head
Type of Named Entities
Positions (Sentences and Paragraphs)
Number Singular, Plural and Unknown
Gender Pronouns and Chinese Person Names
Pronouns Personal Pronouns, Demonstrative
Pronouns

104
Co-Reference Resolution Algorithms

Strategy 1 simple pattern matching
Strategy 2 Cardie Clustering Algorithm

105
Cardie Clustering Algorithm
106
(No Transcript)
107
Semantics Restrictions

SemanticFun_1(NPi ,NPj )
If heads belong to the same word cluster, then 0
is assigned, else 1 is assigned.
SemanticFun_2(NPi ,NPj )
Integrate POS, Named Entity and Cilin sense
Only one is NE
NPi and NPj are not NE, and they are in Cilin
NPi and NPj are not NE, and only of them not in
Cilin
NPi and NPj are not NE, and both are not in Cilin
? 0

108
SemanticFun_2

One of them is NE
Column denotes the type of NE
Row denotes the part of speech
English string in a table cell denotes Cilin sense

109
Experimental Results
110
Named Entity Tagging Environment
111
(No Transcript)
112
(No Transcript)
113
(No Transcript)
114
tag at the same time
115
(No Transcript)
116
?? vs. ??????
117
(No Transcript)
118
Named Entity Extraction in Bioinformatics
Application