Title: Chinese Information Extraction Technologies
1Chinese Information Extraction Technologies
- Hsin-Hsi Chen
- Department of Computer Science and Information
Engineering - National Taiwan University
- Taipei, Taiwan
- E-mail hh_chen_at_csie.ntu.edu.tw
2Outline
- Introduction to Information Extraction (IE)
- Chinese IE Technologies
- Tagging Environment for Chinese IE
- Applications
- Summary
3Introduction
4Introduction
- Information Extraction
- the extraction or pulling out of pertinent
information from large volumes of texts - Information Extraction System
- an automated system to extract pertinent
information from large volumes of text - Information Extraction Technologies
- techniques used to automatically extract
specified information from text
(http//www.itl.nist.gov/iaui/894.02/related_proje
cts/muc/)
5An Example in Air Vehicle Launch
- Original Document
- Named-Entity-Tagged Document
- Equivalence Classes
- Co-Reference Tagged Document
6ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ???? lt/DOCSRCgt ltTEXTgt ????????
??????????????????? ????????????
???????,??????? ????????????,?????????????? ??????
?,?????????? ????????? ??????????? ????
?????????????????? ??????? ,?????????????????????
????? ??????????,????????????????
red location name blue date expression green
organization name purple person name
7???????????????,??????????? ???????,??????????
??????????????????????????? ??????????????????????
?????? ??????????????,????????????
??????????????????????????? ??????????,???????????
?????????????????,???????? ??????????????????
?????????? ????? ?????????????,????????????? ??
??????????????????????????, ??,???????????????????
8?????????????????????????? ????
?????????????????????????? ????????????
???????????????????????,? ??????????????????????
???? ?????? ??????,???????????? ?????????
lt/TEXTgt lt/DOCgt
9ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ???? lt/DOCSRCgt ltISRELEVANTgt
NO lt/ISRELEVANTgt ltTITLEgt ltENAMEX
TYPE"LOCATION"gt?lt/ENAMEXgt??ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt1065?????
lt/TITLEgt ltTEXTgt
?????ltENAMEX TYPE"LOCATION"gt??lt/ENAMEXgt?ltENAMEX
TYPE"LOCATION"gt???lt/ENAMEXgtltTIMEX
TYPE"DATE"gt???lt/TIMEXgt??????ltENAMEX
TYPE"LOCATION"gt???lt/ENAMEXgt????????????????
?ltENAMEX TYPE"LOCATION"gt??lt/ENAMEXgt?ltTIMEX
TYPE"DATE"gt???lt/TIMEXgt,ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt?ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgtltTIMEX
TYPE"DATE"gt??lt/TIMEXgt?ltENAMEX TYPE"LOCATION"gt??lt
/ENAMEXgt?????????,???????????ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt????????,??ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt?????? ?????ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt????????????? ????
10ltID"3"gt??? ltID"4" REF"3" gt?? ltID"5
REF"3"gt???????????? ???????
ltID"63" gt??????? ltID66 REF63gt?????????????
?????? ?????
ltID"65" REF"63"gt????????????? ltID"70"
REF"65"gt?? ltID"69" REF"65"gt?? ltID"64"
REF"63"gt?????????
11ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ltCOREF ID"1"gt????lt/COREFgt
lt/DOCSRCgt ltISRELEVANTgt NO lt/ISRELEVANTgt ltTITLEgt
ltCOREF ID"6"gt?lt/COREFgt??ltCOREF
ID"23"gt??lt/COREFgtltCOREF ID"45" REF"44"
TYPE"IDENT" MIN"????"gt1065?????lt/COREFgt
lt/TITLEgt ltTEXTgt
?ltCOREF ID"2" REF"1" TYPE"IDENT"gt??lt/COREFgt??ltC
OREF ID"61"gt??lt/COREFgt?ltCOREF ID"8"
STATUS"OPT" REF"6" TYPE"IDENT"gt???lt/COREFgtltCORE
F ID"3"gt???lt/COREFgt??????ltCOREF ID"7" REF"6"
TYPE"IDENT"gt???lt/COREFgt????ltCOREF ID"5"
STATUS"OPT" REF"3" TYPE"IDENT"
MIN"???"gt???????????? ?ltCOREF ID"24" REF"23"
TYPE"IDENT"gt??lt/COREFgt????lt/COREFgt,ltCOREF
ID"77"gtltCOREF ID"9" REF"6" TYPE"IDENT"gt??lt/COR
EFgt?ltCOREF ID"29"gt??lt/COREFgtlt/COREFgtltCOREF
ID"4" REF"3" TYPE"IDENT"gt??lt/COREFgt?ltCOREF
ID"62" REF"61" TYPE"IDENT"gt??lt/COREFgt??ltCOREF
ID"63" MIN"??"gt???????lt/COREFgt,ltCOREF ID"64"
REF"63" TYPE"IDENT" MIN"??"gt?????????lt/COREFgt??
ltCOREF ID"81" STATUS"OPT" REF"75" TYPE"IDENT"
MIN"??"gtltCOREF ID"30" REF"29"
TYPE"IDENT"gt??lt/COREFgt???lt/COREFgt?????,??ltCOREF
ID"31" REF"29" TYPE"IDENT"gt??lt/COREFgt??????
???????????????????? ????
12IE Evaluation in MUC-7 (1998)
- Named Entity Task NE Insert SGML tags into the
text to mark each string that represents a
person, organization, or location name, or a date
or time stamp, or a currency or percentage figure
- Multi-lingual Entity Task MET NE task for
Chinese and Japanese - Co-reference Task CO Capture information on
co-referring expressions all mentions of a given
entity, including those tagged in NE, TE tasks
13IE Evaluation in MUC-7 (cont.)
- Template Element Task TE Extract basic
information related to organization, person, and
artifact entities, drawing evidence from anywhere
in the text - Template Relation Task TR Extract relational
information on employee_of, manufacture_of, and
location_of relations - Scenario Template Task ST Extract
pre-specified event information and relate the
event information to particular organization,
person, or artifact entities involved in the
event.
14Chinese IE Technologies
- Segmentation
- Named Entity Extraction
- Part of Speech/Sense Tagging
- Full/Partial Parsing
- Co-Reference Resolution
15Segmentation
16Segmentation
- Problem
- A Chinese sentence is composed of characters
without word boundary - ?????????
- ? ? ?? ? ? ???
- ? ? ??? ? ???
- Word Definition
- A character string with an independent meaning
and a specific syntactic function
17Segmentation
- Standard
- China???????????????
- Implemented in 1988
- National standard in 1992 (GB/T13715-92)
- Taiwan???????????????
- Proposed by ROCLING in 1996
- National standard in 1999 (CNS14366)
18Segmentation Strategies
- Dictionary is an important resource
- List all possible words
- Find the most plausible path from a word
lattice - ???????????
- ??????????????
19Segmentation Strategies (Continued)
- Disambiguation Select the best combination
- Rule-based
- Longest-word first???? ? ?? ?
????????? ? ??? ? ??? - Delete the discontinuous fragments
- Other heuristic rules 2-3 words preference, ...
- parser
- Statistics-based
- Markov models, relaxation method, and so on
20Segmentation Strategies
- Dictionary Coverage
- Dictionary cannot cover all the words
- solutions
- Morphological rules
- (semi-)automatic construction of dictionaries
automatic terminology extraction - Unknown word resolution
21Morphological Rules
- numeral classifierclassifier
- ???, ???
- date time
- ????????
- noun (or verb) prefix/suffix
- ???
- special verbs
- ?? ?,?? ?,?? ?
- ????,????,????,????
- ???,???,???
- ...
22Term Extraction n-gram Approach
- Compute n-grams from a corpus
- Select candidate terms
- Successor variety
- the successor variety will sharply increase until
a segment boundary is reached - Use i-grams and (i1)-grams to select candidate
terms of length i - Mutual Information
- Significance Estimation Function
23Named Entity Extraction
24Named Entities Extraction
- Five basic components in a document
- People, affairs, time, places, things
- Major unknown words
- Named Entities in MET2
- Names people, organizations, locations
- Number monetary/percentage expressions
- Time data/time expressions
25Named People Extraction
- Chinese person names
- Chinese person names are composed of surnames and
names. - Most Chinese surnames are single character and
some rare ones are two characters. - Most names are two characters and some rare ones
are single characters (in Taiwan) - The length of Chinese person names ranges from 2
to 6 characters. - Transliterated person names
- Transliterated person names denote foreigners.
- The length of transliterated person names is not
restricted to 2 to 6 characters.
26Named People ExtractionChinese Person Names
- Extraction Strategies
- baseline models name-formulation statistics
- Propose possible candidates.
- context cues
- Add extra scores to the candidates.
- When a title appears before (after) a string, it
is probably a person name. - Person names usually appear at the head or the
tail of a sentence. - Persons may be accompanied with speech-act verbs
like "??", "?", "??", etc. - cache occurrences of named people
- A candidate appearing more than once has high
tendency to be a person name.
27Structure of Chinese Personal Names
- Chinese surnames have the following three types
- Single character like '?', '?', '?', '?'
- Two characters like '??' and '??'
- Two surnames together like '??'
- Most names have the following two types
- Single character
- Two characters
28Training Data
- Name-formulation statistics is trained from
1-million person name corpus in Taiwan. - Each contains surname, name and sex.
- There are 489,305 male names, and 509,110 female
names. - Total 598 surnames are retrieved from this 1-M
corpus. - The surnames of very low frequency like ?, ?,
etc., are removed to avoid false alarms. - Only 541 surnames are left, and are used to
trigger the person name extraction system.
29Training Data
- The probability of a Chinese character to be the
first character (the second character) of a name
is computed for male and female, separately. - We compute the probabilities using training
tables for female and male, respectively. - Either male score or female score may be greater
than thresholds. - In some cases, female score may be greater than
male score. - Thresholds are defined as 99 of training data
should pass the thresholds.
30Baseline Models name-formulation statistics
- Model 1. Single character, e.g., ?, ?, ? and
? - P(C1)P(C2)P(C3) using the training table for
male gt Threshold1 and P(C2)P(C3) using training
table for male gt Threshold2, or - P(C1)P(C2)P(C3) using the training table for
female gt Threshold3 andP(C2)P(C3) using the
training table for female gt Threshold4 - Model 2. Two characters, e.g., ?? and ??
- P(C2)P(C3) using training table for male gt
Threshold2, or - P(C2)P(C3) using training table for female gt
Threshold4 - Model 3. Two surnames together like '??
- P(C12)P(C2)P(C3) using the training table for
female gt Threshold3,P(C2)P(C3) using the
training table for female gt Threshold4
andP(C12)P(C2)P(C3) using the training table
for female gtP(C12)P(C2)P(C3) using training
table for male
31Cues from Character Levels
- Gender
- A married woman may add her husband's surname
before her surname. That forms type 3 person
names. - Because a surname may be considered as a name,
the candidates with two surnames do not always
belong to the type 3 person names. - The gender information helps us disambiguate this
type of person names. - Some Chinese characters have high score for male
and some for female. The following shows some
examples. - Male ?????????????????????
- Female ?????????????????????
32Cues from Sentence Levels
- Titles
- When a title appears before (after) a candidate,
it is probable a person name. It can help to
decide the boundary of a name. - ????? vs. ??????? ...
- Mutual Information
- How to tell if a word is a content word or a name
is indispensable. - ?????,??????
- When there exists a strong relationship between
surrounding words, the candidate word has a high
probability to be a content word. - Punctuation Marks
- When a candidate is located at the end of a
sentence, we give it an extra score. - If words around the caesura mark, then they have
similar types.
33Cues from Passage/Document Level Cache
- A person name may appear more than once in a
paragraph. - There are four cases when cache is used.
- (1) C1C2C3 and C1C2C4 are both in the cache, and
C1C2 is correct. - (2) C1C2C3 and C1C2C4 are both in the cache, and
both are correct. - (3) C1C2C3 and C1C2 are both in the cache, and
C1C2C3 is correct. - (4) C1C2C3 and C1C2 are both in the cache, and
C1C2 is correct.
34Cache
- The problem using cache is case selection.
- For every entry in the cache, we assign it a
weight. - The entry with clear right boundary has a high
weight. - title and punctuation
- The other entries are assigned low level weight.
- The use of weight in case selection
- high vs. high gt case (2)
- high vs. low or low vs. high gt high is correct
- low vs. low
- check the score of the last character of the name
part - ??? ???
- ??? ???
35Discussion
- Some typical types of errors.
- foreign names (e.g., ???, ???)
- They are identified as proper nouns correctly,
but are assigned wrong features. - About 20 of errors belong to this type.
- rare surnames (e.g.,?, ?, ?) or artists' stage
names. - Near 14 of errors come from this type.
- others
- Other proper nouns (place names, organization
names, etc.) - identification errors
36Omitted Name Problem
- Some texts usually omit name part and leave
surname only. - ??????
- Strategies
- If this candidate appears before in the same
paragraph, it is an omitted name. - If this candidate has a special title like
??????? or a general title like ??????...,
then it is an omitted name. - If two single characters have very high
probability to be surnames, and they appear
around caesura mark, then they are regarded as
omitted names.
37Transliterated Person Names
- Challenging Issues
- No special cue like surnames in Chinese person
names to trigger the recognition system. - No restriction on the length of a transliterated
person name. - No large scale transliterated personal name
corpus - Ambiguity in classification. '???' may denote a
city or a former American president.
38Strategy (1)
- Character Condition
- When a foreign name is transliterated, the
selection of homophones is restrictive.
Richard Macs ????? vs. ????? - Basic character set can be trained from a
transliterated name corpus. - If all the characters in a string belong to this
set, they are regarded as a candidate.
39Strategy (2)
- Syllable Condition
- Some characters which meet the character
condition do not look like transliterated names. - Syllable Sequence
- Simplified Condition
- (1) For each candidate, we check the syllable of
the first (the last) character. - (2) If the syllable does not belong to the
training corpus, the character is deleted. - (3) The remaining characters are treated in the
similar way.
40Strategy (3)
- Frequency Condition
- For each candidate which has only two characters,
we compute the frequency of these two characters
to see if it is larger than a threshold. - The threshold is determined in the similar way as
the baseline model of Chinese person names.
41Cues around Names
- Cues within Transliterated Names
- Character Condition
- Syllable Condition
- Frequency Condition
- Cues around Transliterated Names
- titles the same as Chinese person names
- name introducers "?", "??", "??", "??", and "??"
- special verbs the same as Chinese person names
- first namemiddle name last name
42Discussion
- Some transliterated person names may be
identified by the Chinese person name extraction
system. - ??? ???
- Some nouns may look like transliterated person
names. - popular brands of automobiles, e.g., '???' and
'???' - Chinese proper nouns, e.g., '??', '??' and '??'
- Chinese person names, e.g., '???'
- Besides the above nouns, the boundary errors
affect the precision too. - (?)???
43Named Organization Extraction
- A complete organization name can be divided into
two parts name and keyword. - Example ?????
- Many words can serve as names, but only some
fixed words can serve as keywords. - Challenging Issues
- (1) a keyword is usually a common content word.
- (2) a keyword may appear in the abbreviated form.
- (3) the keyword may be omitted completely.
44Classification of Organization Names
- Complete organization names
- This type of organization names is usually
composed of proper nouns and keywords. - Some organization names are very long, thus
(left) boundary determination is difficult. - Some organization names with keywords are still
ambiguous. - '???' usually denotes reading matters, but not
organizations. - Incomplete organization names
- These organization names often omit their
keywords. - The abbreviated organization names may be
ambiguous. - '??' and '??' are famous sport teams in Taiwan
and in USA, respectively, however, they are also
common content words.
45Strategies
- Keywords
- A keyword shows not only the possibility of an
occurrence of an organization name, but also its
right boundary. - Prefix
- Prefix is a good marker for possible left
boundary. - Single-character words
- If the character preceding a possible keyword is
a single-character word, then the content word is
not a keyword. - If the characters preceding a possible keyword
cannot exist independently, they form a name part
of an organization. - Words of at least two characters
- The words to compose a name part usually have
strong relationships.
46Strategies
- Parts of speech
- The name part of an organization cannot extend
beyond a transitive verb. - Numeral and classifier are also helpful.
- Cache
- problem when should a pattern be put into cache?
- Character set is incomplete.
- n-gram model
- It must consist of a name and an organization
name keyword. - Its length must be greater than 2 words.
- It does not cross any punctuation marks.
- It must occur more than a threshold.
47Handcrafted Rules
- OrganizationName ? OrganizationName
OrganizationNameKeyworde.g.,
??? ?? - OrganizationName ? CountryName
OrganizationNameKeyworde.g.,
?? ??? - OrganizationName ? PersonName
OrganizationNameKeyworde.g.,
??? ??? - OrganizationName ? CountryName
OrganizationNamee.g.,
?? ??? - OrganizationName ? LocationName
OrgnizationNamee.g.,
???? ?? - OrganizationName ? CountryName DDD
OrganizationNameKeyworde.g.,
?? ?? ???? - OrganizationName ? PersonName DD
OrganizationNameKeyworde.g.,
??? ?? ??? - OrganizationName ? LocationName DD
OrganizationNameKeyworde.g.,
?? ?? ????
48Discussion
- Most errors result from organization names
without keywords. - ??? ?? ????
- ?? ?? ??
- Identification errors
- Even if keywords appear, organization names do
not always exist. - ???? ????
- Error left boundary is also a problem.
- ????? (??)????
- Ambiguities
- ??? ????
49Application of Gender Assignment
- Anaphora resolution
- "?????,??????????,??????????,?????????????????????
,?????,?????,?????????????????,????????" - Gender of a person name is useful for this
problem. - The correct rate for gender assignment is 89.
- Co-Reference resolution
50Named Location Extraction
- A location name is composed of name and keyword
parts. - Rules
- LocationName ? PersonName LocationNameKeyword
- LocationName ? LocationName LocationNameKeyword
- Locative verbs like '??', '??', and so on, are
introduced to treat location names without
keywords. - Cache and n-gram models are also employed to
extract location names.
51Date Expressions
- DATE ? NUMBER YEAR (? ?)
- DATE ? NUMBER MTHUNIT (? ?)
- DATE ? NUMBER DUNIT (? ?)
- DATE ? REGINC (??)
- DATE ? FSTATE DATE (?? ??)
- DATE ? COMMON DATE (? ??)
- DATE ? REGINE DATE (?? ????)
- DATE ? DATE DMONTH (?? ??)
- DATE ? DATE BSTATE (?? ?)
- DATE ? FSTATEDATE DATE (?? ???)
- DATE ? FSTATEDATE DMONTH (?? ??)
- DATE ? FSTATEDATE FSTATEDATE (?? ??)
- DATE ? DATE YXY DATE (???? ? ????)
52Time Expressions
- TIME ? NUMBER HUNIT (? ?)
- TIME ? NUMBER MUNIT (?? ?)
- TIME ? NUMBER SUNIT (? ?)
- TIME ? FSTAETIME TIME
- TIME ? FSTATE TIME
- TIME ? TIME BSTATE
- TIME ? MORN BSTATE
- TIME ? TIME TIME
- TIME ? TIME YXY TIME (?? ? ??)
- TIME ? NUMBER COLON NUMBER (03 45)
53Monetary Expressions
- DMONEY ? MOUNIT NUMBER MOUNIT (?? ? ?)
- DMONEY ? NUMBER MOUNIT MOUNIT (? ? ??)
- DMONEY ? NUMBER MOUNIT (? ?)
- DMONEY ? MOUNIT MOUNIT NUMBER (?? 5)
- DMONEY ? MOUNIT NUMBER ( 5)
- DMONEY ? NUMBER YXY DMONEY (? ? ??)
- DMONEY ? DMONEY YXY DMONEY (?? ? ??)
- DMONEY ? DMONEY YXY NUMBER (200 - 500)
54Percentage Expressions
- DPERCENT ? PERCENT NUMBER
(??? ?) - DPERCENT ? NUMBER PERCENT
(3 ) - DPERCENT ? DPERCENT YXY DPERCENT
(5 ? 8) - DPERCENT ? DPERCENT YXY NUMBER
(???? ? ?) - DPERCENT ? NUMBER YXY DPERCENT
(? ? ????)
55Named Entity Extraction in MET2
- Transform Chinese texts in GB codes into texts in
Big-5 codes. - Segment Chinese texts into a sequence of tokens.
- Identify named people.
- Identify named organizations.
- Identify named locations.
- Use n-gram model to identify named
organizations/locations. - Identify the rest of named expressions.
- Transform the results in Big-5 codes into the
results in GB codes.
56from GB codes to Big-5 codes
- Big-5 traditional character set and GB simplified
character set are adopted in Taiwan and in China,
respectively. - Our system is developed on the basis of Big-5
codes, so that the transformation is required. - Characters used both in simplified character set
and in tradition character set always result in
error mapping. - ?? vs. ?? ?? vs. ?? ?? vs. ???? vs. ?? ?? vs.
?? ??? vs. ?????? vs. ??? ?? vs. ?? ?? vs.
?????? vs. ???? and so on. - More unknown words may be generated.
57Segmentation
- We list all the possible words by dictionary
look-up, and then resolve ambiguities by
segmentation strategies. - The test documents in MET-2 are selected from
China newspapers. - Our dictionary is trained from Taiwan corpora.
- Due to the different vocabulary sets, many more
unknown words may be introduced.???? vs.
????, ?? vs. ??, ?? vs. ???, ??? vs.
???, etc. - The unknown words from different code sets and
different vocabulary sets make named entity
extraction more challenging.
58MET-2 Formal Run of NTUNLPL
- F-measures
- PR 79.61
- 2PR 77.88
- P2R 81.42
- Recall and Precision
- name (85, 79)
- number (91, 98)
- time (95, 85)
59Named Persons
- The recall rate and the precision are 91 and
74. - Major errors
- segmentation, e.g., ??? -gt ?? ?Part of person
names may be regarded as a word during
segmentation. - surname name, character set and title are
incomplete, e.g., ???, ?? ? ??, ?? ?? - blanks, e.g., ? ?We cannot tell if blanks
exist in the original documents or are inserted
by segmentation system. - Boundary errors
- Japanese names, e.g., ?????
60Evaluation Named Organization
- The recall rate and the precision rate are 78
and 85. - Major errors
- more than two content words between name and
keyworde.g., ?? ?? ?? ?? ?? - absent of keywordse.g., ???????
- absent of name part the name part do not satisfy
character condition, e.g., ???? - n-gram errorse.g., ???????????
61Evaluation Named Locations
- The recall rate and the precision rate are 78
and 69. - character set
- The characters "?" and "?" in the string "????"
do not belong to our transliterated character
set. - wrong keyword
- The character "?" is an organization keyword.
Thus the string "?????" is mis-regarded as an
organization name. - common content words
- The words such as "??", "??", etc., are common
content words. We do not give them special tags. - single-character locations
- The single-character locations such as "?", "?",
and so on, are missed during recognition.
62Evaluation Time/Date Expressions
- The recall rate and the precision rate for date
expression, time expression, monetary expression
and percentage expression are (94, 88), (98,
70), (98, 98), and (83, 98), respectively. - Major errors
- propagation errors
- segmentation before entity extraction, e.g., ??
- named people extraction before date expressions
- absent date units
- the date unit does not appear, e.g., ????
- the date unit should appear, e.g., ???
63- Absent keywords
- Some keywords are not listed.
- E.g., ???????8?58? is divided into ??,
8?58? - Rule coverage
- E.g., ?????
- Ambiguity
- Some characters like ? can be used in time and
monetary expressions. E.g., ???????? is
divided into two parts ??? and ????? - The strings "??" and "??" are words. In our
pipelined model, "????" and "????" will be missed.
64Issues
- Deal with the errors propagated from the previous
modules - Pipelining model vs. interleaving model
- Deal with the errors resulting from rule coverage
- Handcrafted rules vs. learning rules
- Deal with the errors resulting from segmentation
standards - Vocabulary set of Academic Sinica vs. Peking
University
65Pipelining Model
segmentation
named entity extraction
named people
named location
named organization
number date/time
input
output
ambiguity resolution
only one result
66Interleaving Model
table lookup
named people
named locations
named organization
number date/time
input
ambiguity resolution
output
67An Example in Interleaving Model
?
?
?
?
?
?
68Learning Rules vs. Hand-Crafted Rules
- Collect organization names.
- Extract Patterns
- Clustering organization names based on keywords
- Assign features in name parts
- Employ Teiresias algorithm to extract
patterns(http//cbcsrv.watson.ibm.com/Tspd.html)
69Teiresias algorithm
- Patterns consist of words and wild card (),
e.g., - ?? ?? ??
- ?? ?? ??
- gt ?? ??
- parameter setting
- L the least number of non-wild cards
- W the maximum number of L non-wild cards
- T confidence level, I.e., how many training
instances this pattern must satisfy
70Keyword Set
- Extracting keyword set
- Input all the training instances (i.e.,
organization names) into Teiresias algorithm. - Let confidence level be 5.
- Find all the patterns not ending with wild card,
e.g., ( ?? ) ( ?? ) ( ? ? ) - Regard suffix of a pattern as a keyword.e.g.,
??? ??? ? ?? ?? ??? ?? ?????
71Features of Patterns
- types
- named entities
- named people
- named locations
- named organizations
- date expression ( ????????????? )
- number ( 87????????? )
- common nouns
72Tagging
73Tagging
- Lexical level
- Part of Speech Tagging
- Named entity tagging
- Sense Tagging
- Syntactic level
- Syntactic Category (Structure) Tagging
- Discourse level
- Anaphora-Antecedent Tagging
- Co-Reference Tagging
74Part-of-Speech Tagging
- Issues of tagging accuracy
- the amount of training data
- the granularity of the tagging set
- the occurrences of unknown words, and so on.
- Academia Sinica Balanced Corpus
- 5 million words
- 46 tags
- Language Models, e.g., bigram, trigram,
75Sense Tagging
- Assign sense labels to words in a sentence.
- Sense Tagging Set
- tong2yi4ci2ci2lin2 (?????, Cilin)
- 12 large categories
- 94 middle categories
- 1,428 small categories
- 3,925 word clusters
76A People Aa a collective
name 01Â Â Â Â Â Â Human being The people Everybody 02
      I We 03      You You 04     He/She T
hey 05Â Â Â Â Â Myself Others Someone 06Â Â Â Â Â Who A
b people of all ages and both sexes 01 A
Man A Woman Men and Women 02 An Old
Person An Adult The old and the
young 03Â Â Â Â Â Â Â A Teenager 04Â Â Â Â Â Â Â An
Infant A Child Ac posture 01 A Tall
Person A Dwarf 02 A Fat Person A Thin
Person 03 A Beautiful Woman A Handsome Man
77(No Transcript)
78Degree of Polysemy in Mandarin Chinese
- Small categories of Cilin are used to compute the
distribution of word senses. - ASBC is employed to count frequency of a word
- Total 28,321 word types appear both in Cilin and
in ASBC corpus. - Total 5,922 words are polysemous.
79degree number of senses of a word word type a
dictionary entry
8097.53 94.70 for N and V
98.2297.08 for A and K
?
5922
4132
8193.77 of polysemous words belong to the class of
low ambiguity they only occupy 58.52 of tokens
in ASBC corpus
word token an occurrence of word type in ASBC
corpus word type a dictionary entry
82Low frequency (lt 100), Middle frequency
(100?lt1000), High frequency (?1000)
24
58
gt8
83Phenomena
- POS information reduces the degree of ambiguities
- Total 8.94 of word tokens are high ambiguous in
Table 3. It decreases to 0.47 in Table 4. - High ambiguous words tend to be high frequent
- 23.67 of word types are middle- or high-frequent
words, and they occupy 94.06 of word tokens
84Semantic Tagging Unambiguous Words
- acquire the context for each semantic tag
starting from the unambiguous words
Unambiguous Words
those words that have only one sense in Cilin
Cilin
ASBC
85Acquire Contextual Vectors
- An unambiguous word is characterized by the
surrounding words. - The window size is set to 6, and the stop words
are removed. - A sense tag Ctag is in terms of a vector (w1, w2,
..., wn) - MI metric
- EM metric
86Semantic Tagging Ambiguous Words
- apply the information trained at the first stage
to selecting the best sense tag from the
candidates of each ambiguous word
Unambiguous Words
Cilin
ASBC
Ambiguous Words
Those words that have more than one sense in Cilin
87Apply and Retrain Contextual Vectors
- Identify the context vector of an ambiguous word.
- Measures the similarity between a sense vector
and a context vector by cosine formula. - Select the sense tag of the highest similarity
score. - Retrain the sense vector for each sense tag.
88Semantic Tagging Unknown Words
- Adopt outside evidences from the mapping among
WordNet synsets and Cilin sense tags to narrow
down the candidate set.
Unambiguous Words
Cilin
ASBC
Unknown Words
Ambiguous Words
89(No Transcript)
90Experiments
- Test Materials
- Sample documents of different categories from
ASBC corpus - Total 35,921 words in the test corpus
- Research associates tag this corpus manually
- Mark up the ambiguous words by looking up the
Cilin dictionary - Tag the unknown words by looking up the mapping
table - The tag mapper achieves 82.52 of performance
approximately
91(No Transcript)
9249.55
62.60 31.36 27.00
93?The performance for tagging low ambiguity (2-4),
middle ambiguity (5-8) and high ambiguity (gt8) is
similar (i.e., 63.98, 60.92 and 67.95) when 1
candidate, 2 candidates, and 3 candidates are
proposed. ? Under the middle categories and 1-3
proposed candidates, the performance for tagging
low, middle and high ambiguous words are 71.02,
73.88, and 75.94.
94(No Transcript)
95Co-Reference Resolution
96Introduction
- Anaphora vs. Co-Reference
- Anaphora
- ?????
- Type/Instance ??/??, ?????/??
- Function/Value ?????/?? 30 ?
- NP ?????? ?????/???
??
??
97Flow of Co-Reference Resolution
Document
98Find the Candidate List
Document
Determine Candidates
All the Candidates
Co-Reference Resolution Algorithm
Singletons
Class 1
Class 2
Class N
99Find the Candidates
- Select all nouns (Cand-Terms)
- Na (????)
- Nb (????)
- Nc (????)
- Nd (????)
- Nh (???)
- Delete some Nds (total 171)
- e.g., ??????????, ??, ??, ...
- Select noun phrases (Cand-NP)
- Select maximal noun phrases (Cand-MaxNP)
Some are found in named entity extraction
100Recognize NPs whose head is Na (common noun)
??(Neqa) ?(DE) ??(Na) ??(Neqa) ?(Neu) ?(Nf)
??(Na) ?(Nep) ?(Neu) ?(Nf) ??(Na) ?(Nes)
??(Na)
101Recognize NPs whose head is Nh (pronoun)
Na
Nh
Init
Nb
??(Nb) ??(Nh) ???(Na) ??(Nd) ?(Nh) ??(Nh)
102Cilin (?????)
- 12 large categories
- 94 middle categories
- 1,428 small categories
- 3,925 word clusters
103Features
- Classification
- Word/Phrase Itself
- Part of Speech of Head
- Semantics of Head
- Type of Named Entities
- Positions (Sentences and Paragraphs)
- Number Singular, Plural and Unknown
- Gender Pronouns and Chinese Person Names
- Pronouns Personal Pronouns, Demonstrative
Pronouns
104Co-Reference Resolution Algorithms
- Strategy 1 simple pattern matching
- Strategy 2 Cardie Clustering Algorithm
105Cardie Clustering Algorithm
106(No Transcript)
107Semantics Restrictions
- SemanticFun_1(NPi ,NPj )
- If heads belong to the same word cluster, then 0
is assigned, else 1 is assigned. - SemanticFun_2(NPi ,NPj )
- Integrate POS, Named Entity and Cilin sense
- Only one is NE
- NPi and NPj are not NE, and they are in Cilin
- NPi and NPj are not NE, and only of them not in
Cilin - NPi and NPj are not NE, and both are not in Cilin
? 0
108SemanticFun_2
- One of them is NE
- Column denotes the type of NE
- Row denotes the part of speech
- English string in a table cell denotes Cilin sense
109Experimental Results
110Named Entity Tagging Environment
111(No Transcript)
112(No Transcript)
113(No Transcript)
114tag at the same time
115(No Transcript)
116?? vs. ??????
117(No Transcript)
118Named Entity Extraction in Bioinformatics
Application
- Named Entities
- Protein Name
- Gene Name
119Summary
- Segmentation
- Named Entity Extraction
- POS and Sense Tagging
- Co-Reference Resolution
- NE Tagging Environment
- Bioinformatics