Title: Chinese Information Extraction
1Chinese Information Extraction  Tianfang
Yao Department of Computer Science and
EngineeringShanghai Jiao Tong University1954
Hua Shan RoadShanghai, 200030China
2Outline
- Introduction
- Word Segmentation
- Named Entity Extraction
- Entity Relation Extraction
- Conclusion
3Introduction (1)
- Chinese Language
- Difficulties in Chinese NLP
- State-of-the-Art for Chinese Information
Extraction -
4Introduction (2)
- Chinese Language
- Chinese is a different topological language from
English or German. - It has a big character set that involves about
44,908 characters. - Although Chinese has a history of more than
6,000 years, up to now, Chinese grammar standard
has not been built perfectly.
5Introduction (3)
- Chinese Language
- The form of Chinese character is related to the
meaning of character. It combines with the
hieroglyph, e.g. ?(sun) and?(moon), the
self-explanatory, e.g. ?(above) and ?(below), as
well as the associative compounds, e.g. ?
(believe), a character made up of ? (man) and ?
(word), means a message or something that can be
believed or trusted. - There are many homonyms in Chinese words, e.g.
?(gong), ?(spiral shell), ?(mule), ?(bamboo
basket) etc. - Chinese word can be disconnected or expanded.
Its order can be changed. e.g.??(take a meal) vs.
???????(haircut)? vs. ???
6Introduction (4)
- Difficulties in Chinese NLP
- Because there is no space between the characters
in the Chinese sentence, we have to segment word
before we analyze the sentence structure. - Chinese characters have no flection, using
semantic structures to understand Chinese
sentences is more important than using syntactic
structures to do that. - The combination of Chinese words is flexible,
changeable, succinct and implicit. Sometimes
there are omitted constituents in the sentence. - There exist continuous nouns or continuous verbs
in a Chinese sentence at times.
7Introduction (5)
- State-of-the-Art for Chinese Information
Extraction - Knowledge engineering approaches
- Automatically trainable approaches
- Â Statistic approaches
- Hybrid approaches
8Word Segmentation (1)
- Research of Automatic Chinese Word Segmentation
- (Kaiying Liu. Computer Science Department, Shan
Xi University, China) - 1. Definitions
- Definition 1 Ambiguous Phrase of Overlap Type
- Assume that AJB is a character string and W is a
word list. If AJ - W, and JB W, then AJB is called
ambiguous phrase of overlap type. - e.g. In the string ???(act as a delegate) ,
both ??(of our time) and ??(delegate) are
words. So this string is an ambiguous phrase of
overlap type.
9Word Segmentation (2)
- Definition 2 Chain Length
- The number of ambiguous strings is called chain
length. - e.g. There is one ambiguous string in the string
???, so the chain length is 1. - Definition 3 Ambiguous Phrase of Combination
Type - Assume that AB is a character string and W is a
word list. If A W, B W and AB W, then
AB is called ambiguous phrase of combination
type. - e.g. In the string ??(individual) ,
?(quantifier), ?(man) and ?? are all words.
So this string is an ambiguous phrase of
combination type.
10Word Segmentation (3)
- 2. Build the ambiguous phrase libraries
- 78,000 phrases for overlap type
- More than 3,000 phrases for combination type
- Statistical results for overlap type
- Their chain lengths are mostly 1 or 2, about 95
of all. - Among the ambiguous phrases like ABCD with a
chain length of 2. 98 of them can be segmented
into ABCD. - The segmentation of about 82 of the ambiguous
phrases like ABCDE with a chain length of 3
depends on the leftmost three characters ABC. - False ambiguous phrase 94
- Real ambiguous phrase 6
11Word Segmentation (4)
- False ambiguous phrase It is with actually only
one segmentation result in real texts. e.g. ?(be
given)??(a criticism) - Real ambiguous phrase It is more than two
applicable segmentation results. - Case 1 with almost equal occurrence
probabilities - e.g. ???(apply to) can be segmented into
???(applyto) or ???(should be used in) - Case 2 mostly segmented into only one result in
real texts. - e.g. ???(have dismissed) should be mostly
segmented into ???(have dismissed)
12Word Segmentation (5)
- 3. Approaches for segmenting ambiguous phrases
with overlap type - Statistics based approach
- Built the wording capacity library includes
frequency information for ambiguous phrase AJB
with chain length of 1, that is, different
frequencies for constructing words FreqLeft(AJ),
FreqRight(B), FreqLeft(A) and FreqRight(JB) - Rule1 If FreqLeft(AJ) FreqRight(B) gt
FreqLeft(A) FreqRight(JB), AJB is segmented
into AJB otherwise AJB
13Word Segmentation (6)
- (Depending on the statistical results for
ambiguous phrase library) - Rule 2 Ambiguous phrase with a chain length of
2, like ABCD, is segmented into ABCD. - Rule 3 Ambiguous phrase with a chain length of
3, like ABCDE, first is segmented into
ABCDE then the fore part ABC is segmented
as an ambiguous phrase with a chain length of 1. - Rule 4 Ambiguous phrase with a chain length of
4, like ABCDEF, is segmented into ABCDEF
14Word Segmentation (7)
- Rules based approach
- Rule 1 If there is an appulsive verb in an
ambiguous phrase with its previous word as a
verb, it is segmented solely. e.g. ?????(really
embody) should be segmented ?????, because
?(come up) is an appulsive verb, ?? is a
verb. - Rule 2 If the foremost character in an
ambiguous phrase is a quantifier and the
preceding word of the phrase is a numeral, the a
quantifier is segmented solely. e.g. 65???(a
high building of 65 stories) should be segmented
into 65???, because ? is a quantifier and
65 is a numeral.
15Word Segmentation (8)
- 4. Approaches for segmenting ambiguous phrases
with combination type - Statistics based approach
- Among all ambiguous phrases, 30 of them usually
have only one segmentation result. Therefore, a
library including 133 phrases is built. The
structure of database is as follows - FIELD NAME TYPE LENGTH EXPLANATION
- word char 4
AB - nh number 3
the times of seg. into AB - nf number 3
the times of seg. into AB - Assume freqnh/(nhnf), thresholds are a1and a2,
here a1gt a2 . If freqgta1 , AB will be segmented
into AB if freqlta2 , it is segmented into
AB.
16Word Segmentation (9)
- POS rule based approach
- The word to be segmented is related with the POS
of its context words. If the previous word of
AB is numeral, AB will be segmented into
AB - otherwise segmented into AB. e.g. In the
sentence ????????(He sleeps in his room by
himself), here AB??. Because ? is a numeral,
?? should be segmented into ?? . But in the
phrase ??????(The individual interests of the
peasantry), ?? should not be segmented.
17Word Segmentation (10)
18Word Segmentation (11)
- 6. System test results
- The system has been tested with the corpus
randomly chosen from Beijing Youth, in which
there are 607 ambiguous phrases of overlap type
and 2292 ambiguous phrases of combination type.
The precisions are 97 and 87 respectively.
19Named Entity Extraction (1)
- Description of the NTU System used for MET2
- (Hsin-His Chen et al. Natural Language Processing
Lab., Department of Computer Science and
Information Engineering, National Taiwan
University) - Processing Steps of Named Entity Extraction
- (1) Transform Chinese texts in GB codes into
texts in Big-5 codes - (2) Segment Chinese texts into a sequence of
tokens - (3) Identify named people
- (4) Identify named organizations
- (5) Identify named locations
- (6) Use n-gram model to identify named
organizations/locations - (7) Identify the rest of named expressions
- (8) Transform the results in Big-5 codes into
the results in GB codes
20Named Entity Extraction (2)
- (1) Transform Chinese texts in GB codes into
texts in Big-5 codes - The GB code is an internal code of the
simplified Chinese character set, which is used
in the mainland of China. The Big-5, on the other
hand, is an internal code of the traditional
Chinese character set, which is used in Taiwan
and Hong Kong. - e.g. simplified Chinese character vs.
traditional Chinese character - ???? (Artificial Intelligence)
???? - ??(Software)
?? - ??(Report) ??
- ???(New Zealand)
??? - NTU System is designed for the traditional
Chinese character text and the test texts in MET2
are in GB code. So it must transform GB code of
test texts into Big-5 code. But this mapping is
not only one-to-one, sometimes it is one-to-many. -
21Named Entity Extraction (3)
- (2) Segment Chinese texts into a sequence of
tokens - List all possible words by dictionary
look-up, and then resolve ambiguities by
segmentation strategies. The dictionary is
trained from CKIP corpus, of which articles are
collected from Taiwan newspapers, magazines, etc. - (3) Identify named people
- Chinese person names
- Most Han Chinese surnames are single
character, but some are two characters. - Most names are two characters, but some are
single character. - Theoretically, every character can be used for
a name. Thus the length of Chinese names ranges
from 2 to 6 characters. - Three kinds of recognition strategies are
adopted - Named-formulation rules
- Context clues, e.g., titles, positions,
speech-act verbs, etc. - Cache
22Named Entity Extraction (4)
- Named-formulation rules
- They are trained from a person name corpus in
Taiwan, which contains 1 million Chinese names.
Each contains surname, name and sex. - Possible candidates
- Model 1. Single character for surname
- P(C1)P(C2)P(C3) using male (female) training
table gt threshold1(3) and - P(C2)P(C3) using male (female) training
table gt threshold2(4) - Model 2. Two characters for surname
- P(C2)P(C3) using male (female) training
table gt threshold2(4) - Model 3. Two surnames together
- P(C12)P(C2)P(C3) using female training table
gt threshold3 - P(C2)P(C3) using female training table gt
threshold4 and - P(C12)P(C2)P(C3) using female training
table gt P(C12)P(C2)P(C3) using male training
table
23Named Entity Extraction (5)
- Context clues, e.g., titles, positions,
speech-act verbs, etc. - Titles ??(Dr.) ??(Prof.) ??(Mrs./Ms.)
??(Miss) ??(Mr.) - Positions ??(President) ??(Director)
???(General Manager) - Speech-act verbs ??(speak)?(say) ??(bring
up) - Cache
- The cache presents a global clue. Because a
person name may appear more than once in a
document. The cache is used to store the
identified candidates. There are four cases shown
below when cache is used - (1) C1C2C3 and C1C2C4 are in the cache, and
C1C2 is correct. - (2) C1C2C3 and C1C2C4 are in the cache, and
both are correct. - (3) C1C2C3 and C1C2 are in the cache, and
C1C2C3 is correct. - (4) C1C2C3 and C1C2 are in the cache, and
C1C2 is correct.
24Named Entity Extraction (6)
- Transliterated person names
- Transliterated person names denote foreigners.
The length of transliterated person names is not
restricted to 2 to 6 characters. - Main strategies
- Transliterated name set
- The transliterated names trained from MET data
are regarded as a built-in name set. - Character condition
- Two special character sets are retrieved from
MET training data. The first character of names
must belong to a 280-character set, and the
remaining characters must appear in a
411-character set. The character condition is a
loose restriction. It should be employed with
other clues. - Titles
- They used in Chinese person names are also
applicable to transliterated person names. - Name introducers
- Such as, ? (be called), ?? (Her/His name is
), ?? (respectfully call sb. ) - Special verbs
- e.g. ??(issue/express/deliver),
??(hint/imply)
25Named Entity Extraction (7)
- (4) Identify named organizations
- The structure of organization names is more
complex than that of person names. Basically, a
complete organization name can be divided into
name and keyword. - Such as, names ???(UN), ??(USA), ???(Robertson)
- keywords ??(Army), ???(Embassy),
???(Foundation) - There are some rules to recognize organization
names - OrganizationName -gt OrganizationName
OrganizationNameKeyword - OrganizationName -gt CountryName
OrganizationNameKeyword - OrganizationName -gt PersonName
OrganizationNameKeyword - OrganizationName -gt CountryName DDD
OrganizationNameKeyword - OrganizationName -gt PersonName DD
OrganizationNameKeyword - OrganizationName -gt LocationName DD
OrganizationNameKeyword - OrganizationName -gt CountryName
OrganizationName - OrganizationName -gt LocationName
OrganizationName - Where D is a content word, such as,
??(International), ??(culture and education) etc.
26Named Entity Extraction (8)
- Identify named locations
- The structure of location names is similar to
that of organization names. The rules are like - LocationName -gt PersonName LocationNameKeyword
- LocationName -gt LocationName LocationNameKeyword
- The following are some examples of location
keywords - ?(maintain) ??(center) ??(highway) ??(the
Northern of ) ?(city) - Other strategies for recognizing location names
without keywords - Locative verbs ??(come from ) ??(go to )
- Cache
- N-gram model employ multiple occurrences to
find a pattern
27Named Entity Extraction (9)
- (6) Use n-gram model to identify named
organizations/locations - Although cache mechanism and n-gram use the same
feature, i.e., multiple occurrences, their
concepts are totally different. For organization
names, it is not sure when a pattern should be
put into cache because its left boundary is hard
to be decided. - In the model, the patterns are selected to meet
the following criteria - It must consist of a name and an organization
name keyword - Its length must be greater than two words
- It does not cross sentence boundary and any
punctuation marks - It must occur at lease twice
28Named Entity Extraction (10)
- (7) Identify the rest of named expressions
- The rule based approach is used for the following
named expressions - Date expressions
- DATE-gtNUMBERYEAR
- DATE-gtNUMBERMTHUNIT
- Time expressions
- TIME-gtNUMBERHUNIT
- TIME-gtTIMEBSTATE
- Monetary expressions
- DMONEY-gtMOUNITNUMBERMOUNIT
- DMONEY-gtNUMBERMONUIT
- Percentage expressions
- DPERCENT-gtPERCENTNUMBER
- DPERCENT-gtNUMBERPERCENT
29Named Entity Extraction (11)
- (8) Transform the results in Big-5 codes into the
results in GB codes - MET2 Testing Results
- Named Entity Recall()
Precision() - Person Name 91
74 - Organization Name 78
85 - Location Name 78
69 - Date
94 88 - Time 98
70 - Money 98
98 - Percent 83
98 - F-MEASURES PR 79.61 2PR 77.88
P2R 81.42
30Entity Relation Extraction (1)
- A Trainable Method for Extracting Chinese Entity
Names and Their Relations - (Yimin Zhang et al. Intel China Research Center,
Beijing, China) - The process can be divided into two stages. The
first one is the learning process in which
several classifiers are built from the training
data. The second one is the extracting process in
which Chinese entity names and their relations
are extracted using the classifiers learned. The
learning algorithm used in the learning process
is memory-based learning (MBL) which is a
classification based supervised learning approach.
31Entity Relation Extraction (2)
32Entity Relation Extraction (3)
- The main steps for the learning process
- (1) Prepare training data in which all noun
phrases, entity names and relations are manually
annotated. - (2) Segmenting, tagging and partial parsing for
the training data. - (3) Extract the training sets from the parsed
training data. Four training sets are extracted
for different tasks, related to Chinese person
names, entity names, noun phrase, or relations
between entity names in the training data
respectively. The main feathers used in an
example can be either local context feathers,
e.g. dependency relation, or global context
features, e.g. the feature of a word in the whole
document, etc. - (4) Use MBL algorithm to obtain IG-Tree for four
training sets. IG-Tree is a compressed
representation of the training set that can be
processed quickly in classification process.
33Entity Relation Extraction (4)
- The main steps for the extracting process
- Segmenting, tagging and partial parsing for the
Chinese documents. - Identify Chinese people names using
PersonName-IG-Tree. - Identify Chinese organization names using the
same method of NTU System. - Identify other entity names using the same method
of NTU System. - Identify Chinese noun phrases (NP chunking) using
NP-IG-Tree. - Use entity names and noun phrases extracted to
perform partial parsing again to fix the parsing
errors. - Use EntityName-IG-Tree to classify the noun
phrases extracted. This step will identify entity
names that are missed in the previous steps. - Use Relation-IG-Tree to identify relations
between the extracted entity names.
34Entity Relation Extraction (5)
- The entity relation extracted
- Employee-of,
- Location-of,
- Product-of and
- No-relation
- The feathers for this task
- The features used in CRYSTAL System,
- Add some new feathers, such as the linear order
of entity names, the word(s) between the entity
names, the relative position of the entity names
(in same sentence or in neighboring sentence) etc.
35Entity Relation Extraction (6)
- Example
- Phrase ????(Legends President) (Note
LegendLegend Holdings Limited or Legend Group
which is a famous computer company in China) in
the subject position includes the features - SUBJ-Terms-??
- SUBJ-Terms-??
- SUBJ-Mod-Terms-??
- SUBJ-Head-Terms-??
- SUB-Classes-Employee
- SUB-Mod-Classes-Organization
- SUB-Head-Classes-Organization(should be Position)
36Entity Relation Extraction (7)
- Learning and extracting processes
- For every two related entity names in the
training data, a training example is identified
and extracted. After all examples are extracted,
they are fed to MBL Learner to build the
Relation-IG-Tree. - The extracting process is the same as the
learning process for extracting all pairs of
entity names. Then the relation between every
pair of entity names is derived by the
Relation-IG-Tree.
37Entity Relation Extraction (8)
- Example1
- ???????????IT???????,
- As a famous manufacturer of IT hardware devices
in China, the Lang Chao Group - Company name ???? Product name IT????
- Training example Company name (??/?) Product
name ??? - Relation product-of
- Example2
- ?????????????????,?????TCL???????????????????????
- Wu Shihong became the media focus once again,
however, this time she came to Shanghai as the
vice president of TCL group and its IT companys
general manager. - Person name ??? Company name TCL??
- Training example If a person name and a company
name appear in neighboring sentences, and no
other person names and company names are found in
between, they tend to have an employee-of
relation. - Relation employee-of
38Entity Relation Extraction (9)
- System testing results
- To test this approach, a manually annotated
corpus which comprises about 200 business news is
used. All the entity names (about 500 person
names and 300 organization names), noun phrases,
and relations in the corpus were manually
annotated. Ten pairs of training sets and tests
were randomly selected from the corpus with each
set size equivalent to half of the entire corpus.
All data sets were tested, the result is as
follows -
Recall()
Precision() - Person Name
86.3 83.2 - Organization Name
73.4 89.3 - Employee-of
75.6 92.3 - Product-of
56.2
87.1 - Location-of
67.2 75.6
39Conclusion
- Chinese is a different topological language from
English or German. There exist some special
difficulties in Chinese NLP, such as word
segmentation. - There are mainly two ambiguous phrases in
Chinese word segmentation. One is overlap type,
another is combination type. In overlay ambiguous
phrases, the chain lengths are mostly 1 or 2 and
take up 95. In combination ambiguous phrases,
30 of them usually have only one possibility of
segmentation. We can remove ambiguity depending
on different ambiguous types. - Chinese named entities are major constituents in
Chinese documents. We can adopt different methods
to extract them together, such as character
conditions, statistical information, titles,
punctuation marks, organization and location
keywords, speech-act and locative verbs, cache
and n-gram model. - We can view the determination of Chinese entity
relation as classification process. In the
learning process, several classifiers are built
from the training data. In the extracting
process, the relations are extracted using the
classifiers learned. Machine learning technique
has been effectively used in Chinese entity
relation extraction.