Chinese Information Extraction presentation

About This Presentation

Transcript and Presenter's Notes

Title: Chinese Information Extraction

1
Chinese Information Extraction Tianfang
Yao Department of Computer Science and
EngineeringShanghai Jiao Tong University1954
Hua Shan RoadShanghai, 200030China
2
Outline

Introduction
Word Segmentation
Named Entity Extraction
Entity Relation Extraction
Conclusion

3
Introduction (1)

Chinese Language
Difficulties in Chinese NLP
State-of-the-Art for Chinese Information
Extraction

4
Introduction (2)

Chinese Language
Chinese is a different topological language from
English or German.
It has a big character set that involves about
44,908 characters.
Although Chinese has a history of more than
6,000 years, up to now, Chinese grammar standard
has not been built perfectly.

5
Introduction (3)

Chinese Language
The form of Chinese character is related to the
meaning of character. It combines with the
hieroglyph, e.g. ?(sun) and?(moon), the
self-explanatory, e.g. ?(above) and ?(below), as
well as the associative compounds, e.g. ?
(believe), a character made up of ? (man) and ?
(word), means a message or something that can be
believed or trusted.
There are many homonyms in Chinese words, e.g.
?(gong), ?(spiral shell), ?(mule), ?(bamboo
basket) etc.
Chinese word can be disconnected or expanded.
Its order can be changed. e.g.??(take a meal) vs.
???????(haircut)? vs. ???

6
Introduction (4)

Difficulties in Chinese NLP
Because there is no space between the characters
in the Chinese sentence, we have to segment word
before we analyze the sentence structure.
Chinese characters have no flection, using
semantic structures to understand Chinese
sentences is more important than using syntactic
structures to do that.
The combination of Chinese words is flexible,
changeable, succinct and implicit. Sometimes
there are omitted constituents in the sentence.
There exist continuous nouns or continuous verbs
in a Chinese sentence at times.

7
Introduction (5)

State-of-the-Art for Chinese Information
Extraction
Knowledge engineering approaches
Automatically trainable approaches
Statistic approaches
Hybrid approaches

8
Word Segmentation (1)

Research of Automatic Chinese Word Segmentation
(Kaiying Liu. Computer Science Department, Shan
Xi University, China)
1. Definitions
Definition 1 Ambiguous Phrase of Overlap Type
Assume that AJB is a character string and W is a
word list. If AJ
W, and JB W, then AJB is called
ambiguous phrase of overlap type.
e.g. In the string ???(act as a delegate) ,
both ??(of our time) and ??(delegate) are
words. So this string is an ambiguous phrase of
overlap type.

9
Word Segmentation (2)

Definition 2 Chain Length
The number of ambiguous strings is called chain
length.
e.g. There is one ambiguous string in the string
???, so the chain length is 1.
Definition 3 Ambiguous Phrase of Combination
Type
Assume that AB is a character string and W is a
word list. If A W, B W and AB W, then
AB is called ambiguous phrase of combination
type.
e.g. In the string ??(individual) ,
?(quantifier), ?(man) and ?? are all words.
So this string is an ambiguous phrase of
combination type.

10
Word Segmentation (3)

2. Build the ambiguous phrase libraries
78,000 phrases for overlap type
More than 3,000 phrases for combination type
Statistical results for overlap type
Their chain lengths are mostly 1 or 2, about 95
of all.
Among the ambiguous phrases like ABCD with a
chain length of 2. 98 of them can be segmented
into ABCD.
The segmentation of about 82 of the ambiguous
phrases like ABCDE with a chain length of 3
depends on the leftmost three characters ABC.
False ambiguous phrase 94
Real ambiguous phrase 6

11
Word Segmentation (4)

False ambiguous phrase It is with actually only
one segmentation result in real texts. e.g. ?(be
given)??(a criticism)
Real ambiguous phrase It is more than two
applicable segmentation results.
Case 1 with almost equal occurrence
probabilities
e.g. ???(apply to) can be segmented into
???(applyto) or ???(should be used in)
Case 2 mostly segmented into only one result in
real texts.
e.g. ???(have dismissed) should be mostly
segmented into ???(have dismissed)

12
Word Segmentation (5)

3. Approaches for segmenting ambiguous phrases
with overlap type
Statistics based approach
Built the wording capacity library includes
frequency information for ambiguous phrase AJB
with chain length of 1, that is, different
frequencies for constructing words FreqLeft(AJ),
FreqRight(B), FreqLeft(A) and FreqRight(JB)
Rule1 If FreqLeft(AJ) FreqRight(B) gt
FreqLeft(A) FreqRight(JB), AJB is segmented
into AJB otherwise AJB

13
Word Segmentation (6)

(Depending on the statistical results for
ambiguous phrase library)
Rule 2 Ambiguous phrase with a chain length of
2, like ABCD, is segmented into ABCD.
Rule 3 Ambiguous phrase with a chain length of
3, like ABCDE, first is segmented into
ABCDE then the fore part ABC is segmented
as an ambiguous phrase with a chain length of 1.
Rule 4 Ambiguous phrase with a chain length of
4, like ABCDEF, is segmented into ABCDEF

14
Word Segmentation (7)

Rules based approach
Rule 1 If there is an appulsive verb in an
ambiguous phrase with its previous word as a
verb, it is segmented solely. e.g. ?????(really
embody) should be segmented ?????, because
?(come up) is an appulsive verb, ?? is a
verb.
Rule 2 If the foremost character in an
ambiguous phrase is a quantifier and the
preceding word of the phrase is a numeral, the a
quantifier is segmented solely. e.g. 65???(a
high building of 65 stories) should be segmented
into 65???, because ? is a quantifier and
65 is a numeral.

15
Word Segmentation (8)

4. Approaches for segmenting ambiguous phrases
with combination type
Statistics based approach
Among all ambiguous phrases, 30 of them usually
have only one segmentation result. Therefore, a
library including 133 phrases is built. The
structure of database is as follows
FIELD NAME TYPE LENGTH EXPLANATION
word char 4
AB
nh number 3
the times of seg. into AB
nf number 3
the times of seg. into AB
Assume freqnh/(nhnf), thresholds are a1and a2,
here a1gt a2 . If freqgta1 , AB will be segmented
into AB if freqlta2 , it is segmented into
AB.

16
Word Segmentation (9)

POS rule based approach
The word to be segmented is related with the POS
of its context words. If the previous word of
AB is numeral, AB will be segmented into
AB
otherwise segmented into AB. e.g. In the
sentence ????????(He sleeps in his room by
himself), here AB??. Because ? is a numeral,
?? should be segmented into ?? . But in the
phrase ??????(The individual interests of the
peasantry), ?? should not be segmented.

17
Word Segmentation (10)

5. System architecture

18
Word Segmentation (11)

6. System test results
The system has been tested with the corpus
randomly chosen from Beijing Youth, in which
there are 607 ambiguous phrases of overlap type
and 2292 ambiguous phrases of combination type.
The precisions are 97 and 87 respectively.

19
Named Entity Extraction (1)

Description of the NTU System used for MET2
(Hsin-His Chen et al. Natural Language Processing
Lab., Department of Computer Science and
Information Engineering, National Taiwan
University)
Processing Steps of Named Entity Extraction
(1) Transform Chinese texts in GB codes into
texts in Big-5 codes
(2) Segment Chinese texts into a sequence of
tokens
(3) Identify named people
(4) Identify named organizations
(5) Identify named locations
(6) Use n-gram model to identify named
organizations/locations
(7) Identify the rest of named expressions
(8) Transform the results in Big-5 codes into
the results in GB codes

20
Named Entity Extraction (2)

(1) Transform Chinese texts in GB codes into
texts in Big-5 codes
The GB code is an internal code of the
simplified Chinese character set, which is used
in the mainland of China. The Big-5, on the other
hand, is an internal code of the traditional
Chinese character set, which is used in Taiwan
and Hong Kong.
e.g. simplified Chinese character vs.
traditional Chinese character
???? (Artificial Intelligence)
????
??(Software)
??
??(Report) ??
???(New Zealand)
???
NTU System is designed for the traditional
Chinese character text and the test texts in MET2
are in GB code. So it must transform GB code of
test texts into Big-5 code. But this mapping is
not only one-to-one, sometimes it is one-to-many.

21
Named Entity Extraction (3)

(2) Segment Chinese texts into a sequence of
tokens
List all possible words by dictionary
look-up, and then resolve ambiguities by
segmentation strategies. The dictionary is
trained from CKIP corpus, of which articles are
collected from Taiwan newspapers, magazines, etc.
(3) Identify named people
Chinese person names
Most Han Chinese surnames are single
character, but some are two characters.
Most names are two characters, but some are
single character.
Theoretically, every character can be used for
a name. Thus the length of Chinese names ranges
from 2 to 6 characters.
Three kinds of recognition strategies are
adopted
Named-formulation rules
Context clues, e.g., titles, positions,
speech-act verbs, etc.
Cache

22
Named Entity Extraction (4)

Named-formulation rules
They are trained from a person name corpus in
Taiwan, which contains 1 million Chinese names.
Each contains surname, name and sex.
Possible candidates
Model 1. Single character for surname
P(C1)P(C2)P(C3) using male (female) training
table gt threshold1(3) and
P(C2)P(C3) using male (female) training
table gt threshold2(4)
Model 2. Two characters for surname
P(C2)P(C3) using male (female) training
table gt threshold2(4)
Model 3. Two surnames together
P(C12)P(C2)P(C3) using female training table
gt threshold3
P(C2)P(C3) using female training table gt
threshold4 and
P(C12)P(C2)P(C3) using female training
table gt P(C12)P(C2)P(C3) using male training
table

23
Named Entity Extraction (5)

Context clues, e.g., titles, positions,
speech-act verbs, etc.
Titles ??(Dr.) ??(Prof.) ??(Mrs./Ms.)
??(Miss) ??(Mr.)
Positions ??(President) ??(Director)
???(General Manager)
Speech-act verbs ??(speak)?(say) ??(bring
up)
Cache
The cache presents a global clue. Because a
person name may appear more than once in a
document. The cache is used to store the
identified candidates. There are four cases shown
below when cache is used
(1) C1C2C3 and C1C2C4 are in the cache, and
C1C2 is correct.
(2) C1C2C3 and C1C2C4 are in the cache, and
both are correct.
(3) C1C2C3 and C1C2 are in the cache, and
C1C2C3 is correct.
(4) C1C2C3 and C1C2 are in the cache, and
C1C2 is correct.

24
Named Entity Extraction (6)

Transliterated person names
Transliterated person names denote foreigners.
The length of transliterated person names is not
restricted to 2 to 6 characters.
Main strategies
Transliterated name set
The transliterated names trained from MET data
are regarded as a built-in name set.
Character condition
Two special character sets are retrieved from
MET training data. The first character of names
must belong to a 280-character set, and the
remaining characters must appear in a
411-character set. The character condition is a
loose restriction. It should be employed with
other clues.
Titles
They used in Chinese person names are also
applicable to transliterated person names.
Name introducers
Such as, ? (be called), ?? (Her/His name is
), ?? (respectfully call sb. )
Special verbs
e.g. ??(issue/express/deliver),
??(hint/imply)

25
Named Entity Extraction (7)

(4) Identify named organizations
The structure of organization names is more
complex than that of person names. Basically, a
complete organization name can be divided into
name and keyword.
Such as, names ???(UN), ??(USA), ???(Robertson)
keywords ??(Army), ???(Embassy),
???(Foundation)
There are some rules to recognize organization
names
OrganizationName -gt OrganizationName
OrganizationNameKeyword
OrganizationName -gt CountryName
OrganizationNameKeyword
OrganizationName -gt PersonName
OrganizationNameKeyword
OrganizationName -gt CountryName DDD
OrganizationNameKeyword
OrganizationName -gt PersonName DD
OrganizationNameKeyword
OrganizationName -gt LocationName DD
OrganizationNameKeyword
OrganizationName -gt CountryName
OrganizationName
OrganizationName -gt LocationName
OrganizationName
Where D is a content word, such as,
??(International), ??(culture and education) etc.

26
Named Entity Extraction (8)

Identify named locations
The structure of location names is similar to
that of organization names. The rules are like
LocationName -gt PersonName LocationNameKeyword
LocationName -gt LocationName LocationNameKeyword
The following are some examples of location
keywords
?(maintain) ??(center) ??(highway) ??(the
Northern of ) ?(city)
Other strategies for recognizing location names
without keywords
Locative verbs ??(come from ) ??(go to )
Cache
N-gram model employ multiple occurrences to
find a pattern

27
Named Entity Extraction (9)

(6) Use n-gram model to identify named
organizations/locations
Although cache mechanism and n-gram use the same
feature, i.e., multiple occurrences, their
concepts are totally different. For organization
names, it is not sure when a pattern should be
put into cache because its left boundary is hard
to be decided.
In the model, the patterns are selected to meet
the following criteria
It must consist of a name and an organization
name keyword
Its length must be greater than two words
It does not cross sentence boundary and any
punctuation marks
It must occur at lease twice

28
Named Entity Extraction (10)

(7) Identify the rest of named expressions
The rule based approach is used for the following
named expressions
Date expressions
DATE-gtNUMBERYEAR
DATE-gtNUMBERMTHUNIT
Time expressions
TIME-gtNUMBERHUNIT
TIME-gtTIMEBSTATE
Monetary expressions
DMONEY-gtMOUNITNUMBERMOUNIT
DMONEY-gtNUMBERMONUIT
Percentage expressions
DPERCENT-gtPERCENTNUMBER
DPERCENT-gtNUMBERPERCENT

29
Named Entity Extraction (11)

(8) Transform the results in Big-5 codes into the
results in GB codes
MET2 Testing Results
Named Entity Recall()
Precision()
Person Name 91
74
Organization Name 78
85
Location Name 78
69
Date
94 88
Time 98
70
Money 98
98
Percent 83
98
F-MEASURES PR 79.61 2PR 77.88
P2R 81.42

30
Entity Relation Extraction (1)

A Trainable Method for Extracting Chinese Entity
Names and Their Relations
(Yimin Zhang et al. Intel China Research Center,
Beijing, China)
The process can be divided into two stages. The
first one is the learning process in which
several classifiers are built from the training
data. The second one is the extracting process in
which Chinese entity names and their relations
are extracted using the classifiers learned. The
learning algorithm used in the learning process
is memory-based learning (MBL) which is a
classification based supervised learning approach.

31
Entity Relation Extraction (2)
32
Entity Relation Extraction (3)

The main steps for the learning process
(1) Prepare training data in which all noun
phrases, entity names and relations are manually
annotated.
(2) Segmenting, tagging and partial parsing for
the training data.
(3) Extract the training sets from the parsed
training data. Four training sets are extracted
for different tasks, related to Chinese person
names, entity names, noun phrase, or relations
between entity names in the training data
respectively. The main feathers used in an
example can be either local context feathers,
e.g. dependency relation, or global context
features, e.g. the feature of a word in the whole
document, etc.
(4) Use MBL algorithm to obtain IG-Tree for four
training sets. IG-Tree is a compressed
representation of the training set that can be
processed quickly in classification process.

33
Entity Relation Extraction (4)

The main steps for the extracting process
Segmenting, tagging and partial parsing for the
Chinese documents.
Identify Chinese people names using
PersonName-IG-Tree.
Identify Chinese organization names using the
same method of NTU System.
Identify other entity names using the same method
of NTU System.
Identify Chinese noun phrases (NP chunking) using
NP-IG-Tree.
Use entity names and noun phrases extracted to
perform partial parsing again to fix the parsing
errors.
Use EntityName-IG-Tree to classify the noun
phrases extracted. This step will identify entity
names that are missed in the previous steps.
Use Relation-IG-Tree to identify relations
between the extracted entity names.

34
Entity Relation Extraction (5)

The entity relation extracted
Employee-of,
Location-of,
Product-of and
No-relation
The feathers for this task
The features used in CRYSTAL System,
Add some new feathers, such as the linear order
of entity names, the word(s) between the entity
names, the relative position of the entity names
(in same sentence or in neighboring sentence) etc.

35
Entity Relation Extraction (6)

Example
Phrase ????(Legends President) (Note
LegendLegend Holdings Limited or Legend Group
which is a famous computer company in China) in
the subject position includes the features
SUBJ-Terms-??
SUBJ-Terms-??
SUBJ-Mod-Terms-??
SUBJ-Head-Terms-??
SUB-Classes-Employee
SUB-Mod-Classes-Organization
SUB-Head-Classes-Organization(should be Position)

36
Entity Relation Extraction (7)

Learning and extracting processes
For every two related entity names in the
training data, a training example is identified
and extracted. After all examples are extracted,
they are fed to MBL Learner to build the
Relation-IG-Tree.
The extracting process is the same as the
learning process for extracting all pairs of
entity names. Then the relation between every
pair of entity names is derived by the
Relation-IG-Tree.

37
Entity Relation Extraction (8)

Example1
???????????IT???????,
As a famous manufacturer of IT hardware devices
in China, the Lang Chao Group
Company name ???? Product name IT????
Training example Company name (??/?) Product
name ???
Relation product-of
Example2
?????????????????,?????TCL???????????????????????
Wu Shihong became the media focus once again,
however, this time she came to Shanghai as the
vice president of TCL group and its IT companys
general manager.
Person name ??? Company name TCL??
Training example If a person name and a company
name appear in neighboring sentences, and no
other person names and company names are found in
between, they tend to have an employee-of
relation.
Relation employee-of

38
Entity Relation Extraction (9)

System testing results
To test this approach, a manually annotated
corpus which comprises about 200 business news is
used. All the entity names (about 500 person
names and 300 organization names), noun phrases,
and relations in the corpus were manually
annotated. Ten pairs of training sets and tests
were randomly selected from the corpus with each
set size equivalent to half of the entire corpus.
All data sets were tested, the result is as
follows
Recall()
Precision()
Person Name
86.3 83.2
Organization Name
73.4 89.3
Employee-of
75.6 92.3
Product-of
56.2
87.1
Location-of
67.2 75.6

39
Conclusion

Chinese is a different topological language from
English or German. There exist some special
difficulties in Chinese NLP, such as word
segmentation.
There are mainly two ambiguous phrases in
Chinese word segmentation. One is overlap type,
another is combination type. In overlay ambiguous
phrases, the chain lengths are mostly 1 or 2 and
take up 95. In combination ambiguous phrases,
30 of them usually have only one possibility of
segmentation. We can remove ambiguity depending
on different ambiguous types.
Chinese named entities are major constituents in
Chinese documents. We can adopt different methods
to extract them together, such as character
conditions, statistical information, titles,
punctuation marks, organization and location
keywords, speech-act and locative verbs, cache
and n-gram model.
We can view the determination of Chinese entity
relation as classification process. In the
learning process, several classifiers are built
from the training data. In the extracting
process, the relations are extracted using the
classifiers learned. Machine learning technique
has been effectively used in Chinese entity
relation extraction.

Write a Comment

User Comments (0)

About PowerShow.com

Chinese Information Extraction PowerPoint PPT Presentation