Chinese Named Entity Recognition with Multiple Features - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Chinese Named Entity Recognition with Multiple Features

Description:

Unlike English, Chinese lacks the capitalization information which plays very ... There is no space between words in Chinese, so we have to segment the text ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 36
Provided by: sdum
Category:

less

Transcript and Presenter's Notes

Title: Chinese Named Entity Recognition with Multiple Features


1
Chinese Named Entity Recognition with Multiple
Features
HLT/EMNLP 2005, Vancouver, B.C., Canada, October
6-8, 2005
  • Youzheng Wu, Jun Zhao, Bo Xu, Hao Yu
  • Institute of Automation, Chinese Academy of
    Sciences
  • Fujitsu RD Center Co., Ltd
  • yzwu, jzhao,bxu_at_nlpr.ia.ac.cn
  • yu_at_frdc.fujitsu.com.cn

2
Outline
  • Introduction
  • Related Work
  • Chinese NER with Multiple Features
  • The Hybrid Model (Word Model POS Model)
  • Word Model (Word Entity Model Word Context
    Model)
  • POS Model (POS Entity Model POS Context Model)
  • Heuristic Human Knowledge
  • Experiments
  • Conclusion

3
Introduction
  • Named Entity Recognition is one of the key
    techniques in IE, QA, Parsing, Metadata tagging,
    etc.
  • The definitions of NEs by PKU is our focus.
  • PN is divided into 5 sub-classes, that is,
    Chinese PN, Japanese PN, Russian PN, Euramerican
    PN, Abbreviated PN.
  • LN is divided into 2 sub-classes, that is, CLN,
    ALN.
  • ORG, TIM, NUM.

4
Differences Between ENE CNE Recognition
  • The main differences between Chinese NE
    Recognition and English NE Recognition include
  • Unlike English, Chinese lacks the capitalization
    information which plays very important roles in
    identifying named entities.
  • There is no space between words in Chinese, so
    we have to segment the text before NER.
    Consequently, the errors in word segmentation
    will affect the result of NER.

5
Related Work
  • Approaches for NER is focused on machine learning
  • CoTraining, CoBoosting
  • Hidden Markov Model
  • Maximum entropy model
  • Transformation-based learning
  • Typical System of Chinese NER
  • Hsin-His Chen, et al. 1997
  • Yu, et al. 1998
  • CHUA, et al. 2000
  • Sun, et al. 2002

6
Chinese NER with Multiple Features
  • Features of Chinese NE
  • Chinese NEs have very distinct word features in
    their composition and contextual information.
  • Data sparseness is very serious when only using
    word features.
  • Three Basic Idea
  • Combine coarse particle feature (POS Model) with
    fine particle feature (Word Model).
  • Introduce heuristic human knowledge into
    statistical model
  • Use sub-models to respectively describe
    Japanese, Russian and Euramerican transliterated
    person name and Chinese person name.

7
The Hybrid Model
  • The task of NER
  • Given a word/pos sequenceto find the optimal
    sequence WC/TC by splitting, combining and
    classifying the given sequence
  • We could obtain the optimal sequence WC/TC
    through the following three models
  • the Word Model
  • the POS Model
  • the Hybrid Model

8
Word Model POS Model
  • Word Model
  • Estimates the probability of generating a NE
    from the viewpoint of word sequence
  • The
  • POS Model
  • Estimates the probability of generating a NE
    from the viewpoint of POS sequence
  • The

9
The Hybrid Model
  • Combines Word Model with POS Model
  • where, factor a 0 is to balance Word Model and
    POS Model.
  • The Hybrid Model consists of four sub-models
    word context model P(WC), POS context model
    P(TC), word entity model P(WWC) and POS entity
    model P(TTC).

10
Word Definition in Word Model
11
POS Definition in POS Model
  • PKU POS specification is adopted in our system.
  • The size of POS set is 48.

12
Context Model
  • Word Context Model
  • POS Context Model

13
Entity Model
  • Class Definition in Entity Model

14
Word Entity Model
  • Word Entity Model for PN
  • Word Entity Model for LN and ON
  • Word Entity Model for ALN

15
POS Entity Model
  • POS Entity Model for PN
  • Use Word Entity Model for PN to replace the POS
    Entity Model for PN.
  • POS Entity Model for LN and ON
  • POS Entity Model for ALN

16
Training Corpus for Context Model
  • Five-months Peoples Daily tagged with NER tags

17
Training Corpus for Entity Model
  • Chinese PN 15.6 million
  • Japanese PN 0.15 million
  • Euramerican PN 0.4 million
  • Russian PN 0.44 million

18
Heuristic Human Knowledge for PN
  • Chinese PN surname list (including 476 items)
  • Japanese PN surnames list (including 9189 items)
  • Russian PN characters lists
  • Euramerican PN characters lists
  • NE Length restriction

19
Heuristic Human Knowledge for LN
  • Location keyword list (including 607 items)
  • General word list (such as verbs and
    prepositions) Words in the list usually is
    followed by a location name, such as "?/at",
    "?/go".
  • ALN name list (including 407 items)

20
Heuristic Human Knowledge for ON
  • Organization keyword list (including 3129 items)
  • An organization name template list which is used
    to recognize the missed ONs in the statistical
    model. Some of these templates are as follows.
  • ON -- LN D OrgKeyWord
  • ON -- PN D OrgKeyWord
  • ON -- ON OrgKeyWord

21
Back-off Model to Smooth
  • Escape probability to smooth the statistical
    model

22
Experiments
  • Three Experiments
  • Will the Hybrid Model be more effective than the
    Word Model and the POS Model?
  • Will the conclusion from different testing sets
    be consistent?
  • Will the performance be improved significantly
    after combining heuristic human knowledge?
  • Metrics for evaluations
  • Precision, Recall and F-Measure.

23
Experiment I
  • a in the Hybrid Model denotes the balancing
    factor of the Word Model and the POS Model
  • The largera, the larger contribution of the POS
    Model. The smallera, the larger contribution of
    the Word Model.
  • To find the best value of aon the testing corpus
    of one-month's People's Daily.

24
Experiment I (cont.)
  • Performance of Recognizing PNs Impacted by a

25
Experiment I (cont.)
  • Performance of Recognizing LNs Impacted by a

26
Experiment I (cont.)
  • Performance of Recognizing ONs Impacted by a

27
Experiment I (cont.)
  • Choose the best value a 2.8 after comparing the
    figures.

28
Experiment I (cont.)
  • Conclusion 1
  • The Hybrid Model can improve the performance of
    both the Word Model and the POS Model.
  • The improvements for PN, LN and ON are different.
    That is, the POS Model has obvious side-effect on
    the recall of ON recognition at all times, while
    the recalls for PN and ON recognition are
    improved in the beginning but decreased in the
    ending with the increasing of a.

29
Experiment II
  • Experiments on the MET-2 testing corpus to
    validate the conclusion from Experiment I.
  • This table validate that the Hybrid Model is
    better than both the Word Model and the POS
    Model.

30
Experiment II (cont.)
  • Conclusion 2
  • Our algorithm has consistence on different
    testing, i.e. the Hybrid Model which combining
    the Word Model with the POS Model can achieve
    better performance than either the Word Model or
    the POS Model on different testing sets.

31
Experiment III
  • To validate the idea heuristic human knowledge
    can not only reduce the search space, but also
    improve the performance.

Model I Statistical Model without
knowledge Model II Statistical Model with
knowledge
32
Experiment III
  • Conclusion 3
  • From this experiment, we learn that human
    knowledge can not only reduce the search space,
    but also significantly improve the performance of
    pure statistical model.

33
Conclusion
  • Propose a hybrid Chinese NER model which
    combines multiple features
  • The main contributions are as follows
  • The proposed Hybrid Model emphasizes on
    integrating coarse particle feature (POS Model)
    with fine particle feature (Word Model), so that
    it can make up the disadvantages of each other
  • In order to reduce the search space and improve
    the efficiency of model, we incorporate heuristic
    human knowledge into statistical model, which
    could increase the performance of NER
    significantly
  • For capturing intrinsic features in different
    types of entities, we design several sub-models
    for different entities. Especially, we divide
    transliterated person name into three sub-classes
    according to their characters set, that is, CPN
    JPN, RPN and EPN.

34
  • Thank You!

35
Compare with Sun, et al. 2002(2)
Write a Comment
User Comments (0)
About PowerShow.com