Title: Simple Features for Chinese Word Sense Disambiguation
1Simple Features for Chinese Word Sense
Disambiguation
- Hoa Trang Dang, Ching-yi Chia, Martha Palmer,
Fu-Dong Chiou - Computer and Information Science
- University of Pennsylvania
- htd, chingyc, mpalmer, chioufd_at_unagi.cis.upenn.e
du
2Overview
- Maximum entropy WSD feature types
- English Senseval2 verbs
- Chinese
- Penn Chinese Treebank
- Peoples Daily News
3English Senseval2 verbs
- Primarily Penn Treebank WSJ corpus
- WordNet 1.7 sense inventory
- 29 verbs
- 15.6 senses/verb in corpus
- baseline (most frequent sense) 40
- best system performance 60
4Local Collocational Features (English)
- Collocational features for w
- word w
- pos of w
- pos of words at positions 1, -1 relative to w
- words at positions -2, -1, 1, 2 relative to w
5Local Syntactic Features (English)
- Syntactic features
- whether or not the sentence is passive
- whether there is a subject, direct object,
indirect object, or clausal complement - the words (if any) in the positions of subject,
direct object, indirect object, particle,
prepositional complement (and its object)
6Local Semantic Features (English)
- Semantic features
- a Named Entity tag (PERSON, ORGANIZATION,
LOCATION) for proper nouns - WordNet synsets and hypernyms for the nouns
7Overall Accuracy of System (English)
Feature Type Accuracy Collocation 48.3
Collocation Syntax 53.9 Collocation
Syntax Semantics 59.0 Collocation
Topic 52.9 Collocation Syntax
Topic 54.2 Collocation Syntax Semantics
Topic 60.2
8Data Preparation (Chinese)
- Penn Chinese Treebank (100K words)
- CETA (Chinese-English Translation Assistance)
Dictionary - 28 words (multiple verb senses, possibly other
pos) - 3.5 senses/word in corpus
- Baseline (most frequent sense) 77
9Local Collocational Features (Chinese)
- Collocational Features
- word
- pos
- word-2, word-1, word1, word2
- pos-1, pos1
- followsVerb
10Local Syntactic Features (Chinese)
- Syntactic Features
- hassubj
- subj
- hasobj
- obj-p
- obj
- hasinobj
- Comp-VP
- VPComp
- Comp-IP
- hasprd
11Local Semantic Features (Chinese)
- Semantic Features (for verbs only)
- generated by assigning a HowNet noun category to
each subject and object - subjsem
- objsem
12Overall Accuracy of Maximum Entropy System (CTB)
Feature Type Accuracy Std Dev Collocation (no
pos) 86.8 1.0 Collocation 93.4 0.5 Coll
ocation Syntax 94.4 0.4 Collocation
Syntax Semantics 94.4 0.6 Collocation
Topic 90.3 1.0 Collocation Syntax
Topic 92.7 0.9 Collocation Syntax
Semantics Topic 92.8 0.8 Baseline 76.7
13Data Preparation (PDN)
- Peoples Daily News (PDN)
- Five words with low accuracy and counts in CTB
subsequently sense-tagged in PDN (1M words). - About 200 sentences/word from PDN.
- 8.2 senses/verb in corpus
- Baseline (most frequent sense) 58
- Automatic segmentation, pos-tagging, parsing
14Overall Accuracy of Maximum Entropy System (PDN)
Feature Type Accuracy Std Dev Collocation (no
pos) 72.3 2.2 Collocation 70.3 2.9 Coll
ocation Syntax 71.7 3.9 Collocation
Syntax Semantics 71.7 4.2 Collocation
Topic 73.3 3.2 Collocation Syntax
Topic 72.6 2.9 Collocation Syntax
Semantics Topic 73.0 3.4 Baseline 57.6
15Conclusion
- Types of features that are important for English
and Chinese are different. - Parse information is useful for English WSD.
- Lexical collocational information may be
sufficient for Chinese. - Chinese word sense disambiguation addressed at
segmentation level