Title: Re-organization of IR/CSC team
1Re-organization of IR/CSC team
- Hongchao He
- Conf. follow up TREC-10, NTCIR
- Paper follow up ICCLP, SIGIR paper
- Guihong Cao
- MSKK-III Clustering for technique transfer
- Yang Wen
- MSKK-III Distance word dependency
- Min Zhang
- MSKK/CSC Entropy based pruning for applications
of (Pinyin/Hiragana) input system
2Chinese Spelling Checking(or, the Big CSC)
- Jianfeng Gao
- NLC Group, MSRCN
3Outline
- Introduction
- Chinese spelling checking
- Our approach
- Key techniques and experiments
- Millstone
4Introduction
Goal Automatically correct Chinese spelling
errors using MS-Pinyin (MSPY) input system
- Chinese spelling errors using MS-Pinyin input
system - Chinese spelling error patterns
- English spelling checking
- Why CSC is difficult?
5Chinese spelling errors using MSPY
Text in the brain
Pinyin (phonetic) errors
Syllable
Typographic errors
Key stroke (Typing)
System errors
Converted text
6Chinese spelling errors patterns
- Substitution errors
- Pinyin error
- System error (include Pinyin error in some
systems) - Non-substitution errors ? word segmentation
errors - Typographic errors insertion/deletion/transposi
tion
7English spelling checking
- Non-word error detection (the ? hte)
- N-gram (letter) analysis
- Dictionary lookup
- Real-word error detection (from ? form)
- NLP parser driven
- Statistical approach data/error driven
- Local n-gram language model, depend on
pre-defined confusion set - Global Winnow, Bayesian, TBL, etc.
- Problem lack of error detection
8Why CSC is difficult?
- Word segmentation
- Ambiguous
- OOV Proper noun detection (personal name,
location, organization, etc.) - Segmentation error propagation
- Non-word errors (in sense of English) do not
exist - MSPY makes good use of word trigram language model
9Chinese spelling checking
- CSC related works
- Template matching long distance, e.g. lt???gt
lt???gt - Pattern matching long words (ngt3), e.g. ????
? ????, ??? ? ???? - N-gram models substitution errors
- CSC challenges
- Long distance, coverage issue of template/pattern
set - High-frequent-used confusion set, e.g. ?,?
?,? - OOV, especially the proper nouns
- N-gram, has been fully used by MSPY
10Chinese spelling errors patterns in MSPY
- Proper noun
- Personal name
- Location
- organization
- Non-word errors context independent
- Insertion/deletion/transposition/substitution
- E.g. ???? ? ????, ??? ? ????
- Real-word errors context sensitive
- E.g. ? ? ?, ? ? ?, ?? ? ??
11Flowchart of our approach
Text with errors
Proper noun detection
Word segmentation
Word fuzzy matching Trigger single char string ,
low prob
Non-word error correction
Context sensitive disambiguation
Real-word error correction
12Word segmentation and proper noun detection
- Language model based word segmentation
- Class-based language model
- P(W) Poutside(W) Pinsidea(WltPNgt), a ?
- Outside probability PN tagged training data
- Using NLPWIN to tag the corpus
- Filtering, rule base
- EM?
- Inside probability PN list training data
- Using cache (or, dynamic dictionary)
13Experiments and Findings
- Measure precision/recall definition
- Training data People Daily
- Tag tool NLPWIN
- Test data spec.
- Results and Findings
14Long word fuzzy matching
- Definition of Distance(s1, s2)
- Long word, ngt3,
- Sum of delete/insert/substitute a character
- Fast fuzzy matching
- Global Lei Zhangs ACL
- Local trigger, (single char, or low n-gram
probability ) - Search error detection/correction
- Viterbi
- Simplified version
- Long word Local matching
15Experiments and Findings
- Contact 100 person, 3000 -- 5000
characters/person - Error analysis
- Algorithm
- Measure precision/recall
- Large lexicon, acquisition.
- Trigger/threshold ?
- Results and Findings
16Context sensitive disambiguation
- Building confusion set specific to MSPY
- Feature selection Context vector
- Collocation contiguous POS or words/characters
- Context words words/characters within a K-size
window - Triple ?
- Weighting schema and Classifier
- Context Vector, TFIDF
- Winnow, Bayesian, TBL, etc.
- Scaling up
- Enlarge confusion set
- Feature pruning
- Adaptation
-
17Experiments and Findings
- Measure precision/recall
- Training data
- Test data (XXX confusion set)
- Results and Findings
18Experiments and Findings
- Current Work
- Pseudo-training set based on MSPY IME
- Preliminary data processing (400M PD)
- Unigram error model (10,000 Words useful)
- ? ?/69484 ?/10289 ?/2394
- Trigram error pattern (980,000 useful)
- ????gt? / ???,gt?
- Experiments based on basic approaches
- Pseudo-test set from ????
- Continuous pair (Recall 50, Precision 25)
- Pattern Matching (??)
- Future Work
- Hybrid approaches
- Pattern Clustering Continuous pair
- Functional words error detection
19System evaluation put it all together
- Evaluation toolset
- Measure precision/recall
- Training data
- Test data
- Results and Findings
20Prototype
- Demo
- Online offline CSC
- Right click
- Spelling error detection/correction
- Proper noun detection/correction
21Assignment
- Jianfeng Gao overall, fuzzy matching
- Mu Li context sensitive disambiguation
- Jian Sun PN detection
- Yang Wen system evaluation
- Yulin Kang demo
- Lei Zhang senior consultant
22Millstone
- Oct. 2001, Ming says Yes (TAB demo)
- Dec. 2001, Dong says Yes (Transfer)
- Aug. 2002, HJ says Yes (Party)
23Information
- Access at \\msrcn4p3\rootD\gaojf\spell
- Contact me if any problems
- Jianfeng Gao, Tel 86-10-62617711-5778, Email
jfgao_at_microsoft.com