Re-organization of IR/CSC team - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Re-organization of IR/CSC team

Description:

Re-organization of IR/CSC team Hongchao He Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper Guihong Cao MSKK-III Clustering for technique transfer – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 24
Provided by: researchM6
Category:
Tags: csc | organization | team

less

Transcript and Presenter's Notes

Title: Re-organization of IR/CSC team


1
Re-organization of IR/CSC team
  • Hongchao He
  • Conf. follow up TREC-10, NTCIR
  • Paper follow up ICCLP, SIGIR paper
  • Guihong Cao
  • MSKK-III Clustering for technique transfer
  • Yang Wen
  • MSKK-III Distance word dependency
  • Min Zhang
  • MSKK/CSC Entropy based pruning for applications
    of (Pinyin/Hiragana) input system

2
Chinese Spelling Checking(or, the Big CSC)
  • Jianfeng Gao
  • NLC Group, MSRCN

3
Outline
  • Introduction
  • Chinese spelling checking
  • Our approach
  • Key techniques and experiments
  • Millstone

4
Introduction
Goal Automatically correct Chinese spelling
errors using MS-Pinyin (MSPY) input system
  • Chinese spelling errors using MS-Pinyin input
    system
  • Chinese spelling error patterns
  • English spelling checking
  • Why CSC is difficult?

5
Chinese spelling errors using MSPY
Text in the brain
Pinyin (phonetic) errors
Syllable
Typographic errors
Key stroke (Typing)
System errors
Converted text
6
Chinese spelling errors patterns
  • Substitution errors
  • Pinyin error
  • System error (include Pinyin error in some
    systems)
  • Non-substitution errors ? word segmentation
    errors
  • Typographic errors insertion/deletion/transposi
    tion

7
English spelling checking
  • Non-word error detection (the ? hte)
  • N-gram (letter) analysis
  • Dictionary lookup
  • Real-word error detection (from ? form)
  • NLP parser driven
  • Statistical approach data/error driven
  • Local n-gram language model, depend on
    pre-defined confusion set
  • Global Winnow, Bayesian, TBL, etc.
  • Problem lack of error detection

8
Why CSC is difficult?
  • Word segmentation
  • Ambiguous
  • OOV Proper noun detection (personal name,
    location, organization, etc.)
  • Segmentation error propagation
  • Non-word errors (in sense of English) do not
    exist
  • MSPY makes good use of word trigram language model

9
Chinese spelling checking
  • CSC related works
  • Template matching long distance, e.g. lt???gt
    lt???gt
  • Pattern matching long words (ngt3), e.g. ????
    ? ????, ??? ? ????
  • N-gram models substitution errors
  • CSC challenges
  • Long distance, coverage issue of template/pattern
    set
  • High-frequent-used confusion set, e.g. ?,?
    ?,?
  • OOV, especially the proper nouns
  • N-gram, has been fully used by MSPY

10
Chinese spelling errors patterns in MSPY
  • Proper noun
  • Personal name
  • Location
  • organization
  • Non-word errors context independent
  • Insertion/deletion/transposition/substitution
  • E.g. ???? ? ????, ??? ? ????
  • Real-word errors context sensitive
  • E.g. ? ? ?, ? ? ?, ?? ? ??

11
Flowchart of our approach
Text with errors
Proper noun detection
Word segmentation
Word fuzzy matching Trigger single char string ,
low prob
Non-word error correction
Context sensitive disambiguation
Real-word error correction
12
Word segmentation and proper noun detection
  • Language model based word segmentation
  • Class-based language model
  • P(W) Poutside(W) Pinsidea(WltPNgt), a ?
  • Outside probability PN tagged training data
  • Using NLPWIN to tag the corpus
  • Filtering, rule base
  • EM?
  • Inside probability PN list training data
  • Using cache (or, dynamic dictionary)

13
Experiments and Findings
  • Measure precision/recall definition
  • Training data People Daily
  • Tag tool NLPWIN
  • Test data spec.
  • Results and Findings

14
Long word fuzzy matching
  • Definition of Distance(s1, s2)
  • Long word, ngt3,
  • Sum of delete/insert/substitute a character
  • Fast fuzzy matching
  • Global Lei Zhangs ACL
  • Local trigger, (single char, or low n-gram
    probability )
  • Search error detection/correction
  • Viterbi
  • Simplified version
  • Long word Local matching

15
Experiments and Findings
  • Contact 100 person, 3000 -- 5000
    characters/person
  • Error analysis
  • Algorithm
  • Measure precision/recall
  • Large lexicon, acquisition.
  • Trigger/threshold ?
  • Results and Findings

16
Context sensitive disambiguation
  • Building confusion set specific to MSPY
  • Feature selection Context vector
  • Collocation contiguous POS or words/characters
  • Context words words/characters within a K-size
    window
  • Triple ?
  • Weighting schema and Classifier
  • Context Vector, TFIDF
  • Winnow, Bayesian, TBL, etc.
  • Scaling up
  • Enlarge confusion set
  • Feature pruning
  • Adaptation

17
Experiments and Findings
  • Measure precision/recall
  • Training data
  • Test data (XXX confusion set)
  • Results and Findings

18
Experiments and Findings
  • Current Work
  • Pseudo-training set based on MSPY IME
  • Preliminary data processing (400M PD)
  • Unigram error model (10,000 Words useful)
  • ? ?/69484 ?/10289 ?/2394
  • Trigram error pattern (980,000 useful)
  • ????gt? / ???,gt?
  • Experiments based on basic approaches
  • Pseudo-test set from ????
  • Continuous pair (Recall 50, Precision 25)
  • Pattern Matching (??)
  • Future Work
  • Hybrid approaches
  • Pattern Clustering Continuous pair
  • Functional words error detection

19
System evaluation put it all together
  • Evaluation toolset
  • Measure precision/recall
  • Training data
  • Test data
  • Results and Findings

20
Prototype
  • Demo
  • Online offline CSC
  • Right click
  • Spelling error detection/correction
  • Proper noun detection/correction

21
Assignment
  • Jianfeng Gao overall, fuzzy matching
  • Mu Li context sensitive disambiguation
  • Jian Sun PN detection
  • Yang Wen system evaluation
  • Yulin Kang demo
  • Lei Zhang senior consultant

22
Millstone
  • Oct. 2001, Ming says Yes (TAB demo)
  • Dec. 2001, Dong says Yes (Transfer)
  • Aug. 2002, HJ says Yes (Party)

23
Information
  • Access at \\msrcn4p3\rootD\gaojf\spell
  • Contact me if any problems
  • Jianfeng Gao, Tel 86-10-62617711-5778, Email
    jfgao_at_microsoft.com
Write a Comment
User Comments (0)
About PowerShow.com