Re-organization of IR/CSC team - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Re-organization of IR/CSC team

Description:

Re-organization of IR/CSC team Hongchao He Conf. follow up TREC-10, NTCIR Paper follow up ICCLP, SIGIR paper Guihong Cao MSKK-III Clustering for technique transfer – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 24

Provided by: researchM6

Category:

more less

Transcript and Presenter's Notes

Title: Re-organization of IR/CSC team

1
Re-organization of IR/CSC team

Hongchao He
Conf. follow up TREC-10, NTCIR
Paper follow up ICCLP, SIGIR paper
Guihong Cao
MSKK-III Clustering for technique transfer
Yang Wen
MSKK-III Distance word dependency
Min Zhang
MSKK/CSC Entropy based pruning for applications
of (Pinyin/Hiragana) input system

2
Chinese Spelling Checking(or, the Big CSC)

Jianfeng Gao
NLC Group, MSRCN

3
Outline

Introduction
Chinese spelling checking
Our approach
Key techniques and experiments
Millstone

4
Introduction
Goal Automatically correct Chinese spelling
errors using MS-Pinyin (MSPY) input system

Chinese spelling errors using MS-Pinyin input
system
Chinese spelling error patterns
English spelling checking
Why CSC is difficult?

5
Chinese spelling errors using MSPY
Text in the brain
Pinyin (phonetic) errors
Syllable
Typographic errors
Key stroke (Typing)
System errors
Converted text
6
Chinese spelling errors patterns

Substitution errors
Pinyin error
System error (include Pinyin error in some
systems)
Non-substitution errors ? word segmentation
errors
Typographic errors insertion/deletion/transposi
tion

7
English spelling checking

Non-word error detection (the ? hte)
N-gram (letter) analysis
Dictionary lookup
Real-word error detection (from ? form)
NLP parser driven
Statistical approach data/error driven
Local n-gram language model, depend on
pre-defined confusion set
Global Winnow, Bayesian, TBL, etc.
Problem lack of error detection

8
Why CSC is difficult?

Word segmentation
Ambiguous
OOV Proper noun detection (personal name,
location, organization, etc.)
Segmentation error propagation
Non-word errors (in sense of English) do not
exist
MSPY makes good use of word trigram language model

9
Chinese spelling checking

CSC related works
Template matching long distance, e.g. lt???gt
lt???gt
Pattern matching long words (ngt3), e.g. ????
? ????, ??? ? ????
N-gram models substitution errors
CSC challenges
Long distance, coverage issue of template/pattern
set
High-frequent-used confusion set, e.g. ?,?
?,?
OOV, especially the proper nouns
N-gram, has been fully used by MSPY

10
Chinese spelling errors patterns in MSPY

Proper noun
Personal name
Location
organization
Non-word errors context independent
Insertion/deletion/transposition/substitution
E.g. ???? ? ????, ??? ? ????
Real-word errors context sensitive
E.g. ? ? ?, ? ? ?, ?? ? ??

11
Flowchart of our approach
Text with errors
Proper noun detection
Word segmentation
Word fuzzy matching Trigger single char string ,
low prob
Non-word error correction
Context sensitive disambiguation
Real-word error correction
12
Word segmentation and proper noun detection