Title: Related Works About Text Normalization
1Related WorksAboutText Normalization
2Contents
- Segmentation
- New Word Identification
- Spelling Correction
3Segmentation
- main issue for dealing with the word segmentation
problem is how to find out the correct
segmentation from all possible ones - Chinese strings have no space
4A Generalized Word Segmentation Model (1992)
- c1n c1, c2, , cn (characters)
- Wi wi1, wi2, , wim (words)
- Li li1, li2, , lim (lengths)
- mi of words
- Tij tij1, tij2, , tijm (lexical tags)
5(No Transcript)
6Parameter Adjustment
- Adaptive Learning
- Robustness Enhancement
7A Chinese word segmentation based on language
situation in processing ambiguous words (2004)
- first Chinese character of w is x, Chinese
character series after x is y - (w xy)
- Language situation in the document
- Language situation in the directory
- the function of language situation
8- Layer1 W wordleft x y z u
- Layer2 construct some combinatorial character
series from wordleft, x, y, z and u, and sets
them as a1 . . . ak, in which a1 is set as
wordleft x. - Layer3 Calculate I(a1), , I(ak)
- Layer4 if I(a1) is the greatest value, set yi
1 denoting wordleft x can be segmented into a
word - Repeat until all the characters have been
processed
9Improved Source-Channel Models for Chinese Word
Segmentation(2003)
- Let S be a Chinese sentence, For all possible
word segmentations W, - (Bayes rule)
- Define word class C as follows
- entries in a lexicon(lexicon words below)
- morphologically derived words
- factoids
- named entities
- Rewritten,
10- Two steps
- First, given an input string S, all word
candidates are generated (and stored in a
lattice). Each candidate is tagged with its word
class and the class model probability P(SC),
where S is any substring of S. - Second, Viterbi search is used to select (from
the lattice) the most probable word segmentation
(i.e. word class sequence C) according to Eq.
(2).
11New Word Identification
- Done with segmentation
- Sequences of single characters are good
candidates for new words. - 99 of Chinese words are 14 characters
12A two-stage statistical word segmentation system
for Chinese(2003)
- The first stage Segmentation of known words
- we use word bigram language models and Viterbi
algorithm
13A two-stage statistical word segmentation system
for Chinese
- The second stage Unknown word identification
hybrid algorithm - Word juncture model
- two types of junctures in unknown word
identification - - word boundary (denoted by tB) and non-word
boundary (denoted by tN). - word juncture probability PCJM(wU)
- the larger the probability Pr(tN (wi,wi1 )) ,
the more likely the two words are merged together
into one new word.
14A two-stage statistical word segmentation system
for Chinese
- Word-formation patterns
- the character ? appears at the end of a
multiword in more than 93 of cases. - four patterns while forming a word
- w itself is a word. (S)
- w is the beginning of an unknown word. (B)
- w is at the middle of an unknown word. (M)
- w appears at the end of an unknown word. (E)
- Pr ( pttn(w)) denote the relevant probability,
- Pttn S, B, M, E
- 1- Pr(S(w)) the word-formation power of the
known word w - For a certain unknown word wU w1w2wl,
15A two-stage statistical word segmentation system
for Chinese
- A decoder finally incorporates word juncture
model PWJM(WU) , word-formation patterns
Ppttn(WU) and word bigram probability Pbigram(WU)
to score these candidates, and then applies the
Viterbi algorithm again to find the best new
segmentation WU x1x2xm that has the maximum
score
16Segmenting Chinese Unknown Words by Heuristic
Method(2005)
17Segmenting Chinese Unknown Words by Heuristic
Method
- Statistical Approach
- Mutual information (MI)
- a and b are Chinese characters, N is the size of
corpus, - Significant Estimation (SE)
- where c is a string with n characters, a and b
are two overlapping substrings of c with n - 1
characters.
18Segmenting Chinese Unknown Words by Heuristic
Method
- Heuristic Method
- We propose a heuristic method with five rules to
segment Chinese text using mutual information
(MI) and significance estimation (SE) of all
bi-grams and tri-grams, respectively, in the
corpus. - ex)
19Statistically-Enhanced New Word Identification in
a Rule-Based Chinese System
- Finding candidates of new words
- For characters, define IWP
- IWP(s) is the joint property of the IWP(c) of the
component character. - IWP(s) lt Threshold, S is considered a candidate
for a new word
20Statistically-Enhanced New Word Identification in
a Rule-Based Chinese System
- POS tagging
- Define P(Category, Position, Length) for a
character -
- tend to occur in the second position of
two-character and three-character verbs. - P(Cat) for a word is the joint probabilities of
the P(Cat, Pos, Len) of the component characters
21???
- ltbgt (???? ???? ??)
- Chinese ???? ??? ??? ?? ??? space? ??
- ? ????? ltbgt, ?? ?? ??? character? ??
- ??(ltbgt??) trigram? ??
- word ??? ?? ? ?
- Ex) P(??,?), P(ltbgt?,?), P(??,ltbgt), P(?ltbgt,?),
- P(S) P(a1, a2, , aN)
- P(a1)P(a2a1)P(a3a1a2)P(aNaN-2aN-1)
22Markov Chain for Sentence Generation
ltbgt
ltbgt
ltbgt
ltbgt
A
D
B
C
E
0
1
2
3
4
5
Sentence generation model for ABCDE
- Only left to right transition
- ?? ???? ??(ltbgt)? ? ?? ??
- ABCDE ? A BCDE, A B CDE, , ABCDE ?
16?? - ?? ??? ??? ? ?? ??? ?? ????? ?? ??
23state
?? ??
?? ??
s N
E
N ???? ??? ??
D
C
A2 B
A1 A
??
t N
0
1
2
t 2N-1
Lattice diagram
- ?? ?? n 1(n3)?? ????, ?? ?? ??, ? ???? ??
- ? ???? ?? ?? ??? ??? ? ??
24(t, s)
s
k1
s-1
d ??? ??? ?? penalty ?? ?? ?? log P(d) ?? ???
?? ?? ??
k2
k3
s-2
t-2
t-1
t
Possible transition paths in case
- L(t, s, 1) maxkL(t-1, s, k) log
P(ltbgtxt-2, s-1, ws)log P(d) penalty - L(t, s, 2) maxkL(t-1, s-1, k) log
P(wsxt-2, s-1, ltbgt) - L(t, s, 3) maxkL(t-1, s-2, k) log
P(wsxt-2, s-2, ws-1) - Decision (t, k) arg maxk, tL(t, N, k)/t (N lt
t lt 2N-1)
25Performance with varying n-gram language models
?? ?? ??? SA (1-(SM)/Ns) 100() ?? ?? ??? WA
C/Nw 100() S space insertion error M space
deletion error C of correct words Ns ? ?? ??,
Nw ? ?? ??
26??? ? ?
- ?? trigram?? ??? ??? ??
- Tag(class) ??
- lttag(or class)gt _ trigram ??, ? ??
- lttaggt_lttaggt, lttaggtlttaggt
- ???? ?? ?? ? - lttaggtlttaggt? sequence? ?? ?? ?
??? ??? ltclassgt? - (ltbgt ? ??? ?)
- , ??? ??? ?? ???? ?? ?? ??
- New Word Identification?? ??? ??? ??? ???
- ????? ? ??? ???? ???? ??? ??? ???(?? New Word? ??
? ?? ???)
27Spelling Correction
- Just misspelling (not in dictionary)
- context-free spelling correction
- Ex) acress ? actress
- Wrong in context (in dictionary)
- context sensitive spelling correction
- Ex) peace of cake ? piece of cake
- between them ? among them
28Spelling correction using probabilistic
methods(1984)
- P(YX), the probability of generating a garbled
word Y from a correct word X - Let S(ba) be the probability that the garbler
replaces the symbol a in some input word X by the
symbol b. - P(YX, t)
29Spelling correction using probabilistic methods
- Step1
- Step2 Replace each symbol ui in U by oi with
probability S(oiui), The resulting string is V
v1 ... Vmt. - Step3 Y C(V).
-
30- Decision rule For the given garbled word,
compute the probabilities P(YX) for every X in
the dictionary - Xp value of X in dictionary which maximizes
P(YX). - total computation needed to compute P(Y Xi), i
1,2 .... ,l, is - X (K0K1mK2Y)
- (m the average length of the words in the
dictionary)
31A Spelling Correction Program Based on a Noisy
Channel Model(1990)
- Pr(c) Pr(tc), Pr(c) is estimated by (freq(c)
1)/ N - 4 matrices for calculating Pr(tc)
- (1) delx,y, the number of times that the
character y was deleted after the character x in
the training set - (2), addx,y, the number of times that y was
inserted after x - (3) subx,y, the number of times that y (from
the correct word) was typed as x - (4) revx,y, the number of times that xy was
reversed
32A Spelling Correction Program Based on a Noisy
Channel Model
33(No Transcript)
34Automatic Rule Acquisition for Spelling
Correction (1997)
- a large class of errors involve misspellings that
result in valid words. (piece vs. peace, among
vs. between) - Confusion set principal, principle, except,
accept, - Transformation-Based Learning
35Automatic Rule Acquisition for Spelling Correction
- Change word w1 to w2 if
- the word W occurs within k words of
w1(co-occurrences) - Change the word from principle to principal if
the word school appears within the proximity
window. - a specific pattern of up to l contiguous words
and/or part-of-speech tags occurs around
w1(collocations) - Change the word from piece to peace if the word
world immediately precedes it. - a specific pattern of noncontiguous words and/or
part-of-speech tags occurs around w1(collocations
with wildcards) - Change the word from except to accept if the word
three before is he and the immediately preceding
word is not