Related Works About Text Normalization - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Related Works About Text Normalization

Description:

P(Y|X), the probability of generating a garbled word Y from a correct word X ... Decision rule: For the given garbled word, compute the probabilities P(Y|X) for ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 36

Provided by: homePos

Category:

more less

Transcript and Presenter's Notes

Title: Related Works About Text Normalization

1
Related WorksAboutText Normalization
2
Contents

Segmentation
New Word Identification
Spelling Correction

3
Segmentation

main issue for dealing with the word segmentation
problem is how to find out the correct
segmentation from all possible ones
Chinese strings have no space

4
A Generalized Word Segmentation Model (1992)

c1n c1, c2, , cn (characters)
Wi wi1, wi2, , wim (words)
Li li1, li2, , lim (lengths)
mi of words
Tij tij1, tij2, , tijm (lexical tags)

5
(No Transcript)
6
Parameter Adjustment

Adaptive Learning
Robustness Enhancement

7
A Chinese word segmentation based on language
situation in processing ambiguous words (2004)

first Chinese character of w is x, Chinese
character series after x is y
(w xy)
Language situation in the document
Language situation in the directory
the function of language situation

Layer1 W wordleft x y z u
Layer2 construct some combinatorial character
series from wordleft, x, y, z and u, and sets
them as a1 . . . ak, in which a1 is set as
wordleft x.
Layer3 Calculate I(a1), , I(ak)
Layer4 if I(a1) is the greatest value, set yi
1 denoting wordleft x can be segmented into a
word
Repeat until all the characters have been
processed

9
Improved Source-Channel Models for Chinese Word
Segmentation(2003)

Let S be a Chinese sentence, For all possible
word segmentations W,
(Bayes rule)
Define word class C as follows
entries in a lexicon(lexicon words below)
morphologically derived words
factoids
named entities
Rewritten,

Two steps
First, given an input string S, all word
candidates are generated (and stored in a
lattice). Each candidate is tagged with its word
class and the class model probability P(SC),
where S is any substring of S.
Second, Viterbi search is used to select (from
the lattice) the most probable word segmentation
(i.e. word class sequence C) according to Eq.
(2).

11
New Word Identification

Done with segmentation
Sequences of single characters are good
candidates for new words.
99 of Chinese words are 14 characters

12
A two-stage statistical word segmentation system
for Chinese(2003)

The first stage Segmentation of known words
we use word bigram language models and Viterbi
algorithm

13
A two-stage statistical word segmentation system
for Chinese

The second stage Unknown word identification
hybrid algorithm
Word juncture model
two types of junctures in unknown word
identification
- word boundary (denoted by tB) and non-word
boundary (denoted by tN).
word juncture probability PCJM(wU)
the larger the probability Pr(tN (wi,wi1 )) ,
the more likely the two words are merged together
into one new word.

14
A two-stage statistical word segmentation system
for Chinese

Word-formation patterns
the character ? appears at the end of a
multiword in more than 93 of cases.
four patterns while forming a word
w itself is a word. (S)
w is the beginning of an unknown word. (B)
w is at the middle of an unknown word. (M)
w appears at the end of an unknown word. (E)
Pr ( pttn(w)) denote the relevant probability,
Pttn S, B, M, E
1- Pr(S(w)) the word-formation power of the
known word w
For a certain unknown word wU w1w2wl,

15
A two-stage statistical word segmentation system
for Chinese

A decoder finally incorporates word juncture
model PWJM(WU) , word-formation patterns
Ppttn(WU) and word bigram probability Pbigram(WU)
to score these candidates, and then applies the
Viterbi algorithm again to find the best new
segmentation WU x1x2xm that has the maximum
score

16
Segmenting Chinese Unknown Words by Heuristic
Method(2005)

The problem define

17
Segmenting Chinese Unknown Words by Heuristic
Method

Statistical Approach
Mutual information (MI)
a and b are Chinese characters, N is the size of
corpus,
Significant Estimation (SE)
where c is a string with n characters, a and b
are two overlapping substrings of c with n - 1
characters.

18
Segmenting Chinese Unknown Words by Heuristic
Method

Heuristic Method
We propose a heuristic method with five rules to
segment Chinese text using mutual information
(MI) and significance estimation (SE) of all
bi-grams and tri-grams, respectively, in the
corpus.
ex)

19
Statistically-Enhanced New Word Identification in
a Rule-Based Chinese System

Finding candidates of new words
For characters, define IWP
IWP(s) is the joint property of the IWP(c) of the
component character.
IWP(s) lt Threshold, S is considered a candidate
for a new word

20
Statistically-Enhanced New Word Identification in
a Rule-Based Chinese System

POS tagging
Define P(Category, Position, Length) for a
character
tend to occur in the second position of
two-character and three-character verbs.
P(Cat) for a word is the joint probabilities of
the P(Cat, Pos, Len) of the component characters

21
???

ltbgt (???? ???? ??)
Chinese ???? ??? ??? ?? ??? space? ??
? ????? ltbgt, ?? ?? ??? character? ??
??(ltbgt??) trigram? ??
word ??? ?? ? ?
Ex) P(??,?), P(ltbgt?,?), P(??,ltbgt), P(?ltbgt,?),
P(S) P(a1, a2, , aN)
P(a1)P(a2a1)P(a3a1a2)P(aNaN-2aN-1)

22
Markov Chain for Sentence Generation
ltbgt
ltbgt
ltbgt
ltbgt
A
D
B
C
E
0
1
2
3
4
5
Sentence generation model for ABCDE

Only left to right transition
?? ???? ??(ltbgt)? ? ?? ??
ABCDE ? A BCDE, A B CDE, , ABCDE ?
16??
?? ??? ??? ? ?? ??? ?? ????? ?? ??

23
state
?? ??
?? ??
s N
E
N ???? ??? ??
D
C
A2 B
A1 A
??
t N
0
1
2
t 2N-1
Lattice diagram

?? ?? n 1(n3)?? ????, ?? ?? ??, ? ???? ??
? ???? ?? ?? ??? ??? ? ??

24
(t, s)
s
k1
s-1
d ??? ??? ?? penalty ?? ?? ?? log P(d) ?? ???
?? ?? ??
k2
k3
s-2
t-2
t-1
t
Possible transition paths in case

L(t, s, 1) maxkL(t-1, s, k) log
P(ltbgtxt-2, s-1, ws)log P(d) penalty
L(t, s, 2) maxkL(t-1, s-1, k) log
P(wsxt-2, s-1, ltbgt)
L(t, s, 3) maxkL(t-1, s-2, k) log
P(wsxt-2, s-2, ws-1)
Decision (t, k) arg maxk, tL(t, N, k)/t (N lt
t lt 2N-1)

25
Performance with varying n-gram language models
?? ?? ??? SA (1-(SM)/Ns) 100() ?? ?? ??? WA
C/Nw 100() S space insertion error M space
deletion error C of correct words Ns ? ?? ??,
Nw ? ?? ??
26
??? ? ?

?? trigram?? ??? ??? ??
Tag(class) ??
lttag(or class)gt _ trigram ??, ? ??
lttaggt_lttaggt, lttaggtlttaggt
???? ?? ?? ? - lttaggtlttaggt? sequence? ?? ?? ?
??? ??? ltclassgt?
(ltbgt ? ??? ?)
, ??? ??? ?? ???? ?? ?? ??
New Word Identification?? ??? ??? ??? ???
????? ? ??? ???? ???? ??? ??? ???(?? New Word? ??
? ?? ???)

27
Spelling Correction

Just misspelling (not in dictionary)
context-free spelling correction
Ex) acress ? actress
Wrong in context (in dictionary)
context sensitive spelling correction
Ex) peace of cake ? piece of cake
between them ? among them

28
Spelling correction using probabilistic
methods(1984)

P(YX), the probability of generating a garbled
word Y from a correct word X
Let S(ba) be the probability that the garbler
replaces the symbol a in some input word X by the
symbol b.
P(YX, t)

29
Spelling correction using probabilistic methods

Step1
Step2 Replace each symbol ui in U by oi with
probability S(oiui), The resulting string is V
v1 ... Vmt.
Step3 Y C(V).

Decision rule For the given garbled word,
compute the probabilities P(YX) for every X in
the dictionary
Xp value of X in dictionary which maximizes
P(YX).
total computation needed to compute P(Y Xi), i
1,2 .... ,l, is
X (K0K1mK2Y)
(m the average length of the words in the
dictionary)

31
A Spelling Correction Program Based on a Noisy
Channel Model(1990)

Pr(c) Pr(tc), Pr(c) is estimated by (freq(c)
1)/ N
4 matrices for calculating Pr(tc)
(1) delx,y, the number of times that the
character y was deleted after the character x in
the training set
(2), addx,y, the number of times that y was
inserted after x
(3) subx,y, the number of times that y (from
the correct word) was typed as x
(4) revx,y, the number of times that xy was
reversed

32
A Spelling Correction Program Based on a Noisy
Channel Model
33
(No Transcript)
34
Automatic Rule Acquisition for Spelling
Correction (1997)

a large class of errors involve misspellings that
result in valid words. (piece vs. peace, among
vs. between)
Confusion set principal, principle, except,
accept,
Transformation-Based Learning

35
Automatic Rule Acquisition for Spelling Correction

Change word w1 to w2 if
the word W occurs within k words of
w1(co-occurrences)
Change the word from principle to principal if
the word school appears within the proximity
window.
a specific pattern of up to l contiguous words
and/or part-of-speech tags occurs around
w1(collocations)
Change the word from piece to peace if the word
world immediately precedes it.
a specific pattern of noncontiguous words and/or
part-of-speech tags occurs around w1(collocations
with wildcards)
Change the word from except to accept if the word
three before is he and the immediately preceding
word is not