Related Works About Text Normalization - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Related Works About Text Normalization

Description:

P(Y|X), the probability of generating a garbled word Y from a correct word X ... Decision rule: For the given garbled word, compute the probabilities P(Y|X) for ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 36
Provided by: homePos
Category:

less

Transcript and Presenter's Notes

Title: Related Works About Text Normalization


1
Related WorksAboutText Normalization
2
Contents
  • Segmentation
  • New Word Identification
  • Spelling Correction

3
Segmentation
  • main issue for dealing with the word segmentation
    problem is how to find out the correct
    segmentation from all possible ones
  • Chinese strings have no space

4
A Generalized Word Segmentation Model (1992)
  • c1n c1, c2, , cn (characters)
  • Wi wi1, wi2, , wim (words)
  • Li li1, li2, , lim (lengths)
  • mi of words
  • Tij tij1, tij2, , tijm (lexical tags)

5
(No Transcript)
6
Parameter Adjustment
  • Adaptive Learning
  • Robustness Enhancement

7
A Chinese word segmentation based on language
situation in processing ambiguous words (2004)
  • first Chinese character of w is x, Chinese
    character series after x is y
  • (w xy)
  • Language situation in the document
  • Language situation in the directory
  • the function of language situation

8
  • Layer1 W wordleft x y z u
  • Layer2 construct some combinatorial character
    series from wordleft, x, y, z and u, and sets
    them as a1 . . . ak, in which a1 is set as
    wordleft x.
  • Layer3 Calculate I(a1), , I(ak)
  • Layer4 if I(a1) is the greatest value, set yi
    1 denoting wordleft x can be segmented into a
    word
  • Repeat until all the characters have been
    processed

9
Improved Source-Channel Models for Chinese Word
Segmentation(2003)
  • Let S be a Chinese sentence, For all possible
    word segmentations W,
  • (Bayes rule)
  • Define word class C as follows
  • entries in a lexicon(lexicon words below)
  • morphologically derived words
  • factoids
  • named entities
  • Rewritten,

10
  • Two steps
  • First, given an input string S, all word
    candidates are generated (and stored in a
    lattice). Each candidate is tagged with its word
    class and the class model probability P(SC),
    where S is any substring of S.
  • Second, Viterbi search is used to select (from
    the lattice) the most probable word segmentation
    (i.e. word class sequence C) according to Eq.
    (2).

11
New Word Identification
  • Done with segmentation
  • Sequences of single characters are good
    candidates for new words.
  • 99 of Chinese words are 14 characters

12
A two-stage statistical word segmentation system
for Chinese(2003)
  • The first stage Segmentation of known words
  • we use word bigram language models and Viterbi
    algorithm

13
A two-stage statistical word segmentation system
for Chinese
  • The second stage Unknown word identification
    hybrid algorithm
  • Word juncture model
  • two types of junctures in unknown word
    identification
  • - word boundary (denoted by tB) and non-word
    boundary (denoted by tN).
  • word juncture probability PCJM(wU)
  • the larger the probability Pr(tN (wi,wi1 )) ,
    the more likely the two words are merged together
    into one new word.

14
A two-stage statistical word segmentation system
for Chinese
  • Word-formation patterns
  • the character ? appears at the end of a
    multiword in more than 93 of cases.
  • four patterns while forming a word
  • w itself is a word. (S)
  • w is the beginning of an unknown word. (B)
  • w is at the middle of an unknown word. (M)
  • w appears at the end of an unknown word. (E)
  • Pr ( pttn(w)) denote the relevant probability,
  • Pttn S, B, M, E
  • 1- Pr(S(w)) the word-formation power of the
    known word w
  • For a certain unknown word wU w1w2wl,

15
A two-stage statistical word segmentation system
for Chinese
  • A decoder finally incorporates word juncture
    model PWJM(WU) , word-formation patterns
    Ppttn(WU) and word bigram probability Pbigram(WU)
    to score these candidates, and then applies the
    Viterbi algorithm again to find the best new
    segmentation WU x1x2xm that has the maximum
    score

16
Segmenting Chinese Unknown Words by Heuristic
Method(2005)
  • The problem define

17
Segmenting Chinese Unknown Words by Heuristic
Method
  • Statistical Approach
  • Mutual information (MI)
  • a and b are Chinese characters, N is the size of
    corpus,
  • Significant Estimation (SE)
  • where c is a string with n characters, a and b
    are two overlapping substrings of c with n - 1
    characters.

18
Segmenting Chinese Unknown Words by Heuristic
Method
  • Heuristic Method
  • We propose a heuristic method with five rules to
    segment Chinese text using mutual information
    (MI) and significance estimation (SE) of all
    bi-grams and tri-grams, respectively, in the
    corpus.
  • ex)

19
Statistically-Enhanced New Word Identification in
a Rule-Based Chinese System
  • Finding candidates of new words
  • For characters, define IWP
  • IWP(s) is the joint property of the IWP(c) of the
    component character.
  • IWP(s) lt Threshold, S is considered a candidate
    for a new word

20
Statistically-Enhanced New Word Identification in
a Rule-Based Chinese System
  • POS tagging
  • Define P(Category, Position, Length) for a
    character
  • tend to occur in the second position of
    two-character and three-character verbs.
  • P(Cat) for a word is the joint probabilities of
    the P(Cat, Pos, Len) of the component characters

21
???
  • ltbgt (???? ???? ??)
  • Chinese ???? ??? ??? ?? ??? space? ??
  • ? ????? ltbgt, ?? ?? ??? character? ??
  • ??(ltbgt??) trigram? ??
  • word ??? ?? ? ?
  • Ex) P(??,?), P(ltbgt?,?), P(??,ltbgt), P(?ltbgt,?),
  • P(S) P(a1, a2, , aN)
  • P(a1)P(a2a1)P(a3a1a2)P(aNaN-2aN-1)

22
Markov Chain for Sentence Generation
ltbgt
ltbgt
ltbgt
ltbgt
A
D
B
C
E
0
1
2
3
4
5
Sentence generation model for ABCDE
  • Only left to right transition
  • ?? ???? ??(ltbgt)? ? ?? ??
  • ABCDE ? A BCDE, A B CDE, , ABCDE ?
    16??
  • ?? ??? ??? ? ?? ??? ?? ????? ?? ??

23
state
?? ??
?? ??
s N
E
N ???? ??? ??
D
C
A2 B
A1 A
??
t N
0
1
2
t 2N-1
Lattice diagram
  • ?? ?? n 1(n3)?? ????, ?? ?? ??, ? ???? ??
  • ? ???? ?? ?? ??? ??? ? ??

24
(t, s)
s
k1
s-1
d ??? ??? ?? penalty ?? ?? ?? log P(d) ?? ???
?? ?? ??
k2
k3
s-2
t-2
t-1
t
Possible transition paths in case
  • L(t, s, 1) maxkL(t-1, s, k) log
    P(ltbgtxt-2, s-1, ws)log P(d) penalty
  • L(t, s, 2) maxkL(t-1, s-1, k) log
    P(wsxt-2, s-1, ltbgt)
  • L(t, s, 3) maxkL(t-1, s-2, k) log
    P(wsxt-2, s-2, ws-1)
  • Decision (t, k) arg maxk, tL(t, N, k)/t (N lt
    t lt 2N-1)

25
Performance with varying n-gram language models
?? ?? ??? SA (1-(SM)/Ns) 100() ?? ?? ??? WA
C/Nw 100() S space insertion error M space
deletion error C of correct words Ns ? ?? ??,
Nw ? ?? ??
26
??? ? ?
  • ?? trigram?? ??? ??? ??
  • Tag(class) ??
  • lttag(or class)gt _ trigram ??, ? ??
  • lttaggt_lttaggt, lttaggtlttaggt
  • ???? ?? ?? ? - lttaggtlttaggt? sequence? ?? ?? ?
    ??? ??? ltclassgt?
  • (ltbgt ? ??? ?)
  • , ??? ??? ?? ???? ?? ?? ??
  • New Word Identification?? ??? ??? ??? ???
  • ????? ? ??? ???? ???? ??? ??? ???(?? New Word? ??
    ? ?? ???)

27
Spelling Correction
  • Just misspelling (not in dictionary)
  • context-free spelling correction
  • Ex) acress ? actress
  • Wrong in context (in dictionary)
  • context sensitive spelling correction
  • Ex) peace of cake ? piece of cake
  • between them ? among them

28
Spelling correction using probabilistic
methods(1984)
  • P(YX), the probability of generating a garbled
    word Y from a correct word X
  • Let S(ba) be the probability that the garbler
    replaces the symbol a in some input word X by the
    symbol b.
  • P(YX, t)

29
Spelling correction using probabilistic methods
  • Step1
  • Step2 Replace each symbol ui in U by oi with
    probability S(oiui), The resulting string is V
    v1 ... Vmt.
  • Step3 Y C(V).

30
  • Decision rule For the given garbled word,
    compute the probabilities P(YX) for every X in
    the dictionary
  • Xp value of X in dictionary which maximizes
    P(YX).
  • total computation needed to compute P(Y Xi), i
    1,2 .... ,l, is
  • X (K0K1mK2Y)
  • (m the average length of the words in the
    dictionary)

31
A Spelling Correction Program Based on a Noisy
Channel Model(1990)
  • Pr(c) Pr(tc), Pr(c) is estimated by (freq(c)
    1)/ N
  • 4 matrices for calculating Pr(tc)
  • (1) delx,y, the number of times that the
    character y was deleted after the character x in
    the training set
  • (2), addx,y, the number of times that y was
    inserted after x
  • (3) subx,y, the number of times that y (from
    the correct word) was typed as x
  • (4) revx,y, the number of times that xy was
    reversed

32
A Spelling Correction Program Based on a Noisy
Channel Model
33
(No Transcript)
34
Automatic Rule Acquisition for Spelling
Correction (1997)
  • a large class of errors involve misspellings that
    result in valid words. (piece vs. peace, among
    vs. between)
  • Confusion set principal, principle, except,
    accept,
  • Transformation-Based Learning

35
Automatic Rule Acquisition for Spelling Correction
  • Change word w1 to w2 if
  • the word W occurs within k words of
    w1(co-occurrences)
  • Change the word from principle to principal if
    the word school appears within the proximity
    window.
  • a specific pattern of up to l contiguous words
    and/or part-of-speech tags occurs around
    w1(collocations)
  • Change the word from piece to peace if the word
    world immediately precedes it.
  • a specific pattern of noncontiguous words and/or
    part-of-speech tags occurs around w1(collocations
    with wildcards)
  • Change the word from except to accept if the word
    three before is he and the immediately preceding
    word is not
Write a Comment
User Comments (0)
About PowerShow.com