Text Sequence Modeling - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Text Sequence Modeling

Description:

Ambiguity: 'tag' could be a noun or a verb ' ... The graph along with the individual classification/regression model is a 'dependency network' ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 53

Provided by: davidm45

Category:

more less

Transcript and Presenter's Notes

Title: Text Sequence Modeling

1
Text Sequence Modeling
2
Introduction

Textual data comprise sequences of words The
quick brown fox
Many tasks can put this sequence information to
good use
Part of speech tagging
Named entity extraction
Text chunking
Author identification

3
Part-of-Speech Tagging

Assign grammatical tags to words
Basic task in the analysis of natural language
data
Phrase identification, entity extraction, etc.
Ambiguity tag could be a noun or a verb
a tag is a part-of-speech label context
resolves the ambiguity

4
The Penn Treebank POS Tag Set
5
POS Tagging Process
Berlin Chen
6
POS Tagging Algorithms

Rule-based taggers large numbers of hand-crafted
rules
Probabilistic tagger used a tagged corpus to
train some sort of model, e.g. HMM.

tag3
tag2
tag1
word3
word2
word1
7
The Brown Corpus

Comprises about 1 million English words
HMMs first used for tagging on the Brown Corpus
1967. Somewhat dated now.
British National Corpus has 100 million words

8
Simple Charniak Model
w3
w2
w1
t3
t2
t1

Dont need P(wiwi-1) to find argmax over tags
What about words that have never been seen
before?
Clever tricks for smoothing the number of
parameters (aka priors)

9
some details
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
Test data accuracy on Brown Corpus 91.51
10
HMM
t3
t2
t1
w3
w2
w1

Brown test set accuracy 95.97

11
Better Smoothing Model

Suppose you have seen a particular word w just
three or four times before
Estimates of P(tw) will be of low quality

in 0,1,2,3-4,5-7,8-10,11-20,21-30,gt30
12
Smoothing, cont.

Achieves 96.02 on the Brown Corpus

13
Morphological Features

Knowledge that quickly ends in ly should help
identify the word as an adverb
randomizing -gt ing
Split each word into a root (quick) and a
suffix (ly)

t3
t2
t1
r1
s1
r2
s2
14
Morphological Features

Typical morphological analyzers produce multiple
possible splits
Gastroenteritis ???

Achieves 96.45 on the Brown Corpus

15
Inference in an HMM

Compute the probability of a given observation
sequence
Given an observation sequence, compute the most
likely hidden state sequence
Given an observation sequence and set of possible
models, which model most closely fits the data?

David Meir Blei
16
Viterbi Algorithm
x1
xt-1
j
oT
o1
ot
ot-1
ot1
The state sequence which maximizes the
probability of seeing the observations to time
t-1, landing in state j, and seeing the
observation at time t
David Meir Blei
17
Viterbi Algorithm
x1
xt-1
xt
xt1
oT
o1
ot
ot-1
ot1
Recursive Computation
David Meir Blei
18
Viterbi Algorithm
x1
xt-1
xt
xt1
xT
oT
o1
ot
ot-1
ot1
Compute the most likely state sequence by working
backwards
David Meir Blei
19
Viterbi Small Example
Pr(x1T) 0.2 Pr(x2Tx1T) 0.7 Pr(x2Tx1F)
0.1 Pr(oTxT) 0.4 Pr(oTxF) 0.9 o1T
o2F
x1
x2
o2
o1
Brute Force
Pr(x1T,x2T, o1T,o2F) 0.2 x 0.4 x 0.7 x 0.6
0.0336 Pr(x1T,x2F, o1T,o2F) 0.2 x 0.4 x
0.3 x 0.1 0.0024 Pr(x1F,x2T, o1T,o2F) 0.8
x 0.9 x 0.1 x 0.6 0.0432 Pr(x1F,x2F,
o1T,o2F) 0.8 x 0.9 x 0.9 x 0.1 0.0648
Pr(X1,X2 o1T,o2F) ? Pr(X1,X2 , o1T,o2F)
20
Viterbi Small Example

x1
x2
o2
o1
21
Recent Developments

Toutanova et al., 2003, use a dependency
network and richer feature set

Idea using the next tag as well as the
previous tag should improve tagging performance

22
Example
t1
t2
t3
o2
o1
o3
will
to
fight

will is typically a modal verb so Pr(t1will)
will favor MD (rather than NN)
There is a tag TO that the word to always
receives so Pr(t2TOto,t1) will be close to 1
regardless of t1 (TO rarely preceded by MD)
No way to discover that t1 should be NN

23
Using classification/regression for data
exploration

Suppose you have thousands of variables and
youre not sure about the interactions among
those variables
Build a classification/regression model for each
variable, using the rest of the variables as
inputs

David Heckerman
24
Example with three variables X, Y, and Z
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
David Heckerman
25
Summarize the trees with a single graph
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
X
Y
Z
David Heckerman
26
Dependency Network

Build a classification/regression model for every
variable given the other variables as inputs
Construct a graph where
Nodes correspond to variables
There is an arc from X to Y if X helps to predict
Y
The graph along with the individual
classification/regression model is a dependency
network
(Heckerman, Chickering, Meek, Rounthwaite, Cadie
2000)

David Heckerman
27
Example TV viewing
Nielsen data 2/6/95-2/19/95 Goal
exploratory data analysis (acausal)
400 shows, 3000 viewers
David Heckerman
28
David Heckerman
29
A consistent dependency network
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
X
Y
Z
David Heckerman
30
An inconsistent dependency network
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
X
Y
Z
David Heckerman
31
Dependency Network

Idea using the next tag as well as the
previous tag should improve tagging performance
Multiclass logistic regression model for ti
t-i, w
460,552 features (current word, previous word,
next word,capitalization,prefixes, suffixes,
unknown word)
96.6 per word accuracy on the Penn corpus 4
error reduction over previous best 56.3
whole-sentence correct.

32
Modified Viterbi (2nd-Order HMM)
33

Dependency network learns a single multiclass
Logistic regression model for ti t-i, w but
does consider the tag sequence in its entirety
Train the logistic regression using a data set
that looks like this

ti
ti-1
ti1
f1
f2

fd
PER
LOC
PER
1
2.7
0

34
Named-Entity Classification

Mrs. Frank is a person
Steptoe and Johnson is a company
Honduras is a location
etc.

Bikel et al. (1998) from BBN Nymble statistical
approach using HMMs

35
nc3
nc2
nc1
word3
word2
word1

name classes Not-A-Name, Person, Location,
etc.
Smoothing for sparse training data word
features
Training 100,000 words from WSJ
Accuracy 93
450,000 words ? same accuracy

36
training-development-test
37
Co-Learning

First extract candidate word sequences

and
A noun phrase comprises a noun (obviously) and
any associated modifiers.
38
spelling predictor
or
An appositive is a re-naming or amplification of
a word that immediately precedes it.
39
(No Transcript)
40
Also Extract a Contextual Predictor
head of the modifying appositive
or
41
Or
preposition together with the noun it modifies
42
Learning from Unlabeled Examples

Start with seed rules
Each rule has a strength of 0.9999

43
Learning Algorithm
44
(No Transcript)
45
Evaluation

Evaluated on 1,000 hand-labeled extracted
sequences
Accuracy 83.3 overall, 91.3 if confined to
Location/Organization/Person examples

46
Conditional Random Fields

Dependency network learns a single multiclass
Logistic regression model for ti t-i, w but
does consider the tag sequence in its entirety
CRFs optimize model parameters with respect to
the entire sequence
More expensive optimization increased
flexibility and accuracy