Title: Text Sequence Modeling
1Text Sequence Modeling
2Introduction
- Textual data comprise sequences of words The
quick brown fox - Many tasks can put this sequence information to
good use - Part of speech tagging
- Named entity extraction
- Text chunking
- Author identification
3Part-of-Speech Tagging
- Assign grammatical tags to words
- Basic task in the analysis of natural language
data - Phrase identification, entity extraction, etc.
- Ambiguity tag could be a noun or a verb
- a tag is a part-of-speech label context
resolves the ambiguity
4The Penn Treebank POS Tag Set
5POS Tagging Process
Berlin Chen
6POS Tagging Algorithms
- Rule-based taggers large numbers of hand-crafted
rules - Probabilistic tagger used a tagged corpus to
train some sort of model, e.g. HMM.
tag3
tag2
tag1
word3
word2
word1
7The Brown Corpus
- Comprises about 1 million English words
- HMMs first used for tagging on the Brown Corpus
- 1967. Somewhat dated now.
- British National Corpus has 100 million words
8Simple Charniak Model
w3
w2
w1
t3
t2
t1
- Dont need P(wiwi-1) to find argmax over tags
- What about words that have never been seen
before? - Clever tricks for smoothing the number of
parameters (aka priors)
9some details
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
Test data accuracy on Brown Corpus 91.51
10HMM
t3
t2
t1
w3
w2
w1
- Brown test set accuracy 95.97
11Better Smoothing Model
- Suppose you have seen a particular word w just
three or four times before - Estimates of P(tw) will be of low quality
in 0,1,2,3-4,5-7,8-10,11-20,21-30,gt30
12Smoothing, cont.
- Achieves 96.02 on the Brown Corpus
13Morphological Features
- Knowledge that quickly ends in ly should help
identify the word as an adverb - randomizing -gt ing
- Split each word into a root (quick) and a
suffix (ly)
t3
t2
t1
r1
s1
r2
s2
14Morphological Features
- Typical morphological analyzers produce multiple
possible splits - Gastroenteritis ???
- Achieves 96.45 on the Brown Corpus
15Inference in an HMM
- Compute the probability of a given observation
sequence - Given an observation sequence, compute the most
likely hidden state sequence - Given an observation sequence and set of possible
models, which model most closely fits the data?
David Meir Blei
16Viterbi Algorithm
x1
xt-1
j
oT
o1
ot
ot-1
ot1
The state sequence which maximizes the
probability of seeing the observations to time
t-1, landing in state j, and seeing the
observation at time t
David Meir Blei
17Viterbi Algorithm
x1
xt-1
xt
xt1
oT
o1
ot
ot-1
ot1
Recursive Computation
David Meir Blei
18Viterbi Algorithm
x1
xt-1
xt
xt1
xT
oT
o1
ot
ot-1
ot1
Compute the most likely state sequence by working
backwards
David Meir Blei
19Viterbi Small Example
Pr(x1T) 0.2 Pr(x2Tx1T) 0.7 Pr(x2Tx1F)
0.1 Pr(oTxT) 0.4 Pr(oTxF) 0.9 o1T
o2F
x1
x2
o2
o1
Brute Force
Pr(x1T,x2T, o1T,o2F) 0.2 x 0.4 x 0.7 x 0.6
0.0336 Pr(x1T,x2F, o1T,o2F) 0.2 x 0.4 x
0.3 x 0.1 0.0024 Pr(x1F,x2T, o1T,o2F) 0.8
x 0.9 x 0.1 x 0.6 0.0432 Pr(x1F,x2F,
o1T,o2F) 0.8 x 0.9 x 0.9 x 0.1 0.0648
Pr(X1,X2 o1T,o2F) ? Pr(X1,X2 , o1T,o2F)
20Viterbi Small Example
x1
x2
o2
o1
21Recent Developments
- Toutanova et al., 2003, use a dependency
network and richer feature set
- Idea using the next tag as well as the
previous tag should improve tagging performance
22Example
t1
t2
t3
o2
o1
o3
will
to
fight
- will is typically a modal verb so Pr(t1will)
will favor MD (rather than NN) - There is a tag TO that the word to always
receives so Pr(t2TOto,t1) will be close to 1
regardless of t1 (TO rarely preceded by MD) - No way to discover that t1 should be NN
23Using classification/regression for data
exploration
- Suppose you have thousands of variables and
youre not sure about the interactions among
those variables - Build a classification/regression model for each
variable, using the rest of the variables as
inputs
David Heckerman
24Example with three variables X, Y, and Z
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
David Heckerman
25Summarize the trees with a single graph
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
X
Y
Z
David Heckerman
26Dependency Network
- Build a classification/regression model for every
variable given the other variables as inputs - Construct a graph where
- Nodes correspond to variables
- There is an arc from X to Y if X helps to predict
Y - The graph along with the individual
classification/regression model is a dependency
network - (Heckerman, Chickering, Meek, Rounthwaite, Cadie
2000)
David Heckerman
27Example TV viewing
Nielsen data 2/6/95-2/19/95 Goal
exploratory data analysis (acausal)
400 shows, 3000 viewers
David Heckerman
28David Heckerman
29A consistent dependency network
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
X
Y
Z
David Heckerman
30An inconsistent dependency network
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
X
Y
Z
David Heckerman
31Dependency Network
- Idea using the next tag as well as the
previous tag should improve tagging performance
- Multiclass logistic regression model for ti
t-i, w - 460,552 features (current word, previous word,
next word,capitalization,prefixes, suffixes,
unknown word) - 96.6 per word accuracy on the Penn corpus 4
error reduction over previous best 56.3
whole-sentence correct.
32Modified Viterbi (2nd-Order HMM)
33- Dependency network learns a single multiclass
Logistic regression model for ti t-i, w but
does consider the tag sequence in its entirety - Train the logistic regression using a data set
that looks like this
ti
ti-1
ti1
f1
f2
fd
PER
LOC
PER
1
2.7
0
34Named-Entity Classification
- Mrs. Frank is a person
- Steptoe and Johnson is a company
- Honduras is a location
- etc.
- Bikel et al. (1998) from BBN Nymble statistical
approach using HMMs
35nc3
nc2
nc1
word3
word2
word1
- name classes Not-A-Name, Person, Location,
etc. - Smoothing for sparse training data word
features - Training 100,000 words from WSJ
- Accuracy 93
- 450,000 words ? same accuracy
36training-development-test
37Co-Learning
- First extract candidate word sequences
and
A noun phrase comprises a noun (obviously) and
any associated modifiers.
38spelling predictor
or
An appositive is a re-naming or amplification of
a word that immediately precedes it.
39(No Transcript)
40Also Extract a Contextual Predictor
head of the modifying appositive
or
41Or
preposition together with the noun it modifies
42Learning from Unlabeled Examples
- Start with seed rules
- Each rule has a strength of 0.9999
43Learning Algorithm
44(No Transcript)
45Evaluation
- Evaluated on 1,000 hand-labeled extracted
sequences - Accuracy 83.3 overall, 91.3 if confined to
Location/Organization/Person examples
46Conditional Random Fields
- Dependency network learns a single multiclass
Logistic regression model for ti t-i, w but
does consider the tag sequence in its entirety - CRFs optimize model parameters with respect to
the entire sequence - More expensive optimization increased
flexibility and accuracy
ti
ti-1
ti1
f1
f2
fd
PER
LOC
PER
1
2.7
0
47From Logistic Regression to CRF
- Logistic regression
- Or
- Linear chain CRF
scalar
vector
48CRF Parameter Estimation
- Conditional log likelihood
- Regularized log likelihood
- Conjugate gradient, BFGS, etc.
- POS Tagging, 45 tags, 106 words 1 week
49Sutton and McCallum (2006)
50Skip-Chain CRFs
51Bocks Results POS Tagging
Penn Treebank
52Bocks Results Named Entity
CoNLL-03 task