Text Sequence Modeling - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Text Sequence Modeling

Description:

Ambiguity: 'tag' could be a noun or a verb ' ... The graph along with the individual classification/regression model is a 'dependency network' ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 53
Provided by: davidm45
Category:

less

Transcript and Presenter's Notes

Title: Text Sequence Modeling


1
Text Sequence Modeling
2
Introduction
  • Textual data comprise sequences of words The
    quick brown fox
  • Many tasks can put this sequence information to
    good use
  • Part of speech tagging
  • Named entity extraction
  • Text chunking
  • Author identification

3
Part-of-Speech Tagging
  • Assign grammatical tags to words
  • Basic task in the analysis of natural language
    data
  • Phrase identification, entity extraction, etc.
  • Ambiguity tag could be a noun or a verb
  • a tag is a part-of-speech label context
    resolves the ambiguity

4
The Penn Treebank POS Tag Set
5
POS Tagging Process
Berlin Chen
6
POS Tagging Algorithms
  • Rule-based taggers large numbers of hand-crafted
    rules
  • Probabilistic tagger used a tagged corpus to
    train some sort of model, e.g. HMM.

tag3
tag2
tag1
word3
word2
word1
7
The Brown Corpus
  • Comprises about 1 million English words
  • HMMs first used for tagging on the Brown Corpus
  • 1967. Somewhat dated now.
  • British National Corpus has 100 million words

8
Simple Charniak Model
w3
w2
w1
t3
t2
t1
  • Dont need P(wiwi-1) to find argmax over tags
  • What about words that have never been seen
    before?
  • Clever tricks for smoothing the number of
    parameters (aka priors)

9
some details
number of times word j appears with tag i
number of times word j appears
number of times a word that had never been seen
with tag i gets tag i
number of such occurrences in total
Test data accuracy on Brown Corpus 91.51
10
HMM
t3
t2
t1
w3
w2
w1
  • Brown test set accuracy 95.97

11
Better Smoothing Model
  • Suppose you have seen a particular word w just
    three or four times before
  • Estimates of P(tw) will be of low quality

in 0,1,2,3-4,5-7,8-10,11-20,21-30,gt30
12
Smoothing, cont.
  • Achieves 96.02 on the Brown Corpus

13
Morphological Features
  • Knowledge that quickly ends in ly should help
    identify the word as an adverb
  • randomizing -gt ing
  • Split each word into a root (quick) and a
    suffix (ly)

t3
t2
t1
r1
s1
r2
s2
14
Morphological Features
  • Typical morphological analyzers produce multiple
    possible splits
  • Gastroenteritis ???
  • Achieves 96.45 on the Brown Corpus

15
Inference in an HMM
  • Compute the probability of a given observation
    sequence
  • Given an observation sequence, compute the most
    likely hidden state sequence
  • Given an observation sequence and set of possible
    models, which model most closely fits the data?

David Meir Blei
16
Viterbi Algorithm
x1
xt-1
j
oT
o1
ot
ot-1
ot1
The state sequence which maximizes the
probability of seeing the observations to time
t-1, landing in state j, and seeing the
observation at time t
David Meir Blei
17
Viterbi Algorithm
x1
xt-1
xt
xt1
oT
o1
ot
ot-1
ot1
Recursive Computation
David Meir Blei
18
Viterbi Algorithm
x1
xt-1
xt
xt1
xT
oT
o1
ot
ot-1
ot1
Compute the most likely state sequence by working
backwards
David Meir Blei
19
Viterbi Small Example
Pr(x1T) 0.2 Pr(x2Tx1T) 0.7 Pr(x2Tx1F)
0.1 Pr(oTxT) 0.4 Pr(oTxF) 0.9 o1T
o2F
x1
x2
o2
o1
Brute Force
Pr(x1T,x2T, o1T,o2F) 0.2 x 0.4 x 0.7 x 0.6
0.0336 Pr(x1T,x2F, o1T,o2F) 0.2 x 0.4 x
0.3 x 0.1 0.0024 Pr(x1F,x2T, o1T,o2F) 0.8
x 0.9 x 0.1 x 0.6 0.0432 Pr(x1F,x2F,
o1T,o2F) 0.8 x 0.9 x 0.9 x 0.1 0.0648
Pr(X1,X2 o1T,o2F) ? Pr(X1,X2 , o1T,o2F)
20
Viterbi Small Example

x1
x2
o2
o1
21
Recent Developments
  • Toutanova et al., 2003, use a dependency
    network and richer feature set
  • Idea using the next tag as well as the
    previous tag should improve tagging performance

22
Example
t1
t2
t3
o2
o1
o3
will
to
fight
  • will is typically a modal verb so Pr(t1will)
    will favor MD (rather than NN)
  • There is a tag TO that the word to always
    receives so Pr(t2TOto,t1) will be close to 1
    regardless of t1 (TO rarely preceded by MD)
  • No way to discover that t1 should be NN

23
Using classification/regression for data
exploration
  • Suppose you have thousands of variables and
    youre not sure about the interactions among
    those variables
  • Build a classification/regression model for each
    variable, using the rest of the variables as
    inputs

David Heckerman
24
Example with three variables X, Y, and Z
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
David Heckerman
25
Summarize the trees with a single graph
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
X
Y
Z
David Heckerman
26
Dependency Network
  • Build a classification/regression model for every
    variable given the other variables as inputs
  • Construct a graph where
  • Nodes correspond to variables
  • There is an arc from X to Y if X helps to predict
    Y
  • The graph along with the individual
    classification/regression model is a dependency
    network
  • (Heckerman, Chickering, Meek, Rounthwaite, Cadie
    2000)

David Heckerman
27
Example TV viewing
Nielsen data 2/6/95-2/19/95 Goal
exploratory data analysis (acausal)
400 shows, 3000 viewers
David Heckerman
28
David Heckerman
29
A consistent dependency network
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
X
Y
Z
David Heckerman
30
An inconsistent dependency network
Target X Inputs Y,Z
Target Y Inputs X,Z
Target Z Inputs X,Y
X0
X1
p(yx1)
Z0
Z1
p(yx0,z0)
p(yx0,z1)
X
Y
Z
David Heckerman
31
Dependency Network
  • Idea using the next tag as well as the
    previous tag should improve tagging performance
  • Multiclass logistic regression model for ti
    t-i, w
  • 460,552 features (current word, previous word,
    next word,capitalization,prefixes, suffixes,
    unknown word)
  • 96.6 per word accuracy on the Penn corpus 4
    error reduction over previous best 56.3
    whole-sentence correct.

32
Modified Viterbi (2nd-Order HMM)
33
  • Dependency network learns a single multiclass
    Logistic regression model for ti t-i, w but
    does consider the tag sequence in its entirety
  • Train the logistic regression using a data set
    that looks like this

ti
ti-1
ti1
f1
f2

fd
PER
LOC
PER
1
2.7
0


34
Named-Entity Classification
  • Mrs. Frank is a person
  • Steptoe and Johnson is a company
  • Honduras is a location
  • etc.
  • Bikel et al. (1998) from BBN Nymble statistical
    approach using HMMs

35
nc3
nc2
nc1
word3
word2
word1
  • name classes Not-A-Name, Person, Location,
    etc.
  • Smoothing for sparse training data word
    features
  • Training 100,000 words from WSJ
  • Accuracy 93
  • 450,000 words ? same accuracy

36
training-development-test
37
Co-Learning
  • First extract candidate word sequences

and
A noun phrase comprises a noun (obviously) and
any associated modifiers.
38
spelling predictor
or
An appositive is a re-naming or amplification of
a word that immediately precedes it.
39
(No Transcript)
40
Also Extract a Contextual Predictor
head of the modifying appositive
or
41
Or
preposition together with the noun it modifies
42
Learning from Unlabeled Examples
  • Start with seed rules
  • Each rule has a strength of 0.9999

43
Learning Algorithm
44
(No Transcript)
45
Evaluation
  • Evaluated on 1,000 hand-labeled extracted
    sequences
  • Accuracy 83.3 overall, 91.3 if confined to
    Location/Organization/Person examples

46
Conditional Random Fields
  • Dependency network learns a single multiclass
    Logistic regression model for ti t-i, w but
    does consider the tag sequence in its entirety
  • CRFs optimize model parameters with respect to
    the entire sequence
  • More expensive optimization increased
    flexibility and accuracy

ti
ti-1
ti1
f1
f2

fd
PER
LOC
PER
1
2.7
0


47
From Logistic Regression to CRF
  • Logistic regression
  • Or
  • Linear chain CRF

scalar
vector
48
CRF Parameter Estimation
  • Conditional log likelihood
  • Regularized log likelihood
  • Conjugate gradient, BFGS, etc.
  • POS Tagging, 45 tags, 106 words 1 week

49
Sutton and McCallum (2006)
50
Skip-Chain CRFs
51
Bocks Results POS Tagging
Penn Treebank
52
Bocks Results Named Entity
CoNLL-03 task
Write a Comment
User Comments (0)
About PowerShow.com