Title: Probabilistic Graphs: Efficient Natural Spoken Language Processing
1Probabilistic GraphsEfficient Natural (Spoken)
Language Processing
2The Standard Clichés
- Moores Cliché
- Exponential growth in computing power and memory
will continue to open up new possibilities - The Internet Cliché
- With the advent and growth of the world-wide web,
an ever increasing amount of information must be
managed
3More Standard Clichés
- The Convergence Cliché
- Data, voice and video networking will be
integrated over a universal network, that - includes land lines and wireless
- includes broadband and narrowband
- likely implementation is IP (internet protocol)
- The Interface Cliché
- The three forces above (growth in computing
power, information online, and networking) will
both enable and require new interfaces - Speech will become as common as graphics
4Application Requirements
- Robustness
- acoustic and linguistic variation
- disfluencies and noise
- Scalability
- from embedded devices to palmtops to clients to
servers - across tasks from simple to complex
- system-initiative form-filling to mixed
initiative dialogue - Portability
- simple adaptation to new tasks and new domains
- preferably automated as much as possible
5The Big Question
- How do humans handle unrestricted language so
effortlessly in real time? - Unfortunately, the classical linguistic
assumptions and methodology completely ignore
this issue - This is dangerous strategy for processing natural
spoken language
6My Favorite Experiments I
- Head-Mounted Eye Tracking
- Mike Tanenhaus et al. (Univ. Rochester)
Pick up the yellow plate
- Clearly shows human understanding is online
7My Favorite Experiments (II)
- Garden Paths and Context Sensitivity
- Crain Steedman (U.Connecticut U. Edinburgh)
- if noun denotation is not a singleton in context,
postmodificiation is much more likely - Garden Paths are Frequency and Agreement
Sensitive - Tanenhaus et al.
- The horse raced past the barn fell. (raced
likely past) - The horses brought into the barn fell. (brought
likely participle, and less likely activity for
horses)
8Conclusion Function Evolution
- Humans agressively prune in real time
- This is an existence proof there must be enough
info to do so we just need to find it - All linguistic information is brought in at 200ms
- Other pruning strategies have no such existence
proof - Speakers are cooperative in their use of language
- Especially with spoken language, which is very
different than written language due to real-time
requirements - (Co-?)Evolution of language and speakers to
optimize these requirements
9Stats Explanation or Stopgap?
- The Common View
- Statistics are some kind of approximation of
underlying factors requiring further explanation. - Steve Abneys Analogy (ATT Labs)
- Statistical Queueing Theory
- Consider traffic flows through a toll gate on a
highway. - Underlying factors are diverse, and explain the
actions of each driver, their cars, possible
causes of flat tires, drunk drivers, etc. - Statistics is more insightful explanatory in
this case as it captures emergent generalizations - It is a reductionist error to insist on low-level
account
10Algebraic vs. Statistical
- False Dichotomy
- Statistical systems have an algebraic basis, even
if trivial - Best performing statistical systems have best
linguistic conditioning - Holds for phonology/phonetics and
morphology/syntax - Most explanatory in traditional sense
- Statistical estimators less significant than
conditioning - In other sciences, statistics used for
exploratory data analysis - trendier data mining trendiest information
harvesting - Emergent statistical generalizations can be
algebraic
11The Speech Recognition Problem
- The Recognition Problem
- Find most likely sequence w of words given the
sequence of acoustic observation vectors a - Use Bayes law to create a generative model
- Max w . P(wa) Max w . P(aw) P(w) / P(a)
- Max w . P(aw)
P(w) -
- Language Model P(w) usually n-grams -
discrete - Acoustic Model P(aw) usually HMMs -
cont. density - Challenge 1 beat trigram language models
- Challenge 2 extend this paradigm to NLP
12N-best and Word Graphs
- Speech recognizers can return n-best histories
- 1. flights from Boston today 2. lights for
Boston to pay - 3. flights from Austin today 4. flights for
Boston to pay - Or a packed word graph of histories
- sum of path log probs equals acoustic/language
log prob
- Path closest to utterance in dense graphs much
better - than first-best on average density 124
515 18011
13Probabilistic Graph Processing
- The architecutre were exploring in the context
of spoken dialogue systems involves - Speech recognizers that produce a probabilistic
word graph output, with scores given by acoustic
probabilities - A tagger that transforms a word graph into a
word/tag graph with scores given by joint
probabilities - A parser that transforms a word/tag graph into a
syntactic graph (as in CKY parsing) with scores
given by grammar - Allows each module to rescore output of previous
modules decision - Long Term Apply this architecture to speech act
detection, dialogue act selection, and in
generation
14Probabilistic Graph Tagger
- In probabilistic word graph
- P(AsWs) conditional acoustic likelihoods or
confidences - Out probabilistic word/tag graph
- P(Ws,Ts) joint word/tag likelihoods ignores
acoustics - P(As,Ws,Ts) joint acoustic/word/tag likelihoods
but - General history-based implementation in Java
- next tag/word probability a function of specified
history - operates purely left to right on forward pass
- backwards prune to edges within a beam / on
n-best path - able to output hypotheses online
- optional backwards confidence rescoring not
P(As,Ws,Ts) - need node for each active history class for
proper model
15Backwards Rescore Minimize
All Paths 1. A,C,E 1/64 3. B,C,D
1/256 2. A,C,D 1/128
4. B,C,E 1/512
- Edge gets sum of all path scores that go through
it - Normalize by total (1/64 1/128 1/256
1/512)
Note outputs sum to 1 after backward pass
16Tagger Probability Model
- Exact Probabilities
- P(As,Ws,Ts) P(Ws,Ts) P(AsWs,Ts)
- P(Ws,Ts) P(Ts) P(WsTs) top-down
- Approximations
- Two Tag History tag trigram
- P(Ts) PRODUCT_n P(T_n T_n-2, T_n-1)
- Words Depend only on Tags HMM
- P(WsTs) PRODUCT_n P(W_n T_n)
- Pronunciation Independent of Tag use standard
acoustics - P(AsWs,Ts) P(AsWs)
17Prices rose sharply today
0. -35.612683136497516 NNS/prices VBD/rose
RB/sharply NN/today (0, 2NNS/prices) (2,
10VBD/rose) (10, 14RB/sharply) (14,
15NN/today) 1. -37.035496392922575 NNS/prices
VBD/rose RB/sharply NNP/today (0,
2NNS/prices) (2, 10VBD/rose) (10,
14RB/sharply) (14, 15NNP/today) 2.
-40.439580756197934 NNS/prices VBP/rose
RB/sharply NN/today (0, 2NNS/prices) (2,
9VBP/rose) (9, 11RB/sharply) (11,
15NN/today) 3. -41.86239401262299 NNS/prices
VBP/rose RB/sharply NNP/today (0,
2NNS/prices) (2, 9VBP/rose) (9, 11RB/sharply)
(11, 15NNP/today) 4. -43.45450487625557
NN/prices VBD/rose RB/sharply NN/today (0,
1NN/prices) (1, 6VBD/rose) (6, 14RB/sharply)
(14, 15NN/today) 5. -44.87731813268063
NN/prices VBD/rose RB/sharply NNP/today (0,
1NN/prices) (1, 6VBD/rose) (6, 14RB/sharply)
(14, 15NNP/today) 6. -45.70597331609037
NNS/prices NN/rose RB/sharply NN/today (0,
2NNS/prices) (2, 8NN/rose) (8, 13RB/sharply)
(13, 15NN/today) 7. -45.81027979248346
NNS/prices NNP/rose RB/sharply NN/today (0,
2NNS/prices) (2, 7NNP/rose) (7, 12RB/sharply)
(12, 15NN/today) 8. ..
18Prices rose sharply after hours15-best as a
word/tag graph minimization
19Prices rose sharply after hours15-best as a
word/tag graph minimization collapsing
roseVBD
roseNN
pricesNN
roseVBP
afterRB
pricesNNS
sharplyRB
afterIN
hoursNNS
roseNNP
20Weighted Minimize (isnt easy)
- Can push probabilities back through graph
- Ratio of scores must be equivalent for sound
minimization (difference of log scores)
- Assume x gt y operation preserves sum of paths
- B,A wx C,A zy
21Weighted Minimize is Problematic
- Cant minimize if ratio is not the same
- To push, must have amount
- to push
- (x1-x2) (y1-y2)
- ex1 / ex2 ey1 / ey2
-
22How to Collect n Best in O(n k)
- Do a forward pass through graph, saving
- best total path score at each node
- backpointers to all previous nodes, with scores
- This is done during tagging (linear in max length
k ) - Algorithm
- add first-best and second best final path to
priority queue - k times, repeat
- follow backpointer of best path on queue to
beginning - save next best (if any) at each node on
queue - Can do same for all paths within beam epsilon
- Result is deterministic minimize before parsing
23Collins Head/Dependency Parser
- Michael Collins (ATT) 1998 UPenn PhD Thesis
- Generative model of tree probabilities P(Tree)
- Parses WSJ with 90 constituent precision/recall
- Best performance for single parser, but
Hendersons Johns Hopkins Thesis beat it by
blending with other parsers (Charniak
Ratnaparkhi) - Formal language induced from simple smoothing
of treebank is trivial Word (Charniak) - Collins parser runs in real time
- Collins naïve C implementation
- Parses 100 of test set
24Collins Grammar Model
- Similar to GPSG CG (aka HPSG) model
- Subcat frames adjuncts / complements
distinguished - Generalized Coordination
- Unbounded Dependencies via slash
- Punctuation
- Distance metric codes word order (canonical
not) - Probabilities conditioned top-down
- 12,000 word vocabulary (gt 5 occs in treebank)
- backs off to a words tag
- approximates unknown words from words with lt 5
occs - Induces feature information statistically
25Collins Statistics (Simplified)
- Choose Start Symbol, Head Tag, Head Word
- P(RootCat, HeadTag, HeadWord)
- Project Daughter and Left/Right Subcat Frames
- P(DaughterCatMotherCat, HeadTag, HeadWord)
- P(SubCatMotherCat, DtrCat, HeadTag, HeadWord)
- Attach Modifier (Comp/Adjunct Left/Right)
- P(ModifierCat, ModiferTag, ModifierWord
SubCat, . . MotherCat, DaughterCat,
HeadTag, HeadWord, Distance)
26Complexity and Efficiency
- Collins wide coverage linguistic grammar
generates millions of readings for simple strings - But Collins parser runs faster than real time on
unseen sentences of arbitrary length - How?
- Punchline Time-Syncrhonous Beam Search Reduces
time to Linear - Tighter estimates with more features and more
complex grammars ran faster and more accurately - Beam allows tradeoff of accuracy (search error)
and speed
27Completeness Dialogue
- Collins parser is not complete in the usual
sense - But neither are humans (eg. garden paths)
- Syntactic features alone dont determine
structure - Humans cant parse without context, semantics,
etc. - Even phone or phoneme detection is very
challenging, especially in a noisy environment - Top-down expectations and knowledge of likely
bottom-up combinations prune the vast search
space on line - Question is how to combine it with other factors
- Next steps semantics, pragmatics dialogue
28Conclusions
- Need ranking of hypotheses for applications
- Beam can reduce processing time to linear
- need good statistics to do this
- More linguistic features are better for stat
models - can induce the relevant ones and weights from
data - linguistic rules emerge from these
generalizations - Using acoustic / word / tag / syntax graphs
allows the propogation of uncertainty - ideal is totally online (model is compatible with
this) - approximation allows simpler modules to do first
pruning