Title: Linguistics Methodology meets Language Reality:
1Linguistics Methodology meets Language
Reality
the quest for robustness, scalability, and
portability in (spoken) language applications
- Bob Carpenter
- SpeechWorks International
2The Standard Cliché(s)
- Moores Cliché
- Exponential growth in computing power and memory
will continue to open up new possibilities - The Internet Cliché
- With the advent and growth of the world-wide web,
an ever increasing amount of information must be
managed
3More Standard Clichés
- The Convergence Cliché
- Data, voice and video networking will be
integrated over a universal network, that - includes land lines and wireless
- includes broadband and narrowband
- likely implementation is IP (internet protocol)
- The Interface Cliché
- The three forces above (growth in computing
power, information online, and networking) will
both enable and require new interfaces - Speech will become as common as graphics
4Some Comp Ling Clichés
- The Standard Linguists Cliché
- But it must be recognized that the notion
probability of a sentence is an entirely
useless one, under any known interpretation of
this term. - Noam Chomsky, 1969 essay on Quine
- The Standard Engineers Cliché
- Anytime a linguist leaves the group the
recognition rate goes up. - Fred Jelinek, 1988 address to DARPA
5The Theoretical Abstraction
- mature, monolingual, native language speaker
- idealized to complete knowledge of language
- static, homogenous language community
- all speakers learn identical grammars
- competence (vs. performance)
- performance is a natural class
- wetware implementation follows theory in
divorcing knowledge of language from processing - assumes the existence and innateness of a
language faculty
6The Explicit Methodology
- Emprical Basis is binary grammaticality
judgements - intuitive (to a properly trained linguist)
- innateness and the language faculty
- appropriate for phonetics through dialogue
- in practice, very little agreement at boundaries
and no standard evaluations of theories vs. data - Models of particular languages
- by grammars that generate formal languages
- low priority for transformationalists
- high priority for monostratalists/computationalist
s
7The Holy Grail of Linguistics
- A grammar meta-formalism in which
- all and only natural language grammars (idealized
as above) can be expressed - assumed to correspond to the language faculty
- Grail is sought by every major camp of linguist
- Explains why all major linguistic theories look
alike from any perspective outside of a
linguistics department - The expedient abstractions have become an end in
themselves
8But, Applications Require
- Robustness
- acoustic and linguistic variation
- disfluencies and noise
- Scalability
- from embedded devices to palmtops to clients to
servers - across tasks from simple to complex
- system-initiative form-filling to mixed
initiative dialogue - Portability
- simple adaptation to new tasks and new domains
- preferably automated as much as possible
9The 64,000 Question
- How do humans handle unrestricted language so
effortlessly in real time? - Unfortunately, the classical linguistic
assumptions and methodology completely ignore
this issue - Psycholinguistics has uncovered some baselines
- lexicon (and syntax?) highly parallel
- time course of processing totally online
- information integration lt 200ms for all sources
- But is short on explanations
10(AI) Success by Stupidity
- Jaime Carbonells Argument (ECAI, mid 1990s)
- Apparent intelligence because theyre too
limited to do anything wrong right answer
hardcoded - Typical in Computational NL Grammars
- lexicon limited to demo
- rules limited to common ones (eg no heavy shift)
- Scaling up usually destroys this limited
success - 1,000,000s of grammatical readings with large
grammars
11My Favorite Experiments I
- Mike Tanenhaus et al. (Univ. Rochester)
- Head-Mounted Eye Tracking
Pick up the yellow plate
Clearly shows that understanding is online
12My Favorite Experiments (II)
- Garden Paths and Context Sensitive
- Crain Steedman (U.Connecticut U. Edinburgh)
- if noun is not unique in context,
postmodificiation is much more likely than if
noun picks out unique individual - Garden Paths are Frequency and Agreement
Sensitive - Tanenhaus et al.
- The horse raced past the barn fell. (raced
likely past) - The horses brought into the barn fell. (brought
likely participle, and less likely activity for
horses)
13Stats Explanation or Stopgap
- A Common View
- Statistics are some kind of approximation of
underlying factors requiring further explanation. - Steve Abneys Analogy (ATT Labs)
- Statistical Queueing Theory
- Consider traffic flows through a toll gate on a
highway. - Underlying factors are diverse, and explain the
actions of each driver, their cars, possible
causes of flat tires, drunk drivers, etc. - Statistics is more insightful explanatory in
this case as it captures emergent generalizations - It is a reductionist error to insist on low-level
account
14Competence vs. Performance
- What is computed vs. how it is computed
- The what can be traditional grammatical structure
- All structures not computed, regardless of the
how - Define what probabilistically, independently of
how
15Algebraic vs. Statistical
- False Dichotomy
- All statistical systems have an algebraic basis,
even if trivial - The Good News
- Best statistical systems have best linguistic
conditioning (most explanatory in traditional
sense) - Statistical estimatiors far less significant than
the appropriate linguistic conditioning - Rest of the talk provides examples of this
16Bayesian Statistical Modeling
- Concerned with prior and posterior probabilities
- Allows updates of reasoning
- Bayes Law P(A,B) P(AB) P(B) P(BA) P(A)
- Eg Source/Channel Model for Speech Recognition
- Ws sequence of words
- As sequence of acoustic observations
- Compute ArgMax_Ws P(WsAs)
- ArgMax_Ws P(WsAs)
- ArgMax_Ws P(AsWs) P(Ws) / P(As)
- ArgMax_Ws P(AsWs) P(Ws)
- P(AsWs) acoustic model P(Ws) language
model
17Simple Bayesian Update Example
- Monty Halls Lets Make a Deal
- Three curtains with prize behind one, no other
info - Contestant chooses one of three
- Monty then opens curtain of one of others that
does not have the prize - if you choose curtain 2, then one of curtain 1
or 3 must not contain prize - Monty then lets you either keep your first guess,
or change to the remaining curtain he didnt
open. - Should you switch, stay, or doesnt it matter?
18Answer
- Yes! You should switch.
- Why? Consider possiblities
19Defaults via Bayesian Inference
- Bayesian Inference provides an explanation for
rationality of default reasoning - Reason by choosing an action to maximize expected
payoff given some knowledge - ArgMax_Action Payoff(Action)
P(ActionKnowledge) - Given additional information update to Knowledge
- ArgMax_Action Payoff(Action)
P(ActionKnowledge) - Chosen action may be different, as in Lets Make
a Deal - Inferences are not logically sound, but are
rational - Bayesian framework integrates partiality and
uncertainty of background knowledge
20Example Allophonic Variation
- English Pronunciation (M. Riley A. Llolje,
ATT) - Derived from TIMIT with phoneme/phone labels
- orthographic bottle
- phonological / b aa t ax l /
(ARPAbet phonemes) - phonetic 0.75 b aa dx el
(TIMITbet phones) - 0.13 b aa t el
- 0.10 b aa dx ax l
- 0.02 b aa t ax l
- Allophonic variation is non-deterministic
21Eg Allophonic Variation (contd)
- Simple statistical model (simplified w/o
insertion) - Estimate probability of phones given phonemes
- P(a1,,aMp1,,pM)
- P(a1p1,,pM) P(a2p1,,pM,a1)
-
P(aMp1,,pM,a1,,aM-1) - Approximate phoneme context to /- k phones
- Approximate phone history to 0 or 1 phones
- 0 P(aJpJ-K,,pJ,,pJK) ...
- 1 P(aJpJ-K,,pJ,,pJK, aJ-1)
- Uses word boundary marker and stress
22Eg Allophonic Variation (concld)
- Cluster phonological features using decision
trees - Sparse data smoothed by decision trees over
standard features (/- stop, voicing,
aspiration, etc.) - Conditional entropy w/o context 1.5 bits, w 0.8
- Most likely allophone correct 85.5, in top 5,
99 - Average 17 pronunciations/word to get 95
- Robust handles multiple pronunciations
- Scalable to whole of English pronunciation
- Portable easy to move to new dialects with
training - K. Knight (ISI) similar techniques for Japenese
pronunciation of English words!
23Example Co-articulation
- HMMs have been applied to speech since mid-70s
- Two major recent improvements, the first being
simply more training data and cycles - Second is Context-dependent triphones
- Instead of one HMM per phoneme/phone, use one per
context-dependent triphone - example t-ru an r preceded by t and
followed by u - crucially clustered by phonological features to
overcome sparsity
24Exploratory Data Analysis
- (Trendier data mining Trendiest information
harvesting) - Specious Argument A statistical model wont help
explain linguistic processes. - Counter 1 Abneys anti-reductionist
- But even if you dont believe that
- Counter 2 In other sciences (pace linguistic
tradition), statistics is used to discover
regularities - Allophone example had your pronunciation
- / d / is 51likely to realize as jh , 37 as
d - if / d / realizes as jh , / y / deletes 84
- if / d / realizes as d , / y / deletes 10
25Balancing Gricean Maxims
- Grice gives us conflicting maxims
- quantity (exactly as informative as required)
- quality (try to make your contribution true)
- manner (be perspicuous eg. avoid ambiguity, be
brief) - Manner pulls in opposite directions
- quality without ambiguity lengthens statements
- quantity and and (part of) manner require brevity
- Balance by estimating a multidimensional
goodness metric for generation
26Gricean Balance (contd)
- Consider problem for aggregation in generation
- Every student ran slowly or walked quickly.
- Aggregates to
- Every student ran slowly or every student walked
quickly. - This reduces sentence length, shortens clause
length, and increases ambiguity. - These tradeoffs need to be balanced
27Collins Head/Dependency Parser
- Michael Collins 1998 UPenn PhD thesis
- Parses WSJ with 90 constituent precision/recall
- Generative model of tree probabilities
- Clever Linguistic Decomposition and Training
- P(RootCat, HeadTag, HeadWord)
- P(DaughterCatMotherCat, HeadTag, HeadWord)
- P(SubCatMotherCat, DtrCat, HeadTag, HeadWord)
- P(ModifierCat, ModiferTag, ModifierWord
- SubCat, MotherCat, DaughterCat,
HeadTag, - HeadWord, Distance)
28Eg Collins Parser (contd)
- Distance encodes heaviness
- Adjunct vs. Complement modifiers distinguished
- Head Words and Tags model lexical variation and
word-word attachment preferences - Also conditions punctuation, coordination, UDCs
- 12,000 word vocabulary plus unknown word
attachment model (by Collins) and tag model (by
A. Ratnaparkhi, another 1998 UPenn thesis) - Smoothed by backing off words to categories
- Trivial statistical estimators power is
conditioning
29Computational Complexity
- Wide coverage linguistic grammar generate
millions of readings - But Collins parser runs faster than real time on
a notebook on unseen sentences of length up to
100 - How? Pruning.
- Collins found tighter statistical estimates of
tree likelihoods with more features and more
complex grammars ran faster because a tighter
beam could be used - (E. Charniak S. Caraballo at Brown have really
pushed the envelope here)
30Complexity (contd)
- Collins parser is not complete in the usual
sense - But neither are humans (eg. garden paths)
- Can trade speed for accuracy in statistical
parsers - Syntax is not processed autonomously
- Humans cant parse without context, semantics,
etc. - Even phone or phoneme detection is very
challenging, especially in a noisy environment - Top-down expectations and knowledge of likely
bottom-up combinations prune the vast search
space on line - Question is how to combine it with other factors
31N-best and Word Graphs
- Speech recognizers can return n-best histories
- flights from Boston today
- flights from Austin today
- flights for Boston to pay
- lights for Boston to pay
- Can also return a packed word graph of histories
sum of path log probs equal acoustics /
word-string joint log prob
32Probabilistic Graph Processing
- The architecture were exploring in the context
of spoken dialogue systems involves - Speech recognizers that produce probabilistic
word graph output - A tagger that transforms a word graph into a
word/tag graph with scores given by joint
probabilities - A parser that transforms a word/tag graph into a
graph-based chart (as in CKY or chart parsing) - Allows each module to rescore output of previous
modules decision - Apply this architecture to speech act detection,
dialogue act selection, and in generation
33Prices rose sharply after hours15-best as a
word/tag graph minimization
34Challenge Beat n-grams
- Backed off trigram models estimated from 300M
words of WSJ provide best language models - We know there is more to language than two words
of history - Challenge is to find out how to model it.
35Conclusions
- Need ranking of hypotheses for applications
- Beam can reduce processing time to linear
- need good statistics to do this
- More linguistic features are better for stat
models - can induce the relevant ones and weights from
data - linguistic rules emerge from these
generalizations - Using acoustic / word / tag / syntax graphs
allows the propogation of uncertainty - ideal is totally online (model is compatible with
this) - approximation allows simpler modules to do first
pruning
36Plugs
- Run, dont walk, to read
- Steve Abney. 1996. Statistical methods and
linguistics. In J. L. Klavans and P. Resnik,
eds., The Balancing Act. MIT Press. - Mark Seidenberg and Maryellen MacDonald. 1999. A
probabilistic constraints approach to language
acquisition and processing. Cognitive Science. - Dan Jurafsky and James H. Martin. 2000. Speech
and Language Processing. Prentice-Hall. - Chris Manning and Hinrich Schuetze. 1999.
Statistical Natural Language Processing. MIT
Press.