Title: Computational Linguistics
1 Computational Linguistics
- What is it and what (if any) are its
- unifying themes?
2Computational linguistics
3I often agree with XKCD
4linguistics?
computational linguistics
literary criticism
physics
biology
chemistry
psychology
neuropsychology
more rigorous
less rigorous
more flakey
5What defines the rigor of a field?
- Whether results are reproducible
- Whether theories are testable/falsifiable
- Whether there are a common set of methods for
similar problems - Whether approaches to problems can yield
interesting new questions/answers
6Linguistics
7literary criticism
engineering
sociology
linguistics
more rigorous
less rigorous
8The true situation with linguistics
other areas of sociolinguistics (e.g. Deborah
Tannen)
theoretical linguistics (e.g.
lexical-functional grammar)
some areas of sociolinguistics (e.g. Bill Labov)
theoretical linguistics (e.g. minimalist syntax)
experimental phonetics
historical linguistics
psycholinguistics
more rigorous
less rigorous
9Okay enough alreadyWhat is computational
linguistics
- Text normalization/segmentation
- Morphological analysis
- Automatic word pronunciation prediction
- Transliteration
- Word-class prediction e.g. part of speech
tagging - Parsing
- Semantic role labeling
- Machine translation
- Dialog systems
- Topic detection
- Summarization
- Text retrieval
- Bioinformatics
- Language modeling for automatic speech
recognition - Computer-aided language learning (CALL)
10Computational linguistics
- Often thought of as natural language engineering
- But there is also a serious scientific component
to it.
11Why CL may seem ad hoc
- Wide variety of areas (as in linguistics)
- If its natural language engineering, the goal is
often just to build something that works - Techniques tend to change in somewhat faddish
ways - For example machine learning approaches fall in
and out of favor
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16Machine learning in CL
- In general its a plus since it has meant that
evaluation has become more rigorous - But its important that the field not turn into
applied machine learning - For this to be avoided, people need to continue
to focus on what linguistic features are
important - Fortunately, this seems to be happening
17Some interesting themes
- Finite-state methods
- Many application areas
- Raises interesting questions about how much of
language is regular (in the sense of finite
state) - Grammar induction
- Linguists have done a poor job at their stated
goal of explaining how humans learn grammar - Computational models of language change
- Historical evidence for language change is only
partial. There are many changes in language for
which we have no direct evidence.
18Finite state methods
- Used from the 1950s onwards
- Went out of fashion a bit during the 1980s
- Then a revival in the 1990s with the advent of
weighted finite-state methods
19Some applications
- Analysis of word structure morphology
- Analysis of sentence structure
- Part of speech tagging
- Parsing
- Speech recognition
- Text normalization
- Computational biology
20Regular languages
- A regular language is a language with a finite
alphabet that can be constructed out of one or
more of the following operations - Set union
- Concatenation
- Transitive closure (Kleene star)
21Finite state automata formal definition
Every regular language can be recognized by a
finite-state automaton. Every finite-state
automaton recognizes a regular language.
(Kleenes theorem)
22Representation of FSAs State Diagram
23Regular relations formal definition
24Finite-state transducers
25An FST
26Composition
- In addition to union, concatenation and Kleene
closure, regular relations are closed under
composition - Composition is to be understood here the same way
as composition in algebra - R1oR1 means take the output of R1 and feed it to
the input of R2
27Composition an illustration
28R1 as a transducer
29R2 as a transducer
30R1?R2
31Some things you can do with FSTs
- Text analysis/normalization
- Word segmentation
- Abbreviation expansion
- Digit-to-number-name mappings
- i.e. mapping from writing to language
- Morphological analysis
- Syntactic analysis
- E.g. part-of-speech tagging
- (With weights) pronunciation modeling and
language modeling for speech recognition
32Thats fine for engineering but
- Does it really account for the facts?
- Is morphology really regular?
- Is the mapping between writing and speech really
regular?
33What is morphology?
- scripserunt is third person, plural, perfect,
active of scribo (I write) - Morphology relates word forms
- the lemma of scripserunt is scribo
- Morphology analyzes the structure of word forms
- scripserunt has the structure scribserunt
34Morphology is a relation
- Imagine you have a Latin morphological analyzer
comprising - D a relation that maps between surface form and
decomposed form - L a relation that maps between decomposed form
and lemma - Then
- scripserunt ? D scribserunt
- scripserunt ? D ? L scribo
35English regular plurals
- cat s cats /s/
- dog s dogs /z/
- spouse s spouses /?z/
- This can be implemented by a rule that composes
with the base word, inserting the relevant form
of the affix at the end
36Templatic affixes in Yowlumne
Transducer for each affix transforms base into
required templatic form and appends the relevant
string.
37Subtractive morphology
Transducer deletes final VC of the base
38Bontoc infixation
- Insert a marker gt after the first consonant (if
any) - Change gt into the infix um-
39Side note infixation in English
Kalama
zoo
fg
40Reduplication Gothic
Problem mapping w to ww is not a regular relation
41Factoring Reduplication
- Prosodic constraints
- Copy verification transducer C
42Non-Exact Copies
- Dakota (Inkelas Zoll, 1999)
-
43Non-Exact Copies
- Basic and modified stems in Sye (Inkelas Zoll,
1999)
44Morphological Doubling Theory(Inkelas Zoll,
1999)
- Most linguistic accounts of reduplication assume
that the copying is done as part of morphology - In MDT
- Reduplication involves doubling at the
morphosyntactic level i.e. one is actually
simply repeating words or morphemes - Phonological doubling is thus expected, but not
required
45Gothic Reduplication under Morphological Doubling
Theory
46Summary
- If Inkelas Zoll are right then all morphology
can be computed using regular relations - This in turn suggests that computational
morphology has picked the right tool for the job
47Another Example Linguistic analysis of text
- Maps between the stuff you see on the page e.g.
text written in the standard orthography of a
language into linguistic units (words,
morphemes, phonemes) - For example
- I ate a 25kg bass
- aI eIt ? twenti faIv kIl?græm bæs
- This can be done using transducers
- But is the mapping between writing and language
really regular (finite-state)?
48Linguistic analysis of text
- Abbreviation expansion
- Disambiguation
- Number expansion
- Morphological analysis of words
- Word pronunciation
49A transducer for number names
Consider a machine that maps between digit
strings and their reading as number names in
English. 30,294,005,179,018,903.56 ? thirty
quadrillion, two hundred and ninety four
trillion, five billion, one hundred seventy nine
million, eighteen thousand, nine hundred three,
point five six
50Mapping between speech and writing
- It seems obvious on the face of it that the
mapping between speech and its written form is
regular. After all, the words are ordered in the
same way as speech. Even the
tend to be ordered in the same
letters
way as the sounds they represent.
51Some examples where it isnt
honorific inversion
r
m
j
n
t
nx
xpr
w
t
w
nb
52Finite state methods
- In morphology they seem almost exactly correct as
characterizations of the natural phenomenon - In the mapping from writing to language, again,
finite-state models seem almost exactly correct
53Grammar induction
The common nativist view in linguistics From
Gilbert Harman's review of Chomsky's New Horizons
in the Study of Language and Mind (published in
Journal of Philosophy, 98(5), May 2001) Further
reflection along these lines and a great deal of
empirical study of particular languages has led
to the "principles and parameters" framework
which has dominated linguistics in the last few
decades. The idea is that languages are basically
the same in structure, up to certain parameters,
for example, whether the head of a phrase goes at
the beginning of a phrase or at the end. Children
do not have to learn the basic principles, they
only need to set the parameters. Linguistics aims
at stating the basic principles and parameters by
considering how languages differ in certain more
or less subtle respects. The result of this
approach has been a truly amazing outpouring of
discoveries about how languages are the same yet
different.
54Similarly
Cedric Boeckx and Norbert Hornstein. 2003. The
Varying Aims of Linguistic Theory.
Children come equipped with a set of principles
of grammar construction (i.e. Universal Grammar
(UG)). The principles of UG have open parameters.
Specific grammars arise once values for these
open parameters are specified. Parameter values
are determined on the basis of the primary
linguistic data. A language specific grammar,
then, is simply a specification the values that
the principles of UG leave open.
55My challenge with Shalom Lappin
56 57Automatic induction of grammars from unannotated
text
- Klein, Dan and Manning, Christopher. 2004.
Corpus-based induction of syntactic
structure models of dependency and
constituency. Proceedings of the 42nd Annual
Meeting on Association for Computational
Linguistics - Lots of subsequent work
58Different syntactic representations
59Dependency Model with Valence (DMV)
- Each head generates a set of non-STOP arguments
to one side, then a STOP argument then similarly
on the other side - Trained using expectation maximization
60Performance
61Improvements
- Constituent structure can be induced in a similar
way to inducing word classes (e.g. parts of
speech) by considering the environments in
which the putative constituent finds itself. - In Klein Mannings constituent-context model
(CCM) probability of a bracketing is computed as
follows
62Combined DMVCCM
Subsequent work e.g. Rens Bods 2006
Unsupervised Data Oriented Parsing report
F-scores close to 83.0 For comparison, the best
supervised parsers get about 91.0
63Some objections and a synopsis
- Children do not learn grammars from unannotated
text corpora they get a lot of guidance from the
environmental situation - Sure
- Performance of automatic induction algorithms is
still far from human performance so they do not
constitute evidence that we can do away with
(nativist) linguistic theories of language
acquisition - They do not show this. But the argument would
have more weight if nativist theories had already
been demonstrated to contribute to a working
model of grammar induction - But Computational Linguistics is starting to make
some serious contributions to this 50-year-old
debate
64The evolution of complex structure in language
Examples from Stump, Gregory (2001) Inflectional
Morphology A Theory of Paradigm Structure.
Cambridge University Press.
65Evolutionary Modeling (A tiny sample)
- Hare, M. and Elman, J. L. (1995) Learning and
morphological change. Cognition, 56(1)61--98. - Kirby, S. (1999) Function, Selection, and
Innateness The Emergence of Language Universals.
Oxford - Nettle, D. "Using Social Impact Theory to
simulate language change". Lingua,
108(2-3)95--117, 1999. - de Boer, B. (2001) The Origins of Vowel Systems.
Oxford - Niyogi, P. (2006) The Computational Nature of
Language Learning and Evolution. Cambridge, MA
MIT Press.
66A multi-agent simulation
- System is seeded with a grammar and small number
of agents - Each agent randomly selects a set of phonetic
rules to apply to forms - Agents are assigned to one of a small number of
social groups - 2 parents beget child agents.
- Children are exposed to a predetermined number of
training forms combined from both parents - Forms are presented proportional to their
underlying frequency - Children must learn to generalize to unseen slots
for words - Learning algorithm similar to
- David Yarowsky and Richard Wicentowski (2001)
"Minimally supervised morphological analysis by
multimodal alignment." Proceedings of ACL-2000,
Hong Kong, pages 207-216. - Features include last n-characters of input form,
plus semantic class - Learners select the optimal surface form to
derive other forms from (optimal requiring the
simplest resulting ruleset a Minimum
Description Length criterion) - Forms are periodically pooled among all agents
and the n best forms are kept for each word and
each slot - Population grows, but is kept in check by
natural disasters and a quasi-Malthusian model
of resource limitations - Agents age and die according to reasonably
realistic mortality statistics
67Final states for a given initial state
68Another example
- Kirby, Simon. 2001. Spontaneous evolution of
linguistic structure an iterated learning model
of the emergence of regularity and irregularity.
IEEE Transactions on Evolutionary Computation,
5(2)102--110. - Assumes two meaning components each with 5
values, for 25 possible words - Initial speaker randomly selects examples from
the 25, producing random strings for each, and
teaches them to the hearer - Not all of the slots are filled, thus producing a
bottleneck the hearer must compute forms for
the missing slots
69The basic algorithm produces results that are too
regular
Initial state
Final state
70A more realistic result
- Addition of other constraints, including
- a random tendency for speakers to omit symbols,
- a frequency distribution over the 25 possible
meaning combinations
71Summary
- Evolutionary modeling is evolving slowly
- We are a long way from being able to model the
complexities of known language evolution - Nonetheless, computational approaches promise to
lend insights into how complex social systems
such as language change over time, and complement
discoveries in historical linguistics
72Final thoughts
- Language is central to what it means to be human.
- Language is used to
- Communicate information
- Communicate requests
- Persuade, cajole
- (In written form) record history
- Deceive
- Other animals do some or most of these things
(cf. Anindya Sinhas work on bonnet macaques) - But humans are better at all of these
73Final thoughts
- So the scientific study of language ought to be
more central than it is - We need to learn much more about how language
works - How humans evolved language
- How languages changed over time
- How humans learn language
- Computational linguistics can contribute to all
of these questions.
74(No Transcript)