Title: Modelling language evolution
1Modelling language evolution
- Tandy Warnow
- The University of Texas at Austin
2Species phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3 Possible Indo-European tree(Ringe, Warnow and
Taylor 2000)
4Controversies for Indo-European history
- Subgrouping Other than the 10 major subgroups,
what is likely to be true? In particular, what
about - Italo-Celtic,
- Greco-Armenian,
- Anatolian Tocharian,
- Satem Core?
5This talk
- Empirical evidence of how estimated phylogenies
depend upon both the data and the method - and
can be wrong - Models of language evolution (from the earliest
ones to more recent ones), why we need them, and
what we still need to do. Note simulations and
estimation methods both depend upon model
assumptions! - Results of simulation studies based upon some new
models - Comments
6Nakhleh et al., Transactions of the Philological
Society 2005
- Methods studied UPGMA (lexico-statistics),
Neighbor joining, maximum parsimony, maximum
compatibility, weighted MP, weighted MC, and
GrayAtkinson. - Datasets Four versions of the RingeTaylor IE
data (lexical, morphological, and phonological
characters) lexical only vs. all, screened vs.
unscreened - Observations
- UPGMA (lexico-statistics) does the worst - it
splits known subgroups. - Other than UPGMA, all methods reconstruct the ten
major subgroups, Anatolian Tocharian, and
Greco-Armenian. Nothing else is consistently
reconstructed. - When using lexical data only, all methods group
Italic, Celtic, and Germanic together. - Some methods (not all) will reconstruct different
trees on different datasets. Screening datasets
to remove obvious homoplasy can result in better
(?) trees.
7Question how to determine which phylogenies are
reliable?
- Data need high quality data!
- Phylogenetic reconstruction methods need to be
tested before being trusted! Examples of
possible tests - Benchmark real datasets (need good benchmarks!
Are there any?) - Simulated datasets (need good models!)
8Simulation study (cartoon)
9Simulation study (cartoon)
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
10Modelling language evolution
- Models of evolution allow reconstruction methods
to be evaluated in simulation. This allows us to
understand the conditions under which each method
will perform well. - Models of evolution (for simulation purposes)
need to reflect good scholarship, and should be
able to reproduce the properties of real data. - Models of evolution are also present in
estimation methods, whether explicitly (as in ML
or Bayesian) or implicitly.
11Issues in modelling language evolution
- Character evolution model.
- Variation between characters.
- Cladogenesis model tree vs. network vs. dialect
continuum?
12Modelling the evolution of single linguistic
characters
- Types of linguistic characters
- Phonological (sound changes)
- Lexical (meanings based on a wordlist)
- Morphological (especially inflectional)
- Modelling issues state space, lexical clock,
homoplasy, and polymorphism - Easy lexical clock not believed, and most
linguistic characters have infinite number of
possible states. - More interesting homoplasy, polymorphism, and
variation between characters.
13Homoplasy-free evolution
- When a character changes state, it changes to a
new state not in the tree - In other words, there is no homoplasy (character
reversal or parallel evolution) - First inferred for weird innovations in
phonological characters and morphological
characters in the 19th century.
0
0
1
0
0
0
0
1
1
14Lexical characters can also evolve without
homoplasy
- For every cognate class, the nodes of the tree in
that class should form a connected subset - as
long as there is no undetected borrowing nor
parallel semantic shift. - However, our research suggests that 15 of
lexical characters evolve homoplastically.
1
1
1
0
0
0
1
1
2
15Polymorphism
- Polymorphism means two or more states exhibited
by the same language for a character. - Most common examples are lexical two or more
words for the same basic meaning. Examples
big/large, little/small, rock/stone. - Lexical polymorphism results primarily from
semantic shift, but polymorphism due to borrowing
also occurs. - Incidence lexical polymorphism is very common
but transient (almost all polymorphisms lost
within a millenium). Less frequent for other
types of characters.
16Modelling variation between characters
Rates-across-sites
- If a site (i.e., character) is twice as fast as
another on one edge, it is twice as fast
everywhere.
B
D
A
C
B
D
A
C
17Modelling variation between characters The
no-common-mechanism model
- In this model, there is a separate random
variable for every combination of site and edge -
the underlying tree is fixed, but otherwise there
are no constraints on variation between sites.
C
A
D
B
B
D
A
C
18Homoplasy-free models without polymorphism
- The earliest models were all tree models,
homoplasy-free and obeyed the lexical clock. - Ringe-Warnow PP (perfect phylogeny - i.e.,
homoplasy-free, no common mechanism,
non-parametric tree model)
19Cladogenesis
- The speciation model ranges from trees all the
way to dialect continuums. Intermediate models
include horizontal transfer (borrowing) and
hybridization (creoles).
20Modelling borrowing Networks and Trees within
Networks
21Perfect Phylogenetic Network model
- Nakhleh et al. Perfect Phylogenetic Network (PPN)
model all characters evolve without homoplasy
down a tree contained within the network.
Published in Language, 2005. - Warnow-Evans-Ringe-Nakhleh (2004) extends PPN
model to allow for limited and identifiable
homoplasy.
22Perfect Phylogenetic Network for IENakhleh et
al., Language 2005
23What about polymorphism?
- Our first model of polymorphism (Bonet et al.,
1996) was a non-parametric model for
homoplasy-free characters, no-common-mechanism
model, with polymorphism due to semantic shift. - Three problems (1) because it is non-parametric,
it cannot be used for simulation (2) homoplasy is
fairly frequent for lexical characters (15 of
characters) (3) what about polymorphism due to
borrowing?
24Nichols and Gray model for polymorphism
- Geoff Nichols and Russel Gray (2006)
Homoplasy-free, rates-across-sites, parametric
model in which the character adds and loses
states under a stochastic process. The number of
states in a lineage can go up and down (including
down to 0 and then back up). - Problems (1) homoplasy is frequent in lexical
characters (2) what is the linguistic process?
25What needs to be done in modelling
- We need parametric models of character evolution
that include reasonable levels of homoplasy, in
which polymorphism arises due to semantic shift
(conflation of two characters), by borrowing, or
due to other linguistic processes. - We also need cladogenesis models that incorporate
population-level processes, and can represent
dialect continuums.
26Simulation study (Barbancon et al.)
- Simulated evolution down networks with 30 leaves,
three contact edges, and with moderate levels of
homoplasy and borrowing for 300 lexical
characters and 60 morphological characters. - Compared trees constructed by various methods to
the genetic tree contained in the network, for
topological accuracy. - Methods compared NJ, UPGMA, weighted and
unweighted MP and MC.
27Standard Model Conditions
- Screened dataset
- Lexical characters 4 homoplastic, 10 evolve
with borrowing - Morphological characters no homoplasy nor
borrowing - Unscreened dataset
- Lexical characters 20 homoplastic, 20 borrowed
- Morphological characters 5 homoplastic, no
borrowing - Molecular clock for the cladogenesis model
- No-common-mechanism model with moderate variation
between characters - Lexical weight1, morphological weight50
28Simulation study (cartoon)
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
29Clocklike data
30Clocklike data
31Points
- Screening the data helps to improve the
phylogenetic accuracy of most methods. - When data are generated under network models,
methods which reconstruct trees do not perform
well. - Modelling helps us predict the conditions under
which different methods will perform well, or
poorly. The more accurate the models, the more
relevant the predictions. We need better models!
32Future research
- Testing other methods in simulation (including
some network construction methods) - Formulating improved (more realistic) models of
language evolution - Implementing simulation tools under these
improved models - Developing estimation methods under these
improved models - Reanalyzing IE, and looking at some new families
(or subfamilies)
33Acknowledgements
- Funding NSF, the David and Lucile Packard
Foundation, the Radcliffe Institute for Advanced
Studies, The Program for Evolutionary Dynamics at
Harvard, and the Institute for Cellular and
Molecular Biology at UT-Austin. - Collaborators Don Ringe, Steve Evans, Luay
Nakhleh, and Francois Barbancon.
34For more information
- Please see the Computational Phylogenetics for
Historical Linguistics web site for papers, data,
and additional material http//www.cs.rice.edu/na
khleh/CPHL
35Differences between characters
- Lexical most easily borrowed (most borrowings
detectable), and homoplasy relatively frequent
(we estimate about 25-30 overall for our
wordlist, but a much smaller percentage for
basic vocabulary). Also, lexical characters have
a high incidence (80) of transient polymorphism. - Phonological can still be borrowed but much less
likely than lexical. Complex phonological
characters are infrequently (if ever)
homoplastic, although simple phonological
characters very often homoplastic. - Morphological least easily borrowed, least
likely to be homoplastic. Rarely polymorphic.