Modelling language evolution - PowerPoint PPT Presentation

About This Presentation

Title:

Modelling language evolution

Description:

Empirical evidence of how estimated phylogenies depend upon both the data and ... is very common but transient (almost all polymorphisms lost within a millenium) ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 36

Provided by: informat1198

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Modelling language evolution

1
Modelling language evolution

Tandy Warnow
The University of Texas at Austin

2
Species phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3
Possible Indo-European tree(Ringe, Warnow and
Taylor 2000)
4
Controversies for Indo-European history

Subgrouping Other than the 10 major subgroups,
what is likely to be true? In particular, what
about
Italo-Celtic,
Greco-Armenian,
Anatolian Tocharian,
Satem Core?

5
This talk

Empirical evidence of how estimated phylogenies
depend upon both the data and the method - and
can be wrong
Models of language evolution (from the earliest
ones to more recent ones), why we need them, and
what we still need to do. Note simulations and
estimation methods both depend upon model
assumptions!
Results of simulation studies based upon some new
models
Comments

6
Nakhleh et al., Transactions of the Philological
Society 2005

Methods studied UPGMA (lexico-statistics),
Neighbor joining, maximum parsimony, maximum
compatibility, weighted MP, weighted MC, and
GrayAtkinson.
Datasets Four versions of the RingeTaylor IE
data (lexical, morphological, and phonological
characters) lexical only vs. all, screened vs.
unscreened
Observations
UPGMA (lexico-statistics) does the worst - it
splits known subgroups.
Other than UPGMA, all methods reconstruct the ten
major subgroups, Anatolian Tocharian, and
Greco-Armenian. Nothing else is consistently
reconstructed.
When using lexical data only, all methods group
Italic, Celtic, and Germanic together.
Some methods (not all) will reconstruct different
trees on different datasets. Screening datasets
to remove obvious homoplasy can result in better
(?) trees.

7
Question how to determine which phylogenies are
reliable?

Data need high quality data!
Phylogenetic reconstruction methods need to be
tested before being trusted! Examples of
possible tests
Benchmark real datasets (need good benchmarks!
Are there any?)
Simulated datasets (need good models!)

8
Simulation study (cartoon)
9
Simulation study (cartoon)
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
10
Modelling language evolution

Models of evolution allow reconstruction methods
to be evaluated in simulation. This allows us to
understand the conditions under which each method
will perform well.
Models of evolution (for simulation purposes)
need to reflect good scholarship, and should be
able to reproduce the properties of real data.
Models of evolution are also present in
estimation methods, whether explicitly (as in ML
or Bayesian) or implicitly.

11
Issues in modelling language evolution

Character evolution model.
Variation between characters.
Cladogenesis model tree vs. network vs. dialect
continuum?

12
Modelling the evolution of single linguistic
characters

Types of linguistic characters
Phonological (sound changes)
Lexical (meanings based on a wordlist)
Morphological (especially inflectional)
Modelling issues state space, lexical clock,
homoplasy, and polymorphism
Easy lexical clock not believed, and most
linguistic characters have infinite number of
possible states.
More interesting homoplasy, polymorphism, and
variation between characters.

13
Homoplasy-free evolution

When a character changes state, it changes to a
new state not in the tree
In other words, there is no homoplasy (character
reversal or parallel evolution)
First inferred for weird innovations in
phonological characters and morphological
characters in the 19th century.

0
0
1
0
0
0
0
1
1
14
Lexical characters can also evolve without
homoplasy

For every cognate class, the nodes of the tree in
that class should form a connected subset - as
long as there is no undetected borrowing nor
parallel semantic shift.
However, our research suggests that 15 of
lexical characters evolve homoplastically.

1
1
1
0
0
0
1
1
2
15
Polymorphism

Polymorphism means two or more states exhibited
by the same language for a character.
Most common examples are lexical two or more
words for the same basic meaning. Examples
big/large, little/small, rock/stone.
Lexical polymorphism results primarily from
semantic shift, but polymorphism due to borrowing
also occurs.
Incidence lexical polymorphism is very common
but transient (almost all polymorphisms lost
within a millenium). Less frequent for other
types of characters.

16
Modelling variation between characters
Rates-across-sites

If a site (i.e., character) is twice as fast as
another on one edge, it is twice as fast
everywhere.

B
D
A
C
B
D
A
C
17
Modelling variation between characters The
no-common-mechanism model

In this model, there is a separate random
variable for every combination of site and edge -
the underlying tree is fixed, but otherwise there
are no constraints on variation between sites.

C
A
D
B
B
D
A
C
18
Homoplasy-free models without polymorphism

The earliest models were all tree models,
homoplasy-free and obeyed the lexical clock.
Ringe-Warnow PP (perfect phylogeny - i.e.,
homoplasy-free, no common mechanism,
non-parametric tree model)

19
Cladogenesis

The speciation model ranges from trees all the
way to dialect continuums. Intermediate models
include horizontal transfer (borrowing) and
hybridization (creoles).

20
Modelling borrowing Networks and Trees within
Networks

21
Perfect Phylogenetic Network model

Nakhleh et al. Perfect Phylogenetic Network (PPN)
model all characters evolve without homoplasy
down a tree contained within the network.
Published in Language, 2005.
Warnow-Evans-Ringe-Nakhleh (2004) extends PPN
model to allow for limited and identifiable
homoplasy.

22
Perfect Phylogenetic Network for IENakhleh et
al., Language 2005
23
What about polymorphism?

Our first model of polymorphism (Bonet et al.,
1996) was a non-parametric model for
homoplasy-free characters, no-common-mechanism
model, with polymorphism due to semantic shift.
Three problems (1) because it is non-parametric,
it cannot be used for simulation (2) homoplasy is
fairly frequent for lexical characters (15 of
characters) (3) what about polymorphism due to
borrowing?

24
Nichols and Gray model for polymorphism

Geoff Nichols and Russel Gray (2006)
Homoplasy-free, rates-across-sites, parametric
model in which the character adds and loses
states under a stochastic process. The number of
states in a lineage can go up and down (including
down to 0 and then back up).
Problems (1) homoplasy is frequent in lexical
characters (2) what is the linguistic process?

25
What needs to be done in modelling

We need parametric models of character evolution
that include reasonable levels of homoplasy, in
which polymorphism arises due to semantic shift
(conflation of two characters), by borrowing, or
due to other linguistic processes.
We also need cladogenesis models that incorporate
population-level processes, and can represent
dialect continuums.

26
Simulation study (Barbancon et al.)

Simulated evolution down networks with 30 leaves,
three contact edges, and with moderate levels of
homoplasy and borrowing for 300 lexical
characters and 60 morphological characters.
Compared trees constructed by various methods to
the genetic tree contained in the network, for
topological accuracy.
Methods compared NJ, UPGMA, weighted and
unweighted MP and MC.

27
Standard Model Conditions

Screened dataset
Lexical characters 4 homoplastic, 10 evolve
with borrowing
Morphological characters no homoplasy nor
borrowing
Unscreened dataset
Lexical characters 20 homoplastic, 20 borrowed
Morphological characters 5 homoplastic, no
borrowing
Molecular clock for the cladogenesis model
No-common-mechanism model with moderate variation
between characters
Lexical weight1, morphological weight50

28
Simulation study (cartoon)
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
29
Clocklike data
30
Clocklike data
31
Points

Screening the data helps to improve the
phylogenetic accuracy of most methods.
When data are generated under network models,
methods which reconstruct trees do not perform
well.
Modelling helps us predict the conditions under
which different methods will perform well, or
poorly. The more accurate the models, the more
relevant the predictions. We need better models!

32
Future research

Testing other methods in simulation (including
some network construction methods)
Formulating improved (more realistic) models of
language evolution
Implementing simulation tools under these
improved models
Developing estimation methods under these
improved models
Reanalyzing IE, and looking at some new families
(or subfamilies)

33
Acknowledgements

Funding NSF, the David and Lucile Packard
Foundation, the Radcliffe Institute for Advanced
Studies, The Program for Evolutionary Dynamics at
Harvard, and the Institute for Cellular and
Molecular Biology at UT-Austin.
Collaborators Don Ringe, Steve Evans, Luay
Nakhleh, and Francois Barbancon.

34
For more information

Please see the Computational Phylogenetics for
Historical Linguistics web site for papers, data,
and additional material http//www.cs.rice.edu/na
khleh/CPHL

35
Differences between characters

Lexical most easily borrowed (most borrowings
detectable), and homoplasy relatively frequent
(we estimate about 25-30 overall for our
wordlist, but a much smaller percentage for
basic vocabulary). Also, lexical characters have
a high incidence (80) of transient polymorphism.
Phonological can still be borrowed but much less
likely than lexical. Complex phonological
characters are infrequently (if ever)
homoplastic, although simple phonological
characters very often homoplastic.
Morphological least easily borrowed, least
likely to be homoplastic. Rarely polymorphic.