Modelling language evolution - PowerPoint PPT Presentation

About This Presentation
Title:

Modelling language evolution

Description:

Empirical evidence of how estimated phylogenies depend upon both the data and ... is very common but transient (almost all polymorphisms lost within a millenium) ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 36
Provided by: informat1198
Category:

less

Transcript and Presenter's Notes

Title: Modelling language evolution


1
Modelling language evolution
  • Tandy Warnow
  • The University of Texas at Austin

2
Species phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3
Possible Indo-European tree(Ringe, Warnow and
Taylor 2000)
4
Controversies for Indo-European history
  • Subgrouping Other than the 10 major subgroups,
    what is likely to be true? In particular, what
    about
  • Italo-Celtic,
  • Greco-Armenian,
  • Anatolian Tocharian,
  • Satem Core?

5
This talk
  • Empirical evidence of how estimated phylogenies
    depend upon both the data and the method - and
    can be wrong
  • Models of language evolution (from the earliest
    ones to more recent ones), why we need them, and
    what we still need to do. Note simulations and
    estimation methods both depend upon model
    assumptions!
  • Results of simulation studies based upon some new
    models
  • Comments

6
Nakhleh et al., Transactions of the Philological
Society 2005
  • Methods studied UPGMA (lexico-statistics),
    Neighbor joining, maximum parsimony, maximum
    compatibility, weighted MP, weighted MC, and
    GrayAtkinson.
  • Datasets Four versions of the RingeTaylor IE
    data (lexical, morphological, and phonological
    characters) lexical only vs. all, screened vs.
    unscreened
  • Observations
  • UPGMA (lexico-statistics) does the worst - it
    splits known subgroups.
  • Other than UPGMA, all methods reconstruct the ten
    major subgroups, Anatolian Tocharian, and
    Greco-Armenian. Nothing else is consistently
    reconstructed.
  • When using lexical data only, all methods group
    Italic, Celtic, and Germanic together.
  • Some methods (not all) will reconstruct different
    trees on different datasets. Screening datasets
    to remove obvious homoplasy can result in better
    (?) trees.

7
Question how to determine which phylogenies are
reliable?
  • Data need high quality data!
  • Phylogenetic reconstruction methods need to be
    tested before being trusted! Examples of
    possible tests
  • Benchmark real datasets (need good benchmarks!
    Are there any?)
  • Simulated datasets (need good models!)

8
Simulation study (cartoon)
9
Simulation study (cartoon)
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
10
Modelling language evolution
  • Models of evolution allow reconstruction methods
    to be evaluated in simulation. This allows us to
    understand the conditions under which each method
    will perform well.
  • Models of evolution (for simulation purposes)
    need to reflect good scholarship, and should be
    able to reproduce the properties of real data.
  • Models of evolution are also present in
    estimation methods, whether explicitly (as in ML
    or Bayesian) or implicitly.

11
Issues in modelling language evolution
  • Character evolution model.
  • Variation between characters.
  • Cladogenesis model tree vs. network vs. dialect
    continuum?

12
Modelling the evolution of single linguistic
characters
  • Types of linguistic characters
  • Phonological (sound changes)
  • Lexical (meanings based on a wordlist)
  • Morphological (especially inflectional)
  • Modelling issues state space, lexical clock,
    homoplasy, and polymorphism
  • Easy lexical clock not believed, and most
    linguistic characters have infinite number of
    possible states.
  • More interesting homoplasy, polymorphism, and
    variation between characters.

13
Homoplasy-free evolution
  • When a character changes state, it changes to a
    new state not in the tree
  • In other words, there is no homoplasy (character
    reversal or parallel evolution)
  • First inferred for weird innovations in
    phonological characters and morphological
    characters in the 19th century.

0
0
1
0
0
0
0
1
1
14
Lexical characters can also evolve without
homoplasy
  • For every cognate class, the nodes of the tree in
    that class should form a connected subset - as
    long as there is no undetected borrowing nor
    parallel semantic shift.
  • However, our research suggests that 15 of
    lexical characters evolve homoplastically.

1
1
1
0
0
0
1
1
2
15
Polymorphism
  • Polymorphism means two or more states exhibited
    by the same language for a character.
  • Most common examples are lexical two or more
    words for the same basic meaning. Examples
    big/large, little/small, rock/stone.
  • Lexical polymorphism results primarily from
    semantic shift, but polymorphism due to borrowing
    also occurs.
  • Incidence lexical polymorphism is very common
    but transient (almost all polymorphisms lost
    within a millenium). Less frequent for other
    types of characters.

16
Modelling variation between characters
Rates-across-sites
  • If a site (i.e., character) is twice as fast as
    another on one edge, it is twice as fast
    everywhere.

B
D
A
C
B
D
A
C
17
Modelling variation between characters The
no-common-mechanism model
  • In this model, there is a separate random
    variable for every combination of site and edge -
    the underlying tree is fixed, but otherwise there
    are no constraints on variation between sites.

C
A
D
B
B
D
A
C
18
Homoplasy-free models without polymorphism
  • The earliest models were all tree models,
    homoplasy-free and obeyed the lexical clock.
  • Ringe-Warnow PP (perfect phylogeny - i.e.,
    homoplasy-free, no common mechanism,
    non-parametric tree model)

19
Cladogenesis
  • The speciation model ranges from trees all the
    way to dialect continuums. Intermediate models
    include horizontal transfer (borrowing) and
    hybridization (creoles).

20
Modelling borrowing Networks and Trees within
Networks

21
Perfect Phylogenetic Network model
  • Nakhleh et al. Perfect Phylogenetic Network (PPN)
    model all characters evolve without homoplasy
    down a tree contained within the network.
    Published in Language, 2005.
  • Warnow-Evans-Ringe-Nakhleh (2004) extends PPN
    model to allow for limited and identifiable
    homoplasy.

22
Perfect Phylogenetic Network for IENakhleh et
al., Language 2005
23
What about polymorphism?
  • Our first model of polymorphism (Bonet et al.,
    1996) was a non-parametric model for
    homoplasy-free characters, no-common-mechanism
    model, with polymorphism due to semantic shift.
  • Three problems (1) because it is non-parametric,
    it cannot be used for simulation (2) homoplasy is
    fairly frequent for lexical characters (15 of
    characters) (3) what about polymorphism due to
    borrowing?

24
Nichols and Gray model for polymorphism
  • Geoff Nichols and Russel Gray (2006)
    Homoplasy-free, rates-across-sites, parametric
    model in which the character adds and loses
    states under a stochastic process. The number of
    states in a lineage can go up and down (including
    down to 0 and then back up).
  • Problems (1) homoplasy is frequent in lexical
    characters (2) what is the linguistic process?

25
What needs to be done in modelling
  • We need parametric models of character evolution
    that include reasonable levels of homoplasy, in
    which polymorphism arises due to semantic shift
    (conflation of two characters), by borrowing, or
    due to other linguistic processes.
  • We also need cladogenesis models that incorporate
    population-level processes, and can represent
    dialect continuums.

26
Simulation study (Barbancon et al.)
  • Simulated evolution down networks with 30 leaves,
    three contact edges, and with moderate levels of
    homoplasy and borrowing for 300 lexical
    characters and 60 morphological characters.
  • Compared trees constructed by various methods to
    the genetic tree contained in the network, for
    topological accuracy.
  • Methods compared NJ, UPGMA, weighted and
    unweighted MP and MC.

27
Standard Model Conditions
  • Screened dataset
  • Lexical characters 4 homoplastic, 10 evolve
    with borrowing
  • Morphological characters no homoplasy nor
    borrowing
  • Unscreened dataset
  • Lexical characters 20 homoplastic, 20 borrowed
  • Morphological characters 5 homoplastic, no
    borrowing
  • Molecular clock for the cladogenesis model
  • No-common-mechanism model with moderate variation
    between characters
  • Lexical weight1, morphological weight50

28
Simulation study (cartoon)
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
29
Clocklike data
30
Clocklike data
31
Points
  • Screening the data helps to improve the
    phylogenetic accuracy of most methods.
  • When data are generated under network models,
    methods which reconstruct trees do not perform
    well.
  • Modelling helps us predict the conditions under
    which different methods will perform well, or
    poorly. The more accurate the models, the more
    relevant the predictions. We need better models!

32
Future research
  • Testing other methods in simulation (including
    some network construction methods)
  • Formulating improved (more realistic) models of
    language evolution
  • Implementing simulation tools under these
    improved models
  • Developing estimation methods under these
    improved models
  • Reanalyzing IE, and looking at some new families
    (or subfamilies)

33
Acknowledgements
  • Funding NSF, the David and Lucile Packard
    Foundation, the Radcliffe Institute for Advanced
    Studies, The Program for Evolutionary Dynamics at
    Harvard, and the Institute for Cellular and
    Molecular Biology at UT-Austin.
  • Collaborators Don Ringe, Steve Evans, Luay
    Nakhleh, and Francois Barbancon.

34
For more information
  • Please see the Computational Phylogenetics for
    Historical Linguistics web site for papers, data,
    and additional material http//www.cs.rice.edu/na
    khleh/CPHL

35
Differences between characters
  • Lexical most easily borrowed (most borrowings
    detectable), and homoplasy relatively frequent
    (we estimate about 25-30 overall for our
    wordlist, but a much smaller percentage for
    basic vocabulary). Also, lexical characters have
    a high incidence (80) of transient polymorphism.
  • Phonological can still be borrowed but much less
    likely than lexical. Complex phonological
    characters are infrequently (if ever)
    homoplastic, although simple phonological
    characters very often homoplastic.
  • Morphological least easily borrowed, least
    likely to be homoplastic. Rarely polymorphic.
Write a Comment
User Comments (0)
About PowerShow.com