Taking the Bite (Byte?) Out of Phylogeny - PowerPoint PPT Presentation

About This Presentation
Title:

Taking the Bite (Byte?) Out of Phylogeny

Description:

... is likely; Christmas, 4th of July, Thanksgiving, etc. Students will ... Exercise: Draw a picture of a tree on four taxa that illustrates the problem ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 57
Provided by: lucyj7
Learn more at: http://bioquest.org
Category:

less

Transcript and Presenter's Notes

Title: Taking the Bite (Byte?) Out of Phylogeny


1
Taking the Bite (Byte?) Out of Phylogeny
  • Jennifer Galovich
  • Lucy Kluckhohn Jones
  • Holly Pinkart

2
Introduction
  • Goal is to produce an exercise that will engage
    allied health students and
  • Strengthen math skills and decrease math phobia
  • Decrease molecular data phobia
  • Increase bioinformatics literacy

3
Prerequisites
  • The following will be presented to students prior
    to this project
  • Basic evolutionary concepts and use of 16S rRNA
    in determining relationships between prokaryotes
  • Introduction to Biology Workbench, BLAST and tree
    construction

4
Approach
  • Use the theme of food poisoning to engage both
    nursing and nutrition student populations
  • Utilize mathematics and bioinformatics tools

5
Approach
  • Students will pick a week in which food poisoning
    is likely Christmas, 4th of July, Thanksgiving,
    etc.
  • Students will
  • identify a source of food poisoning (ex.
    Salmonella), and check the Morbidity and
    Mortality Weekly Report tables for the number of
    cases in a specific state or region
  • calculate proportion of cases represented by
    that region
  • Answer Is this number of cases unusual based on
    the data presented for this time period? How can
    you tell?

6
Approach
  • Students will then address the questions
  • Without culturing the organism, how might you
    track it in humans or in a food supply?
  • What relationships (if any) exists between
    various strains of this organism?
  • Can this type of data be used to find the
    original strain?

7
Approach
  • Students will
  • obtain sequence data from NCBIs GenBank for the
    organism (or virus) of interest
  • BLAST the sequence to find organisms with related
    sequences
  • Collect 8-13 of the closest BLAST results to
    perform a global alignment, and construct a tree

8
Questions
  • Students choose a time period (week), search
    MMWR (Morbidity and Mortality Weekly Report) for
    the number of cases of a particular disease for a
    given week.
  • Given the chosen disease, how many cases of the
    disease occurred in a particular state (or other
    locale) during the week?

9
More Questions about the Scene
  • 2a. How many persons are involved? Is there an
    index case?
  • 2b. What percent of the population has the
    disease?
  • 3. What other question might you ask from these
    data?
  • 4. What microbe causes the disease? What strain,
    if appropriate?

10
Now What? (Questions about the microbe)
  • 5. If you want to determine the specific strain
    of the microbe, can you find the genetic
    sequence?
  • How has the strain evolved?
  • What is its phylogeny, and what are the closest
    neighbors?

11
And Then. . . (Questions to Investigate)
  • 8a. Why is the answer to the previous question of
    interest to you if you are a nurse, a
    dietician, a parent, the mayor, the hospital
    director, the first responder, a restaurant
    owner, a cruise ship director, a public health
    inspector, or other interested person (you
    choose)?
  • 8b. What other questions are of interest to you
    in this role?

12
Finding the Microbe
  • Search MMWR Morbidity Tables

http//www.cdc.gov/mmwr/distrnds.html
13
Choose a Week
http//wonder.cdc.gov/mmwr/mmwrmorb.asp
14
Choose a Disease
http//wonder.cdc.gov/mmwr/mmwr_reps.asp?mmwr_year
2006mmwr_week07mmwr_table2F
15
What Percent of the Residents are Sick?
http//wonder.cdc.gov/mmwr/mmwr_reps.asp?mmwr_year
2006mmwr_week01mmwr_table2F
16
Find a Microbe
  • Use your text, class notes, or other resources to
    determine the causative agent of the disease you
    have chosen.
  • Choose a microbe, then find its family tree.
  • For the Salmonellosis example, we have chosen
    Salmonella enterica, a microbe with many
    variants, called serovars.

17
Basics of Tree Construction
  • Preliminary Exercises
  • Goal
  • Students will practice with small examples before
    trying to construct a tree
  • Students will learn phylogenetics notation and
    terminology (also see Glossary at end)

18
From Sequences to Pairwise Alignment
  • The Needleman-Wunsch Method 

19
The Needleman-Wunsch Method
  • We have one row for each residue in sequence (2)
    and one column for each residue in sequence (1). 
    To get started, we add a 0th row and a 0th
    column. 
  • The upper left corner is position (0,0). 
  • We set H(0,0) 0.
  • The rest of the values in the top row are
    (reading across) -g, -2g, -3g, etc. , where g is
    the gap penalty.
  • Similarly, the rest of the values in the leftmost
    column are (reading down) g, -2g, -3g, etc. 
  • To compute the value of H(i1,j1) we first
    consider the values north, west and northwest. We
    then find
  • S(i1,j1) the value immediately northwest
  • (The value just north) g
  • (The value just west) g
  • We make a table of residue scores, S(i,j). The
    number S(i,j) is computed by comparing residue i
    in sequence (1) with residue j in sequence (2),
    using previously chosen values for matches and
    mismatches.  
  • Each alignment matrix entry, H(i,j),  gives the
    score of the best alignment of the first i
    residues in sequence (1) with the first j
    residues of sequence (2)  

20
Distance Matrix
  • Then we choose the largest of these three numbers
    to be H(i1,j1) and draw an arrow from position
    (i1,j1) to the position that gave us the value
    of H(i1,j1).
  • Example 
  • Let match 1, mismatch -1 and g 2.
    Consider the sequences
  • (1)  G A A T T C
  • (2) G G A T  

 
21
Try This Exercise (at home ok)
  • Complete the table and then follow the arrows to
    determine the alignment
  • A diagonal arrow corresponds to aligning the two
    letters.
  • A horizontal arrow corresponds to aligning a
    letter from (2) with a gap.
  • A vertical arrow corresponds to aligning a letter
    from (1) with a gap. 
  • (Note that if you have ties, you may have more
    than one arrow, and so more than one best
    alignment.) 
  • Redo this exercise with your own choice of match,
    mismatch and gap values.  Experiment with these
    values to obtain alignments different from the
    ones you got in part (a).

22
From Pairwise Alignment to Multiple Alignment 
  • Idea of global progressive alignment 
  • Most alike sequences are aligned together in
    order of their similarity.  A consensus is
    determined and then aligned to the next most
    similar sequence. The determination of next most
    similar is made using phylogenetic information
    (a guide tree).

23
From Alignment to Distance Matrix 
  • There are many different ways of computing
    the distance between pairs of sequences in
    multiple alignment.   Each uses different
    assumptions, which may or may not be reasonable
    for a given situation. For example, the simplest
    model, Jukes-Cantor, assumes that mutation occurs
    at a constant rate, and that each nucleotide is
    equally likely to mutate into any other
    nucleotide (at that rate).  For protein
    sequences, the calculation is (even) more
    complicated.

24
From distance matrix to tree
  • Again, there are many different methods
    available. Biology Workbench uses ClustalW to
    construct multiple alignments. Clustal uses the
    neighbor joining methods to find the guide tree.
    The final tree produced by Workbench is a
    compilation of these guide trees. 

25
Clustering Methods 
  • The UPGMA (Unweighted Pair-Group Methods with
    Arithmetic means) method 
  •   easy to describe produces an ultrametric (and
    hence additive) tree 
  • -  assumptions  (molecular clock all species
    evolve at the same rate) 
  • General idea 
  • Step 1.  Find the two closest taxa.
  • Step 2.  Treat the two closest as a new
    combined taxon, and make a new matrix,
    calculating distances from the combined taxon to
    the others using the average of all the pairwise
    distances involved.
  • Iterate these two steps until the tree is
    completed. 

26
  • Construct the UPGMA tree for the following
    distance matrix 

Observe A and D are closest
 
Next, update the matrix
 
Now the A/D cluster and C are closest.
 
27
Exercises
  • Finish constructing this tree.
  • The tree is ultrametric, but the data are not.
    (Why not?) How would the data have to be changed
    in order that they be ultrametric?
  • The tree is additive.  Are the data?
  • Now, redo questions 1 3 in case the BD distance
    is 12 instead of 10. 

28
Neighbor Joining  (NJ) 
  •   additive (but not ultrametric)
    computationally efficient
  • - unrooted. Prior knowledge is needed to decide
    how to root the tree.
  • Note  the species which are closest according to
    the distance matrix need NOT be neighbors. Thats
    why we need a modified distance formula
  • Exercise  Draw a picture of a tree on four taxa
    that illustrates the problem described in the
    note above. 

29
Constructing a Neighbor Joining Tree
  • Step 1  Find the two taxa which are closest
    using the modified distance formula below.  Join
    them.
  • To find the modified distance from node i to node
    j
  • Let N be the number of taxa.
  • Let R_i sum of  all the distances from node i
    to all others except node j, divided by N 2
  • Let R_j sum of  all the distances from node j
    to all others except node i, divided by N 2
  • Let D(i,j) matrix distance. 
  • Calculate modified distance, D, from i to j as
    D(i,j) D(i,j) R_i R_j. For example, using
    the distance matrix we used earlier, D(A,B) 9
    6 9 -6.

30
NJ (continued)
  • Step 2 Suppose that nodes i and j give the
    smallest value of D. Start the tree by joining
    those nodes to a new node. Call the new node
    (ij). We now have two fewer taxa and one more
    internal node, for a net of one less node than we
    started with.
  • Step 3 Now, as in the UPGMA method, we make a
    new matrix showing the distances to all the nodes
    except i and j. Problem the new internal node
    (ij) is not in the original matrix. 

(ij)
31
This problem can be solved
  • Step 4 To update the matrix, you will need to
    compute the distance from the new internal node
    (ij) to the remaining nodes. For each remaining
    node k, compute the new distance as
  • ½ D(i,k) D(j,k) D(i,j)
  • Step 5 Apply steps 1 4 to the revised matrix.

32
Exercises
  • Practice the NJ method on the matrix we had
    earlier.
  • Now try both methods using the matrix to the
    right. Why do you get different trees?

A B C D
A 0 17 21 27
B 17 0 12 18
C 21 12 0 14
D 27 18 14 0
33
Final Approach
  • Use the theme of food poisoning to engage both
    nursing and nutrition student populations
  • Utilize mathematics and bioinformatics tools

34
Find the Microbial Gene
  • NCBI Search

http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbN
ucleotide
35
Choose a Strain
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbn
ucleotidecmdsearchtermSalmonellaenterica16s
ribosomalRNAgene
36
BLAST
  • Basic Local Alignment Search Tool

http//www.ncbi.nlm.nih.gov/BLAST/
37
Paste Sequence, BLAST off!
http//www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?CMDWe
bLAYOUTTwoWindowsAUTO_FORMATSemiautoALIGNMENT
S50ALIGNMENT_VIEWPairwiseCLIENTwebDATABASEn
rDESCRIPTIONS100ENTREZ_QUERY28none29EXPECT
10FILTERLFORMAT_OBJECTAlignmentFORMAT_TYPEHT
MLNCBI_GIonPAGENucleotidesPROGRAMblastnSERV
ICEplainSET_DEFAULTS.x34SET_DEFAULTS.y8SHOW_
OVERVIEWonEND_OF_HTTPGETYesSHOW_LINKOUTyesGE
T_SEQUENCEyes
38
BLAST Results
39
BLAST Sequences http//www.ncbi.nlm.nih.gov/BLAST/
Blast.cgi
40
GenBank http//www.ncbi.nlm.nih.gov/entrez/view
r.fcgi?dbnucleotideval88604678
41
FASTAhttp//www.ncbi.nlm.nih.gov/entrez/viewer.
fcgi?dbnucleotideqty1c_start1list_uids88604
678doptfastadispmax5sendtofrombegintoend
extrafeatpresent1ef_CDD8ef_MGC16ef_HPRD32
ef_STS64ef_tRNA128
42
Constructing a Tree
  • Add sequences
  • http//seqtool.sdsc.edu/CGI/BW.cgi!

43
Clustal W
  • Choose the Multiple Sequence Alignment

http//seqtool.sdsc.edu/CGI/BW.cgi!
44
Choose a Tree Type
  • Choose Rooted and/or Unrooted
  • Submit

http//seqtool.sdsc.edu/CGI/BW.cgi!
45
Voila!
  • Unrooted Tree

http//seqtool.sdsc.edu/CGI/BW.cgi!
46
Rooted Tree
  • Which species are the most closely related?

http//seqtool.sdsc.edu/CGI/BW.cgi!
47
Final Questions
  • How are the data helpful if you are a
  • Parent?
  • Restaurant owner?
  • Hospital director?
  • Public health inspector?

48
Assessment
  • Student Learning Outcomes
  • More comfortable with computation
  • Using the tools to answer questions
  • Empowerment (we hope!)

49
References -- Texts
  • Emphasis on algorithms
  • Neil C. Jones and Pavel A. Pevzner, An
    Introduction to Bioinformatics Algorithms
  • Michael S. Waterman, Introduction to
    Computational Biology
  • Bio/Math Balanced
  • Paul G. Higgs and Teresa K. Attwood,
    Bioinformatics and Molecular Evolution
  • The Bible of Phylogenetics
  • Joseph Felsenstein, Inferring Phylogenies

50
References -- Websites
  • http//mbi.ohio-state.edu/2005/tutorials2005.html
  • (tutorial on tree construction)
  • http//bioalgorithms.info/courses.php
  • (list of links to bioinformatics course websites)
  • http//tree-thinking.org/
  • (resources for learning and teaching)

51
Glossary (for the faint of heart)
  • Taxon (plural taxa) or operational taxonomic
    unit (OTU) an entity (such as a species,
    protein sequence, language, etc.) whose distance
    from or similarity to other entities can be
    measured.
  • Phylogeny the evolutionary history of some
    collection of taxa, i.e., tracking lineages as
    the taxa change through time.
  • Phylogenetic tree a graphic representation of a
    phylogeny.

52
More Glossary
  • Matrix a rectangular array of data
  • Graph a collection of nodes (aka vertices)
    (usually represented by dots) and edges
    (connected pairs of vertices, usually represented
    by line segments)
  • Example


53
Even More Glossary
  • Connected graph -- In a connected graph, it is
    always possible to get from any node to any other
    node by following the edges. Here is an example
    of a graph that is not connected, since we cant
    get from to

54
Glossary- are we there yet?
  • Cycle -- a graph has a cycle if you can start at
    some node and, following the edges, get back to
    that node without backtracking. Here is a graph
    with a cycle marked in red.

55
Glossary almost done
  • Tree a connected graph with no cycles
  • Weighted tree a tree whose edges are labelled
    to represent distances
  • Additive tree a tree where no matter what three
    nodes you choose, say A, B and C, the distance
    from A to B plus the distance from B to C is the
    same as the distance from A to C.
  • Degree of a node (or valence) - the number of
    edges attached to a node
  • Rooted tree a tree where some node has been
    specially designated. (Usually we interpret the
    root to be the ancestral taxon.

56
The end of the Glossary
  • Binary tree if rooted the root has degree 2 and
    all others have degree 1 or 3.
  • Internal nodes nodes in a rooted tree of degree
    3
  • Leaves nodes in any tree of degree 1.
  • Ultrametric tree a tree is ultrametric if it
    meets the three point condition. Any three nodes
    determine three distances, AB, BC and AC. The
    three point condition says that the two largest
    of these three distances must be the same.
Write a Comment
User Comments (0)
About PowerShow.com