Title: Taking the Bite (Byte?) Out of Phylogeny
1Taking the Bite (Byte?) Out of Phylogeny
- Jennifer Galovich
- Lucy Kluckhohn Jones
- Holly Pinkart
2Introduction
- Goal is to produce an exercise that will engage
allied health students and - Strengthen math skills and decrease math phobia
- Decrease molecular data phobia
- Increase bioinformatics literacy
3Prerequisites
- The following will be presented to students prior
to this project - Basic evolutionary concepts and use of 16S rRNA
in determining relationships between prokaryotes - Introduction to Biology Workbench, BLAST and tree
construction
4Approach
- Use the theme of food poisoning to engage both
nursing and nutrition student populations - Utilize mathematics and bioinformatics tools
5Approach
- Students will pick a week in which food poisoning
is likely Christmas, 4th of July, Thanksgiving,
etc. - Students will
- identify a source of food poisoning (ex.
Salmonella), and check the Morbidity and
Mortality Weekly Report tables for the number of
cases in a specific state or region - calculate proportion of cases represented by
that region - Answer Is this number of cases unusual based on
the data presented for this time period? How can
you tell?
6Approach
- Students will then address the questions
- Without culturing the organism, how might you
track it in humans or in a food supply? - What relationships (if any) exists between
various strains of this organism? - Can this type of data be used to find the
original strain?
7Approach
- Students will
- obtain sequence data from NCBIs GenBank for the
organism (or virus) of interest - BLAST the sequence to find organisms with related
sequences - Collect 8-13 of the closest BLAST results to
perform a global alignment, and construct a tree
8Questions
- Students choose a time period (week), search
MMWR (Morbidity and Mortality Weekly Report) for
the number of cases of a particular disease for a
given week. - Given the chosen disease, how many cases of the
disease occurred in a particular state (or other
locale) during the week?
9More Questions about the Scene
- 2a. How many persons are involved? Is there an
index case? - 2b. What percent of the population has the
disease? - 3. What other question might you ask from these
data? - 4. What microbe causes the disease? What strain,
if appropriate?
10Now What? (Questions about the microbe)
- 5. If you want to determine the specific strain
of the microbe, can you find the genetic
sequence? - How has the strain evolved?
- What is its phylogeny, and what are the closest
neighbors?
11And Then. . . (Questions to Investigate)
- 8a. Why is the answer to the previous question of
interest to you if you are a nurse, a
dietician, a parent, the mayor, the hospital
director, the first responder, a restaurant
owner, a cruise ship director, a public health
inspector, or other interested person (you
choose)? - 8b. What other questions are of interest to you
in this role?
12Finding the Microbe
- Search MMWR Morbidity Tables
http//www.cdc.gov/mmwr/distrnds.html
13Choose a Week
http//wonder.cdc.gov/mmwr/mmwrmorb.asp
14Choose a Disease
http//wonder.cdc.gov/mmwr/mmwr_reps.asp?mmwr_year
2006mmwr_week07mmwr_table2F
15What Percent of the Residents are Sick?
http//wonder.cdc.gov/mmwr/mmwr_reps.asp?mmwr_year
2006mmwr_week01mmwr_table2F
16Find a Microbe
- Use your text, class notes, or other resources to
determine the causative agent of the disease you
have chosen. - Choose a microbe, then find its family tree.
- For the Salmonellosis example, we have chosen
Salmonella enterica, a microbe with many
variants, called serovars.
17Basics of Tree Construction
- Preliminary Exercises
- Goal
- Students will practice with small examples before
trying to construct a tree - Students will learn phylogenetics notation and
terminology (also see Glossary at end)
18From Sequences to Pairwise Alignment
- The Needleman-Wunsch Method
19The Needleman-Wunsch Method
- We have one row for each residue in sequence (2)
and one column for each residue in sequence (1).
To get started, we add a 0th row and a 0th
column. - The upper left corner is position (0,0).
- We set H(0,0) 0.
- The rest of the values in the top row are
(reading across) -g, -2g, -3g, etc. , where g is
the gap penalty. - Similarly, the rest of the values in the leftmost
column are (reading down) g, -2g, -3g, etc. - To compute the value of H(i1,j1) we first
consider the values north, west and northwest. We
then find - S(i1,j1) the value immediately northwest
- (The value just north) g
- (The value just west) g
- We make a table of residue scores, S(i,j). The
number S(i,j) is computed by comparing residue i
in sequence (1) with residue j in sequence (2),
using previously chosen values for matches and
mismatches. - Each alignment matrix entry, H(i,j), gives the
score of the best alignment of the first i
residues in sequence (1) with the first j
residues of sequence (2)
20Distance Matrix
- Then we choose the largest of these three numbers
to be H(i1,j1) and draw an arrow from position
(i1,j1) to the position that gave us the value
of H(i1,j1). - Example
- Let match 1, mismatch -1 and g 2.
Consider the sequences - (1) G A A T T C
- (2) G G A T
21Try This Exercise (at home ok)
- Complete the table and then follow the arrows to
determine the alignment - A diagonal arrow corresponds to aligning the two
letters. - A horizontal arrow corresponds to aligning a
letter from (2) with a gap. - A vertical arrow corresponds to aligning a letter
from (1) with a gap. - (Note that if you have ties, you may have more
than one arrow, and so more than one best
alignment.) - Redo this exercise with your own choice of match,
mismatch and gap values. Experiment with these
values to obtain alignments different from the
ones you got in part (a).
22From Pairwise Alignment to Multiple Alignment
- Idea of global progressive alignment
- Most alike sequences are aligned together in
order of their similarity. A consensus is
determined and then aligned to the next most
similar sequence. The determination of next most
similar is made using phylogenetic information
(a guide tree).
23From Alignment to Distance Matrix
- There are many different ways of computing
the distance between pairs of sequences in
multiple alignment. Each uses different
assumptions, which may or may not be reasonable
for a given situation. For example, the simplest
model, Jukes-Cantor, assumes that mutation occurs
at a constant rate, and that each nucleotide is
equally likely to mutate into any other
nucleotide (at that rate). For protein
sequences, the calculation is (even) more
complicated.
24From distance matrix to tree
- Again, there are many different methods
available. Biology Workbench uses ClustalW to
construct multiple alignments. Clustal uses the
neighbor joining methods to find the guide tree.
The final tree produced by Workbench is a
compilation of these guide trees.
25Clustering Methods
- The UPGMA (Unweighted Pair-Group Methods with
Arithmetic means) method - easy to describe produces an ultrametric (and
hence additive) tree - - assumptions (molecular clock all species
evolve at the same rate)
- General idea
- Step 1. Find the two closest taxa.
- Step 2. Treat the two closest as a new
combined taxon, and make a new matrix,
calculating distances from the combined taxon to
the others using the average of all the pairwise
distances involved. -
- Iterate these two steps until the tree is
completed.
26- Construct the UPGMA tree for the following
distance matrix
Observe A and D are closest
Next, update the matrix
Now the A/D cluster and C are closest.
27Exercises
- Finish constructing this tree.
- The tree is ultrametric, but the data are not.
(Why not?) How would the data have to be changed
in order that they be ultrametric? - The tree is additive. Are the data?
-
- Now, redo questions 1 3 in case the BD distance
is 12 instead of 10.
28Neighbor Joining (NJ)
- additive (but not ultrametric)
computationally efficient - - unrooted. Prior knowledge is needed to decide
how to root the tree. - Note the species which are closest according to
the distance matrix need NOT be neighbors. Thats
why we need a modified distance formula - Exercise Draw a picture of a tree on four taxa
that illustrates the problem described in the
note above.
29Constructing a Neighbor Joining Tree
- Step 1 Find the two taxa which are closest
using the modified distance formula below. Join
them. - To find the modified distance from node i to node
j - Let N be the number of taxa.
- Let R_i sum of all the distances from node i
to all others except node j, divided by N 2 - Let R_j sum of all the distances from node j
to all others except node i, divided by N 2 - Let D(i,j) matrix distance.
- Calculate modified distance, D, from i to j as
D(i,j) D(i,j) R_i R_j. For example, using
the distance matrix we used earlier, D(A,B) 9
6 9 -6.
30NJ (continued)
- Step 2 Suppose that nodes i and j give the
smallest value of D. Start the tree by joining
those nodes to a new node. Call the new node
(ij). We now have two fewer taxa and one more
internal node, for a net of one less node than we
started with. -
-
- Step 3 Now, as in the UPGMA method, we make a
new matrix showing the distances to all the nodes
except i and j. Problem the new internal node
(ij) is not in the original matrix.
(ij)
31This problem can be solved
- Step 4 To update the matrix, you will need to
compute the distance from the new internal node
(ij) to the remaining nodes. For each remaining
node k, compute the new distance as - ½ D(i,k) D(j,k) D(i,j)
- Step 5 Apply steps 1 4 to the revised matrix.
32Exercises
- Practice the NJ method on the matrix we had
earlier. - Now try both methods using the matrix to the
right. Why do you get different trees?
A B C D
A 0 17 21 27
B 17 0 12 18
C 21 12 0 14
D 27 18 14 0
33 Final Approach
- Use the theme of food poisoning to engage both
nursing and nutrition student populations - Utilize mathematics and bioinformatics tools
34Find the Microbial Gene
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbN
ucleotide
35Choose a Strain
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbn
ucleotidecmdsearchtermSalmonellaenterica16s
ribosomalRNAgene
36BLAST
- Basic Local Alignment Search Tool
http//www.ncbi.nlm.nih.gov/BLAST/
37Paste Sequence, BLAST off!
http//www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?CMDWe
bLAYOUTTwoWindowsAUTO_FORMATSemiautoALIGNMENT
S50ALIGNMENT_VIEWPairwiseCLIENTwebDATABASEn
rDESCRIPTIONS100ENTREZ_QUERY28none29EXPECT
10FILTERLFORMAT_OBJECTAlignmentFORMAT_TYPEHT
MLNCBI_GIonPAGENucleotidesPROGRAMblastnSERV
ICEplainSET_DEFAULTS.x34SET_DEFAULTS.y8SHOW_
OVERVIEWonEND_OF_HTTPGETYesSHOW_LINKOUTyesGE
T_SEQUENCEyes
38BLAST Results
39BLAST Sequences http//www.ncbi.nlm.nih.gov/BLAST/
Blast.cgi
40GenBank http//www.ncbi.nlm.nih.gov/entrez/view
r.fcgi?dbnucleotideval88604678
41FASTAhttp//www.ncbi.nlm.nih.gov/entrez/viewer.
fcgi?dbnucleotideqty1c_start1list_uids88604
678doptfastadispmax5sendtofrombegintoend
extrafeatpresent1ef_CDD8ef_MGC16ef_HPRD32
ef_STS64ef_tRNA128
42Constructing a Tree
- Add sequences
- http//seqtool.sdsc.edu/CGI/BW.cgi!
43Clustal W
- Choose the Multiple Sequence Alignment
http//seqtool.sdsc.edu/CGI/BW.cgi!
44Choose a Tree Type
- Choose Rooted and/or Unrooted
- Submit
http//seqtool.sdsc.edu/CGI/BW.cgi!
45Voila!
http//seqtool.sdsc.edu/CGI/BW.cgi!
46Rooted Tree
- Which species are the most closely related?
http//seqtool.sdsc.edu/CGI/BW.cgi!
47Final Questions
- How are the data helpful if you are a
- Parent?
- Restaurant owner?
- Hospital director?
- Public health inspector?
48Assessment
- Student Learning Outcomes
- More comfortable with computation
- Using the tools to answer questions
- Empowerment (we hope!)
49References -- Texts
- Emphasis on algorithms
- Neil C. Jones and Pavel A. Pevzner, An
Introduction to Bioinformatics Algorithms - Michael S. Waterman, Introduction to
Computational Biology - Bio/Math Balanced
- Paul G. Higgs and Teresa K. Attwood,
Bioinformatics and Molecular Evolution - The Bible of Phylogenetics
- Joseph Felsenstein, Inferring Phylogenies
50References -- Websites
- http//mbi.ohio-state.edu/2005/tutorials2005.html
- (tutorial on tree construction)
- http//bioalgorithms.info/courses.php
- (list of links to bioinformatics course websites)
- http//tree-thinking.org/
- (resources for learning and teaching)
51Glossary (for the faint of heart)
- Taxon (plural taxa) or operational taxonomic
unit (OTU) an entity (such as a species,
protein sequence, language, etc.) whose distance
from or similarity to other entities can be
measured. - Phylogeny the evolutionary history of some
collection of taxa, i.e., tracking lineages as
the taxa change through time. - Phylogenetic tree a graphic representation of a
phylogeny.
52More Glossary
- Matrix a rectangular array of data
- Graph a collection of nodes (aka vertices)
(usually represented by dots) and edges
(connected pairs of vertices, usually represented
by line segments) - Example
53Even More Glossary
- Connected graph -- In a connected graph, it is
always possible to get from any node to any other
node by following the edges. Here is an example
of a graph that is not connected, since we cant
get from to
54Glossary- are we there yet?
- Cycle -- a graph has a cycle if you can start at
some node and, following the edges, get back to
that node without backtracking. Here is a graph
with a cycle marked in red.
55Glossary almost done
- Tree a connected graph with no cycles
- Weighted tree a tree whose edges are labelled
to represent distances - Additive tree a tree where no matter what three
nodes you choose, say A, B and C, the distance
from A to B plus the distance from B to C is the
same as the distance from A to C. - Degree of a node (or valence) - the number of
edges attached to a node - Rooted tree a tree where some node has been
specially designated. (Usually we interpret the
root to be the ancestral taxon.
56The end of the Glossary
- Binary tree if rooted the root has degree 2 and
all others have degree 1 or 3. - Internal nodes nodes in a rooted tree of degree
3 - Leaves nodes in any tree of degree 1.
- Ultrametric tree a tree is ultrametric if it
meets the three point condition. Any three nodes
determine three distances, AB, BC and AC. The
three point condition says that the two largest
of these three distances must be the same.