Title: Lecture 2: Character Homology with particular attention to sequence alignment
1Lecture 2Character Homology(with particular
attention to sequence alignment)
2DNA sequences
- Strings of characters
- Each with one of 4 possible states
- 4 nucleic acids
- adenine, cytosine, guanine, thymine
3DNA sequences
- Protein-coding genes
- Other structural genes
- Ribosomal RNA
- Transfer RNA
- Other DNA - "non-coding"
- Introns
- Repetitive DNA
4Protein Coding Sequences
Structure and function determined by sequence of
amino acids
5Protein-Coding Genes
6Possible Changes in DNA sequences
- Substitutions
- Inversions
- Insertions and Deletions
- indels
7Protein-Coding Genes
- 3rd position of codon generally redundant
- 1st position changes more common than 2nd
position
8Purines
Pyrimidines
9Substitutions
- Transitions
- purine gt purine
- pyrimidine gt pyrimidine
- Transversions
- purine gt pyrimidine
- pyrimidine gt purine
10Substitutions
- Transitions
- purine gt purine
- pyrimidine gt pyrimidine
11Substitutions
- Transitions
- purine gt purine
- pyrimidine gt pyrimidine
- Transversions
- purine gt pyrimidine
- pyrimidine gt purine
12Homology in gene sequences
t c t c c a g g t g c a c g t c t t c t
a g t c t c c a g g t g c a c g t c t t
???
13Homology in gene sequences
t c t c c a g g t g c a c g t c t t c t
a g t c t c c a g g t g c a c g t c t t
14Homology in gene sequences
t c t c c a g g t g c a c g t c t t c t
a g t c t c c a g g t g c a c g t c t t
15Homology in gene sequences
t c t c c a g g t g c a c g t c t t c t
a g t c t c c a g g t g c a c g t c t t
a g t c c c c a g g t g c a c g t c t t
16Introns and exons
17Insertions and deletions
- indels
- generally occur in multiples of three in exons of
protein-coding genes - May be any length in introns
1 a g t c t c c a g g t g c a c g t c t t
2 a g t c c c c a g g t g c a c g t c t t
18Insertions and deletions
- indels
- generally occur in multiples of three in exons of
protein-coding genes - May be any length in introns
Frame-Shift Mutation
1 a g t c t c c a g g t g c a c g t c t t
2 a g t c c c c a g g t g c a c g t c t t
3 a g t t c c c c a g g t g c a c g t c t t
19Gene families arise from gene duplications
20- Paralogous genes two or more different gene
loci in the same organism that originated by gene
duplication
21- Paralogous genes two or more different gene
loci in the same organism that originated by gene
duplication - Orthologous genes same gene in two different
organisms, homologous due to presence in common
ancestor
22Gene Copy
- May be more than one paralogous copy of a gene in
the genome - Copies may be functional
- e.g. EF1? gene occurs in two copies in insects
- Non-functional copies are called pseudogenes
- Insertions and deletions of any length
- May contain stop codons (TGA, TAG, TAA)
23BLAST(Basic Local Alignment Search Tool)
- National Center for Biotechnology Information
- Algorithms to match query sequence with sequences
in database (Genbank) - If sequences are the same, there should be short
stretches of complete identity (seeds, or
"words") - Should be able to find seeds even if sequence has
insertions and deletions - Finds best matches and computes overall
similarity, probability of match this close, etc.
24(No Transcript)
25Ribosomes - the workbench for protein synthesis
26Ribosomes - the workbench for protein synthesis
27Ribosomes
- Consist of ribosomal RNA and proteins
- Protein manufacturing machinery
- rRNA synthesized in nucleolus
- The rRNA self-assembles into two folded
structures, the large and small subunits
28(image of ribosome by Harry Noller, U.C. Santa
Cruz, Venki Ramakrishnan at Cambridge, and Thomas
Steitz at Yale)
29Eukaryotic rDNA
- NTS - non-transcribed spacer regions
- ITS - internal transcribed spacers
- In thousands of tandem repeats
- Identical due to concerted evolution
30(No Transcript)
3128s rRNA
32Ribosomal DNA
- Insertions and deletions are common
- Alignment of sequences necessary to establish
positional homologies of nucleotides
33Sequence Alignment
X g a c g t t a g a g c t a a t c
Y g a c a g c t c g t c g a
Z g a c g c c c a t c g a g
34Sequence Alignment
X g a c g t t a g a g c t a a t c
Y 1 g a c - - - a g c t c g t c g a
Y g a c a g c t c g t c g a
Z g a c g c c c a t c g a g
35Sequence Alignment
X g a c g t t a g a g c t a a t c
Y 1 g a c - - - a g c t c g t c g a
Y 2 g a c - - - - - a g c t c g t c g a
Y g a c a g c t c g t c g a
Z g a c g c c c a t c g a g
36Sequence Alignment
X g a c g t t a g a g c t a a t c
Y 1 g a c - - - a g c t c g t c g a
Y 2 g a c - - - - - a g c t c g t c g a
Y g a c a g c t c g t c g a
Z g a c g c c c a t c g a g
37Sequence Alignment
X g a c g t t a g a g c t a a t c
Z 1 g a c - - - - - - g c c c a t c g a g
Y 1 g a c - - - a g c t c g t c g a
Y 2 g a c - - - - - a g c t c g t c g a
Y g a c a g c t c g t c g a
Z g a c g c c c a t c g a g
38Sequence alignment
- How many gaps should we insert?
- Assign cost to whole gaps or each space in a gap
separately? - Less cost for expanding an existing gap?
- Should gaps be treated as character states?
- Assign different costs for transitions and
transversions that result from an alignment? - What order to align the sequences?
- Programs such as CLUSTAL, MALIGN, POY and others
attempt to optimize functions with all of these,
and more, parameters
39CLUSTAL X
- Multiple Alignment Program
- First, produce guide tree
- Compute distances between sequences
- Cluster most similar sequences together
- Use tree as template for order of sequential
pairwise alignments - Conduct sequential pairwise alignments
40Clustal X Alignment ParametersSlow accurate
alignments
- Penalty for opening first gap
- Penalty for changing size of existing gap
- Penalty for each inferred transition
- Penalty for each inferred transversion
41Optimization AlignmentWard Wheeler (AMNH)
- Completely different approach
- Strategy of multiple sequence alignment followed
by phylogenetic analysis is fundamentally flawed - Better approach is to integrate process of
sequence alignment with phylogenetic analysis in
an iterative, recursive analysis - POY program
- Attempts to optimize both phylogenetic analysis
and alignment simultaneously - VERY CPU intensive
42Optimization Alignment
- Minimize both insertion/deletion events and
substitutions - Alignments are dynamic and uniquely tailored to
each topology - Alignments chosen to minimize tree length
- Homoplasy and alignment cost functions minimized
simultaneously
43Sensitivity Analysis
- Explore different regions of alignment parameter
space by trying different combinations of
parameters - Determine how sensitive results are to changes in
different parameters - Determine optimal combinations of parameters
- Criterion for optimality is often congruence
(concordance) with other data sets after analysis
4428s rRNA
45Structural Alignment
- Another approach to alignment of molecules for
which we have models of secondary structure - ribosomal sequences, tRNA, etc.
4628s rRNA - D2 region
4728s rRNA - D2 region
48Structural Alignment(with thanks to Joe
Gillespie and Matt Yoder)
49Structural Alignment(with thanks to Joe
Gillespie and Matt Yoder)
50Structural Alignment
51Structural Alignment
52Structural Alignment
53Structural Alignment
54Structural Alignment
55Structural Alignment
56Structural Alignment
57Structural Alignment
Regions of unambiguous homology
58Some recent reviews
- Morrison, D. A. (2006). Multiple sequence
alignment for phylogenetic purposes. Australian
Systematic Botany (19) 479-539. - Ogden Rosenberg. 2006. Multiple Sequence
Alignment Accuracy and Phylogenetic Inference.
Syst. Biol. 55(2)314328 - Höhl Ragan. 2007. Is Multiple-Sequence
Alignment Required for Accurate Inference of
Phylogeny? Syst. Biol. 56(2)206221 - Parmentier et al. 2006. Large scale multiple
sequence alignment with simultaneous phylogeny
inference. J. Parallel Distrib. Comput. 66 (2006)
15341545 - Phillips, A. (2000). Multiple Sequence Alignment
in Phylogenetic Analysis. Molecular Phylogenetics
and Evolution 16(3) 317-330.
59Lutzoni et al. (2000) Systematic Biology 49(4)
628-651
- Method to integrate ambiguously aligned regions
into analysis - Retain positional homologies
- Calculate minimum distances necessary to
transform one fragment into another
60Structural Alignment
Regions of unambiguous homology
61Fixed State Optimization
- Treat contiguous strings of nucleotides as
character states - Calculate minimum pairwise distances between such
states - Attempt to find topology that minimizes these
distances
62Alignment methods(Wheeler, Cladistics, 2001)
63Saturation of gene sequences
t c t c c a g g t g c a c g t c t t c t
1 a g t c t c c a g g t g c a c g t c t t
2 a g t c c c c a g g t g c a c g t c t t
64Saturation of gene sequences
t c t c c a g g t g c a c g t c t t c t
1 a g t c t c c a g g t g c a c g t c t t
2 a g t c c c c a g g t g c a c g t c t t
3 a g t c t c c a g g t g c a c g t c t t
65Saturation of gene sequences
t c t c c a g g t g c a c g t c t t c t
1 a g t c t c c a g g t g c a c g t c t t
2 a g t c c c c a g g t g c a c g t c t t
3 a g t c t c c a g g t g c a c g t c t t
4 a g t c a c c a g g t g c a c g t c t t