Title: MB280 Introduction to Bioinformatics 3 credits Genes, machines and you' Learn the basics of analyzin
1MB280 Introduction to Bioinformatics (3
credits)Genes, machines and you. Learn the
basics of analyzing DNA and protein sequences
using state of the art computer software
Note This is an experimental class for graduate
students with limited space for undergraduates.
Last years class incorporating both student
groups was highly successful.
MB535-LabGenomic Analysis (4 credits)
Learn the basics of Bioinformatics database
searching, sequence analysis, phylogenetic
reconstruction and in silico research design.
Take advantage of biological databases that will
inform and direct your future research agenda.
- Tuesday lecture 200-315
- Thursday laboratory 200-500
- INBRE Bioinformatics Lab
- Cooley Lab B2
- Professor Marcie McClure
- mars_at_parvati.msu.montana.edu
2What was cover last lecture?
What is Bioinformatics from several
perspectives? Why the multiple alignment is
important. Analyzing data at different levels
3- Tentative syllabus.
- 8/28/07 1) Class orientation, Introduction to
computers and operating languages - 8/30/07 1st) Lab Assignment know the machines
your monitor is accessing/using your own computer - i. Know how to access the Internetsend
me an email - ii Do the Unix/Linux tutorials.
- iii. Do a web search for the terms
bioinformatics and computational biology. - iv. Create list of sites and types of
methods necessary to do bioinformatics and
- computational biology.
- 9/4/07 2) What is Bioinformatics/Computational
Biology - 9/6/07 2nd) Lab Assignment
- i. Loading software on your own
machine. - ii. The NCBI/EBI/PR sites, familiarize yourself
with them. - iii. Do a conceptual translation of a nucleic
acid sequence. - iv. Choose a gene family to work on for the rest
of the semester. - 9/11/07 3) Database searching and pairwise
alignments - 9/13/07 3rd Lab Assignment
4More Than You Ever Wanted To Know About
Searching Sequence Databases
5Convergence
Homology
Recombination or Transposition
Analogy
61-Dimensional Biosequence Structures
DNA TAC GGA TGT TTC GCG CTA
RNA AUG CCU ACA AAG GCG GAU
Amino acid met pro thr lys ala
asp M P T K A
D
McClure, 2001
7Pairwise vs. Multiple Alignments
8Primary Structure the Sequence
Sequence Alignment
Phylogenetics gt70 id N.A. lt 70 id A.A.
Determine OSM lt 30 id A.A.
2-D and 3-D Predictions
function
evolution
structure
McClure, 2001
9(No Transcript)
10Identity and Similarity
Identity Sequence 1 M A L V D D M F R Match
M A V D M F R Sequence 2 M A C V
D E M F R
Similarity
Blue Nitrogen White Carbon Red Oxygen
11Sequence Identity vs. Homology
12Alignment Theory
(what a database searcher should know about
alignments)
- Alignments are evolutionary hypothesis.
- A gap and its entire length are distinct
quantities, should different weights should be
applied to each? - Weights for different mismatches should be
permitted (substitution matrices). - -An alignment demonstrates similarity, not
necessarily homology.
13Local vs. Global Alignments
- A global alignment is a comparison over the
entire sequence length. Gaps are added for
overall optimization in the context of a specific
scoring scheme. - Use a global alignment to estimate the overall
degree of similarity between sequences, i.e. for
phylogenetic analysis. - In local alignments only the most highly
conserved sections are aligned, and no attempt is
made to align divergent sequence regions by
adding gaps. - Local alignments are good for finding conserved
motifs.
14Database Searching I
- A sequence by itself is not informative, it must
be analyzed by comparative methods against
existing databases to develop hypothesis
concerning relatives and function. - If your sequence has a similar copy in the
database, it will quickly reveal some properties
about your sequence if the known sequence has
been annotated. - Because of the degeneracy of the code,
homologous sequence searches should be conducted
at the protein sequence level, as it is about 5
times more sensitive at finding matches.
15Database Searching II
- Database searching was derived from three areas
of previous knowledge - Similarity Scores allowing substitutions of
residues with similar characteristics, Dayhoff
matrix - Computer Algorithms that find the best matches
between two sequences, Smith-Waterman - Sequence Databases which store the millions of
bases of data to be searched, Genbank
16Database Searching III
- Database Search Assumptions
- Sequences being sought have an evolutionary
ancestral sequence in common with the query - Our best guess at the actual path of evolution is
the path that requires the fewest evolutionary
events. - All substitutions are not equally likely and
should be weighted accordingly using a similarity
score and a substitution matrix. - Gaps may be penalized using two different
weights. - Constant
- Affine
17Similarity Scoring
- Similarity Scores are used to assess sequence
likeness based upon a substitution matrix - Some Similarity Scores are based upon observed
substitutions of one amino acid for another in
homologous proteins, others based on
physiochemical properties or other criteria - Modern similarity scores computed as log-odds
scores have been shown to be the most efficient
way to use the observed substitution data to
detect homologous sequences
18 Substitution Matrices I
- A substitution matrix contains values
proportional to the probability that amino acid l
mutates into amino acid v. - Probabilities based on
- Observed mutation
- Physiochemical properties and common mutation
based on codon positions - Matrices rescue us from having to assume that all
amino acid changes are equally likely - The choice of substitution matrix may influence
the outcome of analysis. - There are three popular amino acid substitution
matrices currently in use.
19Substitution Matrices II
- PAM Percent Accepted Mutation developed
by Dayhoff, etal. Used protein families to
determine substitutions. - BLOSUM BLOcks SUbstitution Matrix used
conserved regions of protein families. - GONNET used classical distance measures to
estimate an alignment of proteins and then used
those alignments to estimate a new matrix.
20PAM
- Observed mutations in protein families
- Accepted Point Mutation Percent
Accepted Mutation - Derived from global alignments of closely related
sequences globins and cytochromes - The number with the matrix (PAM40, PAM100) refers
to the evolutionary distance - Several later groups have extended Dayhoff's
methodology or re-apply her analysis using later
databases with more examples.
21The PAM250 substitution matrix multiplied by 10.
22Extensions Jones, Thornton
and coworkers used the same methodology as
Dayhoff but with modern databases (CABIOS
8275) Gonnett and coworkers
(Science 2561443) used a slightly different
(but theoretically equivalent) methodology
Henikoff Henikoff (Proteins 1749)
compared these two newer versions of the PAM
matrices with Dayhoff's originals.
BLOSUM The BLOSUM series of matrices were
created by Steve Henikoff and colleagues (PNAS
8910915). Derived from local, ungapped
alignments of distantly related sequences
The number after the
matrix (BLOSUM62) refers to the minimum percent
identity of the blocks used to construct the
matrix greater numbers are lesser
distances. The BLOSUM series of
matrices generally perform better than PAM
matrices for local similarity searches
(Proteins 1749).
23The log-odds matrix for BLOSUM62
24Gonnet Matrix
- developed by Gonnet, Cohen, and Benner (1992)
- uses exhaustive pairwise alignments of the
protein databases - distance matrices differ on whether they were
derived from distantly or closely homologous
proteins - They suggest that for initial comparisons their
matrix should be used, but for following
refinements a PAM matrix should be used
25Important Caveats
- No single matrix performs best on all sequences.
Some are better for sequences with few gaps, and
others are better for sequences with fewer
identical amino acids. - In our study of RT, we found that out of PAM,
BLOSUM, and GONNET, PAM70 works the best for this
protein. - PAM matrices have a theoretical advantage over
other methods because they are based on observed
mutations. - BLOSUM assumes that the evolutionary rates are
uniform over the whole protein ---- NOT TRUE!!
26Phylogenetics
Alignment
Structural Prediction
Pairwise and searches 1970 Gibbs/McIntyre, and
Needle/Wunsch 1981 Smith/Waterman 1982 Maizel
and Lenk 1983 Wilbur/Lipman, 1985,
Lipman/Pearson, FASTA 1990
Karlin/Altschul, 1990 Altschul,et al.
BLAST Local and Multiple Alignment Dynamic
Programming Stochastic Methods Hidden
MarkovModeling, 1994
1962 Zuckerkandel and Pauling 1967 FM
algorithm Distance Parsimony Maximum Likelihood
1961? 1972 AA replacement
27(No Transcript)
28-8
-8
A
T
G
G
C
C
A
A
C
G
A
W
8
G
9
8
T
11
3
3
F
5
C
20
2
6
10
6
6
9
5
ATGGCCAACGAW GTFCLAACAGMW
ATGGCCAAC GAW GTF CLAACAGMW
SCORE 103
SCORE 130
29Algorithms
- Exact Algorithms
- Needleman-Wunsch was the first sequence
comparison method. Scores one point for a match
and no points for a mismatch, and it performed a
global search. It was very time consuming and
computationally expensive. - Smith-Waterman Most rigorous method, there are
no built in heuristics, so this method, which
employs dynamic programming, tries every single
mathematical possibility. - Approximate Algorithms
- BLAST and Fasta both apply heuristic
restrictions that allow a faster search with less
sensitivity, but more selection. They do not
find all best matches! - BLAST Breaks the query and the database
sequences into fragments and initially seeks
matches between fragments. - Fasta looks for optimal local alignments by
scanning the sequence for small match words.
Scores of these segments are calculated and
summed to generate an alignment score. It
outputs an optimized alignment with gaps.
BLAST is 3-4x faster than FASTA
30Smith-Waterman Based Search Methods
- Fasta- first method that used local optimal
segments to produce local alignments based on S-W - - also the first method to introduce
gaps, but had no reliable
scoring scheme - Ssearch- compares a sequence to all entries in a
database, but is very slow - BLAST- is the most popular method. It compares
the query sequence to matches in the database,
reporting back an alignment. - - each comparison is given a score reflecting
the degree of similarity between the two
sequences - WU-BLAST- is a beefed up version of BLAST, which
was created at Washington University. - - it has outperformed BLAST in some studies,
but out study showed no difference
31A Closer Look at BLAST
- BLAST Basic Local Alignment Search Tool
- is a set of sequence comparison algorithms used
to search databases for optimal local alignments
to a query. - it breaks the query and database sequences into
fragments and seeks matches between them. - algorithm was written balancing speed and
sensitivity for distant sequences relationships.
32BLAST Algorithm
First step For each position p of the query,
find the list or words of length w scoring more
than T when paired with the word starting at p
p
33Second step Extension of hits requires a
second hit on the same diagonal at a distance of
less than A.
A
Third step Gapped extension of High Scoring
Pairs above a threshold.
34Which Method is Best?
- It depends on your research needs
- BLAST is the tool of choice for most searches
because it is quick and fairly reliable - WU-BLAST has outperformed BLAST in some studies,
but in our work BLAST is superior - Ssearch is more sensitive, but more
computationally expensive
35Non-trivial BLAST Parameters(work fine at
default settings)
Gap Penalty and Gap Cost Gap scores are
negative numbers representing an insertion or
deletion. The insertion of a gap is given more
significance than extending the gap. Word Size
is the number of amino acids or nucleotides that
BLAST uses as segments for its initial
search. Genetic Codes there are rare genetic
codes that lie outside the norm, for our purposes
the standard code is used. Formatting Parameters
these are set to retrieve the easiest to read
output and should be left alone.
36Changing the Parameters
- E-value measures the expected number of
sequences in the database that would achieve a
given score by chance alone. The lower the
E-value, the better chance for homology. - Filters BLAST by default will filter out low
complexity regions (regions of small repeats) of
the query and database sequences. If not
filtered, many more false positives would result. - Matrix you will find that different matrices
make a world of a difference. - Databases it is important to know which
database to use for your own research. - Alignments and Descriptions can be set up to
1000 each. If these are set to 50 and there are
100 hits, only the top 50 will be reported.
37The NCBI BLAST family of programs includes
blastp compares an amino acid query sequence
against a protein sequence database blastn
compares a nucleotide query sequence against a
nucleotide sequence database blastx compares a
nucleotide query sequence translated in all
reading frames against a protein sequence
database tblastn compares a protein query
sequence against a nucleotide sequence database
dynamically translated in all reading frames
tblastx compares the six-frame translations of
a nucleotide query sequence against the six-frame
translations of a nucleotide sequence database.
Please note that tblastx program cannot be used
with the nr database on the BLAST Web page.
38File Formats
FASTA format
gtgi21616997gbAAM66461.1 pol polyprotein
Human immunodeficiency virus type 1
PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGI
GGFIKVRQYDQILIEICGHK AIGTVLVGPTPVNIIGRNLLTQIGCTLNF
PISPIETVPVKLKPGMDGPKVKQWPLTEEKIRALTEICXEL
EKEGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKKTQDFWEVQ
LGIPHPAGLKKKKSVTVLDV GDAYFSVPLDKDFRKYTAFTIPSINNETP
GIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIY
QYVDDLYVGSDLEIGQHRAKIEELRQHLXKWGFYTPDKKHQKEPPXLWM
Accession number
AAM66461
Raw sequence
PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGI
GGFIKVRQYDQILIEICGHK AIGTVLVGPTPVNIIGRNLLTQIGCTLNF
PISPIETVPVKLKPGMDGPKVKQWPLTEEKIRALTEICXEL
EKEGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKKTQDFWEVQ
LGIPHPAGLKKKKSVTVLDV GDAYFSVPLDKDFRKYTAFTIPSINNETP
GIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIY
QYVDDLYVGSDLEIGQHRAKIEELRQHLXKWGFYTPDKKHQKEPPXLWM
39References and Further Reading
Altschul SF, Madden TL, Schaffer AA, Zhang J,
Zhang Z, Miller W, Lipman DJ. Gapped BLAST and
PSI-BLAST a new generation of protein database
search programs. Nucleic Acids Res. 1997 Sep
125(17)3389-402. Review. Brenner SE, Chothia
C, Hubbard TJ. Assessing sequence comparison
methods with reliable structurally identified
distant evolutionary relationships. Proc Natl
Acad Sci U S A. 1998 May 2695(11)6073-8. M.O.
Dayhoff, R.V. Eck, M.A. Chang, and M.R. Sochard.
Atlas of Protein Sequence and Structure. Natl.
Biomed.Res.Fnd., Silver Spring MD, 1965. Gaston
H. Gonnet, Steven A. Benner, and Mark A. Cohen.
Analysis of amino acid substitution during
divergent evolution the 400 by 400 dipeptide
substitution matrix. Biochem. And Biophys.
Res. Com., 199(2)489-496, 1994. Steven Henikoff
and Jorja G. Henikoff. Amino acid substitution
matrices from protein blocks. Proc. Natl. Acad.
Sci. USA, 8910915-10919, 1992. W.R. Pearson and
D.J. Lipman. Improved tools for biological
sequence comparison. Proc. Natl. Acad. Sci.
USA, 852444-2448, 1998.