MB280 Introduction to Bioinformatics 3 credits Genes, machines and you' Learn the basics of analyzin - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

MB280 Introduction to Bioinformatics 3 credits Genes, machines and you' Learn the basics of analyzin

Description:

Make a string of your name and the names of others ... Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. ' Gapped ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 40
Provided by: Biol2
Category:

less

Transcript and Presenter's Notes

Title: MB280 Introduction to Bioinformatics 3 credits Genes, machines and you' Learn the basics of analyzin


1
MB280 Introduction to Bioinformatics (3
credits)Genes, machines and you. Learn the
basics of analyzing DNA and protein sequences
using state of the art computer software
Note This is an experimental class for graduate
students with limited space for undergraduates.
Last years class incorporating both student
groups was highly successful.
MB535-LabGenomic Analysis (4 credits)
Learn the basics of Bioinformatics database
searching, sequence analysis, phylogenetic
reconstruction and in silico research design.
Take advantage of biological databases that will
inform and direct your future research agenda.
  • Tuesday lecture 200-315
  • Thursday laboratory 200-500
  • INBRE Bioinformatics Lab
  • Cooley Lab B2
  • Professor Marcie McClure
  • mars_at_parvati.msu.montana.edu

2
What was cover last lecture?
What is Bioinformatics from several
perspectives? Why the multiple alignment is
important. Analyzing data at different levels
3
  • Tentative syllabus.
  • 8/28/07 1) Class orientation, Introduction to
    computers and operating languages
  • 8/30/07 1st) Lab Assignment know the machines
    your monitor is accessing/using your own computer
  • i. Know how to access the Internetsend
    me an email
  • ii Do the Unix/Linux tutorials.
  • iii. Do a web search for the terms
    bioinformatics and computational biology.
  • iv. Create list of sites and types of
    methods necessary to do bioinformatics and
  • computational biology.
  • 9/4/07 2) What is Bioinformatics/Computational
    Biology
  • 9/6/07 2nd) Lab Assignment
  • i. Loading software on your own
    machine.
  • ii. The NCBI/EBI/PR sites, familiarize yourself
    with them.
  • iii. Do a conceptual translation of a nucleic
    acid sequence.
  • iv. Choose a gene family to work on for the rest
    of the semester.
  • 9/11/07 3) Database searching and pairwise
    alignments
  • 9/13/07 3rd Lab Assignment

4
More Than You Ever Wanted To Know About
Searching Sequence Databases
5
Convergence
Homology
Recombination or Transposition
Analogy
6
1-Dimensional Biosequence Structures
DNA TAC GGA TGT TTC GCG CTA
RNA AUG CCU ACA AAG GCG GAU
Amino acid met pro thr lys ala
asp M P T K A
D
McClure, 2001
7
Pairwise vs. Multiple Alignments
8
Primary Structure the Sequence
Sequence Alignment
Phylogenetics gt70 id N.A. lt 70 id A.A.
Determine OSM lt 30 id A.A.
2-D and 3-D Predictions
function
evolution
structure
McClure, 2001
9
(No Transcript)
10
Identity and Similarity
Identity Sequence 1 M A L V D D M F R Match
M A V D M F R Sequence 2 M A C V
D E M F R
Similarity
Blue Nitrogen White Carbon Red Oxygen
11
Sequence Identity vs. Homology
12
Alignment Theory
(what a database searcher should know about
alignments)
  • Alignments are evolutionary hypothesis.
  • A gap and its entire length are distinct
    quantities, should different weights should be
    applied to each?
  • Weights for different mismatches should be
    permitted (substitution matrices).
  • -An alignment demonstrates similarity, not
    necessarily homology.

13
Local vs. Global Alignments
  • A global alignment is a comparison over the
    entire sequence length. Gaps are added for
    overall optimization in the context of a specific
    scoring scheme.
  • Use a global alignment to estimate the overall
    degree of similarity between sequences, i.e. for
    phylogenetic analysis.
  • In local alignments only the most highly
    conserved sections are aligned, and no attempt is
    made to align divergent sequence regions by
    adding gaps.
  • Local alignments are good for finding conserved
    motifs.

14
Database Searching I
  • A sequence by itself is not informative, it must
    be analyzed by comparative methods against
    existing databases to develop hypothesis
    concerning relatives and function.
  • If your sequence has a similar copy in the
    database, it will quickly reveal some properties
    about your sequence if the known sequence has
    been annotated.
  • Because of the degeneracy of the code,
    homologous sequence searches should be conducted
    at the protein sequence level, as it is about 5
    times more sensitive at finding matches.

15
Database Searching II
  • Database searching was derived from three areas
    of previous knowledge
  • Similarity Scores allowing substitutions of
    residues with similar characteristics, Dayhoff
    matrix
  • Computer Algorithms that find the best matches
    between two sequences, Smith-Waterman
  • Sequence Databases which store the millions of
    bases of data to be searched, Genbank

16
Database Searching III
  • Database Search Assumptions
  • Sequences being sought have an evolutionary
    ancestral sequence in common with the query
  • Our best guess at the actual path of evolution is
    the path that requires the fewest evolutionary
    events.
  • All substitutions are not equally likely and
    should be weighted accordingly using a similarity
    score and a substitution matrix.
  • Gaps may be penalized using two different
    weights.
  • Constant
  • Affine

17
Similarity Scoring
  • Similarity Scores are used to assess sequence
    likeness based upon a substitution matrix
  • Some Similarity Scores are based upon observed
    substitutions of one amino acid for another in
    homologous proteins, others based on
    physiochemical properties or other criteria
  • Modern similarity scores computed as log-odds
    scores have been shown to be the most efficient
    way to use the observed substitution data to
    detect homologous sequences

18
Substitution Matrices I
  • A substitution matrix contains values
    proportional to the probability that amino acid l
    mutates into amino acid v.
  • Probabilities based on
  • Observed mutation
  • Physiochemical properties and common mutation
    based on codon positions
  • Matrices rescue us from having to assume that all
    amino acid changes are equally likely
  • The choice of substitution matrix may influence
    the outcome of analysis.
  • There are three popular amino acid substitution
    matrices currently in use.

19
Substitution Matrices II
  • PAM Percent Accepted Mutation developed
    by Dayhoff, etal. Used protein families to
    determine substitutions.
  • BLOSUM BLOcks SUbstitution Matrix used
    conserved regions of protein families.
  • GONNET used classical distance measures to
    estimate an alignment of proteins and then used
    those alignments to estimate a new matrix.

20
PAM
  • Observed mutations in protein families
  • Accepted Point Mutation Percent
    Accepted Mutation
  • Derived from global alignments of closely related
    sequences globins and cytochromes
  • The number with the matrix (PAM40, PAM100) refers
    to the evolutionary distance
  • Several later groups have extended Dayhoff's
    methodology or re-apply her analysis using later
    databases with more examples.

21
The PAM250 substitution matrix multiplied by 10.
22
Extensions                   Jones, Thornton
and coworkers used the same methodology as
Dayhoff but with modern databases (CABIOS
8275)                   Gonnett and coworkers
(Science 2561443) used a slightly different
(but theoretically equivalent) methodology      
        Henikoff Henikoff (Proteins 1749)
compared these two newer versions of the PAM
matrices with Dayhoff's originals.
BLOSUM The BLOSUM series of matrices were
created by Steve Henikoff and colleagues (PNAS
8910915). Derived from local, ungapped
alignments of distantly related sequences      
                     The number after the
matrix (BLOSUM62) refers to the minimum percent
identity of the blocks used to construct the
matrix greater              numbers are lesser
distances.               The BLOSUM series of
matrices generally perform better than PAM
matrices for local similarity searches
(Proteins 1749).
23
The log-odds matrix for BLOSUM62
24
Gonnet Matrix
  • developed by Gonnet, Cohen, and Benner (1992)
  • uses exhaustive pairwise alignments of the
    protein databases
  • distance matrices differ on whether they were
    derived from distantly or closely homologous
    proteins
  • They suggest that for initial comparisons their
    matrix should be used, but for following
    refinements a PAM matrix should be used

25
Important Caveats
  • No single matrix performs best on all sequences.
    Some are better for sequences with few gaps, and
    others are better for sequences with fewer
    identical amino acids.
  • In our study of RT, we found that out of PAM,
    BLOSUM, and GONNET, PAM70 works the best for this
    protein.
  • PAM matrices have a theoretical advantage over
    other methods because they are based on observed
    mutations.
  • BLOSUM assumes that the evolutionary rates are
    uniform over the whole protein ---- NOT TRUE!!

26
Phylogenetics
Alignment
Structural Prediction
Pairwise and searches 1970 Gibbs/McIntyre, and
Needle/Wunsch 1981 Smith/Waterman 1982 Maizel
and Lenk 1983 Wilbur/Lipman, 1985,
Lipman/Pearson, FASTA 1990
Karlin/Altschul, 1990 Altschul,et al.
BLAST Local and Multiple Alignment Dynamic
Programming Stochastic Methods Hidden
MarkovModeling, 1994
1962 Zuckerkandel and Pauling 1967 FM
algorithm Distance Parsimony Maximum Likelihood
1961? 1972 AA replacement
27
(No Transcript)
28
-8
-8
A
T
G
G
C
C
A
A
C
G
A
W
8
G
9
8
T
11
3
3
F
5
C
20
2
6
10
6
6
9
5
ATGGCCAACGAW GTFCLAACAGMW
ATGGCCAAC GAW GTF CLAACAGMW
SCORE 103
SCORE 130
29
Algorithms
  • Exact Algorithms
  • Needleman-Wunsch was the first sequence
    comparison method. Scores one point for a match
    and no points for a mismatch, and it performed a
    global search. It was very time consuming and
    computationally expensive. 
  • Smith-Waterman Most rigorous method, there are
    no built in heuristics, so this method, which
    employs dynamic programming, tries every single
    mathematical possibility.  
  • Approximate Algorithms
  • BLAST and Fasta both apply heuristic
    restrictions that allow a faster search with less
    sensitivity, but more selection. They do not
    find all best matches!
  • BLAST Breaks the query and the database
    sequences into fragments and initially seeks
    matches between fragments.
  • Fasta looks for optimal local alignments by
    scanning the sequence for small match words.
    Scores of these segments are calculated and
    summed to generate an alignment score. It
    outputs an optimized alignment with gaps.

BLAST is 3-4x faster than FASTA
30
Smith-Waterman Based Search Methods
  • Fasta- first method that used local optimal
    segments to produce local alignments based on S-W
  • - also the first method to introduce
    gaps, but had no reliable
    scoring scheme
  • Ssearch- compares a sequence to all entries in a
    database, but is very slow
  • BLAST- is the most popular method. It compares
    the query sequence to matches in the database,
    reporting back an alignment.
  • - each comparison is given a score reflecting
    the degree of similarity between the two
    sequences
  • WU-BLAST- is a beefed up version of BLAST, which
    was created at Washington University.
  • - it has outperformed BLAST in some studies,
    but out study showed no difference

31
A Closer Look at BLAST
  • BLAST Basic Local Alignment Search Tool
  • is a set of sequence comparison algorithms used
    to search databases for optimal local alignments
    to a query.
  • it breaks the query and database sequences into
    fragments and seeks matches between them.
  • algorithm was written balancing speed and
    sensitivity for distant sequences relationships.

32
BLAST Algorithm
First step For each position p of the query,
find the list or words of length w scoring more
than T when paired with the word starting at p
p
33
Second step Extension of hits requires a
second hit on the same diagonal at a distance of
less than A.

A

Third step Gapped extension of High Scoring
Pairs above a threshold.
34
Which Method is Best?
  • It depends on your research needs
  • BLAST is the tool of choice for most searches
    because it is quick and fairly reliable
  • WU-BLAST has outperformed BLAST in some studies,
    but in our work BLAST is superior
  • Ssearch is more sensitive, but more
    computationally expensive

35
Non-trivial BLAST Parameters(work fine at
default settings)
Gap Penalty and Gap Cost Gap scores are
negative numbers representing an insertion or
deletion. The insertion of a gap is given more
significance than extending the gap. Word Size
is the number of amino acids or nucleotides that
BLAST uses as segments for its initial
search. Genetic Codes there are rare genetic
codes that lie outside the norm, for our purposes
the standard code is used. Formatting Parameters
these are set to retrieve the easiest to read
output and should be left alone.
36
Changing the Parameters
  • E-value measures the expected number of
    sequences in the database that would achieve a
    given score by chance alone. The lower the
    E-value, the better chance for homology.
  • Filters BLAST by default will filter out low
    complexity regions (regions of small repeats) of
    the query and database sequences. If not
    filtered, many more false positives would result.
  • Matrix you will find that different matrices
    make a world of a difference.
  • Databases it is important to know which
    database to use for your own research.
  • Alignments and Descriptions can be set up to
    1000 each. If these are set to 50 and there are
    100 hits, only the top 50 will be reported.

37
The NCBI BLAST family of programs includes
blastp compares an amino acid query sequence
against a protein sequence database blastn
compares a nucleotide query sequence against a
nucleotide sequence database blastx compares a
nucleotide query sequence translated in all
reading frames against a protein sequence
database tblastn compares a protein query
sequence against a nucleotide sequence database
dynamically translated in all reading frames
tblastx compares the six-frame translations of
a nucleotide query sequence against the six-frame
translations of a nucleotide sequence database.
Please note that tblastx program cannot be used
with the nr database on the BLAST Web page.
38
File Formats
FASTA format
gtgi21616997gbAAM66461.1 pol polyprotein
Human immunodeficiency virus type 1
PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGI
GGFIKVRQYDQILIEICGHK AIGTVLVGPTPVNIIGRNLLTQIGCTLNF
PISPIETVPVKLKPGMDGPKVKQWPLTEEKIRALTEICXEL
EKEGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKKTQDFWEVQ
LGIPHPAGLKKKKSVTVLDV GDAYFSVPLDKDFRKYTAFTIPSINNETP
GIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIY
QYVDDLYVGSDLEIGQHRAKIEELRQHLXKWGFYTPDKKHQKEPPXLWM
Accession number
AAM66461
Raw sequence
PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGRWKPKMIGGI
GGFIKVRQYDQILIEICGHK AIGTVLVGPTPVNIIGRNLLTQIGCTLNF
PISPIETVPVKLKPGMDGPKVKQWPLTEEKIRALTEICXEL
EKEGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKKTQDFWEVQ
LGIPHPAGLKKKKSVTVLDV GDAYFSVPLDKDFRKYTAFTIPSINNETP
GIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIY
QYVDDLYVGSDLEIGQHRAKIEELRQHLXKWGFYTPDKKHQKEPPXLWM
39
References and Further Reading
Altschul SF, Madden TL, Schaffer AA, Zhang J,
Zhang Z, Miller W, Lipman DJ. Gapped BLAST and
PSI-BLAST a new generation of protein database
search programs. Nucleic Acids Res. 1997 Sep
125(17)3389-402. Review. Brenner SE, Chothia
C, Hubbard TJ. Assessing sequence comparison
methods with reliable structurally identified
distant evolutionary relationships. Proc Natl
Acad Sci U S A. 1998 May 2695(11)6073-8. M.O.
Dayhoff, R.V. Eck, M.A. Chang, and M.R. Sochard.
Atlas of Protein Sequence and Structure. Natl.
Biomed.Res.Fnd., Silver Spring MD, 1965. Gaston
H. Gonnet, Steven A. Benner, and Mark A. Cohen.
Analysis of amino acid substitution during
divergent evolution the 400 by 400 dipeptide
substitution matrix. Biochem. And Biophys.
Res. Com., 199(2)489-496, 1994. Steven Henikoff
and Jorja G. Henikoff. Amino acid substitution
matrices from protein blocks. Proc. Natl. Acad.
Sci. USA, 8910915-10919, 1992. W.R. Pearson and
D.J. Lipman. Improved tools for biological
sequence comparison. Proc. Natl. Acad. Sci.
USA, 852444-2448, 1998.
Write a Comment
User Comments (0)
About PowerShow.com