Bioinformatika - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Bioinformatika

Description:

rescore ala Smith-Waterman 32 residues around initial region (Note: doesn't save ... Smith-Waterman for most sensitivity. FASTA with k-tuple 1 is a good compromise ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 41
Provided by: Jur6
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatika


1
Bioinformatika Biologisko virknu
salidzinaana - datubazes un programmas Alignment
of biological sequences - databases and software
LU, 2008, Juris Viksna
2
Prakse lietotas virknu salidzinaanas metodes
Smith-Waterman algoritms (1981) Lokala
salidzinaana Linearas gap penalties Lieto
substituciju matricas
Vai laiks ?(n2) ir praktiski pienemams? Proteina
garums - ap 100 aminoskabju... Divu proteinu
salidzinaana 10000 op. (0.01 sek, pie 1
MHz) Proteinu datubaze - 1000000
ieraksti... Mekleana datubaze 108 op, 100
sek Divu datubazu salidzinaana - 1016 op, 25
gadi (
3
Heiristiskas metodes - FASTA
  • Hash table of short words in the query sequence
  • Go through DB and look for matches in the query
    hash (linear in size of DB)
  • K-tuple determines word size (k-tup 1 is single
    aa)
  • Lipman Pearson 1985

VLICT _
VLICTAVLMVLICTAAAVLICTMSDFFD
Adapted from M.Gerstein
4
The FASTA Algorithm
  • 4 steps
  • use lookup table to find all identities at least
    ktup long, find regions of identities (Fig.1A)
  • rescan 10 regions (diagonals) with highest
    density of identities using PAM250 (Fig.1B)
  • join regions if possible without decreasing score
    below threshold (Fig.1C)
  • rescore ala Smith-Waterman 32 residues around
    initial region (Note doesnt save alignment)
    (Fig.1D)

5
FASTA Parameters
  • ktup 2 for proteins, 6 for DNA
  • init1 Score after rescanning with PAM250 (or
    other)
  • initn Score after joining regions
  • opt Score after Smith-Waterman

6
FASTA algoritms
Fig.1 of Pearson and Lipman 1988
7
FASTA algoritms
Adapted from D Brutlag
8
FASTA histogrammas
9
Heiristiskas metodes - Blast
  • Altschul, S., Gish, W., Miller, W., Myers, E. W.
    Lipman, D. J. (1990). Basic local alignment
    search tool. J. Mol. Biol. 215, 403-410
  • Indexes query and DB
  • Starts with all overlapping words from query
  • Calculates neighborhood of each word using PAM
    matrix and probability threshold matrix and
    probability threshold
  • Looks up all words and neighbors from query in
    database index
  • Extends High Scoring Pairs (HSPs) left and right
    to maximal length
  • Finds Maximal Segment Pairs (MSPs) between query
    and database
  • Blast 1 does not permit gaps in alignments

Adapted from M.Gerstein
10
BLAST algoritms
  • Keyword search of all words of length w from the
    in the query of length n in database of length m
    with score above threshold
  • w 11 for nucleotide queries, 3 for proteins
  • Do local alignment extension for each found
    keyword
  • Extend result until longest match above threshold
    is achieved
  • Running time O(nm)

Adapted from S.Daudenarde
11
BLAST algoritms
keyword
Query KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVL
KIFLENVIRD
GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK
11 GEK 11 GDK 11
Neighborhood words
neighborhood score threshold (T 13)
extension
Query 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK
60 DN G IR L GK I L E
RGK Sbjct 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EK
HRGIIK 263
High-scoring Pair (HSP)
Adapted from S.Daudenarde
12
Original BLAST
  • Dictionary
  • All words of length w
  • Alignment
  • Ungapped extensions until score falls below some
    threshold
  • Output
  • All local alignments with score gt statistical
    threshold

Adapted from S.Daudenarde
13
Original BLAST Example
  • w 4
  • Exact keyword match of GGTC
  • Extend diagonals with mismatches until score is
    under 50
  • Output result
  • GTAAGGTCC
  • GTTAGGTCC

A C G A A G T A A G G T C
C A G T
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
14
Gapped BLAST Example
A C G A A G T A A G G T C
C A G T
  • Original BLAST exact keyword search, THEN
  • Extend with gaps in a zone around ends of exact
    match until score lt threshold then merge nearby
    alignments
  • Output result
  • GTAAGGTCC-AGT
  • GTTAGGTCCTAGT

C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
15
Gapped BLAST Example
  • Original BLAST exact keyword search, THEN
  • Extend with gaps around ends of exact match until
    score lt threshold, then merge nearby alignments
  • Output result
  • GTAAGGTCC-AGT
  • GTTAGGTCCTAGT

A C G A A G T A A G G T C
C A G T
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
16
BLAST - Programmas
blastp compares an amino acid query sequence
against a protein sequence database blastn compa
res a nucleotide query sequence against a
nucleotide sequence database blastx compares a
nucleotide query sequence translated in all
reading frames against a protein sequence
database tblastn compares a protein query
sequence against a nucleotide sequence database
dynamically translated in all reading
frames tblastx compares the six-frame
translations of a nucleotide query sequence
against the six-frame translations of a
nucleotide sequence database. Please note that
tblastx is extremely slow and cpu-intensive
17
PSI-BLAST and transitive sequence comparison
18
Rezultatu vizualizacija (dotplots)
19
Proteinu virknu formati - FASTA
gtspP00156CYB_HUMAN Cytochrome b - Homo sapiens
(Human). MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGAC
LILQITTGLFLAMHYSPDAS TAFSSIAHITRDVNYGWIIRYLHANGASM
FFICLFLHIGRGLYYGSFLYSETWNIGIILL LATMATAFMGYVLPWGQM
SFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFT FHFILPF
IIAALATLHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLLLFLL
SLM TLTLFSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPN
KLGGVLALLLSILI LAMIPILHMSKQQSMMFRPLSQSLYWLLAADLLIL
TWIGGQPVSYPFTIIGQVASVLYFT TILILMPTISLIENKMLKWA
20
Proteinu virknu formati - SwissProt
ID CYB_HUMAN STANDARD PRT 380 AA. AC P00156
Q34786 Q8HBR6 Q8HNQ0 Q8HNQ1 Q8HNQ9 Q8HNR4
Q8HNR7 AC Q8W7V8 Q8WCV9 Q8WCY2 Q8WCY7
Q8WCY8 Q9B1A6 Q9B1B6 Q9B1B8 AC Q9B2X7
Q9B2X9 Q9B2Y3 Q9B2Z0 Q9B2Z4 Q9T6H6 Q9T9Y0
Q9TEH4 DT 21-JUL-1986, integrated into
UniProtKB/Swiss-Prot. DT 21-JUL-1986, sequence
version 1. DT 19-SEP-2006, entry version 75. DE
Cytochrome b. GN NameMT-CYB SynonymsCOB,
CYTB, MTCYB OS Homo sapiens (Human). OG
Mitochondrion. OC Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi OC Mammalia
Eutheria Euarchontoglires Primates
Haplorrhini OC Catarrhini Hominidae Homo. OX
NCBI_TaxID9606 RN 1 RP NUCLEOTIDE SEQUENCE.
RC TISSUEPlacenta RX MEDLINE81173052
PubMed7219534 NCBI, ExPASy, EBI, Israel RA
Anderson S., Bankier A.T., Barrell B.G., de
Bruijn M.H.L., RA Coulson A.R., Drouin J.,
Eperon I.C., Nierlich D.P., Roe B.A., RA Sanger
F., Schreier P.H., Smith A.J.H., Staden R., Young
I.G. RT "Sequence and organization of the human
mitochondrial genome." RL Nature
290457-465(1981).
21
Proteinu virknu formati - SwissProt
FT VARIANT 360 360 T -gt A. FT /FTIdVAR_013667.
FT VARIANT 368 368 T -gt I. FT /FTIdVAR_013668.
SQ SEQUENCE 380 AA 42730 MW 90097B07FF2C1FD8
CRC64 MTPMRKINPL MKLINHSFID LPTPSNISAW
WNFGSLLGAC LILQITTGLF LAMHYSPDAS TAFSSIAHIT
RDVNYGWIIR YLHANGASMF FICLFLHIGR GLYYGSFLYS
ETWNIGIILL LATMATAFMG YVLPWGQMSF WGATVITNLL
SAIPYIGTDL VQWIWGGYSV DSPTLTRFFT FHFILPFIIA
ALATLHLLFL HETGSNNPLG ITSHSDKITF HPYYTIKDAL
GLLLFLLSLM TLTLFSPDLL GDPDNYTLAN PLNTPPHIKP
EWYFLFAYTI LRSVPNKLGG VLALLLSILI LAMIPILHMS
KQQSMMFRPL SQSLYWLLAA DLLILTWIGG QPVSYPFTII
GQVASVLYFT TILILMPTIS LIENKMLKWA
22
Proteinu virknu formati - NiceProt
23
Proteinu virknu datubazes - UniProt
http//www.expasy.uniprot.org/
24
Proteinu virknu datubazes - PIR
http//pir.georgetown.edu/
25
Proteinu virknu datubazes - SwissProt
http//www.expasy.org/sprot/
26
Proteinu virknu datubazes - NRL-3D )
27
Proteinu strukturu datubazes - PDB
http//www.pdb.org
28
Salidzinaanas programmas - BLAST
http//www.ncbi.nlm.nih.gov/BLAST/
http//www.ebi.ac.uk/blastall/
29
Salidzinaanas programmas - FASTA
http//fasta.bioch.virginia.edu/
http//www.ebi.ac.uk/fasta33/
30
Salidzinaanas programmas - Smith-Waterman
http//www.ebi.ac.uk/MPsrch/
31
Salidzinajumu noverteana
  • S Total Score
  • S(i,j) similarity matrix score for aligning i
    and j
  • Sum is carried out over all aligned i and j
  • n number of gaps (assuming no gap ext. penalty)
  • G gap penalty

Simplest score (for identity matrix) is S
matches What does a Score of 10 mean? What is
the Right Cutoff?
Adapted from M.Gerstein
32
Assessing sequence similarity
  • Need to know how strong an alignment can be
    expected from chance alone
  • Chance is the comparison of
  • Real but non-homologous sequences
  • Real sequences that are shuffled to preserve
    compositional properties
  • Sequences that are generated randomly based upon
    a DNA or protein sequence model (favored)

Adapted from S.Daudenarde
33
Model Random Sequence
  • Necessary to evaluate the score of a match
  • Take into account background
  • Adjust for GC content
  • Poly-A tails
  • Junk sequences
  • Codon bias

Adapted from S.Daudenarde
34
E-Values
E-Value Expectation value cutoff. An E-Value of
1 means that one would expect to find at most 1
alignment with a score as high as reported in a
search of the given database. Setting the E-value
higher will report more sequence matches Resp.,
jo mazaka, jo labak... Rekina apmeram adi E
mn2-S, kur m,n - virknu garumi, S' -
rekina, piem., adi S lSln(K)
ln(2)
35
P-Values
P-Value Varbutiba, ka ads (vai labaks)
rezultats tiks ieguts "nejauam" proteinam Ari,
jo mazaka, jo labak...
36
P-Values
  • According to Poisons distribution, the
    probability of finding b HSPs with a score ?S is
    given by
  • (e-EEb)/b!
  • For b 0, that chance is
  • e-E
  • Thus the probability of finding at least one such
    HSP is
  • P 1 e-E

Adapted from S.Daudenarde
37
P-Values
  • P(s gt S) .01
  • P-value of .01 occurs at score threshold S (392
    below) where score s from random comparison is
    greater than this threshold 1 of the time
  • Likewise for P.001 and so on.

Adapted from M.Gerstein
38
General Protein Search Principles
  • Choose between local or global search algorithms
  • Use most sensitive search algorithm available
  • Original BLAST for no gaps
  • Smith-Waterman for most sensitivity
  • FASTA with k-tuple 1 is a good compromise
  • Gapped BLAST for well delimited regions
  • PSI-BLAST for families
  • Initially BLOSUM62 and default gap penalties
  • If no significant results, use BLOSUM30 and lower
    gap penalties
  • FASTA cutoff of .01
  • Blast cutoff of .0001
  • Examine results between exp. 0.05 and 10 for
    biological significance
  • Ensure expected score is negative
  • Beware of hits on long sequences or hits with
    unusual aa composition
  • Reevaluate results of borderline significance
    using limited query region
  • Segment long queries ³ 300 amino acids
  • Segment around known motifs

(some text adapted from D Brutlag)
39
ROC diagrammas
Uzskatisim, ka proteini ir homologi, ja lidziba
parsniedz kaut kadu slieksni t true positives
(tp) - s(a,b) ? t un a, b ir homologi false
positives (fp) - s(a,b) ? t un a, b nav
homologi true negatives (tn) - s(a,b) lt t un a,
b nav homologi false negatives (fn) - s(a,b) lt t
un a, b ir homologi Sensitivity tp/n
tp/(tpfn) Specificity tn/n tn(tnfp)
40
ROC diagrammas
41
ROC diagrammas
Coverage (roughly, fraction of sequences that one
confidently says something about)
Thresh10
Thresh20
sensitivitytp/ntp/(tpfn)
Thresh30
Different score thresholds
Error rate (fraction of the statements that are
false positives)
Two methods (red is more effective)
Specificity tn/n tn/(tnfp) error rate
1-specificity fp/n
Adapted from M.Gerstein
Write a Comment
User Comments (0)
About PowerShow.com