Bioinformatika - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Bioinformatika

Description:

rescore ala Smith-Waterman 32 residues around initial region (Note: doesn't save ... Smith-Waterman for most sensitivity. FASTA with k-tuple 1 is a good compromise ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 41

Provided by: Jur6

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatika

1
Bioinformatika Biologisko virknu
salidzinaana - datubazes un programmas Alignment
of biological sequences - databases and software
LU, 2008, Juris Viksna
2
Prakse lietotas virknu salidzinaanas metodes
Smith-Waterman algoritms (1981) Lokala
salidzinaana Linearas gap penalties Lieto
substituciju matricas
Vai laiks ?(n2) ir praktiski pienemams? Proteina
garums - ap 100 aminoskabju... Divu proteinu
salidzinaana 10000 op. (0.01 sek, pie 1
MHz) Proteinu datubaze - 1000000
ieraksti... Mekleana datubaze 108 op, 100
sek Divu datubazu salidzinaana - 1016 op, 25
gadi (
3
Heiristiskas metodes - FASTA

Hash table of short words in the query sequence
Go through DB and look for matches in the query
hash (linear in size of DB)
K-tuple determines word size (k-tup 1 is single
aa)
Lipman Pearson 1985

VLICT _
VLICTAVLMVLICTAAAVLICTMSDFFD
Adapted from M.Gerstein
4
The FASTA Algorithm

4 steps
use lookup table to find all identities at least
ktup long, find regions of identities (Fig.1A)
rescan 10 regions (diagonals) with highest
density of identities using PAM250 (Fig.1B)
join regions if possible without decreasing score
below threshold (Fig.1C)
rescore ala Smith-Waterman 32 residues around
initial region (Note doesnt save alignment)
(Fig.1D)

5
FASTA Parameters

ktup 2 for proteins, 6 for DNA
init1 Score after rescanning with PAM250 (or
other)
initn Score after joining regions
opt Score after Smith-Waterman

6
FASTA algoritms
Fig.1 of Pearson and Lipman 1988
7
FASTA algoritms
Adapted from D Brutlag
8
FASTA histogrammas
9
Heiristiskas metodes - Blast

Altschul, S., Gish, W., Miller, W., Myers, E. W.
Lipman, D. J. (1990). Basic local alignment
search tool. J. Mol. Biol. 215, 403-410
Indexes query and DB
Starts with all overlapping words from query
Calculates neighborhood of each word using PAM
matrix and probability threshold matrix and
probability threshold
Looks up all words and neighbors from query in
database index
Extends High Scoring Pairs (HSPs) left and right
to maximal length
Finds Maximal Segment Pairs (MSPs) between query
and database
Blast 1 does not permit gaps in alignments

Adapted from M.Gerstein
10
BLAST algoritms

Keyword search of all words of length w from the
in the query of length n in database of length m
with score above threshold
w 11 for nucleotide queries, 3 for proteins
Do local alignment extension for each found
keyword
Extend result until longest match above threshold
is achieved
Running time O(nm)

Adapted from S.Daudenarde
11
BLAST algoritms
keyword
Query KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVL
KIFLENVIRD
GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK
11 GEK 11 GDK 11
Neighborhood words
neighborhood score threshold (T 13)
extension
Query 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK
60 DN G IR L GK I L E
RGK Sbjct 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EK
HRGIIK 263
High-scoring Pair (HSP)
Adapted from S.Daudenarde
12
Original BLAST

Dictionary
All words of length w
Alignment
Ungapped extensions until score falls below some
threshold
Output
All local alignments with score gt statistical
threshold

Adapted from S.Daudenarde
13
Original BLAST Example

w 4
Exact keyword match of GGTC
Extend diagonals with mismatches until score is
under 50
Output result
GTAAGGTCC
GTTAGGTCC

A C G A A G T A A G G T C
C A G T
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
14
Gapped BLAST Example
A C G A A G T A A G G T C
C A G T

Original BLAST exact keyword search, THEN
Extend with gaps in a zone around ends of exact
match until score lt threshold then merge nearby
alignments
Output result
GTAAGGTCC-AGT
GTTAGGTCCTAGT

C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
15
Gapped BLAST Example

Original BLAST exact keyword search, THEN
Extend with gaps around ends of exact match until
score lt threshold, then merge nearby alignments
Output result
GTAAGGTCC-AGT
GTTAGGTCCTAGT

A C G A A G T A A G G T C
C A G T
C T G A T C C T G G A T T
G C G A
From lectures by Serafim Batzoglou (Stanford)
16
BLAST - Programmas
blastp compares an amino acid query sequence
against a protein sequence database blastn compa
res a nucleotide query sequence against a
nucleotide sequence database blastx compares a
nucleotide query sequence translated in all
reading frames against a protein sequence
database tblastn compares a protein query
sequence against a nucleotide sequence database
dynamically translated in all reading
frames tblastx compares the six-frame
translations of a nucleotide query sequence
against the six-frame translations of a
nucleotide sequence database. Please note that
tblastx is extremely slow and cpu-intensive
17
PSI-BLAST and transitive sequence comparison
18
Rezultatu vizualizacija (dotplots)
19
Proteinu virknu formati - FASTA
gtspP00156CYB_HUMAN Cytochrome b - Homo sapiens
(Human). MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGAC
LILQITTGLFLAMHYSPDAS TAFSSIAHITRDVNYGWIIRYLHANGASM
FFICLFLHIGRGLYYGSFLYSETWNIGIILL LATMATAFMGYVLPWGQM
SFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFT FHFILPF
IIAALATLHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLLLFLL
SLM TLTLFSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPN
KLGGVLALLLSILI LAMIPILHMSKQQSMMFRPLSQSLYWLLAADLLIL
TWIGGQPVSYPFTIIGQVASVLYFT TILILMPTISLIENKMLKWA
20
Proteinu virknu formati - SwissProt
ID CYB_HUMAN STANDARD PRT 380 AA. AC P00156
Q34786 Q8HBR6 Q8HNQ0 Q8HNQ1 Q8HNQ9 Q8HNR4
Q8HNR7 AC Q8W7V8 Q8WCV9 Q8WCY2 Q8WCY7
Q8WCY8 Q9B1A6 Q9B1B6 Q9B1B8 AC Q9B2X7
Q9B2X9 Q9B2Y3 Q9B2Z0 Q9B2Z4 Q9T6H6 Q9T9Y0
Q9TEH4 DT 21-JUL-1986, integrated into
UniProtKB/Swiss-Prot. DT 21-JUL-1986, sequence
version 1. DT 19-SEP-2006, entry version 75. DE
Cytochrome b. GN NameMT-CYB SynonymsCOB,
CYTB, MTCYB OS Homo sapiens (Human). OG
Mitochondrion. OC Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi OC Mammalia
Eutheria Euarchontoglires Primates
Haplorrhini OC Catarrhini Hominidae Homo. OX
NCBI_TaxID9606 RN 1 RP NUCLEOTIDE SEQUENCE.
RC TISSUEPlacenta RX MEDLINE81173052
PubMed7219534 NCBI, ExPASy, EBI, Israel RA
Anderson S., Bankier A.T., Barrell B.G., de
Bruijn M.H.L., RA Coulson A.R., Drouin J.,
Eperon I.C., Nierlich D.P., Roe B.A., RA Sanger
F., Schreier P.H., Smith A.J.H., Staden R., Young
I.G. RT "Sequence and organization of the human
mitochondrial genome." RL Nature
290457-465(1981).
21
Proteinu virknu formati - SwissProt
FT VARIANT 360 360 T -gt A. FT /FTIdVAR_013667.
FT VARIANT 368 368 T -gt I. FT /FTIdVAR_013668.
SQ SEQUENCE 380 AA 42730 MW 90097B07FF2C1FD8
CRC64 MTPMRKINPL MKLINHSFID LPTPSNISAW
WNFGSLLGAC LILQITTGLF LAMHYSPDAS TAFSSIAHIT
RDVNYGWIIR YLHANGASMF FICLFLHIGR GLYYGSFLYS
ETWNIGIILL LATMATAFMG YVLPWGQMSF WGATVITNLL
SAIPYIGTDL VQWIWGGYSV DSPTLTRFFT FHFILPFIIA
ALATLHLLFL HETGSNNPLG ITSHSDKITF HPYYTIKDAL
GLLLFLLSLM TLTLFSPDLL GDPDNYTLAN PLNTPPHIKP
EWYFLFAYTI LRSVPNKLGG VLALLLSILI LAMIPILHMS
KQQSMMFRPL SQSLYWLLAA DLLILTWIGG QPVSYPFTII
GQVASVLYFT TILILMPTIS LIENKMLKWA
22
Proteinu virknu formati - NiceProt
23
Proteinu virknu datubazes - UniProt
http//www.expasy.uniprot.org/
24
Proteinu virknu datubazes - PIR
http//pir.georgetown.edu/
25
Proteinu virknu datubazes - SwissProt
http//www.expasy.org/sprot/
26
Proteinu virknu datubazes - NRL-3D )
27
Proteinu strukturu datubazes - PDB
http//www.pdb.org
28
Salidzinaanas programmas - BLAST
http//www.ncbi.nlm.nih.gov/BLAST/
http//www.ebi.ac.uk/blastall/
29
Salidzinaanas programmas - FASTA
http//fasta.bioch.virginia.edu/
http//www.ebi.ac.uk/fasta33/
30
Salidzinaanas programmas - Smith-Waterman
http//www.ebi.ac.uk/MPsrch/
31
Salidzinajumu noverteana

S Total Score
S(i,j) similarity matrix score for aligning i
and j
Sum is carried out over all aligned i and j
n number of gaps (assuming no gap ext. penalty)
G gap penalty

Simplest score (for identity matrix) is S
matches What does a Score of 10 mean? What is
the Right Cutoff?
Adapted from M.Gerstein
32
Assessing sequence similarity

Need to know how strong an alignment can be
expected from chance alone
Chance is the comparison of
Real but non-homologous sequences
Real sequences that are shuffled to preserve
compositional properties
Sequences that are generated randomly based upon
a DNA or protein sequence model (favored)

Adapted from S.Daudenarde
33
Model Random Sequence

Necessary to evaluate the score of a match
Take into account background
Adjust for GC content
Poly-A tails
Junk sequences
Codon bias

Adapted from S.Daudenarde
34
E-Values
E-Value Expectation value cutoff. An E-Value of
1 means that one would expect to find at most 1
alignment with a score as high as reported in a
search of the given database. Setting the E-value
higher will report more sequence matches Resp.,
jo mazaka, jo labak... Rekina apmeram adi E
mn2-S, kur m,n - virknu garumi, S' -
rekina, piem., adi S lSln(K)
ln(2)
35
P-Values
P-Value Varbutiba, ka ads (vai labaks)
rezultats tiks ieguts "nejauam" proteinam Ari,
jo mazaka, jo labak...
36
P-Values

According to Poisons distribution, the
probability of finding b HSPs with a score ?S is
given by
(e-EEb)/b!
For b 0, that chance is
e-E
Thus the probability of finding at least one such
HSP is
P 1 e-E

Adapted from S.Daudenarde
37
P-Values

P(s gt S) .01
P-value of .01 occurs at score threshold S (392
below) where score s from random comparison is
greater than this threshold 1 of the time
Likewise for P.001 and so on.

Adapted from M.Gerstein
38
General Protein Search Principles

Choose between local or global search algorithms
Use most sensitive search algorithm available
Original BLAST for no gaps
Smith-Waterman for most sensitivity
FASTA with k-tuple 1 is a good compromise
Gapped BLAST for well delimited regions
PSI-BLAST for families
Initially BLOSUM62 and default gap penalties

If no significant results, use BLOSUM30 and lower
gap penalties
FASTA cutoff of .01
Blast cutoff of .0001
Examine results between exp. 0.05 and 10 for
biological significance
Ensure expected score is negative
Beware of hits on long sequences or hits with
unusual aa composition
Reevaluate results of borderline significance
using limited query region
Segment long queries ³ 300 amino acids
Segment around known motifs

(some text adapted from D Brutlag)
39
ROC diagrammas
Uzskatisim, ka proteini ir homologi, ja lidziba
parsniedz kaut kadu slieksni t true positives
(tp) - s(a,b) ? t un a, b ir homologi false
positives (fp) - s(a,b) ? t un a, b nav
homologi true negatives (tn) - s(a,b) lt t un a,
b nav homologi false negatives (fn) - s(a,b) lt t
un a, b ir homologi Sensitivity tp/n
tp/(tpfn) Specificity tn/n tn(tnfp)
40
ROC diagrammas
41
ROC diagrammas
Coverage (roughly, fraction of sequences that one
confidently says something about)
Thresh10
Thresh20
sensitivitytp/ntp/(tpfn)
Thresh30
Different score thresholds
Error rate (fraction of the statements that are
false positives)
Two methods (red is more effective)
Specificity tn/n tn/(tnfp) error rate
1-specificity fp/n
Adapted from M.Gerstein

Write a Comment

User Comments (0)