Title: Lecture 3 : Overview
1- Lecture 3 Overview
- Biological text based search NCBI.
- NCBI homepage http//www.ncbi.nlm.nih.gov/
- Entrez http//www.ncbi.nlm.nih.gov/Entrez
/ - OMIM http//www.ncbi.nlm.nih.gov/entrez/query.
fcgi?dbOMIM - Another way to obtain FASTA formatted
sequences - through LocusLink http//www.ncbi.nlm.n
ih.gov/LocusLink/ - For help http//www.ornl.gov/hgmis/posters/
chromosome/blast.html
Note Do not forget to choose human as organism.
2 Lecture 3 Overview
FASTA format
3- Other Biological Text Based
- Searches
- SRS (sequence retrieval system)
- at EBI, England. http//srs.ebi.ac.uk/
-
- STAG at DDBJ, Japan.
- http//stag.genome.ad.jp/
- Expasy at SIB
- (Swiss Institute of
- Bio informatics),
- Switzerland.
- http//www.expasy.org/
4Gaucher Disease
Phillipe Charles Ernest Gaucher
- Cause A rare inherited disease (genetic
- disorder), caused by enzyme deficiency
- (glucocerebrosidase GBA), which results in
- accumulation of the lipid glucocerebrosides
- in cells lysosomes.
- Highest prevalence in Ashkenazi Jewish
population. - Traditional therapy total or partial
- removal of the spleen, blood transfusions,
- orthopedic procedures and bone marrow
- transplantation.
- Enzyme replacement therapy, very costly
- 100,000 - 400,000 US/patient/year.
1882
5Gaucher Disease
Characterization remarkable degree of
variability in clinical signs and
symptoms. Clinical signs anemia, bone damage,
enlarged liver and spleen (most patients have
Type 1 disease), some develop additional severe
central nervous system damage (Type 2 3,
neuronopathic). Type 1 is the most common
genetic disease affecting Ashkenazi Jews. As
many as 1 in 10 Ashkenazi Jews and 1 in 100
people in the general population may be carrying
the gene mutation that causes Type 1 Gauchers.
6Lecture 4 Pair-wise Sequence Alignment
- Outline
- Biological motivation for homology search.
- Short Overview Algorithms and heuristics.
- Methods for comparing two sequences
- (Smith-Waterman, BLAST, FASTA).
- Wet experience.
7Similarity / Identity
Examples If looks like an elephant, and smells
like an elephant its an elephant. If walks
like a duck, and quacks like a duck its a
duck.
Identity - CAT CAT Similarity CAT CAT, CAR,
HAT For each three letter word there are at
most 203 similar words.
http//cbms.st-and.ac.uk/academics/ryan/Teaching/m
olbiol/Bioinf_files/v3_document.htm
8Biological Motivation and Implications
- Sequences by themselves are not
informative. - Sequence analysis by comparative methods against
databases in order to find related (homologous)
sequences. - Re-construct long DNA sequences from short
overlapping fragments. - Explore frequently appearing patterns of
nucleotides - (homologous sequences, structure similarity,
a common ancestor - and similar function).
9Why DO Homologies ?
- A powerful tool to? compare newly discovered
- sequences with known genes.
- Both functional, structural and evolutionary
- information can be inferred.
- Regions of similarity in unrelated proteins
- may be detected.
10- What is Homology ?
- Similarity between properties in various species.
- Before Darwin, homology was defined
morphologically. - Example
- Bats and butterflies fly, but structures are
different. - Bats fly and whales swim, yet the bones in a
bat's wing - and a whale's flipper are strikingly alike.
-
- Conclusion
- Bats and butterflies wings are not homologous.
- Bats wings and whales flippers are
homologous.
11Homology Interpretation from Darwin to 21st
Century
- Darwin (1859) Homology is a result of descent
- with modifications from a common ancestor.
- Modern genetics Homology is determined by
genes. - Two sequences are homologous if they are
similar - and share a common ancestor (similarity by
itself is not enough two - sequences without a common ancestor can have
similar sub-sequences). - Large enough similarities typically imply
homology - when gt 30 identity over 100 residues or
more, sequences are - homologous (all or non property).
- Regions gt 50 identical, in a 20-40 amino acid
region may - occur frequently by chance.
12Classical Example (Science, 1983)
Relationships were established between cancer and
un-controlled cell growth The sequence of a new
human gene (probably related to cancer) was
aligned to genes known to affect cell growth (in
drosophila, the fruitfly), which has recently
been called the little person with wings
Schneider, D. Using Drosophila as a Model
Insect. Nature Reviews Genetics 1 218-226,
2000.).
The
similarity between the sequences was very high,
suggesting that the cancer related gene also
affects cell growth.
http//www.molbio.princeton.edu/courses/mb427/2001
/projects/03/cancer.html
13The Limits of Sequence Similarity
14Similarity vs. Distance
- Two related notions for sequences comparisons
- How similar 2 sequences are ? Count matches.
- How distant 2 sequences are ? Count mismatches.
HIGH SIMILARITY SHORT DISTANCE
Similarity can be either positive or
negative. Distance is always non-negative (gt0).
Identical sequences have zero (0) distance.
15 What is an Alignment ? How Do We Measure
Sequence Alignment ?
T C A T G C A T T G
A process of lining-up 2 or more sequences to
achieve maximum level of identity, in order to
find homologies.
?
T C A T G C A T T G
T C A T G C A T T G
or
16Terms
Homology Sequences that are related by
divergence from a common ancestor. Identity Seq
uences that are invariant. Similarity Sequences
that are related.
C A T C A T
C A G C A T
17Most Common Mutation Types
Insertion (in) AGCGGC deletion (del)
ACG_C ACGGC substitution (sub) CCGGC
(INDEL insertions deletions)
18What is an Alignment ?
- Given two nucleotide sequence strings S and T,
- an alignment is a pair of sequence strings S, T
- consisting of nucleotides and gaps ( _ ) such
that - 1. S and T have the same length.
- 2. Removing all gaps from S will get S.
- 3. Removing all gaps from T will get T.
- Examples
- S ACTG S AC_TG S ACTG S ACTG
- T AGT T A_GT_ T AGT_ T _AGT
.
19There are many alignments S, T. We are
looking for the best one.
But how do we measure the quality of an
alignment ?
S ACTG S AC_TG S ACTG S ACTG T
AGT T A_GT_ T AGT_ T _AGT
Good Identical characters (match). Bad
Different characters (mismatch) Indel (gap).
20Similarity Score of Alignment
- Each pair of characters in the alignment
- gets a value, depending on its identity.
- The similarity score of the alignment is
- the sum of pair values.
- Example for pair values (relevant to DNA)
- Identical characters (match) 1
- Different characters (mismatch) -1
- Indel (gap) -1
21Example for Similarity Scores
Score -1 0 -4
S ACTG T AGT
S AC_TG
T A_GT_
S ACTG
T AGT_
S ACTG
T
_AGT
S, T is the best of these three,
but is it the best of ALL alignments ???
22Lecture 4 Pair-wise Sequence Alignment
- Outline
- Biological motivation for homology search.
- Short Overview Algorithms and heuristics.
- Methods for comparing two sequences
- Smith-Waterman, BLAST, FASTA.
- Wet experience.
23Algorithms Heuristics
- There are a number of exact algorithms and
- heuristics for finding alignment(s)
- Exact algorithm guarantees to find the best.
- Heuristics usually find the best or almost the
best. -
- Heuristics are typically much faster
- (time vs. quality trade-off).
24Pair-wise Alignment Programs
- Exact Algorithms
- Based on dynamic programming, a known
- algorithmic tool (not exhaustive search !).
- Most sensitive, but computational expensive
and slow - Heuristics, based on SW algorithm
- 1. FastA (1985) (http//www2.ebi.ac.uk/
fasta3/) - 2. Blast (1990) (http//www.ncbi.nlm.n
ih.gov/BLAST/)
- Needlman-Wunch (1970).
- Smith-Waterman (1981).
25Types of Gap Penalties
(Improved Pricing of InDels)
Motivation Aligning cDNAs to Genomic DNA
cDNA query
Example
Genomic DNA
In this case, if we penalize every single gap by
-1, the similarity score will be very low, and
the parent DNA will not be picked up.
26Types of Gap Penalties
- (insertions or deletions, indels)
- Insertions and deletions are rare in
evolution. - Once they are created, they are easy to
extend. - Examples
- BLAST Cost to open a gap 10 (high penalty).
- Cost to extend a gap 0.5 (low
penalty).
FASTA
27A Simple Graphical Representation - Dot Plot
C T T A G G A C T
G A G G A C T
Put a dot at every position with identical
nucleotides in the two sequences.
Sequences C T T A G G A C T G A G G A C T
28A Simple Graphical Representation - Dot Plot
C T T A G G A C T
G A G G A C T
Put a dot at every position with identical
nucleotides in the two sequences.
Long diagonals with dots - long matches (good !)
C T T A G G A C T G A G
G A C T
Short dotted diagonals - short matches
(unimpressive)
C T T A G G A C T G A G
G A C T
29Getting Rid of Short Diagonals - word
size
C T T A G G A C T
G A G G A C T
Start with original dot plot. Retain a run along
a diagonal only if has length 6 or more.
This word size is called Ktup in Fasta, W in
Blast
30How Does Word Size Influence Sensitivity
Selectivity ?
Sensitivity of Algorithm The ability to
recognize distantly related sequences. Find all
real homologues. Selectivity of Algorithm The
ability to discard false positive matches between
un-related sequences. Find only real homologues.
Large word size - fast, less sensitive, more
selective distant relatives do not have
many runs of matches, un-related
sequences stand no chance to be selected. Small
word size - slow, more sensitive, less
selective.
31Effect of Word Size
Large word size - fast, less sensitive, more
selective distant relatives do not have many
runs of matches, un-related sequences stand no
chance to be selected. Small word size - slow,
more sensitive, less selective. Example If
word size 3, we will find all words containing
TCG in this sequence (very sensitive compared to
large word size, but less selective and will
find all TCGs).
32Effect of Word Size
If word size 3, you will not find TCGA in this
sequence.
33Smith-Waterman Algorithm (1981)
- Sensitive, guaranteed to find the best
alignment(s) - between a pair of sequences.
- For 2 sequences S of length n and T of length m,
- Smith-Waterman takes nm computational steps.
- ??
- Assume a query sequence S of n200
nucleotides. - ?? We are looking for homologous sequences in a
database - containing, e.g., 107 sequences (small
db) Ti each m500. - ?? Number of computation steps 200500 per
pair S, Ti - 107 200 500 1012 steps total.
- ?? NOT PRACTICAL even for modern computers !
- What is the solution ?
34What is the solution ?
- Use a heuristic algorithm to discard most
irrelevant sequences. - Perform the exact algorithm on the small group of
remaining sequences.
35Overcoming SW Impracticality
1. Use special purpose hardware, e.g. Compugen.
Speed-up factors of up to 100, but costs
? 2. Heuristic solutions - fast, USUALLY OK,
No guarantees.
THE two heuristics FASTA BLAST. Common idea
A really good alignment almost always contains
sub-sequences that are identical (break the
sequences into fragments). Search strategy
Look for identical sub-sequences Extend those
to longer, almost identical sub-sequences (but
ignore non-promising sub-sequences). The best
alignment is not guaranteed to be there
(heuristic).
36Heuristic Search Algorithms
- BLAST (Altschul 1990)
- Basic Local Alignment Search Tool
- Developed in NCBI.
- Uses rapid word lookup methods (heuristics) to
completely skip most of the database entries. - Extremely fast - One order of magnitude faster
than FASTA.. - Almost as sensitive as FASTA.
- FASTA (Pearson 1985)
- Developed in EMBL.
- Uses heuristics to avoid calculating the full
dynamic programming matrix. - Speed up searches by an order of magnitude
compared to full Smith-Waterman. - The statistical side of FASTA is still stronger
than BLAST.
37Strategies BLAST FASTA
- First, identify hot spots (Ktup or word size W,
in FASTA and blastN, respectively) of exact
matches, between the query and databases, that
are high scoring strings (HSP, that satisfy a
Threshold value T). - 2. Next, take only the best hot spots (HSP).
- 3. Extend them to longer diagonal runs (no
indels yet). - 4. Merge close by diagonal runs (this introduces
indels). - Optimize the merges by running Smith-Waterman on
them. - Notes
- 1. The longer the word the more rapid, but less
thorough, the database search is. - 2. The final output consists of many sequences
(ordered by similarity scores). - 3. The best alignment is not guaranteed to be
among them ! Heuristics.
38FASTA Visualization
Identify all hot spots longer than Ktup.
Ignore all short hot spots. The longest hot spot
is called init1.
Merge diagonal runs. Optimize using SW in a
narrow band. Best result is called
Extend hot spots to longer diagonal runs.
Longest diagonal run is initn.
opt.
39FastA Output
FastA produces a list, where each entry looks
like EM_HUMAF?267177 Homo sapiens glucocer
(5420) f 3236 623 1.1e-176 The database name
and entry (accession numbers). Then comes the
species. and a short gene name. The length of
the sequence. Similarity score of the optimal
alignment (opt). The bits score, and the
E-value.
Both measure the statistical significance of the
alignment.
40FastA Output - Explanation
E-value is the theoretically Expected number of
false hits (random sequences) per sequence
query, given a similarity score (a statistical
significance threshold for reporting matches
against database sequences). Low E-value means
high significance, fewer matches will be
reported. Bits is an alternative statistical
measure for significance. High bits means high
significance. Some versions also display
z-score, a measure similar to Bits.
41What Is a Significant E-Value ?
How many false positives to expect? For E-value
10 4 1 in 10,000 Database No. of Entries
False Positive SwissProt 105,967 10.6 PIR-PSD
283,153 28.3 TrEMBL 594,148 59.4
42FastA Parameters
Matrix - Determine the scoring system for letter
substitutions. For nucleotide
sequences matrices are very simple. More
complicated matrices used for proteins. Filter -
Mask off segments of the query sequence that
are highly repetitive (thus not meaningful
biologically, but significant
statistically). Repetitive sequence found by a
filter program are substituted using the letter
"N" in nucleotide sequence (e.g., "NNNNNNNN")
and the letter "X" in protein sequences (e.g.,
"XXXXXXXXX"). Histogram - A (primitive)
graphical display of the distribution of all
sequences in the data bases searched, according
to their similarity scores.
43opt
A Histogram for observed () vs expected ()
the usual bell curve
Unexpected, high score sequences (signal
vs noise)
44Different Variants of Blast FastA
http//www-bioeng.ucsd.edu/research/research_group
s/compbio/workshop/
45Example Comparison of Gene 1 Between
Human and Another Species.
Homo sapiens vs Mus musculus
Homo sapiens vs Gallus gallus
Homo Sapiens vs Fugu Rubripes
http//www1.imim.es/courses/Lisboa01/slide6.5.html
46Lecture 4 Pair-wise Sequence Alignment
- Outline
- Biological motivation for homology search.
- Short Overview Algorithms and heuristics.
- Methods for comparing two sequences
- Smith-Waterman, BLAST, FASTA.
- Wet experience.
47Lets Run FastA
48http//www.ebi.ac.uk/fasta33/
Molecule type
http//www2.ebi.ac.uk/fasta3/help.html Example
and interpretation of results http//www.ebi.ac.u
k/2can/tutorials/nucleotide/fasta2.html
49Do not confuse FASTA Input Format
Step 1 copy the sequence in FASTA
format/accession number and paste into input box.
50FASTA Output
forward or reverse strand
The best scores are              Â
     opt
bits E-valueEM_HUMBC003356 BC003356.1
Homo sapiens, gluco (2279) f 11395 1819.6 Â
 0EM_HUMHSHL60A D13286.1 Homo sapiens mRNA for
 (2259) f 11263 1798.7   0EM_HUMHSGCBPRC
M19285.1 Human glucocerebrosid (2275) f 11221
1792.0 Â Â 0EM_HUMHSGCB01 M16328.1 Human
glucocerebrosida (2587) f 11131 1777.7 Â Â
0EM_PATAX752903 AX752903.1 Sequence 13 from Pa
(2586) f 11108 1774.0 Â Â 0EM_HUMHSGCBL
K02920.1 Human lysosomal glucoce (1792) f
8924 1427.6 Â Â 0EM_PATI09351 I09351.1
Sequence 1 from patent  (1661) f 8188
1310.8 Â Â 0EM_SYBT008212 BT008212.1
Synthetic construct  (1611) f 8045
1288.1 Â Â 0EM_PATAX147658 AX147658.1
Sequence 7 from Pat (1611) f 7924 1268.9 Â
 0
Genes name
genes length
Database name and entry (accession number)
Thumb rules Sequences with E-value lt 0.01 are
almost always homologous. Sequences with E-value
between 1-10 may be related. Word size The thumb
rule is that the larger the word-length the less
sensitive, but faster the search will be.
51FASTA Output - cont.
EM_HUM is the EMBL database division that the
entry is in, in this case EMBL human (EMBL
HUMAN). BC003356 is the entry name in the
database. BC003356.1 is the accession number
associated with this entry. Homo sapiens gluco
is the start of the description of the database
entry. (2279) is the number of nucleotides
present in the entry. f this stands for
forward or reverse, and represents which strand
the alignment was based from. 11395 is the
optimized score. 1819.6 is the bits score, which
is a normalized score calculated from the raw
score. 0 is the expectation value. In evaluating
the E() scores, the following rules of thumb can
be used sequences with E() less than 0.01 are
almost always found to be homologous, sequences
with E() between 1 and 10 frequently turn out to
be related as well. A value of zero is an exact
match.
52FASTA Alignment
gtgtEM_HUMBC003356 BC003356.1 Homo sapiens,
glucosidase, Â (2279 nt)initn 11395 init1
11395 opt 11395 Â Z-score 9759.0 Â bits 1819.6
E() Â Â 0banded Smith-Waterman score 11395
 100.000 identity (100.000 ungapped) in 2279
nt overlap (1-22791-2279) Â Â Â Â Â Â Â Â 10 Â
   20     30     40     50   Â
 60Sequen AGCTAAGGCAGGTACCTGCATCCTTGTTTTTGTTTAGTG
GATCCTCTATCCTTCAGAGACÂ Â Â Â
EM_HUM
AGCTAAGGCAGGTACCTGCATCCTTGTTTTTGTTTAGTGGATCCTCTATC
CTTCAGAGACÂ Â Â Â Â Â Â Â 10 Â Â Â Â 20 Â Â Â Â 30
    40     50     60
Note perfect match
Explanations about FASTA
http//www.embnet.se/roadmap/internet-fasta.html
FASTA also available at http//www.dna.affrc.go.j
p/htbin/prefastadna.pl http//fasta.bioch.virginia
.edu/fasta_www/cgi/search_frm.cgi?pgmfa
http//bioweb.pasteur.fr/seqanal/interfaces/fasta
.html
53FASTA Last Alignment
gtgtEM_HUMHSGCBA2 M22213.1 Human
glucocerebrosidase mRNA, Â (73 nt)initn 360
init1 360 opt 360 Â Z-score 327.0 Â bits 69.4
E() 2.6e-08banded Smith-Waterman score 360
 100.000 identity (100.000 ungapped) in 72 nt
overlap (2187-22581-72) Â Â Â Â 2160 Â Â Â 2170
   2180    2190    2200    2210  Â
 Sequen GGAAGCCTTAAAGCAGCAGCGGGTGTGCCCAGGCACCCAGA
TGATTCCTATGGCACCAGCÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â
    EM_HUM  Â
             AGGCACCCAGATGATTCCTATGGC
ACCAGCÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
 10     20     30    2220    2230 Â
  2240    2250    2260    2270  Â
 Sequen CAGGAAAAATGGCAGCTCTTAAAGGAGAAAATGTTTGAGCC
CAAAAAAAAAAAAAAAAAAÂ Â Â Â
         EM_HUM
CAGGAAAAATGGCAGCTCTTAAAGGAGAAAATGTTTGAGCCCC Â Â Â
             40     50     60
    70                Â
54Program Fastx Compares DNA Sequences to
Protein Databases
BL50 matrix ktup 2 gap-pen -15/ -2 The best
scores are
initn init1 opt z-sc
E(750289) BC000349_1(BC000349pidnone) Homo sa (
320) f 2208 2208 2208 2607.9 1.4e-137 AX147656_1
(AX147656pidnone) Sequenc ( 516) f 2258 1269
1269 1493.1 1.8e-75 AF285236_1(AF285236pidnone)
Pan tro ( 536) f 2260 1269 1269 1492.8
1.8e-75
55Lets Run
56BLAST
- Faster than FastA.
- Similar search strategy.
- Gapped Blast algorithm (1997, 3 times faster
than ungapped blast). - First, BLAST produces a list of high scoring
segment pairs - a pair of sub-sequences of the
same length without gaps - between the query and
databases. - Maximal segment pair is the segment pair with
maximal score. - A pair is reported only if its score gt
Threshold. -
57What Does BLAST Do ?
- Identify short segment pairs of length W
(default 11 in blastN) that are present in both
sequences, with high scoring strings (above a
threshold value T). - The statistical significance of these high
scoring strings is evaluated to determine whether
the matches are random or not. - The segment pairs are extended without gaps,
until - the maximal possible score is reached (when
score - falls below a limit due to mismatches in the
extension, the extension is stopped).
58http//www.ncbi.nih.gov/BLAST/
BLAST Help
http//www.ncbi.nlm.nih.gov/BLAST/blast_help.html
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/t
ut1.html
59Blast A Family of Programs
- BlastN - nt versus nt database.
- BlastP - protein versus protein database.
- BlastX - translated nt versus protein database.
- tBlastN - protein versus translated nt database.
- tBlastX - translated nt versus translated nt
database.
Query DNA Protein Database DNA Protein
60Cont.
61http//www.ncbi.nlm.nih.gov/BLAST/
gtsingle line description. Paste nucleotide
sequence (FASTA format) into search box.
Step 1 Choose the appropriate BLAST program for
DNA or protein sequences, paste your
sequence/accession.
62Input Format
63Step 2 Choose the right database.
64Blast Parameters
default
Step 3 Submit your query and start Blast !
65Step 4 Format to get results.
66Program introduction (top) page
version number, date
Step 5 get results.
Default max. 50 databases searched.
http//www.ncbi.nlm.nih.gov/blast/blast_FAQs.html
67BLAST Alignment - Results
query
All sequences or sequence fragments that match
the query are presented. Mousing over a hit will
display its definition and score. The histogram
shows the lowest (most significant, red)
E-values. Hatched areas non-similar regions.
68BLAST Output
Score (bit score) is a value calculated from the
number of gaps and substitutions for each aligned
sequence.
Link to Entrez
Sorting by E-value
sequence identifier codes in databases
10-179
species
sequence description
69(No Transcript)
70Important Messages Concerning Matches
- A statistically significant match is not
necessarily biologically significant, and
conversely a match may be of biological
significance without passing the statistical
test. You should therefore use - your knowledge of the biology of the
system to help interpret - these results.
- Use positive and negative known controls !
- Be cautious when you have a significant
hit. - Validate with independent tools !
- The absence of a match in the database does not
mean that - there is no homologous sequence in the
database. - 3. The presence of a similar sequence is no
guarantee that the sequences are homologous.
71What to do When Your Search Produced NO Hits ?
- Standard solutions
- Check your sequence format.
- Raise the Expect (E) value.
- Turn-off filtering of low complexity.
- Decrease word size.
72What to do When Your Search Returns Homologs With
Weak Similarities ?
- Run PSI-BLAST.
- Look for sequence motifs.
What In The Next Lectures ?
Why is pairwise sequence alignment different for
proteins ?