Lecture 3 : Overview

About This Presentation

Title:

Lecture 3 : Overview

Description:

OMIM: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM ... STAG at DDBJ, Japan. http://stag.genome.ad.jp/ Expasy at SIB (Swiss Institute of ... – PowerPoint PPT presentation

Number of Views:238

Avg rating:3.0/5.0

Slides: 73

Provided by: Ben5152

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 3 : Overview

1

Lecture 3 Overview
Biological text based search NCBI.
NCBI homepage http//www.ncbi.nlm.nih.gov/
Entrez http//www.ncbi.nlm.nih.gov/Entrez
/
OMIM http//www.ncbi.nlm.nih.gov/entrez/query.
fcgi?dbOMIM
Another way to obtain FASTA formatted
sequences
through LocusLink http//www.ncbi.nlm.n
ih.gov/LocusLink/
For help http//www.ornl.gov/hgmis/posters/
chromosome/blast.html

Note Do not forget to choose human as organism.
2
Lecture 3 Overview
FASTA format
3

Other Biological Text Based
Searches
SRS (sequence retrieval system)
at EBI, England. http//srs.ebi.ac.uk/
STAG at DDBJ, Japan.
http//stag.genome.ad.jp/
Expasy at SIB
(Swiss Institute of
Bio informatics),
Switzerland.
http//www.expasy.org/

4
Gaucher Disease
Phillipe Charles Ernest Gaucher

Cause A rare inherited disease (genetic
disorder), caused by enzyme deficiency
(glucocerebrosidase GBA), which results in
accumulation of the lipid glucocerebrosides
in cells lysosomes.
Highest prevalence in Ashkenazi Jewish
population.
Traditional therapy total or partial
removal of the spleen, blood transfusions,
orthopedic procedures and bone marrow
transplantation.
Enzyme replacement therapy, very costly
100,000 - 400,000 US/patient/year.

1882
5
Gaucher Disease
Characterization remarkable degree of
variability in clinical signs and
symptoms. Clinical signs anemia, bone damage,
enlarged liver and spleen (most patients have
Type 1 disease), some develop additional severe
central nervous system damage (Type 2 3,
neuronopathic). Type 1 is the most common
genetic disease affecting Ashkenazi Jews. As
many as 1 in 10 Ashkenazi Jews and 1 in 100
people in the general population may be carrying
the gene mutation that causes Type 1 Gauchers.
6
Lecture 4 Pair-wise Sequence Alignment

Outline
Biological motivation for homology search.
Short Overview Algorithms and heuristics.
Methods for comparing two sequences
(Smith-Waterman, BLAST, FASTA).
Wet experience.

7
Similarity / Identity
Examples If looks like an elephant, and smells
like an elephant its an elephant. If walks
like a duck, and quacks like a duck its a
duck.
Identity - CAT CAT Similarity CAT CAT, CAR,
HAT For each three letter word there are at
most 203 similar words.
http//cbms.st-and.ac.uk/academics/ryan/Teaching/m
olbiol/Bioinf_files/v3_document.htm
8
Biological Motivation and Implications

Sequences by themselves are not
informative.
Sequence analysis by comparative methods against
databases in order to find related (homologous)
sequences.
Re-construct long DNA sequences from short
overlapping fragments.
Explore frequently appearing patterns of
nucleotides
(homologous sequences, structure similarity,
a common ancestor
and similar function).

9
Why DO Homologies ?

A powerful tool to? compare newly discovered
sequences with known genes.
Both functional, structural and evolutionary
information can be inferred.
Regions of similarity in unrelated proteins
may be detected.

What is Homology ?
Similarity between properties in various species.
Before Darwin, homology was defined
morphologically.
Example
Bats and butterflies fly, but structures are
different.
Bats fly and whales swim, yet the bones in a
bat's wing
and a whale's flipper are strikingly alike.
Conclusion
Bats and butterflies wings are not homologous.
Bats wings and whales flippers are
homologous.

11
Homology Interpretation from Darwin to 21st
Century

Darwin (1859) Homology is a result of descent
with modifications from a common ancestor.
Modern genetics Homology is determined by
genes.
Two sequences are homologous if they are
similar
and share a common ancestor (similarity by
itself is not enough two
sequences without a common ancestor can have
similar sub-sequences).
Large enough similarities typically imply
homology
when gt 30 identity over 100 residues or
more, sequences are
homologous (all or non property).
Regions gt 50 identical, in a 20-40 amino acid
region may
occur frequently by chance.

12
Classical Example (Science, 1983)
Relationships were established between cancer and
un-controlled cell growth The sequence of a new
human gene (probably related to cancer) was
aligned to genes known to affect cell growth (in
drosophila, the fruitfly), which has recently
been called the little person with wings
Schneider, D. Using Drosophila as a Model
Insect. Nature Reviews Genetics 1 218-226,
2000.).
The
similarity between the sequences was very high,
suggesting that the cancer related gene also
affects cell growth.
http//www.molbio.princeton.edu/courses/mb427/2001
/projects/03/cancer.html
13
The Limits of Sequence Similarity
14
Similarity vs. Distance

Two related notions for sequences comparisons
How similar 2 sequences are ? Count matches.
How distant 2 sequences are ? Count mismatches.

HIGH SIMILARITY SHORT DISTANCE
Similarity can be either positive or
negative. Distance is always non-negative (gt0).
Identical sequences have zero (0) distance.
15
What is an Alignment ? How Do We Measure
Sequence Alignment ?
T C A T G C A T T G
A process of lining-up 2 or more sequences to
achieve maximum level of identity, in order to
find homologies.
?
T C A T G C A T T G
T C A T G C A T T G
or
16
Terms
Homology Sequences that are related by
divergence from a common ancestor. Identity Seq
uences that are invariant. Similarity Sequences
that are related.
C A T C A T
C A G C A T
17
Most Common Mutation Types
Insertion (in) AGCGGC deletion (del)
ACG_C ACGGC substitution (sub) CCGGC
(INDEL insertions deletions)
18
What is an Alignment ?

Given two nucleotide sequence strings S and T,
an alignment is a pair of sequence strings S, T
consisting of nucleotides and gaps ( _ ) such
that
1. S and T have the same length.
2. Removing all gaps from S will get S.
3. Removing all gaps from T will get T.
Examples
S ACTG S AC_TG S ACTG S ACTG
T AGT T A_GT_ T AGT_ T _AGT

.
19
There are many alignments S, T. We are
looking for the best one.
But how do we measure the quality of an
alignment ?
S ACTG S AC_TG S ACTG S ACTG T
AGT T A_GT_ T AGT_ T _AGT
Good Identical characters (match). Bad
Different characters (mismatch) Indel (gap).
20
Similarity Score of Alignment

Each pair of characters in the alignment
gets a value, depending on its identity.
The similarity score of the alignment is
the sum of pair values.
Example for pair values (relevant to DNA)
Identical characters (match) 1
Different characters (mismatch) -1
Indel (gap) -1

21
Example for Similarity Scores
Score -1 0 -4
S ACTG T AGT
S AC_TG
T A_GT_
S ACTG
T AGT_
S ACTG
T
_AGT

S, T is the best of these three,
but is it the best of ALL alignments ???
22
Lecture 4 Pair-wise Sequence Alignment

Outline
Biological motivation for homology search.
Short Overview Algorithms and heuristics.
Methods for comparing two sequences
Smith-Waterman, BLAST, FASTA.
Wet experience.

23
Algorithms Heuristics

There are a number of exact algorithms and
heuristics for finding alignment(s)
Exact algorithm guarantees to find the best.
Heuristics usually find the best or almost the
best.
Heuristics are typically much faster
(time vs. quality trade-off).

24
Pair-wise Alignment Programs

Exact Algorithms
Based on dynamic programming, a known
algorithmic tool (not exhaustive search !).
Most sensitive, but computational expensive
and slow
Heuristics, based on SW algorithm
1. FastA (1985) (http//www2.ebi.ac.uk/
fasta3/)
2. Blast (1990) (http//www.ncbi.nlm.n
ih.gov/BLAST/)

Needlman-Wunch (1970).
Smith-Waterman (1981).

25
Types of Gap Penalties
(Improved Pricing of InDels)
Motivation Aligning cDNAs to Genomic DNA
cDNA query
Example
Genomic DNA
In this case, if we penalize every single gap by
-1, the similarity score will be very low, and
the parent DNA will not be picked up.
26
Types of Gap Penalties

(insertions or deletions, indels)
Insertions and deletions are rare in
evolution.
Once they are created, they are easy to
extend.
Examples
BLAST Cost to open a gap 10 (high penalty).
Cost to extend a gap 0.5 (low
penalty).

FASTA
27
A Simple Graphical Representation - Dot Plot
C T T A G G A C T
G A G G A C T
Put a dot at every position with identical
nucleotides in the two sequences.
Sequences C T T A G G A C T G A G G A C T
28
A Simple Graphical Representation - Dot Plot
C T T A G G A C T
G A G G A C T
Put a dot at every position with identical
nucleotides in the two sequences.
Long diagonals with dots - long matches (good !)
C T T A G G A C T G A G
G A C T
Short dotted diagonals - short matches
(unimpressive)
C T T A G G A C T G A G
G A C T
29
Getting Rid of Short Diagonals - word
size
C T T A G G A C T
G A G G A C T
Start with original dot plot. Retain a run along
a diagonal only if has length 6 or more.
This word size is called Ktup in Fasta, W in
Blast
30
How Does Word Size Influence Sensitivity
Selectivity ?
Sensitivity of Algorithm The ability to
recognize distantly related sequences. Find all
real homologues. Selectivity of Algorithm The
ability to discard false positive matches between
un-related sequences. Find only real homologues.
Large word size - fast, less sensitive, more
selective distant relatives do not have
many runs of matches, un-related
sequences stand no chance to be selected. Small
word size - slow, more sensitive, less
selective.
31
Effect of Word Size
Large word size - fast, less sensitive, more
selective distant relatives do not have many
runs of matches, un-related sequences stand no
chance to be selected. Small word size - slow,
more sensitive, less selective. Example If
word size 3, we will find all words containing
TCG in this sequence (very sensitive compared to
large word size, but less selective and will
find all TCGs).
32
Effect of Word Size
If word size 3, you will not find TCGA in this
sequence.
33
Smith-Waterman Algorithm (1981)

Sensitive, guaranteed to find the best
alignment(s)
between a pair of sequences.
For 2 sequences S of length n and T of length m,
Smith-Waterman takes nm computational steps.
??
Assume a query sequence S of n200
nucleotides.
?? We are looking for homologous sequences in a
database
containing, e.g., 107 sequences (small
db) Ti each m500.
?? Number of computation steps 200500 per
pair S, Ti
107 200 500 1012 steps total.
?? NOT PRACTICAL even for modern computers !
What is the solution ?

34
What is the solution ?

Use a heuristic algorithm to discard most
irrelevant sequences.
Perform the exact algorithm on the small group of
remaining sequences.

35
Overcoming SW Impracticality
1. Use special purpose hardware, e.g. Compugen.
Speed-up factors of up to 100, but costs
? 2. Heuristic solutions - fast, USUALLY OK,
No guarantees.
THE two heuristics FASTA BLAST. Common idea
A really good alignment almost always contains
sub-sequences that are identical (break the
sequences into fragments). Search strategy
Look for identical sub-sequences Extend those
to longer, almost identical sub-sequences (but
ignore non-promising sub-sequences). The best
alignment is not guaranteed to be there
(heuristic).
36
Heuristic Search Algorithms

BLAST (Altschul 1990)
Basic Local Alignment Search Tool
Developed in NCBI.
Uses rapid word lookup methods (heuristics) to
completely skip most of the database entries.
Extremely fast - One order of magnitude faster
than FASTA..
Almost as sensitive as FASTA.

FASTA (Pearson 1985)
Developed in EMBL.
Uses heuristics to avoid calculating the full
dynamic programming matrix.
Speed up searches by an order of magnitude
compared to full Smith-Waterman.
The statistical side of FASTA is still stronger
than BLAST.

37
Strategies BLAST FASTA

First, identify hot spots (Ktup or word size W,
in FASTA and blastN, respectively) of exact
matches, between the query and databases, that
are high scoring strings (HSP, that satisfy a
Threshold value T).
2. Next, take only the best hot spots (HSP).
3. Extend them to longer diagonal runs (no
indels yet).
4. Merge close by diagonal runs (this introduces
indels).
Optimize the merges by running Smith-Waterman on
them.
Notes
1. The longer the word the more rapid, but less
thorough, the database search is.
2. The final output consists of many sequences
(ordered by similarity scores).
3. The best alignment is not guaranteed to be
among them ! Heuristics.

38
FASTA Visualization
Identify all hot spots longer than Ktup.
Ignore all short hot spots. The longest hot spot
is called init1.
Merge diagonal runs. Optimize using SW in a
narrow band. Best result is called
Extend hot spots to longer diagonal runs.
Longest diagonal run is initn.
opt.
39
FastA Output
FastA produces a list, where each entry looks
like EM_HUMAF?267177 Homo sapiens glucocer
(5420) f 3236 623 1.1e-176 The database name
and entry (accession numbers). Then comes the
species. and a short gene name. The length of
the sequence. Similarity score of the optimal
alignment (opt). The bits score, and the
E-value.
Both measure the statistical significance of the
alignment.
40
FastA Output - Explanation
E-value is the theoretically Expected number of
false hits (random sequences) per sequence
query, given a similarity score (a statistical
significance threshold for reporting matches
against database sequences). Low E-value means
high significance, fewer matches will be
reported. Bits is an alternative statistical
measure for significance. High bits means high
significance. Some versions also display
z-score, a measure similar to Bits.
41
What Is a Significant E-Value ?
How many false positives to expect? For E-value
10 4 1 in 10,000 Database No. of Entries
False Positive SwissProt 105,967 10.6 PIR-PSD
283,153 28.3 TrEMBL 594,148 59.4
42
FastA Parameters
Matrix - Determine the scoring system for letter
substitutions. For nucleotide
sequences matrices are very simple. More
complicated matrices used for proteins. Filter -
Mask off segments of the query sequence that
are highly repetitive (thus not meaningful
biologically, but significant
statistically). Repetitive sequence found by a
filter program are substituted using the letter
"N" in nucleotide sequence (e.g., "NNNNNNNN")
and the letter "X" in protein sequences (e.g.,
"XXXXXXXXX"). Histogram - A (primitive)
graphical display of the distribution of all
sequences in the data bases searched, according
to their similarity scores.
43
opt
A Histogram for observed () vs expected ()
the usual bell curve
Unexpected, high score sequences (signal
vs noise)
44
Different Variants of Blast FastA
http//www-bioeng.ucsd.edu/research/research_group
s/compbio/workshop/
45
Example Comparison of Gene 1 Between
Human and Another Species.
Homo sapiens vs Mus musculus
Homo sapiens vs Gallus gallus
Homo Sapiens vs Fugu Rubripes
http//www1.imim.es/courses/Lisboa01/slide6.5.html

46
Lecture 4 Pair-wise Sequence Alignment

Outline
Biological motivation for homology search.
Short Overview Algorithms and heuristics.
Methods for comparing two sequences
Smith-Waterman, BLAST, FASTA.
Wet experience.

47
Lets Run FastA
48
http//www.ebi.ac.uk/fasta33/
Molecule type
http//www2.ebi.ac.uk/fasta3/help.html Example
and interpretation of results http//www.ebi.ac.u
k/2can/tutorials/nucleotide/fasta2.html
49
Do not confuse FASTA Input Format
Step 1 copy the sequence in FASTA
format/accession number and paste into input box.
50
FASTA Output
forward or reverse strand
The best scores are
opt
bits E-valueEM_HUMBC003356 BC003356.1
Homo sapiens, gluco (2279) f 11395 1819.6
0EM_HUMHSHL60A D13286.1 Homo sapiens mRNA for
(2259) f 11263 1798.7 0EM_HUMHSGCBPRC
M19285.1 Human glucocerebrosid (2275) f 11221
1792.0 0EM_HUMHSGCB01 M16328.1 Human
glucocerebrosida (2587) f 11131 1777.7
0EM_PATAX752903 AX752903.1 Sequence 13 from Pa
(2586) f 11108 1774.0 0EM_HUMHSGCBL
K02920.1 Human lysosomal glucoce (1792) f
8924 1427.6 0EM_PATI09351 I09351.1
Sequence 1 from patent (1661) f 8188
1310.8 0EM_SYBT008212 BT008212.1
Synthetic construct (1611) f 8045
1288.1 0EM_PATAX147658 AX147658.1
Sequence 7 from Pat (1611) f 7924 1268.9
0
Genes name
genes length
Database name and entry (accession number)
Thumb rules Sequences with E-value lt 0.01 are
almost always homologous. Sequences with E-value
between 1-10 may be related. Word size The thumb
rule is that the larger the word-length the less
sensitive, but faster the search will be.
51
FASTA Output - cont.
EM_HUM is the EMBL database division that the
entry is in, in this case EMBL human (EMBL
HUMAN). BC003356 is the entry name in the
database. BC003356.1 is the accession number
associated with this entry. Homo sapiens gluco
is the start of the description of the database
entry. (2279) is the number of nucleotides
present in the entry. f this stands for
forward or reverse, and represents which strand
the alignment was based from. 11395 is the
optimized score. 1819.6 is the bits score, which
is a normalized score calculated from the raw
score. 0 is the expectation value. In evaluating
the E() scores, the following rules of thumb can
be used sequences with E() less than 0.01 are
almost always found to be homologous, sequences
with E() between 1 and 10 frequently turn out to
be related as well. A value of zero is an exact
match.
52
FASTA Alignment
gtgtEM_HUMBC003356 BC003356.1 Homo sapiens,
glucosidase, (2279 nt)initn 11395 init1
11395 opt 11395 Z-score 9759.0 bits 1819.6
E() 0banded Smith-Waterman score 11395
100.000 identity (100.000 ungapped) in 2279
nt overlap (1-22791-2279)    10
20 30 40 50
60Sequen AGCTAAGGCAGGTACCTGCATCCTTGTTTTTGTTTAGTG
GATCCTCTATCCTTCAGAGAC
EM_HUM
AGCTAAGGCAGGTACCTGCATCCTTGTTTTTGTTTAGTGGATCCTCTATC
CTTCAGAGAC   10 20 30
40 50 60
Note perfect match
Explanations about FASTA
http//www.embnet.se/roadmap/internet-fasta.html
FASTA also available at http//www.dna.affrc.go.j
p/htbin/prefastadna.pl http//fasta.bioch.virginia
.edu/fasta_www/cgi/search_frm.cgi?pgmfa
http//bioweb.pasteur.fr/seqanal/interfaces/fasta
.html
53
FASTA Last Alignment
gtgtEM_HUMHSGCBA2 M22213.1 Human
glucocerebrosidase mRNA, (73 nt)initn 360
init1 360 opt 360 Z-score 327.0 bits 69.4
E() 2.6e-08banded Smith-Waterman score 360
100.000 identity (100.000 ungapped) in 72 nt
overlap (2187-22581-72)    2160 2170
2180 2190 2200 2210
Sequen GGAAGCCTTAAAGCAGCAGCGGGTGTGCCCAGGCACCCAGA
TGATTCCTATGGCACCAGC
EM_HUM
AGGCACCCAGATGATTCCTATGGC
ACCAGC
10 20 30   2220 2230
2240 2250 2260 2270
Sequen CAGGAAAAATGGCAGCTCTTAAAGGAGAAAATGTTTGAGCC
CAAAAAAAAAAAAAAAAAA
EM_HUM
CAGGAAAAATGGCAGCTCTTAAAGGAGAAAATGTTTGAGCCCC
   40 50 60
70
54
Program Fastx Compares DNA Sequences to
Protein Databases
BL50 matrix ktup 2 gap-pen -15/ -2 The best
scores are
initn init1 opt z-sc
E(750289) BC000349_1(BC000349pidnone) Homo sa (
320) f 2208 2208 2208 2607.9 1.4e-137 AX147656_1
(AX147656pidnone) Sequenc ( 516) f 2258 1269
1269 1493.1 1.8e-75 AF285236_1(AF285236pidnone)
Pan tro ( 536) f 2260 1269 1269 1492.8
1.8e-75
55
Lets Run
56
BLAST

Faster than FastA.
Similar search strategy.
Gapped Blast algorithm (1997, 3 times faster
than ungapped blast).
First, BLAST produces a list of high scoring
segment pairs - a pair of sub-sequences of the
same length without gaps - between the query and
databases.
Maximal segment pair is the segment pair with
maximal score.
A pair is reported only if its score gt
Threshold.

57
What Does BLAST Do ?

Identify short segment pairs of length W
(default 11 in blastN) that are present in both
sequences, with high scoring strings (above a
threshold value T).
The statistical significance of these high
scoring strings is evaluated to determine whether
the matches are random or not.
The segment pairs are extended without gaps,
until
the maximal possible score is reached (when
score
falls below a limit due to mismatches in the
extension, the extension is stopped).

58
http//www.ncbi.nih.gov/BLAST/
BLAST Help
http//www.ncbi.nlm.nih.gov/BLAST/blast_help.html
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/t
ut1.html
59
Blast A Family of Programs

BlastN - nt versus nt database.
BlastP - protein versus protein database.
BlastX - translated nt versus protein database.
tBlastN - protein versus translated nt database.
tBlastX - translated nt versus translated nt
database.

Query DNA Protein Database DNA Protein
60
Cont.
61
http//www.ncbi.nlm.nih.gov/BLAST/
gtsingle line description. Paste nucleotide
sequence (FASTA format) into search box.
Step 1 Choose the appropriate BLAST program for
DNA or protein sequences, paste your
sequence/accession.
62
Input Format
63
Step 2 Choose the right database.
64
Blast Parameters
default
Step 3 Submit your query and start Blast !
65
Step 4 Format to get results.
66
Program introduction (top) page
version number, date
Step 5 get results.
Default max. 50 databases searched.
http//www.ncbi.nlm.nih.gov/blast/blast_FAQs.html
67
BLAST Alignment - Results
query
All sequences or sequence fragments that match
the query are presented. Mousing over a hit will
display its definition and score. The histogram
shows the lowest (most significant, red)
E-values. Hatched areas non-similar regions.
68
BLAST Output
Score (bit score) is a value calculated from the
number of gaps and substitutions for each aligned
sequence.
Link to Entrez
Sorting by E-value
sequence identifier codes in databases
10-179
species
sequence description
69
(No Transcript)
70
Important Messages Concerning Matches

A statistically significant match is not
necessarily biologically significant, and
conversely a match may be of biological
significance without passing the statistical
test. You should therefore use
your knowledge of the biology of the
system to help interpret
these results.
Use positive and negative known controls !
Be cautious when you have a significant
hit.
Validate with independent tools !
The absence of a match in the database does not
mean that
there is no homologous sequence in the
database.
3. The presence of a similar sequence is no
guarantee that the sequences are homologous.

71
What to do When Your Search Produced NO Hits ?

Standard solutions
Check your sequence format.
Raise the Expect (E) value.
Turn-off filtering of low complexity.
Decrease word size.

72
What to do When Your Search Returns Homologs With
Weak Similarities ?

Run PSI-BLAST.
Look for sequence motifs.

What In The Next Lectures ?
Why is pairwise sequence alignment different for
proteins ?

Write a Comment

User Comments (0)

About PowerShow.com

Lecture 3 : Overview - PowerPoint PPT Presentation

Lecture 3 : Overview

OMIM: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM ... STAG at DDBJ, Japan. http://stag.genome.ad.jp/ Expasy at SIB (Swiss Institute of ... – PowerPoint PPT presentation