What is BLAST - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

What is BLAST

Description:

'local' means it searches and aligns sequence segments, rather than align the ... acidic-, basic- or proline-rich regions) determined by SEG or DUST program. ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 44
Provided by: edc84
Category:
Tags: blast | seg | sine

less

Transcript and Presenter's Notes

Title: What is BLAST


1
What is BLAST?
BLAST (Basic Local Alignment Search Tool) is a
set of similarity search programs designed to
explore all of the available sequence databases
regardless of whether the query is protein or
DNA. local means it searches and aligns
sequence segments, rather than align the entire
sequence. Its able to detect relationships among
sequences which share only isolated regions of
similarity. Currently, it is the most popular and
most accepted sequence analysis tool.
2
Why BLAST?
  • Identify unknown sequences - The best way to
    identify an unknown sequence is to see if that
    sequence already exists in a public database. If
    the database sequence is a well-characterized
    sequence, then you may have access to a wealth of
    biological information.
  • Help gene/protein function and structure
    prediction genes with similar sequences tend to
    share similar functions or structure.
  • Identify protein family group related (paralog
    or ortholog) genes and their proteins into a
    family.
  • Prepare sequences for multiple alignments
  • And more

3
Different types of homology search
DNA v.s. DNA
GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCA
TAAGTTCCAACAAAGTTTGC

GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGAT
CCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGA
TGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA----

GAATACTCAACAGCAACATCAAC
GGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG -
-----------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGA
GGGCAGGCGAGCTCAGGTA

TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACC
AAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACC
ATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG

GGATGAGATGGAACGTGTGATGACCAT
TATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG
4
Protein v.s. Protein
5
DNA translated v.s. protein
Or the other way around
6
DNA translated v.s. DNA translated
7
Basic BLAST programs and databases
8
(No Transcript)
9
How Does BLAST Work
  • Two-step procedure
  • Compare query sequence to every database entries.
    For each entry, if there are segments of certain
    length (word size) similar to part of the query
    sequence, they have a hit.
  • Query GTTGACCGTTAGCCGACGTTAAGCT
  • DB entry ACATAGCCCGTTAGCCGCTGATACGACCGTAC
  • For each hit, extending two both ends until the
    expect value falls below the threshold. They
    become high-scoring segment pair (HSP)
  • A Smith-Waterman like algorithm is used to do
    local alignment around each HSP.

Word size 7
10
Blastn Construct Queries
paste your sequence here
specify search region
choose database
nr non-redundant database Others are subsets of
nr database.
11
Blastn Options
limit result to from only certain organism
Example protease NOT hiv1Organism
Lower EXPECT thresholds are more stringent.
The smaller the word size, the higher the
sensitivity.
12
Blastn Filters
  • Low-complexity Some sequence segments are
    biologically uninteresting (e.g., hits against
    common acidic-, basic- or proline-rich regions)
    determined by SEG or DUST program. Such segments
    are screened out.
  • Human repeats This option masks Human repeats
    (LINE's and SINE's) and is especially useful for
    human sequences that may contain these repeats.
    Filtering for repeats can increase the speed of a
    search especially with very long sequences (gt100
    kb) and against databases which contain large
    number of repeats (e.g. htgs).
  • Mask for lookup table only BLAST searches
    consist of two phases, finding hits based upon a
    lookup table and then extending them. This option
    tell BLAST search to apply other filters only in
    the first phase.
  • Mask Lower Case Sequences in lower case are
    screened out. This allows users to define
    customized filtering region.

13
Blastn When to Use
  • Your query sequence is nucleotide sequence.
    Blastn can help to
  • Find the identity of your query sequence.
  • Find sequences similar to your query sequence.
  • Blastn returns nucleotide sequences stored in
    NCBI databases.

Variance of blastn MegaBlast Its
specifically designed to efficiently (up to 10
times faster ) find long alignments between very
similar sequences.
14
Interpret BLAST results - Distribution
Query sequence
BLAST hits. Click to access the pairwise
alignment.
This image shows the distribution of BLAST hits
on the query sequence. Each line represents a
hit. The span of a line represents the region
where similarity is detected. Different colors
represent different ranges of scores.
15
Interpret BLAST results - Description
16
Interpret BLAST results Pairwise Alignment
Query line the segment from query sequence. Subj
line the segment from hit (subject)
sequence. Middle line the consensus bases
17
Blastp Protein Protein DB
Blastp is used for both identifying a query amino
acid sequence and for finding similar sequences
in protein databases. Like other BLAST programs,
blastp is designed to find local regions of
similarity. However, when sequence similarity
spans the whole sequence, blastp will report a
global alignment, which is the preferred result
for protein identification purposes.Unlike
nucleotide BLAST, there is no comparable
MEGABLAST for protein searches.
18
Blastp Special Parameters
Gap penalties for opening a new gap, or for
extending an existing gap.
Matrix a table of scores that are assigned to
various amino acid substitutions. In general,
different substitution matrices are tailored to
detecting similarities among sequences that are
diverged by differing degrees. BLOSUM-62 matrix
is among the best for detecting most weak protein
similarities. For particularly long and weak
alignments, the BLOSUM-45 matrix may prove
superior. For short queries, PAM matrices may be
used instead.
19
Exercise
  • Find out how the gap cost is calculated
  • For a length k gap, the cost is
  • Gap_exist k gap_ext OR
  • Gap_exist (k-1) gap_ext

20
Blastp Special Parameters
For proteins, a provisional table of recommended
substitution matrices and gap costs for various
query lengths is
21
BLOSUM62 matrix
BLOSUM62 Substitution Matrix
22
Basic idea
  • Conserved regions from multiple sources are
    aligned into blocks
  • The identity level is high therefore we know they
    are homologues without a score matrix

23
Frequency of AA pairs
  • 37 columns, each column has 3(3-1)/2 pairs. In
    total 111 pairs.
  • Pair I-L occurs 3 times. L-L occurs 13 times
  • P_IL 3/111. P_LL 13/111
  • Total amino acid 111.
  • P_I 2/111, P_L 21/111
  • 2 P_I P_L lt P_IL!
  • P_L P_L lt P_LL!

24
Blosum
  • Score(x,y) log_2 (p_xy / e_xy),
  • where e_xy 2 p_x p_y
  • e_xx p_x p_x

25
BLOSUM 62
  • Some protein families are more well studied so
    they are over represented in the database.
  • To remove this bias in statistics, those proteins
    are classified together before BLOSUM calculation.

26
BLOSUM 62
Weight 0.5
Weight 0.5
Weight 1
Weight 1
  • The sequences that are 62 or above similarity
    are grouped together and given total weight 1.
  • This way, the AA pairs are counted among groups
    that are 62 or below.
  • The lower this number is, the better is the
    matrix suitable to distant homology search.

27
Blastx nucleotide protein DB
Blastx is useful for finding similar proteins to
those encoded by a nucleotide query. It compares
the translation of the nucleotide query sequence
to a protein database. Because blastx translates
the query sequence in all six reading frames and
provides combined significance statistics for
hits to different frames, it is particularly
useful when the reading frame of the query
sequence is unknown or it contains errors that
may lead to frame shifts or other coding errors.
Thus blastx search is often the first analysis
performed with a read from a newly derived
sequence and is used extensively in analyzing EST
sequences.
28
Blastx Attention
  • ATTENTION
  • You have to make sure that your sequence sequence
    is a nucleotide coding region.
  • Blastx is not applicable to Genomic DNA/RNA
    (introns, intergenic region, tRNA, rRNA), because
    they do not encode for protein.

29
Blastx Special Parameters
Different species may use different genetic codes
to encode for the same amino acid. You have to
specify appropriate genetic codes (translation
table) for your query sequence based on the
organism and sources.
30
Blastx Interpret Results
Middle line letters consensus amino acid
residues similar amino acid residue white
space unmatched
31
Tblastn protein translated DB
A tblastn search allows you to compare a protein
sequence to the six-frame translations of a
nucleotide database. It can be a very productive
way of finding homologous protein coding regions
in unannotated nucleotide sequences such as
expressed sequence tags (ESTs) and draft genome
records (HTG), located in BLAST databases est and
htgs, respectively.
32
Tblastx nucleotide translated DB
tblastx takes a nucleotide query sequence,
translates it in all six frames, and compares
those translations to the database sequences
dynamically translated in all six frames. This
effectively performs a more sensitive blastp
search without doing the manual
translation.tblastx gets around the the
potential frame-shift and ambiguities that may
prevent certain open reading frames from being
detected. This is very useful in identifying
potential proteins encoded by single pass read
ESTs. In addition, it would be a good tool for
identifying novel genes.
33
Other blast programs
PSI blast Position-Specific Iterated (PSI)-BLAST
is the most sensitive BLAST program, making it
useful for finding very distantly related
proteins. Use PSI-BLAST when your standard
protein-protein BLAST search either failed to
find significant hits, or returned hits with
descriptions such as "hypothetical protein" or
"similar to..."
34
Other blast programs
BLAST 2 sequences BLAST 2 Sequences" is designed
for direct comparison of two sequences. This
program takes two input sequences and compares
them directly. Please note that "BLAST 2
Sequences" regards the second sequence as the
database. If the database sequence or second
query is present in NCBI databases, using
GI/Accession instead of the FASTA sequence would
allow the program to incorporate the translation
and other sequence features, found in that
record, into the final result to make it more
informative.
35
Other blast programs
  • Search for short and near exact matches Normal
    parameters for standard blast are too stringent
    for short query sequences. Therefore, appropriate
    parameters are set for short and near exact
    matches.
  • For Nucleotide (lt20bp) A common use is to check
    the specificity of primers used in the polymerase
    chain reaction (PCR) or hybridization. Forward
    primer NNNNNNNNNN reverse primer. Since BLAST
    looks for local alignments and searches both
    strands, there is no need to reverse complement
    one of the primers before doing the concatenation
    or the search. Use word size 7, E value 1000, no
    filter.
  • For protein (lt 10-15mer) using matrix PAM30, E
    value 20000, word size 2, no filter.

36
Summary - If your sequence is NUCLEOTIDE
37
Summary - If your sequence is PROTEIN
38
Raw Score, Bit Score, P-value and E-value
39
Score Matrix
  • BLOSUM62

40
Raw Score and E-value
  • VLNVWGKVEAD
  • VLKCWGPMEAD
  • raw score S(V,V)S(L,L)S(N,K)S(D,D)
  • Both sequences are substrings of the query and
    the subject (database).
  • Because there is no gap, this is called an HSP
  • High-Scoring Segment Pair.
  • Is this HSP significant?
  • Can it occur purely by chance?
  • E-value of this raw score is the number of
    expected occurrences if both query and database
    are random sequences.

41
How to compute E-value from raw score
  • There is rigorous mathematical analysis behind
    this. But we only need to know that
  • If query sequence has length m, and database has
    length n, then by chance, the number of
    non-overlapping HSPs with score x is expected to
    be
  • Kmnexp(- lambda x)
  • This makes sense
  • Doubling the length of either sequence should
    double the number of HSPs attaining a given
    score.
  • Also, for an HSP to attain the score 2x it must
    attain the score x twice in a row, so one expects
    E to decrease exponentially with score

42
Bit Score
  • Raw scores have little meaning without detailed
    knowledge of the scoring system used, or more
    simply its statistical parameters K and lambda.
  • Bit score is the normalized score
  • Therefore, E-value mn(2bitscore)

43
Exercise
  • Retrieve myoglobin horse.
  • BLASTp
  • What do you get?
  • What is Hemoglobin?
  • TBLAST
  • Find the DNA sequence corresponding to myoglobin
    horse.
  • Can you do the reverse-translation without
    knowing the DNA sequence?
Write a Comment
User Comments (0)
About PowerShow.com