Introduction to Bioinformatics - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Introduction to Bioinformatics

Description:

BLAST Programs: Which One to Use? Commonly ... Database Search with BLAST. Blast Steps How It Works ... Putative identity and function of your query sequence ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 35
Provided by: ChiChe7
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics


1
Introduction to Bioinformatics
  • BLAST

2
BLAST
  • Introduction
  • What is BLAST?
  • Query Sequence Formats
  • What does BLAST tell you?
  • Choices
  • Variety of BLAST
  • BLAST Programs Which One to Use?
  • Commonly Used BLAST programs
  • BLAST Databases Which One to Search?
  • Understanding the Output
  • Database Search with BLAST
  • Blast Steps How It Works

Acknowledgement The presentation includes
adaptations from NCBIs Introduction to Molecular
Biology Information Resources Modules
3
What is BLAST?
  • Basic Local Alignment Search Tool
  • The GoogleTM of bioinformatics
  • Query is a DNA or protein sequence, not a text
    term
  • Character string comparison against all the
    sequences in the target database
  • Rigorous statistics used to identify
    statistically significant matches

4
Query Sequence Formats
  • Bare sequence
  • QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQES
    KPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLER
    IEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
    VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGST
    GVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP
  • 1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre
    mpfhvtkqes kpvqmmcmnn
  • 61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri
    ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts
    vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels
  • 181 edgiemagst gviedikhsp eseqfradhp
    flflikhnpt ntivyfgryw sp
  • Identifiers
  • accession, accession.version or gi's
  • e.g., p01013, AAA68881.1, 129295, gi129295
  • FASTA format

GenBank format
5
Query Sequence in FASTA Format
  • FASTA definition line ("def line") that begins
    with a gt, followed by some text that briefly
    describes the query sequence on a single line
  • Up to 80 nucleotide bases or amino acids per line
  • Blank lines not allowed in the middle
  • Example
  • gtgi129295spP01013OVAX_CHICK GENE X PROTEIN
    (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWK
    TAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
    KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEK
    RRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKIS
    QAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
    FLFLIKHNPTNTIVYFGRYWSP
  • Additional information

6
What does BLAST tell you?
  • Putative identity and function of your query
    sequence
  • Helps to direct experimental design to prove the
    function
  • Find similar sequences in model organisms (e.g.,
    yeast, C. elegans, mouse), which can be used to
    further study the gene
  • Compare complete genomes against each other to
    identify similarities and differences among
    organisms

7
Variety of BLASTs
http//www.ncbi.nlm.nih.gov/BLAST/
8
BLAST Programs Which One to Use?
  • Depends on
  • What type of query sequence you have (nucleotide
    or protein)
  • What type of database you will search against
    (nucleotide or protein)
  • BLAST program descriptions
  • brief list
  • BLAST program selection guide

9
Commonly Used BLAST Programs
  • Examples of BLAST programs
  • BLASTN
  • Nucleic acids against nucleic acids
  • BLASTP
  • Protein query against protein database
  • Usually better to use than nucleotide-nucleotide
    BLAST
  • Since the genetic code is degenerate, blastn can
    often give less specific results than blastp
  • ...but... what if we don't have a protein query
    sequence. What are our options?
  • BLASTX
  • Translated nucleic acids against protein database
  • One way to do a protein BLAST search if you have
    a nucleotide query sequence
  • The BLAST program does the translating for you,
    in all 6 reading frames

10
BLAST Databases Which One to Search?
  • What type of data do you want to search against?
    For example
  • Characterized sequences?
  • Specialized sequences?
  • Complete genomes or chromosomes?
  • BLAST database descriptions are available in the
  • BLAST help document
  • BLAST program selection guide

11
Request ID RID
  • An RID is like a ticket number that allows you to
    retrieve your search results and format them in
    many different ways over the next 24 hours.
  • If you've saved RIDs from your recent searches,
    you can enter the RIDs directly using the
    Retrieve results with a Request ID page, which is
    accessible from the bottom of the BLAST home page

12
Search Results Understanding the Output
  • Reference to BLAST paper
  • Reminders about your specific query
  • RID
  • query sequence reminder (contains the information
    from your FASTA def line)
  • what database you searched against
  • Graphical summary
  • shows where the hits aligned to your query
  • colors indicate score range
  • mouse over a colored bar to see info about that
    hit
  • Text summary (GI numbers and Def lines)
  • GI links to complete record in Entrez
  • Score links to pairwise alignment between your
    query sequence and the hit
  • Pairwise alignments
  • BLAST statistics for your search

13
Database Search w/ BLAST
  • Finding similar sequences is a primary use of
    bioinformatics
  • BLAST

Enter sequence Choose DB Hit
Acknowledgement Slides 15 19 are adapted from
lecture notes of Professor Chau-Wen Tseng of CS
Department at the University of Maryland with
permission.
14
Database Search w/ BLAST
  • Versions of BLAST
  • BLASTN
  • Nucleic acids against nucleic acids
  • BLASTP
  • Protein query against protein database
  • BLASTX
  • Translated nucleic acids against protein database
  • TBLAST
  • Protein query against translated nucleic acid
    database
  • TBLASTX
  • Translated nucleic acids against translated
    nucleic acids

15
Database Search w/ BLAST
16
Database Search w/ BLAST
  • BLAST graphic result

17
Database Search w/ BLAST
  • BLAST result
  • 0Matching sequences w/ bit-score E-value
  • 0Hyperlinks to database entry for sequence
  • Example
  • gi17330420gbBH384278.1BH384278 ... 153 3e-36
  • gi17320126gbBH373984.1BH373984 ... 140 9e-34
  • gi17338337gbBH392196.1BH392196 ... 112 8e-25
  • gi20373967gbBH771010.1BH771010 ... 105 1e-21
  • gi17314411gbBH368367.1BH368367 ... 104 2e-21
  • gi17332712gbBH386570.1BH386570 ... 64 3e-21
  • Hyperlinks to sequences Bit Score
    E-value

18
BLAST Statistical Evaluation
  • E Value
  • The number of different alignments with scores
    equivalent to or better than alignment score that
    are expected to occur in a database search by
    chance.
  • The lower the E value, the more significant the
    score.

19
BLAST How It Works
  • Find high scoring local alignments between query
    sequence and target database
  • Assumption
  • True match alignments very likely to contain
    within them very high scoring matches
  • Steps
  • Seeding
  • Searching
  • Extension
  • Evaluation

20
BLAST Steps
  • Seeding
  • For each word of length w in the query (w-mer),
    generate a list of all possible words (neighbors)
    with a score of at least threshold T (determined
    by using the scoring matrix)
  • Default
  • w 3 for protein
  • w 11 for DNA
  • BLASTn, however, does not find neighbors. It
    uses the 11-mers from the query only.

21
Query word (w 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
This example uses BLOSUM 62.
22
BLOSUM 62
23
BLAST Steps
  • Searching
  • Determine the locations of all common words
    between the query and the database (word hits)
  • Identifies all word hits

24
Query word (w 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
25
Implemention of BLAST Steps 1 2
  • (Using DNA as an example)
  • Given a query sequence q, a database sequence d,
    and the word size w, the following steps are
    needed
  • Find w-mers from q and create a table (such as
    hash table) to store them and record the position
    of their occurrences.
  • Scan d to find if a w-mer in d is also in q,
    i.e., in the table.

26
Implemention of BLAST Steps 1 2
Example Suppose q ACACAT, d GTACACGTT,
and w 3.   Build the table The 3-mers of q are
ACA, CAC, ACA, and CAT. Their positions are
stored in the table. The position of a sequence
starts at 1.  
4
1
3
2
27
Implemention of BLAST Steps 1 2
Example (contd) Match w-mers Use the 3-mers in d
one at a time to find its position(s) in q using
the hash table. Here is the result GTA not
found TAC not found ACA 1, 3 CAC 2 ACG not
found CGT not found GTT not found
4
1
3
2
28
Implemention of BLAST Steps 1 2
  • Questions
  • How many w-mers are there in q?
  • How many w-mers are there in d?
  • What is the time required to build the table?
  • What is the time required to find all the
    positions?
  • An alternative is building table using w-mers in
    d and then scan q to find positions of the
    w-mers. Why is it not recommended?

29
BLAST Steps
  • Extension
  • Extend hits to find HSPs (high-scoring segment
    pairs) that have scores higher than a threshold
  • Introduce gaps using dynamic programming
  • Problem of extension
  • Time-consuming to find the highest score
  • Solution (heuristic)
  • Extend until score drops a value of X

Example ABCDEFGHIJKLMNOPQRST
ABCDEFZYIJKLMXWVUTAB
1234565456789876565 ? Score
00000012100001234345 ? Drop off score
Match 1 Mismatch -1 X 5
30
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
31
BLAST Steps
  • Evaluation
  • Select all HSPs from step 3 having a score of at
    least S (for some parameter S). The value of S
    should be large enough so that unrelated
    sequences shouldnt have HSPs with scores S.
  • Evaluate the statistical significance of extended
    hits
  • Report only those above the determined threshold

32
BLAST Statistical Evaluation
  • For local, ungapped alignments
  • m size of query
  • n size of database
  • E expected of HSPs with scores at least S
  • p prob of finding at least one HSP with S
  • good tutorial at
  • http//www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu
    l-1.html

33
Interpretations of Expected Value
  • Expected value ranges
  • E lt 10-100 ? very low, homologs or identical
    genes
  • E lt 10-3 ? moderate, may be related genes
  • E gt 1 ? high, probably / may be unrelated
  • 0 0.5 lt E lt 1 ? ??? In the twilight zone Try
    detailed search
  • If database search
  • Long list of gradually declining of E values ?
    large gene family
  • Long regions of moderate similarity ? more
    significant than short regions of high identity
  • Biological relevance
  • Still need to determine biological significance!!!

34
Other BLAST Algorithms
  • BLAST2
  • Gapped-BLAST
  • PSI-BLAST
  • BLAT
  • Mega BLAST
  • etc.
Write a Comment
User Comments (0)
About PowerShow.com