the Blat Rap - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

the Blat Rap

Description:

10 billion bases of genome, more to come. 15 million ... Smith-Waterman technique based on dynamic programming. 'Banded' Smith-Waterman: faster but doesn't ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 41
Provided by: jimk166
Category:
Tags: blat | rap | waterman

less

Transcript and Presenter's Notes

Title: the Blat Rap


1
the Blat Rap
the ins and outs of going blat
  • by Jim Kent

2
Blat Rap Outline
  • What blat is good at.
  • How it compares to other tools.
  • How it works.
  • How best to use it.
  • When its free and when its not.

3
The Need for Speed
  • 10 billion bases of genome, more to come.
  • 15 million ESTs publicly available.
  • BLAST queues at NCBI are 15 minutes.
  • What works well for searching a 10 million amino
    acid database does not work well for searching a
    20 billion nucleotide database.

4
BLAT is Good At
  • Aligning mRNA and ESTs to the genome.
  • Extremely fast and accurate
  • Aligns whole mRNA, not just separate exons
  • Handles introns splice sites
  • Easy to parse native output format. Also outputs
    blast compatible format.
  • Translated alignments between species
  • Still quite fast.
  • Sensitive enough for human/mouse and other
    vertebrate/vertebrate comparisons.

5
BLAT is not the best tool for
  • Untranslated DNA alignments between species
    further than monkey/human.
  • We use blastz for mouse/human DNA alignments.
  • Protein/Protein alignments generally
  • The protein databases are small enough, blastp is
    still an excellent choice for fast searches
  • We use SAM and PSIBLAST for detecting remote
    protein homologies.

6
Other Good Uses of BLAT
  • Aligning reads against the genome to find SNPs
    and other polymorphisms.
  • Clustering together redundant protein, mRNA or
    EST records from Genbank.
  • Mapping annotations from one version of the human
    genome to another version of the human genome.
  • Looking for recent duplication in the human
    genome that may cause cross-hybridization
    problems in microarrays and PCR experiments.

7
Blat Rap Outline
  • What blat is good at.
  • How it compares to other tools.
  • How it works.
  • How best to use it.
  • When its free and when its not.

8
mRNA/Genome Alignments
Remapping Sanger Chromosome 22 mRNAs to genome
and comparing to Sanger annotated exon/intron
structure.
9
Translated Mouse/Human Alignments
WU-TBLASTX and BLAT in translated mode (-tdnax
-qprot) aligning mouse 3x coverage whole genome
shotgun reads vs. Human Chromosome 22.
WU-TBLASTX was run with word-size of 5.
10
Web Search Usp18 vs Genome
11
Blat Rap Outline
  • What blat is good at.
  • How it compares to other tools.
  • How it works.
  • How best to use it.
  • When its free and when its not.

12
The Alignment Problem
  • Figuring out how to position two strings with
    some insertions so as to maximize where two
    strings agree.

aca--gacacactattatg-g-gc-caga-ac-cac
acacacagacacact-tt
atgtgtgctcacacacacacgct
mRNA vs. genomic alignments are an important
special case
13
Introns in Alignments
ATGGAGGGGCAGAGCGGCCGCTGCAAGATCGTGGTGGTGGGAGACGCAGA
GTGCGGCAAGACGGCGCTGCTGCAGGTGTTCGCCAAGGACGCCTATCCC
G GGgtgagggacctgcgtcttgggagggggacgctaaggctgctggggg
gt gggtgacaggggccctggcgacggatgggaatgggtactcgggtaac
cag ggacaagagacagggggtcggaggacgcggggaggccttgagggct
cagg aaggactgcagaggattggggtgggaggaattagggagcagggtg
agata gatggggtttgggagaaccagagcatccgggagggagggcgagg
ggaatg tcggaggtcctgggcaatggagaggggaagaactagggggctg
aagggac cagaagggaacaggaggaggtcttggagcttagcagagattc
tccggggg ggggggggggggggcaggagctcccgggatctcccctttgc
ccaatccca gaccaacttgtgtccaggggctgggctggacggggtgtgg
gagtgaggag ggcatttatctggggtgaggacttggagagatgatctca
tctggatccat ccgtgtctgcagAGTTATGTCCCCACCGTGTTTGAGAA
CTACACTGCGAG CTTTGAGATCGACAAGCGCCGCATTGAGCTCAACATG
TGGGACACTTCAG
14
Perfect Matches Serve as Seeds
  • Computers can look for exact matches very
    quickly.
  • Finding inexact matches is slow
  • Inexact matches should contain some short exact
    matches.
  • Inexact matches should contain multiple even
    shorter exact matches.

15
ggagaatagggcatgctctgaggtctgctggaacccatcc 1 12
123456 12 12 12345 1 12 12345
1 gtggattagggcttgttccgaggttatcgggttcccatac
ttcttgtctcgctccagggcaccgtgcaggaaatcccggg 2345 1
123 1 12 1234567 1234567 1234 ttctgggctctcgcccgg
gcacctagcaggaatacccgat acacctcctcattctcatccagccac
tggatgacgaaggg 123 123456 123 123 12345
12345 ggacccgctcattaccatacagtaaacggatggcgaagac Di
stribution of identical matches length 1 2 3
4 5 6 7 8 number 5 5 4 1 5 2 2 0
16
Steps in Fast cDNA Alignments
  • 1. Break cDNA into 500 base chunks.
  • 2. Use an index to find regions in genome similar
    to each chunk of cDNA.
  • 3. Do a detailed alignment between genomic
    regions and cDNA chunk.
  • 4. Use dynamic programming to stitch together
    detailed alignments of chunks into detailed
    alignment of whole.

17
Indexing
  • Within an exon there should be some K-mers that
    align perfectly.
  • Build an index which contains positions of each
    K-mer in genome. K is typically between 8 and
    13.
  • Step through each K-mer in cDNA chunk and look it
    up in index.
  • Get list of hits - positions in cDNA and in
    genome that match for K bases.

18
Genome cacaattatcacgaccgc 3-mers cac aat tat
cac gac cgc Index aat 3 gac 12 cac
0,9 tat 6 cgc 15 cDNA
aattctcac 3-mers aat att ttc tct ctc tca cac
0 1 2 3 4 5 6 hits aat 0,3
-3 cac 6,0 6 cac 6,9 -3 clump
cacAATtatCACgaccgc
19
Detailed Alignments
  • Smith-Waterman technique based on dynamic
    programming.
  • Banded Smith-Waterman faster but doesnt
    tolerate long inserts
  • Recursive seed and extend faster yet, handles
    large gaps, but only works on very similar
    sequences.

20
Recursive Seed and Extend
  • Find perfect matches that are too long to occur
    reasonably by chance in a region.
  • Extend through short mismatches.
  • Extend through short gaps.
  • Existing matches divide sequence into regions.
  • Recurse to align unaligned regions at reduced
    stringency.

21
acataxxxxxxxxxxxxxxxxxxgatta xxxxxx cctgax

yacatayyyyyyyyyyyyyyyyyygattayyyyyyyycctgayy
yy
acatacgxxxxxxxxxxxxxxxxgatta xxxxxx cctgaa

yacatacgyyyyyyyyyyyyyyyygattayyyyyyyycctgaa
yyy
acatacgxxxxxxxxxxxxxcctgatta-ccggxx cctgaa

yacatacgyyyyyyyyyyyyyccagattaaccggyyycctgaa
yyy
acatacgxxxxcatgxxxxxcctgatta-ccggxx cctgaa

yacatacgyyyycatgyyyyyccagattaaccggyyycctgaa
yyy
acatacg catg cctgatta-ccgg cctgaa

yacatacg catg ccagattaaccgg
cctgaa
22
Stitching Together Alignments
23
Repeats Complicate Things
24
Solution Dynamic Programming
  • Define block of alignment as a region with no
    insertions or deletions.
  • Each block can be represented by 4 coordinates
    cStart, cEnd, gStart, gEnd
  • Each block has a score match-mismatch
  • Each gap between blocks has score
    -log(gSize) - cSize
  • Pick maximal scoring set of blocks where one
    block must follow another in both c g

25
(No Transcript)
26
Blat Rap Outline
  • What blat is good at.
  • How it compares to other tools.
  • How it works.
  • How best to use it.
  • When its free and when its not.

27
Standalone vs. Client/Server
  • Standalone - best for batch queries. Executable
    is called blat.
  • Client/Server - best for interactive queries.
    Executables are called gfClient/gfServer.

28
Standalone
  • Advantages
  • 2x as fast
  • More sensitive for protein/translated searches
  • Runs well on computer clusters
  • Runs effectively in 256 meg of RAM.
  • Disadvantages
  • Cant process complete genome at once unless have
    8 gig of RAM.
  • Must combine sort results of multiple runs.
  • Wait for index to be built before first query is
    processed

29
Client/Server
  • Advantages
  • Can process entire genome in 1.2 Gb of RAM
  • Process translated genome in 2.5 Gb of RAM
  • Index is prebuilt. Response to first query is
    typically lt 2 s.
  • No need to sort results.
  • Disadvantages
  • Ties up lots of memory in server machine
  • 1/2 as fast as standalone

30
Standalone mRNA/DNA Searches
  • blat target query output -ooc11.ooc
  • Target (aka database) can be a fasta file, nib
    file or a text file containing a list of fasta
    and nib files. Target is typically a chromosome.
  • For nucleotide queries no need to mask.
  • Query can be a fasta file or list of fasta files.
    Typically query is a large batch of mRNA or EST
    sequences.
  • Output by default is in a tab-separated format.
    Recently -outblast option and other output
    options added.
  • ooc11.ooc tells blat which 11-mers occur to
    often to be useful. It greatly increases
    blats speed.

31
Standalone translated searches
  • blat target query output -tdnax -qprot
  • This aligns proteins vs translated genome
  • blat target query output -tdnax -qrnax
  • Aligns translated RNA vs translated genome
  • blat target query output -tdnax -qdnax
  • Aligns translated genome vs. translated genome
    (best to chop query into 4kb or less pieces)
  • For translated searches its best to used masked
    target DNA.

32
Client/Server Setup
  • Convert each chromosome into its own .nib file
    with faToNib.
  • Start up gfServer on a machine with enough
    memory. It will take 10 minutes to build an
    index.
  • Run gfClient, telling it query sequence, machine
    and port number that server is on.
  • Parse gfClient output into your own interactive
    systems.

33
Translated Client/Server
  • Mask chromosomes before converting to nib.
  • Index will take 30 minutes to generate and
    require 2.5 gig
  • For human genome both nucleotide and translated
    servers fit on one Linux box with 4 Gb of RAM.
  • In general one gfServer can support about 8
    gfClients. (I put as much of the work as
    possible on the client side.)

34
short matches
  • To find perfect 21-mers
  • minMatch1
  • minScore21
  • minIdentity100
  • For perfect 19-mers
  • tileSize10
  • minMatch1
  • minScore19
  • minIdentity100
  • For 21-mers with one mismatch
  • minMatch1
  • oneOff
  • minScore21

35
For Nucleotide Extra Sensitivity
  • tileSize10
  • minIdentity0
  • minScore0
  • Add 5 Ns at start of target and rerun
  • Try blastz

36
Fast Near Perfect Long Matches
  • Use tileSize12, ooc12.ooc
  • Try -fastMap
  • -minIdentity98
  • -minScore100

37
Blat Rap Outline
  • What blat is good at.
  • How it compares to other tools.
  • How it works.
  • How best to use it.
  • When its free and when its not.

38
BLAT is Free For
  • Non-profit organizations
  • Students and educational institutions
  • For interactive use on the web.
  • Limited program driven use on the web (less than
    2 hits/minute, less than 1,000 hits/day).
  • For the first 30 days after downloading.

39
Commercial Licenses
Jim Kent jim_kent_at_pacbell.net
Heidi Brumbaugh heidi_b_at_pacbell.net
40
THE END
Write a Comment
User Comments (0)
About PowerShow.com