Title: the Blat Rap
1the Blat Rap
the ins and outs of going blat
2Blat Rap Outline
- What blat is good at.
- How it compares to other tools.
- How it works.
- How best to use it.
- When its free and when its not.
3The Need for Speed
- 10 billion bases of genome, more to come.
- 15 million ESTs publicly available.
- BLAST queues at NCBI are 15 minutes.
- What works well for searching a 10 million amino
acid database does not work well for searching a
20 billion nucleotide database.
4BLAT is Good At
- Aligning mRNA and ESTs to the genome.
- Extremely fast and accurate
- Aligns whole mRNA, not just separate exons
- Handles introns splice sites
- Easy to parse native output format. Also outputs
blast compatible format. - Translated alignments between species
- Still quite fast.
- Sensitive enough for human/mouse and other
vertebrate/vertebrate comparisons.
5BLAT is not the best tool for
- Untranslated DNA alignments between species
further than monkey/human. - We use blastz for mouse/human DNA alignments.
- Protein/Protein alignments generally
- The protein databases are small enough, blastp is
still an excellent choice for fast searches - We use SAM and PSIBLAST for detecting remote
protein homologies.
6Other Good Uses of BLAT
- Aligning reads against the genome to find SNPs
and other polymorphisms. - Clustering together redundant protein, mRNA or
EST records from Genbank. - Mapping annotations from one version of the human
genome to another version of the human genome. - Looking for recent duplication in the human
genome that may cause cross-hybridization
problems in microarrays and PCR experiments.
7Blat Rap Outline
- What blat is good at.
- How it compares to other tools.
- How it works.
- How best to use it.
- When its free and when its not.
8mRNA/Genome Alignments
Remapping Sanger Chromosome 22 mRNAs to genome
and comparing to Sanger annotated exon/intron
structure.
9Translated Mouse/Human Alignments
WU-TBLASTX and BLAT in translated mode (-tdnax
-qprot) aligning mouse 3x coverage whole genome
shotgun reads vs. Human Chromosome 22.
WU-TBLASTX was run with word-size of 5.
10Web Search Usp18 vs Genome
11Blat Rap Outline
- What blat is good at.
- How it compares to other tools.
- How it works.
- How best to use it.
- When its free and when its not.
12The Alignment Problem
- Figuring out how to position two strings with
some insertions so as to maximize where two
strings agree.
aca--gacacactattatg-g-gc-caga-ac-cac
acacacagacacact-tt
atgtgtgctcacacacacacgct
mRNA vs. genomic alignments are an important
special case
13Introns in Alignments
ATGGAGGGGCAGAGCGGCCGCTGCAAGATCGTGGTGGTGGGAGACGCAGA
GTGCGGCAAGACGGCGCTGCTGCAGGTGTTCGCCAAGGACGCCTATCCC
G GGgtgagggacctgcgtcttgggagggggacgctaaggctgctggggg
gt gggtgacaggggccctggcgacggatgggaatgggtactcgggtaac
cag ggacaagagacagggggtcggaggacgcggggaggccttgagggct
cagg aaggactgcagaggattggggtgggaggaattagggagcagggtg
agata gatggggtttgggagaaccagagcatccgggagggagggcgagg
ggaatg tcggaggtcctgggcaatggagaggggaagaactagggggctg
aagggac cagaagggaacaggaggaggtcttggagcttagcagagattc
tccggggg ggggggggggggggcaggagctcccgggatctcccctttgc
ccaatccca gaccaacttgtgtccaggggctgggctggacggggtgtgg
gagtgaggag ggcatttatctggggtgaggacttggagagatgatctca
tctggatccat ccgtgtctgcagAGTTATGTCCCCACCGTGTTTGAGAA
CTACACTGCGAG CTTTGAGATCGACAAGCGCCGCATTGAGCTCAACATG
TGGGACACTTCAG
14Perfect Matches Serve as Seeds
- Computers can look for exact matches very
quickly. - Finding inexact matches is slow
- Inexact matches should contain some short exact
matches. - Inexact matches should contain multiple even
shorter exact matches.
15ggagaatagggcatgctctgaggtctgctggaacccatcc 1 12
123456 12 12 12345 1 12 12345
1 gtggattagggcttgttccgaggttatcgggttcccatac
ttcttgtctcgctccagggcaccgtgcaggaaatcccggg 2345 1
123 1 12 1234567 1234567 1234 ttctgggctctcgcccgg
gcacctagcaggaatacccgat acacctcctcattctcatccagccac
tggatgacgaaggg 123 123456 123 123 12345
12345 ggacccgctcattaccatacagtaaacggatggcgaagac Di
stribution of identical matches length 1 2 3
4 5 6 7 8 number 5 5 4 1 5 2 2 0
16Steps in Fast cDNA Alignments
- 1. Break cDNA into 500 base chunks.
- 2. Use an index to find regions in genome similar
to each chunk of cDNA. - 3. Do a detailed alignment between genomic
regions and cDNA chunk. - 4. Use dynamic programming to stitch together
detailed alignments of chunks into detailed
alignment of whole.
17Indexing
- Within an exon there should be some K-mers that
align perfectly. - Build an index which contains positions of each
K-mer in genome. K is typically between 8 and
13. - Step through each K-mer in cDNA chunk and look it
up in index. - Get list of hits - positions in cDNA and in
genome that match for K bases.
18Genome cacaattatcacgaccgc 3-mers cac aat tat
cac gac cgc Index aat 3 gac 12 cac
0,9 tat 6 cgc 15 cDNA
aattctcac 3-mers aat att ttc tct ctc tca cac
0 1 2 3 4 5 6 hits aat 0,3
-3 cac 6,0 6 cac 6,9 -3 clump
cacAATtatCACgaccgc
19Detailed Alignments
- Smith-Waterman technique based on dynamic
programming. - Banded Smith-Waterman faster but doesnt
tolerate long inserts - Recursive seed and extend faster yet, handles
large gaps, but only works on very similar
sequences.
20Recursive Seed and Extend
- Find perfect matches that are too long to occur
reasonably by chance in a region. - Extend through short mismatches.
- Extend through short gaps.
- Existing matches divide sequence into regions.
- Recurse to align unaligned regions at reduced
stringency.
21 acataxxxxxxxxxxxxxxxxxxgatta xxxxxx cctgax
yacatayyyyyyyyyyyyyyyyyygattayyyyyyyycctgayy
yy
acatacgxxxxxxxxxxxxxxxxgatta xxxxxx cctgaa
yacatacgyyyyyyyyyyyyyyyygattayyyyyyyycctgaa
yyy
acatacgxxxxxxxxxxxxxcctgatta-ccggxx cctgaa
yacatacgyyyyyyyyyyyyyccagattaaccggyyycctgaa
yyy
acatacgxxxxcatgxxxxxcctgatta-ccggxx cctgaa
yacatacgyyyycatgyyyyyccagattaaccggyyycctgaa
yyy
acatacg catg cctgatta-ccgg cctgaa
yacatacg catg ccagattaaccgg
cctgaa
22Stitching Together Alignments
23Repeats Complicate Things
24Solution Dynamic Programming
- Define block of alignment as a region with no
insertions or deletions. - Each block can be represented by 4 coordinates
cStart, cEnd, gStart, gEnd - Each block has a score match-mismatch
- Each gap between blocks has score
-log(gSize) - cSize - Pick maximal scoring set of blocks where one
block must follow another in both c g
25(No Transcript)
26Blat Rap Outline
- What blat is good at.
- How it compares to other tools.
- How it works.
- How best to use it.
- When its free and when its not.
27Standalone vs. Client/Server
- Standalone - best for batch queries. Executable
is called blat. - Client/Server - best for interactive queries.
Executables are called gfClient/gfServer.
28Standalone
- Advantages
- 2x as fast
- More sensitive for protein/translated searches
- Runs well on computer clusters
- Runs effectively in 256 meg of RAM.
- Disadvantages
- Cant process complete genome at once unless have
8 gig of RAM. - Must combine sort results of multiple runs.
- Wait for index to be built before first query is
processed
29Client/Server
- Advantages
- Can process entire genome in 1.2 Gb of RAM
- Process translated genome in 2.5 Gb of RAM
- Index is prebuilt. Response to first query is
typically lt 2 s. - No need to sort results.
- Disadvantages
- Ties up lots of memory in server machine
- 1/2 as fast as standalone
30Standalone mRNA/DNA Searches
- blat target query output -ooc11.ooc
- Target (aka database) can be a fasta file, nib
file or a text file containing a list of fasta
and nib files. Target is typically a chromosome.
- For nucleotide queries no need to mask.
- Query can be a fasta file or list of fasta files.
Typically query is a large batch of mRNA or EST
sequences. - Output by default is in a tab-separated format.
Recently -outblast option and other output
options added. - ooc11.ooc tells blat which 11-mers occur to
often to be useful. It greatly increases
blats speed.
31Standalone translated searches
- blat target query output -tdnax -qprot
- This aligns proteins vs translated genome
- blat target query output -tdnax -qrnax
- Aligns translated RNA vs translated genome
- blat target query output -tdnax -qdnax
- Aligns translated genome vs. translated genome
(best to chop query into 4kb or less pieces) - For translated searches its best to used masked
target DNA.
32Client/Server Setup
- Convert each chromosome into its own .nib file
with faToNib. - Start up gfServer on a machine with enough
memory. It will take 10 minutes to build an
index. - Run gfClient, telling it query sequence, machine
and port number that server is on. - Parse gfClient output into your own interactive
systems.
33Translated Client/Server
- Mask chromosomes before converting to nib.
- Index will take 30 minutes to generate and
require 2.5 gig - For human genome both nucleotide and translated
servers fit on one Linux box with 4 Gb of RAM. - In general one gfServer can support about 8
gfClients. (I put as much of the work as
possible on the client side.)
34short matches
- To find perfect 21-mers
- minMatch1
- minScore21
- minIdentity100
- For perfect 19-mers
- tileSize10
- minMatch1
- minScore19
- minIdentity100
- For 21-mers with one mismatch
- minMatch1
- oneOff
- minScore21
35For Nucleotide Extra Sensitivity
- tileSize10
- minIdentity0
- minScore0
- Add 5 Ns at start of target and rerun
- Try blastz
36Fast Near Perfect Long Matches
- Use tileSize12, ooc12.ooc
- Try -fastMap
- -minIdentity98
- -minScore100
37Blat Rap Outline
- What blat is good at.
- How it compares to other tools.
- How it works.
- How best to use it.
- When its free and when its not.
38BLAT is Free For
- Non-profit organizations
- Students and educational institutions
- For interactive use on the web.
- Limited program driven use on the web (less than
2 hits/minute, less than 1,000 hits/day). - For the first 30 days after downloading.
39Commercial Licenses
Jim Kent jim_kent_at_pacbell.net
Heidi Brumbaugh heidi_b_at_pacbell.net
40THE END