Title: artwork: commons.wikimedia.org
1Biological Sequence Determination
protein
RNA
DNA
Robert M. Horton, PhD, MS rmhorton_at_cybertory.org
artwork commons.wikimedia.org
2(No Transcript)
3Sequencing
context
- protein
- RNA
- DNA
- old methods
technological biological
concepts
- classical sequencing (Sanger)
- automation, base calling, quality scoring
- shotgun sequencing, assembly, finishing
chemistry, enzymes physics, computers
contemporary
- "next generation"
- methods pyrosequencing, CRT, SOLiD
- applications resequencing, epigenetics, RNA-Seq
microfluidics microfabrication
contemplation
- third generation
- SMRT, nanopores, etc.
of the future
4Protein Sequencing
Why Proteins?
Digestible (pepsin, trypsin, chymotrypsin) Impor
tant
Small Chemically distinguishable (purifyable)
Insulin Fred Sanger Nobel prize, 1958
5Classes of RNA
- mRNA
- modified bases ( cap with m7G , 2'-O-methylation
) - splicing
- polyadenylation
- tRNA
- modified bases (GMe, GMe2, CMe, T, ?, UH2, I,
IMe) - rRNA
- prokaryotic 70S 50S (5S, 23S) 30S (16S)
- eukaryotic 80S 60S (5S, 5.8S, 28S) 40S (18S)
- 7SL
- RNA of Signal Recognition Particle (SRP)
- homologous to Alu SINE (11 of human genome)
- snRNA
- splicosomes (U1, U2, U4, U5, U6)
- snoRNA
- pre-rRNA processing (U3)
- guide 2'-O-methylation
- guide pseudouridylation
- RNAi
- siRNA (short interfering RNA)
- miRNA (microRNA)
- post-transcriptional gene silencing
- 3' UTR, conserved
- piRNA
- transcriptional silencing of retrotransposons
... et cetera ...
6DNA Sequencing
- 1977
- The modern era of DNA sequencing begins
7Chemical Sequencing of DNA (Maxam-Gilbert)
February 1977
Two steps Damage bases specific,
partial Cleave backbone
Four reactions A AG C CT
http//nobelprize.org/nobel_prizes/chemistry/laure
ates/1980/gilbert-lecture.pdf
8(Sanger Sequencing)
Chain Termination Sequencing
2',3'-dideoxy TTP
- Sanger F, Nicklen S Coulson AR
- DNA sequencing with chain-terminating inhibitors
- PNAS 745463-7, December 1977
9Primer Extension
Bacterial DNA polymerase I adds nucleotides to
the 3' end of primer to complement 5'
-overhanging template.
Each strand is an ordered sequence with a
direction.
Arrows indicate 5' to 3' direction (DNA grows
biochemically in this direction).
(pyrophosphate released)
10Sanger sequencing
Individual reactions with one dNTP partially
poisoned with dideoxynucleotides (ddATP, ddCTP,
ddGTP, ddTTP)
- Decades of improvements
- automated
- fluorescence
- four colors
- one lane
- dye terminators
- one reaction
- capillaries
11Automated Sanger sequencing
trace base calls
quality scores
12Quality Score
q -10 log1 0(p)
p predicted error probability 1/1000
probability of error q score of 30
uses data quality monitoring assembly
consensus finishing criteria
13Sequencing Strategy
Primer walking (serial)
Shotgun Sequencing (parallel)
14Universal Primers
15Assembly
16read length affects assembly
17Next-Generation Sequencing
- Pyrosequencing (454/Roche)
- Cycles of Reversible Termination
(Solexa/Illumina) - Ligation (ABI SOLiD)
"Third-Generation" Sequencing
- SMRT (Pacific Biosciences)
18pyrosequencing
pyrophosphate
APS
adenosine 5-phosphosulfate
(released by dNTP incorporation)
ATP sulfurylase
sulfate
ATP
19pyrosequencing
O
2
luciferin
oxygen
ATP
firefly luciferase
light
oxyluciferin
AMP
pyrophosphate
20pyrosequencing
more biochemistry
problem
solution
apyrase breaks down ATP to AMP 2 Pi (or wash
out solution)
pyrophosphate recycling
use an analog suitable for polymerase but not
luciferase
luciferase can use dATP
21pyrosequencing
flowgram
Ronaghi M. Genome Res 113-11, 2001
22Emulsion PCR
water droplet in oil
one primer bound to solid bead
individual template molecule
23Emulsion PCR
DNA anchored to bead all comes from the same
template molecule
"polony" "PCR colony"
24pyrosequencing
Alternatives to chemiluminescence
- heat (thermosequencing)
- pH change ("Ion Torrent")
25Cycles of Reversible Termination
Helicos
Illumina
Illumina
Metzker M. Sequencing Technologies - The Next
Generation. Nature Reviews Genetics 1131-46,
2010.
26Short Read Alignment
27FASTQ Format
maq.sourceforge.net/fastq.shtml
q chr((Qlt93? Q 93) 33) Q ord(q) -
33
0
60
!"'(),-./0123456789ltgt?_at_ABCDEFGHIJKLMNOPQR
STUVWXYZ\
28Paired End Tags
Mme I
TCCRAC (20/18)
29Illumina Genome Analyzer
30Illumina Genome Analyzer
- Bridge Amplification forms "Polonies"
31Illumina Genome Analyzer
- Cycles of Reversible Termination
32Ligation-based Sequencing
- SOLiD (ABI)
- Complete Genomics
- Polonator (Church Lab)
33SOLiD
Sequencing by Oligonucleotide Ligation and
Detection
3'- ATNNNZZZ-5'
artwork is from the pamphlet Dibase Sequencing
and Color Space Analysis
34(No Transcript)
35(No Transcript)
36SOLiD
37SOLiD Dibase Encoding
AT CG GC TA
AT CG GC TA
AC CA GT TG
AA CC GG TT
AG CT GA TC
38SOLiD Dibase Encoding
color space
base space
Each color sequence can represent four different
base sequences. The base sequence is one unit
longer than the color sequence. You need to know
one base to tell which sequence is represented.
39SOLiD Dibase Encoding
single color change is probably an error
SNP causes two color changes
40Single-Molecule, Real-Time (SMRT) Sequencing
- High throughput
- Parallelism (small reactions)
- Speed (immediate results)
- Long reads
- Read individual templates from mixtures
- Haplotyping
41SMRT Sequencing
42Simulated SMRT Sequencing Data
43Platform Comparisons
Xu M, Fujita D, and Hanagata N. Perspectives and
Challenges of Emerging Single-Molecule DNA
Sequencing Technologies. Small 5(23)26382649,
2009
44Other Technologies
- Mass spectrometry
- TEM
- STM
- nanonozzle probes
- nanopores (protein, graphene)
- ionic current blockage
- transverse tunneling currents
- exonuclease
45Targeted Exome Capture
nimblegen.com
46Bonus Slides
47Selenocysteine tRNA
48Omics
- transcriptome
- exome
- kinome
49Plus and Minus Method
(circa 1975)
"minus" polymerase stops at missing
base "plus" T4 DNA polymerase 3' exonuclease
stalled by dNTP
Sanger F, Coulson AR. J Mol Biol. 94(3)441-8,
1975
50pyrosequencing
Animation http//www.pyrosequencing.com/DynPage.a
spx?id7454
51Bioinformatics Classics
- Needleman SB, Wunsch CD. A general method
applicable to the search for similarities in the
amino acid sequence of two proteins. J Mol Biol
48443-453, 1970. - Smith TF, Waterman MS. Identification of common
molecular subsequences. J Mol Biol 147195-197,
1981.
52Automated Base Calling
- 1. identify idealized peak locations
- assume locally even spacing
- 2. find observed peaks
- 3. match observed to expected
- omit and split as necessary
- 4. add "good" unmatched peaks
53Error Probabilities
- predictive
- does not require knowing actual sequence
- valid
- the set of bases assigned to probability p should
have an actual error rate of p - discriminating
- helps to distinguish correct vs. incorrect base
calls - 1,000,000 base calls with 1,0000 errors (p
0.01) - better if we can break it into two 500,000 sets
- p0.018 in one set (9000 errors)
- p0.002 in second set (1000 errors)
54Error Probability Calibration
'Given a set of parameters and a training set of
reads for which it is known which base-calls are
correct and which are errors, find a way of
associating parameter values to error
probabilities that has (near) maximum
discrimination power for small r.'
55Phred Quality Score Parameters
Empirical.
Small values tend to correspond to more accurate
base-calls.
Window-based parameters smooth out error
probabilities.
- Peak spacing (7 peak window)
- largest / smallest peak-to-peak spacing
- Uncalled/called ratio (7 peak window)
- amplitude of largest uncalled / smallest called
peak - Uncalled/called ratio (3 peak window)
- Peak resolution
- -1 bases to the next unresolved base
56Lookup Table Production
- Select a range of 50 threshold values for each of
the 4 parameters. - These 50 values are chosen so that each increment
contains approximately the same number of bases
in the training set. - For each 4-tuple of parameter thresholds
(5046,250,000) - find the set of bases defined by these thresholds
- compute empirical error rates
- The parameter set with the lowest error rate goes
into the table. - if multiple 4-tuples give the same rate, choose
the largest set - These bases are removed, and the process is
repeated until all bases are represented in the
table.
57Post-translational Modification
(or co-translational)
- acylation (at O, N, or S)
- acetylation (acetate, CH3CO2- )
- myristoylation (myristate, a C14 fatty acid)
- palmitoylation (palmitate, a C16 fatty acid)
- alkylation
- methylation
- isoprenylation
- phosphorylation
- signal transduction
- ADP-ribosylation
- signal transduction
- cholera toxin
- glycosylation (glycoproteins)
- mucin, cellular interaction, structural
- N-linked
- asparagine
- O-linked
- serine, threonine, hydroxylysine, hydroxyproline
- iodination
- thyroid hormone
- hydroxylation
- hydroxylysine in collagen
- covalently bound enzyme cofactors
- FAD, biotin, etc
- ubiquitination
... and many more
58Wandering Spot Method
ca.1970s RNA or DNA
partial digestion 2D separation Horizontal
base composition Vertical size
This is an RNAse T1 fragment, so it ends in G
Fuke, M., and Busch, H. Nucleic Acids Res.
4339-352, 1977.
59Enzymatic vs Chemical Partial Cleavage of RNA
Sequence-specific RNases Phy M AU A
pyrimidine-specific (CU) U2 A or AG T1
degrades after G residues V1 degrades paired
bases
Peattie DA. PNAS 761760-1764, 1979.
enzymatic
chemical
60Modified Nucleotides in tRNA
(post-transcriptional)
- pseudouridine (?)
- dihydrouridine (UH2)
- inosine (I)
- methylinosine (IMe)
- methyl guanine (GMe)
- dimethylguanine(GMe2)
- methylcytosine (Me)
- ribothymine (T)
61Nucleotide Ambiguity Codes
(IUPAC)
Unambiguous A, C, G, T, U 2-fold degenerate M
A or C R A or G (puRine) W A or
T (Weak) S C or G (Strong) Y C or
T (pYrimidine) K G or T
3-fold degenerate V A, C or G (not T) H A,
C or T (not G) D A, G or T (not C) B C, G
or T (not A) 4-fold degenerate X A, C, G or
T N A, C, G or T
62Automated Base Calling
Phred third-party base caller with better
accuracy than ABI's open source(ish)
Ewing B, Hillier L, Wendl MC, Green P.
Base-Calling of Automated Sequencer Traces Using
Phred. I. Accuracy Assessment. Genome Res.
8175-185, 1998 Ewing B and Green
P. Base-Calling of Automated Sequencer Traces
Using Phred. II. Error Probabilities. Genome Res.
8186-194, 1998
63Shotgun Sequencing
Staden R. A strategy of DNA sequencing employing
computer programs, Nucleic Acids Research 7
2601-2610, 1979
With modern fast sequencing techniques and
suitable computer programs it is now possible to
sequence whole genomes without the need of
restriction maps. This paper describes computer
programs that can be used to order both sequence
gel readings and clones. A method of coding for
uncertainties in gel readings is described. These
programs are available on request.
The whole of the DNA to be sequenced is
shotgunned into a suitable vector and cloned.
Ideally the cloned fragments would be of at least
200 bases in length. The clones are then
sequenced and the computer used to collate the
data. Collation involves searching for overlaps
in the data.
64 2D gel electrophoresis
65cybertory.org/exercises/primerDesign
66(No Transcript)
67Protein Sequencing
Edman Degradation
phenylisothiocyanate
invented ca. 1950s automated ca. 1973
proceeds from N-terminus read 50-70 aa
http//en.wikipedia.org/wiki/Edman_degradation
Mass Spectrometry
Precise determination of molecular weights of
peptides
A few amino acids can ID a spot on 2D gel
68(Sec)
modified from Wikimedia commons