Title: Basic Molecular Biology
1Basic Molecular Biology
Many slides by Omkar Deshpande
2Overview
- Structures of biomolecules
- Central Dogma of Molecular Biology
- Overview of this course
- Genome Sequencing
3Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
4(No Transcript)
5Watson and Crick
6(No Transcript)
7Macromolecule (Polymer) Monomer
DNA Deoxyribonucleotides (dNTP)
RNA Ribonucleotides (NTP)
Protein or Polypeptide Amino Acid
8Nucleic acids (DNA and RNA)
- Form the genetic material of all living
organisms. - Found mainly in the nucleus of a cell (hence
nucleic) - Contain phosphoric acid as a component (hence
acid) - They are made up of nucleotides.
9Nucleotides
10DNA
A T G C
11The gene and the genome
- Genome The entire DNA sequence within the
nucleus. - The information in the genome is used for protein
synthesis - A gene is a length of DNA that codes for a
(single) protein.
12How big are genomes?
Organism Genome Size (Bases) Estimated Genes
Human (Homo sapiens) 3 billion 20,000
Laboratory mouse (M. musculus) 2.6 billion 20,000
Mustard weed (A. thaliana) 100 million 18,000
Roundworm (C. elegans) 97 million 16,000
Fruit fly (D. melanogaster) 137 million 12,000
Yeast (S. cerevisiae) 12.1 million 5,000
Bacterium (E. coli) 4.6 million 3,200
Human immunodeficiency virus (HIV) 9700 9
13Repeats
- The DNA is full of repetitive elements (those
that occur over over over) - There are several type of repeats, including
SINEs LINEs (Short Long Interspersed
Elements) (1 million just ALUs) and low
complexity elements. - Their function is poorly understood, but they
make problems more difficult.
14Central dogma
ZOOM IN
tRNA
transcription
DNA
rRNA
snRNA
translation
POLYPEPTIDE
mRNA
15Transcription
- The DNA is contained in the nucleus of the cell.
- A stretch of it unwinds there, and its message
(or sequence) is copied onto a molecule of mRNA. - The mRNA then exits from the cell nucleus.
16DNA
RNA
A T G C
T ? U
17More complexity
- The RNA message is sometimes edited.
- Exons are nucleotide segments whose codons will
be expressed. - Introns are intervening segments (genetic
gibberish) that are snipped out. - Exons are spliced together to form mRNA.
18Splicing
- frgjjthissentencehjfmkcontainsjunkelm
- thissentencecontainsjunk
19Key player RNA polymerase
- It is the enzyme that brings about transcription
by going down the line, pairing mRNA nucleotides
with their DNA counterparts.
20Promoters
- Promoters are sequences in the DNA just upstream
of transcripts that define the sites of
initiation. - The role of the promoter is to attract RNA
polymerase to the correct start site so
transcription can be initiated.
5
3
Promoter
21Promoters
- Promoters are sequences in the DNA just upstream
of transcripts that define the sites of
initiation. - The role of the promoter is to attract RNA
polymerase to the correct start site so
transcription can be initiated.
5
3
Promoter
22Transcription key steps
DNA
- Initiation
- Elongation
- Termination
DNA
RNA
23Transcription key steps
DNA
- Initiation
- Elongation
- Termination
24Transcription key steps
DNA
- Initiation
- Elongation
- Termination
25Transcription key steps
DNA
- Initiation
- Elongation
- Termination
26Transcription key steps
DNA
- Initiation
- Elongation
- Termination
DNA
RNA
27Genes can be switched on/off
- In an adult multicellular organism, there is a
wide variety of cell types seen in the adult. eg,
muscle, nerve and blood cells. - The different cell types contain the same DNA
though. - This differentiation arises because different
cell types express different genes. - Promoters are one type of gene regulators
28Transcription (recap)
- The DNA is contained in the nucleus of the cell.
- A stretch of it unwinds there, and its message
(or sequence) is copied onto a molecule of mRNA. - The mRNA then exits from the cell nucleus.
- Its destination is a molecular workbench in the
cytoplasm, a structure called a ribosome.
29Translation
- How do I interpret the information carried by
mRNA to the Ribosome? - Think of the sequence as a sequence of
triplets. - Think of AUGCCGGGAGUAUAG as AUG-CCG-GGA-GUA-UAG.
- Each triplet (codon) maps to an amino acid.
30The Genetic Code
- f codon amino acid
- 1968 Nobel Prize in medicine Nirenberg and
Khorana - Important The genetic code is universal!
- It is also redundant / degenerate.
31The Genetic Code
32Proteins
- Composed of a chain of amino acids.
- R
-
- H2N--C--COOH
-
- H
20 possible groups
33Proteins
R
R
H2N--C--COOH
H2N--C--COOH
H H
34Dipeptide
This is a peptide bond
R O R
II
H2N--C--C--NH--C--COOH
H H
35Protein structure
- Linear sequence of amino acids folds to form a
complex 3-D structure. - The structure of a protein is intimately
connected to its function. - The 3-D shape of proteins gives them
their working ability the ability to bind
with other molecules.
36Our course (2417)
Part 1, DNA Assembly, Evolution, Alignment
Part 2, Genes Prediction, Regulation
transcription
DNA
rRNA
snRNA
translation
POLYPEPTIDE
mRNA
37DNA Sequencing
Some slides shamelessly stolen from Serafim
Batzoglou
38DNA sequencing
- How we obtain the sequence of nucleotides of a
species
ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGAC
TACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT
39Which representative of the species?
- Which human?
- Answer one
- Answer two it doesnt matter
-
- Polymorphism rate number of letter changes
between two different members of a species -
- Humans 1/1,000 1/10,000
- Other organisms have much higher polymorphism
rates
40Why humans are so similar
- A small population that interbred reduced the
genetic variation - Out of Africa 40,000 years ago
Out of Africa
41Migration of human variation
- http//info.med.yale.edu/genetics/kkidd/point.html
42Migration of human variation
- http//info.med.yale.edu/genetics/kkidd/point.html
43Migration of human variation
- http//info.med.yale.edu/genetics/kkidd/point.html
44DNA Sequencing
- Goal
- Find the complete sequence of A, C, G, Ts in
DNA - Challenge
- There is no machine that takes long DNA as an
input, and gives the complete sequence as output - Can only sequence 500 letters at a time
45DNA sequencing vectors
DNA
Shake
DNA fragments
Known location (restriction site)
Vector Circular genome (bacterium, plasmid)
46DNA sequencing gel electrophoresis
- Start at primer (restriction site)
- Grow DNA chain
- Include dideoxynucleoside (modified a, c, g, t)
- Stops reaction at all possible points
- Separate products with length, using gel
electrophoresis
47Electrophoresis diagrams
48Challenging to read answer
49Challenging to read answer
50Reading an electropherogram
- Filtering
- Smoothening
- Correction for length compressions
- A method for calling the letters PHRED
- PHRED PHils Read EDitor (by Phil Green)
- Based on dynamic programming
- Several better methods exist, but labs are
reluctant to change
51Output of PHRED a read
- A read 500-700 nucleotides
- A C G A A T C A G A
- 16 18 21 23 25 15 28 30 32 21
- Quality scores -10log10Prob(Error)
- Reads can be obtained from leftmost, rightmost
ends of the insert - Double-barreled sequencing
- Both leftmost rightmost ends are sequenced
52Method to sequence longer regions
genomic segment
cut many times at random (Shotgun)
Get two reads from each segment
500 bp
500 bp
53Reconstructing The Sequence
reads
Cover region with 7-fold redundancy (7X)
Overlap reads and extend to reconstruct the
original genomic region
54Definition of Coverage
C
- Length of genomic segment L
- Number of reads n
- Length of each read l
- Definition Coverage C n l / L
- How much coverage is enough?
- Lander-Waterman model
- Assuming uniform distribution of reads, C10
results in 1 gapped region /1,000,000 nucleotides
55Challenges with Fragment Assembly
- Sequencing errors
- 1-2 of bases are wrong
- Repeats
- Computation O( N2 ) where N reads
false overlap due to repeat
56Repeats
- Bacterial genomes 5
- Mammals 50
- Repeat types
- Low-Complexity DNA (e.g. ATATATATACATA)
- Microsatellite repeats (a1ak)N where k 3-6
- (e.g. CAGCAGTAGCAGCACCAG)
- Transposons
- SINE (Short Interspersed Nuclear Elements)
- e.g., ALU 300-long, 106 copies
- LINE (Long Interspersed Nuclear Elements)
- 4000-long, 200,000 copies
- LTR retroposons (Long Terminal Repeats (700 bp)
at each end) - cousins of HIV
- Gene Families genes duplicate then diverge
(paralogs) - Recent duplications 100,000-long, very similar
copies
57Hierarchical Sequencing
58Hierarchical Sequencing Strategy
genome
- Obtain a large collection of BAC clones
- Map them onto the genome (Physical Mapping)
- Select a minimum tiling path
- Sequence each clone in the path with shotgun
- Assemble
- Put everything together
59Methods of physical mapping
- Goal
- Make a map of the locations of each clone
relative to one another - Use the map to select a minimal set of clones to
sequence - Methods
- Hybridization
- Digestion
601. Hybridization
p1
pn
- Short words, the probes, attach to complementary
words - Construct many probes
- Treat each BAC with all probes
- Record which ones attach to it
- Same words attaching to BACS X, Y ? overlap
612. Digestion
- Restriction enzymes cut DNA where specific words
appear - Cut each clone separately with an enzyme
- Run fragments on a gel and measure length
- Clones Ca, Cb have fragments of length li, lj,
lk ? overlap - Double digestion
- Cut with enzyme A, enzyme B, then enzymes A B
62Whole-Genome Shotgun Sequencing
63Whole Genome Shotgun Sequencing
genome
plasmids (2 10 Kbp)
forward-reverse paired reads
known dist
cosmids (40 Kbp)
500 bp
500 bp
64 History of DNA Sequencing
Adapted from Eric Green, NIH Adapted from
Messing Llaca, PNAS (1998)
1870
Miescher Discovers DNA
Avery Proposes DNA as Genetic Material
1940
Efficiency (bp/person/year)
Watson Crick Double Helix Structure of DNA
1953
Holley Sequences Yeast tRNAAla
1
15
1965
Wu Sequences ? Cohesive End DNA
150
1970
Sanger Dideoxy Chain Termination Gilbert
Chemical Degradation
1,500
1977
Messing M13 Cloning
15,000
1980
25,000
Hood et al. Partial Automation
50,000
1986
- Cycle Sequencing
- Improved Sequencing Enzymes
- Improved Fluorescent Detection Schemes
200,000
1990
50,000,000
2002
- Next Generation Sequencing
- Improved enzymes and chemistry
- New image processing
100,000,000,000
2009
65Read length and throughput
1Gb
bases per machine run
100 Mb
10 Mb
1Mb
read length
10 bp
1,000 bp
100 bp
NGS Slides courtesy of Gabor Marth
66Sequencing chemistries
DNA ligation
DNA base extension
Church, 2005
67Massively parallel sequencing
Church, 2005
68Features of NGS data
- Short sequence reads
- 100-200bp 454 (Roche)
- 35-120bp Solexa(Illumina), SOLiD(AB)
- Huge amount of sequence per run
- Gigabases per run
- Huge number of reads per run
- Up to billions
- Higher error (compared with Sanger)
- Different error profile
69Current and future application areas
70What can we use them for?
SANGER 454 Solexa AB SOLiD
De novo assembly Mammal (3109) Bacteria, Yeast Bacteria Bacteria?
SNP Discovery Yes Yes 90 of human 90 of human
Larger events Yes Yes Yes Yes
Transcript profiling (rare) No Maybe Yes Yes
71(No Transcript)
72Computer scientists vs Biologists
- Nothing is ever completely true or false in
Biology. - Everything is either true or false in computer
science.
73Next Gen Raw Data
- Machine Readouts are different
- Read length, accuracy, and error profiles are
variable. - All parameters change rapidly as machine
hardware, chemistry, optics, and noise filtering
improves
74Current and future application areas
75Fundamental informatics challenges
76Informatics challenges (contd)
77AB SOLiD System dibase sequencing
2-base, 4-color 16 probe combinations
- 4 dyes to encode 16 2-base combinations
- Detect a single color indicates 4 combinations
eliminates 12 - Each color reflects position, not the base call
- Each base is interrogated by two probes
- Dual interrogation eases discrimination
- errors (random or systematic) vs. SNPs (true
polymorphisms)
78Converting colors into letters
4 Possible Sequences
- The decoding matrix allows a sequence of
transitions to be converted to a base sequence,
as long as one of two bases is known.
79SOLiD error checking code
80Comparison of the technologies
SANGER 454 Solexa AB SOLiD
Output Sequence Flowgram Sequence Colors
Read Length 500-700 250-500 35-70 35-50
Error rate 2 3 (indels) 1 4 or 0.06
Mb per run 0.8 20 10000 20000
Cost per Mb 1000 50 0.15 0.05
Paired? Yes Sort of Yes (lt1k) Yes (lt10k)