Title: BINF 5230
1Advances in Molecular Cellular Genetics
Principles and Applicationsof Bioinformatics
Lecture 4 BINF 5230
2Before we will continue discussion of DNA
sequencing, let us remember, that the early work
on sequencing focused on protein sequencing. The
amino acid sequence of a peptide has been
determined by Edman degradation. This technique
involves labeling the N-terminal amino group of a
polypeptide and its amino acid components are
next cleaved and identified using
high-pressure liquid chromatography. This
procedure is repeated until all amino acids have
been identified. Edman degradation has been
routinely in use until the mid 1980s. Since then,
recombinant DNA technology has facilitated. Now
the question.
H.A. What advantages has the protein sequencing
method relatively to DNA sequencing?
But in spite of
some priority of protein degradation method the
DNA technology is used routinely now. Why?
3Let us repeat the stages of sequencing from the
previous lecture. Online Education Kit
Understanding the Human Genome Project presents
the essential steps in sequencing a
genome http//www.genome.gov/25019885 View
Animations and read the transcripts of
Introduction, then Mapping and at last Building
Libraries.
H.A. What does "shotgun" sequencing mean?
4We analyze here the main technique for DNA
sequencing, called Shotgun sequencing. Two
approaches have been used to sequence the genome.
They differ in the methods they use to cut up the
DNA, assemble it in the correct order, and
whether they map the chromosomes before decoding
the sequence.
First there was the BAC to BAC approach.
A
second, newer method is called Whole genome
shotgun sequencing. The BAC to BAC approach
first creates a crude physical map of the whole
genome before sequencing the DNA. Constructing a
map requires cutting the chromosomes into large
pieces and figuring out the order of these big
chunks of DNA before taking a closer look and
sequencing all the fragments.
5Whole Genome Shotgun SequencingThe shotgun
sequencing method goes straight to the job of
decoding, bypassing the need for a physical map.
Therefore, it is much faster.
H.A. This method is really much faster but much
more difficult as well. What step of this method
is more difficult? Why?
6The process of shotgun sequencing starts by
physically breaking up
the DNA molecule into millions of random pieces
that are about 150,000 base or more pairs (bp)
long.
These pieces are fingerprinted to give each piece
a unique identification tag that determines the
order of the fragments.
Fingerprinting involves cutting each BAC fragment
with a single enzyme and finding common sequence
landmarks in overlapping fragments that determine
the location of each BAC along the chromosome.
Then overlapping BACs with markers every 100,000
bp form a map of each chromosome.
This step not needed in Whole Genome Shotgun
Sequencing.
7However, current sequencing technologies can
only read at most a few hundred basis up 1000
base pairs of DNA. So, these big fragments
shredded into small pieces to levels needed by
the sequencing. To do so the fragments are
inserted into cloning vectors in order to amplify
the DNA
8Part III Preparing DNA for Detection
The goal to create copies that we will be able
to read.
Each of these 150,000 bp fragments is inserted
into a BAC-a bacterial artificial chromosome. A
BAC is a man made piece of DNA that can replicate
inside a bacterial cell. The whole collection of
BACs containing the entire human genome is called
a BAC library.
H.A. Why we need to create a BAC library?
Each BAC is then broken randomly into 500 -1000
bp pieces Then these double strand pieces are
separated into two single strands. The reaction
is initiated by heating.
9How do we read the nucleotide text? As in almost
all biological processes and methods of analysis
the main idea is based on the principle of
complentarity.
H.A. List several biological functions, which are
based on the complentarity?
10We construct the complement of the DNA clone.To
do it We need to add
1) the "regular" nucleotides
A's, G's, C's, and T's,
2) an polymerase enzyme
to connect the free nucleotides to the single
strand of cloned DNA
and 3) DNA primers.
?
G A DNA primer is a single-stranded piece of
DNA of known sequence, about 20 bases long, that
will initiate the building of a new DNA strand
that is complementary to the cloned DNA.
11What else do we need to add ?
A special marker in a sequence ?
special" nucleotides, because of its molecular
structure,
whenever one of these special nucleotides adds
itself to a growing DNA, the growth of that
strand stops.
The normal substrates for DNA replication are
nucleoside triphosphates that are based on the
sugar 2-deoxyribose (dNTP).
ddNTP
When 2,3-dideoxyribose (ddNTP) is incorporated
into the DNA backbone, replication is terminated
H.A. Explain why this structure terminates
replication?
12Replicating a DNA strand in the presence, for
example, of dideoxy-T (T). MOST of the time
the enzyme will go ahead and add more nucleotides
to make the new strand. (because the
concentration of (dideoxy-T) is much low than
normal deoxy-T). However, about 5 of the time,
the enzyme will get a dideoxy-T, and that strand
can never again be elongated. Sooner or later
ALL of the copies will get terminated by a T, but
each time the enzyme makes a new strand, the
place it gets stopped will be random. In millions
of starts, there will be strands stopping at
every possible T along the way. ALL of the
strands we make started at one exact position.
ALL of them end with a T. There are many millions
at each possible T position. To find out where
all the T's are in our newly synthesized strand,
all we have to do is find out the sizes of all
the terminated products!
13H.A. explain the statement ALL of the strands
we make started at one exact position.
14These dideoxy- nucleotides are special in
another way, too. ?
They fluoresce when struck by a laser beam.
Each type of
nucleotide is marked with its own color
A, G, C, T
Because we need to know where the replication
stop.
15Build New Sequences
The primer attaches itself to a
complementary sequence on the vector and
initiates the building of a new DNA strand.
A G A T C C
T G T A C G A T T. . .
T C T A G G
A
C
A
T
G
To separate the new strand from the vector, we
expose them to heat.
This process is often repeated many times (around
40) to get the most mileage from each of the
billions of vectors present.
16Many Copies of Various Lengths
T C T A G G
T C T A G G
A G C A G T
T T C
T C T A G G
T C T A G G
T A C T G
A
Because of the special nucleotides, which get
added occasionally at random locations, the
lengths of the new copies range in size from one
to ?
the total number of bases in the vector.
17Part IV Detecting the Sequence
Result of the previous stages billions of copies
of a unique DNA strand, of
various lengths, and each
marked with a color.
The goal 1) to sort
the strands according to length, 2) to
read the color of the bases of the pieces that
are the same length to determine the last base
for that length.
T C T A G G A G C A G T
T C T A G G T A C T G
T C T A G G T T C
T C T A G G A
?
18Gel Electrophoresis Separates DNA molecules of
different Sizes
G Electrophoresis a technique in which samples
are moved (pulled) through an acrylamide or
gelatin-like material using electrical voltage
and current.
19Initiate Electrophoresis We apply an electric
field to the gel -- positive at one end and
negative at the other. This causes
?
the negatively charged DNA
?
pieces to migrate to the end.
H.A. Explain why DNA is negatively charged?
20Because of the gel's resistance,
the smaller pieces of DNA travel faster through
the gel than do the larger pieces. The movement
of the pieces is very precise For example, a
piece that is 396 base
pairs long will move slightly faster than a piece
that is 397 base pairs long.
21Detect copies of a DNA fragment of the exact same
length
How ?
.
passing the detector at the same time.
22Putting all four deoxynucleotides into the
picture
The sequence of the DNA is rather obvious if you
know the color codes just read the colors from
bottom to top GTGGACAT Thus electrophoresis is
used to separate the resulting fragments by size
and we can 'read' the sequence from it.
23Data Collection
Because the sequence of the primer is known,
the computer can determine the first base of
the cloned DNA strand and then the subsequent 1,
2, or 500 bases.
T C T A G G
T A C T G
24DNA sequence to be determined
25 The sequence of the newly synthesized DNA (which
is deduced from the gel) is the complement of the
unknown strand.
26In a large-scale sequencing lab, use an automated
DNA sequencer, where the fragments are piped
through a tiny glass-fiber capillary during the
electrophoresis step, and they come out the far
end in size-order An ultraviolet laser checks
for bands of fluoresceent colors.
There might be as many as 96 'lanes' of
samples running in one gel. A Computer interprets
the colors and prints 700 or so nucleotides of
accurate sequence.
27This is an example of what the sequencer's
computer shows us. The computer even interprets
the colors by printing the nucleotide sequence
across the top of the plot.
a fragment of the entire file, which would span
around 900 or so nucleotides of accurate
sequence.
In addition to nucleotide sequence text the
automated sequencer also provides trace diagrams.
Trace diagrams are analyzed by base calling
programs that use dynamic programming to match
predicted and occurring peak intensity and peak
location.
H.A. Looking at the trace diagrams, which
describe the probability of correct base
definition, find bases, with the lowest
probability.
28For quality of a sequence analysis is widely used
Phred - a base calling program for DNA
sequence. Phred reads DNA sequence chromatogram
files and analyzes the peaks to call bases.
After calling bases, Phred examines the peaks
around each base call to assign a quality score
to each base call. Quality scores range from 4 to
about 60, with higher values corresponding to
higher quality. Phred
Probability that
Accuracy quality score the base is called
wrong of the base call
10 1 in
10
90 20
1 in 100
99.
40 1 in
10,000
99.99 The most commonly used method is to count
the bases with a quality score of 20 and above
the resulting number is often called the "Phred20
score".
29Part V Assembly and Finishing
Assembly of 700-1000 Base Segments
After a small DNA segment was read, the question
is
How do we read large DNA segment ?
The answer is Because we have many overlapping
pieces, we also have many starting points for the
4,000-base sequences -- enough to allow us to
read every base.
30Part V Assembly and Finishing
Sequence assembly is a classic bioinformatics
problem to put the pieces together in the most
likely order by juxtaposing the overlapping
fragments in such a way as to assemble complete
stretches of DNA. Typical problems what is a
minimum overlap length?
(L, typically gt 100 base pairs) and
What minimum percentage of similarity is
in overlap fragments (should not be different by
more than 5 percentage). The main problem of
overlap detection is to detect all pairs that
have any significant overlap. But we cannot
afford to make so much computations. So several
heuristics algorithm was introduced. Most
research institutes that sequence DNA use their
own software for assembling the sequences that
they produce.
31Assembly of 150,000-Base SegmentsWith the help
of computers, we assemble the 500-base sequences
into the 150,000-base segments from which they
were derived.
Rebuilding the ChromosomesFinally, we determine
the chromosome that the 150,000-base segment
belongs to.
32Here are different examples of overlaping
fragments
H.A. There are two examples of overlapping
a) end-to-end b) in middle only Could we use
both types for assembling procedure?
a) b)
33The Problem of Genome assembly Genome assembly
is a very difficult computational problem, made
more difficult because genomes contain large
numbers of identical sequences, known as repeats.
These repeats can be thousands of nucleotides
long, and some occur in thousands of different
locations, especially in the large genomes. The
genome contains over 30 percent of sequence that
is repeated several times, so that a repeat
overlap might also occur between fragments that
are millions of base pairs apart in the genome.
The process is a lot like assembling a jigsaw
puzzlemethodically placing puzzle pieces next to
each other to see if they fit together.
34When is a genome sequence considered draft, and
when is it finished?
Summary of the report of the Second International
meeting on Human Genome Sequencing, 1997
Criteria for finished sequence
Each nucleotide sequences to 8-10x coverage
(error rate lt 1/10,000), individual gaps must be
shorter than 150,000 bp On this basis, only
about 25 of human genome sequence was finished
when first published
Finishing the euchromatic sequence of the human
genome Nature,
2004
In this research 95 of genome completed to this
level
H.A. what other chromosomal material does exist?
Why our research is focused on euchromatin?
35In 2001 sequence omitted 10 of euchromatic
genome, had 150,000 gaps and the order/orientaion
of many segments were unknown In 2004 sequence
conatains 2,85Gb and covers 99 of euchromatic
genome
only 341 gaps
error of 10 -5
H.A. Summarize the main steps of two main
approaches for genome analysis. Underline the
differences between these two methods.
36The Databases of DNA sequences
Genbank,
operated by NCBI (National Center for
Biotechnology Information)Contains all publicly
available sequences of DNA, with annotationsSame
DNA sequence content as
EMBL (European Molecular Biology Laboratory)
DDBJ (DNA Data Bank of Japan) Online Mendelian
Inheritance in ManA catalog of human genes and
genetic disorders, linked to gene entries in
GenBank
H.A. Describe these two databases