Title: BME280CSE277CSE377: Bioinformatics Spring 2006
1BME280/CSE277/CSE377 BioinformaticsSpring 2006
2Administrivia
- Lecture time TTh 1230-145pm
- Lecture place Engineering II, Room 322
- Instructor Ion Mandoiu
- Office ITEB 261
- Tel 6-3784
- E-mail ion_at_engr.uconn.edu
- Office hours MW 1-2pm
3Textbooks
- Neil C. Jones and Pavel A. Pevzner, An
Introduction to Bioinformatics Algorithms, MIT
Press, 2004. Textbook website http//bioalgorithm
s.info/. (REQUIRED) - D. Gusfield, Algorithms on Strings, Trees, and
Sequences, Cambridge University Press, 1997
(OPTIONAL)
4Grading
- 30 homework assignments
- Bi-weekly
- 30 programming projects
- Individual, 3-4 projects
- 40 final project
- Individual or teams of 2
- Written report short presentation
- Possible topics
- Algorithm implementation empirical study
- In-depth survey of a topic not covered in class
- Progress on open research problems
- Propose your own!
5What is Bioinformatics?
- Bioinformatics is generally defined as the
analysis, prediction, and modeling of biological
data with the help of computers
6Why Bioinformatics?
- DNA sequencing technologies have created massive
amounts of information that can only be
efficiently analyzed with computers - Hundreds of species sequenced
- Human, rat, chimp, chicken,
- As the information becomes ever so larger and
more complex, more computational tools are needed
to sort through the data. - Biology is becoming an information science!
- Slowly, we are learning how cells work through
comparative genomics -- not unlike comparative
linguistics
7Bioinformatics Tools
- Bioinformatics problems involve multiple aspects
- Example Sequence Comparison
- Biology How are genes evolving? How is gene
function related to gene sequence? - Learning/AI How do we define similar? Can
we learn from examples? - Algorithms How can we efficiently find all
similar sequences? - Statistics How do we distinguish a random match
from a true one?
8Course Description
- Course emphasis
- Modeling computational problems arising in
biology as graph-theoretic, statistical, or
mathematical optimization problems - Design, analysis, and implementation of efficient
algorithms - Algorithmic techniques to be covered
- Exhaustive search
- Integer programming
- Greedy algorithms
- Dynamic programming
- Divide-and-conquer
- Graph algorithms
- Combinatorial pattern matching
- Clustering
- Hidden Markov models
- Randomized algorithms
9Course Description
- Biological applications
- Restriction mapping
- DNA sequencing
- Motif finding
- Pairwise sequence alignment
- Gene prediction
- Evolutionary trees
- Genome rearrangements
10Complete and return the survey!
11Basic Molecular Biology
12The Cell
Source D. Geiger
All cells contain the same DNA, yet there are
many types of cells!
13Mendel and his Genes
- Genes -- physical and functional traits passed on
from one generation to the next - Discovered by Gregor Mendel in the 1860s while he
was experimenting with the pea plant. He asked
the question
Do traits come from a blend of both parent's
traits or from only one parent?
14The Pea Plant Experiments
- Mendel discovered that genes were passed on to
offspring by both parents in two forms dominant
and recessive.
- The dominant form would be the phenotypic
characteristic of the offspring
15DNA The Code of Life
- The structure and the four genomic letters code
for all living organisms - Adenine, Guanine, Thymine, and Cytosine which
pair A-T and C-G on complimentary strands.
16DNA Components
Source D. Geiger
17The Human Genome
Source D. Geiger
18DNA Organization
Source D. Geiger
19Genome Sizes
- E. Coli (bacteria) 4.7 Mb (Mega bases)
- Yeast (simple fungi) 15 Mb
- Nematode (C. Elegans) 100 Mb
- Mouse 2 Gb (Giga bases)
- Human 3 Gb
- Wheat 16.5 Gb
- Lily 32-48 Gb
20Genes
- DNA strings contain
- Coding regions (genes)
- Control regions
- Junk DNA (unknown function)
- Estimated number of genes
- E. Coli (bacteria) 4,000
- Yeast (simple fungi) 6,000
- Nematode (C. Elegans) 13,000
- Human 32,000 (?)
21Central Dogma
- Cells express different subsets of genes under
different environments
Transcription
Translation
Protein
mRNA
Gene
22Gene Transcription
Source D. Geiger
- RNA similar to DNA, but has
- slightly different backbone
- Uracil (U) instead of Thymine (T)
23RNA Roles
Source D. Geiger
24Translation
- Catalyzed by Ribosome
- Using two different sites, the Ribosome
continually binds tRNA, joins the amino acids
together and moves to the next location along the
mRNA - 10 codons/second, but multiple translations can
occur simultaneously
http//wong.scripps.edu/PIX/ribosome.jpg
25Genetic Code
- Human cells produce approx. 100,000 proteins
- Proteins are poly-peptides consisting of
70-3,000 amino acids - There are 20 different amino acids every 3
nucleotides in a gene encode for 1 amino acid (or
the STOP signal)
Source D. Geiger
26Protein Folding
- Proteins are not linear structures, though they
are built that way - The amino acids have very different chemical
properties they interact with each other after
the protein is built - This causes the protein to start fold and
adopting its functional structure - Proteins may fold in reaction to some ions, and
several separate chains of peptides may join
together through their hydrophobic and
hydrophilic amino acids to form a polymer
27Protein Folding (contd)
- The structure that a protein adopts is vital to
its chemistry - Its structure determines which of its amino acids
are exposed carry out the proteins function - Its structure also determines what substrates it
can react with
28Protein Structure
Source D. Geiger
29Basic Molecular BiotechnologyHow is information
accessed at molecular level?
30Operations on DNA/RNA
- Amplification (making many copies)
- Cutting into shorter fragments
- Reading fragment lengths
- Reading DNA sequence
- Probing presence of specific fragments
31Why we need so many copies
- Biologists needed to find a way to read DNA
codes. - How do you read base pairs that are angstroms in
size? - It is not possible to directly look at it due to
DNAs small size. - Need to use chemical techniques to detect what
you are looking for. - To read something so small, you need a lot of it,
so that you can actually detect the chemistry. - Need a way to make many copies of the base pairs,
and a method for reading the pairs.
32Polymerase Chain Reaction
- Problem Modern instrumentation cannot easily
detect single molecules of DNA, making
amplification a prerequisite for further analysis - Solution PCR doubles the number of DNA fragments
at every iteration
1 2 4 8
33Denaturation
Raise temperature to 94oC to separate the duplex
form of DNA into single strands
34Design primers
- To perform PCR, a 10-20bp sequence on either side
of the sequence to be amplified must be known
because DNA pol requires a primer to synthesize a
new strand of DNA
35Annealing
- Anneal primers at 50-65oC
36Annealing
- Anneal primers at 50-65oC
37Extension
- Extend primers raise temp to 72oC, allowing Taq
pol to attach at each priming site and extend a
new DNA strand
38Extension
- Extend primers raise temp to 72oC, allowing Taq
pol to attach at each priming site and extend a
new DNA strand
39Repeat
- Repeat the Denature, Anneal, Extension steps at
their respective temperatures
40Polymerase Chain Reaction
41Restriction Enzymes
- Discovered in the early 1970s
- Used as a defense mechanism by bacteria to break
down the DNA of attacking viruses. - They cut the DNA into small fragments.
- Can also be used to cut the DNA of organisms.
- This allows the DNA sequence to be in a more
manageable bite-size pieces. - It is then possible using standard purification
techniques to single out certain fragments and
duplicate them to macroscopic quantities.
42Molecular Scissors
Molecular Cell Biology, 4th editionfig 9-10
43Discovering Restriction Enzymes
- HindII first restriction enzyme discovered by
Hamilton Smith in 1970 - From bacterium Haemophilus influenzae
- Discovered accidentally while studying how the
bacterium Haemophilus influenzae takes up DNA
from the phage virus P22 - Recognizes and cuts DNA at sequences
- GTGCAC
- GTTAAC
44Recognition Sites of Restriction Enzymes
45Separating DNA by Size
- Gel electrophoresis is a process for separating
DNA by size - Can separate DNA fragments that differ in length
in only 1 nucleotide for fragments up to 500
nucleotides long
46Gel Electrophoresis
- DNA fragments are injected into a gel positioned
in an electric field - DNA are negatively charged near neutral pH
- The ribose phosphate backbone of each nucleotide
is acidic DNA has an overall negative charge - DNA molecules move towards the positive electrode
47Gel Electrophoresis (contd)
- DNA fragments of different lengths are separated
according to size - Smaller molecules move through the gel matrix
more readily than larger molecules - The gel matrix restricts random diffusion so
molecules of different lengths separate into bands
48Detecting DNA Autoradiography
- One way to visualize separated DNA bands on a gel
is autoradiography - The DNA is radioactively labeled
- The gel is laid against a sheet of photographic
film in the dark, exposing the film at the
positions where the DNA is present.
49Detecting DNA Fluorescence
- Another way to visualize DNA bands in gel is
fluorescence - The gel is incubated with a solution containing
the fluorescent dye ethidium - Ethidium binds to the DNA
- The DNA lights up when the gel is exposed to
ultraviolet light.
50Gel Electrophoresis Example
Direction of DNA movement
Smaller fragments travel farther
51Sequencing
- Biologists can reliably find the sequence of
A/C/T/G for short strings (few hundred
nucleotides) - Chain termination
- Single strand template
- Complementary strand synthesis blocked with small
probability at particular nucleotides - Lengths of fragments read for each class of
strings
52Sequencing
- Biologists can reliably find the sequence of
A/C/T/G for short strings (few hundred
nucleotides) - Chain termination
- Single strand template
- Complementary strand synthesis blocked with small
probability at particular nucleotides - Lengths of fragments read for each class of
strings
53Sequencing
- Biologists can reliably find the sequence of
A/C/T/G for short strings (few hundred
nucleotides) - Chain termination
- Single strand template
- Complementary strand synthesis blocked with small
probability at particular nucleotides - Lengths of fragments read for each class of
strings
ATACGGA ATACGG ATACG ATAC ATA AT A
54Sequencing
55 DNA Hybridization
- Single-stranded DNA will naturally bind to
complementary strands - Hybridization is used to locate genes, regulate
gene expression, and determine the degree of
similarity between DNA from different sources - Hybridization is also referred to as annealing or
renaturation
56Microarray Technologies
- Oligonucleotide arrays
- Short (20-60bp) synthetic DNA strands
- Arrays of cDNAs
- Obtained by reverse transcription from Expressed
Sequence Tags (ESTs)
57DNA Array Hybridization Experiment
Images courtesy of Affymetrix.
58 Two-Color Technique
- Sample labeled RED
- Control labeled GREEN
- YELLOW probes hybridize to both sample and
control - BLACK probes hybridize to neither
59Sequencing by Hybridization
- Exploits parallel hybridization capabilities
offered by DNA arrays - ALL probes of a certain length k (k8 to 10) are
synthesized on the array - Target DNA hybridizes at locations which store
probes complementary to its k-substrings - Sequencing by Hybridization (SBH) Problem
Reconstruct target DNA given its k-length
substrings (spectrum)
60Operations on Proteins
- Cloning in expression vectors
- 2-Dimensional gel electrophoresis separate
proteins by molecular weight/pH gradient - Antibody techniques (immunoprecipitation,
antibody arrays,) - Mass spectrometry (e.g., MALDI-TOF)
61Active research problems
- Genome projects have already given draft genome
sequence for hundreds of species, but lots of
questions remain to be answered - Create a complete parts list gene sequences
(including intron/exon structure), transcription
factors, - Understand function of each part, e.g., protein
structure, protein/DNA and protein/protein
interactions - Understand mechanisms, e.g., pathways
- Understand how everything fits together systems
biology