Title: Introduction to bioinformatics Lecture 2 Genes and Genomes
1Introduction to bioinformaticsLecture 2Genes
and Genomes
2Organisational
- Course website http//ibi.vu.nl/teaching/mnw_2yea
r/mnw2_2007.php - or click on
- http//ibi.vu.nl
- (gtteaching gtIntroduction to Bioinformatics)
- Course book Bioinformatics and Molecular
Evolution by Paul G. Higgs and Teresa K. Attwood
(Blackwell Publishing), 2005, ISBN (Pbk)
1-4051-0683-2 - Lots of information about Bioinformatics can be
found on the web.
3DNA sequence
.....acctc ctgtgcaaga acatgaaaca nctgtggttc
tcccagatgg gtcctgtccc aggtgcacct gcaggagtcg
ggcccaggac tggggaagcc tccagagctc aaaaccccac
ttggtgacac aactcacaca tgcccacggt gcccagagcc
caaatcttgt gacacacctc ccccgtgccc acggtgccca
gagcccaaat cttgtgacac acctccccca tgcccacggt
gcccagagcc caaatcttgt gacacacctc ccccgtgccc
ccggtgccca gcacctgaac tcttgggagg accgtcagtc
ttcctcttcc ccccaaaacc caaggatacc cttatgattt
cccggacccc tgaggtcacg tgcgtggtgg tggacgtgag
ccacgaagac ccnnnngtcc agttcaagtg gtacgtggac
ggcgtggagg tgcataatgc caagacaaag ctgcgggagg
agcagtacaa cagcacgttc cgtgtggtca gcgtcctcac
cgtcctgcac caggactggc tgaacggcaa ggagtacaag
tgcaaggtct ccaacaaagc aaccaagtca gcctgacctg
cctggtcaaa ggcttctacc ccagcgacat cgccgtggag
tgggagagca atgggcagcc ggagaacaac tacaacacca
cgcctcccat gctggactcc gacggctcct tcttcctcta
cagcaagctc accgtggaca agagcaggtg gcagcagggg
aacatcttct catgctccgt gatgcatgag gctctgcaca
accgctacac gcagaagagc ctctc.....
4Genome size
Organism Number of base pairs ?X-174
virus 5,386 Epstein Bar Virus 172,282 Mycopla
sma genitalium 580,000 Hemophilus
Influenza 1.8 ? 106 Yeast (S. Cerevisiae) 12.1
? 106 Human 3.2 ? 109 Wheat 16 ?
109 Lilium longiflorum 90 ? 109 Salamander 1
00 ? 109 Amoeba dubia 670 ? 109
5Four DNA nucleotide building blocks
G-C is more strongly hydrogen-bonded than A-T
6A gene codes for a protein
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
7Central Dogma of Molecular Biology
Transcription
Translation
Replication
DNA
mRNA
Protein
Transcription is carried out by RNA polymerase
(II) Translation is performed on
ribosomes Replication is carried out by DNA
polymerase Reverse transcriptase copies RNA into
DNA
Transcription Translation Expression
8But DNA can also be transcribed into non-coding
RNA
- tRNA (transfer) transfer of amino acids to
theribosome during protein synthesis. - rRNA (ribosomal) essential component of the
ribosomes (complex with rProteins). - snRNA (small nuclear) mainly involved in
RNA-splicing(removal of introns). snRNPs. - snoRNA (small nucleolar) involved in chemical
modifi-cations of ribosomal RNAs and other RNA
genes. snoRNPs. - SRP RNA (signal recognition particle) form
RNA-protein complex involved in mRNA secretion. - Further microRNA, eRNA, gRNA, tmRNA etc.
9Eukaryotes have spliced genes
- Promoter involved in transcription initiation
(TF/RNApol-binding sites) - TSS transcription start site
- UTRs un-translated regions (important for
translational control) - Exons will be spliced together by removal of the
Introns - Poly-adenylation site important for transcription
termination (but also mRNA stability,
export mRNA from nucleus etc.)
10DNA makes mRNA makes Protein
11DNA makes RNA makes Protein
yet another picture to appreciate the above
statement
12Some facts about human genes
- There are about 20.000 25.000 genes in the
human genome ( 3 of the genome) - Average gene length is 8.000 bp
- Average of 5-6 exons per gene
- Average exon length is 200 bp
- Average intron length is 2000 bp
- 8 of the genes have a single exon
- Some exons can be as small as 1 or 3 bp
13DMD the largest known human gene
- The largest known human gene is DMD, the gene
that encodes dystrophin 2.4 milion bp over 79
exons - X-linked recessive disease (affects boys)
- Two variants Duchenne-type (DMD) and becker-type
(BMD) - Duchenne-type more severe, frameshift-mutations
Becker-type milder phenotype, in frame-
mutations
Posture changes during progression of Duchenne
muscular dystrophy
14Nucleic acid basics
- Nucleic acids are polymers
nucleotide
nucleoside
- Each monomer consists of 3 moieties
15Nucleic acid basics (2)
- Purines and Pyrimidines can base-pair (Watson-
Crick pairs)
Watson and Crick, 1953
16Nucleic acid as hetero-polymers
(Ribose sugar, RNA precursor)
(2-deoxy ribose sugar, DNA precursor)
- REMEMBER
- DNA deoxyribonucleotidesRNA ribonucleotides
(OH-groups at the 2 position) - Note the directionality of DNA (5-3 3-5) or
RNA (5-3) - DNA A, G, C, T RNA A, G, C, U
(2-deoxy thymidine tri- phosphate, nucleotide)
17So
RNA
18Stability of base-pairing
- C-G base pairing is more stable than A-T (A-U)
base pairing (why?) - 3rd codon position has freedom to evolve
(synonymous mutations) - Species can therefore optimise their G-C content
(e.g. thermophiles are GC rich) (consequences for
codon use?)
Thermocrinis ruber, heat-loving bacteria
19(No Transcript)
20Single Letter Code
DNA codons
Amino Acid
ATT, ATC, ATA
I
Isoleucine
CTT, CTC, CTA, CTG, TTA, TTG
L
Leucine
GTT, GTC, GTA, GTG
V
Valine
TTT, TTC
F
Phenylalanine
ATG
M, Start
Methionine
TGT, TGC
c
Cysteine
GCT, GCC, GCA, GCG
A
Alanine
GGT, GGC, GGA, GGG
G
Glycine
CCT, CCC, CCA, CCG
P
Proline
ACT, ACC, ACA, ACG
T
Threonine
TCT, TCC, TCA, TCG, AGT, AGC
S
Serine
TAT, TAC
Y
Tyrosine
TGG
W
Tryptophan
CAA, CAG
Q
Glutamine
AAT, AAC
N
Asparagine
CAT, CAC
H
Histidine
GAA, GAG
E
Glutamic acid
GAT, GAC
D
Aspartic acid
AAA, AAG
K
Lysine
CGT, CGC, CGA, CGG, AGA, AGG
R
Arginine
TAA, TAG, TGA
Stop
Stop codons
21DNA compositional biases
- Base compositions of genomes GC (and therefore
also AT) content varies between different
genomes - The GC-content is sometimes used to classify
organism in taxonomy - High GC content bacteria Actinobacteriae.g. in
Streptomyces coelicolor it is 72Low GC
content Plasmodium falciparum (20) - Other examples
22Genetic diseases cystic fibrosis
- Known since very early on (Celtic gene)
- Autosomal, recessive, hereditary disease (Chr.
7) - Symptoms
- Exocrine glands (which produce sweat and mucus)
- Abnormal secretions
- Respiratory problems
- Reduced fertility and (male) anatomical anomalies
3,000
20,000
30,000
23cystic fibrosis (2)
- Gene product CFTR (cystic fibrosis transmembrane
conductance regulator) - CFTR is an ABC (ATP-binding cassette) transporter
or traffic ATPase. - These proteins transport molecules such as
sugars, peptides, inorganic phosphate, chloride,
and metal cations across the cellular membrane. - CFTR transports chloride ions (Cl-) ions across
the membranes of cells in the lungs, liver,
pancreas, digestive tract, reproductive tract,
and skin.
24cystic fibrosis (3)
- CF gene CFTR has 3-bp deletion leading to Del508
(Phe) in 1480 aa protein (epithelial Cl-
channel) - Protein degraded in Endoplasmatic Reticulum (ER)
instead of inserted into cell membrane
Theoretical Model of NBD1. PDB identifier 1NBD as
viewed in Protein Explorer http//proteinexplorer.
org
Diagram depicting the five domains of the CFTR
membrane protein (Sheppard 1999).
The deltaF508 deletion is the most common cause
of cystic fibrosis. The isoleucine (Ile) at amino
acid position 507 remains unchanged because both
ATC and ATT code for isoleucine
25Lets return to DNA and RNA structure
- Unlike three dimensional structures of proteins,
DNA molecules assume simple double helical
structures independent of their sequences. - There are three kinds of double helices that have
been observed in DNA type A, type B, and type Z,
which differ in their geometries. - RNA on the other hand, can have as diverse
structures as proteins, as well as simple double
helix of type A. - The ability of being both informational and
diverse in structure suggests that RNA was the
prebiotic molecule that could function in both
replication and catalysis (The RNA World
Hypothesis). - In fact, some viruses encode their genetic
materials by RNA (retrovirus)
26Three dimensional structures of double helices
Side view A-DNA, B-DNA, Z-DNA
Space-filling models of A, B and Z- DNA
Top view A-DNA, B-DNA, Z-DNA
27Major and minor grooves
28Forces that stabilize nucleic acid double helix
- There are two major forces that contribute to
stability of helix formation - Hydrogen bonding in base-pairing
- Hydrophobic interactions in base stacking
Same strand stacking
cross-strand stacking
29Types of DNA double helix
- Type A
- major conformation RNA
- minor conformation DNA
- Right-handed helix
- Type B
- major conformation DNA
- Right-handed helix
- Type Z
- minor conformation DNA
- Left-handed helix
30Secondary structures of Nucleic acids
- DNA is primarily in duplex form
- RNA is normally single stranded which can have a
diverse form of secondary structures other than
duplex.
31Non B-DNA Secondary structures
Hoogsteen basepairs
Source Van Dongen et al. (1999) , Nature
Structural Biology 6, 854 - 859
32More Secondary structures
- Cloverleaf rRNA structure
16S rRNA Secondary Structure Based
onPhylogenetic Data
Source Cornelis W. A. Pleij in Gesteland, R. F.
and Atkins, J. F. (1993) THE RNA WORLD. Cold
Spring Harbor Laboratory Press.
333D structures of RNA transfer-RNA structures
- Secondary structure of tRNA (cloverleaf)
- Tertiary structure of tRNA
343D structures of RNA ribosomal-RNA structures
- Secondary structure of large rRNA (16S)
- Tertiary structure of large rRNA subunit
353D structures of RNA Catalytic RNA
- Secondary structure of self-splicing RNA
- Tertiary structure of self-splicing RNA
36Some structural rules
- Base-pairing is stabilizing
- Un-paired sections (loops) destabilize
- 3D conformation with interactions makes up for
this
37Three main principles
- DNA makes RNA makes Protein
- Structure more conserved than sequence
- Sequence Structure Function
38How to go from DNA to protein sequence
A piece of double stranded DNA 5
attcgttggcaaatcgcccctatccggc 3 3
taagcaaccgtttagcggggataggccg 5
DNA direction is from 5 to 3
39How to go from DNA to protein sequence
6-frame conceptual translation using the codon
table 5 attcgttggcaaatcgcccctatccggc 3 3
taagcaaccgtttagcggggataggccg 5
So, there are six possibilities to make a protein
from an unknown piece of DNA, only one of which
might be a natural protein
40Remark
- Identifying (annotating) human genes, i.e.
finding what they are and what they do, is a
difficult problem - First, the gene should be delineated on the
genome - Gene finding methods should be able to tell a
gene region from a non-gene region - Start, stop codons, further compositional
differences - Then, a putative function should be found for the
gene located
41Evolution and three-dimensional protein structure
information
Isocitrate dehydrogenase The distance from the
active site (in yellow) determines the rate of
evolution (red fast evolution, blue slow
evolution)
Dean, A. M. and G. B. Golding Pacific Symposium
on Bioinformatics 2000
42Genomic Data Sources
- DNA/protein sequence
- Expression (microarray)
- Proteome (xray, NMR,
- mass spectrometry)
- Metabolome
- Physiome (spatial,
- temporal)
Integrative bioinformatics
43Genomic Data Sources Vertical Genomics
genome
transcriptome
proteome
metabolome
physiome
Dinner discussion Integrative Bioinformatics
Genomics VU
44DNA makes RNA makes Protein(reminder)
45DNA makes RNA makes ProteinExpression data
- More copies of mRNA for a gene leads to more
protein - mRNA can now be measured for all the genes in a
cell at ones through microarray technology - Can have 60,000 spots (genes) on a single gene
chip - Colour change gives intensity of gene expression
(over- or under-expression)
46(No Transcript)
47Proteomics
- Elucidating all 3D structures of proteins in the
cell - This is also called Structural Genomics
- Finding out what these proteins do
- This is also called Functional Genomics
48(No Transcript)
49Protein-protein interaction networks
50Metabolic networksGlycolysis and
Gluconeogenesis
Kegg database (Japan)
51High-throughput Biological Data
- Enormous amounts of biological data are being
generated by high-throughput capabilities even
more are coming - genomic sequences
- arrayCGH (Comparative Genomic Hybridization)
data, gene expression data - mass spectrometry data
- protein-protein interaction data
- protein structures
- ......
52Protein structural data explosion
Protein Data Bank (PDB) 14500 Structures (6
March 2001) 10900 x-ray crystallography, 1810
NMR, 278 theoretical models, others...
53Dickersons formula equivalent to Moores law
n e0.19(y-1960) with y the year.
On 27 March 2001 there were 12,123 3D protein
structures in the PDB Dickersons formula
predicts 12,066 (within 0.5)!
54Sequence versus structural data
- Structural genomics initiatives are now in full
swing and growth is still exponential. - However, growth of sequence data is even more
rapidly. There are now more than 500 completely
sequenced genomes publicly available. - Increasing gap between structural and sequence
data (Mind the gap)
55Bioinformatics
Bioinformatics
Large - external (integrative) Science Human
Planetary Science Cultural Anthropology
Population Biology Sociology
Sociobiology Psychology Systems
Biology Biology Medicine
Molecular Biology
Chemistry Physics Small
internal (individual)
56Bioinformatics
- Offers an ever more essential input to
- Molecular Biology
- Pharmacology (drug design)
- Agriculture
- Biotechnology
- Clinical medicine
- Anthropology
- Forensic science
- Chemical industries (detergent industries, etc.)