Title: BB30055: Genes and genomes
1BB30055 Genes and genomes
- Genomes - Dr. MV Hejmadi (bssmvh)
- 3 broad areas
- Genomes
- Applications genome projects
- (C) Genome evolution
2Why sequence the genome?
- 3 main reasons
- description of sequence of every gene valuable.
Includes regulatory regions which help in
understanding not only the molecular activities
of the cell but also ways in which they are
controlled. - identify characterise important inheritable
disease genes or bacterial genes (for industrial
use) - Role of intergenic sequences e.g. satellites,
intronic regions etc
3History of Human Genome Project (HGP)
- 1953 DNA structure (Watson Crick)
- 1972 Recombinant DNA (Paul Berg)
- 1977 DNA sequencing (Maxam, Gilbert and Sanger)
- 1985 PCR technology (Kary Mullis)
- 1986 automated sequencing (Leroy Hood Lloyd
Smith - 1988 IHGSC established (NIH, DOE) Watson leads
- 1990 IHGSC scaled up, BLAST published
(LipmanMyers) - 1992 Watson quits, Venter sets up TIGR
- 1993 F Collins heads IHGSC, Sanger Centre
(Sulston) - 1995 cDNA microarray
- 1998 Celera genomics (J Craig Venter)
- 2001 Working draft of human genome sequence
published - 2003 Finished sequence announced
4Human Genome Project (HGP)
- Goal Obtain the entire DNA sequence of human
genome - Players
- International Human Genome Sequence Consortium
(IHGSC) - - public funding, free access to all, started
earlier - - used mapping overlapping clones method
- (B) Celera Genomics
- private funding, pay to view
- - started in 1998
- - used whole genome shotgun strategy
-
5Whose genome is it anyway?
- International Human Genome Sequence Consortium
(IHGSC) - - composite from several different people
generated from 10-20 primary samples taken from
numerous anonymous donors across racial and
ethnic groups - (B) Celera Genomics
- 5 different donors (one of whom was J Craig
Venter himself !!!)
6Strategies for sequencing the human genome
7sequencing larger genomes
Mapping phase
Sequencing phase
8Result.
30 - 40,000 protein-coding genes estimated based
on known genes and predictions IHGSC Celera
definite genes 24,500 26,383 possible genes
5000 12,000
9Organisation of human genome
- Nuclear genome (3.2 Gbp)
- 24 types of chromosomes
- Y- 51Mb and chr1 -279Mbp
10General organisation of human genome
11Polypeptide-coding regions
12Rare bicistronic transcription units E.g. UBA52
transcription generates ubiquitin and a ribosomal
protein S27a
Gene organisation
13General organisation of human genome
14Non polypeptidecoding RNA encoding
15Pseudogenes (?)
- non functional copies of exonic sequences of an
active gene. - Thought to arise by genomic insertion of a cDNA
as a result of retroposition - Contributes to overall repetitive elements (lt1)
16processed pseudogenes -
17Pseudogenes in globin gene cluster
18Gene fragments or truncated genes
- Gene fragments small segments of a gene (e.g.
single exon from a multiexon gene)
Truncated genes Short components of functional
genes (e.g. 5 or 3 end)
Thought to arise due to unequal crossover or
exchange
19General organisation of human genome
20Repetitive elements
- Main classes based on origin
- Tandem repeats
- Interspersed repeats
- Segmental duplications
211) Tandem repeats
- Blocks of tandem repeats at
- subtelomeres
- pericentromeres
- Short arms of acrocentric chromosomes
- Ribosomal gene clusters
22Tandem / clustered repeats
Broadly divided into 4 types based on size
class Size of repeat Repeat block Major chromosomal location
Satellite 5-171 bp gt 100kb centromeric heterochromatin
minisatellite 9-64 bp 0.120kb Telomeres
microsatellites 1-13 bp lt 150 bp Dispersed
HMG3 by Strachan and Read pp 265-268
23Satellites
- Large arrays of repeats
- Some examples
- Satellite 1,2 3
- a (Alphoid DNA)
- - found in all chromosomes
- b satellite
HMG3 by Strachan and Read pp 265-268
24Minisatellites
- Moderate sized arrays of repeats
- Some examples
- Hypervariable minisatellite DNA
- - core of GGGCAGGAXG
- - found in telomeric regions
- - used in original DNA fingerprinting technique
by Alec Jeffreys
HMG3 by Strachan and Read pp 265-268
25Microsatellites
- VNTRs - Variable Number of Tandem Repeats,
- SSR - Simple Sequence Repeats
- 1-13 bp repeats e.g. (A)n (AC)n
- 2 of genome (dinucleotides - 0.5)
- Used as genetic markers (especially for disease
mapping)
Individual genotype
HMG3 by Strachan and Read pp 265-268
26Microsatellite genotyping
design PCR primers unique to one locus in the
genome a single pair of PCR primers will produce
different sized products for each of the
different length microsatellites
272) Interspersed repeats
- A.k.a. Transposon-derived repeats
- 45 of genome
- Arise mainly as a result of
- transposition either through
- a DNA or a RNA intermediate
28Interspersed repeats (transposon-derived)
major types
class family size Copy number genome
LINE L1 (Kpn family) L2 6.4kb 0.5x106 0.3 x 106 16.9 3.2
SINE Alu 0.3kb 1.1x106 10.6
LTR e.g.HERV 1.3kb 0.3x106 8.3
DNA transposon mariner 0.25kb 1-2x104 2.8
Updated from HGP publications
HMG3 by Strachan Read pp268-272
29LINEs (long interspersed elements)
- Most ancient of eukaryotic genomes
- Autonomous transposition (reverse trancriptase)
- 6-8kb long
- Internal polymerase II promoter and 2 ORFs
- 3 related LINE families in humans
- LINE-1, LINE-2, LINE-3.
- Believed to be responsible for retrotransposition
of SINEs and creation of processed pseudogenes -
30LINEs (long interspersed elements)
Nature (2001) pp879-880
HMG3 by Strachan Read pp268-272
31SINEs (short interspersed elements)
- Non-autonomous (successful freeloaders! borrow
RT from other sources such as LINEs) - 100-300bp long
- Internal polymerase III promoter
- No proteins
- Share 3 ends with LINEs
- 3 related SINE families in humans
- active Alu, inactive MIR and Ther2/MIR3.
-
32LINES and SINEs have preferred insertion sites
- In this example, yellow represents the
distribution of mys (a type of LINE) over a mouse
genome where chromosomes are orange. There are
more mys inserted in the sex (X) chromosomes.
33- Try the link below to do an online experiment
which shows how an Alu insertion polymorphism has
been used as a tool to reconstruct the human
lineage - http//www.geneticorigins.org/geneticorigins/pv92/
intro.html
34Long Terminal Repeats (LTR)
- Repeats on the same orientation on both sides of
element e.g. ATATATNNNNNNNATATAT - contain sequences that serve as transcription
promoters - as well as terminators.
- These sequences allow the element to code for an
mRNA molecule that is processed and
polyadenylated. - At least two genes coded within the element to
supply essential - activities for the retrotransposition mechanism.
- The RNA contains a specific primer binding site
(PBS) for initiating reverse transcription. - A hallmark of almost all mobile elements is that
they form small direct repeats formed at the site
of integration. -
35Long Terminal Repeats (LTR)
- Autonomous or non-autonomous
- Autonomous retroposons encode gag, pol genes
which encode the protease, reverse transcriptase,
RNAseH and integrase -
Nature (2001) pp879-880
HMG3 by Strachan Read pp268-272
36DNA transposons (lateral transfer?)
- DNA transposons
- Inverted repeats on both sides of element
- e.g. ATGCNNNNNNNNNNNCGTA
Nature (2001) pp879-880
From GenesVII by Levin
373) Segmental duplications
- Closely related sequence blocks at different
genomic loci - Transfer of 1-200kb blocks of genomic sequence
- Segmental duplications can occur on homologous
chromosomes (intrachromosomal) or non homologous
chromosomes (interchromosomal) - Not always tandemly arranged
- Relatively recent
38Segmental duplications
- Interchromosomal segments duplicated among
non-homologous chromosomes
Intrachromosomal duplications occur within a
chromosome / arm
Nature Reviews Genetics 2, 791-800 (2001)
39Segmental duplications
Segmental duplications in chromosome 22
40Segmental duplications - chromosome 7.
41Nature Reviews Genetics 2, 791-800 (2001)
42Major insights from the HGP
- Gene size, content and distribution
- Proteome content
- SNP identification
- Distribution of GC content
- CpG islands
- Recombination rates
- Repeat content
Nature (2001) 15th Feb Vol 409 special issue pgs
814 875-914.
431) Gene size
44Gene content.
- More genes Twice as many as drosophila /
C.elegans - Uneven gene distribution Gene-rich and
gene-poor regions - More paralogs some gene families have extended
the number of paralogs e.g. olfactory gene family
has 1000 genes - More alternative transcripts Increased RNA
splice variants produced thereby expanding the
primary proteins by 5 fold (e.g. neurexin genes)
45Gene distribution
Genes generally dispersed (1 gene per 100kb)
Class III complex at HLA 6p21.3
Overlapping genes (transcribed from 2 DNA
strands) - Rare
Genes- within genes E.g. NF1 gene
HMG3 Fig 9.8
46Uneven gene distribution
- Gene-rich
- E.g. MHC on chromosome 6 has 60 genes with a GC
content of 54 - Gene-poor regions
- 82 gene deserts identified
- ? Large or unidentified genes
- What is the functional significance of these
variations? -
472) Proteome content
- proteome more complex than invertebrates
-
Protein Domains (sections with identifiable
shape/function) Domain arrangements in
humans largest total number of domains is
130 largest number of domain types per protein is
9 Mostly identical arrangement of domains
A
A
B
B
C
B
C
C
C
C
Protein X
48Proteome more complex than invertebrates
- no huge difference in domain number in humans
- BUT, frequency of domain sharing very high in
human proteins (structural proteins and proteins
involved in signal transduction and immune
function) - However, only 3 cases where a combination of 3
domain types shared by human yeast proteins. - e.g carbomyl-phosphate synthase (involved in the
first 3 steps of de novo pyrimidine biosynthesis)
has 7 domain types, which occurs once in human
and yeast but twice in drosophila
493) SNPs (single nucleotide polymorphisms)
- Sites that result from point mutations in
individual base pairs - biallelic
- 60,000 SNPs lie within exons and untranslated
regions (85 of exons lie within 5kb of a SNP) - May or may not affect the ORF
- Most SNPs may be regulatory
-
- More than 1.4million SNPs identified
- One every 1.9kb length on average
- Densities vary over regions and chromosomes
- e.g. HLA region has a high SNP density,
reflecting maintenance of diverse haplotypes over
many MYears
Nature (2001) 15th Feb Vol 409 special issue pgs
821-823 928
50How does one distinguish sequence errors from
polymorphisms?
- sequence errors
- Each piece of genome sequenced at least 10 times
to reduce error rate (0.01) - Polymorphisms
- Sequence variation between individuals is 0.1
- To be defined as a polymorphism, the altered
sequence must be present in a significant
population - Rate of polymorphisms in diploid human genome is
about 1 in 500 bp
Nature (2001) 15th Feb Vol 409 special issue pgs
821-823 928
51SNPs and disease
523) SNPsand risk of disease
N(291)S
533) SNPsand risk of disease
late-onset Alzheimer's disease (LOAD) Apolipoprote
in e4 haplotype is a genetic risk factor
3 major alleles (APO E2, E3, and E4) APO E2
Cys112 / Cys158 APO E3 Cys112 / Arg158 APO E4
Arg112 / Arg158
543) SNPsand pharmacogenomics
554) Distribution of GC content
- Genome wide average of 41
- Huge regional variations exist
- E.g.distal 48Mb of chromosome 1p-47 but
chromosome 13 has only 36 - Confirms cytogenetic staining with G-bands
(Giemsa) - dark G-bands low GC content (37)
- light G-bands high GC content (45)
Nature (2001) 15th Feb Vol 409 special issue pg
876-877
565) CpG islands
CpG
TpG
Methyl CpG
Deamination
methylated at C
CpG islands show no methylation
- Significance of CpG islands
- Non-methylated CpG islands associated with the 5
ends of genes - Aberrant methylation of CpG islands is one
mechanism of inactivating tumor suppressor genes
(TSGs) in neoplasia
http//www.sanger.ac.uk/HGP/cgi.shtml
57CpG islands
- Greatly under-represented in human genome
- 28,890 in number
- Variable density
- e.g. Y 2.9/Mb but
- 16,17 22 have 19-22/Mb
- Average is 10.5/Mb
Nature (2001) 15th Feb Vol 409 special issue pg
877-888
586) Recombination rates
- 2 main observations
- Recombination rate increases with decreasing arm
length - Recombination rate suppressed near the
centromeres and increases towards the distal
20-35Mb
597) Repeat content
- Age distribution
- Comparison with other genomes
- Variation in distribution of repeats
- Distribution by GC content
- Y chromosome
-
Nature (2001) 409 pp 881-891
60Repeat content.
a) Age distribution
- Most interspersed repeats predate eutherian
radiation (confirms the slow rate of clearance of
nonfunctional sequence from vertebrate genomes) - LINEs and SINEs have extremely long lives
- 2 major peaks of transposon activity
- No DNA transposition in the past 50MYr
- LTR retroposons teetering on the brink of
extinction -
61a) Age distribution
- overall decline in interspersed repeat activity
in hominid lineage in the past 35-40MYr - compared to mouse genome, which shows a younger
and more dynamic genome
62b) Comparison with other genomes
- Higher density of transposable elements in
euchromatic portion of genome - Higher abundance of ancient transposons
- 60 of IR made up of LINE1 and Alu repeats
- whereas DNA transposons represent only 6
-
- (a few human genes appear likely to have
resulted from horizontal transfer from
bacteria!!)
63c) Variation in distribution of repeats
- Some regions show either
- High repeat density
- e.g. chromosome Xp11 a 525kb region shows 89
repeat density - Low repeat density
- e.g. HOX homeobox gene cluster (lt2 repeats)
- (indicative of regulatory elements which have low
tolerance for insertions)
64d) Distribution by GC content
- High GC gene rich High AT gene poor
- LINEs abundant in AT-rich regions
- SINEs lower in AT-rich regions
- Alu repeats in particular retained in actively
transcribed GC rich regions E.g. chromosme 19 has
5 Alus compared to Y chromosome
65e) The Y chromosome !
- Unusually young genome (high tolerance to gaining
insertions) - Mutation rate is 2.1X higher in male germline
- Possibly due to cell division rates or different
repair mechanisms
66- Working draft published Feb 2001
- Finished sequence April 2003
- Annotation of genes going on
- (refer International Human Genome Sequencing
Consortium. Finishing the euchromatic sequence of
the human genome. Nature 21 October 2004 (doi
10.1038/nature03001)
67Other genomes sequenced
2002 Mus musculus 36,000 genes
1997 4,200 genes
Sept 2003 Canis 18,473 human orthologs
1998 19,099 genes
31Aug 2005 Pan troglodytes 28 identical Human
orthologs
2002 38,000 genes
Science (26 Sep 2003)Vol301(5641)pp1854-1855
68References
- Chapter 9 pp 265-268
- HMG 3 by Strachan and Read
- Chapter 10 pp 339-348
- Genetics from genes to genomes by Hartwell et al
(2/e) - Nature (2001) 409 pp 879-891