Title: BB30055: Genes and genomes
1BB30055 Genes and genomes
Major insights from the HGP
2Major insights from the HGP
- Gene size, content and distribution
- Proteome content
- SNP identification
- Distribution of GC content
- CpG islands
- Recombination rates
- Repeat content
Nature (2001) 15th Feb Vol 409 special issue pgs
814 875-914.
31) Gene size
4Gene content.
- More genes Twice as many as drosophila /
C.elegans - Uneven gene distribution Gene-rich and
gene-poor regions - More paralogs some gene families have extended
the number of paralogs e.g. olfactory gene family
has 1000 genes - More alternative transcripts Increased RNA
splice variants produced thereby expanding the
primary proteins by 5 fold (e.g. neurexin genes)
5Gene distribution
Genes generally dispersed (1 gene per 100kb)
Class III complex at HLA 6p21.3
Overlapping genes (transcribed from 2 DNA
strands) - Rare
Genes- within genes E.g. NF1 gene
HMG3 Fig 9.8
6Uneven gene distribution
- Gene-rich
- E.g. MHC on chromosome 6 has 60 genes with a GC
content of 54 - Gene-poor regions
- 82 gene deserts identified
- ? Large or unidentified genes
- What is the functional significance of these
variations? -
72) Proteome content
- proteome more complex than invertebrates
-
Protein Domains (sections with identifiable
shape/function) Domain arrangements in
humans largest total number of domains is
130 largest number of domain types per protein is
9 Mostly identical arrangement of domains
A
A
B
B
C
B
C
C
C
C
Protein X
8Proteome more complex than invertebrates
- no huge difference in domain number in humans
- BUT, frequency of domain sharing very high in
human proteins (structural proteins and proteins
involved in signal transduction and immune
function) - However, only 3 cases where a combination of 3
domain types shared by human yeast proteins. - e.g carbomyl-phosphate synthase (involved in the
first 3 steps of de novo pyrimidine biosynthesis)
has 7 domain types, which occurs once in human
and yeast but twice in drosophila
93) SNPs (single nucleotide polymorphisms)
- Sites that result from point mutations in
individual base pairs - biallelic
- 60,000 SNPs lie within exons and untranslated
regions (85 of exons lie within 5kb of a SNP) - May or may not affect the ORF
- Most SNPs may be regulatory
-
- More than 1.4million SNPs identified
- One every 1.9kb length on average
- Densities vary over regions and chromosomes
- e.g. HLA region has a high SNP density,
reflecting maintenance of diverse haplotypes over
many MYears
Nature (2001) 15th Feb Vol 409 special issue pgs
821-823 928
10How does one distinguish sequence errors from
polymorphisms?
- sequence errors
- Each piece of genome sequenced at least 10 times
to reduce error rate (0.01) - Polymorphisms
- Sequence variation between individuals is 0.1
- To be defined as a polymorphism, the altered
sequence must be present in a significant
population - Rate of polymorphisms in diploid human genome is
about 1 in 500 bp
Nature (2001) 15th Feb Vol 409 special issue pgs
821-823 928
11SNPs and disease
12SNPsand risk of disease
N(291)S
13SNPsand pharmacogenomics
144) Distribution of GC content
- Genome wide average of 41
- Huge regional variations exist
- E.g.distal 48Mb of chromosome 1p-47 but
chromosome 13 has only 36 - Confirms cytogenetic staining with G-bands
(Giemsa) - dark G-bands low GC content (37)
- light G-bands high GC content (45)
Nature (2001) 15th Feb Vol 409 special issue pg
876-877
155) CpG islands
CpG
TpG
Methyl CpG
Deamination
methylated at C
CpG islands show no methylation
- Significance of CpG islands
- Non-methylated CpG islands associated with the 5
ends of genes - Usually overlap the promoter region
- Aberrant methylation of CpG islands linked to
pathologies like cancer or epigenetic diseases
like Rhetts syndrome
http//www.sanger.ac.uk/HGP/cgi.shtml
16Inheritance of CpG methylation
17Epigenetic disease Rett Syndrome
- Characterised by neurodevelopmental problems
after birth - mutations in a gene on the X chromosome, MECP2
(methyl CpG-binding protein 2), whose protein
normally binds to methylated CpG and represses
gene expression - RS symptoms associated with the failure of
mutated MECP2 to regulate transcription of a
specific gene, DLX5, one allele of which is
normally imprinted. Without the MeCP2 protein,
production of the Dlx5 protein is increased,
which influence production of the
neurotransmitter GABA in the brain -
DLX5
DLX5
18CpG islands
- Greatly under-represented in human genome
- 28,890 in number (5 times less than expected)
- 56 of human genes and 47 of the mouse genes
have CpG islands - Variable density
- e.g. Y 2.9/Mb but
- 16,17 22 have 19-22/Mb
- Average is 10.5/Mb
Nature (2001) 15th Feb Vol 409 special issue pg
877-888
196) Recombination rates
- 2 main observations
- Recombination rate increases with decreasing arm
length - Recombination rate suppressed near the
centromeres and increases towards the distal
20-35Mb
207) Repeat content
- Age distribution
- Comparison with other genomes
- Variation in distribution of repeats
- Distribution by GC content
- Y chromosome
-
Nature (2001) 409 pp 881-891
21Repeat content.
a) Age distribution
- Most interspersed repeats predate eutherian
radiation (confirms the slow rate of clearance of
nonfunctional sequence from vertebrate genomes) - LINEs and SINEs have extremely long lives
- 2 major peaks of transposon activity
- No DNA transposition in the past 50MYr
- LTR retroposons teetering on the brink of
extinction -
22a) Age distribution
- overall decline in interspersed repeat activity
in hominid lineage in the past 35-40MYr - compared to mouse genome, which shows a younger
and more dynamic genome
23b) Comparison with other genomes
- Higher density of transposable elements in
euchromatic portion of genome - Higher abundance of ancient transposons
- 60 of IR made up of LINE1 and Alu repeats
- whereas DNA transposons represent only 6
-
- (a few human genes appear likely to have
resulted from horizontal transfer from
bacteria!!)
24c) Variation in distribution of repeats
- Some regions show either
- High repeat density
- e.g. chromosome Xp11 a 525kb region shows 89
repeat density - Low repeat density
- e.g. HOX homeobox gene cluster (lt2 repeats)
- (indicative of regulatory elements which have low
tolerance for insertions)
25d) Distribution by GC content
- High GC gene rich High AT gene poor
- LINEs abundant in AT-rich regions
- SINEs lower in AT-rich regions
- Alu repeats in particular retained in actively
transcribed GC rich regions E.g. chromosme 19 has
5 Alus compared to Y chromosome
26e) The Y chromosome !
- Unusually young genome (high tolerance to gaining
insertions) - Mutation rate is 2.1X higher in male germline
- Possibly due to cell division rates or different
repair mechanisms
27- Working draft published Feb 2001
- Finished sequence April 2003
- Annotation of genes going on
- (refer International Human Genome Sequencing
Consortium. Finishing the euchromatic sequence of
the human genome. Nature 21 October 2004 (doi
10.1038/nature03001)
28References
- Chapter 9 pp 265-268
- HMG 3 by Strachan and Read
- Chapter 10 pp 339-348
- Genetics from genes to genomes by Hartwell et al
(2/e) - Nature (2001) 409 pp 879-891