BB30055: Genes and genomes - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

BB30055: Genes and genomes

Description:

... Other genomes sequenced References Chapter 9 pp 265 ... chromosme 19 has 5% Alus compared to Y chromosome ... e) The Y chromosome ! Mapping ... – PowerPoint PPT presentation

Number of Views:220
Avg rating:3.0/5.0
Slides: 69
Provided by: bssa3
Category:

less

Transcript and Presenter's Notes

Title: BB30055: Genes and genomes


1
BB30055 Genes and genomes
  • Genomes - Dr. MV Hejmadi (bssmvh)
  • 3 broad areas
  • Genomes
  • Applications genome projects
  • (C) Genome evolution

2
Why sequence the genome?
  • 3 main reasons
  • description of sequence of every gene valuable.
    Includes regulatory regions which help in
    understanding not only the molecular activities
    of the cell but also ways in which they are
    controlled.
  • identify characterise important inheritable
    disease genes or bacterial genes (for industrial
    use)
  • Role of intergenic sequences e.g. satellites,
    intronic regions etc

3
History of Human Genome Project (HGP)
  • 1953 DNA structure (Watson Crick)
  • 1972 Recombinant DNA (Paul Berg)
  • 1977 DNA sequencing (Maxam, Gilbert and Sanger)
  • 1985 PCR technology (Kary Mullis)
  • 1986 automated sequencing (Leroy Hood Lloyd
    Smith
  • 1988 IHGSC established (NIH, DOE) Watson leads
  • 1990 IHGSC scaled up, BLAST published
    (LipmanMyers)
  • 1992 Watson quits, Venter sets up TIGR
  • 1993 F Collins heads IHGSC, Sanger Centre
    (Sulston)
  • 1995 cDNA microarray
  • 1998 Celera genomics (J Craig Venter)
  • 2001 Working draft of human genome sequence
    published
  • 2003 Finished sequence announced

4
Human Genome Project (HGP)
  • Goal Obtain the entire DNA sequence of human
    genome
  • Players
  • International Human Genome Sequence Consortium
    (IHGSC)
  • - public funding, free access to all, started
    earlier
  • - used mapping overlapping clones method
  • (B) Celera Genomics
  • private funding, pay to view
  • - started in 1998
  • - used whole genome shotgun strategy

5
Whose genome is it anyway?
  • International Human Genome Sequence Consortium
    (IHGSC)
  • - composite from several different people
    generated from 10-20 primary samples taken from
    numerous anonymous donors across racial and
    ethnic groups
  • (B) Celera Genomics
  • 5 different donors (one of whom was J Craig
    Venter himself !!!)

6
Strategies for sequencing the human genome
7
sequencing larger genomes
Mapping phase
Sequencing phase
8
Result.
30 - 40,000 protein-coding genes estimated based
on known genes and predictions IHGSC Celera
definite genes 24,500 26,383 possible genes
5000 12,000
9
Organisation of human genome
  • Mitochondrial genome
  • Nuclear genome (3.2 Gbp)
  • 24 types of chromosomes
  • Y- 51Mb and chr1 -279Mbp

10
General organisation of human genome

11
Polypeptide-coding regions
12
Rare bicistronic transcription units E.g. UBA52
transcription generates ubiquitin and a ribosomal
protein S27a
Gene organisation
13
General organisation of human genome

14
Non polypeptidecoding RNA encoding
15
Pseudogenes (?)
  • non functional copies of exonic sequences of an
    active gene.
  • Thought to arise by genomic insertion of a cDNA
    as a result of retroposition
  • Contributes to overall repetitive elements (lt1)

16
processed pseudogenes -
17
Pseudogenes in globin gene cluster
18
Gene fragments or truncated genes
  • Gene fragments small segments of a gene (e.g.
    single exon from a multiexon gene)

Truncated genes Short components of functional
genes (e.g. 5 or 3 end)
Thought to arise due to unequal crossover or
exchange
19
General organisation of human genome

20
Repetitive elements
  • Main classes based on origin
  • Tandem repeats
  • Interspersed repeats
  • Segmental duplications

21
1) Tandem repeats
  • Blocks of tandem repeats at
  • subtelomeres
  • pericentromeres
  • Short arms of acrocentric chromosomes
  • Ribosomal gene clusters

22
Tandem / clustered repeats
Broadly divided into 4 types based on size
class Size of repeat Repeat block Major chromosomal location
Satellite 5-171 bp gt 100kb centromeric heterochromatin
minisatellite 9-64 bp 0.120kb Telomeres
microsatellites 1-13 bp lt 150 bp Dispersed
HMG3 by Strachan and Read pp 265-268
23
Satellites
  • Large arrays of repeats
  • Some examples
  • Satellite 1,2 3
  • a (Alphoid DNA)
  • - found in all chromosomes
  • b satellite

HMG3 by Strachan and Read pp 265-268
24
Minisatellites
  • Moderate sized arrays of repeats
  • Some examples
  • Hypervariable minisatellite DNA
  • - core of GGGCAGGAXG
  • - found in telomeric regions
  • - used in original DNA fingerprinting technique
    by Alec Jeffreys

HMG3 by Strachan and Read pp 265-268
25
Microsatellites
  • VNTRs - Variable Number of Tandem Repeats,
  • SSR - Simple Sequence Repeats
  • 1-13 bp repeats e.g. (A)n (AC)n
  • 2 of genome (dinucleotides - 0.5)
  • Used as genetic markers (especially for disease
    mapping)

Individual genotype
HMG3 by Strachan and Read pp 265-268
26
Microsatellite genotyping
design PCR primers unique to one locus in the
genome a single pair of PCR primers will produce
different sized products for each of the
different length microsatellites
  • .

27
2) Interspersed repeats
  • A.k.a. Transposon-derived repeats
  • 45 of genome
  • Arise mainly as a result of
  • transposition either through
  • a DNA or a RNA intermediate

28
Interspersed repeats (transposon-derived)
major types
class family size Copy number genome
LINE L1 (Kpn family) L2 6.4kb 0.5x106 0.3 x 106 16.9 3.2
SINE Alu 0.3kb 1.1x106 10.6
LTR e.g.HERV 1.3kb 0.3x106 8.3
DNA transposon mariner 0.25kb 1-2x104 2.8
Updated from HGP publications
HMG3 by Strachan Read pp268-272
29
LINEs (long interspersed elements)
  • Most ancient of eukaryotic genomes
  • Autonomous transposition (reverse trancriptase)
  • 6-8kb long
  • Internal polymerase II promoter and 2 ORFs
  • 3 related LINE families in humans
  • LINE-1, LINE-2, LINE-3.
  • Believed to be responsible for retrotransposition
    of SINEs and creation of processed pseudogenes

30
LINEs (long interspersed elements)
Nature (2001) pp879-880
HMG3 by Strachan Read pp268-272
31
SINEs (short interspersed elements)
  • Non-autonomous (successful freeloaders! borrow
    RT from other sources such as LINEs)
  • 100-300bp long
  • Internal polymerase III promoter
  • No proteins
  • Share 3 ends with LINEs
  • 3 related SINE families in humans
  • active Alu, inactive MIR and Ther2/MIR3.

32
LINES and SINEs have preferred insertion sites
  • In this example, yellow represents the
    distribution of mys (a type of LINE) over a mouse
    genome where chromosomes are orange. There are
    more mys inserted in the sex (X) chromosomes.

33
  • Try the link below to do an online experiment
    which shows how an Alu insertion polymorphism has
    been used as a tool to reconstruct the human
    lineage
  • http//www.geneticorigins.org/geneticorigins/pv92/
    intro.html

34
Long Terminal Repeats (LTR)
  • Repeats on the same orientation on both sides of
    element e.g. ATATATNNNNNNNATATAT
  • contain sequences that serve as transcription
    promoters
  • as well as terminators.
  • These sequences allow the element to code for an
    mRNA molecule that is processed and
    polyadenylated.
  • At least two genes coded within the element to
    supply essential
  • activities for the retrotransposition mechanism.
  • The RNA contains a specific primer binding site
    (PBS) for initiating reverse transcription.
  • A hallmark of almost all mobile elements is that
    they form small direct repeats formed at the site
    of integration.

35
Long Terminal Repeats (LTR)
  • Autonomous or non-autonomous
  • Autonomous retroposons encode gag, pol genes
    which encode the protease, reverse transcriptase,
    RNAseH and integrase

Nature (2001) pp879-880
HMG3 by Strachan Read pp268-272
36
DNA transposons (lateral transfer?)
  • DNA transposons
  • Inverted repeats on both sides of element
  • e.g. ATGCNNNNNNNNNNNCGTA

Nature (2001) pp879-880
From GenesVII by Levin
37
3) Segmental duplications
  • Closely related sequence blocks at different
    genomic loci
  • Transfer of 1-200kb blocks of genomic sequence
  • Segmental duplications can occur on homologous
    chromosomes (intrachromosomal) or non homologous
    chromosomes (interchromosomal)
  • Not always tandemly arranged
  • Relatively recent

38
Segmental duplications
  • Interchromosomal segments duplicated among
    non-homologous chromosomes

Intrachromosomal duplications occur within a
chromosome / arm
Nature Reviews Genetics 2, 791-800 (2001)
39
Segmental duplications
Segmental duplications in chromosome 22
40
Segmental duplications - chromosome 7.
41
Nature Reviews Genetics 2, 791-800 (2001)
42
Major insights from the HGP
  1. Gene size, content and distribution
  2. Proteome content
  3. SNP identification
  4. Distribution of GC content
  5. CpG islands
  6. Recombination rates
  7. Repeat content

Nature (2001) 15th Feb Vol 409 special issue pgs
814 875-914.
43
1) Gene size
44
Gene content.
  • More genes Twice as many as drosophila /
    C.elegans
  • Uneven gene distribution Gene-rich and
    gene-poor regions
  • More paralogs some gene families have extended
    the number of paralogs e.g. olfactory gene family
    has 1000 genes
  • More alternative transcripts Increased RNA
    splice variants produced thereby expanding the
    primary proteins by 5 fold (e.g. neurexin genes)

45
Gene distribution
Genes generally dispersed (1 gene per 100kb)
Class III complex at HLA 6p21.3
Overlapping genes (transcribed from 2 DNA
strands) - Rare
Genes- within genes E.g. NF1 gene
HMG3 Fig 9.8
46
Uneven gene distribution
  • Gene-rich
  • E.g. MHC on chromosome 6 has 60 genes with a GC
    content of 54
  • Gene-poor regions
  • 82 gene deserts identified
  • ? Large or unidentified genes
  • What is the functional significance of these
    variations?

47
2) Proteome content
  • proteome more complex than invertebrates

Protein Domains (sections with identifiable
shape/function) Domain arrangements in
humans largest total number of domains is
130 largest number of domain types per protein is
9 Mostly identical arrangement of domains
A
A
B
B
C
B
C
C
C
C
Protein X
48
Proteome more complex than invertebrates
  • no huge difference in domain number in humans
  • BUT, frequency of domain sharing very high in
    human proteins (structural proteins and proteins
    involved in signal transduction and immune
    function)
  • However, only 3 cases where a combination of 3
    domain types shared by human yeast proteins.
  • e.g carbomyl-phosphate synthase (involved in the
    first 3 steps of de novo pyrimidine biosynthesis)
    has 7 domain types, which occurs once in human
    and yeast but twice in drosophila


49
3) SNPs (single nucleotide polymorphisms)
  • Sites that result from point mutations in
    individual base pairs
  • biallelic
  • 60,000 SNPs lie within exons and untranslated
    regions (85 of exons lie within 5kb of a SNP)
  • May or may not affect the ORF
  • Most SNPs may be regulatory
  • More than 1.4million SNPs identified
  • One every 1.9kb length on average
  • Densities vary over regions and chromosomes
  • e.g. HLA region has a high SNP density,
    reflecting maintenance of diverse haplotypes over
    many MYears

Nature (2001) 15th Feb Vol 409 special issue pgs
821-823 928
50
How does one distinguish sequence errors from
polymorphisms?
  • sequence errors
  • Each piece of genome sequenced at least 10 times
    to reduce error rate (0.01)
  • Polymorphisms
  • Sequence variation between individuals is 0.1
  • To be defined as a polymorphism, the altered
    sequence must be present in a significant
    population
  • Rate of polymorphisms in diploid human genome is
    about 1 in 500 bp

Nature (2001) 15th Feb Vol 409 special issue pgs
821-823 928
51
SNPs and disease
52
3) SNPsand risk of disease
N(291)S
53
3) SNPsand risk of disease
late-onset Alzheimer's disease (LOAD) Apolipoprote
in e4 haplotype is a genetic risk factor
3 major alleles (APO E2, E3, and E4) APO E2
Cys112 / Cys158 APO E3 Cys112 / Arg158 APO E4
Arg112 / Arg158
54
3) SNPsand pharmacogenomics
55
4) Distribution of GC content
  • Genome wide average of 41
  • Huge regional variations exist
  • E.g.distal 48Mb of chromosome 1p-47 but
    chromosome 13 has only 36
  • Confirms cytogenetic staining with G-bands
    (Giemsa)
  • dark G-bands low GC content (37)
  • light G-bands high GC content (45)

Nature (2001) 15th Feb Vol 409 special issue pg
876-877
56
5) CpG islands
CpG
TpG
Methyl CpG
Deamination
methylated at C
CpG islands show no methylation
  • Significance of CpG islands
  • Non-methylated CpG islands associated with the 5
    ends of genes
  • Aberrant methylation of CpG islands is one
    mechanism of inactivating tumor suppressor genes
    (TSGs) in neoplasia

http//www.sanger.ac.uk/HGP/cgi.shtml
57
CpG islands
  • Greatly under-represented in human genome
  • 28,890 in number
  • Variable density
  • e.g. Y 2.9/Mb but
  • 16,17 22 have 19-22/Mb
  • Average is 10.5/Mb

Nature (2001) 15th Feb Vol 409 special issue pg
877-888
58
6) Recombination rates
  • 2 main observations
  • Recombination rate increases with decreasing arm
    length
  • Recombination rate suppressed near the
    centromeres and increases towards the distal
    20-35Mb

59
7) Repeat content
  • Age distribution
  • Comparison with other genomes
  • Variation in distribution of repeats
  • Distribution by GC content
  • Y chromosome

Nature (2001) 409 pp 881-891
60
Repeat content.
a) Age distribution
  • Most interspersed repeats predate eutherian
    radiation (confirms the slow rate of clearance of
    nonfunctional sequence from vertebrate genomes)
  • LINEs and SINEs have extremely long lives
  • 2 major peaks of transposon activity
  • No DNA transposition in the past 50MYr
  • LTR retroposons teetering on the brink of
    extinction

61
a) Age distribution
  • overall decline in interspersed repeat activity
    in hominid lineage in the past 35-40MYr
  • compared to mouse genome, which shows a younger
    and more dynamic genome

62
b) Comparison with other genomes
  • Higher density of transposable elements in
    euchromatic portion of genome
  • Higher abundance of ancient transposons
  • 60 of IR made up of LINE1 and Alu repeats
  • whereas DNA transposons represent only 6
  • (a few human genes appear likely to have
    resulted from horizontal transfer from
    bacteria!!)

63
c) Variation in distribution of repeats
  • Some regions show either
  • High repeat density
  • e.g. chromosome Xp11 a 525kb region shows 89
    repeat density
  • Low repeat density
  • e.g. HOX homeobox gene cluster (lt2 repeats)
  • (indicative of regulatory elements which have low
    tolerance for insertions)

64
d) Distribution by GC content
  • High GC gene rich High AT gene poor
  • LINEs abundant in AT-rich regions
  • SINEs lower in AT-rich regions
  • Alu repeats in particular retained in actively
    transcribed GC rich regions E.g. chromosme 19 has
    5 Alus compared to Y chromosome

65
e) The Y chromosome !
  • Unusually young genome (high tolerance to gaining
    insertions)
  • Mutation rate is 2.1X higher in male germline
  • Possibly due to cell division rates or different
    repair mechanisms

66
  • Working draft published Feb 2001
  • Finished sequence April 2003
  • Annotation of genes going on
  • (refer International Human Genome Sequencing
    Consortium. Finishing the euchromatic sequence of
    the human genome. Nature 21 October 2004 (doi
    10.1038/nature03001)

67
Other genomes sequenced
2002 Mus musculus 36,000 genes
1997 4,200 genes
Sept 2003 Canis 18,473 human orthologs
1998 19,099 genes
31Aug 2005 Pan troglodytes 28 identical Human
orthologs
2002 38,000 genes
Science (26 Sep 2003)Vol301(5641)pp1854-1855
68
References
  • Chapter 9 pp 265-268
  • HMG 3 by Strachan and Read
  • Chapter 10 pp 339-348
  • Genetics from genes to genomes by Hartwell et al
    (2/e)
  • Nature (2001) 409 pp 879-891
Write a Comment
User Comments (0)
About PowerShow.com