Title: Chap. 6 Genes, Genomics, and Chromosomes (Part A)
1Chap. 6 Genes, Genomics, and Chromosomes (Part A)
- Topics
- Eukaryotic Gene Structure
- Chromosomal Organization of Genes and Noncoding
DNA - Transposable (Mobile) DNA Elements
- Goals
- Learn how genes encoded by complex transcription
units are expressed. - Learn the origin, types, and functions of DNA in
higher organisms. - Learn the properties of transposons and their
roles in gene evolution.
RxFISH-painted human chromosomes.
2Overview of Human Genes Chromosomes
Human diploid genomic DNA contains 109 bp
divided among 22 autosomes and 2 sex chromosomes.
The longest autosome (1) contains 280 x 106 bp.
Only 1.5 of human DNA encodes proteins or
functional RNA products. The expressed, coding
segments of genes are called exons. Exons are
highly conserved in sequence. Noncoding DNA
consists of spacer DNA between genes and intron
DNA within genes. Noncoding DNA is not strongly
conserved and accounts for most of the variations
in sequences between individual humans. As
discussed later, DNA is highly condensed (overall
105-fold in mitotic chromosomes) by
protein-nucleic acid complexes called nucleosomes
and other higher-order structures (Fig. 6.1).
3Simple Transcription Units
Eukaryotic genes are monocistronic in that only
one protein is produced from a given mRNA.
However, multiple forms of mRNAs, and therefore
proteins, are produced from many genes. Simple
gene transcription units produce only one type of
mRNA and protein (Fig. 6.3a). Mutations at sites
a b often reduce or prevent transcription.
Mutations at site c can change the amino acid
sequence of the protein and interfere with its
function. Mutations at site d affecting the
selection of the exon 2/3 splice site can result
in an abnormally spliced mRNA and nonfunctional
protein.
4Complex Transcription Units
Complex gene transcription units produce several
species of mRNAs, and thus proteins (Fig. 6.3b).
The exon content of mRNAs and domain composition
of proteins are varied by selection of
alternative splice sites (Top), polyadenylation
sites (Middle), and even promoter sites (Bottom).
Site selection may vary in different cell types
and during different stages of development. The
effects of mutations (e.g., c d) on the gene
products synthesized from these transcription
units will be discussed in class. About 60 of
humans genes are contained in complex
transcription units.
5Alternative Splicing Gene Regulation
Protein domains can be encoded by a single exon
or by a small collection of exons within a larger
gene. The coding regions for domains can be
spliced in or out of the primary transcript by
the process of alternative splicing. The
resulting mRNAs encode different forms of the
protein, known as isoforms. Alternative splicing
is an important method for regulation of gene
expression in different tissues and different
physiological states. It is estimated that 60 of
all human genes are expressed as alternatively
spliced mRNAs. Alternative splicing is
illustrated in Fig. 4.16 for the fibronectin
gene. The fibroblast and hepatocyte isoforms
differ in their content of the EIIIA and EIIIB
domains which mediate cell surface binding.Twenty
different isoforms of fibronectin produced by
alternative splicing have been identified.
6Human Genomic DNA Protein-coding Genes
Genomic DNA of higher eukaryotes contains 4 main
classes of DNA--1) protein-coding genes, 2)
tandemly repeated genes, 3) repetitious DNA, and
4) unclassified spacer DNA (Table 6.1). Protein
coding genes are grouped into the categories
known as solitary genes, and duplicated or
diverged genes belonging to gene families. In
humans, roughly equal numbers of protein-coding
genes occur in these two categories. Groups of
homologous duplicated genes form gene and protein
families, such as the ß-globin family.
(25-30)
7The Human ß-globin Gene Family
The ß-globin gene cluster on chromosome 11 is
shown in Fig. 6.4a. The ß-globin genes are
expressed in different stages of life. ?, Ag, and
Gg are expressed during different trimesters of
fetal development (next slide). ß expression
begins around birth continues throughout adult
life. Fetal hemoglobin molecules made with the
d???? and G? or A? polypeptides have a higher
affinity for O2 than maternal hemoglobin,
facilitating O2 transfer to the fetus.
The 5 ß-globin genes are derived from an
ancestral ß-globin gene via gene duplication.
Over time, these genes accumulated adaptive
mutations via sequence drift resulting in the
specialized species of ß-globin proteins. Genomic
DNA also contains nonfunctional DNA sequences
called pseudogenes that are derived from gene
duplication or reverse transcription and
integration of cDNA sequences made from mRNA
(covered below). ß-globin pseudogenes contain
introns and thus were derived by gene
duplication. Over time these genes became
nonfunctional also due to sequence drift. Because
they are not harmful, pseudogenes remain in the
genome, marking a gene duplication event in an
earlier ancestor.
8Expression of Human Globin Genes
9Exon and Gene Duplication from Unequal Crossing
Over
Fig. 6.2 illustrates how duplication of genes
(e.g., the ß-globins) and exons can occur via
unequal crossing over during meiosis and
formation of gametes. Exon duplication results in
proteins containing repeated domains (e.g., the
EGF precursor, Fig. 3.11). In the examples shown,
recombination is shown to occur between L1
retrotransposon sequences which are common in
genomic DNA.
10Modular Domain Structure of Proteins
Domains are independently folding and
functionally specialized tertiary structure units
within a protein. The respective globular and
fibrous structural domains of the hemagglutinin
monomer (which happen to be individual
polypeptide chains) are illustrated above in Fig.
3.10a. Domains (such as the EGF domain) also may
be encoded within a single polypeptide chain, as
illustrated in Fig. 3.11. Domains still perform
their standard functions although fused together
in a longer polypeptide (e.g., DNA binding and
ATPase domains of a transcription factor). The
modular domain structure of many proteins has
resulted from the shuffling and splicing together
of their coding sequences within longer genes.
Epidermal growth factor (EGF) domain
11Gene Density in Genomic DNA
Higher eukaryotes contain far more noncoding DNA
between genes than bacteria and simple eukaryotes
(Fig. 6.4). The region of human genomic DNA
containing the ß-globin gene cluster shown in the
figure actually is a relatively "gene-rich"
region of human DNA. Some regions known as
gene-poor "deserts" also occur. Higher eukaryotes
also contain a larger amount of intron DNA.
Although one-third of human DNA is transcribed
into pre-mRNA, 95 ends up being degraded after
RNA splicing reactions. On average, the typical
exon is 50-200 bp in length, while the median
length of introns is 3.3 kb in human genes.
12Human Genomic DNA Tandemly Repeated Genes
Tandemly repeated genes also are derived by gene
duplication. Unlike gene families, the sequences
of these duplicated genes are identical or
strongly conserved. In addition, they commonly
are arranged in a head-to-tail fashion in tandem
arrays over a long stretch of DNA. rRNAs and
snRNAs (used in splicing reactions, Chap. 8) are
representative of this group (Table 6.1).
Multiple copies of these genes are needed due to
the requirement for vast amounts of these RNAs in
the cell. tRNA and histone genes are included in
this category, but these genes typically occur in
clusters and not true tandem arrays.
13Nonprotein-coding Genes in Human Genomic DNA
Thousands of genes in the human genome encode
functional RNAs (Table 6.2). The functions of
several of these are covered in later chapters.
14Repetitious DNA
Two main categories of repetitious
DNA--simple-sequence DNA and interspersed
repeats--occur in eukaryotic genomes (Table 6.1).
Interspersed repeats are more common and are
derived largely from transposons. Simple-sequence
DNA is less prevalent, accounting for 6 of
human genomic DNA. Simple-sequence DNA is also
known as satellite DNA, due to its formation of
satellite bands during cesium chloride density
gradient ultracentrifugation. The function of
this DNA is mostly obscure. It is commonly found
at the centromere and telomere regions of
chromosomes.
(25-30)
15Properties of Satellite DNA
Satellite DNA is classified into 3 types based on
length. True satellite DNA consists of 14-500 bp
sequence units that tandemly repeat over 20-100
kb lengths of genomic DNA. Minisatellite DNA
consists of 15-100 bp sequence units that
tandemly repeat over 1-5 kb stretches of DNA.
Microsatellite DNA consists of 1-13 bp units that
can repeat up to 150 times. Microsatellite DNA is
thought to originate from backward slippage of
a growing daughter strand on its template strand
during DNA replication (Fig. 6.5).The sequences
of repeat units are highly conserved which
suggests they perform important functions. Each
category of satellite DNA contains a number of
different repeat sequences. Simple-sequence DNAs
can serve as DNA markers due to variations in
repeat number. Satellite DNAs are exploited in
FISH (fluorescence in situ hybridization)
chromosome staining (Fig. 6.6).
16DNA Fingerprinting
DNA fingerprinting is a method for identifying
individuals based on their minisatellite DNA
(Fig. 6.7). It was developed in the mid-80s and
is widely used in forensics, paternity analysis,
and for research purposes. In the method,
minisatellite DNA from a genomic DNA specimen is
amplified by PCR using primers that bind to
unique sequences flanking minisatellite repeat
units. Bands corresponding to each minisatellite
locus then are separated on gels. Although
satellite DNA is highly conserved in sequence,
the number of tandem copies at each loci is
highly variable between individuals. This results
from unequal crossing over during formation of
gametes in meiosis. Due to the variation in the
number of repeats at each locus, different
individuals can be readily distinguished based on
banding patterns.
17Chap. 6 Problem 3
Satellite DNA is classified into 3 categories
based on length. Satellite DNA consists of 14-500
bp sequence units that tandemly repeat over
20-100 kb lengths of genomic DNA. Minisatellite
DNA consists of 15-100 bp sequence units that
tandemly repeat over 1-5 kb stretches of DNA.
Microsatellite DNA consists of 1-13 bp units that
can repeat up to 150 times. Although the
sequences of satellite DNA are highly conserved,
the number of tandem copies at each locus is
highly variable between individuals. This
originates due to unequal crossing over during
formation of gametes in meiosis (Upper figure).
DNA fingerprinting is a method for identifying
individuals based on variations in minisatellite
DNA (Fig. 6.7). In the method, minisatellite DNA
is amplified by PCR using unique primers flanking
repeat regions, and the collection of fragments
is run on a gel. Due to the variation in the
number of repeats at different loci, different
individuals can be readily distinguished.
18Interspersed Repeats
Interspersed repeat DNA comprises the largest
fraction of repetitious DNA in eukaryotic
genomes. This DNA, which is also called
moderately repeated DNA makes up 45 of human
genomic DNA. Interspersed repeat DNA is composed
of partial and complete transposon sequences or
"mobile DNA". Mobile DNAs were discovered by
Barbara McClintock in the 1940s. These sequences
move by "transposition". Transpositions in germ
line cells are inheritable and occur at a rate of
one transposition per 8 individuals. In somatic
cells they can cause somatic cell mutations.
Mobile DNA has been very important in genome
evolution.
(25-30)
19Mobile DNA Elements
Mobile DNA elements are grouped into two classes,
DNA transposons and retrotransposons (Fig. 6.8).
DNA transposons move directly as DNA via a
"cut-and-paste" mechanism. Retrotransposons move
via an RNA intermediate and a "copy-and-paste"
mechanism, wherein the original copy of the
transposon is preserved. Retroviruses, like HIV,
formally are a subclass of retrotransposons that
can move between cells because they encode viral
coat proteins. DNA transposons predominate in
bacteria retrotransposons are more prevalent in
eukaryotes.
20Mobile DNA in Prokaryotes
Bacteria contain DNA transposons called insertion
sequences (Fig. 6.9). IS elements are 1-2 kb DNAs
that transpose within the bacterial genome to
random locations. Transposition ("jumping") is
mediated by an encoded transposase protein.
Insertion usually causes gene inactivation and is
harmful. Nonetheless, E. coli encodes 20 types
of IS elements. They are tolerated in part due to
their low transposition rate (1 in 105 - 107
cells per generation). This rate is set by the
low rate of transcription of the transposase
gene. IS elements contain inverted repeat
sequences of 50 bp at each end of the
protein-coding region that are crucial for
transposition.
21Mechanism of IS Element Transposition
Transposition occurs in 3 main steps, as
summarized in Fig. 6.10. The excision of the IS
element and its cutting-and-pasting into the
target sequence is mediated by the transposase
(Steps 1 2). The single-stranded DNA regions
remaining at the insertion site after transposase
action are filled-in and the nicks sealed by
cellular DNA polymerase and DNA ligase (Step 3).
All transposases we will cover produce staggered
cuts at their target sites. This leads to
production of short direct repeat sequences
immediately flanking the sites of insertion.
Eukaryotic DNA transposons jump in genomic DNA by
a similar mechanism.
22Mechanism of DNA Transposon Copy Number Increase
About 3 x 105 copies of full-length and truncated
DNA transposons occur in human genomic DNA (3 of
DNA). Although DNA transposons move via a
cut-and-paste mechanism, their copy number in the
genome will increase if they transpose during DNA
synthesis preceding the first meiotic division of
gametogenesis (Fig. 6.11).
23LTR Retrotransposons
Eukaryotic retrotransposons fall into two major
groups--LTR retrotransposons and non-LTR
retrotransposons. Together, these sequences
account for 42 of human genomic DNA.
LTRs stand for long direct terminal repeats. LTRs
consist of 250-600 bp direct repeat sequences
located at the ends of the retrotransposon coding
region (Fig. 6.12). LTR retrotransposons share
many features with retroviruses. They both encode
LTRs, reverse transcriptase, and DNA integrase.
However, LTR retrotransposons lack coat proteins
that allow retroviruses to move between cells.
Transposition occurs via an RNA intermediate that
is transcribed from a promoter in the left LTR
(Fig. 6.13). The primary transcript is
polyadenylated, forming the retroviral genomic
RNA.
24Retroviral LTR-retrotransposon DNA Synthesis
The mechanism by which retroviral and LTR
retrotransposon DNA is synthesized prior to
integration into genomic DNA is shown in Fig.
6.14. DNA integrase inserts the completed
retroviral DNA into genomic DNA via a mechanism
similar to that described for bacterial IS
elements. Namely, a short direct repeat is
produced at each end of the integrated DNA. On
the order of 4.4 x 105 LTR retrotransposon
sequences occur in human DNA. Most of these are
non-functional due to recombination between LTR
sequences and deletion of the intervening DNA.
25Non-LTR Retrotransposons
Even more abundant in human genomic DNA are
non-LTR retrotransposon sequences. There are two
main classes of non-LTR retrotransposons, known
as long interspersed elements (LINEs, 6 kb), and
short interspersed elements (SINEs, 300 bp).
LINEs encode a reverse transcriptase (ORF2)
needed for transposition (Fig. 6.16), whereas
SINEs do not. Instead SINEs are thought to rely
on LINE-encoded enzymes for transposition. LINEs
are grouped into L1, L2, and L3 families, of
which only L1 is active today. LINE sequences
occur at 9 x 105 copies per human genome. SINEs
occur at 1.6 x 106 copies. The most abundant
SINE is the Alu element, which is named based on
the fact that it encodes an AluI restriction
site. Alu elements were important for gene
duplications at the ß-globin locus (Figs. 6.4).
poly(A) site
promoter site
26L/SINE Transposition (I)
The mechanism of LINE (and SINE) transposition is
illustrated in Fig. 6.17. In summary, LINE
primary transcripts are translated into the ORF1
and ORF2 gene products in the cytosol. The RNA
then returns to the nucleus with the ORF1 2
proteins. These enzymes catalyze reverse
transcription and integration of the LINE element
at AT-rich regions of genomic DNA. The poly(A)
tail of the LINE RNA is used for selection of
integration sites. SINE element
retrotransposition is thought to be mediated by
the ORF1 2 proteins encoded by LINEs.
1
Nicking
27L/SINE Transposition (II)
Many LINEs are truncated at the 5' end due to
incomplete reverse transcription of the LINE RNA.
For this reason, and sequence drift, only 0.01
of LINE elements are functional today (100 per
genome). It is further thought that LINE SINE
transpositions occur at a rate of 1 in 8
individuals in the population. LINE
transpositions have been implicated in human
disease. About 1/600 mutations causing disease
can be traced to LINE transposition. However,
LINE SINE transpositions have been crucial in
the evolution of the human genome, as discussed
in the remaining slides. Lastly, the ORF1 and 2
LINE proteins are thought to be responsible for
insertion of processed pseudogenes into genomic
DNA.
28Exon Shuffling via Recombination Between
Homologous Interspersed Repeats
We previously have noted that gene evolution has
involved exon shuffling between protein-coding
genes in the genome. A large amount of shuffling
has occurred due to the prevalence of
interspersed repeats in the genome. Due to
sequence conservation within these regions,
crossover events can take place at these sites
(Fig. 6.18). This results in exon shuffling
between nonhomologous genes and the formation of
new genes with new combinations of protein
domains. As illustrated in Fig. 6.2, such events
also have been important in exon and gene
duplications.
29Exon Shuffling via Transposition
Exon shuffling can also occur via cut-and-paste
transpositions mediated by DNA transposons. The
mechanism by which this occurs is illustrated in
Fig. 6.19a. It requires that two copies of the
transposon flank the target exon. Both DNA
transposons and the exon will move as one piece
of DNA if the transposase happens to cleave DNA
at the left inverted repeat of the upstream
transposon and at the right inverted repeat of
the downstream transposon. Gene 1 ends up losing
the exon, and Gene 2 acquires the exon
30Exon Shuffling via Transposition
Exons can move along with a LINE element when it
transposes via its copy-and-paste mechanism (Fig.
6.19b). When a LINE element has a weak poly(A)
signal, RNA polymerase II continues to transcribe
downstream, potentially through an exon. If this
exon has a strong poly(A) signal, then
transcription stops and the RNA is
polyadenylated. Then following the mechanism in
Fig. 6.17, DNA encoding the exon and the LINE
element can be incorporated into another gene.
The spliced mRNA produced from the acceptor gene
may contain the newly introduced exon. Exon
shuffling is supported by experimental evidence
and the enormous amount of interspersed repeat
DNA in genomes. Over billions of years, it has
played a major role in evolution of genomes.