Title: Formation of novel protein-coding genes
1Formation of novel protein-coding genes
- Level 3 Molecular Evolution and Bioinformatics
- Jim Provan
Patthy Chapter 6
2De-novo formation of novel protein-coding genes
- Creation of simple structural elements such as
a-helices, b-sheets and reverse turns seems to be
rather trivial - So many alternative ways of forming these
structures - Have been invented independently several times
- Proteins with repetitive structure are most
likely to arise de novo - Repetitive oligonucleotide sequences can expand,
forming periodic protein structures - Collagen-like
- Leucine-rich repeat (LRR)
- Probably arose several times during evolution
3Evolution of serum antifreeze glycoproteins
- Fish that live in polar waters have serum
antifreeze glycoproteins (AFGPs) which allow them
to tolerate temperatures of as low as 1.9C - It has been shown that fish from the north and
south poles have evolved very similar AFGPs
independently - AFGP of Antarctic fish, made up of a simple
tripeptide repeat evolved by recruitment of the
5 and 3 ends of an ancestral trypsinogen gene
(secretory signal and 3 UTR) and de novo
amplification of a 9bp Thr-Ala-Ala motif - Arctic cod also have a Thr-Ala-Ala tripeptide
repeat-based AFGP but this has no relationship
with the trypsinogen gene - Threonines are O-linked to galactosyl-N-acetylgala
ctosamine and periodicity of repeats matches
periodicity of water molecules - Convergent evolution of the tripeptide-based AFGP
4De novo creation of complex proteins
- Probability of de novo creation of more complex,
globular proteins is inversely proportional to
complexity - Those that consist of a single supersecondary
structure element (TIM barrel proteins) have
higher probability of independent creation - TIM barrel structure likely to have evolved
several times - Easier to remodel replicas of old protein folds
than to invent them from scratch - Creation of first folded proteins was probably
the rate-limiting step in protein-based life - All extant proteins probably arose from a limited
number of ancestral folds through divergence
5Evolutionary convergence
- Previous examples highlight convergence to
similar primary, secondary or tertiary structure
(structural convergence) - Unlike structural convergence, functional
convergence and mechanistic convergence are
relatively common - Several types of proteinases that have similar
function (i.e. they cleave proteins) but have
different structures and catalytic mechanisms and
have evolved independently - Example of mechanistic convergence is the serine
proteases of the subtilisin and trypsin families - Similar active sites and catalytic mechanisms but
no sequence or conformational homology - Catalytic triad residues (His, Asp, Ser) occur in
different order in primary structures - Difficult to prove structural convergence
6Gene duplications
- Evolutionary significance of gene duplication is
that it gives rise to a redundant duplication of
a gene - Duplicated gene may acquire divergent mutations
and eventually emerge as a new gene - Gene duplication is the predominant and most
important mechanism by which new genes arise - Genes derived by a duplication event are said to
be paralogous and are found in different loci of
the chromosome - Different from orthologous genes gained by
speciation events, which are found in different
loci of the corresponding species
7Types of DNA duplications
- An increase in the number of copies of a DNA
segment can be brought about by several types of
DNA duplication - Partial, intragenic or internal gene duplication
only an internal segment of a protein-coding
gene is duplicated - Complete gene duplication, including flanking
regions necessary for expression - Partial chromosome duplication several adjacent
genes are duplicated - Chromosomal duplication (aneuploidy)
- Genome duplication (polyploidy)
8Mechanisms of gene duplication
- Major mechanisms for short intragenic
duplications is disengagement of the DNA
polymerase from the strand that is being copied
and reattachment at the wrong point (slipped
strand mispairing) - Major mechanism for larger duplications involves
unequal crossing over - Involves mistaken pairing and recombination
between homologous chromosomes - Most likely in already-duplicated regions
- Allows rapid expansion of repeats within genes
and expansion of gene families - May facilitate homogenisation of gene sequences
and thus slow down divergence (concerted
evolution)
9Unequal crossing-over
10Gene duplications in lysozyme
- In ruminants, lysozyme gene has been duplicated
10 times and is expressed less in
extra-intestinal tissues - In mice, intestinal lysozyme is expressed from
lysP gene, whereas in other tissues it is encoded
by the lysM gene - Original gene duplication through unequal
crossing-over in Alu-like B2 middle repetitive
elements
11Retrosequences
- Copies of protein-coding genes may be produced by
duplicative transposition - DNA is transcribed into RNA, which is
reverse-transcribed into a cDNA (retroposition) - During re-insertion, small segments of host DNA
(4-12bp) are duplicated, forming direct repeats - Significant diagnostic features of
retrosequences - Lack introns (where parent gene would have
introns) - Lack upstream promoter elements of parent gene
- Contain poly(A) stretches at 3 end
- Flanked by short, direct repeats
- Different chromosomal location from original gene
12Functionality of retrosequences
- Depending on whether the copied gene is
functional or not, we can distinguish processed
genes (retrogenes) and processed pseudogenes
(retropseudogenes) - Several reasons why functional retrogenes are
unlikely - Process of reverse-transcription is very
inaccurate - Lacks necessary regulatory elements
- Generally truncated at 5 end (reverse
transcriptase failure) - May be inserted in genomic region unsuitable for
expression - More likely to form retropseudogenes
- Some examples of processed functional genes have
been found e.g. human phosphoglycerate kinase - X-linked gene has 11 exons and 10 introns
- Autosomal PGK gene has no introns and a poly(A)
tail
13Alu elements
- Processed pseudogenes of the RNA gene specifying
7SL RNA which cuts signal sequences of secreted
proteins - About 300 bp long
- Around 500,000 copies in the human genome (5-6)
- Named after characteristic AluI restriction site
- Derived from functional 7SL sequence by
duplication, two deletions and many mutations - Play a key role in genome plasticity since they
facilitate unequal crossing-over - Gene duplication
- Exon shuffling
14Fate of duplicated genes
- Determined by functional consequences of having
extra copies of same gene and increased amounts
of protein - Duplications can be advantageous, deleterious or
neutral - If an organism is exposed to a toxic environment,
there may be an advantage in overproduction of
detoxifying enzymes - Disadvantage will result of overproduction of
protein upsets regulatory balance - Most duplications are neutral fate determined
by selection and drift - Duplicated gene is unlikely to be fixed unless it
acquires a novel and useful function - May specialise in different subfunctions of
ancestral gene - May acquire drastically different functions
(hepatocyte growth factor vs. plasminogen)
15Formation of gene families
- Recently duplicated gene families are generally
found in close proximity on the same chromosome - Some multigene families contain invariant
repeated genes - Common when large quantities of protein product
are required - Histones have to be synthesised at a high rate
during a well-defined, short period of cell
division - Some members of multigene families serve the same
function but differ in tissue specificity,
developmental regulation or biochemical
properties e.g. isozymes
16Concerted evolution in multigene families
- Paralogous members of multigene families are very
similar to each other within one species although
orthologous members of the same family may differ
greatly between even closely related species - Suggests that mechanisms exist which cause gene
families to evolve together as a unit (concerted
evolution) - Process of concerted evolution of multigene
families under the effects of random genetic
drift is known as molecular drive - Gene correction mechanisms may homogenise genes
difficult to trace true evolutionary history of
many multigene families
17Dating gene duplications
- Assuming duplicated genes diverge at a constant
rate, we can estimate the date of a gene
duplication, TD, that gave rise to two paralogous
genes (A and B) if we have sequences of these
paralogues from two different species (1 and 2)
and we know the time of speciation TS - If genes evolved at a constant rate then
- Average number of substitutions per site (KA
KB/2) in the two orthologue comparisons (A1 vs.
A2, B1 vs. B2) is proportional to TS - Average number of substitutions per site KAB in
the four paralogous comparisons (A1 vs. B1, A2
vs. B2, A1 vs. B2, A2 vs. B1) is proportional to
the time since duplication TD - Thus, the following equation holds
- TD/TS 2KAB/KA KB
18Dating gene duplications (continued)
- All vertebrates have both myoglobin and
haemoglobin - Myoglobin differs from both the a and b subunits
of haemoglobin more than they differ from each
other - Myoglobin diverged (TD 600-800 mya) before the
a and b genes arose (TD 500 mya) - Mammals, reptiles, birds, amphibians and bony
fish all have distinct a and b subunits, whereas
the most primitive vertebrates, the Agnatha
(jawless fish), contain only one type of
haemoglobin subunit - Myoglobin and haemoglobin diverged prior to the
separation of agnathans and jawed vertebrates - Duplication giving rise to a and b subunits
occurred in the ancestor of all jawed vertebrates
following its divergence from agnathans
19Evolutionary history and linkage patterns in a-
and b-globin clusters
- In humans, gene cluster of the a-globin family
(c16) consists of four functional genes (z, a1,
a2, q1) and three unprocessed pseudogenes (Yz,
Ya1, Ya2) - Embryonic type z is most divergent (estimated TD
gt 300 mya) - q1 is less divergent (estimated TD 260 mya)
- Genes a1 and a2 produce identical polypeptide and
have near-identical nucleotide sequence,
suggesting recent divergence - b-globin family (c11) contains five functional
genes and Yb - Adult types (b and d) diverged from non-adult
types (Gg, Ag and e) around 155-200 mya - Ancestor of both g genes diverged from e about
100-140 mya - Duplication that formed Gg and Ag occurred after
separation of human lineage from New World
monkeys (35 mya) - Divergence of adult genes (b and d) occurred
about 80 mya
20Intergene distance and time since duplication