Title: Genomics: Basic concepts
1Genomics Basic concepts
2Contents
- Genome structure
- Comparative mapping
- Comparative genomics
- Definitions in comparative genomics
- Orthologs
- Protein sequences
- Gene expression
3Nuclear genomes and others
Most of the genetic material in eukaryotes is
located in the nucleus. The mitochondria and/or
chloroplast organelles often also contain
important genetic material, but this is usually a
very small number of genes, and inherited
maternally. Therefore, unless otherwise
specified, genetic and genomic studies are
usually referring to the nuclear genome.
Image from Michael Davidson, Florida State
University, http//micro.magnet.fsu.edu/cells/plan
tcell.html
4From DNA to protein
We know that proteins - large, complex molecules
made up of amino acids - are really what perform
most life functions. How do we get from a DNA
sequence to a protein?
Two main steps are involved in this
process 1).Transcription the DNA is copied into
a molecule of RNA whose bases are complementary
to those of the DNA, called messenger RNA, or
mRNA. This complementation is the same chemical
mechanism that holds the 2 strands of DNA
together, where an A binds to a T, and a G to a
C. There is one difference in that RNA contains
the base uracil (U) instead of thymine, so that
during transcription, the RNA will have an A
where the DNA has a T, a G where the DNA has a C,
and a C where the DNA has a G, but a U where the
DNA has an A. 2). Translation the information
now encoded in RNA is deciphered (translated)
into instructions for making a protein. Proteins
are then manufactured in cell structures known as
ribosomes.
See next slide for a pictoral of this process..
5Key background conceptprotein synthesis
As noted, most of our genome is composed of
introns interspersed with relatively few DNA
coding sequences, or genes. The DNA is converted,
or transcribed, into messenger RNA (mRNA), RNA
that serves as a template for protein synthesis.
One of the steps in the transcription process is
called splicing, where the introns (sequences
interrupting genes) are eliminated. The remaining
mRNA is translated into proteins.(next slide).
From The National Center for Biotechnology
Information (NCBI) A Science Primer
http//www.ncbi.nlm.nih.gov/About/primer/est.html
6Translation
In translation codons of three nucleotides
determine which amino acid will be added next in
the growing protein chain. So by examining the
DNA sequence alone we can determine the sequence
of amino acids that will later appear in the
final protein. In fact, there are software
programs that can do this for us, which we will
learn more about later in the module. That
part of the DNA sequence which could potentially
code for a protein (and be a gene or a part of a
gene) is called an open reading frame (ORF). You
can see that it is important to know which
nucleotide to start translation with, and when to
stop, or there would be a frameshift of the
sequence, leading to an entirely different
protein. Therefore it is important to know which
is the start codon, and which is the stop codon
these are specific codons which signal the
beginning and end of an open reading frame (see
next slide).
7Open reading frames
In most species, the codon for methionine (Met),
which is the sequence ATG, signals the start of
an open reading frame, and a stop is signaled by
one of 3 stop codons TAA, TAG, or TGA. The
sequence of nucleotides between a start
(initiation) codon and a stop (termination) codon
is called an Open reading frame (ORF). ORFs can
potentially code for proteins (so be a gene or
part of a gene), but not all do.
An example of a DNA sequence with a start (green)
and stop (red) codon.
.TCGAATGGCATTCGCAGTC..TACTTGCACGCTTGACCGTCATA
AGCA.
8The universal genetic code
(which, by the way, isnt completely universal)
This table shows all the nucleotide triplets
(codons) and their corresponding amino acids.
Remember that in RNA, a uracil (U) is substituted
for a thymine (T).
There are some exceptions, for example in
certain organelles and bacteria.
9Wobble in the code
You can see from the table on the previous slide
that some amino acids are coded for by more than
one codon. For example, both UUU and UUC code for
phenylalanine. Therefore, some changes in the
DNA sequence, for example a point mutation such
as that shown below, do not result in changes in
the amino acid sequence, and therefore the
protein produced.
Changes such as this are called neutral, or
synonymous mutations they do not result in a
change in the final gene product. For example,
in this example, a change in the third nucleotide
from a U to a C would not result in an amino acid
change as both GCU and GCC code for alanine.
Image from Wikipedia genetic code,
http//en.wikipedia.org/wiki/Genetic_code
10Key background conceptGenome structure
Much of the genome, the entire genetic material
of an organism, is actually not genes per se.
Interrupting the protein coding sequences of a
gene (called exons) are introns, intervening DNA
segments. Both are transcribed into mRNA (see
next few slides) but only exons remain in the
final transcript.
So when were talking about the DNA sequence, it
is important to keep in mind that not the entire
sequence is actually genes. This will be
important to remember later when we learn about
comparing sequences, finding genes, and other
tools and methods of genomics.
From Wikipedia gene http//en.wikipedia.org/wik
i/Gene
11Key background concepthigher order structure
In addition, each of the 20 amino acids coded for
by the DNA sequence have different chemical
properties that cause the protein chains to form
specific three-dimensional shapes that help
differentiate their particular functions in the
cell. For example, certain folding patterns
(called tertiary structures) make it possible for
specific enzymes to bind in a particular place.
One change in the DNA sequence could change the
amino acid, which could change the protein
structure.you can see why the DNA sequence is so
important!
Figure sketch of a tertiary protein folding
structure
12Levels and types of genome variation
- Plant genomes may differ from one another in
different ways - Amount of DNA in the nucleus. This is generally
quantified in picograms (pg), and sometimes
called the C-value. It has not been measured for
all plants yet, but in those for which it has
this value can vary over 1000-fold. - Number and size of chromosomes.
- Differences at the sequence level, both in the
absolute order of the bases, and in the type and
number of different classes of sequences.
Typically in genomics research we are talking
about sequence level variation
Cullis 2004
13Mechanisms of genetic diversity
Remember that all of the organisms that exist
today originated, millions of years ago, from the
same DNA sequence. The huge amount of variation
among organisms that exists now is due to changes
in the DNA sequence over time (evolution).
There are many mechanisms for how genetic
variation occurs. Some of these are
- Point mutations
- Insertions and deletions
- Translocations
- Transposons
- Splicing, transcription and translation errors
We will not go into these in detail in this
module, but it is important to keep these in mind
for understanding the differences we see among
organisms.
14Finding genes cDNA
In many cases, especially in the early stages of
exploring or comparing genomes, researchers often
want to focus mainly on the genes of an organism
rather than the whole genome sequence. You might
think it would be better to just study mRNA to
really focus on the genes. However, mRNA and
protein are very unstable outside of a cell and
therefore much more difficult to work
with. Therefore, scientists use special enzymes
to convert RNA into complementary DNA (cDNA).
cDNA is a much more stable compound and because
it was generated from a mRNA in which the introns
have been removed, cDNA represents only
transcribed DNA sequence, the genes. One widely
used method of studying genes this way is using
ESTs, which we will learn more about in the next
section.
15Comparative mapping
The long-used technique of genetic mapping, or
linkage mapping, uses the concepts of Mendelian
inheritance and recombination to determine the
chromosomal location of either genes or other
sequences of DNA called markers (such as RFLPs,
SSRs, etc.) by analyzing their inheritance
patterns. This is most typically done by either
Southern hybridization or, more recently,
polymerase chain reaction (PCR) and primers
designed to specifically amplify certain DNA
sequences. Both of these methods require a
minimal amount of similarity between the target
and probe species and so cannot be used with more
distantly related species.
However, heterologous DNA markers can be used to
generate comparative maps and to infer linkage
conservation and the position of orthologous loci
among sexually incompatible species. For
example, most gramineae genomes (i.e. grass
species, maize, rice, wheat, barley, millet,
etc) are now connected through comparative
genetic maps (next slide).
Figure. Here, markers from the tomato map were
mapped onto the eggplant map, highlighting an
inversion between the two species.
16While genome size varies dramatically among grass
species, gene content and gene order are more
highly conserved
17Syntenic relationships
One of the key early discoveries of genomics was
that the genes in related organisms often retain
a ancestral order--that is, even after evolution
and divergence, certain sets of genes are in the
same order along the chromosome among related
organisms. This is now referred to as synteny
(although in the past this term meant only on
the same chromosome, it has come to mean
co-linear). Comparisons across entire genomes is
sometimes referred to as macrosynteny. Direct
comparisons of fully sequenced portions of
related genomes allows even finer comparisons of
genetic changes across species and such
comparisons are sometimes referred to as
microsynteny studies.
18Definition synteny
Synteny (strict sense definition) orthologous
loci in two different species located on same
chromosome
Colinear set of loci in two different species
that are on same chromosome and conserved in order
synteny is now used as a substitute for
colinear, despite the different origins of the
terms
19Macrosynteny vs. microsynteny
A
A
A
A
t
t
k
t
In the 2 organisms being compared on the left,
not only are the genes A, B, and C syntenic, but
all the genes in between are as well. In the 2
organisms being compared on the right, although
genes A, B, and C appear syntenic, other genes in
between are not.
u
u
n
u
B
B
v
v
B
B
g
v
w
w
o
w
x
q
x
x
C
C
C
C
Must always consider the possibility that at
the micro level, gene content/gene order is not
so well conserved, despite overall macro synteny
20Comparing gene function and relationships
As well as comparing the order of genes among
different organisms, we also want to know whether
these genes have a ancestral relationship. The
term ortholog is used to denote homologous genes
from different species that are derived from a
common ancestral gene at the time of the last
common ancestor (as opposed to paralogs, see next
slide). The term homology historically meant
having a common evolutionary origin, relatedness.
However, it is now often used simply to describe
similarity in DNA sequence (to the dismay of many
language purists).
21Definitions orthology vs. paralogy
Orthologous gene two genes found in
different species that trace back to a common
ancestral gene at the time of the last common
ancestor
Paralogous genes genes duplicated within a
species
Gene A Gene A
Species A
Ancestral Species
Paralogous genes
Gene A
Gene A
Species B
22Conserved ortholog methods
Care must be taken that one is comparing actual
orthologous genes among different genomes, and
not paralogs, in order to make any conclusions
about the evolutionary relationship and history
of the genomes (as paralogs may have diverged at
different rates). For a good discussion on the
relationships between orthologs and paralogs and
the history of the use of the terms, see Koonin
E (2005) Orthologs, paralogs, and evolutionary
genomics. Annu Rev Genet 39 309-338
23Methods/Criteria for Identifying Orthologs
1) Hybridization If a gene probe identifies
(through southern hybridization) a single copy in
two related diploid species, it is assumed that
the gene is orthologous in those two species
Southern Hybridization (gt70 sequence similarity
required to see signal)
The DNA sequence in the photo at the right
appears to be single copy per genome so may be
surmised to be an ortholog.
24Paralogous gene families
Paralogous gene family (a large portion of genes
belong to multigene families and orthologs are
not easily discriminated between plant species)
Southern Hybridization (gt70 sequence similarity)
In this example the DNA sequence hybridizes to a
large number of genes in the genome, making it
impossible to distinguish between paralogs and
orthologs.
Note this method of ortholog detection has a
limitation Cannot detect orthologs that are
less than 70 similar at nucleic acid level --
usually true for comparing species within a
family, but not always true in more distant
comparisons (e.g. between families)
25Identifying Orthologs via computation
Another method of identifying orthologs is called
in silico (computational matching). This
utilizes sequence alignments (either nucleic acid
or amino acid) to detect putative orthologs by
identifying strong sequence alignments (we will
learn more about sequence alignments in the next
section). It is important to note that,
eventually, laboratory confirmations must be done
to prove orthology however, first identifying
putative (potential) orthologs computationally
(and eliminating clear paralogs) can greatly
decrease the amount of validation work needed.
However, it does require that large amounts of
sequence data be available for all the organisms
being analyzed.
For an example of this method, see Fulton, TM,
van der Hoeven R, Eannetta NT, Tanksley SD
(2002) Identification, analysis and utilization
of Conserved Ortholog Set (COS) markers for
comparative genomics in higher plants. Plant
Cell 14 1457-1467
26SNPs
Single nucleotide polymorphism, or SNPs
(pronounced snips) are differences in a DNA
sequence of just one nucleotide. These
differences may or may not cause a visible change
in phenotype, or even in the amino acid sequence.
SNPs are very useful as markers as the are the
most basic type possible, but they require being
able to do quite a bit of DNA sequencing, which
can be expensive.
An example of a SNP between 2 small DNA
sequences. Remember DNA sequences are
abbreviated to the 4 different bases, A
(adenine), C (cytosine), T (thymine), and G
(guanine).
GACGTACGATCAG GACTTACGATCAG
27Comparing protein sequences
While proteins can be sequenced directly, it is
much more difficult and time-consuming than
sequencing DNA. Furthermore, due to the nature of
the genetic code, the protein sequence can be
deduced from the DNA sequence, but not vice
versa. Thus, protein sequences are usually
compared by deducing (translating) the sequences
from the DNA sequences first. There are software
programs that help do this. However, there are
cases when comparing protein sequences, rather
then DNA sequences, is useful. Recall that some
amino acids can be coded for by more than one set
of triplet nucleotides. For example, alanine can
be coded for by GCU, GCC, GCA or GCG. This means
that two organisms might have differences in the
DNA sequence but still have the same amino acid
sequence. Therefore, two organisms are so
distantly related that they have many differences
in their DNA sequences may have less differences
in their amino acid sequences and thus can be
compared at this level instead.
28Gene Expression
To complicate matters, just because there is gene
present in the genome that codes for a particular
protein does not mean that that gene is on all
the time. The process of when, where, and how
a gene is turned on, commonly referred to as gene
expression, or expression profiling. Once we
learn where and how a gene is expressed under
normal circumstances, we can then study what
happens in different states, such as in disease
or during various stages of development. To
accomplish the latter goal, however, researchers
must identify and study the protein, or proteins,
coded for by the gene(s) involved, and identify
how their expression is controlled. So genomics,
and comparative genomics, can involve looking at
the locations of genes, changes in the sequence
(such as in evolutionary studies) or how these
genes are expressed. The next section describes
some of the methods for doing so.
29Resources More Information
Koonin E (2005) Orthologs, paralogs, and
evolutionary genomics. Annu Rev Genet 39
309-338 The Human Genome Project
http//www.ornl.gov/sci/techresources/Human_Genome
/home.shtml National Center of Biotechnology
http//www.ncbi.nlm.nih.gov/ National Center for
Biotechnology Information (NCBI). A Science
Primer. Available at http//www.ncbi.nlm.nih.gov/
About/primer/index.html Accessible as of January
6, 2007). National Institutes of Health,
National Institute of General Medical Sciences
(2001) Genetics Basics. NIH Publication No.
01-662. Also available at http//publications.nig
ms.nih.gov/genetics/