Title: Genome Organization overview
1- Genome Organization overview
- Eukaryotic genomes are complex and DNA amounts
and organization vary widely between species - C value paradox the amount of DNA in the haploid
cell of an organism is not related to its
evolutionary complexity or number of genes
2C-value Paradox
Drosophila has 20X smaller genome than human and
2X fewer genes Newt and lungfish genomes 5
and 50 x larger than human
3Number of genes does increase in higher organisms
4Re-association Kinetics
Cot (initial DNA concentration x time)
Complexity sum length of all single copy
(unique) sequence in a genome
5- There are different classes of eukaryotic DNA
- based on sequence complexity
- revealed by re-association kinetics
6A time line of genomics research
7Status of plant genome sequences as of January
2006
8The human genome
- Two versions of human genome sequences were
published in February 2001. DNA sequences that
encode proteins make up only 5 of the genome -
- 50 sequences are transposable elements
clusters of gene-rich regions are separated by
gene deserts -
- CH 19 has the highest gene density, CH 13 Y
show the lowest gene density
9The human genome
- Gene total estimated 30,000-40,000 (now maybe
25,000), w/ an average gene size of 27 Kb - Hundreds of genes share homology w/ those of
bacteria -
- The number of introns vary greatly (from 0 for
histone to 234 for titin)
10The human genome
- Genes larger contain more and larger introns
compared to these in invertebrates (dystrophin
gene is 2.5 Mb) - Genes are not evenly spaced on CHs
- The most common genes include those involved in
nucleic acid metabolism-7.5 receptors-5
protein kinases-2.8 cytoskeletal structural
proteins-2.8
11The human genome predicted gene function
12Any 2 human genomes are roughly 99.9 identical
On average 0.1
Chr - chromosome n - Number of samples
examined bp - Number of basepairs sequences S -
Number of polymorphic sites p - Nucleotide
divergence
Przeworski, M., et al. (2000) Trends Genet 16,
296-302.
13Yet phenotypic differences abound!
14Genome organization in plants
- Size of genome varies widely (100 Mb-5,500 Mb)
- Many tandem gene duplications larger
duplications some interchromosomal duplications
also observed - Large-genome plants also have genes clustered
with long stretches of intergenic DNA - In maize, the intergenic sequences are composed
mainly of transposons
15(No Transcript)
16(No Transcript)
17- Genome Organization
- gene identification
- Genes can be difficult to identify/predict based
on genome sequence - The human genome appears to contain fewer genes
than originally predicted but an estimated
35,000 genes produce an estimated 150,000 proteins
18- Genome Organization
- gene identification
- No one to one correspondence between
- Genome (all genes of an organism)
- Transcriptome (all transcripts of an organism)
- Proteome (all proteins of an organism
19Variable estimates of human gene content
20Gene identification the simple view
21Gene identification the challenges
from Klug Cummings 1997
22Gene identification the challenges
- Non coding sequences
- Promoters and enhancers of gene expression can be
distant from the coding region itself - Genes can have alternative promoters
- Genes can have alternative terminators
23(No Transcript)
24Gene identification the challenges
- Introns and exons
- Most eukaryotic genes have introns
- Introns are often much longer than exons
- Often many introns so mRNA much shorter than
genomic DNA - Intron size can vary between the same gene of
different species - Splice junctions are difficult to predict
- Alternative splicing
25Gene identification the challenges
Introns and exons
- Eukaryotes only
- Removal of internal parts of the newly
transcribed RNA. - Takes place in the cell nucleus
26Introns and exons
27Introns numerous longer than exons
28- Variable intron size
- same gene, different organism
29Introns alternative splicing
- Different splice patterns from the same sequence,
therefore different products from the same gene.
30One gene many proteins alternative splicing
3 cleavage
31- Exon shuffling
- Different genes having similar exons
32Why genome, transcriptome and proteome dont
correlate in size
- More sophisticated regulation of expression
- Proteome vastly larger than genome
- Alternate splicing, promoters, terminators (59
of genes with an average of 3 different products)
- RNA editing
- Post-translational modifications
- Moonlighting
- Same protein different function depending on
cellular location
33Gene Identification
- Open reading frames
- Sequence conservation
- Database searches
- Synteny
- Sequence features
- CpG islands
- Evidence for transcription
- ESTs, microarrays
- Gene inactivation
- Transformation, TEs, RNAi
34Gene identification - Open reading frames
- 5'atgcccaagctgaatagcgtagaggggttttcatcatttgaggacgat
gtataa - frame 1
- atg ccc aag ctg aat agc gta gag ggg ttt
- M P K L N S V E G F
- tca tca ttt gag gac gat gta taa
- S S F E D D V
-
- frame 2
- tgc cca agc tga ata gcg tag agg ggt ttt cat
- C P S I A R G F H
- cat ttg agg acg atg tat
- H L R T M Y
35Gene identification - Database searches
36Gene identification - Synteny
Mouse-human synteny and sequence conservation A)
Blocks of synteny between mouse chromosome 11 and
parts of 5 different human chromosomes B)
Enlarged block with perfect correspondence in
order, orientation and spacing of 23 putative
genes, and 245 conserved squence blocks of gt 100
bp with gt70 identity, many in noncoding
regions Caution! Even regions of high synteny
may not show perfect gene-for-gene
correspondence from Gibson Muse (2002) A
Primer of Genome Science, Sinauer Inc.
37Gene identification CpG islands
- Defined as regions of DNA of at least 200 bp in
length that have a GC content above 50 and a
ratio of observed vs. expected CpGs close to or
above 0.6 - Used to help predict gene sequences, especially
promoter regions.
38Gene identification evidence of transcription
- Sequencing libraries of cDNA clones yields
expressed sequence tags ESTs (not necessarily
full-length)
39- Genome Organization
- duplicated genes
- Gene families
- paralogs
- orthologs (homologs)
- Pseudogenes
40Duplicated genes
- Paralogs evolved one from another through gene
duplication - Encode closely related proteins
- Formed by duplication of an ancestral gene
followed by mutation
- Five functional genes and two pseudogenes ?
41Pseudogenes
- Nonfunctional copies of genes
- Formed by duplication of ancestral gene, or by
reverse transcription and integration of the cDNA - Not expressed due to mutations that produce a
stop codon, nonsense or frameshift, or mutations
that prevent mRNA transcription or processing
42Duplicated genes
- Can be clustered as in ?-globin cluster, or
dispersed in genome as seen for entire globin
family in humans
43Duplicated genes
- Paralogs vs orthologs (or homologs)
- Different members of the globin gene family are
paralogs, having evolved one from another through
gene duplication. Paralogs are separated by a
gene duplication event. - Each specific family member (e.g. ? globin ?
human) is an ortholog (homolog) of the same
family member in another species. Both evolved
from an ancestral ? globin ? gene. Orthologs
(homologs) are separated by a speciation event. - It is not always easy to distinguish true
orthologs from paralogs when comparing large
multigene families between species. Especially in
polyploid organisms!
44- Genome Organization
- transcripts that do not encode proteins (ncRNA)
- lt 5 of higher eukaryotic genome is protein
coding - 97-98 of the transcriptional output of the
human genome is ncRNA - Introns)
- Transfer RNAs (tRNA)
- 500 tRNA genes in human genome
- Ribosomal RNAs
- Tandem arrays on several chromosomes
- 150-200 copies of 28S 5.8S 18S cluster
- 200-300 copies of 5S cluster
45- Genome Organization- ncRNA
- 97-98 of the transcriptional output of the
human genome is ncRNA - Small nucleolar RNAs (snoRNAs)
- Single genes
- Modify rRNAs
- Small nuclear RNAs (snRNAs)
- Spliceosomes
- Small regulatory RNAs
- Micro RNAs (miRNA)
- Short interfering RNAs (siRNA)
- Participate in transcriptional and
non-transcriptional gene silencing, regulation
of translation - Many come from intergenic regions recently
recognized as transcribed
46- Genome Organization - ncRNA
- 97-98 of the transcriptional output of the human
genome is ncRNA - Longer regulatory RNAs
- ncRNAs derived from introns of protein-coding
genes and introns and exons of non-protein-coding
genes constitute the majority of the genomic
programming in higher organisms - Explains why very different organisms show little
difference in protein coding sequence
47(No Transcript)
48- Genome Organization
- repetitive DNA
- 50 of human genome
- Moderately repeated DNA
- Tandemly repeated rRNA, tRNA and histone genes
(gene products needed in high amounts) - Large duplicated gene families
- Mobile DNA
- Segmental duplications
49Repetitive DNA - Segmental duplications
- Found especially around centromeres and telomeres
- Often come from nonhomologous chromosomes
- Many can come from the same source
- Tend to be large (10 to 50 kb)
- Unique to humans?
50Repetitive DNA - Segmental duplications
51Repetitive DNA Transposon derived repeats
- Most of the moderately repeated DNA sequences
found throughout higher eukaryotic genomes (45
of human genome) - Some encode enzymes that catalyze movement
- Long interspersed elements (LINE)
retrotransposons - Short interspersed elements (SINE)
retrotransposons - LTR (long terminal repeat) retrotransposons
- DNA transposons
52Repetitive DNA Transposon derived repeats
53Repetitive DNA Transposon derived repeats
- Different regions of the genome differ in density
of repeats - Most LINEs accumulate in AT rich regions
- Alu elements accumulate in GC rich regions
54- Genome Organization
- repetitive DNA
- Simple-sequence Repeats
- 3 of genome
- Highly repeated short sequences found in
centromeres and telomeres - Variable numbers of tandem repeats (VNTR)
dispersed throughout the genome
55Repetitive DNA Highly repetitive satellite DNA
56Repetitive DNA VNTRs
- dispersed throughout the genome
- 1 13 base repeat unit
- microsatellite, SSR
- includes trinucleotide repeats in protein coding
genes - 14 500 repeats
- minisatellites
- Used as mapping and fingerprinting markers
57Over view of human genome composition
58(No Transcript)