Genome Organization overview - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

Genome Organization overview

Description:

C value paradox: the amount of DNA in the haploid cell of an organism is not ... Newt and lungfish genomes ~ 5 and 50 x larger than human ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 59

Provided by: Gloria75

Category:

more less

Transcript and Presenter's Notes

Title: Genome Organization overview

1

Genome Organization overview
Eukaryotic genomes are complex and DNA amounts
and organization vary widely between species
C value paradox the amount of DNA in the haploid
cell of an organism is not related to its
evolutionary complexity or number of genes

2
C-value Paradox
Drosophila has 20X smaller genome than human and
2X fewer genes Newt and lungfish genomes 5
and 50 x larger than human
3
Number of genes does increase in higher organisms
4
Re-association Kinetics
Cot (initial DNA concentration x time)
Complexity sum length of all single copy
(unique) sequence in a genome
5

There are different classes of eukaryotic DNA
based on sequence complexity
revealed by re-association kinetics

6
A time line of genomics research
7
Status of plant genome sequences as of January
2006
8
The human genome

Two versions of human genome sequences were
published in February 2001. DNA sequences that
encode proteins make up only 5 of the genome
50 sequences are transposable elements
clusters of gene-rich regions are separated by
gene deserts
CH 19 has the highest gene density, CH 13 Y
show the lowest gene density

9
The human genome

Gene total estimated 30,000-40,000 (now maybe
25,000), w/ an average gene size of 27 Kb
Hundreds of genes share homology w/ those of
bacteria
The number of introns vary greatly (from 0 for
histone to 234 for titin)

10
The human genome

Genes larger contain more and larger introns
compared to these in invertebrates (dystrophin
gene is 2.5 Mb)
Genes are not evenly spaced on CHs
The most common genes include those involved in
nucleic acid metabolism-7.5 receptors-5
protein kinases-2.8 cytoskeletal structural
proteins-2.8

11
The human genome predicted gene function
12
Any 2 human genomes are roughly 99.9 identical
On average 0.1
Chr - chromosome n - Number of samples
examined bp - Number of basepairs sequences S -
Number of polymorphic sites p - Nucleotide
divergence
Przeworski, M., et al. (2000) Trends Genet 16,
296-302.
13
Yet phenotypic differences abound!
14
Genome organization in plants

Size of genome varies widely (100 Mb-5,500 Mb)
Many tandem gene duplications larger
duplications some interchromosomal duplications
also observed
Large-genome plants also have genes clustered
with long stretches of intergenic DNA
In maize, the intergenic sequences are composed
mainly of transposons

15
(No Transcript)
16
(No Transcript)
17

Genome Organization
gene identification
Genes can be difficult to identify/predict based
on genome sequence
The human genome appears to contain fewer genes
than originally predicted but an estimated
35,000 genes produce an estimated 150,000 proteins

Genome Organization
gene identification
No one to one correspondence between
Genome (all genes of an organism)
Transcriptome (all transcripts of an organism)
Proteome (all proteins of an organism

19
Variable estimates of human gene content
20
Gene identification the simple view
21
Gene identification the challenges
from Klug Cummings 1997
22
Gene identification the challenges

Non coding sequences
Promoters and enhancers of gene expression can be
distant from the coding region itself
Genes can have alternative promoters
Genes can have alternative terminators

23
(No Transcript)
24
Gene identification the challenges

Introns and exons
Most eukaryotic genes have introns
Introns are often much longer than exons
Often many introns so mRNA much shorter than
genomic DNA
Intron size can vary between the same gene of
different species
Splice junctions are difficult to predict
Alternative splicing

25
Gene identification the challenges
Introns and exons

Eukaryotes only
Removal of internal parts of the newly
transcribed RNA.
Takes place in the cell nucleus

26
Introns and exons
27
Introns numerous longer than exons
28

Variable intron size
same gene, different organism

29
Introns alternative splicing

Different splice patterns from the same sequence,
therefore different products from the same gene.

30
One gene many proteins alternative splicing
3 cleavage
31

Exon shuffling
Different genes having similar exons

32
Why genome, transcriptome and proteome dont
correlate in size

More sophisticated regulation of expression
Proteome vastly larger than genome
Alternate splicing, promoters, terminators (59
of genes with an average of 3 different products)
RNA editing
Post-translational modifications
Moonlighting
Same protein different function depending on
cellular location

33
Gene Identification

Open reading frames
Sequence conservation
Database searches
Synteny
Sequence features
CpG islands
Evidence for transcription
ESTs, microarrays
Gene inactivation
Transformation, TEs, RNAi

34
Gene identification - Open reading frames

5'atgcccaagctgaatagcgtagaggggttttcatcatttgaggacgat
gtataa
frame 1
atg ccc aag ctg aat agc gta gag ggg ttt
M P K L N S V E G F
tca tca ttt gag gac gat gta taa
S S F E D D V
frame 2
tgc cca agc tga ata gcg tag agg ggt ttt cat
C P S I A R G F H
cat ttg agg acg atg tat
H L R T M Y

35
Gene identification - Database searches
36
Gene identification - Synteny
Mouse-human synteny and sequence conservation A)
Blocks of synteny between mouse chromosome 11 and
parts of 5 different human chromosomes B)
Enlarged block with perfect correspondence in
order, orientation and spacing of 23 putative
genes, and 245 conserved squence blocks of gt 100
bp with gt70 identity, many in noncoding
regions Caution! Even regions of high synteny
may not show perfect gene-for-gene
correspondence from Gibson Muse (2002) A
Primer of Genome Science, Sinauer Inc.
37
Gene identification CpG islands

Defined as regions of DNA of at least 200 bp in
length that have a GC content above 50 and a
ratio of observed vs. expected CpGs close to or
above 0.6
Used to help predict gene sequences, especially
promoter regions.

38
Gene identification evidence of transcription

Sequencing libraries of cDNA clones yields
expressed sequence tags ESTs (not necessarily
full-length)

Genome Organization
duplicated genes
Gene families
paralogs
orthologs (homologs)
Pseudogenes

40
Duplicated genes

Paralogs evolved one from another through gene
duplication
Encode closely related proteins
Formed by duplication of an ancestral gene
followed by mutation

Five functional genes and two pseudogenes ?

41
Pseudogenes

Nonfunctional copies of genes
Formed by duplication of ancestral gene, or by
reverse transcription and integration of the cDNA
Not expressed due to mutations that produce a
stop codon, nonsense or frameshift, or mutations
that prevent mRNA transcription or processing

42
Duplicated genes

Can be clustered as in ?-globin cluster, or
dispersed in genome as seen for entire globin
family in humans

43
Duplicated genes

Paralogs vs orthologs (or homologs)
Different members of the globin gene family are
paralogs, having evolved one from another through
gene duplication. Paralogs are separated by a
gene duplication event.
Each specific family member (e.g. ? globin ?
human) is an ortholog (homolog) of the same
family member in another species. Both evolved
from an ancestral ? globin ? gene. Orthologs
(homologs) are separated by a speciation event.
It is not always easy to distinguish true
orthologs from paralogs when comparing large
multigene families between species. Especially in
polyploid organisms!

Genome Organization
transcripts that do not encode proteins (ncRNA)

lt 5 of higher eukaryotic genome is protein
coding
97-98 of the transcriptional output of the
human genome is ncRNA
Introns)
Transfer RNAs (tRNA)
500 tRNA genes in human genome
Ribosomal RNAs
Tandem arrays on several chromosomes
150-200 copies of 28S 5.8S 18S cluster
200-300 copies of 5S cluster

Genome Organization- ncRNA

97-98 of the transcriptional output of the
human genome is ncRNA
Small nucleolar RNAs (snoRNAs)
Single genes
Modify rRNAs
Small nuclear RNAs (snRNAs)
Spliceosomes
Small regulatory RNAs
Micro RNAs (miRNA)
Short interfering RNAs (siRNA)
Participate in transcriptional and
non-transcriptional gene silencing, regulation
of translation
Many come from intergenic regions recently
recognized as transcribed

Genome Organization - ncRNA

97-98 of the transcriptional output of the human
genome is ncRNA
Longer regulatory RNAs
ncRNAs derived from introns of protein-coding
genes and introns and exons of non-protein-coding
genes constitute the majority of the genomic
programming in higher organisms
Explains why very different organisms show little
difference in protein coding sequence

47
(No Transcript)
48

Genome Organization
repetitive DNA

50 of human genome
Moderately repeated DNA
Tandemly repeated rRNA, tRNA and histone genes
(gene products needed in high amounts)
Large duplicated gene families
Mobile DNA
Segmental duplications

49
Repetitive DNA - Segmental duplications

Found especially around centromeres and telomeres
Often come from nonhomologous chromosomes
Many can come from the same source
Tend to be large (10 to 50 kb)
Unique to humans?

50
Repetitive DNA - Segmental duplications
51
Repetitive DNA Transposon derived repeats

Most of the moderately repeated DNA sequences
found throughout higher eukaryotic genomes (45
of human genome)
Some encode enzymes that catalyze movement
Long interspersed elements (LINE)
retrotransposons
Short interspersed elements (SINE)
retrotransposons
LTR (long terminal repeat) retrotransposons
DNA transposons

52
Repetitive DNA Transposon derived repeats
53
Repetitive DNA Transposon derived repeats