Introduction to Genomics - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Introduction to Genomics

Description:

Title: PowerPoint Presentation Author: Pevsner Last modified by: Jack Min Created Date: 8/20/2002 7:07:52 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:840
Avg rating:3.0/5.0
Slides: 78
Provided by: Pevs3
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Genomics


1
Introduction to Genomics and the Tree of
Life Chapter 13
2
Extra-Reading
  • Next generation sequencer
  • What next generation sequencer can do for
    genetics/genomics research?
  • Compar_genomics
  • What can we learn from comparative genomics?

3
Outline of todays lecture
Introduction 5 perspectives, history of
life Genome-sequencing projects
chronology Genome analysis criteria,
resequencing, metagenomics DNA sequencing
technologies Sanger, 454, Solexa Process of
genome sequencing centers, repositories Genome
annotation features, prokaryotes, eukaryotes
4
Five approaches to genomics
As we survey the tree of life, consider these
perspectives
Approach I cataloguing genomic
information Genome size number of chromosomes
GC content isochores number of genes
repetitive DNA unique features of each genome
Approach II cataloguing comparative genomic
information Orthologs and paralogs COGs
lateral gene transfer
Approach III function biological principles
evolution How genome size is regulated
polyploidization birth and death of genes
neutral theory of evolution positive and
negative selection speciation
Approach IV Human disease relevance
Approach V Bioinformatics aspects Algorithms,
databases, websites
Page 519
5
Introduction Lessons learned form comparative
genomics What have we learned about genes by
comparing genomic sequences? What have we
learned about regulation? About 5 of the
human genome is under purifying selection
Positively regulated regions Mechanisms and
history of mammalian evolution Nonuniformity
of neutral evolutionary rates within species
Nonuniformity of evolution along the branches of
phylogeny Learning more form existing data
Choice of species Choice of tools Future of
comparative genomics
6
Levels of analysis in genomics
level topics databases DNA genes,
chromosomes GenBank RNA ESTs, ncRNA UniGene,
GEO protein ORFs, composition UniProt complexes b
inary, multimeric BIND pathways COGs,
KEGG organelles organs individuals variation and
disease HapMap species speciation TaxBrowser
SGD genus JAX mouse phylum FishBase king
dom TOL
7
Definitions of terms
Genomics is the study of genomes (the DNA
comprising an organism) using the tools of
bioinformatics. Bioinformatics is the study
protein, genes, and genomes using computer
algorithms and databases. Systematics is the
scientific study of the kinds and diversity of
organisms and of any and all relationships among
them. Classification is the ordering of
organisms into groups on the basis of their
relationships. The relationships may be
evolutionary (phylogenetic) or may refer to
similarities of phenotype (phenetic). Taxonomy
is the theory and practice of classifying
organisms.
8
Pace (2001) described a tree of life based on
small subunit rRNA sequences. This tree shows
the main three branches described by Woese and
colleagues.
Fig. 13.1 Page 521
9
Molecular sequences as basis of trees
Historically, trees were generated primarily
using characters provided by morphological data.
Molecular sequence data are now commonly used,
including sequences (such as small-subunit RNAs)
that are highly conserved. Visit the European
Small Subunit Ribosomal RNA database for 20,000
SSU rRNA sequences.
Page 523
10
Tree of life from David Hillis lab (based on
3000 rRNAs)
animals
plants
you are here
protists
bacteria
fungi
archaea
http//www.zo.utexas.edu/faculty/antisense/Downloa
d.html
11
Tree of life from David Hillis lab (based on
3000 rRNAs)
you are here
http//www.zo.utexas.edu/faculty/antisense/Downloa
d.html
12
Ribosomal RNA Database
Ribosomal Database Project http//rdp.cme.msu.edu/
index.jsp Santos, S. R. and Ochman H.
Identification and phylogenetic sorting of
bacterial lineages with universally conserved
genes and proteins. Environmental Microbiology.
2004. Jul(6)7754-9. ?Download fusA (translation
elongation factor 2 EF-2) ?Obtain DNA in the
fasta format ?Align by ClustalW in MEGA ?Create a
neighbor-joining tree
Page 524
13
(No Transcript)
14
European Small Subunit Ribosomal RNA
database (http//www.psb.ugent.be/rRNA/ssu/)
15
Neighbor-joining tree of 150 fusA (GTPase) DNA
sequences
Yersinia pestis
Clostridium
Aquifex aeolicus
Mycoplasma
Bac. antracis
Mycobacterium
Rickettsia
Treponema
16
History of life on earth
4.55 BYA formation of earth (violent 100 MY
period) 4.4-3.8 BYA last ocean-evaporating
impacts 3.9 BYA oldest dated rocks 3.8 BYA sun
brightened to 70 of todays luminosity Ammonia,
methane, or carbon dioxide atmosphere. Earl
iest life RNA, protein Source Schopf J.W.
(ed.), Lifes Origins (U. Calif. Press, 2002)
Page 521
17
Millions of years ago (MYA)
Cambrian explosion
Age of Reptiles ends
deuterostome/ protostome
echinoderm/ chordate
Land plants
Insects
Proterozoic eon
Phanerozoic eon
1000
100
0
500
Page 522
18
Millions of years ago (MYA)
Human/chimp divergence
Dinosaurs extinct Mammalian radiation
Mass extinction
100
10
0
50
Page 522
19
Millions of years ago (MYA)
Homo sapiens/ Chimp divergence
Emergence of Homo erectus
Earliest stone tools
Australepithecus Lucy
10
0
5
1
Page 522
20
Years ago
Homo erectus emerges in Africa
Mitochondrial Eve
1,000,000
100,000
0
500,000
Page 523
21
Years ago
Emergence of anatomically modern H. sapiens
Neanderthal and Homo erectus disappear
10,000
0
100,000
50,000
Page 523
22
Years ago
Ice Man from Alps
Earliest pyramids
Aristotle
1,000
0
10,000
5,000
Page 523
23
Years ago
Darwin, Mendel
algebra
calculus
Gutenberg
100
0
1,000
500
Page 523
24
Chronology of genome sequencing projects
We will next summarize the major achievements
in genome sequencing projects from a
chronological perspective.
Page 525
25
Chronology of genome sequencing projects
1976 first viral genome Fiers et al. sequence
bacteriophage MS2 (3,569 base pairs, Accession
NC_001417). 1977Sanger et al. sequence
bacteriophage fX174. This virus is 5,386 base
pairs (encoding 11 genes). See accession J02482
NC_001422.
Page 527
26
Chronology of genome sequencing projects
1981 Human mitochondrial genome 16,500 base pairs
(encodes 13 proteins, 2 rRNA, 22 tRNA) Today
(10/09), over 1800 mitochondrial genomes
sequenced 1986 Chloroplast genome 156,000 base
pairs (most are 120 kb to 200 kb)
Page 527
27
mitochondrion
chloroplast
Lack mitochondria (?)
28
Entrez Genomes organelle resource at NCBI
http//www.ncbi.nlm.nih.gov/genomes/ORGANELLES/org
anelles.html
29
There are gt2100 eukaryotic organelles (10/09)
30
GOBASE resource for organelle genomes
http//megasun.bch.umontreal.ca/gobase/
31
MitoDat resource for organelle genomes
This database is dedicated to the nuclear genes
specifying the enzymes, structural proteins, and
other proteins, many still not identified,
involved in mitochondrial biogenesis and
function. MitoDat highlights predominantly human
nuclear-encoded mitochondrial proteins. Not
updated recently.
http//www-lecb.ncifcrf.gov/mitoDat/
32
MitoMap resource for organelle genomes
http//www.mitomap.org/
33
It is possible to map mutations in human
mitochondrial DNA that are responsible for
disease
34
Chronology of genome sequencing projects
1995 first genome of a free-living organism,
the bacterium Haemophilus influenzae
Page 530
35
Chronology of genome sequencing projects
1996 first eukaryotic genome The complete
genome sequence of the budding yeast Saccharomyces
cerevisiae was reported. We will describe this
genome soon. Also in 1996, TIGR reported the
sequence of the first archaeal genome,
Methanococcus jannaschii.
Page 532
36
Chronology of genome sequencing projects
1997 More bacteria and archaea Escherichia
coli 4.6 megabases, 4200 proteins (38 of unknown
function) 1998 first multicellular
organism Nematode Caenorhabditis elegans 97 Mb
19,000 genes. 1999 first human
chromosome Chromosome 22 (49 Mb, 673 genes)
Page 532
37
1999 Human chromosome 22 sequenced
38
Chronology of genome sequencing projects
2000 Fruitfly Drosophila melanogaster (13,000
genes) Plant Arabidopsis thaliana Human
chromosome 21 2001 draft sequence of the human
genome (public consortium and Celera Genomics)
Page 534
39
(No Transcript)
40
2000
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
Overview of genome analysis
  • Selection of genomes for sequencing
  • Sequence one individual genome, or several?
  • How big are genomes?
  • Genome sequencing centers
  • Sequencing genomes strategies
  • When has a genome been fully sequenced?
  • Repository for genome sequence data
  • Genome annotation

Page 537
45
Applications of Genome Sequencing Applications of Genome Sequencing Applications of Genome Sequencing
Purpose Template Example
De novo sequencing Genome sequencing Sequencing gt1000 influenza genomes
De novo sequencing Ancient DNA Extinct Neanderthal genome
De novo sequencing Metagenomics Human gut
Resequencing Whole genomes Individual humans
Resequencing Genomic regions Assessment of genomic rearrangements or disease-associated regions
Resequencing Somatic mutations Sequencing mutations in cancer
Transcriptome Full-length transcripts Defining regulated messenger RNA transcripts
Transcriptome Serial Analysis of Gene Expression (SAGE) Defining regulated messenger RNA transcripts
Transcriptome Noncoding RNAs Identifying and quantifying microRNAs in samples
Epigenetics Methylation changes Measuring methylation changes in cancer
Table 13.15 p.538
46
Overview of genome analysis
Fig. 13.8 p.539
47
Criteria for selecting genomes for sequencing
  • Criteria include
  • genome size (some plants are gtgtgthuman genome)
  • cost
  • relevance to human disease (or other disease)
  • relevance to basic biological questions
  • relevance to agriculture

Page 538
48
Criteria for selecting genomes for sequencing
  • Criteria include
  • genome size (some plants are gtgtgthuman genome)
  • cost
  • relevance to human disease (or other disease)
  • relevance to basic biological questions
  • relevance to agriculture
  • Recent projects
  • Chicken Fungi (many)
  • Chimpanzee Honey bee
  • Cow Sea urchin
  • Dog Rhesus macaque

Page 540
49
Selection criteria
Selection of genomes for sequencing is based on
specific criteria. For an overview, see a series
of white papers posted on the National Human
Genome Research Institute (NHGRI) website
http//www.genome.gov/10002154 For a description
of NHGRI selection criteria, visit http//www.gen
ome.gov/10001495
Page 540
50
Criteria for selecting genomes for sequencing
Sequence one individual genome, or several? Try
one --Each genome center may study one
chromosome from an organism --It is necessary to
measure polymorphisms (e.g. SNPs) in large
populations For viruses, thousands of isolates
may be sequenced. For the human genome, cost is
the impediment.
Page 540
51
Diversity of genome sizes
How big are genomes? Viral genomes 1 kb to 350
kb (Mimivirus 1181 kb) Bacterial genomes 0.5
Mb to 13 Mb Eukaryotic genomes 8 Mb to 686 Gb
(human 3 Gb)
Page 540
52
Genome sizes in nucleotide base pairs
plasmids
viruses
bacteria
fungi
plants
algae
insects
mollusks
bony fish
The size of the human genome is 3 X 109
bp almost all of its complexity is in
single-copy DNA. The human genome is thought to
contain 30,000-40,000 genes.
amphibians
reptiles
birds
mammals
104
108
105
106
107
1011
1010
109
http//www3.kumc.edu/jcalvet/PowerPoint/bioc801b.p
pt
53
16 eukaryotic genome projects gt 1000 megabases
Genus, species Subgroup Size (Mb) chr common name
Macropus eugenii Mammals 3800 8 tammar wallaby
Oryctolagus cuniculus Mammals 3500 22 rabbit
Cavia porcellus Mammals 3400 31 guinea pig
Pan troglodytes Mammals 3100 24 chimpanzee
Homo sapiens Mammals 3038 23 human
Bos taurus Mammals 3000 30 cow
Dasypus novemcinctus Mammals 3000 32 nine-banded armadillo
Loxodonta africana Mammals 3000 28 African savanna elephant
Sorex araneus Mammals 3000 European shrew
Rattus norvegicus Mammals 2750 21 rat
Canis familiaris Mammals 2400 39 dog
Zea mays Land Plants 2365 10 corn
Aplysia californica Other Animals 1800 17 California sea hare
Danio rerio Fishes 1700 25 zebrafish
Gallus gallus Birds 1200 40 chicken
Triphysaria versicolor Land Plants 1200 plant parasite
54
Ancient DNA projects
  • Special challenges
  • Ancient DNA is degraded by nucleases
  • The majority of DNA in samples derives from
    unrelated organisms such as bacteria that invaded
    after death
  • The majority of DNA in samples is contaminated
    by human DNA
  • Determination of authenticity requires special
    controls, and analysis of multiple independent
    extracts

Page 542
55
Metagenomics projects
  • Two broad areas
  • Environmental (ecological)
  • e.g. hot spring, ocean, sludge, soil
  • Organismal
  • e.g. human gut, feces, lung

Page 543
56
Outline of todays lecture
Introduction 5 perspectives, history of life
time lines Genome-sequencing projects
chronology Genome analysis criteria,
resequencing, metagenomics DNA sequencing
technologies Sanger, 454, Solexa Process of
genome sequencing centers, repositories Genome
annotation features, prokaryotes, eukaryotes
57
Outline of todays lecture
Introduction 5 perspectives, history of life
time lines Genome-sequencing projects
chronology Genome analysis criteria,
resequencing, metagenomics DNA sequencing
technologies Sanger, 454, Solexa Process of
genome sequencing centers, repositories Genome
annotation features, prokaryotes, eukaryotes
58
Overview of genome analysis
20 Genome sequencing centers contributed to the
public sequencing of the human genome. Many of
these are listed at the Entrez genomes site. (Or
see Table 19.3, page 803.)
Page 548
59
Two approaches to genome sequencing
Whole genome shotgun sequencing
(Celera) Hierarchical shotgun sequencing (public
consortium)
60
Two approaches to genome sequencing
Whole Genome Shotgun (from the NCBI website) An
approach used to decode an organism's genome by
shredding it into smaller fragments of DNA which
can be sequenced individually. The sequences of
these fragments are then ordered, based on
overlaps in the genetic code, and finally
reassembled into the complete sequence. The
'whole genome shotgun' (WGS) method is applied to
the entire genome all at once, while the
'hierarchical shotgun' method is applied to
large, overlapping DNA fragments of known
location in the genome.
Page 548
61
Human genome project strategies
Whole genome shotgun sequencing (Celera) --
given the computational capacity, this approach
is far faster than hierarchical shotgun
sequencing -- the approach was validated using
Drosophila
62
Two approaches to genome sequencing
Hierarchical shotgun method Assemble contigs from
various chromosomes, then sequence and assemble
them. A contig is a set of overlapping clones or
sequences from which a sequence can be obtained.
The sequence may be draft or finished. A contig
is thus a chromosome map showing the locations
of those regions of a chromosome where
contiguous DNA segments overlap. Contig maps are
important because they provide the ability to
study a complete, and often large segment of the
genome by examining a series of overlapping
clones which then provide an unbroken succession
of information about that region.
Page 548
63
Two approaches to genome sequencing
Hierarchical shotgun sequencing (public
consortium) -- 29,000 BAC clones -- 4.3
billion base pairs -- it is helpful to assign
chromosomal loci to sequenced fragments,
especially in light of the large amount of
repetitive DNA in the genome -- individual
chromosomes assigned to centers
64
Source IHGSC (2001)
65
Sequenced-clone contigs are merged to form
scaffolds of known order and orientation
Fig. 19.8 Page 804
Source IHGSC (2001)
66
When has a genome been fully sequenced?
A typical goal is to obtain five to ten-fold
coverage. Finished sequence a clone insert is
contiguously sequenced with high quality standard
of error rate 0.01. There are usually no gaps in
the sequence. Draft sequence clone sequences
may contain several regions separated by gaps.
The true order and orientation of the pieces may
not be known.
Page 549
67
When has a genome been fully sequenced?
When has a genome been fully sequenced? Fold
coverage sequenced 0.25 22 0.5 39 0.75 53
1 63 2 87.5 3 95 4 98.2 5 99.4 6 99.75
7 99.91 8 99.97 9 99.99 10 99.995
Page 551
68
Trace repository for genome sequence data
Raw data from many genome sequencing
projects are stored at the trace archive at NCBI
or EBI (main NCBI page, bottom right). Also
visit http//trace.ensembl.org/ As of October
2008, the Trace Archive had 2b traces. As of
October 2009 it has 2,108,000,000 traces.
Page 552
69
Fig. 13.12 Page 553
70
  • http//www.jgi.doe.gov/education/
  • http//www.youtube.com/watch?vRLsb0pMx_oUfeature
    channel_page
  • A Howard Hughes Medical Institute (HHMI) video
    production describing the Whole Genome Shotgun
    Sequencing process at the JGI. This video is
    viewable on YouTube in three parts
    Part1(chapters 1-5), Part 2 (chapters 6-8), Part
    3 (chapters 9-14).

71
Role of comparative genomics
Phylogenetic footprinting Phylogenetic
shadowing Population shadowing
Page 552
72
Fig. 13.13 Page 554
73
Outline of todays lecture
Introduction 5 perspectives, history of life
time lines Genome-sequencing projects
chronology Genome analysis criteria,
resequencing, metagenomics DNA sequencing
technologies Sanger, 454, Solexa Process of
genome sequencing centers, repositories Genome
annotation features, prokaryotes, eukaryotes
74
Fig. 13.14 Page 555
75
Genome annotation
Information content in genomic DNA includes --
nucleotide composition (GC content) -- repetitive
DNA elements -- protein-coding genes, other
genes
Page 555
76
GC content varies across genomes
Bacteria
10
5
Plants
5
Number of species in each GC class
Invertebrates
3
Vertebrates
10
5
20
30
40
50
60
70
80
GC content ()
Fig. 13.15 Page 556
77
Gene prediction tools
  • http//bioinformatics.ca/links_directory/?subcateg
    ory_id39
  • http//www.geneprediction.org/
  • Common tools
  • GenScan http//genes.mit.edu/GENSCAN.html
  • HMMgene http//www.cbs.dtu.dk/services/HMMgene/
  • Microbial http//www.ncbi.nlm.nih.gov/genomes/MI
    CROBES/glimmer_3.cgi
  • Fungal
  • http//www.cbcb.umd.edu/software/GlimmerHMM/
Write a Comment
User Comments (0)
About PowerShow.com