What is Comparative Genomics? - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

What is Comparative Genomics?

Description:

'Informatics' techniques from applied math, computer science and ... Proteins more likely to interact if they are parts of the same cellular process ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 36
Provided by: igertmsu
Category:

less

Transcript and Presenter's Notes

Title: What is Comparative Genomics?


1
What is Comparative Genomics?
Insights gained through comparison of genomes
from different species
2
How did it all start?
We needed some genomes to start comparing
Many Bacteria sequenced first Model organisms
Yeast Worm Fruit fly
Thale cress Finally, Human
Comparative genomics did not just happen
Enough data had to be accumulated
Development of new computational methods to meet
the challenges of processing large
amounts of data Informatics techniques
from applied math, computer science and
statistics were adapted for biological sequences
3
Comparing sequenced genomes
Comparison of genomic sequences from different
species can help identify the following
Gene structure Gene function
Interaction between gene products
Non-coding RNAs Regulatory sequences
4
Evolution and sequence conservation
Genome comparisons are based on simple
premise conservation functional importance
If there are no constraints on DNA sequence,
random mutations will occur Over large
evolutionary times (millions of years), these
random mutations make two related sequences
different Sequences from different genomes will
be conserved if They code for proteins
They are important for regulation (protein
binding)
5
No-hypothesis-driven approach
Hypothesis-driven approaches Develop
goals based on available hypothesis
Design initial experiments (and backups if those
fail) When it yields results, go to NIH,
NSF, DOE, ONR for funding
No hypothesis-driven approaches Start
with a general knowledge of the biological
system Collect large amount of data
(usually high-throughput methods) and try
extracting and/or amplifying signal from noisy
data Sometimes it works for reasons that are
obvious Sometimes it works for reasons that are
NOT obvious Sometimes it doesnt work because
the data is too noisy Funding agencies are not
likely to fund this kind of research
6
Finding DNA regulatory motifs (protein binding
sites)
Experimental approaches Promoter
Trapping DNA Footprinting
In-vitro binding site selection (SELEX)
Computational approaches Searching
databases of known sites Finding
over-represented motifs in a group of sequences
(Gibbs sampling, Expectation
Maximization) In promoters of homologous
genes In promoters of functionally linked genes
In promoters of interacting proteins Ab
initio methods Positional conservation of
(pseudo)palindromic DNA motifs
7
Finding motifs in promoters of homologous genes
Perform all-versus-all proteomes BLAST search
Pool together promoters of related genes Find
conserved motifs (Gibbs sampling, Expectation
Maximization)
Only DNA motifs in related genes can be
identified
8
Finding DNA motifs by positional conservation of
palindromes
The approach targets sites for dimeric proteins
and is particularly suited for helix-turn-helix
proteins of Bacteria and Archea HTH
proteins bind as dimers usually with variable
sequence spacing Binding sites are
palindromic with poorly conserved middle
GGATTnnnAATCC GGATTnnAATCC GGATTnnAAGCC
Starting from a complete set of promoter
sequences, we find imperfect palindromes of
variable length Remove sequence bias (A/T or
G/C content gt 80) Search all-versus-all and
identify similar motifs
YES
9
Many potential binding sites are found ...
Sulfate metabolism
Transposons
RNA Pol K Ribosomal proteins
GTP-binding ATPase
Short hypothetical proteins
The role of found motifs is difficult to predict
10
Finding DNA motifs - the summary
In promoters of homologous genes Easy
to perform and interpret results Works
only for proteins with sequence homology In
promoters of interacting proteins General
approach, works even in the absence of sequence
homology Needs better coverage of
interactions High-throughput studies of
species other than yeast will enable comparative
analysis Ab initio methods General
approach, requires no prior knowledge
Complementary approaches (experimental or
computational) are needed to link the
found sites to their DNA-binding proteins
11
Evolution and sequence conservation
Genome comparisons are based on simple
premise conservation functional importance
If there are no constraints on DNA sequence,
random mutations will occur Over large
evolutionary times (millions of years), these
random mutations make two related sequences
different Sequences from different genomes will
be conserved if They code for proteins
They are important for regulation (protein
binding) Comparative genomics is needed to
identify conservation
12
Comparative genomics helps genome annotations
In prokaryotes, finding genes is relatively
easy based on open reading frames (ORFs) In
eukaryotes, we have to look for ORFs, exons,
introns, splice sites, polyA sites Bad
news Predicted exons sometimes do not exist
More bad news Pseudogenes Bad news
keep coming Alternative splicing Good
news In different species, the genes
normally have similar exon-intron structure
13
Case 1
Cellular concentration of metabolite is too low
to occupy the riboswitch binding site.
Transcription and
RNA polymerase
RNA polymerase
Courtesy of R. Breaker, Yale U.
14
Case 1
Cellular concentration of metabolite is too low
to occupy the riboswitch binding site.
Transcription and intramolecular RNA folding
continue.
RNA polymerase
Courtesy of R. Breaker, Yale U.
15
Case 1
Cellular concentration of metabolite is too low
to occupy the riboswitch binding site.
Translation is initiated.
Transcription and intramolecular RNA folding
continue.
Typically the new mRNA codes for a biosynthetic
or transport protein that raises the
intracellular level of the metabolite.
Ribosome
Gene regulation (next case) is accomplished by
variations in the interactions of the regions
highlighted in orange.
Courtesy of R. Breaker, Yale U.
16
Case 2
Cellular concentration of metabolite (X) is high.
RNA polymerase produces the long untranslated
leader region.
Intramolecular folding can lead to an alternate
conformation.
X
X
X
X
Nascent RNA
X
X
RNA polymerase
DNA template
The alternate riboswitch conformation is stable
when metabolite is bound.
Courtesy of R. Breaker, Yale U.
17
Case 2
Cellular concentration of metabolite (X) is high.
RNA polymerase produces the long untranslated
leader region.
Transcription continues.
Intramolecular folding can lead to an alternate
conformation.
X
X
X
X
X
X
RNA polymerase
The alternate riboswitch conformation is stable
when metabolite is bound.
Courtesy of R. Breaker, Yale U.
18
Case 2
Cellular concentration of metabolite (X) is high.
Transcription continues.
Now, RNA folding leads to formation of an
intrinsic terminator.
X
X
X
X
X
RNA polymerase
Courtesy of R. Breaker, Yale U.
19
Case 2
Cellular concentration of metabolite (X) is high.
Transcription continues.
Now, RNA folding leads to formation of an
intrinsic terminator.
X
X
X
X
X
RNA polymerase
The transcript is never completed and the
metabolite biosynthetic or transport protein is
not produced.
Courtesy of R. Breaker, Yale U.
20
What does this ncRNA bind?
21
Can we predict functions without strict measure
of significance(no sequence or structural
similarity)?
This is done by machine-trained (objective)
jury-like system using inference
22
Comparative genomics predicts protein
interactions (Rosetta Stone)
In yeast, topoisomerase II has two domains
that correspond to gyrases A and B Sequence
comparisons show that these two domains are
individual proteins in E. coli The
implication is that these two proteins
interact, and that their fusion was favored
during the evolution
23
Predicting protein function by genome context
24
What does gene colinearity mean?
Krr1/Rrp20
Rio1/Rio2
Spo11
Tif11
25
Not much, unless supported by phylogeny and
function
26
The case of Fibrillarin/Nop56 colinearity
27
Fibrillarin and Nop56 DO interact
28
Functional clues for hypothetical proteins based
on genomic context analysis
29
High-throughput approaches
  • Had to be developed quickly to match the speed of
    genome sequencing
  • As a general rule, most experimental approaches
    can be adapted for high-throughput
  • Protein interactions (two hybrid, TAP)
  • Protein localizations
  • Gene regulations (microarray)
  • Structure determination (more recent, still
    gaining speed)

30
What is a high-throughput experiment?
  • Usually done at the level of whole organism
    (whole genome) under different conditions
  • HT experiments are aided by
  • Equipment miniaturization
  • Robotics
  • Other automated procedures
  • In almost all instances, heavy data analysis and
    processing is required

31
General properties of HT experiments
  • Collect large amounts of data under many
    different conditions
  • Err on the side of collecting too much data, disk
    storage is cheap
  • Process raw data (computers)
  • Analyze data (computers)
  • Integrate data from various sources (computers)
  • Identify patterns and cluster the results based
    on similarity (computers)

32
Integrating heterogonous data to predict protein
interactions
33
Analysis of different data types is usually based
on Bayesian inferenceExample protein
interactions? Proteins more likely to interact
if they are co-expressed ? Proteins more likely
to interact if they are co-localized in cell ?
Proteins more likely to interact if they are
co-localized in genome ? Proteins more likely
to interact if they are parts of the same
cellular process
34
Predicting large protein complexes from
individual parts
35
Beware of erroneous annotations
Write a Comment
User Comments (0)
About PowerShow.com