Title: Jeanette P. Schmidt
1Orthologs, Paralogs and splice variants -- can
we to recognize experimental artifacts
algorithmically
- Jeanette P. Schmidt
- Stanford University
2Computational Biology
- Interplay between experiments and computations
3Sometimes experimental data (especially if
obtained in high throughput) can be dirty
Sometimes -- one can detect bad experimental data
Normally try to correct the model
X
4Overview
- Context Identify characterize (protein coding)
genes their function across organisms - Types of experimental Data
- Addressing bad data
- Definitions
- Orthologs
- Paralogs
- Splice variants
- Why they pose computational challenges
5Data
- Fully sequenced genome from several organisms
- EST (expressed sequence tags) obtained through
high throughput methods
An Expressed Sequence Tag is a portion of an
entire gene that can be used to help identify
unknown genes and to map their positions within
a genome.
6The Problem
- Identify the important (protein coding) genes
their function across organisms - Define paralogs, pseudogenes, orthologs why
they make computation difficult - Identify pseudogenes -- look computational like
genes but are not functional -- do not make
proteins - How can computation guide experiments
7ESTs and gene discovery
An Expressed Sequence Tag is a portion of an
entire gene that can be used to help identify
unknown genes and to map their positions within a
genome.
8Where do the ESTs come from?
Extended exon or artifact?
Splice variant or articfact?
Read through or articfact?
9Characteristics
No stop codon in an intron of 90 nucleotides is
quite common (1-3/64)30 .24
- Read through
- Stop codon?
- Length of read through? (statistical estimates)
- The shorter the more likely to get a read through
w/o stop codon - How many times observed (different mRNA pool -
different tissues) - Present in other species (ortholog)
- Extended exon
- Donor acceptor site
- Present in other species (ortholog)
- Splice variant?
- Stop codon?
- Present in other species (ortholog)
- Observed in certain tissues (more than ones)
10Should we bother with ESTs?
- A gene atlas of the mouse and human
- protein-encoding transcriptomes
- (PNAS, 2004) - (Hogenesh)
- We find that although no single line of evidence
- is universally predictive of expression,
- EST evidence has the most predictive value
11Determining accurancy of splice variants by EST
- Most common approach
- Look at tissue specificity of exon in transcript
12When looking at tissue specific splicing, why
look at the splicing event rather than the exon?
Exon Mapping
Splice Mapping
2
1
3
FL1
EST A Heart EST B Liver EST C Brain EST D
Brain
EST A Heart EST B Liver EST C Brain EST D
Brain
EST A Heart EST B Liver EST C Brain
FL1
EST A Heart EST B Liver EST C Brain
EST A Heart EST B Liver EST C Brain
FL2
FL2
EST D Brain
EST A Heart EST B Liver EST C Brain EST D
Brain
EST A Heart EST B Liver EST C Brain EST D
Brain
Mapping ESTs to exons doesnt really distinguish
the two variants
Looking at the ESTs which share the same splice
site shows brain specific splicing
13Tissue specific splicing
- By changing metric slightly we get a significant
increase. - Number of exons k
- Number of potential splice junctions O(k2)
- Number of splice junctions in practice O(k)
- Note that even with O(k) splice junction --gt
potential number of splice variants is 2k.
14When looking at tissue specific splicing, why
look at the splicing event rather than the exon?
Exon Mapping
Splice Mapping
2
1
4
FL1
EST A Heart EST B Liver EST C Brain EST D
Brain EST E Brain
EST A Heart EST B Liver EST C Brain EST D
Brain
EST A Heart EST B Liver EST C Brain EST E
Brain
EST E Brain EST B Liver EST C Brain
FL1
EST A Heart EST B Liver EST C Brain
EST A Heart EST B Liver EST C Brain
FL2
FL2
EST D Brain
EST A Heart EST B Liver EST C Brain EST D
Brain
EST A Heart EST B Liver EST C Brain EST D
Brain
FL3
EST E Brain
EST B Liver EST C Braun
FL3
15(No Transcript)
16Comparison with other species
If A2 is a functional gene A1 and A2 are
paralogs -- A2 is the result of a duplication
of A1
Not functional --gt A2 pseudogene
17Events do not always happen in the order we
prescribe
18Events do not always happen in convenient order
19Events do not always happen in convenient order
Speciation 1
A1 B1 are orthologs
A1
B1
Gene Duplication
B1
B2
B1 B2 are paralogs
B1
C1
C2
B2
B1
C1
B2
C2
C3
A1
Now what?
20Ortholog identification
- Combination of methods used to identify
orthologs - Homology use reciprocal best hit (need complete
set of genes to identify best) - Breaks in presence of paralogs --gt More than one
best hit - Syntenic confirmation
- Provides additional solid evidence for ortholog
relationship but requires sequenced genome - Cannot distinguish between paralogs --gt they are
adjacent
21Rat and Mouse Synteny
22Whats the solution?
- Change the metric
- Allow multiple orthologs per genes
- Extend notion of best reciprocal by including
e-neighborhood. - 2 options --gt
- allow ortholog only if e -neighborhood is 1
- include entire neighborhood
23Summary
- Cleaner ortholog identification
- Better picture of when 2 orthologs might be
present in other species - Important for pre-clinical experiments (off
targets)
24ExampleIncyte hand-edited lipid kinase
Protein with strong similarity to
phosphatidylinositol-4-phosphate 5-kinase type
III (mouse Pip5k3), member of phosphatidylinosito
l-4-phosphate 5-kinase and TCP-1 or cpn60
Families, contains a domain of unknown function
and a FYVE zinc finger
Protein of unknown function, has strong
similarity to a region of phosphatidylinositol
-4-phosphate 5-kinase type III (mouse Pip5k3),
which is a lipid kinase that binds to
phosphatidylinositol 3-phosphate and may act in
endosomal trafficking
Protein containing a domain of unknown function,
has strong similarity to a region of
phosphatidylinositol -4-phosphate 5-kinase type
III (mouse Pip5k3), which is a lipid kinase that
binds to phosphatidylinositol 3-phosphate
25FL lipid kinase
26Acknowledgements
- Kristian Stevens
- Mirjana Marjanovic
- Jim Wingrove
- Ursula Vitt
27(No Transcript)