Title: Discussion Points for 2nd Pseudogene Call
1Discussion Points for 2nd Pseudogene Call
- Mark Gerstein
- 2005,09.22 1100 EST
2Intersection of Pseudogenes from Three Groups
Original
42
45
Havana-Gencode 167 pseudogenes
35
21
86
Yale 184 pseudogenes
87
87
18
17
18
16
22
UCSC retrogenes 15 expressed (7-8 pseudogenes)
143 not expressed (all pseudogenes)
86 havana peudogenes overlap with any Yale
pseudogene and 87 Yale pseudogenes overlap with
any havana pseudogene (idem for retrogenes). This
is a global result maybe in some loci three
havana pseudogenes overlap with only one yale
pseudogene, but in other loci, several yale
pseudogenes overlap with one havana pseudogene.
Provided by France.
3Intersection of Pseudogenes from 4 Groups Updated
52 (2)
Havana-Gencode 167 pseudogenes
14 (2)
16 (0)
Yale 164 pseudogenes
82 (34)
15 (1)
17 (7)
33 (1)
UCSC retrogenes 146 not expressed
- The numbers in parentheses are pseudogenes from
GIS. - All from http//pseudogene.org/ENCODE/cross-ref
- Pseudo-exons were merged to form pseudogenes and
used for this comparison (now a pseudogene has
only a single start and end) - Strand information is ignored
- There are a total of 229 pseudogenes in the union
4Intersection of Pseudogenes from 4 Groups
Non-processed Consensus
52 (2)
Havana-Gencode 167 pseudogenes
14 (2)
16 (0)
82 (34)
Yale 164 pseudogenes
15 (1)
17 (7)
33 (1)
UCSC retrogenes 146 not expressed
Roughly agreement now is 82 52 7 127 from
229 total What to do with 102?
GENCODE Processed GENCODE Non-Processed
Yale Processed 7 / 8 5 / 5
Yale Non-Processed 4 / 4 39 / 37
5How to Pick Pseudogenes for RT-PCR?
- Start with the intersection 127
- Duplicated v processed how many of each? (21?)
- Rank Pseudogenes
- By likelihood to be transcribed according to
ENCODE evidence - ditag, then CAGE, then tiling array
- By their uniqueness in genome
- Good primers
- Non cross-hybridizing probes
- How to get a consistent rank?
- Who will do RT-PCR ?
- What coordinates to use ?
- (Ignore 1 processed pseudogene already being
sequenced by GIS group.)
6How to generate a consensus for remaining 102
pseudogenes?
- Stick with the intersection 127
- Develop a consistent criteria for identifying
pseudogenes and uniformly apply to ENCODE - E.g. protein matches with disablements found from
a pipeline - Ignores tricky cases flagged by manual annotation
- Do a simple union of UCSC, Havana Yale, giving
229 - GIS is a subset of other 3
- Describe pseudogenes as being identified by
multiple approaches and then explicitly flag each
groups unique ones in final annotation - Easy but perhaps biases stats
- Do a qualified union
- Allow each group to question particular
pseudogenes in anothers set - Send questions around and then have a call to
sort out differences - Need a way to arbitrate e.g. we could demand an
obvious disablement - We might learn something!
- How do we represent this in the browser in
stats?
7Once we have consensus, how to agree on
pseudogene boundaries?
- Keep unchanged each groups boundaries
- If pseudogenes overlap, take largest region
(union) or smallest - Develop a uniform criteria for assigning
pseudogene boundaries and apply it to each of the
pseudogenes in the consensus set - Could just take each pseudogene in the consensus
and have one group realign it against parent