Title: ENCODE Pseudogene Annotation Subgroup: Summary of Thurs. 16-Sept Call
1ENCODE Pseudogene Annotation Subgroup Summary
of Thurs. 16-Sept Call
- summarized by M Gerstein16-Sept
- Participating groups Havana, IMIM, UCSC, Yale,
GIS, Affy
2Overall Goals of Pseudogene Subgroup
- Create consensus ENCODE pseudogene annotation
- Agree on defining elements of a pseudogene
- What is the degree to which pseudogenes confound
gene annotation? How many are close or distal to
genes? - Cross-reference this annotation against ENCODE
experiments - How many pseudogenes have some functional
"activity"? How many are transcribed ? - How many are associated with TARs transfrags?
CAGE ditags? ChIP-chip binding sites ? - Cross hybridization problem
3Intersection of Pseudogenes from 3 Groups
42
45
Havana-Gencode 167 pseudogenes
35
21
86
Yale 184 pseudogenes
87
87
18
17
18
16
22
UCSC retrogenes 15 expressed (7-8 pseudogenes)
143 not expressed (all pseudogenes)
86 havana peudogenes overlap with any Yale
pseudogene and 87 Yale pseudogenes overlap with
any havana pseudogene (idem for retrogenes). This
is a global result maybe in some loci three
havana pseudogenes overlap with only one yale
pseudogene, but in other loci, several yale
pseudogenes overlap with one havana pseudogene.
Provided by France Denoeud (IMIM)
4"Yale-only" Pseudogenes5 Examples
gt15 ENm002 831244 831480 237 IPIIPI00442001
259 ..337 pexons 1 235FHALVVLSWPHVLELLPQ
RNPSLHVASLTRQLQHCMAGHQLLQFKGSTLALVIITLELERLMPGWCAP
ISDLLKKAQV FHALVVLSWPHVLELLPQR
NPSLHVASLTRQLQHCMAGHQLLQFKGSTLALVIITLELERLMPGWCAPI
SDLLKKAQV FHALVVLSWPHVLELLPQR
NPSLHVASLTRQLQHCMAGHQLLQFKGSTLALVIITLELERLMPGWCAPI
SDLLKKAQVFHALVVLSWPHVLELLPQRNPSLHVASLTRQLQHCMAGHQ
LLQFKGSTLALVIITLELERLMPGWCAPISDLLKKAQV
No disablement, overlap exon
gt70 ENm007 381109 381518 410 IPIIPI00448927
239 ..330 frameshift1 ENm007 381109 381518
pexons -404 -147SKKPSLSVQPGPVMAPGESLTLHCVSDVG
YDRFVLYKEGERDLRQLPGRQPQAGLSQANFTLGPVSRSYGGQYRCYGAH
NLSSECSAPSDP SPQPSLSAQPGSPVLSGDSLTPQHHSEAGF
DSSALTR-----TR!LPARQRLDGQHLLDVPLGHASHPPGGQHRCCGGHN
ASCPRSVPRRP PGVSKKPSLSVQPGPVMAPGESLTLHCVSD
VGYDRFVL-YKEGERDLRQLPGRQPQAGLSQANFTLGPVSRSYGGQYRCY
GAHNLSSECSAPG-SPQPSLSAQPGSPVLSGDSLTPQHHSEAGFDSSAL
/YQD-----KGLPARQRLDGQHLLDVPLGHASHPPGGQHRCCGGHNASCP
RSV PSDPLDILITGQIRGT-----PFISVQPG PRRPHPTSWL-QVRGP
YPDPIPFSALDPG
Frameshift
gt122 ENm009 367441 368389 949 IPIIPI00465221
1 ..305 ENm009 367441 368389 pexons -949
-37MALPITNGTLFMPFVLTFIGIPGFESVQCWIGIPFCATYVIALI..
.......WILYPIICTYHLVQSLPTGPTIPQPLYLWVKDQTH
MALPITNGTLFMPFVLTF
IGIPGFESVQCWIGIPFCATYVIALI.........WILYPIICTYHLVQS
LPTGPTIPQPLYLWVKDQTH
MALPITNGTLFMPFVLTFIGIPGFESVQCWIGIPFCATY
VIALI.........WILYPIICTYHLVQSLPTGPTIPQPLYLWVKDQTH
MALPITNGTLFMPFVLTFIGIPGFESVQCWIGIPFCATYVIALI......
...WILYPIICTYHLVQSLPTGPTIPQPLYLWVKDQTH
No disablement, overlap exon
Remove 12, but some tricky issues-- i.e.
12,99,152,169,108
gt205 ENr223 185680 201963 16284 IPIIPI00023543
110 ..588 2.78 intron4 stop2
frameshift5 pexons 3383 3620 3826 4462
12565 12865 12917 13099 13459 13551
Disablements, have introns, probable duplicated,
overlap exon
gt177 ENr122 359278 362468 3191 IPIIPI00029222
980 ..1118 0.87 intron0 stop0 frameshift2
ENr122 359278 362468 2768 3191 pexons 2768
3191LGNTIQDIGMGKDFMTKTPKAMATKVKIDRWDLIKLKSFCTAKE
TTIRVNRQPTKWEKIFAIYSSDKGLISRIYNE---LKQIYKKKTNNPIKK
WAKDMNRHPSKEDIYAAKKHMKKCSSSLAIREMQIKTTMRYHLTPVR
LGNNILDTGFGKYFMTKMPKAIATETKIEIWDISKLK!FCRAKETI
NSVNRQPIEMEKIFANYASDRGLISRIY!KKTNLNLQAKTKQHNSIKKWP
KDMDRHFSKDDICVANKPRKTLPTSLIIREIQIKTMMRYHLTPFR
IKTLEKNLGNTIQDIGMGKDFMTKTPKAMATKVKIDRWDLIKLK-SF
CTAKETTIRVNRQPTKWEKIFAIYSSDKGLISRIY---NELKQIYKKKT-
NNPIKKWAKDMNRHPSKEDIYAAKKHMKKCSSSLAIREMQIKTTMRYHLT
PVRVRLLYALLGNNILDTGFGKYFMTKMPKAIATETKIEIWDISKLK/S
FCRAKETINSVNRQPIEMEKIFANYASDRGLISRIYKKNKLKFTSKNQT
\NNSIKKWPKDMDRHFSKDDICVANKPRKTLPTSLIIREIQIKTMMRYHL
TPFR
Multiple Frameshifts, overlap exon
5"Havana-only" Pseudogenes5 Examples
gt12 ENm002 242882 243044 163 IPIIPI00017094
2359 ..2399 0.26 ENm002 242882 243044
FARASKEQKDKFLKNRGFSLLANQLYLHRGTQELLECFIE
FSRPSKKQKDKFLK-YSFSLLANQLFLHQEIQELTDSFIK LDAYFAR
ASKEQKDKFLKNRGFSLLANQLYLHRGTQELLECFI-EMFFGRHIGLDE
FEAFSRPSKKQKDKFLK-YSFSLLANQLFLHQEIQELTDSFI/EMFFG
CTGLDE
gt56 ENm006 1293946 1313338 19393 IPIIPI00384823
1 ..1276 8.3 intron0 stop7 frameshift9
gt125 ENm009 424525 425472 948 IPIIPI00022766
1 ..282 MYIVAVAGNIFLIFLIMTERSLHEPLYLFLSMLASANFLL
AAAAAPEVLAILWFH.........KQIKDRVILLFSPISVCC
MYIVAVAGNIFLIFLIMTERSLHEPMYLFLSMLASADFLLATAA
APKVLAILWFH.........KQIKDRVILLFSPISVCC
MYIVAVAGNIFLIFLIMTERSLHEPLYLFLSMLASANFLLAAAAAPE
VLAILWFH.........KQIKDRVILLFSPISVCCMYIVAVAGNIFLIF
LIMTERSLHEPMYLFLSMLASADFLLATAAAPKVLAILWFH.........
KQIKDRVILLFSPISVCC
Similar discussion for "UCSC only"
gt103 ENm008 153121 155155 2035 IPIIPI00217473
8 ..143 intron2 stop0 frameshift0 pexons
25 96 1360 1566 1906 2033TIIVSMWAKISTQADTIGTE
TLE LFLSHPQTKTYFPHFDLHPGS
AQLRAHGSKVVAAVGDAVKSI TIIVSMWAKISTQADTIGTETLE
RRagg LFLSHPQTKTYFPHFDLHPGSAQLRAHGS
KVVAAVGDAVKSI DDIGGALSKLSELHAYILRVDPVNFK
LLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVL
TEKYR DDIGGALSKLSELHAYILRVDPVNFK
LLSHCLLVTLAARFPADFTAEAHAAWAKFLSVVSSVLTEKYR
RLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVG
DAVKSIDDIGGALSKLSELHAYILRVDPVNFKLRLFLSHPQTKTYFPHF
DLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPV
NFKV
gt174 ENr121 322430 366341 43912 IPIIPI00384823
1 ..1276 intron0 stop2 frameshift2 ENr121
322430 366341 38663 42482 pexons 38663
42482MTGSNSHITILTLNINGLNSAIKRHRRASWIKSQDPSVCCIQET
...
6Pseudogenes Overlapping Gencode Exons
122
28
30
124
Havana-Gencode 167 pseudogenes
Yale 184 pseudogenes
13
12
20
2
Havana-Gencode Exons 17603
749 GIS Pseudogenes, Not Yet Fully Compared
- The 49 non-redundant ENCODE processed
pseudogenes were used - for comparison with pseudogenes from Yale,
Vega, and Ensembl groups. - 4 pseudogenes were uniquely found in the two
libraries.
GIS-PET (4)
Yale (12)
Vega (5)
20
Ensembl (3)
2
2
1
From GIS
8Browser Tracks
R Baertsch, UCSC
Pseudogene track
A processed pseudogene at chr21 33775699
-33776428
genome-test.cse.ucsc.edu/ENCODE/encode.html
9Overall short-term goal for next call Come up
with a consensus list of pseudogenes suitable
for carefully checking for transcription
(perhaps by RT-PCR)
10Immediate ToDo's for Next Call
- Classify pseudogenes as processed non-processed
(with a third "not sure" category) - Venn diagrams in each category
- Need to add to our current 87 consensus
- Among duplicated pseudogenesDetermine
Yale/Havana consensus, add to 87 - Among processed pseudogenes
- Merge in 49 from GIS
- Each group should determine which of its
pseudogenes not in the consensus it still wants
to keep and repost them to list - Update list summary and UCSC browser
- list summary web page (maintained by Deyou,
http//homes.gersteinlab.org/people/zhengdy/cgi-bi
n/encode-pgene.cgi ) - Flag truly tricky ones as questionable to be
returned to later (e.g. 169, OR ex. truncated at
6TM )
11Browser ToDo's for Next Call
- Send alignments to Rob so he can link to browser
- A clear coloring scheme for differentiating
processed vs non-processed pgenes - UCSC will index by names used by the different
groups - Create an additional fourth sub-track for
consensus pseudogenes - Perhaps an additional track for prominent
disagreements i.e. questionable pseudogenes (or
another color) - Small fix on Gencode "pseudogene" track
12Remaining Issues
- Of the consensus pseudogenes, determine unique
sequences for RT-pcr or matching against probes - Remaining questions
- How are we going to arrive at agreed upon
boundaries for pseudogenes (start and stop)? - What is the best for alignments, cDNA or protein?
- (Given that complete cDNA info is not available
for everything, perhaps best to stick to proteins
initially.)