Title: Annotation Parsing
1Annotation Parsing
2Affymetrix File Format
- Comma seperated file containing lots of data
- UniGene, Ensembl, Entrez and SwissProt IDs
- Genome Version, Chromosonal Location, Alignment
info - Gene Ontology Info
- Pathway Membership
- Protein Families and Domains
- Looks like
"1000_at","Human Genome U95Av2 Array","Homo
sapiens","Dec 18, 2005","Exemplar
sequence","GenBank","X60188mRNA","X60188
/FEATUREmRNA /DEFINITIONHSERK1 Human ERK1 mRNA
for protein serine/threonine kinase","X60188","---
","Hs.861","May 2004 (NCBI 35)","chr1630032927-30
042040 (-) // 93.03 // p11.2","mitogen-activated
protein kinase 3","MAPK3","chr16p12-p11.2","full
length","ENSG00000102882","5595","P27361 ///
Q9BWJ1 /// Q7Z3H5 /// Q8NHX0 ///
Q8NHX1","EC2.7.1.-","601795","NP_002737.1","NM_00
2746","---","---","---","---","---","---","74 //
regulation of progression through cell cycle //
non-traceable author statement /// 6468 //
protein amino acid phosphorylation // inferred
from direct assay /// 6468 // protein amino acid
phosphorylation // inferred from electronic
annotation /// 7049 // cell cycle // inferred
from electronic annotation","---","166 //
nucleotide binding // inferred from electronic
annotation /// 4674 // protein serine/threonine
kinase activity // inferred from electronic
annotation /// 4707 // MAP kinase activity //
non-traceable author statement /// 4713 //
protein-tyrosine kinase activity // inferred from
electronic annotation /// 5515 // protein binding
// inferred from physical interaction /// 5524 //
ATP binding // non-traceable author statement ///
16740 // transferase activity // inferred from
electronic annotation /// 4672 // protein kinase
activity // inferred from electronic annotation
/// 4707 // MAP kinase activity // inferred from
electronic annotation /// 5524 // ATP binding //
inferred from electronic annotation /// 16301 //
kinase activity // inferred from electronic
annotation","MAPK_Cascade // GenMAPP ///
S1P_Signaling // GenMAPP /// TGF_Beta_Signaling_Pa
thway // GenMAPP","ec // A2S7_HUMAN // (Q96Q40)
Serine/threonine-protein kinase ALS2CR7 (EC
2.7.1.37) (Amyotrophic lateral sclerosis 2
chromosomal region candidate gene protein 7) //
1.0E-77 /// ec // A2S7_HUMAN // (Q96Q40)
Serine/threonine-protein kinase ALS2CR7 (EC
2.7.1.37) (Amyotrophic lateral sclerosis 2
chromosomal region candidate gene protein 7) //
2.0E-85 /// hanks // 3.1.1 // CMCG Group CMGC I
Cyclin-dependent (CDKs) and close relatives
CDC2Hs // 1.0E-85 /// hanks // 3.1.1 // CMCG
Group CMGC I Cyclin-dependent (CDKs) and close
relatives CDC2Hs // 1.0E-79","---","IPR000719 //
Protein kinase","---","---","This probe set was
annotated using the Matching Probes based
pipeline to a Entrez Gene identifier using 3
transcripts. // false // Matching Probes //
A","BC000205(15),BX537897(15),NM_002746(16)","NM_0
02746 // Homo sapiens mitogen-activated protein
kinase 3 (MAPK3), mRNA. // refseq // 16 // ---
/// CR603463 // full-length cDNA clone
CS0DN005YA14 of Adult brain of Homo sapiens
(human). // gb // 15 // --- ///
ENSESTT00000097559 // --- // ensembl_est // 15 //
--- /// ENST00000263025 // cdnaknown-ccds
chromosomeNCBI35163003292830042042-1
geneENSG00000102882 CCDS10672.1 //
ensembl_transcript // 15 // --- /// BC000205 //
Homo sapiens, clone IMAGE3350666, mRNA, partial
cds. // gb // 15 // --- /// BX537897 // Homo
sapiens mRNA cDNA DKFZp686O0215 (from clone
DKFZp686O0215). // gb // 15 // ---","ENSESTT000000
97558 // ensembl_est // 4 // Cross Hyb Matching
Probes /// AK096992 // gb // 1 // Cross Hyb
Matching Probes"
3WorkBench Model
- Automatically identify chip type by specific
marker presence - Parse and filter appropriate annotation file to
produce a smaller version of annotations, called
idx file. - Store all annotations in Map from marker ID to
annotation line. - For future accesses, skip filtering step 2.
4Issues
- A lot of hardcoded values in parser. Chip names,
annotation names, etc. (100, 147, 393) - Hardcoded list of included annotations. (393)
- Chip type map fragile dependent on specific
markers being present. - Annotations stored in memory in an unparsed
state. Forces annotation line to be parsed for
every element access. (368, 511) - All included annotations stored in memory. (42)
- Would benefit from a Singleton pattern, could
then avoid file access in static constructor,
methods wouldnt be static, etc. - Includes GUI elements, causing difficulty with
test cases and programmatic usage (108).
5Annotation Sizes
6Proposed fixes
- Determine and specify relationship between
Microarray data objects and Annotation
information. What will be the impact if
annotations not available? - User requested annotation loading separate
step. - Allow for multiple annotation formats, support
non-Affymetrix and custom. - Do not create custom index file.
- Allow user specified filtering of annotations.
- Explore open source disk based indexes and
databases. For example, Berkley DB, hsqldb. - Proper MVC structure, AnnotationParser class
simply for loading and parsing data, does not
cause GUI events. (Although this can be said of
many Data classes, see CSExprMicroarraySet.java12
5).