Annotation Parsing - PowerPoint PPT Presentation

About This Presentation
Title:

Annotation Parsing

Description:

... inferred from electronic annotation /// 4674 // protein serine/threonine kinase ... 15 // --- /// BC000205 // Homo sapiens, clone IMAGE:3350666, mRNA, partial cds. ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 7
Provided by: matt79
Category:

less

Transcript and Presenter's Notes

Title: Annotation Parsing


1
Annotation Parsing
2
Affymetrix File Format
  • Comma seperated file containing lots of data
  • UniGene, Ensembl, Entrez and SwissProt IDs
  • Genome Version, Chromosonal Location, Alignment
    info
  • Gene Ontology Info
  • Pathway Membership
  • Protein Families and Domains
  • Looks like

"1000_at","Human Genome U95Av2 Array","Homo
sapiens","Dec 18, 2005","Exemplar
sequence","GenBank","X60188mRNA","X60188
/FEATUREmRNA /DEFINITIONHSERK1 Human ERK1 mRNA
for protein serine/threonine kinase","X60188","---
","Hs.861","May 2004 (NCBI 35)","chr1630032927-30
042040 (-) // 93.03 // p11.2","mitogen-activated
protein kinase 3","MAPK3","chr16p12-p11.2","full
length","ENSG00000102882","5595","P27361 ///
Q9BWJ1 /// Q7Z3H5 /// Q8NHX0 ///
Q8NHX1","EC2.7.1.-","601795","NP_002737.1","NM_00
2746","---","---","---","---","---","---","74 //
regulation of progression through cell cycle //
non-traceable author statement /// 6468 //
protein amino acid phosphorylation // inferred
from direct assay /// 6468 // protein amino acid
phosphorylation // inferred from electronic
annotation /// 7049 // cell cycle // inferred
from electronic annotation","---","166 //
nucleotide binding // inferred from electronic
annotation /// 4674 // protein serine/threonine
kinase activity // inferred from electronic
annotation /// 4707 // MAP kinase activity //
non-traceable author statement /// 4713 //
protein-tyrosine kinase activity // inferred from
electronic annotation /// 5515 // protein binding
// inferred from physical interaction /// 5524 //
ATP binding // non-traceable author statement ///
16740 // transferase activity // inferred from
electronic annotation /// 4672 // protein kinase
activity // inferred from electronic annotation
/// 4707 // MAP kinase activity // inferred from
electronic annotation /// 5524 // ATP binding //
inferred from electronic annotation /// 16301 //
kinase activity // inferred from electronic
annotation","MAPK_Cascade // GenMAPP ///
S1P_Signaling // GenMAPP /// TGF_Beta_Signaling_Pa
thway // GenMAPP","ec // A2S7_HUMAN // (Q96Q40)
Serine/threonine-protein kinase ALS2CR7 (EC
2.7.1.37) (Amyotrophic lateral sclerosis 2
chromosomal region candidate gene protein 7) //
1.0E-77 /// ec // A2S7_HUMAN // (Q96Q40)
Serine/threonine-protein kinase ALS2CR7 (EC
2.7.1.37) (Amyotrophic lateral sclerosis 2
chromosomal region candidate gene protein 7) //
2.0E-85 /// hanks // 3.1.1 // CMCG Group CMGC I
Cyclin-dependent (CDKs) and close relatives
CDC2Hs // 1.0E-85 /// hanks // 3.1.1 // CMCG
Group CMGC I Cyclin-dependent (CDKs) and close
relatives CDC2Hs // 1.0E-79","---","IPR000719 //
Protein kinase","---","---","This probe set was
annotated using the Matching Probes based
pipeline to a Entrez Gene identifier using 3
transcripts. // false // Matching Probes //
A","BC000205(15),BX537897(15),NM_002746(16)","NM_0
02746 // Homo sapiens mitogen-activated protein
kinase 3 (MAPK3), mRNA. // refseq // 16 // ---
/// CR603463 // full-length cDNA clone
CS0DN005YA14 of Adult brain of Homo sapiens
(human). // gb // 15 // --- ///
ENSESTT00000097559 // --- // ensembl_est // 15 //
--- /// ENST00000263025 // cdnaknown-ccds
chromosomeNCBI35163003292830042042-1
geneENSG00000102882 CCDS10672.1 //
ensembl_transcript // 15 // --- /// BC000205 //
Homo sapiens, clone IMAGE3350666, mRNA, partial
cds. // gb // 15 // --- /// BX537897 // Homo
sapiens mRNA cDNA DKFZp686O0215 (from clone
DKFZp686O0215). // gb // 15 // ---","ENSESTT000000
97558 // ensembl_est // 4 // Cross Hyb Matching
Probes /// AK096992 // gb // 1 // Cross Hyb
Matching Probes"
3
WorkBench Model
  • Automatically identify chip type by specific
    marker presence
  • Parse and filter appropriate annotation file to
    produce a smaller version of annotations, called
    idx file.
  • Store all annotations in Map from marker ID to
    annotation line.
  • For future accesses, skip filtering step 2.

4
Issues
  • A lot of hardcoded values in parser. Chip names,
    annotation names, etc. (100, 147, 393)
  • Hardcoded list of included annotations. (393)
  • Chip type map fragile dependent on specific
    markers being present.
  • Annotations stored in memory in an unparsed
    state. Forces annotation line to be parsed for
    every element access. (368, 511)
  • All included annotations stored in memory. (42)
  • Would benefit from a Singleton pattern, could
    then avoid file access in static constructor,
    methods wouldnt be static, etc.
  • Includes GUI elements, causing difficulty with
    test cases and programmatic usage (108).

5
Annotation Sizes
6
Proposed fixes
  • Determine and specify relationship between
    Microarray data objects and Annotation
    information. What will be the impact if
    annotations not available?
  • User requested annotation loading separate
    step.
  • Allow for multiple annotation formats, support
    non-Affymetrix and custom.
  • Do not create custom index file.
  • Allow user specified filtering of annotations.
  • Explore open source disk based indexes and
    databases. For example, Berkley DB, hsqldb.
  • Proper MVC structure, AnnotationParser class
    simply for loading and parsing data, does not
    cause GUI events. (Although this can be said of
    many Data classes, see CSExprMicroarraySet.java12
    5).
Write a Comment
User Comments (0)
About PowerShow.com