Title: National Center for Biomedical Computing at
1geWorkbench An InteroperablePlatform for
Integrative Genomics
- National Center for Biomedical Computing at
- Columbia University
2geWorkbench
MAGNet Software Tools
- An interoperable platform for integrative
genomics research
A. Califano A. Floratos
3Integrated Genomics Solution
Comparative Genomics
Medical Records
Proteomics
Scientific and Technological Innovation
DNA Sequences
Protein Structure
Literature
Lead Compounds
Gene Expression
4geWorkbench (genomic Workbench)
- Based on caWorkbench, an NCI/caBIG-funded effort
- Open source, Java based platform
- Integrated Genomics Platform
- Support for gene expression data, sequences,
pathways, structure, etc. (40 visualization and
analysis modules). - Access to local and remote data sources and
analytical services. - Support for workflow scripting.
- Integration with caGRID.
- Development framework
- Open source development.
- Modular/extensible architecture, supporting
pluggable components with configurable user
interface. - Formal (caBIG-registered) data models for
multitude of bioinformatics concepts. - Easy integration of 3rd party components.
5History
- Evolution
- caWorkbench 1.0 (NCICB sponsored) 2003.
- caWorkbench 2.0 (AMDeC/Columbia sponsored) 2005
- caWorkbench 3.0 (caBIG sponsored) 2006
- geWorkbench 3.0 (NCBC/caBIG sponsored) - 2007
- A platform for caGRID enablement
- MAGNet Multi-scale Analysis of Genomic and
Cellular Networks - National Center for Biomedical Computing (NIH
Roadmap) - geWorkbench (genomics Workbench) will be the
software platform for the Centers technology.
6Component-Based Architecture
Visualization
Data Management
Algorithms
Databases
7Architecture
8Available geWorkbench Components
- Filters/Normalizers
- Threshold filters, detection call filter, Genepix
flags filter. - Centering, quantile normalization, housekeeping
genes normalization - Analyses
- T-test, hierarchical clustering, Self organizing
maps. - BLAST, pattern discovery, sequence alignments,
synteny - Network reconstruction.
- 3rd Party components
- Cytoscape, Go-miner, RProteomics
- Workflow support
- Data formatters
- Gene expression data (affy, genepix, caArray)
- Sequence data (fasta)
- Genotypic data
- Visualizers
- Color mosaic.
- Scatter plot.
- Spreadsheet view.
- Expression profile graph.
- Gene networks.
- Pathways.
- Annotations
- Go Terms.
- caBIO gene info.
9User Interface
10geWorkbench Resources
http//www.geworkbench.org/
11Ontology-Anchored Interfaces
- Scalable mechanisms for the definition of
community based standards - Vocabulary
- Grammar
- Semantics
- Ontology Anchored Interface Design
- Available only for Fundamental Biomedical Types
- DNA Sequence
- Gene
- Gene Expression
-
- Extend Existing Ontologies with Two New Axes
(CDE) - Complex Data Structures
- Algorithms
BISON Biomedical Informatics Structured Ontology
Notation
12Interface Design CDE
Algorithms
Pattern Discovery
Complex Data Structures
Linkage Analysis
1D Pattern Discovery
Linkage Analysis
2D Pattern Discovery
2D Pattern Discovery
DNA Sequence Pattern Discovery
Gene Expression Pattern Discovery
Adjacency Matrix
2D Pattern Matching
Synteny
1D Pattern Discovery
Pattern Match
Protein Sequence Pattern Discovery
Training Set
1D Pattern Matching
Phylogenetic Tree
SNP Pattern Discovery
Cluster
Classification
Pattern
Clustering
Protein Seq
Expression
Sequence
DNA Seq
Tissue
Organ
Gene
SNP
Primary Biological Datatypes (GO)
13BISON Biomedical Informatics Structured Ontology
Component A
_at_Publish public DSDataSet publish(. . .)
DSDataSet dataSet // do some work that
assigns a value to dataSet. return
dataSet _at_Subscribe public void
receive(DSDataSet dataSet, Object source)
// Consume the argument dataSet, as
appropriate
Component B
- Provide re-usable models of common bioinformatics
concepts - Data sequence, expression, genotype, structure,
proteomics - Complex data structures patterns, clusters,
HMMs, PSSMs, alignments - Algorithms Clustering, matching, discovery,
normalization, filtering - Provide a foundation for the development of
interoperable geWorkbench components - Endorsed by multiple communities (caBIG, AMDeC,
NCBCs)
14BISON - BioObjects
15BISON Example - Patterns
PatternltTgt MatchltTgt match(T)
Collection(MatchltTgt) match(Collection(T))
String toString()
T
3DMatch extends MatchltStructuregt Structure
getObject() vector getV1() vector
getV2() double getAlpha() double
getPValue()
SeqMatch extends MatchltSequencegt Sequence
getObject() int getAlignment() double
getPValue()
MatchltTgt
Sequence
PatternltTgt
SeqMatch
Structure
SequencePattern extends PatternltSequencegt
SeqMatch match(Sequence)
Collection(SeqMatch) match(Collection(Sequence))
String toString()
Sequence
Sequence Pattern
3DMatch
StructurePattern extends PatternltStructuregt
3DMatch match(Structure)
Collection(3DMatch) match(Collection(Structure))
String toString()
Structure
Structure Pattern
16BISON Example Classification
ClassifierltTgt ClassificationltTgt
run(T object) CollectionltClassificationltTgtgt
run(TestSetltTgt objects) void
init(TrainingSetltTgt objects)
T
ProteinSVM extends ClassifierltProteingt
ClassificationltProteingt run(Protein
protein) CollectionltClassificationltProteingtgt
run(TestSetltProteingt proteins) void
init(TrainingSetltProteingt
proteins)
ClassifierltProteingt
MicroarrayPD extends ClassifierltMicroarraygt
ClassificationltMicroarraygt
run(Microarray ma) CollectionltClassificationltTgt
gt run(TestSetltMicroarraygt mas) void
init(TrainingSetltMicroarr
aygt mas)
ClassifierltMicroarraygt
CrossValidationltTgt void run() void
init(TrainingSetltTgt tSet)
ClassifierltTgt
TrainingSetltTgt
T
TrainingSetltProteingt
Protein
Protein
Microarray
Microarray
T
TrainingSetltMicroarraygt
17geWorkbench
- Major effort on making the platform broadly
available to and extensible by the biomedical
research community - Integration of MAGNet Tools
- B cell Interaction Knowledge base
- ARACNE
- REDUCE
- MEDUSA
- GeneWays
- Protein Structure Pipeline
- JMol (Open source Molecular Viewer)
18Integration of 3rd party components
Cytoscape
GenePattern
MatrixREDUCE
GoMiner
19Core IV Infrastructure (cont.)
MAGNet/C2B2
20matrixREDUCE
- Statistical mechanical modeling of genome-wide
transcription factor occupancy data by
MatrixREDUCE
Bussemaker Lab
21matrixREDUCE
22ARACNE Accurate Reconstruction of Cellular
Networks
- Reverse Engineering Transcriptional Interactions
Califano Dalla Favera Labs
23ARACNE Recap
Step 1 Compute Mutual Information to assess
statistical independence of each gene (mRNA) pair
Step 2 (DPI) Data Processing Inequality used to
test each gene (mRNA) triplet
z
x
y
24The germinal center
mutated Ig V region
unmutated
B-Cell Subpopulations
Germinal Center (GC)
Post-GC
Plasma Cell
Pre-GC
Naïve B
Centroblast
Centrocyte
Memory B
Follicular Lymphoma
B-CLL
Burkitt Lymphoma
Diffuse Large Cell Lymphoma
Mantle Cell Lymphoma
(CD5)
Hodgkin Disease
Multiple Myeloma
(CD5)
B-Cell Derived Malignancies
25C-MYC Sub-network
- Graphical representation
- all 56 1st neighbors
- the top ranking 444 2nd neighbors
26ChIP Validation
Expected background 10 (1,300/12,600)
27ARACNE Problems
- Single Run is very sensitive to dataset selection
- Bootstrapping Selecting N (equal-sized)
subsamples of the expression profiles with
replacement. - Statistical significance based on consensus
across all runs - Non-transcriptional interaction should not be
allowed to remove transcriptional interactions
via the DPI. - Identify Transcription Factors
- Prevent removal of TF interaction by non-TF
interactions - Network is average of normal, tumor-related and
experimentally manipulated cell phenotypes. - Generate network for homogeneous phenotype using
experimental perturbations
28Performance improvements
- Systematic (100-fold) improvement for
Bootstrapping - Systematic (100-fold) improvement when TFs are
used - Homogeneous Network is possible with as few as
100 MA
29The B-Cell Knowledge Base
- Everything you always wanted to know about B
Cells but were afraid to ask
H. Bussemaker A. Califano R. Dalla Favera C.
Leslie A. Rzhetsky C. Wiggins
30Knowledge Base for Human B Lymphocytes
- Integrative
- Bayesian Evidence integration of pairwise
interactions - Protein-Protein, Protein-DNA
- Context Specific
- ARACNE, GeneWays, REDUCE
- B-Cell data or B-cell specific criteria
- Linked to one of the largest B-Cell expression
profiles microarray dataset, ChIP-Chip assays
(MYC/BCL6), miRNA profiles, and Literature - Captures Multi-variate dependencies
- Three-way interactions via MINDY and MATRIXReduce
- Post-translational modulation of transcriptional
regulation - Combinatorial transcriptional regulation
- Signal transduction control of Transcriptional
Regulation I.e. the Transferome meets the
Transcriptome - Links to literature, via GeneWays
31Integrating protein-DNA and protein-protein
Interactions via Naïve Bayes Classification
- Protein-Protein Interactions (PPIs)
- Human PPI databases
- Human Protein Reference Database (HPRD)
- Biomolecular Interaction Network Database (BIND)
- Database of Interacting Proteins (DIP and IntAct)
- Y2H Studies (2in human)
- Eukaryotic PPI via hortologous genes
(Inparanoid) - MIPS, BIND, IntAct.
- GeneWays Predictions (context-specific literature
analysis) - Co-expression analysis (Mutual Information)
- Gene Ontology classification (biological
process/compartment) - Protein-DNA Interactions (PDIs)
- Human PDI databases
- TRANSFAC, BIND, MycDB
- Mouse PDI databases (TRANSFAC, BIND via
orthologous genes (Inparanoid) - ARACNE (bootstrap-TF)
- GeneWays predictions (context-specific literature
analysis)
- 49,719 interactions (4,944 genes)
- 27,705 PPIs (4,209 genes)
- 22,014 PDIs (3,216 genes/457 TFs)
32Ribosomal Protein Control
BTF3
From ARACNE
MI(SHOX2-MYC) 0.073 MI(MYC-RP)
0 MI(SHOX2-RP) gt 0.3
From Literature
MYC
From ARACNE
SHOX2
33GeneWays
Rzhetsky lab
34Literature Data Mining Tool
- Natural Language Parsing Analysis of Literature
Data - Full text analysis (not abstract)
- Ontology-based classification of (inter)-actions
- Links to literature
- Sentences
- Full Text
35Functional Annotation Pipeline
- Everything you always wanted to know about a PDB
Structure but were afraid to ask
Honig Rost labs
36Background
- Pipeline for automated gathering of functional
annotations of protein structures - Main targets NESG protein structures
- Structure analysis
- Structure comparison (Skan, CE, Dali)
- Binding site identification (SCREEN)
- Binding site comparison (Procat, cmplig)
- Sequence analysis
- Sequence neighbors (PSI-BLAST)
- Conservation analysis (CONSURF)
- Domain analysis (InterProScan)
37Data flow
PDB or NESG repository
PDB file
Sequence
DALI, CE, Skan
SCREEN
Procat
CONSURF
Procat results
Skan, DALI, CE hits
Cavities
InterProScan
PSI-BLAST (local nr or Uniprot)
PDB file conservation
DaliLite, CE
ClustalW
Identified sequence domains
BLAST report
Structure alignment
Parsing and internal Data representation
MySQL database
38Web Interface
http//luna.bioc.columbia.edu/mfischer/cgi-bin/pip
e.pl (/nesg.pl)
39Pipeline test system
Sporulation response regulator Spo0F from
Bacillus subtilis PDB 15f1, resolution 3Ã…, SCOP
C.23.1.1 Complex with phosphotransferase Spo0B
(dimer) 4 Spo0F monomers 2 Spo0B dimers