National Center for Biomedical Computing at - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

National Center for Biomedical Computing at

Description:

National Center for Biomedical Computing at – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 40

Provided by: afl66

Category:

more less

Transcript and Presenter's Notes

Title: National Center for Biomedical Computing at

1
geWorkbench An InteroperablePlatform for
Integrative Genomics

National Center for Biomedical Computing at
Columbia University

2
geWorkbench
MAGNet Software Tools

An interoperable platform for integrative
genomics research

A. Califano A. Floratos
3
Integrated Genomics Solution
Comparative Genomics
Medical Records
Proteomics
Scientific and Technological Innovation
DNA Sequences
Protein Structure
Literature
Lead Compounds
Gene Expression
4
geWorkbench (genomic Workbench)

Based on caWorkbench, an NCI/caBIG-funded effort
Open source, Java based platform

Integrated Genomics Platform
Support for gene expression data, sequences,
pathways, structure, etc. (40 visualization and
analysis modules).
Access to local and remote data sources and
analytical services.
Support for workflow scripting.
Integration with caGRID.

Development framework
Open source development.
Modular/extensible architecture, supporting
pluggable components with configurable user
interface.
Formal (caBIG-registered) data models for
multitude of bioinformatics concepts.
Easy integration of 3rd party components.

5
History

Evolution
caWorkbench 1.0 (NCICB sponsored) 2003.
caWorkbench 2.0 (AMDeC/Columbia sponsored) 2005
caWorkbench 3.0 (caBIG sponsored) 2006
geWorkbench 3.0 (NCBC/caBIG sponsored) - 2007
A platform for caGRID enablement
MAGNet Multi-scale Analysis of Genomic and
Cellular Networks
National Center for Biomedical Computing (NIH
Roadmap)
geWorkbench (genomics Workbench) will be the
software platform for the Centers technology.

6
Component-Based Architecture
Visualization
Data Management
Algorithms
Databases
7
Architecture
8
Available geWorkbench Components

Filters/Normalizers
Threshold filters, detection call filter, Genepix
flags filter.
Centering, quantile normalization, housekeeping
genes normalization
Analyses
T-test, hierarchical clustering, Self organizing
maps.
BLAST, pattern discovery, sequence alignments,
synteny
Network reconstruction.
3rd Party components
Cytoscape, Go-miner, RProteomics
Workflow support

Data formatters
Gene expression data (affy, genepix, caArray)
Sequence data (fasta)
Genotypic data
Visualizers
Color mosaic.
Scatter plot.
Spreadsheet view.
Expression profile graph.
Gene networks.
Pathways.
Annotations
Go Terms.
caBIO gene info.

9
User Interface
10
geWorkbench Resources
http//www.geworkbench.org/
11
Ontology-Anchored Interfaces

Scalable mechanisms for the definition of
community based standards
Vocabulary
Grammar
Semantics
Ontology Anchored Interface Design
Available only for Fundamental Biomedical Types
DNA Sequence
Gene
Gene Expression
Extend Existing Ontologies with Two New Axes
(CDE)
Complex Data Structures
Algorithms

BISON Biomedical Informatics Structured Ontology
Notation
12
Interface Design CDE
Algorithms
Pattern Discovery
Complex Data Structures
Linkage Analysis
1D Pattern Discovery
Linkage Analysis
2D Pattern Discovery
2D Pattern Discovery
DNA Sequence Pattern Discovery
Gene Expression Pattern Discovery
Adjacency Matrix
2D Pattern Matching
Synteny
1D Pattern Discovery
Pattern Match
Protein Sequence Pattern Discovery
Training Set
1D Pattern Matching
Phylogenetic Tree
SNP Pattern Discovery
Cluster
Classification
Pattern
Clustering
Protein Seq
Expression
Sequence
DNA Seq
Tissue
Organ
Gene
SNP
Primary Biological Datatypes (GO)
13
BISON Biomedical Informatics Structured Ontology
Component A
_at_Publish public DSDataSet publish(. . .)
DSDataSet dataSet // do some work that
assigns a value to dataSet. return
dataSet _at_Subscribe public void
receive(DSDataSet dataSet, Object source)
// Consume the argument dataSet, as
appropriate
Component B

Provide re-usable models of common bioinformatics
concepts
Data sequence, expression, genotype, structure,
proteomics
Complex data structures patterns, clusters,
HMMs, PSSMs, alignments
Algorithms Clustering, matching, discovery,
normalization, filtering
Provide a foundation for the development of
interoperable geWorkbench components
Endorsed by multiple communities (caBIG, AMDeC,
NCBCs)

14
BISON - BioObjects
15
BISON Example - Patterns
PatternltTgt MatchltTgt match(T)
Collection(MatchltTgt) match(Collection(T))
String toString()
T
3DMatch extends MatchltStructuregt Structure
getObject() vector getV1() vector
getV2() double getAlpha() double
getPValue()
SeqMatch extends MatchltSequencegt Sequence
getObject() int getAlignment() double
getPValue()
MatchltTgt
Sequence
PatternltTgt
SeqMatch
Structure
SequencePattern extends PatternltSequencegt
SeqMatch match(Sequence)
Collection(SeqMatch) match(Collection(Sequence))
String toString()
Sequence
Sequence Pattern
3DMatch
StructurePattern extends PatternltStructuregt
3DMatch match(Structure)
Collection(3DMatch) match(Collection(Structure))
String toString()
Structure
Structure Pattern
16
BISON Example Classification
ClassifierltTgt ClassificationltTgt
run(T object) CollectionltClassificationltTgtgt
run(TestSetltTgt objects) void
init(TrainingSetltTgt objects)
T
ProteinSVM extends ClassifierltProteingt
ClassificationltProteingt run(Protein
protein) CollectionltClassificationltProteingtgt
run(TestSetltProteingt proteins) void
init(TrainingSetltProteingt
proteins)
ClassifierltProteingt
MicroarrayPD extends ClassifierltMicroarraygt
ClassificationltMicroarraygt
run(Microarray ma) CollectionltClassificationltTgt
gt run(TestSetltMicroarraygt mas) void
init(TrainingSetltMicroarr
aygt mas)
ClassifierltMicroarraygt
CrossValidationltTgt void run() void
init(TrainingSetltTgt tSet)
ClassifierltTgt
TrainingSetltTgt
T
TrainingSetltProteingt
Protein
Protein
Microarray
Microarray
T
TrainingSetltMicroarraygt
17
geWorkbench

Major effort on making the platform broadly
available to and extensible by the biomedical
research community
Integration of MAGNet Tools
B cell Interaction Knowledge base
ARACNE
REDUCE
MEDUSA
GeneWays
Protein Structure Pipeline
JMol (Open source Molecular Viewer)

18
Integration of 3rd party components
Cytoscape
GenePattern
MatrixREDUCE
GoMiner
19
Core IV Infrastructure (cont.)
MAGNet/C2B2
20
matrixREDUCE

Statistical mechanical modeling of genome-wide
transcription factor occupancy data by
MatrixREDUCE

Bussemaker Lab
21
matrixREDUCE
22
ARACNE Accurate Reconstruction of Cellular
Networks

Reverse Engineering Transcriptional Interactions

Califano Dalla Favera Labs
23
ARACNE Recap
Step 1 Compute Mutual Information to assess
statistical independence of each gene (mRNA) pair
Step 2 (DPI) Data Processing Inequality used to
test each gene (mRNA) triplet
z
x
y
24
The germinal center
mutated Ig V region
unmutated
B-Cell Subpopulations
Germinal Center (GC)
Post-GC
Plasma Cell
Pre-GC
Naïve B
Centroblast
Centrocyte
Memory B
Follicular Lymphoma
B-CLL
Burkitt Lymphoma
Diffuse Large Cell Lymphoma
Mantle Cell Lymphoma
(CD5)
Hodgkin Disease
Multiple Myeloma
(CD5)
B-Cell Derived Malignancies
25
C-MYC Sub-network

Graphical representation
all 56 1st neighbors
the top ranking 444 2nd neighbors

26
ChIP Validation
Expected background 10 (1,300/12,600)
27
ARACNE Problems

Single Run is very sensitive to dataset selection
Bootstrapping Selecting N (equal-sized)
subsamples of the expression profiles with
replacement.
Statistical significance based on consensus
across all runs
Non-transcriptional interaction should not be
allowed to remove transcriptional interactions
via the DPI.
Identify Transcription Factors
Prevent removal of TF interaction by non-TF
interactions
Network is average of normal, tumor-related and
experimentally manipulated cell phenotypes.
Generate network for homogeneous phenotype using
experimental perturbations

28
Performance improvements

Systematic (100-fold) improvement for
Bootstrapping
Systematic (100-fold) improvement when TFs are
used
Homogeneous Network is possible with as few as
100 MA

29
The B-Cell Knowledge Base

Everything you always wanted to know about B
Cells but were afraid to ask

H. Bussemaker A. Califano R. Dalla Favera C.
Leslie A. Rzhetsky C. Wiggins
30
Knowledge Base for Human B Lymphocytes

Integrative
Bayesian Evidence integration of pairwise
interactions
Protein-Protein, Protein-DNA
Context Specific
ARACNE, GeneWays, REDUCE
B-Cell data or B-cell specific criteria
Linked to one of the largest B-Cell expression
profiles microarray dataset, ChIP-Chip assays
(MYC/BCL6), miRNA profiles, and Literature
Captures Multi-variate dependencies
Three-way interactions via MINDY and MATRIXReduce
Post-translational modulation of transcriptional
regulation
Combinatorial transcriptional regulation
Signal transduction control of Transcriptional
Regulation I.e. the Transferome meets the
Transcriptome
Links to literature, via GeneWays

31
Integrating protein-DNA and protein-protein
Interactions via Naïve Bayes Classification

Protein-Protein Interactions (PPIs)
Human PPI databases
Human Protein Reference Database (HPRD)
Biomolecular Interaction Network Database (BIND)
Database of Interacting Proteins (DIP and IntAct)
Y2H Studies (2in human)
Eukaryotic PPI via hortologous genes
(Inparanoid)
MIPS, BIND, IntAct.
GeneWays Predictions (context-specific literature
analysis)
Co-expression analysis (Mutual Information)
Gene Ontology classification (biological
process/compartment)
Protein-DNA Interactions (PDIs)
Human PDI databases
TRANSFAC, BIND, MycDB
Mouse PDI databases (TRANSFAC, BIND via
orthologous genes (Inparanoid)
ARACNE (bootstrap-TF)
GeneWays predictions (context-specific literature
analysis)

49,719 interactions (4,944 genes)
27,705 PPIs (4,209 genes)
22,014 PDIs (3,216 genes/457 TFs)

32
Ribosomal Protein Control
BTF3
From ARACNE
MI(SHOX2-MYC) 0.073 MI(MYC-RP)
0 MI(SHOX2-RP) gt 0.3
From Literature
MYC
From ARACNE
SHOX2
33
GeneWays

Networks from literature

Rzhetsky lab
34
Literature Data Mining Tool

Natural Language Parsing Analysis of Literature
Data
Full text analysis (not abstract)
Ontology-based classification of (inter)-actions
Links to literature
Sentences
Full Text

35
Functional Annotation Pipeline

Everything you always wanted to know about a PDB
Structure but were afraid to ask

Honig Rost labs
36
Background

Pipeline for automated gathering of functional
annotations of protein structures
Main targets NESG protein structures
Structure analysis
Structure comparison (Skan, CE, Dali)
Binding site identification (SCREEN)
Binding site comparison (Procat, cmplig)
Sequence analysis
Sequence neighbors (PSI-BLAST)
Conservation analysis (CONSURF)
Domain analysis (InterProScan)

37
Data flow
PDB or NESG repository
PDB file
Sequence
DALI, CE, Skan
SCREEN
Procat
CONSURF
Procat results
Skan, DALI, CE hits
Cavities
InterProScan
PSI-BLAST (local nr or Uniprot)
PDB file conservation
DaliLite, CE
ClustalW
Identified sequence domains
BLAST report
Structure alignment
Parsing and internal Data representation
MySQL database
38
Web Interface
http//luna.bioc.columbia.edu/mfischer/cgi-bin/pip
e.pl (/nesg.pl)
39
Pipeline test system
Sporulation response regulator Spo0F from
Bacillus subtilis PDB 15f1, resolution 3Å, SCOP
C.23.1.1 Complex with phosphotransferase Spo0B
(dimer) 4 Spo0F monomers 2 Spo0B dimers

Write a Comment

User Comments (0)