National Center for Biomedical Computing at - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

National Center for Biomedical Computing at

Description:

National Center for Biomedical Computing at – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 40
Provided by: afl66
Category:

less

Transcript and Presenter's Notes

Title: National Center for Biomedical Computing at


1
geWorkbench An InteroperablePlatform for
Integrative Genomics
  • National Center for Biomedical Computing at
  • Columbia University

2
geWorkbench
MAGNet Software Tools
  • An interoperable platform for integrative
    genomics research

A. Califano A. Floratos
3
Integrated Genomics Solution
Comparative Genomics
Medical Records
Proteomics
Scientific and Technological Innovation
DNA Sequences
Protein Structure
Literature
Lead Compounds
Gene Expression
4
geWorkbench (genomic Workbench)
  • Based on caWorkbench, an NCI/caBIG-funded effort
  • Open source, Java based platform
  • Integrated Genomics Platform
  • Support for gene expression data, sequences,
    pathways, structure, etc. (40 visualization and
    analysis modules).
  • Access to local and remote data sources and
    analytical services.
  • Support for workflow scripting.
  • Integration with caGRID.
  • Development framework
  • Open source development.
  • Modular/extensible architecture, supporting
    pluggable components with configurable user
    interface.
  • Formal (caBIG-registered) data models for
    multitude of bioinformatics concepts.
  • Easy integration of 3rd party components.

5
History
  • Evolution
  • caWorkbench 1.0 (NCICB sponsored) 2003.
  • caWorkbench 2.0 (AMDeC/Columbia sponsored) 2005
  • caWorkbench 3.0 (caBIG sponsored) 2006
  • geWorkbench 3.0 (NCBC/caBIG sponsored) - 2007
  • A platform for caGRID enablement
  • MAGNet Multi-scale Analysis of Genomic and
    Cellular Networks
  • National Center for Biomedical Computing (NIH
    Roadmap)
  • geWorkbench (genomics Workbench) will be the
    software platform for the Centers technology.

6
Component-Based Architecture
Visualization
Data Management
Algorithms
Databases
7
Architecture
8
Available geWorkbench Components
  • Filters/Normalizers
  • Threshold filters, detection call filter, Genepix
    flags filter.
  • Centering, quantile normalization, housekeeping
    genes normalization
  • Analyses
  • T-test, hierarchical clustering, Self organizing
    maps.
  • BLAST, pattern discovery, sequence alignments,
    synteny
  • Network reconstruction.
  • 3rd Party components
  • Cytoscape, Go-miner, RProteomics
  • Workflow support
  • Data formatters
  • Gene expression data (affy, genepix, caArray)
  • Sequence data (fasta)
  • Genotypic data
  • Visualizers
  • Color mosaic.
  • Scatter plot.
  • Spreadsheet view.
  • Expression profile graph.
  • Gene networks.
  • Pathways.
  • Annotations
  • Go Terms.
  • caBIO gene info.

9
User Interface
10
geWorkbench Resources
http//www.geworkbench.org/
11
Ontology-Anchored Interfaces
  • Scalable mechanisms for the definition of
    community based standards
  • Vocabulary
  • Grammar
  • Semantics
  • Ontology Anchored Interface Design
  • Available only for Fundamental Biomedical Types
  • DNA Sequence
  • Gene
  • Gene Expression
  • Extend Existing Ontologies with Two New Axes
    (CDE)
  • Complex Data Structures
  • Algorithms

BISON Biomedical Informatics Structured Ontology
Notation
12
Interface Design CDE
Algorithms
Pattern Discovery
Complex Data Structures
Linkage Analysis
1D Pattern Discovery
Linkage Analysis
2D Pattern Discovery
2D Pattern Discovery
DNA Sequence Pattern Discovery
Gene Expression Pattern Discovery
Adjacency Matrix
2D Pattern Matching
Synteny
1D Pattern Discovery
Pattern Match
Protein Sequence Pattern Discovery
Training Set
1D Pattern Matching
Phylogenetic Tree
SNP Pattern Discovery
Cluster
Classification
Pattern
Clustering
Protein Seq
Expression
Sequence
DNA Seq
Tissue
Organ
Gene
SNP
Primary Biological Datatypes (GO)
13
BISON Biomedical Informatics Structured Ontology
Component A
_at_Publish public DSDataSet publish(. . .)
DSDataSet dataSet // do some work that
assigns a value to dataSet. return
dataSet _at_Subscribe public void
receive(DSDataSet dataSet, Object source)
// Consume the argument dataSet, as
appropriate
Component B
  • Provide re-usable models of common bioinformatics
    concepts
  • Data sequence, expression, genotype, structure,
    proteomics
  • Complex data structures patterns, clusters,
    HMMs, PSSMs, alignments
  • Algorithms Clustering, matching, discovery,
    normalization, filtering
  • Provide a foundation for the development of
    interoperable geWorkbench components
  • Endorsed by multiple communities (caBIG, AMDeC,
    NCBCs)

14
BISON - BioObjects
15
BISON Example - Patterns
PatternltTgt MatchltTgt match(T)
Collection(MatchltTgt) match(Collection(T))
String toString()
T
3DMatch extends MatchltStructuregt Structure
getObject() vector getV1() vector
getV2() double getAlpha() double
getPValue()
SeqMatch extends MatchltSequencegt Sequence
getObject() int getAlignment() double
getPValue()
MatchltTgt
Sequence
PatternltTgt
SeqMatch
Structure
SequencePattern extends PatternltSequencegt
SeqMatch match(Sequence)
Collection(SeqMatch) match(Collection(Sequence))
String toString()
Sequence
Sequence Pattern
3DMatch
StructurePattern extends PatternltStructuregt
3DMatch match(Structure)
Collection(3DMatch) match(Collection(Structure))
String toString()
Structure
Structure Pattern
16
BISON Example Classification
ClassifierltTgt ClassificationltTgt
run(T object) CollectionltClassificationltTgtgt
run(TestSetltTgt objects) void
init(TrainingSetltTgt objects)
T
ProteinSVM extends ClassifierltProteingt
ClassificationltProteingt run(Protein
protein) CollectionltClassificationltProteingtgt
run(TestSetltProteingt proteins) void
init(TrainingSetltProteingt
proteins)
ClassifierltProteingt
MicroarrayPD extends ClassifierltMicroarraygt
ClassificationltMicroarraygt
run(Microarray ma) CollectionltClassificationltTgt
gt run(TestSetltMicroarraygt mas) void
init(TrainingSetltMicroarr
aygt mas)
ClassifierltMicroarraygt
CrossValidationltTgt void run() void
init(TrainingSetltTgt tSet)
ClassifierltTgt
TrainingSetltTgt
T
TrainingSetltProteingt
Protein
Protein
Microarray
Microarray
T
TrainingSetltMicroarraygt
17
geWorkbench
  • Major effort on making the platform broadly
    available to and extensible by the biomedical
    research community
  • Integration of MAGNet Tools
  • B cell Interaction Knowledge base
  • ARACNE
  • REDUCE
  • MEDUSA
  • GeneWays
  • Protein Structure Pipeline
  • JMol (Open source Molecular Viewer)

18
Integration of 3rd party components
Cytoscape
GenePattern
MatrixREDUCE
GoMiner
19
Core IV Infrastructure (cont.)
MAGNet/C2B2
20
matrixREDUCE
  • Statistical mechanical modeling of genome-wide
    transcription factor occupancy data by
    MatrixREDUCE

Bussemaker Lab
21
matrixREDUCE
22
ARACNE Accurate Reconstruction of Cellular
Networks
  • Reverse Engineering Transcriptional Interactions

Califano Dalla Favera Labs
23
ARACNE Recap
Step 1 Compute Mutual Information to assess
statistical independence of each gene (mRNA) pair
Step 2 (DPI) Data Processing Inequality used to
test each gene (mRNA) triplet
z
x
y
24
The germinal center
mutated Ig V region
unmutated
B-Cell Subpopulations
Germinal Center (GC)
Post-GC
Plasma Cell
Pre-GC
Naïve B
Centroblast
Centrocyte
Memory B
Follicular Lymphoma
B-CLL
Burkitt Lymphoma
Diffuse Large Cell Lymphoma
Mantle Cell Lymphoma
(CD5)
Hodgkin Disease
Multiple Myeloma
(CD5)
B-Cell Derived Malignancies
25
C-MYC Sub-network
  • Graphical representation
  • all 56 1st neighbors
  • the top ranking 444 2nd neighbors

26
ChIP Validation
Expected background 10 (1,300/12,600)
27
ARACNE Problems
  • Single Run is very sensitive to dataset selection
  • Bootstrapping Selecting N (equal-sized)
    subsamples of the expression profiles with
    replacement.
  • Statistical significance based on consensus
    across all runs
  • Non-transcriptional interaction should not be
    allowed to remove transcriptional interactions
    via the DPI.
  • Identify Transcription Factors
  • Prevent removal of TF interaction by non-TF
    interactions
  • Network is average of normal, tumor-related and
    experimentally manipulated cell phenotypes.
  • Generate network for homogeneous phenotype using
    experimental perturbations

28
Performance improvements
  • Systematic (100-fold) improvement for
    Bootstrapping
  • Systematic (100-fold) improvement when TFs are
    used
  • Homogeneous Network is possible with as few as
    100 MA

29
The B-Cell Knowledge Base
  • Everything you always wanted to know about B
    Cells but were afraid to ask

H. Bussemaker A. Califano R. Dalla Favera C.
Leslie A. Rzhetsky C. Wiggins
30
Knowledge Base for Human B Lymphocytes
  • Integrative
  • Bayesian Evidence integration of pairwise
    interactions
  • Protein-Protein, Protein-DNA
  • Context Specific
  • ARACNE, GeneWays, REDUCE
  • B-Cell data or B-cell specific criteria
  • Linked to one of the largest B-Cell expression
    profiles microarray dataset, ChIP-Chip assays
    (MYC/BCL6), miRNA profiles, and Literature
  • Captures Multi-variate dependencies
  • Three-way interactions via MINDY and MATRIXReduce
  • Post-translational modulation of transcriptional
    regulation
  • Combinatorial transcriptional regulation
  • Signal transduction control of Transcriptional
    Regulation I.e. the Transferome meets the
    Transcriptome
  • Links to literature, via GeneWays

31
Integrating protein-DNA and protein-protein
Interactions via Naïve Bayes Classification
  • Protein-Protein Interactions (PPIs)
  • Human PPI databases
  • Human Protein Reference Database (HPRD)
  • Biomolecular Interaction Network Database (BIND)
  • Database of Interacting Proteins (DIP and IntAct)
  • Y2H Studies (2in human)
  • Eukaryotic PPI via hortologous genes
    (Inparanoid)
  • MIPS, BIND, IntAct.
  • GeneWays Predictions (context-specific literature
    analysis)
  • Co-expression analysis (Mutual Information)
  • Gene Ontology classification (biological
    process/compartment)
  • Protein-DNA Interactions (PDIs)
  • Human PDI databases
  • TRANSFAC, BIND, MycDB
  • Mouse PDI databases (TRANSFAC, BIND via
    orthologous genes (Inparanoid)
  • ARACNE (bootstrap-TF)
  • GeneWays predictions (context-specific literature
    analysis)
  • 49,719 interactions (4,944 genes)
  • 27,705 PPIs (4,209 genes)
  • 22,014 PDIs (3,216 genes/457 TFs)

32
Ribosomal Protein Control
BTF3
From ARACNE
MI(SHOX2-MYC) 0.073 MI(MYC-RP)
0 MI(SHOX2-RP) gt 0.3
From Literature
MYC
From ARACNE
SHOX2
33
GeneWays
  • Networks from literature

Rzhetsky lab
34
Literature Data Mining Tool
  • Natural Language Parsing Analysis of Literature
    Data
  • Full text analysis (not abstract)
  • Ontology-based classification of (inter)-actions
  • Links to literature
  • Sentences
  • Full Text

35
Functional Annotation Pipeline
  • Everything you always wanted to know about a PDB
    Structure but were afraid to ask

Honig Rost labs
36
Background
  • Pipeline for automated gathering of functional
    annotations of protein structures
  • Main targets NESG protein structures
  • Structure analysis
  • Structure comparison (Skan, CE, Dali)
  • Binding site identification (SCREEN)
  • Binding site comparison (Procat, cmplig)
  • Sequence analysis
  • Sequence neighbors (PSI-BLAST)
  • Conservation analysis (CONSURF)
  • Domain analysis (InterProScan)

37
Data flow
PDB or NESG repository
PDB file
Sequence
DALI, CE, Skan
SCREEN
Procat
CONSURF
Procat results
Skan, DALI, CE hits
Cavities
InterProScan
PSI-BLAST (local nr or Uniprot)
PDB file conservation
DaliLite, CE
ClustalW
Identified sequence domains
BLAST report
Structure alignment
Parsing and internal Data representation
MySQL database
38
Web Interface
http//luna.bioc.columbia.edu/mfischer/cgi-bin/pip
e.pl (/nesg.pl)
39
Pipeline test system
Sporulation response regulator Spo0F from
Bacillus subtilis PDB 15f1, resolution 3Ã…, SCOP
C.23.1.1 Complex with phosphotransferase Spo0B
(dimer) 4 Spo0F monomers 2 Spo0B dimers
Write a Comment
User Comments (0)
About PowerShow.com