Title: Panel: The Broader Role of Artificial Intelligence in Large-Scale Scientific Research
1Panel The Broader Role of Artificial
Intelligence in Large-Scale Scientific Research
- Joel Saltz MD, PhD
- Professor Biomedical Informatics, Computer
Science - Davis Chair in Cancer Research
- The Ohio State University
- Visiting Professor ISI, USC
2Translational Research
Disease mechanism, disease classification,
diagnosis, treatment
- Biology, biotechnology bioinformatics
Credit NIH SPORE Guidelines
3Imaging, Medical Analysis and Grid Environments
(IMAGE)September 16 - 18 2003
- Identify, query, retrieve, carry out on-demand
data product generation directed at collections
of data from multiple sites/groups on a given
topic, reproduce each groups data analysis and
carry out new analyses on all datasets. Should be
able to carry out entirely new analyses or to
incrementally modify other scientists data
analyses. Should not have to worry about physical
location of data or processing.
4caBIG 50 Grid Enabled Cancer Centers
5Biomedical Research Grids Types of Information
- Radiological Studies
- Pathology
- Molecular (Proteomics, gene expression)
- Genetic, Epigenetic (SNPs, haplotype analysis)
- Laboratory, pharmacy, outcome data
6Example Some Data types generated by OSU
Comprehensive Cancer Center
. .
Shared Resource Example data types
Molecular Cytogenetics Datasets from Karyotype analysis, data from SKY/FISH experiments
Analytical Cytometry FACSCaliber output, raw data from DiVaOption system,
Genotyping and Sequencing DNA sequence information for the sample, primer sequence, PCR sequencing information
Microarray Output from Affymetrix gene expression and Custom microarray analyses
Mouse Phenotyping Digitized radiographic, gross, and histologic images, hematology characterization
Real Time-PCR Description of sample plate, raw and processed output files
Tissue Procurement Anonymized pathology report, age, gender, race, tissue procurement id, patient id (if consent form is available), consent form, virtual slide of the tissue, if available
Leukemia Tissue Bank Sample processing date, date sample taken from the patient, accession id, patient id, patient name and last name, specimen type, number of tubes, diagnosis info, protocol
Proteomics 1D and 2D gel images, sample information (PI name, analysis method, instrument name), diagnosis, protein expressions for spots
Clinical Trials Office Protocol descriptions, investigator, title, status, approval processing ids, clinical data for patients on protocols, lab reports, adverse events, and trial outcomes.
7caBIG Problem Statement
- Production of data outstripping our ability to
analyze it - The research community may not be aware of other
work and datasets - Researchers may not tag data with the definition
of the data they produce - Semantic information is not often encoded nor
included with data sets - Data Islands or Silos of information are
produced based on the problems outlined above - A small group of knowledgeable people transmit
data amongst themselves - Modern exploratory research requires the
integration of disparate databases of biological
information to explain results - To elucidate the mechanism behind disease we must
aggregate data from many databases
Peter Corbitz NCI
8Vision
- Provide the Grid For Cancer Research so that we
may - Raise awareness of disparate datasets in the
biological research community - Allow research groups to exchange datasets with
ease - Allow research groups to understand the semantics
of the datasets that they publish without always
having to get on the phone - Allow for quicker publication of the analysis of
integrated data
- Blind Man/Elephant problem high throughput
techniques, molecular imaging are powerful but
each contribute only a piece to a puzzle
Peter Corbitz - NCI
9Bleeding Edge Pilot Project
10Example Ohio State BISTI Center for Grid Enabled
Image Analysis Novartis Molecular Imaging Studies
(Knopp)
11Example Genotype Phenotype Correlation
- Genetic, phenotypic data related via phylogentic
tree - 3473 SNPs among 11 strains of inbred mice
- Tree represents ancestry relationship between
strains - C3HHEJ and DBA2J are mutations associated with
high heart rate variability - Once candidate genotypes are identified, gather
additional information about candidates from
wrapped data sources - Integration of Gene-Drug relationships in cancer
treatment - (Janies, Knoblock, Khan, Saltz)
12Example Classification of Neuroblastoma
- Decision-tree models current classification
system - Close collaboration with leading Pathologist
(Shimada USC) who developed classifications - Automate analysis, correlate with molecular,
outcome data - Classification determines treatment
- Childrens Oncology Group
- Scope North America, Australia, New Zealand
13Example Use Case from Abramson Cancer Center
- Use Case
- A research would like to study the error rate in
pathological diagnoses of solid tumor samples
and compare numerous molecular diagnostic
approaches to determine if the molecular
diagnostic approach can enhance the accuracy of
pathological diagnoses. - Query
- I want all solid tumors, specifically for lung
cancer, that have a diagnosis based on tumor
pathology. Each diagnosis must have an image of
the tumor that allows for independent
verification of diagnoses. Each record retrieved
must also have either proteomics marker data or
microarray data (Affy or two-color) included so
that different molecular techniques can be
correlated to the tumor pathology. In addition, I
want all protein annotations for markers and
genes associated with the proteomics and
microarray data so I can perform meta-analyses.
14Issues Biomedical Grid Architecture
- What metadata needs to be described
- How to enforce standardization and completeness
of metadata description - Is it practical for everyone to use the same
ontologies? - If not, how to handle local variations in
controlled manner - Are there middleware solutions that can help
(yes!) - Data grid techniques for query, management of
very large grid based datasets - Role for immutable gridbased datatypes?
15Issues Biomedical Grid Architecture
- Distributed Ontology Management
- Many sites, many types of complex data
- Sites need to have freedom to create local
ontology variants in controlled manner - Systematic methods for controlled management,
query of ontology variants - Heuristic datatyping
- Data quality control
- Well defined structure (e.g. XML schema)
sanity checks to check accuracy and
completeness of metadata - e.g. are data values consistent with what would
be expected in an affymetrix gene expression
dataset?
16Issues In Silico Research - Not Your Fathers
Datamining
- Example Predict clinical outcome
- Goal optimize function F that predicts outcome
by combining clinical, molecular, image data - Molecular, image data in turn need to be
interpreted and analyzed - Need to find image analysis functions Gi and
molecular data analysis functions Hi that make it
possible best predict outcome - Functions Gi and Hi make use of domain specific
knowledge (e.g. phylogentic trees, histologic
classifications, pathways)
17Issues Incorporating ad-hoc data sources
Information Integration
- Not all data sources will post precise metadata
definitions, ontologies etc - Biomedical researchers should be able to use
non-conforming data - Goal is to develop a system where we automate the
integration of data sources that are easy and
natural - Wrap datasources, define metadata, ontologies
that allow data integration - Middleware to cache ad-hoc data and to make
ad-hoc information a first class grid citizen
18Issues Security Requirements
- Patients can give consent to for some data for
some studies or classes of studies - Researchers need to be able to control access to
data - IRBs can approve release of identified data to
some individuals - IRBs can specify how deidentification is to be
carried out and when deidentified data can be
released - Cooperative study may have different IRB-dictated
constraints at different sites - Individuals associated with a given study may
have different roles with different data access
permissions - Access requests and successful accesses must be
logged
19Issues Security Requirements
- Need ontology based description of roles,
ownership, permissions - Validate correctness of description
- Conditions should lead to expected results (i.e.
if one specifies rules, does one end up with
counter-intuitive results)
20Thanks!
21Mobius
- Middleware system that provides support for
management of metadata definitions (defined as
XML schemas) and efficient storage and retrieval
of data instances in a distributed environment. - Mechanism for data driven applications to cache,
share, and asynchronously communicate data in a
distributed environment - Grid based distributed, searchable, and shareable
persistent storage - Infrastructure for grid coordination language
22Global Model Exchange
- Store and link data models defined inside
namespaces in grid. - Enables other services to publish, retrieve,
discover, remove, and version metadata
definitions - Services composed in a DNS-like architecture
representing parent-child namespace hierarchy - When a schema is registered in GME, it is stored
in under the name and name space specified by the
application schema is assigned a version number
23(No Transcript)
24(No Transcript)
25Finding candidate genes
- Complex, multi-factorial diseases
- CAD, diabetes, schizophrenia
- Long candidate lists
- Variations between individuals in their response
to treatments - Non-responders to statin treatment
- GOAL Link phenotypes to genotypes
26Genotype ltgt Phenotype
- APPROACH establish QTLs by correlating changes
in genotype with changes in phenotype - Genotype SNPs
- Use in mapping Haplotype blocks
- Functional come in many flavors cis-regulatory,
non-synonymous, mRNA stability - Phenotypes Mice
- Inbred strains provide well characterized
genotypes - Publicly available quantitative phenotype data
(http//www.jax.org/phenome/)
27BISP Demonstration Mouse Phenotype Trait Query
Mouse Phenotype/SNP Analysis
Distributed Data Storage
Mako
Mako
Mako
processing
Distributed query
Mobius
Virtual Mako Service
internet
SNP IDs
Trait query
trait
HUGO IDs, BLAST Alignments
CERF Service Delivery System
CERF Server
MouseTraitQueryResult
(resource instance)
Action ViewGoMiner
ActionView
CERF Client
GOMiner
HTML Browser
28BRTT Demonstrations June 21, 2000 Mouse
SNP-Trait Association Demonstration
Execute query
A CERF Action/Service binding allows the query to
be executed - user prompted for the name of
a Trait (HDL6).
29Image Data Processing
Digitized Microscopy
DCE-MRI Studies
Visualization of Terabyte Scale, Multiresolution
Data
30Image Analysis Middleware Framework
- The Distributed Metadata and Data Management
service to keep track of workflows, image
datasets, and analysis results. - Manage metadata associated with images, analysis
results, annotations in a distributed
environment. - Federate databases of image, clinical, and
molecular data in a distributed environment - The Image Data Storage Service to manage the
storage resources on the server and encapsulates
efficient storage methods for image datasets. - Create and maintain large-scale, on-line
databases of images (from gigabytes to multiple
terabytes in size) on disk-based storage
clusters. - The Distributed Execution Service to support
on-demand analysis of image data - Execute simple and complex image analysis
operations and workflows on distributed
collections of images. - Integrate data retrieval and processing on
commodity clusters and multiprocessor machines
31BRTT Demonstrations June 21, 2000 Rat Placenta
Microscopy Image Demonstration
Search for Images of Interest
32Prototype based on caCORE
- cancer Common Ontologic Representation
Environment (caCORE) - caCORE is the technology stack that facilitates
data integration across multiple scientific
disciplines
33caGRID Core architecture
caGRID Extension (Integration of Discovery and
Query Services)
Client
OGSA-DAI Globus
caGRID extension (Concept Discovery)
caGRID extension (Federated Query)
OGSA-DAI
caGRID extension (metadata)
caGRID extension (query)
Grid
Globus
Strongly typed XML transport, Metadata
Management (Mobius)
Data Source
caBIO server
34Petabyte sized rotating storage archives are no
longer hypothetical
35Ohio Supercomputing Center Mass Storage Testbed
- 50 TB of performance storage
- home directories, project storage space, and
long-term frequently accessed files. - 420 TB of performance/capacity storage
- Active Disk Cache - compute jobs that require
directly connected storage - parallel file systems, and scratch space.
- Large temporary holding area
- 128 TB tape library
- Backups and long-term "offline" storage
IBMs Storage Tank technology combined with TFN
connections will allow large data sets to be
seamlessly moved throughout the state with
increased redundancy and seamless delivery.