Panel: The Broader Role of Artificial Intelligence in Large-Scale Scientific Research - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Panel: The Broader Role of Artificial Intelligence in Large-Scale Scientific Research

Description:

Panel: The Broader Role of Artificial Intelligence in LargeScale Scientific Research – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 36

Provided by: robpe

Learn more at: http://bmi.osu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Panel: The Broader Role of Artificial Intelligence in Large-Scale Scientific Research

1
Panel The Broader Role of Artificial
Intelligence in Large-Scale Scientific Research

Joel Saltz MD, PhD
Professor Biomedical Informatics, Computer
Science
Davis Chair in Cancer Research
The Ohio State University
Visiting Professor ISI, USC

2
Translational Research
Disease mechanism, disease classification,
diagnosis, treatment

Biology, biotechnology bioinformatics

Credit NIH SPORE Guidelines
3
Imaging, Medical Analysis and Grid Environments
(IMAGE)September 16 - 18 2003

Identify, query, retrieve, carry out on-demand
data product generation directed at collections
of data from multiple sites/groups on a given
topic, reproduce each groups data analysis and
carry out new analyses on all datasets. Should be
able to carry out entirely new analyses or to
incrementally modify other scientists data
analyses. Should not have to worry about physical
location of data or processing.

4
caBIG 50 Grid Enabled Cancer Centers
5
Biomedical Research Grids Types of Information

Radiological Studies
Pathology
Molecular (Proteomics, gene expression)
Genetic, Epigenetic (SNPs, haplotype analysis)
Laboratory, pharmacy, outcome data

6
Example Some Data types generated by OSU
Comprehensive Cancer Center
. .
Shared Resource Example data types
Molecular Cytogenetics Datasets from Karyotype analysis, data from SKY/FISH experiments
Analytical Cytometry FACSCaliber output, raw data from DiVaOption system,
Genotyping and Sequencing DNA sequence information for the sample, primer sequence, PCR sequencing information
Microarray Output from Affymetrix gene expression and Custom microarray analyses
Mouse Phenotyping Digitized radiographic, gross, and histologic images, hematology characterization
Real Time-PCR Description of sample plate, raw and processed output files
Tissue Procurement Anonymized pathology report, age, gender, race, tissue procurement id, patient id (if consent form is available), consent form, virtual slide of the tissue, if available
Leukemia Tissue Bank Sample processing date, date sample taken from the patient, accession id, patient id, patient name and last name, specimen type, number of tubes, diagnosis info, protocol
Proteomics 1D and 2D gel images, sample information (PI name, analysis method, instrument name), diagnosis, protein expressions for spots
Clinical Trials Office Protocol descriptions, investigator, title, status, approval processing ids, clinical data for patients on protocols, lab reports, adverse events, and trial outcomes.
7
caBIG Problem Statement

Production of data outstripping our ability to
analyze it
The research community may not be aware of other
work and datasets
Researchers may not tag data with the definition
of the data they produce
Semantic information is not often encoded nor
included with data sets
Data Islands or Silos of information are
produced based on the problems outlined above
A small group of knowledgeable people transmit
data amongst themselves
Modern exploratory research requires the
integration of disparate databases of biological
information to explain results
To elucidate the mechanism behind disease we must
aggregate data from many databases

Peter Corbitz NCI
8
Vision

Provide the Grid For Cancer Research so that we
may
Raise awareness of disparate datasets in the
biological research community
Allow research groups to exchange datasets with
ease
Allow research groups to understand the semantics
of the datasets that they publish without always
having to get on the phone
Allow for quicker publication of the analysis of
integrated data

Blind Man/Elephant problem high throughput
techniques, molecular imaging are powerful but
each contribute only a piece to a puzzle

Peter Corbitz - NCI
9
Bleeding Edge Pilot Project
10
Example Ohio State BISTI Center for Grid Enabled
Image Analysis Novartis Molecular Imaging Studies
(Knopp)
11
Example Genotype Phenotype Correlation

Genetic, phenotypic data related via phylogentic
tree
3473 SNPs among 11 strains of inbred mice
Tree represents ancestry relationship between
strains
C3HHEJ and DBA2J are mutations associated with
high heart rate variability
Once candidate genotypes are identified, gather
additional information about candidates from
wrapped data sources
Integration of Gene-Drug relationships in cancer
treatment
(Janies, Knoblock, Khan, Saltz)

12
Example Classification of Neuroblastoma

Decision-tree models current classification
system
Close collaboration with leading Pathologist
(Shimada USC) who developed classifications
Automate analysis, correlate with molecular,
outcome data
Classification determines treatment
Childrens Oncology Group
Scope North America, Australia, New Zealand

13
Example Use Case from Abramson Cancer Center

Use Case
A research would like to study the error rate in
pathological diagnoses of solid tumor samples
and compare numerous molecular diagnostic
approaches to determine if the molecular
diagnostic approach can enhance the accuracy of
pathological diagnoses.
Query
I want all solid tumors, specifically for lung
cancer, that have a diagnosis based on tumor
pathology. Each diagnosis must have an image of
the tumor that allows for independent
verification of diagnoses. Each record retrieved
must also have either proteomics marker data or
microarray data (Affy or two-color) included so
that different molecular techniques can be
correlated to the tumor pathology. In addition, I
want all protein annotations for markers and
genes associated with the proteomics and
microarray data so I can perform meta-analyses.

14
Issues Biomedical Grid Architecture

What metadata needs to be described
How to enforce standardization and completeness
of metadata description
Is it practical for everyone to use the same
ontologies?
If not, how to handle local variations in
controlled manner
Are there middleware solutions that can help
(yes!)
Data grid techniques for query, management of
very large grid based datasets
Role for immutable gridbased datatypes?

15
Issues Biomedical Grid Architecture

Distributed Ontology Management
Many sites, many types of complex data
Sites need to have freedom to create local
ontology variants in controlled manner
Systematic methods for controlled management,
query of ontology variants
Heuristic datatyping
Data quality control
Well defined structure (e.g. XML schema)
sanity checks to check accuracy and
completeness of metadata
e.g. are data values consistent with what would
be expected in an affymetrix gene expression
dataset?

16
Issues In Silico Research - Not Your Fathers
Datamining

Example Predict clinical outcome
Goal optimize function F that predicts outcome
by combining clinical, molecular, image data
Molecular, image data in turn need to be
interpreted and analyzed
Need to find image analysis functions Gi and
molecular data analysis functions Hi that make it
possible best predict outcome
Functions Gi and Hi make use of domain specific
knowledge (e.g. phylogentic trees, histologic
classifications, pathways)

17
Issues Incorporating ad-hoc data sources
Information Integration

Not all data sources will post precise metadata
definitions, ontologies etc
Biomedical researchers should be able to use
non-conforming data
Goal is to develop a system where we automate the
integration of data sources that are easy and
natural
Wrap datasources, define metadata, ontologies
that allow data integration
Middleware to cache ad-hoc data and to make
ad-hoc information a first class grid citizen

18
Issues Security Requirements

Patients can give consent to for some data for
some studies or classes of studies
Researchers need to be able to control access to
data
IRBs can approve release of identified data to
some individuals
IRBs can specify how deidentification is to be
carried out and when deidentified data can be
released
Cooperative study may have different IRB-dictated
constraints at different sites
Individuals associated with a given study may
have different roles with different data access
permissions
Access requests and successful accesses must be
logged

19
Issues Security Requirements

Need ontology based description of roles,
ownership, permissions
Validate correctness of description
Conditions should lead to expected results (i.e.
if one specifies rules, does one end up with
counter-intuitive results)

20
Thanks!
21
Mobius

Middleware system that provides support for
management of metadata definitions (defined as
XML schemas) and efficient storage and retrieval
of data instances in a distributed environment.
Mechanism for data driven applications to cache,
share, and asynchronously communicate data in a
distributed environment
Grid based distributed, searchable, and shareable
persistent storage
Infrastructure for grid coordination language

22
Global Model Exchange

Store and link data models defined inside
namespaces in grid.
Enables other services to publish, retrieve,
discover, remove, and version metadata
definitions
Services composed in a DNS-like architecture
representing parent-child namespace hierarchy
When a schema is registered in GME, it is stored
in under the name and name space specified by the
application schema is assigned a version number

23
(No Transcript)
24
(No Transcript)
25
Finding candidate genes

Complex, multi-factorial diseases
CAD, diabetes, schizophrenia
Long candidate lists
Variations between individuals in their response
to treatments
Non-responders to statin treatment
GOAL Link phenotypes to genotypes

26
Genotype ltgt Phenotype

APPROACH establish QTLs by correlating changes
in genotype with changes in phenotype
Genotype SNPs
Use in mapping Haplotype blocks
Functional come in many flavors cis-regulatory,
non-synonymous, mRNA stability
Phenotypes Mice
Inbred strains provide well characterized
genotypes
Publicly available quantitative phenotype data
(http//www.jax.org/phenome/)

27
BISP Demonstration Mouse Phenotype Trait Query
Mouse Phenotype/SNP Analysis
Distributed Data Storage
Mako
Mako
Mako
processing
Distributed query
Mobius
Virtual Mako Service
internet
SNP IDs
Trait query
trait
HUGO IDs, BLAST Alignments
CERF Service Delivery System
CERF Server
MouseTraitQueryResult
(resource instance)
Action ViewGoMiner
ActionView
CERF Client
GOMiner
HTML Browser
28
BRTT Demonstrations June 21, 2000 Mouse
SNP-Trait Association Demonstration
Execute query
A CERF Action/Service binding allows the query to
be executed - user prompted for the name of
a Trait (HDL6).
29
Image Data Processing
Digitized Microscopy
DCE-MRI Studies
Visualization of Terabyte Scale, Multiresolution
Data
30
Image Analysis Middleware Framework

The Distributed Metadata and Data Management
service to keep track of workflows, image
datasets, and analysis results.
Manage metadata associated with images, analysis
results, annotations in a distributed
environment.
Federate databases of image, clinical, and
molecular data in a distributed environment
The Image Data Storage Service to manage the
storage resources on the server and encapsulates
efficient storage methods for image datasets.
Create and maintain large-scale, on-line
databases of images (from gigabytes to multiple
terabytes in size) on disk-based storage
clusters.
The Distributed Execution Service to support
on-demand analysis of image data
Execute simple and complex image analysis
operations and workflows on distributed
collections of images.
Integrate data retrieval and processing on
commodity clusters and multiprocessor machines

31
BRTT Demonstrations June 21, 2000 Rat Placenta
Microscopy Image Demonstration
Search for Images of Interest
32
Prototype based on caCORE

cancer Common Ontologic Representation
Environment (caCORE)
caCORE is the technology stack that facilitates
data integration across multiple scientific
disciplines

33
caGRID Core architecture
caGRID Extension (Integration of Discovery and
Query Services)
Client
OGSA-DAI Globus
caGRID extension (Concept Discovery)
caGRID extension (Federated Query)
OGSA-DAI
caGRID extension (metadata)
caGRID extension (query)
Grid
Globus
Strongly typed XML transport, Metadata
Management (Mobius)
Data Source
caBIO server
34
Petabyte sized rotating storage archives are no
longer hypothetical
35
Ohio Supercomputing Center Mass Storage Testbed

50 TB of performance storage
home directories, project storage space, and
long-term frequently accessed files.
420 TB of performance/capacity storage
Active Disk Cache - compute jobs that require
directly connected storage
parallel file systems, and scratch space.
Large temporary holding area
128 TB tape library
Backups and long-term "offline" storage

IBMs Storage Tank technology combined with TFN
connections will allow large data sets to be
seamlessly moved throughout the state with
increased redundancy and seamless delivery.

Write a Comment

User Comments (0)