Panel: The Broader Role of Artificial Intelligence in Large-Scale Scientific Research - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Panel: The Broader Role of Artificial Intelligence in Large-Scale Scientific Research

Description:

Panel: The Broader Role of Artificial Intelligence in LargeScale Scientific Research – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 36
Provided by: robpe
Learn more at: http://bmi.osu.edu
Category:

less

Transcript and Presenter's Notes

Title: Panel: The Broader Role of Artificial Intelligence in Large-Scale Scientific Research


1
Panel The Broader Role of Artificial
Intelligence in Large-Scale Scientific Research
  • Joel Saltz MD, PhD
  • Professor Biomedical Informatics, Computer
    Science
  • Davis Chair in Cancer Research
  • The Ohio State University
  • Visiting Professor ISI, USC

2
Translational Research
Disease mechanism, disease classification,
diagnosis, treatment
  • Biology, biotechnology bioinformatics

Credit NIH SPORE Guidelines
3
Imaging, Medical Analysis and Grid Environments
(IMAGE)September 16 - 18 2003
  • Identify, query, retrieve, carry out on-demand
    data product generation directed at collections
    of data from multiple sites/groups on a given
    topic, reproduce each groups data analysis and
    carry out new analyses on all datasets. Should be
    able to carry out entirely new analyses or to
    incrementally modify other scientists data
    analyses. Should not have to worry about physical
    location of data or processing.

4
caBIG 50 Grid Enabled Cancer Centers
5
Biomedical Research Grids Types of Information
  • Radiological Studies
  • Pathology
  • Molecular (Proteomics, gene expression)
  • Genetic, Epigenetic (SNPs, haplotype analysis)
  • Laboratory, pharmacy, outcome data

6
Example Some Data types generated by OSU
Comprehensive Cancer Center
. .
Shared Resource Example data types
Molecular Cytogenetics Datasets from Karyotype analysis, data from SKY/FISH experiments
Analytical Cytometry FACSCaliber output, raw data from DiVaOption system,
Genotyping and Sequencing DNA sequence information for the sample, primer sequence, PCR sequencing information
Microarray Output from Affymetrix gene expression and Custom microarray analyses
Mouse Phenotyping Digitized radiographic, gross, and histologic images, hematology characterization
Real Time-PCR Description of sample plate, raw and processed output files
Tissue Procurement Anonymized pathology report, age, gender, race, tissue procurement id, patient id (if consent form is available), consent form, virtual slide of the tissue, if available
Leukemia Tissue Bank Sample processing date, date sample taken from the patient, accession id, patient id, patient name and last name, specimen type, number of tubes, diagnosis info, protocol
Proteomics 1D and 2D gel images, sample information (PI name, analysis method, instrument name), diagnosis, protein expressions for spots
Clinical Trials Office Protocol descriptions, investigator, title, status, approval processing ids, clinical data for patients on protocols, lab reports, adverse events, and trial outcomes.
7
caBIG Problem Statement
  • Production of data outstripping our ability to
    analyze it
  • The research community may not be aware of other
    work and datasets
  • Researchers may not tag data with the definition
    of the data they produce
  • Semantic information is not often encoded nor
    included with data sets
  • Data Islands or Silos of information are
    produced based on the problems outlined above
  • A small group of knowledgeable people transmit
    data amongst themselves
  • Modern exploratory research requires the
    integration of disparate databases of biological
    information to explain results
  • To elucidate the mechanism behind disease we must
    aggregate data from many databases

Peter Corbitz NCI
8
Vision
  • Provide the Grid For Cancer Research so that we
    may
  • Raise awareness of disparate datasets in the
    biological research community
  • Allow research groups to exchange datasets with
    ease
  • Allow research groups to understand the semantics
    of the datasets that they publish without always
    having to get on the phone
  • Allow for quicker publication of the analysis of
    integrated data
  • Blind Man/Elephant problem high throughput
    techniques, molecular imaging are powerful but
    each contribute only a piece to a puzzle

Peter Corbitz - NCI
9
Bleeding Edge Pilot Project
10
Example Ohio State BISTI Center for Grid Enabled
Image Analysis Novartis Molecular Imaging Studies
(Knopp)
11
Example Genotype Phenotype Correlation
  • Genetic, phenotypic data related via phylogentic
    tree
  • 3473 SNPs among 11 strains of inbred mice
  • Tree represents ancestry relationship between
    strains
  • C3HHEJ and DBA2J are mutations associated with
    high heart rate variability
  • Once candidate genotypes are identified, gather
    additional information about candidates from
    wrapped data sources
  • Integration of Gene-Drug relationships in cancer
    treatment
  • (Janies, Knoblock, Khan, Saltz)

12
Example Classification of Neuroblastoma
  • Decision-tree models current classification
    system
  • Close collaboration with leading Pathologist
    (Shimada USC) who developed classifications
  • Automate analysis, correlate with molecular,
    outcome data
  • Classification determines treatment
  • Childrens Oncology Group
  • Scope North America, Australia, New Zealand

13
Example Use Case from Abramson Cancer Center
  • Use Case
  • A research would like to study the error rate in
    pathological diagnoses of solid tumor samples
    and compare numerous molecular diagnostic
    approaches to determine if the molecular
    diagnostic approach can enhance the accuracy of
    pathological diagnoses.
  • Query
  • I want all solid tumors, specifically for lung
    cancer, that have a diagnosis based on tumor
    pathology. Each diagnosis must have an image of
    the tumor that allows for independent
    verification of diagnoses. Each record retrieved
    must also have either proteomics marker data or
    microarray data (Affy or two-color) included so
    that different molecular techniques can be
    correlated to the tumor pathology. In addition, I
    want all protein annotations for markers and
    genes associated with the proteomics and
    microarray data so I can perform meta-analyses.

14
Issues Biomedical Grid Architecture
  • What metadata needs to be described
  • How to enforce standardization and completeness
    of metadata description
  • Is it practical for everyone to use the same
    ontologies?
  • If not, how to handle local variations in
    controlled manner
  • Are there middleware solutions that can help
    (yes!)
  • Data grid techniques for query, management of
    very large grid based datasets
  • Role for immutable gridbased datatypes?

15
Issues Biomedical Grid Architecture
  • Distributed Ontology Management
  • Many sites, many types of complex data
  • Sites need to have freedom to create local
    ontology variants in controlled manner
  • Systematic methods for controlled management,
    query of ontology variants
  • Heuristic datatyping
  • Data quality control
  • Well defined structure (e.g. XML schema)
    sanity checks to check accuracy and
    completeness of metadata
  • e.g. are data values consistent with what would
    be expected in an affymetrix gene expression
    dataset?

16
Issues In Silico Research - Not Your Fathers
Datamining
  • Example Predict clinical outcome
  • Goal optimize function F that predicts outcome
    by combining clinical, molecular, image data
  • Molecular, image data in turn need to be
    interpreted and analyzed
  • Need to find image analysis functions Gi and
    molecular data analysis functions Hi that make it
    possible best predict outcome
  • Functions Gi and Hi make use of domain specific
    knowledge (e.g. phylogentic trees, histologic
    classifications, pathways)

17
Issues Incorporating ad-hoc data sources
Information Integration
  • Not all data sources will post precise metadata
    definitions, ontologies etc
  • Biomedical researchers should be able to use
    non-conforming data
  • Goal is to develop a system where we automate the
    integration of data sources that are easy and
    natural
  • Wrap datasources, define metadata, ontologies
    that allow data integration
  • Middleware to cache ad-hoc data and to make
    ad-hoc information a first class grid citizen

18
Issues Security Requirements
  • Patients can give consent to for some data for
    some studies or classes of studies
  • Researchers need to be able to control access to
    data
  • IRBs can approve release of identified data to
    some individuals
  • IRBs can specify how deidentification is to be
    carried out and when deidentified data can be
    released
  • Cooperative study may have different IRB-dictated
    constraints at different sites
  • Individuals associated with a given study may
    have different roles with different data access
    permissions
  • Access requests and successful accesses must be
    logged

19
Issues Security Requirements
  • Need ontology based description of roles,
    ownership, permissions
  • Validate correctness of description
  • Conditions should lead to expected results (i.e.
    if one specifies rules, does one end up with
    counter-intuitive results)

20
Thanks!
21
Mobius
  • Middleware system that provides support for
    management of metadata definitions (defined as
    XML schemas) and efficient storage and retrieval
    of data instances in a distributed environment.
  • Mechanism for data driven applications to cache,
    share, and asynchronously communicate data in a
    distributed environment
  • Grid based distributed, searchable, and shareable
    persistent storage
  • Infrastructure for grid coordination language

22
Global Model Exchange
  • Store and link data models defined inside
    namespaces in grid.
  • Enables other services to publish, retrieve,
    discover, remove, and version metadata
    definitions
  • Services composed in a DNS-like architecture
    representing parent-child namespace hierarchy
  • When a schema is registered in GME, it is stored
    in under the name and name space specified by the
    application schema is assigned a version number

23
(No Transcript)
24
(No Transcript)
25
Finding candidate genes
  • Complex, multi-factorial diseases
  • CAD, diabetes, schizophrenia
  • Long candidate lists
  • Variations between individuals in their response
    to treatments
  • Non-responders to statin treatment
  • GOAL Link phenotypes to genotypes

26
Genotype ltgt Phenotype
  • APPROACH establish QTLs by correlating changes
    in genotype with changes in phenotype
  • Genotype SNPs
  • Use in mapping Haplotype blocks
  • Functional come in many flavors cis-regulatory,
    non-synonymous, mRNA stability
  • Phenotypes Mice
  • Inbred strains provide well characterized
    genotypes
  • Publicly available quantitative phenotype data
    (http//www.jax.org/phenome/)

27
BISP Demonstration Mouse Phenotype Trait Query
Mouse Phenotype/SNP Analysis
Distributed Data Storage
Mako
Mako
Mako
processing
Distributed query
Mobius
Virtual Mako Service
internet
SNP IDs
Trait query
trait
HUGO IDs, BLAST Alignments
CERF Service Delivery System
CERF Server
MouseTraitQueryResult
(resource instance)
Action ViewGoMiner
ActionView
CERF Client
GOMiner
HTML Browser
28
BRTT Demonstrations June 21, 2000 Mouse
SNP-Trait Association Demonstration
Execute query
A CERF Action/Service binding allows the query to
be executed - user prompted for the name of
a Trait (HDL6).
29
Image Data Processing
Digitized Microscopy
DCE-MRI Studies
Visualization of Terabyte Scale, Multiresolution
Data
30
Image Analysis Middleware Framework
  • The Distributed Metadata and Data Management
    service to keep track of workflows, image
    datasets, and analysis results.
  • Manage metadata associated with images, analysis
    results, annotations in a distributed
    environment.
  • Federate databases of image, clinical, and
    molecular data in a distributed environment
  • The Image Data Storage Service to manage the
    storage resources on the server and encapsulates
    efficient storage methods for image datasets.
  • Create and maintain large-scale, on-line
    databases of images (from gigabytes to multiple
    terabytes in size) on disk-based storage
    clusters.
  • The Distributed Execution Service to support
    on-demand analysis of image data
  • Execute simple and complex image analysis
    operations and workflows on distributed
    collections of images.
  • Integrate data retrieval and processing on
    commodity clusters and multiprocessor machines

31
BRTT Demonstrations June 21, 2000 Rat Placenta
Microscopy Image Demonstration
Search for Images of Interest
32
Prototype based on caCORE
  • cancer Common Ontologic Representation
    Environment (caCORE)
  • caCORE is the technology stack that facilitates
    data integration across multiple scientific
    disciplines

33
caGRID Core architecture
caGRID Extension (Integration of Discovery and
Query Services)
Client
OGSA-DAI Globus
caGRID extension (Concept Discovery)
caGRID extension (Federated Query)
OGSA-DAI
caGRID extension (metadata)
caGRID extension (query)
Grid
Globus
Strongly typed XML transport, Metadata
Management (Mobius)
Data Source
caBIO server
34
Petabyte sized rotating storage archives are no
longer hypothetical
35
Ohio Supercomputing Center Mass Storage Testbed
  • 50 TB of performance storage
  • home directories, project storage space, and
    long-term frequently accessed files.
  • 420 TB of performance/capacity storage
  • Active Disk Cache - compute jobs that require
    directly connected storage
  • parallel file systems, and scratch space.
  • Large temporary holding area
  • 128 TB tape library
  • Backups and long-term "offline" storage

IBMs Storage Tank technology combined with TFN
connections will allow large data sets to be
seamlessly moved throughout the state with
increased redundancy and seamless delivery.
Write a Comment
User Comments (0)
About PowerShow.com