geneCBR a case-base reasoning tool for cancer diagnosis using microarray datasets - PowerPoint PPT Presentation

About This Presentation
Title:

geneCBR a case-base reasoning tool for cancer diagnosis using microarray datasets

Description:

geneCBR a case-base reasoning tool for cancer diagnosis using microarray datasets dr. florentino fdez-riverola university of vigo Computer System of New Generation – PowerPoint PPT presentation

Number of Views:275
Avg rating:3.0/5.0
Slides: 27
Provided by: Daniel1585
Category:

less

Transcript and Presenter's Notes

Title: geneCBR a case-base reasoning tool for cancer diagnosis using microarray datasets


1
geneCBRa case-base reasoning tool for cancer
diagnosis using microarray datasets
  • dr. florentino fdez-riverola
  • university of vigo

Computer System of New Generation
2
Outline
  • DNA Microarray Technology
  • characteristics and model operation overview
  • Bioinformatics and AI
  • new challenges and emerging research areas
  • CBR systems
  • case-based reasoning
  • GENE-CBR
  • human genome analysis using CBR systems
  • Demo
  • geneCBR in action cancer diagnosis using
    microarrays

2/26
3
Microarrays characteristics
  • silicon chips that can measure the expression
    levels of thousands of genes simultaneously
  • microarrays are base on a database over 40000
    fragments of genes called expressed sequence tags
    (ESTs)
  • allow us for the first time to obtain a global
    view of the cells belonging to
  • different individuals
  • different time-intervals for the same individual
  • different tissues of the same individual
  • gene expression profiles can be used as inputs to
    large-scale data analysis as
  • fingerprints to build more accurate molecular
    classification
  • discovering hidden taxonomies
  • Increasing our understanding of normal and
    disease states

3/26
4
Microarrays model operation overview
  • how does the chip work?
  • Microarray chips incorporate different dyed genes
    tiled in a grid-like fashion
  • The individuals DNA to analyze is dyed with a
    different colour
  • Both sets of labelled DNA strands are allowed to
    hybridize or bind
  • hybridization events are detected identifying
    fluorescent changes in the strands or DNA
  • an scanner and the associated software perform
    various forms of image analysis to measure and
    report raw gene expression values
  • the scanned intensities show how active the genes
    represented by the ESTs are in the cell
  • strong fluorescence indicates that the gene is
    very active in the cell
  • no fluorescence indicates that the gene is
    inactive in the cell

scanner
preprocessing
microarray data file
4/26
5
Available data
  • bone narrow samples from 43 adult patients with
    Acute Myeloid Leukemia (AML) plus 6 sane
    individuals
  • 10 patients with Acute Promyelocytic Leukemia
    APL
  • 4 patients with Acute Myeloid Leukemia with
    inv(16) AML-inv(16)
  • 7 patients with Acute Monocytic Leukemia
    AML-mono
  • 22 patients with Acute non-Monocytic Leukemia
    AML-other
  • 6 samples belonging to sane individuals
    control samples
  • volume of information processed
  • each microarray contains 22.283 ESTs (? genes)
  • 49 microarrays 1.091.867 gene expression values
  • today available data
  • 150 microarrays (Human Genome 133A) 210
    microarrays (Human Genome - plus)

5/26
6
Challenges for microarray Data Mining
  • three main types of data analysis needed for
    biomedical applications
  • gene selection (? attribute selection in AI)
  • find the genes most strongly related to a
    particular class
  • classification (? supervised classification in
    AI)
  • classifying diseases or predicting outcomes based
    on gene expression patterns, and perhaps even
    identifying the best treatment for given genetic
    signature
  • clustering (? unsupervised classification in AI)
  • finding new biological classes or refining
    existing ones
  • three parallel research areas
  • convenient visualization of experiments and
    results
  • discovery of biological knowledge (metabolic
    pathways, etc.)
  • low-level analysis providing better readouts
    (preprocessing, normalization, etc.)

6/26
7
Problems with existing data
  • analysis of microarrays presents a number of
    unique challenges for Machine Learning and Data
    Mining techniques but
  • Its capacity for generating enormous amounts of
    data is, however, also an handicap
  • great amount of data belonging to each individual
    (thousands of genes)
  • efficiency and memory problems
  • lack of initial knowledge
  • which is the significance level of each gene?
  • given the difficulty of collecting microarray
    samples, the number of samples is likely to
    remain small in many interesting cases
  • having so many fields relative to so few samples
    creates a high likelihood of finding false
    positives
  • these problems are increased if we consider the
    potential errors that can be present in
    microarray data (symmetric and random errors)
  • it is required sophisticated data analysis
    techniques and robust methods capable of
    extracting biologically meaningful knowledge from
    the raw data

7/26
8
CBR systems (Case-Based Reasoning)
  • Kolodner (1983a, 1983b). Problem solving paradigm
    in AI. It can be viewed as a methodology for
    reasoning and learning
  • reasoning by re-using past cases is a powerful
    and frequently applied way to solve problems for
    humans Joh (1997)
  • the memory of the system (case base) stores a
    certain number of previously experienced
    situations
  • CASE PROBLEM description applied SOLUTION
    RESULT
  • a new problem is solved by finding similar past
    cases and reusing them in the new problem
    situation
  • Riesbeck et al., (1989)
  • 4 cyclical steps are performed when it is
    necessary to solve a new problem
  • Kolodner (1993) Aamodt y Plaza (1994) Watson
    (1997)
  • Case-based reasoning is - in effect - a cyclic
    and integrated process of solving a problem,
    learning from this experience, solving a new
    problem, and so on...

8/26
9
The CBR cycle
  • RETRIEVING
  • one or more previously experienced cases

most similar cases
New problem
(1) RETRIEVE
REUSING the case(s) in one way or another
(2) REUSE
REVISING the solution based on reusing a previous
case(s)
(4) RETAIN
(3) REVISE
proposed solution
confirmed solution
RETAINING the new experience by incorporating it
into the existing knowledge-base (case base).
9/26
10
Main characteristics of CBR systems
  • adaptive and dynamic systems the number of cases
    stored in the memory of the model changes,
    allowing the system adaptation to new situations
  • CBR allow the utilisation of general knowledge in
    the resolution of a particular problem
  • CBR facilitate the indexation of the available
    information
  • CBR can use uncompleted cases
  • CBR are advised about their limitations (perhaps
    a problem has no solution)
  • CBR facilitate the utilisation of representative
    and flexible data structures
  • case adaptation aids to discover
    inter-connections and hided structures in the
    available data
  • CBR can be completely automated

10/26
11
GENE-CBR
11/26
12
Goals
Objectives
GENE-CBR
Develop an effective and reliable system able to
diagnose cancer subtypes based on the analysis of
microarray data
CBR system (Case-Based Reasoning) Solve new
problems (new patient) based on the previous
experience (diagnosed patients)
Implement a flexible tool for designing and
testing new techniques and experiments
AI techniques selection, clustering, inference
BeanShell Programmer interface
Construct an advanced edition module for run-time
modification of coded techniques
12/26
13
Logic architecture
DFP
GCS
DFP
GCS
1 RETRIEVE
2 REUSE
CASE BASE
3 REVISE
4 RETAIN
13/26
14
Model overview
reclassification
Gene Selection
most relevant genes DFP
Clustering
revised prediction and final diagnostic
genetically similar patients
Knowledge Discovery
Initial prediction
Prediction
14/26
15
GENE-CBRi retrieval
  • objectives
  • perform gene selection without losing information
  • extracting simplified fuzzy patterns (FP) for
    each pathology
  • possibility of using AI techniques initially
    discarded
  • main phases
  • supervised fuzzy discretisation of gene
    expression values
  • Low, Medium, High and overlapping labels (LM, MH)
  • supervised gene selection for each pathology
  • advantages
  • independence of the ordering existing in data
  • takes into account data variability
  • allows for discovering new knowledge
  • obtained results are interpretable

15/26
16
GENE-CBRi retrieval
16/26
17
GENE-CBRi retrieval
FP_AML-other
FP_healthy
FP_AML-inv()
FP_APL
FP_AML-monocytic
DFP
. . .
17/26
18
GENE-CBRii reuse
  • objectives
  • unsupervised identification of genetic
    similarities between patients
  • taking only into account the previous selected
    genes (DFP)
  • main phases
  • training a GCS network DFP-dimensional
  • Growing Cell Structures. Fritzke, B. (1993)
  • presenting the new patient to the network
  • classifying using a proportional weighting voting
    schema
  • advantages
  • clustering without taking into account the
    patient class
  • definition of an indexing and similarity
    structure between nodes (? relating patients)
  • generation of clusters containing new subtypes of
    unknown cancer (knowledge discovery)

18/26
19
GENE-CBRii reuse
Similarity
- Similarity
AML-inv()
19/26
20
GENE-CBRiii revise
  • objectives
  • provide doctors with meaningful information about
    the classification carried out by the system
  • help in discovering new knowledge
  • if-then rules as decision making support
    mechanism
  • information supplied
  • identification of similar patients (from a
    genetically point of view)
  • proportional weighting voting and assigned
    weights
  • rules generation using See5. Quinlan, J.R. (2000)
  • DFP genes belonging to the set of patients
    retrieved by the GCS network
  • advantages
  • doctors can supervise the final decision proposed
    by the system
  • new knowledge generation in the form of easy
    understandable rules

20/26
21
GENE-CBRiii revise
CARIOTYPE
BIOLOGICAL AND CLINICAL CHARACTERISTICS
Rule 6 (45 / 4, lift 1.1) If
X65962 (AFFX-HSAC07/X00351_5_at) is LOW then
If U96781 (AFFX-BioDn-3_at) is
LOW-MEDIUM then AML-other
Else If D87845 (AFFX-hum_alu_at) is HIGH then
AML-inv() 0.968
21/26
22
GENE-CBRiv retain
  • objectives
  • feedback the system with new knowledge
  • new subclassification of existing cancer
    pathologies
  • reclassification of existing patients
  • identification of correlated genes
  • discovering of new marks able to distinguish new
    pathologies
  • Identification of prototypical patients and rare
    cases
  • main phases
  • update the case base with new a microarray every
    time a new classification is generated
  • modification of the parameters of the model
  • advantages
  • possibility of easily integrating new biological
    knowledge in the hybrid system

22/26
23
Applied technologies
  • Design patterns
  • Action
  • Future
  • MVC
  • Singleton
  • Wizard
  • 100 Java
  • Swing
  • BeanShell
  • Log4j
  • JFreeChart
  • Unified Modeling Language
  • Poseidon for UML

23/26
24
Future work
  • going through a plug-in architecture
  • designing a core where each technique is
    implemented as a plug-in gt aiBENCH
  • implementing fold-cross validation
  • generation of multiple training and test cases in
    an automatic way
  • supporting standard microarray data formats
  • MIAME Minimum Information About a Microarray
    Experiment
  • deploying of GENE-CBR with JavaWebStart
  • remote and automatic access to latest versions of
    GENE-CBR project
  • on-line access to genetic sequence databases
  • geneBank (http//www.ncbi.nlm.nih.gov/Genbank)

24/26
25
Demo GENE-CBR in action
25/26
26
geneCBRa case-base reasoning tool for cancer
diagnosis using microarray datasets
  • dr. florentino fdez-riverola
  • university of vigo

Computer System of New Generation
Write a Comment
User Comments (0)
About PowerShow.com