geneCBR a case-base reasoning tool for cancer diagnosis using microarray datasets - PowerPoint PPT Presentation

About This Presentation

Title:

geneCBR a case-base reasoning tool for cancer diagnosis using microarray datasets

Description:

geneCBR a case-base reasoning tool for cancer diagnosis using microarray datasets dr. florentino fdez-riverola university of vigo Computer System of New Generation – PowerPoint PPT presentation

Number of Views:275

Avg rating:3.0/5.0

Slides: 27

Provided by: Daniel1585

Category:

more less

Transcript and Presenter's Notes

Title: geneCBR a case-base reasoning tool for cancer diagnosis using microarray datasets

1
geneCBRa case-base reasoning tool for cancer
diagnosis using microarray datasets

dr. florentino fdez-riverola
university of vigo

Computer System of New Generation
2
Outline

DNA Microarray Technology
characteristics and model operation overview
Bioinformatics and AI
new challenges and emerging research areas
CBR systems
case-based reasoning
GENE-CBR
human genome analysis using CBR systems
Demo
geneCBR in action cancer diagnosis using
microarrays

2/26
3
Microarrays characteristics

silicon chips that can measure the expression
levels of thousands of genes simultaneously
microarrays are base on a database over 40000
fragments of genes called expressed sequence tags
(ESTs)
allow us for the first time to obtain a global
view of the cells belonging to
different individuals
different time-intervals for the same individual
different tissues of the same individual
gene expression profiles can be used as inputs to
large-scale data analysis as
fingerprints to build more accurate molecular
classification
discovering hidden taxonomies
Increasing our understanding of normal and
disease states

3/26
4
Microarrays model operation overview

how does the chip work?
Microarray chips incorporate different dyed genes
tiled in a grid-like fashion
The individuals DNA to analyze is dyed with a
different colour
Both sets of labelled DNA strands are allowed to
hybridize or bind
hybridization events are detected identifying
fluorescent changes in the strands or DNA
an scanner and the associated software perform
various forms of image analysis to measure and
report raw gene expression values
the scanned intensities show how active the genes
represented by the ESTs are in the cell
strong fluorescence indicates that the gene is
very active in the cell
no fluorescence indicates that the gene is
inactive in the cell

scanner
preprocessing
microarray data file
4/26
5
Available data

bone narrow samples from 43 adult patients with
Acute Myeloid Leukemia (AML) plus 6 sane
individuals
10 patients with Acute Promyelocytic Leukemia
APL
4 patients with Acute Myeloid Leukemia with
inv(16) AML-inv(16)
7 patients with Acute Monocytic Leukemia
AML-mono
22 patients with Acute non-Monocytic Leukemia
AML-other
6 samples belonging to sane individuals
control samples

volume of information processed
each microarray contains 22.283 ESTs (? genes)
49 microarrays 1.091.867 gene expression values

today available data
150 microarrays (Human Genome 133A) 210
microarrays (Human Genome - plus)

5/26
6
Challenges for microarray Data Mining

three main types of data analysis needed for
biomedical applications
gene selection (? attribute selection in AI)
find the genes most strongly related to a
particular class
classification (? supervised classification in
AI)
classifying diseases or predicting outcomes based
on gene expression patterns, and perhaps even
identifying the best treatment for given genetic
signature
clustering (? unsupervised classification in AI)
finding new biological classes or refining
existing ones

three parallel research areas
convenient visualization of experiments and
results
discovery of biological knowledge (metabolic
pathways, etc.)
low-level analysis providing better readouts
(preprocessing, normalization, etc.)

6/26
7
Problems with existing data

analysis of microarrays presents a number of
unique challenges for Machine Learning and Data
Mining techniques but
Its capacity for generating enormous amounts of
data is, however, also an handicap
great amount of data belonging to each individual
(thousands of genes)
efficiency and memory problems
lack of initial knowledge
which is the significance level of each gene?
given the difficulty of collecting microarray
samples, the number of samples is likely to
remain small in many interesting cases
having so many fields relative to so few samples
creates a high likelihood of finding false
positives
these problems are increased if we consider the
potential errors that can be present in
microarray data (symmetric and random errors)
it is required sophisticated data analysis
techniques and robust methods capable of
extracting biologically meaningful knowledge from
the raw data

7/26
8
CBR systems (Case-Based Reasoning)

Kolodner (1983a, 1983b). Problem solving paradigm
in AI. It can be viewed as a methodology for
reasoning and learning
reasoning by re-using past cases is a powerful
and frequently applied way to solve problems for
humans Joh (1997)
the memory of the system (case base) stores a
certain number of previously experienced
situations
CASE PROBLEM description applied SOLUTION
RESULT
a new problem is solved by finding similar past
cases and reusing them in the new problem
situation
Riesbeck et al., (1989)
4 cyclical steps are performed when it is
necessary to solve a new problem
Kolodner (1993) Aamodt y Plaza (1994) Watson
(1997)
Case-based reasoning is - in effect - a cyclic
and integrated process of solving a problem,
learning from this experience, solving a new
problem, and so on...

8/26
9
The CBR cycle

RETRIEVING
one or more previously experienced cases

most similar cases
New problem
(1) RETRIEVE
REUSING the case(s) in one way or another
(2) REUSE
REVISING the solution based on reusing a previous
case(s)
(4) RETAIN
(3) REVISE
proposed solution
confirmed solution
RETAINING the new experience by incorporating it
into the existing knowledge-base (case base).
9/26
10
Main characteristics of CBR systems

adaptive and dynamic systems the number of cases
stored in the memory of the model changes,
allowing the system adaptation to new situations
CBR allow the utilisation of general knowledge in
the resolution of a particular problem
CBR facilitate the indexation of the available
information
CBR can use uncompleted cases
CBR are advised about their limitations (perhaps
a problem has no solution)
CBR facilitate the utilisation of representative
and flexible data structures
case adaptation aids to discover
inter-connections and hided structures in the
available data
CBR can be completely automated

10/26
11
GENE-CBR
11/26
12
Goals
Objectives
GENE-CBR
Develop an effective and reliable system able to
diagnose cancer subtypes based on the analysis of
microarray data
CBR system (Case-Based Reasoning) Solve new
problems (new patient) based on the previous
experience (diagnosed patients)
Implement a flexible tool for designing and
testing new techniques and experiments
AI techniques selection, clustering, inference
BeanShell Programmer interface
Construct an advanced edition module for run-time
modification of coded techniques
12/26
13
Logic architecture
DFP
GCS
DFP
GCS
1 RETRIEVE
2 REUSE
CASE BASE
3 REVISE
4 RETAIN
13/26
14
Model overview
reclassification
Gene Selection
most relevant genes DFP
Clustering
revised prediction and final diagnostic
genetically similar patients
Knowledge Discovery
Initial prediction
Prediction
14/26
15
GENE-CBRi retrieval

objectives
perform gene selection without losing information
extracting simplified fuzzy patterns (FP) for
each pathology
possibility of using AI techniques initially
discarded
main phases
supervised fuzzy discretisation of gene
expression values
Low, Medium, High and overlapping labels (LM, MH)
supervised gene selection for each pathology
advantages
independence of the ordering existing in data
takes into account data variability
allows for discovering new knowledge
obtained results are interpretable

15/26
16
GENE-CBRi retrieval
16/26
17
GENE-CBRi retrieval
FP_AML-other
FP_healthy
FP_AML-inv()
FP_APL
FP_AML-monocytic
DFP
. . .
17/26
18
GENE-CBRii reuse

objectives
unsupervised identification of genetic
similarities between patients
taking only into account the previous selected
genes (DFP)
main phases
training a GCS network DFP-dimensional
Growing Cell Structures. Fritzke, B. (1993)
presenting the new patient to the network
classifying using a proportional weighting voting
schema
advantages
clustering without taking into account the
patient class
definition of an indexing and similarity
structure between nodes (? relating patients)
generation of clusters containing new subtypes of
unknown cancer (knowledge discovery)

18/26
19
GENE-CBRii reuse
Similarity
- Similarity
AML-inv()
19/26
20
GENE-CBRiii revise

objectives
provide doctors with meaningful information about
the classification carried out by the system
help in discovering new knowledge
if-then rules as decision making support
mechanism
information supplied
identification of similar patients (from a
genetically point of view)
proportional weighting voting and assigned
weights
rules generation using See5. Quinlan, J.R. (2000)
DFP genes belonging to the set of patients
retrieved by the GCS network
advantages
doctors can supervise the final decision proposed
by the system
new knowledge generation in the form of easy
understandable rules

20/26
21
GENE-CBRiii revise
CARIOTYPE
BIOLOGICAL AND CLINICAL CHARACTERISTICS
Rule 6 (45 / 4, lift 1.1) If
X65962 (AFFX-HSAC07/X00351_5_at) is LOW then
If U96781 (AFFX-BioDn-3_at) is
LOW-MEDIUM then AML-other
Else If D87845 (AFFX-hum_alu_at) is HIGH then
AML-inv() 0.968
21/26
22
GENE-CBRiv retain

objectives
feedback the system with new knowledge
new subclassification of existing cancer
pathologies
reclassification of existing patients
identification of correlated genes
discovering of new marks able to distinguish new
pathologies
Identification of prototypical patients and rare
cases
main phases
update the case base with new a microarray every
time a new classification is generated
modification of the parameters of the model
advantages
possibility of easily integrating new biological
knowledge in the hybrid system