Title: geneCBR a case-base reasoning tool for cancer diagnosis using microarray datasets
1geneCBRa case-base reasoning tool for cancer
diagnosis using microarray datasets
- dr. florentino fdez-riverola
- university of vigo
Computer System of New Generation
2Outline
- DNA Microarray Technology
- characteristics and model operation overview
- Bioinformatics and AI
- new challenges and emerging research areas
- CBR systems
- case-based reasoning
- GENE-CBR
- human genome analysis using CBR systems
- Demo
- geneCBR in action cancer diagnosis using
microarrays
2/26
3Microarrays characteristics
- silicon chips that can measure the expression
levels of thousands of genes simultaneously - microarrays are base on a database over 40000
fragments of genes called expressed sequence tags
(ESTs) - allow us for the first time to obtain a global
view of the cells belonging to - different individuals
- different time-intervals for the same individual
- different tissues of the same individual
- gene expression profiles can be used as inputs to
large-scale data analysis as - fingerprints to build more accurate molecular
classification - discovering hidden taxonomies
- Increasing our understanding of normal and
disease states
3/26
4Microarrays model operation overview
- how does the chip work?
- Microarray chips incorporate different dyed genes
tiled in a grid-like fashion - The individuals DNA to analyze is dyed with a
different colour - Both sets of labelled DNA strands are allowed to
hybridize or bind - hybridization events are detected identifying
fluorescent changes in the strands or DNA - an scanner and the associated software perform
various forms of image analysis to measure and
report raw gene expression values - the scanned intensities show how active the genes
represented by the ESTs are in the cell - strong fluorescence indicates that the gene is
very active in the cell - no fluorescence indicates that the gene is
inactive in the cell
scanner
preprocessing
microarray data file
4/26
5Available data
- bone narrow samples from 43 adult patients with
Acute Myeloid Leukemia (AML) plus 6 sane
individuals - 10 patients with Acute Promyelocytic Leukemia
APL - 4 patients with Acute Myeloid Leukemia with
inv(16) AML-inv(16) - 7 patients with Acute Monocytic Leukemia
AML-mono - 22 patients with Acute non-Monocytic Leukemia
AML-other - 6 samples belonging to sane individuals
control samples
- volume of information processed
- each microarray contains 22.283 ESTs (? genes)
- 49 microarrays 1.091.867 gene expression values
- today available data
- 150 microarrays (Human Genome 133A) 210
microarrays (Human Genome - plus)
5/26
6Challenges for microarray Data Mining
- three main types of data analysis needed for
biomedical applications - gene selection (? attribute selection in AI)
- find the genes most strongly related to a
particular class - classification (? supervised classification in
AI) - classifying diseases or predicting outcomes based
on gene expression patterns, and perhaps even
identifying the best treatment for given genetic
signature - clustering (? unsupervised classification in AI)
- finding new biological classes or refining
existing ones
- three parallel research areas
- convenient visualization of experiments and
results - discovery of biological knowledge (metabolic
pathways, etc.) - low-level analysis providing better readouts
(preprocessing, normalization, etc.)
6/26
7Problems with existing data
- analysis of microarrays presents a number of
unique challenges for Machine Learning and Data
Mining techniques but - Its capacity for generating enormous amounts of
data is, however, also an handicap - great amount of data belonging to each individual
(thousands of genes) - efficiency and memory problems
- lack of initial knowledge
- which is the significance level of each gene?
- given the difficulty of collecting microarray
samples, the number of samples is likely to
remain small in many interesting cases - having so many fields relative to so few samples
creates a high likelihood of finding false
positives - these problems are increased if we consider the
potential errors that can be present in
microarray data (symmetric and random errors) - it is required sophisticated data analysis
techniques and robust methods capable of
extracting biologically meaningful knowledge from
the raw data
7/26
8CBR systems (Case-Based Reasoning)
- Kolodner (1983a, 1983b). Problem solving paradigm
in AI. It can be viewed as a methodology for
reasoning and learning - reasoning by re-using past cases is a powerful
and frequently applied way to solve problems for
humans Joh (1997) - the memory of the system (case base) stores a
certain number of previously experienced
situations - CASE PROBLEM description applied SOLUTION
RESULT - a new problem is solved by finding similar past
cases and reusing them in the new problem
situation - Riesbeck et al., (1989)
- 4 cyclical steps are performed when it is
necessary to solve a new problem - Kolodner (1993) Aamodt y Plaza (1994) Watson
(1997) - Case-based reasoning is - in effect - a cyclic
and integrated process of solving a problem,
learning from this experience, solving a new
problem, and so on...
8/26
9The CBR cycle
- RETRIEVING
- one or more previously experienced cases
most similar cases
New problem
(1) RETRIEVE
REUSING the case(s) in one way or another
(2) REUSE
REVISING the solution based on reusing a previous
case(s)
(4) RETAIN
(3) REVISE
proposed solution
confirmed solution
RETAINING the new experience by incorporating it
into the existing knowledge-base (case base).
9/26
10Main characteristics of CBR systems
- adaptive and dynamic systems the number of cases
stored in the memory of the model changes,
allowing the system adaptation to new situations - CBR allow the utilisation of general knowledge in
the resolution of a particular problem - CBR facilitate the indexation of the available
information - CBR can use uncompleted cases
- CBR are advised about their limitations (perhaps
a problem has no solution) - CBR facilitate the utilisation of representative
and flexible data structures - case adaptation aids to discover
inter-connections and hided structures in the
available data - CBR can be completely automated
10/26
11GENE-CBR
11/26
12Goals
Objectives
GENE-CBR
Develop an effective and reliable system able to
diagnose cancer subtypes based on the analysis of
microarray data
CBR system (Case-Based Reasoning) Solve new
problems (new patient) based on the previous
experience (diagnosed patients)
Implement a flexible tool for designing and
testing new techniques and experiments
AI techniques selection, clustering, inference
BeanShell Programmer interface
Construct an advanced edition module for run-time
modification of coded techniques
12/26
13Logic architecture
DFP
GCS
DFP
GCS
1 RETRIEVE
2 REUSE
CASE BASE
3 REVISE
4 RETAIN
13/26
14Model overview
reclassification
Gene Selection
most relevant genes DFP
Clustering
revised prediction and final diagnostic
genetically similar patients
Knowledge Discovery
Initial prediction
Prediction
14/26
15GENE-CBRi retrieval
- objectives
- perform gene selection without losing information
- extracting simplified fuzzy patterns (FP) for
each pathology - possibility of using AI techniques initially
discarded - main phases
- supervised fuzzy discretisation of gene
expression values - Low, Medium, High and overlapping labels (LM, MH)
- supervised gene selection for each pathology
- advantages
- independence of the ordering existing in data
- takes into account data variability
- allows for discovering new knowledge
- obtained results are interpretable
15/26
16GENE-CBRi retrieval
16/26
17GENE-CBRi retrieval
FP_AML-other
FP_healthy
FP_AML-inv()
FP_APL
FP_AML-monocytic
DFP
. . .
17/26
18GENE-CBRii reuse
- objectives
- unsupervised identification of genetic
similarities between patients - taking only into account the previous selected
genes (DFP) - main phases
- training a GCS network DFP-dimensional
- Growing Cell Structures. Fritzke, B. (1993)
- presenting the new patient to the network
- classifying using a proportional weighting voting
schema - advantages
- clustering without taking into account the
patient class - definition of an indexing and similarity
structure between nodes (? relating patients) - generation of clusters containing new subtypes of
unknown cancer (knowledge discovery)
18/26
19GENE-CBRii reuse
Similarity
- Similarity
AML-inv()
19/26
20GENE-CBRiii revise
- objectives
- provide doctors with meaningful information about
the classification carried out by the system - help in discovering new knowledge
- if-then rules as decision making support
mechanism - information supplied
- identification of similar patients (from a
genetically point of view) - proportional weighting voting and assigned
weights - rules generation using See5. Quinlan, J.R. (2000)
- DFP genes belonging to the set of patients
retrieved by the GCS network - advantages
- doctors can supervise the final decision proposed
by the system - new knowledge generation in the form of easy
understandable rules
20/26
21GENE-CBRiii revise
CARIOTYPE
BIOLOGICAL AND CLINICAL CHARACTERISTICS
Rule 6 (45 / 4, lift 1.1) If
X65962 (AFFX-HSAC07/X00351_5_at) is LOW then
If U96781 (AFFX-BioDn-3_at) is
LOW-MEDIUM then AML-other
Else If D87845 (AFFX-hum_alu_at) is HIGH then
AML-inv() 0.968
21/26
22GENE-CBRiv retain
- objectives
- feedback the system with new knowledge
- new subclassification of existing cancer
pathologies - reclassification of existing patients
- identification of correlated genes
- discovering of new marks able to distinguish new
pathologies - Identification of prototypical patients and rare
cases - main phases
- update the case base with new a microarray every
time a new classification is generated - modification of the parameters of the model
- advantages
- possibility of easily integrating new biological
knowledge in the hybrid system
22/26
23Applied technologies
- Design patterns
- Action
- Future
- MVC
- Singleton
- Wizard
- 100 Java
- Swing
- BeanShell
- Log4j
- JFreeChart
- Unified Modeling Language
- Poseidon for UML
23/26
24Future work
- going through a plug-in architecture
- designing a core where each technique is
implemented as a plug-in gt aiBENCH - implementing fold-cross validation
- generation of multiple training and test cases in
an automatic way - supporting standard microarray data formats
- MIAME Minimum Information About a Microarray
Experiment - deploying of GENE-CBR with JavaWebStart
- remote and automatic access to latest versions of
GENE-CBR project - on-line access to genetic sequence databases
- geneBank (http//www.ncbi.nlm.nih.gov/Genbank)
24/26
25Demo GENE-CBR in action
25/26
26geneCBRa case-base reasoning tool for cancer
diagnosis using microarray datasets
- dr. florentino fdez-riverola
- university of vigo
Computer System of New Generation