Title: Nada Lavrac
1- Nada Lavrac
- Subgroup Discovery
- Recent Biomedical Applications
- Solomon seminar, Ljubljana, January 2008
2Talk outline
- Data mining in a nutshell
- Subgroup discovery in a nutshell
- Relational data mining and propositionalization
in a nutshell - DNA Microarray Data Analysis with SD
- RSD approach to Descriptive Analysis of
Differentially Expressed Genes - Future work Towards service-oriented knowledge
technologies for information fusion
3Data Mining in a Nutshell
Data Mining
knowledge discovery from data
model, patterns,
data
Given transaction data table, relational
database, text documents, Web pages
Find a classification model, a set of
interesting patterns
4Data Mining in a Nutshell
Data Mining
knowledge discovery from data
model, patterns,
data
Given transaction data table, relational
database, text documents, Web pages
Find a classification model, a set of
interesting patterns
symbolic model symbolic patterns explanation
new unclassified instance
classified instance
black box classifier no explanation
5Data mining exampleInput Contact lens data
6Output Decision tree for contact lens
prescription
tear prod.
reduced
normal
astigmatism
NONE
no
yes
spect. pre.
SOFT
hypermetrope
myope
NONE
HARD
7Output Classification/prediction rules for
contact lens prescription
tear productionreduced ? lensesNONE tear
productionnormal astigmatismyes spect.
pre.hypermetrope ? lensesNONE tear
productionnormal astigmatismno ? lensesSOFT
tear productionnormal astigmatismyes
spect. pre.myope ? lensesHARD DEFAULT
lensesNONE
8Task reformulation Concept learning problem
(positive vs. negative examples of Target class)
9 Classification versus Subgroup Discovery
- Classification task constructing models from
data (constructing sets of predictive rules) - Predictive induction aimed at learning models
for classification and prediction - Classification rules aim at covering only
positive examples - A set of rules forms a domain model
- Subgroup discovery task finding interesting
patterns in data (constructing individual
subgroup describing rules) - Descriptive induction aimed at exploratory data
analysis - Subgroups descriptions aim at covering a
significant proportion of positive examples - Each rule (pattern) is an independent chunk of
knowledge
10Classification versus Subgroup Discovery
11Talk outline
- Data mining in a nutshell
- Subgroup discovery in a nutshell
- Relational data mining and propositionalization
in a nutshell - DNA Microarray Data Analysis with SD
- RSD approach to Descriptive Analysis of
Differentially Expressed Genes
12Subgroup discovery in a nutshell
- SD Task definition
- Given a set of labeled training examples and a
target class of interest - Find descriptions of most interesting
subgroups of target class examples - are as large as possible
- (high target class coverage)
- have significantly different distribution of the
target class examples - (high TP/FP ratio, high RelAcc, high significance
- Other (subjective) criteria of interestingness
- Surprising to the user, simple, .useful -
actionable
13CHD Risk Group Discovery Task
- Task Find and characterize population subgroups
with high CHD risk - Input Patient records described by stage A
(anamnestic), stage B (an. lab.), and stage C
(an., lab. ECG) attributes - Output Best subgroup descriptions that are most
actionable for CHD risk screening at primary
health-care level
14Subgroup discovery in the CHD application
- From best induced subgroup descriptions, five
were selected by the expert as most actionable
for CHD risk screening (by GPs) - A1 CHD ? male pos. fam. history age gt 46
- A2 CHD ? female bodymassIndex gt 25 age gt 63
- B1 CHD ? ..., B2 CHD ? ..., C1 CHD ? ...
- Principal risk factors (found by subgroup mining)
- Supporting risk factors (found by statistical
analysis) - A1 psychosocial stress, as well as cigarette
smoking, hypertension and overweight - A2
15Characteristics of Subgroup Discovery Algorithms
Remark Subgroup discovery can be viewed as
cost-sensitive rule learning, rewarding TP
covered, and punishing FP covered.
- SD algorithm does not look for a single complex
rule to describe all examples of target class A
(all CHD patients), but several rules that
describe parts (subgroups) of A. - SD prefers rules that are accurate (cover only
CHD patients) and have high generalization
potential (cover large patient subgroups) - This is modeled by parameter g of the rule
quality heuristic of SD. - SD naturally uses example weights in its
procedure for repetitive subgroup generation, via
its weighted covering algo., and rule quality
evaluation heuristic.
16Weighted covering algorithm for rule set
construction
CHD patients
other patients
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
For learning a set of subgroup describing rules,
SD implements an iterative weigthed covering
algorithm. Quality of a rule is measured by
tading off coverage and precision.
17Weighted covering algorithm for rule set
construction
f2 and f3
CHD patients
other patients
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
- Rule quality measure in SD q(Cl ? Cond)
TP/(FPg) - Rule quality measure in CN2-SD WRAcc(Cl ?Cond)
p(Cond) x p(Cl Cond) p(Cl) coverage x
(precision default precision) - Pos/N x Neg/N x TPr FPr
- Coverage sum of the covered weights,
Precision purity of the covered examples
18Weighted covering algorithm for rule set
construction
CHD patients
other patients
0.5
0.5
1.0
1.0
0.5
1.0
0.5
1.0
1.0
0.5
0.5
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
In contrast with classification rule learning
algorithms (e.g. CN2), the covered positive
examples are not deleted from the training set in
the next rule learning iteration they are
re-weighted, and a next best rule is learned.
19Subgroup visualization
The CHD task Find, characterize and visualize
population subgroups with high CHD risk (large
enough, distributionally unusual, most actionable)
20Induced subgroups and their statistical
characterization
Subgroup A2 for femle patients High-CHD-risk IF
body mass index over 25 kg/m2 (typically 29)
AND
age over 63 years Supporting
characteristics are positive family history and
hypertension. Women in this risk group typically
have slightly increased LDL cholesterol values
and normal but decreased HDL cholesterol values.
21Statistical characterization of expert selected
subgroups
22Statistical characterization of subgroups
- starts from induced subgroup descriptions
- statistical significance of all available
features (all risk factors) is computed given two
populations true positive cases (CHD patients
correctly included into the subgroup) and all
negative cases (healthy subjects) - ?2 test with 95 confidence level is used
23Propositional subgroup discovery algorithms
- SD algorithm (Gamberger Lavrac, JAIR 2002)
- APRIORI-SD (Kavsek Lavrac, AAI 2006)
- CN2-SD (Lavrac et al., JMLR 2004) Adapting CN2
classification rule learner to Subgroup
Discovery - Weighted covering algorithm
- Weighted relative accuracy (WRAcc) search
heuristics, with added example weights - WRAcc trade-off between rule coverage and rule
accuracy - Probabilistic classification
- Evaluation with different rule interestingness
measures
24Subgroup discovery lessons learned
- In expert-guided subgroup discovery, the expert
may decide to choose sub-optimal subgroups, which
are the most actionable - A weigted covering algorithm for rule subset
construction (or rule set selection), using
decreased weights of covered positive examples,
can be used to construct/select a small set of
relatively independent patterns - Additional evidence in the form of supporting
factors increases experts confidence in rules
resulting from automated discovery - Value-added Subgroup visualization
25Talk outline
- Data mining in a nutshell
- Subgroup discovery in a nutshell
- Relational data mining and propositionalization
in a nutshell - DNA Microarray Data Analysis with SD
- RSD approach to Descriptive Analysis of
Differentially Expressed Genes
26Relational Data Mining (Inductive Logic
Programming) in a Nutshell
Relational Data Mining
knowledge discovery from data
model, patterns,
Given a relational database, a set of tables.
sets of logical facts, a graph, Find a
classification model, a set of interesting
patterns
27Relational Data Mining (ILP)
- Learning from multiple tables
- Complex relational problems
- temporal data time series in medicine, trafic
control, ... - structured data representation of molecules and
their properties in protein engineering,
biochemistry, ... - Illustrative example structured objects - Trains
28RSD Upgrading CN2-SD to Relational Subgroup
Discovery
- Implementing an propositionalization approach to
relational data mining, through efficient
first-order feature construction - Using CN2-SD for propositional subgroup discovery
Subgroupdiscovery
features
rules
First-order featureconstruction
29Propositionalization in a nutshell
TRAIN_TABLE
Propositionalization task Transform a
multi-relational (multiple-table) representation
to a propositional representation (single
table) Proposed in ILP systems LINUS (1991),
1BC (1999),
PROPOSITIONAL TRAIN_TABLE
30Propositionalization in relational data mining
TRAIN_TABLE
Main propositionalization step first-order
feature construction f1(T)-hasCar(T,C),clength(C
,short). f2(T)-hasCar(T,C), hasLoad(C,L),
loadShape(L,circle) f3(T) - . Propositional
learning t(T) ? f1(T), f4(T) Relational
interpretation eastbound(T) ? hasShortCar(T),has
ClosedCar(T).
PROPOSITIONAL TRAIN_TABLE
31 Relational subgroup discovery
- RSD algorithm (Lavrac et al., ILP 2002, Zelezny
Lavrac, MLJ 2006) - Implementing an propositionalization approach to
relational learning, through efficient
first-order feature construction - Syntax-driven feature construction, using
Progol/Aleph style of modeb/modeh declaration - f121(M)- hasAtom(M,A), atomType(A,21)
- f235(M)- lumo(M,Lu), lessThr(Lu,1.21)
- Using CN2-SD for propositional subgroup discovery
- mutagenic(M) ? feature121(M), feature235(M)
Subgroupdiscovery
features
rules
First-order featureconstruction
32RSD Lessons learned
- Efficient propositionalization can be applied to
individual-centered, multi-instance learning
problems - one free global variable (denoting an individual,
e.g. molecule M) - one or more structural predicates (e.g.
has_atom(M,A)), each introducing a new
existential local variable (e.g. atom A), using
either the global variable (M) or a local
variable introduced by other structural
predicates (A) - one or more utility predicates defining
properties of individuals or their parts,
assigning values to variables - feature121(M)- hasAtom(M,A), atomType(A,21)
- feature235(M)- lumo(M,Lu), lessThr(Lu,-1.21)
- mutagenic(M)- feature121(M), feature235(M)
33Talk outline
- Data mining in a nutshell
- Subgroup discovery in a nutshell
- Relational data mining and propositionalization
in a nutshell - DNA Microarray Data Analysis with SD
- RSD approach to Descriptive Analysis of
Differentially Expressed Genes
34 DNA microarray data analysis
- Genomics The study of genes and their function
- Functional genomics is a typical scientific
discovery domain characterized by - a very large number of attributes (genes)
relative to the number of examples
(observations). - typical values 7000-16000 attributes, 50-150
examples - Functional genomics using gene expression
monitoring by DNA microarrays (gene chips)
enables - better understanding of many biological processes
- improved disease diagnosis and prediction in
medicine
35Gene Expression Data data mining format
1
2
100
fewcases
many features
35/71
36 Standard approach High-Dimensional
Classification Models
- Neural Networks, Support Vector Machines, ...
37High-Dimensional Classification Models (contd)
- Usually good at predictive accuracy
- Golub et al., Science 286531-537 1999
- Ramaswamy et al., PNAS 9815149-54 2001
- Resistance to overfitting (mainly SVM,
ensembles, ...) - But black box models are hard to interpret
??
38Subgroup discovery in DNA microarray data
analysis Functional genomics domains
- Two-class diagnosis problem of distinguishing
between acute lymphoblastic leucemia (ALL, 27
samples) and acute myeloid leukemia (AML, 11
samples), with 34 samples in the test set. Every
sample is described with gene expression values
for 7129 genes. - Multi-class cancer diagnosis problem with 14
different cancer types, in total 144 samples in
the training set and 54 samples in the test set.
Every sample is described with gene expression
values for 16063 genes. - http//www-genome.wi.mit.edu/cgi-bin/cancer/datase
ts.cgi
39Subgroup discovery in microarray data analysis
- Applying SD algorithm to cancer diagnosis problem
with 14 different cancer types (leukemia, CNS,
lung cancer, lymphoma, ) - Altogether 144 samples in the training set, 54
samples in the test set. - Every sample is described with gene expression
values for 16063 genes. - IF (KIAA0128_gene EXPRESSED) AND
(prostaglandin_d2_synthase_gene NOT_EXP) - THEN Leukemia
- training set test set
- sensitivity 23/24
4/6 - specificity 120/120 47/48
40Subgroup discovery in microarray data analysis
Exterts comments
- SD results in simple IF-THEN rules,
interpretable by biologists. - The best-scoring rule for leukemia shows
expression of KIAA0128 (Septin 6) whose relation
to the disease is directly explicable. - The second condition is concerned with the
absence of Prostaglandin Dsynthase (PGDS). PGDS
is an enzyme active in the production of
Prostaglandins (pro-inflammatory an
anti-inflammatory molecules). Elevated expression
of PGDS has been found in brain tumors, ovarian
and breast cancer Su2001,Kawashima2001, while
hematopoietic PGDS has not been, to our
knowledge, associated with leukemias.
41Propositional subgroup discovery
Accuracy-Interpretability trade off
- Patterns in the form of IF-THEN rules induced by
SD - Interpretable by biologists
- D. Gamberger, N. Lavrac, F. elezný, J. Tolar Jr
Biomed Informatics 37(5)269-284 2004 - Special care taken to avoid fluke rules
- Still, inferior in terms of accuracy
42Talk outline
- Data mining in a nutshell
- Subgroup discovery in a nutshell
- Relational data mining and propositionalization
in a nutshell - DNA Microarray Data Analysis with SD
- RSD approach to Descriptive Analysis of
Differentially Expressed Genes
43Accuracy-Interpretability trade off
- Dilemma Accuracy or Interpretability ?
- Our approach to achieve both at the same time
- Learn an accurate high-dim classifier
- Learn comprehensible summarizations of genes in
the classifier by relational subgroup discovery - Learning 2 instances are genes, not patients !!!
44Actual approach approach to Learning 1
Identifying sets of differentially expressed
genes in data preprocessing
44/28
45 Identifying diffferentially expressed genes
45/28
46 Identifying diffferentially expressed genes
- We want to find genes that display a large
difference in gene expression between groups and
are homogeneous within groups - Typically, one would use statistical tests (e.g.
t-test) and calculate p-values (e.g. permutation
test) - p-values from these tests have to be corrected
for multiple testing (e.g. Bonferroni correction)
The two sample tstatistic is used to test
equality of the group means m1, m2.
47Ranking of differentially expressed genes
The genes can be ordered in a ranked list L,
according to their differential expression
between the classes. The challenge is to
extract meaning from this list, to describe
them. The terms of the Gene Ontology were used
as a vocabulary for the description of the
genes. Selected genes have different influence
on the classifier. The weight of that influence
can be extracted from the learned model (e.g.
voting algorithm or SVM) or from the gene
selection algorithm, in a form of a score, or
weight. Description of the selected genes should
be biased towards the genes with higher weights.
48Statistical Significance Meets Biological
Relevance Motivation for relational feature
construction
- Gene A is obvious, well-known, and not
interesting - Gene J activates gene X, which is an oncogene
48/28
49Relational Subgroup Discovery
- Learning 2 technically
- Discovery of gene subgroups which
- largely overlap with those associated by the
classifier with a given class - can be compactly summarized in terms of their
features - What are features?
- attributes of the original attributes (genes),
and - first-order features extracted from the Gene
Ontology and NCBI gene annotation database ENTREZ - Recent work first-order features generated from
GO, ENTREZ and KEGG
50Gene Ontology (GO)
- GO is a database of terms for genes
- Function - What does the gene product do?
- Process - Why does it perform these activities?
- Component - Where does it act?
- Known genes are annotated to GO terms
(www.ncbi.nlm.nih.gov) - Terms are connected as a directed acyclic graph
(is_a, part_of) - Levels represent specificity
- of the terms
12093 biological process 1812 cellular
components 7459 molecular functions
51Gene Ontology (2)
12093 biological process 1812 cellular
components 7459 molecular functions
52Multi-Relational representation
GENE-GENEINTERACTION
GENE(main table,class labels)
GENE-FUNCTION
GENE-PROCESS
GENE-COMPONENT
FUNCTION
PROCESS
COMPONENT
is_a
part_of
is_a
part_of
is_a
part_of
53Encoding as relational background knowledge
- Prolog facts
- predicate(geneID, CONSTANT).
- interaction(geneID, geneID).
- component(2532,'GO0016020').
- component(2532,'GO0005886').
- component(2534,'GO0008372').
- function(2534,'GO0030554').
- function(2534,'GO0005524').
- process(2534,'GO0007243').
- interaction(2534,5155).
- interaction(2534,4803).
fact(class, geneID, weight). fact(diffexp',6449
9, 5.434). fact(diffexp',2534,
4.423). fact(diffexp',5199, 4.234). fact(diffexp
',1052, 2.990). fact(diffexp',6036,
2.500). fact(random',7443,
1.0). fact('random',9221, 1.0). fact('random',2339
5,1.0). fact('random',9657, 1.0). fact('random',19
679, 1.0).
Basic, plus generalized background knowledge
using GO zinc ion binding -gt metal ion binding,
ion binding, binding
54RSD First order feature construction
First order features with support gt min_support
- f(7,A)-function(A,'GO0046872').
- f(8,A)-function(A,'GO0004871').
- f(11,A)-process(A,'GO0007165').
- f(14,A)-process(A,'GO0044267').
- f(15,A)-process(A,'GO0050874').
- f(20,A)-function(A,'GO0004871'),
process(A,'GO0050874'). - f(26,A)-component(A,'GO0016021').
- f(29,A)- function(A,'GO0046872'),
component(A,'GO0016020'). - f(122,A)-interaction(A,B),function(B,'GO0004872'
). - f(223,A)-interaction(A,B),function(B,'GO0004871'
), process(B,'GO0009613'). - f(224,A)-interaction(A,B),function(B,'GO0016787'
), component(B,'GO0043231').
existential
55Propositionalization
56Propositional learning subgroup discovery
f2 and f3 4,0
57Subgroup Discovery
diff. exp. genes
Not diff. exp. genes
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
58Subgroup Discovery
f2 and f3
diff. exp. genes
Not diff. exp. genes
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
In RSD (using propositional learner
CN2-SD) Quality of the rules Coverage x
Precision Coverage sum of the covered
weights Precision purity of the covered genes
59Subgroup Discovery
diff. exp. genes
Not diff. exp. genes
0.5
0.5
1.0
1.0
0.5
1.0
0.5
1.0
1.0
0.5
0.5
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
RSD naturally uses gene weights in its procedure
for repetitive subgroup generation, via its
heuristic rule evaluation weighted relative
accuracy
60Summary The RSD approach
- The RSD system
- Constructs relational logic features of genes
such as - (g interacts with another gene whose functions
include protein binding) - Feature subject to constraints (undecomposability,
minimum support, ...) - Then discovers subgroups using these features
interaction(g, G) function(G, protein_binding)
Genes coding for proteins located in the nucleus
whose functions include protein binding and whose
related processes include transcription are
highly expressed in the TEL class
61Experiments
- We have applied the proposed methodology of
first-order feature construction and subgroup
discovery to discover descriptions of groupd of
differentially expressed genes on three popular
classification problems from gene expression
data - ALL vs. AML, 6817 genes, 73 labeled samples, 2
classes - Subtypes ALL, 22283 genes, 132 labeled samples, 6
classes - 14 cancers types, 16063 genes, 198 labeled
samples, 14 classes - For all three problems and all classes we
selected a set of most differentially expressed
genes (highest t-score ranking) and the same
number of randomly chosen non-differentially
expressed genes - In initial work 50 genes selected
- In recent work Using Gene Set Enrichment
Analysis (GSEA) to determine the top-ranked genes
62Results Discovered subgroup descriptions
- Descriptions of differentially expressed genes
for Acute lymphoblastic leukemia (ALL) vs. Acute
myeloid leukemia (AML), (Golub 99) - 12, 0 interaction(A,B),process(B,'humoral
immune response'). - 11, 0 interaction(A,B),function(B,'peptidase
activity'). - 8, 0 interaction(A,B),process(B,'proteolysis')
. - interaction(A,B),function(B,'peptidase
activity'). - 10, 0 interaction(A,B),process(B,'immune
response'), -
component(B,'extracellular space'). - 8, 1 function(A,'signal transducer activity').
63Results Discovered subgroup descriptions (2)
- Subtypes ALL, 6 classes all other (Ross 03)
- BCR subtype
- 9, 0 interaction(A,B),function(B,'metal ion
binding'). - component(A,'membrane').
- process(A,'cell adhesion').
- 8, 0 interaction(A,B),function(B,'transmembrane
receptor activity'). - function(A,'receptor activity').
- 10, 1 interaction(A,B),function(B,'protein
homodimerizat. activity').
interaction(A,B),process(B,'response to
stimulus').
64Results Clear effect of using background
knowledge and weigths in learning
65Related and Recent work
- There are several approaches for descriptive
analysis of gene expression data Onto-Expres,
GOstat, GoMiner, FunSpec, FatiGO, GOTermFinder.
- They do not consider the weight of importance and
do not use the interaction information - They define a cutoff (fixed or calculated) in
gene list and calculate the significance of
single GO terms. - Our recent work overcomes the fixed cut-off
problem - In SEGS - Search for Enriched Gene Sets the
cutoff is dynamically determined through Gene Set
Enrichment Analysis - Extended feature construction
- GO, ENTREZ, KEGG
66Summary of the presented RSD approach
- A method for adding interpretability to
high-dimensional gene expression based
classifiers was presented - Sequence of two data mining tasks
- predictive classifier construction
- descriptive subgroup discovery
- The 2nd learning task integrates public gene
annotation data through relational features - Highlight genes are attributes in the 1st task
but become examples in the 2nd - Good because of their abundance
67Summary of recent work
- The SEGS approach (Trajkovski, Lavrac and Tolar,
JBI 2008 in press) to descriptive subgroup
discovery, together with Gene Set Enrichment
Analysis for initial gene set selection, proved
very effective - In 2nd learning task propositionalization through
first-order feature construction was in this
experiment - not used to transform multiple-table
representation of training instances into their
single-table representation - was used to transform the information available
in structured web repositories (GO, KEGG, ENTREZ)
into features used in exploratory data analysis - Side effect Consistent, automatically updateable
database of biological background knowledge is
now available on the Web kt.ijs.si/segs/
68Future work Towards service-oriented knowledge
technologies for information fusion
- Steps towards SOKT for information fusion
- Service-oriented knowledge technologies (SOKT)
workshop in Ljubljana, Jan 9-18, 2008 - Implementation of the SEGS workflow as a first
step towards a SOKT toolbox (previously named
KMET), in collaboration with Leiden University - Implementation of other modules of the future
SOKT toolbox
69Conclusions Thanks
- Results show that out methodology is capable of
automatic extraction of meaningful biomedical
knowledge. - Extracted knowledge can be used for guiding the
medical research, generating different
interpretations of the learned model or for
constructing complex gene features for building
interpretable classifiers. - We have high hopes to discover novel, yet
reliable medical knowledge from the relational
combination of gene expression data with public
gene annotation databases. - Thanks
- This work was done in collaboration with Dragan
Gamberger, Filip Zelezny, Igor Trajkovski and
Jakub Tolar.