Title: Data Mining for the NHS Information Authority
1Data Mining for the NHS Information Authority
- Brief reviewby Evandro Leite Jr
2Is it maths, management or computer science?
- Data mining definition
- Analysis of large volumes of data to extract
important trends and higher level information. - We are drowning in data, but starving for
knowledge! (J. Naisbett) - Data mining became a Computer Science subject in
the last 10 years, but it will always use
mathematics as the base of it.
3Some quick definitions
- Variables
- Continuous its measured values are real numbers
(ex. 73.827, 23). - Categorical takes values in a finite set not
having any natural ordering (ex. black, red,
green). - Ordered finite set, with some way of sorting the
elements of the set. (ex. age in years, interval
of integer numbers, 01/09/2004). - Dependent variable or set of classes The aspect
of the data to be studied. - Independent variable or set of attributes
Variables that are manipulated to explain the
dependent variable. - Types of problems
- Regression-type -gt dependent variable
Continuous Ex House selling price ( value)
price is real - Classification-type -gt dependent variable
Categorical Ex Who will graduate (yes, no) yes
and no are categories - DECISION TREES SOLVES CLASSIFICATION AND
REGRESSION PROBLEMS
4The focus of the project
- There are many mathematical and computing tools
that can be applied to data mining. - Association Rules, Regression, Classification
and Clustering. - For now the focus has been give to
- Classification using Classification Trees.
- Regression using Regression Trees.
5Classification Trees
CAN BE USED ONLY IF THE DEPENDENT VARIABLE IS
CATEGORICAL. Ex Explain the reasons patient die
after going to ICU. Outcome
(Survived/Died) Aim To understand complex
datasets by splitting it into datasets with less
entropy. The key is how to choose the best
attribute to split the data.
6Classification Trees
How to choose the best attribute to split? Gini
impurity Used by the CART algorithm
(Classification and Regression Trees). Suppose y
takes on values in 1, 2, , m, and let f(i, j)
frequency of value j in node i. That is f(i, j)
is the proportion of records assigned to node i
for which y j. Entropy Used by the C4.5 and
C5.0 algorithms. This measure is based on the
concept of entropy used in information theory.
7(No Transcript)
8Gaps in knowledge
- Are there other algorithms and function to be
found? - What are the best functions and algorithms for
each dataset? - The way to find out the goodness of a tree is
known. However, finding the best size tree is a
NP-complete problem. How to improve that? - How to combine the best from neural networks,
support vector machines, relation rules, decision
trees etc to create a meta learner and meta-meta
learner.
9A software which can implement multiple algorithms
- The software will be able to run the different
algorithms for the same dataset. - Trees generated from different algorithms will be
created and will be compared. The user will be
able to visually compare them, or to pick the one
that has the inferior misclassification rate or
model complexity. - Depending on the nature of the problem
(classification or regression) a specific
algorithm can be much more efficient.
10Last presentations play golf dataset
Independent variables
Dep. var
OUTLOOK TEMPERATURE HUMIDITY WINDY PLAY
sunny 85 85 FALSE Don't Play
sunny 80 90 TRUE Don't Play
overcast 83 78 FALSE Play
rain 70 96 FALSE Play
rain 68 80 FALSE Play
rain 65 70 TRUE Don't Play
overcast 64 65 TRUE Play
sunny 72 95 FALSE Don't Play
sunny 69 70 FALSE Play
rain 75 80 FALSE Play
sunny 75 70 TRUE Play
overcast 72 90 TRUE Play
overcast 81 75 FALSE Play
rain 71 80 TRUE Don't Play
11Comparison between decision tree algorithms
Answer tree solution using the famous CART
algorithm SPSS Analytical Software
12Comparison between decision tree
algorithmsSpartacus Data Mining tools using the
C4.5 algorithmSouthampton University
13End of the introductory part
14Part 1The meta and meta-meta learners
- The meta-learner
- The user will choose the dataset and the
variables. - A trial of different runs, using combinations of
different methods will be the input of a neural
network (the meta-learner).
15Set of rules
C1 CRT
Data quality
Meta-learner
Optimal data quality
CPU time Memory utilisation
Dataset
Neural network
simpler rules
Total CPU time
CPU time Memory utilisation
C2 QUEST
Data quality
Memory utilisationS memory(c) / CPU(c) c
-------------------------------Total time
Set of rules
16The meta-meta-learners
Meta-Learner 1 CRT
User defined could be a function likeBest
meta-learner DataQuality A Simpler rules
B - Memory C - Time D
Meta-Learner2 Neural networkLinear discriminant
Dataset
Neural network(probably not necessary)
Meta-Learner 3 Relation rulesC4.5STR-Tree
17The meta-meta-learners user input and output
Input
Output
Dataset name? NHSDependent variables? LOS,
OUTCOME, STROKE How much do you care
aboutData quality (0-99) Parsimonious models
(0-99)Time to process (0-99)Memory utilisation
(0-99)
The best meta-learner for youis a combination
of C4.5, ANN and Relation rules.These are the
best rules1- IF HEART ATTACK and AGE gt 90
then OUTCOME DEATH (error 3)2- Everybody
that has STOKE also has HIGH BLOOD PRESSURE 3-
AGE 2.3 APACHE 2 0.4 LOS (error 25)
18A software which can implement multiple algorithms
- Once the best meta-learner is found for a given
situation, dataset and dependent variable, the
user can define this meta-learner as the one to
be executed in similar situations. - Ex To find the out the patients LOS in the ICU
datasets the ML3(CRT) will be used. However to
find out the outcome of the patient (died or
survived) the ML103(C4.5, relation rules) will be
used.
19No more slides
20(No Transcript)
21Nice things about decision trees
- There are many mathematical and computing tools
that can be applied to data mining. - Association Rules, Regression, Classification
and Clustering. - For now the focus has been give to
- Classification using Classification Trees.
- Regression using Regression Trees.
22Regression Trees
CAN BE USED ONLY IF THE DEPENDENT VARIABLE IS
CONTINUOS.Ex Time a patient stays in the
hospital (LOS in days) Aim To reduce the
entropy of an dataset by splitting it into
datasets with less entropy. The key is how to
choose the best attribute to split the data.