Data Mining for the NHS Information Authority presentation

About This Presentation

Transcript and Presenter's Notes

Title: Data Mining for the NHS Information Authority

1
Data Mining for the NHS Information Authority

Brief reviewby Evandro Leite Jr

2
Is it maths, management or computer science?

Data mining definition
Analysis of large volumes of data to extract
important trends and higher level information.
We are drowning in data, but starving for
knowledge! (J. Naisbett)
Data mining became a Computer Science subject in
the last 10 years, but it will always use
mathematics as the base of it.

3
Some quick definitions

Variables
Continuous its measured values are real numbers
(ex. 73.827, 23).
Categorical takes values in a finite set not
having any natural ordering (ex. black, red,
green).
Ordered finite set, with some way of sorting the
elements of the set. (ex. age in years, interval
of integer numbers, 01/09/2004).
Dependent variable or set of classes The aspect
of the data to be studied.
Independent variable or set of attributes
Variables that are manipulated to explain the
dependent variable.
Types of problems
Regression-type -gt dependent variable
Continuous Ex House selling price ( value)
price is real
Classification-type -gt dependent variable
Categorical Ex Who will graduate (yes, no) yes
and no are categories
DECISION TREES SOLVES CLASSIFICATION AND
REGRESSION PROBLEMS

4
The focus of the project

There are many mathematical and computing tools
that can be applied to data mining.
Association Rules, Regression, Classification
and Clustering.
For now the focus has been give to
Classification using Classification Trees.
Regression using Regression Trees.

5
Classification Trees
CAN BE USED ONLY IF THE DEPENDENT VARIABLE IS
CATEGORICAL. Ex Explain the reasons patient die
after going to ICU. Outcome
(Survived/Died) Aim To understand complex
datasets by splitting it into datasets with less
entropy. The key is how to choose the best
attribute to split the data.
6
Classification Trees
How to choose the best attribute to split? Gini
impurity Used by the CART algorithm
(Classification and Regression Trees). Suppose y
takes on values in 1, 2, , m, and let f(i, j)
frequency of value j in node i. That is f(i, j)
is the proportion of records assigned to node i
for which y j. Entropy Used by the C4.5 and
C5.0 algorithms. This measure is based on the
concept of entropy used in information theory.
7
(No Transcript)
8
Gaps in knowledge

Are there other algorithms and function to be
found?
What are the best functions and algorithms for
each dataset?
The way to find out the goodness of a tree is
known. However, finding the best size tree is a
NP-complete problem. How to improve that?
How to combine the best from neural networks,
support vector machines, relation rules, decision
trees etc to create a meta learner and meta-meta
learner.

9
A software which can implement multiple algorithms

The software will be able to run the different
algorithms for the same dataset.
Trees generated from different algorithms will be
created and will be compared. The user will be
able to visually compare them, or to pick the one
that has the inferior misclassification rate or
model complexity.
Depending on the nature of the problem
(classification or regression) a specific
algorithm can be much more efficient.

10
Last presentations play golf dataset
Independent variables
Dep. var
OUTLOOK TEMPERATURE HUMIDITY WINDY PLAY
sunny 85 85 FALSE Don't Play
sunny 80 90 TRUE Don't Play
overcast 83 78 FALSE Play
rain 70 96 FALSE Play
rain 68 80 FALSE Play
rain 65 70 TRUE Don't Play
overcast 64 65 TRUE Play
sunny 72 95 FALSE Don't Play
sunny 69 70 FALSE Play
rain 75 80 FALSE Play
sunny 75 70 TRUE Play
overcast 72 90 TRUE Play
overcast 81 75 FALSE Play
rain 71 80 TRUE Don't Play
11
Comparison between decision tree algorithms
Answer tree solution using the famous CART
algorithm SPSS Analytical Software
12
Comparison between decision tree
algorithmsSpartacus Data Mining tools using the
C4.5 algorithmSouthampton University

13
End of the introductory part
14
Part 1The meta and meta-meta learners

The meta-learner
The user will choose the dataset and the
variables.
A trial of different runs, using combinations of
different methods will be the input of a neural
network (the meta-learner).

15
Set of rules
C1 CRT
Data quality
Meta-learner
Optimal data quality
CPU time Memory utilisation
Dataset
Neural network
simpler rules
Total CPU time
CPU time Memory utilisation
C2 QUEST
Data quality
Memory utilisationS memory(c) / CPU(c) c
-------------------------------Total time
Set of rules
16
The meta-meta-learners
Meta-Learner 1 CRT
User defined could be a function likeBest
meta-learner DataQuality A Simpler rules
B - Memory C - Time D
Meta-Learner2 Neural networkLinear discriminant
Dataset
Neural network(probably not necessary)
Meta-Learner 3 Relation rulesC4.5STR-Tree
17
The meta-meta-learners user input and output
Input
Output
Dataset name? NHSDependent variables? LOS,
OUTCOME, STROKE How much do you care
aboutData quality (0-99) Parsimonious models
(0-99)Time to process (0-99)Memory utilisation
(0-99)
The best meta-learner for youis a combination
of C4.5, ANN and Relation rules.These are the
best rules1- IF HEART ATTACK and AGE gt 90
then OUTCOME DEATH (error 3)2- Everybody
that has STOKE also has HIGH BLOOD PRESSURE 3-
AGE 2.3 APACHE 2 0.4 LOS (error 25)
18
A software which can implement multiple algorithms

Once the best meta-learner is found for a given
situation, dataset and dependent variable, the
user can define this meta-learner as the one to
be executed in similar situations.
Ex To find the out the patients LOS in the ICU
datasets the ML3(CRT) will be used. However to
find out the outcome of the patient (died or
survived) the ML103(C4.5, relation rules) will be
used.

19
No more slides
20
(No Transcript)
21
Nice things about decision trees

There are many mathematical and computing tools
that can be applied to data mining.
Association Rules, Regression, Classification
and Clustering.
For now the focus has been give to
Classification using Classification Trees.
Regression using Regression Trees.

22
Regression Trees
CAN BE USED ONLY IF THE DEPENDENT VARIABLE IS
CONTINUOS.Ex Time a patient stays in the
hospital (LOS in days) Aim To reduce the
entropy of an dataset by splitting it into
datasets with less entropy. The key is how to
choose the best attribute to split the data.

Write a Comment

User Comments (0)

About PowerShow.com

Data Mining for the NHS Information Authority PowerPoint PPT Presentation