Data Mining for the NHS Information Authority PowerPoint PPT Presentation

presentation player overlay
1 / 22
About This Presentation
Transcript and Presenter's Notes

Title: Data Mining for the NHS Information Authority


1
Data Mining for the NHS Information Authority
  • Brief reviewby Evandro Leite Jr

2
Is it maths, management or computer science?
  • Data mining definition
  • Analysis of large volumes of data to extract
    important trends and higher level information.
  • We are drowning in data, but starving for
    knowledge! (J. Naisbett)
  • Data mining became a Computer Science subject in
    the last 10 years, but it will always use
    mathematics as the base of it.

3
Some quick definitions
  • Variables
  • Continuous its measured values are real numbers
    (ex. 73.827, 23).
  • Categorical takes values in a finite set not
    having any natural ordering (ex. black, red,
    green).
  • Ordered finite set, with some way of sorting the
    elements of the set. (ex. age in years, interval
    of integer numbers, 01/09/2004).
  • Dependent variable or set of classes The aspect
    of the data to be studied.
  • Independent variable or set of attributes
    Variables that are manipulated to explain the
    dependent variable.
  • Types of problems
  • Regression-type -gt dependent variable
    Continuous Ex House selling price ( value)
    price is real
  • Classification-type -gt dependent variable
    Categorical Ex Who will graduate (yes, no) yes
    and no are categories
  • DECISION TREES SOLVES CLASSIFICATION AND
    REGRESSION PROBLEMS

4
The focus of the project
  • There are many mathematical and computing tools
    that can be applied to data mining.
  • Association Rules, Regression, Classification
    and Clustering.
  • For now the focus has been give to
  • Classification using Classification Trees.
  • Regression using Regression Trees.

5
Classification Trees
CAN BE USED ONLY IF THE DEPENDENT VARIABLE IS
CATEGORICAL. Ex Explain the reasons patient die
after going to ICU. Outcome
(Survived/Died) Aim To understand complex
datasets by splitting it into datasets with less
entropy. The key is how to choose the best
attribute to split the data.
6
Classification Trees
How to choose the best attribute to split? Gini
impurity Used by the CART algorithm
(Classification and Regression Trees). Suppose y
takes on values in 1, 2, , m, and let f(i, j)
frequency of value j in node i. That is f(i, j)
is the proportion of records assigned to node i
for which y j. Entropy Used by the C4.5 and
C5.0 algorithms. This measure is based on the
concept of entropy used in information theory.
7
(No Transcript)
8
Gaps in knowledge
  • Are there other algorithms and function to be
    found?
  • What are the best functions and algorithms for
    each dataset?
  • The way to find out the goodness of a tree is
    known. However, finding the best size tree is a
    NP-complete problem. How to improve that?
  • How to combine the best from neural networks,
    support vector machines, relation rules, decision
    trees etc to create a meta learner and meta-meta
    learner.

9
A software which can implement multiple algorithms
  • The software will be able to run the different
    algorithms for the same dataset.
  • Trees generated from different algorithms will be
    created and will be compared. The user will be
    able to visually compare them, or to pick the one
    that has the inferior misclassification rate or
    model complexity.
  • Depending on the nature of the problem
    (classification or regression) a specific
    algorithm can be much more efficient.

10
Last presentations play golf dataset
Independent variables
Dep. var
OUTLOOK TEMPERATURE HUMIDITY WINDY PLAY
sunny 85 85 FALSE Don't Play
sunny 80 90 TRUE Don't Play
overcast 83 78 FALSE Play
rain 70 96 FALSE Play
rain 68 80 FALSE Play
rain 65 70 TRUE Don't Play
overcast 64 65 TRUE Play
sunny 72 95 FALSE Don't Play
sunny 69 70 FALSE Play
rain 75 80 FALSE Play
sunny 75 70 TRUE Play
overcast 72 90 TRUE Play
overcast 81 75 FALSE Play
rain 71 80 TRUE Don't Play
11
Comparison between decision tree algorithms
Answer tree solution using the famous CART
algorithm SPSS Analytical Software
12
Comparison between decision tree
algorithmsSpartacus Data Mining tools using the
C4.5 algorithmSouthampton University

13
End of the introductory part
14
Part 1The meta and meta-meta learners
  • The meta-learner
  • The user will choose the dataset and the
    variables.
  • A trial of different runs, using combinations of
    different methods will be the input of a neural
    network (the meta-learner).

15
Set of rules
C1 CRT
Data quality
Meta-learner
Optimal data quality
CPU time Memory utilisation
Dataset
Neural network
simpler rules
Total CPU time
CPU time Memory utilisation
C2 QUEST
Data quality
Memory utilisationS memory(c) / CPU(c) c
-------------------------------Total time
Set of rules
16
The meta-meta-learners
Meta-Learner 1 CRT
User defined could be a function likeBest
meta-learner DataQuality A Simpler rules
B - Memory C - Time D
Meta-Learner2 Neural networkLinear discriminant
Dataset
Neural network(probably not necessary)
Meta-Learner 3 Relation rulesC4.5STR-Tree
17
The meta-meta-learners user input and output
Input
Output
Dataset name? NHSDependent variables? LOS,
OUTCOME, STROKE How much do you care
aboutData quality (0-99) Parsimonious models
(0-99)Time to process (0-99)Memory utilisation
(0-99)
The best meta-learner for youis a combination
of C4.5, ANN and Relation rules.These are the
best rules1- IF HEART ATTACK and AGE gt 90
then OUTCOME DEATH (error 3)2- Everybody
that has STOKE also has HIGH BLOOD PRESSURE 3-
AGE 2.3 APACHE 2 0.4 LOS (error 25)
18
A software which can implement multiple algorithms
  • Once the best meta-learner is found for a given
    situation, dataset and dependent variable, the
    user can define this meta-learner as the one to
    be executed in similar situations.
  • Ex To find the out the patients LOS in the ICU
    datasets the ML3(CRT) will be used. However to
    find out the outcome of the patient (died or
    survived) the ML103(C4.5, relation rules) will be
    used.

19
No more slides
20
(No Transcript)
21
Nice things about decision trees
  • There are many mathematical and computing tools
    that can be applied to data mining.
  • Association Rules, Regression, Classification
    and Clustering.
  • For now the focus has been give to
  • Classification using Classification Trees.
  • Regression using Regression Trees.

22
Regression Trees
CAN BE USED ONLY IF THE DEPENDENT VARIABLE IS
CONTINUOS.Ex Time a patient stays in the
hospital (LOS in days) Aim To reduce the
entropy of an dataset by splitting it into
datasets with less entropy. The key is how to
choose the best attribute to split the data.
Write a Comment
User Comments (0)
About PowerShow.com