Title: Data Mining
1Data Mining
- Luc Dehaspe
- K.U.L. Computer Science Department
- -
- Marc Van Hulle
- K.U.L. Neurofysiologie Department
http//toledo.kuleuven.ac.be/
2Course overview
Data Mining
3Exercise session
- Part 1 (L. Dehaspe)
- 2 2.5 h paper-and-pencil sessions
- application of algorithms
- Part 2 (M. Van Hulle)
- hands-on exercises
4Exam
- Written exam, closed book
- Part 1 (Sessions 1-7) 50
- Coverage
- Questions RESTRICTED TO CONTENT OF SLIDES
- Occasional pointers to additional material I do
not expect you to study this material - Questions
- One main question applyunderstand algorithm
(30) - Two smaller questions explain concept, compute
model quality, (210) - Part 2 (Sessions 8-14) 50 (explained later by
Marc Van Hulle)
5Working definition data mining
- tools to search data for patterns and
relationships that lead to better business
decisions - business commercial/scientific
6Overview
- myths and facts
- the Data Mining process
- methods
- visual
- non-visual
7Myths and facts
- New technology cycle
- phase 1 hype
- unrealistic expectations
- naive users
- phase 2 frustration
- phase 3 rejection
- Alternative realistic view on vital technology
8Myth 1 tabula rasa (virgin territory)
- Data mining methods are fundamentally different
from previous methods
Fact
- Underlying ideas often decades old
- neural networks 1940
- k-nearest neighbour 1950
- CART (regression trees) 1960
- Novel
- integrated applications to general business
problems - more data, more computing power
- non-academic users
9Data Mining
Problem
Solution
Meta learning
Task
Data
Performance
Solution integration of tools, mixture of old
and new
10Take home lesson 1
- Not 1 optimal method optimal
- But portfolio of tools, mixture of old and new
11Myth 2 manna from heaven
- Data mining produces surprising results
- that will turn your business upside-down
- without any input of domain expert knowledge
- without any tuning of the technology
Fact
- incremental changes rather than revolutionary
- long term competitive advantage
- occasional breakthroughs (e.g. link
aspirine-Reyes Syndrome) - technology assistant to the domain expert
- careful selection required of
- goal
- technology
12Take home lesson 2
- Crucial combination of
- business (application domain) expertise
- data mining technology expertise
13Data Mining process model
- Definition
- Link with the scientific method
14The data mining process
The non-trivial process of finding valid, novel,
potentially useful, and ultimately understandable
patterns in data
- process iterative learn to ask better
questions - valid patterns can be generalized to new data
- novel and useful offer a competitive advantage
- understandable contribute to insight in the
domain
15Interrogating the databaseLook-up queries
Which customers have a car insurance?
What is the average toxicity of cadmium chloride?
How many earthquakes have occurred last year?
How did HIV patient p123 react to AZT?
16Interrogating the databaseFinding patterns
What is the profile of returning customers?
What is the relation between in vitro activity
and chemical structure?
What is the relation between geological features
and the occurrence of earthquakes?
What is the relation between the HIV patients
therapy history and response to AZT?
17Science
ACTIVE
5
6
7
8
18Science
The actual discovery of such an explanatory
hypothesis is a process of creation, in which
imagination as well as knowledge is
involved. Irving Copi, Introduction to Logic,
1986
collect data
build hypothesis
formulate theory
The formation of hypotheses is the most
mysterious of all the categories of scientific
method. Where they come from, no one knows. A
person is setting somewhere, minding his own
business, and suddenly - flash! - he understands
something he didnt understand before. Robert M.
Pirsig, Zen and the Art of Motorcycle maintenance
verify hypothesis
19Evolution of data generation
2000
Data source
Data
Data Rich Knowledge Poor
Everyone, even the most patient and thorough
investigator, must pick and choose, deciding
which facts to study and which to pass over.
Irving Copi, Introduction to Logic, 1986
Data analyst
20The scientific method
Knowledge discovery in Databases
Data warehousing
collect data
Data Mining
build hypothesis
formulate theory
care
inspiration
Statistics - OLAP
verify hypothesis
21Data Mining
- Definition
- Extracting or mining knowledge from large
amounts of data
CRISP-DM process model
22Data mining in industry
- An in silico research assistant allowing
researchers to - Explore integrated database
- For variety of research purposes (business
goals) - Using optimal selection of data mining
technologies
knowledge
pattern
23Data Mining process model CRISP-DM
24Business understanding
- Which are the business goals?
- Translation to data mining problem definition
- Design of a plan to meet objectives
25Data understanding
- First collection of data
- Becoming familiar with the data
- Judge data quality
- Discovery of
- first insights
- interesting subsets
26Data preparation
- Extract final data set from original set
- Selection of
- tables
- records
- attributes
- transformation
- data cleaning
27Modelling
- Selection modelling techniques
- calibrating parameters
- regular backtracking to adapt data to technology
- (some techniques discussed further on)
28Evaluation
- Decide whether to use Data Mining results
- Verification of all steps
- Check whether business goals have been met
29Deployment
- Organisation presentation of new insights
- variable complexity
- deliver report
- implement software that allows process to be
repeated
30Visual Data Mining methods
- Pro
- image has got broader information-bandwidth than
text - (cf., an image tells more than a thousand
words) - Con
- problems with representation of 3 dimensions
- not effective in case of color blindness
- interpretation gives more information on subject
than on object - stars, clouds, Hermann Rorschach test
31Visual Data Mining methods
32Visual Data Mining methods
33Visual Data Mining methods
- Conditional probabilities
34Visual Data Mining methods
35Visual Data Mining methods
36Non-visual data mining methods
- Statistics - OLAP
- descriptive average, median, standard deviation,
distribution - hypothesis testing (observed differences)/(random
variation) - discriminant analysis
- predictive regression analysis linear,
non-linear - clustering
- Neural networks
- Decision trees and rules
- Conceptual clustering
- Association rules
37(Non-)visual Data Mining methodsOLAP - Data cubes
- Online analytical processing
- Classical statistical methods database
technology - real-time calculations
- powerful visualisation methods
38Non-Visual Data Mining methodsRegression
39Non-visual Data Mining methodsDiscriminant
analysis
- R.A. Fischer, 1936
- discovers planes that separate classes
40Non-Visual Data Mining methodsNeural Networks
- Represent functions with output a discrete value,
a real value, or a vector - Neurobiological motivation
- Parameters network tuned on basis of input-output
examples (backpropagation) - e.g. . input from sensors
- camera (face recognition)
- microphone (speech recognition)
41Non-Visual Data Mining methodsDecision trees
42Non-Visual Data Mining methodsDecision trees
- Attribute selection
- information gain
- how well does an attribute distribute the data
according to their target class - maximal reduction of Entropy
- - pM log2 pM - pF log2 pF
43Non-Visual Data Mining methodsDecision rules
- IF
- Frame 2-Door AND
- Engine ? V6 AND
- Age
- Cost 30K AND
- Color Red
- THEN
- buyer is highly likely to be male
44Non-Visual Data Mining methods Clustering
Cholesterol biosynthesis
Cell cycle
Early response
Signaling and angiogenesis
Wound healing
Eisen et al, PNAS 1998
45Non-Visual Data Mining methodsConceptual
clustering
- Groups examples and provides description of each
group
u all examples A Age-20 B Age
20-40 b1 Age 20-40 en Frame2-Door b2
Age 20-40 en Frame 4-Door C Age 40-60 D
Age 60 d1 Age 60 en Frame 2-Door d2
Age 60 en Frame 4-Door
u
46Non-Visual Data Mining methodsAssociation rules
- IF-THEN rules show relationships
- e.g. . Which products bought together?
47Evaluation pitfallsPost hoc ergo propter hoc
Everyone who drank Stella in the year 1743 is
now dead. Therefore, Stella is fatal.
48Evaluation pitfallsCorrelation does not imply
Causality
- Palm size correlates with your life expectancy
- The larger your palm, the less you will live, on
average.
Why?
Women have smaller palms and live 6 years
longer on average
!actions inspired by data mining results!
49Evaluation pitfallsHypothesis validation
- descriptive statistics 1 hypothesis
- data mining 1 hypothesis-SPACE
- much higher probability of random relationships
- validation on separate data set required