Title: Intelligent Data Mining
1Intelligent Data Mining
Ethem Alpaydin Department of Computer
Engineering Bogaziçi University
alpaydin_at_boun.edu.tr
2What is Data Mining ?
- Search for very strong patterns (correlations,
dependencies) in big data that can generalise to
accurate future decisions. - Aka Knowledge discovery in databases, Business
Intelligence
3Example Applications
- Association
- 30 of customers who buy diapers also buy
beer. Basket Analysis - Classification
- Young women buy small inexpensive cars.
- Older wealthy men buy big cars.
- Regression
- Credit Scoring
4Example Applications
- Sequential Patterns
- Customers who latepay two or more of the first
three installments have a 60 probability of
defaulting. - Similar Time Sequences
- The value of the stocks of company X has been
similar to that of company Ys.
5Example Applications
- Exceptions (Deviation Detection)
- Is any of my customers behaving differently
than usual? - Text mining (Web mining)
- Which documents on the internet are similar to
this document?
6IDIS US Forest Service
- Identifies forest stands (areas similar in age,
structure and species composition) - Predicts how different stands would react to fire
and what preventive measures should be taken?
7GTE Labs
- KEFIR (Key findings reporter)
- Evaluates health-care utilization costs
- Isolates groups whose costs are likely to
increase in the next year. - Find medical conditions for which there is a
known procedure that improves health condition
and decreases costs.
8Lockheed
- RECON Stock portfolio selection
- Create a portfolio of 150-200 securities from an
analysis of a DB of the performance of 1,500
securities over a 7 years period.
9VISA
- Credit Card Fraud Detection
- CRIS Neural Network software which learns to
recognize spending patterns of card holders and
scores transactions by risk. - If a card holder normally buys gas and
groceries and the account suddenly shows purchase
of stereo equipment in Hong Kong, CRIS sends a
notice to bank which in turn can contact the card
holder.
10ISL Ltd (Clementine) - BBC
- Audience prediction
- Program schedulers must be able to predict the
likely audience for a program and the optimum
time to show it. - Type of program, time, competing programs, other
events affect audience figures.
11Data Mining is NOT Magic!
Data mining draws on the concepts and methods of
databases, statistics, and machine learning.
12From the Warehouse to the Mine
Standard form
Data Warehouse
Transactional Databases
Extract, transform, cleanse data
Define goals, data transformations
13How to mine?
Verification Discovery
Computer-assisted, User-directed, Top-down Query and Report OLAP (Online Analytical Processing) tools Automated, Data-driven, Bottom-up
14Steps 1. Define Goal
- Associations between products ?
- New market segments or potential customers?
- Buying patterns over time or product sales
trends? - Discriminating among classes of customers ?
15Steps2. Prepare Data
- Integrate, select and preprocess existing data
(already done if there is a warehouse) - Any other data relevant to the objective which
might supplement existing data
16Steps2. Prepare Data (Contd)
- Select the data Identify relevant variables
- Data cleaning Errors, inconsistencies,
duplicates, missing data. - Data scrubbing Mappings, data conversions, new
attributes - Visual Inspection Data distribution, structure,
outliers, correlations btw attributes - Feature Analysis Clustering, Discretization
17Steps3. Select Tool
- Identify task class
- Clustering/Segmentation, Association,
Classification, - Pattern detection/Prediction in time series
- Identify solution class
- Explanation (Decision trees, rules) vs Black Box
(neural network) - Model assesment, validation and comparison
- k-fold cross validation, statistical tests
- Combination of models
18Steps4. Interpretation
- Are the results (explanations/predictions)
correct, significant? - Consultation with a domain expert
19Example
- Data as a table of attributes
Name
Income
Owns a house?
Marital status
Default
Ali
25,000
Yes
Married
No
Married
Veli
18,000
No
Yes
We would like to be able to explain the value of
one attribute in terms of the values of other
attributes that are relevant.
20Modelling Data
- Attributes x are observable
- y f (x) where f is unknown and probabilistic
-
21Building a Model for Data
f
y
x
-
f
22Learning from Data
- Given a sample Xxt,ytt
- we build f(xt) a predictor to f (xt) that
minimizes the difference between our prediction
and actual value
23Types of Applications
- Classification y in C1, C2,,CK
- Regression y in Re
- Time-Series Prediction x temporally
dependent - Clustering Group x according to similarity
24Example
savings
OK DEFAULT
Yearly income
25Example Solution
OK DEFAULT
q2
RULE IF yearly-incomegt q1 AND savingsgt q2
THEN OK ELSE DEFAULT
26Decision Trees
x1 yearly income x2 savings y 0 DEFAULT y
1 OK
27Clustering
savings
OK DEFAULT
Type 1
Type 2
Type 3
yearly-income
28Time-Series Prediction
?
time
Jan Feb Mar Apr May Jun Jul Aug Sep
Oct Nov Dec Jan
Discovery of frequent episodes
Future
Past
Present
29Methodology
Accept best if good enough
Predictor 1
Train set
Choose best
Best Predictor
Initial Standard Form
Predictor 2
Test trained predictors on test data and choose
best
Predictor L
Test set
Data reduction Value and feature Reductions
Train alternative predictors on train set
30Data Visualisation
- Plot data in fewer dimensions (typically 2) to
allow visual analysis - Visualisation of structure, groups and outliers
31Data Visualisation
savings
Rule
Exceptions
Yearly income
32Techniques for Training Predictors
- Parametric multivariate statistics
- Memory-based (Case-based) Models
- Decision Trees
- Artificial Neural Networks
33Classification
- x d-dimensional vector of attributes
- C1 , C2 ,... , CK K classes
- Reject or doubt
- Compute P(Cix) from data and
- choose k such that
- P(Ckx)maxj P(Cjx)
34Bayes Rule
p(xCj) likelihood that an object of class j
has its features x P(Cj) prior probability of
class j p(x) probability of an object (of any
class) with feature x P(Cjx) posterior
probability that object with feature x is of
class j
35Statistical Methods
- Parametric e.g., Gaussian, model for class
densities, p(xCj) - Univariate
- Multivariate
36Training a Classifier
- Given data xtt of class Cj
- Univariate p(xCj) is N (mj,sj2)
- Multivariate p(xCj) is Nd (mj,Sj)
37Example 1D Case
38Example Different Variances
39Example Many Classes
402D Case Equal Spheric Classes
41Shared Covariances
42Different Covariances
43Actions and Risks
- ai Action i
- l(aiCj) Loss of taking action ai when the
situation is Cj - R(ai x) Sj l(aiCj) P(Cj x)
- Choose ak st
- R(ak x) mini R(ai x)
44Function Approximation (Scoring)
45Regression
- where e is noise. In linear regression,
- Find w,w0 st
E
w
46Linear Regression
47Polynomial Regression
48Polynomial Regression
49Multiple Linear Regression
50Feature Selection
- Subset selection
- Forward and backward methods
- Linear Projection
- Principal Components Analysis (PCA)
- Linear Discriminant Analysis (LDA)
51Sequential Feature Selection
Forward Selection
Backward Selection
(x1) (x2) (x3) (x4)
(x1 x2 x3 x4)
(x1 x2 x3) (x1 x2 x4) (x1 x3 x4) (x2 x3 x4)
(x1 x3) (x2 x3) (x3 x4)
(x2 x4) (x1 x4) (x1 x2)
(x1 x2 x3) (x2 x3 x4)
52Principal Components Analysis (PCA)
z2
x2
z2
z1
z1
x1
Whitening transform
53Linear Discriminant Analysis (LDA)
x2
z1
z1
x1
54Memory-based Methods
- Case-based reasoning
- Nearest-neighbor algorithms
- Keep a list of known instances and interpolate
response from those
55Nearest Neighbor
x2
x1
56Local Regression
y
x
Mixture of Experts
57Missing Data
- Ignore cases with missing data
- Mean imputation
- Imputation by regression
58Training Decision Trees
x2
59Measuring Disorder
x2
x2
q
q
x1
x1
60Entropy
61Artificial Neural Networks
x01
x1
w1
w0
x2
g
w2
y
wd
Regression Identity Classification Sigmoid (0/1)
xd
62Training a Neural Network
Training set
Find w that min E on X
63Nonlinear Optimization
E
wi
Gradient-descent Iterative learning Starting
from random w h is learning factor
64Neural Networks for Classification
K outputs oj , j1,..,K Each oj estimates P (Cjx)
65Multiple Outputs
66Iterative Training
Linear Nonlinear
67Nonlinear classification
Linearly separable
NOT Linearly separable requires a
nonlinear discriminant
68Multi-Layer Networks
o2
o1
oK
tKH
h2
hH
h1
wKd
h01
xd
x1
x2
x01
69Probabilistic Networks
70Evaluating Learners
- Given a model M, how can we assess its
performance on real (future) data? - Given M1, M2, ..., ML which one is the best?
71Cross-validation
1 2 3 k-1 k
1 2 3 k-1
k
Repeat k times and average
72Combining Learners Why?
Predictor 1
Train set
Choose best
Best Predictor
Initial Standard Form
Predictor 2
Predictor L
Validation set
73Combining Learners How?
Predictor 1
Train set
Voting
Initial Standard Form
Predictor 2
Predictor L
Validation set
74ConclusionsThe Importance of Data
- Extract valuable information from large amounts
of raw data - Large amount of reliable data is a must. The
quality of the solution depends highly on the
quality of the data - Data mining is not alchemy we cannot turn stone
into gold
75Conclusions The Importance of the Domain Expert
- Joint effort of human experts and computers
- Any information (symmetries, constraints, etc)
regarding the application should be made use of
to help the learning system - Results should be checked for consistency by
domain experts
76Conclusions The Importance of Being Patient
- Data mining is not straightforward repeated
trials are needed before the system is finetuned. - Mining may be lengthy and costly. Large
expectations lead to large disappointments !
77Once again Important Requirements for Mining
- Large amount of high quality data
- Devoted and knowledgable experts on
- Application domain
- Databases (Data warehouse)
- Statistics and Machine Learning
- Time and patience
78Thats all folks!