Title: DATA MINING a business perspective
1DATA MININGa business perspective
2DATA MINING, a business perspective
- Introduction Data Mining
- A methodology for Data Mining
- Algorithm selection
- Success stories
- Data mining primitives
- 10 golden rules
- Conclusion
3Data crisis
- Nowadays a large company produces more
information in one day than any man can take in a
whole life - The amount of data in the world roughly doubles
every ten months - It will be increasingly difficult to get the
right information
4Knowledge management
- Data Information Knowledge
- A database is an asset
- Organization must be a constant source of
knowledge - Creating, collecting, selecting, trading,
filtering, exchanging, interpreting, cleaning,
enriching
5Main challenges
- Very large and growing databases
- Noisy/incomplete data
- Make discoveries understandable
- Legal/ethical/privacy issues
6The ancestors of Data Mining
Database
Query languages
4 Generation languages
Executive Information System
OLAP
Data Mining
Next hype (?!)
7KDD Definition
- Knowledge Discovery in Databases is the non
trivial extraction of implicit, previously
unknown and potentially useful knowledge from
data(after Frawley Et al. 1991)
8Different forms of knowledge
Shallow Data (SQL)
Multi-Dimensional Data (OLAP)
Hidden Data (KDD)
Deep Data (only with clues)
9Legacy databases
- Databases in the 80s were created to support the
administrative process - Bureaucratic attitude, big brother is watching
you bad screen design, form oriented software - Ergo primary administrative data is OK, the rest
is rubbish - Software ergonomics
10The KDD process
Data selection
Enrichment
Coding
- Data Mining
- Clustering
- Segmentation
- Prediction
Reporting application
- Cleaning
- Domain consistency
- De-duplication
- Disambiguation
Information requirements
Action
external data
Feedback
11Cleaning
Original data
De-duplication
Domain consistency
12Coding
address date of birth car date purchase
region age car class purchase month no.
Client
Region
House
purchase
Product
Age
Car class
month no.
Coding
10022
1
5
A
70
50.000
16
20033
2
3
B
40
60.000
NONE
10022
1
5
B
70
50.000
24
Client
Region
Car
House
Purchase
Product
Age class
Alternative coding
class
season
class
10022
1
5
A
4
3
spring
20033
2
3
B
2
4
NONE
10022
1
5
B
4
3
winter
13Clustering Prediction
Region
Age
Car class
House
Clustering
type x
1
70
5
50.000
type q
2
40
3
60.000
Region
Age
Car class
House
Prediction
product A and B
1
70
5
50.000
product B
2
40
3
60.000
input
output
14Algorithm selection
Genetic Algorithms
Association Rules
Neural Networks
Decision Trees
KNN
Ability to handle large number of records
Bad
Ability to handle large number of attributes
Medium
Ability to handle numeric attributes
Ability to handle strings
Good
15k-Nearest Neighbor
- ...a method that classifies records on the
basis of their similarity with other records - distance (similarity) measure necessary
- no knowledge gained
16Decision trees
A ? 35 0
A ? 35 100
17Decision Trees Information Gain 2
information gain on numeric attributes is based
on binary splits
Error
18Neural networks (I)
19Neural networks (II)
20Neural networks (III)
21Association rules
A C confidence 50 support 33 A,B
C confidence 100 support 33
22Data Mining success stories
- KLMPredicting pilot bid behavior with Genetic
Algorithms2 reduction in overall pilot costs - A large bankCross selling application using
k-nearest neighbor and clusteringloans
predictor 86 - Adaptive system managementDecision trees derived
from monitoring dataeffective analysis of
performance bottle necks
23Customer profile
24Customer profile
25Performance UNIX system
26Bottlenecks in data mining
SQL
Fuzzy
27Bottlenecks in data mining
28Relational data mining primitives
29Data mining in the organization
- A small team of data analysts use powerful
professional data mining tools - A larger group of product/marketing/HR managers
use client tools - Groups of normal IT users use embedded data mining
3010 golden rules
- 1 Support for extremely large data sets
- 2 Support for hybrid learning
- 3 Based on a data warehouse
- 4 Data cleaning facilities
- 5 Dynamic coding
- 6 Integrated with DSS
- 7 Extendable architecture
- 8 Support for heterogeneous databases
- 9 Client/server architecture
- 10 Cache optimization
31Conclusions
De Molen 25 HOUTEN
- Great potential
- Relation with data warehousing
- Change of relational technology
- Not easy to implement
- 80 data 20 mining
- The self-learning organization
- Knowledge management