DATA MINING a business perspective - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

DATA MINING a business perspective

Description:

Nowadays a large company produces more information in one day than ... defunct 1.5. defunct = 1.5. users 6.5. users = 6.5. defunct 2.5. defunct = 2.5 ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 32
Provided by: staffSci
Category:

less

Transcript and Presenter's Notes

Title: DATA MINING a business perspective


1
DATA MININGa business perspective
  • P.W. Adriaans

2
DATA MINING, a business perspective
  • Introduction Data Mining
  • A methodology for Data Mining
  • Algorithm selection
  • Success stories
  • Data mining primitives
  • 10 golden rules
  • Conclusion

3
Data crisis
  • Nowadays a large company produces more
    information in one day than any man can take in a
    whole life
  • The amount of data in the world roughly doubles
    every ten months
  • It will be increasingly difficult to get the
    right information

4
Knowledge management
  • Data Information Knowledge
  • A database is an asset
  • Organization must be a constant source of
    knowledge
  • Creating, collecting, selecting, trading,
    filtering, exchanging, interpreting, cleaning,
    enriching

5
Main challenges
  • Very large and growing databases
  • Noisy/incomplete data
  • Make discoveries understandable
  • Legal/ethical/privacy issues

6
The ancestors of Data Mining
Database
Query languages
4 Generation languages
Executive Information System
OLAP
Data Mining
Next hype (?!)
7
KDD Definition
  • Knowledge Discovery in Databases is the non
    trivial extraction of implicit, previously
    unknown and potentially useful knowledge from
    data(after Frawley Et al. 1991)

8
Different forms of knowledge
Shallow Data (SQL)
Multi-Dimensional Data (OLAP)
Hidden Data (KDD)
Deep Data (only with clues)
9
Legacy databases
  • Databases in the 80s were created to support the
    administrative process
  • Bureaucratic attitude, big brother is watching
    you bad screen design, form oriented software
  • Ergo primary administrative data is OK, the rest
    is rubbish
  • Software ergonomics

10
The KDD process
Data selection
Enrichment
Coding
  • Data Mining
  • Clustering
  • Segmentation
  • Prediction

Reporting application
  • Cleaning
  • Domain consistency
  • De-duplication
  • Disambiguation

Information requirements
Action
external data
Feedback
11
Cleaning
Original data
De-duplication
Domain consistency
12
Coding
address date of birth car date purchase
region age car class purchase month no.
Client
Region
House
purchase
Product
Age
Car class
month no.
Coding
10022
1
5
A
70
50.000
16
20033
2
3
B
40
60.000
NONE
10022
1
5
B
70
50.000
24
Client
Region
Car
House
Purchase
Product
Age class
Alternative coding
class
season
class
10022
1
5
A
4
3
spring
20033
2
3
B
2
4
NONE
10022
1
5
B
4
3
winter
13
Clustering Prediction
Region
Age
Car class
House
Clustering
type x
1
70
5
50.000
type q
2
40
3
60.000
Region
Age
Car class
House
Prediction
product A and B
1
70
5
50.000
product B
2
40
3
60.000
input
output
14
Algorithm selection
Genetic Algorithms
Association Rules
Neural Networks
Decision Trees
KNN
Ability to handle large number of records
Bad
Ability to handle large number of attributes
Medium
Ability to handle numeric attributes
Ability to handle strings
Good
15
k-Nearest Neighbor
  • ...a method that classifies records on the
    basis of their similarity with other records
  • distance (similarity) measure necessary
  • no knowledge gained

16
Decision trees
A ? 35 0
A ? 35 100
17
Decision Trees Information Gain 2
information gain on numeric attributes is based
on binary splits
  • 10 20 30 40
    50 60 70 80

Error
18
Neural networks (I)
19
Neural networks (II)
20
Neural networks (III)
21
Association rules
A C confidence 50 support 33 A,B
C confidence 100 support 33
22
Data Mining success stories
  • KLMPredicting pilot bid behavior with Genetic
    Algorithms2 reduction in overall pilot costs
  • A large bankCross selling application using
    k-nearest neighbor and clusteringloans
    predictor 86
  • Adaptive system managementDecision trees derived
    from monitoring dataeffective analysis of
    performance bottle necks

23
Customer profile
24
Customer profile
25
Performance UNIX system
26
Bottlenecks in data mining
SQL
Fuzzy
27
Bottlenecks in data mining
28
Relational data mining primitives
29
Data mining in the organization
  • A small team of data analysts use powerful
    professional data mining tools
  • A larger group of product/marketing/HR managers
    use client tools
  • Groups of normal IT users use embedded data mining

30
10 golden rules
  • 1 Support for extremely large data sets
  • 2 Support for hybrid learning
  • 3 Based on a data warehouse
  • 4 Data cleaning facilities
  • 5 Dynamic coding
  • 6 Integrated with DSS
  • 7 Extendable architecture
  • 8 Support for heterogeneous databases
  • 9 Client/server architecture
  • 10 Cache optimization

31
Conclusions
De Molen 25 HOUTEN
  • Great potential
  • Relation with data warehousing
  • Change of relational technology
  • Not easy to implement
  • 80 data 20 mining
  • The self-learning organization
  • Knowledge management
Write a Comment
User Comments (0)
About PowerShow.com