Data Mining - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Data Mining

Description:

from large compact or distributed databases, or the Internet. What is the ... or streamed through communication lines. Page 19 /65. Focus of this Presentation ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 66
Provided by: mali90
Category:
Tags: data | mining | streamed

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining Versus Semantic Web
Veljko Milutinovic, vm_at_etf.bg.ac.yu
http//galeb.etf.bg.ac.yu/vm
2
DataMining versus SemanticWeb
  • Two different avenues leading to the same goal!
  • The goal Efficient retrieval of knowledge,from
    large compact or distributed databases, or the
    Internet
  • What is the knowledge Synergistic interaction
    of information (data)and its relationships
    (correlations).
  • The major difference Placement of complexity!

3
Essence of DataMining
  • Data and knowledge representedwith simple
    mechanisms (typically, HTML)and without metadata
    (data about data).
  • Consequently, relatively complex algorithms have
    to be used (complexity migratedinto the
    retrieval request time).
  • In return,low complexity at system design time!

4
Essence of SemanticWeb
  • Data and knowledge representedwith complex
    mechanisms (typically XML)and with plenty of
    metadata (a byte of data may be accompanied
    with a megabyte of metadata).
  • Consequently, relatively simple algorithms can
    be used (low complexity at the retrieval request
    time).
  • However, large metadata designand maintenance
    complexityat system design time.

5
Major Knowledge Retrieval Algorithms (for
DataMining)
  • Neural Networks
  • Decision Trees
  • Rule Induction
  • Memory Based Reasoning,etcConsequently, the
    stress is on algorithms!

6
Major Metadata Handling Tools (for SemanticWeb)
  • XML
  • RDF
  • Ontology Languages
  • Verification (Logic Trust) Efforts in
    ProgressConsequently, the stress is on
    tools!

7
Issues in Data Mining Infrastructure
Authors Nemanja Jovanovic, nemko_at_acm.org Valen
tina Milenkovic, tina_at_eunet.yu Veljko
Milutinovic, vm_at_etf.bg.ac.yu http//galeb.etf
.bg.ac.yu/vm
8

Semantic Web
  • Ivana Vujovic (ile_at_eunet.yu)
  • Erich Neuhold (neuhold_at_ipsi.fhg.de)
  • Peter Fankhauser (fankhaus_at_ipsi.fhg.de)
  • Claudia NiederĂ©e (niederee_at_ipsi.fhg.de)
  • Veljko Milutinovic (vm_at_etf.bg.ac.yu)
  • http//galeb.etf.bg.ac.yu/vm

9
Data Mining in the Nutshell
  • Uncovering the hidden knowledge
  • Huge n-p complete search space
  • Multidimensional interface

10
A Problem
You are a marketing manager for a cellular phone
company
  • Problem Churn is too high
  • Turnover (after contract expires) is 40
  • Customers receive free phone (cost 125) with
    contract
  • You pay a sales commission of 250 per contract
  • Giving a new telephone to everyone whose
    contract is expiring is very expensive (as well
    as wasteful)
  • Bringing back a customer after quitting is both
    difficult and expensive

11
A Solution
  • Three months before a contract expires, predict
    which customers will leave
  • If you want to keep a customer that is predicted
    to churn, offer them a new phone
  • The ones that are not predicted to churn need no
    attention
  • If you dont want to keep the customer, do nothing
  • How can you predict future behavior?
  • Tarot Cards?
  • Magic Ball?
  • Data Mining?

12
Still Skeptical?
13
The Definition
The automated extraction of predictive
information from (large) databases
  • Automated
  • Extraction
  • Predictive
  • Databases

14
History of Data Mining
15
Repetition in Solar Activity
  • 1613 Galileo Galilei
  • 1859 Heinrich Schwabe

16
The Return of theHalley Comet
Edmund Halley (1656 - 1742)
1531
1607
1682
239 BC
1910
1986
2061 ???
17
Data Mining is Not
  • Data warehousing
  • Ad-hoc query/reporting
  • Online Analytical Processing (OLAP)
  • Data visualization

18
Data Mining is
  • Automated extraction of predictive
    informationfrom various data sources
  • Powerful technology with great potential to help
    users focus on the most important information
    stored in data warehouses or streamed through
    communication lines

19
Focus of this Presentation
  • Data Mining problem types
  • Data Mining models and algorithms
  • Efficient Data Mining
  • Available software

20
Data Mining Problem Types
21
Data Mining Problem Types
  • 6 types
  • Often a combination solves the problem

22
Data Description and Summarization
  • Aims at concise description of data
    characteristics
  • Lower end of scale of problem types
  • Provides the user an overview of the data
    structure
  • Typically a sub goal

23
Segmentation
  • Separates the data into interesting and
    meaningful subgroups or classes
  • Manual or (semi)automatic
  • A problem for itself or just a step in solving
    a problem

24
Classification
  • Assumption existence of objects with
    characteristics that belong to different classes
  • Building classification models which assign
    correct labels in advance
  • Exists in wide range of various application
  • Segmentation can provide labels or restrict data
    sets

25
Concept Description
  • Understandable description of concepts or classes
  • Close connection to both segmentation and
    classification
  • Similarity and differences to classification

26
Prediction (Regression)
  • Finds the numerical value of the target
    attribute for unseen objects
  • Similar to classification - differencediscrete
    becomes continuous

27
Dependency Analysis
  • Finding the model that describes significant
    dependences between data items or events
  • Prediction of value of a data item
  • Special case associations

28
Data Mining Models
29
Neural Networks
  • Characterizes processed data with single numeric
    value
  • Efficient modeling of large and complex problems
  • Based on biological structures Neurons
  • Network consists of neurons grouped into layers

30
Neuron Functionality
W1
I1
W2
I2
Output
W3
I3
f
In
Wn
Output f (W1I1, W2I2, , WnIn)
31
Training Neural Networks
32
Decision Trees
  • A way of representing a series of rules that
    lead to a class or value
  • Iterative splitting of data into discrete groups
    maximizing distance between them at each split
  • Classification trees and regression trees
  • Univariate splits and multivariate splits
  • Unlimited growth and stopping rules
  • CHAID, CHART, Quest, C5.0

33
Decision Trees
Balancegt10
Balancelt10
Agelt32
Agegt32
MarriedNO
MarriedYES
34
Decision Trees
35
Rule Induction
  • Method of deriving a set of rules to classify
    cases
  • Creates independent rules that are unlikely to
    form a tree
  • Rules may not cover all possible situations
  • Rules may sometimes conflict in a prediction

36
Rule Induction
If balancegt100.000 then confidenceHIGH
weight1.7
If balancegt25.000 and statusmarriedthen
confidenceHIGH weight2.3
If balancelt40.000 then confidenceLOW
weight1.9
37
K-nearest Neighbor and Memory-Based Reasoning
(MBR)
  • Usage of knowledge of previously solved similar
    problems in solving the new problem
  • Assigning the class to the group where most of
    the k-neighbors belong
  • First step finding the suitable measure for
    distance between attributes in the data
  • How far is black from green?
  • Easy handling of non-standard data types
  • - Huge models

38
K-nearest Neighbor and Memory-Based Reasoning
(MBR)
39
Data Mining Models and Algorithms
  • Many other available models and algorithms
  • Logistic regression
  • Discriminant analysis
  • Generalized Adaptive Models (GAM)
  • Genetic algorithms
  • Etc
  • Many application specific variations of known
    models
  • Final implementation usually involves several
    techniques
  • Selection of solution that match best results

40
Efficient Data Mining
41
Is It Working?
NO
YES
Dont Mess With It!
Did You Mess With It?
YES
You Shouldnt Have!
NO
Will it Explode In Your Hands?
Anyone Else Knows?
Youre in TROUBLE!
YES
YES
Can You Blame Someone Else?
NO
NO
NO
Hide It
Look The Other Way
YES
NO PROBLEM!
42
DM Process Model
  • 5A used by SPSS Clementine (Assess, Access,
    Analyze, Act and Automate)
  • SEMMA used by SAS Enterprise Miner (Sample,
    Explore, Modify, Model and Assess)
  • CRISPDM tends to become a standard

43
CRISP - DM
  • CRoss-Industry Standard for DM
  • Conceived in 1996 by three companies

44
CRISP DM methodology
Four level breakdown of the CRISP-DM methodology
Phases
Generic Tasks
Specialized Tasks
Process Instances
45
Mapping generic modelsto specialized models
  • Analyze the specific context
  • Remove any details not applicable to the context
  • Add any details specific to the context
  • Specialize generic context according toconcrete
    characteristic of the context
  • Possibly rename generic contents to provide more
    explicit meanings

46
Generalized and Specialized Cooking
  • Preparing food on your own
  • Find out what you want to eat
  • Find the recipe for that meal
  • Gather the ingredients
  • Prepare the meal
  • Enjoy your food
  • Clean up everything (or leave it for later)
  • Raw stake with vegetables?
  • Check the Cookbook or call mom
  • Defrost the meat (if you had it in the fridge)
  • Buy missing ingredients or borrow the from the
    neighbors
  • Cook the vegetables and fry the meat
  • Enjoy your food or even more
  • You were cooking so convince someone else to do
    the dishes

47
CRISP DM model
  • Business understanding

Business understanding
Data understanding
  • Data understanding
  • Data preparation
  • Modeling

Datapreparation
Deployment
  • Evaluation
  • Deployment

Modeling
Evaluation
48
Business Understanding
  • Determine business objectives
  • Assess situation
  • Determine data mining goals
  • Produce project plan

49
Data Understanding
  • Collect initial data
  • Describe data
  • Explore data
  • Verify data quality

50
Data Preparation
  • Select data
  • Clean data
  • Construct data
  • Integrate data
  • Format data

51
Modeling
  • Select modeling technique
  • Generate test design
  • Build model
  • Assess model

52
Evaluation
results models findings
  • Evaluate results
  • Review process
  • Determine next steps

53
Deployment
  • Plan deployment
  • Plan monitoring and maintenance
  • Produce final report
  • Review project

54
At Last
55
Available Software
14
56
Comparison of forteen DM tools
  • The Decision Tree products were - CART
    - Scenario - See5 -
    S-Plus
  • The Rule Induction tools were - WizWhy
    - DataMind - DMSK
  • Neural Networks were built from three
    programs - NeuroShell2 - PcOLPARS
    - PRW
  • The Polynomial Network tools were -
    ModelQuest Expert - Gnosis - a
    module of NeuroShell2 - KnowledgeMiner

57
Criteria for evaluating DM tools
  • A list of 20 criteria for evaluating DM tools,
    put into 4 categories
  • Capability measures what a desktop tool can do,
    and how well it does
    it - Handless missing data -
    Considers misclassification costs - Allows
    data transformations - Quality of tesing
    options - Has programming language -
    Provides useful output reports -
    Visualisation

58
Criteria for evaluating DM tools
  • Learnability/Usability shows how easy a tool is
    to learn and use - Tutorials -
    Wizards - Easy to learn - Users
    manual - Online help - Interface

59
Criteria for evaluating DM tools
  • Interoperability shows a tools ability to
    interface with other
    computer applications - Importing data -
    Exporting data - Links to other
    applications
  • Flexibility - Model adjustment
    flexibility - Customizable work
    enviroment - Ability to write or change code

60
A classification of data sets
  • Pima Indians Diabetes data set
  • 768 cases of Native American women from the Pima
    tribesome of whom are diabetic, most of whom are
    not
  • 8 attributes plus the binary class variable for
    diabetes per instance
  • Wisconsin Breast Cancer data set
  • 699 instances of breast tumors some of which are
    malignant, most of which are benign
  • 10 attributes plus the binary malignancy
    variable per case
  • The Forensic Glass Identification data set
  • 214 instances of glass collected during crime
    investigations
  • 10 attributes plus the multi-class output
    variable per instance
  • Moon Cannon data set
  • 300 solutions to the equation x 2v 2
    sin(g)cos(g)/g
  • the data were generated without adding noise

61
Evaluation of forteen DM tools
62
Conclusions
63
WWW.NBA.COM
64
Se7en
65
? CD ROM ?
Write a Comment
User Comments (0)
About PowerShow.com