Introduction to DataMining

About This Presentation
Title:

Introduction to DataMining

Description:

Auto Insurance: detect a group of people who stage accidents to collect on insurance ... buying beer and chips - ketchup [0.5%,60%] rule form:LHS RHS ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 26
Provided by: marko159

less

Transcript and Presenter's Notes

Title: Introduction to DataMining


1
Introduction toData-Mining
  • Marko Grobelnik
  • Institut Jozef Stefan

2
Outline
  • Motivation Definition
  • What are typical applications?
  • How do we build solutions?
  • Method algorithms
  • Tools standards
  • conclusion

3
Motivation Necessity is the Mother of
Invention
  • Data explosion problem
  • Automated data collection tools and mature
    database technology lead to tremendous amounts of
    data stored in databases, data warehouses and
    other information repositories
  • We are drowning in data, but starving for
    knowledge!

4
Data pyramid
Wisdom
Knowledge experience
Knowledge
Information rules
Information
Data context
Data
5
What Is Data Mining?
  • Data mining (knowledge discovery in databases -
    KDD, business intelligence)
  • Extraction of interesting ( non-trivial,
    implicit, previously unknown and potentially
    useful) information from data in large databases
  • Tell me something interesting about the data.
  • Describe the data.

6
Potential Applications
  • Database analysis and decision support
  • Market analysis and management
  • Risk analysis and management
  • Fraud detection and management
  • Text analysis - Text Mining
  • Web analysis - Web Mining
  • Intelligent query answering

7
Market Analysis and Management
  • Where are the data sources for analysis?
  • Credit card transactions, loyalty cards, discount
    coupons, customer complaint calls, plus (public)
    lifestyle studies.
  • Target marketing
  • Find clusters of model customers who share the
    same characteristics interest, income level,
    spending habits, etc.
  • Determine customer purchasing patterns over time
  • Conversion of single to a joint bank account
    marriage, etc.

8
Analysis and Risk Management
  • Finance planning and asset evaluation
  • cash flow analysis and prediction
  • time series analysis (trend analysis, etc.)
  • Resource planning
  • summarize and compare the resources and spending
  • Competition
  • Monitor competitors and market directions
  • Set pricing strategy in a highly competitive
    market

9
Fraud Detection and Management
  • Use historical data to build models of fraudulent
    behavior and use data mining to help identify
    similar instances
  • Examples application
  • Auto Insurance detect a group of people who
    stage accidents to collect on insurance
  • Money Laundering detect suspicious money
    transactions
  • Detecting telephone fraud detecting suspicious
    patterns (generate call model - destination,
    time, duration)

10
Other Areas of application
  • Sports
  • Analysis of game in NBA (eg., detect the
    opponents strategy)
  • Astronomy
  • discovery and classification of new objects
  • Internet
  • analysis of Web access logs, discovery of user
    behavior patterns, analyzing effectiveness of Web
    marketing, improving Web site organization
  • Text
  • news analysis, medical record analysis, automatic
    email sorting and filtering, automatic document
    categorization

11
Data mining intersection of multiple
disciplines
  • Database systems, data warehouse and OLAP
  • Statistics
  • Machine learning
  • Visualization
  • Information science
  • High performance computing
  • Other disciplines
  • Neural networks, mathematical modeling,
    information retrieval, pattern recognition, ...

12
From data to knowledge
Knowledge
  • Data mining the core of knowledge discovery
    process.

Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
13
Main steps of KDD
  • Learning the application domain
  • relevant prior knowledge and goals of application
  • Data cleaning and preprocessing (may take 60 of
    effort!)
  • creating a target data set data selection
  • find useful features, generate new features, map
    feature values, discretization of values
  • Choosing data mining tools/algorithms
  • summarization, classification, regression,
    association, clustering.
  • Data mining search for patterns of interest
  • Interpretation analysis of results.
  • visualization, transformation, removing redundant
    patterns, etc.
  • Use of discovered knowledge.

14
Data Mining and Business Intelligence
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
15
Mining the data what kind of data?
  • Relational databases
  • Data warehouses
  • Transactional databases
  • Advanced DB systems and information repositories
    object-oriented and object-relational databases,
    spatial databases, time-series data and temporal
    data, text databases and multimedia databases,
    heterogeneous and legacy databases, WWW

16
Data mining algorithms (I)
  • Association
  • finding rules like if the customer bought item
    A, then in X of transactions she/he also bought
    item B. This holds for Y of all transactions
  • Classification and Prediction
  • classify data based on the values in a
    classifying attribute, e.g., classify countries
    based on climate, or classify cars based on gas
    mileage
  • predict some unknown or missing attribute values
    based on other information

17
Data mining algorithms (II)
  • Clustering
  • group data to form new classes, e.g., find
    groups of customers with similar behavior
  • Time-series analysis
  • trend and deviation analysis find and
    characterize evolution trend, sequential
    patterns, similar sequences, and deviation data,
    e.g., stock analysis.
  • similarity-based pattern-directed analysis find
    and characterize user-specified patterns in
    large databases.
  • cyclicity/periodicity analysis find
    segment-wise or total cycles or periodic
    behaviors in time-related data.
  • Other pattern-directed or statistical analysis

18
Association rules
  • Finding associations or correlations among a set
    of items
  • Applications
  • basket data analysis, cross-marketing,
  • Example
  • buying beer and chips -gt ketchup 0.5,60
  • rule formLHS RHS support, confidence

19
Classification
  • Finding rules that describe given groups of
    objects
  • Applications credit approval, target marketing,
    medical diagnosis, treatment effectiveness
    analysis,...
  • Example based on the past symptoms and diagnoses
    of patients generate a model describing influence
    of symptoms to disease to be used for
    classification of future test data and better
    understanding of each class
  • Methods decision-trees (e.g., ID3, C5),
    statistics, neural networks,...

20
Classification using decision trees
  • A decision tree
  • Top-down decision tree generation algorithm, at
    each step
  • partition examples based on the selected
    attribute value
  • select attribute favoring the partitioning which
    makes the majority of examples belong to a single
    class

outlook
sunny
rain
overcast
windy
humidity
P
N
P
N
P
21
Classification methods
  • Decision trees and decision rules
  • give a training set of labeled data
  • tree pruning used for noise handling and avoiding
    data overfiting
  • Bayesian classification
  • Naïve Bayesian classification
  • Bayesian belief networks
  • Neural network approach
  • multi-layer networks and back-propagation
  • Genetic algorithms
  • genetic operators (mutation, cross-over,) and
    fitness function selection

22
Clustering methods
  • partitioning a set of data into a set of classes,
    called clusters, such that the members of each
    class are sharing some interesting common
    properties.
  • high quality clusters if the intra-class
    similarity is high and the inter-class similarity
    is low
  • Important is distance measure

23
Data-Mining tools
  • Main producers of Data-Mining software
  • IBM Intelligent Miner, extender for DB2
  • SAS Enterprise Miner
  • SPSS Clementine
  • Microsoft Analysis Server (part of SQL Server
    2000)
  • many more smaller producers

24
Data Mining standards
  • PMML (Predictive Modelling Markup Language)
  • XML like language for saving and sharing models
    (most widely accepted standard)
  • CRISP
  • standardized methodology for building Data Mining
    applications
  • OLE DB for Data Mining
  • Microsofts standard for developing OLEDB/COM
    components for extending Analysis server with new
    Data Mining functionality (uses customized SQL
    language)
  • IBM and Oracle prepared standard extensions to
    SQL language to support Data Mining functionality

25
conclusion
  • Data Mining is an area in the rapid development
  • Who and Why needs Data Mining?
  • (almost) everybody having the data?
  • to get something more out of the data
  • More information
  • http//www.kdnuggets.com/
Write a Comment
User Comments (0)