Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining

Description:

Data Mining Lecture 1: Introduction to Data Mining Manuel Penaloza, PhD Introduction to Data Mining Society produces huge amounts of data daily Retail Store POS data ... – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 28
Provided by: hpcnetOrg
Category:
Tags: data | mining

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
  • Lecture 1
  • Introduction to Data Mining
  • Manuel Penaloza, PhD

2
Introduction to Data Mining
  • Society produces huge amounts of data daily
  • Retail Store
  • POS data on customer purchases
  • Banks
  • Collection of customer service calls
  • Telecommunications
  • Phone call records (mobile and house-based calls)
  • Medicine
  • Genomic data collected on the structure of genes
  • Government
  • Law enforcement data, income tax data
  • Others (Transactional) data from Sports,
    Schools, Research, Search engines, etc.

3
What is Data Mining (DM)?
  • It is the process of discovering hidden
    relationships and patterns in large data sets
  • It can also predict the outcome of a future
    observation
  • Data mining is an interdisciplinary field
  • It is an extension to statistical analysis
  • It uses techniques from
  • Statistics
  • Machine learning
  • Pattern recognition
  • Database technology
  • Visualization
  • High-performance computing

4
Questions answered by DM
  • Extracting useful information from a dataset that
    answer
  • Which CC customers are most profitable?
  • Which loan applicants are high-risk?
  • Which customer will respond to a planned
    promotion?
  • How do we detect phone card fraud?
  • How do customer profile change over time?
  • Which customers do prefer product A over product
    B?
  • What is the revenue prediction for next year?
  • Which students are most likely to transfer than
    others?
  • Which tax payer may be cheating the system?
  • Who is most likely to violate a probation
    sentence?
  • What is the predicted outcome for some treatment?

5
Data sources
  • Relational Databases
  • Transactional data with many tables
  • Data warehouses
  • Historical data, aggregated and updated
    periodically
  • Files
  • In special format (e.g., CSV) or proprietary
    binary
  • Internet or electronic mail
  • HTML, XML, web search results, e-mails
  • Scientific, research
  • Seismology, remote sensing, etc.

6
Example Health System
  • Characteristics of the Health System
  • Personal medical records (GP, specialists, etc.)
  • Billing records
  • Hospital data (surgery, admission, etc.)
  • Questions
  • Are MD's following the procedures?
  • Which patient may have an adverse drug
    reactions?
  • Are people committing frauds?
  • Which patient are most likely to get cancer?

7
Case study E-commerce
  • A person buys book from Amazon.com
  • Objective Recommend other books this person is
    likely to buy
  • Amazon may do clustering or sequential pattern
    analysis based on books bought by other people
  • Data analyzed
  • Customer who bought Data Mining Practical
    Machine Learning Tools and Techniques also
    bought Introduction to Data Mining
  • Recommendations have been successful for Amazon
  • Increasing buyers satisfaction and purchases

8
What motivated data mining?
  • Growth in data collection
  • Presence of data warehouses with reliable data
  • Competitive pressure to increase sales
  • The development of commercial off the shelves
    (COTS) data mining software
  • Examples XLMiner, Insightful Miner, SAS, SPSS
  • Growth of computing power and storage capacity
  • High dimensionality of the data
  • Heterogeneous and complex data
  • Limitation of humans

9
Insightful MinerTM 7 GUI
Figures taken from the Insightful Miner 7 Guide
10
Creating Models
  • Create a network of pipelined components
  • By dragging and dropping components

11
Choosing a data mining system
  • They have different functionality or methodology
  • Selection determined by
  • Type of operating system used in your
    organization
  • The data sources handle by the tool
  • ASCII text files, relational databases, XML data
  • The data mining functions and methods offered
  • Scalability of the system
  • Row and column scalability
  • Visualization tools available
  • Graphical user interface that guides the
    execution of the methods
  • Integration with other information systems
  • Cost and performance

12
Data Mining in Databases
  • Current applications include data mining modules
  • Example
  • Database management systems such as Oracle and
    MS SQL Server
  • CRM (Customer Relationship Management)
  • Advantages for Database systems
  • One Stop shopping
  • Minimize data movement and conversion
  • Disadvantages for Database systems
  • Limited to DM methods available in the system
  • Data extractions and transformations may not be
    powerful enough

13
Standard data mining life cycle
  • CRISP (Cross-Industry Standard Process)
  • It is an iterative process with phase
    dependencies
  • IT consists of six (6) phases

14
CRISP_DM
  • Cross-industry standard developed in 1996
  • Analysts from SPSS/ISL, NCR, Daimler-Benz, OHRA
  • Funding from European Commission
  • Important Characteristics
  • Non-proprietary
  • Application/Industry neutral
  • Tool neutral
  • General problem-solving process
  • Process with six phases but missing
  • Saving results and updating the model

15
CRISP-DM Phases (1)
  • Business Understanding
  • Understand project objectives and requirements
  • Formulation of a data mining problem definition
  • Data Understanding
  • Data collection
  • Evaluate the quality of the data
  • Perform exploratory data analysis
  • Data Preparation
  • Clean, prepare, integrate, and transform the
    data
  • Select appropriate attributes and variables

16
CRISP-DM Phases (2)
  • Modeling
  • Select and apply appropriate modeling techniques
  • Calibrate model parameters to optimize results
  • If necessary, return to data preparation phase
    to satisfy model's data format
  • Evaluation
  • Determine if model satisfies objectives set in
    phase 1
  • Identify business issues that have not been
    addressed
  • Deployment
  • Organize and present the model to the user
  • Put model into practice
  • Set up for continuous mining of the data

17
Data mining tasks (1)
  • Classification
  • Predict the categorical value of a target
    (dependent) variable based on the values of other
    attributes
  • Target variable is partitioned into classes
  • It predicts class membership of a new
    observation
  • Examples Which drug should be prescribed for
    older patients with low sodium/potassium ratios?
  • Estimation
  • Similar to classification except target variable
    is numeric
  • That is, predicting a numeric value
  • Example Estimate the blood pressure of a person
    based on his/her age, gender, body mass index,
    etc.

18
Data mining tasks (2)
  • Prediction
  • Similar to estimation except that results lie in
    the future
  • Example Predict the price of a stock 3 months
    into the future
  • Clustering
  • Grouping similar records together
  • Example Find patients with similar profiles
  • Associations
  • Uncover rules that indicates the association
    between two or more attributes
  • Find out which items are purchased together

19
Task Classification
  • Build a model that learns to predict the class
    from pre-labeled instances or observations
  • Many approaches Regression, Decision Trees,
    Neural Networks

Given a set of points from classes what is the
class of new point ?
Diagram taken fromwww.kdnuggets.com/data_mining_
course/index.html
20
Task Clustering
  • Find grouping of instances given un-labeled data

Diagram taken fromwww.kdnuggets.com/data_mining_
course/index.html
21
DM looks easy
Regression Decision Tree Neural
Network Association Rules
Model
Data
Data Mining Method
  • But it is not easy
  • Real-world is complicate

22
Methods and Techniques
  • Cluster Analysis (tasks clustering)
  • Association Rules (tasks association)
  • Decision trees (tasks prediction,
    classification)
  • Neural networks (tasks prediction,
    classification)
  • K-nearest neighbor (tasks prediction,
    classification, clustering)
  • Regression analysis (task estimation,
    prediction)
  • Confidence interval estimation (task estimation)

23
Fallacies of Data Mining (1)
  • Fallacy 1 There are data mining tools that
    automatically find the answers to our problem
  • Reality There are no automatic tools that will
    solve your problems while you wait
  • Fallacy 2 The DM process require little human
    intervention
  • Reality The DM process require human
    intervention in all its phases, including
    updating and evaluating the model by human
    experts
  • Fallacy 3 Data mining have a quick ROI
  • Reality It depends on the startup costs,
    personnel costs, data source costs, and so on

24
Fallacies of Data Mining (2)
  • Fallacy 4 DM tools are easy to use
  • Reality Analysts must be familiar with the
    model
  • Fallacy 5 DM will identify the causes to the
    business problem
  • Reality DM tool only identify patterns in your
    data, analysts must identify the cause
  • Fallacy 6 Data mining will clean up a data
    repository automatically
  • Reality Sequence of transformation tasks must
    be defined by an analysts during early DM phases
  • Fallacies described by Jen Que Louie, President
    of Nautilus Systems, Inc.

25
In summary,
  • Problems suitable for Data Mining
  • Require to discover knowledge to make right
    decisions
  • Current solutions are not adequate
  • Expected high-payoff for the right decisions
  • Have accessible, sufficient, and relevant data
  • Have a changing environment
  • IMPORTANT
  • ENSURE privacy if personal data is used!
  • Not every data mining application is successful!

26
Main References
  • Ian Witten and Eibe Frank. Data Mining Practical
    Machine Learning Tools and Techniques, 2nd
    edition, Morgan Kaufmann Publishers
  • Daniel LaRose. Discovering Knowledge in Data An
    Introduction to Data Mining, Wiley Publication
  • Pang-Ning Tang et. al. Introduction to Data
    Mining, Addison Wesley
  • Jiawei Han and Micheline Kamber. Data Mining
    Concepts and Techniques, Morgan Kaufmann
    Publishers
  • Online data mining course offered by KDnuggetsTM
    at www.kdnuggets.com/data_mining_course/index.html
  • Engineering Statistics Handbook available online
    at http//www.itl.nist.gov/div898/handbook/eda/sec
    tion1/eda126.htm

27
Exercise 1
  • CRISP-DM is not the only DM process, do a quick
    search on the Internet for another process.
    Describe any similarity and differences with
    CRISP-DM.
  • Determine how data mining could help a web search
    engine company like Google in its operation?
  • Identify one or more objectives.
  • Which data mining task(s) could help this
    company?
Write a Comment
User Comments (0)
About PowerShow.com