Title: Data Mining
1Data Mining
- Lecture 1
- Introduction to Data Mining
- Manuel Penaloza, PhD
2Introduction to Data Mining
- Society produces huge amounts of data daily
- Retail Store
- POS data on customer purchases
- Banks
- Collection of customer service calls
- Telecommunications
- Phone call records (mobile and house-based calls)
- Medicine
- Genomic data collected on the structure of genes
- Government
- Law enforcement data, income tax data
- Others (Transactional) data from Sports,
Schools, Research, Search engines, etc.
3 What is Data Mining (DM)?
- It is the process of discovering hidden
relationships and patterns in large data sets - It can also predict the outcome of a future
observation - Data mining is an interdisciplinary field
- It is an extension to statistical analysis
- It uses techniques from
- Statistics
- Machine learning
- Pattern recognition
- Database technology
- Visualization
- High-performance computing
4Questions answered by DM
- Extracting useful information from a dataset that
answer - Which CC customers are most profitable?
- Which loan applicants are high-risk?
- Which customer will respond to a planned
promotion? - How do we detect phone card fraud?
- How do customer profile change over time?
- Which customers do prefer product A over product
B? - What is the revenue prediction for next year?
- Which students are most likely to transfer than
others? - Which tax payer may be cheating the system?
- Who is most likely to violate a probation
sentence? - What is the predicted outcome for some treatment?
5 Data sources
- Relational Databases
- Transactional data with many tables
- Data warehouses
- Historical data, aggregated and updated
periodically - Files
- In special format (e.g., CSV) or proprietary
binary - Internet or electronic mail
- HTML, XML, web search results, e-mails
- Scientific, research
- Seismology, remote sensing, etc.
6 Example Health System
- Characteristics of the Health System
- Personal medical records (GP, specialists, etc.)
- Billing records
- Hospital data (surgery, admission, etc.)
- Questions
- Are MD's following the procedures?
- Which patient may have an adverse drug
reactions? - Are people committing frauds?
- Which patient are most likely to get cancer?
7 Case study E-commerce
- A person buys book from Amazon.com
- Objective Recommend other books this person is
likely to buy - Amazon may do clustering or sequential pattern
analysis based on books bought by other people - Data analyzed
- Customer who bought Data Mining Practical
Machine Learning Tools and Techniques also
bought Introduction to Data Mining - Recommendations have been successful for Amazon
- Increasing buyers satisfaction and purchases
8 What motivated data mining?
- Growth in data collection
- Presence of data warehouses with reliable data
- Competitive pressure to increase sales
- The development of commercial off the shelves
(COTS) data mining software - Examples XLMiner, Insightful Miner, SAS, SPSS
- Growth of computing power and storage capacity
- High dimensionality of the data
- Heterogeneous and complex data
- Limitation of humans
9Insightful MinerTM 7 GUI
Figures taken from the Insightful Miner 7 Guide
10Creating Models
- Create a network of pipelined components
- By dragging and dropping components
11Choosing a data mining system
- They have different functionality or methodology
- Selection determined by
- Type of operating system used in your
organization - The data sources handle by the tool
- ASCII text files, relational databases, XML data
- The data mining functions and methods offered
- Scalability of the system
- Row and column scalability
- Visualization tools available
- Graphical user interface that guides the
execution of the methods - Integration with other information systems
- Cost and performance
12Data Mining in Databases
- Current applications include data mining modules
- Example
- Database management systems such as Oracle and
MS SQL Server - CRM (Customer Relationship Management)
- Advantages for Database systems
- One Stop shopping
- Minimize data movement and conversion
- Disadvantages for Database systems
- Limited to DM methods available in the system
- Data extractions and transformations may not be
powerful enough
13Standard data mining life cycle
- CRISP (Cross-Industry Standard Process)
- It is an iterative process with phase
dependencies - IT consists of six (6) phases
14CRISP_DM
- Cross-industry standard developed in 1996
- Analysts from SPSS/ISL, NCR, Daimler-Benz, OHRA
- Funding from European Commission
- Important Characteristics
- Non-proprietary
- Application/Industry neutral
- Tool neutral
- General problem-solving process
- Process with six phases but missing
- Saving results and updating the model
15CRISP-DM Phases (1)
- Business Understanding
- Understand project objectives and requirements
- Formulation of a data mining problem definition
- Data Understanding
- Data collection
- Evaluate the quality of the data
- Perform exploratory data analysis
- Data Preparation
- Clean, prepare, integrate, and transform the
data - Select appropriate attributes and variables
16CRISP-DM Phases (2)
- Modeling
- Select and apply appropriate modeling techniques
- Calibrate model parameters to optimize results
- If necessary, return to data preparation phase
to satisfy model's data format - Evaluation
- Determine if model satisfies objectives set in
phase 1 - Identify business issues that have not been
addressed - Deployment
- Organize and present the model to the user
- Put model into practice
- Set up for continuous mining of the data
17Data mining tasks (1)
- Classification
- Predict the categorical value of a target
(dependent) variable based on the values of other
attributes - Target variable is partitioned into classes
- It predicts class membership of a new
observation - Examples Which drug should be prescribed for
older patients with low sodium/potassium ratios? - Estimation
- Similar to classification except target variable
is numeric - That is, predicting a numeric value
- Example Estimate the blood pressure of a person
based on his/her age, gender, body mass index,
etc.
18Data mining tasks (2)
- Prediction
- Similar to estimation except that results lie in
the future - Example Predict the price of a stock 3 months
into the future - Clustering
- Grouping similar records together
- Example Find patients with similar profiles
- Associations
- Uncover rules that indicates the association
between two or more attributes - Find out which items are purchased together
19Task Classification
- Build a model that learns to predict the class
from pre-labeled instances or observations - Many approaches Regression, Decision Trees,
Neural Networks
Given a set of points from classes what is the
class of new point ?
Diagram taken fromwww.kdnuggets.com/data_mining_
course/index.html
20 Task Clustering
- Find grouping of instances given un-labeled data
Diagram taken fromwww.kdnuggets.com/data_mining_
course/index.html
21DM looks easy
Regression Decision Tree Neural
Network Association Rules
Model
Data
Data Mining Method
- But it is not easy
- Real-world is complicate
22Methods and Techniques
- Cluster Analysis (tasks clustering)
- Association Rules (tasks association)
- Decision trees (tasks prediction,
classification) - Neural networks (tasks prediction,
classification) - K-nearest neighbor (tasks prediction,
classification, clustering) - Regression analysis (task estimation,
prediction) - Confidence interval estimation (task estimation)
23Fallacies of Data Mining (1)
- Fallacy 1 There are data mining tools that
automatically find the answers to our problem - Reality There are no automatic tools that will
solve your problems while you wait - Fallacy 2 The DM process require little human
intervention - Reality The DM process require human
intervention in all its phases, including
updating and evaluating the model by human
experts - Fallacy 3 Data mining have a quick ROI
- Reality It depends on the startup costs,
personnel costs, data source costs, and so on
24Fallacies of Data Mining (2)
- Fallacy 4 DM tools are easy to use
- Reality Analysts must be familiar with the
model - Fallacy 5 DM will identify the causes to the
business problem - Reality DM tool only identify patterns in your
data, analysts must identify the cause - Fallacy 6 Data mining will clean up a data
repository automatically - Reality Sequence of transformation tasks must
be defined by an analysts during early DM phases - Fallacies described by Jen Que Louie, President
of Nautilus Systems, Inc.
25In summary,
- Problems suitable for Data Mining
- Require to discover knowledge to make right
decisions - Current solutions are not adequate
- Expected high-payoff for the right decisions
- Have accessible, sufficient, and relevant data
- Have a changing environment
- IMPORTANT
- ENSURE privacy if personal data is used!
- Not every data mining application is successful!
26Main References
- Ian Witten and Eibe Frank. Data Mining Practical
Machine Learning Tools and Techniques, 2nd
edition, Morgan Kaufmann Publishers - Daniel LaRose. Discovering Knowledge in Data An
Introduction to Data Mining, Wiley Publication - Pang-Ning Tang et. al. Introduction to Data
Mining, Addison Wesley - Jiawei Han and Micheline Kamber. Data Mining
Concepts and Techniques, Morgan Kaufmann
Publishers - Online data mining course offered by KDnuggetsTM
at www.kdnuggets.com/data_mining_course/index.html
- Engineering Statistics Handbook available online
at http//www.itl.nist.gov/div898/handbook/eda/sec
tion1/eda126.htm
27Exercise 1
- CRISP-DM is not the only DM process, do a quick
search on the Internet for another process.
Describe any similarity and differences with
CRISP-DM. - Determine how data mining could help a web search
engine company like Google in its operation? - Identify one or more objectives.
- Which data mining task(s) could help this
company?