Title: Data Mining
1Data Mining
- Just a teaser of an overview
2Classic Cases
- Beer and diapers (an urban legend?)
- market basket analysis
- Credit card fraud detection based on user buying
patterns - gas station purchase followed by big buy
- Online brokerages sponsor golf tournaments
- 70 of online investors are males over 40 who
golf - 60 of stock investors are golfers
- Senior citizens buy CDs for their grand-kids
- RAP ads in magazines for seniors (?)
3Data Mining - Overview
- Extraction of hidden predictive information from
large databases - Predict future trends and behavior
- proactive, knowledge-based decision making
- Automated, prospective analysis
- Goes beyond Data Warehousing and OLAP
- Supervised vs. Unsupervised Learning
- Attempt to have computers learn concepts
4Examples of Data Mining
- Marketing (clustering for customer segmentation)
- Direct marketing
- how to reach customers who matter - mass mailing
- Customer modeling
- customer profile - identify who is most likely to
buy - Trend analysis
- isolating when a customer changes habits
- Fraud Detection
- insurance
- credit card
- Healthcare
- patient type clustering
- prediction of DRG outliers
- Hospital length of stay prediction
5Examples of Data Mining (contd.)
- Banking
- credit card marketing campaign
- Telecommunication
- changing long distance carriers
- Astronomy
- classification of stars in the known universe
- Geographic Information System
- identify relative consumption of products within
counties and zip codes
6Very Nice Intro to Data Mining
- http//www.aw.com/info/database/roiger.html
- Business focus
- Enough algorithm detail to learn something
- Includes Excel based software for many common
data mining techniques
7Some commonly used data mining techniques
- Classification trees
- Descriminant analysis
- Logistic regression
- Neural networks
- Kohenen self-organizing maps (networks)
- K-means or hierarchical cluster analysis
8Two Nice Examples in PMS
- Cluster Analysis (sec. 8.7)
- Cluster cities based on some demographic
information - standardization of data
- Calculation of a distance metric
- Clustering as a non-linear optimization problem
(using Evolutionary Solver) - Discriminant Analysis
- Classify people as Wall Street Journal readers
(or not) - Fits regression line that discriminates between
the two groups - Can use the classifier equation to predict
whether a new person would or would not subscribe
9Supervised Learning
- We see and hear things, learn their names,
classify them using our own mental models - We have inputs and outputs
- We have predefined classes into which we wish to
classify new cases - Example Look at data of mortgage borrowers and
whether or not they defaulted. Build model to
predict if a new borrower is a default risk.
10Supervised Learning
Training Data
Examples (defaulted)
Data mining algorithm (for classification)
Data (inputs and outputs)
Non-examples (did NOT default)
Classifier Model
New data (inputs only)
Default or Not Default OR Probability of Default
Test Data
11DISCRIM.XLS
- This file contains the annual income and size of
investment portfolio (both in thousands of
dollars) for 84 people. - It also indicates whether each of these people
subscribes or does not subscribe to the Wall
Street Journal. - Using income and size of investment portfolio,
determine a classification rule that maximizes
the number of people correctly classified as
subscribers or nonsubscribers.
12Unsupervised Learning
- We have inputs only
- We do NOT have predefined classes into which we
wish to classify new cases - We want to cluster the data so that cases
- Similar to other cases in their cluster
- Different from cases in other clusters
- Example Look at data for online investors and
cluster them into groups - do targeted marketing based on cluster membership
13Unsupervised Learning
Training Data
Data (inputs and outputs)
Data mining algorithm (for clustering)
Examples (online investors)
Clustering Model
Number of clusters (some algs need this)
Clusters (each case gets put in a cluster, you
name the clusters)
14Can you identify clusters?
15CLUSTERS.XLS
- This file contains demographic data on 49 of the
largest cities in the United States. - Some of the data appears in the shaded region of
the figure on the next slide. - For example, Atlanta is 67 African American, 2
Hispanic, and 1 Asian. It has a median age of
31, a 5 unemployment rate, and a per capita
income of 22,000. - We would like to group these 49 cities into four
clusters of cities that are demographically
similar.
16Steps in Data Mining
- Exploratory
- Exploring relationships, reasoning later!
- Hypotheses
- Testing relationships based on theory / some
evidence
Beer-diaper example, Insurance Sports car
Antique
Classification Problem Credit Approval,
Bankruptcy prediction
17Steps in Data Mining
This step is critical and can take an enormous
amount of time and effort.
Expertise on data
Is there enough data? size
Data Quality Is the data noisy, format, missing
errors Garbage in Garbage out!
Is data available on relevant factors
18Data for Mining
- Most data mining packages want a flat file of
data - Much work goes into preparing the data
- Missing values
- May transform numeric into categorical
- Often standardize or rescale the data using a
variety of different approaches
- Standardized values are in a relatively small
range of values - Removes bias due to large differences in absolute
numeric values - Example ages in years vs. fractions (decimals
between 0 and 1) - WHY? Many data mining algorithms calculate
distance between cases (records in your data)
19Flat File for Mining
Standardized data values
Raw data values
20Steps in Data Mining
Different techniques are suited for different
kinds of problems, produce different kinds of
output
- Unsupervised Learning
- K-means cluster analysis
- Kohonen Networks
- The number of groups are not known a priori
- Supervised Learning
- Neural Networks
- The number of groups are known a priori enabling
categorization and training - Linear vs. non linear separation, dimensionality
Classification Problem Credit Approval,
Bankruptcy prediction
Customer profiling
214. Analysis
- Subject new data to classification models to aid
in decision making - Use output from clustering models as part of
larger decision or planning problem - Supplement data mining models with data
visualization, statistical reporting - Supplement all of this with domain knowledge,
common sense, expert opinion
22Decision Trees
- Another supervised learning technique
- Generalizes input instances by building a tree
- Non-terminal nodes are tests on one or more
attributes (i.e. pick next branch to traverse) - Terminal nodes are decision outcomes
- Simple, graphical, can be transformed into
logical rules
23Hypothetical Training Data for Disease Diagnosis
24Simple Decision Tree
Swollen Glands?
No
Yes
Fever?
Diagnosis Strep Throat
No
Yes
Diagnosis Allergy
Diagnosis Cold
25Neural NetworksUsed for Classification
- Based on simple model of brain as collection of
neurons that fire if inputs exceed a certain
threshold - Three types of neurons or nodes
- Input nodes these are input variables
- Hidden nodes they help fit a model
- Output nodes these are the predictions
- Supervised learning
26Basic Steps with Neural Nets
- Prepare data including input and output variables
- Standardization a good idea
- Hold out some data to test model later
- Submit training data to neural network software
(see later slide) - Build neural network model
- A few tuning parameters involved
- Model iteratively puts weights on neurons in an
attempt to predict the outputs correctly - Submit new data to neural network for predictive
classification - Neural network is a black box there is NO
equation to look at - Can be difficult to gain insight from neural
network - Has proven to be a very effective classification
method in practice
27Basic Structure of a Neural Network
28Neural Network Software
- Clementine
- iDA trial version ships with Roiger and Geatz
book (see previous slide) - Excel based add-ins
- Neuralyst
- Freeware and Shareware
29More Data Mining Software
- Clustan cluster analysis
- SAS
- SPSS
- KXEN
- Lots more at KDNuggets.com