Title: Data Mining in Large Databases
1Data Mining inLarge Databases
- (Contributing Slides by Gregory Piatetsky-Shapiro
- and
- Rajeev Rastogi and Kyuseok Shim
- Lucent Bell laboratories)
2Overview
- Introduction
- Association Rules
- Classification
- Clustering
3Background
- Corporations have huge databases containing a
wealth of information - Business databases potentially constitute a
goldmine of valuable business information - Very little functionality in database systems to
support data mining applications - Data mining The efficient discovery of
previously unknown patterns in large databases
4Applications
- Fraud Detection
- Loan and Credit Approval
- Market Basket Analysis
- Customer Segmentation
- Financial Applications
- E-Commerce
- Decision Support
- Web Search
5Data Mining Techniques
- Association Rules
- Sequential Patterns
- Classification
- Clustering
- Similar Time Sequences
- Similar Images
- Outlier Discovery
- Text/Web Mining
6Examples of Patterns
- Association rules
- 98 of people who purchase diapers buy beer
- Classification
- People with age less than 25 and salary gt 40k
drive sports cars - Similar time sequences
- Stocks of companies A and B perform similarly
- Outlier Detection
- Residential customers with businesses at home
7Association Rules
- Given
- A database of customer transactions
- Each transaction is a set of items
- Find all rules X gt Y that correlate the presence
of one set of items X with another set of items Y - Any number of items in the consequent or
antecedent of a rule - Possible to specify constraints on rules (e.g.,
find only rules involving expensive imported
products)
8Association Rules
- Sample Applications
- Market basket analysis
- Attached mailing in direct marketing
- Fraud detection for medical insurance
- Department store floor/shelf planning
9Confidence and Support
- A rule must have some minimum user-specified
confidence - 1 2 gt 3 has 90 confidence if when a customer
bought 1 and 2, in 90 of cases, the customer
also bought 3. - A rule must have some minimum user-specified
support - 1 2 gt 3 should hold in some minimum percentage
of transactions to have business value
10Example
- Example
- For minimum support 50, minimum confidence
50, we have the following rules - 1 gt 3 with 50 support and 66 confidence
- 3 gt 1 with 50 support and 100 confidence
11Problem Decomposition
- 1. Find all sets of items that have minimum
support - Use Apriori Algorithm
- 2. Use the frequent itemsets to generate the
desired rules - Generation is straight forward
12Problem Decomposition - Example
For minimum support 50 and minimum confidence
50
- For the rule 1 gt 3
- Support Support(1, 3) 50
- Confidence Support(1,3)/Support(1) 66
13The Apriori Algorithm
- Fk Set of frequent itemsets of size k
- Ck Set of candidate itemsets of size k
- F1 large items
- for ( k1 Fk ! 0 k) do
- Ck1 New candidates generated from Fk
- foreach transaction t in the database do
- Increment the count of all candidates in Ck1
that - are contained in t
- Fk1 Candidates in Ck1 with minimum
support -
- Answer Uk Fk
14Key Observation
- Every subset of a frequent itemset is also
frequentgt a candidate itemset in Ck1 can be
pruned if even one of its subsets is not
contained in Fk
15Apriori - Example
Database D
F1
C1
Scan D
C2
C2
F2
Scan D
16Sequential Patterns
- Given
- A sequence of customer transactions
- Each transaction is a set of items
- Find all maximal sequential patterns supported by
more than a user-specified percentage of
customers - Example 10 of customers who bought a PC did a
memory upgrade in a subsequent transaction
17Classification
- Given
- Database of tuples, each assigned a class label
- Develop a model/profile for each class
- Example profile (good credit)
- (25 lt age lt 40 and income gt 40k) or (married
YES) - Sample applications
- Credit card approval (good, bad)
- Bank locations (good, fair, poor)
- Treatment effectiveness (good, fair, poor)
18Decision Tree
- An internal node is a test on an attribute.
- A branch represents an outcome of the test, e.g.,
Colorred. - A leaf node represents a class label or class
label distribution. - At each node, one attribute is chosen to split
training examples into distinct classes as much
as possible - A new case is classified by following a matching
path to a leaf node.
19Decision Trees
20Example Tree
Outlook
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
21Decision Tree Algorithms
- Building phase
- Recursively split nodes using best splitting
attribute for node - Pruning phase
- Smaller imperfect decision tree generally
achieves better accuracy - Prune leaf nodes recursively to prevent
over-fitting
22Attribute Selection
- Which is the best attribute?
- The one which will result in the smallest tree
- Heuristic choose the attribute that produces the
purest nodes - Popular impurity criterion information gain
- Information gain increases with the average
purity of the subsets that an attribute produces - Strategy choose attribute that results in
greatest information gain
23Which attribute to select?
24Computing information
- Information is measured in bits
- Given a probability distribution, the info
required to predict an event is the
distributions entropy - Entropy gives the information required in bits
(this can involve fractions of bits!) - Formula for computing the entropy
25Example attribute Outlook
- Outlook Sunny
- Outlook Overcast
- Outlook Rainy
- Expected information for attribute
26Computing the information gain
- Information gain
- (information before split) (information after
split) - Information gain for attributes from weather data
27Continuing to split
28The final decision tree
- Note not all leaves need to be pure sometimes
identical instances have different classes - ? Splitting stops when data cant be split any
further
29Decision Trees
- Pros
- Fast execution time
- Generated rules are easy to interpret by humans
- Scale well for large data sets
- Can handle high dimensional data
- Cons
- Cannot capture correlations among attributes
- Consider only axis-parallel cuts
30Clustering
- Given
- Data points and number of desired clusters K
- Group the data points into K clusters
- Data points within clusters are more similar than
across clusters - Sample applications
- Customer segmentation
- Market basket customer analysis
- Attached mailing in direct marketing
- Clustering companies with similar growth
31Traditional Algorithms
- Partitional algorithms
- Enumerate K partitions optimizing some criterion
- Example square-error criterion
- mi is the mean of cluster Ci
32K-means Algorithm
- Assign initial means
- Assign each point to the cluster for the closest
mean - Compute new mean for each cluster
- Iterate until criterion function converges
33K-means example, step 1
Pick 3 initial cluster centers (randomly)
34K-means example, step 2
Assign each point to the closest cluster center
35K-means example, step 3
Move each cluster center to the mean of each
cluster
36K-means example, step 4
Reassign points closest to a different new
cluster center Q Which points are reassigned?
37K-means example, step 4
A three points with animation
38K-means example, step 4b
re-compute cluster means
39K-means example, step 5
move cluster centers to cluster means
40Discussion
- Result can vary significantly depending on
initial choice of seeds - Can get trapped in local minimum
- Example
- To increase chance of finding global optimum
restart with different random seeds
41K-means clustering summary
- Advantages
- Simple, understandable
- items automatically assigned to clusters
- Disadvantages
- Must pick number of clusters before hand
- All items forced into a cluster
- Too sensitive to outliers
42Traditional Algorithms
- Hierarchical clustering
- Nested Partitions
- Tree structure
43Agglomerative Hierarchcal Algorithms
- Mostly used hierarchical clustering algorithm
- Initially each point is a distinct cluster
- Repeatedly merge closest clusters until the
number of clusters becomes K - Closest dmean (Ci, Cj)
- dmin (Ci, Cj)
- Likewise dave (Ci, Cj) and dmax (Ci, Cj)
44Similar Time Sequences
- Given
- A set of time-series sequences
- Find
- All sequences similar to the query sequence
- All pairs of similar sequences
- whole matching vs. subsequence matching
- Sample Applications
- Financial market
- Scientific databases
- Medical Diagnosis
45Whole Sequence Matching
- Basic Idea
- Extract k features from every sequence
- Every sequence is then represented as a point in
k-dimensional space - Use a multi-dimensional index to store and search
these points - Spatial indices do not work well for high
dimensional data -
46Similar Time Sequences
- Take Euclidean distance as the similarity measure
- Obtain Discrete Fourier Transform (DFT)
coefficients of each sequence in the database - Build a multi-dimensional index using first a few
Fourier coefficients - Use the index to retrieve sequences that are at
most distance away from query sequence - Post-processing
- compute the actual distance between sequences in
the time domain
47Outlier Discovery
- Given
- Data points and number of outliers ( n) to find
- Find top n outlier points
- outliers are considerably dissimilar from the
remainder of the data - Sample applications
- Credit card fraud detection
- Telecom fraud detection
- Medical analysis
48Statistical Approaches
- Model underlying distribution that generates
dataset (e.g. normal distribution) - Use discordancy tests depending on
- data distribution
- distribution parameter (e.g. mean, variance)
- number of expected outliers
- Drawbacks
- most tests are for single attribute
- In many cases, data distribution may not be known
49Distance-based Outliers
- For a fraction p and a distance d,
- a point o is an outlier if p points lie at a
greater distance than d - General enough to model statistical outlier
tests - Develop nested-loop and cell-based algorithms
- Scale okay for large datasets
- Cell-based algorithm does not scale well for high
dimensions