Data Mining in Large Databases - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Data Mining in Large Databases

Description:

Fraud detection for medical insurance. Department store floor/shelf planning ... Answer = Uk Fk. Key Observation. Every subset of a frequent itemset is also frequent ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 50

Provided by: rajeevrast

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining in Large Databases

1
Data Mining inLarge Databases

(Contributing Slides by Gregory Piatetsky-Shapiro
and
Rajeev Rastogi and Kyuseok Shim
Lucent Bell laboratories)

2
Overview

Introduction
Association Rules
Classification
Clustering

3
Background

Corporations have huge databases containing a
wealth of information
Business databases potentially constitute a
goldmine of valuable business information
Very little functionality in database systems to
support data mining applications
Data mining The efficient discovery of
previously unknown patterns in large databases

4
Applications

Fraud Detection
Loan and Credit Approval
Market Basket Analysis
Customer Segmentation
Financial Applications
E-Commerce
Decision Support
Web Search

5
Data Mining Techniques

Association Rules
Sequential Patterns
Classification
Clustering
Similar Time Sequences
Similar Images
Outlier Discovery
Text/Web Mining

6
Examples of Patterns

Association rules
98 of people who purchase diapers buy beer
Classification
People with age less than 25 and salary gt 40k
drive sports cars
Similar time sequences
Stocks of companies A and B perform similarly
Outlier Detection
Residential customers with businesses at home

7
Association Rules

Given
A database of customer transactions
Each transaction is a set of items
Find all rules X gt Y that correlate the presence
of one set of items X with another set of items Y
Any number of items in the consequent or
antecedent of a rule
Possible to specify constraints on rules (e.g.,
find only rules involving expensive imported
products)

8
Association Rules

Sample Applications
Market basket analysis
Attached mailing in direct marketing
Fraud detection for medical insurance
Department store floor/shelf planning

9
Confidence and Support

A rule must have some minimum user-specified
confidence
1 2 gt 3 has 90 confidence if when a customer
bought 1 and 2, in 90 of cases, the customer
also bought 3.
A rule must have some minimum user-specified
support
1 2 gt 3 should hold in some minimum percentage
of transactions to have business value

10
Example

Example
For minimum support 50, minimum confidence
50, we have the following rules
1 gt 3 with 50 support and 66 confidence
3 gt 1 with 50 support and 100 confidence

11
Problem Decomposition

1. Find all sets of items that have minimum
support
Use Apriori Algorithm
2. Use the frequent itemsets to generate the
desired rules
Generation is straight forward

12
Problem Decomposition - Example
For minimum support 50 and minimum confidence
50

For the rule 1 gt 3
Support Support(1, 3) 50
Confidence Support(1,3)/Support(1) 66

13
The Apriori Algorithm

Fk Set of frequent itemsets of size k
Ck Set of candidate itemsets of size k
F1 large items
for ( k1 Fk ! 0 k) do
Ck1 New candidates generated from Fk
foreach transaction t in the database do
Increment the count of all candidates in Ck1
that
are contained in t
Fk1 Candidates in Ck1 with minimum
support
Answer Uk Fk

14
Key Observation

Every subset of a frequent itemset is also
frequentgt a candidate itemset in Ck1 can be
pruned if even one of its subsets is not
contained in Fk

15
Apriori - Example
Database D
F1
C1
Scan D
C2
C2
F2
Scan D
16
Sequential Patterns

Given
A sequence of customer transactions
Each transaction is a set of items
Find all maximal sequential patterns supported by
more than a user-specified percentage of
customers
Example 10 of customers who bought a PC did a
memory upgrade in a subsequent transaction

17
Classification

Given
Database of tuples, each assigned a class label
Develop a model/profile for each class
Example profile (good credit)
(25 lt age lt 40 and income gt 40k) or (married
YES)
Sample applications
Credit card approval (good, bad)
Bank locations (good, fair, poor)
Treatment effectiveness (good, fair, poor)

18
Decision Tree

An internal node is a test on an attribute.
A branch represents an outcome of the test, e.g.,
Colorred.
A leaf node represents a class label or class
label distribution.
At each node, one attribute is chosen to split
training examples into distinct classes as much
as possible
A new case is classified by following a matching
path to a leaf node.

19
Decision Trees
20
Example Tree
Outlook
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
21
Decision Tree Algorithms

Building phase
Recursively split nodes using best splitting
attribute for node
Pruning phase
Smaller imperfect decision tree generally
achieves better accuracy
Prune leaf nodes recursively to prevent
over-fitting

22
Attribute Selection

Which is the best attribute?
The one which will result in the smallest tree
Heuristic choose the attribute that produces the
purest nodes
Popular impurity criterion information gain
Information gain increases with the average
purity of the subsets that an attribute produces
Strategy choose attribute that results in
greatest information gain

23
Which attribute to select?
24
Computing information

Information is measured in bits
Given a probability distribution, the info
required to predict an event is the
distributions entropy
Entropy gives the information required in bits
(this can involve fractions of bits!)
Formula for computing the entropy

25
Example attribute Outlook

Outlook Sunny
Outlook Overcast
Outlook Rainy
Expected information for attribute

26
Computing the information gain

Information gain
(information before split) (information after
split)
Information gain for attributes from weather data

27
Continuing to split
28
The final decision tree

Note not all leaves need to be pure sometimes
identical instances have different classes
? Splitting stops when data cant be split any
further

29
Decision Trees

Pros
Fast execution time
Generated rules are easy to interpret by humans
Scale well for large data sets
Can handle high dimensional data
Cons
Cannot capture correlations among attributes
Consider only axis-parallel cuts

30
Clustering

Given
Data points and number of desired clusters K
Group the data points into K clusters
Data points within clusters are more similar than
across clusters
Sample applications
Customer segmentation
Market basket customer analysis
Attached mailing in direct marketing
Clustering companies with similar growth

31
Traditional Algorithms

Partitional algorithms
Enumerate K partitions optimizing some criterion
Example square-error criterion
mi is the mean of cluster Ci

32
K-means Algorithm

Assign initial means
Assign each point to the cluster for the closest
mean
Compute new mean for each cluster
Iterate until criterion function converges

33
K-means example, step 1
Pick 3 initial cluster centers (randomly)
34
K-means example, step 2
Assign each point to the closest cluster center
35
K-means example, step 3
Move each cluster center to the mean of each
cluster
36
K-means example, step 4
Reassign points closest to a different new
cluster center Q Which points are reassigned?
37
K-means example, step 4
A three points with animation
38
K-means example, step 4b
re-compute cluster means
39
K-means example, step 5
move cluster centers to cluster means
40
Discussion

Result can vary significantly depending on
initial choice of seeds
Can get trapped in local minimum
Example
To increase chance of finding global optimum
restart with different random seeds

41
K-means clustering summary

Advantages
Simple, understandable
items automatically assigned to clusters

Disadvantages
Must pick number of clusters before hand
All items forced into a cluster
Too sensitive to outliers

42
Traditional Algorithms

Hierarchical clustering
Nested Partitions
Tree structure

43
Agglomerative Hierarchcal Algorithms

Mostly used hierarchical clustering algorithm
Initially each point is a distinct cluster
Repeatedly merge closest clusters until the
number of clusters becomes K
Closest dmean (Ci, Cj)
dmin (Ci, Cj)
Likewise dave (Ci, Cj) and dmax (Ci, Cj)

44
Similar Time Sequences

Given
A set of time-series sequences
Find
All sequences similar to the query sequence
All pairs of similar sequences
whole matching vs. subsequence matching
Sample Applications
Financial market
Scientific databases
Medical Diagnosis

45
Whole Sequence Matching

Basic Idea
Extract k features from every sequence
Every sequence is then represented as a point in
k-dimensional space
Use a multi-dimensional index to store and search
these points
Spatial indices do not work well for high
dimensional data

46
Similar Time Sequences

Take Euclidean distance as the similarity measure
Obtain Discrete Fourier Transform (DFT)
coefficients of each sequence in the database
Build a multi-dimensional index using first a few
Fourier coefficients
Use the index to retrieve sequences that are at
most distance away from query sequence
Post-processing
compute the actual distance between sequences in
the time domain

47
Outlier Discovery