Data Mining in Large Databases - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Data Mining in Large Databases

Description:

Fraud detection for medical insurance. Department store floor/shelf planning ... Answer = Uk Fk. Key Observation. Every subset of a frequent itemset is also frequent ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 50
Provided by: rajeevrast
Category:

less

Transcript and Presenter's Notes

Title: Data Mining in Large Databases


1
Data Mining inLarge Databases
  • (Contributing Slides by Gregory Piatetsky-Shapiro
  • and
  • Rajeev Rastogi and Kyuseok Shim
  • Lucent Bell laboratories)

2
Overview
  • Introduction
  • Association Rules
  • Classification
  • Clustering

3
Background
  • Corporations have huge databases containing a
    wealth of information
  • Business databases potentially constitute a
    goldmine of valuable business information
  • Very little functionality in database systems to
    support data mining applications
  • Data mining The efficient discovery of
    previously unknown patterns in large databases

4
Applications
  • Fraud Detection
  • Loan and Credit Approval
  • Market Basket Analysis
  • Customer Segmentation
  • Financial Applications
  • E-Commerce
  • Decision Support
  • Web Search

5
Data Mining Techniques
  • Association Rules
  • Sequential Patterns
  • Classification
  • Clustering
  • Similar Time Sequences
  • Similar Images
  • Outlier Discovery
  • Text/Web Mining

6
Examples of Patterns
  • Association rules
  • 98 of people who purchase diapers buy beer
  • Classification
  • People with age less than 25 and salary gt 40k
    drive sports cars
  • Similar time sequences
  • Stocks of companies A and B perform similarly
  • Outlier Detection
  • Residential customers with businesses at home

7
Association Rules
  • Given
  • A database of customer transactions
  • Each transaction is a set of items
  • Find all rules X gt Y that correlate the presence
    of one set of items X with another set of items Y
  • Any number of items in the consequent or
    antecedent of a rule
  • Possible to specify constraints on rules (e.g.,
    find only rules involving expensive imported
    products)

8
Association Rules
  • Sample Applications
  • Market basket analysis
  • Attached mailing in direct marketing
  • Fraud detection for medical insurance
  • Department store floor/shelf planning

9
Confidence and Support
  • A rule must have some minimum user-specified
    confidence
  • 1 2 gt 3 has 90 confidence if when a customer
    bought 1 and 2, in 90 of cases, the customer
    also bought 3.
  • A rule must have some minimum user-specified
    support
  • 1 2 gt 3 should hold in some minimum percentage
    of transactions to have business value

10
Example
  • Example
  • For minimum support 50, minimum confidence
    50, we have the following rules
  • 1 gt 3 with 50 support and 66 confidence
  • 3 gt 1 with 50 support and 100 confidence

11
Problem Decomposition
  • 1. Find all sets of items that have minimum
    support
  • Use Apriori Algorithm
  • 2. Use the frequent itemsets to generate the
    desired rules
  • Generation is straight forward

12
Problem Decomposition - Example
For minimum support 50 and minimum confidence
50
  • For the rule 1 gt 3
  • Support Support(1, 3) 50
  • Confidence Support(1,3)/Support(1) 66

13
The Apriori Algorithm
  • Fk Set of frequent itemsets of size k
  • Ck Set of candidate itemsets of size k
  • F1 large items
  • for ( k1 Fk ! 0 k) do
  • Ck1 New candidates generated from Fk
  • foreach transaction t in the database do
  • Increment the count of all candidates in Ck1
    that
  • are contained in t
  • Fk1 Candidates in Ck1 with minimum
    support
  • Answer Uk Fk

14
Key Observation
  • Every subset of a frequent itemset is also
    frequentgt a candidate itemset in Ck1 can be
    pruned if even one of its subsets is not
    contained in Fk

15
Apriori - Example
Database D
F1
C1
Scan D
C2
C2
F2
Scan D
16
Sequential Patterns
  • Given
  • A sequence of customer transactions
  • Each transaction is a set of items
  • Find all maximal sequential patterns supported by
    more than a user-specified percentage of
    customers
  • Example 10 of customers who bought a PC did a
    memory upgrade in a subsequent transaction

17
Classification
  • Given
  • Database of tuples, each assigned a class label
  • Develop a model/profile for each class
  • Example profile (good credit)
  • (25 lt age lt 40 and income gt 40k) or (married
    YES)
  • Sample applications
  • Credit card approval (good, bad)
  • Bank locations (good, fair, poor)
  • Treatment effectiveness (good, fair, poor)

18
Decision Tree
  • An internal node is a test on an attribute.
  • A branch represents an outcome of the test, e.g.,
    Colorred.
  • A leaf node represents a class label or class
    label distribution.
  • At each node, one attribute is chosen to split
    training examples into distinct classes as much
    as possible
  • A new case is classified by following a matching
    path to a leaf node.

19
Decision Trees
20
Example Tree
Outlook
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
21
Decision Tree Algorithms
  • Building phase
  • Recursively split nodes using best splitting
    attribute for node
  • Pruning phase
  • Smaller imperfect decision tree generally
    achieves better accuracy
  • Prune leaf nodes recursively to prevent
    over-fitting

22
Attribute Selection
  • Which is the best attribute?
  • The one which will result in the smallest tree
  • Heuristic choose the attribute that produces the
    purest nodes
  • Popular impurity criterion information gain
  • Information gain increases with the average
    purity of the subsets that an attribute produces
  • Strategy choose attribute that results in
    greatest information gain

23
Which attribute to select?
24
Computing information
  • Information is measured in bits
  • Given a probability distribution, the info
    required to predict an event is the
    distributions entropy
  • Entropy gives the information required in bits
    (this can involve fractions of bits!)
  • Formula for computing the entropy

25
Example attribute Outlook
  • Outlook Sunny
  • Outlook Overcast
  • Outlook Rainy
  • Expected information for attribute

26
Computing the information gain
  • Information gain
  • (information before split) (information after
    split)
  • Information gain for attributes from weather data

27
Continuing to split
28
The final decision tree
  • Note not all leaves need to be pure sometimes
    identical instances have different classes
  • ? Splitting stops when data cant be split any
    further

29
Decision Trees
  • Pros
  • Fast execution time
  • Generated rules are easy to interpret by humans
  • Scale well for large data sets
  • Can handle high dimensional data
  • Cons
  • Cannot capture correlations among attributes
  • Consider only axis-parallel cuts

30
Clustering
  • Given
  • Data points and number of desired clusters K
  • Group the data points into K clusters
  • Data points within clusters are more similar than
    across clusters
  • Sample applications
  • Customer segmentation
  • Market basket customer analysis
  • Attached mailing in direct marketing
  • Clustering companies with similar growth

31
Traditional Algorithms
  • Partitional algorithms
  • Enumerate K partitions optimizing some criterion
  • Example square-error criterion
  • mi is the mean of cluster Ci

32
K-means Algorithm
  • Assign initial means
  • Assign each point to the cluster for the closest
    mean
  • Compute new mean for each cluster
  • Iterate until criterion function converges

33
K-means example, step 1
Pick 3 initial cluster centers (randomly)
34
K-means example, step 2
Assign each point to the closest cluster center
35
K-means example, step 3
Move each cluster center to the mean of each
cluster
36
K-means example, step 4
Reassign points closest to a different new
cluster center Q Which points are reassigned?
37
K-means example, step 4
A three points with animation
38
K-means example, step 4b
re-compute cluster means
39
K-means example, step 5
move cluster centers to cluster means
40
Discussion
  • Result can vary significantly depending on
    initial choice of seeds
  • Can get trapped in local minimum
  • Example
  • To increase chance of finding global optimum
    restart with different random seeds

41
K-means clustering summary
  • Advantages
  • Simple, understandable
  • items automatically assigned to clusters
  • Disadvantages
  • Must pick number of clusters before hand
  • All items forced into a cluster
  • Too sensitive to outliers

42
Traditional Algorithms
  • Hierarchical clustering
  • Nested Partitions
  • Tree structure

43
Agglomerative Hierarchcal Algorithms
  • Mostly used hierarchical clustering algorithm
  • Initially each point is a distinct cluster
  • Repeatedly merge closest clusters until the
    number of clusters becomes K
  • Closest dmean (Ci, Cj)
  • dmin (Ci, Cj)
  • Likewise dave (Ci, Cj) and dmax (Ci, Cj)

44
Similar Time Sequences
  • Given
  • A set of time-series sequences
  • Find
  • All sequences similar to the query sequence
  • All pairs of similar sequences
  • whole matching vs. subsequence matching
  • Sample Applications
  • Financial market
  • Scientific databases
  • Medical Diagnosis

45
Whole Sequence Matching
  • Basic Idea
  • Extract k features from every sequence
  • Every sequence is then represented as a point in
    k-dimensional space
  • Use a multi-dimensional index to store and search
    these points
  • Spatial indices do not work well for high
    dimensional data

46
Similar Time Sequences
  • Take Euclidean distance as the similarity measure
  • Obtain Discrete Fourier Transform (DFT)
    coefficients of each sequence in the database
  • Build a multi-dimensional index using first a few
    Fourier coefficients
  • Use the index to retrieve sequences that are at
    most distance away from query sequence
  • Post-processing
  • compute the actual distance between sequences in
    the time domain

47
Outlier Discovery
  • Given
  • Data points and number of outliers ( n) to find
  • Find top n outlier points
  • outliers are considerably dissimilar from the
    remainder of the data
  • Sample applications
  • Credit card fraud detection
  • Telecom fraud detection
  • Medical analysis

48
Statistical Approaches
  • Model underlying distribution that generates
    dataset (e.g. normal distribution)
  • Use discordancy tests depending on
  • data distribution
  • distribution parameter (e.g. mean, variance)
  • number of expected outliers
  • Drawbacks
  • most tests are for single attribute
  • In many cases, data distribution may not be known

49
Distance-based Outliers
  • For a fraction p and a distance d,
  • a point o is an outlier if p points lie at a
    greater distance than d
  • General enough to model statistical outlier
    tests
  • Develop nested-loop and cell-based algorithms
  • Scale okay for large datasets
  • Cell-based algorithm does not scale well for high
    dimensions
Write a Comment
User Comments (0)
About PowerShow.com