Introduction to Association Analysis - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Introduction to Association Analysis

Description:

Given a set of transactions, find rules that will predict the occurrence of an ... {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 39

Provided by: Compu287

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Association Analysis

1
Introduction to Association Analysis

Zhangxi Lin
ISQS 3358
Texas Tech University

2
Outline

Basic concepts
Itemset generation - Apriori principle
Association rule discovery and generation
Evaluation of association patterns
Sequential pattern analysis

3
Basic Concepts
4
Association Rule Mining

Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
5
Definition Frequent Itemset

Itemset
A collection of one or more items
Example Milk, Bread, Diaper
k-itemset
An itemset that contains k items
Support count (?)
Frequency of occurrence of an itemset
E.g. ?(Milk, Bread,Diaper) 2
Support
Fraction of transactions that contain an itemset
E.g. s(Milk, Bread, Diaper) 2/5
Frequent Itemset
An itemset whose support is greater than or equal
to a minsup threshold

6
Definition Association Rule

Association Rule
An implication expression of the form X ? Y,
where X and Y are itemsets
Example Milk, Diaper ? Beer
Rule Evaluation Metrics
Support (s)
Fraction of transactions that contain both X and
Y
Confidence (c)
Measures how often items in Y appear in
transactions thatcontain X

Count (CKG, SVG) 1
Support 1 / 5 20
Count (CKG) 3
Confidence 1 / 3 0.33
Count (CKG) 2
Count (CKG, SVG) 2
Confidence (CKG, SVG) 2 / 2 100

8
Formal Definitions

Support s(X ? Y)
Confidence, c(X ? Y)

9
Itemset generation - Apriori principle
10
Association Rule Mining Task

Given a set of transactions T, the goal of
association rule mining is to find all rules
having
support minsup threshold
confidence minconf threshold
Brute-force approach
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
? Computationally prohibitive!

11
Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)

Observations
All the above rules are binary partitions of the
same itemset Milk, Diaper, Beer
Rules originating from the same itemset have
identical support but can have different
confidence
Thus, we may decouple the support and confidence
requirements

12
Mining Association Rules

Two-step approach
Frequent Itemset Generation
Generate all itemsets whose support ? minsup
Rule Generation
Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset
Frequent itemset generation is still
computationally expensive

13
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
14
Frequent Itemset Generation

Brute-force approach
Each itemset in the lattice is a candidate
frequent itemset
Count the support of each candidate by scanning
the database
Match each transaction against every candidate
Complexity O(NMw) gt Expensive since M 2d !!!

15
Frequent Itemset Generation Strategies

Reduce the number of candidates (M)
Complete search M2d
Use pruning techniques to reduce M
Reduce the number of transactions (N)
Reduce size of N as the size of itemset increases
Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)
Use efficient data structures to store the
candidates or transactions
No need to match every candidate against every
transaction

16
Reducing Number of Candidates

Apriori principle
If an itemset is frequent, then all of its
subsets must also be frequent
Apriori principle holds due to the following
property of the support measure
Support of an itemset never exceeds the support
of its subsets
This is known as the anti-monotone property of
support

17
Illustrating Apriori Principle
Found to be Frequent
18
Illustrating Apriori Principle
19
Apriori Algorithm

Method
Let k1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are
identified
Generate length (k1) candidate itemsets from
length k frequent itemsets
Prune candidate itemsets containing subsets of
length k that are infrequent
Count the support of each candidate by scanning
the DB
Eliminate candidates that are infrequent, leaving
only those that are frequent

20
Association rule discovery and generation
21
Reducing Number of Comparisons

Candidate counting
Scan the database of transactions to determine
the support of each candidate itemset
To reduce the number of comparisons, store the
candidates in a hash structure
Instead of matching each transaction against
every candidate, match it against candidates
contained in the hashed buckets

22
Factors Affecting Complexity

Choice of minimum support threshold
lowering support threshold results in more
frequent itemsets
this may increase number of candidates and max
length of frequent itemsets
Dimensionality (number of items) of the data set
more space is needed to store support count of
each item
if number of frequent items also increases, both
computation and I/O costs may also increase
Size of database
since Apriori makes multiple passes, run time of
algorithm may increase with number of
transactions
Average transaction width
transaction width increases with denser data
sets
This may increase max length of frequent itemsets
and traversals of hash tree (number of subsets in
a transaction increases with its width)

23
Rule Generation

Given a frequent itemset L, find all non-empty
subsets f ? L such that f ? L f satisfies the
minimum confidence requirement
If A,B,C,D is a frequent itemset, candidate
rules
ABC ?D, ABD ?C, ACD ?B, BCD ?A, A ?BCD, B
?ACD, C ?ABD, D ?ABCAB ?CD, AC ? BD, AD ? BC,
BC ?AD, BD ?AC, CD ?AB,
If L k, then there are 2k 2 candidate
association rules (ignoring L ? ? and ? ? L)

24
Rule Generation

How to efficiently generate rules from frequent
itemsets?
In general, confidence does not have an
anti-monotone property
c(ABC ?D) can be larger or smaller than c(AB ?D)
But confidence of rules generated from the same
itemset has an anti-monotone property
e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD)
? c(A ? BCD)
Confidence is anti-monotone w.r.t. number of
items on the RHS of the rule

25
Rule Generation for Apriori Algorithm
Lattice of rules
Low Confidence Rule
26
Rule Generation for Apriori Algorithm

Candidate rule is generated by merging two rules
that share the same prefixin the rule consequent
join(CDgtAB,BDgtAC)would produce the
candidaterule D gt ABC
Prune rule DgtABC if itssubset ADgtBC does not
havehigh confidence

27
Demonstration

A bank wants to examine its customer base and
understand which of its products individual
customers own in combination with one another. It
has chosen to conduct a market-basket analysis of
a sample of its customer base. The bank has a
data set that lists the banking products/services
used by 7,991 customers.
Data set BANK
Variables
ACCT ID, Nominal, Account Number
SERVICE Target, Nominal, Type of Service
VISIT Sequence, Ordinal, Order of Product
Purchase

28
Evaluation of association patterns
29
Contingency Table
30
Statistical Independence

Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(S?B) 420/1000 0.42
P(S) ? P(B) 0.6 ? 0.7 0.42
P(S?B) P(S) ? P(B) gt Statistical independence
P(S?B) gt P(S) ? P(B) gt Positively correlated
P(S?B) lt P(S) ? P(B) gt Negatively correlated

31
Statistical-based Measures

Measures that take into account statistical
dependence

32
Example Lift/Interest Contingency Table
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule Tea ? Coffee
Confidence P(CoffeeTea) 0.75
but P(Coffee) 0.9
Lift 0.75/0.9 0.8333 (lt 1, therefore is
negatively associated)

33
Compared to Confusion Matrix
Computed Yes Computed No Total
Actual Yes 15 5 20
Actual No 75 5 80
Total 90 10 100
In classification, we are interested in P(Actual
YesComputed Yes) i.e. P(Row Column) In
associate analysis, we are interested in
P(ColumnRow)
34
Sequential pattern analysis
35
Examples of Sequence Data
Sequence Database Sequence Element (Transaction) Event(Item)
Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc
Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc
Event data History of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors
Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C
Element (Transaction)
Event (Item)
E1E2
E1E3
E2
E3E4
E2
Sequence
36
Examples of Sequence

Web sequence
lt Homepage Electronics Digital Cameras
Canon Digital Camera Shopping Cart Order
Confirmation Return to Shopping gt
Sequence of initiating events causing the nuclear
accident at 3-mile Island(http//stellar-one.com
/nuclear/staff_reports/summary_SOE_the_initiating_
event.htm)
lt clogged resin outlet valve closure loss
of feedwater condenser polisher outlet valve
shut booster pumps trip main waterpump
trips main turbine trips reactor pressure
increasesgt
Sequence of books checked out at a library
ltFellowship of the Ring The Two Towers
Return of the Kinggt

37
Sequential Pattern Mining Definition

Given
a database of sequences
a user-specified minimum support threshold,
minsup
Task
Find all subsequences with support minsup

38
Sequential Pattern Mining Example
Minsup 50 Examples of Frequent
Subsequences lt 1,2 gt s60 lt 2,3 gt
s60 lt 2,4gt s80 lt 3 5gt s80 lt 1
2 gt s80 lt 2 2 gt s60 lt 1 2,3
gt s60 lt 2 2,3 gt s60 lt 1,2 2,3 gt s60

Write a Comment

User Comments (0)