Title: Mining surprising patterns using temporal description length
1Mining surprising patterns usingtemporal
description length
- Soumen Chakrabarti (IIT Bombay)Sunita Sarawagi
(IIT Bombay)Byron Dom (IBM Almaden)
2Market basket mining algorithms
- Find prevalent rules that hold over large
fractions of data - Useful for promotions and store arrangement
- Intensively researched
1990
Milk and cereal selltogether!
3Prevalent ? Interesting
- Analysts already know about prevalent rules
- Interesting rules are those that deviate from
prior expectation - Minings payoff is in finding surprising phenomena
1995
Milk and cereal selltogether!
Milk and cereal selltogether!
4What makes a rule surprising?
- Does not match prior expectation
- Correlation between milk and cereal remains
roughly constant over time
- Cannot be trivially derived from simpler rules
- Milk 10, cereal 10
- Milk and cereal 10 surprising
- Eggs 10
- Milk, cereal and eggs 0.1 surprising!
- Expected 1
5Two views on data mining
Data
Data
Model of Analysts Knowledge of the Data
Mining Program
Mining Program
Discovery
Discovery
Analyst
6Our contributions
- A new notion of surprising patterns
- Detect changes in correlation along time
- Filter out steady, uninteresting correlations
- Algorithms to mine for surprising patterns
- Encode data into bit streams using two models
- Surprise difference in number of bits needed
- Experimental results
- Demonstrate superiority over prevalent patterns
7A simpler problem one item
- Milk-buying habits modeled by biased coin
- Customer tosses this coin to decide whether to
buy milk - Head or 1 denotes basket contains milk
- Coin bias is Prmilk
- Analyst wants to study Prmilk along time
- Single coin with fixed bias is not interesting
- Changes in bias are interesting
8The coin segmentation problem
- Players A and B
- A has a set of coins with different biases
- A repeatedly
- Picks arbitrary coin
- Tosses it arbitrary number of times
- B observes H/T
- Guesses transition points and biases
Return
Pick
A
Toss
B
9How to explain the data
- Given n head/tail observations
- Can assume n different coins with bias 0 or 1
- Data fits perfectly (with probability one)
- Many coins needed
- Or assume one coin
- May fit data poorly
- Best explanation is a compromise
5/7
1/3
1/4
10Coding examples
- Sequence of k zeroes
- Naïve encoding takes k bits
- Run length takes about log k bits
- 1000 bits, 10 randomly placed 1s, rest 0s
- Posit a coin with bias 0.01
- Data encoding cost is (Shannons theorem)
11How to find optimal segments
Sequence of 17 tosses
Derived graph with 18 nodes
Data cost for Prhead 5/7, 5 heads, 2 tails
Edge cost model cost data cost
Model cost one node ID one Prhead
12Approximate shortest path
- Suppose there are T tosses
- Make T1? chunks each with T? nodes(tune ?)
- Find shortest paths within chunks
- Some nodes are chosen in each chunk
- Solve a shortest path with all chosen nodes
13Two or more items
- Unconstrained segmentation
- k items induce a 2k sided coin
- milk and cereal 11, milk, not cereal 10,
neither 00, etc. - Shortest path finds significant shift in any of
the coin face probabilities - Problem some of these shifts may be completely
explained by lower order marginal
14Example
- Drop in joint sale of milk and cereal is
completely explained by drop in sale of milk - Prmilk cereal / (Prmilk Prcereal) remains
constant over time - Call this ratio ?
15Constant-? segmentation
Observed support
Independence
- Compute global ? over all time
- All coins must have this common value of ?
- Segment by constrained optimization
- Compare with unconstrained coding cost
16Is all this really needed?
- Simpler alternative
- Aggregate data into suitable time windows
- Compute support, correlation, ?, etc. in each
window - Use variance threshold to choose itemsets
- Pitfalls
- Choices windows, thresholds
- May miss fine detail
- Over-sensitive to outliers
17 but no simpler
Smoothing leads to an estimated trend that
isdescriptive rather than analytic or
explanatory. Because it is not based on an
explicit probabilisticmodel, the method cannot
be treated rigorously in terms of mathematical
statistics. The Statistical Analysis of Time
Series T. W. Anderson
18Experiments
- 2.8 million baskets over 7 years, 1987-93
- 15800 items, average 2.62 items per basket
- Two algorithms
- Complete MDL approach
- MDL segmentation statistical tests (MStat)
- Anecdotes
- MDL effective at penalizing obvious itemsets
19Quality of approximation
20Little agreement in itemset ranks
- Simpler methods do not approximate MDL
21MDL has high selectivity
- Score of best itemsets stand out from the rest
using MDL
22Three anecdotes
- ? against time
- High MStat score
- Small marginals
- Polo shirt shorts
- High correlation
- Small variation
- Bedsheets pillow cases
- High MDL score
- Significant gradual drift
- Mens womens shorts
23Conclusion
- New notion of surprising patterns based on
- Joint support expected from marginals
- Variation of joint support along time
- Robust MDL formulation
- Efficient algorithms
- Near-optimal segmentation using shortest path
- Pruning criteria
- Successful application to real data