CS 345: Topics in Data Warehousing - PowerPoint PPT Presentation

About This Presentation

Title:

CS 345: Topics in Data Warehousing

Description:

CS 345: Topics in Data Warehousing Tuesday, November 16, 2004 Review of Thursday s Class Dimension Key Mapping Revisited Comments on Assignment #2 Updating the Data ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 26

Provided by: BrianB120

Learn more at: http://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 345: Topics in Data Warehousing

1
CS 345Topics in Data Warehousing

Tuesday, November 16, 2004

2
Review of Thursdays Class

Dimension Key Mapping Revisited
Comments on Assignment 2
Updating the Data Warehouse
Incremental maintenance vs. drop rebuild
Self-maintainable views
Approximate Query Processing
Sampling-based techniques
Computing confidence intervals
Online vs. pre-computed samples
Sampling and joins
Alternative techniques

3
Outline of Todays Class

Data Mining
What is data mining?
Types of data mining
Data mining pitfalls
Decision Tree Classifiers
What is a decision tree?
Learning decision trees
Entropy
Information Gain
Cross-Validation

4
Data Mining

What is data mining?
Many definitions
Basically identify interesting patterns in data
Most often, the term data mining refers to
automatic detection of patterns through machine
learning
Data mining is one part of the broader process of
knowledge discovery in databases (KDD)
KDD the process of identifying valid, novel,
potential useful, and ultimately understandable
patterns in data
This is what data warehousing is all about.
Data mining is a field of research
Draws from databases, artificial intelligence,
statistics
Relatively new research community
Several conferences and journals
ACM KDD, SIAM Data Mining, IEEE ICDM

5
Knowledge Discovery in Databases
Knowledge
Interpretation/Evaluation

Validation Tests
Visualization

Data Mining

Identify Patterns
Generate Models

Preprocessing

Selection
Cleaning
Transformation
Feature Extraction

Data
6
Types of Data Mining

OLAP
Group-by aggregation queries are a simple type of
data mining
Summarize the data set
Classification
Build predictive model to categorize records into
discrete classes
Examples
Classify mortgage applicants as will default or
will not default
Face recognition in image database
Identify likely terrorists vs. unlikely
terrorists
Regression
Build predictive model to predict real-valued
function
Examples
Predict how much revenue each customer will
generate
Predict profitability of planned marketing
campaign
Clustering
Separate data records into groups of similar
items
Clustering vs. Classification
Classification is supervised, clustering is
unsupervised
Classification uses pre-defined class labels,
clustering doesnt.

7
Types of Data Mining

Outlier detection
Identify unusual or atypical data records
Sometimes to investigate them further
Sometimes to exclude them from a broad analysis
Trend analysis / forecasting
Identify changes in patterns of data over time
Example What will be next months revenue?
Dependency detection
Which attributes are correlated with one another?
Which attribute values are likely to occur
together?
Popular technique Association rule mining
Also known as market basket analysis
Find products that are often bought together as
part of same transaction
Temporal pattern detection / time series mining
Recognize commonly recurring patterns in time
series data
Example Technical analysis of financial
markets

8
Data Mining Pitfalls

Overfitting
Spurious patterns may emerge by chance
Dont mistake coincidence for causality
Example ESP experiment
Ask 10,000 test subjects to predict whether each
of 10 face-down playing cards is red or black
10 subjects predicted all 10 cards correctly!
Conclusion 1 out of every 1000 people have
ESP
Can be a particular concern in datasets with
Lots of attributes
Not too many records
Reporting obvious patterns
Learning cancer risk factors
Women are more likely than men to have breast
cancer
Men are more likely than women to have prostate
cancer
These patterns are not novel

9
Data Mining Pitfalls

Confusing correlation and causation
Data mining can identify attributes that are
correlated
Correlation doesnt necessarily imply causation
Example Studying causes of obesity
Overweight people are more likely to drink diet
soda
Conclusion Drinking diet soda causes obesity
Moral of the story Interpretation and
evaluation of patterns is crucial
Data mining algorithms are not magical
Patterns they identify must be examined carefully
to avoid drawing inappropriate conclusions

10
Decision Tree Classifiers

Decision trees are one type of classification
model
Internal nodes of decision tree labeled with
attributes
Each internal node represents a test
Edges labeled with attribute values
Edges represent the results of the tests
Leafs labeled with class values
Leafs represent the classifiers predictions
To classify a record, walk down the tree starting
at the root
The path that is followed depends on the
attribute values of the record being classified

Employed?
Yes
No
Credit Score?
Income?
High
Low
High
Low
Approve
Approve
Reject
Reject
11
Decision Tree Learning

Were given a data set with unknown values for an
attribute of interest
Example
Data set is Customer records
Attribute of interest is Will Close Account in
Next 3 Months
Unknown attribute referred to as target attribute
This data set is referred to as the test set
We also have a second data set where the values
of the target attribute are known
Referred to as the training set
We would like to build a decision tree classifier
to predict the value of the target attribute
Construct a decision tree that accurately
classifies the records in the training set
Use the decision tree to predict the value of the
target attribute for the records from the test
set
Hopefully a classifier that works well on the
training set will also work well on the other
data set!

12
Decision Tree Learning

When does decision tree learning work well?
Training set and test set are similar
Patterns in the training set are also present in
the test set
Rules learned from one data set apply to the
other
Decision tree identifies general, globally valid
patterns
And not specific, idiosyncratic properties of the
training records
Need to avoid overfitting the model to the
training set
Occams razor simple explanations are usually
the best
Simple (small) decision trees are usually
preferable
Easier for humans to interpret
Usually less prone to overfitting
Finding the smallest accurate decision tree is
NP-Hard
Decision trees are usually built top-down using
greedy heuristic
Idea First test attributes that do best job of
separating the classes

13
Decision Tree Learning

Basic decision tree learning algorithm
Do all records in training set belong to same
class?
Yes ? Return leaf node with that class.
Do all records in training set have the same
values for all attributes (other than target)?
Yes ? Return leaf node with most common class.
Otherwise
Pick the single attribute that best separates
records from different classes
Use that attribute for the root of the decision
tree
Children of root node are decision trees
Build them recursively using same algorithm

14
Splitting Criterion

How to decide which attribute is best to test
first?
Each attribute splits data into subsets
Ideally, each subset should be as homogenous as
possible
Need metric for homogeneity of a data set
Example
Two classes, /-
100 records overall (50 s and 50 -s)
A and B are two binary attributes
Records with A0 48, 2-Records with A1 2,
48-
Records with B0 26, 24-Records with B1
24, 26-
Splitting on A is better than splitting on B
A does a good job of separating s and -s
B does a poor job of separating s and -s

15
Entropy

Entropy is a good way to measure homogeneity
Measures minimum number of bits per record needed
to optimally encode class values
Entropy example
Three classes (A,B,C)
A occurs ½ of the time
B and C each occur ¼ of the time
Optimal encoding A 0, B 10, C 11
Entropy Average bits / record 1.5
Entropy formula
Entropy of data set S is denoted H(S)
cis are the possible classes
pi fraction of records from S that have class ci

16
Entropy Examples

Example
10 records have class A
20 records have class B
30 records have class C
40 records have class D
Entropy -(.1 log .1) (.2 log .2) (.3 log
.3) (.4 log .4)
Entropy 1.85
Earlier example revisited
Two classes, /-
100 records overall (50 s and 50 -s)
A and B are two binary attributes
Records with A0 48, 2- Entropy 0.24
Records with A1 2, 48- Entropy 0.24
Records with B0 26, 24- Entropy
0.99Records with B1 24, 26- Entropy 0.99
A is better than B because average entropy is
less after splitting on A

17
Information Gain

Information gain Expected reduction in entropy
Expected entropy after splitting on attribute A
H(SA)
H(SA) Sum (percentage of records with
Aai)(Entropy of records with Aai)
Sum is taken over all possible values of
attribute A
Computes weighted average entropy across all
subsets
Weight of subset number of records in the
subset
Always split on attribute with greatest
information gain
This is one possible splitting rule for building
decision trees
However, other splitting criteria are also used
sometimes
Gain ratio, Gini index, etc.
Alternative methods of measuring homogeneity

18
Decision Tree Example
State Season Barometer Weather
AK Winter Down Snow
HI Winter Down Sun
HI Summer Up Sun
CA Summer Up Rain
AK Winter Up Snow
CA Winter Down Sun
AK Summer Down Sun
CA Winter Up Rain
HI Summer Down Sun
Predicting the weather Target attribute
Weather Source attributes State, Season,
Barometer
19
Decision Tree Example
State AK 2 Snow, 1 Sun ? 0.92HI 3 Sun ?
0.00 CA 2 Rain, 1 Sun ? 0.92 Entropy 0.62
State Season Barometer Weather
AK Winter Down Snow
HI Winter Down Sun
HI Summer Up Sun
CA Summer Up Rain
AK Winter Up Snow
CA Winter Down Sun
AK Summer Down Sun
CA Winter Up Rain
HI Summer Down Sun
Season Winter 2 Snow, 2 Sun, 1 Rain ?
1.52Summer 3 Sun, 1 Rain ? 0.81 CA 2 Rain, 1
Sun ? 0.92 Entropy 1.20
Barometer Down 1 Snow, 4 Sun ? 0.72Up 1
Snow, 1 Sun, 2 Rain ? 1.50 Entropy 1.07
20
Decision Tree Example
State AK Split on Season Winter
Snow Summer Sun
State Season Barometer Weather
AK Winter Down Snow
AK Winter Up Snow
AK Summer Down Sun
HI Winter Down Sun
HI Summer Up Sun
HI Summer Down Sun
CA Summer Up Rain
CA Winter Down Sun
CA Winter Up Rain
State HI Leaf node Sun
State CA Split on Barometer Up Rain Down
Sun
21
Decision Tree Example
State
CA
AK
HI
Barometer
Season
Sun
Down
Up
Summer
Winter
Sun
Snow
Sun
Rain
22
Overfitting and Pruning

Performance graph at right exhibits typical
phenomenon
Accuracy on training data increases decision tree
grows
Accuracy on test data initially increases, then
decreases.
Why does this happen?
Highly predictive attributes near root of
decision tree capture general patterns
Less predictive attributes added later are mostly
capturing statistical noise
Goal Stop building the decision tree before
overfitting kicks in
Pruning ? eliminate lower portions of the
decision tree
Replace sub-tree with a leaf node

Accuracy
Training Set Accuracy
Test Set Accuracy
Decisiontree size
Optimaltree size
23
Pruning via Cross-Validation

Cross-validation
Separate training set into two parts
Most of the training set is used to build tree
Small holdout set is used to validate accuracy
Post-pruning approach
Build decision tree with training data (less
holdout set)
Traverse tree in bottom-up fashion
For each sub-tree
Consider pruning sub-tree, replacing with leaf
node
If pruned tree is more accurate on holdout set,
then use it
Otherwise, stick with original sub-tree
Idea behind pruning
Portion of tree that models general patterns
works well on holdout set
Portion of tree that fits random noise works
poorly on holdout set

24
Sufficient Statistics

What information is need to determine what
attribute to split on?
Need to compute expected entropy of each
attribute
To compute expected entropy after splitting on
attribute A
How many records are there with each value of A?
Among the records with each A value, how many
belong to each class?
These counts are called sufficient statistics
Computing sufficient statistics via SQL
Use a simple group-by SQL query (one per
attribute)SELECT A, Class, COUNT()FROM
TableGROUP BY A, Class
For non-root nodes, need a WHERE clause for
earlier splitsSELECT A, Class, COUNT()FROM
TableWHERE Bx AND CyGROUP BY A, Class
Full data cube contains all sufficient statistics
for entire decision tree

25
Decision Trees and Data Warehouses

Generally building a decision tree involves
dimension-focused queries
As opposed to typical fact-focused queries
Records for which predictions are made are
dimension rows (e.g. Customers, Accounts)
Sometimes queries just involve the dimension
table
Other times dimension attributes may be
supplemented by virtual behavioral attributes
Two approaches for gathering sufficient
statistics
Compute entire data cube (including subtotals) in
one query
Issue a series of small group-by queries