Title: CS 345: Topics in Data Warehousing
1CS 345Topics in Data Warehousing
- Tuesday, November 16, 2004
2Review of Thursdays Class
- Dimension Key Mapping Revisited
- Comments on Assignment 2
- Updating the Data Warehouse
- Incremental maintenance vs. drop rebuild
- Self-maintainable views
- Approximate Query Processing
- Sampling-based techniques
- Computing confidence intervals
- Online vs. pre-computed samples
- Sampling and joins
- Alternative techniques
3Outline of Todays Class
- Data Mining
- What is data mining?
- Types of data mining
- Data mining pitfalls
- Decision Tree Classifiers
- What is a decision tree?
- Learning decision trees
- Entropy
- Information Gain
- Cross-Validation
4Data Mining
- What is data mining?
- Many definitions
- Basically identify interesting patterns in data
- Most often, the term data mining refers to
automatic detection of patterns through machine
learning - Data mining is one part of the broader process of
knowledge discovery in databases (KDD) - KDD the process of identifying valid, novel,
potential useful, and ultimately understandable
patterns in data - This is what data warehousing is all about.
- Data mining is a field of research
- Draws from databases, artificial intelligence,
statistics - Relatively new research community
- Several conferences and journals
- ACM KDD, SIAM Data Mining, IEEE ICDM
5Knowledge Discovery in Databases
Knowledge
Interpretation/Evaluation
- Validation Tests
- Visualization
Data Mining
- Identify Patterns
- Generate Models
Preprocessing
- Selection
- Cleaning
- Transformation
- Feature Extraction
Data
6Types of Data Mining
- OLAP
- Group-by aggregation queries are a simple type of
data mining - Summarize the data set
- Classification
- Build predictive model to categorize records into
discrete classes - Examples
- Classify mortgage applicants as will default or
will not default - Face recognition in image database
- Identify likely terrorists vs. unlikely
terrorists - Regression
- Build predictive model to predict real-valued
function - Examples
- Predict how much revenue each customer will
generate - Predict profitability of planned marketing
campaign - Clustering
- Separate data records into groups of similar
items - Clustering vs. Classification
- Classification is supervised, clustering is
unsupervised - Classification uses pre-defined class labels,
clustering doesnt.
7Types of Data Mining
- Outlier detection
- Identify unusual or atypical data records
- Sometimes to investigate them further
- Sometimes to exclude them from a broad analysis
- Trend analysis / forecasting
- Identify changes in patterns of data over time
- Example What will be next months revenue?
- Dependency detection
- Which attributes are correlated with one another?
- Which attribute values are likely to occur
together? - Popular technique Association rule mining
- Also known as market basket analysis
- Find products that are often bought together as
part of same transaction - Temporal pattern detection / time series mining
- Recognize commonly recurring patterns in time
series data - Example Technical analysis of financial
markets
8Data Mining Pitfalls
- Overfitting
- Spurious patterns may emerge by chance
- Dont mistake coincidence for causality
- Example ESP experiment
- Ask 10,000 test subjects to predict whether each
of 10 face-down playing cards is red or black - 10 subjects predicted all 10 cards correctly!
- Conclusion 1 out of every 1000 people have
ESP - Can be a particular concern in datasets with
- Lots of attributes
- Not too many records
- Reporting obvious patterns
- Learning cancer risk factors
- Women are more likely than men to have breast
cancer - Men are more likely than women to have prostate
cancer - These patterns are not novel
9Data Mining Pitfalls
- Confusing correlation and causation
- Data mining can identify attributes that are
correlated - Correlation doesnt necessarily imply causation
- Example Studying causes of obesity
- Overweight people are more likely to drink diet
soda - Conclusion Drinking diet soda causes obesity
- Moral of the story Interpretation and
evaluation of patterns is crucial - Data mining algorithms are not magical
- Patterns they identify must be examined carefully
to avoid drawing inappropriate conclusions
10Decision Tree Classifiers
- Decision trees are one type of classification
model - Internal nodes of decision tree labeled with
attributes - Each internal node represents a test
- Edges labeled with attribute values
- Edges represent the results of the tests
- Leafs labeled with class values
- Leafs represent the classifiers predictions
- To classify a record, walk down the tree starting
at the root - The path that is followed depends on the
attribute values of the record being classified
Employed?
Yes
No
Credit Score?
Income?
High
Low
High
Low
Approve
Approve
Reject
Reject
11Decision Tree Learning
- Were given a data set with unknown values for an
attribute of interest - Example
- Data set is Customer records
- Attribute of interest is Will Close Account in
Next 3 Months - Unknown attribute referred to as target attribute
- This data set is referred to as the test set
- We also have a second data set where the values
of the target attribute are known - Referred to as the training set
- We would like to build a decision tree classifier
to predict the value of the target attribute - Construct a decision tree that accurately
classifies the records in the training set - Use the decision tree to predict the value of the
target attribute for the records from the test
set - Hopefully a classifier that works well on the
training set will also work well on the other
data set!
12Decision Tree Learning
- When does decision tree learning work well?
- Training set and test set are similar
- Patterns in the training set are also present in
the test set - Rules learned from one data set apply to the
other - Decision tree identifies general, globally valid
patterns - And not specific, idiosyncratic properties of the
training records - Need to avoid overfitting the model to the
training set - Occams razor simple explanations are usually
the best - Simple (small) decision trees are usually
preferable - Easier for humans to interpret
- Usually less prone to overfitting
- Finding the smallest accurate decision tree is
NP-Hard - Decision trees are usually built top-down using
greedy heuristic - Idea First test attributes that do best job of
separating the classes
13Decision Tree Learning
- Basic decision tree learning algorithm
- Do all records in training set belong to same
class? - Yes ? Return leaf node with that class.
- Do all records in training set have the same
values for all attributes (other than target)? - Yes ? Return leaf node with most common class.
- Otherwise
- Pick the single attribute that best separates
records from different classes - Use that attribute for the root of the decision
tree - Children of root node are decision trees
- Build them recursively using same algorithm
14Splitting Criterion
- How to decide which attribute is best to test
first? - Each attribute splits data into subsets
- Ideally, each subset should be as homogenous as
possible - Need metric for homogeneity of a data set
- Example
- Two classes, /-
- 100 records overall (50 s and 50 -s)
- A and B are two binary attributes
- Records with A0 48, 2-Records with A1 2,
48- - Records with B0 26, 24-Records with B1
24, 26- - Splitting on A is better than splitting on B
- A does a good job of separating s and -s
- B does a poor job of separating s and -s
15Entropy
- Entropy is a good way to measure homogeneity
- Measures minimum number of bits per record needed
to optimally encode class values - Entropy example
- Three classes (A,B,C)
- A occurs ½ of the time
- B and C each occur ¼ of the time
- Optimal encoding A 0, B 10, C 11
- Entropy Average bits / record 1.5
- Entropy formula
-
- Entropy of data set S is denoted H(S)
- cis are the possible classes
- pi fraction of records from S that have class ci
16Entropy Examples
- Example
- 10 records have class A
- 20 records have class B
- 30 records have class C
- 40 records have class D
- Entropy -(.1 log .1) (.2 log .2) (.3 log
.3) (.4 log .4) - Entropy 1.85
- Earlier example revisited
- Two classes, /-
- 100 records overall (50 s and 50 -s)
- A and B are two binary attributes
- Records with A0 48, 2- Entropy 0.24
Records with A1 2, 48- Entropy 0.24 - Records with B0 26, 24- Entropy
0.99Records with B1 24, 26- Entropy 0.99 - A is better than B because average entropy is
less after splitting on A
17Information Gain
- Information gain Expected reduction in entropy
- Expected entropy after splitting on attribute A
H(SA) - H(SA) Sum (percentage of records with
Aai)(Entropy of records with Aai) - Sum is taken over all possible values of
attribute A - Computes weighted average entropy across all
subsets - Weight of subset number of records in the
subset - Always split on attribute with greatest
information gain - This is one possible splitting rule for building
decision trees - However, other splitting criteria are also used
sometimes - Gain ratio, Gini index, etc.
- Alternative methods of measuring homogeneity
18Decision Tree Example
State Season Barometer Weather
AK Winter Down Snow
HI Winter Down Sun
HI Summer Up Sun
CA Summer Up Rain
AK Winter Up Snow
CA Winter Down Sun
AK Summer Down Sun
CA Winter Up Rain
HI Summer Down Sun
Predicting the weather Target attribute
Weather Source attributes State, Season,
Barometer
19Decision Tree Example
State AK 2 Snow, 1 Sun ? 0.92HI 3 Sun ?
0.00 CA 2 Rain, 1 Sun ? 0.92 Entropy 0.62
State Season Barometer Weather
AK Winter Down Snow
HI Winter Down Sun
HI Summer Up Sun
CA Summer Up Rain
AK Winter Up Snow
CA Winter Down Sun
AK Summer Down Sun
CA Winter Up Rain
HI Summer Down Sun
Season Winter 2 Snow, 2 Sun, 1 Rain ?
1.52Summer 3 Sun, 1 Rain ? 0.81 CA 2 Rain, 1
Sun ? 0.92 Entropy 1.20
Barometer Down 1 Snow, 4 Sun ? 0.72Up 1
Snow, 1 Sun, 2 Rain ? 1.50 Entropy 1.07
20Decision Tree Example
State AK Split on Season Winter
Snow Summer Sun
State Season Barometer Weather
AK Winter Down Snow
AK Winter Up Snow
AK Summer Down Sun
HI Winter Down Sun
HI Summer Up Sun
HI Summer Down Sun
CA Summer Up Rain
CA Winter Down Sun
CA Winter Up Rain
State HI Leaf node Sun
State CA Split on Barometer Up Rain Down
Sun
21Decision Tree Example
State
CA
AK
HI
Barometer
Season
Sun
Down
Up
Summer
Winter
Sun
Snow
Sun
Rain
22Overfitting and Pruning
- Performance graph at right exhibits typical
phenomenon - Accuracy on training data increases decision tree
grows - Accuracy on test data initially increases, then
decreases. - Why does this happen?
- Highly predictive attributes near root of
decision tree capture general patterns - Less predictive attributes added later are mostly
capturing statistical noise - Goal Stop building the decision tree before
overfitting kicks in - Pruning ? eliminate lower portions of the
decision tree - Replace sub-tree with a leaf node
Accuracy
Training Set Accuracy
Test Set Accuracy
Decisiontree size
Optimaltree size
23Pruning via Cross-Validation
- Cross-validation
- Separate training set into two parts
- Most of the training set is used to build tree
- Small holdout set is used to validate accuracy
- Post-pruning approach
- Build decision tree with training data (less
holdout set) - Traverse tree in bottom-up fashion
- For each sub-tree
- Consider pruning sub-tree, replacing with leaf
node - If pruned tree is more accurate on holdout set,
then use it - Otherwise, stick with original sub-tree
- Idea behind pruning
- Portion of tree that models general patterns
works well on holdout set - Portion of tree that fits random noise works
poorly on holdout set
24Sufficient Statistics
- What information is need to determine what
attribute to split on? - Need to compute expected entropy of each
attribute - To compute expected entropy after splitting on
attribute A - How many records are there with each value of A?
- Among the records with each A value, how many
belong to each class? - These counts are called sufficient statistics
- Computing sufficient statistics via SQL
- Use a simple group-by SQL query (one per
attribute)SELECT A, Class, COUNT()FROM
TableGROUP BY A, Class - For non-root nodes, need a WHERE clause for
earlier splitsSELECT A, Class, COUNT()FROM
TableWHERE Bx AND CyGROUP BY A, Class - Full data cube contains all sufficient statistics
for entire decision tree
25Decision Trees and Data Warehouses
- Generally building a decision tree involves
dimension-focused queries - As opposed to typical fact-focused queries
- Records for which predictions are made are
dimension rows (e.g. Customers, Accounts) - Sometimes queries just involve the dimension
table - Other times dimension attributes may be
supplemented by virtual behavioral attributes - Two approaches for gathering sufficient
statistics - Compute entire data cube (including subtotals) in
one query - Issue a series of small group-by queries