CS590D: Data Mining Chris Clifton presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS590D: Data Mining Chris Clifton

1
CS590D Data MiningChris Clifton

January 14, 2006
Data Preparation

2
Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

3
Why Data Preprocessing?

Data in the real world is dirty
incomplete lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
e.g., occupation
noisy containing errors or outliers
e.g., Salary-10
inconsistent containing discrepancies in codes
or names
e.g., Age42 Birthday03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records

4
Why Is Data Dirty?

Incomplete data comes from
n/a data value when collected
different consideration between the time when the
data was collected and when it is analyzed.
human/hardware/software problems
Noisy data comes from the process of data
collection
entry
transmission
Inconsistent data comes from
Different data sources
Functional dependency violation

5
Why Is Data Preprocessing Important?

No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause
incorrect or even misleading statistics.
Data warehouse needs consistent integration of
quality data
Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse. Bill Inmon

6
Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories
intrinsic, contextual, representational, and
accessibility.

7
Major Tasks in Data Preprocessing

Data cleaning
Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes, or
files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but
produces the same or similar analytical results
Data discretization
Part of data reduction but with particular
importance, especially for numerical data

8
Forms of data preprocessing
9
Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

10
Data Cleaning

Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in data
warehousingDCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration

11
Missing Data

Data is not always available
E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus
deleted
data not entered due to misunderstanding
certain data may not be considered important at
the time of entry
not register history or changes of the data
Missing data may need to be inferred.

12
How to Handle Missing Data?

Ignore the tuple usually done when class label
is missing (assuming the tasks in
classificationnot effective when the percentage
of missing values per attribute varies
considerably.
Fill in the missing value manually tedious
infeasible?
Fill in it automatically with
a global constant e.g., unknown, a new
class?!
the attribute mean
the attribute mean for all samples belonging to
the same class smarter
the most probable value inference-based such as
Bayesian formula or decision tree

13
CS590D Data MiningChris Clifton

January 17, 2006
Data Preparation

14
What is Data?
Attributes

Collection of data objects and their attributes
An attribute is a property or characteristic of
an object
Examples eye color of a person, temperature,
etc.
Attribute is also known as variable, field,
characteristic, or feature
A collection of attributes describe an object
Object is also known as record, point, case,
sample, entity, or instance

Objects
15
Attribute Values

Attribute values are numbers or symbols assigned
to an attribute
Distinction between attributes and attribute
values
Same attribute can be mapped to different
attribute values
Example height can be measured in feet or
meters
Different attributes can be mapped to the same
set of values
Example Attribute values for ID and age are
integers
But properties of attribute values can be
different
ID has no limit but age has a maximum and minimum
value

16
Measurement of Length

The way you measure an attribute is somewhat may
not match the attributes properties.

17
Types of Attributes

There are different types of attributes
Nominal
Examples ID numbers, eye color, zip codes
Ordinal
Examples rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in tall,
medium, short
Interval
Examples calendar dates, temperatures in Celsius
or Fahrenheit.
Ratio
Examples temperature in Kelvin, length, time,
counts

18
Properties of Attribute Values

The type of an attribute depends on which of the
following properties it possesses
Distinctness ?
Order lt gt
Addition -
Multiplication /
Nominal attribute distinctness
Ordinal attribute distinctness order
Interval attribute distinctness, order
addition
Ratio attribute all 4 properties

19
(No Transcript)
20
(No Transcript)
21
Discrete and Continuous Attributes

Discrete Attribute
Has only a finite or countably infinite set of
values
Examples zip codes, counts, or the set of words
in a collection of documents
Often represented as integer variables.
Note binary attributes are a special case of
discrete attributes
Continuous Attribute
Has real numbers as attribute values
Examples temperature, height, or weight.
Practically, real values can only be measured and
represented using a finite number of digits.
Continuous attributes are typically represented
as floating-point variables.

22
Types of data sets

Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data

23
Important Characteristics of Structured Data

Dimensionality
Curse of Dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale

24
Record Data

Data that consists of a collection of records,
each of which consists of a fixed set of
attributes

25
Data Matrix

If data objects have the same fixed set of
numeric attributes, then the data objects can be
thought of as points in a multi-dimensional
space, where each dimension represents a distinct
attribute
Such data set can be represented by an m by n
matrix, where there are m rows, one for each
object, and n columns, one for each attribute

26
Document Data

Each document becomes a term' vector,
each term is a component (attribute) of the
vector,
the value of each component is the number of
times the corresponding term occurs in the
document.

27
Transaction Data

A special type of record data, where
each record (transaction) involves a set of
items.
For example, consider a grocery store. The set
of products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.

28
Graph Data

Examples Generic graph and HTML Links

29
Chemical Data

Benzene Molecule C6H6

30
Ordered Data

Sequences of transactions

Items/Events
An element of the sequence
31
Ordered Data

Genomic sequence data

32
Ordered Data

Spatio-Temporal Data

Average Monthly Temperature of land and ocean
33
Data Quality

What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems
Noise and outliers
missing values
duplicate data

34
Noise

Noise refers to modification of original values
Examples distortion of a persons voice when
talking on a poor phone and snow on television
screen

Two Sine Waves
Two Sine Waves Noise
35
Outliers

Outliers are data objects with characteristics
that are considerably different than most of the
other data objects in the data set

36
Missing Values

Reasons for missing values
Information is not collected (e.g., people
decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children)
Handling missing values
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by
their probabilities)

37
Duplicate Data

Data set may include data objects that are
duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous
sources
Examples
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues

38
Noisy Data

Noise random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data

39
How to Handle Noisy Data?

Binning method
first sort data and partition into (equi-depth)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
(e.g., deal with possible outliers)
Regression
smooth by fitting the data into regression
functions

40
Simple Discretization Methods Binning

Equal-width (distance) partitioning
Divides the range into N intervals of equal size
uniform grid
if A and B are the lowest and highest values of
the attribute, the width of intervals will be W
(B A)/N.
The most straightforward, but outliers may
dominate presentation
Skewed data is not handled well.
Equal-depth (frequency) partitioning
Divides the range into N intervals, each
containing approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.

41
Binning Methods for Data Smoothing

Sorted data (e.g., by price)
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins
Bin 1 4, 8, 9, 15
Bin 2 21, 21, 24, 25
Bin 3 26, 28, 29, 34
Smoothing by bin means
Bin 1 9, 9, 9, 9
Bin 2 23, 23, 23, 23
Bin 3 29, 29, 29, 29
Smoothing by bin boundaries
Bin 1 4, 4, 4, 15
Bin 2 21, 21, 25, 25
Bin 3 26, 26, 26, 34

42
Cluster Analysis
43
Regression
y
Y1
y x 1
Y1
x
X1
44
Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

45
Data Integration

Data integration
combines data from multiple sources into a
coherent store
Schema integration
integrate metadata from different sources
Entity identification problem identify real
world entities from multiple data sources, e.g.,
A.cust-id ? B.cust-
Detecting and resolving data value conflicts
for the same real world entity, attribute values
from different sources are different
possible reasons different representations,
different scales, e.g., metric vs. British units

46
Handling Redundancy in Data Integration

Redundant data occur often when integration of
multiple databases
The same attribute may have different names in
different databases
One attribute may be a derived attribute in
another table, e.g., annual revenue
Redundant data may be able to be detected by
correlational analysis
Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality

47
Data Transformation

Smoothing remove noise from data
Aggregation summarization, data cube
construction
Generalization concept hierarchy climbing
Normalization scaled to fall within a small,
specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones

48
CS590D Data MiningChris Clifton

January 18, 2005
Data Preparation

49
Data Transformation Normalization

min-max normalization
z-score normalization
normalization by decimal scaling

Where j is the smallest integer such that Max(
)lt1
50
Z-Score (Example)
v v v v
0.18 -0.84 Avg 0.68 20 -.26 Avg 34.3
0.60 -0.14 sdev 0.59 40 .11 sdev 55.9
0.52 -0.27 5 .55
0.25 -0.72 70 4
0.80 0.20 32 -.05
0.55 -0.22 8 -.48
0.92 0.40 5 -.53
0.21 -0.79 15 -.35
0.64 -0.07 250 3.87
0.20 -0.80 32 -.05
0.63 -0.09 18 -.30
0.70 0.04 10 -.44
0.67 -0.02 -14 -.87
0.58 -0.17 22 -.23
0.98 0.50 45 .20
0.81 0.22 60 .47
0.10 -0.97 -5 -.71
0.82 0.24 7 -.49
0.50 -0.30 2 -.58
3.00 3.87 4 -.55
51
Data Preprocessing

Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation

52
Aggregation

Combining two or more attributes (or objects)
into a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states,
countries, etc
More stable data
Aggregated data tends to have less variability

53
Aggregation
Variation of Precipitation in Australia
Standard Deviation of Average Monthly
Precipitation
Standard Deviation of Average Yearly Precipitation
54
CS490DIntroduction to Data MiningChris Clifton

January 26, 2004
Data Preparation

55
Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

56
Data Reduction Strategies

A data warehouse may store terabytes of data
Complex data analysis/mining may take a very long
time to run on the complete data set
Data reduction
Obtain a reduced representation of the data set
that is much smaller in volume but yet produce
the same (or almost the same) analytical results
Data reduction strategies
Data cube aggregation
Dimensionality reduction remove unimportant
attributes
Data Compression
Numerosity reduction fit data into models
Discretization and concept hierarchy generation

57
Data Cube Aggregation

The lowest level of a data cube
the aggregated data for an individual entity of
interest
e.g., a customer in a phone calling data
warehouse.
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Reference appropriate levels
Use the smallest representation which is enough
to solve the task
Queries regarding aggregated information should
be answered using data cube, when possible

58
Dimensionality Reduction

Feature selection (i.e., attribute subset
selection)
Select a minimum set of features such that the
probability distribution of different classes
given the values for those features is as close
as possible to the original distribution given
the values of all features
reduce of patterns in the patterns, easier to
understand
Heuristic methods (due to exponential of
choices)
step-wise forward selection
step-wise backward elimination
combining forward selection and backward
elimination
decision-tree induction

59
Example ofDecision Tree Induction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
60
Heuristic Feature Selection Methods

There are 2d possible sub-features of d features
Several heuristic feature selection methods
Best single features under the feature
independence assumption choose by significance
tests.
Best step-wise feature selection
The best single-feature is picked first
Then next best feature condition to the first,
...
Step-wise feature elimination
Repeatedly eliminate the worst feature
Best combined feature selection and elimination
Optimal branch and bound
Use feature elimination and backtracking

61
Data Compression

String compression
There are extensive theories and well-tuned
algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time

62
Data Compression
Original Data
Compressed Data
lossless
Original Data Approximated
lossy
63
Wavelet Transformation

Discrete wavelet transform (DWT) linear signal
processing, multiresolutional analysis
Compressed approximation store only a small
fraction of the strongest of the wavelet
coefficients
Similar to discrete Fourier transform (DFT), but
better lossy compression, localized in space
Method
Length, L, must be an integer power of 2 (padding
with 0s, when necessary)
Each transform has 2 functions smoothing,
difference
Applies to pairs of data, resulting in two set of
data of length L/2
Applies two functions recursively, until reaches
the desired length

64
DWT for Image Compression

Image
Low Pass High Pass
Low Pass High Pass
Low Pass High Pass

65
Curse of Dimensionality

When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
Definitions of density and distance between
points, which is critical for clustering and
outlier detection, become less meaningful

Randomly generate 500 points
Compute difference between max and min distance
between any pair of points

66
Dimensionality Reduction

Purpose
Avoid curse of dimensionality
Reduce amount of time and memory required by data
mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or
reduce noise
Techniques
Principle Component Analysis
Singular Value Decomposition
Others supervised and non-linear techniques

67
Principal Component Analysis
X2
Y1
Y2
X1
68
Dimensionality Reduction PCA

Goal is to find a projection that captures the
largest amount of variation in data

x2
e
x1
69
Dimensionality Reduction PCA

Find the eigenvectors of the covariance matrix
The eigenvectors define the new space

x2
e
x1
70
Principal Component Analysis

Given N data vectors from k-dimensions, find c
k orthogonal vectors that can be best used to
represent data
The original data set is reduced to one
consisting of N data vectors on c principal
components (reduced dimensions)
Each data vector is a linear combination of the c
principal component vectors
Works for numeric data only
Used when the number of dimensions is large

71
Dimensionality Reduction PCA
72
Dimensionality Reduction ISOMAP

Construct a neighbourhood graph
For each pair of points in the graph, compute the
shortest path distances geodesic distances

By Tenenbaum, de Silva, Langford (2000)
73
Feature Subset Selection

Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information
contained in one or more other attributes
Example purchase price of a product and the
amount of sales tax paid
Irrelevant features
contain no information that is useful for the
data mining task at hand
Example students' ID is often irrelevant to the
task of predicting students' GPA

74
Feature Subset Selection

Techniques
Brute-force approch
Try all possible feature subsets as input to data
mining algorithm
Embedded approaches
Feature selection occurs naturally as part of
the data mining algorithm
Filter approaches
Features are selected before data mining
algorithm is run
Wrapper approaches
Use the data mining algorithm as a black box to
find best subset of attributes

75
CS590DData MiningChris Clifton

January 24, 2006
Data Preparation

76
Feature Creation

Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes
Three general methodologies
Feature Extraction
domain-specific
Mapping Data to New Space
Feature Construction
combining features

77
Mapping Data to a New Space
Fourier transform Wavelet transform
Two Sine Waves
Two Sine Waves Noise
Frequency
78
Discretization Using Class Labels

Entropy based approach

3 categories for both x and y
5 categories for both x and y
79
Discretization Without Using Class Labels
Data
Equal interval width
Equal frequency
K-means
80
Attribute Transformation

A function that maps the entire set of values of
a given attribute to a new set of replacement
values such that each old value can be identified
with one of the new values
Simple functions xk, log(x), ex, x
Standardization and Normalization

81
Numerosity Reduction

Parametric methods
Assume the data fits some model, estimate model
parameters, store only the parameters, and
discard the data (except possible outliers)
Log-linear models obtain value at a point in m-D
space as the product on appropriate marginal
subspaces
Non-parametric methods
Do not assume models
Major families histograms, clustering, sampling

82
Regression and Log-Linear Models

Linear regression Data are modeled to fit a
straight line
Often uses the least-square method to fit the
line
Multiple regression allows a response variable Y
to be modeled as a linear function of
multidimensional feature vector
Log-linear model approximates discrete
multidimensional probability distributions

83
Regress Analysis and Log-Linear Models

Linear regression Y ? ? X
Two parameters , ? and ? specify the line and are
to be estimated by using the data at hand.
using the least squares criterion to the known
values of Y1, Y2, , X1, X2, .
Multiple regression Y b0 b1 X1 b2 X2.
Many nonlinear functions can be transformed into
the above.
Log-linear models
The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
Probability p(a, b, c, d) ?ab ?ac?ad ?bcd

84
Histograms

A popular data reduction technique
Divide data into buckets and store average (sum)
for each bucket
Can be constructed optimally in one dimension
using dynamic programming
Related to quantization problems.

85
Clustering

Partition data set into clusters, and one can
store cluster representation only
Can be very effective if data is clustered but
not if data is smeared
Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
There are many choices of clustering definitions
and clustering algorithms, further detailed in
Chapter 8

86
Hierarchical Reduction

Use multi-resolution structure with different
degrees of reduction
Hierarchical clustering is often performed but
tends to define partitions of data sets rather
than clusters
Parametric methods are usually not amenable to
hierarchical representation
Hierarchical aggregation
An index tree hierarchically divides a data set
into partitions by value range of some attributes
Each partition can be considered as a bucket
Thus an index tree with aggregates stored at each
node is a hierarchical histogram

87
Sampling

Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
data
Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods
Stratified sampling
Approximate the percentage of each class (or
subpopulation of interest) in the overall
database
Used in conjunction with skewed data
Sampling may not reduce database I/Os (page at a
time).

88
Sampling
SRSWOR (simple random sample without
replacement)
SRSWR
89
Sampling
Cluster/Stratified Sample
Raw Data
90
Sampling

Sampling is the main technique employed for data
selection.
It is often used for both the preliminary
investigation of the data and the final data
analysis.
Statisticians sample because obtaining the entire
set of data of interest is too expensive or time
consuming.
Sampling is used in data mining because
processing the entire set of data of interest is
too expensive or time consuming.

91
Sampling

The key principle for effective sampling is the
following
using a sample will work almost as well as using
the entire data sets, if the sample is
representative
A sample is representative if it has
approximately the same property (of interest) as
the original set of data

92
Types of Sampling

Simple Random Sampling
There is an equal probability of selecting any
particular item
Sampling without replacement
As each item is selected, it is removed from the
population
Sampling with replacement
Objects are not removed from the population as
they are selected for the sample.
In sampling with replacement, the same object can
be picked up more than once
Stratified sampling
Split the data into several partitions then draw
random samples from each partition

93
Sample Size

8000 points 2000 Points 500 Points
94
Sample Size

What sample size is necessary to get at least one
object from each of 10 groups.

95
Similarity and Dissimilarity

Similarity
Numerical measure of how alike two data objects
are.
Is higher when objects are more alike.
Often falls in the range 0,1
Dissimilarity
Numerical measure of how different are two data
objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity

96
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data
objects.
97
Euclidean Distance

Euclidean Distance
Where n is the number of dimensions (attributes)
and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
Standardization is necessary, if scales differ.

98
Euclidean Distance
Distance Matrix
99
Minkowski Distance

Minkowski Distance is a generalization of
Euclidean Distance
Where r is a parameter, n is the number of
dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or
data objects p and q.

100
Minkowski Distance Examples

r 1. City block (Manhattan, taxicab, L1 norm)
distance.
A common example of this is the Hamming distance,
which is just the number of bits that are
different between two binary vectors
r 2. Euclidean distance
r ? ?. supremum (Lmax norm, L? norm) distance.
This is the maximum difference between any
component of the vectors
Do not confuse r with n, i.e., all these
distances are defined for all numbers of
dimensions.

101
Minkowski Distance
Distance Matrix
102
Mahalanobis Distance
? is the covariance matrix of the input data X
For red points, the Euclidean distance is 14.7,
Mahalanobis distance is 6.
103
Mahalanobis Distance
Covariance Matrix
C
A (0.5, 0.5) B (0, 1) C (1.5, 1.5) Mahal(A,B)
5 Mahal(A,C) 4
B
A
104
Common Properties of a Distance

Distances, such as the Euclidean distance, have
some well known properties.
d(p, q) ? 0 for all p and q and d(p, q) 0
only if p q. (Positive definiteness)
d(p, q) d(q, p) for all p and q. (Symmetry)
d(p, r) ? d(p, q) d(q, r) for all points p,
q, and r. (Triangle Inequality)
where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q.
A distance that satisfies these properties is a
metric

105
Common Properties of a Similarity

Similarities, also have some well known
properties.
s(p, q) 1 (or maximum similarity) only if p
q.
s(p, q) s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points
(data objects), p and q.

106
Similarity Between Binary Vectors

Common situation is that objects, p and q, have
only binary attributes
Compute similarities using the following
quantities
M01 the number of attributes where p was 0 and
q was 1
M10 the number of attributes where p was 1 and
q was 0
M00 the number of attributes where p was 0 and
q was 0
M11 the number of attributes where p was 1 and
q was 1
Simple Matching and Jaccard Coefficients
SMC number of matches / number of attributes
(M11 M00) / (M01 M10 M11
M00)
J number of 11 matches / number of
not-both-zero attributes values
(M11) / (M01 M10 M11)

107
SMC versus Jaccard Example

p 1 0 0 0 0 0 0 0 0 0
q 0 0 0 0 0 0 1 0 0 1
M01 2 (the number of attributes where p was 0
and q was 1)
M10 1 (the number of attributes where p was 1
and q was 0)
M00 7 (the number of attributes where p was 0
and q was 0)
M11 0 (the number of attributes where p was 1
and q was 1)
SMC (M11 M00)/(M01 M10 M11 M00) (07)
/ (2107) 0.7
J (M11) / (M01 M10 M11) 0 / (2 1 0)
0

108
Cosine Similarity

If d1 and d2 are two document vectors, then
cos( d1, d2 ) (d1 ? d2) / d1
d2 ,
where ? indicates vector dot product and d
is the length of vector d.
Example
d1 3 2 0 5 0 0 0 2 0 0
d2 1 0 0 0 0 0 0 1 0 2
d1 ? d2 31 20 00 50 00 00
00 21 00 02 5
d1 (3322005500000022000
0)0.5 (42) 0.5 6.481
d2 (110000000000001100
22) 0.5 (6) 0.5 2.245
cos( d1, d2 ) .3150

109
Extended Jaccard Coefficient (Tanimoto)

Variation of Jaccard for continuous or count
attributes
Reduces to Jaccard for binary attributes

110
Correlation

Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, p and q, and then take their dot product

111
Visually Evaluating Correlation
Scatter plots showing the similarity from 1 to 1.
112
General Approach for Combining Similarities

Sometimes attributes are of many different types,
but an overall similarity is needed.

113
Using Weights to Combine Similarities

May not want to treat all attributes the same.
Use weights wk which are between 0 and 1 and sum
to 1.

114
Density

Density-based clustering require a notion of
density
Examples
Euclidean density
Euclidean density number of points per unit
volume
Probability density
Graph-based density

115
Euclidean Density Cell-based

Simplest approach is to divide region into a
number of rectangular cells of equal volume and
define density as of points the cell contains

116
Euclidean Density Center-based

Euclidean density is the number of points within
a specified radius of the point

117
Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

118
Discretization

Three types of attributes
Nominal values from an unordered set
Ordinal values from an ordered set
Continuous real numbers
Discretization
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Prepare for further analysis

119
Discretization and Concept hierachy

Discretization
reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals. Interval labels can
then be used to replace actual data values
Concept hierarchies
reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior)

120
CS490DIntroduction to Data MiningChris Clifton

January 28, 2004
Data Preparation

121
Discretization and Concept Hierarchy Generation
for Numeric Data

Binning (see sections before)
Histogram analysis (see sections before)
Clustering analysis (see sections before)
Entropy-based discretization
Segmentation by natural partitioning

122
Definition of Entropy

Entropy
Example Coin Flip
AX heads, tails
P(heads) P(tails) ½
½ log2(½) ½ - 1
H(X) 1
What about a two-headed coin?
Conditional Entropy

123
Entropy-Based Discretization

Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is
The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met,
e.g.,
Experiments show that it may reduce data size and
improve classification accuracy

124
Segmentation by Natural Partitioning

A simply 3-4-5 rule can be used to segment
numeric data into relatively uniform, natural
intervals.
If an interval covers 3, 6, 7 or 9 distinct
values at the most significant digit, partition
the range into 3 equi-width intervals
If it covers 2, 4, or 8 distinct values at the
most significant digit, partition the range into
4 intervals
If it covers 1, 5, or 10 distinct values at the
most significant digit, partition the range into
5 intervals

125
Example of 3-4-5 Rule
(-4000 -5,000)
Step 4
126
Concept Hierarchy Generation for Categorical Data

Specification of a partial ordering of attributes
explicitly at the schema level by users or
experts
streetltcityltstateltcountry
Specification of a portion of a hierarchy by
explicit data grouping
Urbana, Champaign, ChicagoltIllinois
Specification of a set of attributes.
System automatically generates partial ordering
by analysis of the number of distinct values
E.g., street lt city ltstate lt country
Specification of only a partial set of attributes
E.g., only street lt city, not others

127
Automatic Concept Hierarchy Generation

Some concept hierarchies can be automatically
generated based on the analysis of the number of
distinct values per attribute in the given data
set
The attribute with the most distinct values is
placed at the lowest level of the hierarchy
Note Exceptionweekday, month, quarter, year

15 distinct values
country
65 distinct values
province_or_ state
3567 distinct values
city
674,339 distinct values
street
128
Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

129
Summary

Data preparation is a big issue for both
warehousing and mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but still an
active area of research

130
References

E. Rahm and H. H. Do. Data Cleaning Problems and
Current Approaches. IEEE Bulletin of the
Technical Committee on Data Engineering. Vol.23,
No.4
D. P. Ballou and G. K. Tayi. Enhancing data
quality in data warehouse environments.
Communications of ACM, 4273-78, 1999.
H.V. Jagadish et al., Special Issue on Data
Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December
1997.
A. Maydanchik, Challenges of Efficient Data
Cleansing (DM Review - Data Quality resource
portal)
D. Pyle. Data Preparation for Data Mining. Morgan
Kaufmann, 1999.
D. Quass. A Framework for research in Data
Cleaning. (Draft 1999)
V. Raman and J. Hellerstein. Potters Wheel An
Interactive Framework for Data Cleaning and
Transformation, VLDB2001.
T. Redman. Data Quality Management and
Technology. Bantam Books, New York, 1992.
Y. Wand and R. Wang. Anchoring data quality
dimensions ontological foundations.
Communications of ACM, 3986-95, 1996.
R. Wang, V. Storey, and C. Firth. A framework for
analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7623-640, 1995.
http//www.cs.ucla.edu/classes/spring01/cs240b/not
es/data-integration1.pdf

131
CS590D Data MiningChris Clifton

January 20, 2005
Data Cubes

132
A Sample Data Cube
Total annual sales of TV in U.S.A.
133
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
country
product
date
1-D cuboids
product,date
product,country
date, country
2-D cuboids
3-D(base) cuboid
product, date, country
134
Browsing a Data Cube

Visualization
OLAP capabilities
Interactive manipulation

135
Typical OLAP Operations

Roll up (drill-up) summarize data
by climbing up hierarchy or by dimension
reduction
Drill down (roll down) reverse of roll-up
from higher level summary to lower level summary
or detailed data, or introducing new dimensions
Slice and dice
project and select
Pivot (rotate)
reorient the cube, visualization, 3D to series of
2D planes.
Other operations
drill across involving (across) more than one
fact table
drill through through the bottom level of the
cube to its back-end relational tables (using SQL)

136
A Star-Net Query Model
Customer Orders
Shipping Method

Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Product
Time
DAILY
QTRLY
ANNUALY
PRODUCT ITEM
PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Each circle is called a footprint
Location
Organization
Promotion
137
Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L
levels?
Materialization of data cube
Materialize every (cuboid) (full
materialization), none (no materialization), or
some (partial materialization)
Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.

138
Cube Operation

Cube definition and computation in DMQL
define cube salesitem, city, year
sum(sales_in_dollars)
compute cube sales
Transform it into a SQL-like language (with a new
operator cube by, introduced by Gray et al.96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year
Compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer), (product,
customer),
(date), (product), (customer)
()

139
Cube Computation ROLAP-Based Method

Efficient cube computation methods
ROLAP-based cubing algorithms (Agarwal et al96)
Array-based cubing algorithm (Zhao et al97)
Bottom-up computation method (Beyer
Ramarkrishnan99)
H-cubing technique (Han, Pei, Dong
WangSIGMOD01)
ROLAP-based cubing algorithms
Sorting, hashing, and grouping operations are
applied to the dimension attributes in order to
reorder and cluster related tuples
Grouping is performed on some sub-aggregates as a
partial grouping step
Aggregates may be computed from previously
computed aggregates, rather than from the base
fact table

140
Cube Computation ROLAP-Based Method (2)

This is not in the textbook but in a research
paper
Hash/sort based methods (Agarwal et. al. VLDB96)
Smallest-parent computing a cuboid from the
smallest, previously computed cuboid
Cache-results caching results of a cuboid from
which other cuboids are computed to reduce disk
I/Os
Amortize-scans computing as many as possible
cuboids at the same time to amortize disk reads
Share-sorts sharing sorting costs cross
multiple cuboids when sort-based method is used
Share-partitions sharing the partitioning cost
across multiple cuboids when hash-based
algorithms are used

141
Multi-way Array Aggregation for Cube Computation

Partition arrays into chunks (a small subcube
which fits in memory).
Compressed sparse array addressing (chunk_id,
offset)
Compute aggregates in multiway by visiting cube
cells in the order which minimizes the of times
to visit each cell, and reduces memory access and
storage cost.

What is the best traversing order to do multi-way
aggregation?
142
Multi-way Array Aggregation for Cube Computation
B
143
Multi-way Array Aggregation for Cube Computation
C
64
63
62
61
c3
c2
48
47
46
45
c1
29
30
31
32
c 0
B
60
13
14
15
16
b3
44
28
B
56
9
b2
40
24
52
5
b1
36
20
1
2
3
4
b0
a1
a0
a2
a3
A
144
Multi-Way Array Aggregation for Cube Computation
(Cont.)

Method the planes should be sorted and computed
according to their size in ascending order.
See the details of Example 2.12 (pp. 75-78)
Idea keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for
the largest plane
Limitation of the method computing well only for
a small number of dimensions
If there are a large number of dimensions,
bottom-up computation and iceberg cube
computation methods can be explored

145
Indexing OLAP Data Bitmap Index

Index on a particular column
Each value in the column has a bit vector bit-op
is fast
The length of the bit vector of records in the
base table
The i-th bit is set if the i-th row of the base
table has the value for the indexed column
not suitable for high cardinality domains

Base table
Index on Region
Index on Type
146
Indexing OLAP Data Join Indices

Join index JI(R-id, S-id) where R (R-id, ) ?? S
(S-id, )
Traditional indices map the values to a list of
record ids
It materializes relational join in JI file and
speeds up relational join a rather costly
operation
In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table.
E.g. fact table Sales and two dimensions city
and product
A join index on city maintains for each distinct
city a list of R-IDs of the tuples recording the
Sales in the city
Join indices can span multiple dimensions

147
Efficient Processing OLAP Queries

Determine which operations should be performed on
the available cuboids
transform drill, roll, etc. into corresponding
SQL and/or OLAP operations, e.g, dice selection
projection
Determine to which materialized cuboid(s) the
relevant operations should be applied.
Exploring indexing structures and compressed vs.
dense array structures in MOLAP

148
Iceberg Cube

Computing only the cuboid cells whose countor
other aggregates satisfying the condition
HAVING COUNT() gt minsup
Motivation
Only a small portion of cube cells may be above
the water in a sparse cube
Only calculate interesting datadata above
certain threshold
Suppose 100 dimensions, only 1 base cell. How
many aggregate (non-base) cells if count gt 1?
What about count gt 2?

149
Bottom-Up Computation (BUC)

BUC (Beyer Ramakrishnan, SIGMOD99)
Bottom-up vs. top-down?depending on how you view
it!
Apriori property
Aggregate the data,
then move to the next level
If minsup is not met, stop!
If minsup 1 Þ compute full CUBE!

150
Partitioning

Usually, entire data set cant fit in main
memory
Sort distinct values, partition into blocksthat
fit
Continue processing
Optimizations
Partitioning
External Sorting, Hashing, Counting Sort
Ordering dimensions to encourage pruning
Cardinality, Skew, Correlation
Collapsing duplicates
Cant do holistic aggregates anymore!

151
Drawbacks of BUC

Requires a significant amount of memory
On par with most other CUBE algorithms though
Does not obtain good performance with dense CUBEs
Overly skewed data or a bad choice of dimension
ordering reduces performance
Cannot compute iceberg cubes with complex
measures
CREATE CUBE Sales_Iceberg AS
SELECT month, city, cust_grp,
AVG(price), COUNT()
FROM Sales_Infor
CUBEBY month, city, cust_grp
HAVING AVG(price) gt 800 AND
COUNT() gt 50

152
Non-Anti-Monotonic Measures
CREATE CUBE Sales_Iceberg AS SELECT month, city,
cust_grp, AVG(price), COUNT() FROM
Sales_Infor CUBEBY month, city, cust_grp HAVING
AVG(price) gt 800 AND COUNT() gt 50

The cubing query with avg is non-anti-monotonic!
(Mar, , , 600, 1800) fails the HAVING clause
(Mar, , Bus, 1300, 360) passes the clause

Month City Cust_grp Prod Cost Price
Jan Tor Edu Printer 500 485
Jan Tor Hld TV 800 1200
Jan Tor Edu Camera 1160 1280
Feb Mon Bus Laptop 1500 2500
Mar Van Edu HD 540 520

153
Top-k Average

Let (, Van, ) cover 1,000 records
Avg(price) is the average price of those 1000
sales
Avg50(price) is the average price of the top-50
sales (top-50 according to the sales price
Top-k average is anti-monotonic
The top 50 sales in Van. is with avg(price) lt
800 ? the top 50 deals in Van. during Feb. must
be with avg(price) lt 800

Month City Cust_grp Prod Cost Price

154
Binning for Top-k Average

Computing top-k avg is costly with large k
Binning idea
Avg50(c) gt 800
Large value collapsing use a sum and a count to
summarize records with measure gt 800
If countgt800, no need to check small records
Small value binning a group of bins
One bin covers a range, e.g., 600800, 400600,
etc.
Register a sum and a count for each bin

155
Approximate top-k average
Suppose for (, Van, ), we have
Approximate avg50() (280001060060015)/50952
Range Sum Count
Over 800 28000 20
600800 10600 15
400600 15200 30

Top 50
The cell may pass the HAVING clause
Month City Cust_grp Prod Cost Price

156
Quant-info for Top-k Average Binning

Accumulate quant-info for cells to compute
average iceberg cubes efficiently
Three pieces sum, count, top-k bins
Use top-k bins to estimate/prune descendants
Use sum and count to consolidate current cell

strongest
weakest
Approximate avg50() Anti-monotonic, can be computed efficiently real avg50() Anti-monotonic, but computationally costly avg() Not anti-monotonic
157
An

Write a Comment

User Comments (0)

About PowerShow.com

CS590D: Data Mining Chris Clifton PowerPoint PPT Presentation