Title: CS590D: Data Mining Chris Clifton
1CS590D Data MiningChris Clifton
- January 14, 2006
- Data Preparation
2Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
3Why Data Preprocessing?
- Data in the real world is dirty
- incomplete lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data - e.g., occupation
- noisy containing errors or outliers
- e.g., Salary-10
- inconsistent containing discrepancies in codes
or names - e.g., Age42 Birthday03/07/1997
- e.g., Was rating 1,2,3, now rating A, B, C
- e.g., discrepancy between duplicate records
4Why Is Data Dirty?
- Incomplete data comes from
- n/a data value when collected
- different consideration between the time when the
data was collected and when it is analyzed. - human/hardware/software problems
- Noisy data comes from the process of data
- collection
- entry
- transmission
- Inconsistent data comes from
- Different data sources
- Functional dependency violation
5Why Is Data Preprocessing Important?
- No quality data, no quality mining results!
- Quality decisions must be based on quality data
- e.g., duplicate or missing data may cause
incorrect or even misleading statistics. - Data warehouse needs consistent integration of
quality data - Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse. Bill Inmon
6Multi-Dimensional Measure of Data Quality
- A well-accepted multidimensional view
- Accuracy
- Completeness
- Consistency
- Timeliness
- Believability
- Value added
- Interpretability
- Accessibility
- Broad categories
- intrinsic, contextual, representational, and
accessibility.
7Major Tasks in Data Preprocessing
- Data cleaning
- Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies - Data integration
- Integration of multiple databases, data cubes, or
files - Data transformation
- Normalization and aggregation
- Data reduction
- Obtains reduced representation in volume but
produces the same or similar analytical results - Data discretization
- Part of data reduction but with particular
importance, especially for numerical data
8Forms of data preprocessing
9Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
10Data Cleaning
- Importance
- Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball - Data cleaning is the number one problem in data
warehousingDCI survey - Data cleaning tasks
- Fill in missing values
- Identify outliers and smooth out noisy data
- Correct inconsistent data
- Resolve redundancy caused by data integration
11Missing Data
- Data is not always available
- E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data - Missing data may be due to
- equipment malfunction
- inconsistent with other recorded data and thus
deleted - data not entered due to misunderstanding
- certain data may not be considered important at
the time of entry - not register history or changes of the data
- Missing data may need to be inferred.
12How to Handle Missing Data?
- Ignore the tuple usually done when class label
is missing (assuming the tasks in
classificationnot effective when the percentage
of missing values per attribute varies
considerably. - Fill in the missing value manually tedious
infeasible? - Fill in it automatically with
- a global constant e.g., unknown, a new
class?! - the attribute mean
- the attribute mean for all samples belonging to
the same class smarter - the most probable value inference-based such as
Bayesian formula or decision tree
13CS590D Data MiningChris Clifton
- January 17, 2006
- Data Preparation
14What is Data?
Attributes
- Collection of data objects and their attributes
- An attribute is a property or characteristic of
an object - Examples eye color of a person, temperature,
etc. - Attribute is also known as variable, field,
characteristic, or feature - A collection of attributes describe an object
- Object is also known as record, point, case,
sample, entity, or instance
Objects
15Attribute Values
- Attribute values are numbers or symbols assigned
to an attribute - Distinction between attributes and attribute
values - Same attribute can be mapped to different
attribute values - Example height can be measured in feet or
meters - Different attributes can be mapped to the same
set of values - Example Attribute values for ID and age are
integers - But properties of attribute values can be
different - ID has no limit but age has a maximum and minimum
value
16Measurement of Length
- The way you measure an attribute is somewhat may
not match the attributes properties.
17Types of Attributes
- There are different types of attributes
- Nominal
- Examples ID numbers, eye color, zip codes
- Ordinal
- Examples rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in tall,
medium, short - Interval
- Examples calendar dates, temperatures in Celsius
or Fahrenheit. - Ratio
- Examples temperature in Kelvin, length, time,
counts
18Properties of Attribute Values
- The type of an attribute depends on which of the
following properties it possesses - Distinctness ?
- Order lt gt
- Addition -
- Multiplication /
- Nominal attribute distinctness
- Ordinal attribute distinctness order
- Interval attribute distinctness, order
addition - Ratio attribute all 4 properties
19(No Transcript)
20(No Transcript)
21Discrete and Continuous Attributes
- Discrete Attribute
- Has only a finite or countably infinite set of
values - Examples zip codes, counts, or the set of words
in a collection of documents - Often represented as integer variables.
- Note binary attributes are a special case of
discrete attributes - Continuous Attribute
- Has real numbers as attribute values
- Examples temperature, height, or weight.
- Practically, real values can only be measured and
represented using a finite number of digits. - Continuous attributes are typically represented
as floating-point variables.
22Types of data sets
- Record
- Data Matrix
- Document Data
- Transaction Data
- Graph
- World Wide Web
- Molecular Structures
- Ordered
- Spatial Data
- Temporal Data
- Sequential Data
- Genetic Sequence Data
23Important Characteristics of Structured Data
- Dimensionality
- Curse of Dimensionality
- Sparsity
- Only presence counts
- Resolution
- Patterns depend on the scale
24Record Data
- Data that consists of a collection of records,
each of which consists of a fixed set of
attributes
25Data Matrix
- If data objects have the same fixed set of
numeric attributes, then the data objects can be
thought of as points in a multi-dimensional
space, where each dimension represents a distinct
attribute - Such data set can be represented by an m by n
matrix, where there are m rows, one for each
object, and n columns, one for each attribute
26Document Data
- Each document becomes a term' vector,
- each term is a component (attribute) of the
vector, - the value of each component is the number of
times the corresponding term occurs in the
document.
27Transaction Data
- A special type of record data, where
- each record (transaction) involves a set of
items. - For example, consider a grocery store. The set
of products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
28Graph Data
- Examples Generic graph and HTML Links
29Chemical Data
30Ordered Data
- Sequences of transactions
Items/Events
An element of the sequence
31Ordered Data
32Ordered Data
Average Monthly Temperature of land and ocean
33Data Quality
- What kinds of data quality problems?
- How can we detect problems with the data?
- What can we do about these problems?
- Examples of data quality problems
- Noise and outliers
- missing values
- duplicate data
34Noise
- Noise refers to modification of original values
- Examples distortion of a persons voice when
talking on a poor phone and snow on television
screen
Two Sine Waves
Two Sine Waves Noise
35Outliers
- Outliers are data objects with characteristics
that are considerably different than most of the
other data objects in the data set
36Missing Values
- Reasons for missing values
- Information is not collected (e.g., people
decline to give their age and weight) - Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children) - Handling missing values
- Eliminate Data Objects
- Estimate Missing Values
- Ignore the Missing Value During Analysis
- Replace with all possible values (weighted by
their probabilities)
37Duplicate Data
- Data set may include data objects that are
duplicates, or almost duplicates of one another - Major issue when merging data from heterogeous
sources - Examples
- Same person with multiple email addresses
- Data cleaning
- Process of dealing with duplicate data issues
38Noisy Data
- Noise random error or variance in a measured
variable - Incorrect attribute values may due to
- faulty data collection instruments
- data entry problems
- data transmission problems
- technology limitation
- inconsistency in naming convention
- Other data problems which requires data cleaning
- duplicate records
- incomplete data
- inconsistent data
39How to Handle Noisy Data?
- Binning method
- first sort data and partition into (equi-depth)
bins - then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc. - Clustering
- detect and remove outliers
- Combined computer and human inspection
- detect suspicious values and check by human
(e.g., deal with possible outliers) - Regression
- smooth by fitting the data into regression
functions
40Simple Discretization Methods Binning
- Equal-width (distance) partitioning
- Divides the range into N intervals of equal size
uniform grid - if A and B are the lowest and highest values of
the attribute, the width of intervals will be W
(B A)/N. - The most straightforward, but outliers may
dominate presentation - Skewed data is not handled well.
- Equal-depth (frequency) partitioning
- Divides the range into N intervals, each
containing approximately same number of samples - Good data scaling
- Managing categorical attributes can be tricky.
41Binning Methods for Data Smoothing
- Sorted data (e.g., by price)
- 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
- Partition into (equi-depth) bins
- Bin 1 4, 8, 9, 15
- Bin 2 21, 21, 24, 25
- Bin 3 26, 28, 29, 34
- Smoothing by bin means
- Bin 1 9, 9, 9, 9
- Bin 2 23, 23, 23, 23
- Bin 3 29, 29, 29, 29
- Smoothing by bin boundaries
- Bin 1 4, 4, 4, 15
- Bin 2 21, 21, 25, 25
- Bin 3 26, 26, 26, 34
42Cluster Analysis
43Regression
y
Y1
y x 1
Y1
x
X1
44Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
45Data Integration
- Data integration
- combines data from multiple sources into a
coherent store - Schema integration
- integrate metadata from different sources
- Entity identification problem identify real
world entities from multiple data sources, e.g.,
A.cust-id ? B.cust- - Detecting and resolving data value conflicts
- for the same real world entity, attribute values
from different sources are different - possible reasons different representations,
different scales, e.g., metric vs. British units
46Handling Redundancy in Data Integration
- Redundant data occur often when integration of
multiple databases - The same attribute may have different names in
different databases - One attribute may be a derived attribute in
another table, e.g., annual revenue - Redundant data may be able to be detected by
correlational analysis - Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
47Data Transformation
- Smoothing remove noise from data
- Aggregation summarization, data cube
construction - Generalization concept hierarchy climbing
- Normalization scaled to fall within a small,
specified range - min-max normalization
- z-score normalization
- normalization by decimal scaling
- Attribute/feature construction
- New attributes constructed from the given ones
48CS590D Data MiningChris Clifton
- January 18, 2005
- Data Preparation
49Data Transformation Normalization
- min-max normalization
- z-score normalization
- normalization by decimal scaling
Where j is the smallest integer such that Max(
)lt1
50Z-Score (Example)
v v v v
0.18 -0.84 Avg 0.68 20 -.26 Avg 34.3
0.60 -0.14 sdev 0.59 40 .11 sdev 55.9
0.52 -0.27 5 .55
0.25 -0.72 70 4
0.80 0.20 32 -.05
0.55 -0.22 8 -.48
0.92 0.40 5 -.53
0.21 -0.79 15 -.35
0.64 -0.07 250 3.87
0.20 -0.80 32 -.05
0.63 -0.09 18 -.30
0.70 0.04 10 -.44
0.67 -0.02 -14 -.87
0.58 -0.17 22 -.23
0.98 0.50 45 .20
0.81 0.22 60 .47
0.10 -0.97 -5 -.71
0.82 0.24 7 -.49
0.50 -0.30 2 -.58
3.00 3.87 4 -.55
51Data Preprocessing
- Aggregation
- Sampling
- Dimensionality Reduction
- Feature subset selection
- Feature creation
- Discretization and Binarization
- Attribute Transformation
52Aggregation
- Combining two or more attributes (or objects)
into a single attribute (or object) - Purpose
- Data reduction
- Reduce the number of attributes or objects
- Change of scale
- Cities aggregated into regions, states,
countries, etc - More stable data
- Aggregated data tends to have less variability
53Aggregation
Variation of Precipitation in Australia
Standard Deviation of Average Monthly
Precipitation
Standard Deviation of Average Yearly Precipitation
54CS490DIntroduction to Data MiningChris Clifton
- January 26, 2004
- Data Preparation
55Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
56Data Reduction Strategies
- A data warehouse may store terabytes of data
- Complex data analysis/mining may take a very long
time to run on the complete data set - Data reduction
- Obtain a reduced representation of the data set
that is much smaller in volume but yet produce
the same (or almost the same) analytical results - Data reduction strategies
- Data cube aggregation
- Dimensionality reduction remove unimportant
attributes - Data Compression
- Numerosity reduction fit data into models
- Discretization and concept hierarchy generation
57Data Cube Aggregation
- The lowest level of a data cube
- the aggregated data for an individual entity of
interest - e.g., a customer in a phone calling data
warehouse. - Multiple levels of aggregation in data cubes
- Further reduce the size of data to deal with
- Reference appropriate levels
- Use the smallest representation which is enough
to solve the task - Queries regarding aggregated information should
be answered using data cube, when possible
58Dimensionality Reduction
- Feature selection (i.e., attribute subset
selection) - Select a minimum set of features such that the
probability distribution of different classes
given the values for those features is as close
as possible to the original distribution given
the values of all features - reduce of patterns in the patterns, easier to
understand - Heuristic methods (due to exponential of
choices) - step-wise forward selection
- step-wise backward elimination
- combining forward selection and backward
elimination - decision-tree induction
59Example ofDecision Tree Induction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
60Heuristic Feature Selection Methods
- There are 2d possible sub-features of d features
- Several heuristic feature selection methods
- Best single features under the feature
independence assumption choose by significance
tests. - Best step-wise feature selection
- The best single-feature is picked first
- Then next best feature condition to the first,
... - Step-wise feature elimination
- Repeatedly eliminate the worst feature
- Best combined feature selection and elimination
- Optimal branch and bound
- Use feature elimination and backtracking
61Data Compression
- String compression
- There are extensive theories and well-tuned
algorithms - Typically lossless
- But only limited manipulation is possible without
expansion - Audio/video compression
- Typically lossy compression, with progressive
refinement - Sometimes small fragments of signal can be
reconstructed without reconstructing the whole - Time sequence is not audio
- Typically short and vary slowly with time
62Data Compression
Original Data
Compressed Data
lossless
Original Data Approximated
lossy
63Wavelet Transformation
- Discrete wavelet transform (DWT) linear signal
processing, multiresolutional analysis - Compressed approximation store only a small
fraction of the strongest of the wavelet
coefficients - Similar to discrete Fourier transform (DFT), but
better lossy compression, localized in space - Method
- Length, L, must be an integer power of 2 (padding
with 0s, when necessary) - Each transform has 2 functions smoothing,
difference - Applies to pairs of data, resulting in two set of
data of length L/2 - Applies two functions recursively, until reaches
the desired length
64DWT for Image Compression
- Image
- Low Pass High Pass
- Low Pass High Pass
- Low Pass High Pass
65Curse of Dimensionality
- When dimensionality increases, data becomes
increasingly sparse in the space that it occupies - Definitions of density and distance between
points, which is critical for clustering and
outlier detection, become less meaningful
- Randomly generate 500 points
- Compute difference between max and min distance
between any pair of points
66Dimensionality Reduction
- Purpose
- Avoid curse of dimensionality
- Reduce amount of time and memory required by data
mining algorithms - Allow data to be more easily visualized
- May help to eliminate irrelevant features or
reduce noise - Techniques
- Principle Component Analysis
- Singular Value Decomposition
- Others supervised and non-linear techniques
67Principal Component Analysis
X2
Y1
Y2
X1
68Dimensionality Reduction PCA
- Goal is to find a projection that captures the
largest amount of variation in data
x2
e
x1
69Dimensionality Reduction PCA
- Find the eigenvectors of the covariance matrix
- The eigenvectors define the new space
x2
e
x1
70Principal Component Analysis
- Given N data vectors from k-dimensions, find c
k orthogonal vectors that can be best used to
represent data - The original data set is reduced to one
consisting of N data vectors on c principal
components (reduced dimensions) - Each data vector is a linear combination of the c
principal component vectors - Works for numeric data only
- Used when the number of dimensions is large
71Dimensionality Reduction PCA
72Dimensionality Reduction ISOMAP
- Construct a neighbourhood graph
- For each pair of points in the graph, compute the
shortest path distances geodesic distances
By Tenenbaum, de Silva, Langford (2000)
73Feature Subset Selection
- Another way to reduce dimensionality of data
- Redundant features
- duplicate much or all of the information
contained in one or more other attributes - Example purchase price of a product and the
amount of sales tax paid - Irrelevant features
- contain no information that is useful for the
data mining task at hand - Example students' ID is often irrelevant to the
task of predicting students' GPA
74Feature Subset Selection
- Techniques
- Brute-force approch
- Try all possible feature subsets as input to data
mining algorithm - Embedded approaches
- Feature selection occurs naturally as part of
the data mining algorithm - Filter approaches
- Features are selected before data mining
algorithm is run - Wrapper approaches
- Use the data mining algorithm as a black box to
find best subset of attributes
75CS590DData MiningChris Clifton
- January 24, 2006
- Data Preparation
76Feature Creation
- Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes - Three general methodologies
- Feature Extraction
- domain-specific
- Mapping Data to New Space
- Feature Construction
- combining features
77Mapping Data to a New Space
Fourier transform Wavelet transform
Two Sine Waves
Two Sine Waves Noise
Frequency
78Discretization Using Class Labels
3 categories for both x and y
5 categories for both x and y
79Discretization Without Using Class Labels
Data
Equal interval width
Equal frequency
K-means
80Attribute Transformation
- A function that maps the entire set of values of
a given attribute to a new set of replacement
values such that each old value can be identified
with one of the new values - Simple functions xk, log(x), ex, x
- Standardization and Normalization
81Numerosity Reduction
- Parametric methods
- Assume the data fits some model, estimate model
parameters, store only the parameters, and
discard the data (except possible outliers) - Log-linear models obtain value at a point in m-D
space as the product on appropriate marginal
subspaces - Non-parametric methods
- Do not assume models
- Major families histograms, clustering, sampling
82Regression and Log-Linear Models
- Linear regression Data are modeled to fit a
straight line - Often uses the least-square method to fit the
line - Multiple regression allows a response variable Y
to be modeled as a linear function of
multidimensional feature vector - Log-linear model approximates discrete
multidimensional probability distributions
83Regress Analysis and Log-Linear Models
- Linear regression Y ? ? X
- Two parameters , ? and ? specify the line and are
to be estimated by using the data at hand. - using the least squares criterion to the known
values of Y1, Y2, , X1, X2, . - Multiple regression Y b0 b1 X1 b2 X2.
- Many nonlinear functions can be transformed into
the above. - Log-linear models
- The multi-way table of joint probabilities is
approximated by a product of lower-order tables. - Probability p(a, b, c, d) ?ab ?ac?ad ?bcd
84Histograms
- A popular data reduction technique
- Divide data into buckets and store average (sum)
for each bucket - Can be constructed optimally in one dimension
using dynamic programming - Related to quantization problems.
85Clustering
- Partition data set into clusters, and one can
store cluster representation only - Can be very effective if data is clustered but
not if data is smeared - Can have hierarchical clustering and be stored in
multi-dimensional index tree structures - There are many choices of clustering definitions
and clustering algorithms, further detailed in
Chapter 8
86Hierarchical Reduction
- Use multi-resolution structure with different
degrees of reduction - Hierarchical clustering is often performed but
tends to define partitions of data sets rather
than clusters - Parametric methods are usually not amenable to
hierarchical representation - Hierarchical aggregation
- An index tree hierarchically divides a data set
into partitions by value range of some attributes - Each partition can be considered as a bucket
- Thus an index tree with aggregates stored at each
node is a hierarchical histogram
87Sampling
- Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
data - Choose a representative subset of the data
- Simple random sampling may have very poor
performance in the presence of skew - Develop adaptive sampling methods
- Stratified sampling
- Approximate the percentage of each class (or
subpopulation of interest) in the overall
database - Used in conjunction with skewed data
- Sampling may not reduce database I/Os (page at a
time).
88Sampling
SRSWOR (simple random sample without
replacement)
SRSWR
89Sampling
Cluster/Stratified Sample
Raw Data
90Sampling
- Sampling is the main technique employed for data
selection. - It is often used for both the preliminary
investigation of the data and the final data
analysis. - Statisticians sample because obtaining the entire
set of data of interest is too expensive or time
consuming. - Sampling is used in data mining because
processing the entire set of data of interest is
too expensive or time consuming.
91Sampling
- The key principle for effective sampling is the
following - using a sample will work almost as well as using
the entire data sets, if the sample is
representative - A sample is representative if it has
approximately the same property (of interest) as
the original set of data
92Types of Sampling
- Simple Random Sampling
- There is an equal probability of selecting any
particular item - Sampling without replacement
- As each item is selected, it is removed from the
population - Sampling with replacement
- Objects are not removed from the population as
they are selected for the sample. - In sampling with replacement, the same object can
be picked up more than once - Stratified sampling
- Split the data into several partitions then draw
random samples from each partition
93Sample Size
8000 points 2000 Points 500 Points
94Sample Size
- What sample size is necessary to get at least one
object from each of 10 groups.
95Similarity and Dissimilarity
- Similarity
- Numerical measure of how alike two data objects
are. - Is higher when objects are more alike.
- Often falls in the range 0,1
- Dissimilarity
- Numerical measure of how different are two data
objects - Lower when objects are more alike
- Minimum dissimilarity is often 0
- Upper limit varies
- Proximity refers to a similarity or dissimilarity
96Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data
objects.
97Euclidean Distance
- Euclidean Distance
- Where n is the number of dimensions (attributes)
and pk and qk are, respectively, the kth
attributes (components) or data objects p and q. - Standardization is necessary, if scales differ.
98Euclidean Distance
Distance Matrix
99Minkowski Distance
- Minkowski Distance is a generalization of
Euclidean Distance -
- Where r is a parameter, n is the number of
dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or
data objects p and q.
100Minkowski Distance Examples
- r 1. City block (Manhattan, taxicab, L1 norm)
distance. - A common example of this is the Hamming distance,
which is just the number of bits that are
different between two binary vectors - r 2. Euclidean distance
- r ? ?. supremum (Lmax norm, L? norm) distance.
- This is the maximum difference between any
component of the vectors - Do not confuse r with n, i.e., all these
distances are defined for all numbers of
dimensions.
101Minkowski Distance
Distance Matrix
102Mahalanobis Distance
? is the covariance matrix of the input data X
For red points, the Euclidean distance is 14.7,
Mahalanobis distance is 6.
103Mahalanobis Distance
Covariance Matrix
C
A (0.5, 0.5) B (0, 1) C (1.5, 1.5) Mahal(A,B)
5 Mahal(A,C) 4
B
A
104Common Properties of a Distance
- Distances, such as the Euclidean distance, have
some well known properties. - d(p, q) ? 0 for all p and q and d(p, q) 0
only if p q. (Positive definiteness) - d(p, q) d(q, p) for all p and q. (Symmetry)
- d(p, r) ? d(p, q) d(q, r) for all points p,
q, and r. (Triangle Inequality) - where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q. - A distance that satisfies these properties is a
metric
105Common Properties of a Similarity
- Similarities, also have some well known
properties. - s(p, q) 1 (or maximum similarity) only if p
q. - s(p, q) s(q, p) for all p and q. (Symmetry)
- where s(p, q) is the similarity between points
(data objects), p and q.
106Similarity Between Binary Vectors
- Common situation is that objects, p and q, have
only binary attributes - Compute similarities using the following
quantities - M01 the number of attributes where p was 0 and
q was 1 - M10 the number of attributes where p was 1 and
q was 0 - M00 the number of attributes where p was 0 and
q was 0 - M11 the number of attributes where p was 1 and
q was 1 - Simple Matching and Jaccard Coefficients
- SMC number of matches / number of attributes
- (M11 M00) / (M01 M10 M11
M00) - J number of 11 matches / number of
not-both-zero attributes values - (M11) / (M01 M10 M11)
107SMC versus Jaccard Example
- p 1 0 0 0 0 0 0 0 0 0
- q 0 0 0 0 0 0 1 0 0 1
- M01 2 (the number of attributes where p was 0
and q was 1) - M10 1 (the number of attributes where p was 1
and q was 0) - M00 7 (the number of attributes where p was 0
and q was 0) - M11 0 (the number of attributes where p was 1
and q was 1) -
- SMC (M11 M00)/(M01 M10 M11 M00) (07)
/ (2107) 0.7 - J (M11) / (M01 M10 M11) 0 / (2 1 0)
0
108Cosine Similarity
- If d1 and d2 are two document vectors, then
- cos( d1, d2 ) (d1 ? d2) / d1
d2 , - where ? indicates vector dot product and d
is the length of vector d. - Example
- d1 3 2 0 5 0 0 0 2 0 0
- d2 1 0 0 0 0 0 0 1 0 2
- d1 ? d2 31 20 00 50 00 00
00 21 00 02 5 - d1 (3322005500000022000
0)0.5 (42) 0.5 6.481 - d2 (110000000000001100
22) 0.5 (6) 0.5 2.245 - cos( d1, d2 ) .3150
109Extended Jaccard Coefficient (Tanimoto)
- Variation of Jaccard for continuous or count
attributes - Reduces to Jaccard for binary attributes
110Correlation
- Correlation measures the linear relationship
between objects - To compute correlation, we standardize data
objects, p and q, and then take their dot product
111Visually Evaluating Correlation
Scatter plots showing the similarity from 1 to 1.
112General Approach for Combining Similarities
- Sometimes attributes are of many different types,
but an overall similarity is needed.
113Using Weights to Combine Similarities
- May not want to treat all attributes the same.
- Use weights wk which are between 0 and 1 and sum
to 1.
114Density
- Density-based clustering require a notion of
density - Examples
- Euclidean density
- Euclidean density number of points per unit
volume - Probability density
- Graph-based density
115Euclidean Density Cell-based
- Simplest approach is to divide region into a
number of rectangular cells of equal volume and
define density as of points the cell contains
116Euclidean Density Center-based
- Euclidean density is the number of points within
a specified radius of the point
117Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
118Discretization
- Three types of attributes
- Nominal values from an unordered set
- Ordinal values from an ordered set
- Continuous real numbers
- Discretization
- divide the range of a continuous attribute into
intervals - Some classification algorithms only accept
categorical attributes. - Reduce data size by discretization
- Prepare for further analysis
119Discretization and Concept hierachy
- Discretization
- reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals. Interval labels can
then be used to replace actual data values - Concept hierarchies
- reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior)
120CS490DIntroduction to Data MiningChris Clifton
- January 28, 2004
- Data Preparation
121Discretization and Concept Hierarchy Generation
for Numeric Data
- Binning (see sections before)
- Histogram analysis (see sections before)
- Clustering analysis (see sections before)
- Entropy-based discretization
- Segmentation by natural partitioning
122Definition of Entropy
- Entropy
- Example Coin Flip
- AX heads, tails
- P(heads) P(tails) ½
- ½ log2(½) ½ - 1
- H(X) 1
- What about a two-headed coin?
- Conditional Entropy
123Entropy-Based Discretization
- Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is - The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization. - The process is recursively applied to partitions
obtained until some stopping criterion is met,
e.g., - Experiments show that it may reduce data size and
improve classification accuracy
124Segmentation by Natural Partitioning
- A simply 3-4-5 rule can be used to segment
numeric data into relatively uniform, natural
intervals. - If an interval covers 3, 6, 7 or 9 distinct
values at the most significant digit, partition
the range into 3 equi-width intervals - If it covers 2, 4, or 8 distinct values at the
most significant digit, partition the range into
4 intervals - If it covers 1, 5, or 10 distinct values at the
most significant digit, partition the range into
5 intervals
125Example of 3-4-5 Rule
(-4000 -5,000)
Step 4
126Concept Hierarchy Generation for Categorical Data
- Specification of a partial ordering of attributes
explicitly at the schema level by users or
experts - streetltcityltstateltcountry
- Specification of a portion of a hierarchy by
explicit data grouping - Urbana, Champaign, ChicagoltIllinois
- Specification of a set of attributes.
- System automatically generates partial ordering
by analysis of the number of distinct values - E.g., street lt city ltstate lt country
- Specification of only a partial set of attributes
- E.g., only street lt city, not others
127Automatic Concept Hierarchy Generation
- Some concept hierarchies can be automatically
generated based on the analysis of the number of
distinct values per attribute in the given data
set - The attribute with the most distinct values is
placed at the lowest level of the hierarchy - Note Exceptionweekday, month, quarter, year
15 distinct values
country
65 distinct values
province_or_ state
3567 distinct values
city
674,339 distinct values
street
128Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
129Summary
- Data preparation is a big issue for both
warehousing and mining - Data preparation includes
- Data cleaning and data integration
- Data reduction and feature selection
- Discretization
- A lot a methods have been developed but still an
active area of research
130References
- E. Rahm and H. H. Do. Data Cleaning Problems and
Current Approaches. IEEE Bulletin of the
Technical Committee on Data Engineering. Vol.23,
No.4 - D. P. Ballou and G. K. Tayi. Enhancing data
quality in data warehouse environments.
Communications of ACM, 4273-78, 1999. - H.V. Jagadish et al., Special Issue on Data
Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December
1997. - A. Maydanchik, Challenges of Efficient Data
Cleansing (DM Review - Data Quality resource
portal) - D. Pyle. Data Preparation for Data Mining. Morgan
Kaufmann, 1999. - D. Quass. A Framework for research in Data
Cleaning. (Draft 1999) - V. Raman and J. Hellerstein. Potters Wheel An
Interactive Framework for Data Cleaning and
Transformation, VLDB2001. - T. Redman. Data Quality Management and
Technology. Bantam Books, New York, 1992. - Y. Wand and R. Wang. Anchoring data quality
dimensions ontological foundations.
Communications of ACM, 3986-95, 1996. - R. Wang, V. Storey, and C. Firth. A framework for
analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7623-640, 1995. - http//www.cs.ucla.edu/classes/spring01/cs240b/not
es/data-integration1.pdf
131CS590D Data MiningChris Clifton
- January 20, 2005
- Data Cubes
132A Sample Data Cube
Total annual sales of TV in U.S.A.
133Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
country
product
date
1-D cuboids
product,date
product,country
date, country
2-D cuboids
3-D(base) cuboid
product, date, country
134Browsing a Data Cube
- Visualization
- OLAP capabilities
- Interactive manipulation
135Typical OLAP Operations
- Roll up (drill-up) summarize data
- by climbing up hierarchy or by dimension
reduction - Drill down (roll down) reverse of roll-up
- from higher level summary to lower level summary
or detailed data, or introducing new dimensions - Slice and dice
- project and select
- Pivot (rotate)
- reorient the cube, visualization, 3D to series of
2D planes. - Other operations
- drill across involving (across) more than one
fact table - drill through through the bottom level of the
cube to its back-end relational tables (using SQL)
136A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Product
Time
DAILY
QTRLY
ANNUALY
PRODUCT ITEM
PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Each circle is called a footprint
Location
Organization
Promotion
137Efficient Data Cube Computation
- Data cube can be viewed as a lattice of cuboids
- The bottom-most cuboid is the base cuboid
- The top-most cuboid (apex) contains only one cell
- How many cuboids in an n-dimensional cube with L
levels? - Materialization of data cube
- Materialize every (cuboid) (full
materialization), none (no materialization), or
some (partial materialization) - Selection of which cuboids to materialize
- Based on size, sharing, access frequency, etc.
138Cube Operation
- Cube definition and computation in DMQL
- define cube salesitem, city, year
sum(sales_in_dollars) - compute cube sales
- Transform it into a SQL-like language (with a new
operator cube by, introduced by Gray et al.96) - SELECT item, city, year, SUM (amount)
- FROM SALES
- CUBE BY item, city, year
- Compute the following Group-Bys
- (date, product, customer),
- (date,product),(date, customer), (product,
customer), - (date), (product), (customer)
- ()
139Cube Computation ROLAP-Based Method
- Efficient cube computation methods
- ROLAP-based cubing algorithms (Agarwal et al96)
- Array-based cubing algorithm (Zhao et al97)
- Bottom-up computation method (Beyer
Ramarkrishnan99) - H-cubing technique (Han, Pei, Dong
WangSIGMOD01) - ROLAP-based cubing algorithms
- Sorting, hashing, and grouping operations are
applied to the dimension attributes in order to
reorder and cluster related tuples - Grouping is performed on some sub-aggregates as a
partial grouping step - Aggregates may be computed from previously
computed aggregates, rather than from the base
fact table
140Cube Computation ROLAP-Based Method (2)
- This is not in the textbook but in a research
paper - Hash/sort based methods (Agarwal et. al. VLDB96)
- Smallest-parent computing a cuboid from the
smallest, previously computed cuboid - Cache-results caching results of a cuboid from
which other cuboids are computed to reduce disk
I/Os - Amortize-scans computing as many as possible
cuboids at the same time to amortize disk reads - Share-sorts sharing sorting costs cross
multiple cuboids when sort-based method is used - Share-partitions sharing the partitioning cost
across multiple cuboids when hash-based
algorithms are used
141Multi-way Array Aggregation for Cube Computation
- Partition arrays into chunks (a small subcube
which fits in memory). - Compressed sparse array addressing (chunk_id,
offset) - Compute aggregates in multiway by visiting cube
cells in the order which minimizes the of times
to visit each cell, and reduces memory access and
storage cost.
What is the best traversing order to do multi-way
aggregation?
142Multi-way Array Aggregation for Cube Computation
B
143Multi-way Array Aggregation for Cube Computation
C
64
63
62
61
c3
c2
48
47
46
45
c1
29
30
31
32
c 0
B
60
13
14
15
16
b3
44
28
B
56
9
b2
40
24
52
5
b1
36
20
1
2
3
4
b0
a1
a0
a2
a3
A
144Multi-Way Array Aggregation for Cube Computation
(Cont.)
- Method the planes should be sorted and computed
according to their size in ascending order. - See the details of Example 2.12 (pp. 75-78)
- Idea keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for
the largest plane - Limitation of the method computing well only for
a small number of dimensions - If there are a large number of dimensions,
bottom-up computation and iceberg cube
computation methods can be explored
145Indexing OLAP Data Bitmap Index
- Index on a particular column
- Each value in the column has a bit vector bit-op
is fast - The length of the bit vector of records in the
base table - The i-th bit is set if the i-th row of the base
table has the value for the indexed column - not suitable for high cardinality domains
Base table
Index on Region
Index on Type
146Indexing OLAP Data Join Indices
- Join index JI(R-id, S-id) where R (R-id, ) ?? S
(S-id, ) - Traditional indices map the values to a list of
record ids - It materializes relational join in JI file and
speeds up relational join a rather costly
operation - In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table. - E.g. fact table Sales and two dimensions city
and product - A join index on city maintains for each distinct
city a list of R-IDs of the tuples recording the
Sales in the city - Join indices can span multiple dimensions
147Efficient Processing OLAP Queries
- Determine which operations should be performed on
the available cuboids - transform drill, roll, etc. into corresponding
SQL and/or OLAP operations, e.g, dice selection
projection - Determine to which materialized cuboid(s) the
relevant operations should be applied. - Exploring indexing structures and compressed vs.
dense array structures in MOLAP
148Iceberg Cube
- Computing only the cuboid cells whose countor
other aggregates satisfying the condition - HAVING COUNT() gt minsup
- Motivation
- Only a small portion of cube cells may be above
the water in a sparse cube - Only calculate interesting datadata above
certain threshold - Suppose 100 dimensions, only 1 base cell. How
many aggregate (non-base) cells if count gt 1?
What about count gt 2?
149Bottom-Up Computation (BUC)
- BUC (Beyer Ramakrishnan, SIGMOD99)
- Bottom-up vs. top-down?depending on how you view
it! - Apriori property
- Aggregate the data,
then move to the next level - If minsup is not met, stop!
- If minsup 1 Þ compute full CUBE!
150Partitioning
- Usually, entire data set cant fit in main
memory - Sort distinct values, partition into blocksthat
fit - Continue processing
- Optimizations
- Partitioning
- External Sorting, Hashing, Counting Sort
- Ordering dimensions to encourage pruning
- Cardinality, Skew, Correlation
- Collapsing duplicates
- Cant do holistic aggregates anymore!
151Drawbacks of BUC
- Requires a significant amount of memory
- On par with most other CUBE algorithms though
- Does not obtain good performance with dense CUBEs
- Overly skewed data or a bad choice of dimension
ordering reduces performance - Cannot compute iceberg cubes with complex
measures - CREATE CUBE Sales_Iceberg AS
- SELECT month, city, cust_grp,
- AVG(price), COUNT()
- FROM Sales_Infor
- CUBEBY month, city, cust_grp
- HAVING AVG(price) gt 800 AND
- COUNT() gt 50
152Non-Anti-Monotonic Measures
CREATE CUBE Sales_Iceberg AS SELECT month, city,
cust_grp, AVG(price), COUNT() FROM
Sales_Infor CUBEBY month, city, cust_grp HAVING
AVG(price) gt 800 AND COUNT() gt 50
- The cubing query with avg is non-anti-monotonic!
- (Mar, , , 600, 1800) fails the HAVING clause
- (Mar, , Bus, 1300, 360) passes the clause
Month City Cust_grp Prod Cost Price
Jan Tor Edu Printer 500 485
Jan Tor Hld TV 800 1200
Jan Tor Edu Camera 1160 1280
Feb Mon Bus Laptop 1500 2500
Mar Van Edu HD 540 520
153Top-k Average
- Let (, Van, ) cover 1,000 records
- Avg(price) is the average price of those 1000
sales - Avg50(price) is the average price of the top-50
sales (top-50 according to the sales price - Top-k average is anti-monotonic
- The top 50 sales in Van. is with avg(price) lt
800 ? the top 50 deals in Van. during Feb. must
be with avg(price) lt 800
Month City Cust_grp Prod Cost Price
154Binning for Top-k Average
- Computing top-k avg is costly with large k
- Binning idea
- Avg50(c) gt 800
- Large value collapsing use a sum and a count to
summarize records with measure gt 800 - If countgt800, no need to check small records
- Small value binning a group of bins
- One bin covers a range, e.g., 600800, 400600,
etc. - Register a sum and a count for each bin
155Approximate top-k average
Suppose for (, Van, ), we have
Approximate avg50() (280001060060015)/50952
Range Sum Count
Over 800 28000 20
600800 10600 15
400600 15200 30
Top 50
The cell may pass the HAVING clause
Month City Cust_grp Prod Cost Price
156Quant-info for Top-k Average Binning
- Accumulate quant-info for cells to compute
average iceberg cubes efficiently - Three pieces sum, count, top-k bins
- Use top-k bins to estimate/prune descendants
- Use sum and count to consolidate current cell
strongest
weakest
Approximate avg50() Anti-monotonic, can be computed efficiently real avg50() Anti-monotonic, but computationally costly avg() Not anti-monotonic
157An