CS590D: Data Mining Chris Clifton - PowerPoint PPT Presentation

About This Presentation
Title:

CS590D: Data Mining Chris Clifton

Description:

Ordinal ... Ordinal attribute: distinctness & order. Interval attribute: distinctness, order & addition ... Ordinal. The values of an ordinal attribute provide ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 130
Provided by: clif8
Category:

less

Transcript and Presenter's Notes

Title: CS590D: Data Mining Chris Clifton


1
CS590D Data MiningChris Clifton
  • January 14, 2006
  • Data Preparation

2
Data Preprocessing
  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

3
Why Data Preprocessing?
  • Data in the real world is dirty
  • incomplete lacking attribute values, lacking
    certain attributes of interest, or containing
    only aggregate data
  • e.g., occupation
  • noisy containing errors or outliers
  • e.g., Salary-10
  • inconsistent containing discrepancies in codes
    or names
  • e.g., Age42 Birthday03/07/1997
  • e.g., Was rating 1,2,3, now rating A, B, C
  • e.g., discrepancy between duplicate records

4
Why Is Data Dirty?
  • Incomplete data comes from
  • n/a data value when collected
  • different consideration between the time when the
    data was collected and when it is analyzed.
  • human/hardware/software problems
  • Noisy data comes from the process of data
  • collection
  • entry
  • transmission
  • Inconsistent data comes from
  • Different data sources
  • Functional dependency violation

5
Why Is Data Preprocessing Important?
  • No quality data, no quality mining results!
  • Quality decisions must be based on quality data
  • e.g., duplicate or missing data may cause
    incorrect or even misleading statistics.
  • Data warehouse needs consistent integration of
    quality data
  • Data extraction, cleaning, and transformation
    comprises the majority of the work of building a
    data warehouse. Bill Inmon

6
Multi-Dimensional Measure of Data Quality
  • A well-accepted multidimensional view
  • Accuracy
  • Completeness
  • Consistency
  • Timeliness
  • Believability
  • Value added
  • Interpretability
  • Accessibility
  • Broad categories
  • intrinsic, contextual, representational, and
    accessibility.

7
Major Tasks in Data Preprocessing
  • Data cleaning
  • Fill in missing values, smooth noisy data,
    identify or remove outliers, and resolve
    inconsistencies
  • Data integration
  • Integration of multiple databases, data cubes, or
    files
  • Data transformation
  • Normalization and aggregation
  • Data reduction
  • Obtains reduced representation in volume but
    produces the same or similar analytical results
  • Data discretization
  • Part of data reduction but with particular
    importance, especially for numerical data

8
Forms of data preprocessing
9
Data Preprocessing
  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

10
Data Cleaning
  • Importance
  • Data cleaning is one of the three biggest
    problems in data warehousingRalph Kimball
  • Data cleaning is the number one problem in data
    warehousingDCI survey
  • Data cleaning tasks
  • Fill in missing values
  • Identify outliers and smooth out noisy data
  • Correct inconsistent data
  • Resolve redundancy caused by data integration

11
Missing Data
  • Data is not always available
  • E.g., many tuples have no recorded value for
    several attributes, such as customer income in
    sales data
  • Missing data may be due to
  • equipment malfunction
  • inconsistent with other recorded data and thus
    deleted
  • data not entered due to misunderstanding
  • certain data may not be considered important at
    the time of entry
  • not register history or changes of the data
  • Missing data may need to be inferred.

12
How to Handle Missing Data?
  • Ignore the tuple usually done when class label
    is missing (assuming the tasks in
    classificationnot effective when the percentage
    of missing values per attribute varies
    considerably.
  • Fill in the missing value manually tedious
    infeasible?
  • Fill in it automatically with
  • a global constant e.g., unknown, a new
    class?!
  • the attribute mean
  • the attribute mean for all samples belonging to
    the same class smarter
  • the most probable value inference-based such as
    Bayesian formula or decision tree

13
CS590D Data MiningChris Clifton
  • January 17, 2006
  • Data Preparation

14
What is Data?
Attributes
  • Collection of data objects and their attributes
  • An attribute is a property or characteristic of
    an object
  • Examples eye color of a person, temperature,
    etc.
  • Attribute is also known as variable, field,
    characteristic, or feature
  • A collection of attributes describe an object
  • Object is also known as record, point, case,
    sample, entity, or instance

Objects
15
Attribute Values
  • Attribute values are numbers or symbols assigned
    to an attribute
  • Distinction between attributes and attribute
    values
  • Same attribute can be mapped to different
    attribute values
  • Example height can be measured in feet or
    meters
  • Different attributes can be mapped to the same
    set of values
  • Example Attribute values for ID and age are
    integers
  • But properties of attribute values can be
    different
  • ID has no limit but age has a maximum and minimum
    value

16
Measurement of Length
  • The way you measure an attribute is somewhat may
    not match the attributes properties.

17
Types of Attributes
  • There are different types of attributes
  • Nominal
  • Examples ID numbers, eye color, zip codes
  • Ordinal
  • Examples rankings (e.g., taste of potato chips
    on a scale from 1-10), grades, height in tall,
    medium, short
  • Interval
  • Examples calendar dates, temperatures in Celsius
    or Fahrenheit.
  • Ratio
  • Examples temperature in Kelvin, length, time,
    counts

18
Properties of Attribute Values
  • The type of an attribute depends on which of the
    following properties it possesses
  • Distinctness ?
  • Order lt gt
  • Addition -
  • Multiplication /
  • Nominal attribute distinctness
  • Ordinal attribute distinctness order
  • Interval attribute distinctness, order
    addition
  • Ratio attribute all 4 properties

19
(No Transcript)
20
(No Transcript)
21
Discrete and Continuous Attributes
  • Discrete Attribute
  • Has only a finite or countably infinite set of
    values
  • Examples zip codes, counts, or the set of words
    in a collection of documents
  • Often represented as integer variables.
  • Note binary attributes are a special case of
    discrete attributes
  • Continuous Attribute
  • Has real numbers as attribute values
  • Examples temperature, height, or weight.
  • Practically, real values can only be measured and
    represented using a finite number of digits.
  • Continuous attributes are typically represented
    as floating-point variables.

22
Types of data sets
  • Record
  • Data Matrix
  • Document Data
  • Transaction Data
  • Graph
  • World Wide Web
  • Molecular Structures
  • Ordered
  • Spatial Data
  • Temporal Data
  • Sequential Data
  • Genetic Sequence Data

23
Important Characteristics of Structured Data
  • Dimensionality
  • Curse of Dimensionality
  • Sparsity
  • Only presence counts
  • Resolution
  • Patterns depend on the scale

24
Record Data
  • Data that consists of a collection of records,
    each of which consists of a fixed set of
    attributes

25
Data Matrix
  • If data objects have the same fixed set of
    numeric attributes, then the data objects can be
    thought of as points in a multi-dimensional
    space, where each dimension represents a distinct
    attribute
  • Such data set can be represented by an m by n
    matrix, where there are m rows, one for each
    object, and n columns, one for each attribute

26
Document Data
  • Each document becomes a term' vector,
  • each term is a component (attribute) of the
    vector,
  • the value of each component is the number of
    times the corresponding term occurs in the
    document.

27
Transaction Data
  • A special type of record data, where
  • each record (transaction) involves a set of
    items.
  • For example, consider a grocery store. The set
    of products purchased by a customer during one
    shopping trip constitute a transaction, while the
    individual products that were purchased are the
    items.

28
Graph Data
  • Examples Generic graph and HTML Links

29
Chemical Data
  • Benzene Molecule C6H6

30
Ordered Data
  • Sequences of transactions

Items/Events
An element of the sequence
31
Ordered Data
  • Genomic sequence data

32
Ordered Data
  • Spatio-Temporal Data

Average Monthly Temperature of land and ocean
33
Data Quality
  • What kinds of data quality problems?
  • How can we detect problems with the data?
  • What can we do about these problems?
  • Examples of data quality problems
  • Noise and outliers
  • missing values
  • duplicate data

34
Noise
  • Noise refers to modification of original values
  • Examples distortion of a persons voice when
    talking on a poor phone and snow on television
    screen

Two Sine Waves
Two Sine Waves Noise
35
Outliers
  • Outliers are data objects with characteristics
    that are considerably different than most of the
    other data objects in the data set

36
Missing Values
  • Reasons for missing values
  • Information is not collected (e.g., people
    decline to give their age and weight)
  • Attributes may not be applicable to all cases
    (e.g., annual income is not applicable to
    children)
  • Handling missing values
  • Eliminate Data Objects
  • Estimate Missing Values
  • Ignore the Missing Value During Analysis
  • Replace with all possible values (weighted by
    their probabilities)

37
Duplicate Data
  • Data set may include data objects that are
    duplicates, or almost duplicates of one another
  • Major issue when merging data from heterogeous
    sources
  • Examples
  • Same person with multiple email addresses
  • Data cleaning
  • Process of dealing with duplicate data issues

38
Noisy Data
  • Noise random error or variance in a measured
    variable
  • Incorrect attribute values may due to
  • faulty data collection instruments
  • data entry problems
  • data transmission problems
  • technology limitation
  • inconsistency in naming convention
  • Other data problems which requires data cleaning
  • duplicate records
  • incomplete data
  • inconsistent data

39
How to Handle Noisy Data?
  • Binning method
  • first sort data and partition into (equi-depth)
    bins
  • then one can smooth by bin means, smooth by bin
    median, smooth by bin boundaries, etc.
  • Clustering
  • detect and remove outliers
  • Combined computer and human inspection
  • detect suspicious values and check by human
    (e.g., deal with possible outliers)
  • Regression
  • smooth by fitting the data into regression
    functions

40
Simple Discretization Methods Binning
  • Equal-width (distance) partitioning
  • Divides the range into N intervals of equal size
    uniform grid
  • if A and B are the lowest and highest values of
    the attribute, the width of intervals will be W
    (B A)/N.
  • The most straightforward, but outliers may
    dominate presentation
  • Skewed data is not handled well.
  • Equal-depth (frequency) partitioning
  • Divides the range into N intervals, each
    containing approximately same number of samples
  • Good data scaling
  • Managing categorical attributes can be tricky.

41
Binning Methods for Data Smoothing
  • Sorted data (e.g., by price)
  • 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
  • Partition into (equi-depth) bins
  • Bin 1 4, 8, 9, 15
  • Bin 2 21, 21, 24, 25
  • Bin 3 26, 28, 29, 34
  • Smoothing by bin means
  • Bin 1 9, 9, 9, 9
  • Bin 2 23, 23, 23, 23
  • Bin 3 29, 29, 29, 29
  • Smoothing by bin boundaries
  • Bin 1 4, 4, 4, 15
  • Bin 2 21, 21, 25, 25
  • Bin 3 26, 26, 26, 34

42
Cluster Analysis
43
Regression
y
Y1
y x 1
Y1
x
X1
44
Data Preprocessing
  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

45
Data Integration
  • Data integration
  • combines data from multiple sources into a
    coherent store
  • Schema integration
  • integrate metadata from different sources
  • Entity identification problem identify real
    world entities from multiple data sources, e.g.,
    A.cust-id ? B.cust-
  • Detecting and resolving data value conflicts
  • for the same real world entity, attribute values
    from different sources are different
  • possible reasons different representations,
    different scales, e.g., metric vs. British units

46
Handling Redundancy in Data Integration
  • Redundant data occur often when integration of
    multiple databases
  • The same attribute may have different names in
    different databases
  • One attribute may be a derived attribute in
    another table, e.g., annual revenue
  • Redundant data may be able to be detected by
    correlational analysis
  • Careful integration of the data from multiple
    sources may help reduce/avoid redundancies and
    inconsistencies and improve mining speed and
    quality

47
Data Transformation
  • Smoothing remove noise from data
  • Aggregation summarization, data cube
    construction
  • Generalization concept hierarchy climbing
  • Normalization scaled to fall within a small,
    specified range
  • min-max normalization
  • z-score normalization
  • normalization by decimal scaling
  • Attribute/feature construction
  • New attributes constructed from the given ones

48
CS590D Data MiningChris Clifton
  • January 18, 2005
  • Data Preparation

49
Data Transformation Normalization
  • min-max normalization
  • z-score normalization
  • normalization by decimal scaling

Where j is the smallest integer such that Max(
)lt1
50
Z-Score (Example)
v v v v
0.18 -0.84 Avg 0.68 20 -.26 Avg 34.3
0.60 -0.14 sdev 0.59 40 .11 sdev 55.9
0.52 -0.27 5 .55
0.25 -0.72 70 4
0.80 0.20 32 -.05
0.55 -0.22 8 -.48
0.92 0.40 5 -.53
0.21 -0.79 15 -.35
0.64 -0.07 250 3.87
0.20 -0.80 32 -.05
0.63 -0.09 18 -.30
0.70 0.04 10 -.44
0.67 -0.02 -14 -.87
0.58 -0.17 22 -.23
0.98 0.50 45 .20
0.81 0.22 60 .47
0.10 -0.97 -5 -.71
0.82 0.24 7 -.49
0.50 -0.30 2 -.58
3.00 3.87 4 -.55
51
Data Preprocessing
  • Aggregation
  • Sampling
  • Dimensionality Reduction
  • Feature subset selection
  • Feature creation
  • Discretization and Binarization
  • Attribute Transformation

52
Aggregation
  • Combining two or more attributes (or objects)
    into a single attribute (or object)
  • Purpose
  • Data reduction
  • Reduce the number of attributes or objects
  • Change of scale
  • Cities aggregated into regions, states,
    countries, etc
  • More stable data
  • Aggregated data tends to have less variability

53
Aggregation
Variation of Precipitation in Australia
Standard Deviation of Average Monthly
Precipitation
Standard Deviation of Average Yearly Precipitation
54
CS490DIntroduction to Data MiningChris Clifton
  • January 26, 2004
  • Data Preparation

55
Data Preprocessing
  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

56
Data Reduction Strategies
  • A data warehouse may store terabytes of data
  • Complex data analysis/mining may take a very long
    time to run on the complete data set
  • Data reduction
  • Obtain a reduced representation of the data set
    that is much smaller in volume but yet produce
    the same (or almost the same) analytical results
  • Data reduction strategies
  • Data cube aggregation
  • Dimensionality reduction remove unimportant
    attributes
  • Data Compression
  • Numerosity reduction fit data into models
  • Discretization and concept hierarchy generation

57
Data Cube Aggregation
  • The lowest level of a data cube
  • the aggregated data for an individual entity of
    interest
  • e.g., a customer in a phone calling data
    warehouse.
  • Multiple levels of aggregation in data cubes
  • Further reduce the size of data to deal with
  • Reference appropriate levels
  • Use the smallest representation which is enough
    to solve the task
  • Queries regarding aggregated information should
    be answered using data cube, when possible

58
Dimensionality Reduction
  • Feature selection (i.e., attribute subset
    selection)
  • Select a minimum set of features such that the
    probability distribution of different classes
    given the values for those features is as close
    as possible to the original distribution given
    the values of all features
  • reduce of patterns in the patterns, easier to
    understand
  • Heuristic methods (due to exponential of
    choices)
  • step-wise forward selection
  • step-wise backward elimination
  • combining forward selection and backward
    elimination
  • decision-tree induction

59
Example ofDecision Tree Induction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
60
Heuristic Feature Selection Methods
  • There are 2d possible sub-features of d features
  • Several heuristic feature selection methods
  • Best single features under the feature
    independence assumption choose by significance
    tests.
  • Best step-wise feature selection
  • The best single-feature is picked first
  • Then next best feature condition to the first,
    ...
  • Step-wise feature elimination
  • Repeatedly eliminate the worst feature
  • Best combined feature selection and elimination
  • Optimal branch and bound
  • Use feature elimination and backtracking

61
Data Compression
  • String compression
  • There are extensive theories and well-tuned
    algorithms
  • Typically lossless
  • But only limited manipulation is possible without
    expansion
  • Audio/video compression
  • Typically lossy compression, with progressive
    refinement
  • Sometimes small fragments of signal can be
    reconstructed without reconstructing the whole
  • Time sequence is not audio
  • Typically short and vary slowly with time

62
Data Compression
Original Data
Compressed Data
lossless
Original Data Approximated
lossy
63
Wavelet Transformation
  • Discrete wavelet transform (DWT) linear signal
    processing, multiresolutional analysis
  • Compressed approximation store only a small
    fraction of the strongest of the wavelet
    coefficients
  • Similar to discrete Fourier transform (DFT), but
    better lossy compression, localized in space
  • Method
  • Length, L, must be an integer power of 2 (padding
    with 0s, when necessary)
  • Each transform has 2 functions smoothing,
    difference
  • Applies to pairs of data, resulting in two set of
    data of length L/2
  • Applies two functions recursively, until reaches
    the desired length

64
DWT for Image Compression
  • Image
  • Low Pass High Pass
  • Low Pass High Pass
  • Low Pass High Pass

65
Curse of Dimensionality
  • When dimensionality increases, data becomes
    increasingly sparse in the space that it occupies
  • Definitions of density and distance between
    points, which is critical for clustering and
    outlier detection, become less meaningful
  • Randomly generate 500 points
  • Compute difference between max and min distance
    between any pair of points

66
Dimensionality Reduction
  • Purpose
  • Avoid curse of dimensionality
  • Reduce amount of time and memory required by data
    mining algorithms
  • Allow data to be more easily visualized
  • May help to eliminate irrelevant features or
    reduce noise
  • Techniques
  • Principle Component Analysis
  • Singular Value Decomposition
  • Others supervised and non-linear techniques

67
Principal Component Analysis
X2
Y1
Y2
X1
68
Dimensionality Reduction PCA
  • Goal is to find a projection that captures the
    largest amount of variation in data

x2
e
x1
69
Dimensionality Reduction PCA
  • Find the eigenvectors of the covariance matrix
  • The eigenvectors define the new space

x2
e
x1
70
Principal Component Analysis
  • Given N data vectors from k-dimensions, find c
    k orthogonal vectors that can be best used to
    represent data
  • The original data set is reduced to one
    consisting of N data vectors on c principal
    components (reduced dimensions)
  • Each data vector is a linear combination of the c
    principal component vectors
  • Works for numeric data only
  • Used when the number of dimensions is large

71
Dimensionality Reduction PCA
72
Dimensionality Reduction ISOMAP
  • Construct a neighbourhood graph
  • For each pair of points in the graph, compute the
    shortest path distances geodesic distances

By Tenenbaum, de Silva, Langford (2000)
73
Feature Subset Selection
  • Another way to reduce dimensionality of data
  • Redundant features
  • duplicate much or all of the information
    contained in one or more other attributes
  • Example purchase price of a product and the
    amount of sales tax paid
  • Irrelevant features
  • contain no information that is useful for the
    data mining task at hand
  • Example students' ID is often irrelevant to the
    task of predicting students' GPA

74
Feature Subset Selection
  • Techniques
  • Brute-force approch
  • Try all possible feature subsets as input to data
    mining algorithm
  • Embedded approaches
  • Feature selection occurs naturally as part of
    the data mining algorithm
  • Filter approaches
  • Features are selected before data mining
    algorithm is run
  • Wrapper approaches
  • Use the data mining algorithm as a black box to
    find best subset of attributes

75
CS590DData MiningChris Clifton
  • January 24, 2006
  • Data Preparation

76
Feature Creation
  • Create new attributes that can capture the
    important information in a data set much more
    efficiently than the original attributes
  • Three general methodologies
  • Feature Extraction
  • domain-specific
  • Mapping Data to New Space
  • Feature Construction
  • combining features

77
Mapping Data to a New Space
Fourier transform Wavelet transform
Two Sine Waves
Two Sine Waves Noise
Frequency
78
Discretization Using Class Labels
  • Entropy based approach

3 categories for both x and y
5 categories for both x and y
79
Discretization Without Using Class Labels
Data
Equal interval width
Equal frequency
K-means
80
Attribute Transformation
  • A function that maps the entire set of values of
    a given attribute to a new set of replacement
    values such that each old value can be identified
    with one of the new values
  • Simple functions xk, log(x), ex, x
  • Standardization and Normalization

81
Numerosity Reduction
  • Parametric methods
  • Assume the data fits some model, estimate model
    parameters, store only the parameters, and
    discard the data (except possible outliers)
  • Log-linear models obtain value at a point in m-D
    space as the product on appropriate marginal
    subspaces
  • Non-parametric methods
  • Do not assume models
  • Major families histograms, clustering, sampling

82
Regression and Log-Linear Models
  • Linear regression Data are modeled to fit a
    straight line
  • Often uses the least-square method to fit the
    line
  • Multiple regression allows a response variable Y
    to be modeled as a linear function of
    multidimensional feature vector
  • Log-linear model approximates discrete
    multidimensional probability distributions

83
Regress Analysis and Log-Linear Models
  • Linear regression Y ? ? X
  • Two parameters , ? and ? specify the line and are
    to be estimated by using the data at hand.
  • using the least squares criterion to the known
    values of Y1, Y2, , X1, X2, .
  • Multiple regression Y b0 b1 X1 b2 X2.
  • Many nonlinear functions can be transformed into
    the above.
  • Log-linear models
  • The multi-way table of joint probabilities is
    approximated by a product of lower-order tables.
  • Probability p(a, b, c, d) ?ab ?ac?ad ?bcd

84
Histograms
  • A popular data reduction technique
  • Divide data into buckets and store average (sum)
    for each bucket
  • Can be constructed optimally in one dimension
    using dynamic programming
  • Related to quantization problems.

85
Clustering
  • Partition data set into clusters, and one can
    store cluster representation only
  • Can be very effective if data is clustered but
    not if data is smeared
  • Can have hierarchical clustering and be stored in
    multi-dimensional index tree structures
  • There are many choices of clustering definitions
    and clustering algorithms, further detailed in
    Chapter 8

86
Hierarchical Reduction
  • Use multi-resolution structure with different
    degrees of reduction
  • Hierarchical clustering is often performed but
    tends to define partitions of data sets rather
    than clusters
  • Parametric methods are usually not amenable to
    hierarchical representation
  • Hierarchical aggregation
  • An index tree hierarchically divides a data set
    into partitions by value range of some attributes
  • Each partition can be considered as a bucket
  • Thus an index tree with aggregates stored at each
    node is a hierarchical histogram

87
Sampling
  • Allow a mining algorithm to run in complexity
    that is potentially sub-linear to the size of the
    data
  • Choose a representative subset of the data
  • Simple random sampling may have very poor
    performance in the presence of skew
  • Develop adaptive sampling methods
  • Stratified sampling
  • Approximate the percentage of each class (or
    subpopulation of interest) in the overall
    database
  • Used in conjunction with skewed data
  • Sampling may not reduce database I/Os (page at a
    time).

88
Sampling
SRSWOR (simple random sample without
replacement)
SRSWR
89
Sampling
Cluster/Stratified Sample
Raw Data
90
Sampling
  • Sampling is the main technique employed for data
    selection.
  • It is often used for both the preliminary
    investigation of the data and the final data
    analysis.
  • Statisticians sample because obtaining the entire
    set of data of interest is too expensive or time
    consuming.
  • Sampling is used in data mining because
    processing the entire set of data of interest is
    too expensive or time consuming.

91
Sampling
  • The key principle for effective sampling is the
    following
  • using a sample will work almost as well as using
    the entire data sets, if the sample is
    representative
  • A sample is representative if it has
    approximately the same property (of interest) as
    the original set of data

92
Types of Sampling
  • Simple Random Sampling
  • There is an equal probability of selecting any
    particular item
  • Sampling without replacement
  • As each item is selected, it is removed from the
    population
  • Sampling with replacement
  • Objects are not removed from the population as
    they are selected for the sample.
  • In sampling with replacement, the same object can
    be picked up more than once
  • Stratified sampling
  • Split the data into several partitions then draw
    random samples from each partition

93
Sample Size

8000 points 2000 Points 500 Points
94
Sample Size
  • What sample size is necessary to get at least one
    object from each of 10 groups.

95
Similarity and Dissimilarity
  • Similarity
  • Numerical measure of how alike two data objects
    are.
  • Is higher when objects are more alike.
  • Often falls in the range 0,1
  • Dissimilarity
  • Numerical measure of how different are two data
    objects
  • Lower when objects are more alike
  • Minimum dissimilarity is often 0
  • Upper limit varies
  • Proximity refers to a similarity or dissimilarity

96
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data
objects.
97
Euclidean Distance
  • Euclidean Distance
  • Where n is the number of dimensions (attributes)
    and pk and qk are, respectively, the kth
    attributes (components) or data objects p and q.
  • Standardization is necessary, if scales differ.

98
Euclidean Distance
Distance Matrix
99
Minkowski Distance
  • Minkowski Distance is a generalization of
    Euclidean Distance
  • Where r is a parameter, n is the number of
    dimensions (attributes) and pk and qk are,
    respectively, the kth attributes (components) or
    data objects p and q.

100
Minkowski Distance Examples
  • r 1. City block (Manhattan, taxicab, L1 norm)
    distance.
  • A common example of this is the Hamming distance,
    which is just the number of bits that are
    different between two binary vectors
  • r 2. Euclidean distance
  • r ? ?. supremum (Lmax norm, L? norm) distance.
  • This is the maximum difference between any
    component of the vectors
  • Do not confuse r with n, i.e., all these
    distances are defined for all numbers of
    dimensions.

101
Minkowski Distance
Distance Matrix
102
Mahalanobis Distance
? is the covariance matrix of the input data X
For red points, the Euclidean distance is 14.7,
Mahalanobis distance is 6.
103
Mahalanobis Distance
Covariance Matrix
C
A (0.5, 0.5) B (0, 1) C (1.5, 1.5) Mahal(A,B)
5 Mahal(A,C) 4
B
A
104
Common Properties of a Distance
  • Distances, such as the Euclidean distance, have
    some well known properties.
  • d(p, q) ? 0 for all p and q and d(p, q) 0
    only if p q. (Positive definiteness)
  • d(p, q) d(q, p) for all p and q. (Symmetry)
  • d(p, r) ? d(p, q) d(q, r) for all points p,
    q, and r. (Triangle Inequality)
  • where d(p, q) is the distance (dissimilarity)
    between points (data objects), p and q.
  • A distance that satisfies these properties is a
    metric

105
Common Properties of a Similarity
  • Similarities, also have some well known
    properties.
  • s(p, q) 1 (or maximum similarity) only if p
    q.
  • s(p, q) s(q, p) for all p and q. (Symmetry)
  • where s(p, q) is the similarity between points
    (data objects), p and q.

106
Similarity Between Binary Vectors
  • Common situation is that objects, p and q, have
    only binary attributes
  • Compute similarities using the following
    quantities
  • M01 the number of attributes where p was 0 and
    q was 1
  • M10 the number of attributes where p was 1 and
    q was 0
  • M00 the number of attributes where p was 0 and
    q was 0
  • M11 the number of attributes where p was 1 and
    q was 1
  • Simple Matching and Jaccard Coefficients
  • SMC number of matches / number of attributes
  • (M11 M00) / (M01 M10 M11
    M00)
  • J number of 11 matches / number of
    not-both-zero attributes values
  • (M11) / (M01 M10 M11)

107
SMC versus Jaccard Example
  • p 1 0 0 0 0 0 0 0 0 0
  • q 0 0 0 0 0 0 1 0 0 1
  • M01 2 (the number of attributes where p was 0
    and q was 1)
  • M10 1 (the number of attributes where p was 1
    and q was 0)
  • M00 7 (the number of attributes where p was 0
    and q was 0)
  • M11 0 (the number of attributes where p was 1
    and q was 1)
  • SMC (M11 M00)/(M01 M10 M11 M00) (07)
    / (2107) 0.7
  • J (M11) / (M01 M10 M11) 0 / (2 1 0)
    0

108
Cosine Similarity
  • If d1 and d2 are two document vectors, then
  • cos( d1, d2 ) (d1 ? d2) / d1
    d2 ,
  • where ? indicates vector dot product and d
    is the length of vector d.
  • Example
  • d1 3 2 0 5 0 0 0 2 0 0
  • d2 1 0 0 0 0 0 0 1 0 2
  • d1 ? d2 31 20 00 50 00 00
    00 21 00 02 5
  • d1 (3322005500000022000
    0)0.5 (42) 0.5 6.481
  • d2 (110000000000001100
    22) 0.5 (6) 0.5 2.245
  • cos( d1, d2 ) .3150

109
Extended Jaccard Coefficient (Tanimoto)
  • Variation of Jaccard for continuous or count
    attributes
  • Reduces to Jaccard for binary attributes

110
Correlation
  • Correlation measures the linear relationship
    between objects
  • To compute correlation, we standardize data
    objects, p and q, and then take their dot product

111
Visually Evaluating Correlation
Scatter plots showing the similarity from 1 to 1.
112
General Approach for Combining Similarities
  • Sometimes attributes are of many different types,
    but an overall similarity is needed.

113
Using Weights to Combine Similarities
  • May not want to treat all attributes the same.
  • Use weights wk which are between 0 and 1 and sum
    to 1.

114
Density
  • Density-based clustering require a notion of
    density
  • Examples
  • Euclidean density
  • Euclidean density number of points per unit
    volume
  • Probability density
  • Graph-based density

115
Euclidean Density Cell-based
  • Simplest approach is to divide region into a
    number of rectangular cells of equal volume and
    define density as of points the cell contains

116
Euclidean Density Center-based
  • Euclidean density is the number of points within
    a specified radius of the point

117
Data Preprocessing
  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

118
Discretization
  • Three types of attributes
  • Nominal values from an unordered set
  • Ordinal values from an ordered set
  • Continuous real numbers
  • Discretization
  • divide the range of a continuous attribute into
    intervals
  • Some classification algorithms only accept
    categorical attributes.
  • Reduce data size by discretization
  • Prepare for further analysis

119
Discretization and Concept hierachy
  • Discretization
  • reduce the number of values for a given
    continuous attribute by dividing the range of the
    attribute into intervals. Interval labels can
    then be used to replace actual data values
  • Concept hierarchies
  • reduce the data by collecting and replacing low
    level concepts (such as numeric values for the
    attribute age) by higher level concepts (such as
    young, middle-aged, or senior)

120
CS490DIntroduction to Data MiningChris Clifton
  • January 28, 2004
  • Data Preparation

121
Discretization and Concept Hierarchy Generation
for Numeric Data
  • Binning (see sections before)
  • Histogram analysis (see sections before)
  • Clustering analysis (see sections before)
  • Entropy-based discretization
  • Segmentation by natural partitioning

122
Definition of Entropy
  • Entropy
  • Example Coin Flip
  • AX heads, tails
  • P(heads) P(tails) ½
  • ½ log2(½) ½ - 1
  • H(X) 1
  • What about a two-headed coin?
  • Conditional Entropy

123
Entropy-Based Discretization
  • Given a set of samples S, if S is partitioned
    into two intervals S1 and S2 using boundary T,
    the entropy after partitioning is
  • The boundary that minimizes the entropy function
    over all possible boundaries is selected as a
    binary discretization.
  • The process is recursively applied to partitions
    obtained until some stopping criterion is met,
    e.g.,
  • Experiments show that it may reduce data size and
    improve classification accuracy

124
Segmentation by Natural Partitioning
  • A simply 3-4-5 rule can be used to segment
    numeric data into relatively uniform, natural
    intervals.
  • If an interval covers 3, 6, 7 or 9 distinct
    values at the most significant digit, partition
    the range into 3 equi-width intervals
  • If it covers 2, 4, or 8 distinct values at the
    most significant digit, partition the range into
    4 intervals
  • If it covers 1, 5, or 10 distinct values at the
    most significant digit, partition the range into
    5 intervals

125
Example of 3-4-5 Rule
(-4000 -5,000)
Step 4
126
Concept Hierarchy Generation for Categorical Data
  • Specification of a partial ordering of attributes
    explicitly at the schema level by users or
    experts
  • streetltcityltstateltcountry
  • Specification of a portion of a hierarchy by
    explicit data grouping
  • Urbana, Champaign, ChicagoltIllinois
  • Specification of a set of attributes.
  • System automatically generates partial ordering
    by analysis of the number of distinct values
  • E.g., street lt city ltstate lt country
  • Specification of only a partial set of attributes
  • E.g., only street lt city, not others

127
Automatic Concept Hierarchy Generation
  • Some concept hierarchies can be automatically
    generated based on the analysis of the number of
    distinct values per attribute in the given data
    set
  • The attribute with the most distinct values is
    placed at the lowest level of the hierarchy
  • Note Exceptionweekday, month, quarter, year

15 distinct values
country
65 distinct values
province_or_ state
3567 distinct values
city
674,339 distinct values
street
128
Data Preprocessing
  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

129
Summary
  • Data preparation is a big issue for both
    warehousing and mining
  • Data preparation includes
  • Data cleaning and data integration
  • Data reduction and feature selection
  • Discretization
  • A lot a methods have been developed but still an
    active area of research

130
References
  • E. Rahm and H. H. Do. Data Cleaning Problems and
    Current Approaches. IEEE Bulletin of the
    Technical Committee on Data Engineering. Vol.23,
    No.4
  • D. P. Ballou and G. K. Tayi. Enhancing data
    quality in data warehouse environments.
    Communications of ACM, 4273-78, 1999.
  • H.V. Jagadish et al., Special Issue on Data
    Reduction Techniques. Bulletin of the Technical
    Committee on Data Engineering, 20(4), December
    1997.
  • A. Maydanchik, Challenges of Efficient Data
    Cleansing (DM Review - Data Quality resource
    portal)
  • D. Pyle. Data Preparation for Data Mining. Morgan
    Kaufmann, 1999.
  • D. Quass. A Framework for research in Data
    Cleaning. (Draft 1999)
  • V. Raman and J. Hellerstein. Potters Wheel An
    Interactive Framework for Data Cleaning and
    Transformation, VLDB2001.
  • T. Redman. Data Quality Management and
    Technology. Bantam Books, New York, 1992.
  • Y. Wand and R. Wang. Anchoring data quality
    dimensions ontological foundations.
    Communications of ACM, 3986-95, 1996.
  • R. Wang, V. Storey, and C. Firth. A framework for
    analysis of data quality research. IEEE Trans.
    Knowledge and Data Engineering, 7623-640, 1995.
  • http//www.cs.ucla.edu/classes/spring01/cs240b/not
    es/data-integration1.pdf

131
CS590D Data MiningChris Clifton
  • January 20, 2005
  • Data Cubes

132
A Sample Data Cube
Total annual sales of TV in U.S.A.
133
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
country
product
date
1-D cuboids
product,date
product,country
date, country
2-D cuboids
3-D(base) cuboid
product, date, country
134
Browsing a Data Cube
  • Visualization
  • OLAP capabilities
  • Interactive manipulation

135
Typical OLAP Operations
  • Roll up (drill-up) summarize data
  • by climbing up hierarchy or by dimension
    reduction
  • Drill down (roll down) reverse of roll-up
  • from higher level summary to lower level summary
    or detailed data, or introducing new dimensions
  • Slice and dice
  • project and select
  • Pivot (rotate)
  • reorient the cube, visualization, 3D to series of
    2D planes.
  • Other operations
  • drill across involving (across) more than one
    fact table
  • drill through through the bottom level of the
    cube to its back-end relational tables (using SQL)

136
A Star-Net Query Model
Customer Orders
Shipping Method

Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Product
Time
DAILY
QTRLY
ANNUALY
PRODUCT ITEM
PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Each circle is called a footprint
Location
Organization
Promotion
137
Efficient Data Cube Computation
  • Data cube can be viewed as a lattice of cuboids
  • The bottom-most cuboid is the base cuboid
  • The top-most cuboid (apex) contains only one cell
  • How many cuboids in an n-dimensional cube with L
    levels?
  • Materialization of data cube
  • Materialize every (cuboid) (full
    materialization), none (no materialization), or
    some (partial materialization)
  • Selection of which cuboids to materialize
  • Based on size, sharing, access frequency, etc.

138
Cube Operation
  • Cube definition and computation in DMQL
  • define cube salesitem, city, year
    sum(sales_in_dollars)
  • compute cube sales
  • Transform it into a SQL-like language (with a new
    operator cube by, introduced by Gray et al.96)
  • SELECT item, city, year, SUM (amount)
  • FROM SALES
  • CUBE BY item, city, year
  • Compute the following Group-Bys
  • (date, product, customer),
  • (date,product),(date, customer), (product,
    customer),
  • (date), (product), (customer)
  • ()

139
Cube Computation ROLAP-Based Method
  • Efficient cube computation methods
  • ROLAP-based cubing algorithms (Agarwal et al96)
  • Array-based cubing algorithm (Zhao et al97)
  • Bottom-up computation method (Beyer
    Ramarkrishnan99)
  • H-cubing technique (Han, Pei, Dong
    WangSIGMOD01)
  • ROLAP-based cubing algorithms
  • Sorting, hashing, and grouping operations are
    applied to the dimension attributes in order to
    reorder and cluster related tuples
  • Grouping is performed on some sub-aggregates as a
    partial grouping step
  • Aggregates may be computed from previously
    computed aggregates, rather than from the base
    fact table

140
Cube Computation ROLAP-Based Method (2)
  • This is not in the textbook but in a research
    paper
  • Hash/sort based methods (Agarwal et. al. VLDB96)
  • Smallest-parent computing a cuboid from the
    smallest, previously computed cuboid
  • Cache-results caching results of a cuboid from
    which other cuboids are computed to reduce disk
    I/Os
  • Amortize-scans computing as many as possible
    cuboids at the same time to amortize disk reads
  • Share-sorts sharing sorting costs cross
    multiple cuboids when sort-based method is used
  • Share-partitions sharing the partitioning cost
    across multiple cuboids when hash-based
    algorithms are used

141
Multi-way Array Aggregation for Cube Computation
  • Partition arrays into chunks (a small subcube
    which fits in memory).
  • Compressed sparse array addressing (chunk_id,
    offset)
  • Compute aggregates in multiway by visiting cube
    cells in the order which minimizes the of times
    to visit each cell, and reduces memory access and
    storage cost.

What is the best traversing order to do multi-way
aggregation?
142
Multi-way Array Aggregation for Cube Computation
B
143
Multi-way Array Aggregation for Cube Computation
C
64
63
62
61
c3
c2
48
47
46
45
c1
29
30
31
32
c 0
B
60
13
14
15
16
b3
44
28
B
56
9
b2
40
24
52
5
b1
36
20
1
2
3
4
b0
a1
a0
a2
a3
A
144
Multi-Way Array Aggregation for Cube Computation
(Cont.)
  • Method the planes should be sorted and computed
    according to their size in ascending order.
  • See the details of Example 2.12 (pp. 75-78)
  • Idea keep the smallest plane in the main memory,
    fetch and compute only one chunk at a time for
    the largest plane
  • Limitation of the method computing well only for
    a small number of dimensions
  • If there are a large number of dimensions,
    bottom-up computation and iceberg cube
    computation methods can be explored

145
Indexing OLAP Data Bitmap Index
  • Index on a particular column
  • Each value in the column has a bit vector bit-op
    is fast
  • The length of the bit vector of records in the
    base table
  • The i-th bit is set if the i-th row of the base
    table has the value for the indexed column
  • not suitable for high cardinality domains

Base table
Index on Region
Index on Type
146
Indexing OLAP Data Join Indices
  • Join index JI(R-id, S-id) where R (R-id, ) ?? S
    (S-id, )
  • Traditional indices map the values to a list of
    record ids
  • It materializes relational join in JI file and
    speeds up relational join a rather costly
    operation
  • In data warehouses, join index relates the values
    of the dimensions of a start schema to rows in
    the fact table.
  • E.g. fact table Sales and two dimensions city
    and product
  • A join index on city maintains for each distinct
    city a list of R-IDs of the tuples recording the
    Sales in the city
  • Join indices can span multiple dimensions

147
Efficient Processing OLAP Queries
  • Determine which operations should be performed on
    the available cuboids
  • transform drill, roll, etc. into corresponding
    SQL and/or OLAP operations, e.g, dice selection
    projection
  • Determine to which materialized cuboid(s) the
    relevant operations should be applied.
  • Exploring indexing structures and compressed vs.
    dense array structures in MOLAP

148
Iceberg Cube
  • Computing only the cuboid cells whose countor
    other aggregates satisfying the condition
  • HAVING COUNT() gt minsup
  • Motivation
  • Only a small portion of cube cells may be above
    the water in a sparse cube
  • Only calculate interesting datadata above
    certain threshold
  • Suppose 100 dimensions, only 1 base cell. How
    many aggregate (non-base) cells if count gt 1?
    What about count gt 2?

149
Bottom-Up Computation (BUC)
  • BUC (Beyer Ramakrishnan, SIGMOD99)
  • Bottom-up vs. top-down?depending on how you view
    it!
  • Apriori property
  • Aggregate the data,
    then move to the next level
  • If minsup is not met, stop!
  • If minsup 1 Þ compute full CUBE!

150
Partitioning
  • Usually, entire data set cant fit in main
    memory
  • Sort distinct values, partition into blocksthat
    fit
  • Continue processing
  • Optimizations
  • Partitioning
  • External Sorting, Hashing, Counting Sort
  • Ordering dimensions to encourage pruning
  • Cardinality, Skew, Correlation
  • Collapsing duplicates
  • Cant do holistic aggregates anymore!

151
Drawbacks of BUC
  • Requires a significant amount of memory
  • On par with most other CUBE algorithms though
  • Does not obtain good performance with dense CUBEs
  • Overly skewed data or a bad choice of dimension
    ordering reduces performance
  • Cannot compute iceberg cubes with complex
    measures
  • CREATE CUBE Sales_Iceberg AS
  • SELECT month, city, cust_grp,
  • AVG(price), COUNT()
  • FROM Sales_Infor
  • CUBEBY month, city, cust_grp
  • HAVING AVG(price) gt 800 AND
  • COUNT() gt 50

152
Non-Anti-Monotonic Measures
CREATE CUBE Sales_Iceberg AS SELECT month, city,
cust_grp, AVG(price), COUNT() FROM
Sales_Infor CUBEBY month, city, cust_grp HAVING
AVG(price) gt 800 AND COUNT() gt 50
  • The cubing query with avg is non-anti-monotonic!
  • (Mar, , , 600, 1800) fails the HAVING clause
  • (Mar, , Bus, 1300, 360) passes the clause

Month City Cust_grp Prod Cost Price
Jan Tor Edu Printer 500 485
Jan Tor Hld TV 800 1200
Jan Tor Edu Camera 1160 1280
Feb Mon Bus Laptop 1500 2500
Mar Van Edu HD 540 520

153
Top-k Average
  • Let (, Van, ) cover 1,000 records
  • Avg(price) is the average price of those 1000
    sales
  • Avg50(price) is the average price of the top-50
    sales (top-50 according to the sales price
  • Top-k average is anti-monotonic
  • The top 50 sales in Van. is with avg(price) lt
    800 ? the top 50 deals in Van. during Feb. must
    be with avg(price) lt 800

Month City Cust_grp Prod Cost Price

154
Binning for Top-k Average
  • Computing top-k avg is costly with large k
  • Binning idea
  • Avg50(c) gt 800
  • Large value collapsing use a sum and a count to
    summarize records with measure gt 800
  • If countgt800, no need to check small records
  • Small value binning a group of bins
  • One bin covers a range, e.g., 600800, 400600,
    etc.
  • Register a sum and a count for each bin

155
Approximate top-k average
Suppose for (, Van, ), we have
Approximate avg50() (280001060060015)/50952
Range Sum Count
Over 800 28000 20
600800 10600 15
400600 15200 30

Top 50
The cell may pass the HAVING clause
Month City Cust_grp Prod Cost Price

156
Quant-info for Top-k Average Binning
  • Accumulate quant-info for cells to compute
    average iceberg cubes efficiently
  • Three pieces sum, count, top-k bins
  • Use top-k bins to estimate/prune descendants
  • Use sum and count to consolidate current cell

strongest
weakest
Approximate avg50() Anti-monotonic, can be computed efficiently real avg50() Anti-monotonic, but computationally costly avg() Not anti-monotonic
157
An
Write a Comment
User Comments (0)
About PowerShow.com