Title: Data Preprocessing
1Data Preprocessing
Chapter 2
2Chapter Objectives
- Realize the importance of data preprocessing for
real world data before data mining or
construction of data warehouses. - Get an overview of some data preprocessing issues
and techniques.
3The course
(4)
DS
OLAP
(2)
(3)
Data Preprocessing
DW
DS
DM
(5)
Association
DS
(6)
Classification
(7)
Clustering
DS Data source DW Data warehouse DM Data
Mining
4 - The Chapter
(2.3)
(2.4)
(2.4)
(2.5)
5- Chapter Outline
- Introduction (2.1, 2.2)
- Data Cleaning (2.3)
- Data Integration (2.4)
- Data Transformation (2.4)
- Data Reduction (2.5)
- Concept Hierarchy (2.6)
6- Introduction
- Introduction
- Why Preprocess the Data (2.1)
- Where Preprocess Data
- Identifying Typical properties of Data (2.2)
7-- Why Preprocess the Data
- A well-accepted multi-dimensional measure of data
quality - Accuracy
- Completeness
- Consistency
- Timeliness
- Believability
- Value added
- Interpretability
- Accessibility
8-- Why Preprocess the Data
- Reason for data cleaning
- Incomplete data (missing data)
- Noisy data (contains errors)
- Inconsistent data (containing discrepancies)
- Reasons for data integration
- Data comes from multiple sources
- Reason for data transformation
- Some data must be transformed to be used for
mining - Reasons for data reduction
- Performance
- No quality data ? no quality mining results!
9-- Where Preprocess Data
DS
OLAP
DW
SD
DS
DM
Association
Data preprocessing is done here, In the Staging
Database
DS
Classification
Clustering
DS Data source DW Data warehouse DM Data
Mining SD Staging Database
10-- Identifying Typical Properties of Data
- Descriptive Data Summarization techniques can be
used to identify the typical properties of data
and helps which data values should be treated as
noise. For many data preprocessing tasks it is
useful to know the following measures of the data - The central tendency
- The dispersion
11--- Measuring the Central Tendency
- Central Tendency measures
- Mean
- Median
- Mode
- Midrange (max() min())/2
- For data mining purposes, we need to know how to
compute these measures efficiently in large
databases. It is important to know whether the
measure is - distributive
- Algebraic or
- holistic
12 --- Measuring the Central Tendency
- Distributive measure A measure that can be
computed by partitioning the data, compute the
measure for each partition, and the merge the
results to arrive at the measures value for the
entire data. - eg. Sum(), count(), max(), min().
- Algebraic measure is a measure that can be
computed by applying an algebraic function to one
or more distributed measures. - eg. Avg() which is sum()/count()
- Holistic measure. You need the entire data to
compute the measure - eg. median
13--- Measuring the Dispersion
- Dispersion or variance is the degree to which
numerical data tends to spread. - The most common measures are
- Standard deviation
- Range max() min()
- Quartiles
- The five-number summary
- Interquartile range (IQR)
- Boxplot Analysis
14---- Quartiles
- The kth percentile of a data sorted in ascending
order is the value x having the property the k
percent of the data entries lie at or below x. - The first quartile, Q1, is the 25th percentile,
Q2 and median the 50th percentile, and Q3 is the
75th percentile. - IQR is Q3 Q1 and is a simple measure that gives
the spread of the middle half. - A common rule of thump for identifying suspected
outliers is to single out values 1.5 IQR above
Q3 or below Q1. - The 5-number summary The min, Q1, median, Q3,
the max - Box plots can be plotted based on the 5-number
summary and are useful tools for identifying
outliers.
15 ---- Boxplot Analysis
- Boxplot
- Data is represented with a box
- The ends of the box are Q1 and Q3,
- i.e., the height of the box is IRQ
- The median is marked by a line within the box
- Whiskers two lines outside the box extend to
Minimum and Maximum
Highest value
Q3
Median
Q1
Whisker
Lowest value
16- Chapter Outline
- Introduction (2.1, 2.2)
- Data Cleaning (2.3)
- Data Integration (2.4)
- Data Transformation (2.4)
- Data Reduction (2.5)
- Concept Hierarchy (2.6)
17- Data Cleaning
- Importance
- Data cleaning is the number one problem in data
warehousing - In data cleaning, the following data problems are
resolved - Incomplete data (missing data)
- Noisy data (contains errors)
- Inconsistent data (containing discrepancies)
18-- Missing Data
- Data is not always available
- E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data - Missing data may be due to
- equipment malfunction
- inconsistent with other recorded data and thus
deleted - data not entered due to misunderstanding
- certain data may not be considered important at
the time of entry
19--- How to Handle Missing Data?
- Fill in missing value manually (often unfeasible)
- Fill in with a global constant. Unknown or n/a
not recommended (data mining algorithm will see
this as a normal value) - Fill in with attribute mean or median
- Fill in with class mean or median (classes need
to be known) - Fill in with most likely value (using regression,
decision trees, most similar records, etc.) - Use other attributes to predict value (e.g. if a
postcode is missing use suburb value) - Ignore the record
20-- Noisy Data
- Noise random error or variance in a measured
variable - Incorrect attribute values may due to
- faulty data collection
- data entry problems
- data transmission problems
- data conversion errors
- Data decay problems
- technology limitations, e.g. buffer overflow or
field size limits
21--- How to Handle Noisy Data?
- Binning
- First sort data and partition into
(equal-frequency) bins, then one can smooth by
bin means, or by bin median, or by bin
boundaries, etc. - Regression
- smooth by fitting the data into regression
functions - Clustering
- detect and remove outliers
- Combined computer and human inspection
- detect suspicious values and check by human.
22--- Binning Methods for Data Smoothing
- Sorted data for price 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34 - Partition into equal-frequency (equi-depth) bins
- Bin 1 4, 8, 9, 15
- Bin 2 21, 21, 24, 25
- Bin 3 26, 28, 29, 34
- Smoothing by bin means
- Bin 1 9, 9, 9, 9
- Bin 2 23, 23, 23, 23
- Bin 3 29, 29, 29, 29
- Smoothing by bin boundaries
- Bin 1 4, 4, 4, 15
- Bin 2 21, 21, 25, 25
- Bin 3 26, 26, 26, 34
23--- Regression
y
Y1
y x 1
Y1
x
X1
24--- Cluster Analysis
25-- Inconsistent data
- Inconsistent data can be due to
- data entry errors
- data integration errors (different formats,
codes, etc.) - Handling inconsistent data
- Important to have data entry verification (check
both format and values of data entered) - Correct with help of external reference data
26-- Data Cleaning as a Process
- Data discrepancy detection
- Use metadata (e.g., domain, range, correlation,
distribution, DDS) - Check field overloading
- Inconsistent use of codes (e.g. 5/12/2004 and
12/5/2004) - Check uniqueness rule, consecutive rule, and null
rule - Use commercial tools
- Data scrubbing use simple domain knowledge
(e.g., postal code, spell-check) to detect errors
and make corrections - Data auditing by analyzing data to discover
rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers)
27--- Properties of Normal Distribution Curve
- The normal (distribution) curve
- From µs to µs contains about 68 of the
measurements (µ mean, s standard deviation) - From µ2s to µ2s contains about 95 of it
- From µ3s to µ3s contains about 99.7 of it
28--- Correlation
Positive correlation
Negative correlation
No correlation
29- Chapter Outline
- Introduction (2.1, 2.2)
- Data Cleaning (2.3)
- Data Integration (2.4)
- Data Transformation (2.4)
- Data Reduction (2.5)
- Concept Hierarchy (2.6)
30- Data Integration
- Data integration Combines data from multiple
sources into a coherent data store - Main problems
- Entity identification problem
- Identify real world entities from multiple data
sources, e.g., A.cust-id ? B.cust- - Redundancy problem
- An attribute is redundant if it can be derived
from other attribute(s). - Inconsistencies in attribute naming can cause
redundancy - Solutions
- Entity identification problems can be resolved
using metadata - Some redundancy problems can be also be resolved
using metadata and some others can be resolved
correlation analysis.
31-- Correlation
Positive correlation
Negative correlation
No correlation
32--- Correlation Analysis (Numerical Data)
- Correlation coefficient (also called Pearsons
product moment coefficient) -
- where n is the number of tuples, and
are the respective means of A and B, sA and sB
are the respective standard deviation of A and B,
and S(AB) is the sum of the AB cross-product. - If rA,B gt 0, A and B are positively correlated
- rA,B 0 independent
- rA,B lt 0 negatively correlated
33--- Correlation Analysis (Categorical Data)
- ?2 (chi-square) test
- The larger the ?2 value, the more likely the
variables are related - The cells that contribute the most to the ?2
value are those whose actual count is very
different from the expected count
34--- Chi-Square Calculation An Example
- ?2 (chi-square) calculation (numbers in
parenthesis are expected counts calculated based
on the data distribution in the two categories) - It shows that like_science_fiction and play_chess
are correlated in the group
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
35- Chapter Outline
- Introduction (2.1, 2.2)
- Data Cleaning (2.3)
- Data Integration (2.4)
- Data Transformation (2.4)
- Data Reduction (2.5)
- Concept Hierarchy (2.6)
36- Data Transformation
- In data transformation, data is transformed or
consolidated to forms appropriate for mining.
Data transformation can involve - Smoothing remove noise from data using binning,
regression, or clustering. - Aggregation E.g. sales data can be aggregated to
monthly. - Generalization concept hierarchy climbing. E.g.
cities can be generalized to countries. Ages can
be generalized to youth, middle-aged, and senior. - Normalization Attribute data scaled to fall
within a small, specified range - Attribute/feature construction New attributes
constructed from the given ones
37-- Data Transformation Normalization
- Min-max normalization to new_minA, new_maxA
- Ex. Let income range 12,000 to 98,000
normalized to 0.0, 1.0. Then 73,000 is mapped
to - Z-score normalization (µ mean, s standard
deviation) - Ex. Let µ 54,000, s 16,000. Then
- Normalization by decimal scaling
Where j is the smallest integer such that
Max(?) lt 1
38-- Attribute/feature construction
- Sometimes it is helpful or necessary to construct
new attributes or features - Helpful for understanding and accuracy
- For example Create attribute volume based on
attributes height, depth and width - Construction is based on mathematical or logical
operations - Attribute construction can help to discover
missing information about the relationships
between data attributes
39- Chapter Outline
- Introduction (2.1, 2.2)
- Data Cleaning (2.3)
- Data Integration (2.4)
- Data Transformation (2.4)
- Data Reduction (2.5)
- Concept Hierarchy (2.6)
40- Data Reduction
- The data is often too large. Reducing the data
can improve performance. Data reduction consists
of reducing the representation of the data set
while producing the same (or almost the same)
results. - Data Reduction Includes
- Reducing the number of rows (objects)
- Reducing the number of attributes (features)
- Compression
- Discretization (will be covered in the next
section)
41-- Reducing the number of Rows
- Aggregation
- Aggregation of data in to a higher concept level.
- We can have multiple levels of aggregation. E.g.,
Weekly, monthly, quarterly, yearly, and so on. - For data reduction use the highest aggregation
level which is enough - Numerosity reduction
- Data volume can be reduced by choosing
alternative forms of data representation
42--- Types of Numerosity reduction
- Parametric
- Assume the data fits some model, estimate model
parameters, store only the parameters, and
discard the data (except possible outliers) - E.g. Linear regression Data are modeled to fit
a straight line - Non-parametric
- Histograms
- Clustering
- Sampling
43---- Reduction with Histograms
- A popular data reduction technique. Divide data
into buckets and store representation of buckets
(sum, count, etc.) - Histogram Types
- Equal-width Divides the range into N intervals
of equal size. - Equal-depth Divides the range into N intervals,
each containing approximately same number of
samples - V-optimal Considers all histogram types for a
given number of buckets and chooses the one with
the least variance. - MaxDiff After sorting the data to be
approximated, the borders of the buckets are
defined at points where the adjacent values have
the maximum difference
44---- Example Histogram
45---- Reduction with Clustering
- Partition data into clusters based on closeness
in space. Retain representatives of clusters
(centroids) and outliers. Effectiveness depends
upon the distribution of data. Hierarchical
clustering is possible (multi-resolution).
Outlier
x
x
x
Centroid
46---- Reduction with Sampling
- Allows a large data set to be represented by a
much smaller random sample of the data (sub-set). - Will the patterns in the sample represent the
patterns in the data? - How to select a random sample?
- Simple random sample without replacement (SRSWOR)
- Simple random sampling with replacement (SRSWR)
- Cluster sample (SRSWOR or SRSWR from clusters)
- Stratified sample (stratum group based on
attribute value)
47----Sampling
SRSWOR (simple random sample without
replacement)
SRSWR
48Sampling Example
Cluster/Stratified Sample
Raw Data
49-- Reduce the number of Attributes
- Reduce the number of attributes or dimensions or
features. - Select a minimum set of attributes (features)
that is sufficient for the data mining or
analytical task. - Purpose
- Avoid curse of dimensionality which creates
sparse data space and bad clusters. - Reduce amount of time and memory required by data
mining algorithms - Allow data to be more easily visualized
- May help to eliminate irrelevant and duplicate
features or reduce noise
50--- Reduce the number of Attributes techniques
- Step-wise forward selection
- E.g. ?A1 ? A1,A3 ? A1,A3,A5
- Step-wise backward elimination
- E.g. A1,A2,A3,A4,A5 ? A1,A3,A4,A5 ?
A1,A3,A5 - Combining forward selection and backward
elimination - Decision-tree induction (This will be covered in
Chapter 5).
51-- Data Compression
- Data compression reduces the size of data and can
be used for all sorts of data. - saves storage space.
- saves communication time.
- There is lossless compression and lossy
compression. E.g., ZIP, Discrete wavelet
transform (DWT), and Principal Component Analysis
(PCA). - For data mining, data compression is beneficial
if data mining algorithms can manipulate
compressed data directly without uncompressing
it. Examples String compression (e.g. ZIP, only
allow limited manipulation of data.)
52- Chapter Outline
- Introduction (2.1, 2.2)
- Data Cleaning (2.3)
- Data Integration (2.4)
- Data Transformation (2.4)
- Data Reduction (2.5)
- Data discritization Concept Hierarchy (2.6)
53- Discretization
- Three types of attributes
- Nominal values from an unordered set, e.g.,
color, profession - Ordinal values from an ordered set, e.g.,
military or academic rank - Continuous real numbers, e.g., integer or real
numbers - Discretization
- Divide the range of a continuous attribute into
intervals - Some classification algorithms only accept
categorical attributes. - Reduce data size by discretization
- Prepare for further analysis
54-- Discretization and Concept Hierarchy
- Discretization
- Reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals - Interval labels can then be used to replace
actual data values - Supervised vs. unsupervised
- Split (top-down) vs. merge (bottom-up)
- Discretization can be performed recursively on an
attribute - Concept hierarchy formation
- Recursively reduce the data by collecting and
replacing low level concepts (such as numeric
values for age) by higher level concepts (such as
young, middle-aged, or senior)
55-- Discretization and Concept Hierarchy
Generation for Numeric Data
- Typical methods All the methods can be applied
recursively - Binning (covered above)
- Top-down split, unsupervised,
- Histogram analysis (covered above)
- Top-down split, unsupervised
- Clustering analysis (covered above)
- Either top-down split or bottom-up merge,
unsupervised - Entropy-based discretization supervised,
top-down split - Interval merging by ?2 Analysis unsupervised,
bottom-up merge - Segmentation by natural partitioning top-down
split, unsupervised
56--- Entropy-Based Discretization
- Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the information gain after partitioning is - Entropy is calculated based on class distribution
of the samples in the set. Given m classes, the
entropy of S1 is - where pi is the probability of class i in S1
- The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization - The process is recursively applied to partitions
obtained until some stopping criterion is met
57--- Interval Merge by ?2 Analysis
- Merging-based (bottom-up) vs. splitting-based
methods - Merge Find the best neighboring intervals and
merge them to form larger intervals recursively - ChiMerge
- Initially, each distinct value of a numerical
attr. A is considered to be one interval - ?2 tests are performed for every pair of adjacent
intervals - Adjacent intervals with the least ?2 values are
merged together, since low ?2 values for a pair
indicate similar class distributions - This merge process proceeds recursively until a
predefined stopping criterion is met (such as
significance level, max-interval, max
inconsistency, etc.)
58--- Segmentation by Natural Partitioning
- A simply 3-4-5 rule can be used to segment
numeric data into relatively uniform, natural
intervals. - If an interval covers 3, 6, 7 or 9 distinct
values at the most significant digit, partition
the range into 3 equi-width intervals - If it covers 2, 4, or 8 distinct values at the
most significant digit, partition the range into
4 intervals - If it covers 1, 5, or 10 distinct values at the
most significant digit, partition the range into
5 intervals
59---- Example of 3-4-5 Rule
(-400 -5,000)
Step 4
60-- Concept Hierarchy Generation for Categorical
Data
- Categorical data
- Discrete, finite cardinality, unordered
- E.g. Geographic location, job category, product
- Problem how to compose an order to categorical
data - E.g. Organize location into categories
- Street lt city lt province lt country
61 -- Concept Hierarchy Generation for Categorical
Data
- A partial order of attributes at the schema level
- Street lt city lt state lt country
- A portion of a hierarchy by explicit data
grouping - Jeddah, Riyadh lt Saudi Arabia
- A set of attributes
- A partial order generated by cardinality of
attributes - E.g., street lt city ltstate lt country
- Only a partial set of attributes
- E.g., only street lt city, not others
- System automatically fills in others
62--- Automatic Concept Hierarchy Generation
- Some hierarchies can be automatically generated
based on the analysis of the number of distinct
values per attribute in the data set - The attribute with the most distinct values is
placed at the lowest level of the hierarchy - Exceptions, e.g., weekday, month, quarter, year
63End