Title: Data Mining
1Data Mining
- Lecture 2
- Data Pre-processing
- Manuel Penaloza, PhD
2Why Data Pre-processing?
- Data in data sources may be
- Incomplete (missing data)
- Noisy (random error or outliers)
- Inconsistent (wrong codes, violation of data
constraints) - Not suitable for data mining models
- Such as redundant attributes (highly correlated
attributes) - Pre-processing consists in cleaning, reducing,
and transforming the data - Pre-processing makes data suitable for data
mining - It is an important step for successful data
mining - Good data mining results require quality data
3Types of data
- A data set consists of a collection of data
objects - A data object is also called as record, entity,
observation, instance, example, pattern, event,
or case - A data object is described by a set of attributes
- Attribute is also called as field, variable,
feature, characteristic, property, or dimension - An attribute is a property or characteristic of
an object - It may vary from object to object (eye color)
- It may vary over time (temperature)
4Types of attributes
- Numeric Data
- Integer, decimal, binary (True/False, 1/0, or
Y/N) - Non-numeric data
- Nominal just different names to distinguish
objects - zip codes 57002, 33202 student names John,
Peter - Categorical values grouped into classes
- Course codes CSC, EE risk factors High, Low
- Ordinal ordering values
- Grades A, B, C temperature Cold, Fair, Hot
- Multimedia data (different standards formats)
- Images, Video, Audio
- Values may be converted between types
5Reasons for errors in data
- Data entry errors
- Temp values 75.12, 98.12, 667.2 '.'
- Different data formats on the sources
- Zip values 57702, Y2S6K sources
- Different use of field codes on the sources
- Wrong Codes Cold C in Montreal means Chaud
(Hot) - Faulty data collection instruments
- A streak in the same place on a set of
photographs - Technology limitations
- Limited buffer size
- Value not considered important or inapplicable
6 Dealing with missing data
- Remove the record (if class is missing)
- Ignore missing value (if allowed by algorithm)
- Fill in missing value manually (time consuming)
- Use a global constant (e.g., UNKNOWN, n/a, 99)
- Use attribute mean (for samples of same class)
- Infer value using regression or decision trees
7 Dealing with noisy data
- Noise is a random error in a measured variable
- Values of variables might be distorted, or
- Spurious objects have been added to the data
- Outliers are considered as noise or valid data
- Use histograms and scatter plots to detect them
- Look for extreme and/or lonely values
8 Other outlier detection methods
- Using boxplots
- Sort values and determine IQR Q3 Q1 (1st and
3rd quartiles) - Remove values 1.5IQR below Q1 or 1.5IQR above
Q3 - Assume attribute B with values 30.83, 58.67,
24.50, 27.83, 20.17, 32.08, 54.42, 49.50, 34.92,
19.75, ... - After sorting the values, we get Q122.67,
Q337.75, IQR 37.75 22.67 15.08 - Outlier if value 60.37
- Using mean (M) and standard deviation (S)
- Remove values below M - 3S and above M 3S
9Dealing with inconsistent data
- Inconsistency mainly due to data integrations
- In relational databases
- Enforce data constraints
- Age
- Enforce referential integrity
- Employee's department must be a valid department
number - Enforce right use of codes
- Define standard codes Yes/No, Y/N, or 1/0
- Detect and deal with misclassifications
- An attribute with following value frequencies
- USA 1, France 1, US 156, Europe 46, Japan
34 - USA and France values are misclassified
10Unified Date Format
- Date fields cause problems
- Systems may accept them in different formats
- June 12, 2006, 06/12/06, 06/12/2006, 06-12-03,
etc. - Another common format YYYYMMDD (or MMDDYYYY)
- Fields may have values such as 2006, 200603,
9999 - Find out what month and/or day is implicit in
incomplete dates Convert all date fields to a
standard format (e.g., YYYYMMDD) - Format does not preserve intervals
- 20060612 20060530 ? 12
- Convert it to KSP Date Format if you want
interval preservation - KSP Date YYYY (Julian Day 0.5) / (365
1_if_leap_year) - Example June 12, 2006
- 2006 (1630.5) / 365 2006.4479 (round to
four digits)
11 Data transformations (DT)
- Make data suitable for data mining
- By reducing data size
- By eliminating redundant attributes
- To increase data accuracy
- To satisfy data mining model requirements
- Important techniques
- Aggregation
- Dimensionality Reduction
- Attribute subset selection
- Attribute creation
- Discretization
- Normalization
12 DT Aggregation
- Combine two or more objects into one object
- Aggregate can eliminate attributes or
- Reduce number of values for an attribute
- Motivation reduce the size of data set
- Requiring less memory, less process time, use of
complex data mining algorithm - Example Aggregate store data sales by product,
city, or region
13DT Dimensionality reduction
- Main technique Principal Components Analysis
(PCA) - PCA is a linear algebra technique that finds new
attributes that are combinations of the original
ones - Technique rotates the data from existing axes
(attributes) to new positions - First two components retains most of the
original data - Multiply original data by the principal
components
14DT Attribute subset selection
- Remove redundant attributes with no useful
information - Example purchase price of a product and sale tax
paid for it contain same information - Determined by computing their correlation
coefficient - Correlation measures how strongly two attributes
are related - Height and weight are somehow related
- Correlation coefficient, r, ranges from -1.0 to
1.0 - The closer r is to -1 or 1,
- the more closely the
- two variables are related
15DT Attribute creation
- Create new attributes from the original
attributes - Feature extraction is one of these techniques
- Example Use a set of photographs to classify a
human face - Instead of using every pixel, use features
extracted from them, such as presence or absence
of certain edges, areas or figures correlated
with human faces - Construction is based on mathematical or logical
operations - Example Create attribute area based on
attributes height and and width
16DT Discretization (1)
- Discretization of continuous attributes
- Transform numerical values to categorical values
- Equal width
- Example temperature values (page 12)
- Equal-height it gives better breakpoints
- Dont split values across bins
- Example temperature values with height 4
17DT Discretization (2)
- Supervised method (class dependent)
- Divide each attribute's range into intervals
- Sort instances based on the attribute's values
- Place breakpoints where class changes
- Example temperature values
- Result 8 intervals or groups
- Enforce minimum number of instances per interval
(3 groups)
18DT Normalization
- Transform a set of values to values in some
range - Avoiding a variable with large values to
dominate the data mining results - Main Techniques
- Min-max (values transformed into 0-1 interval)
- Z-score (using mean(x) and standard
deviation(sx)) - Apply transformation (x X)/sx
- For above values, X 73.57, sx 6.57,
Normalized values - -1.46 -1.3 -0.85 -0.70 -0.54 -0.39 -0.24
-0.24 0.22 0.22 0.98 1.13 1.43 1.74
64 65 68 69 70 71 72
72 75 75 80 81 83
85 Minimum value 64, maximum value 85, apply
transformation (x min)/ (max-min) Normalized
values 0.0 0.05 0.19 0.24 0.29 0.33
0.38 0.38 0.52 0.52 0.76 0.81 0.90
1.0
19Conversions
- From Nominal to Numeric
- Status Single, Married, Divorced, Widow to 1,
2, 3, 4 - From Binary to Numeric
- Gender M, F converted to 0, 1
- From Ordinal to Numeric
- Grades A, B, C, D, F to 4, 3, 2, 1, 0
- From Nominal to Binary
- For each value create a binary variable (SS ?
Status Single
?
20Data pre-processing with Weka
- Weka is a data mining toolkit
- Developed by the University of Waikato, New
Zealand - Used mainly for research and education
- Weka data files have a native arff format
- It can import files from other formats (e.g.,
CSV) - Pre-processing tools are called filters
- Normalization, re-sampling, discretization,
numeric transformations, - Several tutorials
- http//www.cs.queensu.ca/home/cisc333/tutorial/We
ka.html - http//maya.cs.depaul.edu/classes/ect584/WEKA/ind
ex.html
21Several environments in Weka
- Weka GUI presents four environments
- Command line interface (cli), The Experimenter,
the Explorer, and Knowledge Flow - We will use the Explorer here
22Pre-process Iris data set (1)
- Click Explorer and click Open file button
- Select iris.arff file from the data folder
- Weka provides some information about the data
- Number of instances
- Data file attributes
- Statistics and histogram for attributes one at a
time - You can switch between attributes
- Yo can visualize all by clicking Visualize all
button on lower right window
23Pre-process Iris data set (2)
- Select petallength attribute, click Choose
button, select attribute folder under
Unsupervised folder, and finally select the
Discretize filter
24Pre-process Iris data set (3)
- Click Apply button on upper right position
- Results are shown on middle right window
- Parameters can be changed for the Discretize
filter - By clicking on text at the right side of the
Choose button
25Exercise 2 Using the WEKA Filters
- Become familiar with the use of the WEKA filters
using Explorer - Load the weather data (weather.arff) and do the
following - Remove the second attribute (temperature) using
the remove unsupervised filter (see page 382 on
your textbook), or the remove button at the
bottom left of the screen. - Create a new attribute using AddExpression
unsupervised filter that adds 5 to the humidity
values. - Normalize (using the normalize unsupervised
filter) the numeric attributes. - Display the results using the Edit button.