Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Data Mining

Description:

zip codes: 57002, 33202; student names: John, Peter ... Zip values: 57702, Y2S6K = error by merging sources. Different use of field codes on the sources ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 26

Provided by: hpc8

Learn more at: https://sdmines.sdsmt.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining

1
Data Mining

Lecture 2
Data Pre-processing
Manuel Penaloza, PhD

2
Why Data Pre-processing?

Data in data sources may be
Incomplete (missing data)
Noisy (random error or outliers)
Inconsistent (wrong codes, violation of data
constraints)
Not suitable for data mining models
Such as redundant attributes (highly correlated
attributes)
Pre-processing consists in cleaning, reducing,
and transforming the data
Pre-processing makes data suitable for data
mining
It is an important step for successful data
mining
Good data mining results require quality data

3
Types of data

A data set consists of a collection of data
objects
A data object is also called as record, entity,
observation, instance, example, pattern, event,
or case
A data object is described by a set of attributes
Attribute is also called as field, variable,
feature, characteristic, property, or dimension
An attribute is a property or characteristic of
an object
It may vary from object to object (eye color)
It may vary over time (temperature)

4
Types of attributes

Numeric Data
Integer, decimal, binary (True/False, 1/0, or
Y/N)
Non-numeric data
Nominal just different names to distinguish
objects
zip codes 57002, 33202 student names John,
Peter
Categorical values grouped into classes
Course codes CSC, EE risk factors High, Low
Ordinal ordering values
Grades A, B, C temperature Cold, Fair, Hot
Multimedia data (different standards formats)
Images, Video, Audio
Values may be converted between types

5
Reasons for errors in data

Data entry errors
Temp values 75.12, 98.12, 667.2 '.'
Different data formats on the sources
Zip values 57702, Y2S6K sources
Different use of field codes on the sources
Wrong Codes Cold C in Montreal means Chaud
(Hot)
Faulty data collection instruments
A streak in the same place on a set of
photographs
Technology limitations
Limited buffer size
Value not considered important or inapplicable

6
Dealing with missing data

Remove the record (if class is missing)
Ignore missing value (if allowed by algorithm)
Fill in missing value manually (time consuming)
Use a global constant (e.g., UNKNOWN, n/a, 99)
Use attribute mean (for samples of same class)
Infer value using regression or decision trees

7
Dealing with noisy data

Noise is a random error in a measured variable
Values of variables might be distorted, or
Spurious objects have been added to the data
Outliers are considered as noise or valid data
Use histograms and scatter plots to detect them
Look for extreme and/or lonely values

8
Other outlier detection methods

Using boxplots
Sort values and determine IQR Q3 Q1 (1st and
3rd quartiles)
Remove values 1.5IQR below Q1 or 1.5IQR above
Q3
Assume attribute B with values 30.83, 58.67,
24.50, 27.83, 20.17, 32.08, 54.42, 49.50, 34.92,
19.75, ...
After sorting the values, we get Q122.67,
Q337.75, IQR 37.75 22.67 15.08
Outlier if value 60.37
Using mean (M) and standard deviation (S)
Remove values below M - 3S and above M 3S

9
Dealing with inconsistent data

Inconsistency mainly due to data integrations
In relational databases
Enforce data constraints
Age
Enforce referential integrity
Employee's department must be a valid department
number
Enforce right use of codes
Define standard codes Yes/No, Y/N, or 1/0
Detect and deal with misclassifications
An attribute with following value frequencies
USA 1, France 1, US 156, Europe 46, Japan
34
USA and France values are misclassified

10
Unified Date Format

Date fields cause problems
Systems may accept them in different formats
June 12, 2006, 06/12/06, 06/12/2006, 06-12-03,
etc.
Another common format YYYYMMDD (or MMDDYYYY)
Fields may have values such as 2006, 200603,
9999
Find out what month and/or day is implicit in
incomplete dates Convert all date fields to a
standard format (e.g., YYYYMMDD)
Format does not preserve intervals
20060612 20060530 ? 12
Convert it to KSP Date Format if you want
interval preservation
KSP Date YYYY (Julian Day 0.5) / (365
1_if_leap_year)
Example June 12, 2006
2006 (1630.5) / 365 2006.4479 (round to
four digits)

11
Data transformations (DT)

Make data suitable for data mining
By reducing data size
By eliminating redundant attributes
To increase data accuracy
To satisfy data mining model requirements
Important techniques
Aggregation
Dimensionality Reduction
Attribute subset selection
Attribute creation
Discretization
Normalization

12
DT Aggregation

Combine two or more objects into one object
Aggregate can eliminate attributes or
Reduce number of values for an attribute
Motivation reduce the size of data set
Requiring less memory, less process time, use of
complex data mining algorithm
Example Aggregate store data sales by product,
city, or region

13
DT Dimensionality reduction

Main technique Principal Components Analysis
(PCA)
PCA is a linear algebra technique that finds new
attributes that are combinations of the original
ones
Technique rotates the data from existing axes
(attributes) to new positions
First two components retains most of the
original data
Multiply original data by the principal
components

14
DT Attribute subset selection

Remove redundant attributes with no useful
information
Example purchase price of a product and sale tax
paid for it contain same information
Determined by computing their correlation
coefficient
Correlation measures how strongly two attributes
are related
Height and weight are somehow related
Correlation coefficient, r, ranges from -1.0 to
1.0
The closer r is to -1 or 1,
the more closely the
two variables are related

15
DT Attribute creation

Create new attributes from the original
attributes
Feature extraction is one of these techniques
Example Use a set of photographs to classify a
human face
Instead of using every pixel, use features
extracted from them, such as presence or absence
of certain edges, areas or figures correlated
with human faces
Construction is based on mathematical or logical
operations
Example Create attribute area based on
attributes height and and width

16
DT Discretization (1)

Discretization of continuous attributes
Transform numerical values to categorical values
Equal width
Example temperature values (page 12)
Equal-height it gives better breakpoints
Dont split values across bins
Example temperature values with height 4

17
DT Discretization (2)

Supervised method (class dependent)
Divide each attribute's range into intervals
Sort instances based on the attribute's values
Place breakpoints where class changes
Example temperature values
Result 8 intervals or groups
Enforce minimum number of instances per interval
(3 groups)

18
DT Normalization

Transform a set of values to values in some
range
Avoiding a variable with large values to
dominate the data mining results
Main Techniques
Min-max (values transformed into 0-1 interval)
Z-score (using mean(x) and standard
deviation(sx))
Apply transformation (x X)/sx
For above values, X 73.57, sx 6.57,
Normalized values
-1.46 -1.3 -0.85 -0.70 -0.54 -0.39 -0.24
-0.24 0.22 0.22 0.98 1.13 1.43 1.74

64 65 68 69 70 71 72
72 75 75 80 81 83
85 Minimum value 64, maximum value 85, apply
transformation (x min)/ (max-min) Normalized
values 0.0 0.05 0.19 0.24 0.29 0.33
0.38 0.38 0.52 0.52 0.76 0.81 0.90
1.0
19
Conversions

From Nominal to Numeric
Status Single, Married, Divorced, Widow to 1,
2, 3, 4
From Binary to Numeric
Gender M, F converted to 0, 1
From Ordinal to Numeric
Grades A, B, C, D, F to 4, 3, 2, 1, 0
From Nominal to Binary
For each value create a binary variable (SS ?
Status Single

?
20
Data pre-processing with Weka

Weka is a data mining toolkit
Developed by the University of Waikato, New
Zealand
Used mainly for research and education
Weka data files have a native arff format
It can import files from other formats (e.g.,
CSV)
Pre-processing tools are called filters
Normalization, re-sampling, discretization,
numeric transformations,
Several tutorials
http//www.cs.queensu.ca/home/cisc333/tutorial/We
ka.html
http//maya.cs.depaul.edu/classes/ect584/WEKA/ind
ex.html

21
Several environments in Weka

Weka GUI presents four environments
Command line interface (cli), The Experimenter,
the Explorer, and Knowledge Flow
We will use the Explorer here

22
Pre-process Iris data set (1)

Click Explorer and click Open file button
Select iris.arff file from the data folder
Weka provides some information about the data

Number of instances
Data file attributes
Statistics and histogram for attributes one at a
time
You can switch between attributes
Yo can visualize all by clicking Visualize all
button on lower right window

23
Pre-process Iris data set (2)

Select petallength attribute, click Choose
button, select attribute folder under
Unsupervised folder, and finally select the
Discretize filter

24
Pre-process Iris data set (3)

Click Apply button on upper right position
Results are shown on middle right window
Parameters can be changed for the Discretize
filter
By clicking on text at the right side of the
Choose button

25
Exercise 2 Using the WEKA Filters

Become familiar with the use of the WEKA filters
using Explorer
Load the weather data (weather.arff) and do the
following
Remove the second attribute (temperature) using
the remove unsupervised filter (see page 382 on
your textbook), or the remove button at the
bottom left of the screen.
Create a new attribute using AddExpression
unsupervised filter that adds 5 to the humidity
values.
Normalize (using the normalize unsupervised
filter) the numeric attributes.
Display the results using the Edit button.