Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining

Description:

zip codes: 57002, 33202; student names: John, Peter ... Zip values: 57702, Y2S6K = error by merging sources. Different use of field codes on the sources ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 26
Provided by: hpc8
Category:
Tags: code | data | directory | mining | zip

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
  • Lecture 2
  • Data Pre-processing
  • Manuel Penaloza, PhD

2
Why Data Pre-processing?
  • Data in data sources may be
  • Incomplete (missing data)
  • Noisy (random error or outliers)
  • Inconsistent (wrong codes, violation of data
    constraints)
  • Not suitable for data mining models
  • Such as redundant attributes (highly correlated
    attributes)
  • Pre-processing consists in cleaning, reducing,
    and transforming the data
  • Pre-processing makes data suitable for data
    mining
  • It is an important step for successful data
    mining
  • Good data mining results require quality data

3
Types of data
  • A data set consists of a collection of data
    objects
  • A data object is also called as record, entity,
    observation, instance, example, pattern, event,
    or case
  • A data object is described by a set of attributes
  • Attribute is also called as field, variable,
    feature, characteristic, property, or dimension
  • An attribute is a property or characteristic of
    an object
  • It may vary from object to object (eye color)
  • It may vary over time (temperature)

4
Types of attributes
  • Numeric Data
  • Integer, decimal, binary (True/False, 1/0, or
    Y/N)
  • Non-numeric data
  • Nominal just different names to distinguish
    objects
  • zip codes 57002, 33202 student names John,
    Peter
  • Categorical values grouped into classes
  • Course codes CSC, EE risk factors High, Low
  • Ordinal ordering values
  • Grades A, B, C temperature Cold, Fair, Hot
  • Multimedia data (different standards formats)
  • Images, Video, Audio
  • Values may be converted between types

5
Reasons for errors in data
  • Data entry errors
  • Temp values 75.12, 98.12, 667.2 '.'
  • Different data formats on the sources
  • Zip values 57702, Y2S6K sources
  • Different use of field codes on the sources
  • Wrong Codes Cold C in Montreal means Chaud
    (Hot)
  • Faulty data collection instruments
  • A streak in the same place on a set of
    photographs
  • Technology limitations
  • Limited buffer size
  • Value not considered important or inapplicable

6
Dealing with missing data
  • Remove the record (if class is missing)
  • Ignore missing value (if allowed by algorithm)
  • Fill in missing value manually (time consuming)
  • Use a global constant (e.g., UNKNOWN, n/a, 99)
  • Use attribute mean (for samples of same class)
  • Infer value using regression or decision trees

7
Dealing with noisy data
  • Noise is a random error in a measured variable
  • Values of variables might be distorted, or
  • Spurious objects have been added to the data
  • Outliers are considered as noise or valid data
  • Use histograms and scatter plots to detect them
  • Look for extreme and/or lonely values

8
Other outlier detection methods
  • Using boxplots
  • Sort values and determine IQR Q3 Q1 (1st and
    3rd quartiles)
  • Remove values 1.5IQR below Q1 or 1.5IQR above
    Q3
  • Assume attribute B with values 30.83, 58.67,
    24.50, 27.83, 20.17, 32.08, 54.42, 49.50, 34.92,
    19.75, ...
  • After sorting the values, we get Q122.67,
    Q337.75, IQR 37.75 22.67 15.08
  • Outlier if value 60.37
  • Using mean (M) and standard deviation (S)
  • Remove values below M - 3S and above M 3S

9
Dealing with inconsistent data
  • Inconsistency mainly due to data integrations
  • In relational databases
  • Enforce data constraints
  • Age
  • Enforce referential integrity
  • Employee's department must be a valid department
    number
  • Enforce right use of codes
  • Define standard codes Yes/No, Y/N, or 1/0
  • Detect and deal with misclassifications
  • An attribute with following value frequencies
  • USA 1, France 1, US 156, Europe 46, Japan
    34
  • USA and France values are misclassified

10
Unified Date Format
  • Date fields cause problems
  • Systems may accept them in different formats
  • June 12, 2006, 06/12/06, 06/12/2006, 06-12-03,
    etc.
  • Another common format YYYYMMDD (or MMDDYYYY)
  • Fields may have values such as 2006, 200603,
    9999
  • Find out what month and/or day is implicit in
    incomplete dates Convert all date fields to a
    standard format (e.g., YYYYMMDD)
  • Format does not preserve intervals
  • 20060612 20060530 ? 12
  • Convert it to KSP Date Format if you want
    interval preservation
  • KSP Date YYYY (Julian Day 0.5) / (365
    1_if_leap_year)
  • Example June 12, 2006
  • 2006 (1630.5) / 365 2006.4479 (round to
    four digits)

11
Data transformations (DT)
  • Make data suitable for data mining
  • By reducing data size
  • By eliminating redundant attributes
  • To increase data accuracy
  • To satisfy data mining model requirements
  • Important techniques
  • Aggregation
  • Dimensionality Reduction
  • Attribute subset selection
  • Attribute creation
  • Discretization
  • Normalization

12
DT Aggregation
  • Combine two or more objects into one object
  • Aggregate can eliminate attributes or
  • Reduce number of values for an attribute
  • Motivation reduce the size of data set
  • Requiring less memory, less process time, use of
    complex data mining algorithm
  • Example Aggregate store data sales by product,
    city, or region

13
DT Dimensionality reduction
  • Main technique Principal Components Analysis
    (PCA)
  • PCA is a linear algebra technique that finds new
    attributes that are combinations of the original
    ones
  • Technique rotates the data from existing axes
    (attributes) to new positions
  • First two components retains most of the
    original data
  • Multiply original data by the principal
    components

14
DT Attribute subset selection
  • Remove redundant attributes with no useful
    information
  • Example purchase price of a product and sale tax
    paid for it contain same information
  • Determined by computing their correlation
    coefficient
  • Correlation measures how strongly two attributes
    are related
  • Height and weight are somehow related
  • Correlation coefficient, r, ranges from -1.0 to
    1.0
  • The closer r is to -1 or 1,
  • the more closely the
  • two variables are related

15
DT Attribute creation
  • Create new attributes from the original
    attributes
  • Feature extraction is one of these techniques
  • Example Use a set of photographs to classify a
    human face
  • Instead of using every pixel, use features
    extracted from them, such as presence or absence
    of certain edges, areas or figures correlated
    with human faces
  • Construction is based on mathematical or logical
    operations
  • Example Create attribute area based on
    attributes height and and width

16
DT Discretization (1)
  • Discretization of continuous attributes
  • Transform numerical values to categorical values
  • Equal width
  • Example temperature values (page 12)
  • Equal-height it gives better breakpoints
  • Dont split values across bins
  • Example temperature values with height 4

17
DT Discretization (2)
  • Supervised method (class dependent)
  • Divide each attribute's range into intervals
  • Sort instances based on the attribute's values
  • Place breakpoints where class changes
  • Example temperature values
  • Result 8 intervals or groups
  • Enforce minimum number of instances per interval
    (3 groups)

18
DT Normalization
  • Transform a set of values to values in some
    range
  • Avoiding a variable with large values to
    dominate the data mining results
  • Main Techniques
  • Min-max (values transformed into 0-1 interval)
  • Z-score (using mean(x) and standard
    deviation(sx))
  • Apply transformation (x X)/sx
  • For above values, X 73.57, sx 6.57,
    Normalized values
  • -1.46 -1.3 -0.85 -0.70 -0.54 -0.39 -0.24
    -0.24 0.22 0.22 0.98 1.13 1.43 1.74

64 65 68 69 70 71 72
72 75 75 80 81 83
85 Minimum value 64, maximum value 85, apply
transformation (x min)/ (max-min) Normalized
values 0.0 0.05 0.19 0.24 0.29 0.33
0.38 0.38 0.52 0.52 0.76 0.81 0.90
1.0
19
Conversions
  • From Nominal to Numeric
  • Status Single, Married, Divorced, Widow to 1,
    2, 3, 4
  • From Binary to Numeric
  • Gender M, F converted to 0, 1
  • From Ordinal to Numeric
  • Grades A, B, C, D, F to 4, 3, 2, 1, 0
  • From Nominal to Binary
  • For each value create a binary variable (SS ?
    Status Single

?
20
Data pre-processing with Weka
  • Weka is a data mining toolkit
  • Developed by the University of Waikato, New
    Zealand
  • Used mainly for research and education
  • Weka data files have a native arff format
  • It can import files from other formats (e.g.,
    CSV)
  • Pre-processing tools are called filters
  • Normalization, re-sampling, discretization,
    numeric transformations,
  • Several tutorials
  • http//www.cs.queensu.ca/home/cisc333/tutorial/We
    ka.html
  • http//maya.cs.depaul.edu/classes/ect584/WEKA/ind
    ex.html

21
Several environments in Weka
  • Weka GUI presents four environments
  • Command line interface (cli), The Experimenter,
    the Explorer, and Knowledge Flow
  • We will use the Explorer here

22
Pre-process Iris data set (1)
  • Click Explorer and click Open file button
  • Select iris.arff file from the data folder
  • Weka provides some information about the data
  • Number of instances
  • Data file attributes
  • Statistics and histogram for attributes one at a
    time
  • You can switch between attributes
  • Yo can visualize all by clicking Visualize all
    button on lower right window

23
Pre-process Iris data set (2)
  • Select petallength attribute, click Choose
    button, select attribute folder under
    Unsupervised folder, and finally select the
    Discretize filter

24
Pre-process Iris data set (3)
  • Click Apply button on upper right position
  • Results are shown on middle right window
  • Parameters can be changed for the Discretize
    filter
  • By clicking on text at the right side of the
    Choose button

25
Exercise 2 Using the WEKA Filters
  • Become familiar with the use of the WEKA filters
    using Explorer
  • Load the weather data (weather.arff) and do the
    following
  • Remove the second attribute (temperature) using
    the remove unsupervised filter (see page 382 on
    your textbook), or the remove button at the
    bottom left of the screen.
  • Create a new attribute using AddExpression
    unsupervised filter that adds 5 to the humidity
    values.
  • Normalize (using the normalize unsupervised
    filter) the numeric attributes.
  • Display the results using the Edit button.
Write a Comment
User Comments (0)
About PowerShow.com