Data Mining: Data - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining: Data

Description:

Sami yr m (C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002. 2 ... e.g., People decline to answer a question (age, weight, position, ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 6
Provided by: Compu278
Category:
Tags: data | mining | people | sami

less

Transcript and Presenter's Notes

Title: Data Mining: Data


1
Data Mining Data
  • Lecture 3
  • TIES445 Data mining
  • Nov-Dec 2007
  • Sami Ă„yrämö

2
Data quality
  • GIGO Garbage In, Garbage Out
  • Effectiveness of DM exercise depends on the
    quality of data
  • Data quality concerns
  • individual measurements (records and fields)
  • collections of observations
  • Sources of error are infinite
  • Human error (e.g., keyboard error)
  • Instrumentation failure
  • Inaccuare or imprecise
  • Inadequate specification of measurement or data
    collection process

3
Quality of individual measurements
  • Bias
  • the difference between the mean of the repeated
    measurements and the true value
  • Precision
  • variability of the repeated measurements (NOTE
    precision is not the number of digits in record)
  • Accuracy
  • small bias and high precision (e.g., small
    variance)
  • e.g, repeated measurement of someones height may
    be precise (reliable), but inaccurate (validity),
    if (s)he is wearing shoes (we are not measuring
    the right thing)
  • True value (does it even exist?)

4
Quality of collections of data bias
  • Distorted (biased) samples
  • mismatch between the sample population and and
    the population of interest (selection bias)
  • e.g., calculating an average age of students in
    Jyväskylä when the sample is restricted to female
    students
  • a sample may be selected through a chain of
    selection steps
  • e.g., candidates for bank loans 1) potential
    customers are contacted, 2) some reply, some do
    not, 3) of those who replied some are
    creditworthy, some are not, 4) those who take out
    a loan are followed, 5) some are good customers,
    some are not,
  • populations are not static (population drift)
  • e.g., customers shopping behaviour may change
    over time
  • A biased sample leads to inconsistent estimates
    of population parameters

5
Quality of collections of data Incomplete data
  • Incomplete data missing or empty values
  • Missing value Information is not collected
  • e.g., People decline to answer a question (age,
    weight, position,)
  • Empty value Information does not exist
  • A form may have conditional parts e.g., expiry
    date of an drivers license can not be filled out
    by children
  • Determining whether any value is empty or
    missing requires domain knowledge
  • If the discriminating information is not
    provided both empty and missing values are
    treated as and called missing
  • Fundamental question for data mining task Why
    are the data incomplete?
  • Note A distorted (biased) sample is actually a
    special case of incomplete data
Write a Comment
User Comments (0)
About PowerShow.com