Data Cleansing: Filling Missing Values in Data - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Data Cleansing: Filling Missing Values in Data

Description:

Data Cleansing: Filling Missing Values in Data Class Presentation CIS 764 Instructor Presented by Dr. William Hankley Gaurav ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 15
Provided by: V403
Category:

less

Transcript and Presenter's Notes

Title: Data Cleansing: Filling Missing Values in Data


1
Data Cleansing Filling Missing Values in Data
  • Class Presentation
  • CIS 764
  • Instructor
    Presented by
  • Dr. William Hankley Gaurav Chauhan

2
Overview
  • Problems Caused
  • Methods for retrieving missing values
  • Predicting values
  • The average way
  • The probabilistic way
  • By leveraging the relational network structure
  • Conclusions

3
Problems Caused
  • Following problems occur in data analysis because
    of missing values in the same
  • Summarizing variables
  • Computing new variables
  • Comparing variables
  • Combining variables
  • In Time Series Analysis

4
Methods for retrieving missing values
  • Considering average of the available values for
    prediction
  • Using probabilistic approach for value prediction
  • Leveraging relation network structure of the data
    to predict values

5
Predicting Values- the average way
Year Rainfall (avg) in (cm) Temperature (avg)
1936 30 60F
1937 32 66F
1938 N.A, Predicted 28.5 cm 62F
1939 25 64F
1940 23 69F
1941 30 59F
1942 N.A, Predicted 29.0 cm 60F
1943 28 59F
1944 22 65F
6
For finding the values
for year 1938 and 1942
  • We can calculate the rainfall for these two years
    as
  • Taking avg of rainfall of 1937 and 1939
  • Rainfall in 1938 (3225)/2 cm
  • 28.5 cm
  • Taking avg of rainfall of 1941 and 1943
  • Rainfall in 1942 (3028)/2 cm
  • 29 cm

7
Predicting Values- the probabilistic way
  • Assume that we have n values and we are required
    to predict n1th value
  • For every i such that i1 to n the probability
    that a data instance has a value vi is p(vi)
  • Each of these probabilities is calculated on the
    bases of the frequency with which vi occurs in
    the data.
  • That said, vn1 is picked at random such that
  • p(vn1 vi ) gt p(vn1 vj)
  • If p(vi)gtp(vj)

8
Predicting Values by leveraging the relational
network
  • This technique applies only to relational data
    only
  • The values of missing instances are predicted as
    the mode of the peers who fit the relational
    network and have no missing values

9
Predicting Values by leveraging the relational
network
10
Predicting Valuesby leveraging the relational
network
  • Example 1
  • Book A Book C Book B
  • Category A Category C Category B
  • Book A Book C Book B
  • ? (Predicted A) Category C Category B

11
Predicting Values by leveraging the relational
network
  • Example 2
  • Teacher
  • Student 1 Student 2 Student 3 Student 4
  • Age(19) ? Age(18)
    Age(19)
  • (Predicted 19)

12
Conclusion
  • Missing values in the data are bad when it is
    used for analysis, learning or mining purposes
  • Various techniques aim at predicting data but
    none has reached a 100 accuracy
  • An average of 90 accuracy with which these
    values are predicted is still acceptable

13
References
  • www.hrs.co.nz
  • http//dblife.cs.wisc.edu/search.cgi?entityentity
    -8982

14
Questions Anyone
  • I am shivering not because of nervousness but
    because of cold room temperature
  • -one nervous student
Write a Comment
User Comments (0)
About PowerShow.com