Exploratory Data Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Exploratory Data Analysis

Description:

The goal of data analysis is to gain information from the data. Exploratory data analysis: set of methods to display and summarize the data. ... – PowerPoint PPT presentation

Number of Views:244
Avg rating:3.0/5.0
Slides: 29
Provided by: facwebC
Category:

less

Transcript and Presenter's Notes

Title: Exploratory Data Analysis


1
Exploratory Data Analysis
  • The goal of data analysis is to gain information
    from the data.
  • Exploratory data analysis set of methods to
    display and summarize the data.
  • Data on just one variable the distribution of
    the observations is analyzed by
  • Displaying the data in a graph that shows overall
    patterns and unusual observations (bar chart,
    histogram, density curve)
  • Computing descriptive statistics that summarize
    specific aspects of the data (center and spread).

2
Review of Histograms
  • A histogram represents percent by area.
  • The height of each block represents
    frequencies/percentages of the observations
    falling in the interval.
  • The total area under a histogram is ______ if
    height in frequencies
  • The total area under a histogram is ______ if
    height in percentages
  • There is no fixed choice for the number of
    classes in a histogram
  • If class intervals are too small, the histogram
    will have spikes
  • If class intervals are too large, some
    information will be missed.
  • Use your judgment!
  • Typically statistical software will choose the
    class intervals for you, but you can modify them.

3
(No Transcript)
4
Center and Spread
5
Measuring Centers
  • The most common measures are the mean (or
    average) and the median.
  • The Mean or Average
  • To calculate the average of a set of
    observations, add their value and divide by the
    number of observations

Data Number of home runs hit by Babe Ruth as a
Yankee 54, 59, 35, 41, 46, 25, 47, 60, 54, 46,
49, 46, 41, 34, 22 The mean number of home runs
hit in a year is
6
  • The median
  • The median M is the midpoint of a distribution,
    the number such that half the observations are
    smaller and the other half are larger.
  • To find the median
  • Sort all the observations in order of size from
    smallest to largest
  • If the number of observations n is odd, the
    median M is the center observation in the ordered
    list I.e. M(n1)/2-th obs.
  • If the number of observations n is even, the
    median M is the mean of the two center
    observations in the ordered list.

Example 1 Ordered list of home run hits by Babe
Ruth 22 25 34 35 41 41 46 46 46 47 49 54 54 59
60 N15 Median 46
8th
Example 2 Ordered list of home run hits by Roger
Maris 8 13 14 16 23 26 28 33 39 61
N10 Median (2326)/224.5
7
Symmetric distribution
50
  • Mean versus Median
  • The mean and median of a symmetric distribution
    are close together

Mean Median
  1. In skewed distributions, the mean is farther out
    in the long tail than is the median. The mean is
    more sensitive to extreme values.

Right-skewed distribution
Left-skewed distribution
50
50
Median
Mean
Median
Mean
8
Mean or Median?
  • The mean is a good measure for the center of a
    symmetric distribution
  • The median is a resistant measure and should be
    used for skewed distributions. Its value is only
    slightly affected by the presence of extreme
    observations, no matter how large these
    observations are.

9
The Mode
On average, the cars under study drive 18.9 miles
per gallon, and 50 of the cars under study drive
at least 18 miles per gallon. The mode is the
observation value with the highest frequency
10
Spread of a Distribution
Two measures of spread 1. The Quartiles First
quartile Q1 is the value such that 25 of the
observations fall at or below it, (Q1 is often
called 25th percentile). The third quartile Q3
the value such that 75 of the observations
fall at or below it, (Q3 is often called 75th
percentile).   Typically used if the distribution
of the observations is skewed.
Q1 M Q3
25
11
First quartile (Q1) 16, third quartile (Q3)
21 What does this mean in terms of the data?
12
Percentiles (also called Quantiles) In general
the nth percentile is a value such that n of the
observations fall at or below or it
n
nth percentile
In the example before 5th percentile
10.35 95th percentile 24.1 10th percentile
11 90th percentile 22 Hence about 80 of
the cars get between 11 and 22 miles per gallon.
13
Descriptive measures for skewed distributions
  • If the histogram of the data is skewed, use the
    following descriptive statistics
  • Min, Q1, Median, Q3, Max
  • To describe the distribution of the observed
    variable.
  • In our example,
  • Min8, Q116, Median18, Q321, Max61

14
The Standard Deviation
If a distribution is symmetric Use the average
to measure the center and the Standard
Deviation to measure the spread. The standard
deviation s (or SD ) measures how far the
observations are from the average. Example A
persons metabolic rate rate at which the body
consumes energy. Rates of 7 men in a study on
dieting 1792, 1666, 1614, 1460, 1867, 1439,
1362. The mean is and the
s.d. s 189.24
Deviation1867 1600267
Deviation1600 1439161
? ? ? ?
? ? ?
1300 1400 1500
1600 1700
1800 1900
Metabolic rate
15
Formula for the SD
In symbols, the standard deviation s of n
observations is
The variance of an observed variable is defined
as the square of the standard deviation.
Variance s2
16
Properties of the SD
  • It measures the spread about the mean.
  • Only used in association with the mean. Good
    descriptive measure for symmetric distributions
  • If s 0, all the observations have the same
    value
  • It is a POSITIVE value, the larger s is, the more
    spread out the observations are around the mean
  • It is NOT a resistant measure, a few extreme
    observations may affect its value (make it very
    large).
  • The variance is the square of the s.d.

17
Interpreting the SD
  • For many lists of observations especially if
    their histogram is bell-shaped
  • Roughly 68 of the observations in the list lie
    within 1 standard deviation of the average
  • 95 of the observations lie within 2 standard
    deviations of the average

Average
Ave2s.d.
Aves.d.
Ave-s.d.
Ave-2s.d.
68
95
18
Example
In a large university, data were collected to
study the academic achievements of computer
science majors. Well consider the SAT math
scores of 224 first year CS students. The
average SATM score is 595.28 with s.d. s 86.40
Histogram of the SATM Scores
Are the average and s.d. good descriptions of
the SATM scores distribution? Roughly 68 of the
students have scores between 510 and 680 Roughly
95 of the students have scores between 422 and
768
19
CS students example Descriptive statistics
Mean 595.28 Std Deviation 86.40
Max 800 Min 300 Q1 540 Median
600.00 Q3 650 IQR110 1.5xIQR165
5th percentile 460 95th percentile 750

Histogram of the SATM Scores
768
422
95 of scores
20
Analysis of the scores for male and female
students
SATM scores for men
SATM scores for women
21
Exploratory Data Analysis
  1. Always plot your data
  2. Look for overall patterns striking deviations
    such as outliers
  3. Calculate a numerical summary to describe the
    center and the spread
  4. NEXT STEP sometimes the overall pattern is so
    regular that we can describe it through a smooth
    curve, called a density curve

22
Computing descriptive statistics in Excel
  • There are two ways
  • Use the formula palette click on the fx button
  • OR
  • Use the Data Analysis Toolpak select
    descriptive statistics

23
The descriptive statistics tool
  • Input range sequence of cells containing the
    data
  • Label in First row
  • Output range tell Excel where to put the output
  • Summary statistics to be checked

24
(No Transcript)
25
Formulas for 5-number summary
Select an empty cell, and type the function name
you want to compute or use the function palette
for the list of available functions. For
instance to compute the min of the fuel
consumption data in the city, type
min(b2b31)
26
Normal distributions
Normal curves provide a simple, compact way to
describe symmetric, bell-shaped distributions.
Normal curve
SAT math scores for CS students
27
Money spent in a supermarket
Is the normal curve a good approximation?
28
SAT math scores for CS students
The area under the histogram, i.e. the
percentages of the observations, can be
approximated by the corresponding area under the
normal curve. If the histogram is symmetric, we
say that the data are approximately normal (or
normally distributed). We need to know only the
average and the standard deviation of the
observations!!
Write a Comment
User Comments (0)
About PowerShow.com