DATA DESCRIPTION - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

DATA DESCRIPTION

Description:

In a cross-sectional analysis a unit/subject will be the entity you are studying. ... Ordinal data: excellent/good/bad, Interval data: temperature, GMAT scores, ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 37
Provided by: sheld
Category:

less

Transcript and Presenter's Notes

Title: DATA DESCRIPTION


1
DATA DESCRIPTION
2
Units
  • Unit entity we are studying, subject if human
    being
  • Each unit/subject has certain parameters, e.g., a
    student (subject) has his age, weight, height,
    home address, number of units taken, and so on.

3
Variables
  • These parameters are called variables.
  • In statistics variables are stored in columns,
    each variable occupying a column.

4
Cross-sectional and time-series analyses
  • In a cross-sectional analysis a unit/subject will
    be the entity you are studying. For example, if
    you study the housing market in San Diego, a unit
    will be a house, and variables will be price,
    size, age, etc., of a house.
  • In a time-series analysis the unit is a time
    unit, say, hour, day, month, etc.

5
Data Types
  • Nominal data male/female, colors,
  • Ordinal data excellent/good/bad,
  • Interval data temperature, GMAT scores,
  • Ratio data distance to school, price,

6
Two forms
  • GRAPHICAL form
  • NUMERICAL SUMMARY form

7
Graphical forms
  • Sequence plots
  • Histograms (frequency distributions)
  • Scatter plots

8
Sequence plots
  • To describe a time series
  • The horizontal axis is always related to the
    sequence in which data were collected
  • The vertical axis is the value of the variable

9
Example sequence plot
10
Histograms I
  • A histogram (frequency distribution) shows how
    many values are in a certain range.
  • It is used for cross-sectional analysis.
  • the potential observation values are divided into
    groups (called classes).
  • The number of observations falling into each
    class is called frequency.
  • When we say an observation falls into a class, we
    mean its value is greater than or equal to the
    lower bound but less than the upper bound of the
    class.

11
Example histogram
  • A commercial bank is studying the time a customer
    spends in line. They recorded waiting times (in
    minutes) of 28 customers
  • 5.9 7.6 5.3 9.7 1.6 3.5 7.4
  • 4.0 1.6 7.3 8.2 8.4 6.5 8.9
  • 1.1 8.6 4.3 1.2 3.3 2.1 8.4
  • 1.1 6.7 5.0 4.5 9.4 6.3 6.4

12
Example histogram
13
Histogram II
  • The relative frequency distribution depicts the
    ratio of the frequency and the total number of
    observations.
  • The cumulative distribution depicts the
    percentage of observations that are less than a
    specific value.

14
Example relative frequency distribution
  • A relative frequency distribution plots the
    fraction (or percentage) of observations in each
    class instead of the actual number. For this
    problem, the relative frequency of the first
    class is 6/280.214. The remaining relative
    frequencies are 0.179, 0.250, 0.286 and 0.071. A
    graph similar to the above one can then be
    plotted.

15
Example cumulative distribution
  • In the previous example, the percentage of
    observations that are less than 3 minutes is
    0.214, the percentage of observations that are
    less than 5 is 0.2140.1790.393, less than 7 is
    0.2140.1790.250.643, less than 9 is
    0.2140.1790.250.2860.929, and that less than
    11 is 1.0.

16
Example cumulative distribution
17
Histogram III
  • The summation of all the relative frequencies is
    always 1.
  • The cumulative distribution is non-decreasing.
  • The last value of the cumulative distribution is
    always 1.
  • A cumulative distribution can be derived from the
    corresponding relative distribution, and vice
    versa.

18
Probability
  • A random variable is a variable whose values
    cannot predetermined but governed by some random
    mechanism.
  • Although we cannot predict precisely the value of
    a random variable, we might be able to tell the
    possibility of a random variable being in a
    certain interval.
  • The relative frequency is also the probability of
    a random variable falling in the corresponding
    class.
  • The relative frequency distribution is also the
    probability distribution.

19
Scatter plots
  • A scatter plot shows the relationship between two
    variables.

20
Example scatter plot
  • . The following are the height and foot size
    measurements of 8 men arbitrarily selected from
    students in the cafeteria. Heights and foot sizes
    are in centimeters.
  • man 1 2 3 4 5 6 7 8
  • Height 155 160 149 175 182 145 177 164
  • foot 23.3 21.8 22.1 26.3 28.0 20.7 25.3 24.9

21
Example scatter plot
22
Numerical Summary Forms
  • Central locations mean, median, and mode.
  • Dispersion standard deviation and variance.
  • Correlation.

23
Mean
  • Mean/average is the summation of the observations
    divided by the number of observations
  • 27 22 26 24 27 20 23 24 18 32
  • Sum (27 22 26 24 27 20 23 24
    18 32) 243
  • Mean 243/10 24.3

24
Median
  • Median is the value of the central observation
    (the one in the middle), when the observations
    are listed in ascending or descending order.
  • When there is an even number of values, the
    median is given by the average of the middle two
    values.
  • When there is an odd number of values, the median
    is given by the middle number.

25
Example median
  • 18 20 22 23 24 24 26 27 27 32

26
Compare mean and median
  • The median is less sensitive to outliers than the
    mean. Check the mean and median for the
    following two data sets
  • 18 20 22 23 24 24 26 27 27
    32
  • 18 20 22 23 24 24 26 27 27
    320

27
Mode
  • Mode is the most frequently occurring value(s).

28
Symmetry and skew
  • A frequency distribution in which the area to the
    left of the mean is a mirror image of the area to
    the right is called a symmetrical distribution.
  • A distribution that has a longer tail on the
    right hand side than on the left is called
    positively skewed or skewed to the right. A
    distribution that has a longer tail on the left
    is called negatively skewed.
  • If a distribution is positively skewed, the mean
    exceeds the median. For a negatively skewed
    distribution, the mean is less than the median.

29
Range
  • The range is the difference in the maximum and
    minimum values of the observations.

30
Standard deviation and variance
  • The standard deviation is used to describe the
    dispersion of the data.
  • The variance is the squared standard deviation.

31
Calculation of S.D.
  • Calculate the mean
  • calculate the deviations
  • calculate the squares of the deviations and sum
    them up
  • Divide the sum by n-1 and take the square root.

32
Example S.D.
  • Sample 27 22 26 24 27 20
    23 24 18 32
  • Deviation 2.7 -2.3 1.7 -0.3 2.7
    -4.3 -1.3 -.3 -6.3 7.7
  • Sq of Dev 7.29 5.29 2.89 .09 7.29 19.5
    1.69 .09 39.7 59.3
  • Sum of 7.29 5.29 ..... 59.3 142.1
  • Std. Dev.

33
(No Transcript)
34
Empirical rules
  • If the distribution is symmetrical and
    bell-shaped,
  • Approximately 68 of the observations will be
    within plus and minus one standard deviation from
    he mean.
  • Approximately 95 observations will be within two
    standard deviation of the mean.
  • Approximately 99.7 observations will be within
    three standard deviations of the mean.

35
Percentiles
  • The 75th percentile is the value such that 75 of
    the numbers are less than or equal to this value
    and the remaining 25 are larger than this value.
  • The k-th percentile is the value such that k of
    the numbers are less than or equal to this value
    and the remaining 1-k are larger than this value.

36
Correlation coefficient
  • The Correlation coefficient measures how closely
    two variables are (linearly) related to each
    other. It has a value between -1 to 1.
  • Positive and negative linear relationships.
  • If two variables are not linearly related, the
    correlation coefficient will be zero if they are
    closely related, the correlation coefficient will
    be close to 1 or -1.
Write a Comment
User Comments (0)
About PowerShow.com