Basic Statistics II - PowerPoint PPT Presentation

About This Presentation
Title:

Basic Statistics II

Description:

Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business Frequency Distribution and Probability Distribution Frequency ... – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 51
Provided by: asb4
Category:

less

Transcript and Presenter's Notes

Title: Basic Statistics II


1
Basic Statistics II
  • Biostatistics, MHA, CDC, Jul 09
  • Prof. KG Satheesh Kumar
  • Asian School of Business

2
Frequency Distribution and Probability
Distribution
  • Frequency Distribution Plot of frequency along
    y-axis and variable along the x-axis
  • Histogram is an example
  • Probability Distribution Plot of probability
    along y-axis and variable along x-axis
  • Both have same shape
  • Properties of probability distributions
  • Probability is always between 0 and 1
  • Sum of probabilities must be 1

3
Theoretical Probability Distributions
  • For a discrete variable we have discrete
    probability distribution
  • Binomial Distribution
  • Poisson Distribution
  • Geometric Distribution
  • Hypergeometric Distribution
  • For a continuous variable we have continuous
    probability distribution
  • Uniform (rectangular) Distribution
  • Exponential Distribution
  • Normal Distribution

4
The Normal Distribution
  • If a random variable, X is affected by many
    independent causes, none of which is
    overwhelmingly large, the probability
    distribution of X closely follows normal
    distribution. Then X is called normal variate and
    we write X N(?, ?2), where ? is the mean and ?2
    is the variance
  • A Normal pdf is completely defined by its mean, ?
    and variance, ?2. The square root of variance is
    called standard deviation ?.
  • If several independent random variables are
    normally distributed, their sum will also be
    normally distributed with mean equal to the sum
    of individual means and variance equal to the sum
    of individual variances.

5
The Normal pdf
6
The area under any pdf between two given values
of X is the probability that X falls between
these two values
7
Standard Normal Variate, Z
  • SNV, Z is the normal random variable with mean 0
    and standard deviation 1
  • Tables are available for Standard Normal
    Probabilities
  • X and Z are connected by
  • Z (X - ?) / ? and X ? ?Z
  • The area under the X curve between X1 and X2 is
    equal to the area under Z curve between Z1 and Z2.

8
  • z 0.00 0.01 0.02 0.03
    0.04 0.05 0.06 0.07 0.08
    0.09
  • 0.0 0.0000 0.0040 0.0080 0.0120 0.0160
    0.0199 0.0239 0.0279 0.0319 0.0359
  • 0.1 0.0398 0.0438 0.0478 0.0517 0.0557
    0.0596 0.0636 0.0675 0.0714 0.0753
  • 0.2 0.0793 0.0832 0.0871 0.0910 0.0948
    0.0987 0.1026 0.1064 0.1103 0.1141
  • 0.3 0.1179 0.1217 0.1255 0.1293 0.1331
    0.1368 0.1406 0.1443 0.1480 0.1517
  • 0.4 0.1554 0.1591 0.1628 0.1664 0.1700
    0.1736 0.1772 0.1808 0.1844 0.1879
  • 0.5 0.1915 0.1950 0.1985 0.2019 0.2054
    0.2088 0.2123 0.2157 0.2190 0.2224
  • 0.6 0.2257 0.2291 0.2324 0.2357 0.2389
    0.2422 0.2454 0.2486 0.2517 0.2549
  • 0.7 0.2580 0.2611 0.2642 0.2673 0.2704
    0.2734 0.2764 0.2794 0.2823 0.2852
  • 0.8 0.2881 0.2910 0.2939 0.2967 0.2995
    0.3023 0.3051 0.3078 0.3106 0.3133
  • 0.9 0.3159 0.3186 0.3212 0.3238 0.3264
    0.3289 0.3315 0.3340 0.3365 0.3389
  • 1.0 0.3413 0.3438 0.3461 0.3485 0.3508
    0.3531 0.3554 0.3577 0.3599 0.3621
  • 1.1 0.3643 0.3665 0.3686 0.3708 0.3729
    0.3749 0.3770 0.3790 0.3810 0.3830
  • 1.2 0.3849 0.3869 0.3888 0.3907 0.3925
    0.3944 0.3962 0.3980 0.3997 0.4015
  • 1.3 0.4032 0.4049 0.4066 0.4082 0.4099
    0.4115 0.4131 0.4147 0.4162 0.4177
  • 1.4 0.4192 0.4207 0.4222 0.4236 0.4251
    0.4265 0.4279 0.4292 0.4306 0.4319
  • 1.5 0.4332 0.4345 0.4357 0.4370 0.4382
    0.4394 0.4406 0.4418 0.4429 0.4441
  • 1.6 0.4452 0.4463 0.4474 0.4484 0.4495
    0.4505 0.4515 0.4525 0.4535 0.4545
  • 1.7 0.4554 0.4564 0.4573 0.4582 0.4591
    0.4599 0.4608 0.4616 0.4625 0.4633

Standard Normal Probabilities (Table of z
distribution)
The z-value is on the left and top margins and
the probability (shaded area in the diagram) is
in the body of the table
9
Illustration
  • Q.A tube light has mean life of 4500 hours with a
    standard deviation of 1500 hours. In a lot of
    1000 tubes estimate the number of tubes lasting
    between 4000 and 6000 hours
  • P(4000ltXlt6000) P(-1/3ltZlt1)
  • 0.1306
    0.3413
  • 0.4719
  • Hence the probable number of tubes in a lot of
    1000 lasting 4000 to 6000 hours is 472

10
Illustration
  • Q. Cost of a certain procedure is estimated to
    average Rs.25,000 per patient. Assuming normal
    distribution and standard deviation of Rs.5000,
    find a value such that 95 of the patients pay
    less than that.
  • Using tables, P(ZltZ1) 0.95 gives Z1 1.645.
    Hence X1 25000 1.645 x 5000 Rs.33,225
  • 95 of the patients pay less than Rs.33,225

11
Sampling Basics
  • Population or Universe is the collection of all
    units of interest. E.g. Households of a specific
    type in a given city at a certain time.
    Population may be finite or infinite
  • Sampling Frame is the list of all the units in
    the population with identifications like Sl.Nos,
    house numbers, telephone nos etc
  • Sample is a set of units drawn from the
    population according to some specified procedure
  • Unit is an element or group of elements on which
    observations are made. E.g. a person, a family, a
    school, a book, a piece of furniture etc.

12
Census Vs Sampling
  • Census
  • Thought to be accurate and reliable, but often
    not so if the population is large
  • More resources (money, time, manpower)
  • Unsuitable for destructive tests
  • Sampling
  • Less resources
  • Highly qualified and skilled persons can be used
  • Sampling error, which can be reduced using large
    and representative sample

13
Sampling Methods
  • Probability Sampling (Random Sampling)
  • Simple Random Sampling
  • Systematic Random Sampling
  • Stratified Random Sampling
  • Cluster Sampling (Single stage , Multi-stage)
  • Non-probability Sampling
  • Convenience Sampling
  • Judgment Sampling
  • Quota Sampling

14
Limitations of Non-Random Sampling
  • Selection does not ensure a known chance that a
    unit will be selected (i.e. non-representative)
  • Inaccurate in view of the selection bias
  • Results cannot be used for generalisation because
    inferential statistics requires probability
    sampling for valid conclusions
  • Useful for pilot studies and exploratory research

15
Sampling Distribution and Standard Error of the
Mean
  • The sampling distribution of ?x is the
    probability distribution of all possible values
    of ?x for a given sample size n taken from the
    population.
  • According to the Central Limit Theorem, for large
    enough sample size, n, the sampling distribution
    is approximately normal with mean ? and standard
    deviation ?/?n. This standard deviation is called
    standard error of the mean.
  • CLT holds for non-normal populations also and
    states For large enough n, ?x N(?, ?2/n)

16
Illustration
  • Q. When sampling from a population with SD 55,
    using a sample size of 150, what is the
    probability that the sample mean will be at least
    8 units away from the population mean?
  • Standard Error of the mean, SE 55/sqrt(150)
    4.4907
  • Hence 8 units 1.7815 SE
  • Area within 1.7815 SE on both sides of the mean
    2 0.4625 0.925
  • Hence required probability 1-0.925 0.075

17
Illustration
  • Q. An Economist wishes to estimate the average
    family income in a certain population. The
    population SD is known to be 4,500 and the
    economist uses a random sample of size 225. What
    is the probability that the sample mean will fall
    within 800 of the population mean?

18
Point and Interval Estimation
  • The value of an estimator (see next slide),
    obtained from a sample can be used to estimate
    the value of the population parameter. Such an
    estimate is called a point estimate.
  • This is a 5050 estimate, in the sense, the
    actual parameter value is equally likely to be on
    either side of the point estimate.
  • A more useful estimate is the interval estimate,
    where an interval is specified along with a
    measure of confidence (90, 95, 99 etc)
  • The interval estimate with its associated measure
    of confidence is called a confidence interval.
  • A confidence interval is a range of numbers
    believed to include the unknown population
    parameter, with a certain level of confidence

19
Estimators
  • Population parameters (?, ?2, p) and Sample
    Statistics (?x,s2, ps)
  • An estimator of a population parameter is a
    sample statistic used to estimate the parameter
  • Statistic,?x is an estimator of parameter ?
  • Statistic, s2 is an estimator of parameter ?2
  • Statistic, ps is an estimator of parameter p

20
Illustration
  • Q. A wine importer needs to report the average
    percentage of alcohol in bottles of French wine.
    From experience with previous kinds of wine, the
    importer believes the population SD is 1.2. The
    importer randomly samples 60 bottles of the new
    wine and obtains a sample mean of 9.3. Find the
    90 confidence interval for the average
    percentage of alcohol in the population.

21
Answer
  • Standard Error 1.2/sqrt(60) 0.1549
  • For 90 confidence interval, Z 1.645
  • Hence the margin of error 1.6450.1549

  • 0.2548
  • Hence 90 confidence interval is
  • 9.3 /- 0.3

22
More Sampling Distributions
  • Sampling Distribution is the probability
    distribution of a given test statistic (e.g. Z),
    which is a numerical quantity calculated from
    sample statistic
  • Sampling distribution depends on the distribution
    of the population, the statistic being considered
    and the sample size
  • Distribution of Sample Mean Z or t distribution
  • Distribution of Sample Proportion Z (large
    sample)
  • Distribution of Sample Variance Chi-square
    distribution

23
The t-distribution
  • The t-distribution is also bell-shaped and very
    similar to the Z(0,1) distribution
  • Its mean is 0 and variance is df/(df-2)
  • df degrees of freedom n-1 n sample size
  • For large sample size, t Z are identical
  • For small n, the variance of t is larger than
    that of Z and hence wider tails, indicating the
    uncertainty introduced by unknown population SD
    or smaller sample size n

24
(No Transcript)
25
Illustration
  • Q. A large drugstore wants to estimate the
    average weekly sales for a brand of soap. A
    random sample of 13 weeks gives the following
    numbers 123, 110, 95, 120, 87, 89, 100, 105, 98,
    88, 75, 125, 101. Determine the 90 confidence
    interval for average weekly sales.
  • Sample mean 101.23 and Sample SD 15.13. From
    t-table, for 90 confidence at df 12 is t
    1.782. Hence Margin of Error 1.782
    15.13/sqrt(13) 7.48. The 90 confidence
    interval is (93.75,108.71)

26
Chi-Square Distribution
  • Chi-square distribution is the probability
    distribution of the sum of several independent
    squared Z variables
  • It has a df parameter associated with it (like t
    distribution).
  • Being a sum of squares, the chi-squares cannot be
    negative and hence the distribution curve is
    entirely on the positive side, skewed to the
    right.

The mean is df and variance is 2df
27
Confidence Interval for population variance using
chi-square distribution
  • A random sample of 30 gives a sample variance of
    18,540 for a certain variable. Give a 95
    confidence interval for the population variance
  • Point estimate for population variance 18,540
  • Given df 29, excel gives chi-square values
  • For 2.5, 45.7 and for 97.5, 16.0
  • Hence for the population variance,
  • the lower limit of the confidence interval
  • 18540 29/45.7 11,765 and
  • the upper limit of the confidence interval
  • 1854029/16.0 33,604

28
Chi-Square Distribution
  • Chi-square distribution is the probability
    distribution of the sum of several independent
    squared Z variables
  • It has a df parameter associated with it (like t
    distribution).
  • Being a sum of squares, the chi-squares cannot be
    negative and hence the distribution curve is
    entirely on the positive side, skewed to the
    right.

The mean is df and variance is 2df
29
Chi-Square Test for Goodness of Fit
  • A goodness-of-fit is a statistical test of how
    sample data support an assumption about the
    distribution of a population
  • Chi-square statistic used is
  • ?2 ?(O-E)2/E, where O is the observed value
    and E the expected value
  • The above value is then compared with the
    critical value (obtained from table or using
    excel) for the given df and the required level of
    significance, a (1 or 5)

30
Illustration
  • Q. A company comes out with a new watch and
    wants to find out whether people have special
    preferences for colour or whether all four
    colours under consideration are equally
    preferred. A random sample of 80 prospective
    buyers indicated preferences as follows 12, 40,
    8, 20. Is there a colour preference at 1
    significance?
  • Assuming no preference, the expected values would
    all be 20. Hence the chi-square value is 64/20
    400/20 144/20 0 30.4
  • For df 3 and 1 significance, the right tail
    area is 11.3.
  • The computed value of 30.4 is far greater than
    11.3 and hence deeply in the rejection region. So
    we reject the assumption of no colour preference.

31
  • Q. Following data is about the births of new born
    babies on various days of the week during the
    past one year in a hospital. Can we assume that
    birth is independent of the day of the week?
    Sun116, Mon184, Tue 148, Wed 145, Thu 153,
    Fri 150, Sat 154 (Total 1050)
  • Ans Assuming independence, the expected values
    would all be 1050/7 150. Hence the chi-square
    value is 342/150342/15022/15052/15032/15042/1
    502366/150 15.77
  • For df 6 and 5 significance, the right tail
    area is 12.6.
  • The computed value of 15.77 is greater than the
    critical value of 12.6 and hence falls in the
    rejection region. So we reject the assumption of
    independence.

32
Correlation
  • Correlation refers to the concomitant variation
    between two variables in such a way that change
    in one is associated with a change in the other
  • The statistical technique used to analyse the
    strength and direction of the above association
    between two variables is called correlation
    analysis

33
Correlation and Causation
  • Even if an association is established between two
    variables no cause-effect relationship is implied
  • Association between x and y may be looked upon
    as
  • x causes y
  • y causes x
  • x and y influence each other (mutual influence)
  • x and y are both influenced by z, v (influence of
    third variable)
  • due to chance (spurious association)
  • Hence caution needed while interpreting
    correlation

34
Types of Correlations
  • Positive (direct) and negative (inverse)
  • Positive direction of change is the same
  • Negative direction of change is opposite
  • Linear and non-linear
  • Linear changes are in a constant ratio
  • Non-linear ratio of change is varying
  • Simple, Partial and Multiple
  • Simple Only two variables are involved
  • Partial There may be third and other variables,
    but they are kept constant
  • Multiple Association of multiple variables
    considered simultaneously

35
Scatter Diagrams
Correlation coefficient r 1 r -
0.54 r 0.85 r
- 0.94 r0.42
r0.17
36
Correlation Coefficient
  • Correlation coefficient (r) indicates the
    strength and direction of association
  • The value of r is between -1 and 1
  • -1 perfect negative correlation
  • 1 perfect positive correlation
  • Above 0.75 Very high correlation
  • 0.50 to 0.75 High correlation
  • 0.25 to 0.50 Low correlation
  • Below 0.25 Very low correlation

37
Methods of Correlation Analysis
  • Scatter Diagram
  • A quick approximate visual idea of association
  • Karl Pearsons Coefficient of Correlation
  • For numeric data measured on interval or ratio
    scale
  • r Cov(x,y) /(SDx SDy)
  • Spearmans Rank Correlation
  • For ordinal (rank) data
  • R 1 6 Sum of Squared Difference of Ranks /
    n(n2-1)
  • Method of Least Squares
  • r2 bxy byx, i.e. product of regression
    coefficients

38
Karl Pearson Correlation Coefficient
(Product-Moment Correlation)
  • r Covariance (x,y) / (SD of x SD of y)
  • Recall n Var(X) SSxx, nVar(Y) SSYY and n
    Cov(X,Y) SSXY
  • Thus r2 Cov2(X,Y)/Var(X) Var(Y)
  • SS2XY / (SSxx SSYY)
  • Note r2 is called coefficient of determination

39
Sample Problem
  • The following data refers to two variables,
    promotional expense (Rs. Lakhs) and sales (000
    units) collected in the context of a promotional
    study. Calculate the correlation coefficient
  • Promo 7 10 9 4 11 5 3
  • Sales 12 14 13 5 15 7 4

40
Promo (X) Sales (Y) X - Ave(X) Y - Ave (Y) Sxy Sxx Syy
             
7 12 0 2 0 0 4
10 14 3 4 12 9 16
9 13 2 3 6 4 9
4 5 -3 -5 15 9 25
11 15 4 5 20 16 25
5 7 -2 -3 6 4 9
3 4 -4 -6 24 16 36
             
7 10     83 58 124
Ave(X) Ave(Y)     SSxy SSxx SSyy

Coefficient of Determination, r-squared 8383 / (58124) Coefficient of Determination, r-squared 8383 / (58124) Coefficient of Determination, r-squared 8383 / (58124) Coefficient of Determination, r-squared 8383 / (58124) Coefficient of Determination, r-squared 8383 / (58124) Coefficient of Determination, r-squared 8383 / (58124) 0.95787
Coefficient of Correlation, r square root of 0.95787 Coefficient of Correlation, r square root of 0.95787 Coefficient of Correlation, r square root of 0.95787 Coefficient of Correlation, r square root of 0.95787 Coefficient of Correlation, r square root of 0.95787 0.978708
41
Spearmans Rank Correlation Coefficient
  • The ranks of 15 students in two subjects A and B
    are given below. Find Spearmans Rank Correlation
    Coefficient
  • (1,10) (2,7) (3,2) (4,6) (5,4) (6,8)
    (7,3) (8,1) (9,11) (10,15) (11,9) (12,5)
    (13,14) (14,12) and (15,13)
  • Solution SSD of Ranks 81251414
    1649425449144 272
  • R 1 6272/(141516) 0.5143
  • Hence moderate degree of positive correlation
    between the ranks of students in the two subjects

42
Regression Analysis
  • Statistical technique for expressing the
    relationship between two (or more) variables in
    the form of an equation (regression equation)
  • Dependent or response or predicted variable
  • Independent or regressor or predictor variable
  • Used for prediction or forecasting

43
Types of Regression Models
  • Simple and Multiple Regression Models
  • Simple Only one independent variable
  • Multiple More than one independent variable
  • Linear and Nonlinear Regression Models
  • Linear Value of response variable changes in
    proportion to the change in predictor so that Y
    abX

44
Simple Linear Regression Model
  • Y a bX,
  • a and b are constants to be determined using the
    given data
  • Note More strictly, we may say Y ayx byxX
  • To determine a and b solve the following two
    equations (called normal equations)
  • ?Y a n b ?x ------- (1)
  • ?YX a ?x b ?x2 ------- (2)

45
Calculating Regression Coeff
  • Instead of solving the simultaneous equations one
    may directly use formulae
  • For Y a bX, i.e. regression of Y on X
  • byx SSxy / SSxx
  • ayx Y byxX where mean values of Y, X are used
  • For X a bY form (regression of X on Y)
  • bxy SSxy / SSyy
  • axy Y bxyX where mean values of Y, X are used

46
Example
  • For the earlier problem of Sales (dependent
    variable) Vs Promotional expenses (independent
    variable) set up the simple linear regression
    model and predict the sales when promotional
    spending is Rs.13 lakhs
  • Solution We need to find a and b in Y a bX
  • b SSxy / SSxx 83/58 1.4310
  • a Y - bX, at mean 10 1.43107 -0.017
  • Hence regression equation is Y -0.0171.4310X
  • For X 13 Lakhs, we get Y 18.59, i.e. 18,590
    units of predicted sales

47
Linear Regression using Excel
48
Properties of Regression Coeff
  • Coefficient of determination r2 byx bxy
  • If one regression coefficient is greater than one
    the other is less than one because r2 lies
    between 0 and 1
  • Both regression coeff must have the same sign,
    which is also the sign of the correlation coeff r
  • The regression lines intersect at the means of X
    and Y
  • Each regression coefficient gives the slope of
    the respective regression line

49
Coefficient of Determination
  • Recall
  • SSyy Sum of squared deviations of Y from the
    mean
  • Let us define
  • SSR as sum of squared deviations of estimated
    (using regression equation) values of Y from the
    mean
  • SSE as the sum of squared deviations of errors
    (error means actual Y estimated Y)
  • It can be shown that
  • SSyy SSR SSE, i.e. Total Variation
    Explained Variation Unexplained (error)
    Variation
  • r2 SSR/SSyy Explained Variation / Total
    Variation
  • Thus r2 represents the proportion of the total
    variability of the dependent variable y that is
    accounted for or explained by the independent
    variable x

50
Coefficient of Determination for Statistical
Validity of Promo-Sales Regression Model
Promo (X) Sales (Y) Ye -0.0171.4310X Squared deviation of Ye from Mean Squared deviation of Ye from Y Squared Deviation of Y from Mean
           
7 12 10.00 0.00 4.00 4
10 14 14.29 18.43 0.09 16
9 13 12.86 8.19 0.02 9
4 5 5.71 18.43 0.50 25
11 15 15.72 32.76 0.52 25
5 7 7.14 8.19 0.02 9
3 4 4.28 32.76 0.08 36
           
7 10   118.77 5.22 124
      SSR SSE Ssyy

Coefficient of determination, r-squared 118.77/124 Coefficient of determination, r-squared 118.77/124 Coefficient of determination, r-squared 118.77/124 Coefficient of determination, r-squared 118.77/124 0.957824

Thus 96 of the variation in sales is explained by promo expenses Thus 96 of the variation in sales is explained by promo expenses Thus 96 of the variation in sales is explained by promo expenses Thus 96 of the variation in sales is explained by promo expenses Thus 96 of the variation in sales is explained by promo expenses Thus 96 of the variation in sales is explained by promo expenses
Write a Comment
User Comments (0)
About PowerShow.com