??9:10?12:00 A211? - PowerPoint PPT Presentation

About This Presentation
Title:

??9:10?12:00 A211?

Description:

9:10 12:00 A211 hchen_at_math.ntu.edu.tw 2 ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 46
Provided by: Hun117
Category:

less

Transcript and Presenter's Notes

Title: ??9:10?12:00 A211?


1
?????
  • ? ?
  • ???????
  • ??910?1200 A211?
  • hchen_at_math.ntu.edu.tw

2
????
  • ????,????????(?????2?)
  • ???????
  • ??????
  • ?????????????
  • ????(?????7?)
  • ??????????
  • ?????
  • ?????
  • ?????(?????8?)
  • ?????(Principal Component Analysis)
  • ????(Factor Analysis)
  • ?????(Discriminant Analysis)
  • ?????(Cluster Analysis)
  • ??????(Canonical Correlation Analysis)

3
  • ???
  • ??
  • ????
  • R(??????)
  • R has a home page at http//www.r-project.org/
  • Download
  • ??????
  • ???(30)?projects(70)

4
? ?
  • ??
  • Exploratory Data Analysis Decision Making
  • Data Mining
  • Data Collection ?????
  • ????
  • R Software
  • ????,????????
  • Probability and Random Variables
  • Variance
  • ????
  • Association
  • IntroRegression
  • MultipleRegression
  • DAonREgression

5
? ?
  • ?????
  • ?????(Principal Component Analysis)
  • ????(Factor Analysis)
  • ?????(Discriminant Analysis)
  • ?????(Cluster Analysis)
  • ??????(Canonical Correlation Analysis)

6
Statistics for Decision Making
  • Describing Sets of Data
  • Objective Introduce numerical methods and
    graphical displays to summarize data sets.
  • Graphical and numerical tools
  • for examining the distribution of a single
    variable,
  • for comparing several distributions, and
  • for investigating changes over time.
  • Sampling and Statistical Inference
  • Objective Provide methods to infer about a
    population based on a sample of observations
    drawn from that population
  • Forecasting with Distinguishable Data
  • Objective Introduce the basic concepts of
    forecasting to motivate a regression model.
  • Method for studying relationships among several
    variables.
  • Regression Coefficients and Forecasts
  • Objective Understand regression coefficients and
    how to use them for forecasting

7
Statistics for Decision Making
  • Measures of Goodness of Fit and Residual Analysis
  • Objective Introduce a few statistics that
    measure how well a regression model fits the data
    and show how to use residual analysis to detect
    inadequacies of a regression model
  • Developing a Regression Model
  • Objective Demonstrate how to develop a useful
    regression model through
  • Selection of the Dependent Variable
  • Selection of the Independent Variables
  • Determining the Nature of Relationships

8
Sampling and Statistical Inference
  • Objective Provide methods to infer about a
    population based on a sample of observations
    drawn from that population.
  • Inference from a Sample
  • Statistical Estimation
  • From Margin of Error to Confidence Interval
  • Test of Significance

9
Inference from a Sample
  • The sample provides useful information, but the
    information is imperfect.
  • Samples are taken when it is impossible,
    impractical or too expensive to obtain complete
    data on relevant population.
  • EX. Suppose you are asked 100 potential customers
    how much they will spend on a proposed new
    product next year?
  • From the 100 responses you obtained a sample
    average of 250. You could make the following
    inference
  • My best estimate of average sales per potential
    customer is 250.
  • Average sales per potential customer will be
    between 210 and 290 with 95 confidence.
  • Average sales per potential customer will be
    greater than the break-even amount of 210 at a
    2.5 level of significance.
  • Law of Large Numbers
  • Independent observations at random from any
    population with finite mean ?
  • As the number of observations drawn increases,
    the mean of the observed values eventually
    approaches the mean ? of the population as
    closely as you specified and then stays that
    close.

10
Sampling variability
  • Parameter pthe proportion of the adult
    population in the US (190 million) that find
    clothes shopping frustrating.
  • Statistic 66 or 1650 out of 2500 adults.
  • Sampling variability The value of a statistic
    varies in repeated random sampling.
  • Answer to What would happen if we took many
    samples?
  • Take a large number of samples from the same
    population.
  • Calculate the sample proportion p for each
    sample.
  • Make a histogram of the values of p.
  • Examine the distribution displayed in the
    histogram.
  • We can imitate chance behavior of many samples by
    using random digits or computer (simulation).

11
Sampling variability
  • The sampling distribution of a statistic is the
    distribution of values taken by the statistic in
    all possible samples of the same size from the
    same population.
  • Can be either
  • approximated by simulation or
  • obtained exactly by probability theory in
    statistics.
  • 1000 SRSs of size 100 when p0.6.

12
1000 SRSs of size 100 and 2500 when p0.6
13
Bias and variance
  • A statistic is unbiased in the mean of its
    sampling distribution is equal to the true value
    of the parameter being estimated. - no
    favoritism.
  • The variability of a statistic is described by
    the spread of its sampling distribution.
  • 95 of the sample proportions will like in the
    range 0.60.1 (n100) or 0.6 0.02 (n2500)
  • Larger samples have smaller spreads.
  • As long as the population is much larger than the
    sample, the spread of the sampling distribution
    for a sample of fixed size n is approximately the
    same for any population size.
  • An SRS of size 2500 from 270 million US residents
    gives results as precise as an SRS of size 2500
    from 740,000 inhabitants of SFO!

14
(No Transcript)
15
Why randomize?
  • The act of randomizing guarantees that the
    results of analyzing our data are subject to the
    laws of probability.
  • Randomization removes bias.
  • Replication (bigger sample) reduces variance.
  • Better answer What would happen if the sample or
    the experiment were repeated many times?
  • Caution the sampling distribution does not
    reflect bias due to under-coverage, non-response,
    lack of realism, etc.

16
Presidential Election and Poll

17
??1936???????
  • ??????????????????????????????
  • ??????????????
  • ??????,?1929??1933?????????????
  • ??????????????????The spender must go?
  • ???????????????? (deficit financing)????Balance
    the budget of the American people first?
  • ????????????????????
  • ???Literary Digest????????57?43?????
  • ?????????????????????
  • ????1916??,????????????????
  • ????????62?38?????????
  • ?????-???-???
  • ??Literary Digest??????????????,????????,???????56
    ?44?????
  • ?????????????,??????????56?44?????

18
Digest???????
  • ?????????????,????????,????????????????????
  • ????????????????,????????
  • ???????
  • ????Digest?????????????,???????????????
  • ??????????
  • ????????,?????????????,???20???,????????????,????
    ????????????????

19
??????????????????
  • ????16??????393??????????????,
  • ???1033???????
  • ????????,???????????????????,?????????????????,???
    ??16??????????????(????)???
  • ?????????,?????????????????

20
Digest???????
  • ?????????????,????????,????????????????????
  • (???????????????????????)?
  • ???????
  • ????Digest?????????????,???????????????
  • ??????????
  • ????????,?????????????,???20???,????????????,????
    ????????????????

21
Statistical Estimation
  • A parameter is a number that described the
    population.
  • Its value is fixed but unknown.
  • A statistic is a number that describes a sample.
  • Its value is known for a sample, but it can
    change from sample to sample.
  • We use a statistic to estimate an unknown
    parameter.
  • Error of estimation is the difference between an
    estimate and the estimated parameter.
  • In case of estimating the population mean using
    the sample mean,
  • Error of Estimation sample mean
    population mean
  • The distribution of Error of Estimation Central
    Limit Theorem
  • If the sample size is large, the error of
    estimation is approximately normally distributed
    with mean zero and a standard deviation which can
    be estimated by
  • Standard Error sample standard
    deviation/(sample size)1/2
  • The Normal Distribution
  • If X has N(?,?2) distribution, then Z(X- ?)/?
    has N(0,1) distribution.

22
The normal density
  • The height of the normal density curve for the
    normal distribution with mean ? and SD ? is given
    by
  • Why is the normal distributions important?
  • Good description for some distributions of real
    data. (e.g. test scores, repeated measurements,
    characteristics of biological populations, etc.)
  • Good approximations to the results of many kinds
    of chance outcomes. (e.g. coin tossing).
  • Many statistical inference procedures based on
    normal distributions work well for other roughly
    symmetric distributions.

23
From Margin of Error to Confidence Interval
  • What is the probability that the error of
    estimation exceeds two standard errors?
  • If we add two standard errors to our estimate as
    the margin of error, what can we say about the
    resulting interval estimate?
  • Confidence and Probability
  • When reporting that a confidence interval for a
    population mean extends from 210 to 290, it is
    tempting to slip into the language of
    probability, and say there is only 5 chance that
    the true mean of the population is outside this
    interval.
  • Such probabilistic interpretation is much more
    natural and appealing than the rather convoluted
    interpretation above. But is it legitimate?
  • Example
  • Suppose from a sample of 100 potential customers
    one market researcher obtained a 95 confidence
    interval of (190,210) for the average amount a
    potential customer will spend on a product next
    year.
  • Another market researcher from a different sample
    of size 400 obtained a 95 confidence interval of
    (215,225).
  • How do you reconcile these two results?

24
Test of Significance
  • Example 1 A market researcher asked a sample of
    100 potential customers how much they plan to
    spend on a product next year.
  • The mean of the sample turned out to be 25 and
    the standard deviation is 200.
  • Is it likely that average sales per capita
    exceeds a break-even level of 208?
  • Example 2 Suppose a manager is trying to decide
    which of the two new products, A or B, to
    introduce. Break-even sales per capita are 208
    for both A and B.
  • Sample results are given in the following.
  • Product A sample size 10,000, sample mean211,
    sample SD 100
  • Product B sample size 100, sample mean250,
    sample SD 300
  • Example 3 In a Business Week/Harris executive
    poll, senior executives were asked Compared
    with the last 12 months, do you think the rate of
    growth of the gross domestic product will go up,
    go down, or stay the same for the next 12 months?

25
Test for Independence
  • Application on Business outlook
  • Results of this poll are summarized below
    (Business Week, 1/09/95).
  • Date of Survey
  • 12/94 6/94 12/93
    Total
  • Go Up 152
    177 101 430
  • Go Down 104 72
    36 212
  • Outlook Stay the Same 144 152 261
    557
  • Not Sure 0
    0 4 4
  • Total 400
    401 402 1203
  • Have the executives changed their outlook over
    time?

26
Relations in categorical data
  • Relationship between two or more categorical
    variables.
  • Use counts (frequencies) or percent (relative
    frequencies) of individuals that fall into
    various categories.
  • Two-way table
  • A two-way table describes two categorical
    variables.
  • Each horizontal row in the table describes
    individuals with one level of the row variable.
  • Each vertical column describes individuals with
    one level of the column variable.
  • EX Years of school completed, by age (thousands
    of persons)

27
Marginal distributions
  • Look at the distribution of each variable
    separately.
  • Total columns list the totals for each of the
    rows or row totals. Similarly for column totals.
  • Row and column totals specify the marginal
    distributions of each of the two categorical
    variables.
  • The distribution of years of schooling completed
    among people age 25 years and over

28
Describing relationships
  • What percent of people aged 25 to 34 have
    completed 4 years of college?
  • What percent of people aged 35 to 54 have
    completed 4 years of college?
  • What percent of people aged 55 and over have
    completed 4 years of college?
  • Conclusion?

29
Conditional distribution of age group on the
education level
30
Three way table
  • The table of outcome by hospital by patient
    condition is a three-way table that reports the
    frequencies of each combination of levels of
    three categorical variables.
  • We can aggregate a three-way table into a two-way
    table.
  • A variable being aggregated can become a lurking
    variable.

31
NSF study on the salary of new women engineer
  • The median salary of newly graduated female
    engineers and scientists was 73 of that for
    males.
  • Field is a lurking variable. (life and social
    sciences against physical and engineering)

32
Establishing causation
  • The best (and only?) method of establishing
    causation is to conduct a carefully designed
    experiment in which the effects of possible
    lurking variables are controlled.
  • What other criteria when we cant do an
    experiment?

33
Smoking causes lung cancer
  • The association is strong.
  • The association is consistent.
  • Higher doses are associated with stronger
    responses.
  • The alleged cause precedes the effect in time.
  • The alleged cause is plausible.

34
Forecasting with Distinguishable Data
  • Objective Introduce the basic concepts of
    forecasting to motivate a regression model.
  • Forecasting with Indistinguishable Data
  • If the future value of the variable you would
    like to forecast is indistinguishable from the
    sample values you collected, then you forecast
    with indistinguishable data.
  • Example 1 To help forecasting the selling price
    of your house, you obtained a sample (109,360,
    137,980, 131,230, 130,230, 125,410, 124,370,
    139,030, 140,160, 144,220, 154,190.
  • Forecasting when the Data are Distinguishable
  • When your sample contains additional information
    so that the sample values are no longer
    indistinguishable from the future value you would
    like to forecast, you forecast with
    distinguishable data.
  • Example 2 Our sample also contain the
    information on the square footage of the ten
    houses. (109,360,1404), (137,980,1477),
    (131,230,1503), (130,230,1552),
    (125,410,1608), (124,370,1633),
    (139,030,1717), (140,160,1775),
    (144,220,1838), (154,190,1934).

35
Forecasting with Distinguishable Data
  • Assume that your house has 1682 square feet of
    living area.
  • Analysis 1 sample average of all ten houses
    133,618 (SD 12,406)
  • Analysis 2 Stratify the sample according to lot
    size.
  • Size Range Sample Average SD
    Number of Observations
  • 1400-1599 127,200
    12,381 4
  • 1600-1799 132,243
    8,513 4
  • 1800-1999 149,205
    7,050 2
  • Then use 132,243 (instead of 133,618) to
    forecast the selling value.
  • Does the cell standard deviation properly measure
    the forecast uncertainty?
  • Is it possible to have a measure of overall
    efficacy of our partitioning the sample into
    cells?
  • Use the data more efficiently The stratification
    method that we used is unsatisfactory for two
    reasons. First, we have ignored data on house
    that are less like, but not most like yours.
    Secondly, we have stratified the data somewhat
    arbitrarily.

36
The question of causation
  • Mothers adult height vs daughters adult height.
  • Amount of saccharin in a rats diet vs count of
    tumors in the rats bladder.
  • A students SAT score and the students first
    year GPA.
  • Monthly flow of money into stock mutual funds vs
    monthly rate of return for the stock market.
  • The anesthetic used in surgery vs whether the
    patient survives the surgery.
  • The number of years of education a worker has vs
    the workers income.

37
Explaining association
  • Causation.
  • Common response. (a lurking variable).
  • Confounding two variables are confounded when
    their effects on a response variable are mixed
    together.

38
Data on the survival of patients after surgery in
hospital A and B
  • Hospital A loses 3 of patients while Hospital B
    loses 2.

39
Lurking variable...
  • 1 vs 1.3 for patients with good condition
  • 3.8 vs 4 for patients with bad condition

40
Simpsons paradox
  • How can A do better in each group, yet do worse
    overall??
  • An association or comparison that holds for all
    of several groups can reverse direction when the
    data are combined to form a single group.

41
Regression Model
  • Try to create a model that specifies the
    relationship between selling price (dependent
    variable) and other variables (independent or
    explanatory variable) that help you forecast the
    selling price.
  • It is reasonable to assume that as size go up,
    selling price will go up on average.

42
Regression Coefficients and Forecasts
  • Objective Understand regression coefficients and
    how to use them for forecasting.

43
Measures of Goodness of Fit and Residual Analysis
  • Objective Introduce a few statistics that
    measure how well a regression model fits the data
    and show how to use residual analysis to detect
    inadequacies of a regression model

44
Developing a Regression Model
  • Objective Demonstrate how to develop a useful
    regression model through
  • Selection of the Dependent Variable
  • Selection of the Independent Variables
  • Determining the Nature of Relationships

45
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com