Title: ??9:10?12:00 A211?
1?????
- ? ?
- ???????
- ??910?1200 A211?
- hchen_at_math.ntu.edu.tw
2????
- ????,????????(?????2?)
- ???????
- ??????
- ?????????????
- ????(?????7?)
- ??????????
- ?????
- ?????
- ?????(?????8?)
- ?????(Principal Component Analysis)
- ????(Factor Analysis)
- ?????(Discriminant Analysis)
- ?????(Cluster Analysis)
- ??????(Canonical Correlation Analysis)
3- ???
- ??
- ????
- R(??????)
- R has a home page at http//www.r-project.org/
- Download
- ??????
- ???(30)?projects(70)
4? ?
- ??
- Exploratory Data Analysis Decision Making
- Data Mining
- Data Collection ?????
- ????
- R Software
- ????,????????
- Probability and Random Variables
- Variance
- ????
- Association
- IntroRegression
- MultipleRegression
- DAonREgression
5? ?
- ?????
- ?????(Principal Component Analysis)
- ????(Factor Analysis)
- ?????(Discriminant Analysis)
- ?????(Cluster Analysis)
- ??????(Canonical Correlation Analysis)
6Statistics for Decision Making
- Describing Sets of Data
- Objective Introduce numerical methods and
graphical displays to summarize data sets. - Graphical and numerical tools
- for examining the distribution of a single
variable, - for comparing several distributions, and
- for investigating changes over time.
- Sampling and Statistical Inference
- Objective Provide methods to infer about a
population based on a sample of observations
drawn from that population - Forecasting with Distinguishable Data
- Objective Introduce the basic concepts of
forecasting to motivate a regression model. - Method for studying relationships among several
variables. - Regression Coefficients and Forecasts
- Objective Understand regression coefficients and
how to use them for forecasting
7Statistics for Decision Making
- Measures of Goodness of Fit and Residual Analysis
- Objective Introduce a few statistics that
measure how well a regression model fits the data
and show how to use residual analysis to detect
inadequacies of a regression model - Developing a Regression Model
- Objective Demonstrate how to develop a useful
regression model through - Selection of the Dependent Variable
- Selection of the Independent Variables
- Determining the Nature of Relationships
8Sampling and Statistical Inference
- Objective Provide methods to infer about a
population based on a sample of observations
drawn from that population. - Inference from a Sample
- Statistical Estimation
- From Margin of Error to Confidence Interval
- Test of Significance
9Inference from a Sample
- The sample provides useful information, but the
information is imperfect. - Samples are taken when it is impossible,
impractical or too expensive to obtain complete
data on relevant population. - EX. Suppose you are asked 100 potential customers
how much they will spend on a proposed new
product next year? - From the 100 responses you obtained a sample
average of 250. You could make the following
inference - My best estimate of average sales per potential
customer is 250. - Average sales per potential customer will be
between 210 and 290 with 95 confidence. - Average sales per potential customer will be
greater than the break-even amount of 210 at a
2.5 level of significance. - Law of Large Numbers
- Independent observations at random from any
population with finite mean ? - As the number of observations drawn increases,
the mean of the observed values eventually
approaches the mean ? of the population as
closely as you specified and then stays that
close.
10Sampling variability
- Parameter pthe proportion of the adult
population in the US (190 million) that find
clothes shopping frustrating. - Statistic 66 or 1650 out of 2500 adults.
- Sampling variability The value of a statistic
varies in repeated random sampling. - Answer to What would happen if we took many
samples? - Take a large number of samples from the same
population. - Calculate the sample proportion p for each
sample. - Make a histogram of the values of p.
- Examine the distribution displayed in the
histogram. - We can imitate chance behavior of many samples by
using random digits or computer (simulation).
11Sampling variability
- The sampling distribution of a statistic is the
distribution of values taken by the statistic in
all possible samples of the same size from the
same population. - Can be either
- approximated by simulation or
- obtained exactly by probability theory in
statistics. - 1000 SRSs of size 100 when p0.6.
121000 SRSs of size 100 and 2500 when p0.6
13Bias and variance
- A statistic is unbiased in the mean of its
sampling distribution is equal to the true value
of the parameter being estimated. - no
favoritism. - The variability of a statistic is described by
the spread of its sampling distribution. - 95 of the sample proportions will like in the
range 0.60.1 (n100) or 0.6 0.02 (n2500) - Larger samples have smaller spreads.
- As long as the population is much larger than the
sample, the spread of the sampling distribution
for a sample of fixed size n is approximately the
same for any population size. - An SRS of size 2500 from 270 million US residents
gives results as precise as an SRS of size 2500
from 740,000 inhabitants of SFO!
14(No Transcript)
15Why randomize?
- The act of randomizing guarantees that the
results of analyzing our data are subject to the
laws of probability. - Randomization removes bias.
- Replication (bigger sample) reduces variance.
- Better answer What would happen if the sample or
the experiment were repeated many times? - Caution the sampling distribution does not
reflect bias due to under-coverage, non-response,
lack of realism, etc.
16Presidential Election and Poll
17??1936???????
- ??????????????????????????????
- ??????????????
- ??????,?1929??1933?????????????
- ??????????????????The spender must go?
- ???????????????? (deficit financing)????Balance
the budget of the American people first? - ????????????????????
- ???Literary Digest????????57?43?????
- ?????????????????????
- ????1916??,????????????????
- ????????62?38?????????
- ?????-???-???
- ??Literary Digest??????????????,????????,???????56
?44????? - ?????????????,??????????56?44?????
18Digest???????
- ?????????????,????????,????????????????????
- ????????????????,????????
- ???????
- ????Digest?????????????,???????????????
- ??????????
- ????????,?????????????,???20???,????????????,????
????????????????
19??????????????????
- ????16??????393??????????????,
- ???1033???????
- ????????,???????????????????,?????????????????,???
??16??????????????(????)??? - ?????????,?????????????????
20Digest???????
- ?????????????,????????,????????????????????
- (???????????????????????)?
- ???????
- ????Digest?????????????,???????????????
- ??????????
- ????????,?????????????,???20???,????????????,????
????????????????
21Statistical Estimation
- A parameter is a number that described the
population. - Its value is fixed but unknown.
- A statistic is a number that describes a sample.
- Its value is known for a sample, but it can
change from sample to sample. - We use a statistic to estimate an unknown
parameter. - Error of estimation is the difference between an
estimate and the estimated parameter. - In case of estimating the population mean using
the sample mean, - Error of Estimation sample mean
population mean - The distribution of Error of Estimation Central
Limit Theorem - If the sample size is large, the error of
estimation is approximately normally distributed
with mean zero and a standard deviation which can
be estimated by - Standard Error sample standard
deviation/(sample size)1/2 - The Normal Distribution
- If X has N(?,?2) distribution, then Z(X- ?)/?
has N(0,1) distribution.
22The normal density
- The height of the normal density curve for the
normal distribution with mean ? and SD ? is given
by
- Why is the normal distributions important?
- Good description for some distributions of real
data. (e.g. test scores, repeated measurements,
characteristics of biological populations, etc.) - Good approximations to the results of many kinds
of chance outcomes. (e.g. coin tossing). - Many statistical inference procedures based on
normal distributions work well for other roughly
symmetric distributions.
23From Margin of Error to Confidence Interval
- What is the probability that the error of
estimation exceeds two standard errors? - If we add two standard errors to our estimate as
the margin of error, what can we say about the
resulting interval estimate? - Confidence and Probability
- When reporting that a confidence interval for a
population mean extends from 210 to 290, it is
tempting to slip into the language of
probability, and say there is only 5 chance that
the true mean of the population is outside this
interval. - Such probabilistic interpretation is much more
natural and appealing than the rather convoluted
interpretation above. But is it legitimate? - Example
- Suppose from a sample of 100 potential customers
one market researcher obtained a 95 confidence
interval of (190,210) for the average amount a
potential customer will spend on a product next
year. - Another market researcher from a different sample
of size 400 obtained a 95 confidence interval of
(215,225). - How do you reconcile these two results?
24Test of Significance
- Example 1 A market researcher asked a sample of
100 potential customers how much they plan to
spend on a product next year. - The mean of the sample turned out to be 25 and
the standard deviation is 200. - Is it likely that average sales per capita
exceeds a break-even level of 208? - Example 2 Suppose a manager is trying to decide
which of the two new products, A or B, to
introduce. Break-even sales per capita are 208
for both A and B. - Sample results are given in the following.
- Product A sample size 10,000, sample mean211,
sample SD 100 - Product B sample size 100, sample mean250,
sample SD 300 - Example 3 In a Business Week/Harris executive
poll, senior executives were asked Compared
with the last 12 months, do you think the rate of
growth of the gross domestic product will go up,
go down, or stay the same for the next 12 months?
25Test for Independence
- Application on Business outlook
- Results of this poll are summarized below
(Business Week, 1/09/95). - Date of Survey
- 12/94 6/94 12/93
Total - Go Up 152
177 101 430 - Go Down 104 72
36 212 - Outlook Stay the Same 144 152 261
557 - Not Sure 0
0 4 4 - Total 400
401 402 1203 - Have the executives changed their outlook over
time?
26Relations in categorical data
- Relationship between two or more categorical
variables. - Use counts (frequencies) or percent (relative
frequencies) of individuals that fall into
various categories. - Two-way table
- A two-way table describes two categorical
variables. - Each horizontal row in the table describes
individuals with one level of the row variable. - Each vertical column describes individuals with
one level of the column variable. - EX Years of school completed, by age (thousands
of persons)
27Marginal distributions
- Look at the distribution of each variable
separately. - Total columns list the totals for each of the
rows or row totals. Similarly for column totals.
- Row and column totals specify the marginal
distributions of each of the two categorical
variables. - The distribution of years of schooling completed
among people age 25 years and over
28Describing relationships
- What percent of people aged 25 to 34 have
completed 4 years of college? - What percent of people aged 35 to 54 have
completed 4 years of college? - What percent of people aged 55 and over have
completed 4 years of college? - Conclusion?
29Conditional distribution of age group on the
education level
30Three way table
- The table of outcome by hospital by patient
condition is a three-way table that reports the
frequencies of each combination of levels of
three categorical variables. - We can aggregate a three-way table into a two-way
table. - A variable being aggregated can become a lurking
variable.
31NSF study on the salary of new women engineer
- The median salary of newly graduated female
engineers and scientists was 73 of that for
males. - Field is a lurking variable. (life and social
sciences against physical and engineering)
32Establishing causation
- The best (and only?) method of establishing
causation is to conduct a carefully designed
experiment in which the effects of possible
lurking variables are controlled. - What other criteria when we cant do an
experiment?
33Smoking causes lung cancer
- The association is strong.
- The association is consistent.
- Higher doses are associated with stronger
responses. - The alleged cause precedes the effect in time.
- The alleged cause is plausible.
34Forecasting with Distinguishable Data
- Objective Introduce the basic concepts of
forecasting to motivate a regression model. - Forecasting with Indistinguishable Data
- If the future value of the variable you would
like to forecast is indistinguishable from the
sample values you collected, then you forecast
with indistinguishable data. - Example 1 To help forecasting the selling price
of your house, you obtained a sample (109,360,
137,980, 131,230, 130,230, 125,410, 124,370,
139,030, 140,160, 144,220, 154,190. - Forecasting when the Data are Distinguishable
- When your sample contains additional information
so that the sample values are no longer
indistinguishable from the future value you would
like to forecast, you forecast with
distinguishable data. - Example 2 Our sample also contain the
information on the square footage of the ten
houses. (109,360,1404), (137,980,1477),
(131,230,1503), (130,230,1552),
(125,410,1608), (124,370,1633),
(139,030,1717), (140,160,1775),
(144,220,1838), (154,190,1934).
35Forecasting with Distinguishable Data
- Assume that your house has 1682 square feet of
living area. - Analysis 1 sample average of all ten houses
133,618 (SD 12,406) - Analysis 2 Stratify the sample according to lot
size. - Size Range Sample Average SD
Number of Observations - 1400-1599 127,200
12,381 4 - 1600-1799 132,243
8,513 4 - 1800-1999 149,205
7,050 2 - Then use 132,243 (instead of 133,618) to
forecast the selling value. - Does the cell standard deviation properly measure
the forecast uncertainty? - Is it possible to have a measure of overall
efficacy of our partitioning the sample into
cells? - Use the data more efficiently The stratification
method that we used is unsatisfactory for two
reasons. First, we have ignored data on house
that are less like, but not most like yours.
Secondly, we have stratified the data somewhat
arbitrarily.
36The question of causation
- Mothers adult height vs daughters adult height.
- Amount of saccharin in a rats diet vs count of
tumors in the rats bladder. - A students SAT score and the students first
year GPA. - Monthly flow of money into stock mutual funds vs
monthly rate of return for the stock market. - The anesthetic used in surgery vs whether the
patient survives the surgery. - The number of years of education a worker has vs
the workers income.
37Explaining association
- Causation.
- Common response. (a lurking variable).
- Confounding two variables are confounded when
their effects on a response variable are mixed
together.
38Data on the survival of patients after surgery in
hospital A and B
- Hospital A loses 3 of patients while Hospital B
loses 2.
39Lurking variable...
- 1 vs 1.3 for patients with good condition
- 3.8 vs 4 for patients with bad condition
40Simpsons paradox
- How can A do better in each group, yet do worse
overall?? - An association or comparison that holds for all
of several groups can reverse direction when the
data are combined to form a single group.
41Regression Model
- Try to create a model that specifies the
relationship between selling price (dependent
variable) and other variables (independent or
explanatory variable) that help you forecast the
selling price. - It is reasonable to assume that as size go up,
selling price will go up on average.
42Regression Coefficients and Forecasts
- Objective Understand regression coefficients and
how to use them for forecasting.
43Measures of Goodness of Fit and Residual Analysis
- Objective Introduce a few statistics that
measure how well a regression model fits the data
and show how to use residual analysis to detect
inadequacies of a regression model
44Developing a Regression Model
- Objective Demonstrate how to develop a useful
regression model through - Selection of the Dependent Variable
- Selection of the Independent Variables
- Determining the Nature of Relationships
45(No Transcript)