Title: Lecture 1: Thu, Sept 5
1Lecture 1 Thu, Sept 5
- Introduction/Syllabus (web page)
- Todays material
- Key Statistical Concepts
- Types of Data
- Pie and Bar Charts
- Histograms, Stem-and-Leaf Plots
- Scatter Plots
- Intro to JMP-IN (Xr 2.94 2.95)
- Homework Assignment
2Key Definitions
- Statistics the art of data analysis. Involves
classifying, summarizing, organizing, and
interpreting numerical information. - Population the set of all items of interest in a
statistical problem. - Sample a subset of items in the population.
- Descriptive Statistics a body of methods used to
summarize and organize the characteristics of
sample data. - Inferential Statistics a body of methods used to
draw inferences about characteristics of
populations based on sample data.
3- Variable characteristic or property of an
individual item of a population or sample. - Observation the value assigned to a variable.
- Parameter descriptive measure of a population.
- Statistic descriptive measure of a sample.
- Statistical Inference process of making an
estimate, prediction or decision about a
population based on information contained in a
sample. - Measure of Reliability a statement about the
degree of uncertainty.
4Example Cola Wars
- Cola wars is the popular term for the intense
competition between Coca-Cola and Pepsi displayed
in their marketing campaigns. Their campaigns
have featured movie and television stars, rock
videos, athletic endorsements, and claims of
consumer preference based on taste tests.
Suppose, as part of a Pepsi marketing campaign,
1,000 cola consumers are given a blind taste test
(ie, a taste test in which the two brand names
are disguised). Each consumer is asked to state
their gender, age and a preference for brand A or
brand B.
5- a. Describe the population.
- b. Describe the variables of interest.
- c. Describe the sample.
- d. Describe the inference about the taste
preference. - e. Assume the cola preferences of 1,000 consumers
were indicated in a taste test. Describe how the
reliability of an inference concerning the
preferences of all cola consumers in the Pepsi
bottlers marketing region could be measured.
6Solutions
- a. Population of interest the collection or set
of all cola consumers. - b. Variables of interest gender, age and cola
preference. - c. Sample 1,000 cola consumers selected from the
population of all cola consumers. - d. Inference of interest generalization of the
cola preferences of the 1,000 sampled consumers
to the population of all cola consumers. In
particular, the preferences of the consumers in
the sample can be used to estimate the percentage
of all cola consumers who prefer each brand.
7- e. When the preferences of 1,000 consumers who
are used to estimate the preference of all
consumers in the region, the estimate will not
exactly mirror the preferences of the population.
For example, if the taste test shows that 56 of
the 1,000 consumers chose Pepsi, it does not
follow (nor is it likely) that exactly 56 of all
cola drinkers in the region prefer Pepsi. - Nevertheless, we can use sound statistical
reasoning (which is presented later in the
course) to ensure that our sampling procedure
will generate estimates that are almost certainly
within a specified limit of the true percentage
of all consumers who prefer Pepsi. - For example, such reasoning might assure us
that the estimate of the preference for Pepsi
from the sample is almost certainly within 5 of
the actual population preference. The implication
is that the actual preference for Pepsi is
between 51 ie, (56-5) and 61 ie, (565)-
that is, (56 5) This interval represents a
measure of reliability for the inference.
8Types of Data (Chapter 2)
- Quantitative Data are obtained when the variable
being observed takes numerical values. - Qualitative Data are obtained when the variable
being observed can only be categorized into
different groups (classes). - Ranked Data variable is categorized into
different groups, but the groups are ranked.
9Questions
- In the Cola Wars example, what type of data are
the variables of interest? - Gender
- Age
- Cola preference
- Give one example of each type of data numerical,
categorical, ranked.
10Types of data - examples
Interval data
Nominal
Age - income 55 75000 42 68000 . . . .
Person Marital status 1 married 2 single 3 sin
gle . . . .
Weight gain 10 5 . .
Computer Brand 1 IBM 2 Dell 3 IBM . . . .
11Types of data - examples
Interval data
Nominal data
With nominal data, all we can do is, calculate
the proportion of data that falls into each
category.
Age - income 55 75000 42 68000 . . . .
Weight gain 10 5 . .
IBM Dell Compaq Other Total 25
11 8 6 50
50 22 16 12
12Types of data analysis
- Knowing the type of data is necessary to properly
select the technique to be used when analyzing
data. - Type of analysis allowed for each type of data
- Interval data arithmetic calculations
- Nominal data counting the number of observation
in each category - Ordinal data - computations based on an ordering
process
13Cross-Sectional/Time-Series Data
- Cross sectional data is collected at a certain
point in time - Marketing survey (observe preferences by gender,
age) - Test score in a statistics course
- Starting salaries of an MBA program graduates
- Time series data is collected over successive
points in time - Weekly closing price of gold
- Amount of crude oil imported monthly
14Graphical Techniques for Qualitative Data
- How to summarize? Count the number of times and
compute the proportion of times of the occurrence
of each value of the data. - Pie Chart is a circle divided into a number of
slices that represent the various categories such
that the size of each slice is proportional to
the percentage corresponding to that category. - Bar Chart uses bars to represent the frequencies
(or relative frequencies) such that the height of
each bar equals the frequency or relative
frequency of each of the categories.
15Turboprop Airplanes
- In 1994, a spate of small aircraft crashes made
the safety of turboprop airplanes an issue. As
part of an analysis of different types of
accidents, Airjet Ltd determined where accidents
occurred for both turboprop airplanes and jets in
the period 1984-1993. The data are stored using
the following format
16- Results for turboprops are stored in column 1
(n260) Results for jets are stored in column 2
(n298). - Identify the type of data stored in each column.
- Use two pie charts to summarize these data.
- Does it appear that turboprop airplanes and jets
have similar accident patterns?
17(No Transcript)
18Graphical Techniques for Quantitative Data
- Frequency Distribution a table that groups data
in non-overlapping intervals called classes and
records the number of observations (frequencies)
in each class.
19- Frequency Histogram is created by drawing
rectangles. The bases of the rectangles
correspond to the class interval, and the height
of each rectangle equals the number of
observations in that class. - Stem-and-Leaf Displays similar to histogram but
with each observation represented by leafs. (see
description next page) - Ogive is the graphical representation of the
cumulative relative frequency distribution.
20Shapes of histograms
Symmetry
- There are four typical shape characteristics
21Shapes of histograms
Skewness
Negatively skewed
Positively skewed
22Modal classes
- A modal class is the one with the largest number
of observations. - A unimodal histogram
The modal class
23Modal classes
A bimodal histogram
A modal class
A modal class
24Bell shaped histograms
- Many statistical techniques require that the
population be bell shaped. - Drawing the histogram helps verify the shape of
the population in question
25Example MBA Salaries
- The table contains the top salary offer (in
thousands of dollars) received by each member of
a sample of 50 MBA students who recently
graduated from the Graduate School of Management
at Rutgers, the state university of New Jersey.
26MBA Salary Data
27Frequency Distribution
28Histogram of MBA Salaries
29Shapes of Histograms
- Symmetric histogram which if you draw a line
down the middle looks identical on both sides - Positively skewed histogram with a long tail
extending to the right - Negatively skewed histogram with a long tail
extending to the left - Bell-shaped histogram looks like a bell
- Number of modal classes the number of distinct
peaks in a histogram
30Stem-and-Leaf Plot
- Split each datum into stem and leaf
- Stem the first part of the number
- Leaves last digit of number
- Examples
- ?
31Stem-and-Leaf Example 2
32Histogram Stem-and-Leaf
33Cumulative Frequency Distribution
34Histogram Ogive Plot
35Example Production
- In order to estimate how long it will take to
produce a particular product, a manufacturer will
study the relationship between production time
per unit time and the number of units that have
been produced. The line or curve characterizing
this relationship is called a learning curve
(Adler and Clark, Management Science, Mar 1991). -
- Twenty-five employees, all of whom were
performing the same production task for the 10th
time, were observed. Each persons task
completion time (in minutes) was recorded. The
same 25 employees were observed again the 30th
time they performed the same task and the 50th
time they performed the task. The resulting
completion times are shown in the table below.
36- Use a statistical software package to construct a
frequency histogram for each of the three data
sets. - Compare the histograms. Does it appear that the
relationship between task completion and the
number of times the task is performed is in
agreement with the observations note above about
production processes in general? Explain.
37(No Transcript)
38Graphical Techniques for 2 Quantitative Variables
- Scatter Plot
- Graphical method to describe the relationship
between two quantitative variables - Two-dimensional plot, with one variables values
plotted along the vertical axis and the other
along the horizontal axis.
39Typical Patterns of Scatter Diagrams
Negative linear relationship
Positive linear relationship
No relationship
Negative nonlinear relationship
Nonlinear (concave) relationship
This is a weak linear relationship.A non linear
relationship seems to fit the data better.
40House Sales and Mortgage Levels
- The economics department of a national investment
banking firm is conducting a study to determine
how house sales are related to mortgage rate
levels. The number of house sales are related to
mortgage rate levels. The number of houses sold
and the average monthly mortgage rate for 36
months recorded.
41- a. Draw a scatter diagram for these data with
number of houses sold on the vertical axis. - b. Describe the relationship between mortgage
rates and number of homes sold.
42Graphing the Relationship Between Two Nominal
Variables
- We create a contingency table.
- This table lists the frequency for each
combination of values of the two variables. - We can create a bar chart that represent the
frequency of occurrence of each combination of
values.
43Contingency table
- Example 2.8
- To conduct an efficient advertisement campaign
the relationship between occupation and
newspapers readership is studied. The following
table was created
44Contingency table
- Solution
- If there is no relationship between occupation
and newspaper read, the bar charts describing the
frequency of readership of newspapers should look
similar across occupations.
45Bar charts for a contingency table
Blue-collar workers prefer the Star and the
Sun.
White-collar workers and professionals mostly
read the Post and the Globe and Mail
462.6 Describing Time-Series Data
- Data can be classified according to the time it
is collected. - Cross-sectional data are all collected at the
same time. - Time-series data are collected at successive
points in time. - Time-series data is often depicted on a line
chart (a plot of the variable over time).
47Line Chart
- Example 2.9
- The total amount of income tax paid by
individuals in 1987 through 1999 are listed
below. - Draw a graph of this data and describe the
information produced
48Line Chart
For the first five years total tax was
relatively flat From 1993 there was a rapid
increase in tax revenues.
Line charts can be used to describe nominal data
time series.
49Homework Assignment 1
- Due next Thursday, Sept 19, at the start of
class. - Full assignment will be posted on the Stat 101
web page this Thu at 5pm.