Title: Statistical Review
1Statistical Review
- Measures of Central Location
2SALARY.XLS
- Lists starting salaries for 190 graduates from an
undergraduate school of business. - The data is in the range named Salary on a sheet
called Data.
3The Mean
- We calculate the mean salary by entering the
formula AVERAGE(Salary) in cell B6 of the
Excel Functions worksheet. - The mean salary is 29,762.
- The mean in this example is a representative
measure because the distribution of salaries is
nearly symmetric. - The mean can be misleading due to skewness.
4The Median
- The median is the middle observation when the
data are listed from smallest to largest. - If there is an odd number of observations, the
median is the middle observation. - If there is an even number of observations, we
take the median to be the average of the two
middle observations.
5The Median -- continued
- We calculate the median salary in Example 3.1 by
entering the formula MEDIAN(Salary) in cell B7
of the Excel Functions worksheet. - The median in this example is 29,850.
- In this case, the mean and the median values are
nearly the same because the distribution is
approximately symmetric.
6The Median -- continued
- If the salary distribution were skewed (for
example, a few graduates received abnormally
large salaries), the mean would be biased upward
while the median would not be affected by the
unusual values. - Thus, it is better to use the median in
characterizing the center of a distribution when
that distribution is skewed.
7The Mode
- The mode is the most frequently occurring value.
- If the values are essentially continuous, as with
the salaries in Example 3.1, then the mode is
essentially irrelevant. There is typically no
single value that occurs more than once. - Thus, the mode is not likely to provide much
information.
8 9SHOES.XLS
- This file lists shoe sizes purchased at a shoe
store. - We seek to find the best-selling shoe size at
this store.
10The Mode
- The mode is the most frequently occurring value.
- If the values are essentially continuous, then
the mode is usually not relevant. There is
typically no single value that occurs more than
once.
11Why is the mode relevant here?
- Shoe sizes come in discrete increments, rather
than a continuum so it makes sense to find the
mode, the size that is requested most often, in
this example.
12Computing the Mode in EXCEL
- The mode can be found in Excel by entering the
formula MODE(Range of Data). - Applying this formula in the present example, we
find that size 11 is the most frequently
purchased shoe size. - This is also apparent form the histogram on the
next slide.
13Distribution of Shoe Sizes
14- Measures of Variability Variance and Standard
Deviation
15OTIS4.XLS
- Suppose that Otis Elevator is going to stop
manufacturing elevator rails. Instead, it is
going to buy them from an outside supplier. - Otis would like each rail to have a diameter of 1
inch. - The company has obtained samples of ten elevator
rails from each supplier. They are listed in
columns A and B of this Excel file.
16Which should Otis prefer?
- Observe that the mean, median, and mode are all
exactly 1 inch for each of the two suppliers. - Based on these measures, the two suppliers are
equally good and right on the mark. However, we
when we consider measures of variability,
supplier 1 is somewhat better than supplier 2.
Why?
17Explanation
- The reason is that supplier 2s rails exhibit
more variability about the mean than do supplier
1s rails. - If we want rails to have a diameter of 1 inch,
then more variability around the mean is very
undesirable!
18Variance
- The most commonly used measures of variability
are the variance and standard deviation. - The variance is essentially the average of the
squared deviations from the mean. - We say essentially because there are two
versions of the variance the population variance
and the sample variance.
19More on the Variance
- The variance tends to increase when there is more
variability around the mean. - Indeed, large deviations from the mean contribute
heavily to the variance because they are squared. - One consequence of this is that the variance is
expressed in squared units (squared dollars, for
example) rather than original units.
20Standard Deviation
- A more intuitive measure of variability is the
standard deviation. - The standard deviation is defined to be the
square root of the variance. - Thus, the standard deviation is measured in
original units, such as dollars, and it is much
easier to interpret.
21Computing Variance and Standard Deviation in Excel
- Excel has built-in functions for computing these
measures of variability. - The sample variances and standard deviations of
the rail diameters from the suppliers in the
present example can be found by entering the
following formulas VAR(Supplier1) in cell E8
and STDEV(Supplier1) in cell E9.
22Computing Variances Standard Deviations --
continued
- Of course, enter similar formulas for supplier 2
in cells F8 and F9. - As we mentioned earlier, it is difficult to
interpret the variances numerically because they
are expressed in squared inches, not inches. - All we can say is that the variance from supplier
2 is considerably larger than the variance from
supplier 1.
23Interpretation of the Standard Deviation
- The standard deviations, on the other hand, are
expressed in inches. The standard deviation for
supplier 1 is approximately 0.012 inch, and
supplier 2s standard deviation is approximately
three times this large. - This is quite a disparity. Hence, Otis will
prefer to buy rails from supplier 1.
24- Interpretation of the Standard Deviation Rules
of Thumb
25DOW.XLS
- This file contains monthly closing prices for the
Dow Jones Index from January 1947 through January
1993. - The monthly returns from the index are also shown
starting with February 1947. Each return is the
monthly percentage change (expressed) as a
decimal) in the index. - How well do the rules of thumb work for these
data?
26Rules of Thumb
- Many data sets follow rules of thumb.
- Approximately 68 of the observations are within
one standard deviation of the mean. - Approximately 95 of the observations are within
two standard deviations of the mean. - Approximately 99.7 - almost all - of the
observations are within three standard deviations
of the mean.
27Index Time Series Plot
- A time series plot of the index show that the
index has been increasingly fairly steadily over
the period. - Whenever a series indicates a clear trend such as
the index does, most of the measures we have been
discussing are less relevant. - For example, the mean of the index for this
period has at most historical interest. We are
probably more interested in predicting the future
of the Dow, and the historical mean has little
relevance for predicting the future.
28Time Series Plot of Dow Closing Index
29Time Series Plot of Dow Returns
30Return Time Series Plot
- A time series plot of the returns show no obvious
trend over the period. - The measures we have been discussing are relevant
in discussing the series of returns, which
fluctuate around a stable mean. - We first calculate the mean and standard
deviation of the returns by using the Excel
functions AVERAGE and STDEV in cells B4 and B5.
See the table on the next slide.
31Rules of Thumb for Dow Jones Data
32Returns -- continued
- The average return is 0.59 and the standard
deviation of about 3.37. - Therefore, the rules of thumb (if they apply)
imply, for example, that about 2/3 of all returns
are within the interval 0.59 3.37, that is
from -2.78 to 3.95. - In order to determine if the rules of thumb apply
to these returns, we can use a frequency table.
33Creating the Frequency Table
- We first enter the upper limits of the suitable
categories in the range A8A15. - Any categories can be chosen but it is convenient
to choose categories in which each breakpoint is
one standard deviation higher than the previous
one with the open-ended categories on either end
are more than 3 standard deviations from the
mean. - Next we use the FREQUENCY function to fill in
column C. FREQUENCY(Returns,Bins)
34Frequency Table continued
- Finally, we use the frequencies in column C to
calculate the actual percentage of return within
k standard deviations of the mean for k1, k2
and k3 and we compare these with the percentages
from the rules of thumb. - The agreement between these percentages is not
perfect - there are a few more observations
within one standard deviation of the mean than
the rule of thumb predicts - but in general the
rules of thumb work quite well.
35- Obtaining Summary Measures with Add-Ins
36SALARY.XLS
- Lists starting salaries for 190 graduates from an
undergraduate school of business. - The data is in the range named Salary on a sheet
called Data. - We need to find a set of useful summary measures
for the salaries.
37Summary Statistics
- To find the summary statistics of a set of data
we can use the Stat-Pro Add-In or Excels
Analysis ToolPak. In this example we use the
Stat-Pro Add-In . - Begin by placing the cursor anywhere within the
data range. Then select StatPro/Summary
Stats/One-Variable Summary Stats menu item. - Select all variables you want to summarize, and
select the summary measures you want to find from
the Available Summary Measures dialog box shown
on the next slide.
38Available Summary Measures
39About the Measures
- Four measures are selected by default. These are
mean, median, Standard Deviation and Count. You
can override these. - A typical output appears here.
- It includes many of the measures we have
discussed plus a few more.
40About the Measures -- continued
- The mean absolute deviation is similar to the
variance except that it is an average of the
absolute (note squared) deviations from the mean. - The kurtosis and skewness indicate the relative
peakedness of the distribution and its skewness. - By clicking on any of the cells containing the
measures (Column B), youll see that StatPro
provides the formulas for the outputs. (Analysis
ToolPak does not do so.)
41About the Measures -- continued
- The effect of this is that if any of the data
changes the summary measures we produced change
automatically. - All output is formatted as numerical to three
decimal places by default. You can reformat them
in a more appropriate manner if you would like.
42- Measures of Association Covariance and
Correlation
43EXPENSES.XLS
- A survey questions members of 100 households
about their spending habits. - The data in this file represent the salary,
expense for cultural activities, expense for
sports-related activities, and the expense for
dining-out for each household over the past year. - Do these variables appear to be related linearly?
44Covariance and Correlation
- When we need to summarize the relationship
between two variables we can use the measures
covariance and correlation. We summarize the type
of behavior observed in a scatterplot. - Each measures the strength (and direction) of a
linear relationship between two numerical
variables. - The relationship is strong if the points in a
scatterplot cluster tightly around some straight
line. If this line rises form left to right then
the relationship is positive. If it falls from
left to right then the relationship is negative.
45Determining Linear Relationships
- Scatterplots of each variable versus each other
would provide the answer to the question but six
scatterplots would be required, one for each
pair. - To get a quick indication of possible linear
relationships we can use Stat-Proto obtain a
table of correlations and/or covariances.
46Table of Correlations and Covariances
- To get the table, place the cursor anywhere in
the data set and use the StatPro/Summary
Stats/Correlations, Covariances menu item and
proceed in the obvious way.
47Relationships
- The only relationships that stand out are the
positive relationships between salary and
cultural expenses and between salary and dining
expenses. - The negative relationships are between cultural
and sports-related expenses. - To confirm these graphically we show scatterplots
of Salary versus Culture and Culture versus
Sports
48Scatterplot Indicating Positive Relationship
49Scatterplot Indicating Negative Relationship
50Correlation and Covariance Properties
- In general, the following properties are evident
from the Table of correlations and covariances. - The correlation between a variable and itself is
1. - The correlation between X and Y is the same as
the correlation between Y and X. Therefore, it is
sufficient to list the correlations below (or
above) the diagonal in the table. (The same is
true for the covariances). - The covariance between a variable and itself is
the variance of the variable. We indicate this in
the heading of the covariance table.
51Correlation and Covariance Properties -- continued
- It is difficult to interpret the magnitudes of
covariances. These depend on the fact that the
data are measured in dollars rather than, say,
thousands of dollars. It is such easier to
interpret the magnitudes of the correlations
because they are scaled to be between -1 and 1.
52- Describing Data Sets with Boxplots
53DOW.XLS
- This file lists the monthly returns on the Dow
from February 1947 through January 1993. - Use a boxplot to summarize the distribution of
these returns.
54Boxplots
- A boxplot is a very useful graphical method for
summarizing data. - Boxplots can be used in two ways either to
describe a single variable in a data set or to
compare two (or more) variables. - Excel has no boxplot option, but we included this
option in the StatPro add-in.
55Creating Boxplots
- Place the cursor anywhere within the data set,
use the StatPro/Charts/BoxPlots(s) menu item and
proceed in the obvious way. - Eventually two sheets will be added to your
workbook. One has a the boxplot chart, while the
other contains summary measures used to form the
boxplot. - The following slides show the chart and the
summary measure information.
56Boxplot Chart
57Boxplot Summary Measures
58Keys to Understanding Boxplots
- The right and left of the box are at the third
and first quartiles. Therefore, the length of the
box equals the interquartile range (IQR), and the
box itself represents the middle 50 of the
observations. The height of the box has no
significance. - The vertical line inside the box indicates the
location of the median. The point inside the box
indicates the location of the mean.
59Keys to Understanding Boxplots -- continued
- Horizontal lines are drawn from each side of the
box. They extend to the most extreme observations
tat are no farther than 1.5 IQRs from the box.
They are useful for indicating variability and
skewness. - Observations farther than 1.5 IQRs from the box
are shown as individual points. If they are
between 1.5 IRQs and 3 IQRs from the box, they
are call mild outliers and are hollow. Otherwise,
they are called extreme outliers and are solid.
60Dow Returns Boxplot
- The boxplot for this example summarizes the
distribution of the returns. - It implies that the Dow returns are approximately
symmetric on each side of the median, although
the mean is a bit below the median. - Also there are a few mild outliers but no extreme
outliers.
61- Describing Data Sets with Boxplots
62ACTORS.XLS
- Recall that the salaries of famous actors and
actresses are listed in this file. - Use side-by-side boxplots to compare the salaries
of male and female actors and actresses.
63Side-by-Side Boxplots
- Boxplots are probably most useful for comparing
two populations graphically. This is done using
side-by-side boxplots. - The data setup for this type of comparison can
be in one of two forms stacked or unstacked. - Data are stacked if there is a code variable
that designates which category each observation
is in, and there is a single measurement
variable that contains the data for both
categories.
64Side-by-Side Boxplots -- continued
- The data are unstacked if there is a separate
data column for each category. - In this example the data are stacked because
Gender designates the gender associated with each
observation and Salary is the single measurement
variable. - If the data were unstacked, for example, actors
salaries would be in one column and actresses
salaries would be in another.
65Creating a Side-by-Side Boxplot
- Since the data are stacked the following must be
steps must be followed to create the boxplots. - Place cursor within the data set and select the
StatPro/Charts/Boxplot(s) menu item. - In the dialog box that opens check the stacked
option. - Then choose Gender as the code variable and
Salary as the measurement variable. - The resulting data in stacked form and the
side-by-side boxplot appears.
66Actor Data in Stacked Form
67Side-by-Side Boxplot Chart
68Reading the Boxplot
- It is clear the the female salary box is
considerably to the left of the male salary box,
although both have about the same IQR. - Each boxplot has three indications that the
salary distributions are skewed to the right - the means are larger than the medians
- the medians are closer to the left sides of the
boxes than to the right sides - the horizontal lines extend farther to the right
than to the left of the boxes. However, there are
no outliers.
69 70Background Information
- The Spring Mills Company produces and distributes
a wide variety of manufactured goods. Due to its
variety, it has a a large number of customers. - Spring Mills classifies these customers as small,
medium and large, depending on the volume of
business each does with them. - Recently they have noticed a problem with
accounts receivable. They are not getting paid by
their customers in as timely a manner as they
would like. This obviously costs them money.
71RECEIVE.XLS
- Spring Mills has gathered data on 280 customer
accounts. - For each of these accounts the data set lists
three variables - Size, the size of the customer (coded 1 for
small, 2 for medium, 3 for large) - Days, the number of days since the customer was
billed - Amount, the amount the customer owes
- What information can we obtain from this data?
72Analysis
- It is best to start by getting a good sense of
the data. To do this we - Calculate several summary measures for the Days
and Amount - Create a histogram of Amount
- Create a scatterplot of Amount versus Days
- From these we determine
- a positive skewness in the Amount variable (The
mean is considerably larger than the median and
the standard deviation of Amount is quite large). - The scatterplot suggests some suspicious
behavior, with two distinct groups of points.
73Analysis -- continued
- The next step is to see whether the different
customer sizes have any effect on either Days,
Amount, or the relationship between Days and
Amount. - To do this, it is useful to unstack the Days and
Amount variables - that is to create a new Days
and Amounts variable for each group of customer
sizes. For example, the Days and Amount variables
for customers of size 1 are named Days1 and
Amount1. This can be accomplished by using
StatPros Unstack procedure but copying and
pasting also work.
74Analysis -- continued
- Once unstacked we need to calculate summary
measures and a variety of charts on these
unstacked variables. - The charts include
- Histograms of Amount for each size customer
- Boxplots of days owed by different size customers
- Boxplots of amounts owed by different size
customers - Scatterplots of Amount versus Days for each size
customer
75Summary Measures for Combined Data
76Histogram of All Amounts Owed
77Scatterplot of Amount versus Days for All
Customers
78Summary Measures Broken Down by Size
79Histogram for Small Customers
80Histogram of Amount for Medium Customers
81Histogram of Amount for Large Customers
82Boxplots of Days Owed by Different Size Customers
83Boxplots of Amounts Owed by Different Size
Customers
84Scatterplot of Amount versus Days for Small
Companies
85Scatterplot of Amount versus Days for Medium
Companies
86Scatterplot of Amount versus Days for Large
Companies
87Analysis -- continued
- There is obviously a lot going on here and it is
evident form the charts. We point out the
following - there are considerably fewer large customers than
small or medium customers. - the large customers tend to owe considerably more
than small or medium customers. - the small customers do not tend to be as long
overdue as the small and medium customers. - there is no relationship between Days and Amount
for the small customers, but there is a definite
positive relationship between these variables for
the medium and large customers.
88More Analysis
- We have done the obvious but here is still much
more we can do. - For example, suppose Spring Mills wants a
breakdown of customers who owe at least 500. - We first create a new variable called Large?
next to the original variables that equals 1 for
all amounts greater than 500 and equals 0
otherwise. - We do this by entering a formula in D6 and
copying down.The next slide shows a sample of
data and the formula to be used.
89More Analysis - -continued
- We enter the formula
- IF(C6gtB3,1,0)
- in cell D6 and copy it down.
- We can then use a pivot table to create a count
of the number of 1s in this new variable for
each value of the Size variable.
90Pivot Tables for Counts of Customers Who Owe More
than 500
91More Analysis -- continued
- We created the pivot table twice, once showing
the counts as percentages of each column, and
once showing them as percentages of each row. - One table shows that 73 of all customers with
amounts less than 500 are small customers.The
other table shows that 45 of all medium-sized
customers owe more than 500. - This type of analysis is often referred to as
slicing and dicing the data. They are based on
the same counts but portray them in different
ways. Neither way is better they both provide
useful information.
92More Analysis -- continued
- Finally, we investigate the amount of interest
Spring Mills is losing by the delays in its
customers payments. - We assume that the company can make 12 annual
interest on excess cash. Then we can create a
Lost variable for each customer size that
indicates the amount of interest Spring Mills
loses on each customer group.
93More Analysis
- The formula entered in cell C10 to calculate
Lost1 is B10A10C7/365. This is the amount
owed by the number of days owed multiplied by the
interest rate, divided by the number of days in a
year. Then we copy this formula down and to the
columns to calculate Lost2 and Lost3. - Then we calculate the sums of these amounts in
row 5. - The next slide shows a sample of data at this
point.
94More Analysis -- continued
- Although Spring Mills is losing more per customer
from the large customers, it is losing more total
from the medium-sized customers - because there
are more of them.
95More Analysis -- continued
- This is shown graphically by a pie chart of the
sums in row 5.
96Findings
- If Spring Mills really wants to decrease
receivables, it might want to target the
medium-sized customer group, from which it is
losing the most interest. - Or it could target the large customers because
they owe the most on average. - The most appropriate action depends on the cost
and effectiveness of targeting any particular
customer group. However, the analysis presented
here gives the company a much better picture of
whats currently going on.
97 98Background Information
- The RP Supermarket is open 24 hours a day, 7
days a week. Lately it has been receiving a lot
of complaints from customers about excessive
waiting in line for checking out. - RP has decided to investigate this situation by
gathering data on arrivals, departures, and line
lengths at the checkout stations. - It has collected data in half-hour increments for
an entire week - 336 observations - starting at 8
am on Monday and ending at 8 am on the following
Monday.
99CHECKOUT.XLS
- This file includes the data collected on the
following variables - InitialWaiting, the number waiting or being
checked out at the beginning of a half-hour
period - Arrivals, the number of arrivals to the checkout
stations during a period - Departures, the number of checkout stations open
during a period - Checkers, the number of checkout stations open
during a period
100CHECKOUT.XLS -- continued
- The data set also includes time variables
- Day, day of week
- StartTime, clock time at the beginning of each
half-hour period - TimeInterval, a descriptive term for the time of
day such as Lunch rush for 1130 a.m. to 130
p.m. - Finally the data set includes a calculated
variable - EndWaiting, the number waiting or being checked
out at the end of a half-hour period This
variable for any time period equals
InitialWaiting plus Arrivals minus Departures it
also equals InitialWaiting for the next period.
101The Data
102Analysis Information
- The manager of RP wants to analyze the data to
discover any trends, particularly in the pattern
of arrivals throughout the day or across the
entire week. - Also the store currently uses a
seat-of-the-pants approach to opening and
closing checkout stations each half hour. The
manager would like to see how well the current
approach is working. - Of course, she would love to know the best
strategy for opening and closing checkout
stations - but this is beyond her (and our)
capabilities at this point
103Analysis
- Obviously time plays a crucial role in this data,
so a good place to start is to create one or more
time series plots. - The time series plot of InitialWaiting and
Arrivals Variables shown on the next slide shows
that - Fridays and Saturdays are the busiest days
- the time pattern of arrivals is somewhat
different - more spread out - during the weekends
than during the weekdays - there are fairly regular peak arrival periods
during the weekdays - the number waiting is sometimes as large as 10 or
20, and the largest of these tend to be around
peak arrival times
104Time Series Plot of Initial Waiting and Arrivals
Variables
105Analysis -- continued
- A similar time series plot shown below shows
Arrivals and Departures.
106Analysis -- continued
- On this plot it is difficult to separate the two
time series they are practically on top of each
other. - Perhaps this is not so bad because this tells us
that the store is checking out customers
approximately as quickly as they are arriving. - A somewhat more efficient way to understand the
time series behavior is to use pivot tables.
107Analysis - continued
- A pivot table for Average InitialWaiting by Hour
of Day is generated. This table can be seen on
the next slide. - To create this table, we
- drag the InitialWaiting variable to the Data
area, express it as an average - drag StartTime variable to the Row area
- drag the Day variable to the Page area
- Finally, we use the data in the pivot table to
create a time series plot.
108Average Initial Waiting by Hour of Day
109Analysis -- continued
- Note how the variable in the Page area works.
Depending on which day we select in the page area
the corresponding table and time series plot
changes. - Similarly, a pivot table can be created,
accompanied by a column chart. Average Arrivals
by TimeInterval of Day can be analyzed with a
pivot table by dragging Arrival to the Data area,
expressing it as an average, dragging the
TimeInterval variable to the Row area, and
dragging the day variable to the Page area. This
table is is on the next slide. You can check for
the patterns on each day.
110Average Arrivals by Time Interval of Days
111More Analysis
- The manager of RP is ultimately interested in
whether the right number of checkout stations
are available throughout the day. - We can create two scatterplots to provide some
evidence. The plots follow this slide. - The first of these would be a scatterplot of
Checkers versus TotalCustomers. The
TotalCustomers variable is calculated as the sum
of InitialWaiting and the Arrivals to measure the
total amount of work presented to the checkout
stations in any half-hour period.
112Scatterplot of Checkers versus Total Customers
113Scatterplot of End Waiting versus Checkers
114More Analysis -- continued
- From the scatterplot we can see an obvious
positive relationship between these two
variables. - Evidently management is reacting as they should -
it is opening more checkout stations when there
is more traffic. - The second scatterplot shows EndWaiting versus
Checkers. There is again a definite upward trend.
Periods when more checkout stations are open tend
to be associated with periods where more
customers still remain in the checkout process.
115More Analysis - continued
- Presumably, management is reacting with more open
checkout stations in busy periods, but it is not
reacting strongly enough. - Just fiddling with the numbers in the Checker
column will not solve the manager's problems.
116Conclusions
- Two problems exist. First, there is a trade-off
between the cost of having customers wait in
line and the cost of paying extra checkout
people. This is a difficult tradeoff for any
supermarket manager. - Second, the number of departures is clearly
related to the number of checkout stations open.
Therefore it doesnt make sense to change the
numbers in the Checkers column without changing
the numbers in the Departures (and hence
InitialWaiting and EndWaiting) column in an
appropriate way. This is not an easy problem.
117 118Background Information
- The HyTex Company is a direct marketer of
stereophonic equipment, personal computers and
other electronic products. - HyTex advertises entirely by mailing catalogs to
its customers, and all of its orders are taken
over the telephone. - The company spends a great deal of money on its
catalog mailings, and it wants to be sure that
this is paying off in sales.
119CATALOGS.XLS
- This file contains the data that has been
collected on 1000 customers at the end of the
current year. - For each customer it has data on the following
variables - Age coded as 1 for 30 or younger, 2 for 31 to
55, 3 for 56 and older. - Gender coded as 1 for males, 2 for females
- OwnHome coded as 1 if customer owns a home, 2
otherwise - Married coded as 1 if customer is currently
married, 2 otherwise
120CATALOGS.XLS -- continued
- Close coded as 1 if customers lives reasonably
close to a shopping area that sells similar
merchandise, 2 otherwise - Salary combined annual salary of customer and
spouse (if any) - Children number of children living with customer
- History coded as NA if customer had no
dealings with the company before this year, 1 if
customer was a low spending customer last year, 2
if medium-spending, 3 if high-spending - Catalogs Number of catalogs sent to the customer
this year - AmountSpent Total amount of purchases made by
the customer this year
121Analysis Desired
- HyTex wants to analyze these data carefully to
understand its customers better. - They want to see whether they are sending the
catalogs to the right customers. - Each customer either receives 6, 12, 18, or 24
catalogs through the mail. Currently which
customer receives which amount is not thought out
carefully. They want to know if the current
distribution of catalogs is effective. Is there
room for improvement?
122Analysis
- HyTex is obviously interested in the AmountSpent
variable. Therefore, it makes sense to create
scatterplots of AmountSpent versus selected
explanatory variables. - First we create the scatterplot shown on the next
slide showing AmountSpent versus Salary. - It is clear that customers with higher salaries
tend to spend more, although the variability in
amounts spent increases significantly as salary
increases.
123Scatterplot of Amount Spent versus Salary
124Analysis -- continued
- Second, we create a scatterplot of AmountSpent
versus Catalogs - This plot shows that there is some tendency
toward higher spending among customers who
receive more catalogs. - But do the catalogs cause more spending, or are
more catalogs sent to customers who would tend to
spend more anyway? There is no way to answer this
with this data. - Next is the scatterplot of AmountSpent versus
Children - This plot shows the interesting tendency of
customers with more children to spend less.
125Scatterplot of Amount Spent versus Catalogs
126Scatterplot of Amount Spent versus Children
127More Analysis
- Pivot tables and accompanying charts are very
useful in this type of situation. We can begin by
using this technique to get a better
understanding of the demographics of the
customers. - The first pivot table and chart shows the
percentage of an age group who own homes. Using
the pivot table we can see how these percentages
change for women, unmarried men and so on. - Specifically small percentages of the younger
people own their own hoe, regardless of marital
status or gender.
128Percent of Home Owners versus Age, Married and
Gender
129More Analysis -- continued
- The second pivot table shows the percentage of
each age group who are married, for any
combination of the Gender and OwnHome variables. - For example, we can check that the
married/unmarried split is quite different for
women who dont own a home than for male home
owners. - The third pivot table shows the average Salary
broken down by Age and Gender, with page
variables for OwnHome and Married.
130Percent of Married versus Age, OwnHome and Gender
131Average Salary versus Age, Gender, Married and
OwnHome
132More Analysis --continued
- The shape of the resulting charts is practically
the same for any combination of the page
variables. However, the heights of the bars
change appreciably. For example, salaries are
higher for the married home owners than for
unmarried customers who are not home owners. - Another pivot table and chart break the data down
in another way. Each column in the pivot table
shows the percentages in the various History
categories for a particular number of children.
Each of these columns corresponds to one of the
bars in the stacked bar chart. We have also
used Close as a page variable.
133Percentages in History Categories versus Children
and Close
134More Analysis -- continued
- Two interesting points emerge from this plot.
- First, customers with more children tend to be
more heavily represented in the low-spending
History category. - Second, you can check by changing the setting of
the Close variable from 1 to 2, the percentage of
high spenders among customers who live far from
electronic stores is much higher than for those
who live close to such stores. - The next pivot table provide insight into how
HyTex determined its catalog mailing
distributions. Each row shows a percentage of a
particular History category that were sent
6,12,18 or 24 catalogs.
135Catalog Distribution versus History
136More Analysis -- continued
- The companys distribution policy is still
unclear but we can see that it definitely sends
more catalogs to high spending customers and
fewer to low spending customers. - Finally the last pivot table shows AmountSpent
versus History and Catalogs, with a variety of
demographic variables in the page area. There are
so many possible combinations so it can be
difficult to discover all the existing patterns
137Average Amount Spent versus History, Catalogs and
Demographic Variables
138More Analysis -- continued
- One things stands out loud and clear from the
graph the more customers receive, the more they
tend to spend. - Also, if they were large spenders last year, they
tend to be large spenders this year. - In a pivot table with this many combinations
there will almost certainly be some combinations
with no observations.