Title: How to Lie with Statistics
1How to Lie with Statistics
- as in the book by Darrell Huff
2Types of Lies
- Intentional deceit
- Selective data use
- Extrapolation
- Creative graphics
- Faulty assumptions
- Incompetence
3Mmmm
- Many of the truths we hold onto depend on our
point of view Ben Kenobi
4Look at this graph
5Compared to this one
6Or this one
7Some graphs use pictures
8B the height and the weight were doubled from Joe
to Ann
9The Gee-Whiz Graph
- Attractive figures must be true
- Axes do not need labels or units
- Scale is intent dependent
10This looks like the fertilizer does an OK job.
11This looks even better
12How to mislead through visual effects
- Direct labels changing the title
- Encoding - using color coding to associate values
to numbers. - Self-representing scales - portraying commonly
known objects next to the object being discussed.
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Sample with the Built-in Bias
- Practically all statistics are based on a sample
of a population. So... - how was the sample chosen?
- how big is the sample?
- what population does it claim to represent?
- what population does it actually represent?
22Size of Sample
- Flip a coin 5 times
- Heads four times
- 80 heads
- Flip 100 times
- Same results???
- In general, the larger the sample size, the
better the estimation.
23FDR poll
- Early 1900's
- Poll were taken during the U.S. presidential
campaign of Franklin D. Roosevelt (FDR). - Only those people with telephones.
- Pollsters predicted not FDR The pollsters
predicted one candidate would win, but FDR
actually won the real election. - The poll did NOT accurately reflect all of the
voters because the opinions of only one part of
the population (wealthy people with telephones)
were taken into account.
24Random?
- Truly random samples are practically impossible
but almost everyone claims theyve done it - Quadrants were randomly placed in a forest -
except where there was a thicket of multiflora
rose and greenbriar - Water quality was determined at random locations
in a stream - but somehow always get clustered
around access roads
25What population does the sample represent?
- No population is uniform - it is composed of many
distinct and interactive subsets - are you crossing subsets boundaries?
- mixing age groups, income levels, soil types
- are you aware of the subsets?
- how fine are the subsets divided
- are each observation independent?
- inflating sample size
26Best ways to Lie with Sampling
- Ignore possible biases in your sampling method -
many are too difficult to detect anyway. - Claim everything has been done randomly - it is
expected of you - Express more confidence in your sampling method
then it merits - Do not elaborate
27The Well Chosen Average
- Remember the three measures of central tendency
- Mean
- Median
- Mode Average could mean any of them
28Pick your favorite average
- Mean 37,000
- Median 12,000
- Mode 9000
- Each is a legitimate average but can serve
conflicting purposes
- Incomes
- 9000
- 9000
- 9000
- 12,000
- 120,000
- 85,000
- 15,000
29Standard Deviation versus Standard Error
- Standard Deviation describes variability around
the mean - Standard Error assesses the precision of the
estimate of the population mean
30How to Lie with Averages
- Use the standard error. It is always smaller than
the standard deviation and thus looks better. - People look at the error bars but rarely at what
the bars represent - State an average without explaining what it
measures. - The average person will think it is the mean
31The Little Figures that are Not There
- Confusing graphics
- Meaningless and misleading averages
- lose an average of 30 pounds
- Proportions or ratios stated without an
explanation of what produced them - 3/4 dentists prefer Brand X toothpaste
32What Average?
- Person Money
- John 2
- Ann 3
- Bob 1
- Mary 10
- Sue 5
- Carol 2
- Ken 999
33Misleading Graphics
- Graphics are aesthetically appealing
- Graphics convey a lot of information with minimum
effort - Graphics are nice and vague
34(No Transcript)
35Examples of Irrelevant Conclusions
- 72 of all crow nests in a particular forest are
in pine trees - Therefore, crows prefer to nest in pine trees.
- But, 95 of all trees in the forest are pine!
36Failure to apply in general..
- A bird survey in a woodlot detected a healthy
songbird population - But at the next survey, hardly a bird was heard.
- What happened to the birds?
37Confuse cause and effect
- Does high income cause ownership of stocks or
does ownership of stocks cause high income? - Mistake correlation with causation
38Misleading Numbers
- Play with significant digits. 1.28 appears more
accurate than 1.0. - Answer a question with a statistic that is
slightly irrelevant. - Do students get a quality education at school X?
Yes, the average GPA is 3.2 - But what if has class rank of 70 with a GPA of
3.8.
39Other mis-uses
- Present a result without a significant value.
- Use untestable assumptions.
- Use precision and accuracy interchangeably
- Perform nonsensical tests that sound good.
- No significant difference was found between field
A and field B. At what scale????
40Remember that.
- a statistic is only worthwhile when it
satisfies the assumptions of the model/test.
Knowing whether the assumptions are met is
dependent on the competence of the person running
the stats. Often difficult to catch in the review
process.
41Thanks to
- http//faculty.washington.edu/chudler/stat3.html
Eric Chudler - http//atlantic.evsc.virginia.edu/jhp7e/EVSC503/s
lides/stats_lie02/index.htm David Bowne - Dr. Robertta H. BarbaSan José State
Universityhttp//sweeneyhall.sjsu.edu/edit272/lie
/sld001.htm