Title: Measures of Precision
1How to Fake Data if you must
Rachel Fewster
Department of Statistics
2Who wants to fake data?
- Electoral finance returns
- Toxic emissions reports
- Business tax returns
3Land areas of world countries real or fake?
4Land areas of world countries real or fake?
1 2 3 4 5 6 7 8 9
IIIII III III I I II I
5Land areas of world countries real or fake?
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
IIIII III III I I II I
I I III I IIII I II III
6Land areas of world countries real or fake?
This one is right!
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
IIIII III III I I II I
I I III I IIII I II III
This one seems more even
This one has as many 1s as 5-9s put together!
7Real land areas of world countries
11 of them begin with digits 1 4
Only 5 begin with digits 5 9
8Fridays Newspaper
10 out of 34 numbers began with a 1
None out of 34 began with a 9!
9The Curious Case of the Grimy Log-books
- In 1881, American astronomer Simon Newcomb
noticed something funny about books of logarithm
tables
10The Curious Case of the Grimy Log-books
The first pages are for numbers beginning with
digits 1 and 2
The books always seemed grubby on the first
pages
The last pages are for numbers beginning with
digits 8 and 9
but clean on the last pages
11The Curious Case of the Grimy Log-books
Why?
People seemed to look up numbers beginning with
1 and 2 more often than they looked up numbers
beginning with 8 and 9.
Because numbers beginning with 1 and 2 are MORE
COMMON than numbers beginning with 8 and 9!!
12Newcombs Law
30 of numbers begin with a 1 !!
lt 5 of numbers begin with a 9 !!
American Journal of Mathematics, 1881
13The First Digits
Over 30 of numbers begin with a 1
Only 5 of numbers begin with a 9
14The First Digits
Numbers beginning with a 1
Numbers beginning with a 9
There is the same opportunity for numbers to
begin with 9 as with 1 but for some reason they
dont!
150.301 log10(2/1)
0.176 log10(3/2)
0.125 log10(4/3)
Chance of a number starting with digit d
16Reactions to Newcombs law
Nothing!
for 57 years!
17Enter Frank Benford 1938
Physicist with the General Electric
Company Assembled over 20,000 numbers and
counted their first digits!
A study as wide as time and energy permitted.
18Populations
Numbers from newspapers
Drainage rates of rivers
Numbers from Readers Digest articles
Street addresses of American Men of Science
19About 30 begin with a 1
About 5 begin with a 9
20Anomalous numbers !!
Benford gave the law its name but no
explanation.
21The logarithmic law applies to outlaw numbers
that are without known relationship, rather than
to those that follow an orderly course and so
the logarithmic relation is essentially a Law of
Anomalous Numbers.
22What is the explanation?
Explanations for Benfords Law
- Numbers from a wide range of data sources have
about 30 of 1s, down to only 5 of 9s. - Benford called these outlaw or anomalous
numbers. They include street addresses of
American Men of Science, populations, areas,
numbers from magazines and newspapers. - Benfords orderly numbers dont follow the law
like atomic weights and physical constants
23Popular Explanations
These two say that IF there is a universal law,
it must be Benfords.
They dont explain why there should be a law to
start with!
- Scale Invariance
- Base Invariance
- Complicated Measure Theory
- Divine choice
- Mystery of Nature
24Complicated Measure Theory
In a nutshell If you grab numbers from all
over the place (a random mix of distributions),
their digit frequencies ultimately converge to
Benfords Law
25Thats why THIS works well
26It doesnt really explain WHAT will work well,
nor why
It doesnt explain why street addresses of
American Men of Science works well!
27The Key Idea
If a hat is covered evenly in red and white
stripes
Photo - Eric Pouhier http//commons.wikimedia.org
/wiki/Napoleon
28The Key Idea
If a hat is covered evenly in red and white
stripes
it will be half red
and half white.
Photo - Eric Pouhier http//commons.wikimedia.org
/wiki/Napoleon
29A Hat
30A Hat
31A Hat
If the red stripes cover half the base, theyll
cover about half the hat
The red stripes and the white stripes even out
over the shape of the hat
32What if the red stripes cover 30 of the base?
0 0.3 1 1.3 2 2.3 3 3.3
4 4.3 5 5.3 6
Then theyll cover about 30 of the hat.
33What if the red stripes cover precisely fraction
0.301 of the base?
Then theyll cover fraction 0.301 of the hat.
0 0.301 1 1.301 2 2.301 3
3.301 4 4.301 5 5.301 6
0.301 log10(2/1)
34Think of X as a random number
We want the probability that X has first digit
1
Let the hat be a probability density curve for X
Then AREAS on the hat give PROBABILITIES for X
35Think of X as a random number
We want the probability that X has first digit
1
Let the hat be a probability density curve for X
Then AREAS on the hat give PROBABILITIES for X
Area 0.95 from 1 to 5
Pr(1 lt X lt 5) 0.95
Total area 1
36In the same way .
0 0.301 1 1.301 2 2.301 3
3.301 4 4.301 5 5.301 6
If the red stripes somehow represent the X values
with first digit 1, and the red stripes have
area 0.301, then Pr(X has first digit 1)
0.301.
37So X values with first digit1 somehow lie on a
set of evenly spaced stripes?
Write X in Scientific Notation
38So X values with first digit1 somehow lie on a
set of evenly spaced stripes?
Write X in Scientific Notation
r is between 1 and 10
n is an integer
39For example
r is between 1 and 10
n is an integer
40For example
For the first digit of X, only r matters!
41For example
r gt 2 J
1 lt r lt 2 J
For the first digit of X, only r matters!
42Take logs to base 10
Or in other words
43r is between 1 and 10
n is an integer
44r is between 1 and 10
n is an integer
45r is between 1 and 10
n is an integer
46 n is an integer
X has first digit 1 precisely when log(X)
is between n and n 0.301 for any integer n
n 0
X from 1 to 2
n 1
X from 10 to 20
n 2
X from 100 to 200
47 n is an integer
X has first digit 1 precisely when log(X)
is between n and n 0.301 for any integer n
STRIPES!!
n 0
n 1
n 2
48The hat is the probability density curve for
log(X)
0 0.301 1 1.301 2 2.301 3
3.301 4 4.301 5 5.301 6
X values with first digit 1 satisfy
n 0
and so on!
n 1
n 2
49The hat is the probability density curve for
log(X)
0 0.301 1 1.301 2 2.301 3
3.301 4 4.301 5 5.301 6
X values with first digit 1 satisfy
n 0
X from 1 to 2
n 1
X from 10 to 20
n 2
X from 100 to 200
50 0 0.301 1 1.301 2 2.301 3
3.301 4 4.301 5 5.301 6
So X values with first digit1 DO lie on evenly
spaced stripes, on the log scale!
The PROBABILITY of getting first digit 1 is the
AREA of the red stripes, approx the fraction on
the base, 0.301.
51Weve done it!
Weve shown that we really should expect the
first digit to be 1 about 30 of the time!
52Intuitively
So the smallest numbers (first digit 1) are
stretched out, and get the highest probability!
0 0.301 1 1.301 2 2.301 3
3.301 4 4.301 5 5.301 6
The log scale distorts small numbers (e.g. 100)
are stretched out larger numbers (e.g. 900) are
bunched up. The first digit corresponds to
regularly spaced stripes on the log scale.
53When is this going to work?
0 0.301 1 1.301 2 2.301 3
3.301 4 4.301 5 5.301 6
The distribution of X needs to be WIDE on the
log scale!
We need a lot of stripes to balance out big ones
and little ones! We get one stripe every
integer So we need a lot of integers!
54When is this going to work?
0 0.301 1 1.301 2 2.301 3
3.301 4 4.301 5 5.301 6
X ranges from 0 to 6 on the log scale So it
ranges from 1 to 106 on usual scale!
1 .. 2 .. Miss a few ... 999,999 .. 1,000,000
55These are Benfords Outlaw Numbers!
0 0.301 1 1.301 2 2.301 3
3.301 4 4.301 5 5.301 6
- All we need is a distribution that is
- WIDE (4 6 orders of magnitude or more)
- Reasonably SMOOTH
- Then the red stripes will even out to cover about
30 of the total area.
56In Real Life
First digits very good fit to Benford!
World Populations From 50 for the Pitcairn
Islands To 1.3 x 109 for China
Wide (9 integers gt 9 stripes)
57In Real Life
World Populations From 50 for the Pitcairn
Islands To 1.3 x 109 for China
58Electorate populations? From 583,000 to 773,000
in California
The hat has less than one stripe! Benford
doesnt work here.
Of course not! All the first digits are 5, 6, or
7
59But naturally occurring populations are a
different story! Cities in California - from 94
in the city of Vernon - to 3.9 million in Los
Angeles
Yes! Its Benford!
Wide enough (5 integers gt 5 stripes)
60Powerball Jackpots? - from 10 million to 365
million
Not bad!
Orders of magnitude only 1.5 but sometimes
you just hit lucky!
Data with kind permission from www.lottostrategies
.com
61Your tax return.?
???
If you plan to fake data, you should first check
whether it ought to be Benford! BUT the IRD has
a few other tricks up its sleeve too.
62Thanks for listening!
- To find out more
- A Simple Explanation of Benfords Law
- by R. M. Fewster
- The American Statistician, to appear.
- PDF from
- www.stat.auckland.ac.nz/fewster/benford.html
-
- Judy Patersons CMCT course, Term 1 2009 Centre
for Mathematical Content in Teaching