Use of Quantile Functions - PowerPoint PPT Presentation

About This Presentation
Title:

Use of Quantile Functions

Description:

A histogram can be plotted using hist(z) Exponential Distribution ... hist(stamford, xlab='ozone concentration', main='Stamford ozone concentrations' ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 59
Provided by: jphil
Category:
Tags: functions | hist | quantile | use

less

Transcript and Presenter's Notes

Title: Use of Quantile Functions


1
Use of Quantile Functions
  • in Data Analysis

2
In general, Quantile Functions (sometimes
referred to as Inverse Density Functions or
Percent Point Functions) return the Value X at
which P(X) specified cumulative probability
given that particular distribution.
3
Recall the normal distribution
4
So for the specified cumulative probability of
0.025, the value of X turns out to be -1.96. In
other words, ?(-1.96) 0.025, so ?-1(0.025)
-1.96 Notice, then, that ?-1 is producing the
quantile function, Q
5
(No Transcript)
6
Idealised Samples
These are produced by first defining a sequence
of n probability points using
i ½ n
pi
i 1,2,.n
Note that these are evenly spaced between 0 and 1

7
For example, for n 20, we have p1, p2, .p20
equal to 0.025, 0.075, 0.975 Probability
points may be produced with the R function
ppoints
8
(No Transcript)
9
Now suppose that a distribution has quantile
function Q. Then, if the random variable Z has qf
Q, from the original definition, P(Z Q(pi) )
pi i 1.. n Hence,
Q(p1).Q(pn) may be regarded as an idealised
sample of size n.
10
The Uniform Distribution
1
0
1
11
The Uniform Distribution
The quantile function is easy here e.g. Q(0.2)0.2
1
Area 0.2
0
0.2
1
12
The Uniform Distribution
In general, Q(p)p
1
Area p
0
p
1
13
Hence p1. pn may be regarded as an idealised
sample of size n from the U(0, 1) distribution.
14
The Normal Distribution
We have already seen that the normal distribution
has quantile function given by ?-1. Thus an
idealised N(0, 1) sample of size 20, will be
?-1(0.025), ?-1(0.075), ?-1(0.125), ..
15
In R the result is found with qnorm
16
A histogram can be plotted using hist(z)
17
Exponential Distribution
The Exp(1) distribution has quantile function Q
given by Q(p) - ln(1 - p)
(exercise!) In R it is given by the function
qexp. Thus an idealised Exp(1) sample of size
20, together with its histogram are as follows
18
(No Transcript)
19
(No Transcript)
20
Empirical Quantile Functions
Suppose that we have a set of n observations y1,
yn of a variable y. Let y(1) .........
y(n) be their order statistics. Analogously to
earlier work we define the empirical quantile
function Qe of these data by Qe(pi)
y( i ) with linear interpolation for other
values of p in the interval (0, 1).
21
Thus, for any p, Qe(p) is the essentially the
value of the variable, y, below which a fraction,
p, of the observations lie. Example the three
quartiles of the distribution of the data are
given by Qe(1/4), Qe(1/2) and Qe(3/4).
22
Example
Maximum daily ozone concentrations. The R data
frame ozone is made available on the module web
pages. It gives maximum daily ozone
concentrations, in parts per billion, measured at
Stamford, Connecticut and Yonkers, New York,
during the period May - September one year. Data
are given only for the 132 days (out of a total
of 152) on which readings were successfully
obtained at both locations.
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
The sorted Stamford data are given by
28
The R code gt hist(stamford, xlab"ozone
concentration", main"Stamford ozone
concentrations") gt plot(density(stamford),
main"density estimate for Stamford ozone
concentrations") produces the histogram and
(kernel) density estimate shown in the next
slides. (For details of the latter, see 5.6 of
Venables and Ripley (2002).)
29
(No Transcript)
30
(No Transcript)
31
A simple summary is given by gt
summary(stamford) Min. 1st Qu. Median Mean
3rd Qu. Max. 14.00 50.50 80.00 89.87
119.80 240.00 while much the same is achieved
with the use of the R (empirical) quantile
function, which by default evaluates this at p
(0.00, 0.25, 0.50, 0.75, 1.00) gt
quantile(stamford) 0 25 50 75
100 14.00 50.50 80.00 119.75 240.00
32
Both the histogram and the summaries show the
data to be quite positively skewed. This is quite
typical of data constrained to be positive. We
can specify any vector of points at which to
evaluate the quantile function gt p
seq(0,1,0.1) gt p 10.0 0.1 0.2 0.3 0.4 0.5
0.6 0.7 0.8 0.9 1.0
33
(No Transcript)
34
Q-Q Plots
We wish to investigate whether observations y1
. yn may reasonably be regarded as a random
sample from some theoretical distribution. A Q-Q
plot compares the empirical quantile function Qe
of the data with the theoretical quantile
function Q of the distribution.
35
We plot y(i) ( Qe(pi)) against Q(pi) for the
probability points p1,p2 etc. as defined
earlier. If the data follow an approximately
straight line with slope 1 and intercept 0, the
observed values are said to be compatible with
the theoretical distribution. i.e. y (i) Q(pi)
36
Refinement
If the data do not lie on the specific line just
mentioned, but instead lie on a line with slope b
and intercept a, i.e. y(i) a bQ(pi) This
indicates compatibility with the appropriate
distribution scaled by b and shifted by a.
37
For example if Q is the quantile function of the
N(0,1) distribution, then the relation suggests
compatibility with the N(a,b2) distribution.
38
Stamford Ozone Concentrations
We check for normality. A normal Q-Q plot
compares empirical qf of the data with the qf of
the N(0,1) distribution. The R function qqnorm
constructs the plot.
39
(No Transcript)
40
(No Transcript)
41
The function qqline adds the straight line y a
bx corresponding to the normal distribution
N(a,b2) whose lower and upper quartiles match
those of the data. (This is a resistant fit,
unaffected by one or two extreme observations).
42
(No Transcript)
43
The lack of linearity in the plotted data
suggests that they are not well modelled by a
normal distribution. Rather, the convex shape
indicates positive skewness (as was seen in the
histogram). We investigate whether a
transformation would help. A check is carried out
as to whether the square roots of the Stamford
ozone concentrations can be modelled by a normal
distribution.
44
(Notice that, when using R, xy gives the yth
power of x, and, of course, a power of 0.5 gives
a square root).
45
(No Transcript)
46
The plotted data are reasonably well fitted by
the straight line which has slope 2.844 and
intercept 9.0. Thus the data are reasonably well
modelled by the N(9, 2.8442) distribution (whose
lower and upper quartiles match those of the
data). Note that the sample mean and standard
deviation are 9.1 and 2.715 respectively.
47
Note The equation of the straight line can be
obtained by inspection, or more accurately by
using the two points that it is defined by,
i.e. (Q(1/4), Qe(1/4)), (Q(3/4),
Qe(3/4)) Appropriate R commands give these
as (-0.675, 7.106) and (0.675, 10.940)
48
Example
The R vector abbey in the package MASS gives 31
determinations of nickel content (µg g-1) in
a Canadian syenite rock.
49
We check whether the data are reasonably modelled
by an exponential distribution. There is no
predefined function in R to construct an
exponential Q-Q plot, so we have to work from
first principles.
50
qexp(ppoints(31)) gives the theoretical quantile
values at 31 probability points.
sort(abbey) gives the sorted experimental values
51
The following command produces a Q-Q plot with
axes labelled accordingly.
52
(No Transcript)
53
If we ignore the highest observation, the data
appear to be reasonably compatible with an
exponential distribution with mean around 12.5
(exp(0.08)).
54
(No Transcript)
55
However, the probability that the highest of 31
independent observations from an exponential
distribution of mean 12.5 is as great as 125 is
only 0.0014, so there must be considerable doubt
about the suitability of an exponential model
here. A further possibility is that some
transformation of the data may be modelled by a
normal distribution. Try this out in the
practicals.
56
Boxplots
These are useful for comparing distributions of
the same variable. The R code
produces the following boxplot
57
(No Transcript)
58
Recall that the lower and upper ends of each box
give the first and third quartiles of the
corresponding distribution and the centre line
indicates the median.
Write a Comment
User Comments (0)
About PowerShow.com