Title: Descriptive Statistics
1Descriptive Statistics
Spatial Statistics (SGG 2413)
- Assoc. Prof. Dr. Abdul Hamid b. Hj. Mar Iman
- Director
- Centre for Real Estate Studies
- Faculty of Engineering and Geoinformation Science
- Universiti Tekbnologi Malaysia
- Skudai, Johor
2Learning Objectives
- Overall To give students a basic understanding
of descriptive statistics - Specific Students will be able to
- understand the basic concept of descriptive
- statistics
- understand the concept of distribution
- can calculate measures of central tendency
- dispersion
- can calculate measures of kurtosis and
skewness -
3Contents
- What is descriptive statistics
- Central tendency, dispersion, kurtosis, skewness
- Distribution
4Descriptive Statistics
- Use sample information to explain/make
abstraction of population phenomena. - Common phenomena
- Association (e.g. s1,2.3 0.75)
- Tendency (left-skew, right-skew)
- Trend, pattern, location, dispersion, range
- Causal relationship (e.g. if X then Y)
- Emphasis on meaningful characterisation of data
(e.g. central tendency, variability), graphics,
and description - Use non-parametric analysis (e.g. ?2, t-test,
2-way anova)
5E.g. of Abstraction of phenomena
6Inferential Statistics
- Using sample statistics to infer some phenomena
of population parameters - Common phenomena cause-and-effect
- One-way r/ship
- Feedback r/ship
- Recursive
- Use parametric analysis (e.g. a and ?) through
regression analysis - Emphasis on hypothesis testing
Y f(X)
Y1 f(Y2, X, e1) Y2 f(Y1, Z, e2)
Y1 f(X, e1) Y2 f(Y1, Z, e2)
7Parametric statistics
- Statistical analysis that attempts to explain the
population parameter using a sample - E.g. of statistical parameters mean, variance,
std. dev., R2, t-value, F-ratio, ?xy, etc. - It assumes that the distributions of the
variables being assessed belong to known
parameterised families of probability
distributions
8Examples of parametric relationship
Dep9t 215.8
Dep7t 192.6
9Non-parametric statistics
- First used by Wolfowitz (1942)
- Statistical analysis that attempts to explain the
population parameter using a sample without
making assumption about the frequency
distribution of the assessed variable - In other words, the variable being assessed is
distribution-free - E.g. of non-parametric statistics histogram,
stochastic kernel, non-parametric regression
10Descriptive Inferential Statistics (DS IS)
- DS gather information about a population
characteristic (e.g. income) and describe it with
a parameter of interest (e.g. mean) - IS uses the parameter to test a hypothesis
pertaining to that characteristic. E.g. - Ho mean income RM 4,000
- H1 mean income lt RM 4,000)
- The result for hypothesis testing is used to make
inference about the characteristic of interest
(e.g. Malaysian ? upper middle income)
11Sample Statistics Central Tendency
12Central Tendency Mean
- For individual observations, . E.g.
- X 3,5,7,7,8,8,8,9,9,10,10,12
- 96 n 12
- Thus, 96/12 8
- The above observations can be organised into a
frequency table and mean calculated on the basis
of frequencies -
96 12 - Thus, 96/12 8
13Central Tendency - Mean and Mid-point
- Let say we have data like this
Price (RM 000/unit) of Shop Houses in Skudai
Can you calculate the mean?
14Central Tendency - Mean and Mid-point (contd.)
- Lets calculate
- Town A (228450)/2 339
- Town B (320430)/2 375
- Are these figures means?
M ½(Min Max)
15Central Tendency - Mean and Mid-point (contd.)
- Lets say we have price data as follows
- Town A 228, 295, 310, 420, 450
- Town B 320, 295, 310, 400, 430
- Calculate the means?
- Town A
- Town B
- Are the results same as previously?
- ? Be careful about mean and mid-point!
16Central Tendency Mean of Grouped Data
- House rental or prices in the PMR are frequently
tabulated as a range of values. E.g. - What is the mean rental across the areas?
- 23 3317.5
- Thus, 3317.5/23 144.24
17Central Tendency Median
- Let say house rentals in a particular town are
tabulated - Calculation of median rental needs a graphical
aids?
- Median (n1)/2 (251)/2 13th. Taman
- 2. (i.e. between 10 15 points on the vertical
axis of ogive). - 3. Corresponds to RM 140-145/month on the
horizontal axis - 4. There are (17-8) 9 Taman in the range of RM
140-145/month
5. Taman 13th. is 5th. out of the 9
Taman 6. The rental interval width is 5 7.
Therefore, the median rental can be
calculated as 140 (5/9 x 5) RM 142.8
18Central Tendency Median (contd.)
19Central Tendency Quartiles (contd.)
Following the same process as in calculating
median
Upper quartile ¾(n1) 19.5th. Taman UQ 145
(3/7 x 5) RM 147.1/month Lower quartile
(n1)/4 26/4 6.5 th. Taman LQ 135 (3.5/5
x 5) RM138.5/month Inter-quartile UQ LQ
147.1 138.5 8.6th. Taman IQ 138.5 (4/5 x
5) RM 142.5/month
20Variability
- Indicates dispersion, spread, variation,
deviation - For single population or sample data
- where s2 and s2 population and sample
variance respectively, xi individual
observations, µ population mean, sample
mean, and n total number of individual
observations. - The square roots are
- standard deviation standard deviation
21Variability (contd.)
- Why measure of dispersion important?
- Consider yields of two plant species
-
- Plant A (ton) 1.8, 1.9, 2.0, 2.1, 3.6
- Plant B (ton) 1.0, 1.5, 2.0, 3.0, 3.9
-
- Mean A mean B 2.28
- But, different variability!
- Var(A) 0.557, Var(B) 1.367
- Would you choose to grow plant A or B?
22Variability (contd.)
- Coefficient of variation CV std. deviation as
of the mean - A better measure compared to std. dev. in case
- where samples have different means. E.g.
- Plant X (ton/ha) 1.2, 1.4, 2.6, 2.7, 3.9
- Plant Y (ton/ha) 1.4, 1.5, 2.1, 3.2, 3.9
23Variability (cont.)
Calculate CV for both species.
CVx (1.2/2.36) x 100
50.97 CVy (1.2/2.42) x 100 49.46
? Species X is a little more variable than
species Y
24Variability (cont.)
- Std. dev. of a frequency distribution
- E.g. age distribution of second-home buyers
(SHB)
25Probability distribution
- If there 20
lecturers, the probability that A becomes a
professor is p 1/20 0.05 - Out of 100
births, half of them were girls (p0.5), as the
number increased to 1,000, two-third were girls
(p0.67) but from a record of 10,000 new-born
babies, three-quarter were girls (p0.75) - The
probability of a drug addict recovering from
addiction is 5050 - General rule
- No. of times event X
occurs - Pr (event X) --------------------------------
----- - Total number of
occurrences - Probability of certain event X to occur has a
specific form of distribution
Logical probability
Experiential probability
Subjective probability
26Probability Distribution
Classical example of
tossing
What is the distribution of the sum of tosses?
27Probability Distribution (contd.)
Discrete variable
Values of x are discrete (discontinuous) Sum of
lengths of vertical bars ?p(Xx) 1
all x
28Probability Distribution (cont.)
Continuous variable
Mean 39.5 Std. dev 2.45
Pr (Area under curve) 1
Pr (Area under curve) 1
Age distribution of second-home buyers in
probability histogram
29Probability Distribution (cont.)
- Pr (Age 36) 0.02
- Pr (Age 37) Pr (Age 36) Pr (Age 37)
0.02 0.07 0.09 - Pr (Age 38) Pr (Age 37) Pr (Age 38)
0.09 0.04 0.13 - Pr (Age 39) Pr (Age 38) Pr (Age 39)
0.13 0.18 0.31 - Pr (Age 40) Pr (Age 39) Pr (Age 40)
0.31 0.36 0.67 - Pr (Age 41) Pr (Age 40) Pr (Age 41)
0.67 0.14 0.81 - Pr (Age 42) Pr (Age 41) Pr (Age 42)
0.81 0.10 0.91 - Pr (Age 43) Pr (Age 42) Pr (Age 43)
0.91 0.09 1.00
?Cumulative probability corresponds to the
left tail of a distribution
30Probability Distribution (cont.)
Larger sample
- As larger and larger samples are drawn, the
probability distribution is getting smoother - Tens of different types of probability
distribution Z, t, F, gamma, etc - Most important normal distribution
Very large sample
31Normal Distribution - ND
- Salient features of ND
- Bell-shaped, symmetrical
- Total area under curve 1
- Area under curve between
- any two points prob. of
- values in that range (shaded area)
- Prob. of any exact value 0
- Has a function of
-
-
µ mean of variable x s std. dev. of x p
ratio of circumference of a circle to its
diameter 3.14 e base of natural log
2.71828.
32Normal Distribution - ND
Population 2
Population 1
?2
?1
?1 ?2
A larger population has narrower base
(smaller variance)
? determines location while ? determines
shape of ND
33Normal Distribution (cont.)
Has a mean ? and a variance ?2, i.e. X ? N(?,
?2 ) Has the following distribution of
observation
Home-buyers example
Mean age 39.3 Std. dev 2.42
34Standard Normal Distribution (SND)
- Since different populations have different ? and
? (thus, locations and shapes of distribution),
they have to be standardised. - Most common standardisation standard normal
distribution (SND) or called Z-distribution - ?(Xx) is given by area under curve
- Has no standard algebraic method of integration
- ? Z N(0,1)
- To transform f(x) into f(z)
- x - µ
- Z ------- N(0, 1)
- s
35Z-Distribution
-
- Probability is such a way that
- Approx. 68 -1lt z lt1
- Approx. 95 -1.96 lt z lt 1.96
- Approx. 99 -2.58 lt z lt 2.58
36Z-distribution (cont.)
- When X µ, Z 0, i.e.
- When X µ s, Z 1
- When X µ 2s, Z 2
- When X µ 3s, Z 3 and so on.
- It can be proven that P(X1 ltXlt Xk) P(Z1 ltZlt Zk)
- SND shows the probability to the right of any
particular value of Z.
37Normal distributionQuestions
- A study found that the mean age, A of second-home
buyers in Johor Bahru - is 39.3 years old with a variance of RM
2.45.Assuming normality, how sure - are you that the mean age is (a) 40 years old
(b) 39 to 42 years old? - Answer (a) P(A 40)
- PZ (40 39.3)/2.4
- P(Z 0.2917? 0.3000)
- 0.3821
-
- (b) P(39 A 42)
- P(A 39) P(A 42)
- 0.45224 PA
(42-39.3)/2.4 - 0.45224 P(A 1.125)
- 0.45224 0.12924
- 0.3230
Use Z-table!
Always remember to convert to SND, subtract the
mean and divide by the std. dev.
38Students t-Distribution
- Similar to Z-distribution (bell-shaped,
symmetrical) - Has a function of
- where ? gamma distribution v n-1
d.o.f ? 3.147 - Flatter with thicker tails
- Distributed with t?(0,s) and -8 lt t lt 8
- As n?8 t?(0,s) ? N(0,1)
- Probability calculation requires
- information on d.o.f.
39How Are t-dist. and Z-dist. Related?
- Using central limit theorem, ?N(?, ?2/n) will
become - z?N(0, 1) as n?8
- ?For a large sample, t-dist. of a variable or a
- parameter is given by
- The interval of critical values for variable, x
is -
40Skewness, m3 Kurtosis, m4
- Skewness, m3 measures degree of symmetry of
distribution - Kurtosis, m4 measures its degree of peakness
- Both are useful when comparing sample
distributions with different shapes - Useful in data analysis
41Skewness
42Kurtosis
Mesokurtic distributionkurtosis 3 Leptokurtic
distributionkurtosis lt 3 Platykurtoc
distributionkurtosis gt 3
43Occurrence of ganoderma
Occurrence of ganoderma
44Aluminium residues in the soil
E.g. Al2 H2O--
? Al2O H2
45Measures of spatial separation
- E.g. WCM ((545.10-542.86)2 (105.90-105.48)2)0.
5 - (5.0176 0.1764)0.5
- 2.28 (i.e. 2,280 m)
46Spatial distribution
Occurrence of ganoderma
47Spatial distribution point data
Ethnic distribution of residence
48Ethnic distribution of residence