Title: CPSC 601.04
1CPSC 601.04
- Statistical Analysis in GIS
- Dr. M. Gavrilova
2Overview
- Importance of correct data representation
- Variance and covariance
- Autocorrelation
- Applications to pattern analysis and geometric
modeling
3Overuse of color and dimensionality
Four colors, three dimensions, and two plots to
visualize five data points
http//www.math.yorku.ca/SCS/Gallery/
4Misleading data axis
5Overcrowded data
Steven Skiena, Stony Brook, NY
http//www.cs.sunysb.edu/skiena
6Time increasing over time
http//www.math.yorku.ca/SCS/Gallery/
7Scatterplot linear or logarithmic?
Results of a poll of happiness from the World
Values Survey project of people throughout the
world in relation to economy, GNP per
capita. Many countries, particularly those in
Latin America, had higher marks for happiness
than their economic situation would predict.
Conclusion is based on the assumption that
happiness should be linearly related to GNP.
8GIS goals
- An organized collection of computer hardware,
software, geographic data, and personnel designed
to efficiently capture, store, update,
manipulate, analyze, and display all forms of
geographically referenced data.
9Spatial Analysis
- Provides
- an efficient and generally reliable means of
obtaining knowledge about spatial processes, - a way of maximizing our knowledge of spatial
processes with the minimum of error.
10Spatial processes
- Spatial Data
- location and attribute ? Pi (x, y, z)
- Spatial Stochastic Processes
- statistics and inference
- Spatial is special
- spatial autocorrelation
- spatial non-stationarity
- proximity
11Examples of data analysis
The Space Shuttle Challenger exploded shortly
after take-off in January 1986. Cause failure of
the O-ring seals used to isolate the fuel supply
from burning gases. Graph from the Report of the
Presidential Commission on the Space Shuttle
Challenger Accident, 1986. NASA staff had
analysed the data on the relation between
temperature and number of O-ring failures (out of
6), but they had excluded observations where no
O-rings failed, believing that they were
uninformative. They were main observations
showing no failure at warm temperatures (65-80
degF).
12Better graph curve fitting
Apart from the disasterouse omitting the
observations with 0 failures 1. drawing a
smoothed curve to fit the points 2. removing
the background grid which obscure datagives a
graph which shows excessive risks associated with
both high and low temperatures
13Logistic regressing model
14Challenger disaster
- Reanalysis of the O-ring data involved fitting a
logistic regression model. This provides a
predicted extrapolation (black curve) of the
probability of failure to the low (31 degF)
temperature at the time of the launch and
confidence bands on that extrapolation (red
curves). See also Tappin, L. (1994). "Analyzing
data relating to the Challenger disaster".
Mathematics Teacher, 87, 423-426 - There's not much data at low temperatures (the
confidence band is quite wide), but the predicted
probability of failure is uncomfortably high.
Would you take a ride on Challenger when the
weather is cold?
15Good examples
The French engineer, Charles Minard (1781-1870),
illustrated the disastrous result of Napoleon's
failed Russian campaign of 1812. The graph shows
the size of the army by the width of the band
across the map of the campaign on its outward and
return legs, with temperature on the retreat
shown on the line graph at the bottom. Many
consider Minard's original the best statistical
graphic ever drawn.
16Florence Nightingale's Coxcomb diagrams
17Escaping the 2D
18Definitions statistical variables
- Samples, populations, consist of individuals.
- Values of certain attributes are called
observations (e. g. age, income). - Attributes vary across individuals, and they
are called variables. - Variables are described by distributions and
their parameters (e.g. Normal, Poisson, ). - A random variable X assumes its value according
to the outcome of a chance experiment (coin,
dice).
19Definitions Variance
- Variance is the sum of squared deviations from
the mean divided by n (or n-1) sample number.
Sample Variance Population Variance
20Autocorrelation
- Spatial autocorrelation is a measure of the
similarity of objects within an area. - Jay Lee and Louis K. Marion, 2001
21Morans Index
- The formula to compute Morans index is the
following
- where n is the number of individual points,
- A area of the bounding polygon, i.e. the total
area of the map including all points - zi- value of the parameter measured for point I
(attribute)
22Features
- wij is computed according to the following rule,
min(dij) is the smallest of all distances between
all pairs of points computed - In this formula, distance dij is computed
according to the formulas for Euclidean, supremum
or Manhattan metrics. Since dii is equal to 0,
wii will become infinite, thus cases when ij
should be excluded. This will result in n2 n
pairs of points.
23Selecting pairs of points
- The sum by all i,j means that ALL ORDERED PAIRS
of points (i.e. order of consideration of pair ij
is important) should be considered by the
formula. - Sometimes, only pair of sample points within a
specific distance from each other are considered.
24Application to pattern analysis
- Example autocorrelation on a grid.
- Sample points are combined in one cell. Size and
location of the cell defines autocorrelation
parameters. - Consider all pairs of GRID CELLS, where XC and YC
now denote coordinates of the center of each grid
cell and the attribute z for each grid is the sum
of combined attributes of all points that belong
to this cell. - Result insight on pattern analysis and
correlation can be obtained.
25Case study 1 Pattern Analysis
- Analysis of instances of patients undergoing
cardiac catheterization, and location of those
instances, i.e. city blocks. - Primary question spatial variation of heart
disease random or non-random pattern? - Secondary question relationship between disease
occurrence and social and demographic factors
(Spatial Regression).
26Set up
- Analysis results are affected by grid size
- prone to subjective choices
- constrained by spatial resolution of data
- Solving the problem by
- using a non-arbitrary grid(s)
- implementing a guided selection of the
square unit area or grid size
27City blocks in Calgary
28Methodology
- Definition of a city-block grid based on the
main division in the city, i.e. using the squared
grid centered on the intersection between Center
Street and Center Avenue as the main axes of the
geometric plan thus created. - Grid regularity decreases as distance increases
from its center. - L_p norms provide flexibility to adjust grids
size and shape consequently.
29Methodology
- Application of varying L_p norms
- Varying spatial weights for spatial
autocorrelation - Autocorrelation analysis at varying scales
(CDA, community) - Data 2001/1996 census
30Experiments
31Observations
- Sensitivity of Spatial Autocorrelation to
- L_p norm
- spatial weight
- Proposed method useful in determining
- best distance
- best spatial weight
- In context of multivariate spatial regression
- best ?? lowest variance
32Results
- The Calgary Journal, Regional publication,
Researchers link heart disease to urban
lifestyles on SPARCS activity profile, Oct. 26
Nov. 8, 2005 - High risk of heart attack male, high education,
married
33Case study 2 Oil spill discharge
34Summary statistics
cells Min. Max. Mean St. dev. Sum Skew Kurt.
Oil spill counts 44 (2,741) 0 3 0.02 0.162 53 9.85 113.6
Flight counts 2151 (2,741) 0 309 13.75 27.12 37,681 4.21 25.6
The mean and the standard deviation provide
information about the statistical dispersion of
the data and skewness (irregular) and kurtosis
(bulging in Greek) indicate highly skewed
distributions or lack of normality in the data.
35Data clustering
36Statistical analysis
- Our exploratory analyses indicate that there is a
positive spatial autocorrelation within datasets
for all variables. - An initial overview of the statistical
distribution and normality of each of the
variables selected for this study indicated
absence of normality in the data.
Exploratory Spatial Analysis of Illegal Oil
Discharges Detected off Canadas Pacific
Coast. Norma Serra-Sogas1, Patrick OHara2,
Rosaline Canessa3, Stefania Bertazzon4 and Marina
Gavrilova5
37Lecture summary
- Proper statistical analysis is important
- Variance and autocorrelation are two important
vehicles for data analysis - Combining these measures with various metrics,
hierarchical structures, grids, attributes and
also data filtering/visualization methods is a
direction of current research.