Title: Going Beyond GIS
1Going Beyond GIS for Environmental Health
Frank C. Curriero fcurrier_at_jhsph.edu Environment
al Health Sciences and Biostatistics Bloomberg
School of Public Health EnviroHealth
Connections Summer Institute 2006
2Bio
- Joint appt. in Env Health Sci and Biostatistics
- PhD in Statistics
- Research agenda is spatial statistics
Statistics
Geography (GIS)
Env Health
Spatial Statistics
3Objectives
- Provide exposure to the field of spatial
statistics. -
- Keep it simple (non-technical)
- Applications of GIS in Environmental
Health - Beyond GIS, maps make you think/question
- Current research topics
- Geography (location) is a source of variation
worth - considering in environmental health
investigations.
4What is Spatial Statistics?
Statistics for the analysis of spatial data
spatial
geographic
What is Spatial Data?
The where in addition to the what was
observed or measured is important and recorded
with the data. Location information (the
where) can vary.
What is GIS?
Stands for Geographic Information System Anything
more depends on who you ask!
5What is a GIS?
One word def Database Two word def Visual
Database
Visual database for geographic data
- Stores
- Manipulates
- Analysis
- Queries
- Creates
- Displays
. . . .
MAPS
Layer cake of information
6What else - A computer system (piece of
software) with a tremendous amount of
capability for storing, querying, combining,
presenting, . . . , spatial data. - GIS is
designed specifically for spatial data and
hence built to handle all of its complicated
features. - GIS is a generic name like word
processor. ArcGIS, MapInfo, Idrisi are
examples of different GIS. - The earth does
not have to be the backdrop for every GIS
application, but certainly most common.
7What else (cont.) - Public health was not the
first and probably not be the last
application of GIS and spatial statistics. -
GIS as a mechanism for generating hypotheses
(exploratory spatial data analysis). - GIS is
a tool, a very powerful and valuable tool
when working with spatial data.
8Applications in Spatial Statistics and GIS
- Waterborne disease outbreaks
- DDE soil contamination
- Lyme Disease
- Prostate cancer mapping
- Chesapeake Bay water quality assessment
9US Waterborne Disease Outbreaks, 1948-1994
Outbreak Data
Location Longitude Latitude Month
Year AL, Anniston -85.83 33.65
Oct 1953 AL, Center Pt. -86.68
33.63 Nov 1958 WY, Cody
-109.06 44.53 July 1986
. . .
. . .
. . .
10US Waterborne Disease Outbreaks, 1948-1994
Substantive Questions
Do outbreaks occur at random across the US? Are
outbreaks preceded by extreme precipitation
events? Does the risk of an outbreak vary
spatially and related to watershed vulnerability?
11Objective Association between extreme prcip. and
outbreaks Methods Overlay map of outbreaks and
extreme precip events 2,105
watersheds (USGS) 16,000
weather stations (NCDC) define
extreme precipitation aggregate
precip and outbreak to watershed Results 51
of outbreaks were coincident with extreme
levels of precip within a 2 month lag
preceding the outbreak
month. Conclusion Is this evidence of an
association?
1216,000 Weather Stations Reporting Monthly
Precipitation
132105 US Watersheds
14US Waterborne Disease Outbreaks, 1948-1994
Results 51 of outbreaks were coincident with
extreme levels of precip within a
2 month lag preceding the
outbreak month. Conclusion Is this evidence of
an association?
15US Waterborne Disease Outbreaks, 1948-1994
- Map generation included many involved GIS tasks
- on numerous data sources, GIS Spatial Analysis.
- Statistically speaking though it represents risk
- factor data.
- Spatial statistics often considers the map as a
- starting point, which in GIS is often an
endpoint.
16Western Maryland Superfund Site
DDE Soil Sample Data
Sample Easting Northing DDE
(ppm) 1 1108420 725173
160 2 1108300
725378 4 110 1108490
725038 92
. . .
. . .
. . .
17Substantive Questions
Does the site exceed regulated levels of DDE
contamination and in need of remediation? What
is the level of DDE in my backyard?
18(No Transcript)
19(No Transcript)
20Kriged DDE Predictions
Kriging Spatial prediction at unsampled
locations based on data from
sampled locations. Environmental health
applications of kriging exposure maps
21Baltimore County Lyme Disease 1989-1990
Lyme Case Lyme Control
Lyme Disease Cases and Controls
Cases
Controls Longitude Latitude Longitude
Latitude -76.4047 39.3421
-76.4054 39.3419 -76.3433 39.3736
-76.3522 39.3718 -76.7592
39.3265
-76.7665 39.3119
. .
. . .
22Baltimore County Lyme Disease 1989-1990
Lyme Case Lyme Control
Substantive Questions
Do cases of Lyme Disease tend to cluster,
generally or as localized hot spots? Does risk
of Lyme Disease vary spatially over Balt.
County? Identify and quantify environmental risk
factors associated with Lyme Disease.
23Baltimore County Lyme Disease Risk 1989-1990
Spatial Case/Control Analysis
- Spatial density estimate of cases divided by
spatial density - estimate of controls (nonparametric kernel
approach). - Logistic regression approach to include
covariates.
24Statistical Methods Exist to Address
- Do cases (events) show a tendency to cluster?
- Identifying clusters or hot spots.
- Does risk of disease (or outcome of interest)
vary - spatially?
- Is disease risk elevated near a particular point
- source?
- Spatial prediction of outcomes at unobserved
- locations.
- Risk factor estimation in the presence of
residual - spatial variation.
25Types of Spatial Data
1. Geostatistical Data
Basic structure is data tagged with
locations. Locations can essentially exist
anywhere. Referred to as continuous spatial
variation. Example MD Superfund Site DDE
262. Point Pattern Data
Locations are the data denoting occurrence of
events. Common to aggregate to area-level
data. Example Baltimore County Lyme Disease
Cases Baltimore County Lyme
Disease Controls
3. Area-level Data
Data summarized to an area unit. Rarely arises
naturally. Often an aggregate form of point
pattern data. Referred to as discrete spatial
variation. Example Maryland prostate cancer by
zip code
27Why Collect Locations as Part of Data?
- Sometimes locations are the only data (as in
point patterns). - Risk (or outcome of interest) may vary
spatially. - Location can serve as an information gateway to
other - linked data sources environmental
- demographic
- social
- etc.
- Data are spatially dependent and locations are
used in - statistical methods that account for this
dependence. - In general things can vary spatially and
geography (location) - maybe a source of variation worth considering.
28Temporal Dependence
- Time series or longitudinal data.
- Past/present direction inherent in temporal data.
Spatial Dependence
- Dimensions gt 1 and loss of directional
component. - Observations closer together in space are more
- similar than observations further away
(clustering).
in space
on the earth
29Spatial Dependence (clustering) in Environmental
Health Data
Could be due to
- A contagious agent of the outcome under
- investigation.
- The spatial variation in the population at risk.
- An underlying shared environmental
characteristic, - measured or unmeasured, that also varies
spatially - (Shared Environment Effect).
30What GIS is Not
- A complete system for statistical or scientific
inference. - Maps, most basic and fundamental concepts in
GIS, - are not statistical inference.
- A GIS map of
- one variable is analogous to a histogram
display - two variables overlayed is analogous to an
x-y - scatterplot or 2x2
table. - In statistics we go beyond histograms and
- scatterplots.
31An Important Distinction
In the GIS literature analysis or spatial
analysis often means spatial data manipulation
which is something different than statistical
analysis.
32Two Current Research Problems in Spatial
Statistics and GIS
Non-geocoded Data Non-Euclidean Distance
33Geographic Analysis of Prostate Cancer in Maryland
PI Ann Klassen (HPM Oncology)
Collaborators Margaret Ensminger, Chyvette
Williams, JeanHeeHong (HPM)
Frank Curriero (Biostat), Anthony Alberg
(Epi) Martin Kulldorff
(Harvard), Helen Meissner (NCI)
Cooperative Agreement from Association of Schools
of Public Health and Centers for Disease
Control Data Agreement with the Maryland Cancer
Registry One of six CDC projects investigating
geography and prostate cancer, including NY,
CT/MA, NJ, Kansas/Iowa, and Louisiana.
34Prostate Cancer Reported to MD Cancer Registry
1992-1997
Proportion of an Outcome of Interest
Legend
No Data
0 - 12
13 - 30
31 - 67
68 - 100
All geocoded cases
Outcomes of Interest Include
- Incidence
- Stage at diagnosis
- Tumor grade at diagnosis
- Failure to stage or grade
- Treatment and mortality
35Proportion of an Outcome of Interest
Legend
No Data
0 - 12
13 - 30
31 - 67
68 - 100
All geocoded cases
36What is Geocoding?
GIS process of translating mailing address
information to coordinates on a map, such as with
longitude and latitude
16 Goucher Woods Ct Towson, MD 21286
(-76.5883, 39.4005)
Nongeocoded Data
Mailing addresses that could not be geocoded
8123 Rose Haven Road Rosedale, MD 21237
Nongeocoded
37Reasons for Nongeocodes
Address error PO Box Rural routes Base maps
out of date
38Proportion of Outcome of Interest
Geocoded Cases (15,585)
Legend
No Data
0 - 12
13 - 30
31 - 67
68 - 100
All Cases (17,091)
39Statistical Issues
(1) Common to just ignore nongeocodes
What's the Consequence? Historically not well
documented in publications
(2) Level of aggregation for analyses?
Zip code level Census
tract, county, etc.
40Statistical Issues (cont.)
(3) Nongeocodes represent missing data and
most likely not missing at random
MD Prostate Cancer Proportion of NonGeocodes
Nongeocoded
0 - 9
10 - 25
26 - 47
48 - 75
76 - 100
41Statistical Issues (cont.)
(3) Nongeocodes carry plenty of information
Known Information (fictitious example)
Age 72
Race White
Year of Diagnosis 1991
Stage at Diagnosis Late
Tumor Grade Aggressive
Zip Code 21237
42Statistical Solutions
(a) Impute a location for nongeocodes
Determine the age-race distribution within known
zip codes Weighted random selection based on
known age and race Sampling with and without
replacement Multiple imputation to assess bias
(Joint work with Ann Klassen,
HPM)
(b) Develop statistical models for outcomes at
different levels of aggregation
Spatial variation in risk model for geocoded
household level data and nongeocoded zip
code level data
(Joint work with Peter
Diggle, Biost)
43Chesapeake Bay Water Quality Assessment
Data
Temperature Turbidity Dissolved
Oxygen Chlorophyll a
Needed
Assessments at unsampled locations
44Kriging
A spatial regression method that provides
optimal prediction at unsampled
locations. Kriged predictions are weighted
averages of sampled data, higher weights given to
data closer to the prediction site. Proximity is
measured by the straight line Euclidean distance
(as the crow flies).
45Chesapeake Bay Fixed Station Data
Euclidean distance may not be appropriate. Propos
e a water metric Currently kriging only
works for Euclidean distance. New methods needed.
46Closing Remarks
- GIS for spatial database management and
- hypothesis generation (posing the questions)
- Spatial Statistics for inferential methods
- (answering the questions)
- Why consider location
- Scientific inference may depend on it
- Gateway to environmental data
- Source of variation worth considering
- Biography and Geography of Public Health