Logistic Regression - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Logistic Regression

Description:

... Vegetation: vegetation computed by rotation period. X4 ... Finally, wildfires are influenced by neighborhood conditions. Testing Statistical Signficance ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 32
Provided by: AJL2
Category:

less

Transcript and Presenter's Notes

Title: Logistic Regression


1
Logistic Regression
  • Often, the spatial phenomenon under investigation
    can only be described by a categorical variable.
  • Wild fires typically depicted with polygons
    showing burned vs. not burned
  • Or, bird distribution indicating presence or
    absence of birds
  • Previous regression technique is not suitable
    because the dependent variable is neither
    interval or ratio
  • Logistic regression treats the distribution in a
    probabilistic manner, that is, the occurrence of
    the study phenomenon is evaluated in terms of
    probability

2
Logistic Regression
  • If the probability of presence of a phenomenon is
    Pa, then Pb represents the absence of the
    phenomenon and
  • Pa Pb 1
  • Ua b0 b1X1 b2X2 bnXn e
  • Ua is the utility function of event a expressed
    as a linear combination of a number of
    explanatory variables X1, X2, .., and bn is the
    estimated parameter of variable Xn

3
Logistic Regression
  • A greater value of Ua implies a greater
    probability for the event to take place. When Ua
    approaches infinity, Pa approaches 1, indicating
    a high likelihood for the event to occur. When
    Ua approaches negative infinity, Pa approaches 0.
  • When Ua equals zero, the probability is .50,
    implying a 50/50 chance for the event to occur.

4
Logistic Regression Example
  • Example from Chou
  • Fires in San Jacinto Ranger District of the San
    Bernardino National Forest were examined to map
    the distribution of fire occurrence probability.
    The basic model consisted of eight independent
    variables
  • Area, perimeter, vegetation, proximity to
    buildings, proximity to campgrounds, proximity to
    roads, maximum temperature in July, and annual
    precipitation

5
Variables in Fire Distribution Study
  • X1 Area area of geographic unitX2 Perimeter
    perimeter of geographic unitX3 Vegetation
    vegetation computed by rotation
    periodX4 Building proximity to
    structuresX5 Campground proximity to
    campgroundsX6 Road proximity to
    roadsX7 Temperature maximum temperature in
    JulyX8 Precipitation annual precipitation
  • Dependent variable is a code indicating whether
    or not a geographic unit is burned or not. Area
    and perimeter provide general geometric
    characteristics. Vegetation, precipitation, and
    temperature represent environmental factors,
    while building, campground, and road represent
    human-related factors

6
Results of Logistic Regression
  • The model indicatesthat perimeter, vegetation,
    campground, road, and temperature are variables
    to be included in the model. Other variables are
    not included as they are not statistically
    different from 0

7
Results of Logistic Regression
  • Percentage-correctly-estimated (PCE) index shows
    the maximum level of estimation accuracy of a
    model.
  • In this example, PCE is 60, not much better than
    a random 50/50 chance.
  • Therefore, another parameter was evaluated

8
Alternative Model
  • Included an additional variable to determine
    whether it makes any significant difference in
    model performance
  • New variable represents neighborhood effects, or
    conditions of the surrounding geographic units
  • Assumes that fire occurrence probability is not
    only affected by the environmental and
    human-related variables listed in the basic
    model, but by the distribution of fire occurrence
    probability of adjacent units
  • The new spatial term X9 is defined by the
    percentage of neighboring units that were burned
    during the study period

9
New Results
  • Results from the new study are quite different
  • Only two variables are statistically significant
    vegetation and neighborhood effects
  • Vegetation appears to be the determining
    environmental variable in the distribution of
    wildfires in the study area
  • Finally, wildfires are influenced by neighborhood
    conditions

10
Testing Statistical Signficance
  • Did the neighborhood effects significantly change
    the model? Need to test the chi-square test of
    likelihood ratio
  • Where L0 denotes the likelihood of the basic
    model and L1 denotes the likelihood of the study
    model
  • Statistical testing suggests that the
    neighborhood variable significantly improved the
    performance of the model

11
Procedure for Regression Analysis (Barber, p. 448)
  • Specify the variables in the model and the exact
    form of the relationship between them
  • Collect data
  • Estimate the parameters of the model
  • Statistically test the utility of the developed
    model, and check whether the assumptions of the
    simple linear regression model are satisfied
  • Use the model for prediction

12
Example of Data Manipulation and Programming in
ArcView
  • Manipulating Yield Data with DataManipulation.ave

13
Spatial Prediction of Landslide Hazard Using
Logistic Regression and GIS
  • Art Lembo
  • 620 Presentation
  • Based on paper by Gorsevski, Gessler, and Folz

14
Introduction
  • Landslides are natural geologic processes that
    cause different types of damage, causing billions
    of dollars in damage and thousands of deaths each
    year
  • 95 of landslides occur in developing countries

15
Causes of Landslides
  • Human activities, such as deforestation and urban
    expansion, accelerate the process of landslides
  • Roads and harvest activities in timberlands
    increase the occurrence of landslides
  • In undisturbed forest, soil erosion is generally
    negligible

16
Clearwater National Forest
  • 1995-1996
  • Major landslides occurred during the winter
    following heavy rains, snowmelt, and high river
    flow
  • Over 900 landslides were recorded on the unstable
    slopes of the forest
  • Landslide occurrence was widely distributed and
    included artificial slopes such as road cuts and
    fills, or natural slopes in clearcut areas

17
(No Transcript)
18
Landslide Data
  • Within the large remote area, a DEM was used to
    generate quantitative topographic attributes
  • Slope, elevation, aspect, profile, curvature,
    tangent curvature, plan curvature, flow path, and
    contributing area
  • Photo interpretation and field inventory
    identified landslide areas

19
(No Transcript)
20
Considerations in Creating Hazard Models
  • Datasets combined and stored in a GIS database
  • Hazard Model assumptions
  • Strength of a model depends on the quality of the
    data collected
  • Data driven models are not appropriate to
    extrapolate to neighboring areas
  • Climatic conditions may change so that the past
    is not an indicator of the future
  • Uncertainty exists when a hazard map is derived
    from a statistically based model

21
Models Used in Study
  • Logistic regression was used, which correlated
    the environmental attributes and landslide
    distribution
  • Because of the existence of uncertainty, a
    Receiver-Operating Curve curve plots the
    proportion of false positives against the true
    positives at each level of the criterion

22
Assessing Landslide Hazard
  • Field inspection using a check list to identify
    sites susceptible to landsliding
  • Projection of future patterns of instability from
    analysis of landslide inventories
  • Multivariate analysis of factors characterizing
    observed sites of slope instability
  • Stability ranking based on criteria such as
    slope, land forms, or geologic structure
  • Failure probability analysis based on slope
    stability models with stochastic hydrologic
    simulation

23
Preparing the Data
  • Primary and secondary attributes are derived from
    a DEM, reducing the high cost of collecting the
    data (30m)
  • Landslides assessed through aerial reconnaissance
  • Landslide hazard area are then identified based
    on spatial correlation between the attributes
  • Identifying landslide hazard is based on spatial
    correlation between the attributes derived from
    the DEM
  • ROC curves used for decision making

24
Data Sampling
  • 15 of non-landslide cells were randomly sampled
    for an absence of landslides
  • Multivariate subset was derived from the
    coverages where landslides were absent
  • The landslide coverage was a point data set
    sampled grid cells where landslides were present
  • Both samples were joined together where the
    dependent variable had a binary response (present
    or absent)
  • Final output stored in ASCII and used in SAS

25
Statistical Analysis
  • Normal plot of data to determine if the data
    followed a normal distribution
  • Plot showed that data points do not fall along a
    straight line. The data is not multivariate
    normal
  • Logistic regression is used when the predictor
    variables are not normally distributed, and
    some predictor variables are categorical
  • Factor analysis was applied to determine the
    number of underlying variables
  • Only significantly loaded variables were
    considered

26
Statistical Analysis
  • The form of the logistic regression model is
    defined as
  • Where x is the data vector for a randomly
    selected experimental unit and y is the value of
    the binary outcome variable. Maximum likelihood
    was used to estimate B for the predictive
    equation
  • Variables not significant at the .1 level were
    eliminated

27
Logit Results
  • Logit showed that the most important variables
    contributing to the slope instability were Flow
    Path and mean slope of upland area
  • log (p/(1-p)) (-2.2642 FACTOR8 0.4969
    FLPATH 0.6039)          or p exp (-2.2642
    FACTOR8 0.4969 FLPATH 0.6039)/(1
    exp(-2.2642 FACTOR8 0.4969 FLPATH
    0.6039)__________________________________________
    ____________________p probability of
    landslide hazard FACTOR8 factor with
    underlying characteristics of aspectFLPATH
    Maximum distance of water to the point in the
    catchment

28
(No Transcript)
29
Logit Results
  • Coefficients of Logit model included positive
    coefficients. Therefore, higher scores would
    increase the probability of landslide hazard.
  • Logit model assumes a nonlinear relationship
    between the probability and the explanatory
    variables
  • Hazard map based on ROC curve technique groups
    the hazard into two classes Low Hazard and High
    Hazard, showing five classes of probabilities of
    landslide hazard

30
Final Results
  • 59.1 of the landslides and 69.8 of non
    landslides were correctly determined
  • Model can be applied to large geographic areas
  • ROC curves are incorporated as a sophisticated
    tool for decision makers for the spatial
    prediction of landslide hazard

31
a) Cut-off based on ROC curve technique b)
Probability of landslide hazard
Write a Comment
User Comments (0)
About PowerShow.com