Maximum Entropy - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Maximum Entropy

Description:

'Regularization' 13. find distribution p such that ... Effect of regularization: multiplier = 5. Larger confidence. Intervals. Higher entropy ... – PowerPoint PPT presentation

Number of Views:1383
Avg rating:3.0/5.0
Slides: 52
Provided by: nrac
Category:

less

Transcript and Presenter's Notes

Title: Maximum Entropy


1
Maximum Entropy
  • RESM 575
  • Spring 2009
  • Lecture 13

2
Maximum entropy
(Phillips et al. 2008)
  • History
  • E. T. Janes 1957
  • Thermodynamics
  • Inference and information theory

3
The Maximum Entropy Method
  • Origins Jaynes 1957, statistical mechanics
  • Recent use machine learning, eg. automatic
    language translation
  • To estimate an unknown distribution
  • Determine what you know (constraints)
  • Among distributions satisfying constraints
  • Output the one with maximum entropy

4
(No Transcript)
5
What is it?
  • Maxent is a general-purpose method for making
  • predictions or inferences from incomplete
    information.
  • Its origins lie in statistical mechanics (Jaynes,
    1957), and it remains an active area of research
    with an Annual Conference, Maximum Entropy and
    Bayesian Methods, that explores applications in
    diverse areas such as
  • astronomy, portfolio optimization, image
    reconstruction, statistical physics and signal
    processing.

6
Like other Bayesian models
  • Uses prior information
  • Maxent is an alternative to methods of inference
    of classical statistics

7
Maximum Entropy Principle
The fact that a certain probability distribution
maximizes entropy subject to certain constraints
representing our incomplete information, is the
fundamental property which justifies the use of
that distribution for inference it agrees with
everything that is known but carefully avoids
assuming anything that is not known (Jaynes,
1990).
8
Why?
  • Introduced as a general approach for presence
    only modeling of species distributions, suitable
    for all existing applications involving
    presence-only datasets.

9
Modeling species distributions
Yellow-throated Vireo
occurrence points

environmental variables
10
Estimating a probability distribution
  • Given
  • Map divided into cells
  • Environmental variables, with values in each cell
  • Occurrence points samples from an unknown
    distribution
  • Our task is to estimate the unknown
    probability distribution
  • Note
  • The distribution sums to 1 over the whole map
  • Most probability values will be very small
  • Different from estimating probability of presence

11
Entropy
  • More entropy more spread out, closer to
    uniform distribution
  • 2nd law of thermodynamics
  • Without external influences, a system moves to
    increase entropy
  • Maximum entropy method
  • Apply constraints to remove external influences
  • Species spreads out to fill areas with suitable
    conditions

12
Using Maxent for Species Distributions
  • Features
  • Constraints
  • Regularization

13
Features impose constraints
Feature environmental variable, or function
thereof
find distribution p of maximum entropy such
that for all features f mean(f) sample average
of f
14
Features
  • Environmental variables or functions thereof.
  • Maxent has these classes of features (others are
    possible)
  • Linear variable itself
  • Quadratic square of variable
  • Product product of two variables
  • Binary (indicator) membership in a
    category
  • Threshold
  • Hinge

1
0
Environmental variable
1
0
Environmental variable
15
Constraints
Each feature type imposes constraints on output
distribution Linear features mean Quadratic
features variance Product features
covariance Threshold features proportion
above threshold Hinge features mean above
threshold Binary features (categorical)
proportion in each category
16
Regularization
precipitation
sample average
true mean
temperature
find distribution p of maximum entropy such
that Mean(f) in confidence region of sample
average of f
17
The Maxent distribution
is always a Gibbs distribution
q?(x) exp(Sj ?jfj(x)) / Z
Z is a scaling factor so distribution sums to
1 fj is the jth feature ?j is a
coefficient, calculated by the program
18
Maxent is penalized maximum likelihood
Log likelihood LogLikelihood(q?) 1/m Si
ln(q?(xi)) where x1 xm are the occurrence
points.
Maxent maximizes regularized likelihood LogLike
lihood(q?) - Sj ßj?j where ßj is the width of
the confidence interval for fj Similar to Akaike
Information Criterion (AIC).
19
Output
  • When Maxent is applied to presence-only species
    distribution modeling, the pixels of the study
    area make up the space on which the Maxent
    probability distribution is defined,
  • Pixels with known species occurrence records
    constitute the sample points, and the features
    are
  • climatic variables,
  • elevation,
  • soil category,
  • vegetation type or other environmental variables,
    and functions thereof.

20
To note
  • Sometimes both presence and absence occurrence
    data are available for the development of models,
    in which case general-purpose statistical methods
    can be used
  • (for an overview of the variety of techniques
    currently in use, see Corsi et al., 2000 Elith,
    2002 Guisan and Zimmerman, 2000 Scott et al.,
    2002).

21
Opportunity
  • However, while vast stores of presence-only data
    exist, (records etc.) absence data are rarely
    available,
  • Poorly sampled areas, remote, difficult
  • Absence data may be of questionable value in many
    situations

22
(No Transcript)
23
Background
  • 16 modeling methods
  • 226 well surveyed species in 6 regions of the
    world

24
The authors used three statistics, the area under
the Receiver Operating Characteristic curve
(AUC), correlation (COR) and Kappa, to assess the
agreement between the presence-absence records
and the predictions.
25
(No Transcript)
26
(No Transcript)
27
Maximum Entropy
  • Only useful when applied to testable information.
    (whether a given distribution is consistent with
    it)
  • Given testable information, the maximum entropy
    procedure consists of seeking the probability
    distribution which maximizes information entropy,
    subject to the constraints of the information.
  • This constrained optimization problem is
    typically solved using the method of Lagrange
    multipliers.

28
(No Transcript)
29
Output format
Raw output Cumulative output
30
Cumulative output format
  • Gives estimate of omission rate
  • A pixel p has cumulative value c
  • Total probability of pixels with lower
    probability than p is c
  • Set a threshold of c
  • Binary model with presence if cumulative value
    c
  • Omission rate is c if test data drawn from
    Maxent distribution
  • Predict omission rate of c for real test data
  • Example thresholds
  • 5 (light red)
  • 20 (dark red)

31
Logistic output format
  • Estimates probability of presence
  • Between 0 and 1
  • Scaled so that a typical presence has value 0.5
  • Defined as
  • c q?(x) / (1 c q?(x))
  • where c exp(H(q?(x))
  • Probability of presence depends on sampling
    details
  • Site size
  • Observation time
  • These details should correspond to collection
    effort for occurrence points

32
Response curves
  • Show how predicted probability of presence
    depends on each variable
  • Simple features ? simpler model
  • Easier interpretation
  • Complex features ? complex model
  • Better fit to data
  • Linear quadratic (top)
  • Threshold features (middle)
  • All feature types (bottom)

33
Effect of regularization multiplier 0.2
Smaller confidence Intervals Lower
entropy Less spread-out
34
Effect of regularization multiplier 5
Larger confidence Intervals Higher
entropy More spread-out
35
Effect of regularization over-fitting
Regularization multiplier 1.0 (not over-fit)
Regularization multiplier 0.2 (clearly over-fit)
36
  • Sage grouse distribution model
  • MAXENT software package
  • Consistently superior to alternative methods
  • Robust to colinearity between explanatory
    variables
  • Accepts continuous and categorical variables
  • Stable distribution with limited training data
  • Evaluates relative variable importance

37
West Virginia Conservation Prioritization using
Species Distribution Modeling
  • Michael Dougherty
  • West Virginia Division of Natural Resources

The Conservation Fund
38
  • Project Goals
  • Develop statewide conservation prioritization map
    based on the
  • distribution of
  • Species of Greatest Conservation Need (SGCN)
  • Habitats of Concern
  • Existing public land
  • The Challenge
  • Develop distribution models for 500 state-tracked
    species
  • Species include plants, herps, birds, bats,
    mammals, aquatics
  • Modeling process must be defensible, transparent,
    and repeatable

39
  • Occurrence data
  • 1. State Natural Heritage Program Biotics
    database
  • Biologists collect Source Features
  • Source Features are grouped into Element
    Occurrences (EOs)
  • EOs represent known populations
  • Species identification is accurate and spatial
    accuracy documented
  • Use of EOs seems to greatly reduce spatial
    autocorrelation
  • 2. Community Ecologists Vegetation Plots
    Database

40
  • Predictor Variables
  • Developed a broad range of predictor variables
  • Climate
  • Landcover
  • Terrain
  • Ecoregions
  • Geology
  • Soils
  • Disturbances

41
  • Workflow Overview
  • Build an array of workstations to run models
  • Develop R scripts to automate running the
    maxent models by iterating through all 500
    species
  • Develop web-based map viewer to assist
    biologists in reviewing maxent model results
  • Perform patch and connectivity analysis using
    FunConn
  • (TBD) Assign weights to patches and connectors

42
  • Scripting Steps
  • Developed R script to performed variable
    pre-selection using boosted regression trees to
    reduce the number of variables to an appropriate
    number (30)
  • Developed R script to produce the maxent batch
    files and perform file management
  • Developed R script to harvest maxent results, a
    Python script to store grids in an ArcSDE
    database, and publish results to a website
  • (TBD) Develop R scripts to perform functional
    connectivity analysis
  • (TBD) Perform layer weighting to produce
    conservation prioritization index

43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
Occurrence localities
  • Csv file format. Each line has
  • Species name
  • X coordinate
  • Y coordinate
  • Multiple species can be in 1 file.

Example species,longitude,latitude bradypus_vari
egatus,-65.4,-10.3833 bradypus_variegatus,-65.3833
,-10.3833 bradypus_variegatus,-65.1333,-16.8 brady
pus_variegatus,-63.6667,-17.45
49
Environmental variables
  • ESRI ascii raster grid file format.
  • One file per environmental variable
  • All files must have exactly the same bounds, cell
    size
  • Coordinate system must be same as for occurrence
    localities
  • Alternative Diva .grd format.


50
Samples with data (SWD) format
  • Environmental data given with samples in a .csv
    format file
  • Example
  • species,longitude,latitude,cld,dtr,ecoreg,frs,h_de
    m,pre,pre_l10,pre_l1,pre_l4,pre_l7,tmn,tmp,tmx,vap
  • bradypus_variegatus,-65.4,-10.3833,76.0,104.0,10.0
    ,2.0,121.0,46.0,41.0,84.0,54.0,3.0,192.0,266.0,337
    .0,279.0
  • bradypus_variegatus,-65.3833,-10.3833,76.0,104.0,1
    0.0,2.0,121.0,46.0,40.0,84.0,54.0,3.0,192.0,266.0,
    337.0,279.0
  • bradypus_variegatus,-65.1333,-16.8,57.0,114.0,10.0
    ,1.0,211.0,65.0,56.0,129.0,58.0,34.0,140.0,244.0,3
    21.0,221.0
  • bradypus_variegatus,-63.6667,-17.45,57.0,112.0,10.
    0,3.0,363.0,36.0,33.0,71.0,27.0,13.0,135.0,229.0,3
    07.0,202.0

51
Background data in SWD format
  • Environmental data at (typically) random points
    in study area
  • Useful
  • when environmental grids huge
  • Maxent needs only small random sample (10,000)
  • when doing non-uniform sampling
  • Example
  • species,longitude,latitude,cld,dtr,ecoreg,frs,h_de
    m,pre,pre_l10,pre_l1,pre_l4,pre_l7,tmn,tmp,tmx,vap
  • background,-61.775,6.175,60.0,100.0,10.0,0.0,747.0
    ,55.0,24.0,57.0,45.0,81.0,182.0,239.0,300.0,232.0
  • background,-66.075,5.325,67.0,116.0,10.0,3.0,1038.
    0,75.0,16.0,68.0,64.0,145.0,181.0,246.0,331.0,234.
    0
  • background,-59.875,-26.325,47.0,129.0,9.0,1.0,73.0
    ,31.0,43.0,32.0,43.0,10.0,97.0,218.0,339.0,189.0
  • background,-68.375,-15.375,58.0,112.0,10.0,44.0,20
    39.0,33.0,67.0,31.0,30.0,6.0,101.0,181.0,251.0,133
    .0
  • background,-68.525,4.775,72.0,95.0,10.0,0.0,65.0,7
    2.0,16.0,65.0,69.0,133.0,218.0,271.0,346.0,289.0
Write a Comment
User Comments (0)
About PowerShow.com