Title: Use of Knearest Neighbor Imputation for Modeling Forest Inventory Data
1Use of K-nearest Neighbor Imputation for Modeling
Forest Inventory Data
Andrew J. Lister USDA Forest Service Northern
Research Station Northern Monitoring
Program Forest Inventory and Analysis
2Original rationale for needing classified maps
plot stratification for variance reduction
gt 400,000 photointerpretation points in 13
Northeastern states alone
3Instead of photos, use classified imagery to
group and weight plots
96.3 accurate with a kappa statistic of 0.81
4Better hardware, software and skill ?
sophisticated, regional and national maps could
be produced ? growth in demand.
5After 2000, the advent of the MODIS sensor made
regional mapping even more feasible.
Concurrently, computer capabilities and software
advances made this type of modeling more
accessible to natural resource agencies.
6The USDA Forest Services Remote Sensing
Application Center and its Forest Inventory and
Analysis Unit assembled a set of several dozen
GIS and imagery data layers that could be used
for a first iteration of national mapping efforts.
ETC.
7With these new software and hardware solutions,
FIA produced a first iterations of a national
product Forest Biomass of the United States.
8The problem with the regression tree approach is
that every map requires a separate modeling
effort. The ideal approach would be to make one
map with entire plot records imputed to pixels.
9- Approach A mapping project in support of the PA
state report - Clean data remove whacko plots, plots with any
nonforest on them, and categorical predictors.
Good idea?
2. Cull confounding predictor data by choosing a
subset of variables that effectively group forest
inventory data into homogeneous groups.
First use k-means clustering to group the FIA
data into 10 clusters using, e.g., total volume
per species as the distance defining variables ?
each plot assigned a species composition group
(cluster)
10Next, run a feature selection algorithm
iteratively assess every predictor for its
ability to improve the classification of training
data into their correct species composition
clusters, and rank them based on this ability
Why not do data reduction by creating composite
variables, e.g., principal components or
canonical variates?
Good question..
Intuitively feel that adding another layer of
modeled data to the modeling process further
dissociates the phenomenon being predicted from
the measurements that were taken.
Im probably wrong!
11Next, associate the training plots with the
subset of standardized predictor data
Perform fuzzy classification, whereby each
unknown pixel is given the plot id number of the
known plot that is most similar to it (simple
Euclidean distance).
Finally, recode the plot id image with a lookup
table created by summarizing the FIA database to
the plot level. For example, plot 17 has 1500
cubic feet of volume/acre, so every pixel that
was assigned plot 17 as its nearest neighbor gets
a value of 1500.
The principle advantage you do the
classification once, and then simply recode the
plotid map for every attribute of interest.
12Results of feature selection procedure which
layers best put plots into homogeneous groups?
13x x2 Soil ph coarse fraction 1 August
precip coarse fraction 2 y2 y xy November
precip September precip June precip total annual
precip December precip plasticity
Ranks of tree volume-related variables
14Ranks of species composition-related variables
15MODIS May 9 b2 MODIS May 9 b5 MODIS May 9
b6 MODIS July 12 b2 percent conifer forest MODIS
June 10 b6 percent conifer forest MODIS September
14 b2 MODIS April 7 b2 MODIS June 10 b2 MODIS
June 10 b7 MODIS Aug 13 b2 MODIS June 10 b1 MODIS
NDVI August 13 EVI
Ranks of total green biomass variables
16Plot id map note spatial clustering of similar
values
17A small number of final recoded, masked maps
18A couple more..
19Just one more a wafer thin map..
20Are we there yet?
21Examples of final product Maple Beech Birch
Forest Type
22Examples of final product Chestnut Oak
23Examples of final product Eastern Hemlock
24Examples of final product Red Maple
25Quality assurance
Compute map- and plot-based totals per window,
and assess quality of map
26Examples of quality assurance
Percent Dead White Pine
Board Foot Volume Sweet Birch
Percent Basal Area Red Oak
Percent Damage Red Maple
27Next steps
See if data reduction via creation of composite
variables improves quality over using a subset of
raw variables
Try different distance metrics
Assess varying levels of k could use same
recoding approach
Evaluate hybrid unsupervised-supervised approach,
similar to classical remote sensing
Automate and generalize process feature
selection, clustering, etc. Ideally, user could
submit a training data set, and map would be
created via a set of heuristics.
28Why worry so much about mapping approaches for
forestry data?
alister_at_fs.fed.us