Title: Exploratory Analysis of Forestry Data in NEFIS
1Exploratory Analysis of Forestry Data in NEFIS
- Natalia Andrienko Gennady Andrienko
- FHG AIS (Fraunhofer Institute for Autonomous
Intelligent Systems) - http//www.ais.fraunhofer.de/and
- NEFIS Project Workshop, JRC Italy, 29th June 2005
2NEFIS and our research
- Our research focus is EDA Exploratory Data
Analysis (in particular, spatial and temporal
data) - In NEFIS, we strive at explaining and promoting
the ideas and principles of EDA - We have used the ICP Forests defoliation data as
a non-trivial example to demonstrate systematic,
comprehensive EDA - We hope to receive valuable feedback from you for
guiding our further work
3What Is EDA?
- Emerged in statistics in 1970ies originator
John Tukey - A philosophy and discipline of unbiased looking
at data What can data tell me? rather than Do
they agree with my expectations? - Similar to the work of a detective (J.Tukey)
- Need to look at data ? focus on visualisation and
user interaction with data displays
4Purposes of EDA
- Uncover peculiarities of the data and, on this
basis, understand how the data should be further
processed (e.g. filtered, transformed, split into
parts, fused, ) - Generate hypotheses for further testing (e.g.
using statistical methods) - Choose proper methods for in-depth analysis
(possibly, domain-specific) - Especially important for previously unknown data,
e.g. found in the Web ? relevant to NEFIS
5EDA vs. other analyses
- EDA does not substitute rigor methods of
numerical analysis, either general or
domain-specific, but should give the
understanding what methods and how to apply
Original data
1. EDA
Understanding of the data (mental model)
3. In-depth analysis
2. Data processing
Conclusions, theories, decisions,
Processed data
6EDA vs. information presentation
- EDA makes intensive use of graphics
- However, nice presentation and reporting are
not EDA purposes - Primary goal of presentation convey certain idea
or set of ideas to others - Understandably
- Convincingly
- Aesthetically attractively
- This requires different visual means than
exploration
7The defoliation data
- Large volume 6169 spatially-referenced time
series - Two dimensions ST
- Many missing values
- No full compatibility across countries, species,
time etc.
8EDA data quality issues
- Specialists opinion (after seeing the draft
report of the data exploration) The data were
not meant for analysis! - But
- There are no ideal data (especially in the Web
and for free) - Even for understanding data inadequacy one needs
first to explore them - Even imperfect data can be useful
- The principles of EDA (demonstrated further) are
applicable to perfect data as well
9General procedure of the EDA
- See the whole
- Space Time ? 2 complementary views
- Evolution of spatial patterns in time
- Distribution of temporal behaviours in space
- Divide and focus
- Data are complex ? Have to be explored by slices
and subsets (species, age groups, countries,
years, ) - Attend to particulars
- Detect outliers, strange behaviours, unexpected
patterns,
10See the whole Handle large data volumes
- General approach Data aggregation
- Task 1 Explore evolution of spatial patterns
- Appropriate data transformation aggregate by
small space compartments (regular grid with 4025
cells) separately for different species various
aggregates (mean, max) - Gain no symbol overlapping
11Explore evolution of spatial patterns
- Animated map
- Map sequence
- Observations
- Persistently high values in Poland
- Improvement in Belarus
- Mosaic distribution in most countries great
differences between close locations - Outliers
12Divide and Focus Exploration on country level
- Recommendable due to inconsistencies between
countries - Observation abrupt changes between locations ?
spatial smoothing methods are not appropriate
13Explore spatial distribution of temporal
behaviours
- Are behaviours in neighbouring places similar?
- Step 1. Smoothing supports revealing general
patterns and disregarding fluctuations and
outliers (we shall look at outliers later)
14Explore spatial distribution of temporal
behaviours
- Are behaviours in neighbouring places similar?
- Step 2. Temporal comparison (e.g. with particular
year, mean for a period) helps to disregard
absolute differences in values and thus focus on
behaviours
Observation no strong similarity between
neighbouring places
15Compare behaviours in plots with different main
species
- Mosaic signs
- 6 rows for species
- 14 columns for years 1990-2003
- Colours encode defoliation values
- Observation behaviours differ for different main
species
16Explore overall temporal trends
- Line overlapping obstructs data analysis
- ? apply aggregation
17Aggregation method 1 by quantiles
18Aggregation method 2 by intervals
19Divide and Focus Germany
20Divide and Focus age groups 1,3
21Attend to particulars
- Types of particulars (examples)
- Extreme values
- Extreme changes
- High variability
-
- Questions
- When?
- Where?
- What is around?
- Why? (a question for further, in-depth analysis)
- Domain knowledge is essential
22Attend to particulars extreme values
- Click on a segment corresponding to extreme
values - The behaviour(s) is(are) highlighted on the time
graph - The location(s) is(are) highlighted on the map
23Attend to particulars what is around?
- In some neighbouring places the behaviours during
the period 2000 - 2003 are somewhat similar
24Attend to particulars extreme changes
- Transform the time graph to show changes
- Select extreme changes in a specific year (here
2003)
25Attend to particulars high variation
- Aggregate time graph by quantiles
- Save counts
- Visualise e.g. on a scatter plot
- Select items with high variation
26Attend to particulars high fluctuation
- Select items with maximal number of jumps between
quantiles
27Attend to particulars stable extremes
- Select items being always in the topmost 10
28Attend to particulars stable increase
- Turn the time graph in the segmentation mode
- Choose increase and set minimum difference
- Select a sequence of years by clicking
- Check sensitivity to the time period!
29Conclusions the Data
- This dataset is not suitable for application of
major statistical analysis methods due to - absence of spatial temporal smoothness
- skewed distributions
- outliers
- missing values
- The data may be suitable for other purposes (e.g.
in a context of a broader study of the ecological
situation over Europe) - EDA methods can promote insights
30Recap Exploration procedure
- See the whole
- Evolution of spatial patterns in time
- Distribution of temporal behaviours in space
- Divide and focus
- Data were explored by slices and subsets
(species, age groups, countries, years, ) - Attend to particulars
- Extreme values, extreme changes, high variation,
high fluctuations, stable growth
31Recap Tools
- Visualisation on thematic maps, time graphs,
other aspatial displays - Aggregation reduce data volume symbol
overlapping - Filtering divide and focus (select subsets)
- Marking see corresponding data on different
displays - Data transformation smoothing, computing
changes, normalisation etc. - It is important to use the tools in combination
32Further information
- Software http//www.commongis.com
- Scientific issues (papers, tutorials, demos)
http//www.ais.fraunhofer.de/and - Book to appear
- N. and G. Andrienko
- Exploratory Analysis of Spatial and Temporal
data. A Systematic Approach - (Springer-Verlag, ? end 2005)
A systematic approach to defining tasks, tools,
and principles of EDA
33http//www.ais.fraunhofer.de/and
In press, to appear ? end 2005