Title: Knowledge acquisition and semantic rules discovery for CORINE land cover mapping using Latent Dirich
1Knowledge acquisition and semantic rules
discovery for CORINE land cover mapping using
Latent Dirichlet Allocation
- Sixth conference on Image Information Mining
- EUSC Torrejon Air Base
- 3-5 November 2009
2Land Use Land Cover Mapping
- Subject discussed at EU and international level
- CLC 2000, CLC 2006
- FAO LCCS
- GMES Soil Sealing, Forest, Urban Atlas
- GlobCover
- CLC methodology
- High resolution Visible IR data (Landsat, SPOT,
IRS) - False or natural color compositions
- Reference Topo 150K maps
- classification scheme
- Visual interpretation and delineation of polygons
- Quality control
- Products image data, vector layer, raster layer
3Difficulties in Land Use Land Cover Mapping
- Availability and price of EO data
- Resolution characteristics (spatial, spectral,
temporal) not always / yet fulfilling user needs - Landsat data successful in answering current
requirements, no more available replaced by SPOT
and IRS data - New sensor data available not intensive use yet
price is very high - Methodology low productivity due to
- Most of the times visual interpretation and
on-the-screen delineation of land cover polygons
are the main tools - Lack of operational automatic or semi-automatic
methods capable of extracting the user defined
classes / methods exists but are not used by the
production unit - Solutions pay for new sensor data and use
automatic methods (i.e. stimulate research and
pay for software) get data for low price and
further invest in human operators and cheap
software tools
4Difficulties in Land Use Land Cover Mapping
- Operational use of automatic or semi-automatic
methods still not very widespread - Sensors produce different data
- From one day to another
- From one place to another
- From one orbit to another, etc
- Data coming from different sensors is different
(SPOTltgt IRSltgt RapidEye, ERSltgtRadarsat) - Altered (processed data) can lead to different
results (RAW data vs Radiometrically enhanced
data) - User defined classification schemes
(nomenclatures) application specific not the
same as the classes seen by the computer i.e.
user not interested in a generic bare soil the
user wants non irrigated arable land
5Current work in ROSA
- Many RD projects aiming to increase the
effectiveness of extracting geo-information from
EO data ROSAR, GEOINF, SIGUR, MUTER, ROKEO,
SAFER, GEOLAND - Both object oriented and pixel based methods are
studied towards bridging the gap between high
semantic, user defined needs and state-of-the art
classification algorithms - Results are tested and evaluated against existing
LULC datasets produced and validated under
operational conditions - Algorithms are tested on original data and
efforts are made to make them work on new sensor
data
6From text to images - Latent Dirichlet Allocation
- Latent Dirichlet Allocation from text to image
processing - Probabilistic, Bayesian model focusing on
documents i.e. subsets of pixels (tiles) - Documents are made of words i.e. pixels and
express a latent set of topics i.e classes - The whole set of words define the vocabulary i.e.
the dynamic range - Each topic is modeled as a probability
distribution over the vocabulary - Many documents together form a corpus i.e. a
satellite image - The order of words in a document is ignored (bag
of words) - Based on training sets (histograms of documents),
the model can discover the latent topics in
documens
text corpus
vocabulary
document
topic
topic
topic
7Latent Dirichlet Allocation
Given a vocabulary with N visual words
For K topics, a two variables model can be derived
And then applied to image document, D
8LDA for CLC Mapping
- Due to complex semantics the (Document, Topics)
couple is more close to the CLC classes i.e. a
document can get a CLC class assigned based on
latent topics inside it - The huge vocabulary in a satellite image (dynamic
range) can be reduced using unsupervised methods
(e.g. k-means, SoilMapper) decrease processing
time the new vocabulary is used to identify
latent topics in a document - The dimension of documents (tiles) function of
the MMU (Minimum Mapping Unit 25 Ha) 15x15
pixels for Landsat, 50x50 pixels for SPOT5 - Number of topics used depend on the CLC level
addressed (5, 15, 44) and directly dependent on
the unsupervised process e.g. 35 words for a
Landsat image - Based on latent topics, a document can get a
label according to CLC nomenclature words -gt
topics -gt classes - Semantic rules can be derived for each of the
classes in a nomenclature Class
f(topics(words))
Satellite image
Compressed LANDSAT/ SPOT vocabulary
document
topic
topic
topic
CLC Class
9MEEO SoilMapper
- See Baraldi, A. et. al. 2006. Automatic Spectral
Rule-Based Preliminary Mapping of Calibrated
Landsat TM and ETM Images. IEEE Transactions on
Geoscience and Remote Sensing, vol. 40, no.9
(September 2006), 2563-2585 - Image data calibrated to reflectance values using
sensor-specific information - Output preliminary spectral map with pixel
labeled according to an intermediate level - For Landsat data 85, 41, 27 or 16 classes ca be
derived - Operational services provided by MEEO through ESA
SSE
10CLC Mapping Workflow
Raw data reduced
vocabulary topics on pixels
topics on documents
11Tests results
- Tests were performed on Landsat data subsets of
600x600 pixels and on SPOT data subsets of
2000x2000 pixels, MEEO SoilMapper used for
intermediate classification (vocabulary
compression) -
12Tests results
- Example 1 Landsat data 27 words / topics CLC
Level 1 Classes - Overall accuracy 92
-
13Tests results
- Example 2
- Landsat data
- 41 words / topics
- CLC Level 2 Classes
- Overall accuracy 70
-
14Semantic rules
- The current approach allows one to establish
semantic rules for explaining high level semantic
using intermediate level semantic -
15Concluding remarks
- A method is proposed based on similarities that
can be established between text and satellite
images used in LULC mapping, based on words,
documents and topics - The current approach show promising results in
trying to find solutions for bridging the gap
between low level semantic data and high level
semantic features by operating an intermediate
level classification and then using LDA and thus
addressing user requirements on providing
meaningful information for policy making - Currently, the availability of fully automatic
methods allowing effective sensor specific
intermediate level classification is exploited in
order to reach a superior semantic level - Although CLC tests were shown, any other high
level semantic nomenclature can be applied
semantic rules can be established between the
intermediate and superior semantic level features - A good level of accuracy is obtain by applying
LDA on intermediate, fully automatic classified
data - In line with current discussions (GMES, INSPIRE)
on new data models for LULC data promoting object
oriented schemes
16Contact
- Dragos Bratasanu, ROSA dragos.bratasanu_at_rosa.ro
- Ion Nedelcu, ROSA ion.nedelcu_at_rosa.ro
- Mihai Datcu, DLR mihai.datcu_at_dlr.de