Title: Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL
1Efficiently Managing Large-Scale Raster Species
Distribution Data in PostgreSQL
- Jianting Zhang, Dept. of Computer Science
- The City College of the City University of New
York - Michael Gertz, Institute of Computer Science
- University of Heidelberg
- Le Gruenwald, School of Computer Science
- The University of Oklahoma
2Outline
- Introduction
- Background and Related Work
- Species Distribution Data
- Quadtree Indexing and Query Processing
- The Proposed Solution
- Database Preparation
- Query Window Decomposition
- Query Optimization and Result Combination
- Experiments and Evaluation
- Conclusion and Future Work
3Introduction
3
4Introduction
Linked Environment for Exploratory Analysis of
Large-Scale Species Distribution Data
ACMGIS08
5Introduction
Environmental
Taxonomic
Correlation
Distribution
Configuration
Distribution
Approximation
Geographical
Phylogenetic
6Background Species Distribution Data
Museum Collections or Species checklists Global Biodiversty Facility (GBIF) data portal (http//data.gbif.org/) 177,887,193 occurrence records from 294 data providers as of 07/25/2009
Museum Collections or Species checklists Species 2000 Annual Checklist (www.sp2000.org) 1,160,711 species as of 2009
Species Range Maps NatureServe Birds of the Western Hemisphere http//www.natureserve.org 4253 birds species 3 Gigabytes
Species Range Maps Little Tree Species http//esp.cr.usgs.gov/data/atlas/little/ 679 tree species, 137 Megabytes
Compiled Species databases WWF Wildfinder database http//www.worldwildlife.org/science/data/wildfinder.cfm 29112 species, 4815 genus, 445 families, 69 orders, 350045 species-ecoregion records, 80 Megabytes
Compiled Species databases USDA Plant databases http//plants.usda.gov/ 89759 plant species in 3141 US counties
7Background Species Distribution Data
8Related Work
- Quadtree Indexing One of the oldest and most
extensively studied indexing and query processing
approach Gaede and Günther 1998 Samet 2005 - Research prototypes QUILT (Shaffer et al 1990)
and SAND (Esperanca and Samet 2002, Samet and
Webber 2006 ) - Lacking of full SQL support - Window query in linear quadtree (Aboulnaga and
Aref 2001) -Key values are stored as Morton codes
for B-Tree Indexing Implementation is
non-trivial - SP-GIST (Aref and Ilyas 2001, Eltabakh et al
2006) - Quad-tree based indexing of line-segments
in PostgreSQL
- Commercial products (Oracle/SQL Server) Kothuri
et al 2002, Fang et al 2008 - Quadtree based indexing of polygonal data
- Filtering mechanism to facilitate querying
spatial relationships at the polygon level - Not directly accessible to application developers
9Related Work
- Open source products
- PostgreSQL Quadtree indexing for binary raster
data is not available - Rasdaman support storing and querying dense
multi-dimensional real-valued arrays based on
tiling (chunking) techniques
- Pieces that contribute to the research
- Quadtree representation of geo-referenced data
and linear quadtrees (Hunter et al 1979,
Gargantini 1982, Samet et al 1983, Shaffer et al
1990, Esperanca and Samet 2002, Samet and Webber
2006) - Efficient window query decomposition (Aref and
Samet 1993, Aref and Samet 1997, Proietti 1999,
Aboulnaga and Aref 2001, Tsai et al 2004) - Microsoft SQL Server Spatials implementation of
quadtree indexing based on path query (Fang et al
2008) - LTREE Tree path indexing module in PostgreSQL
10Overview of the Proposed Approach
11Database Preparation
- Polygons representing species distributions vary
in sizes and shapes significantly - Distributions among a large number of species
overlap greatly cross layer query - It is inefficient to create a quadtree for each
polygon - quadtree paths will be duplicated - Associate a quadtree node with a set of species
identifiers instead of a single species
identifier - How to combine individual quadtrees for efficient
query processing?
12Database Preparation
Individual Quadtrees for Species Distributions
13Database Preparation
Classic Combination
14Database Preparation
Improved Combination
15Database Preparation
- Each of quadtree nodes (leaf or non-leaf) and
their associated species identifiers become a
tuple in PostgreSQL database - Index the table based on the paths of the
quadtree nodes offline - Sample query
- Select bk_id, sp_ids from TB where bk_id lt_at_ 3
- selecting all the tuples whose paths are
decedents of tree path 3 - Results A (16), B(4), C(4)
16Query Window Decomposition
- Transform a spatial query window into tree paths
to match with the database tuples - Decomposition
- Exact matches searching for ancestors.
- Exact matches searching for descendents
3
A
B
A
3.2
B
C
C
C
3.2.0
3.2.2
Condition to match a query window cell C with a
database tuple R
(C.ID is an ancestor of R.path) or (C.ID is a
descendant of R.path)
17Query Window Decomposition
We adopt the approach reported in (Tsai et al
2004) -efficiency and easy of implementation
m8 d3 l13 (1 level 2 and 12 level 3)
- Complexity analysis
- Decomposition algorithm ? O(m) (Tsai et al
2004), m is the larger of the width and height of
a query window - Converting the outputs of (Tsai et al 2004) to
tree paths ? O(ld). l is the number of
decomposed cells and d is the depth of quadtree
for a raster tessellation. - l is proportional to one dimension of the query
window, i.e., m. (Tsai et al 2004) - The overall complexity O(m) O(ld) ?O(md)
- d is a relative small number ?O(m)
18Query Optimization
- For a large query window that does not align with
quadrant divisions very well, the decomposed
cells can be thousands or even more and most of
them will be small cells with same ancestor nodes
- As the queries are sent to the server
independently, duplicated tuples may be returned
and need to be removed when combining query
results
How to minimize duplication while ensure
correctness?
Can we remove duplication and make combination as
simple additions?
19Query Optimization
20Proposed Approach -Discussions
- Summary
- Utilizes existing database storage and indexing
functions no need to define new data types,
develop new indexing approaches, modify query
syntax and revise database query engines - User queries are transformed into formats that
are supported by existing database backend and
the results are combined in the middleware to
answer users query effectively and efficiently. - Advantages
- Use SQL query syntax instead of being forced to
use APIs - Use a variety of databases (as long as they
support efficient path query) - The underlying database systems are left
untouched - Reduce technical complexities
- Does not depend on the availability of source code
21Experiment Setups
- Dell Precision T5400 workstation/PostgreSQL 8.3.5
- Species Distribution Data
- NatureServe http//www.natureserve.org/getData/an
imalData.jsp - Mammals (1693 species), birds (4148 species) and
amphibian (5816 species) - Quadtree
- West hemisphere, i.e., (-180,-90, 0, 90)
- Depth14 (21416384)
- Spatial resolution is finer than 1 arc minute
(1806010800) - Four query window sizes 0.1, 0.5, 1 and 5
degrees - For each query window size 100 queries with
random centers
22Experiments on Database Preparation
- Rasterized bird species distributions (finest
resolution) - 46,139,247 cells
- 1,318,136,140 pairs of (cell, identifier)
combinations - 28.7 species per cell
- of quadtree nodes (database tuples)
- Classic combination 7,511,823 leaf nodes
- Proposed combination 4,957,050 leaf and non-leaf
nodes - Proposed combination 19.3 compression ratio
- of species identifiers
- Classic combination 831,903,250 (110.7 per
node) - Proposed combination 23,865,343 (4.8 per node)
- 34.9 times less with respect to the total of
species identifiers - 23 time less with respect to the average of
identifiers per node
23Experiments on Query Processing
Query response time (ms) Baseline (lt) and
Optimized (op) approaches
of species
24Experiments on Query Processing
Average and Maximum Response Times under Four
Query Windows for the Three Approaches (in
seconds)
Window Size Baseline Query Baseline Query Optimized Query Optimized Query
Window Size AVG MAX AVG MAX
0.1 0.14 0.36 0.03 0.06
0.5 0.93 2.39 0.12 0.25
1 1.63 3.93 0.20 0.50
5 9.68 21.13 1.16 3.38
25Conclusions and Future Work
- This research tackles the problem of storing,
indexing and querying large-scale species
distribution data in the form of binary rasters
in a database environment
- A middleware approach has been adopted by
utilizing existing PostgreSQL database support
for tree paths and by transforming spatial window
query into tree path matching.
- The approach does not require any modifications
on database backend and is applicable to many
database systems that support tree path matching.
- An end-to-end system to manage large-scale
species distribution datasets has been developed
with demonstrated efficiency based on 4000 bird
species distribution data in the West Hemisphere.
26Conclusions and Future Work
- Future work
- Further extend the solution to manage even larger
scale of species distribution datasets. The
ultimate goal is to support all known species at
the million scale and reduce the average query
response time to below one second for realistic
query window sizes to support interactive
customer applications - Possible strategies
- New efficient data structures and algorithms
(e.g. column store, bitmap) - Combing pre-computation and on-demand processing
(query window decomposition) - Using main-memory database techniques
(indexing/querying) - Using GPU for faster query processing (Fermi,
CUDA, Nexus) - Using Map-Reduce/Hadoop for parallel/distributed
query processing (HadoopDB)
27Latest Progresses
- Main-memory database and query processing
algorithm - Memory Consumption 200M for 4000 birds species
- Query response time 1/4 second for query window
size as large as 5 by 5 degrees - Can be used as a cache system for extremely large
scale species distribution data - Web-based tool http//geoteci.engr.ccny.cuny.edu/
geoteci/SPTestMap.html