Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL - PowerPoint PPT Presentation

About This Presentation

Title:

Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL

Description:

Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City University of ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 28

Provided by: zhang

Learn more at: http://www-cs.ccny.cuny.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL

1
Efficiently Managing Large-Scale Raster Species
Distribution Data in PostgreSQL

Jianting Zhang, Dept. of Computer Science
The City College of the City University of New
York
Michael Gertz, Institute of Computer Science
University of Heidelberg
Le Gruenwald, School of Computer Science
The University of Oklahoma

2
Outline

Introduction
Background and Related Work
Species Distribution Data
Quadtree Indexing and Query Processing
The Proposed Solution
Database Preparation
Query Window Decomposition
Query Optimization and Result Combination
Experiments and Evaluation
Conclusion and Future Work

3
Introduction
3
4
Introduction
Linked Environment for Exploratory Analysis of
Large-Scale Species Distribution Data
ACMGIS08
5
Introduction
Environmental
Taxonomic
Correlation
Distribution
Configuration
Distribution
Approximation
Geographical
Phylogenetic
6
Background Species Distribution Data
Museum Collections or Species checklists Global Biodiversty Facility (GBIF) data portal (http//data.gbif.org/) 177,887,193 occurrence records from 294 data providers as of 07/25/2009
Museum Collections or Species checklists Species 2000 Annual Checklist (www.sp2000.org) 1,160,711 species as of 2009
Species Range Maps NatureServe Birds of the Western Hemisphere http//www.natureserve.org 4253 birds species 3 Gigabytes
Species Range Maps Little Tree Species http//esp.cr.usgs.gov/data/atlas/little/ 679 tree species, 137 Megabytes
Compiled Species databases WWF Wildfinder database http//www.worldwildlife.org/science/data/wildfinder.cfm 29112 species, 4815 genus, 445 families, 69 orders, 350045 species-ecoregion records, 80 Megabytes
Compiled Species databases USDA Plant databases http//plants.usda.gov/ 89759 plant species in 3141 US counties
7
Background Species Distribution Data
8
Related Work

Quadtree Indexing One of the oldest and most
extensively studied indexing and query processing
approach Gaede and Günther 1998 Samet 2005
Research prototypes QUILT (Shaffer et al 1990)
and SAND (Esperanca and Samet 2002, Samet and
Webber 2006 ) - Lacking of full SQL support
Window query in linear quadtree (Aboulnaga and
Aref 2001) -Key values are stored as Morton codes
for B-Tree Indexing Implementation is
non-trivial
SP-GIST (Aref and Ilyas 2001, Eltabakh et al
2006) - Quad-tree based indexing of line-segments
in PostgreSQL

Commercial products (Oracle/SQL Server) Kothuri
et al 2002, Fang et al 2008
Quadtree based indexing of polygonal data
Filtering mechanism to facilitate querying
spatial relationships at the polygon level
Not directly accessible to application developers

9
Related Work

Open source products
PostgreSQL Quadtree indexing for binary raster
data is not available
Rasdaman support storing and querying dense
multi-dimensional real-valued arrays based on
tiling (chunking) techniques

Pieces that contribute to the research
Quadtree representation of geo-referenced data
and linear quadtrees (Hunter et al 1979,
Gargantini 1982, Samet et al 1983, Shaffer et al
1990, Esperanca and Samet 2002, Samet and Webber
2006)
Efficient window query decomposition (Aref and
Samet 1993, Aref and Samet 1997, Proietti 1999,
Aboulnaga and Aref 2001, Tsai et al 2004)
Microsoft SQL Server Spatials implementation of
quadtree indexing based on path query (Fang et al
2008)
LTREE Tree path indexing module in PostgreSQL

10
Overview of the Proposed Approach
11
Database Preparation

Polygons representing species distributions vary
in sizes and shapes significantly
Distributions among a large number of species
overlap greatly cross layer query
It is inefficient to create a quadtree for each
polygon - quadtree paths will be duplicated
Associate a quadtree node with a set of species
identifiers instead of a single species
identifier
How to combine individual quadtrees for efficient
query processing?

12
Database Preparation
Individual Quadtrees for Species Distributions
13
Database Preparation
Classic Combination
14
Database Preparation
Improved Combination
15
Database Preparation

Each of quadtree nodes (leaf or non-leaf) and
their associated species identifiers become a
tuple in PostgreSQL database
Index the table based on the paths of the
quadtree nodes offline
Sample query
Select bk_id, sp_ids from TB where bk_id lt_at_ 3
selecting all the tuples whose paths are
decedents of tree path 3
Results A (16), B(4), C(4)

16
Query Window Decomposition

Transform a spatial query window into tree paths
to match with the database tuples - Decomposition

Exact matches searching for ancestors.
Exact matches searching for descendents

3
A
B
A
3.2
B
C
C
C
3.2.0
3.2.2
Condition to match a query window cell C with a
database tuple R
(C.ID is an ancestor of R.path) or (C.ID is a
descendant of R.path)
17
Query Window Decomposition
We adopt the approach reported in (Tsai et al
2004) -efficiency and easy of implementation
m8 d3 l13 (1 level 2 and 12 level 3)

Complexity analysis
Decomposition algorithm ? O(m) (Tsai et al
2004), m is the larger of the width and height of
a query window
Converting the outputs of (Tsai et al 2004) to
tree paths ? O(ld). l is the number of
decomposed cells and d is the depth of quadtree
for a raster tessellation.
l is proportional to one dimension of the query
window, i.e., m. (Tsai et al 2004)
The overall complexity O(m) O(ld) ?O(md)
d is a relative small number ?O(m)

18
Query Optimization

For a large query window that does not align with
quadrant divisions very well, the decomposed
cells can be thousands or even more and most of
them will be small cells with same ancestor nodes
As the queries are sent to the server
independently, duplicated tuples may be returned
and need to be removed when combining query
results

How to minimize duplication while ensure
correctness?
Can we remove duplication and make combination as
simple additions?
19
Query Optimization
20
Proposed Approach -Discussions

Summary
Utilizes existing database storage and indexing
functions no need to define new data types,
develop new indexing approaches, modify query
syntax and revise database query engines
User queries are transformed into formats that
are supported by existing database backend and
the results are combined in the middleware to
answer users query effectively and efficiently.
Advantages
Use SQL query syntax instead of being forced to
use APIs
Use a variety of databases (as long as they
support efficient path query)
The underlying database systems are left
untouched
Reduce technical complexities
Does not depend on the availability of source code

21
Experiment Setups

Dell Precision T5400 workstation/PostgreSQL 8.3.5
Species Distribution Data
NatureServe http//www.natureserve.org/getData/an
imalData.jsp
Mammals (1693 species), birds (4148 species) and
amphibian (5816 species)
Quadtree
West hemisphere, i.e., (-180,-90, 0, 90)
Depth14 (21416384)
Spatial resolution is finer than 1 arc minute
(1806010800)
Four query window sizes 0.1, 0.5, 1 and 5
degrees
For each query window size 100 queries with
random centers

22
Experiments on Database Preparation

Rasterized bird species distributions (finest
resolution)
46,139,247 cells
1,318,136,140 pairs of (cell, identifier)
combinations
28.7 species per cell
of quadtree nodes (database tuples)
Classic combination 7,511,823 leaf nodes
Proposed combination 4,957,050 leaf and non-leaf
nodes
Proposed combination 19.3 compression ratio
of species identifiers
Classic combination 831,903,250 (110.7 per
node)
Proposed combination 23,865,343 (4.8 per node)
34.9 times less with respect to the total of
species identifiers
23 time less with respect to the average of
identifiers per node

23
Experiments on Query Processing
Query response time (ms) Baseline (lt) and
Optimized (op) approaches
of species
24
Experiments on Query Processing
Average and Maximum Response Times under Four
Query Windows for the Three Approaches (in
seconds)
Window Size Baseline Query Baseline Query Optimized Query Optimized Query
Window Size AVG MAX AVG MAX
0.1 0.14 0.36 0.03 0.06
0.5 0.93 2.39 0.12 0.25
1 1.63 3.93 0.20 0.50
5 9.68 21.13 1.16 3.38
25
Conclusions and Future Work

This research tackles the problem of storing,
indexing and querying large-scale species
distribution data in the form of binary rasters
in a database environment

A middleware approach has been adopted by
utilizing existing PostgreSQL database support
for tree paths and by transforming spatial window
query into tree path matching.

The approach does not require any modifications
on database backend and is applicable to many
database systems that support tree path matching.

An end-to-end system to manage large-scale
species distribution datasets has been developed
with demonstrated efficiency based on 4000 bird
species distribution data in the West Hemisphere.

26
Conclusions and Future Work

Future work
Further extend the solution to manage even larger
scale of species distribution datasets. The
ultimate goal is to support all known species at
the million scale and reduce the average query
response time to below one second for realistic
query window sizes to support interactive
customer applications
Possible strategies
New efficient data structures and algorithms
(e.g. column store, bitmap)
Combing pre-computation and on-demand processing
(query window decomposition)
Using main-memory database techniques
(indexing/querying)
Using GPU for faster query processing (Fermi,
CUDA, Nexus)
Using Map-Reduce/Hadoop for parallel/distributed
query processing (HadoopDB)

27
Latest Progresses

Main-memory database and query processing
algorithm
Memory Consumption 200M for 4000 birds species
Query response time 1/4 second for query window
size as large as 5 by 5 degrees
Can be used as a cache system for extremely large
scale species distribution data
Web-based tool http//geoteci.engr.ccny.cuny.edu/
geoteci/SPTestMap.html

Write a Comment

User Comments (0)