Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL - PowerPoint PPT Presentation

About This Presentation
Title:

Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL

Description:

Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City University of ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 28
Provided by: zhang
Category:

less

Transcript and Presenter's Notes

Title: Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL


1
Efficiently Managing Large-Scale Raster Species
Distribution Data in PostgreSQL
  • Jianting Zhang, Dept. of Computer Science
  • The City College of the City University of New
    York
  • Michael Gertz, Institute of Computer Science
  • University of Heidelberg
  • Le Gruenwald, School of Computer Science
  • The University of Oklahoma

2
Outline
  • Introduction
  • Background and Related Work
  • Species Distribution Data
  • Quadtree Indexing and Query Processing
  • The Proposed Solution
  • Database Preparation
  • Query Window Decomposition
  • Query Optimization and Result Combination
  • Experiments and Evaluation
  • Conclusion and Future Work

3
Introduction
3
4
Introduction
Linked Environment for Exploratory Analysis of
Large-Scale Species Distribution Data
ACMGIS08
5
Introduction
Environmental
Taxonomic
Correlation
Distribution
Configuration
Distribution
Approximation
Geographical
Phylogenetic
6
Background Species Distribution Data
Museum Collections or Species checklists Global Biodiversty Facility (GBIF) data portal (http//data.gbif.org/) 177,887,193 occurrence records from 294 data providers as of 07/25/2009
Museum Collections or Species checklists Species 2000 Annual Checklist (www.sp2000.org) 1,160,711 species as of 2009
Species Range Maps NatureServe Birds of the Western Hemisphere http//www.natureserve.org 4253 birds species 3 Gigabytes
Species Range Maps Little Tree Species http//esp.cr.usgs.gov/data/atlas/little/ 679 tree species, 137 Megabytes
Compiled Species databases WWF Wildfinder database http//www.worldwildlife.org/science/data/wildfinder.cfm 29112 species, 4815 genus, 445 families, 69 orders, 350045 species-ecoregion records, 80 Megabytes
Compiled Species databases USDA Plant databases http//plants.usda.gov/ 89759 plant species in 3141 US counties
7
Background Species Distribution Data
8
Related Work
  • Quadtree Indexing One of the oldest and most
    extensively studied indexing and query processing
    approach Gaede and Günther 1998 Samet 2005
  • Research prototypes QUILT (Shaffer et al 1990)
    and SAND (Esperanca and Samet 2002, Samet and
    Webber 2006 ) - Lacking of full SQL support
  • Window query in linear quadtree (Aboulnaga and
    Aref 2001) -Key values are stored as Morton codes
    for B-Tree Indexing Implementation is
    non-trivial
  • SP-GIST (Aref and Ilyas 2001, Eltabakh et al
    2006) - Quad-tree based indexing of line-segments
    in PostgreSQL
  • Commercial products (Oracle/SQL Server) Kothuri
    et al 2002, Fang et al 2008
  • Quadtree based indexing of polygonal data
  • Filtering mechanism to facilitate querying
    spatial relationships at the polygon level
  • Not directly accessible to application developers

9
Related Work
  • Open source products
  • PostgreSQL Quadtree indexing for binary raster
    data is not available
  • Rasdaman support storing and querying dense
    multi-dimensional real-valued arrays based on
    tiling (chunking) techniques
  • Pieces that contribute to the research
  • Quadtree representation of geo-referenced data
    and linear quadtrees (Hunter et al 1979,
    Gargantini 1982, Samet et al 1983, Shaffer et al
    1990, Esperanca and Samet 2002, Samet and Webber
    2006)
  • Efficient window query decomposition (Aref and
    Samet 1993, Aref and Samet 1997, Proietti 1999,
    Aboulnaga and Aref 2001, Tsai et al 2004)
  • Microsoft SQL Server Spatials implementation of
    quadtree indexing based on path query (Fang et al
    2008)
  • LTREE Tree path indexing module in PostgreSQL

10
Overview of the Proposed Approach
11
Database Preparation
  • Polygons representing species distributions vary
    in sizes and shapes significantly
  • Distributions among a large number of species
    overlap greatly cross layer query
  • It is inefficient to create a quadtree for each
    polygon - quadtree paths will be duplicated
  • Associate a quadtree node with a set of species
    identifiers instead of a single species
    identifier
  • How to combine individual quadtrees for efficient
    query processing?

12
Database Preparation
Individual Quadtrees for Species Distributions
13
Database Preparation
Classic Combination
14
Database Preparation
Improved Combination
15
Database Preparation
  • Each of quadtree nodes (leaf or non-leaf) and
    their associated species identifiers become a
    tuple in PostgreSQL database
  • Index the table based on the paths of the
    quadtree nodes offline
  • Sample query
  • Select bk_id, sp_ids from TB where bk_id lt_at_ 3
  • selecting all the tuples whose paths are
    decedents of tree path 3
  • Results A (16), B(4), C(4)

16
Query Window Decomposition
  • Transform a spatial query window into tree paths
    to match with the database tuples - Decomposition
  • Exact matches searching for ancestors.
  • Exact matches searching for descendents

3
A
B
A
3.2
B
C
C
C
3.2.0
3.2.2
Condition to match a query window cell C with a
database tuple R
(C.ID is an ancestor of R.path) or (C.ID is a
descendant of R.path)
17
Query Window Decomposition
We adopt the approach reported in (Tsai et al
2004) -efficiency and easy of implementation
m8 d3 l13 (1 level 2 and 12 level 3)
  • Complexity analysis
  • Decomposition algorithm ? O(m) (Tsai et al
    2004), m is the larger of the width and height of
    a query window
  • Converting the outputs of (Tsai et al 2004) to
    tree paths ? O(ld). l is the number of
    decomposed cells and d is the depth of quadtree
    for a raster tessellation.
  • l is proportional to one dimension of the query
    window, i.e., m. (Tsai et al 2004)
  • The overall complexity O(m) O(ld) ?O(md)
  • d is a relative small number ?O(m)

18
Query Optimization
  • For a large query window that does not align with
    quadrant divisions very well, the decomposed
    cells can be thousands or even more and most of
    them will be small cells with same ancestor nodes
  • As the queries are sent to the server
    independently, duplicated tuples may be returned
    and need to be removed when combining query
    results

How to minimize duplication while ensure
correctness?
Can we remove duplication and make combination as
simple additions?
19
Query Optimization
20
Proposed Approach -Discussions
  • Summary
  • Utilizes existing database storage and indexing
    functions no need to define new data types,
    develop new indexing approaches, modify query
    syntax and revise database query engines
  • User queries are transformed into formats that
    are supported by existing database backend and
    the results are combined in the middleware to
    answer users query effectively and efficiently.
  • Advantages
  • Use SQL query syntax instead of being forced to
    use APIs
  • Use a variety of databases (as long as they
    support efficient path query)
  • The underlying database systems are left
    untouched
  • Reduce technical complexities
  • Does not depend on the availability of source code

21
Experiment Setups
  • Dell Precision T5400 workstation/PostgreSQL 8.3.5
  • Species Distribution Data
  • NatureServe http//www.natureserve.org/getData/an
    imalData.jsp
  • Mammals (1693 species), birds (4148 species) and
    amphibian (5816 species)
  • Quadtree
  • West hemisphere, i.e., (-180,-90, 0, 90)
  • Depth14 (21416384)
  • Spatial resolution is finer than 1 arc minute
    (1806010800)
  • Four query window sizes 0.1, 0.5, 1 and 5
    degrees
  • For each query window size 100 queries with
    random centers

22
Experiments on Database Preparation
  • Rasterized bird species distributions (finest
    resolution)
  • 46,139,247 cells
  • 1,318,136,140 pairs of (cell, identifier)
    combinations
  • 28.7 species per cell
  • of quadtree nodes (database tuples)
  • Classic combination 7,511,823 leaf nodes
  • Proposed combination 4,957,050 leaf and non-leaf
    nodes
  • Proposed combination 19.3 compression ratio
  • of species identifiers
  • Classic combination 831,903,250 (110.7 per
    node)
  • Proposed combination 23,865,343 (4.8 per node)
  • 34.9 times less with respect to the total of
    species identifiers
  • 23 time less with respect to the average of
    identifiers per node

23
Experiments on Query Processing
Query response time (ms) Baseline (lt) and
Optimized (op) approaches
of species
24
Experiments on Query Processing
Average and Maximum Response Times under Four
Query Windows for the Three Approaches (in
seconds)
Window Size Baseline Query Baseline Query Optimized Query Optimized Query
Window Size AVG MAX AVG MAX
0.1 0.14 0.36 0.03 0.06
0.5 0.93 2.39 0.12 0.25
1 1.63 3.93 0.20 0.50
5 9.68 21.13 1.16 3.38
25
Conclusions and Future Work
  • This research tackles the problem of storing,
    indexing and querying large-scale species
    distribution data in the form of binary rasters
    in a database environment
  • A middleware approach has been adopted by
    utilizing existing PostgreSQL database support
    for tree paths and by transforming spatial window
    query into tree path matching.
  • The approach does not require any modifications
    on database backend and is applicable to many
    database systems that support tree path matching.
  • An end-to-end system to manage large-scale
    species distribution datasets has been developed
    with demonstrated efficiency based on 4000 bird
    species distribution data in the West Hemisphere.

26
Conclusions and Future Work
  • Future work
  • Further extend the solution to manage even larger
    scale of species distribution datasets. The
    ultimate goal is to support all known species at
    the million scale and reduce the average query
    response time to below one second for realistic
    query window sizes to support interactive
    customer applications
  • Possible strategies
  • New efficient data structures and algorithms
    (e.g. column store, bitmap)
  • Combing pre-computation and on-demand processing
    (query window decomposition)
  • Using main-memory database techniques
    (indexing/querying)
  • Using GPU for faster query processing (Fermi,
    CUDA, Nexus)
  • Using Map-Reduce/Hadoop for parallel/distributed
    query processing (HadoopDB)

27
Latest Progresses
  • Main-memory database and query processing
    algorithm
  • Memory Consumption 200M for 4000 birds species
  • Query response time 1/4 second for query window
    size as large as 5 by 5 degrees
  • Can be used as a cache system for extremely large
    scale species distribution data
  • Web-based tool http//geoteci.engr.ccny.cuny.edu/
    geoteci/SPTestMap.html
Write a Comment
User Comments (0)
About PowerShow.com