Efficient%20Algorithms%20for%20Large-Scale%20GIS%20Applications - PowerPoint PPT Presentation

About This Presentation

Title:

Efficient%20Algorithms%20for%20Large-Scale%20GIS%20Applications

Description:

Efficient Algorithms for Large-Scale GIS Applications Laura Toma Duke University * * – PowerPoint PPT presentation

Number of Views:197

Avg rating:3.0/5.0

Slides: 58

Provided by: csus85

Learn more at: https://tildesites.bowdoin.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient%20Algorithms%20for%20Large-Scale%20GIS%20Applications

1
Efficient Algorithms for Large-Scale GIS
Applications

Laura Toma
Duke University

2
Why GIS?

How it all started..
Duke Environmental researchers
computing flow accumulation for Appalachian
Mountains took 14 days (with 512MB memory)
800km x 800km at 100m resolution ? 64 million
points
GIS (Geographic Information Systems)
System that handles spatial data
Visualization, processing, queries, analysis
Indispensable tool
Modeling, analysis, prediction, decision making
Rich area of problems for Computer Science
Graphics, graph theory, computational geometry etc

3
GIS and the Environment

Monitoring keep an eye on the state of earth
systems using satellites and monitoring stations
(water, ecosystems, urban development)
Modeling, simulation predict consequences of
human actions and natural processes
Analysis and risk assessment find the problem
areas and analyse the possible causes (soil
erosion, groundwater pollution, traffic jams)
Planning and decision support provide
information and tools for better management of
natural and socio-economic resources

4
Precipitation in Tropical South America
Lots of rain Dry
H. Mitasova
5
Nitrogen in Chesapeake Bay
High nitrogen concentrations
H. Mitasova
6
Jockeys Ridge evolution
Combining IR-DOQQ, LIDAR and RTK GPS to assess
the change decreasing elevation, extending
towards homes and a road
C
A
B
N
H. Mitasova
7
Bald Head Island Renourishment
1998 LIDAR shoreline 1998 2000
LIDAR shoreline 2000 2001, Dec. RTK GPS
shoreline
surface is 1998 LIDAR
H. Mitasova
8
Sediment flow
H. Mitasova
9
Computations on Terrains

Reality
Height of terrain is a continuous function of
two variables f(x,y)
Estimate, predict, simulate
Flooding, pollution
Erosion, deposition
Vegetation structure
.

GIS
DEM (Digital Elevation Model) is a set of sample
points and theirheights ? x, y, hxy?

Compute indices
10
DEM Representations
TIN
Grid
Contour lines
Sample points
11
Panama DEM
12
Modeling Flow on Terrains

What happens when it rains?
Predict areas susceptible to floods.
Predict location of streams.
Compute watersheds.
Flow is modeled using two basic attributes
Flow Direction (FD)
The direction water flows at a point
Flow Accumulation (FA)
Total amount of water that flows through a point
(if water is distributed according to the flow
directions)

13
Panama DEM - Flow Accumulation
14
(No Transcript)
15
(No Transcript)
16
Uses

Flow direction and flow accumulationare used
for
Computing other hydrological attributes
river network
moisture indices
watersheds and watershed divides
Analysis and prediction of sediment and
pollutant movement in landscapes.
Decision support in land management, flood and
pollution prevention and disaster management

17
Massive Terrain Data

Remote sensing technology
Massive amounts of terrain data
Higher resolutions (1km, 100m, 30m, 10m, 1m,)

NASA-SRTM
Mission launched in 2001
Acquired data for 80 of earth at 30m resolution
5TB
USGS
Most of US at 10m resolution
LIDAR
1m res

18
Example LIDAR Terrain Data

Massive (irregular) point sets (1-10m resolution)
Relatively cheap and easy to collect

Example Jockeys ridge (NC coast)
19
Its Growing!

Appalachian Mountains
Area if approx. 800 km x 800 km
Sampled at
100m resolution ? 64 million points (128MB)
30m resolution ? 640
(1.2GB)
10m resolution ? 6400 6.4 billion (12GB)
1m resolution ? 600.4 billion (1.2TB)

20
Computing on Massive Data

GRASS (open source GIS)
Killed after running for 17 days on a 6700 x 4300
grid (approx 50 MB dataset)
TARDEM (research, U. Utah)
Killed after running for 20 days on a 12000 x
10000 grid (appox 240 MB dataset)
CPU utilization 5, 3GB swap file
ArcInfo (ESRI, commercial GIS)
Can handle the 240MB dataset
Doesnt work for datasets bigger than 2GB

21
Outline

Introduction
Flow direction and flow accumulation
Definitions, assumptions, algorithm outline.
Scalability to large terrains
Why not?
I/O-efficient algorithms
I/O-efficient flow accumulation
TerraFlow
Theoretical results
Conclusion

22
Flow Direction (FD) on Grids

Water flows downhill
follows the gradient
On grids Approximated using 3x3 neighborhood
SFD (Single-Flow Direction)
FD points to the steepest downslope neighbor
MFD (Multiple-Flow direction)
FD points to all downslope neighbors

23
Flow accumulation with MFD
24
Flow accumulation with SFD
25
Computing FD

Goal compute FD for every cell in the grid (FD
grid)
Algorithm
For each cell compute SFD/MFD by inspecting 8
neighbor cells
Analysis O(N) time for a grid of N cells
Is this all?
NO! flat areas Plateas and sinks

26
FD on Flat Areas

no obvious flow direction
Plateaus
Assign flow directions such that each cell flows
towards the nearest spill point of the plateau
Sinks
Either catch the water inside the sink
Or route the water outside the sink using uphill
flow directions
model steady state of water and remove (fill)
sinks by simulating flooding uniformly
pouring water on terrain until steady state
is reached
Assign uphill flow directions on the original
terrain by assigning downhill flow directions on
the flooded terrain

27
Flow Accumulation (FA) on Grids

FA models water flow through each cell with
uniform rain
Initially one unit of water in each cell
Water distributed from each cell to neighbors
pointed to by its FD
Flow conservation If several FD, distribute
proportionally to height difference
Flow accumulation of cell is total flow through
it
Goal compute FA for every cell in the grid (FA
grid)

28
Computing FA

FD graph
node for each cell
(directed) edge from cell a to b if FD of a
points to b
FD graph must be acyclic
ok on slopes, be careful on plateaus
FD graph depends on the FD method used
SFD graph a tree (or a set of trees)
MFD graph a DAG (or a set of DAGs)

29
Computing FA Plane Sweeping

Input flow direction grid FD
Output flow accumulation grid FA (initialized to
1)
Process cells in topological order. For each
cell
Read its flow from FA grid and its direction from
FD grid
Update flow for downslope neighbors (all
neighbors pointed to by cell flow direction)
Correctness
One sweep enough
Analysis
O(sort) O(N) time for a grid of N cells
Note Topological order means decreasing height
order (since water flows downhill).

30
Scalability Problem

We can compute FD and FA using simple O(N)-time
algorithms
..but.. for large sets..??

31
Scalability Problem Why?

Most (GIS) programs assume data fits in memory
minimize only CPU computation
But.. Massive data does not fit in main memory!
OS places data on disk and moves data between
memory and disk as needed
Disk systems try to amortize large access time by
transferring large contiguous blocks of data
When processing massive data disk I/O is the
bottleneck, rather than CPU time!

32
Disks are Slow

The difference in speed between modern CPU
and disk technologies is analogous to the
difference in speed in sharpening a pencil using
a sharpener on ones desk or by taking an
airplane to the other side of the world and using
a sharpener on someone elses desk. (D. Comer)

33
Scalability to Large Data

Example reading an array from disk
Array size N 10 elements
Disk block size 2 elements
Memory size 4 elements (2 blocks)

1 2 10 9 5 6 3 4 8 7
1 5 2 6 3 8 9 4 7 10
Algorithm 2 Loads 5 blocks
Algorithm 1 Loads 10 blocks
N blocks gtgt N/B blocks

Block size is large (32KB, 64KB) ? N gtgt N/B
N 256 x 106, B 8000 , 1ms disk access time
? N I/Os take 256 x 103 sec 4266 min
71 hr
? N/B I/Os take 256/8 sec 32 sec

I/O model
I/O-operation
Read/write one block of data from/to disk
I/O-complexity
number of I/O-operations (I/Os) performed by the
algorithm
External memory or I/O-efficient algorithms
Minimize I/O-complexity

RAM model
CPU-operation
CPU-complexity
Number of CPU-operations performed by the
algorithm
Internal memory algorithms
Minimize CPU-complexity

35
I/O-Efficient Algorithms

O(N) I/Os is bad!!
Improve to O(N/B) I/Os (if possible)
Minimize the number of blocks transferred between
main memory and disk
Compute on whole block while it is in memory
Avoid loading a block each time
Use techniques from PRAM algorithms

36
Sorting

Mergesort illustrates often used features
Main memory sized chunks (for N/M runs)
Multi-way merge (repeatedly merge M/B of them)

37
Computing FAI/O-Analysis

Algorithm O(N) time
Process (sweep) cells in topological order. For
each cell
Read flow from FA grid and direction from FD grid
Update flow in FA grid for downslope neighbors
Problem Cells of same height distributed over
the terrain
? scattered access to FA grid and FD grid ?O(N)
blocks

38
I/O-Efficient Flow Accumulation
ATV00

Eliminating scattered accesses to FD grid
Store FD grid in topological order
Eliminating scattered accesses to FA grid
Obs flow to neighbor cell is only needed when
its time comes to be processed
Topological rank time when cell is
processed priority
Push flow by inserting flow increment in
priority queue with
priority equal to neighbors priority
Flow of cell obtained using DeleteMin operations
Note Augment each cell with priority of 8
neighbors
Obs Space (9N) traded for I/O
Turns O(N) grid accesses into O(N) priority queue
operations
Use I/O-efficient priority queue A95,BK97
Buffered B-tree with with lazy updates

39
(No Transcript)
40
TerraFlow

TerraFlow is our suite of programs for flow
routing and flow accumulation on massive grids
ATV00,ACal02
Flow routing and flow accumulation modeled as
graph problems and solved in optimal I/O bounds
Efficient
2-1000 times faster on very large grids than
existing software
Scalable
1 billion elements!! (gt2GB data)
Flexible
Allows multiple methods flow modeling

http//www.cs.duke.edu/geo/terraflow
41
TerraFlow

Significant speedup over ArcInfo for large
datasets
East-Coast
TerraFlow 8.7 Hours
ArcInfo 78 Hours
Washington state
TerraFlow 63 Hours
ArcInfo
GRASS cannot handle
Hawaii dataset (killed
after (17 days!)

42
I/O-Model
D

Parameters
N elements in problem instance
B elements that fit in disk block
M elements that fit in main memory
Fundamental bounds
Sorting sort(N)

Block I/O
M
P
In practice block and main memory sizes are big
43
(No Transcript)
44
I/O-Efficient Graph Algorithms

Graph G(V,E)
Basic graph (searching) problems
BFS, DFS, SSSP, topological sorting
..are big open problems in the I/O-model!
Standard internal memory algorithms O(E) I/Os
No I/O-efficient algorithms are known for any of
these problems on general graphs!
Lower bound O (sort(V)), best known O (V/sqrt(B))
O(sort(E)) algorithms for special classes of
graphs
Trees, grid graphs, bounded-treewidth graphs,
outerplanar graphs, planar graphs
Exploit existence of small separators or
geometric structure

45
SSSP on Grid Graphs ATV00
Grid graph O(N) vertices, O(N) edges
Dijskstras algorithm O(N) I/Os Goal compute
shortest path d(s,t) in O(sort(N)) I/Os

Lemma
The portion of d(s,t) between intersection
points with boundaries of subgrids is the
shortest path within the subgrid

46
SSSP on Grid Graphs ATV00
Idea Compute shortest paths locally in each
subgrid then compute the shortest way to
combine them together

Divide grid into subgrids of size BxB (assume M gt
B2)
Replace each BxB subgrid with complete graph on
boundary nodes
Edge weight shortest path between the two
boundary vertices within the subgrid
Reduced graph GR
O(N/B) vertices, O(N) edges

47
SSSP on Grid Graphs ATV00

Algorithm
Compute SSSP on GR from s to all boundary
vertices
Find SSSP from s to all interior vertices for
any subgrid s, for any t in s
d(s,t) min v in Bnd(s) d(s,v) d s(v,t)
Correctness
easy to show using Lemma
Analysis O(sort(N)) I/Os
Dijkstra algorithm using I/O efficient priority
queue and graph blocking

48
(No Transcript)
49
Results on Planar graphs

Planar graph G with N vertices
Separators can be computed in O(sort(N)) I/Os
I/O-efficient reductions ABT00, AMTZ01
? BFS, DFS, SSSP in O(sort(N)) I/Os

50
SSSP on Planar Graphs

Similar with grid graphs. Assume M gt B2, bounded
degree
Assume graph is separated
O(N/B2) subgraphs, O(B2) vertices each, SO(N/B)
separators
each subgraph adjacent to O(B) separators

51
SSSP on Planar Graphs

Reduced graph GR
S O(N/B) vertices
O(N/B2) x O(B2) O(N) edges
Compute SSSP on GR
Dijkstras algorithm and I/O-efficient priority
queue
Each vertex is accessed once by its O(B)
adjacent vertices ?O(N) I/Os
Use boundary sets
O(N/B2) boundary sets, each
accessed once by its O(B) adjacent
vertices ? O(N/B) I/Os

52
On I/O-Efficient DFS

DFS upper bounds
Internal memory algorithm O(VE) time, O(VE)
I/Os
Best upper bound
O(V E/B log V) I/Os on general graphs
DFS on general graphs is a big open problem
Note PRAM DFS is P-complete
DFS on planar graphs uses O(sort(N)) I/Os
DFS to BFS reduction AMTZ01

53
DFS to BFS Reduction on Planar Graphs

Idea Partition the faces of G into levels around
a source face containing s and grow DFS
level-by-level
Levels can be obtained from BFS in dual graph
Denote
Gi union of the boundaries of faces at level lt
i
Ti DFS tree of Gi
Hi Gi \ G i-1
Algorithm Compute a spanning forest of Hi and
attach it onto T i-1
Structure of levels is simple
The bicomps of the Hi are the boundary cycles of
Gi
Glueing onto T i-1 is simple
A spanning tree is a DFS tree if and only if it
has no cross edges

54
DFS to BFS Reduction on Planar Graphs

Idea Partition the faces of G into levels around
a source face containing s and grow DFS
level-by-level

55
Other Graphs Results

Grid graphs ATV00
MST, SSSP in O(sort(N)) I/Os
CC in O(scan(N)) I/Os
Planar graphs ABT00, AMTZ01
Planar reductions
DFS
General graphs ABT00
MST in O(sort(N) log log N) I/Os
Planar directed graphs submitted
Topological sorting and ear decomposition in
O(sort(N)) I/Os

56
..In Conclusion

I have tried to convince you of a few of things
Massive data is available and in order to process
it scalable algorithms are necessary
I/O-efficient algorithms have applications
outside computer science and have big potential
for (interdisciplinary) collaboration
I/O-efficient algorithms are theory and practice
put together and support educational efforts
Challenging, rewarding, fun!

57
Collaboration

Rewarding, good response
Duke Nicholas School of the Environment
NCSU Dept. of Marine, Earth and Atmospheric
Sciences
GRASS, ESRI
TerraFlow
Incorporated in GRASS AMT02
Current work with U. Muenster GE
2 MS students port TerraFlow to VisualC under
Windows and make it ArcInfo extension
Extends projects and brings up new problems
LIDAR data