Title: Efficient%20Algorithms%20for%20Large-Scale%20GIS%20Applications
1Efficient Algorithms for Large-Scale GIS
Applications
- Laura Toma
- Duke University
2Why GIS?
- How it all started..
- Duke Environmental researchers
- computing flow accumulation for Appalachian
Mountains took 14 days (with 512MB memory) - 800km x 800km at 100m resolution ? 64 million
points - GIS (Geographic Information Systems)
- System that handles spatial data
- Visualization, processing, queries, analysis
- Indispensable tool
- Modeling, analysis, prediction, decision making
- Rich area of problems for Computer Science
- Graphics, graph theory, computational geometry etc
3GIS and the Environment
- Monitoring keep an eye on the state of earth
systems using satellites and monitoring stations
(water, ecosystems, urban development) - Modeling, simulation predict consequences of
human actions and natural processes - Analysis and risk assessment find the problem
areas and analyse the possible causes (soil
erosion, groundwater pollution, traffic jams) -
- Planning and decision support provide
information and tools for better management of
natural and socio-economic resources
4 Precipitation in Tropical South America
Lots of rain Dry
H. Mitasova
5Nitrogen in Chesapeake Bay
High nitrogen concentrations
H. Mitasova
6Jockeys Ridge evolution
Combining IR-DOQQ, LIDAR and RTK GPS to assess
the change decreasing elevation, extending
towards homes and a road
C
A
B
N
H. Mitasova
7Bald Head Island Renourishment
1998 LIDAR shoreline 1998 2000
LIDAR shoreline 2000 2001, Dec. RTK GPS
shoreline
surface is 1998 LIDAR
H. Mitasova
8Sediment flow
H. Mitasova
9Computations on Terrains
- Reality
- Height of terrain is a continuous function of
two variables f(x,y) - Estimate, predict, simulate
- Flooding, pollution
- Erosion, deposition
- Vegetation structure
- .
- GIS
- DEM (Digital Elevation Model) is a set of sample
points and theirheights ? x, y, hxy?
Compute indices
10DEM Representations
TIN
Grid
Contour lines
Sample points
11Panama DEM
12Modeling Flow on Terrains
- What happens when it rains?
- Predict areas susceptible to floods.
- Predict location of streams.
- Compute watersheds.
- Flow is modeled using two basic attributes
- Flow Direction (FD)
- The direction water flows at a point
- Flow Accumulation (FA)
- Total amount of water that flows through a point
(if water is distributed according to the flow
directions)
13Panama DEM - Flow Accumulation
14(No Transcript)
15(No Transcript)
16Uses
- Flow direction and flow accumulationare used
for - Computing other hydrological attributes
- river network
- moisture indices
- watersheds and watershed divides
- Analysis and prediction of sediment and
pollutant movement in landscapes. - Decision support in land management, flood and
pollution prevention and disaster management
17Massive Terrain Data
- Remote sensing technology
- Massive amounts of terrain data
- Higher resolutions (1km, 100m, 30m, 10m, 1m,)
- NASA-SRTM
- Mission launched in 2001
- Acquired data for 80 of earth at 30m resolution
- 5TB
- USGS
- Most of US at 10m resolution
- LIDAR
- 1m res
18Example LIDAR Terrain Data
- Massive (irregular) point sets (1-10m resolution)
- Relatively cheap and easy to collect
-
Example Jockeys ridge (NC coast)
19Its Growing!
- Appalachian Mountains
- Area if approx. 800 km x 800 km
- Sampled at
- 100m resolution ? 64 million points (128MB)
- 30m resolution ? 640
(1.2GB) - 10m resolution ? 6400 6.4 billion (12GB)
- 1m resolution ? 600.4 billion (1.2TB)
20Computing on Massive Data
- GRASS (open source GIS)
- Killed after running for 17 days on a 6700 x 4300
grid (approx 50 MB dataset) - TARDEM (research, U. Utah)
- Killed after running for 20 days on a 12000 x
10000 grid (appox 240 MB dataset) - CPU utilization 5, 3GB swap file
- ArcInfo (ESRI, commercial GIS)
- Can handle the 240MB dataset
- Doesnt work for datasets bigger than 2GB
21Outline
- Introduction
- Flow direction and flow accumulation
- Definitions, assumptions, algorithm outline.
- Scalability to large terrains
- Why not?
- I/O-efficient algorithms
- I/O-efficient flow accumulation
- TerraFlow
- Theoretical results
- Conclusion
22Flow Direction (FD) on Grids
- Water flows downhill
- follows the gradient
- On grids Approximated using 3x3 neighborhood
- SFD (Single-Flow Direction)
- FD points to the steepest downslope neighbor
- MFD (Multiple-Flow direction)
- FD points to all downslope neighbors
23Flow accumulation with MFD
24Flow accumulation with SFD
25Computing FD
- Goal compute FD for every cell in the grid (FD
grid) - Algorithm
- For each cell compute SFD/MFD by inspecting 8
neighbor cells - Analysis O(N) time for a grid of N cells
- Is this all?
- NO! flat areas Plateas and sinks
26FD on Flat Areas
- no obvious flow direction
- Plateaus
- Assign flow directions such that each cell flows
towards the nearest spill point of the plateau - Sinks
- Either catch the water inside the sink
- Or route the water outside the sink using uphill
flow directions - model steady state of water and remove (fill)
- sinks by simulating flooding uniformly
- pouring water on terrain until steady state
- is reached
- Assign uphill flow directions on the original
terrain by assigning downhill flow directions on
the flooded terrain
27Flow Accumulation (FA) on Grids
- FA models water flow through each cell with
uniform rain - Initially one unit of water in each cell
- Water distributed from each cell to neighbors
pointed to by its FD - Flow conservation If several FD, distribute
proportionally to height difference - Flow accumulation of cell is total flow through
it - Goal compute FA for every cell in the grid (FA
grid)
28Computing FA
- FD graph
- node for each cell
- (directed) edge from cell a to b if FD of a
points to b - FD graph must be acyclic
- ok on slopes, be careful on plateaus
- FD graph depends on the FD method used
- SFD graph a tree (or a set of trees)
- MFD graph a DAG (or a set of DAGs)
29Computing FA Plane Sweeping
- Input flow direction grid FD
- Output flow accumulation grid FA (initialized to
1) - Process cells in topological order. For each
cell - Read its flow from FA grid and its direction from
FD grid - Update flow for downslope neighbors (all
neighbors pointed to by cell flow direction) - Correctness
- One sweep enough
- Analysis
- O(sort) O(N) time for a grid of N cells
- Note Topological order means decreasing height
order (since water flows downhill).
30Scalability Problem
- We can compute FD and FA using simple O(N)-time
algorithms - ..but.. for large sets..??
31Scalability Problem Why?
- Most (GIS) programs assume data fits in memory
- minimize only CPU computation
- But.. Massive data does not fit in main memory!
- OS places data on disk and moves data between
memory and disk as needed - Disk systems try to amortize large access time by
transferring large contiguous blocks of data - When processing massive data disk I/O is the
bottleneck, rather than CPU time!
32Disks are Slow
- The difference in speed between modern CPU
and disk technologies is analogous to the
difference in speed in sharpening a pencil using
a sharpener on ones desk or by taking an
airplane to the other side of the world and using
a sharpener on someone elses desk. (D. Comer)
33Scalability to Large Data
- Example reading an array from disk
- Array size N 10 elements
- Disk block size 2 elements
- Memory size 4 elements (2 blocks)
1 2 10 9 5 6 3 4 8 7
1 5 2 6 3 8 9 4 7 10
Algorithm 2 Loads 5 blocks
Algorithm 1 Loads 10 blocks
N blocks gtgt N/B blocks
- Block size is large (32KB, 64KB) ? N gtgt N/B
- N 256 x 106, B 8000 , 1ms disk access time
- ? N I/Os take 256 x 103 sec 4266 min
71 hr - ? N/B I/Os take 256/8 sec 32 sec
34- I/O model
- I/O-operation
- Read/write one block of data from/to disk
- I/O-complexity
- number of I/O-operations (I/Os) performed by the
algorithm - External memory or I/O-efficient algorithms
- Minimize I/O-complexity
- RAM model
- CPU-operation
-
- CPU-complexity
- Number of CPU-operations performed by the
algorithm - Internal memory algorithms
- Minimize CPU-complexity
35I/O-Efficient Algorithms
- O(N) I/Os is bad!!
- Improve to O(N/B) I/Os (if possible)
- Minimize the number of blocks transferred between
main memory and disk - Compute on whole block while it is in memory
- Avoid loading a block each time
- Use techniques from PRAM algorithms
36Sorting
- Mergesort illustrates often used features
- Main memory sized chunks (for N/M runs)
- Multi-way merge (repeatedly merge M/B of them)
37Computing FAI/O-Analysis
- Algorithm O(N) time
- Process (sweep) cells in topological order. For
each cell - Read flow from FA grid and direction from FD grid
- Update flow in FA grid for downslope neighbors
- Problem Cells of same height distributed over
the terrain - ? scattered access to FA grid and FD grid ?O(N)
blocks
38I/O-Efficient Flow Accumulation
ATV00
- Eliminating scattered accesses to FD grid
- Store FD grid in topological order
- Eliminating scattered accesses to FA grid
- Obs flow to neighbor cell is only needed when
its time comes to be processed - Topological rank time when cell is
processed priority - Push flow by inserting flow increment in
priority queue with - priority equal to neighbors priority
- Flow of cell obtained using DeleteMin operations
- Note Augment each cell with priority of 8
neighbors - Obs Space (9N) traded for I/O
- Turns O(N) grid accesses into O(N) priority queue
operations - Use I/O-efficient priority queue A95,BK97
- Buffered B-tree with with lazy updates
39(No Transcript)
40TerraFlow
- TerraFlow is our suite of programs for flow
routing and flow accumulation on massive grids
ATV00,ACal02 - Flow routing and flow accumulation modeled as
graph problems and solved in optimal I/O bounds - Efficient
- 2-1000 times faster on very large grids than
existing software - Scalable
- 1 billion elements!! (gt2GB data)
- Flexible
- Allows multiple methods flow modeling
http//www.cs.duke.edu/geo/terraflow
41TerraFlow
- Significant speedup over ArcInfo for large
datasets - East-Coast
- TerraFlow 8.7 Hours
- ArcInfo 78 Hours
- Washington state
- TerraFlow 63 Hours
- ArcInfo
- GRASS cannot handle
- Hawaii dataset (killed
- after (17 days!)
42I/O-Model
D
- Parameters
- N elements in problem instance
- B elements that fit in disk block
- M elements that fit in main memory
- Fundamental bounds
- Sorting sort(N)
Block I/O
M
P
In practice block and main memory sizes are big
43(No Transcript)
44I/O-Efficient Graph Algorithms
- Graph G(V,E)
- Basic graph (searching) problems
- BFS, DFS, SSSP, topological sorting
- ..are big open problems in the I/O-model!
- Standard internal memory algorithms O(E) I/Os
- No I/O-efficient algorithms are known for any of
these problems on general graphs! - Lower bound O (sort(V)), best known O (V/sqrt(B))
- O(sort(E)) algorithms for special classes of
graphs - Trees, grid graphs, bounded-treewidth graphs,
outerplanar graphs, planar graphs - Exploit existence of small separators or
geometric structure
45SSSP on Grid Graphs ATV00
Grid graph O(N) vertices, O(N) edges
Dijskstras algorithm O(N) I/Os Goal compute
shortest path d(s,t) in O(sort(N)) I/Os
- Lemma
- The portion of d(s,t) between intersection
points with boundaries of subgrids is the
shortest path within the subgrid
46SSSP on Grid Graphs ATV00
Idea Compute shortest paths locally in each
subgrid then compute the shortest way to
combine them together
- Divide grid into subgrids of size BxB (assume M gt
B2) - Replace each BxB subgrid with complete graph on
boundary nodes - Edge weight shortest path between the two
boundary vertices within the subgrid - Reduced graph GR
- O(N/B) vertices, O(N) edges
47SSSP on Grid Graphs ATV00
- Algorithm
- Compute SSSP on GR from s to all boundary
vertices - Find SSSP from s to all interior vertices for
any subgrid s, for any t in s - d(s,t) min v in Bnd(s) d(s,v) d s(v,t)
- Correctness
- easy to show using Lemma
- Analysis O(sort(N)) I/Os
- Dijkstra algorithm using I/O efficient priority
queue and graph blocking
48(No Transcript)
49Results on Planar graphs
- Planar graph G with N vertices
- Separators can be computed in O(sort(N)) I/Os
- I/O-efficient reductions ABT00, AMTZ01
-
- ? BFS, DFS, SSSP in O(sort(N)) I/Os
50SSSP on Planar Graphs
- Similar with grid graphs. Assume M gt B2, bounded
degree - Assume graph is separated
- O(N/B2) subgraphs, O(B2) vertices each, SO(N/B)
separators - each subgraph adjacent to O(B) separators
51SSSP on Planar Graphs
- Reduced graph GR
- S O(N/B) vertices
- O(N/B2) x O(B2) O(N) edges
- Compute SSSP on GR
- Dijkstras algorithm and I/O-efficient priority
queue - Each vertex is accessed once by its O(B)
- adjacent vertices ?O(N) I/Os
- Use boundary sets
- O(N/B2) boundary sets, each
- accessed once by its O(B) adjacent
- vertices ? O(N/B) I/Os
52On I/O-Efficient DFS
- DFS upper bounds
- Internal memory algorithm O(VE) time, O(VE)
I/Os - Best upper bound
- O(V E/B log V) I/Os on general graphs
- DFS on general graphs is a big open problem
- Note PRAM DFS is P-complete
- DFS on planar graphs uses O(sort(N)) I/Os
- DFS to BFS reduction AMTZ01
53DFS to BFS Reduction on Planar Graphs
- Idea Partition the faces of G into levels around
a source face containing s and grow DFS
level-by-level - Levels can be obtained from BFS in dual graph
- Denote
- Gi union of the boundaries of faces at level lt
i - Ti DFS tree of Gi
- Hi Gi \ G i-1
- Algorithm Compute a spanning forest of Hi and
attach it onto T i-1 - Structure of levels is simple
- The bicomps of the Hi are the boundary cycles of
Gi - Glueing onto T i-1 is simple
- A spanning tree is a DFS tree if and only if it
has no cross edges
54DFS to BFS Reduction on Planar Graphs
- Idea Partition the faces of G into levels around
a source face containing s and grow DFS
level-by-level
55Other Graphs Results
- Grid graphs ATV00
- MST, SSSP in O(sort(N)) I/Os
- CC in O(scan(N)) I/Os
- Planar graphs ABT00, AMTZ01
- Planar reductions
- DFS
- General graphs ABT00
- MST in O(sort(N) log log N) I/Os
- Planar directed graphs submitted
- Topological sorting and ear decomposition in
O(sort(N)) I/Os
56..In Conclusion
- I have tried to convince you of a few of things
- Massive data is available and in order to process
it scalable algorithms are necessary - I/O-efficient algorithms have applications
outside computer science and have big potential
for (interdisciplinary) collaboration - I/O-efficient algorithms are theory and practice
put together and support educational efforts - Challenging, rewarding, fun!
57Collaboration
- Rewarding, good response
- Duke Nicholas School of the Environment
- NCSU Dept. of Marine, Earth and Atmospheric
Sciences - GRASS, ESRI
- TerraFlow
- Incorporated in GRASS AMT02
- Current work with U. Muenster GE
- 2 MS students port TerraFlow to VisualC under
Windows and make it ArcInfo extension - Extends projects and brings up new problems
- LIDAR data