Efficient%20Algorithms%20for%20Large-Scale%20GIS%20Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient%20Algorithms%20for%20Large-Scale%20GIS%20Applications

Description:

Efficient Algorithms for Large-Scale GIS Applications Laura Toma Duke University * * – PowerPoint PPT presentation

Number of Views:195
Avg rating:3.0/5.0
Slides: 58
Provided by: csus85
Category:

less

Transcript and Presenter's Notes

Title: Efficient%20Algorithms%20for%20Large-Scale%20GIS%20Applications


1
Efficient Algorithms for Large-Scale GIS
Applications
  • Laura Toma
  • Duke University

2
Why GIS?
  • How it all started..
  • Duke Environmental researchers
  • computing flow accumulation for Appalachian
    Mountains took 14 days (with 512MB memory)
  • 800km x 800km at 100m resolution ? 64 million
    points
  • GIS (Geographic Information Systems)
  • System that handles spatial data
  • Visualization, processing, queries, analysis
  • Indispensable tool
  • Modeling, analysis, prediction, decision making
  • Rich area of problems for Computer Science
  • Graphics, graph theory, computational geometry etc

3
GIS and the Environment
  • Monitoring keep an eye on the state of earth
    systems using satellites and monitoring stations
    (water, ecosystems, urban development)
  • Modeling, simulation predict consequences of
    human actions and natural processes
  • Analysis and risk assessment find the problem
    areas and analyse the possible causes (soil
    erosion, groundwater pollution, traffic jams)
  • Planning and decision support provide
    information and tools for better management of
    natural and socio-economic resources

4
Precipitation in Tropical South America
Lots of rain Dry
H. Mitasova
5
Nitrogen in Chesapeake Bay
High nitrogen concentrations
H. Mitasova
6
Jockeys Ridge evolution
Combining IR-DOQQ, LIDAR and RTK GPS to assess
the change decreasing elevation, extending
towards homes and a road
C
A
B
N
H. Mitasova
7
Bald Head Island Renourishment
1998 LIDAR shoreline 1998 2000
LIDAR shoreline 2000 2001, Dec. RTK GPS
shoreline
surface is 1998 LIDAR
H. Mitasova
8
Sediment flow
H. Mitasova
9
Computations on Terrains
  • Reality
  • Height of terrain is a continuous function of
    two variables f(x,y)
  • Estimate, predict, simulate
  • Flooding, pollution
  • Erosion, deposition
  • Vegetation structure
  • .
  • GIS
  • DEM (Digital Elevation Model) is a set of sample
    points and theirheights ? x, y, hxy?

Compute indices
10
DEM Representations
TIN
Grid
Contour lines
Sample points
11
Panama DEM
12
Modeling Flow on Terrains
  • What happens when it rains?
  • Predict areas susceptible to floods.
  • Predict location of streams.
  • Compute watersheds.
  • Flow is modeled using two basic attributes
  • Flow Direction (FD)
  • The direction water flows at a point
  • Flow Accumulation (FA)
  • Total amount of water that flows through a point
    (if water is distributed according to the flow
    directions)

13
Panama DEM - Flow Accumulation
14
(No Transcript)
15
(No Transcript)
16
Uses
  • Flow direction and flow accumulationare used
    for
  • Computing other hydrological attributes
  • river network
  • moisture indices
  • watersheds and watershed divides
  • Analysis and prediction of sediment and
    pollutant movement in landscapes.
  • Decision support in land management, flood and
    pollution prevention and disaster management

17
Massive Terrain Data
  • Remote sensing technology
  • Massive amounts of terrain data
  • Higher resolutions (1km, 100m, 30m, 10m, 1m,)
  • NASA-SRTM
  • Mission launched in 2001
  • Acquired data for 80 of earth at 30m resolution
  • 5TB
  • USGS
  • Most of US at 10m resolution
  • LIDAR
  • 1m res

18
Example LIDAR Terrain Data
  • Massive (irregular) point sets (1-10m resolution)
  • Relatively cheap and easy to collect

Example Jockeys ridge (NC coast)
19
Its Growing!
  • Appalachian Mountains
  • Area if approx. 800 km x 800 km
  • Sampled at
  • 100m resolution ? 64 million points (128MB)
  • 30m resolution ? 640
    (1.2GB)
  • 10m resolution ? 6400 6.4 billion (12GB)
  • 1m resolution ? 600.4 billion (1.2TB)

20
Computing on Massive Data
  • GRASS (open source GIS)
  • Killed after running for 17 days on a 6700 x 4300
    grid (approx 50 MB dataset)
  • TARDEM (research, U. Utah)
  • Killed after running for 20 days on a 12000 x
    10000 grid (appox 240 MB dataset)
  • CPU utilization 5, 3GB swap file
  • ArcInfo (ESRI, commercial GIS)
  • Can handle the 240MB dataset
  • Doesnt work for datasets bigger than 2GB

21
Outline
  • Introduction
  • Flow direction and flow accumulation
  • Definitions, assumptions, algorithm outline.
  • Scalability to large terrains
  • Why not?
  • I/O-efficient algorithms
  • I/O-efficient flow accumulation
  • TerraFlow
  • Theoretical results
  • Conclusion

22
Flow Direction (FD) on Grids
  • Water flows downhill
  • follows the gradient
  • On grids Approximated using 3x3 neighborhood
  • SFD (Single-Flow Direction)
  • FD points to the steepest downslope neighbor
  • MFD (Multiple-Flow direction)
  • FD points to all downslope neighbors

23
Flow accumulation with MFD
24
Flow accumulation with SFD
25
Computing FD
  • Goal compute FD for every cell in the grid (FD
    grid)
  • Algorithm
  • For each cell compute SFD/MFD by inspecting 8
    neighbor cells
  • Analysis O(N) time for a grid of N cells
  • Is this all?
  • NO! flat areas Plateas and sinks

26
FD on Flat Areas
  • no obvious flow direction
  • Plateaus
  • Assign flow directions such that each cell flows
    towards the nearest spill point of the plateau
  • Sinks
  • Either catch the water inside the sink
  • Or route the water outside the sink using uphill
    flow directions
  • model steady state of water and remove (fill)
  • sinks by simulating flooding uniformly
  • pouring water on terrain until steady state
  • is reached
  • Assign uphill flow directions on the original
    terrain by assigning downhill flow directions on
    the flooded terrain

27
Flow Accumulation (FA) on Grids
  • FA models water flow through each cell with
    uniform rain
  • Initially one unit of water in each cell
  • Water distributed from each cell to neighbors
    pointed to by its FD
  • Flow conservation If several FD, distribute
    proportionally to height difference
  • Flow accumulation of cell is total flow through
    it
  • Goal compute FA for every cell in the grid (FA
    grid)

28
Computing FA
  • FD graph
  • node for each cell
  • (directed) edge from cell a to b if FD of a
    points to b
  • FD graph must be acyclic
  • ok on slopes, be careful on plateaus
  • FD graph depends on the FD method used
  • SFD graph a tree (or a set of trees)
  • MFD graph a DAG (or a set of DAGs)

29
Computing FA Plane Sweeping
  • Input flow direction grid FD
  • Output flow accumulation grid FA (initialized to
    1)
  • Process cells in topological order. For each
    cell
  • Read its flow from FA grid and its direction from
    FD grid
  • Update flow for downslope neighbors (all
    neighbors pointed to by cell flow direction)
  • Correctness
  • One sweep enough
  • Analysis
  • O(sort) O(N) time for a grid of N cells
  • Note Topological order means decreasing height
    order (since water flows downhill).

30
Scalability Problem
  • We can compute FD and FA using simple O(N)-time
    algorithms
  • ..but.. for large sets..??

31
Scalability Problem Why?
  • Most (GIS) programs assume data fits in memory
  • minimize only CPU computation
  • But.. Massive data does not fit in main memory!
  • OS places data on disk and moves data between
    memory and disk as needed
  • Disk systems try to amortize large access time by
    transferring large contiguous blocks of data
  • When processing massive data disk I/O is the
    bottleneck, rather than CPU time!

32
Disks are Slow
  • The difference in speed between modern CPU
    and disk technologies is analogous to the
    difference in speed in sharpening a pencil using
    a sharpener on ones desk or by taking an
    airplane to the other side of the world and using
    a sharpener on someone elses desk. (D. Comer)

33
Scalability to Large Data
  • Example reading an array from disk
  • Array size N 10 elements
  • Disk block size 2 elements
  • Memory size 4 elements (2 blocks)

1 2 10 9 5 6 3 4 8 7
1 5 2 6 3 8 9 4 7 10
Algorithm 2 Loads 5 blocks
Algorithm 1 Loads 10 blocks
N blocks gtgt N/B blocks
  • Block size is large (32KB, 64KB) ? N gtgt N/B
  • N 256 x 106, B 8000 , 1ms disk access time
  • ? N I/Os take 256 x 103 sec 4266 min
    71 hr
  • ? N/B I/Os take 256/8 sec 32 sec

34
  • I/O model
  • I/O-operation
  • Read/write one block of data from/to disk
  • I/O-complexity
  • number of I/O-operations (I/Os) performed by the
    algorithm
  • External memory or I/O-efficient algorithms
  • Minimize I/O-complexity
  • RAM model
  • CPU-operation
  • CPU-complexity
  • Number of CPU-operations performed by the
    algorithm
  • Internal memory algorithms
  • Minimize CPU-complexity

35
I/O-Efficient Algorithms
  • O(N) I/Os is bad!!
  • Improve to O(N/B) I/Os (if possible)
  • Minimize the number of blocks transferred between
    main memory and disk
  • Compute on whole block while it is in memory
  • Avoid loading a block each time
  • Use techniques from PRAM algorithms

36
Sorting
  • Mergesort illustrates often used features
  • Main memory sized chunks (for N/M runs)
  • Multi-way merge (repeatedly merge M/B of them)

37
Computing FAI/O-Analysis
  • Algorithm O(N) time
  • Process (sweep) cells in topological order. For
    each cell
  • Read flow from FA grid and direction from FD grid
  • Update flow in FA grid for downslope neighbors
  • Problem Cells of same height distributed over
    the terrain
  • ? scattered access to FA grid and FD grid ?O(N)
    blocks

38
I/O-Efficient Flow Accumulation
ATV00
  • Eliminating scattered accesses to FD grid
  • Store FD grid in topological order
  • Eliminating scattered accesses to FA grid
  • Obs flow to neighbor cell is only needed when
    its time comes to be processed
  • Topological rank time when cell is
    processed priority
  • Push flow by inserting flow increment in
    priority queue with
  • priority equal to neighbors priority
  • Flow of cell obtained using DeleteMin operations
  • Note Augment each cell with priority of 8
    neighbors
  • Obs Space (9N) traded for I/O
  • Turns O(N) grid accesses into O(N) priority queue
    operations
  • Use I/O-efficient priority queue A95,BK97
  • Buffered B-tree with with lazy updates

39
(No Transcript)
40
TerraFlow
  • TerraFlow is our suite of programs for flow
    routing and flow accumulation on massive grids
    ATV00,ACal02
  • Flow routing and flow accumulation modeled as
    graph problems and solved in optimal I/O bounds
  • Efficient
  • 2-1000 times faster on very large grids than
    existing software
  • Scalable
  • 1 billion elements!! (gt2GB data)
  • Flexible
  • Allows multiple methods flow modeling

http//www.cs.duke.edu/geo/terraflow
41
TerraFlow
  • Significant speedup over ArcInfo for large
    datasets
  • East-Coast
  • TerraFlow 8.7 Hours
  • ArcInfo 78 Hours
  • Washington state
  • TerraFlow 63 Hours
  • ArcInfo
  • GRASS cannot handle
  • Hawaii dataset (killed
  • after (17 days!)

42
I/O-Model
D
  • Parameters
  • N elements in problem instance
  • B elements that fit in disk block
  • M elements that fit in main memory
  • Fundamental bounds
  • Sorting sort(N)

Block I/O
M
P
In practice block and main memory sizes are big
43
(No Transcript)
44
I/O-Efficient Graph Algorithms
  • Graph G(V,E)
  • Basic graph (searching) problems
  • BFS, DFS, SSSP, topological sorting
  • ..are big open problems in the I/O-model!
  • Standard internal memory algorithms O(E) I/Os
  • No I/O-efficient algorithms are known for any of
    these problems on general graphs!
  • Lower bound O (sort(V)), best known O (V/sqrt(B))
  • O(sort(E)) algorithms for special classes of
    graphs
  • Trees, grid graphs, bounded-treewidth graphs,
    outerplanar graphs, planar graphs
  • Exploit existence of small separators or
    geometric structure

45
SSSP on Grid Graphs ATV00
Grid graph O(N) vertices, O(N) edges
Dijskstras algorithm O(N) I/Os Goal compute
shortest path d(s,t) in O(sort(N)) I/Os
  • Lemma
  • The portion of d(s,t) between intersection
    points with boundaries of subgrids is the
    shortest path within the subgrid

46
SSSP on Grid Graphs ATV00
Idea Compute shortest paths locally in each
subgrid then compute the shortest way to
combine them together
  • Divide grid into subgrids of size BxB (assume M gt
    B2)
  • Replace each BxB subgrid with complete graph on
    boundary nodes
  • Edge weight shortest path between the two
    boundary vertices within the subgrid
  • Reduced graph GR
  • O(N/B) vertices, O(N) edges

47
SSSP on Grid Graphs ATV00
  • Algorithm
  • Compute SSSP on GR from s to all boundary
    vertices
  • Find SSSP from s to all interior vertices for
    any subgrid s, for any t in s
  • d(s,t) min v in Bnd(s) d(s,v) d s(v,t)
  • Correctness
  • easy to show using Lemma
  • Analysis O(sort(N)) I/Os
  • Dijkstra algorithm using I/O efficient priority
    queue and graph blocking

48
(No Transcript)
49
Results on Planar graphs
  • Planar graph G with N vertices
  • Separators can be computed in O(sort(N)) I/Os
  • I/O-efficient reductions ABT00, AMTZ01
  • ? BFS, DFS, SSSP in O(sort(N)) I/Os

50
SSSP on Planar Graphs
  • Similar with grid graphs. Assume M gt B2, bounded
    degree
  • Assume graph is separated
  • O(N/B2) subgraphs, O(B2) vertices each, SO(N/B)
    separators
  • each subgraph adjacent to O(B) separators

51
SSSP on Planar Graphs
  • Reduced graph GR
  • S O(N/B) vertices
  • O(N/B2) x O(B2) O(N) edges
  • Compute SSSP on GR
  • Dijkstras algorithm and I/O-efficient priority
    queue
  • Each vertex is accessed once by its O(B)
  • adjacent vertices ?O(N) I/Os
  • Use boundary sets
  • O(N/B2) boundary sets, each
  • accessed once by its O(B) adjacent
  • vertices ? O(N/B) I/Os

52
On I/O-Efficient DFS
  • DFS upper bounds
  • Internal memory algorithm O(VE) time, O(VE)
    I/Os
  • Best upper bound
  • O(V E/B log V) I/Os on general graphs
  • DFS on general graphs is a big open problem
  • Note PRAM DFS is P-complete
  • DFS on planar graphs uses O(sort(N)) I/Os
  • DFS to BFS reduction AMTZ01

53
DFS to BFS Reduction on Planar Graphs
  • Idea Partition the faces of G into levels around
    a source face containing s and grow DFS
    level-by-level
  • Levels can be obtained from BFS in dual graph
  • Denote
  • Gi union of the boundaries of faces at level lt
    i
  • Ti DFS tree of Gi
  • Hi Gi \ G i-1
  • Algorithm Compute a spanning forest of Hi and
    attach it onto T i-1
  • Structure of levels is simple
  • The bicomps of the Hi are the boundary cycles of
    Gi
  • Glueing onto T i-1 is simple
  • A spanning tree is a DFS tree if and only if it
    has no cross edges

54
DFS to BFS Reduction on Planar Graphs
  • Idea Partition the faces of G into levels around
    a source face containing s and grow DFS
    level-by-level

55
Other Graphs Results
  • Grid graphs ATV00
  • MST, SSSP in O(sort(N)) I/Os
  • CC in O(scan(N)) I/Os
  • Planar graphs ABT00, AMTZ01
  • Planar reductions
  • DFS
  • General graphs ABT00
  • MST in O(sort(N) log log N) I/Os
  • Planar directed graphs submitted
  • Topological sorting and ear decomposition in
    O(sort(N)) I/Os

56
..In Conclusion
  • I have tried to convince you of a few of things
  • Massive data is available and in order to process
    it scalable algorithms are necessary
  • I/O-efficient algorithms have applications
    outside computer science and have big potential
    for (interdisciplinary) collaboration
  • I/O-efficient algorithms are theory and practice
    put together and support educational efforts
  • Challenging, rewarding, fun!

57
Collaboration
  • Rewarding, good response
  • Duke Nicholas School of the Environment
  • NCSU Dept. of Marine, Earth and Atmospheric
    Sciences
  • GRASS, ESRI
  • TerraFlow
  • Incorporated in GRASS AMT02
  • Current work with U. Muenster GE
  • 2 MS students port TerraFlow to VisualC under
    Windows and make it ArcInfo extension
  • Extends projects and brings up new problems
  • LIDAR data
Write a Comment
User Comments (0)
About PowerShow.com