Seven O - PowerPoint PPT Presentation

About This Presentation
Title:

Seven O

Description:

Seven O Clock: A New Distributed GVT Algorithm using Network Atomic Operations David Bauer, Garrett Yaun Christopher Carothers Computer Science – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 32
Provided by: David3308
Learn more at: http://www.shivkumar.org
Category:

less

Transcript and Presenter's Notes

Title: Seven O


1
Seven OClock A New Distributed GVT Algorithm
using Network Atomic Operations
  • David Bauer, Garrett Yaun
  • Christopher Carothers
  • Computer Science

Murat Yuksel Shivkumar Kalyanaraman ECSE
2
Global Virtual Time
  • Defines a lower bound on
  • any unprocessed event in the
  • system.
  • Defines the point
  • beyond which events should
  • not be reclaimed.
  • Imperative that GVT computation operate as
    efficiently as possible.

3
Key Problems
  • Simultaneous Reporting Problem
  • Transient Message Problem

arises because not all processors will report
their local minimum at precisely the same instant
in wall-clock time.
message is delayed in the network and neither the
sender nor the receiver consider that message in
their respective GVT calculation.
Asynchronous Solution create a synchronization,
or cut, across the distributed simulation that
divides events into two categories past and
future.
Consistent Cut a cut where there is no message
scheduled in the future of the sending processor,
but received in the past of the destination
processor.
4
Matterns GVT Algorithm
  • Construct cut via message-passing

Cost O(log n) if tree, O(N) if ring
  • If large number of processors, then free pool
    exhausted waiting for GVT to complete

5
Fujimotos GVT Algorithm
  • Construct cut using shared memory flag

Cost O(1)
Sequentially consistent memory model ensures
proper causal order
  • Limited to shared memory architecture

6
Memory Model
  • Sequentially consistent does not mean
    instantaneous
  • Memory events are only guaranteed to be causally
    ordered

Is there a method to achieve sequentially
consistent shared memory in a loosely
coordinated, distributed environment?
7
GVT Algorithm Differences
Fujimoto 7 OClock Mattern Samadi
Cost of Cut Calculation O(1) O(1) O(N) or O(log N) O(N) or O(log N)
Parallel / Distributed P PD PD PD
Global Invariant Shared Memory Flag Real Time Clock Message Passing Message Passing
Independent of Event Memory N Y N N
cost of algorithm much higher
8
Network Atomic Operations
  • Goal each processor observes the start of the
    GVT computation at the same instance of wall
    clock time
  • Definition An NAO is an agreed upon frequency in
    wall clock time at which some event is logically
    observed to have happened across a distributed
    system.

9
Network Atomic Operations
  • Goal each processor observes the start of the
    GVT computation at the same instance of wall
    clock time

Definition An NAO is an agreed upon frequency in
wall clock time at which some event is logically
observed to have happened across a distributed
system.
possible operations provided by a complete
sequentially consistent memory model
10
Clock Synchronization
  • Assumption all processors share a highly
    accurate, common view of wall clock time.
  • Basic building block CPU timestamp counter
  • Measures time in terms of clock cycles, so a
    gigahertz CPU clock has granularity of 109 secs
  • Sending events across network is much larger
    granularity depending on tech 106 secs on
    1000base/T

11
Clock Synchronization
  • Issues clock synchronization, drift and jitter
  • Ostrovsky and Patt-Shamir
  • provably optimal clock synchronization
  • clocks have drift and the message latency may be
    unbounded
  • Well researched problem in distributed computing
    we used simplified approach
  • simplified approach helpful in determining if
    system working properly

12
Max Send Dt
  • Definition max_send_delta_t is maximum of
  • worst case bound on the time to send an event
    through the network
  • twice synchronization error
  • twice max clock drift over simulation time
  • add a small amount of time to the NAO expiration
  • Similar to sequentially consistent memory
  • Overcomes
  • Transient message problem, clock drift/jitter and
    clock synchronization error

13
Max Send Dt clock drift
  • Clock drift causes CPU clocks to become
    unsynchronized
  • Long running simulations may require multiple
    synchs
  • Or, we account for it in the NAO
  • Max Send Dt overcomes clock drift by ensuring no
    event falls between the cracks

14
Max Send Dt
  • What if clocks are not well synched?
  • Let ?Dmax be the maximum clock drift.
  • Let ?Smax be the maximum synchronization error.
  • Solution Re-define ?tmax as
  • ?tmax max(?tmax , 2?Dmax , 2?Smax)
  • In practice both ?Dmax and ?Smax are very small
    in comparison to ?tmax.

15
Transient Message Problem
  • Max Send Dt worst case bound on time to send
    event in network
  • guarantees events are accounted for by either
    sender of receiver

16
Simultaneous Reporting Problem
  • Problem arises when processors do not start GVT
    computation simultaneously
  • Seven OClock does start simultaneously across
    all CPUs, therefore, problem cannot occur

17
NAO
A
B
C
D
E
18
NAO
19
Simulation Seven OClock GVT Algorithm
  • Assumptions
  • Each processor has a highly accurate clock
  • A message passing interface w/o ack is available
  • The worst case bound on the time to transmit a
    message through the network ?tmax is known.
  • Properties
  • a clock-based algorithm for distributed
    processors
  • creates a sequentially consistent view of
    distributed memory

?tmax
?tmax
7
12
LP4
LVTmin(7,9)
LP3
9
cut point
GVTmin(5,7)
LP2
10
LVTmin(5,9)
LP1
5
NAO
wallclock time
20
Limitations
  • NAOs cannot be forced
  • agreed upon intervals cannot change
  • Simulation End Time
  • worst-case, complete NAO and only one event
    remaining to process
  • amortized over entire run-time, cost is O(1)
  • Exhausted Event Pool
  • requires tuning to ensure enough optimistic
    memory available

21
Uniqueness
  • Only real-time based GVT algorithm
  • Zero-cost consistent-cut ? truly scalable
  • O(1) cost ? optimal
  • Only algorithm which is entirely independent of
    available event memory
  • Event memory loosely tied to GVT algorithm

22
Performance Analysis Models
  • r-PHOLD
  • PHOLD with reverse computation
  • Modified to control percent remote events
    (normally 75)
  • Destinations still decided using a uniform random
    number generator ? all LPs possible destination
  • TCP-Tahoe
  • TCP-TAHOE ring of Campus Networks topology
  • Same topology design as used by PDNS in MASCOTS
    03
  • Model limitations required us to increase the
    number of LAN routers in order to simulate the
    same network

23
Performance Analysis Clusters
Itanium Cluster Location RPI Total Nodes 4 Total CPU 16 Total RAM 64GB CPU Quad Itanium-2 1.3GHz Network Myrinet 1000base/T NetSim Cluster Location RPI Total Nodes 40 Total CPU 80 Total RAM 20GB CPU Dual Intel 800MHz Network ½ 100base/T, ½ 1000base/T Sith Cluster Location Georgia Tech Total Nodes 30 Total CPU 60 Total RAM 180GB CPU Dual Itanium-2 900MHz Network ethernet 1000base/T
24
Itanium Cluster r-PHOLD, CPUs allocated
round-robin
25
Maximize distribution (round robin among nodes)
VERSUS Maximize parallelization (use all CPUs
before using additional nodes)
26
NetSim Cluster Comparing 10- and 25 remote
events (using 1 CPU per node)
27
NetSim Cluster Comparing 10- and 25 remote
events (using 1 CPU per node)
28
TCP Model Topology
Single Campus
10 Campus Networks in a Ring
Our model contained 1,008 campus networks in a
ring, simulating gt 540,000 nodes.
29
Itanium Cluster TCP results using 2- and 4-nodes
30
Sith Cluster TCP Model using 1 CPU per node and
2 CPU per node
31
Future Work Conclusions
  • Investigate power of different models by
    computing spectral analysis
  • GVT now in frequency domain
  • Determine max length of rollbacks
  • Investigate new ways of measuring performance
  • Models too large to run sequentially
  • Account for hardware affects (even in NOW there
    are fluctuations in HW performance)
  • Account for model LP mapping
  • Account for different cases, ie, 4 CPUs
    distributed across 1, 2, and 4 nodes
Write a Comment
User Comments (0)
About PowerShow.com