Title: CX: A Scalable, Robust Network for Parallel Computing
1CX A Scalable, Robust Network for Parallel
Computing
- Peter Cappello Dimitrios Mourloukos
- Computer Science
- UCSB
2Outline
- Introduction
- Related work
- API
- Architecture
- Experimental results
- Current future work
3Introduction
- Listen to the technology! Carver Mead
4Introduction
- Listen to the technology! Carver Mead
- What is the technology telling us?
5Introduction
- Listen to the technology! Carver Mead
- What is the technology telling us?
- Internets idle cycles/sec growing rapidly
6Introduction
- Listen to the technology! Carver Mead
- What is the technology telling us?
- Internets idle cycles/sec growing rapidly
- Bandwidth increasing getting cheaper
7Introduction
- Listen to the technology! Carver Mead
- What is the technology telling us?
- Internets idle cycles/sec growing rapidly
- Bandwidth is increasing getting cheaper
- Communication latency is not decreasing
8Introduction
- Listen to the technology! Carver Mead
- What is the technology telling us?
- Internets idle cycles/sec growing rapidly
- Bandwidth increasing getting cheaper
- Communication latency is not decreasing
- Human technology is getting neither cheaper nor
faster.
9IntroductionProject Goals
- Minimize job completion time
- despite large communication latency
10IntroductionProject Goals
- Minimize job completion time
- despite large communication latency
- Jobs complete with high probability
- despite faulty components
11IntroductionProject Goals
- Minimize job completion time
- despite large communication latency
- Jobs complete with high probability
- despite faulty components
- Application program is oblivious to
- Number of processors
- Inter-process communication
- Fault tolerance
12IntroductionFundamental Issue Heterogeneity
OS1
OS2
OS3
OS4
OS5
M1
M2
M3
M4
M5
Heterogeneous machine/OS
13IntroductionFundamental Issue Heterogeneity
OS1
OS2
OS3
OS4
OS5
M1
M2
M3
M4
M5
Heterogeneous machine/OS
Functionally Homogeneous JVM
?
14Outline
- Introduction
- Related work
- API
- Architecture
- Experimental results
- Current future work
15Related work
- Cilk ? Cilk-NOW ? Atlas
- DAG computational model
- Work-stealing
16Related work
- Linda ? Piranha ? JavaSpaces
- Space-based coordination
- Decoupled communication
17Related work
- Charlotte (Milan project / Calypso prototype)
- High performance ? Fault tolerance not achieved
via transactions - Fault tolerance via eager scheduling
18Related work
- SuperWeb ?Javelin?Javelin
- Architecture client, broker, host
19Outline
- Introduction
- Related work
- API
- Architecture
- Experimental results
- Current future work
20API
- DAG Computational model
- int f( int n )
-
- if ( n lt 2 )
- return n
- else
- return f( n-1 ) f( n-2 )
21DAG Computational Model
int f( int n ) if ( n lt 2 ) return n else
return f( n-1 ) f( n-2 )
f(4)
Method invocation tree
22DAG Computational Model
int f( int n ) if ( n lt 2 ) return n else
return f( n-1 ) f( n-2 )
f(4)
f(3)
f(2)
Method invocation tree
23DAG Computational Model
int f( int n ) if ( n lt 2 ) return n else
return f( n-1 ) f( n-2 )
f(4)
f(3)
f(2)
f(2)
f(1)
f(1)
f(0)
Method invocation tree
24DAG Computational Model
- int f( int n )
- if ( n lt 2 ) return n
- else return f( n-1 ) f( n-2 )
f(4)
f(3)
f(2)
f(1)
f(1)
f(0)
f(2)
f(1)
f(0)
Method invocation tree
25DAG Computational Model / API
execute( ) if ( n lt 2 ) setArg(
, n ) else spawn ( )
spawn ( ) spawn ( )
_______________________________
f(n)
f(4)
f(n-1)
f(n-2)
execute( ) setArg( , in0
in1 )
26DAG Computational Model / API
execute( ) if ( n lt 2 ) setArg(
, n ) else spawn ( )
spawn ( ) spawn ( )
_______________________________
f(n)
f(4)
f(3)
f(2)
f(n-1)
f(n-2)
execute( ) setArg( , in0
in1 )
27DAG Computational Model / API
execute( ) if ( n lt 2 ) setArg(
, n ) else spawn ( )
spawn ( ) spawn ( )
_______________________________
f(n)
f(4)
f(3)
f(2)
f(n-1)
f(2)
f(1)
f(1)
f(0)
f(n-2)
execute( ) setArg( , in0
in1 )
28DAG Computational Model / API
execute( ) if ( n lt 2 ) setArg(
, n ) else spawn ( )
spawn ( ) spawn ( )
_______________________________
f(n)
f(4)
f(3)
f(2)
f(n-1)
f(2)
f(1)
f(1)
f(0)
f(n-2)
f(1)
f(0)
execute( ) setArg( , in0
in1 )
29Outline
- Introduction
- Related work
- API
- Architecture
- Experimental results
- Current future work
30Architecture Basic Entities
register ( spawn getResult ) unregister
CONSUMER
PRODUCTION NETWORK
CLUSTER NETWORK
31Architecture Cluster
PRODUCER
TASK SERVER
PRODUCER
PRODUCER
PRODUCER
32A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
33A Cluster at Work
f(4)
TASK SERVER
PRODUCER
READY
f(4)
PRODUCER
WAITING
34A Cluster at Work
f(4)
TASK SERVER
PRODUCER
READY
f(4)
f(4)
PRODUCER
WAITING
35A Cluster at Work
f(4)
TASK SERVER
PRODUCER
READY
f(4)
PRODUCER
WAITING
36Decompose
- execute( )
-
- if ( n lt 2 )
- setArg( ArgAddr, n )
- else
-
- spawn ( )
- spawn ( f(n-1) )
- spawn ( f(n-2) )
-
37A Cluster at Work
f(4)
TASK SERVER
PRODUCER
READY
f(4)
f(3)
f(3)
f(2)
f(2)
PRODUCER
WAITING
38A Cluster at Work
TASK SERVER
PRODUCER
READY
f(3)
f(3)
f(2)
f(2)
PRODUCER
WAITING
39A Cluster at Work
TASK SERVER
PRODUCER
READY
f(3)
f(3)
f(3)
f(2)
f(2)
PRODUCER
WAITING
f(2)
40A Cluster at Work
TASK SERVER
PRODUCER
READY
f(3)
f(3)
f(2)
PRODUCER
WAITING
f(2)
41A Cluster at Work
TASK SERVER
PRODUCER
READY
f(3)
f(2)
f(3)
f(2)
f(1)
f(1)
f(0)
f(2)
f(1)
f(1)
f(0)
PRODUCER
WAITING
f(2)
42A Cluster at Work
TASK SERVER
PRODUCER
READY
f(2)
f(1)
f(1)
f(0)
f(2)
f(1)
f(1)
f(0)
PRODUCER
WAITING
43A Cluster at Work
TASK SERVER
PRODUCER
READY
f(2)
f(2)
f(1)
f(0)
f(1)
f(2)
f(1)
f(1)
f(0)
PRODUCER
WAITING
f(1)
44A Cluster at Work
TASK SERVER
PRODUCER
READY
f(2)
f(0)
f(1)
f(2)
f(1)
f(1)
f(0)
PRODUCER
WAITING
f(1)
45Compute Base Case
- execute( )
-
- if ( n lt 2 )
- setArg( ArgAddr, n )
- else
-
- spawn ( )
- spawn ( f(n-1) )
- spawn ( f(n-2) )
-
46A Cluster at Work
TASK SERVER
PRODUCER
READY
f(1)
f(2)
f(0)
f(0)
f(1)
f(2)
f(1)
f(1)
f(0)
f(1)
f(0)
PRODUCER
WAITING
f(1)
47A Cluster at Work
TASK SERVER
PRODUCER
READY
f(1)
f(0)
f(0)
f(1)
f(1)
f(0)
f(1)
f(0)
PRODUCER
WAITING
48A Cluster at Work
TASK SERVER
PRODUCER
READY
f(1)
f(1)
f(0)
f(0)
f(1)
f(1)
f(0)
f(1)
f(0)
PRODUCER
WAITING
f(0)
49A Cluster at Work
TASK SERVER
PRODUCER
READY
f(1)
f(1)
f(0)
f(1)
f(0)
f(1)
f(0)
PRODUCER
WAITING
f(0)
50A Cluster at Work
TASK SERVER
PRODUCER
READY
f(1)
f(1)
f(0)
f(1)
f(0)
f(1)
f(0)
PRODUCER
WAITING
f(0)
51A Cluster at Work
TASK SERVER
PRODUCER
READY
f(1)
f(0)
f(1)
f(0)
PRODUCER
WAITING
52A Cluster at Work
TASK SERVER
PRODUCER
READY
f(1)
f(0)
f(1)
f(0)
PRODUCER
WAITING
53A Cluster at Work
TASK SERVER
PRODUCER
READY
f(1)
f(0)
f(1)
f(0)
PRODUCER
WAITING
f(1)
54A Cluster at Work
TASK SERVER
PRODUCER
READY
f(0)
f(1)
f(0)
PRODUCER
WAITING
f(1)
55Compose
- execute( )
-
- setArg( ArgAddr, in0 in1 )
-
56A Cluster at Work
TASK SERVER
PRODUCER
READY
f(0)
f(1)
f(0)
PRODUCER
WAITING
f(1)
57A Cluster at Work
TASK SERVER
PRODUCER
READY
f(0)
f(0)
PRODUCER
WAITING
58A Cluster at Work
TASK SERVER
PRODUCER
READY
f(0)
f(0)
PRODUCER
WAITING
f(0)
59A Cluster at Work
TASK SERVER
PRODUCER
READY
f(0)
PRODUCER
WAITING
f(0)
60A Cluster at Work
TASK SERVER
PRODUCER
READY
f(0)
PRODUCER
WAITING
f(0)
61A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
62A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
63A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
64A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
65A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
66A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
67A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
68A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
69A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
70A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
71A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
72A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
73A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
74A Cluster at Work
TASK SERVER
PRODUCER
READY
PRODUCER
WAITING
75A Cluster at Work
TASK SERVER
PRODUCER
READY
R
PRODUCER
WAITING
76A Cluster at Work
TASK SERVER
PRODUCER
- Result object is sent to Production Network
- Production Network
- returns it to Consumer
READY
R
PRODUCER
WAITING
77Task Server ProxyOverlap Communication with
Computation
PRODUCER
TASK SERVER
READY
Task Server Proxy
PRIORITY Q
COMP
COMM
OUTBOX
INBOX
WAITING
78Architecture Work stealing eager scheduling
- A task is removed from server only after a
complete signal is received. - A task may be assigned to multiple producers
- Balance task load among producers of varying
processor speeds - Tasks on failed/retreating producers are
re-assigned.
79Architecture Scalability
- A cluster tolerates producer
- Retreat
- Failure
- 1 task server however is a
- Bottleneck
- Single point of failure.
- We introduce a network of task servers.
80Scalability Class loading
- CX class loader loads classes
- (Consumer JAR) in each
- servers class cache
2. Producer loads classes from its server
81Scalability Fault-tolerance
Replicate a servers tasks on its sibling
82Scalability Fault-tolerance
Replicate a servers tasks on its sibling
83Scalability Fault-tolerance
When server fails, its sibling restores state to
replacement server
Replicate a servers tasks on its sibling
84ArchitectureProduction network of clusters
- Network tolerates single server failure.
- Restores ability to tolerate a single failure.
- ? ability to tolerate a sequence of failures
85Outline
- Introduction
- Related work
- API
- Architecture
- Experimental results
- Current future work
86Preliminary experiments
- Experiments run on Linux cluster
- 100 port Lucent P550 Cajun Gigabit Switch
- Machine
- 2 Intel EtherExpress Pro 100 Mb/s Ethernet cards
- Red Hat Linux 6.0
- JDK 1.2.2_RC3
- Heterogeneous
- processor speeds
- processors/machine
87Fibonacci Tasks with Synthetic Load
execute( ) if ( n lt 2 )
synthetic workload() setArg( , n
) else synthetic workload() spawn (
) spawn ( ) spawn (
)
execute( ) synthetic
workload() setArg( , in0 in1 )
f(n)
f(n-1)
f(n-2)
88TSEQ vs. T1 (seconds)Computing F(8)
Workload TSEQ T1 Efficiency
4.522 497.420 518.816 0.96
3.740 415.140 436.897 0.95
2.504 280.448 297.474 0.94
1.576 179.664 199.423 0.90
0.914 106.024 120.807 0.88
0.468 56.160 65.767 0.85
0.198 24.750 29.553 0.84
0.058 8.120 11.386 0.71
89Average task time Workload 1 1.8 sec. Workload
2 3.7 sec.
Parallel efficiency for F(13) 0.87 Parallel
efficiency for F(18) 0.99
90Outline
- Introduction
- Related work
- API
- Architecture
- Experimental results
- Current future work
91Current work
- Implement CX market maker (broker)
- Solves discovery problem between Consumers
Production networks - Enhance Producer with Leas Fork/Join Framework
- See gee.cs.oswego.edu
92Current work
- Enhance computational model branch bound.
- Propagate new bounds thru production network 3
steps
SEARCH TREE
PRODUCTION NETWORK
BRANCH
TERMINATE!
93Current work
- Enhance computational model branch bound.
- Propagate new bounds thru production network 3
steps
SEARCH TREE
PRODUCTION NETWORK
TERMINATE!
94Current work
- Investigate computations that appear ill-suited
to adaptive parallelism - SOR
- N-body.
95End of CX Presentation
- www.cs.ucsb.edu/research/cx
- Next release End of June, includes source.
- E-mail cappello_at_cs.ucsb.edu
96IntroductionFundamental Issues
- Communication latency
- Long latency ? Overlap computation with
communication. - Robustness
- Massive parallelism ? faults
- Scalability
- Massive parallelism ? login privileges cannot be
required. - Ease of use
- Jini ? easy upgrade of system components
97Related work
- Market mechanisms
- Huberman, Waldspurger, Malone, Miller Drexler,
Newhouse Darlington
98Related work
- CX integrates
- DAG computational model
- Work-stealing scheduler
- Space-based, decoupled communication
- Fault-tolerance via eager scheduling
- Market mechanisms (incentive to participate)
99Architecture Task identifier
- Dag has spawn tree
- TaskID path id
- Root.TaskID 0
- TaskID used to detect duplicate
- Tasks
- Results.
F(4)
1
2
F(3)
F(2)
2
1
1
2
0
F(2)
F(1)
F(1)
F(0)
2
1
F(1)
F(0)
0
0
0
100Architecture Basic Entities
- Consumer
- Seeks computing resources.
- Producer
- Offers computing resources.
- Task Server
- Coordinates task distribution among its
producers. - Production Network
- A network of task servers their associated
producers.
101Defining Parallel Efficiency
- Scalar Homogeneous set of P machines
- Parallel efficiency (T1 / P) / TP
- Vector Heterogeneous set of P machines
- P P1, P2, , Pd , where there are
- P1 machines of type 1,
- P2 machines of type 2,
- Pd machines of type d
- Parallel efficiency ( P1 / T1 P2 / T2 Pd
/ Td ) 1 / TP
102Future work
- Support special hardware / data inter-server
task movement. - Diffusion model
- Tasks are homogeneous gas atoms diffusing through
network. - N-body model Each kind of atom (task) has its
own - Mass (resistance to movement code size, input
size, ) - attraction/repulsion to different servers
- Or other massive entities, such as
- special processors
- large data base.
103Future Work
- CX preprocessor to simplify API.