Migratory File Services for Batch-Pipelined Workloads - PowerPoint PPT Presentation

About This Presentation
Title:

Migratory File Services for Batch-Pipelined Workloads

Description:

Coop. Block. Input. Cache. Container 120 /mydata. d15. d16. Wide ... Coop. Block. Input. Cache. Container 120. outfile. creat('/tmp/outfile'); open('/data/d15' ... – PowerPoint PPT presentation

Number of Views:11
Avg rating:3.0/5.0
Slides: 33
Provided by: dougl229
Category:

less

Transcript and Presenter's Notes

Title: Migratory File Services for Batch-Pipelined Workloads


1
Migratory File ServicesforBatch-Pipelined
Workloads
  • John Bent, Douglas Thain, Andrea Arpaci-Dusseau,
  • Remzi Arpaci-Dusseau, and Miron Livny
  • WiND and Condor Projects
  • 6 May 2003

2
How to Run aBatch-Pipelined Workload?
in.1
in.2
in.3
Batch Shared Data
a1
a2
a3
pipe.1
pipe.2
pipe.3
b1
b2
b3
out.1
out.2
out.3
3
Cluster-to-Cluster Computing
Grid Engine Cluster
The Internet
PBS Cluster
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Home System
Condor Pool
Node
Node
Node
Node
Archive
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
4
How to Run aBatch-Pipelined Workload?
  • Remote I/O
  • Submit jobs to a remote batch system.
  • Let all I/O come directly home.
  • Inefficient if re-use is common.
  • (( But perfect if no data sharing! ))
  • FTP-Net
  • User finds remote clusters.
  • Manually stages data in.
  • Submits jobs, deals with failures.
  • Pulls data out.
  • Lather, rinse, repeat.

5
Hawk A Migratory File Servicefor
Batch-Pipelined Workloads
  • Automatically deploys a task force across
    multiple wide-area systems.
  • Manages applications from a high level, using
    knowledge of process interactions.
  • Provides dependable performance with peer-to-peer
    techniques. (( Locality is key! ))
  • Understands and reacts to failures using
    knowledge of the system and workloads.

6
Dangers
  • Failures
  • Physical Networks fail, disks crash, CPUs halt.
  • Logical Out of space/memory, lease expired.
  • Administrative You cant use cluster X today.
  • Dependencies
  • A comes before C and D, which are simultaneous.
  • What do we do if the output of C is lost?
  • Risk vs Reward
  • A gamble Staging input data to a remote CPU.
  • A gamble Leaving output data at a remote CPU.

7
Hawk In Action
Grid Engine Cluster
The Internet
PBS Cluster
a2
b2
c2
Node
Node
Node
i2
o2
a3
b3
c3
Node
Node
Node
Node
Node
Node
i3
o3
Node
Node
Node
Node
Node
Node
Home System
Condor Pool
a1
b1
c1
Node
Node
Node
Node
Archive
i1
o1
Node
Node
Node
Node
Node
Node
Node
Node
o1
o2
o3
Node
Node
Node
Node
8
Workflow Language 1(Start With Condor DAGMan)
  • job a a.condor
  • job b b.condor
  • job c c.condor
  • job d d.condor
  • parent a child c
  • parent b child d

a
b
c
d
9
Workflow Language 2
  • volume v1
  • ftp//archive/mydata
  • mount v1 a /data
  • mount v1 b /data
  • volume v2 scratch
  • mount v2 a /tmp
  • mount v2 c /tmp
  • volume v3 scratch
  • mount v3 b /tmp
  • mount v3 d /tmp

v2
v3
a
b
c
d
10
Workflow Language 3
v2
v3
a
b
  • extract v2 x ftp//home/out.1
  • extract v3 x ftp//home/out.2

x
x
c
d
out.1
out.2
11
Mapping the Workflow tothe Migratory File System
  • Abstract Jobs
  • Become jobs in a batch system
  • May start, stop, fail, checkpoint, restart...
  • Logical scratch volumes
  • Become temporary containers on a scratch disk.
  • May be created, replicated, and destroyed...
  • Logical read volumes
  • Become blocks in a cooperative proxy cache.
  • May be created, cached, and evicted...

12
System Components
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
PBS Cluster
Condor Pool
Archive
Condor MM
Condor SchedD
Workflow Manager
13
Gliding In
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
PBS Cluster
Condor Pool
Archive
Condor MM
Condor SchedD
14
System Components
Node
Node
Node
Node
Node
Proxy
Proxy
Proxy
StartD
StartD
StartD
Node
Node
Node
Node
Node
Proxy
Proxy
StartD
StartD
PBS Head Node
Condor Pool
Archive
Condor MM
Condor SchedD
Workflow Manager
15
Cooperative Proxies
Node
Node
Node
Node
Node
Proxy
Proxy
Proxy
StartD
StartD
StartD
Node
Node
Node
Node
Node
Proxy
Proxy
StartD
StartD
PBS Head Node
Condor Pool
Archive
Condor MM
Condor SchedD
Workflow Manager
16
System Components
Node
Node
Node
Node
Node
Proxy
Proxy
Proxy
StartD
StartD
StartD
Node
Node
Node
Node
Node
Proxy
Proxy
StartD
StartD
PBS Head Node
Condor Pool
Archive
Condor MM
Condor SchedD
Workflow Manager
17
Batch Execution System
Node
Node
Node
Node
Node
Proxy
Proxy
Proxy
Node
Node
Node
Node
Node
Proxy
Proxy
PBS Head Node
Condor Pool
Archive
Condor MM
Condor SchedD
Workflow Manager
18
System Components
Proxy
StartD
Archive
Condor MM
Condor SchedD
Workflow Manager
19
Workflow Manager Detail
Condor MM
Condor SchedD
Workflow Manager
20
StartD
Proxy
Proxy
StartD
Archive
Archive
Condor MM
Condor SchedD
Workflow Manager
21
StartD
Proxy
Coop Block Input Cache
Local Area Network
Wide Area Network
Archive
Condor MM
Condor SchedD
Workflow Manager
22
StartD
Proxy
Job
Coop Block Input Cache
creat(/tmp/outfile)
open(/data/d15)
Local Area Network
outfile
Wide Area Network
Archive
Condor MM
Condor SchedD
Workflow Manager
23
StartD
Proxy
Job Completed
Coop Block Input Cache
Local Area Network
Wide Area Network
Archive
Condor MM
Condor SchedD
Workflow Manager
d15
out65
d16
24
StartD
Proxy
Job Completed
Coop Block Input Cache
Local Area Network
Wide Area Network
Archive
Condor MM
Condor SchedD
Workflow Manager
d15
out65
d16
25
StartD
Proxy
Job Completed
Coop Block Input Cache
Local Area Network
Container Deleted
Wide Area Network
Archive
Condor MM
Condor SchedD
Workflow Manager
d15
out65
d16
26
Fault Detection and Repair
  • The proxy, startd, and agent detect failures
  • Job evicted by machine owner.
  • Network disconnection between job and proxy.
  • Container evicted by storage owner.
  • Out of space at proxy.
  • The workflow manager knows the consequences
  • Job D couldnt perform its I/O.
  • Check Are volumes V1 and V3 still in place?
  • Aha Volume V3 was lost -gt Run B to create it.

27
Performance Testbed
  • Controlled remote cluster
  • 32 cluster nodes at UW.
  • Hawk submitter also at UW.
  • Connected by a restricted 800 Kb/s link.
  • Also some preliminary tests on uncontrolled
    systems
  • Hawk over PBS cluster at Los Alamos
  • Hawk over Condor system at INFN Italy.

28
Batch-Pipelined Applications
Name Stages Load Remote (jobs/hr) Hawk (jobs/hr)
BLAST 1 Batch Heavy 4.67 747.40
CMS 2 Batch and Pipe 33.78 1273.96
HF 3 Pipe Heavy 40.96 3187.22
29
Rollback
Cascading Failure
Failure
Recovery
30
A Little Bit of Philosophy
  • Most systems build from the bottom up
  • This disk must have five nines, or else!
  • MFS works from the top down
  • If this disk fails, we know what to do.
  • By working from the top down, we finesse many of
    the hard problems in traditional filesystems.

31
Future Work
  • Integration with Stork
  • P2P Aspects Discovery Replication
  • Optional Knowledge Size Time
  • Delegation and Disconnection
  • Names, names, names
  • Hawk A migratory file service.
  • Hawkeye A system monitoring tool.

32
jobs
data
?
Feeling overwhelmed?
Write a Comment
User Comments (0)
About PowerShow.com