Title: Explicit Control in a Batch-aware Distributed File System
1Explicit Control in a Batch-aware Distributed
File System
2Focus of work
- Harnessing, managing remote storage
- Batch-pipelined I/O intensive workloads
- Scientific workloads
- Wide-area grid computing
3Batch-pipelined workloads
- General properties
- Large number of processes
- Process and data dependencies
- I/O intensive
- Different types of I/O
- Endpoint
- Batch
- Pipeline
4Batch-pipelined workloads
Endpoint
Endpoint
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Endpoint
Endpoint
Endpoint
Endpoint
5Wide-area grid computing
Internet
Home storage
6Cluster-to-cluster (c2c)
- Not quite p2p
- More organized
- Less hostile
- More homogeneity
- Correlated failures
- Each cluster is autonomous
- Run and managed by different entities
- An obvious bottleneck is wide-area
Internet
Home store
How to manage flow of data into, within and out
of these clusters?
7Current approaches
- Remote I/O
- Condor standard universe
- Very easy
- Consistency through serialization
- Prestaging
- Condor vanilla universe
- Manually intensive
- Good performance through knowledge
- Distributed file systems (AFS, NFS)
- Easy to use, uniform name space
- Impractical in this environment
8Pros and cons
Practical Easy to use Leverages workload info
Remote I/O v v X
Pre-staging v X v
Trad. DFS X v X
9BAD-FS
- Solution Batch-Aware Distributed File System
- Leverages workload info with storage control
- Detail information about workload is known
- Storage layer allows external control
- External scheduler makes informed storage
decisions - Combining information and control results in
- Improved performance
- More robust failure handling
- Simplified implementation
Practical Easy to use Leverages workload info
BAD-FS v v v
10Practical and deployable
- User-level requires no privilege
- Packaged as a modified Condor system
- A Condor system which includes BAD-FS
- General glide-in works everywhere
SGE
SGE
SGE
SGE
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
SGE
SGE
SGE
SGE
Internet
Home store
11BAD-FS Condor
Compute node
Compute node
Compute node
Compute node
Condor startd
Condor startd
Condor Startd
Condor startd
BAD-FS
BAD-FS
BAD-FS
1) NeST storage management
3) Expanded Condor submit language
2) Batch-Aware Distributed File System
4) BAD-FS scheduler
Job queue
Condor DAGMan
Home storage
Condor DAGMan
12BAD-FS knowledge
- Remote cluster knowledge
- Storage availability
- Failure rates
- Workload knowledge
- Data type (batch, pipeline, or endpoint)
- Data quantity
- Job dependencies
13Control through lots
- Abstraction that allows external storage control
- Guaranteed storage allocations
- Containers for job I/O
- e.g. I need 2 GB of space for at least 24 hours
- Scheduler
- Creates lots to cache input data
- Subsequent jobs can reuse this data
- Creates lots to buffer output data
- Destroys pipeline, copies endpoint
- Configures workload to access lots
14Knowledge plus control
- Enhanced performance
- I/O scoping
- Capacity-aware scheduling
- Improved failure handling
- Cost-benefit replication
- Simplified implementation
- No cache consistency protocol
15I/O scoping
- Technique to minimize wide-area traffic
- Allocate lots to cache batch data
- Allocate lots for pipeline and endpoint
- Extract endpoint
- Cleanup
Compute node
Compute node
AMANDA 200 MB pipeline 500 MB batch 5 MB
endpoint
Steady-state Only 5 of 705 MB traverse
wide-area.
Internet
BAD-FS Scheduler
16Capacity-aware scheduling
- Technique to avoid over-allocations
- Scheduler has knowledge of
- Storage availability
- Storage usage within the workload
- Scheduler runs as many jobs as fit
- Avoids wasted utilizations
- Improves job throughput
17Improved failure handling
- Scheduler understands data semantics
- Data is not just a collection of bytes
- Losing data is not catastrophic
- Output can be regenerated by rerunning jobs
- Cost-benefit replication
- Replicates only data whose replication cost is
cheaper than cost to rerun the job - Can improve throughput in lossy environment
18Simplified implementation
- Data dependencies known
- Scheduler ensures proper ordering
- Build a distributed file system
- With cooperative caching
- But without a cache consistency protocol
19Real workloads
- AMANDA
- Astrophysics study of cosmic events such as
gamma-ray bursts - BLAST
- Biology search for proteins within a genome
- CMS
- Physics simulation of large particle colliders
- HF
- Chemistry study of non-relativistic interactions
between atomic nuclei and electrons - IBIS
- Ecology global-scale simulation of earths
climate used to study effects of human activity
(e.g. global warming)
20Real workload experience
- Setup
- 16 jobs
- 16 compute nodes
- Emulated wide-area
- Configuration
- Remote I/O
- AFS-like with /tmp
- BAD-FS
- Result is order of magnitude improvement
21BAD Conclusions
- Schedulers can obtain workload knowledge
- Schedulers need storage control
- Caching
- Consistency
- Replication
- Combining this control with knowledge
- Enhanced performance
- Improved failure handling
- Simplified implementation
22For more information
Pipeline and Batch Sharing in Grid Workloads,
Douglas Thain, John Bent, Andrea Arpaci-Dusseau,
Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003.
- http//www.cs.wisc.edu/condor/publications.html
- Questions?
Explicit Control in a Batch-Aware Distributed
File System, John Bent, Douglas Thain, Andrea
Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron
Livny. NSDI 04, 2004.
23Why not BAD-scheduler and traditional DFS?
- Practical reasons
- Deployment
- Interoperability
- Technical reasons
- Cooperative caching
- Data sharing
- Traditional DFS
- assume sharing is exception
- provision for arbitrary, unplanned sharing
- Batch workloads, sharing is rule
- Sharing behavior is completely known
- Data committal
- Traditional DFS must guess when to commit
- AFS uses close, NFS uses 30 seconds
- Batch workloads precisely define when
24Is capacity awareness important in real world?
- Heterogeneity of remote resources
- Shared disk
- Workloads changing some are very, very large and
still growing.
25User burden
- Additional info needed in declarative lang.
- User probably already knows this info
- Or can readily obtain
- Typically, this info already exists
- Scattered across collection of scripts,
Makefiles, etc. - BAD-FS improves current situation by collecting
this info into one central location
26In the wild
27Capacity-aware scheduling evaluation
- Workload
- 64 synthetic pipelines
- Varied pipe size
- Environment
- 16 compute nodes
- Configuration
- Breadth-first
- Depth-first
- BAD-FS
Failures directly correlate to workload
throughput.
28I/O scoping evaluation
- Workload
- 64 synthetic pipelines
- 100 MB of I/O each
- Varied data mix
- Environment
- 32 compute nodes
- Emulated wide-area
- Configuration
- Remote I/O
- Cache volumes
- Scratch volumes
- BAD-FS
Wide-area traffic directly correlates to workload
throughput.
29Cost-benefit replication evaluation
- Workload
- Synthetic pipelines of depth 3
- Runtime 60 seconds
- Environment
- Artificially injected failures
- Configuration
- Always-copy
- Never-copy
- BAD-FS
Trade-off overhead in environment without failure
to gain throughput in environment with failure.
30Real workloads
- Workload
- Real workloads
- 64 pipelines
- Environment
- 16 compute nodes
- Emulated wide-area
- Cold and warm
- First 16 are cold
- Subsequent 48 warm
- Configuration
- Remote I/O
- AFS-like
- BAD-FS
31Example workflow language Condor DAGman
- Keyword job names file w/ execute instrs
- Keywords parent, child express relations
- no declaration of data
job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D
32Adding data primitives to a workflow language
- New keywords for container operations
- volume create a container
- scratch specify container type
- mount how the app addresses the container
- extract the desired endpoint output
- User must provide complete, exact I/O information
to the scheduler - Specify which procs use which data
- Specify size of data read and written
33Extended workflow language
job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D volume B1
ftp//home/data 1GB volume P1 scratch 500
MB volume P2 scratch 500 MB A mount B1 /data C
mount B1 /data A mount P1 /tmp B mount P1 /tmp C
mount P2 /tmp D mount P2 /tmp extract P1/out
ftp//home/out.1 extract P2/out ftp//home/out.2