Explicit Control in a Batch-aware Distributed File System - PowerPoint PPT Presentation

About This Presentation

Title:

Explicit Control in a Batch-aware Distributed File System

Description:

Harnessing, managing remote storage. Batch-pipelined I/O intensive workloads ... Containers for job I/O. e.g. 'I need 2 GB of space for at least 24 hours' ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 23

Provided by: con92

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Explicit Control in a Batch-aware Distributed File System

1
Explicit Control in a Batch-aware Distributed
File System
2
Focus of work

Harnessing, managing remote storage
Batch-pipelined I/O intensive workloads
Scientific workloads
Wide-area grid computing

3
Batch-pipelined workloads

General properties
Large number of processes
Process and data dependencies
I/O intensive
Different types of I/O
Endpoint
Batch
Pipeline

4
Batch-pipelined workloads
Endpoint
Endpoint
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Endpoint
Endpoint
Endpoint
Endpoint
5
Wide-area grid computing
Internet
Home storage
6
Cluster-to-cluster (c2c)

Not quite p2p
More organized
Less hostile
More homogeneity
Correlated failures
Each cluster is autonomous
Run and managed by different entities
An obvious bottleneck is wide-area

Internet
Home store
How to manage flow of data into, within and out
of these clusters?
7
Current approaches

Remote I/O
Condor standard universe
Very easy
Consistency through serialization
Prestaging
Condor vanilla universe
Manually intensive
Good performance through knowledge
Distributed file systems (AFS, NFS)
Easy to use, uniform name space
Impractical in this environment

8
Pros and cons
Practical Easy to use Leverages workload info
Remote I/O v v X
Pre-staging v X v
Trad. DFS X v X
9
BAD-FS

Solution Batch-Aware Distributed File System
Leverages workload info with storage control
Detail information about workload is known
Storage layer allows external control
External scheduler makes informed storage
decisions
Combining information and control results in
Improved performance
More robust failure handling
Simplified implementation

Practical Easy to use Leverages workload info
BAD-FS v v v
10
Practical and deployable

User-level requires no privilege
Packaged as a modified Condor system
A Condor system which includes BAD-FS
General glide-in works everywhere

SGE
SGE
SGE
SGE
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
SGE
SGE
SGE
SGE
Internet
Home store
11
BAD-FS Condor
Compute node
Compute node
Compute node
Compute node
Condor startd
Condor startd
Condor Startd
Condor startd
BAD-FS
BAD-FS
BAD-FS
1) NeST storage management
3) Expanded Condor submit language
2) Batch-Aware Distributed File System
4) BAD-FS scheduler
Job queue
Condor DAGMan
Home storage
Condor DAGMan
12
BAD-FS knowledge

Remote cluster knowledge
Storage availability
Failure rates
Workload knowledge
Data type (batch, pipeline, or endpoint)
Data quantity
Job dependencies

13
Control through lots

Abstraction that allows external storage control
Guaranteed storage allocations
Containers for job I/O
e.g. I need 2 GB of space for at least 24 hours
Scheduler
Creates lots to cache input data
Subsequent jobs can reuse this data
Creates lots to buffer output data
Destroys pipeline, copies endpoint
Configures workload to access lots

14
Knowledge plus control

Enhanced performance
I/O scoping
Capacity-aware scheduling
Improved failure handling
Cost-benefit replication
Simplified implementation
No cache consistency protocol

15
I/O scoping

Technique to minimize wide-area traffic
Allocate lots to cache batch data
Allocate lots for pipeline and endpoint
Extract endpoint
Cleanup

Compute node
Compute node
AMANDA 200 MB pipeline 500 MB batch 5 MB
endpoint
Steady-state Only 5 of 705 MB traverse
wide-area.
Internet
BAD-FS Scheduler
16
Capacity-aware scheduling

Technique to avoid over-allocations
Scheduler has knowledge of
Storage availability
Storage usage within the workload
Scheduler runs as many jobs as fit
Avoids wasted utilizations
Improves job throughput

17
Improved failure handling

Scheduler understands data semantics
Data is not just a collection of bytes
Losing data is not catastrophic
Output can be regenerated by rerunning jobs
Cost-benefit replication
Replicates only data whose replication cost is
cheaper than cost to rerun the job
Can improve throughput in lossy environment

18
Simplified implementation

Data dependencies known
Scheduler ensures proper ordering
Build a distributed file system
With cooperative caching
But without a cache consistency protocol

19
Real workloads

AMANDA
Astrophysics study of cosmic events such as
gamma-ray bursts
BLAST
Biology search for proteins within a genome
CMS
Physics simulation of large particle colliders
HF
Chemistry study of non-relativistic interactions
between atomic nuclei and electrons
IBIS
Ecology global-scale simulation of earths
climate used to study effects of human activity
(e.g. global warming)

20
Real workload experience

Setup
16 jobs
16 compute nodes
Emulated wide-area
Configuration
Remote I/O
AFS-like with /tmp
BAD-FS
Result is order of magnitude improvement

21
BAD Conclusions

Schedulers can obtain workload knowledge
Schedulers need storage control
Caching
Consistency
Replication
Combining this control with knowledge
Enhanced performance
Improved failure handling
Simplified implementation

22
For more information
Pipeline and Batch Sharing in Grid Workloads,
Douglas Thain, John Bent, Andrea Arpaci-Dusseau,
Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003.

http//www.cs.wisc.edu/condor/publications.html
Questions?

Explicit Control in a Batch-Aware Distributed
File System, John Bent, Douglas Thain, Andrea
Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron
Livny. NSDI 04, 2004.
23
Why not BAD-scheduler and traditional DFS?

Practical reasons
Deployment
Interoperability
Technical reasons
Cooperative caching
Data sharing
Traditional DFS
assume sharing is exception
provision for arbitrary, unplanned sharing
Batch workloads, sharing is rule
Sharing behavior is completely known
Data committal
Traditional DFS must guess when to commit
AFS uses close, NFS uses 30 seconds
Batch workloads precisely define when

24
Is capacity awareness important in real world?

Heterogeneity of remote resources
Shared disk
Workloads changing some are very, very large and
still growing.

25
User burden

Additional info needed in declarative lang.
User probably already knows this info
Or can readily obtain
Typically, this info already exists
Scattered across collection of scripts,
Makefiles, etc.
BAD-FS improves current situation by collecting
this info into one central location

26
In the wild
27
Capacity-aware scheduling evaluation

Workload
64 synthetic pipelines
Varied pipe size
Environment
16 compute nodes
Configuration
Breadth-first
Depth-first
BAD-FS

Failures directly correlate to workload
throughput.
28
I/O scoping evaluation

Workload
64 synthetic pipelines
100 MB of I/O each
Varied data mix
Environment
32 compute nodes
Emulated wide-area
Configuration
Remote I/O
Cache volumes
Scratch volumes
BAD-FS

Wide-area traffic directly correlates to workload
throughput.
29
Cost-benefit replication evaluation

Workload
Synthetic pipelines of depth 3
Runtime 60 seconds
Environment
Artificially injected failures
Configuration
Always-copy
Never-copy
BAD-FS

Trade-off overhead in environment without failure
to gain throughput in environment with failure.
30
Real workloads

Workload
Real workloads
64 pipelines
Environment
16 compute nodes
Emulated wide-area
Cold and warm
First 16 are cold
Subsequent 48 warm
Configuration
Remote I/O
AFS-like
BAD-FS

31
Example workflow language Condor DAGman

Keyword job names file w/ execute instrs
Keywords parent, child express relations
no declaration of data

job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D
32
Adding data primitives to a workflow language

New keywords for container operations
volume create a container
scratch specify container type
mount how the app addresses the container
extract the desired endpoint output
User must provide complete, exact I/O information
to the scheduler
Specify which procs use which data
Specify size of data read and written

33
Extended workflow language
job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D volume B1
ftp//home/data 1GB volume P1 scratch 500
MB volume P2 scratch 500 MB A mount B1 /data C
mount B1 /data A mount P1 /tmp B mount P1 /tmp C
mount P2 /tmp D mount P2 /tmp extract P1/out
ftp//home/out.1 extract P2/out ftp//home/out.2

Write a Comment

User Comments (0)