Science in the Clouds: History, Challenges, and Opportunities presentation

About This Presentation

Transcript and Presenter's Notes

Title: Science in the Clouds: History, Challenges, and Opportunities

1
Science in the CloudsHistory, Challenges, and
Opportunities

Douglas Thain
University of Notre Dame
GeoClouds Workshop
17 September 2009

2
http//www.cse.nd.edu/ccl
3
The Cooperative Computing Lab

We collaborate with people who have large scale
computing problems.
We build new software and systems to help them
achieve meaningful goals.
We run a production computing system used by
people at ND and elsewhere.
We conduct computer science research, informed by
real world experience, with an impact upon
problems that matter.

4
Clouds in the Hype Cycle
Gartner Hype Cycle Report, 2009
5
What is cloud computing?

A cloud provides rapid, metered access to a
virtually unlimited set of resources.
Two significant impact on users
End users must have an economic model for the
work that they want to accomplish.
Apps must be flexible enough to work with an
arbitrary number and kind of resources.

6
Example Amazon EC2 Sep 2009(simplified slightly
for discussion)

Small 1 core, 1.7GB RAM, 160GB disk
10 cents/hour
Large 2 cores, 7.5GB RAM, 850GB disk
40 cents/hour
Extra Large 4 cores, 15 GB, 1690GB disk
80 cents/hour
And the Simple Storage Service
15 cents per GB-month stored
17 cents per GB transferred (outside of EC2)
1 cent per 1000 write operations
1 cent per 10000 read operations

7
Is Cloud Computing New?

Not entirely, but a combination of the old ideas
of utility computing and distributed computing.
1960 MULTICS
1980 The Cambridge Ring
1987 Condor Distributed Batch System
1989 Seti_at_Home
1990s Clusters, Beowulf, MPI, NOW
1995 Globus, Grid Computing
2001 TeraGrid
2004 Sun Rents CPUs at 1/hour
2006 Amazon EC2 and S3

8
Clouds Trade CapEx for OpEx
Cost
OpEx of Ownership
Capital Expense of Ownership
Time
9
What about grid computing?

A vision much like clouds
A worldwide framework that would make massive
scale computing as easy to use as an electrical
socket.
The more modest realization
A means for accessing remote computing facilities
in their native form, usually for CPU-intensive
tasks.
The social context
Large collaborative efforts between computer
scientists and computer-savvy fields,
particularly physics and astronomy.

10
Clouds vs Grids

Grids provide a job execution interface
Run program P on input A, return the output.
Allows the system to maximize utilization and
hide failures, but provides few performance
guarantees and inaccurate metering.
Clouds provide resource allocation
Create a VM with 2GB of RAM for 7 days.
Gives predictable performance and accurate
metering, but exposes problems to the user.
Can be used to build interactive services.
How do I run 1M jobs on 100 servers?

11
Grid Computing Layer Provides Job Execution
Cloud Computing Layer Provides Resource Allocation
12
Create a Condor Pool with 100 Nodes
Allocate 100 Cores
Run 1M Jobs
13
Clouds Solve Some Grid Problems

Application compatibility is simplified.
You provide a VM for Linux 2.3.4.1.2.
Performance is reasonably predictable.
10 variations rather than orders of mag.
Fewer administrative headaches for the lone user.
A credit card swipe instead of a certificate.

14
But, Problems New and Old

How do I reliably execute 1M jobs?
Can I share resources and data with others in the
cloud?
How do I authenticate others in the cloud?
Unfortunately, location still matters.
Can we make applications efficiently span
multiple cloud providers?
Can we join existing centers with clouds?
(These are all problems contemplated by grid.)

15
More Open Questions

Can I afford to move my data in to the cloud?
Can I afford to get it out?
Do I trust the cloud to secure my data?
How do I go about constructing an economic model
for my research?
Are there social/technical dangers in putting too
many eggs in one basket?
Is pay-go the proper model for research?
Should universities get out of the data center
business?

16
Clusters, clouds, and gridsgive us access to
unlimited CPUs.
How do we write programs that canrun effectively
in large systems?
17
MapReduce( S, M, R )
Key0
R
O0
V
V
V
Set S
K,V
K,V
Key1
K,V
M
V
R
O1
K,V
K,V
V
K,V
K,V
KeyN
V
V
R
O2
V
V
18
Of course, not all science fits into the
Map-Reduce model!
19
Example Biometrics Research

Goal Design robust face comparison function.

20
Similarity Matrix Construction

1.0 0.8 0.1 0.0 0.0 0.1
1.0 0.0 0.1 0.1 0.0
1.0 0.0 0.1 0.3
1.0 0.0 0.0
1.0 0.1
1.0
Challenge Workload 60,000 iris images 1MB
each .02s per F 833 CPU-days 600 TB of I/O
21
Now What?
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Non-Expert User Using 500 CPUs
26
Observation

In a given field of study, many people repeat the
same pattern of work many times, making slight
changes to the data and algorithms.
If the system knows the overall pattern in
advance, then it can do a better job of executing
it reliably and efficiently.
If the user knows in advance what patterns are
allowed, then they have a better idea of how to
construct their workloads.

27
Abstractionsfor Distributed Computing

Abstraction a declarative specification of the
computation and data of a workload.
A restricted pattern, not meant to be a general
purpose programming language.
Uses data structures instead of files.
Provide users with a bright path.
Regular structure makes it tractable to model and
predict performance.

28
Working with Abstractions
A1
A1
F
A2
A2
An
Bn
AllPairs( A, B, F )
Custom Workflow Engine
Cloud or Grid
Compact Data Structure
29
All-Pairs Abstraction

AllPairs( set A, set B, function F )
returns matrix M where
Mij F( Ai, Bj ) for all i,j

A1
A2
A3
A1
A1
allpairs A B F.exe
An
B1
F
F
F
AllPairs(A,B,F)
B1
B1
Bn
B2
F
F
F
F
B3
F
F
F
30
How Does the Abstraction Help?

The custom workflow engine
Chooses right data transfer strategy.
Chooses the right number of resources.
Chooses blocking of functions into jobs.
Recovers from a larger number of failures.
Predicts overall runtime accurately.
All of these tasks are nearly impossible for
arbitrary workloads, but are tractable (not
trivial) to solve for a specific abstraction.

31
(No Transcript)
32
Choose the Right of CPUs
33
Resources Consumed
34
All-Pairs in Production

Our All-Pairs implementation has provided over
57 CPU-years of computation to the ND biometrics
research group over the last year.
Largest run so far 58,396 irises from the Face
Recognition Grand Challenge. The largest
experiment ever run on publically available data.
Competing biometric research relies on samples of
100-1000 images, which can miss important
population effects.
Reduced computation time from 833 days to 10
days, making it feasible to repeat multiple times
for a graduate thesis. (We can go faster yet.)

35
(No Transcript)
36
Are there other abstractions?
37
Wavefront( matrix M, function F(x,y,d) ) returns
matrix M such that Mi,j F( Mi-1,j,
MI,j-1, Mi-1,j-1 )
Wavefront(M,F)
M
F
38
Applications of Wavefront

Bioinformatics
Compute the alignment of two large DNA strings in
order to find similarities between species.
Existing tools do not scale up to complete DNA
strings.
Economics
Simulate the interaction between two competing
firms, each of which has an effect on resource
consumption and market price. E.g. When will we
run out of oil?
Applies to any kind of optimization problem
solvable with dynamic programming.

39
Problem Dispatch Latency

Even with an infinite number of CPUs, dispatch
latency controls the total execution time O(n)
in the best case.
However, job dispatch latency in an unloaded grid
is about 30 seconds, which may outweigh the
runtime of F.
Things get worse when queues are long!
Solution Build a lightweight task dispatch
system. (Idea from Falkon_at_UC)

40
1000s of workers Dispatched to the cloud
worker
worker
worker
worker
worker
worker
queue tasks
put F.exe put in.txt exec F.exe ltin.txt
gtout.txt get out.txt
worker
work queue
wavefront engine
tasks done
F
In.txt
out.txt
41
Problem Performance Variation

Tasks can be delayed for many reasons
Heterogeneous hardware.
Interference with disk/network.
Policy based suspension.
Any delayed task in Wavefront has a cascading
effect on the rest of the workload.
Solution - Fast Abort Keep statistics on task
runtimes, and abort those that lie significantly
outside the mean. Prefer to assign jobs to
machines with a fast history.

42
500x500 Wavefront on 200 CPUs
43
Wavefront on a 200-CPU Cluster
44
Wavefront on a 32-Core CPU
45
The Genome Assembly Problem
AGTCGATCGATCGATAATCGATCCTAGCTAGCTACGA
AGTCGATCGATCGAT
TCGATAATCGATCCTAGCTA
AGCTAGCTACGA
46
Sample Genomes
Reads Data Pairs Sequential Time
A. gambiae scaffold 101K 80MB 738K 12 hours
A. gambiae complete 180K 1.4GB 12M 6 days
S. Bicolor simulated 7.9M 5.7GB 84M 30 days
47
Some-Pairs Abstraction

SomePairs( set A, list (i,j), function F(x,y) )
returns
list of F( Ai, Aj )

A1
A2
A3
A1
A1
An
A1
F
SomePairs(A,L,F)
(1,2) (2,1) (2,3) (3,3)
A2
F
F
A3
F
F
48
Distributed Genome Assembly
100s of workers dispatched to Notre Dame, Purdue,
and Wisconsin
worker
(1,2) (2,1) (2,3) (3,3)
worker
A1
worker
A1
F
worker
An
worker
worker
queue tasks
detail of a single worker
put align.exe put in.txt exec F.exe ltin.txt
gtout.txt get out.txt
worker
work queue
somepairs master
tasks done
F
in.txt
out.txt
49
Small Genome (101K reads)
50
Medium Genome (180K reads)
51
Large Genome (7.9M)
52
Whats the Upshot?

We can do full-scale assemblies as a routine
matter on existing conventional machines.
Our solution is faster (wall-clock time) than the
next faster assembler run on 1024x BG/L.
You could almost certainly do better with a
dedicated cluster and a fast interconnect, but
such systems are not universally available.
Our solution opens up research in assembly to
labs with NASCAR instead of Formula-One
hardware.

53
What if your application doesnt fit a regular
pattern?
54
Makeflow
part1 part2 part3 input.data split.py
./split.py input.data out1 part1 mysim.exe
./mysim.exe part1 gtout1 out2 part2 mysim.exe
./mysim.exe part2 gtout2 out3 part3 mysim.exe
./mysim.exe part3 gtout3 result out1 out2 out3
join.py ./join.py out1 out2 out3 gt result
55
Makeflow Implementation
100s of workers dispatched to the cloud
worker
worker
worker
bfile afile prog prog afile gtbfile
worker
worker
worker
queue tasks
detail of a single worker
put prog put afile exec prog afile gt bfile get
bfile
worker
work queue
makeflow master
tasks done
prog
Two optimizations Cache inputs and output.
Dispatch tasks to nodes with data.
afile
bfile
56
Experience with Makeflow

Still in initial deployment, so no big results to
show just yet.
Easy to test and debug on a desktop machine or a
multicore server.
The workload says nothing about the distributed
system. (This is good.)
Graduate students in bioinformatics running codes
at production speeds on hundreds of nodes in less
than a week.

57
Abstractions as a Social Tool

Collaboration with outside groups is how we
encounter the most interesting, challenging, and
important problems, in computer science.
However, often neither side understands which
details are essential or non-essential
Can you deal with files that have upper case
letters?
Oh, by the way, we have 10TB of input, is that
ok?
(A little bit of an exaggeration.)
An abstraction is an excellent chalkboard tool
Accessible to anyone with a little bit of
mathematics.
Makes it easy to see what must be plugged in.
Forces out essential details data size,
execution time.

58
Conclusion

Grids, clouds, and clusters provide enormous
computing power, but are very challenging to use
effectively.
An abstraction provides a robust, scalable
solution to a narrow category of problems each
requires different kinds of optimizations.
Limiting expressive power, results in systems
that are usable, predictable, and reliable.
Is there a menu of abstractions that would
satisfy many consumers of clouds?

59
Acknowledgments

Cooperative Computing Lab
http//www.cse.nd.edu/ccl

Faculty
Patrick Flynn
Nitesh Chawla
Kenneth Judd
Scott Emrich

Grad Students
Chris Moretti
Hoang Bui
Li Yu
Mike Olson
Michael Albrecht

Undergrads
Mike Kelly
Rory Carmichael
Mark Pasquier
Christopher Lyon
Jared Bulosan

NSF Grants CCF-0621434, CNS-0643229

Write a Comment

User Comments (0)

About PowerShow.com

Science in the Clouds: History, Challenges, and Opportunities PowerPoint PPT Presentation