Science in the Clouds: History, Challenges, and Opportunities PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Science in the Clouds: History, Challenges, and Opportunities


1
Science in the CloudsHistory, Challenges, and
Opportunities
  • Douglas Thain
  • University of Notre Dame
  • GeoClouds Workshop
  • 17 September 2009

2
http//www.cse.nd.edu/ccl
3
The Cooperative Computing Lab
  • We collaborate with people who have large scale
    computing problems.
  • We build new software and systems to help them
    achieve meaningful goals.
  • We run a production computing system used by
    people at ND and elsewhere.
  • We conduct computer science research, informed by
    real world experience, with an impact upon
    problems that matter.

4
Clouds in the Hype Cycle
Gartner Hype Cycle Report, 2009
5
What is cloud computing?
  • A cloud provides rapid, metered access to a
    virtually unlimited set of resources.
  • Two significant impact on users
  • End users must have an economic model for the
    work that they want to accomplish.
  • Apps must be flexible enough to work with an
    arbitrary number and kind of resources.

6
Example Amazon EC2 Sep 2009(simplified slightly
for discussion)
  • Small 1 core, 1.7GB RAM, 160GB disk
  • 10 cents/hour
  • Large 2 cores, 7.5GB RAM, 850GB disk
  • 40 cents/hour
  • Extra Large 4 cores, 15 GB, 1690GB disk
  • 80 cents/hour
  • And the Simple Storage Service
  • 15 cents per GB-month stored
  • 17 cents per GB transferred (outside of EC2)
  • 1 cent per 1000 write operations
  • 1 cent per 10000 read operations

7
Is Cloud Computing New?
  • Not entirely, but a combination of the old ideas
    of utility computing and distributed computing.
  • 1960 MULTICS
  • 1980 The Cambridge Ring
  • 1987 Condor Distributed Batch System
  • 1989 Seti_at_Home
  • 1990s Clusters, Beowulf, MPI, NOW
  • 1995 Globus, Grid Computing
  • 2001 TeraGrid
  • 2004 Sun Rents CPUs at 1/hour
  • 2006 Amazon EC2 and S3

8
Clouds Trade CapEx for OpEx
Cost
OpEx of Ownership
Capital Expense of Ownership
Time
9
What about grid computing?
  • A vision much like clouds
  • A worldwide framework that would make massive
    scale computing as easy to use as an electrical
    socket.
  • The more modest realization
  • A means for accessing remote computing facilities
    in their native form, usually for CPU-intensive
    tasks.
  • The social context
  • Large collaborative efforts between computer
    scientists and computer-savvy fields,
    particularly physics and astronomy.

10
Clouds vs Grids
  • Grids provide a job execution interface
  • Run program P on input A, return the output.
  • Allows the system to maximize utilization and
    hide failures, but provides few performance
    guarantees and inaccurate metering.
  • Clouds provide resource allocation
  • Create a VM with 2GB of RAM for 7 days.
  • Gives predictable performance and accurate
    metering, but exposes problems to the user.
  • Can be used to build interactive services.
  • How do I run 1M jobs on 100 servers?

11
Grid Computing Layer Provides Job Execution
Cloud Computing Layer Provides Resource Allocation
12
Create a Condor Pool with 100 Nodes
Allocate 100 Cores
Run 1M Jobs
13
Clouds Solve Some Grid Problems
  • Application compatibility is simplified.
  • You provide a VM for Linux 2.3.4.1.2.
  • Performance is reasonably predictable.
  • 10 variations rather than orders of mag.
  • Fewer administrative headaches for the lone user.
  • A credit card swipe instead of a certificate.

14
But, Problems New and Old
  • How do I reliably execute 1M jobs?
  • Can I share resources and data with others in the
    cloud?
  • How do I authenticate others in the cloud?
  • Unfortunately, location still matters.
  • Can we make applications efficiently span
    multiple cloud providers?
  • Can we join existing centers with clouds?
  • (These are all problems contemplated by grid.)

15
More Open Questions
  • Can I afford to move my data in to the cloud?
  • Can I afford to get it out?
  • Do I trust the cloud to secure my data?
  • How do I go about constructing an economic model
    for my research?
  • Are there social/technical dangers in putting too
    many eggs in one basket?
  • Is pay-go the proper model for research?
  • Should universities get out of the data center
    business?

16
Clusters, clouds, and gridsgive us access to
unlimited CPUs.
How do we write programs that canrun effectively
in large systems?
17
MapReduce( S, M, R )
Key0
R
O0
V
V
V
Set S
K,V
K,V
Key1
K,V
M
V
R
O1
K,V
K,V
V
K,V
K,V
KeyN
V
V
R
O2
V
V
18
Of course, not all science fits into the
Map-Reduce model!
19
Example Biometrics Research
  • Goal Design robust face comparison function.

20
Similarity Matrix Construction

1.0 0.8 0.1 0.0 0.0 0.1
1.0 0.0 0.1 0.1 0.0
1.0 0.0 0.1 0.3
1.0 0.0 0.0
1.0 0.1
1.0
Challenge Workload 60,000 iris images 1MB
each .02s per F 833 CPU-days 600 TB of I/O
21
Now What?
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Non-Expert User Using 500 CPUs
26
Observation
  • In a given field of study, many people repeat the
    same pattern of work many times, making slight
    changes to the data and algorithms.
  • If the system knows the overall pattern in
    advance, then it can do a better job of executing
    it reliably and efficiently.
  • If the user knows in advance what patterns are
    allowed, then they have a better idea of how to
    construct their workloads.

27
Abstractionsfor Distributed Computing
  • Abstraction a declarative specification of the
    computation and data of a workload.
  • A restricted pattern, not meant to be a general
    purpose programming language.
  • Uses data structures instead of files.
  • Provide users with a bright path.
  • Regular structure makes it tractable to model and
    predict performance.

28
Working with Abstractions
A1
A1
F
A2
A2
An
Bn
AllPairs( A, B, F )
Custom Workflow Engine
Cloud or Grid
Compact Data Structure
29
All-Pairs Abstraction
  • AllPairs( set A, set B, function F )
  • returns matrix M where
  • Mij F( Ai, Bj ) for all i,j

A1
A2
A3
A1
A1
allpairs A B F.exe
An
B1
F
F
F
AllPairs(A,B,F)
B1
B1
Bn
B2
F
F
F
F
B3
F
F
F
30
How Does the Abstraction Help?
  • The custom workflow engine
  • Chooses right data transfer strategy.
  • Chooses the right number of resources.
  • Chooses blocking of functions into jobs.
  • Recovers from a larger number of failures.
  • Predicts overall runtime accurately.
  • All of these tasks are nearly impossible for
    arbitrary workloads, but are tractable (not
    trivial) to solve for a specific abstraction.

31
(No Transcript)
32
Choose the Right of CPUs
33
Resources Consumed
34
All-Pairs in Production
  • Our All-Pairs implementation has provided over
    57 CPU-years of computation to the ND biometrics
    research group over the last year.
  • Largest run so far 58,396 irises from the Face
    Recognition Grand Challenge. The largest
    experiment ever run on publically available data.
  • Competing biometric research relies on samples of
    100-1000 images, which can miss important
    population effects.
  • Reduced computation time from 833 days to 10
    days, making it feasible to repeat multiple times
    for a graduate thesis. (We can go faster yet.)

35
(No Transcript)
36
Are there other abstractions?
37
Wavefront( matrix M, function F(x,y,d) ) returns
matrix M such that Mi,j F( Mi-1,j,
MI,j-1, Mi-1,j-1 )
Wavefront(M,F)
M
F
38
Applications of Wavefront
  • Bioinformatics
  • Compute the alignment of two large DNA strings in
    order to find similarities between species.
    Existing tools do not scale up to complete DNA
    strings.
  • Economics
  • Simulate the interaction between two competing
    firms, each of which has an effect on resource
    consumption and market price. E.g. When will we
    run out of oil?
  • Applies to any kind of optimization problem
    solvable with dynamic programming.

39
Problem Dispatch Latency
  • Even with an infinite number of CPUs, dispatch
    latency controls the total execution time O(n)
    in the best case.
  • However, job dispatch latency in an unloaded grid
    is about 30 seconds, which may outweigh the
    runtime of F.
  • Things get worse when queues are long!
  • Solution Build a lightweight task dispatch
    system. (Idea from Falkon_at_UC)

40
1000s of workers Dispatched to the cloud
worker
worker
worker
worker
worker
worker
queue tasks
put F.exe put in.txt exec F.exe ltin.txt
gtout.txt get out.txt
worker
work queue
wavefront engine
tasks done
F
In.txt
out.txt
41
Problem Performance Variation
  • Tasks can be delayed for many reasons
  • Heterogeneous hardware.
  • Interference with disk/network.
  • Policy based suspension.
  • Any delayed task in Wavefront has a cascading
    effect on the rest of the workload.
  • Solution - Fast Abort Keep statistics on task
    runtimes, and abort those that lie significantly
    outside the mean. Prefer to assign jobs to
    machines with a fast history.

42
500x500 Wavefront on 200 CPUs
43
Wavefront on a 200-CPU Cluster
44
Wavefront on a 32-Core CPU
45
The Genome Assembly Problem
AGTCGATCGATCGATAATCGATCCTAGCTAGCTACGA
AGTCGATCGATCGAT
TCGATAATCGATCCTAGCTA
AGCTAGCTACGA
46
Sample Genomes
Reads Data Pairs Sequential Time
A. gambiae scaffold 101K 80MB 738K 12 hours
A. gambiae complete 180K 1.4GB 12M 6 days
S. Bicolor simulated 7.9M 5.7GB 84M 30 days
47
Some-Pairs Abstraction
  • SomePairs( set A, list (i,j), function F(x,y) )
  • returns
  • list of F( Ai, Aj )

A1
A2
A3
A1
A1
An
A1
F
SomePairs(A,L,F)
(1,2) (2,1) (2,3) (3,3)
A2
F
F
A3
F
F
48
Distributed Genome Assembly
100s of workers dispatched to Notre Dame, Purdue,
and Wisconsin
worker
(1,2) (2,1) (2,3) (3,3)
worker
A1
worker
A1
F
worker
An
worker
worker
queue tasks
detail of a single worker
put align.exe put in.txt exec F.exe ltin.txt
gtout.txt get out.txt
worker
work queue
somepairs master
tasks done
F
in.txt
out.txt
49
Small Genome (101K reads)
50
Medium Genome (180K reads)
51
Large Genome (7.9M)
52
Whats the Upshot?
  • We can do full-scale assemblies as a routine
    matter on existing conventional machines.
  • Our solution is faster (wall-clock time) than the
    next faster assembler run on 1024x BG/L.
  • You could almost certainly do better with a
    dedicated cluster and a fast interconnect, but
    such systems are not universally available.
  • Our solution opens up research in assembly to
    labs with NASCAR instead of Formula-One
    hardware.

53
What if your application doesnt fit a regular
pattern?
54
Makeflow
part1 part2 part3 input.data split.py
./split.py input.data out1 part1 mysim.exe
./mysim.exe part1 gtout1 out2 part2 mysim.exe
./mysim.exe part2 gtout2 out3 part3 mysim.exe
./mysim.exe part3 gtout3 result out1 out2 out3
join.py ./join.py out1 out2 out3 gt result
55
Makeflow Implementation
100s of workers dispatched to the cloud
worker
worker
worker
bfile afile prog prog afile gtbfile
worker
worker
worker
queue tasks
detail of a single worker
put prog put afile exec prog afile gt bfile get
bfile
worker
work queue
makeflow master
tasks done
prog
Two optimizations Cache inputs and output.
Dispatch tasks to nodes with data.
afile
bfile
56
Experience with Makeflow
  • Still in initial deployment, so no big results to
    show just yet.
  • Easy to test and debug on a desktop machine or a
    multicore server.
  • The workload says nothing about the distributed
    system. (This is good.)
  • Graduate students in bioinformatics running codes
    at production speeds on hundreds of nodes in less
    than a week.

57
Abstractions as a Social Tool
  • Collaboration with outside groups is how we
    encounter the most interesting, challenging, and
    important problems, in computer science.
  • However, often neither side understands which
    details are essential or non-essential
  • Can you deal with files that have upper case
    letters?
  • Oh, by the way, we have 10TB of input, is that
    ok?
  • (A little bit of an exaggeration.)
  • An abstraction is an excellent chalkboard tool
  • Accessible to anyone with a little bit of
    mathematics.
  • Makes it easy to see what must be plugged in.
  • Forces out essential details data size,
    execution time.

58
Conclusion
  • Grids, clouds, and clusters provide enormous
    computing power, but are very challenging to use
    effectively.
  • An abstraction provides a robust, scalable
    solution to a narrow category of problems each
    requires different kinds of optimizations.
  • Limiting expressive power, results in systems
    that are usable, predictable, and reliable.
  • Is there a menu of abstractions that would
    satisfy many consumers of clouds?

59
Acknowledgments
  • Cooperative Computing Lab
  • http//www.cse.nd.edu/ccl
  • Faculty
  • Patrick Flynn
  • Nitesh Chawla
  • Kenneth Judd
  • Scott Emrich
  • Grad Students
  • Chris Moretti
  • Hoang Bui
  • Li Yu
  • Mike Olson
  • Michael Albrecht
  • Undergrads
  • Mike Kelly
  • Rory Carmichael
  • Mark Pasquier
  • Christopher Lyon
  • Jared Bulosan
  • NSF Grants CCF-0621434, CNS-0643229
Write a Comment
User Comments (0)
About PowerShow.com