Title: First Steps in the Clouds
1 First Steps in the Clouds
- Kate Keahey
- keahey_at_mcs.anl.gov
- University of Chicago
- Argonne National Laboratory
2Why Clouds?
- Resource consumers
- Individual users or Virtual Organization
- Requirements
- Customized environments for their
services/applications - Services/applications can be short-lived
- New environments/services deployed quickly and
often - Resource providers
- Own and operate physical resources
- Requirements
- Ability to monitor and control their resources
- Provide resources at reasonable operational cost
- Protection from activities performed by resource
consumer - Consumers need to be able to lease (potentially
for short-term) platforms that they can customize
and control
3Cloud Computing for Grid CommunitiesThe STAR
Application Use Case
4The STAR Application
- Complex experimental application codes
- Developed over more than 10 years, by more than
100 scientists, comprises 2 M lines of C and
Fortran code - www.star.bnl.gov
- Require complex, customized environments
- Rely heavily on the right combination of compiler
versions and available libraries - Dynamically load external libraries depending on
the task to be performed - Environment validation
- To ensure reproducibility and result uniformity
across environments - Why do we need a cloud?
- Resources with the right configuration are hard
to find - A VM-based cloud gives us the required control
5Running STAR in a Cloud
- First Challenge finding VM-enabled resources
- Amazon Elastic Compute Cloud (EC2)
- More Challenges
- Can we use X.509 certs to submit to a cloud? Can
we use Grid access protocols? How much manual
configuration do we need to do for a cluster that
we need for 4 hours? How do we integrate the
cluster into the Grid infrastructure? - Workspace Service
- X.509 certificates are mapped to a project
account - Grid access protocols
- Creating a virtual cluster dynamically
- Contextualization (cluster context) the cluster
node VMs find out about each other and integrate
that information at boot time - Integrating the cluster into the Grid
- Contextualization (grid context) cluster is
configured with appropriate host certs,
gridmapfiles, etc.
6with thanks to Jerome Lauret and Doug Olson of
the STAR project
with thanks to Jerome Lauret and Doug Olson of
the STAR project, presented at CHEP07
Running jobs 230
Running jobs 150
Running jobs 150
Running jobs 142
Running jobs 124
Running jobs 109
Running jobs 94
Running jobs 73
Running jobs 42
Running jobs 0
VWS/EC2
BNL
Running jobs 300
Running jobs 300
Running jobs 300
Running jobs 282
Running jobs 243
Running jobs 221
Running jobs 195
Running jobs 140
Running jobs 76
Running jobs 0
WSU
Fermi
Running jobs 150
Running jobs 200
Running jobs 195
Running jobs 183
Running jobs 152
Running jobs 136
Running jobs 96
Running jobs 54
Running jobs 37
Running jobs 0
Running jobs 50
Running jobs 50
Running jobs 42
Running jobs 39
Running jobs 34
Running jobs 27
Running jobs 21
Running jobs 15
Running jobs 9
Running jobs 0
PDSF
Job Completion
File Recovery
7with thanks to Jerome Lauret and Doug Olson of
the STAR project
with thanks to Jerome Lauret and Doug Olson of
the STAR project, presented at CHEP07
Nersc PDSF
EC2 (via Workspace Service)
WSU
Accelerated display of a workflow job state Y
job number, X job state
8What Did We Learn?
- Performance was not an issue
- The real comparison is having a resource to run
on vs not having a resource to run on - Contextualization is key for dynamic virtual
cluster deployment - Next steps a more challenging application
9Cloud Computing for Grid Providers Building the
Science Cloud at the University of Chicago
10Challenges
- Virtualization adoption has been relatively slow
among Grid Providers - Challenge integrating VMs into current
provisioning models - Integrate into a site without disrupting the
current operation of resources - I.e., be able to run jobs as well as VMs
- Non-invasive from the perspective of currently
used tools - E.g., no modification to the currently used
schedulers and resource managers - Can be used alongside the current mode of
operation - Batch jobs
- Represent as small a change as possible
- Operate within familiar metaphors
- Avoid error-generating complexity
11Roll Your Own Cloud
- The Workspace Pilot
- Operates on resources that can support jobs as
well as VMs - E.g., have been booted into Xen domain 0
- Non-invasive extension to batch schedulers (e.g.,
PBS) - Wrappers for submission operation, scheduler
signals to operate on VMs - Glidein approach submits a pilot program that
prepares a resource slot for VM deployment - E.g., adjusts Xen domain 0 memory
- Comes with administrator tools
- E.g., kill-all
12Workspace Pilot in Action
Level 1 provision raw resources
Level 2 provision VMs
Workspace Service
Xen dom0
LRM/PBS
Xen dom0
Xen dom0
VMs are decomissioned
raw resources are decomissioned
13The Pilot Program
- Uses Xen balloon driver to reduce/restore domain0
memory so that guest domains (VMs) can be
deployed - Secure VM deployment
- The pilot requires sudo privilege and thus can be
used only with site administrators approval - The workspace service provides fine-grained
authorization for all requests - Signal handling
- SIGTERM pilot exceeded its allotted time
- Notifies VWS, allows it to clean up
- After a configurable time period takes things
into its hands. - Default policy one VM per physical node
- Available for download
- Workspace Release 1.3.1
- http//workspace.globus.org/downloads/index.html
14Nimbus _at_ UC
- What is it?
- The Science Cloud at University of Chicago
- UC TeraPort cluster configured with the workspace
pilot - Currently 16 nodes
- What can it do for me?
- Allow you to lease out a cluster of VMs
- Who can use it?
- Members of scientific community
- In as much as usage policies will allow
- What do I need to do if I want to use it?
- Contact us keahey_at_mcs.anl.gov
- You will need a VM image (we can help and know
others who can), a certificate, and a simple
client
15Cloud Interoperability
- Moving an app from a hardware platform to a cloud
is relatively hard - Need to develop a VM image, learn about cloud
computing, figure our logistics - Moving between clouds
- E.g., STAR app EC2-gtScience Cloud and vice versa
is very easy - Rough consensus on the interfaces needed to
provision resources in the cloud - OGF gridvit-wg
- Chairs Erol Bozak, Wolfgang Reichert
- Define the requirements for integration of Grid
architecture with system virtualization platforms - Exploring the impact of virtualization on Grid
use cases - Exploring the relationship with standards (DMTF,
etc.)