Title: Roger Barga Architect, Cloud Computing Futures Group Microsoft Research (MSR)
1Roger BargaArchitect, Cloud Computing Futures
GroupMicrosoft Research (MSR)
Cloud Computing A Microsoft Research Perspective
Contributors to this presentation includeDan
Reed, Dennis Gannon, Navendu Jain, and Tony Hey
(MSR)
2eXtreme Computing, MSR
Rethink the nature of computing at extreme scale,
from alternative, quantum computing models,
through the transformative effects of manycore
parallelism on programming systems and
architectures, through massive cloud computing
infrastructure designs. eXtreme Computing
Division Dan Reed, CVP Microsoft Research
ab initio research and development on cloud
hardware and software infrastructure. Investigate
cloud computing for research empowerment with
worldwide government academic
partnerships. Cloud Computing Futures Group
3Talk Outline
- Data Center landscape
- Cloud computing spectrum, the rise of a new
platform - Data intensive research, role of cloud computing
- Key takeaways
- Data centers and HPC, like twins separated at
birth Dan Reed - Data centers evolving at a blistering pace,
driven by economics - The Application Model for Cloud Computing Is
Evolving - Economic landscape increasingly favors pay as
you go - There are many obstacles, but economic forces
will dominate the obstacles - Emergence of the Fourth Paradigm, synergistic
with cloud computing
4HPC and Clouds Select Comparisons
- Node and system architectures
- Communication fabric
- Storage systems and analytics
- Physical plant and operations
- Reliability and resilience
- Programming models
5HPC Node Architecture
- Moores Law favored commodity systems
- Specialized processors and systems faltered
- Killer micros and industry standard blades led
- Inexpensive clusters now dominate
www.top500.org
6HPC Interconnects
- Ethernet for low end (cost sensitive)
- High end expectations
- Nearly flat networks and very large switches
- Operating system bypass for low latency
(microseconds)
www.top500.org
7Modern Data Center Network
Internet
Monsoon network with Valiant routing
- Key
- CR (L3 Border Router)
- AR (L3 Access Router)
- S (L2 Switch)
- LB (Load Balancer)
- A (20 Server Rack/TOR)
Source Albert Greenberg and Cisco
8HPC Interconnects
- Ethernet for low end (cost sensitive)
- High end expectations
- Nearly flat networks and very large switches
- Operating system bypass for low latency
(microseconds)
www.top500.org
9HPC Storage Systems
- Local disk
- Scratch or non-existent
- Secondary storage
- SAN and parallel file systems
- Hundreds of TBs (at most)
- Tertiary storage
- Tape robot(s)
- 3-5 GB/s bandwidth
60 PB capacity
www.nersc.gov
10I/O Implications and Scale
- Typical HPC scenario
- MPI computation
- Domain decomposition
- SAN-based parallel file system
- Periodic checkpoints
- Scaling challenges
- System MTBF approaching zero
- Checkpoint frequency increasing
- I/O demand becoming intolerable
- Implications
- Unlikely to extend to exascale
- Loosely consistent models required
Slide by Dan Reed
11Cloud/HPC Hardware Comparison
Attribute HPC Cloud
Processor High-end x86 x86
Memory 1-8 GB 8 GB
Local Disk Scratch only Permanent storage
SAN Storage Common Rare
Tertiary Storage Common Rare
Interconnect Infiniband or 10 GigE 1 GigE/10GigE
Network Flat Hierarchical
Physical Plant Traditional Optimized
Efficient Virtualization
- Predominate differences
- Network architecture and SAN storage
12Virtualization as Enabler
- EMULATION OF EXISTING APPS
- Hardware via existing ISA, memory mapped ports,
etc. - Storage via SCSI LUN or other disk interface
- Application via underlying API
ENABLEMENT OF NEW SERVICES
- Resource utilization ? pool concrete resources
- Decouples concrete resources ? enables
migration - Extend existing abstractions ? e.g. LUN
expansion
13 HPC Physical Plant
- Facilities
- Co-located with operating institution
- Standard raised floor and CRAC units
- Limited UPS support
- Typically constrained to 3-5 MW
- Designed as Lab showpieces
LBL
38,640 cores
ORNL
LANL
150,152 cores
130,000 cores
ANL
163,840 cores
14The Data Center Landscape
- Range in size from edge facilities to
megascale. - Unprecedented economies of scale
- Approximate costs for a medium size center (1000
servers) and large, 50K server center.
Technology Cost in Medium-sized Data Center Cost in Very Large Data Center Ratio
Network 95 per Mbps/ month 13 per Mbps/ month 7.1
Storage 2.20 per GB/ month 0.40 per GB/ month 5.7
Administration 140 servers/ Administrator gt1000 Servers/ Administrator 7.1
James Hamilton, LADIS 08
15Modern Data Center Containers Separating Concerns
16Data Center Design Issues
- Where are the costs?
- Mid-sized facility (20 containers)
- Cost of power (/kwh) 0.07
- Cost of facility 200,000,000
(amortize 15 years) - Number of Servers 50,000 (3 year life)
_at_2K each - Power critical load
15MW - Power Usage Effectiveness (PUE)
1.7 - Observe
- Fully burdened cost of power power consumed
cost of cooling and power distribution
infrastructure - As cost of servers drops and power
- costs rise, power will dominate all
- other costs.
17Power
- EPA released a report saying
- In 2006 data centers used 61 Terawatt-hours of
power - Total power bill 4.5 billion
- 7 GW peak load (15 power plants)
- This was 1.5 of all US electrical energy use.
- Expected to double by 2011.
- Power accounts for 30 of Data Center costs
- Only 20-30 CPU utilization
- Causes Uneven app fit, demand varies,
over-provisioning, etc. - A deeper look and a few ideas .
18Power and Cooling Is Expensive!
- Infrastructure for power cooling cost a lot
- Infrastructure PLUS Energy gt Server Costs Since
2001 - Infrastructure Alone gt Server Costs Since 2004
- Energy Alone gt Server Cost Since 2008
- Cost Effective to discard energy inefficient
servers - Power Savings ? Infrastructure Savings!
- Like Airlines Retiring Fuel-Guzzling Airplanes
19What can we do about power costs?
- Data Centers use 1.5 of US electricity
- 4.5 billion annually
- 7 GW peak load (15 power plants)
- 44.4 million mt CO2 (0.8 emissions)
- Rethink Environmentals
- Run them in a wider rage of conditions
- Christian Beladys In Tent data center
experiment. - Rethink UPS
- Googles battery per server.
- Rethink Architecture
- Intel Atom and power states.
- Marlowe Project
20Marlowe the Big Sleep
- Adaptive Resource Management
- Monitor the data center and its apps.
- Use rules engine fuzzy logic to control
resources - for most current workloads
- Spare capacity available
- Sleep/hibernate 3 4 watts (vs. 28 36 watts
for Atom servers) - 5 45 sec. to reactivate server
Created by Navendu Jain, CJ Williams, Dan Reed
and Jim Larus
21Microsofts Data Center Evolution
Data Center Evolution
22What is a "cloud computing"?
23So What is Cloud Computing?...
- Using a remote data center to manage scalable,
reliable, on-demand access to application
services and data. - Scalable means
- Possibly millions of simultaneous users of app.
- Exploiting thousand-fold parallelism in the app.
- Reliable means on-demand means 5 nines
available right now
- Three New Aspects to Cloud Computing
- Illusion of infinite computing resources
available on demand - Elimination of an upfront commitment
- Ability to pay for use of computing resources on
a short-term basis as needed
24Platform Extension to the Cloud is a Continuum
- New capabilities
- New cost structure
- Requires embracing a specific app model
- Hosted version of what you have been using so
far - Requires few changes if any to what you know and
do
What Youve Been Using So Far
25Spectrum of Application Models
26Azure Programming Model
Abstract Programming Model
In-band communication software control
Load-balancers
Switches
Highly-available Fabric Controller
27The Azure Fabric
- Consists of a (large) group of machines, all of
which are managed by software called the fabric
controller - The fabric controller is replicated across a
group of five to seven machines, and it owns all
of the resources in the fabric - Because it can communicate with a fabric agent on
every computer, its also aware of every Windows
Azure application in this fabric
28RolesScalable, Fault Tolerant, Stateless
- A Scalable architecture is critical to take
advantage of scalable infrastructure - Queues decouple different parts of app, making it
easier to scale app parts independently - Flexible resource allocation, different priority
queues and separation of backend servers to
process different queues. - Queues mask faults in worker roles.
- Roles are a mostly stateless process running in a
Windows Server 2008 VM on one or more cores - Web Roles provide web service access to app Web
roles generate tasks for worker roles - Worker Roles do heavy lifting and manage data
in tables/blobs - Communication is through queues.
- The number of instances can scale with load.
29Storage Blobs, Tables and Queues, and a full
relational database
- The simplest way to store data in Azure storage
is to use blobs - A blob contains binary data, up to 50GB
- Each table holds some number of entities. An
entity contains zero or more properties - SQL Data Services provide the SQL data platform
in the cloud
Blobs can be bigup to 50 gigabytes each They can
also have associated metadata
30Back to the Future (again.)
- Mid 1980's The invention of client/server
databases - Data locked up in mainframe DBs
- Closed monolithic trust boundary
- PCs? Spreadsheets and terminal emulation
- Networks lots of them DECNet, IPX, SNA, Banyan
Vines, TCP/IP - Client / Server database challenges
- Had to invent network abstraction layer,
formats, protocols - Had to consider latency, concurrency control
- Had to move trust boundary
- Wound up with only 60 of the incumbent's
capabilitycould have been easily dismissed as a
failure - End result
- Data was made accessible where it could be used
in a new way - Client / Server databases are now viewed as being
tremendously successful
31Data in a Cloud Services World
- Cloud database service challenges
- Same as Client / Server DBMS shift
- Formats, protocols, authentication,
authorization, latency, trust boundary - Will not do 100 of what client / server
databases can do - Cloud database service capabilities
- Data boundary moves from corporate LAN to
internet - Utility DBMS for cloud applications
- Expect new capabilities, new value proposition
32Cloud Platform Strategic Differentiator and
Economics
Competitive advantageAND economics
Innovation introduced by first firm
Competitive Advantage
Time
33The Economics of Elasticity by the numbers
Example of Elasticity
Elasticity May Be More Cost-Effective Even with a
Higher Per-Hour Charge!
Takes Weeks to Acquire and Install Equipment
34The Cloud Empowers the Long Tail of Research
- Research Funding
- Have good idea
- Write proposal
- Wait 6 months
- If successful, wait 3 months to get
- Install Computers
- Start Work
- Science Start-ups
- Have good idea
- Write Business Plan
- Ask VCs to fund
- If successful...
- Install Computers
- Start Work
- Cloud Computing Model
- Have good idea
- Grab nodes from Cloud provider
- Start Work
- Pay for what you actually used
Poised to reach a broad class of new users
Slide compliments of Paul Watson, University of
Newcastle (UK)
35Emergence of a Fourth Research Paradigm
- Thousand years ago Experimental Science
- Description of natural phenomena
- Last few hundred years Theoretical Science
- Newtons Laws, Maxwells Equations
- Last few decades Computational Science
- Simulation of complex phenomena
- Today Data-Intensive Science
- Scientists overwhelmed with data sets from a
variety of different sources - Data captured by instruments, sensor networks
- Data generated by simulations
- Data generated by computational models
Astronomy was one of the first disciplines to
embrace data-intensive science with the Virtual
Observatory (VO), enabling highly efficient
access to data and analysis tools at a
centralized site. The image shows the Pleiades
star cluster form the Digitized Sky Survey
combined with an image of the moon, synthesized
within the WorldWide Telescope
With thanks to Jim Gray
36Science ExamplePhyloD as an Azure Service
- Statistical tool used to analyze DNA of HIV from
large studies of infected patients - PhyloD was developed by Microsoft Research and
has been highly impactful - Small but important group of researchers
- 100s of HIV and HepC researchers actively use it
- 1000s of research communities rely on results
Cover of PLoS Biology November 2008
- Typical job, 10 20 CPU hours, extreme jobs
require 1K 2K CPU hours - Requires a large number of test runs for a given
job (1 10M tests) - Highly compressed data per job ( 100 KB per job)
37Metagenomics Atop Azure
Basic Map-Reduce - 2 GB database per worker -
500 MB input file.
BLAST user selects DBs and input sequence
Blast Web Role
Input Splitter Worker Role
- Metagenomics
- Ecosystem characterization
- Map Reduce-style
- Parallel BLAST
- 50 roles, speedup 45
- 100 roles, speedup 94
BLAST Execution Worker Role n
BLAST Execution Worker Role 1
.
Azure Blob Storage
BLAST DB Configuration
Genome DB K
Genome DB 1
Combiner Worker Role
38Reference Data on Azure
- Ocean Science data on Azure SDS-relational
- Two terabytes of coastal and model data
-
- Computational finance data on SDS-relational
- BATS, daily tick data for stocks (10 years)
- XBRL call report for banks (10,000 banks)
- Storing select seismic data on Azure, NSF funded
consortium that collects and distributes global
seismological data. - Data sets requested by researchers worldwide
- Includes HD videos, seismograms, images, data
from major seismic events.
39Takeaways
- Data centers and HPC, like twins separated at
birth - Interconnect, Storage and Efficient
Virtualization - Data centers evolving at a blistering pace,
driven by economics - The Economics Are Changing towards Cloud
Computing - Big Data centers Offer Big Economies of Scale
- Cloud Computing Transfers Risks Away from the
Application Providers - The Application Model for Cloud Computing Is
Evolving - Advantages to Being Close to the Metal versus
Advantages to Higher Level - Just Because the Infrastructure Is Scalable
Doesnt Mean the App Is!! - There Are Many Obstacles to Ubiquitous Cloud
Computing - The Economic Forces Will Dominate the Obstacles
- Theres Too Much to Gain It Will Grow!
40Roger BargaArchitect, Cloud Computing Futures
GroupMicrosoft Research (MSR)
Cloud Computing for Research
Q A