Berkeley RAD Lab Technical Vision - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Berkeley RAD Lab Technical Vision

Description:

... (S. Kawamoto) as low-cost prevention/repair strategies ... Root Cause: High DNS request rates generated by Spam Appliance triggered by mail surge ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 28

Provided by: georgep6

Learn more at: http://oasis.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Berkeley RAD Lab Technical Vision

1
Berkeley RAD LabTechnical Vision

Armando Fox, Randy Katz, Michael Jordan, Dave
Patterson, Scott Shenker, Ion Stoica
RADS Retreat, June 2005

2
Outline

Overall Vision
Internet Services Vision (ServRADS)
Network Vision (NetRADS)
Internet Services Network architecture
Principles and Summary

3
Overarching Mantra

Enable a faster pace of network service
innovationthrough new distributed system
architectures that reduce operations cost by
2-3 orders of magnitude
The Challenge
Software systems Too much information gt make
sense of it through statistical learning
control theory
Network systems Too little information gt
exploit better observation and monitoring in the
network infrastructure to drive management
processes

4
In practice this means

Single person can write, deploy, operate the
next-generation IT business (the Fortune 1
million)
Do for Internet apps what Web did for individual
publishing
Gray s challenge planetary-scale distributed
system operated by a single part-time operator
Goal programmers focus on functionality put
the ility in the platform
Could be built on utility computing, giving
access to distributed physical resources
Integrated approach to network and server/service
management
Requires 100x-1000x reduction in TCO from todays
levels

5
What things are like today

World-scale services created and operated by
expert teams
Google-sized organization to create a Google
Amazons book browsing, designed by programmers,
is cumbersome
Browsing for housewares, designed by domain
experts on mature infrastructure, more usable
We dont know what the next killer app will be!
NOW project didnt predict Internet search as a
Killer app for NOWs
If we succeed, the next killer Internet app will
be written, deployed, operated, at Google-like
scales, by a single programmer

6
Focusing on lowering cost of ownership

Standard way to account for where the money
goes in operating a deployed distributed
application
Definition independent of who is operating the
app
Operators per byte of storage or per CPU? No,
doesnt scale with technology changes
Operators per end-user served? (This is the
figure of merit for e-tailers)
Operators per geographic region served?
Operators per spent on capital cost?
Operators per of revenue?

7
Outline

Overall Vision
Internet Services Vision (ServRADS)
Network Vision (NetRADS)
Internet Services Network architecture
Principles and Summary

8
Enabling Technologies for Reducing TCO in ServRADS

Past successes
microrebooting Fast recovery makes false
positives tolerable
Pinpoint using SLT to detect and localize
fine-grain failures
visualizationSLT to help operators earn their
trust
Elements of technical vision
SLT and machine learning
Operator-centric visualization
Control theory
Open source failures database (sanitized, open
failures forensics repository)

9
Example scenarios

Helping operators make sense of instrumentation
Using ML techniques to localize failures (P.
Bodik, E. Kiciman)
Using automatically-induced statistical models to
identify likely causes of performance problems
(S. Zhang, I. Cohen et al.)
Combining SLT with visualization for
cross-checking problem reports and rapidly
spotting potential problems visually
Facilitating self-tuning/configuration
Using control theory to improve performance of a
distributed streaming database (W. Xu)
Service placement in wide-area distributed system
(D. Oppenheimer)
Microreboots (G. Candea) and microreplacement (S.
Kawamoto) as low-cost prevention/repair
strategies
If false positive cost can be kept low, automate.
Otherwise, help operator do her job.

10
Services example combining viz SLT
11
Reduce TCO via Planetary-scale Abstractions

Inspiration narrowly-focused planetary-scale
abstractions whose design implementation...
scale well understand distributed scheduling,
locality, symptoms of wide-area failures
monitorable and controllable (using SLT linear
CT)
retain precisely-quantifiable and acceptable
semantics under partial-failure conditions
Examples of existing narrow but powerful
services
MapReduce in Google understands data locality
Can easily imagine a lossy MapReduce, like
online aggregation
queues/messaging in Yahoo, Amazon, others
User information database in Yahoo
Instrumentation collection analysis services
using Telegraph-CQ

12
Outline

Overall Vision
Internet Services Vision (ServRADS)
Network Vision (NetRADS)
Internet Services Network architecture
Principles and Summary

13
RADS Network Problem

Internet routing has proven to be robust
But
Poor visibility hard to determine health of the
network
Routing policy interactions defeat propagation of
useful diagnostic info difficult to identify
root cause problems
Slow reaction times to connectivity failures
operator intervention (across admin domains)
increases cost of ownership
Key observation network service failures
attributed to unexpected traffic surge patterns
Approach identify and protect good traffic
during surge
Mechanism deployed in network edge
Its where the servers and clients are located
Greatest need for lowering management costs
Administrative scope and responsibility is
well-defined

14
iBoxes New network element for Observe,
Analyze, Act
Enterprise Network Architecture
Inspection-and-Action Boxes Deep multiprotocol
packet inspection No routing observation
marking Policing points drop, fence, block
15
Network-Level Observe-Analyze-Act

Observe
Packet, path, protocol, service invocation
statistical collection and sampling frequencies,
latencies, completion rates
Construct the collection infrastructure
Analyze
Determine correlations among observations
Normal model discovery anomaly detection
Exploit SLT
Act
Experiment to test correlations
Prioritize and throttle
Mark and annotate
Control theory? Distributed analyses and actions

16
Network Layer Mechanism Annotations

Enhance network visibility disseminate
observations, communicate actions, provide
in-band network management actions, iBox-to-iBox
communications
iBoxes label packets at annotation layer but do
not rewrite packet contents
Annotations stack, must be removed from packets
before delivery to A-layer unaware end nodes

17
Scenario Traffic Surge Inhibiting Network
Services
Internet Edge
II
R
Primary Secondary DNS Servers
Distribution Tier
S
S
E
Mail Server
E
R
R
S
IA
IS
E
Spam Appliance
Access Edge
Server Edge
E
S

DNS Server swamped by excessive request traffic
Observe DNS time outs, Web access traffic
slowed, but also higher than normal mail delivery
latency implying busy server edge (correlation
between Mail Server and DNS Server utilization?)
Root Cause High DNS request rates generated by
Spam Appliance triggered by mail surge

18
Scenario
Internet Edge
II
R
Primary Secondary DNS Servers
Distribution Tier
S
S
E
Mail Server
E
R
R
S
IA
IS
E
Spam Appliance
Access Edge
Server Edge
E
S

How Diagnosed?
I-S detects high link utilization but abnormally
high DNS traffic
Stats from I-I high mail traffic, low outgoing
web traffic, in traffic high but link utilization
not high
Stats from I-A lower web traffic, no unusual
mail origination
Problem localized to Server edge, but visibility
limited RADS can help

19
Scenario
Internet Edge
II
R
Primary Secondary DNS Servers
Distribution Tier
S
S
E
Mail Server
E
R
R
S
IA
IS
E
Spam Appliance
Access Edge
Server Edge
E
S

Possible Action Responses
Experiment Redirect local DNS requests to
Secondary DNS server if these complete, can
infer the server is the problem, not the network
Throttle Due to MS-DNS correlation, block/slow
email traffic at Server Edge should expect
reduced DNS server utilization

20
Outline

Overall Vision
Internet Services Vision (ServRADS)
Network Vision (NetRADS)
Internet Services Network architecture
Principles and Summary

21
Embodying principles in a prototype

Platform architecture and prototype to enable
rapid innovation in network services by
non-experts
automatically accommodates scaling, provisioning,
failure management
multi-datacenter (geoplexed)
observable networks connecting datacenters
potentially planetary scale
runs with minimal operator oversight
Prototype keeps various research projects focused
on common goal and allows ongoing testing
Participation in standards processes to promote
best practices in platform as open standards

22
Reliable Adaptive Distributed Systems
Operator
User
Prototype Applications
Programming Abstractions For Roll-back
and wide-area distributed computations
SLT Services
Crash-only services Observation Infrastructure
forSystem SLT
Application- Specific Overlay Network
Checkable Protocols Fast Detection Route
Recovery ObservationInfrastructure for network
SLT
iBox
iBox
Edge Network
Edge Network
Commodity Internet
23
Generic iBox Architecture
Tag Mem
Rules Programs
24
Possible architecture of a rack
app. server application, e.g. J2EE
Microrecovery actions
Datacenter boundary
From other datacenters
High-leveleffectors
SLTalgo.
SLTalgo.
SLTalgo.
To other datacenters
Control loops
High-level sensor data
Externally-inducedfailures, workload changes,
etc.
T-CQ engine
Sanitizeddata
Visualization
SLTalgo.
SLTalgo.
SLTalgo.
Preprocesseddata
Syndrome identification
To otherdatacenters
25
Outline

Overall Vision
Internet Services Vision (ServRADS)
Network Vision (NetRADS)
Internet Services Network architecture
Principles and Summary

26
ServRADS Observations Summary

SLT algorithms make sense of large amounts of
data
Classification, outlier/anomaly detection,
clustering, etc.
Viz helps operator use visual pattern
recognition to quickly spot problems and
cross-check SLT models
Enables operator expertise to be quickly brought
to bear
Builds operators trust in statistical/machine
learning models
Challenge
Fundamental challenges associated with applying
SLT to problem determination (coming up next
session)
Unifying many techniques into a coherent approach
- prototype platform as unifying artifact
Idea capture best practices in TCO-optimized,
planetary-scale abstractions

27
NetRADS Observations Summary

COPS Paradigm for (more) automatically
protecting critical resources when network is
under stress
Checkable protocols visible semantics
Observe network behavior good (easy), bad
(hard), suspicious
Protect services throttle, redirect
Network management major contributor to TCO
NetRADS built on
iBoxes pervasive infrastructure for observation
and action at the network level
Annotation Layer for marking, control,
inter-iBox communications
Integration with Internet service approach for
service/server-level visibility and integrated
management