Title: Experimentally inferring runtime datacenter dependencies with XTrace
1Experimentally inferring runtime datacenter
dependencies with X-Trace
- George Porter - Winter 2007 retreat
2Relevance to RAD Lab
(datacenter cluster or RAMP)
Policy Maker
Per node SW stack
Load- Balancer (IDLB)
Intrusion- Detection (IDID)
Service (IDS)
Web 2.0 Applications
Firewall (IDF)
Ruby on Rails Interpreter
Web Svc APIs
Path traces and statistics
Trace, X-trace
Local OS functions
trace, X-trace, Lib log, D-trigger,
Identity-based Routing Layer,
Virtual Machine Monitor
Actuator Commands
Sensor Data
1. Energy? 2. 1 person run Killer Apps?
3Simple datacenter scenario
- Applications
- Web
- Email
- Network services
- Remote file storage (NFS)
- Naming (DNS)
- Composition
- Service path
- Multilayer
- Task tree
4Example scenario observed task tree with XTrace
- Multilayer service path or task tree
- Static dependencies
- Shared nodes and edges
- Runtime dependencies
- Concurrent requests sharing a node or edge
- Typically in a way that effects throughput,
delay, or performance
5Flash traffic effect on dependent services
- DNS dependency example
- Email server checks for spam by examining
hostnames - Webserver uses clients hostname for access
control - Surge of junk email arrives
- Spam checker floods requests to DNS server
- DNS server becomes overloaded DNS latency
increases - Web authentication latency increases
- Web throughput decreases
Different applications (web, email) interfere
with each other. Source of degraded performance
non-obvious.
6Relevance of problem to RAD Lab
- Flash traffic / unexpected traffic patterns
- What will happen during site growth?
- Migration for power savings
- Two perspectives on mechanism
- Reduce offered capacity to save power
- vs increasing offered capacity to handle
increased demands - Virtualization (Identity-based routing layer)
- Independent, virtualized services might be
co-located - However, this layer may help if it provides a
service path - This work is in initial stages- comments and
feedback greatly appreciated
7Assumptions
- X-Trace deployed at least partially, creating
regions of network observations - Can measure req/sec, transaction and operation
rates, latencies, flow rates - Interested in inferring dependencies on
unobserved resources - links, back-end servers, etc.
- Can generate and/or delay network traffic at key
points in the network - Used for proactive analysis discussed later
8Summary of Approach
- Observe
- Dynamic, path-based data
- Network policy, SLAs, QoS, service toplogy, etc.
- Analyze
- Model expected service performance based on path
observations - Use deviations from models to infer dependencies
and find correlations - Act
- Modify network (inject or delay traffic) to
test correlations - e.g., Delay selected traffic and look for effect
elsewhere - Identify dependencies before demands affect
service behavior
9System Observations
- Macro-level connectivity behavior
- Layer above X,-Trace from individual paths to
operations/sec, latencies over time, flow rates,
service topology - Views from multiple network locations
- Develop algorithms
- Inspect and measure relevant paths
- Start with detailed service semantics, then try
to generalize
10Managing observations
- Semantic complexity
- Host
- Naming, directory services, remote file storage,
authentication - Middleware
- DB pooling (tomcat), Contain managed persistence,
RoR - Application
- Need application knowledge
- Most variability
- Difficult to predict behavior given request(s)
- Integrating dependency information with policy
- Instrumentation backplane
- Cross-layer visibility as a service Kompella05
Increasing difficulty
11Model extraction and update
- Modeling based on observations
- Representative timeframe and workload?
- Stationary how long?
- Non-stationary changes over time, day-of-week,
holidays, etc - Updating the model
- Frequency, triggering action (new hw, links,
software, O/S versions)
- Modeling based on workload
- AWE-gen tie in
- Modeling based on policy
- SLA, QoS parameters, Middlebox policies
12Performance anomaly detection
- Goal detect dependency
- Based on determining when expected behavior
differs from observed - Across services (E-mail volume affecting web
authentication latency, thus throughput) - Despite typical service variability
- Develop algorithms to detect that deviations
- are related
- are strong enough
13Deviation from expected signature shared
back-end server
14Deviation from expected signature shared link
15Link vs server contention?
16Proactive dependency discovery
2
- Selectively inject or delay traffic and observe
result - From source -gt sink over shared link
- Result
- Alternative
- Delay web1 -gt NFS1 and observe result
NFS1
web1
1
NFS2
web2
Clients
sink
source
17Proactive dependency discovery
- Automatically determining experimental plan
- Dynamically on-demand, or based on policy
- Intervention intensity and duration
- Too much -gt disrupt traffic
- Too little -gt miss dependency
- Detecting effect via change point detection
- Stationarity test, measure means
- More at poster session
- Delay rate R1, measure rate R2 and latency u2
- Treat u2 as a time series, look for deviations
during experiment
18Summary / Discussion
- Detect runtime dependencies using and
observe/analyze/act approach - Based on path-traces collected with ,X-Trace
- Uses
- Handling unexpected traffic, understanding
service behavior despite virtualization and
migration - Initial stages, welcome feedback
19Backup slides
20X-Trace overview and status
- Collect path-based traces
- Across layers, devices, and ASes
- Design principles
- Trace request sent in-band
- Trace data collected out of band
- Decouple trace requestor from trace receiver
- Components
- Per-layer metadata
- Host and server modification to propagate
metadata - Reporting and aggregation framework
- Opendht, I3, SQL
- Progress
- New implementation
- HTTP, IP, I3, SQL, Chord
- C and Java
- Early adopters
- DONA, Coral CDN
- To appear at NSDI07
21Role of workload
- Increasing request rate may not effect service
under test - Due to caching, fast path, middlebox
interception, etc - E.g. workload consisting of a single page served
from RAM - Services often optimized for certain requests
- SQL requests to indexed vs non-indexed data
- Router fast path, vs slow-path
22Remote file storage behavior under load
- File storage an application with well-known
sementics - Absent contention, we would expect this behavior
at runtime - NFS write performance
- File size ?, throughput ?
- Deviation from expected
- File size ?, throughput ?
- This could be an indication of resource
contention/ depenency
NFS
web
Clients
DNS
email