The RAD Lab monitoring strategy: From Ruckus to Chukwa presentation

About This Presentation

Transcript and Presenter's Notes

Title: The RAD Lab monitoring strategy: From Ruckus to Chukwa

1
The RAD Lab monitoring strategy From Ruckus to
Chukwa
Ari Rabkin, RAD Lab January 2009
1
2
RAD Lab Overview
Low level spec
Com- piler
High level spec
Instrumentation Backplane
New apps, equipment, global policies (eg SLA)
Offered load, resource utilization, etc.
Director
Training data
performance cost models
Log Mining
3
RAD Lab Overview
Low level spec
Com- piler
High level spec
Instrumentation Backplane
New apps, equipment, global policies (eg SLA)
Offered load, resource utilization, etc.
Director
Training data
performance cost models
Log Mining
4
Our plan

Adapt Chukwa
Developed in part by Andy and myself at Yahoo!
this past summer
Need to add support for low-latency
We had this in mind in designing Chukwa.

5
Recap Ruckus

Last year, we started work on a RIOT
instrumentation backplane, Ruckus.
Idea was to dynamically assemble dataflow graphs
for monitoring.
Processing done by Modules at dataflow graph
vertexes
Ruckus is currently in use, supporting director.

6
Ruckus, illustrated
Alarm triggered on aggregate statistics
Data aggregated in-system
Data collected on worker node
7
Lesson 1 storage is key

Chukwa separates data collection from processing
Easier to debug configure each separately
Scalable batch processing is a well-understood
problem, solved by MapReduce.
We really want to store everything
New analysis jobs are common
Data often has lifetime in months easy to do
this via batch job, hard in streaming model.

8
Lesson 2 Dynamically loadable modules

Ruckus had a library of dynamically loadable data
collection modules.
Chukwa calls them Adaptors
Similar programming model. (push data to
framework, which handles transport)
Definitely the right call
Want to watch different log files, different
metrics, etc, at different times.

9
Lesson 3 You need Schemas

Ruckus didn't define message formats
Awkward, made it hard to program
Chukwa defines mandatory metadata for chunks
(base data abstraction)
Every chunk has origin, format, stream offset
Separately, Chukwa has a notion of parsed
records, with complex schemas
Can put into structured storage

10
Lesson 4 Reuse MapReduce

Ruckus was both a collection system and a
dataflow processing framework.
But people don't have experience with that model
for distributed computing
Better to just rely on Hadoop/MapReduce.
Has a lot of engineering put into it.
Has a reasonably well-understood idiom for
writing scalable processing jobs.

11
Chukwa design goals

Monitor a variety of data sources
Especially logs and metrics
Scalably process collected data
Support long-term archiving
Scale to 1000s of nodes, petabytes of data
Every node in a cluster...or a datacenter.
Store all collected data indefinitely

12
How does Chukwa work?

Separates collection and processing
Collection
Adaptors (on each node) output chunks of data,
with some minimal metadata.
Framework uploads data to a small number of
collectors, that write to sink files in HDFS
Processing
Periodic MapReduce jobs to organize and analyze
collected data
Dump to structured storage for ad-hoc
visualization.

13
The architecture

Pipeline architecture

Guaranteed end-to-end delivery
Failure tolerant crash recovery
Trade latency for scalability
Buffer data in temporary files
Use MapReduce to organize it

14
Why use Chukwa?

Three big reasons
Benefit from Yahoo! developers
Can push our innovations to the Hadoop user
community
Architecture is better in many (not all) ways
More aggressive reuse of existing tools ()
More confidence in scalability ()
Easier programming model ()
Measurements are minutes old, not seconds. (-)
We have less control of development (-)

15
Chukwa and the RAD Lab

Chukwa was designed for batch computing, not an
interactive service.
Minutes of latency OK.
The director needs faster processing
Dont need long-term storage or reliable delivery
Intend to reuse collection part, only
We had this in mind during the Chukwa design, we
have some ways to fix this.

16
Adapting Chukwa

We intend to build a "rush delivery" mode for
Chukwa get some data to director right away, in
addition to archiving.
Can summarize data at collectors
Plan to build this before next retreat
Also intend to integrate with Hive and/or SCADS
to support ad-hoc queries

17
Chukwa today

Under active development Andy and I are major
outside contributors
Deployed on several thousand Hadoop nodes at
Yahoo!
Being steadily rolled out across company
Y! runs Chukwa clusters about 1 of the size of
the system being monitored

18
Performance

Goal a 2000-node cluster generates 5.5 MB/sec

Collectors can write at 20MB/sec/collector
No state at collectors, so easy to add more

Demux MapReduce job runs at 3 MB/sec/node, on
Yahoo! configuration
Can add nodes for speed
Hadoop will improve

19
Related work

Ganglia (and most NMS) doesn't do large data
volumes or reliable delivery
Microsoft Artemis (Presented Dec 08)
Similar concept, using Dryad not Hadoop
Leaves data on worker nodes
Not open source
Facebooks ScribeHive
Scribe is streaming, not batch
Hive is batch, and atop Hadoop
Doesn't do collection or processing
No centralized configuration

The RAD Lab monitoring strategy: From Ruckus to Chukwa PowerPoint PPT Presentation