The RAD Lab monitoring strategy: From Ruckus to Chukwa PowerPoint PPT Presentation

presentation player overlay
1 / 20
About This Presentation
Transcript and Presenter's Notes

Title: The RAD Lab monitoring strategy: From Ruckus to Chukwa


1
The RAD Lab monitoring strategy From Ruckus to
Chukwa
Ari Rabkin, RAD Lab January 2009
1
2
RAD Lab Overview
Low level spec
Com- piler
High level spec
Instrumentation Backplane
New apps, equipment, global policies (eg SLA)
Offered load, resource utilization, etc.
Director
Training data
performance cost models
Log Mining
3
RAD Lab Overview
Low level spec
Com- piler
High level spec
Instrumentation Backplane
New apps, equipment, global policies (eg SLA)
Offered load, resource utilization, etc.
Director
Training data
performance cost models
Log Mining
4
Our plan
  • Adapt Chukwa
  • Developed in part by Andy and myself at Yahoo!
    this past summer
  • Need to add support for low-latency
  • We had this in mind in designing Chukwa.

5
Recap Ruckus
  • Last year, we started work on a RIOT
    instrumentation backplane, Ruckus.
  • Idea was to dynamically assemble dataflow graphs
    for monitoring.
  • Processing done by Modules at dataflow graph
    vertexes
  • Ruckus is currently in use, supporting director.

6
Ruckus, illustrated
Alarm triggered on aggregate statistics
Data aggregated in-system
Data collected on worker node
7
Lesson 1 storage is key
  • Chukwa separates data collection from processing
  • Easier to debug configure each separately
  • Scalable batch processing is a well-understood
    problem, solved by MapReduce.
  • We really want to store everything
  • New analysis jobs are common
  • Data often has lifetime in months easy to do
    this via batch job, hard in streaming model.

8
Lesson 2 Dynamically loadable modules
  • Ruckus had a library of dynamically loadable data
    collection modules.
  • Chukwa calls them Adaptors
  • Similar programming model. (push data to
    framework, which handles transport)
  • Definitely the right call
  • Want to watch different log files, different
    metrics, etc, at different times.

9
Lesson 3 You need Schemas
  • Ruckus didn't define message formats
  • Awkward, made it hard to program
  • Chukwa defines mandatory metadata for chunks
    (base data abstraction)
  • Every chunk has origin, format, stream offset
  • Separately, Chukwa has a notion of parsed
    records, with complex schemas
  • Can put into structured storage

10
Lesson 4 Reuse MapReduce
  • Ruckus was both a collection system and a
    dataflow processing framework.
  • But people don't have experience with that model
    for distributed computing
  • Better to just rely on Hadoop/MapReduce.
  • Has a lot of engineering put into it.
  • Has a reasonably well-understood idiom for
    writing scalable processing jobs.

11
Chukwa design goals
  • Monitor a variety of data sources
  • Especially logs and metrics
  • Scalably process collected data
  • Support long-term archiving
  • Scale to 1000s of nodes, petabytes of data
  • Every node in a cluster...or a datacenter.
  • Store all collected data indefinitely

12
How does Chukwa work?
  • Separates collection and processing
  • Collection
  • Adaptors (on each node) output chunks of data,
    with some minimal metadata.
  • Framework uploads data to a small number of
    collectors, that write to sink files in HDFS
  • Processing
  • Periodic MapReduce jobs to organize and analyze
    collected data
  • Dump to structured storage for ad-hoc
    visualization.

13
The architecture
  • Pipeline architecture
  • Guaranteed end-to-end delivery
  • Failure tolerant crash recovery
  • Trade latency for scalability
  • Buffer data in temporary files
  • Use MapReduce to organize it

14
Why use Chukwa?
  • Three big reasons
  • Benefit from Yahoo! developers
  • Can push our innovations to the Hadoop user
    community
  • Architecture is better in many (not all) ways
  • More aggressive reuse of existing tools ()
  • More confidence in scalability ()
  • Easier programming model ()
  • Measurements are minutes old, not seconds. (-)
  • We have less control of development (-)

15
Chukwa and the RAD Lab
  • Chukwa was designed for batch computing, not an
    interactive service.
  • Minutes of latency OK.
  • The director needs faster processing
  • Dont need long-term storage or reliable delivery
  • Intend to reuse collection part, only
  • We had this in mind during the Chukwa design, we
    have some ways to fix this.

16
Adapting Chukwa
  • We intend to build a "rush delivery" mode for
    Chukwa get some data to director right away, in
    addition to archiving.
  • Can summarize data at collectors
  • Plan to build this before next retreat
  • Also intend to integrate with Hive and/or SCADS
    to support ad-hoc queries

17
Chukwa today
  • Under active development Andy and I are major
    outside contributors
  • Deployed on several thousand Hadoop nodes at
    Yahoo!
  • Being steadily rolled out across company
  • Y! runs Chukwa clusters about 1 of the size of
    the system being monitored

18
Performance
  • Goal a 2000-node cluster generates 5.5 MB/sec
  • Collectors can write at 20MB/sec/collector
  • No state at collectors, so easy to add more
  • Demux MapReduce job runs at 3 MB/sec/node, on
    Yahoo! configuration
  • Can add nodes for speed
  • Hadoop will improve

19
Related work
  • Ganglia (and most NMS) doesn't do large data
    volumes or reliable delivery
  • Microsoft Artemis (Presented Dec 08)
  • Similar concept, using Dryad not Hadoop
  • Leaves data on worker nodes
  • Not open source
  • Facebooks ScribeHive
  • Scribe is streaming, not batch
  • Hive is batch, and atop Hadoop
  • Doesn't do collection or processing
  • No centralized configuration

20
Questions?
  • We have ten minutes for questions
  • Then the demo
Write a Comment
User Comments (0)
About PowerShow.com