ISTORE: Introspective Storage for DataIntensive Network Services - PowerPoint PPT Presentation

About This Presentation
Title:

ISTORE: Introspective Storage for DataIntensive Network Services

Description:

ISTORE's adaptation framework built on model of active database 'database' includes: ... active databases: UFL Gator, TriggerMan ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 50
Provided by: aaronbrown6
Category:

less

Transcript and Presenter's Notes

Title: ISTORE: Introspective Storage for DataIntensive Network Services


1
ISTORE Introspective Storage for Data-Intensive
Network Services
  • Aaron Brown, David Oppenheimer, Jim Beck,
  • Kimberly Keeton, Rich Martin, Randi Thomas,
  • John Kubiatowicz, David Patterson, and Kathy
    Yelick
  • Computer Science Division
  • University of California, Berkeley
  • http//iram.cs.berkeley.edu/istore/
  • 1999 Summer IRAM Retreat

2
ISTORE Philosophy
Traditional Research Priorities 1)
Performance 1) Cost 3) Scalability 4)
Availability 5) Maintainability
  • Traditional systems research has focused on peak
    performance and cost

3
ISTORE Philosophy SAM
Traditional Research Priorities 1)
Performance 1) Cost 3) Scalability 4)
Availability 5) Maintainability
ISTORE Priorities 1) Maintainability 2)
Availability 3) Scalability 4) Performance 4)
Cost
  • In reality, scalability, maintainability, and
    availability (SAM) are equally important
  • performance cost mean little if the system
    isnt working

4
ISTORE Philosophy Introspection
  • ISTOREs solution is introspective systems
  • systems that monitor themselves and automatically
    adapt to changes in their environment and
    workload
  • introspection enables automatic self-maintenance
    and self-tuning
  • ISTORE vision a framework that makes it easy to
    build introspective systems
  • ISTORE target high-end servers for
    data-intensive infrastructure services
  • single-purpose systems managing large amounts of
    data for large numbers of active network users

5
Outline
  • Motivation for Introspective Systems
  • ISTORE Research Agenda and Architecture
  • Hardware
  • Software
  • Policy-driven Introspection Example
  • Research Issues, Status, and Discussion

6
Motivation Service Demands
  • Emergence of a true information infrastructure
  • today e-commerce, online database services,
    online backup, search engines, and web servers
  • tomorrow more of above (with ever-growing
    datasets), plus thin-client/PDA infrastructure
    support
  • Infrastructure users expect always-onservice
    and constant quality of service
  • infrastructure must provide fault-toleranceand
    performance-tolerance
  • failures and slowdowns have major business impact
  • e.g., recent EBay, ETrade, Schwab outages

7
Motivation Service Demands (2)
  • Delivering 24x7 fault- and performance-tolerance
    requires
  • a robust hardware platform
  • fast adaptation to failures, load spikes,
    changing access patterns
  • easy incremental scalability when existing
    resources stop providing desired quality of
    service
  • self-maintenance the system handles problems as
    they arise, automatically
  • can't rely on human intervention to fix problems
    or to tune performance
  • humans are too expensive, too slow, prone to
    mistakes
  • Introspective systems are self-maintaining

8
Motivation System Scaling
  • Infrastructure services are growing rapidly
  • more users, more online data, higher access
    rates, more historical data
  • bigger and bigger back-end systems are needed
  • O(300)-node clusters deployed now thousands of
    nodes not far off
  • techniques for maintenance and administration
    must scale with the system to 1000s of nodes
  • Todays administrative approaches dont scale
  • systems will be too big to reason about, monitor,
    or fix
  • failures and load variance will be too frequent
    for static solutions to work
  • Introspective, reactive techniques are required

9
ISTORE Research Agenda
  • ISTORE goal create a hardware/software
    framework for building introspective servers
  • Hardware plug-and-play intelligent devices with
    integrated self-monitoring, diagnostics, and
    fault injection hardware
  • intelligence used to collect and filter
    monitoring data
  • diagnostics and fault injection enhance
    robustness
  • networked to create a scalable shared-nothing
    cluster
  • Software toolkit that allows programmers to
    easily define the systems adaptive behavior
  • provides abstractions for manipulating and
    reacting to monitoring data

10
Hardware Requirements for Self-Maintaining
Servers
  • Redundant components that fail fast
  • no single point of failure anywhere
  • Tightly-integrated device monitoring
  • low-level HW diagnostics to detect impending
    failure
  • device health, performance data, access
    patterns, environmental info, ...
  • Automatic preventive maintenance
  • predictive failure analysis based on diagnostic
    data
  • continual scrubbing and in situ stress testing
    of all components, new and old
  • Self-characterizing, plug-and-play hardware

11
ISTORE-1 Hardware Prototype
  • Based on intelligent disk bricks
  • each brick is one ISTORE node
  • ISTORE-1 will have 64 bricks/nodes

12
ISTORE-1 Hardware Design
  • Brick
  • processor board
  • mobile Pentium-II, 128MB SODRAM
  • PCI and ISA busses/controllers, SuperIO (serial
    ports)
  • Flash BIOS
  • 4x100Mb Ethernet interfaces
  • Adaptec Ultra2-LVD SCSI interface
  • disk one 18.2GB low-profile SCSI disk
  • diagnostic processor
  • OS several UNIX-like OSs supporting Linux ABI
    (Linux, NetBSD, FreeBSD, Solaris x86?)

13
ISTORE-1 Hardware Design (2)
  • Network
  • primary data network
  • hierarchical, highly-redundant switched Ethernet
  • uses 16 20-port 100Mb switches at the leaves
  • each brick connects to 4 independent switches
  • root switching fabric is two ganged 25-port
    Gigabit switches (PacketEngines PowerRails)
  • diagnostic network

14
Diagnostic Processor
  • Each brick has a diagnostic processor
  • Goal small, independent, trusted piece of
    hardware running hand-verifiable
    monitoring/control software
  • monitoring connects to motherboard SMbus, CAN
    bus
  • environmental monitor, CPU watchdog
  • control
  • reboot/power-cycle main CPU
  • inject simulated faults power, bus transients,
    memory errors, network interface failure, ...
  • Not-so-small embedded Motorola 68k processor
  • provides the flexibility needed for research
    prototype
  • still can run just a small, simple monitoring and
    control program if desired (no OS, networking,
    etc.)

15
Diagnostic Network
  • Separate diagnostic network connects the
    diagnostic processors of each brick
  • provides independent network path to diagnostic
    CPU
  • works when brick CPU is powered off or has failed
  • separate failure modes from Ethernet interfaces
  • CAN (Controller Area Network) diagnostic
    interconnect
  • CAN connects directly to environmental monitoring
    sensors (temperature, fan RPM, ...)
  • one brick per shelf of 8 acts as gateway from
    CAN to redundant switched Ethernet fabric

16
ISTORE-1 Hardware Prototype
  • Meets requirements for a robust HW platform
  • fast embedded CPU performs local monitoring tasks
  • diagnostic hardware enables low-level diagnostic
    monitoring, fail-fast behavior, and fault
    injection
  • highly-redundant system design
  • redundant data network and interfaces at all
    levels
  • separate diagnostic network
  • redundant backup power
  • powerful preventive maintenance
  • each brick can be periodically taken offline and
    stress-tested/scrubbed using diagnostic
    processors fault injection capabilities

17
ISTORE Research Agenda
  • ISTORE goal create a hardware/software
    framework for building introspective servers
  • Hardware
  • Software toolkit that allows programmers to
    easily define the systems adaptive behavior
  • provides abstractions for manipulating and
    reacting to monitoring data

18
A Software Framework for Introspection
  • ISTORE hardware provides device monitoring
  • application programmers could write ad-hoc code
    to collect, process, and react to monitoring data
  • ISTORE software framework should simplify writing
    introspective applications
  • rule-based adaptation engine encapsulates the
    mechanisms of collecting, processing monitoring
    data
  • policy compiler and mechanism libraries help turn
    application adaptation goals into rules
    reaction code
  • these provide a high-level, abstract interface to
    the systems monitoring and adaptation mechanisms

19
Rule-based Adaptation
  • ISTOREs adaptation framework built on model of
    active database
  • database includes
  • hardware monitoring data device status, access
    patterns, performance stats
  • software monitoring data app-specific
    quality-of-service metrics, high-level workload
    patterns, ...
  • applications define views and triggers over the
    DB
  • views select and aggregate data of interest to
    app.
  • triggers are rules that invoke application-specifi
    c reaction code when their predicates are
    satisfied
  • SQL-like declarative language used to specify
    views and trigger rules

20
Benefits of Views and Triggers
  • Allow applications to focus on adaptation, not
    monitoring
  • hide the mechanics of gathering and processing
    monitoring data
  • can be dynamically redefined without altering
    adaptation code as situation changes
  • Can be implemented without a real database
  • views and triggers implemented as device-local
    and distributed filters and reaction rules
  • defined views and triggers control frequency,
    granularity, types of data gathered by HW
    monitoring
  • no materialized database necessary

21
Raising the Level of AbstractionPolicy Compiler
and Mechanism Libs
  • Rule-based adaptation doesnt go far enough
  • application designer must still write views,
    triggers, and adaptation code by hand
  • but designer thinks in terms of system policies
  • Solution designer specifies policies to system
    system implements them
  • policy compiler automatically generates views,
    triggers, adaptation code
  • uses preexisting mechanism libraries to implement
    adaptation algorithms
  • claim feasible for common adaptation mechanisms
    needed by data-intensive network service apps.

22
Adaptation Policies
  • Policies specify system states and how to react
    to them
  • high-level specification independent of schema
    of system, object/node identity
  • that knowledge is encapsulated in policy compiler
  • Examples
  • self-maintenance and availability
  • if overall free disk space is below 10, compress
    all but one replica/version of least-frequently-ac
    cessed data
  • if any disk reports more than 5 errors per hour,
    migrate all data off that disk and shut it down
  • invoke load-balancer when new disk is added to
    system
  • performance tuning
  • place large, sequentially-accessed objects on
    outer tracks of fast disks as space becomes
    available

23
Software Structure
policy
policy compiler
view
trigger
adaptation code
mechanism libraries
24
Detailed Adaptation Example
  • Policy quench hot spots by migrating objects

while ((average queue length for any disk D) gt
(120 of average for whole system))
migrate hottest object on D to disk with
shortest average queue length
policy
calls
used as input to
policy compiler
produces
view
trigger
adaptation code
mechanism libraries
25
Example View Definition
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
view
trigger
adaptation code
mechanism libraries
26
Example Trigger
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
trigger
view
adaptation code
mechanism libraries
DEFINE VIEW (average_queue_length...,queue_leng
th..., disk_id..., short_disk... )
27
Example Adaptation Code
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
adaptation code
view
trigger
DEFINE VIEW (average_queue_length...,queue_leng
th..., disk_id..., short_disk... )
mechanism libraries
foreach disk_id from_disk if
(queue_lengthfrom_disk gt 1.2average_queue_len
gth) user_migrate(from_disk,short_disk)
28
Example Mechanism Lib. Calls
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
view
trigger
adaptation code
mechanism libraries
DEFINE VIEW (average_queue_length...,queue_leng
th..., disk_id..., short_disk... )
user_migrate(from_disk,to_disk) diskObject x
x find_hottest_obj(from_disk) migrate(x,
to_disk)
foreach disk_id from_disk if
(queue_lengthfrom_disk gt 1.2average_queue_len
gth) user_migrate(from_disk,short_disk)
29
Mechanism Libraries
  • Unify existing techniques/services found in
    single-node OSs, DBMSs, distributed systems
  • distributed directory services
  • replication and migration
  • data layout and placement
  • distributed transactions
  • checkpointing
  • caching
  • administrative (human UI) tasks
  • Provide a place for higher-level monitoring
  • Simplify creation of adaptation code
  • for humans using the rule system directly
  • for the policy compiler auto-generating code

select key mechanisms fordata-intensivenetwork
services
30
Open Research Issues
  • Defining appropriate software abstractions
  • how should views and triggers be declared?
  • what should the policy language look like?
  • what functions should mechanism libraries
    provide?
  • what is the systems schema?
  • how should heterogeneous hardware be integrated?
  • can it be extended by the user to include new
    types and statistics?
  • what level of policies can be expressed?
  • how much of the implementation can the system
    figure out automatically?
  • to what extent can the system reason about
    policies and their interactions?

31
More Open Research Issues
  • Implementing an introspective system
  • what default policies should the system supply?
  • what are the internal and external interfaces?
  • debugging
  • visualization of states, triggers, ...
  • simulation/coverage analysis of policies,
    adaptation code
  • appropriate administrative interfaces
  • Measuring an introspective system
  • what are the right benchmarks for scalability,
    availability, and maintainability (SAM)?
  • O(gt1000)-node scalability
  • how to write applications that scale and run well
    despite continual state of partial failure?

32
Related Work
  • Hardware
  • CMU and UCSB Active Disks
  • Software
  • adaptive databases MS AutoAdmin, Informix
    NoKnobs
  • adaptive OSs MS Millennium, adaptive VINO
  • adaptive storage HP AutoRAID, attribute-managed
    storage
  • active databases UFL Gator, TriggerMan
  • ISTORE unifies many of these techniques in a
    single system

33
Related Work Ninja
  • Ninja composable Internet-scale services
  • some ISTORE runtime software services provided
    using Ninja programming platform?
  • provides
  • some fault tolerance
  • a framework for automatic service discovery
  • incremental s/w upgrades

34
Related Work Telegraph
  • Universal system for information
  • Four layers
  • query, browse, mine
  • global agoric federation
  • continuously reoptimizing query processor
    adaptive data placement
  • storage manager
  • Relationship to ISTORE
  • continuous online reoptimization
  • adaptive data placement
  • indexing, other operations on disk CPU

35
Related Work OceanStore
  • Global-scale persistent storage
  • Nomadic, highly-available data
  • Federation of data storage providers
  • Investigate global-scale SAM
  • also naming, indexability, consistency
  • Relationship to ISTORE
  • investigating similar concepts but on a global
    scale
  • converse ISTORE as Internet in a box

36
Related Work Endeavour
  • Endeavour new research project at UCB
  • goal enhancing human understanding through
    information technology
  • ISTOREs potential contributions
  • ISTORE is building adaptive, scalable,
    self-maintaining back-end servers for
    storage-intensive network services
  • can be part of Endeavours back-end
    infrastructure
  • software research
  • using policies to guide a systems adaptive
    behavior
  • providing QoS under degraded conditions
  • application platform
  • process and store streams of sensor data

37
Status and Conclusions
  • ISTOREs focus is on introspective systems
  • a new perspective on systems research priorities
  • Proposed framework for building introspection
  • intelligent, self-monitoring plug-and-play
    hardware
  • software that provides a higher level of
    abstraction for the construction of introspective
    systems
  • flexible, powerful rule system for monitoring
  • policy specification automates generation of
    adaptation
  • Status
  • ISTORE-1 hardware prototype being constructed now
  • software prototyping just starting

38
ISTORE Short-Term Plans
  • Solidify/begin implementing benchmarking ideas
  • run on existing systems to characterize and
    compare them with respect to SAM
  • Assemble ISTORE-0 system
  • 6 PCs with similar configurations to ISTORE-1
    bricks
  • 100 Mb/s switched Ethernet
  • Gain experience running multiple OSes
  • Investigate implementation options for monitoring
    database, views, and triggers
  • Study data-intensive network service applications
  • to guide development of policy lang.
  • to determine what types of adaptation will help

39
ISTORE Introspective Storage for Data-Intensive
Network Services
  • For more information
  • http//iram.cs.berkeley.edu/istore/
  • istore-group_at_cs.berkeley.edu

40
Backup Slides
41
ISTORE-1 Hardware Design
  • Brick
  • processor board
  • mobile Pentium-II, 366 MHz, 128MB SODRAM
  • PCI and ISA busses/controllers, SuperIO (serial
    ports)
  • Flash BIOS
  • 4x100Mb Ethernet interfaces
  • Adaptec Ultra2-LVD SCSI interface
  • disk one 18.2GB 10,000 RPM low-profile SCSI disk
  • diagnostic processor
  • Motorola MC68376, 2MB Flash or NVRAM
  • serial connections to CPU for console and
    monitoring
  • controls power to all parts on board
  • CAN interface

42
ISTORE-1 Hardware Design (2)
  • Network
  • primary data network
  • hierarchical, highly-redundant switched Ethernet
  • uses 16 20-port 100Mb switches at the leaves
  • each brick connects to 4 independent switches
  • root switching fabric is two ganged 25-port
    Gigabit switches (PacketEngines PowerRails)
  • diagnostic network
  • point-to-point CAN network connects bricks in a
    shelf
  • Ethernet fabric described above is used for
    shelf-to-shelf communication
  • console I/O from each brick can be routed through
    diagnostic network

43
Motivation Technology Trends
  • Disks, systems, switches are getting smaller
  • Convergence on intelligent disks (IDISKs)
  • MicroDrive system-on-a-chip gt tiny IDISK nodes
  • Inevitability of enormous-scale systems
  • by 2006, a O(10,000) IDISK-node cluster with 90TB
    of storage could fit in one rack

44
Disk Limit
  • Continued advance in capacity (60/yr) and
    bandwidth (40/yr)
  • Slow improvement in seek, rotation (8/yr)
  • Time to read whole disk
  • Year Sequentially Randomly (1 sector/seek)
  • 1990 4 minutes 6 hours
  • 2000 12 minutes 1 week(!)
  • 3.5 form factor make sense in 5-7 years?

45
ISTORE-II Hardware Vision
  • System-on-a-chip enables computer, memory,
    redundant network interfaces without
    significantly increasing size of disk
  • Target for 5-7 years
  • 1999 IBM MicroDrive
  • 1.7 x 1.4 x 0.2 (43 mm x 36 mm x 5 mm)
  • 340 MB, 5400 RPM, 5 MB/s, 15 ms seek
  • 2006 MicroDrive?
  • 9 GB, 50 MB/s (1.6X/yr capacity, 1.4X/yr BW)

46
2006 ISTORE
  • ISTORE node
  • Add 20 pad to MicroDrive size for packaging,
    connectors
  • Then double thickness to add IRAM
  • 2.0 x 1.7 x 0.5 (51 mm x 43 mm x 13 mm)
  • Crossbar switches growing by Moores Law
  • 2x/1.5 yrs ? 4X transistors/3yrs
  • Crossbars grow by N2 ? 2X switch/3yrs
  • 16 x 16 in 1999 ? 64 x 64 in 2005
  • ISTORE rack (19 x 33 x 84) (480 mm x 840 mm
    x 2130 mm)
  • 1 tray (3 high) ? 16 x 32 ? 512 ISTORE nodes
  • 20 traysswitchesUPS ? 10,240 ISTORE nodes(!)

47
Benefits of Views and Triggers (2)
  • Equally useful for performance and failure
    management
  • Performance tuning example DB index management
  • View access patterns to tables, query predicates
    used
  • Trigger access rate to table above/below average
  • Adaptation add/drop indices based on query
    stream
  • Failure management example impending disk
    failure
  • View disk error logs, environmental conditions
  • Trigger frequency of errors, unsafe environment
  • Adaptation redirect requests to other replicas,
    shut down disk, generate new replicas, signal
    operator

48
More Adaptation Policy Examples
  • Self-maintenance and availability
  • maintain two copies of all dirty data stored only
    in volatile memory
  • if a disk fails, restore original redundancy
    level for objects stored on that disk
  • Performance tuning
  • if accesses to a read-mostly object take more
    than 10ms on average, replicate the object on
    another disk
  • Both (like HP AutoRAID)
  • if an object is in the top 10 of
    frequently-accessed objects, and there is only
    one copy, create a new replica. if an object is
    in the bottom 90, delete all replicas and stripe
    the object across N disks using RAID-5.

49
Mechanism Library Benefits
  • Programmability
  • libraries provide high-level abstractions of
    services
  • code using the libraries is easier to reason
    about, maintain, customize
  • Performance
  • libraries can be highly-optimized
  • optimization complexity is hidden by abstraction
  • Reliability
  • libraries include code thats easy to forget or
    get wrong
  • synchronization, communication, memory allocation
  • debugging effort can be spent getting libraries
    right
  • library users inherit the verification effort
Write a Comment
User Comments (0)
About PowerShow.com