Title: ISTORE: Introspective Storage for DataIntensive Network Services
1ISTORE Introspective Storage for Data-Intensive
Network Services
- Aaron Brown, David Oppenheimer, Kimberly Keeton,
- Randi Thomas, John Kubiatowicz, and David
Patterson - Computer Science Division
- University of California, Berkeley
- http//iram.cs.berkeley.edu/istore/
- CS444I Guest Presentation, 4/29/99
2ISTORE Philosophy
Traditional Research Priorities 1)
Performance 1) Cost 3) Scalability 4)
Availability 5) Maintainability
- Traditional systems research has focused on peak
performance and cost
3ISTORE Philosophy
Traditional Research Priorities 1)
Performance 1) Cost 3) Scalability 4)
Availability 5) Maintainability
ISTORE Priorities 1) Maintainability 2)
Availability 3) Scalability 4) Performance 4)
Cost
- In reality, maintainability, availability, and
scalability are more important - performance cost mean little if the system
isnt working
4ISTORE Philosophy Introspection
- ISTOREs solution is introspective systems
- systems that monitor themselves and automatically
adapt to changes in their environment and
workload - introspection enables automatic self-maintenance
and self-tuning - ISTORE vision a framework that makes it easy to
build introspective systems - ISTORE target high-end servers for
data-intensive infrastructure services - single-purpose systems managing large amounts of
data for large numbers of active network users
5Outline
- Motivation for Introspective Systems
- ISTORE Research Agenda and Architecture
- Hardware
- Software
- Policy-driven Introspection Example
- Research Issues, Status, and Discussion
6Motivation Service Demands
- Emergence of a true information infrastructure
- today e-commerce, online database services,
online backup, search engines, and web servers - tomorrow more of above (with ever-growing
datasets), plus thin-client/PDA infrastructure
support - Infrastructure users expect always-onservice
and constant quality of service - infrastructure must provide fault-toleranceand
performance-tolerance - failures and slowdowns have major business impact
- e.g., recent ETrade, Schwab outages
7Motivation Service Demands (2)
- Delivering 24x7 fault- and performance-tolerance
requires - fast adaptation to failures, load spikes,
changing access patterns - easy incremental scalability when existing
resources stop providing desired QoS - self-maintenance the system handles problems as
they arise, automatically - can't rely on human intervention to fix problems
or to tune performance - humans are too expensive, too slow, prone to
mistakes - Introspective systems can deliver self-maintenance
8Motivation System Scaling
- Infrastructure services are growing rapidly
- more users, more online data, higher access
rates, more historical data - bigger and bigger backend systems are needed
- O(300)-node clusters deployed now thousands of
nodes not far off - techniques for maintenance and administration
must scale with the system to 1000s of nodes - Todays administrative approaches dont scale
- systems will be too big to reason about, monitor,
or fix - failures and load variance will be too frequent
for static solutions to work - Introspective, reactive techniques are required
9ISTORE Research Agenda
- ISTORE goal create a hardware/software
framework for building introspective servers - Hardware plug-and-play intelligent devices with
integrated self-monitoring and diagnostics - intelligence used to collect and filter
monitoring data - networked to create a scalable shared-nothing
cluster - Software toolkit that allows programmers to
easily define the systems adaptive behavior - provides abstractions for manipulating and
reacting to monitoring data
10Hardware Support for Introspective Servers
- Introspective, self-maintaining servers need
hardware support - tightly-integrated monitoring on all components
- device health, performance data, access
patterns, environmental info, ... - shared-nothing architecture for scalability,
heterogeneity, availability - redundancy at all levels
- idiot-proof packaging thats compact, easily
scaled, easily maintained - ISTORE uses clusters of intelligent device bricks
to provide this support
11ISTORE-1 Hardware Prototype
- Based on intelligent disk bricks (64 nodes)
- fast embedded CPU performs local monitoring
tasks, runs parallel application code - diagnostic hardware provides fail-fast behavior,
self-testing, additional monitoring
12ISTORE-1 Hardware Design
- Brick
- processor board
- mobile Pentium-II, 366 MHz, 128MB SODRAM
- PCI and ISA busses/controllers, SuperIO (serial
ports) - Flash BIOS
- 4x100Mb Ethernet interfaces
- Adaptec Ultra2-LVD SCSI interface
- disk one 18.2GB 10,000 RPM low-profile SCSI disk
- diagnostic processor
13ISTORE-1 Hardware Design (2)
- Network
- primary data network
- hierarchical, highly-redundant switched Ethernet
- uses 16 20-port 100Mb switches at the leaves
- each brick connects to 4 independent switches
- root switching fabric is two ganged 25-port
Gigabit switches (PacketEngines PowerRails) - diagnostic network
14Diagnostic Support
- Each brick has a diagnostic processor
- Goal small, independent, trusted piece of
hardware running hand-verifiable
monitoring/control software - monitoring CPU watchdog, environmental
conditions - control
- reboot/power-cycle main CPU
- inject simulated faults power, bus transients,
memory errors, network interface failure, ... - Separate diagnostic network connects the
diagnostic processors of each brick - provides independent network path to diagnostic
CPU - works when brick CPU is powered off or has failed
- separate failure modes from Ethernet interfaces
15Diagnostic Support Implementation
- Not-so-small embedded Motorola 68k processor
- provides the flexibility needed for research
prototype - can communicate with CPU via serial port, if
desired - still can run just a small, simple monitoring and
control program if desired (no OS, networking,
etc.) - CAN (Controller Area Network) diagnostic
interconnect - one brick per shelf of 8 acts as gateway from
CAN to redundant switched Ethernet fabric - CAN connects directly to automotive environmental
monitoring sensors (temperature, fan RPM, ...)
16ISTORE Research Agenda
- ISTORE goal create a hardware/software
framework for building introspective servers - Hardware
- Software toolkit that allows programmers to
easily define the systems adaptive behavior - provides abstractions for manipulating and
reacting to monitoring data
17A Software Framework for Introspection
- ISTORE hardware provides device monitoring
- applications could write ad-hoc code to collect,
process, and react to monitoring data - ISTORE software framework should simplify writing
introspective applications - rule-based adaptation engine encapsulates the
mechanisms of collecting, processing monitoring
data - policy compiler and mechanism libraries help turn
application adaptation goals into rules
reaction code - these provide a high-level, abstract interface to
the systems monitoring and adaptation mechanisms
18Rule-based Adaptation
- ISTOREs adaptation framework built on model of
active database - database includes
- hardware monitoring data device status, access
patterns, performance stats - software monitoring data app-specific
quality-of-service metrics, high-level workload
patterns, ... - applications define views and triggers over the
DB - views select and aggregate data of interest to
app. - triggers are rules that invoke application-specifi
c reaction code when their predicates are
satisfied - SQL-like declarative language used to specify
views and trigger rules
19Benefits of Views and Triggers
- Allow applications to focus on adaptation, not
monitoring - hide the mechanics of gathering and processing
monitoring data - can be dynamically redefined without altering
adaptation code as situation changes - Can be implemented without a real database
- views and triggers implemented as device-local
and distributed filters and reaction rules - defined views and triggers control frequency,
granularity, types of data gathered by HW
monitoring - no materialized database necessary
20Raising the Level of AbstractionPolicy Compiler
and Mechanism Libs
- Rule-based adaptation doesnt go far enough
- application designer must still write views,
triggers, and adaptation code by hand - but designer thinks in terms of system policies
- Solution designer specifies policies to system
system implements them - policy compiler automatically generates views,
triggers, adaptation code - uses preexisting mechanism libraries to implement
adaptation algorithms - claim feasible for common adaptation mechanisms
needed by data-intensive network service apps.
21Adaptation Policies
- Policies specify system states and how to react
to them - high-level specification independent of schema
of system, object/node identity - that knowledge is encapsulated in policy compiler
- Examples
- self-maintenance and availability
- if overall free disk space is below 10, compress
all but one replica/version of least-frequently-ac
cessed data - if any disk reports more than 5 errors per hour,
migrate all data off that disk and shut it down - invoke load-balancer when new disk is added to
system - performance tuning
- place large, sequentially-accessed objects on
outer tracks of fast disks as space becomes
available
22Software Structure
policy
policy compiler
view
trigger
adaptation code
mechanism libraries
23Detailed Adaptation Example
- Policy quench hot spots by migrating objects
while ((average queue length for any disk D) gt
(120 of average for whole system))
migrate hottest object on D to disk with
shortest average queue length
policy
calls
used as input to
policy compiler
produces
view
trigger
adaptation code
mechanism libraries
24Example View Definition
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
view
trigger
adaptation code
mechanism libraries
25Example Trigger
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
trigger
view
adaptation code
mechanism libraries
DEFINE VIEW (average_queue_length...,queue_leng
th..., disk_id..., short_disk... )
26Example Adaptation Code
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
adaptation code
view
trigger
DEFINE VIEW (average_queue_length...,queue_leng
th..., disk_id..., short_disk... )
mechanism libraries
foreach disk_id from_disk if
(queue_lengthfrom_disk gt 1.2average_queue_len
gth) user_migrate(from_disk,short_disk)
27Example Mechanism Lib. Calls
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
view
trigger
adaptation code
mechanism libraries
DEFINE VIEW (average_queue_length...,queue_leng
th..., disk_id..., short_disk... )
user_migrate(from_disk,to_disk) diskObject x
x find_hottest_obj(from_disk) migrate(x,
to_disk)
foreach disk_id from_disk if
(queue_lengthfrom_disk gt 1.2average_queue_len
gth) user_migrate(from_disk,short_disk)
28Mechanism Libraries
- Unify existing techniques/services found in
single-node OSs, DBMSs, distributed systems - distributed directory services
- replication and migration
- data layout and placement
- distributed transactions
- checkpointing
- caching
- administrative (human UI) tasks
- Provide a place for higher-level monitoring
- Simplify creation of adaptation code
- for humans using the rule system directly
- for the policy compiler auto-generating code
select key mechanisms fordata-intensivenetwork
services
29Open Research Issues
- Defining appropriate software abstractions
- how should views and triggers be declared?
- what is the systems schema?
- how should heterogeneous hardware be integrated?
- can it be extended by the user to include new
types and statistics? - what should the policy language look like?
- what level of policies can be expressed?
- how much of the implementation can the system
figure out automatically? - to what extent can the system reason about
policies and their interactions? - what functions should mechanism libraries provide?
30More Open Research Issues
- Implementing an introspective system
- what default policies should the system supply?
- what are the internal and external interfaces?
- debugging
- visualization of states, triggers, ...
- simulation/coverage analysis of policies,
adaptation code - appropriate administrative interfaces
- Measuring an introspective system
- what are the right benchmarks for
maintainability, availability, scalability? - O(gt1000)-node scalability
- how to write applications that scale and run well
despite continual state of partial failure?
31Related Work
- Hardware
- CMU and UCSB Active Disks
- Software
- Adaptive databases MS AutoAdmin, Informix
NoKnobs - Adaptive OSs MS Millennium, adaptive VINO
- Adaptive storage HP AutoRAID, attribute-managed
storage - Active databases UFL Gator, TriggerMan
- ISTORE unifies many of these techniques in a
single system
32Status and Conclusions
- ISTOREs focus is on introspective systems
- a new perspective on systems research priorities
- Proposed framework for building introspection
- intelligent, self-monitoring plug-and-play
hardware - software that provides a higher level of
abstraction for the construction of introspective
systems - flexible, powerful rule system for monitoring
- policy specification automates generation of
adaptation - Status
- ISTORE-1 hardware prototype being constructed now
- software prototyping just starting
33ISTORE Introspective Storage for Data-Intensive
Network Services
- For more information
- http//iram.cs.berkeley.edu/istore/
- istore-group_at_cs.berkeley.edu
34Backup Slides
35ISTORE-1 Hardware Design
- Brick
- processor board
- mobile Pentium-II, 366 MHz, 128MB SODRAM
- PCI and ISA busses/controllers, SuperIO (serial
ports) - Flash BIOS
- 4x100Mb Ethernet interfaces
- Adaptec Ultra2-LVD SCSI interface
- disk one 18.2GB 10,000 RPM low-profile SCSI disk
- diagnostic processor
- Motorola MC68376, 2MB Flash or NVRAM
- serial connections to CPU for console and
monitoring - controls power to all parts on board
- CAN interface
36ISTORE-1 Hardware Design (2)
- Network
- primary data network
- hierarchical, highly-redundant switched Ethernet
- uses 16 20-port 100Mb switches at the leaves
- each brick connects to 4 independent switches
- root switching fabric is two ganged 25-port
Gigabit switches (PacketEngines PowerRails) - diagnostic network
- point-to-point CAN network connects bricks in a
shelf - Ethernet fabric described above is used for
shelf-to-shelf communication - console I/O from each brick can be routed through
diagnostic network
37Motivation Technology Trends
- Disks, systems, switches are getting smaller
- Convergence on intelligent disks (IDISKs)
- MicroDrive system-on-a-chip gt tiny IDISK nodes
- Inevitability of enormous-scale systems
- by 2006, a O(10,000) IDISK-node cluster with 90TB
of storage could fit in one rack
38Disk Limit
- Continued advance in capacity (60/yr) and
bandwidth (40/yr) - Slow improvement in seek, rotation (8/yr)
- Time to read whole disk
- Year Sequentially Randomly (1 sector/seek)
- 1990 4 minutes 6 hours
- 2000 12 minutes 1 week(!)
- 3.5 form factor make sense in 5-7 years?
39ISTORE-II Hardware Vision
- System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk - Target for 5-7 years
- 1999 IBM MicroDrive
- 1.7 x 1.4 x 0.2 (43 mm x 36 mm x 5 mm)
- 340 MB, 5400 RPM, 5 MB/s, 15 ms seek
- 2006 MicroDrive?
- 9 GB, 50 MB/s (1.6X/yr capacity, 1.4X/yr BW)
402006 ISTORE
- ISTORE node
- Add 20 pad to MicroDrive size for packaging,
connectors - Then double thickness to add IRAM
- 2.0 x 1.7 x 0.5 (51 mm x 43 mm x 13 mm)
- Crossbar switches growing by Moores Law
- 2x/1.5 yrs ? 4X transistors/3yrs
- Crossbars grow by N2 ? 2X switch/3yrs
- 16 x 16 in 1999 ? 64 x 64 in 2005
- ISTORE rack (19 x 33 x 84) (480 mm x 840 mm
x 2130 mm) - 1 tray (3 high) ? 16 x 32 ? 512 ISTORE nodes
- 20 traysswitchesUPS ? 10,240 ISTORE nodes(!)
41Benefits of Views and Triggers (2)
- Equally useful for performance and failure
management - Performance tuning example DB index management
- View access patterns to tables, query predicates
used - Trigger access rate to table above/below average
- Adaptation add/drop indices based on query
stream - Failure management example impending disk
failure - View disk error logs, environmental conditions
- Trigger frequency of errors, unsafe environment
- Adaptation redirect requests to other replicas,
shut down disk, generate new replicas, signal
operator
42More Adaptation Policy Examples
- Self-maintenance and availability
- maintain two copies of all dirty data stored only
in volatile memory - if a disk fails, restore original redundancy
level for objects stored on that disk - Performance tuning
- if accesses to a read-mostly object take more
than 10ms on average, replicate the object on
another disk - Both (like HP AutoRAID)
- if an object is in the top 10 of
frequently-accessed objects, and there is only
one copy, create a new replica. if an object is
in the bottom 90, delete all replicas and stripe
the object across N disks using RAID-5.
43Mechanism Library Benefits
- Programmability
- libraries provide high-level abstractions of
services - code using the libraries is easier to reason
about, maintain, customize - Performance
- libraries can be highly-optimized
- optimization complexity is hidden by abstraction
- Reliability
- libraries include code thats easy to forget or
get wrong - synchronization, communication, memory allocation
- debugging effort can be spent getting libraries
right - library users inherit the verification effort