ISTORE: Introspective Storage for DataIntensive Network Services - PowerPoint PPT Presentation

About This Presentation

Title:

ISTORE: Introspective Storage for DataIntensive Network Services

Description:

ISTORE's adaptation framework built on model of active database 'database' includes: ... active databases: UFL Gator, TriggerMan ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 50

Provided by: aaronbrown6

Learn more at: http://istore.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: ISTORE: Introspective Storage for DataIntensive Network Services

1
ISTORE Introspective Storage for Data-Intensive
Network Services

Aaron Brown, David Oppenheimer, Jim Beck,
Kimberly Keeton, Rich Martin, Randi Thomas,
John Kubiatowicz, David Patterson, and Kathy
Yelick
Computer Science Division
University of California, Berkeley
http//iram.cs.berkeley.edu/istore/
1999 Summer IRAM Retreat

2
ISTORE Philosophy
Traditional Research Priorities 1)
Performance 1) Cost 3) Scalability 4)
Availability 5) Maintainability

Traditional systems research has focused on peak
performance and cost

3
ISTORE Philosophy SAM
Traditional Research Priorities 1)
Performance 1) Cost 3) Scalability 4)
Availability 5) Maintainability
ISTORE Priorities 1) Maintainability 2)
Availability 3) Scalability 4) Performance 4)
Cost

In reality, scalability, maintainability, and
availability (SAM) are equally important
performance cost mean little if the system
isnt working

4
ISTORE Philosophy Introspection

ISTOREs solution is introspective systems
systems that monitor themselves and automatically
adapt to changes in their environment and
workload
introspection enables automatic self-maintenance
and self-tuning
ISTORE vision a framework that makes it easy to
build introspective systems
ISTORE target high-end servers for
data-intensive infrastructure services
single-purpose systems managing large amounts of
data for large numbers of active network users

5
Outline

Motivation for Introspective Systems
ISTORE Research Agenda and Architecture
Hardware
Software
Policy-driven Introspection Example
Research Issues, Status, and Discussion

6
Motivation Service Demands

Emergence of a true information infrastructure
today e-commerce, online database services,
online backup, search engines, and web servers
tomorrow more of above (with ever-growing
datasets), plus thin-client/PDA infrastructure
support
Infrastructure users expect always-onservice
and constant quality of service
infrastructure must provide fault-toleranceand
performance-tolerance
failures and slowdowns have major business impact
e.g., recent EBay, ETrade, Schwab outages

7
Motivation Service Demands (2)

Delivering 24x7 fault- and performance-tolerance
requires
a robust hardware platform
fast adaptation to failures, load spikes,
changing access patterns
easy incremental scalability when existing
resources stop providing desired quality of
service
self-maintenance the system handles problems as
they arise, automatically
can't rely on human intervention to fix problems
or to tune performance
humans are too expensive, too slow, prone to
mistakes
Introspective systems are self-maintaining

8
Motivation System Scaling

Infrastructure services are growing rapidly
more users, more online data, higher access
rates, more historical data
bigger and bigger back-end systems are needed
O(300)-node clusters deployed now thousands of
nodes not far off
techniques for maintenance and administration
must scale with the system to 1000s of nodes
Todays administrative approaches dont scale
systems will be too big to reason about, monitor,
or fix
failures and load variance will be too frequent
for static solutions to work
Introspective, reactive techniques are required

9
ISTORE Research Agenda

ISTORE goal create a hardware/software
framework for building introspective servers
Hardware plug-and-play intelligent devices with
integrated self-monitoring, diagnostics, and
fault injection hardware
intelligence used to collect and filter
monitoring data
diagnostics and fault injection enhance
robustness
networked to create a scalable shared-nothing
cluster
Software toolkit that allows programmers to
easily define the systems adaptive behavior
provides abstractions for manipulating and
reacting to monitoring data

10
Hardware Requirements for Self-Maintaining
Servers

Redundant components that fail fast
no single point of failure anywhere
Tightly-integrated device monitoring
low-level HW diagnostics to detect impending
failure
device health, performance data, access
patterns, environmental info, ...
Automatic preventive maintenance
predictive failure analysis based on diagnostic
data
continual scrubbing and in situ stress testing
of all components, new and old
Self-characterizing, plug-and-play hardware

11
ISTORE-1 Hardware Prototype

Based on intelligent disk bricks
each brick is one ISTORE node
ISTORE-1 will have 64 bricks/nodes

12
ISTORE-1 Hardware Design

Brick
processor board
mobile Pentium-II, 128MB SODRAM
PCI and ISA busses/controllers, SuperIO (serial
ports)
Flash BIOS
4x100Mb Ethernet interfaces
Adaptec Ultra2-LVD SCSI interface
disk one 18.2GB low-profile SCSI disk
diagnostic processor
OS several UNIX-like OSs supporting Linux ABI
(Linux, NetBSD, FreeBSD, Solaris x86?)

13
ISTORE-1 Hardware Design (2)

Network
primary data network
hierarchical, highly-redundant switched Ethernet
uses 16 20-port 100Mb switches at the leaves
each brick connects to 4 independent switches
root switching fabric is two ganged 25-port
Gigabit switches (PacketEngines PowerRails)
diagnostic network

14
Diagnostic Processor

Each brick has a diagnostic processor
Goal small, independent, trusted piece of
hardware running hand-verifiable
monitoring/control software
monitoring connects to motherboard SMbus, CAN
bus
environmental monitor, CPU watchdog
control
reboot/power-cycle main CPU
inject simulated faults power, bus transients,
memory errors, network interface failure, ...
Not-so-small embedded Motorola 68k processor
provides the flexibility needed for research
prototype
still can run just a small, simple monitoring and
control program if desired (no OS, networking,
etc.)

15
Diagnostic Network

Separate diagnostic network connects the
diagnostic processors of each brick
provides independent network path to diagnostic
CPU
works when brick CPU is powered off or has failed
separate failure modes from Ethernet interfaces
CAN (Controller Area Network) diagnostic
interconnect
CAN connects directly to environmental monitoring
sensors (temperature, fan RPM, ...)
one brick per shelf of 8 acts as gateway from
CAN to redundant switched Ethernet fabric

16
ISTORE-1 Hardware Prototype

Meets requirements for a robust HW platform
fast embedded CPU performs local monitoring tasks
diagnostic hardware enables low-level diagnostic
monitoring, fail-fast behavior, and fault
injection
highly-redundant system design
redundant data network and interfaces at all
levels
separate diagnostic network
redundant backup power
powerful preventive maintenance
each brick can be periodically taken offline and
stress-tested/scrubbed using diagnostic
processors fault injection capabilities

17
ISTORE Research Agenda

ISTORE goal create a hardware/software
framework for building introspective servers
Hardware
Software toolkit that allows programmers to
easily define the systems adaptive behavior
provides abstractions for manipulating and
reacting to monitoring data

18
A Software Framework for Introspection

ISTORE hardware provides device monitoring
application programmers could write ad-hoc code
to collect, process, and react to monitoring data
ISTORE software framework should simplify writing
introspective applications
rule-based adaptation engine encapsulates the
mechanisms of collecting, processing monitoring
data
policy compiler and mechanism libraries help turn
application adaptation goals into rules
reaction code
these provide a high-level, abstract interface to
the systems monitoring and adaptation mechanisms

19
Rule-based Adaptation

ISTOREs adaptation framework built on model of
active database
database includes
hardware monitoring data device status, access
patterns, performance stats
software monitoring data app-specific
quality-of-service metrics, high-level workload
patterns, ...
applications define views and triggers over the
DB
views select and aggregate data of interest to
app.
triggers are rules that invoke application-specifi
c reaction code when their predicates are
satisfied
SQL-like declarative language used to specify
views and trigger rules

20
Benefits of Views and Triggers

Allow applications to focus on adaptation, not
monitoring
hide the mechanics of gathering and processing
monitoring data
can be dynamically redefined without altering
adaptation code as situation changes
Can be implemented without a real database
views and triggers implemented as device-local
and distributed filters and reaction rules
defined views and triggers control frequency,
granularity, types of data gathered by HW
monitoring
no materialized database necessary

21
Raising the Level of AbstractionPolicy Compiler
and Mechanism Libs

Rule-based adaptation doesnt go far enough
application designer must still write views,
triggers, and adaptation code by hand
but designer thinks in terms of system policies
Solution designer specifies policies to system
system implements them
policy compiler automatically generates views,
triggers, adaptation code
uses preexisting mechanism libraries to implement
adaptation algorithms
claim feasible for common adaptation mechanisms
needed by data-intensive network service apps.

22
Adaptation Policies

Policies specify system states and how to react
to them
high-level specification independent of schema
of system, object/node identity
that knowledge is encapsulated in policy compiler

Examples
self-maintenance and availability
if overall free disk space is below 10, compress
all but one replica/version of least-frequently-ac
cessed data
if any disk reports more than 5 errors per hour,
migrate all data off that disk and shut it down
invoke load-balancer when new disk is added to
system
performance tuning
place large, sequentially-accessed objects on
outer tracks of fast disks as space becomes
available

23
Software Structure
policy
policy compiler
view
trigger
adaptation code
mechanism libraries
24
Detailed Adaptation Example

Policy quench hot spots by migrating objects

while ((average queue length for any disk D) gt
(120 of average for whole system))
migrate hottest object on D to disk with
shortest average queue length
policy
calls
used as input to
policy compiler
produces
view
trigger
adaptation code
mechanism libraries
25
Example View Definition
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
view
trigger
adaptation code
mechanism libraries
26
Example Trigger
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
trigger
view
adaptation code
mechanism libraries
DEFINE VIEW (average_queue_length...,queue_leng
th..., disk_id..., short_disk... )
27
Example Adaptation Code
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
adaptation code
view
trigger
DEFINE VIEW (average_queue_length...,queue_leng
th..., disk_id..., short_disk... )
mechanism libraries
foreach disk_id from_disk if
(queue_lengthfrom_disk gt 1.2average_queue_len
gth) user_migrate(from_disk,short_disk)
28
Example Mechanism Lib. Calls
while ((average queue length for any
disk D) gt (120 of average for whole
system)) migrate hottest object on D to disk
with shortest average queue length
policy
policy compiler
view
trigger
adaptation code
mechanism libraries
DEFINE VIEW (average_queue_length...,queue_leng
th..., disk_id..., short_disk... )
user_migrate(from_disk,to_disk) diskObject x
x find_hottest_obj(from_disk) migrate(x,
to_disk)
foreach disk_id from_disk if
(queue_lengthfrom_disk gt 1.2average_queue_len
gth) user_migrate(from_disk,short_disk)
29
Mechanism Libraries

Unify existing techniques/services found in
single-node OSs, DBMSs, distributed systems
distributed directory services
replication and migration
data layout and placement
distributed transactions
checkpointing
caching
administrative (human UI) tasks
Provide a place for higher-level monitoring
Simplify creation of adaptation code
for humans using the rule system directly
for the policy compiler auto-generating code

select key mechanisms fordata-intensivenetwork
services
30
Open Research Issues

Defining appropriate software abstractions
how should views and triggers be declared?
what should the policy language look like?
what functions should mechanism libraries
provide?
what is the systems schema?
how should heterogeneous hardware be integrated?
can it be extended by the user to include new
types and statistics?
what level of policies can be expressed?
how much of the implementation can the system
figure out automatically?
to what extent can the system reason about
policies and their interactions?

31
More Open Research Issues

Implementing an introspective system
what default policies should the system supply?
what are the internal and external interfaces?
debugging
visualization of states, triggers, ...
simulation/coverage analysis of policies,
adaptation code
appropriate administrative interfaces
Measuring an introspective system
what are the right benchmarks for scalability,
availability, and maintainability (SAM)?
O(gt1000)-node scalability
how to write applications that scale and run well
despite continual state of partial failure?

32
Related Work

Hardware
CMU and UCSB Active Disks
Software
adaptive databases MS AutoAdmin, Informix
NoKnobs
adaptive OSs MS Millennium, adaptive VINO
adaptive storage HP AutoRAID, attribute-managed
storage
active databases UFL Gator, TriggerMan
ISTORE unifies many of these techniques in a
single system

33
Related Work Ninja

Ninja composable Internet-scale services
some ISTORE runtime software services provided
using Ninja programming platform?
provides
some fault tolerance
a framework for automatic service discovery
incremental s/w upgrades

34
Related Work Telegraph

Universal system for information
Four layers
query, browse, mine
global agoric federation
continuously reoptimizing query processor
adaptive data placement
storage manager
Relationship to ISTORE
continuous online reoptimization
adaptive data placement
indexing, other operations on disk CPU

35
Related Work OceanStore

Global-scale persistent storage
Nomadic, highly-available data
Federation of data storage providers
Investigate global-scale SAM
also naming, indexability, consistency
Relationship to ISTORE
investigating similar concepts but on a global
scale
converse ISTORE as Internet in a box

36
Related Work Endeavour

Endeavour new research project at UCB
goal enhancing human understanding through
information technology
ISTOREs potential contributions
ISTORE is building adaptive, scalable,
self-maintaining back-end servers for
storage-intensive network services
can be part of Endeavours back-end
infrastructure
software research
using policies to guide a systems adaptive
behavior
providing QoS under degraded conditions
application platform
process and store streams of sensor data

37
Status and Conclusions

ISTOREs focus is on introspective systems
a new perspective on systems research priorities
Proposed framework for building introspection
intelligent, self-monitoring plug-and-play
hardware
software that provides a higher level of
abstraction for the construction of introspective
systems
flexible, powerful rule system for monitoring
policy specification automates generation of
adaptation
Status
ISTORE-1 hardware prototype being constructed now
software prototyping just starting

38
ISTORE Short-Term Plans

Solidify/begin implementing benchmarking ideas
run on existing systems to characterize and
compare them with respect to SAM
Assemble ISTORE-0 system
6 PCs with similar configurations to ISTORE-1
bricks
100 Mb/s switched Ethernet
Gain experience running multiple OSes
Investigate implementation options for monitoring
database, views, and triggers
Study data-intensive network service applications
to guide development of policy lang.
to determine what types of adaptation will help

39
ISTORE Introspective Storage for Data-Intensive
Network Services

For more information
http//iram.cs.berkeley.edu/istore/
istore-group_at_cs.berkeley.edu

40
Backup Slides
41
ISTORE-1 Hardware Design

Brick
processor board
mobile Pentium-II, 366 MHz, 128MB SODRAM
PCI and ISA busses/controllers, SuperIO (serial
ports)
Flash BIOS
4x100Mb Ethernet interfaces
Adaptec Ultra2-LVD SCSI interface
disk one 18.2GB 10,000 RPM low-profile SCSI disk
diagnostic processor
Motorola MC68376, 2MB Flash or NVRAM
serial connections to CPU for console and
monitoring
controls power to all parts on board
CAN interface

42
ISTORE-1 Hardware Design (2)

Network
primary data network
hierarchical, highly-redundant switched Ethernet
uses 16 20-port 100Mb switches at the leaves
each brick connects to 4 independent switches
root switching fabric is two ganged 25-port
Gigabit switches (PacketEngines PowerRails)
diagnostic network
point-to-point CAN network connects bricks in a
shelf
Ethernet fabric described above is used for
shelf-to-shelf communication
console I/O from each brick can be routed through
diagnostic network

43
Motivation Technology Trends

Disks, systems, switches are getting smaller

Convergence on intelligent disks (IDISKs)
MicroDrive system-on-a-chip gt tiny IDISK nodes
Inevitability of enormous-scale systems
by 2006, a O(10,000) IDISK-node cluster with 90TB
of storage could fit in one rack

44
Disk Limit

Continued advance in capacity (60/yr) and
bandwidth (40/yr)
Slow improvement in seek, rotation (8/yr)
Time to read whole disk
Year Sequentially Randomly (1 sector/seek)
1990 4 minutes 6 hours
2000 12 minutes 1 week(!)
3.5 form factor make sense in 5-7 years?

45
ISTORE-II Hardware Vision

System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk
Target for 5-7 years

1999 IBM MicroDrive
1.7 x 1.4 x 0.2 (43 mm x 36 mm x 5 mm)
340 MB, 5400 RPM, 5 MB/s, 15 ms seek
2006 MicroDrive?
9 GB, 50 MB/s (1.6X/yr capacity, 1.4X/yr BW)

46
2006 ISTORE

ISTORE node
Add 20 pad to MicroDrive size for packaging,
connectors
Then double thickness to add IRAM
2.0 x 1.7 x 0.5 (51 mm x 43 mm x 13 mm)
Crossbar switches growing by Moores Law
2x/1.5 yrs ? 4X transistors/3yrs
Crossbars grow by N2 ? 2X switch/3yrs
16 x 16 in 1999 ? 64 x 64 in 2005
ISTORE rack (19 x 33 x 84) (480 mm x 840 mm
x 2130 mm)
1 tray (3 high) ? 16 x 32 ? 512 ISTORE nodes
20 traysswitchesUPS ? 10,240 ISTORE nodes(!)

47
Benefits of Views and Triggers (2)

Equally useful for performance and failure
management
Performance tuning example DB index management
View access patterns to tables, query predicates
used
Trigger access rate to table above/below average
Adaptation add/drop indices based on query
stream
Failure management example impending disk
failure
View disk error logs, environmental conditions
Trigger frequency of errors, unsafe environment
Adaptation redirect requests to other replicas,
shut down disk, generate new replicas, signal
operator

48
More Adaptation Policy Examples

Self-maintenance and availability
maintain two copies of all dirty data stored only
in volatile memory
if a disk fails, restore original redundancy
level for objects stored on that disk
Performance tuning
if accesses to a read-mostly object take more
than 10ms on average, replicate the object on
another disk
Both (like HP AutoRAID)
if an object is in the top 10 of
frequently-accessed objects, and there is only
one copy, create a new replica. if an object is
in the bottom 90, delete all replicas and stripe
the object across N disks using RAID-5.