Title: Future Directions in Advanced Storage Services
1Future Directions in Advanced Storage Services
- Danny Dolev
- School of Engineering and Computer Science
- Hebrew University
2Case Study --replication for efficiency and
robustness
- Storage Area Network (SAN)
- Current technology utilizes standard Ethernet
connectivity and Cluster of workstations - Message ordering is used to overcome possible
inconsistency in case of failures. - During stable periods total ordering is not
used because of its high latency - What about stress periods ???
3Motivation
- Message delivery order is a fundamental building
block in distributed systems - The agreed order allows distributed applications
to use the state-machine replication model to
achieve fault tolerance and data replication - Replicated systems are often built atop Group
Communication Systems (GCS). - provide message ordering, reliable delivery and
group membership - Many GCS were introduced with variety of
optimization tradeoffs each has its own
bottleneck, preventing it from becoming truly
scalable
4Current State
- High performance implementations use a management
layer that resides in the critical path and
provides - Message ordering
- Membership
- State synchronization
- Consistency
- This layer consumes valuable CPU cycles needed to
the actual body of work - No standard interface (API) and no
Interoperability - Depicts a specific programming methodology (e.g.,
event driven) - The network capacity outperforms any progress in
CPU capability
5The challenges
- As network speed reaches several 10th of Gbs even
a multi-core server reaches its CPU limits
(approximately 1hz 1bps )
GHz/Gbps Rx ratio
GHz/Gbps Tx ratio
The graphs appear in the paper TCP performance
revisited (ispass03) by Foong et. al. and are
used with the authors permission.
6The challenges
- New techniques are called for to free the CPU to
do a productive work - Extra resources exist
- Peripheral devices are equipped with programmable
processors (GPU, disk controllers, NICs) - Some devices have dedicated CPUs with unique
properties (SIMD, TCAM memory, d/encryption
logic) - Offloading parts of the application to such
devices is the new dimension!!!
7Reasons for Offloading
- Memory Bottlenecks
- reduced memory pressure and cache-misses
- (due to filtering done at the device)
- Better timeliness guarantees
- GPOS ? Embedded OS (RTOS)
- avoiding OS noise (interrupts, context
switches, timers etc.)
8Reasons for Offloading
- Security
- Another level of isolation
- Harder to tamper with
- Reduced power consumption
- Pentium 4 2.8Ghz 68Watt
- Intel XScale 600Mhz 0.5Watt
9Sample Devices Graphics
NVIDIA GeForce 6/7800, 600 400 Mhz Core 512MB
DDR Memory Bandwidth (GB/sec) 54.4
AGEIA PhysX
- 500 Mhz multi-core processor
- Specialized physics units
IBM T60 - ATI Mobility Radeon X1300 6 programmable shader processors 512MB
10Sample Devices Graphics
- Compared to the CPU, GPU performance has been
increasing at a much faster rate - SIMD architecture (Single Instruction, Multiple
Data)
3 times Moores law
12Gflops
11Sample Devices Networking
- Todays Network Interface Cards (NICs) are
equipped with an onboard CPU. - Execute proprietary code
- Inaccessible to the OS
Killer NIC (http//www.killernic.com/KillerNic/)
400 Mhz Network Processing Unit 64MB DDR
Embedded Linux OS worlds first Network Card
designed specifically for Online Gaming
12Replication and Offloading
- Offloading reusable components that implement
various distributed algorithms will facilitate
the development of cluster replication and
reliability. - Possible candidates
- Reliable Broadcast
- Total Order timestamp ordering, Token Ring, etc.
- Membership Services
- Failure Detectors
- Atomic Commit Protocols (2PC,3PC,E3PC, etc.)
- Locking Service
13Example Offloaded TO-Application
- We have offloaded Lamport's Timestamp ordering
algorithm to the networking device - Application Architecture
14Example Offloaded TO-Application
Lamports Algorithm
15Hydra An Offloading Framework
- Offloading application is a tedious task
- Depends on device capabilities, SDK and toolchain
- Requires kernel knowledge (device drivers, DMA)
- Repeated for each target device
- We have developed a generic offloading framework
that enables a developer to design the offloading
aspects of the application at design time. - Joint work with Yaron Weinsberg (HUJI), Tal
Anker (Marvell), Muli Ben-Yehuda (IBM), Pete
Wyckoff (OSC)
16HYDRA Programming Model
- Hydra programming model enables one to develop an
Offload-Aware (OA) Applications - aware of available computing resources
- The minimal unit for offloading is called
Offcode (i.e., Offloaded-Code) - Exports a well defined interface (like COM
objects) - Given as open source or as compiled binaries
- Described by Offcode Description File (ODF)
- Exposes the offcodes functionality (interfaces)
17Offcode Libraries
Offcode Library
Networking
Networking
Math
BSD Socket
socket.odf
Graphics
CRC32
Security
crc32.odf
import
User Lib
import
mpeg
OA-App
Decoder.odf
18Offcode Description File
Device section
Offcode Description
ltdevice-classgt lttypegtethernetlt/typegt
ltlinkgt1000lt/linkgt lt/device-classgt
ltoffcode nameBSD socket offcodegt
ltinterfacesgt ltinterface nameunicast
IDIID_UNICASTgt ltmethod
namesendUDPgt ltparamgt lt/parmgt
lt/interfacesgt lt/offcodegt
Import section
ltimportgt ltdescriptorgtNet\BSD
Socket\crc32.odflt/descriptorgt ltreference
typePull priority0lt/referencegt
ltIIDgt6060843lt/IIDgt lt/importgt
19Channels
- Offcodes are interconnected via Channels
- Determines various communication properties
between offcodes - (I) An Out-Of-Band Channel, OOB-channel, is
attached to every OA-application and Offcode - Not performance critical (uses memory copies)
- Used for initialization, control and events
dissemination
B
A
C
Specialized channel
OOB-channel
20Channels
- (II) A specialized channel is created for
performance - critical communication.
- Hydra provides several channel types
- Unicast / Multicast
- Reliable / Unreliable
- Synchronized / Asynchronous
- Buffered / Zero-Copy R/W/Both
21Design Methodology
- We follow the layout design methodology first
presented in FarGo1 and later in FarGo-DA2. - Offload-aware applications are designed by two
aspects - 1. Basic logic design
- Design the application logic and define the
components to be offloaded. - 2. Offloading Layout design
- Define the communication channels between
offcodes - and their location constraints.
- (1) FarGo-System, ICDCS99, Ophir Holder and
Israel Ben-Shaul - (2) A programming model and system support for
disconnected-aware applications on
resource-constrained devices, ICSE02, - Yaron Weinsberg and Israel Ben-Shaul
221. Logical Design (the example)
Component Description
GUI Provides the viewing area and user controls (define a message pattern, frequency and send it)
TO Service Provides the TO API TO_broadcast() TO_recv()
LamportOrderer Implements the specific algorithm instance (Timestamp Ordering)
ReliableBoradcast Implements a simple RB algorithm
232. Offloading Layout Design
1
2
4
3
Components Legend
1 GUI 2 TO Service 3 Lamport Orderer 4
Reliable Broadcast
net
24Channel Constraints
- Link Constraint (default)
B.ODF B target Device1 or Device 2
A.ODF A target Device 1
B
Link
A
Device 1
Device 2
25Channel Constraints
- Link Constraint (default)
B.ODF B target Device1 or Device 2
A.ODF A target Device 1
B
Link
A
Device 1
Device 2
26Channel Constraints
- Link Constraint (default)
B.ODF B target Device1 or Device 2
A.ODF A target Device 1
A
Link
B
Device 1
Device 2
27Channel Constraints
B.ODF B target Device 1 or Device 2
A.ODF A target Device 1
B
Pull
A
Device 1
Device 2
28Channel Constraints
B.ODF B target Device 1 or Device 2
A.ODF A target Device 1
A
Pull
B
Device 1
Device 2
29Channel Constraints
B.ODF B target Device 2
A.ODF A target Device 1
B
Gang
A
Device 1
Device 2
30Channel Constraints
B.ODF B target Device 2
A.ODF A target Device 1
Gang
A
B
Device 1
Device 2
31Finally Application Deployment
Layout Graph
Logical Devices
mapping
Physical Devices
mapping
Offcode Generation
Offloading
Execution
32EvaluationOA Total-Order Application
33EvaluationOA Total-Order Application
5 Intel Pentium4 2.4GHz systems 512MB of RAM,
32-bit, 33MHz PCI bus. Programmable Netgear 620
NICs, 512kB RAM. We used Linux version 2.6.11
with the Hydra module Dell PowerConnect 6024
Gigabit ethernet switch
34Conclusions
- We are at the beginning of a journey for enabling
an application developer to fully utilize the
available computing resource - Peripherals
- Multi-core systems
- Offloading can improve the performance of
distributed applications, advanced storage
services, IDS systems, VMMs etc.