Title: Design
1Design Implementation of LHRS a Highly-
Available Distributed Data Structure
Workshop in Distributed Data Structures
July 2004
Thomas J.E. Schwartz TjSchwarz_at_scu.edu
http//www.cse.scu.edu/tschwarz/homepage/thomas_
schwarz.html
Rim Moussa Rim.Moussa_at_dauphine.fr
http//ceria.dauphine.fr/rim/rim.html
2Objective
Design Implementation Performance Measurements
LHRS
Factors of Interest are Parity
Overhead Recovery Performances
3Overview
- Motivation
- Highly-available schemes
- LHRS
- Architectural Design
- Hardware testbed
- File Creation
- High Availability
- Recovery
- Conclusion
- Future Work
? Scenario Description ? Performance Results
4Motivation
- Information Volume of 30 / year
- Bottleneck of disk access and CPUs
- Failures are frequent costly
Source Contingency Planning Research -1996
5Requirements
Need Highly Available Networked Data Storage
Systems
Scalability High Throughput High Availability
6Scalable Distributed Data Structure
Dynamic file growth
Coordinator
Client
Client
Network
Data Buckets (DBs)
7SDDS (Ctnd.)
No Centralized Directory Access
Client
Data Buckets (DBs)
8Solutions towards High Availability
Data Replication
() Good Response time since mirrors are
queried (-) High storage cost (?n if n repliquas)
Parity Calculus
- Erasure-resilient codes are evaluated regarding
- Coding Rate (parity volume / data volume)
- Update Penality
- Group Size used for Data Reconstruction
- Complexity of Coding Decoding
9Fault-Tolerant Schemes
1 server failure
Simple XOR parity calculus RAID Systems
Patterson et al., 88,
The SDDS LHg Litwin et al.,
96
More than 1 server failure
Binary linear codes Hellerstein al., 94
Array Codes EVENODD Blaum et al., 94 ,
X-code Xu et al.,99,
RDP schema Corbett et al., 04 ?
Tolerate just 2 failures Reed Solomon Codes
IDA Rabin, 89, RAID X White, 91, FEC
Blomer et al., 95, Tutorial Plank, 97,
LHRSLitwin Schwarz, 00 ? Tolerate large
number of failures
10A Highly Available Distributed Data Structure
LHRS
Litwin Schwarz, 00 Litwin, Moussa Schwarz,
sub.
11LHRS
SDDS Data Distribution scheme based on Linear
Hashing LHLH Karlesson et al., 96 applied to
the key-field
Parity Calculus Reed-Solomon Codes Reed
Solomon, 63
12LHRS File Structure
Key r
? ? ?
? ? ?
2 1 0
Insert Rank r
? ? ?
? ? ?
? ? ?
? ?
2 1 0
Parity Buckets
? Rank Key List Parity Field
Data Buckets
? Key Data Field
13Architectural Design of LHRS
14Communication
Use of UDP Individual Insert/ Update/ Delete/
Search Queries Record Recovery Service and
Control Messages
Speed
Use of TCP/IP New PB Creation Large Update
Transfer (DB split) Bucket Recovery
Better Performance Reliability than UDP
15Bucket Architecture
TCP Connection
Network
Multicast Listening Port
Send UDP Port
TCP/IP Port
Recv UDP Port
TCP Listening Thread
UDP Listening Thread
Multicast Listening Thread
Message
Process Buffer
Message Queue
Message Queue
-Message processing-
Work. Thread n
Work. Thread 1
-Message processing-
Multicast Working Thread
Window
Free Zones
?
Messages waiting for ack
Ack. Management Thread
Sending Credit
Not acked messages
16Architectural Design
Enhancements to SDDS2000 B00, D01 Bucket
Architecture
- TCP/IP Connection Handler
- TCP/IP connections are passive OPEN, RFC 793
ISI,81,TCP/IP Implem. under Win2K Server O.S.
McDonal Barkley, 00
Ex. 1 DB recovery SDDS 2000 Architecture 6.7
s New Architecture 2.6 s ? Improv. 60 (Hardware
config. 733MhZ machines, 100Mbps network)
Before
Flow Control and Acknowledgement Mgmt. Principle
of Sending Credit Message conservation until
delivery Jacobson, 88 Diène, 01
17Architectural Design (Ctnd.)
Dynamic IP_at_ Structure Updated when adding
new/spare Buckets (PBs/DBs) through Multicast
Probe
Before
18Hardware Testbed
- 5 Machines (Pentium IV 1.8 GHz, RAM 512 Mb)
- Ethernet Network max bandwidth of 1 Gbps
- Operating System Windows 2K Server
- Tested configuration
- 1 Client
- A group of 4 Data Buckets
- k Parity Buckets, k ? 0, 1, 2
19LHRS
File Creation
20File Creation
Client Operation
Propagation of each Insert/ Update/ Delete on
Data Record to Parity Buckets
Data Bucket Split
Splitting Data Bucket ? PBs (Records that
Remain) N Deletes -from old rank
N Inserts -at new rank (Records that move) N
Deletes New Data Bucket ? PBs N Inserts
(Moved Records) All Updates are gathered in the
same buffer and transferred (TCP/IP)
simultaneously to respective Parity Buckets of
the Splitting DB New DB.
21File Creation Perf.
- Experiments Set-up
- File of 25 000 data records 1 data record 104 B
Client Sending Credit 1
Client Sending Credit 5
PB Overhead
- k 0 to k 1
- ? Perf. Degradation of 20
- k 1 to k 2
- ? Perf. Degradation of 8
22File Creation Perf.
- Experimental Set-up
- File of 25 000 data records 1 data record 104 B
Client Sending Credit 1
Client Sending Credit 5
PB Overhead
- k 0 to k 1
- ? Perf. Degradation of 37
- k 1 to k 2
- ? Perf. Degradation of 10
23LHRS
Parity Bucket Creation
24PB Creation Scenario
Searching for a new PB
Coordinator
Wanna join group g ? ltMulticastgt Sender
IP_at_Entity, Your Entity
PBs Connected to The Blank PBs Multicast Group
25PB Creation Scenario
Waiting for Replies
Coordinator
Start UDP Listening, Start TCP Listening, Start
Working Threads Waiting for Confirmation,
If Time-out elapsed ? Cancel all
PBs Connected to The Blank PBs Multicast Group
26PB Creation Scenario
Cancellation
Cancellation
PB Selection
Coordinator
You are Selected ltUDPgt
Disconnect from Blank PBs Multicast Group
PBs Connected to The Blank PBs Multicast Group
27PB Creation Scenario
Auto-creation -Query phase
New PB
Data Buckets group
28PB Creation Scenario
Auto-creation Encoding phase
New PB
Data Buckets group
29PB Creation Perf.
- Experimental Set-up
- Bucket Size 5000 .. 50000 records Bucket
Contents 0.625 Bucket Size - File Size 2.5 Bucket Size records
XOR Encoding
RS Encoding
Comparison
30PB Creation Perf.
- Experimental Set-up
- Bucket Size 5000 .. 50000 records Bucket
Contents 0.625 Bucket Size - File Size 2.5 Bucket Size records
XOR Encoding
RS Encoding
Comparison
31PB Creation Perf.
XOR Encoding
RS Encoding
Comparison
For Bucket Size 50000
XOR Encoding Rate 0.66 MB/sec
RS Encoding Rate 0.673 MB/sec
XOR provides a performance gain of 5 in
Processing Time (?0.02 in the Total Time)
32LHRS
Bucket Recovery
33Buckets Recovery
Failure Detection
Coordinator
Are You Alive ? ltUDPgt
?
?
Parity Buckets
Data Buckets
34Buckets Recovery
Waiting for Replies
Coordinator
I am Alive ? ltUDPgt
?
?
Parity Buckets
Data Buckets
35Buckets Recovery
Searching for 2 Spare DBs
Coordinator
Wanna be a Spare DB ? ltMulticastgt Sender IP_at_,
Your Entity
DBs Connected to The Blank DBs Multicast Group
36Buckets Recovery
Waiting for Replies
Coordinator
Start UDP Listening, Start TCP Listening, Start
Working Threads Waiting for Confirmation,
If Time-out elapsed ? Cancel all
DBs Connected to The Blank DBs Multicast Group
37Buckets Recovery
Disconnect from Blank PBs Multicast Group
Spare DBs Selection
Coordinator
Cancellation
You are Selected ltUDPgt
Disconnect from Blank PBs Multicast Group
DBs Connected to The Blank DBs Multicast Group
38Buckets Recovery
Recovery Manager Determination
Coordinator
Recover Buckets Spares IP_at_sEntitys
Parity Buckets
39Buckets Recovery
Query Phase
Alive Buckets participating to Recovery
Recovery Manager
Send me Records of rank in r, rslice-1 ltUDPgt
Parity Buckets
Data Buckets
Spare DBs
40Buckets Recovery
Alive Buckets participating to Recovery
Reconstruction Phase
Recovery Manager
Requested Buffer ltTCPgt
Parity Buckets
Data Buckets
Spare DBs
41DBs Recovery Perf.
Experimental Set-up File 125 000 recs Bucket
31250 recs ? 3.125 MB
XOR Decoding
RS Decoding
Comparison
42DBs Recovery Perf.
Experimental Set-up File 125 000 recs Bucket
31250 recs ? 3.125 MB
XOR Decoding
RS Decoding
Comparison
43DBs Recovery Perf.
Experimental Set-up File 125 000 recs Bucket
31250 recs ? 3.125 MB
XOR Decoding
RS Decoding
Comparison
1DB Recovery Time - XOR 0.720 sec
1DB Recovery Time RS 0.855 sec
XOR provides a performance gain of 15 in Total
Time
44DBs Recovery Perf.
Experimental Set-up File 125 000 recs Bucket
31250 recs ? 3.125 MB
Recover 2 DBs
Recover 3 DBs
45DBs Recovery Perf.
Experimental Set-up File 125 000 recs Bucket
31250 recs ? 3.125 MB
Recover 2 DBs
Recover 3 DBs
46Perf. Summary of Bucket Recovery
- 1 DB (3.125 MB) in 0.7 sec (XOR)
- 4.46 MB/sec
- 1 DB (3.125 MB) in 0.85 sec (RS)
- ? 3.65 MB/sec
- 2 DBs (6.250 MB) in 1.2 sec (RS)
- ? 5.21 MB/sec
- 3 DBs (9,375 MB) in 1.6 sec (RS)
- ? 5.86 MB/sec
47Conclusion
The conducted experiements show that
Encoding/Decoding Optimization Enhanced Bucket
Architecture ? Impact on performance
Good Recovery Performance Finally, we improved
the processing time of the RS decoding process by
4 to 8 1DB is recovered in half a second
48Conclusion
LHRS
Mature Implementation Many Optimization
Iterations Only SDDS with Scalable Availability
49Future Work
Better Parity Update Propagation Strategy to PBs
Investigation of faster Encoding/ Decoding
processes
50References
- Patterson et al., 88 D. A. Patterson, G. Gibson
R. H. Katz, A Case for Redundant Arrays of
Inexpensive Disks, Proc. of ACM SIGMOD Conf,
pp.109-106, June 1988. - ISI,81 Information Sciences Institute, RFC 793
Transmission Control Protocol (TCP)
Specification, Sept. 1981, http//www.faqs.org/rfc
s/rfc793.html - McDonal Barkley, 00 D. MacDonal, W. Barkley,
MS Windows 2000 TCP/IP Implementation Details,
http//secinf.net/info/nt/2000ip/tcpipimp.html - Jacobson, 88 V. Jacobson, M. J. Karels,
Congestion Avoidance and Control, Computer
Communication Review, Vol. 18, No 4, pp. 314-329.
Xu et al.,99 L. Xu Jehoshua Bruck, X-Code
MDS Array Codes with Optimal Encoding, IEEE
Trans. on Information Theory, 45(1), p.272-276,
1999. - Corbett et al., 04 P. Corbett, B. English, A.
Goel, T. Grcanac, S. Kleiman, J. Leong, S.
Sankar, Row-Diagonal Parity for Double Disk
Failure Correction, Proc. of the 3rd USENIX
Conf. On File and Storage Technologies, Avril
2004. - Rabin, 89 M. O. Rabin, Efficient Dispersal of
Information for Security, Load Balancing and
Fault Tolerance, Journal of ACM, Vol. 26, N 2,
April 1989, pp. 335-348. - White, 91 P.E. White, RAID X tackles design
problems with existing design RAID schemes, ECC
Technologies, ftp//members.aol.com.mnecctek.ctr19
91.pdf - Blomer et al., 95 J. Blomer, M. Kalfane, R.
Karp, M. Karpinski, M. Luby D. Zuckerman, An
XOR-Based Erasure-Resilient Coding Scheme, ICSI
Tech. Rep. TR-95-048, 1995.
51References (Ctnd.)
Litwin Schwarz, 00 W. Litwin T. Schwarz,
LHRS A High-Availability Scalable Distributed
Data Structure using Reed Solomon Codes,
p.237-248, Proceedings of the ACM SIGMOD 2000.
Karlesson et al., 96 J. Karlson, W. Litwin
T. Risch, LHLH A Scalable high performance data
structure for switched multicomputers, EDBT 96,
Springer Verlag. Reed Solomon, 60 I. Reed
G. Solomon, Polynomial codes over certain Finite
Fields, Journal of the society for industrial and
applied mathematics, 1960. Plank, 97 J. S.
Plank, A Tutorial on Reed-Solomon Coding for
fault-Tolerance in RAID-like Systems, Software
Practise Experience, 27(9), Sept. 1997, pp 995-
1012, Diéne, 01 A.W. Diène, Contribution à la
Gestion de Structures de Données Distribuées et
Scalables, PhD Thesis, Nov. 2001, Université
Paris Dauphine. Bennour, 00 F. Sahli Bennour,
Contribution à la Gestion de Structures de
Données Distribuées et Scalables, PhD Thesis,
Juin 2000, Université Paris Dauphine.
Moussa http//ceria.dauphine.fr/rim/rim.html
More references http//ceria.dauphine.fr/rim/bi
blio.pdf
52End
Thank you for your Attention
53Parity Calculus
Galois Field GF28 ? 1 symbol is 1
byte GF216 ? 1 symbol is 2 bytes
() GF216 vs. GF28 reduces by 1/2 the of
symbols, and consequently number of opertaions in
the field
(-) Multiplication Tables Sizes
New Generator Matrix
- 1st line of 1s
- Each PB executes XOR calculus for any update from
the 1st DB of any group ? gain performance of 4
-measured for PB creation
1st Column of 1s 1st parity bucket executes XOR
calculus instead of RS calculus ? gain
performance in encoding of 20
Encoding Decoding Hints
- Decoding
- log pre-calculus of H-1 matrix coef. and b vector
for multiple buckets recovery ? improv. from 4
to 8
Encoding log pre-calculus of the P matrix
coefficents ? improv. of 3.5