Design - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Design

Description:

Design – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 54
Provided by: rim74
Category:
Tags: design | wanna

less

Transcript and Presenter's Notes

Title: Design


1
Design Implementation of LHRS a Highly-
Available Distributed Data Structure
Workshop in Distributed Data Structures
July 2004
Thomas J.E. Schwartz TjSchwarz_at_scu.edu
http//www.cse.scu.edu/tschwarz/homepage/thomas_
schwarz.html
Rim Moussa Rim.Moussa_at_dauphine.fr
http//ceria.dauphine.fr/rim/rim.html
2
Objective
Design Implementation Performance Measurements
LHRS
Factors of Interest are Parity
Overhead Recovery Performances
3
Overview
  • Motivation
  • Highly-available schemes
  • LHRS
  • Architectural Design
  • Hardware testbed
  • File Creation
  • High Availability
  • Recovery
  • Conclusion
  • Future Work

? Scenario Description ? Performance Results
4
Motivation
  • Information Volume of 30 / year
  • Bottleneck of disk access and CPUs
  • Failures are frequent costly

Source Contingency Planning Research -1996
5
Requirements
Need Highly Available Networked Data Storage
Systems
Scalability High Throughput High Availability
6
Scalable Distributed Data Structure
Dynamic file growth
Coordinator
Client
Client

Network


Data Buckets (DBs)
7
SDDS (Ctnd.)
No Centralized Directory Access
Client



Data Buckets (DBs)
8
Solutions towards High Availability
Data Replication
() Good Response time since mirrors are
queried (-) High storage cost (?n if n repliquas)
Parity Calculus
  • Erasure-resilient codes are evaluated regarding
  • Coding Rate (parity volume / data volume)
  • Update Penality
  • Group Size used for Data Reconstruction
  • Complexity of Coding Decoding

9
Fault-Tolerant Schemes
1 server failure
Simple XOR parity calculus RAID Systems
Patterson et al., 88,
The SDDS LHg Litwin et al.,
96
More than 1 server failure
Binary linear codes Hellerstein al., 94
Array Codes EVENODD Blaum et al., 94 ,
X-code Xu et al.,99,
RDP schema Corbett et al., 04 ?
Tolerate just 2 failures Reed Solomon Codes
IDA Rabin, 89, RAID X White, 91, FEC
Blomer et al., 95, Tutorial Plank, 97,
LHRSLitwin Schwarz, 00 ? Tolerate large
number of failures
10
A Highly Available Distributed Data Structure
LHRS
Litwin Schwarz, 00 Litwin, Moussa Schwarz,
sub.
11
LHRS
SDDS Data Distribution scheme based on Linear
Hashing LHLH Karlesson et al., 96 applied to
the key-field
Parity Calculus Reed-Solomon Codes Reed
Solomon, 63
12
LHRS File Structure
Key r
? ? ?
? ? ?
2 1 0
Insert Rank r
? ? ?
? ? ?
? ? ?
? ?
2 1 0
Parity Buckets
? Rank Key List Parity Field
Data Buckets
? Key Data Field
13
Architectural Design of LHRS
14
Communication
Use of UDP Individual Insert/ Update/ Delete/
Search Queries Record Recovery Service and
Control Messages
Speed
Use of TCP/IP New PB Creation Large Update
Transfer (DB split) Bucket Recovery
Better Performance Reliability than UDP
15
Bucket Architecture
TCP Connection
Network
Multicast Listening Port
Send UDP Port
TCP/IP Port
Recv UDP Port
TCP Listening Thread
UDP Listening Thread
Multicast Listening Thread
Message
Process Buffer
Message Queue
Message Queue
-Message processing-
Work. Thread n
Work. Thread 1


-Message processing-
Multicast Working Thread
Window
Free Zones
?
Messages waiting for ack
Ack. Management Thread
Sending Credit
Not acked messages

16
Architectural Design
Enhancements to SDDS2000 B00, D01 Bucket
Architecture
  • TCP/IP Connection Handler
  • TCP/IP connections are passive OPEN, RFC 793
    ISI,81,TCP/IP Implem. under Win2K Server O.S.
    McDonal Barkley, 00

Ex. 1 DB recovery SDDS 2000 Architecture 6.7
s New Architecture 2.6 s ? Improv. 60 (Hardware
config. 733MhZ machines, 100Mbps network)
Before
Flow Control and Acknowledgement Mgmt. Principle
of Sending Credit Message conservation until
delivery Jacobson, 88 Diène, 01
17
Architectural Design (Ctnd.)
Dynamic IP_at_ Structure Updated when adding
new/spare Buckets (PBs/DBs) through Multicast
Probe
Before
18
Hardware Testbed
  • 5 Machines (Pentium IV 1.8 GHz, RAM 512 Mb)
  • Ethernet Network max bandwidth of 1 Gbps
  • Operating System Windows 2K Server
  • Tested configuration
  • 1 Client
  • A group of 4 Data Buckets
  • k Parity Buckets, k ? 0, 1, 2

19
LHRS
File Creation
20
File Creation
Client Operation
Propagation of each Insert/ Update/ Delete on
Data Record to Parity Buckets
Data Bucket Split
Splitting Data Bucket ? PBs (Records that
Remain) N Deletes -from old rank
N Inserts -at new rank (Records that move) N
Deletes New Data Bucket ? PBs N Inserts
(Moved Records) All Updates are gathered in the
same buffer and transferred (TCP/IP)
simultaneously to respective Parity Buckets of
the Splitting DB New DB.
21
File Creation Perf.
  • Experiments Set-up
  • File of 25 000 data records 1 data record 104 B

Client Sending Credit 1
Client Sending Credit 5
PB Overhead
  • k 0 to k 1
  • ? Perf. Degradation of 20
  • k 1 to k 2
  • ? Perf. Degradation of 8

22
File Creation Perf.
  • Experimental Set-up
  • File of 25 000 data records 1 data record 104 B

Client Sending Credit 1
Client Sending Credit 5
PB Overhead
  • k 0 to k 1
  • ? Perf. Degradation of 37
  • k 1 to k 2
  • ? Perf. Degradation of 10

23
LHRS
Parity Bucket Creation
24
PB Creation Scenario
Searching for a new PB
Coordinator
Wanna join group g ? ltMulticastgt Sender
IP_at_Entity, Your Entity
PBs Connected to The Blank PBs Multicast Group
25
PB Creation Scenario
Waiting for Replies
Coordinator
Start UDP Listening, Start TCP Listening, Start
Working Threads Waiting for Confirmation,
If Time-out elapsed ? Cancel all
PBs Connected to The Blank PBs Multicast Group
26
PB Creation Scenario
Cancellation
Cancellation
PB Selection
Coordinator
You are Selected ltUDPgt
Disconnect from Blank PBs Multicast Group
PBs Connected to The Blank PBs Multicast Group
27
PB Creation Scenario
Auto-creation -Query phase

New PB
Data Buckets group
28
PB Creation Scenario
Auto-creation Encoding phase

New PB
Data Buckets group
29
PB Creation Perf.
  • Experimental Set-up
  • Bucket Size 5000 .. 50000 records Bucket
    Contents 0.625 Bucket Size
  • File Size 2.5 Bucket Size records

XOR Encoding
RS Encoding
Comparison
30
PB Creation Perf.
  • Experimental Set-up
  • Bucket Size 5000 .. 50000 records Bucket
    Contents 0.625 Bucket Size
  • File Size 2.5 Bucket Size records

XOR Encoding
RS Encoding
Comparison
31
PB Creation Perf.
XOR Encoding
RS Encoding
Comparison
For Bucket Size 50000
XOR Encoding Rate 0.66 MB/sec
RS Encoding Rate 0.673 MB/sec
XOR provides a performance gain of 5 in
Processing Time (?0.02 in the Total Time)
32
LHRS
Bucket Recovery
33
Buckets Recovery
Failure Detection
Coordinator
Are You Alive ? ltUDPgt
?
?
Parity Buckets
Data Buckets
34
Buckets Recovery
Waiting for Replies
Coordinator
I am Alive ? ltUDPgt
?
?
Parity Buckets
Data Buckets
35
Buckets Recovery
Searching for 2 Spare DBs
Coordinator
Wanna be a Spare DB ? ltMulticastgt Sender IP_at_,
Your Entity
DBs Connected to The Blank DBs Multicast Group
36
Buckets Recovery
Waiting for Replies
Coordinator
Start UDP Listening, Start TCP Listening, Start
Working Threads Waiting for Confirmation,
If Time-out elapsed ? Cancel all
DBs Connected to The Blank DBs Multicast Group
37
Buckets Recovery
Disconnect from Blank PBs Multicast Group
Spare DBs Selection
Coordinator
Cancellation
You are Selected ltUDPgt
Disconnect from Blank PBs Multicast Group
DBs Connected to The Blank DBs Multicast Group
38
Buckets Recovery
Recovery Manager Determination
Coordinator
Recover Buckets Spares IP_at_sEntitys
Parity Buckets
39
Buckets Recovery
Query Phase
Alive Buckets participating to Recovery
Recovery Manager
Send me Records of rank in r, rslice-1 ltUDPgt

Parity Buckets
Data Buckets
Spare DBs
40
Buckets Recovery
Alive Buckets participating to Recovery
Reconstruction Phase
Recovery Manager
Requested Buffer ltTCPgt

Parity Buckets
Data Buckets
Spare DBs
41
DBs Recovery Perf.
Experimental Set-up File 125 000 recs Bucket
31250 recs ? 3.125 MB
XOR Decoding
RS Decoding
Comparison
42
DBs Recovery Perf.
Experimental Set-up File 125 000 recs Bucket
31250 recs ? 3.125 MB
XOR Decoding
RS Decoding
Comparison
43
DBs Recovery Perf.
Experimental Set-up File 125 000 recs Bucket
31250 recs ? 3.125 MB
XOR Decoding
RS Decoding
Comparison
1DB Recovery Time - XOR 0.720 sec
1DB Recovery Time RS 0.855 sec
XOR provides a performance gain of 15 in Total
Time
44
DBs Recovery Perf.
Experimental Set-up File 125 000 recs Bucket
31250 recs ? 3.125 MB
Recover 2 DBs
Recover 3 DBs
45
DBs Recovery Perf.
Experimental Set-up File 125 000 recs Bucket
31250 recs ? 3.125 MB
Recover 2 DBs
Recover 3 DBs
46
Perf. Summary of Bucket Recovery
  • 1 DB (3.125 MB) in 0.7 sec (XOR)
  • 4.46 MB/sec
  • 1 DB (3.125 MB) in 0.85 sec (RS)
  • ? 3.65 MB/sec
  • 2 DBs (6.250 MB) in 1.2 sec (RS)
  • ? 5.21 MB/sec
  • 3 DBs (9,375 MB) in 1.6 sec (RS)
  • ? 5.86 MB/sec

47
Conclusion
The conducted experiements show that
Encoding/Decoding Optimization Enhanced Bucket
Architecture ? Impact on performance
Good Recovery Performance Finally, we improved
the processing time of the RS decoding process by
4 to 8 1DB is recovered in half a second
48
Conclusion
LHRS
Mature Implementation Many Optimization
Iterations Only SDDS with Scalable Availability
49
Future Work
Better Parity Update Propagation Strategy to PBs
Investigation of faster Encoding/ Decoding
processes
50
References
  • Patterson et al., 88 D. A. Patterson, G. Gibson
    R. H. Katz, A Case for Redundant Arrays of
    Inexpensive Disks, Proc. of ACM SIGMOD Conf,
    pp.109-106, June 1988.
  • ISI,81 Information Sciences Institute, RFC 793
    Transmission Control Protocol (TCP)
    Specification, Sept. 1981, http//www.faqs.org/rfc
    s/rfc793.html
  • McDonal Barkley, 00 D. MacDonal, W. Barkley,
    MS Windows 2000 TCP/IP Implementation Details,
    http//secinf.net/info/nt/2000ip/tcpipimp.html
  • Jacobson, 88 V. Jacobson, M. J. Karels,
    Congestion Avoidance and Control, Computer
    Communication Review, Vol. 18, No 4, pp. 314-329.
    Xu et al.,99 L. Xu Jehoshua Bruck, X-Code
    MDS Array Codes with Optimal Encoding, IEEE
    Trans. on Information Theory, 45(1), p.272-276,
    1999.
  • Corbett et al., 04 P. Corbett, B. English, A.
    Goel, T. Grcanac, S. Kleiman, J. Leong, S.
    Sankar, Row-Diagonal Parity for Double Disk
    Failure Correction, Proc. of the 3rd USENIX
    Conf. On File and Storage Technologies, Avril
    2004.
  • Rabin, 89 M. O. Rabin, Efficient Dispersal of
    Information for Security, Load Balancing and
    Fault Tolerance, Journal of ACM, Vol. 26, N 2,
    April 1989, pp. 335-348.
  • White, 91 P.E. White, RAID X tackles design
    problems with existing design RAID schemes, ECC
    Technologies, ftp//members.aol.com.mnecctek.ctr19
    91.pdf
  • Blomer et al., 95 J. Blomer, M. Kalfane, R.
    Karp, M. Karpinski, M. Luby D. Zuckerman, An
    XOR-Based Erasure-Resilient Coding Scheme, ICSI
    Tech. Rep. TR-95-048, 1995.

51
References (Ctnd.)
Litwin Schwarz, 00 W. Litwin T. Schwarz,
LHRS A High-Availability Scalable Distributed
Data Structure using Reed Solomon Codes,
p.237-248, Proceedings of the ACM SIGMOD 2000.
Karlesson et al., 96 J. Karlson, W. Litwin
T. Risch, LHLH A Scalable high performance data
structure for switched multicomputers, EDBT 96,
Springer Verlag. Reed Solomon, 60 I. Reed
G. Solomon, Polynomial codes over certain Finite
Fields, Journal of the society for industrial and
applied mathematics, 1960.  Plank, 97 J. S.
Plank, A Tutorial on Reed-Solomon Coding for
fault-Tolerance in RAID-like Systems, Software
Practise Experience, 27(9), Sept. 1997, pp 995-
1012, Diéne, 01 A.W. Diène, Contribution à la
Gestion de Structures de Données Distribuées et
Scalables, PhD Thesis, Nov. 2001, Université
Paris Dauphine. Bennour, 00 F. Sahli Bennour,
Contribution à la Gestion de Structures de
Données Distribuées et Scalables, PhD Thesis,
Juin 2000, Université Paris Dauphine.
Moussa http//ceria.dauphine.fr/rim/rim.html
More references http//ceria.dauphine.fr/rim/bi
blio.pdf
52
End
Thank you for your Attention
53
Parity Calculus
Galois Field GF28 ? 1 symbol is 1
byte GF216 ? 1 symbol is 2 bytes
() GF216 vs. GF28 reduces by 1/2 the of
symbols, and consequently number of opertaions in
the field
(-) Multiplication Tables Sizes
New Generator Matrix
  • 1st line of 1s
  • Each PB executes XOR calculus for any update from
    the 1st DB of any group ? gain performance of 4
    -measured for PB creation

1st Column of 1s 1st parity bucket executes XOR
calculus instead of RS calculus ? gain
performance in encoding of 20
Encoding Decoding Hints
  • Decoding
  • log pre-calculus of H-1 matrix coef. and b vector
    for multiple buckets recovery ? improv. from 4
    to 8

Encoding log pre-calculus of the P matrix
coefficents ? improv. of 3.5
Write a Comment
User Comments (0)
About PowerShow.com