Title: Contribution to the Design
1Contribution to the Design Implementation of
the Highly Available Scalable and Distributed
Data Structure LHRS
Paris Dauphine University CERIA Lab.
04th October 2004
Rim Moussa Rim.Moussa_at_dauphine.fr
http//ceria.dauphine.fr/rim/rim.html
Thesis Supervisor Pr. Witold Litwin
Examinators Pr. Thomas J.E. Schwarz Pr.
Toré Risch Jury President Pr. Gérard Lévy
Thesis Presentation in Computer Science
Distributed Databases
2Outline
- Issue
- State of the Art
- LHRS Scheme
- LHRS Manager
- Experimentations
- LHRS File Creation
- Bucket Recovery
- Parity Bucket Creation
- Conclusion Future Work
3Facts
- Volume of Information of 30 /year
- Technology
- Network Infrastructure
- gtgt Gilder Law, bandwidth triples every year.
- Evolution of PCs storage computing capacities
- gtgt Moore Law, the latters double every 18
months. - Bottleneck of Disks Accesses CPUs
Need of Distributed Data Storage Systems SDDSs
LH, RP ? High Throughput
4Facts
- Multicomputers
- gtgt Modular Architecture
- gtgt Good Price/ Performance
- Tradeoff
- Frequent Costly Failures
- gtgt Stat. Published by the Contingency Planning
Research in 1996 the cost of service
interruption/h case of brokerage application is
6,45 million. - Need of Distributed Highly-Available Data
Storage Systems
5State of the Art
Data Replication
() Good Response Time, Mirors are
functional (-) High Storage Overhead (?n if n
repliquas)
Parity Calculus
- Criteria to evaluate Erasure-resilient Codes
- Encoding Rate (Parity Volume/ Data Volume)
- Update Penality (Parity Volumes)
- Group Size used for Data Reconstruction
- Encoding Decoding Complexity
- Recovery Capabilitties
6Parity Schemes
1-Available Schemes
XOR Parity Calculus RAID Technology (level 3,
4, 5) PGK88, SDDS
LHg L96
k-Available Schemes
Binary Linear Codes H94 ? Tolerate max. 3
failures Array Codes EVENODD B94 , X-code
XB99, RDP C04 ? Tolerate max. 2 failures
Reed Solomon Codes IDA R89, RAID X W91,
FEC B95, Tutorial P97, LHRS
LS00, ML02, MS04, LMS04 ? Tolerate k failures
(k gt 3)
7Outline
- Issue
- State of the Art
- LHRS Scheme
- LHRS?
- SDDSs?
- Reed Solomon Codes?
- Encoding/ Decoding Optimizations
- LHRS Manager
- Experimentations
8LHRS ?
LHRS LS00
Scalability High Throughput
LH Scalable Distributed Data Structure
Distribution using Linear Hashing (LHLH KLR96)
LHLH ManagerB00
High Availability
Parity Calculus using Reed-Solomon Codes RS63
9SDDSs Principles
(1) Dynamic File Growth
Coordinator
Client
Client
Network
Data Buckets
10SDDSs Principles (2)
(2) No Centralized Directory Access
File Image
Client
Cases de Données
11Reed-Solomon Codes
- Encoding
- From m Data Symbols ? Calculus of n Parity
Symbols - Data Representation ? Galois Field
- Fields with finite size q
- Closure Propoerty Addition, Substraction,
Multiplication, Division. - In GF(2w),
- (1) Addition (XOR)
- (2) Multiplication (Tables gflog and antigflog)
- e1 e2 antigflog gfloge1 gfloge2
12RS Encoding
1 0 0 0 0 0 C1,1 C1,j C1,n-m 0 1 0 0 0
0 C2,1 C2,j C2,n-m 0 0 1 0 0 0
C3,1 C3,j C3,n-m 0 0 0 0 0
1 Cm,1 Cm,j Cm,n-m
13RS Decoding
S1 S2 S3 S4 Sm P1 P2 P3 Pn-m
1 0 0 0 0 0 C1,1 C1,2 C1,3 C1,n-m 0 1 0 0 0
0 C2,1 C2,2 C2,3 C2,n-m 0 0 1 0 0 0 C3,1
C3,2 C3,3 C3,n-m
0 0 0 0 0 1 Cm,1 Cm,2
Cm,3 Cm,n-m
14Optimizations
GF Multiplication
Galois Field
Parity Matrix
GF(28) ? 1 symbol 1 Byte GF(216) ? 1 symbol
2 Bytes
() GF(216) vs. GF(28) reduces the
Symbols by 1/2 ? Operations in the GF.
(-) Multiplication Tables
Size GF(28) 0,768 Ko GF(216) 393,216 Ko (512 ?
0,768)
15Optimizations (2)
GF Multiplication
Galois Field
Parity Matrix
0001 0001 0001 0001 eb9b 2284 0001
2284 9é74 0001 9e44 d7f1
1st Row of 1s Any update from 1st DB is
processed with XOR Calculus ? Gain in Performance
of 4 (case PB creation, m 4)
1st Column of 1s Encoding of the 1st PB along
XOR Calculus ? Gain in encoding decoding
16Optimizations (3)
GF Multiplication
Galois Field
Parity Matrix
Goal Reduce GF Multiplication Complexity
e1 e2 antigflog gfloge1 gfloge2
Encoding Log Pre-calculus of the Coef. of P
Matrix ? Improvement of 3,5
Decoding Log Pre-calculus of coef. of H-1 matrix
and OK symbols vector ? Improvement of 4 to 8
depending on the buckets to recover
0000 0000 0000 0000 5ab5 e267 0000
e267 0dce 0000 784d 2b66
17LHRS -Parity Groups
- Grouping Concept
- m data buckets
- k parity buckets
Key r
? ? ?
? ? ?
Insert Rank r
2 1 0
? ? ?
? ? ?
? ? ?
? ?
2 1 0
? Rank Key-list Parity
Parity Buckets
? Key Data
Data Buckets
A k-Acvailable Group survive to the failure of k
buckets
18Outline
- Issue
- State of the Art
- LHRS Scheme
- LHRS Manager
- Communication
- Gross Architecture
- 5. Experimentations
- 6. File Creation
- Bucket Recovery
19Communication
TCP/IP
UDP
Multicast
- Individual Operations
- (Insert, Update, Delete, Search)
- Record Recovery
- Control Messages
Performance
20Communication
TCP/IP
UDP
Multicast
- Large Buffers Transfert
- New Parity Buckets
- Transfer Parity Update Record (Bucket Split)
- Bucket Recovery
Performance Reliability
21Communication
TCP/IP
UDP
Multicast
Looking for New Data/Parity Buckets
Communication Multipoints
22Architecture
Enhancements to SDDS2000 Architecture
- (1) TCP/IP Connection Handler
TCP/IP Connections are passive OPEN, RFC 793
ISI81, TCP/IP under Win2K Server OS MB00
Before
1 Bucket Recovery (3,125 MB) SDDS 2000
6,7 s SDDS2000-TCP 2,6 s (Hardware
Config. CPU 733MhZ machines, network 100Mbps)
? Improvement of 60
(2) Flow Control Message Acknowledgement (FCMA)
Principle of Sending Credit Message
Conservation until delivery J88, GRS97, D01
23Architecture (2)
(3) Dynamic IP Addressing Structure
To tag new servers (data or parity) using
Multicast
Multicast Group of Blank Parity Buckets
Created Buckets
Multicast Group of Blank Data Buckets
Coordinator
Before
Pre-defined and Static IP_at_s Table
24Architecture (3)
Network
TCP Listening Thread
TCP/IP Port
Pool of Working Threads
ACK Structure
Messages Queue
Free Zones
UDP Listening Port
UDP Listening Thread
Messages waiting for ACK.
UDP Sending Port
Not acquitted Messages
Multicast Working Thread
ACK Mgmt Threads
Multicast listening Thread
Message Queue
Multicast Listening Port
25Experimentation
- Performance Evaluation
- CPU Time
- Communication Time
- Experimental Environment
- 5 Machines (Pentium IV 1.8 GHz, RAM 512 Mb)
- Ethernet Network 1 Gbps
- O.S. Win2K Server
- Tested Configuration
- 1 Client,
- A group of 4 Data Buckets,
- k Parity Buckets (k 0,1,2,3).
26Outline
- Issue
- State of the Art
- LHRS Scheme
- LHRS Manager
- Experimentations
- File Creation
- Parity Update
- Performance
- Bucket Recovery
- Parity Bucket Creation
27File Creation
- Propagation of Data Record Inserts/ Updates/
Deletes to Parity Buckets. - Update Send only ?record.
- Deletes Management of Free Ranks within Data
Buckets.
N1 renaining records N2 leaving
records Parity Group of the Splitting Data
Bucket N1N2 Deletes N1 Inserts Parity
Group of the New Data Bucket N2 Inserts
28Performances
Config.
Client Window 1
Client Window 5
Max Bucket Size 10 000 records File of 25 000
records 1 record 104 Bytes No difference
GF(28) et GF(216) (we dont wait for ACKs between
DBs and PBs)
29Performances
Config.
Client Window 1
Client Window 5
k 0 k 1 ? Perf. Degradation of 20
k 1 k 2 ? Perf. Degradation of 8
30Performances
Config.
Client Window 1
Client Window 5
k 0 k 1 ? Perf. Degradation of 37
k 1 k 2 ? Perf. Degradation of 10
31Outline
- Issue
- State of the Art
- LHRS Scheme
- LHRS Manager
- Experimentations
- File Creation
- Bucket Recovery
- Scenario
- Performances
- 8. Parity Bucket Creation
32Scenario
Failure Detection
Coordinator
Are you Alive?
?
?
Parity Buckets
Data Buckets
33Scenario (2)
Waiting for Responses
Coordinator
OK
OK
OK
OK
?
?
Parity Buckets
Data Buckets
34Scenario (3)
Searching Spare Buckets
Coordinator
Wanna be Spare ?
Multicast Group of Blank Data Buckets
35Scenario (4)
Waiting for Replies
I would
Coordinator
I would
I would
Launch UDP Listening Launch TCP Listening, Launch
Working Thredsl Waiting for Confirmation
If Time-out elapsed ? cancel everything
Multicast Group of Blank Data Buckets
36Scenario (5)
Spare Selection
Cancellation
Coordinator
Confirmed
You are Hired
Confirmed
Multicast Group of Blank Data Buckets
37Scenario (6)
Recovery Manager Selection
Coordinator
Recover Failed Buckets
Parity Buckets
38Scenario (7)
Query Phase
Recovery Manager
Send me Records of rank in r, rslice-1
Parity Buckets
Data Buckets
Buckets participating to Recovery
Spare Buckets
39Scenario (8)
Reconstruction Phase
Recovery Manager
Requested Buffers
Parity Buckets
Data Buckets
Decoding Phase
In // with Query Phase
Buckets participating to Recovery
Recovered Slices
Spare Buckets
40Performances
2 DBs
1 DB XOR
Config.
1 DB RS
XOR vs. RS
- File Info
- File of 125 000 records
- Record Size 100 bytes
- Bucket Size 31250 records ? 3.125 MB
- Group of 4 Data Buckets (m 4), k-Available
with k 1,2,3 - Decoding
- GF(216)
- RS Decoding (RS log Pre-calculus of H-1
and OK Symboles Vector) - Recovery per Slice (adaptative to PCs storage
computing capacities)
41Performances
2 DBs
1 DB XOR
Config.
1 DB RS
XOR vs. RS
42Performances
2 DBs
1 DB XOR
Config.
1 DB RS
XOR vs. RS
43Performances
2 DBs
1 DB XOR
Config.
1 DB RS
XOR vs. RS
Time to Recover 1DB -XOR 0,58 sec
Time to Recover 1DB RS 0,67 sec
XOR in GF(216) realizes a gain of 13 in Total
Time (and 30 in CPU Time)
44Performances
3 DBs
2 DBs
Summary
XOR vs. RS
1 DB RS
45Performances
3 DBs
2 DBs
Summary
XOR vs. RS
1 DB RS
46Performances
3 DBs
2 DBs
Summary
XOR vs. RS
1 DB RS
Time to Recover f Buckets ? f ? Time to Recover
1 Bucket Factorized Query Phase ? The is
Decoding Time Time to send Recovered Buffers
47Performances
GF(28)
3 DBs
2 DBs
Summary
XOR vs. RS
- XOR in GF(28) improves decoding perf. of 60
compared to RS in GF(28). - RS/RS decoding in GF(216) realize a gain of
50 compared to decoding in GF(28).
48Outline
1. Issue 2. State of the Art 3. LHRS Scheme
4. LHRS Manager 5. Experimentations 6. File
Creation 7. Bucket Recovery 8. Parity Bucket
Creation Scenario Performances
49Scenario
Searching for a new Parity Bucket
Coordinator
Wanna Join Group g ?
Multicast Group of Blank Parity Buckets
50Scenario (2)
Waiting for Replies
Coordinator
I Would
I Would
I Would
Launch UDP Listening Launch TCP Listening, Launch
Working Thredsl Waiting for Confirmation
If Time-out elapsed ? cancel everything
Multicast Group of Blank Parity Buckets
51Scenario (3)
New Parity Bucket Selection
Cancellation
Coordinator
Cancellation
You are Hired
Confirmed
Multicast Group of Blank Parity Buckets
52Scenario (4)
Auto-creation Query Phase
Send me your contents !
New Parity Bucket
Group of Data Buckets
53Scenario (5)
Auto-creation Encoding Phase
New Parity Bucket
Group of Data Buckets
54Performances
XOR
RS
Config.
GF(28)
XOR vs. RS
- Max Bucket Size 5000 .. 50000 records
- Bucket Load Factor 62,5
- Record Size 100 octets
- Group of 4 Data Buckets
- Encoding
- GF(216)
- RS ( Log Pre-calculus Row 1s ? XOR
encoding to Process 1st DB buffer)
55Performances
XOR
RS
Config.
GF(28)
XOR vs. RS
Same Encoding Rate ?Bucket Size CPU Time ? 74
Total Time
56Performances
XOR
RS
Config.
GF(28)
XOR vs. RS
Same Encoding Rate ?Bucket Size CPU Time ? 74
Total Time
57Performances
XOR
RS
Config.
GF(28)
XOR vs. RS
For Bucket Size 50000 records
XOR encoding speed 2.062 sec
RS encoding speed 2.103 sec
XOR realizes a performance gain in CPU time of 5
(? only 0,02 on Total Time)
58Performances
XOR
RS
Config.
GF(28)
XOR vs. RS
- Idem GF(216), CPU Time 3/4 Total Time
- XOR in GF(28) improves CPU Time by 22
59Performance
Wintel P4, 1.8GHz, 1Gbps
Bucket Recovery Rate 4.66MB/s from
1-unavailability 6.94MB/s from 2-unavailability 7.
62MB/s from 3-unavailability Record Recovery
Time About 1.3ms
File Creation Rate 0.33MB/s for k 0 0.25MB/s
for k 1 0.23MB/s for k 2 Record Insert
Time 0.29ms for k 0 0.33ms for k 1 0.36ms for
k 2
Key Search Time Individualgt 0.24ms Bulkgt 0.056ms
60Conclusion
Experiments prove
Optimizations Encoding/ Decoding Architecture
? Impact on Performance
Good Recovery Performances
61Future Work
Update Propagation to Parity Buckets
Reliability Performance Reduce Coordinator
Tasks Parity Declustering Investigation of
New Erausure-Resilient Codes
62References
- PGK88 D. A. Patterson, G. Gibson R. H. Katz,
A Case for Redundant Arrays of Inexpensive Disks,
Proc. of ACM SIGMOD Conf, pp.109-106, June 1988. - ISI81 Information Sciences Institute, RFC 793
Transmission Control Protocol (TCP)
Specification, Sept. 1981, http//www.faqs.org/rfc
s/rfc793.html - MB 00 D. MacDonal, W. Barkley, MS Windows 2000
TCP/IP Implementation Details, http//secinf.net/i
nfo/nt/2000ip/tcpipimp.html - J88 V. Jacobson, M. J. Karels, Congestion
Avoidance and Control, Computer Communication
Review, Vol. 18, No 4, pp. 314-329. - XB99 L. Xu J. Bruck, X-Code MDS Array Codes
with Optimal Encoding, IEEE Trans. on Information
Theory, 45(1), p.272-276, 1999. - CEG 04 P. Corbett, B. English, A. Goel, T.
Grcanac, S. Kleiman, J. Leong, S. Sankar,
Row-Diagonal Parity for Double Disk Failure
Correction, Proc. of the 3rd USENIX Conf. On
File and Storage Technologies, Avril 2004. - R89 M. O. Rabin, Efficient Dispersal of
Information for Security, Load Balancing and
Fault Tolerance, Journal of ACM, Vol. 26, N 2,
April 1989, pp. 335-348. - W91 P.E. White, RAID X tackles design problems
with existing design RAID schemes, ECC
Technologies, ftp//members.aol.com.mnecctek.ctr19
91.pdf - GRS97 J. C. Gomez, V. Redo, V. S. Sunderam,
Efficient Multithreaded User-Space Transport for
Network Computing, Design Test of the TRAP
protocol, Journal of Parallel Distributed
Computing, 40 (1) 1997.
63References (2)
BK 95 J. Blomer, M. Kalfane, R. Karp, M.
Karpinski, M. Luby D. Zuckerman, An XOR-Based
Erasure-Resilient Coding Scheme, ICSI Tech. Rep.
TR-95-048, 1995. LS00 W. Litwin T. Schwarz,
LHRS A High-Availability Scalable Distributed
Data Structure using Reed Solomon Codes,
p.237-248, Proceedings of the ACM SIGMOD 2000.
KLR96 J. Karlson, W. Litwin T. Risch, LHLH
A Scalable high performance data structure for
switched multicomputers, EDBT 96, Springer
Verlag. RS60 I. Reed G. Solomon, Polynomial
codes over certain Finite Fields, Journal of the
society for industrial and applied mathematics,
1960. P97 J. S. Plank, A Tutorial on
Reed-Solomon Coding for fault-Tolerance in
RAID-like Systems, Software Practise
Experience, 27(9), Sept. 1997, pp 995-
1012, D01 A.W. Diène, Contribution à la Gestion
de Structures de Données Distribuées et
Scalables, PhD Thesis, Nov. 2001, Université
Paris Dauphine. B00 F. Sahli Bennour,
Contribution à la Gestion de Structures de
Données Distribuées et Scalables, PhD Thesis,
Juin 2000, Université Paris Dauphine.
Références http//ceria.dauphine.fr/rim/theserim
.pdf
64Publications
ML02 R. Moussa, W. Litwin, Experimental
Performance Analysis of LHRS Parity Management,
Carleton Scientific Records of the 4th
International Workshop on Distributed Data
Structure WDAS 2002, p.87-97. MS04 R.
Moussa, T. Schwarz, Design and Implementation of
LHRS A Highly-Available Scalable Distributed
Data Structure, Carleton Scientific Records of
the 6th International Workshop on Distributed
Data Structure WDAS 2004. LMS04 W. Litwin,
R. Moussa, T. Schwarz, Prototype Demonstration of
LHRS A Highly Available Distributed Storage
System, Proc. of VLDB 2004 (Demo Session)
p.1289-1292. LMS04-a W. Litwin, R. Moussa, T.
Schwarz, LHRS A Highly Available Distributed
Storage System, journal version submitted, under
revision.
65Thank You For Your Attention
Questions ?