Contribution to the Design - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Contribution to the Design

Description:

Contribution to the Design & Implementation of the Highly Available Scalable and ... Tor Risch. Jury President: Pr. G rard L vy. Paris Dauphine University *CERIA Lab. ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 66
Provided by: rim74
Category:

less

Transcript and Presenter's Notes

Title: Contribution to the Design


1
Contribution to the Design Implementation of
the Highly Available Scalable and Distributed
Data Structure LHRS
Paris Dauphine University CERIA Lab.
04th October 2004
Rim Moussa Rim.Moussa_at_dauphine.fr
http//ceria.dauphine.fr/rim/rim.html
Thesis Supervisor Pr. Witold Litwin
Examinators Pr. Thomas J.E. Schwarz Pr.
Toré Risch Jury President Pr. Gérard Lévy
Thesis Presentation in Computer Science
Distributed Databases
2
Outline
  • Issue
  • State of the Art
  • LHRS Scheme
  • LHRS Manager
  • Experimentations
  • LHRS File Creation
  • Bucket Recovery
  • Parity Bucket Creation
  • Conclusion Future Work

3
Facts
  • Volume of Information of 30 /year
  • Technology
  • Network Infrastructure
  • gtgt Gilder Law, bandwidth triples every year.
  • Evolution of PCs storage computing capacities
  • gtgt Moore Law, the latters double every 18
    months.
  • Bottleneck of Disks Accesses CPUs

Need of Distributed Data Storage Systems SDDSs
LH, RP ? High Throughput
4
Facts
  • Multicomputers
  • gtgt Modular Architecture
  • gtgt Good Price/ Performance
  • Tradeoff
  • Frequent Costly Failures
  • gtgt Stat. Published by the Contingency Planning
    Research in 1996 the cost of service
    interruption/h case of brokerage application is
    6,45 million.
  • Need of Distributed Highly-Available Data
    Storage Systems

5
State of the Art
Data Replication
() Good Response Time, Mirors are
functional (-) High Storage Overhead (?n if n
repliquas)
Parity Calculus
  • Criteria to evaluate Erasure-resilient Codes
  • Encoding Rate (Parity Volume/ Data Volume)
  • Update Penality (Parity Volumes)
  • Group Size used for Data Reconstruction
  • Encoding Decoding Complexity
  • Recovery Capabilitties

6
Parity Schemes
1-Available Schemes
XOR Parity Calculus RAID Technology (level 3,
4, 5) PGK88, SDDS
LHg L96
k-Available Schemes
Binary Linear Codes H94 ? Tolerate max. 3
failures Array Codes EVENODD B94 , X-code
XB99, RDP C04 ? Tolerate max. 2 failures
Reed Solomon Codes IDA R89, RAID X W91,
FEC B95, Tutorial P97, LHRS
LS00, ML02, MS04, LMS04 ? Tolerate k failures
(k gt 3)
7
Outline
  • Issue
  • State of the Art
  • LHRS Scheme
  • LHRS?
  • SDDSs?
  • Reed Solomon Codes?
  • Encoding/ Decoding Optimizations
  • LHRS Manager
  • Experimentations

8
LHRS ?
LHRS LS00
Scalability High Throughput
LH Scalable Distributed Data Structure
Distribution using Linear Hashing (LHLH KLR96)
LHLH ManagerB00
High Availability
Parity Calculus using Reed-Solomon Codes RS63
9
SDDSs Principles
(1) Dynamic File Growth
Coordinator
Client
Client

Network


Data Buckets
10
SDDSs Principles (2)
(2) No Centralized Directory Access
File Image
Client



Cases de Données
11
Reed-Solomon Codes
  • Encoding
  • From m Data Symbols ? Calculus of n Parity
    Symbols
  • Data Representation ? Galois Field
  • Fields with finite size q
  • Closure Propoerty Addition, Substraction,
    Multiplication, Division.
  • In GF(2w),
  • (1) Addition (XOR)
  • (2) Multiplication (Tables gflog and antigflog)
  • e1 e2 antigflog gfloge1 gfloge2

12
RS Encoding
1 0 0 0 0 0 C1,1 C1,j C1,n-m 0 1 0 0 0
0 C2,1 C2,j C2,n-m 0 0 1 0 0 0
C3,1 C3,j C3,n-m 0 0 0 0 0
1 Cm,1 Cm,j Cm,n-m
13
RS Decoding
S1 S2 S3 S4 Sm P1 P2 P3 Pn-m
1 0 0 0 0 0 C1,1 C1,2 C1,3 C1,n-m 0 1 0 0 0
0 C2,1 C2,2 C2,3 C2,n-m 0 0 1 0 0 0 C3,1
C3,2 C3,3 C3,n-m
0 0 0 0 0 1 Cm,1 Cm,2
Cm,3 Cm,n-m
14
Optimizations
GF Multiplication
Galois Field
Parity Matrix
GF(28) ? 1 symbol 1 Byte GF(216) ? 1 symbol
2 Bytes
() GF(216) vs. GF(28) reduces the
Symbols by 1/2 ? Operations in the GF.
(-) Multiplication Tables
Size GF(28) 0,768 Ko GF(216) 393,216 Ko (512 ?
0,768)
15
Optimizations (2)
GF Multiplication
Galois Field
Parity Matrix
0001 0001 0001 0001 eb9b 2284 0001
2284 9é74 0001 9e44 d7f1

1st Row of 1s Any update from 1st DB is
processed with XOR Calculus ? Gain in Performance
of 4 (case PB creation, m 4)
1st Column of 1s Encoding of the 1st PB along
XOR Calculus ? Gain in encoding decoding
16
Optimizations (3)
GF Multiplication
Galois Field
Parity Matrix
Goal Reduce GF Multiplication Complexity
e1 e2 antigflog gfloge1 gfloge2
Encoding Log Pre-calculus of the Coef. of P
Matrix ? Improvement of 3,5
Decoding Log Pre-calculus of coef. of H-1 matrix
and OK symbols vector ? Improvement of 4 to 8
depending on the buckets to recover
0000 0000 0000 0000 5ab5 e267 0000
e267 0dce 0000 784d 2b66

17
LHRS -Parity Groups
  • Grouping Concept
  • m data buckets
  • k parity buckets

Key r
? ? ?
? ? ?
Insert Rank r
2 1 0
? ? ?
? ? ?
? ? ?
? ?
2 1 0
? Rank Key-list Parity
Parity Buckets
? Key Data
Data Buckets
A k-Acvailable Group survive to the failure of k
buckets
18
Outline
  • Issue
  • State of the Art
  • LHRS Scheme
  • LHRS Manager
  • Communication
  • Gross Architecture
  • 5. Experimentations
  • 6. File Creation
  • Bucket Recovery

19
Communication
TCP/IP
UDP
Multicast
  • Individual Operations
  • (Insert, Update, Delete, Search)
  • Record Recovery
  • Control Messages

Performance
20
Communication
TCP/IP
UDP
Multicast
  • Large Buffers Transfert
  • New Parity Buckets
  • Transfer Parity Update Record (Bucket Split)
  • Bucket Recovery

Performance Reliability
21
Communication
TCP/IP
UDP
Multicast
Looking for New Data/Parity Buckets
Communication Multipoints
22
Architecture
Enhancements to SDDS2000 Architecture
  • (1) TCP/IP Connection Handler

TCP/IP Connections are passive OPEN, RFC 793
ISI81, TCP/IP under Win2K Server OS MB00
Before
1 Bucket Recovery (3,125 MB) SDDS 2000
6,7 s SDDS2000-TCP 2,6 s (Hardware
Config. CPU 733MhZ machines, network 100Mbps)
? Improvement of 60
(2) Flow Control Message Acknowledgement (FCMA)
Principle of Sending Credit Message
Conservation until delivery J88, GRS97, D01
23
Architecture (2)
(3) Dynamic IP Addressing Structure
To tag new servers (data or parity) using
Multicast
Multicast Group of Blank Parity Buckets
Created Buckets
Multicast Group of Blank Data Buckets
Coordinator
Before
Pre-defined and Static IP_at_s Table
24
Architecture (3)
Network
TCP Listening Thread
TCP/IP Port
Pool of Working Threads
ACK Structure
Messages Queue
Free Zones
UDP Listening Port
UDP Listening Thread
Messages waiting for ACK.
UDP Sending Port
Not acquitted Messages

Multicast Working Thread
ACK Mgmt Threads
Multicast listening Thread
Message Queue
Multicast Listening Port
25
Experimentation
  • Performance Evaluation
  • CPU Time
  • Communication Time
  • Experimental Environment
  • 5 Machines (Pentium IV 1.8 GHz, RAM 512 Mb)
  • Ethernet Network 1 Gbps
  • O.S. Win2K Server
  • Tested Configuration
  • 1 Client,
  • A group of 4 Data Buckets,
  • k Parity Buckets (k 0,1,2,3).

26
Outline
  • Issue
  • State of the Art
  • LHRS Scheme
  • LHRS Manager
  • Experimentations
  • File Creation
  • Parity Update
  • Performance
  • Bucket Recovery
  • Parity Bucket Creation

27
File Creation
  • Client Operations
  • Propagation of Data Record Inserts/ Updates/
    Deletes to Parity Buckets.
  • Update Send only ?record.
  • Deletes Management of Free Ranks within Data
    Buckets.
  • Data Bucket Split

N1 renaining records N2 leaving
records Parity Group of the Splitting Data
Bucket N1N2 Deletes N1 Inserts Parity
Group of the New Data Bucket N2 Inserts
28
Performances
Config.
Client Window 1
Client Window 5
Max Bucket Size 10 000 records File of 25 000
records 1 record 104 Bytes No difference
GF(28) et GF(216) (we dont wait for ACKs between
DBs and PBs)
29
Performances
Config.
Client Window 1
Client Window 5
k 0 k 1 ? Perf. Degradation of 20
k 1 k 2 ? Perf. Degradation of 8
30
Performances
Config.
Client Window 1
Client Window 5
k 0 k 1 ? Perf. Degradation of 37
k 1 k 2 ? Perf. Degradation of 10
31
Outline
  • Issue
  • State of the Art
  • LHRS Scheme
  • LHRS Manager
  • Experimentations
  • File Creation
  • Bucket Recovery
  • Scenario
  • Performances
  • 8. Parity Bucket Creation

32
Scenario
Failure Detection
Coordinator
Are you Alive?
?
?
Parity Buckets
Data Buckets
33
Scenario (2)
Waiting for Responses
Coordinator
OK
OK
OK
OK
?
?
Parity Buckets
Data Buckets
34
Scenario (3)
Searching Spare Buckets
Coordinator
Wanna be Spare ?
Multicast Group of Blank Data Buckets
35
Scenario (4)
Waiting for Replies
I would
Coordinator
I would
I would
Launch UDP Listening Launch TCP Listening, Launch
Working Thredsl Waiting for Confirmation
If Time-out elapsed ? cancel everything
Multicast Group of Blank Data Buckets
36
Scenario (5)
Spare Selection
Cancellation
Coordinator
Confirmed
You are Hired
Confirmed
Multicast Group of Blank Data Buckets
37
Scenario (6)
Recovery Manager Selection
Coordinator
Recover Failed Buckets
Parity Buckets
38
Scenario (7)
Query Phase
Recovery Manager
Send me Records of rank in r, rslice-1

Parity Buckets
Data Buckets
Buckets participating to Recovery
Spare Buckets
39
Scenario (8)
Reconstruction Phase
Recovery Manager
Requested Buffers

Parity Buckets
Data Buckets
Decoding Phase
In // with Query Phase
Buckets participating to Recovery
Recovered Slices
Spare Buckets
40
Performances
2 DBs
1 DB XOR
Config.
1 DB RS
XOR vs. RS
  • File Info
  • File of 125 000 records
  • Record Size 100 bytes
  • Bucket Size 31250 records ? 3.125 MB
  • Group of 4 Data Buckets (m 4), k-Available
    with k 1,2,3
  • Decoding
  • GF(216)
  • RS Decoding (RS log Pre-calculus of H-1
    and OK Symboles Vector)
  • Recovery per Slice (adaptative to PCs storage
    computing capacities)

41
Performances
2 DBs
1 DB XOR
Config.
1 DB RS
XOR vs. RS
42
Performances
2 DBs
1 DB XOR
Config.
1 DB RS
XOR vs. RS
43
Performances
2 DBs
1 DB XOR
Config.
1 DB RS
XOR vs. RS
Time to Recover 1DB -XOR 0,58 sec
Time to Recover 1DB RS 0,67 sec
XOR in GF(216) realizes a gain of 13 in Total
Time (and 30 in CPU Time)
44
Performances
3 DBs
2 DBs
Summary
XOR vs. RS
1 DB RS
45
Performances
3 DBs
2 DBs
Summary
XOR vs. RS
1 DB RS
46
Performances
3 DBs
2 DBs
Summary
XOR vs. RS
1 DB RS
Time to Recover f Buckets ? f ? Time to Recover
1 Bucket Factorized Query Phase ? The is
Decoding Time Time to send Recovered Buffers

47
Performances
GF(28)
3 DBs
2 DBs
Summary
XOR vs. RS
  • XOR in GF(28) improves decoding perf. of 60
    compared to RS in GF(28).
  • RS/RS decoding in GF(216) realize a gain of
    50 compared to decoding in GF(28).

48
Outline
1. Issue 2. State of the Art 3. LHRS Scheme
4. LHRS Manager 5. Experimentations 6. File
Creation 7. Bucket Recovery 8. Parity Bucket
Creation Scenario Performances
49
Scenario
Searching for a new Parity Bucket
Coordinator
Wanna Join Group g ?
Multicast Group of Blank Parity Buckets
50
Scenario (2)
Waiting for Replies
Coordinator
I Would
I Would
I Would
Launch UDP Listening Launch TCP Listening, Launch
Working Thredsl Waiting for Confirmation
If Time-out elapsed ? cancel everything
Multicast Group of Blank Parity Buckets
51
Scenario (3)
New Parity Bucket Selection
Cancellation
Coordinator
Cancellation
You are Hired
Confirmed
Multicast Group of Blank Parity Buckets
52
Scenario (4)
Auto-creation Query Phase
Send me your contents !


New Parity Bucket
Group of Data Buckets
53
Scenario (5)
Auto-creation Encoding Phase

New Parity Bucket
Group of Data Buckets
54
Performances
XOR
RS
Config.
GF(28)
XOR vs. RS
  • Max Bucket Size 5000 .. 50000 records
  • Bucket Load Factor 62,5
  • Record Size 100 octets
  • Group of 4 Data Buckets
  • Encoding
  • GF(216)
  • RS ( Log Pre-calculus Row 1s ? XOR
    encoding to Process 1st DB buffer)

55
Performances
XOR
RS
Config.
GF(28)
XOR vs. RS
Same Encoding Rate ?Bucket Size CPU Time ? 74
Total Time
56
Performances
XOR
RS
Config.
GF(28)
XOR vs. RS
Same Encoding Rate ?Bucket Size CPU Time ? 74
Total Time
57
Performances
XOR
RS
Config.
GF(28)
XOR vs. RS
For Bucket Size 50000 records
XOR encoding speed 2.062 sec
RS encoding speed 2.103 sec
XOR realizes a performance gain in CPU time of 5
(? only 0,02 on Total Time)
58
Performances
XOR
RS
Config.
GF(28)
XOR vs. RS
  • Idem GF(216), CPU Time 3/4 Total Time
  • XOR in GF(28) improves CPU Time by 22

59
Performance
Wintel P4, 1.8GHz, 1Gbps
Bucket Recovery Rate 4.66MB/s from
1-unavailability 6.94MB/s from 2-unavailability 7.
62MB/s from 3-unavailability Record Recovery
Time About 1.3ms
File Creation Rate 0.33MB/s for k 0 0.25MB/s
for k 1 0.23MB/s for k 2 Record Insert
Time 0.29ms for k 0 0.33ms for k 1 0.36ms for
k 2
Key Search Time Individualgt 0.24ms Bulkgt 0.056ms
60
Conclusion
Experiments prove
Optimizations Encoding/ Decoding Architecture
? Impact on Performance
Good Recovery Performances
61
Future Work
Update Propagation to Parity Buckets
Reliability Performance Reduce Coordinator
Tasks  Parity Declustering  Investigation of
New Erausure-Resilient Codes
62
References
  • PGK88 D. A. Patterson, G. Gibson R. H. Katz,
    A Case for Redundant Arrays of Inexpensive Disks,
    Proc. of ACM SIGMOD Conf, pp.109-106, June 1988.
  • ISI81 Information Sciences Institute, RFC 793
    Transmission Control Protocol (TCP)
    Specification, Sept. 1981, http//www.faqs.org/rfc
    s/rfc793.html
  • MB 00 D. MacDonal, W. Barkley, MS Windows 2000
    TCP/IP Implementation Details, http//secinf.net/i
    nfo/nt/2000ip/tcpipimp.html
  • J88 V. Jacobson, M. J. Karels, Congestion
    Avoidance and Control, Computer Communication
    Review, Vol. 18, No 4, pp. 314-329.
  • XB99 L. Xu J. Bruck, X-Code MDS Array Codes
    with Optimal Encoding, IEEE Trans. on Information
    Theory, 45(1), p.272-276, 1999.
  • CEG 04 P. Corbett, B. English, A. Goel, T.
    Grcanac, S. Kleiman, J. Leong, S. Sankar,
    Row-Diagonal Parity for Double Disk Failure
    Correction, Proc. of the 3rd USENIX Conf. On
    File and Storage Technologies, Avril 2004.
  • R89 M. O. Rabin, Efficient Dispersal of
    Information for Security, Load Balancing and
    Fault Tolerance, Journal of ACM, Vol. 26, N 2,
    April 1989, pp. 335-348.
  • W91 P.E. White, RAID X tackles design problems
    with existing design RAID schemes, ECC
    Technologies, ftp//members.aol.com.mnecctek.ctr19
    91.pdf
  • GRS97 J. C. Gomez, V. Redo, V. S. Sunderam,
    Efficient Multithreaded User-Space Transport for
    Network Computing, Design Test of the TRAP
    protocol, Journal of Parallel Distributed
    Computing, 40 (1) 1997.

63
References (2)
BK 95 J. Blomer, M. Kalfane, R. Karp, M.
Karpinski, M. Luby D. Zuckerman, An XOR-Based
Erasure-Resilient Coding Scheme, ICSI Tech. Rep.
TR-95-048, 1995. LS00 W. Litwin T. Schwarz,
LHRS A High-Availability Scalable Distributed
Data Structure using Reed Solomon Codes,
p.237-248, Proceedings of the ACM SIGMOD 2000.
KLR96 J. Karlson, W. Litwin T. Risch, LHLH
A Scalable high performance data structure for
switched multicomputers, EDBT 96, Springer
Verlag. RS60 I. Reed G. Solomon, Polynomial
codes over certain Finite Fields, Journal of the
society for industrial and applied mathematics,
1960.  P97 J. S. Plank, A Tutorial on
Reed-Solomon Coding for fault-Tolerance in
RAID-like Systems, Software Practise
Experience, 27(9), Sept. 1997, pp 995-
1012, D01 A.W. Diène, Contribution à la Gestion
de Structures de Données Distribuées et
Scalables, PhD Thesis, Nov. 2001, Université
Paris Dauphine. B00 F. Sahli Bennour,
Contribution à la Gestion de Structures de
Données Distribuées et Scalables, PhD Thesis,
Juin 2000, Université Paris Dauphine.
Références http//ceria.dauphine.fr/rim/theserim
.pdf
64
Publications
ML02 R. Moussa, W. Litwin, Experimental
Performance Analysis of LHRS Parity Management,
Carleton Scientific Records of the 4th
International Workshop on Distributed Data
Structure WDAS 2002, p.87-97. MS04 R.
Moussa, T. Schwarz, Design and Implementation of
LHRS A Highly-Available Scalable Distributed
Data Structure, Carleton Scientific Records of
the 6th International Workshop on Distributed
Data Structure WDAS 2004. LMS04 W. Litwin,
R. Moussa, T. Schwarz, Prototype Demonstration of
LHRS A Highly Available Distributed Storage
System, Proc. of VLDB 2004 (Demo Session)
p.1289-1292. LMS04-a W. Litwin, R. Moussa, T.
Schwarz, LHRS A Highly Available Distributed
Storage System, journal version submitted, under
revision. 
65
Thank You For Your Attention
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com