Scalable Distributed Data Structures - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Scalable Distributed Data Structures

Description:

Scalable Distributed Data Structures – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 76
Provided by: fet1
Category:

less

Transcript and Presenter's Notes

Title: Scalable Distributed Data Structures


1
Scalable Distributed Data Structures
High-Performance ComputingWitold Litwin Fethi
BennourCERIAUniversity Paris 9
Dauphinehttp//ceria.dauphine.fr/
2
Plan
  • Multicomputers for HPC
  • What are SDDSs ?
  • Overview of LH
  • Implementation under SDDS-2000
  • Conclusion

3
Multicomputers
  • A collection of loosely coupled computers
  • Mass-produced and/or preexisting hardware
  • share nothing architecture
  • Best for HPC because of scalability
  • message passing through high-speed net
    (?????0?Mb/s)
  • Network multicomputers
  • use general purpose nets PCs
  • LANs Fast Ethernet, Token Ring, SCI, FDDI,
    Myrinet, ATM
  • NCSA cluster 1024 NTs on Myrinet by the end of
    1999
  • Switched multicomputers
  • use a bus, or a switch
  • IBM-SP2, Parsytec...

4
Why Multicomputers ?
  • Unbeatable price-performance ratio for HPC.
  • Cheaper and more powerful than supercomputers.
  • especially the network multicomputers.
  • Available everywhere.
  • Computing power.
  • file size, access and processing times,
    throughput...
  • For more pro cons
  • IBM SP2 and GPFS literature.
  • Tanenbaum "Distributed Operating Systems",
    Prentice Hall, 1995.
  • NOW project (UC Berkeley).
  • Bill Gates at Microsoft Scalability Day, May
    1997.
  • www.microoft.com White Papers from Business
    Syst. Div.
  • Report to the President, Presidents Inf. Techn.
    Adv. Comm., Aug 98.

5
Typical Network Multicomputer
Client
Server
6
Why SDDSs
  • Multicomputers need data structures and file
    systems
  • Trivial extensions of traditional structures are
    not best
  • hot-spots
  • scalability
  • parallel queries
  • distributed and autonomous clients
  • distributed RAM distance to data
  • For a CPU, data on a disk are as far as those at
    the Moon for a human (J. Gray, ACM Turing Price
    1999)

7
What is an SDDS ?
  • Data are structured
  • records with keys ? objects with OIDs
  • more semantics than in Unix flat-file model
  • abstraction most popular with applications
  • parallel scans function shipping
  • Data are on servers
  • waiting for access
  • Overflowing servers split into new servers
  • appended to the file without informing the
    clients
  • Queries come from multiple autonomous clients
  • Access initiators
  • Not supporting synchronous updates
  • Not using any centralized directory for access
    computations

8
What is an SDDS ?
  • Clients can make addressing errors
  • Clients have less or more adequate image of the
    actual file structure
  • Servers are able to forward the queries to the
    correct address
  • perhaps in several messages
  • Servers may send Image Adjustment Messages
  • Clients do not make same error twice
  • Servers supports parallel scans
  • Sent out by multicast or unicast
  • With deterministic or probabilistic termination
  • See the SDDS talk papers for more
  • ceria.dauphine.fr/witold.html
  • Or the LH ACM-TODS paper (Dec. 96)

9
High-Availability SDDS
  • A server can be unavailable for access without
    service interruption
  • Data are reconstructed from other servers
  • Data and parity servers
  • Up to k ³ 1 servers can fail
  • At parity overhead cost of about 1/k
  • Factor k can itself scale with the file
  • Scalable availability SDDSs

10
An SDDS
growth through splits under inserts
Servers
Clients
11
An SDDS
growth through splits under inserts
Servers
Clients
12
An SDDS
growth through splits under inserts
Servers
Clients
13
An SDDS
growth through splits under inserts
Servers
Clients
14
An SDDSClient Access
Clients
15
An SDDSClient Access
Clients
16
An SDDSClient Access
IAM
Clients
17
An SDDSClient Access
Clients
18
An SDDSClient Access
Clients
19
Known SDDSs
DS
Classics
20
Known SDDSs
DS
SDDS (1993)
Classics
Hash
LH DDH Breitbart al
21
Known SDDSs
DS
SDDS (1993)
Classics
Hash
1-d tree
LH DDH Breitbart al
22
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
23
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
H-Avail.
LHm, LHg
Security
LHs
24
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
Disk
SDLSA
H-Avail.
LHm, LHg
LHSA
Security
s-availability
LHs
LHRS
http//192.134.119.81/SDDS-bibliograhie.html
25
LH (A classic)
  • Scalable distributed hash partitionning
  • generalizes the LH addressing schema
  • variants used in Netscape products, LH-Server,
    Unify, Frontpage, IIS, MsExchange...
  • Typical load factor 70 - 90
  • In practice, at most 2 forwarding messages
  • regardless of the size of the file
  • In general, 1 m/insert and 2 m/search on the
    average
  • 4 messages in the worst case

26
LH bucket servers
  • For every record c, its correct address a results
    from the LH addressing rule
  • a Ü hi(c)
  • if n 0 then exit
  • else
  • if a lt n then a Ü h i1 ( c)
  • end
  • (i, n) the file state, known only to the
    LH-coordinator
  • Each server a keeps only track of the function hj
    used to access it
  • j i or j i1

27
LH clients
  • Each client uses the LH-rule for address
    computation, but with the client image (i, n)
    of the file state.
  • Initially, for a new client (i, n) 0.

28
LH Server Address Verification and Forwarding
  • Server a getting key c, a m in particular,
    computes
  • a' hj (c)
  • if a' a then accept c
  • else a'' hj - 1 (c)
  • if a'' gt a and a'' lt a' then a' a''
  • send c to bucket a'

29
Client Image Adjustment
  • The IAM consists of address a where the client
    sent c and of j (a)
  • if j gt i' then i' j - 1, n' a
    1
  • if n' ??2i' then n' 0, i' i' 1
  • The rule guarantees that client image is within
    the file
  • Provided there is no file contractions (merge)

30
LH file structure
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
31
LH file structure
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
32
LH split
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
33
LH split
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
34
LH split
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
35
LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
15
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
36
LH addressing
servers
15
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
37
LH addressing
servers
15
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
a 7, j 3
n 3 i 3
n' 0, i' 3
n' 3, i' 2
Coordinateur
Client
Client
38
LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
9
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
39
LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
9
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
40
LH addressing
servers
9
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
41
LH addressing
servers
9
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
a 9, j 4
n 3 i 3
n' 1, i' 3
n' 3, i' 2
Coordinateur
Client
Client
42
Result
  • The distributed file can grow to even whole
    Internet so that
  • every insert and search are done in four
    messages (IAM included)
  • in general an insert is done in one message and
    search in two message

43
SDDS-2000Prototype Implementation of LH and of
RP on Wintel multicomputer
  • Architecture Client/Server
  • TCP/IP Communication (UDP and TCP) with Windows
    Sockets
  • Multiple threads control
  • Processes synchronization (mutex, critical
    section, event, time_out, etc)
  • Queuing system
  • Optional Flow control for UDP messaging

44
SDDS-2000 Client Architecture
  • Send Request
  • Receive Response
  • Return Response
  • Client Image process.

45
SDDS-2000 Server Architecture
Bucket SDDS
Insertion
Search
Update
Delete
Request Analyse
  • Listen Thread
  • Queuing system
  • Work Thread
  • Local process
  • Forward
  • Response


W.Thread 1
W.Thread 4
Queuing system
Listen Thread
Response
Response
Socket
Network
client Request
Client
46
LHLH RAM buckets
47
Measuring conditions
  • LAN of 4 computers interconnected by a 100 Mb/s
    Ethernet
  • F.S Fast Server Pentium II 350 MHz 128 Mo
    RAM
  • F.C Fast Client Pentium II 350 MHz 128 Mo
    RAM
  • S.C Slow Client Pentium I 90 Mhz 48 Mo RAM
  • S.S Slow Server Pentium I 90 Mhz 48 Mo RAM
  • The measurements result from 10.000 records
    more.
  • UDP Protocol for insertions and searches
  • TCP Protocol for splitting

48
Best performances of a F.S configuration
S.C (1)
F.S J0
S.C (2)
100 Mb/s
Bucket 0
S.C (3)
UDP communication
49
Fast Server Average Insert time
  • Inserts without ack
  • 3 clients create lost messages
  • ? best time 0,44 ms

50
Fast ServerAverage Search time
  • The time measured include the search process
    response return
  • More than 3 clients, there are a lot of lost
    messages
  • Whatever is the bucket capacity (1000,5000, ,
    20000 records),
  • ?0,66 ms is the best time

51
Performance of a Slow Server Configuration
S.S J0
S.C
wait
100 Mb/s
Bucket 0
UDP communication
52
Slow ServerAverage Insert time
  • Measurements on server without ack
  • S.C to S.S (with wait)
  • We dont need a 2nd client
  • ? 2,3 ms is the best constant time

53
Slow ServerAverage Search time
  • Measurements on server
  • S.C to S.S (with wait)
  • We dont need a 2nd client
  • ? 3,3 ms is the best time

54
Insert time into up to 3 buckets Configuration
F.S J2
Bucket 0
S.S J1
S.C
Batch 1,2,3,
100 Mb/s
Bucket 1
S.S J2
Bucket 2
UDP communication
55
Average insert time no ack
  • File creation includes 2 splits forwards
    updates of IAM
  • Buckets already exist without splits
  • Conditions S.C F.S 2 S.S
  • Time measured on the server of bucket 0 which is
    informed of the end of insertions from each
    server.
  • The split is not penalizing ? 0,8 ms/insert in
    both cases.

56
Average search time in 3 Slow Servers
Configuration
S.S J2
Bucket 0
S.S J1
F.C
Batch 1,2,3,
100 Mb/s
Bucket 1
S.S J2
Bucket 2
UDP communication
57
The average key search time Fast Client Slow
Servers
  • Records are sent in batch system 1,2,3,. 10000
  • Balanced charge (load) The 3 buckets receive
    the same number of records
  • Non balanced charge The bucket 1 receives more
    than the others
  • conclusion The curve is linear ? a good
    parallelism

58
ExtrapolationSingle 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Search time
F.S 0,66 ms
S.S 3,3 ms
5
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
59
Extrapolation
Extrapolation Single 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Pentium III 700 Mhz
/ 2
Search time
F.S 0,66 ms
S.S 3,3 ms
5
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
60
Extrapolation
Extrapolation Single 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Pentium III 700 Mhz
/ 2
Search time
F.S 0,66 ms
S.S 3,3 ms
5
lt 0,33 ms
2
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
61
Extrapolation
Extrapolation Single 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Pentium III 700 Mhz
/ 2
Search time
F.S 0,66 ms
S.S 3,3 ms
5
lt 0,33 ms
2
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
lt 0,22 ms
2
62
Extrapolation Search time on fast P3 servers
  • The client is F.C
  • 3 servers are 350 Mhz.P3 search time is 0,216
    ms/ key
  • 3 servers are 700 Mhz. search time is 0,106 ms/
    key

63
Extrapolation Search time in file scaling to
100 servers
64
RP schemes
  • Produce 1-d ordered files
  • for range search
  • Uses m-ary trees
  • like a B-tree
  • Efficiently supports range queries
  • LH also supports range queries
  • but less efficiently
  • Consists of the family of three schemes
  • RPN RPC and RPS

65
RP schemes
66
(No Transcript)
67
Comparison between LHLH RPN
68
Scalable Distributed Log Structured Array (SDLSA)
  • Intended for high-capacity SANs of IBM Ramac
    Virtual Arrays (RVAs) or Enterprise Storage
    Servers (ESSs)
  • One RVA contains up to 0.8 TB of data
  • One EES contains up to 13 TB of data
  • Reuse of current capabilities
  • Transparent access to the entire SAN, as if it
    were one RVA or EES
  • Preservation of current functions,
  • Log Structured Arrays
  • for high-availability without small-write RAID
    penalty
  • Snapshots
  • New capabilities
  • Scalable TB databases
  • PB databases for an EES SAN
  • Parallel / distributed processing
  • High-availability supporting an entire server
    node unavailability

69
Gross Architecture
RVA
70
Scalable Availability SDDS
  • Support unavailability of k ³ 1 server sites
  • The factor k increases automatically with the
    file.
  • Necessary to prevent the reliability decrease
  • Moderate overhead for parity data
  • Storage overhead of O (1/k)
  • Access overhead of k messages per data record
    insert or update
  • Do not impare search and parallel scans
  • Unlike trivial adaptations of RAID like schemes.
  • Several schemas were proposed around LH
  • Different properties to best suit various
    applications
  • See http//ceria.dauphine.fr/witold.html

71
SDLSA Main features
  • LH used as global addressing schema
  • RAM buckets split atomically
  • Disk buckets split in lazy way
  • A record (logical track) moves only when
  • The client access it (update, or read)
  • It is garbage collected
  • Atomic split of TB disk bucket would take hours
  • The LHRS schema is used for the
    high-availability
  • Litwin W. Menon, J. Scalable Distributed Log
    Structured Arrays. CERIA Res. Rep. 12, 1999
    http//ceria.dauphine.fr/witold.html

72
Conclusion
  • SDDSs should be highly useful for HPC
  • Scalability
  • Fast access perfromance
  • Parallel scans function shipping
  • High-availability
  • SDDSs are available on network multicomputers
  • SDDS-2000
  • Access performance prove at least an order of
    magnitude faster than to traditional files
  • Should reach two orders (100 times improvement)
    for 700 Mhz P3
  • Combination of fast net distributed RAM

73
Future work
  • Experiments
  • Faster net
  • We do not have any volunteer to help ?
  • More Wintel computers
  • We are adding two 700 Mhz P3
  • Volunteers with funding for more their own
    config. ?
  • Experiments on switched multicomputers
  • LHLH runs on Parsytec (J. Karlson) SGs (Math.
    Cntr. Of U. Amsterdam)
  • Volunteers with an SP2 ?
  • Generally, we welcome every cooperation

74
THE END
  • Thank You for Your Attention

Witold LITWIN Fehti Bennour
Sponsored by HP Laboratories IBM Almaden
Research Microsoft Research
75
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com