SDDS2000 : A Prototype System for Scalable Distributed Data Structures on Windows 2000 - PowerPoint PPT Presentation

About This Presentation
Title:

SDDS2000 : A Prototype System for Scalable Distributed Data Structures on Windows 2000

Description:

1. SDDS-2000 : A Prototype System for Scalable Distributed ... SQL-Server, IIS, MsExchange, Frontpage, Netscape suites, Berkeley DB Library, LH-Server, Unify... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 133
Provided by: lit49
Category:

less

Transcript and Presenter's Notes

Title: SDDS2000 : A Prototype System for Scalable Distributed Data Structures on Windows 2000


1
SDDS-2000 A Prototype System for Scalable
Distributed Data Structures on Windows 2000
  • Witold Litwin
  • Witold.Litwin_at_dauphine.fr

2
Plan
  • What are SDDSs ?
  • Where are we in 2002 ?
  • LH Scalable Distributed Hash Partitioning
  • Scalable Distributed Range Partitioning RP
  • High-Availability LHRS RPRS
  • DBMS coupling AMOS-SDDS SD-AMOS
  • Architecture of SDDS-2000
  • Experimental performance results
  • Conclusion
  • Future work

3
What is an SDDS
  • A new type of data structure
  • Specifically for multicomputers
  • Designed for data intensive files
  • horizontal scalability to very large sizes
  • larger than any single-site file
  • parallel and distributed processing
  • especially in (distributed) RAM
  • Record access time better than for any disk file
  • 100-300 ?s usually under Win 2000 (100 Mb/s net,
    700 MHZ CPU, 100B 1 KB records)
  • access by multiple autonomous clients

4
Killer apps
  • Any traditional application using a large hash or
    B-tree or k-d file
  • Access time to an RAM SDDS record is 100 faster
    than to a disk record
  • Network storage servers (SNA and NAS)
  • DBMSs
  • WEB servers
  • Video servers
  • Real-time systems
  • High Perf. Comp.

5
Multicomputers
  • A collection of loosely coupled computers
  • common and/or preexisting hardware
  • share nothing architecture
  • message passing through high-speed net
    (??????Mb/s)
  • Network multicomputers
  • use general purpose nets
  • LANs Ethernet, Token Ring, Fast Ethernet, SCI,
    FDDI...
  • WANs ATM...
  • Switched multicomputers
  • use a bus, or a switch
  • e.g., IBM-SP2, Parsytec

6
Typical Network Multicomputer
Client
Server
Network segments
7
Why multicomputers ?
  • Potentially unbeatable price-performance ratio
  • Much cheaper and more powerful than
    supercomputers
  • 1500 WSs at HPL with 500 GB of RAM TBs of
    disks
  • Potential computing power
  • file size
  • access and processing time
  • throughput
  • For more pro cons
  • Bill Gates at Microsoft Scalability Day
  • NOW project (UC Berkeley)
  • Tanenbaum "Distributed Operating Systems",
    Prentice Hall, 1995
  • www.microoft.com White Papers from Business
    Syst. Div.

8
Why SDDSs
  • Multicomputers need data structures and file
    systems
  • Trivial extensions of traditional structures are
    not best
  • hot-spots
  • scalability
  • parallel queries
  • distributed and autonomous clients
  • distributed RAM distance to data

9
Distance to data(Jim Gray)
10 msec
local disk
distant RAM (Ethernet)
100 ?sec
distant RAM (gigabit net)
1 ?sec
100 ns
RAM
10
Distance to Data(Jim Gray)
Moon
10 ms
local disk
8 d
distant RAM (Ethernet)
100 ?s
2 h
distant RAM (gigabit net)
10 ?s
10 m
RAM
1 ?s
1 m
11
Scalability Dimensions (Client view)
Scale-up
Sub-linear (usuel)
Operation time (distant RAM)
Linear (idéal)
operation / s
Data size servers and clients
12
Scalability Dimensions (Client view, SDDS
specific)
Tapes, juke-box
Sub-linear (usuel)
Scale-up
Operation time
Cluster- Computer
Disk Cache
Cache Disk
Linear
Multicomputer SDDS
Single comp.
Local RAM
Data size
Cluster-Computer- fix of servers
13
Scalability Dimensions (Client view)
Speed-up
operations/ s
Linear (ideal)
Sub-linear (usuel)
servers
14
What is an SDDS ?
  • Queries come from multiple autonomous clients
  • Data are on servers
  • Data are structured
  • records with keys ? objects with OIDs
  • more semantics than in Unix flat-file model
  • abstraction most popular with applications
  • parallel scans function shipping
  • Overflowing servers split into new servers

15
An SDDS
growth through splits under inserts
Servers
Clients
16
An SDDS
growth through splits under inserts
Servers
Clients
17
An SDDS
growth through splits under inserts
Servers
Clients
18
An SDDS
growth through splits under inserts
Servers
Clients
19
SDDS Addressing Principles
  • SDDS Clients
  • Are not informed about the splits.
  • Do not access any centralized directory for
    record address computations
  • Have each a less or more adequate private image
    of the actual file structure
  • Can make addressing errors
  • Sending queries or records to incorrect servers
  • Searching for a record that was moved elsewhere
    by splits
  • Sending a record that should be elsewhere for
    the same reason

20
What is an SDDS ?
SDDS Addressing Principles
  • Servers are able to forward the queries to the
    correct address
  • perhaps in several messages
  • Servers may send Image Adjustment Messages
  • Clients do not make same error twice
  • Servers supports parallel scans
  • Sent out by multicast or unicast
  • With deterministic or probabilistic termination
  • See the SDDS talk papers for more
  • ceria.dauphine.fr/witold.html
  • Or the LH ACM-TODS paper (Dec. 96)

21
An SDDSClient Access
Clients
22
An SDDSClient Access
Clients
23
An SDDSClient Access
IAM
Clients
24
An SDDSClient Access
Clients
25
An SDDSClient Access
Clients
26
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
Disk
SDLSA
LHm, LHg
LHSA
Security
s-availability
LHs
LHRS
http//ceria.dauphine/SDDS-bibliograhie.html
27
LH (A classic)
  • Scalable distributed hash partitioning
  • Transparent for the applications
  • Unlike the current static schemes (i.e. DB2)
  • Generalizes the LH addressing schema
  • used in many products
  • SQL-Server, IIS, MsExchange, Frontpage, Netscape
    suites, Berkeley DB Library, LH-Server, Unify...
  • Typical load factor 70 - 90
  • In practice, at most 2 forwarding messages
  • regardless of the size of the file
  • In general, 1 message/insert and 2
    messages/search on the average
  • 4 messages in the worst case
  • Several variants are known
  • LHLH is most studied

28
Overview of LH
  • Extensible hash algorithm
  • Widely used, e.g.,
  • Netscape browser (100M copies)
  • LH-Server by AR (700K copies sold)
  • MS Frontpage, Exchange, IIS
  • tought in most DB and DS classes
  • address space expands
  • to avoid overflows access performance
    deterioration
  • the file has buckets with capacity b gtgt 1
  • Hash by division hi c -gt c mod 2i N provides
    the address h (c) of key c.
  • Buckets split through the replacement of hi
    with h i1 i 0,1,..
  • On the average, b/2 keys move towards new bucket

29
Overview of LH
  • Basically, a split occurs when some bucket m
    overflows
  • One splits bucket n, pointed by pointer n.
  • usually m ??n
  • n évolue 0, 0,1, 0,1,..,2, 0,1..,3, 0,..,7,
    0,..,2i N, 0..
  • One consequence gt no index
  • characteristic of other EH schemes

30
LH File Evolution
N 1 b 4 i 0 h0 c -gt 20
35 12 7 15 24
0
h0 n 0
31
LH File Evolution
N 1 b 4 i 0 h1 c -gt 21
35 12 7 15 24
0
h1 n 0
32
LH File Evolution
N 1 b 4 i 1 h1 c -gt 21
35 7 15
12 24
0
1
h1 n 0
33
LH File Evolution
N 1 b 4 i 1 h1 c -gt 21
21 11 35 7 15
32 58 12 24
0
1
h1
h1
34
LH File Evolution
N 1 b 4 i 1 h2 c -gt 22
21 11 35 7 15
32 12 24
58
0
1
2
h2
h1
h2
35
LH File Evolution
33 21 11 35 7 15
N 1 b 4 i 1 h2 c -gt 22
32 12 24
58
0
1
2
h2
h1
h2
36
LH File Evolution
N 1 b 4 i 1 h2 c -gt 22
11 35 7 15
32 12 24
33 21
58
0
1
2
3
h2
h2
h2
h2
37
LH File Evolution
N 1 b 4 i 2 h2 c -gt 22
11 35 7 15
32 12 24
33 21
58
0
1
2
3
h2
h2
h2
h2
38
LH File Evolution
  • Etc
  • One starts h3 then h4 ...
  • The file can expand as much as needed
  • without too many overflows ever

39
Addressing Algorithm
  • a lt- h (i, c)
  • if n 0 then exit
  • else
  • if a lt n then a lt- h (i1, c)
  • end

40
LH
  • Property of LH
  • Given j i or j i 1, key c is in bucket m
    iff
  • hj (c) m j i or j i 1
  • Verify yourself
  • Ideas for LH
  • LH addresing rule global rule for LH file
  • every bucket at a server
  • bucket level j in the header
  • Check the LH property when the key comes form a
    client

41
LH file structure
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
42
LH file structure
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
43
LH split
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
44
LH split
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
45
LH split
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
46
LH Addressing Schema
  • Client
  • computes the LH address m of c using its image,
  • send c to bucket m
  • Server
  • Server a getting key c, a m in particular,
    computes
  • a' hj (c)
  • if a' a then accept c
  • else a'' hj - 1 (c)
  • if a'' gt a and a'' lt a' then a' a''
  • send c to bucket a'

47
LH Addressing Schema
  • Client
  • computes the LH address m of c using its image,
  • send c to bucket m
  • Server
  • Server a getting key c, a m in particular,
    computes
  • a' hj (c)
  • if a' a then accept c
  • else a'' hj - 1 (c)
  • if a'' gt a and a'' lt a' then a' a''
  • send c to bucket a'
  • See LNS93 for the (long) proof

Simple ?
48
Client Image Adjustement
  • The IAM consists of address a where the client
    sent c and of j (a)
  • i' is presumed i in client's image.
  • n' is preumed value of pointer n in client's
    image.
  • initially, i' n' 0.
  • if j gt i' then i' j - 1, n' a
    1
  • if n' ??2i' then n' 0, i' i' 1
  • The algo. garantees that client image is within
    the file LNS93
  • if there is no file contractions (merge)

49
LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
15
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
50
LH addressing
servers
15
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
51
LH addressing
servers
15
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
j 4
n 3 i 3
n' 1, i' 3
n' 3, i' 2
Coordinator
Client
Client
52
LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
9
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
53
LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
9
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
54
LH addressing
servers
9
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
55
LH addressing
servers
9
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
j 4
n' 1, i' 3
n' 3, i' 2
Coordinator
Client
Client
56
Result
  • The distributed file can grow to even whole
    Internet so that
  • every insert and search are done in four
    messages (IAM included)
  • in general an insert is done in one message and
    search in two messages
  • proof in LNS 93

57
10,000 inserts
Global cost
Client's cost
58
(No Transcript)
59
(No Transcript)
60
Inserts by two clients
61
Parallel Queries
  • A query Q for all buckets of file F with
    independent local executions
  • every buckets should get Q exactly once
  • The basis for function shipping
  • fundamental for high-perf. DBMS appl.
  • Send Mode
  • multicast
  • not always possible or convenient
  • unicast
  • client may not know all the servers
  • severs have to forward the query
  • how ??

Image
File
62
LH Algorithm for Parallel Queries(unicast)
  • Client sends Q to every bucket a in the image
  • The message with Q has the message level j'
  • initialy j' i' if n' ????????i' else j' i'
    1
  • bucket a (of level j ) copies Q to all its
    children using the alg.
  • while j' lt j do
  • j' j' 1
  • forward (Q, j' ) to bucket a 2 j' - 1
  • endwhile
  • Prove it !

63
Termination of Parallel Query (multicast or
unicast)
  • How client C knows that last reply came ?
  • Deterministic Solution (expensive)
  • Every bucket sends its j, m and selected records
    if any
  • m is its (logical) address
  • The client terminates when it has every m
    fullfiling the condition
  • m 0,1..., 2 i n where
  • i min (j) and n min (m) where j i

i1
i
i1
n
64
Termination of Parallel Query (multicast or
unicast)
  • Probabilistic Termination ( may need less
    messaging)
  • all and only buckets with selected records reply
  • after each reply C reinitialises a time-out T
  • C terminates when T expires
  • Practical choice of T is network and query
    dependent
  • ex. 5 times Ethernet everage retry time
  • 1-2 msec ?
  • experiments needed
  • Which termination is finally more useful in
    practice ?
  • an open problem

65
LH variants
  • With/without load (factor) control
  • With/without the (split) coordinator
  • the former one was discussed
  • the latter one is a token-passing schema
  • bucket with the token is next to split
  • if an insert occurs, and file overload is guessed
  • several algs. for the decision
  • use cascading splits
  • See the talk on LH at http//ceria.dauphine.fr/

66
RP schemes
  • Produce scalable distributed 1-d ordered files
  • for range search
  • Each bucket (server) has the unique r ange of
    keys it may contain
  • Ranges partition the key space
  • Ranges evolve dynamically through splits
  • Transparently for the application
  • Use RAM m-ary trees at each server
  • Like B-trees
  • Optimized for the RP split efficiency

67
Current PDBMS technology (Ex. Non-Stop SQL)
  • Static Range Partitioning
  • Done manually by DBA
  • Requires goods skills
  • Not scalable

68
RP schemes
69
High-availability SDDS schemes
  • Data remain available despite
  • any single server failure most of two server
    failures
  • or any up to n-server failure
  • and some catastrophic failures
  • n scales with the file size
  • To offset the reliability decline which would
    otherwise occur
  • Three principles for high-availability SDDS
    schemes are currently known
  • mirroring (LHm)
  • striping (LHs)
  • grouping (LHg, LHsa, LHrs, RP rs)
  • Realize different performance trade-offs

70
LHRS Record Groups
  • LHRS records
  • LH data records parity records
  • Records with same rank r in the bucket group form
    a record group
  • Each record group gets n parity records
  • Computed using Reed-Salomon erasure correction
    codes
  • Additions ans multiplications in Galois Fields
  • See the Sigmod 2000 paper on the Web site for
    details
  • r is the common key of these records
  • Each group supports unavailability of up to n of
    its members

71
LHRS Record Groups
Data records
Parity records
72
LHRS Parity Management
  • An insert of data record with rank r creates or,
    usually, updates parity records r
  • An update of data record with rank r updates
    parity records r
  • A split recreates parity records
  • Data record usually change the rank after the
    split

73
LHRS Scalable availability
  • Create 1 parity bucket per group until M 2i1
    buckets
  • Then, at each split,
  • add 2nd parity bucket to each existing group
  • create 2 parity buckets for new groups until 2i2
    buckets
  • etc.

74
LHRS Scalable availability
75
LHRS Scalable availability
76
LHRS Scalable availability
77
LHRS Scalable availability
78
LHRS Scalable availability
79
SDDS-2000 global architecture
80
SDDS-2000 global architecture
Applications
Applications
Applications
etc
UDP TCP
81
SDDS-2000 Client Architecture (RPc)
  • 2 Modules
  • Send Module
  • Receive Module
  • Multithread Architecture
  • SendRequest
  • ReceiveRequest
  • AnalyzeResponse1..4
  • GetRequest
  • ReturnResponse
  • Synchronization Queues
  • Client Images
  • Flow control

82
SDDS-2000 Server Architecture (RPc)
  • Multithread architecture
  • Synchronization queues
  • Listen Thread for incoming requests
  • SendAck Thread for flow control
  • Work Threads for
  • request processing
  • response sendout
  • request forwarding
  • UDP for shorter messages (lt 64K)
  • TCP/IP for longer data exchanges
  • Several buckets of different SDDS files

83
AMOS-SDDS Architecture
  • For database queries
  • Especially parallel scans
  • Couples SDDS-2000 and Amos II
  • RAM OR-DBMS.
  • AMOSQL declarative query language
  • can be embedded into C and Lisp.
  • call-level interface (callin )
  • external procedures (functions) (callout)
  • See the AMOS-II talk papers for more
  • http//www.dis.uu.se/udbl/

84
AMOS-SDDS Architecture
  • SDDS is used as the distributed RAM storage
    manager.
  • RP scheme for the scalable distributed range
    partitioning.
  • Like in a B-tree, records are lexicographically
    ordered according to their keys.
  • Supports efficiently the range queries.
  • Amos II provide a fast SDDS-based RAM OR-DBMS.
  • The callout capability realizes the AMOSQL
    object-relational capability usually called
    external or foreign functions.

85
AMOS-SDDS Architecture
AMOS-SDDS Architecture
86
AMOS-SDDS Architecture
AMOS-SDDS scalable distributed query processing
87
AMOS-SDDS Server Query Processing
  • E-strategy
  • Data stay external to AMOS
  • within the SDDS bucket
  • Custom foreign functions perform the query
  • I-strategy
  • Data are on-the-fly imported into AMOS-II
  • Perhaps with the local index creation
  • Good for joins
  • AMOS performs the query
  • Which strategy is preferable ?
  • Good question

88
SD-AMOS
  • Server storage manager is a full scale DBMS
  • AMOS-II in our case since it is a RAM DBMS
  • Could be any DBMS
  • SDDS-2000 provides the scalable distributed
    partitioning schema
  • Server DBMS performs the splits
  • When et How ???
  • Client manages scalable query decomposition
    execution
  • Easier said than done
  • The whole system generalizes the PDBMS technology
  • Static partitioning only

89
Scalability Analysis
  • Theoretical
  • To validate an SDDS
  • See the papers
  • To get an idea of system performance
  • Limited validity
  • Experimental
  • More accurate validation of design issues
  • Practical necessity
  • Costs orders of magnitude more of time and money

90
Experimental Configuration
  • 6 machines 700 Mhz P3
  • 100 Mb/s Ethernet
  • 150 byte records

91
LH file creation
LH Scalability is confirmed
Time (ms)
with splits w / splits
of buckets
Time (ms)
with splits w / splits
of inserts
Time (ms)
Bucket size b 5.000 Flow control On
of inserts
Performance bound by the client processing speed
Ph. D Thesis of F. Bennour, 2000
92
LH Key search
Actual client image New client image IAMs
Time (ms)
2
3
4
5
File Size in records and servers
Performance bound by the client processing speed
93
LHRS Experimental Performance(Preliminary
results)
Insert time during the file creation (moving
average)
Time (ms)
Number of records
94
LHRS Experimental Performance(Preliminary
results)
File Creation Time
Time (s)
Number of records
95
LHRS Experimental Performance(Preliminary
results)
  • Normal key search
  • Unaffected by the parity calculus
  • 0.3 ms per key search
  • Degraded key search
  • About 2 ms for the application
  • 1.1 ms (k 4) for the record recovery
  • 1 ms for the client time-out and the coordinator
    action
  • Bucket recovery at the spare
  • 0.3 ms per record (k 4)

96
LHRS Experimental Performance(Preliminary
results)
Insert time during the file creation (moving
average)
Time (ms)
Number of records
97
LHRS Experimental Performance(Preliminary
results)
File Creation Time
Time (s)
Number of records
98
LHRS Experimental Performance(Preliminary
results)
  • Normal key search
  • Unaffected by the parity calculus
  • 0.3 ms per key search
  • Degraded key search
  • About 2 ms for the application
  • 1.1 ms (k 4) for the record recovery
  • 1 ms for the client time-out and the coordinator
    action
  • Bucket recovery at the spare
  • 0.3 ms per record (k 4)

99
Performance Analysis (RP)
  • Experimental Environment
  • Six Pentium III 700 MHz
  • Windows 2000
  • 128 MB RAM extended later to 256 MB RAM
  • 100 Mb/s Ethernet
  • Messages
  • 180 bytes 80 for the header, 100 for the
    record
  • Keys are random integers within some interval
  • Flow Control sliding window of 10 messages
  • Index
  • Capacity of an internal node 80 index elements
  • Capacity of a leaf 100 records

100
Performance Analysis
  • File Creation
  • Bucket capacity 50.000 records
  • 150.000 random inserts by a single client
  • With flow control (FC) or without

File creation time
Average insert time
101
Discussion
  • Creation time is almost linearly scalable
  • Flow control is quite expensive
  • Losses without were negligible
  • Both schemes perform almost equally well
  • RPC slightly better
  • As one could expect
  • Insert time 30 faster than for a disk file
  • Insert time appears bound by the client speed

102
Performance AnalysisFile Creation
  • File created by 120.000 random inserts by 2
    clients
  • Without flow control

Comparative file creation time by one or two
clients
File creation by two clients total time and per
insert
103
Discussion
  • Performance improves
  • Insert times appear bound by a server speed
  • More clients would not improve performance of a
    server

104
Performance AnalysisSplit Time
Split times versus bucket capacity
105
Discusion
  • About linear scalability in function of bucket
    size
  • Larger buckets are more efficient
  • Splitting is very efficient
  • Reaching as little as 40 ?s per record

106
Performance AnalysisKey Search
  • A single client sends 100.000 successful random
    search requests
  • Flow control the client sends at most 10
    requests without reply

Search time (ms)
107
Performance AnalysisKey Search
  • A single client sends 100.000 successful random
    search requests
  • Flow control the client sends at most 10
    requests without reply

Total search time
Search time per record
108
Discussion
  • Single search time about 30 times faster than for
    a disk file
  • 350 ?s per search
  • Search throughput more than 65 times faster than
    that of a disk file
  • 145 ?s per search
  • RPN appears again surprisingly efficient with
    respect RPc for more buckets

109
Performance Analysis
  • Range Query
  • Deterministic termination
  • Parallel scan of the entire file with all the
    100.000 records sent to the client

Range query total time
Range query time per record
110
Discussion
  • Range search appears also very efficient
  • Reaching 10 ?s per record delivered
  • More servers should further improve the
    efficiency
  • Curves do not become flat yet

111
Range Query Parallel Execution Strategies
Study of MM. Tsangou (Master Th.) Prof. Samba
(U. Dakar)
Sc. 3 1 server at the time
Sc. 1,2 all servers together
Sc. 1 single connection request per server
Response Time (ms)
servers
112
File Size Limits
  • Bucket capacity 751K records, 196 MB
  • Number of inserts 3M
  • Flow control (FC) is necessary to limit the input
    queue at each server

113
File Size Limits
  • Bucket capacity 751K records, 196 MB
  • Number of inserts 3M
  • GA Global Average MA Moving Average

114
Related Works
Suspicious
Comparative Analysis
115
AMOS-SDDS
  • Benchmark data
  • Table Pers (SS, Name, City)
  • Size 20.000 to 300.000 tuples
  • 50 Cities
  • Random distribution
  • Benchmark queries
  • Join SS and Name of persons in the same city
  • Nested loop or Local index
  • Count Join Count couples in the same city
  • To determine the result transfer time to the
    client
  • Count () Pers, Max (SS) from Pers
  • Measures
  • Scale-up Speed-up
  • Comparison to AMOS-II alone

116
Join best time per tuple
14.4 2.4 1.6
20.000 tuples in Pers
AMOS-II alone 13.5 Nested loop 2.25 Index
lookup
3,990,070 tuples produced
117
Join Count best time per tupleI-strategy
1.8 1.0
20.000 tuples in Pers
AMOS-II alone 13.5 Nested loop 2.25 Index
lookup
3,990,070 tuples produced
118
Join Speed-upI-strategy
20.000 tuples in Pers
AMOS-II alone 13.5 Nested loop 2.25 Index
lookup
3,990,070 tuples produced
119
Count Speed-up
341
E-strategy wins
100.000 tuples in Pers
AMOS-II alone 280 ms
120
Join Scale-up Performance
  • The file scales to 300.000 tuples
  • Spreading from 1 to 15 AMOS-SDDS Servers
  • Transparently for the application !
  • 3 servers per machine
  • The poor men configuration has only 5 server
    machines
  • Results are extrapolated to 1 server per machine
  • Basically, the CPU component of the elapsed time
    time is divided by 3

121
Join Elapsed Time Scale-up
AMOS-SDDS I-Strategy with Index Lookup Join
122
Join Elapsed Time Scale-up
123
Join Time per Tuple Scale-up
Better scalability than any current P-DBMS could
provide
Join w. Count flat !
124
SD-AMOS File Creation
Insert Time (ms)
servers inserts
Global Avg. Moving Avg.
Flat unexpectedly fast insert time
b 4.000
125
SD-AMOS Large File Creation
Insert Time (ms)
Global Avg. Moving Avg.
servers inserts
Flat fast insert time remains
b 40.000
126
SD-AMOS Large File Search(time per record)
File of 300K Records
Client max process. speed is reached
127
SD-AMOS Very Large File Creation
Bucket size 750 K records Max file size 3M
records Record size 100 B
128
Conclusion
  • SDDS-2000 a prototype SDDS manager for Windows
    multicomputer
  • Several variants of LH and RP
  • High-availability
  • Scalable distributed database query processing
  • AMOS-SDDS SD-AMOS

129
Conclusion
  • Experimental performance of SDDS schemes appears
    in line with the expectations
  • Record search insert times in the range of a
    fraction of a millisecond
  • About 30 to 100 times faster than a disk file
    access performance
  • About ideal (linear) scalability
  • Including the query processing
  • Results prove the overall efficiency of SDDS-2000
    system

130
Current Future Work
  • SDDS-2000 Implementation
  • High-Availability through RS-Codes
  • CERIA U. Santa Clara (Prof. Th. Schwarz)
  • Disk-based High-Availability SDDSs
  • CERIA IBM-Almaden (J. Menon)
  • Parallel Queries
  • U. Dakar (Prof. S. Ndiaye) CERIA
  • Concurrency Transactions
  • U. Dakar (Prof. T. Seck)
  • Overall Performance Analysis
  • SD-AMOS SD-DBMS in general
  • CERIA U. Uppsala (Prof. T. Risch) U. Dakar
  • SD-SQL-Server ?
  • Extent dependent basically on available funding

131
Credits
  • SDDS-2000 Implementation
  • CERIA Ph. D Students
  • F. Bennour (Now Post-Doc) (SDDS-2000 LH)
  • A. Wan Diene, Y. Ndiaye (SDDS-2000 RP
    AMOS-SDDS SD-AMOS)
  • R. Moussa (RS-Parity Subsystem)
  • Master Student Thesis
  • At CERIA and coop. Universities
  • See Ceria Web page ceria.dauphine.fr
  • Partial Support for SDDS-2000 research
  • HPL, IBM Research, Microsoft Research

132
Problems Exercices
  • Install SDDS-2000 and experiment with the
    interactive appl. Comment the experience on a few
    pages.
  • Create your favorite appl. Using the .h
    interfaces provided with the package.
  • Comment on on a few pages on LHrs goal and way
    to work, on the basis of the Sigmod paper and the
    WDAS-2002 paper by Rim Moussa.
  • Comment on a few pages on the strategies for a
    scalable distributed hash joins according to D.
    Schneider al paper. Your own ?
  • Can you propose how to deal with Theta-joins ?
  • You should split a table under one SQL Server
    into two tables on two SQL Servers. You wish to
    generate RP like range partitioning on the key
    attribute(s). You wish to use as much as possible
    standard SQL queries
  • How can you find the record with median (middle)
    key ?
  • What are the catalog tables where you can find
    the table size, the associated check constraints,
    indexes, triggers, stored procedures to move to
    the new node
  • How will you move the records that should leave
    to the new node.
  • Idem for a stored funciton under Amos
  • Idem for Oracle
  • Idem for DB2

133
END
  • Thank you for your attention

Witold Litwin litwin_at_dauphine.fr wlitwin_at_cs.berkel
ey.edu
134
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com