Building a Database on S3 - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Building a Database on S3

Description:

Next wave on the Web to provide services ... Put(uri, bytestream) writes new version ... Updates by one client can be overwritten by another even if 2 are ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 43
Provided by: susanv5
Category:

less

Transcript and Presenter's Notes

Title: Building a Database on S3


1
Building a Database on S3
  • Brantner, Florescue, Graf, Kossmann, Kraska
  • SIGMOD08

2
Introduction
  • Next wave on the Web to provide services
  • Make it easy for everyone to provide services,
    not just the Googles, Amazons
  • Technical difficulties
  • 24X7 service
  • Need data centers around world
  • Must administer server and any DBs
  • Success can Kill
  • Reason for utility computing (cloud)

3
  • Goal of utility computing
  • Storage, CPU, network bandwidth as a commodity at
    low unit cost
  • Scalability not a problem
  • Full availability at any time never blocked
  • Clients can fail at any time
  • Response times constant (R/W)
  • Pay by use

4
  • Most prominent utility service is S3
  • Part of Amazon Web Services AWS
  • S3, SQS, EC2, SimpleDB
  • S3 is Amazons simple storage service
  • Infinite scalability, high availability, low cost
  • Currently for multi-media documents
  • For data rarely updated
  • Smugmug implemented on top of S3
  • S3 popular as a backup
  • Products to backup data from MySQL to S3
  • But will S3 work for other kids of data?

5
  • Disadvantages of S3
  • Slow compared to local disk drive
  • Sacrifice consistency undetermined amount of
    time to update object
  • Updates not applied in same order as initiated
  • Eventual consistency is only guarantee

6
  • Can Web-based DB applications can be implemented
    on top of utility services (S3)?
  • What if use S3 for a general purpose DB?
  • Small objects, frequent updates
  • Present R/W commit protocols
  • Study cost, performance, consistency
  • Goal preserve scalability and availability of
    distributed systems, some ACID properties
  • Can only maximize level of consistency
  • Will not try to support full ACID properties

7
  • Show how small object frequently updated can be
    implemented
  • Show how B-tree can be implemented
  • Protocols for different levels of consistency
  • Performance results with TPC-W benchmarks

8
S3
  • S3 Simple Storage System
  • Conceptually, infinite store for objects from 1B
    to 5GB
  • Object is a byte container identified by URI
  • Can Read/Update with SOAP or REST-based interface
  • Get(uri) returns object
  • Put(uri, bytestream) writes new version
  • Get-if-modified-since(uri,TS) gets new version if
    object changes since TS

9
  • In S3, each object associated with a bucket
  • User specified bucket for new object, can scan
    through objects in bucket
  • Use buckets as unit of security or individual
    objects
  • S3 not free
  • 0.15 to store 1GB of data per month
  • 160GB disk drive costs 70
  • Cost for 2 years - .02 per GB and month (power
    not included)
  • S3 same ballpark as disk drives
  • Using S3 as a backup is good

10
  • But
  • Cost for R/W access
  • .01 per 10K get requests, per 1K put requests
  • Many operate their own servers to cache
  • Latency also a problem
  • Reading takes 100 Ms
  • 2 to 3 X longer than from local disk
  • Writing takes 3X as long as reading
  • Throughput is superior
  • Acceptable bandwidth only if data read in large
    chunks 100KB
  • Must cluster small objects into pages on disk

11
  • Implementation details of S3 not published
  • S3 seems to replicate all data at several data
    centers
  • Replicas R/W at any time
  • Updates propagated asynchronously
  • If data center fails, use another center
  • Last update wins
  • Guarantees full R/W availability crucial to Web
    applications

12
SQS
  • SQS Simple Queuing system
  • Allows users to manage an infinite number of
    queues with infinite capacity
  • Each queue referenced by a URI, supports
    send/receive messages via HTTP or REST-based
    interface
  • Size of message 8KB for HTTP
  • Supports
  • createQueue, send message to Q, receive number of
    messages from top of Q, delete message from Q,
    grant another user send/receive messages to Q

13
  • Cost of SQS - .01 to sent 1K messages,
  • Round trip times
  • Each call to SQS returns a result or ACK
  • Round trip time as wallclock time
  • Implementation details not published
  • Seems messages of Q stored in distributed and
    replicated way
  • Cleints can initiate requests at any time, never
    blocked
  • Best effort returning messages in FIFO
  • SQS returns only every 10th relevant message
  • E.g. Q has 200 messages, ask for top 100, get 20

14
EC2
  • EC2 Elastic Computing Cloud
  • Allows renting machines (CPUDisk) for specified
    period of time
  • Client gets virtual machine hosted on Amazon
    server
  • 0.10 per hour regardless of how heavily machine
    used
  • All requests from EC2 to S3 and SQS are free
  • If use all these, computation moved to the data

15
Using S3 as a disk
  • Client-Server Architecture
  • Similar to distributed shared-disk DB systems
  • Client retrieves pages from S3, based on URIs,
    buffers them locally, updates them, writes them
    back
  • Record is bytestream of variable size
    (constrained by page size)
  • Can be relational tuples or XML
    elements/documents, Blobs
  • Focus on
  • page manager coordinates R/W, buffers pages
  • Record manager interface, organizes records on
    pages, free-space management

16
Page manager, record manager, etc. could be
executed on EC2 Or whole client stack installed
on laptops or mobile phones to implement Web 2.0
application (assume this)
17
Record Manager
  • Record Manager manages tuples
  • Record associated with a collection (table)
  • Record composed of key and data
  • Record stored in one page, pages stored as single
    object
  • Table implemented in a bucket
  • Table identified by URI
  • Create new record, read record based on key,
    update based on key, delete based on key scan uri

18
Page manager
  • Implements buffer pool for S3 pages
  • Supports reading, updated, marking as updated,
    create new pages
  • Implements commit and abort
  • Assume write set fits into client's main memory
    or secondary storage
  • Commit must propagate changes to S3
  • If abort, discard clients buffer pool
  • No pages evicted from buffer pool as part of
    commit get up-to-date version if necessary

19
B-tree indexes
  • Adopt existing DB technology where possible
  • Root, intermediate nodes stored as pages with
    (key, uri of next level)
  • Leaf pages of primary index have (key, payload
    data)
  • Data stored as leaf of B-tree (index-organized
    table IOT)
  • Leaf pages of secondary index have (search key,
    record key)
  • Retrieve keys of matching records, go to primary
    index to retrieve records with payload data
  • Nodes at each level are chained
  • Root always at same URI (even when split node)

20
Logging
  • Use traditional strategies if can
  • Insert log, delete log record, update log record
    associated with a data page
  • Redo logging - log records are idempotent can
    apply more than once with same result
  • Undo logging keep before and after image in
    update logs
  • Keep last version of record in delete log records

21
Security
  • Everybody has access to S3
  • S3 gives clients control of the data
  • Client who owns a collection, can give other
    clients R/W privileges to collection (bucket) or
    pages
  • Cannot do views but can be implemented on top
    of S3
  • If provider not trusted, can encrypt data
  • Can assign a curator for a collection to approve
    all updates

22
Basic Commit Protocols
  • Updates by one client can be overwritten by
    another even if 2 are updating different tuples
  • Unit of transfer is a page rather than tuple
  • Several small objects must be clustered together
    (not the case in typical S3 usage)

23
  • Assume all features of utility computing
  • Protocol
  • Client generates log records for all updates
    committed and sends to SQS
  • Log records applied to pages on S3 called
    checkpointing
  • First step carried out in constant time
  • Second step can be asynchronous, users never
    blocked any part fails, resend (idempotent)

24
  • Preserves features of utility computing
  • But not atomic (can apply only part of updates)
  • Not consistent only guarantee will eventually
    be written

25
PU Queues
  • PU Pending Update queues
  • Clients propagate log records to PU Qs
  • Each B-tree has one PU Q
  • One PU Q associated with each leaf node of a
    Primary B-tree of a table

26
Checkpoint Protocol for Data Pages
  • Input of a checkpoint is a PU Q
  • Make sure no other client carrying out checkpoint
    concurrently
  • Associate a Lock Q with a PU Q
  • Receives a token from Lock Q if can lock object
  • Set time out, must complete checkpoint by then
  • Protocol to update data pages but really update
    B-tree, so see next slide

27
Checkpoint Protocol for B-trees
  • More complicated than checkpointing a database
    because several tree pages are involved
  • Obtain token from Lock Q
  • Receive log records form the PU Q
  • Sort the log records by Key
  • Find leaf node for first log record
  • Apply all log records to that leaf node
  • Put new version to S3
  • Delete log records
  • Continue if still time

28
Checkpoint Strategies
  • Checkpoint on a pages can be carried out by
    reader, writer, watchdog (additional
    infrastructure), owner (may be offline)
  • Assume writer initiates checkpoint
  • Each data page has TS of last checkpoint
  • TS taken from machine that does checkpoint
  • Client compute difference between wallclock time
    and TS
  • If bigger than checkpoint, writer carries out
    checkpoint (10-15 s)
  • If update once, never checkpointed (force random)
  • Queries can have phantoms

29
Transactional Properties
  • Durability
  • with SQS
  • Atomicity if
  • Use additional Atomic queues associated with each
    client
  • Commit logs to Atomic Qs rather than PU Qs
  • Every log record has id of commit for client
  • Client sends special commit to Atomic Q
  • then send all log records to PU Q
  • Delete commit record from Atomic Q

30
  • Logging works
  • If client fails, restarts
  • Delete log records from Atomic Q with no id that
    is same as a commit record
  • Those with matching id propagated to PU Q and
    deleted
  • Delete commit record from Atomic Q only after all
    its log records propagated to PU Q
  • Log records propagated twice no problem because
    idempotent

31
Consistency Levels
  • Consistency for the Web
  • Not Strict (every read reads most recent write)
  • Monotonic read If read value of x, any
    successive read by client reads that or more
    recent value
  • Keep track of highest commit TS cached by client
  • Monotonic write W to x is completed before any
    successive write to x by same client
  • Counter for each page, increment when update,
    keep track of counter value, order for log

32
  • Read your writes effect of W on x always be
    seen by successive R to x by same client
  • True if use monotonic reads
  • Write follows read W on x following R on x by
    same client takes place on same or more recent
    value of x that was read
  • W not applied directly to data times

33
Isolation
  • Not implemented, but if did
  • Multiversion optimistic concurrency control to
    implement isolation in S3
  • Multiversion retrieve version of object as of
    moment when transaction started
  • When commit, compare W set to W set of Ts
    committed earlier and started later if
    intersection empty can commit
  • Apply 2PL protocol on PU Q in commit phase
  • Need global counter can be implemented on top
    of S3, but bottleneck

34
  • Not implemented, but if did
  • BOCC (Optimistic concurrency control)
  • Also involves global counter for beginning/end
  • Requires only 1 commit at a time big problem

35
Experiments and Results
  • Increasing levels of consistency
  • Basic eventual consistency
  • Montonicity R and Ws, R your Ws and W follows R
  • Atomicity in addition to above 2
  • Baseline naïve approach write all dirty pages
    to S3
  • Not even eventual consistency updates lost
  • R, W, create, index probe, abort same for all
    differ in commit, checkpoints

36
  • Mac with 2.15 MHz Intel processor
  • Page size of data 100KB
  • B-tree node size 57 KB
  • TTL of clients cache - 100 s
  • Cache size - 5 MB
  • Checkpoint interval 15 s
  • 1 GB of network traffic - 0.18

37
TCP-W Benchmark
  • Models online bookstore with queries asking for
    availability of products, places orders
  • Retrieve customer record, search for 6 products,
    place orders for 3 products (random)

38
Running time
Average and maximum execution time in seconds per
transaction are high
Believe results acceptable in interactive
environment Transactions simulate 12 clicks
about 1s each except for commit Higher
consistency means faster Only propagate log
records to SQS Atomicity batches log records
39
Cost ()
Overall cost per 1000 transactions Run several
thousand transactions, divide by total cost
Cost increases with highest consistency Interactio
n with SQS expensive checkpoints to Atomic Q
(change interval) 0.3 cents per
transaction acceptable for some The more
orders the more it costs
40
Varying Checkpoint Interval
Increasing interval decreases cost Less than 10
seconds means checkpoint for every
update Infinite interval is 7 per 1000
transactions
41
Related Work
  • Utility computing biggest success is grid
    computing
  • Specific purpose to analyze large scientific data
  • Even S3 for specific purpose multi media and
    backups
  • Goal is to broaden scope of utility computing to
    general-purpose web-based applications
  • S3 is a distributed P2P but no technical
    drawbacks of centralized component ??
  • Paper proposes to overlay DM on top of S3
  • P2P community proposes to create network overlays
    on top of Internet

42
Conclusions
  • Utility computing not attractive for
    high-performance transaction processing
  • Paper is first step for the above
  • Abandoned strict consistency and DB-style
    transactions (controversial)
  • May need ACID properties more than scalability
    and availability
  • New algorithms for join, query optimization
  • Need ways to can through several pages (chained
    I/O)
  • Need right security structure
Write a Comment
User Comments (0)
About PowerShow.com