Title: Putting the Scalability into Database Scalability Services
1Scalable Query Result Caching for Web Applications
Bruce Maggs Carnegie Mellon University and Akamai
Technologies
Joint work with Charlie Garrod and Amit Manjhi
and Natassa Ailamaki, Phil Gibbons, Todd Mowry,
Chris Olston, and Anthony Tomasic.
2Load at a Web site varies
3Load can be unpredictable
4The provisioning dilemma
- Heavily overprovision systems?
- Waste resources
- Risk loss of availability?
- Lose revenue, reputation
5Static content vs. dynamic content
- Changes infrequently
- e.g., fixed web pages, images, movies
- Often tailored for each request
Executecode
AccessDB
Web Server
6Content Delivery Networks scale static content
Internet core
Users
CDN nodes
Content providers
7CDN Application Services
CDNs can also run applications
Internet
DB
Users
- but for data-intensive dynamic applications
database server becomes the bottleneck!
8Our goal a Database Scalability Service (DBSS)
DBSS
DB
Users
9Database Scalability Service
users
Content Delivery Network
DBSS
Internet
home server databases
10Database Scalability Service
users
Web and application servers
DBSS
home server databases
11Database Scalability Service
client apps
DBSS
Internet
home server databases
12The challenges for a DBSS
- Provide economical, on-demand scalability
- New requests must reflect database updates
- Provide data privacy for content providers
- Should not increase end-user latency
13One scalability approach database query result
caching
- Simple map from queries to query results
- Advantages
- Easy, well-understood
- Compatible with our privacy goals
- Problems
- Scalability limited by cache miss rate
- Hard to keep caches up-to-date
14Our solution The Ferdinand DBSS
part of shared cache
local cache
- 2-tier caching local and cooperative
- All updates sent to home database server
15Our solution The Ferdinand DBSS
part of shared cache
local cache
- Pub / sub for consistency management
- Consistency is slightly relaxed
16Outline
- Need for on-demand scalability
- Invalidation mechanism
- Security-scalability tradeoff
- Reducing latency
17Addressing consistency
- TTL is wasteful
- Often refresh cached data unnecessarily
(workloads dominated by reads) - Must set TTL0 for strong consistency!
- Solution update or invalidate cached data only
when affected by updates - Naïve approach home organizations notify proxy
servers of relevant updates ? not scalable
Our approach Fully-distributed,
proxy-to-proxy update notification mechanism
18Publish / subscribe for cache consistency
management
- On caching a query Q
- Subscribe to messages for updates that affect Q
- On an update U
- Publish U to notify all affected query caches
The challenge Relate (current) query to
(future) possible updates
19Our solution Analyze the Web app
- Determine which updates affect which queries
- If query and update are independent, saves
consistency traffic - Query and update templates
- E.g. SELECT name FROM emp WHERE
salary gt ? - UPDATE emp SET dept ?
WHERE id ?
value set at run-time
20Distributed Consistency Mechanism
users
proxy node
- Distributed app-level multicast environment,
e.g., Scribe - Forward all updates to backend home servers
21Configuring Multicast Channels
- Key observation Web applications typically
interact with DB via a small, fixed set of
query/update templates (usually 10-100) - Example
- SELECT qty FROM inv WHERE id ?
- UPDATE inv SET qty ? WHERE id ?
Templates natural way to configure channels
Options Channel-by-query or Channel-by-update
22Channel-by-Query Option
- One channel per query template Q C(Q)
- Few subscriptions/cached result
- Many invalidation notifications/update
Conflicts determined lazily (upon update)
23Channel-by-Update Option
- One channel per update template U C(U)
- Many subscriptions/cached result
- Few invalidation notifications/update
Conflicts determined eagerly (when caching Q)
24Parameter-Specific Channels
- Optimization consider parameter bindings
supplied at runtime for example - Q5 SELECT qty FROM inv WHERE id ?
- When issued with id 29, create extra
parameter-specific channel C(5, 29) - Subscribe to both C(5) and C(5, 29)
- Upon update
- If update affects a single item with id X, send
notification on channel C(5, X) - Saves work if X ? 29
- Updates affecting multiple items sent to C(5)
25Recall the DB bottleneck
Argh!
Content providers home DB server
26The cache can be a bottleneck too
1 poor, lonely cache
Content providers home DB server
27Scalable query caching is hard
- Reduces chance of query reuse
- Sends extra queries to home server
2 caches
Content providers home DB server
28A solution Ferdinands 2-tier cache
- If any node stores the current result for a
query, the cooperative cache stores it too - Each query sent to home server at most once
between updates - Possible drawbacks
- Complicates consistency management
- Checking the cooperative cache might introduce
latency
29Queries with Ferdinand
App server
query Q
Ferdinand
Publish / subscribe
Cooperative caching via a DHT
DBSS nodes
part of shared cache
local cache
Content providers home DB server
30Queries with Ferdinand
App server
Ferdinand
Publish / subscribe
Cooperative caching via a DHT
part of shared cache
local cache
Content providers home DB server
31Queries with Ferdinand
App server
Ferdinand
Publish / subscribe
Q
Cooperative caching via a DHT
part of shared cache
Qs master node
local cache
Content providers home DB server
32Queries with Ferdinand
App server
Ferdinand
Publish / subscribe
Cooperative caching via a DHT
part of shared cache
local cache
Content providers home DB server
33Queries with Ferdinand
App server
Ferdinand
Publish / subscribe
Cooperative caching via a DHT
part of shared cache
local cache
Q
Content providers home DB server
34Queries with Ferdinand
App server
Ferdinand
Publish / subscribe
Q
Cooperative caching via a DHT
Q
part of shared cache
local cache
Content providers home DB server
Response
35Queries with Ferdinand
App server
Q
Ferdinand
Publish / subscribe
Q
Cooperative caching via a DHT
Q
part of shared cache
local cache
Content providers home DB server
36Queries with Ferdinand
App server
Ferdinand
Publish / subscribe
Q
Cooperative caching via a DHT
Q
part of shared cache
local cache
Content providers home DB server
37Queries with Ferdinand
App server
Ferdinand
Publish / subscribe
Q
Q
Cooperative caching via a DHT
Q
part of shared cache
local cache
Content providers home DB server
38Queries with Ferdinand
App server
Ferdinand
Publish / subscribe
Q
Q
Cooperative caching via a DHT
Q
part of shared cache
Response
local cache
Content providers home DB server
39Updates with Ferdinand
App server
Ferdinand
update U
Publish / subscribe
Q
Q
Cooperative caching via a DHT
Q
part of shared cache
U
local cache
Response
Content providers home DB server
40Consistency for a 2-tier cache
App server
Notify the cooperative cache first
Ferdinand
Publish / subscribe
Q
Q
U
Cooperative caching via a DHT
Q
part of shared cache
local cache
Content providers home DB server
41Consistency for a 2-tier cache
App server
and then notify the local caches.
Ferdinand
Publish / subscribe
Q
Q
U
Cooperative caching via a DHT
part of shared cache
local cache
Content providers home DB server
42Evaluating Ferdinands 2-tier cache
- 3 competing scalability approaches
- Benchmarks and metrics
- Our evaluation goal
- Impact on cache hit rates and scalability
- Performance in higher latency environments
43Competitor 1 1-tier cache
- No cooperative cache
- Uses pub / sub like Ferdinand
44Competitor 2 No cache
- CDN-like proxy servers scale the Web and app
servers
45Competitor 3 No proxies
- A good baseline to compare against
Argh!
Argh!
Banana?
46Web application benchmarks
- Simulate users as they browse online Web sites
- TPC-W bookstore
- Browsing mix (5 purchases)
- Shopping mix (20 purchases)
- RUBiS auction
- RUBBoS bulletin board
47Our scalability metric WIPS
- Web Interactions Per Second
- 90 of responses must meet a latency threshold
- End-user latency for the whole Web request
48Implementation details
- Ferdinand 100 Java
- Interface is a JDBC driver
- MySQL for home DB
- Apache Tomcat as Web / app server
- Scribe publish / subscribe
- Pastry DHT for cooperative cache
49Experiments run on Emulab
- DBSS nodes also run Web and app server
benchmark node
DBSS node
home database node
50Miss rates for 1- and 2-tier caches
misses sent to home database server
51Ferdinand excels on a LAN
52Ferdinand at higher latencies
53Ferdinand OK for medium latencies
bookstore browsing mix
54Scalable consistency matters
55Outline
- Need for on-demand scalability
- Invalidation mechanism
- Security-scalability tradeoff
- Reducing latency
56Guaranteeing security in a DBSS setting
- Limit ability to observe an applications data
by - DBSS administrator
- Unauthorized application through the DBSS
- Security-Scalability tradeoff in the DBSS setting
Analyzing the code helps in managing this tradeoff
57A simple solution for guaranteeing security
- Outsource database scalability
- Home server master copies of all datahandles
updates directly - No query execution on the DBSS
- DBSS caches query results (read-only)kept
consistent by invalidation
All data passing through the DBSS can be
encrypted Query, Update, Query results
58A Simple Example
toys (toy_id, toy_name)
No Invalidations
Nothing is encrypted
Empty
Q1 toy_id15
Q1
U1
DBSS
Home server Database
Q1 SELECT toy_id FROM toys WHERE toy_nameGI
Joe
U1 DELETE FROM toys WHERE toy_id5
Invalidate
Results are encrypted
Empty
Q1
Q1
U1
More encryption leads to more invalidations
59Challenge providing scalability while
guaranteeing security
When updates occur, DBSS needs to invalidate
Application faces a dilemma in what data to
encrypt (secure)
More encryption
Less encryption
Conservative Invalidation
Precise Invalidation
Security
Scalability
Security-scalability tradeoff
60Opportunity for managing the tradeoff
Not all data is equally sensitive
Data Sensitivity
Extremely sensitive
Completely insensitive
Moderately sensitive
Credit Card Information
Bestsellers list
Inventory records, customer records
Care but worried about scalability impact
Secure at all costs
Dont care
- But for most data, nontrivial to assess
- Data-sensitivity
- Scalability impact of securing the data
61Key Insight arbitrary queries and updates not
possible
function get_toy_id (toy_name)
templateSELECT toy_id FROM toys
WHERE toy_name? queryattach_to_template
(template, toy_name) execute (query)
62Data not useful for invalidation examples
Example 1
Q1 SELECT toy_id FROM toys WHERE toy_name?
Q2 SELECT toy_name FROM toys WHERE toy_id?
No data is needed for precise invalidation
Example 2
Q1 SELECT toy_id FROM toys WHERE toy_name?
U1 DELETE FROM toys WHERE toy_id?
Query parameters are not needed for precise
invalidation (the query result is needed though)
63Security without hurting scalability
Data not needed for invalidation
Can secure for free (without hurting
scalability)
Security Conscious Scalability Approach SIGMOD
06
As a result,
Tradeoff has to be only managed over remaining
data
64Sample experiment methodology
- Scalability max concurrent users with
acceptable response times - Security templates with encrypted results
- California Privacy Law determined sensitive data
- Non-transactional invalidation
- Start with a cold cache
65Benchmark Applications
- Bookstore (TPC-W, from UW-Madison)
- Online bookseller, a standard web benchmark
- Changed the popularity of books
- Auction (RUBiS, from Rice)
- Modeled after Ebay
- Bulletin board (RUBBoS, from Rice)
- Modeled after Slashdot
Benchmarks model popular websites
66Security-Scalability Tradeoff
U1 DELETE FROM toys WHERE toy_id5
Security
Scalability
X denotes encrypted, visible
67Magnitude of Security-Scalability tradeoff
Scalability (number of concurrent users supported)
0
0
Benchmark Applications
68Security Results
Query data that can be encrypted for free
7
7
7
6
4
17
and result
14
18
12
Bboard
Bookstore
Auction
69Security Results in Detail
- Auction The historical record of user bids was
not exposed - Bboard The rating users give one another based
on the quality of their posting - Bookstore Book purchase association rules
discovered by the vendor customers who purchase
book A also purchase book B
70Scalability Conscious Security Approach (SCSA)
to managing the tradeoff
900
Nothing
encrypted
600
Scalability (Number of concurrent users supported)
Everything
300
encrypted
0
0
5
10
15
20
25
30
Security (Number of query templates with
encrypted results)
1. Easy to either get good scalability or good
security 2. SCSA presents a shortcut to manage
the tradeoff
71Outline
- Need for on-demand scalability
- Invalidation mechanism
- Security-scalability tradeoff
- Reducing latency
72Contributors to User Latency
Request, high latency
Database
Web server
App server
Response, high latency
Traditional architecture
high latency
Database
DBSS
CDN
DBSS architecture
A single HTTP request ? Multiple database requests
72
73Sample Web Application Code
function find_comments (user_id)
templateSELECT from_id, body FROM comments
WHERE to_id? queryattach_to_te
mplate (template, user_id) resultexecute
(query) foreach (row in result)
print (get_body (row), get_name (get_id
(row)))
- (N1) queries are issued because
- Convenient for programmers to abstract database
values - No effect in the traditional setting
Found many examples in the benchmark applications
73
74Reducing User Latency in a DBSS Setting
- Transformations to reduce number of round-trips
- Group execution of queries MERGING
transformation - Overlap execution of queries NONBLOCKING
transformation
Web Application Code
Transformed Code
Procedural program with embedded SQL
Holistic transformations using src-to-src
compilers
74
75The MERGING Transformation
www.ebay.com
John
Names of users who have posted comments about
John
Content Delivery Network
1 Query
- Find user_ids who have made comments
- For each user_id, find name of the user
Database Scalability Service
N Queries
High latency
75
76The MERGING Transformation
Find names of users who have commented about John
Names of users who have posted comments about
John
- SELECT from_id, u.name
- FROM comments, users u
- WHERE from_id u.id AND to_id ?
?
- Find user_ids who have made comments
- For each user_id, find name of the user
Assuming constant cache hit rate, the
round-trips to the database decreases by a
factor of (N1)
76
77The NONBLOCKING Transformation
www.amazon.com
John
Home page
Content Delivery Network
- Greet user
- Get names of related books
Database Scalability Service
High latency
Issue queries concurrently to reduce latency
77
78Applicability of the Transformations
Either transformation applies to 25 (Auction),
75 (Bboard), and 50 (Bookstore) dynamic
runtime interactions
78
79BBOARD Application Impact on Latency
Average latency in ms
Transformations
Overall latency decreases by 38, the DBSS-DB
latency decreases by 65
79
80Impact of Latency on Scalability
Improved scalability
Scalability
Threshold
Latency curve
Latency
Reduced latency curve
Simultaneous users supported
Reducing latency improves scalability
80
81Effect of the Transformations on Scalability
Scalability (number of concurrent users supported)
Applying both transformations yield the best
scalability
81
82Related work database scalability for Web
applications
- Database caching
- DBCache, DBProxy, MTCache, NEC Cache Portal,
MySQL - Database replication
- many
- Database outsourcing
- Hacigumus ICDE02, Hacigumus SIGMOD02, Amazon
SimpleDB, Amazon S3
83Related work non-DB-oriented Web scalability
- Caching Web application output
- Challenger INFOCOM99, Challenger ACMTrans05,
Chabbouh and Makpangou iiWAS05 - Modifying the application design
- Gao IEEETrans05, Wei WWW08
84Conclusions
- Ferdinands 2-tier cache very effective compared
to 1-tier cache - Better miss rates and scalability
- Pub / sub can manage consistency for a 2-tier
cache - Results suggest that neither Ferdinand nor a
1-tier cache should be fully distributed in a
high-latency environment without additional
techniques