Title: Using User Access Patterns for Semantic Query Caching
1Using User Access Patterns for Semantic Query
Caching
- Qingsong Yao, and Aijun An
- qingsong,ann_at_cs.yorku.ca
- Department of Computer Science
- York University
- Toronto, Canada
- 2003.08
2Topics of Discussion
- User Access Patterns
- Semantic Query Caching Problem and Solutions
- Algorithms
- Implementation and Experiments
- Conclusion and Future Work
3Background
- All SQL queries submitted by a client or a user
have specific meaning, and the query execution
orders follow certain business logics or rules. - The applications are written by using certain
programming tools. The embedded business logics
ensure that the submitted queries have certain
formats and follow certain rules. - In a database-driven web site, each dynamic web
page corresponds to a set of queries. The web
visitors show certain navigation patterns, thus
the queries show certain orders. - User access patterns describe how a group of
users or a client application accesses the data
of a database, include - a collection of user access events which
represents the format of queries. - a collection of frequent user access graphs
which describes the query execution orders. - User access patterns can be mined from database
workload or business logics, and they can - help to rewrite certain SQL to gain fast
execution time. - help to tune the database system,
- help to anticipate and pre-fetch incoming
queries. - help for semantic query caching.
4User Access Patterns
- User access patterns describe how a group of
users or a client application accesses the data
of a database, include - a collection of user access events associated
with the occurrence frequency and the parameter
distribution which describe the format of
queries. - a collection of frequent user access graphs
which describes the query execution orders. - User access patterns can be mined from database
workload or business logics, and they can - help to rewrite certain SQL to gain fast
execution time. - help to tune the database system, and generate
aalternative query execution path by adding
index, materialized view , statistics. - help to anticipate and pre-fetch incoming
queries. - help for semantic query caching.
- help to find possible data distributions or
relations.
5User Access Event and User Access Graph
- A user access event represents a set of similar
queries. It contains a SQL template and a set of
parameters. - SQL Template each value of the SQL queries is
replaced by a wildcard character () - Parameters the corresponding values of the
queries, and can be constants or variables - For example, event (select name from customer
where id,101) will retrieve customer 101 s
name, and event (select name from customer where
idd, cid) represent a set of queries that
retrieve the name of a given customer - A user access graph is a directed dependency
graph which represents the query execution order - each node is a user access event or user access
graph - each edge associates with a confidence value
- contains a set of global variables shared by the
nodes - some nodes associate with actions which change
the value of global variables - Semantic relationship exists between the events
of a graph
6An Example of User Access Graph
V31,V32,V34 are determined event.V34 is
parameter-determined by V33. V31 and V32 have the
same query predicate.
7Problem and Solutions
- Semantic Query Caching
- Previous query results are cached at clients or
mediates. - Each cache associates with a formula to describe
the content. - By finding the semantic relationship between the
cache formula and the query predicate, cached
query result can help to evaluate incoming
queries. - In order to answer a query, a probe query is
performed on the cached data, as well as a
reminder query is performed on the server to
retrieve the data which is not in the caches. - Problem
- no good cache selection and replacement policies.
- cache matching time can not be ignored.
- Solution
- anticipate future queries according to request
sequence and user access graphs. - rewrite current query to answer future queries.
8Problem and Solutions (2)
- Given two consecutive parameterized queries u and
v, three kinds of solutions are proposed
PR
SEQ
- Costs
- Response Time Network Traffic Server
Processing Time and Costs. - Factors
- The semantic relationship between u and v.
- The result size of u and v.
- The possibility that v follows u .
-
9Semantic Relationship
rows
u v
columns
6. u and v are irrelevant.
10Rewriting Algorithm
- Assumption the query predicates of u and v are
both the conjunction of basic predicate units
var1 op var2 cons., where op in ,lt,gt,gt,lt - solutions
- if v is undetermined, solutions
- else if v is result-determined by u, solutions
SEQ - else if u and v are irrelevant, solutions
SEQ - else Gu weighted_directed_graph(u)
- Gv weighted_directed_graph(v)
- comm_diff(Gu,Gv,comm,diff)
- rel relation(comm,diff)
- choose solutions according to rel
- generate rewriting queries
- select a solution based on the overall costs.
11Example
- u1 select a1 from r1 where a21 and a3lt3
-
- u2 select a1 from r1 where a21 and a3gt1,
u1gt select a1 from r1 where a2lt1 and 0lta2-1
and a3lt3
comm (a2,0,1),(0,a2,-1) difflt(a3,0,3),null
gt, ltnull,(0,a3,-1)gt, lt(a3,a2,2),nullgt,ltn
ull,(a2,a3,0)gt
Relationship partial-match
Rewriting queries u1 ? u2 select a1,a3 from r1
where a21 u1 select a1,a3 from r1 where a21
and a3 lt 3 u2 select a1 from r1 where a21
and a3gt3
12Implementation - Architecture
- SQL-Relay is an event-driven, rule-based database
gateway - Each connected user correspond to a state machine
which contains a set of states, variables and
user request sequence. - Each incoming query is one kind of user access
event. - Each event associates with a set of pre-defined
execution rules. - SQL-Relay contains a set of standard routine to
process a given execution rules. - Previous query results are cached for answering
incoming queries. - Two different kinds of caches global cache and
local cache.
13Implementation-Cache Management
Two kinds of caches
- A global cache aims to answer multiple request
from the clients. - always available to answer queries.
- various cache update policies can be used
(immediate update, periodic update, or
deferred update) . - can not be replaced by other caches, and is
always available unless it is temporarily
disabled due to updating. - A local cache aims to answer the request from a
specific client. - has shorter life-time.
- may be replaced by other local caches, and an
LRU-like replacement policy is used for cache
replacement. - is discarded when the underlying data is update.
14Experiment Result (1)
- Mining result for a client/server application
from one day's database queries log - 12 instances of the application, and 9,344 SQL
queries. - 190 user access events.
- 718 user request sequences belong to 21 frequent
user access graphs (support gt10).
q2,q3,q4 has horizontal-match relationship, can
be rewrite as select count() as num,
card_id, contract_last,
contract_first from customer where
cust_num1074
15Experiment Result (2)
- Client program is implemented by using java
language, simulates user request sequences based
on the user access graphs and the following
parameters - nClients the number of concurrent connected
clients. - nPaths the number of user request
sequences. - nRuns the number of runs.
- errorFreq the frequency of a random event
occurs. - abortFreq the frequency of a sequence abort
to execution. - randomFreq the frequency of a random sequence
occurs. - sleepTimePerEvent average sleep time between
two queries, default 0.1s. - sleepTimePerPath average sleep time between
two sequences,default 0.5s. - sleepTimePerRun average sleep time between
two runs, default 3min. - Database server is MySQL version 4 which
implements a server-side query cache function,
and MySQL is configured with a 128Kbytes server
cache. - SQL-Relay is implemented by using java language,
and has 128Kbytes global cache, 4Kbytes local
cache per connected client.
16Experiment Result (3)
- Client program simulates user request sequences
based on the user access graphs. Database server
is MySQL version 4 which is configured with a
128Kbytes server cache. SQL-Relay is implemented
by using java language, and has 128Kbytes global
cache, 4Kbytes local cache per connected client - Comparison of cache performance under the
following conditions - executing queries without cache
- executing queries with 128K server cache
- pre-fetching queries based on user access graphs
- integrating query pre-fetching and query
rewriting rules together
- Our solution has better performance than
others. - cache hit frequency is higher that server-side
cache. - save network traffic by retrieving less data from
the server. - the rewritten queries have less server I/Os.
17Experiment Result (4) TPC-W Benchmark
- Pseudo code for order display web interaction
- q1 select c_id from customer where
c_uname_at_c_uname and c_passwd_at_c_passwd - q2 select max(o_id) from orders where
o_c_id_at_c_id - q3 select customer., orders., address.
country. from customer,address,country,orders - where o_id_at_o_id and c_id_at_c_id and
o_bill_addraddr_id and addr_co_idco_id - q4 select address., country. from address,
country - where addr_id_at_a_ship_id and addr_co_idco_id
- q5 select from order_line,item where
ol_i_idi_id and ol_o_id_at_o_id
- split join query q3 into two queries
- q3_1 select from customer where cid_at_c_id
- q3_2 select orders., address. country. from
customer,address,country,orders - where o_id_at_o_id and o_bill_addraddr_id and
addr_co_idco_id - rewrite q1 to include the answer of q1 and q3_1
- q1 select from customer where
c_uname_at_c_uname and c_passwd_at_c_passwd - reason c_uname and c_passwd is the key.
- benefit access table customer only once, and
avoid join customer with other tables. - disadvantage retrieve more data when the
customer has no orders. - question can we rewrite q3_2 to include the
answer of q3_2 and q4? - select orders., a1., a2. c1., c2. from
address a1, address a2, country c1, country c2,
orders - where o_id_at_o_id and o_bill_addra1.addr_id and
a1.addr_co_idc1.co_id - and o_ship_addra2.addr_id and
a2.addr_co_idc2.co_id - reason o_ship_addr , addr_co_id are foreign
key. - benefit make use of foreign key constraints.
- disadvantage introduce new joins.
18Experiment Result (5) TPC-C Benchmark
- Pseudo code for order status transaction
- q1 EXEC SQL select count(c_id) INTO namecnt
- from customer where
c_lastc_last AND c_d_idd_id AND c_w_idw_id - q2EXEC SQL DECLARE c_name CURSOR FOR
- select c_balance, c_first, c_middle, c_id
- from customer where c_lastc_last AND
c_d_idd_id AND c_w_idw_id - order by c_first
- EXEC SQL OPEN c_name
- q3 if (namecnt2) namecnt // Locate midpoint
customer - for (n0 nltnamecnt/2 n)
- q4 EXEC SQL FETCH c_name INTO c_balance,
c_first, c_middle, c_id - q5 EXEC SQL CLOSE c_name
- Solution
- Execute q2 instead of q1 when q1 is submitted.
- q1 can be answered by retrieving the number of
rows returned by the cursor. - Reason
- q2 always follows q1 (i.e., the confidence is
1.0). - q1 and q2 have similar query predicate (except
the order by clause). - q2 contains q1 (i.e., the answer of q1 can be
answered by that of q2). - Benefit
- Only submit one query to the server.
- Only access the base relation once.
- Improve performance when no customer meets the
condition since the original solution search the
database twice. - Disadvantage
- Not a general solution, every DB has own function
to retrieve the number of rows of a cursor. - More
- DB server can take advantage of such
relationship.
19Conclusion and Future Work
- Our solution has the several advantages
- our caching algorithms are based on the query
execution orders and the semantic relationship
between queries, which are better than the
selection policies based on the global query
reference statistics. - It separates global cache with local cache, and
which will result in a better cache hit ratio. - Our SQL-Relay application is flexible and
extendable where various caching and rewriting
rules can be added and tested. - Future work
- Finding user access pattern from database
workload. - Exploring more ways to use user access patterns.
20The End.