Title: Interactive Query Processing
1Interactive Query Processing
- Vijayshankar Raman
- Computer Science Division
- University of California at Berkeley
2Motivation Nature of querying
- querying extracting information from data sets
- different techniques in different settings
- intrinsically slow
- significant user-system interaction
- info. seekers work iteratively, gradually
refining requests based on feedback
3Problems with traditional solutions
- mismatch between system functionality and mode of
HCI - black boxes
- do batch processing
- frustrating delays in iterative process
4Interactive processing
query
process
exact answer
- HCI requirements
- users must get continual feedback on results of
processing - allow users to control processing based on prior
feedback - performance goals
- not to minimize time to give complete results
- give continually improving partial results
- adapt to dynamically specified performance goals
5Context and Talk Outline
- query processing over structured data
- processing dataflow through
pipelining operators - outline of rest of talk
- support for dynamic user control in traditional
query proc. architectures (static plans) - architecture for giving more aggressive partial
results - policy for generating partial results
- client user interface
- impact of user actions on query execution
- other related issues
- interactive data cleaning (A-B-C)
- adaptive query processing
R
S
Q
6Design goals in supporting user control
- make minimal change to system architecture
- must be independent of particular query
processing algorithms - no delay in processing
7Online Reordering (Raman et al. 99,00)
- users perceive data being processed over time
- prioritize processing for interesting tuples
- interest based on user-specified preferences
- reorder dataflow so that interesting tuples go
first - encapsulate reordering as pipelined dataflow
operator
8Framework for Online Reordering
acddbadb...
abcdabc..
user interest
- want no delay in processing
- in general, reordering can only be best-effort
- typically process/consume slower than produce
- exploit throughput difference to reorder
- two aspects
- mechanism for best-effort reordering
- reordering policy
9Juggle mechanism for reordering
- continually prefetch from input, spooling onto
auxiliary side disk if needed - juggle data between buffer and side disk, to keep
buffer full of interesting items - getNext chooses best item currently on buffer
- reordering policy determines what getNext
returns, and enrich/spool decisions
10Reordering policies
good permutation of items t1tn to t?1t?n
GOAL
QOF
time
- quality of feedback for a prefix t?1t?2t?k
- QOF(UP(t?1), UP(t?2), UP(t?k )), UP user
preference - determined by application
- goodness of reordering dQOF/dt
- implication for juggle mechanism
- process gets item from buffer that increases QOF
the most - juggle tries to maintain buffer with such items
11One application of reordering
- online aggregation Hellerstein et. al 97,
Hellerstein and Haas99 - for SQL aggregate queries, give gradually
improving estimates - with confidence intervals
-
- allow users to speed up estimate refinement for
groups of interest - prioritize for processing at a per-group
granularity
SELECT AVG(gpa) FROM students GROUP BY college
12Online Aggregation Screenshot
SELECT AVG(gpa) FROM students GROUP BY college
13QOF in Online Aggregation
- avg weighted confidence interval
- preference acts as weight on confidence interval
- QOF ?? UPi / ?ni , ni number of tuples
processed from group i - process pulls items from group with max UPi /
ni?ni - desired ratio of group i tuples on buffer
UPi2/3/? UPj2/3 - juggle tries to maintain this by enrich/spool
14Other Quality of Feedback functions
- rate of processing (for a group) ? preference
- QOF ?? (ni - nUPi)2 (variance from ideal
proportions) - process pulls items from group with max (nUPi -
ni ) - desired ratio of group i tuples in buffer UPi
- will see more complex QOF later
15Results Reordering in Online Aggregation
- implemented in Informix UDO
- experiments with modified TPC-D queries
- questions
- how much throughput difference is needed for
reordering - can we reorder handle skewed data
- one stress test skew, minimal proc. cost
- index-only join
- 5 orderpriorities, zipf distribution
consume
SELECT AVG(o_totalprice), o_orderpriority FROM
order WHERE exists ( SELECT FROM lineitem
WHERE l_orderkey o_orderkey) GROUP BY
o_orderpriority
juggle
index
scan
process
16Performance results
confidence interval
tuples processed
time
time
- 3 times faster for interesting groups
- 2 completion time overhead
17Findings
- higher processing costs (index/hash join,
subquery, ) make reordering easy - at very low processing costs, juggle constrained
bydensity of interesting tuples in source - outlier groups hard to speed up
- better to use index on reordering column HHW
97 - reordering becomes easier over time
- questions to answer
- where to place juggle
- are groups fixed statically
18Outline
- motivation and context
- support for dynamic user control in traditional
query proc. architectures - architecture for giving more aggressive partial
results - policy for generating partial results
- user interface for displaying partial results
- impact on routing
- related work, work-in-progress, future work
19Generating partial results
- traditional arch. also generate continual result
tuples - arises from continual dataflow thru pipelining
operators - much work on pipelining joins WA91, HH99,
IFF99, UF00 - this is too rigid
- especially in distributed envirorments
Result Space
20Context Query processing in Telegraph
- Telegraph adaptive dataflow system to query
diverse, distributed sources - much data available as services over Internet
- currently only accessible by browse/search/forms
- want to semi-automatically combine this data
- examples
- campaign finance information (fff.cs.berkeley.edu)
- multiple donor lists, home prices, census info.,
crime ratings, maps, celebrity lists, legislative
records - restaurant information
- lists, reviews, addresses, maps, health
inspection reports, nutrition information
21More aggressive partial results
- complete results too rigid
- source latencies high, diverse
- dynamic source variations, delays
- query does not capture user desires
- want to process queries flexibly, giving results
asap - adapt dynamically to user preferences and source
variations - allow user feedback to refine partial results
22Correctness of partial results
- must contain some essential columns
- group-by columns / sort-by columns
- full disjunction/outer-join semantics
- appropriate for many Internet data sources
- join semantics
- cannot give partial results without ensuring that
match exists - aggregates over-estimate, even with outer-join
semantics - update early, and compensate later
23Dynamic query plans
- Eddy Avnur and Hellerstein 2000
- router for directing data thru modules
- dynamically chooses join and selection order
- all partial tuples generatable
- original routing policy optimize completion time
- our focus continual partial results, adaptation
to user preferences
24Eddy Routing
. . . modules
R
P
S
. . . inputs
- Eddy must decide
- what tuple to route next
- where to route it
- based on user preferences, and source properties
- need routing policy and a reordering mechanism
- eddy memory and module queues bounded
25Prioritizing tuples
. . . modules
R
P
S
. . . inputs
- perform online reordering within Eddy
- all incoming tuples placed in reorderer
- when space available on queues to modules, Eddy
takes tuple from reorderer and routes it to
modules - best-effort ? automatically reorder before slow
modules
26Routing and reordering policy
GOAL
QOF
time
- GOAL at any time, route to max. dQOF/dt
benefit of sending tuple to module / cost - cost estimate data rates to/from module
- benefit dependent on application and UI
- how partial results impact the UI
- user preferences
27Outline
- motivation and context
- online reordering for user prioritization of
partial results - architecture for generating more aggressive
partial results - policy for generating partial results
- Telegraph UI displaying results and inferring
preferences - impact on routing
- experimental results
- related work/work-in-progress/future work
28Telegraph UI
29Telegraph UI
- partial results displayed on screen as they
arrive - values clustered into groups
- can roll-up or drill-down to see different
granularities - client has hash table mapping groups to values
- navigation
- vertical/horizontal scrolling
- column (un)hiding
- rollup/drilldown
- different columns visible at different drilldown
levels
30Getting user preferences (1)
- infer from navigation
- row/column scrolling, group drill down and rollup
- prioritize visible rows/columns
- query evolution
- subset one-size-fits-all queries
- future work query expansion
- explicit
- at the cell level -- need for some expensive
sources - can also have up/down buttons on columns
31Getting user preferences (2)
- preferences
- col priorities
- ltpredicate, row priority, row-col prioritiesgt
- e.g. lt1, 1.0, nonegt, lt2, 1.7, nonegt,
lt3, 1.3, maphighgt, lt8/33014, 1.0,
nonegt, lt8/60610, 1.0, nonegt
32Benefit of a partial result
- depends on user preferences
- benefit of updating a cell in output
- row priority x column weight x ? cell worthiness
- column weight max(col priority, row-col
priority) - incremental cell worthiness how much does one
extra update add to the cells value ? quality
of feedback - enumerations -- 1
- aggregations -- change in confidence interval
- informing user about execution progress
- map cell worthiness to foreground color value
33Benefit of Routing a Tuple
- route tuple according to expected benefit and
cost - benefit of sending a tuple t to a module M and
forming set T ?cells c ?T benefit of updating c - estimate fanout by sampling previous values --
selectivity estimation
34Granularity of Reordering/Routing
- reorder and route tuples at granularity of a
group - group ltbase relations, predicategt
- groups created and deleted dynamically, as user
navigates in UI - group predicate may be a subset of
application-specified row predicate - final row depends on values this tuple joins with
35Partial results, no reordering
- Bush Contributors Income Crime
Ratings
with partial results
without partial results
36Partial results, no reordering
- Bush Contributors Income Crime
Ratings
with partial results
total worthiness
without partial results
time(s)
37Results with row-wise reordering
- scrolling AZ, AR, CA CO, CT, LA,KY,MA,MD,MI
100
Aggr. Updates
0
40
80
time
0.24
rel. error
0
groups
38Prioritizing particular columns
- previous graph Income updates much faster than
Crime Ratings - can we prioritize Crime Ratings?
- increase number of threads to probe it
- if eddy sends na tuples to A, nb to B, per second
- na lt threadsa/latency(A), nb lt
threadsb/latency(B) - threadsa threadsb lt Number of Threads Per
Query - max. na ?worthiness(A) nb ?worthiness(B)
- more generally, optimize for multiple resources
- good citizenship
- costs
39Outline
- motivation and context
- online reordering for user prioritization of
partial results - architecture for generating more aggressive
partial results - policy for generating partial results
- conclusions and future work
40Related Work
- IR work
- ranked retrieval, relevance feedback
- search strategies, Berry Picking
- making operators pipelining (Ripple Join, XJoin,
Tuqwila) - precomputed summaries (OLAP, materialized views,
AQUA) - top N / fast-first query processing
- parachute queries, union queries
- adaptivity
- dynamic query plans
- mid-query reoptimization (KD98, IFF99)
- competition (DEC Rdb)
41Summary
- query processor should care about how
user/application uses results - online reordering effective way of supporting
dynamic user control - give partial results as user wishes by embedding
reordering within dynamically controlled dataflow - system flexibility helpful
- hard to map these user-interaction needs
intoconcrete algorithm performance goals - wanted benchmarks based on user/application
traces - For more information
- http//telegraph.cs.berkeley.edu/,
http//control.cs.berkeley.edu/
42Future Work
- session-granularity query processing
- query evolution
- lazy evaluation user/client navigates through
partial results - closer dbms-application interaction
- reduced operator set querying
- execute query by appropriate routing through data
sources and state modules
43Lessons Learned
- query processor should care about how
user/application uses results - system flexibility helpful
- hard to map these user-interaction needs
intoconcrete algorithm performance goals - wanted benchmarks based on user/application
traces - For more information
- http//telegraph.cs.berkeley.edu/
- http//control.cs.berkeley.edu/
44Granularity of Query Operators
- relational operators logical astractions
- encapsulate multiple physical effects
- inflexible in handling unexpected changes
- information hiding
- operator speeds
- resource consumption
- work sharing
- competitive access paths
- inflexible tuple routing
- suppose user asks for SQ tuples
- suppose there is a delay in P
- want to encapsulate at level of physical
operators data sources, data structures
45State Modules
SteMP
SteMQ
SteMS
SteMR
Eddy
?
Q
Q
R
S
P
- store state in separate State Modules (SteMs)
- SteM like an index supports builds and probes
- unifies caches, join state, rendezvous buffers
- Eddy routes tuples through SteMs and data sources
- gets results from sources, or probe/build into
SteMs - joins and data access performed in the process of
routing
46HCI Motivation Why Interactivity?
- Berry picking (Bates 90, 93)
- user studies
- decision support (Oday and Jeffries 93)
- relevance feedback (Koenemann and Belkin 96)
47BACKUP Disk data layout during juggling
- performance goal
- favor early results optimize Phase 1, at expense
of Phase 2 - spool sequential I/O, enrich random I/O
- reordering in Phase 2 much easier than in Phase 1
- enrich needs approx. index on side-disk
- have hash-index on tuples, according to user
interest - done at group granularity
48BACKUP Other applications of reordering
- can add reorder to any dataflow
- will later discuss application for query systems
that give aggressive partial results - also useful in batch query processing
- sorting often used in query plans for performance
gains - can replace by best-effort reordering
- little performance hit, but plan is now
pipelining - will not discuss further in this talk
- spreadsheets
49BACKUP Estimators for rowPriority, incremental
worthiness
- easy, unless output row depends on new values
- currently use average of all possible rows
- could estimate a distribution instead
50BACKUP Interactive query processing with
non-pipelining operators
- basic techniques independent of pipelined
operators. - reordering --- can be used in general query
plans, although effectiveness may be hindered by
blocking operators - State Modules applicable in general -- helps with
adaptivity - much work on making plan operators pipelining --
ripple join, xjoin, tuqwila - reordering itself can be used to avoid blocking
sorts in some cases
51Query processing using SteMs (1)
- SteM operations -- works like index on a table
- build(tuple) add tuple to SteM
- probe(tuple) find all matches among build values
- bounce back tuple if all matches not found and
probe tuple not cached elsewhere - performing joins by routing through SteMs
- index joins
- regular synchronous index joins easy -- have no
state - over distributed sources, lookups cached by
building into SteM - separating cache allows Eddy to distinguish
access cost - asynchronous index joins GW00
- helps for high-throughput Internet sources
- SteM acts as rendezvous buffer for pending probe
results
52Query processing using SteMs (2)
- performing joins by routing through SteMs (contd)
- doubly pipelined hash join
- use two SteMs, one on each source
- can also do extensions (e.g. Tuqwila IFF99),
and even non-pipelined joins, using appropriate
SteMs - thus can simulate any static query plan by
appropriate routing
SteMS
SteMR
R probe
S probe
R bld.
S bld.
Eddy
R
S
53Query processing using SteMs (3)
- how to do this in general?
- dont want Eddy to be a super-join
- want it to route as per user desires and source
properties, independent of any join algorithm - routing constraint
- each tuple must have a non-zero probability of
being sent to every appropriate module - can avoid repeated looping by marking tuples
- Theorem any routing that satisfies above
constraint will produce all query results and
terminate - could get duplicates
- prune at application, or
- enforce atomicity in tuple routing
54Benefits of SteMs
- adaptivity
- join algorithm and access path selection are
determined by eddy routing - hence Eddy can dynamically choose these based on
source properties - in fact, Eddy can do hybrid join algorithms
- adapt to dynamic delays, etc.
- work sharing -- for query evolution, competition
- better information to the Eddy
- speeds, memory consumption
55BACKUP Lazy Evaluation
- User navigates thru partial results
- can see only a few at a time
- exploit to delay processing
- do on demand on the region user currently
navigating over - things to delay
- expensive source lookups (e.g. map)
- joins (e.g. )
- any other formatting etc.
56Backup giving probabilistic results
- view partial result as somethng that throws
light on outer-join space - positive result probabilistic
- negative result deterministic
- benefit of positive result
- weighted sum of cells it is likely to show
penalty for false result - repudiation of earlier false results
- benefit of negative result
- direct benefit unlikely
- repudiation of earlier false results
- complications
- many tuples may contribute to single output tuple
-- aggregation
57Probabilistic partial results
- partial result gt may not apply all filters
- still want to show result probabilistically
- filters ill-specified
- data on Internet often inconsistent
- show all partial results full disjunction
(Galindo-Legaria 94) - positive and negative results
Output Space
Outer Join Space
58What does user see?
- user navigates thru partial results
- with complete row partial results
- can either explore these results in detail
(sort/scroll along columns as in spreadsheet) - compute approx. aggregates on them in online
fashion - errors shrinking as more results come in
- with probabilistic, incomplete results?
- easy if primary key in partial result
- what to show otherwise?
59Quantifying benefit of partial results
- partial result sheds light on outer-join space
- positive result probabilistic
- negative result deterministic
- benefit of positive result
- weighted sum of cells it is likely to show
- penalty for false result ?
- repudiation of earlier false results ?
- benefit of negative result
- direct benefit unlikely
- repudiation of earlier false results ?
60Why probabilistic results are good
- data on Internet is often inconsistent
- hence even exact processing cannot give perfect
answers - people are used to sloppy answers
- can allow negations in expert interfaces only
61Data growth vs. Computer Speedup
- Moores Law -- of transistors/chip doubles
every 18 months (1965) - data growth
Source J. Porter, Disk/Trend, Inc.
(http//www.disktrend.com/pdf/portrpkg.pdf)
62Disk Appetite, contd.
- Greg Papadopoulos, CTO Sun
- Disk sales doubling every 9 months
- similar results from Winter VLDB survey
- time to process all your data doubles every 18
months!