Interactive Query Processing

About This Presentation

Title:

Interactive Query Processing

Description:

querying = extracting information from data sets ... scan. reorder. index. scan. 8/25/09. Interactive Query Processing. 8. Framework for Online Reordering ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 63

Provided by: rsha

Learn more at: https://dsf.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Interactive Query Processing

1
Interactive Query Processing

Vijayshankar Raman
Computer Science Division
University of California at Berkeley

2
Motivation Nature of querying

querying extracting information from data sets
different techniques in different settings
intrinsically slow
significant user-system interaction
info. seekers work iteratively, gradually
refining requests based on feedback

3
Problems with traditional solutions

mismatch between system functionality and mode of
HCI
black boxes
do batch processing
frustrating delays in iterative process

4
Interactive processing
query
process
exact answer

HCI requirements
users must get continual feedback on results of
processing
allow users to control processing based on prior
feedback
performance goals
not to minimize time to give complete results
give continually improving partial results
adapt to dynamically specified performance goals

5
Context and Talk Outline

query processing over structured data
processing dataflow through
pipelining operators
outline of rest of talk
support for dynamic user control in traditional
query proc. architectures (static plans)
architecture for giving more aggressive partial
results
policy for generating partial results
client user interface
impact of user actions on query execution
other related issues
interactive data cleaning (A-B-C)
adaptive query processing

R
S
Q
6
Design goals in supporting user control

make minimal change to system architecture
must be independent of particular query
processing algorithms
no delay in processing

7
Online Reordering (Raman et al. 99,00)

users perceive data being processed over time
prioritize processing for interesting tuples
interest based on user-specified preferences
reorder dataflow so that interesting tuples go
first
encapsulate reordering as pipelined dataflow
operator

8
Framework for Online Reordering
acddbadb...
abcdabc..
user interest

want no delay in processing
in general, reordering can only be best-effort
typically process/consume slower than produce
exploit throughput difference to reorder
two aspects
mechanism for best-effort reordering
reordering policy

9
Juggle mechanism for reordering

continually prefetch from input, spooling onto
auxiliary side disk if needed
juggle data between buffer and side disk, to keep
buffer full of interesting items
getNext chooses best item currently on buffer
reordering policy determines what getNext
returns, and enrich/spool decisions

10
Reordering policies
good permutation of items t1tn to t?1t?n
GOAL
QOF
time

quality of feedback for a prefix t?1t?2t?k
QOF(UP(t?1), UP(t?2), UP(t?k )), UP user
preference
determined by application
goodness of reordering dQOF/dt
implication for juggle mechanism
process gets item from buffer that increases QOF
the most
juggle tries to maintain buffer with such items

11
One application of reordering

online aggregation Hellerstein et. al 97,
Hellerstein and Haas99
for SQL aggregate queries, give gradually
improving estimates
with confidence intervals
allow users to speed up estimate refinement for
groups of interest
prioritize for processing at a per-group
granularity

SELECT AVG(gpa) FROM students GROUP BY college
12
Online Aggregation Screenshot
SELECT AVG(gpa) FROM students GROUP BY college
13
QOF in Online Aggregation

avg weighted confidence interval
preference acts as weight on confidence interval
QOF ?? UPi / ?ni , ni number of tuples
processed from group i
process pulls items from group with max UPi /
ni?ni
desired ratio of group i tuples on buffer
UPi2/3/? UPj2/3
juggle tries to maintain this by enrich/spool

14
Other Quality of Feedback functions

rate of processing (for a group) ? preference
QOF ?? (ni - nUPi)2 (variance from ideal
proportions)
process pulls items from group with max (nUPi -
ni )
desired ratio of group i tuples in buffer UPi
will see more complex QOF later

15
Results Reordering in Online Aggregation

implemented in Informix UDO
experiments with modified TPC-D queries
questions
how much throughput difference is needed for
reordering
can we reorder handle skewed data
one stress test skew, minimal proc. cost
index-only join
5 orderpriorities, zipf distribution

consume
SELECT AVG(o_totalprice), o_orderpriority FROM
order WHERE exists ( SELECT FROM lineitem
WHERE l_orderkey o_orderkey) GROUP BY
o_orderpriority
juggle
index
scan
process
16
Performance results
confidence interval
tuples processed
time
time

3 times faster for interesting groups
2 completion time overhead

17
Findings

higher processing costs (index/hash join,
subquery, ) make reordering easy
at very low processing costs, juggle constrained
bydensity of interesting tuples in source
outlier groups hard to speed up
better to use index on reordering column HHW
97
reordering becomes easier over time
questions to answer
where to place juggle
are groups fixed statically

18
Outline

motivation and context
support for dynamic user control in traditional
query proc. architectures
architecture for giving more aggressive partial
results
policy for generating partial results
user interface for displaying partial results
impact on routing
related work, work-in-progress, future work

19
Generating partial results

traditional arch. also generate continual result
tuples
arises from continual dataflow thru pipelining
operators
much work on pipelining joins WA91, HH99,
IFF99, UF00
this is too rigid
especially in distributed envirorments

Result Space
20
Context Query processing in Telegraph

Telegraph adaptive dataflow system to query
diverse, distributed sources
much data available as services over Internet
currently only accessible by browse/search/forms
want to semi-automatically combine this data
examples
campaign finance information (fff.cs.berkeley.edu)
multiple donor lists, home prices, census info.,
crime ratings, maps, celebrity lists, legislative
records
restaurant information
lists, reviews, addresses, maps, health
inspection reports, nutrition information

21
More aggressive partial results

complete results too rigid
source latencies high, diverse
dynamic source variations, delays
query does not capture user desires
want to process queries flexibly, giving results
asap
adapt dynamically to user preferences and source
variations
allow user feedback to refine partial results

22
Correctness of partial results

must contain some essential columns
group-by columns / sort-by columns
full disjunction/outer-join semantics
appropriate for many Internet data sources
join semantics
cannot give partial results without ensuring that
match exists
aggregates over-estimate, even with outer-join
semantics
update early, and compensate later

23
Dynamic query plans

Eddy Avnur and Hellerstein 2000
router for directing data thru modules
dynamically chooses join and selection order
all partial tuples generatable
original routing policy optimize completion time
our focus continual partial results, adaptation
to user preferences

24
Eddy Routing
. . . modules
R
P
S
. . . inputs

Eddy must decide
what tuple to route next
where to route it
based on user preferences, and source properties
need routing policy and a reordering mechanism
eddy memory and module queues bounded

25
Prioritizing tuples
. . . modules
R
P
S
. . . inputs

perform online reordering within Eddy
all incoming tuples placed in reorderer
when space available on queues to modules, Eddy
takes tuple from reorderer and routes it to
modules
best-effort ? automatically reorder before slow
modules

26
Routing and reordering policy
GOAL
QOF
time

GOAL at any time, route to max. dQOF/dt
benefit of sending tuple to module / cost
cost estimate data rates to/from module
benefit dependent on application and UI
how partial results impact the UI
user preferences

27
Outline

motivation and context
online reordering for user prioritization of
partial results
architecture for generating more aggressive
partial results
policy for generating partial results
Telegraph UI displaying results and inferring
preferences
impact on routing
experimental results
related work/work-in-progress/future work

28
Telegraph UI

screenshot

29
Telegraph UI

partial results displayed on screen as they
arrive
values clustered into groups
can roll-up or drill-down to see different
granularities
client has hash table mapping groups to values
navigation
vertical/horizontal scrolling
column (un)hiding
rollup/drilldown
different columns visible at different drilldown
levels

30
Getting user preferences (1)

infer from navigation
row/column scrolling, group drill down and rollup
prioritize visible rows/columns
query evolution
subset one-size-fits-all queries
future work query expansion
explicit
at the cell level -- need for some expensive
sources
can also have up/down buttons on columns

31
Getting user preferences (2)

preferences
col priorities
ltpredicate, row priority, row-col prioritiesgt
e.g. lt1, 1.0, nonegt, lt2, 1.7, nonegt,
lt3, 1.3, maphighgt, lt8/33014, 1.0,
nonegt, lt8/60610, 1.0, nonegt

32
Benefit of a partial result

depends on user preferences
benefit of updating a cell in output
row priority x column weight x ? cell worthiness
column weight max(col priority, row-col
priority)
incremental cell worthiness how much does one
extra update add to the cells value ? quality
of feedback
enumerations -- 1
aggregations -- change in confidence interval
informing user about execution progress
map cell worthiness to foreground color value

33
Benefit of Routing a Tuple

route tuple according to expected benefit and
cost
benefit of sending a tuple t to a module M and
forming set T ?cells c ?T benefit of updating c
estimate fanout by sampling previous values --
selectivity estimation

34
Granularity of Reordering/Routing

reorder and route tuples at granularity of a
group
group ltbase relations, predicategt
groups created and deleted dynamically, as user
navigates in UI
group predicate may be a subset of
application-specified row predicate
final row depends on values this tuple joins with

35
Partial results, no reordering

Bush Contributors Income Crime
Ratings

with partial results
without partial results
36
Partial results, no reordering

Bush Contributors Income Crime
Ratings

with partial results
total worthiness
without partial results
time(s)
37
Results with row-wise reordering

scrolling AZ, AR, CA CO, CT, LA,KY,MA,MD,MI

100
Aggr. Updates
0
40
80
time
0.24
rel. error
0
groups
38
Prioritizing particular columns

previous graph Income updates much faster than
Crime Ratings
can we prioritize Crime Ratings?
increase number of threads to probe it
if eddy sends na tuples to A, nb to B, per second
na lt threadsa/latency(A), nb lt
threadsb/latency(B)
threadsa threadsb lt Number of Threads Per
Query
max. na ?worthiness(A) nb ?worthiness(B)
more generally, optimize for multiple resources
good citizenship
costs

39
Outline

motivation and context
online reordering for user prioritization of
partial results
architecture for generating more aggressive
partial results
policy for generating partial results
conclusions and future work

40
Related Work

IR work
ranked retrieval, relevance feedback
search strategies, Berry Picking
making operators pipelining (Ripple Join, XJoin,
Tuqwila)
precomputed summaries (OLAP, materialized views,
AQUA)
top N / fast-first query processing
parachute queries, union queries
adaptivity
dynamic query plans
mid-query reoptimization (KD98, IFF99)
competition (DEC Rdb)

41
Summary

query processor should care about how
user/application uses results
online reordering effective way of supporting
dynamic user control
give partial results as user wishes by embedding
reordering within dynamically controlled dataflow
system flexibility helpful
hard to map these user-interaction needs
intoconcrete algorithm performance goals
wanted benchmarks based on user/application
traces
For more information
http//telegraph.cs.berkeley.edu/,
http//control.cs.berkeley.edu/

42
Future Work

session-granularity query processing
query evolution
lazy evaluation user/client navigates through
partial results
closer dbms-application interaction
reduced operator set querying
execute query by appropriate routing through data
sources and state modules

43
Lessons Learned

query processor should care about how
user/application uses results
system flexibility helpful
hard to map these user-interaction needs
intoconcrete algorithm performance goals
wanted benchmarks based on user/application
traces
For more information
http//telegraph.cs.berkeley.edu/
http//control.cs.berkeley.edu/

44
Granularity of Query Operators

relational operators logical astractions
encapsulate multiple physical effects
inflexible in handling unexpected changes
information hiding
operator speeds
resource consumption
work sharing
competitive access paths
inflexible tuple routing
suppose user asks for SQ tuples
suppose there is a delay in P
want to encapsulate at level of physical
operators data sources, data structures

45
State Modules
SteMP
SteMQ
SteMS
SteMR
Eddy
?
Q
Q
R
S
P

store state in separate State Modules (SteMs)
SteM like an index supports builds and probes
unifies caches, join state, rendezvous buffers
Eddy routes tuples through SteMs and data sources
gets results from sources, or probe/build into
SteMs
joins and data access performed in the process of
routing

46
HCI Motivation Why Interactivity?

Berry picking (Bates 90, 93)
user studies
decision support (Oday and Jeffries 93)
relevance feedback (Koenemann and Belkin 96)

47
BACKUP Disk data layout during juggling

performance goal
favor early results optimize Phase 1, at expense
of Phase 2
spool sequential I/O, enrich random I/O
reordering in Phase 2 much easier than in Phase 1
enrich needs approx. index on side-disk
have hash-index on tuples, according to user
interest
done at group granularity

48
BACKUP Other applications of reordering

can add reorder to any dataflow
will later discuss application for query systems
that give aggressive partial results
also useful in batch query processing
sorting often used in query plans for performance
gains
can replace by best-effort reordering
little performance hit, but plan is now
pipelining
will not discuss further in this talk
spreadsheets

49
BACKUP Estimators for rowPriority, incremental
worthiness

easy, unless output row depends on new values
currently use average of all possible rows
could estimate a distribution instead

50
BACKUP Interactive query processing with
non-pipelining operators

basic techniques independent of pipelined
operators.
reordering --- can be used in general query
plans, although effectiveness may be hindered by
blocking operators
State Modules applicable in general -- helps with
adaptivity
much work on making plan operators pipelining --
ripple join, xjoin, tuqwila
reordering itself can be used to avoid blocking
sorts in some cases

51
Query processing using SteMs (1)

SteM operations -- works like index on a table
build(tuple) add tuple to SteM
probe(tuple) find all matches among build values
bounce back tuple if all matches not found and
probe tuple not cached elsewhere
performing joins by routing through SteMs
index joins
regular synchronous index joins easy -- have no
state
over distributed sources, lookups cached by
building into SteM
separating cache allows Eddy to distinguish
access cost
asynchronous index joins GW00
helps for high-throughput Internet sources
SteM acts as rendezvous buffer for pending probe
results

52
Query processing using SteMs (2)

performing joins by routing through SteMs (contd)
doubly pipelined hash join
use two SteMs, one on each source
can also do extensions (e.g. Tuqwila IFF99),
and even non-pipelined joins, using appropriate
SteMs
thus can simulate any static query plan by
appropriate routing

SteMS
SteMR
R probe
S probe
R bld.
S bld.
Eddy
R
S
53
Query processing using SteMs (3)

how to do this in general?
dont want Eddy to be a super-join
want it to route as per user desires and source
properties, independent of any join algorithm
routing constraint
each tuple must have a non-zero probability of
being sent to every appropriate module
can avoid repeated looping by marking tuples
Theorem any routing that satisfies above
constraint will produce all query results and
terminate
could get duplicates
prune at application, or
enforce atomicity in tuple routing

54
Benefits of SteMs

adaptivity
join algorithm and access path selection are
determined by eddy routing
hence Eddy can dynamically choose these based on
source properties
in fact, Eddy can do hybrid join algorithms
adapt to dynamic delays, etc.
work sharing -- for query evolution, competition
better information to the Eddy
speeds, memory consumption

55
BACKUP Lazy Evaluation

User navigates thru partial results
can see only a few at a time
exploit to delay processing
do on demand on the region user currently
navigating over
things to delay
expensive source lookups (e.g. map)
joins (e.g. )
any other formatting etc.

56
Backup giving probabilistic results

view partial result as somethng that throws
light on outer-join space
positive result probabilistic
negative result deterministic
benefit of positive result
weighted sum of cells it is likely to show
penalty for false result
repudiation of earlier false results
benefit of negative result
direct benefit unlikely
repudiation of earlier false results
complications
many tuples may contribute to single output tuple
-- aggregation

57
Probabilistic partial results

partial result gt may not apply all filters
still want to show result probabilistically
filters ill-specified
data on Internet often inconsistent
show all partial results full disjunction
(Galindo-Legaria 94)
positive and negative results

Output Space
Outer Join Space
58
What does user see?

user navigates thru partial results
with complete row partial results
can either explore these results in detail
(sort/scroll along columns as in spreadsheet)
compute approx. aggregates on them in online
fashion
errors shrinking as more results come in
with probabilistic, incomplete results?
easy if primary key in partial result
what to show otherwise?

59
Quantifying benefit of partial results