Interactive Query Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Interactive Query Processing

Description:

querying = extracting information from data sets ... scan. reorder. index. scan. 8/25/09. Interactive Query Processing. 8. Framework for Online Reordering ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 63
Provided by: rsha
Learn more at: https://dsf.berkeley.edu
Category:

less

Transcript and Presenter's Notes

Title: Interactive Query Processing


1
Interactive Query Processing
  • Vijayshankar Raman
  • Computer Science Division
  • University of California at Berkeley

2
Motivation Nature of querying
  • querying extracting information from data sets
  • different techniques in different settings
  • intrinsically slow
  • significant user-system interaction
  • info. seekers work iteratively, gradually
    refining requests based on feedback

3
Problems with traditional solutions
  • mismatch between system functionality and mode of
    HCI
  • black boxes
  • do batch processing
  • frustrating delays in iterative process

4
Interactive processing
query
process
exact answer
  • HCI requirements
  • users must get continual feedback on results of
    processing
  • allow users to control processing based on prior
    feedback
  • performance goals
  • not to minimize time to give complete results
  • give continually improving partial results
  • adapt to dynamically specified performance goals

5
Context and Talk Outline
  • query processing over structured data
  • processing dataflow through
    pipelining operators
  • outline of rest of talk
  • support for dynamic user control in traditional
    query proc. architectures (static plans)
  • architecture for giving more aggressive partial
    results
  • policy for generating partial results
  • client user interface
  • impact of user actions on query execution
  • other related issues
  • interactive data cleaning (A-B-C)
  • adaptive query processing

R
S
Q
6
Design goals in supporting user control
  • make minimal change to system architecture
  • must be independent of particular query
    processing algorithms
  • no delay in processing

7
Online Reordering (Raman et al. 99,00)
  • users perceive data being processed over time
  • prioritize processing for interesting tuples
  • interest based on user-specified preferences
  • reorder dataflow so that interesting tuples go
    first
  • encapsulate reordering as pipelined dataflow
    operator

8
Framework for Online Reordering
acddbadb...
abcdabc..
user interest
  • want no delay in processing
  • in general, reordering can only be best-effort
  • typically process/consume slower than produce
  • exploit throughput difference to reorder
  • two aspects
  • mechanism for best-effort reordering
  • reordering policy

9
Juggle mechanism for reordering
  • continually prefetch from input, spooling onto
    auxiliary side disk if needed
  • juggle data between buffer and side disk, to keep
    buffer full of interesting items
  • getNext chooses best item currently on buffer
  • reordering policy determines what getNext
    returns, and enrich/spool decisions

10
Reordering policies
good permutation of items t1tn to t?1t?n
GOAL
QOF
time
  • quality of feedback for a prefix t?1t?2t?k
  • QOF(UP(t?1), UP(t?2), UP(t?k )), UP user
    preference
  • determined by application
  • goodness of reordering dQOF/dt
  • implication for juggle mechanism
  • process gets item from buffer that increases QOF
    the most
  • juggle tries to maintain buffer with such items

11
One application of reordering
  • online aggregation Hellerstein et. al 97,
    Hellerstein and Haas99
  • for SQL aggregate queries, give gradually
    improving estimates
  • with confidence intervals
  • allow users to speed up estimate refinement for
    groups of interest
  • prioritize for processing at a per-group
    granularity

SELECT AVG(gpa) FROM students GROUP BY college
12
Online Aggregation Screenshot
SELECT AVG(gpa) FROM students GROUP BY college
13
QOF in Online Aggregation
  • avg weighted confidence interval
  • preference acts as weight on confidence interval
  • QOF ?? UPi / ?ni , ni number of tuples
    processed from group i
  • process pulls items from group with max UPi /
    ni?ni
  • desired ratio of group i tuples on buffer
    UPi2/3/? UPj2/3
  • juggle tries to maintain this by enrich/spool

14
Other Quality of Feedback functions
  • rate of processing (for a group) ? preference
  • QOF ?? (ni - nUPi)2 (variance from ideal
    proportions)
  • process pulls items from group with max (nUPi -
    ni )
  • desired ratio of group i tuples in buffer UPi
  • will see more complex QOF later

15
Results Reordering in Online Aggregation
  • implemented in Informix UDO
  • experiments with modified TPC-D queries
  • questions
  • how much throughput difference is needed for
    reordering
  • can we reorder handle skewed data
  • one stress test skew, minimal proc. cost
  • index-only join
  • 5 orderpriorities, zipf distribution

consume
SELECT AVG(o_totalprice), o_orderpriority FROM
order WHERE exists ( SELECT FROM lineitem
WHERE l_orderkey o_orderkey) GROUP BY
o_orderpriority
juggle
index
scan
process
16
Performance results
confidence interval
tuples processed
time
time
  • 3 times faster for interesting groups
  • 2 completion time overhead

17
Findings
  • higher processing costs (index/hash join,
    subquery, ) make reordering easy
  • at very low processing costs, juggle constrained
    bydensity of interesting tuples in source
  • outlier groups hard to speed up
  • better to use index on reordering column HHW
    97
  • reordering becomes easier over time
  • questions to answer
  • where to place juggle
  • are groups fixed statically

18
Outline
  • motivation and context
  • support for dynamic user control in traditional
    query proc. architectures
  • architecture for giving more aggressive partial
    results
  • policy for generating partial results
  • user interface for displaying partial results
  • impact on routing
  • related work, work-in-progress, future work

19
Generating partial results
  • traditional arch. also generate continual result
    tuples
  • arises from continual dataflow thru pipelining
    operators
  • much work on pipelining joins WA91, HH99,
    IFF99, UF00
  • this is too rigid
  • especially in distributed envirorments

Result Space
20
Context Query processing in Telegraph
  • Telegraph adaptive dataflow system to query
    diverse, distributed sources
  • much data available as services over Internet
  • currently only accessible by browse/search/forms
  • want to semi-automatically combine this data
  • examples
  • campaign finance information (fff.cs.berkeley.edu)
  • multiple donor lists, home prices, census info.,
    crime ratings, maps, celebrity lists, legislative
    records
  • restaurant information
  • lists, reviews, addresses, maps, health
    inspection reports, nutrition information

21
More aggressive partial results
  • complete results too rigid
  • source latencies high, diverse
  • dynamic source variations, delays
  • query does not capture user desires
  • want to process queries flexibly, giving results
    asap
  • adapt dynamically to user preferences and source
    variations
  • allow user feedback to refine partial results

22
Correctness of partial results
  • must contain some essential columns
  • group-by columns / sort-by columns
  • full disjunction/outer-join semantics
  • appropriate for many Internet data sources
  • join semantics
  • cannot give partial results without ensuring that
    match exists
  • aggregates over-estimate, even with outer-join
    semantics
  • update early, and compensate later

23
Dynamic query plans
  • Eddy Avnur and Hellerstein 2000
  • router for directing data thru modules
  • dynamically chooses join and selection order
  • all partial tuples generatable
  • original routing policy optimize completion time
  • our focus continual partial results, adaptation
    to user preferences

24
Eddy Routing
. . . modules
R
P
S
. . . inputs
  • Eddy must decide
  • what tuple to route next
  • where to route it
  • based on user preferences, and source properties
  • need routing policy and a reordering mechanism
  • eddy memory and module queues bounded

25
Prioritizing tuples
. . . modules
R
P
S
. . . inputs
  • perform online reordering within Eddy
  • all incoming tuples placed in reorderer
  • when space available on queues to modules, Eddy
    takes tuple from reorderer and routes it to
    modules
  • best-effort ? automatically reorder before slow
    modules

26
Routing and reordering policy
GOAL
QOF
time
  • GOAL at any time, route to max. dQOF/dt
    benefit of sending tuple to module / cost
  • cost estimate data rates to/from module
  • benefit dependent on application and UI
  • how partial results impact the UI
  • user preferences

27
Outline
  • motivation and context
  • online reordering for user prioritization of
    partial results
  • architecture for generating more aggressive
    partial results
  • policy for generating partial results
  • Telegraph UI displaying results and inferring
    preferences
  • impact on routing
  • experimental results
  • related work/work-in-progress/future work

28
Telegraph UI
  • screenshot

29
Telegraph UI
  • partial results displayed on screen as they
    arrive
  • values clustered into groups
  • can roll-up or drill-down to see different
    granularities
  • client has hash table mapping groups to values
  • navigation
  • vertical/horizontal scrolling
  • column (un)hiding
  • rollup/drilldown
  • different columns visible at different drilldown
    levels

30
Getting user preferences (1)
  • infer from navigation
  • row/column scrolling, group drill down and rollup
  • prioritize visible rows/columns
  • query evolution
  • subset one-size-fits-all queries
  • future work query expansion
  • explicit
  • at the cell level -- need for some expensive
    sources
  • can also have up/down buttons on columns

31
Getting user preferences (2)
  • preferences
  • col priorities
  • ltpredicate, row priority, row-col prioritiesgt
  • e.g. lt1, 1.0, nonegt, lt2, 1.7, nonegt,
    lt3, 1.3, maphighgt, lt8/33014, 1.0,
    nonegt, lt8/60610, 1.0, nonegt

32
Benefit of a partial result
  • depends on user preferences
  • benefit of updating a cell in output
  • row priority x column weight x ? cell worthiness
  • column weight max(col priority, row-col
    priority)
  • incremental cell worthiness how much does one
    extra update add to the cells value ? quality
    of feedback
  • enumerations -- 1
  • aggregations -- change in confidence interval
  • informing user about execution progress
  • map cell worthiness to foreground color value

33
Benefit of Routing a Tuple
  • route tuple according to expected benefit and
    cost
  • benefit of sending a tuple t to a module M and
    forming set T ?cells c ?T benefit of updating c
  • estimate fanout by sampling previous values --
    selectivity estimation

34
Granularity of Reordering/Routing
  • reorder and route tuples at granularity of a
    group
  • group ltbase relations, predicategt
  • groups created and deleted dynamically, as user
    navigates in UI
  • group predicate may be a subset of
    application-specified row predicate
  • final row depends on values this tuple joins with

35
Partial results, no reordering
  • Bush Contributors Income Crime
    Ratings

with partial results
without partial results
36
Partial results, no reordering
  • Bush Contributors Income Crime
    Ratings

with partial results
total worthiness
without partial results
time(s)
37
Results with row-wise reordering
  • scrolling AZ, AR, CA CO, CT, LA,KY,MA,MD,MI

100
Aggr. Updates
0
40
80
time
0.24
rel. error
0
groups
38
Prioritizing particular columns
  • previous graph Income updates much faster than
    Crime Ratings
  • can we prioritize Crime Ratings?
  • increase number of threads to probe it
  • if eddy sends na tuples to A, nb to B, per second
  • na lt threadsa/latency(A), nb lt
    threadsb/latency(B)
  • threadsa threadsb lt Number of Threads Per
    Query
  • max. na ?worthiness(A) nb ?worthiness(B)
  • more generally, optimize for multiple resources
  • good citizenship
  • costs

39
Outline
  • motivation and context
  • online reordering for user prioritization of
    partial results
  • architecture for generating more aggressive
    partial results
  • policy for generating partial results
  • conclusions and future work

40
Related Work
  • IR work
  • ranked retrieval, relevance feedback
  • search strategies, Berry Picking
  • making operators pipelining (Ripple Join, XJoin,
    Tuqwila)
  • precomputed summaries (OLAP, materialized views,
    AQUA)
  • top N / fast-first query processing
  • parachute queries, union queries
  • adaptivity
  • dynamic query plans
  • mid-query reoptimization (KD98, IFF99)
  • competition (DEC Rdb)

41
Summary
  • query processor should care about how
    user/application uses results
  • online reordering effective way of supporting
    dynamic user control
  • give partial results as user wishes by embedding
    reordering within dynamically controlled dataflow
  • system flexibility helpful
  • hard to map these user-interaction needs
    intoconcrete algorithm performance goals
  • wanted benchmarks based on user/application
    traces
  • For more information
  • http//telegraph.cs.berkeley.edu/,
    http//control.cs.berkeley.edu/

42
Future Work
  • session-granularity query processing
  • query evolution
  • lazy evaluation user/client navigates through
    partial results
  • closer dbms-application interaction
  • reduced operator set querying
  • execute query by appropriate routing through data
    sources and state modules

43
Lessons Learned
  • query processor should care about how
    user/application uses results
  • system flexibility helpful
  • hard to map these user-interaction needs
    intoconcrete algorithm performance goals
  • wanted benchmarks based on user/application
    traces
  • For more information
  • http//telegraph.cs.berkeley.edu/
  • http//control.cs.berkeley.edu/

44
Granularity of Query Operators
  • relational operators logical astractions
  • encapsulate multiple physical effects
  • inflexible in handling unexpected changes
  • information hiding
  • operator speeds
  • resource consumption
  • work sharing
  • competitive access paths
  • inflexible tuple routing
  • suppose user asks for SQ tuples
  • suppose there is a delay in P
  • want to encapsulate at level of physical
    operators data sources, data structures

45
State Modules
SteMP
SteMQ
SteMS
SteMR
Eddy
?
Q
Q
R
S
P
  • store state in separate State Modules (SteMs)
  • SteM like an index supports builds and probes
  • unifies caches, join state, rendezvous buffers
  • Eddy routes tuples through SteMs and data sources
  • gets results from sources, or probe/build into
    SteMs
  • joins and data access performed in the process of
    routing

46
HCI Motivation Why Interactivity?
  • Berry picking (Bates 90, 93)
  • user studies
  • decision support (Oday and Jeffries 93)
  • relevance feedback (Koenemann and Belkin 96)

47
BACKUP Disk data layout during juggling
  • performance goal
  • favor early results optimize Phase 1, at expense
    of Phase 2
  • spool sequential I/O, enrich random I/O
  • reordering in Phase 2 much easier than in Phase 1
  • enrich needs approx. index on side-disk
  • have hash-index on tuples, according to user
    interest
  • done at group granularity

48
BACKUP Other applications of reordering
  • can add reorder to any dataflow
  • will later discuss application for query systems
    that give aggressive partial results
  • also useful in batch query processing
  • sorting often used in query plans for performance
    gains
  • can replace by best-effort reordering
  • little performance hit, but plan is now
    pipelining
  • will not discuss further in this talk
  • spreadsheets

49
BACKUP Estimators for rowPriority, incremental
worthiness
  • easy, unless output row depends on new values
  • currently use average of all possible rows
  • could estimate a distribution instead

50
BACKUP Interactive query processing with
non-pipelining operators
  • basic techniques independent of pipelined
    operators.
  • reordering --- can be used in general query
    plans, although effectiveness may be hindered by
    blocking operators
  • State Modules applicable in general -- helps with
    adaptivity
  • much work on making plan operators pipelining --
    ripple join, xjoin, tuqwila
  • reordering itself can be used to avoid blocking
    sorts in some cases

51
Query processing using SteMs (1)
  • SteM operations -- works like index on a table
  • build(tuple) add tuple to SteM
  • probe(tuple) find all matches among build values
  • bounce back tuple if all matches not found and
    probe tuple not cached elsewhere
  • performing joins by routing through SteMs
  • index joins
  • regular synchronous index joins easy -- have no
    state
  • over distributed sources, lookups cached by
    building into SteM
  • separating cache allows Eddy to distinguish
    access cost
  • asynchronous index joins GW00
  • helps for high-throughput Internet sources
  • SteM acts as rendezvous buffer for pending probe
    results

52
Query processing using SteMs (2)
  • performing joins by routing through SteMs (contd)
  • doubly pipelined hash join
  • use two SteMs, one on each source
  • can also do extensions (e.g. Tuqwila IFF99),
    and even non-pipelined joins, using appropriate
    SteMs
  • thus can simulate any static query plan by
    appropriate routing

SteMS
SteMR
R probe
S probe
R bld.
S bld.
Eddy
R
S
53
Query processing using SteMs (3)
  • how to do this in general?
  • dont want Eddy to be a super-join
  • want it to route as per user desires and source
    properties, independent of any join algorithm
  • routing constraint
  • each tuple must have a non-zero probability of
    being sent to every appropriate module
  • can avoid repeated looping by marking tuples
  • Theorem any routing that satisfies above
    constraint will produce all query results and
    terminate
  • could get duplicates
  • prune at application, or
  • enforce atomicity in tuple routing

54
Benefits of SteMs
  • adaptivity
  • join algorithm and access path selection are
    determined by eddy routing
  • hence Eddy can dynamically choose these based on
    source properties
  • in fact, Eddy can do hybrid join algorithms
  • adapt to dynamic delays, etc.
  • work sharing -- for query evolution, competition
  • better information to the Eddy
  • speeds, memory consumption

55
BACKUP Lazy Evaluation
  • User navigates thru partial results
  • can see only a few at a time
  • exploit to delay processing
  • do on demand on the region user currently
    navigating over
  • things to delay
  • expensive source lookups (e.g. map)
  • joins (e.g. )
  • any other formatting etc.

56
Backup giving probabilistic results
  • view partial result as somethng that throws
    light on outer-join space
  • positive result probabilistic
  • negative result deterministic
  • benefit of positive result
  • weighted sum of cells it is likely to show
    penalty for false result
  • repudiation of earlier false results
  • benefit of negative result
  • direct benefit unlikely
  • repudiation of earlier false results
  • complications
  • many tuples may contribute to single output tuple
    -- aggregation

57
Probabilistic partial results
  • partial result gt may not apply all filters
  • still want to show result probabilistically
  • filters ill-specified
  • data on Internet often inconsistent
  • show all partial results full disjunction
    (Galindo-Legaria 94)
  • positive and negative results

Output Space
Outer Join Space
58
What does user see?
  • user navigates thru partial results
  • with complete row partial results
  • can either explore these results in detail
    (sort/scroll along columns as in spreadsheet)
  • compute approx. aggregates on them in online
    fashion
  • errors shrinking as more results come in
  • with probabilistic, incomplete results?
  • easy if primary key in partial result
  • what to show otherwise?

59
Quantifying benefit of partial results
  • partial result sheds light on outer-join space
  • positive result probabilistic
  • negative result deterministic
  • benefit of positive result
  • weighted sum of cells it is likely to show
  • penalty for false result ?
  • repudiation of earlier false results ?
  • benefit of negative result
  • direct benefit unlikely
  • repudiation of earlier false results ?

60
Why probabilistic results are good
  • data on Internet is often inconsistent
  • hence even exact processing cannot give perfect
    answers
  • people are used to sloppy answers
  • can allow negations in expert interfaces only

61
Data growth vs. Computer Speedup
  • Moores Law -- of transistors/chip doubles
    every 18 months (1965)
  • data growth

Source J. Porter, Disk/Trend, Inc.
(http//www.disktrend.com/pdf/portrpkg.pdf)
62
Disk Appetite, contd.
  • Greg Papadopoulos, CTO Sun
  • Disk sales doubling every 9 months
  • similar results from Winter VLDB survey
  • time to process all your data doubles every 18
    months!
Write a Comment
User Comments (0)
About PowerShow.com