Interactive Query Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Interactive Query Processing

Description:

Interactive Query Processing. 6. data access: scan-based ( ) or index-based ... Interactive Query Processing. 9. Design goals in supporting user control ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 75
Provided by: rsha
Learn more at: https://dsf.berkeley.edu
Category:

less

Transcript and Presenter's Notes

Title: Interactive Query Processing


1
Interactive Query Processing
  • Vijayshankar Raman
  • Computer Science Division
  • University of California at Berkeley

2
Motivation Nature of querying
  • querying extracting information from data sets
  • different techniques in different settings
  • intrinsically slow
  • significant user-system interaction
  • info. seekers work iteratively, gradually
    refining requests based on feedback Oday and
    Jeffries 93, Koenemann and Belkin 96

3
Problems with traditional solutions
  • mismatch between system functionality and
    iterative mode of HCI
  • black box functionality
  • batch processing
  • frustrating delays in iterative process

Query find average grade of students in each
college
4
Interactive processing
process
query
exact answer
  • HCI requirements
  • users must get continual feedback on results of
    processing
  • allow users to control processing based on prior
    feedback

5
Interactive processing
result estimates
  • HCI requirements
  • users must get continual feedback on results of
    processing
  • allow users to control processing based on prior
    feedback
  • performance goals
  • not to minimize time to give complete results
  • give continually improving partial results
  • adapt to dynamically specified performance goals

6
Background Traditional Database Query Processing
declarative query select R.a, S.b, T.c from R,
S, T where ltconditionsgt
query executor
query optimizer
S
query plan
R
  • data access scan-based ( ) or index-based (
    )
  • selection (?) -- filter tuples based on condition
  • join ( ) -- apply filter on cross-product of
    the inputs
  • can use index if available (index-join)
  • else dynamically build hash-tables on inputs
    (hash-join)
  • query optimizer
  • chooses plan operator implementations and
    ordering
  • cost model based on pre-computed summary
    statistics

R
7
My Research
  • interactive query processing
  • interactive data cleaning (Potters
    Wheel)
  • assume dataflow thru pipelined operators
  • support for dynamic user control in traditional
    query proc. architectures
  • adaptively giving partial results in response to
    user control
  • still more aggressive adaptation
  • state modules

T
S
R
S
R
T
S
R
T
T
8
Talk Outline
  • motivation and context
  • support for dynamic user control in traditional
    query proc. architectures
  • architecture for adaptively generating partial
    results
  • policy for generating partial results
  • user interface for displaying partial results
  • impact on routing
  • wrapup

9
Design goals in supporting user control
  • make minimal change to system architecture
  • must be independent of particular query
    processing algorithms
  • no delay in processing

10
Online Reordering (Raman et al. 99,00)
  • users perceive data being processed over time
  • prioritize processing for interesting tuples
  • interest based on user-specified preferences
  • reorder dataflow so that interesting tuples go
    first
  • encapsulate reordering as pipelined dataflow
    operator

T
T
S
S
R
R
11
Context an application of reordering
  • online aggregation Hellerstein/Haas/Wang 97,
    Haas/Hellerstein 99
  • for SQL aggregate queries, give gradually
    improving estimates
  • with confidence intervals
  • allow users to speed up estimate refinement for
    groups of interest
  • prioritize for processing at a per-group
    granularity

SELECT AVG(gpa) FROM students GROUP BY college
12
Online Aggregation Screenshot
SELECT AVG(gpa) FROM students GROUP BY college
13
Framework for Online Reordering
network xfer.
acddbadb...
f(t)
abcdabc..
process
reorder
user interest
  • want no delay in processing
  • in general, reordering can only be best-effort
  • typically process/consume slower than produce
  • exploit throughput difference to reorder
  • two aspects
  • mechanism for best-effort reordering
  • reordering policy

14
Juggle mechanism for reordering
process/consume
getNext
buffer
spool
prefetch
enrich
produce
side disk
  • two threads -- prefetch from input
    -- spool/enrich from auxiliary side disk
  • juggle data between buffer and side disk
  • keep buffer full of interesting items
  • getNext chooses best item currently on buffer
  • getNext, enrich/spool decisions -- based on
    reordering policy
  • side disk management
  • hash index, populated in a way that postpones
    random I/O

15
Reordering policies
good permutation of items t1tn to t?1t?n
GOAL
QOF
time
  • quality of feedback for a prefix t?1t?2t?k
  • QOF(UP(t?1), UP(t?2), UP(t?k )), UP user
    preference
  • determined by application
  • goodness of reordering dQOF/dt
  • implication for juggle mechanism
  • process gets item from buffer that increases QOF
    the most
  • juggle tries to maintain buffer with such items

16
QOF in Online Aggregation
  • avg weighted confidence interval
  • preference acts as weight on confidence interval
  • QOF ?? UPi / ?ni , ni number of tuples
    processed from group i
  • process pulls items from group with max UPi /
    ni?ni
  • desired ratio of group i tuples on buffer
    UPi2/3/? UPj2/3
  • juggle tries to maintain this by enrich/spool

17
Other QOF functions
  • rate of processing (for a group) ? preference
  • QOF ?? (ni - nUPi)2 (variance from ideal
    proportions)
  • process pulls items from group with max (nUPi -
    ni )
  • desired ratio of group i tuples in buffer UPi

18
Results Reordering in Online Aggregation
  • implemented in Informix UDO server
  • experiments with modified TPC-D queries
  • questions
  • how much throughput difference is needed for
    reordering
  • can we reorder handle skewed data
  • one stress test skew, very small proc. cost
  • index-only join
  • 5 orderpriorities, zipf distribution

consume
SELECT AVG(o_totalprice), o_orderpriority FROM
order WHERE exists ( SELECT FROM lineitem
WHERE l_orderkey o_orderkey) GROUP BY
o_orderpriority
juggle
index
scan
process
19
Performance results
tuples processed
time
  • without reordering

20
Performance results
tuples processed
time
  • 3 times faster for interesting groups
  • overhead 2 completion time, 1 extra disk

21
Performance results
confidence interval
E
C
A
time
22
Overall Findings
  • higher processing costs
  • index/hash join, subquery,
  • reordering easy
  • very low processing costs
  • juggle constrained by density of interesting
    tuples
  • outlier groups hard to speed up
  • better to use index stride Hellerstein/Haas/Wang
    97
  • needs pre-computed index
  • reordering becomes easier over time
  • question to answer
  • where to place juggle?

23
Outline
  • motivation and context
  • support for dynamic user control in traditional
    query proc. architectures
  • architecture for adaptively generating partial
    results
  • policy for generating partial results
  • user interface for displaying partial results
  • impact on routing
  • wrapup

24
Incremental results
  • traditional arch. also generate continual result
    tuples
  • arises from continual dataflow thru pipelining
    operators
  • much work on pipelining joins Wilschut/Apers91,
    Haas/Hellerstein99, Ives et al.99,
    Urhan/Franklin00
  • this is too rigid
  • especially in distributed envirorments

25
Context Query processing in Telegraph
  • Telegraph adaptive dataflow system to query
    diverse, distributed sources
  • much data available as services over Internet
  • currently only accessible by browse/search/forms
  • want to combine this data through queries
  • examples
  • campaign finance information (Election
    2000)Federal Election Commision Yahoo
    Home PricesCensus APBnews Crime Ratings
    Maps IMDB
  • restaurant information
  • SwitchBoard Fodors S.F.Chronicle
    MapQuest Health inspection reports

26
Partial results
  • complete result tuples too rigid
  • source latencies high, diverse
  • dynamic source variations, delays
  • query does not capture user desires
  • non-expert and even expert users query too broad

27
Partial results
  • complete tuples too rigid
  • source latencies high, diverse
  • dynamic source variations, delays
  • query does not capture user desires
  • non-expert and even expert users query too broad
  • want to process queries flexibly
  • give partial result tuples asap
  • adapt dynamically to user preferences and source
    variations

28
Correctness of partial results
  • some columns essential for UI
  • e.g. group-by columns / sort-by columns
  • for maximum flexibility -- outer-join semantics
  • good idea for Web sources
  • or -- strict join semantics
  • no partial results without ensuring match exists
  • key constraints helpful
  • aggregates
  • update early, and compensate later
  • statistical guarantees for aggregates difficult
  • fanouts unknown (can be 0!)
  • key constraints helpful

29
Dynamic query plans
  • Eddy Avnur and Hellerstein 2000
  • router for directing data thru modules
  • minimize completion time
  • adaptively choosing join order for arbitrary
    tuple
  • all partial tuples generable

30
Partial Results in Dynamic Plans
. . . modules
. . . inputs
R
P
S
  • my focus continual partial results
  • dynamically adapt dataflow to suit user
    preferences
  • Eddy must decide
  • what tuple to route next
  • where to route it
  • based on user preferences and module properties
  • need routing policy and a reordering mechanism
  • eddy memory buffer and module queues bounded

31
Prioritizing tuples
. . . modules
R
P
S
. . . inputs
  • enhance Eddy with Juggle
  • reorder exploiting slowness of modules
  • synergy -- juggle location problem solved!
  • juggle everywhere, to the extent of throughput
    difference

32
Routing and reordering policy
GOAL
QOF
time
  • GOAL at any time, route to max. dQOF/dt
    benefit of sending tuple to module / cost
  • cost estimate data rates to/from module
  • benefit dependent on application and UI
  • how partial results impact the UI
  • user preferences

33
Outline
  • motivation and context
  • online reordering for user prioritization of
    partial results
  • architecture for adaptively generating partial
    results
  • policy for generating partial results
  • Telegraph UI displaying results and inferring
    preferences
  • experimental results
  • wrapup

34
Telegraph UI
  • screenshot

35
Getting user preferences
  • infer from navigation
  • row/column scrolling, group drill down and rollup
  • prioritize visible rows/columns
  • query evolution
  • subset one-size-fits-all queries
  • future work query expansion
  • explicit
  • up/down buttons on columns
  • at the cell level -- need for some expensive
    sources
  • map to QOF metric on partial results

36
QOF Benefit of a partial result
  • depends on user preferences
  • benefit of updating a cell in output
  • row priority x column weight x ? cell resolution
  • incremental cell resolution how much does one
    extra update add to the cells value
  • scalars -- 1
  • aggregations -- change in confidence interval
  • informing user about execution progress
  • convey cell resolution on UI
  • future work

37
QOF Benefit of Routing a Tuple
  • route tuple according to expected benefit and
    cost
  • benefit of sending a tuple t to a module M and
    forming set T ?cells c ?T benefit of updating c

38
Throughput difference
  • Bush Contributors Income (index) AFB
    Crime Ratings (index)

40000
Bush Contributors
30000
20000
Number of tuples read
10000
Income
Crime Ratings
0
200
400
600
time (s)
39
Benefit of giving partial results
  • Bush Contributors Income (index) AFB
    Crime Ratings (index)

40000
Bush Contributors
30000
partial results
number of partial results
complete results
20000
10000
Income
Crime Ratings
0
200
400
600
time (s)
40
Effect of Delays
Bush Contributors Income (index) AFB
Crime Ratings (index) with delay in AFB Crime
Ratings
number of partial results
10000
delay
1000
number of complete results
100
10
200
400
600
0
time(s)
41
Prioritizing particular values
  • SELECT AVG(Income) FROM Bush Donors, Census
    GROUP BY BushDonors.State
  • scrolling AZ, AR, CA, CO, CT,
    LA, KY, MA, MD, MI, TX, UT,
    VA, VI, VT

600
400
Aggr. Updates
200
0
100
200
300
400
time (s)
42
Prioritizing particular values
  • SELECT AVG(Income) FROM Bush Donors, Census
    GROUP BY BushDonors.State
  • scrolling AZ, AR, CA, CO, CT,
    LA, KY, MA, MD, MI, TX, UT,
    VA, VI, VT

600
400
Aggr. Updates
200
0
100
200
300
400
time (s)
43
Prioritizing particular values
  • SELECT AVG(Income) FROM Bush Donors, Census
    GROUP BY BushDonors.State
  • scrolling AZ, AR, CA, CO, CT,
    LA, KY, MA, MD, MI, TX, UT,
    VA, VI, VT

600
400
Aggr. Updates
200
0
100
200
300
400
time (s)
44
Distribution of contributions
5000
4000
3000
Bush Contributors
2000
1000
State
45
Outline
  • motivation and context
  • online reordering for user prioritization of
    partial results
  • architecture for adaptively generating partial
    results
  • policy for generating partial results
  • more aggressive adaptation state modules
  • wrapup

46
Granularity of Query Operators
  • relational operators logical abstractions
  • encapsulate multiple physical effects
  • inflexible in handling unexpected changes
  • cannot gracefully adapt
  • access method/data source selection
  • join algorithm selection
  • resource allocation e.g. memory
  • delays Query Scrambling, XJoin
  • want to encapsulate at level of physical operators

47
State Modules Elevator Pitch
hash jn
ind.jn
Eddy
R
R
P
  • isolate state in State Modules
  • work sharing
  • routing flexibility
  • query execution routing
  • adapt access methods, join algorithms gracefully
  • directly measure adapt resource consumption ??

48
Outline
  • motivation and context
  • online reordering for user prioritization of
    partial results
  • architecture for generating more aggressive
    partial results
  • policy for generating partial results
  • more aggressive adaptation state modules
  • wrapup

49
Related Work
  • information retrieval
  • ranked retrieval, relevance feedback
  • search strategies, Berry Picking
  • incremental query processing
  • pipelining hash joins (Haas/Hellerstein99,
    Ives
    et al.99, Urhan/Franklin00)
  • top N/fast first queries (Carey/Kossman97,

    Antoshenkov/Ziauddin96)
  • adaptivity
  • parametric query plans (Graefe/Cole94)
  • mid-query reoptimization (Urhan/Franklin/Amsaleg9
    8,
    Kabra/DeWitt98, Ives et al.99)
  • competition
    (Antoshenkov/Ziauddin96)
  • miscellaneous
  • precomputed summaries (OLAP, materialized views,
    AQUA)
  • parachute queries (Bonnet/Tomasic98)
  • APPROXIMATE (Vrbsky/Liu93)

50
Summary
  • applications, query processors need tighter
    coupling
  • online reordering
  • effective way of supporting dynamic user control
  • partial results as desired
  • embed reordering within dynamically controlled
    dataflow
  • hard to map user-interaction needs intoconcrete
    algorithm performance goals
  • wanted benchmarks based on user/application
    traces

51
Future Work
  • session-granularity query processing
  • query evolution
  • lazy evaluation user/client navigates through
    partial results
  • helps thin clients
  • closer dbms-application interaction
  • evaluation tools
  • benchmarks driven by traces/user studies
  • for more information
  • http//telegraph.cs.berkeley.edu/
  • http//control.cs.berkeley.edu/

52
Prioritizing particular columns (1)
  • how do we prioritize external sources?
  • increase number of threads used to probe it
  • resource scheduling rather than data scheduling
  • if eddy sends na tuples to A, nb to B, per second
  • na lt threadsa/latency(A)
  • nb lt threadsb/latency(B)
  • threadsa threadsb lt Number of Threads Per
    Query
  • GOAL maximize na ?QOF(A) nb ?QOF(B)

53
Prioritizing particular columns (2)
  • with equal number of threads (3 each)

Income
3000
2000
Number of tuples read
1000
Crime Ratings
300
100
200
0
time (s)
54
Prioritizing particular columns (2)
  • choosing number of threads according to
    priorities (5 and 1)

Income
3000
2000
Number of tuples read
1000
Crime Ratings
300
100
200
0
time (s)
55
Prioritizing particular columns (3)
  • more generally, optimize with multiple resource
    constraints
  • netizenship
  • s
  • future work

56
Telegraph UI
  • partial results displayed on screen as they
    arrive
  • multi-resolution spreadsheet
  • values clustered into groups
  • can roll-up or drill-down to see different
    granularities
  • different columns visible at different drilldown
    levels
  • client has hash table mapping groups to values
  • navigation
  • vertical/horizontal scrolling
  • rollup/drilldown

57
HCI Motivation Why Interactivity?
  • Berry picking (Bates 90, 93)
  • user studies
  • decision support (Oday and Jeffries 93)
  • relevance feedback (Koenemann and Belkin 96)

58
Disk data layout during juggling
  • performance goal
  • favor early results optimize Phase 1, at expense
    of Phase 2
  • spool sequential I/O, enrich random I/O
  • reordering in Phase 2 much easier than in Phase 1
  • enrich needs approx. index on side-disk
  • have hash-index on tuples, according to user
    interest
  • done at group granularity

59
Other applications of reordering
  • can add reorder to any dataflow
  • will later discuss application for query systems
    that give aggressive partial results
  • also useful in batch query processing
  • sorting often used in query plans for performance
    gains
  • can be replaced by best-effort reordering
  • little performance hit, but plan is now
    pipelining
  • will not discuss further in this talk
  • spreadsheets

60
Telegraph-Client API
sql query
client
Telegraph
partial results
  • preferences
  • col priorities
  • ltpredicate, row priority, row-col prioritiesgt
  • e.g. lt1, 1.0, nonegt, lt2, 1.7, nonegt,
    lt3, 1.3, maphighgt, lt8/33014, 1.0,
    nonegt, lt8/60610, 1.0, nonegt

preferences
61
Granularity of Reordering/Routing
  • reorder and route tuples at granularity of a
    group
  • group ltbase relations, predicategt
  • groups created and deleted dynamically, as user
    navigates in UI
  • group predicate may be a subset of
    application-specified row predicate
  • final row depends on values this tuple joins with

62
Interactive query processing with non-pipelining
operators
  • basic techniques independent of pipelined
    operators.
  • reordering --- can be used in general query
    plans, although effectiveness may be hindered by
    blocking operators
  • State Modules applicable in general -- helps with
    adaptivity
  • much work on making plan operators pipelining --
    ripple join, xjoin, tuqwila
  • reordering itself can be used to avoid blocking
    sorts in some cases

63
Query processing using SteMs (1)
  • SteM operations -- works like index on a table
  • build(tuple) add tuple to SteM
  • probe(tuple) find all matches among build values
  • bounce back tuple if all matches not found and
    probe tuple not cached elsewhere
  • performing joins by routing through SteMs
  • index joins
  • regular synchronous index joins easy -- have no
    state
  • over distributed sources, lookups cached by
    building into SteM
  • separating cache allows Eddy to distinguish
    access cost
  • asynchronous index joins GW00
  • helps for high-throughput Internet sources
  • SteM acts as rendezvous buffer for pending probe
    results

64
Query processing using SteMs (2)
  • performing joins by routing through SteMs (contd)
  • doubly pipelined hash join
  • use two SteMs, one on each source
  • can also do extensions (e.g. Tuqwila IFF99),
    and even non-pipelined joins, using appropriate
    SteMs
  • thus can simulate any static query plan by
    appropriate routing

SteMS
SteMR
R probe
S probe
R bld.
S bld.
Eddy
R
S
65
Query processing using SteMs (3)
  • how to do this in general?
  • dont want Eddy to be a super-join
  • want it to route as per user desires and source
    properties, independent of any join algorithm
  • routing constraint
  • each tuple must have a non-zero probability of
    being sent to every appropriate module
  • can avoid repeated looping by marking tuples
  • Theorem any routing that satisfies above
    constraint will produce all query results and
    terminate
  • could get duplicates
  • prune at application, or
  • enforce atomicity in tuple routing

66
Benefits of SteMs
  • adaptivity
  • join algorithm and access path selection are
    determined by eddy routing
  • hence Eddy can dynamically choose these based on
    source properties
  • in fact, Eddy can do hybrid join algorithms
  • adapt to dynamic delays, etc.
  • work sharing -- for query evolution, competition
  • better information to the Eddy
  • speeds, memory consumption

67
Lazy Evaluation
  • User navigates thru partial results
  • can see only a few at a time
  • exploit to delay processing
  • do on demand on the region user currently
    navigating over
  • things to delay
  • expensive source lookups (e.g. map)
  • joins (e.g. )
  • any other formatting etc.

68
Giving probabilistic results
  • view partial result as somethng that throws
    light on outer-join space
  • positive result probabilistic
  • negative result deterministic
  • benefit of positive result
  • weighted sum of cells it is likely to show
    penalty for false result
  • repudiation of earlier false results
  • benefit of negative result
  • direct benefit unlikely
  • repudiation of earlier false results
  • complications
  • many tuples may contribute to single output tuple
    -- aggregation

69
Probabilistic partial results
  • partial result gt may not apply all filters
  • still want to show result probabilistically
  • filters ill-specified
  • data on Internet often inconsistent
  • show all partial results full disjunction
    (Galindo-Legaria 94)
  • positive and negative results

Output Space
Outer Join Space
70
What does user see?
  • user navigates thru partial results
  • with complete row partial results
  • can either explore these results in detail
    (sort/scroll along columns as in spreadsheet)
  • compute approx. aggregates on them in online
    fashion
  • errors shrinking as more results come in
  • with probabilistic, incomplete results?
  • easy if primary key in partial result
  • what to show otherwise?

71
Quantifying benefit of partial results
  • partial result sheds light on outer-join space
  • positive result probabilistic
  • negative result deterministic
  • benefit of positive result
  • weighted sum of cells it is likely to show
  • penalty for false result ?
  • repudiation of earlier false results ?
  • benefit of negative result
  • direct benefit unlikely
  • repudiation of earlier false results ?

72
Why probabilistic results are good
  • data on Internet is often inconsistent
  • hence even exact processing cannot give perfect
    answers
  • people are used to sloppy answers
  • can allow negations in expert interfaces only

73
Data growth vs. Computer Speedup
  • Moores Law -- of transistors/chip doubles
    every 18 months (1965)
  • data growth

Source J. Porter, Disk/Trend, Inc.
(http//www.disktrend.com/pdf/portrpkg.pdf)
74
Disk Appetite, contd.
  • Greg Papadopoulos, CTO Sun
  • Disk sales doubling every 9 months
  • similar results from Winter VLDB survey
  • time to process all your data doubles every 18
    months!
Write a Comment
User Comments (0)
About PowerShow.com