Dynamic Sample Selection for Approximate Query Processing Brian Babcock, Surajit Chaudari PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Dynamic Sample Selection for Approximate Query Processing Brian Babcock, Surajit Chaudari


1
Dynamic Sample Selection for Approximate Query
Processing Brian Babcock, Surajit
Chaudari
Gautam Das
  • Presented by
  • Mariam John
    CSE 6392
  • 02/14/2006

2
Contents
  • Introduction
  • Dynamic Sample Selection
  • Policies for Sample Selection
  • Small Group Sampling
  • Pre-Processing Phase
  • Summary

3
Why do we do Approximate Query Processing?
  • Multi-gigabyte data repositories
  • Data Analysis Application
  • Data mining
  • Decision Support Analysis
  • Fast query response time
  • Acceptability of inexact query response

4
Problem
  • Constructing an optimal sample that well
    represents the underlying data.
  • Uniform sampling
  • Non-uniform sampling

5
Non-uniform sampling
  • Purpose is to produce more accurate results
    across a particular set of queries.
  • Produces more approximate results than uniform
    sampling.
  • Optimal bias differs from query to query.

6
Dynamic Sample Selection
Standard Sampling
7
Dynamic Sample Selection
  • Pre-Processing Phase

Query Workload
Sample Data
Select Strata
Build Sample
Data
Meta- Data
8
Dynamic Sample Selection
  • Runtime Phase

Query
Sample Data
Choose Samples
Rewrite Query
Meta- Data
9
Dynamic Sample Selection
  • How to identify the set of biased samples to be
    created?
  • Occurs during pre-processing phase
  • How to determine which of the various samples to
    use to answer a query?
  • Occurs during runtime phase
  • Simplest and most efficient strategy is when
    choice of samples is guided by the syntax of
    incoming query.

10
Small Group Sampling
  • Specific dynamic sample selection technique which
    targets aggregate queries with group-bys.
  • Small group sampling approach
  • Overall sample perform uniform sampling on
    large groups.
  • Small group tables-one or more sample tables for
    smaller groups.

11
Small group Sampling
  • Set of small groups
  • depends on
  • grouping columns
  • selection predicates

12
Small Group Sampling
  • Idea behind Small Group Sampling
  • Determine for which values in each column to
    create small group tables.
  • Create small group tables for each column of a
    table along with the overall sample.
  • During runtime, choose a subset of sample tables
    to answer a query most accurately.
  • Query is rewritten to run against the sample
    tables instead of the base tables.

13
Pre-processing Phase
  • For every column, identify the rare values within
    it and create small group tables.
  • Pre-processing phase produces three
  • outputs
  • Overall sample table
  • Small group tables
  • Metadata table

14
Pre-processing phase
  • Rows can appear in multiple sample tables.
  • Bitmask field is used to identify the set of
    sample tables to which a row was added.
  • Avoids double counting of rows assigned to
    multiple sample tables.

15
Summary
  • Dynamic Sample Selection
  • Takes advantage of available disk space
  • Creates multiple biased sample tables during the
    pre-processing phase
  • Picks best samples during runtime for query
    processing.
  • Small Group Sampling
  • Notion is to treat large and small groups
    differently
  • Creates an overall sample table for large groups
    and a number of small group tables for each rare
    values in each column.
Write a Comment
User Comments (0)
About PowerShow.com