Fast Computation of Database Operations using Graphics Processors - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Fast Computation of Database Operations using Graphics Processors

Description:

Fast Computation of Database Operations using Graphics Processors. Naga K. Govindaraju ... Have multiple vertex and pixel processing engines running parallel ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 91
Provided by: Naga6
Category:

less

Transcript and Presenter's Notes

Title: Fast Computation of Database Operations using Graphics Processors


1
Fast Computation of Database Operations using
Graphics Processors
  • Naga K. Govindaraju
  • Univ. of North Carolina
  • Modified By,
  • Mahendra Chavan for CS632

2
Goal
  • Utilize graphics processors for fast computation
    of common database operations

3
Motivation Fast operations
  • Increasing database sizes
  • Faster processor speeds but low improvement in
    query execution time
  • Memory stalls
  • Branch mispredictions
  • Resource stalls Eg. Instruction dependency
  • Utilize the available architectural features and
    exploit parallel execution possibilities

4
Graphics Processors
  • Present in almost every PC
  • Have multiple vertex and pixel processing engines
    running parallel
  • Can process tens of millions of geometric
    primitives per second
  • Peak Perf. Of GPU is increasing at the rate of
    2.5-3 times a year!
  • Programmable- fragment programs executed on
    pixel processing engines

5
Main Contributions
  • Algorithms for predicates, boolean combinations
    and aggregations
  • Utilize SIMD capabilities of pixel processing
    engines
  • They have used these algorithms for selection
    queries on one or more attributes and aggregate
    queries

6
Related Work
  • Hardware Acceleration for DB operations
  • Vector processors for relational DB operations
    Meki and Kambayashi 2000
  • SIMD instructions for relational DB operations
    Zhou and Ross 2002
  • GPUs for spatial selections and joins Sun et al.
    2003

7
Graphics Processors Design Issues
  • Programming model is limited due to lack of
    random access writes
  • Design algorithms avoiding data rearrangements
  • Programmable pipeline has poor branching
  • Design algorithms without branching in
    programmable pipeline - evaluate branches using
    fixed function tests

8
Frame Buffer
  • Pixels stored on graphics card in a frame buffer.
  • Frame buffer conceptually divided into
  • Color Buffer
  • Stores color component of each pixel in the frame
    buffer
  • Depth Buffer
  • Stores depth value associated with each pixel.
    The depth is used to determine surface visibility
  • Stencil Buffer
  • Stores stencil value for each pixel . Called
    Stencil because, it is typically used for
    enabling/disabling writes to frame buffer

9
Graphics Pipeline
Vertices
Vertex Processing Engine
Alpha Test
Stencil Test
Depth Test
10
Graphics Pipeline
  • Vertex Processing Engine
  • Transforms vertices to points on screen
  • Setup Engine
  • Generates Info. For color, depth etc. associated
    with primitive vertices
  • Pixel processing Engines
  • Fragment processors, performs a series of tests
    before writing the fragments to frame buffer

11
Pixel processing Engines
  • Alpha Test
  • Compares fragments alpha value to user-specified
    reference value
  • Stencil Test
  • Compares fragments pixels stencil value to
    user-specified reference value
  • Depth Test
  • Compares depth value of the fragment to the
    reference depth value.

12
Operators
  • lt
  • gt
  • lt
  • gt
  • Never
  • Always

13
Occlusion Query
Fragment Programs
  • Users can supply custom fragment programs on each
    fragment
  • Gives no. of fragments that pass different no. of
    tests

14
Radeon R770 GPU by AMD Graphics Product Group
15
Data Representation on GPUs
  • Textures 2 D arrays- may have multiple channels
  • We store data in textures in floating point
    formats
  • To perform computations on the values, render the
    quadrilateral, generate fragments, run fragment
    programs and perform tests!

16
Stencil Tests
  • Fragments failing Stencil test are rejected from
    the rasterization pipeline
  • Stencil Operations
  • KEEP keep the stencil value in the stencil
    buffer
  • INCR stencil value
  • DECR stencil value
  • ZERO stencil value 0
  • REPLACE stencil value reference value
  • INVERT bitwise invert (stencil value)

17
Stencil and Depth Tests
  • We can setup the stencilOP routine as below
  • For each fragment , three possible outcomes,
    based on the outcome, corresponding stencil op.
    is executed
  • Op1 when a fragment fails stencil test
  • Op2 when a fragment passes stencil test but
    fails depth test
  • Op3 when a fragment passes stencil and depth
    test

18
Outline
  • Database Operations on GPUs
  • Implementation Results
  • Analysis
  • Conclusions

19
Outline
  • Database Operations on GPUs
  • Implementation Results
  • Analysis
  • Conclusions

20
Overview
  • Database operations require comparisons
  • Utilize depth test functionality of GPUs for
    performing comparisons
  • Implements all possible comparisons lt, lt, gt, gt,
    , !, ALWAYS, NEVER
  • Utilize stencil test for data validation and
    storing results of comparison operations

21
Basic Operations
  • Basic SQL query
  • Select A
  • From T
  • Where C
  • A attributes or aggregations (SUM, COUNT, MAX
    etc)
  • Trelational table
  • C Boolean Combination of Predicates (using
    operators AND, OR, NOT)

22
Outline Database Operations
  • Predicate Evaluation
  • (a op constant) depth test and stencil test
  • (a op b) (a-b op 0 ) can be executed on
    GPUs
  • Boolean Combinations of Predicates
  • Express as CNF and repetitively use stencil tests
  • Aggregations
  • Occlusion queries

23
Outline Database Operations
  • Predicate Evaluation
  • Boolean Combinations of Predicates
  • Aggregations

24
Basic Operations
  • Predicates ai op constant or ai op aj
  • Op is one of lt,gt,lt,gt,!, , TRUE, FALSE
  • Boolean combinations Conjunctive Normal Form
    (CNF) expression evaluation
  • Aggregations COUNT, SUM, MAX, MEDIAN, AVG

25
Predicate Evaluation
  • ai op constant (d)
  • Copy the attribute values ai into depth buffer
  • Define the comparison operation using depth test
  • Draw a screen filling quad at depth d

26
ai op d
If ( ai op d ) pass fragment Else reject
fragment
Screen
d
27
Predicate Evaluation
  • ai op aj
  • Treat as (ai aj) op 0
  • Semi-linear queries
  • Defined as linear combination of attribute values
    compared against a constant
  • Linear combination is computed as a dot product
    of two vectors
  • Utilize the vector processing capabilities of GPUs

28
Data Validation
  • Performed using stencil test
  • Valid stencil values are set to a given value s
  • Data values that fail predicate evaluation are
    set to zero

29
Outline Database Operations
  • Predicate Evaluation
  • Boolean Combinations of Predicates
  • Aggregations

30
Boolean Combinations
  • Expression provided as a CNF
  • CNF is of form (A1 AND A2 AND AND Ak)
  • where Ai (Bi1 OR Bi2 OR OR Bimi )
  • CNF does not have NOT operator
  • If CNF has a NOT operator, invert comparison
    operation to eliminate NOT
  • Eg. NOT (ai lt d) gt (ai gt d)

31
Boolean Combination
  • We will focus on (A1 AND A2)
  • All cases are considered
  • A1 (TRUE AND A1)
  • If Ei (A1 AND A2 AND AND Ai-1 AND Ai),
  • Ei (Ei-1 AND Ai)

32
  • Clear stencil value to 1
  • For each Ai , i1,.,k
  • do
  • if (mod(I,2)) / Valid stencil value is 1 /
  • Stencil test to pass if stencil value is equal to
    1
  • StencilOp (KEEP,KEPP, INCR)
  • Else
  • Stencil test to pass if stencil value is equal to
    2
  • StencilOp (KEEP,KEPP, DECR)
  • Endif
  • For each Bij, j1,..,mi
  • Do
  • Perform Bij using COMPARE / depth test /
  • End for
  • If (mod(I,2)) / valid stencil value is 2 /
  • If stencil value on screen is 1 , REPLACE with 0
  • Else / valid stencil value is 1 /

33
A1 AND A2
A1
B23
B22
B21
34
A1 AND A2
35
A1 AND A2
A1
36
A1 AND A2
Stencil value 0
A1
Stencil value 2
37
A1 AND A2
St 0
A1
B22
St0
St2
B23
St1
St1
B21
St1
38
A1 AND A2
Stencil 0
A1
B22
St 0
B23
St1
St1
B21
St1
39
A1 AND A2
St 0
St 1A1 AND B22
St1 A1 AND B23
St1 A1 AND B21
40
Range Query
  • Compute ai within low, high
  • Evaluated as ( ai gt low ) AND ( ai lt high )

41
Outline Database Operations
  • Predicate Evaluation
  • Boolean Combinations of Predicates
  • Aggregations

42
Aggregations
  • COUNT, MAX, MIN, SUM, AVG
  • No data rearrangements

43
COUNT
  • Use occlusion queries to get pixel pass count
  • Syntax
  • Begin occlusion query
  • Perform database operation
  • End occlusion query
  • Get count of number of attributes that passed
    database operation
  • Involves no additional overhead!

44
MAX, MIN, MEDIAN
  • We compute Kth-largest number
  • Traditional algorithms require data
    rearrangements
  • We perform no data rearrangements, no frame
    buffer readbacks

45
K-th Largest Number
  • Say vk is the k-th largest number
  • How do we generate a number m equal to vk?
  • Without knowing vks bit-representation and using
    comparisons

46
Our algorithm
  • b_max max. no. of bits in the values in tex
  • x0
  • For i b_max-1 down to 0
  • Count Compare (text gt x 2i)
  • If Count gt k-1
  • xx2i
  • Return x

47
K-th Largest Number
  • Lemma Let vk be the k-th largest number. Let
    count be the number of values gt m
  • If count gt (k-1) mlt vk
  • If count lt (k-1) mgtvk
  • Apply the earlier algorithm ensuring that count
    gt(k-1)

48
Example
  • Vk 11101001
  • M 00000000

49
Example
  • Vk 11101001
  • M 10000000
  • M lt Vk

50
Example
  • Vk 11101001
  • M 11000000
  • M lt Vk

51
Example
  • Vk 11101001
  • M 11100000
  • M lt Vk

52
Example
  • Vk 11101001
  • M 11110000
  • M gt Vk
  • Make the bit 0
  • M 11100000

53
Example
  • Vk 11101001
  • M 11101000
  • M lt Vk

54
Example
  • Vk 11101001
  • M 11101100
  • M gt Vk
  • Make this bit 0
  • M 11101000

55
Example
  • Vk 11101001
  • M 11101010
  • M gt Vk
  • M 11101000

56
Example
  • Vk 11101001
  • M 11101001
  • M lt Vk

57
Example
  • Integers ranging from 0 to 255
  • Represent them in depth buffer
  • Idea Use depth functions to perform comparisons
  • Use NV_occlusion_query to determine maximum

58
Example Parallel Max
  • S10,24,37,99,192,200,200,232
  • Step 1 Draw Quad at 128
  • S 10,24,37,99,192,200,200,232
  • Step 2 Draw Quad at 192
  • S 10,24,37,192,200,200,232
  • Step 3 Draw Quad at 224
  • S 10,24,37,192,200,200,232
  • Step 4 Draw Quad at 240 No values pass
  • Step 5 Draw Quad at 232
  • S 10,24,37,192,200,200,232
  • Step 6,7,8 Draw Quads at 236,234,233 No values
    pass
  • Max is 232

59
SUM and AVG
  • Mipmaps multi resolution textures consisting of
    multiple levels
  • Highest level contains average of all values at
    lowest level
  • SUM AVG COUNT
  • Problems with mipmaps
  • If we want sum of a subset of values then we have
    to introduce conditions in the fragment programs
  • Floating point representations may have problems

60
Accumulator
  • Data representation is of form
  • ak 2k ak-1 2k-1 a0
  • Sum sum(ak) 2k sum(ak-1) 2k-1sum(a0)
  • Current GPUs support no bit-masking operations
  • AVG SUM/COUNT

61
TestBit
  • Read the data value from texture, say ai
  • F frac(ai/2k1)
  • If Fgt0.5, then k-th bit of ai is 1
  • Set F to alpha value. Alpha test passes a
    fragment if alpha valuegt0.5

62
Outline
  • Database Operations on GPUs
  • Implementation Results
  • Analysis
  • Conclusions

63
Implementation
  • Dell Precision Workstation with Dual 2.8GHz Xeon
    Processor
  • NVIDIA GeForce FX 5900 Ultra GPU
  • 2GB RAM

64
Implementation
  • CPU Intel compiler 7.1 with hyperthreading,
    multi-threading, SIMD optimizations
  • GPU NVIDIA Cg Compiler

65
Benchmarks
  • TCP/IP database with 1 million records and four
    attributes
  • Census database with 360K records

66
Copy Time
67
Predicate Evaluation (3 times faster)
68
Range Query(5.5 times faster)
69
Multi-Attribute Query (2 times)
70
Semi-linear Query (9 times faster)
71
COUNT
  • Same timings for GPU implementation

72
Kth-Largest for median(2.5 times)
73
Kth-Largest
74
Kth-Largest conditional
75
Accumulator(20 times slower!)
76
Outline
  • Database Operations on GPUs
  • Implementation Results
  • Analysis
  • Conclusions

77
Analysis Issues
  • Precision
  • Currently depth buffer has only 24 bit precision
    , inadequate
  • Copy time
  • Copy from texture to depth buffer no mechanism
    in GPU
  • Integer arithmetic
  • Not enough arithmetic inst. In pixel processing
    engines
  • Depth compare masking
  • Useful to have comparison mask for depth function

78
Analysis Issues
  • Memory management
  • Current GPUS have 512 MB video memory, we may use
    the out-ofcore techniques and swap
  • No random writes
  • No data re-arrangements possible

79
Analysis Performance
  • Relative Performance Gain
  • High Performance Predicate evaluation,
    multi-attribute queries, semi-linear queries,
    count
  • Medium Performance Kth-largest number
  • Low Performance - Accumulator

80
High Performance
  • Parallel pixel processing engines
  • Pipelining
  • Multi-attribute queries get advantage
  • Early Depth culling
  • Before passing through the pixel processing
    engine
  • Eliminate branch mispredictions

81
Medium Performance
  • Parallelism
  • FX 5900 has clock speed 450MHz, 8 pixel
    processing engines
  • Rendering single 1000x1000 quad takes 0.278ms
  • Rendering 19 such quads take 5.28ms. Observed
    time is 6.6ms
  • 80 efficiency in parallelism!!

82
Low Performance
  • No gain over SIMD based CPU implementation
  • Two main reasons
  • Lack of integer-arithmetic
  • Clock rate

83
Outline
  • Database Operations on GPUs
  • Implementation Results
  • Analysis
  • Conclusions

84
Conclusions
  • Novel algorithms to perform database operations
    on GPUs
  • Evaluation of predicates, boolean combinations of
    predicates, aggregations
  • Algorithms take into account GPU limitations
  • No data rearrangements
  • No frame buffer readbacks

85
Conclusions
  • Preliminary comparisons with optimized CPU
    implementations is promising
  • Discussed possible improvements on GPUs
  • GPU as a useful co-processor

86
Relational Joins
  • Modern GPUs have thread groups
  • Each thread group have several threads
  • Data Parallel primitives
  • Map
  • Scatter scatters the Data of a relation with
    respect to an array L
  • Gather reverse of scatter
  • Split Divides the relation into a number of
    disjoint partitions with a given partitioning
    function

87
NINLJ
R
Thread Group 1
Thread Group i
Thread Group Bp
Thread Group j
S
88
INLJ
  • Used Cache Optimized Search Trees (CSS trees) for
    index structure
  • Inner relation as the CSS tree
  • Multiple keys are searched in parallel on the tree

89
Sort Merge join
  • Merge step is done in parallel
  • 3 steps
  • Divide relation S into Q chunks Q S / M
  • Find the corresponding matching chunks from R by
    using the start and end of each chunk of S
  • Merge each pair of S and R chunk in parallel. 1
    thread group per pair.

90
Hash join
  • Partitioning
  • Use the Split primitive to partition both the
    relations
  • Matching
  • Read the inner relation in memory relation
  • Each tuple from the outer relation uses
    sequential/binary search on the inner relation
  • For binary search, the inner relation will be
    sorted using bitonic sort.
Write a Comment
User Comments (0)
About PowerShow.com