Notions of clustering - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Notions of clustering

Description:

MovieStar(name, address, gender, birthdate) And there is an index on MovieStar.name ... SELECT birthdate. FROM StarsIn, MovieStar. WHERE title = 'King Kong' AND ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 12
Provided by: tho9
Category:

less

Transcript and Presenter's Notes

Title: Notions of clustering


1
Notions of clustering
  • Clustered relation tuples are stored in blocks
    mostly devoted to that relation.
  • Clustering index tuples (of the relation) with
    same search key are stored together.

2
Index-based algorithms selection
  • To evaluate ?a?(R) use index on a, if it exists
  • Cost cost of index lookup (negligible) plus
  • If index is clustering
  • B(R)/V(R,a) I/Os
  • (the fraction of the relation with some value
    for a)
  • Otherwise, with a non-clustering index, an
    approximation is that each tuple we retrieve is
    in a different block, so we get
  • T(R)/V(R,a) I/Os

3
Example of index-based selection
  • ?a?(R), and B(R) 1000, T(R) 20,000
  • R is clustered, but no index on attribute a
  • ?1000 disk I/Os
  • R has a clustering index on a, V(R,a) 100
  • ?10 I/Os
  • R has a non-clustering index on a, V(R,a) 100
  • ?20,000/100 200 disk I/Os
  • V(R,a) 20,000 (i.e. attribute a is key) ? just
    1 I/O

4
Index joins
  • We want to compute R(X,Y) ?? S(Y,Z)
  • Suppose there is a Y-index on S.
  • Algorithm
  • For each tuple t of R, lookup all tuples in S
    with key-value tY and output the join of t.
  • Cost B(R) to read R (clustered case) --- We can
    neglect this cost
  • Each tuple of R joins with T(S)/V(S,Y) tuples of
    S, on average.
  • If S has a non-clustered index on Y
  • ?I/O cost is T(R)T(S)/V(S,Y)
  • If S has a clustered index on Y
  • ?I/O cost is T(R)B(S)/V(S,Y)

5
Example of index-join
  • T(R) 10,000, B(R) 1000
  • T(S) 5000, B(S) 500, V(S,Y) 100
  • To compute R(X,Y) ?? S(Y,Z) using a clustered
    Y-index on S
  • 1000 10,000(500/100)
  • 51,000 I/Os

Bad!!
6
However, things are not so bad in practice
  • Suppose we have the relations
  • StarsIn(title, year, starName)
  • MovieStar(name, address, gender, birthdate)
  • And there is an index on MovieStar.name
  • Consider the SQL query
  • SELECT birthdate
  • FROM StarsIn, MovieStar
  • WHERE title 'King Kong' AND starName name

7
Practice (Contd)
  • We can first do the selection of those tuples in
    StarsIn relation with titleKing Kong. Suppose
    they are 10 such tuples.
  • Now, we know that stars take care to not have the
    same name with some other star. So, name is a key
    for the relation MovieStar. (V(MovieStar, name)
    ?)
  • Hence, V(MovieStar, name) T(MovieStar)
  • Finally the number of I/Os is
  • B(StarsIn) T(?nameKing Kong(StarsIn))

8
Joins using sorted indexes
  • We want to compute R(X,Y) ?? S(Y,Z)
  • If S has a B-tree index on Y,
  • Create sorted sublists of R only, and
  • Do a sort join, extracting the S-tuples in order
    through the index
  • Or, if both have B-tree index on Y, do a
    zigzag-join.

9
Example (B-Tree index on SY)
  • T(R) 10,000, B(R) 1000,
  • T(S) 5000, B(S) 500, V(S,Y) 100, S
    has a B-Tree index on Y
  • Assume that both relations and the indexes are
    clustered.
  • M 101 buffers
  • Create 10 sorted sublists of R. Cost 2B(R)
  • 10 buffers for sublists of R, 1 buffer for S
    (retrieved via index)
  • Join tuples from input buffers
  • Total cost 2B(R) B(R) B(S) index lookup
  • 2000 1000 500 index lookup 3500 index
    lookup

10
Zigzag Join
  • Suppose we have B-Tree indexes on both SY and
    RY.
  • We can jump back and forth between the indexes
    finding Y-values that they share in common.
  • Tuples from R with Y-value that doesnt appear in
    S need never be retrieved, and similarly tuples
    of S whose Y-value doesnt appear in R need never
    be retrieved.
  • Example.
  • Let the Y-values for R be 1,3,4,4,4,5,6
  • Let the Y-values for S be 2,2,4,4,6,7
  • Start with the 1 and 2.
  • Since 1lt2 skip 1 in R.
  • Since 2lt3 skip the 2s in S.
  • Since 3lt4 skip 3 in R.
  • Join 4s.

11
Example (Zigzag Join)
  • T(R) 10,000, B(R) 1000,
  • T(S) 5000, B(S) 500,
  • S and R both have clustered B-Tree indexes on Y
  • We use just 1000500 disk I/Os to read the
    blocks of R and S through their indexes.
  • We can determine from the indexes alone that a
    large fraction of R or S cannot match tuples of
    the other relation, so the cost might be
    considerably less than 1500 I/Os.
Write a Comment
User Comments (0)
About PowerShow.com