Title: GiST
1GiST
Most slides are adapted from ChengXiang Zhais
lecture slides
2My taste
My slides
Zhais Lecture Slides
3Not from the scratch
Domain-specific data type Queries definable
B-tree R-tree R -tree kDB-tree hB-tree RD-tree
Your own tree
Your methods
GiST (Generalized Search Tree) Data Structure
API
4GiST Generalized Search Tree
- General cover B-tree, R-tree, etc
- Extensible
- domain-specific data types queries definable
- Easy to extend six key methods for a new tree
- Efficient can match specialized trees
- Reusable concurrency, recovery for indexes
5Example Indexing Book Titles
- You are a programmer in Postech Digital Library
Develop a library book search system
XX
6O.K. Lets Start
Titles of 4 books - T1 database
optimization - T2 web database - T3
complexity of optimization
algorithms - T4 algorithms and complexity
Indexable with (extensible) B-tree? linear
ordering T4, T3, T1, T2
7Extensible B-Tree for Titles
d
c
w
T4 alg.
T3 complexity
T1 database
T2 web
- Observations
- indexed values have linear ordering T4, T3, T1,
T2 - keys are simply separators T4, c, T3, d, T1, w,
T2
8Queries on Titles
- Equality predicates
- WHERE book.title web databases
- Containment predicates
- WHERE book.title has web
- Prefix predicates
- WHERE book.title start-with web
- RegEx predicates (generalize all the others)
- WHERE book.title like web database
9Using B-Tree Whats Wrong?
- What predicates can Btree support well?
- equality, containing, prefix, regex?
d
c
w
T4 alg.
T3 complexity
T1 database
T2 web
10New Index
- index pages on disk
- the algorithms
- for searching indexes
- deleting from indexes,
- complex transactional details
- page-level locking
- for high concurrency
- write-ahead logging
- for crash recovery
11We need API for New Search Index
12GiST Generalizing Balanced Search Trees
- GiST is not universal (just reasonable
generalization) - balanced tree of ltkey, ptrgt pairs, keys can
overlap - B-Tree R-Tree R-Tree GiST
- What is the key generalization?
key1 key2
internal nodes (directory)
leaf nodes (linked list)
13The Key Generalization The Key
- Key evolution 1-D separator --gt 2-D MBR --gt
predicates - R-Tree B-Tree
- generalizing key from 1-D line to 2-D area
- bounding range to (minimal) bounding region
- GiST R-Tree
- generalizing key from 2-D MBR to predicates
- a predicate that all values v in subtree will
satisfy - B-tree keys
- k1k2) --gt contains(k1k2), v)
- R-tree keys
- (x1,y1,x2,y2) --gt contains((x1,y1, x2,y2), v)
- RD-tree keys
- x1,xk ? subset(x1,,xk,v)
14Gist for Title Indexing Predicates
- Must first determine predicates
- What query predicates to support?
- equality equal(v, web db)
- containing has(v, web)
- What key predicate to use?
- Criteria for choosing key predicates?
- What do you suggest?
15GiST for Title Indexing Predicates
- Key predicates Contains(S, v)
SL
SR
alg, comp, opt
db, opt, web
SLL
SLR
SRL
SRR
alg, comp
comp, opt
db, opt
db, web
T4 alg.
T3 complexity
T1 database
T2 web
16GiST Built-in Tree Operations
- Search(root R, predicate q)
- Insert(root R, entry E, level l)
- Delete(root R, entry E)
17GiST Application-Specific Methods
- Search
- Consistent(E, q) search subtree E for predicate
q? - Labeling
- Union(E1, , En) how to label the union of E1,
, En? - Categorization
- Penalty(E1, E2) penalty for inserting E2 in
subtree E1 - PickSplit(E1, , En) how to split into two
groups of entries - Compression (storage/time tradeoff)
- Compress(E) E --gt Ec
- Decompress(Ec) --gt E such that E.p implies E.p
18Search Operation Consistent Method
- Search(root R, predicate q)
- traverse subtrees where Consistent true
- return leaf entries that are consistent
19Consistent Method
- Consistent(E, q)
- Can E.p and q both hold?
- Does E.p imply (not q)?
- Title GiST
- key predicate p Contains(S, v) or simply S
- e.g., SL alg, comp, opt
- e.g., SR db, opt, web
- Consistent(SL, has(v, web))?
- how to implement? SLnweb ? Ø
- Consistent(SR, equals(v, web database))?
- how to implement? SL web database
20Insert Operation
- Insert(root R, entry E, level l )
- descend tree minimizing potential increase in
Penalty - stop at level specified
- if there is room at node, insert there
- else split according to PickSplit
- propagate changes using Union to adjust keys
21Title GiST Insert
- Where to insert T5complexity of web algorithms
?
SL
SR
alg, comp, opt
db, opt, web
SLL
SLR
SRL
SRR
alg, comp
comp, opt
db, opt
db, web
T4 alg.
T3 complexity
T1 database
T2 web
22Penalty Method
- Penality(E1, E2)
- penalty for inserting E2 in subtree E1
- Title GiST
- E2 with S comp,web, alg (i.e., T5complexity
of web algorithms) - Where to insert?
- root SL alg, comp, opt vs. SR db, opt,
web? - Penalty
- how to implement? E1?E2 - E1
23PickSplit Method
- PickSplit(E1, , En)
- how to split into two groups of entries
- Title GiST
- suppose we have 3 entries (after an Insert)
- S1 alg, comp
- S2 comp, opt
- S3 comp, web, alg (new)
- ? how to split S1, S2, S3 into two?
- something similar to R-tree algorithm will do
24Union Method
- Union(E1, , En)
- Generates a label for the subtree with E1, , En
- Title GiST
- key predicate p Contains(S, v) or simply S
- S1 alg, comp, S2 comp, opt
- Combined key alg, comp, opt
- Union(E1(SL, ptr1), E2(SR, ptr2)) ?
- how to implement? ?
25Compress/Decompress Method?
- Key storage vs. search time tradeoff
- Compress(E) E --gt Ec
- Decompress(Ec) --gt E.p can be looser than E.p
(less pruning power) - Lossy compression may need more time for search
26Title GiST Compress/Decompress
- Example 1 no compression
- Compress(E) --gt Ec E
- Decompress(Ec) --gt E Ec
- Example 2 compress by taking word initials
- Compress
- algorithm, complexity, optimization --gt al,
co, op - Decompress
- al, co, op --gt al, co, op
27GiST No Magic
- It offers (only) what its model is based on
- It does not represent all possible index
structures - e.g. duplicate objects by multiple inserts
(R-tree) - e.g. support notion of distance and similarity
- rather than Boolean based predicates
- any more?
28What You Should Know
- What is GiST?
- What are the six key methods?
- How does GiST generalize other more specialized
trees? - What are some limitations of GiST?
29Carry Away Messages
- Once again, generalize whenever its possible
- 1-dimension indexing (B-tree, interval-based)-gt
Multi-dimension indexing (R-tree, region-based)
-gt Arbitrary objects (GiST, predicate-based) - Avoid over-generalization
- While predicate is quite general, it doesnt
guarantee pruning power - Wheres the notion of bounding in GiST?
- Whenever you see yet another X, think about
possibilities for a more general formulation of X
30Usage GiST
- PostgreSQL Spatial Indexes
- http//www.cmarschner.net/mtree.html
31QA