Title: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes
1A Combination of Trie-trees and Inverted files
for the Indexing of Set-valued Attributes
- Manolis Terrovitis (NTUA)
- Spyros Passas (NTUA)
- Panos Vassiliadis (UoI)
- Timos Sellis (NTUA)
2Problem
- We are interested in low cardinality set-values
- Retail store transaction logs
- Web logs
- Biomedical databases etc.
- We address the efficient evaluation of
containment queries - In which transactions were products a and b
sold together? - Which users visited only the main page or the
download page of our site? - We propose the Hybrid Trie-Inverted file (HTI)
index
3Outline
- Problem definition
- The HTI index
- Query evaluation
- Experiments
- Conclusions
4Outline
- Problem definition
- The HTI index
- Query evaluation
- Experiments
- Conclusions
5Data and queries
tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
6Data and queries
- Find all transactions that contain a, b and
d (subset)
tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
7Data and queries
- Find all transactions that contain a, b and
d (subset) - Find all transactions that contain exactly a,
b and d (equality)
tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
8Data and queries
- Find all transactions that contain a, b and
d (subset) - Find all transactions that contain exactly a,
b and d (equality) - Find all transactions that contain only items
from a, b and d (superset)
tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
9Data and queries
- Traditional methods
- Signature files
- Inverted files
- Differences from text databases
- Low cardinality
- Large number of records in comparison with
vocabulary size - New types of queries (equality-superset)
10Outline
- Problem definition
- The HTI index
- Query evaluation
- Experiments
- Conclusions
11The HTI index Background The inverted file
12HTI indexInverted files - problems
- The evaluation of containment queries relies on
merge-joining the inverted lists - The inverted lists become very long
- when the database size is very big compared to
the vocabulary - when the items distribution is skewed
- This is often the case in the real world!
13HTI indexSolution?
- We need to break up the lists!
- But how?
- Lets make a list for every combination of items!
14HTI indexSolution?
- We assume a total order based on the frequency of
appearance for the items of the database - We order the items in each set-value and we
transform it to a sequence - We create a path in the access tree for each
sequence
15HTI indexAll combinations?
16HTI indexAll combinations?
17HTI indexAll combinations?
18HTI indexAll combinations?
19HTI indexAll combinations? Maybe, not
20HTI indexAn access tree for the frequent items
21HTI indexAn access tree for the frequent items
22The HTI index
23The HTI index
24The HTI index
25The HTI index
26HTI indexThe basic points
- The access tree is used only for the most
frequent items - The inverted lists are restructured so that each
node of the access tree points to a different
inverted sublist - We keep the access tree in main memory
27Outline
- Problem definition
- The HTI index
- Query evaluation
- Experiments
- Conclusions
28Query EvaluationBasic Steps
- Find the frequent items of the query set
- Use the access tree to detect the sublists which
might participate in the answer - Merge-join these sublists with the inverted lists
of the non-frequent items
29Subset - (b, c, d)
30Subset - (b, c, d)
31Subset - (b, c, d)
32Subset - (b, c, d)
33Subset - (b, c, d)
34Equality - (b, c, d)
35Equality - (b, c, d)
36Equality - (b, c, d)
37Equality - (b, c, d)
38Superset - (b, c, d)
39Superset - (b, c, d)
40Superset - (b, c, d)
41Superset - (b, c, d)
42Superset - (b, c, d)
43Superset - (b, c, d)
44Superset - (b, c, d)
45Outline
- Problem definition
- The HTI index
- Query evaluation
- Experiments
- Conclusions
46ExperimentsSetup
- Real Data from UCI
- web log from microsoft.com 320k records, 294
items - web log from msnbc.com 1M records, 17 items
- Synthetic data
- Zipfian distribution of order 1
- 100k-1M records
- 1k-10k items
- Queries with 2-22 items
47ExperimentsQuery performance DB size
48ExperimentsQuery performance query length
49ExperimentsQuery performance query length
50ExperimentsQuery performance query length
51ExperimentsQuery performance query length
52ExperimentsAccess tree size DB size
53ExperimentsAccess tree size DB size
54Experiments
- The HTI scales a lot better than the inverted
file as the query and the database size grow - A small threshold is enough for a performance
gain over an order of magnitude - The main memory requirements do not exceed 0.5M
for the real data.
55Outline
- Problem Definition
- The HTI index
- Query evaluation
- Experiments
- Conclusions
56Conclusions
- The HTI index relies on breaking up the larger
inverted lists in smaller lists that contain
known combinations of items - The HTI index significantly outperforms the
inverted file for small domains and skewed item
distributions - It has moderate memory requirements that can be
adjusted by using the right threshold
57The End
58ExperimentsVocabulary size
59ExperimentsThreshold choice
60ExperimentsThreshold choice