A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes - PowerPoint PPT Presentation

About This Presentation

Title:

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Description:

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 61

Provided by: mte76

Category:

more less

Transcript and Presenter's Notes

Title: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

1
A Combination of Trie-trees and Inverted files
for the Indexing of Set-valued Attributes

Manolis Terrovitis (NTUA)
Spyros Passas (NTUA)
Panos Vassiliadis (UoI)
Timos Sellis (NTUA)

2
Problem

We are interested in low cardinality set-values
Retail store transaction logs
Web logs
Biomedical databases etc.
We address the efficient evaluation of
containment queries
In which transactions were products a and b
sold together?
Which users visited only the main page or the
download page of our site?
We propose the Hybrid Trie-Inverted file (HTI)
index

3
Outline

Problem definition
The HTI index
Query evaluation
Experiments
Conclusions

4
Outline

Problem definition
The HTI index
Query evaluation
Experiments
Conclusions

5
Data and queries
tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
6
Data and queries

Find all transactions that contain a, b and
d (subset)

tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
7
Data and queries

Find all transactions that contain a, b and
d (subset)
Find all transactions that contain exactly a,
b and d (equality)

tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
8
Data and queries

Find all transactions that contain a, b and
d (subset)
Find all transactions that contain exactly a,
b and d (equality)
Find all transactions that contain only items
from a, b and d (superset)

tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
9
Data and queries

Traditional methods
Signature files
Inverted files
Differences from text databases
Low cardinality
Large number of records in comparison with
vocabulary size
New types of queries (equality-superset)

10
Outline

Problem definition
The HTI index
Query evaluation
Experiments
Conclusions

11
The HTI index Background The inverted file
12
HTI indexInverted files - problems

The evaluation of containment queries relies on
merge-joining the inverted lists
The inverted lists become very long
when the database size is very big compared to
the vocabulary
when the items distribution is skewed
This is often the case in the real world!

13
HTI indexSolution?

We need to break up the lists!
But how?
Lets make a list for every combination of items!

14
HTI indexSolution?

We assume a total order based on the frequency of
appearance for the items of the database
We order the items in each set-value and we
transform it to a sequence
We create a path in the access tree for each
sequence

15
HTI indexAll combinations?
16
HTI indexAll combinations?
17
HTI indexAll combinations?
18
HTI indexAll combinations?
19
HTI indexAll combinations? Maybe, not
20
HTI indexAn access tree for the frequent items
21
HTI indexAn access tree for the frequent items
22
The HTI index
23
The HTI index
24
The HTI index
25
The HTI index
26
HTI indexThe basic points

The access tree is used only for the most
frequent items
The inverted lists are restructured so that each
node of the access tree points to a different
inverted sublist
We keep the access tree in main memory

27
Outline

Problem definition
The HTI index
Query evaluation
Experiments
Conclusions

28
Query EvaluationBasic Steps

Find the frequent items of the query set
Use the access tree to detect the sublists which
might participate in the answer
Merge-join these sublists with the inverted lists
of the non-frequent items

29
Subset - (b, c, d)
30
Subset - (b, c, d)
31
Subset - (b, c, d)
32
Subset - (b, c, d)
33
Subset - (b, c, d)
34
Equality - (b, c, d)
35
Equality - (b, c, d)
36
Equality - (b, c, d)
37
Equality - (b, c, d)
38
Superset - (b, c, d)
39
Superset - (b, c, d)
40
Superset - (b, c, d)
41
Superset - (b, c, d)
42
Superset - (b, c, d)
43
Superset - (b, c, d)
44
Superset - (b, c, d)
45
Outline

Problem definition
The HTI index
Query evaluation
Experiments
Conclusions

46
ExperimentsSetup

Real Data from UCI
web log from microsoft.com 320k records, 294
items
web log from msnbc.com 1M records, 17 items
Synthetic data
Zipfian distribution of order 1
100k-1M records
1k-10k items
Queries with 2-22 items

47
ExperimentsQuery performance DB size
48
ExperimentsQuery performance query length
49
ExperimentsQuery performance query length
50
ExperimentsQuery performance query length
51
ExperimentsQuery performance query length
52
ExperimentsAccess tree size DB size
53
ExperimentsAccess tree size DB size
54
Experiments

The HTI scales a lot better than the inverted
file as the query and the database size grow
A small threshold is enough for a performance
gain over an order of magnitude
The main memory requirements do not exceed 0.5M
for the real data.

55
Outline

Problem Definition
The HTI index
Query evaluation
Experiments
Conclusions

56
Conclusions

The HTI index relies on breaking up the larger
inverted lists in smaller lists that contain
known combinations of items
The HTI index significantly outperforms the
inverted file for small domains and skewed item
distributions
It has moderate memory requirements that can be
adjusted by using the right threshold

57
The End