A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes - PowerPoint PPT Presentation

About This Presentation
Title:

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes

Description:

A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes Manolis Terrovitis (NTUA) Spyros Passas (NTUA) Panos Vassiliadis (UoI) – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 61
Provided by: mte76
Category:

less

Transcript and Presenter's Notes

Title: A Combination of Trie-trees and Inverted files for the Indexing of Set-valued Attributes


1
A Combination of Trie-trees and Inverted files
for the Indexing of Set-valued Attributes
  • Manolis Terrovitis (NTUA)
  • Spyros Passas (NTUA)
  • Panos Vassiliadis (UoI)
  • Timos Sellis (NTUA)

2
Problem
  • We are interested in low cardinality set-values
  • Retail store transaction logs
  • Web logs
  • Biomedical databases etc.
  • We address the efficient evaluation of
    containment queries
  • In which transactions were products a and b
    sold together?
  • Which users visited only the main page or the
    download page of our site?
  • We propose the Hybrid Trie-Inverted file (HTI)
    index

3
Outline
  • Problem definition
  • The HTI index
  • Query evaluation
  • Experiments
  • Conclusions

4
Outline
  • Problem definition
  • The HTI index
  • Query evaluation
  • Experiments
  • Conclusions

5
Data and queries
tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
6
Data and queries
  • Find all transactions that contain a, b and
    d (subset)

tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
7
Data and queries
  • Find all transactions that contain a, b and
    d (subset)
  • Find all transactions that contain exactly a,
    b and d (equality)

tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
8
Data and queries
  • Find all transactions that contain a, b and
    d (subset)
  • Find all transactions that contain exactly a,
    b and d (equality)
  • Find all transactions that contain only items
    from a, b and d (superset)

tid products tid products
1 f,a 9 a,e
2 a,d,c 10 g,c,a
3 c,b,a 11 b,a,e
4 f,a,c 12 b,d,c
5 c,g 13 c,f,a,d,b
6 a,b,g,c,d,e 14 b,d
7 a,d,b 15 e
8 a,e,b 16 b,f,a
9
Data and queries
  • Traditional methods
  • Signature files
  • Inverted files
  • Differences from text databases
  • Low cardinality
  • Large number of records in comparison with
    vocabulary size
  • New types of queries (equality-superset)

10
Outline
  • Problem definition
  • The HTI index
  • Query evaluation
  • Experiments
  • Conclusions

11
The HTI index Background The inverted file
12
HTI indexInverted files - problems
  • The evaluation of containment queries relies on
    merge-joining the inverted lists
  • The inverted lists become very long
  • when the database size is very big compared to
    the vocabulary
  • when the items distribution is skewed
  • This is often the case in the real world!

13
HTI indexSolution?
  • We need to break up the lists!
  • But how?
  • Lets make a list for every combination of items!

14
HTI indexSolution?
  • We assume a total order based on the frequency of
    appearance for the items of the database
  • We order the items in each set-value and we
    transform it to a sequence
  • We create a path in the access tree for each
    sequence

15
HTI indexAll combinations?
16
HTI indexAll combinations?
17
HTI indexAll combinations?
18
HTI indexAll combinations?
19
HTI indexAll combinations? Maybe, not
20
HTI indexAn access tree for the frequent items
21
HTI indexAn access tree for the frequent items
22
The HTI index
23
The HTI index
24
The HTI index
25
The HTI index
26
HTI indexThe basic points
  • The access tree is used only for the most
    frequent items
  • The inverted lists are restructured so that each
    node of the access tree points to a different
    inverted sublist
  • We keep the access tree in main memory

27
Outline
  • Problem definition
  • The HTI index
  • Query evaluation
  • Experiments
  • Conclusions

28
Query EvaluationBasic Steps
  1. Find the frequent items of the query set
  2. Use the access tree to detect the sublists which
    might participate in the answer
  3. Merge-join these sublists with the inverted lists
    of the non-frequent items

29
Subset - (b, c, d)
30
Subset - (b, c, d)
31
Subset - (b, c, d)
32
Subset - (b, c, d)
33
Subset - (b, c, d)
34
Equality - (b, c, d)
35
Equality - (b, c, d)
36
Equality - (b, c, d)
37
Equality - (b, c, d)
38
Superset - (b, c, d)
39
Superset - (b, c, d)
40
Superset - (b, c, d)
41
Superset - (b, c, d)
42
Superset - (b, c, d)
43
Superset - (b, c, d)
44
Superset - (b, c, d)
45
Outline
  • Problem definition
  • The HTI index
  • Query evaluation
  • Experiments
  • Conclusions

46
ExperimentsSetup
  • Real Data from UCI
  • web log from microsoft.com 320k records, 294
    items
  • web log from msnbc.com 1M records, 17 items
  • Synthetic data
  • Zipfian distribution of order 1
  • 100k-1M records
  • 1k-10k items
  • Queries with 2-22 items

47
ExperimentsQuery performance DB size
48
ExperimentsQuery performance query length
49
ExperimentsQuery performance query length
50
ExperimentsQuery performance query length
51
ExperimentsQuery performance query length
52
ExperimentsAccess tree size DB size
53
ExperimentsAccess tree size DB size
54
Experiments
  • The HTI scales a lot better than the inverted
    file as the query and the database size grow
  • A small threshold is enough for a performance
    gain over an order of magnitude
  • The main memory requirements do not exceed 0.5M
    for the real data.

55
Outline
  • Problem Definition
  • The HTI index
  • Query evaluation
  • Experiments
  • Conclusions

56
Conclusions
  • The HTI index relies on breaking up the larger
    inverted lists in smaller lists that contain
    known combinations of items
  • The HTI index significantly outperforms the
    inverted file for small domains and skewed item
    distributions
  • It has moderate memory requirements that can be
    adjusted by using the right threshold

57
The End
  • Thank You!

58
ExperimentsVocabulary size
59
ExperimentsThreshold choice
60
ExperimentsThreshold choice
Write a Comment
User Comments (0)
About PowerShow.com