Clustering, Jan 2003, Yoon, 1

About This Presentation

Title:

Clustering, Jan 2003, Yoon, 1

Description:

Fast Clustering for XML Bitmap Indexes J. Yoon University of Louisiana at Lafayette Center for Advanced Computer Studies – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 89

Provided by: Jong65

Category:

more less

Transcript and Presenter's Notes

Title: Clustering, Jan 2003, Yoon, 1

1
Fast Clustering for XML Bitmap Indexes

J. Yoon
University of Louisiana at Lafayette
Center for Advanced Computer Studies

Motivating Examples
Related Work
Bitmap indexing 3-dim, 2-dim, multi-dim, space
problem (compression save - no retrieval,
clustering uniformed-bits ?weighted bits)
Clustering distance-based (k-means, k-nearest
neighbor), density-based (k-median),
entropy-based () / Semantic-based (SLI),
Topology-based (TOIS-Stanford, ST-based,
Jagadish)
Pass 1-pass clustering ? data streams
Preliminaries
XML Bitmap index
Weighted bits popularity pop(), security sec(),
Radius of a cluster rad(),
Thresholds similarity, popularity, radius
Divide-and-conquer clustering
K-means ?Hamming distance
clusters ? find centroid
(-) takes long O(nk) / compare with LSI in
quality
Weighted-bit-based Divide-and-Conquer clustering
Popular bits / Unpopular bits /
Similarity ? find centroid similar on popular
bits and not dissimilar on unpopular
One-pass clustering fast centroid O(1) / compare
with LSI in quality and in speed

Motivating Examples
XML doc retrieval is fast in XML bitmap indexes
in almost constant-time. Fast but space
consumption ? shuffle bits to eliminate 0-planes
? cluster grouping a very large XML bitmap index
into many smaller indexes, each of which is yet
relevant within itself.
Related Work
Bitmap indexing 3-dim, 2-dim, multi-dim, space
problem (compression save - no retrieval,
clustering uniformed-bits ?weighted bits)
Clustering distance-based (k-means, k-nearest
neighbor), density-based (k-median),
entropy-based () / Semantic-based (SLI),
Topology-based (TOIS-Stanford, ST-based,
Jagadish)
Pass 1-pass clustering ? data streams
Preliminaries
XML Bitmap index
Weighted bits popularity pop(), security sec(),
Radius of a cluster rad(),
Thresholds similarity, popularity, radius
Divide-and-conquer clustering
K-means ?Hamming distance
clusters ? find centroid
(-) takes long O(nk) / compare with LSI in
quality
Weighted-XPath-based Divide-and-Conquer
clustering
Popular bits / Unpopular bits /
Similarity ? find centroid similar on popular
bits and not dissimilar on unpopular
Fast Clustering on Weighted Bits

4
ltpapergt lttitlegtW0lt/titlegt ltauthorgtW1
ltaffiliategtW2lt/affiliategt lt/authorgt
ltsectiongteconomy W3 lt/sectiongt lt/papergt
ltpapergt lttitlegtW0lt/titlegt ltauthorgtW1
ltaffiliategtW2lt/affiliategt lt/authorgt
ltsectiongtW3 ltsectiongteconomylt/sectiongt
lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
ltaffiliategtW2lt/affiliategt lt/authorgt
ltsectiongtW3 ltsectiongtW4lt/sectiongt
lt/sectiongt lt/papergt
ltpapergt lttitlegtW0lt/titlegt ltauthorgtW1
ltaffiliategteconomylt/affiliategt lt/authorgt
ltsectiongtW3 ltsectiongtW4lt/sectiongt
lt/sectiongt lt/papergt
If full text file economy W1 W2 W3 W4.
Similarity-based centroid is found by Consider
the bitmaps 01011100, 11011000, and
01111111.
Same words in different structures Same words in
the same structures Same structure containing
different words Different words in different
structures, but linking to the same documents
5
K-means
ALGORITHM k-Means Bitmap Clustering INPUT bitmap
index BI, of cluster k, similarity threshold s,
radius threshold r OUTPUT a
number of smaller bitmap indexes METHOD (1)
Select k cluster centroids ci from cluster sets
C1, C2, , Ck (2) For each row b (? BI),
assign b to ci if sim(ci, b) ? s (3) For each
cluster, re-compute the center (4) Continue
from (1) until ? Ci, radius(Ci) ? r
6
Divide--Conquer k-Means
ALGORITHM Bitmap-based Divide--Conquer
Clustering using k-Means INPUT bitmap index BI,
number of clusters k, popularity threshold p,
similarity threshold s, radius
threshold r OUTPUT a number of smaller bitmap
indexes METHOD Let I be a set of integer
Let proj(B,I) be a projected bitmap from a bitmap
B, where I denote positions in B (1) For each
bit b, if pop(b) ? p, then add b to the
popular-bitset P (2) For each clusters centroid
ci in k clusters (3) do For any two
rows b (? BI), (4) if
sim(proj(ci,P), proj(b,P)) ? s, and radius(Ci) is
minimum, then assign b to Ci (5)
otherwise assign b to a new cluster
Ci1 (6) Stop if no more clusters are
obtained. (7) For each cluster Ci, if radius(Ci)
? r, then form a bit-centroid of Ci (8)
otherwise, invoke Bitmap-based
Divide--Conquer Clustering (Ci, p, s, r)
7
Divide--Conquer k-Min
ALGORITHM Bitmap-based Divide--Conquer
Clustering with k-Min INPUT bitmap index BI,
popularity threshold p, similarity threshold s,
radius threshold r OUTPUT a number
of smaller bitmap indexes METHOD Let I be a
set of integer Let proj(B,I) be a projected
bitmap from a bitmap B, where I denote positions
in B (1) For each bit b, if pop(b) ? p, then add
b to the popular-bitset P (2) Let the MIN number
of bits out of P to satisfy the similarity
threshold s be n (3) For each row b (? BI), if
proj(b,P) ? n, then assign b to S (4)
otherwise assign b to U (5) Stop if no more
clusters are obtained. (6) If radius(S) ? r,
form a bit-wise centroid of S (7)
otherwise invoke Bitmap-based Divide--Conquer
Clustering (S, p, s, r) (8) If radius(U) ? r,
form a bit-wise centroid of U (9)
otherwise invoke Bitmap-based Divide--Conquer
Clustering (U, p, s, r)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Scalable Bitmap Indexing for XML Document
Retrieval

Plain text-based document (d,w)
Document d is a sequence of w, where d for
document, and w for word
Vector space
XML document (d, p, w)
Document d is a sequence of path p, each of which
contains a sequence of word w
Yet not sufficient to represent all necessary
information, but simple enough to represent in
bitmap indexing (i.e., BitCube)
Search-related information
frequency, structural info (topological info),
reference info (dereference info), security info,
etc.

12
Scalable XML Document

Scalable XML Document (d, p, w, f, t, r, s, )
Extension of XML Document (d, p, w)
Represents enough features of documents
Bitmap indexing requires a large storage and it
will be multidimensional.
Question
How to use a BitCube to represent all the
features
How to query

13
(No Transcript)
14
Related Work Clustering

Similarity-based clustering
Jaccards Coefficient
Dices Coefficient
Vector-Space Model
Generalized Cosine-Similarity Measure
Optimistic Genealogy Measure

15
Similarity Measures

The Set/Bag Model Let X and Y be two collections
of XML documents
Jaccards Coefficient
Dices Coefficient
The Vector-Space Model Cosine-Similarity Measure
(CSM)

16
Similarity Measures (2)

The Generalized Cosine-Similarity Measure (GCSM)
Let X and Y be vectors and
where
Hierarchical Model
Why only for depth?

17
Related Work Clustering

Graph-based clustering
For an XML document collection C, s-Graph sg (C)
(N, E), a directed graph such that N is the set
of all the elements and attributes in the
documents in C and (a, b)?E if and only if a is a
parent element of b in document(s) in C (b can be
element or attribute).
For two sets, C1 and C2, of XML documents, the
distance between them, where sg(Ci) is the
number of edges

18
Related Work Clustering

Matrix Singular value decomposition
? compare with BitClustering
D ? Dk

19
Is it possible to do better?

Use Dk instead of D ?calculate DkTq rather than
DTq
Dk is defined at the k-truncated SVD of D
Note that both D and Dk are w?d matrices
Note
U matrix of left singular vector, V matrix of
right singular vector
U and V are orthogonal
? is real

20
Singular Value Decomposition
21
Using D-Hihat

Term-term comparison
Document-document comparison
Document-term comparison

22
Related Work Clustering

Structure
Suffix tree clustering
? compare with BitClustering
Optimistic Genealogy Measure ACM TIS 2003
? compare with BitClustering

23
Streaming Data

IEEE TKDE Vol 15 No.3 2003
S. Guha, A. Meyerson, N. Mishra, R. Motwani, L.
OCallaghan, Clustering Data Streams Theory and
Practice
G. Cormode, M. Datar, P. Indyk, and S.
Muthukrishnana, Comparing Data Streams Using
Hamming Norms (How to Zero In)
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M.
Strauss, One-Pass Wavelet Decompositions of Data
Streams
P. Tucker, D. Maier, T. Sheard, and L. Fegaras,
Exploiting Punctuation Semantics in Continuous
Data Streams

24
Data Streams

Characteristics
Arrives continuously in the form of a stream
Needs to be processed in an on-line fashion
Constraints
The time for processing each stream element must
be small
The amount of memory available to the query
processor is limited

25
Algorithms

Summarize data streams in a concise, but
reasonably accurate
Sampling-based
Histograms, Wavelet methods may be good for
static data
Synopsis-based keep small summary and update
One-pass algorithms
Obtaining median, quantiles, and other order
statistics
Correlated Aggregate queries, Mining

26
Streaming Data Models
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M.
Strauss, One-Pass Wavelet Decompositions of Data
Streams, IEEE TKDE 15(3), pages 541-554.

a0..(N-1) ? Z
Ex) lt(d1, p1, w1)gt, lt(d1, p2, w2)gt, lt(d2, p1,
w2)gt, lt(d1, p1, w1)gt, lt(d3, p3, w1)gt, lt(d1,
p3, w1)gt
Cash-register model items on domain values
(contiguous, but not ordered)
Ex) lt(d1, p1, w1,2)gt, lt(d1, p2, w2,1)gt, lt(d2,
p1, w2,1)gt, lt(d3, p3, w1,1)gt, lt(d1, p3,
w1,1)gt
Aggregate model items on range values (no
particular order)
Ex) lt(d1,w1,3)gt, lt(d1,w2,1)gt, lt(d2, w2,1)gt,
lt(d3, w1,1)gt

27
Streaming Data Models
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M.
Strauss, One-Pass Wavelet Decompositions of Data
Streams, IEEE TKDE 15(3), pages 541-554.

a0..(N-1) ? Z
Ex) lt(d1, p1, w1)gt, lt(d1, p2, w2)gt, lt(d2, p1,
w2)gt, lt(d1, p1, w1)gt, lt(d3, p3, w1)gt, lt(d1,
p3, w1)gt
(aggregation of paths) If p2 lt p3,
Ex) lt(d1, p1, w1,2)gt, lt(d1, p2, w1,w2,2)gt,
lt(d2, p1, w2,1)gt, lt(d3, p3, w1,1)gt
Aggregation of word
Ex) lt(d1,w1,w2,4)gt, lt(d2, w2,1)gt, lt(d3,
w1,1)gt

28
(No Transcript)
29
BitClustering High-Speed and Context-based
Clustering for XML Documents
30
Problem with Traditional Clustering

All epaths are equally significant.
Either clustering based on content only or one
based on structure (hierarchy) only
Simple bitmap approach a flat bitmap
Our approach
The more used the more significant
Clustering for both content and structure
Complex ? Pipelined Bitmap Index

31
Why clustering should take the structural
information into account

Parse Tree (Document Tree)

e1
e2 e3 e4
e5 e5 e6
e7 e7
Example paper (title, section (para, section),
reference (paper))

The same word in the Introduction section is
different from the one in the Conclusion section.
Tree contained

32
Significance-based Clustering

It is likely that all ePath are not equally
significant.
Ex) e1order.item, e2order.payment.card_number
e1 and e2 are not treated equally and uniformly.

ltordergt ltitemgtCD ltdescriptiongtCompact
disklt/descriptiongt ltpricegt9.99lt/pricegt
ltquantitygt5lt/quantitygt lt/itemgt ltitemgtDVD
ltdescriptiongtpopular appliance productlt/descriptio
ngt ltcolorgtsilverlt/colorgt
ltpricegt150.00lt/pricegt ltquantitygt1lt/quantitygt
lt/itemgt ltduegt199.95lt/duegt ltpaymentgt
ltmethodgtcredit cardlt/methodgt
ltcard_numbergt12345lt/card_numbergt
lt/paymentgt lt/ordergt

Clearly, for the same reason, column-wise
security may be popular Oracle9i

33
Encoding
appliance 0 card 1 CD 2 check 3
Compact 4 credit 5 disk 6 DVD 7
Popular 8 product 9 silver 10 tomato
11 TV 12 white 13 o1 14 o2 15
order/_at_customer 0 order/item 1
order/item/description 2 order/item/color 3
order/item/price 4 order/item/quantity 5
order/due 6 order/payment/method 7
order/payment/card_number 8

Pairs (Path, Value)

The Set/Bag Model Let X and Y be two collections
of XML documents
Jaccards Coefficient
Dices Coefficient
The Vector-Space Model Cosine-Similarity Measure
(CSM)

35
Similarity Measures (2)

The Generalized Cosine-Similarity Measure (GCSM)
Let X and Y be vectors and
where
Hierarchical Model
Why only for depth?

36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
and n denotes the number of 1s in the path
from the root to a node.
40
Similarity Measures (3)

The Optimistic Genealogy Measure (OGM) Let C1
and C2 be collections of XML documents

41
Significances of Paths

Prioritization based on
Popularity of paths in usage
Significance of content
Presetting by user definition

42
Traditional
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
ltemailgtW3lt/emailgt lt/authorgt
ltsectiongtW3 ltsectiongtW4lt/sectiongt
ltfiguregtW3lt/figuregt lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltsectiongtW3
ltsectiongt ltsectiongtW3lt/sectiongt
lt/sectiongt ltfiguregtW3lt/figuregt
lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgt
ltaffiliategtW2lt/affiliategt lt/authorgt
ltsectiongtW3 ltsectiongtW4
ltfiguregtW3lt/figuregt lt/sectiongt
lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
lt/authorgt ltsectiongt ltsectiongtW4
ltsectiongtW3lt/sectiongt lt/sectiongt
ltfiguregtW3lt/figuregt lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
ltemailgtW3lt/emailgt lt/authorgt
ltsectiongtW3 ltsectiongtW4lt/sectiongt
ltfiguregtW3lt/figuregt lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
lt/authorgt ltsectiongtW3 ltsectiongt
ltsectiongtW3lt/sectiongt lt/sectiongt
lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltsectiongtW3
ltsectiongtW4 ltsectiongtW3lt/sectiongt
lt/sectiongt ltfiguregtW3lt/figuregt
lt/sectiongt lt/papergt
ltpapergt ltauthorgtW1 ltaffiliategtW2lt/affilia
tegt ltemailgtW3lt/emailgt lt/authorgt lt/papergt
1 title 2 author 3 affiliate 4 email 5 section 6
subsection 7 subsubsection 8 figure
Dont Care operation, -operation, can be
implemented 2 bitmaps. 1101 ? 10101 OR 11101.
43
(No Transcript)
44
(No Transcript)
45
K-Means with non-object Centroid
d1 d2 d3
d4 14/25 12/25 7/25
d5 15/25 13/25 10/25
d6 20/25 22/25 19/25
d7 21/25 21/25 20/25
d8 24/25 22/25 17/25
d2,d6 d1, d7, d8d3 d4 d5
46
the similarity threshold is 0.8, the popularity
threshold is 0.5, the radius threshold is 0.8
Aggregation
Pop bitset 1,4,5,11,14,1519
47
System Architecture
Identification of Popular Bits
Form Bit-Centroids
48
Traditional
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
ltemailgtW3lt/emailgt lt/authorgt
ltsectiongtW3 ltsectiongtW4
ltsectiongtW3lt/sectiongt lt/sectiongt
ltfiguregtW3lt/figuregt lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
W3lt/authorgt ltsectiongtW3 ltsectiongtW4
W3lt/sectiongt ltfiguregtW3lt/figuregt
lt/sectiongt lt/papergt
ltpapergt lttitlegtW6lt/titlegt ltauthorgtW7
ltaffiliategtW2lt/affiliategt
ltemailgtW8lt/emailgt lt/authorgt ltsectiongtW3
ltsectiongtW4 ltsectiongtW3lt/sectiongt
lt/sectiongt ltfiguregtW3lt/figuregt
lt/sectiongt lt/papergt
(d1) (d2)
(d3)
ltpapergt lttitlegtreligionlt/titlegt ltauthorgtW1
ltemailgtW3lt/emailgt lt/authorgt
ltsectiongtW17 ltsectiongtW18
ltsectiongtW19lt/sectiongt lt/sectiongt
ltfiguregtW20lt/figuregt lt/sectiongt lt/papergt
ltpapergt lttitlegteconomylt/titlegt ltauthorgtW1
ltemailgtW3lt/emailgt lt/authorgt
ltsectiongtW3 ltsectiongtW4
ltsectiongtW3lt/sectiongt lt/sectiongt
ltfiguregtW3lt/figuregt lt/sectiongt lt/papergt
1 title 2 author 3 affiliate 4 email 5 section 6
subsection 7 subsubsection 8 figure
d1 d5 are the same in structure d1 d2 are
the same in content d1 d3 are similar in both
structure and content d1 d3 are the same in
content and structure on weighted-XPath d1 d4
are similar in content on weighted-XPath
(d4) (d5)
49
Traditional
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.2
centerbit 11011101 for 1,5 ? similar if 5 bits
out of 6 1-bits are the same.. 1000111 for 2,6
? similar if 4 bits out of 4 1-bits are the
same.. 1111 for 3,5,6 ? similar if 4 bits
out of 5 1-bits are the same.. 1111 for
4,5,6 ? similar if 4 bits out of 5 1-bits are
the same.. 1,3,4,5 2,6 7 8
Dont Care operation, -operation, can be
implemented 2 bitmaps. 1101 ? 10101 OR 11101.
50
A
For 8 data 8.61 4.88 5 8.3 2.4 2 For 5
popular bits 5.8 4.0 4 5.2 1.0 1
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.2
Distance lt 1 (similar) 1lt x lt 4
(mid-similar) gt 4 (dissimilar)
centerbit 111 for similar documents on
popular bits 110 for mid-similar documents
on popular bits 01000111 for dissimilar documents
on popular bits
centerbit for original array 111 for
similar documents 1101 for mid-similar
documents 01110000 for dissimilar documents
51
B
For 8 data 8.61 4.88 5 8.3 2.4 2 For 5
popular bits 7.8 5.6 6 7.2 1.4 1
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.2
Distance lt 1 (similar) 2 lt x lt 5
(mid-similar) gt 6 (dissimilar)
12568 47 3
12568 47 3
11111 10 0 10101 01 0 10111 00 1 11011 01 0 11111
10 0 10111 01 0 11100 01 0 01000 10 1
1 2 3 4 5 6 7 8
1 5 4 6 2 3 7 8
11111 10 0 11111 10 0 11011 01 0 10111 01 0 10101
01 0 10111 00 1 11100 01 0 01000 10 1
centerbit 11111100 for similar documents on
popular bits for mid-similar documents
on popular bits
centerbit for original array 11011101 for
similar documents for mid-similar
documents
52
A
popularity threshold 0.61 ? 5.613.054
unpopularity threshold 0.3 ? 5.31.51
similarity threshold 0.8 ? 50.84
dissimilarity threshold 0.2 ? 5.21 diameter
threshold 0.8
centerbit 111 for similar documents on
popular bits 110 for mid-similar documents
on popular bits 01000111 for dissimilar documents
on popular bits
Due to the diameter for popular documents is 3/5
lt 0.8, more clustering
12568 47 3
1 3 4 5 6 2 7 8
11111 10 0 10111 00 1 11011 01 0 11111 10 0 10111
01 0 10101 01 0 11100 01 0 01000 10 1
Distance lt 1 (similar) 1lt x lt 3
(mid-similar)
centerbit for original array 111 for
popular documents 1101 for mid-popular
documents 01110010 for mid-popular documents
popular bits mid-popular bits
unpopular bits
53
12568 47 3
12568 47 3
12568 47 3
11111 10 0 10101 01 0 10111 00 1 11011 01 0 11111
10 0 10111 01 0 11100 01 0 01000 10 1
1 2 3 4 5 6 7 8
1 3 4 5 6 2 7 8
11111 10 0 10111 00 1 11011 01 0 11111 10 0 10111
00 0
1 3 4 5 6
11111 10 0 10111 00 1 11011 01 0 11111 10 0 10111
01 0 10101 01 0 11100 01 0 01000 10 1
54
For 6 data 6.61 3.16 4 6.3 1.8 1 For 5
popular bits 4.8 3.2 4 4.2 0.8 0
B
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.2 diameter threshold
0.8
1,5,2,6, 3,4,7,8
Distance lt 0 (similar) 1 lt x lt 3
(mid-similar) gt 4 (dissimilar)
12568 47 3
1587 2364
1 5 4 6 2 3 7 8
6 2 3 4 7 8
11111 10 0 11111 10 0 11011 01 0 10111 01 0 10101
01 0 10111 00 1 11100 01 0 01000 10 1
1111 0010 1111 0000 1110 0110 1011 1010 1101
1000 0000 1101
centerbit 11111100 for similar documents on
popular bits for mid-similar documents
on popular bits
centerbit for original array 11011101 for
similar documents for mid-similar
documents
55
A
If popular bits are considered, 1,3,5,6, 4,
2,7, 8
Distance lt 1 (similar) 1lt x lt 3
(mid-similar)
1568 247 3
1 3 5 6 4
1111 110 0 1111 000 1 1111 110 0 1111 000 0 1011
101 0
popular bits mid-popular bits
unpopular bits
centerbit for original array 111 for
popular documents 1101 for mid-popular
documents 01110010 for mid-popular documents
56
Coverage

cover(x) is a set of objects that satisfy x.
cover(x) o o satisfies x
y is x if cover(x) ? cover(y)

57
Similarity in both Structure and Content
58
Bitmap index for paths
Bitmap index for pairs (path, value)
59
Semantics in Hierarchies

Topologies
Order in sibling
Order in depth (not just the depth number in ACM
TIS 03)
Ex) toxics in pharmacy vs. toxics in weapon

60
Encoding Hierarchies
order/_at_customer 0 order/item 1
order/item/description 2 order/item/color 3
order/item/price 4 order/item/quantity 5
order/due 6 order/payment/method 7
order/payment/card_number 8

Pairs (Path, Value)

ltorder customero1gt ltitemgtCD
ltdescriptiongtCompact disklt/descriptiongt
ltpricegt9.99lt/pricegt ltquantitygt5lt/quantitygt
lt/itemgt ltitemgtDVD ltdescriptiongtpopular
appliance productlt/descriptiongt
ltcolorgtsilverlt/colorgt ltpricegt150.00lt/pricegt
ltquantitygt1lt/quantitygt lt/itemgt
ltduegt199.95lt/duegt ltpaymentgt ltmethodgtcredit
cardlt/methodgt ltcard_numbergt12345lt/card_numbergt
lt/paymentgt lt/ordergt
0 1 2 3 4 5 6 7 8
order1.xml order2.xml order3.xml
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
0,14 0,15 1,2 1,7 1,10 1,12 2.0
2,4 2,6 2,8 2,9 3,10 3,13 7,3 7,5
7,6
order1.xml order2.xml order3.xml
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
1
0
0
1
1
1
0
1
0
0
order1.xml
61
root node
Tree-driven Bitmap Index
Vi(n1)

overflow node
leaf node
to Bitmap Index
62
Tree-driven Bitmap Index
Pi1ltVj1
Vj2ltPi1ltVj3
Vj1ltPi1ltVj2
leaf node
Pi2.Pk1ltVp1
Vp1ltPi2.Pk1ltVp2
doc1 doc2 . . dock
1
0
1
0
0
0
1
0
0
0
0
0
1
0
..
0
0
0
1
0
1
0
1
0
0
0
0
0
1
0
..
0
1
0
1
0
1
0
1
0
0
0
0
0
0
0
..
1
63
Incremental
123456789
9 10 11 12 13
11011101 11001111 11010000 01010001 01110111
centrobit for original 11011101 for 1,5
?9 100111 for 4,6 ? 10 110000 for 2,7
? 11 10101101 for 3 01110000 for 8 ? 12
011000 for 8,12 diameter (2/80.25) ??
01110111 for 13
64
Types of Incremental

Inserted into an existing cluster
Inserted into an existing cluster that can in
turn be modified to a new cluster
Created a new cluster

65
Procedures

Consider database h, which consists of bits b and
objects o.
Compute pop(b) ? labeling bits.
Compute sim(o) ? grouping objects
Set groups g. Computer center(g) ?Verify groups
If diameter(g) gt diameter_threshold
If g lt h
Then set database g (with corresponding b and o),
and redo from 1.
If g ? h
Then relax labeling bits
By setting a to 1 at a time
Redo from 3.
Else, stop.

66
Another Relaxation

If members(g) gt member_threshold and diameter(g)
? diameter_threshold, stop
Set database g (with corresponding b and o), and
redo from 1.

67
Performance

Fast ? grouping not by parsing all objects
Fast ? 1-pass computation of center not by
picking or generating a center object
Flexible ?
Incremental ?

68
(No Transcript)
69
If both popular and mid-popular bits are
considered, 1,5, 3,6, 4, 2,7, 8
Distance lt 1 (similar) 2lt x lt 5
(mid-similar) gt (dissimilar)
1568 247 3
1 5 3 6 4
1111 110 0 1111 110 0 1111 000 1 1111 000 0 1011
101 0
popular bits mid-popular bits
unpopular bits
centerbit for original array 111 for
popular documents 1101 for mid-popular
documents 01110010 for mid-popular documents
70
(No Transcript)
71
popularity threshold 0.8 similarity threshold
0.8
40bits ----------------------------------?
(1212529)bits
72
45bits ----------------------------------?
(1264527)bits
73
(No Transcript)
74
Query Plan
75
wrong
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
11000000 11000000 10110000 10001000 10010000 10001
010 10000110 10000011
101000000 100010001 110000100 100100001 100001000
110000101 100010010 100100101 01010101
2 2 3 2 2 3 3 3
1 2 3 4 5 6 7 8
2 2 3 2 2 3 3 3
1 2 3 4 5 6 7 8
8 2 1 2 2 1 3 1
8 2 1 2 2 1 3 1
76
1 2 3 4 5 6 7 8
11000000 11000000 10110000 10001000 10010000 10001
010 10000110 10000011
2 2 3 2 2 3 3 3
1 2 3 4 5 6 7 8
8 2 1 2 2 1 3 1
77
A
If both popular and mid-popular bits are
considered, 1,5, 3,6, 4, 2,7, 8
Distance lt 1 (similar) 2lt x lt 5
(mid-similar) gt (dissimilar)
1568 247 3
1 5 3 6 4
1111 110 0 1111 110 0 1111 000 1 1111 000 0 1011
101 0
popular bits mid-popular bits
unpopular bits
centerbit for original array 111 for
popular documents 1101 for mid-popular
documents 01110010 for mid-popular documents
78
What if both popular and mid-popular bits are
considered from the begining
79
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.2 diameter threshold
0.3
12568 47 3
Distance lt 2 (similar) 3lt x lt
5(mid-similar) gt 6 (dissimilar)
1 4 5 6 2 3 7 8
11111 10 0 11011 01 0 11111 10 0 10111 01 0 10101
01 0 10111 00 1 11100 01 0 01000 10 1
centrobit 111- for similar documents on pop
1,4,5,6, diameter (5/8.63) - for
similar documents on mid-pop 2,3,7,8, diameter
(8/81)
80
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.15 diameter
threshold 0.3
0.614 docs 2.44 if gt3 pop 0.331.2 if lt1
unpop 0.87 bits 5.6 if 1s gt 6 sim 0.157
1.05 if 1s lt1 unsim 0.86bits 4.8 if 1s gt5
sim 0.1560.9 if 4gt1s gt1 mid-sim
12568 47 3
Distance Centrobit 1s gt 6 (similar) 1111110
(diameter0) 2lt1s lt5 (mid-sim) 11101
(diameter(8-6)/80.25) 1s lt 1 (dissimilar)
1 5 4 6
11111 10 11111 10 11011 01 10111 01
Distance Centrobit 1s gt 5 (similar) 1lt1s
lt4 (mid-sim) (diameter(8-0)/81) 1s
lt 1 (dissimilar)
15 8273 46
2 3 7 8
11 1010 00 11 1001 01 11 0110 00 00 0101 10
Not recommendable!
81
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.15 diameter
threshold 0.3
0.614 docs 2.44 if gt3 pop 0.331.2 if lt1
unpop 0.87 bits 5.6 if 1s gt 6 sim 0.157
1.05 if 1s lt1 unsim 0.82bits 1.6 if 1s gt2
sim 0.1520.3 if 1s lt1 mid-sim
15 8273 46
Distance Centrobit 1s gt 2 (similar) 110
(diameter5/80.62) 1s lt 1 (mid-sim) 00010110
(diameter(8-8)/80) 1s lt 1 (dissimilar)
2 3 7 8
11 1010 00 11 1001 01 11 0110 00 00 0101 10
centrobit 1111110 for 1,5 11101 for
4,6 110 for 2,3,7 00010110 for 8
centrobit for original 11011101 for
1,5 100111 for 4,6 101 for
2,3,7 01110000 for 8
82
popularity threshold 0.61 unpopularity
threshold 0.3 similarity threshold 0.8
dissimilarity threshold 0.15 diameter
threshold 0.3
0.614 docs 2.44 if gt3 pop 0.331.2 if lt1
unpop 0.87 bits 5.6 if 1s gt 6 sim 0.157
1.05 if 1s lt1 unsim 0.82bits 1.6 if 1s gt2
sim 0.1520.3 if 1s lt1 mid-sim
Centrobit 1110 diameter4/80.5) ? 11100
diameter3/80.37) 11011000 (diameter0/80)
15 8273 46
2 3 7
11 1010 00 11 1001 01 11 0110 00
Centrobit 111100101 diameter0/80) 1110
(diameter4/80.5) ? 11000 diameter2/80.25)
15 8273 46
3 2 7
11 1001 01 11 1010 00 11 0110 00
centrobit 1111110 for 1,5 11101 for
4,6 110 for 2,3,7 ?centers1110 ,
1110 , 1110 , 1110 ,
1101 00010110 for 8
centrobit for original 11011101 for
1,5 100111 for 4,6 110000 for 2,7
10101101 for 3 01110000 for 8
83
(No Transcript)
84
(No Transcript)
85
p1 p2 p3 p4 p5 p6 p7 p8
d1 d2 d3
? 0 0 2 2 2 0 0 7
Trees of BitCube
Tree mask vector
p1 e1.e2 p2 e1.e3 p3 e1.e3.e5 p4
e1.e3.e6 p5 e1.e3.e7.e8 p6 e1.e4.e9 p7
e1.e4.e10 p8 e1.e4.e10.e11
86
p1 p2 p3 p4 p5 p6 p7 p8
d1 d2 d3
? 0 0 2 2 2 0 0 7
Trees of BitCube
Tree mask vector
p1 e1.e2 p2 e1.e3 p3 e1.e3.e5 p4
e1.e3.e6 p5 e1.e3.e7.e8 p6 e1.e4.e9 p7
e1.e4.e10 p8 e1.e4.e10.e11
87
? 0 0 2 2 2 0 0 7
Trees of d1, d2, d3
Tree mask vector
88
? 0 0 2 2 2 0 0 7
Trees of d1, d2, d3
Tree mask vector
e1
e2 e3 e3
e5 e5 e6 e6
(d4)

Write a Comment

User Comments (0)