Data Mining Techniques for Query Relaxation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Data Mining Techniques for Query Relaxation

1
Data Mining Techniques for Query Relaxation
2
Query Relaxation via Abstraction
Abstraction is context dependent 69 guard ?
big guard 69 forward ? medium forward 69
center ? small center

Abstraction must be automated for
Large domains
Unfamiliar domains

Heights of guards
A conceptual query Find me a big guard
lt 6
lt 64
gt 64
small
medium
large
3
Related Work

Maximum Entropy (ME) method
Maximization of entropy (- S p log p)
Only considers frequency distribution
Conceptual clustering systems
Only allows non-numerical values (COBWEB)
Assume a certain distribution (CLASSIT)

4
Supervised vs. Unsupervised Learning

Supervised Learning
Given instances with known class information,
generate rules/decision tree that can be used to
infer class of future instances
Examples ID3, Statistical Pattern Recognition
Unsupervised Learning
Given instances with unknown class information,
generate concept tree that cluster instances into
similar classes
Examples COBWEB, TAH Generation (DISC, PBKI)

5
Automatic Construction of TAHs

Necessary for Scaling up CoBase
Sources of Knowledge
Database Instance
Attribute Value Distributions
Inter-Attribute Relationships
Query and Answer Statistics
Domain Expert
Approach
Generate Initial TAH
With Minimal Expert Effort
Edit the Hierarchy to Suit
Application Context
User Profile

6
For Clustering Attribute Instances with
Non-Numerical Values
7
Pattern-Based Knowledge Induction (PKI)

Rule-Based
Cluster attribute values into TAH based on other
attributes in the relation
Provides Attribute Correlation value

8
Definitions

The cardinality of a pattern P, denoted P, is
the number of distinct objects that match P.
The confidence of a rule A ? B, denoted by x(A ?
B), is
x (A ? B) A I B / A
Let A ? B be a rule that applies to a relation
R. The support of the rule over R is defined as
h(A ? B) A / R

9
Knowledge Inference A Three-Step Process

Step 1 Infer Rules
Consider all rules of basic form A ? B.
Calculate Confidence and Support.
Confidence measures how well a rule applies to
the database.
A ? B has a confidence of .75 means that if A
holds, B has a 75 chance of holding as well.
Support measures how often a rule applies to the
database.
A ? B has a support of 10 means that it applies
to 10 tuples in the database (A holds for 10
tuples).

10
Knowledge Inference (contd)

Step 2 Combine Rules
If two rules share a consequence and have the
same attribute as a premise (with different
values), then those values are candidates for
clustering.
Color red ? style sport (x1)
Color black ? style sport (x2)
Suggests red and black should be clustered.
Correlation is product of the confidences of the
two rules
g x1 x x2

11
Clustering

Algorithm Binary Cluster (Greedy Algorithm)
repeat
INDUCE RULES and determine g
sort g in descending order
for each g(ai, aj)
if ai and aj are unclustered
replace ai and aj in DB with joint value Ji,j
until fully clustered
Approximate n-ary g using binary g
cluster a set of n values if the g between all
pairs is above threshold
Decrease threshold and repeat

12
Knowledge Inference (contd)

Step 3 Combine Correlations
Clustering Correlation between two values is the
weighted sum of their correlations.
Combines all the evidence that two values should
be clustered together into a single number (g(a1,
a2)).
g(a1, a2) S i 1 wi x x(A a1 ? Bi bi) x
x(A a2 ? Bi bi) / (m-1)
Where a1, a2 are values of attribute A, and there
are m attributes B1, , Bm in the relation with
corresponding weights w1, , wm

m
13
Pattern-Based Knowledge Induction (Example)
A B C a1 b1 c1 a1 b2 c1 a2 b1 c1 a3 b2 c1
1st iteration
Rules A a1 ? B b1 confidence 0.5 A a2
? B b1 confidence 1.0 A a1 ? C
c1 confidence 1.0 A a2 ? C c1 confidence
1.0 correlation (a1, a2) 0.5x1.01.0x1.0/ 2
0.75 correlation (a1, a3) 0.75 correlation
(a2, a3) 0.5
14
Pattern-Based Knowledge Induction (contd)
2nd iteration
0.67
A B C a12 b1 c1 a12 b2 c1 a12 b1 c1 a3 b2 c1
0.75
a3
a1
a2
15
Example for Non-Numerical Attribute ValueThe
PEOPLE Relation
16
TAH for People
17

Cor(a12, a3) is computed as follows
Attribute origin Same (Holland)
contributes 1.0
Attribute hair Same
contributes 1.0
Attribute eye Different
contributes 0.0
Attribute height Overlap on MEDIUM
5/10 of a12 and 2/2 of a3
contributes 5/10 2/2 0.5
cor(a12, a3) 1/4 (1100.5) 0.63

18
Correlation Computation

Compute correlation between European and Asian.
Attributes ORIGIN and HAIR COLOR
No overlap between Europe and Asia, no
contributions to correlation
Attribute EYE COLOR
BROWN is the only attribute that has overlap
1 out of 24 Europeans have BROWN
12 out of 12 Asians have BROWN
Attribute BROWN contributes 1/24 12/12 0.0416
Attribute Height
SHORT 5/24 Europeans and 8/12 of Asians
Medium 11/24 and 3/12
Tall 8/24 and 1/12
Attribute HEIGHT contributes
5/24 8/12 11/24 3/12 8/12 1/12
0.2812
Total Contribution 0.0416 0.2812 0.3228

19
Extensions

Pre-clustering
For non-discrete domains
Reduces computational complexity
Expert Direction
Identify complex rules
Eliminate unrelated attributes
Eliminating Low-Popularity Rules
Set Popularity Threshold q
Do not keep rules below q
Saves Time and Space
Loses Knowledge about Uncommon Data
In the Transportation Example, q 2 improves
efficiency by nearly 80.
Statistical sampling for very large domains.

20
Clustering of Attribute Instances with Numerical
Values
21
Conventional Clustering MethodsI. Maximum
Entropy (ME)

Maximization of entropy (- S p log p)
Only considers frequency distribution
Example 1,1,2,99,99,100 and
1,1,2,3,100,100
have the same entropy (2/6,1/6,2/6,1/6)
ME cannot distinguish between
(1) 1,1,2,99,99,100 good partition
(2) 1,1,2,3,100,100 bad partition
Me does not consider value distribution.
Clusters have no semantic meaning.

22
Conventional Clustering MethodsII. Biggest Gap
(BG)

Consider only value distribution
Find cuts at biggest gaps
1,1,1,10,10,20 is partitioned to
1,1,1,10,10 and 20 ? bad
A good partition
1,1,1 and 10,10,20

23
New Notion of Goodness of Clusters Relaxation
Error
24
Relaxation Error of a Cluster
25
Relaxation Error of a Partition
26
Distribution Sensitive Clustering (DISC) Example
27

Relaxation Error
RE(B) average pair-wise difference
3 2 3 8
9 9 9 9
RE(C) 0.5
RE(A) 2.08
correlation (B) 1 - RE(B) 1 - 0.89 0.57
RE(A) 2.08
correlation (C) 1- 0.5 0.76
2.08
correlation (A) 1- 2.08 0
2.08

28
Examples

Example 1 1,1,2,3,100,100
ME 1,1,2,3,100,100
RE(1,1,2) (010111)/9 0.44
RE(3,100,100) 388/9 43.11
RE(1,1,2,3,100,100) 0.443/6 43.113/6
21.78
Ours RE(1,1,2,3,100,100) 0.58
Example 2 1,1,1,10,10,20
BG 1,1,1,10,10,20
RE(1,1,1,10,10,20) 3.6
Ours RE(1,1,1,10,10,20) 2.22

29
An Example

Example
The table SHIPS has 153 tuples and the attribute
LENGTH has 33 distinct values ranging from 273 to
947. DISC and ME are used to cluster LENGTH into
three sub-concepts SHORT, MEDIUM, and LONG.

30
An Example (contd)

Cuts by DISC
between 636,652 and 756,791
average gap 25.5
Cuts by ME
between 540,560 and 681,685 (a bad cut)
average gap 12
Optimal cuts by exhaustive search
between 605,635 and 756,791
average gap 32.5
DISC is more effective than ME in discovering
relevant concepts in the data.

31
An Example
Clustering of SHIP.LENGTH by DISC and ME Cuts by
DISC - - - Cuts by ME - . - .
32
Quality of Approximate Answers
33
DISC

For numeric domains
Uses intra-attribute knowledge
Sensitive to both frequency and value
distributions of data.
RE average difference between exact and
approximate answers in a cluster.
Quality of approximate answers are measured by
relaxation error (RE) the smaller the RE, the
better the approximate answer.
DISC (Distribution Sensitive Clustering)
generates AAHs based on minimization of RE.

34
DISC

Goal automatic generation of TAH for a numerical
attribute
Task given a numerical attribute and a number s,
find the optimal s-1 cuts that partition the
attribute into s sub-clusters
Need a measure for optimality of clustering.

35
Quality of Partitions
If RE(C) is too big, we could partition C into
smaller clusters. The goodness measure for
partitioning C into m sub-clusters C1, , Cm is
given by the relaxation error reduction per
cluster (category utility CU)
For efficiency, use binary partitions to obtain
m-ary partitions.
36
The Algorithms DISC and BinaryCut

Algorithm DISC(C)
if the number of distinct values e C lt T, return
/ T is a threshold /
let cut the best cut returned by BinaryCut(C)
partition values in C based on cut
let the resultant sub-clusters be C1 and C2
call DISC(C1) and DISC(C2)
Algorithm BinaryCut(C)
/ input cluster C x1, , xn /
for h 1 to n 1 / evaluate each cut /
Let P be the partition with clusters C1 x1,
, xh and
C2 xh1, , xn
computer category utility CU for P
if CU lt MinCU then
MinCU CU, cut h / the best cut /
Return cut as the best cut

37
The N-ary Partition Algorithm

Algorithm N ary Partition(C)
let C1 and C2 by the two sub-clusters of C
compute CU for the partition C1, C2
for N 2 to n 1
let Ci by the sub-cluster of C with maximum
relaxation error
call BinaryCut to find the best sub-clusters Ci1
and Ci2 of Ci
compute and store CU for the partition C1, ,
Ci-1, Ci1, Ci2, Ci1, , CN
if current CU is less than the previous CU
stop
else
replace Ci by Ci1 and Ci2
/ the result is an N ary partition of C /

38
Using TAHs for Approximate Query Answering

select CARGO-ID
from CARGOS
where SQUARE-FEET 300
and WEIGHT 740
no answers
The query is relaxed according to TAHs.

39
Approximate Query Answering

select CARGO-ID
from CARGOS
where 294 lt SQUARE-FEET lt 300
and 737 lt WEIGHT lt 741
CARGO-ID SQUARE-FEET WEIGHT

10 296 740
Relaxation error (4/11.950)/2 0.168 Further
Relaxation select CARGO-ID from CARGOS where 294
lt SQUARE-FEET lt 306 and 737 lt WEIGHT lt
749 CARGO-ID SQUARE-FEET WEIGHT
10 296 740 21 301 737
30 304 746 44 306 745
Relaxation error (3.75/11.953.5/9.88)/2 0.334
40
Performance of DISC

Theorem Let D and M be the optimal binary cuts
by DISC and ME respectively. If the data
distribution is symmetrical with respect to the
median, then D M (i.e., the cuts determined by
DISC and ME are the same).
For skewed distributions, clusters discovered by
DISC have less relaxation error than those by the
ME method.
The more skewed the data, the greater the
performance difference between DISC and ME.

41
Multi-Attribute TAH (MTAH)
In many applications, concepts need to be
characterized by multiple attributes, e.g.,
near-ness of geographical locations.

As MTAH
As a guidance for query modification
As a semantic index

42
Multi-Attribute TAH (MTAH)
43
Multi-Attribute DISC (M-DISC) Algorithm

Algorithm M-DISC(C)
if the number of objects in C lt T, return / T
is a threshold /
for each attribute a 1 to m
for each possible binary cut h
compute CU for h
if CU gt MaxCU then / remember the best cut /
MaxCU CU, BestAttribute a, cut h
partition C based on cut of the attribute
BestAttribute
let the resultant sub-clusters be C1 and C2
call M-DISC(C1) and M-DISC(C2)

44
Greedy M-DISC Algorithm gM-DISC

Algorithm gM-DISC(C)
if the number of objects in C lt T, return / T
is a threshold /
for each attribute a 1 to m
for each possible binary cut h
compute REa for h
if REa gt Max RE then / remember the
best cut /
Max RE REa, BestAttribute a, cut
h
partition C based on cut of the attribute
BestAttribute
let the resultant sub-clusters be C1 and C2
call gM-DISC(C1) and gM-DISC(C2)

45
MTAH of RECTANGLES (Height, Width)
46
The Database Table AIRCRAFT
How to find similar aircrafts?
47
MTAH for AIRCRAFT
48
Example for Numerical Attribute Value
Motor Data from PartNet(http//PartNet)
49
TAH for Motor Capability
50
TAH for Motor Size and Weight
51
TAHs for Motor

The Motor table was adapted from Housed Torque
from Part Net. After inputting the data, two
TAHs were generated automatically from the DISC
algorithm.
One TAH was based on peak torque, peak torque
power, and motor constant. The other was based
on outer diameter, length, and weight. The leaf
nodes represent part number. THE intermediate
nodes are classes. The relaxation error (average
pair-wise distance between the parts) of each
node are also given.

52
Application of TAHs

The TAHs can be used jointly to satisfy
attributes in both TAHs. For example, find part
similar to T-0716 in terms of peak torque, peak
torque power, motor constant, outer diameter,
length, and weight. By examining both TAHs, we
know that QT-0701 is similar to T-0716 with an
expected relaxation error of (0.06 0.1)/2 0.08

53
Performance of TAH

Performance measures
accuracy
efficiency
where all relevant answers are the best n
answers determined by exhaustive search.
Compare an MTAH with a traditional 2-d index tree
(based on frequency distribution).

retrieved relevant answers
all relevant answers
retrieved relevant answers
all retrieved answers
54
Performance of MTAHs

Based on attributes Longitudes and Latitudes of
972 geographical locations from a transportation
database.
500 queries with the form
find the n locations nearest to (long,lat)
where n is randomly selected from 1 to 20, and
long and lat are generated based on the
distributions of the geographical locations.

MTAH is more accurate than 2-d-tree. MTAH is more
efficient than Exhaustive Search.
55
Generation of Evolutionary TAH

Approximate query answering for temporal data
(given as a set of time sequences)
Find time sequences that are similar to a given
template sequence.
A time sequence S of n stages is defined as an
n-tuple S s1, , sn where si is a numerical
value.
Issues
Needs a similarity measure for sequences
Use clustering for efficient retrieval
Evaluation of work

56
Automatic Constructions of TAHs

Necessary for scaling up CoBase
Sources of Knowledge
Database Instance
Attribute Value Distributions
Inter-Attribute Relationships
Query and Answer Statistics
Domain Exert
Approach
Generate Initial TAH
With Minimal Expert Effort
Edit the Hierarchy to Suit
Application Context
User Profile

57
The CoBase Knowledge-Base Editor

Tool for Type Abstraction Hierarchies
Display available TAHs
Visualize TAHs as graphs
Edit TAHs
Add/Delete/Move nodes and sub-trees
Assign names to nodes
Interface to Knowledge Discovery Tools
Cooperative Operators
Specify parameter values
Approximate
Near-To, Similar-To

58
An Example of Using the KB Editor
59
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Data Mining Techniques for Query Relaxation PowerPoint PPT Presentation