Stochastic Clustering for Organizing Distributed Information Sources - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Stochastic Clustering for Organizing Distributed Information Sources

Description:

How to pronounce the word 'of'? Observation Symbol (Output Symbol) The probability of the word 'of' is pronounced 'ov' is .7 x .9 = .63 ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 22

Provided by: Xin116

Category:

more less

Transcript and Presenter's Notes

Title: Stochastic Clustering for Organizing Distributed Information Sources

1
Stochastic Clustering for Organizing Distributed
Information Sources

Mei-Ling Shyu, Shu-Ching Chen, Stuart H. Rubin

2
Motivation

Efficiently and effectively navigate and retrieve
information in a distributed information
environment.
Clusters provide a structure for organizing the
large number of information sources for efficient
browsing, searching, and retrieval.
Each cluster shall show similarities in the data
access behavior. That is, information sources in
each cluster are expected to provide most of the
required information for user queries that are
closely related with respect to a particular
application.

3
Two Basic Concepts

Clustering
Markov Model

4
Clustering

Clustering is a process of grouping the data into
clusters so that items within a cluster have high
similarity in comparison to one another, but are
very dissimilar to items in other clusters.
Example

Clustering Result
5
Markov Model
State
Transition 0.22, 0.8, 0.9. Transition
Probability between States
Initial State Probability
6
Markov Model (cont.)
7
Markov Model (cont.)
How to pronounce the word of?
Observation Symbol (Output
Symbol) The probability of the word of is
pronounced ov is .7 x .9 .63
8
Map Markov Model to Database E-R Model
9
Database Markov Model
BranchInfo
State
Observation Symbol (Output Symbol)
Transition
Name
10
State Transition Probability
BranchInfo
T
T
T
T
T is the transition probability between states.
Database access log information is used to
compute T. It is subject to the affinity
relationship between entities in the database.
That is, when two entities are access together
more frequently, they are said to have a higher
relative affinity relationship and also a higher
transition probability.
11
Observation Symbol Probability
BranchInfo
Matrix BB
B
B
B
B
12
Observation Symbol Probability
BranchInfo
Normalized Matrix BB
B
B
B
B
13
Initial State Probability
BranchInfo
I
I
I is the initial state probability. It is defined
as the fraction of the number of occurrences of
entity m with respect to the total number of
occurrences for all the member entities.
14
Distributed Database Markov Chain
BranchInfo
ProductInfo
15
Browsing Graph

Each information source is represented as a
node.
An arc connecting two nodes implies that two
information sources have structurally equivalent
entities.
Two entities are said to be equivalent if they
deemed to possess the same real world states.

16
Structurally Equivalent Entities
BranchInfo
ProductInfo
Company and Manufacturer are structurally
equivalent.
17
Similarity Measure
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
18
Stationary Probability

is the stationary probability of each node
on Markov chain.
It is obtained from similarities S between
nodes.
First-order Markov Model is applied. The
probability a node is chosen is only dependent on
the last node on the chain but not on other
previous chosen nodes.

19
Stochastic Clustering

Browsing graph traversal based and greedy.
Procedure
Sort information sources according to stationary
probabilities.
Information source with highest stationary
probability starts a cluster.
Information source with next highest stationary
probability that is accessible from starting node
on browsing graph is added to the cluster.
When the cluster fills up to a predefined size,
next unclustered information source in the list
starts a new cluster.

20
Experimental Results

MMM proposed
algorithm
MCST maximum
cost spanning tree
MCC maximum
cost chain connection
BFS breadth first
search
DFS depth first
search
Random randomly
generated sequence
Single-Link
Complete-Link
Group-Average-Link
BEA bond energy
algorithm

number of inter-cluster accesses
Cluster Size
21
Conclusion

The paper introduced a stochastically-based
framework, MMM mechanism.
Designed a conceptual information source
clustering algorithm for a distributed
information environment.
A cluster consists of several related
information sources are usually required for
queries in the same application domain.
Empirical studies shows that MMM mechanism
performs best among the tested algorithms since
it yields the smallest number of intercluster
accesss.