Title: Stochastic Clustering for Organizing Distributed Information Sources
1Stochastic Clustering for Organizing Distributed
Information Sources
- Mei-Ling Shyu, Shu-Ching Chen, Stuart H. Rubin
2Motivation
- Efficiently and effectively navigate and retrieve
information in a distributed information
environment. - Clusters provide a structure for organizing the
large number of information sources for efficient
browsing, searching, and retrieval. - Each cluster shall show similarities in the data
access behavior. That is, information sources in
each cluster are expected to provide most of the
required information for user queries that are
closely related with respect to a particular
application.
3Two Basic Concepts
4Clustering
- Clustering is a process of grouping the data into
clusters so that items within a cluster have high
similarity in comparison to one another, but are
very dissimilar to items in other clusters. - Example
Clustering Result
5Markov Model
State
Transition 0.22, 0.8, 0.9. Transition
Probability between States
Initial State Probability
6Markov Model (cont.)
7Markov Model (cont.)
How to pronounce the word of?
Observation Symbol (Output
Symbol) The probability of the word of is
pronounced ov is .7 x .9 .63
8Map Markov Model to Database E-R Model
9Database Markov Model
BranchInfo
State
Observation Symbol (Output Symbol)
Transition
Name
10State Transition Probability
BranchInfo
T
T
T
T
T is the transition probability between states.
Database access log information is used to
compute T. It is subject to the affinity
relationship between entities in the database.
That is, when two entities are access together
more frequently, they are said to have a higher
relative affinity relationship and also a higher
transition probability.
11Observation Symbol Probability
BranchInfo
Matrix BB
B
B
B
B
12Observation Symbol Probability
BranchInfo
Normalized Matrix BB
B
B
B
B
13Initial State Probability
BranchInfo
I
I
I is the initial state probability. It is defined
as the fraction of the number of occurrences of
entity m with respect to the total number of
occurrences for all the member entities.
14Distributed Database Markov Chain
BranchInfo
ProductInfo
15Browsing Graph
- Each information source is represented as a
node. - An arc connecting two nodes implies that two
information sources have structurally equivalent
entities. - Two entities are said to be equivalent if they
deemed to possess the same real world states.
16Structurally Equivalent Entities
BranchInfo
ProductInfo
Company and Manufacturer are structurally
equivalent.
17Similarity Measure
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
18Stationary Probability
- is the stationary probability of each node
on Markov chain. - It is obtained from similarities S between
nodes. - First-order Markov Model is applied. The
probability a node is chosen is only dependent on
the last node on the chain but not on other
previous chosen nodes.
19Stochastic Clustering
- Browsing graph traversal based and greedy.
- Procedure
- Sort information sources according to stationary
probabilities. - Information source with highest stationary
probability starts a cluster. - Information source with next highest stationary
probability that is accessible from starting node
on browsing graph is added to the cluster. - When the cluster fills up to a predefined size,
next unclustered information source in the list
starts a new cluster.
20Experimental Results
- MMM proposed
- algorithm
- MCST maximum
- cost spanning tree
- MCC maximum
- cost chain connection
- BFS breadth first
- search
- DFS depth first
- search
- Random randomly
- generated sequence
- Single-Link
- Complete-Link
- Group-Average-Link
- BEA bond energy
- algorithm
number of inter-cluster accesses
Cluster Size
21Conclusion
- The paper introduced a stochastically-based
framework, MMM mechanism. - Designed a conceptual information source
clustering algorithm for a distributed
information environment. - A cluster consists of several related
information sources are usually required for
queries in the same application domain. - Empirical studies shows that MMM mechanism
performs best among the tested algorithms since
it yields the smallest number of intercluster
accesss.