Stochastic Clustering for Organizing Distributed Information Sources - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Stochastic Clustering for Organizing Distributed Information Sources

Description:

How to pronounce the word 'of'? Observation Symbol (Output Symbol) The probability of the word 'of' is pronounced 'ov' is .7 x .9 = .63 ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 22
Provided by: Xin116
Category:

less

Transcript and Presenter's Notes

Title: Stochastic Clustering for Organizing Distributed Information Sources


1
Stochastic Clustering for Organizing Distributed
Information Sources
  • Mei-Ling Shyu, Shu-Ching Chen, Stuart H. Rubin

2
Motivation
  • Efficiently and effectively navigate and retrieve
    information in a distributed information
    environment.
  • Clusters provide a structure for organizing the
    large number of information sources for efficient
    browsing, searching, and retrieval.
  • Each cluster shall show similarities in the data
    access behavior. That is, information sources in
    each cluster are expected to provide most of the
    required information for user queries that are
    closely related with respect to a particular
    application.

3
Two Basic Concepts
  • Clustering
  • Markov Model

4
Clustering
  • Clustering is a process of grouping the data into
    clusters so that items within a cluster have high
    similarity in comparison to one another, but are
    very dissimilar to items in other clusters.
  • Example

Clustering Result
5
Markov Model
State
Transition 0.22, 0.8, 0.9. Transition
Probability between States
Initial State Probability
6
Markov Model (cont.)
7
Markov Model (cont.)
How to pronounce the word of?
Observation Symbol (Output
Symbol) The probability of the word of is
pronounced ov is .7 x .9 .63
8
Map Markov Model to Database E-R Model
9
Database Markov Model
BranchInfo
State
Observation Symbol (Output Symbol)
Transition
Name
10
State Transition Probability
BranchInfo
T
T
T
T
T is the transition probability between states.
Database access log information is used to
compute T. It is subject to the affinity
relationship between entities in the database.
That is, when two entities are access together
more frequently, they are said to have a higher
relative affinity relationship and also a higher
transition probability.
11
Observation Symbol Probability
BranchInfo
Matrix BB
B
B
B
B
12
Observation Symbol Probability
BranchInfo
Normalized Matrix BB
B
B
B
B
13
Initial State Probability
BranchInfo
I
I
I is the initial state probability. It is defined
as the fraction of the number of occurrences of
entity m with respect to the total number of
occurrences for all the member entities.
14
Distributed Database Markov Chain
BranchInfo
ProductInfo
15
Browsing Graph
  • Each information source is represented as a
    node.
  • An arc connecting two nodes implies that two
    information sources have structurally equivalent
    entities.
  • Two entities are said to be equivalent if they
    deemed to possess the same real world states.

16
Structurally Equivalent Entities
BranchInfo
ProductInfo
Company and Manufacturer are structurally
equivalent.
17
Similarity Measure
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
18
Stationary Probability
  • is the stationary probability of each node
    on Markov chain.
  • It is obtained from similarities S between
    nodes.
  • First-order Markov Model is applied. The
    probability a node is chosen is only dependent on
    the last node on the chain but not on other
    previous chosen nodes.

19
Stochastic Clustering
  • Browsing graph traversal based and greedy.
  • Procedure
  • Sort information sources according to stationary
    probabilities.
  • Information source with highest stationary
    probability starts a cluster.
  • Information source with next highest stationary
    probability that is accessible from starting node
    on browsing graph is added to the cluster.
  • When the cluster fills up to a predefined size,
    next unclustered information source in the list
    starts a new cluster.

20
Experimental Results
  • MMM proposed
  • algorithm
  • MCST maximum
  • cost spanning tree
  • MCC maximum
  • cost chain connection
  • BFS breadth first
  • search
  • DFS depth first
  • search
  • Random randomly
  • generated sequence
  • Single-Link
  • Complete-Link
  • Group-Average-Link
  • BEA bond energy
  • algorithm

number of inter-cluster accesses
Cluster Size
21
Conclusion
  • The paper introduced a stochastically-based
    framework, MMM mechanism.
  • Designed a conceptual information source
    clustering algorithm for a distributed
    information environment.
  • A cluster consists of several related
    information sources are usually required for
    queries in the same application domain.
  • Empirical studies shows that MMM mechanism
    performs best among the tested algorithms since
    it yields the smallest number of intercluster
    accesss.
Write a Comment
User Comments (0)
About PowerShow.com