Title: Growing Hierarchical Self-Organizing Maps for Web Mining
1Growing Hierarchical Self-Organizing Maps for Web
Mining
- Joseph P. Herbert and JingTao Yao
- Department of Computer Science,
- University or Regina
- CANADA S4S 0A2
- herbertj_at_cs.uregina.ca jtyao_at_cs.uregina.ca
- http//www2.cs.uregina.ca/herbertj http//www2.cs
.uregina.ca/jtyao
2Introduction
- Many information retrieval and machine learning
techniques have not evolved to survive the Web
environment. - There are two major problems in applying some
machine learning techniques for Web Mining - The dynamic and ever-changing nature of Web data.
- Dimensionality and sheer size of Web data.
3Introduction
- Three domains of application Web content mining,
Web usage mining, and Web structure mining. - Self-Organizing Maps (SOM) have been used for
- Web page clustering
- Document retrieval
- Recommendation systems
4Growing Hierarchical SOMs
- Growing Hierarchical SOMs are a hybridization of
Growing SOMs and Hierarchical SOMs - Growing SOMs have a dynamic topology of neurons
to help solve the dynamic nature of data on the
Web. - Hierarchical SOMs are multi-level systems
designed to minimize the high dimensionality
problem of data. - Together, the hybrid system provides a logical
solution when considering the combined problem of
dynamic, high-dimensional data sources.
5The Consistency Problem
- The growing hierarchical SOM model suffers from a
new problem - Maintaining consistency of hierarchical
relationships between levels. - Training is done locally, without consideration
of how changes effect other SOMs that have
connection to the local focus. - The Web Mining model for Self-Organizing Maps
solve this problem through bidirectional update
propagation.
6The Web Mining Model for Self-Organizing Maps
C
U
w
- Input Layer
- Each vector is inserted into the SOM network for
the first - stage of competition.
- An iteration is complete once all input vectors
have been - presented.
- Hierarchy Layer
- A suitable level within the hierarchy of SOMs is
found - by traversing the tree.
- The SOM whose collectively maximum similarity to
- the input is marked and passed to the next
layer.
- Growth Layer
- This layer determines whether or not neurons
need to be - added or subtracted from the current SOM.
- If error is above an upper bound threshold,
neurons are - added. If error is below a lower bound
threshold, neurons - are removed.
- Update Layer
- This layer updates the winning neuron and the
- neighborhood associated with it.
- Bidirectional Update Propagation updates parent
neurons - and children feature maps that are associated
with the - winning neuron.
7Formal Definition
- A A1, , At
- A set of hierarchy levels.
- Ai Wi,1, , Wi,m
- A set of individual SOMs.
- Wi,j w1, , wn
- A SOM of n neurons.
- Each neuron contains a storage unit sk and a
weight vector vk.
8Three Basic Functions
- Three functions are introduced for actions on the
system - Lev()
- Returns the hierarchy level that a SOM currently
resides on. - Chd()
- Returns a set of SOMs that have child
relationship to a particular neuron. - Par()
- Returns the parent SOM of a particular neuron.
9Process Flow for Training
- Input is inserted into network
- Neuron that is most similar is selected.
- Descend through hierarchy until similarity is
maximal. - Determine whether correct number of neurons
represent pattern. - Add / Subtract neurons accordingly.
- Update neuron and neighbourhood.
- Update children SOMs.
- Update parent SOM.
1
Input
Determine winning neuron on current level
Bidirectional Propagation
2
Propagate Updates Upwards
8
Is neuron and input similar enough ?
3
N
Y
Propagate Updates Downwards
7
4
Proceed to next Hierarchy Level with closest
neuron
Is map representing input enough ?
Y
N
5
Add / Subtract Neuron
Update Winner Neuron
6
Update Neighborhood
10Conceptual View
- At the top-most hierarchy level (A1), only one
feature map would exist. - This map contains the absolute highest conceptual
view of the entire hierarchical structure. - Additional SOMs on subsequent levels offer more
precise pattern abstraction. - SOMs are denoted by the sequence of their
parents. - W3,6,4 denotes the feature map is the fourth map
on the third level derived from the sixth map on
the previous level.
11Learning of Features
- Once a winning neuron wihas been identified
(denoted by an asterisk), its weight vector vi
is updated according to a learning rate a. - The value a decays over time according to the
current training iteration. - vi(q) vi(q-1) a(pk(q) vi(q-1))
- The neighbourhood must also be updated with a
modified learning rate a/. - vNi(d)(q) vNi(d) (q-1) a(pk(q)
vNi(d)(q-1))
12Bidirectional Update Propagation
- Let wi be the winning neuron in SOM Wj,k for
input k. - To propagate upwards
- Calculate Par(wi) Wj-1,m, where Lev(Wj-1,m) lt
Lev(Wj,k). - Update all neurons wa contained in Wj-1,m that
are similar to wi. - va(q) va(q-1) ß(pk(q) va(q-1))
13Bidirectional Update Propagation
- To propagate downwards
- Calculate Chd(wi) Aj1, where j1 is the next
level in the hierarchy succeeding level j. - Update the corresponding weight vectors for all
neurons wb in SOM Wj1,t, where Wj1,t is on the
lower level Aj1. - vb(q) vb(q-1) ?(pk(q) vb(q-1))
- The learning rates ß and ? are derived from a
value of a. - Generally, updates to a parent neuron are not as
strong as updates to children neurons.
14Web-based News Coverage Example
- The top-most level of the hierarchy contains news
articles pertaining to high-level concepts and
are arranged according to their features. - The entire collection of Web documents on the
online news site are presented through feature
maps that abstract their similarities. - Individual maps W2,1 , , W2,10 are Web documents
pertaining to Global, Local, Political, Business,
Weather, Entertainment, Technology, Sports,
Opinion, and Health news respectively.
15Web-based News Coverage Example
- Feature map W2,10 with neurons linking to three
children maps W3,10,1 , W3,10,2 , W3,10,3. - Articles in W2,10 relate to Health News.
- W3,10,1 relates to Health Research Funding.
- W3,10,2 relates to Health Outbreak Crises.
- W3,10,3 relates to Doctor shortages.
- New Health-related articles are coming in rapidly
relating to a recent international outbreak. - Neurons are added to W2,10 in the Health Outbreak
Crises cluster, that point to the SOM W3,10,2.
16Conclusion
- The Web mining model of growing hierarchical
self-organizing maps minimizes the effect of the
dynamic data and high-dimensionality problems. - Bidirection Update Propagation allows for changes
in pattern abstractions to be reflect on multiple
levels in the hierarchy. - The Web-based News Coverage example demonstrates
the effectiveness of growing hierarchical
self-organizing maps when used in conjunction
with bidirectional update propagation.
17Growing Hierarchical Self-Organizing Maps for Web
Mining
Thank-you
Thank-you
- Joseph P. Herbert and JingTao Yao
- Department of Computer Science,
- University or Regina
- CANADA S4S 0A2
- herbertj_at_cs.uregina.ca jtyao_at_cs.uregina.ca
- http//www2.cs.uregina.ca/herbertj http//www2.cs
.uregina.ca/jtyao