Title: PowerPointPrsentation
1Sitemap Creation
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
2Outline 1. Task 2. Definition 3. Site
Topology in Web Mining 4. Pre-processing 5.
Rules and Patterns 6. Concept Hierarchies 7.
my Approach
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
31. Task
- automated creation of a sitemap
- input URL
- output a sitemap
- representation of linkage between all sites
which can - be reached from the given URL
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
4Outline 1. Task 2. Definition 3. Site
Topology in Web Mining 4. Pre-processing 5.
Rules and Patterns 6. Concept Hierarchies 7.
my Approach
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
52. Definition
- sitemap is a graph, nodes pages, edges
hyperlinks - spanning tree minimal graph
- includes all nodes but only subset of edges
- only one path between 2 nodes
- no cycles
- no redundancy
- minimum cost
- different spanning trees for
- one graph
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
6Outline 1. Task 2. Definition 3. Site
Topology in Web Mining 4. Pre-processing 5.
Rules and Patterns 6. Concept Hierarchies 7.
my Approach
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
73. Site Topology in Web Mining
- automated creation of sitemaps for Web Usage
Mining - Systems
- difficulties historical development of the web
- static and single frame sites easy to map
- problems CGI, JavaScript, Frames, Flash, .
- personalisation techniques
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
83. Site Topology in Web Mining
- Web Mining is divided into 3 Categories
- sitemap creation is part to Web Structure Mining
- topology is also important in Web Usage Mining
and - Web Content Mining
- helps solving different problems data
preparation and - analysis
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
93. Site Topology in Web Mining
(Architecture of WEBMINER by Robert Cooley)
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
103. Site Topology in Web Mining
- together with other information we need
knowledge of - site topology for
- completing the tasks of
- - user identification
- - path completion
- - filtering rules and patterns
- - exploring users behaviour with concept
hierarchies
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
11Outline 1. Task 2. Definition 3. Site
Topology in Web Mining 4. Pre-processing 5.
Rules and Patterns 6. Concept Hierarchies 7.
my Approach
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
124. Pre-processing
- log-files are mostly the only source for
information - about user behaviour
- for instance cookies are often seen as privacy
leaks - unfortunately logs dont contain enough
information - sometimes users behaviour has to be guessed
- sitemaps can help
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
134. Pre-processing user identification
- - user identification
- task identifying the unique user
- which log file entries has been caused by one
user - two log entries with same IP
- - we assume only one user has caused them
- - this can be wrong
- - a possible heuristic
- - a new user if requested page can not be
- reached by the already visited pages
-
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
144. Pre-processing user identification
- example
log file sitemap
IP URL
1 123.456.78.9 A.html
2 123.456.78.9 B.html
3 123.456.78.9 C.html
4 123.456.78.9 Y.html
- only heuristics bookmarks URL is typed in
directly
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
154. Pre-processing path completion
- path completion
- problems proxies and browser cache
- - although user requests page, no entries in
logs - - e.g. Back button of Browser
- - while the user uses this button no log
entries - - if the user requests a page he did not visited
- before, a log entry is caused
- - possible heuristic
- - if the new page can be reached by
- one of the pages the user visited before
- ? he used the back button
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
164. Pre-processing path completion
- example
log file sitemap
IP URL
1 123.456.78.9 A.html
2 123.456.78.9 B.html
3 123.456.78.9 C.html
4 123.456.78.9 X.html
- the entries B, A are added to the log between C
and X
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
17Outline 1. Task 2. Definition 3. Site
Topology in Web Mining 4. Pre-processing 5.
Rules and Patterns 6. Concept Hierarchies 7.
my Approach
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
185. Rules and Patterns
- sitemaps help to compare developers and user
view - developers view sitemap, usage proposal
- user view real usage, patterns and rules
- the bigger the difference the more interesting
- - every rule that confirms the site topology is
- uninteresting
- rules that are expected but not present
important - information
- - for instance head page
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
19Outline 1. Task 2. Definition 3. Site
Topology in Web Mining 4. Pre-processing 5.
Rules and Patterns 6. Concept Hierarchies 7.
my Approach
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
206. Concept Hierarchies
- a tree of concepts with increasing level of
abstraction - is called a concept hierarchy
user
Head page index.html
potential customer
high potential customer
non customer
extern links page
customer
Thanks for ordering... page
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
216. Concept Hierarchies
- more abstract view on the page..
- is used for comparison of intended user
behaviour and the - real behaviour
- finding ideal navigation paths
- help to redesign the site to e.g. increase
user-customer- - conversion
- usability Â
- sitemaps help to map given concepts to all found
pages
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
22Outline 1. Task 2. Definition 3. Site
Topology in Web Mining 4. Pre-processing 5.
Rules and Patterns 6. Concept Hierarchies 7.
my Approach
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
237. my Approach
- input URL ? path and filename
- output complete sitemap and a spanning tree
- first file is opened and filename, path, level
and referrers - are written into an array
- file searched for hyperlinks
- these links are filtered, URL is pre-processed
- (direct, relative)
- file to which the link points is opened,.....
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
247. my Approach
- very simple example
index
L path filename referrer
A/a.html
0 test.de/ index.html
C/c2.html
B/b.html
1 test.de/A a.html index.html
1 test.de/B b.html index.html
C/c1.html
2 test.de/C c1.html B/b.html
2 test.de/C c2.html B/b.html
C/c2.html
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
257. my Approach
- visualisation with H3Viewer by Tamara Munzner
- spanning tree of the sitemap
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003
26literature
- Tamara Munzner Interactive Visualization of
Large Graphs and Networks - Robert Cooley,
Bamshad Mobasher, Jaideep Srivastava Data
Preparation for Mining World Wide Web Browsing
Patterns - Robert Cooley The Importance of
Understanding Web Site Structure and Content
when Performing Web Usage Mining - Bettina
Berendt, Myra Spiliopoulou - Analysis of
navigation behaviour in web sites integrating
multiple information systems - Wie werden
Surfer zu Kunden? Navigationsanalyse zur
Ermittlung des Konversionspotenzials
verschiedener Bereiche einer Site
Andrej Wunrau Sitemap Creation Seminar Web
mining
16. 1. 2003