Title: Visual Semantic Web Usage Mining
1Visual Semantic Web Usage Mining the power of
sequences, graphs, and background knowledge for
understanding navigation
Bettina Berendt
Humboldt University Berlin, Institute of
Information Systems www.wiwi.hu-berlin.de/berendt
Talk at Università degli Studi di Roma - La
Sapienza, 6 May 2005
2Acknowledgements
- Elke Brenstein
- Martin Eisend
- Jorge Gonzalez
- Sebastian Hinz
- Andreas Hotho
- Anett Kralisch
- Ernestina Menasalvas
- Bamshad Mobasher
- Daniel Oberle
- Myra Spiliopoulou
- Gerd Stumme
- Max Teltzrow
- Bert Wendland
3Goals and top-level questions
- Make the worlds knowledge available to the world
- How do people discover knowledge on the Web?
- How can more knowledge sources contribute to the
Web?
4Approaches to the current Webs biggest
challenges lots of data, human-understandable
5What is Web Mining?
- Despite its success, one problem of the current
WWW is that much of this knowledge lies dormant
in the data. - Web mining tries to overcome these problems by
applying data mining techniques to the content,
(hyperlink) structure, and usage of Web resources.
- Goals include
- the improvement of site design and site
structure, - the generation of dynamic recommendations,
- and improving marketing.
6Semantic Web usage mining
7Agenda
8Semantics andassigning senses
9Semantics of requests Step 1 Domain ontology
- community portal ka2portal.aifb.uni-karlsruhe.de
- ontology-based
- Knowledge base in F-Logic
- Static pages annotations
- Dynamic pages generated from queries
- Queries also in F-Logic
- Logs contain these queries
10- RESEARCHER
- PERSON
- PROJECT
- PUBLICATION
- RESEARCHTOPIC
- EVENT
- ORGANIZATION
- RESEARCHINTEREST
- LASTNAME
- TITLE
- ISABOUT
- EVENTS
- EVENTTITLE
- WORKSATPROJECT
- AUTHOR
- AFFILIATION
- ISWORKEDONBY
- PROGRAMCOMMITTEE
- EMPLOYS
11Semantics of requests Step 2 Modelling
requests
- HOME www\.dermis\.net\/
- HOME dermis\.multimedica\.de
- DOIA \/doia/mainmenu\.asp\?zugrdlangdesp
- PEDOIA \/doia/mainmenu\.asp\?zugrplangdesp
- D_ALPH1 \/doia/abrowser\.asp\?zugrdlangdesp
- D_ALPH2 \/doia/abrowser\.asp\?zugrdlangdespb
eginswithA-Z - D_ALPH2 \/doia/abrowser\.asp\?zugrdlangdespb
eginswithA-Zsize0-9 - D_LOKAL1 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z - D_LOKAL2 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z_0-9 - D_LOKAL3 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z_0-9_0-9 - D_LOKAL4 \/doia/dbrowser\.asp\?zugrdlangdesp
benrA-Z_0-9_0-9_0-9 - SEARCH \/doia/abrowser\.asp\?zugrdlangdespbe
ginswithA-Za-zA-Za-zA-Za-ztypesearch
. - SEARCH \/doia/abrowser\.asp\?zugrplangdespbe
ginswithA-Za-zA-Za-zA-Za-ztypesearch
. - SEARCH \/doia/diagalphabrowser\.asp\?zugrdpla
ngdesptypesearchbeginswith. - D_DIAGNOSE \/doia/diagnose\.asp\?zugrdlangdeps
.diagnr. - P_DIAGNOSE \/doia/diagnose\.asp\?zugrplangdeps
.diagnr. - P_DIAGNOSE \/doia/diagnose\.asp\?langdepszugr
pdiagnr. (and so on)
12Semantics of requests Step 2 Modelling users
Web server log
- 200.x4.xx.xx - - 09/Apr/2002222835 0200
"GET /cgi-bin/ivw/CP/doia/image. asp.
ivw?zugrdlangecd14nr87diagnr757370
HTTP/1.0" 200 735 "http//www.dermis.net/doia/imag
e.asp?zugrdlangecd14nr87diagnr757370"
"Mozilla/4.0 (compatible MSIE 5.0 Windows 98
DigExt)" - 200.x4.xx.xx ? IP address
- doia/ diagnr757370 ? requested page (also
search modus) - etc.
IP address Localization
Culture and Language
13Pattern content,node labels,and sequence mining
14Regular expressions in node labels, templates,
and the semantics of sequences pattern
discovery in WUM
- select t
- from node a b, template a b as t
- where a.url startswith "SEITE1-"
- and a.occurrence 1
- and b.url contains "1SCHULE"
- and b.occurrence 1
- and (b.support / a.support) gt 0.2
15Similarities ...
16Content transitions,edge labels,and sequence
mining
17Semantics of sequences Step 3 Strategy pattern
discovery the miner STRATDYN
- An ontology of navigation strategies
- Define strategy templates as regular expressions
- Of requests (mapped to ontological entities)
- Of transitions (between ontological entities)
- Ex. .search . individual
- Discover strategies by learning a strategy trie
...
...
...
...
18Semantics of sequences Step 4 Strategy pattern
evaluation
- Use strategy patterns statistics to
- Derive descriptive measures of patterns
- support, confidence
- popularity, effectiveness, efficiency
- Apply inferential statistics to compare patterns
19Assigning sense to sequences URLs and
application events
URL
Web page with content
20Visualization
21Communication Visual data mining Step 5
Mapping an ontological relation over concepts
to a linear order and to visual variables
Concreteness
Reach goal
Refine search
Remain unspecific
Abandon search
Time
22Communication Visual data mining Step 5
Example
23Communication Visual data mining Step 6
Visual abstraction ? new semantic patterns
24Pattern structureand graph analysis
25 Graph-based Web metrics (1) centrality
metrics
26Graph-based Web metrics (2) - compactness
27Graph-based Web metrics (3) intermediate
sociometric measures
28Graph-based Web metrics (4) stratum
29Semantics,pattern structure,and graph analysis
30Sequence analysis Presence (or not) of linear
navigation patterns
31(No Transcript)
32linear
33Unzooming the same pattern
marks diagnoses
34Dto. Im Original
35Nicht-linear
36Agenda
37Using results for site improvement
Name
38Using results for personalization
39A caveat Culture in the sense of Hofstede/Hall
may be a less reliable predictor than ...
- domain knowledge
- a different notion of culture?!
- language
- geographic regions
- (current work with Anett Kralisch and others)
40Using the results for evaluation, site
improvement, and personalization
- Mining for the evaluation of sites and services
- Not-for-profit sites
- Multi-channel user contact
- Privacy attitudes and behaviour
- Differences in cognitive styles and abilities
- Internationalisation / localisation
- Behavioural patterns, user groups ? recommend (
evaluate) changes in - page design
- navigation design
- domain ontology
41Agenda
42Authoring support for document servers
- Surveys Web usage mining analysis of a digitial
publishing service showed - Metadata creation is one of the main barriers for
contribution. - Reasons include deficiencies in
- information flow
- understanding and use of structured search
- education in structured writing
- HCI aspects
43Authoring support for document servers
44Where does this take us?
45Summary
- Semantics ? Web mining
- Data preparation for assigning possible senses
(concepts) to terms (URLs) - Sequence mining for finding patterns constrained
by node labels - Pattern content
- Sequence mining and visualization for finding
patterns constrained by edge labels - Pattern content and structure
- Graph analysis for finding quantitative measures
of pattern structure - The effect of content abstraction on pattern
structure - Web mining ? semantics
46Outlook
- Usage mining
- Learning supplying integrating behavioural
- ontologies
- knowledge bases
- Understanding more about the implications of
graph structure in interaction with content - Authoring support
- More intelligent text analysis
- Personalisation
- Mining and metacognition / reflexivity
47Semantic Web usage mining
48Semantic Web usage mining
49Semantic Web usage mining
50- distance matrix, which allows us to adapt
classical graph-based metrics to the educational
context easily. An entry in the matrix is the
length of the shortest path from a place A (a
page) to a place B (another page) by counting
places only. Metrics are derived from this matrix
or from a so-called converted distance matrix
where infinite values (pages not reachable) are
replaced by a constant. Metrics in general
express ideas such as compactness of the site, or
centrality and accessibility of nodes. They are a
means to validate the technical realisation of
the course structure. These metrics can be
derived automatically, but need to be interpreted
manually 2. - Metrics which apply to a single page are (with
acronym, name, aspect and definition) - COD - Converted Out Distance Centrality (sum of
all entries in a row), - CID - Converted In Distance Accessibility (sum
of all entries in a column), - ROC - Relative Out Centrality Relative
Centrality (ROC CD/COD), - RIC - Relative In Centrality Relative
Accessibility (RIC CD/CID). - Metrics which apply to a whole site are (with
acronym, name and definition) - CD - Converted Distance sum of all finite values
in the matrix, - CP - Compactness (Max CD) / (Max Min) with
Max (n2 n)C , Min (n2 n), n is the number
of pages, and C is the maximal distance in
matrix. - We can use metrics to classify pages in a course
system. Hubs are central pages, i.e. have a high
ROC, and are easy to find, i.e. have a high RIC.
Examples are table of contents or index pages or
any page part of the course portal. Content pages
are less central, i.e. have a low ROC, with only
a few links to other pages, i.e. a low RIC. The
overall compactness of the site is one of the
indicators for the degree of integration of a
page. A high compactness indicates good
navigability and good cross referencing, but also
indicates a possibly poorly structured site. - The stratum metric reveals to what degree the
hypertext is organised, for example that some
nodes must be read before others. The stratum
metric is based on the non-converted distance
matrix. Metrics such as OD (Out Distance) and ID
(In Distance) are defined as the sum of all
finite entries. D is the total sum of all finite
distances. The prestige P of a page is defined as
P OD ID for that page. The absolute prestige
AP of the net is defined as the sum of all
absolute values of prestiges of all pages. The
stratum S is a normalised absolute prestige,
typically normalised by the linear absolute
prestige LAP - the absolute prestige of a linear
sequence of pages.
51Thank you for your attention!
52Outlook
- Combine mining experimental methology
- Educational portal Berendt Brenstein, BRMIC
2001 - eHealth portal Kralisch Berendt, GOR 2005
- Learning supplying integrating behavioural
- ontologies
- knowledge bases
- Patterns over time pattern monitoring, streams,
Web dynamics - Authoring support
- Personalisation
- Mining and metacognition / reflexivity