Visual Semantic Web Usage Mining - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Visual Semantic Web Usage Mining

Description:

Despite its success, one problem of the current WWW is that much of this ... A high compactness indicates good navigability and good cross referencing, but ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 52
Provided by: warholWiw
Category:

less

Transcript and Presenter's Notes

Title: Visual Semantic Web Usage Mining


1
Visual Semantic Web Usage Mining the power of
sequences, graphs, and background knowledge for
understanding navigation
Bettina Berendt
Humboldt University Berlin, Institute of
Information Systems www.wiwi.hu-berlin.de/berendt
Talk at Università degli Studi di Roma - La
Sapienza, 6 May 2005
2
Acknowledgements
  • Elke Brenstein
  • Martin Eisend
  • Jorge Gonzalez
  • Sebastian Hinz
  • Andreas Hotho
  • Anett Kralisch
  • Ernestina Menasalvas
  • Bamshad Mobasher
  • Daniel Oberle
  • Myra Spiliopoulou
  • Gerd Stumme
  • Max Teltzrow
  • Bert Wendland

3
Goals and top-level questions
  • Make the worlds knowledge available to the world
  • How do people discover knowledge on the Web?
  • How can more knowledge sources contribute to the
    Web?

4
Approaches to the current Webs biggest
challenges lots of data, human-understandable
5
What is Web Mining?
  • Despite its success, one problem of the current
    WWW is that much of this knowledge lies dormant
    in the data.
  • Web mining tries to overcome these problems by
    applying data mining techniques to the content,
    (hyperlink) structure, and usage of Web resources.
  • Goals include
  • the improvement of site design and site
    structure,
  • the generation of dynamic recommendations,
  • and improving marketing.

6
Semantic Web usage mining
7
Agenda
8
Semantics andassigning senses
9
Semantics of requests Step 1 Domain ontology
  • community portal ka2portal.aifb.uni-karlsruhe.de
  • ontology-based
  • Knowledge base in F-Logic
  • Static pages annotations
  • Dynamic pages generated from queries
  • Queries also in F-Logic
  • Logs contain these queries

10
  • RESEARCHER
  • PERSON
  • PROJECT
  • PUBLICATION
  • RESEARCHTOPIC
  • EVENT
  • ORGANIZATION
  • RESEARCHINTEREST
  • LASTNAME
  • TITLE
  • ISABOUT
  • EVENTS
  • EVENTTITLE
  • WORKSATPROJECT
  • AUTHOR
  • AFFILIATION
  • ISWORKEDONBY
  • PROGRAMCOMMITTEE
  • EMPLOYS

11
Semantics of requests Step 2 Modelling
requests
  • HOME www\.dermis\.net\/
  • HOME dermis\.multimedica\.de
  • DOIA \/doia/mainmenu\.asp\?zugrdlangdesp
  • PEDOIA \/doia/mainmenu\.asp\?zugrplangdesp
  • D_ALPH1 \/doia/abrowser\.asp\?zugrdlangdesp
  • D_ALPH2 \/doia/abrowser\.asp\?zugrdlangdespb
    eginswithA-Z
  • D_ALPH2 \/doia/abrowser\.asp\?zugrdlangdespb
    eginswithA-Zsize0-9
  • D_LOKAL1 \/doia/dbrowser\.asp\?zugrdlangdesp
    benrA-Z
  • D_LOKAL2 \/doia/dbrowser\.asp\?zugrdlangdesp
    benrA-Z_0-9
  • D_LOKAL3 \/doia/dbrowser\.asp\?zugrdlangdesp
    benrA-Z_0-9_0-9
  • D_LOKAL4 \/doia/dbrowser\.asp\?zugrdlangdesp
    benrA-Z_0-9_0-9_0-9
  • SEARCH \/doia/abrowser\.asp\?zugrdlangdespbe
    ginswithA-Za-zA-Za-zA-Za-ztypesearch
    .
  • SEARCH \/doia/abrowser\.asp\?zugrplangdespbe
    ginswithA-Za-zA-Za-zA-Za-ztypesearch
    .
  • SEARCH \/doia/diagalphabrowser\.asp\?zugrdpla
    ngdesptypesearchbeginswith.
  • D_DIAGNOSE \/doia/diagnose\.asp\?zugrdlangdeps
    .diagnr.
  • P_DIAGNOSE \/doia/diagnose\.asp\?zugrplangdeps
    .diagnr.
  • P_DIAGNOSE \/doia/diagnose\.asp\?langdepszugr
    pdiagnr. (and so on)

12
Semantics of requests Step 2 Modelling users
Web server log
  • 200.x4.xx.xx - - 09/Apr/2002222835 0200
    "GET /cgi-bin/ivw/CP/doia/image. asp.
    ivw?zugrdlangecd14nr87diagnr757370
    HTTP/1.0" 200 735 "http//www.dermis.net/doia/imag
    e.asp?zugrdlangecd14nr87diagnr757370"
    "Mozilla/4.0 (compatible MSIE 5.0 Windows 98
    DigExt)"
  • 200.x4.xx.xx ? IP address
  • doia/ diagnr757370 ? requested page (also
    search modus)
  • etc.

IP address Localization
Culture and Language
13
Pattern content,node labels,and sequence mining
14
Regular expressions in node labels, templates,
and the semantics of sequences pattern
discovery in WUM
  • select t
  • from node a b, template a b as t
  • where a.url startswith "SEITE1-"
  • and a.occurrence 1
  • and b.url contains "1SCHULE"
  • and b.occurrence 1
  • and (b.support / a.support) gt 0.2

15
Similarities ...
16
Content transitions,edge labels,and sequence
mining
17
Semantics of sequences Step 3 Strategy pattern
discovery the miner STRATDYN
  • An ontology of navigation strategies
  • Define strategy templates as regular expressions
  • Of requests (mapped to ontological entities)
  • Of transitions (between ontological entities)
  • Ex. .search . individual
  • Discover strategies by learning a strategy trie

...
...
...
...
18
Semantics of sequences Step 4 Strategy pattern
evaluation
  • Use strategy patterns statistics to
  • Derive descriptive measures of patterns
  • support, confidence
  • popularity, effectiveness, efficiency
  • Apply inferential statistics to compare patterns

19
Assigning sense to sequences URLs and
application events
URL
Web page with content
20
Visualization
21
Communication Visual data mining Step 5
Mapping an ontological relation over concepts
to a linear order and to visual variables
Concreteness
Reach goal
Refine search
Remain unspecific
Abandon search
Time
22
Communication Visual data mining Step 5
Example
23
Communication Visual data mining Step 6
Visual abstraction ? new semantic patterns
24
Pattern structureand graph analysis
25
Graph-based Web metrics (1) centrality
metrics
26
Graph-based Web metrics (2) - compactness
27
Graph-based Web metrics (3) intermediate
sociometric measures
28
Graph-based Web metrics (4) stratum
29
Semantics,pattern structure,and graph analysis
30
Sequence analysis Presence (or not) of linear
navigation patterns
31
(No Transcript)
32
linear
33
Unzooming the same pattern
marks diagnoses
34
Dto. Im Original
35
Nicht-linear
36
Agenda
37
Using results for site improvement
Name
38
Using results for personalization
39
A caveat Culture in the sense of Hofstede/Hall
may be a less reliable predictor than ...
  • domain knowledge
  • a different notion of culture?!
  • language
  • geographic regions
  • (current work with Anett Kralisch and others)

40
Using the results for evaluation, site
improvement, and personalization
  • Mining for the evaluation of sites and services
  • Not-for-profit sites
  • Multi-channel user contact
  • Privacy attitudes and behaviour
  • Differences in cognitive styles and abilities
  • Internationalisation / localisation
  • Behavioural patterns, user groups ? recommend (
    evaluate) changes in
  • page design
  • navigation design
  • domain ontology

41
Agenda
42
Authoring support for document servers
  • Surveys Web usage mining analysis of a digitial
    publishing service showed
  • Metadata creation is one of the main barriers for
    contribution.
  • Reasons include deficiencies in
  • information flow
  • understanding and use of structured search
  • education in structured writing
  • HCI aspects

43
Authoring support for document servers
44
Where does this take us?
45
Summary
  • Semantics ? Web mining
  • Data preparation for assigning possible senses
    (concepts) to terms (URLs)
  • Sequence mining for finding patterns constrained
    by node labels
  • Pattern content
  • Sequence mining and visualization for finding
    patterns constrained by edge labels
  • Pattern content and structure
  • Graph analysis for finding quantitative measures
    of pattern structure
  • The effect of content abstraction on pattern
    structure
  • Web mining ? semantics

46
Outlook
  • Usage mining
  • Learning supplying integrating behavioural
  • ontologies
  • knowledge bases
  • Understanding more about the implications of
    graph structure in interaction with content
  • Authoring support
  • More intelligent text analysis
  • Personalisation
  • Mining and metacognition / reflexivity

47
Semantic Web usage mining
48
Semantic Web usage mining
49
Semantic Web usage mining
50
  • distance matrix, which allows us to adapt
    classical graph-based metrics to the educational
    context easily. An entry in the matrix is the
    length of the shortest path from a place A (a
    page) to a place B (another page) by counting
    places only. Metrics are derived from this matrix
    or from a so-called converted distance matrix
    where infinite values (pages not reachable) are
    replaced by a constant. Metrics in general
    express ideas such as compactness of the site, or
    centrality and accessibility of nodes. They are a
    means to validate the technical realisation of
    the course structure. These metrics can be
    derived automatically, but need to be interpreted
    manually 2.
  • Metrics which apply to a single page are (with
    acronym, name, aspect and definition)
  • COD - Converted Out Distance Centrality (sum of
    all entries in a row),
  • CID - Converted In Distance Accessibility (sum
    of all entries in a column),
  • ROC - Relative Out Centrality Relative
    Centrality (ROC CD/COD),
  • RIC - Relative In Centrality Relative
    Accessibility (RIC CD/CID).
  • Metrics which apply to a whole site are (with
    acronym, name and definition)
  • CD - Converted Distance sum of all finite values
    in the matrix,
  • CP - Compactness (Max CD) / (Max Min) with
    Max (n2 n)C , Min (n2 n), n is the number
    of pages, and C is the maximal distance in
    matrix.
  • We can use metrics to classify pages in a course
    system. Hubs are central pages, i.e. have a high
    ROC, and are easy to find, i.e. have a high RIC.
    Examples are table of contents or index pages or
    any page part of the course portal. Content pages
    are less central, i.e. have a low ROC, with only
    a few links to other pages, i.e. a low RIC. The
    overall compactness of the site is one of the
    indicators for the degree of integration of a
    page. A high compactness indicates good
    navigability and good cross referencing, but also
    indicates a possibly poorly structured site.
  • The stratum metric reveals to what degree the
    hypertext is organised, for example that some
    nodes must be read before others. The stratum
    metric is based on the non-converted distance
    matrix. Metrics such as OD (Out Distance) and ID
    (In Distance) are defined as the sum of all
    finite entries. D is the total sum of all finite
    distances. The prestige P of a page is defined as
    P OD ID for that page. The absolute prestige
    AP of the net is defined as the sum of all
    absolute values of prestiges of all pages. The
    stratum S is a normalised absolute prestige,
    typically normalised by the linear absolute
    prestige LAP - the absolute prestige of a linear
    sequence of pages.

51
Thank you for your attention!
52
Outlook
  • Combine mining experimental methology
  • Educational portal Berendt Brenstein, BRMIC
    2001
  • eHealth portal Kralisch Berendt, GOR 2005
  • Learning supplying integrating behavioural
  • ontologies
  • knowledge bases
  • Patterns over time pattern monitoring, streams,
    Web dynamics
  • Authoring support
  • Personalisation
  • Mining and metacognition / reflexivity
Write a Comment
User Comments (0)
About PowerShow.com