Web Usage Mining for EBusiness Applications - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Web Usage Mining for EBusiness Applications

Description:

Humboldt-Universit t zu Berlin, Institute of Information Systems ... A community is a collection of web pages created by individuals or any kind of ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 42
Provided by: bettinaber
Category:

less

Transcript and Presenter's Notes

Title: Web Usage Mining for EBusiness Applications


1
Knowledge and the Web Web Structure Mining,
Link Mining 12 June 2006
Bettina Berendt
Humboldt-Universität zu Berlin, Institute of
Information Systems http//www.wiwi.hu-berlin.de/
berendt/2006s/kaw
2
From http//www.caida.org/tools/visualization/walr
us/gallery1/lar-gr-l-13.png
3
From Chen, C. and Paul, R.J. Visualizing a
knowledge domain's intellectual structure.
Computer, 34 (3). 65-71. http//www.pages.drexel.e
du/cc345/papers/ieeecomputer2001.pdf
4
Todays plan for filling in the method/data
matrix(L lecture, P practical, H
homework, ltno.gt course session)
5
Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
6
Different applications of link mining
  • Basic idea Referencing / linking something means
    saying
  • This is good and/or
  • This is related
  • Search ranking of search engine results 
  • Earlier What journals / authors are good /
    better? (impact factors)
  • How can link-based ranking be manipulated? Link
    farms / Web spam
  • How can Web spam be detected? A classification
    task
  • Co-citation analysis and other bibliometrics
  • What scientific papers / Web pages / etc. are
    similar or relevant to one another?
  • Identification of Web communities
  • What people / pages / documents / etc. have more
    to do with one another than with all others?
  • Counterterrorism as one application
  • Outlook (in fact, these are the origins) Social
    network analysis

7
Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
8
Recall Trees (slide from ISI)
  • A is the root node
  • B is the parent of D and E
  • D and E are children of B
  • (C,F) is an edge
  • 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 are leaves
  • A, B, C, D, E, F, G, H, I are internal nodes
  • The level (or depth) of E is 2 (number of edges
    to root)
  • The height (or order) of the tree is 4 (max
    number of edges from root to a leaf node)
  • The degree of node B is 2 (number of children)

9
Recall Graphs (data structure def.)
  • Definition A set of items connected by edges.
    Each item is called a vertex or node. Formally, a
    graph is a set of vertices and a binary relation
    between vertices, adjacency.
  • Formal Definition A graph G can be defined as a
    pair (V,E), where V is a set of vertices, and E
    is a set of edges between the vertices E (u,v)
    u, v in V. If the graph is undirected, the
    adjacency relation defined by the edges is
    symmetric, or E u,v u, v in V (sets of
    vertices rather than ordered pairs). If the graph
    does not allow self-loops, adjacency is
    irreflexive.
  • (http//www.nist.gov/dads/HTML/graph.html)
  • Note Edges are also called links (esp. in
    hypertext graphs like the WWW).
  • in denotes the element-of relation

10
Matrix Notation
Adjacency Matrix
A
Based on Modeling the Internet and the Web slide
set, Ch.5 School of Information and Computer
Science University of California, Irvine
11
Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
12
(No Transcript)
13
Early approaches
  • pp. 4-6
  • from
  • Link Analysis
  • (Slide set accompanying Baldi et al., Modeling
    the Internet and the Web at
  • http//ibook.ics.uci.edu/Slides/MIW20Chapter205.
    ppt )

14
A brief recap of HITS (which youve read about)
  • pp. 7-9, 13
  • from
  • Link Analysis
  • (Slide set accompanying Baldi et al., Modeling
    the Internet and the Web at
  • http//ibook.ics.uci.edu/Slides/MIW20Chapter205.
    ppt )

15
Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
16
  • pp. 15-24
  • from
  • Link Analysis
  • (Slide set accompanying Baldi et al., Modeling
    the Internet and the Web at
  • http//ibook.ics.uci.edu/Slides/MIW20Chapter205.
    ppt )

17
  • So what are the differences between HITS and
    PageRank?

18
Limitations
  • pp. 36-37
  • from
  • Link Analysis
  • (Slide set accompanying Baldi et al., Modeling
    the Internet and the Web at
  • http//ibook.ics.uci.edu/Slides/MIW20Chapter205.
    ppt )

19
Additional reading on stability and other
properties of the two algorithms
  • Panayiotis Tsaparas (2006).
  • Theoretical analysis of Link Analysis Ranking.
  • Talk at the Workshop The Future of Web Search,
    Barcelona 2006.
  • Slides at
  • http//grupoweb.upf.es/workshop/slides/fws_panayio
    tis.pdf

20
Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
21
  • Becchetti, L., Castillo, C., Donato, D.,
    Leonardi, S., Baeza-Yates, R. (2006). Using
    rank propagation and probabilistic counting for
    link-based spam detection. Technical report,
    DELIS Dynamically Evolving, Large-Scale
    Information Systems.
  • http//citeseer.ist.psu.edu/becchetti06using.html
  • Slides
  • Talk at the Workshop The Future of Web Search,
    Barcelona 2006
  • http//grupoweb.upf.es/workshop/slides/fws_castill
    o.pdf

22
Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
23
Types of citation analysis
Co-citation
Bibliographic coupling
Direct citation
links documentscited together
  • direct linkagesbetween documents

linkages due to common usage ofsource documents
Bibliographic coupling
Co-citation
Direct citation
A
B
A
B
C
composite judgement of hundreds of citers
represent authors view of similarity
dynamically changing
fix once documents are written
From Dingel, K., Lohde, T. (2006). Cluster
analysis based on co-citation data. Presentation
in the KaW seminar. http//vasarely.wiwi.hu-berlin
.de/lehre/2005s/kaw/Session3/Presentations/kaw_e_p
1_050627.ppt Source Small (1973), Pitkow
Pirolli (1997), White Griffith (1981)
24
An example The citeseer archive Web site
25
Google Scholar
26
Additional reading Co-citation and bibliographic
coupling in Web link analysis
  • Dean, J., and Henzinger, M. R., (1999). Finding
    related pages in the World Wide Web. In
    Proceedings of WWW-8, the Eighth International
    World Wide Web Conference. http//citeseer.ist.psu
    .edu/dean99finding.html

27
Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
28
Outlook social network analysis
  • Bibliometrics and link mining have their roots in
    a much older are social network analysis
  • Direct transfer of the link analysis methods we
    have found find "opinion leaders" in ciao.de and
    similar sites
  • ? see also viral marketing
  • Others analyse communication patterns, prestige,
    power, ...

29
Social networks example
30
(No Transcript)
31
  • Additional slides on Web Communities
  • based on
  • Flake, G.W., Lawrence, S., Giles, C.L. (2000).
    Efficient identification of web communities. In
    Proc. KDD 2000.

32
Web communities
  • A community is a collection of web pages created
    by individuals or any kind of associations that
    have a common interest on a specific topic, such
    as fan pages of a baseball team, and official
    pages of PC vendors. Another example are Blog
    communities.
  • Formally Flake et al.s definition of an ideal
    community
  • A Pokemon web site is a site that links to or
    is linked by more Pokemon sites than non-Pokemon
    sites.

33
An example, and how to find communities
34
An example, and how to find communities
35
Approximate communities
  • To apply this nice theorem, we would need to have
    the whole Web on our hard disk!
  • Realistically, we crawl a part of the Web
    starting with some pages that are in the
    community we are interested in.
  • Questions
  • What is crawling?
  • What pages are retrieved during this crawl?
  • What other assumptions have to be made?

36
Crawling
  • Archives are not always given
  • Crawling techniques for assembling archives
    from the Web
  • Simple Unix command-line utility wget
  • Sophisticated WIRE (contains analysis)
  • Crawling contains graph search

37
Focused community crawling
38
(No Transcript)
39
What is the virtual sink ( the site that is
definitely not in the community)?
  • In the ideal version
  • In the approximate version, use artificicial
    virtual sink (a theorem ensures correctness even
    if this is not really at the center of the graph)

40
Repetition for a better result
41
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com