Title: Web Usage Mining for EBusiness Applications
1Knowledge and the Web Web Structure Mining,
Link Mining 12 June 2006
Bettina Berendt
Humboldt-Universität zu Berlin, Institute of
Information Systems http//www.wiwi.hu-berlin.de/
berendt/2006s/kaw
2From http//www.caida.org/tools/visualization/walr
us/gallery1/lar-gr-l-13.png
3From Chen, C. and Paul, R.J. Visualizing a
knowledge domain's intellectual structure.
Computer, 34 (3). 65-71. http//www.pages.drexel.e
du/cc345/papers/ieeecomputer2001.pdf
4Todays plan for filling in the method/data
matrix(L lecture, P practical, H
homework, ltno.gt course session)
5Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
6Different applications of link mining
- Basic idea Referencing / linking something means
saying - This is good and/or
- This is related
- Search ranking of search engine results
- Earlier What journals / authors are good /
better? (impact factors) - How can link-based ranking be manipulated? Link
farms / Web spam - How can Web spam be detected? A classification
task - Co-citation analysis and other bibliometrics
- What scientific papers / Web pages / etc. are
similar or relevant to one another? - Identification of Web communities
- What people / pages / documents / etc. have more
to do with one another than with all others? - Counterterrorism as one application
- Outlook (in fact, these are the origins) Social
network analysis
7Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
8Recall Trees (slide from ISI)
- A is the root node
- B is the parent of D and E
- D and E are children of B
- (C,F) is an edge
- 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 are leaves
- A, B, C, D, E, F, G, H, I are internal nodes
- The level (or depth) of E is 2 (number of edges
to root) - The height (or order) of the tree is 4 (max
number of edges from root to a leaf node) - The degree of node B is 2 (number of children)
9Recall Graphs (data structure def.)
- Definition A set of items connected by edges.
Each item is called a vertex or node. Formally, a
graph is a set of vertices and a binary relation
between vertices, adjacency. - Formal Definition A graph G can be defined as a
pair (V,E), where V is a set of vertices, and E
is a set of edges between the vertices E (u,v)
u, v in V. If the graph is undirected, the
adjacency relation defined by the edges is
symmetric, or E u,v u, v in V (sets of
vertices rather than ordered pairs). If the graph
does not allow self-loops, adjacency is
irreflexive. - (http//www.nist.gov/dads/HTML/graph.html)
- Note Edges are also called links (esp. in
hypertext graphs like the WWW). - in denotes the element-of relation
10Matrix Notation
Adjacency Matrix
A
Based on Modeling the Internet and the Web slide
set, Ch.5 School of Information and Computer
Science University of California, Irvine
11Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
12(No Transcript)
13Early approaches
- pp. 4-6
- from
- Link Analysis
- (Slide set accompanying Baldi et al., Modeling
the Internet and the Web at - http//ibook.ics.uci.edu/Slides/MIW20Chapter205.
ppt )
14A brief recap of HITS (which youve read about)
- pp. 7-9, 13
- from
- Link Analysis
- (Slide set accompanying Baldi et al., Modeling
the Internet and the Web at - http//ibook.ics.uci.edu/Slides/MIW20Chapter205.
ppt )
15Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
16- pp. 15-24
- from
- Link Analysis
- (Slide set accompanying Baldi et al., Modeling
the Internet and the Web at - http//ibook.ics.uci.edu/Slides/MIW20Chapter205.
ppt )
17- So what are the differences between HITS and
PageRank?
18Limitations
- pp. 36-37
- from
- Link Analysis
- (Slide set accompanying Baldi et al., Modeling
the Internet and the Web at - http//ibook.ics.uci.edu/Slides/MIW20Chapter205.
ppt )
19Additional reading on stability and other
properties of the two algorithms
- Panayiotis Tsaparas (2006).
- Theoretical analysis of Link Analysis Ranking.
- Talk at the Workshop The Future of Web Search,
Barcelona 2006. - Slides at
- http//grupoweb.upf.es/workshop/slides/fws_panayio
tis.pdf
20Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
21- Becchetti, L., Castillo, C., Donato, D.,
Leonardi, S., Baeza-Yates, R. (2006). Using
rank propagation and probabilistic counting for
link-based spam detection. Technical report,
DELIS Dynamically Evolving, Large-Scale
Information Systems. - http//citeseer.ist.psu.edu/becchetti06using.html
- Slides
- Talk at the Workshop The Future of Web Search,
Barcelona 2006 - http//grupoweb.upf.es/workshop/slides/fws_castill
o.pdf
22Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
23Types of citation analysis
Co-citation
Bibliographic coupling
Direct citation
links documentscited together
- direct linkagesbetween documents
linkages due to common usage ofsource documents
Bibliographic coupling
Co-citation
Direct citation
A
B
A
B
C
composite judgement of hundreds of citers
represent authors view of similarity
dynamically changing
fix once documents are written
From Dingel, K., Lohde, T. (2006). Cluster
analysis based on co-citation data. Presentation
in the KaW seminar. http//vasarely.wiwi.hu-berlin
.de/lehre/2005s/kaw/Session3/Presentations/kaw_e_p
1_050627.ppt Source Small (1973), Pitkow
Pirolli (1997), White Griffith (1981)
24An example The citeseer archive Web site
25Google Scholar
26Additional reading Co-citation and bibliographic
coupling in Web link analysis
- Dean, J., and Henzinger, M. R., (1999). Finding
related pages in the World Wide Web. In
Proceedings of WWW-8, the Eighth International
World Wide Web Conference. http//citeseer.ist.psu
.edu/dean99finding.html
27Agenda
Whats in a link?
Preliminaries trees and graphs
Ranking early approaches, HITS
Ranking PageRank
Web spam
Topical relevance Co-citation analysis
Outlook
28Outlook social network analysis
- Bibliometrics and link mining have their roots in
a much older are social network analysis - Direct transfer of the link analysis methods we
have found find "opinion leaders" in ciao.de and
similar sites - ? see also viral marketing
- Others analyse communication patterns, prestige,
power, ...
29Social networks example
30(No Transcript)
31- Additional slides on Web Communities
- based on
- Flake, G.W., Lawrence, S., Giles, C.L. (2000).
Efficient identification of web communities. In
Proc. KDD 2000.
32Web communities
- A community is a collection of web pages created
by individuals or any kind of associations that
have a common interest on a specific topic, such
as fan pages of a baseball team, and official
pages of PC vendors. Another example are Blog
communities. - Formally Flake et al.s definition of an ideal
community - A Pokemon web site is a site that links to or
is linked by more Pokemon sites than non-Pokemon
sites.
33An example, and how to find communities
34An example, and how to find communities
35Approximate communities
- To apply this nice theorem, we would need to have
the whole Web on our hard disk! - Realistically, we crawl a part of the Web
starting with some pages that are in the
community we are interested in. - Questions
- What is crawling?
- What pages are retrieved during this crawl?
- What other assumptions have to be made?
36Crawling
- Archives are not always given
- Crawling techniques for assembling archives
from the Web - Simple Unix command-line utility wget
- Sophisticated WIRE (contains analysis)
- Crawling contains graph search
37Focused community crawling
38(No Transcript)
39What is the virtual sink ( the site that is
definitely not in the community)?
- In the ideal version
- In the approximate version, use artificicial
virtual sink (a theorem ensures correctness even
if this is not really at the center of the graph)
40Repetition for a better result
41(No Transcript)