Knowledge%20Extraction%20from%20the%20Web - PowerPoint PPT Presentation

About This Presentation

Title:

Knowledge%20Extraction%20from%20the%20Web

Description:

Knowledge Extraction from the Web Monika Henzinger Steve Lawrence Outline Hyperlink analysis in web IR Sampling the web: Web pages Web hosts Web graph models Focused ... – PowerPoint PPT presentation

Number of Views:281

Avg rating:3.0/5.0

Slides: 82

Provided by: 4882

Category:

more less

Transcript and Presenter's Notes

Title: Knowledge%20Extraction%20from%20the%20Web

1
Knowledge Extraction from the Web
Monika Henzinger Steve Lawrence
2
Outline

Hyperlink analysis in web IR
Sampling the web
Web pages
Web hosts
Web graph models
Focused crawling
Finding communities

3
Hyperlink analysis in web information retrieval
4
Graph structure of the web

Web graph
Each web page is a node
Each hyperlink is a directed edge
Host graph
Each host is a node
If there are k links from host A to host B, there
is an edge with weight k from A to B.

5
Hyperlink analysis in Web IR

Idea Mine structure of the web graph to improve
search results
Related work
Classic IR work (citations links) a.k.a.
Bibliometrics K63, G72, S73,
Socio-metrics K53, MMSM86,
Many Web related papers use this approach
PPR96, AMM97, S97, CK97, K98, BP98,

6
Googles approach

Assumption A link from page A to page B is a
recommendation of page B by the author of A(we
say B is successor of A)
Quality of a page is related to its in-degree
Recursion Quality of a page is related to
its in-degree, and to
the quality of pages linking to it
PageRank BP 98

7
Definition of PageRank

Consider the following infinite random walk
(surf)
Initially the surfer is at a random page
At each step, the surfer proceeds
to a randomly chosen web page with probability d
to a randomly chosen successor of the current
page with probability 1-d
The PageRank of a page p is the fraction of steps
the surfer spends at p in the limit.

8
PageRank (cont.)

By previous theorem
PageRank stationary probability for this Markov
chain, i.e.
where n is the total number of nodes in the
graph

9
Neighborhood graph

Subgraph associated to each query

Forward Set
Back Set
Query Results Start Set
Result1
f1
b1
f2
b2
Result2
...

...
fs
bm
Resultn
An edge for each hyperlink, but no edges within
the same host
10
HITS Kleinberg98

Goal Given a query find
Good sources of content (authorities)
Good sources of links (hubs)

11
HITS details

Repeat until HUB and AUTH converge
Normalize HUB and AUTH
HUBv S AUTHui for all ui with
Edge(v, ui)
AUTHv S HUBwi for all wi with
Edge(wi, v)

v
w1
u1
A
H
w2
u2
...
...
wk
uk
12
PageRank vs. HITS

Computation
Once for all documents and queries (offline)
Query-independent requires combination with
query-dependent criteria
Hard to spam

Computation
Requires computation for each query
Query-dependent
Relatively easy to spam
Quality depends on quality of start set
Gives hubs as well as authorities

13
PageRank vs. HITS

Lempel Not rank-stable O(1) changes in graph
can change O(N2) order-relations
Ng, Zheng, Jordan01 Value-Stable change in k
nodes (with PR values p1,pk) results in p s.t.

Not rank-stable
value-stability depends on gap g between
largest and second largest eigenvector in ATA
change of O(g) in ATA results in p s.t.

14
Random sampling of web pages
15
Random sampling of web pages

Useful for estimating
Web properties Percentage of pages in a domain,
in a language, on a topic, indegree distribution
Search engine comparison Percentage of pages in
a search engine index (index size)

16
Lets do the random walk!

Perform PageRank random walk
Select uniform random sample from resulting
pages
Cant jump to a random page instead, jump to a
random page on a random host seen so far.
Problem
Starting state bias finite walk only
approximates PageRank.
Quality-biased sample of the web

17
Most frequently visited pages
18
Most frequently visited hosts
19
Sampling pages nearly uniformly

Perform PageRank random walk
Sample pages from walk s.t.
Dont know PageRank(p)
PR PageRank computation of crawled graph
VR VisitRatio on crawled graph
Nearly uniform sample of the web

20
Sampling pages nearly uniformly

Nearly uniform sample
Recall
A page is well-connected if it can be reached by
almost every other page by short paths (O(n1/2)
steps)
For short paths in a well-connected graph

21
Sampling pages nearly uniformly

Problems
Starting state bias finite walk only
approximates PageRank.
Dependence, especially in short cycles

22
Synthetic graphs in-degree
23
Synthetic graphs PageRank
24
Experiments on the real web

Performed 3 random walks in Nov 1999 (starting
from 10,258 seed URLs)
Small overlap between walks walks disperse well
(82 visited by only 1 walk)
Walk visited URLs unique URLs
1 2,702,939 990,251 2 2,507,004 921,114 3 5,006
,745 1,655,799

25
Percentage of pages in domains
26
Estimating search engine index size

Choose a sample of pages p1,p2,p3 pn according
to near uniform distribution
Check if the pages are in search engine index S
BB98
Exact match
Host match
Estimate for size of index S is the percentage of
sampled pages that are in S, i.e.where Ipj in
S 1 if pj is in S and 0 otherwise

27
Result set for index size (fall 99)
28
Random sampling of sites
29
Publicly indexable web

We analyzed the publicly indexable web
Excludes pages that are not indexed by the major
search engines due to
Authentication requirements
Pages hidden behind search forms
Robots exclusion standard

30
Random sampling of sites

Randomly sample IP addresses (2564 or about 4.3
billion)
Test for a web server at the standard port
Many machines and network connections are
temporarily unavailable - recheck all addresses
after one week
Many sites serve the same content on multiple IP
addresses for load balancing or redundancy
Use DNS - only count one address in publicly
indexable web
Many servers not part of the publicly indexable
web
Authorization requirements, default page, sites
coming soon, web-hosting companies that present
their homepage on many IP addresses, printers,
routers, proxies, mail servers, etc.
Use regular expressions to find a majority,
manual inspection

31
Feb 99 results

Manually classified 2,500 random web servers
83 of sites commercial
Percentage of sites in areas like science,
health,and government is relatively small
Would be feasible and very valuable to create
specialized services that are very comprehensive
and up to date
65 of sites have a majority of pages in English

32
Metadata analysis

Analyzed simple HTML meta tag usage on the
homepage of the 2,500 random servers
34 of sites had description or keywords tags
Low usage of this simple standard suggests that
acceptance and widespread use of more complex
standards like XML and Dublin Core may be very
slow
0.3 of sites contained Dublin Core tags

33
Web graph models
34
Inverse power laws on the web

Fraction of pages with k in-links

35
Properties with inverse power law

indegree of web pages
outdegree of web pages
indegree of web pages, off-site links only
outdegree of web pages, off-site links only
size of weakly connected components
size of strongly connected components
indegree of hosts
outdegree of hosts
number of hyperlinks between host pairs
PageRank

36
Category specific web

All US company homepages
Histogram with exponentially increasing size
buckets (constant size on log scale)
Strong deviation from pure power law
Unimodal body, power law tail

37
Web graph model BA 99

Preferential attachment model
Start with nodes
At each timestep
add 1 node v and
m edges incident to v s.t. for each new
edgeP(other endpoint is node u) ? in-degree(u)
Theorem P(page has k in-links) ? k-3

38
Combining preferential and uniform

Extension of preferential attachment model
Start with nodes
At timestep t
add 1 node v and
m edges s.t. for each new edgeP(node u is
endpoint)
Theorem P(page has k in-links) ?

39
Preferential vs. uniform attachment

always
Preferential attachment plays a greater role in
web link growth than uniform attachment
Distribution of links to companies and newspapers
close to power law
Distribution of links to universities and
scientists closer to uniform
More balanced mixture of preferential and uniform
attachment

Preferential attachment Preferential attachment
Dataset a
Companies 0.95
Newspapers 0.95
Web inlinks 0.91
Universities 0.61
Scientists 0.60
Web outlinks 0.58
40
E-commerce categories
41
Other networks

Most social/biological networks exhibit drop-off
from power law scaling at small k
Actor collaborations, paper citations, US power
grid, global web outlinks, web file sizes

42
Graph model summary

Previous research power law distribution of
inlinks - winners take all
Only an approximation - hides important details
Distribution varies in different categories may
be much less biased
New model accurately accounts for the
distribution of category specific pages, the web
as a whole, and other social networks
May be used to predict degree of winners take
all behavior

43
Copy model KKRRT99

At each timestep add new node u with fixed
outdegree d.
The destinations of these links are chosen
Choose existing node v uniformly at random.
For j1,...d, the j-th link of u points to a
random existing node with probability ? and to
the destination of vs j-th link with probability
1- ?.
Models power law as well as large number of small
bipartite cliques.

44
Relink model

Hostgraph exhibits drop-off from power law
scaling at small k ? relink model
With probability ? select a random existing node
u, and with probability 1-? create a new node u.
Add d edges to u.
The destinations of these links are chosen
Choose existing node v uniformly at random and
choose d random edges with source v.
Determine destinations as in the copy model.

45
Relink model
46
Linkage between domains
com Self 1 2 3 4
com 82.9 82.9 net 6.5 org 2.6 jp 0.8 uk 0.7
cn 15.8 74.1 tw 0.4 jp 0.2 de 0.2 hk 0.1
jp 17.4 74.5 to 0.8 cn 0.6 uk 0.2 de 0.1
tw 22.0 66.0 to 1.3 au 0.6 jp 0.6 ch 0.4
ca 19.4 65.2 uk 0.6 fr 0.4 se 0.3 de 0.3
de 16.0 71.2 uk 0.8 ch 0.6 at 0.5 nl 0.2
br 17.8 69.1 uk 0.4 pt 0.4 de 0.4 ar 0.2
fr 20.9 61.9 ch 0.9 de 0.8 uk 0.7 ca 0.5
uk 34.2 33.1 de 0.6 ca 0.5 jp 0.3 se 0.3
47
Finding communities
48
Finding communities

Identifying communities is valuable for
Focused search engines
Web directory creation
Content filtering
Analysis of communities and relationships

49
Recursive communities

Several methods proposed
One link based method
A community consists of members that have more
links within the community than outside of the
community

50
s-t Maximum flow

Definition given a directed graph, G(V,E), with
edge capacities c(u,v) ? 0, and two vertices, s,
t ? V, find the maximum flow that can be routed
from the source, s, to the sink, t.
Intuition think of water pipes
Note maximum flow minimum cut
Maximum flow yields communities

51
Maximum flow communities

If the source is in the community, the sink is
outside of the community, and the degree of the
source and sink exceeds the cut size, then
maximum flow identifies the entire community.

52
Maximum flow communities
53
Maximum flow communities
54
SVM web community

Seed set consisted of
http//svm.first.gmd.de/
http//svm.research.bell-labs.com/
http//www.clrc.rhbnc.ac.uk/research/SVM/
http//www.support-vector.net/
Four EM iterations used
Only external links considered
Induced graph contained over 11,000 URLs
Identified community contained 252 URLs

55
Top ranked SVM pages

Vladimir Vapnik's home page (inventor SVMs)
Home page of SVM light, a popular software
package
A hub site of SVM links
Text categorization corpus
SVM application list
John Platt's SVM page (inventor of SMO)
Research interests of Mario Marchand (SVM
researcher)
SVM workshop home page
GMD First SVM publication list
Book Advances in Kernel Methods - SVM Learning

B. Schölkopf's SVM page
GMD First hub page of SVM researchers
Y. Li's links to SVM pages
NIPS SVM workshop abstract page
GMD First SVM links
Learning System Group of ANU
NIPS98 workshop on large margin classifiers
Control theory seminar (with links to SVM
material)
ISIS SVM page
Jonathan Howell's home page

56
Lowest ranked SVM pages

Ten web pages tied for the lowest score. All
were personal home pages of scientists that had
at least one SVM publication.
Other results contained researchers, students,
software, books, conferences, workshops, etc.
A few false positives NN and data mining.

57
Ronald Rivest community summary

One seed http//theory.lcs.mit.edu/rivest
Four EM iterations used
First EM iteration used internal links
Induced graph contained more than 38,000 URLs
Identified community contained 150 URLs

58
Ronald Rivest top ranked pages

Thomas H. Cormens home page
The Mathematical Guts of RSA Encryption
Charles E. Leisersons home page
Famous people in the history of Cryptography
Cryptography sites
Massachusetts Institute of Technology
general cryptography links
Spektrum der Wissenschaft - Kryptographie
Issues in Securing Electronic Commerce over the
Internet
course based on Introduction to Algorithms

Recommended Literature for Self-Study
Resume of Aske Plaat
German article on who's who of the WWW
People Ulrik knows
A course that uses Introduction to Algorithms''
Bibliography on algorithms
an article on encryption
German computer science institute
security links
International PGP FAQ

59
Ronald Rivest lowest ranked

23 URLs tied for the lowest ranked
All 23 were personally related to Ronald Rivest
or his research
11 / 23 were bibliographies of Rivests
publications

60
Rivest community n-grams
61
Rivest community rules
62
Web communities summary

Approximate method gives promising results
Exact method should be practical as well
Both methods can be easily generalized
Applications are numerous and exciting
Building a better web directory
Focused search engines
Filtering undesirable content
Complements text-based methods

63
Focused crawling
64
Focused crawling

Analyzing the web graph can help locate pages on
a specific topic
Typical crawler considers only the links on the
current page
Graph based focused crawler learns the context of
the web graph where relevant pages appear
Significant performance improvements

65
Focused crawling
66
CiteSeer
67
CiteSeer

Digital library for scientific literature
Aims to improve communication and progress in
science
Autonomous Citation Indexing, citation context
extraction, distributed error correction,
citation graph analysis, etc.
Helps researchers obtain a better perspective and
overview of the literature with citation context
and new methods of locating related research
Lower cost, wider availability, more up-to-date
than competing citation indexing services
Faster, easier, and more complete access to the
literature can speed research, better direct
research activities, and minimize duplication of
effort

68
CiteSeer

575,000 documents
6 million citations
500,000 daily requests
50,000 daily users
Data for research available on request
feedback_at_researchindex.org

69
Distribution of articles
SCI ResearchIndex
70
Citations over time
71
Citations over time

Conference papers and technical reports play a
very important role in computer science research
Citations to very recent research are dominated
by these types of articles
When recent journal papers are cited they are
typically in press or to appear
The most cited items tend to be journal articles
and books
Conference and technical report citations tend to
be replaced with journal and book citations over
time
May not be a one-to-one mapping

72
Online or invisible?
73
Online or invisible?

Analyzed 119,924 conference articles from DBLP
Online articles cited 4.5 times more than offline
articles on average
Online articles more highly cited because
They are easier to access and thus more visible,
or
Because higher quality articles are more likely
to be made available online?
Within venues online articles cited 4.4 times
more on average
Similar when restricted to top-tier conferences

74
Persistence of URLs

Analyzed URLs referenced within articles in
CiteSeer
URLs per article increasing
Many URLs now invalid
1999 - 23
1994 - 53

75
Persistence of URLs

2nd searcher found 80 of URLs the 1st searcher
could not find
Only 3 of URLs could not be found after 2nd
searcher

76
How important are the lost URLs?

With respect to the ability of future research to
verify and/or build on the given paper

After 1st searcher After 2nd
searcher
77
Persistence of URLs

Many URLs now invalid
Can often relocate information
No evidence that information very important to
future research has been lost yet
Citation practices suggest more information will
be lost in the future unless these practices are
improved
A widespread and easy to use web with invalid
links may be more useful than an improved system
without invalid links but with added complexity
or overhead

78
Extracting knowledge from the web

Unprecedented opportunity for automated analysis
of a large sample of interests and activity in
the world
Many methods for extracting knowledge from the
web
Random sampling and analysis of pages and hosts
Analysis of link structure and link growth

79
Extracting knowledge from the web