Knowledge%20Extraction%20from%20the%20Web - PowerPoint PPT Presentation

About This Presentation
Title:

Knowledge%20Extraction%20from%20the%20Web

Description:

Knowledge Extraction from the Web Monika Henzinger Steve Lawrence Outline Hyperlink analysis in web IR Sampling the web: Web pages Web hosts Web graph models Focused ... – PowerPoint PPT presentation

Number of Views:279
Avg rating:3.0/5.0
Slides: 82
Provided by: 4882
Category:

less

Transcript and Presenter's Notes

Title: Knowledge%20Extraction%20from%20the%20Web


1
Knowledge Extraction from the Web
Monika Henzinger Steve Lawrence
2
Outline
  • Hyperlink analysis in web IR
  • Sampling the web
  • Web pages
  • Web hosts
  • Web graph models
  • Focused crawling
  • Finding communities

3
Hyperlink analysis in web information retrieval
4
Graph structure of the web
  • Web graph
  • Each web page is a node
  • Each hyperlink is a directed edge
  • Host graph
  • Each host is a node
  • If there are k links from host A to host B, there
    is an edge with weight k from A to B.

5
Hyperlink analysis in Web IR
  • Idea Mine structure of the web graph to improve
    search results
  • Related work
  • Classic IR work (citations links) a.k.a.
    Bibliometrics K63, G72, S73,
  • Socio-metrics K53, MMSM86,
  • Many Web related papers use this approach
    PPR96, AMM97, S97, CK97, K98, BP98,

6
Googles approach
  • Assumption A link from page A to page B is a
    recommendation of page B by the author of A(we
    say B is successor of A)
  • Quality of a page is related to its in-degree
  • Recursion Quality of a page is related to
  • its in-degree, and to
  • the quality of pages linking to it
  • PageRank BP 98

7
Definition of PageRank
  • Consider the following infinite random walk
    (surf)
  • Initially the surfer is at a random page
  • At each step, the surfer proceeds
  • to a randomly chosen web page with probability d
  • to a randomly chosen successor of the current
    page with probability 1-d
  • The PageRank of a page p is the fraction of steps
    the surfer spends at p in the limit.

8
PageRank (cont.)
  • By previous theorem
  • PageRank stationary probability for this Markov
    chain, i.e.
  • where n is the total number of nodes in the
    graph

9
Neighborhood graph
  • Subgraph associated to each query

Forward Set
Back Set
Query Results Start Set
Result1
f1
b1
f2
b2
Result2
...

...
fs
bm
Resultn
An edge for each hyperlink, but no edges within
the same host
10
HITS Kleinberg98
  • Goal Given a query find
  • Good sources of content (authorities)
  • Good sources of links (hubs)

11
HITS details
  • Repeat until HUB and AUTH converge
  • Normalize HUB and AUTH
  • HUBv S AUTHui for all ui with
    Edge(v, ui)
  • AUTHv S HUBwi for all wi with
    Edge(wi, v)

v
w1
u1
A
H
w2
u2
...
...
wk
uk
12
PageRank vs. HITS
  • Computation
  • Once for all documents and queries (offline)
  • Query-independent requires combination with
    query-dependent criteria
  • Hard to spam
  • Computation
  • Requires computation for each query
  • Query-dependent
  • Relatively easy to spam
  • Quality depends on quality of start set
  • Gives hubs as well as authorities

13
PageRank vs. HITS
  • Lempel Not rank-stable O(1) changes in graph
    can change O(N2) order-relations
  • Ng, Zheng, Jordan01 Value-Stable change in k
    nodes (with PR values p1,pk) results in p s.t.
  • Not rank-stable
  • value-stability depends on gap g between
    largest and second largest eigenvector in ATA
    change of O(g) in ATA results in p s.t.

14
Random sampling of web pages
15
Random sampling of web pages
  • Useful for estimating
  • Web properties Percentage of pages in a domain,
    in a language, on a topic, indegree distribution
  • Search engine comparison Percentage of pages in
    a search engine index (index size)

16
Lets do the random walk!
  • Perform PageRank random walk
  • Select uniform random sample from resulting
    pages
  • Cant jump to a random page instead, jump to a
    random page on a random host seen so far.
  • Problem
  • Starting state bias finite walk only
    approximates PageRank.
  • Quality-biased sample of the web

17
Most frequently visited pages
18
Most frequently visited hosts
19
Sampling pages nearly uniformly
  • Perform PageRank random walk
  • Sample pages from walk s.t.
  • Dont know PageRank(p)
  • PR PageRank computation of crawled graph
  • VR VisitRatio on crawled graph
  • Nearly uniform sample of the web

20
Sampling pages nearly uniformly
  • Nearly uniform sample
  • Recall
  • A page is well-connected if it can be reached by
    almost every other page by short paths (O(n1/2)
    steps)
  • For short paths in a well-connected graph

21
Sampling pages nearly uniformly
  • Problems
  • Starting state bias finite walk only
    approximates PageRank.
  • Dependence, especially in short cycles

22
Synthetic graphs in-degree
23
Synthetic graphs PageRank
24
Experiments on the real web
  • Performed 3 random walks in Nov 1999 (starting
    from 10,258 seed URLs)
  • Small overlap between walks walks disperse well
    (82 visited by only 1 walk)
  • Walk visited URLs unique URLs
  • 1 2,702,939 990,251 2 2,507,004 921,114 3 5,006
    ,745 1,655,799

25
Percentage of pages in domains
26
Estimating search engine index size
  • Choose a sample of pages p1,p2,p3 pn according
    to near uniform distribution
  • Check if the pages are in search engine index S
    BB98
  • Exact match
  • Host match
  • Estimate for size of index S is the percentage of
    sampled pages that are in S, i.e.where Ipj in
    S 1 if pj is in S and 0 otherwise

27
Result set for index size (fall 99)
28
Random sampling of sites
29
Publicly indexable web
  • We analyzed the publicly indexable web
  • Excludes pages that are not indexed by the major
    search engines due to
  • Authentication requirements
  • Pages hidden behind search forms
  • Robots exclusion standard

30
Random sampling of sites
  • Randomly sample IP addresses (2564 or about 4.3
    billion)
  • Test for a web server at the standard port
  • Many machines and network connections are
    temporarily unavailable - recheck all addresses
    after one week
  • Many sites serve the same content on multiple IP
    addresses for load balancing or redundancy
  • Use DNS - only count one address in publicly
    indexable web
  • Many servers not part of the publicly indexable
    web
  • Authorization requirements, default page, sites
    coming soon, web-hosting companies that present
    their homepage on many IP addresses, printers,
    routers, proxies, mail servers, etc.
  • Use regular expressions to find a majority,
    manual inspection

31
Feb 99 results
  • Manually classified 2,500 random web servers
  • 83 of sites commercial
  • Percentage of sites in areas like science,
    health,and government is relatively small
  • Would be feasible and very valuable to create
    specialized services that are very comprehensive
    and up to date
  • 65 of sites have a majority of pages in English

32
Metadata analysis
  • Analyzed simple HTML meta tag usage on the
    homepage of the 2,500 random servers
  • 34 of sites had description or keywords tags
  • Low usage of this simple standard suggests that
    acceptance and widespread use of more complex
    standards like XML and Dublin Core may be very
    slow
  • 0.3 of sites contained Dublin Core tags

33
Web graph models
34
Inverse power laws on the web
  • Fraction of pages with k in-links

35
Properties with inverse power law
  • indegree of web pages
  • outdegree of web pages
  • indegree of web pages, off-site links only
  • outdegree of web pages, off-site links only
  • size of weakly connected components
  • size of strongly connected components
  • indegree of hosts
  • outdegree of hosts
  • number of hyperlinks between host pairs
  • PageRank

36
Category specific web
  • All US company homepages
  • Histogram with exponentially increasing size
    buckets (constant size on log scale)
  • Strong deviation from pure power law
  • Unimodal body, power law tail

37
Web graph model BA 99
  • Preferential attachment model
  • Start with nodes
  • At each timestep
  • add 1 node v and
  • m edges incident to v s.t. for each new
    edgeP(other endpoint is node u) ? in-degree(u)
  • Theorem P(page has k in-links) ? k-3

38
Combining preferential and uniform
  • Extension of preferential attachment model
  • Start with nodes
  • At timestep t
  • add 1 node v and
  • m edges s.t. for each new edgeP(node u is
    endpoint)
  • Theorem P(page has k in-links) ?

39
Preferential vs. uniform attachment
  • always
  • Preferential attachment plays a greater role in
    web link growth than uniform attachment
  • Distribution of links to companies and newspapers
    close to power law
  • Distribution of links to universities and
    scientists closer to uniform
  • More balanced mixture of preferential and uniform
    attachment

Preferential attachment Preferential attachment
Dataset a
Companies 0.95
Newspapers 0.95
Web inlinks 0.91
Universities 0.61
Scientists 0.60
Web outlinks 0.58
40
E-commerce categories
41
Other networks
  • Most social/biological networks exhibit drop-off
    from power law scaling at small k
  • Actor collaborations, paper citations, US power
    grid, global web outlinks, web file sizes

42
Graph model summary
  • Previous research power law distribution of
    inlinks - winners take all
  • Only an approximation - hides important details
  • Distribution varies in different categories may
    be much less biased
  • New model accurately accounts for the
    distribution of category specific pages, the web
    as a whole, and other social networks
  • May be used to predict degree of winners take
    all behavior

43
Copy model KKRRT99
  • At each timestep add new node u with fixed
    outdegree d.
  • The destinations of these links are chosen
  • Choose existing node v uniformly at random.
  • For j1,...d, the j-th link of u points to a
    random existing node with probability ? and to
    the destination of vs j-th link with probability
    1- ?.
  • Models power law as well as large number of small
    bipartite cliques.

44
Relink model
  • Hostgraph exhibits drop-off from power law
    scaling at small k ? relink model
  • With probability ? select a random existing node
    u, and with probability 1-? create a new node u.
    Add d edges to u.
  • The destinations of these links are chosen
  • Choose existing node v uniformly at random and
    choose d random edges with source v.
  • Determine destinations as in the copy model.

45
Relink model
46
Linkage between domains
com Self 1 2 3 4
com 82.9 82.9 net 6.5 org 2.6 jp 0.8 uk 0.7
cn 15.8 74.1 tw 0.4 jp 0.2 de 0.2 hk 0.1
jp 17.4 74.5 to 0.8 cn 0.6 uk 0.2 de 0.1
tw 22.0 66.0 to 1.3 au 0.6 jp 0.6 ch 0.4
ca 19.4 65.2 uk 0.6 fr 0.4 se 0.3 de 0.3
de 16.0 71.2 uk 0.8 ch 0.6 at 0.5 nl 0.2
br 17.8 69.1 uk 0.4 pt 0.4 de 0.4 ar 0.2
fr 20.9 61.9 ch 0.9 de 0.8 uk 0.7 ca 0.5
uk 34.2 33.1 de 0.6 ca 0.5 jp 0.3 se 0.3
47
Finding communities
48
Finding communities
  • Identifying communities is valuable for
  • Focused search engines
  • Web directory creation
  • Content filtering
  • Analysis of communities and relationships

49
Recursive communities
  • Several methods proposed
  • One link based method
  • A community consists of members that have more
    links within the community than outside of the
    community

50
s-t Maximum flow
  • Definition given a directed graph, G(V,E), with
    edge capacities c(u,v) ? 0, and two vertices, s,
    t ? V, find the maximum flow that can be routed
    from the source, s, to the sink, t.
  • Intuition think of water pipes
  • Note maximum flow minimum cut
  • Maximum flow yields communities

51
Maximum flow communities
  • If the source is in the community, the sink is
    outside of the community, and the degree of the
    source and sink exceeds the cut size, then
    maximum flow identifies the entire community.

52
Maximum flow communities
53
Maximum flow communities
54
SVM web community
  • Seed set consisted of
  • http//svm.first.gmd.de/
  • http//svm.research.bell-labs.com/
  • http//www.clrc.rhbnc.ac.uk/research/SVM/
  • http//www.support-vector.net/
  • Four EM iterations used
  • Only external links considered
  • Induced graph contained over 11,000 URLs
  • Identified community contained 252 URLs

55
Top ranked SVM pages
  1. Vladimir Vapnik's home page (inventor SVMs)
  2. Home page of SVM light, a popular software
    package
  3. A hub site of SVM links
  4. Text categorization corpus
  5. SVM application list
  6. John Platt's SVM page (inventor of SMO)
  7. Research interests of Mario Marchand (SVM
    researcher)
  8. SVM workshop home page
  9. GMD First SVM publication list
  10. Book Advances in Kernel Methods - SVM Learning
  1. B. Schölkopf's SVM page
  2. GMD First hub page of SVM researchers
  3. Y. Li's links to SVM pages
  4. NIPS SVM workshop abstract page
  5. GMD First SVM links
  6. Learning System Group of ANU
  7. NIPS98 workshop on large margin classifiers
  8. Control theory seminar (with links to SVM
    material)
  9. ISIS SVM page
  10. Jonathan Howell's home page

56
Lowest ranked SVM pages
  • Ten web pages tied for the lowest score. All
    were personal home pages of scientists that had
    at least one SVM publication.
  • Other results contained researchers, students,
    software, books, conferences, workshops, etc.
  • A few false positives NN and data mining.

57
Ronald Rivest community summary
  • One seed http//theory.lcs.mit.edu/rivest
  • Four EM iterations used
  • First EM iteration used internal links
  • Induced graph contained more than 38,000 URLs
  • Identified community contained 150 URLs

58
Ronald Rivest top ranked pages
  1. Thomas H. Cormens home page
  2. The Mathematical Guts of RSA Encryption
  3. Charles E. Leisersons home page
  4. Famous people in the history of Cryptography
  5. Cryptography sites
  6. Massachusetts Institute of Technology
  7. general cryptography links
  8. Spektrum der Wissenschaft - Kryptographie
  9. Issues in Securing Electronic Commerce over the
    Internet
  10. course based on Introduction to Algorithms
  1. Recommended Literature for Self-Study
  2. Resume of Aske Plaat
  3. German article on who's who of the WWW
  4. People Ulrik knows
  5. A course that uses Introduction to Algorithms''
  6. Bibliography on algorithms
  7. an article on encryption
  8. German computer science institute
  9. security links
  10. International PGP FAQ

59
Ronald Rivest lowest ranked
  • 23 URLs tied for the lowest ranked
  • All 23 were personally related to Ronald Rivest
    or his research
  • 11 / 23 were bibliographies of Rivests
    publications

60
Rivest community n-grams
61
Rivest community rules
62
Web communities summary
  • Approximate method gives promising results
  • Exact method should be practical as well
  • Both methods can be easily generalized
  • Applications are numerous and exciting
  • Building a better web directory
  • Focused search engines
  • Filtering undesirable content
  • Complements text-based methods

63
Focused crawling
64
Focused crawling
  • Analyzing the web graph can help locate pages on
    a specific topic
  • Typical crawler considers only the links on the
    current page
  • Graph based focused crawler learns the context of
    the web graph where relevant pages appear
  • Significant performance improvements

65
Focused crawling
66
CiteSeer
67
CiteSeer
  • Digital library for scientific literature
  • Aims to improve communication and progress in
    science
  • Autonomous Citation Indexing, citation context
    extraction, distributed error correction,
    citation graph analysis, etc.
  • Helps researchers obtain a better perspective and
    overview of the literature with citation context
    and new methods of locating related research
  • Lower cost, wider availability, more up-to-date
    than competing citation indexing services
  • Faster, easier, and more complete access to the
    literature can speed research, better direct
    research activities, and minimize duplication of
    effort

68
CiteSeer
  • 575,000 documents
  • 6 million citations
  • 500,000 daily requests
  • 50,000 daily users
  • Data for research available on request
  • feedback_at_researchindex.org

69
Distribution of articles
SCI ResearchIndex
70
Citations over time
71
Citations over time
  • Conference papers and technical reports play a
    very important role in computer science research
  • Citations to very recent research are dominated
    by these types of articles
  • When recent journal papers are cited they are
    typically in press or to appear
  • The most cited items tend to be journal articles
    and books
  • Conference and technical report citations tend to
    be replaced with journal and book citations over
    time
  • May not be a one-to-one mapping

72
Online or invisible?
73
Online or invisible?
  • Analyzed 119,924 conference articles from DBLP
  • Online articles cited 4.5 times more than offline
    articles on average
  • Online articles more highly cited because
  • They are easier to access and thus more visible,
    or
  • Because higher quality articles are more likely
    to be made available online?
  • Within venues online articles cited 4.4 times
    more on average
  • Similar when restricted to top-tier conferences

74
Persistence of URLs
  • Analyzed URLs referenced within articles in
    CiteSeer
  • URLs per article increasing
  • Many URLs now invalid
  • 1999 - 23
  • 1994 - 53

75
Persistence of URLs
  • 2nd searcher found 80 of URLs the 1st searcher
    could not find
  • Only 3 of URLs could not be found after 2nd
    searcher

76
How important are the lost URLs?
  • With respect to the ability of future research to
    verify and/or build on the given paper

After 1st searcher After 2nd
searcher
77
Persistence of URLs
  • Many URLs now invalid
  • Can often relocate information
  • No evidence that information very important to
    future research has been lost yet
  • Citation practices suggest more information will
    be lost in the future unless these practices are
    improved
  • A widespread and easy to use web with invalid
    links may be more useful than an improved system
    without invalid links but with added complexity
    or overhead

78
Extracting knowledge from the web
  • Unprecedented opportunity for automated analysis
    of a large sample of interests and activity in
    the world
  • Many methods for extracting knowledge from the
    web
  • Random sampling and analysis of pages and hosts
  • Analysis of link structure and link growth

79
Extracting knowledge from the web
  • Variety of information can be extracted
  • Distribution of interest and activity in
    different areas
  • Communities related to different topics
  • Competition in different areas
  • Communication between different communities

80
Collaborators
  • Web communities Gary Flake, Lee Giles, Frans
    Coetzee
  • Link growth modeling David Pennock, Gary Flake,
    Lee Giles, Eric Glover
  • Hostgraph modeling Krishna Bharat, Bay-Wei
    Chang, Matthias Ruhl
  • Web page sampling Allan Heydon, Michael
    Mitzenmacher, Mark Najork
  • Host sampling Lee Giles
  • CiteSeer Kurt Bollacker, Lee Giles

81
More information
  • http//www.henzinger.com/monika/
  • http//www.neci.nec.com/lawrence/
  • http//citeseer.org/
Write a Comment
User Comments (0)
About PowerShow.com