Web Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Web Mining

Description:

Other (groups of) pages are spider traps (all out-links are within the group) ... The Web is perhaps the single largest data source in the world. ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 44
Provided by: ksu7
Learn more at: https://www.cs.kent.edu
Category:
Tags: in | largest | mining | spider | the | web | world

less

Transcript and Presenter's Notes

Title: Web Mining


1
Web Mining
2
Two Key Problems
  • Page Rank
  • Web Content Mining

3
PageRank
  • Intuition solve the recursive equation a page
    is important if important pages link to it.
  • Maximailly importance the principal
    eigenvector of the stochastic matrix of the Web.
  • A few fixups needed.

4
Stochastic Matrix of the Web
  • Enumerate pages.
  • Page i corresponds to row and column i.
  • M i,j 1/n if page j links to n pages,
    including page i 0 if j does not link to i.
  • M i,j is the probability well next be at page
    i if we are now at page j.

5
Example
Suppose page j links to 3 pages, including i
j
i
1/3
6
Random Walks on the Web
  • Suppose v is a vector whose i th component is
    the probability that we are at page i at a
    certain time.
  • If we follow a link from i at random, the
    probability distribution for the page we are then
    at is given by the vector M v.

7
Random Walks --- (2)
  • Starting from any vector v, the limit M (M
    (M (M v ) )) is the distribution of page visits
    during a random walk.
  • Intuition pages are important in proportion to
    how often a random walker would visit them.
  • The math limiting distribution principal
    eigenvector of M PageRank.

8
Example The Web in 1839
y a m
Yahoo
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
Msoft
Amazon
9
Simulating a Random Walk
  • Start with the vector v 1,1,,1 representing
    the idea that each Web page is given one unit of
    importance.
  • Repeatedly apply the matrix M to v, allowing the
    importance to flow like a random walk.
  • Limit exists, but about 50 iterations is
    sufficient to estimate final distribution.

10
Example
  • Equations v M v
  • y y /2 a /2
  • a y /2 m
  • m a /2

y a m
1 1 1
1 3/2 1/2
5/4 1 3/4
9/8 11/8 1/2
6/5 6/5 3/5
. . .
11
Solving The Equations
  • Because there are no constant terms, these 3
    equations in 3 unknowns do not have a unique
    solution.
  • Add in the fact that y a m 3 to solve.
  • In Web-sized examples, we cannot solve by
    Gaussian elimination we need to use relaxation
    ( iterative solution).

12
Real-World Problems
  • Some pages are dead ends (have no links out).
  • Such a page causes importance to leak out.
  • Other (groups of) pages are spider traps (all
    out-links are within the group).
  • Eventually spider traps absorb all importance.

13
Microsoft Becomes Dead End
y a m
Yahoo
y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 0
Msoft
Amazon
14
Example
  • Equations v M v
  • y y /2 a /2
  • a y /2
  • m a /2

y a m
1 1 1
1 1/2 1/2
3/4 1/2 1/4
5/8 3/8 1/4
0 0 0
. . .
15
Msoft Becomes Spider Trap
y a m
Yahoo
y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1
Msoft
Amazon
16
Example
  • Equations v M v
  • y y /2 a /2
  • a y /2
  • m a /2 m

y a m
1 1 1
1 1/2 3/2
3/4 1/2 7/4
5/8 3/8 2
0 0 3
. . .
17
Google Solution to Traps, Etc.
  • Tax each page a fixed percentage at each
    interation.
  • Add the same constant to all pages.
  • Models a random walk with a fixed probability of
    going to a random place next.

18
Example Previous with 20 Tax
  • Equations v 0.8(M v ) 0.2
  • y 0.8(y /2 a/2) 0.2
  • a 0.8(y /2) 0.2
  • m 0.8(a /2 m) 0.2

y a m
1 1 1
1.00 0.60 1.40
0.84 0.60 1.56
0.776 0.536 1.688
7/11 5/11 21/11
. . .
19
General Case
  • In this example, because there are no dead-ends,
    the total importance remains at 3.
  • In examples with dead-ends, some importance leaks
    out, but total remains finite.

20
Solving the Equations
  • Because there are constant terms, we can expect
    to solve small examples by Gaussian elimination.
  • Web-sized examples still need to be solved by
    relaxation.

21
Speeding Convergence
  • Newton-like prediction of where components of the
    principal eigenvector are heading.
  • Take advantage of locality in the Web.
  • Each technique can reduce the number of
    iterations by 50.
  • Important --- PageRank takes time!

22
Web Content Mining
  • The Web is perhaps the single largest data source
    in the world.
  • Much of the Web (content) mining is about
  • Data/information extraction from semi-structured
    objects and free text, and
  • Integration of the extracted data/information
  • Due to the heterogeneity and lack of structure,
    mining and integration are challenging tasks.

23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Wrapper induction
  • Using machine learning to generate extraction
    rules.
  • The user marks the target items in a few training
    pages.
  • The system learns extraction rules from these
    pages.
  • The rules are applied to extract target items
    from other pages.
  • Many wrapper induction systems, e.g.,
  • WIEN (Kushmerick et al, IJCAI-97),
  • Softmealy (Hsu and Dung, 1998),
  • Stalker (Muslea et al. Agents-99),
  • BWI (Freitag and McCallum, AAAI-00),
  • WL2 (Cohen et al. WWW-02).
  • IDE (Liu and Zhai, WISE-05)
  • Thresher (Hogue and Karger, WWW-05)

27
Stalker A wrapper induction system (Muslea et
al. Agents-99)
  • E1 513 Pico, ltbgtVenicelt/bgt, Phone
    1-ltbgt800lt/bgt-555-1515
  • E2 90 Colfax, ltbgtPalmslt/bgt, Phone (800)
    508-1570
  • E3 523 1st St., ltbgtLAlt/bgt, Phone
    1-ltbgt800lt/bgt-578-2293
  • E4 403 La Tijera, ltbgtWattslt/bgt, Phone (310)
    798-0008
  • We want to extract area code.
  • Start rules
  • R1 SkipTo(()
  • R2 SkipTo(-ltbgt)
  • End rules
  • R3 SkipTo())
  • R4 SkipTo(lt/bgt)

28
Learning extraction rules
  • Stalker uses sequential covering to learn
    extraction rules for each target item.
  • In each iteration, it learns a perfect rule that
    covers as many positive items as possible without
    covering any negative items.
  • Once a positive item is covered by a rule, the
    whole example is removed.
  • The algorithm ends when all the positive items
    are covered. The result is an ordered list of all
    learned rules.

29
Rule induction through an example
  • Training examples
  • E1 513 Pico, ltbgtVenicelt/bgt, Phone
    1-ltbgt800lt/bgt-555-1515
  • E2 90 Colfax, ltbgtPalmslt/bgt, Phone (800)
    508-1570
  • E3 523 1st St., ltbgtLAlt/bgt, Phone
    1-ltbgt800lt/bgt-578-2293
  • E4 403 La Tijera, ltbgtWattslt/bgt, Phone (310)
    798-0008
  • We learn start rule for area code.
  • Assume the algorithm starts with E2. It creates
    three initial candidate rules with first prefix
    symbol and two wildcards
  • R1 SkipTo(()
  • R2 SkipTo(Punctuation)
  • R3 SkipTo(Anything)
  • R1 is perfect. It covers two positive examples
    but no negative example.

30
Rule induction (cont )
  • E1 513 Pico, ltbgtVenicelt/bgt, Phone
    1-ltbgt800lt/bgt-555-1515
  • E2 90 Colfax, ltbgtPalmslt/bgt, Phone (800)
    508-1570
  • E3 523 1st St., ltbgtLAlt/bgt, Phone
    1-ltbgt800lt/bgt-578-2293
  • E4 403 La Tijera, ltbgtWattslt/bgt, Phone (310)
    798-0008
  • R1 covers E2 and E4, which are removed. E1 and E3
    need additional rules.
  • Three candidates are created
  • R4 SkiptTo(ltbgt)
  • R5 SkipTo(HtmlTag)
  • R6 SkipTo(Anything)
  • None is good. Refinement is needed.
  • Stalker chooses R4 to refine, i.e., to add
    additional symbols, to specialize it.
  • It will find R7 SkipTo(-ltbgt), which is perfect.

31
Limitations of Supervised Learning
  • Manual Labeling is labor intensive and time
    consuming, especially if one wants to extract
    data from a huge number of sites.
  • Wrapper maintenance is very costly
  • If Web sites change frequently
  • It is necessary to detect when a wrapper stops to
    work properly.
  • Any change may make existing extraction rules
    invalid.
  • Re-learning is needed, and most likely manual
    re-labeling as well.

32
The RoadRunner System(Crescenzi et al. VLDB-01)
  • Given a set of positive examples (multiple sample
    pages). Each contains one or more data records.
  • From these pages, generate a wrapper as a
    union-free regular expression (i.e., no
    disjunction).
  • The approach
  • To start, a sample page is taken as the wrapper.
  • The wrapper is then refined by solving mismatches
    between the wrapper and each sample page, which
    generalizes the wrapper.

33
(No Transcript)
34
Compare with wrapper induction
  • No manual labeling, but need a set of positive
    pages of the same template
  • which is not necessary for a page with multiple
    data records
  • not wrapper for data records, but pages.
  • A Web page can have many pieces of irrelevant
    information.
  • Issues of automatic extraction
  • Hard to handle disjunctions
  • Hard to generate attribute names for the
    extracted data.
  • extracted data from multiple sites need
    integration, manual or automatic.

35
Relation Extraction
  • Assumptions
  • No single source contains all the tuples
  • Each tuple appears on many web pages
  • Components of tuple appear close together
  • Foundation, by Isaac Asimov
  • Isaac Asimovs masterpiece, the
    ltemgtFoundationlt/emgt trilogy
  • There are repeated patterns in the way tuples are
    represented on web pages

36
Naïve approach
  • Study a few websites and come up with a set of
    patterns e.g., regular expressions
  • letter A-Za-z.
  • title letter5,40
  • author letter10,30
  • ltbgt(title)lt/bgt by (author)

37
Problems with naïve approach
  • A pattern that works on one web page might
    produce nonsense when applied to another
  • So patterns need to be page-specific, or at least
    site-specific
  • Impossible for a human to exhaustively enumerate
    patterns for every relevant website
  • Will result in low coverage

38
Better approach (Brin)
  • Exploit duality between patterns and tuples
  • Find tuples that match a set of patterns
  • Find patterns that match a lot of tuples
  • DIPRE (Dual Iterative Pattern Relation Extraction)

Match
Patterns
Tuples
Generate
39
DIPRE Algorithm
  • R Ã SampleTuples
  • e.g., a small set of lttitle,authorgt pairs
  • O Ã FindOccurrences(R)
  • Occurrences of tuples on web pages
  • Keep some surrounding context
  • P Ã GenPatterns(O)
  • Look for patterns in the way tuples occur
  • Make sure patterns are not too general!
  • R Ã MatchingTuples(P)
  • Return or go back to Step 2

40
Web query interface integration
  • Many integration tasks,
  • Integrating Web query interfaces (search forms)
  • Integrating extracted data
  • Integrating textual information
  • Integrating ontologies (taxonomy)
  • We only introduce integration of query
    interfaces.
  • Many web sites provide forms to query deep web
  • Applications meta-search and meta-query

41
Global Query Interface
united.com
airtravel.com
delta.com
hotwire.com
42
Synonym Discovery (He and Chang, KDD-04)
  • Discover synonym attributes
  • Author Writer, Subject Category

S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Holistic Model Discovery
category
author
name
subject
writer
43
Schema matching as correlation mining
  • Across many sources
  • Synonym attributes are negatively correlated
  • synonym attributes are semantically alternatives.
  • thus, rarely co-occur in query interfaces
  • Grouping attributes with positive correlation
  • grouping attributes semantically complement
  • thus, often co-occur in query interfaces
Write a Comment
User Comments (0)
About PowerShow.com