Web Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Web Mining

Description:

Other (groups of) pages are spider traps (all out-links are within the group) ... The Web is perhaps the single largest data source in the world. ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 44

Provided by: ksu7

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Web Mining

1
Web Mining
2
Two Key Problems

Page Rank
Web Content Mining

3
PageRank

Intuition solve the recursive equation a page
is important if important pages link to it.
Maximailly importance the principal
eigenvector of the stochastic matrix of the Web.
A few fixups needed.

4
Stochastic Matrix of the Web

Enumerate pages.
Page i corresponds to row and column i.
M i,j 1/n if page j links to n pages,
including page i 0 if j does not link to i.
M i,j is the probability well next be at page
i if we are now at page j.

5
Example
Suppose page j links to 3 pages, including i
j
i
1/3
6
Random Walks on the Web

Suppose v is a vector whose i th component is
the probability that we are at page i at a
certain time.
If we follow a link from i at random, the
probability distribution for the page we are then
at is given by the vector M v.

7
Random Walks --- (2)

Starting from any vector v, the limit M (M
(M (M v ) )) is the distribution of page visits
during a random walk.
Intuition pages are important in proportion to
how often a random walker would visit them.
The math limiting distribution principal
eigenvector of M PageRank.

8
Example The Web in 1839
y a m
Yahoo
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
Msoft
Amazon
9
Simulating a Random Walk

Start with the vector v 1,1,,1 representing
the idea that each Web page is given one unit of
importance.
Repeatedly apply the matrix M to v, allowing the
importance to flow like a random walk.
Limit exists, but about 50 iterations is
sufficient to estimate final distribution.

10
Example

Equations v M v
y y /2 a /2
a y /2 m
m a /2

y a m
1 1 1
1 3/2 1/2
5/4 1 3/4
9/8 11/8 1/2
6/5 6/5 3/5
. . .
11
Solving The Equations

Because there are no constant terms, these 3
equations in 3 unknowns do not have a unique
solution.
Add in the fact that y a m 3 to solve.
In Web-sized examples, we cannot solve by
Gaussian elimination we need to use relaxation
( iterative solution).

12
Real-World Problems

Some pages are dead ends (have no links out).
Such a page causes importance to leak out.
Other (groups of) pages are spider traps (all
out-links are within the group).
Eventually spider traps absorb all importance.

13
Microsoft Becomes Dead End
y a m
Yahoo
y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 0
Msoft
Amazon
14
Example

Equations v M v
y y /2 a /2
a y /2
m a /2

y a m
1 1 1
1 1/2 1/2
3/4 1/2 1/4
5/8 3/8 1/4
0 0 0
. . .
15
Msoft Becomes Spider Trap
y a m
Yahoo
y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1
Msoft
Amazon
16
Example

Equations v M v
y y /2 a /2
a y /2
m a /2 m

y a m
1 1 1
1 1/2 3/2
3/4 1/2 7/4
5/8 3/8 2
0 0 3
. . .
17
Google Solution to Traps, Etc.

Tax each page a fixed percentage at each
interation.
Add the same constant to all pages.
Models a random walk with a fixed probability of
going to a random place next.

18
Example Previous with 20 Tax

Equations v 0.8(M v ) 0.2
y 0.8(y /2 a/2) 0.2
a 0.8(y /2) 0.2
m 0.8(a /2 m) 0.2

y a m
1 1 1
1.00 0.60 1.40
0.84 0.60 1.56
0.776 0.536 1.688
7/11 5/11 21/11
. . .
19
General Case

In this example, because there are no dead-ends,
the total importance remains at 3.
In examples with dead-ends, some importance leaks
out, but total remains finite.

20
Solving the Equations

Because there are constant terms, we can expect
to solve small examples by Gaussian elimination.
Web-sized examples still need to be solved by
relaxation.

21
Speeding Convergence

Newton-like prediction of where components of the
principal eigenvector are heading.
Take advantage of locality in the Web.
Each technique can reduce the number of
iterations by 50.
Important --- PageRank takes time!

22
Web Content Mining

The Web is perhaps the single largest data source
in the world.
Much of the Web (content) mining is about
Data/information extraction from semi-structured
objects and free text, and
Integration of the extracted data/information
Due to the heterogeneity and lack of structure,
mining and integration are challenging tasks.

23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Wrapper induction

Using machine learning to generate extraction
rules.
The user marks the target items in a few training
pages.
The system learns extraction rules from these
pages.
The rules are applied to extract target items
from other pages.
Many wrapper induction systems, e.g.,
WIEN (Kushmerick et al, IJCAI-97),
Softmealy (Hsu and Dung, 1998),
Stalker (Muslea et al. Agents-99),
BWI (Freitag and McCallum, AAAI-00),
WL2 (Cohen et al. WWW-02).
IDE (Liu and Zhai, WISE-05)
Thresher (Hogue and Karger, WWW-05)

27
Stalker A wrapper induction system (Muslea et
al. Agents-99)

E1 513 Pico, ltbgtVenicelt/bgt, Phone
1-ltbgt800lt/bgt-555-1515
E2 90 Colfax, ltbgtPalmslt/bgt, Phone (800)
508-1570
E3 523 1st St., ltbgtLAlt/bgt, Phone
1-ltbgt800lt/bgt-578-2293
E4 403 La Tijera, ltbgtWattslt/bgt, Phone (310)
798-0008
We want to extract area code.
Start rules
R1 SkipTo(()
R2 SkipTo(-ltbgt)
End rules
R3 SkipTo())
R4 SkipTo(lt/bgt)

28
Learning extraction rules

Stalker uses sequential covering to learn
extraction rules for each target item.
In each iteration, it learns a perfect rule that
covers as many positive items as possible without
covering any negative items.
Once a positive item is covered by a rule, the
whole example is removed.
The algorithm ends when all the positive items
are covered. The result is an ordered list of all
learned rules.

29
Rule induction through an example

Training examples
E1 513 Pico, ltbgtVenicelt/bgt, Phone
1-ltbgt800lt/bgt-555-1515
E2 90 Colfax, ltbgtPalmslt/bgt, Phone (800)
508-1570
E3 523 1st St., ltbgtLAlt/bgt, Phone
1-ltbgt800lt/bgt-578-2293
E4 403 La Tijera, ltbgtWattslt/bgt, Phone (310)
798-0008
We learn start rule for area code.
Assume the algorithm starts with E2. It creates
three initial candidate rules with first prefix
symbol and two wildcards
R1 SkipTo(()
R2 SkipTo(Punctuation)
R3 SkipTo(Anything)
R1 is perfect. It covers two positive examples
but no negative example.

30
Rule induction (cont )

E1 513 Pico, ltbgtVenicelt/bgt, Phone
1-ltbgt800lt/bgt-555-1515
E2 90 Colfax, ltbgtPalmslt/bgt, Phone (800)
508-1570
E3 523 1st St., ltbgtLAlt/bgt, Phone
1-ltbgt800lt/bgt-578-2293
E4 403 La Tijera, ltbgtWattslt/bgt, Phone (310)
798-0008
R1 covers E2 and E4, which are removed. E1 and E3
need additional rules.
Three candidates are created
R4 SkiptTo(ltbgt)
R5 SkipTo(HtmlTag)
R6 SkipTo(Anything)
None is good. Refinement is needed.
Stalker chooses R4 to refine, i.e., to add
additional symbols, to specialize it.
It will find R7 SkipTo(-ltbgt), which is perfect.

31
Limitations of Supervised Learning

Manual Labeling is labor intensive and time
consuming, especially if one wants to extract
data from a huge number of sites.
Wrapper maintenance is very costly
If Web sites change frequently
It is necessary to detect when a wrapper stops to
work properly.
Any change may make existing extraction rules
invalid.
Re-learning is needed, and most likely manual
re-labeling as well.

32
The RoadRunner System(Crescenzi et al. VLDB-01)

Given a set of positive examples (multiple sample
pages). Each contains one or more data records.
From these pages, generate a wrapper as a
union-free regular expression (i.e., no
disjunction).
The approach
To start, a sample page is taken as the wrapper.
The wrapper is then refined by solving mismatches
between the wrapper and each sample page, which
generalizes the wrapper.

33
(No Transcript)
34
Compare with wrapper induction

No manual labeling, but need a set of positive
pages of the same template
which is not necessary for a page with multiple
data records
not wrapper for data records, but pages.
A Web page can have many pieces of irrelevant
information.
Issues of automatic extraction
Hard to handle disjunctions
Hard to generate attribute names for the
extracted data.
extracted data from multiple sites need
integration, manual or automatic.

35
Relation Extraction

Assumptions
No single source contains all the tuples
Each tuple appears on many web pages
Components of tuple appear close together
Foundation, by Isaac Asimov
Isaac Asimovs masterpiece, the
ltemgtFoundationlt/emgt trilogy
There are repeated patterns in the way tuples are
represented on web pages

36
Naïve approach

Study a few websites and come up with a set of
patterns e.g., regular expressions
letter A-Za-z.
title letter5,40
author letter10,30
ltbgt(title)lt/bgt by (author)

37
Problems with naïve approach

A pattern that works on one web page might
produce nonsense when applied to another
So patterns need to be page-specific, or at least
site-specific
Impossible for a human to exhaustively enumerate
patterns for every relevant website
Will result in low coverage

38
Better approach (Brin)

Exploit duality between patterns and tuples
Find tuples that match a set of patterns
Find patterns that match a lot of tuples
DIPRE (Dual Iterative Pattern Relation Extraction)

Match
Patterns
Tuples
Generate
39
DIPRE Algorithm

R Ã SampleTuples
e.g., a small set of lttitle,authorgt pairs
O Ã FindOccurrences(R)
Occurrences of tuples on web pages
Keep some surrounding context
P Ã GenPatterns(O)
Look for patterns in the way tuples occur
Make sure patterns are not too general!
R Ã MatchingTuples(P)
Return or go back to Step 2

40
Web query interface integration

Many integration tasks,
Integrating Web query interfaces (search forms)
Integrating extracted data
Integrating textual information
Integrating ontologies (taxonomy)
We only introduce integration of query
interfaces.
Many web sites provide forms to query deep web
Applications meta-search and meta-query

41
Global Query Interface
united.com
airtravel.com
delta.com
hotwire.com
42
Synonym Discovery (He and Chang, KDD-04)

Discover synonym attributes
Author Writer, Subject Category

S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Holistic Model Discovery
category
author
name
subject
writer
43
Schema matching as correlation mining