CS224: Advances in Database Management System Technology Spring 224

About This Presentation

Title:

CS224: Advances in Database Management System Technology Spring 224

Description:

Sergey Brin, Extracting Patterns and Relations from the World ... From each match, extract the title and author, according the order specified in the pattern. ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 24

Provided by: che7

Category:

more less

Transcript and Presenter's Notes

Title: CS224: Advances in Database Management System Technology Spring 224

1
CS224 Advances in Database Management System
TechnologySpring 224

Lecture Note 2 Web-data Extraction
Professor Chen Li

2
Readings

Sergey Brin, Extracting Patterns and Relations
from the World Wide Web, WebDB Workshop, 1998.
Anand Rajaraman and Jeffrey D. Ullman, Querying
Websites Using Compact Skeletons, PODS 2001.
Goal Extract rich data from the Web

3
Brins approach

Example
Book(title, author)
Find as many book titles and authors as possible

Sample data
4
Details

Start with sample tuples, e.g., five book titles
and authors.
Find where the tuples appear on Web. Accept a
pattern if
It identifies several examples of known tuples,
and
is sufficiently specific that it is unlikely to
identify too much.
Given a set of accepted patterns, find data that
appears in these patterns, add it to the set of
known data.
Repeat steps (2) and (3) several times.

5
Pattern

A pattern consists of five elements
The order, i.e., whether the title appears prior
to the author in the text, or viceversa.
In a more general case, where tuples have more
than 2 components, the order would be the
permutation of components.
The URL prefix.
The prefix of text, just prior to the first of
the title or author.
The middle text appearing between the two data
elements.
The suffix of text following the second of the
two data elements. Both the prefix and suffix
were limited to 10 characters.

ltulgt
ltligtltigtDatabase Systems The Complete Booklt/igt,
by Hector Garcia-Molina, Jeffrey Ullman, and
Jennifer Widom.ltbrgt
ltligtltigtData Mining Concepts and Techniqueslt/igt,
by Jiawei Han and Micheline Kamber.ltbrgt
ltligtltigtPrinciples of Data Mininglt/igt, by David
J. Hand, Heikki Mannila, and Padhraic Smyth,
Cambridge, MA MIT Press, 2001.ltbrgt
lt/ulgt

7
Example

Order title then author.
URL prefix www.ics.uci.edu/ics215/
Prefix, middle, and suffix of the following form
ltligtltigttitlelt/igt by authorltbrgt
The prefix is ltligtltigt, the middle is lt/igt by
(including the blank after by''), and the
suffix is ltbrgt.
The title is whatever appears between the prefix
and middle the author is whatever appears
between the middle and suffix.
The intuition behind why this pattern might be
good is that there are probably lots of reading
lists among the class pages at UCI ICS.

8
Constraints on patterns

Pattern specificity
Is the product of the lengths of prefix, middle,
suffix, and URL prefix.
It measures how likely we are to find the
pattern the higher the specificity, the fewer
occurrences we expect.
To make sure patterns are likely to be accurate,
it must meet two conditions
There must be gt 2 known data items appearing in
it.
The product of the patterns specificity and the
number of occurrences of data items in the
pattern must exceed a certain threshold T.

9
Data Occurrences

A data occurrence in a pattern consists of
The particular title and author.
The complete URL, not just the prefix as for a
pattern.
The order, prefix, middle, and suffix of the
pattern in which the title and author occurred.
The same title and author might appear in several
different patterns

10
Finding Data Occurrences Given Data

Given known titleauthor pairs, to find new
patterns, search the Web to see where these
titles and authors occur.
Assume there is a Web index
Given a word, can find (pointers to) all pages
containing that word.
The method used is essentially apriori
Find (pointers to) pages containing any known
author.
Since author names generally consist of 2 words,
use the index for each first name and last name,
and check that the occurrences are consecutive in
the document.
Find (pointers to) pages containing any known
title.
Start by finding pages with each word of a title,
and then checking that the words appear in order
on the page.
Intersect the sets of pages that have an author
and a title on them.
Only these pages need to be searched to find the
patterns in which a known titleauthor pair is
found. For the prefix and suffix, take the 10
surrounding characters, or fewer if there are not
as many as 10.

11
Building Patterns from Data Occurrences

1. Group data occurrences according to their
order and middle.
E.g., one group in the groupby'' might
correspond to the order titlethenauthor'' and
the middle lt/Igt by.
2. For each group, find the longest common
prefix, suffix, and URL prefix.
3. If specificity test for the pattern is met,
accept it.
4. Otherwise,
try to split the group into two by extending the
length of the URL prefix by one character, and
repeat from step (2).
If it is impossible to split the group (because
there is only one URL), then we fail to produce a
pattern from the group.

12
Example

Suppose our group contains three URL's
www.ics.uci.edu/ics184/
www.ics.uci.edu/ics214/
www.ics.uci.edu/ics215/
The common prefix is www.ics.uci.edu/ics
If we have to split the group, then the next
character, 1 versus 2, breaks the group into
two,
those data occurrences in the first page (could
be many) go into one group,
those occurrences on the other two pages going
into another.

13
Finding Occurrences Given Patterns

Find all URL's that match the URL prefix in at
least one pattern.
For each of those pages, scan the text using a
regular expression built from the pattern's
prefix, middle, and suffix.
From each match, extract the title and author,
according the order specified in the pattern.

14
Results

24M pages, 147GB
Start with 5 (book, author) pairs
First round 199 occurrences, 3 patterns, 4047
unique (book, author) pairs
After four rounds, found to 15,000 tuples. About
95 were true titleauthor pairs.
Data quality is good.

15
RUs approach

Model data values as a graph
Compute skeletons from the graph
Use skeleton to extract data to populate tables
Used in Junglee a legendary database startup

16
Example
17
Data graph G

A DAG
Each node is an information element
E.g., A(ddress), T(itle), S(alary)
Can be extracted using predefined regular
expressions
Can have a value, or NULL.
An edge represents relationship between two
elements
Could be between pages (web link)
Or could be within one page

18
Relation schema

The table we want to populate
Has a given set of attributes X
Each attribute A has a domain Dom(A)
Problem formulation
Given a data graph G, and a relation R over
attribute set X
Use the data graph to populate the table

19
Skeleton

A tree
Each node is an attribute in X (could be NULL)
Intuition a pattern/layout of data in the graph
Overlay of a skeleton on the data graph
A skeleton node matches a graph node, i.e., they
same attribute
May use an overlay to extract tuples
Perfect skeleton a skeleton K is perfect for
data graph G if for each edge e in G, there is an
overlay using K that includes e.

20
Perfect skeletons
Skeleton K1 Good
Data graph
Skeleton K2 Bad

K1 tends to give us right tuples
K2 can give us wrong tuples
Intuitively, in K1, information elements are
closer

21
Compact skeletons
Data graph
Compact Skeleton (K1)
Not compact (K2)

K is a compact skeleton for G if
For each node u in G, there is a node v in K,
such that for any overlay from K to G in which u
participates, u is mapped to v.
Why is K2 not compact? Consider the overlays that
result in the tuples a_1 t_1 s_3 and a_2 t_3 s_1
the children of the root of G get mapped to
different nodes of the skeleton K2 by the two
overlays.
Perfect compact skeleton (PCS) perfect compact
Thats what we want!

22
Computing PCSs

Not every graph has a PCS
PCS is unique
An algorithm for computing PCS
Complexity O(kmVG)
K of attributes in the relation
M of the nodes in the largest subgraph that
has a PCS

23
Other results

Partially PCS
Deal with incomplete information
An algorithm for computing PPCS
Use PCS or PPCS to populate the relation, and
answer queries
Deal with noisy data graphs

Write a Comment

User Comments (0)

About PowerShow.com

CS224: Advances in Database Management System Technology Spring 224 - PowerPoint PPT Presentation

CS224: Advances in Database Management System Technology Spring 224

Sergey Brin, Extracting Patterns and Relations from the World ... From each match, extract the title and author, according the order specified in the pattern. ... – PowerPoint PPT presentation