CS224: Advances in Database Management System Technology Spring 224 - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

CS224: Advances in Database Management System Technology Spring 224

Description:

Sergey Brin, Extracting Patterns and Relations from the World ... From each match, extract the title and author, according the order specified in the pattern. ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 24
Provided by: che7
Category:

less

Transcript and Presenter's Notes

Title: CS224: Advances in Database Management System Technology Spring 224


1
CS224 Advances in Database Management System
TechnologySpring 224
  • Lecture Note 2 Web-data Extraction
  • Professor Chen Li

2
Readings
  • Sergey Brin, Extracting Patterns and Relations
    from the World Wide Web, WebDB Workshop, 1998.
  • Anand Rajaraman and Jeffrey D. Ullman, Querying
    Websites Using Compact Skeletons, PODS 2001.
  • Goal Extract rich data from the Web

3
Brins approach
  • Example
  • Book(title, author)
  • Find as many book titles and authors as possible

Sample data
4
Details
  • Start with sample tuples, e.g., five book titles
    and authors.
  • Find where the tuples appear on Web. Accept a
    pattern if
  • It identifies several examples of known tuples,
    and
  • is sufficiently specific that it is unlikely to
    identify too much.
  • Given a set of accepted patterns, find data that
    appears in these patterns, add it to the set of
    known data.
  • Repeat steps (2) and (3) several times.

5
Pattern
  • A pattern consists of five elements
  • The order, i.e., whether the title appears prior
    to the author in the text, or viceversa.
  • In a more general case, where tuples have more
    than 2 components, the order would be the
    permutation of components.
  • The URL prefix.
  • The prefix of text, just prior to the first of
    the title or author.
  • The middle text appearing between the two data
    elements.
  • The suffix of text following the second of the
    two data elements. Both the prefix and suffix
    were limited to 10 characters.

6
  • ltulgt
  • ltligtltigtDatabase Systems The Complete Booklt/igt,
    by Hector Garcia-Molina, Jeffrey Ullman, and
    Jennifer Widom.ltbrgt
  • ltligtltigtData Mining Concepts and Techniqueslt/igt,
    by Jiawei Han and Micheline Kamber.ltbrgt
  • ltligtltigtPrinciples of Data Mininglt/igt, by David
    J. Hand, Heikki Mannila, and Padhraic Smyth,
    Cambridge, MA MIT Press, 2001.ltbrgt
  • lt/ulgt

7
Example
  • Order title then author.
  • URL prefix www.ics.uci.edu/ics215/
  • Prefix, middle, and suffix of the following form
  • ltligtltigttitlelt/igt by authorltbrgt
  • The prefix is ltligtltigt, the middle is lt/igt by
    (including the blank after by''), and the
    suffix is ltbrgt.
  • The title is whatever appears between the prefix
    and middle the author is whatever appears
    between the middle and suffix.
  • The intuition behind why this pattern might be
    good is that there are probably lots of reading
    lists among the class pages at UCI ICS.

8
Constraints on patterns
  • Pattern specificity
  • Is the product of the lengths of prefix, middle,
    suffix, and URL prefix.
  • It measures how likely we are to find the
    pattern the higher the specificity, the fewer
    occurrences we expect.
  • To make sure patterns are likely to be accurate,
    it must meet two conditions
  • There must be gt 2 known data items appearing in
    it.
  • The product of the patterns specificity and the
    number of occurrences of data items in the
    pattern must exceed a certain threshold T.

9
Data Occurrences
  • A data occurrence in a pattern consists of
  • The particular title and author.
  • The complete URL, not just the prefix as for a
    pattern.
  • The order, prefix, middle, and suffix of the
    pattern in which the title and author occurred.
  • The same title and author might appear in several
    different patterns

10
Finding Data Occurrences Given Data
  • Given known titleauthor pairs, to find new
    patterns, search the Web to see where these
    titles and authors occur.
  • Assume there is a Web index
  • Given a word, can find (pointers to) all pages
    containing that word.
  • The method used is essentially apriori
  • Find (pointers to) pages containing any known
    author.
  • Since author names generally consist of 2 words,
    use the index for each first name and last name,
    and check that the occurrences are consecutive in
    the document.
  • Find (pointers to) pages containing any known
    title.
  • Start by finding pages with each word of a title,
    and then checking that the words appear in order
    on the page.
  • Intersect the sets of pages that have an author
    and a title on them.
  • Only these pages need to be searched to find the
    patterns in which a known titleauthor pair is
    found. For the prefix and suffix, take the 10
    surrounding characters, or fewer if there are not
    as many as 10.

11
Building Patterns from Data Occurrences
  • 1. Group data occurrences according to their
    order and middle.
  • E.g., one group in the groupby'' might
    correspond to the order titlethenauthor'' and
    the middle lt/Igt by.
  • 2. For each group, find the longest common
    prefix, suffix, and URL prefix.
  • 3. If specificity test for the pattern is met,
    accept it.
  • 4. Otherwise,
  • try to split the group into two by extending the
    length of the URL prefix by one character, and
    repeat from step (2).
  • If it is impossible to split the group (because
    there is only one URL), then we fail to produce a
    pattern from the group.

12
Example
  • Suppose our group contains three URL's
  • www.ics.uci.edu/ics184/
  • www.ics.uci.edu/ics214/
  • www.ics.uci.edu/ics215/
  • The common prefix is www.ics.uci.edu/ics
  • If we have to split the group, then the next
    character, 1 versus 2, breaks the group into
    two,
  • those data occurrences in the first page (could
    be many) go into one group,
  • those occurrences on the other two pages going
    into another.

13
Finding Occurrences Given Patterns
  • Find all URL's that match the URL prefix in at
    least one pattern.
  • For each of those pages, scan the text using a
    regular expression built from the pattern's
    prefix, middle, and suffix.
  • From each match, extract the title and author,
    according the order specified in the pattern.

14
Results
  • 24M pages, 147GB
  • Start with 5 (book, author) pairs
  • First round 199 occurrences, 3 patterns, 4047
    unique (book, author) pairs
  • After four rounds, found to 15,000 tuples. About
    95 were true titleauthor pairs.
  • Data quality is good.

15
RUs approach
  • Model data values as a graph
  • Compute skeletons from the graph
  • Use skeleton to extract data to populate tables
  • Used in Junglee a legendary database startup

16
Example
17
Data graph G
  • A DAG
  • Each node is an information element
  • E.g., A(ddress), T(itle), S(alary)
  • Can be extracted using predefined regular
    expressions
  • Can have a value, or NULL.
  • An edge represents relationship between two
    elements
  • Could be between pages (web link)
  • Or could be within one page

18
Relation schema
  • The table we want to populate
  • Has a given set of attributes X
  • Each attribute A has a domain Dom(A)
  • Problem formulation
  • Given a data graph G, and a relation R over
    attribute set X
  • Use the data graph to populate the table

19
Skeleton
  • A tree
  • Each node is an attribute in X (could be NULL)
  • Intuition a pattern/layout of data in the graph
  • Overlay of a skeleton on the data graph
  • A skeleton node matches a graph node, i.e., they
    same attribute
  • May use an overlay to extract tuples
  • Perfect skeleton a skeleton K is perfect for
    data graph G if for each edge e in G, there is an
    overlay using K that includes e.

20
Perfect skeletons
Skeleton K1 Good
Data graph
Skeleton K2 Bad
  • K1 tends to give us right tuples
  • K2 can give us wrong tuples
  • Intuitively, in K1, information elements are
    closer

21
Compact skeletons
Data graph
Compact Skeleton (K1)
Not compact (K2)
  • K is a compact skeleton for G if
  • For each node u in G, there is a node v in K,
    such that for any overlay from K to G in which u
    participates, u is mapped to v.
  • Why is K2 not compact? Consider the overlays that
    result in the tuples a_1 t_1 s_3 and a_2 t_3 s_1
    the children of the root of G get mapped to
    different nodes of the skeleton K2 by the two
    overlays.
  • Perfect compact skeleton (PCS) perfect compact
  • Thats what we want!

22
Computing PCSs
  • Not every graph has a PCS
  • PCS is unique
  • An algorithm for computing PCS
  • Complexity O(kmVG)
  • K of attributes in the relation
  • M of the nodes in the largest subgraph that
    has a PCS

23
Other results
  • Partially PCS
  • Deal with incomplete information
  • An algorithm for computing PPCS
  • Use PCS or PPCS to populate the relation, and
    answer queries
  • Deal with noisy data graphs
Write a Comment
User Comments (0)
About PowerShow.com