KnowItNow: Fast, Scalable Information Extraction from the Web - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

KnowItNow: Fast, Scalable Information Extraction from the Web

Description:

KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 28
Provided by: Simi158
Category:

less

Transcript and Presenter's Notes

Title: KnowItNow: Fast, Scalable Information Extraction from the Web


1
KnowItNow Fast, Scalable Information Extraction
from the Web
  • Michael J. Cafarella, Doug Downey, Stephen
    Soderland, Oren Etzioni

2
The Problem
  • Numerous NLP applications rely on search-engine
    queries to
  • Extract information from the web.
  • Compute statistics over the Web corpus.
  • Search engines are extremely helpful for several
    linguistic tasks such as
  • Computing usage statistics.
  • Finding a subset of web documents to analyze in
    depth.

3
Problem With Search Engines
  • Search engines were not designed as building
    blocks for NLP applications. As a result
  • An NLP application is forced to issue literally
    millions of queries to search engines increasing
    processing time and limiting scalability.
  • Fetching web documents is also time-consuming.
  • Search engines are limiting the use of
    programmatic queries to their engines
  • Google has placed hard quotas on the number of
    daily queries a program can issue.
  • Other engines force applications to introduce
    courtesy waits between queries.

4
Example of the Problem KnowItAll
  • KnowItAll works in a generate-and-test
    architecture extracting Information in 2 stages
  • First, it Utilizes a small set of domain
    independent extraction patterns to generate
    candidate facts.
  • Second, it automatically tests the plausibility
    of the candidate facts it extracts using
    pointwise mutual information (PMI) statistics
    computed from search-engine hit counts.

5
1st Stage in KnowItAll
  • Take the generic pattern NP1 such as NPList2.
  • This indicates that the head of each simple noun
    phrase (NP) in NPList2 is a member of the class
    named in NP1.
  • Take as example the pattern for class City, and
    the sentence We provide tours to cities such as
    Paris, London, and Berlin.
  • KNOWITALL extracts three candidate cities from
    the sentence Paris, London, Berlin.

6
2nd Stage in KnowItAll
  • KnowItAll needs to assess the likelihood of the
    information it found.
  • Verify that Paris is actually a city.
  • It does that by computing the PMI between Paris
    and a set of k discriminator phrases that tend to
    have high mutual information with city names.
    (Paris is a city)
  • This requires at least k search-engine queries
    for every candidate extraction!

7
The Solution
  • A novel architecture for Information Extraction
    which does not depend on Web search-engine
    queries KnowItNow.
  • Works over 2 stages like KnowItAll
  • Uses a specialized search engine called the
    Binding Engine (or BE) which efficiently returns
    bindings in response to variabilized queries.
  • Uses URNS, a combinatorial model, which estimates
    the probability that each extraction is correct
    without using any additional search engine queries

8
The Binding Engine vs. The Traditional Engine
9
The Traditional Engine
  • Take the search query (Cities such as
    ltNounPhrasegt).
  • Perform a traditional search engine query.
  • For each such URL
  • obtain the document contents.
  • find the searched-for terms in the document text.
  • Run the noun phrase recognizer to determine if
    text found satisfies the linguistic type
    requirement
  • If it does, return the string.

10
Problems With Traditional Engine
  • The search itself doesnt take a long time. Even
    if there are multiple search queries
  • The second stage fetches a large number of
    documents, each fetch likely resulting in a
    random disk seek this stage executes slowly.
  • this disk access is slow regardless of whether it
    happens on a locally-cached copy or on a remote
    document server.

11
The Binding Engine
  • Why not use a table to store a list of terms and
    documents containing them?!
  • The Binding Engine supports these queries
  • Typed variables (such as NounPhrase)
  • String-processing functions (such as head(X) or
    ProperNoun(X)).
  • Standard query terms.
  • It processes a variable by returning every
    possible string in the corpus that has a matching
    type, and that can be substituted for the
    variable and still satisfy the user's query.

12
How the Binding Engine Works?
  • It uses a novel approach called the neighborhood
    index
  • The neighborhood index is an augmented inverted
    index structure.
  • For each term in the corpus, the index keeps a
    list of documents in which the term appears and a
    list of positions where the term occurs.
  • The index also keeps a list of left-hand and
    right-hand neighbors at each position. (Adjacent
    text strings that satisfy a recognizer, e.g.
    NounPhrase)

13
How is The Binding Engine Better?
  • K is the number of concrete terms in the query.
  • B is the number of variable bindings found in the
    corpus.
  • N is the number of documents in the corpus.
  • Expensive processing such as part-of-speech
    tagging or shallow syntactic parsing is performed
    only once, while building the index, and is not
    needed at query time.

14
How is The Binding Engine Better?
  • Average time to return the relevant bindings
  • in response to a set of queries.
  • 0.06 CPU minutes for BE.
  • 8.16 CPU minutes for Nutch (Private search
    engine)

15
Disadvantages of The Binding Engine
  • It consumes a large amount of disk space, as
    parts of the corpus text are folded into the
    index several times.
  • The neighborhood index increased disk space four
    times that of a standard inverted index

16
The URNS Model
  • We need a way to test that the extractions from
    the Binding Engine are correct
  • KnowItAll issues queries to search engines and
    uses the PMI model to verify extractions.
  • PMI is very efficient but it is also very slow.

17
How URNS works?
  • URNS is a probabilistic model
  • It takes the form of a classic balls-and-urns
    model from combinatorics.
  • Each extraction is modeled as a labeled ball in
    an urn.
  • A label represents either an instance of the
    target class or relation, or represents an error

18
How URNS works?
  • C - the set of unique target labels C is the
    number of unique target labels in the urn.
  • E - the set of unique error labels E is the
    number of unique error labels in the urn.
  • num(b) - the function giving the number of balls
    labeled by b where b is a subset of C U E.
  • num(B) is the multi-set giving the number of
    balls for each label b, where b is a subset of B.

19
How URNS works?
  • The goal of an IE system is to discern which of
    the labels it extracts are in fact elements of C.
  • Given that a particular label x was extracted k
    times in a set of n draws from the urn, what is
    the probability that x is a subset of C?

20
Alternative to URNS
  • Items that were extracted more often are more
    likely to be true.
  • i.e. Extractions with higher frequencies are
    true.

21
Experiments
  • Recall how many distinct extractions does each
    system return at high precision?
  • Time how long did each system take to produce
    and rank its extractions?
  • Extraction Rate how many distinct high quality
    extractions does the system return per minute?
    The extraction rate is simply recall divided by
    time.

22
KnowItNow vs. KnowItAllTested on relation
Country
23
KnowItNow vs. KnowItAllTested on relation
CapitalOf
24
KnowItNow vs. KnowItAllTested on relation Corp
25
KnowItNow vs. KnowItAllTested on relation CeoOf
26
KnowItNow vs. KnowItAll
27
Contributions
  • A novel architecture for Information Extraction
    which does not depend on Web search-engine
    queries.
  • Extract tens of thousands of facts from the Web
    in minutes instead of days.
  • KnowItNow's extraction rate is two to three
    orders of magnitude greater than KnowItAll's.
Write a Comment
User Comments (0)
About PowerShow.com