Title: KnowItNow: Fast, Scalable Information Extraction from the Web
1KnowItNow Fast, Scalable Information Extraction
from the Web
- Michael J. Cafarella, Doug Downey, Stephen
Soderland, Oren Etzioni
2The Problem
- Numerous NLP applications rely on search-engine
queries to - Extract information from the web.
- Compute statistics over the Web corpus.
- Search engines are extremely helpful for several
linguistic tasks such as - Computing usage statistics.
- Finding a subset of web documents to analyze in
depth.
3Problem With Search Engines
- Search engines were not designed as building
blocks for NLP applications. As a result - An NLP application is forced to issue literally
millions of queries to search engines increasing
processing time and limiting scalability. - Fetching web documents is also time-consuming.
- Search engines are limiting the use of
programmatic queries to their engines - Google has placed hard quotas on the number of
daily queries a program can issue. - Other engines force applications to introduce
courtesy waits between queries.
4Example of the Problem KnowItAll
- KnowItAll works in a generate-and-test
architecture extracting Information in 2 stages - First, it Utilizes a small set of domain
independent extraction patterns to generate
candidate facts. - Second, it automatically tests the plausibility
of the candidate facts it extracts using
pointwise mutual information (PMI) statistics
computed from search-engine hit counts.
51st Stage in KnowItAll
- Take the generic pattern NP1 such as NPList2.
- This indicates that the head of each simple noun
phrase (NP) in NPList2 is a member of the class
named in NP1. - Take as example the pattern for class City, and
the sentence We provide tours to cities such as
Paris, London, and Berlin. - KNOWITALL extracts three candidate cities from
the sentence Paris, London, Berlin.
62nd Stage in KnowItAll
- KnowItAll needs to assess the likelihood of the
information it found. - Verify that Paris is actually a city.
- It does that by computing the PMI between Paris
and a set of k discriminator phrases that tend to
have high mutual information with city names.
(Paris is a city) - This requires at least k search-engine queries
for every candidate extraction!
7The Solution
- A novel architecture for Information Extraction
which does not depend on Web search-engine
queries KnowItNow. - Works over 2 stages like KnowItAll
- Uses a specialized search engine called the
Binding Engine (or BE) which efficiently returns
bindings in response to variabilized queries. - Uses URNS, a combinatorial model, which estimates
the probability that each extraction is correct
without using any additional search engine queries
8The Binding Engine vs. The Traditional Engine
9The Traditional Engine
- Take the search query (Cities such as
ltNounPhrasegt). - Perform a traditional search engine query.
- For each such URL
- obtain the document contents.
- find the searched-for terms in the document text.
- Run the noun phrase recognizer to determine if
text found satisfies the linguistic type
requirement - If it does, return the string.
10Problems With Traditional Engine
- The search itself doesnt take a long time. Even
if there are multiple search queries - The second stage fetches a large number of
documents, each fetch likely resulting in a
random disk seek this stage executes slowly. - this disk access is slow regardless of whether it
happens on a locally-cached copy or on a remote
document server.
11The Binding Engine
- Why not use a table to store a list of terms and
documents containing them?! - The Binding Engine supports these queries
- Typed variables (such as NounPhrase)
- String-processing functions (such as head(X) or
ProperNoun(X)). - Standard query terms.
- It processes a variable by returning every
possible string in the corpus that has a matching
type, and that can be substituted for the
variable and still satisfy the user's query.
12How the Binding Engine Works?
- It uses a novel approach called the neighborhood
index - The neighborhood index is an augmented inverted
index structure. - For each term in the corpus, the index keeps a
list of documents in which the term appears and a
list of positions where the term occurs. - The index also keeps a list of left-hand and
right-hand neighbors at each position. (Adjacent
text strings that satisfy a recognizer, e.g.
NounPhrase)
13How is The Binding Engine Better?
- K is the number of concrete terms in the query.
- B is the number of variable bindings found in the
corpus. - N is the number of documents in the corpus.
- Expensive processing such as part-of-speech
tagging or shallow syntactic parsing is performed
only once, while building the index, and is not
needed at query time.
14How is The Binding Engine Better?
- Average time to return the relevant bindings
- in response to a set of queries.
- 0.06 CPU minutes for BE.
- 8.16 CPU minutes for Nutch (Private search
engine)
15Disadvantages of The Binding Engine
- It consumes a large amount of disk space, as
parts of the corpus text are folded into the
index several times. - The neighborhood index increased disk space four
times that of a standard inverted index
16The URNS Model
- We need a way to test that the extractions from
the Binding Engine are correct - KnowItAll issues queries to search engines and
uses the PMI model to verify extractions. - PMI is very efficient but it is also very slow.
17How URNS works?
- URNS is a probabilistic model
- It takes the form of a classic balls-and-urns
model from combinatorics. - Each extraction is modeled as a labeled ball in
an urn. - A label represents either an instance of the
target class or relation, or represents an error
18How URNS works?
- C - the set of unique target labels C is the
number of unique target labels in the urn. - E - the set of unique error labels E is the
number of unique error labels in the urn. - num(b) - the function giving the number of balls
labeled by b where b is a subset of C U E. - num(B) is the multi-set giving the number of
balls for each label b, where b is a subset of B.
19How URNS works?
- The goal of an IE system is to discern which of
the labels it extracts are in fact elements of C. - Given that a particular label x was extracted k
times in a set of n draws from the urn, what is
the probability that x is a subset of C?
20Alternative to URNS
- Items that were extracted more often are more
likely to be true. - i.e. Extractions with higher frequencies are
true.
21Experiments
- Recall how many distinct extractions does each
system return at high precision? - Time how long did each system take to produce
and rank its extractions? - Extraction Rate how many distinct high quality
extractions does the system return per minute?
The extraction rate is simply recall divided by
time.
22KnowItNow vs. KnowItAllTested on relation
Country
23KnowItNow vs. KnowItAllTested on relation
CapitalOf
24KnowItNow vs. KnowItAllTested on relation Corp
25KnowItNow vs. KnowItAllTested on relation CeoOf
26KnowItNow vs. KnowItAll
27Contributions
- A novel architecture for Information Extraction
which does not depend on Web search-engine
queries. - Extract tens of thousands of facts from the Web
in minutes instead of days. - KnowItNow's extraction rate is two to three
orders of magnitude greater than KnowItAll's.