KnowItNow: Fast, Scalable Information Extraction from the Web - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

KnowItNow: Fast, Scalable Information Extraction from the Web

Description:

KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 28

Provided by: Simi158

Category:

more less

Transcript and Presenter's Notes

Title: KnowItNow: Fast, Scalable Information Extraction from the Web

1
KnowItNow Fast, Scalable Information Extraction
from the Web

Michael J. Cafarella, Doug Downey, Stephen
Soderland, Oren Etzioni

2
The Problem

Numerous NLP applications rely on search-engine
queries to
Extract information from the web.
Compute statistics over the Web corpus.
Search engines are extremely helpful for several
linguistic tasks such as
Computing usage statistics.
Finding a subset of web documents to analyze in
depth.

3
Problem With Search Engines

Search engines were not designed as building
blocks for NLP applications. As a result
An NLP application is forced to issue literally
millions of queries to search engines increasing
processing time and limiting scalability.
Fetching web documents is also time-consuming.
Search engines are limiting the use of
programmatic queries to their engines
Google has placed hard quotas on the number of
daily queries a program can issue.
Other engines force applications to introduce
courtesy waits between queries.

4
Example of the Problem KnowItAll

KnowItAll works in a generate-and-test
architecture extracting Information in 2 stages
First, it Utilizes a small set of domain
independent extraction patterns to generate
candidate facts.
Second, it automatically tests the plausibility
of the candidate facts it extracts using
pointwise mutual information (PMI) statistics
computed from search-engine hit counts.

5
1st Stage in KnowItAll

Take the generic pattern NP1 such as NPList2.
This indicates that the head of each simple noun
phrase (NP) in NPList2 is a member of the class
named in NP1.
Take as example the pattern for class City, and
the sentence We provide tours to cities such as
Paris, London, and Berlin.
KNOWITALL extracts three candidate cities from
the sentence Paris, London, Berlin.

6
2nd Stage in KnowItAll

KnowItAll needs to assess the likelihood of the
information it found.
Verify that Paris is actually a city.
It does that by computing the PMI between Paris
and a set of k discriminator phrases that tend to
have high mutual information with city names.
(Paris is a city)
This requires at least k search-engine queries
for every candidate extraction!

7
The Solution

A novel architecture for Information Extraction
which does not depend on Web search-engine
queries KnowItNow.
Works over 2 stages like KnowItAll
Uses a specialized search engine called the
Binding Engine (or BE) which efficiently returns
bindings in response to variabilized queries.
Uses URNS, a combinatorial model, which estimates
the probability that each extraction is correct
without using any additional search engine queries

8
The Binding Engine vs. The Traditional Engine
9
The Traditional Engine

Take the search query (Cities such as
ltNounPhrasegt).
Perform a traditional search engine query.
For each such URL
obtain the document contents.
find the searched-for terms in the document text.
Run the noun phrase recognizer to determine if
text found satisfies the linguistic type
requirement
If it does, return the string.

10
Problems With Traditional Engine

The search itself doesnt take a long time. Even
if there are multiple search queries
The second stage fetches a large number of
documents, each fetch likely resulting in a
random disk seek this stage executes slowly.
this disk access is slow regardless of whether it
happens on a locally-cached copy or on a remote
document server.

11
The Binding Engine

Why not use a table to store a list of terms and
documents containing them?!
The Binding Engine supports these queries
Typed variables (such as NounPhrase)
String-processing functions (such as head(X) or
ProperNoun(X)).
Standard query terms.
It processes a variable by returning every
possible string in the corpus that has a matching
type, and that can be substituted for the
variable and still satisfy the user's query.

12
How the Binding Engine Works?

It uses a novel approach called the neighborhood
index
The neighborhood index is an augmented inverted
index structure.
For each term in the corpus, the index keeps a
list of documents in which the term appears and a
list of positions where the term occurs.
The index also keeps a list of left-hand and
right-hand neighbors at each position. (Adjacent
text strings that satisfy a recognizer, e.g.
NounPhrase)

13
How is The Binding Engine Better?

K is the number of concrete terms in the query.
B is the number of variable bindings found in the
corpus.
N is the number of documents in the corpus.
Expensive processing such as part-of-speech
tagging or shallow syntactic parsing is performed
only once, while building the index, and is not
needed at query time.

14
How is The Binding Engine Better?

Average time to return the relevant bindings
in response to a set of queries.
0.06 CPU minutes for BE.
8.16 CPU minutes for Nutch (Private search
engine)

15
Disadvantages of The Binding Engine

It consumes a large amount of disk space, as
parts of the corpus text are folded into the
index several times.
The neighborhood index increased disk space four
times that of a standard inverted index

16
The URNS Model

We need a way to test that the extractions from
the Binding Engine are correct
KnowItAll issues queries to search engines and
uses the PMI model to verify extractions.
PMI is very efficient but it is also very slow.

17
How URNS works?

URNS is a probabilistic model
It takes the form of a classic balls-and-urns
model from combinatorics.
Each extraction is modeled as a labeled ball in
an urn.
A label represents either an instance of the
target class or relation, or represents an error

18
How URNS works?

C - the set of unique target labels C is the
number of unique target labels in the urn.
E - the set of unique error labels E is the
number of unique error labels in the urn.
num(b) - the function giving the number of balls
labeled by b where b is a subset of C U E.
num(B) is the multi-set giving the number of
balls for each label b, where b is a subset of B.

19
How URNS works?

The goal of an IE system is to discern which of
the labels it extracts are in fact elements of C.
Given that a particular label x was extracted k
times in a set of n draws from the urn, what is
the probability that x is a subset of C?

20
Alternative to URNS

Items that were extracted more often are more
likely to be true.
i.e. Extractions with higher frequencies are
true.

21
Experiments

Recall how many distinct extractions does each
system return at high precision?
Time how long did each system take to produce
and rank its extractions?
Extraction Rate how many distinct high quality
extractions does the system return per minute?
The extraction rate is simply recall divided by
time.

22
KnowItNow vs. KnowItAllTested on relation
Country
23
KnowItNow vs. KnowItAllTested on relation
CapitalOf
24
KnowItNow vs. KnowItAllTested on relation Corp
25
KnowItNow vs. KnowItAllTested on relation CeoOf
26
KnowItNow vs. KnowItAll
27
Contributions

A novel architecture for Information Extraction
which does not depend on Web search-engine
queries.
Extract tens of thousands of facts from the Web
in minutes instead of days.
KnowItNow's extraction rate is two to three
orders of magnitude greater than KnowItAll's.

Write a Comment

User Comments (0)