Information Extraction from the World Wide Web

About This Presentation

Title:

Information Extraction from the World Wide Web

Description:

Richard Stallman, founder of the Free Software Foundation, countered saying... Free Software Foundation. What is 'Information Extraction' Information Extraction ... – PowerPoint PPT presentation

Number of Views:745

Avg rating:3.0/5.0

Slides: 157

Provided by: AndrewM163

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Information Extraction from the World Wide Web

1
Information Extractionfrom the World Wide Web

William W. Cohen
Carnegie Mellon University
Andrew McCallum
University of Massachusetts Amherst
KDD 2003

2
Example The Problem
Martin Baker, a person
Genomics job
Employers job posting form
3
Example A Solution
4
Extracting Job Openings from the Web
5
Job Openings Category Food Services Keyword
Baker Location Continental U.S.
6
Data Mining the Extracted Job Information
7
IE from Research Papers
8
IE fromChinese Documents regarding Weather
Chinese Academy of Sciences
200k documents several millennia old - Qing
Dynasty Archives - memos - newspaper articles -
diaries
9
IE from SEC Filings
This filing covers the period from December 1996
to September 1997. ENRON
GLOBAL POWER PIPELINES L.L.C.
CONSOLIDATED BALANCE SHEETS
(IN THOUSANDS, EXCEPT SHARE AMOUNTS)
SEPTEMBER
30, DECEMBER 31,
1997 1996
-------------
------------
(UNAUDITED) ASSETS Current Assets Cash
and cash equivalents 54,262
24,582 Accounts receivable
8,473 6,301 Current
portion of notes receivable 1,470
1,394 Other current assets
336 404
--------
-------- Total Current Assets
71,730 32,681
--------
-------- Investments in to Unconsolidated
Subsidiaries 286,340 298,530 Notes
Receivable
16,059 12,111
-------- --------
Total Assets
374,408 343,843

LIABILITIES AND SHAREHOLDERS'
EQUITY Current Liabilities Accounts payable
13,461 11,277
Accrued taxes
1,910 1,488
-------- --------
Total Current Liabilities
15,371 49,348
-------- -------- Deferred
Income Taxes 525
4,301 The U.S. energy markets in 1997 were
subject to significant fluctuation

Data mine these reports for
- suspicious behavior,
to better understand
what is normal.

10
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
11
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
12
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
13
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
14
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
15
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation

16
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
17
Why IE from the Web?

Science
Grand old dream of AI Build large KB and reason
with it. IE from the Web enables the creation
of this KB.
IE from the Web is a complex problem that
inspires new advances in machine learning.
Profit
Many companies interested in leveraging data
currently locked in unstructured text on the
Web.
Not yet a monopolistic winner in this space.
Fun!
Build tools that we researchers like to use
ourselvesCora CiteSeer, MRQE.com, FAQFinder,
See our work get used by the general public.

KB Knowledge Base
18
Tutorial Outline

IE History
Landscape of problems and solutions
Parade of models for segmenting/classifying
Sliding window
Boundary finding
Finite state machines
Trees
Overview of related problems and solutions
Association, Clustering
Integration with Data Mining
Where to go from here

15 min break
19
IE History

Pre-Web
Mostly news articles
De Jongs FRUMP 1982
Hand-built system to fill Schank-style scripts
from news wire
Message Understanding Conference (MUC) DARPA
87-95, TIPSTER 92-96
Most early work dominated by hand-built models
E.g. SRIs FASTUS, hand-built FSMs.
But by 1990s, some machine learning Lehnert,
Cardie, Grishman and then HMMs Elkan Leek 97,
BBN Bikel et al 98
Web
AAAI 94 Spring Symposium on Software Agents
Much discussion of ML applied to Web. Maes,
Mitchell, Etzioni.
Tom Mitchells WebKB, 96
Build KBs from the Web.
Wrapper Induction
Initially hand-build, then ML Soderland 96,
Kushmeric 97,

20
What makes IE from the Web Different?
Less grammar, but more formatting linking
Newswire
Web
www.apple.com/retail
Apple to Open Its First Retail Store in New York
City MACWORLD EXPO, NEW YORK--July 17,
2002--Apple's first retail store in New York City
will open in Manhattan's SoHo district on
Thursday, July 18 at 800 a.m. EDT. The SoHo
store will be Apple's largest retail store to
date and is a stunning example of Apple's
commitment to offering customers the world's best
computer shopping experience. "Fourteen months
after opening our first retail store, our 31
stores are attracting over 100,000 visitors each
week," said Steve Jobs, Apple's CEO. "We hope our
SoHo store will surprise and delight both Mac and
PC users who want to see everything the Mac can
do to enhance their digital lifestyles."
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
The directory structure, link structure,
formatting layout of the Web is its own new
grammar.
21
Landscape of IE Tasks (1/4)Pattern Feature
Domain
Text paragraphs without formatting
Grammatical sentencesand some formatting links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,rich formatting links
Tables
22
Landscape of IE Tasks (2/4)Pattern Scope
Web site specific
Genre specific
Wide, non-specific
Formatting
Layout
Language
Amazon.com Book Pages
Resumes
University Names
23
Landscape of IE Tasks (3/4)Pattern Complexity
E.g. word patterns
Regular set
Closed set
U.S. phone numbers
U.S. states
Phone (413) 545-1323
He was born in Alabama
The CALD main office can be reached at
412-268-1299
The big Wyoming sky
Ambiguous patterns,needing context andmany
sources of evidence
Complex pattern
U.S. postal addresses
Person names
University of Arkansas P.O. Box 140 Hope, AR
71802
was among the six houses sold by Hope Feldman
that year.
Pawel Opalinski, SoftwareEngineer at WhizBang
Labs.
Headquarters 1128 Main Street, 4th
Floor Cincinnati, Ohio 45210
24
Landscape of IE Tasks (4/4)Pattern Combinations
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
25
Evaluation of Single Entity Extraction
TRUTH
Michael Kearns and Sebastian Seung will start
Mondays tutorial, followed by Richard M. Karpe
and Martin Cooke.
PRED
Michael Kearns and Sebastian Seung will start
Mondays tutorial, followed by Richard M. Karpe
and Martin Cooke.
correctly predicted segments 2

Precision

predicted segments 6

correctly predicted segments 2

Recall

true segments 4
1
F1 Harmonic mean of Precision
Recall
((1/P) (1/R)) / 2
26
State of the Art Performance

Named entity recognition
Person, Location, Organization,
F1 in high 80s or low- to mid-90s
Binary relation extraction
Contained-in (Location1, Location2)Member-of
(Person1, Organization1)
F1 in 60s or 70s or 80s
Wrapper induction
Extremely accurate performance obtainable
Human effort (30min) required on each site

27
Landscape of IE Techniques (1/1)Models
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
and beyond
Any of these models can be used to capture words,
formatting or both.
28
LandscapeFocus of this Tutorial
Pattern complexity
closed set
regular
complex
ambiguous
Pattern feature domain
words
words formatting
formatting
Pattern scope
site-specific
genre-specific
general
Pattern combinations
entity
binary
n-ary
Models
lexicon
regex
window
boundary
FSM
CFG
29
Sliding Windows
30
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
31
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
32
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
33
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
34
A Naïve Bayes Sliding Window Model
Freitag 1997
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun

w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
Estimate Pr(LOCATIONwindow) using Bayes
rule Try all reasonable windows (vary length,
position) Assume independence for length, prefix
words, suffix words, content words Estimate from
data quantities like Pr(Place in
prefixLOCATION)
If P(Wean Hall Rm 5409 LOCATION) is above
some threshold, extract it.
Other examples of sliding window Baluja et al
2000 (decision tree over individual words
their context)
35
Naïve Bayes Sliding Window Results
Domain CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
Field F1 Person Name 30 Location 61 Start
Time 98
36
SRV a realistic sliding-window-classifier IE
system
Frietag AAAI 98

What windows to consider?
all windows containing as many tokens as the
shortest example, but no more tokens than the
longest example
How to represent a classifier? It might
Restrict the length of window
Restrict the vocabulary or formatting used
before/after/inside window
Restrict the relative order of tokens
Etc

lttitlegtCourse Information for CS213lt/titlegt lth1gtCS
213 C Programminglt/h1gt
37
SRV a rule-learner for sliding-window
classification

Top-down rule learning
let RULES
while (there are uncovered positive
examples)
// construct a rule R to add to RULES
let R be a rule covering all examples
while (R covers too many negative examples)
let C argmaxC VALUE( R, RC,
uncoveredExamples)
over some set of candidate conditions C
let R R - C
let RULES RULES R

38
SRV a rule-learner for sliding-window
classification

Rule learning greedily add conditions to rules,
rules to rule set
Search metric SRV algorithm greedily adds
conditions to maximize information gain
To prevent overfitting
rules are built on 2/3 of data, then their false
positive rate is estimated on the 1/3 holdout
set.
Candidate conditions

courseNumber(X) - tokenLength(X,,2),
every(X, inTitle, false), some(X, A,
ltpreviousTokengt, inTitle, true), some(X, B, ltgt,
tripleton, true)
39
Learning first-order rules

A sample zero-th order rule set
(tok1InTitle tok1StartsPara tok2triple)
or (prevtok2EqCourse prevtok1EqNumber) or
First-order rules can be learned the same
waywith additional search to find best
condition
phrase(X) - firstToken(X,A), not startPara(A),
nextToken(A,B), triple(B)
phrase(X) - firstToken(X,A), prevToken(A,C),
eq(C,number),
prevToken(C,D), eq(D,course)
Semantics
p(X) - q(X),r(X,Y),s(Y) X exists Y
q(X) and r(X,Y) and s(Y)

40
SRV a rule-learner for sliding-window
classification

Primitive predicates used by SRV
token(X,W), allLowerCase(W), numerical(W),
nextToken(W,U), previousToken(W,V)
HTML-specific predicates
inTitleTag(W), inH1Tag(W), inEmTag(W),
emphasized(W) inEmTag(W) or inBTag(W) or
tableNextCol(W,U) U is some token in the
column after the column W is in
tablePreviousCol(W,V), tableRowHeader(W,T),

41
SRV a rule-learner for sliding-window
classification

Non-primitive conditions used by SRV
every(X, f, c) for all W in X f(W)c
some(X, W, ltf1,,fkgt, g, c) exists W
g(fk((f1(W)))c
tokenLength(X, relop, c)
position(W,direction,relop, c)
e.g., tokenLength(X,gt,4), position(W,fromEnd,lt,2)

42
Utility of non-primitive conditions in greedy
rule search

Greedy search for first-order rules is hard
because useful conditions can give no immediate
benefit
phrase(X) - token(X,A), prevToken(A,B),inTitle(
B),
nextToken(A,C), tripleton(C)

43
Rapier an alternative approach
Califf Mooney, AAAI 99

A bottom-up rule learner
initialize RULES to be one rule per example
repeat
randomly pick N pairs of rules (Ri,Rj)
let G1,GN be the consistent pairwise
generalizations
let G Gi that optimizes compression
let RULES RULES G R covers(G,R)
where compression(G,RULES) size of RULES- R
covers(G,R) and covers(G,R) means every
example matching G matches R

44
lttitlegtCourse Information for CS213lt/titlegt lth1gtCS
213 C Programminglt/h1gt
courseNum(window1) - token(window1,CS),
doubleton(CS), prevToken(CS,CS213),
inTitle(CS213), nextTok(CS,213),
numeric(213), tripleton(213),
nextTok(213,C), tripleton(C), .
lttitlegtSyllabus and meeting times for Eng
214lt/titlegt lth1gtEng 214 Software Engineering for
Non-programmers lt/h1gt
courseNum(window2) - token(window2,Eng),
tripleton(Eng), prevToken(Eng,214),
inTitle(214), nextTok(Eng,214),
numeric(214), tripleton(214),
nextTok(214,Software),
courseNum(X) - token(X,A),
prevToken(A, B), inTitle(B),
nextTok(A,C)), numeric(C),
tripleton(C), nextTok(C,D),
45
Rapier an alternative approach

Combines top-down and bottom-up learning
Bottom-up to find common restrictions on content
Top-down greedy addition of restrictions on
context
Use of part-of-speech and semantic features (from
WORDNET).
Special pattern-language based on sequences of
tokens, each of which satisfies one of a set of
given constraints
lt lttok2ate,hit,POS2vbgt, lttok2thegt,
ltPOS2nngtgt

46
Rapier results precision/recall
47
Rapier results vs. SRV
48
Rule-learning approaches to sliding-window
classification Summary

SRV, Rapier, and WHISK Soderland KDD 97
Representations for classifiers allow restriction
of the relationships between tokens, etc
Representations are carefully chosen subsets of
even more powerful representations based on logic
programming (ILP and Prolog)
Use of these heavyweight representations is
complicated, but seems to pay off in results
Can simpler representations for classifiers work?

49
BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000

Another formulation learn three probabilistic
classifiers
START(i) Prob( position i starts a field)
END(j) Prob( position j ends a field)
LEN(k) Prob( an extracted field has length k)
Then score a possible extraction (i,j) by
START(i) END(j) LEN(j-i)
LEN(k) is estimated from a histogram

50
BWI Learning to detect boundaries

BWI uses boosting to find detectors for START
and END
Each weak detector has a BEFORE and AFTER pattern
(on tokens before/after position i).
Each pattern is a sequence of tokens and/or
wildcards like anyAlphabeticToken, anyToken,
anyUpperCaseLetter, anyNumber,
Weak learner for patterns uses greedy search (
lookahead) to repeatedly extend a pair of empty
BEFORE,AFTER patterns

51
BWI Learning to detect boundaries
Field F1 Person Name 30 Location 61 Start
Time 98
52
Problems with Sliding Windows and Boundary
Finders

Decisions in neighboring parts of the input are
made independently from each other.
Naïve Bayes Sliding Window may predict a seminar
end time before the seminar start time.
It is possible for two overlapping windows to
both be above threshold.
In a Boundary-Finding system, left boundaries are
laid down independently from right boundaries,
and their pairing happens as a separate step.

53
Finite State Machines
54
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
55
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Pedro Domingos spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Pedro Domingos spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Pedro Domingos
56
HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or

(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 500k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90

Results
Other examples of shrinkage for HMMs in IE
Freitag and McCallum 99
57
We want More than an Atomic View of Words
Would like richer representation of text many
arbitrary, overlapping features of the words.
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor last
person name was female next two words are and
Associates
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
58
Problems with Richer Representationand a Joint
Model

These arbitrary features are not independent.
Multiple levels of granularity (chars, words,
phrases)
Multiple dependent modalities (words, formatting,
layout)
Past future
Two choices

Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!
Model the dependencies. Each state would have its
own Bayes Net. But we are already starved for
training data!
S
S
S
S
S
S
t
-
1
t
t1
t
-
1
t
t1
O
O
O
O
O
O
t
t
t
1
-
t
1
-
t
1
t
1
59
Conditional Sequence Models

We prefer a model that is trained to maximize a
conditional probability rather than joint
probabilityP(so) instead of P(s,o)
Can examine features, but not responsible for
generating them.
Dont have to explicitly model their
dependencies.
Dont waste modeling effort trying to generate
what we are given at test time anyway.

60
From HMMs to CRFs
Conditional Finite State Sequence Models
McCallum, Freitag Pereira, 2000
Lafferty, McCallum, Pereira 2001
St-1
St
St1
...
Joint
...
Ot
Ot1
Ot-1
Conditional
where
(A super-special case of Conditional Random
Fields.)
61
Conditional Random Fields
Lafferty, McCallum, Pereira 2001
1. FSM special-case linear chain among
unknowns, parameters tied across time steps.
St
St1
St2
St3
St4
O Ot, Ot1, Ot2, Ot3, Ot4
2. In general CRFs "Conditionally-traine
d Markov Network" arbitrary structure among
unknowns
3. Relational Markov Networks Taskar, Abbeel,
Koller 2002 Parameters tied across hits
from SQL-like queries ("clique templates")
62
Feature Functions
o
Yesterday Pedro Domingos spoke this example
sentence.
o1 o2 o3
o4 o5 o6
o7
s1
s2
s3
s4
63
Efficient Inference
64
Learning Parameters of CRFs
Maximize log-likelihood of parameters L lk
given training data D
Log-likelihood gradient

Methods
iterative scaling (quite slow)
conjugate gradient (much faster)
limited-memory quasi-Newton methods, BFGS
(super fast)

Sha Pereira 2002 Malouf 2002
65
Voted Perceptron Sequence Models
Collins 2002
Like CRFs with stochastic gradient ascent and a
Viterbi approximation.
Analogous tothe gradientfor this onetraining
instance
Avoids calculating the partition function
(normalizer), Zo, but gradient ascent, not
2nd-order or conjugate gradient method.
66
General CRFs vs. HMMs

More general and expressive modeling technique
Comparable computational efficiency
Features may be arbitrary functions of any or all
observations
Parameters need not fully specify generation of
observations require less training data
Easy to incorporate domain knowledge
State means only state of process, vsstate of
process and observational history Im keeping

67
MEMM CRF Related Work

Maximum entropy for language tasks
Language modeling Rosenfeld 94, Chen
Rosenfeld 99
Part-of-speech tagging Ratnaparkhi 98
Segmentation Beeferman, Berger Lafferty 99
Named entity recognition MENE Borthwick,
Grishman,98
HMMs for similar language tasks
Part of speech tagging Kupiec 92
Named entity recognition Bikel et al 99
Other Information Extraction Leek 97, Freitag
McCallum 99
Serial Generative/Discriminative Approaches
Speech recognition Schwartz Austin 93
Reranking Parses Collins, 00
Other conditional Markov models
Non-probabilistic local decision models Brill
95, Roth 98
Gradient-descent on state path LeCun et al 98
Markov Processes on Curves (MPCs) Saul Rahim
99
Voted Perceptron-trained FSMs Collins 02

68
Person name Extraction
McCallum 2001, unpublished
69
Person name Extraction
70
Features in Experiment

Capitalized Xxxxx
Mixed Caps XxXxxx
All Caps XXXXX
Initial Cap X.
Contains Digit xxx5
All lowercase xxxx
Initial X
Punctuation .,!(), etc
Period .
Comma ,
Apostrophe
Dash -
Preceded by HTML tag

Character n-gram classifier says string is a
person name (80 accurate)
In stopword list(the, of, their, etc)
In honorific list(Mr, Mrs, Dr, Sen, etc)
In person suffix list(Jr, Sr, PhD, etc)
In name particle list (de, la, van, der, etc)
In Census lastname listsegmented by P(name)
In Census firstname listsegmented by P(name)
In locations lists(states, cities, countries)
In company name list(J. C. Penny)
In list of company suffixes(Inc, Associates,
Foundation)

Hand-built FSM person-name extractor says yes,
(prec/recall 30/95) Conjunctions of all
previous feature pairs, evaluated at the current
time step. Conjunctions of all previous feature
pairs, evaluated at current step and one step
ahead. All previous features, evaluated two steps
ahead. All previous features, evaluated one step
behind.
Total number of features 500k
71
Training and Testing

Trained on 65k words from 85 pages, 30 different
companies web sites.
Training takes 4 hours on a 1 GHz Pentium.
Training precision/recall is 96 / 96.
Tested on different set of web pages with similar
size characteristics.
Testing precision is 92 95,
recall is 89 91.

72
Part-of-speech Tagging
45 tags, 1M words training data, Penn Treebank
DT NN NN , NN , VBZ
RB JJ IN PRP VBZ DT NNS , IN
RB JJ NNS TO PRP VBG NNS
WDT VBP RP NNS JJ , NNS
VBD .
The asbestos fiber , crocidolite, is unusually
resilient once it enters the lungs , with even
brief exposures to it causing symptoms that show
up decades later , researchers said .
Using spelling features
use words, plus overlapping features
capitalized, begins with , contains hyphen,
ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion,
-ity, -ies.
Lafferty, McCallum, Pereira 2001
73
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.

Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.

74
Table Extraction from Government Reports
Pinto, McCallum, Wei, Croft, 2003
100 documents from www.fedstats.gov
Labels
CRF

Non-Table
Table Title
Table Header
Table Data Row
Table Section Data Row
Table Footnote
... (12 in all)

Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.

Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
Features

Percentage of digit chars
Percentage of alpha chars
Indented
Contains 5 consecutive spaces
Whitespace in this line aligns with prev.
...
Conjunctions of all previous features, time
offset 0,0, -1,0, 0,1, 1,2.

75
Table Extraction Experimental Results
Pinto, McCallum, Wei, Croft, 2003
Line labels, percent correct
HMM
65
Stateless MaxEnt
85
D error 85
CRF w/out conjunctions
52
95
CRF
76
Named Entity Recognition
Reuters stories on international news Train on
300k words
CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN
1996-08-22 South African provincial side Boland
said on Thursday they had signed Leicestershire
fast bowler David Millns on a one year contract.
Millns, who toured Australia with England A in
1992, replaces former England all-rounder Phillip
DeFreitas as Boland's overseas professional.
Labels Examples
PER Yayuk Basuki Innocent Butare ORG 3M KDP
Leicestershire LOC Leicestershire Nirmal
Hriday The Oval MISC Java Basque 1,000
Lakes Rally
77
Automatically Induced Features
McCallum 2003
Index Feature 0 inside-noun-phrase
(ot-1) 5 stopword (ot) 20 capitalized
(ot1) 75 wordthe (ot) 100 in-person-lexicon
(ot-1) 200 wordin (ot2) 500 wordRepublic
(ot1) 711 wordRBI (ot) headerBASEBALL 1027 he
aderCRICKET (ot) in-English-county-lexicon
(ot) 1298 company-suffix-word (firstmentiont2) 40
40 location (ot) POSNNP (ot) capitalized
(ot) stopword (ot-1) 4945 moderately-rare-first-
name (ot-1) very-common-last-name
(ot) 4474 wordthe (ot-2) wordof (ot)
78
Named Entity Extraction Results
McCallum Li, 2003
Method F1 parameters BBN's Identifinder,
word features 79 500k CRFs word features,
80 500k w/out Feature Induction CRFs
many features, 75 3 million w/out Feature
Induction CRFs many candidate features 90
60k with Feature Induction
79
Inducing State-Transition Structure
Chidlovskii, 2000
K-reversiblegrammars
Structure learning forHMMs IE Seymore et al
1999 Frietag McCallum 2000
80
Limitations of Finite State Models

Finite state models have a linear structure
Web documents have a hierarchical structure
Are we suffering by not modeling this structure
more explicitly?
How can one learn a hierarchical extraction model?

81
Tree-based Models
82

Extracting from one web site
Use site-specific formatting information e.g.,
the JobTitle is a bold-faced paragraph in column
2
For large well-structured sites, like parsing a
formal language
Extracting from many web sites
Need general solutions to entity extraction,
grouping into records, etc.
Primarily use content information
Must deal with a wide range of ways that users
present data.
Analogous to parsing natural language
Problems are complementary
Site-dependent learning can collect training data
for a site-independent learner
Site-dependent learning can boost accuracy of a
site-independent learner on selected key sites

83
(No Transcript)
84
(No Transcript)
85
(No Transcript)
86
STALKER Hierarchical boundary finding
Muslea,Minton Knoblock 99

Main idea
To train a hierarchical extractor, pose a series
of learning problems, one for each node in the
hierarchy
At each stage, extraction is simplified by
knowing about the context.

87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(BEFORE(), AFTERnull)
91
(BEFORE(), AFTERnull)
92
(BEFORE(), AFTERnull)
93
Stalker hierarchical decomposition of two web
sites
94
Stalker summary and results

Rule format
landmark automata format for rules which
extended BWIs format
E.g. ltagtW. Cohenlt/agt CMU Web IE lt/ligt
BWI BEFORE(lt, /, a,gt, ANY, )
STALKER BEGIN SkipTo(lt, /, a, gt), SkipTo()
Top-down rule learning algorithm
Carefully chosen ordering between types of rule
specializations
Very fast learning e.g. 8 examples vs. 274
A lesson we often control the IE training
data!

95
Why low sample complexity is important in
wrapper learning
At training time, only four examples are
availablebut one would like to generalize to
future pages as well
96
Wrapster a hybrid approach to representing
wrappers
Cohen,JensenHurst WWW02

Common representations for web pages include
a rendered image
a DOM tree (tree of HTML markup text)
gives some of the power of hierarchical
decomposition
a sequence of tokens
a bag of words, a sequence of characters, a node
in a directed graph, . . .
Questions
How can we engineer a system to generalize
quickly?
How can we explore representational choices
easily?

97
Wrapster architecture

Bias is an ordered set of builders.
Builders are simple micro-learners.
A single master algorithm co-ordinates learning.
Hybrid top-down/bottom-up rule learning
Terminology
Span substring of page, created by a predicate
Predicate subset of span x span, created by a
builder
Builder a micro-learner, created by hand

98
Wrapster predicates

A predicate is a binary relation on spans
p(s t) means that t is extracted from s.
Membership in a predicate can be tested
Given (s,t), is p(s,t) true?
Predicates can be executed
EXECUTE(s,t) t p(s,t)

99
Example Wrapster predicate
html

http//wasBang.org/aboutus.html
WasBang.com contact info
Currently we have offices in two locations
Pittsburgh, PA
Provo, UT

head
body

p
p
WasBang.com .. info
ul
Currently..
li
li
a
a
Pittsburgh, PA
Provo, UT
100
Example Wrapster predicate

Example
p(s1,s2) iff s2 are the tokens below an li node
inside a ul node inside s1.
EXECUTE(p,s1) extracts
Pittsburgh, PA
Provo, UT

http//wasBang.org/aboutus.html
WasBang.com contact info
Currently we have offices in two locations
Pittsburgh, PA
Provo, UT

101
Wrapster builders

Builders are based on simple, restricted
languages, for example
Ltagpath p is defined by tag1,,tagk and
ptag1,,tagk(s1,s2) is true iff s1 and s2
correspond to DOM nodes and s2 is reached from s1
by following a path ending in tag1,,tagk
EXECUTE(pul,li,s1) Pittsburgh,PA, Provo,
UT
Lbracket p is defined by a pair of strings
(l,r), and pl,r(s1,s2) is true iff s2 is preceded
by l and followed by r.
EXECUTE(pin,locations,s1) two

102
Wrapster builders

For each language L there is a builder B which
implements
LGG( positive examples of p(s1,s2)) least
general p in L that covers all the positive
examples (like pairwise generalization)
For Lbracket, longest common prefix and suffix of
the examples.
REFINE(p, examples ) a set of ps that cover
some but not all of the examples.
For Ltagpath, extend the path with one additional
tag that appears in the examples.
Builders/languages can be combined
E.g. to construct a builder for (L1 and L2) or
(L1 composeWith L2)

103
Wrapster builders - examples

Compose tagpaths and brackets
E.g., extract strings between ( and ) inside
a list item inside an unordered list
Compose tagpaths and language-based extractors
E.g., extract city names inside the first
paragraph
Extract items based on position inside a rendered
table, or properties of the rendered text
E.g., extract items inside any column headed by
text containing the words Job and Title
E.g. extract items in boldfaced italics

104
Composing builders

Composing builders for Ltagpath and Lbracket.
LGG of the locations would be
(ptags composeWith pL,R )
where
tags ul,li
L (
R )

Jobs at WasBang.com
Call (888)-555-1212 now to apply!
Webmaster (New York). Perl, servlets essential.
Librarian (Pittsburgh). MLS required.
Ski Instructor (Vancouver). Snowboarding skills
also useful.

105
Composing builders structural/global

Jobs at WasBang.com
Call Alberta Hill at 1-888-555-1212 now to apply!
Webmaster (New York). Perl, servlets essential.
Librarian (Pittsburgh). MLS required.
Ski Instructor (Vancouver). Snowboarding skills
also useful.

Composing builders for Ltagpath and Lcity
Lcity pcity where pcity(s1,s2) iff s2 is a
city name inside of s2.
LGG of the locations would be
ptags composeWith pcity

106
Table-based builders
How to represent links to pages about
singers? Builders can be based on a geometric
view of a page.
107
Wrapster results
F1
examples
108
Wrapster results
Examples needed for 100 accuracy
109
Broader Issues in IE
110
Broader View
Up to now we have been focused on segmentation
and classification
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
111
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Documentcollection
Train extraction models
4
Data mine
5
Label training data
112
(1) Association as Binary Classification
Christos Faloutsos conferred with Ted Senator,
the KDD 2003 General Chair.
Person
Person
Role
Person-Role (Christos Faloutsos, KDD 2003
General Chair) ? NO
Person-Role ( Ted Senator, KDD 2003
General Chair) ? YES
Do this with SVMs and tree kernels over parse
trees.
Zelenko et al, 2002
113
(1) Association with Finite State Machines
Ray Craven, 2001
This enzyme, UBC6, localizes to the endoplasmic
reticulum, with the catalytic domain facing the
cytosol.
DET this N enzyme N ubc6 V localizes PREP to ART t
he ADJ endoplasmic N reticulum PREP with ART the A
DJ catalytic N domain V facing ART theN cytosol
Subcellular-localization (UBC6, endoplasmic
reticulum)
114
(1) Association using Parse Tree
Miller et al 2000
Simultaneously POS tag, parse, extract
associate!
Increase space of parse constituents to
includeentity and relation tags Notation Descrip
tion . ch head constituent
category cm modifier constituent category Xp X of
parent node t POS tag w word Parameters e.g.
. P(chcp) P(vps) P
(cmcp,chp,cm-1,wp) P(per/nps,vp,null,said) P(tm
cm,th,wh) P(per/nnpper/np,vbd,said) P(wmcm,tm,t
h,wh) P(nanceper/np,per/nnp,vbd,said)
(This is also a great exampleof extraction using
a tree model.)
115
(1) Association with Graphical Models
Roth Yih 2002
Capture arbitrary-distance dependencies among
predictions.
116
(1) Association with Graphical Models
Roth Yih 2002
Also capture long-distance dependencies among
predictions.
Random variableover the class ofrelation
between entity 2 and 1, e.g. over lives-in,
is-boss-of,
person
Random variableover the class ofentity 1, e.g.
overperson, location,
lives-in
Local languagemodels contributeevidence to
relationclassification.
person?
Local languagemodels contributeevidence to
entityclassification.
Dependencies between classesof entities and
relations!
Inference with loopy belief propagation.
117
(1) Association with Graphical Models
Roth Yih 2002
Also capture long-distance dependencies among
predictions.
Random variableover the class ofrelation
between entity 2 and 1, e.g. over lives-in,
is-boss-of,
person
Random variableover the class ofentity 1, e.g.
overperson, location,
lives-in
Local languagemodels contributeevidence to
relationclassification.
location
Local languagemodels contributeevidence to
entityclassification.
Dependencies between classesof entities and
relations!
Inference with loopy belief propagation.
118
(1) Association with Grouping Labels
Jensen Cohen, 2001

Create a simple language that reflects a fields
relation to other fields
Language represents ability to define
Disjoint fields
Shared fields
Scope
Create rules that use field labels

119
(1) Semantics of language of labels
120
(1) Grouping labels A simple example
NextNamerecordstart
Name Box Kite Company -Location -Order
-Cost 100Description -Color -Size -
Kites Buy a kite Box Kite 100Stunt Kite
300
121
(2) Grouping labels A messy example
nextNamerecordstart
prevlinkCost
Kites Buy a kite Box Kite 100Stunt Kite
300
Name Box Kite Company -Location -Order
-Cost 100Description Great for kidsColor
blueSize small
Box Kite Great for kidsDetailed specs
Specs Color blueSize small
pagetypeProduct
122
(2) User interface adding labels to extracted
fields
123
(1) Experimental Evaluation of Grouping Labels
Fixed language, then wrapped 499 new sitesall of
which could be handled.
124
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Documentcollection
Train extraction models
4
Data mine
5
Label training data
Object Consolidation
125
(2) Learning a Distance Metric Between Records
Borthwick, 2000 Cohen Richman, 2001 Bilenko
Mooney, 2002, 2003
Learn Pr (duplicate, not-duplicate record1,
record2)with a Maximum Entropy cl

Write a Comment

User Comments (0)