Ex Information Extraction System - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Ex Information Extraction System

Description:

Weather forecast sites (e.g. forecasts for the next day) ... Known pattern spotting (5) Pattern generalizing the content of attribute monitor_name ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 55

Provided by: keg8

Category:

more less

Transcript and Presenter's Notes

Title: Ex Information Extraction System

1
Ex Information Extraction System

Martin Labsky
labsky_at_vse.cz
KEG seminar, March 2006

2
Agenda

Purpose
Use cases
Sources of knowledge
Identifying attribute candidates
Parsing instance candidates
Implementation status

3
Purpose

Extract objects from documents
object instance of a class from an ontology
document text, possibly with formatting and
other documents from the same source
Usability
make simple things simple
complex possible

4
Use Cases

Extraction of objects of a known, well-defined
class(es)
From document collections of any size
Structured, semi-structured, free-text
Extraction should improve if
documents contain some formatting (e.g. HTML)
this formatting is similar within or across
document(s)
Examples
Product catalogues (e.g. detailed product
descriptions)
Weather forecast sites (e.g. forecasts for the
next day)
Restaurant descriptions (cuisine, opening hours
etc.)
Emails on a certain topic
Contact information

5
Use 3 sources of knowledge

Ontology
the only mandatory source
class definitions IE hooks (e.g. regexps)
Sample instances
possibly coupled with referring documents
get to know typical content and context of
extractable items
Common Formatting structure
of instances presented
in a single document, or
among documents from the same source

6
Ontology sample

see monitors.xml

7
Sample Instances

see monitors.tsv and .html

8
Common Formatting

If a document or a group of documents have common
or similar regular structure, this structure can
be identified by a wrapper and used to improve
extraction (esp. recall)

9
Document understanding

Known pattern spotting 4
ID of possible wrappers 2
ID of attribute candidates 2
Parsing attribute candidates 4

10
Known pattern spotting (1)

Sources of known patterns
attribute content patterns
specified in EOL
induced automatically by generalizing attribute
contents in sample instances
attribute context patterns
specified in EOL
induced automatically by generalizing attribute
context observed in referring documents

11
Known pattern spotting (2)

Known phrases and patterns represented using a
single datastructure

monitor
VIEWSONIC
VP201s
LCD
230
215
567
719
token ID
211
215
456
718
lwrcase ID
211
215
456
718
lemma ID
AL
AL
AL
AN
token type
UC
LC
UC
MX
capitalization
12
Known pattern spotting (3)

Known phrases and patterns represented using a
single datastructure

monitor
VIEWSONIC
VP201s
LCD
989
phrase ID
567
lemma phrase ID
3
cnt as monitor_name content
? attribute
0
cnt as monitor_name L-context
0
cnt as monitor_name R-context
0
cnt as garbage
13
Known pattern spotting (4)

Pattern generalizing the content of attribute
monitor_name

1-2
monitor
viewsonic
AN MX
lcd
-1
-1
-1
-1
token ID
-1
-1
-1
-1
lwrcase ID
211
215
456
-1
lemma ID
-1
-1
-1
AN
token type
MX
capitalization
-1
-1
-1
14
Known pattern spotting (5)

Pattern generalizing the content of attribute
monitor_name

1-2
monitor
viewsonic
AN MX
lcd
345
pattern ID
27
cnt as monitor_name content
0
cnt as monitor_name L-context
? attribute
0
cnt as monitor_name R-context
0
cnt as garbage
15
Known pattern spotting (6)

Data structures
All known tokens stored in Vocabulary (character
Trie) along with their features
All known phrases and patterns stored in
PhraseBook (token Trie), also with features
Precision and recall of a known pattern
using stored count features, we have
precision recall of each pattern with respect
to each attribute content, L-context, R-context
precision c(pattern attr_content) /
c(pattern)
recall c(pattern attr_content) /
c(attr_content)

16
Document understanding

Known phrase/pattern spotting 4
ID of possible wrappers 2
ID of attribute candidates 2
Parsing attribute candidates 4

17
ID of possible wrappers (1)

Given a collection of documents from the same
source
attribute
Identify all high-precision phrases (hpps)
Apply a wrapper induction algorithm, specifying
hpps as labeled samples
Get n-best wrapper hypotheses

18
ID of possible wrappers (2)

Start with a simple wrapper induction algorithm
? attribute
list L-contexts, R-contexts, and X-PATH (LRP)
leading to labeled attribute samples
find clusters of samples with similar LRPs
? cluster with clustergtthreshold
compute the most specific generalization of LRP
that covers the whole cluster
this generalized LRP is hoped to cover also
unlabeled attributes
the (single) wrapper on output is the set of
generalized LRPs
Able to plug-in different wrapper induction
algorithms

19
Document understanding

Known phrase/pattern spotting 4
ID of possible wrappers 2
ID of attribute candidates 2
Parsing attribute candidates 4

20
Attribute candidate (CA) generation

known phrases P in document collection
if P is known as the content of some attribute A
create new CA from this P
if P is known as a high-precision L-(R-)context
of some attribute
create new CAs from phrases P on the right
(left) of P
in CA, set the following feature
has_context_of_attribute_A 1
wrapper WA for attribute A
? phrase P covered by WA
if P is a not already an CA, create a new CA
in CA, set the following feature to 1
in_wrapper_of_attribute_A 1

21
Attribute candidates

Properties
many overlapping attribute candidates
maximum recall, precision is low

a
b
c
d
e
f
g
h
i
j
k
l
Att_X
Att_Y
Att_Z
22
Document understanding

Known phrase/pattern spotting 4
ID of possible wrappers 2
ID of attribute candidates 2
Parsing attribute candidates 4

23
Parsing of attribute candidates

The table below can be converted to a lattice
A parse is a single path through the lattice
Many paths are impossible due to ontology
constraints
Many paths still remain possible, we must
determine the most probable one

a
b
c
d
e
f
g
h
i
j
k
l
Att_X
Att_Y
Att_Z
Garbage
24
Sample parse tree
Doc
ICLASS
ICLASS
AX
AY
AZ
a
b
c
d
e
f
g
h
i
j
k
l
m
n
...
AX
AY
AZ
Garbage
25
AC parsing algorithm

Left-to-right bottom-up parsing
Decoding phase
in each step, algorithm selects n most probable
non-terminals to become heads for observed
(non-)terminal sequence
we support nested attributes, therefore some ACs
may become heads of other ACs
an instance candidate (IC) may become a head of
ACs that do not violate ontology constraints
the most probable heads are determined using
features of ACs in the examined AC sequence
features all features assigned directly to the
AC or to the underlying phrase
features have weights assigned during parser
training

26
AC parser training

Iterative training
initial feature weights are set
based on counts observed in sample instances
based on parameters defined in ontology
document collection is parsed (decoded) with
current features
feature weights are modified in the direction
that improves the current parsing result
repeat until time allows or convergence

work-in-progress notes

28
AC Parser revised

Attribute candidates (AC)
AC identification by patterns
matching pattern indicates AC with some
probability
patterns given by user or induced by trainer
assignment of conditional P(attAphrase,context)
computation from
single-pattern conditional probabilities
single-pattern reliabilities (weights)
AC parsing
trellis representation
algorithm

29
Pattern types by area

Patterns can be defined for attribute
content
lcd monitor viewsonic ALPHANUMCAP
ltFLOATgt ltunitgt
a special case of content pattern list of
example attribute values
L/R context
monitor name
contentL/R context (units are better modeled as
content)
ltintgt x ltintgt ltunitgt
ltfloatgt ltunitgt
DOM context
BLOCK_LEVEL_ELEMENT A

30
Pattern types by generality

General patterns
expected to appear across multiple websites, used
when parsing new websites
Local (site-specific) patterns
all pattern types from previous slide can have
their local variants for a specific website
we can have several local variants plus a general
variant of the same pattern, these will differ in
statistics (esp. pattern precision and weight)
local patterns are induced while joint-parsing
documents with supposed similar structure (e.g.
from a single website)
for example, local DOM context patterns can get
more detailed than general DOM context patterns,
e.g.
TDclassproduct_name A precision1.0,
weight1.0
statistics for local patterns are only computed
based on the local website
local patterns are stored for each website
(similar to a wrapper) and used when re-parsing
the website next time. When deleted, they will be
induced again the next time the website is parsed.

31
Pattern match types

Types of pattern matches
exact match,
approximate phrase match if pattern definition
allows, or
approximate numeric match for numeric types (int,
float)
Approximate phrase match
can use any general phrase distance or similarity
measure
phrase distance dist f(phrase1, phrase2) 0 ?
dist lt ?
phrase similarity sim f(phrase1, phrase2) 0 ?
sim lt 1
now using a nested edit-distance defined on
tokens and their types
this distance is a black box for now, returns
dist, can compare to a set of phrase2
Approximate numeric match
when searching for values of a numeric attribute,
all int or float values found in analyzed
documents are considered, except for those not
satisfying min or max constraints. User
specifies, or trainer estimates
a probability function, e.g. a simple
value-probability table (for discrete values) or
a probability density function (pdf), e.g.
weighted gaussians (for continuous values). Each
specific number NUM found in document can be
further represented either as
pdf(NUM)
P(less probable value than NUM attribute) ?t
pdf(t) lt pdf(NUM) pdf(NUM)
or, use likelihood relative to pdf max lik(NUM
attribute) pdf(NUM) / maxt pdf(t)

32
AC conditional probability computing P(attApat)

P(attAphrase,ctx) Spat wpatP(attApat)
Spat wpat1
How do we get P(attApat)? (pattern indicates AC)
exact pattern matches
patterns precision estimated by user, or
P(attApat)c(pat indicates attA) / c(pat) in
training data
approximate pattern matches
train a cumulative probability on held-out data
(phrase similarity trained on training data)
P(attA PHR) interpolate(examples)
examples
scored using similarity to (distance from)
pattern, and
classified into positive (examples of attA) or
negative.
approximate numeric matches
for discrete p.d. user estimates precisions for
all discrete values as if they were separate
exact matches, or compute from training data
P(attAvalue) p.d.(valueattA) P(attA) /
P(value)
for continuous pdf (also possible for discrete
p.d.) train a cumulative probability on held-out
data (pdfs/p.d. trained on training data)
P(attA NUM) interpolate(examples)
examples
scored using pdf(NUM), or P(less probable value
than NUMattA), or lik(NUMattA)
classified into positive or negative

reds must come from training data

examples should be both positive and negative

33
Approximate matches

From the above examples, derive a mapping
distance?P(attAdist)

P(attAdist)

other mappings possible we could fit linear or
logarithmic curve e.g. by least squares

analogous approach is taken for numeric
approximate matches
pdf(NUM) or lik(NUMattA) or P(less probable
value than NUMattA) will replace dist(P,attA)
and the x scale will be reversed

1
0.5
dist(P, attA)
0
0.06
0.12
0.50
34
AC conditional probability computing wpat

P(attAphrase,ctx) Spat wpatP(attApat)
Spat wpat1
How do we get wpat? (represents pattern
reliability)
For general patterns (site-independent)
user specifies pattern importance, or
reliability is initially computed from
the number of pattern examples seen in training
data (irrelevant whether pattern means attA or
not)
the number of different websites showing this
pattern with similar site-specific precision for
attA (this indicates patterns general
usefulness)
with held-out data from multiple websites, we can
re-estimate wpat using the EM algorithm
we probably could first use held-out data to
update pattern precisions, and then, keeping
precisions fixed, update pattern weights via EM
EM for each labeled held-out instance,
accumulate each patterns contribution to
P(attAphrase,ctx) in accumulatorpat
wpatP(attApat). After a single run through
held-out data, new weights are given by
normalizing accumulators.
For site-specific patterns
since local patterns are established while
joint-parsing documents with similar structure,
both their wpat and P(attApat) will develop as
the joint-parse proceeds. wpat will again be
based on the number of times the pattern was seen.

35
Pattern statistics

Each pattern needs
precision P(attApat) a/(ab)
reliability wpat
Maybe we need also
negative precision P(attA?pat) c/(cd), or
recall P(patattA) a/(ac) (this could be rel.
easy to enter by users)
these are slightly related, e.g. when recall1
then negative precision0
Conditional model variants
A. P(attAphrase,ctx) Spat?matched
wpatP(attApat) Spat?matched wpat1
S only goes over patterns that match phrase,ctx
(uses 2 parameters per pattern)
B. P(attAphrase,ctx) Spat?matched
wpatP(attApat) Spat?nonmatched
w_negpatP(attA?pat)
Spat?matched wpat Spat?nonmatched w_negpat 1
S goes over all patterns, using negative
precision for patterns not matched, and negative
reliability w_negpat (negative reliability of a
pattern in general ! its reliability). This
model uses 4 parameters per pattern)
Generative model (only for contrast)
assumes independence among patterns (naive bayes
assumption, which is never true in our case)
P(phrase,ctxattA) P(attA) ?pat P(patattA) /
?pat P(pat) (the denominator can be ignored in
argmaxA P(phrase,ctxattA) search, P(attA) is
another parameter)
however, patterns are typically very much
dependent and thus the probability produced by
dependent patterns is very much overestimated
(and often gt 1 -) )
smoothing would be necessary, while conditional
models (maybe) avoid it

36
Normalizing weights for conditional models

Need to ensure Spat wpat1
Conditional model A (only matching patterns used)
Spat?matched wpat1
Conditional model B (all patterns are always
used)
Spat?matched wpat Spat?nonmatched w_negpat 1
Both models
need to come up with an appropriate estimation of
pattern reliabilities (weights), and possibly
negative reliabilities, so that we can do
normalization with no harm
it may be problematic that some reliabilities are
estimated by users (e.g. in a 1..9 scale) and
some are to be computed from observed pattern
frequency in training documents and across
training websites. How shall we integrate this?
First, lets look at how we can integrate them
separately
if all weights to be normalized are given by
user wpat wpat/ SpatX wpatX
if all weights are estimated from training data
counts, then something like
wpat log(coccurences(pat)) log(cdocuments(pat)
) log(cwebsites(pat))
and then as usual (including user-estimated
reliabilities) wpat wpat/ SpatX wpatX

37
Parsing

AC attribute candidate
IC instance candidate (set of ACs)
the goal is to parse a set of documents into
valid instances of classes defined in extraction
ontology

38
AC scoring (1)

Main problem seems to be the integration of
conditional probabilities P(attAphrase,ctx)
which we computed in previous slides, with
generative probabilities P(propositioninstance
of class C)
proposition can be e.g.
price_with_tax gt price_without_tax,
product_name is first attribute mentioned,
the text in product_pictures alt attribute is
similar to product_name
price follows name
instance has 1 value for attribute
price_with_tax
if proposition is not true, then complementary
probability 1-P is used
proposition is taken into account whenever its
source attributes are present in the parsed
instance candidate (lets call this proposition
set PROPS)
Combination of proposition probabilities
assume that propositions are mutually independent
(seems OK)
then we can multiply their generative
probabilities to get an averaged generative
probability of all propositions together, and
normalize this probability according to the
number of propositions used
PAVG( PROPS instance of C) (?prop?PROPS
P(propinstance of C))1/PROPS
(computed as logs)

39
AC scoring (2)

Combination of pattern probabilities
view the parsed instance candidate IC as a set of
attribute candidates
PAVG(instance of class Cphrases,contexts)
SA?IC P(attAphrase,ctx) / IC
Extension each P(attAphrase,ctx) may be further
multiplied by the engaged-ness of attribute,
P(part_of_instanceattA), since some attributes
appear alone (outside of instances) more often
than others
Combination of
PAVG(propositions instance of class C)
PAVG(instance of class C phrases,contexts)
into a single probability used as a score for the
instance candidate
intuitively, multiplying seems reasonable, but is
incorrect we must justify it somehow
we use propositions their generative
probabilities to discriminate among possible
parse candidates for the assembled instance
we need probabilities here to compete with the
probabilities given by patterns
if PAVG(propositions instance of class C) 0
then result must be 0
but finally, we want to see something like
conditional P (instance of class C attributes
phrases, contexts, and relations between them) as
an ICs score
so lets take PAVG(instance of class C
phrases,contexts) as a basis, and multiply it by
the portion of training instances that exhibit
the observed propositions. This will lower the
base probability proportionally to the scarcity
of observed propositions.
result use multiplication score(IC)
PAVG(propositions instance of class C)
PAVG(instance of class C phrases,contexts)
but experiments necessary (can be tested in
approx. 1 month)

40
Parsing algorithm (1)

bottom-up parser
driven by candidates with the highest current
scores (both instance and attribute candidates),
not a left-to-right parser
using DOM to guide search
joint-parse of multiple documents from the same
source
adding/changing local patterns (especially DOM
context patterns) as the joint-parse continues,
recalculating probabilities/weights of local
patterns
configurable beam width

41
Parsing algorithm (2)

treat all documents D from a single source as a
single document identify and score ACs
INIT_AC_SET VALID_IC_SET
do
BACthe best AC not in INIT_AC_SET (from atts
with card1 or gt1 if any)
if(BACs score lt threshold) break
add BAC to INIT_AC_SET
INIT_ICBAC
IC_SETINIT_IC
curr_blockparent_block(BAC)
while(curr_block ! top_block)
for all AC in curr_block (ordered by linear
token distance from BAC)
for all IC in IC_SET
if(IC.accepts(AC))
create IC2ICAC
add IC2 to IC_SET
if(IC_SET contains a valid IC and too many
ACs were refused due to ontology constraints)
break

accepts() returns true if the IC can
accommodate the AC according to ontology
constraints and if the AC does not overlap with
any other AC already present in IC, with the
exception of being embedded in that AC. adding
the new IC2 at the end of the list will prolong
the loop going through IC_SET
next_parent_block() returns a single parent block
for most block elements. For table cells, this
returns 4 aggregates of horizontally and
vertically neighboring cells, and the
encapsulating table row and column. Calling
next_parent_block() on each of these aggregates
yields the next aggregate, call to the last
aggregate returns the whole table body.
42
Class C
X card1, may contain Y
Y card1..n
Z card1..n
AX
a
b
c
d
e
f
g
h
i
j
k
l
m
n
...
AX
AY
AZ
Garbage
A
TD
TD
block structure ?
TR
TABLE
43
Class C
X card1, may contain Y
Y card1..n
Z card1..n
AXAY
AX
AY
a
b
c
d
e
f
g
h
i
j
k
l
m
n
...
AX
AY
AZ
Garbage
A
TD
TD
block structure ?
TR
TABLE
44
Class C
X card1, may contain Y
Y card1..n
Z card1..n
AXAY
AXAY
AY
AX
AY
a
b
c
d
e
f
g
h
i
j
k
l
m
n
...
AX
AY
AZ
Garbage
A
TD
TD
block structure ?
TR
TABLE
45
Class C
X card1, may contain Y
Y card1..n
Z card0..n
AXAYAZ AXAYAY AXAYAZ AXAYAZ AXAYAY
AXAYAZ AXAZ AXAY AXAZ
AXAY
AXAY
AY
AX
AY AZ
AY
AZ
a
b
c
d
e
f
g
h
i
j
k
l
m
n
...
AX
AY
AZ
Garbage
A
TD
TD
block structure ?
TR
TABLE
46
Class C
X card1, may contain Y
Y card1..n
Z card0..n
AXAYAZ AXAYAY AXAYAZ AXAYAZ AXAYAY
AXAYAZ AXAZ AXAY AXAZ
AXAY
AXAY AXAY AXAY
AY
AX
AY AZ
AY
AZ
a
b
c
d
e
f
g
h
i
j
k
l
m
n
...
AX
AY
AZ
Garbage
A
TD
TD
block structure ?
TR
TABLE
47
Class C
X card1, may contain Y
Y card1..n
Z card0..n
AXAYAZ AXAYAY AXAYAZ AXAYAZ AXAYAY
AXAYAZ AXAZ AXAY AXAZ
AXAYAZ AXAYAY AXAYAZ AXAYAZ
AXAYAY AXAYAZ
AXAY
AXAY AXAY AXAY
AY
AX
AY AZ
AY
AZ
a
b
c
d
e
f
g
h
i
j
k
l
m
n
...
AX
AY
AZ
Garbage
A
TD
TD
block structure ?
TR
TABLE
48
Class C
X card1, may contain Y
Y card1..n
Z card0..n
AXAYAZ AXAYAY AXAYAZ AXAYAZ AXAYAY
AXAYAZ AXAZ AXAY AXAZ
AXAYAZ AXAYAY AXAYAZ AXAYAZ
AXAYAY AXAYAZ
AXAY
AXAY AXAY AXAY
AY
AX
AY AZ
AY
AZ
a
b
c
d
e
f
g
h
i
j
k
l
m
n
...
AX
AY
AZ
Garbage
A
TD
TD
block structure ?
TR
TABLE
49
Class C
X card1, may contain Y
Y card1..n
AXAYAZ
Z card0..n
AXAY
AZ
AY
a
b
c
d
e
f
g
h
i
j
k
l
m
...
AX
AY
AZ
Bg
TD
A
TD
TR
TABLE
50
Class C
X card1, contains Y
Y card1..n
AXAYAZ
Z card0..n
AXAY
AZ
AY
a
b
c
d
e
f
g
h
i
j
k
...
AX
AY
AZ
Bg
TD
A
TD
TR
...
TABLE
51
Aggregation of overlapping ACs

Performance and clarity improvement before
parsing, aggregate those overlapping ACs that
have the same relation to ACs of other
attributes, and let the aggregate have the max
score of its children ACs. This will prevent some
multiplications of new ICs. The aggregate will
only break down if other features appear during
the parse which only support some of its
children. At the end of the parse, all remaining
aggregates are reduced to their best child.

52
Focused or global parsing

The algorithm above is focused since it focuses
in detail on each single AC at a time. All ICs
built by the parser in a single loop have the
chosen AC as a member. More complex ICs are built
incrementally based on existing simpler ICs, as
we take into account larger neighboring area of
the document. Stopping criterion to taking in
further ACs from further parts of document is
needed.
Alternatively, we may do global parsing by first
creating a single-member ICAC for each AC in
document. Then, in a loop, always choose the
best-scoring IC and add a next AC that is found
in the growing context of the IC. Here the ICs
score is computed without certain ontological
constraints that would damage partially populated
ICs (e.g. missing mandatory attributes). Again, a
stopping criterion is needed to prevent
high-scoring ICs from growing all over the
document. Validity itself is not a good
criterion, since (a) valid ICs may still need
further attributes, (b) some ICs will never be
valid since they are wrong from the beginning.

53
Global parsing

How to do IC merging when extending existing ICs
during global parsing? Shall we only merge ICs
with single-AC ICs? Should the original ICs be
always retained for other possible merges?

54
References

M. Collins Discriminative training methods for
hidden markov models Theory and experiments with
perceptron algorithms, 2002.
M. Collins, B. Roark Incremental Parsing with
the Perceptron Algorithm, 2004.
D. W. Embley A Conceptual-Modeling Approach to
Extracting Data from the Web, 1998.
V. Crescenzi, G. Mecca, P. Merialdo RoadRunner
Towards Automatic Data Extraction from Large Web
Sites, 2000.
F. Ciravegna LP2, an Adaptive Algorithm for
Information Extraction fromWeb-related Texts,
2001.