Title: Information Extraction with Tree Automata Induction
1Information Extraction with Tree Automata
Induction
- Raymond Kosala1, Jan Van den Bussche2,
- Maurice Bruynooghe1, Hendrik Blockeel1
- 1 Katholieke Universiteit Leuven, Belgium
- 2 University of Limburg, Belgium
2Outline
- Introduction
- information extraction (IE)
- grammatical inference
- Approach
- k-testable and g-testable algorithms
- Preliminary result
- Further work
3IE from unstructured documents
- Extract certain fields of interest from a text
- Learner is trained with (positive) examples
- Each learner focuses on a single field marked
with x
Company Name
Job Title
Requirement 1
Requirement 2
4Grammatical inference
- ? finite alphabet
- Regular language L ? ?
- Given set of examples (pos. or neg.)
- Task infer a DFA compatible with examples
- Quality criterion
- Exact learning in the limit
- PAC
- etc.
- Large body of work
5IE with grammatical inference
- Mark field x as special token
- Infer DFA for the language L
- S over (? ? x) the field to be extracted is
marked by x - Only positive examples
6IE from structured documents
- Previous works learn string language
- XML or HTML data tree structured
- Natural extension is to learn a tree language
-
-
- x has a b-brother
- Extraction of a field can depend on structural
context
7Learning process
8Testing process
9Page examples
10Why do we need the context?
-- tr -- td
-- td -- lastupdate
(CDATA) -- td
-- b --
12/4/98 -- td -- tr
-- td -- tr -- td
-- td --
organization (CDATA) -- td
-- b -- ABC
-- td -- tr -- td
- Not enough to differentiate the fields of
interest that depend on the structural context. - Can be chosen automatically.
11Tree automata
- Ranked alphabet ? finite set of function symbols
with arities. - E.g. ? a(2), b(2), c(0)
- Tree ? ground term over ?
- a(c,a(b(c,c),b(c,c))) tree with depth 3
- Tree automaton M (?, Q, ?, F).
- ? is a set of transitions of the form v(q1, ,
qn) ? q - Where v ? ? , n is the arity of v , qi and q ? Q
12Example
- Given an automaton M with the following
transitions - ?1 c ? q0
- ?2 a(q0 , q0) ? q0
- ?3 b(q0 , q0) ? accept
- M accepts a tree t ? t has b-node as the root
13Unranked trees
- XML/HTML
- bib
-
- paper report book
paper - The number of children is not fixed by the label
- Two approaches
- 1. Generalize notion of tree automaton to
unranked trees. L transition rules - v(e) ? q , where e is regular expression over Q
- 2. Encode as ranked tree
14Encoding of unranked trees
- There are well-known methods of encoding to
binary trees, we use - encode(T) encodef(T)
- v if F1 F2 ?
- vright(encodef(F2)) if F1 ?, F2 ?
? - encodef(v(F1), F2) vleft(encodef(F1))
if F1 ? ?, F2 ? - v(encodef(F1), encodef(F2))
otherwise - Where T v(F), v ? ? F ?
- F T, F
- Example
-
- becomes
-
15k-testable tree languages
- Languages in which membership can be checked by
just looking at subtrees of length k-1 that
appear in the tree. - k-roots
- k-forks
- k-subtrees
16k-testable tree languages (cont.)
- An example t
- html
-
- head body
-
- title h1 table
-
- f2(t)
- html head body
-
- head body title h1 table
r2(t) html head
body s2(t) body head
table title h1 h1 table title
17k-testable algorithm Rico-Juan, et al.
- Given a set of positive examples T, a positive
integer k - Q, FS, ? Ø
- For each t ?T,
- Let R rk-1(t) F fk(t) S sk-1(t)
- Q Q ? R ? rk-1( F ) ? S
- FS FS ? R
- ? v(t1, ,tm) ? S ? ? ? ?m(v(t1, ,tm))
v(t1, ,tm) - ? v(t1, ,tm) ? F ? ? ? ?m(v(t1, ,tm))
rk-1(v(t1, ,tm))
18g-testable algorithm
- Idea generalize state transitions from forks
that are not important for the extraction. - Important forks are those that contain x and
(possibly) the distinguishing context.
t
gen(t,1)
gen(t,2)
19g-testable algorithm
- Given a set of positive examples T, positive
integer k and l - FS, ?, Ss, CF, OF, OF Ø
- For each t ?T,
- Let R rk-1(t) F fk(t) S sk-1(t)
- Ss Ss ? S
- FS FS ? R
- ? v(t1, ,tm) ? S ? ? ? ?m(v(t1, ,tm))
v(t1, ,tm) - CF CF ? f f ? F, f contains x
- OF OF ? f f ? F, f does not contain x
- For each of ?OF,
- of gen(of,l)
- If of covers one of CF then
- OF OF ? of else OF OF ? of
- Let F OF ? CF
- Q Q ? F ? rk-1( F ) ? Ss
- ? v(t1, ,tm) ? F ? ? ? ?m(v(t1, ,tm))
rk-1(v(t1, ,tm))
20g-testable algorithm example
- A set of examples T (with k 2 and l 1)
- html
html -
- head body head
body -
- title h1 x table title
h1 x -
- F f2(T)
- html head body
body -
- head body title h1 x h1 x
table - FS html
- OF html(head, body), head(title)
- CF body(h1, x), body(h1, x, table)
- OF html(, ), head()
- R r1(T) html
- Ss s1(T) table , title , h1, x
-
21g-testable algorithm example (cont.)
- F html(, ), head(), body(h1, x), body(h1,
x, table) - Transitions from the trees in the subtrees s1(T)
- ? (table) table
- ? (title) title
- ? (h1) h1
- ? (x) x
- Transitions from the trees in the generalized
forks F - ? (html(, )) html
- ? (head()) head
- ? (body(h1, x)) body
- ? (body(h1, x, table)) body
- Q html head body table title h1 x
22Experiment
- Two benchmark datasets Internet Address Finder
(IAF) and Quote Server (QS). - Comparison with HMM, Stalker, and BWI.
- The highlights of our method
- More expressive.
- Doesnt require
- manual specifications of windows length of the
prefix and suffix of the target field (HMM and
BWI) - special tokens of the delimiters such as gt
(Stalker and BWI) - embedded catalog tree (Stalker)
- The limitations
- The field that can be extracted limited to whole
node - Slower when extracting
23Experiment results
- The results in are
- Dataset IAF-altname IAF-org
QS-date QS-vol
Shakespeare - Prec Rec F1 Prec Rec
F1 Prec Rec F1 Prec Rec F1
Prec Rec F1 - --------------------------------------------------
--------------------------------------------------
---------------------- - HMM 1.7 90 3.4 16.8 89.7
28.4 36.3 100 53.3 18.4 96.2 30.9 - Stalker 100 - - 48.0 -
- 0 - - 0
- - - BWI 90.9 43.5 58.8 77.5 45.9 57.7
100 100 100 100 61.9 76.5 - k-testable 100 73.9 85 100 57.9 73.3
100 60.5 75.4 100 73.6 84.8
56.2 90 69.2 - g-testable 100 73.9 85 100 82.6 90.5
100 60.5 75.4 100 73.6 84.8
69.2 90 78.2 - Parameters 4 and (5,2) 2 and (3,2)
2 and (3,2) 5 and (6,5)
3 and (4,2) - F1 is the harmonic mean of recall and precision
- The results of HMM, Stalker and BWI are adopted
from Freitag Kushmerick
24Further work
- More generalization while using bigger context is
achieved, but sometimes the binarisation makes
the context far from the field of interest ? the
generalization cannot go very far - Work on the algorithm that can work directly with
unranked trees.