Title: Web Information Extraction Techniques
1Web Information Extraction Techniques
2Web Information
3Web Information
4Web Information
- ltlibrarygt ltpoetgt ltnamegt Shakespeare
lt/namegt ltdramagt Macbeth lt/dramagt ltdramagt
Hamlet lt/dramagt lt/poetgt ltpoetgt ltnamegt
Goethe lt/namegt ltdramagt Faust lt/dramagt
ltdramagt Prometheus lt/dramagt lt/poetgtlt/librarygt
5Web Information Extraction
- XML
- Select all dramas written by Shakespeare
- ltlibrarygt ltpoetgt ltnamegt Shakespeare
lt/namegt ltdramagt Macbeth lt/dramagt ltdramagt
Hamlet lt/dramagt lt/poetgt ltpoetgt ltnamegt
Goethe lt/namegt ltdramagt Faust lt/dramagt
ltdramagt Prometheus lt/dramagt lt/poetgtlt/librarygt
6Web Information Extraction
- XML
- Select all dramas written by Shakespeare
- ltlibrarygt ltpoetgt ltnamegt Shakespeare
lt/namegt ltdramagt Macbeth lt/dramagt ltdramagt
Hamlet lt/dramagt lt/poetgt ltpoetgt ltnamegt
Goethe lt/namegt ltdramagt Faust lt/dramagt
ltdramagt Prometheus lt/dramagt lt/poetgtlt/librarygt
7Formalization Step 1
- Example XML file
- ltagt ltagtlt/agt ltcgt ltagtlt/agt lt/cgt ltbgt
ltagtlt/agt ltcgtlt/cgt ltagtlt/agt lt/bgtlt/agt
8Formalization Step 1
- Example XML file
- ltagt ltagtlt/agt ltcgt ltagtlt/agt lt/cgt ltbgt
ltagtlt/agt ltcgtlt/cgt ltagtlt/agt lt/bgtlt/agt
XML tree
a
a
c
b
a
c
a
a
Unranked tree
9Formalization Step 1
- Example XML file
- ltagt ltagtlt/agt ltcgt ltagtlt/agt lt/cgt ltbgt
ltagtlt/agt ltcgtlt/cgt ltagtlt/agt lt/bgtlt/agt - Information extraction is selection of tree nodes
XML tree
a
a
c
b
a
c
a
a
10Formalization Step 2
- Select all as whose parent is a b
a
a
c
b
a
c
a
a
11Formalization Step 2
- Select all as whose parent is a b
- Node selection languages
a
a
c
b
a
c
a
a
12Formalization Step 2
- Select all as whose parent is a b
- Node selection languages
- XPATH//b/a
a
a
c
b
a
c
a
a
13Formalization Step 2
- Select all as whose parent is a b
- Node selection languages
- XPATH//b/a
- MSO
a
a
c
b
a
c
a
a
14Monadic Datalog
- Horn clauses without function symbols
- Predefined predicates firstchild(2),
nextsibling(2), labela(1), labelb(1) - Other predicates are unary or nullary
15Monadic Datalog
- Horn clauses without function symbols
- Predefined predicates firstchild(2),
nextsibling(2), labela(1), labelb(1) - Other predicates are unary or nullary
- Example Queryselect(X) - labela(X),
b_is_father(X).b_is_father(X) -
firstchild(Y,X), labelb(X).b_is_father(X) -
b_is_father(Y), nextsibling(Y,X).
16Monadic Datalog
- Is as expressive as MSO
- Fast takes linear time to evaluate only
17Tree Walking Automata
a
a
c
b
a
c
a
a
18Tree Walking Automata
a
a
c
b
a
c
a
a
19Tree Walking Automata
a
a
c
b
a
c
a
a
State Set godown, goright
20Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Start at root in state godown
21Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state godown
22Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state godown
23Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state godown
24Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state godown
25Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state goright
26Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state godown
27Tree Walking Automata
(godown, a/c) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop (godown, b)
? (?, mark) (mark) ? (?, mark) else (stay,
return) (return) ? (?, return) else (stay,
godown)
a
a
c
b
a
c
a
a
State Set godown, goright, mark, return
Mark all nodes a that are reachedin state mark
28Summary
- Talk 1
- Monadic datalog
- Comparison to MSO
- Linear time evaluation
- Talk 2
- Query automata on unranked trees
- Tree walking automata, Two-way-automata,
Bottom-up-automata - Comparison to MSO