Web Information Extraction Techniques - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Web Information Extraction Techniques

Description:

Web Information Extraction Techniques. Torsten St ber. Web Information ... Select all dramas written by Shakespeare library poet name Shakespeare /name ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 29
Provided by: Zum2
Category:

less

Transcript and Presenter's Notes

Title: Web Information Extraction Techniques


1
Web Information Extraction Techniques
  • Torsten Stüber

2
Web Information
3
Web Information
  • XML

4
Web Information
  • XML
  • ltlibrarygt ltpoetgt ltnamegt Shakespeare
    lt/namegt ltdramagt Macbeth lt/dramagt ltdramagt
    Hamlet lt/dramagt lt/poetgt ltpoetgt ltnamegt
    Goethe lt/namegt ltdramagt Faust lt/dramagt
    ltdramagt Prometheus lt/dramagt lt/poetgtlt/librarygt

5
Web Information Extraction
  • XML
  • Select all dramas written by Shakespeare
  • ltlibrarygt ltpoetgt ltnamegt Shakespeare
    lt/namegt ltdramagt Macbeth lt/dramagt ltdramagt
    Hamlet lt/dramagt lt/poetgt ltpoetgt ltnamegt
    Goethe lt/namegt ltdramagt Faust lt/dramagt
    ltdramagt Prometheus lt/dramagt lt/poetgtlt/librarygt

6
Web Information Extraction
  • XML
  • Select all dramas written by Shakespeare
  • ltlibrarygt ltpoetgt ltnamegt Shakespeare
    lt/namegt ltdramagt Macbeth lt/dramagt ltdramagt
    Hamlet lt/dramagt lt/poetgt ltpoetgt ltnamegt
    Goethe lt/namegt ltdramagt Faust lt/dramagt
    ltdramagt Prometheus lt/dramagt lt/poetgtlt/librarygt

7
Formalization Step 1
  • Example XML file
  • ltagt ltagtlt/agt ltcgt ltagtlt/agt lt/cgt ltbgt
    ltagtlt/agt ltcgtlt/cgt ltagtlt/agt lt/bgtlt/agt

8
Formalization Step 1
  • Example XML file
  • ltagt ltagtlt/agt ltcgt ltagtlt/agt lt/cgt ltbgt
    ltagtlt/agt ltcgtlt/cgt ltagtlt/agt lt/bgtlt/agt

XML tree
a
a
c
b
a
c
a
a
Unranked tree
9
Formalization Step 1
  • Example XML file
  • ltagt ltagtlt/agt ltcgt ltagtlt/agt lt/cgt ltbgt
    ltagtlt/agt ltcgtlt/cgt ltagtlt/agt lt/bgtlt/agt
  • Information extraction is selection of tree nodes

XML tree
a
a
c
b
a
c
a
a
10
Formalization Step 2
  • Select all as whose parent is a b

a
a
c
b
a
c
a
a
11
Formalization Step 2
  • Select all as whose parent is a b
  • Node selection languages

a
a
c
b
a
c
a
a
12
Formalization Step 2
  • Select all as whose parent is a b
  • Node selection languages
  • XPATH//b/a

a
a
c
b
a
c
a
a
13
Formalization Step 2
  • Select all as whose parent is a b
  • Node selection languages
  • XPATH//b/a
  • MSO

a
a
c
b
a
c
a
a
14
Monadic Datalog
  • Horn clauses without function symbols
  • Predefined predicates firstchild(2),
    nextsibling(2), labela(1), labelb(1)
  • Other predicates are unary or nullary

15
Monadic Datalog
  • Horn clauses without function symbols
  • Predefined predicates firstchild(2),
    nextsibling(2), labela(1), labelb(1)
  • Other predicates are unary or nullary
  • Example Queryselect(X) - labela(X),
    b_is_father(X).b_is_father(X) -
    firstchild(Y,X), labelb(X).b_is_father(X) -
    b_is_father(Y), nextsibling(Y,X).

16
Monadic Datalog
  • Is as expressive as MSO
  • Fast takes linear time to evaluate only

17
Tree Walking Automata
a
a
c
b
a
c
a
a
18
Tree Walking Automata
a
a
c
b
a
c
a
a
19
Tree Walking Automata
a
a
c
b
a
c
a
a
State Set godown, goright
20
Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Start at root in state godown
21
Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state godown
22
Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state godown
23
Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state godown
24
Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state godown
25
Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state goright
26
Tree Walking Automata
(godown) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop
a
a
c
b
a
c
a
a
State Set godown, goright
Current state godown
27
Tree Walking Automata
(godown, a/c) ? (?, godown) else (?, godown)
else (?, goright) (goright) ? (?, godown)
else (?, goright) else stop (godown, b)
? (?, mark) (mark) ? (?, mark) else (stay,
return) (return) ? (?, return) else (stay,
godown)
a
a
c
b
a
c
a
a
State Set godown, goright, mark, return
Mark all nodes a that are reachedin state mark
28
Summary
  • Talk 1
  • Monadic datalog
  • Comparison to MSO
  • Linear time evaluation
  • Talk 2
  • Query automata on unranked trees
  • Tree walking automata, Two-way-automata,
    Bottom-up-automata
  • Comparison to MSO
Write a Comment
User Comments (0)
About PowerShow.com