A Survey of WEB Information Extraction Systems - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

A Survey of WEB Information Extraction Systems

Description:

IE from free text using extraction patterns that are mainly based on syntactic ... Roadrunner and EXALG do the analysis from multiple pages. Comparison (Cont. ... – PowerPoint PPT presentation

Number of Views:495

Avg rating:3.0/5.0

Slides: 59

Provided by: Sure63

Category:

more less

Transcript and Presenter's Notes

Title: A Survey of WEB Information Extraction Systems

1
A Survey of WEB Information Extraction Systems

Chia-Hui Chang
National Central University
Sep. 22, 2005

2
Introduction

Abundant information on the Web
Static Web pages
Searchable databases Deep Web
Information Integration
Information for life
e.g. shopping agents, travel agents
Data for research purpose
e.g. bioinformatics, auction economy

3
Introduction (Cont.)

Information Extraction (IE)
is to identify relevant information from
documents, pulling information from a variety of
sources and aggregates it into a homogeneous form
An IE task is defined by its input and output

4
An IE Task
5
Web Data Extraction
Data Record
Data Record
6
IE Systems

Wrappers
Programs that perform the task of IE are referred
to as extractors or wrappers.
Wrapper Induction
IE systems are software tools that are designed
to generate wrappers.

7
Various IE Survey

Muslea
Hsu and Dung
Chang
Kushmerick
Laender
Sarawagi
Kuhlins and Tredwell

8
Related Work Time

MUC Approaches
AutoSolg Riloff, 1993, LIEP Huffman, 1996,
PALKA Kim, 1995, HASTEN Krupka, 1995, and
CRYSTAL Soderland, 1995
Post-MUC Approaches
WHISK Soderland, 1999, RAPIER califf, 1998,
SRV Freitag, 1998, WIEN Kushmerick, 1997,
SoftMealy Hsu, 1998 and STALKER Muslea, 1999

9
Related Work Automation Degree

Hsu and Dung 1998
hand-crafted wrappers using general programming
languages
specially designed programming languages or tools
heuristic-based wrappers, and
WI approaches

10
Related Work Automation Degree

Chang and Kuo 2003
systems that need programmers,
systems that need annotation examples,
annotation-free systems and
semi-supervised systems

11
Related Work Input and Extraction Rules

Muslea 1999
IE from free text using extraction patterns that
are mainly based on syntactic/semantic
constraints.
The second class is Wrapper induction systems
which rely on the use of delimiter-based rules.
The third class also processes IE from online
documents however the patterns of these tools
are based on both delimiters and
syntactic/semantic constraints.

12
Related Work Extraction Rules

Kushmerick 2003
Finite-state tools (regular expressions)
Relational learning tools (logic rules)

13
Related Work Techniques

Laender 2002
languages for wrapper development
HTML-aware tools
NLP-based tools
Wrapper induction tools (e.g., WIEN, SoftMealy
and STALKER),
Modeling-based tools
Ontology-based tools
New Criteria
degree of automation, support for complex
objects, page contents, availability of a GUI,
XML output, support for non-HTML sources,
resilience and adaptiveness.

14
Related Work Output Targets

Sarawagi VLDB 2002
Record-level
Page-level
Site-level

15
Related Work Usability

Kuhlins and Tredwell 2002
Commercial
Noncommercial

16
Three Dimensions

Task Domain
Input (Unstructured, semi-structured)
Output Targets (record-level, page-level,
site-level)
Automation Degree
Programmer-involved, learning-based or
annotation-free approaches
Techniques
Regular expression rules vs Prolog-like logic
rules
Deterministic finite-state transducer vs
probabilistic hidden Markov models

17
Task Domain Input
18
Task Domain Output

Missing Attributes
Multi-valued Attributes
Multiple Permutations
Nested Data Objects
Various Templates for an attribute
Common Templates for various attributes
Untokenized Attributes

19
Classification by Automation Degree

Manually
TSIMMIS, Minerva, WebOQL, W4F, XWrap
Supervised
WIEN, Stalker, Softmealy
Semi-supervised
IEPAD, OLERA
Unsupervised
DeLa, RoadRunner, EXALG

20
Automation Degree

Page-fetching Support
Annotation Requirement
Output Support
API Support

21
Technologies

Scan passes
Extraction rule types
Learning algorithms
Tokenization schemes
Feature used

22
A Survey of Contemporary IE Systems

Manually-constructed IE tools
Programmer-aided
Supervised IE systems
Labeled based
Semi-supervised IE systems
Unsupervised IE systems
Annotation-free

23
(No Transcript)
24
Manually-constructed IE Systems

TSIMMIS Hammer, et al, 1997
Minerva Crescenzi, 1998
WebOQL Arocena and Mendelzon, 1998
W4F Saiiuguet and Azavant, 2001
XWrap Liu, et al. 2000

25
A Running Example
26
TSIMMIS

Each command is of the form
variables, source, pattern where
source specifies the input text to be considered
pattern specifies how to find the text of
interest within the source, and
variables are a list of variables that hold the
extracted results.
Note
means save in the variable
means discard

27
Minerva

The grammar used by Minerva is defined in an EBNF
style

28
WebOQL

Select Z!.Text
From x in browse (pe2.html), y in x, Z in y
Where x.Tag ol and Z.TextReviewer Name

29
W4F

Wysiwyg support
Java toolkit
Extraction rule
HTML parse tree (DOM object)
e.g. html.body.ol0.li.pcdata0.txt
Regular expression to address finer pieces of
information

30
Supervised IE systems

SRV Freitag, 1998
Rapier Califf and Mooney, 1998
WIEN Kushmerick, 1997
WHISK Soderland, 1999
NoDoSE Adelberg, 1998
Softmealy Hsu and Dung, 1998
Stalker Muslea, 1999
DEByE Laender, 2002b

31
SRV

Single-slot information extraction
Top-down (general to specific) relational
learning algorithm
Positive examples
Negative examples
Learning algorithm work like FOIL
Token-oriented features
Logic rule

Rating extraction rule- Length(1),
Every(numeric true), Every(in_list true).
32
Rapier

Field-level (Single-slot) data extraction
Bottom-up (specific to general)
The extraction rules consist of 3 parts
Pre-filler
Slot-filler
Post-filler

Book Title extraction rule- Pre-filler slot-fille
r post-filler word Book Length2 wordltbgt word
Name Tag nn, nns word lt/bgt
33
WIEN

LR Wrapper
(Reviewer name lt/bgt, ltbgt, Rating lt/bgt,
ltbgt, Text lt/bgt, lt/ligt)
HLRT Wrapper (Head LR Tail)
OCLR Wrapper (Open-Close LR)
HOCLRT Wrapper
N-LR Wrapper (Nested LR)
N-HLRT Wrapper (Nested HLRT)

34
WHISK

Top-down (general to specific) learning
Example
To generate 3-slot book reviews, it start with
empty rule ()()()
Each parenthesis indicates a phrase to be
extracted
The phrase in the first set of parenthesis is
bound to variable 1, and 2nd to 2, etc.
The extraction logic is similar to the LR wrapper
for WIEN.

Pattern Reviewer Name lt/bgt (Person) ltbgt
(Digit) ltbgtTextlt/bgt() lt/ligt Output
BookReview Name 1 Rating 2 Comment 3
35
NoDoSE

Assume the order of attributes within a record to
be fixed
The user interacts with the system to decompose
the input.
For the running example
a book title (an attribute of type string) and
a list of Reviewer
RName (string), Rate (integer), and Text
(string).

36
Softmealy

Finite transducer
Contextual rules

slt,RgtL HTML(ltbgt) C1Alph(Rating)
HTML(lt/bgt) slt,RgtR Spc(-) Num(-) sltR,gtL
Num(-) sltR,gtR NL(-) HTML(ltbgt)
37
Stalker

Embedded Category Tree
Multipass Softmealy

38
DEByE

Bottom-up extraction strategy
Comparison
DEByE the user marks only atomic (attribute)
values to assemble nested tables
NoDoSE the user decomposes the whole document in
a top-down fashion

39
Semi-supervised Approaches

IEPAD Chang and Lui, 2001
OLERA Chang and Kuo, 2003
Thresher Hogue, 2005

40
IEPAD

Encoding of the input page
Multiple-record pages
Pattern Mining by PAT Tree
Multiple string alignment
For the running example
ltligtltbgtTlt/bgtTltbgtTlt/bgtTltbgtTlt/bgtTlt/ligt

41
OLERA

Online extraction rule analysis
Enclosing
Drill-down / Roll-up
Attribute Assignment

42
Thresher

Work similar to OLERA
Apply tree alignment instead of string alignment

43
Unsupervised Approaches

Roadrunner Crescenzi, 2001
DeLa Wang, 2002 2003
EXALG Arasu and Garcia-Molina, 2003
DEPTA Zhai, et al., 2005

44
Roadrunner

Input multiple pages with the same template
Match two input pages at one time

Sample page 01 lthtmlgtltbodygt 02 ltbgt 03
Book Name 04 lt/bgt 05 Data mining 06
ltbgt 07 Reviews 08 lt/bgt 09
ltOLgt 10 ltLIgt 11 ltbgt Reviewer Name
lt/bgt 12 Jeff 13 ltbgt Rating
lt/bgt 14 2 15 ltbgtText lt/bgt 16
17 lt/LIgt 18 ltLIgt 19 ltbgt
Reviewer Name lt/bgt 20 Jane 21 ltbgt
Rating lt/bgt 22 6 23 ltbgtText
lt/bgt 24 25 lt/LIgt 26
lt/OLgt 27lt/bodygtlt/htmlgt
Wrapper (initially) 01 lthtmlgtltbodygt 02
ltbgt 03 Book Name 04 lt/bgt 05
Databases 06 ltbgt 07 Reviews 08
lt/bgt 09 ltOLgt 10 ltLIgt 11 ltbgt
Reviewer Name lt/bgt 12 John 13 ltbgt
Rating lt/bgt 14 7 15 ltbgtText
lt/bgt 16 17 lt/LIgt 10
lt/OLgt 11lt/bodygtlt/htmlgt
parsing
String mismatch
String mismatch
String mismatch
String mismatch
tag mismatch
45
DeLa