Title: A Survey of WEB Information Extraction Systems
1A Survey of WEB Information Extraction Systems
- Chia-Hui Chang
- National Central University
- Sep. 22, 2005
2Introduction
- Abundant information on the Web
- Static Web pages
- Searchable databases Deep Web
- Information Integration
- Information for life
- e.g. shopping agents, travel agents
- Data for research purpose
- e.g. bioinformatics, auction economy
3Introduction (Cont.)
- Information Extraction (IE)
- is to identify relevant information from
documents, pulling information from a variety of
sources and aggregates it into a homogeneous form - An IE task is defined by its input and output
4An IE Task
5Web Data Extraction
Data Record
Data Record
6IE Systems
- Wrappers
- Programs that perform the task of IE are referred
to as extractors or wrappers. - Wrapper Induction
- IE systems are software tools that are designed
to generate wrappers.
7Various IE Survey
- Muslea
- Hsu and Dung
- Chang
- Kushmerick
- Laender
- Sarawagi
- Kuhlins and Tredwell
8Related Work Time
- MUC Approaches
- AutoSolg Riloff, 1993, LIEP Huffman, 1996,
PALKA Kim, 1995, HASTEN Krupka, 1995, and
CRYSTAL Soderland, 1995 - Post-MUC Approaches
- WHISK Soderland, 1999, RAPIER califf, 1998,
SRV Freitag, 1998, WIEN Kushmerick, 1997,
SoftMealy Hsu, 1998 and STALKER Muslea, 1999
9Related Work Automation Degree
- Hsu and Dung 1998
- hand-crafted wrappers using general programming
languages - specially designed programming languages or tools
- heuristic-based wrappers, and
- WI approaches
10Related Work Automation Degree
- Chang and Kuo 2003
- systems that need programmers,
- systems that need annotation examples,
- annotation-free systems and
- semi-supervised systems
11Related Work Input and Extraction Rules
- Muslea 1999
- IE from free text using extraction patterns that
are mainly based on syntactic/semantic
constraints. - The second class is Wrapper induction systems
which rely on the use of delimiter-based rules. - The third class also processes IE from online
documents however the patterns of these tools
are based on both delimiters and
syntactic/semantic constraints.
12Related Work Extraction Rules
- Kushmerick 2003
- Finite-state tools (regular expressions)
- Relational learning tools (logic rules)
13Related Work Techniques
- Laender 2002
- languages for wrapper development
- HTML-aware tools
- NLP-based tools
- Wrapper induction tools (e.g., WIEN, SoftMealy
and STALKER), - Modeling-based tools
- Ontology-based tools
- New Criteria
- degree of automation, support for complex
objects, page contents, availability of a GUI,
XML output, support for non-HTML sources,
resilience and adaptiveness.
14Related Work Output Targets
- Sarawagi VLDB 2002
- Record-level
- Page-level
- Site-level
15Related Work Usability
- Kuhlins and Tredwell 2002
- Commercial
- Noncommercial
16Three Dimensions
- Task Domain
- Input (Unstructured, semi-structured)
- Output Targets (record-level, page-level,
site-level) - Automation Degree
- Programmer-involved, learning-based or
annotation-free approaches - Techniques
- Regular expression rules vs Prolog-like logic
rules - Deterministic finite-state transducer vs
probabilistic hidden Markov models
17Task Domain Input
18Task Domain Output
- Missing Attributes
- Multi-valued Attributes
- Multiple Permutations
- Nested Data Objects
- Various Templates for an attribute
- Common Templates for various attributes
- Untokenized Attributes
19Classification by Automation Degree
- Manually
- TSIMMIS, Minerva, WebOQL, W4F, XWrap
- Supervised
- WIEN, Stalker, Softmealy
- Semi-supervised
- IEPAD, OLERA
- Unsupervised
- DeLa, RoadRunner, EXALG
20Automation Degree
- Page-fetching Support
- Annotation Requirement
- Output Support
- API Support
21Technologies
- Scan passes
- Extraction rule types
- Learning algorithms
- Tokenization schemes
- Feature used
22A Survey of Contemporary IE Systems
- Manually-constructed IE tools
- Programmer-aided
- Supervised IE systems
- Labeled based
- Semi-supervised IE systems
- Unsupervised IE systems
- Annotation-free
23(No Transcript)
24Manually-constructed IE Systems
- TSIMMIS Hammer, et al, 1997
- Minerva Crescenzi, 1998
- WebOQL Arocena and Mendelzon, 1998
- W4F Saiiuguet and Azavant, 2001
- XWrap Liu, et al. 2000
25A Running Example
26TSIMMIS
- Each command is of the form
- variables, source, pattern where
- source specifies the input text to be considered
- pattern specifies how to find the text of
interest within the source, and - variables are a list of variables that hold the
extracted results. - Note
- means save in the variable
- means discard
27Minerva
- The grammar used by Minerva is defined in an EBNF
style
28WebOQL
- Select Z!.Text
- From x in browse (pe2.html), y in x, Z in y
- Where x.Tag ol and Z.TextReviewer Name
29W4F
- Wysiwyg support
- Java toolkit
- Extraction rule
- HTML parse tree (DOM object)
- e.g. html.body.ol0.li.pcdata0.txt
- Regular expression to address finer pieces of
information
30Supervised IE systems
- SRV Freitag, 1998
- Rapier Califf and Mooney, 1998
- WIEN Kushmerick, 1997
- WHISK Soderland, 1999
- NoDoSE Adelberg, 1998
- Softmealy Hsu and Dung, 1998
- Stalker Muslea, 1999
- DEByE Laender, 2002b
31SRV
- Single-slot information extraction
- Top-down (general to specific) relational
learning algorithm - Positive examples
- Negative examples
- Learning algorithm work like FOIL
- Token-oriented features
- Logic rule
Rating extraction rule- Length(1),
Every(numeric true), Every(in_list true).
32Rapier
- Field-level (Single-slot) data extraction
- Bottom-up (specific to general)
- The extraction rules consist of 3 parts
- Pre-filler
- Slot-filler
- Post-filler
Book Title extraction rule- Pre-filler slot-fille
r post-filler word Book Length2 wordltbgt word
Name Tag nn, nns word lt/bgt
33WIEN
- LR Wrapper
- (Reviewer name lt/bgt, ltbgt, Rating lt/bgt,
ltbgt, Text lt/bgt, lt/ligt) - HLRT Wrapper (Head LR Tail)
- OCLR Wrapper (Open-Close LR)
- HOCLRT Wrapper
- N-LR Wrapper (Nested LR)
- N-HLRT Wrapper (Nested HLRT)
34WHISK
- Top-down (general to specific) learning
- Example
- To generate 3-slot book reviews, it start with
empty rule ()()() - Each parenthesis indicates a phrase to be
extracted - The phrase in the first set of parenthesis is
bound to variable 1, and 2nd to 2, etc. - The extraction logic is similar to the LR wrapper
for WIEN.
Pattern Reviewer Name lt/bgt (Person) ltbgt
(Digit) ltbgtTextlt/bgt() lt/ligt Output
BookReview Name 1 Rating 2 Comment 3
35NoDoSE
- Assume the order of attributes within a record to
be fixed - The user interacts with the system to decompose
the input. - For the running example
- a book title (an attribute of type string) and
- a list of Reviewer
- RName (string), Rate (integer), and Text
(string).
36Softmealy
- Finite transducer
- Contextual rules
slt,RgtL HTML(ltbgt) C1Alph(Rating)
HTML(lt/bgt) slt,RgtR Spc(-) Num(-) sltR,gtL
Num(-) sltR,gtR NL(-) HTML(ltbgt)
37Stalker
- Embedded Category Tree
- Multipass Softmealy
38DEByE
- Bottom-up extraction strategy
- Comparison
- DEByE the user marks only atomic (attribute)
values to assemble nested tables - NoDoSE the user decomposes the whole document in
a top-down fashion
39Semi-supervised Approaches
- IEPAD Chang and Lui, 2001
- OLERA Chang and Kuo, 2003
- Thresher Hogue, 2005
40IEPAD
- Encoding of the input page
- Multiple-record pages
- Pattern Mining by PAT Tree
- Multiple string alignment
- For the running example
- ltligtltbgtTlt/bgtTltbgtTlt/bgtTltbgtTlt/bgtTlt/ligt
41OLERA
- Online extraction rule analysis
- Enclosing
- Drill-down / Roll-up
- Attribute Assignment
42Thresher
- Work similar to OLERA
- Apply tree alignment instead of string alignment
43Unsupervised Approaches
- Roadrunner Crescenzi, 2001
- DeLa Wang, 2002 2003
- EXALG Arasu and Garcia-Molina, 2003
- DEPTA Zhai, et al., 2005
-
44Roadrunner
- Input multiple pages with the same template
- Match two input pages at one time
Sample page 01 lthtmlgtltbodygt 02 ltbgt 03
Book Name 04 lt/bgt 05 Data mining 06
ltbgt 07 Reviews 08 lt/bgt 09
ltOLgt 10 ltLIgt 11 ltbgt Reviewer Name
lt/bgt 12 Jeff 13 ltbgt Rating
lt/bgt 14 2 15 ltbgtText lt/bgt 16
17 lt/LIgt 18 ltLIgt 19 ltbgt
Reviewer Name lt/bgt 20 Jane 21 ltbgt
Rating lt/bgt 22 6 23 ltbgtText
lt/bgt 24 25 lt/LIgt 26
lt/OLgt 27lt/bodygtlt/htmlgt
Wrapper (initially) 01 lthtmlgtltbodygt 02
ltbgt 03 Book Name 04 lt/bgt 05
Databases 06 ltbgt 07 Reviews 08
lt/bgt 09 ltOLgt 10 ltLIgt 11 ltbgt
Reviewer Name lt/bgt 12 John 13 ltbgt
Rating lt/bgt 14 7 15 ltbgtText
lt/bgt 16 17 lt/LIgt 10
lt/OLgt 11lt/bodygtlt/htmlgt
parsing
String mismatch
String mismatch
String mismatch
String mismatch
tag mismatch
45DeLa
- Similar to IEPAD
- Works for one input page
- Handle nested data structure
- Example
- ltPgtltAgtTlt/AgtltAgtTlt/Agt Tlt/PgtltPgtltAgtTlt/AgtTlt/Pgt
- ltPgtltAgtTlt/AgtTlt/PgtltPgtltAgtTlt/AgtTlt/Pgt
- (ltPgt(ltAgtTlt/Agt)TltPgt)
46EXALG
- Input multiple pages with the same template
- Techniques
- Differentiating token roles
- Equivalence class (EC) form a template
- Tokens with the same occurrence vector
47DEPTA
- Identify data region
- Allow mismatch between data records
- Identify data record
- Data records may not be continuous
- Identify data items
- By partial tree alignment
48Comparison
- How do we differentiate template token from data
token? - DeLa and DEPTA assume HTML tags are template
while others are data tokens - IEPAD and OLERA leaves the problems to users
- How to apply the information from multiple pages?
- DeLa and DEPTA conduct the mining from single
page - Roadrunner and EXALG do the analysis from
multiple pages
49Comparison (Cont.)
- Techniques improvement
- From string alignment (IEPAD, RoadRunner) to tree
alignment (DEPTA, Thresher) - From full alignment (IEPAD) to partial alignment
(DEPTA)
50Task domain comparison
- Page type
- structured, semi-structured or free-text Web
pages - Non-HTML support
- Extraction level
- Field level, record-level, page-level
51Task domain comparison (Cont.)
- Extraction target variation
- Missing attributes, multiple-value attributes,
multi-order attribute permutation - Template variation
- Untokernized Attributes
52(No Transcript)
53Technique-based comparison
- Scan pass
- Single pass vs mutiple pass
- Extraction rule type
- Regular expression vs. logic rules
- Feature used
- DOM tree information, POS tags, etc.
- Learning algorithm
- Machine learning vs pattern mining
- Tokernization schemes
54(No Transcript)
55(No Transcript)
56Conclusion
- Criteria for evaluating IE systems from the task
domain - Comparison of IE systems from various automation
degree - The use of various techniques in IE systems
57Future Work
- Page Fetching
- XWrap, W4F, WNDL
- Schema Mapping
- Full information
- Partial information
- Query Interface Integration
58References
- C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan,
A survey of Web Information Extraction Systems.