Extracting%20Semistructured%20Information%20from%20the%20Web - PowerPoint PPT Presentation

About This Presentation
Title:

Extracting%20Semistructured%20Information%20from%20the%20Web

Description:

... from the Web. J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. ... Extract_table construct. Case operator. Get(url) operator. Query the extracted result ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 12
Provided by: UMR
Learn more at: https://web.mst.edu
Category:

less

Transcript and Presenter's Notes

Title: Extracting%20Semistructured%20Information%20from%20the%20Web


1
Extracting Semistructured Information from the
Web J. Hammer, H. Garcia-Molina, J. Cho, R.
Aranha, A. Crespo from Stanford University
Presented by Wei Mao
2
  • Introduction
  • Background
  • Fast growing of WWW
  • Semistructured data in web pages
  • Difficulty with manipulating web data
  • One solution
  • A configurable extraction program
  • Extraction result in OEM
  • A wrapper is used for query

3
  • A detailed example
  • Weather table

Can we query What is the forecast for Vienna for
Jan. 28, 1997??
4
  • Extraction process
  • HTML file
  • Specification file
  • Commands
  • variables, source, pattern
  • Package result into an OEM object

5
The HTML for weather table
6
A sample specification file
7
Extraction result
8
Customizing the extraction result
9
Additional capabilities
  • Extract_table construct
  • Case operator
  • Get(url) operator

Query the extracted result
  • Use existing wrapper generation tool
  • Only simple interface is required

10
Advantages
  • Manipulate web data efficiently
  • Flexible
  • Easy to use
  • Reuse the existing systems
  • (OEM, Lorel, HTML parser)

11
Disadvantages
  • Depends on outside input
  • Requires prior knowledge of the
  • structure of HTML file
  • Have to use specification file
Write a Comment
User Comments (0)
About PowerShow.com