Web Content Extraction through Histogram clustering - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Web Content Extraction through Histogram clustering

Description:

Too much junk in a web page. Goal: Extract only the content of a page ... Non-HTML or all content pages. approximation. ANNIE'08 Paper. Computing and ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 20
Provided by: kdd5
Category:

less

Transcript and Presenter's Notes

Title: Web Content Extraction through Histogram clustering


1
Web Content Extraction through Histogram
clustering
  • Tim Weninger and William H. Hsu
  • Department of Computing and Information Sciences
  • Kansas State University, Manhattan KS
  • ANNIE 2008
  • St. Louis, MO USA

2
Outline
  • Introduction
  • Motivation
  • Related Work
  • The Text-to-Tag Ratio
  • Heuristic
  • Worst Case
  • Methodology
  • Pre-processing
  • Computing clusters
  • Results
  • Evaluation Metrics
  • Results
  • Conclusions and Future Work

3
Introduction Motivation 1
  • Problem
  • Too much junk in a web page
  • Goal
  • Extract only the content of a page

Taken from The Hutchinson News on 8/14/2008
4
Introduction Motivation 2 Example
5
Related Work 1
  • Naïve Approach
  • Remove all HTML tags

6
Related Work 2
  • Tag Approach
  • Use HTML tags as clues for content
  • Problem Style-sheets

7
Text-to-Tag Ratio 1
8
Text-to-Tag Ratio 2
  • Example

9
Text-to-Tag Ratio 3
  • Worst Case 1
  • Non-HTML or all content pages

10
Text-to-Tag Ratio 4
  • Worst Cases 2
  • American Declaration of Independence Web page

American Declaration of Independence TTR computed
from digital copy at http//www.ushistory.org/dec
laration/document/index.htm
11
Methodology 1
  • Preprocessing
  • Content Blurring

12
Methodology 2
  • Threshold Clustering
  • Threshold clustering based on standard deviation

Std. Dev. Is 20.3TTR for Hutchinson News document
13
Methodology 3
  • Histogram Clustering in 2-Dimensions
  • Looks for jumps in the moving average of the
    TTRArray

14
Methodology 4
  • Histogram Clustering in 2-Dimensions
  • Moving differences

15
Methodology 5
  • Histogram Clustering in 2-Dimensions
  • Scatterplots

16
Methodology 6
  • Evaluation method
  • 176 Pages selected by querying Yahoo search for
    the
  • Gold standard for each page created by a CS
    undergraduate.
  • Metrics computed against gold standard and
    averaged
  • Evaluation Metrics
  • Accuracy, Precision, Recall, ROC curve
  • Evaluation Algorithms
  • Farthest First, K-Means, Expectation Maximization
  • Density and Distance Modes
  • Clustering results are compared to Threshold
    Results

17
Results
  • Threshold
  • Clustering

18
Conclusions and Future Work
  • Text-To-Tag Ratio Approach
  • A valid content extraction technique
  • But has Limitations
  • Prediction clustering
  • General histogram clustering
  • Uses Gaussian Blurring
  • Analysis of the slope of the tangent line
  • Extracting dimensions and re-clustering

19
Questions?
Write a Comment
User Comments (0)
About PowerShow.com