Content Extraction from HTML Documents - PowerPoint PPT Presentation

About This Presentation

Title:

Content Extraction from HTML Documents

Description:

Number of Views:25

Avg rating:3.0/5.0

Slides: 18

Provided by: Ahmad63

Category:

Tags: html | content | documents | extraction

Transcript and Presenter's Notes

Title: Content Extraction from HTML Documents

1
Content Extraction from HTML Documents

2
Current need?

Viewing website using small screen handheld
devices
Since web sites are written using HTML codes, we
need to translate these to systems that the
wireless devices can support.

3
Current Solutions

Handcrafting
Custom Web Sites are typically crafted by hand by
a set of content experts
Transcoding
Thranscoding replaces HTML tags with suitable
device specific tags (HDML, WML etc)

4
Handcrafting

Automation
Use of XML.
There is no standard XML tagset (Document Type
Definition DTD) in use by vendors.
XML has been available to web designers for the
last 10 years. Examination of websites shows
little use of document structural elements.
Web masters see themselves as artists rather than
programmers.
XML may meet the same fate as SGML, an earlier
attempt to create structured documents.

5
Handcrafting

Take an existing website and make it available to
wireless access. Aether Systems, Mshift and 2Roam
currently offer these types of solutions.
Use a proprietary graphical interface to ease the
development of wireless applications from
scratch. Covigo and iConverse offer these type of
solutions.
Let the user do all coding in languages such as
C or Java. ThinAirApps offers this type of
solution.

6
Handcrafting

7
Transcoding

Most web pages have a loose repeating visual
structure. The wireless user gets the same
repeating information with every screen
Browsing is an unfriendly experience
Transcoding sends all the information to the
wireless device, making it substantially slow on
the wireless network

8
Transcoding

Transcoding was introduced in Japan during
1999-2000. It was widely rejected by the Japanese
users.
Recently, Google and Pixo introduced this
solution for the US market, but have so far
failed to attract attention of end users.

9
The Alternate Solution

10
Steps to Content Extraction

Structural analysis Understanding the
relationship of the various segments with the
document
Decomposition Breakdown on these segments into
operational units
Contextual Analysis Employment of context to
revise the segmentation
(Continuedgt)

11
Steps to Content Extraction (Continued)

Labeling gt Segment Summary Extraction of a low
level summary of the segment
Priority Estimating importance of these segments
Table of Content (TOC) gt Document Summary
Putting together a summary of the document

12
Content Extraction

Proximity Analysis Relational analysis of
content between segments
Content Classification callification into
various types, i.e. stories, navigation,
links, images, forms etc.
Relationship Analysis
Contextual grammar (Natural Language)
Knowledge modes
Information retrieval techniques

13
Content Extraction Why do we need it?

14
Content Extraction Why do we need it
(continued)?

Easy Configurability Any such system should be
easiliy configurable
Rapid Deployment Should be rapidly deployable
Non-intrusive Design Should be possible to
transform web sites without modifying the actual
web site
Multiple Views System Integrators should be able
to create multiple views of the same site

15
Advantages of Content Extraction