Content Extraction from HTML Documents - PowerPoint PPT Presentation

About This Presentation
Title:

Content Extraction from HTML Documents

Description:

Title: A Comparative Study of Some Multiple Expert Recognition Strategies Author: Ahmad Fuad Rezaur Rahman Last modified by: Ahmad F R Rahman Created Date – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 18
Provided by: Ahmad63
Category:

less

Transcript and Presenter's Notes

Title: Content Extraction from HTML Documents


1
Content Extraction from HTML Documents
  • A. Rahman H. Alam R. Hartono
  • Document Analysis and Recognition Team (DART)
  • BCL Computers Inc.
  • Santa Clara, Calif, USA

2
Current need?
  • Viewing website using small screen handheld
    devices
  • Since web sites are written using HTML codes, we
    need to translate these to systems that the
    wireless devices can support.

3
Current Solutions
  • Handcrafting
  • Custom Web Sites are typically crafted by hand by
    a set of content experts
  • Transcoding
  • Thranscoding replaces HTML tags with suitable
    device specific tags (HDML, WML etc)

4
Handcrafting
  • Automation
  • Use of XML.
  • There is no standard XML tagset (Document Type
    Definition DTD) in use by vendors.
  • XML has been available to web designers for the
    last 10 years. Examination of websites shows
    little use of document structural elements.
  • Web masters see themselves as artists rather than
    programmers.
  • XML may meet the same fate as SGML, an earlier
    attempt to create structured documents.

5
Handcrafting
  • Take an existing website and make it available to
    wireless access. Aether Systems, Mshift and 2Roam
    currently offer these types of solutions.
  • Use a proprietary graphical interface to ease the
    development of wireless applications from
    scratch. Covigo and iConverse offer these type of
    solutions.
  • Let the user do all coding in languages such as
    C or Java. ThinAirApps offers this type of
    solution.

6
Handcrafting
  • Labor intensive
  • Expensive.
  • Typically less than 1 of a web site gets
    converted to wireless content.

7
Transcoding
  • Most web pages have a loose repeating visual
    structure. The wireless user gets the same
    repeating information with every screen
  • Browsing is an unfriendly experience
  • Transcoding sends all the information to the
    wireless device, making it substantially slow on
    the wireless network

8
Transcoding
  • Transcoding was introduced in Japan during
    1999-2000. It was widely rejected by the Japanese
    users.
  • Recently, Google and Pixo introduced this
    solution for the US market, but have so far
    failed to attract attention of end users.

9
The Alternate Solution
  • Separate the content into smaller segments
  • Generate a summary of these segments
  • Prioritize these summaries from individual
    segments
  • Put together to form a summary of the overall
    document

10
Steps to Content Extraction
  • Structural analysis Understanding the
    relationship of the various segments with the
    document
  • Decomposition Breakdown on these segments into
    operational units
  • Contextual Analysis Employment of context to
    revise the segmentation
  • (Continuedgt)

11
Steps to Content Extraction (Continued)
  • Labeling gt Segment Summary Extraction of a low
    level summary of the segment
  • Priority Estimating importance of these segments
  • Table of Content (TOC) gt Document Summary
    Putting together a summary of the document

12
Content Extraction
  • Proximity Analysis Relational analysis of
    content between segments
  • Content Classification callification into
    various types, i.e. stories, navigation,
    links, images, forms etc.
  • Relationship Analysis
  • Contextual grammar (Natural Language)
  • Knowledge modes
  • Information retrieval techniques

13
Content Extraction Why do we need it?
  • Viewing any website Any solution to web browsing
    has to be universal
  • High network access Any transformation has to be
    fast and on-the-fly
  • Network Usage Network traffic should increase
    because of these systems
  • (Continuedgt)

14
Content Extraction Why do we need it
(continued)?
  • Easy Configurability Any such system should be
    easiliy configurable
  • Rapid Deployment Should be rapidly deployable
  • Non-intrusive Design Should be possible to
    transform web sites without modifying the actual
    web site
  • Multiple Views System Integrators should be able
    to create multiple views of the same site

15
Advantages of Content Extraction
  • Displays size
  • Locating information
  • Important content can be on top
  • Multiple levels of abstraction can be created
  • The browsing can use a demand-driven model
  • Faster download
  • More efficient use of small display areas
  • Mapping of the importance of content from the
    original document

16
Supported Devices and Formats
  • PDAs (HTML3.2)
  • Cell phones
  • USA/Europe
  • WAP
  • Japan
  • iMode (NTT DoCoMo)
  • J-Sky (J-Phone)
  • EZWeb (KDDI)

17
Conclusion
  • Content from web documents can be extracted based
    on the
  • HTML structure
  • Proximity analysis
  • Logical relationship analysis
  • Information retrieval techniques
  • Content can be used effectively to summarize web
    documents
  • Better option compared to handcrafting or
    transcoding
  • Produces faster browsing experience
Write a Comment
User Comments (0)
About PowerShow.com