Title: Web Data Management Panel Presentation WITS
1Web Data ManagementPanel Presentation WITS 97
- Joachim Hammer
- University of Florida
- December 14, 1997
2World Wide Web
- Easy to use interface (pages and links)
- Ubiquitous
- Excellent storefront
- Lots of valuable information
- Irregular structure (semistructured)
- Highly dynamic
- Difficult searching/navigation
- Limited querying
- Human is query processor
3Web Management Issues Research
- Information browsing through the Web
- Effective, simple to program GUI
- MOBIE (Stanford University)
- Dealing with existing (static) Web data
- Cant query static Web pages
- TSIMMIS Data Extraction (Stanford University)
- Information discovery
- Too much quantity, too little quality
- WebMining (University of Florida)
4MOBIE
- Formats and displays data objects as a web of
hypertext documents - Traverse hyperlinks to explore nested structure
and contents - Based on HTTP and HTML
- Provides simple, world-wide access to information
servers - New way of exploring databases
- Much like readers explore contents of a book
5Raw Data Object
ltcollection, b1, a1, ...gt b1 ltbook, t, agt
t lttitle, Database and ...gt a
ltauthor, Jeff Ullmangt a1 ltarticle, v, w,
xgt v lttitle, Mediators in ...gt
w ltauthor-list, ...gt x ...
...
. . .
6Formatted - Hyperlinked
collection book title Database and
... author Jeff Ullman article
title Mediators in the ... author Gio
Wiederhold
7Data Extraction Querying
- Configurable parser (Python)
- Declarative description of HTML source
- Location of data on page
- How to package data into result object
- Regular expression-like syntax
- Human intelligence rather than A.I.
- Returns data as OEM (Object Exchange Model)
objects - TSIMMIS interchange format
8Approach
- Extract data from Web page(s)
- On demand
- Periodic/on update
- Use wrapper/DBMS as query processor
Wrapper
Query/ Result
World Wide Web
Extractor
or
Persistent Storage
Query/ Result
Specification
9Evaluation
- Better than
- Writing programs
- YACC, PERL, etc.
- Want to do even better
- GUI tool to simplify the generation of extractor
specification
10Information Discovery
- Improve existing search engines
- Efficient crawling techniques (reduce data
shipping) - Quality ranking of pages
- Apply data mining techniques to categorize web
pages - e.g., clustering algorithms, proximity
- Inferencing of knowledge
- Making connections among entities
11Goals of Web Management
- Putting vast amounts of previously unavailable
data on the Web - Digital libraries, long distance learning,
research, etc. - Data managed by DBMS
- Leverage existing DBMS technology
- New tools for managing semistructured data
- Support of electronic storefronts
- Dynamic creation of customizable Web pages