Querying the WWW - PowerPoint PPT Presentation

About This Presentation
Title:

Querying the WWW

Description:

should I ask yahoo or hotbot or lycos or ... ? Find pages about databases within 2 links from Joe's webpage ... Find recent web pages with title 'Bob's Music ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 18
Provided by: rsha
Learn more at: https://dsf.berkeley.edu
Category:
Tags: www | hotbot | querying

less

Transcript and Presenter's Notes

Title: Querying the WWW


1
Querying the WWW
Alberto O. Mendelzon George A. Mihaila Tova Milo
2
Scenarios...
  • Find about PCs from IBM
  • query IBM personal computer price
  • can we restrict search to www.ibm.com ?
  • Find a good music store
  • should I ask yahoo or hotbot or lycos or ?
  • Find pages about databases within 2 links from
    Joes webpage
  • Find recent web pages with title Bobs Music
    Store

3
Problems
  • Queries dont exploit structure of data
  • Queries dont exploit link topology of data
  • Source selection hard
  • different search engines have different
    functionalities, idiosyncratic behaviour
  • different search engines good at different tasks

4
Outline
  • Motivation
  • WebSQL
  • Nuts and Bolts
  • Query Locality
  • Good, Bad and Ugly

5
WebSQL
  • Integrate structure/topology constraints with
    textual retrieval
  • Virtual graph model of document network
  • Need to combine navigation and querying
  • Query Language that utilizes documents structure
    and can accept constraints on link topology

6
Data Model
  • Relational
  • Each web object is a tuple in a Document
  • url, title, text, type, length, modification
    info
  • Hyperlinks are tuples in Anchor
  • base, href, label
  • interior links ( )within same document
  • local links ( ) within same server
  • global ( ) across servers

7
Examples
  • SELECT x.url, x.title, y.url, y.titleFROM
    Document x SUCH THAT
    x MENTIONS Computer Science,
    Document y SUCH THAT x y
    -- docs within 2 links from something on CS.
  • SELECT d.url, d.titleFROM Document d SUCH THAT

    http//www.cs.toronto.edu
    dWHERE d.title CONTAINS database -- docs
    within 2 links of CS homepage.
  • MENTIONS search engine, CONTAINS checked
    locally

8
More examples
  • from Toronto
  • Job Opportunities for Software Engineers SELECT
    e.url FROM Document d SUCH THAT
    d MENTIONS "Career Opportunities" , Document e
    SUCH THAT d - e WHERE e.text CONTAINS
    "Software Engineer
  • this query is useful, but ...

9
Outline
  • Motivation
  • WebSQL
  • Nuts and Bolts
  • Query Locality
  • Good, Bad and Ugly

10
Nuts and bolts
  • SELECT Fields(x1, x2, , xn)FROM Obj x1 SUCH
    THAT A1 Obj x2 SUCH THAT A2
    WHERE Condition(x1, x2, xn)
  • nested loops join algorithm for all x1 such
    that A1 is true for all x2 such that A2 is
    true

11
  • each atomic condition A1 Am is of form
  • Path( from_node, path_expression, to_node)
  • x5 (-) x7
  • enumerate links to check these
  • NodePredicate(node)
  • CONTAINS Bobs Coffee Place (x5)
  • query a customizable set of known search
    engines
  • what queries are computable?
  • those that dont have to explore the entire web
  • safe queries every variable must be
  • either directly solvable in some atomic
    condition, OR
  • directly derivable from another in some atomic
    condition

12
Query Locality
  • distinguish between access to local and remote
    documents
  • model communication cost of a query based on
  • expected number of results from search engines
  • expected size of documents
  • expected number of exterior, interior, remote
    links per document
  • expected cost of network access
  • can identify potentially expensive components of
    a query and warn user

13
The Good
  • Idea of using structure in answering queries
  • topologies can be useful, with a better
    interface...
  • can be used for link maintenance

14
The Bad
  • Too complicated (especially syntax)
  • easy to write queries that explore the entire
    web.
  • does end user care for topology
    constraint,besides domain constraint?
  • Remote accesses cause huge slow down
  • check topology constraints at search engine?
  • availability

15
The Ugly
  • How to avoid back links?
  • Fuzzy queries
  • find me good, inexpensive Chilean restaurants
    that are close by

16
Issues
  • What kinds of path based queries are useful,
    intuitive?
  • How to check the path constraints at the search
    engine?
  • Can hypertext links be viewed as yet another kind
    of link in a semi-structured model

17
Other Work
  • Other, generic intra-document structure can be
    useful
  • Topology, structure can be used by system
    (instead of by end user)
  • use links to determine quality of site content
  • authority sites -- find www.harvard.edu for query
    on harvard
  • classification -- Cha-Cha
  • Store links at search engine for proximity
    searches
  • can generalize to arbitrary links in a directed
    graph model --- Goldman et. al 98
  • get see also info
Write a Comment
User Comments (0)
About PowerShow.com