Structural Web Search Using a GraphBased Discovery System - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Structural Web Search Using a GraphBased Discovery System

Description:

Existing search engines use linear feature match. Web contains structural information as well ... Expand substructure by adding edge/vertex ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 18
Provided by: diane93
Category:

less

Transcript and Presenter's Notes

Title: Structural Web Search Using a GraphBased Discovery System


1
Structural Web Search Using a Graph-Based
Discovery System
  • Nitish Manocha, Diane J. Cook, and Lawrence B.
    Holder
  • University of Texas at Arlington
  • cook_at_cse.uta.edu
  • http//www-cse.uta.edu/cook

2
Structured Web Search
  • Existing search engines use linear feature match
  • Web contains structural information as well
  • Hyperlink information
  • Web viewed as a graph Kleinberg
  • Subdue searches based on structure
  • Use as foundation of a structural search engine
  • Incorporation of WordNet allows for synonym match

3
SUBDUE
  • Discovers structural patterns in input graphs
  • A substructure is connected subgraph
  • An instance of a substructure is a subgraph that
    is isomorphic to substructure definition
  • Pattern discovery, classification, clustering

Input Database
Substructure S1 (graph form)
Compressed Database
triangle
shape
C1
S1
object
R1
R1
on
square
S1
S1
S1
shape
object
4
Subdue Algorithm
  • Start with individual vertices
  • Keep only best substructures on queue
  • Expand substructure by adding edge/vertex
  • Compress graph and repeat to generate
    hierarchical description
  • Optional use of background knowledge

5
Inexact Graph Match
  • Some variations may occur between instances
  • Want to abstract over minor differences
  • Difference cost of transforming one graph to
    make it isomorphic to another
  • Match if cost/size lt threshold

6
Application Domains
  • Protein data
  • Human Genome DNA data
  • Spatial-temporal domains
  • Earthquake data
  • Aircraft Safety and Reporting System
  • Telecommunications data
  • Program source code
  • Web data

7
Represent Web as Graph
  • Breadth-first search of domain to generate graph
  • Nodes represent pages / documents
  • Edges represent hyperlinks
  • Additional nodes represent document keywords

subdue
texas
projects
word
word
university
work
hyperlink
page
parallel
group
learning
robotics
planning
8
WebSubdues Structural Search
  • Formulate query as graph
  • Use Subdues predefined substructure option to
    search for instances of query

9
Query Find all pages which link to a page
containing term Subdue
  • Subgraph vertices
  •  
  • 1 page
  • URL http//cygnus.uta.edu
  • 7  page
  • URL http//cygnus.uta.edu/projects.html
  • Subdue
  • 1-gt7 hyperlink
  • 7-gt8 word

Subdue

word
hyperlink
page
page
/ Vertex ID Label / s v 1 page v 2 page v
3 Subdue
/ Edge Vertex 1 Vertex 2 Label / d 1 2
hyperlink d 2 3 word
10
Search for Presentation Pages
page
hyperlink
hyperlink
hyperlink
page
page
page
hyperlink
hyperlink
  • WebSubdue
  • 22 instances
  • AltaVista
  • Query hostwww-cse.uta.edu AND
    imagenext_motif.gif AND imageup_motif.gif AND
    imageprevious_motif.gif.
  • 12 instances

11
Search for Reference Pages
page
hyperlink
hyperlink
hyperlink

page
page
page
  • Search for page with at least 35 in links
  • WebSubdue found 5 pages in www-cse
  • AltaVista cannot perform this type of search

12
Inclusion of WordNet
  • When generating graph
  • Use common stopword list
  • When searching for subgraph instances
  • Morphology functions
  • October Oct
  • teaching teach
  • Synsets
  • Optional allowance of synonyms

13
Search for pages on jobs in computer science
  • Inexact match allow one level of synonyms
  • WebSubdue found 33 matches
  • Words include employment, work, job, problem,
    task
  • AltaVista found 2 matches

page
word
word
word
jobs
computer
science
14
Search for authority hub and authority pages
  • WebSubdue found 3 hub (and 3 authority) pages
  • AltaVista cannot perform this type of search
  • Inexact match applied with threshold 0.2 (4.2
    transformations allowed)
  • WebSubdue found 13 matches

15
Subdue Learning from Web Data
  • Distinguish professors and students web pages
  • Learned concept (professors have box in address
    field)
  • Distinguish online stores and professors web
    pages
  • Learned concept (stores have more levels in graph)

page
page
page
page
page
page
page
16
Conclusions
  • WebSubdue can be used to search for structural
    web data
  • Could be enhanced with additional WordNet
    features such as synset path length
  • Efficient structural search necessary for future
    of web search tools

17
To Learn More
cygnus.uta.edu/subdue
cook_at_cse.uta.edu http//www-cse.uta.edu/cook
Write a Comment
User Comments (0)
About PowerShow.com