A Vector Space Search Engine for Web Services - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

A Vector Space Search Engine for Web Services

Description:

The Vector Space Model for Information Retrieval ... extractor. Stemmer, Stop words, Weighting schema. Local. Vectorspace. Remote. Vectorspace ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 22
Provided by: geral3
Category:

less

Transcript and Presenter's Notes

Title: A Vector Space Search Engine for Web Services


1
A Vector Space Search Enginefor Web Services
  • PhD. Seminar
  • Christian Platzer
  • March, 23 2005

2
Outline
  • Motivation and Introduction
  • The Vector Space Model for Information Retrieval
  • Distributed Vector Spaces and why they are so
    unpopular
  • Applying the concept to Web service descriptions
  • Searching in distributed UDDI registries
  • Conclusions and Future Work (Other Applications,
    possible Case studies)
  • Short Demonstration

3
Motivation and Introduction
  • Where the idea came from
  • Nedine Distributed search of natural language
    documents
  • Information retrieval The VSM concept
  • VSM capabilities
  • Indexing of large document repositories
  • Fast search of documents (querie- and document
    based)
  • Where it can be used
  • General data processing
  • Indexing

4
The VSM Concept
  • The Term Space
  • General Idea
  • Keywords
  • Vectors
  • Dimensions
  • Advantage of low dimensionality

5
Adding new documents
  • Documents
  • d1Product,Info
  • d2Info
  • d1get,Product,Info

6
Weighting
  • Binary weights
  • Whenever a term occurs in document d, it is
    indexed as 1 in the term space
  • Ignores term frequency
  • Ignores document length
  • Perfect distribution capabilities

7
Weighting
  • tf x idf
  • Terms are weighted hihgly, if
  • frequent in relevant documents, but
  • infrequent in a collection as a whole.

8
Weighting
  • tf x idf normalization
  • Standard tf x idf rates longer documents higher
    than shorter ones.
  • Normalization maps the values to an interval
    0,1

9
Rating
  • Documents
  • Generate a vector for the document
  • Weight it according to the weighting scheme
  • Compare it to all vectors in the vector space
  • Return the documents with the highest similarity
    rating
  • Queries
  • Treated as short documents

10
Rating Algorithms
  • Cosinus Measure
  • Computes the cosinus value for the
    multidimensional angle between two vectors
  • Fast computation
  • Small angles mean a high chance of two vectors
    being semantically related.

11
Rating Algorithms
  • Additional Rating Algorithms
  • The Dice Coefficient
  • The Jacard Coefficient
  • Overlap Coefficient
  • Asymmetric Coefficients

12
Dimensional reduction
  • Stop Word Lists
  • Stemming algorithms
  • Language dependent
  • Not applicable to all data repositories.
  • Phonetics (?)

13
Distributed Vector Spaces
  • The Problem
  • Term spaces grow independent from each other
  • Term weights depend on term frequencies of other
    documents
  • Querie vectors from other vector spaces cannot be
    mapped to the own term space.
  • The Effect
  • Search results are not related to each other
  • Every peer needs a complete index

14
Distributed Vector Spaces
  • The Solution
  • Weighting is done at runtime
  • The term space only keeps the raw term
    frequencies of all vectors
  • For distributed queries, term spaces are merged
    and treated as a large collection

15
Distributed Vector Spaces
  • An example
  • N N1N2 6
  • nknk1nk2 ?
  • nProduct 3
  • nInfo 2

,N1 3
C1
,N2 3
C2
16
Indexing WSDL Files
  • Overview

User Query
Local Vectorspace
WSDL Repository
Stemmer, Stop words, Weighting schema
Merger P2P Handler
Keyword extractor
UDDI Registry
Remote Vectorspace
17
Indexing WSDL Files
  • Keyword extraction
  • Data Types (Parameters)
  • Complex Types
  • Elements
  • Element Types
  • Messages (Methods)
  • Message Names
  • Parts
  • Endpoint Description (Service Description)
  • Service Name
  • Endpoint Address
  • Binding method (http, soap)
  • Comments

18
Indexing WSDL Files (Google sample)
  • lt!-- WSDL description of the Google Web APIs.
  • The Google Web APIs are in beta release. All
    interfaces are subject to
  • change as we refine and extend our APIs.
    Please see the terms of use
  • for more information. --gt
  • ltscomplexType name"GoogleSearchResult"gt
  • ltsallgt
  • ltselement name"documentFiltering"
    type"sboolean" /gt
  • ltselement name"searchComments"
    type"sstring" /gt
  • ltselement name"estimatedTotalResultsCo
    unt" type"sint" /gt
  • ltselement name"estimateIsExact"
    type"sboolean" /gt
  • ltselement name"resultElements"
    type"s0ResultElementArray" /gt
  • ltselement name"searchQuery"
    type"sstring" /gt
  • ltselement name"startIndex"
    type"sint" /gt
  • ltselement name"endIndex" type"sint"
    /gt
  • ltselement name"searchTips"
    type"sstring" /gt
  • ltselement name"directoryCategories"
    type"s0DirectoryCategoryArray" /gt
  • ltselement name"searchTime"
    type"sdouble" /gt

19
UDDI registries
  • UDDI
  • Contained information
  • WSDL to UDDI Mapping
  • Current search engines
  • Distribution
  • No main index required
  • High scalability
  • Easy to upgrade

20
Conclusion and future work
  • Other fields of application
  • Nedine
  • Anti-Spam
  • Future work
  • Implementing the UDDI Indexer
  • Proving the Algorithm upon natural language
    comparison ressources
  • Performance evaluation for distributed
    environments.

21
Conclusion and future work
  • Thank you for your attention
  • Enjoy the demo
Write a Comment
User Comments (0)
About PowerShow.com