Semantic%20Search%20Facilitator:%20Concept%20and%20Current%20State%20of%20Development - PowerPoint PPT Presentation

About This Presentation
Title:

Semantic%20Search%20Facilitator:%20Concept%20and%20Current%20State%20of%20Development

Description:

InBCT Tekes PROJECT Chapter 3.1.3 : 'Industrial Ontologies and Semantic ... gull is a kind of bird, gull has specific kind called seagull. meronymy / holonymy ... – PowerPoint PPT presentation

Number of Views:673
Avg rating:3.0/5.0
Slides: 56
Provided by: industrial5
Category:

less

Transcript and Presenter's Notes

Title: Semantic%20Search%20Facilitator:%20Concept%20and%20Current%20State%20of%20Development


1
Semantic Search Facilitator Concept and Current
State of Development
InBCT Tekes PROJECT Chapter 3.1.3 Industrial
Ontologies and Semantic Web (year 2004)
2
Industrial Ontologies Group
  • Researchers
  • Vagan Terziyan
  • Oleksandr Kononenko
  • Andriy Zharko
  • Oleksiy Khriyenko
  • Olena Kaykova
  • Olga Klochko
  • Andriy Taranov
  • Contact
  • e-mail vagan_at_it.jyu.fi
  • Phone 358 14 260 4618
  • URL http//www.cs.jyu.fi/ai/OntoGroup

3
Resources
Resources used from InBCT Project in 2004
  • 12 000  EURO salaries for 5 months

4
Semantic-based Enhancement of the Information
RetrievalMotivation from Industrial Ontologies
GroupWhile recently there is luck of
annotated resources in the Web, which makes
metadata-based search useless, we should develop
enhanced Web search tool based on Google and
WordNet ontology and provide semantic search user
interface
5
Semantic Web and Information Retrieval
  • Semantic Web promises many advantages and
    benefits, but
  • We are only in transition towards the Semantic
    Web
  • Resources are not yet annotated semantically
  • Not enough metadata available in the Web for more
    smart search
  • Semantic search of non-semantic data ???
  • Yes, why not? We need a Semantic Facilitator !

6
Semantic Facilitator Concept
  • What is it?
  • Search service that uses other services
  • Utilizes other search engines as Web services
    and
  • makes their performance better due to smart
    query generation algorithms
  • Supports search within heterogeneous resources
    (Web pages, Web databases, local file system,
    etc.)
  • Filters returned results based on user
    preferences
  • Intelligent semantic query-based tool that
    really understands what users want to find
  • What it is not?
  • Search engine, indexing tool, registry, etc.
  • Data storage, database browser, etc.

7
Web search - Whats the Problem?
  • Search in the web is not always convenient
  • Polysemy of words gives irrelevant results
  • Synonymy does not supported by search engines gt
    loss of relevant results
  • There is a need to capture semantics from search
    query

8
Semantic Search Assistantlight version of
Semantic Facilitator
  • Semantic Search Assistant (SSA) is a software
    that
  • helps user to obtain more relevant results while
    using standard search engine (Google) by
    interaction with WordNet ontology
  • finds possible contexts for words in search query
  • can broaden or constrict search query with other
    relevant words and phrases for result improvement
  • works with not annotated documents
  • is not restricted to any concrete domain

9
Sense Determination
  • WordNet is an open source ontology, which
    contains information about different meanings of
    a term, synonyms, antonyms and other lexical and
    semantic relations
  • Having several words in search query we can
    determine in which context (sense) each of them
    is used with the help of WordNet
  • by comparing words synsets
  • by comparing words textual descriptions and
    examples
  • by finding common roots going up in WordNet
    hierarchy tree for each word
  • by asking a user

10
How does it work?
  1. Gets keyword query
  2. Translates original query into series of queries
    to Google taking into account the semantics of
    keywords
  3. Combines returned results

11
Basic idea
SemanticSearchFacilitator is Semantic Interface
for search mechanism, which use combination and
integration of the set of other existing search
engines. This interface provides additional
semantic enhancement and filtering of a complex
information retrieval.
Local search
? search
DB search
Web search
SemanticSearchFacilitator
12
Ontology Personalization
Ontology Personalization is mechanism, which
allows users to have own conceptual view and be
able to use it for semantic querying of search
facilities.
Search
13
Semantic Search Enhancement
Semantic Search Enhancement
Semantic Search Facilitator transforms
concept-based queries on the intermediate stage
of query execution into a set of simple
(keyword-based) queries to the used search
services of various types (Internet
search-engine, database, intranet, etc.). Such
semantic facilitator uses knowledge of how to
translate semantic queries into the query formats
of several different search services and can
perform routine actions that most of users do in
order to achieve better performance and get more
relevant results
Common ontology
Personal ontology
(
)
Query
14
Semantic Search Enhancement
Semantic Search Enhancement
Common (linguistic) ontology
Semantic Search Facilitator uses ontologically
(WordNet) defined knowledge about words and
embedded support of advanced Google-search query
features in order to construct more efficient
queries from formal textual description of
searched information. Semantic Search Facilitator
hides from users the complexity of query language
of concrete search engine and performs routine
actions that most of users do in order to achieve
better performance and get more relevant results.
Domain ontology
(
)
Query
15
Capturing Semantics from Search
PhrasesMotivation according to our Ukrainian
colleague Vadim ErmolayevGoogle query should
be transformed based on domain ontology
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Semantic Search Assistant
23
Semantic Search Assistant
24
Semantic Search Assistant
Sense Processor
  • Defines the sense of words in search query if
    polysemantic interpretation is possible
  • This determination is used for narrowing the
    choice from plenty of sense combinations
  • Combination choice algorithm is based on
    analyzing the intersected items for different
    words in search query
  • Intersections can be found, for example, in
    definition of words, in their synsets, or in
    definitions of some pointer type for words (e.g.
    in upper level hypernym)
  • Is written in Python

25
Semantic Search Assistant
Google Search Query Generator Builds the new
search queries based on determined semantic sense
of query and user's measure of results relevance.
Search Preferences Processor Interprets user's
preferences for the query.
26
Semantic Search Assistant
  • User Interface
  • Allows sending search queries to search engine,
    getting search results and customization of query
  • Customization
  • selecting measure of importance for each word in
    query
  • prioritizing certain sense for words in query
  • Irrelevant results are excluded by choosing words
    that give negative characteristics to the
    resource

27
Algorithm for the New Query Generation
Rij
- relevance of the words sense
j 1, p
Sense (i1)
-1
1

k 1, mij

i 1, n
Syn (ij1)
Word(i)
Sense (ij)
1
-1

Nijk
Syn (ijk)

0
1

- number of the synonyms senses
Syn (ijmij )
Ri
- significance of the word in query
Sense (ip)
-1
1

n number of the words from query
p number of the words senses
mij number of the words synonyms in senses
28
Algorithm for the New Query Generation
Synonym Quality
j1
Rij
1
, if Synijk is a member of the synsetj
Qijk
p

L
Nijk
L number of the synsets which contain Synijk
Word(i)

Reduction of the synonym quality absolute value
if Qijk gt 0, then synonym will used via OR in
a query if Qijk lt 0, then will used via AND NOT
29
Algorithm for the New Query Generation
Algorithm 1
Word(1)
Syn
Syn
Syn

Syn
Word(i)
Syn
Syn
Syn

Syn

Word(n)
Syn
Syn
Syn

Syn
AND
AND
Query
Syn
Syn
Syn
Syn
Syn
Syn
Syn
Syn
Syn
OR (AND NOT)
OR (AND NOT)
OR (AND NOT)
30
Algorithm for the New Query Generation
Algorithm 2
gt
Q
Word(1)
Syn
Syn
Syn

Syn
Word(i)
Syn
Syn
Syn

Syn

Word(n)
Syn
Syn

Syn
Syn
Ri
Filtering based on a significance of the word

AND
AND
Query
Syn
Syn
Syn
Syn
Syn
Syn
Syn
Syn
Syn
OR (AND NOT)
OR (AND NOT)
OR (AND NOT)
31
Google APIAdaptation to search engine
32
We use Google because..
  • Developers write software that connects remotely
    to the Google Web APIs service and access
    Google's index of more than 4 billion web pages
  • Google Web APIs support the same search syntax as
    the Google.com site
  • Communication is performed via the Simple Object
    Access Protocol (SOAP), an XML-based mechanism
    for exchanging typed information

..but that could be virtually any of existing
search engine
33
Requests
  • Search query and a set of parameters is submitted
    to the Google Web APIs service and a set of
    search results is returned
  • Cache requests submit a URL to the Google Web
    APIs service and receive the contents of the URL
    when Google's crawlers last visited the page (if
    available)
  • Spelling requests submit a query to the Google
    Web APIs service and receive a suggested spell
    correction for the query (if available)

34
Requests Additional Commands
  • intitle restricts your search to the titles of
    web pages
  • inurl restricts your search to the titles of web
    pages
  • intext searches only body text
  • inanchor searches for text in pages link
    anchors
  • site allows you to narrow your search by site or
    top-level domain
  • link returns a list of pages linking to the
    specified URL
  • cache finds a copy of the page that Google
    indexed from Googles cache

35
Requests Additional Commands
  • daterange limits your search to the particular
    date or range of dates that a page was indexed
  • filetype searches the suffixes or filename
    extensions
  • related finds pages, which are related to the
    specified page
  • info provides a page of links to more
    information about specified URL
  • phonebook looks up phone numbers

36
Search Results Format Search Response
  • Search response you get each time after search
    request
  • It contains
  • ltdocumentFilteringgt - A Boolean value indicating
    whether filtering was performed on the search
    results. This will be "true" only if (a) you
    requested filtering and (b) filtering actually
    occurred
  • ltsearchCommentsgt - A text string intended for
    display to an end user. One of the most common
    messages found here is a note that "stop words"
    were removed from the search automatically. (This
    happens for very common words such as "and" and
    "as.")

37
Search Results Format Search Response (2)
  • ltestimatedTotalResultsCountgt - The estimated
    total number of results that exist for the query
  • ltestimateIsExactgt - A Boolean value indicating
    that the estimate is actually the exact value
  • ltresultElementsgt - An array of ltresultElementgt
    items. This corresponds to the actual list of
    search results
  • ltsearchQuerygt - This is the value of ltqgt for the
    search request
  • ltstartIndexgt - Indicates the index (1-based) of
    the first search result in ltresultElementsgt

38
Search Results Format Search Response (3)
  • ltendIndexgt - Indicates the index (1-based) of
    the last search result in ltresultElementsgt
  • ltsearchTipsgt - A text string intended for
    display to the end user. It provides instructive
    suggestions on how to use Google
  • ltdirectoryCategoriesgt - An array of
    ltdirectoryCategorygt items. This corresponds to
    the ODP directory matches for this search
  • ltsearchTimegt - Text, floating-point number
    indicating the total server time to return the
    search results, measured in seconds

39
Search Results Format Result Element
  • ltsummarygt - If the search result has a listing
    in the ODP directory, the ODP summary appears
    here as a text string
  • ltURLgt - The URL of the search result, returned
    as text, with an absolute URL path
  • ltsnippetgt - A snippet which shows the query in
    context on the URL where it appears. This is
    formatted HTML and usually includes ltBgt tags
    within it. Note that the query term does not
    always appear in the snippet
  • lttitlegt - The title of the search result,
    returned as HTML

40
Search Results Format Result Element (2)
  • ltcachedSizegt - Text (Integer "k"). Indicates
    that a cached version of the ltURLgt is available
    size is indicated in kilobytes
  • ltrelatedInformationPresentgt - Boolean indicating
    that the "related" query term is supported for
    this URL
  • lthostNamegt - When filtering occurs, a maximum
    of two results from any given host is returned.
    When this occurs, the second resultElement that
    comes from that host contains the host name in
    this parameter

41
Limitations of Google APIs
  • Search request length 2048 bytes
  • Maximum number of words in the query 10
  • Maximum number of site terms in the query 1 (per
    search request)
  • Maximum number of results per query 10
  • Maximum value of ltstartgt ltmaxResultsgt 1000

42
WordNet( online access http//www.cogsci.prince
ton.edu/cgi-bin/webwn )
43
WordNet 2.0 Search Example
  • Search word "driver ? The noun "driver" has 5
    senses in WordNet.1. driver -- (the operator of
    a motor vehicle)2. driver -- (someone who drives
    animals that pull a vehicle)3. driver -- (a
    golfer who hits the golf ball with a driver)4.
    driver, device driver -- ((computer science) a
    program that determines how a computer will
    communicate with a peripheral device)5. driver,
    number one wood -- (a golf club (a wood) with a
    near vertical face that is used for hitting long
    shots from the tee)
  • Sense 1driver -- (the operator of a motor vehicle
    )       gt busman, bus driver -- (someone who dri
    ves a bus)       gt chauffeur -- (a man paid to d
    rive a privately owned car)       gt designated d
    river --(the member of a party who
    is designated to refrain from alcohol
    and so is sober when it is time to drive home)   
        gt honker -- (a driver who causes his car's ho
    rn to make a loud honking sound
    "the honker was fined for disturbing the peace") 
          gt motorist, automobilist -- (someone who dr
    ives (or travels in) an automobile)       gt owne
    r-driver -- (a motorist who owns the car that he/s
    he drives)       gt racer, race driver, automobil
    e driver -- (someone who drives racing cars at
    high speeds)       

44
WordNet Basic Terminology
  • Syntactic category part of speech noun, verb,
    adjective, adverb
  • Synonymic set (synset) list of synonymic words
    or collocations
  • Every word can have several senses
  • Every sense of a word is associated with synonyms
    (synset) of the word in that specific sense
  • Synsets are organized in hierarchies interlinked
    with semanticrelations

45
WordNet Relations
  • hypernymy / hyponymy
  • ( is kind of / has kind )
  • - hypernym, smth hiponym
  • gull is a kind of bird, gull has specific kind
    called seagull
  • meronymy / holonymy
  • (part of / has a part)
  • eye is a part of head / head has eyes as a part
  • domain
  • category
  • usage

46
WordNet Organization
  • Building Blocks
  • Word forms common word orthography
  • Word meanings by synsets
  • Relations
  • Lexical between word forms
  • Semantic between word meanings
  • gt Pointers
  • Lexical pertain only to specific word
  • Semantic pertain to all of the words in
    semantic set.

47
WordNet Organization
Synset1
Synset2
Semantic Relation
Lexical Relation
WordForm1
WordForm2
48
WordNet Organization
  • Nouns and verbs
  • organized into hierarchies based on the
    hypernymy/hyponymy relation between synsets
  • additional pointers are be used to indicate other
    relations
  • Adjectives
  • arranged in clusters containing head synsets and
    satellite synsets
  • Adverbs
  • often derived from adjectives
  • sometimes have antonyms
  • therefore the synset for an adverb usually
    contains a lexical pointer to the adjective from
    which it is derived

49
Simple Example of WordNet usage
Syntactic Categories
Synsets
50
Simple Example of PyWordNet usage
Syntactic Categories
Synsets
51
Semantic Search Assistantprototype
52
Features of SSA
  • Platform independent (written in Java)
  • Works in 2 modes
  • common mode, implements almost all of Google
    functionality
  • extended mode, extends common mode, makes several
    requests with the same semantic sense, returns
    compound results.
  • Keeps results in XML format

53
Common mode
  • SSA has clear and simple interface, which helps
    user makes advanced Google search without special
    knowledge
  • SSA transforms values of fields into Google
    request according to special format, which Google
    provides for advanced search

54
Extended mode
  • More powerful mode than the common one
  • SSA takes user request, makes a try to choose
    more convenient sense with users help
  • Makes a set of requests, which extend users
    request by synonyms and exclude unsuitable words

55
Extended mode (2)
  • SSA makes generated and base requests in series
  • It compounds taken results into one according
    developed rules
  • Provides possibility to highlight results for
    specific request for result analysing

56
Generating of requests set
  • WordNet API and dictionaries are used for
    generating the set of requests
  • When user enters original request, SSA switches
    to the panel, where different senses of typed
    word are presented

57
Generating of requests set (2)
  • For every presented sense on this panel a user
    can see some description (even example) extracted
    from WordNet dictionary
  • Also he/she can set rate of correspondence for
    every sense in range -1, 1

58
Making compound result
  • SSA sends generated requests to Google one by one
  • It keeps obtained results for each request
    separately
  • User finally will get an integrated result, which
    was generated according special rules

59
Integrated resultsgenerating rules
  • Unique identifier for each result is its URL
  • SSA counts amount of URL appearances in returned
    results and sets this amount as index for every
    URL
  • Results with bigger index are showed first
  • If indexes are equal, results are shown according
    the order as Google returned them

60
Results analysis
  • After making all requests, SSA shows final
    results
  • All results are keeping also in files in XML
    format for further analysing
  • User can highlight results for specific request,
    if there were more than one request

61
Python - WordNet
  • PyWN("pin"), a python interface to WordNet,
    developed by WordNet employees John Asmuth and
    Jesse Fischer
  • PyWordNet is a Python interface to WordNet
    developed by Oliver Steele

62
Java-Python Integration Jython
Python
pyClass.py
jythonc
Jython
pyClass.java
javac
pyClass.class (inner classes)
Java
63
Results
  • Methods for automatic sense determination using
    WordNet Lexical Database were studied and
    correspondent algorithms were implemented
  • Algorithm for new query generating were
    implemented and embedded to the programming
    complex
  • User Interface for advanced search (with Google
    integration) was developed with Semantic Search
    Assistant functionality

64
Example
  • Initial query
  • hotel reservation agency
  • (1, 7 and 5 senses correspondingly)
  • From first 5 results only 3 are relevant
  • (results with whole sequence of query words even
    does not appear in first three pages)
  • Generated query
  • ("hotel") ("booking" OR "reserve")
    (-"qualification") ("bureau" OR "agency")
    (-"means")
  • From first 5 results all are relevant
  • (using synonym booking along with
    reservation was helpful)

65
Example
  • Results of initial query

Results of generated query
66
More Examples
67
Test 1
Initial query cork mousepad
68
Test 1
Initial query cork mousepad
Enhanced query ("phellem" OR "bobfloat" OR
"bobber" OR "cork" OR "bob") ("mousepad" OR
"mouse mat")
69
Test 2
Initial query flowers present shop
70
Test 2
Initial query flowers present shop
Enhanced query ("flower") (-"heyday" -"prime"
-"efflorescence") ("present") (-"nowadays"
-"present tense") ("store" OR "shop")
(-"workshop")
71
Test 3
Initial query hotel reservation agency
72
Test 3
Initial query hotel reservation agency
Enhanced query ("hotel") ("booking" OR
"reserve") (-"qualification") ("bureau" OR
"agency") (-"means")
73
Test 4
Initial query zodiac fish
74
Test 4
Initial query zodiac fish
Enhanced query ("zodiac") ("pisces" OR "fish" OR
"pisces the fishes")
75
Drawbacks
  • Lack highly specialized terminology for narrow
    domains in WordNet gt difficult to get better
    results with SSA in such cases
  • Frequent absence of sense relation between words
    in whole phrases gt difficulty of context
    determination by used algorithms
  • Presence of several very close senses for many
    terms in WordNet gt no clear belonging of the
    word to some sense
  • Possible wrong determination of part of speech
    for word in query gt using improper synonyms and
    antonyms for making query

76
Possible Improvements and further work
  • Additional Adaptive Learning (for personalized
    context definition)
  • Creating Global Sense Ontology on the basis of
    WordNet Database
  • Improving algorithms for automatic computing of
    relevance indexes
  • Adding algorithms for smart cutting off for
    generated queries
  • Using fuzzy logic for determination of query
    context
  • Adding other lexical databases for supporting
    search in specific domains (like programming,
    medicine)
  • Multilingual support

77
Current status
  • During Jan-May 2004 main efforts for the InBCT
    Semantic Search Facilitator project were put
    into the research and design of the basic
    features of SSA and implementation of
    ontology-based search method.
  • The development of the prototype Semantic Search
    Assistant software has been started and pilot
    version is ready.
  • Starting 1.06.2004 kernel part of the Industrial
    Ontologies Group start working on TEKES project
    SmartResource Proactive Self-Maintained
    Resources in Semantic Web
  • at Agora Center, University of Jyväskylä
  • Further development (from the point of stability
    and usability) of SSA will be continued during
    Jul-Sep 2004
Write a Comment
User Comments (0)
About PowerShow.com