Title: Semantic%20Search%20Facilitator:%20Concept%20and%20Current%20State%20of%20Development
1Semantic Search Facilitator Concept and Current
State of Development
InBCT Tekes PROJECT Chapter 3.1.3 Industrial
Ontologies and Semantic Web (year 2004)
2Industrial Ontologies Group
- Researchers
- Vagan Terziyan
- Oleksandr Kononenko
- Andriy Zharko
- Oleksiy Khriyenko
- Olena Kaykova
- Olga Klochko
- Andriy Taranov
- Contact
- e-mail vagan_at_it.jyu.fi
- Phone 358 14 260 4618
- URL http//www.cs.jyu.fi/ai/OntoGroup
3Resources
Resources used from InBCT Project in 2004
- 12 000 EURO salaries for 5 months
4Semantic-based Enhancement of the Information
RetrievalMotivation from Industrial Ontologies
GroupWhile recently there is luck of
annotated resources in the Web, which makes
metadata-based search useless, we should develop
enhanced Web search tool based on Google and
WordNet ontology and provide semantic search user
interface
5Semantic Web and Information Retrieval
- Semantic Web promises many advantages and
benefits, but - We are only in transition towards the Semantic
Web - Resources are not yet annotated semantically
- Not enough metadata available in the Web for more
smart search - Semantic search of non-semantic data ???
- Yes, why not? We need a Semantic Facilitator !
6Semantic Facilitator Concept
- What is it?
- Search service that uses other services
- Utilizes other search engines as Web services
and - makes their performance better due to smart
query generation algorithms - Supports search within heterogeneous resources
(Web pages, Web databases, local file system,
etc.) - Filters returned results based on user
preferences - Intelligent semantic query-based tool that
really understands what users want to find - What it is not?
- Search engine, indexing tool, registry, etc.
- Data storage, database browser, etc.
7Web search - Whats the Problem?
- Search in the web is not always convenient
- Polysemy of words gives irrelevant results
- Synonymy does not supported by search engines gt
loss of relevant results - There is a need to capture semantics from search
query
8Semantic Search Assistantlight version of
Semantic Facilitator
- Semantic Search Assistant (SSA) is a software
that - helps user to obtain more relevant results while
using standard search engine (Google) by
interaction with WordNet ontology - finds possible contexts for words in search query
- can broaden or constrict search query with other
relevant words and phrases for result improvement - works with not annotated documents
- is not restricted to any concrete domain
9Sense Determination
- WordNet is an open source ontology, which
contains information about different meanings of
a term, synonyms, antonyms and other lexical and
semantic relations - Having several words in search query we can
determine in which context (sense) each of them
is used with the help of WordNet - by comparing words synsets
- by comparing words textual descriptions and
examples - by finding common roots going up in WordNet
hierarchy tree for each word - by asking a user
10How does it work?
- Gets keyword query
- Translates original query into series of queries
to Google taking into account the semantics of
keywords - Combines returned results
11Basic idea
SemanticSearchFacilitator is Semantic Interface
for search mechanism, which use combination and
integration of the set of other existing search
engines. This interface provides additional
semantic enhancement and filtering of a complex
information retrieval.
Local search
? search
DB search
Web search
SemanticSearchFacilitator
12Ontology Personalization
Ontology Personalization is mechanism, which
allows users to have own conceptual view and be
able to use it for semantic querying of search
facilities.
Search
13Semantic Search Enhancement
Semantic Search Enhancement
Semantic Search Facilitator transforms
concept-based queries on the intermediate stage
of query execution into a set of simple
(keyword-based) queries to the used search
services of various types (Internet
search-engine, database, intranet, etc.). Such
semantic facilitator uses knowledge of how to
translate semantic queries into the query formats
of several different search services and can
perform routine actions that most of users do in
order to achieve better performance and get more
relevant results
Common ontology
Personal ontology
(
)
Query
14Semantic Search Enhancement
Semantic Search Enhancement
Common (linguistic) ontology
Semantic Search Facilitator uses ontologically
(WordNet) defined knowledge about words and
embedded support of advanced Google-search query
features in order to construct more efficient
queries from formal textual description of
searched information. Semantic Search Facilitator
hides from users the complexity of query language
of concrete search engine and performs routine
actions that most of users do in order to achieve
better performance and get more relevant results.
Domain ontology
(
)
Query
15Capturing Semantics from Search
PhrasesMotivation according to our Ukrainian
colleague Vadim ErmolayevGoogle query should
be transformed based on domain ontology
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22Semantic Search Assistant
23Semantic Search Assistant
24Semantic Search Assistant
Sense Processor
- Defines the sense of words in search query if
polysemantic interpretation is possible - This determination is used for narrowing the
choice from plenty of sense combinations - Combination choice algorithm is based on
analyzing the intersected items for different
words in search query - Intersections can be found, for example, in
definition of words, in their synsets, or in
definitions of some pointer type for words (e.g.
in upper level hypernym) - Is written in Python
25Semantic Search Assistant
Google Search Query Generator Builds the new
search queries based on determined semantic sense
of query and user's measure of results relevance.
Search Preferences Processor Interprets user's
preferences for the query.
26Semantic Search Assistant
- User Interface
- Allows sending search queries to search engine,
getting search results and customization of query - Customization
- selecting measure of importance for each word in
query - prioritizing certain sense for words in query
- Irrelevant results are excluded by choosing words
that give negative characteristics to the
resource
27Algorithm for the New Query Generation
Rij
- relevance of the words sense
j 1, p
Sense (i1)
-1
1
k 1, mij
i 1, n
Syn (ij1)
Word(i)
Sense (ij)
1
-1
Nijk
Syn (ijk)
0
1
- number of the synonyms senses
Syn (ijmij )
Ri
- significance of the word in query
Sense (ip)
-1
1
n number of the words from query
p number of the words senses
mij number of the words synonyms in senses
28Algorithm for the New Query Generation
Synonym Quality
j1
Rij
1
, if Synijk is a member of the synsetj
Qijk
p
L
Nijk
L number of the synsets which contain Synijk
Word(i)
Reduction of the synonym quality absolute value
if Qijk gt 0, then synonym will used via OR in
a query if Qijk lt 0, then will used via AND NOT
29Algorithm for the New Query Generation
Algorithm 1
Word(1)
Syn
Syn
Syn
Syn
Word(i)
Syn
Syn
Syn
Syn
Word(n)
Syn
Syn
Syn
Syn
AND
AND
Query
Syn
Syn
Syn
Syn
Syn
Syn
Syn
Syn
Syn
OR (AND NOT)
OR (AND NOT)
OR (AND NOT)
30Algorithm for the New Query Generation
Algorithm 2
gt
Q
Word(1)
Syn
Syn
Syn
Syn
Word(i)
Syn
Syn
Syn
Syn
Word(n)
Syn
Syn
Syn
Syn
Ri
Filtering based on a significance of the word
AND
AND
Query
Syn
Syn
Syn
Syn
Syn
Syn
Syn
Syn
Syn
OR (AND NOT)
OR (AND NOT)
OR (AND NOT)
31Google APIAdaptation to search engine
32We use Google because..
- Developers write software that connects remotely
to the Google Web APIs service and access
Google's index of more than 4 billion web pages - Google Web APIs support the same search syntax as
the Google.com site - Communication is performed via the Simple Object
Access Protocol (SOAP), an XML-based mechanism
for exchanging typed information
..but that could be virtually any of existing
search engine
33Requests
- Search query and a set of parameters is submitted
to the Google Web APIs service and a set of
search results is returned - Cache requests submit a URL to the Google Web
APIs service and receive the contents of the URL
when Google's crawlers last visited the page (if
available) - Spelling requests submit a query to the Google
Web APIs service and receive a suggested spell
correction for the query (if available)
34Requests Additional Commands
- intitle restricts your search to the titles of
web pages - inurl restricts your search to the titles of web
pages - intext searches only body text
- inanchor searches for text in pages link
anchors - site allows you to narrow your search by site or
top-level domain - link returns a list of pages linking to the
specified URL - cache finds a copy of the page that Google
indexed from Googles cache
35Requests Additional Commands
- daterange limits your search to the particular
date or range of dates that a page was indexed - filetype searches the suffixes or filename
extensions - related finds pages, which are related to the
specified page - info provides a page of links to more
information about specified URL - phonebook looks up phone numbers
36Search Results Format Search Response
- Search response you get each time after search
request - It contains
- ltdocumentFilteringgt - A Boolean value indicating
whether filtering was performed on the search
results. This will be "true" only if (a) you
requested filtering and (b) filtering actually
occurred - ltsearchCommentsgt - A text string intended for
display to an end user. One of the most common
messages found here is a note that "stop words"
were removed from the search automatically. (This
happens for very common words such as "and" and
"as.")
37Search Results Format Search Response (2)
- ltestimatedTotalResultsCountgt - The estimated
total number of results that exist for the query - ltestimateIsExactgt - A Boolean value indicating
that the estimate is actually the exact value - ltresultElementsgt - An array of ltresultElementgt
items. This corresponds to the actual list of
search results - ltsearchQuerygt - This is the value of ltqgt for the
search request - ltstartIndexgt - Indicates the index (1-based) of
the first search result in ltresultElementsgt
38Search Results Format Search Response (3)
- ltendIndexgt - Indicates the index (1-based) of
the last search result in ltresultElementsgt - ltsearchTipsgt - A text string intended for
display to the end user. It provides instructive
suggestions on how to use Google - ltdirectoryCategoriesgt - An array of
ltdirectoryCategorygt items. This corresponds to
the ODP directory matches for this search - ltsearchTimegt - Text, floating-point number
indicating the total server time to return the
search results, measured in seconds
39Search Results Format Result Element
- ltsummarygt - If the search result has a listing
in the ODP directory, the ODP summary appears
here as a text string - ltURLgt - The URL of the search result, returned
as text, with an absolute URL path - ltsnippetgt - A snippet which shows the query in
context on the URL where it appears. This is
formatted HTML and usually includes ltBgt tags
within it. Note that the query term does not
always appear in the snippet - lttitlegt - The title of the search result,
returned as HTML
40Search Results Format Result Element (2)
- ltcachedSizegt - Text (Integer "k"). Indicates
that a cached version of the ltURLgt is available
size is indicated in kilobytes - ltrelatedInformationPresentgt - Boolean indicating
that the "related" query term is supported for
this URL - lthostNamegt - When filtering occurs, a maximum
of two results from any given host is returned.
When this occurs, the second resultElement that
comes from that host contains the host name in
this parameter
41Limitations of Google APIs
- Search request length 2048 bytes
- Maximum number of words in the query 10
- Maximum number of site terms in the query 1 (per
search request) - Maximum number of results per query 10
- Maximum value of ltstartgt ltmaxResultsgt 1000
42WordNet( online access http//www.cogsci.prince
ton.edu/cgi-bin/webwn )
43WordNet 2.0 Search Example
- Search word "driver ? The noun "driver" has 5
senses in WordNet.1. driver -- (the operator of
a motor vehicle)2. driver -- (someone who drives
animals that pull a vehicle)3. driver -- (a
golfer who hits the golf ball with a driver)4.
driver, device driver -- ((computer science) a
program that determines how a computer will
communicate with a peripheral device)5. driver,
number one wood -- (a golf club (a wood) with a
near vertical face that is used for hitting long
shots from the tee) - Sense 1driver -- (the operator of a motor vehicle
) gt busman, bus driver -- (someone who dri
ves a bus) gt chauffeur -- (a man paid to d
rive a privately owned car) gt designated d
river --(the member of a party who
is designated to refrain from alcohol
and so is sober when it is time to drive home)
gt honker -- (a driver who causes his car's ho
rn to make a loud honking sound
"the honker was fined for disturbing the peace")
gt motorist, automobilist -- (someone who dr
ives (or travels in) an automobile) gt owne
r-driver -- (a motorist who owns the car that he/s
he drives) gt racer, race driver, automobil
e driver -- (someone who drives racing cars at
high speeds)
44WordNet Basic Terminology
- Syntactic category part of speech noun, verb,
adjective, adverb - Synonymic set (synset) list of synonymic words
or collocations - Every word can have several senses
- Every sense of a word is associated with synonyms
(synset) of the word in that specific sense - Synsets are organized in hierarchies interlinked
with semanticrelations
45WordNet Relations
- hypernymy / hyponymy
- ( is kind of / has kind )
- - hypernym, smth hiponym
- gull is a kind of bird, gull has specific kind
called seagull - meronymy / holonymy
- (part of / has a part)
- eye is a part of head / head has eyes as a part
-
- domain
- category
- usage
46WordNet Organization
- Building Blocks
- Word forms common word orthography
- Word meanings by synsets
- Relations
- Lexical between word forms
- Semantic between word meanings
- gt Pointers
- Lexical pertain only to specific word
- Semantic pertain to all of the words in
semantic set.
47WordNet Organization
Synset1
Synset2
Semantic Relation
Lexical Relation
WordForm1
WordForm2
48WordNet Organization
- Nouns and verbs
- organized into hierarchies based on the
hypernymy/hyponymy relation between synsets - additional pointers are be used to indicate other
relations - Adjectives
- arranged in clusters containing head synsets and
satellite synsets - Adverbs
- often derived from adjectives
- sometimes have antonyms
- therefore the synset for an adverb usually
contains a lexical pointer to the adjective from
which it is derived
49Simple Example of WordNet usage
Syntactic Categories
Synsets
50Simple Example of PyWordNet usage
Syntactic Categories
Synsets
51Semantic Search Assistantprototype
52Features of SSA
- Platform independent (written in Java)
- Works in 2 modes
- common mode, implements almost all of Google
functionality - extended mode, extends common mode, makes several
requests with the same semantic sense, returns
compound results. - Keeps results in XML format
53Common mode
- SSA has clear and simple interface, which helps
user makes advanced Google search without special
knowledge
- SSA transforms values of fields into Google
request according to special format, which Google
provides for advanced search
54Extended mode
- More powerful mode than the common one
- SSA takes user request, makes a try to choose
more convenient sense with users help - Makes a set of requests, which extend users
request by synonyms and exclude unsuitable words
55Extended mode (2)
- SSA makes generated and base requests in series
- It compounds taken results into one according
developed rules - Provides possibility to highlight results for
specific request for result analysing
56Generating of requests set
- WordNet API and dictionaries are used for
generating the set of requests
- When user enters original request, SSA switches
to the panel, where different senses of typed
word are presented
57Generating of requests set (2)
- For every presented sense on this panel a user
can see some description (even example) extracted
from WordNet dictionary - Also he/she can set rate of correspondence for
every sense in range -1, 1
58Making compound result
- SSA sends generated requests to Google one by one
- It keeps obtained results for each request
separately - User finally will get an integrated result, which
was generated according special rules
59Integrated resultsgenerating rules
- Unique identifier for each result is its URL
- SSA counts amount of URL appearances in returned
results and sets this amount as index for every
URL - Results with bigger index are showed first
- If indexes are equal, results are shown according
the order as Google returned them
60Results analysis
- After making all requests, SSA shows final
results - All results are keeping also in files in XML
format for further analysing - User can highlight results for specific request,
if there were more than one request
61Python - WordNet
- PyWN("pin"), a python interface to WordNet,
developed by WordNet employees John Asmuth and
Jesse Fischer - PyWordNet is a Python interface to WordNet
developed by Oliver Steele -
62Java-Python Integration Jython
Python
pyClass.py
jythonc
Jython
pyClass.java
javac
pyClass.class (inner classes)
Java
63Results
- Methods for automatic sense determination using
WordNet Lexical Database were studied and
correspondent algorithms were implemented - Algorithm for new query generating were
implemented and embedded to the programming
complex - User Interface for advanced search (with Google
integration) was developed with Semantic Search
Assistant functionality
64Example
- Initial query
- hotel reservation agency
- (1, 7 and 5 senses correspondingly)
- From first 5 results only 3 are relevant
- (results with whole sequence of query words even
does not appear in first three pages)
- Generated query
- ("hotel") ("booking" OR "reserve")
(-"qualification") ("bureau" OR "agency")
(-"means") - From first 5 results all are relevant
- (using synonym booking along with
reservation was helpful)
65Example
Results of generated query
66More Examples
67Test 1
Initial query cork mousepad
68Test 1
Initial query cork mousepad
Enhanced query ("phellem" OR "bobfloat" OR
"bobber" OR "cork" OR "bob") ("mousepad" OR
"mouse mat")
69Test 2
Initial query flowers present shop
70Test 2
Initial query flowers present shop
Enhanced query ("flower") (-"heyday" -"prime"
-"efflorescence") ("present") (-"nowadays"
-"present tense") ("store" OR "shop")
(-"workshop")
71Test 3
Initial query hotel reservation agency
72Test 3
Initial query hotel reservation agency
Enhanced query ("hotel") ("booking" OR
"reserve") (-"qualification") ("bureau" OR
"agency") (-"means")
73Test 4
Initial query zodiac fish
74Test 4
Initial query zodiac fish
Enhanced query ("zodiac") ("pisces" OR "fish" OR
"pisces the fishes")
75Drawbacks
- Lack highly specialized terminology for narrow
domains in WordNet gt difficult to get better
results with SSA in such cases - Frequent absence of sense relation between words
in whole phrases gt difficulty of context
determination by used algorithms - Presence of several very close senses for many
terms in WordNet gt no clear belonging of the
word to some sense - Possible wrong determination of part of speech
for word in query gt using improper synonyms and
antonyms for making query
76Possible Improvements and further work
- Additional Adaptive Learning (for personalized
context definition) - Creating Global Sense Ontology on the basis of
WordNet Database - Improving algorithms for automatic computing of
relevance indexes - Adding algorithms for smart cutting off for
generated queries - Using fuzzy logic for determination of query
context - Adding other lexical databases for supporting
search in specific domains (like programming,
medicine) - Multilingual support
77Current status
- During Jan-May 2004 main efforts for the InBCT
Semantic Search Facilitator project were put
into the research and design of the basic
features of SSA and implementation of
ontology-based search method. - The development of the prototype Semantic Search
Assistant software has been started and pilot
version is ready. - Starting 1.06.2004 kernel part of the Industrial
Ontologies Group start working on TEKES project
SmartResource Proactive Self-Maintained
Resources in Semantic Web - at Agora Center, University of Jyväskylä
- Further development (from the point of stability
and usability) of SSA will be continued during
Jul-Sep 2004