Title: SphereSearch
1(No Transcript)
2Outline
- Where existing search engines fail
- SphereSearch Concepts
- Transformation and Annotation
- Query Language and Scoring
- Experimental Evaluation
- Summary
3Example query 1
Which professors from Saarbrücken do research on
XML
Different terminology in query and Web pages
Director of Department 5 DBS IS
Professor at Saarland University
Abstraction Awareness
4Example query 2
?
Conferences about XML in Norway 2005
Context Awareness
5Example query 3
What are the publications of Max Planck?
Max Planck should be instance of concept person,
not of concept institute
Concept Awareness
6SphereSearch Concepts
Goal Increase recall precision for hard
queries on linked and heterogeneous data
- Unified search for unstructured, semistructured,
structured data from heterogeneous sources - Graph-based model, including links
- Annotation engines from NLP to recognize classes
of named entities (persons, locations, dates, )
for concept-aware queries - Flexible yet simple abstraction-aware query
language with context-aware scoring - Compactness-based scores
7Some Related Work
- Web Query Languagese.g., W3QS VLDB95, WebOQL
ICDE95, - Web IR with thesaurie.g., Qiu et al.SIGIR93,
Liu et al.SIGIR04, - XML IRe.g., XXL WebDB00, XIRQL
SIGIR01,XSearch VLDB93, XRank SIGMOD03, - Information extractione.g., Lixto, KnowItAll,
- Advanced Web graph IRe.g., BANKS ICDE02,
Hristidis et al.VLDB03,
8Outline
- Where existing search engines fail
- SphereSearch Concepts
- Transformation and Annotation
- Query Language and Scoring
- Experimental Evaluation
- Current and Future Work
9Unifying Search on Heterogeneous Data
Databases
10Heuristic Transformation of HTML
Goal Transform layout tagsto semantic
annotations
- Headlineslth1gtExperimentslt/h1gtlth2gtSettingslt/h2gtW
e evaluated...lth2gtResultslt/h2gtOur system...
- PatternsltbgtTopiclt/bgtXML
11(Almost) Generic XML Data Model
- ltProfessorgt Gerhard Weikum ltCoursegt
IR lt/Coursegt Saarbrücken ltResearchgt
XML lt/Researchgtlt/Professorgt
docid1tagProfessor contentGerhard Weikum
Saarbrücken
1
docid1tagResearchcontentXML
docid1tagCoursecontentIR
3
2
Automatic annotation of important concepts
(persons, locations, dates, money amounts) with
tools from Information Extraction
12Information Extraction (IE)
- Named Entity Recognition (NER)
- Named Entity abstract datatype, concept
(location, person,, IP-address) - Mature (out-of-the-box products, e.g. GATE/ANNIE)
- Extensible
The Pelican Hotel in Salvador, operated
by Roberto Cardoso, offers comfortable rooms
starting at 100 a night, including breakfast.
Please check in before 7pm.
The ltcompanygt Pelican Hotel lt/companygt
in ltlocationgt Salvador lt/locationgt, operated
by ltpersongt Roberto Cardoso lt/persongt,
offers comfortable rooms starting at ltpricegt 100
lt/pricegt a night, including breakfast. Please
check in before lttimegt 7pm lt/timegt.
13Unifying Search on Heterogeneous Data
Databases
14Annotation-Aware Data Model
docid1tagProfessorcontentGerhard Weikum
Saarbrücken
1
- ltProfessorgt Gerhard Weikum
ltCoursegtIRlt/Coursegt Saarbrücken
ltResearchgtXMLlt/Researchgtlt/Professorgt
docid1tagResearch contentXML
docid1tagCoursecontentIR
3
2
Annotation introduces new tags
15Data Model for Links
16Architecture
INDEX
Search Engine
Search Engine
FROMSIGIR SUBJECTNotification
Date 15-18 August EventSIGIR
Location Frankfurt LocationSalvador Time 1315
Location Salvador Price 89
LocationSalvador
PersonSchenkel
IE Processor
Annotation Module PRICE
Annotators
Annotation Module DATE
Annotation Module LOCATION
XML Adapter
EMail Adapter
Adapters
Web Portal Adapter
Web Adapter
SIGIR Website
Hotel Website
Tourist Guide (XML)
Sources
Flight Schedule
Graupmann
Homepage
17Outline
- Where existing search engines fail
- SphereSearch Concepts
- Transformation and Annotation
- Query Language and Scoring
- Experimental Evaluation
- Current and Future Work
18SphereSearch Queries
- Extended keyword queries
- similarity conditions professor, Saarbrücken
- concept-based conditions personMax Planck,
locationTrondheim - grouping
- join conditions
- Ranked results with context-aware scoring
19Score Aggregation SphereScore
research XML
1
- Weighted aggregation of local scores in
environment of element (sphere score)
Context awareness
Rewards proximity of terms and compactness of
term distribution
20Similarity Conditions
Similarity conditions like professor,
Saarbrücken
disambiguation
Query expansion
d-exp(x)wsim(x,w)gtd
Local score weighted max over all expansion
terms
sL(e,professor) max t?d-exp(professor)
sim(professor,t)sL(e,t)
Abstraction awareness
21Concept-based conditions
Goal Exploit explicit (tags) and automatic
annotations in documents
locationTrondheim
Allows similarity and range queries (for
annotated concepts) likelocationTrondheim1970ltd
atelt1980with concept-specific distancemeasures
Concept awareness
22Query Groups
Goal Related terms should occur in the same
context
- Group conditions that relate to the same
entity professor teaching IR research XML - professor T(teaching IR) R(research XML)
- SphereScore computed for each group
- Find compact sets with one result for each group
23Scores for Query Results
- query result R one result per query group
compactness 1/size of a minimal spanning tree
Context awareness
24Join conditions
Goal Connect results of different query groups
- A(research, XML)
- B(VLDB 2005 paper)
- A.personB.person
- Dependent on database size, application
- Precomputed
- Computed during query execution
B
A
VLDB 2005
research XML Ralf Schenkel
1.0
2004 2005 R.Schenkel
0.9
- Join conditions do not change the score for a
node - Join conditions create a new link with a specific
weight
25Score for Join Conditions
- Join condition A.TB.S
- For all nodes n1 with type T, n2 with type S, add
edge (n1,n2) with weight sim(n1,n2))-1 - sim(n1,n2) content-based similarity
26Outline
- Where existing search engines fail
- SphereSearch Concepts
- Transformation and Annotation
- Query Language and Scoring
- Experimental Evaluation
- Current and Future Work
27Setup for Experiments
No existing benchmark (INEX, TREC, ) fits
- Three corpora
- Wikipedia
- extended Wikipedia with links to IMDB
- extended DBLP corpus with links to homepages
- 50 Queries like
- A(actor birthday 1970ltdatelt1980) western
- G(California,governor) M(movie)
- A(Madonna,husband) B(director) A.personB.director
- Opponent keyword queries with standard
TF/IDF-based score - ? simplified Google
28Incremental Language Levels
SSE-Join(join conditions)
SSE-QG(query groups)
SSE-CV(concept-based conditions)
SSE-basic(keywords, SphereScores)
29Experimental Results on Wikipdia
30Experimental Results on Wiki and DBLP
- SphereScores better than local scores
- New SSE features nearly double precision
31Current and Future Work
- Improve graphical user interface
- Refined type-specific similarity measures (like
geographic distances) SIGIR-WS 2005 - Deep Web search through automatic portal queries
- Parameter tuning with relevance feedback
- Efficiency of query evaluation through
precomputation and integrated top-k(TopX talk
this afternoon)
32Thank you!