Title: Interactive Query Formulation over Web ServiceAccessed Sources
1Interactive Query Formulationover Web
Service-Accessed Sources
- Michalis Petropoulos
- Alin Deutsch
- Yannis Papakonstantinou
ACM SIGMOD, June 2006
2Large-Scale Data Integration Systems
?
Web Domain
Web Forms Reports
- CNETs Top Combinations
- CNETs Search Desktops
- PCWorlds Product Finder
End User
?
Application Domain
?
?
- CNET Computer
- PCWorld Portals
Developer
Application
Application
?
Integration Domain
Mediator
Integrated Schema
Compatible Combinations of Computers, Routers an
d Printers
Integration Engineer
Source Domain
- Dell Computers by CPU
- Cisco Routers by Rate
- HP Printers by Speed
?
Web Service
Web Service
Web Service
Source Owner
Data Source
Data Source
- Dell Computers
- Cisco Routers
- HP Printers
Source Schema
Source Schema
3Large-Scale Data Integration Systems
?
Web Domain
Web Forms Reports
End User
?
What queries can the mediator answer for me? CL
IDE
Application Domain
?
?
Developer
Application
Application
?
Integration Domain
Mediator
Integrated Schema
Integration Engineer
Source Domain
?
Web Service
Web Service
Web Service
Source Owner
Data Source
Data Source
Source Schema
Source Schema
4Running Example
Parameterized Views
Dell
Cisco
- Schema
- Computers(cid, cpu, ram, price)
- NetCards(cid, rate, standard, interface)
- Views
- V1 ComByCpu(cpu) ? (Computer)
- SELECT DISTINCT Com1.
- FROM Computers Com1
- WHERE Com1.cpucpu
- V2 ComNetByCpuRate(cpu, rate) ?
- (Computer, NetCard)
- SELECT DISTINCT Com1., Net1.
- FROM Computers Com1, Network Net1
- WHERE Com1.cidNet1.cid
Schema Routers(rate, standard, price, type)
Views V3 RouWired() ? (Router) SELECT
DISTINCT Rou1. FROM Routers Rou1 WHERE Rou1.t
ype'Wired' V4 RouWireless() ? (Router) SE
LECT DISTINCT Rou1. FROM Routers Rou1 WHERE R
ou1.type'Wireless'
Wired Routers
Computers for a given cpu
Wireless Routers
Computers NetCards for a given cpu rate
- Conjunctive Queries CQ
- Equality Comparison Conditions
- Parameters
5Running Example
Integrated Schema
?
?
Developer
Application
- Integrated schema puts togetherthe Dell and
Cisco schemas
- Attribute Associations
- (Computers.cid, NetCards.cid)
- (NetCards.rate, Routers.rate)
- (NetCards.standard, Routers.standard)
Mediator
Integrated Schema
V1
V3
V2
V4
Dell
Cisco
6Sophisticated Mediators MakeFeasible Queries
Hard to Predict
- Feasible Queries FQ
- Equivalent CQ query rewritings using the views
- Might involve more than one views
- Order might matter
Feasible
Query Get all P4 Computers, together with thei
r NetCards and their compatible Wireless Route
rs
Query Get all Computers
Infeasible
E
Mediator
B
D
Mediator
A
C
RouWireless()
ComNetByCpuRate(P4, 10)
V1
ComNetByCpuRate(P4, 54)
V4
V2
7Problem
- Large number of sources
- Large number of views (web-services)
- Mediator capabilities
- Developer formulates an application query
- Is an application query feasible?
- If not, how do I know which ones are feasible?
- Previous options
- The developer had to browse the view definitions
and somehow formulate a feasible query
- Or formulate queries until a feasible one is
found(trial-and-error)
- No system-provided guidance
8The CLIDE Solution
CLIDE
?
?
Developer
Application
- A query formulation interface, which
interactively guides the developer toward
feasible queries by employing a coloring scheme
Mediator
Integrated Schema
V1
V3
V2
V4
Dell
Cisco
9QBE-Like Interfaces
Microsoft SQL-Server
10CLIDE Interface
Feasibility Flag
Selection Boxes
- Table, selection, projection and join actions
- Feasibility Flag
- Color-based suggestions
11Example Interaction
Snapshot 1
- Yellow ? required action
- All feasible queries require this action
- White ? optional action
- Feasible queries can be formulatedw/ or w/o
these actions
12Example Interaction
Snapshot 2
- Blue ? required choice of action
- At least one feasible query cannot be formulated
unless this action is performed
C
Mediator
A
ComByCpu(P4)
B
V1
13Example Interaction
Snapshot 2
- Blue ? required choice of action
- At least one of them is required to be taken in
order to reach a feasible query
- Join Lines
- Only yellow and blue are displayed
- Must appear in Attribute Associations
14CLIDE Properties
- Completeness of Suggestions
- Every feasible query can be formulated by
performing yellow and blue actions at every step
- Minimality of Suggestions
- At every step, only a minimal number of actions
is suggested, i.e., the ones that are needed to
preserve completeness
- Rapid Convergence By Following Suggestions
- The shortest sequence of actions from a query to
any feasible query consists of suggested actions
15Example Interaction
Snapshot 3
- ? any other constant
- Red ? prohibited action
- Does not appear in any feasible query
- Lead to Dead End state
16Example Interaction
Snapshot 4
F
Mediator
A
D
RouWireless()
ComNetByCpuRate(P4, rate)
B
E
V4
V2
17CLIDE Properties
- Completeness of Suggestions
- Every feasible query can be formulated by
performing yellow and blue actions at every step
- Minimality of Suggestions
- At every step, only a minimal number of actions
is suggested, i.e., the ones that are needed to
preserve completeness
- Rapid Convergence By Following Suggestions
- The shortest sequence of actions from a query to
any feasible query consists of suggested actions
18Interaction Graph
Join Action
Table Action
Selection Action
Com1.cidNet1.cid
Com1.cpuP4
Com1
Com1.ram
Rou1
Com1.price
Net1
- Nodes are queries
- One for each q?CQ
- Edges are actions
- Table, selection, projection and join actions
- Green nodes are feasible queries
- Infinitely big structure
- All CQ queries
- All possible combinations of actions formulating
them
19Interaction Graph Colors
- Yellow action ?
- Every path from current node n to a feasible node
contains ?
- Blue action ?
- At least one path from current node n to a
feasible node nF contains ?
- There is not path from n to nF that contains a
feasible node
- Red action ?
- No path to a feasible node contains ?
Current Node
Com1.cid
Com1.cpu
Current Node
Com1.cid
Com1.cpu
Com1.cpu
Net1
Com1.cidNet1.cid
Net1.rate54Mbps
Com1.cidNet1.cid
Com1.ram
Com1.cidNet1.cid
Net1.rate54Mbps
Net1.rate54Mbps
Com1.price
Com1.cpu
Rou1
Net1.rateRou1.rate
Com1.cidNet1.cid
Rou1
Net1
Com2
Rou1
Com2
Com2
Com2.cidNet1.cid
Com2.cpuP4
Net1.rate54Mbps
20CLIDE Properties
- Rapid Convergence By Following Suggestions
- The shortest path from the current node to any
feasible node consists of suggested actions
21CLIDE Architecture
?
Actions
Front-End
User
Current Query
Colored Actions Feasibility Flag
Back-End
Color Algorithm
Seed Queries SQ
Parameters Algorithm
Closest Feasible Queries FQC
Closest Feasible Queries Algorithm
Aliases Collapse Rule
Minimal Feasible Extension Queries
Maximally-Contained Rewriter
Views
Schemas
Attribute Associations
- Back-End invoked every time the user performs an
action
- i.e., the user arrives at a new node in the
interactions graph
22Color DeterminedBy a Finite Set of Feasible
Queries
Challenge Infinitely Many Feasible Queries
?
Radius
n
Closest Feasible Queries FQC
Solution
- Finite Set of Feasible Queries
- Closest Feasible Queries FQC
- FQC is sufficient to color actions
Challenge How far can the Closest Feasible
Queries FQC be?
23Closest Feasible Queries FQC Algorithm
Solution Based on Maximally Contained Queries
FQMC
Maximally Contained Queries FQMC
pL Radius
n
Closest Feasible Queries FQC
- Compute maximally contained queries FQMC
- Radius pL is the longest path to a maximally
contained query
- Theorem All FQC queries are reachable via a
path of length p ? pL
Challenge The set of nodes within pL can be too
many to explore
24Closest Feasible Queries FQC Algorithm
Solution Find the Closest Feasible Queries
Directly
Maximally Contained Queries FQMC
n
Closest Feasible Queries FQC
More feasible nodes
- Theorem All queries in FQMC are in FQC
- But not all queries in FQC are in FQMC
- Collapse Aliases to compute FQC \ FQMC
25Remaining Issues of the Algorithm
- Color remaining actions white or red
- Handling projections
- Dealing with parameterized queries
26CLIDE Implementation Issues
- Maximally contained queries need to be
minimized(affect CLIDEs rapid convergence and
minimality)
- ? Implemented the minimization module from
scratch
- ? MiniCon computes mappings between the query
and the views
- ? Exposed these mappings to guide minimization
- Optimizations to bring the response time below 3
sec
- ? Amortize the computation across multiple
interactions steps by observing that query is
changed minimally at every step
- ? Instead of calling MiniCon from scratch at
every step, we incrementally compute the closest
feasible queries
27Thank you
- Play with our demo
- http//www.clide.info
28Maximally Contained Queries FQMC
Maximally Contained Query
Query Q2 Get all Computers with a given cpu
Query Q1 Get all Computers
- Assuming fixed SELECT clause (projection list)
- Covered extensively in literature
- MiniCon, Bucket, InverseRules Algorithms
- FQMC is finite
29Closest Feasible Queries FQC Algorithm
Solution Collapse Aliases
Maximally Contained Feasible Queries FQMC
Closest Feasible Queries FQC
n
- Collapse Aliases to compute FQC \ FQMC
- Check satisfiability
30Color Algorithm
- Yellow and Blue
- An action ? is colored based on which closest
feasible queries it appear in
- Yellow, if ? appears in all queries in FQC
- Blue, if ? appears in at least one (but not all)
query in FQC
- White and Red
- Attach Maximum Projection Lists to Closest
Feasible Queries
- Projections that can be added to a feasible
query, without compromising feasibility
- Projection ? is white if in the maximum
projection list
- Color selections based on projections
31CLIDE Implementation Optimizations
- Views expansion introduce redundancy
- Affects CLIDEs rapid convergence and minimality
- Efficient containment test crucial to redundancy
removal
32CLIDE Performance
Chains of Stars
A-span 7 B-span 3 Selections 4,6,8,10
A