Title: Meaningful Labeling of Integrated Query Interfaces
1Meaningful Labeling of Integrated Query Interfaces
Eduard C. Dragut (speaker) Clement Yu Weiyi Meng
University of Illinois at Chicago University of
Illinois at Chicago SUNY at Binghamton
VLDB 2006, Seoul, Korea
2A Motivating Scenario
- Looking for a ticket
- Chicago Seoul, September 10th September 17th
delta.com
orbitz.com
expedia.com
- A user looking for the best price for a ticket
- Has to explore multiple sources
- It is tedious, frustrating and time-consuming
3The goal
- Provide a unified way to query multiple sources
in the same domain
The Web
Unified query interface
Airfare.com
priceline.com
united.com
delta.com
nwa.com
4Overview Integrating Query Interfaces
(Deep) Web
5Overview Integrating Query Interfaces
- Integration Steps
- Structural merging of query interfaces He03 et
al, Dragut06 et al - Grouping constraints
- Ancestor-Descendant relationships
- Determining the domain of each global field in
the integrated interface He03 et al - Meaningful labeling of the integrated interface
- The topic of this presentation
6Motivation of Naming
- A query interface needs to be easily understood
by any user, irrespective of his/her background - The study of query interfaces in the seven
domains used in our experiment revealed that the
designers of query interfaces follow some
hidden norms - there are certain relationships between the
labels of the fields in the same groups - E.g., all plurals
- the labels of the (super) groups semantically
characterize the set of fields underneath them - The semantic ambiguity problem
- Synonyms and homonyms are the two sources of
naming conflicts Batini86 et al, Bright94 et al
7The objectives
- The main goal is to provide a systematic way to
label fields in the integrated query interface so
that the concepts on the integrated query
interface are easily understood by ordinary
users. - Validated undergoing a survey
- Provide a set of desirable properties required in
order to have consistent labels for the
attributes within an integrated interface so that
users have no difficulty in understanding it. - Not covered in detail
8Naming Algorithm
- The input
- A set of query interfaces in the same domain
- E.g. Airline domain Delta, AA, NWA, Orbitz,
Travelocity - Each query interface is represented
hierarchically Wu04 - The mapping between the fields of the query
interfaces. - Organized in clusters (e.g. Wu04 et al, B.He03
et al) - The set of groups of fields given by the merge
algorithm Dragut06 et al - The integrated query interface given by the merge
algorithm as a schema tree Dragut06 et al
vacations.net
9An Example of Input
- Three fragments of query interfaces represented
hierarchically
- The mapping between them, i.e. the set of clusters
10Naming Algorithm - Sketch
- Step 1 Consistent labeling of the fields
- Fields in the same group - use intersect-and-union
strategy - Isolated fields, no consistency required
- Root fields - treated as a group
- Output each group of fields (or field) has a set
of candidate labels, possibly empty - Step 2 Consistent labeling of the internal nodes
- For each internal node, starting from the lowest
level to the root, apply a set of inference
rules on labels - Output each internal node has a set of candidate
labels, possibly empty - Step 3 Enforce consistency within the entire
integrated interface - Not covered
11Preliminaries
- Normalization e.g., He03 et al, Madhavan01 et al
, Rahm01 et al - E.g. Adults (18-64) becomes adult
- Semantic relationships among complex labels need
to be established - E.g., synonymy, hypernymy/ hyponymy
- Main issues
- Thesauruses provide semantic relationships only
for individual content words (e.g., WordNet
Fellbaum98) - How to show that Area of Study is a synonym of
Field of Work in the Job domain? - How to show that Class is a hypernym of Class of
Tickets in the Airline domain?
12Preliminaries
- Manipulation of labels
- A label is seen as a set of normalized content
words - E.g., area, study corresponds to Area of Study
- E.g., field, work corresponds to Field of Work
- Area of Study is a synonym of Field of Work
- Area is synonym of Field (by WordNet)
- Study is synonym of Work (by WordNet)
- Most descriptive vs. most general labels
- e.g. Category, Job Category, Area of Work,
Function - Category and Function too general
- Job Category and Area of Work descriptive,
avoids confusion
13Consistent Labeling of Groups of Fields
- Assumption
- The labels given by a query interface for the
fields in the same group are consistent - Organize the labels of a group in a relation-like
form, called group relation - General idea to build a consistent solution
- Combine multiple rows of consistent labels until
a label is assigned to each field in the group
14Consistent Labeling of Groups of Fields
- Levels of Consistency
- String Level
- Two distinct tuples belong to this level of
consistency if they have the same label for a
cluster in the group relation - Equality Level
- Two distinct tuples belong to this level of
consistency if they have equal labels for a
cluster in the group relation - Synonymy Level
- Two distinct tuples belong to this level of
consistency if they have synonym labels for a
cluster in the group relation
15Consistent Labeling of Internal Nodes
- The problem
- Given an internal node in the integrated
interface, determine a label that is semantically
suitable for it, i.e. its semantic is rich enough
to cover the semantics of all its descendant leaf
nodes - An example
- a fragment of the integrated interface of real
Estate domain
16Consistent Labeling of Internal Nodes
- In assigning labels to internal nodes we mainly
exploit two types of knowledge - The semantic relationship among the labels of the
internal nodes in the individual schema trees - The relationship between internal nodes of source
schema trees with overlapping sets of descendent
leaves - The two types of knowledge are employed to derive
a set of logical inference rules among the
textual labels - Some of them will be exemplified next
17Consistent Labeling of Internal Nodes
- First logical inference
- Informally, consider two internal nodes v1 and v2
of two distinct source schema trees with the
property that - v1s set of descendant leaves is a subset of
v2s set of descendant leaves nodes, - and v1s label is a hypernym of v2s label
- Then the labels of the two nodes are semantically
equivalent within the given domain of discourse - An example
18Consistent Labeling of Internal Nodes
- Second logical inference (the idea)
- The same label is assigned to internal nodes in
multiple source query interfaces and the
descendant leaves of each such internal node are
among those of the internal node in the
integrated interface for which a label is sought.
- An example
- Fragment integrated query interface
- Within source query interfaces
19Consistent Labeling of Internal Nodes
- Third logical inference (hypernymy scenario)
- Informally, consider two internal nodes v1 and v2
of two distinct source schema trees with the
property that - v1s label is a hypernym of v2s label
- Then v1s label semantically covers the union of
the descendant nodes of the two nodes. - An example
- Fragment integrated query interface
- Within source query interfaces
20Where can the instances help?
- Discard labels as values
- The problem is known as schema element name as
value Xu03, Dhamankar04 - Example, in the Book domain labels like Hardcover
or Paperback are data instances of fields with
labels like Format or Binding - Reconcile most general vs. most descriptive
- The idea is to bound the meaning of the most
general label to a more descriptive one
21Experiment
- Setup
- Seven real world domain
- Used also in Wu04 et al, Madhavan05 et al,
Dragut06 at al
22Experiment
- Human Acceptance
- Questions asked
- Do you have any difficulty in filling in an entry
for each field? - If you do, identify the fields you had difficulty
filling in. - Are the fields understandable on the source
interfaces? - 11 Survey respondents reported the following
23Example Integrated Interfaces
- Airfare domain integrated interface
Four people found the group confusing
24Example Integrated Interfaces
- Auto domain integrated interface
- No surveyed person has identified any problem
for this integrated query interface
25End
- Please visit the project web site
- http//www.cs.uic.edu/edragut/QIProject.html
Thank you for your time and patience!