INDEXING DATASPACES by Xin Dong - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

INDEXING DATASPACES by Xin Dong

Description:

* Neighborhood Keyword Queries Extends keyword search by considering association A ... Does not index text values Indexing on Value Indexes text values and ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 28
Provided by: Vishal46
Category:

less

Transcript and Presenter's Notes

Title: INDEXING DATASPACES by Xin Dong


1
INDEXING DATASPACESby Xin Dong Alon Halevy
  • ITCS 6010
  • FALL 2008
  • Presented by VISHAL SHETH

2
AGENDA
  • Background
  • Motivation
  • Problem Definition
  • Indexing Structure
  • Experimental Evaluation
  • Related Work
  • Conclusion
  • Future Work

3
Background
  • Indexing
  • A technique used for faster execution of queries
    and result retrieval which can be created on one
    or more columns of DB table
  • More indexes means faster query performance, but
    also longer transformation/load times
  • Types of Indexes B-Tree, Bitmap
  • Dataspace
  • It is a data co-existence approach which forms a
    semantic web of inter-related / similar things.
    E.g. Music Dataspace
  • DS Indexing v/s DB Indexing

DB INDEXING DS INDEXING
Indexing on tables of Relational DB of same source Indexing on dataspace having heterogeneous data sources
Data is structured Data may be structured or unstructured
Underlying DB Schema is very well defined (Relational) Underlying schema may/may not be known (DB, XML, Doc, PPT)
4
Motivation
  • Indexing of data from disparate data sources is a
    big problem and challenging
  • To answer queries with keyword and structure
    efficiently
  • Faster execution of queries on semantically
    different data

5
  • Indexing Heterogeneous Data
  • Support queries over different types of data
  • Data may or may not be having semantic similarity
  • Data may be structured (XML/DB/Spreadsheet) or
    (un/partially)structured files (PPT/DOC/Email/LaTe
    x Files/WebPages)
  • To extract associations / relationships between
    either structured or unstructured or both

6
Solution to Indexing Heterogeneous Data
  • Results of queries are typically from different
    sources (XML/tuples)
  • Index (an inverted list) is built whose leaves
    are references to data items in the individual
    sources

7
Solution Contd
  • Data is modeled as a set of triples called as
    triple base which can take form of (instance,
    attribute, value) or (instance, association,
    instance)
  • Instance is a real world object described by
    multi-valued attributes.
  • Association is a directional relationship between
    two instances (two directions of a particular
    association are named differently)

8
Example of a Triple Base
Legends a Article Instance, p Person
Instance, c Conference Instance a1 is
associated with p1, p2 and c1
9
  • Querying Heterogeneous Data
  • Support queries over user independent data source
    structure
  • Support queries that enable users to specify
    structure, or none at all

10
Solution
  • Two types of query proposed
  • Predicate Queries
  • Describes the desired instances by a set of
    predicates
  • Each predicate specifies an attribute value or an
    associated instance
  • Example Raghus Birch paper in Sigmod 1996
  • Three predicates (title Birch), (author
    Raghu), (publishedIn 1996 Sigmod)
  • Definition of a predicate query
  • Each predicate is of the form (v, K1, . . .
    ,Kn). v (verb - attribute / association) and K1,
    . . . ,Kn (keywords)
  • v attribute ? attribute predicate and v
    association ? association predicate
  • Returned instances need to satisfy at least one
    predicate in the query.
  • An instance satisfies an attribute predicate if
    it contains at least one of K1,. . . ,Kn in the
    values of attribute v or sub-attributes of v.
  • An instance o satisfies an association predicate
    if there exists i, 1ltiltn, such that o has an
    association v or sub-association of v with an
    instance o that has an attribute value Ki.

11
  • Neighborhood Keyword Queries
  • Extends keyword search by considering association
  • A neighborhood keyword query is a set of
    keywords, K1, . . . ,Kn
  • Definition of a Neighborhood Keyword query
  • An instance satisfies a neighborhood keyword
    query if
  • It contains at least one of K1, . . . ,Kn in
    attribute values. (relevant instance)
  • OR
  • The instance is associated (in either direction)
    with a relevant instance (associated instance)

12
Inverted Lists
  • It is a 2-D table with indexed keyword (as rows)
    and instances (as columns)
  • Concept
  • ith row represents indexed keyword Ki
  • jth column represents instance Ij
  • Cell (Ki, Ij) records no. of occurrences (called
    as occurrence count) of keyword Ki in the
    attributes of Ij
  • Non zero cell value ? Instance Ij is indexed on
    Ki
  • Keywords are sorted and arranged in an
    alphabetical order in the list
  • Instances are ordered by their identifiers
  • No structural information present
  • Stored as sorted array or a prefix B-Tree

13
Inverted Lists Contd
14
Indexing Structure
  • It is an extension to Inverted List addressing
    some of the issues (structural information). E.g.
    Tian Last Name or First Name ?
  • It describes how attributes and association are
    indexed to support predicate queries
  • Two ways
  • Indexing Attribute ? ATtribute Inverted List
    (ATIL)
  • Indexing Associations ? Attribute-Association
    Inverted List (AAIL)

15
Indexing Attribute
  • Indexing each attribute (excessive overhead)
  • Specify the attribute name in the cells of IL
    (complex query answering)
  • ATIL (k-Keyword, a-attribute, I-Instance)
  • There is a row in IL for k//a//, when k appears
    in the value of a
  • The cell (k//a//, I) records occurrence count
  • E.g. Attribute Predicate (LastName, Tian)
  • Query converted to Keyword query as
    Tian//LastName//
  • Search yields p3 and not p1

16
Indexing Association
  • Perform keyword search on keywords, find a set of
    instances that contain these keywords and find
    associated instances for each instance (very
    expensive)
  • AAIL (k-Keyword, r-association, I-Instance,
    a-attribute)
  • There is a row in IL for k//r//, when k appears
    in the value of a
  • The cell (k//r//, I) records occurrence count
  • E.g. Query Raghus Paper
  • It has an association predicate author
    Raghu and keyword raghu//author//
  • Search yields a1
  • ATIL association information ? Slightly slow in
    answering attribute predicates but speeds up
    answering association predicates

17
Indexing Hierarchies
  • Answering predicate queries having hierarchical
    structure
  • E.g. Query (Name, Tian) Results p1 and p3
  • Find all the descendants of an attribute
    (FirstName, LastName and NickName)
  • Expand the scope of query by adding above
    attributes
  • E.g. Tian//Name// OR Tian//FirstName// and so
    on
  • This incurs multiple index lookups and hence
    expensive
  • Solution
  • Attribute IL with duplication (Dup-ATIL)
  • Attribute IL with Hierarchies (Hier-ATIL)
  • Hybrid Attribute IL (Hybrid-ATIL)

18
Index With Duplication
  • Duplicate a row with attribute name for each of
    its ancestors
  • Dup-ATIL (k-Keyword, a0-attribute, a-ancestor of
    a0, I-Instance)
  • There is a row in IL for k//a//
  • The cell (k//a//, I) records occurrence count of
    k in values of a of I
  • E.g. Query Name Tian ? Results retrieved
    p1 and p3
  • Extensive index size (long hierarchy) ? problem?
  • Appropriate when k occurs in many a0 with common
    ancestors

19
Index with Hierarchy Path
  • Keyword includes the hierarchy path
  • Hier-ATIL (k-Keyword, a-attribute, I-Instance)
  • Hierarchy path a0////an// for attribute an
  • There is a row for k//a0////an//
  • The cell (k//a0////an//, I) records occurrence
    count of k in Is an attributes
  • E.g. Query Name Tian ? Prefix Search
    Tian//Name// ? Results p1 and p3
  • Answering query by converting into prefix search
    can be more expensive than a keyword search
  • Appropriate when k occurs in a few a with common
    ancestors

20
Hybrid Index
  • Combination of Dup-ATIL and Hier-ATIL
  • Hybrid-ATIL (k-Keyword, a0-attribute, a-ancestor
    of a0, I-Instance)
  • Build an IL that answers prefix-search query
    with rows lt threshold (t)
  • Hierarchy path a0////an// for attribute an
  • p k//a0////an// is an indexed keyword
  • The cell (p//, I) records occurrence count of k
    in Is an attributes
  • E.g. Query Name Jeff ? Prefix Search
    Jeff//Name// ? Result p3
  • E.g. Query Name Tian ? Prefix Search
    Tian//Name// ? Result p1 and p3

20
21
Neighborhood Keyword Queries
  • Keyword Inverted List (KIL)
  • Equal to Hybrid-AAIL
  • Summarize prefixes ending with hierarchy path and
    also the one corresponding to keywords
  • Keywords (k1,,kn) are transformed to a prefix
    search (k1//,, kn//)
  • E.g. Query birch ? prefix-search birch//
    ? results a1, c1, p1, p2

22
Experimental Evaluation
  • Indexing structure text ? improves performance
    in answering both the type of queries
  • Data set personal data on desktop some
    external sources
  • Extracted associations and relationships from
    disparate items are stored in RDF file managed by
    Jena
  • RDF Resource Description Framework
  • Jena Java framework supporting Semantic Web
    applications
  • RDF file had 105,320 object instances 300,354
    attribute values 468,402 association instances
    file size 52.4 MB
  • Four types of queries
  • PQAS Predicate Queries with Attribute (no
    sub-attributes)
  • PQAC Predicate Queries with Attribute (with
    sub-attributes)
  • PQR Predicate Queries with association
  • NKQ Neighborhood Keyword Queries
  • Hardware
  • 4 CPUs (with 3.2 GHz Processor and 1 MB Cache
    memory)
  • 1 GB memory (RAM)

23
Performance
  • Alternative approaches NAÏVE (Basic IL) and
    SEPIL (3 separate indexes (IL, structured index
    relationship index)
  • Both returned instances with no occurrence count
    and hence an extra overhead
  • Clauses Introducing some variation (E.g. change
    no. of keywords)

24
Performance Contd
  • Compare efficiency of ATIL with a technique that
    creates separate index for each attribute
  • ATIL reduces indexing time by 63 and
    keyword-lookup time by 33

25
Related Work
  • Indexing XML
  • Indexing on Structure
  • Schema-driven queries (list all book authors)
  • Does not index text values
  • Indexing on Value
  • Indexes text values and encodes
    parent-child/ancestor-descendant relation
  • Indexing on both
  • Combines indexes on structure and on text
  • Indexing keyword queries in R-DB
  • DISCOVER, DBXplorer and BANKS require
    join-network at run-time which is expensive

26
Conclusion
  • Novel indexing approach to support flexible
    querying over dataspaces
  • Inverted list are used for creating indexes
  • IL captures the structure including attributes of
    instances, relationships between instances and
    hierarchies of schema elements.
  • The experimental results shows that IL speeds up
    query answering

27
Future Work
  • Extend indexes to support heterogeneous
    (attribute) values
  • Appropriate ranking algorithms
Write a Comment
User Comments (0)
About PowerShow.com