A Framework for Supporting DBMS-like Indexes in the Cloud - PowerPoint PPT Presentation

About This Presentation
Title:

A Framework for Supporting DBMS-like Indexes in the Cloud

Description:

Gang Chen, Hoang Tam Vo, Sai Wu, Beng Chin Ooi, M. Tamer zsu * * * * * * * * * * * * * Motivation NoSQL systems trade-off K-V model -- scalability vs ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 26
Provided by: vldbOrg2
Learn more at: https://www.vldb.org
Category:

less

Transcript and Presenter's Notes

Title: A Framework for Supporting DBMS-like Indexes in the Cloud


1
A Framework for Supporting DBMS-like Indexes in
the Cloud
Gang Chen, Hoang Tam Vo, Sai Wu, Beng Chin Ooi,
M. Tamer Özsu
2
Motivation
  • NoSQL systems trade-off
  • K-V model -- scalability vs. functionality (ACID,
    indexes)
  • Data selection on primary key is not sufficient
  • Ad-hoc queries on secondary attributes
  • OLTP queries high selectivity, low latency
    expectation
  • Cloud storage system huge volume of data
  • parallel scan scan 1TB data to get 10 tuples?
  • Indexes in the cloud
  • Useful when query selectivity is high
  • Distributed indexes
  • central server may become bottleneck
  • facilitate parallelism and load balance

- the distribution of indexes ? - scalability in
terms of data volume, network size and number of
indexes ?
3
Current State of the Art
  • Asynchronous view maintenance for VLSD
    databases1
  • Pre-configured queries vs. ad-hoc data selection
  • Open-source systems
  • Cassandra
  • built-in distributed hash secondary indexes
    (from V. 7.0)
  • Hbase
  • a secondary index created as another table (on
    going) 2
  • Closed-source systems
  • Megastore 3
  • consistent local indexes inside an entity group
  • asynchronous global indexes across groups

1 Asynchronous view maintenance for VLSD
databases Sigmod 2009 2 http//hbase.apache.org/
book.htmlsecondary.indexes 3 Megastore
Providing Scalable, Highly Available Storage for
Interactive Services CIDR 2011
4
Current State of the Art (2)
  • P2P overlays as global distributed indexes
  • Distributed B-tree-like indexes 1
  • based on tree-based overlay BATON
  • Distributed R-tree-like indexes 2
  • based on CAN overlay

1 Efficient B-tree Based Indexing for Cloud
Data Processing VLDB 2010 2 Indexing
Multi-dimensional Data in a Cloud System SIGMOD
2010
5
Our Focus
  • Context
  • Efficient and elastic database service with
    database functionality (DaaS)
  • Aims
  • Provision of indexing functionality in the
    context of DaaS
  • Efficiency
  • the ability to locate some specific records
    among millions of distributed candidates in real
    time
  • Scalability
  • multiple indexes (of different types) over
    distributed data
  • Extensibility
  • users can define new indexes without knowing the
    structure of the underlying network
  • Performance self-tuning
  • users do not have to tune the system performance
    by themselves

6
Challenges of Distributed Indexes
  • Different overlays are required to support
    different types of indexes
  • BATON for B-tree
  • CAN for R-tree
  • Chord for Hash
  • Overlay routing and maintenance cost are high
  • Load balancing issue
  • Indexed columns have different data distribution
  • Difficult to balance the load of index nodes in
    the presence of multiple indexes

7
Our Approachto providing index functionality in
the cloud
  • Indexing as a service
  • Generic overlay
  • Data mapping
  • Performance self-tuning
  • Result
  • A simple yet efficient and extensible framework
    for developing distributed indexes in the cloud

8
Index Node
Data Mapping
Data are transformed into a unified cayley key
space
Cayley graph
Cayley Graph Manager
chord
can
baton
- Index data are distributed into different
cluster nodes - Each node builds a local index
for maintaining the index data - Part of local
indexes are cached in memory
Buffer Manager
ConnectionManager
Local Indexes
TCP/IP Connection
A connection manager is set up to decide which
connection should be maintained
9
Overlay Mapping
  • Two interfaces for mapping a specific type of
    overlay to Cayley graph
  • Generating set
  • Operator
  • ? Applying the operator on the ID of an index
    node and the generating set will generate the
    routing table for that node

generator
2i i 0, ..., n - 1
Baton
mod 2n
operator
generator
1i, i 1, ..., n
Cayley graph manager
CAN
mod 2
operator
generator
Index search
2i i 0, ..., n - 1
A routing algorithm
Chord
mod 2n
operator
10
Data Mapping
  • Uniform data mapping
  • Load balance property
  • Uniform data mapping provides 1-balance with the
    assumption that the data distribution is uniform

11
Data Mapping
  • Sampling-based data mapping
  • To deal with skewed data distribution
  • Stratified random sampling 1,2
  • partition the domain into disjoint subsets
  • take a specified number of samples from each
    subset
  • Done when bulk load data from external sources
    into cloud databases, e.g., bulk insert daily new
    feed of new items from partners into operational
    table
  • Skewed online update
  • perform data migration to re-balance

1 Sampling Issues in Parallel Database Systems
S. Seshadri and J. F. Naughton, EDBT, 1992. 2
Efficient Bulk Insertion into a Distributed
Ordered Table A. Silberstein et. al., Sigmod,
2008.
12
Index Building
  • Each cluster node
  • acts as a peer in the P2P overlay
  • maintains local indexes such as hash, B-trees
    and R-trees
  • Index building
  • when data are imported
  • publish the index entries to different indexes
    based on P2P routing protocols

13
Index Search
  • Optimization
  • Index base table vs. Index covering plan
  • index entries contain portion of data record
  • support a wider range of queries than
    materialized views

14
Index Search
  • Range search
  • Process on index nodes in parallel
  • Parallel scan of different indexes
  • Facilitate correlated access across multiple
    indexes
  • Especially useful for equi-join and range join
  • Join order ?

15
Index Update
  • Two steps
  • Delete the old corresponding index entry
  • Insert the new index entry
  • Index consistency
  • Based on requirements of applications
  • Trade-off between performance and consistency
  • strict enforcement of ACID properties
  • less demanding bulk update

16
Performance Self-tuning
  • Why?
  • Optimize the performance of existing index nodes
    before launching new ones
  • Complex setting of multiple indexes of different
    types
  • Adaptively cache network connections
  • Effectively buffer local indexes

17
Failure and Replication
  • Failures in large clusters are common
  • Replication of index data
  • 24X7 service provision
  • Correct retrieval of index data in the presence
    of failures
  • Two-tier load adaptive replication 1
  • first tier k copies for data reliability
  • second tier replicas created adaptively with
    query load
  • Replica consistency management
  • Lost updates?
  • System recovery from different types of failures

1 Towards Elastic Transactional Cloud Storage
with Range Query Support H.T. Vo, B.C. Ooi, C.
Chen. PVLDB 3(1) 506-517 (2010)
18
Evaluation
  • Settings
  • 64-node in-house cluster and EC2
  • Storage service HDFS
  • TPC-W (most experiments) and synthetically
    generated data set (bigger number of indexes and
    skewed data)
  • Experiments
  • Index plan vs. full table scan
  • Index covering vs. Indexbase approach
  • Multiple indexes of different types
  • Handling skewed data
  • Scalability on EC2 (up to 256 nodes)
  • Other results (not covered in this talk)
  • Effect of varying query rate
  • Effect of varying data size
  • Update performance
  • Performance of equi-join and range join queries

19
Index plan vs. full table (parallel) scan
  • Index plan performs much better than the full
    table scan approach
  • Advantage of indexes being able to identify the
    data node that contains the qualified tuple
    quickly
  • Table scan time increases almost linearly along
    with the data set size

20
Index covering vs. Indexbase approach
  • Index covering outperforms indexbase when the
    size of result set is large
  • Index covering index entries contains sufficient
    data for answering queries directly
  • Indexbase is still useful compared to table scan

21
Multiple indexes of different types
  • Generalized index is superior to
    one-overlay-per-index, and provides the much
    needed scalability
  • Generalized index one index process maintains
    multiple indexes and self-tunes the performance
    via resource sharing.

22
Handling skewed data
  • Storage load distribution
  • Sampling-based data mapping can roughly estimate
    the data distribution and consequently, a certain
    percentage of nodes maintains an equivalent
    percentage of index data
  • Execution load imbalance
  • Sampling-based data mapping distributes data
    among nodes better and therefore, incoming
    queries on skewed data are also distributed
    better

23
Scalability on EC2
  • Elastic scaling property
  • More workload can be handled by adding more nodes
    into the system

24
Conclusions
  • A simple yet efficient and extensible framework
    for supporting indexes in the cloud
  • Main characteristics
  • Support indexes using P2P overlays
  • Provide high level abstraction for definition new
    indexes
  • Main benefits
  • Reduce index creation and maintenance cost
  • Provide the much needed scalability
  • multiple indexes of different types over
    distributed data

More info at epiC project http//www.comp.nus.edu.
sg/epiC
25
Thank you!
Questions Answers
Write a Comment
User Comments (0)
About PowerShow.com