A Framework for Supporting DBMS-like Indexes in the Cloud - PowerPoint PPT Presentation

About This Presentation

Title:

A Framework for Supporting DBMS-like Indexes in the Cloud

Description:

Gang Chen, Hoang Tam Vo, Sai Wu, Beng Chin Ooi, M. Tamer zsu * * * * * * * * * * * * * Motivation NoSQL systems trade-off K-V model -- scalability vs ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 26

Provided by: vldbOrg2

Learn more at: https://www.vldb.org

Category:

more less

Transcript and Presenter's Notes

Title: A Framework for Supporting DBMS-like Indexes in the Cloud

1
A Framework for Supporting DBMS-like Indexes in
the Cloud
Gang Chen, Hoang Tam Vo, Sai Wu, Beng Chin Ooi,
M. Tamer Özsu
2
Motivation

NoSQL systems trade-off
K-V model -- scalability vs. functionality (ACID,
indexes)
Data selection on primary key is not sufficient
Ad-hoc queries on secondary attributes
OLTP queries high selectivity, low latency
expectation
Cloud storage system huge volume of data
parallel scan scan 1TB data to get 10 tuples?
Indexes in the cloud
Useful when query selectivity is high
Distributed indexes
central server may become bottleneck
facilitate parallelism and load balance

- the distribution of indexes ? - scalability in
terms of data volume, network size and number of
indexes ?
3
Current State of the Art

Asynchronous view maintenance for VLSD
databases1
Pre-configured queries vs. ad-hoc data selection
Open-source systems
Cassandra
built-in distributed hash secondary indexes
(from V. 7.0)
Hbase
a secondary index created as another table (on
going) 2
Closed-source systems
Megastore 3
consistent local indexes inside an entity group
asynchronous global indexes across groups

1 Asynchronous view maintenance for VLSD
databases Sigmod 2009 2 http//hbase.apache.org/
book.htmlsecondary.indexes 3 Megastore
Providing Scalable, Highly Available Storage for
Interactive Services CIDR 2011
4
Current State of the Art (2)

P2P overlays as global distributed indexes
Distributed B-tree-like indexes 1
based on tree-based overlay BATON
Distributed R-tree-like indexes 2
based on CAN overlay

1 Efficient B-tree Based Indexing for Cloud
Data Processing VLDB 2010 2 Indexing
Multi-dimensional Data in a Cloud System SIGMOD
2010
5
Our Focus

Context
Efficient and elastic database service with
database functionality (DaaS)
Aims
Provision of indexing functionality in the
context of DaaS
Efficiency
the ability to locate some specific records
among millions of distributed candidates in real
time
Scalability
multiple indexes (of different types) over
distributed data
Extensibility
users can define new indexes without knowing the
structure of the underlying network
Performance self-tuning
users do not have to tune the system performance
by themselves

6
Challenges of Distributed Indexes

Different overlays are required to support
different types of indexes
BATON for B-tree
CAN for R-tree
Chord for Hash
Overlay routing and maintenance cost are high
Load balancing issue
Indexed columns have different data distribution
Difficult to balance the load of index nodes in
the presence of multiple indexes

7
Our Approachto providing index functionality in
the cloud

Indexing as a service
Generic overlay
Data mapping
Performance self-tuning
Result
A simple yet efficient and extensible framework
for developing distributed indexes in the cloud

8
Index Node
Data Mapping
Data are transformed into a unified cayley key
space
Cayley graph
Cayley Graph Manager
chord
can
baton
- Index data are distributed into different
cluster nodes - Each node builds a local index
for maintaining the index data - Part of local
indexes are cached in memory
Buffer Manager
ConnectionManager
Local Indexes
TCP/IP Connection
A connection manager is set up to decide which
connection should be maintained
9
Overlay Mapping

Two interfaces for mapping a specific type of
overlay to Cayley graph
Generating set
Operator
? Applying the operator on the ID of an index
node and the generating set will generate the
routing table for that node

generator
2i i 0, ..., n - 1
Baton
mod 2n
operator
generator
1i, i 1, ..., n
Cayley graph manager
CAN
mod 2
operator
generator
Index search
2i i 0, ..., n - 1
A routing algorithm
Chord
mod 2n
operator
10
Data Mapping

Uniform data mapping
Load balance property
Uniform data mapping provides 1-balance with the
assumption that the data distribution is uniform

11
Data Mapping

Sampling-based data mapping
To deal with skewed data distribution
Stratified random sampling 1,2
partition the domain into disjoint subsets
take a specified number of samples from each
subset
Done when bulk load data from external sources
into cloud databases, e.g., bulk insert daily new
feed of new items from partners into operational
table
Skewed online update
perform data migration to re-balance

1 Sampling Issues in Parallel Database Systems
S. Seshadri and J. F. Naughton, EDBT, 1992. 2
Efficient Bulk Insertion into a Distributed
Ordered Table A. Silberstein et. al., Sigmod,
2008.
12
Index Building

Each cluster node
acts as a peer in the P2P overlay
maintains local indexes such as hash, B-trees
and R-trees
Index building
when data are imported
publish the index entries to different indexes
based on P2P routing protocols

13
Index Search

Optimization
Index base table vs. Index covering plan
index entries contain portion of data record
support a wider range of queries than
materialized views

14
Index Search

Range search
Process on index nodes in parallel
Parallel scan of different indexes
Facilitate correlated access across multiple
indexes
Especially useful for equi-join and range join
Join order ?

15
Index Update

Two steps
Delete the old corresponding index entry
Insert the new index entry
Index consistency
Based on requirements of applications
Trade-off between performance and consistency
strict enforcement of ACID properties
less demanding bulk update

16
Performance Self-tuning

Why?
Optimize the performance of existing index nodes
before launching new ones
Complex setting of multiple indexes of different
types
Adaptively cache network connections
Effectively buffer local indexes

17
Failure and Replication

Failures in large clusters are common
Replication of index data
24X7 service provision
Correct retrieval of index data in the presence
of failures
Two-tier load adaptive replication 1
first tier k copies for data reliability
second tier replicas created adaptively with
query load
Replica consistency management
Lost updates?
System recovery from different types of failures

1 Towards Elastic Transactional Cloud Storage
with Range Query Support H.T. Vo, B.C. Ooi, C.
Chen. PVLDB 3(1) 506-517 (2010)
18
Evaluation

Settings
64-node in-house cluster and EC2
Storage service HDFS
TPC-W (most experiments) and synthetically
generated data set (bigger number of indexes and
skewed data)
Experiments
Index plan vs. full table scan
Index covering vs. Indexbase approach
Multiple indexes of different types
Handling skewed data
Scalability on EC2 (up to 256 nodes)
Other results (not covered in this talk)
Effect of varying query rate
Effect of varying data size
Update performance
Performance of equi-join and range join queries

19
Index plan vs. full table (parallel) scan

Index plan performs much better than the full
table scan approach
Advantage of indexes being able to identify the
data node that contains the qualified tuple
quickly
Table scan time increases almost linearly along
with the data set size

20
Index covering vs. Indexbase approach

Index covering outperforms indexbase when the
size of result set is large
Index covering index entries contains sufficient
data for answering queries directly
Indexbase is still useful compared to table scan

21
Multiple indexes of different types

Generalized index is superior to
one-overlay-per-index, and provides the much
needed scalability
Generalized index one index process maintains
multiple indexes and self-tunes the performance
via resource sharing.

22
Handling skewed data

Storage load distribution
Sampling-based data mapping can roughly estimate
the data distribution and consequently, a certain
percentage of nodes maintains an equivalent
percentage of index data
Execution load imbalance
Sampling-based data mapping distributes data
among nodes better and therefore, incoming
queries on skewed data are also distributed
better

23
Scalability on EC2

Elastic scaling property
More workload can be handled by adding more nodes
into the system

24
Conclusions

A simple yet efficient and extensible framework
for supporting indexes in the cloud
Main characteristics
Support indexes using P2P overlays
Provide high level abstraction for definition new
indexes
Main benefits
Reduce index creation and maintenance cost
Provide the much needed scalability
multiple indexes of different types over
distributed data