Title: A Framework for Supporting DBMS-like Indexes in the Cloud
1A Framework for Supporting DBMS-like Indexes in
the Cloud
Gang Chen, Hoang Tam Vo, Sai Wu, Beng Chin Ooi,
M. Tamer Özsu
2Motivation
- NoSQL systems trade-off
- K-V model -- scalability vs. functionality (ACID,
indexes) - Data selection on primary key is not sufficient
- Ad-hoc queries on secondary attributes
- OLTP queries high selectivity, low latency
expectation - Cloud storage system huge volume of data
- parallel scan scan 1TB data to get 10 tuples?
- Indexes in the cloud
- Useful when query selectivity is high
- Distributed indexes
- central server may become bottleneck
- facilitate parallelism and load balance
- the distribution of indexes ? - scalability in
terms of data volume, network size and number of
indexes ?
3Current State of the Art
- Asynchronous view maintenance for VLSD
databases1 - Pre-configured queries vs. ad-hoc data selection
- Open-source systems
- Cassandra
- built-in distributed hash secondary indexes
(from V. 7.0) - Hbase
- a secondary index created as another table (on
going) 2 - Closed-source systems
- Megastore 3
- consistent local indexes inside an entity group
- asynchronous global indexes across groups
1 Asynchronous view maintenance for VLSD
databases Sigmod 2009 2 http//hbase.apache.org/
book.htmlsecondary.indexes 3 Megastore
Providing Scalable, Highly Available Storage for
Interactive Services CIDR 2011
4Current State of the Art (2)
- P2P overlays as global distributed indexes
- Distributed B-tree-like indexes 1
- based on tree-based overlay BATON
- Distributed R-tree-like indexes 2
- based on CAN overlay
1 Efficient B-tree Based Indexing for Cloud
Data Processing VLDB 2010 2 Indexing
Multi-dimensional Data in a Cloud System SIGMOD
2010
5Our Focus
- Context
- Efficient and elastic database service with
database functionality (DaaS) - Aims
- Provision of indexing functionality in the
context of DaaS - Efficiency
- the ability to locate some specific records
among millions of distributed candidates in real
time - Scalability
- multiple indexes (of different types) over
distributed data - Extensibility
- users can define new indexes without knowing the
structure of the underlying network - Performance self-tuning
- users do not have to tune the system performance
by themselves
6Challenges of Distributed Indexes
- Different overlays are required to support
different types of indexes - BATON for B-tree
- CAN for R-tree
- Chord for Hash
- Overlay routing and maintenance cost are high
- Load balancing issue
- Indexed columns have different data distribution
- Difficult to balance the load of index nodes in
the presence of multiple indexes
7Our Approachto providing index functionality in
the cloud
- Indexing as a service
- Generic overlay
- Data mapping
- Performance self-tuning
- Result
- A simple yet efficient and extensible framework
for developing distributed indexes in the cloud
8Index Node
Data Mapping
Data are transformed into a unified cayley key
space
Cayley graph
Cayley Graph Manager
chord
can
baton
- Index data are distributed into different
cluster nodes - Each node builds a local index
for maintaining the index data - Part of local
indexes are cached in memory
Buffer Manager
ConnectionManager
Local Indexes
TCP/IP Connection
A connection manager is set up to decide which
connection should be maintained
9Overlay Mapping
- Two interfaces for mapping a specific type of
overlay to Cayley graph - Generating set
- Operator
- ? Applying the operator on the ID of an index
node and the generating set will generate the
routing table for that node
generator
2i i 0, ..., n - 1
Baton
mod 2n
operator
generator
1i, i 1, ..., n
Cayley graph manager
CAN
mod 2
operator
generator
Index search
2i i 0, ..., n - 1
A routing algorithm
Chord
mod 2n
operator
10Data Mapping
- Uniform data mapping
- Load balance property
- Uniform data mapping provides 1-balance with the
assumption that the data distribution is uniform
11Data Mapping
- Sampling-based data mapping
- To deal with skewed data distribution
- Stratified random sampling 1,2
- partition the domain into disjoint subsets
- take a specified number of samples from each
subset - Done when bulk load data from external sources
into cloud databases, e.g., bulk insert daily new
feed of new items from partners into operational
table - Skewed online update
- perform data migration to re-balance
1 Sampling Issues in Parallel Database Systems
S. Seshadri and J. F. Naughton, EDBT, 1992. 2
Efficient Bulk Insertion into a Distributed
Ordered Table A. Silberstein et. al., Sigmod,
2008.
12Index Building
- Each cluster node
- acts as a peer in the P2P overlay
- maintains local indexes such as hash, B-trees
and R-trees - Index building
- when data are imported
- publish the index entries to different indexes
based on P2P routing protocols
13Index Search
- Optimization
- Index base table vs. Index covering plan
- index entries contain portion of data record
- support a wider range of queries than
materialized views
14Index Search
- Range search
- Process on index nodes in parallel
- Parallel scan of different indexes
- Facilitate correlated access across multiple
indexes - Especially useful for equi-join and range join
- Join order ?
15Index Update
- Two steps
- Delete the old corresponding index entry
- Insert the new index entry
- Index consistency
- Based on requirements of applications
- Trade-off between performance and consistency
- strict enforcement of ACID properties
- less demanding bulk update
16Performance Self-tuning
- Why?
- Optimize the performance of existing index nodes
before launching new ones - Complex setting of multiple indexes of different
types - Adaptively cache network connections
- Effectively buffer local indexes
17Failure and Replication
- Failures in large clusters are common
- Replication of index data
- 24X7 service provision
- Correct retrieval of index data in the presence
of failures - Two-tier load adaptive replication 1
- first tier k copies for data reliability
- second tier replicas created adaptively with
query load - Replica consistency management
- Lost updates?
- System recovery from different types of failures
1 Towards Elastic Transactional Cloud Storage
with Range Query Support H.T. Vo, B.C. Ooi, C.
Chen. PVLDB 3(1) 506-517 (2010)
18Evaluation
- Settings
- 64-node in-house cluster and EC2
- Storage service HDFS
- TPC-W (most experiments) and synthetically
generated data set (bigger number of indexes and
skewed data) - Experiments
- Index plan vs. full table scan
- Index covering vs. Indexbase approach
- Multiple indexes of different types
- Handling skewed data
- Scalability on EC2 (up to 256 nodes)
- Other results (not covered in this talk)
- Effect of varying query rate
- Effect of varying data size
- Update performance
- Performance of equi-join and range join queries
19Index plan vs. full table (parallel) scan
- Index plan performs much better than the full
table scan approach - Advantage of indexes being able to identify the
data node that contains the qualified tuple
quickly - Table scan time increases almost linearly along
with the data set size
20Index covering vs. Indexbase approach
- Index covering outperforms indexbase when the
size of result set is large - Index covering index entries contains sufficient
data for answering queries directly - Indexbase is still useful compared to table scan
21Multiple indexes of different types
- Generalized index is superior to
one-overlay-per-index, and provides the much
needed scalability - Generalized index one index process maintains
multiple indexes and self-tunes the performance
via resource sharing.
22Handling skewed data
- Storage load distribution
- Sampling-based data mapping can roughly estimate
the data distribution and consequently, a certain
percentage of nodes maintains an equivalent
percentage of index data - Execution load imbalance
- Sampling-based data mapping distributes data
among nodes better and therefore, incoming
queries on skewed data are also distributed
better
23Scalability on EC2
- Elastic scaling property
- More workload can be handled by adding more nodes
into the system
24Conclusions
- A simple yet efficient and extensible framework
for supporting indexes in the cloud - Main characteristics
- Support indexes using P2P overlays
- Provide high level abstraction for definition new
indexes - Main benefits
- Reduce index creation and maintenance cost
- Provide the much needed scalability
- multiple indexes of different types over
distributed data
More info at epiC project http//www.comp.nus.edu.
sg/epiC
25Thank you!
Questions Answers