Database Engine Design a.k.a. Research@ DSL

About This Presentation

Title:

Database Engine Design a.k.a. Research@ DSL

Description:

Database Management Systems (DBMS) ... Maha-bore - almost as dull as watching. Rahul Dravid bat! High-tech name for data entry! ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 29

Provided by: sercIis

Category:

more less

Transcript and Presenter's Notes

Title: Database Engine Design a.k.a. Research@ DSL

1
Database Engine Designa.k.a. Research_at_ DSL

Jayant Haritsa

2
Database Management Systems (DBMS)

Efficient and convenient mechanisms for storing,
querying and maintenance of enterprise data
Cornerstone of computer industry
Uses more than 80 percent of computers worldwide
Employs more than 70 percent of computer
professionals
Largest monetary sector of computer business

3
DBMS FEATURES

Handle data of arbitrary size
Income-Tax records are in Petabytes (1015)
Self-contained
contains both data and meta-data
Program-Data insulation
application s/w not affected by storage changes

SR No Name Address Hostel GPA
SR No Name Address GPA Hostel
4
DBMS FEATURES (contd)

Declarative Access
state what you want, not how to get it
On-the-Fly Questions
ask new questions without writing new programs
PEACE OF MIND
changes to the database are guaranteed to be
immune to subsequent system failures
Sri Sri Ravishankar of the Information World

5
Current Database Systems

Commercial
IBM DB2 / Oracle / Microsoft SQL Server / Sybase
Public-domain
PostgreSQL / MySQL / Berkeley DB

6
DBMS Myths ?

Databases? Isnt that the boring part of
accounting?
Hazaar dumb Cobol programming!
Maha-bore - almost as dull as watching
Rahul Dravid bat!
High-tech name for data entry!
Will only get job with TCS!
...

7
DBMS Realities ?

Design of database engines has lots of really,
really interesting intellectual problems with
practical impact
theory, algorithms, data structures, experiments,
prototypes
Turing awards
1981 Edgar Codd (relational data model)
1999 Jim Gray (transaction model)
Ullman, Silberschatz, Papadimitrou,
Rajaraman, Patnaik, Balakrishnan,
Jacob/Govindarajan

8
Database Systems Lab(DSL)

Established 1995

9
Research Topics

Real-Time Database Systems
Distributed Transaction Management
OODBMS
Web Databases
Data Mining
XML Databases
Biological Databases
Query Optimization
Multilingual Databases
Music Databases

1995-2000
2000-2005
Last few years
10
Research Trajectory
CORE DB TECHNOLOGY
AccessMethods
TransactionProcessing
Query Processing
11
Research Techniques

Theory
real-time, data mining, query optimization
Simulation studies
real-time, distributed, web dbms
Empirical evaluation
data mining, biological, multilingual dbms, query
optimization
Prototype development
OODBMS (Flexible Manufacturing MIDAS, VLSI
DIAS, Bio-diversity Oshadhi,Bodhi )
XML (Storage LegoDB, Compression XGrind )
Query Optimization (Clustering Plastic,
Visualization Picasso )
Multilingual Databases (Cross-lingual SQL Mira
)

12
SPINE Putting Backbone into Genomic Sequence
Indexing

13
Standard Genomic Index Suffix Tree Weiner 1973
Vertically-compressed trie of suffixes augmented
with links
0 1 2 3 4 5 6 7 8
9 Data GTTAATTACT
Suffix Links (xW ? W)
Tree Edges
Search for Query TTA
5
1
14
Locate all Maximal Matching Substrings Chang
Lawler 1990

For each position in query sequence Q , locate
all longest matching substrings of length ? in
the indexed data sequence D
Example D GTTAATTACT Q
CTAATGA and ? 3
Result
TAATlt2,1gt AATlt3,2gt

15
Maximal Substring Searchwith Suffix Tree Index
0 1 2 3 4 5 6 7 8 9 D
GTTAATTACT
Q CTAATGA ? 3
A
T
CT
ATTACT
GTTAATTACT
A
TA
TTACT
CT

3
3
4
9
0
7
ATTACT
8
CT
ATTACT
CT
2
2
6
1
5
16
Features of Suffix Tree Index

Accurate retrieval
no false negatives (unlike BLAST)
Linear Time Complexity for both Constructionand
Search!
because of Suffix-links
Widely used
More than 40-50 applications over biological
sequences Gusfield, 2002
MUMmer Celera Genomics, AVID,

17
Crippling Limitation

Viable only for sequences that are short enough
for their associated suffix tree to fit
completely in main memory Baeza-Yates and
Navarro, 2000
Best that has been built so far is for sequences
of 10 Mbp (Human Genome is 300 times longer!)

18
Difficulties in Supporting Suffix Trees on Long
Sequences - 1

Space overheads are enormous
Order(s) of magnitude larger than data!
Human Genome can be easily stored in main memory
(1 GB) but the index couldbe of the order of
10-100 GB
Disk-resident suffix trees for long sequences

19
Difficulties in Supporting Suffix Trees on Long
Sequences - 2

Tree Construction on Disk is Very Slow
Due to disk thrashing from random seeks

The active suffix creeps through the text like a
caterpillar corresponding active node swings
through the tree like a butterflyGiegerich and
Kurtz, 1995
20
Difficulties in Supporting Suffix Trees on Long
Sequences - 3

Searching on Disk is Very Slow
Unbalanced Tree Structure
Shape of tree depends onsequence stochastic
properties
Multi-directional traversals causes disk
thrashing
Tree-Edge ? Vertical Walk-Down
Suffix-Link ? Horizontal Jump-Across

Suffix Tree Search

Edge Link mesh
Two phase Search
Locate
Report

?
Combination of Batman and Spiderman !
21
Alternative ApproachHorizontal Compaction
D ACCACAC

Merge duplicate paths
Inter-path compaction, not intra-path
More compaction potential
Horizontal compaction
Global Elimination
Vertical compaction
Local Structural Merging
Difficult because of false positives
ACACAC appears present

A
C
C
A
C
C
A
A
C
A
C
C
A
C
A
C
A
C
C
Trie
22
The SPINE IndexA Horizontally-Compacted Trie
Index
Sequence Processing INdexing Engine
23
SPINE Index Structure
D ACCACAC
Complete horizontal compaction into single linear
chain!!
Root node

Nodes
Forward Edges
Vertebras (Backbone)
Ribs / Ext-Ribs
Backward Edges
Links

Link
Rib
Extension rib
Vertebra
24
Structural Advantages of SPINE w.r.t. Suffix
Trees

Number of nodes is equal to length of string,
whereas in suffix tree can go up to double.
Entire data sequence explicitly embedded in index
? throw away the data!
On-line incremental algorithm (by definition)
do not need to possess entire data sequence in
advance
Node creation order andlogical order are the
same ? prefix-partitionable

D ACCA
25
Advantages of SPINE (contd)

Each node represents a set of suffixes whereas in
suffix tree each node represents only a single
suffix
Number of suffixes processed for construction and
searching is smaller
Easy to develop buffering strategies
forpersistent implementations

26
SPINE Performance Summary

Data Sets Ecoli 3.5 Mbp
Celegans 15.5 Mbp
HC 21 28.5 Mbp HC19
57.5 Mbp
Suffix Tree (MUMmer - Celera Genomics)
Spine Space
2/3 of Suffix Tree
Spine Time
Construction 1/2 of Suffix Tree
Searching 1/2 of Suffix Tree

27
SPINE Summary

First index based on horizontal (inter-path)
compaction of the trie
Collapses into a single linear structure
Improved features and performance w.r.t. suffix
trees, the classical index
Prefix-partitionable (first index to have this
property)
Easily amenable to persistent disk implementation
Retains linear time/space complexity
Better construction speed and capacity
Better search response times

Database Engine Design a.k.a. Research@ DSL - PowerPoint PPT Presentation

Database Engine Design a.k.a. Research@ DSL

Database Management Systems (DBMS) ... Maha-bore - almost as dull as watching. Rahul Dravid bat! High-tech name for data entry! ... – PowerPoint PPT presentation