Database Engine Design a.k.a. Research@ DSL - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Database Engine Design a.k.a. Research@ DSL

Description:

Database Management Systems (DBMS) ... Maha-bore - almost as dull as watching. Rahul Dravid bat! High-tech name for data entry! ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 29
Provided by: sercIis
Category:

less

Transcript and Presenter's Notes

Title: Database Engine Design a.k.a. Research@ DSL


1
Database Engine Designa.k.a. Research_at_ DSL
  • Jayant Haritsa

2
Database Management Systems (DBMS)
  • Efficient and convenient mechanisms for storing,
    querying and maintenance of enterprise data
  • Cornerstone of computer industry
  • Uses more than 80 percent of computers worldwide
  • Employs more than 70 percent of computer
    professionals
  • Largest monetary sector of computer business

3
DBMS FEATURES
  • Handle data of arbitrary size
  • Income-Tax records are in Petabytes (1015)
  • Self-contained
  • contains both data and meta-data
  • Program-Data insulation
  • application s/w not affected by storage changes

SR No Name Address Hostel GPA
SR No Name Address GPA Hostel
4
DBMS FEATURES (contd)
  • Declarative Access
  • state what you want, not how to get it
  • On-the-Fly Questions
  • ask new questions without writing new programs
  • PEACE OF MIND
  • changes to the database are guaranteed to be
    immune to subsequent system failures
  • Sri Sri Ravishankar of the Information World

5
Current Database Systems
  • Commercial
  • IBM DB2 / Oracle / Microsoft SQL Server / Sybase
  • Public-domain
  • PostgreSQL / MySQL / Berkeley DB

6
DBMS Myths ?
  • Databases? Isnt that the boring part of
    accounting?
  • Hazaar dumb Cobol programming!
  • Maha-bore - almost as dull as watching
    Rahul Dravid bat!
  • High-tech name for data entry!
  • Will only get job with TCS!
  • ...

7
DBMS Realities ?
  • Design of database engines has lots of really,
    really interesting intellectual problems with
    practical impact
  • theory, algorithms, data structures, experiments,
    prototypes
  • Turing awards
  • 1981 Edgar Codd (relational data model)
  • 1999 Jim Gray (transaction model)
  • Ullman, Silberschatz, Papadimitrou,
  • Rajaraman, Patnaik, Balakrishnan,
    Jacob/Govindarajan

8
Database Systems Lab(DSL)
  • Established 1995

9
Research Topics
  • Real-Time Database Systems
  • Distributed Transaction Management
  • OODBMS
  • Web Databases
  • Data Mining
  • XML Databases
  • Biological Databases
  • Query Optimization
  • Multilingual Databases
  • Music Databases

1995-2000
2000-2005
Last few years
10
Research Trajectory
CORE DB TECHNOLOGY
AccessMethods
TransactionProcessing
Query Processing
11
Research Techniques
  • Theory
  • real-time, data mining, query optimization
  • Simulation studies
  • real-time, distributed, web dbms
  • Empirical evaluation
  • data mining, biological, multilingual dbms, query
    optimization
  • Prototype development
  • OODBMS (Flexible Manufacturing MIDAS, VLSI
    DIAS, Bio-diversity Oshadhi,Bodhi )
  • XML (Storage LegoDB, Compression XGrind )
  • Query Optimization (Clustering Plastic,
    Visualization Picasso )
  • Multilingual Databases (Cross-lingual SQL Mira
    )

12
SPINE Putting Backbone into Genomic Sequence
Indexing

13
Standard Genomic Index Suffix Tree Weiner 1973
Vertically-compressed trie of suffixes augmented
with links
0 1 2 3 4 5 6 7 8
9 Data GTTAATTACT
Suffix Links (xW ? W)
Tree Edges
Search for Query TTA
5
1
14
Locate all Maximal Matching Substrings Chang
Lawler 1990
  • For each position in query sequence Q , locate
    all longest matching substrings of length ? in
    the indexed data sequence D
  • Example D GTTAATTACT Q
    CTAATGA and ? 3
  • Result
  • TAATlt2,1gt AATlt3,2gt

15
Maximal Substring Searchwith Suffix Tree Index
0 1 2 3 4 5 6 7 8 9 D
GTTAATTACT
Q CTAATGA ? 3
A
T
CT
ATTACT
GTTAATTACT
A
TA
TTACT
CT

3
3
4
9
0
7
ATTACT
8
CT
ATTACT
CT
2
2
6
1
5
16
Features of Suffix Tree Index
  • Accurate retrieval
  • no false negatives (unlike BLAST)
  • Linear Time Complexity for both Constructionand
    Search!
  • because of Suffix-links
  • Widely used
  • More than 40-50 applications over biological
    sequences Gusfield, 2002
  • MUMmer Celera Genomics, AVID,

17
Crippling Limitation
  • Viable only for sequences that are short enough
    for their associated suffix tree to fit
    completely in main memory Baeza-Yates and
    Navarro, 2000
  • Best that has been built so far is for sequences
    of 10 Mbp (Human Genome is 300 times longer!)

18
Difficulties in Supporting Suffix Trees on Long
Sequences - 1
  • Space overheads are enormous
  • Order(s) of magnitude larger than data!
  • Human Genome can be easily stored in main memory
    (1 GB) but the index couldbe of the order of
    10-100 GB
  • Disk-resident suffix trees for long sequences

19
Difficulties in Supporting Suffix Trees on Long
Sequences - 2
  • Tree Construction on Disk is Very Slow
  • Due to disk thrashing from random seeks

The active suffix creeps through the text like a
caterpillar corresponding active node swings
through the tree like a butterflyGiegerich and
Kurtz, 1995
20
Difficulties in Supporting Suffix Trees on Long
Sequences - 3
  • Searching on Disk is Very Slow
  • Unbalanced Tree Structure
  • Shape of tree depends onsequence stochastic
    properties
  • Multi-directional traversals causes disk
    thrashing
  • Tree-Edge ? Vertical Walk-Down
  • Suffix-Link ? Horizontal Jump-Across

Suffix Tree Search
  • Edge Link mesh
  • Two phase Search
  • Locate
  • Report

?
Combination of Batman and Spiderman !
21
Alternative ApproachHorizontal Compaction
D ACCACAC
  • Merge duplicate paths
  • Inter-path compaction, not intra-path
  • More compaction potential
  • Horizontal compaction
  • Global Elimination
  • Vertical compaction
  • Local Structural Merging
  • Difficult because of false positives
  • ACACAC appears present

A
C
C
A
C
C
A
A
C
A
C
C
A
C
A
C
A
C
C
Trie
22
The SPINE IndexA Horizontally-Compacted Trie
Index
Sequence Processing INdexing Engine
23
SPINE Index Structure
D ACCACAC
Complete horizontal compaction into single linear
chain!!
Root node
  • Nodes
  • Forward Edges
  • Vertebras (Backbone)
  • Ribs / Ext-Ribs
  • Backward Edges
  • Links

Link
Rib
Extension rib
Vertebra
24
Structural Advantages of SPINE w.r.t. Suffix
Trees
  • Number of nodes is equal to length of string,
    whereas in suffix tree can go up to double.
  • Entire data sequence explicitly embedded in index
    ? throw away the data!
  • On-line incremental algorithm (by definition)
  • do not need to possess entire data sequence in
    advance
  • Node creation order andlogical order are the
    same ? prefix-partitionable

D ACCA
25
Advantages of SPINE (contd)
  • Each node represents a set of suffixes whereas in
    suffix tree each node represents only a single
    suffix
  • Number of suffixes processed for construction and
    searching is smaller
  • Easy to develop buffering strategies
    forpersistent implementations

26
SPINE Performance Summary
  • Data Sets Ecoli 3.5 Mbp
    Celegans 15.5 Mbp
  • HC 21 28.5 Mbp HC19
    57.5 Mbp
  • Suffix Tree (MUMmer - Celera Genomics)
  • Spine Space
  • 2/3 of Suffix Tree
  • Spine Time
  • Construction 1/2 of Suffix Tree
  • Searching 1/2 of Suffix Tree

27
SPINE Summary
  • First index based on horizontal (inter-path)
    compaction of the trie
  • Collapses into a single linear structure
  • Improved features and performance w.r.t. suffix
    trees, the classical index
  • Prefix-partitionable (first index to have this
    property)
  • Easily amenable to persistent disk implementation
  • Retains linear time/space complexity
  • Better construction speed and capacity
  • Better search response times

28
Full details at http//dsl.serc.iisc.ernet.in

Questions?
29
END PRESENTATION
Write a Comment
User Comments (0)
About PowerShow.com