Title: Class 7: Data Persistence
1Class 7Data Persistence
2Overview of Class
- Review of Class Project Progress
- Introduction to Storage Retrieval
- Introduction to Memory Management
- SMP, MMP, NUMA
3Motivation for Studying Data Relatives
- To provide an understanding of how data storage
management or data persistence relates to
database design - To provide the designer with an opportunity to
seek improvements in database performance.
4What is Data Persistence?
- Data in a system can exist either as an instance,
that is, temporarily, or exist in a state of
persistence, that is, it is permanently stored. - It is this state of permanence that we examine by
understanding how storage works and is managed.
5Data Storage Management
- The Storage Management layer provides the base
layer upon which the Architectural and
Programming layers operate.
6Data Storage Products
- The majority of commercially available database
products use twelve basic techniques to implement
the storage management layer.
7Data Storage Techniques
- These techniques originate from the basic nuts
and bolts origins of data processing and will be
easily recognized to individuals trained in
traditional structured programming methodologies.
8These twelve techniques are
- Hashing
- Table Indexes
- Memory Resident Tables
- Pointing
- Joins
- RAID
- Blocking
- Buffering
- Compression
- Clustering
- Sequential Storage
- Indexing
9Architecture and Storage Differences
- Understanding a databases architecture does not
automatically mean an understanding in the
databases storage management implementation. - Essentially, architecture profile is concerned
with logical design, while the storage profile is
concerned with the physical design.
10Performance Criteria
- Computer power for the purpose of database
performance can be measured by three criteria as
follows - Memory
- Processing Speed
- Input/Output Speed
11MEMORY
- Memory can be used for two basic purposes.
- To store instructions
- To store data that is in use
- The process of tuning is one of balancing these
two, often conflicting, roles.
12PROCESSING SPEED
- Processing speed is simply a measure of how many
instructions the CPU can process each second
assuming no other bottlenecks in the process
chain.
13I/O SPEED
- I/O speed is always slower than processing speed
for the simple fact that the laws of physics
limit the components in a physical storage
device. - I/O speed is typically slower than processing
speed from 110 to 1100 times.
14I/O Bound
- Systems are typically I/O bound, meaning the
speed of the I/O system typically sets the pace
at which a system can operate.
15Resolving the I/O Bottleneck
- Databases consistently attempt to employ
techniques that limit the impact of the I/O
system on the overall system performance.
16I/O Minimization Techniques
- These include
- I/O Management Techniques
- Blocking
- Buffering
- Compression
- Clustering
- Sequential Storage
17I/O Minimization Techniques
- Continued
- Data Lookup Capabilities
- Indexing
- Hashing
- Table-indexes
- Memory Resident Tables
- Cross-table Capabilities
18I/O Minimization Techniques
- Continued
- Pointing
- Dynamic Cross File Strategies
- Explicit Disk Manipulation and Management
- RAID
19I/O MINIMIZATION TECHNIQUES
- Blocking and buffering are performed by all
computer systems whenever I/O operations are
involved. - Depending on the platform they may be defined by
the operating system, the job control language
(JCL) the runs the program or by the program
itself. - Many databases override these factors for there
own performance improvements.
20Blocking
- This is the process of grouping a lot of little
records into one larger block. This reduces the
number of reads the computer required and thereby
improves system performance.
21Blocking
- Blocking groups are also called blocking units,
pages, control intervals, or record groups. - Blocking requires a trade off between improving
I/O performance and forcing the reading of
unnecessary data.
22Buffering
- This reduces the amount of time waiting between
I/O operations by allocating memory space for the
storage of data blocks anticipated for use. - Data buffers are measured in the same units as
the blocks used to fill them.
23Blocking and Buffering
- When evaluating a databases blocking and
buffering criteria a number of issues should be
examined.
24Blocking and Buffering Issues
- Are block sizes adjustable or fixed?
- The ability to modify block size relative to
record size improves performance.
25Blocking and Buffering Issues
- Does the database control buffering or is it left
to the operating system? - Database adjustable buffering improves
performance options.
26Blocking and Buffering Issues
- Can the memory allocation for buffering be
adjusted? - Optimization of memory relative to buffering
requirements improves performance.
27Compression
- Compression allows more data to fit into a block,
which in turn improves performance. - Data is compressed before storage takes place and
when the data is read it is immediately
decompressed.
28Important considerations for Compression
- What compression method is used?
- There are different compression techniques each
with their own advantages and disadvantages.
29Other considerations
- Will the cost of CPU cycles exceed the I/O
savings? - The CPU work required for compression may inhibit
any potential performance improvement.
30Other considerations
- Is compression optional?
- Some databases dont give the user a choice.
31Sequencing
- Sequencing is that ability of the database to
store data in the same order as its natural sort.
This works well if there is a natural sort but
it is also rudimentary and inflexible.
32Clustering
- Clustering is the process of predicting natural
record groups likely to be processed at the same
time. - Clustering can be very effective if the groupings
match the I/O activity. If this does not happen
the performance is likely to be degraded.
33DATA LOOKUP CAPABILITIES
- These techniques address methods for finding
specific data.
34Indexes
- Indexes provide a way for a system to find one
record without having to read through all of
them. - An index lists all the values we may be looking
for in some kind of alphabetic or numeric order
with an address at which the full data record can
be found.
35Various types of indexes include
- Binary search style indexes
- B-tree indexes
- Inverted list indexes
- Memory resident tables
- Table-indexes
36Binary Search
- The binary search style index is the most
primitive type of index. - In this kind of index a simple lookup list is
developed and the system sequentially searches
through it.
37Binary Search
- The system knows that common data will be grouped
together in the index therefore when a grouping
such as color reaches the end of a specific value
the system will know to stop searching. - Indexing also allows more records to be
referenced from within a single block because the
index takes less space.
38Search Patterns
- Indexes can be used differently with various
search patterns, including. - Simple Index Search Pattern
- Binary Search Pattern
39Simple Index Search Pattern
- A simple index search pattern reads through the
index sequentially from beginning to end. - Combined with strategic blocking this can
introduce some efficiencies. - The average number of reads per search equals the
length of the index divide by two.
40Binary Search Pattern
- A binary search pattern uses a half-life
algorithm to narrow down to the required
information.
41Binary Search Pattern Algorithm
- The first check is in the middle of the index and
if the desired value is higher or lower than the
checked value then the list is checked by half,
either higher or lower, again. - This process is repeated until the value is found
- The maximum number of reads required for a search
is equal to the number of times the index can be
divided in half and the average will be less than
that.
42B-Tree Index
- The B-tree index is a structure comprised of
multiple indexes whose common factor will be a
collection of simple indexes.
43B-Tree Index
- A B-tree structure is simply a collection of
indexes on a further more basic index. - This layering can be implemented on more than one
level if necessary. - The B-tree index is the most common indexing
process used in database products.
44Inverted List Index
- In an inverted list index all record addresses
having the same index value will be stored in the
same record.
45Inverted List Index
- Although a very efficient index it causes
difficulty with record sharing and file
maintenance. - Some database systems will apply algorithms that
allow the system to choose the best index system
for a particular task on the fly.
46Binary Indexes
- Binary Indexes use bit fiddling and coding to
represent values in an index. This can save both
I/O time and processor time. - This is primarily applicable where there are a
small defined set of possible index values.
47Hashing
- Hashing is a non-indexing technique that
translates index values directly into a storage
address. - Hashing is heavily reliant on the algorithm the
calculates the addresses. - Hashing can be very fast but it causes issues
with physical storage.
48Table Indexes
- Table indexes are indexes that include some of
the data fields themselves. This can be
effective when the number of important fields is
small.
49Memory Resident Tables
- Memory resident tables are another brute force
type of technique to reduce I/O overhead.
50Cross File Capabilities and Storage Management
- Two approaches are evaluated
- Light Overhead or Direct Approach
- Heavy Overhead or Smart Database Approach
51Light Overhead
- The light overhead or direct approach requires
the user to specify relationships. - This is also referred to as a navigational
database.
52Heavy Overhead
- The heavy overhead or smart database approach
uses catalog tables, directories and other lookup
facilities to make relationship decisions for the
user.
53Pointers
- Pointers are a technique for relating records
from one table to another without using an index. - The address of the corresponding records are
imbedded in the original record itself.
54File Scan Join
- A file scan join is a join performed on the fly
by identifying the matching fields and letting
the database perform full file scans to identify
record matches.
55File Scan Join
- This technique is the fall back method for most
databases if no other method is available. - More commonly, indexing and hash keys are used to
join files.
56ORACLE Example for Illustration
- An ORACLE database uses the following techniques
- Index Unique, use a unique index to find values
- Index Range Scan, Scan a range of values within
the index only - Intersection, Return only rows that match on two
or more tables
57ORACLE Example
- ORACLE techniques
- Merge Join, Get two sets of rows, both sorted,
and combine them where they match - Nested Loop, Identify a set of criteria in one
table and then match them, one at a time, against
a second table - Sequence, Access the data from a table that is
presorted in the right order
58ORACLE Example
- ORACLE techniques
- Table Access Full, A full table scan
- Table Access Cluster, Access a table via its
cluster organization - Table Access Hash, Access using the hash key
- Table Access Row ID, Find a row based on its ID
number
59Supported Join Operations
- The most common ways databases support join
operations are - Full Scan
- Merge
- Nested Loop
- Row ID Manipulation (enhanced B-tree)
- Star Row ID Pointers
60Contrasting Light and Heavy Overhead Systems
- In the heavy overhead system the database uses
the same techniques, however, the user doesnt
have to choose which technique to use in each
situation. - The system will assess and choose the most
appropriate technique in the given circumstances.
61Explicit Disk Management Approaches
- The easiest method a DBA can use to improve
performance is to improve the physical storage of
files. - This requires three approaches
- Reorganize database storage area
- Pull data together
- Spread data apart
62Disk Management Technology
- Storage subsystems have taken a number of forms
over the years, although a few have become
dominant in the middle to higher end server and
system market. - One particularly effective technology for
managing disk throughput, increasing security,
and improving performance is RAID.
63RAID
- RAID, Redundant Arrays of Inexpensive Disks, is a
method of distributing a database across a number
of physical disk systems. It consists of the
disks, the array controller and the software.
64RAID Levels
- RAID is described in terms of levels from RAID 0
to RAID 6. - RAID 0 disk striping
- RAID 1 disk mirroring, no striping
- RAID 3 disk striping with parity bit backup
- RAID 5 disk striping, mirroring with parity
65ENHANCED STORAGE MANAGEMENT
- Databases can sometimes be enhanced through the
implementation of third party programs into the
layers of the system.
66Storage Management Evolution
- In the early days of databases it was common for
systems programmers and database experts to
rewrite portions of the database system to meet
their specific needs. - It is common now for database systems to have
pre-built exit routines that allow the system to
call an external program or algorithm.
67Exit Routines
- These intervention points often occurred at these
key points - Calls to external security packages
- Calls to sort, compression and data
transportation routines - Calls to online transaction monitors and
processors - Calls to network management systems
- Calls to data storage, retrieval and access
enhancers
68Storage Management Applications
- Object-Oriented databases apply this technology
to a greater extent than all other models.
69Storage Management Database Enhancement
- The data definition language (DDL) of the
database itself also provides means of database
enhancement through the following means - Stored Procedures
- Triggers
- Column-based Calls
70Storage Management Enhancement
- In all three cases the routines being called may
execute other database calls or invoke external
program routines.
71Optimizing Hardware Architectures
- Our concern with hardware revolves around four
basic elements. - Disk Storage
- Memory
- Processors (CPUs)
- Disk Connection Pathways (buses)
72SMP, MPP and NUMA Hardware Architectures
- Manufacturers approach these issues by
addressing - physical partitioning across multiple machines
- clustering across multiple machines
- 64-bit, SMP, MPP, and NUMA within individual
machines.
73Physical Partitioning
- Physical partitioning is the division of data
between multiple computers. - In some cases it works effectively, in others
terribly.
74Problems with Physical Partitioning
- Difficulty of determining how to distribute
dataManaging multiple CPU usage on the same data - Complexity overload
- Traffic and transportation overhead
75Clustering
- Clustering involves the connection of multiple
computers to a single storage array. - In the clustering model only one CPU can work on
one query at a time.
7664 Bit Word Size
- Increasing the word size (64 bit) increases the
overall speed of the system. - However, if the rest of the architecture doesnt
change also it simply introduces additional
bottlenecks.
77The Multiple Processor Approach
- Multiple processor approaches introduce methods
for enhancing processor capability and overall
processor speed.
78The Multiple Processor Methods
- Methods we will look at include
- Massively Parallel Processor (MPP)
- Symmetric Multiprocessor
- Cache-Coherent, Non Uniform Memory Access
(cc-NUMA)
79Massively Parallel Processor (MPP)
- The Massively Parallel Processor (MPP) approach
is a method for placing up to thousands of
processors into a single machine governed by a
set of rules.
80Problems with MPP
- The fault of the MPP architecture is the overhead
required coordinating the CPUs. - Usually MPP architectures only work well for
specialized tasks. - MPP leads to complex, custom systems.
81Symmetric Multiprocessor
- The Symmetric Multiprocessor (SMP) approach is an
enhancement of the MPP approach that focuses on
allowing true multi-tasking. - An SMP system dynamically assigns CPU processor
time.
82Problems with SMP
- If processor time is not evenly distributed then
the advantages of parallelism are lost. - There are physical limitations to the size that
an SMP system can grow due to overhead issues.
83Cache-Coherent, Non Uniform Memory Access
(cc-NUMA)
- Cache-coherent, Non-uniform Memory Access
(cc-NUMA) architectures allow for true,
distributed sharing of major subsystems using
shared interconnect bus systems. - This occurs most notably with memory but is also
often used with I/O subsystems.
84Advantages of NUMA
- Most NUMA architectures use the symmetric
Scalable Coherent Interconnect (SCI) model form
IEEE. - The system takes advantage of SMP architectures
while improving on the bus bandwidth issues.
85Summary Review of Disk Management
- Clearly, the I/O bound limitations imposed on
database implementations can defeat an otherwise
intelligent design with reduced performance and
limited reliability. - Understanding the optimization tools available to
the design allows you to overcome these
limitations with the proper management approach.