Class 7: Data Persistence

About This Presentation

Title:

Class 7: Data Persistence

Description:

... techniques originate from the basic nuts and bolts origins of data processing ... between I/O operations by allocating memory space for the storage of data blocks ... – PowerPoint PPT presentation

Number of Views:283

Avg rating:3.0/5.0

Slides: 86

Provided by: glenndo

Category:

more less

Transcript and Presenter's Notes

Title: Class 7: Data Persistence

1
Class 7Data Persistence
2
Overview of Class

Review of Class Project Progress
Introduction to Storage Retrieval
Introduction to Memory Management
SMP, MMP, NUMA

3
Motivation for Studying Data Relatives

To provide an understanding of how data storage
management or data persistence relates to
database design
To provide the designer with an opportunity to
seek improvements in database performance.

4
What is Data Persistence?

Data in a system can exist either as an instance,
that is, temporarily, or exist in a state of
persistence, that is, it is permanently stored.
It is this state of permanence that we examine by
understanding how storage works and is managed.

5
Data Storage Management

The Storage Management layer provides the base
layer upon which the Architectural and
Programming layers operate.

6
Data Storage Products

The majority of commercially available database
products use twelve basic techniques to implement
the storage management layer.

7
Data Storage Techniques

These techniques originate from the basic nuts
and bolts origins of data processing and will be
easily recognized to individuals trained in
traditional structured programming methodologies.

8
These twelve techniques are

Hashing
Table Indexes
Memory Resident Tables
Pointing
Joins
RAID

Blocking
Buffering
Compression
Clustering
Sequential Storage
Indexing

9
Architecture and Storage Differences

Understanding a databases architecture does not
automatically mean an understanding in the
databases storage management implementation.
Essentially, architecture profile is concerned
with logical design, while the storage profile is
concerned with the physical design.

10
Performance Criteria

Computer power for the purpose of database
performance can be measured by three criteria as
follows
Memory
Processing Speed
Input/Output Speed

11
MEMORY

Memory can be used for two basic purposes.
To store instructions
To store data that is in use
The process of tuning is one of balancing these
two, often conflicting, roles.

12
PROCESSING SPEED

Processing speed is simply a measure of how many
instructions the CPU can process each second
assuming no other bottlenecks in the process
chain.

13
I/O SPEED

I/O speed is always slower than processing speed
for the simple fact that the laws of physics
limit the components in a physical storage
device.
I/O speed is typically slower than processing
speed from 110 to 1100 times.

14
I/O Bound

Systems are typically I/O bound, meaning the
speed of the I/O system typically sets the pace
at which a system can operate.

15
Resolving the I/O Bottleneck

Databases consistently attempt to employ
techniques that limit the impact of the I/O
system on the overall system performance.

16
I/O Minimization Techniques

These include
I/O Management Techniques
Blocking
Buffering
Compression
Clustering
Sequential Storage

17
I/O Minimization Techniques

Continued
Data Lookup Capabilities
Indexing
Hashing
Table-indexes
Memory Resident Tables
Cross-table Capabilities

18
I/O Minimization Techniques

Continued
Pointing
Dynamic Cross File Strategies
Explicit Disk Manipulation and Management
RAID

19
I/O MINIMIZATION TECHNIQUES

Blocking and buffering are performed by all
computer systems whenever I/O operations are
involved.
Depending on the platform they may be defined by
the operating system, the job control language
(JCL) the runs the program or by the program
itself.
Many databases override these factors for there
own performance improvements.

20
Blocking

This is the process of grouping a lot of little
records into one larger block. This reduces the
number of reads the computer required and thereby
improves system performance.

21
Blocking

Blocking groups are also called blocking units,
pages, control intervals, or record groups.
Blocking requires a trade off between improving
I/O performance and forcing the reading of
unnecessary data.

22
Buffering

This reduces the amount of time waiting between
I/O operations by allocating memory space for the
storage of data blocks anticipated for use.
Data buffers are measured in the same units as
the blocks used to fill them.

23
Blocking and Buffering

When evaluating a databases blocking and
buffering criteria a number of issues should be
examined.

24
Blocking and Buffering Issues

Are block sizes adjustable or fixed?
The ability to modify block size relative to
record size improves performance.

25
Blocking and Buffering Issues

Does the database control buffering or is it left
to the operating system?
Database adjustable buffering improves
performance options.

26
Blocking and Buffering Issues

Can the memory allocation for buffering be
adjusted?
Optimization of memory relative to buffering
requirements improves performance.

27
Compression

Compression allows more data to fit into a block,
which in turn improves performance.
Data is compressed before storage takes place and
when the data is read it is immediately
decompressed.

28
Important considerations for Compression

What compression method is used?
There are different compression techniques each
with their own advantages and disadvantages.

29
Other considerations

Will the cost of CPU cycles exceed the I/O
savings?
The CPU work required for compression may inhibit
any potential performance improvement.

30
Other considerations

Is compression optional?
Some databases dont give the user a choice.

31
Sequencing

Sequencing is that ability of the database to
store data in the same order as its natural sort.
This works well if there is a natural sort but
it is also rudimentary and inflexible.

32
Clustering

Clustering is the process of predicting natural
record groups likely to be processed at the same
time.
Clustering can be very effective if the groupings
match the I/O activity. If this does not happen
the performance is likely to be degraded.

33
DATA LOOKUP CAPABILITIES

These techniques address methods for finding
specific data.

34
Indexes

Indexes provide a way for a system to find one
record without having to read through all of
them.
An index lists all the values we may be looking
for in some kind of alphabetic or numeric order
with an address at which the full data record can
be found.

35
Various types of indexes include

Binary search style indexes
B-tree indexes
Inverted list indexes
Memory resident tables
Table-indexes

36
Binary Search

The binary search style index is the most
primitive type of index.
In this kind of index a simple lookup list is
developed and the system sequentially searches
through it.

37
Binary Search

The system knows that common data will be grouped
together in the index therefore when a grouping
such as color reaches the end of a specific value
the system will know to stop searching.
Indexing also allows more records to be
referenced from within a single block because the
index takes less space.

38
Search Patterns

Indexes can be used differently with various
search patterns, including.
Simple Index Search Pattern
Binary Search Pattern

39
Simple Index Search Pattern

A simple index search pattern reads through the
index sequentially from beginning to end.
Combined with strategic blocking this can
introduce some efficiencies.
The average number of reads per search equals the
length of the index divide by two.

40
Binary Search Pattern

A binary search pattern uses a half-life
algorithm to narrow down to the required
information.

41
Binary Search Pattern Algorithm

The first check is in the middle of the index and
if the desired value is higher or lower than the
checked value then the list is checked by half,
either higher or lower, again.
This process is repeated until the value is found
The maximum number of reads required for a search
is equal to the number of times the index can be
divided in half and the average will be less than
that.

42
B-Tree Index

The B-tree index is a structure comprised of
multiple indexes whose common factor will be a
collection of simple indexes.

43
B-Tree Index

A B-tree structure is simply a collection of
indexes on a further more basic index.
This layering can be implemented on more than one
level if necessary.
The B-tree index is the most common indexing
process used in database products.

44
Inverted List Index

In an inverted list index all record addresses
having the same index value will be stored in the
same record.

45
Inverted List Index

Although a very efficient index it causes
difficulty with record sharing and file
maintenance.
Some database systems will apply algorithms that
allow the system to choose the best index system
for a particular task on the fly.

46
Binary Indexes

Binary Indexes use bit fiddling and coding to
represent values in an index. This can save both
I/O time and processor time.
This is primarily applicable where there are a
small defined set of possible index values.

47
Hashing

Hashing is a non-indexing technique that
translates index values directly into a storage
address.
Hashing is heavily reliant on the algorithm the
calculates the addresses.
Hashing can be very fast but it causes issues
with physical storage.

48
Table Indexes

Table indexes are indexes that include some of
the data fields themselves. This can be
effective when the number of important fields is
small.

49
Memory Resident Tables

Memory resident tables are another brute force
type of technique to reduce I/O overhead.

50
Cross File Capabilities and Storage Management

Two approaches are evaluated
Light Overhead or Direct Approach
Heavy Overhead or Smart Database Approach

51
Light Overhead

The light overhead or direct approach requires
the user to specify relationships.
This is also referred to as a navigational
database.

52
Heavy Overhead

The heavy overhead or smart database approach
uses catalog tables, directories and other lookup
facilities to make relationship decisions for the
user.

53
Pointers

Pointers are a technique for relating records
from one table to another without using an index.
The address of the corresponding records are
imbedded in the original record itself.

54
File Scan Join

A file scan join is a join performed on the fly
by identifying the matching fields and letting
the database perform full file scans to identify
record matches.

55
File Scan Join

This technique is the fall back method for most
databases if no other method is available.
More commonly, indexing and hash keys are used to
join files.

56
ORACLE Example for Illustration

An ORACLE database uses the following techniques
Index Unique, use a unique index to find values
Index Range Scan, Scan a range of values within
the index only
Intersection, Return only rows that match on two
or more tables

57
ORACLE Example

ORACLE techniques
Merge Join, Get two sets of rows, both sorted,
and combine them where they match
Nested Loop, Identify a set of criteria in one
table and then match them, one at a time, against
a second table
Sequence, Access the data from a table that is
presorted in the right order

58
ORACLE Example

ORACLE techniques
Table Access Full, A full table scan
Table Access Cluster, Access a table via its
cluster organization
Table Access Hash, Access using the hash key
Table Access Row ID, Find a row based on its ID
number

59
Supported Join Operations

The most common ways databases support join
operations are
Full Scan
Merge
Nested Loop
Row ID Manipulation (enhanced B-tree)
Star Row ID Pointers

60
Contrasting Light and Heavy Overhead Systems

In the heavy overhead system the database uses
the same techniques, however, the user doesnt
have to choose which technique to use in each
situation.
The system will assess and choose the most
appropriate technique in the given circumstances.

61
Explicit Disk Management Approaches

The easiest method a DBA can use to improve
performance is to improve the physical storage of
files.
This requires three approaches
Reorganize database storage area
Pull data together
Spread data apart

62
Disk Management Technology

Storage subsystems have taken a number of forms
over the years, although a few have become
dominant in the middle to higher end server and
system market.
One particularly effective technology for
managing disk throughput, increasing security,
and improving performance is RAID.

63
RAID

RAID, Redundant Arrays of Inexpensive Disks, is a
method of distributing a database across a number
of physical disk systems. It consists of the
disks, the array controller and the software.

64
RAID Levels

RAID is described in terms of levels from RAID 0
to RAID 6.
RAID 0 disk striping
RAID 1 disk mirroring, no striping
RAID 3 disk striping with parity bit backup
RAID 5 disk striping, mirroring with parity

65
ENHANCED STORAGE MANAGEMENT

Databases can sometimes be enhanced through the
implementation of third party programs into the
layers of the system.

66
Storage Management Evolution

In the early days of databases it was common for
systems programmers and database experts to
rewrite portions of the database system to meet
their specific needs.
It is common now for database systems to have
pre-built exit routines that allow the system to
call an external program or algorithm.

67
Exit Routines

These intervention points often occurred at these
key points
Calls to external security packages
Calls to sort, compression and data
transportation routines
Calls to online transaction monitors and
processors
Calls to network management systems
Calls to data storage, retrieval and access
enhancers

68
Storage Management Applications

Object-Oriented databases apply this technology
to a greater extent than all other models.

69
Storage Management Database Enhancement

The data definition language (DDL) of the
database itself also provides means of database
enhancement through the following means
Stored Procedures
Triggers
Column-based Calls

70
Storage Management Enhancement

In all three cases the routines being called may
execute other database calls or invoke external
program routines.

71
Optimizing Hardware Architectures

Our concern with hardware revolves around four
basic elements.
Disk Storage
Memory
Processors (CPUs)
Disk Connection Pathways (buses)

72
SMP, MPP and NUMA Hardware Architectures

Manufacturers approach these issues by
addressing
physical partitioning across multiple machines
clustering across multiple machines
64-bit, SMP, MPP, and NUMA within individual
machines.

73
Physical Partitioning

Physical partitioning is the division of data
between multiple computers.
In some cases it works effectively, in others
terribly.

74
Problems with Physical Partitioning

Difficulty of determining how to distribute
dataManaging multiple CPU usage on the same data
Complexity overload
Traffic and transportation overhead

75
Clustering

Clustering involves the connection of multiple
computers to a single storage array.
In the clustering model only one CPU can work on
one query at a time.

76
64 Bit Word Size

Increasing the word size (64 bit) increases the
overall speed of the system.
However, if the rest of the architecture doesnt
change also it simply introduces additional
bottlenecks.

77
The Multiple Processor Approach

Multiple processor approaches introduce methods
for enhancing processor capability and overall
processor speed.

78
The Multiple Processor Methods

Methods we will look at include
Massively Parallel Processor (MPP)
Symmetric Multiprocessor
Cache-Coherent, Non Uniform Memory Access
(cc-NUMA)

79
Massively Parallel Processor (MPP)

The Massively Parallel Processor (MPP) approach
is a method for placing up to thousands of
processors into a single machine governed by a
set of rules.

80
Problems with MPP

The fault of the MPP architecture is the overhead
required coordinating the CPUs.
Usually MPP architectures only work well for
specialized tasks.
MPP leads to complex, custom systems.

81
Symmetric Multiprocessor

The Symmetric Multiprocessor (SMP) approach is an
enhancement of the MPP approach that focuses on
allowing true multi-tasking.
An SMP system dynamically assigns CPU processor
time.

82
Problems with SMP

If processor time is not evenly distributed then
the advantages of parallelism are lost.
There are physical limitations to the size that
an SMP system can grow due to overhead issues.

83
Cache-Coherent, Non Uniform Memory Access
(cc-NUMA)

Cache-coherent, Non-uniform Memory Access
(cc-NUMA) architectures allow for true,
distributed sharing of major subsystems using
shared interconnect bus systems.
This occurs most notably with memory but is also
often used with I/O subsystems.

84
Advantages of NUMA

Most NUMA architectures use the symmetric
Scalable Coherent Interconnect (SCI) model form
IEEE.
The system takes advantage of SMP architectures
while improving on the bus bandwidth issues.

85
Summary Review of Disk Management

Clearly, the I/O bound limitations imposed on
database implementations can defeat an otherwise
intelligent design with reduced performance and
limited reliability.
Understanding the optimization tools available to
the design allows you to overcome these
limitations with the proper management approach.

Write a Comment

User Comments (0)

About PowerShow.com

Class 7: Data Persistence - PowerPoint PPT Presentation

Class 7: Data Persistence

... techniques originate from the basic nuts and bolts origins of data processing ... between I/O operations by allocating memory space for the storage of data blocks ... – PowerPoint PPT presentation