Topics 9-10: Database optimization for modern machines - PowerPoint PPT Presentation

About This Presentation

Title:

Topics 9-10: Database optimization for modern machines

Description:

CPU speed, memory bandwidth, and disk bandwidth, follow Moore's Law. ... Monet: A Next-Generation DBMS Kernel For Query-Intensive Applications. PhD dissertation. ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 27

Provided by: mamo7

Category:

more less

Transcript and Presenter's Notes

Title: Topics 9-10: Database optimization for modern machines

1
Topics 9-10 Database optimization for modern
machines

Computer architecture is changing over the years,
so we have to change our programmes, too!
Most database operators, indexes, buffering
techniques and storage schemes were developed and
optimized over 20 years ago, so we need to
re-tune them for the modern reality.

2
Moores Law Dictionary Term

http//info.astrian.net/jargon/terms/m/moore_s_law
.html
Moore's Law /morz law/ prov. The
observation that the logic density of silicon
integrated circuits has closely followed the
curve (bits per square inch) 2(t - 1962) where
t is time in years that is, the amount of
information storable on a given amount of silicon
has roughly doubled every year since the
technology was invented. This relation, first
uttered in 1964 by semiconductor engineer Gordon
Moore (who co-founded Intel four years later)
held until the late 1970s, at which point the
doubling period slowed to 18 months. The doubling
period remained at that value through time of
writing (late 1999). Moore's Law is apparently
self-fulfilling. The implication is that
somebody, somewhere is going to be able to build
a better chip than you if you rest on your
laurels, so you'd better start pushing hard on
the problem.

3
Features of Modern Machines and Future Trends

CPU speed, memory bandwidth, and disk bandwidth,
follow Moores Law. On the other hand, memory
latency improves by only 1 per year.
Memories are becoming larger and slower
relatively to memory bandwidth and CPU speed.
Most of the time the processor of your PC is
idle, waiting for data to be fetched from memory.
The role of main memory caches is becoming very
important in the overall performance.

4
Features of Modern Machines and Future Trends
(contd)

Disk bandwidth also follows Moores Law.
On the other hand, disk seek time improves by
only 10 per year.
Thus a random disk access today may cost as much
as 20-50 sequential accesses.

5
Features of Modern Machines and Future Trends
(contd)

Another characteristic of modern machines is that
their processors (AMD Athlon, Intel P4), have
parallel processing units able to process at the
same time multiple (e.g., 5-9) independent
instructions.

6
The new bottleneck Memory Access

Memory is now hierarchical two levels of caching

CPU
L1 cache-line
CPU die
L1 cache
L2 cache-line
L2 cache
Memory page
Main memory

Memory-latency the time needed to transfer 1
byte from the main memory to the L2 cache.
Cache (L2) miss if the requested data is not in
cache and needs to be fetched from main memory
Cache-line The transfer unit from main memory to
cache (e.g., L2 cache-line 128 bytes)

7
Example of Memory Latency Effects in Databases

Employee(id 4 bytes, name 20 bytes, age 4
bytes, gender 1 byte)

Disk/Memory representation
12000 Mike Chan 30 M 12305 George Best 27 M ...
29 bytes
29 bytes
8
Example of Memory Latency Effects in Databases
(contd)

Example query how many girls work for the
company?
SELECT COUNT()
FROM EMPLOYEE
WHERE gender F
What is the cost of this query, assuming that the
relation is in main memory?
We have to access the gender values, compare them
to F and add 1 (or 0).

9
Example of Memory Latency Effects in Databases
(contd)

Accessing a specific address from memory loads
also the information around it in the cache-line
(in parallel over a wide bus).
Example Accessing the gender of the first tuple

cache-line (128 bytes)
Cache
12000 Mike Chan 30 M 12305 George Best 27 M ...
MMem
29 bytes
29 bytes
10
Example of Memory Latency Effects in Databases
(contd)

Thus a cache-line at a time is accessed from main
memory.
If the requested information is already in cache
(e.g., the gender of the second tuple), main
memory is not accessed.
Otherwise, a cache-miss occurs.

11
Effect of cache misses

The stride.c program demonstrates how cache
misses may dominate the query processing cost.
http//www.csis.hku.hk/nikos/courses/CSIS7101/str
ide.c
initialize a random memory buffer
sum 0
ilimit 200000 //tentative
for(i0 iltilimit istride)
sum bufferi
printf("sumd\n",sum)

stride5
12
Effect of cache misses (contd)
13
Re-designing the DBMS

We need to redesign the DBMS in order to face the
new bottleneck memory access
New storage schemes are proposed to minimize
memory and disk accesses.
Query processing techniques are redesigned to
take under consideration the memory-latency
effects.
Algorithms are changed to take advantage of the
intraparallelism of instruction execution.
New instruction types are used (e.g., SIMD
instructions).

14
Problems in optimizing the DBMS

Programming languages do not have control over
the replacement policy of memory caches. These
are determined by the hardware.
As a result, we cannot apply buffer management
techniques for memory cache management.
Also, simple instructions generated by
programming languages cannot control the parallel
instruction execution capabilities of the
machine.
In many cases, we re-write the programs, trying
to fool the page replacement policy, and the
instruction execution in order to make them more
efficient.

15
The N-ary storage model

The N-ary storage model (NSM) stores the
information for each tuple as a sequence of bytes.

Disk/Memory representation
12000 Mike Chan 30 M 12305 George Best 27 M ...
29 bytes
29 bytes
16
A Decomposition Storage Model (DSM)

The table is vertically decomposed, and
information for each attribute is stored
sequentially.

a1
a2
a3
12000 Mike Chan 12305 George Best ...
12000 30 12305 27 ...
12000 M 12305 M ...
17
Properties of DSM

The relation is decomposed into many binary
tables, one for each attribute.
The attributes of a binary table are a surrogate
id and the attribute from the original relation.
The surrogate-id is necessary to bring back
together information for a tuple, by joining the
binary tables.
Example Print the information for tuple with
id12305

NSM
DSM
SELECT FROM EMPLOYEE WHERE id12305
SELECT id,name,age,gender FROM a1,a2,a3 WHERE
a1.id12305 AND a2.id a1.id AND a3.id a1.id
18
Properties of DSM (contd)

Advantages of DSM
If the relation has many attributes, but queries
involve only few attributes, then
Much less information is read from disk, than in
NSM
Cache misses are fewer than in NSM, because the
stride is smaller
Projection queries are very fast
Disadvantages of DSM
If a tuple needs to be reconstructed, this will
require many joins.
The size of the decomposed relation is larger
than the original due to the replication of the
surrogate key.

19
A Decomposition Model with Unary tables (Monet)
not materialized
Given the surrogate sid of a tuple, we can
compute its attribute value v by v
(table_address(sid - first_surrogate)size)
1 1 M F ...
20
Partition Attributes Across

Attempts to combine advantages of both NSM and
DSM, while avoiding their disadvantages.
The data are split to pages, like in NSM, but the
organization in each page is like DSM. Thus
Cache misses are minimized because the
information for a specific attribute is stored
compactly in the page.
Record reconstruction cost is low, since the
tuples are actually stored like in NSM on disk,
but vertically partitioned in each page, where
the reconstruction cost is minimal.

21
How do pages look in each schema?
DSM
NSM
PAX
PAGE HEADER
12000
12000
PAGE HEADER
PAGE HEADER
12000
George
Mike Chan
12305
Mike Chan
30 M 12305
...
12305
...
Best
...
George Best 27 M
...
Mike Chan George Best
...
...
PAGE HEADER
...
30 27
...
PAGE HEADER
12000
30
54647
PAGE HEADER
PAGE HEADER
54647
12305 27 54647 33
John Kit
33 M 94765
...
94765
...
94765 27
...
Flora Ho 27 F
...
...
John Kit Flora Ho
...
PAGE HEADER
...
33 27
...
...
many tables
22
Comparison between PAX and NSM
Cache
NSM

SELECT AVG(age)
FROM EMPLOYEE
WHERE idlt20000

PAGE HEADER
12000
Mike Chan
30 M 12305
...
George Best 27 M
12000 Mike Chan 30 M
...
Cache
PAX
12000
PAGE HEADER
...
12305
12000 12305 ...
Mike Chan George Best
30 27 ...
...
relevant data are fetched to cache
30 27
...
23
Comparison between PAX and DSM
DSM

SELECT AVG(age)
FROM EMPLOYEE
WHERE nameM

Cache
PAGE HEADER
12000
Mike Chan
12305
George
Best
...
12000 Mike Chan 12305 George ...
...
qualifying ids have to be joined with other pages
12000 ...
join
Cache
PAX
12000
PAGE HEADER
...
12305
Mike Chan George ...
Mike Chan George Best
12000 ...
...
30 27
...
30 ...
join is avoided
24
Is PAX always better than DSM?

NO. If the relation has many attributes (e.g.,
10) and the query involves only few (e.g., 2)
then the join may be more beneficial, than
reading whole tuples in memory.
Example R(a1,a2,a3, ..., a10)
Each attribute is 4 bytes long.
DSM D1(id,a1), D2(id,a2),...,D10(id,a10).
Query
SELECT avg(a2)
FROM R
WHERE a1x
Query selectivity is very high 1. The qualifying
ids can fit in memory
PAX has to read the whole table40bytes?R
DSM has to read D1, apply the query, and create
an intermediate table X with qualifying ids (in
memory). Then it has to read D2 to get the
qualifying a2 that join with X. So the total
bytes DSM reads from disk are 16bytes?R.

25
Summary

In modern computer architectures the bottleneck
is memory access. In many cases the processor is
waiting for data to be fetched from memory.
Memory access is no longer random. When we
access a location in disk then data around it are
loaded to a fast memory chip (cache). Access
locality is very important.
Database operators, storage and buffering
schemes, indexes are optimized for the reduction
of cache misses.

26
References

George P. Copeland, Setrag Khoshafian, A
Decomposition Storage Model. SIGMOD, 1985.
Peter A. Boncz, Stefan Manegold, Martin L.
Kersten, Database Architecture Optimized for the
New Bottleneck Memory Access. VLDB, 1999.
Anastassia Ailamaki, David J. DeWitt, Mark D.
Hill, Marios Skounakis Weaving Relations for
Cache Performance, VLDB, 2001.
P. A. Boncz. Monet A Next-Generation DBMS Kernel
For Query-Intensive Applications. PhD
dissertation. Universiteit van Amsterdam, May
2002.