Title: 13.1
113.1 13.2
SECONDARY STORAGE MANAGEMENT
202 Sanuja Dabade
213.1.1 Memory Hierarchy
- devices for data storage have different data
capacities available - Cost per byte to store data also varies
- Device with smallest capacity offer the fastest
speed with highest cost per bit
3Memory Hierarchy Diagram
- Programs,
DBMS - Main Memory DBMSs
Tertiary Storage
As Visual Memory Disk
File System
Main Memory
Cache
413.1.1 Memory Hierarchy
- Cache
- Lowest level of Memory Hierarchy, but it is
fastest - Data items are copies of certain locations of
main memory - Sometimes, values in cache are changed and
corresponding changes to main memory are delayed - Machine looks for instructions as well as data
for those instructions in the cache - Holds limited amount of data
513.1.1 Memory Hierarchy (cont)
- In a single processor computer the data in main
memory doesnt have to be updated immediately - In multiple processors computer the data is
updated immediately to main memory - .called as write through
6Main Memory
- Everything happens in the computer i.e.
instruction execution, data manipulation, as
working on information that is resident in main
memory - Main memories are random access.one can obtain
any byte in the same amount of time
7Secondary storage
- Used to store data and programs when they are not
being processed - More permanent than main memory, as data and
programs are retained when the power is turned
off - E.g. magnetic disks, hard disks
813.2
SECONDARY STORAGE MANAGEMENT
201 Eilbroun Benjamin
9Tertiary Storage
- Holds data volumes in terabytes
- Used for databases much larger than what can be
stored on disk
1013.1.2 Transfer of Data Between levels
- Data moves between adjacent levels of the
hierarchy - At the secondary or tertiary levels accessing the
desired data or finding the desired place to
store the data takes a lot of time - Disk is devided into blocks
- Entire blocks are moved to and from memory called
a buffer -
1113.1.2 Transfer of Data Between level (contd)
- A key technique for speeding up database
operations is to arrange the data so that when
one piece of data block is needed it is likely
that other data on the same block will be needed
at the same time - Same idea applies to other hierarchy levels
1213.1.3 Volatile and Non Volatile Storage
- A volatile device forgets what data is stored on
it after power off - Non volatile holds data for longer period even
when device is turned off - All the secondary and tertiary devices are non
volatile and main memory is volatile
1313.1.4 Virtual Memory
- Typical software executes in virtual memory
- Address space is typically 32 bit or 232 bytes or
4GB - Transfer between memory and disk is in terms of
blocks
1413.2.1 Mechanism of Disk
- Mechanisms of Disks
- Use of secondary storage is one of the important
characteristic of DBMS - Consists of 2 moving pieces of a disk
- 1. disk assembly
- 2. head assembly
- Disk assembly consists of one or more platters
- Platters rotate around a central spindle
- Bits are stored on upper and lower surfaces of
platters
1513.2.1 Mechanism of Disk
- Disk is devided into tracks
- Track is devided into sectors
- Tracks are the segments of circle separated by
gap - The tracks that are at fixed radius from center
form one cylinder
16(No Transcript)
1713.2.2 Disk Controller
- One or more disks are controlled by disk
controllers - Disks controllers are capable of
- Controlling the mechanical actuator that moves
the head assembly - Selecting the sector from among all those in the
cylinder at which heads are positioned - Transferring bits between desired sector and main
memory - Possible buffering an entire track
1813.2.3 Disk Access Characteristics
- Accessing (reading/writing) a block requires 3
steps - Disk controller positions the head assembly at
the cylinder containing the track on which the
block is located. It is a seek time - The disk controller waits while the first sector
of the block moves under the head. This is a
rotational latency - All the sectors and the gaps between them pass
the head, while disk controller reads or writes
data in these sectors. This is a transfer time.
1913.3 Accelerating Access to Secondary Storage
- Several approaches for more-efficiently accessing
data in secondary storage - Place blocks that are together in the same
cylinder. - Divide the data among multiple disks.
- Mirror disks.
- Use disk-scheduling algorithms.
- Prefetch blocks into main memory.
- Scheduling Latency added delay in accessing
data caused by a disk scheduling algorithm. - Throughput the number of disk accesses per
second that the system can accommodate.
2013.3.1 The I/O Model of Computation
- The number of block accesses (Disk I/Os) is a
important time approximation for the algorithm. - This should be minimized.
- Ex 13.3 You want to have an index on R to
identify the block on which the desired tuple
appears, but not where on the block it resides. - For Megatron 747 (M747) example, it takes 11ms to
read a 16k block. - A standard microprocessor can execute millions of
instruction in 11ms, making any delay in
searching for the desired tuple negligible.
2113.3.2 Organizing Data by Cylinders
- If we read all blocks on a single track or
cylinder consecutively, then we can neglect all
but first seek time and first rotational latency. - Ex 13.4 We request 1024 blocks of M747.
- If data is randomly distributed, average latency
is 10.76ms by Ex 13.2, making total latency 11s. - If all blocks are consecutively stored on 1
cylinder - 6.46ms 8.33ms 16 139ms
- (1 average seek) (time per rotation) ( rotations)
2213.3.3 Using Multiple Disks
- If we have n disks, read/write performance will
increase by a factor of n. - Striping distributing a relation across
multiple disks following this pattern - Data on disk R1 R1, R1n, R12n,
- Data on disk R2 R2, R2n, R22n,
-
- Data on disk Rn Rn, Rnn, Rn2n,
- Ex 13.5 We request 1024 blocks with n 4.
- 6.46ms (8.33ms (16/4)) 39.8ms
- (1 average seek) (time per rotation) ( rotations)
2313.3.4 Mirroring Disks
- Mirroring Disks having 2 or more disks to hold
identical copied of data. - Benefit 1 If n disks are mirrors of each other,
the system can survive a crash by n-1 disks. - Benefit 2 If we have n disks, read performance
increases by a factor of n. - Performance increases further by having the
controller select the disk which has its head
closest to desired data block for each read.
2413.3.5 Disk Scheduling and the Elevator Problem
- Disk controller can run this algorithm to select
which of several requests to process first. - Pseudo code
- requests // array of all non-processed data
requests - upon receiving new data request
- requests.add(new request)
- while(requests is not empty)
- move head to next location
- if(head location is at data in requests)
- retrieve data
- remove data from requests
- if(head reaches end)
- reverse head direction
2513.3.5 Disk Scheduling and the Elevator Problem
(cont)
Events Head starting point Request data at
8000 Request data at 24000 Request data at
56000 Get data at 8000 Request data at 16000 Get
data at 24000 Request data at 64000 Get data at
56000 Request Data at 40000 Get data at 64000 Get
data at 40000 Get data at 16000
64000
56000
48000
Current time
Current time
0
Current time
4.3
Current time
10
Current time
13.6
Current time
20
Current time
26.9
Current time
30
Current time
34.2
Current time
45.5
Current time
56.8
40000
32000
24000
16000
8000
data time
data time
8000.. 4.3
data time
8000.. 4.3
24000.. 13.6
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
64000.. 34.2
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
64000.. 34.2
40000.. 45.5
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
64000.. 34.2
40000.. 45.5
16000.. 56.8
2613.3.5 Disk Scheduling and the Elevator Problem
(cont)
Elevator Algorithm
FIFO Algorithm
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
64000.. 34.2
40000.. 45.5
16000.. 56.8
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
16000.. 42.2
64000.. 59.5
40000.. 70.8
2713.3.6 Prefetching and Large-Scale Buffering
- If at the application level, we can predict the
order blocks will be requested, we can load them
into main memory before they are needed.
2813.4
DISK FAILURE WAYS AND THEIR MITIGATION
203 Priya Gangaraju
29Ways in which disks can fail-
- Intermittent failure.
- Media Decay.
- Write failure.
- Disk Crash.
30Intermittent Failures
- Read or write operation on a sector is successful
but after repeated tries. - The most common form of failure.
- Parity checks can be used to detect this kind of
failure.
31Media Decay
- Serious form of failure.
- Bit/Bits are permanently corrupted.
- Impossible to read a sector correctly even after
many trials. - Stable storage technique for organizing a disk
is used to avoid this failure.
32Write failure
- Attempt to write a sector is not possible.
- Attempt to retrieve previously written sector is
unsuccessful. - Possible reason power outage while writing of
the sector. - Stable Storage Technique can be used to avoid
this.
33Disk Crash
- Most serious form of disk failure.
- Entire disk becomes unreadable, suddenly and
permanently. - RAID techniques can be used for coping with disk
crashes.
34More on Intermittent failures
- When we try to read a sector, but the correct
content of that sector is not delivered to the
disk controller. - If the controller has a way to tell that the
sector is good or bad (checksums), it can then
re-issue the read request when the bad data is
read.
35More on Intermittent Failures..
- The controller can attempt to write a sector,
but the contents of the sector are not what was
intended. - The only way to check this kind of failure is to
read the sector again. - One way to perform the check is to read the
sector and compare it with the sector we intend
to write.
36Contd..
- Instead of performing the complete comparison at
the disk controller, simpler way is to read the
sector and see if a good sector was read. - If it is good sector, then the write was correct
otherwise the write was unsuccessful and must be
repeated.
37Checksums.
- Technique used to determine the good/bad status
of a sector. - Each sector has some additional bits called the
checksum that are set depending on the values of
the data bits in that sector. - If checksum is not proper on reading, then there
is an error in reading.
38Checksums(contd..)
- There is a small chance that the block was not
read correctly even if the checksum is proper. - The probability of correctness can be increased
by using many checksum bits.
39Checksum calculation.
- Checksum is based on the parity of all bits in
the sector. - If there are odd number of 1s among a collection
of bits, the bits are said to have odd parity. A
parity bit 1 is added. - If there are even number of 1s then the
collection of bits is said to have even parity. A
parity bit 0 is added.
40Checksum calculation(contd..)
- The number of 1s among a collection of bits and
their parity bit is always even. - During a write operation, the disk controller
calculates the parity bit and append it to the
sequence of bits written in the sector. - Every sector will have a even parity.
41Examples
- A sequence of bits 01101000 has odd number of
1s. The parity bit will be 1. So the sequence
with the parity bit will now be 011010001. - A sequence of bits 11101110 will have an even
parity as it has even number of 1s. So with the
parity bit 0, the sequence will be 111011100.
42Checksum calculation(contd..)
- Any one-bit error in reading or writing the bits
results in a sequence of bits that has
odd-parity. - The disk controller can count the number of 1s
and can determine if the sector has odd parity in
the presence of an error.
43Odds.
- There are chances that more than one bit can be
corrupted and the error can be unnoticed. - Increasing the number of parity bits can increase
the chances of detecting errors. - In general, if there are n independent bits as
checksum, the chances of error will be one in 2n.
44Stable Storage.
- Checksums can detect the error but cannot correct
it. - Sometimes we overwrite the previous contents of a
sector and yet cannot read the new contents
correctly. - To deal with these problems, Stable Storage
policy can be implemented on the disks.
45Stable-Storage(contd..)
- Sectors are paired and each pair represents one
sector-contents X. - The left copy of the sector may be represented as
XL and XR as the right copy.
46Assumptions.
- We assume that copies are written with sufficient
number of parity bits to reduce the chance of
bad sector looks good when the parity checks are
considered. - Also, If the read function returns a good value w
for either XL or XR then it is assumed that w is
the true value of X.
47Stable -Storage Writing Policy
- Write the value of X into XL. Check the value has
status good i.e., the parity-check bits are
correct in the written copy. If not repeat write.
If after a set number of write attempts, we have
not successfully written X in XL, assume that
there is a media failure in this sector. A
fix-up such as substituting a spare sector for XL
must be adopted. - Repeat (1) for XR.
48Stable-Storage Reading Policy
- The policy is to alternate trying to read XL and
XR until a good value is returned. - If a good value is not returned after pre chosen
number of tries, then it is assumed that X is
truly unreadable.
49Error-Handling capabilities
- Media failures
- If after storing X in sectors XL and XR, one of
them undergoes media failure and becomes
permanently unreadable, we can read from the
second one. - If both the sectors have failed to read, then
sector X cannot be read. - The probability of both failing is extremely
small.
50Error-Handling Capabilities(contd..)
- Write Failure
- When writing X, if there is a system failure(like
power shortage), the X in the main memory is
lost and the copy of X being written will be
erroneous. - Half of the sector may be written with part of
new value of X, while the other half remains as
it was.
51Error-Handling Capabilities(contd..)
- The possible cases when the system becomes
available - The failure occurred when writing to XL. Then XL
is considered bad. Since XR was never changed,
its status is good. We can make a copy of XR into
XL, which is the old value of X. - The failure occurred after XL is written. Then XL
will have the good status and XR which has the
old value of XR has bad status. We can copy the
new value of X to XR from XL.
52Recovery from Disk Crashes.
- To reduce the data loss by Dish crashes, schemes
which involve redundancy, extending the idea of
parity checks or duplicate sectors can be
applied. - The term used for these strategies is RAID i.e.
Redundant Arrays of Independent Disks. - In general, if the mean time to failure of disks
is n years, then in any given year, 1/nth of the
surviving disks fail.
53Recovery from Disk Crashes(contd..)
- Each of the RAID schemes has data disks and
redundant disks. - Data disks are one or more disks that hold the
data. - Redundant disks are one or more disks that hold
information that is completely determined by the
contents of the data disks. - When there is a disk crash of any disk, then the
other disks can be used to restore the failed
disk to avoid a permanent information loss.
5413.4
DISK FAILURES
204 Xiaqing He
551) Mirroring
- The simplest scheme to recovery from Disk Crashes
- How does Mirror work?
- -- make two or more copies of the data and
save on different disks - Benefit
- -- save data in case of one disk is fail
- -- same data on several disks and let access
to several blocks at once
561) Mirroring (cont)
- For mirroring, when the data can be lost?
- -- the only way data can be lost if both
disks crash - Possibility
- Suppose
- One disk mean time to failure 10 years
- One of the two disk average of mean time to
failure 5 years - The process of replacing the failed disk 3
hours1/2920 year - So
- the possibility of the mirror disk will fail1/10
1/2,920 1/29,200 - The possibility of data loss by mirroring 1/5
1/29,200 1/146,000
572)Parity Blocks
- why changes?
- -- disadvantages of Mirroring uses too many
redundant disks - Whats new?
- -- RAID level 4 uses only one redundant disk
- How this one redundant disk works?
- -- modulo-2 sum
- -- the jth bit of the redundant disk is the
modulo-2 sum of the jth bits of all the data
disks. - Example
582)Parity Blocks(cont)___Example
- Data disks
- Disk1 11110000
- Disk2 10101010
- Disk3 00111000
- Redundant disk
- Disk4 01100010
592)RAID 4 (cont)
- Reading
- -- Similar with reading blocks from any disk
- Writing
- 1)change the data disk
- 2)change the corresponding block of the
redundant disk - Why?
- -- hold the parity checks for the
corresponding blocks of all the data disks
602)RAID 4 (cont) _ writing
- For a total of N data disks
- 1) naïve way
- read N data disks and compute the modulo-2 sum
of the corresponding blocks - rewrite the redundant disk according to
modulo-2 sum of the data disks - 2) better way
- Take modulo-2 sum of the old and new version of
the data block which was rewritten - Change the position of the redundant disk which
was 1s in the modulo-2 sum
612)RAID 4 (cont) _ writing_Example
- Data disks
- Disk1 11110000
- Disk2 10101010 ? 01100010
- Disk3 00111000
- to do
- Modulo-2 sum of the old and new version of disk
2 11001100 - So, we need to change the positions 1,2,5,6 of
the redundant disk. -
- Redundant disk
- Disk4 01100010 ? 10101110
622)RAID 4 (cont) _failure recovery
- Redundant disk crash
- -- swap a new one and recomputed data from all
the data disks - One of Data disks crash
- -- swap a new one
- -- recomputed data from the other disks including
data disks and redundant disk - How to recomputed? (same rule, thats why there
will be some improvement) - -- take modulo-2 sum of all the corresponding
bits of all the other disks
633) An Improvement RAID 5
- Why need a improvement?
- -- Shortcoming of RAID level 4 suffers from a
bottleneck defect (when updating data disk need
to read and write the redundant disk) - Principle of RAID level 5 (RAID 5)
- -- treat each disk as the redundant disk for
some of the blocks - Why it is feasible?
- The rule of failure recovery for redundant disk
and data disk is the same - take modulo-2 sum of all the corresponding bits
of all the other disks - So, there is no need to retreat one disk as
redundant disk and others as data disks
643) RAID 5 (cont)
- How to recognize which blocks of each disk treat
this disk as redundant disk? - -- if there are n1 disks which were labeled from
0 to N, then we can treat the ith cylinder of
disk J as redundant if J is the remainder when I
is divided by n1 - Example
653) RAID 5 (cont)_example
- N3
- The first disk, labeled as 0 4,8,12
- The second disk, labeled as 1 1,5,9
- The third disk, labeled as 2 2,6,10
- .
- Suppose all the 4 disks are equally likely to
be written, for one of the 4 disks, the
possibility of being written - 1/4 3 /4 1/3 1/2
- If Nm gt 1/m (m-1)/m 1/(m-1) 2/m
664) Coping with multiple disk crashes
- RAID 6
- deal with any number of disk crashes if
using enough redundant disks - Example
- a system of seven disks ( four data
disks_numer 1-4 and 3 redundant disks_ number
5-7) - How to set up this 37 matrix ?
- (why is 3? there are 3 redundant
disks) - 1)every column values three 1s and 0s except
for all three 0s - 2) column of the redundant disk has single 1s
- 3) column of the data disk has at least two 1s
674) Coping with multiple disk crashes (cont)
- Reading
- read form the data disks and ignore the
redundant disk - Writing
- Change the data disk
- Change the corresponding bits of all the
redundant disks
684) Coping with multiple disk crashes (cont)
- In those system which has 4 data disks and 3
redundant disk, how they can correct up to 2 disk
crashes? - Suppose disk a and b failed
- find some row r (in 37 matrix)in which the
column for a and b are different (suppose a is
0s and b is 1s) - Compute the correct b by taking modulo-2 sum of
the corresponding bits from all the other disks
other than b which have 1s in row r - After getting the correct b, Compute the correct
a with all other disks available - Example
694) Coping with multiple disk crashes
(cont)_example
- 37 matrix
- data
disk redundant disk - disk number 1 2 3 4 5
6 7 -
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 0 0 1
704) Coping with multiple disk crashes
(cont)_example
- First block of all the disks
- disk contents
- 1) 11110000
- 2) 10101010
- 3) 00111000
- 4) 01000001
- 5) 01100010
- 6) 00011011
- 7) 10001001
-
714) Coping with multiple disk crashes
(cont)_example
- Two disks crashes
- disk contents
- 1) 11110000
- 2) ?????????
- 3) 00111000
- 4) 01000001
- 5) ?????????
- 6) 00011011
- 7) 10001001
-
724) Coping with multiple disk crashes
(cont)_example
- In that 37 matrix, find in row 2, disk 2
and 5 have different value and disk 2s value is
1 and 5s value is 0. - so compute the first block of disk 2 by
modulo-2 sum of all the corresponding bits of
disk 1,4,6 - then compute the first block of disk 2 by
modulo-2 sum of all the corresponding bits of
disk 1,2,3 - 1) 11110000
- 2) ????????? gt
00001111 - 3) 00111000
- 4) 01000001
- 5) ????????? gt
01100010 - 6) 00011011
- 7) 10001001
-
7313.5
ARRANGING DATA ON DISK
205 Meghna Jain
74Data elements are represented as records, which
stores in consecutive bytes in same same disk
block. Basic layout techniques of storing data
Fixed-Length Records Allocation criteria -
data should start at word boundary. Fixed
Length record header 1. A pointer to record
schema. 2. The length of the record. 3.
Timestamps to indicate last modification or last
read.
75 Example
- CREATE TABLE employee(
- name CHAR(30) PRIMARY KEY,
- address VARCHAR(255),
- gender CHAR(1),
- birthdate DATE
- )
- Data should start at word boundary and contain
header and four fields name, address, gender and
birthdate.
76Packing Fixed-Length Records into Blocks
- Records are stored in the form of blocks on the
disk and they move into main memory when we need
to update or access them. - A block header is written first, and which is
followed by series of blocks.
77Block header contains the following information
- one or more blocks that are part of a network of
blocks. - Information about the role played by this block
in such a network. - Information about the relation, the tuples in
this block belong to. - A "directory" gives the offset of each record in
the block. - Time stamp(s) to indicate time of the block's
last modification and/or access.
78 Example
- Along with the header we can pack as many record
as we can - in one block as shown in the figure and remaining
space will - be unused.
7913.6
REPRESENTING BLOCK AND RECORD ADDRESSES
206 Ramya Karri
80INTRODUCTION
- Address of a block and Record
- In Main Memory
- Address of the block is the virtual memory
address of the first byte - Address of the record within the block is the
virtual memory address of the first byte of the
record - In Secondary Memory sequence of bytes describe
the location of the block in the overall system - Sequence of Bytes describe the location of the
block the device ID for the disk, Cylinder
number, etc.
81ADDRESSES IN CLIENT-SERVER SYSTEMS
- The addresses in address space are represented
in two ways - Physical Addresses byte strings that determine
the place within the secondary storage system
where the record can be found. - Logical Addresses arbitrary string of bytes of
some fixed length - Physical Address bits are used to indicate
- Host to which the storage is attached
- Identifier for the disk
- Number of the cylinder
- Number of the track
- Offset of the beginning of the record
82Addresses in Client-Server Systems (Contd..)
- Map Table relates logical addresses to physical
addresses.
Logical Physical
83LOGICAL AND STRUCTURED ADDRESSES
- Purpose of logical address?
- Gives more flexibility, when we
- Move the record around within the block
- Move the record to another block
- Gives us an option of deciding what to do when a
record is deleted?
Record 4 Record 3 Record 2 Record 1
84POINTER SWIZZLING
- Having pointers is common in an object-relational
database systems - Important to learn about the management of
pointers - Every data item (block, record, etc.) has two
addresses - database address address on the disk
- memory address if the item is in virtual memory
85POINTER SWIZZLING (CONTD)
- Translation Table Maps database address to
memory address - All addressable items in the database have
entries in the map table, while only those items
currently in memory are mentioned in the
translation table
Dbaddr Mem-addr
86POINTER SWIZZLING (CONTD)
- Pointer consists of the following two fields
- Bit indicating the type of address
- Database or memory address
- Example 13.17
Memory
Disk
Swizzled
Block 1
Block 1
Unswizzled
Block 2
87EXAMPLE 13.7
- Block 1 has a record with pointers to a second
record on the same block and to a record on
another block - If Block 1 is copied to the memory
- The first pointer which points within Block 1 can
be swizzled so it points directly to the memory
address of the target record - Since Block 2 is not in memory, we cannot swizzle
the second pointer
88POINTER SWIZZLING (CONTD)
- Three types of swizzling
- Automatic Swizzling
- As soon as block is brought into memory, swizzle
all relevant pointers. - Swizzling on Demand
- Only swizzle a pointer if and when it is actually
followed. - No Swizzling
- Pointers are not swizzled they are accesses using
the database address.
89PROGRAMMER CONTROL OF SWIZZLING
- Unswizzling
- When a block is moved from memory back to disk,
all pointers must go back to database (disk)
addresses - Use translation table again
- Important to have an efficient data structure for
the translation table
90PINNED RECORDS AND BLOCKS
- A block in memory is said to be pinned if it
cannot be written back to disk safely. - If block B1 has swizzled pointer to an item in
block B2, then B2 is pinned - Unpin a block, we must unswizzle any pointers to
it - Keep in the translation table the places in
memory holding swizzled pointers to that item - Unswizzle those pointers (use translation table
to replace the memory addresses with database
(disk) addresses
9113.7
VARIABLE LENGTH DATA AND RECORDS
221 Eswara Satya Pavan Rajesh Pinapala
92Example
name
address
gender
birth date
0 30 286 287
297
Fig 1 Movie star record with four fields
93Records with Variable Fields
- An effective way to represent variable length
records is as follows - Fixed length fields are Kept ahead of the
variable length fields - Record header contains
- Length of the record
- Pointers to the beginning of all variable
- length fields except the first one.
94Records with Variable Length Fields
header information
record length
to address
birth date
name
address
gender
Figure 2 A Movie Star record with name and
address implemented as variable length character
strings
95Records with Repeating Fields
- Records contains variable number of occurrences
of a field F - All occurrences of field F are grouped together
and the record - header contains a pointer to the first
occurrence of field F - L bytes are devoted to one instance of field F
- Locating an occurrence of field F within the
record - Add to the offset for the field F which are the
integer multiples of L starting with 0 , L ,2L,3L
and so on to locate - We stop upon reaching the offset of the field F.
96Records with Repeating Fields
other header information
record length
to address
to movie pointers
name
address
pointers to movies
Figure 3 A record with a repeating group of
references to movies
97 Records with Repeating Fields
record header information
length of name
to name
to address
length of address
to movie references
number of references
address
name
Figure 4 Storing variable-length fields
separately from the record
98Records with Repeating Fields
- Advantage
- Keeping the record itself fixed length allows
record to be searched more efficiently, minimizes
the overhead in the block headers, and allows
records to be moved within or among the blocks
with minimum effort. - Disadvantage
- Storing variable length components on another
block increases the number of disk I/Os needed
to examine all components of a record.
99Records with Repeating Fields
- A compromise strategy is to allocate a fixed
portion of the record for the repeating fields - If the number of repeating fields is lesser than
- allocated space, then there will be some
unused space -
- If the number of repeating fields is greater
than - allocated space, then extra fields are stored
in a - different location and
- Pointer to that location and count of additional
- occurrences is stored in the record
100Variable Format Records
- Records that do not have fixed schema
- Variable format records are represented by
sequence of - tagged fields
- Each of the tagged fields consist of information
- Attribute or field name
- Type of the field
- Length of the field
- Value of the field
- Why use tagged fields
- Information Integration applications
- Records with a very flexible schema
101Variable Format Records
code for name
code for restaurant owned
code for string type
code for string type
length
length
N
16
S
S
14
Clint Eastwood
Hogs Breath Inn
R
Fig 5 A record with tagged fields
10213.8
RECORD MODIFICATIONS
222 Neha Samant
103Records that do not fit in a block
- When the length of a record is greater than one
block size ,then record is divided and placed
into two or more blocks - Portion of the record in each block is referred
to as a - RECORD FRAGMENT
- Record with two or more fragments is called
- SPANNED RECORD
- Record that do not cross a block boundary is
called - UNSPANNED RECORD
104Spanned Records
- Spanned records require the following extra
header information - A bit indicates whether it is fragment or not
- A bit indicates whether it is first or last
fragment of - a record
- Pointers to the next or previous fragment for
the - same record
105Records that do not fit in a block
block header
record header
record 2 - b
record 2 - a
record 1
record 3
block 1
block 2
Figure 6 Storing spanned records across blocks
106BLOBS
- Large binary objects are called BLOBS
- e.g. audio files, video files
- Storage of BLOBS
- Retrieval of BLOBS
107Modification types
- Insertion
- Deletion
- Update
107
108Insertion
- Insertion of records without order
- Records can be placed in a blocks free
space or in a new block. - Insertion of records in fixed order
- Space available in the block
- No space available in the block (outside the
block) - Structured address
- Pointer to a record from outside the block.
108
109Insertion in fixed order
- Space available within the block
- Use of an offset table in the header of each
block with pointers to the location of each
record in the block. - The records are slid within the block and the
pointers in the offset table are adjusted.
Record 1
109
110Insertion in fixed order
- No space available within the block (outside the
block) - Find space on a nearby block.
- In case of no space available on a block, look at
the following block in sorted order of blocks. - If space is available in that block ,move the
highest records of first block 1 to block 2 and
slide the records around on both blocks. - Create an overflow block
- Records can be stored in overflow block.
- Each block has place for a pointer to an overflow
block in its header. - The overflow block can point to a second overflow
block as shown below.
110
111Deletion
- Recover space after deletion
- When using an offset table, the records can be
slid around the block so there will be an unused
region in the center that can be recovered. - In case we cannot slide records, an available
space list can be maintained in the block header. - The list head goes in the block header and
available regions hold the links in the list.
111
112Deletion
- Use of tombstone
- The tombstone is placed in a record in order to
avoid pointers to the deleted record to point to
new records. - The tombstone is permanent until the entire
database is reconstructed. - If pointers go to fixed locations from which the
location of the record is found then we put the
tombstone in that fixed location. (See examples) - Where a tombstone is placed depends on the nature
of the record pointers. - Map table is used to translate logical record
address to physical address.
112
113Deletion
- Use of tombstone
- If we need to replace records by tombstones,
place the bit that serves as the tombstone at the
beginning of the record. - This bit remains the record location and
subsequent bytes can be reused for another record
Record 1 can be replaced, but the tombstone
remains, record 2 has no tombstone and can be
seen when we follow a pointer to it.
113
114Update
- Fixed Length update
- No effect on storage system as it occupies
same space as before update. - Variable length update
- Longer length
- Short length
114
115Update
- Variable length update (longer length)
- Stored on the same block
- Sliding records
- Creation of overflow block.
- Stored on another block
- Move records around that block
- Create a new block for storing variable length
fields.
115
116Update
- Variable length update (Shorter length)
- Same as deletion
- Recover space
- Consolidate space.
116
11714.2
BTrees Bitmap Indexes
207 Maciej Kicinski
118Structure
- A balanced tree, meaning that all paths from the
- leaf node have the same length.
- There is a parameter n associated with each Btree
- block. Each block will have space for n search
keys and n1 pointers. - The root may have only 1 parameter, but all other
- blocks most be at least half full.
119Structure
- ? A typical node gt
- ? a typical interior
- node would have
- pointers pointing to
- leaves with out
- values
- ? a typical leaf would
- have pointers point
- to records
- N search keys
- N1 pointers
120Application
- The search key of the Btree is the primary key
for the data file. - Data file is sorted by its primary key.
- Data file is sorted by an attribute that is not a
key,and this attribute is the search key for the
Btree.
121Lookup
- If at an interior node, choose the correct
pointer to use. This is done by comparing keys to
search value.
122Lookup
- If at a leaf node, choose the key that matches
what - you are looking for and the pointer for that
leads - to the data.
123Insertion
- When inserting, choose the correct leaf node to
put pointer to data. - If node is full, create a new node and split keys
- between the two.
- Recursively move up, if cannot create new pointer
to new node because full, create new node. - This would end with creating a new root node, if
- the current root was full.
124Deletion
- Perform lookup to find node to delete and
delete it. - If node is no longer half full, perform join on
adjacent node and recursively delete up, or key
move if that node is full and recursively change
pointer up.
125Efficiency
- Btrees allow lookup, insertion, and deletion of
records using very few disk I/Os. - Each level of a Btree would require one read.
Then you would follow the pointer of that read to
the next or final read.
126Efficiency
- Three levels are sufficient for Btrees. Having
each block have 255 pointers, 2553 is about 16.6
million. - You can even reduce disk I/Os by keeping a level
of a Btree in main memory. Keeping the first
block with 255 pointers would reduce the reads to
2, and even possible to keep the next 255
pointers in memory to reduce reads to 1.
12714.7
BTrees Bitmap Indexes
221 Deepti Kundu
128Definition
- A bitmap index for a field F is a collection of
bit-vectors of length n, one for each possible
value that may appear in that field F.1
129What does that mean?
- Assume relation R with
- 2 attributes A and B.
- Attribute A is of type Integer and B is of type
String. - 6 records, numbered 1 through 6 as shown.
A B
1 30 foo
2 30 bar
3 40 baz
4 50 foo
5 40 bar
6 30 baz
130Example Continued
- A bitmap for attribute B is
-
A B
1 30 foo
2 30 bar
3 40 baz
4 50 foo
5 40 bar
6 30 baz
Value Vector
foo 100100
bar 010010
baz 001001
131Where do we reach?
- A bitmap index is a special kind of database
index that uses bitmaps.2 - Bitmap indexes have traditionally been considered
to work well for data such as gender, which has a
small number of distinct values, e.g., male and
female, but many occurrences of those values.2
132A little more
- A bitmap index for attribute A of relation R is
- A collection of bit-vectors
- The number of bit-vectors the number of
distinct values of A in R. - The length of each bit-vector the cardinality
of R. - The bit-vector for value v has 1 in position i,
if the ith record has v in attribute A, and it
has 0 there if not.3 - Records are allocated permanent numbers.3
- There is a mapping between record numbers and
record addresses.3
133Motivation for Bitmap Indexes
- Very efficient when used for partial match
queries.3 - They offer the advantage of buckets 2
- Where we find tuples with several specified
attributes without first retrieving all the
record that matched in each of the attributes. - They can also help answer range queries 3
134Another Example
- Multidimensional Array of multiple types
- (5,d),(79,t),(4,d),(79,d),(5,t),(6,a)
- 5 100010
- 79 010100
- 4 001000
- 6 000001
- d 101100
- t 010010
- a 000001
135Example Continued
- (5,d),(79,t),(4,d),(79,d),(5,t),(6,a)
- Searching for items is easy, just AND together.
- To search for (5,d)
- 5 100010
- d 101100
- 100010 AND 101100 100000
The location of the record has been traced!
136Compressed Bitmaps
- Assume
- The number of records in R are n
- Attribute A has m distinct values in R
- The size of a bitmap index on attribute A is mn.
- If m is large, then the number of 1s will be
around 1/m. - Opportunity to encode
- A common encoding approach is called run-length
encoding.1
137Run-length encoding
- Represents runs
- A run is a sequence of i 0s followed by a 1, by
some suitable binary encoding of the integer i. - A run of i 0s followed by a 1 is encoded by
- First computing how many bits are needed to
represent i, Say k - Then represent the run by k-1 1s and a single
0 followed by k bits which represent i in binary. - The encoding for i 1 is 01. k 1
- The encoding for i 0 is 00. k 1
- We concatenate the codes for each run together,
and the sequence of bits is the encoding of the
entire bit-vector
138Understanding with an Example
- Let us decode the sequence 11101101001011
- Staring at the beginning (left most bit)
- First run The first 0 is at position 4, so k
4. The next 4 bits are 1101, so we know that the
first integer is i 13 - Second run 001011
- k 1
- i 0
- Last run 1011
- k 1
- i 3
- Our entire run length is thus 13,0,3, hence our
bit-vector is - 0000000000000110001
139Managing Bitmap Indexes
- 1) How do you find a specific bit-vector for a
- value efficiently?
- 2) After selecting results that match, how do you
retrieve the results efficiently? - 3) When data is changed, do you you alter bitmap
index?
1401) Finding bit vectors
- Think of each bit-vector as a key to a value.1
- Any secondary storage technique will be efficient
in retrieving the values.1 - Create secondary key with the attribute value as
a search key 3 - Btree
- Hash
1412) Finding Records
- Create secondary key with the record number as a
search key 3 - Or in other words,
- Once you learn that you need record k, you can
create a secondary index using the kth position
as a search key.1
1423) Handling Modifications
Record numbers must remain fixed once assigned
Changes to data file require changes to bitmap
index
143- Deletion
- Tombstone replaces deleted record
- Corresponding bit is set to 0
144- Insertion
- Record assigned the next record number.
- A bit of value 0 or 1 is appended to each bit
vector - If new record contains a new value of the
attribute, add one bit-vector.
145- Modification
- Change the bit corresponding to the old value of
the modified record to 0 - Change the bit corresponding to the new value of
the modified record to 1 - If the new value is a new value of A, then insert
a new bit-vector.
14618.1 18.2
CONCURRENCY CONTROL
210 Chiu Luk
147Concurrency Control
- Concurrency control in database management
systems (DBMS) ensures that database transactions
are performed concurrently without the
concurrency violating the data integrity of a
database. - Executed transactions should follow the ACID
rules. The DBMS must guarantee that only
serializable (unless Serializability is
intentionally relaxed), recoverable schedules are
generated. - It also guarantees that committed transactions
wont be lost, and aborted (rolled back)
transactions wont remain in the related
database.
148Transaction ACID rules
Atomicity - Either the effects of all or none of
its operations remain when a transaction is
completed - in other words, to the outside world
the transaction appears to be indivisible,
atomic. Consistency - Every transaction must
leave the database in a consistent state.
Isolation - Transactions cannot interfere with
each other. Providing isolation is the main goal
of concurrency control. Durability - Successful
transactions must persist through crashes.
149Serial and Serializable Schedules
- In the field of databases, a schedule is a list
of actions, (i.e. reading, writing, aborting,
committing), from a set of transactions. - In this example, Schedule D is the set of 3
transactions T1, T2, T3. The schedule describes
the actions of the transactions as seen by the
DBMS. T1 Reads and writes to object X, and then
T2 Reads and writes to object Y, and finally T3
Reads and writes to object Z. This is an example
of a serial schedule, because the actions of the
3 transactions are not interleaved.
150Serial and Serializable Schedules
- A schedule that is equivalent to a serial
schedule has the serializability property. - In schedule E, the order in which the actions of
the transactions are executed is not the same as
in D, but in the end, E gives the same result as
D.
151Serial Schedule TI precedes T2
- T1 T2
- Read(A) A ? A100
- Write(A)
- Read(B) B ? B100
- Write(B)
- Read(A)A ? A?2
- Write(A)
- Read(B)B ? B?2
- Write(B)
-
152Serial Schedule T2 precedes Tl
- T1 T2
- Read(A)A ? A?2
- Write(A)
- Read(B)B ? B?2
- Write(B)
- Read(A) A ? A100
- Write(A)
- Read(B) B ? B100
- Write(B)
-
153serializable, but not serial, schedule
- T1 T2
- Read(A) A ? A100
- Write(A)
- Read(A)A ? A?2
- Write(A)
- Read(B) B ? B100
- Write(B)
- Read(B)B ? B?2
- Write(B)
-
- r1(A) w1 (A) r2(A) w2(A) r1 (B) w1 (B)
r2(B) w2(B)
154nonserializable schedule
- T1 T2
- Read(A) A ? A100
- Write(A)
- Read(A)A ? A?2
- Write(A)
- Read(B)B ? B?2
- Write(B)
- Read(B) B ? B100
- Write(B)
-
155schedule that is serializable only because of the
detailed behavior of the transactions
- T1 T2
- Read(A) A ? A100
- Write(A)
- Read(A)A ? A?1
- Write(A)
- Read(B)B ? B?1
- Write(B)
- Read(B) B ? B100
- Write(B)
- regardless of the consistent initial state the
final state will be consistent.
156Non-Conflicting Actions
Two actions are non-conflicting if whenever
theyoccur consecutively in a schedule, swapping
themdoes not affect the final state produced by
theschedule. Otherwise, they are conflicting.
157Conflicting Actions General Rules
- Two actions of the same transaction conflict
because of the fix sequence - r1(A) w1(B)
- Two actions over the same database element
conflict, if one of them is a write - r1(A) w2(A)
- w1(A) w2(A)
158Conflict actions
- Two or more actions are said to be in conflict
if - The actions belong to different transactions.
- At least one of the actions is a write operation.
- The actions access the same object (read or
write). - The following set of actions is conflicting
- T1R(X), T2W(X), T3W(X)
- While the following sets of actions are not
- T1R(X), T2R(X), T3R(X)
- T1R(X), T2W(Y), T3R(X)
159Conflict Serializable
- We may take any schedule and make as many
nonconflicting swaps as we wish. - With the goal of turning the schedule into a
serial schedule. - If we can do so, then the original schedule is
serializable, because its effect on the database
state remains the same as we perform each of the
nonconflicting - swaps.
160Conflict Serializable
- A schedule is said to be conflict-serializable
when the schedule is conflict-equivalent to one
or more serial schedules. - Another definition for conflict-serializability
is that a schedule is conflict-serializable if
and only if there exists an acyclic precedence
graph/serializability graph for the schedule. - Which is conflict-equivalent to the serial
schedule ltT1,T2gt, but not ltT2,T1gt.
161Conflict equivalent / conflict-serializable
- Let Ai and Aj are consecutive non-conflicting
actions that belongs to different transactions.
We can swap Ai and Aj without changing the
result. - Two schedules are conflict equivalent if they can
be turned one into the other by a sequence of
non-conflicting swaps of adjacent actions. - We shall call a schedule conflict-serializable if
it is conflict-equivalent to a serial schedule.
162conflict-serializable
T1 T2
R(A)
W(A)
R(A)
R(B)
W(A)
W(B)
R(B)
W(B)
163conflict-serializable
T1 T2
R(A)
W(A)
R(B)
R(A)
W(A)
W(B)
R(B)
W(B)
164conflict-serializable
T1 T2
R(A)
W(A)
R(A)
R(B)
W(B)
W(A)
R(B)
W(B)
Cant swap this pair!!
165conflict-serializable
T1 T2
R(A)
W(A)
R(A)
W(B)
R(B)
W(A)
R(B)
W(B)
Serial Schedule
16618.3 18.4
CONCURRENCY CONTROL
211 Donavon Norwood
167INTRODUCTION
- Enforcing serializability by locks
- Locks
- Locking scheduler
- Two phase locking
- Locking systems with several lock modes
- Shared and exclusive locks
- Compatibility matrices
- Upgrading/updating locks
- Incrementing locks
167
168Locks
- It works like as follows
- A request from transaction
- Scheduler checks in the lock table
- Generates a serializable schedule of actions.
168
169Consistency of transactions
- Actions and locks must relate each other
- Transactions can only read write only if has a
lock and has not released the lock. - Unlocking an element is compulsory.
- Legality of schedules
- No two transactions can aquire the lock on same
element without the prior one releasing it.
169
170Locking scheduler
- Grants lock requests only if it is in a legal
schedule. - Lock table stores the information about current
locks on the elements.
170
171The locking scheduler (contd.)
- A legal schedule of consistent transactions but
unfortunately it is not a serializable.
171
172Locking sched