Title: File Organisation
1File Organisation
2Placing File on Disk
- File a sequence of records
- Records
- Record type
- Record fields
- Data type
- Number of bytes in a field
- fixed
- Variable
3Record Characteristics
- A logical view
- SELECT FROM STUDENTS or
- (Smith, 17, 1, CS) , (Brown, 8, 2, CS) or
- STUDENT(Name, Number, Class, Major)
- A physical view
- (20 bytes 4 bytes 4 bytes 3 bytes)
- data types determine record length
- - -records can be of fixed or variable length
4Fixed Versus Variable Length Records
- FIXED LENGTH
- every record has same fields
- field can be located relative to record start
- VARIABLE LENGTH - FIELDS
- Some fields have unknown length
- use a field separator
- Use a record terminator
- WHAT IF RECORDS ARE SMALLER THAN A BLOCK? -
BLOCKING FACTOR - WHAT IF RECORDS ARE LARGER THAN A BLOCK? -
SPANNING RECORDS
5Record blocking
- Allocating records to disk blocks
- Unspanned records
- Each record is fully contained in one block
- Many records in one block
- Blocking factor bfr number of records that fit
in one block - Example Block size B 1024 record size (fixed)
R 150 - bfr ?1024/150 ? 6 (floor and ceiling
functions) - Spanned organization
- Record continued on the consecutive block
- Required pointer to point the block with the
remainder of a record - If records are of a variable length , then bfr
could represent the average number of records per
bloc (the rounding function does not apply)
6File structure
- File as a set of pages (disk blocks) storing
records - File header
- Record format, types of separators
- Block address(es)
- Blocks allocated
- Contiguous
- Linked (use of block pointers)
- Linked clusters
- Indexed
7Searching for a record
- Search for a record on disk,
- one or more file blocks copied into buffers.
- Programs search for the desired record in the
buffers, using the information in the file
header. - If the address of the block with desired record
is not known, the search programs must do a
linear search through the file blocks. Each file
block is copied into a buffer and searched either
until the record is located or all the file
blocks have been searched unsuccessfully. - The goal of a good file organization is to locate
the block that contains a desired record with a
minimal number of block transfers
8Operations on Files
- Because of complex path from stored data to
user, DBMS offer a range of I/O operations - OPEN - access the file and prepare pointer
- FIND (LOCATE) - find first record
- FINDNEXT
- FINDALL - set
- READ
- INSERT
- DELETE
- MODIFY
- CLOSE
- REORGANISE - set
- READ-ORDERED (FIND-ORDERED) - set
9File organization and access method.
- Difference between the terms
- file organization and
- access method.
- A file organization is organization of the data
of a file into records, blocks, and access
structures - way of placing records and blocks on the storage
medium - An access method provides a group of operations
that can be applied to a file resulting in
retrieval, modification and reorganisation. - One file organization can accept many different
access methods Some access methods, though, can
be applied only to files with specific file
organization. - For example, one cannot apply an indexed access
method to a file without an index
10Why do Access Methods matter
- The unit of transfer between disk and main memory
is a block - Data must be in memory for the DBMS to use it
- DBMS memory is handled in units of a page, e.g.
4K, 8K. Pages in memory represent one or more
hardware blocks from the disk - If a single item is needed, the whole block is
transferred - Time taken for an I/O depends on the location of
the data on the disk and is lower if the number
of seek times and rotational delays are small, we
remember that access time seek times
rotational delays transfer times - The reason many DBMS do not rely on the OS file
system is - higher level DB operations, e.g. JOIN, have a
known pattern of page accesses and can be
translated into known sets of I/O operations - buffer manager can PRE-FETCH pages by
anticipating the next request. This is
especially efficient when the required data are
stored CONTIGUOUSLY on disk -
11Simple File Organisations
- Unordered files of records Heap or Pile file
- New records inserted at EOF, or anywhere
- locating a record is by a linear search
- insertion is easy
- retrieval of an individual record, or in any
order, is difficult (time consuming). - Question. How many blocks in average one needs to
reed to find a single record ? - Fast Select from Course
- Slow Select count() from Course
- group by Course_Number
12Operations on Unordered File
- Inserting a new record is very efficient
- The address of the last file block is kept in the
file header - The last disk block of the file is copied into a
buffer page - The new record is added or new page is opened
the page is then rewritten back to disk block. - Searching for a record using any search condition
in a file stored in b blocks - Linear search through the file, block by block
- Cost b/2 block transfers. on average, if only
one record satisfies the search condition, - Cost b block transfers. If no records or
several records satisfy the search condition.
program must read and search all b blocks in the
file. - To delete a record,
- find its block and copy the block into a buffer
page, - delete the record from the buffer,
- rewrite the updated page back to the disk block.
- Note Unused space in the block could be used in
future for a new record if suitable (some book
keeping necessary on unused space in file blocks))
13Special Deletion Procedures
- Technique used for record deletion
- Each record has an extra byte or bit, called a
deletion marker set to 1 at insertion ) - DO not remove deleted record, but reset its
deletion marker to 0 when deleted - Record with deletion marker set to 0 is not used
by application programs - From time to time reorganise the file physically
remove deleted records or reclaim unused space. - ) Just for simplicity we assume that values of
deletion markers are 0 or 1. A system
actually can choose other characters or
combination of bits as values of deletion
markers.
14Simple File Organisations
- Ordered files of records - sequential files
- still extremely useful in DBM (auditing,
recovery, security) - A record field is nominated and records are
ordered based on that field - Ordering key
- insertion is expensive
- retrieval is easy (efficient) if exploiting the
sort order - binary search reduces time significantly
- Fast Select from Course order by ltordergt
- Slow Select from Course where ltany other
attributegt c
15Retrieval Update in Sorted Files
- Binary search on ordering field to find block
with key k - B of blocks High B Low 0
- Do while not (Found or NotThere)
- Read Block
- Mid (Low High) / 2
- If k lt key field of first record in the block
- Then High Mid - 1
- Else If k gt key field of last record
- Then Low Mid 1
- Else If k record is in the buffer
- Then Found Else NotThere
- end
16Operations on Ordered File
- Searching for records when criteria are specified
in terms of ordering field - Reading the records in order of the ordering key
values is extremely efficient, - Finding the next record from the current one in
order of the ordering key usually requires no
additional block accesses, - the next record is in the same block or in the
next block - using a search condition based on the value of an
ordering key field results in faster access when
the binary search technique is used, - A binary search can be done on the blocks rather
than on the records.. A binary search usually
accesses log2(b) blocks, whether the record is
found or not - No advantage if search criterion is specified in
terms of non ordering fields
17Operations on Ordered File
- Inserting records is expensive. To insert a
record - find its correct position in the file, based on
its ordering field value, - cost log2(b) - make space in the file to insert the record in
that position. - on the average, half the records of the file must
be moved to make space for the new record. - these file blocks must be read and rewritten to
keep the order. Cost of insertion is then b/2
block transfers - Deleting record.
- Find the record using binary search based on
ordering field value, - cost log2(b - Delete the record,
- Reorganise part of the file (all records after
that deleted one, b/2 blocks in average) - Modifying record
- Find record using binary search and update as
required
18Operations on Ordered File
- Alternative ways for more efficient insertion
- keep some unused space in each block for new
records (not good - problem returns when that
space is filled up) - create and maintain a temporary unordered file
called an overflow file. - New records are inserted at the end of the
overflow file - Periodically, the overflow file is sorted and
merged with the main file during file
reorganization. - Searching for a record must involve both files,
main and overflow the cost of searching is thus
more expensive but for large main file will be
still close to log2(b) - Alternative way for more efficient deletion
- Use the technique based on deletion marker, as
described earlier
19Access Properties of Simple Files
- Heap (sequential unordered)
- Ordered (sequential) file
- Note in this and the following examples record
numbers corresponds to values of ordering field
in ascending order
20Access Properties of Simple Files
Insert into Heap file record R15
21Access Properties of Simple Files
Insert into Ordered file record R15
Notice that all records after R15 have changed
their page location or position on the page
22Access Properties of Simple Files
Insert into Ordered file records R15, R9, R17
using overflow file
Main File
Overflow File
R15 ------- R9 ------- R17 -------
Main File
Overflow File
Periodically overflow file is sorted and merged
with the main file
23Access Properties of Simple Files
- Deletions from a Heap R10, R3, R7
- Simple delete
24Access Properties of Simple Files
- Deletions from a Heap R10, R3, R7
- using deletion marker technique
Deletion markers set to 0 and later these
records will be physically removed when file is
reorganised
25Access Properties of Simple Files
- Deletions from ordered file R10, R3, R7
- Simple delete
26Access Properties of Simple Files
- Deletions from ordered file R10, R3, R7
- Using deletion marker technique
Deletion markers set to 0 and later these
records will be physicaly removed when file is
reorganised
27Retrieval and Update In Heaps
- Quick summary
- can only use linear search
- insertion is fast
- deletion, update are slow
- parameter search (e.g. SELECTWHERE) is slow
- unconditional search can be fast if
- records are of fixed length
- records do not span blocks
- j-th record located by position in block j / bfr
- average time to find a single record b / 2
- (b number of blocks)
28Retrieval Update in Sorted Files
- Quick summary
- retrieval on key field is fast - next record is
nearby - any other retrieval either requires a sort, or an
index, or is as slow as a heap - update, delete, insert are slow (find block,
update block, rewrite block)
29FAST ACCESS FOR DATABASE HASHING
- Types of hashing static or dynamic
- What is the point of hashing?
- reduce a large address space
- provide close to direct access
- provide reasonable performance for all U,I,D,S
- What is a hash function?
- properties
- behaviour
- Collisions
- Collision resolution
- Open addressing
- Summary
30What Is Hashing, and What Is It For?
- direct access to block containing the desired
record - reduce the number of blocks read or written
- allow for file expansion and contraction with
minimal file reorganising - permit retrieval on hashed fields without
re-sorting the file - no need to allocate contiguous disk areas
- if file is small, internal hashing otherwise
external - no direct access other than by hashing
31A basic example of hashing
- There are 25 rows of seats, with 3 seats per row
(75 seats total) - We have to allocate each person to a row in
advance, at random - We will hash on their family name so as to find
the persons row number directly, knowing only
the name - The database is logically a single table
- ROOM (Name, Age, Attention)
- implemented as a blocked, hashed file
32The hashing process
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
- The hash process is
- Loc 0
- Until no more characters in YourName
- Add the alphabetic position of the character to
Loc - Calculate RowNum Loc mod 25
33Examples - Hashed Names
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
NAME Loc Row Num (mod 25)
REYE 53 3
ANDERSON 90 15
MCWILLIAM 95 20
TANG 42 17
LEE 22 22
Where is MCWILLIAM? Hash(MCWILLIAM) Row 20
34NAME Loc RowNum (mod 25)
REYE 53 3
ANDERSON 90 15
MCWILLIAM 95 20
TANG 42 17
LEE 22 22
35Name Hashing Example continued
Name Row Name Row Name Row Name Row Lee 22 Alex 1
7 George 7 Anne 9 West 17 Rita 23 Guy 3 Will 6 Jam
es 23 Jodie 18 Dave 7 Wilf 0 Anna 5 Jill 18 Don 8
Walt 6 Anita 18 Lily 8 Dixy 12 Jack 0 Jie 24 Ash 3
Jon 14 Lana 3 Kenny 19 Ben 21 Nina 13 Olga 10 Mar
ie 21 Kay 12 May 14 Fred 8 Lois 5 Peter 14 Max 13
Tania 18 Best 21 Paul 0 Nora 23 Tom 23 Rob 10 Phil
20 Cash 6 Julia 3 Lou 23 Pat 11 Foot 6 Leah 6 Axe
l 17 Ed 9 Tan 10 Ling 17
36The results after 52 arrivals
RowNo Sitting Alloc
0 3
1 0
2 0
3 3 1
4 0
5 2
6 3 2
7 2
8 3
9 2
10 3
11 1
12 2
RowNo Sitting Alloc
13 2
14 3
15 0
16 0
17 3 1
18 3 1
19 1
20 1
21 3
22 1
23 3 2
24 1
37The Room as a Hashed File
- Each person has a hash key - the name
- Each person is a record
- Each row is a hardware block (bucket)
- Each row number is the address of a bucket
- Records here are fixed-length (and 3 records per
block) - The leftover people are collisions (key
collisions) - They will have to be found a seat by collision
resolution
38Collision Resolution
- Leave an empty seat in each row
- Under population - blocks 66 full
- A notice on the end of the row extra seat for
row N can be found at the rear exit - buckets overflow chain points to an overflow
page containing the record - Everyone stand up while we reallocate seats
- file reorganisation
39Collision Resolution
- Strategy 1 (open addressing)
- Nora is 4th arrival for 23
- Place new arrival in next higher No block with a
vacancy - Retrieval - search for Nora
- Retrieve block 23
- Read blocks in 23 consecutively. If Nora not
found try 24
40- Disadvantages
- May need to read whole file consecutively on some
keys - Blocks will gradually fill up with out-of-place
records - Deletions cause either immediate or periodic
reorganisation
41Collision Resolution
- Strategy 2
- Reserve some rows (buckets) for overflow Blocks
25, 26 and 27 or recalculate hash function for
smaller mod, say 20 instead of 25 - Julia is then 4th arrival for block 3
- Place in overflow block with smaller label and
with available space (26 ? and optionally placing
a pointer in bucket 3 pointing to 26th). - Retrieval - search for Julia
- Retrieve block 3
- Read blocks in 3 consecutively. If Julia not
found, either - search overflow consecutively, or
- follow pointer to block 26 (chaining)
42- Disadvantages
- Overflow gradually fills up giving longer
retrieval times - Deletions/additions cause periodic
reorganisation
43Collision Resolution
- More formally
- Open addressing If location specified by hash
address is occupied then the subsequent
positions are checked in order until an unused
(empty) position is found. - Chaining various overflow locations are kept, a
pointer field is added to each record location. A
collision is resolved by placing the new record
in an unused overflow location and setting the
pointer of the occupied hash address location to
the address of that overflow location. - Multiple hashing A second hash function is
applied if the first results in a collision.
44Performance on Hashed Files
- Retrieve (SELECT) very fast if name is known,
otherwise hopeless - SELECT FROM ROOM
- WHERE NAME McWilliam
- SELECT FROM ROOM
- WHERE AGE gt 30
- Update same
- UPDATE ROOM SET ATTENTION low
- WHERE NAME McWilliam
- UPDATE ROOM SET ATTENTION high
- WHERE AGE gt 50 OR AGE lt 10
45Performance on Hashed Files
- Delete same as SELECT, UPDATE
- DELETE FROM ROOM (uses hash - fast)
- WHERE NAME Nora
- DELETE FROM ROOM (cant use hash - slow)
- WHERE NAME IS LIKE No
- Insert unpredictable
- INSERT INTO ROOM
- VALUES (Smyth, high)
46Internal Hashing
- Internal hashing is used as an internal search
structure within a program whenever a group of
records is accessed exclusively by using the
value of one field. - Applicable to smaller files
- Hashed in main memory fast lookup in store
- R records, R-length array
- Hash function transforms key field into subscript
array in the range 0 to R - 1 - hash (Key Value) Key Value (mod R)
- subscript is the record address in store
47External Hashing
- Hashing for disk files is called external
hashing. - address space is made of buckets, each of which
holds multiple records. - A bucket is either one disk block or a cluster of
contiguous blocks. - The hashing function maps a key into a relative
bucket number, - A table maintained in the file header converts
the bucket number into the corresponding disk
block address
48External Hashing (static)
- The hashing scheme is called static hashing if a
fixed number of buckets M is allocated. - If a record is to be retrieved with search
condition specified for the key values, then the
bucket number of the bucket potentially
containing that record is determined using the
hashing function applied on the key and then that
bucket is examined for the containment of the
desired record. If record is not in that bucket
then further search could be activated in
overflow buckets.
49External Hashing (static)
- Construction of hashed file
- Identify size of the file, choose hashing
function (according to the anticipated number of
buckets) and decide about selection of the
collision resolution procedure - for the life of
the file - Apply hashing function to each inserted record
to get the bucket number and place the record in
the bucket with that number - If bucket is full then apply selected collision
resolution procedure - If the number of records in overflow buckets is
large and/or distribution of records in buckets
is highly un-uniform , then reorganise the file
using changed hashing function (tuning)
50External Hashing (static)
0 1 N-1
H(key) mod N
Overflow Page
key
H
Primary buckets
51Problems of Static Hashing
- Number of buckets is fixed
- shrinkage causes wasted space
- growth causes long overflow chains
- Solutions
- reorganise
- re-hash
- use dynamic hashing...
52Extendible Hashing
- Previously, to insert a new record into a full
bucket - add overflow page, or
- reorganise by doubling the bucket allocation and
redistributing the records - This is a poor solution
- entire file is read
- twice as many pages have to be written
- Solution Extendible Hashing
- add a directory of pointers to buckets
- double the number of buckets by doubling the
directory - split only the bucket that has overflowed
53LINEAR HASHING
- It does not have a directory at all
- Instead, have a family of algorithms to manage
dynamic expansion and contraction of the file - Start with a set number of M buckets 0..M-1 with
hashing function mod M - Split them in linear order, when more space is
needed. The next hashing function is mod 2M and
subsequent 3M, 4M etc as required - Example Block capacity is 2 records. Records
with values 72, 62, 32 are colliding for hashing
function (mod 10), but after application of next
hashing function (mod 20) they do not (one bucket
contains 72 and 32 and another 62). - Combines controlled overflow with new space
acquisition
54Linear Hashing - Advantages
- Another type of dynamic hashing
- does not require a directory
- manages collisions well
- accommodates insertions and deletions well
- allows overflow chain length to be traded against
average space utilisation - uses several hash functions
55HASHING SUMMARY
- Comparison of Simple and Hashed Files
- Pages in a hashed file are grouped into buckets
(1 block or a cluster of contiguous blocks) - reduce a large address space
- provide close to direct access
- Static hashing has some disadvantages which are
addressed in dynamic hashing solutions
(extendible, linear)
56- Static hashed files are kept at about 80
occupancy then reorganised. New pages are added
when each existing page is about 80 full - Hence, time to read entire file is ?? 1.25
non-hashed file - Dynamic hashing provides flexibility in usage of
file storage space (expansion and contraction)