File Organisation

About This Presentation

Title:

File Organisation

Description:

File Organisation ... – PowerPoint PPT presentation

Number of Views:155

Avg rating:3.0/5.0

Slides: 57

Provided by: coi84

Category:

more less

Transcript and Presenter's Notes

Title: File Organisation

1
File Organisation
2
Placing File on Disk

File a sequence of records
Records
Record type
Record fields
Data type
Number of bytes in a field
fixed
Variable

3
Record Characteristics

A logical view
SELECT FROM STUDENTS or
(Smith, 17, 1, CS) , (Brown, 8, 2, CS) or
STUDENT(Name, Number, Class, Major)
A physical view
(20 bytes 4 bytes 4 bytes 3 bytes)
data types determine record length
- -records can be of fixed or variable length

4
Fixed Versus Variable Length Records

FIXED LENGTH
every record has same fields
field can be located relative to record start
VARIABLE LENGTH - FIELDS
Some fields have unknown length
use a field separator
Use a record terminator
WHAT IF RECORDS ARE SMALLER THAN A BLOCK? -
BLOCKING FACTOR
WHAT IF RECORDS ARE LARGER THAN A BLOCK? -
SPANNING RECORDS

5
Record blocking

Allocating records to disk blocks
Unspanned records
Each record is fully contained in one block
Many records in one block
Blocking factor bfr number of records that fit
in one block
Example Block size B 1024 record size (fixed)
R 150
bfr ?1024/150 ? 6 (floor and ceiling
functions)
Spanned organization
Record continued on the consecutive block
Required pointer to point the block with the
remainder of a record
If records are of a variable length , then bfr
could represent the average number of records per
bloc (the rounding function does not apply)

6
File structure

File as a set of pages (disk blocks) storing
records
File header
Record format, types of separators
Block address(es)
Blocks allocated
Contiguous
Linked (use of block pointers)
Linked clusters
Indexed

7
Searching for a record

Search for a record on disk,
one or more file blocks copied into buffers.
Programs search for the desired record in the
buffers, using the information in the file
header.
If the address of the block with desired record
is not known, the search programs must do a
linear search through the file blocks. Each file
block is copied into a buffer and searched either
until the record is located or all the file
blocks have been searched unsuccessfully.
The goal of a good file organization is to locate
the block that contains a desired record with a
minimal number of block transfers

8
Operations on Files

Because of complex path from stored data to
user, DBMS offer a range of I/O operations
OPEN - access the file and prepare pointer
FIND (LOCATE) - find first record
FINDNEXT
FINDALL - set
READ
INSERT
DELETE
MODIFY
CLOSE
REORGANISE - set
READ-ORDERED (FIND-ORDERED) - set

9
File organization and access method.

Difference between the terms
file organization and
access method.
A file organization is organization of the data
of a file into records, blocks, and access
structures
way of placing records and blocks on the storage
medium
An access method provides a group of operations
that can be applied to a file resulting in
retrieval, modification and reorganisation.
One file organization can accept many different
access methods Some access methods, though, can
be applied only to files with specific file
organization.
For example, one cannot apply an indexed access
method to a file without an index

10
Why do Access Methods matter

The unit of transfer between disk and main memory
is a block
Data must be in memory for the DBMS to use it
DBMS memory is handled in units of a page, e.g.
4K, 8K. Pages in memory represent one or more
hardware blocks from the disk
If a single item is needed, the whole block is
transferred
Time taken for an I/O depends on the location of
the data on the disk and is lower if the number
of seek times and rotational delays are small, we
remember that access time seek times
rotational delays transfer times
The reason many DBMS do not rely on the OS file
system is
higher level DB operations, e.g. JOIN, have a
known pattern of page accesses and can be
translated into known sets of I/O operations
buffer manager can PRE-FETCH pages by
anticipating the next request. This is
especially efficient when the required data are
stored CONTIGUOUSLY on disk

11
Simple File Organisations

Unordered files of records Heap or Pile file
New records inserted at EOF, or anywhere
locating a record is by a linear search
insertion is easy
retrieval of an individual record, or in any
order, is difficult (time consuming).
Question. How many blocks in average one needs to
reed to find a single record ?
Fast Select from Course
Slow Select count() from Course
group by Course_Number

12
Operations on Unordered File

Inserting a new record is very efficient
The address of the last file block is kept in the
file header
The last disk block of the file is copied into a
buffer page
The new record is added or new page is opened
the page is then rewritten back to disk block.
Searching for a record using any search condition
in a file stored in b blocks
Linear search through the file, block by block
Cost b/2 block transfers. on average, if only
one record satisfies the search condition,
Cost b block transfers. If no records or
several records satisfy the search condition.
program must read and search all b blocks in the
file.
To delete a record,
find its block and copy the block into a buffer
page,
delete the record from the buffer,
rewrite the updated page back to the disk block.
Note Unused space in the block could be used in
future for a new record if suitable (some book
keeping necessary on unused space in file blocks))

13
Special Deletion Procedures

Technique used for record deletion
Each record has an extra byte or bit, called a
deletion marker set to 1 at insertion )
DO not remove deleted record, but reset its
deletion marker to 0 when deleted
Record with deletion marker set to 0 is not used
by application programs
From time to time reorganise the file physically
remove deleted records or reclaim unused space.
) Just for simplicity we assume that values of
deletion markers are 0 or 1. A system
actually can choose other characters or
combination of bits as values of deletion
markers.

14
Simple File Organisations

Ordered files of records - sequential files
still extremely useful in DBM (auditing,
recovery, security)
A record field is nominated and records are
ordered based on that field
Ordering key
insertion is expensive
retrieval is easy (efficient) if exploiting the
sort order
binary search reduces time significantly
Fast Select from Course order by ltordergt
Slow Select from Course where ltany other
attributegt c

15
Retrieval Update in Sorted Files

Binary search on ordering field to find block
with key k
B of blocks High B Low 0
Do while not (Found or NotThere)
Read Block
Mid (Low High) / 2
If k lt key field of first record in the block
Then High Mid - 1
Else If k gt key field of last record
Then Low Mid 1
Else If k record is in the buffer
Then Found Else NotThere
end

16
Operations on Ordered File

Searching for records when criteria are specified
in terms of ordering field
Reading the records in order of the ordering key
values is extremely efficient,
Finding the next record from the current one in
order of the ordering key usually requires no
additional block accesses,
the next record is in the same block or in the
next block
using a search condition based on the value of an
ordering key field results in faster access when
the binary search technique is used,
A binary search can be done on the blocks rather
than on the records.. A binary search usually
accesses log2(b) blocks, whether the record is
found or not
No advantage if search criterion is specified in
terms of non ordering fields

17
Operations on Ordered File

Inserting records is expensive. To insert a
record
find its correct position in the file, based on
its ordering field value, - cost log2(b)
make space in the file to insert the record in
that position.
on the average, half the records of the file must
be moved to make space for the new record.
these file blocks must be read and rewritten to
keep the order. Cost of insertion is then b/2
block transfers
Deleting record.
Find the record using binary search based on
ordering field value, - cost log2(b
Delete the record,
Reorganise part of the file (all records after
that deleted one, b/2 blocks in average)
Modifying record
Find record using binary search and update as
required

18
Operations on Ordered File

Alternative ways for more efficient insertion
keep some unused space in each block for new
records (not good - problem returns when that
space is filled up)
create and maintain a temporary unordered file
called an overflow file.
New records are inserted at the end of the
overflow file
Periodically, the overflow file is sorted and
merged with the main file during file
reorganization.
Searching for a record must involve both files,
main and overflow the cost of searching is thus
more expensive but for large main file will be
still close to log2(b)
Alternative way for more efficient deletion
Use the technique based on deletion marker, as
described earlier

19
Access Properties of Simple Files

Heap (sequential unordered)
Ordered (sequential) file
Note in this and the following examples record
numbers corresponds to values of ordering field
in ascending order

20
Access Properties of Simple Files
Insert into Heap file record R15

And after insertion

21
Access Properties of Simple Files
Insert into Ordered file record R15

And after insertion

Notice that all records after R15 have changed
their page location or position on the page
22
Access Properties of Simple Files
Insert into Ordered file records R15, R9, R17
using overflow file

Main File

Overflow File

And after insertion

R15 ------- R9 ------- R17 -------
Main File

Overflow File
Periodically overflow file is sorted and merged
with the main file
23
Access Properties of Simple Files

Deletions from a Heap R10, R3, R7
Simple delete

After delete operations

24
Access Properties of Simple Files

Deletions from a Heap R10, R3, R7
using deletion marker technique

After delete operations

Deletion markers set to 0 and later these
records will be physically removed when file is
reorganised
25
Access Properties of Simple Files

Deletions from ordered file R10, R3, R7
Simple delete

After delete operations

26
Access Properties of Simple Files

Deletions from ordered file R10, R3, R7
Using deletion marker technique

After delete operations

Deletion markers set to 0 and later these
records will be physicaly removed when file is
reorganised
27
Retrieval and Update In Heaps

Quick summary
can only use linear search
insertion is fast
deletion, update are slow
parameter search (e.g. SELECTWHERE) is slow
unconditional search can be fast if
records are of fixed length
records do not span blocks
j-th record located by position in block j / bfr
average time to find a single record b / 2
(b number of blocks)

28
Retrieval Update in Sorted Files

Quick summary
retrieval on key field is fast - next record is
nearby
any other retrieval either requires a sort, or an
index, or is as slow as a heap
update, delete, insert are slow (find block,
update block, rewrite block)

29
FAST ACCESS FOR DATABASE HASHING

Types of hashing static or dynamic
What is the point of hashing?
reduce a large address space
provide close to direct access
provide reasonable performance for all U,I,D,S
What is a hash function?
properties
behaviour
Collisions
Collision resolution
Open addressing
Summary

30
What Is Hashing, and What Is It For?

direct access to block containing the desired
record
reduce the number of blocks read or written
allow for file expansion and contraction with
minimal file reorganising
permit retrieval on hashed fields without
re-sorting the file
no need to allocate contiguous disk areas
if file is small, internal hashing otherwise
external
no direct access other than by hashing

31
A basic example of hashing

There are 25 rows of seats, with 3 seats per row
(75 seats total)
We have to allocate each person to a row in
advance, at random
We will hash on their family name so as to find
the persons row number directly, knowing only
the name
The database is logically a single table
ROOM (Name, Age, Attention)
implemented as a blocked, hashed file

32
The hashing process
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

The hash process is
Loc 0
Until no more characters in YourName
Add the alphabetic position of the character to
Loc
Calculate RowNum Loc mod 25

33
Examples - Hashed Names
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
NAME Loc Row Num (mod 25)
REYE 53 3
ANDERSON 90 15
MCWILLIAM 95 20
TANG 42 17
LEE 22 22
Where is MCWILLIAM? Hash(MCWILLIAM) Row 20
34
NAME Loc RowNum (mod 25)
REYE 53 3
ANDERSON 90 15
MCWILLIAM 95 20
TANG 42 17
LEE 22 22
35
Name Hashing Example continued
Name Row Name Row Name Row Name Row Lee 22 Alex 1
7 George 7 Anne 9 West 17 Rita 23 Guy 3 Will 6 Jam
es 23 Jodie 18 Dave 7 Wilf 0 Anna 5 Jill 18 Don 8
Walt 6 Anita 18 Lily 8 Dixy 12 Jack 0 Jie 24 Ash 3
Jon 14 Lana 3 Kenny 19 Ben 21 Nina 13 Olga 10 Mar
ie 21 Kay 12 May 14 Fred 8 Lois 5 Peter 14 Max 13
Tania 18 Best 21 Paul 0 Nora 23 Tom 23 Rob 10 Phil
20 Cash 6 Julia 3 Lou 23 Pat 11 Foot 6 Leah 6 Axe
l 17 Ed 9 Tan 10 Ling 17
36
The results after 52 arrivals
RowNo Sitting Alloc
0 3
1 0
2 0
3 3 1
4 0
5 2
6 3 2
7 2
8 3
9 2
10 3
11 1
12 2
RowNo Sitting Alloc
13 2
14 3
15 0
16 0
17 3 1
18 3 1
19 1
20 1
21 3
22 1
23 3 2
24 1

37
The Room as a Hashed File

Each person has a hash key - the name
Each person is a record
Each row is a hardware block (bucket)
Each row number is the address of a bucket
Records here are fixed-length (and 3 records per
block)
The leftover people are collisions (key
collisions)
They will have to be found a seat by collision
resolution

38
Collision Resolution

Leave an empty seat in each row
Under population - blocks 66 full
A notice on the end of the row extra seat for
row N can be found at the rear exit
buckets overflow chain points to an overflow
page containing the record
Everyone stand up while we reallocate seats
file reorganisation

39
Collision Resolution

Strategy 1 (open addressing)
Nora is 4th arrival for 23
Place new arrival in next higher No block with a
vacancy
Retrieval - search for Nora
Retrieve block 23
Read blocks in 23 consecutively. If Nora not
found try 24

Disadvantages
May need to read whole file consecutively on some
keys
Blocks will gradually fill up with out-of-place
records
Deletions cause either immediate or periodic
reorganisation

41
Collision Resolution

Strategy 2
Reserve some rows (buckets) for overflow Blocks
25, 26 and 27 or recalculate hash function for
smaller mod, say 20 instead of 25
Julia is then 4th arrival for block 3
Place in overflow block with smaller label and
with available space (26 ? and optionally placing
a pointer in bucket 3 pointing to 26th).
Retrieval - search for Julia
Retrieve block 3
Read blocks in 3 consecutively. If Julia not
found, either
search overflow consecutively, or
follow pointer to block 26 (chaining)

Disadvantages
Overflow gradually fills up giving longer
retrieval times
Deletions/additions cause periodic
reorganisation

43
Collision Resolution

More formally
Open addressing If location specified by hash
address is occupied then the subsequent
positions are checked in order until an unused
(empty) position is found.
Chaining various overflow locations are kept, a
pointer field is added to each record location. A
collision is resolved by placing the new record
in an unused overflow location and setting the
pointer of the occupied hash address location to
the address of that overflow location.
Multiple hashing A second hash function is
applied if the first results in a collision.

44
Performance on Hashed Files

Retrieve (SELECT) very fast if name is known,
otherwise hopeless
SELECT FROM ROOM
WHERE NAME McWilliam
SELECT FROM ROOM
WHERE AGE gt 30
Update same
UPDATE ROOM SET ATTENTION low
WHERE NAME McWilliam
UPDATE ROOM SET ATTENTION high
WHERE AGE gt 50 OR AGE lt 10

45
Performance on Hashed Files

Delete same as SELECT, UPDATE
DELETE FROM ROOM (uses hash - fast)
WHERE NAME Nora
DELETE FROM ROOM (cant use hash - slow)
WHERE NAME IS LIKE No
Insert unpredictable
INSERT INTO ROOM
VALUES (Smyth, high)

46
Internal Hashing

Internal hashing is used as an internal search
structure within a program whenever a group of
records is accessed exclusively by using the
value of one field.
Applicable to smaller files
Hashed in main memory fast lookup in store
R records, R-length array
Hash function transforms key field into subscript
array in the range 0 to R - 1
hash (Key Value) Key Value (mod R)
subscript is the record address in store

47
External Hashing

Hashing for disk files is called external
hashing.
address space is made of buckets, each of which
holds multiple records.
A bucket is either one disk block or a cluster of
contiguous blocks.
The hashing function maps a key into a relative
bucket number,
A table maintained in the file header converts
the bucket number into the corresponding disk
block address

48
External Hashing (static)

The hashing scheme is called static hashing if a
fixed number of buckets M is allocated.
If a record is to be retrieved with search
condition specified for the key values, then the
bucket number of the bucket potentially
containing that record is determined using the
hashing function applied on the key and then that
bucket is examined for the containment of the
desired record. If record is not in that bucket
then further search could be activated in
overflow buckets.

49
External Hashing (static)

Construction of hashed file
Identify size of the file, choose hashing
function (according to the anticipated number of
buckets) and decide about selection of the
collision resolution procedure - for the life of
the file
Apply hashing function to each inserted record
to get the bucket number and place the record in
the bucket with that number
If bucket is full then apply selected collision
resolution procedure
If the number of records in overflow buckets is
large and/or distribution of records in buckets
is highly un-uniform , then reorganise the file
using changed hashing function (tuning)

50
External Hashing (static)
0 1 N-1
H(key) mod N
Overflow Page
key
H
Primary buckets
51
Problems of Static Hashing

Number of buckets is fixed
shrinkage causes wasted space
growth causes long overflow chains
Solutions
reorganise
re-hash
use dynamic hashing...

52
Extendible Hashing

Previously, to insert a new record into a full
bucket
add overflow page, or
reorganise by doubling the bucket allocation and
redistributing the records
This is a poor solution
entire file is read
twice as many pages have to be written
Solution Extendible Hashing
add a directory of pointers to buckets
double the number of buckets by doubling the
directory
split only the bucket that has overflowed

53
LINEAR HASHING

It does not have a directory at all
Instead, have a family of algorithms to manage
dynamic expansion and contraction of the file
Start with a set number of M buckets 0..M-1 with
hashing function mod M
Split them in linear order, when more space is
needed. The next hashing function is mod 2M and
subsequent 3M, 4M etc as required
Example Block capacity is 2 records. Records
with values 72, 62, 32 are colliding for hashing
function (mod 10), but after application of next
hashing function (mod 20) they do not (one bucket
contains 72 and 32 and another 62).
Combines controlled overflow with new space
acquisition

54
Linear Hashing - Advantages

Another type of dynamic hashing
does not require a directory
manages collisions well
accommodates insertions and deletions well
allows overflow chain length to be traded against
average space utilisation
uses several hash functions

55
HASHING SUMMARY

Comparison of Simple and Hashed Files
Pages in a hashed file are grouped into buckets
(1 block or a cluster of contiguous blocks)
reduce a large address space
provide close to direct access
Static hashing has some disadvantages which are
addressed in dynamic hashing solutions
(extendible, linear)

Static hashed files are kept at about 80
occupancy then reorganised. New pages are added
when each existing page is about 80 full
Hence, time to read entire file is ?? 1.25
non-hashed file
Dynamic hashing provides flexibility in usage of
file storage space (expansion and contraction)

Write a Comment

User Comments (0)

About PowerShow.com

File Organisation - PowerPoint PPT Presentation

File Organisation

File Organisation ... – PowerPoint PPT presentation