File Organisation - PowerPoint PPT Presentation

About This Presentation
Title:

File Organisation

Description:

File Organisation ... – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 57
Provided by: coi84
Category:

less

Transcript and Presenter's Notes

Title: File Organisation


1
File Organisation
2
Placing File on Disk
  • File a sequence of records
  • Records
  • Record type
  • Record fields
  • Data type
  • Number of bytes in a field
  • fixed
  • Variable

3
Record Characteristics
  • A logical view
  • SELECT FROM STUDENTS or
  • (Smith, 17, 1, CS) , (Brown, 8, 2, CS) or
  • STUDENT(Name, Number, Class, Major)
  • A physical view
  • (20 bytes 4 bytes 4 bytes 3 bytes)
  • data types determine record length
  • - -records can be of fixed or variable length

4
Fixed Versus Variable Length Records
  • FIXED LENGTH
  • every record has same fields
  • field can be located relative to record start
  • VARIABLE LENGTH - FIELDS
  • Some fields have unknown length
  • use a field separator
  • Use a record terminator
  • WHAT IF RECORDS ARE SMALLER THAN A BLOCK? -
    BLOCKING FACTOR
  • WHAT IF RECORDS ARE LARGER THAN A BLOCK? -
    SPANNING RECORDS

5
Record blocking
  • Allocating records to disk blocks
  • Unspanned records
  • Each record is fully contained in one block
  • Many records in one block
  • Blocking factor bfr number of records that fit
    in one block
  • Example Block size B 1024 record size (fixed)
    R 150
  • bfr ?1024/150 ? 6 (floor and ceiling
    functions)
  • Spanned organization
  • Record continued on the consecutive block
  • Required pointer to point the block with the
    remainder of a record
  • If records are of a variable length , then bfr
    could represent the average number of records per
    bloc (the rounding function does not apply)

6
File structure
  • File as a set of pages (disk blocks) storing
    records
  • File header
  • Record format, types of separators
  • Block address(es)
  • Blocks allocated
  • Contiguous
  • Linked (use of block pointers)
  • Linked clusters
  • Indexed

7
Searching for a record
  • Search for a record on disk,
  • one or more file blocks copied into buffers.
  • Programs search for the desired record in the
    buffers, using the information in the file
    header.
  • If the address of the block with desired record
    is not known, the search programs must do a
    linear search through the file blocks. Each file
    block is copied into a buffer and searched either
    until the record is located or all the file
    blocks have been searched unsuccessfully.
  • The goal of a good file organization is to locate
    the block that contains a desired record with a
    minimal number of block transfers

8
Operations on Files
  • Because of complex path from stored data to
    user, DBMS offer a range of I/O operations
  • OPEN - access the file and prepare pointer
  • FIND (LOCATE) - find first record
  • FINDNEXT
  • FINDALL - set
  • READ
  • INSERT
  • DELETE
  • MODIFY
  • CLOSE
  • REORGANISE - set
  • READ-ORDERED (FIND-ORDERED) - set

9
File organization and access method.
  • Difference between the terms
  • file organization and
  • access method.
  • A file organization is organization of the data
    of a file into records, blocks, and access
    structures
  • way of placing records and blocks on the storage
    medium
  • An access method provides a group of operations
    that can be applied to a file resulting in
    retrieval, modification and reorganisation.
  • One file organization can accept many different
    access methods Some access methods, though, can
    be applied only to files with specific file
    organization.
  • For example, one cannot apply an indexed access
    method to a file without an index

10
Why do Access Methods matter
  • The unit of transfer between disk and main memory
    is a block
  • Data must be in memory for the DBMS to use it
  • DBMS memory is handled in units of a page, e.g.
    4K, 8K. Pages in memory represent one or more
    hardware blocks from the disk
  • If a single item is needed, the whole block is
    transferred
  • Time taken for an I/O depends on the location of
    the data on the disk and is lower if the number
    of seek times and rotational delays are small, we
    remember that access time seek times
    rotational delays transfer times
  • The reason many DBMS do not rely on the OS file
    system is
  • higher level DB operations, e.g. JOIN, have a
    known pattern of page accesses and can be
    translated into known sets of I/O operations
  • buffer manager can PRE-FETCH pages by
    anticipating the next request. This is
    especially efficient when the required data are
    stored CONTIGUOUSLY on disk

11
Simple File Organisations
  • Unordered files of records Heap or Pile file
  • New records inserted at EOF, or anywhere
  • locating a record is by a linear search
  • insertion is easy
  • retrieval of an individual record, or in any
    order, is difficult (time consuming).
  • Question. How many blocks in average one needs to
    reed to find a single record ?
  • Fast Select from Course
  • Slow Select count() from Course
  • group by Course_Number

12
Operations on Unordered File
  • Inserting a new record is very efficient
  • The address of the last file block is kept in the
    file header
  • The last disk block of the file is copied into a
    buffer page
  • The new record is added or new page is opened
    the page is then rewritten back to disk block.
  • Searching for a record using any search condition
    in a file stored in b blocks
  • Linear search through the file, block by block
  • Cost b/2 block transfers. on average, if only
    one record satisfies the search condition,
  • Cost b block transfers. If no records or
    several records satisfy the search condition.
    program must read and search all b blocks in the
    file.
  • To delete a record,
  • find its block and copy the block into a buffer
    page,
  • delete the record from the buffer,
  • rewrite the updated page back to the disk block.
  • Note Unused space in the block could be used in
    future for a new record if suitable (some book
    keeping necessary on unused space in file blocks))

13
Special Deletion Procedures
  • Technique used for record deletion
  • Each record has an extra byte or bit, called a
    deletion marker set to 1 at insertion )
  • DO not remove deleted record, but reset its
    deletion marker to 0 when deleted
  • Record with deletion marker set to 0 is not used
    by application programs
  • From time to time reorganise the file physically
    remove deleted records or reclaim unused space.
  • ) Just for simplicity we assume that values of
    deletion markers are 0 or 1. A system
    actually can choose other characters or
    combination of bits as values of deletion
    markers.

14
Simple File Organisations
  • Ordered files of records - sequential files
  • still extremely useful in DBM (auditing,
    recovery, security)
  • A record field is nominated and records are
    ordered based on that field
  • Ordering key
  • insertion is expensive
  • retrieval is easy (efficient) if exploiting the
    sort order
  • binary search reduces time significantly
  • Fast Select from Course order by ltordergt
  • Slow Select from Course where ltany other
    attributegt c

15
Retrieval Update in Sorted Files
  • Binary search on ordering field to find block
    with key k
  • B of blocks High B Low 0
  • Do while not (Found or NotThere)
  • Read Block
  • Mid (Low High) / 2
  • If k lt key field of first record in the block
  • Then High Mid - 1
  • Else If k gt key field of last record
  • Then Low Mid 1
  • Else If k record is in the buffer
  • Then Found Else NotThere
  • end

16
Operations on Ordered File
  • Searching for records when criteria are specified
    in terms of ordering field
  • Reading the records in order of the ordering key
    values is extremely efficient,
  • Finding the next record from the current one in
    order of the ordering key usually requires no
    additional block accesses,
  • the next record is in the same block or in the
    next block
  • using a search condition based on the value of an
    ordering key field results in faster access when
    the binary search technique is used,
  • A binary search can be done on the blocks rather
    than on the records.. A binary search usually
    accesses log2(b) blocks, whether the record is
    found or not
  • No advantage if search criterion is specified in
    terms of non ordering fields

17
Operations on Ordered File
  • Inserting records is expensive. To insert a
    record
  • find its correct position in the file, based on
    its ordering field value, - cost log2(b)
  • make space in the file to insert the record in
    that position.
  • on the average, half the records of the file must
    be moved to make space for the new record.
  • these file blocks must be read and rewritten to
    keep the order. Cost of insertion is then b/2
    block transfers
  • Deleting record.
  • Find the record using binary search based on
    ordering field value, - cost log2(b
  • Delete the record,
  • Reorganise part of the file (all records after
    that deleted one, b/2 blocks in average)
  • Modifying record
  • Find record using binary search and update as
    required

18
Operations on Ordered File
  • Alternative ways for more efficient insertion
  • keep some unused space in each block for new
    records (not good - problem returns when that
    space is filled up)
  • create and maintain a temporary unordered file
    called an overflow file.
  • New records are inserted at the end of the
    overflow file
  • Periodically, the overflow file is sorted and
    merged with the main file during file
    reorganization.
  • Searching for a record must involve both files,
    main and overflow the cost of searching is thus
    more expensive but for large main file will be
    still close to log2(b)
  • Alternative way for more efficient deletion
  • Use the technique based on deletion marker, as
    described earlier

19
Access Properties of Simple Files
  • Heap (sequential unordered)
  • Ordered (sequential) file
  • Note in this and the following examples record
    numbers corresponds to values of ordering field
    in ascending order

20
Access Properties of Simple Files
Insert into Heap file record R15
  • And after insertion

21
Access Properties of Simple Files
Insert into Ordered file record R15
  • And after insertion

Notice that all records after R15 have changed
their page location or position on the page
22
Access Properties of Simple Files
Insert into Ordered file records R15, R9, R17
using overflow file

Main File

Overflow File
  • And after insertion

R15 ------- R9 ------- R17 -------
Main File

Overflow File
Periodically overflow file is sorted and merged
with the main file
23
Access Properties of Simple Files
  • Deletions from a Heap R10, R3, R7
  • Simple delete
  • After delete operations

24
Access Properties of Simple Files
  • Deletions from a Heap R10, R3, R7
  • using deletion marker technique
  • After delete operations

Deletion markers set to 0 and later these
records will be physically removed when file is
reorganised
25
Access Properties of Simple Files
  • Deletions from ordered file R10, R3, R7
  • Simple delete
  • After delete operations

26
Access Properties of Simple Files
  • Deletions from ordered file R10, R3, R7
  • Using deletion marker technique
  • After delete operations

Deletion markers set to 0 and later these
records will be physicaly removed when file is
reorganised
27
Retrieval and Update In Heaps
  • Quick summary
  • can only use linear search
  • insertion is fast
  • deletion, update are slow
  • parameter search (e.g. SELECTWHERE) is slow
  • unconditional search can be fast if
  • records are of fixed length
  • records do not span blocks
  • j-th record located by position in block j / bfr
  • average time to find a single record b / 2
  • (b number of blocks)

28
Retrieval Update in Sorted Files
  • Quick summary
  • retrieval on key field is fast - next record is
    nearby
  • any other retrieval either requires a sort, or an
    index, or is as slow as a heap
  • update, delete, insert are slow (find block,
    update block, rewrite block)

29
FAST ACCESS FOR DATABASE HASHING
  • Types of hashing static or dynamic
  • What is the point of hashing?
  • reduce a large address space
  • provide close to direct access
  • provide reasonable performance for all U,I,D,S
  • What is a hash function?
  • properties
  • behaviour
  • Collisions
  • Collision resolution
  • Open addressing
  • Summary

30
What Is Hashing, and What Is It For?
  • direct access to block containing the desired
    record
  • reduce the number of blocks read or written
  • allow for file expansion and contraction with
    minimal file reorganising
  • permit retrieval on hashed fields without
    re-sorting the file
  • no need to allocate contiguous disk areas
  • if file is small, internal hashing otherwise
    external
  • no direct access other than by hashing

31
A basic example of hashing
  • There are 25 rows of seats, with 3 seats per row
    (75 seats total)
  • We have to allocate each person to a row in
    advance, at random
  • We will hash on their family name so as to find
    the persons row number directly, knowing only
    the name
  • The database is logically a single table
  • ROOM (Name, Age, Attention)
  • implemented as a blocked, hashed file

32
The hashing process
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  • The hash process is
  • Loc 0
  • Until no more characters in YourName
  • Add the alphabetic position of the character to
    Loc
  • Calculate RowNum Loc mod 25

33
Examples - Hashed Names
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
NAME Loc Row Num (mod 25)
REYE 53 3
ANDERSON 90 15
MCWILLIAM 95 20
TANG 42 17
LEE 22 22
Where is MCWILLIAM? Hash(MCWILLIAM) Row 20
34
NAME Loc RowNum (mod 25)
REYE 53 3
ANDERSON 90 15
MCWILLIAM 95 20
TANG 42 17
LEE 22 22
35
Name Hashing Example continued
Name Row Name Row Name Row Name Row Lee 22 Alex 1
7 George 7 Anne 9 West 17 Rita 23 Guy 3 Will 6 Jam
es 23 Jodie 18 Dave 7 Wilf 0 Anna 5 Jill 18 Don 8
Walt 6 Anita 18 Lily 8 Dixy 12 Jack 0 Jie 24 Ash 3
Jon 14 Lana 3 Kenny 19 Ben 21 Nina 13 Olga 10 Mar
ie 21 Kay 12 May 14 Fred 8 Lois 5 Peter 14 Max 13
Tania 18 Best 21 Paul 0 Nora 23 Tom 23 Rob 10 Phil
20 Cash 6 Julia 3 Lou 23 Pat 11 Foot 6 Leah 6 Axe
l 17 Ed 9 Tan 10 Ling 17
36
The results after 52 arrivals
RowNo Sitting Alloc
0 3
1 0
2 0
3 3 1
4 0
5 2
6 3 2
7 2
8 3
9 2
10 3
11 1
12 2
RowNo Sitting Alloc
13 2
14 3
15 0
16 0
17 3 1
18 3 1
19 1
20 1
21 3
22 1
23 3 2
24 1

37
The Room as a Hashed File
  • Each person has a hash key - the name
  • Each person is a record
  • Each row is a hardware block (bucket)
  • Each row number is the address of a bucket
  • Records here are fixed-length (and 3 records per
    block)
  • The leftover people are collisions (key
    collisions)
  • They will have to be found a seat by collision
    resolution

38
Collision Resolution
  • Leave an empty seat in each row
  • Under population - blocks 66 full
  • A notice on the end of the row extra seat for
    row N can be found at the rear exit
  • buckets overflow chain points to an overflow
    page containing the record
  • Everyone stand up while we reallocate seats
  • file reorganisation

39
Collision Resolution
  • Strategy 1 (open addressing)
  • Nora is 4th arrival for 23
  • Place new arrival in next higher No block with a
    vacancy
  • Retrieval - search for Nora
  • Retrieve block 23
  • Read blocks in 23 consecutively. If Nora not
    found try 24

40
  • Disadvantages
  • May need to read whole file consecutively on some
    keys
  • Blocks will gradually fill up with out-of-place
    records
  • Deletions cause either immediate or periodic
    reorganisation

41
Collision Resolution
  • Strategy 2
  • Reserve some rows (buckets) for overflow Blocks
    25, 26 and 27 or recalculate hash function for
    smaller mod, say 20 instead of 25
  • Julia is then 4th arrival for block 3
  • Place in overflow block with smaller label and
    with available space (26 ? and optionally placing
    a pointer in bucket 3 pointing to 26th).
  • Retrieval - search for Julia
  • Retrieve block 3
  • Read blocks in 3 consecutively. If Julia not
    found, either
  • search overflow consecutively, or
  • follow pointer to block 26 (chaining)

42
  • Disadvantages
  • Overflow gradually fills up giving longer
    retrieval times
  • Deletions/additions cause periodic
    reorganisation

43
Collision Resolution
  • More formally
  • Open addressing If location specified by hash
    address is occupied then the subsequent
    positions are checked in order until an unused
    (empty) position is found.
  • Chaining various overflow locations are kept, a
    pointer field is added to each record location. A
    collision is resolved by placing the new record
    in an unused overflow location and setting the
    pointer of the occupied hash address location to
    the address of that overflow location.
  • Multiple hashing A second hash function is
    applied if the first results in a collision.

44
Performance on Hashed Files
  • Retrieve (SELECT) very fast if name is known,
    otherwise hopeless
  • SELECT FROM ROOM
  • WHERE NAME McWilliam
  • SELECT FROM ROOM
  • WHERE AGE gt 30
  • Update same
  • UPDATE ROOM SET ATTENTION low
  • WHERE NAME McWilliam
  • UPDATE ROOM SET ATTENTION high
  • WHERE AGE gt 50 OR AGE lt 10

45
Performance on Hashed Files
  • Delete same as SELECT, UPDATE
  • DELETE FROM ROOM (uses hash - fast)
  • WHERE NAME Nora
  • DELETE FROM ROOM (cant use hash - slow)
  • WHERE NAME IS LIKE No
  • Insert unpredictable
  • INSERT INTO ROOM
  • VALUES (Smyth, high)

46
Internal Hashing
  • Internal hashing is used as an internal search
    structure within a program whenever a group of
    records is accessed exclusively by using the
    value of one field.
  • Applicable to smaller files
  • Hashed in main memory fast lookup in store
  • R records, R-length array
  • Hash function transforms key field into subscript
    array in the range 0 to R - 1
  • hash (Key Value) Key Value (mod R)
  • subscript is the record address in store

47
External Hashing
  • Hashing for disk files is called external
    hashing.
  • address space is made of buckets, each of which
    holds multiple records.
  • A bucket is either one disk block or a cluster of
    contiguous blocks.
  • The hashing function maps a key into a relative
    bucket number,
  • A table maintained in the file header converts
    the bucket number into the corresponding disk
    block address

48
External Hashing (static)
  • The hashing scheme is called static hashing if a
    fixed number of buckets M is allocated.
  • If a record is to be retrieved with search
    condition specified for the key values, then the
    bucket number of the bucket potentially
    containing that record is determined using the
    hashing function applied on the key and then that
    bucket is examined for the containment of the
    desired record. If record is not in that bucket
    then further search could be activated in
    overflow buckets.

49
External Hashing (static)
  • Construction of hashed file
  • Identify size of the file, choose hashing
    function (according to the anticipated number of
    buckets) and decide about selection of the
    collision resolution procedure - for the life of
    the file
  • Apply hashing function to each inserted record
    to get the bucket number and place the record in
    the bucket with that number
  • If bucket is full then apply selected collision
    resolution procedure
  • If the number of records in overflow buckets is
    large and/or distribution of records in buckets
    is highly un-uniform , then reorganise the file
    using changed hashing function (tuning)

50
External Hashing (static)
0 1 N-1
H(key) mod N
Overflow Page
key
H
Primary buckets
51
Problems of Static Hashing
  • Number of buckets is fixed
  • shrinkage causes wasted space
  • growth causes long overflow chains
  • Solutions
  • reorganise
  • re-hash
  • use dynamic hashing...

52
Extendible Hashing
  • Previously, to insert a new record into a full
    bucket
  • add overflow page, or
  • reorganise by doubling the bucket allocation and
    redistributing the records
  • This is a poor solution
  • entire file is read
  • twice as many pages have to be written
  • Solution Extendible Hashing
  • add a directory of pointers to buckets
  • double the number of buckets by doubling the
    directory
  • split only the bucket that has overflowed

53
LINEAR HASHING
  • It does not have a directory at all
  • Instead, have a family of algorithms to manage
    dynamic expansion and contraction of the file
  • Start with a set number of M buckets 0..M-1 with
    hashing function mod M
  • Split them in linear order, when more space is
    needed. The next hashing function is mod 2M and
    subsequent 3M, 4M etc as required
  • Example Block capacity is 2 records. Records
    with values 72, 62, 32 are colliding for hashing
    function (mod 10), but after application of next
    hashing function (mod 20) they do not (one bucket
    contains 72 and 32 and another 62).
  • Combines controlled overflow with new space
    acquisition

54
Linear Hashing - Advantages
  • Another type of dynamic hashing
  • does not require a directory
  • manages collisions well
  • accommodates insertions and deletions well
  • allows overflow chain length to be traded against
    average space utilisation
  • uses several hash functions

55
HASHING SUMMARY
  • Comparison of Simple and Hashed Files
  • Pages in a hashed file are grouped into buckets
    (1 block or a cluster of contiguous blocks)
  • reduce a large address space
  • provide close to direct access
  • Static hashing has some disadvantages which are
    addressed in dynamic hashing solutions
    (extendible, linear)

56
  • Static hashed files are kept at about 80
    occupancy then reorganised. New pages are added
    when each existing page is about 80 full
  • Hence, time to read entire file is ?? 1.25
    non-hashed file
  • Dynamic hashing provides flexibility in usage of
    file storage space (expansion and contraction)
Write a Comment
User Comments (0)
About PowerShow.com