Objectives: - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Objectives:

Description:

Chapter 6 Organizing File for Performance. Folk/Zoellick/Riccardi, ... Dynamic space reclaiming for ... The small free slots are placed at the beginning of ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 22
Provided by: chuahua
Category:

less

Transcript and Presenter's Notes

Title: Objectives:


1
Chapter 6 Organizing File for Performance
  • Objectives
  • To get familiar with
  • Data compression
  • Storage management
  • Internal sorting and binary search

2
Outline
  • Data compression
  • Reclaiming space in files
  • Record deletion
  • Dynamic space reclaiming for fixed-length record
  • Dynamic space reclaiming for variable-length
    record
  • Storage fragmentation
  • Internal sorting and binary search
  • Keysorting

3
Data Compression
  • Data compression to organize files into smaller
    size.
  • Use less storage,
  • Can be transmitted faster,
  • Can be processed faster sequentially.
  • Encoding with a different notation
  • The State field in the address file requires
    two bytes. However, 50 states can be encoded
    using 6 bits. 50 space saving for each
    occurrence of the state field.
  • The compact notation is a redundancy reduction
    technique.
  • Costs
  • The file is not readable by humans.
  • The overhead of encoding and decoding operations.

4
Data Compression (contd)
  • Suppressing repeating sequences
  • Suitable for sparse arrays or images with regions
    of same colors.
  • Run-length encoding choose an unused byte value
    to indicate that a run-length code following that
    byte.
  • Encoding algorithm
  • Read through the data (pixels or values) that
    make up the image or data content, copying the
    data values to the file in sequence, except where
    the same data value occurs more the once in the
    succession,
  • Where the same value occurs more than once in
    succession, substitute the following three
    entries
  • The special run-length code indicator,
  • The data value that is repeated, and
  • The number of times that the value is repeated.
  • Example,
  • 50 51 52 52 52 52 52 53 54 54 54 54 54 54 54 55
    52 52 53 53 53 54
  • The encoded sequence is
  • 50 51 ff 52 05 53 ff 54 07 55 ff 52 02 ff 53 03
    54

5
Data Compression (contd)
  • Variable length encoding
  • Letters with high frequency are encoded using
    shorter symbols.
  • Letters with low frequency are encoded using
    longer symbols.
  • Huffman code (for a set of seven letters)
  • four bits per letter (minimum 3 bits).
  • The string abefd is encoded as
    1010000100100000.
  • Huffman codes are used in some UNIX systems for
    data compression.
  • Irreversible compression techniques
  • Voice coding
  • Some image coding scheme that change pixel
    granularity or reduce color quality

6
Reclaiming Space in Files
  • File organization with the following operations
  • record insertion
  • record deletion
  • record modification
  • Space reclaiming is needed when
  • deleting fixed-length and variable-length records
  • modifying variable-length records
  • can be treated as a deletion followed by an
    insertion

7
Record Deletion
  • Identifying deleted records
  • Place a special mark in each deleted record.
  • Eg., place an asterisk () as the first field in
    a deleted record.
  • Before deletion
  • AmesJohn123 MapleStillwaterOK74075...
  • MorrisonSebastian9035 South HillcrestForest
    VillageOK78420
  • BrownMartha625 KimbarkDes MoinesIA50311...
  • After deletion
  • AmesJohn123 MapleStillwaterOK74075...
  • rrisonSebastian9035 South HillcrestForest
    VillageOK78420
  • BrownMartha625 KimbarkDes MoinesIA50311...
  • Keep the deleted records around for sometimes.
  • Delay the disk compaction.
  • Programs must be able to ignore the deleted
    records.
  • Allow to undelete records.

8
Record Deletion (contd)
  • Space reclamation
  • Happens after accumulating a number of deleted
    records.
  • A simple solution is to copy the file by skipping
    the deleted records.
  • Suitable for both fixed-length and
    variable-length records.
  • After space reclamation
  • AmesJohn123 MapleStillwaterOK74075...
  • BrownMartha625 KimbarkDes MoinesIA50311...
  • In place (not copying a file) space reclamation
    is more complicated and time consuming.

9
Dynamic Space Reclaiming -- Fixed-Length Records
  • An naive approach
  • When inserting a new record,
  • searching the file record by record
  • if a deleted record is found, insert the new
    record in the place of the deleted record
  • otherwise, insert the new record at the end of
    the file.
  • Issues on reclaiming space quickly
  • How to know immediately if there are empty slots
    in the file?
  • How to jump to one of those slots, if they exist?
  • Linking all deleted records together using a
    linked list

10
Dynamic Space Reclaiming -- Fixed-Length Records
(contd)
  • Use the link list of the deleted records as a
    stack
  • Add (push) a recently deleted record of RRN 3 to
    the top of the stack
  • Remove a free space of RRN from the top of the
    stack for an inserted record

11
Dynamic Space Reclaiming -- Fixed-Length Records
(contd)
  • Use the link list of the deleted records as a
    stack
  • Add (push) a recently deleted record of RRN 3 to
    the top of the stack
  • Insert three new records to the space of the
    deleted records

12
Dynamic Space Reclaiming -- Variable-Length
Records
  • An available list to store the deleted
    variable-length records
  • How to link the deleted records together into a
    list?
  • How to add newly deleted records to the available
    list?
  • How to find and remove records from the available
    list when space is reclaimed?
  • An available list of variable-length records
  • HEAD.FIRST_AVAILABLE -1
  • 40 AmesJohn123 MapleStillwaterOK7407564
    MorrisonSebastian9035 South HillcrestForest
    VillageOK7842045 BrownMartha625 KimbarkDes
    MoinesIA50311
  • Delete the second record
  • HEAD.FIRST_AVAILABLE 43
  • 40 AmesJohn123 MapleStillwaterOK7407564
    -1..............................................
    ...............................................45
    BrownMartha625 KimbarkDes MoinesIA50311

13
Dynamic Space Reclaiming -- Variable-Length
Records (contd)
  • When inserting a new record, we need to search
    the available list for a deleted record with
    large enough record length
  • The current available list
  • Insert a record of 55 bytes

14
Storage Fragmentation
  • Internal fragmentation caused by fixed-length
    records
  • AmesJohn123 MapleStillwaterOK74075..........
    .........................
  • MorrisonSebastian9035 South HillcrestForest
    VillageOK78420
  • BrownMartha625 KimbarkDes MoinesIA50311.....
    ....................
  • Internal fragmentation caused by variable-length
    records
  • The inserted records is shorter than the deleted
    record HEAD.FIRST_AVAILABLE -1
  • 40 AmesJohn123 MapleStillwaterOK7407564
    HamAl28 Elm
  • AdaOK70332....................................
    .................45 BrownMartha
  • 625 KimbarkDes MoinesIA50311
  • Reclaim the used part of the deleted record
  • HEAD.FIRST_AVAILABLE 43
  • 40 AmesJohn123 MapleStillwaterOK7407535
    -1..................
  • ..............26 HamAl28 ElmAdaOK7033245
    BrownMartha625
  • KimbarkDes MoinesIA50311

15
Storage Fragmentation (contd)
  • External fragmentation caused by continuing to
    insert records so some space becomes too
    fragmented to be useful
  • Insert a record of 25 bytes
  • HEAD.FIRST_AVAILABLE 43
  • 40 AmesJohn123 MapleStillwaterOK740758
    -1.....25 LeeEd
  • Rt 2AdaOK7482026 HamAl28
    ElmAdaOK7033245 Brown
  • Martha625 KimbarkDes MoinesIA50311
  • How to handle external fragmentation
  • storage compaction regenerate the file when
    external fragmentation becomes intolerable.
  • coalescing the holes combine two record slots on
    the available list if they are physically
    adjacent.
  • placement strategy adopt a placement strategy to
    minimize fragmentation.

16
Placement Strategies
  • First-fit placement strategy search the first
    available space which is large enough for the
    inserted record.
  • Least amount of work when we place a newly
    available space on the list.
  • Best-fit placement strategy search the smallest
    available which is large enough for the inserted
    record.
  • Order the available list in ascending order by
    size, then use the first-fit placement strategy.
  • After inserting the new record, the free area
    left over may be too small to be useful. May
    cause serious external fragmentation.
  • The small free slots are placed at the beginning
    of the available list. Make the search of the
    first-fit space increasingly long as time goes
    on.
  • Worst-fit placement strategy
  • Order the available list in descending order by
    size, then use first-fit placement strategy.
  • Always insert the new record to the first slot.
    If the first slot is not large enough. The new
    record is inserted to the end of the file.
  • Decrease the chance of external fragmentation.

17
Binary Search
  • Search by guessing.
  • Use RRN to jump around
  • Searching a file of n records
  • the worst case ?log n?1 comparisons,
  • the average case ?log n?1/2 comparisons.
  • Requirement
  • Works only for fixed-length records.
  • The records must be in order in the searching
    field.

18
Sorting a Disk File in RAM
  • If the records are not in order, they must be
    sorted before we can use binary search.
  • Consider any internal sorting algorithms bubble
    sort, quick sort, bucket sort, etc.
  • If applied directly on data stored on disk, they
    require many disk accesses (seeking, rotational
    delay) and multiple passes over the list.
    Extremely slow
  • If the entire file can fit into RAM. Load the
    entire contents of the file into RAM and perform
    internal sorting.
  • Can access records sequentially.
  • Much faster if the file is stored sequentially.
  • This is an example of a general rule minimizing
    disk access cost by forcing disk accesses into a
    sequential mode and performing complex, direct
    access in memory.

19
Limitations of Binary Searching and Internal
Sorting
  • Binary searching requires more than one or two
    disk accesses
  • Accessing records by relative record number
    (RRN), we can retrieve a record with a single
    disk access.
  • Ideally, we can combine RRN retrieval (single
    access) and search by key (ease of use).
  • Keeping a file sorted is very expensive
  • If record insertion is as frequent as record
    search, it is expensive to keep records sorted.
  • Keep records unsorted and use sequential search.
  • An internal sort works only on small files
  • It is not possible to read all records of a large
    file into the main memory.
  • Only load the keys to the main memory --
    keysorting.

20
Keysorting
  • Only load records keys into RAM.
  • A KEYNODES array has two fields KEY and RRN.
    There is a correspondence between KEYNODES and
    records in the actual file.
  • Actual sorting process, simply sort the KEYNODES
    array according to the key field.

21
Limitation of Keysorting
  • The keysort method requires two reads and one
    write for each record.
  • The first pass of reads can be done sequentially,
    sector by sector.
  • The second pass of reads cannot be done
    sequentially. It may requires many random seeks
    for these reads.
  • Since the write operations interleave with the
    reads in the second pass, these writes also
    require separate seeks.
  • If only one copy of the records are kept in the
    disk, it is not an easy job to create a sorted
    version of the file from KEYNODES array.
  • Solution
  • Not to write the sorted file back to the disk.
  • Only write the KEYNODES array back to the disk
    as the index file.
Write a Comment
User Comments (0)
About PowerShow.com