File Organization and Indexing - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

File Organization and Indexing

Description:

(2) If R B. Blocking. Blocking factor (bfr) ... Record size = R and block size = B. bfr = floor(B/R) and the rest is empty space ... If R B. Files of Records ... – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 30
Provided by: imadr
Category:

less

Transcript and Presenter's Notes

Title: File Organization and Indexing


1
Lecture 26 (11/16/05)
  • File Organization and Indexing

2
Announcements
  • Report 4 write up
  • Exam II guideline

3
(No Transcript)
4
-Sectors ? hard-coded on the disk surface and
cannot be changed - Blocks (pages) ? During disk
formatting
  • - The block size B is fixed for each system
    (B512 bytes to B4096 bytes)
  • blocks are transferred between disk and main
    memory for processing

5
Disk Storage Devices
  • Preferred secondary storage device for high
    storage capacity and low cost
  • The number of bytes that can be stored
  • A disk pack contains several magnetic disk
    platters connected to a rotating spindle
  • Usually each has two surfaces
  • Disks are divided into concentric circular tracks
    on each disk surface
  • A surface can double-sided or single-sided
  • Typical track capacities vary from 4 to 50 Kbytes

6
Disk Storage Devices
  • Because a track usually contains a large amount
    of information, it is divided into smaller
    sectors
  • The division of a track into sectors is
    hard-coded on the disk surface and cannot be
    changed
  • A track is divided into blocks (pages)
  • During disk formatting
  • The block size B is fixed for each system
  • Typical block sizes range from B512 bytes to
    B4096 bytes
  • Whole blocks are transferred between disk and
    main memory for processing

7
Disk Storage Devices
  • To read/write a block from/to disk
  • A read-write head moves to the track that
    contains the block to be transferred
  • seek time
  • Disk rotation moves the block under the
    read-write head for reading or writing
  • Rotational delay (or latency)
  • Transfer time
  • A physical disk block (hardware) address consists
    of
  • a cylinder number (imaginary collection of tracks
    of same radius from all recorded surfaces)
  • track number or surface number (within the
    cylinder),
  • block number (within track)

8
Disk Storage Devices
  • A buffer is contiguous reserved area in main
    memory that holds one or more block of data
  • Reading or writing a disk block is time consuming
  • Read disk ? buffer
  • Write buffer ? disk
  • Read/write operations work either on one block or
    cluster of blocks (must fit in buffer) at a time
  • Principle of locality of reference

9
O.S. Modules
10
Records
  • Data is stored in form of records each of which
    is a collection of fields
  • Records contain fields which have values of a
    particular type
  • Record type or record format is collection of the
    field names making up a record along with their
    data types
  • Fields may be fixed length or variable length
  • E.g. VarChar(10)
  • Fixed and variable length records
  • Contain variable-length fields
  • Contain repeating groups (multi-valued
    attributes)
  • Contain optional fields (can be null)
  • Contain records of different record formats
    (mixed files)
  • Usually a file contains records from single
    record type (or relation)
  • E.g. placing the grades of student next to their
    records

11
Records
  • A system can easily identify and parse
    fixed-length records
  • Each has the same size with the set of fields
    (and fields lengths)
  • For variable-length records
  • with variable-length fields
  • Use a special delimiter or separator to terminate
    fields
  • Delimiter should not appear in fields
  • Or, record length of field in bytes preceding the
    field value
  • with repeating groups
  • Need a separator for the values of the repeating
    group (s) and another for the fields

12
Records
  • with optional fields
  • If a lot of optional fields, store ltfield-name,
    field-valuegt pairs rather than field values only
  • Otherwise, use nulls
  • Need a separator for field-names and
    field-values, a second one for fields and a third
    one for records (why ?)
  • in mixed files
  • Each record must be preceded by a record type

13
Variable length fields
Variable length and optional fields
14
Unspanned Block Organization for fixed-length
records
Spanned Block Organization for variable-length
records
  • - Blocking factor (bfr) bfr floor(B/R)
  • Used when
  • (1) When utilizing empty space or
  • (2) If RgtB

15
Blocking
  • Blocking factor (bfr) refers to the number of
    records per block
  • There may be empty space in a block if an
    integral number of records do not fit in one
    block
  • Record size R and block size B
  • bfr floor(B/R) and the rest is empty space
  • Spanned blocking
  • Records can span a number of blocks
  • A pointer at the end of the first block points to
    the block containing the remainder of the record
    in 2nd block (if blocks are not contiguous)
  • Used
  • When utilizing empty space or
  • If RgtB

16
Files of Records
  • File records can be un-spanned (no record can
    span two blocks) or spanned (a record can be
    stored in more than one block)
  • The physical disk blocks that are allocated to
    hold the records of a file can be contiguous,
    linked, or indexed
  • In a file of fixed-length records, all records
    have the same format
  • Usually, unspanned blocking is used with such
    files
  • Files of variable-length records require
    additional information to be stored in each
    record, such as separator characters and field
    types
  • Usually spanned blocking is used with such files

17
Files Headers
  • A file descriptor (or file header) includes
    information that describes the file, such as
  • Record format (Field names/Field order/Field data
    types)
  • Separator characters
  • The addresses of the file blocks on disk
  • To search for a record on disk one or more blocks
    are copied into memory buffers
  • The buffers are then searched using information
    in the file header
  • What if address is not known?
  • The main goal of file organization is locate the
    block that contains a desired record with a
    minimal number of block transfers

18
Heap Files
  • Also called a unordered or pile files
  • No order is enforced on the records of the file
  • Insert operation
  • New records are inserted at the end of the file
  • Record insertion is quite efficient
  • Search operation
  • To search for a record, a linear search through
    the file records is necessary
  • This requires reading and searching half the file
    blocks on the average, and is hence quite
    expensive
  • Read_Ordered operation
  • Reading the records in order of a particular
    field requires sorting the file records

19
Unordered Files
  • Delete operation
  • To delete a record we must first find it and then
    either delete or mark it for deletion
  • The former causes wasted storage within the file
  • Both require periodic reorganization
  • Remove wasted storage or deleted records
  • Modify operation
  • if fixed-length, do it in its current position
  • if variable-length, delete the old one and then
    reinsert the updated one

20
An Example Heap File
21
Sequential Files
  • Also called a sorted or ordered files
  • File records are kept sorted by the values of an
    ordering field
  • Physically sorted
  • Insert operation
  • Insertion is expensive b/c records must be
    inserted in the correct order
  • It is common to keep a separate unordered
    overflow file for new records to improve
    insertion efficiency
  • This is periodically merged with the main ordered
    file
  • Delete operation
  • Expensive because of physical moving of the rest
    of the records

22
(No Transcript)
23
Ordered Files
  • Search operation
  • A binary search can be used to search for a
    record on its ordering field value
  • This requires reading and searching log2 of the
    file blocks on the average, an improvement over
    linear search
  • Search by a non-ordering field is expensive
  • Read_Ordered operation
  • Reading the records in order of the ordering
    field is quite efficient
  • Reading the records in order of a non-ordering
    field requires sorting the file records
  • Modify operation
  • Non-ordering attribute in place
  • Ordering attribute Delete and then reinsert
    record in new correct

24
Indexed Files
  • A single-level index is an auxiliary file that
    makes it more efficient to search for a record in
    the data file
  • Create Index Sql command
  • CREATE INDEX part_of_name ON customer (name(10))
  • i.e., using the first 10 characters of the name
    column
  • E.g. ISAM or MyISAM in MySQL (on Pk)
  • The index is usually specified on one field of
    the file
  • although it could be specified on several fields
  • Usually, an index is a file of entries
  • ltfield value, pointer to recordgt
  • ordered by the field value
  • The index is called an access path on the field
  • Provides another (fact) access mechanism to the
    file

25
Indexes as Access Paths
  • The index file usually occupies considerably less
    disk blocks than the data file because its
    entries are much smaller
  • sometimes the entries are even less (sparse
    indexes)
  • A binary search on the index yields a pointer to
    the file record --- all indexes are sorted
  • Indexes can also be characterized as dense or
    sparse
  • A dense index has an index entry for every search
    key value (and hence every record) in the data
    file
  • A sparse (or nondense) index, on the other hand,
    has index entries for only some of the search
    values

26
Indexes as Access Paths
  • Primary Index
  • Defined on an ordered data file (by a key field)
  • i.e., no duplicates allowed for ordering field
  • Index provides faster access using the ordering
    field
  • Includes one index entry for each block in the
    data file
  • the index entry has the key field value for the
    first record (usually) in the block, (called
    block anchor)
  • A primary index is a nondense (sparse) index,
    since it includes an entry for each disk block of
    the data file
  • the keys for the anchor records rather than for
    every search value in every record
  • One primary index per file

27
Search for Amir, John
28
Indexes as Access Paths
  • Clustering Index
  • Defined on an ordered data file (by a non-key
    field)
  • Clustering field
  • Unlike primary index which requires that the
    ordering field of the data file have a distinct
    value for each record
  • Index provides faster access using the ordering
    field
  • Includes one index entry for each distinct value
    of the field
  • the index entry points to the first data block
    that contains records with that field value
  • Dense or sparse?
  • One clustering index per file

29
Search for dept. 5 and then 7
Write a Comment
User Comments (0)
About PowerShow.com