Fundamental File Structure Concepts - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Fundamental File Structure Concepts

Description:

Title: Fundamental File Structure Concepts Author: Jim Skon Last modified by: James Skon Created Date: 2/25/1998 12:28:48 PM Document presentation format – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 63
Provided by: JimS57
Category:

less

Transcript and Presenter's Notes

Title: Fundamental File Structure Concepts


1
Fundamental File Structure Concepts
  • Chapter 4

2
Record and Field Structure
  • A record is a collection of fields.
  • A field is used to store information about some
    attribute.
  • The question when we write records, how do we
    organize the fields in the records
  • so that the information can be recovered
  • so that we save space
  • so that we can process efficiently
  • to maximize record structure flexibility

3
Field Structure issues
  • What if
  • Field values vary greatly
  • Fields are optional

4
Field Delineation methods
  • Fixed length fields
  • Include length with field
  • Separate fields with a delimiter
  • Include keyword expression to identify each field

5
Fixed length fields
  • Easy to implement - use language record
    structures (no parsing)
  • Fields must be declared at maximum length needed

10 10 15 15
2 9
last first address
city state zip
Yeakus Bill 123 Pine Utica
OH43050
6
Include length with field
  • Begin field with length indicator
  • If maximum field length lt256, a byte can be used
    for length

last first address
city state zip
Length bytes
Yeakus Bill
123 Pine
06 59 65 61 6B 75 73 04 42 69 6C 6C 08 31 32 33
20 50 69 6E 64 . .
7
Separate fields with a delimiter
  • Use a special character not used in data
  • space, comma, tab
  • Also special ASCII chars Field Separator (fs)
    1C
  • Here we use
  • Also need a end of record delimiter

YeakusBill123 PineUticaOH43050
8
Include keyword expression
  • Keywords label each fields
  • A self-describing structure
  • Allows LOTS of flexibility
  • Uses lots of space

LASTYeakusFIRSTBillADDRESS123
Pine CITYUticaSTATEOHZIP43050
9
Optional Fields
  • Fixed length
  • Leave blank
  • Field length
  • zero length field
  • Delimiter
  • Adjacent delimiters
  • Keywords
  • Just leave out

10
Reading a stream of fields
  • Need to break record into fields
  • Fixed length can simply be read into record
    structure
  • Others must be parsed with a parse algorithm

11
Record Structures
  • How do we organize records in a file?
  • Records can be fixed length or variable length
  • Fixed length allows simple direct access lookup
  • Fixed may waste space
  • Variable - how do we find a records position?

12
Record Structures
  • Fixed Length Records
  • Fixed number of fields in records
  • Variable length
  • prefix each record with a length
  • Use a second file to keep track of record start
    positions
  • Place delimiter between records

13
Fixed Length Records
  • All records same length
  • Record positions can be calculated for direct
    access reads.
  • Does not imply the that the sizes or number of
    fields are fixed.
  • Variable length records would lead to unused
    space.

14
Fixed number of fields in records
  • Field size could be fixed or variable
  • Fixed
  • results in fixed size records
  • simply read directly into struct
  • Variable sized fields
  • delimited or field lengths
  • Simply count fields while parsing

15
Variable length Records
  • prefix each record with a length
  • Use a second file to keep track of record start
    positions
  • Place delimiter between records

16
Prefix records with a length
  • Allows true variable length records
  • Form of prefix
  • Character number (fixed length)
  • Binary number (write integer without conversion)
  • Must consider Maximum length
  • No direct access (great for sequencial access)

17
Index of record start addresses
  • A second file is simply a list of offsets to
    successive records
  • Since the offsets are fixed length, this file
    allows direct access, thereby allow direct access
    to main file.
  • Problem
  • Maintaining file (adding and deleting records)
  • Cost of index

18
Place delimiter between records
  • Special character not used in record
  • Allows efficient variable size
  • No direct access
  • Bible files - use \n as delimiter

19
Binary data in files
  • Binary reals and integers can be written, and
    read, from a file
  • Need to know byte size of variables used.
  • tsize function returns data size

20
Binary data in files
  • int rsize
  • char rec_bufMAX
  • ...
  • cpystr(rec_buf,this is a test record)
  • rsize strlen(rec_buf)
  • write(my_fd,rsize,tsize(int)) // write the
    size
  • write(my_fd,rec_buf,rsize) // write the
    record
  • ...
  • read(my_fd, rsize,tsize(int)) // read the size
  • read(my_fd,rec_buf,rsize) // read the record

21
Viewing Binary file data
  • Use the file dump utility (od - octal dump)
  • od -xc ltfilenamegt
  • x - hex output
  • c - character output
  • Useful for viewing what is actually in file

22
Using Classes to Manipulate Buffer
  • Three Classes
  • delimited fields
  • Length-based fields
  • Fixed length fields

23
Class for Delimited fields
  • Consider a class to manage delimited text buffers
  • Allows reading and writing of delimited records
  • Allows packing and unpacking

24
Class for Delimited fields
  • class Person
  • public
  • // fields
  • char LastName 11
  • char FirstName 11
  • char Address 16
  • char City 16
  • char State 3
  • char ZipCode 10
  • // Methods next ...

25
Class for Delimited fields
  • class DelimTextBuffer
  • public
  • DelimTextBuffer (char Delim '', int maxBytes
    1000)
  • int Read (istream )
  • int Write (ostream ) const
  • int Pack (const char , int size -1)
  • int Unpack (char )
  • private
  • char Delim
  • char DelimStr2 // zero terminated string for
    Delim
  • char Buffer // character array to hold field
    values
  • int BufferSize // size of packed fields
  • int MaxBytes // maximum number of characters in
    the buffer
  • int NextByte // packing/unpacking position in
    buffer

26
Class for Delimited fields
  • Packing a buffer
  • Person Bill_Yeakus
  • DelimitedTextBuffer buffer
  • buffer.pack(Bill_Yeakus.LastName)
  • buffer.pack(Bill_Yeakus.FastName)
  • buffer.pack(Bill_Yeakus.ZipCode)
  • buffer.Write(stream)

27
Class for Delimited fields
  • int DelimTextBuffer Pack (const char str,
    int size)
  • // set the value of the next field of the buffer
  • // if size -1 (default) use strlen(str) as
    Delim of field
  • short len // length of string to be packed
  • if (size gt 0) len size
  • else len strlen (str)
  • if (len gt strlen(str)) // str is too short!
  • return FALSE
  • int start NextByte // first character to be
    packed
  • NextByte len 1
  • if (NextByte gt MaxBytes) return FALSE
  • memcpy (Bufferstart, str, len)
  • Buffer startlen Delim // add delimeter
  • BufferSize NextByte
  • return TRUE

28
Class for Delimited fields
  • int DelimTextBuffer Write (ostream stream)
    const
  • stream . write ((char)BufferSize,
    sizeof(BufferSize))
  • stream . write (Buffer, BufferSize)
  • return stream . good ()

29
Class for Delimited fields
  • int DelimTextBuffer Read (istream stream)
  • Clear ()
  • stream . read ((char)BufferSize,
    sizeof(BufferSize))
  • if (stream.fail()) return FALSE
  • if (BufferSize gt MaxBytes) return FALSE //
    buffer overflow
  • stream . read (Buffer, BufferSize)
  • return stream . good ()

30
Class for Delimited fields
  • int DelimTextBuffer Unpack (char str)
  • // extract the value of the next field of the
    buffer
  • int len -1 // length of packed string
  • int start NextByte // first character to be
    unpacked
  • for (int i start i lt BufferSize i)
  • if (Bufferi Delim)
  • len i - start break
  • if (len -1) return FALSE // delimeter not
    found
  • NextByte len 1
  • if (NextByte gt BufferSize) return FALSE
  • strncpy (str, Bufferstart, len)
  • str len 0 // zero termination for string
  • return TRUE

31
Class for Delimited fields
  • Class Person can be extended to provide
    specialized packing functions

32
Class for Delimited fields
  • int PersonPack (DelimTextBuffer Buffer) const
  • // pack the fields into a FixedTextBuffer,
    return TRUE if all succeed, FALSE o/w
  • int result
  • Buffer . Clear ()
  • result Buffer . Pack (LastName)
  • result result Buffer . Pack (FirstName)
  • result result Buffer . Pack (Address)
  • result result Buffer . Pack (City)
  • result result Buffer . Pack (State)
  • result result Buffer . Pack (ZipCode)
  • return result

33
Class for Delimited fields
  • int PersonUnpack (DelimTextBuffer Buffer)
  • int result
  • result Buffer . Unpack (LastName)
  • result result Buffer . Unpack (FirstName)
  • result result Buffer . Unpack (Address)
  • result result Buffer . Unpack (City)
  • result result Buffer . Unpack (State)
  • result result Buffer . Unpack (ZipCode)
  • return result

34
Class for Fixed Length fields
  • int FixedTextBuffer AddField (int fieldSize)
  • if (NumFields MaxFields) return FALSE
  • if (BufferSize fieldSize gt MaxChars) return
    FALSE
  • FieldSizeNumFields fieldSize
  • NumFields
  • BufferSize fieldSize
  • return TRUE

35
Class for Fixed Length fields
  • int FixedTextBuffer Read (istream stream)
  • stream . read (Buffer, BufferSize)
  • return stream . good ()

36
Class for Fixed Length fields
  • int FixedTextBuffer Write (ostream stream)
  • stream . write (Buffer, BufferSize)
  • return stream . good ()

37
Class for Fixed Length fields
  • int FixedTextBuffer Pack (const char str)
  • // set the value of the next field of the buffer
  • if (NextField NumFields !Packing) //
    buffer is full or not packing mode
  • return FALSE
  • int len strlen (str)
  • int start NextCharacter // first byte to be
    packed
  • int packSize FieldSizeNextField // number
    bytes to be packed
  • strncpy (Bufferstart, str, packSize)
  • NextCharacter packSize
  • NextField
  • // if len lt packSize, pad with blanks
  • for (int i start packSize i lt
    NextCharacter i )
  • Bufferstart ' '
  • Buffer NextCharacter 0 // make buffer look
    like a string
  • if (NextField NumFields) // buffer is full
  • Packing FALSE
  • NextField NextCharacter 0

38
Class for Fixed Length fields
  • int FixedTextBuffer Unpack (char str)
  • // extract the value of the next field of the
    buffer
  • if (NextField NumFields Packing) // buffer
    is full or not unpacking mode
  • return FALSE
  • int start NextCharacter // first byte to be
    unpacked
  • int packSize FieldSizeNextField // number
    bytes to be unpacked
  • strncpy (str, Bufferstart, packSize)
  • str packSize 0 // terminate string with
    zero
  • NextCharacter packSize
  • NextField
  • if (NextField NumFields) Clear () // all
    fields unpacked
  • return TRUE

39
Class for Fixed Length fields
  • void FixedTextBuffer Print (ostream stream)
  • stream ltlt "Buffer has max fields "ltltMaxFieldsltlt"
    and actual "ltltNumFieldsltltendl
  • ltlt"max bytes "ltltMaxCharsltlt" and Buffer Size
    "ltltBufferSizeltltendl
  • for (int i 0 i lt NumFields i)
  • stream ltlt"\tfield "ltltiltlt" size
    "ltltFieldSizeiltltendl
  • if (Packing) stream ltlt"\tPacking\n"
  • else stream ltlt"\tnot Packing\n"
  • stream ltlt"Contents '"ltltBufferltlt"'"ltltendl

40
Class for Fixed Length fields
  • class FixedTextBuffer
  • public
  • FixedTextBuffer (int maxFields, int maxChars
    1000) int AddField (int fieldSize)
  • int Read (istream )
  • int Write (ostream )
  • int Pack (const char )
  • int Unpack (char )
  • private
  • char Buffer // character array to hold field
    values
  • int BufferSize // sum of the sizes of declared
    fields
  • int FieldSize // array to hold field sizes
  • int MaxChars // maximum number of characters in
    the buffer
  • int NextCharacter // packing/unpacking position
    in buffer

41
Class for Fixed Length fields
  • int PersonPack (FixedTextBuffer Buffer) const
  • // pack the fields into a FixedTextBuffer,
    return TRUE if all succeed, FALSE o/w
  • int result
  • Buffer . Clear ()
  • result Buffer . Pack (LastName)
  • result result Buffer . Pack (FirstName)
  • result result Buffer . Pack (Address)
  • result result Buffer . Pack (City)
  • result result Buffer . Pack (State)
  • result result Buffer . Pack (ZipCode)
  • return result

42
Class for Fixed Length fields
  • int PersonUnpack (FixedTextBuffer Buffer)
  • Clear ()
  • int result
  • result Buffer . Unpack (LastName)
  • result result Buffer . Unpack (FirstName)
  • result result Buffer . Unpack (Address)
  • result result Buffer . Unpack (City)
  • result result Buffer . Unpack (State)
  • result result Buffer . Unpack (ZipCode)
  • return result

43
Record Access - Keys
  • Attribute used to identify records
  • Often used to find records
  • Standard or canonical form
  • rules which keys must conform to
  • prevents missing record because key in different
    form
  • Example
  • all capitals
  • Phone in form (nnn) nnn-nnnn

44
Record Access - Keys
  • Keys can distinct - uniquely identify records
  • Primary keys
  • one-to-one relationship between key value and
    possible entities represented
  • SSN, Student ID
  • Keys can identify a collection of records
  • Secondary keys
  • one-to-many relationship
  • City, position, department

45
Record Access - Keys
  • Primary key desired characteristics
  • unique among collection of entities
  • dataless - what if some entities have not value
    of this type (e.g. SSN)
  • unchanging

46
Record access
  • Performance of access method
  • how do we compare techniques?
  • Must be careful what events we count.
  • big-oh notation gives us a way to factor out
    all but the most significant factors

47
Record Access - timing
  • Sequential searching
  • Consider file of 4000 records
  • What if no blocking done, and one record per
    block? (500 bytes records, 512 byte blocks)
  • What if cluster size set to 8?
  • always requires O(n), but search is faster by a
    constant factor

48
Sequential searching
  • Usually NOT the best method
  • Sometimes it is best
  • Searching for some ASCII pattern (grep)
  • Small files
  • Files rarely searched
  • Searching on secondary key, and a large
    percentage of records match (say 25)

49
Unix Tools for sequential file processing
  • cat - display a file
  • wc - count lines, words, and characters
  • grep - find lines in file(s) which match regular
    expression.

50
Direct Access
  • Move directly to record without scanning
    preceding data
  • Different languages/OSs support different
    models
  • Byte offset model
  • Programmer must specify offset to record, and
    record size to read.
  • Supports variable size records, skip sequential
    processing
  • Relative Record Number (RRN) model
  • File has a fixed record size (declared at
    creation time)
  • Records are specified by a record number
  • File modeled as a collection of components
  • Higher level of abstraction

51
Direct Access
  • Different language support
  • RRN support
  • PL/I
  • COBOL
  • Pascal (files are modeled as a collection of
    components (records)
  • FORTRAN
  • Byte offset
  • C

52
Choosing Record Sizes for Direct Access
  • Fixed Length Fields
  • Very easy to parse records - just read into
    record structure!
  • Each field must be maximum length needed!
  • Thus record must be as long all the maximum fields

10 10 15 15
2 9
last first address
city state zip
Yeakus Bill 123 Pine Utica
OH43050
53
Choosing Record Sizes for Direct Access
  • Variable length fields
  • Each field can be any length
  • since some can be long, others short, overall
    record size may be shorter.
  • This gives more flexibility to fields length
  • Records must be parsed, space wasted for
    delimiter or length bytes.

YeakusBill123PineUticaOH43050 Snivenloppinsk
yHelmut12232 Galmentary AvenueSpotsdaleNY1123
2
54
Header Records
  • The first record in a direct file may be used to
    store special information
  • Number of records used.
  • Location of first record in key order sequence.
  • Location of first empty record
  • File record structure (meta-data)
  • In languages with the RRN model Pascal, variant
    record facility must be used
  • In C, the header record can be of different size
    from the rest of the file records.

55
Header Records
  • Consider update.c is text.
  • Header record contains 2 byte number of record
    count.
  • Header size is 32, record size is 64

static struct short rec_count char
fill30 head
56
Header Records
  • Must be written when file created
  • Must be rewritten when file changed
  • Must be read when file is opened

57
File Access and Organization
  • File Organization
  • Variable Length Records
  • Fixed Length Records
  • Field Structures (size bytes, delimiters, fixed)
  • File Access
  • Sequential access
  • Direct access
  • Indexed access

58
File Access and Organization
  • Interaction between organization and access
  • Can the file be divided into fields?
  • Is there a higher level of organization to the
    file (mete data)?
  • Do all records have to have the same number of
    fields, bytes?
  • How do we distinguish one record from the next?
  • How do we recognize if a fixed length record
    holds real data or not?

59
File Access and Organization
  • There is a often a trade-off between space and
    time
  • Fixed length records - allow direct access, waste
    space
  • Variable require sequential search
  • We also must consider the typical use of the file
    - what are the desired access patterns
  • Selection of a particular organization has
    implications on the allowable types of access

60
Portability and Standardization
  • Differences among Languages
  • Fixed sized records versus byte addressable
    access
  • Differences among Machine Architectures
  • Byte order of binary data
  • May be high order or low order byte first

61
Byte order of binary data
  • High order first (Big Endian)
  • A long int say 45 is stored in memory.
  • It is stored as 00 00 00 2D
  • Suns, Network protocols
  • Low order first (Little Endian)
  • A long int say 45 is stored in memory.
  • It is stored as 2D 00 00 00
  • PCs, VAXs

62
Byte order of binary data
  • If binary data is written to a file, it is
    written in the order stored in memory
  • If the data is later read by a system with a
    different ordering, the number will be incorrect!
  • For the sake of portability, files should be
    written in an agreed upon format (probably Big
    Endian)
Write a Comment
User Comments (0)
About PowerShow.com