Fundamental File Structure Concepts

About This Presentation

Title:

Fundamental File Structure Concepts

Description:

Title: Fundamental File Structure Concepts Author: Jim Skon Last modified by: James Skon Created Date: 2/25/1998 12:28:48 PM Document presentation format – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 63

Provided by: JimS57

Category:

more less

Transcript and Presenter's Notes

Title: Fundamental File Structure Concepts

1
Fundamental File Structure Concepts

Chapter 4

2
Record and Field Structure

A record is a collection of fields.
A field is used to store information about some
attribute.
The question when we write records, how do we
organize the fields in the records
so that the information can be recovered
so that we save space
so that we can process efficiently
to maximize record structure flexibility

3
Field Structure issues

What if
Field values vary greatly
Fields are optional

4
Field Delineation methods

Fixed length fields
Include length with field
Separate fields with a delimiter
Include keyword expression to identify each field

5
Fixed length fields

Easy to implement - use language record
structures (no parsing)
Fields must be declared at maximum length needed

10 10 15 15
2 9
last first address
city state zip
Yeakus Bill 123 Pine Utica
OH43050
6
Include length with field

Begin field with length indicator
If maximum field length lt256, a byte can be used
for length

last first address
city state zip
Length bytes
Yeakus Bill
123 Pine
06 59 65 61 6B 75 73 04 42 69 6C 6C 08 31 32 33
20 50 69 6E 64 . .
7
Separate fields with a delimiter

Use a special character not used in data
space, comma, tab
Also special ASCII chars Field Separator (fs)
1C
Here we use
Also need a end of record delimiter

YeakusBill123 PineUticaOH43050
8
Include keyword expression

Keywords label each fields
A self-describing structure
Allows LOTS of flexibility
Uses lots of space

LASTYeakusFIRSTBillADDRESS123
Pine CITYUticaSTATEOHZIP43050
9
Optional Fields

Fixed length
Leave blank
Field length
zero length field
Delimiter
Adjacent delimiters
Keywords
Just leave out

10
Reading a stream of fields

Need to break record into fields
Fixed length can simply be read into record
structure
Others must be parsed with a parse algorithm

11
Record Structures

How do we organize records in a file?
Records can be fixed length or variable length
Fixed length allows simple direct access lookup
Fixed may waste space
Variable - how do we find a records position?

12
Record Structures

Fixed Length Records
Fixed number of fields in records
Variable length
prefix each record with a length
Use a second file to keep track of record start
positions
Place delimiter between records

13
Fixed Length Records

All records same length
Record positions can be calculated for direct
access reads.
Does not imply the that the sizes or number of
fields are fixed.
Variable length records would lead to unused
space.

14
Fixed number of fields in records

Field size could be fixed or variable
Fixed
results in fixed size records
simply read directly into struct
Variable sized fields
delimited or field lengths
Simply count fields while parsing

15
Variable length Records

prefix each record with a length
Use a second file to keep track of record start
positions
Place delimiter between records

16
Prefix records with a length

Allows true variable length records
Form of prefix
Character number (fixed length)
Binary number (write integer without conversion)
Must consider Maximum length
No direct access (great for sequencial access)

17
Index of record start addresses

A second file is simply a list of offsets to
successive records
Since the offsets are fixed length, this file
allows direct access, thereby allow direct access
to main file.
Problem
Maintaining file (adding and deleting records)
Cost of index

18
Place delimiter between records

Special character not used in record
Allows efficient variable size
No direct access
Bible files - use \n as delimiter

19
Binary data in files

Binary reals and integers can be written, and
read, from a file
Need to know byte size of variables used.
tsize function returns data size

20
Binary data in files

int rsize
char rec_bufMAX
...
cpystr(rec_buf,this is a test record)
rsize strlen(rec_buf)
write(my_fd,rsize,tsize(int)) // write the
size
write(my_fd,rec_buf,rsize) // write the
record
...
read(my_fd, rsize,tsize(int)) // read the size
read(my_fd,rec_buf,rsize) // read the record

21
Viewing Binary file data

Use the file dump utility (od - octal dump)
od -xc ltfilenamegt
x - hex output
c - character output
Useful for viewing what is actually in file

22
Using Classes to Manipulate Buffer

Three Classes
delimited fields
Length-based fields
Fixed length fields

23
Class for Delimited fields

Consider a class to manage delimited text buffers
Allows reading and writing of delimited records
Allows packing and unpacking

24
Class for Delimited fields

class Person
public
// fields
char LastName 11
char FirstName 11
char Address 16
char City 16
char State 3
char ZipCode 10
// Methods next ...

25
Class for Delimited fields

class DelimTextBuffer
public
DelimTextBuffer (char Delim '', int maxBytes
1000)
int Read (istream )
int Write (ostream ) const
int Pack (const char , int size -1)
int Unpack (char )
private
char Delim
char DelimStr2 // zero terminated string for
Delim
char Buffer // character array to hold field
values
int BufferSize // size of packed fields
int MaxBytes // maximum number of characters in
the buffer
int NextByte // packing/unpacking position in
buffer

26
Class for Delimited fields

Packing a buffer
Person Bill_Yeakus
DelimitedTextBuffer buffer
buffer.pack(Bill_Yeakus.LastName)
buffer.pack(Bill_Yeakus.FastName)
buffer.pack(Bill_Yeakus.ZipCode)
buffer.Write(stream)

27
Class for Delimited fields

int DelimTextBuffer Pack (const char str,
int size)
// set the value of the next field of the buffer
// if size -1 (default) use strlen(str) as
Delim of field
short len // length of string to be packed
if (size gt 0) len size
else len strlen (str)
if (len gt strlen(str)) // str is too short!
return FALSE
int start NextByte // first character to be
packed
NextByte len 1
if (NextByte gt MaxBytes) return FALSE
memcpy (Bufferstart, str, len)
Buffer startlen Delim // add delimeter
BufferSize NextByte
return TRUE

28
Class for Delimited fields

int DelimTextBuffer Write (ostream stream)
const
stream . write ((char)BufferSize,
sizeof(BufferSize))
stream . write (Buffer, BufferSize)
return stream . good ()

29
Class for Delimited fields

int DelimTextBuffer Read (istream stream)
Clear ()
stream . read ((char)BufferSize,
sizeof(BufferSize))
if (stream.fail()) return FALSE
if (BufferSize gt MaxBytes) return FALSE //
buffer overflow
stream . read (Buffer, BufferSize)
return stream . good ()

30
Class for Delimited fields

int DelimTextBuffer Unpack (char str)
// extract the value of the next field of the
buffer
int len -1 // length of packed string
int start NextByte // first character to be
unpacked
for (int i start i lt BufferSize i)
if (Bufferi Delim)
len i - start break
if (len -1) return FALSE // delimeter not
found
NextByte len 1
if (NextByte gt BufferSize) return FALSE
strncpy (str, Bufferstart, len)
str len 0 // zero termination for string
return TRUE

31
Class for Delimited fields

Class Person can be extended to provide
specialized packing functions

32
Class for Delimited fields

int PersonPack (DelimTextBuffer Buffer) const
// pack the fields into a FixedTextBuffer,
return TRUE if all succeed, FALSE o/w
int result
Buffer . Clear ()
result Buffer . Pack (LastName)
result result Buffer . Pack (FirstName)
result result Buffer . Pack (Address)
result result Buffer . Pack (City)
result result Buffer . Pack (State)
result result Buffer . Pack (ZipCode)
return result

33
Class for Delimited fields

int PersonUnpack (DelimTextBuffer Buffer)
int result
result Buffer . Unpack (LastName)
result result Buffer . Unpack (FirstName)
result result Buffer . Unpack (Address)
result result Buffer . Unpack (City)
result result Buffer . Unpack (State)
result result Buffer . Unpack (ZipCode)
return result

34
Class for Fixed Length fields

int FixedTextBuffer AddField (int fieldSize)
if (NumFields MaxFields) return FALSE
if (BufferSize fieldSize gt MaxChars) return
FALSE
FieldSizeNumFields fieldSize
NumFields
BufferSize fieldSize
return TRUE

35
Class for Fixed Length fields

int FixedTextBuffer Read (istream stream)
stream . read (Buffer, BufferSize)
return stream . good ()

36
Class for Fixed Length fields

int FixedTextBuffer Write (ostream stream)
stream . write (Buffer, BufferSize)
return stream . good ()

37
Class for Fixed Length fields

int FixedTextBuffer Pack (const char str)
// set the value of the next field of the buffer
if (NextField NumFields !Packing) //
buffer is full or not packing mode
return FALSE
int len strlen (str)
int start NextCharacter // first byte to be
packed
int packSize FieldSizeNextField // number
bytes to be packed
strncpy (Bufferstart, str, packSize)
NextCharacter packSize
NextField
// if len lt packSize, pad with blanks
for (int i start packSize i lt
NextCharacter i )
Bufferstart ' '
Buffer NextCharacter 0 // make buffer look
like a string
if (NextField NumFields) // buffer is full
Packing FALSE
NextField NextCharacter 0

38
Class for Fixed Length fields

int FixedTextBuffer Unpack (char str)
// extract the value of the next field of the
buffer
if (NextField NumFields Packing) // buffer
is full or not unpacking mode
return FALSE
int start NextCharacter // first byte to be
unpacked
int packSize FieldSizeNextField // number
bytes to be unpacked
strncpy (str, Bufferstart, packSize)
str packSize 0 // terminate string with
zero
NextCharacter packSize
NextField
if (NextField NumFields) Clear () // all
fields unpacked
return TRUE

39
Class for Fixed Length fields

void FixedTextBuffer Print (ostream stream)
stream ltlt "Buffer has max fields "ltltMaxFieldsltlt"
and actual "ltltNumFieldsltltendl
ltlt"max bytes "ltltMaxCharsltlt" and Buffer Size
"ltltBufferSizeltltendl
for (int i 0 i lt NumFields i)
stream ltlt"\tfield "ltltiltlt" size
"ltltFieldSizeiltltendl
if (Packing) stream ltlt"\tPacking\n"
else stream ltlt"\tnot Packing\n"
stream ltlt"Contents '"ltltBufferltlt"'"ltltendl

40
Class for Fixed Length fields

class FixedTextBuffer
public
FixedTextBuffer (int maxFields, int maxChars
1000) int AddField (int fieldSize)
int Read (istream )
int Write (ostream )
int Pack (const char )
int Unpack (char )
private
char Buffer // character array to hold field
values
int BufferSize // sum of the sizes of declared
fields
int FieldSize // array to hold field sizes
int MaxChars // maximum number of characters in
the buffer
int NextCharacter // packing/unpacking position
in buffer

41
Class for Fixed Length fields

int PersonPack (FixedTextBuffer Buffer) const
// pack the fields into a FixedTextBuffer,
return TRUE if all succeed, FALSE o/w
int result
Buffer . Clear ()
result Buffer . Pack (LastName)
result result Buffer . Pack (FirstName)
result result Buffer . Pack (Address)
result result Buffer . Pack (City)
result result Buffer . Pack (State)
result result Buffer . Pack (ZipCode)
return result

42
Class for Fixed Length fields

int PersonUnpack (FixedTextBuffer Buffer)
Clear ()
int result
result Buffer . Unpack (LastName)
result result Buffer . Unpack (FirstName)
result result Buffer . Unpack (Address)
result result Buffer . Unpack (City)
result result Buffer . Unpack (State)
result result Buffer . Unpack (ZipCode)
return result

43
Record Access - Keys

Attribute used to identify records
Often used to find records
Standard or canonical form
rules which keys must conform to
prevents missing record because key in different
form
Example
all capitals
Phone in form (nnn) nnn-nnnn

44
Record Access - Keys

Keys can distinct - uniquely identify records
Primary keys
one-to-one relationship between key value and
possible entities represented
SSN, Student ID
Keys can identify a collection of records
Secondary keys
one-to-many relationship
City, position, department

45
Record Access - Keys

Primary key desired characteristics
unique among collection of entities
dataless - what if some entities have not value
of this type (e.g. SSN)
unchanging

46
Record access

Performance of access method
how do we compare techniques?
Must be careful what events we count.
big-oh notation gives us a way to factor out
all but the most significant factors

47
Record Access - timing

Sequential searching
Consider file of 4000 records
What if no blocking done, and one record per
block? (500 bytes records, 512 byte blocks)
What if cluster size set to 8?
always requires O(n), but search is faster by a
constant factor

48
Sequential searching

Usually NOT the best method
Sometimes it is best
Searching for some ASCII pattern (grep)
Small files
Files rarely searched
Searching on secondary key, and a large
percentage of records match (say 25)

49
Unix Tools for sequential file processing

cat - display a file
wc - count lines, words, and characters
grep - find lines in file(s) which match regular
expression.

50
Direct Access

Move directly to record without scanning
preceding data
Different languages/OSs support different
models
Byte offset model
Programmer must specify offset to record, and
record size to read.
Supports variable size records, skip sequential
processing
Relative Record Number (RRN) model
File has a fixed record size (declared at
creation time)
Records are specified by a record number
File modeled as a collection of components
Higher level of abstraction

51
Direct Access

Different language support
RRN support
PL/I
COBOL
Pascal (files are modeled as a collection of
components (records)
FORTRAN
Byte offset
C

52
Choosing Record Sizes for Direct Access

Fixed Length Fields
Very easy to parse records - just read into
record structure!
Each field must be maximum length needed!
Thus record must be as long all the maximum fields

10 10 15 15
2 9
last first address
city state zip
Yeakus Bill 123 Pine Utica
OH43050
53
Choosing Record Sizes for Direct Access

Variable length fields
Each field can be any length
since some can be long, others short, overall
record size may be shorter.
This gives more flexibility to fields length
Records must be parsed, space wasted for
delimiter or length bytes.

YeakusBill123PineUticaOH43050 Snivenloppinsk
yHelmut12232 Galmentary AvenueSpotsdaleNY1123
2
54
Header Records

The first record in a direct file may be used to
store special information
Number of records used.
Location of first record in key order sequence.
Location of first empty record
File record structure (meta-data)
In languages with the RRN model Pascal, variant
record facility must be used
In C, the header record can be of different size
from the rest of the file records.

55
Header Records

Consider update.c is text.
Header record contains 2 byte number of record
count.
Header size is 32, record size is 64

static struct short rec_count char
fill30 head
56
Header Records

Must be written when file created
Must be rewritten when file changed
Must be read when file is opened

57
File Access and Organization

File Organization
Variable Length Records
Fixed Length Records
Field Structures (size bytes, delimiters, fixed)
File Access
Sequential access
Direct access
Indexed access

58
File Access and Organization

Interaction between organization and access
Can the file be divided into fields?
Is there a higher level of organization to the
file (mete data)?
Do all records have to have the same number of
fields, bytes?
How do we distinguish one record from the next?
How do we recognize if a fixed length record
holds real data or not?

59
File Access and Organization

There is a often a trade-off between space and
time
Fixed length records - allow direct access, waste
space
Variable require sequential search
We also must consider the typical use of the file
- what are the desired access patterns
Selection of a particular organization has
implications on the allowable types of access

60
Portability and Standardization

Differences among Languages
Fixed sized records versus byte addressable
access
Differences among Machine Architectures
Byte order of binary data
May be high order or low order byte first

61
Byte order of binary data

High order first (Big Endian)
A long int say 45 is stored in memory.
It is stored as 00 00 00 2D
Suns, Network protocols
Low order first (Little Endian)
A long int say 45 is stored in memory.
It is stored as 2D 00 00 00
PCs, VAXs

62
Byte order of binary data

If binary data is written to a file, it is
written in the order stored in memory
If the data is later read by a system with a
different ordering, the number will be incorrect!
For the sake of portability, files should be
written in an agreed upon format (probably Big
Endian)

Write a Comment

User Comments (0)