Title: File Structures
1File Structures
- Shyh-Kang Jeng
- Department of Electrical Engineering/
- Graduate Institute of Communication Engineering
- National Taiwan University
2Logical Records
- High-level programming language provides
primitives for requesting access to files via the
operating system - logical records
- Units in a file compatible with the application,
provided by high-level programming language
file-access primitives - Fields
- Smaller information units in a logical record
3File, Logical Record, and Field
4Physical Records
5Operating System and File Access
6Physical Records and Buffers
- Manipulating the file in terms of the blocks is
handled by the operating system - The operating system responds to a reading
request by reading enough physical records,
placing the data in buffer, and making the buffer
available to the application - When storing, the operating system stores the
data in a buffer until a complete physical record
has been accumulated and then transfers the
entire physical record to mass storage
7File Descriptor
- Also called file control block
- A table storing information about the file being
manipulated - device
- name of the file
- location of the buffer
- flag to save the file or not when application
terminates - Opening and closing the file
- The processes of creating and discarding a file
descriptor
8Opening and Closing Files
- Imperative Paradigm
- Open the file document.txt as DocFile for input
- Close the file DocFile
- Object-Oriented Paradigm
- Create the object DocFile as the input file
document.txt - Send the message GetCharacter to DocFile to
retrieve Symbol - Send the message Close to the object DocFile
9Sequential Files
- Accessed in a serial manner from its beginning to
its end - Examples
- Audio
- Video
- Files containing programs
- Files containing text documents
10Processing a Sequential File
- while (the end of the file has not been reached)
do - (retrieve the next record from the file and
process it)
11File Allocation Tables (FAT)
- Clusters (4-16 sectors, about 2 KB in Windows)
- FAT keeps a record of which cluster is assigned
to which file - Through FAT, the operating system can retrieve
the file in the proper cluster-by-cluster order - FAT16 (64K clusters, 128 MB)
- FAT32 (4 G clusters, terabytes)
12Maintaining a Files Order
13Disk Scanning
14Detection of EOF
- End-of-file
- Identifying EOF
- Place a special record (sentinel)
- retrieve the first record from the file
- while (the record is not sentinel) do
- ( retrieve the next record from the file)
- Leave the task to operating system
- while (not EOF) do
- (retrieve the next record from the file)
15Key Field
- A single field to identify a logical record in a
file - Example
- social security number in an employee file
- Arranging files according to a key field can
greatly reduce processing time - Updating classic sequential files
- Transaction file (new information)
- Master file (file to be updated)
16Merging Two Files (1)
17Merging Two Files (2)
18Merging Two Sequential Files
19Text Files
- A sequential file in which each logical record
consists of a single encoded character - ASCII file and Unicode file
- Binary file
- Line feed (PC), Carriage Return Line feed
(UNIX), Carriage Return (Apple)
20Editors vs. Word Processors
- Editors create and modify strict text files
- Word processors insert nonprintable codes in the
file to represent changes in fonts, alignment
information, etc. - Email handles only text files, and word processor
output can only be transferred as attachments
21A Simple Employee File Implemented as a Text File
22The First Two Bars of Beethovens Fifth Symphony
23Representing Sheet Music by a Text File
- ltstaff cleftreblegt
- ltkeygtC minorlt/keygtlttimegt2/4lt/timegt
- ltmeasuregtltrestgtqtrlt/restgt
- ltnotesgtqtr G, qtr G, qtr Glt/notesgt
- lt/measuregt
- ltmeasuregtltnotesgthlf Elt/notesgtlt/measuregt
- lt/staffgt
24Advantages of Representing Music as a Text File
- Can be encoded, modified, stored, and transferred
over the internet - Software can be written to present the contents
of such files in the form of traditional sheet
music or even to play the music on a synthesizer
25eXtensible Markup Language
- A standard style for designing notation systems
(markup languages) for representing data as text
files - Some markup languages following the XML standards
- MathML
- SMIL (multimedia presentations)
- 4ML (music)
- XHTML
- WML and MPEG-7
26Programming Concerns
- Imperative paradigm
- Apply the procedure ReadFile to retrieve
MailRecord from MailList - Object-oriented paradigm
- Send the ReadFile message to the object MailList
to retrieve MailRecord - Peripheral devices
- Apply the procedure ReadFile to retrieve Name
from the file KeyBoard
27Programming Concerns
- Examples
- Apply the procedure GetCharacter to retrieve
Symbol from the file Text - Apply the procedure ReadLine to retrieve TextLine
from the file Text - Apply the procedure Write to place the value of
Length in the file Text - Apply the procedure ReadFile to retrieve the
value of Length from the file Text - Apply the procedure ReadFile to retrieve Age from
the file KeyBoard
28Converting data from twos complement notation
into ASCII for storage in a text file
29Indices
- Contains a list of values we call keys
- Each key identifies a block of information
residing in the related storage structure - Along with each of these keys in the index is an
entry indicating where the associated block of
information is stored - To find a particular block of information
- First finds the identifying key in the index
- Then retrieves the block of information stored at
the location associated with that key
30Indexed File
31Inverted File
32Partial Index
33Hierarchical Index System
34File Pointer
- fgetpos( Personnel, Position )
- fsetpos( Personnel, Position )
35Example (1)
- / FGETPOS.C This program opens a file and reads
- bytes at several different locations.
- /
- include ltstdio.hgt
- void main( void )
- FILE stream
- fpos_t pos
- char buffer20
- if( (stream fopen( "fgetpos.c", "rb" ))
NULL ) - printf( "Trouble opening file\n" )
- else
- / Read some data and then check the position.
/ - fread( buffer, sizeof( char ), 10, stream
) - if( fgetpos( stream, pos ) ! 0 )
- perror( "fgetpos error" )
- else
- fread( buffer, sizeof( char ), 10, stream )
- printf( "10 bytes at byte ld .10s\n",
pos, buffer ) -
36Example (2)
- / Set a new position and read more data /
- pos 140
- if( fsetpos( stream, pos ) ! 0 )
- perror( "fsetpos error" )
- fread( buffer, sizeof( char ), 10, stream
) - printf( "10 bytes at byte ld .10s\n",
pos, buffer ) fclose( stream ) -
-
37Hashing
- A technique that provides access similar to index
structure - But needs not to maintain indices
- Bucket
- Sections that the data storage space is divided
into - Hash function
- Algorithm that converts the key value into bucket
number - Hash files and hash tables
38Hashing System
39Hashing a Key
40Distribution Problems
- Better to select a hash function that evenly
distributes the records among the buckets - If a dividend and a divisor both have a common
factor, this factor will be present in the
remainder as well - Example
- 40 buckets and keys are multiples of 5
- The entries will cluster in those buckets
associated with the remainder 0, 5, 10, 15, 20,
25, 30, 35
41Collision
- The phenomenon of two keys hashing to the same
value - Tends to clustering, and should be avoided
42Probability Calculation
- 41 buckets
- Probability that a new entry can be placed in an
empty bucket after inserting 7 entries - (41/41)(40/41)(39/41)(38/41)(34/41)
- 0.482
43Handling Bucket Overflow
44Load Factor
- The ratio between the number of entries actually
stored in the structure to the total capacity of
the buckets - As long as the ratio is below 50, the
performance is normally good - If the load factor creeps above 75, the system
performance generally degrades - Usually reconstruct the system using more buckets
if a load factor approaches 75
45Java Class Hashtable
- table new Hashtable(capacity, factor)
- Each bucket is a linked list
- The load factor is the ratio of nonempty buckets
to the total number of buckets - Methods put and get
46Hash File
47Java Class Properties
- A Properties object is in effect a Hashtable
- Initialized from a file by method load
- Saved in mass storage by the method store
- The file is actually a sequential file consisting
of a stream of bits from which the appropriate
hash table can be constructed in main memory
48Exercises
- Review problems
- 6, 9, 12, 16, 18, 20, 25, 32, 38, 39