Title: Structured Files
1Structured Files
Chapter 19
2What The Record Manager Does
- Storage allocation store tuples in file blocks
- Tuple addressing give tuple an id identifier
- provide fast access via that id.
- Enumeration fast enumeration of all relations
tuples - Content addressing give fast accessible via
attribute values. - Maintenance update/delete a tuple and its
access paths. - Protection support for security
- encrypt or tuple-granularity access
control.
3Outline
- Representing values
- Representing records
- Storing records in pages and across pages
- Organizing records (entry, relative, key, hash)
- Examples of fix/log/log logic.
4Record Allocation in a Page
-
- Recall
- File is a collection of fixed-length pages
(blocks). - File and buffer managers map files to disc/RAM
slot on disk
block
page
Page Dir
Block Head
Block Trailer
Page Head
page body
checksum
5Page Declares
- typedef struct / global page numbers /
- FILENO fileno /file where the page
lives / - uint pageno / page number within the
file / - PAGEID, PAGEIDP / /
- typedef struct
- PAGEID thatsme / identifies the page /
- PAGE_TYPE page_type / see description
above / - OBJID object_id / internal id of the
relation,index,etc. / - LSN safe_up_to / page LSN for the WAL -
protocol / - PAGEID previous / often pages are members
of doubly / - PAGEID next / linked lists /
- PAGE_STATE status / valid,in-doubt,copy of
something,etc/ - int no_entries / entries in page dir
(see below) / - int unused / free bytes not in
freespace / - int freespace / contiguous free bytes
for data / - char stuff / will grow /
- PAGE_HEADER, PAGE_PTR / /
6Different uses of pages
- Data Homogeneous record storage
- Cluster like Data except many different record
types - Index (access path) hashed or B-tree
- Free-space bitmap describes status of ³ 4,000
other pages. - Directory meta-data about this or other files
7Page Directory Points to Records on Page
Page Header
1st Tuple
2nd
2nd Tuple
3rd Tuple
4thTuple
5th Tuple
Tuples are inserted in this direction
Page directory grows in this direction
2
1
3
4
5
- Record id is File, Page, Directory_offset
8Accessing a Record
- Read by TID Lock record shared
- locate page
- Get semaphore shared
- follow directory offset
- copy tuple
- Give semaphore
- Insert by TID Lock record exclusive
- locate page
- Get semaphore exclusive
- Find space
- Insert
- log insert (tid, new value).
- update page LSN, header, directory,
- Give semaphore
9Accessing a Record
- Delete by TID Lock record exclusive
- locate page
- Get semaphore exclusive
- Add record to free space
- Log delete (tid, old value).
- update page LSN, header, directory,
- Give semaphore
- Update TID much like delete--insert
10Finding Space for Insert / Update
- If tuple fits in page contiguous free-space
easy. - If tuple fits in page free space reorganize
(compress) - Physiological logging makes this cheap.
- If tuple does not fit then
- leave forwarding address on page.
- Optionally leave record prefix on page.
- Segment record among several pages.
tid
11Finding space within a file
- Free space table
- Summarizes status of many pages
- (8KB page gt 64Kb gt 500MB of 8KB data pages)
- Good for clustered contiguous allocation
- bitmap should be transaction protected
- If transaction aborts, page is freed again.
- Alternatively, treat bitmap as a hint
- Rebuild periodically.
p1
p2
p3
p4
p5
p6
p7
P19
P20
P21
.
.
.
.
f2
F19
21
f3
f4
f5
f6
f7
.
.
.
.....
.
.
.
.
.
.
f17
Free space directories
12Finding space within a file
- Free space cursor/list
- Chain should be transaction protected
- Else rebuild at restart
- do not trust pointers
- (free page may be allocated).
file catalog
chain of empty pages
.
.
empty_page_anchor
point_of_insert
page for next insert
13Tuple Allocation - I
The first strategy maintains a pointer to the
current block for insert (CBI). When that
block fills up, an empty block is requested from
a system service, which then becomes the new
current block for insert.
CBI
where next?
And so on. This is the sequential insert
strategy.
Questions What happens, when the pointer arrives
at the last block? How do we reclaim space freed
by deleted tuples?
14Incremental Space Expansion - I
When the list of empty blocks is exhausted, there
are two options to find space for new tuples. Let
us assume the following configuration
The first option is to let the CBI pointer
circulate over the set of allocated
blocks, assuming that space is released by
deleted tuples.
And so on. This works as long as enough space is
freed up by deleted tuples. If there are only few
gaps, finding space for a new tuple can become
very expensive, because many blocks have to be
probed sequentially.
The need to probe blocks that are completely
filled can be avoided by maintaining a an array
of bits that contains one bit per block
indicating whether a block is full
15Naming Tuples (records)
Relative byte address file, offset in file
OK for insert-then-read-only DBs record
can't easily grow. deleted space not easily
reclaimed. Tuple Identifier file, page, index
The design shown below. Main
disadvantage expensive reorganization (fixing
overflows)
pageno
dir_index
fileid
nodeid
pageno
fileid
nodeid
dir_index
3
7446
7446
3
7446
7446
5127
this tuple
this tuple
pseudo -TID
16Implementing Database Keys
- Address record via directory
- Address has a ID to allow for invalidation
- ID never reused.
- Pointer can be swizzled.
- Popular with network OO DBs
17Naming Tuples via Primary Key
- Entry Sequenced, Relative primary key is
physical addr - Hash, B-tree primary key is content (primary
key) - Primary Key an alternative to DBkey
- B-tree clusters related data
- ProblemsB-tree access is slower than Hash.Hash
B-tree keys not fixed lengthbut neither is
node.db_key - Benefit key can grow to LARGE databasesGood
for distributed/partitioned data - Its religious.
18Datatype Representation
- E External representation ASCII, ISO Latin1,
Unicode,... - P Programming language representation
- many PL/1, Cobol, C, all have different
VARCHAR - many type mismatches between P and F
- interval, datetime, user,...
- F File representation "native" types (e.g.
null values, ....). - Lots of mapping functions.
- It would be great if F-1(F(x)) x for these
functions, but.... - Called the impedance mismatch between DB and PL
19Datatype Representations
- P _ F Implies a special language
- (all other languages are 2nd class)
- E _ F Use characters for everything.
- Problem E changes from country to country!
- (all other languages are 2nd class)
- No easy way out of this.
- Unicode will help most of us and make E_F more
attractive
20Representing Records
21Representing Records
- struct relations
- Uint relation_no / internal id for the
relation / - char owner / user id of the creator /
- long creation_date / date when it was
created / - PAGENO current_point_of_insert / free space
done via / - PAGENO empty_page_anchor / free space cursor
method / - Uint no_of_attributes /attributes in
relation / - Uint no_of_fixed_atts / fixed-length
attributes / - Uint no_of_var_atts / variable-length
attributes / - struct attributes p_attr / pointer to
the attributes array / - struct attributes / attributes array /
- char attribute_name / external name of
the attribute / - Uint attribute_position / index of the field
in the tuple (1,2,...) / - char attribute_type / this encodes the SQL -
type definition / - Boolean var_length / is it variable_length
field ? / - Boolean nulls_allowed / can field assume NULL
value ? / - char default_value / value assumed if none
stored in tuple / - Uint field_length / maximum length of field
/ - int accumulated_offset / explained later
/
22Representing Records
- Generic header
- (rid, tid, fields)
- all fixed length encoding
- (fat records, fast-simple code
- max lt page path length)
- variable fields have length
- (short records, slow code)
- type-length-value
- (simple slow code, easy reorg)
- fixed ptrs to variables.
- (compact, fast code)
-
general prefix to all tuple
representations
relation-id
number of fields in
tuple-id
the tuple or
actual tuple length
number of
fields
F2
F3
F4
F6
F5
F1
3
4
2
4
8
10
name
n
m
L
tuple
length
F
F
F
F
m
F
4
2
6
F
1
3
5
L
3
4
2
4
n
tuple
length
number of
F1
F2
F3
F4
F5
F6
fields
3
4
2
4
23Representing Records (Reuter Recommends)
24Some Details
- Representing null values
- missing field
- special value
- extra field
- bitmap
- Representing keys
- efficient comparison is important
- store "conditioned" key so simple byte-compare.
- Flip integer sign (so negative sorts low)
- Flip float so exponent first, mantissa second,
flipped signs - Compress varchars.
- MANY refinements.
- Want an order-preserving compression.
25Fat Records (Longer Than a Page)
Record must fit on page. Long fields
segregated to separate page may be good in some
cases (Multi-media DBs) Overflow page
chains Segment record across pages
26Obese records (Longer Than 10 Pages)
- If record is super-large, then may want to index
into it quickly. - Obvious" design is standard tree.
- Record is root of tree.
- Grow levels when one fills.
- Allows blob growth, update,...
27Non-Normalized Relations
28Structured File Definition
29File Layouts
- Unstructured
- a sequence of bytes
- Structured,
- Entry Sequenced.
- Records inserted at end
- Records cannot grow
- key is RBA (relative byte address)
- Relative
- fixed size record slots
- records limited by that size
- key is relative record number
30Associative File Types
- Hashed
- Records addressed by key field(s)
- bucket has list of records
- overflow to other buckets
- or to overflow pages.
- Key Sequenced
- Records addressed by keyfield(s)
- Records in sorted order.
- either sorting or b-tree or...
31Parameters at Create
- Database
- Record type (fields)
- Key
- Organization Entry Sequenced, Relative, Hashed,
Key Sequenced - Block size (page size)
- Extent size (storage area)
- Partitioning (among discs or nodes) by key.
- Attributes access control
- allocation and archive strategy
- transactional
- lifetime, zero on free, and on and on ....
32Parameters at Create
- "Secondary" indices.
- Primary key is....(e.g. customer number).
- Secondary key is social security number
- Non-Unique secondary key is Last_Name, First_name
- Secondary indices can be unique or not
and hashed or Key Sequenced - index is like a table.
- fields of index are
- secondary key, primary key
- So can define index on any
- kind of base table
33Secondary Index Example
- Base table is key-sequenced on CustomerNumber.
- Index table is key sequence on Name-CustomerNumber
. - Index can be a replica of the base table in
another order.Transaction recovery and locking
keeps them consistent. - Tuple management system
- Maintains indices (insert, update, delete)
- Navigates to base table via secondary index as
one request.
34What happens when you open a relation?
- Many files get opened. Read directory (catalog)
- Partitions,
- Indices
35Once OPEN, Application can SCAN the relation
- Scan is a row column subset
- SELECT ltcolumn listgt
- FROM lttablegt
- WHERE ltpredicategt
- With a specified start/stop key
- AND ltkeygt BETWEEN ltlowgt AND lthighgt
- In a specified order (supported by a secondary
index) - ASCENDING DESCENDING
- A locking protocol Serializable Repeatable
Read Committed Read - Uncommited Read Skip Uncommitted
- TIMEOUT ltsecondsgt
36SCAN States
37SCAN States How they change
- On error, scan state does not change.
- On open,
- scan is before after the first last set
element - if scan is ascending descending
- On fetch next
- if not end of set at end of set
- scan is at next before first after last
element - On insert
- scan is at element
- On delete
- scan is at the missing element
38SCAN States How they change
- On update scan position is not affected.
- if tuple moves (because ordering attributes
affected) - scan key position is unchanged
- Can create Halloween problem (give everybody a
10 raise) - But scan enumerates entire set.
39SCAN Data structure
- enum
- SCAN_STATE TOP, ON, BOTTOM, BETWEEN, NIL /
the 5 scan states / - enum
- ISOLATION UNCOMMITTED_READ,..., SERIALIZABLE,
READ_PAST, BOUNCE - typedef struct
- Uint scanid / handle for scan returned by
open_scan/ - TRID owner / which transaction uses the
scan / - FILE fileid / handle of file the scan is
defined on / - char scan_key / specification of scan key
attribute(s) / - char start_key / lower bound of scan
range / - char stop_key / upper bound of scan range
/ - char filter / qualifying predicate for all
tuples in scan / - ISOLATION isol_degree / locking policy
for tuples accessed / - SCAN_STATE scan_state / state of scan
pointer / - char scan_key / scan key the scan is
before, at, or after / - SCANCB, SCANCBP
40Entry Sequenced File Insert
- fix page descriptor page
- find eof page
- fix eof data page
- if no space in page
- lt see next slide for transaction to advance
pagegt - unfix descriptor page
- add record to page (updating on-page directory)
- generate log record (new value) and update page
LSN. - compute lock name of record (based on TID).
- get lock on record
- unfix data page.
- To make this work, MUST be assured lock is
available - Otherwise page sem can (undetected)deadlock with
lock wait - So, UNDO of entry-sequence insert does not free
the space, - it just invalidates the record.
41Entry Sequenced File Insert If EOF page or File
is Full
Top level transaction to extend file
- Begin new transaction (will not abort if insert
aborts) - to extend file EOF page. (leaves insert
transaction) - unfix directory page
- if file full, panic()
- start a top-level transaction
- fix the directory
- advance the page eof updating directory and
freespace - log the changes
- fix the data page
- format it
- log the change
- unfix the directory and data page
- commit the transaction resume insert
transaction - fix directory, fix eof, check to see that there
is room for the record.
42Entry Sequenced Operations
- Delete by RBA.
- get record lock
- (node,
- file,
- RBA) exclusive
- if timeout, deadlock, error
- return error
- Fix page
- Mark record invalid
- Generate log record
- Update page lsn
- Unfix page.
Read by RBA. get record lock (node,
file, RBA) shared if timeout,
deadlock, error return error Fix page if
record valid copy to buffer Unfix page Return
record or null
Note both must test that RBA lt EOF. Update,
ReadNext, ... are similar.
43Relative Files
- Records fit in fixed-length slots
- Operation on slots.
- Separate transactions extend the file EOF
(allocate and format pages)
44Relative Files
- Read Insert Update Delete by key are all
easy - Insert "near" key works by
- Plan A
- look at page
- Look at neighbor pages (left, right, left,
right,...) - Plan B
- allocate overflow page for base page
- Plan C
- Look in free-space bit-map or byte (full) map.
45Key Sequenced or Hashed Files
- Key sequenced
- is subject of next chapter.
46File Clustering
- Different record types kept in same page/file
- For example Master and detail records of an
invoice.Detail records always accessed if master
is. - SituationMaster key InvoiceNoDetail key
InvoiceNo Foreign Key References Master
SequenceNo -
- TechniqueHash or Key sequence Master on
InvoiceNoHash or Key Sequence Detail on
InvoiceNoSequenceNo in same table.
47Clustering different record types in a page
- One disc request gets the entire order.
- Concept works for any storage hierarchy
- Is natural for Hierarchical database systems.
48Summary
- Representing values
- Representing records
- storing records in pages and across pages
- Organizing records (entry, relative, key, hash)
- Examples of fix/log/log logic.