Title: Physical Data Organization and Indexing
1Physical Data Organization and Indexing
- Disk Organization
- Heap Files
- Sorted Files
- Indexing
- Search Trees, B-Trees and B-trees
2Disk Organization
A disk unit contains one or more platters
attached to a rotating spindle. The surfaces of
each platter are coated with magnetic material,
and the direction of magnetism at a particular
spot determines whether that spot records a one
or a zero. When power goes off, the information
stays. Each surface is accessed through an
associated read/write head that is capable of
either detecting or setting the direction of
magnetism on the spot over which it is
positioned. The head is attached to an arm,
which is capable of moving radially, toward
either the center or the circumference of the
platter.
3Figure 11.1Physical organization of a disk
storage unit.
4Information is stored in on a disk surface in
concentric circles called tracks. The tracks
with the same diameter on the various surfaces
are called a cylinder. Each track is divided
into sectors. Sectors are the smallest units of
transfer allowed by the hardware. Disks are
extremely slow devices compared to CPUs. A CPU
can execute hundreds of thousands of instructions
in the time it takes to access a sector. The term
page generally denotes the unit of data
transferred with each I/O operation. A page is
stored on the disk in a disk block, which is a
sequence of adjacent sectors on a track.
5- Access time for disk
- Seek time the time to position the read/write
head over the proper cylinder. - Rotational delay the time to rotate to the
proper sector. - Settling time the time to place the head in
contact with the disk. - Data transfer time the time it takes for the
platter to rotate through the angle subtended by
the sector(s) that contains the data (i.e. the
time to transfer the data from the disk to main
memory).
6Heap Files
- A heap file is also called a pile file or a
sequential file. It is an unordered set of
records. The following operations are supported - Heap files can be created and destroyed.
- Existing heap files can be opened and closed.
- Records are uniquely identified by a record id
(rid). A specific record can be retrieved by
using the record id. - Sequential scans on heap files are also
supported.
7- Records can be inserted. To insert a record, the
last disk block is copied into a buffer the new
record is added and the block is then rewritten
back to the disk. - Records can be deleted. To delete a record, a
program must first find the block that contains
the record the block is copied into a buffer
the record is removed and the block is then
rewritten back to the disk.
8A heap file is an efficient storage structure if
queries on the table involve accessing all rows
and if the order in which the rows are accessed
is not important. For example SELECT FROM
Transcript SELECT AVG(T.Grade)FROM Trandcript
T
9Sorted Files
- A sorted file is a file of which the records are
ordered based on the values of the ordering
field. If the field is a key, it is also called
the ordering key. - Advantages
- Reading the records in order of the ordering key
valuesbecomes extremely efficient. - Finding the next record from the current one in
order of the ordering key usually requires no
additional block accesses. - Binary search can be applied to a sorted file.
10Binary Search Algorithm
if (target lt thingfirst target gt
thinglast) return NOT_FOUND while (first lt
last) mid (firstlast)/2 / truncate to
integer / if (target lt thingmid) last
mid else if (target gt thingmid) first
mid1 else return mid if (target
thinglast) return last return NOT_FOUND
11Inserting and deleting records are expensive
operations for an ordered file because the
records must remain physically ordered. One
option for making insertion more efficient is to
keep some unused space in each block for new
record. Another option is to create a temporary
unordered file called an overflow or transaction
file. With this technique, the actual ordered
file is called the main or master file. New
records are inserted at the end of the overflow
file. Periodically, the overflow file is sorted
and merged with the master file during file
reorganization.
12Indexing
- Single-Level Ordered Indexes
- Primary Indexes
- Clustering Indexes
- Secondary Indexes
- Multilevel Indexes
- B-Trees and B-trees
- Search tree
- B-tree
- B-tree
- Hash Indexing
13Types of Single-Level Ordered Indexes Primary
Indexes If the field to be indexed is a key
field and the records of the data file are
physically ordered on that key (primary key), we
can create a primary index. A primary index
analogous to a book index. Each index entry
contains a term and a pointer ltK(i), P(i)gt.
There is one index entry in the index file for
each block in the data file. The field K(i) has
the value of the primary key field (ordering key
field) for the first record in the block, and
P(i) is a pointer to that block. (Fig. 6.1)
14(No Transcript)
15Clustering Indexes If the field to be indexed
is a nonkey field and the records of the data
file are physically ordered on that key
(clustering field), we can create a clustering
index. A clustering index is an ordered file
whose records has two fields ltK(i), P(i)gt. There
is one index entry in the index file for each
distinct value of the clustering field. The field
K(i) has the value of the clustering field
(physically ordered nonkey field), and P(i) is a
pointer to the first block in the data file that
has a record with the value specified in K(i).
(Fig. 6.2, Fig. 6.3)
16(No Transcript)
17Secondary Indexes If the field to be indexed is
a key field and the records of the data file are
not physically ordered on that key (second key),
we can create a secondary index. A secondary
index is an ordered file whose records has two
fields ltK(i), P(i)gt. There is one index entry in
the index file for each record (dense index). The
field K(i) has the value of the secondary key for
a record, and P(i) is a pointer to the block in
which the record is stored or to the record
itself. (Fig. 6.4 ,Fig. 6.5).
18(No Transcript)
19If the field to be indexed is a nonkey field, we
have several options Option 1. To include
several index entries with the same K(i)
values-one for each record. This would be a
dense index. Option 2. To have variable-length
records for the index entries. We keep a list of
pointers ltP(i,1)gt, ..., P(i,k)gt in the index
entry for K(i). Option 3. P(i) points to a block
of record pointers. Each record pointer in that
block points to one of the data file record with
value K(i) for the indexing field.
20- Multilevel Indexes
- The index file is a file ordered on K(i). We can
create a primary index for the index file. This
index to the original index file (first level or
base level) is called the second level of the
multilevel index. We can repeat this process and
create a primary index (third level) for the
second level. We can repeat this process until
all the entries of some index level t fit in a
single block. This block at the t-th level is
called the top index level. - The multilevel scheme can be used on any type of
index, whether it is primary, clustering, or
secondary--as long as the first level index has
distinct values for K(i). - Fig. 6.6
21A common file organization used in business
data processing is an ordered file with a
multilevel primary index on its ordering key
field. Such an organization is called an indexed
sequential file and was used in a large number of
early IBM systems. Insertion is handled by some
form of overflow file that is merged periodically
with the data file. The index is re-created
during file reorganization.
22(No Transcript)
23- Search Trees, B-Trees and B-trees
- To retain the benefits of using multilevel
indexing while reducing index insertion and
deletion problems, designers adopted a multilevel
index that leaves some space in each of its
blocks for inserting new entries. This is called
a dynamic multilevel index and is often
implemented by using data structures called
B-trees and B-trees,
24- Search Trees
- A search tree of order p is a tree such that each
node contains at most p - 1 search values and p
pointers in the order lt P1, K1, P2, K2, ...,
Pq-1, Kq-1, Pq gt, where qltp each Pi is a
pointer to a child node (or a null pointer) and
each Ki is a search value from some ordered set
of values. All search values are assumed to be
unique. - Two constraints must hold at all times on the
search tree - Within each node, K1 lt K2 lt ... lt Kq-1.
- For all values X in the subtree pointed at by Pi,
we have Ki-1 lt X lt Ki for 1 lt i lt q X lt Ki for i
1 and Ki-1 lt X for i q (see Figure 06.08).
25B-tree Informally, a B-tree of order p is a
balanced search tree in which every internal node
has at most p, and at least CEILING(p/2),
children.
26More formally, a B-tree of order p, can be
defined as follows Each internal node in the
B-tree (Figure 06.10a) is of the form ltP1, ltK1,
Pr1gt , P2, ltK2, Pr2gt , ..., ltKq-1,Prq-1gt , Pqgt
where q lt p. Each Pi is a tree pointer. Each Pri
is a data pointer. Within each node, K1 ltK2 lt ...
lt Kq-1. For all search key field values X in the
subtree pointed at by Pi, we have Ki-1 lt X lt Ki
for 1 lt i lt q X lt Ki for i 1 and Ki-1 lt X for
i q. Each node has at most p tree pointers.
Each node, except the root and leaf nodes, has at
least (p/2) tree pointers. The root node has at
least two tree pointers unless it is the only
node in the tree. All leaf nodes are at the same
level. Leaf nodes have the same structure as
internal nodes except that all of their tree
pointers Pi are null.
27B-tree A B-tree is a B-tree in which data
pointers are stored in the leaves. The structure
of the internal nodes of a B-tree of order p is
as follows Each internal node is of the form
ltP1, K1, P2, K2, ..., Pq-1, Kq-1, Pqgt where q
lt p and each Pi is a tree pointer. Within each
internal node, K1 lt K2 lt ... ltKq-1. For all
search field values X in the subtree pointed at
by Pi, we have Ki-1 lt X lt Ki for 1 lt i lt q X lt
Ki for i 1 and Ki-1 lt X for i q.
28Each internal node has at most p tree pointers.
Each internal node, except the root, has at
least (p/2) tree pointers. The root node has at
least two tree pointers unless it is the only
node in the tree. An internal node with q
pointers, q lt p, has q - 1 search field values.
29The structure of the leaf nodes of a B-tree of
order p is as follows Each leaf node is of the
form ltltK1, Pr1gt , ltK2, Pr2gt, ..., ltKq-1, Prq-1gt,
Pnextgt where q lt p, each Pri is a data pointer,
and Pnext points to the next leaf node of the
B-tree. Within each leaf node, K1 lt K2 lt ... lt
Kq-1, q lt p. Each Pri is a data pointer that
points to the record whose search field value is
Ki or to a file block containing the record (or
to a block of record pointers that point to
records whose search field value is Ki if the
search field is not a key). Each leaf node has
at least (p/2) values. All leaf nodes are at the
same level.
30Figure 11.17Schematic view of a B tree.
31Figure 11.15An example of an ISAM
(index-sequential access method) index.
32Figure 11.18Portion of the index of Figure 11.15
after insertion of an entry for vince.
33Figure 11.19Index subtree of Figure 11.18 after
the insertion of vera has caused the split of a
leaf page.
34Figure 11.20Index subtree of Figure 11.19 after
the insertion of rob has caused the split of a
leaf page and of an index page.
35Figure 11.21B tree that results from the
insertion of vince , vera , and rob into the
index of Figure 11.15.
36- Hash Indexing
- A hash index divides the index entries
corresponding to the data records of a table into
disjoint subsets (buckets) in accordance with
some hash function h. - A bucket is either one disk block or a cluster of
contiguous blocks. The bucket in which an index
entry is inserted is determined by h(v), where v
is the value of the search key. - Hash functions are chosen such that the number of
index entries in each bucket is roughly the same.
37For example, h might be defined as h(v) (avb)
mod M Where a and b are constants chosen to
optimize the way the function randomizes over the
search key values, M is the number of buckets,
and v is a value determined by the search key.
38Figure 11.23Schematic depiction of a hash index.