Notes #10 --- Indexed Sequential Files
TABLE:
Overview of Single-key File TypesBinary vs. k-ary trees
: While binary trees give optimal access performance for internal data structures, the disparity between internal and external times results in larger blocking factors being preferable for external trees. In a k-ary tree, each internal node will contain a number n of keys, where n will be between 1 and k-1, and n+1 pointers. The decision algorithm for inserting newkey will be a generalization of the binary decision algorithm:if (newkey < key_1)
follow pointer_0;
else if (newkey < key_2)
follow pointer_1;
else ...
else /* newkey > key_n */
follow pointer_n;
Typical k are on the order of 20. It is possible to use different k on different levels. In particular, the root level and/or the leaf nodes may have different blocking factors.
Some characteristics/decisions:
We will look at three machine organizations for indexed sequential files, and will later discuss some programming approaches:
ISAM files
The ISAM file structure is characterized by a set of index blocks indexing physical disk structures: tracks of cylinder index blocks (master index), cylinders (cylinder index), and tracks (track index). The first two consist of (key, address) pairs, where key is the maximum key covered in the next level of the index.
The track index consists of pairs of such pairs: the first is the (key,address) pair for the track; the second is a (key,address) pair for overflow from the track (and a pointer to a sector in overflow.
Overflow is either local (cylinder) overflow --- a number of reserved tracks on each cylinder, or global overflow --- one or more cylinders reserved for overflow. Data tracks contain (key,data) pairs. Overflow, whether local or global, consists of (key,address,pointer) triples, so that the overflow from a given track forms a linked list (with items typically inserted at the end). Actually, ISAM records typically also need a number of flags for initialized/uninitialized, active/deleted, etc.
find and insert are fairly straightforward; delete, as in hash tables, uses a lazy delete. When there is either too much overflow, or too many deleted records, the file should be reorganized.
VSAM files
The VSAM file structure is characterized by (1) data blocks (control areas and control intervals) with no necessary physical contiguity, and (2) a threading of index blocks at the level immediately above data (sequence set).
The upper levels, down to the sequence set, are managed more-or-less as a B-tree (to be seen shortly), and should retain some physical locality. The non-contiguity of data blocks allows for graceful handling of non-uniform distributions of inserts, and for non-uniform record length. The threading above data level allows efficient sequential and range access.
New memory is allocated in contiguous control areas, divided into a number of control intervals. Each control interval consists of a number of keys and records, arranged more or less as follows.
data1 data2 ... free_space ... key2 key1 nrecs
len2 len1 free_len ptr2 ptr1 free_ptr
find
is simple. insert and delete are easy unless a control interval becomes full, or has too few items. In these cases, need to use rotations and merge and split algorithms. Reorganization is never necessary, although it may sometimes improve performance.SIS files
A hybrid between ISAM and VSAM (although independently developed).