Notes #9 --- Direct Access Files
Direct access files allow rapid look-up of a single keyed record (typically O(1) average cost). The main issue is the method for converting the key to an address. There are three families of schemes:
- Use the key as the address.
- There is a separate location for each possible key value.
- Hardware-addressed files.
- The physical address is precisely the key value (translated).
- Relative-record files.
- The key is an offset from a base value for the file. It is possible to have the file in several segments (as we saw briefly for sequential files), in which case the base value depends on the range in which the key falls.
- The problem here is that the space needed by the file is proportional not to the size of the file, but to the number of possible key values.
- Compute the address using the key --- hash tables.
- Hash table size is typically much smaller than the number of possible keys, but larger (of course) than the actual file size. This means that many different key values hash to the same address, and also that a single table size will not necessarily remain appropriate for a volatile table.
- Most frequent hash table sizes are p=4k+1 for p prime, and 2^k or 3*2^k (particularly for dynamic hashing).
- Important questions:
- How is the address computed from the key? How is the table organized (e.g., blocking)?
- What is the resulting (average-case) complexity of
- Finding a record which is present?
- Finding one which is not present?
- Inserting a record?
- Deleting a record? (In fact, how does one delete a record?)
- How are collisions handled (a key hashing to an address which is already occupied)?
- How do we handle the situation when the file gets too big for the table?
- Computing the address:
- Several methods: remainder (most common), midsquare, digit extraction, folding, radix conversion, combinations.
- Methods other than remainder typically used in combination with remainder, or for data sets known to be bad for remainder.
- Buckets: typically small, but frequently not single-record.
- Collision resolution:
- Perfect hashing: if we know the keys in advance, we can come up with a function which maps each key to a unique address (see below).
- Probing: linear (1,2,3,...), quadratic (1,-1,4,-4,9,-9,...).
- Chaining: in-table, external (overflow area).
- Note in-table chaining hard to manage with BF>1.
- Rehashing.
- Deletion: Lazy delete and overwrite.
- Use lookup --- dictionary techniques.
- Dictionary is a list of (key,address) pairs, stored as
- sorted or unsorted linked list,
- search tree, or
- multi-level index.
- For large files, the dictionary itself must be accessed as an external file.
- Fairly close in spirit to indexed sequential files.
Questions:
Given a file of n records from a space of N possible keys, and assuming that
- any address computations are unit time,
- keys can be stored with BF = m, and
- the hash table has c cells with blocking factor b:
- What is the worst-case and average-case access time for each scheme?
- How much space overhead does each scheme incur?
- For each scheme (and for hash tables, for each collision-resolution technique),
- how are deletes handled?
- are range queries possible without additional processing?
Additional considerations:
Direct vs. indirect storage: store pointers to data rather than data.
- Dictionaries --- always indirect.
- Other two --- either alternative possible.
- Why would one or the other be preferred?
- Compare the two possibilities for time and space cost.
- Reorganization
- Linear-time read and rehash.
- Dynamic hashing and hash sequences (see below).
- Questions: Is it needed? Why or why not? How is it handled?
- How can we decide when to reorganize (in particular, what information do we need to keep?)?
Modifications of hash tables.
Perfect hashing.
- Assumption:
- _static_ data set of size n, known set of n actual keys (out of a much larger set of N possible keys).
- Objective: find a hash function which maps the keys to the values 0..n-1.
- Result: compact storage and fast access.
- Dynamic hashing.
- Assumptions:
- The data set is very active (lots of inserts (and possibly deletes))
- Size will vary significantly during its lifetime
- Reorganization is too time-consuming.
- Objective: allow the size of the hash table to vary with time, changing incrementally.
- Requires: use of (typically external) chaining for collision resolution.
- Approaches:
- k
-way search trees rooted at primary buckets (Miller).
- Will still need reorganization when tree becomes too deep.
- Table doubling and halving
- cells are split/merged as they are accessed (or if they are way out of synch).
- Result: Maximum number of probes is bounded with high probability; table density remains high.
HOMEWORK.
Consider a disk with sector size 256 bytes, and a data set of records of 2788 bytes, with possible keys 0 to 999, where records with the following keys are received, and are to be inserted in an empty structure, in order:
87, 214, 711, 62, 500, 915, 443, 459, 902,
817, 715, 377, 91, 350, 882, 266, 281.
Suppose we want to use relative record addressing. How many sectors are needed, if
- we use direct storage?
- we use indirect storage?
- How many sectors currently contain actual records?
- What is the data density?
- Suppose we want to use a dictionary, where each dictionary record takes 32 bytes, and the dictionary are maintained in key order. Now suppose the keys of problem #1 are stored in the dictionary.
- What is the maximum number of accesses needed to retrieve a record?
- What is the data density?
- Enter these keys in a hash table of size 29, using probe = key mod 29 as primary key, and resolving collisions using
- Linear probing.
- Linear probing, using a two-pass algorithm.
- External chaining.
- Now consider the following (single) sequence of accesses to the table of problem #3.
insert 451, insert 606, find 91, insert 817,
find 300, delete 459, insert 727, insert 701,
delete 4, insert 973, insert 240, find 843
Indicate for each operation:
the number of probes needed.
any data errors (redundant insert, erroneous delete, data not found).
Answers.
\\Mathsci\sys\PROF\MARLOWTO\COURSES\C2113\C2113N9A.DOC
PROGRAMMING
Preferably in C (but optionally, for less credit, in Pascal) write program(s) to manage the hash table in parts (c) and (d). The table will be an array of size 29 of records/structures consisting of a key, a pointer to a data record (which we won't actually use), and some flags (e.g., for initialization or delete). For external chaining, the structure will need an additional pointer for overflow. You may work in teams.
- Level 1: handle only inserts without errors, using one-pass linear probing.
- Level 2: add finds, deletes, and errors.
- Level 3: use external chaining, where the overflow area is an array of 12 of the key records.