Notes #13 --- Multikey files and complex queries
We will study two types of multikey file organization: tree-based, represented in this course by k-d-trees, and the more common list/pointer-based, here including multirings, multilists, and inverted lists.
The typical multikey organization includes a number of keys, K1, K2, ..., Kn. Each key allows a number of key values, KV1, KV2, ..., KVi_k (where different keys may use the same key value). There is also a primary key, and t ypically the primary key structure of the file is one of those we have discussed previously. For ease of discussion, we will assume that the organization keeps records in primary key order (e.g., indexed sequential, dictionary, relative record). Secondary keys may (or may not) possess any of the basic properties mentioned earlier, that is, singleton, unique, required. The records for a given key are linked, by triples which conceptually are of the form (K,KV,ptr).
k-d-trees
List-structured multikey files.
List/pointer structures are far more common and more robust. There are several different types, the principal difference in which is the method of storing and accessing pointers, and in handling conjunctive (and) queries.
Queries
Recall that we identified basically five types of queries of increasing complexity. K, K1, K2 as usual represent arbitrary keys; KV, KV1, KV2 represent values for those keys.
|
Type |
Form |
Notes |
|
simple |
K = KV |
|
|
range |
K relop KV KV1 relop K relop KV2 |
relop in <, >, <=, >=; = allowed not equal (<>, !=, ~=) limited |
|
boolean |
(K1 relop KV1) boolop (K2 relop KV2) boolop . . . |
boolop in union, intersection, difference |
|
equational |
test for relationships between pairs of KVs of different keys |
e.g., K1.KV >= K2.KV |
|
functionalor relational |
more complicated relationships involving multiple keys, complicated functions, etc. |
|
Boolean constraints use the operators
boolop in { not, and, or, but not, exactly one of, like }
"But not" is the difference operator, and "exactly one of" is essentially the symmetric difference or "exclusive or". Expressions in which the principal operation is "not" are not typically used, as the resulting query is highly inefficien t. like is an operator for approximate matches (with wild cards) on string-valued keys.
Equational (also called relational) constraints can use the relational operators, constants, standard arithmetic operators, min and max. Several clauses may be combined by boolean operators and parentheses.
We want to look primarily at boolean and equational queries (simple and range queries can be handled for primary keys by the single key file structures we have already studied, and for secondary keys are easy special cases of the methods for boolean co nstraints).
We are concerned here principally with the first three types of queries for list-structured multikey files, although I will discuss the fourth briefly. We will assume (for ease of discussion) that none of the keys involved is the primary key, and in bo olean queries (a) that there are exactly two clauses, and (b) usually, that both of them are simple.
Suppose there are k distinct keys, each (for simplicity) with v values. Assume that
These costs will be paid regardless of approach, and are small, so we will ignore them hereafter. The "+ 1" comes from the nil pointer at the end of the list.
The boolean queries we consider are of the form ((K1 = KV1) boolop (K2 = KV2)).
The table below looks at complexity for some queries. For inverted lists, the first line is the cost of list traversal and merger, the second the cost of accessing the records. The write cost in each case is the size of the solution.
Table 1 : Costs of Data Access for Simple Boolean Queries
Table 2 : Costs of Data Access for Boolean Queries with Ranges
Compound queries and temporary files
For inverted lists and more complicated queries, involving three or more clauses, temporary lists of pointers, and for large files, temporary files of pointers, may have to be created. (In contrast, in multilists, the problem is that we need to keep a file buffer for each open file. If there are too many (K,KV) pairs, then we will have to store a file corresponding to the set of records solving a part of the query.)
Exercises
((((K1 = KV1) or (K2 = KV2)) - (K3 = KV3)) d (K4 = KV4))
((((K1 = KV1) and (K2 = KV2)) or ((K3 = KV3) and (K4 = KV4)))
Comparison of multikey list structures
Table 3 : Advantages and Disadvantages of Multikey File Organizations
Characteristics used to decide on structures include:
We consider only the first eight of these here.
Hybrid structures
Introduction to
query optimizationIntroduction to Database management systems
(DBMS)