Notes #13 --- Multikey files and complex queries

We will study two types of multikey file organization: tree-based, represented in this course by k-d-trees, and the more common list/pointer-based, here including multirings, multilists, and inverted lists.

The typical multikey organization includes a number of keys, K1, K2, ..., Kn. Each key allows a number of key values, KV1, KV2, ..., KVi_k (where different keys may use the same key value). There is also a primary key, and t ypically the primary key structure of the file is one of those we have discussed previously. For ease of discussion, we will assume that the organization keeps records in primary key order (e.g., indexed sequential, dictionary, relative record). Secondary keys may (or may not) possess any of the basic properties mentioned earlier, that is, singleton, unique, required. The records for a given key are linked, by triples which conceptually are of the form (K,KV,ptr).

 k-d-trees

List-structured multikey files.

List/pointer structures are far more common and more robust. There are several different types, the principal difference in which is the method of storing and accessing pointers, and in handling conjunctive (and) queries.

Queries

Recall that we identified basically five types of queries of increasing complexity. K, K1, K2 as usual represent arbitrary keys; KV, KV1, KV2 represent values for those keys.

Type

Form

Notes

simple

K = KV

 

range

K relop KV

KV1 relop K relop KV2

relop in <, >, <=, >=; = allowed

not equal (<>, !=, ~=) limited

boolean

(K1 relop KV1) boolop (K2 relop KV2)

boolop . . .

boolop in union, intersection,

difference

equational

test for relationships between pairs of KVs of different keys

e.g.,

K1.KV >= K2.KV

functionalor relational

more complicated relationships involving multiple keys, complicated functions, etc.

 

Boolean constraints use the operators

           boolop in { not, and, or, but not, exactly one of, like }

"But not" is the difference operator, and "exactly one of" is essentially the symmetric difference or "exclusive or". Expressions in which the principal operation is "not" are not typically used, as the resulting query is highly inefficien t. like is an operator for approximate matches (with wild cards) on string-valued keys.

Equational (also called relational) constraints can use the relational operators, constants, standard arithmetic operators, min and max. Several clauses may be combined by boolean operators and parentheses.

We want to look primarily at boolean and equational queries (simple and range queries can be handled for primary keys by the single key file structures we have already studied, and for secondary keys are easy special cases of the methods for boolean co nstraints).

We are concerned here principally with the first three types of queries for list-structured multikey files, although I will discuss the fourth briefly. We will assume (for ease of discussion) that none of the keys involved is the primary key, and in bo olean queries (a) that there are exactly two clauses, and (b) usually, that both of them are simple.

Suppose there are k distinct keys, each (for simplicity) with v values. Assume that

These costs will be paid regardless of approach, and are small, so we will ignore them hereafter. The "+ 1" comes from the nil pointer at the end of the list.

The boolean queries we consider are of the form ((K1 = KV1) boolop (K2 = KV2)).

The table below looks at complexity for some queries. For inverted lists, the first line is the cost of list traversal and merger, the second the cost of accessing the records. The write cost in each case is the size of the solution.

 Table 1 : Costs of Data Access for Simple Boolean Queries

 Table 2 : Costs of Data Access for Boolean Queries with Ranges

Compound queries and temporary files

For inverted lists and more complicated queries, involving three or more clauses, temporary lists of pointers, and for large files, temporary files of pointers, may have to be created. (In contrast, in multilists, the problem is that we need to keep a file buffer for each open file. If there are too many (K,KV) pairs, then we will have to store a file corresponding to the set of records solving a part of the query.)

Exercises

  1. Assume that the four (K,KV) pairs in the following query have sizes 200, 300, 150, and 225, respectively, and that the answers to the subqueries in the natural order of evaluation are 425, 350, and 400. Assume the blocking f actor for inverted lists is 10.
          ((((K1 = KV1) or (K2 = KV2)) - (K3 = KV3)) d (K4 = KV4))
  1. For the same pairs, consider the following query:
          ((((K1 = KV1) and (K2 = KV2)) or ((K3 = KV3) and (K4 = KV4)))

Comparison of multikey list structures

 Table 3 : Advantages and Disadvantages of Multikey File Organizations

Characteristics used to decide on structures include:

  1. number of keys and key values
  2. characteristics of keys (unique, singleton, required, etc.)
  3. typical number of records per (K,KV) pair and variability
  4. typical number of (K,KV) pairs per record and variability
  5. typical number of records per query answer and variability
  6. frequency of various forms of queries
  7. frequency and types of updates
  8. frequency of key and KV insertion/deletion
  9. variability in length of data records
  10. response-time requirements for (a) primary-key retrieval, (b) queries, (c) update
  11. response-time requirements for print, whether in order, and whether KVs are needed
  12. primary-key file organization, total size, and expected changes in size
  13. organization used for key and KV lists, and inverted lists if used
  14. relative costs of processing and secondary memory access

 We consider only the first eight of these here.

 Hybrid structures

Introduction to  query optimization

Introduction to Database management systems  (DBMS)