Relational Keys

02 Feb 1996 updated

The word "key" is much used and abused in the context of relational database design. In pre-relational databases (hierarchtical, networked) and file systems (ISAM, VSAM, et al) "key" often referred to the specific structure and components of a linked list, chain of pointers, or other physical locator outside of the data. It is thus natural, but unfortunate, that today people often associate "key" with a RDBMS "index". We will explain what a key is and how it differs from an index.

According to Codd, Date, and all other experts, a key has only one meaning in relational theory: it is a set of one or more columns whose combined values are unique among all occurrences in a given table. A key is the relational means of specifying uniqueness.

There are only three types of relational keys (foreign keys are another issue and discussed separately):

Candidate Key
As stated above, a candidate key is any set of one or more columns whose combined values are unique among all occurrences (i.e., tuples or rows). Since a null value is not guaranteed to be unique, no component of a candidate key is allowed to be null.

There can be any number of candidate keys in a table (as demonstrated elsewhere). Relational pundits are not in agreement whether zero candidate keys is acceptable, since that would contradict the (debatable) requirement that there must be a primary key.

Primary Key
The primary key of any table is any candidate key of that table which the database designer arbitrarily designates as "primary". The primary key may be selected for convenience, comprehension, performance, or any other reasons. It is entirely proper (albeit often inconvenient) to change the selection of primary key to another candidate key.

As we will see below, there is no property of a primary key which is not shared by all other candidate keys in of the table except this arbitrary designation. Unfortunately E-R methodologies and RDBMS products have come to rely on the primary key almost to the exclusion of the concept of candidate keys, which are rarely supported by software.

Alternate Key
The alternate keys of any table are simply those candidate keys which are not currently selected as the primary key. According to {Date95} (page 115), "... exactly one of those candidate keys [is] chosen as the primary key [and] the remainder, if any, are then called alternate keys." An alternate key is a function of all candidate keys minus the primary key.

Therefore it is completely incorrect and misleading that several data modeling products (e.g., ERwin, Silverrun, PowerDesigner) provide "alternate keys" which are selected arbitrarily by the designer, without qualifying the alternates as unused candidate keys, in fact without any mechanism to even designate candidate keys.

While regrettably few data modeling products recognize candidate keys, most do correctly implement the uniqueness specification of primary and alternate keys. DDL is usually generated either to declare a primary or alternate key, where appropriate (e.g., Oracle 7.0), or to create a unique index (e.g., DB2, Sybase 4.x) for each such key.

There is an insidious trap in the fallacy of not supporting candidate keys and providing arbitrary primary and alternate keys instead. Designers become accustomed to the fact the a primary key is unique and often assign a primary key just to obtain uniqueness. If this requirement of uniqueness per se is not recorded, then there is a danger that at some point someone else may eventually change the primary key (for example, to a system assigned number) and lose the original enforcement of uniqueness on what had been the primary key.

Let's look a this simple association or join table below which holds student class enrollment:

This is the default form, and often the only form taught or recommended for a join table. The primary key of each contribution table is inherited so that this table has a compound primary key. Since the DDL will create a unique index on the two columns, the designer knows that each student-class pair will be unique; i.e., each student may enroll in each class only once.

Several years later a new DBA decides that it is inefficient to use two columns for the primary key where one would do. She adds a "row id" column and makes it the primary key by loading it with a system counter. This is fine as far as an identity for each row. But now nothing prevents a student from enrolling in the same class multiple times!

This happened because the data model did not retain a candidate key property on the two original columns when the primary key was changed. Therefore the new DBA had no direct way of knowing (other than text notes somewhere) that these two columns must still remain unique, even though they are no longer part of the primary key.

Notice here how this could have been handled automatically by the model, if it had captured candidate keys in the first place and then generated alternate keys as a function of those candidates not in the primary key. The original two columns remain unique even after they are no longer primary

So the lesson is: do not confuse a candidate key specification of uniqueness with the arbitrary selection of primary key. As always, {Date95} sums it up nicely, on page 128: "Candidate keys are important because they provide the tuple-level addressing mechanism in the relational model."

We have used the word index only in the context of implementing uniqueness. A key is about uniqueness, not access, while an index is about access, and not necessarily uniqueness. In concept, an index is merely any set of one or more columns from a given table, sorted an any arbitrary sequence. Thus an index is a sorted sub-set of the data in the table to which it refers, with pointers (yes, pointers) to the actual data row. The columns included in an index are not "keys". They are just index columns.

Where to go from here:

Copyright © 1996 Applied Information Science International