Surrogate Key
A surrogate key is a system-generated identifier with no business meaning, used as the primary key of a database table. It is also called a synthetic key or meaningless key. Examples: auto-incrementing integers (BIGSERIAL), UUIDs.
Surrogate keys contrast with natural keys — identifiers that come from the business domain and carry meaning (e.g., customer_email, product_sku, tax_id).
Surrogate Key vs Natural Key
| Surrogate Key | Natural Key | |
|---|---|---|
| Source | System-generated (SERIAL, UUID) | Business domain value |
| Stability | Immutable — never changes | Can change (email updates, sku reuse) |
| Meaning | No business meaning | Encodes business data |
| Privacy | No data exposed in URLs/APIs | Exposes business data |
| Uniqueness | Guaranteed by system | Requires validation and deduplication |
| FK join performance | BIGINT is fast; UUID is slower | Depends on data type and uniqueness |
| SCD support | Multiple rows per entity — surrogate key distinguishes them | Natural key cannot support SCD history rows |
Why Surrogate Keys Are Essential in Data Warehouses
In dimensional modeling (Ralph Kimball methodology), dimension tables always use surrogate keys as their primary key. The natural key (source system business identifier) is retained as a separate column but is not the PK.
Three reasons Kimball mandates surrogate keys for dimensions:
-
Slowly Changing Dimensions (SCD Type 2): When a customer's address changes, SCD Type 2 inserts a new row with a new surrogate key. The fact table FK points to the specific version of the customer as of the transaction date. If the natural key (
customer_id) were the PK, there could only be one row per customer — no history. -
Source system independence: Source system keys change. Companies merge, primary keys get reused after deletions, or you're loading from multiple source systems with overlapping key spaces. Surrogate keys shield the warehouse from all of this.
-
NULL handling: Source systems sometimes send NULL or unknown values for foreign keys. A surrogate key approach handles this with a special "unknown" dimension row (surrogate key = -1 or 0) — impossible with natural keys.
Surrogate Key Implementation
PostgreSQL (integer):
CREATE TABLE dim_customer (
customer_key BIGSERIAL PRIMARY KEY, -- surrogate key
customer_id BIGINT NOT NULL, -- natural key from source
email TEXT NOT NULL,
name TEXT NOT NULL,
valid_from DATE NOT NULL,
valid_to DATE,
is_current BOOLEAN NOT NULL DEFAULT TRUE
);
PostgreSQL (UUID):
CREATE TABLE dim_product (
product_key UUID DEFAULT gen_random_uuid() PRIMARY KEY, -- surrogate key
product_id TEXT NOT NULL, -- natural key (e.g., SKU from ERP)
name TEXT NOT NULL,
category TEXT NOT NULL
);
When to use integer vs UUID:
BIGSERIAL(integer): faster joins, smaller storage, simpler debugging. Best for internal warehouse tables.UUID: better for distributed systems, external API exposure, merging data across environments.
TalkingSchema: Automatic Surrogate Key Generation
When you ask TalkingSchema to design a star schema or data warehouse model, surrogate keys are generated automatically for every dimension table:
Example prompt:
"Create a star schema for our e-commerce analytics. Include dimensions for customers, products, and dates, with SCD Type 2 for customers."
TalkingSchema generates dim_customer.customer_key BIGSERIAL PRIMARY KEY alongside customer_id BIGINT (natural key), valid_from DATE, valid_to DATE, and is_current BOOLEAN — the full SCD Type 2 pattern.
Frequently Asked Questions
Do transactional (OLTP) tables need surrogate keys?
In OLTP schemas, using a surrogate integer or UUID primary key is also recommended, but for different reasons: privacy (avoid exposing business data in URLs), stability (natural keys like email change), and consistency. The standard practice for OLTP is: surrogate PK + unique constraint on the natural key.
What is the "unknown" surrogate key?
A convention from Kimball dimensional modeling: reserve surrogate key value -1 (or 0) in every dimension table to represent "unknown" or "not applicable." When a fact row has a NULL or unresolvable FK, it points to the -1 row instead of NULL. This allows aggregations to work correctly without NULL handling in every query.
Does TalkingSchema use BIGSERIAL or UUID for surrogate keys?
TalkingSchema defaults to BIGSERIAL for dimension table surrogate keys (integer, auto-increment) following standard Kimball practice. You can explicitly request UUID surrogate keys with a prompt: "Use UUID primary keys for all dimension tables."