Slowly Changing Dimensions: SCD Type 1, Type 2, and Type 6 in the Schema

April 18, 202610 min read

ETL is the bullied kid of the data stack. Dashboard wrong? Blame the ETL engineer. Patient count drops 8% on a Monday? ETL. Cost-sharing numbers shift overnight and the CFO wants answers by 9 AM? ETL again. The Slack thread always lands on the same person — who almost certainly isn't the one who introduced the lie.

Let's look at one of those lies. A patient was reclassified from a Bronze insurance tier to a Gold tier in March, but the claims report has been grouping her Bronze claims under the Gold bucket ever since. The ETL job that loaded the new tier overwrote the old value. Every historical encounter that happened under Bronze coverage now looks like Gold.

For context: the healthcare warehouse canvas below shows the source systems and warehouse schema we're working with.

This is what a missing SCD Type 2 decision looks like in production. It's almost never caught by data quality tests, because every row is individually valid. The lie is in the history, not the data. And it didn't originate in the ETL code where it eventually surfaces — it was introduced six months earlier, at the schema-design layer, when someone made an SCD typing decision (or silently failed to) and never wrote it down anywhere outside of a MERGE statement.

I've been designing data warehouse dimension layers long enough to have strong opinions here. The SCD taxonomy is one of the most decision-dense areas of dimensional modeling — eight types with overlapping tradeoffs, no universal agreement on when to use what, and performance consequences that bite teams six to twelve months after they think they've done it right. This post is my attempt to cut through all of it.

The stakes are real. In healthcare analytics, getting the SCD type wrong on an insurance dimension means every cost-sharing analysis and prior-authorization report is built on a fiction. In retail, a wrong SCD decision on a customer segment dimension means your cohort analysis groups customers by who they are now, not who they were when they made the purchase you're analyzing. Reports look fine. Decisions built on them are wrong.

SCD Types 0–7: The full taxonomy (and which ones actually matter)

Ralph Kimball introduced slowly changing dimensions in 1996 with three core types. The third edition of The Data Warehouse Toolkit (2013) formalized Types 0, 4, 5, 6, and 7 in Design Tip #152. As the Kimball Group noted at the time: "We sometimes wish Ralph had given these techniques more descriptive names, such as 'overwrite' instead of 'type 1.' But at this point, the SCD type numbers are part of our industry's vernacular."

That's an understatement. The type numbers are now so embedded that teams debate them without agreeing on definitions.

Here's the full map — with brutal honesty about which ones you'll actually use:

Type	Name	What it does	When you actually use it
0	Retain Original	Attribute never changes once set	Natural keys, enrollment dates, CMS certification numbers
1	Overwrite	New value overwrites old; no history	Name corrections, code description updates, typos
2	Add Row	New row per change; surrogate key changes	Any attribute driving historical analysis
3	Add Column	Previous value stored in a new column	"Current vs. prior" comparisons where only one history step matters
4	Mini-Dimension	Volatile attributes split into a separate table	High-cardinality, frequently-changing demographic clusters
5	Mini-Dimension + Outrigger	Type 4 with a current-profile key in the base dimension	Accessing current mini-dim attributes without joining through a fact table
6	Hybrid (1+2+3)	Type 2 rows + current-value copy overwritten on all rows	When Type 2 query complexity is unbearable but you still need history
7	Dual Key	Fact table carries both surrogate key and durable key	Simultaneously querying "as-was" (surrogate) and "as-is" (durable) perspectives

The ones that matter in practice: Type 1, Type 2, Type 6. Types 3, 4, and 5 are real techniques for real problems, but most dimension layers never need them. Types 0 and 7 are important conceptually but rarely named explicitly in ETL code.

Where SCD typing decisions actually get made today

For most teams, twice — and neither time is when it should be.

The first time is informally, in a Slack thread when the dimension is first modeled. Someone says "let's make it Type 2." Someone says "sounds good." Six months later, nobody can find that thread.

The second time is in the ETL code itself, when an engineer is writing the dbt snapshot config or the Snowflake MERGE and has to choose: insert a new row, overwrite, or both. The original reasoning has evaporated by then. The engineer makes the call again — often without context, often differently from what was agreed.

The cost is bigger than the bug. The most expensive anti-pattern I've seen in production is the opposite of under-historizing: modeling every dimension attribute as Type 2, defensively, because nobody wanted to make the per-attribute decision. That's how you end up with a dim_customer table at 200 million rows and 40 version columns, most of which track changes nobody ever queries. A 10M-row patient dimension with three plan or provider changes per year compounds to 150M rows in five years — and every query that forgets is_current_row = true scans all of them. Snowflake reads more micro-partitions than the query needs, BigQuery scans the full table. The performance tax is real.

The fix isn't a stricter PR review or a better Confluence template. It's moving the decision upstream — out of the ETL code and into the schema itself, before any pipeline is built.

Designing the warehouse in conversation

The rest of this post is built around a real schema: a healthcare analytics warehouse for a hospital network, tracking patient encounters, insurance claims, provider affiliations, and lab results across multiple facilities. Almost every dimension in it makes a non-obvious SCD typing call.

Open full screen ↗

The interesting artifact isn't the final schema — it's that every SCD type decision is named, justified, and captured in the schema before the first ETL pipeline runs.

Open the canvas. You're looking at two things:

Sources — the upstream OLTP footprint as it actually exists in production: an EHR, a claims processing system, a provider directory, and the CMS-published ICD-10 reference feed.
Warehouse — built against those sources: a staging layer, SCD-typed dimensions, encounter and lab facts, and a denormalized patient outcome mart.

The warehouse was designed through a back-and-forth with the AI. With the source schema already imported, you describe the analytical needs the warehouse has to serve — a patient dimension that tracks insurance tier changes, a facility dimension that doesn't, fact tables that join both correctly. The AI scaffolds the dimensions with version-control plumbing already in place, each attribute's SCD type chosen and annotated as a column note.

Hover any dimension column and you see the type and the reasoning:

dim_patient.insurance_tier — Type 2, because cost-sharing analysis depends on the historical value.
dim_patient.current_insurance_tier — Type 6 hybrid, so current-state dashboards don't have to join through version history.
patient_sk and patient_durable_key on fact_patient_encounter — the Type 7 dual-key pattern that lets the same fact join either as-was or as-is.

Six months from now, when an analyst is staring at two columns wondering which to use, the answer is one hover away in the schema — not a Slack search.

The most valuable moment of the conversation, though, comes after the schema is drafted: asking the AI to audit it.

"Review the SCD type assignments across all dimension tables. Flag any attributes that might be over-engineered as Type 2 or under-engineered as Type 1."

The AI flagged race_ethnicity on dim_patient:

"This is currently marked as Type 1 (overwrite). If your analytics use cases include population health stratification by race/ethnicity, and patients can legitimately change their self-reported race/ethnicity over time, this might warrant Type 2. However, if changes are corrections to initial data entry errors, Type 1 is correct. Please confirm the business rule."

The AI doesn't have the business answer. It has the discipline to surface the ambiguity at design time — the kind of decision that otherwise gets missed in Confluence and ETL reviews, then surfaces six months later when a population health report produces inexplicable results.

Where this fits alongside your existing stack

You already have a data stack. dbt for transformations, Snowflake or BigQuery or Databricks for the warehouse, Airflow or Dagster for orchestration, maybe Hudi or Delta Lake for high-frequency CDC into the lakehouse. None of those tools is the right layer to make the SCD type decision in — they're the layer to implement it. The decision belongs upstream of all of them, at the schema-design layer.

That's the gap TalkingSchema fills. The SCD type decision happens in conversation, grounded in the actual schema, and gets captured as a column-level annotation (note: "SCD Type 2 — required for cost-sharing analysis") that becomes the source of truth every downstream tool reads from:

dbt snapshots get configured against the type the schema chose — and the line in schema.yml is no longer a re-litigation of the original decision, just a transcription of it.
Snowflake MERGE statements and BigQuery temporal tables implement the version pattern the schema named, with the partition strategy (sentinel 9999-12-31 for current rows on BigQuery, Streams + Tasks on Snowflake) chosen against an explicit, annotated dimension rather than a guessed one.
Hudi or Delta Lake CDC pipelines know which attributes need point-in-time tracking and which don't — the high-frequency upsert workload only spins up for the columns the schema marked as Type 2 or Type 6.

There's one more thing modern warehouses don't solve: native time-travel isn't SCD Type 2. Snowflake's AT(TIMESTAMP =>) and Delta's VERSION AS OF are bounded by a 7–90 day retention window, and they capture when the warehouse saw a row change — not when the business event actually happened. For five-year claims analysis or a regulatory audit trail, you still need explicit, queryable version rows. Time-travel and SCD2 solve adjacent problems, not the same one.

For SCD-related brainstorming specifically, this is where the value compounds. "Should this attribute be Type 1 or Type 2?" is a domain question dressed up as a technical one. The conversational workflow handles it in three moves:

Describe the business question in plain language ("do we need the insurance tier as it was at the time of a claim, or just today?").
Let the AI propose the SCD type and capture the justification as a column note.
Ask it to audit the whole schema for over- and under-historized attributes before you ship.

That gets to the right answer faster than a six-engineer Zoom call — and produces a schema annotation as a permanent positive side effect. The decision is made, captured, and carried forward.

Open TalkingSchema. Describe your dimension layer, let the AI annotate the SCD type for every attribute, and stop putting your ETL engineer at the bottom of the next 11 PM Slack thread.

SCD Types 0–7: The full taxonomy (and which ones actually matter)​

Where SCD typing decisions actually get made today​

Designing the warehouse in conversation​

Where this fits alongside your existing stack​

SCD Types 0–7: The full taxonomy (and which ones actually matter)

Where SCD typing decisions actually get made today

Designing the warehouse in conversation

Where this fits alongside your existing stack