Article Detail
Data Mart Hygiene First Principles
You can catch contradictions without knowing the whole story.

Purpose & Scope
A pragmatic, early‑phase test strategy for data marts when detailed business rules and complete documentation are not yet available (and with the understanding that full documentation is in progress). It prioritizes fast, logic‑light checks that surface defects early, and evolves into deeper, invariant‑driven testing as understanding grows.
Premise: We can perform meaningful testing without full context, and we’ll do even better once we learn the systems and rules. Start now; deepen over time.
Warning from Experience: Building on whatever already exists—without occasionally revisiting first principles—turns a strong foundation into a house of cards. Eventually, it will collapse.
---
Core Principle — The Marble Test
Idea: You don’t need to count every marble to know two bags match. Compare mass/volume/shape instead of enumerating each item.
Practices:
- Row count parity (source ↔︎ stage ↔︎ mart)
- Partitioned counts (by month, region, key)
- Schema/field presence parity
- Distribution checks (null rates, cardinality bands)
Diagnostic pattern: queries return only discrepancies; empty result = pass.
-- Example: partitioned row count parity by monthWITH src AS (
SELECT DATEFROMPARTS(YEAR(TxDate), MONTH(TxDate), 1) AS MonthStart, COUNT(*) AS Cnt
FROM Source.Transactions GROUP BY DATEFROMPARTS(YEAR(TxDate), MONTH(TxDate), 1)
), dst AS (
SELECT DATEFROMPARTS(YEAR(TxDate), MONTH(TxDate), 1) AS MonthStart, COUNT(*) AS Cnt
FROM Mart.FactTransactions GROUP BY DATEFROMPARTS(YEAR(TxDate), MONTH(TxDate), 1)
)
SELECT COALESCE(s.MonthStart,d.MonthStart) AS MonthStart,
ISNULL(s.Cnt,0) AS SrcCnt, ISNULL(d.Cnt,0) AS DstCnt
FROM src s
FULL OUTER JOIN dst d ON d.MonthStart = s.MonthStart
WHERE ISNULL(s.Cnt,0) <> ISNULL(d.Cnt,0);
---
Corollary A — Checksum Thinking
Idea: For complex structures, compute hashes per logical grain; compare fingerprints instead of contents.
When to use: Validating documents/blobs, dimensional conformance, wide tables, or high‑volume facts.
-- Example: stable row-level fingerprint excluding audit fieldsSELECT BusinessKey,
HASHBYTES('SHA2_256', CONCAT_WS('|', Col1, Col2, Col3)) AS RowHash
INTO #src
FROM Staging.Entity;
SELECT BusinessKey,
HASHBYTES('SHA2_256', CONCAT_WS('|', Col1, Col2, Col3)) AS RowHash
INTO #dst
FROM Mart.DimEntity;
SELECT s.BusinessKey
FROM #src s
FULL OUTER JOIN #dst d ON d.BusinessKey = s.BusinessKey AND d.RowHash = s.RowHash
WHERE d.BusinessKey IS NULL OR s.BusinessKey IS NULL; -- mismatches only
Notes: Ensure deterministic formatting (e.g., normalized dates, null sentinels) before hashing.
---
Corollary B — Spot the Lie
Idea: You don’t need full meaning to catch contradictions. Ask the data questions that can’t all be true at once.
Common patterns:
- Temporal order:
Created <= Modified <= Closed - State coherence: if
Status = ClosedthenClosedDateis NOT NULL - Magnitude sanity: no negative counts where impossible; ratios 0..1
-- Example: temporal contradictionSELECT *FROM WorkOrder
WHERE ClosedDate < CreatedDate;
---
Corollary C — The Mirror Test (Aggregate Reconciliation)
Idea: Reflect source and destination into a shared frame (time bucket, region, product) and look for distortion.
-- Example: monthly revenue reconciliationWITH combined AS (
SELECT DATEFROMPARTS(YEAR(TxDate),MONTH(TxDate),1) AS MonthStart,
SUM(Amount) AS SrcAmt, CAST(0 AS decimal(18,2)) AS DstAmt
FROM Source.Transactions GROUP BY DATEFROMPARTS(YEAR(TxDate),MONTH(TxDate),1)
UNION ALL SELECT DATEFROMPARTS(YEAR(TxDate),MONTH(TxDate),1) AS MonthStart,
CAST(0 AS decimal(18,2)) AS SrcAmt, SUM(Amount) AS DstAmt
FROM Mart.FactTransactions GROUP BY DATEFROMPARTS(YEAR(TxDate),MONTH(TxDate),1)
)
SELECT MonthStart
FROM combined
GROUP BY MonthStart
HAVING ABS(SUM(SrcAmt) - SUM(DstAmt)) > 0.01; -- discrepancies only
---
Post‑Load Constraint Validation (When the Schema Doesn’t Enforce It)
Because many marts skip hard constraints for performance, enforce them after loads:
- Referential integrity: no orphans (FKs must have parents)
- NOT NULL: critical fields populated (keys, dates, amounts)
- Uniqueness: business/display keys (e.g., WorkOrderNumber, CustomerID) are unique
-- Orphan detectionSELECT f.WorkOrderGUID
FROM Mart.FactWork f
LEFT JOIN Mart.DimWorkOrder d ON d.WorkOrderGUID = f.WorkOrderGUID
WHERE d.WorkOrderGUID IS NULL;
-- Business/display key uniquenessSELECT WorkOrderNumber
FROM Mart.DimWorkOrder
GROUP BY WorkOrderNumber
HAVING COUNT(*) > 1;
---
Transition to Deeper Coverage — Invariant‑Driven Testing
Remember: In parallel to the testing effort we are discovering and documenting business rules. Now we extract the value from those rules by using them to validate system behavior.
Insight: One business rule implies several truths (invariants) about state. Some truths are cheaper to test and deliver broader coverage than the rule itself.
Example Rule → Truths
Rule: “A contract must be signed before work begins.”
Truths (test individually):
Contract.SignedDate <= WorkOrder.StartDateWorkOrder.ContractGUID IS NOT NULLContract.Status IN ('Signed','Active')when WorkOrder is Active- If
WorkOrder.Status='Active'thenContract.Status='Signed'
-- 1) Signed before startSELECT w.WorkOrderGUID
FROM Mart.WorkOrder w
JOIN Mart.Contract c ON c.ContractGUID = w.ContractGUID
WHERE c.SignedDate > w.StartDate;
Layering Truths:
- Core invariants: Always true (temporal order, non‑negatives)
- Derived truths: Implied by rules (state dependencies)
- State truths: True only in certain states (if Closed → has ClosedDate)
- Temporal truths: Behavior over time (no 100% drop without explanation)
Use truths to amplify coverage: one rule → many cheap, reusable tests.
---
The Diagnostic Query Pattern
- Each test is a single query that returns only failures
- Empty result set = pass
- Add a test whenever a new defect/edge case is discovered
- Remove/retire tests made obsolete by design changes
Runner skeleton:
- Store tests in metadata table with:
TestName,Severity,SQLText,Owner,Tags - Orchestrator executes each test; captures row count + sample failures
- Report highlights non‑zero failures; zero rows collapse to green
---
Early‑Stage Strategy (When Context Is Thin)
- Start with marble tests (counts, partitions, null rates)
- Add checksums at key grains (dim rows, monthly fact buckets)
- Layer mirror tests for critical KPIs across time/region/product
- Introduce core invariants (temporal order, non‑negative amounts)
- As rules emerge, codify derived truths and state truths
- Convert every production bug into a diagnostic query
Principle: Begin broad and cheap; grow precise where pain appears.
---
Governance & Culture
- Testing doesn’t slow delivery; defects do. Tests surface the cost you already incurred.
- Treat the catalog as operational memory.
- Make results visible.
- Reward bug‑to‑test conversions; celebrate zero‑row runs.
---
This playbook is intentionally small, fast, and expandable. Start with marbles, add checksums, mirror the world, and make the data tell the truth.