Why We Chose Hundreds of Pipelines

It is sometimes easier to be agile when managing a few hundred small processes than when managing a handful of monoliths.

The Spark

At least once a year (and sometimes more often), someone asks me why our Azure Data Factory environment has more than 300 pipelines. It is a good question. It is about tradeoffs—and finding an approach that works under real-world constraints.

Our current solution has lasted so long because it has been running reliably under real constraints for a long time.

Large numbers raise concern, especially when they’re tied to infrastructure. What continues to catch me off guard is how difficult it is to explain our decision succinctly. Of course, when we are talking about architectural design, there will never be a single perfect design. This is not about right or wrong; it is about balancing tradeoffs to meet current needs and constraints.

I tend toward more granular pipelines because I usually find myself needing solutions that:

Are developed and supported by the same small team.
Must run in a single environment.
Change frequently as requirements evolve.

Let’s do a deep dive into what this means.

---

TL;DR

We chose many small Azure Data Factory pipelines over a few large ones because production systems can fail unpredictably, and recovery matters more than elegance. Smaller pipelines give us restart points, faster hotfixes, clearer observability, simpler testing, and a reliable audit trail. A single monolithic pipeline may appear cleaner on paper, but in practice, it becomes brittle, opaque, and slow to recover. Visible complexity proved safer than hidden complexity.

---

If your team wants to reduce pipeline count over time, I agree that’s worth exploring—but the first step is understanding what the current shape is buying us operationally.

---

The Seduction of the One Big Pipeline

On the surface, a single large pipeline is appealing. One trigger. One execution graph. One place to look when something goes wrong. With enough parallel branches, retries, and conditional logic, it is entirely possible to orchestrate an entire data mart load inside a single construct.

The problem is not that this can’t be done. In fact, sometimes it is the better solution, but in our case, the operational failure modes outweighed the benefits of consolidation.

But our production systems usually require robustness against network glitches, unexpected data conditions, and unavailable services. When pipelines run for many hours and issues arise, having more smaller pipelines can help us recover from issues in minutes rather than hours or days.

The longer the pipeline, the greater the risk, as there are more opportunities for transient failures, upstream delays, or unexpected data conditions. When a monolithic pipeline fails nine and a half hours into a run, the technical failure quickly becomes an operational one. Teams are forced to choose between rerunning everything from the beginning or attempting manual recovery under pressure. Neither option scales well.

---

Restart Points as a First-Class Design Principle

Distributed systems fail. Files arrive late. APIs stall. Schemas change without notice. Credentials expire. These are not edge cases; they are the normal operating conditions of long-lived data platforms.

Retries help, but retries alone do not solve the problem. What makes recovery manageable is restartability.

By decomposing the ETL process into many smaller pipelines, we created natural restart boundaries throughout the system. When something fails, we resume from the last known good state rather than rewind the entire workflow. This was not an optimization layered on later; it was a deliberate design choice rooted in the assumption that failure would happen and that recovery speed mattered.

---

Hotfixes in the Real World

Production issues rarely announce themselves at convenient times. More often than not, they surface midday as partial outages, data-quality issues, or downstream reports that suddenly stop making sense.

With a modular pipeline structure, responding to these issues is straightforward. The affected logic has been corrected, only the relevant pipelines are rerun, downstream dependencies are validated, and the service is restored. What could have been a multi-hour or multi-day outage becomes a targeted intervention measured in minutes.

A single large pipeline makes this kind of response far more difficult. Hotfixes tend to require full reruns, fragile manual executions, or ad hoc workarounds outside the orchestration framework. Of course, it is always possible to bypass ADF and run SQL directly, but doing so sacrifices auditability and operational visibility—two things that matter when troubleshooting complex systems or operating in regulated environments.

---

Observability Over Aesthetic Simplicity

An unexpected benefit of many pipelines is that system behavior becomes clearly visible over time. Azure Data Factory’s monitoring tools provide duration and execution history at the pipeline level, turning the orchestration layer into a performance map rather than a black box.

This enables visibility into where bottlenecks are emerging, which upstream changes affect downstream workloads, and where optimization efforts will have the greatest impact. A single monolithic pipeline, by contrast, collapses all that detail into a single aggregate runtime, offering little insight into where time is actually spent.

This difference becomes easier to see when you compare concrete orchestration patterns side by side, rather than debating abstractions in the abstract.

---

Placeholder: Four Orchestration Patterns, One Workload

This section compares four approaches to orchestrating the same logical workload in Azure Data Factory. In each case, assume we are copying approximately 100 source tables into a data mart. The data volume, business logic, and downstream dependencies are identical. What changes is how the work is structured, observed, and operated day-to-day.

The goal here is not to declare a single universally correct approach. All four patterns are viable and have been successfully used in real systems. What differs is the operator experience: how easy the system is to understand, how quickly problems can be diagnosed, and how safely the system can be evolved.

---

Examples

Let’s look at a few examples.

If you only read one example, read Example 3.

Example 1: The Monolithic Pipeline

In the first pattern, all copy activities are contained within a single Azure Data Factory pipeline. The pipeline fans out into parallel branches where possible, then synchronizes before moving on to downstream steps. From a distance, this can look tidy: one trigger, one run history, one artifact to manage.

Conceptually, the pipeline might look like this:

~~graph TD\n Start --> C1\n Start --> C2\n Start --> C3\n C1 --> J1\n C2 --> J1\n C3 --> J1\n J1 --> C4\n J1 --> C5\n C4 --> End\n C5 --> End~~

Scaled up, this diagram represents dozens or hundreds of copy activities with varying degrees of parallelism. The exact shape matters less than the experience of reasoning about it as the pipeline grows wider and taller.

From the perspective of Azure Data Factory Monitor, this entire workload collapses into a single line item per day:

Pipeline Name	Start Date	Duration
Load_All_Data	2025 Jan 01	2h 03m
Load_All_Data	2025 Jan 02	1h 55m
Load_All_Data	2025 Jan 03	3h 18m

This indicates that the pipeline runs once per day and takes approximately two hours. What it does not tell us is where that time is being spent, which copies are slow, or which parts of the workflow would benefit most from optimization.

It is possible to extract that information elsewhere—through logging, custom metrics, or manual analysis—but none of it comes for free. At the orchestration layer itself, operational insight is minimal.

---

Example 2: The Metadata-Driven Pipeline

In the second pattern, copy behavior is driven by metadata. A pipeline reads a control table describing which sources to copy and iterates over that metadata. Parallelism is still possible by adjusting concurrency settings, and the overall pipeline graph becomes significantly simpler.

At a high level, the orchestration looks like this:

graph LR\n Start --> Lookup_Metadata[Lookup Metadata]\n Lookup_Metadata --> ForEach_Source[For Each Source]\n ForEach_Source --> Copy_Activity[Copy Activity]\n Copy_Activity --> End

From a design perspective, this approach is elegant. It reduces duplication, centralizes configuration, and makes it easy to add or remove sources without editing the pipeline itself.

In the monitor, however, much of the detail moves out of sight:

Pipeline Name	Start Date	Duration
Load_Metadata_Driven	2025 Jan 01	1h 55m
Load_Metadata_Driven	2025 Jan 02	1h 55m
Load_Metadata_Driven	2025 Jan 03	3h 36m

This indicates the pipeline runs once per day and often takes around two hours, with occasional spikes. To understand which source is slow (or failing), an operator must click into a specific run, inspect parameters, correlate timestamps, and often cross-reference logs or metadata tables. The information exists, but it is no longer visible at a glance.

This pattern trades pipeline sprawl for operational indirection. That tradeoff can be reasonable, but it shifts effort from design to troubleshooting.

---

Example 3: One Pipeline per Copy

In the third pattern, each copy activity becomes its own pipeline. Pipelines are named semantically for what they do. Orchestration happens at a higher level, but execution and monitoring are fully decomposed.

Conceptually, the workflow now looks like this:

graph LR\n Start --> Copy_Accounts[Copy Accounts]-->End\n Start --> Copy_WorkOrders[Copy Work Orders]-->End\n Start --> Copy_Activities[Copy Activities]-->End

From the monitor’s perspective, the same workload now produces immediately useful signals:

Pipeline Name	Start Date	Duration
Copy_Accounts	2025 Jan 01	3m 12s
Copy_Work_Orders	2025 Jan 01	18m 44s
Copy_Activities	2025 Jan 01	41m 02s
Copy_Accounts	2025 Jan 02	13m 02s
Copy_Work_Orders	2025 Jan 02	28m 12s
Copy_Activities	2025 Jan 02	50m 12s
Copy_Accounts	2025 Jan 03	2m 55s
Copy_Work_Orders	2025 Jan 03	17m 54s
Copy_Activities	2025 Jan 03	1h 5m 05s

Without clicking into any runs, it is obvious where time is being spent and which areas deserve attention. Optimization becomes targeted rather than speculative. Troubleshooting is faster because the failure unit is smaller and more clearly named.

This pattern often explains why pipeline counts grow quickly. What can look like sprawl is frequently just visibility.

---

Example 4: Semantically Grouped Pipelines

The final pattern builds on the previous one by introducing semantic grouping and dependency-aware orchestration. Individual copy pipelines remain atomic, but they are grouped into higher-level workflows based on domain or dependency.

For example:

graph TD\n subgraph Core_Dimensions\n Copy_Accounts[Copy Accounts]\n Copy_Work_Orders[Copy Work Orders]\n end\n\n subgraph Transactions\n Copy_Activities[Copy Activities]\n Copy_Payments[Copy Invoices]\n end\n\n Core_Dimensions[Core Dimensions] --> Transactions

This structure allows operators to trigger, monitor, and rerun entire domains without touching unrelated parts of the system. It also avoids the awkwardness of starting a long pipeline midstream and being forced to run it to completion.

In practice, this capability saves time, reduces errors, and shortens outages. When issues arise late in the process, teams can rerun just the affected group instead of replaying hours of already successful work.

---

What These Examples Show

All four patterns move data. The differences emerge when systems fail, when changes are required, and when humans need to understand what is happening.

As orchestration boundaries shrink and become more semantically meaningful, monitoring becomes more informative, recovery becomes safer, and continuous improvement becomes easier. None of this is free—but some designs make the cost explicit and manageable, while others defer it to the least convenient moment.

These examples are not about choosing the “right” diagram. They are about deciding which kinds of problems you want to solve upfront, and which ones you are willing to solve later, often under pressure.

---

Debugging Systems, Not Diagrams

There is a persistent assumption that fewer pipelines are easier to debug. In practice, the opposite is often true. Debugging complexity does not scale linearly with size, because it is driven less by the number of steps than by the limits of human comprehension—how many interactions, ordering constraints, and side effects an engineer can reliably reason about at once.

Smaller pipelines behave like well-defined units. You can test them independently, reason about them in isolation, and explain them to new engineers without requiring a guided tour of an enormous execution graph. Large pipelines, by contrast, tend to become untouchable over time—not because changing them is impossible, but because understanding them costs too much.

Hidden complexity feels comforting until something breaks. When it does, the lack of clear boundaries slows diagnosis and increases the risk of recovery.

---

On the Fear of “Too Many Pipelines”

The reaction to seeing hundreds of pipelines is often driven by first impressions rather than technical analysis. We rarely panic over hundreds of tables, stored procedures, or scheduled jobs, yet pipelines trigger a different response because they make orchestration complexity visible.

That visibility is a feature, not a flaw. The system is complex because the problem domain is complex. The choice is whether to expose that complexity in manageable pieces or compress it into a single opaque structure. Experience has shown that visible complexity is easier to reason about, easier to support, and ultimately safer.

---

Is This the Best Possible Approach?

Probably not forever.

There is a helpful reminder often attributed to former U.S. Secretary of Defense Donald Rumsfeld: you go to war with the capabilities you have, not the ones you wish you had (paraphrased). The same principle applies to large, long‑lived data platforms. When we moved this system to Azure, many of today’s orchestration tools either did not exist yet or were still immature. The architecture reflects the realities, constraints, and risks of that moment in time.

Modern orchestrators and metadata-driven frameworks now offer more sophisticated lineage tracking, task-level observability, and orchestration-as-code patterns. Many of these tools could improve how this system is expressed or documented today. But rebuilding or heavily refactoring an enterprise data platform whenever a new approach emerges is unsustainable. For most organizations, that kind of constant reinvention crowds out the very work the platform exists to enable.

Instead, progress has to be opportunistic. We look for moments when modernization can coexist with other efforts—significant enhancements, platform upgrades, or natural inflection points—rather than chasing every new wave as it emerges.

What these newer tools do not eliminate is segmentation. Even the most advanced orchestration platforms still break work into small, restartable steps. The underlying principle persists across tooling generations because it reflects how reliable systems are actually built.

---

I recognize that inheriting a large, unfamiliar system—especially one with many moving parts—is frustrating. This document isn’t meant to dismiss that frustration, but to explain the constraints under which the system evolved and how it was kept operable over time.

---

Parallel Work, Parallel People

One of the least discussed benefits of decomposing a system into many pipelines has nothing to do with performance, retries, or tooling. It has to do with people.

Enterprise data platforms rarely stand still. Teams change. Contractors rotate in and out. New hires arrive with strong general skills but limited context. Over time, the system must be understood, extended, and repaired by people who did not build it.

In that environment, structure is not an aesthetic choice. It is a safety mechanism.

Onboarding Without Breaking Things

When a system is broken into reasonable, well-named chunks, onboarding becomes a guided process rather than a trial by fire. A new developer or support engineer can be handed a specific pipeline or a small group of related pipelines and told, “Start here.” The blast radius is clear. The dependencies are visible. The consequences of a mistake are limited.

By contrast, onboarding someone to a monolithic pipeline is an exercise in controlled risk. Even small changes require understanding a large execution graph, subtle ordering constraints, and side effects that may not be obvious from the diagram alone. The safest option becomes inaction, which slows progress and concentrates knowledge in the hands of a few long-tenured individuals.

Many pipelines distribute understanding. Monoliths concentrate it.

Safer Parallel Development

Modern teams rarely work on a single task at a time. Bug fixes, enhancements, data corrections, and performance tuning often happen in parallel. A modular pipeline ecosystem naturally supports this reality.

Different developers can work on different pipelines without stepping on one another. Changes can be tested and deployed independently. Hotfixes can be applied surgically, without freezing unrelated work or delaying urgent fixes until a global release window opens.

In a monolithic pipeline, parallel work is theoretically possible but fragile in practice. When changes queue behind one another, coordination costs rise, and each modification carries the risk of unintended interactions. Over time, teams respond by batching changes, increasing deployment size, and increasing the risk per deployment.

Support Teams and Operational Confidence

The same principles apply to support and operations. When pipelines are small and semantically grouped, on-call engineers can diagnose issues more quickly and act with confidence. Rerunning a single pipeline or group of pipelines feels routine rather than dangerous.

In monolithic systems, operational tasks tend to become ceremonial. Reruns require careful planning. Manual interventions are double- and triple-checked, yet they remain stressful. The system may work, but it discourages decisive action.

Confidence matters. Systems that allow people to act confidently recover faster and fail less expensively.

Knowledge Transfer as a Design Constraint

Over long time horizons, the limiting factor in system reliability is not technology. It is knowledge transfer. Small, well-named, purpose-built pipelines serve as living documentation. They teach new team members how the system works simply by existing.

A monolith can be documented, but it rarely teaches. Understanding must be transmitted person to person, often under pressure, often incompletely.

Designing for parallel people is ultimately about designing for continuity. The goal is not just to make today’s work easier, but to ensure that tomorrow’s team can operate the system safely, even when the original builders are no longer in the room.

---

What I Would Evolve

That historical context matters when talking about evolution. The question is not whether the system should change, but how—and at what cost.

If I were redesigning this system today, I would not collapse pipelines to achieve a cleaner diagram. Instead, I would focus on clearer domain grouping, more standardized retry and notification logic, richer metadata, and better automated documentation. The intent is to improve legibility and maintainability without sacrificing the resilience that has proven effective over time.

A well-run library does not have fewer books. It has better organization.

---

Closing: Designing for Recovery

After decades of building and/or supporting production data systems, one lesson stands out:

Healthy systems are not defined by how rarely they fail, but by how quickly and safely they recover.

The decision to use many pipelines was not driven by fashion or tooling constraints. It was driven by experience—experience with outages, hotfixes, audits, and the realities of keeping data flowing when the world refuses to behave.

You can build a single pipeline that looks elegant on a whiteboard. Or you can build a system designed to survive contact with reality.

I know why we made this choice, but we should never assume that decisions made years ago should not be reconsidered. The challenge is balancing design improvements with other priorities and constraints.