Why Data Engineering Teams Move Slower Than They Should

Jun 29

Your data engineering team probably isn’t slow. They're just spending their time in the wrong place.

Right now, somewhere in your org, a senior engineer is diagnosing a pipeline that failed silently overnight. A field changed shape upstream. The job ran for hours, produced output, and only surfaced the problem after something downstream broke or a number looked wrong in a dashboard. The engineer's morning is now an investigation: which job, which records, what state was the data in before the failure, what needs to be reprocessed, who consumed the bad output. A fix that would have taken two minutes at build time takes most of a day in production.

That's not a one-off. For most teams running Python-based pipelines at scale, it's the background noise of every sprint. The velocity metrics look reasonable and the team seems busy. But when you pull the incident log and categorize where the hours actually went, a significant share of what looks like engineering work is really recovery work. The pipeline didn't fail because anyone made a bad decision, but because the stack was never designed to catch that kind of error early.

TL;DR

Most data engineering teams aren't slow because of headcount or tooling choices. They're slow because their stack lets certain failures stay invisible until production. The bottleneck is error discovery timing: when your tools push problems downstream, your engineers spend their time responding instead of building. The teams that move fastest are the ones whose stack surfaces problems at compile time, not in a 2am PagerDuty alert.

The Diagnosis Most Teams Skip

When a data engineering team is consistently behind, the instinct is to look at the usual suspects: not enough headcount, too much technical debt, poor requirements from stakeholders. These are real problems. But they're often symptoms of something more structural, and adding engineers to a broken feedback loop just means more engineers spending time on firefighting.

The harder question is where errors surface in your pipeline lifecycle. If most of your production incidents trace back to type mismatches, null handling failures, schema drift, or race conditions in distributed jobs, the problem isn't people or process. It's that your stack is designed to discover those errors as late as possible.

Late error discovery is expensive. A bug caught at build time costs one engineer a few minutes. The same bug caught in a production job that's been silently corrupting a reporting table costs days of investigation, rollback, stakeholder communication, and whatever decisions got made with bad data in the meantime.

Why Dynamic Typing Hurts at Pipeline Scale

Python is the default language for most data engineering teams, and for good reasons. It's fast to write, it has a mature ecosystem, and almost every data engineer already knows it. Those advantages are real, especially in early-stage work where speed of iteration matters most.

The tradeoff shows up at scale. Python is dynamically typed, which means type errors are runtime errors. In a single-process script, that's manageable. In a distributed pipeline processing hundreds of millions of records across a cluster, a type error that goes undetected in development can corrupt output for hours before anyone notices. By the time it surfaces, the debugging process involves reconstructing what the data looked like before the job ran.

This isn't a knock on Python as a language. It's a structural observation about where errors land when your runtime doesn't know what types to expect. The later a mistake is discovered, the more expensive it is to fix.

Teams often try to paper over this with runtime validation libraries, schema registries, and manual testing. These help. But they're adding a layer of error detection that a statically typed language provides by default, at the cost of engineering time and operational complexity.

What Mutable State Does to Distributed Systems

The other major source of pipeline instability is mutable shared state, and it's harder to instrument your way out of. In distributed data processing, multiple workers operate on partitions of data concurrently. If the code modifies state that other workers can observe, you introduce a class of timing-dependent bugs that only appear under specific load conditions. They're difficult to reproduce, difficult to test for, and tend to show up at the worst possible time.

The functional programming model treats data as immutable by default. You don't modify a dataset. You derive a new one. Each transformation is a pure function: same input always produces the same output, regardless of what else is running. That property makes pipelines dramatically easier to reason about, test, and parallelize safely.

Apache Spark's core design is built on this principle. Its RDD model is explicitly immutable and functional. When you write Spark jobs in a language that shares those properties natively, the code you write maps cleanly onto how the framework actually works. When you write Spark jobs in Python through PySpark, you're working through an abstraction layer that adds friction and overhead.

The PySpark Overhead Problem That Doesn't Show Up in Demos

PySpark works well for exploration, prototyping, and smaller jobs. At production scale, the Python-JVM boundary starts to matter. When you use Python UDFs or custom functions in PySpark, data has to be serialized out of the JVM into Python, processed, and then serialized back. That round trip adds latency and memory overhead to every job that uses it.

Teams that have moved from Python-heavy Spark implementations to Scala often report the same pattern: the first significant run on a large dataset shows performance characteristics that weren't visible in staging. Not because the logic is wrong, but because the serialization cost compounds at scale in ways that smaller test runs don't surface.

Scala runs natively on the JVM alongside Spark's execution engine. There's no serialization boundary. Custom transformations execute where the data lives. For jobs that are already expensive in terms of compute, that difference isn't academic. For a detailed look at how these two approaches compare in practice, the Scala vs Python data engineering decision framework covers the tradeoffs in depth.

What Scala's Type System Actually Catches

The practical value of a strong type system in data engineering isn't abstract. It shows up in specific, recurring failure modes that dynamically typed pipelines handle at runtime.

Schema evolution is one of the most common. Upstream data sources change shape. A field gets renamed. A nullable column becomes non-nullable. A numeric field starts arriving as a string in some edge case. In a Python pipeline, these changes are silent until a job fails or, worse, until a job produces subtly wrong output without failing. In Scala, a schema change is a type error. The compiler catches it before the job runs.

Consider a simple transformation on a data record:

scala

// A strongly typed event record
case class PipelineEvent(
  userId: String,
  eventTs: Long,
  value: Double
)
 
// If the upstream schema changes eventTs to Option[Long],
// this transformation won't compile until the mismatch is fixed
def normalize(event: PipelineEvent): NormalizedEvent =
  NormalizedEvent(event.userId, event.eventTs, event.value / maxValue)

When eventTs becomes Option[Long] upstream, the code above won't compile. The engineer has to handle the null case explicitly before the job can run. That's not extra work. That's the system preventing a silent null pointer failure in a production job that might otherwise process for six hours before anyone notices.

Null handling is the other major area. Scala's Option type makes the presence or absence of a value explicit in the type signature. A function that takes an Option[String] cannot pretend the value is always present. The compiler enforces that you handle both cases. Python has no equivalent enforcement. The null check is optional, and optional null checks have a way of going missing under deadline pressure.

Three Signals Your Stack Is the Problem

None of these symptoms point immediately to language choice. A VP of Engineering looking at the situation sees a team that seems busy but not productive, and the instinct is to add process: better data contracts, more thorough code review, staging environments that more closely mirror production. Those interventions help at the margin. But if the underlying tooling is designed to discover errors late, better process is fighting the current.

These are the three signals worth paying attention to:

Onboarding takes longer than it should.

If new data engineers need weeks to become productive because the pipeline logic is only understandable by running it, that's a sign the codebase doesn't encode its own contracts. A well-typed pipeline is largely self-documenting. The types tell you what a function expects and what it produces. When that information lives only in tribal knowledge or internal wikis, every new hire starts from scratch.

One upstream schema change creates three days of pipeline fixes.

Schema drift is inevitable. The question is whether your stack catches it at build time or in production. If a renamed field or a newly nullable column requires manual investigation to diagnose after a job fails, the system has no mechanism for surfacing the contract violation early. That three-day fix is almost entirely avoidable.

Sprint velocity looks fine on paper, but the work is reactive.

If you pull up your data engineering team's completed tickets over the last quarter and a significant portion are incident-related, that ratio is telling you something. Velocity metrics that include firefighting overstate actual delivery capacity. The team isn't slow. They're spending their build time cleaning up failures the system should have prevented.

The teams that restructure around earlier error discovery typically report the same outcome: the initial migration takes real investment, but the operational workload drops significantly within a few months. The engineers are building instead of debugging.

How to Audit Your Data Engineering Stack for This Problem

The most useful question isn't "what tools are we using" but "where do our errors surface." Pull your last three months of production incidents and work through three categories:

Type and schema failures.

How many incidents traced back to a type mismatch, a null that wasn't expected, or a schema change upstream that nobody caught before the job ran? If this category is significant, your stack is discovering contract violations too late.

State and concurrency failures.

How many incidents involved race conditions, unexpected data mutations, or non-deterministic output from jobs that should be deterministic? These are the hardest bugs to reproduce and the ones most often attributed to infrastructure rather than code. They're frequently a mutable state problem.

Rework ratio.

Of the engineering time spent in the last quarter, what fraction went to fixing things that were already in production versus building new capabilities? A rework ratio above 30 percent is a sign the error discovery loop is broken, not that the team is under-resourced.

If any of these categories account for a meaningful share of your incidents or engineering time, the tooling is likely a contributor. The fix isn't necessarily a full rewrite. A common starting point is migrating the highest-value, highest-incident pipelines first, establishing the Scala and Spark pattern, and extending from there as the team builds fluency.

Apache Spark was originally written in Scala, and Kafka's Streams API follows the same model. Using those tools in the language they were built for removes an abstraction layer that's easy to underestimate in planning and difficult to ignore at production scale.

Where Kafka Fits the Same Pattern

Everything above focuses on batch processing with Spark, but the same structural argument applies to streaming pipelines built on Kafka. Kafka Streams is a first-class JVM library. Its API is designed around typed, immutable stream transformations. When you build Kafka consumers and stream processors in Scala, you're working with the same type system guarantees: schema mismatches are compile errors, transformation contracts are enforced statically, and the functional model maps directly onto how Kafka's stream processing model works.

Teams running Kafka in Python face a similar tradeoff. The Python Kafka client libraries are mature and widely used, but they operate without schema enforcement at the language level. A message format change in a producer can silently break a consumer until a runtime error surfaces. In a high-throughput streaming system, the window between a silent failure and a noticeable data quality problem can be hours. The same compile-time contract enforcement that protects Spark jobs protects Kafka pipelines for the same reasons.

For teams running both Spark batch jobs and Kafka streaming pipelines, the operational benefit of using Scala across both is compounded. The same patterns, the same type discipline, the same immutability model. Engineers move between batch and streaming work without switching mental models or error-handling approaches.

The Hiring Dimension

Scala engineers are harder to hire than Python engineers, and the talent pool is smaller. That's worth acknowledging directly rather than glossing over it.

The practical model that works best for most organizations isn't hiring a full Scala data engineering team from scratch. It's bringing in experienced Scala engineers to establish the architecture, tooling, and patterns, then transitioning day-to-day ownership to an internal team once the foundation is solid. The experienced engineers set the contracts and the standards. The internal team inherits a codebase that enforces correctness by design rather than by convention.

The tradeoff is one of leverage. A smaller team working in a stack that catches errors early can often outdeliver a larger team working in one that doesn't. The productivity difference isn't linear. It compounds over time as the high-correctness team accumulates less technical debt and spends less time on incident response. The Scala talent shortage is real, but for data infrastructure work, the scarcity is often overstated relative to the actual headcount a well-structured engagement requires.

Tired of pipeline incidents that should have been caught in development?

We work with engineering teams that are serious about getting data infrastructure right. If your current stack is generating more firefighting than forward progress, it's worth a conversation. Talk to a Scala expert.

Frequently Asked Questions

Why do data engineering teams move slowly?

Most data engineering slowdowns come from the same root cause: the stack tolerates errors that should be caught before code runs. Schema mismatches, null handling issues, and type coercion bugs fail silently in dynamic languages and only surface in production, creating repeated firefighting cycles that consume engineering time.

Why do data pipelines keep breaking in production?

Pipelines break in production when the tooling doesn't enforce contracts at build time. Dynamic typing, mutable state in distributed systems, and implicit schema handling push error discovery downstream, where the cost of fixing them is highest.

What is the difference between PySpark and Scala for data engineering?

PySpark adds a Python layer on top of the JVM, which introduces serialization overhead and removes compile-time type safety. Scala runs natively on the JVM alongside Spark, which means lower overhead and full access to the type system for catching errors before runtime.

Why do companies use Scala for data pipelines?

Scala's type system catches schema and contract violations at compile time, before a job runs. Its native JVM execution eliminates the serialization cost of Python-based Spark APIs, and its immutability model prevents the class of shared-state bugs that commonly cause distributed pipeline failures.

Is Scala better than Python for data engineering at scale?

For teams processing large volumes or running complex distributed jobs, Scala tends to produce more stable pipelines because errors are caught earlier and execution is closer to the metal. Python is faster to prototype in, but at scale the operational cost of runtime failures often exceeds the speed advantage in development.

How do you migrate from PySpark to Scala Spark?

The most effective approach is a pipeline-by-pipeline migration rather than a full rewrite. Start with the pipelines that generate the most incidents or carry the highest business risk. Rewrite them in Scala, establish the typing patterns and project conventions, and use those as the template for subsequent pipelines. Teams typically find that the second and third pipelines go significantly faster than the first, as the patterns become familiar and the shared utilities get built out.

Alyssa Ehinger