Stop Training and Praying

Stop Training and Praying

Reading time: 5 min
Machine Learning MLOps ML Training Model Development Data Pipeline ML Engineering GPU Training Model Evaluation ML Workflow Training Optimization ML Infrastructure Data Quality Model Debugging ML Best Practices Compute Efficiency

We need to talk about the elephant in the room: most ML development is embarrassingly wasteful.

I’ve watched teams burn through millions of dollars training frontier models only to discover fundamental bugs after the run completes. I’ve seen engineers religiously check their loss curves like stock tickers, praying their 3-week training job doesn’t crash overnight. And I’ve personally experienced the gut-punch of realizing that a “successful” model with 95% accuracy completely fails on the data slices that actually matter for the business.

This isn’t just about wasted money—though poor data quality alone costs firms an average of $12.9 million annually. It’s about a fundamentally broken development paradigm that’s holding back innovation.

The “Train and Pray” Problem

Here’s how most ML development actually works:

  1. Spend weeks preparing your dataset
  2. Write your model code
  3. Launch a massive training job
  4. Cross your fingers and wait
  5. Discover your bug/data issue/fundamental flaw after burning through your compute budget

Sound familiar? This “train and pray” approach treats ML training like a black box gamble instead of engineering. And the costs are staggering.

The Hardware Reality Check

Let’s start with the inconvenient truth about hardware failures. When Meta trained their OPT-175B model on 992 A100 GPUs, they had 35 manual restarts over two months. The Llama 3.1 405B training? 417 hardware failures in 54 days.

Here’s the math that’ll keep you up at night: a 512-instance cluster has an expected hardware failure every 5 hours. And because most teams checkpoint every 3+ hours to avoid the 30-40 minute save overhead, each failure can waste hours of the most expensive computation on earth.

The Data Pipeline Disaster

But hardware is just the beginning. Google’s analysis of millions of training jobs revealed that jobs spend 30% of their time just waiting for data. Some models spend up to 70% of their time on I/O, with your expensive GPUs sitting idle while CPUs struggle to decode and preprocess the next batch.

This creates a vicious cycle: when your data pipeline is slow, debugging becomes prohibitively expensive. Teams ship with “good enough” data because the cost of another iteration to fix subtle quality issues feels too high. Then they burn weeks of compute training on fundamentally flawed datasets.

The Silent Killers: Logic Errors

The most insidious problems don’t crash your training—they silently corrupt it. Data leakage. Incorrect preprocessing. Subtle bugs like forgetting to apply the same augmentation to both images and segmentation masks, forcing your model to learn an impossible task.

These issues often only surface when you discover your “successful” model performs terribly on critical subgroups—like Amazon’s hiring algorithm that discriminated against women because it was trained on historical male-dominated data.

Why Traditional Evaluation Is Broken

Even when training completes successfully, our evaluation methods give us false confidence. The standard practice—evaluating on a random test set with a single aggregate metric—is like judging a car’s safety by its average crash test score across all scenarios.

Random sampling fails to capture critical edge cases. Your model might achieve 99% accuracy overall while completely failing on vulnerable populations or rare but critical scenarios. You declare victory, invest more compute in hyperparameter tuning, maybe even deploy to production—all while sitting on a ticking time bomb.

A Better Way: Pre-Flight, In-Flight, Post-Flight

The solution isn’t better hardware or faster GPUs. It’s treating ML development like the engineering discipline it should be. I propose a three-phase approach that shifts validation left and makes every compute cycle count:

ML Development Phases
ML Development Phases

Phase 1: Pre-Flight Protocol (Before You Waste Money)

Stop launching training jobs blindly. Before you spin up that expensive cluster, run these fast, cheap checks:

Phase 2: In-Flight Monitoring (Turn Training Into Debugging)

Make your training process transparent. Instead of staring at loss curves and hoping:

Phase 3: Post-Flight Analysis (Beyond Vanity Metrics)

Interrogate your model like your business depends on it. Because it does:

Time to Stop the Madness

The current state of ML development is like building rockets by lighting the fuse and hoping for the best. We’ve gotten comfortable with waste because compute felt infinite and model performance kept improving despite our inefficiencies.

But as models get larger and more expensive, we can’t afford this anymore. Every failed training run is a lost opportunity to test a new idea. Every hour debugging preventable errors is an hour not spent on innovation.

The three-phase framework isn’t just about saving money—it’s about accelerating the pace of ML innovation by making our development process more scientific, predictable, and efficient.

The question isn’t whether you can afford to adopt these practices. It’s whether you can afford not to.

Related Posts

When (and How) to Invest in MLOps at Your Startup

A practical guide for startup founders on when and how to invest in MLOps - from early stage flexibility to scaling infrastructure, with key principles and pitfalls to avoid.

Machine LearningMLOps +1

Modern ML Cookiecutter Template

From dependency hell to device confusion - I built a modality-aware cookiecutter template that gets you from idea to training in minutes, not hours. Here's why modern ML projects need modern tooling.

Machine LearningMLOps