Back to all posts
Experiments as First-Class Citizens

Experiments as First-Class Citizens

mlops experiment tracking machine learning reproducibility ml infrastructure software 2.0 devops

What’s the real weakest link in today’s machine learning lifecycle? Provocatively, it’s how we treat our experiments. Too often, ML experiments are second-class citizens – ephemeral runs tossed aside after extracting a metric or two. We instrument our production apps to the hilt, yet a training run that creates a model (the very core of “Software 2.0”) often isn’t tracked with the same rigor. This post argues that treating experiments as second-class is a fundamental flaw in modern ML, and it’s time to elevate experiment tracking to a first-class infrastructure concern – as indispensable as version control or observability.

The Second-Class Treatment of Experiments Today

The Second-Class Treatment of Experiments Today

In many teams, experiment tracking is an afterthought. Data scientists run dozens of training jobs, copy-paste metrics into spreadsheets, and manually version file names with dates. This ad-hoc approach is the norm, but it should be unacceptable. Imagine a software team not using Git or not logging production errors – unthinkable, right? Yet in ML, entire model histories vanish into personal notebooks or, worse, get lost entirely. Current ML tooling often treats experiments as a utility or addon, rather than the beating heart of model development. The result: wasted effort, irreproducible results, and “mystery” models whose exact provenance is anyone’s guess.

Why does this happen? Partly because ML experiments have been seen as disposable – run it, get the accuracy, move on. Unlike code (which we version and review) or pipelines (which we carefully automate), experiments live in a wild west. This second-class status is a glaring weakness. It leads to situations where critical insights are trapped in one engineer’s memory or transient logs. It means teams struggle to reproduce promising results from a few months ago because no one remembers the exact parameter combo or dataset slice used. In a field where experimentation is the engine of progress, treating that engine as a throwaway is courting failure.


Lessons from DevOps and Science

Lessons from DevOps and Science

We don’t have to accept this status quo – other disciplines have solved analogous problems. Take DevOps: modern software teams bake observability in from day one. Logging, metrics, and traces are first-class features of any serious application deployment, offering a holistic, automatic view in context, rather than requiring engineers to bolt on monitoring after the fact. Engineers would never say “let’s deploy to prod and maybe later figure out how to monitor it.” Observability isn’t a “nice-to-have”; it’s foundational. Similarly, ML experiment infrastructure should let us observe and record everything about a model run by default – not as a hacky add-on. We need the same mindset for experiments: tracking should be pervasive and automatic, not an optional extra.

We can also draw on the scientific method. In traditional science, lab notebooks and meticulous record-keeping are sacred. Every hypothesis, experimental setup, outcome, and odd observation is logged. Why? Because without a record, an experiment might as well not have happened – it’s not reproducible or trustworthy. A chemist who doesn’t write down experiment details is not taken seriously. Yet ML practitioners (who are essentially scientists training models) often run experiments without any persistent record beyond “model_v3.h5”. The scientific community’s focus on reproducibility and detailed experiment context (even subjective notes) is something ML must embrace. A result only counts if it can be reproduced and audited – and that requires treating the experiment log as sacrosanct as a lab notebook or an audit trail.


Experiments: The Core of Software 2.0

Experiments: The Core of Software 2.0

There’s also a deeper reason to elevate experiments: in the era of Software 2.0, experiments are the development process. Andrej Karpathy famously described Software 2.0 as the paradigm where we don’t write detailed code; instead we train neural networks by feeding data and tweaking parameters. In this view, the act of training a model is analogous to writing code – it’s how we “program” the behavior. If training is the new coding, then experiment tracking is the new version control and debugger. Karpathy pointed out that much of the real work shifts to curating data and running training jobs, while only a small part remains writing glue code. That means our iteration loop is driven by experiments: try a new dataset tweak or architecture, run training, evaluate, repeat. It’s the core loop of ML development. Not treating that loop as a first-class citizen is like a software team neglecting their code repository. We need to recognize that an experiment run in ML is a first-class artifact – as important as a Git commit in a software project. Every experiment carries the “source” of a model (data + code + hyperparameters) and the resulting “binary” (the model weights), making the experiment analogous to source code and the model analogous to a binary in traditional software development cycles. It’s high time we manage these artifacts with the same seriousness we manage source code and binaries in traditional software.


Experiment Tracking: Progress So Far (and Why It’s Not Enough)

Experiment Tracking: Progress So Far (and Why It's Not Enough)

The good news is the community isn’t starting from scratch. Over the past few years, many teams have adopted experiment tracking tools such as MLflow, Weights & Biases, ClearML, Comet, and others. A cottage industry of experiment management platforms has emerged, all designed to treat ML experiments as first-class citizens in the workflow. These tools provide interfaces to log parameters and metrics, store model artifacts, record code versions, and sometimes even capture the environment or system stats. This is a big step forward – using any structured tracking is usually better than logs floating around or results in someone’s head. Many organizations now require that each model training is logged in a centralized dashboard where any team member can see what was tried.

And yet, we’re still early. Today’s experiment trackers are where version control was in the early 1990s – clearly useful, but nowhere near the seamlessness of git/GitHub integration we take for granted now. In practice, many trackers end up being glorified metric databases; they don’t always integrate into the ML workflow as deeply as they should. Teams adopt a tool, but if it’s clunky, people might still bypass it (“I just ran a quick test, didn’t bother logging it”). In other cases, tools capture the basics but miss the rich context around experiments. In short, we have progress – experiment tracking is on the radar – but treating experiments as truly first-class is more aspiration than reality in 2025.

So what’s missing? Let’s dissect a few pain points with today’s tools and practices:

  • Metrics-First, Context-Last: Most experiment tracking systems are built around logging metrics, hyperparameters, and maybe system logs. That’s necessary, but not sufficient. Real experimental knowledge often includes qualitative feedback and domain-specific evaluation that numbers alone can’t capture. For example, a GAN model’s FID score might improve, but the research notes might say “images look oversharpened” – a subjective judgment that won’t appear in any metric. Current tools have limited support for logging this kind of rich context or subjective evaluation. At best, you might attach a comment or upload sample images manually. First-class experiments demand first-class context. We need to capture the “story” of the experiment – the reason it was run, the intuition behind changes, and the qualitative observations – not just the final accuracy number. Without this, experiment logs lack soul and insight, reducing their value for future researchers.

  • Broken Lineage (Frankenstein Models): In an ideal world, you could trace any model artifact back through all the experiments and data that produced it. In reality, model lineage often breaks the moment an experiment leaves the original context. Modern ML is collaborative and often “open-source” within a company. One team might take a baseline model from another team’s repository and fine-tune it for a new task. Or a pre-trained model from the outside (say a HuggingFace model) is pulled in and extended. These Frankenstein models – composed from pieces across repos and teams – typically lose their lineage. The experiment tracker in Team A’s system doesn’t know that Team B continued the lineage in another tool, or that an open-source model was a starting point. None of the mainstream tools seamlessly stitch lineage across such boundaries. This is a huge gap: when models move between environments, the provenance should travel with them. Failing to do so means we can’t fully trust or audit models, especially in regulated settings. True first-class treatment means never breaking the chain – every model’s history remains intact, no matter where it goes.

  • Infrastructure and Runtime Blind Spots: Anyone who trains models at scale knows that infrastructure events can make or break an experiment. A training run might slow to a crawl because a cluster node was overcommitted, or an AWS spot instance might get terminated at 90% completion, or a network glitch might silently corrupt a few data samples mid-run. These issues often live in separate logs (cloud monitoring dashboards, console outputs) and are not integrated with experiment tracking. Today’s tracking tools rarely record infrastructure-level telemetry alongside experiment results. As a result, when an experiment gives a weird result, we might not easily see that “oh, it was running on an instance that threw GPU memory errors” or “halfway through, there was a network timeout that stalled data loading.” Integrating infra events – GPU utilization, IO throughput, interruption notices – into the experiment record is essential for truly understanding outcomes. An experiment tracker should be part profiler, part ops monitor, not just a params-metrics database. This is analogous to observability in DevOps: you need context to interpret metrics. Right now, experiment tracking and ML infra monitoring live in silos. First-class experiments would mean merging these streams, so that an experiment’s log automatically notes, for instance, that “this run was degraded by a CPU throttling event at 2:05am.”

  • Opaque Costs: ML experiments don’t just consume data and time – they burn money (think GPU hours on cloud instances). Surprisingly, most experiment tracking setups give you zero insight into the cost of a run. If you ask, “what did this promising experiment cost to get a 0.5% accuracy bump?” you’d have to manually cross-reference cloud bills or use separate tooling. In an era where efficient ML is crucial, this is a major blind spot. We treat cost as someone else’s problem, which leads to either overspending or fear-driven conservatism. Cost transparency per experiment should be a first-class feature: your experiment dashboard should show not just accuracy and latency, but dollars (or compute credits) consumed. This would allow teams to make informed decisions like “Model A is slightly less accurate but 5x cheaper to train than Model B – that’s a better trade-off for our business.” Some forward-thinking teams script their own logging of training time or cloud usage, but it’s not standardized. A future experiment infrastructure might plug into cloud APIs to log cost automatically, or at least estimate it from resource usage metrics. The key is: if experiments are first-class, their resource footprint is tracked just as closely as their performance.

  • Intrusive or Rigid Instrumentation: One practical gripe with many current tools: they demand that you contort your workflow to fit their model. Some experiment tracking solutions make you use a specific project structure, or call their logging APIs everywhere, or execute training through their CLI/agent. This can be intrusive, forcing you to rewrite code or adopt a new ecosystem wholesale. It’s reminiscent of testing frameworks that won’t run unless you restructure your entire app – it raises the adoption barrier. In contrast, a first-class experiment system should meet you where you are. It should be as lightweight as adding a single line to init tracking, and everything else “just works” in the background.

    A Note on ClearML's approach

    A positive example here is ClearML, which emphasizes a minimal-footprint approach. You can take an existing training script and add one line like Task.init(project_name, task_name) at the top – and ClearML will automatically capture code, environment, hyperparameters, metrics, and even output plots with basically zero code changes, as described in this ClearML blog post. That’s the kind of low-friction instrumentation that encourages everyone to track experiments because it hardly takes any effort. ClearML’s SDK will even auto-log your argparse parameters and git commit, etc., without you doing anything extra. This least-intrusive style is promising: experiment tracking becomes ambient, not a chore.

    However, even with ClearML we see the tension between an easy client-side and a heavy ecosystem. ClearML provides an entire suite – pipeline orchestration, dataset versioning, model registry, remote execution – all tightly integrated. This is powerful, but it can feel opinionated and not yet modular. Adopting the full ClearML stack means running its server and agents, and structuring pipelines in its specific way. Some have criticized it as a “rigid, all-in-one approach” where you’re kind of locked into their way of doing MLOps. For example, ClearML’s orchestration relies on an agent architecture that, while effective, introduces complexity and a learning curve. If you love their experiment tracking but want to use a different pipeline tool, or vice versa, it’s not straightforward to mix and match – the pieces are somewhat intertwined. This highlights an important point: the best experiment tracking is both comprehensive and composable. We want the breadth of capabilities (tracking, pipelines, data, etc.) but without forcing one monolithic platform. ClearML is on the right track with minimal intrusion for logging (in fact, its “zero overhead” integration is often praised), yet the broader ecosystem still feels early in terms of flexibility. The same can be said for others – many tools either do one thing well (metrics tracking) or attempt to do everything but then you must buy into their ecosystem. The sweet spot – a truly first-class experiment framework – likely requires more openness and modularity than we have today.


Envisioning an Experiment-First Future

Envisioning an Experiment-First Future

How would ML development look if we truly treated experiments as first-class citizens? Let’s paint that picture. In a future (not too far off, we hope), experiment tracking becomes so ingrained and advanced that it’s part of the fabric of ML work, much like git and continuous integration are for software. Here are some characteristics of an experiment-centric ML infrastructure:

  • Composable and Modular: The experiment tracking system of the future won’t be a one-size-fits-all monolith; it will be composable. You might plug it into your existing data pipeline tool, use its logging component standalone with your custom training loop, or swap out its visualization module for your own. It will offer open APIs and integrations so that it can hook into any stage of your ML workflow. Flexibility will be key – no more forcing you to abandon your favorite tools. Instead, experiment tracking will become a layer that easily interfaces with data versioning systems (like DVC), pipeline schedulers, and model serving platforms. Think of it like a “Git for experiments” that any tool can speak to. This also avoids vendor lock-in – your experiments could be exported or migrated as easily as moving a Git repo, because the data formats and interfaces will be standardized and open.

  • Reproducible by Default: In a first-class experiment world, every run is 100% reproducible automatically. The moment you hit “train,” the system logs not just metrics, but the exact git commit of the code, the Docker container or environment specs, the dataset version or hashes, random seeds, library versions – everything needed to recreate that run is captured without requiring the user to think about it. It’s essentially an automated lab notebook. If someone from another team or an auditor asks, “How was this model trained? Show me exactly,” you could point them to an experiment ID that contains a complete recipe. Re-running that recipe (now or 5 years from now) should yield the same result (barring hardware differences). Achieving this at scale means tight integration with data lineage tools and environment management. We might see experiments-as-code become a thing – where an experiment can be exported as a config or script that fully encapsulates it. Some teams do this with containers and YAML configs today, but in the future it will be seamless. Reproducibility won’t rely on heroics of individual engineers; it will be a baked-in feature.

  • Collaborative and Shareable: First-class experiments will be as shareable and collaborative as code in GitHub. Imagine an “experiment hub” where any team member (or even the public, for open research) can browse experiments, comment on them, fork them, and merge improvements. This requires experiment tracking systems to have rich collaboration features: think commenting, tagging, linking experiments to issues or research notes, comparing results visually, and perhaps even version-controlling experiments themselves. In the future, when one data scientist finds a surprising result, they won’t just send a Slack message – they’ll send a link to the exact experiment, where colleagues can dig into the details, discuss inline, and perhaps branch off a new experiment derived from it. Lineage tracking will show these relationships (like a family tree of models). This is analogous to how developers collaborate on code; we need that level of social and collaborative infrastructure for experiments. It makes experiments truly first-class entities that teams work on together, rather than personal one-off runs.

  • Cost and Efficiency Awareness: The experiment platforms of tomorrow will treat computational cost as a first-class metric. Every experiment’s record will include how long it took, what resources it used, and an estimate of cost. Dashboards might display project burn-down charts for GPU hours or budget used vs. accuracy gained. Crucially, this will enable automatic optimization: the system could proactively catch waste (like alert you if two experiments are essentially duplicates, or if an experiment has run beyond a reasonable point of diminishing returns). It could also help decide scheduling – e.g. “This experiment is expensive; maybe run it on spot instances over the weekend.” Tying cost to experiments also improves decision-making at the management level – a CTO could finally answer “how much did it cost us to develop this model?” in a granular way. In short, treating experiments as first-class will bring DevOps-style cost monitoring into the model development process, fostering a culture of efficiency and awareness that is often lacking today.

  • End-to-End Traceability: In the future, nothing about an experiment will be opaque. From data to deployment, every step will be traceable. This means experiment tracking will merge with model deployment tracking – you’ll know which experiment produced the model that is currently in production, and what data that model saw, and so on. If a production model misbehaves, you can trace it back to an experiment run from six months ago and see exactly how it was trained and who approved that experiment. Lineage graphs will span across teams and tools: perhaps using standardized metadata, a model published to a model registry will carry an experiment ID that any org (or any user) can use to fetch its history. This end-to-end traceability is vital for governance (e.g., compliance audits) and for the technical ability to debug complex ML systems. When experiments are first-class citizens, they don’t live and die in a silo – they become part of the permanent record of the ML system, from research to production.

In essence, a truly experiment-centric infrastructure would make ML development feel much more robust and mature. It would be like moving from tinkering in a garage to operating on a modern software assembly line – without losing the creative spark of experimentation. We’d gain confidence that we’re not losing information or time with each experiment, and we’d empower larger teams (and cross-team efforts) to scale up experimentation without descending into chaos.


Conclusion: Elevating Experimentation in the ML Lifecycle

Conclusion: Elevating Experimentation in the ML Lifecycle

It’s time to treat experiments as first-class citizens in ML, not as second-class chores. The provocative stance is that anything less is holding our field back. We have the analogies and prior art to guide us: software engineering gave us version control and agile processes, DevOps gave us observability and automation, science gave us rigorous experiment protocol. ML can combine these lessons to build something uniquely suited to our needs – an experimentation culture backed by great tools, where no insight is lost and every model’s story is known.

Yes, many teams are already waking up to this reality (as evidenced by the plethora of tracking tools and emerging best practices), but we’re still early in the journey. The current tools, while useful, still show the cracks of second-class treatment – missing context, broken lineage, gaps in integration, and rigidity. By openly critiquing these gaps, we can push the industry toward better solutions. It’s not about one tool vs another; it’s about a mindset shift: experimentation isn’t just a means to an end (a model) – it is a fundamental part of the product.

CTOs and tech leaders should view experiment infrastructure as equal in importance to their data pipeline or CI/CD pipeline. Investing in this pays off in faster iteration, more reliable outcomes, and happier researchers (who spend less time playing detective and more time doing science). For ML practitioners, demanding first-class experiment support is akin to a developer demanding a good IDE and debugger – it’s not a luxury, it’s what you need to do your job properly.

In the coming years, expect to see experimentation platforms evolve dramatically. Perhaps we’ll look back on the days of manually tracking experiments the way we look back on coding without version control – how did we ever live like that? The future of ML belongs to those who can experiment faster, smarter, and more collaboratively. To get there, we must build infrastructure that treats experiments with the respect and centrality they deserve. It’s time to promote experiments from second-class to first-class citizens in our ML universe – and watch our capabilities soar as a result, a sentiment echoed by Neptune.ai and inherent in Karpathy’s Software 2.0 concept.