You’ve been there. You’re building a critical pipeline that relies on consistent output from a Large Language Model. You did everything right. You set temperature=0. You triple-checked the prompt. It should be deterministic. It has to be.

Table of Contents

  1. The Gremlins in the Machine
  2. Gremlin #1: The Batching Mirage
  3. Gremlin #2: Hardware Roulette
  4. So, What Can You Actually Do About It?
  5. Sources

And yet, the model spits out a slightly different response. You start chasing ghosts, wondering if you’re losing your mind. You’re not. You’ve just fallen for one of the biggest myths in applied AI.

Burn this into your brain: temperature=0 only makes the token sampling step deterministic. It forces the model to perform greedy decoding, always picking the token with the highest probability. But it does absolutely nothing to tame the computational chaos happening under the hood before sampling ever occurs.

Think of it like trying to hit the same spot twice with a medieval cannon. Your temperature=0 setting is like fixing the cannon’s elevation and using the same amount of gunpowder for every shot. In a perfect, theoretical world, every cannonball would land in the same crater. But in the real world, tiny, invisible variables ruin your perfect precision. A gust of wind, a change in humidity, the microscopic expansion of the cannon’s metal. All of these ensure the next shot lands millimeters or even meters away.

Your LLM is that cannon. For a long time, the industry blamed the “wind” on floating-point math. It was a convenient, nerdy explanation. It’s also mostly wrong.

The Gremlins in the Machine

To stop chasing ghosts, you need to know what’s actually haunting your system. The non-determinism you’re seeing is a direct result of how these massive models are run on modern hardware, optimized for throughput, not consistency.

The common hypothesis, the one you have probably heard, blames a combination of Floating-Point Fuzz and GPU concurrency. The story goes that because (a + b) + c isn’t bit-for-bit identical to ‘a + (b + c)` in floating-point math, the parallel nature of GPUs creates chaos.

While that sounds plausible, it doesn’t hold up. Run this on your own GPU:

import torch

A = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)
B = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)

ref = torch.mm(A, B)
for _ in range(1000):
    assert (torch.mm(A, B) - ref).abs().max().item() == 0

It passes. A thousand times out of a thousand. The kernels used in an LLM’s forward pass are, in fact, deterministic run-to-run.

So where’s the ghost? The real gremlin is far more subtle. It’s about the context of the math changing.

Gremlin #1: The Batching Mirage

The true culprit is a lack of batch invariance. From the inference server’s perspective, the system is deterministic. Given the same batch of user requests, it will always produce the same output.

The problem is, your request is never the only one in the batch.

Think of a commercial kitchen oven. Your job is to bake one perfect, identical cookie every time. If you place a single ball of dough on a baking sheet and put it in the oven for 10 minutes at 350 degrees, you get a specific result. But the kitchen’s goal is to bake as many cookies as possible. Your single cookie gets thrown onto a tray with eleven other cookies of different sizes. Now, the heat distribution in the oven is different. The airflow changes. Your cookie bakes slightly differently. It’s still a cookie, but it’s not identical to the one baked alone.

Your API request is that single cookie. The LLM inference server is the oven. The “other cookies” are the other users’ requests that the server bundles with yours to maximize GPU utilization.

Those tiny differences are all it takes. A token that was winning by a margin of 0.0000001 might now lose. The argmax flips, a different word is chosen, and the rest of the generation veers onto a completely new path. From your perspective, the result is random because you have no control over the server load and, therefore, no control over the batch your request ends up in.

Gremlin #2: Hardware Roulette

Even if you could guarantee the exact same batch every time, you’d still be playing a game of hardware roulette. The major cloud providers that serve these models operate massive, heterogeneous fleets of machines. Your API request may be handled by a server with NVIDIA H100 GPUs one minute, and then by one with older A100 GPUs the next.

While these chips are compatible, they are not identical. Different hardware architectures, and even different versions of the underlying CUDA drivers, can implement mathematical operations in slightly different ways. This is a less frequent cause of non-determinism than batching, but it makes true bit-for-bit reproducibility in a public cloud environment a practical impossibility.

So, What Can You Actually Do About It?

You can’t force a cloud provider to stop batching requests or use a single, homogenous server fleet. But you’re an engineer, and your job is to build systems that work in the real world.

Set Deterministic Parameters. This is table stakes. Always set temperature=0. If the API has a seed parameter, use it. A seed helps control any intentional randomness, but as we’ve seen, it won’t solve the computational gremlins. It’s a necessary but insufficient step.

Design for Resilience. This is the most important lesson. Stop relying on exact string matching. Build robust parsers that can handle slight variations. Use techniques like Self-Consistency prompting. Focus on semantic correctness, not lexical identity. Your system should be robust to a few words changing here and there.

Enforce Batch Invariance (If You Can). This is the new frontier. If you are self-hosting an open-source model, you have more control. For projects where determinism is non-negotiable, you can now use libraries designed to solve this exact problem. A new library, thinking-machines-lab/batch-invariant-ops, allows you to swap out standard PyTorch operators with versions that are guaranteed to be batch-invariant. This provides a path to true determinism by ensuring the “oven” bakes your “cookie” the same way, no matter how many other cookies are on the tray.

Stop treating LLMs like calculators. They are not. They are complex systems running on hardware optimized for speed and throughput, not for the pedantic consistency we expect from traditional software. You can force them to be predictable if you have full control over the hardware and the software stack. If.

Sources

ATTENTION: while reading/watching the sources below, remember that the floating-point math isn’t the cause of the non-determinism.


Is your AI hallucinating in production? Take my 10-minute AI Readiness Assessment to identify critical vulnerabilities or schedule a consultation.

Subscribe to the newsletter
Older post

The AI Architect's Guide to RAG Debugging: A 3-Step Process to Fix Hallucinations in Minutes, Not Days

Stop wasting hours on frustrating AI debugging. This guide provides a proven method for RAG debugging that isolates the root cause of hallucinations by focusing on retrieval, not just the LLM.

Now Enrolling: A new cohort for my premium course on fixing AI hallucinations. Limited 'Founding Member' spots available. Learn more