Back to Stories

Why LLMs Can't Give the Same Answer Twice



Leon Oliver Wolf
September 16, 2025 - 3 min read

You'd naturally expect some variety when repeatedly asking a large language model (LLM) the same question, that's part of what makes these systems useful and creative. The surprise comes when you dial all the randomness settings down to zero, essentially telling the model "be completely predictable," yet it keeps giving you different responses anyway. This stubborn inconsistency persists even when researchers run their own models locally with full control over the software stack.

A new blog entry from Horace He in collaboration with Thinking Machines Lab "Defeating Nondeterminism in LLM Inference," digs deep into this phenomenon and uncovers a counterintuitive truth: the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies!

The Wrong Culprit

For years, the AI community has pointed fingers at GPU parallelism. The prevailing theory (what the authors call the "concurrency + floating point" hypothesis) suggests that concurrent GPU cores finishing in different orders, combined with floating-point arithmetic quirks, creates randomness in results (see Figure below). Although this explanation isn't entirely wrong, it misses the bigger picture.

The Real Villain: Batch Size

The post reveals a more fundamental issue. When you send a request to an LLM inference server, that request gets batched with others for efficiency. But here is where it is becoming important: the amount of load the server is under is effectively "nondeterministic" from the user's perspective. The load determines the batch size that the kernels are run under, and thus changes the eventual result of each individual request!

This happens because most GPU kernels aren't "batch-invariant", which is a fancy way of saying that the computer doesn't produce identical results when it processes one item alone versus when it processes that same item as part of a group. It's a mathematical peculiarity: matrix multiplication should be independent along every element in the batch, but when the batch size changes, each element in the batch can get different results.

The root cause traces back to floating-point arithmetic. Unlike the integers we learn in school, floating-point numbers break a fundamental rule: (a + b) + c ≠ a + (b + c). This non-associativity means that adding numbers in different orders produces different results.

Every time one adds together floating-point numbers in a different order, we can get a completely different result. Therefore, when batch sizes change, the order in which GPU kernels add up intermediate results changes, leading to different final answers.

Real-World Impact

When tested with Qwen3-235B-A22B-Instruct-2507 and 1,000 sampled completions at a temperature of 0, using the prompt 'Tell me about Richard Feynman', the standard setup produced 80 unique completions. The most common of these occurred 78 times. In contrast, with batch-invariant kernels, all 1,000 completions were identical.

But what about implementation? Based on their claims, the performance cost is manageable and can be improved. However, the unoptimised implementation currently runs at around 60% of the speed of the standard vLLM.

The Bigger Picture

The paper shows that what seemed like an intractable hardware limitation or black box may actually be a solvable software problem. Nonetheless, it is still to be determined if understanding the mathematical foundations of our systems will enable us to build a truly reproducible and thus trustworthy and safe AI.


Scan the QR code to view this story on your mobile device.


Machine LearningReproducibilityBatchingFloating-pointKernelsNon-deterministic