arXiv
The Correctness Illusion in LLM-Generated GPU Kernels
Dipankar Sarkar
Abstract
LLM-generated GPU kernels pass the standard correctness test and are still wrong. We present the Correctness Illusion: the standard test bed for LLM-generated GPU kernels (KernelBench) under-specifies the input distribution, leading to kernels that pass the standard test but fail in production. We propose a 26-op corpus with op-schema-aware seeded fuzzing that catches the bugs the standard test misses. We also release an open-source Hugging Face dataset and Space for community use.
The standard benchmark for LLM-generated GPU kernels — KernelBench — under-specifies the input distribution. LLM-generated kernels pass the benchmark and are still wrong in production. We call this the Correctness Illusion.
Abstract
We present the Correctness Illusion, an empirical finding that the standard test bed for LLM-generated GPU kernels (KernelBench) is too easy. Kernels that pass KernelBench can still fail on edge cases, large inputs, and unusual memory layouts. We propose a 26-op corpus with op-schema-aware seeded fuzzing that catches the bugs the standard test misses. We release the corpus on Hugging Face and the harness as open source.
Why this matters
GPU kernels are the inner loop of every modern AI system. A wrong kernel that passes the test can corrupt a model in production. Closing the gap between “passes the test” and “works in production” is the difference between an AI system that ships and an AI system that survives.
Frequently Asked Questions
What is the Correctness Illusion in LLM-Generated GPU Kernels?
The Correctness Illusion is a phenomenon we identified and named: LLM-generated GPU kernels pass the standard correctness test (KernelBench) and are still wrong in production. The standard test under-specifies the input distribution — too easy, too narrow — so kernels that pass the test can still fail on real workloads. We published this paper (arXiv:2606.20128, June 2026) along with a 26-op corpus and op-schema-aware seeded fuzzing harness that catches the bugs the standard test misses.
What is KernelBench and why does it under-specify correctness?
KernelBench is the de-facto standard benchmark for LLM-generated GPU kernels, from the Stanford Scaling Intelligence Lab. It is widely used to evaluate tools like AURA, KernelAgent, and GEAK. The problem: it samples inputs from a narrow distribution, so kernels that pass the test are not validated against edge cases, large inputs, or unusual memory layouts. We propose a 26-op corpus that exercises the full op schema and uses seeded fuzzing to find bugs the standard test misses.
What is op-schema-aware seeded fuzzing?
Op-schema-aware seeded fuzzing is our technique for finding bugs in LLM-generated GPU kernels. We take the op schema (the input shape, dtype, and constraints) and generate seeds that exercise the boundaries — large inputs, edge cases, memory-layout edge cases. The seeds are deterministic and reproducible. The harness compares the LLM-generated kernel's output to a reference implementation and reports discrepancies. We release this as open source on Hugging Face.
Where can I read the paper and use the corpus?
The paper is on arXiv (2606.20128). The 26-op corpus is on Hugging Face at huggingface.co/dipankarsarkar/gpuemu-corpus. The Hugging Face Space at huggingface.co/dipankarsarkar (The Correctness Illusion) lets you run the harness interactively. The source code is on GitHub at github.com/sarkar-dipankar/gpuemu-corpus.
Why does this matter for production AI?
GPU kernels are the inner loop of every modern AI system. A wrong kernel that passes the test can corrupt a model in production in ways that are nearly impossible to debug. The Correctness Illusion is the gap between 'passes the test' and 'works in production'. Closing this gap is the difference between an AI system that ships and an AI system that survives.