Question 1

What is the Correctness Illusion in LLM-Generated GPU Kernels?

Accepted Answer

The Correctness Illusion is a phenomenon we identified and named: LLM-generated GPU kernels pass the standard correctness test (KernelBench) and are still wrong in production. The standard test under-specifies the input distribution — too easy, too narrow — so kernels that pass the test can still fail on real workloads. We published this paper (arXiv:2606.20128, June 2026) along with a 26-op corpus and op-schema-aware seeded fuzzing harness that catches the bugs the standard test misses.

Question 2

What is KernelBench and why does it under-specify correctness?

Accepted Answer

KernelBench is the de-facto standard benchmark for LLM-generated GPU kernels, from the Stanford Scaling Intelligence Lab. It is widely used to evaluate tools like AURA, KernelAgent, and GEAK. The problem: it samples inputs from a narrow distribution, so kernels that pass the test are not validated against edge cases, large inputs, or unusual memory layouts. We propose a 26-op corpus that exercises the full op schema and uses seeded fuzzing to find bugs the standard test misses.

Question 3

What is op-schema-aware seeded fuzzing?

Accepted Answer

Op-schema-aware seeded fuzzing is our technique for finding bugs in LLM-generated GPU kernels. We take the op schema (the input shape, dtype, and constraints) and generate seeds that exercise the boundaries — large inputs, edge cases, memory-layout edge cases. The seeds are deterministic and reproducible. The harness compares the LLM-generated kernel's output to a reference implementation and reports discrepancies. We release this as open source on Hugging Face.

Question 4

Where can I read the paper and use the corpus?

Accepted Answer

The paper is on arXiv (2606.20128). The 26-op corpus is on Hugging Face at huggingface.co/dipankarsarkar/gpuemu-corpus. The Hugging Face Space at huggingface.co/dipankarsarkar (The Correctness Illusion) lets you run the harness interactively. The source code is on GitHub at github.com/sarkar-dipankar/gpuemu-corpus.

Question 5

Why does this matter for production AI?

Accepted Answer

GPU kernels are the inner loop of every modern AI system. A wrong kernel that passes the test can corrupt a model in production in ways that are nearly impossible to debug. The Correctness Illusion is the gap between 'passes the test' and 'works in production'. Closing this gap is the difference between an AI system that ships and an AI system that survives.

The Correctness Illusion in LLM-Generated GPU Kernels

Abstract

Abstract

Why this matters

Frequently Asked Questions

What is the Correctness Illusion in LLM-Generated GPU Kernels?

What is KernelBench and why does it under-specify correctness?

What is op-schema-aware seeded fuzzing?

Where can I read the paper and use the corpus?

Why does this matter for production AI?

Related Content

Test-Input Generation for Tensor Programs: What Actually Finds Kernel Bugs

Before the Pull Request: Mining Multi-Agent Coordination

Automated Prompt Optimization: From AutoPrompt (2020) to TextGrad (2024)

LLM Prompt Compression: LLMLingua, GIST Tokens, and the Path to 480x Compression

Prompt Structuring Techniques: From Chain-of-Thought to the Instruction Hierarchy

LLM Safety Techniques: Constitutional AI, Harmony, SAIF, and Llama Guard Compared