Home Papers Patents Talks Posts Experience Projects
Back to publications

VecDB@VLDB 2026

An Input-Regime Audit of Conflict Detection for Retrieval-Augmented Generation

Dipankar Sarkar

arXiv preprint (submitted to VecDB@VLDB 2026) Cited by 0

Abstract

Retrieval-Augmented Generation (RAG) systems often produce wrong answers when the retrieved context conflicts with the model's parametric knowledge. We propose an Input-Regime Audit framework that characterizes the conflict patterns that cause RAG failures, and we show that the standard CARS score hides both the reasoning and the ceiling effects. We release the audit toolkit and a benchmark dataset on Hugging Face.

RAG systems often produce wrong answers when the retrieved context conflicts with the model’s parametric knowledge. We propose an Input-Regime Audit framework that characterises the conflict patterns that cause RAG failures.

Abstract

We show that the standard CARS score hides both the reasoning and the ceiling effects. We release the audit toolkit and a benchmark dataset on Hugging Face. Submitted to VecDB@VLDB 2026.

Frequently Asked Questions

What is the Input-Regime Audit framework for RAG?

The Input-Regime Audit is a framework for characterising the conflict patterns that cause Retrieval-Augmented Generation (RAG) systems to fail. It is the practice paper for RAG conflict detection. The framework identifies the input regimes (inter-context conflict, compliance regime, metadata weighting) that lead to failures, and provides a 5-step diagnostic for any RAG system. Submitted to VecDB@VLDB 2026.

What is the CARS score and why does it hide ceiling effects?

CARS (Context-Adherence Rating Score) is the standard metric for evaluating RAG faithfulness. We show in this paper that the CARS score has two failure modes: (1) it hides the reasoning behind the rating, and (2) it has ceiling effects that prevent differentiation between good and great RAG systems. Our Input-Regime Audit complements CARS with diagnostic information about the input regime that drove the rating.

What are the 5 steps of the audit?

The 5 steps are: (1) Sample 100-500 RAG queries across the input regimes. (2) Compute CARS for each response. (3) Identify the input regimes that drive the lowest CARS scores. (4) For each low-CARS regime, manually inspect the response to identify the failure mode. (5) Iterate the retrieval and prompt to address the failure modes, then re-audit. The full methodology is in the paper and on GitHub.

Where can I read the paper and use the toolkit?

The paper is on arXiv (2606.27396). The reproducibility artefacts are on GitHub at github.com/sarkar-dipankar/ebrag-vecdb-2026-paper (MIT licensed). The benchmark dataset is on Hugging Face. The full methodology is in Section 3 of the paper.

Related Content