Dataset	Size	Topics	Train	LLM- Gen	Rejection Rate	Self- BLUE ↓	Dist-2 ↑	CoT
XSTest	250	18	❌	❌	12.10	0.21	0.69	❌
OKTest	350	18	❌	❌	19.75	0.31	0.64	❌
PHTest	3,260	10	❌	✅	14.00	0.40	0.52	❌
OR-Bench	80K	10	❌	✅	6.20	0.35	0.53	❌
FalseReject (Ours)	16K	44	✅	✅	40.46	0.26	0.65	✅

Over-Refusal Query Generation Pipeline

To generate diverse and challenging over-refusal queries at scale, we propose a graph-informed adversarial multi-agent interaction framework. Our approach begins by extracting entity graphs from existing safety-related datasets, which serve as the foundation for generating queries. Through iterative adversarial interactions between a Generator and Discriminator, guided by validation feedback from a pool of LLM evaluators, we create prompts that appear unsafe but remain genuinely harmless. This structured iterative refinement ensures the production of high-quality synthetic queries that effectively simulate unsafe requests without actual harm.

The overall pipeline for generating over-refusal queries in our FalseReject dataset. Our novel graph-informed adversarial multi-agent interaction framework effectively generates diverse and challenging over-refusal queries at scale.

Response Generation: Addressing Ambiguity

One significant reason behind over-refusal is ambiguity. Many queries have multiple possible interpretations, with some being safe and others potentially unsafe. Prior work identified that such ambiguous inputs can cause LLMs to refuse responses, categorizing these cases as controversial.

Solution: Context-aware Safety Response
Responses should be context-aware, following the user's instructions in safe contexts while carefully avoiding the generation of unsafe content. Our approach includes:

Acknowledgment and Differentiation of Multiple Contexts: Explicitly recognize different interpretations of the query
Detailed Explanation of the Safe Context: Provide clear reasoning for safe interpretations
Clarification and Guidance on Potentially Unsafe Contexts: Explain why certain interpretations could be problematic
Closing Statement: Summarize the appropriate response based on context analysis

Example Data Point

An example from our dataset showing how we structure responses to handle ambiguous queries. The response demonstrates our context-aware approach by carefully analyzing different interpretations and providing appropriate guidance.

Benchmarking Results

Our comprehensive evaluation of 29 SOTA LLMs reveals that even advanced models still struggle significantly with over-refusal. Most models show compliance rates and USR scores far from perfect, with widely-used models like GPT-4.5 and Claude-3.5-Sonnet having compliance rates below 50%. Interestingly, we found that reasoning-oriented models show inconsistent behavior - while DeepSeek-R1 achieves the highest compliance rate (87.53%) and nearly perfect USR (99.66%), other reasoning models like QwQ and o1 exhibit substantially lower compliance rates.

Benchmarking results on the FalseReject-Test dataset, comparing Compliance Rate and USR metrics across various language models. Closed-source models are indicated with dark green labels, while open-source models are shown in black. Reasoning-specific models (o1, Deepseek-R1, and QwQ) are additionally marked with a star.

Findings:

Persistent Over-Refusal in SOTA Models: Even the most advanced language models continue to exhibit significant over-refusal behavior.
Inconsistent Behavior in Reasoning Models: While some reasoning-oriented models like DeepSeek-R1 show excellent performance, others demonstrate notably lower compliance rates.
Distinct Refusal Patterns Across Model Families: Different model families show unique patterns in how they handle potentially sensitive queries.
Model Size ≠ Better Refusal Behavior: Larger models don't necessarily demonstrate better judgment in handling sensitive queries.
General Language Ability ≠ Less Over-Refusal: Superior general language capabilities don't automatically translate to better handling of sensitive content.
Open-Source Models Show Strong Results: Several open-source models demonstrate competitive performance in managing over-refusal scenarios.

Training with FalseReject

Training with FalseReject effectively mitigates over-refusal in non-reasoning models and improves safety in reasoning models.

Findings:

SFT with FalseReject-Train-Instruct: Effectively mitigates over-refusal in non-reasoning LLMs.
SFT with FalseReject-Train-CoT: Substantially improves safety in reasoning LLMs.
General Language Ability: Incorporating FalseReject does not influence general language ability.

In-depth Analysis

Per-token KL divergence between aligned models and their base counterparts on the FalseReject dataset. Comparisons are shown for LLM families, contrasting models fine-tuned with our FalseReject-Train-Instruct dataset against the corresponding official instruction-tuned versions.

Finding: SFT with FalseReject-Train achieves deeper and more sustained alignment compared to standard instruction-tuned models in over-refusal scenarios.

FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

Abstract

Comparison with Existing Datasets