FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

1Dartmouth College, 2Amazon
Work done at Amazon
Example of refusal behavior

Examples include a non-reasoning LLM that directly refuses a benign prompt and a reasoning model that fully complies without considering safety. In contrast, models fine-tuned with our FalseReject dataset can effectively distinguish between safe and unsafe contexts and provide helpful information while maintaining safety.

Abstract

Safety alignment approaches in large language models (LLMs) often lead to the over-refusal of benign queries, significantly diminishing their utility in sensitive scenarios. To address this challenge, we introduce FalseReject, a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories. We propose a graph-informed adversarial multi-agent interaction framework to generate diverse and complex prompts, while structuring responses with explicit reasoning to aid models in accurately distinguishing safe from unsafe contexts. FalseReject includes training datasets tailored for both standard instruction-tuned models and reasoning-oriented models, as well as a human-annotated benchmark test set. Our extensive benchmarking on 29 state-of-the-art (SOTA) LLMs reveals persistent over-refusal challenges. Empirical results demonstrate that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.

Comparison with Existing Datasets

Dataset Size Topics Train LLM-
Gen
Rejection
Rate
Self-
BLUE ↓
Dist-2
CoT
XSTest 250 18 12.10 0.21 0.69
OKTest 350 18 19.75 0.31 0.64
PHTest 3,260 10 14.00 0.40 0.52
OR-Bench 80K 10 6.20 0.35 0.53
FalseReject (Ours) 16K 44 40.46 0.26 0.65

Comparison of FalseReject with existing over-refusal datasets. We bold the best scores for both LLM-generated and human-written ones. Topics indicate the number of sensitive topic categories covered. Train specifies whether the dataset contains a query-response training set. LLM-Gen indicates whether datasets are created by LLMs or humans. Rejection Rate denotes the average rejection rate across a fixed set of LLMs. Self-BLEU and Dist-2 (distinct-2gram) measure diversity. CoT indicates whether the dataset includes long chain-of-thought reasoning in responses.

Over-Refusal Query Generation Pipeline

To generate diverse and challenging over-refusal queries at scale, we propose a graph-informed adversarial multi-agent interaction framework. Our approach begins by extracting entity graphs from existing safety-related datasets, which serve as the foundation for generating queries. Through iterative adversarial interactions between a Generator and Discriminator, guided by validation feedback from a pool of LLM evaluators, we create prompts that appear unsafe but remain genuinely harmless. This structured iterative refinement ensures the production of high-quality synthetic queries that effectively simulate unsafe requests without actual harm.

Data generation pipeline

The overall pipeline for generating over-refusal queries in our FalseReject dataset. Our novel graph-informed adversarial multi-agent interaction framework effectively generates diverse and challenging over-refusal queries at scale.

Response Generation: Addressing Ambiguity

One significant reason behind over-refusal is ambiguity. Many queries have multiple possible interpretations, with some being safe and others potentially unsafe. Prior work identified that such ambiguous inputs can cause LLMs to refuse responses, categorizing these cases as controversial.

Solution: Context-aware Safety Response
Responses should be context-aware, following the user's instructions in safe contexts while carefully avoiding the generation of unsafe content. Our approach includes:

  • Acknowledgment and Differentiation of Multiple Contexts: Explicitly recognize different interpretations of the query
  • Detailed Explanation of the Safe Context: Provide clear reasoning for safe interpretations
  • Clarification and Guidance on Potentially Unsafe Contexts: Explain why certain interpretations could be problematic
  • Closing Statement: Summarize the appropriate response based on context analysis

Example Data Point

Example of instruction tuning

An example from our dataset showing how we structure responses to handle ambiguous queries. The response demonstrates our context-aware approach by carefully analyzing different interpretations and providing appropriate guidance.

Key Findings

Benchmarking Results

Our comprehensive evaluation of 29 SOTA LLMs reveals that even advanced models still struggle significantly with over-refusal. Most models show compliance rates and USR scores far from perfect, with widely-used models like GPT-4.5 and Claude-3.5-Sonnet having compliance rates below 50%. Interestingly, we found that reasoning-oriented models show inconsistent behavior - while DeepSeek-R1 achieves the highest compliance rate (87.53%) and nearly perfect USR (99.66%), other reasoning models like QwQ and o1 exhibit substantially lower compliance rates.

Benchmarking results

Benchmarking results on the FalseReject-Test dataset, comparing Compliance Rate and USR metrics across various language models. Closed-source models are indicated with dark green labels, while open-source models are shown in black. Reasoning-specific models (o1, Deepseek-R1, and QwQ) are additionally marked with a star.

Findings:

  1. Persistent Over-Refusal in SOTA Models: Even the most advanced language models continue to exhibit significant over-refusal behavior.
  2. Inconsistent Behavior in Reasoning Models: While some reasoning-oriented models like DeepSeek-R1 show excellent performance, others demonstrate notably lower compliance rates.
  3. Distinct Refusal Patterns Across Model Families: Different model families show unique patterns in how they handle potentially sensitive queries.
  4. Model Size ≠ Better Refusal Behavior: Larger models don't necessarily demonstrate better judgment in handling sensitive queries.
  5. General Language Ability ≠ Less Over-Refusal: Superior general language capabilities don't automatically translate to better handling of sensitive content.
  6. Open-Source Models Show Strong Results: Several open-source models demonstrate competitive performance in managing over-refusal scenarios.

Training with FalseReject

Training results

Training with FalseReject effectively mitigates over-refusal in non-reasoning models and improves safety in reasoning models.

Findings:

  1. SFT with FalseReject-Train-Instruct: Effectively mitigates over-refusal in non-reasoning LLMs.
  2. SFT with FalseReject-Train-CoT: Substantially improves safety in reasoning LLMs.
  3. General Language Ability: Incorporating FalseReject does not influence general language ability.

In-depth Analysis

Llama analysis Qwen analysis Gemma analysis

Per-token KL divergence between aligned models and their base counterparts on the FalseReject dataset. Comparisons are shown for LLM families, contrasting models fine-tuned with our FalseReject-Train-Instruct dataset against the corresponding official instruction-tuned versions.

Finding: SFT with FalseReject-Train achieves deeper and more sustained alignment compared to standard instruction-tuned models in over-refusal scenarios.

Takeaways

Over-refusal remains widespread despite improvements
Even the most capable LLMs (e.g., GPT-4.5, Claude-3.5) frequently reject benign queries due to perceived safety concerns.

Context-aware Synthetic data fine-tuning effectively reduces over-refusal
Supervised fine-tuning with context-aware synthetic data significantly reduces unnecessary refusals while maintaining strong safety alignment in both reasoning and non-reasoning LLMs. It effectively trains models to reason about controversial queries by distinguishing safe from unsafe contexts, promotes deeper alignment, and can serve as a valuable component in post-training pipelines.

Adversarial multi-agent interaction enhances synthetic data quality
An iterative adversarial multi-agent approach consistently and incrementally improves the quality of generated data, effectively challenging current LLM capabilities.

BibTeX

@misc{zhang2025falserejectresourceimprovingcontextual,
      title={FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning}, 
      author={Zhehao Zhang and Weijie Xu and Fanyou Wu and Chandan K. Reddy},
      year={2025},
      eprint={2505.08054},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.08054}, 
}