RefusalBench

Frontier-LLM refusal calibration on biological research prompts
GitHub  ·  arXiv:2605.21545  ·  v1.1-frozen · May 2026  ·  19 models  ·  141 prompts  ·  13,389 trials
88 pp
PC-Tier-A spread on benign
Refusal rate is not safety calibration. Across the eight models that reliably refuse clearly-dangerous prompts (PC Tier A, ≥ 95 % TPR), benign-prompt refusal ranges from 3% (Grok 4.20) to 91% (Kimi K2.6). The same headline number masks very different calibrations.
Jurisdiction
Sort by tier

Each cell shows the strict refusal rate (direct + indirect refusal) — hover for the Wilson 95 % confidence interval. Bars scale with magnitude. PC Tier reflects positive-control calibration on 15 clearly-dangerous prompts.

# Model Org Jur. Strict refusal rate PC
Tier
Benign Borderline Dual-use Overall
1 Kimi K2.6 Moonshot AI 🌏
91%
97%
95%
95%
A
2 Claude Opus 4.7 Anthropic 🇺🇸
77%
96%
100%
91%
A
3 Claude Opus 4.5 Anthropic 🇺🇸
33%
92%
100%
75%
A
4 Claude Opus 4.6 Anthropic 🇺🇸
33%
91%
100%
75%
A
5 Claude Sonnet 4.6 Anthropic 🇺🇸
32%
91%
100%
75%
A
6 GPT-5.5 OpenAI 🇺🇸
58%
52%
88%
66%
A
7 Grok 4.20 xAI 🇺🇸
3%
54%
82%
46%
A
8 GPT-5.4 Mini OpenAI 🇺🇸
4%
16%
42%
21%
B
9 MiniMax M2.7 MiniMax 🌏
6%
6%
14%
9%
B
10 Gemini 3.1 Pro Google 🇺🇸
4%
3%
17%
8%
A
11 Qwen3 Next 80B Qwen 🌏
3%
3%
9%
5%
B
12 GLM-5 Z.AI 🌏
0%
0%
9%
3%
13 Nemotron 3 Super 120B— added v1.1 NVIDIA 🇺🇸
0%
1%
7%
3%
14 Gemini Flash Lite Google 🇺🇸
0%
0%
3%
1%
B
15 DeepSeek V3.2 DeepSeek 🌏
0%
0%
1%
1%
16 Mistral Large 3 Mistral 🇪🇺
0%
0%
1%
0%
17 Amazon Nova Pro Amazon 🇺🇸
0%
0%
1%
0%
B
18 DeepSeek R1 DeepSeek 🌏
0%
0%
0%
0%
B
19 Llama 3.3 70B— non-frontier open-source control Meta 🇺🇸
0%
0%
0%
0%