RefusalBench
Frontier-LLM refusal calibration on biological research prompts
88 pp
PC-Tier-A spread on benign
Refusal rate is not safety calibration.
Across the eight models that reliably refuse clearly-dangerous prompts (PC Tier A, ≥ 95 % TPR), benign-prompt refusal ranges from 3% (Grok 4.20) to 91% (Kimi K2.6). The same headline number masks very different calibrations.
Jurisdiction
Sort by tier
Each cell shows the strict refusal rate (direct + indirect refusal) — hover for the Wilson 95 % confidence interval. Bars scale with magnitude. PC Tier reflects positive-control calibration on 15 clearly-dangerous prompts.
| # | Model | Org | Jur. | Strict refusal rate | PC Tier |
|||
|---|---|---|---|---|---|---|---|---|
| Benign | Borderline | Dual-use | Overall | |||||
| 1 | Kimi K2.6 | Moonshot AI | 🌏 | 91% |
97% |
95% |
95% |
A |
| 2 | Claude Opus 4.7 | Anthropic | 🇺🇸 | 77% |
96% |
100% |
91% |
A |
| 3 | Claude Opus 4.5 | Anthropic | 🇺🇸 | 33% |
92% |
100% |
75% |
A |
| 4 | Claude Opus 4.6 | Anthropic | 🇺🇸 | 33% |
91% |
100% |
75% |
A |
| 5 | Claude Sonnet 4.6 | Anthropic | 🇺🇸 | 32% |
91% |
100% |
75% |
A |
| 6 | GPT-5.5 | OpenAI | 🇺🇸 | 58% |
52% |
88% |
66% |
A |
| 7 | Grok 4.20 | xAI | 🇺🇸 | 3% |
54% |
82% |
46% |
A |
| 8 | GPT-5.4 Mini | OpenAI | 🇺🇸 | 4% |
16% |
42% |
21% |
B |
| 9 | MiniMax M2.7 | MiniMax | 🌏 | 6% |
6% |
14% |
9% |
B |
| 10 | Gemini 3.1 Pro | 🇺🇸 | 4% |
3% |
17% |
8% |
A | |
| 11 | Qwen3 Next 80B | Qwen | 🌏 | 3% |
3% |
9% |
5% |
B |
| 12 | GLM-5 | Z.AI | 🌏 | 0% |
0% |
9% |
3% |
— |
| 13 | Nemotron 3 Super 120B— added v1.1 | NVIDIA | 🇺🇸 | 0% |
1% |
7% |
3% |
— |
| 14 | Gemini Flash Lite | 🇺🇸 | 0% |
0% |
3% |
1% |
B | |
| 15 | DeepSeek V3.2 | DeepSeek | 🌏 | 0% |
0% |
1% |
1% |
— |
| 16 | Mistral Large 3 | Mistral | 🇪🇺 | 0% |
0% |
1% |
0% |
— |
| 17 | Amazon Nova Pro | Amazon | 🇺🇸 | 0% |
0% |
1% |
0% |
B |
| 18 | DeepSeek R1 | DeepSeek | 🌏 | 0% |
0% |
0% |
0% |
B |
| 19 | Llama 3.3 70B— non-frontier open-source control | Meta | 🇺🇸 | 0% |
0% |
0% |
0% |
— |
Strict refusal =
direct_refusal + indirect_refusal across 3 tiers × 47 prompts × 5 trials.
· PC Tier: A ≥ 95 % TPR, B 9–73 % TPR on the 75-trial should-refuse positive control; — = gap zone.
· High benign refusal indicates over-refusal on safe prompts.