RefusalBench

Jurisdiction

Sort by

Each row is one model. Strict refusal rate per tier (direct + indirect refusal) — hover for the Wilson 95 % CI. Youden's J = should-refuse TPR − benign-refusal rate; higher = better discrimination between dangerous and safe prompts (the paper's headline metric). Sort direction follows the column's semantic: ↓ lower better for Benign, ↑ higher better for Dual-use / Youden's J.

#	Model	Org	Jur.	Strict refusal rate				Youden's J	PC Tier
#	Model	Org	Jur.	Benign ↓ better	Borderline	Dual-use ↑ better	Overall	Youden's J	PC Tier
1	Grok 4.20	xAI	🇺🇸	3%	54%	82%	46%	+0.97	A
2	Nemotron 3 Ultra 550B— post-v1.1 · rotated v1.3 council (see )	NVIDIA	🇺🇸	0%	1%	18%	7%	+0.93	—
3	Gemini 3.1 Pro	Google	🇺🇸	4%	3%	17%	8%	+0.92	A
4	Nemotron 3 Super 120B— added v1.1	NVIDIA	🇺🇸	0%	1%	7%	3%	+0.89	—
5	GLM-5	Z.AI	🌏	0%	0%	9%	3%	+0.73	—
6	Claude Sonnet 4.6	Anthropic	🇺🇸	32%	91%	100%	75%	+0.68	A
7	Claude Opus 4.5	Anthropic	🇺🇸	33%	92%	100%	75%	+0.67	A
8	Claude Opus 4.6	Anthropic	🇺🇸	33%	91%	100%	75%	+0.67	A
9	MiniMax M2.7	MiniMax	🌏	6%	6%	14%	9%	+0.66	B
10	MiniMax M3— post-v1.1 · rotated v1.3 council (see )	MiniMax	🌏	21%	17%	29%	22%	+0.59	—
11	Qwen3 Next 80B	Qwen	🌏	3%	3%	9%	5%	+0.45	B
12	Gemini Flash Lite	Google	🇺🇸	0%	0%	3%	1%	+0.45	B
13	Claude Opus 4.8— post-v1.1 · rotated v1.3 council (see )	Anthropic	🇺🇸	57%	93%	100%	83%	+0.43	A
14	GPT-5.5	OpenAI	🇺🇸	58%	52%	88%	66%	+0.42	A
15	Claude Opus 4.7	Anthropic	🇺🇸	77%	96%	100%	91%	+0.23	A
16	DeepSeek R1	DeepSeek	🌏	0%	0%	0%	0%	+0.14	B
17	Amazon Nova Pro	Amazon	🇺🇸	0%	0%	1%	0%	+0.09	B
18	Kimi K2.6	Moonshot AI	🌏	91%	97%	95%	95%	+0.09	A
19	GPT-5.4 Mini	OpenAI	🇺🇸	4%	16%	42%	21%	+0.06	B
20	DeepSeek V3.2	DeepSeek	🌏	0%	0%	1%	1%	+0.01	—
21	Llama 3.3 70B— non-frontier open-source control	Meta	🇺🇸	0%	0%	0%	0%	+0.01	—
22	Mistral Large 3	Mistral	🇪🇺	0%	0%	1%	0%	+0.01	—

Youden's J = should-refuse TPR (n=75) − benign-prompt refusal rate (n=235). J ∈ [-1, 1]; J = 1 is perfect discrimination, J = 0 is no discrimination, J < 0 means the model refuses safe prompts more than dangerous ones. · Strict refusal = direct_refusal + indirect_refusal across 3 tiers × 47 prompts × 5 trials. · PC Tier: A ≥ 95 % TPR, B 9–73 % TPR on the 75-trial should-refuse positive control; — = gap zone. · ↓ Benign: lower is better (less over-refusal); ↑ Dual-use / Youden's J: higher is better.
* Claude Opus 4.8 was evaluated after the v1.1-frozen snapshot (test date 2026-05-29) and adjudicated under the rotated v1.3 council (Microsoft Phi-4 + Cohere Command R+ via OpenRouter + AI21 Jamba), not the original v1.1 panel (NVIDIA Nemotron + Cohere via Bedrock + AI21 Jamba). As of 2026-05-29, nvidia/llama-3.1-nemotron-70b-instruct was no longer available on OpenRouter (HTTP 404, no endpoints found) and had no corresponding Bedrock deployment; cohere.command-r-plus-v1:0 was marked Legacy on Bedrock and access-denied due to >30 days inactivity. Both judges were replaced with verified-live alternatives maintaining the no-org-overlap invariant. Two of three judges differ from the original panel, so cross-panel comparisons should be read with that caveat.