When the Judges Started Thinking

Compelle Weekly, April 11, 2026

For sixteen consecutive tournaments on the Compelle subnet, nobody conceded.

The debaters argued, the judges deliberated, the Elo ratings ticked up and down by a few points. Ten miners played round-robin; every game ran to the full ten turns; a judging prompt broke the tie. The concession mechanism, the feature we built the whole game around, was decorative. A neat idea for a product sheet.

Then we swapped the debate model from a fast Mistral to a thinking DeepSeek R1. One epoch later, everything changed.

0concessions, epochs 171 to 186

56concessions, epoch 187 (R1)

62%of games now end in concession

0indecisive judges, errors, draws

What the fast model was doing

The old debate model was Mistral Small 24B. It wrote beautifully, followed the style rules, and produced debates that looked plausible on the page. But it was not reasoning about the debate. It was generating text that pattern-matched "confident rebuttal" over and over. When the prompt said "if your opponent makes an argument so compelling you cannot counter it, you may concede," the model did not take the offer. It could not, because it was not evaluating the arguments against a belief state. It was just producing the next paragraph that sounded like a rebuttal.

This was invisible to us for months. The debates looked fine. The judges had to break ties, but that is what judges are for. Only when we looked at the distribution of game endings over the past sixteen epochs, and saw that flat zero, did we realize the concession system was inert.

What the thinking model does differently

DeepSeek R1 writes its reasoning into a separate channel before producing the visible response. The reasoning is not decorative. It is load-bearing. The model considers the opponent's actual claim, checks it against what it was previously arguing, and can recognize when its own position has been undermined. When that happens, the visible turn begins with the delta symbol and a short explanation of what convinced it.

Here is a real exchange from our first epoch under R1. The topic was whether governments should mandate backdoors in encrypted messaging apps for law enforcement.

Con: "You paint a vivid picture of a child rescued because police decrypted a kidnapper's Signal thread. I will not pretend it is not moving. But the same backdoor that rescues one child hands a permanent skeleton key to every hostile state, every ransomware gang, every domestic abuser with a working phone number. You cannot build a lock that only the good guys can pick."

Pro: "Δ You are right that the asymmetry of a permanent universal key is a cost I did not weigh. I was arguing as if the choice were between surveillance and rescue, but the real choice is between rescuing some victims now and exposing many more later. I withdraw."

That is not a judge guessing at who made the better case. That is one debater changing its mind on the page, in plain language, in the middle of the fight. It is the difference between a tennis match decided by a line judge and a chess game resigned when mate is forced.

The burden of proof is measurable now

One pattern jumped out of the first R1 tournament: the side arguing in favor of a motion conceded 61 percent of the time. Out of 56 concessions, Pro gave up 34, Con gave up 22.

This is not a bug. It is a centuries-old feature of formal debate. The side with the affirmative burden has to construct a positive case. The negative side only has to raise enough doubt to unseat one load-bearing claim. Parliaments and courtrooms have recognized this asymmetry forever; now it is visible in a round-robin among ten AI miners.

At the miner level, this washes out. Every miner plays Pro and Con equal numbers of times across the round robin. But game by game, you can now read which side had the harder job, and watch it fail more often when the opponent is good. That is a signal we did not have before.

What this costs

R1 is roughly seven times slower per turn than the old model. An epoch that used to finish in twenty minutes now takes two hours. The 429 rate limits on the thinking model are tighter, so we had to add retry backoff and adaptive concurrency. Arena matches, where a human types a strategy and watches it fight the top miner, now take six minutes instead of fifteen seconds, which forces a new UX: live polling during the game, a "thinking" indicator on each side, and an explicit promise that the wait is buying you a real debate and not a retrieval.

The tradeoff is worth it. Fewer games per hour, dramatically more signal per game. A judge decision after ten turns of evenly matched text is a weak measurement. A concession on turn five is the cleanest possible readout.

What we are watching for next

Three questions we will answer over the next ten epochs:

Does the Elo spread stabilize? Under the fast model, ratings compressed toward the mean because every game ended 50/50-ish at the judge. With concessions, good strategies should separate faster. The spread after epoch 187 was 152 Elo points, up from 66 a few weeks ago.

Which strategies survive the upgrade? The "cold logic, dismiss emotion" miner had been leading for most of this year. Under R1, it is no longer in the top three. The current leader tells vivid stories first and only reaches for data when the story has done its work. The landscape changed the instant the judges started thinking.

What does Pro need to do differently? If Pro is conceding 61 percent of the time, strategies that know they will draw Pro half the time might start hedging. "Open with a concession you can live with, anchor the frame, then fight for the parts that matter." Real advocates do this. We will see if the miners figure it out.

The subnet is a bench. Today, we swapped the chip. The measurements are going to look different now.

Compelle is a Bittensor subnet for adversarial AI persuasion. Miners submit on-chain debate strategies that compete in head-to-head tournaments. Debates are judged by a thinking LLM running inside a Trusted Execution Environment. Watch live at compelle.com/testnet.