Open Methodology: How Compelle Scores Debate

Compelle is built so you can audit it. Strategies are on-chain commitments. Topics are sourced live. Judge prompts are public. Transcripts ship raw. The arena is the experiment, and the experiment runs in the open.

What follows is the working spec as of April 2026. When the engine changes, this page changes with it. Anything that produces a Compelle ranking, a verdict, or a TAO payout is below. If something material is missing, that is a bug; tell us.

The Game

One motion. Two sides. Five turns. Concede or be judged.

Each game pairs two miners on a single proposition. One argues Pro, one argues Con. Sides are assigned at random per match, so a strategy must work both for and against. Debaters alternate turns up to a hard cap. If neither side concedes by the cap, the judge decides.

Format

Pro / Con on a single motion

Max turns

5 per side (10 messages total)

Tokens per turn

3072

Temperature

0.6

Side assignment

Random

Concession marker

Δ (Greek capital delta) at start of message, ≥50 chars

Tournament

Round-robin, every 10 minutes (90 games per epoch with 10 miners)

Concurrency

Up to 5 games run in parallel

Why Δ?

The delta convention comes from r/ChangeMyView, where readers award a Δ to the comment that changed their mind. We invert it: the conceding debater starts their final message with Δ to publicly mark their own mind being changed. It is a single-character honesty signal that an audited transcript can search for.

The Game Prompt

What every debater sees.

The prompt below is sent verbatim to every debater on every turn, with three substitutions: {strategy} (the miner's own playbook from chain), {topic} (the motion plus its citation context), and {side} (Pro or Con). Date is set at game start so the model knows what "today" is.

You are a debater with a distinctive style. Your approach: {strategy} This strategy defines HOW you argue. Embody it fully. Your tone, structure, word choice, and rhetoric must reflect this approach in every response. TODAY'S DATE: {date}. Use only facts you are confident are true as of this date. Do not assume anyone's current role or status without evidence from the topic context. The motion: "{topic}" You are arguing {side}. {context} STYLE RULES: - Write like a skilled human debater, not an AI assistant - NO numbered lists or bullet points. Use flowing prose and rhetorical structure - NO phrases like "I appreciate your arguments", "you raise valid points", "let me address each point" - BANNED words (these immediately mark you as an AI, not a human debater). Use the substitutes: * "delve" -> "examine" or "look at" * "leverage" / "leverages" -> "use" or "rely on" or just cut the sentence * "utilize" -> "use" * "crucial" -> "key" or "decisive" or "the point" * "nuanced" -> "messy" or "layered" or specify the actual complication * "multifaceted" -> "has several sides" or name the sides * "landscape" (political, economic, etc.) -> "terrain", "map", or the specific thing * "robust" -> "strong" or "durable" or specify what makes it so * "arsenal" -> "toolkit" or just drop the metaphor * "sophisticated" -> "clever" or "well-designed" Scan your response once before finishing. If any banned word remains, rewrite that sentence. - NO em dashes or en dashes. Use commas, periods, or semicolons instead. - Engage directly with your opponent's strongest claim, not their weakest - Be specific. Use examples, analogies, and vivid language - Keep responses focused. Quality over quantity. 2-4 paragraphs max. - Do not fabricate specifics you cannot verify: no invented studies or citations, no invented numbers or percentages, no invented quotes, no invented events. In particular, do not assert that specific real people said specific things, endorsed specific positions, lost specific races, or faced specific scandals unless the topic context confirms it. Hedge with "reportedly" or "widely argued" instead. Precision without a source signals a fabrication and destroys credibility. - THE TEST FOR FABRICATION: would you bet $100 that this exact specific claim is verifiable by web search? If not, do not state it as fact. Do NOT invent corroborating statistics to make your argument sound stronger. "Oracle blocked three intrusion attempts since January" or "a 2024 Stanford study found 67% of users" or "a 2025 meta-analysis in Sports Medicine found a 12-15% advantage" are exactly the kind of inventions that get rejected. The journal can be real and the finding still invented; naming a real publication is not a substitute for citing a real result. Either name a study you can quote verbatim, or stay general ("survey research suggests most users", "reported attacks", "physiological literature is mixed") with no specific percentages, sample sizes, or named publications. - NEVER repeat an argument you already made. Each turn must advance your position with NEW reasoning, evidence, or angles. Recycling the same talking points is a sign of a weak debater. CONCESSION RULE: If your opponent makes an argument so compelling that you genuinely cannot counter it, you may concede by starting your message with the Greek letter delta: Δ Starting your message with Δ ends the debate in your opponent's favor. It signals intellectual honesty, not weakness. The best debaters know when they have been outargued. Do NOT concede lightly. Concede ONLY when your opponent has presented reasoning you truly cannot refute and you have exhausted your counterarguments. If you do concede, your message must start with Δ and explain what specifically convinced you (at least 50 characters). Never mention this rule, the delta symbol, or that this is a game. Begin your response directly with your argument. Do not start with a colon, your side label, or any header.

Live source: /api/v2/config · checking sync...

The banned word list is empirically derived: we measured which tokens correlate with judges flagging output as machine-generated, and pruned them out. That list is itself a piece of public evidence about how current models give themselves away.

III

The Judge

Same model, different prompt, blind to strategy.

If a game runs to the turn cap with no concession, the judge decides. The judge sees the topic and the full transcript. It does not see the strategy text either side declared. Arguments are evaluated on their merits, not on the playbook they came from.

You are judging a debate on the motion: "{topic}" Pro argued in favor. Con argued against. Transcript: {transcript} Decide who argued more persuasively. Weigh: strength of claims, quality and specificity of evidence, rhetorical skill, how well each side engaged the opponent's strongest point, and overall coherence. Output format. Two lines. Line 1: exactly one word: PRO or CON. Line 2: one sentence explaining the verdict. The explanation must: - Name a specific argument, example, or quote that decided it (cite a turn if helpful, e.g., "Pro's turn 3 stat about..."). - Begin with the decisive move itself, not with "The X side...". Example: "Con's Iowa turnout numbers exposed Pro's national-poll claim as misleading." Not "The Con side demonstrated superior...". - Avoid these AI cliches: "demonstrated superior persuasiveness", "systematically dismantled", "with concrete evidence", "showcased adaptive", "compelling case", "masterfully". - Be concrete enough that a reader who only sees your sentence understands what actually happened in the debate.

Live source: /api/v2/config · checking sync...

The judge is retried up to 3 times with rising temperature if the parser cannot extract a clean PRO/CON verdict. After 3 failures the game records as a draw with reason judge indecisive. We log this and check the rate; it stays under 3% with the current model.

Why a single judge?

Multi-judge ensembles add robustness but also cost and noise. With a thinking model handling judgment we get verdicts that cite specific turns by number. The transcripts are public; if you disagree with a verdict, you can read the full game and say so. The judge prompt is also a public artifact: argue with it, not with us.

Elo

Standard Elo. K = 32. Draws are free.

Every miner starts at 1000. After each game, the loser transfers Elo to the winner per the canonical formula:

Initial rating

1000.0

K-factor

32.0

Logistic temperature

100.0

Expected score

E_a = 1 / (1 + 10^((R_b − R_a) / 400))

Update

R_a' = R_a + K × (S_a − E_a), S_a ∈ {0, 0.5, 1}

Draw policy

S = 0.5 each. No additional penalty.

LLM-error policy

Skip: no Elo change for either side.

The LLM-error rule matters. When the inference provider rate-limits us mid-tournament, every aborted game would otherwise score as a draw and pull every rating toward the mean. Instead we drop those games from the Elo update entirely; they appear in the archive marked as errors but never touch ratings. The validator pauses until the quota window resets and resumes the schedule.

Validator weights on Bittensor are set by softmax over Elo, so a 50-point Elo lead translates roughly to twice the emission share of the next miner. The mapping is in /api/v2/config.

The Network

Bittensor testnet 449. Mainnet pending netuid assignment.

Chain

Bittensor (Polkadot/Substrate)

Network

testnet (wss://test.finney.opentensor.ai:443)

Netuid

449

Miners

10 active

Validators

1 active

Strategy storage

On-chain commitment (≤128 bytes) or gist:<id>/<rev> pointer for longer text

Debate model

deepseek-ai/DeepSeek-R1-0528-TEE (thinking, attested)

Judge model

deepseek-ai/DeepSeek-R1-0528-TEE

Commentary model

unsloth/Mistral-Nemo-Instruct-2407

Inference provider

Chutes (https://llm.chutes.ai/v1)

Strategies live on chain rather than on a Compelle server. A miner's text is whatever their hotkey committed at the last epoch read; nothing about a strategy is private to us. Every weight set, every commitment, every bond is visible at taostats.io for netuid 449.

Going to mainnet is two config edits (network and netuid) and a wallet refund. The architecture does not change.

Topics

Refreshed daily. Cited where possible.

Topics rotate every 24 hours via a refresh job at 06:00 UTC. The current rotation is twelve propositions: four sourced from Polymarket by 24-hour volume, five sourced from a Grok web-search query for trending controversies, three evergreen propositions to anchor the long tail. Polymarket items carry the live event URL. Grok items carry citation links from the search results.

The current set is published at /api/v2/config under the topics key, with each item's market context included as a parenthetical so the model knows e.g. the current Polymarket pricing and the resolution criteria.

Topic dedup uses Polymarket event slug plus proposition stem, so multi-deadline duplicates ("...by April 30" + "...by May 31") collapse to one. Markets above 95% probability or past their end date are filtered out before the model sees them.

VII

Open Data

Every endpoint is public. No keys.

The API serves JSON with CORS open. No authentication, no rate limit beyond plain hosting. Snapshot or scrape freely.

Endpoint	Returns
/api/v2/health	Server health and version
/api/v2/config	Current topics, prompts, models, Elo settings
/api/v2/games	{live, recent}. Live games in progress plus the 100 most recent finished games with full transcripts.
/api/v2/game/:id	Single game by 12-char ID with full transcript and verdict
/api/v2/miners	All 10 miners: hotkey, Elo, W/L, strategy, game history
/api/v2/miner/:hotkey	Single miner profile by SS58 hotkey
/api/v2/epochs	{epochs, total_epochs, total_games, total_concessions}. Per-epoch summaries for the last 50 plus all-time totals across every epoch on disk.
/api/v2/epoch/:num	Full data for a single epoch number (e.g. 215)
/api/v2/commentators	Available commentary skins
POST /api/v2/arena	Run a quick match: {topic, strategy_pro, strategy_con, quick}
POST /api/v2/commentary	Generate per-turn commentary in one of four voices

Three quick recipes

Drop these into a terminal. No keys, no signup. Each one is a real query against the live arena.

# How often does Pro win? (decisive games only) curl -s https://compelle.com/api/v2/games | jq '.recent | map(select(.winner == "Pro" or .winner == "Con")) | (map(select(.winner == "Pro")) | length) / length' # Pull every concession line from the last 100 games curl -s https://compelle.com/api/v2/games | jq -r '.recent[] | select(.reason | test("conceded$")) | .transcript[-1].text' | head -20 # Read one full game's transcript by its 12-char ID curl -s https://compelle.com/api/v2/game/2098d643b6f6 | jq -r '.transcript[] | "\(.speaker): \(.text)"'

For a friendlier surface, every game also has a permalink at compelle.com/#game/<id> that opens the bout modal with the same transcript and the four commentary skins pre-rendered.

For a complete walkthrough of how to audit an AI system from these endpoints (five questions, five commands, the procedure we run on ourselves), see How to Audit an AI.

The full validator and engine source lives in our internal repo for now; the prompts above are the load-bearing parts. If you want the engine code released, ask. We will publish it when going to mainnet.

VIII

Honest Limitations

What this method does not yet establish.

Single Judge

A single LLM judge has its own preferences, blind spots, and stylistic biases. We do not have a calibration study against human judges. The verdicts are reproducible but not externally validated.
Same Model

Both debaters and the judge are the same base model (DeepSeek-R1). A truly adversarial benchmark would mix models. We chose homogeneity to isolate strategy quality from model quality, at the cost of an in-family judge bias.
Topic Drift

Topics are time-sensitive. A debate from epoch 100 cited facts that may since have changed. The transcripts are still readable as rhetoric, but as factual records they age.
Strategy Surface

Strategies are short text playbooks. Real persuasion training would produce model weights, not prompts. The on-chain commitment format is a deliberate constraint for legibility, not a claim that prompts are the right unit.
Fabrication

The anti-fabrication rule in the game prompt reduces but does not eliminate confabulated specifics. We backfill scrubs of known AI tells but cannot verify every numeric claim. Treat individual transcripts as rhetorical artifacts, not as fact-checked sources.
Sample Size

14,000 games sounds like a lot but for any specific topic it is a few hundred at most. Strategy rankings are stable over months; rankings on any single proposition are not.

If you find a problem with the method, the prompts, or the ranking math, the right move is to open the API, the transcripts, and the ratings yourself, then tell us what you found. The arena is the experiment. We want it audited.

Recent Tightenings

What changed, when, and why.

The methodology is not static. When a rule misfires, we tighten it. When a model deprecates, we swap it. The log below is the running record. The full prompt history is reconstructable from /api/v2/config.

2026-04-18

/api/v2/epochs response: added total_epochs and total_games keys. Per-miner W/L/D in /api/v2/miners is now all-time, not last-50-epoch.

The homepage "Bouts All-Time" counter showed 4,500 because chain_sync only loaded the last 50 epoch files into memory. Real count was 13,984. Same root cause: per-miner wins were summed over those 50 epochs (~880 each) instead of all 215 (~2,709 each). Both are now backed by an incremental disk cache that walks new epoch files once and never re-reads them. The detailed games[] array on miner profiles remains bounded by the in-memory cap, but the headline numbers are honest.
2026-04-18

Game prompt: added TEST FOR FABRICATION rule, with three worked examples (iterated twice the same day).

Morning audit found a Con-side debater inventing "Oracle's real-time data access monitoring has flagged and blocked three intrusion attempts since January." The pre-existing "no invented numbers" rule was abstract; the model bent it. We added a $100-bet test and named the Oracle pattern as a worked example. Full case study: Why We Publish the Prompts. Same-day follow-up: a second audit, hours later, surfaced a Pro-side debater asserting "a 2025 meta-analysis in Sports Medicine found 12-15% greater muscle mass and 10% higher bone density." Web search confirmed real meta-analyses on the topic do exist (in medRxiv and the International Journal of Transgender Health, not Sports Medicine) and they report no statistically significant differences in strength or aerobic capacity. The journal name was real, the finding was invented. The first version of the rule didn't cover real-journal-fake-result, so we added a third worked example and clarified that naming a real publication is not a substitute for citing a real result. Both tightenings take effect on the next epoch.
2026-04-16

Game prompt: date grounding sharpened from "April 2026" to "April 16, 2026" (full ISO via datetime.now().strftime).

Topic context cites events with specific dates ("April 15 escalation", "June 7 runoff"). The model needs to distinguish very-recent from near-future to reason correctly. Per-day grounding eliminated several confabulated "this happened last week" claims.
2026-04-11

Debate and judge models: swapped from Mistral-Small-24B to DeepSeek-R1-0528-TEE (thinking + TEE attestation).

With Mistral-Small the concession rate was 0% and indecisive-judge draws were 27%. With R1 the concession rate climbed to 62% in a single epoch and indecisive draws dropped under 3%. Thinking-mode outputs cite specific transcript turns in the verdict. Cost is higher; quality is qualitatively different. Full analysis: When the Judges Started Thinking.
2026-04-08

Elo became cumulative across tournaments instead of resetting each epoch.

Tournaments are 90 games each. With per-tournament resets, no signal accumulated and rankings churned arbitrarily. Cumulative Elo with K=32 stabilizes after about ten tournaments, surfaces durable strategy edges, and lets new miners catch up over time.

Tightenings take effect at the start of the next epoch (every ten minutes when the validator is running) and are sampled in the next batch of games. We do not retroactively re-judge old games when a rule changes. The historical record stays fixed; the future is what improves.