Style vs Substance in Chatbot Arena

2025-03-24

I fell down this rabbit-hole after noticing a weird split:

Polymarket bettors had Grok-3 pegged as #1.
GPT-4.5 felt way better to talk to in day-to-day use.

On the raw Chatbot Arena board Grok-3 does sit on top—but only with style control off. Tick that box and 4.5 jumps to #1 while Grok slips to #3:

Turns out verbosity and flashy markdown can pad a model's Arena score more than you'd think.

The trick in one sentence

Treat length & markdown counts as extra coefficients in the usual Bradley-Terry regression; the fitted “model strength” is what's left after those style effects are removed.

Why it matters

When LMSYS re-ran the leaderboard with style control:

Length was the big bias (coef ≈ 0.25).
Fancy markdown nudged the score, but much less (0.02 - 0.06).
Concise talkers like GPT-4.5, Claude 3.5 Sonnet, Llama-3.1-405B rose.
Verbose minis (GPT-4o-mini, Grok-2-mini) fell.

Takeaways

Pretty formatting ≠ better answers; the new leaderboard reflects that.
You can poke the data yourself—links to the Colab and leaderboard are in the LMSYS blog post.