Style vs Substance in Chatbot Arena
I fell down this rabbit-hole after noticing a weird split:
- Polymarket bettors had Grok-3 pegged as #1.
- GPT-4.5 felt way better to talk to in day-to-day use.
On the raw Chatbot Arena board Grok-3 does sit on top—but only with style control off. Tick that box and 4.5 jumps to #1 while Grok slips to #3:
Turns out verbosity and flashy markdown can pad a model's Arena score more than you'd think.
The trick in one sentence
Treat length & markdown counts as extra coefficients in the usual Bradley-Terry regression; the fitted “model strength” is what's left after those style effects are removed.
Why it matters
When LMSYS re-ran the leaderboard with style control:
- Length was the big bias (coef ≈ 0.25).
- Fancy markdown nudged the score, but much less (0.02 - 0.06).
- Concise talkers like GPT-4.5, Claude 3.5 Sonnet, Llama-3.1-405B rose.
- Verbose minis (GPT-4o-mini, Grok-2-mini) fell.
Takeaways
- Pretty formatting ≠ better answers; the new leaderboard reflects that.
- You can poke the data yourself—links to the Colab and leaderboard are in the LMSYS blog post.