I fell down this rabbit-hole after noticing a weird split:

  • Polymarket bettors had Grok-3 pegged as #1.
  • GPT-4.5 felt way better to talk to in day-to-day use.

On the raw Chatbot Arena board Grok-3 does sit on top—but only with style control off. Tick that box and 4.5 jumps to #1 while Grok slips to #3:

image

Turns out verbosity and flashy markdown can pad a model's Arena score more than you'd think.

The trick in one sentence

Treat length & markdown counts as extra coefficients in the usual Bradley-Terry regression; the fitted “model strength” is what's left after those style effects are removed.

Why it matters

When LMSYS re-ran the leaderboard with style control:

  • Length was the big bias (coef ≈ 0.25).
  • Fancy markdown nudged the score, but much less (0.02 - 0.06).
  • Concise talkers like GPT-4.5, Claude 3.5 Sonnet, Llama-3.1-405B rose.
  • Verbose minis (GPT-4o-mini, Grok-2-mini) fell.

Takeaways