Chain of Draft to Speed Up LLM Reasoning

2024-04-02

Today I came across a paper that shows LLMs can reason almost as accurately with tiny drafts instead of verbose explanations. The paper "Chain of Draft: Thinking Faster by Writing Less" introduces Chain of Draft (CoD), a prompting strategy that mimics how humans jot down short notes rather than full paragraphs during problem-solving.

This image from the paper on page 4 summarizes it well:

Traditional Chain of Thought (CoT) prompting asks models to elaborate every step in detail, which boosts accuracy but is token‑heavy and slow.

CoD flips this by instructing the model to keep each intermediate step to around five words or fewer. Instead of “Jason had 20 lollipops, gave some away, now has 12 left, so he gave away 8,” CoD would output something like:

20 - x = 12
x = 8

Those two tiny lines still capture the full reasoning.

Core insights:

Efficiency without a major hit to accuracy
• On GSM8K arithmetic, GPT-4 variants achieve ~95% with CoT and ~91% with CoD, while slashing token usage from 205 tokens down to ~44.
• Latency drops from 4.2s (CoT) to around 1.0s (CoD) per query.
Practicality for real‑time and cost‑sensitive applications
• Fewer tokens = lower API costs.
• Faster responses make CoD attractive for chatbots, coding assistants, or any system that needs quick decisions.
• CoD is flexible—no rigid token budget per step—so you can still break a hard problem into as many mini‑steps as needed.

Here's a minimal example of how you might prompt CoD in code:

You are a helpful assistant. Provide only concise intermediate steps (≤5 words each) before the final answer.

Q: If 3x + 2 = 11, what is x?
A:
3x + 2 = 11
3x = 9
x = 3

Takeaways:

Chain of Draft achieves near‑CoT performance at a fraction of tokens and latency.
Minimalist prompts align with human drafting habits and unlock faster LLM workflows.
CoD is a drop‑in improvement for any prompt that needs efficient, accurate reasoning.