Look for Steeper Gradients

tool-callingevaluationlocal-llmstrust-calibrationgradients

The default assumption is that harder tasks need bigger models. A tool-calling evaluation across four local LLMs — 3B to 80B parameters — suggests something different: what matters isn't the difficulty of the task but the gradient toward the solution. When the path to the right answer is steep, a small model gets there just fine. When the gradient is shallow — when the model has to hold multiple possibilities in working memory and reason through which one fits — size and thinking start to matter.

The practical implication: before reaching for a larger model, ask whether you can steepen the gradient instead.

What "gradient" means here

Tool selection from 13 options: the prompt says "send an email," the tool is called send_email. Lexical match. One obvious answer, strong signal pointing to it. A 3B model hits 100%. So does an 80B model. Steep gradient — model size is irrelevant.

Discriminated union types: three CRUD variants, each with different schemas. The model must comprehend the schema structure, decide which variant applies, then populate the right fields. The signal is diffuse — multiple parts of the prompt map to multiple parts of the schema. The 3B model drops to 72% and becomes stochastic. Run the same test ten times: 80% success rate. The capability exists. The reliability doesn't. Shallow gradient.

SQL with dialect hints: models default to MySQL syntax when you need SQLite. Three lines of context — "use strftime for dates, values are UPPERCASE, use || for concatenation" — and every model above 3B hits 100%. The gradient was shallow (many valid SQL dialects). The hint steepened it by collapsing the search space. That's all a specialist prompt does: it doesn't add intelligence. It adds gradient.

Three levers

The data points to three ways to steepen a gradient:

Decomposition. The 13-distractor tier was easier than the 3-way union tier. Selecting from many clearly-described options is simpler than comprehending one complex schema. If you can break a shallow-gradient task into multiple steep-gradient steps, each step becomes reliable at smaller model sizes.

Context. The SQL result is the cleanest example. Models that fail generically succeed completely with three lines of domain knowledge. Knowledge gaps are the cheapest failures to fix — they don't need a better model, they need a better prompt.

Thinking. When the gradient is genuinely shallow — nested objects with 20+ fields, unions requiring schema comprehension — a thinking model creates its own gradient by reasoning explicitly through each step. The overhead is ~45% more tokens and 2-3x latency. It pays off on structural tasks where there's a correct path to reason toward. On pattern-matching tasks where the gradient is already steep, thinking adds cost without benefit.

The architecture this implies

This is the same thesis as orchestrated phases, now with numbers. Don't route everything through one big model hoping it'll figure things out. Instead:

Routing and selection — steep gradient, small model, fast. Structured output — supply schema context, add thinking if the structure is deep. Knowledge tasks — specialist prompts with domain specifics.

The practitioners vocabulary problem named trust calibration as an intuition we can't articulate. The gradient frame makes it articulable: trust is high when the gradient is steep, low when it's shallow. And the interesting move is that you often have a choice about which one you're dealing with.