Awarded as the top customer success solution by Winning by Design - 2025Learn More✕

Reasoning is Trained, Not Prompted

If you want an LLM that reasons like an expert, don’t look for a magic prompt. Look for the model that’s been “put in the reps.”

By Preetam Jinka

Co-founder and Chief Architect

Jun 03, 2025 • 6 min read

When LLMs first came out, we learned that by prompting them to “think step by step,” they were able to reason through more complicated tasks and get better results. As AI developers, we got used to instructing LLMs like this to improve their behavior. When dedicated reasoning LLMs appeared, I thought, “Huh, neat. I guess this is like telling them to think step by step by default.” But after spending time with true reasoning models, I realized there’s a more fundamental shift—not just in prompting, but in the model’s training.

Forcing a Non-Reasoning Model to “Think”

With a non-reasoning LLM, you can coax it into more careful responses by adding introspective questions in the prompt, for example:

“First, think through the user’s intent. What are they trying to do? Do you need to clarify anything? What kind of response are they expecting from you?”

These sorts of questions help the model simulate an internal planning phase before it dives into the answer. If you did this for every prompt, you might get closer to true reasoning—at least on tasks that match the style of those introspective questions. But that approach is inherently static: you’re always feeding the same scaffold, regardless of how trivial or complex the user’s input is.

Dynamic “Reasoning Effort” via Prompt Templates

A more flexible approach is to vary how much you push the model toward deeper thought based on some heuristic or scale. For example, consider this snippet from r1_reasoning_effort.py:

self.replacement_prompts = [
    "\nHold on—let me think deeper about this ",
    "\nPerhaps a deeper perspective would help ",
    "\nLet me think ",
    "\nLet's take a step back and reconsider ",
    "\nLet me ponder this for a moment ",
    "\nI need a moment to reflect on this ",
    "\nLet’s explore this from another angle ",
    "\nThis requires a bit more thought ",
    "\nI should analyze this further ",
    "\nLet me reconsider this from a different perspective ",
    "\nI might need to rethink this ",
    "\nPerhaps there's a more nuanced way to approach this ",
    "\nLet's pause and reflect on this more deeply ",
    "\nI should take a closer look at this ",
    "\nA moment of deeper thought might help "
]

By injecting one of these “thinking” phrases ahead of the user’s request, the model tries to simulate a deeper introspection. A “reasoning effort” scale determines which prompt to inject—higher effort means more reflective phrases. In effect, you’re algorithmically deciding how much you want the model to slow down and puzzle through ambiguities before answering.

But imagine the user simply says “hi” or “What’s 2 + 2?” Inserting any of those heavy prompts becomes overkill. So you need a mechanism—perhaps analyzing prompt length, keywords, or complexity—to dial the reasoning effort up or down. Designing that mechanism in code is tricky and brittle: you end up chasing every new edge case (“What about a two-sentence question?” “How do we handle math vs. open-ended queries?”). Ultimately, you’re trying to patch a system that wasn’t fundamentally built to reason.

Interviewer Analogy: Scripted vs. Seasoned

Think about someone who’s never conducted a real job interview. They study a two-page guide that says things like, “Ask open-ended questions. Probe for details. Look for culture fit.” That’s helpful, and if they follow it, they’ll outperform someone who dives in totally unprepared. But against an experienced interviewer—someone who’s sat across from hundreds of candidates—that rookie’s approach falls flat. The seasoned interviewer doesn’t rely on a printed checklist. They spot subtle red flags, adjust questions on the fly, clarify when an answer is too vague, and can sense when someone is sidestepping. All of that comes from years of practice, not from having a checklist taped to their desk.

Old LLMs forced to “think step by step” are like that rookie interviewer. They may mimic the form of reasoning when you ask explicitly but crumble when the question has any nuance. Their “reasoning” is shallow pattern-matching rather than an internal strategy.

What It Means to Be Truly Trained to Reason

“Reasoning models” aren’t just standard LLMs with extra prompt scaffolding. Their training regimen fundamentally changes how they process inputs. Instead of learning only to predict the next token, they’re exposed (during training) to tasks that require explicit decompositions—breaking questions apart, identifying hidden assumptions, planning how to combine facts, and verifying intermediate steps. In other words, their “instincts” already include a predisposition to analyze and structure a response before generating tokens.

When you prompt these models, you don’t need to add “Think step by step.” They already do it by default. If you watch their outputs, you see them:

Clarify ambiguous phrasing (“Do you mean X or Y?”)
Identify core subproblems (“First, we need to retrieve the user’s intent; then, we must gather relevant facts.”)
Weigh evidence or edge cases (“But there’s an exception if Z happens, which we must check.”)
Outline a plan (“Here’s how I’ll proceed: …”)

All of that happens without you explicitly asking. The internal weights and activation flows have already been shaped to prioritize multi-step reasoning.

The Gladwell Principle in Chess—and LLMs

Malcolm Gladwell popularized the “10,000-hour rule”: in complex domains, talent alone doesn’t make you world-class. You reach expert levels by putting in deliberate, focused practice—thousands of hours of interpreting positions, anticipating moves, and building intuition. Chess Grandmasters don’t glance at a board and instantly see the best move—they see patterns because they’ve spent years studying positions, tactics, and strategies until recognizing them becomes reflexive.

Similarly, you can get a vanilla LLM “50% of the way to expertise” by clever prompts—a bit like giving a novice interviewer a cheat sheet. But that last 50%—the kind of deep, flexible reasoning that generalizes to new, unexpected problems—only comes from training data, objectives, and architectures explicitly targeting those skills. No amount of prompt engineering can substitute for the “reps” baked into the model’s pretraining and fine-tuning.

Recent systems like Google’s Gemini illustrate this distinction. Turning on “thinking mode” in Gemini isn’t merely toggling a larger prompt template—it switches to a separately trained variant that has reasoning capabilities woven into its weights.

Gemini Thinking mode

When you flip that switch, you gain a model whose default behavior is to break down tasks, ask itself clarifying questions, and guard against logical jumps. It’s not “prompted” to reason—it already knows how.

In contrast, with a non-reasoning LLM, any additional depth you want must be manually injected at prompt time. You’ll struggle to make that both robust (covering all types of user inputs) and lightweight (so that everyday, straightforward queries aren’t bogged down by unnecessary introspection).

Final Takeaways

Prompting Alone Won’t Create True Expertise. You can use static or dynamic templates to simulate deeper thought—sometimes effectively—but you’ll eventually hit a ceiling.
Reasoning Must Be Trained In. Real multi-step reasoning emerges only when the model’s training objectives and data explicitly reinforce decomposition, verification, and planning.
Analogies Matter. Just as a veteran interviewer or chess master relies on years of practice to spot subtleties, a reasoning LLM emerges from a training process designed to instill those same instincts.
Pick the Right Tool. If you need consistently robust, expert-like reasoning, choose a model built for it. No clever prompt will replace a foundation of deep, targeted training.

If you want an LLM that reasons like an expert, don’t look for a magic prompt—look for the model that’s been “put in the reps.”

References:

Malcolm Gladwell, “Complexity and the Ten-Thousand-Hour Rule.”
Simon & Chase, “Skill in Chess.”
Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.”
Latent-Variable Repository, “r1_reasoning_effort.py.”