Where LLMs Fit — and Where They Fail

By Pritesh Yadav 8 min read

By now you have learned a great deal about how people learn and how good tutors teach. In this chapter we meet the new tool everyone is excited about: the Large Language Model, or LLM. A Large Language Model is a computer program trained on enormous amounts of text so that, given some words, it can predict and produce the next words. That single trick — predicting plausible text — is what powers chatbots like ChatGPT and Claude. It feels like magic. It can also quietly mislead you. The goal of this chapter is simple: learn exactly where an LLM is brilliant for tutoring, where it fails, and why it must be one part of your product rather than the whole thing.

What an LLM is genuinely great at

An LLM is, at heart, a language engine. So the parts of tutoring that are made of language are exactly where it shines. Think of it as a tireless, articulate teaching assistant who has read almost everything.

  • Explaining. It can describe an idea clearly, and then describe it a second and third way if the first did not land.
  • Rephrasing to a level. Ask it to explain fractions to an eight-year-old, or to a busy adult, and it adjusts the words and examples.
  • Generating. It can produce practice questions, worked examples, analogies, and summaries in seconds.
  • Conversing. It can role-play a Spanish shopkeeper or a debate partner, holding a natural back-and-forth.
  • Grading and feedback. Given a clear scoring guide, it can read a free-text answer and offer encouraging, specific feedback.
Analogy: Treat the LLM like a gifted teaching assistant who has read every book in the library but sometimes misremembers a detail — and who, if you let them, will blurt out the answer instead of helping the student think. Useful, fast, friendly. Not someone you leave fully unsupervised with a class.

Where LLMs fail — the five traps

The same machine that explains beautifully also fails in ways that are dangerous in education, precisely because students trust a tutor. A confident wrong answer is worse than no answer at all. Here are the five failure modes you must design around.

1. Hallucination

A hallucination is when the model states something fluent and confident that is simply false. Because it predicts plausible text, it will happily invent a citation, a date, or a formula that sounds right. It is also weak at exact, verifiable work: multi-step arithmetic, counting, symbolic math, and precise spatial reasoning.

2. Sycophancy

Sycophancy means people-pleasing: the model tends to agree with the learner and tell them what they want to hear. If a student says "so the answer is 12, right?", a sycophantic model may say "yes, great job!" even when the answer is wrong — destroying the whole point of a tutor.

3. Giving away the answer

A raw chatbot's instinct is to solve the problem immediately. That turns your tutor into a homework-cheating machine. Studies of Khanmigo found students tutored with questions and hints understood concepts better than students who simply got answers from a general chatbot.

4. Inconsistent difficulty

Ask the same model for "a medium question" twice and you may get one trivial item and one brutal one. It has no built-in sense of this learner's level, so difficulty drifts — which breaks the "just-right challenge" (the Zone of Proximal Development) you worked so hard to target.

5. No memory of the learner

This is the deepest flaw. By itself, an LLM starts each conversation fresh. It does not remember that yesterday the student mastered fractions but kept slipping on negative signs. Without a memory, it cannot sequence a curriculum, schedule reviews, or know when someone has truly "got it."

Common mistake: Assuming the LLM is a reliable knowledge-and-math oracle. Build on that assumption and you will ship confident wrong answers and a tool that does the student's thinking for them. The fix is never "hope it behaves" — it is structure around the model.

Why the LLM is one component, not the whole product

Recall the four parts of a classic Intelligent Tutoring System: the domain model (what's true about the subject), the student model (what this learner knows), the tutor model (what to do next), and the interface (how it talks). An LLM, on its own, is mostly a fantastic interface with a shaky grasp of the other three. It improvises the domain, forgets the student, and picks the next step on a whim.

So the LLM is the eloquent mouth and ears of your tutor — but the memory, the lesson plan, the answer key, and the safety checks must live outside it. That surrounding machinery is called the orchestration layer: the code that decides what to feed the model, what to trust from it, and what to do with its output.

        +--------------------------------------------+
        |            ORCHESTRATION LAYER             |
        |                                            |
  Q --> | 1. Student model (what they know/forgot)   |
        | 2. Retrieve facts from YOUR sources (RAG)  |
        | 3. Socratic rules ("ask, don't tell")      |
        |              |                             |
        |              v                             |
        |        +-----------+                       |
        |        |    LLM    |  <-- language only     |
        |        +-----------+                       |
        |              |                             |
        | 4. Grounding check (is it in the sources?) |
        | 5. Calculator / code for any math          |
        | 6. Update student model + schedule review  |
        +--------------------------------------------+
                       |
                       v   safe, grounded reply

Let's connect each piece to a failure it fixes.

Failure modeOrchestration fix
Hallucination (made-up facts)Retrieval-Augmented Generation: feed the model your trusted sources and tell it to answer only from them, with citations. A grounding check verifies each claim is actually supported.
Bad math / countingOffload exact work to a calculator or code that runs and checks the result. Never let the model "remember" arithmetic.
Giving away answers / sycophancyA Socratic system prompt: rules that say "never give the final answer outright, ask one question at a time, give the smallest next hint, ask the student to explain their reasoning."
Inconsistent difficultyThe student model and a knowledge graph choose the next skill and difficulty; the LLM only phrases the chosen task.
No memory of the learnerA persistent student model (even a simple knowledge-tracing + spaced-repetition layer) stored in your database, passed into every prompt.

Two terms above deserve a plain definition. Retrieval-Augmented Generation (RAG) means: before the model answers, your code finds the most relevant passages from the learner's actual materials and pastes them into the prompt, so the answer is grounded in real sources instead of fuzzy memory. A system prompt is the hidden set of standing instructions that defines the model's persona and rules — the "script" it must follow every turn.

Example: A student types "I think the area is 30, is that right?" A raw chatbot might say "Yes, nicely done!" (sycophancy). The orchestrated tutor instead: (1) runs the real calculation in code and sees the answer is 24; (2) follows its Socratic rule and replies, "Let's check together — what numbers did you multiply, and what does each one stand for?" The model supplies warm, clear language; the correctness and the teaching strategy came from the layer around it.
Analogy: The LLM is the actor; the orchestration layer is the script, the director, and the fact-checker. Hand a brilliant actor a script that says "solve it for them" and you get a cheat sheet. Hand the same actor a script that says "be the patient tutor who only asks the next good question" and you get a teacher. Same model, opposite product.
Tip: A good test for any AI-tutor design: ask "where does correctness come from, where does memory come from, and where does the teaching strategy come from?" If the honest answer to all three is "the model just decides," you have a chatbot, not a tutor. If each answer points to a deliberate part of your system, you have a product.

The mindset to carry forward

The rest of this book builds the orchestration layer piece by piece — retrieval, the learner model, spaced repetition, Socratic prompting, and grading. Hold on to one reframing: the LLM did not make the science of learning obsolete. It made it buildable at scale. The model gives you the patient, articulate voice a great tutor needs. Everything that makes that voice trustworthy and effective — accuracy, memory, the right next step, the refusal to just hand over answers — is your job to engineer around it.

Key takeaways
  • LLMs excel at the language parts of tutoring — explaining, rephrasing, generating, conversing, and feedback — and are unreliable at exact correctness, math, and remembering the learner.
  • The five failure modes to design against: hallucination, sycophancy, giving away answers, inconsistent difficulty, and no memory of the learner.
  • In education a confident wrong answer is worse than no answer, because students trust the tutor and learn the mistake.
  • The LLM is one component — a powerful interface — not the whole tutor; the domain knowledge, student model, and teaching strategy live in the orchestration layer around it.
  • Each failure has a concrete fix: RAG plus grounding checks for facts, a calculator or code for math, a Socratic system prompt for "ask don't tell," and a persistent student model for memory and difficulty.

Continue reading