Where LLMs Fit — and Where They Fail
By now you have learned a great deal about how people learn and how good tutors teach. In this chapter we meet the new tool everyone is excited about: the Large Language Model, or LLM. A Large Language Model is a computer program trained on enormous amounts of text so that, given some words, it can predict and produce the next words. That single trick — predicting plausible text — is what powers chatbots like ChatGPT and Claude. It feels like magic. It can also quietly mislead you. The goal of this chapter is simple: learn exactly where an LLM is brilliant for tutoring, where it fails, and why it must be one part of your product rather than the whole thing.
What an LLM is genuinely great at
An LLM is, at heart, a language engine. So the parts of tutoring that are made of language are exactly where it shines. Think of it as a tireless, articulate teaching assistant who has read almost everything.
- Explaining. It can describe an idea clearly, and then describe it a second and third way if the first did not land.
- Rephrasing to a level. Ask it to explain fractions to an eight-year-old, or to a busy adult, and it adjusts the words and examples.
- Generating. It can produce practice questions, worked examples, analogies, and summaries in seconds.
- Conversing. It can role-play a Spanish shopkeeper or a debate partner, holding a natural back-and-forth.
- Grading and feedback. Given a clear scoring guide, it can read a free-text answer and offer encouraging, specific feedback.
Where LLMs fail — the five traps
The same machine that explains beautifully also fails in ways that are dangerous in education, precisely because students trust a tutor. A confident wrong answer is worse than no answer at all. Here are the five failure modes you must design around.
1. Hallucination
A hallucination is when the model states something fluent and confident that is simply false. Because it predicts plausible text, it will happily invent a citation, a date, or a formula that sounds right. It is also weak at exact, verifiable work: multi-step arithmetic, counting, symbolic math, and precise spatial reasoning.
2. Sycophancy
Sycophancy means people-pleasing: the model tends to agree with the learner and tell them what they want to hear. If a student says "so the answer is 12, right?", a sycophantic model may say "yes, great job!" even when the answer is wrong — destroying the whole point of a tutor.
3. Giving away the answer
A raw chatbot's instinct is to solve the problem immediately. That turns your tutor into a homework-cheating machine. Studies of Khanmigo found students tutored with questions and hints understood concepts better than students who simply got answers from a general chatbot.
4. Inconsistent difficulty
Ask the same model for "a medium question" twice and you may get one trivial item and one brutal one. It has no built-in sense of this learner's level, so difficulty drifts — which breaks the "just-right challenge" (the Zone of Proximal Development) you worked so hard to target.
5. No memory of the learner
This is the deepest flaw. By itself, an LLM starts each conversation fresh. It does not remember that yesterday the student mastered fractions but kept slipping on negative signs. Without a memory, it cannot sequence a curriculum, schedule reviews, or know when someone has truly "got it."
Why the LLM is one component, not the whole product
Recall the four parts of a classic Intelligent Tutoring System: the domain model (what's true about the subject), the student model (what this learner knows), the tutor model (what to do next), and the interface (how it talks). An LLM, on its own, is mostly a fantastic interface with a shaky grasp of the other three. It improvises the domain, forgets the student, and picks the next step on a whim.
So the LLM is the eloquent mouth and ears of your tutor — but the memory, the lesson plan, the answer key, and the safety checks must live outside it. That surrounding machinery is called the orchestration layer: the code that decides what to feed the model, what to trust from it, and what to do with its output.
+--------------------------------------------+
| ORCHESTRATION LAYER |
| |
Q --> | 1. Student model (what they know/forgot) |
| 2. Retrieve facts from YOUR sources (RAG) |
| 3. Socratic rules ("ask, don't tell") |
| | |
| v |
| +-----------+ |
| | LLM | <-- language only |
| +-----------+ |
| | |
| 4. Grounding check (is it in the sources?) |
| 5. Calculator / code for any math |
| 6. Update student model + schedule review |
+--------------------------------------------+
|
v safe, grounded reply
Let's connect each piece to a failure it fixes.
| Failure mode | Orchestration fix |
|---|---|
| Hallucination (made-up facts) | Retrieval-Augmented Generation: feed the model your trusted sources and tell it to answer only from them, with citations. A grounding check verifies each claim is actually supported. |
| Bad math / counting | Offload exact work to a calculator or code that runs and checks the result. Never let the model "remember" arithmetic. |
| Giving away answers / sycophancy | A Socratic system prompt: rules that say "never give the final answer outright, ask one question at a time, give the smallest next hint, ask the student to explain their reasoning." |
| Inconsistent difficulty | The student model and a knowledge graph choose the next skill and difficulty; the LLM only phrases the chosen task. |
| No memory of the learner | A persistent student model (even a simple knowledge-tracing + spaced-repetition layer) stored in your database, passed into every prompt. |
Two terms above deserve a plain definition. Retrieval-Augmented Generation (RAG) means: before the model answers, your code finds the most relevant passages from the learner's actual materials and pastes them into the prompt, so the answer is grounded in real sources instead of fuzzy memory. A system prompt is the hidden set of standing instructions that defines the model's persona and rules — the "script" it must follow every turn.
The mindset to carry forward
The rest of this book builds the orchestration layer piece by piece — retrieval, the learner model, spaced repetition, Socratic prompting, and grading. Hold on to one reframing: the LLM did not make the science of learning obsolete. It made it buildable at scale. The model gives you the patient, articulate voice a great tutor needs. Everything that makes that voice trustworthy and effective — accuracy, memory, the right next step, the refusal to just hand over answers — is your job to engineer around it.
- LLMs excel at the language parts of tutoring — explaining, rephrasing, generating, conversing, and feedback — and are unreliable at exact correctness, math, and remembering the learner.
- The five failure modes to design against: hallucination, sycophancy, giving away answers, inconsistent difficulty, and no memory of the learner.
- In education a confident wrong answer is worse than no answer, because students trust the tutor and learn the mistake.
- The LLM is one component — a powerful interface — not the whole tutor; the domain knowledge, student model, and teaching strategy live in the orchestration layer around it.
- Each failure has a concrete fix: RAG plus grounding checks for facts, a calculator or code for math, a Socratic system prompt for "ask don't tell," and a persistent student model for memory and difficulty.