Artificial Intelligence and Machine Learning Explained (Advanced)

By Pritesh Yadav June 22, 2026 22 min read —

In the last two chapters you met the building blocks: computers that follow literal instructions, software, the internet, the cloud, and a first look at artificial intelligence. You learned the most important orienting idea in all of AI — the nesting dolls. This chapter goes deeper. We will open the hood on how modern AI actually learns, how the systems behind ChatGPT, Claude, and Gemini are built, why they are so fluent and yet so prone to confident errors, and how a working professional in 2025–2026 should actually use them.

This is the advanced tier, so we will not stop at "it learns from data." We will look at how it learns, what a neural network is really doing, what a "transformer" is, why these models have a "context window," and why a fluent answer is not the same as a true one. Throughout, hold on to one honesty beat: a computer is fast, literal, and dumb; an AI language model is probabilistic, fluent, and not designed to tell the truth. It is designed to produce likely text. Keep pairing every feeling of "magic" with the plain mechanism underneath, and you will understand AI better than most people who use it daily.

24.1 The orienting map: AI ⊃ ML ⊃ Deep Learning ⊃ Transformers/LLMs

Before any detail, fix the map in your head. These terms are not synonyms; they nest inside each other like Russian dolls.

Artificial intelligence (AI): The broadest idea — machines doing tasks that seem to require human-like intelligence (recognising a face, translating a sentence, playing chess). This is the biggest doll.
Machine learning (ML): A subset of AI where the machine learns patterns from examples instead of being told the rules by a programmer. A doll inside the first.
Deep learning: A subset of ML that uses neural networks with many layers ("deep" = many layers). A smaller doll inside ML.
Transformers / Large Language Models (LLMs): The specific deep-learning design behind today's famous chatbots. The smallest doll, sitting inside deep learning.

+-------------------------------------------------+
|  ARTIFICIAL INTELLIGENCE (any "smart" machine)  |
|  +-------------------------------------------+  |
|  |  MACHINE LEARNING (learns from data)      |  |
|  |  +-------------------------------------+  |  |
|  |  |  DEEP LEARNING (many-layer neural   |  |  |
|  |  |  networks)                          |  |  |
|  |  |  +-------------------------------+  |  |  |
|  |  |  | TRANSFORMERS / LLMs           |  |  |  |
|  |  |  | (ChatGPT, Claude, Gemini)     |  |  |  |
|  |  |  +-------------------------------+  |  |  |
|  |  +-------------------------------------+  |  |
|  +-------------------------------------------+  |
+-------------------------------------------------+

Key takeaway: When people say "AI" in 2025, they almost always mean machine learning, and usually the innermost doll — a transformer-based large language model. The word "AI" is broad; the technology getting all the attention is narrow and specific.

24.2 The core flip: from "rules + data" to "data + answers → rules"

Traditional programming, which you met earlier, works like a recipe. A human writes the rules; the computer applies them to data and produces answers.

TRADITIONAL PROGRAMMING:
   Rules  +  Data   ------>   Answers
   (human writes the rules)

MACHINE LEARNING:
   Data   +  Answers ------>  Rules ("the model")
   (the machine figures out the rules itself)

This flip is the whole revolution. Imagine you wanted a program to recognise a cat in a photo. The old way would force a human to write rules: "if there are two triangular ears, and whiskers, and fur texture…" This fails immediately — what about a cat curled up, in shadow, from behind? You cannot write enough rules.

Machine learning skips the rules. You show the system a hundred thousand photos already labelled "cat" or "not cat," and it works out for itself what visual patterns separate the two. You provided data and answers; it produced the rules.

Analogy: You don't teach a small child what a dog is by reciting a definition ("a four-legged carnivorous mammal of the family Canidae…"). You point at dogs — "dog… dog… that's a dog too" — and after enough examples the child just knows, even for a breed they have never seen. Machine learning is that, at industrial scale.

24.3 Training data, features, and the "model"

Three words come up constantly. Define them precisely.

Training data: The examples the system learns from. More importantly, the right examples — representative and correctly labelled.
Features: The attributes of each example that actually matter for the task. For predicting a house price: square footage, location, number of bedrooms. Choosing good features used to be a huge part of the job; deep learning now discovers many of them automatically.
Model: The trained "pattern-machine" that comes out the other end. It is the learned rules, frozen into a file full of numbers, ready to make predictions on new data.

Analogy: A student studies five years of past exam papers (training data), notices which topics keep appearing and how questions are phrased (features), and becomes genuinely good at the subject (the model). The model is not the papers; it is the competence built from them.

Common mistake: "More data always makes the model better." Not true. Garbage or biased data produces a garbage or biased model — a recruiting model trained on a company's past hires can quietly learn that company's past discrimination. Quality and representativeness usually beat raw volume. A clean, balanced 50,000 examples can crush a messy, lopsided 5 million.

24.4 The three ways a machine learns

There are three broad learning styles. You should be able to tell them apart and give an example of each.

Style	What it learns from	Everyday analogy	Real use
Supervised	Labelled examples — each input comes with the correct answer.	Flashcards with the answer on the back.	Spam filter (emails pre-labelled spam / not spam); loan default prediction.
Unsupervised	Unlabelled data — no answers given; it finds hidden groupings.	Sorting a pile of photos into natural piles nobody named.	Customer segmentation (the system discovers "budget shoppers" vs "premium shoppers" on its own).
Reinforcement	Trial and error with rewards and penalties.	Training a dog with treats for good behaviour.	Game-playing AI; robot walking; tuning a chatbot to give helpful answers.

One advanced note worth carrying: modern chatbots use a blend. They are first trained on text (a supervised-style next-word prediction), then refined with a reinforcement step where humans rate which answers are better, nudging the model toward responses people find helpful and safe. That refinement step is a big reason ChatGPT felt so much more usable than the raw text-prediction engines that came before it.

24.5 Neural networks: the committee that learns by adjusting trust

Deep learning runs on neural networks. The name is borrowed loosely from the brain, but you do not need any biology to understand them. A neural network is layers of simple units. Each unit takes in numbers, multiplies each by a "weight," adds them up, and passes the result forward.

Weights: The tunable "trust levels" the network learns. A high weight means "this input matters a lot for my decision"; a low weight means "mostly ignore this."
Layers: Stacked rows of units. The input goes in one end, gets transformed row by row, and a final answer comes out the other.
Training (backpropagation): The process of nudging every weight, again and again, to make the network's answers less wrong.

Analogy: Picture an enormous committee arranged in rows. Each member listens to the row in front, weighs how much to trust each speaker, and passes their own vote backward to the next row. At first everyone's trust levels are random, so the final vote is nonsense. Training is the process where, every time the final vote is wrong, you go back through the committee and tell each member: "trust that input a little less, that one a little more." Do this across millions of examples and the committee slowly becomes reliable.

That "go back and adjust" process has a name: backpropagation — the error at the output is pushed backward through the layers, and each weight is tweaked in the direction that would have reduced the error. A modern network can have billions of these weights. You never set them by hand; the training process discovers them. The reason this works at all is sheer scale plus the patient repetition of "predict, measure error, adjust, repeat."

Example: A bank builds a network to flag fraudulent card transactions. Inputs (features) include the amount, the time of day, the location, and how it compares to your usual habits. During training, the network sees millions of past transactions already labelled "fraud" or "fine," and slowly tunes its weights until a £2,000 purchase in another country at 3 a.m., minutes after a £4 coffee at home, lights up as suspicious — even though no human ever wrote that exact rule.

24.6 Overfitting: the student who memorised the wrong thing

The single most important failure mode in machine learning is overfitting — when a model memorises its training examples instead of learning the general pattern, so it performs brilliantly on data it has seen and falls apart on anything new.

Analogy: A student who memorised the exact answers to last year's exam will score 100% if the questions are identical — and crash the moment the wording changes. They learned the answers, not the subject. An overfit model is exactly that student.

This is why practitioners always hold back some data the model never sees during training, then test on it. If the model scores 99% on training data but 60% on the held-back data, it has overfit. The whole craft of ML is steering between two cliffs: a model too simple to capture the pattern (underfitting), and a model so flexible it memorises noise (overfitting).

Common mistake — correlation vs causation: A model finding a strong pattern does not mean it found a cause. A famous illustration: ice-cream sales and drowning deaths rise together — not because ice cream causes drowning, but because hot weather drives both. ML models are pattern-finders, not explanation-machines. They will happily latch onto a correlation that is useless or even harmful if you act on it as if it were a cause.

24.7 Generative AI: from sorting to creating

Most ML you have met so far classifies or predicts: spam or not, fraud or fine, this price or that. Generative AI does something different — it creates new content: text, images, code, audio, video.

Analogy: A film critic labels and ranks existing films. A screenwriter produces a new script. Traditional ML is the critic; generative AI is the screenwriter. The leap from "judge the input" to "produce a believable new output" is what made AI feel suddenly creative to the public.

The most economically important kind of generative AI right now works on text: the large language model. To understand it properly, we need two pieces of plumbing first — tokens and embeddings.

24.8 Tokens and embeddings: how a machine turns words into meaning

A computer cannot do arithmetic on the word "king." It only handles numbers. So language models convert text into numbers in two steps.

Token: A small chunk of text — often a word or part of a word. "Unbelievable" might split into "un," "believ," and "able." Models read and write in tokens, not letters. (Roughly, 1,000 tokens is about 750 English words.)
Embedding: Each token is turned into a long list of numbers that captures its meaning. Tokens with similar meaning end up with similar number-lists, so they sit near each other in a vast "meaning-space."

Analogy: Imagine a giant map where every word has coordinates, placed by meaning rather than geography. "King," "queen," and "monarch" cluster together in one neighbourhood; "banana" is far away in a fruit district. Famously, the maths even works directionally: the step from "king" to "queen" is roughly the same as from "man" to "woman." Embeddings are how the model represents that meaning has structure you can do arithmetic on.

This matters far beyond chatbots. Embeddings power semantic search ("find documents about cancelling a subscription" returns results that never use the word "cancel" but mean it), recommendation systems, and the retrieval step we will meet shortly. When you understand that meaning becomes coordinates, a lot of modern AI stops being mysterious.

24.9 The Large Language Model: astronomically powerful autocomplete

Here is the plain, slightly deflating truth about an LLM. At its core, it does one thing: predict the next token, given all the tokens so far. That's it. ChatGPT, Claude, and Gemini are, mechanically, next-token predictors.

Analogy: Your phone's keyboard suggests the next word as you type ("I'm running a little…" → "late"). An LLM is that same idea, scaled up beyond imagination — trained on a large fraction of the public internet, books, and code, with billions of weights. At that scale, "predict the next likely chunk of text, one piece at a time" produces something that looks like reasoning, explanation, and creativity.

Prompt: "The capital of France is"
   |
   v
[ LLM predicts the most likely next token ]
   |
   v
"Paris"   -> appended to the text
   |
   v
[ predict next token again, now seeing "...is Paris" ]
   ...repeat one token at a time until done...

This single fact explains almost every strength and weakness of these tools. Their fluency comes from having absorbed the statistical patterns of human language. Their unreliability comes from the same source: they are producing likely-sounding text, not consulting a database of verified facts. There is no little fact-checker inside. "Likely" and "true" usually overlap — which is why the output is often correct — but they are not the same thing, and when they diverge, the model has no idea.

Key takeaway: An LLM is a probability machine for text, not an oracle and not a search engine. It predicts what a knowledgeable person might write next. Treat its output as a confident draft from a fast, well-read assistant who never says "I'm not sure" — useful, but always to be checked.

24.10 The transformer and attention: why modern LLMs actually work

For decades, getting a computer to handle long stretches of language was hard. The breakthrough came in 2017, in a Google research paper with the memorable title "Attention Is All You Need" by Vaswani and colleagues. It introduced the transformer — the architecture behind every major LLM today. The "GPT" in ChatGPT literally stands for "Generative Pre-trained Transformer."

The heart of the transformer is a mechanism called attention. Attention lets the model, for every word, look at all the other words at once and decide which ones matter for understanding it.

Analogy: Consider the sentence: "The dog didn't cross the street because it was tired." What does "it" refer to — the dog or the street? You glance back across the whole sentence and instantly link "it" to "dog" (streets don't get tired). Attention is the model doing exactly that glance: for each word, it weighs how relevant every other word is, and uses those weights to fix the meaning. Change the sentence to "…because it was busy," and attention re-links "it" to "street." Same word, different anchor, resolved by context.

Two technical points are worth knowing at this tier:

It runs in parallel. Older designs read text strictly one word after another, which was slow and forgetful over long passages. The transformer looks at all positions simultaneously, which is both faster on modern hardware and far better at tracking long-range context (linking a pronoun to a noun many sentences earlier).
The mechanism uses "query, key, value." You don't need the maths, but the idea is intuitive: each word sends out a query ("what am I looking for?"), every word advertises a key ("here's what I offer"), the model matches queries to keys to decide relevance, then pulls in the matching values (the actual information). It is like searching a library: your question is the query, the book spines are the keys, and the contents you pull off the shelf are the values.

The quadratic catch: Attention compares every token with every other token. Double the text and you roughly quadruple the work — this is "quadratic cost." It is the main reason a model's "context window" (how much text it can consider at once) is limited and expensive to grow. The whole industry has raced to push context windows from a few thousand tokens to over a million (Claude and Gemini now offer 1M+ token windows), but every expansion fights this quadratic wall.

24.11 Using an LLM well: prompts, context windows, hallucinations, fine-tuning, and RAG

These five terms are the working vocabulary for anyone using LLMs at work in 2025–2026. Learn them precisely.

Prompt: Your instruction or question to the model. The clearer and more specific, the better the output. "Write something about our product" is weak; "Write a 3-sentence product description for a waterproof hiking backpack, aimed at budget-conscious students, friendly tone" is strong.
Context window: The maximum amount of text the model can consider at once — its short-term memory. Everything in the conversation, plus any documents you paste in, must fit. Exceed it and the earliest material falls out of view, and the model effectively "forgets" it.
Hallucination: A confident, fluent, and false output. The model invents a court case, a citation, a statistic, or an API that does not exist — because a plausible-sounding answer was the most "likely" text, and the model has no internal sense of truth.
Fine-tuning: Taking a general base model and training it further on a narrow set of examples so it specialises — e.g. tuning a model on a law firm's past contracts so it drafts in the firm's style.
RAG (Retrieval-Augmented Generation): Instead of relying on what the model memorised during training, you first retrieve relevant, up-to-date documents and feed them into the prompt, so the model answers from those facts.

Analogy for RAG: A closed-book exam forces the student to answer from memory — they might misremember, or make something up. An open-book exam, where you hand them the exact reference pages first, produces far more accurate answers. RAG turns a closed-book LLM into an open-book one. This is how a company chatbot can answer questions about your specific policies, prices, or product catalogue — those facts were never in the model's training, so they are retrieved and supplied at question time.

RAG flow:
  User question
      |
      v
  [Search company documents for relevant pages]   <- embeddings used here
      |
      v
  [Stuff those pages + the question into the prompt]
      |
      v
  [LLM answers, grounded in the supplied pages]

Why does RAG matter so much in practice? Three reasons. It keeps answers current (a model trained last year doesn't know this morning's price change, but a retrieved document does). It keeps answers grounded (less hallucination, because the facts are right there). And it lets you cite sources, so a human can verify — which is essential for any high-stakes use.

Common mistake — anthropomorphising the model: "The AI understands me / knows the answer / thinks it through." It does none of these in the human sense. It predicts likely text. Treating its fluency as understanding leads to misplaced trust — and to the second classic error: "it sounds confident, so it must be right." Fluency is not accuracy. An LLM will state a wrong fact with exactly the same smooth confidence as a right one, because confidence is a feature of the writing style it learned, not a signal of truth.

24.12 Putting it together: a real-world AI feature, end to end

Let's trace a concrete example you might actually build or use — a customer-support assistant for an online print shop (the kind of SaaS this very platform is). This ties together almost everything in the chapter.

Example — a support chatbot done properly:

A customer types: "How long does delivery take for 500 business cards to Manchester?" (This is the prompt.)
The system turns the question into an embedding and searches the store's own help documents and shipping rules for the most relevant pages (the retrieval step of RAG).
It puts those pages plus the question into the context window and sends them to the LLM (built on the transformer architecture).
The model uses attention to connect "500," "business cards," and "Manchester" with the right lines in the supplied shipping table, and writes a grounded answer.
Because the facts came from retrieved documents, the answer is current and the system can show the source — so a human can verify and the risk of a hallucinated delivery promise is low.

A naive version skips RAG and just asks the raw LLM. It would invent a plausible-sounding delivery time — and a wrong shipping promise to a customer is a real business problem. The difference between a toy and a trustworthy product is mostly this discipline of grounding and verification.

24.13 Where AI is reshaping work in 2025–2026

You do not need to become a machine-learning engineer to be affected by this. AI is changing how ordinary knowledge work gets done:

Drafting and editing: first drafts of emails, reports, marketing copy, summaries — fast, then refined by a human.
Coding assistance: AI assistants now suggest, complete, and explain code, reshaping how software teams build. (More on the risks below.)
Customer support and search: RAG-based assistants answer from a company's own documents.
Data extraction and translation: pulling structured fields out of messy documents; translating between languages instantly.

The professional skill is no longer "can you produce a draft?" — the machine can. It is "can you direct the machine clearly, judge its output, and catch the errors it makes confidently?" That judgement is now the differentiator.

24.14 Best practices for using AI as a serious tool

Best practice — treat AI like a fast, brilliant, unreliable junior assistant. A good junior does a huge amount of work quickly and is often right — but you would never send their work to a client without reading it. Apply the same rule to AI: delegate freely, verify always, and never let unverified output go into a high-stakes decision (legal, medical, financial, or anything a customer relies on).

Write clear, specific prompts. State the role, the audience, the format, the length, and the tone. Vague in, vague out.
Give it context (the RAG mindset). Paste the relevant document, the example you want it to match, the constraints. The model can only reason about what is in front of it.
Verify every fact that matters. Especially names, numbers, dates, quotes, citations, and code that touches money or security. Fluency is not proof.
Ask it to show its sources or reasoning so you can check the chain, not just the conclusion.
Know its limits. It has a training cutoff (it doesn't know recent events unless you supply them), a context window (don't assume it remembers things you said far earlier in a long chat), and no access to private or live data unless you give it.

Common mistake — "vibe-coding" without understanding: Letting an AI generate an app you don't understand, then shipping it. AI-written code can be subtly wrong, insecure, or inefficient in ways that only surface later — and you cannot debug what you don't understand. The fundamentals from these chapters (how data moves, how requests flow, how to read an error) are what let you supervise the tool instead of being at its mercy. Using a tool you don't understand produces brittle results; the fundamentals never became optional.

24.15 The honesty beat, one more time

Let's end where the chapter's spine has been pointing. There is a thread running through this entire study guide: abstraction — every layer of computing hides the complexity of the one below. AI is the latest, highest layer. A chatbot's friendly conversation hides an LLM; the LLM hides a transformer; the transformer hides attention; attention hides billions of weights tuned by backpropagation; those weights run on neural-network maths; the maths runs on the CPU, memory, and binary you met in the earlier chapters. Pull any thread and the whole stack is there.

The leaky-abstraction reality applies here most of all. When the AI confidently invents a fact, or forgets the start of a long conversation, or gives wrong code, the layer below is leaking through — and your understanding of the mechanism is what lets you spot it. People who think "the AI just knows things" are constantly surprised. People who remember "it predicts likely text and has a limited window and no truth-checker" are rarely surprised; they design around it.

Key takeaway: Modern AI is genuinely powerful and genuinely limited, and both come from the same mechanism: it predicts likely text from patterns in data. Pair every feeling of magic with the plain mechanism. Direct it clearly, ground it in real facts, verify what matters, and never outsource your judgement to a machine that has none. Do that, and AI becomes one of the most powerful tools you will ever use — without quietly leading you off a cliff with a confident smile.

24.16 Chapter recap

AI ⊃ ML ⊃ Deep Learning ⊃ Transformers/LLMs — keep the nesting map fixed; "AI" today usually means the innermost doll.
ML flips programming: instead of rules + data → answers, it learns rules from data + answers.
The pieces: training data, features, model; the three styles are supervised, unsupervised, reinforcement.
A neural network is layers of units with tunable weights, trained by backpropagation; the great enemy is overfitting (memorising, not generalising).
An LLM is scaled-up next-token prediction, built on the transformer and its attention mechanism (with a quadratic cost that limits the context window).
Hallucination is the core risk; RAG (open-book grounding) and verification are the core defences.
Fluency is not truth, confidence is not accuracy, and a tool you don't understand can't be supervised. Direct it, ground it, verify it.

Continue reading

🧭 Ten Disciplines

Seeing the Whole: What a System Is and Why It Behaves the Way It Does

🧭 Ten Disciplines

The Machinery of Systems: Stocks, Flows, Feedback Loops, and Delays

🧭 Ten Disciplines

Leverage Points, System Traps, and Thinking in Systems (Advanced & Real-World)