Frequently Asked Questions

By Pritesh Yadav 9 min read

Q: What is the difference between RAG and fine-tuning, and when should I use each?

RAG (Retrieval-Augmented Generation) fetches relevant documents at query time and inserts them into the model's context, so the model reasons over fresh, external knowledge without changing its weights. Fine-tuning adjusts the model's weights on examples to change its behavior, tone, or format. Use RAG when you need up-to-date or proprietary facts that change often; use fine-tuning when you need a consistent style, structure, or skill the base model handles poorly. They are complementary, not competing: a fine-tuned model can still retrieve context via RAG.

Q: Is an "agent" just a fancy word for a workflow?

No. A workflow follows a fixed, pre-defined sequence of steps that you wrote, even if some steps call an LLM. An agent decides for itself which steps to take, in what order, and when to stop, typically by choosing tools in a loop based on intermediate results. The practical rule: if you can draw the control flow as a flowchart ahead of time, build a workflow; reach for an agent only when the path genuinely can't be known in advance.

Q: Why do evaluations matter so much? Can't I just eyeball the outputs?

Eyeballing works for a demo but collapses at scale: you can't manually re-check hundreds of cases every time you tweak a prompt or swap a model. Evals turn "it feels better" into a repeatable, measurable score, so you can tell whether a change actually improved things or quietly broke something. Without evals you are flying blind, and small regressions ship silently. Think of evals as the unit tests of LLM engineering, the single highest-leverage investment you can make.

Q: How do I even start building an eval set?

Start small and concrete: collect 20–50 real or realistic inputs that represent the cases you care about, including a few hard or edge cases. For each, define what a good answer looks like, whether that's an exact match, a checklist of required facts, or a rubric. Run your system over the set, look at the failures by hand, and only then automate the scoring. A scrappy 30-case eval you actually run beats a perfect 1,000-case eval you keep meaning to build.

Q: What is "LLM-as-judge" and can I trust it?

LLM-as-judge means using a language model to score or grade another model's output against a rubric, which scales evaluation far beyond what humans can review by hand. It's useful but not automatically trustworthy: judges can be biased toward longer answers, toward the first option shown, or toward outputs that sound confident. To trust it, calibrate the judge against a sample of human-labeled cases, give it a clear rubric with examples, and ask for a short justification before the score. Treat the judge as an instrument you must validate, not an oracle.

Q: What exactly is a hallucination, and why does it happen?

A hallucination is when the model produces confident, fluent text that is factually wrong or unsupported by its sources. It happens because language models predict plausible next words, not verified truth, so when they lack the right information they fill the gap with something that merely sounds right. They don't "know what they don't know" by default. You reduce hallucination by grounding answers in retrieved sources, asking the model to cite or say "I don't know," and verifying claims against the provided context.

Q: What is the difference between prompt engineering and context engineering?

Prompt engineering is crafting the instructions and wording you give the model, the phrasing, role, examples, and output format. Context engineering is the broader discipline of deciding what information goes into the model's window at all: which retrieved documents, prior messages, tool results, and system instructions, and in what order and amount. Prompt engineering is one piece of context engineering. As systems grow, the question shifts from "what should I say?" to "what should the model see, and what should I leave out?"

Q: Do I actually need a vector database?

Not always. A vector database shines when you have a large corpus and need semantic search, finding passages by meaning rather than exact keywords. But for small or stable document sets, keyword search, a plain database, or even stuffing everything into the context window can work fine and be far simpler. Start with the simplest retrieval that meets your accuracy bar, and add a vector store only when scale or semantic matching demands it. Many "we need a vector DB" projects really needed better chunking and a basic search.

Q: When should I not use AI or an LLM at all?

Avoid LLMs when a deterministic rule, a database query, or simple code would do the job more reliably and cheaply, such as math, exact lookups, or strict business logic. Be cautious in high-stakes domains (medical, legal, financial) without a human in the loop and strong guardrails. Skip them when you can't tolerate the variability and occasional errors that are inherent to generative models. The mature engineer's instinct is to ask "does this genuinely need probabilistic language understanding?" before reaching for a model.

Q: What does "context window" mean, and is a bigger one always better?

The context window is the maximum amount of text (measured in tokens) the model can consider at once, including your prompt, retrieved data, conversation history, and its own answer. A bigger window lets you include more, but it isn't a free win: cost and latency rise with length, and models often pay less attention to information buried in the middle of a long context. More relevant context beats more context. Curating what goes in usually outperforms dumping everything in and hoping the model finds the needle.

Q: What is the difference between retrieval quality and generation quality?

In a RAG system, retrieval is finding the right source material, and generation is the model writing a good answer from it. These fail independently: the model can write beautifully from wrong documents, or botch an answer despite having perfect sources. Diagnosing problems means checking both separately, did we fetch the correct passages, and did the model use them faithfully? Most RAG failures trace back to retrieval, so evaluate that stage first before blaming the model.

Q: How are tokens related to cost and limits, and why should I care?

Models read and write in tokens, chunks of text roughly three-quarters of a word, and you are billed per token for both input and output. This means long prompts, big retrieved documents, and verbose answers all cost money and add latency. Caring about tokens keeps you efficient: trimming redundant context, summarizing history, and capping output length can cut cost dramatically without hurting quality. It also explains why "just send everything" is rarely the right design.

Q: What is "AI product judgment" and why is it listed as a core pillar?

AI product judgment is the skill of deciding whether and how AI should be used in a product, not just how to make a model respond. It covers choosing the right use case, designing for graceful failure, setting user expectations, and knowing when a simpler non-AI solution wins. It's a pillar because the hardest mistakes are rarely technical: they're shipping AI where it doesn't belong, or exposing raw model uncertainty to users who can't handle it. Good engineering can't rescue a bad product decision.

Q: How should an AI feature handle being wrong, since the model sometimes will be?

Design for failure from the start, because a probabilistic system will produce errors no matter how good it is. Give the model an escape hatch ("I'm not sure" beats a confident wrong answer), show sources so users can verify, and keep a human in the loop for consequential actions. Make the cost of a mistake recoverable: prefer drafts the user approves over actions taken automatically. The goal isn't zero errors; it's that errors are visible, correctable, and low-stakes.

Q: What is "temperature," and when should I change it?

Temperature controls randomness in the model's word choices: lower values make outputs more focused and repeatable, higher values make them more varied and creative. For factual tasks, extraction, classification, or anything you want consistent and testable, use a low temperature. For brainstorming, copywriting, or generating diverse options, raise it. When you're building evals, keep temperature low so results are reproducible enough to compare.

Q: My prompt works once but fails unpredictably later. What's going on?

LLMs are non-deterministic: the same input can yield different outputs, so a prompt that works in one test isn't proven reliable. A single success is anecdote, not evidence. The fix is to run the prompt across many varied inputs and measure how often it succeeds, which is exactly what an eval set gives you. Treat reliability as a percentage you measure, not a binary you assume from one happy demo.

Q: How do tools and function calling fit into agents?

Tools (also called function calling) let a model do things beyond generating text: search the web, query a database, run code, or call an API. You describe each tool's purpose and inputs, and the model chooses when to invoke one and with what arguments, then reads the result and continues. This is what makes agents capable of acting in the world rather than just talking. Clear tool descriptions and well-scoped, safe tools matter as much as the prompt, because a confused model will call the wrong tool or pass bad arguments.

Q: How do I decide between a single prompt, a workflow, or a full agent?

Climb the ladder only as far as the problem forces you. Start with a single prompt; if the task needs multiple reliable steps you can predefine, build a workflow that chains prompts and code; reach for an agent only when the steps genuinely can't be planned ahead because they depend on what's discovered along the way. Each rung up adds power but also cost, latency, and unpredictability. The most common and expensive mistake is building an autonomous agent for a job a simple workflow would have done more reliably.

Q: What's the single most important habit for someone starting out in LLM engineering?

Look at your data. Read actual model outputs, especially the failures, before theorizing about fixes, because the real failure modes are almost never the ones you'd guess. Build a small eval set early so every change is measured rather than vibed. And stay biased toward the simplest approach that works, adding retrieval, agents, or fine-tuning only when evidence shows you need them. Curiosity about your outputs plus measurement beats clever architecture every time.

Continue reading