Detection, Monitoring & Incident Response

By Pritesh Yadav 13 min read

Everything you have learned so far tries to prevent attacks. This section is about what happens when prevention fails — because eventually it will. A clever phishing email, a leaked password, an unpatched library, a careless vendor: sooner or later something gets through. The mature security mindset, called "assume breach", accepts this reality and changes the goal. Instead of betting everything on a perfect wall, you also invest in detecting an intruder quickly, responding cleanly, and limiting the blast radius (how much damage one compromise can cause). A team that can spot and evict an attacker in hours suffers a minor scare; a team that takes months suffers a catastrophe.

Analogy: A bank does not rely only on its vault door. It also has cameras, motion sensors, guards, alarms, and a rehearsed plan for what to do when the alarm trips. Locks are prevention; cameras and guards are detection and response. You need both.

13.1 Why detection pays off: dwell time, MTTD, and MTTR

Three terms define how good a team is at this:

MTTD (Mean Time To Detect)
The average time from when an attacker gets in to when you notice.
MTTR (Mean Time To Respond/Remediate)
The average time from noticing to fully containing and cleaning up.
Dwell time
Roughly MTTD + MTTR — how long an attacker roams your network undetected. The single most important number to drive down.

The 2025–2026 numbers explain why this matters. Mandiant's M-Trends 2026 reports a global median dwell time of 14 days (up from 11 — attackers are evading detection better). IBM's Cost of a Data Breach 2025 found organizations averaged roughly 158–181 days just to identify a breach plus about 60 more to contain it — a lifecycle near 241 days. The financial sting: the global average breach cost is $4.44M, the US average is $10.22M (highest in the world), and healthcare leads sectors at $7.42M. Notably, "shadow AI" (employees using unsanctioned AI tools) added about $670K per breach, while organizations using AI and automation extensively cut their breach lifecycle by ~80 days and saved ~$1.9M. The lesson: every day shaved off dwell time is money saved.

Key takeaway: You cannot prevent every breach, but you control how long it lasts. Detection and response speed — not perfect prevention — is the real measure of a mature security program.

13.2 Logging: what to log, and what never to log

You cannot detect or investigate what you did not record. Logging means writing down security-relevant events as they happen. The OWASP Logging Cheat Sheet and OWASP Top 10:2025 category A09 spell out what to capture. Every log line should answer five questions:

  • WHO — the user or source (account, IP address).
  • WHAT — the action taken.
  • WHEN — a synchronized timestamp in UTC (so logs from different servers line up).
  • WHERE — the host or service it happened on.
  • OUTCOME — success or failure.

Log the security-meaningful events: login success and failure, authorization failures (someone tried to access what they shouldn't), session start/end, input-validation failures, account and privilege changes, admin actions, data exports, and configuration changes. Logs must be tamper-resistant — append-only and forwarded off the machine immediately, so an attacker who compromises a server cannot quietly erase their tracks. Retain them long enough for compliance (often a year or more) and store them where they can be correlated together.

Equally important is what you must never log. This ties directly to this project's "no secrets/PII in logs" rule:

  • Passwords, session tokens, access tokens, API keys, encryption keys, database connection strings.
  • Full payment card numbers (PAN) or CVV, government IDs / SSNs, health data, source code, and sensitive personal data.

These must be removed, masked, hashed, or pseudonymized before writing. Two weakness IDs name the failures: CWE-532 (inserting sensitive data into logs) and CWE-117 (log injection — an attacker plants newline characters in input to forge fake log lines and hide their activity). Always neutralize newlines in logged user input.

Common mistake: Leaving debug logging at verbose level in production. It quietly dumps full request bodies — including auth tokens and card numbers — into log files. Logs then become the breach: a stolen log archive hands the attacker a stepping stone to everything else.
Example: OWASP Top 10:2025 (released late 2025, replacing the 2021 list) deliberately renamed A09 from "Logging and Monitoring Failures" to "Security Logging and Alerting Failures." The word "Monitoring" was swapped for "Alerting" to make a point: an unnoticed event that nobody is paged about is the real failure mode — not the absence of a pretty dashboard.

13.3 The tooling stack — SIEM, EDR, XDR, SOAR, SOC

Beginners constantly confuse these acronyms. Here is the clean separation, with one analogy that holds them together.

Analogy: Think of a building's security. SIEM = the camera DVR that records every feed for later review. EDR = a guard watching one specific door very closely. XDR = the control room that fuses every camera, door, and sensor into one picture. SOAR = automatic door locks that slam shut when a trip-wire fires. SOC = the human guards running it all 24/7.
ToolFull nameWhat it doesExamples / weakness
SIEMSecurity Information & Event ManagementCentral log aggregation, correlation, long-term storage, compliance & historical searchSplunk, Microsoft Sentinel, Elastic; weakness = alert fatigue, heavy tuning
EDREndpoint Detection & ResponseDeep view of one endpoint (processes, files, behavior); can isolate a hostCrowdStrike Falcon, SentinelOne, Defender for Endpoint
XDRExtended Detection & ResponseCorrelates endpoint + network + cloud + email + identity into ONE prioritized incidentAI-driven; reduces 10 alerts to 1 story
SOARSecurity Orchestration, Automation & ResponseRuns automated playbooks (auto-disable account, quarantine host)Automates the repetitive 3am steps
SOCSecurity Operations CenterThe human team + process running all the above, around the clockMDR = an outsourced/managed SOC + EDR

A common maturity ladder: EDR (eyes on endpoints) → SIEM (central view + compliance) → SOAR (automate the repeats) → XDR (unified context). Smaller teams that cannot staff a 24/7 SOC often buy MDR (Managed Detection & Response) — renting the team and tooling.

13.4 Detection types and MITRE ATT&CK

Detections come in two flavors. Signature/rule-based detection looks for known bad things — specific malware hashes, known-bad IP addresses, or patterns called IOCs (Indicators of Compromise). It is precise but blind to anything new. Anomaly/behavioral detection learns what "normal" looks like and flags deviations. A key form is UEBA (User & Entity Behavior Analytics), which catches things like impossible-travel logins (the same account signs in from London and Tokyo ten minutes apart), off-hours mass downloads, or lateral movement between machines. Threat intelligence feeds enrich both by supplying fresh lists of known-bad IPs and file hashes.

MITRE ATT&CK is a free, globally used knowledge base of real-world attacker behavior. It organizes attacks into tactics (the attacker's goal — the "why") and techniques (the specific method — the "how"), each with a Txxxx ID (for example, T1566 = phishing). The current release is v18.1 (December 2025); the Enterprise matrix has 14 tactics spanning the attack lifecycle: Reconnaissance, Resource Development, Initial Access, Execution, Persistence, Privilege Escalation, Defense Evasion, Credential Access, Discovery, Lateral Movement, Collection, Command & Control, Exfiltration, and Impact — plus 200+ techniques. Teams use it to map their detections ("do we cover T1566?"), find coverage gaps, and speak a common language during incidents.

13.5 The incident response lifecycle — NIST and SANS

An incident response (IR) process gives the team a known sequence to follow under pressure. Learn both common models.

NIST shifted in 2025. The old, widely-taught SP 800-61 Rev 2 (four phases: Preparation → Detection & Analysis → Containment/Eradication/Recovery → Post-Incident Activity) was withdrawn in April 2025. The new SP 800-61 Rev 3 drops the rigid sequence and instead maps IR onto the six functions of the Cybersecurity Framework (CSF) 2.0: Govern, Identify, Protect, Detect, Respond, Recover — treating "Improve" as something continuous, not a one-time post-mortem. The four-phase model still appears everywhere in practice and on exams, so know both.

SANS PICERL is the dominant operational mnemonic — six steps:

  1. Preparation — tools, plans, and training before anything happens.
  2. Identification — confirm an incident is real and scope it.
  3. Containment — isolate first (short-term), then plan a clean rebuild (long-term).
  4. Eradication — remove the malware, backdoors, and persistence the attacker planted.
  5. Recovery — restore from known-good backups and watch for reinfection.
  6. Lessons Learned — improve so it doesn't recur.
  SANS PICERL  vs  NIST (older 4-phase)

  Preparation ............... Preparation
  Identification ............ Detection & Analysis
  Containment  \
  Eradication   >........... Containment / Eradication / Recovery
  Recovery     /
  Lessons Learned ........... Post-Incident Activity

13.6 Playbooks, on-call, and severity levels

Playbooks are pre-written, step-by-step responses for specific scenarios (ransomware, business email compromise, data exfiltration) so responders execute a tested plan instead of improvising at 3am. An on-call rotation assigns a primary responder, a backup, and an escalation path. Severity levels set the urgency and who gets pulled in. A common scheme:

SeverityMeaningResponse
SEV1Critical: full outage or data at riskPage on-call + backup + lead; ack ≤5 min; dedicated incident channel + Commander; stakeholder updates every 15 min; status page; mandatory blameless retro ≤48h
SEV2Major degradation, workaround existsAck ≤15 min; updates every 30 min
SEV3/4Minor or cosmeticHandled during business hours

Always name an Incident Commander — the person who coordinates and decides, kept separate from the hands-on responders doing the technical work, so nobody is both fighting the fire and running the room.

13.7 Breach notification duties — the clocks start at "awareness"

A breach is not only a technical event; it triggers legal deadlines. Crucially, most clocks start when you become aware (reasonably certain a breach occurred), not when the investigation is finished.

RegimeDeadlineNotes
GDPR Art. 33 (EU)72 hours to the supervisory authorityPhased reporting allowed; Art. 34 = notify individuals if high risk; fines up to €10M or 2% global turnover
SEC Form 8-K Item 1.05 (US public cos.)4 business days after deciding it is "material"Effective Dec 2023
US state lawsVaries (e.g., CA ~30 days; HIPAA 60 days for health data)All 50 states + DC have laws; Oklahoma's overhaul (SB 626) effective Jan 1, 2026

A single breach can start dozens of overlapping clocks at once. Map your obligations before an incident — legal counsel scrambling on day one is too late.

13.8 Forensics basics: volatility and chain of custody

If you ever need evidence — for legal action, insurance, or just understanding what happened — handle it correctly. Order of volatility (RFC 3227) says collect the most fragile data first, because it disappears fastest: CPU registers/cache → RAM (live memory, gone the instant power is cut) → network connections/ARP tables → disk → logs/archives → physical configuration. Chain of custody is an unbroken, documented record of who collected, handled, stored, and transferred each piece of evidence, and when — required for it to be admissible in court. Use write-blockers, compute a SHA-256 hash at acquisition and verify it later, and always work on copies, never the original.

Common mistake: Pulling the power plug on a compromised machine to "stop the attack." That instantly destroys RAM — which often holds the encryption keys, running malware, and network connections you most need. Image memory first, then disconnect from the network.

13.9 Practice the plan: tabletops and blameless postmortems

Tabletop exercises are discussion-based dry runs of a scenario with the whole team — no live systems touched. Someone reads out a scenario ("our SIEM just flagged a bulk customer export at 2am") and the team talks through exactly what they would do, exposing gaps in the plan before a real attacker finds them. CISA publishes free tabletop packages; run them at least annually and after major changes.

Blameless postmortems (popularized by Google SRE, borrowed from aviation and medicine) assume everyone acted in good faith with the information they had. They focus on systemic causes — missing tooling, ambiguous runbooks, untested code, unclear ownership — never on punishing an individual. The reason is practical: a blame culture makes people hide incidents, which destroys your detection. The output is a timeline, the impact, contributing causes, and tracked action items with named owners.

13.10 A worked incident: the Qantas breach (2025)

Example: In 2025, attackers compromised a weakly defended third-party customer-service platform Qantas relied on — supply-chain initial access — and used it to reach 5.7 million customer records. Unusual activity was detected on June 30, 2025 and contained "within hours," but forensic investigation took weeks to reveal the full scope. In October 2025, after Qantas refused to pay, the extortion group published the stolen data on the dark web. Teaching beats: detection was fast, but the way in was a trusted vendor — a vendor's risk is your risk. Containment is not resolution. Refusing ransom does not stop a leak. And every notification clock (GDPR's 72 hours, US state laws) started at the moment of awareness.

Here is a clean hypothetical that ties the whole section together:

  02:14  SIEM fires: impossible-travel login + bulk export
  02:18  On-call paged -> declares SEV1, names Incident Commander
  02:25  EDR ISOLATES host (containment)
         ...but first captures a RAM image (forensics)
  02:40  Identity team rotates leaked credential;
         hunts for backdoors/persistence (eradication)
  04:00  Restore from known-good backup; monitor (recovery)
  04:30  Legal starts GDPR 72h clock; checks SEC materiality
  next day  Blameless postmortem -> root cause:
            a CI/CD token was logged in plaintext  (-> 13.2!)

Notice the postmortem's root cause loops straight back to the start of this section: a secret written into a log file. Good logging hygiene, fast detection, a rehearsed response, and an honest postmortem are one continuous loop.

Common mistakes

  • Alert fatigue — drowning in low-value alerts so the real one is missed. This is the A09 failure mode. Tune relentlessly.
  • Logging secrets, tokens, or PII (CWE-532) and not synchronizing log timestamps across hosts.
  • Pulling power before capturing memory — destroying volatile evidence forever.
  • Treating containment as "done" and skipping eradication, leaving the attacker's backdoor in place.
  • Having no IR plan or tabletop until the real incident hits.
  • Blameful postmortems that make people hide future incidents.
  • Missing the notification deadline because "we weren't 100% sure yet" — the clock starts at awareness.

Best practices

Best practice: Forward logs off-host to a SIEM in append-only form, scrub secrets/PII at the source, write a few well-tested playbooks for your top scenarios, run a tabletop at least yearly, name an Incident Commander separate from responders, image RAM before disk, map your breach-notification clocks ahead of time, and close every postmortem with tracked, owned action items.
Key takeaway: Assume breach. Prevention buys you time; detection and response decide your fate. Log the right things (and never secrets), centralize and correlate them, detect both known IOCs and behavioral anomalies, follow a rehearsed lifecycle (SANS PICERL / NIST CSF), capture forensics before you destroy them, respect the legal clocks that start at awareness, and learn from every incident without blame. Shrinking dwell time is the single highest-leverage security investment a team can make.

Continue reading