Detection, Monitoring & Incident Response
Everything you have learned so far tries to prevent attacks. This section is about what happens when prevention fails — because eventually it will. A clever phishing email, a leaked password, an unpatched library, a careless vendor: sooner or later something gets through. The mature security mindset, called "assume breach", accepts this reality and changes the goal. Instead of betting everything on a perfect wall, you also invest in detecting an intruder quickly, responding cleanly, and limiting the blast radius (how much damage one compromise can cause). A team that can spot and evict an attacker in hours suffers a minor scare; a team that takes months suffers a catastrophe.
13.1 Why detection pays off: dwell time, MTTD, and MTTR
Three terms define how good a team is at this:
- MTTD (Mean Time To Detect)
- The average time from when an attacker gets in to when you notice.
- MTTR (Mean Time To Respond/Remediate)
- The average time from noticing to fully containing and cleaning up.
- Dwell time
- Roughly MTTD + MTTR — how long an attacker roams your network undetected. The single most important number to drive down.
The 2025–2026 numbers explain why this matters. Mandiant's M-Trends 2026 reports a global median dwell time of 14 days (up from 11 — attackers are evading detection better). IBM's Cost of a Data Breach 2025 found organizations averaged roughly 158–181 days just to identify a breach plus about 60 more to contain it — a lifecycle near 241 days. The financial sting: the global average breach cost is $4.44M, the US average is $10.22M (highest in the world), and healthcare leads sectors at $7.42M. Notably, "shadow AI" (employees using unsanctioned AI tools) added about $670K per breach, while organizations using AI and automation extensively cut their breach lifecycle by ~80 days and saved ~$1.9M. The lesson: every day shaved off dwell time is money saved.
13.2 Logging: what to log, and what never to log
You cannot detect or investigate what you did not record. Logging means writing down security-relevant events as they happen. The OWASP Logging Cheat Sheet and OWASP Top 10:2025 category A09 spell out what to capture. Every log line should answer five questions:
- WHO — the user or source (account, IP address).
- WHAT — the action taken.
- WHEN — a synchronized timestamp in UTC (so logs from different servers line up).
- WHERE — the host or service it happened on.
- OUTCOME — success or failure.
Log the security-meaningful events: login success and failure, authorization failures (someone tried to access what they shouldn't), session start/end, input-validation failures, account and privilege changes, admin actions, data exports, and configuration changes. Logs must be tamper-resistant — append-only and forwarded off the machine immediately, so an attacker who compromises a server cannot quietly erase their tracks. Retain them long enough for compliance (often a year or more) and store them where they can be correlated together.
Equally important is what you must never log. This ties directly to this project's "no secrets/PII in logs" rule:
- Passwords, session tokens, access tokens, API keys, encryption keys, database connection strings.
- Full payment card numbers (PAN) or CVV, government IDs / SSNs, health data, source code, and sensitive personal data.
These must be removed, masked, hashed, or pseudonymized before writing. Two weakness IDs name the failures: CWE-532 (inserting sensitive data into logs) and CWE-117 (log injection — an attacker plants newline characters in input to forge fake log lines and hide their activity). Always neutralize newlines in logged user input.
13.3 The tooling stack — SIEM, EDR, XDR, SOAR, SOC
Beginners constantly confuse these acronyms. Here is the clean separation, with one analogy that holds them together.
| Tool | Full name | What it does | Examples / weakness |
|---|---|---|---|
| SIEM | Security Information & Event Management | Central log aggregation, correlation, long-term storage, compliance & historical search | Splunk, Microsoft Sentinel, Elastic; weakness = alert fatigue, heavy tuning |
| EDR | Endpoint Detection & Response | Deep view of one endpoint (processes, files, behavior); can isolate a host | CrowdStrike Falcon, SentinelOne, Defender for Endpoint |
| XDR | Extended Detection & Response | Correlates endpoint + network + cloud + email + identity into ONE prioritized incident | AI-driven; reduces 10 alerts to 1 story |
| SOAR | Security Orchestration, Automation & Response | Runs automated playbooks (auto-disable account, quarantine host) | Automates the repetitive 3am steps |
| SOC | Security Operations Center | The human team + process running all the above, around the clock | MDR = an outsourced/managed SOC + EDR |
A common maturity ladder: EDR (eyes on endpoints) → SIEM (central view + compliance) → SOAR (automate the repeats) → XDR (unified context). Smaller teams that cannot staff a 24/7 SOC often buy MDR (Managed Detection & Response) — renting the team and tooling.
13.4 Detection types and MITRE ATT&CK
Detections come in two flavors. Signature/rule-based detection looks for known bad things — specific malware hashes, known-bad IP addresses, or patterns called IOCs (Indicators of Compromise). It is precise but blind to anything new. Anomaly/behavioral detection learns what "normal" looks like and flags deviations. A key form is UEBA (User & Entity Behavior Analytics), which catches things like impossible-travel logins (the same account signs in from London and Tokyo ten minutes apart), off-hours mass downloads, or lateral movement between machines. Threat intelligence feeds enrich both by supplying fresh lists of known-bad IPs and file hashes.
MITRE ATT&CK is a free, globally used knowledge base of real-world attacker behavior. It organizes attacks into tactics (the attacker's goal — the "why") and techniques (the specific method — the "how"), each with a Txxxx ID (for example, T1566 = phishing). The current release is v18.1 (December 2025); the Enterprise matrix has 14 tactics spanning the attack lifecycle: Reconnaissance, Resource Development, Initial Access, Execution, Persistence, Privilege Escalation, Defense Evasion, Credential Access, Discovery, Lateral Movement, Collection, Command & Control, Exfiltration, and Impact — plus 200+ techniques. Teams use it to map their detections ("do we cover T1566?"), find coverage gaps, and speak a common language during incidents.
13.5 The incident response lifecycle — NIST and SANS
An incident response (IR) process gives the team a known sequence to follow under pressure. Learn both common models.
NIST shifted in 2025. The old, widely-taught SP 800-61 Rev 2 (four phases: Preparation → Detection & Analysis → Containment/Eradication/Recovery → Post-Incident Activity) was withdrawn in April 2025. The new SP 800-61 Rev 3 drops the rigid sequence and instead maps IR onto the six functions of the Cybersecurity Framework (CSF) 2.0: Govern, Identify, Protect, Detect, Respond, Recover — treating "Improve" as something continuous, not a one-time post-mortem. The four-phase model still appears everywhere in practice and on exams, so know both.
SANS PICERL is the dominant operational mnemonic — six steps:
- Preparation — tools, plans, and training before anything happens.
- Identification — confirm an incident is real and scope it.
- Containment — isolate first (short-term), then plan a clean rebuild (long-term).
- Eradication — remove the malware, backdoors, and persistence the attacker planted.
- Recovery — restore from known-good backups and watch for reinfection.
- Lessons Learned — improve so it doesn't recur.
SANS PICERL vs NIST (older 4-phase) Preparation ............... Preparation Identification ............ Detection & Analysis Containment \ Eradication >........... Containment / Eradication / Recovery Recovery / Lessons Learned ........... Post-Incident Activity
13.6 Playbooks, on-call, and severity levels
Playbooks are pre-written, step-by-step responses for specific scenarios (ransomware, business email compromise, data exfiltration) so responders execute a tested plan instead of improvising at 3am. An on-call rotation assigns a primary responder, a backup, and an escalation path. Severity levels set the urgency and who gets pulled in. A common scheme:
| Severity | Meaning | Response |
|---|---|---|
| SEV1 | Critical: full outage or data at risk | Page on-call + backup + lead; ack ≤5 min; dedicated incident channel + Commander; stakeholder updates every 15 min; status page; mandatory blameless retro ≤48h |
| SEV2 | Major degradation, workaround exists | Ack ≤15 min; updates every 30 min |
| SEV3/4 | Minor or cosmetic | Handled during business hours |
Always name an Incident Commander — the person who coordinates and decides, kept separate from the hands-on responders doing the technical work, so nobody is both fighting the fire and running the room.
13.7 Breach notification duties — the clocks start at "awareness"
A breach is not only a technical event; it triggers legal deadlines. Crucially, most clocks start when you become aware (reasonably certain a breach occurred), not when the investigation is finished.
| Regime | Deadline | Notes |
|---|---|---|
| GDPR Art. 33 (EU) | 72 hours to the supervisory authority | Phased reporting allowed; Art. 34 = notify individuals if high risk; fines up to €10M or 2% global turnover |
| SEC Form 8-K Item 1.05 (US public cos.) | 4 business days after deciding it is "material" | Effective Dec 2023 |
| US state laws | Varies (e.g., CA ~30 days; HIPAA 60 days for health data) | All 50 states + DC have laws; Oklahoma's overhaul (SB 626) effective Jan 1, 2026 |
A single breach can start dozens of overlapping clocks at once. Map your obligations before an incident — legal counsel scrambling on day one is too late.
13.8 Forensics basics: volatility and chain of custody
If you ever need evidence — for legal action, insurance, or just understanding what happened — handle it correctly. Order of volatility (RFC 3227) says collect the most fragile data first, because it disappears fastest: CPU registers/cache → RAM (live memory, gone the instant power is cut) → network connections/ARP tables → disk → logs/archives → physical configuration. Chain of custody is an unbroken, documented record of who collected, handled, stored, and transferred each piece of evidence, and when — required for it to be admissible in court. Use write-blockers, compute a SHA-256 hash at acquisition and verify it later, and always work on copies, never the original.
13.9 Practice the plan: tabletops and blameless postmortems
Tabletop exercises are discussion-based dry runs of a scenario with the whole team — no live systems touched. Someone reads out a scenario ("our SIEM just flagged a bulk customer export at 2am") and the team talks through exactly what they would do, exposing gaps in the plan before a real attacker finds them. CISA publishes free tabletop packages; run them at least annually and after major changes.
Blameless postmortems (popularized by Google SRE, borrowed from aviation and medicine) assume everyone acted in good faith with the information they had. They focus on systemic causes — missing tooling, ambiguous runbooks, untested code, unclear ownership — never on punishing an individual. The reason is practical: a blame culture makes people hide incidents, which destroys your detection. The output is a timeline, the impact, contributing causes, and tracked action items with named owners.
13.10 A worked incident: the Qantas breach (2025)
Here is a clean hypothetical that ties the whole section together:
02:14 SIEM fires: impossible-travel login + bulk export
02:18 On-call paged -> declares SEV1, names Incident Commander
02:25 EDR ISOLATES host (containment)
...but first captures a RAM image (forensics)
02:40 Identity team rotates leaked credential;
hunts for backdoors/persistence (eradication)
04:00 Restore from known-good backup; monitor (recovery)
04:30 Legal starts GDPR 72h clock; checks SEC materiality
next day Blameless postmortem -> root cause:
a CI/CD token was logged in plaintext (-> 13.2!)
Notice the postmortem's root cause loops straight back to the start of this section: a secret written into a log file. Good logging hygiene, fast detection, a rehearsed response, and an honest postmortem are one continuous loop.
Common mistakes
- Alert fatigue — drowning in low-value alerts so the real one is missed. This is the A09 failure mode. Tune relentlessly.
- Logging secrets, tokens, or PII (CWE-532) and not synchronizing log timestamps across hosts.
- Pulling power before capturing memory — destroying volatile evidence forever.
- Treating containment as "done" and skipping eradication, leaving the attacker's backdoor in place.
- Having no IR plan or tabletop until the real incident hits.
- Blameful postmortems that make people hide future incidents.
- Missing the notification deadline because "we weren't 100% sure yet" — the clock starts at awareness.