Privacy Engineering Fundamentals
Most engineers learn security first and assume privacy is "more of the same." It is not. Security protects data from people who should not have it (attackers, outsiders). Privacy governs what people who are allowed to touch the data may actually do with it. You can have flawless security and terrible privacy at the same time: the database is encrypted, access is locked down, MFA is everywhere — and then the company quietly sells that data, keeps it forever, or repurposes it for something the user never agreed to. No attacker, no breach, still a privacy failure. Privacy engineering is the discipline of building privacy requirements — collect less, use only for the stated reason, honour user rights — directly into the system's architecture, rather than writing a policy PDF and hoping for the best.
10.1 The data taxonomy (get these terms right)
Students constantly confuse these four terms. They have different definitions and different legal weight.
- PII (Personally Identifiable Information — US term)
- Information that identifies a person. NIST splits it into linked data (directly identifying: name, Social Security Number, email, phone) and linkable data (identifying only when combined with other data).
- Personal data (GDPR Art.4(1) — much broader)
- ANY information relating to an identified or identifiable person. This deliberately includes IP addresses, cookie IDs, device IDs, location, and pseudonyms — far wider than the US idea of "PII." (GDPR = General Data Protection Regulation, the EU's privacy law.)
- Sensitive data / special categories (GDPR Art.9)
- Race/ethnicity, political opinions, religion, trade-union membership, genetic data, biometrics used for identification, health, and sex life/orientation. Processing these is prohibited by default unless a narrow exception applies. US sectoral analogues: HIPAA (health), GLBA (financial), COPPA (children under 13), and CCPA/CPRA "sensitive personal information" (SSN, precise geolocation, biometrics, message contents).
- Quasi-identifiers
- Fields that are not unique on their own but become uniquely identifying in combination.
10.2 The four foundational principles
These come from GDPR Art.5 and are mirrored in the NIST Privacy Framework and CCPA/CPRA. They are the levers a builder actually pulls.
- Data minimization — collect only what the stated purpose needs. This is the single most powerful engineering control: you cannot leak, mishandle, or be forced to disclose data you never stored.
- Purpose limitation — data collected for purpose A may not be silently reused for purpose B. Engineering pattern: tag every data element with its purpose at ingestion, then enforce that purpose at query time.
- Storage limitation / retention — keep data only as long as needed, then delete on a schedule (a TTL, "time to live"). "Keep forever by default" is a violation.
- Accountability — you must be able to prove you do the above (records, logs, impact assessments). Plus lawfulness/fairness/transparency and accuracy round out Art.5.
10.3 Privacy by Design — and by Default
The seven principles of Privacy by Design (PbD) were created by Dr. Ann Cavoukian (Ontario's privacy commissioner) in the 1990s, adopted as a global standard in 2010, and later encoded into GDPR Art.25 as "Data Protection by Design and by Default."
- Proactive, not reactive; preventative, not remedial.
- Privacy as the default setting — zero action required by the user to be protected; the strictest setting ships on by default (opt-IN to share, not opt-out).
- Privacy embedded into the design, not bolted on later.
- Full functionality — positive-sum, not zero-sum (reject the false "privacy vs. features" trade-off).
- End-to-end security across the full data lifecycle (no privacy without security).
- Visibility and transparency — keep it open and auditable.
- Respect for user privacy — strong defaults, clear notice, easy controls.
PbD was criticised as vague and aspirational — which is exactly why GDPR Art.25 turned "by design AND by default" into a legal obligation, and why the NIST Privacy Framework 1.1 (public draft CSWP 40, released 14 Apr 2025, aligned to Cybersecurity Framework 2.0) operationalises it into auditable functions: Identify-P, Govern-P, Control-P, Communicate-P, Protect-P.
10.4 Anonymization vs pseudonymization vs de-identification
This is the most-tested and most-misunderstood distinction in the whole field.
| Technique | What it does | Reversible? | Still "personal data"? |
|---|---|---|---|
| Pseudonymization (GDPR 4(5)) | Replace identifiers with a token; keep the mapping key stored separately and secured | Yes (with the key) | Yes — still fully in GDPR scope; you only get reduced obligations |
| Anonymization | Irreversibly strip identifiability so the person can never be re-identified by any reasonably likely means | No | No — out of GDPR scope entirely (but genuinely achieving this is very hard) |
| De-identification (US/HIPAA umbrella) | Remove identifiers via HIPAA Safe Harbor (strip 18 listed identifiers) or Expert Determination | Often yes | Roughly equivalent to EU pseudonymized, NOT anonymous |
10.5 Re-identification risk — the cautionary tales
"We removed the names, so it's anonymous" is the most expensive sentence in privacy. Three real incidents prove why.
- AOL search leak (2006): AOL published ~20M search queries from ~658,000 users, replacing names with numbers. NYT reporters re-identified user #4417749 as Thelma Arnold, a 62-year-old in Lilburn, Georgia, from queries like "landscapers in Lilburn, Ga." The CTO resigned; two staff were fired. Behavioral history is itself a fingerprint.
- Netflix Prize (2006–08): Netflix released 100M "anonymized" ratings. Researchers Narayanan & Shmatikov cross-referenced public IMDb ratings and re-identified users — with as few as 8 ratings they uniquely identified 99% of records. Auxiliary (background) data defeats anonymization; sparse high-dimensional data is uniquely identifying — the "curse of dimensionality."
- Strava heatmap (2018, recurring through 2025): Aggregated "anonymized" fitness maps revealed secret military base layouts (soldiers jogging the perimeter); a 2023 study pinpointed individuals' home addresses. Even aggregated data leaks when activity is sparse and distinctive.
From syntactic tricks to mathematical guarantees
k-anonymity (Sweeney): a dataset is k-anonymous if every record looks identical to at least k−1 others on its quasi-identifiers, achieved by generalization (age 34 → "30–39") or suppression. Extensions l-diversity and t-closeness patch its holes, but all three are syntactic and still fall to background-knowledge attacks.
Differential privacy (DP) is the rigorous gold standard. You add carefully calibrated random noise so that whether any single person is in the dataset barely changes the output, bounded by a parameter epsilon (the "privacy budget" — lower epsilon = more noise = more privacy, less utility). Its killer feature: the guarantee holds regardless of any auxiliary data an attacker has. A 2025 study found DP at epsilon=1.0 cut re-identification risk below 0.1% with negligible utility loss.
Central DP (trusted curator) Local DP (no trust needed)
-------------------------- --------------------------
user -> raw -> [CURATOR user -> [+noise on device]
adds noise] -> out -> server -> aggregate
weaker trust, less noise stronger trust, more noise
e.g. US Census 2020 e.g. Apple iOS, Google RAPPOR
Real deployments: the US Census Bureau (2020, "TopDown" — the first ever, controversial for distorting small-area counts), Apple (iOS 10, 2016, local DP for typing/emoji telemetry), and Google RAPPOR (2014, Chrome — first internet-scale local DP).
10.6 Consent and lawful basis
Under GDPR Art.6 you must have at least one of exactly six lawful bases to process personal data: (a) consent, (b) contract, (c) legal obligation, (d) vital interests, (e) public task, (f) legitimate interests.
Engineering implication: consent is state. It must be recorded (who, what, when, which version of the notice — a "consent receipt"), enforced at processing time by a consent-checking gate, and honoured on withdrawal. A checkbox on a signup form is not consent management.
10.7 Data lifecycle, mapping, and subject rights
You cannot protect, delete, or report on data you cannot find. Maintain a Record of Processing Activities (RoPA, mandatory under GDPR Art.30) plus a living data inventory/map: for every data element — what it is, where it lives (every DB, cache, log, backup, third-party processor, analytics pipeline), why it was collected, lawful basis, who it is shared with, and retention period. Lifecycle: collect → store → use → share → retain → destroy, with controls at each stage. High-risk processing also requires a DPIA (Data Protection Impact Assessment).
Data Subject Rights (GDPR Arts.15–22; CCPA/CPRA parallels) must be built as first-class infrastructure, not manual ops:
- Access / DSAR (Art.15): confirm what you hold and give a copy plus purposes, categories, recipients, retention. Respond within 1 month (extendable +2 for complex cases). Build a "collect everything for subject X" pipeline that fans out across every store in the data map.
- Erasure / right to be forgotten (Art.17): the hardest to engineer. A soft-delete flag is not erasure — regulators reject it (an admin can still read it). You need hard purge across primary stores, indexes, caches, and derived data. For immutable backups, use crypto-shredding: encrypt each subject's data with a per-subject key, then destroy the key — the ciphertext becomes mathematically equivalent to random noise even inside untouched backups (EDPB-accepted as valid erasure). Pair with a replayable deletion ledger so any restored backup re-applies the deletion.
- Portability (Art.20): hand back the data the subject gave you in a structured, machine-readable format (JSON/CSV) for reuse or switching providers — distinct from access.
- Plus rectification (16), restriction (18), objection (21), and rights over automated decision-making (22).
10.8 Privacy-Enhancing Technologies (PETs)
| PET | What it lets you do | 2025 caveat |
|---|---|---|
| Differential privacy | Publish stats/telemetry with a provable per-person guarantee | Tune epsilon; some utility loss |
| Homomorphic encryption (FHE) | Compute on encrypted data without decrypting it | Can be millions of times slower; still niche |
| Secure multi-party computation (SMPC) | Several parties jointly compute a result without revealing their inputs (e.g. banks detecting fraud without sharing customer lists) | Network/coordination overhead |
| Federated learning | Train an ML model where raw data stays on each device; only model updates leave (can be DP-noised) | Updates can still leak; pair with DP |
| Synthetic data | Generate artificial records mirroring real statistics, no 1:1 link to real people | Must verify it doesn't memorize originals |
Also in the toolkit: trusted execution environments (TEEs/enclaves), tokenization, and zero-knowledge proofs. For risk cataloguing, the OWASP Top 10 Privacy Risks is the privacy counterpart to the security Top 10.
Common mistakes
- Treating anonymization as a checkbox — just stripping names (AOL and Netflix prove it fails against auxiliary data).
- Equating "de-identified" with "anonymous."
- Using consent as the only/default lawful basis, or pre-ticked opt-out boxes.
- Passing off a soft-delete flag as erasure.
- Collecting "just in case" with no retention TTL.
- Forgetting backups, logs, caches, and third parties when deleting.
- Having no data inventory, so DSARs can never be fully answered.
- Confusing privacy with security — assuming "we encrypt it" means "we're private."
Best practices
- Minimize at the point of collection — the only data you can never mishandle is the data you never took.
- Tag every data element with its purpose and retention at ingestion; enforce both downstream.
- Default to the strictest privacy setting (opt-in, not opt-out).
- Build access/delete/export as automated subject-rights pipelines, not manual heroics.
- Prefer crypto-shredding for erasure that must survive in immutable backups.
- Choose differential privacy over k-anonymity when releasing data externally.
- Run a DPIA for any high-risk processing; keep a living RoPA/data map.