Privacy Engineering Fundamentals

By Pritesh Yadav 12 min read

Most engineers learn security first and assume privacy is "more of the same." It is not. Security protects data from people who should not have it (attackers, outsiders). Privacy governs what people who are allowed to touch the data may actually do with it. You can have flawless security and terrible privacy at the same time: the database is encrypted, access is locked down, MFA is everywhere — and then the company quietly sells that data, keeps it forever, or repurposes it for something the user never agreed to. No attacker, no breach, still a privacy failure. Privacy engineering is the discipline of building privacy requirements — collect less, use only for the stated reason, honour user rights — directly into the system's architecture, rather than writing a policy PDF and hoping for the best.

Analogy: Security is the locks on the building. Privacy is the rule book for which rooms each keyholder may enter and what they may do once inside. A janitor with a master key reading a patient's file broke no lock — that is a privacy violation, not a security breach.
Example: A hospital employee with legitimate database access queries a famous patient's medical record out of curiosity. Every access control worked exactly as designed. Nothing was "hacked." It is still one of the most serious privacy violations there is — and a fireable, sometimes criminal, offence.

10.1 The data taxonomy (get these terms right)

Students constantly confuse these four terms. They have different definitions and different legal weight.

PII (Personally Identifiable Information — US term)
Information that identifies a person. NIST splits it into linked data (directly identifying: name, Social Security Number, email, phone) and linkable data (identifying only when combined with other data).
Personal data (GDPR Art.4(1) — much broader)
ANY information relating to an identified or identifiable person. This deliberately includes IP addresses, cookie IDs, device IDs, location, and pseudonyms — far wider than the US idea of "PII." (GDPR = General Data Protection Regulation, the EU's privacy law.)
Sensitive data / special categories (GDPR Art.9)
Race/ethnicity, political opinions, religion, trade-union membership, genetic data, biometrics used for identification, health, and sex life/orientation. Processing these is prohibited by default unless a narrow exception applies. US sectoral analogues: HIPAA (health), GLBA (financial), COPPA (children under 13), and CCPA/CPRA "sensitive personal information" (SSN, precise geolocation, biometrics, message contents).
Quasi-identifiers
Fields that are not unique on their own but become uniquely identifying in combination.
Example: Latanya Sweeney showed that ZIP code + birthdate + sex uniquely identifies about 87% of the US population. Using this, she re-identified the Massachusetts governor's "anonymized" hospital records and mailed them to his office. None of those three fields is "PII" on its own — together they are a fingerprint.

10.2 The four foundational principles

These come from GDPR Art.5 and are mirrored in the NIST Privacy Framework and CCPA/CPRA. They are the levers a builder actually pulls.

  • Data minimization — collect only what the stated purpose needs. This is the single most powerful engineering control: you cannot leak, mishandle, or be forced to disclose data you never stored.
  • Purpose limitation — data collected for purpose A may not be silently reused for purpose B. Engineering pattern: tag every data element with its purpose at ingestion, then enforce that purpose at query time.
  • Storage limitation / retention — keep data only as long as needed, then delete on a schedule (a TTL, "time to live"). "Keep forever by default" is a violation.
  • Accountability — you must be able to prove you do the above (records, logs, impact assessments). Plus lawfulness/fairness/transparency and accuracy round out Art.5.

10.3 Privacy by Design — and by Default

The seven principles of Privacy by Design (PbD) were created by Dr. Ann Cavoukian (Ontario's privacy commissioner) in the 1990s, adopted as a global standard in 2010, and later encoded into GDPR Art.25 as "Data Protection by Design and by Default."

  1. Proactive, not reactive; preventative, not remedial.
  2. Privacy as the default setting — zero action required by the user to be protected; the strictest setting ships on by default (opt-IN to share, not opt-out).
  3. Privacy embedded into the design, not bolted on later.
  4. Full functionality — positive-sum, not zero-sum (reject the false "privacy vs. features" trade-off).
  5. End-to-end security across the full data lifecycle (no privacy without security).
  6. Visibility and transparency — keep it open and auditable.
  7. Respect for user privacy — strong defaults, clear notice, easy controls.

PbD was criticised as vague and aspirational — which is exactly why GDPR Art.25 turned "by design AND by default" into a legal obligation, and why the NIST Privacy Framework 1.1 (public draft CSWP 40, released 14 Apr 2025, aligned to Cybersecurity Framework 2.0) operationalises it into auditable functions: Identify-P, Govern-P, Control-P, Communicate-P, Protect-P.

10.4 Anonymization vs pseudonymization vs de-identification

This is the most-tested and most-misunderstood distinction in the whole field.

TechniqueWhat it doesReversible?Still "personal data"?
Pseudonymization (GDPR 4(5))Replace identifiers with a token; keep the mapping key stored separately and securedYes (with the key)Yes — still fully in GDPR scope; you only get reduced obligations
AnonymizationIrreversibly strip identifiability so the person can never be re-identified by any reasonably likely meansNoNo — out of GDPR scope entirely (but genuinely achieving this is very hard)
De-identification (US/HIPAA umbrella)Remove identifiers via HIPAA Safe Harbor (strip 18 listed identifiers) or Expert DeterminationOften yesRoughly equivalent to EU pseudonymized, NOT anonymous
Common mistake: Treating "de-identified" (US) as "anonymous." They are not the same. De-identified data usually still carries re-identification risk and is closer to pseudonymized. A Feb 2025 German court ruling (Hanover) added nuance: pseudonymized data can be effectively anonymous to a recipient who has no means or motive to re-identify it — anonymity is contextual, not absolute.

10.5 Re-identification risk — the cautionary tales

"We removed the names, so it's anonymous" is the most expensive sentence in privacy. Three real incidents prove why.

  • AOL search leak (2006): AOL published ~20M search queries from ~658,000 users, replacing names with numbers. NYT reporters re-identified user #4417749 as Thelma Arnold, a 62-year-old in Lilburn, Georgia, from queries like "landscapers in Lilburn, Ga." The CTO resigned; two staff were fired. Behavioral history is itself a fingerprint.
  • Netflix Prize (2006–08): Netflix released 100M "anonymized" ratings. Researchers Narayanan & Shmatikov cross-referenced public IMDb ratings and re-identified users — with as few as 8 ratings they uniquely identified 99% of records. Auxiliary (background) data defeats anonymization; sparse high-dimensional data is uniquely identifying — the "curse of dimensionality."
  • Strava heatmap (2018, recurring through 2025): Aggregated "anonymized" fitness maps revealed secret military base layouts (soldiers jogging the perimeter); a 2023 study pinpointed individuals' home addresses. Even aggregated data leaks when activity is sparse and distinctive.

From syntactic tricks to mathematical guarantees

k-anonymity (Sweeney): a dataset is k-anonymous if every record looks identical to at least k−1 others on its quasi-identifiers, achieved by generalization (age 34 → "30–39") or suppression. Extensions l-diversity and t-closeness patch its holes, but all three are syntactic and still fall to background-knowledge attacks.

Differential privacy (DP) is the rigorous gold standard. You add carefully calibrated random noise so that whether any single person is in the dataset barely changes the output, bounded by a parameter epsilon (the "privacy budget" — lower epsilon = more noise = more privacy, less utility). Its killer feature: the guarantee holds regardless of any auxiliary data an attacker has. A 2025 study found DP at epsilon=1.0 cut re-identification risk below 0.1% with negligible utility loss.

  Central DP (trusted curator)        Local DP (no trust needed)
  --------------------------          --------------------------
   user -> raw -> [CURATOR             user -> [+noise on device]
                   adds noise]  -> out         -> server -> aggregate
   weaker trust, less noise            stronger trust, more noise
   e.g. US Census 2020                 e.g. Apple iOS, Google RAPPOR

Real deployments: the US Census Bureau (2020, "TopDown" — the first ever, controversial for distorting small-area counts), Apple (iOS 10, 2016, local DP for typing/emoji telemetry), and Google RAPPOR (2014, Chrome — first internet-scale local DP).

10.6 Consent and lawful basis

Under GDPR Art.6 you must have at least one of exactly six lawful bases to process personal data: (a) consent, (b) contract, (c) legal obligation, (d) vital interests, (e) public task, (f) legitimate interests.

Common mistake: Treating consent as the default or only basis. Often "contract" or "legitimate interests" is the right basis (you don't ask consent to store the shipping address you need to ship the order). Valid consent must be freely given, specific, informed, unambiguous, and an affirmative opt-IN — no pre-ticked boxes (CJEU Planet49) — and as easy to withdraw as to give. Stacking several bases "just in case" is itself a transparency violation (per EDPB).

Engineering implication: consent is state. It must be recorded (who, what, when, which version of the notice — a "consent receipt"), enforced at processing time by a consent-checking gate, and honoured on withdrawal. A checkbox on a signup form is not consent management.

10.7 Data lifecycle, mapping, and subject rights

You cannot protect, delete, or report on data you cannot find. Maintain a Record of Processing Activities (RoPA, mandatory under GDPR Art.30) plus a living data inventory/map: for every data element — what it is, where it lives (every DB, cache, log, backup, third-party processor, analytics pipeline), why it was collected, lawful basis, who it is shared with, and retention period. Lifecycle: collect → store → use → share → retain → destroy, with controls at each stage. High-risk processing also requires a DPIA (Data Protection Impact Assessment).

Data Subject Rights (GDPR Arts.15–22; CCPA/CPRA parallels) must be built as first-class infrastructure, not manual ops:

  • Access / DSAR (Art.15): confirm what you hold and give a copy plus purposes, categories, recipients, retention. Respond within 1 month (extendable +2 for complex cases). Build a "collect everything for subject X" pipeline that fans out across every store in the data map.
  • Erasure / right to be forgotten (Art.17): the hardest to engineer. A soft-delete flag is not erasure — regulators reject it (an admin can still read it). You need hard purge across primary stores, indexes, caches, and derived data. For immutable backups, use crypto-shredding: encrypt each subject's data with a per-subject key, then destroy the key — the ciphertext becomes mathematically equivalent to random noise even inside untouched backups (EDPB-accepted as valid erasure). Pair with a replayable deletion ledger so any restored backup re-applies the deletion.
  • Portability (Art.20): hand back the data the subject gave you in a structured, machine-readable format (JSON/CSV) for reuse or switching providers — distinct from access.
  • Plus rectification (16), restriction (18), objection (21), and rights over automated decision-making (22).

10.8 Privacy-Enhancing Technologies (PETs)

PETWhat it lets you do2025 caveat
Differential privacyPublish stats/telemetry with a provable per-person guaranteeTune epsilon; some utility loss
Homomorphic encryption (FHE)Compute on encrypted data without decrypting itCan be millions of times slower; still niche
Secure multi-party computation (SMPC)Several parties jointly compute a result without revealing their inputs (e.g. banks detecting fraud without sharing customer lists)Network/coordination overhead
Federated learningTrain an ML model where raw data stays on each device; only model updates leave (can be DP-noised)Updates can still leak; pair with DP
Synthetic dataGenerate artificial records mirroring real statistics, no 1:1 link to real peopleMust verify it doesn't memorize originals

Also in the toolkit: trusted execution environments (TEEs/enclaves), tokenization, and zero-knowledge proofs. For risk cataloguing, the OWASP Top 10 Privacy Risks is the privacy counterpart to the security Top 10.

Best practice: Current numbers worth citing: the IBM Cost of a Data Breach 2025 reports a global average of USD 4.44M (first decline in five years, credited to AI-assisted detection), a US record high of USD 10.22M, and "shadow AI" adding about USD 670K to the average. The EU AI Act is in force (prohibited practices applied Feb 2025), with most high-risk obligations now deferred toward Dec 2027 under the proposed "AI Omnibus" — relevant because it introduces Fundamental Rights Impact Assessments.

Common mistakes

  • Treating anonymization as a checkbox — just stripping names (AOL and Netflix prove it fails against auxiliary data).
  • Equating "de-identified" with "anonymous."
  • Using consent as the only/default lawful basis, or pre-ticked opt-out boxes.
  • Passing off a soft-delete flag as erasure.
  • Collecting "just in case" with no retention TTL.
  • Forgetting backups, logs, caches, and third parties when deleting.
  • Having no data inventory, so DSARs can never be fully answered.
  • Confusing privacy with security — assuming "we encrypt it" means "we're private."

Best practices

  • Minimize at the point of collection — the only data you can never mishandle is the data you never took.
  • Tag every data element with its purpose and retention at ingestion; enforce both downstream.
  • Default to the strictest privacy setting (opt-in, not opt-out).
  • Build access/delete/export as automated subject-rights pipelines, not manual heroics.
  • Prefer crypto-shredding for erasure that must survive in immutable backups.
  • Choose differential privacy over k-anonymity when releasing data externally.
  • Run a DPIA for any high-risk processing; keep a living RoPA/data map.
Key takeaway: Security keeps the wrong people out; privacy governs what the right people may do once they are in. Privacy is an engineering discipline, not a policy document: minimize what you collect, bind data to its purpose, set retention TTLs, and ship the strictest setting by default. "Removing names" is not anonymization — auxiliary data and quasi-identifiers re-identify people (AOL, Netflix, Strava), so reach for differential privacy when you must share. Treat consent as enforceable state, build subject-rights (access, erasure via crypto-shredding, portability) as real infrastructure, and you cannot do any of it without a data map that tells you where every byte lives.

Continue reading