Inside a PDF: Structure, Graphics & Fonts

By Pritesh Yadav 12 min read

You send PDFs to the print shop every day, but what is actually inside one? Understanding the anatomy of a PDF is the single best way to stop the most common print disasters: wrong fonts, muddy colour, and pixelated images. This section opens the file up and shows you the gears turning inside — in plain English, no programming required.

8.1 What a PDF really is

Most people imagine a PDF as a "picture of a page." It is closer to a small database of objects plus a map telling the reader where each object lives. It is not a top-to-bottom stream of text like a web page. PDF is defined by the open ISO standard ISO 32000 (PDF 1.7 = ISO 32000-1, 2008; PDF 2.0 = ISO 32000-2, 2020). It began as Adobe's format and is now a published international standard.

Analogy: Think of a PDF as a shipping container with a manifest taped to the back door. The contents are boxes (objects). The manifest lists the exact shelf (byte position) of every box. A sticker on the door says "the manifest is on the last page, and the master inventory is box #1." You read the door sticker first, then jump straight to any box — you never unpack the whole container.

The four physical parts

+-----------------------------------------------+
| 1. HEADER     %PDF-1.7   (version line)        |
+-----------------------------------------------+
| 2. BODY       all the objects:                 |
|               pages, text, fonts, images, art  |
+-----------------------------------------------+
| 3. XREF TABLE index: byte offset of every obj  |
+-----------------------------------------------+
| 4. TRAILER    where xref starts + which obj is  |
|               the Root (Catalog)                |
+-----------------------------------------------+
      ^ a reader opens the file from the BOTTOM
Header
The first line, e.g. %PDF-1.7, declaring the spec version. It is usually followed by a line of high-byte binary characters so email/FTP tools treat the file as binary, not text.
Body
The actual content — every object that makes up the document.
Cross-reference (xref) table
An index giving the exact byte offset of every object, so the reader can jump straight to any object without scanning the whole file. This is what makes a PDF random-access.
Trailer
A tiny dictionary at the very end. It tells the reader two critical things: where the xref table starts (startxref) and which object is the document /Root (the Catalog). It also carries /Size (object count) and /ID.
Key takeaway: A reader opens a PDF by reading the end first (trailer → xref), then jumps directly to the objects it needs — the opposite of how you read a book.

8.2 Objects — the eight building blocks

Everything in a PDF body is made of exactly eight object types: Boolean, Numeric (integer or real), String, Name, Array, Dictionary, Stream, and Null. Four of them do most of the work:

Name
A token starting with /, e.g. /Type, /Font, /MediaBox. Used as keys and identifiers — not the same thing as a text string.
Array
An ordered list in square brackets, e.g. [0 0 612 792] (a page box measured in points).
Dictionary
Key→value pairs wrapped in << >>, e.g. << /Type /Page /MediaBox [0 0 612 792] >>. This is the workhorse object.
Stream
A dictionary followed by raw bytes between stream … endstream. Used for anything large or binary: page content, embedded fonts, images. The dictionary carries /Length and a /Filter (compression, usually /FlateDecode = zlib/deflate).

Any object can be made an indirect object by giving it an ID, written as 12 0 obj … endobj (object number 12, generation 0). Other objects point at it with 12 0 R — the "R" means reference. The xref table is simply the index of where every numbered object lives.

How the xref table works

Each classic xref entry is exactly 20 bytes so the table is itself randomly addressable: a 10-digit byte offset (zero-padded) + space + a 5-digit generation number + space + a one-character flag n (in-use) or f (free), then a 2-byte end-of-line.

Common mistake: Assuming "deleting" content in a PDF removes it. PDFs use incremental update — edits are appended to the end with a new xref section and trailer chained to the old one via /Prev; the original bytes stay. This is why edited PDFs grow, and why "removed" text can sometimes be recovered — a real privacy gotcha. PDF 1.5+ can compress the index into a cross-reference stream and pack objects into object streams, giving smaller files that are harder to read in a hex editor.

8.3 The document hierarchy

Trailer /Root --> Catalog (/Type /Catalog)
                     |
                     v
                  Pages node (/Type /Pages)  <- root of page tree
                   /        \
              Pages          Page (/Type /Page)
              /    \             |  /MediaBox  [0 0 612 792]
           Page   Page          |  /Contents  -> content stream
                                |  /Resources -> fonts, images, colors

The trailer's /Root points to the Catalog, the top of the tree. The Catalog points to the Pages node — the root of the page tree. The page tree is a (usually balanced) tree of branch nodes (/Pages) and leaf nodes (/Page). Balancing it means a 5,000-page document can find page 4,000 quickly without walking a flat list.

Each Page dictionary holds /MediaBox (the physical sheet size), /Contents (the stream that draws the page), /Resources (the fonts, images and colour spaces it uses), and optionally the print boxes /CropBox, /BleedBox, /TrimBox and /ArtBox — where TrimBox = the final cut size and BleedBox = trim + bleed.

Key takeaway: PDF measures everything in PostScript points: 1 pt = 1/72 inch. US Letter = [0 0 612 792] (8.5×11 in), A4 ≈ [0 0 595 842]. The origin is the bottom-left corner and Y increases upward — the maths convention, not the screen convention.

8.4 Content streams — how the page is "drawn"

A page's appearance comes from its content stream, which is a tiny postfix (Reverse-Polish) program: the operands come first, then the operator. For example 1 0 0 RG sets the stroke colour to red, then 100 100 m moves the pen. It is an imperative paint program — there is no "this is a paragraph," only "put this glyph at this X/Y, draw this line, fill this path." Meaning and structure have to be reconstructed afterwards, which is exactly why text extraction and reflow are hard.

Graphics-state operators

  • q / Qsave / restore the graphics state on a stack. Everything between them runs in isolation (colour, line width, clipping, transform). Always paired.
  • cm — concatenate a transformation matrix (the CTM): scale, rotate, skew or move everything drawn afterward (six numbers).
  • w line width, J/j caps/joins, d dash pattern, gs apply an extended graphics state (transparency, blend mode).

Vector paths (resolution-independent)

Construct with m (moveto), l (lineto), c (cubic Bézier curve), re (rectangle) and h (closepath). Paint with S (stroke), f (fill, nonzero winding rule), f* (fill, even-odd rule), B (fill and stroke), and W/W* (use the path as a clipping mask). The two winding rules decide which side of overlapping shapes counts as "inside" — even-odd is what lets a donut have a real hole.

Images

A raster image is an XObject of subtype Image — a stream listed in the page's /Resources, carrying /Width, /Height, /BitsPerComponent, /ColorSpace and a /Filter (DCTDecode = JPEG, JPXDecode = JPEG2000, FlateDecode = lossless, CCITTFaxDecode/JBIG2Decode = bilevel scans). It is drawn by setting a CTM that maps a 1×1 unit square to the desired size and position, then calling Do.

Common mistake: Scaling a placed raster up via the matrix. Because the image's scale lives in the CTM, not the pixels, enlarging that 1×1 square too far silently drops the effective resolution below 300 dpi. It looks crisp on screen yet prints pixelated.

Text

Text lives inside a text object between BT and ET. Tf sets the font and size (/F1 12 Tf), Td/Tm position it, Tj shows a string of glyphs, and TJ shows them with per-glyph kerning. Crucially, the string in Tj contains glyph codes, decoded to real characters through the font's Encoding and, for searchable text, a ToUnicode map.

Common mistake: Forgetting /ToUnicode. The page prints perfectly but copies out as garbage and is unreadable to search and accessibility tools.

8.5 Fonts — the number-one cause of print disasters

Font types PDF supports

TypeWhat it isPrint note
Type 1Classic Adobe PostScript outline fontLegacy only — Adobe ended authoring support in Jan 2023
TrueTypeApple/MS quadratic-outline fontVery common, well supported
OpenTypeModern wrapper holding either TrueType (glyf) or PostScript (CFF) outlinesPDF 1.6+; advanced typography
Type 0 / Composite (CID-keyed)"Big alphabet" wrapper (CIDFontType0 or CIDFontType2)Required for >256 glyphs: CJK, large Unicode, complex scripts
Type 3Glyphs drawn as arbitrary PDF graphicsRare; often non-scalable — usually undesirable for print

Embedding vs. subsetting

Embedding
The font program is stored inside the PDF (a FontFile stream in the font descriptor). It guarantees identical output on any viewer or RIP.
Subsetting
Embed only the glyphs actually used. Shrinks the file dramatically and is the prepress default. Subset fonts get a name prefixed with 6 uppercase letters + a "+", e.g. ABCDEF+Helvetica. Spotting XXXXXX+Name in the font list confirms it is embedded and subset.
Analogy: Embedding a font is like packing the typewriter with the letter. If you ship only the text and assume the recipient owns the same typewriter, they'll retype it on whatever machine they have (Courier) — different key widths, so the whole letter re-flows and no longer fits the page.

Why missing or unembedded fonts ruin print

If a font is not embedded, the viewer or RIP must substitute. Acrobat fakes a match using Multiple-Master substitutes (Adobe Serif MM / Adobe Sans MM); a print RIP often just drops to Courier or Helvetica. Substitutes have different glyph shapes and different advance widths, so text reflows: words overlap, lines re-wrap, paragraphs push off the page, and special characters or logos turn into wrong glyphs or empty boxes (□). Print RIPs are stricter than screen viewers, so a file can look fine on screen and still substitute on the press.

The Standard 14 ("Base-14") fonts (Helvetica, Times, Courier — four styles each — plus Symbol and ZapfDingbats) were historically never embedded because every PostScript device had them. For modern print you should embed them anyway.

Example — the Courier surprise: A designer sends a brochure in a licensed display font but forgets to embed it. On their Mac it's perfect; on the shop's RIP every headline renders in Courier, lines rewrap, a two-line headline becomes three, and it collides with the photo. It previewed fine only because the designer's machine had the font installed.
Best practice: Export to PDF/X-4 (or X-1a for legacy CMYK-flat work). PDF/X and PDF/A both require every font to be embedded — a non-embedded font is a preflight failure, not a warning. Then preflight (Acrobat Pro, Enfocus PitStop, callas pdfToolbox) and confirm the XXXXXX+FontName subset prefixes are present before plating.

8.6 Colour spaces in PDF

Colour spaceWhat it isPrint suitability
DeviceGrayOne 0–1 channel (g/G)Fine for grayscale / black-only
DeviceRGBRed/Green/Blue (rg/RG), screen-oriented, no defined gamutPoor as a final print space — looks different on every device
DeviceCMYKCyan/Magenta/Yellow/Key ink % (k/K), press-oriented, no press/paper profileColour undefined until you know the printing condition
ICCBasedRGB/CMYK/Gray with an embedded ICC profile pinning an exact appearanceThe industry standard for colour-managed work
CIE-based (CalGray, CalRGB, Lab)Device-independent reference spaces; Lab is the absolute referenceUsed internally for conversions
SeparationA single named spot colorant (Pantone, varnish, white) + alternate space + tint transformReal press uses a dedicated plate; others approximate
DeviceNSeveral named colorants at once (multi-spot, duotones)Same idea as Separation, N inks
IndexedA palette into a base spaceSmall fixed colour sets, paletted images

For PDF/X, a single OutputIntent declares the intended printing condition (e.g. GRACoL2013, FOGRA39/FOGRA51, US Web Coated SWOP) — the reference everything in the file is meant to be reproduced against.

Analogy: DeviceCMYK vs. ICCBased is "add salt" vs. "add 5 g of salt." Device numbers are vague proportions that taste different in every kitchen (press); an ICC profile or OutputIntent pins the recipe to a known kitchen so the dish comes out the same everywhere.
Example — RGB black turns muddy: A logo built in DeviceRGB pure black (0,0,0) gets converted at the press into rich black (high C, M, Y and K), so crisp 100% K text would have been fuzzy and registration-sensitive instead. Fixing the colour space before plates is the prepress job.
Example — Pantone prints as process: Packaging uses a Separation colour "PANTONE 286 C," expecting a fifth plate. The job is set CMYK-only, so the RIP falls back to the alternate tint transform and the brand blue prints as a dull process approximation — the classic spot-vs-process mismatch.
Common mistake: Delivering DeviceRGB, or untagged DeviceCMYK, for print. With no defined gamut or printing condition, colour is unpredictable. Final artwork should be ICCBased or carry a PDF/X OutputIntent.
Best practice: Colour-manage explicitly — tag artwork with ICC profiles, keep spot colours as named Separation channels only when you genuinely want extra plates, and convert to the press's output condition (FOGRA/GRACoL) on purpose, never by accident.
Section summary:
  • A PDF is a random-access database of objects indexed by an xref table; readers open it from the end (trailer → xref → Catalog → page tree).
  • Pages are painted by a postfix content stream of operators — vectors (m/l/c/f), images (1×1-square + Do), and text (BT…ET, glyph codes decoded via Encoding/ToUnicode); measured in points (1 pt = 1/72 in) from a bottom-left origin.
  • Fonts are the top print risk: always embed and subset (look for the XXXXXX+Name prefix); unembedded fonts substitute to Courier/Helvetica and reflow text — RIPs are stricter than screen viewers.
  • For colour, prefer ICCBased over device spaces, keep Separation for true spot plates, and declare an OutputIntent so the press condition is unambiguous.
  • Ship PDF/X-4 and preflight before plating — it enforces font embedding and surfaces colour, transparency and resolution problems before they cost a print run.

Continue reading