Inside a PDF: Structure, Graphics & Fonts
You send PDFs to the print shop every day, but what is actually inside one? Understanding the anatomy of a PDF is the single best way to stop the most common print disasters: wrong fonts, muddy colour, and pixelated images. This section opens the file up and shows you the gears turning inside — in plain English, no programming required.
8.1 What a PDF really is
Most people imagine a PDF as a "picture of a page." It is closer to a small database of objects plus a map telling the reader where each object lives. It is not a top-to-bottom stream of text like a web page. PDF is defined by the open ISO standard ISO 32000 (PDF 1.7 = ISO 32000-1, 2008; PDF 2.0 = ISO 32000-2, 2020). It began as Adobe's format and is now a published international standard.
The four physical parts
+-----------------------------------------------+
| 1. HEADER %PDF-1.7 (version line) |
+-----------------------------------------------+
| 2. BODY all the objects: |
| pages, text, fonts, images, art |
+-----------------------------------------------+
| 3. XREF TABLE index: byte offset of every obj |
+-----------------------------------------------+
| 4. TRAILER where xref starts + which obj is |
| the Root (Catalog) |
+-----------------------------------------------+
^ a reader opens the file from the BOTTOM
- Header
- The first line, e.g.
%PDF-1.7, declaring the spec version. It is usually followed by a line of high-byte binary characters so email/FTP tools treat the file as binary, not text. - Body
- The actual content — every object that makes up the document.
- Cross-reference (xref) table
- An index giving the exact byte offset of every object, so the reader can jump straight to any object without scanning the whole file. This is what makes a PDF random-access.
- Trailer
- A tiny dictionary at the very end. It tells the reader two critical things: where the xref table starts (
startxref) and which object is the document/Root(the Catalog). It also carries/Size(object count) and/ID.
8.2 Objects — the eight building blocks
Everything in a PDF body is made of exactly eight object types: Boolean, Numeric (integer or real), String, Name, Array, Dictionary, Stream, and Null. Four of them do most of the work:
- Name
- A token starting with
/, e.g./Type,/Font,/MediaBox. Used as keys and identifiers — not the same thing as a text string. - Array
- An ordered list in square brackets, e.g.
[0 0 612 792](a page box measured in points). - Dictionary
- Key→value pairs wrapped in
<< >>, e.g.<< /Type /Page /MediaBox [0 0 612 792] >>. This is the workhorse object. - Stream
- A dictionary followed by raw bytes between
stream … endstream. Used for anything large or binary: page content, embedded fonts, images. The dictionary carries/Lengthand a/Filter(compression, usually/FlateDecode= zlib/deflate).
Any object can be made an indirect object by giving it an ID, written as 12 0 obj … endobj (object number 12, generation 0). Other objects point at it with 12 0 R — the "R" means reference. The xref table is simply the index of where every numbered object lives.
How the xref table works
Each classic xref entry is exactly 20 bytes so the table is itself randomly addressable: a 10-digit byte offset (zero-padded) + space + a 5-digit generation number + space + a one-character flag n (in-use) or f (free), then a 2-byte end-of-line.
/Prev; the original bytes stay. This is why edited PDFs grow, and why "removed" text can sometimes be recovered — a real privacy gotcha. PDF 1.5+ can compress the index into a cross-reference stream and pack objects into object streams, giving smaller files that are harder to read in a hex editor.8.3 The document hierarchy
Trailer /Root --> Catalog (/Type /Catalog)
|
v
Pages node (/Type /Pages) <- root of page tree
/ \
Pages Page (/Type /Page)
/ \ | /MediaBox [0 0 612 792]
Page Page | /Contents -> content stream
| /Resources -> fonts, images, colors
The trailer's /Root points to the Catalog, the top of the tree. The Catalog points to the Pages node — the root of the page tree. The page tree is a (usually balanced) tree of branch nodes (/Pages) and leaf nodes (/Page). Balancing it means a 5,000-page document can find page 4,000 quickly without walking a flat list.
Each Page dictionary holds /MediaBox (the physical sheet size), /Contents (the stream that draws the page), /Resources (the fonts, images and colour spaces it uses), and optionally the print boxes /CropBox, /BleedBox, /TrimBox and /ArtBox — where TrimBox = the final cut size and BleedBox = trim + bleed.
[0 0 612 792] (8.5×11 in), A4 ≈ [0 0 595 842]. The origin is the bottom-left corner and Y increases upward — the maths convention, not the screen convention.8.4 Content streams — how the page is "drawn"
A page's appearance comes from its content stream, which is a tiny postfix (Reverse-Polish) program: the operands come first, then the operator. For example 1 0 0 RG sets the stroke colour to red, then 100 100 m moves the pen. It is an imperative paint program — there is no "this is a paragraph," only "put this glyph at this X/Y, draw this line, fill this path." Meaning and structure have to be reconstructed afterwards, which is exactly why text extraction and reflow are hard.
Graphics-state operators
q/Q— save / restore the graphics state on a stack. Everything between them runs in isolation (colour, line width, clipping, transform). Always paired.cm— concatenate a transformation matrix (the CTM): scale, rotate, skew or move everything drawn afterward (six numbers).wline width,J/jcaps/joins,ddash pattern,gsapply an extended graphics state (transparency, blend mode).
Vector paths (resolution-independent)
Construct with m (moveto), l (lineto), c (cubic Bézier curve), re (rectangle) and h (closepath). Paint with S (stroke), f (fill, nonzero winding rule), f* (fill, even-odd rule), B (fill and stroke), and W/W* (use the path as a clipping mask). The two winding rules decide which side of overlapping shapes counts as "inside" — even-odd is what lets a donut have a real hole.
Images
A raster image is an XObject of subtype Image — a stream listed in the page's /Resources, carrying /Width, /Height, /BitsPerComponent, /ColorSpace and a /Filter (DCTDecode = JPEG, JPXDecode = JPEG2000, FlateDecode = lossless, CCITTFaxDecode/JBIG2Decode = bilevel scans). It is drawn by setting a CTM that maps a 1×1 unit square to the desired size and position, then calling Do.
Text
Text lives inside a text object between BT and ET. Tf sets the font and size (/F1 12 Tf), Td/Tm position it, Tj shows a string of glyphs, and TJ shows them with per-glyph kerning. Crucially, the string in Tj contains glyph codes, decoded to real characters through the font's Encoding and, for searchable text, a ToUnicode map.
/ToUnicode. The page prints perfectly but copies out as garbage and is unreadable to search and accessibility tools.8.5 Fonts — the number-one cause of print disasters
Font types PDF supports
| Type | What it is | Print note |
|---|---|---|
| Type 1 | Classic Adobe PostScript outline font | Legacy only — Adobe ended authoring support in Jan 2023 |
| TrueType | Apple/MS quadratic-outline font | Very common, well supported |
| OpenType | Modern wrapper holding either TrueType (glyf) or PostScript (CFF) outlines | PDF 1.6+; advanced typography |
| Type 0 / Composite (CID-keyed) | "Big alphabet" wrapper (CIDFontType0 or CIDFontType2) | Required for >256 glyphs: CJK, large Unicode, complex scripts |
| Type 3 | Glyphs drawn as arbitrary PDF graphics | Rare; often non-scalable — usually undesirable for print |
Embedding vs. subsetting
- Embedding
- The font program is stored inside the PDF (a FontFile stream in the font descriptor). It guarantees identical output on any viewer or RIP.
- Subsetting
- Embed only the glyphs actually used. Shrinks the file dramatically and is the prepress default. Subset fonts get a name prefixed with 6 uppercase letters + a "+", e.g.
ABCDEF+Helvetica. SpottingXXXXXX+Namein the font list confirms it is embedded and subset.
Why missing or unembedded fonts ruin print
If a font is not embedded, the viewer or RIP must substitute. Acrobat fakes a match using Multiple-Master substitutes (Adobe Serif MM / Adobe Sans MM); a print RIP often just drops to Courier or Helvetica. Substitutes have different glyph shapes and different advance widths, so text reflows: words overlap, lines re-wrap, paragraphs push off the page, and special characters or logos turn into wrong glyphs or empty boxes (□). Print RIPs are stricter than screen viewers, so a file can look fine on screen and still substitute on the press.
The Standard 14 ("Base-14") fonts (Helvetica, Times, Courier — four styles each — plus Symbol and ZapfDingbats) were historically never embedded because every PostScript device had them. For modern print you should embed them anyway.
XXXXXX+FontName subset prefixes are present before plating.8.6 Colour spaces in PDF
| Colour space | What it is | Print suitability |
|---|---|---|
| DeviceGray | One 0–1 channel (g/G) | Fine for grayscale / black-only |
| DeviceRGB | Red/Green/Blue (rg/RG), screen-oriented, no defined gamut | Poor as a final print space — looks different on every device |
| DeviceCMYK | Cyan/Magenta/Yellow/Key ink % (k/K), press-oriented, no press/paper profile | Colour undefined until you know the printing condition |
| ICCBased | RGB/CMYK/Gray with an embedded ICC profile pinning an exact appearance | The industry standard for colour-managed work |
| CIE-based (CalGray, CalRGB, Lab) | Device-independent reference spaces; Lab is the absolute reference | Used internally for conversions |
| Separation | A single named spot colorant (Pantone, varnish, white) + alternate space + tint transform | Real press uses a dedicated plate; others approximate |
| DeviceN | Several named colorants at once (multi-spot, duotones) | Same idea as Separation, N inks |
| Indexed | A palette into a base space | Small fixed colour sets, paletted images |
For PDF/X, a single OutputIntent declares the intended printing condition (e.g. GRACoL2013, FOGRA39/FOGRA51, US Web Coated SWOP) — the reference everything in the file is meant to be reproduced against.
- A PDF is a random-access database of objects indexed by an xref table; readers open it from the end (trailer → xref → Catalog → page tree).
- Pages are painted by a postfix content stream of operators — vectors (
m/l/c/f), images (1×1-square +Do), and text (BT…ET, glyph codes decoded via Encoding/ToUnicode); measured in points (1 pt = 1/72 in) from a bottom-left origin. - Fonts are the top print risk: always embed and subset (look for the
XXXXXX+Nameprefix); unembedded fonts substitute to Courier/Helvetica and reflow text — RIPs are stricter than screen viewers. - For colour, prefer ICCBased over device spaces, keep Separation for true spot plates, and declare an OutputIntent so the press condition is unambiguous.
- Ship PDF/X-4 and preflight before plating — it enforces font embedding and surfaces colour, transparency and resolution problems before they cost a print run.