Production Readiness & Reliability Reference

By Pritesh Yadav 20 min read

Canonical reliability reference for Print-Flow-360 (Laravel 11 + PostgreSQL, admin Nuxt, storefront Nuxt, Vue designer, Node pdf-service with BullMQ). Authored 2026-06-15. Based on a current-state codebase scan + web best-practice research. Why this doc is serious: print jobs are PHYSICAL. A lost order, a missing print-ready file, or a re-run of a paid job costs real money and ships the wrong thing (or nothing) to a paying customer. Reliability here is part of the fulfillment contract, not an ops nicety.


1. Executive summary

In plain language: the platform can currently lose paying customers’ data and their print-ready files with no way to get them back. The code is well-structured (clean tenancy, services, snapshots) but the reliability infrastructure around it is mostly absent or wired to development-grade defaults. One app (pdf-service) is genuinely well-instrumented; almost nothing else is.

Top risks, most dangerous first

  1. No database backups at all (CRITICAL — data loss). No backup package is installed (spatie/laravel-backup is absent from composer.json), no pg_dump/pg_basebackup/WAL archiving, no scheduled backup task in routes/console.php, and no restore path. A search of the codebase for backup/restore/dump/snapshot returns zero matches. If either PostgreSQL instance (live_db central, admin_db) is lost or corrupted, every tenant’s orders, customers, payments, and products are gone permanently. This is the single most dangerous finding.

  2. QUEUE_CONNECTION=sync (CRITICAL — paid-job loss). All 18 queued jobs run inline inside the web request (.env:57, .env.example:57). There is no async execution, no retry, no failed_jobs capture on a mid-request crash, and a PHP timeout silently destroys the work with no trace. For anything that produces or notifies about a paid customer’s file, a transient hiccup = permanently lost work, possibly after payment succeeded.

  3. Paid orders can be downloaded without their print-ready PDF, silently (CRITICAL — wrong/incomplete fulfillment). OrderController.downloadArtwork() checks each file exists and silently continues past missing ones (OrderController.php:504-513). If the print_ready_file link is lost (the DB update is “best-effort” and never throws — printReadyPersistService.js:39-59), the customer’s ZIP is missing the production artwork with no warning, and the print job can proceed without production-grade files.

  4. Error monitoring is one-app-only (HIGH — blind to failures). Only pdf-service has real observability (Sentry + Prometheus + health checks + correlation IDs). Laravel has local file logs + an error_logs DB table but no Sentry/APM. The storefront has Honeybadger (deferred-load, possibly unwired). The designer and docs apps have zero error monitoring. Backend API and designer failures are effectively invisible.

  5. Single PostgreSQL host, no replica, no failover, no PITR (HIGH). config/database.php has no read replica array and no secondary connection; central and admin DBs sit on the same localhost:5432. There is no point-in-time recovery, so even with a future nightly dump the best achievable RPO would be a full day.

  6. Redis not configured; BullMQ effectively off (HIGH). No REDIS_* vars in either app; cache driver is database. The pdf-service has full BullMQ retry/backoff infrastructure but it is disabled by default (no REDIS_HOST), so PDF generation also runs synchronously.

Environment note: .env shows APP_ENV=development, APP_DEBUG=true, file cache/session, but contains real AWS/Stripe credentials — i.e. this is a dev-configured environment in active use. The recommendations below assume you will stand up a properly-configured production environment; do not ship these dev defaults.


2. Current state

2.1 Backups / Disaster Recovery

AreaExists todayEvidenceMissing / risk
Backup toolingNothingcomposer.json (no spatie/laravel-backup); app/Console/Commands (26 commands, none backup); grep backup/restore/dump/snapshot = 0 hitsNo logical or physical backup of any DB
Scheduled backupsNoneroutes/console.php:1-71 (only auth-clear, sanctum-prune, reminders, campaigns, invoices)No backup window; nothing offsite
PITR / WAL archivingNoneconfig/database.php (no read/replica/failover); grep WAL = 0Cannot recover to a point in time
Read replica / HANoneconfig/database.php (single host localhost:5432)DB host is a single point of failure
File (S3) durabilityS3 is primary diskconfig/filesystems.php:48-59, .env:56, bucket printflow360Versioning/replication/Object-Lock not verified/enabled; tenancy filesystem bootstrapper disabled (config/tenancy.php:34) so tenants likely share a bucket without disk-level isolation
Test DB sandboxSeparate printflow360_test.env.testing:35-45Useful as a restore-drill target (not yet used as one)
Recovery capabilityNonegrep results; no backup packagesOnly recovery is manual DBA intervention

2.2 Error monitoring / observability

AppExists todayEvidenceMissing / risk
pdf-serviceExcellent: Sentry (uncaught/unhandled/5xx), Prometheus /metrics, /health + /health/ready + /health/startup, X-Request-Id correlation, Pino JSON logspdf-service/src/lib/sentry.js:1-105, index.js:43,79,87-98, lib/metrics.js, routes/health.js:13-54, middleware/requestLogger.js:8-28None significant
Laravel APIModerate: 11 log channels, ErrorLogService persists 5xx + unhandled to error_logs table, built-in /up health, partial X-Request-Id (forwarded to pdf-service only)bootstrap/app.php:27,82-134, config/logging.php, app/Services/ErrorLog/ErrorLogService.php:1-165, PdfServiceClient.php:303-305No Sentry/APM, no global request-id middleware, no request_id on error_logs, /up has no custom checks
Storefront (Nuxt)Weak: Honeybadger, deferred-loaded on idle; client-error ingest endpoint existsfrontstore/.../honeybadger.client.ts:1-24, StorefrontErrorLogController.php, package.json:40Honeybadger may not be reporting; no Sentry; no confirmed client→backend error flow
Admin (Nuxt)(not separately evidenced)No confirmed error monitoring
Designer (Vue)Nonedesigner/package.json (no Sentry/Honeybadger)Zero visibility into a customer-facing editor
DocsNone / unknowndocs/package.jsonLikely none
Distributed tracingPartial: X-Request-Id pdf↔Laravel onlyPdfServiceClient.php:280-309No Sentry/Datadog/Jaeger trace stitching across the stack

2.3 Queues & background work

AreaExists todayEvidenceMissing / risk
Laravel queue driversync (inline, no async).env:57, .env.example:57No retries, no isolation, mid-request crash = silent loss
failed_jobs tableDefined but only fed on thrown exceptions0001_01_01_000002_create_jobs_table.php:37-45, config/queue.php:106-110Useless under sync for crashes/timeouts
Per-job retry/backoffAlmost none (2/18 set anything)SendEmailCampaignJob.php:23,25 ($tries=1), CampaignDispatcherJob.php:22 ($timeout=300)16 jobs have no tries/backoff/timeout
HorizonNot installedcomposer.json (no laravel/horizon)No queue UI, retry, or worker health
RedisNot configuredno REDIS_* in .env/.env.example; config/cache.php:18 default databaseNo persistent/Redis queue; BullMQ can’t run
pdf-service BullMQBuilt (3 attempts, exp backoff, 7d failed / 24h completed retention) but disabledpdf-service/src/queue/pdfQueue.js:35-40, config/env.js:43-53No REDIS_HOST → PDF jobs run sync
Laravel → pdf-service callsSynchronous HTTP, retry only on connection error (2x, 5s connect/30s req)PdfServiceClient.php:268-275No recovery on generation failure/timeout; imposition also sync (:235-239)
pdf-service flagsAll delegation flags default off; silent fallback to DOMPDF (strict_mode=false)config/pdf_service.php:46-57No print jobs use pdf-service yet
Cross-system job trackingNonepdf-service/src/db/models/PDFJob.js (own DB, not linked to failed_jobs)No end-to-end job recovery between Laravel and Node
Worker concurrencyWORKER_CONCURRENCY=2 (unused while BullMQ off)pdf-service/src/config/env.js:51Potential bottleneck for bulk print once enabled

2.4 Print-file durability & integrity

AreaExists todayEvidenceMissing / risk
Generation & storage300-DPI PDFs (PDFKit+sharp) → per-tenant S3 or local; new timestamped file per regenpdfStorageService.js:40-66, S3Storage.js:54-69Old files never deleted → file proliferation on retries
Link to designpersistPrintReadyFile() writes designer_documents.print_ready_file; order snapshots it at checkout w/ live fallbackcommit 12847afa, printReadyPersistService.js:28-60, StorefrontCheckoutController::resolveDesignPrintReadyFile()DB link write is best-effort, never throws (:39-59) → silent link loss
Corruption validationOnly buffer.length > 0pdfStorageService.js:82-84, pdfGenerator.js:205-211Malformed-but-nonzero PDF can be stored & marked complete
IdempotencyNone — each retry regenerates a NEW filepdfQueue.js:36-39, pdfStorageService.js:25-26,81-94No dedup; orphan files accumulate
Order download w/ missing fileSilently skippedOrderController.php:504-513Customer ZIP missing print PDF, no indication (CRITICAL)
Paid-order-without-file scenarioSnapshot may be null + link may be lost → no artwork, no alertOrderController.php:428-440,504-513, printReadyPersistService.js:56-59Print job runs without production artwork (CRITICAL)
Secret couplingAPP_KEY + INTERNAL_API_SECRET must match Laravel exactlytokenService.js, internalAuth.js, config/env.js:22,41,88-89Mismatch → 401 on all /internal/* (ops footgun)

2.5 Health / uptime

AreaExists todayEvidenceMissing / risk
pdf-service probesliveness/readiness/startup with per-dependency status + latencyroutes/health.js:13-54, lib/healthcheck.js:17-124None
Laravel healthBuilt-in /up onlybootstrap/app.php:27No readiness check (DB/Redis/queue/disk/backup), no token-gated JSON
Nuxt appsNo /livez//readyz evidenced
External uptime/syntheticsNone evidencedNo UptimeRobot/Better Stack, no checkout/PDF synthetic, no backup/scheduler heartbeat

2.6 Deploy / infrastructure

AreaExists todayEvidenceMissing / risk
Environment postureDev-grade config in active use.env:1-50,161 (APP_ENV=development, APP_DEBUG=true, file cache/session, SUBSCRIPTION_MODE=local)Not a hardened production env
Tenancy DR nuanceDB-per-tenant via stancl; DatabaseTenancyBootstrapper disabledconfig/tenancy.php:32,42-78, app/Models/Tenant.php:12-19Restore plan must cover central + tenant data consistently
IaC / secrets backupNone evidencedNo infra-as-code / config backup tier
Migration rollback strategy274 central migrations, no documented rollback/backup-before-migratedatabase/migrationsRisky schema changes have no safety net

3. Target architecture & recommendations

Sized for a small team: prefer managed and cheap; avoid over-engineering (no active/active, no Kubernetes complexity until revenue demands it).

3.1 Backups / DR — adopt a 3-tier backup model

  • Tier 1 — PostgreSQL PITR (the must-have). Use pgBackRest (gold-standard, all-in-one) or WAL-G (simplest to S3, ideal under ~100GB) for continuous archiving: weekly base backup + daily incremental + WAL archiving with archive_timeout ≈ 300s. This gives ~5-minute RPO. If you’d rather not self-host, a managed Postgres with PITR (RDS/Aurora 1s–35d, Cloud SQL, DigitalOcean daily+WAL 7d, Neon) removes the operational load — strongly preferred for a small team. Avoid Supabase free tier for production data — it has no backups/PITR.
  • Tier 2 — Offsite logical net. Add spatie/laravel-backup for nightly pg_dump ZIPs (DB + app files) to a separate S3 bucket in another region, encrypted, with Object Lock (immutability). This is portable and easy for single-table/single-tenant restores, but is NOT a PITR replacement (only as fresh as the last dump).
  • Tier 3 — Print-file protection. Enable S3 bucket versioning + cross-region (or 2nd-bucket) replication + Object Lock on the printflow360 bucket. Make file deletion soft (quarantine / lifecycle-expire after a window longer than backup retention) so a restored older DB row still finds its object.
  • Follow 3-2-1-1-0: 3 copies, 2 media, 1 offsite, 1 immutable, 0 errors. Retention: daily 7–30d / weekly 4–12w / monthly 12mo.
  • DB + object-store consistency on restore (the project’s #1 trap). The DB stores relative paths (HasImageFields/FileHelper); a point-in-time DB restore does NOT roll S3 back, producing orphaned files (safe) or missing files (the customer-facing silent-lie class this codebase guards against). Mitigate with versioning + replication + soft-deletes + a post-restore reconciliation job that diffs every DB file path against the bucket per tenant base path, restores prior versions for missing objects, and lists orphans — surfaced in plain language, never a broken download. Also re-verify order/quote/invoice snapshot file references resolve after restore.
  • “A backup you’ve never restored isn’t a backup.” Automate a weekly restore drill (cron/CI: restore base+WAL into a throwaway DB → integrity checks: key-table row counts, FK/constraint validation, app smoke test → DB-vs-bucket path diff → report → teardown). Use printflow360_test as the sandbox. Measure and trend: restore duration vs RTO, recovered point vs RPO, validation pass/fail, integrity diff, manual-step count.
  • DR strategy tier: start at robust Backup & Restore with continuous archiving (hits the targets below cheaply). Graduate to Pilot Light (always-on replicated DB + IaC-deployable app) only when an hours-long RTO becomes unacceptable.

3.2 Error monitoring / observability — standardize on Sentry across all apps

  • Laravel: composer require sentry/sentry-laravel, php artisan sentry:publish, wire Integration::handles($exceptions) in bootstrap/app.php. Keep send_default_pii=false; redact email/phone/address/card via before_send + server-side Data Scrubbing. Tag tenant/store/user by UUID only (matches HasUuid) — never email/name. Add a global request-id middleware and store request_id on error_logs.
  • Both Nuxt apps: @sentry/nuxt as separate Sentry projects (admin vs storefront). Upload source maps on production build (the most common setup failure if skipped). Session Replay: replaysSessionSampleRate=0, replaysOnErrorSampleRate=1.0, and do not ship replay on /checkout, /cart, /profile without masking review (customer PII). Server-side Sentry needs the built sentry.server.config.mjs loaded via --import.
  • Designer (Vue) and docs: add @sentry/vue — currently zero visibility into a customer-facing editor.
  • pdf-service: already excellent; just add Sentry if not on (@sentry/node, instrument.js required FIRST, setupExpressErrorHandler after routes).
  • Distributed tracing: propagate sentry-trace + baggage; add the pdf-service internal host to trace_propagation_targets so a Laravel→pdf-service trace stitches. Use one canonical SENTRY_RELEASE (git SHA) and identical SENTRY_ENVIRONMENT across all four apps or traces won’t stitch.
  • Cost control: prefer a traces_sampler (1.0 for errors/checkout/admin/slow paths; ~0.05–0.1 normal; 0 for /up, health, static, and high-volume /info//promotional-bars/ISR routes). Sample traces, not errors.
  • Alerting: gate on rate/severity/tags, not “any new issue”; ignore expected exceptions (Validation/404/Auth/client-abort); mark vendor//node_modules out-of-app for grouping.

3.3 Queues — Redis + Horizon, idempotent jobs, alerting

  • Set QUEUE_CONNECTION=redis (treat sync as dev-only). Run Laravel Horizon with a dedicated supervisor + queue for print-file generation so an email backlog never starves file work. Keep failed_jobs as the dead-letter store; prune only after alerting/retry workflows exist.
  • Dedicated, persistent, HA Redis for queues (separate instance from cache): appendonly yes, appendfsync everysec, maxmemory-policy=noeviction (the only policy that keeps queue correctness), plus a replica + Sentinel/cluster. Many managed Redis ship without persistence and with eviction — you must set these explicitly. A clean worker restart re-runs in-flight jobs (safe if idempotent); Redis data loss without AOF loses all unpersisted jobs.
  • Idempotency is the #1 correctness rule (both Laravel & BullMQ are at-least-once). The file-producing job must key on order_id + file_version (or content hash): on entry, if the print-ready file already exists, no-op and reuse — never regenerate, double-charge, or re-notify. Persist a “generated” marker (DB unique constraint/upsert).
  • Transactional enqueue: set 'after_commit' => true (or ->afterCommit()) so a job never runs before the order+payment rows commit; consider the transactional-outbox pattern for the Laravel→Node handoff.
  • Bound every job: $tries (e.g. 5), array $backoff with jitter ([10,30,60,120]), $timeout < worker timeout, $failOnTimeout=true. Wrap flaky external calls (pdf-service, S3) with ThrottlesExceptions. Implement failed() to set the order’s file status to generation_failed, notify staff, and give the customer a recovery action — never leave a paid order silently stuck.
  • BullMQ (pdf-service): enable by setting REDIS_HOST; keep attempts + exponential backoff + jitter; bound removeOnComplete (e.g. {age:3600,count:1000}) and keep removeOnFail generous (failed print jobs = paid orders needing manual recovery); trap SIGTERM → worker.close() for graceful drain on deploy; raise lockDuration above worst-case render time to avoid false stalls; return only the relative/S3 path in returnvalue (repo rule). Model the pipeline (preflight → render → thumbnail → notify) with FlowProducer + fail-parent.
  • Alert on three signals per queue: backlog depth, oldest-job age (>5 min on the print-file queue = paid customers waiting), and failure/dead-letter growth. Page a human on any failed paid-order job.

3.4 Print-file durability — close the silent-loss gaps

  • Never silently skip a missing print file. Fix OrderController.downloadArtwork() (:504-513) to surface a plain-language error + recovery path (re-generate / contact support) instead of continue.
  • Make the print-ready link write reliable. persistPrintReadyFile() must retry and, on persistent failure, fail the job / flag the order — not swallow the error (printReadyPersistService.js:39-59).
  • Add PDF integrity validation beyond length>0: verify PDF header/EOF marker (and ideally a checksum) before marking a job complete.
  • Add idempotency + cleanup so retries reuse the existing file rather than proliferating timestamped orphans.
  • Pre-fulfillment guard: before a print job is actioned, assert the print-ready artwork resolves in storage; block/alert if not.

3.5 Health / uptime — 3-tier probes + external watchers

  • Laravel: keep /up as liveness; add spatie/laravel-health readiness (DB, Redis, queue, disk, backup-age) returning token-gated 200/503 JSON, plus custom S3 + pdf-service checks.
  • Nuxt + Node: expose /livez (process only) and /readyz (deps with parallel timeouts). Never put dependencies in a liveness probe (a DB blip must not trigger a restart loop).
  • External: point UptimeRobot or Better Stack at the readiness endpoints; Healthchecks.io heartbeats on backups, the scheduler, and the BullMQ worker (catches “the cron silently stopped”); a Checkly/Playwright synthetic for checkout + PDF generation (the revenue paths). Alerts: dedup + escalate + link a runbook.

3.6 Deploy / infrastructure

  • Stand up a real production env: APP_ENV=production, APP_DEBUG=false, Redis-backed cache/session.
  • Infra-as-code in git + nightly config/secret backup (Tier 2 of the backup model).
  • Backup before migrate for risky schema changes; document rollback per migration.
  • Keep APP_KEY and INTERNAL_API_SECRET/PDF_SERVICE_INTERNAL_SECRET in sync across Laravel and pdf-service (mismatch = 401 storm).
  • Write the DR runbook: activation criteria; named roles (Incident Commander / Restore Operator / Comms Lead) with backups; call tree (DB host, S3, payment, DNS providers); decision tree by failure type; numbered copy-pasteable restore steps (provision host → restore base+WAL to target time → repoint at S3 → run reconciliation → repoint DNS); validation checklist; post-incident actual-vs-target review. Add a single-tenant accidental-deletion branch (S3 versioning + pg_dump filtered by tenant_id) — far cheaper than full PITR. Keep it in version control; rehearse during game days.
TierData / systemRPO targetRTO targetHow achieved
1 CriticalPostgreSQL (orders, customers, payments, products)5 min1–4 hrsDaily base backup + WAL archiving (archive_timeout=300s) for PITR; restore to new host
1 CriticalS3 print-ready files / designs / proofsnear-zero (last write)1–4 hrsS3 versioning + cross-region/2nd-bucket replication
2 ImportantApp config, IaC, secrets, queue state24 hrs4–8 hrsInfra-as-code in git + nightly config/secret backup
3 LowThumbnails, derived/cache assets, regenerable renders24 hrs+best-effortRegenerate via pdf-service; replicate only if cheap

DB RPO can tighten to ~1 min (lower archive_timeout or add streaming replication) but 5 min is the cost/benefit sweet spot. Sub-30-min RTO needs warm standby — overkill until revenue justifies it.


4. Prioritized action plan

P0 = data-loss / paid-job-loss risks. One ordered list.

PriorityItemWhy it matters for this platform (paid-print-job risk)Rough effort
P0Stand up PostgreSQL backups: managed PITR or self-hosted pgBackRest/WAL-G (weekly base + daily + WAL to offsite S3, encrypted)Today a DB loss wipes every tenant’s orders/customers/payments permanently — there is no recovery at all1–3 days (managed faster)
P0Add Tier-2 spatie/laravel-backup nightly dump to a 2nd-region, Object-Lock bucketOffsite immutable copy survives ransomware/accidental drop; enables per-tenant restore0.5 day
P0Enable S3 versioning + replication + Object Lock on printflow360; make file deletes softLost/overwritten print-ready files = wrong physical product to a paying customer; lets restore find old objects0.5–1 day
P0Switch QUEUE_CONNECTION off sync (→ Redis) + run Horizon with a dedicated print-file queueInline jobs lose paid work on any crash/timeout, sometimes after payment, with no failed_jobs trace1–2 days (+Redis)
P0Fix silent print-file gaps: error (not silent skip) in downloadArtwork() (:504-513); make persistPrintReadyFile() retry/fail instead of swallowA paid order can be “downloaded” or printed with the production PDF silently missing1 day
P0First automated restore drill + DR runbook (incl. DB↔S3 reconciliation, per-tenant)An untested backup is not a backup; proves you can actually recover paid-order data before you need to1–2 days
P1Idempotent file-generation job (key on order+file version; reuse existing file) + afterCommit dispatchAt-least-once retries must not regenerate, double-charge, or re-notify; must not run before payment commits1 day
P1Sentry across Laravel + both Nuxt apps + designer (UUID-only tags, PII scrubbed, shared release/env)Backend/designer failures are currently invisible — paid-order errors go unnoticed1–2 days
P1Persistent HA Redis for queues (AOF, noeviction, replica), separate from cache; enable BullMQ in pdf-serviceRedis is the durability boundary for queued paid-order work; default eviction/no-persistence loses jobs1 day
P1Failure alerting: Horizon failed jobs + BullMQ failed set + queue depth/age → page staff; set order file status + recovery pathA failed paid-order file must page a human and never leave the order silently stuck1 day
P1spatie/laravel-health readiness + external uptime (Better Stack/UptimeRobot) + Healthchecks.io heartbeats on backups/scheduler/workerDetects a silently-dead scheduler/worker/backup before customers do1 day
P1Bound all 18 jobs (tries/backoff+jitter/timeout/failed() handler)Email/SMS/notification failures currently vanish with no retry0.5–1 day
P2PDF integrity validation (header/EOF/checksum) before marking completePrevents storing a malformed-but-nonzero “print-ready” file that prints as garbage0.5 day
P2Old print-file cleanup / dedup to stop timestamped orphan proliferationControls storage cost from repeated regen; complements idempotency0.5 day
P2Checkout + PDF-generation synthetic monitor (Checkly/Playwright)Catches a broken revenue path before a customer hits it0.5–1 day
P2Production env hardening (APP_ENV=production, APP_DEBUG=false, IaC, secret backup) + backup-before-migrateRemoves dev-grade exposure and schema-change risk1 day
P2Read replica / Pilot Light DB (graduate from backup-restore)Only when hours-long RTO becomes unacceptable; not yet warranted2–4 days

5. References

PostgreSQL backups / PITR

Backup strategy / 3-2-1 / Laravel backup

RTO/RPO & DR strategy

Restore testing / runbooks / DB↔S3 consistency

Sentry (Laravel / Nuxt / Express / tracing)

Queues — Laravel / Horizon / idempotency / outbox

Queues — BullMQ / Redis durability

Health / uptime / monitoring

Local source-of-truth files

  • CLAUDE.md, readme/PDF_SERVICE.md, app/Services/PdfService/PdfServiceClient.php, config/pdf_service.php

Continue reading