# Troubleshooting

## Stream-log file for a fresh session is absent or empty

**Symptom:** Operator opens a new admin session, sends one turn, sees the agent reply, then `logs-read sessionKey=<…>` returns `file-not-found` or zero bytes.

**Invariant:** For every new session, the stream-log file exists on disk iff at least one token byte has been emitted, and contains the token bytes from the moment the first token returns to the operator. The single-writer mandate (2026-05-14) mechanically enforces both halves of the contract: the single writer module at `platform/ui/app/lib/claude-agent/stream-log-writer.ts` opens the file lazily on `streamLog.writeToken` (the SDK first-byte site at [`stream-parser.ts:296`](../../../ui/app/lib/claude-agent/stream-parser.ts#L296)), and the build gate `platform/ui/scripts/check-stream-log-writer.mjs` rejects every external `appendFileSync`/`createWriteStream` against the `claude-agent-stream-*` pattern at CI time. The first-token invariant is bound by `platform/scripts/__tests__/first-token-creates-stream-log.test.sh`: one operator turn, one token, `claude-agent-stream-<sessionKey>.log` exists and contains the token bytes — pass iff file present and bytes present. The hourly adherence runner `platform/scripts/log-adherence-check.sh` extends the device-side check with a duplicate-basename diagnostic (`dup-basenames=N` in the `[log-tee] adherence-check` line); `dup>0` is a P0 page meaning the writer collapse regressed.

**Diagnose if it ever recurs:** run `bash platform/scripts/__tests__/first-token-creates-stream-log.test.sh` from the install. Pass = invariant holds; any other exit = the writer-side existence contract is broken and one `[log-tee] missing-on-resolve sessionKey=<8> surface=<…>` line on `server.log` is the operator-visible signal (P0). For the duplicate-file class specifically (the 2026-05-14 recurrence trigger), `bash platform/scripts/log-adherence-check.sh` returns non-zero whenever any sessionKey has more than one `claude-agent-stream-<sk>.log` across account dirs.

## Retrieving evidence from an rc-spawn session

rc-spawn sessions (those started via the sidebar or the `claude rc --spawn` daemon) do not write a per-account stream log under `data/accounts/<id>/logs/`. Their evidence is the Claude Code JSONL transcript in the configDir:

```
<CLAUDE_CONFIG_DIR>/projects/<slug>/<uuid>.jsonl                      # parent session
<…>/projects/<slug>/<uuid>.meta.json                                  # bridgeIds persistent map
<…>/projects/<slug>/<uuid>/subagents/agent-<hex>.jsonl               # each subagent
<…>/projects/<slug>/<uuid>/subagents/agent-<hex>.meta.json           # {"agentType",…}
```

**Retrieve a session's merged timeline:** `logs-read.sh <key>` with a bare key (no second argument) maps the key to the local `<uuid>` and prints one timestamp-ordered timeline merging the parent transcript with every subagent transcript. The key is resolved in order: a matching `<uuid>.jsonl` on disk; a `sessions/<pid>.json` whose `bridgeSessionId` matches; a `<uuid>.meta.json` whose `bridgeIds` carries the suffix (persistent — survives PID-file cleanup on clean exit); and finally a content scan of the top-level transcripts as last resort. Any accepted key form works: the `claude.ai` `session_<id>`, its bare suffix, or the `<uuid>` (or a unique uuid prefix).

Every subagent `is_error` tool_result is flagged inline as `‼ SUBAGENT ERROR` with the agent type, the failing tool, and the error text. The parent session's own tool errors appear as `‼ tool error`. The two are never conflated.

**Audit all silently-failed subagents:** `logs-read.sh --scan-subagent-errors [N]` walks every `subagents/agent-*.jsonl` under the configDir and lists each one carrying an `is_error` result — agent type, parent session, failing tool, error text. Optional `N` limits the scan to the `N` most-recently-modified transcripts. Use this when a delivery failure was reported but no reproduction is available.

**Quick recipes:**

```bash
# A session's merged parent+subagent timeline (subagent errors flagged inline)
~/maxy-code/platform/scripts/logs-read.sh session_<id>

# Standing audit: every subagent transcript that failed silently
~/maxy-code/platform/scripts/logs-read.sh --scan-subagent-errors

# Limit audit to the 50 most-recent transcripts
~/maxy-code/platform/scripts/logs-read.sh --scan-subagent-errors 50
```

Note: passing an explicit second argument (e.g. `logs-read.sh <key> agent-stream`) still reads the legacy per-account stream log — the bare-key JSONL path is the default when no type is given.

## A JavaScript-rendered page comes back empty from WebFetch or `url-get`

**Symptom:** A page that needs JavaScript to show its content returns empty or a shell document from `WebFetch` (summary) or `url-get` (verbatim, server-rendered).

**Resolution:** Use the `browser` core plugin's `browser-render` tool. It renders the page in the device's per-brand Chromium over the Chrome DevTools Protocol (the same browser the VNC viewer shows) and returns the rendered HTML plus visible text. It attaches to the already-running Chromium on `127.0.0.1:${CDP_PORT}` — nothing is downloaded or installed mid-session.

**Diagnose if it ever recurs:** grep the per-conversation stream log for `[browser-render]`. `rendered=true domBytes=<n>` is the healthy signal. `rendered=false outcome=cdp-unreachable` means no Chromium is listening on the brand's CDP port — confirm with `curl 127.0.0.1:<cdpPort>/json/version`. Other outcomes (`navigate-failed`, `load-timeout`, `evaluate-failed`) name the failed CDP step.

## First user-domain write rejected by `[graph-write-gate] reject reason=no-admin-user`

**Symptom:** Admin chat reports "couldn't save that — set up your business profile first" or `[graph-write-gate] reject reason=no-admin-user` appears in `server.log` on the operator's first non-bootstrap write (a website, service, opening hours, etc.). Reproduces on Minimal-onboarded installs from before the seed-stamping fix shipped.

**Diagnose:** Tail the gate reject and self-heal lines together:

```
grep -E "adminuser-self-heal|graph-write-gate.*reject" <server.log>
```

- `[adminuser-self-heal] healed=1 …` followed by no `[graph-write-gate] reject` lines on subsequent writes — heal fired, the gate is now passing. Operator can retry.
- `[adminuser-self-heal] healed=0 …` + `[graph-write-gate] reject … subReason=admin-user-no-accountid` — heal couldn't reach the broken node. Most likely cause: the env-side `ACCOUNT_ID` doesn't match any `:AdminUser.userId`. Cross-check `users.json[0].userId` against `MATCH (au:AdminUser) RETURN au.userId, au.accountId` — if the userId mismatches, the post-Task-904 `[admin-invariant]` line in the same log will show `direction=users-without-account` and the repair is to align the stores per `.docs/agents.md` § "Three-store admin auth invariant", not to retry the heal.
- `[graph-write-gate] reject … subReason=no-admin-user-node` — the graph has no `:AdminUser` at all. Re-run the seed (`platform/scripts/seed-neo4j.sh`) under the install's env vars; the boot self-heal won't help because there's nothing to heal.

The `subReason=admin-user-no-accountid` path should be impossible on any install whose admin server has booted at least once after the boot self-heal shipped — if it fires, the diagnostic recipe is the cross-check above, not "rerun the heal."

## Fresh install opens to "Set your remote password" on the LAN URL

**Symptom:** On a brand-new device, the LAN URL printed by `create-maxy` (e.g. `http://maxy.local:19200`) opens to a remote-password setup page instead of admin onboarding. This was a Task-647-era regression and should not occur on any install built.

**Diagnose:** On the Pi, grep the UI server log for the gate's disambiguation fields:

```
tail -200 ~/.maxy/logs/maxy-ui.log | rg '\[remote-auth\].*resolvedKind='
```

- `resolvedKind=lan` on a `login required` or `not configured` line means the classifier sees the request as local — if the browser is still on the remote-auth page, something cached the older page before the fix shipped (hard-refresh the tab).
- `resolvedKind=external` means the request chain presents as remote (routable IP in the first `x-forwarded-for` hop). On a LAN-only browser this points to a proxy or VPN rewriting headers between the browser and the Pi.
- `resolvedKind=unknown` is a defect — the classifier could not identify the TCP peer. Capture the log line and file it; do not work around it.

**Fix:** If all three fields confirm the LAN shape and the gate still refuses, upgrade the platform (`Software Update` from admin chat) to pick up the Task-679 classifier.

---

## Remote sign-in is rejected with "Remote access requires TLS"

**Symptom:** Posting the remote-auth password returns a plain-text `400 Remote access requires TLS` response instead of completing sign-in.

**What this means:** The login endpoint will only issue a session cookie when the request arrived over HTTPS (via the Cloudflare tunnel). Browsers silently drop `Set-Cookie: Secure` on plain-HTTP responses, so minting a cookie there would produce a dead-end redirect. An earlier fix replaced that silent failure with this loud one.

**Fix:** Reach the admin surface through the tunnel hostname (e.g. `https://admin.<your-domain>`), not an IP or plain-HTTP URL. If you need LAN access, use the LAN URL (`http://<hostname>.local:<port>`) — LAN never hits the remote-auth endpoint.

---

## Agent Not Responding

**Symptom:** You send a message and nothing comes back, or the response never arrives.

**Check:**
1. Ask Maxy: "Check system status" — the `system-status` tool will report whether all services are running
2. Check the platform logs: ask Maxy "Show me the recent logs"
3. If the admin agent itself won't start: restart the platform (see below)

**Common causes:**
- Claude API connectivity issue — check your Claude OAuth connection is still valid
- Platform process has stopped — restart it
- Network issue if accessing remotely — check your Cloudflare tunnel is running

**If the chat shows a single `[agent-loop-stop] same error twice — aborting` line and stops:** Maxy hit the same structured tool failure twice in a row inside one turn (e.g. a permission gate refused the same write twice, or two `Read` calls hit the same missing file). The runtime aborted the turn after the second occurrence to save tokens instead of running until the SDK turn budget exhausted. The blocker text names the tool and the first line of the error. Resolve the underlying cause (re-run the named skill, fix the missing prerequisite, etc.) and tap "Continue" — the next turn truly resumes the prior SDK session via the synthetic-tool-result contract, so Maxy picks up where it aborted instead of cold-querying its own session list. To see the diagnostic, ask Maxy: "Show me the most recent stall-recovery log line." Greppable post-deploy invariants: `[agent-loop-stop] reason=identical-tool-failure tool=<name> errorSignature=<sha8> toolInputDigest=<sha8>` followed by `[stall-recovery] kind=agent_loop_stop … handoff=resume-first` and on the next turn `[stall-resume] consumed kind=agent_loop_stop toolUseId=<8> priorSessionId=<8>`. The fallback path (when the SDK session id was lost) emits `handoff=metadata-only` + `[recovery-handoff] generated/consumed reason=agent-loop-stop` and the chat button reads "Start over" instead of "Continue". A `[recovery-handoff] WARN missing-on-cold-create` line means the fallback briefing wasn't persisted — surface to support.

**If a background task goes silent and the chat shows "A background task went silent — K of M completed":** Maxy's subagent stopped emitting progress for over 2 minutes. Tap "Continue" — the next turn resumes the prior session and reads a synthetic tool_result describing what completed before the pause, so the agent re-plans without losing the work it had done. Most stalls are upstream API latency rather than the subagent's approach failing — the resume-first path treats both correctly. Greppable post-deploy invariants: `[stall-recovery] kind=subagent_stalled … completed=<K>/? handoff=resume-first` followed by `[stall-resume] consumed kind=subagent_stalled toolUseId=<8>` on the next turn. If the button reads "Start over" instead, the parent's pending tool_use_id was not captured — the fallback path took over; the prior conversation is preserved as a `<recovery-context>` block in the cold-started session.

**Agent searches the filesystem after uploading a zip.** If you uploaded a zip and the agent burns several turns running `find` / `Glob` instead of unzipping, that is the symptom of the recovery-retry attachment-context regression (now closed by the recovery context preservation contract in `.docs/agents.md`). Greppable confirmation is the `[context-overflow-recovery] retry … attachmentsCarried=<n>` line in the conversation stream log. If you see `[context-overflow-recovery] WARN attachment-context-lost`, the regression has returned — surface to support.

**Turn budget exhausted with a horizontal rule separating two assistant turns.** When Maxy reaches its turn budget and the doubled retry also runs out, the chat now shows a one-paragraph assistant message that opens with `error_max_turns turns=A→B` (initial budget → final budget) followed by the recovery copy: "I reached my turn budget of N before I could finish this request. Try sending a smaller or more focused request, or ask me to use higher effort." That message is persisted to the graph, so the next page-refresh still shows it. The thin horizontal rule labelled "Session restored after timeout." that appears above your following turn signals that the prior turn forced a cold SDK-session restart inside the same conversation (pool eviction) — the agent's response after the rule is from a fresh SDK session even though the conversation thread is unchanged. Greppable post-deploy invariants: `[context-overflow-recovery] exhausted cause=max-turns-interrupted` count equals `[admin-persist] writer=persistMessageExhaust outcome=ok` count for the same sessionId window, and one `[session-store] storeAgentSessionId` line marks the cold-restart that drove the on-screen rule.


**A turn rendered in chat is missing on next page-refresh.** Pre-the 2026-05-07 mandate this was a class of silent failure — Neo4j persists were wrapped in a no-op error catch and a write that threw left the artefact "rendered then disappeared on resume". The 2026-05-07 mandate makes JSONL canonical: the resume route reads the SDK transcript file at `~/.claude/projects/<project-key>/<sessionId>.jsonl` first, supplements from Neo4j, and triggers async heal-on-resume writes for any turn the JSONL has but Neo4j does not. So a refreshed conversation always renders what the SDK saw, regardless of write outcome. If a heal write itself fails, the chat shows a top-of-conversation banner naming the count; if every heal succeeds the resume is silent and the missing rows are quietly restored to Neo4j. Greppable post-deploy invariants in the per-session stream log (`logs/claude-agent-stream-<sessionKey>.log`): `[admin-resume] reason=<…> source=<jsonl|jsonl-missing|neo4j-only>` (one per resume), `[admin-persist] convId=<8> writer=<…> outcome=<ok|fail|skip>` (per persist site), `[admin-persist-heal] convId=<8> turnIndex=<n> outcome=<ok|fail>` (per heal write). To force-audit a specific conversation against its Neo4j projection without re-executing it, run `tsx platform/scripts/admin-persist-audit.ts --conversation-id=<uuid> --account-id=<uuid> --session-id=<uuid>` — non-zero exit + per-divergence `[admin-persist-audit] expected=<message|component> missing reason=neo4j-row-absent` lines name what would have been silently lost pre-mandate.
**Wrong Claude account answering on a multi-brand device.** On a host running both Maxy and Real Agent, each brand's admin agent reads its own `~/${brand.configDir}/.claude/.credentials.json`; there is no longer a shared `~/.claude/` thrashing them against one another. If a brand reports auth failures or appears to be operating against the wrong subscription, check three things:
1. `grep "\[claude-auth\] init" ~/.${brand}/logs/server.log | tail -1` — the resolved path must end with `~/.${brand}/.claude/.credentials.json`. If a `[claude-auth] WARN cross-brand-path-detected` line is present, the runtime is still pointing at `~/.claude/`; the brand main service did not pick up the `Environment=CLAUDE_CONFIG_DIR=` setting (re-run the brand installer to refresh the unit file).
2. `diff <(jq .claudeAiOauth.accessToken ~/.maxy/.claude/.credentials.json) <(jq .claudeAiOauth.accessToken ~/.realagent/.claude/.credentials.json)` — must be non-empty after each brand's operator has run `claude /login` against distinct Anthropic accounts; if it's empty, both brands are still logged in to the same account (operator action, not a code bug).
3. `grep "\[install\] claude-creds pickup" ~/.${brand}/logs/install-*.log` — fires once on the first post-Task-923 install of any brand and moves the legacy `~/.claude/.credentials.json` into that brand's path. Subsequent brands install with no credentials and require a fresh `claude /login` inside that brand's chat (which writes to the brand-scoped path because the systemd unit env is in scope).

**All sessions on the brand stopped responding after a token expiry.** Symptom on the operator side: every spawn dies at `pid-file-timeout` and the dashboard health probe reports auth dead. Diagnose the OAuth refresh path before anything else:

1. `tail -n 300 ~/.${brand}/logs/server.log | grep -E 'auth-refresh|auth-health|invalid_grant'` — `op=lock-acquired` proves the cross-process lock is in play (Task 576). `op=skipped-fresh` means a sibling process (the admin server or a `claude` binary) already rotated the tokens during the lock wait — expected, healthy. `op=renewed expiresAt=…` is the only line that means a network refresh actually ran.
2. `outcome=fail-token` or `invalid_grant` lines mean Anthropic rejected the refresh token itself (revoked or expired beyond the rotation window). The brand needs a fresh `claude /login`. Pre-576 the most common cause was the admin server and a spawned `claude` racing to rotate the same single-use refresh token; that race is now serialised by the file lock at `~/.${brand}/.claude/.credentials.json.lock` and a re-read after the lock skips redundant refreshes.
3. `grep '\[auth-health\]' ~/.${brand}/logs/server.log | tail -n 5` — the heartbeat fires every five minutes. `status=dead expiresIn=...` means the refresh token is gone; only a re-login fixes it. `status=ok` heartbeats with no spawns in between mean the credentials file is healthy and the failure lives elsewhere.
4. The spawn-failure surface now carries `reason=auth-refresh-failed` (with `authStatus` in the JSON body) instead of generic `pid-file-timeout` whenever the credentials file is in `dead` or `expired` state at the moment of failure — visible in `grep '\[spawn-failed\]'` on server.log.

---

## Memory Not Working

**Symptom:** Maxy doesn't remember things you've told it, or search returns nothing.

**Check:**
1. Ask Maxy: "Check the Neo4j connection"
2. Ask Maxy: "Search memory for [something you know was stored]"

**Common causes:**
- Neo4j service stopped — restart the platform, which restarts Neo4j
- Memory index is stale — ask Maxy: "Reindex memory"

---

## Telegram Bot Not Receiving Messages

**Symptom:** You send a message to the bot and nothing happens.

**Check:**
1. Confirm the bot token is correct: ask Maxy "What Telegram bot token is configured?"
2. Verify the bot is running: send `/start` to the bot in Telegram
3. Check the MCP server logs: ask Maxy "Show Telegram plugin logs"

**Common causes:**
- Bot token changed (if you regenerated it in BotFather) — update it by telling Maxy "Update my Telegram bot token"
- Webhook not connected — restart the platform

---

## Plugin Errors

**Symptom:** A tool fails with an error, or a plugin says it can't connect.

**Check:**
1. Ask Maxy: "Show me recent errors"
2. Ask Maxy: "Restart the [plugin name] plugin"

**Common causes:**
- Missing environment variable (API key, token) — the error message will name it; ask Maxy to help configure it
- MCP server crashed — restarting the platform restarts all MCP servers

---

## Cannot Mount the SMB Share

**Symptom:** Mounting `smb://<hostname>.local` (or `\\<hostname>.local\<brand>`) fails with a "logon failure" or the share does not appear in your network browser.

**Check:**
1. Confirm you have set a PIN in the admin UI at least once. On a fresh Pi or Hetzner box the `smbpasswd` entry does not exist until the first set-pin runs — mounts before that point always fail.
2. Use the install owner as the username (`admin` on a Pi or Hetzner box; the Linux user that ran the installer on a self-hosted laptop) and the current Maxy PIN as the password. The SMB password is not stored separately — it is the PIN.
3. If `<hostname>.local` does not resolve from your client, mount by LAN IP instead (`smb://192.168.1.50` on macOS, `\\192.168.1.50\<brand>` on Windows).
4. Rotate the PIN in the admin UI. That re-triggers the `smbpasswd` sync on the device. If the resync log line reads `[set-pin] smbpasswd sync failed owner=<unknown> rc=-1 reason=install-owner-file-missing`, restore `~/.<brand>/.install-owner` from the installer log.

See [Samba Share](./samba.md) for the full credential model and per-OS mount syntax.

---

## Restarting the Platform

From the admin interface, ask Maxy: "Restart the platform."

If Maxy itself isn't responding (the page loads but the agent won't connect), try refreshing the browser. If the page itself won't load, the platform process may have stopped — power-cycle the Raspberry Pi by unplugging and reconnecting power, then wait a minute for services to restart automatically.

---

## Checking Logs

Ask Maxy: "Show me the logs" or "Show errors from the last hour."

For specific plugin logs: "Show Telegram logs" or "Show contacts plugin logs."

Maxy has access to all platform logs and can filter them for you.

---

## Cloudflare Tunnel Down (Remote Access Broken)

**Symptom:** You can reach Maxy on your local network but not via your public domain.

**Check:** Ask Maxy "Check the Cloudflare tunnel status."

**Fix:** Ask Maxy "Restart the Cloudflare tunnel."

If the tunnel won't reconnect, re-run the Cloudflare setup: ask Maxy "Reconnect Cloudflare."

If the initial Cloudflare login fails during setup, Maxy will fall back to asking you for a connection key. You can create one in the Cloudflare dashboard (Maxy will guide you through this in the browser).

**If you switched Cloudflare accounts or are stuck on the wrong one:** ask Maxy "Reset my Cloudflare login and start over." This is a clean reset — Maxy clears every stored credential, then opens a fresh browser sign-in. The next sign-in binds to whichever Cloudflare account you choose, with no risk of the previous account's stored credentials silently coming back.

---

## "Bad Gateway" or holding page during an upgrade

`maxy-edge.service` (always-on front door) classifies upstream errors and serves a brand-aware response. There are two distinct user-visible shapes; the right one depends on what failed.

**Branded holding page (brand logo + "Starting") for ~10 s during an upgrade — this is expected and self-healing.** The edge process binds the public port immediately, but `maxy.service` (the upstream UI) takes ~10 s after restart to apply the neo4j schema and mount its 11 routes. Any browser navigation that lands during that window gets a self-contained HTML holding page that polls `/api/health` and reloads automatically once the upstream binds. The page renders the brand logo (inlined as a base64 data URI at edge boot from `<install>/server/public/brand/<assets.logo>`) and the brand display/body fonts (loaded from fonts.googleapis.com) — both paths bypass the unavailable upstream so the page never makes a same-origin asset fetch. When `brand.logoContainsName` is true the logo replaces the productName text; otherwise the page falls back to "Maxy is starting". No operator action required. The diagnostic line in `~/.maxy/logs/edge.log` is `[edge] upstream http error path=… err=connect ECONNREFUSED 127.0.0.1:<UPSTREAM_PORT> err-class=econnrefused-coldstart upstream=…` and disappears as soon as upstream binds. Boot-time confirmation that the logo resolved: `[edge] brand=<name> holding-logo=inlined assets-dir=<path>` — `holding-logo=missing` means the logo file wasn't found at `assets-dir`, the page degrades to text-only.

**Branded plain-text 502 ("Bad Gateway (Maxy unavailable)") — real upstream failure, not cold-start.** Any error class other than `ECONNREFUSED` (timeouts, resets, host-unreachable) returns the existing 502 path. The diagnostic line carries `err-class=other`. Read the log with `tail -200 ~/.maxy/logs/edge.log | rg 'err-class=other'` and check `~/.maxy/logs/server.log` for upstream stack traces — the upstream itself is the source.

**Continuous `err-class=econnrefused-coldstart` for >30 s past the last `[edge] listening` line** indicates the upstream never binds — the upgrade or boot has stalled. Recover via `sudo systemctl --user status maxy.service` and check the action runner log per the next section. Permanent-failure UI escalation (turning the holding page into an error after N seconds) is intentionally deferred.

**The literal string `maxy-ui` should never appear in `edge.log` or in any user-visible 502 body**, regardless of brand. If it does, the edge is running stale code — re-bundle and re-publish.

**Verifying the holding page locally:** `curl -sS -H 'Accept: text/html' http://127.0.0.1:<EDGE_PORT>/` while `maxy.service` is stopped should return HTML containing the brand `productName`. The `Accept: text/html` header is required — non-html clients (default `curl`, `fetch`, XHR) get the branded plain-text 502 instead, so the holding page's own `/api/health` polls don't break themselves during cold-start.

---


## Software update and Cloudflare setup

Both flows run on the native Claude Code PTY surface in admin chat (Task 287). The retired action-runner / terminal-modal troubleshooting sections that lived here have been removed because those surfaces no longer exist; failures now manifest as plain stderr from the agent-invoked Bash command, visible in chat.

- **Software update.** Re-run `npx -y @rubytech/create-<brand>@latest` from a shell; if the installer fails, its stdout is the diagnostic record. HeaderMenu turns sage when `installed === latest`.
- **Cloudflare setup.** The agent invokes `cloudflared` directly via Bash, following the cloudflare plugin's `plugins/cloudflare/references/manual-setup.md`. Failures surface as cloudflared's literal stderr plus a non-zero exit. Recovery paths live in `plugins/cloudflare/references/reset-guide.md` and `plugins/cloudflare/references/manual-setup.md`.

## Orphan Account Directory Archived to `.trash/`

**What happened:** During upgrade, the installer detected multiple account directories under `~/maxy/data/accounts/` and identified one as live (its `admins` list matches the device's `users.json`). Non-matching siblings are archived — not deleted — under `~/maxy/data/accounts/.trash/<uuid>-<ISO8601-ts>/`.

**Installer signal:** Look for these lines in the installer log or admin terminal output:

```
==> [seed] identity-match: kept=<uuid-short> via userId=<first-8>
==> [seed] swept orphan: <uuid-short> →.trash/<uuid-short>-<ts>
==> [seed] orphan sweep: moved N → ~/maxy/data/accounts/.trash/
```

**Rollback (if the wrong account was kept):** The archive is preserved verbatim. Stop the platform, move the desired directory back, restart:

```bash
sudo systemctl --user stop maxy-ui
mv ~/maxy/data/accounts/<live-uuid> ~/maxy/data/accounts/.trash/<live-uuid>-$(date -u +%Y%m%dT%H%M%SZ)
mv ~/maxy/data/accounts/.trash/<archived-uuid>-<ts> ~/maxy/data/accounts/<archived-uuid>
sudo systemctl --user start maxy-ui
```

**`.trash/` retention:** Archived directories are kept indefinitely. The platform never auto-empties `.trash/`. When you're confident the archived orphans are truly obsolete, remove the directory manually: `rm -rf ~/maxy/data/accounts/.trash/<uuid>-<ts>/`.

**Installer aborted with "identity-match FAILED":** Multi-account installs where no sibling matches `users.json[0].userId` abort loud — the installer refuses to pick one and refuses to sweep. Resolution: inspect `account.json` in each candidate dir (listed in the abort output), identify the correct owner, move the other(s) aside manually, then re-run the installer.

**A chat turn looks broken — assistant bubble never rendered:** Open `claude-agent-stream-<sessionKey>.log` and grep for `[sse-client]`. The five phases (`connected`, `event_received`, `render_complete`, `error`, `close`) tell the story in order. Missing `connected` = the chat fetch never returned 200; missing `event_received` = the server emitted nothing or the client lost the stream before the first frame; missing `render_complete` = the reducer never committed the assistant bubble (persist_ack never arrived).

## Admin DevTools console floods with `onboarding-banner-mount` or `sessions-poll` lines

**Regression symptom.** Open DevTools on the admin shell at `/` with `onboardingComplete=false`, leave the page idle for a minute, then scroll back through the console. Thousands of `[admin-ui] onboarding-banner-mount onboardingComplete=false` lines (one per AdminShell render, ~40/min driven by the 3s sessions poll) with no per-tick poll telemetry indicates the banner-mount log has regressed back into the render body.

**Steady-state invariants at `/`:**

- `grep -c '\[admin-ui\] onboarding-banner-mount' ~/.maxy/logs/admin-ui-console.log` equals page-load count plus onboarding-flip count, not the render count. Sustained climb at idle means the banner mount log regressed back into the render body (fix).
- `grep -c '\[admin-ui\] sessions-poll' ~/.maxy/logs/admin-ui-console.log` over a 60-minute idle window equals zero. The hook no longer installs a `setInterval`; every `sessions-poll` line is operator-triggered (initial mount, refresh button, post-mutation refetch). One or more lines during operator idle means `setInterval` was reinstated.
- `outcome=error` lines name a real fetch failure on an operator-triggered refetch, set the `error` field, and surface in the sidebar.

**Reconcile signal:**

- `grep -c '\[admin-ui\] sidebar-meta-pane-reconcile' ~/.maxy/logs/admin-ui-console.log` should equal the count of End / Resume / Purge clicks while the metadata pane was open. A `to=gone` line without a paired Close click means the pane's auto-close logic regressed.

**Why this matters.** The render-body log was misleading: it read as "the admin agent is checking onboarding state continuously", when in fact `onboardingComplete` had not changed at all. The fix moved the log into `useEffect(…, [])` then dropped the per-tick poll entirely, so a quiet console is now the steady state. With both fixes in place, console output is a faithful record of what the page actually did each operator click.
