Hardik 0814b040e1 fix(automation): bound + detach the Claude run so a hang can't wedge or stick

First live run on PR #126 left the foreground shell stuck: Claude pushed the
commit itself and then did not exit, so the supervisor code after it (push +
ack + handled marker) never ran. Under cron that also holds the flock lock,
freezing every later run, and the comment never gets marked handled (so it
would be re-processed forever).

- Wrap the Claude invocation in `setsid timeout -k 30s "$CLAUDE_TIMEOUT"`
  (default 30m). `setsid` detaches from the controlling terminal so a lingering
  child can't stick an interactive run; `timeout` returns control to the
  supervisor, which still pushes any commits (idempotent if Claude pushed) and
  writes the handled marker. A timed-out run (rc=124) is logged, not fatal.
- Give this watcher its own dev port (devPort, default 3101) distinct from the
  issue watcher's 3100, and reap it after each run -- no cross-watcher kill.
- Reinforce the prompt: stop any dev server and END THE TURN; never push.

Adds claudeTimeout + devPort to the example config and documents both.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-24 16:27:22 +05:30

19 KiB

Raw Blame History

Automated issue-to-deploy pipeline

End-to-end flow from a user clicking Report Issue in the portal to a fix running in production:

Portal header (bug icon)                          [App/components/layout/report-issue-button.tsx]
        │  server action → Forgejo API
        ▼
Forgejo issue  (label: portal)                    [git.pelagiamarine.com/shad0w/pelagia-portal]
        │  polled every 10 min by Windows Scheduled Task "PelagiaClaudeIssueWatcher"
        ▼
TRIAGE  (watcher phase 1)                          [dev PC, headless Claude Code, analysis only]
        │  Claude reads the issue + repo, posts a requirements-breakdown comment,
        │  and routes it: adds `claude-queue` (auto-fixable) or `interactive` (human)
        ▼
FIX  (watcher phase 2, only for claude-queue)      [headless Claude Code in C:\...\src\pelagia-autofix]
        │  Claude implements + verifies fix; watcher pushes branch claude/issue-N
        │  and opens a PR (label: claude-pr)
        ▼
Human review: merge the PR, then create a release tag vX.Y.Z
        │  tag push triggers .forgejo/workflows/deploy.yml
        ▼
forgejo-runner on pms1 (pm2: forgejo-runner, label "host")
        │  checks out the tag in ~/pms, pnpm install + build + prisma migrate deploy
        ▼
pm2 restart ppms  →  live at pms.pelagiamarine.com

interactive-routed issues stop after triage for a human to pick up (run with Claude in a steered session). The triage breakdown comment is plain (no bot marker) so, for claude-queue issues, the fix stage reads it back as refined requirements.

Contribution policy (all changes via PR)

Every change lands through a pull request — no direct pushes to master. This applies to humans and to the automated pipeline alike (the watcher already opens PRs).

Each PR must include:

Tests for any code change. Model: the integration test on claude/issue-12 — it targets the prod-mirror test DB, anchors on existing rows, inserts fixtures via raw SQL (schema-tolerant), isolates them with a unique prefix, and cleans up in afterEach. Docs/config/automation-only PRs are exempt.
Docs updates where relevant (App/README.md, App/CLAUDE.md, Docs/, this file, CHANGELOG.md).

Enforcement — .forgejo/workflows/pr-checks.yml runs on every PR into master:

Test-presence gate: a PR touching App/app|lib|components|hooks with no test change fails. Justify genuine exceptions in the PR body for a reviewer to override.
Type-check: pnpm type-check must be clean across the whole project (tests included). The test suite's old type baseline was repaired when this gate landed.
Unit tests: pnpm test must pass.

All three are hard gates. pnpm lint is intentionally not run — it currently requires an interactive ESLint migration (a follow-up). Integration tests are type-checked here but executed against the pelagia_test DB by the autofix / locally (not in this shared CI, to avoid prod-mirror schema drift).

A PULL_REQUEST_TEMPLATE.md carries the checklist.

Components

Piece	Where	Notes
Report Issue button	`App/components/layout/report-issue-button.tsx` + `report-issue-actions.ts`	Any signed-in user; files issue with only the `portal` label (triage routes it)
Forgejo helper	`App/lib/forgejo.ts`	Needs `FORGEJO_URL`, `FORGEJO_REPO`, `FORGEJO_TOKEN` env (token scope: `write:issue`)
Issue watcher (active)	`automation/claude-issue-watcher.sh` on pms1	Bash port; runs 24/7 via cron. Config + logs under `~/issue-watcher/`
Issue watcher (Windows, disabled)	`automation/claude-issue-watcher.ps1`	PowerShell original. `PelagiaClaudeIssueWatcher` task is disabled (pms1 is the sole worker; two pollers would race)
PR review-comment watcher	`automation/claude-pr-review-watcher.sh` on pms1	Addresses `claude-review:` comments on Claude-raised PRs. Own cron entry, own clone (`~/pelagia-pr-review`), own config + lock. See below
Forgejo helper	`App/lib/forgejo.ts`	Needs `FORGEJO_URL`, `FORGEJO_REPO`, `FORGEJO_TOKEN` env (token scope: `write:issue`)
Deploy workflow	`.forgejo/workflows/deploy.yml`	Triggers on `v*` tags; runs on the `host` runner
Runner	pms1 `~/forgejo-runner`, pm2 process `forgejo-runner`	Registered as `pms1-host` with labels `host`, `docker`

Where the watcher runs (pms1)

The watcher runs on pms1 under cron (every 10 min), polling Forgejo over the local loopback (http://127.0.0.1:3001).

Script: ~/issue-watcher/claude-issue-watcher.sh (source: automation/claude-issue-watcher.sh)
Config: ~/issue-watcher/watcher.config.json (gitignored; holds the token + claudeExe = the nvm claude path)
Work clone: ~/pelagia-autofix (separate from the deployed ~/pms)
Logs: ~/issue-watcher/logs/ (watcher-<date>.log, per-issue claude-*.log, cron.log)
Crontab: */10 * * * * PATH=<nvm bin>:... ~/issue-watcher/claude-issue-watcher.sh >> ~/issue-watcher/logs/cron.log 2>&1

Auth: Claude Code must be signed in on pms1 (ssh in, run claude, complete the login → writes ~/.claude/.credentials.json). The watcher has a preflight that no-ops until those credentials exist, so cron can be enabled before sign-in and activates automatically once signed in. (An ANTHROPIC_API_KEY env var also satisfies it.)

The Windows variant (.ps1 + register-watcher-task.ps1) is the portable fallback; re-enable its task only if pms1 is unavailable, and disable one before enabling the other.

PR review-comment watcher

Where the issue watcher turns issues into PRs, the PR review-comment watcher (automation/claude-pr-review-watcher.sh) closes the loop on the other side: it addresses review comments left on the PRs Claude already raised. This is how you iterate on an automated PR without dropping into an interactive session — leave a comment, Claude pushes a follow-up commit.

How to use it (as a reviewer): on any open Claude-raised PR, leave a comment that starts with the marker claude-review: — the text after the marker is the instruction. It works in three places:

the PR conversation (a normal PR comment),
a review summary (the overall body of a submitted review),
an inline / on-file comment (Claude is given the file, line, and diff hunk).

Example inline comment on App/lib/foo.ts:

claude-review: this should null-check order.vendor before dereferencing it, and add a test for the null case.

What the watcher does each run (every 10 min via cron):

Lists open PRs Claude raised — head branch starts with prBranchPrefix (claude/) or the PR is labelled claude-pr.
Collects every claude-review: comment from repo collaborators only (write access; the repo owner is always included). Comments from anyone else, and the bot's own comments, are ignored. This is the safety gate — only trusted users can make Claude push code.
Skips comments already handled in a previous run (tracked by a hidden  marker the bot stamps on its acknowledgements, so a 10-minute poll never redoes the same comment).
Checks out the PR's own branch in ~/pelagia-pr-review, runs headless Claude Code with the collected instructions (+ the same pelagia_test / port-3100 test environment the fixer uses), then pushes the new commit(s) to the same branch — updating the open PR in place.
Acknowledges: posts a reply listing what it addressed (with the handled marker) and adds a 🚀 reaction to each handled PR-conversation comment.

If Claude judges a comment unclear, out of scope, or too risky to do unattended (migrations, payments, permissions), it makes no commit for it and the watcher posts a "produced no change — a human may need to take these" reply. The comments are still marked handled so the poll doesn't loop on them; re-comment with a clearer claude-review: instruction to retry.

Deploy on pms1 (mirrors the issue watcher):

# 1. Place the script + config alongside the issue watcher
cp automation/claude-pr-review-watcher.sh        ~/pr-review-watcher/
cp automation/pr-review-watcher.config.example.json ~/pr-review-watcher/pr-review-watcher.config.json
# 2. Edit the config: real token (scope write:repository,write:issue), claudeExe = `which claude`
# 3. Add a crontab entry, OFFSET from the issue watcher so the two don't run at the same minute:
#    5,15,25,35,45,55 * * * * PATH=<nvm bin>:$PATH ~/pr-review-watcher/claude-pr-review-watcher.sh >> ~/pr-review-watcher/logs/cron.log 2>&1

Token scope: needs write:repository (push to the PR branch) plus write:issue (post comments + reactions) — one scope more than the issue watcher.
Own everything: separate clone (~/pelagia-pr-review), config (pr-review-watcher.config.json), and lock (.pr-review-watcher.lock) so it never races the issue watcher. Logs land in the same logs/ dir (pr-review-<date>.log, per-PR claude-pr-<n>-*.log).
Same auth preflight as the issue watcher — no-ops until Claude Code is signed in on pms1 (or ANTHROPIC_API_KEY is set).
Bounded + detached run: each Claude invocation is wrapped in setsid timeout (claudeTimeout, default 30m). setsid detaches it from any terminal (so a manual run can't leave your shell stuck on a lingering child); timeout guarantees control returns to the supervisor — so even a stuck/misbehaving run still gets its commits pushed and its handled: marker written, and can never wedge the flock lock for later cron runs. Runtime checks use port 3101 (devPort), distinct from the issue watcher's 3100, and the watcher reaps that port after every run.
A Windows .ps1 port is not provided yet (pms1 is the sole worker); port it from claude-issue-watcher.ps1 only if you need a failover.

Updating the deployed copy: update-pr-review-watcher.sh refreshes the watcher script in one command, from a dedicated self-update checkout (~/pr-review-watcher/.src) that never races the issue watcher's clone. Copy it once, then:

cp automation/update-pr-review-watcher.sh ~/pr-review-watcher/   # one-time
~/pr-review-watcher/update-pr-review-watcher.sh                  # pull from master
~/pr-review-watcher/update-pr-review-watcher.sh some/branch      # or a branch (pre-merge testing)

It reads the live config for the token/URL, never clobbers the config, and self-updates.

Test database (for autofix verification)

So the fix stage can verify against realistic data without touching production:

pelagia_test — a PostgreSQL database on pms1, owned by pelagia_user, that is a daily mirror of production (pelagia). Created once as superuser; refreshed by automation/refresh-test-db.sh via cron at 03:30 (pg_dump pelagia | psql pelagia_test).
The autofix clone's ~/pelagia-autofix/App/.env points DATABASE_URL at pelagia_test and runs in safe dev mode — no Resend/SSO secrets, so email is console-logged and storage is local. NEXTAUTH_URL/PORT are set to 3100 (production app is on 3000).
The fix prompt tells Claude it may run integration tests against this DB (set -a; . ./.env; set +a; pnpm test:integration) and may start a dev server on port 3100 only, stopping it by port (fuser -k 3100/tcp) — never a broad pkill next, which would take down production (it also runs a next-server).

Because the test DB is refreshed daily, anything the autofix writes to it (test data, schema experiments) is disposable. Schema-migration issues are routed to interactive by triage, so the unattended fixer should not be altering the schema anyway.

Staging (smoke test before deploy)

automation/staging-up.sh (deployed to ~/issue-watcher/ on pms1) brings up a staging instance of the latest master so changes can be clicked through before a release tag deploys them to prod.

Checkout: ~/pelagia-staging (separate from ~/pms and ~/pelagia-autofix)
Process: pm2 ppms-staging on port 3200, against the prod-mirror test DB (pelagia_test), safe dev mode (console email, local storage, SSO disabled).
Auto-refresh: .forgejo/workflows/staging.yml rebuilds staging on every push to master (i.e. every merged PR) on the host runner, so staging always tracks the trunk. It runs ~/issue-watcher/staging-up.sh; concurrent runs are coalesced (newest master wins). Also triggerable on demand (workflow_dispatch).
Manual refresh / restart: re-run ~/issue-watcher/staging-up.sh.
Stop: pm2 delete ppms-staging.
Access is SSH-tunnel only — the dev server binds to 127.0.0.1:3200, so it is not reachable from the public internet. Open a tunnel and browse http://localhost:3200: ssh -L 3200:localhost:3200 shad0w@<pms1>. On Windows, the desktop shortcut "Pelagia Staging (tunnel)" (automation/staging-tunnel.cmd) opens the tunnel and the browser in one click.
A fixed banner "INTERNAL DEV / STAGING - NOT PRODUCTION" is shown (driven by NEXT_PUBLIC_ENV_LABEL in the staging .env; the EnvBanner component renders nothing when the var is unset, so production is unaffected).
Log in with a password user (SSO is off here), e.g. admin@pelagiamarine.com.

Issue label lifecycle

portal ──(triage)──▶ triaged + claude-queue ─▶ claude-working ─▶ claude-pr | claude-failed
              └─────▶ triaged + interactive   (stops here — handle with Claude interactively)

Triage owns routing for every portal issue. Each untriaged portal issue is triaged once (maxTriagePerRun per run); triage adds triaged, a routing label (claude-queue or interactive), a type label (bug or feature), and posts a breakdown. Triage skips an issue only once it carries triaged, interactive, claude-working, claude-pr, or claude-failed.
claude-queue alone does NOT skip triage on a portal issue. The Report Issue button may stamp claude-queue at creation; triage still claims the issue and decides routing (stripping the stray claude-queue if it routes to interactive). This is why triage works even if an older button build is deployed.
claude-queue → claude-working → claude-pr (PR opened) or claude-failed.
To retry a failed issue, re-add claude-queue (and remove claude-failed).
To queue a non-portal issue for Claude (skipping triage), add claude-queue directly — triage never claims issues without the portal label.
To force a portal issue straight to fix, add triaged + claude-queue yourself.

Releasing

⚠️ Release tags MUST be v-prefixed (e.g. v0.2.2). deploy.yml triggers only on v* tags — a bare tag like 0.2.2 will NOT deploy (the runner ignores it and prod stays on the previous version). Push the tag specifically; pushing master alone never deploys.

After merging PR(s) on master:

git pull
git tag v0.2.2          # MUST start with "v"; semver: patch = fixes, minor = features
git push pms1 v0.2.2    # pushing the v* tag is what triggers the deploy

The runner checks out the tag in ~/pms, runs pnpm install + build + prisma migrate deploy, pm2 restart ppms, and verifies /login returns 200. Watch progress under Actions on the Forgejo repo, or pm2 logs forgejo-runner on pms1.

Microservices (GstService / EpfoService / PdfService)

The standalone Playwright services are deployed by the same v* tag as the app. ~/pms was historically a sparse checkout limited to App/, so the service folders never landed on disk; the deploy now disables sparse-checkout (idempotent) to materialise the whole tree before managing the services. After the app restart, deploy.yml:

expands the working tree (sparse-checkout disable) and exports the few secrets the services need out of ~/pms/App/.env (PDF_SERVICE_TOKEN, ALLOWED_ORIGIN, EPFO_LIVE) — never PORT or the runner's ephemeral FORGEJO_TOKEN;
for each service folder present, runs npm install + npx playwright install chromium + npm run build;
runs pm2 startOrReload ecosystem.config.js --update-env — which creates the pm2 processes on the first release and reloads them on every release after — then pm2 save;
health-checks :3003 / :3004 / :3005 (/health → 200).

ecosystem.config.js (repo root) is the source of truth: canonical pm2 names gst-service / epfo-service / pdf-service, fixed ports, and it registers only services whose folder is checked out (so a not-yet-merged service is skipped, and adopted automatically once its PR lands).

One-time alignment: if a service is already running on pms1 under a different pm2 name, delete it once (pm2 delete <old-name> && pm2 save) so the canonical process can bind its port — otherwise the new one fails on a port clash. PdfService additionally needs Chromium system libs the first time (npx playwright install --with-deps chromium, which needs sudo); the deploy's plain playwright install chromium only fetches the browser binary.

Operational notes

The watcher runs Claude Code with --dangerously-skip-permissions inside the dedicated pelagia-autofix clone — never point workDir at your main checkout.
Watcher only works issues while this PC is on; queued issues are picked up on the next run after boot (-StartWhenAvailable).
Tokens: portal-report-issue (write:issue, used by the app) and claude-watcher (write:issue + write:repository, used by the watcher). Both belong to the shad0w Forgejo account. Rotate via docker exec -u 1000 forgejo forgejo admin user generate-access-token ....
Server-side env for the button lives in ~/pms/App/.env on pms1 (FORGEJO_URL=http://127.0.0.1:3001 so it does not depend on the tunnel).

Known Forgejo 10 bug: clicking Update branch on a PR (or pushing to its head branch) can make the page show "This pull request is broken due to missing fork information" even though the PR is fine (API still reports mergeable: true). Fix: close and reopen the PR — via the UI, or:

$h = @{ Authorization = "token <claude-watcher token>" }
Invoke-RestMethod -Method Patch -Headers $h -ContentType application/json `
  -Uri https://git.pelagiamarine.com/api/v1/repos/shad0w/pelagia-portal/pulls/<N> -Body '{"state":"closed"}'
Invoke-RestMethod -Method Patch -Headers $h -ContentType application/json `
  -Uri https://git.pelagiamarine.com/api/v1/repos/shad0w/pelagia-portal/pulls/<N> -Body '{"state":"open"}'

Fixed upstream in newer Gitea/Forgejo — resolves itself if Forgejo is upgraded past v10.

19 KiB Raw Blame History