First live run on PR #126 left the foreground shell stuck: Claude pushed the commit itself and then did not exit, so the supervisor code after it (push + ack + handled marker) never ran. Under cron that also holds the flock lock, freezing every later run, and the comment never gets marked handled (so it would be re-processed forever). - Wrap the Claude invocation in `setsid timeout -k 30s "$CLAUDE_TIMEOUT"` (default 30m). `setsid` detaches from the controlling terminal so a lingering child can't stick an interactive run; `timeout` returns control to the supervisor, which still pushes any commits (idempotent if Claude pushed) and writes the handled marker. A timed-out run (rc=124) is logged, not fatal. - Give this watcher its own dev port (devPort, default 3101) distinct from the issue watcher's 3100, and reap it after each run -- no cross-watcher kill. - Reinforce the prompt: stop any dev server and END THE TURN; never push. Adds claudeTimeout + devPort to the example config and documents both. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
19 KiB
Automated issue-to-deploy pipeline
End-to-end flow from a user clicking Report Issue in the portal to a fix running in production:
Portal header (bug icon) [App/components/layout/report-issue-button.tsx]
│ server action → Forgejo API
▼
Forgejo issue (label: portal) [git.pelagiamarine.com/shad0w/pelagia-portal]
│ polled every 10 min by Windows Scheduled Task "PelagiaClaudeIssueWatcher"
▼
TRIAGE (watcher phase 1) [dev PC, headless Claude Code, analysis only]
│ Claude reads the issue + repo, posts a requirements-breakdown comment,
│ and routes it: adds `claude-queue` (auto-fixable) or `interactive` (human)
▼
FIX (watcher phase 2, only for claude-queue) [headless Claude Code in C:\...\src\pelagia-autofix]
│ Claude implements + verifies fix; watcher pushes branch claude/issue-N
│ and opens a PR (label: claude-pr)
▼
Human review: merge the PR, then create a release tag vX.Y.Z
│ tag push triggers .forgejo/workflows/deploy.yml
▼
forgejo-runner on pms1 (pm2: forgejo-runner, label "host")
│ checks out the tag in ~/pms, pnpm install + build + prisma migrate deploy
▼
pm2 restart ppms → live at pms.pelagiamarine.com
interactive-routed issues stop after triage for a human to pick up (run with
Claude in a steered session). The triage breakdown comment is plain (no bot
marker) so, for claude-queue issues, the fix stage reads it back as refined
requirements.
Contribution policy (all changes via PR)
Every change lands through a pull request — no direct pushes to master. This applies
to humans and to the automated pipeline alike (the watcher already opens PRs).
Each PR must include:
- Tests for any code change. Model: the integration test on
claude/issue-12— it targets the prod-mirror test DB, anchors on existing rows, inserts fixtures via raw SQL (schema-tolerant), isolates them with a unique prefix, and cleans up inafterEach. Docs/config/automation-only PRs are exempt. - Docs updates where relevant (
App/README.md,App/CLAUDE.md,Docs/, this file,CHANGELOG.md).
Enforcement — .forgejo/workflows/pr-checks.yml
runs on every PR into master:
- Test-presence gate: a PR touching
App/app|lib|components|hookswith no test change fails. Justify genuine exceptions in the PR body for a reviewer to override. - Type-check:
pnpm type-checkmust be clean across the whole project (tests included). The test suite's old type baseline was repaired when this gate landed. - Unit tests:
pnpm testmust pass.
All three are hard gates. pnpm lint is intentionally not run — it currently
requires an interactive ESLint migration (a follow-up). Integration tests are
type-checked here but executed against the pelagia_test DB by the autofix / locally
(not in this shared CI, to avoid prod-mirror schema drift).
A PULL_REQUEST_TEMPLATE.md carries the checklist.
Components
| Piece | Where | Notes |
|---|---|---|
| Report Issue button | App/components/layout/report-issue-button.tsx + report-issue-actions.ts |
Any signed-in user; files issue with only the portal label (triage routes it) |
| Forgejo helper | App/lib/forgejo.ts |
Needs FORGEJO_URL, FORGEJO_REPO, FORGEJO_TOKEN env (token scope: write:issue) |
| Issue watcher (active) | automation/claude-issue-watcher.sh on pms1 |
Bash port; runs 24/7 via cron. Config + logs under ~/issue-watcher/ |
| Issue watcher (Windows, disabled) | automation/claude-issue-watcher.ps1 |
PowerShell original. PelagiaClaudeIssueWatcher task is disabled (pms1 is the sole worker; two pollers would race) |
| PR review-comment watcher | automation/claude-pr-review-watcher.sh on pms1 |
Addresses claude-review: comments on Claude-raised PRs. Own cron entry, own clone (~/pelagia-pr-review), own config + lock. See below |
| Forgejo helper | App/lib/forgejo.ts |
Needs FORGEJO_URL, FORGEJO_REPO, FORGEJO_TOKEN env (token scope: write:issue) |
| Deploy workflow | .forgejo/workflows/deploy.yml |
Triggers on v* tags; runs on the host runner |
| Runner | pms1 ~/forgejo-runner, pm2 process forgejo-runner |
Registered as pms1-host with labels host, docker |
Where the watcher runs (pms1)
The watcher runs on pms1 under cron (every 10 min), polling Forgejo over the
local loopback (http://127.0.0.1:3001).
- Script:
~/issue-watcher/claude-issue-watcher.sh(source:automation/claude-issue-watcher.sh) - Config:
~/issue-watcher/watcher.config.json(gitignored; holds the token +claudeExe= the nvmclaudepath) - Work clone:
~/pelagia-autofix(separate from the deployed~/pms) - Logs:
~/issue-watcher/logs/(watcher-<date>.log, per-issueclaude-*.log,cron.log) - Crontab:
*/10 * * * * PATH=<nvm bin>:... ~/issue-watcher/claude-issue-watcher.sh >> ~/issue-watcher/logs/cron.log 2>&1
Auth: Claude Code must be signed in on pms1 (ssh in, run claude, complete
the login → writes ~/.claude/.credentials.json). The watcher has a preflight that
no-ops until those credentials exist, so cron can be enabled before sign-in and
activates automatically once signed in. (An ANTHROPIC_API_KEY env var also satisfies it.)
The Windows variant (.ps1 + register-watcher-task.ps1) is the portable fallback;
re-enable its task only if pms1 is unavailable, and disable one before enabling the other.
PR review-comment watcher
Where the issue watcher turns issues into PRs, the PR review-comment watcher
(automation/claude-pr-review-watcher.sh) closes the
loop on the other side: it addresses review comments left on the PRs Claude already
raised. This is how you iterate on an automated PR without dropping into an
interactive session — leave a comment, Claude pushes a follow-up commit.
How to use it (as a reviewer): on any open Claude-raised PR, leave a comment that
starts with the marker claude-review: — the text after the marker is the
instruction. It works in three places:
- the PR conversation (a normal PR comment),
- a review summary (the overall body of a submitted review),
- an inline / on-file comment (Claude is given the file, line, and diff hunk).
Example inline comment on App/lib/foo.ts:
claude-review:this should null-checkorder.vendorbefore dereferencing it, and add a test for the null case.
What the watcher does each run (every 10 min via cron):
- Lists open PRs Claude raised — head branch starts with
prBranchPrefix(claude/) or the PR is labelledclaude-pr. - Collects every
claude-review:comment from repo collaborators only (write access; the repo owner is always included). Comments from anyone else, and the bot's own comments, are ignored. This is the safety gate — only trusted users can make Claude push code. - Skips comments already handled in a previous run (tracked by a hidden
<!-- ppms-review-bot handled: … -->marker the bot stamps on its acknowledgements, so a 10-minute poll never redoes the same comment). - Checks out the PR's own branch in
~/pelagia-pr-review, runs headless Claude Code with the collected instructions (+ the samepelagia_test/ port-3100 test environment the fixer uses), then pushes the new commit(s) to the same branch — updating the open PR in place. - Acknowledges: posts a reply listing what it addressed (with the handled marker) and adds a 🚀 reaction to each handled PR-conversation comment.
If Claude judges a comment unclear, out of scope, or too risky to do unattended
(migrations, payments, permissions), it makes no commit for it and the watcher posts a
"produced no change — a human may need to take these" reply. The comments are still
marked handled so the poll doesn't loop on them; re-comment with a clearer
claude-review: instruction to retry.
Deploy on pms1 (mirrors the issue watcher):
# 1. Place the script + config alongside the issue watcher
cp automation/claude-pr-review-watcher.sh ~/pr-review-watcher/
cp automation/pr-review-watcher.config.example.json ~/pr-review-watcher/pr-review-watcher.config.json
# 2. Edit the config: real token (scope write:repository,write:issue), claudeExe = `which claude`
# 3. Add a crontab entry, OFFSET from the issue watcher so the two don't run at the same minute:
# 5,15,25,35,45,55 * * * * PATH=<nvm bin>:$PATH ~/pr-review-watcher/claude-pr-review-watcher.sh >> ~/pr-review-watcher/logs/cron.log 2>&1
- Token scope: needs
write:repository(push to the PR branch) pluswrite:issue(post comments + reactions) — one scope more than the issue watcher. - Own everything: separate clone (
~/pelagia-pr-review), config (pr-review-watcher.config.json), and lock (.pr-review-watcher.lock) so it never races the issue watcher. Logs land in the samelogs/dir (pr-review-<date>.log, per-PRclaude-pr-<n>-*.log). - Same auth preflight as the issue watcher — no-ops until Claude Code is signed in
on pms1 (or
ANTHROPIC_API_KEYis set). - Bounded + detached run: each Claude invocation is wrapped in
setsid timeout(claudeTimeout, default30m).setsiddetaches it from any terminal (so a manual run can't leave your shell stuck on a lingering child);timeoutguarantees control returns to the supervisor — so even a stuck/misbehaving run still gets its commits pushed and itshandled:marker written, and can never wedge the flock lock for later cron runs. Runtime checks use port3101(devPort), distinct from the issue watcher's3100, and the watcher reaps that port after every run. - A Windows
.ps1port is not provided yet (pms1 is the sole worker); port it fromclaude-issue-watcher.ps1only if you need a failover.
Updating the deployed copy: update-pr-review-watcher.sh refreshes the watcher
script in one command, from a dedicated self-update checkout (~/pr-review-watcher/.src)
that never races the issue watcher's clone. Copy it once, then:
cp automation/update-pr-review-watcher.sh ~/pr-review-watcher/ # one-time
~/pr-review-watcher/update-pr-review-watcher.sh # pull from master
~/pr-review-watcher/update-pr-review-watcher.sh some/branch # or a branch (pre-merge testing)
It reads the live config for the token/URL, never clobbers the config, and self-updates.
Test database (for autofix verification)
So the fix stage can verify against realistic data without touching production:
pelagia_test— a PostgreSQL database on pms1, owned bypelagia_user, that is a daily mirror of production (pelagia). Created once as superuser; refreshed byautomation/refresh-test-db.shvia cron at 03:30 (pg_dump pelagia | psql pelagia_test).- The autofix clone's
~/pelagia-autofix/App/.envpointsDATABASE_URLatpelagia_testand runs in safe dev mode — no Resend/SSO secrets, so email is console-logged and storage is local.NEXTAUTH_URL/PORTare set to 3100 (production app is on 3000). - The fix prompt tells Claude it may run integration tests against this DB
(
set -a; . ./.env; set +a; pnpm test:integration) and may start a dev server on port 3100 only, stopping it by port (fuser -k 3100/tcp) — never a broadpkill next, which would take down production (it also runs anext-server).
Because the test DB is refreshed daily, anything the autofix writes to it (test data,
schema experiments) is disposable. Schema-migration issues are routed to interactive
by triage, so the unattended fixer should not be altering the schema anyway.
Staging (smoke test before deploy)
automation/staging-up.sh (deployed to ~/issue-watcher/ on pms1) brings up a
staging instance of the latest master so changes can be clicked through
before a release tag deploys them to prod.
- Checkout:
~/pelagia-staging(separate from~/pmsand~/pelagia-autofix) - Process: pm2
ppms-stagingon port 3200, against the prod-mirror test DB (pelagia_test), safe dev mode (console email, local storage, SSO disabled). - Auto-refresh:
.forgejo/workflows/staging.ymlrebuilds staging on every push tomaster(i.e. every merged PR) on the host runner, so staging always tracks the trunk. It runs~/issue-watcher/staging-up.sh; concurrent runs are coalesced (newest master wins). Also triggerable on demand (workflow_dispatch). - Manual refresh / restart: re-run
~/issue-watcher/staging-up.sh. - Stop:
pm2 delete ppms-staging. - Access is SSH-tunnel only — the dev server binds to
127.0.0.1:3200, so it is not reachable from the public internet. Open a tunnel and browsehttp://localhost:3200:ssh -L 3200:localhost:3200 shad0w@<pms1>. On Windows, the desktop shortcut "Pelagia Staging (tunnel)" (automation/staging-tunnel.cmd) opens the tunnel and the browser in one click. - A fixed banner "INTERNAL DEV / STAGING - NOT PRODUCTION" is shown (driven by
NEXT_PUBLIC_ENV_LABELin the staging.env; theEnvBannercomponent renders nothing when the var is unset, so production is unaffected). - Log in with a password user (SSO is off here), e.g.
admin@pelagiamarine.com.
Issue label lifecycle
portal ──(triage)──▶ triaged + claude-queue ─▶ claude-working ─▶ claude-pr | claude-failed
└─────▶ triaged + interactive (stops here — handle with Claude interactively)
- Triage owns routing for every
portalissue. Each untriaged portal issue is triaged once (maxTriagePerRunper run); triage addstriaged, a routing label (claude-queueorinteractive), a type label (bugorfeature), and posts a breakdown. Triage skips an issue only once it carriestriaged,interactive,claude-working,claude-pr, orclaude-failed. claude-queuealone does NOT skip triage on a portal issue. The Report Issue button may stampclaude-queueat creation; triage still claims the issue and decides routing (stripping the strayclaude-queueif it routes tointeractive). This is why triage works even if an older button build is deployed.claude-queue→claude-working→claude-pr(PR opened) orclaude-failed.- To retry a failed issue, re-add
claude-queue(and removeclaude-failed). - To queue a non-portal issue for Claude (skipping triage), add
claude-queuedirectly — triage never claims issues without theportallabel. - To force a portal issue straight to fix, add
triaged+claude-queueyourself.
Releasing
⚠️ Release tags MUST be
v-prefixed (e.g.v0.2.2).deploy.ymltriggers only onv*tags — a bare tag like0.2.2will NOT deploy (the runner ignores it and prod stays on the previous version). Push the tag specifically; pushingmasteralone never deploys.
After merging PR(s) on master:
git pull
git tag v0.2.2 # MUST start with "v"; semver: patch = fixes, minor = features
git push pms1 v0.2.2 # pushing the v* tag is what triggers the deploy
The runner checks out the tag in ~/pms, runs pnpm install + build +
prisma migrate deploy, pm2 restart ppms, and verifies /login returns 200. Watch
progress under Actions on the Forgejo repo, or pm2 logs forgejo-runner on pms1.
Microservices (GstService / EpfoService / PdfService)
The standalone Playwright services are deployed by the same v* tag as the app.
~/pms was historically a sparse checkout limited to App/, so the service
folders never landed on disk; the deploy now disables sparse-checkout (idempotent)
to materialise the whole tree before managing the services. After the app restart,
deploy.yml:
- expands the working tree (sparse-checkout disable) and exports the few secrets
the services need out of
~/pms/App/.env(PDF_SERVICE_TOKEN,ALLOWED_ORIGIN,EPFO_LIVE) — neverPORTor the runner's ephemeralFORGEJO_TOKEN; - for each service folder present, runs
npm install+npx playwright install chromium+npm run build; - runs
pm2 startOrReload ecosystem.config.js --update-env— which creates the pm2 processes on the first release and reloads them on every release after — thenpm2 save; - health-checks
:3003/:3004/:3005(/health→ 200).
ecosystem.config.js (repo root) is the source of truth: canonical pm2 names
gst-service / epfo-service / pdf-service, fixed ports, and it registers
only services whose folder is checked out (so a not-yet-merged service is
skipped, and adopted automatically once its PR lands).
One-time alignment: if a service is already running on pms1 under a different
pm2 name, delete it once (pm2 delete <old-name> && pm2 save) so the canonical
process can bind its port — otherwise the new one fails on a port clash. PdfService
additionally needs Chromium system libs the first time (npx playwright install --with-deps chromium, which needs sudo); the deploy's plain playwright install chromium only fetches the browser binary.
Operational notes
-
The watcher runs Claude Code with
--dangerously-skip-permissionsinside the dedicatedpelagia-autofixclone — never pointworkDirat your main checkout. -
Watcher only works issues while this PC is on; queued issues are picked up on the next run after boot (
-StartWhenAvailable). -
Tokens:
portal-report-issue(write:issue, used by the app) andclaude-watcher(write:issue + write:repository, used by the watcher). Both belong to theshad0wForgejo account. Rotate viadocker exec -u 1000 forgejo forgejo admin user generate-access-token .... -
Server-side env for the button lives in
~/pms/App/.envon pms1 (FORGEJO_URL=http://127.0.0.1:3001so it does not depend on the tunnel). -
Known Forgejo 10 bug: clicking Update branch on a PR (or pushing to its head branch) can make the page show "This pull request is broken due to missing fork information" even though the PR is fine (API still reports
mergeable: true). Fix: close and reopen the PR — via the UI, or:$h = @{ Authorization = "token <claude-watcher token>" } Invoke-RestMethod -Method Patch -Headers $h -ContentType application/json ` -Uri https://git.pelagiamarine.com/api/v1/repos/shad0w/pelagia-portal/pulls/<N> -Body '{"state":"closed"}' Invoke-RestMethod -Method Patch -Headers $h -ContentType application/json ` -Uri https://git.pelagiamarine.com/api/v1/repos/shad0w/pelagia-portal/pulls/<N> -Body '{"state":"open"}'Fixed upstream in newer Gitea/Forgejo — resolves itself if Forgejo is upgraded past v10.