pelagia-portal/automation/README.md
Hardik 7acd86e3dd
All checks were successful
PR checks / checks (pull_request) Successful in 47s
PR checks / integration (pull_request) Successful in 32s
feat(automation): watcher that addresses claude-review: comments on Claude PRs
Sibling to claude-issue-watcher.sh: polls open Claude-raised PRs (head
branch under claude/, or labelled claude-pr) for review comments carrying
the marker `claude-review:` — in the PR conversation, review summaries, or
inline on-file comments — and runs headless Claude Code on the PR's own
branch to address them, pushing the follow-up commit(s) to the same branch.

- Authorization gate: only repo collaborators (write access) + the owner can
  trigger it; the bot's own comments are ignored.
- Idempotent: handled comments are tracked by a hidden marker on the bot's
  acknowledgements, so the 10-min poll never redoes a comment.
- Own clone (~/pelagia-pr-review), config, and lock so it never races the
  issue watcher. Token needs write:repository + write:issue.

Adds the script, an example config, .gitignore entries for the live
config/lock, and an automation/README.md section with deploy + cron steps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 15:21:04 +05:30

18 KiB

Automated issue-to-deploy pipeline

End-to-end flow from a user clicking Report Issue in the portal to a fix running in production:

Portal header (bug icon)                          [App/components/layout/report-issue-button.tsx]
        │  server action → Forgejo API
        ▼
Forgejo issue  (label: portal)                    [git.pelagiamarine.com/shad0w/pelagia-portal]
        │  polled every 10 min by Windows Scheduled Task "PelagiaClaudeIssueWatcher"
        ▼
TRIAGE  (watcher phase 1)                          [dev PC, headless Claude Code, analysis only]
        │  Claude reads the issue + repo, posts a requirements-breakdown comment,
        │  and routes it: adds `claude-queue` (auto-fixable) or `interactive` (human)
        ▼
FIX  (watcher phase 2, only for claude-queue)      [headless Claude Code in C:\...\src\pelagia-autofix]
        │  Claude implements + verifies fix; watcher pushes branch claude/issue-N
        │  and opens a PR (label: claude-pr)
        ▼
Human review: merge the PR, then create a release tag vX.Y.Z
        │  tag push triggers .forgejo/workflows/deploy.yml
        ▼
forgejo-runner on pms1 (pm2: forgejo-runner, label "host")
        │  checks out the tag in ~/pms, pnpm install + build + prisma migrate deploy
        ▼
pm2 restart ppms  →  live at pms.pelagiamarine.com

interactive-routed issues stop after triage for a human to pick up (run with Claude in a steered session). The triage breakdown comment is plain (no bot marker) so, for claude-queue issues, the fix stage reads it back as refined requirements.

Contribution policy (all changes via PR)

Every change lands through a pull request — no direct pushes to master. This applies to humans and to the automated pipeline alike (the watcher already opens PRs).

Each PR must include:

  • Tests for any code change. Model: the integration test on claude/issue-12 — it targets the prod-mirror test DB, anchors on existing rows, inserts fixtures via raw SQL (schema-tolerant), isolates them with a unique prefix, and cleans up in afterEach. Docs/config/automation-only PRs are exempt.
  • Docs updates where relevant (App/README.md, App/CLAUDE.md, Docs/, this file, CHANGELOG.md).

Enforcement.forgejo/workflows/pr-checks.yml runs on every PR into master:

  1. Test-presence gate: a PR touching App/app|lib|components|hooks with no test change fails. Justify genuine exceptions in the PR body for a reviewer to override.
  2. Type-check: pnpm type-check must be clean across the whole project (tests included). The test suite's old type baseline was repaired when this gate landed.
  3. Unit tests: pnpm test must pass.

All three are hard gates. pnpm lint is intentionally not run — it currently requires an interactive ESLint migration (a follow-up). Integration tests are type-checked here but executed against the pelagia_test DB by the autofix / locally (not in this shared CI, to avoid prod-mirror schema drift).

A PULL_REQUEST_TEMPLATE.md carries the checklist.

Components

Piece Where Notes
Report Issue button App/components/layout/report-issue-button.tsx + report-issue-actions.ts Any signed-in user; files issue with only the portal label (triage routes it)
Forgejo helper App/lib/forgejo.ts Needs FORGEJO_URL, FORGEJO_REPO, FORGEJO_TOKEN env (token scope: write:issue)
Issue watcher (active) automation/claude-issue-watcher.sh on pms1 Bash port; runs 24/7 via cron. Config + logs under ~/issue-watcher/
Issue watcher (Windows, disabled) automation/claude-issue-watcher.ps1 PowerShell original. PelagiaClaudeIssueWatcher task is disabled (pms1 is the sole worker; two pollers would race)
PR review-comment watcher automation/claude-pr-review-watcher.sh on pms1 Addresses claude-review: comments on Claude-raised PRs. Own cron entry, own clone (~/pelagia-pr-review), own config + lock. See below
Forgejo helper App/lib/forgejo.ts Needs FORGEJO_URL, FORGEJO_REPO, FORGEJO_TOKEN env (token scope: write:issue)
Deploy workflow .forgejo/workflows/deploy.yml Triggers on v* tags; runs on the host runner
Runner pms1 ~/forgejo-runner, pm2 process forgejo-runner Registered as pms1-host with labels host, docker

Where the watcher runs (pms1)

The watcher runs on pms1 under cron (every 10 min), polling Forgejo over the local loopback (http://127.0.0.1:3001).

  • Script: ~/issue-watcher/claude-issue-watcher.sh (source: automation/claude-issue-watcher.sh)
  • Config: ~/issue-watcher/watcher.config.json (gitignored; holds the token + claudeExe = the nvm claude path)
  • Work clone: ~/pelagia-autofix (separate from the deployed ~/pms)
  • Logs: ~/issue-watcher/logs/ (watcher-<date>.log, per-issue claude-*.log, cron.log)
  • Crontab: */10 * * * * PATH=<nvm bin>:... ~/issue-watcher/claude-issue-watcher.sh >> ~/issue-watcher/logs/cron.log 2>&1

Auth: Claude Code must be signed in on pms1 (ssh in, run claude, complete the login → writes ~/.claude/.credentials.json). The watcher has a preflight that no-ops until those credentials exist, so cron can be enabled before sign-in and activates automatically once signed in. (An ANTHROPIC_API_KEY env var also satisfies it.)

The Windows variant (.ps1 + register-watcher-task.ps1) is the portable fallback; re-enable its task only if pms1 is unavailable, and disable one before enabling the other.

PR review-comment watcher

Where the issue watcher turns issues into PRs, the PR review-comment watcher (automation/claude-pr-review-watcher.sh) closes the loop on the other side: it addresses review comments left on the PRs Claude already raised. This is how you iterate on an automated PR without dropping into an interactive session — leave a comment, Claude pushes a follow-up commit.

How to use it (as a reviewer): on any open Claude-raised PR, leave a comment that starts with the marker claude-review: — the text after the marker is the instruction. It works in three places:

  • the PR conversation (a normal PR comment),
  • a review summary (the overall body of a submitted review),
  • an inline / on-file comment (Claude is given the file, line, and diff hunk).

Example inline comment on App/lib/foo.ts:

claude-review: this should null-check order.vendor before dereferencing it, and add a test for the null case.

What the watcher does each run (every 10 min via cron):

  1. Lists open PRs Claude raised — head branch starts with prBranchPrefix (claude/) or the PR is labelled claude-pr.
  2. Collects every claude-review: comment from repo collaborators only (write access; the repo owner is always included). Comments from anyone else, and the bot's own comments, are ignored. This is the safety gate — only trusted users can make Claude push code.
  3. Skips comments already handled in a previous run (tracked by a hidden <!-- ppms-review-bot handled: … --> marker the bot stamps on its acknowledgements, so a 10-minute poll never redoes the same comment).
  4. Checks out the PR's own branch in ~/pelagia-pr-review, runs headless Claude Code with the collected instructions (+ the same pelagia_test / port-3100 test environment the fixer uses), then pushes the new commit(s) to the same branch — updating the open PR in place.
  5. Acknowledges: posts a reply listing what it addressed (with the handled marker) and adds a 🚀 reaction to each handled PR-conversation comment.

If Claude judges a comment unclear, out of scope, or too risky to do unattended (migrations, payments, permissions), it makes no commit for it and the watcher posts a "produced no change — a human may need to take these" reply. The comments are still marked handled so the poll doesn't loop on them; re-comment with a clearer claude-review: instruction to retry.

Deploy on pms1 (mirrors the issue watcher):

# 1. Place the script + config alongside the issue watcher
cp automation/claude-pr-review-watcher.sh        ~/pr-review-watcher/
cp automation/pr-review-watcher.config.example.json ~/pr-review-watcher/pr-review-watcher.config.json
# 2. Edit the config: real token (scope write:repository,write:issue), claudeExe = `which claude`
# 3. Add a crontab entry, OFFSET from the issue watcher so the two don't run at the same minute:
#    5,15,25,35,45,55 * * * * PATH=<nvm bin>:$PATH ~/pr-review-watcher/claude-pr-review-watcher.sh >> ~/pr-review-watcher/logs/cron.log 2>&1
  • Token scope: needs write:repository (push to the PR branch) plus write:issue (post comments + reactions) — one scope more than the issue watcher.
  • Own everything: separate clone (~/pelagia-pr-review), config (pr-review-watcher.config.json), and lock (.pr-review-watcher.lock) so it never races the issue watcher. Logs land in the same logs/ dir (pr-review-<date>.log, per-PR claude-pr-<n>-*.log).
  • Same auth preflight as the issue watcher — no-ops until Claude Code is signed in on pms1 (or ANTHROPIC_API_KEY is set).
  • A Windows .ps1 port is not provided yet (pms1 is the sole worker); port it from claude-issue-watcher.ps1 only if you need a failover.

Test database (for autofix verification)

So the fix stage can verify against realistic data without touching production:

  • pelagia_test — a PostgreSQL database on pms1, owned by pelagia_user, that is a daily mirror of production (pelagia). Created once as superuser; refreshed by automation/refresh-test-db.sh via cron at 03:30 (pg_dump pelagia | psql pelagia_test).
  • The autofix clone's ~/pelagia-autofix/App/.env points DATABASE_URL at pelagia_test and runs in safe dev mode — no Resend/SSO secrets, so email is console-logged and storage is local. NEXTAUTH_URL/PORT are set to 3100 (production app is on 3000).
  • The fix prompt tells Claude it may run integration tests against this DB (set -a; . ./.env; set +a; pnpm test:integration) and may start a dev server on port 3100 only, stopping it by port (fuser -k 3100/tcp) — never a broad pkill next, which would take down production (it also runs a next-server).

Because the test DB is refreshed daily, anything the autofix writes to it (test data, schema experiments) is disposable. Schema-migration issues are routed to interactive by triage, so the unattended fixer should not be altering the schema anyway.

Staging (smoke test before deploy)

automation/staging-up.sh (deployed to ~/issue-watcher/ on pms1) brings up a staging instance of the latest master so changes can be clicked through before a release tag deploys them to prod.

  • Checkout: ~/pelagia-staging (separate from ~/pms and ~/pelagia-autofix)
  • Process: pm2 ppms-staging on port 3200, against the prod-mirror test DB (pelagia_test), safe dev mode (console email, local storage, SSO disabled).
  • Auto-refresh: .forgejo/workflows/staging.yml rebuilds staging on every push to master (i.e. every merged PR) on the host runner, so staging always tracks the trunk. It runs ~/issue-watcher/staging-up.sh; concurrent runs are coalesced (newest master wins). Also triggerable on demand (workflow_dispatch).
  • Manual refresh / restart: re-run ~/issue-watcher/staging-up.sh.
  • Stop: pm2 delete ppms-staging.
  • Access is SSH-tunnel only — the dev server binds to 127.0.0.1:3200, so it is not reachable from the public internet. Open a tunnel and browse http://localhost:3200: ssh -L 3200:localhost:3200 shad0w@<pms1>. On Windows, the desktop shortcut "Pelagia Staging (tunnel)" (automation/staging-tunnel.cmd) opens the tunnel and the browser in one click.
  • A fixed banner "INTERNAL DEV / STAGING - NOT PRODUCTION" is shown (driven by NEXT_PUBLIC_ENV_LABEL in the staging .env; the EnvBanner component renders nothing when the var is unset, so production is unaffected).
  • Log in with a password user (SSO is off here), e.g. admin@pelagiamarine.com.

Issue label lifecycle

portal ──(triage)──▶ triaged + claude-queue ─▶ claude-working ─▶ claude-pr | claude-failed
              └─────▶ triaged + interactive   (stops here — handle with Claude interactively)
  • Triage owns routing for every portal issue. Each untriaged portal issue is triaged once (maxTriagePerRun per run); triage adds triaged, a routing label (claude-queue or interactive), a type label (bug or feature), and posts a breakdown. Triage skips an issue only once it carries triaged, interactive, claude-working, claude-pr, or claude-failed.
  • claude-queue alone does NOT skip triage on a portal issue. The Report Issue button may stamp claude-queue at creation; triage still claims the issue and decides routing (stripping the stray claude-queue if it routes to interactive). This is why triage works even if an older button build is deployed.
  • claude-queueclaude-workingclaude-pr (PR opened) or claude-failed.
  • To retry a failed issue, re-add claude-queue (and remove claude-failed).
  • To queue a non-portal issue for Claude (skipping triage), add claude-queue directly — triage never claims issues without the portal label.
  • To force a portal issue straight to fix, add triaged + claude-queue yourself.

Releasing

⚠️ Release tags MUST be v-prefixed (e.g. v0.2.2). deploy.yml triggers only on v* tags — a bare tag like 0.2.2 will NOT deploy (the runner ignores it and prod stays on the previous version). Push the tag specifically; pushing master alone never deploys.

After merging PR(s) on master:

git pull
git tag v0.2.2          # MUST start with "v"; semver: patch = fixes, minor = features
git push pms1 v0.2.2    # pushing the v* tag is what triggers the deploy

The runner checks out the tag in ~/pms, runs pnpm install + build + prisma migrate deploy, pm2 restart ppms, and verifies /login returns 200. Watch progress under Actions on the Forgejo repo, or pm2 logs forgejo-runner on pms1.

Microservices (GstService / EpfoService / PdfService)

The standalone Playwright services are deployed by the same v* tag as the app. ~/pms was historically a sparse checkout limited to App/, so the service folders never landed on disk; the deploy now disables sparse-checkout (idempotent) to materialise the whole tree before managing the services. After the app restart, deploy.yml:

  1. expands the working tree (sparse-checkout disable) and exports the few secrets the services need out of ~/pms/App/.env (PDF_SERVICE_TOKEN, ALLOWED_ORIGIN, EPFO_LIVE) — never PORT or the runner's ephemeral FORGEJO_TOKEN;
  2. for each service folder present, runs npm install + npx playwright install chromium + npm run build;
  3. runs pm2 startOrReload ecosystem.config.js --update-env — which creates the pm2 processes on the first release and reloads them on every release after — then pm2 save;
  4. health-checks :3003 / :3004 / :3005 (/health → 200).

ecosystem.config.js (repo root) is the source of truth: canonical pm2 names gst-service / epfo-service / pdf-service, fixed ports, and it registers only services whose folder is checked out (so a not-yet-merged service is skipped, and adopted automatically once its PR lands).

One-time alignment: if a service is already running on pms1 under a different pm2 name, delete it once (pm2 delete <old-name> && pm2 save) so the canonical process can bind its port — otherwise the new one fails on a port clash. PdfService additionally needs Chromium system libs the first time (npx playwright install --with-deps chromium, which needs sudo); the deploy's plain playwright install chromium only fetches the browser binary.

Operational notes

  • The watcher runs Claude Code with --dangerously-skip-permissions inside the dedicated pelagia-autofix clone — never point workDir at your main checkout.

  • Watcher only works issues while this PC is on; queued issues are picked up on the next run after boot (-StartWhenAvailable).

  • Tokens: portal-report-issue (write:issue, used by the app) and claude-watcher (write:issue + write:repository, used by the watcher). Both belong to the shad0w Forgejo account. Rotate via docker exec -u 1000 forgejo forgejo admin user generate-access-token ....

  • Server-side env for the button lives in ~/pms/App/.env on pms1 (FORGEJO_URL=http://127.0.0.1:3001 so it does not depend on the tunnel).

  • Known Forgejo 10 bug: clicking Update branch on a PR (or pushing to its head branch) can make the page show "This pull request is broken due to missing fork information" even though the PR is fine (API still reports mergeable: true). Fix: close and reopen the PR — via the UI, or:

    $h = @{ Authorization = "token <claude-watcher token>" }
    Invoke-RestMethod -Method Patch -Headers $h -ContentType application/json `
      -Uri https://git.pelagiamarine.com/api/v1/repos/shad0w/pelagia-portal/pulls/<N> -Body '{"state":"closed"}'
    Invoke-RestMethod -Method Patch -Headers $h -ContentType application/json `
      -Uri https://git.pelagiamarine.com/api/v1/repos/shad0w/pelagia-portal/pulls/<N> -Body '{"state":"open"}'
    

    Fixed upstream in newer Gitea/Forgejo — resolves itself if Forgejo is upgraded past v10.