site-agent-pro 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (81) hide show
  1. package/README.md +689 -0
  2. package/dist/auth/credentialStore.js +62 -0
  3. package/dist/auth/inbox.js +193 -0
  4. package/dist/auth/profile.js +379 -0
  5. package/dist/auth/runner.js +1124 -0
  6. package/dist/backend/dashboardData.js +194 -0
  7. package/dist/backend/runArtifacts.js +48 -0
  8. package/dist/backend/runRepository.js +93 -0
  9. package/dist/bin.js +2 -0
  10. package/dist/cli/backfillSiteChecks.js +143 -0
  11. package/dist/cli/run.js +309 -0
  12. package/dist/cli/trade.js +69 -0
  13. package/dist/config.js +199 -0
  14. package/dist/core/agentProfiles.js +55 -0
  15. package/dist/core/aggregateReport.js +382 -0
  16. package/dist/core/audit.js +30 -0
  17. package/dist/core/customTaskSuite.js +148 -0
  18. package/dist/core/evaluator.js +217 -0
  19. package/dist/core/executor.js +788 -0
  20. package/dist/core/fallbackReport.js +335 -0
  21. package/dist/core/formHeuristics.js +411 -0
  22. package/dist/core/gameplaySummary.js +164 -0
  23. package/dist/core/interaction.js +202 -0
  24. package/dist/core/pageState.js +201 -0
  25. package/dist/core/planner.js +1669 -0
  26. package/dist/core/processSubmissionBatch.js +204 -0
  27. package/dist/core/runAuditJob.js +170 -0
  28. package/dist/core/runner.js +2352 -0
  29. package/dist/core/siteBrief.js +107 -0
  30. package/dist/core/siteChecks.js +1526 -0
  31. package/dist/core/taskDirectives.js +279 -0
  32. package/dist/core/taskHeuristics.js +263 -0
  33. package/dist/dashboard/client.js +1256 -0
  34. package/dist/dashboard/contracts.js +95 -0
  35. package/dist/dashboard/narrative.js +277 -0
  36. package/dist/dashboard/server.js +458 -0
  37. package/dist/dashboard/theme.js +888 -0
  38. package/dist/index.js +84 -0
  39. package/dist/llm/client.js +188 -0
  40. package/dist/paystack/account.js +123 -0
  41. package/dist/paystack/client.js +100 -0
  42. package/dist/paystack/index.js +13 -0
  43. package/dist/paystack/test-paystack.js +83 -0
  44. package/dist/paystack/transfer.js +138 -0
  45. package/dist/paystack/types.js +74 -0
  46. package/dist/paystack/webhook.js +121 -0
  47. package/dist/prompts/browserAgent.js +124 -0
  48. package/dist/prompts/reviewer.js +71 -0
  49. package/dist/reporting/clickReplay.js +290 -0
  50. package/dist/reporting/html.js +930 -0
  51. package/dist/reporting/markdown.js +238 -0
  52. package/dist/reporting/template.js +1141 -0
  53. package/dist/schemas/types.js +361 -0
  54. package/dist/submissions/customTasks.js +196 -0
  55. package/dist/submissions/html.js +770 -0
  56. package/dist/submissions/model.js +56 -0
  57. package/dist/submissions/publicUrl.js +76 -0
  58. package/dist/submissions/service.js +74 -0
  59. package/dist/submissions/store.js +37 -0
  60. package/dist/submissions/types.js +65 -0
  61. package/dist/trade/engine.js +241 -0
  62. package/dist/trade/evm/erc20.js +44 -0
  63. package/dist/trade/extractor.js +148 -0
  64. package/dist/trade/policy.js +35 -0
  65. package/dist/trade/session.js +31 -0
  66. package/dist/trade/types.js +107 -0
  67. package/dist/trade/validator.js +148 -0
  68. package/dist/utils/files.js +59 -0
  69. package/dist/utils/log.js +24 -0
  70. package/dist/utils/playwrightCompat.js +14 -0
  71. package/dist/utils/time.js +3 -0
  72. package/dist/wallet/provider.js +345 -0
  73. package/dist/wallet/relay.js +129 -0
  74. package/dist/wallet/wallet.js +178 -0
  75. package/docs/01-installation.md +134 -0
  76. package/docs/02-running-your-first-audit.md +136 -0
  77. package/docs/03-configuration.md +233 -0
  78. package/docs/04-how-the-agent-thinks.md +41 -0
  79. package/docs/05-extending-personas-and-tasks.md +42 -0
  80. package/docs/06-hardening-for-production.md +92 -0
  81. package/package.json +60 -0
@@ -0,0 +1,134 @@
1
+ # 01 - Installation
2
+
3
+ There are two ways to use site-agent-pro. Choose the one that fits your workflow.
4
+
5
+ ---
6
+
7
+ ## Track A — devDependency (recommended for most developers)
8
+
9
+ This is the right choice if you want to audit your own product as you build it, run audits in CI, or call site-agent-pro from scripts and test files.
10
+
11
+ ### 1. Prerequisites
12
+
13
+ - Node.js 20.10 or newer
14
+ - npm 10 or newer
15
+
16
+ ```bash
17
+ node -v
18
+ npm -v
19
+ ```
20
+
21
+ ### 2. Install in your project
22
+
23
+ ```bash
24
+ npm install --save-dev site-agent-pro
25
+ ```
26
+
27
+ ### 3. Install the Playwright browser
28
+
29
+ ```bash
30
+ npx playwright install chromium
31
+ ```
32
+
33
+ This only needs to be done once per machine or CI environment.
34
+
35
+ ### 4. Set your API key
36
+
37
+ Create a `.env` file (or set environment variables) in your project root:
38
+
39
+ ```bash
40
+ OPENAI_API_KEY=your_real_key_here
41
+ ```
42
+
43
+ Or use Ollama for local/offline development (no API key needed):
44
+
45
+ ```bash
46
+ LLM_PROVIDER=ollama
47
+ OLLAMA_MODEL=llama3.1:8b
48
+ ```
49
+
50
+ ### 5. Run your first audit
51
+
52
+ ```bash
53
+ # Against your running dev server
54
+ site-agent-pro --url http://localhost:3000 --task "Check the homepage CTA"
55
+
56
+ # Or add it to your package.json scripts
57
+ # "audit": "site-agent-pro --url http://localhost:3000 --task 'Check the homepage'"
58
+ # npm run audit
59
+ ```
60
+
61
+ ### 6. Or use it programmatically
62
+
63
+ ```ts
64
+ import { runAudit } from "site-agent-pro";
65
+
66
+ const result = await runAudit({
67
+ url: "http://localhost:3000",
68
+ tasks: ["Check the homepage CTA", "Try the signup flow"],
69
+ });
70
+
71
+ console.log(`Score: ${result.report.overall_score}/10`);
72
+
73
+ if (result.report.overall_score < 7) {
74
+ process.exit(1); // Fail CI
75
+ }
76
+ ```
77
+
78
+ ---
79
+
80
+ ## Track B — Clone and run (for contributors and self-hosting)
81
+
82
+ Use this if you want to run the full web dashboard, modify the source, or self-host the submission server.
83
+
84
+ ### 1. Prerequisites
85
+
86
+ - Node.js 20.10 or newer
87
+ - npm 10 or newer
88
+ - Git
89
+
90
+ ```bash
91
+ node -v
92
+ npm -v
93
+ ```
94
+
95
+ ### 2. Clone the repository
96
+
97
+ ```bash
98
+ git clone https://github.com/your-org/site-agent-pro.git
99
+ cd site-agent-pro
100
+ ```
101
+
102
+ ### 3. Install dependencies
103
+
104
+ ```bash
105
+ npm install
106
+ ```
107
+
108
+ ### 4. Install the Playwright browser
109
+
110
+ ```bash
111
+ npm run browser:install
112
+ ```
113
+
114
+ ### 5. Create your environment file
115
+
116
+ ```bash
117
+ cp .env.example .env
118
+ ```
119
+
120
+ ### 6. Add your OpenAI API key
121
+
122
+ Open `.env` and set:
123
+
124
+ ```bash
125
+ OPENAI_API_KEY=your_real_key_here
126
+ ```
127
+
128
+ ### 7. Confirm TypeScript builds cleanly
129
+
130
+ ```bash
131
+ npm run check
132
+ ```
133
+
134
+ If this fails, do not keep going and pretend everything is fine. Fix the error first.
@@ -0,0 +1,136 @@
1
+ # 02 - Running Your First Audit
2
+
3
+ ## Which command to use?
4
+
5
+ | Setup | Command |
6
+ |---|---|
7
+ | Installed as npm devDependency | `site-agent-pro --url ... --task "..."` |
8
+ | Cloned from source | `npm run dev -- --url ... --task "..."` |
9
+
10
+ All examples below use the `site-agent-pro` command. If you cloned the repo, replace it with `npm run dev --`.
11
+
12
+ ---
13
+
14
+ ## 1. Start with a simple public site
15
+
16
+ ```bash
17
+ site-agent-pro --url https://example.com --task "Open pricing and compare the visible plans before signup"
18
+ ```
19
+
20
+ This creates a new run directory in `runs/`.
21
+ If you want the full local product flow, start the app with `npm run dashboard` and submit the URL through `http://localhost:4173/`.
22
+
23
+ ---
24
+
25
+ ## 2. Run against your own localhost dev server
26
+
27
+ ```bash
28
+ # Start your app first
29
+ npm run dev
30
+
31
+ # Then in another terminal
32
+ site-agent-pro --url http://localhost:3000 --task "Check the homepage CTA"
33
+ ```
34
+
35
+ This is the core use case for side-by-side development: catch UX issues as you build, not after.
36
+
37
+ ---
38
+
39
+ ## 3. Inspect the output
40
+
41
+ Each run produces a timestamped directory in `runs/` containing:
42
+
43
+ - `inputs.json`
44
+ - `raw-events.json`
45
+ - `task-results.json`
46
+ - `accessibility.json`
47
+ - `report.json`
48
+ - `report.html`
49
+ - `report.md`
50
+ - `click-replay.webp` (animated replay)
51
+ - `*.webm` (full video recording if `RECORD_VIDEO=true`)
52
+
53
+ Open `report.html` in your browser for a readable, standalone report.
54
+
55
+ ---
56
+
57
+ ## 4. Run in a visible browser
58
+
59
+ Use this while debugging interaction issues:
60
+
61
+ ```bash
62
+ site-agent-pro --url https://example.com --task "Open pricing" --headed
63
+ ```
64
+
65
+ ---
66
+
67
+ ## 5. Run as a mobile user
68
+
69
+ ```bash
70
+ site-agent-pro --url https://example.com --task "Check the mobile nav" --mobile
71
+ ```
72
+
73
+ ---
74
+
75
+ ## 6. Bootstrap an authenticated session
76
+
77
+ When the site requires signup, email verification, OTP, or login before the important content is visible:
78
+
79
+ ```bash
80
+ site-agent-pro --url https://example.com \
81
+ --task "Reach the account dashboard and confirm billing is visible" \
82
+ --auth-flow --signup-url /register --login-url /login --access-url /dashboard \
83
+ --headed
84
+ ```
85
+
86
+ If you only want the authenticated Playwright session file and not the task run:
87
+
88
+ ```bash
89
+ site-agent-pro --url https://example.com \
90
+ --auth-only --signup-url /register --login-url /login --access-url /dashboard
91
+ ```
92
+
93
+ This writes `auth-flow.json` into the run directory and saves the authenticated `storageState` so future runs can reuse it directly.
94
+
95
+ ---
96
+
97
+ ## 7. Use it programmatically (devDependency users)
98
+
99
+ Instead of the CLI, call site-agent-pro from a script or test file:
100
+
101
+ ```ts
102
+ import { runAudit } from "site-agent-pro";
103
+
104
+ const result = await runAudit({
105
+ url: "http://localhost:3000",
106
+ tasks: [
107
+ "Open pricing and compare the visible plans before signup",
108
+ "Click the sign-up button and check the form fields",
109
+ ],
110
+ });
111
+
112
+ console.log(`Score: ${result.report.overall_score}/10`);
113
+ console.log(`Strengths: ${result.report.strengths.join(", ")}`);
114
+ console.log(`Top fixes: ${result.report.top_fixes.join(", ")}`);
115
+ ```
116
+
117
+ ---
118
+
119
+ ## 8. Read the task output correctly
120
+
121
+ Do not treat the overall score as objective truth.
122
+ Use the output to answer:
123
+ - what users could do
124
+ - where they got stuck
125
+ - what broke trust
126
+ - what to fix first
127
+
128
+ ---
129
+
130
+ ## 9. Use the hosted output flow (self-hosted setup only)
131
+
132
+ When you run the local app server:
133
+ - submit a public URL from `/`
134
+ - check status at `/submissions/<submission-id>`
135
+ - open the unique public task-output link at `/r/<token>`
136
+ - download the finished output from `/dashboard`
@@ -0,0 +1,233 @@
1
+ # 03 - Configuration
2
+
3
+ ## Environment variables
4
+
5
+ ### `OPENAI_API_KEY`
6
+ Required. Your API key.
7
+
8
+ ### `OPENAI_MODEL`
9
+ Default: `gpt-5`
10
+
11
+ Change this if you want a different compatible model.
12
+
13
+ ### `APP_BASE_URL`
14
+ Default: `http://localhost:4173`
15
+
16
+ Used when building hosted task-output links.
17
+
18
+ ### `HEADLESS`
19
+ Default: `true`
20
+
21
+ Set to `false` if you want the browser visible by default.
22
+
23
+ ### `MAX_SESSION_DURATION_MS`
24
+ Default: `600000`
25
+
26
+ Caps a single audit at 10 minutes in V1.
27
+ The code enforces a hard ceiling of 600 seconds even if you set a larger value.
28
+
29
+ ### `MAX_STEPS_PER_TASK`
30
+ Default: `32`
31
+
32
+ The default now leans toward a forensic investigation across multiple focused coverage lanes instead of one vague exploration pass.
33
+ The runner also preserves time for later tasks and supplemental site checks, so increasing this does not guarantee more useful coverage.
34
+ Raise this only when tasks genuinely require even longer flows. Bigger numbers can still make the agent wander if the site has poor signals.
35
+
36
+ ### `ACTION_DELAY_MS`
37
+ Default: `600`
38
+
39
+ Extra delay between actions. Useful when sites animate heavily.
40
+
41
+ ### `NAVIGATION_TIMEOUT_MS`
42
+ Default: `25000`
43
+
44
+ Increase this for painfully slow sites.
45
+ This timeout also affects the supplemental site probes that power performance, SEO, security, mobile, and content coverage.
46
+
47
+ ### `REPORT_TTL_DAYS`
48
+ Default: `30`
49
+
50
+ Hosted public task-output links expire after this many days.
51
+
52
+ ### `RECORD_VIDEO`
53
+ Default: `false`
54
+
55
+ When set to `true`, Playwright captures a full video recording of every browser session. These recordings are saved in the run directory and are viewable in the dashboard alongside the animated WebP replays.
56
+
57
+ ### `PLAYWRIGHT_STORAGE_STATE_PATH`
58
+ Default: unset
59
+
60
+ Optional path to a Playwright `storageState` JSON file.
61
+ Use this when your approved test lane already has a legitimate verified or authenticated session and you want the CLI or local app to reuse it automatically.
62
+
63
+ ## Coverage playbook
64
+
65
+ If you want the fewest possible `blocked` metrics:
66
+
67
+ - Prefer sites or QA lanes that are reachable without CAPTCHA, Cloudflare challenges, or geo/IP throttling.
68
+ - Reuse a legitimate session with `PLAYWRIGHT_STORAGE_STATE_PATH` or `--storage-state` when important paths sit behind login or verification.
69
+ - Raise `NAVIGATION_TIMEOUT_MS` for slow sites before raising `MAX_STEPS_PER_TASK`.
70
+ - Keep `MAX_SESSION_DURATION_MS` near the 10-minute ceiling for deeper task runs.
71
+ - Use multiple agent perspectives in the submission form when you want broader behavioral coverage, not just deeper repetition from one agent.
72
+
73
+ ### `DASHBOARD_PORT`
74
+ Default: `4173`
75
+
76
+ Port used by the local app server.
77
+
78
+ ### `DASHBOARD_HOST`
79
+ Default: `127.0.0.1`
80
+
81
+ Host binding used by the local app server.
82
+
83
+ ## Auth bootstrap variables
84
+
85
+ These are needed when you bootstrap a new auth identity with `--auth-flow` or `--auth-only`.
86
+ After a successful auth run, the runner also caches the working credentials in `.auth/credentials.json` keyed by target origin, so later runs against the same site can reuse them automatically.
87
+
88
+ ### `AUTH_TEST_EMAIL`
89
+ Required for auth bootstrap.
90
+
91
+ The base mailbox address the runner uses for signup and login.
92
+ On the first signup attempt it uses this exact address.
93
+ If the site says the account already exists, the runner now retries with fresh plus-address aliases such as `name+siteagent-...@domain.com` so it can keep registering without manual edits.
94
+
95
+ ### `AUTH_TEST_PASSWORD`
96
+ Required for auth bootstrap.
97
+
98
+ The password the runner uses for both signup and login.
99
+
100
+ ### `AUTH_TEST_USERNAME`
101
+ Optional.
102
+
103
+ Use this when the site expects a username field that is different from the email address. If omitted, the runner derives a fallback username from the configured email address.
104
+
105
+ ### `AUTH_TEST_FIRST_NAME` through `AUTH_TEST_COMPANY`
106
+ Defaults are provided in `.env.example`.
107
+
108
+ These values are used to fill visible signup fields such as name, phone, address, city, state, postal code, country, and company.
109
+ When the runner has to retry signup with a fresh identity, it also adds small numeric variations to these details so sites that enforce uniqueness beyond email are less likely to reject the retry.
110
+
111
+ ### `AUTH_IMAP_HOST`, `AUTH_IMAP_PORT`, `AUTH_IMAP_SECURE`, `AUTH_IMAP_USER`, `AUTH_IMAP_PASSWORD`, `AUTH_IMAP_MAILBOX`
112
+
113
+ Configure the real inbox the runner should poll for OTP or verification emails.
114
+ The auth bootstrap uses IMAP mailbox access, not a browser-driven webmail tab.
115
+
116
+ ### `AUTH_EMAIL_POLL_TIMEOUT_MS`
117
+ Default: `180000`
118
+
119
+ How long to wait for the verification email before failing the auth bootstrap.
120
+
121
+ ### `AUTH_EMAIL_POLL_INTERVAL_MS`
122
+ Default: `5000`
123
+
124
+ How frequently to poll the inbox for a new message.
125
+
126
+ ### `AUTH_OTP_LENGTH`
127
+ Default: `6`
128
+
129
+ Expected OTP length for numeric code extraction.
130
+
131
+ ### `AUTH_EMAIL_FROM_FILTER`
132
+ Optional.
133
+
134
+ Use this when the mailbox receives lots of unrelated email and you want to constrain matching to a specific sender.
135
+
136
+ ### `AUTH_EMAIL_SUBJECT_FILTER`
137
+ Optional.
138
+
139
+ Use this when the mailbox receives lots of unrelated email and you want to constrain matching to a specific subject fragment.
140
+
141
+ ### `AUTH_GENERATED_IDENTITY_MAX_ATTEMPTS`
142
+ Default: `5`
143
+
144
+ How many signup identities the runner should try before giving up when the site keeps reporting that the account already exists.
145
+
146
+ ### `AUTH_SIGNUP_URL`, `AUTH_LOGIN_URL`, `AUTH_ACCESS_URL`
147
+ Optional.
148
+
149
+ Default auth flow URLs used by the CLI when you do not pass `--signup-url`, `--login-url`, or `--access-url`.
150
+
151
+ If auth credentials are configured and a normal task run lands on a real login or registration wall, the runner can also attempt an automatic in-session signup/login recovery using the current blocked page as the protected destination to re-open.
152
+
153
+ ### `AUTH_SESSION_STATE_PATH`
154
+ Default: `.auth/session.json`
155
+
156
+ Where the authenticated Playwright session is saved if you do not explicitly pass `--save-storage-state`.
157
+
158
+ ## CLI flags
159
+
160
+ ### `--url`
161
+ Required website URL.
162
+
163
+ ### `--task`
164
+ Required for task runs. Repeat it for each accepted task you want the agent to perform.
165
+
166
+ Example:
167
+
168
+ ```bash
169
+ npm run dev -- --url https://example.com --task "Open pricing and compare the visible plans" --task "Reach the signup page without creating an account"
170
+ ```
171
+
172
+ ### `--headed`
173
+ Shows the browser.
174
+
175
+ ### `--mobile`
176
+ Uses a mobile browser profile.
177
+
178
+ ### `--ignore-https-errors`
179
+ Allows invalid or self-signed HTTPS certificates.
180
+
181
+ Useful for local development sites such as:
182
+
183
+ ```bash
184
+ npm run dev -- --url https://localhost:3000 --ignore-https-errors
185
+ ```
186
+
187
+ ### `--storage-state`
188
+ Loads a Playwright `storageState` JSON file for a single run.
189
+
190
+ Example:
191
+
192
+ ```bash
193
+ npm run dev -- --url https://example.com --storage-state .auth/session.json
194
+ ```
195
+
196
+ ### `--save-storage-state`
197
+ Saves the Playwright `storageState` JSON after the run finishes.
198
+
199
+ Example:
200
+
201
+ ```bash
202
+ npm run dev -- --url https://example.com --storage-state .auth/session.json --save-storage-state .auth/session.json
203
+ ```
204
+
205
+ ### `--auth-flow`
206
+ Runs the auth bootstrap first, then continues the accepted task run with the authenticated session.
207
+
208
+ Example:
209
+
210
+ ```bash
211
+ npm run dev -- --url https://example.com --auth-flow --signup-url /register --login-url /login --access-url /app
212
+ ```
213
+
214
+ ### `--auth-only`
215
+ Runs only the auth bootstrap and saves the authenticated session without generating a task output.
216
+
217
+ ### `--signup-url`
218
+ Optional absolute or relative signup URL for auth bootstrap.
219
+
220
+ ### `--login-url`
221
+ Optional absolute or relative login URL for auth bootstrap.
222
+
223
+ ### `--access-url`
224
+ Optional absolute or relative protected URL used to confirm the session can reach authenticated content after login.
225
+
226
+ ## Local app routes
227
+
228
+ After running `npm run dashboard`:
229
+ - `/` is the public submission form
230
+ - `/dashboard` is the internal run dashboard
231
+ - `/submissions/<id>` is the submission status page
232
+ - `/r/<token>` is the public task-output link
233
+ - `/outputs/<run-id>` is the standalone HTML output route
@@ -0,0 +1,41 @@
1
+ # 04 - How the Agent Thinks
2
+
3
+ ## The execution loop
4
+
5
+ For each task, the system does this:
6
+
7
+ 1. capture visible page state
8
+ 2. ask the model for the next realistic user action
9
+ 3. execute the action with guarded locators
10
+ 4. log what happened
11
+ 5. repeat until the task ends or the step limit is hit
12
+
13
+ ## Why the planner and evaluator are separate
14
+
15
+ If one model both acts and judges, it will flatter itself and invent success.
16
+ That is weak design.
17
+
18
+ This project separates:
19
+ - **planner**: chooses the next action
20
+ - **evaluator**: reviews the evidence afterward
21
+
22
+ ## What the planner sees
23
+
24
+ The planner gets:
25
+ - page title and URL
26
+ - visible body text excerpt
27
+ - visible interactive elements
28
+ - headings
29
+ - modal hints
30
+ - previous action history
31
+
32
+ ## What the planner does not get
33
+
34
+ It does not get:
35
+ - hidden DOM content
36
+ - fake claims that something succeeded
37
+
38
+ ## Why this matters
39
+
40
+ You wanted a system that behaves like a regular user.
41
+ Regular users do not inspect invisible elements or parse the entire DOM perfectly.
@@ -0,0 +1,42 @@
1
+ # 05 - Extending Accepted Tasks
2
+
3
+ ## Define tasks from explicit input
4
+
5
+ There are no built-in personas or default task files anymore.
6
+ Each run is driven by accepted tasks submitted from the dashboard or passed to the CLI with repeated `--task` flags.
7
+
8
+ Example CLI input:
9
+
10
+ ```bash
11
+ npm run dev -- --url https://example.com \
12
+ --task "Find the pricing page and compare the visible plans" \
13
+ --task "Open the contact path and confirm whether support is easy to reach"
14
+ ```
15
+
16
+ For game-oriented runs, write the requested behavior directly into the accepted tasks. Example: read the visible how-to-play section, reach a playable state, and play five rounds while recording wins and losses.
17
+
18
+ ## Good task design
19
+
20
+ A good task is:
21
+ - concrete
22
+ - time-bounded
23
+ - observable
24
+ - easy to judge from evidence
25
+ - complementary with the other tasks in the suite
26
+
27
+ Good task sets usually split coverage into a few lanes such as:
28
+ - main journey and orientation
29
+ - discovery and information architecture
30
+ - conversion and trust
31
+ - suspicious interactions and recovery states
32
+
33
+ That gives the runner broader coverage without asking one task to explain the whole site alone.
34
+
35
+ ## Bad task design
36
+
37
+ Trash tasks look like this:
38
+ - “Explore the site”
39
+ - “See if it is good”
40
+ - “Understand everything”
41
+
42
+ Those are vague, hard to score, and guaranteed to produce mush.
@@ -0,0 +1,92 @@
1
+ # 06 - Hardening for Production
2
+
3
+ ## 0. CI integration (devDependency users)
4
+
5
+ This is the natural endpoint of side-by-side development: once you trust the audit scores on your own product, automate them.
6
+
7
+ ### Run site-agent-pro in CI against a preview URL
8
+
9
+ ```yaml
10
+ # Example: GitHub Actions
11
+ - name: Run site-agent-pro audit
12
+ env:
13
+ OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
14
+ run: |
15
+ site-agent-pro --url ${{ env.PREVIEW_URL }} \
16
+ --task "Check the homepage CTA" \
17
+ --task "Open pricing and compare the visible plans"
18
+
19
+ > [!TIP]
20
+ > You can also pass secrets directly via CLI flags (e.g., `--openai-api-key ${{ secrets.OPENAI_API_KEY }}`) if you prefer not to map them as environment variables.
21
+ ```
22
+
23
+ ### Gate on a minimum score using the programmatic API
24
+
25
+ Create an `audit.mjs` script in your project:
26
+
27
+ ```ts
28
+ import { runAudit } from "site-agent-pro";
29
+
30
+ const result = await runAudit({
31
+ url: process.env.PREVIEW_URL ?? "http://localhost:3000",
32
+ tasks: [
33
+ "Check the homepage CTA",
34
+ "Open pricing and compare the visible plans",
35
+ ],
36
+ });
37
+
38
+ console.log(`Score: ${result.report.overall_score}/10`);
39
+ result.report.top_fixes.forEach((fix) => console.log(`Fix: ${fix}`));
40
+
41
+ if (result.report.overall_score < 7) {
42
+ console.error("Audit score below threshold. Failing build.");
43
+ process.exit(1);
44
+ }
45
+ ```
46
+
47
+ ```yaml
48
+ - name: Run audit gate
49
+ run: node audit.mjs
50
+ ```
51
+
52
+ > **Important:** Do not add CI gating until you have manually reviewed enough runs on your own product to understand the score range. Arbitrary thresholds will create false failures and destroy trust in the tool.
53
+
54
+ ---
55
+
56
+ ## 1. Add retries carefully
57
+
58
+ Retrying every failed action blindly is lazy and dangerous.
59
+ Add retries only for:
60
+ - network hiccups
61
+ - delayed rendering
62
+ - slow client-side routing
63
+
64
+ ## 2. Improve task completion checks
65
+
66
+ The current completion logic is conservative but still heuristic.
67
+ Production systems should add explicit validators per task.
68
+
69
+ Examples:
70
+ - pricing check should detect real price patterns
71
+ - contact check should verify actual support/contact details
72
+ - signup check should verify the next-step form is real and usable
73
+
74
+ ## 3. Improve event-aware evaluation
75
+
76
+ Right now the evaluator relies on interaction logs, task outcomes, and accessibility findings.
77
+ If you want better judgment later, enrich the structured events instead of adding guesswork.
78
+
79
+ ## 4. Add category-specific personas
80
+
81
+ Use different task sets for:
82
+ - SaaS marketing sites
83
+ - ecommerce stores
84
+ - docs portals
85
+ - recruiting pages
86
+ - local business websites
87
+
88
+ ## 5. Add CI only after manual trust is earned
89
+
90
+ Do not turn this into a pipeline gate until you have manually reviewed enough runs to understand its failure modes.
91
+
92
+