@ainyc/canonry 4.18.1 → 4.19.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -88,6 +88,7 @@ GA4 is a first-class signal alongside citation tracking. Connect once with `cano
88
88
  | `references/aeo-analysis.md` | Interpreting sweep output, diagnosing regressions, planning content fixes |
89
89
  | `references/indexing.md` | Submitting URLs, checking GSC/Bing coverage, fixing indexing gaps |
90
90
  | `references/wordpress-integration.md` | Connecting to WordPress, editing pages, pushing staging → live |
91
+ | `references/server-side-traffic.md` | Wiring server-log evidence (Cloud Run today; WordPress / others later) for AI Visibility — Server-Side. Connect, sync, manage sources, troubleshoot. |
91
92
 
92
93
  ---
93
94
 
@@ -0,0 +1,167 @@
1
+ # Server-side traffic (AI Visibility — Server-Side)
2
+
3
+ Server-side traffic ingestion captures **what AI engines actually do in
4
+ your server logs** — bots crawling pages, AI products sending
5
+ click-through arrivals — in addition to the citation data that measures
6
+ **what models say** about you. The two surfaces are independent.
7
+
8
+ ## When to use it
9
+
10
+ Reach for server-side traffic when an analyst or operator asks:
11
+
12
+ - *"Is GPTBot / ClaudeBot / PerplexityBot actually fetching my pages?"*
13
+ - *"Which paths are AI engines paying attention to?"*
14
+ - *"Are users clicking through from chatgpt.com / claude.ai / etc.?"*
15
+ - *"My citation rate is fine but there's no traffic — why?"*
16
+
17
+ GA4 referrals (chatgpt.com → your site) catch click-throughs after they
18
+ land. Server logs catch the upstream bot activity AND referrals at the
19
+ edge — including arrivals GA4 missed because of cookie consent, ad
20
+ blockers, or analytics gaps.
21
+
22
+ ## Architecture
23
+
24
+ Two tables, populated from server-log adapters:
25
+
26
+ | Table | What's in it |
27
+ |---|---|
28
+ | `crawler_events_hourly` | One row per `(project, source, hour, bot, verification, path, status)` — bot crawls rolled up by hour |
29
+ | `ai_referral_events_hourly` | One row per `(project, source, hour, product, source_domain, evidence_type, landing_path, status)` — click-through arrivals rolled up by hour |
30
+ | `raw_event_samples` | Bounded forensic samples (≤100 per sync) for spot-checking |
31
+
32
+ Each `traffic_sources` row is one server-log integration for a project.
33
+ Today's only adapter is `cloud-run`; future adapters slot in by
34
+ implementing the same contract.
35
+
36
+ ## Connecting a Cloud Run source
37
+
38
+ ```bash
39
+ # 1. Create a service account in the Cloud project that hosts the Cloud Run
40
+ # service. Grant it `roles/logging.viewer`. Download the JSON key.
41
+
42
+ # 2. Connect from canonry CLI:
43
+ canonry traffic connect cloud-run <project> \
44
+ --gcp-project <gcp-project-id> \
45
+ --service-account-key <path/to/key.json>
46
+
47
+ # 3. (Optional) narrow to a specific service or location:
48
+ canonry traffic connect cloud-run <project> \
49
+ --gcp-project <id> \
50
+ --service-account-key <path> \
51
+ --service my-service-name \
52
+ --location us-east1
53
+ ```
54
+
55
+ Credentials are stored in `~/.canonry/config.yaml` (not the DB). The
56
+ canonical key lives only on the host that runs `canonry serve`. The
57
+ sync flow does NOT echo the private key back in any response.
58
+
59
+ ## Syncing data
60
+
61
+ ```bash
62
+ # Manual sync — defaults to a 30-day lookback on the first run; subsequent
63
+ # runs are clamped forward to lastSyncedAt to avoid re-pulling.
64
+ canonry traffic sync <project> --source <id>
65
+
66
+ # Override the lookback window (minutes):
67
+ canonry traffic sync <project> --source <id> --since-minutes 4320 # 3 days
68
+ ```
69
+
70
+ Cross-sync dedupe via the `last_event_ids` ring buffer means re-running a
71
+ sync over an overlapping window cannot double-count rolled-up hourly
72
+ hits. Safe to schedule (see "Scheduling" below) or trigger from CI.
73
+
74
+ ## Inspecting source state
75
+
76
+ ```bash
77
+ # All sources with last-24h totals + latest sync run (single-call):
78
+ canonry traffic status <project> --format json
79
+
80
+ # Just the source list:
81
+ canonry traffic sources <project> --format json
82
+
83
+ # Windowed events (defaults to last 24h):
84
+ canonry traffic events <project> --kind crawler --limit 200 --format json
85
+ canonry traffic events <project> --kind ai-referral --since 2026-04-01 --until 2026-04-30
86
+ ```
87
+
88
+ The `traffic status` composite returns the same per-source detail
89
+ (24h crawler hits, AI-referral arrivals, raw-event-sample count, latest
90
+ sync-run summary) whether you reach it via the CLI, the API, or the
91
+ MCP `canonry_traffic_status` tool.
92
+
93
+ ## Where the data shows up
94
+
95
+ | Surface | What's rendered |
96
+ |---|---|
97
+ | Project dashboard `/projects/:name/activity` | Live source table + 24h totals + GA4 referrals (combined view) |
98
+ | Top-level `/traffic` route | Cross-project source admin (connect, sync, archive) |
99
+ | `canonry report <project>` (HTML + SPA) | "AI Visibility — Server-Side" section, ranked above Indexing Health |
100
+ | `canonry doctor --project <name>` | `traffic.source.connected`, `recent-data`, `credentials`, `scopes` checks |
101
+ | MCP toolkit `traffic` | Tools: `canonry_traffic_status`, `_sources_list`, `_source_get`, `_events`, `_connect_cloud_run`, `_sync` |
102
+
103
+ ## Doctor signals
104
+
105
+ The doctor checks are adapter-agnostic. When they fail or warn:
106
+
107
+ | Check | Code | What to do |
108
+ |---|---|---|
109
+ | `traffic.source.connected` | `traffic.source.none` | No source — `canonry traffic connect cloud-run …` |
110
+ | `traffic.source.connected` | `traffic.source.all-errored` | Re-connect the source. The check's `details.lastError` shows the underlying reason. |
111
+ | `traffic.source.recent-data` | `traffic.recent-data.stale` | Last sync was >7d ago. Run `canonry traffic sync …` or schedule a recurring sync. |
112
+ | `traffic.source.recent-data` | `traffic.recent-data.empty` | Source connected but no data in 30d. Verify config and credentials with `canonry traffic sources <project>`. |
113
+ | `traffic.source.credentials` | `traffic.credentials.resolve-failed` | Service-account key in `~/.canonry/config.yaml` is invalid or expired. Re-connect. |
114
+
115
+ ## Scheduling
116
+
117
+ `canonry schedule` supports `--kind traffic-sync`. Recurring syncs are
118
+ safe because of the `last_event_ids` cross-sync dedupe ring buffer
119
+ described above. Recommended cadence:
120
+
121
+ | Cadence | Use case |
122
+ |---|---|
123
+ | `0 */6 * * *` (every 6h) | Production agencies tracking active client sites |
124
+ | `0 0 * * *` (daily) | Lower-traffic sites or local dev |
125
+ | Manual only | First few weeks while validating data |
126
+
127
+ ## Telemetry
128
+
129
+ Every successful or failed sync emits a `traffic.synced` event to the
130
+ canonry telemetry pipeline:
131
+
132
+ ```jsonc
133
+ {
134
+ "event": "traffic.synced",
135
+ "errorCode": "PROVIDER_AUTH", // present only when status='failed'
136
+ "properties": {
137
+ "status": "completed" | "failed",
138
+ "sourceType": "cloud-run", // adapter type
139
+ "sourceId": "<uuid>", // opaque
140
+ "pulledEvents": 234,
141
+ "crawlerHits": 200,
142
+ "aiReferralHits": 12,
143
+ "durationMs": 4150
144
+ }
145
+ }
146
+ ```
147
+
148
+ Counts are aggregate. The sourceId is an opaque UUID. No raw paths,
149
+ domains, or PII are surfaced.
150
+
151
+ ## Limits & caveats
152
+
153
+ - **Path-level citation cross-reference is not implemented yet.** The
154
+ citation store is domain-grain (`query_snapshots.cited_domains`). A
155
+ future iteration that lands URL-grain citation evidence will extend
156
+ the `topCrawledPaths` entry with a `citationState` flag. Until then,
157
+ treat the report's crawled-paths table as "engine attention" — the
158
+ signal is the bot fetched it, not whether it was cited.
159
+ - **Verified vs unverified.** The headline numbers count only
160
+ rDNS-verified hits. Unverified bots claim a known UA but couldn't be
161
+ cross-confirmed via reverse-DNS — they may be the real bot or an
162
+ imitator. Don't promote unverified counts in client-facing copy.
163
+ - **Cloud Run only in v1.** WordPress plugin and other adapters are
164
+ planned. The doctor checks and the report renderer are already
165
+ adapter-agnostic — adding a new adapter is just a new entry in
166
+ `traffic_sources.source_type` and a `TrafficSourceValidator`
167
+ registration.