@pencil-agent/nano-pencil 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (195) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/mcp/mcp-client.d.ts +3 -1
  7. package/dist/core/mcp/mcp-client.js +6 -6
  8. package/dist/core/mcp/mcp-config.d.ts +3 -3
  9. package/dist/core/mcp/mcp-config.js +1 -1
  10. package/dist/core/mcp/mcp-manager.d.ts +5 -1
  11. package/dist/core/mcp/mcp-manager.js +1 -1
  12. package/dist/core/platform/config/resource-loader.d.ts +2 -0
  13. package/dist/core/platform/config/resource-loader.js +2 -2
  14. package/dist/core/runtime/agent-session.d.ts +12 -0
  15. package/dist/core/runtime/agent-session.js +8 -8
  16. package/dist/core/runtime/sdk.d.ts +8 -0
  17. package/dist/core/runtime/sdk.js +1 -1
  18. package/dist/extensions/builtin/AGENT.md +115 -115
  19. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  20. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  99. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  100. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  101. package/dist/extensions/builtin/browser/browser.md +73 -73
  102. package/dist/extensions/builtin/browser/install.md +142 -142
  103. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  104. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  105. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  107. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  112. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  113. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  114. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  115. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  116. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  117. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  118. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  119. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  120. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  121. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  122. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  123. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  124. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  125. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  126. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  127. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  128. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  129. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  130. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  131. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  132. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  133. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  134. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  135. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  136. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  137. package/dist/extensions/builtin/goal/README.md +67 -67
  138. package/dist/extensions/builtin/grub/README.md +112 -112
  139. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  140. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  141. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  142. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  143. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  144. package/dist/extensions/builtin/loop/README.md +92 -92
  145. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  146. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  147. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  148. package/dist/extensions/builtin/sal/README.md +72 -72
  149. package/dist/extensions/builtin/security-audit/README.md +289 -289
  150. package/dist/extensions/builtin/team/AGENT.md +112 -112
  151. package/dist/extensions/builtin/team/TESTING.md +299 -299
  152. package/dist/extensions/builtin/token-save/README.md +56 -56
  153. package/dist/extensions/optional/AGENT.md +10 -10
  154. package/dist/modes/interactive/interactive-mode.js +36 -36
  155. package/dist/modes/interactive/theme/dark.json +85 -85
  156. package/dist/modes/interactive/theme/light.json +84 -84
  157. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  158. package/dist/modes/interactive/theme/warm.json +81 -81
  159. package/dist/node_modules/@pencil-agent/agent-core/dist/agent-loop.js +3 -2
  160. package/dist/node_modules/@pencil-agent/agent-core/dist/structured-adaptive-agent-loop.js +2 -1
  161. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  162. package/docs/cc-agent-design.md +1297 -0
  163. package/docs/cc-tui-design.md +1333 -0
  164. package/docs/codex-goal-command-impl.md +1055 -1055
  165. package/docs/codex-goal-vs-grub.md +500 -500
  166. package/docs/custom-provider.md +27 -27
  167. package/docs/extensions.md +27 -27
  168. package/docs/keybindings.md +27 -27
  169. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  170. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  171. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  172. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  173. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  174. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  175. package/docs/loop-usage-examples.md +214 -214
  176. package/docs/models.md +27 -27
  177. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +170 -0
  178. package/docs/packages.md +27 -27
  179. package/docs/pi-design-philosophy.md +457 -457
  180. package/docs/planmode.md +1987 -1987
  181. package/docs/prompt-templates.md +27 -27
  182. package/docs/providers.md +27 -27
  183. package/docs/scan-report.md +3820 -0
  184. package/docs/sdk.md +27 -27
  185. package/docs/skills.md +27 -27
  186. package/docs/themes.md +27 -27
  187. package/docs/tui.md +27 -27
  188. package/docs//345/257/271/346/240/207Claude-Code.md +1775 -0
  189. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +261 -0
  190. package/package.json +190 -190
  191. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +0 -851
  192. package/docs/SDK-TESTING.md +0 -364
  193. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +0 -593
  194. package/docs/startup-performance-optimization.md +0 -301
  195. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +0 -47
@@ -1,440 +1,440 @@
1
- # Capterra — Scraping & Data Extraction
2
-
3
- Field-tested against capterra.com on 2026-04-18. All code blocks validated with live requests.
4
-
5
- ## Do this first
6
-
7
- **Use `User-Agent: ClaudeBot` — Capterra explicitly allows it in robots.txt and returns clean, pre-rendered Markdown instead of JavaScript-heavy HTML. No browser needed.**
8
-
9
- Capterra serves a fully structured Markdown representation of every page to AI bots (`ClaudeBot`, `GPTBot`, `PerplexityBot`, `Anthropic-AI` are all listed as `Allow: /` in robots.txt). The Markdown format is far easier to parse than HTML.
10
-
11
- With the default `Mozilla/5.0` UA (or any realistic browser UA), Capterra returns HTTP 403 with `Cf-Mitigated: challenge` — Cloudflare blocks all browser UA requests. There is no bypass via HTTP; those pages require a real browser session.
12
-
13
- ```python
14
- from helpers import http_get
15
- import re, json
16
-
17
- # Works everywhere:
18
- html = http_get(
19
- "https://www.capterra.com/p/135003/Slack/reviews/",
20
- headers={"User-Agent": "ClaudeBot"}
21
- )
22
-
23
- # Extract overall rating and review count from the Markdown header line "4.7 (24059)"
24
- m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
25
- print(m.group(1), m.group(2)) # 4.7 24059
26
- ```
27
-
28
- ---
29
-
30
- ## Fastest approach: product summary in one call
31
-
32
- All key metrics — overall rating, review count, sub-ratings, pagination — come from the `/reviews/` endpoint in a single request.
33
-
34
- ```python
35
- from helpers import http_get
36
- import re, json
37
-
38
- def get_product_summary(product_id, slug):
39
- """
40
- Returns overall rating, review count, sub-ratings.
41
- product_id: Capterra numeric ID (e.g. 135003)
42
- slug: URL slug (e.g. 'Slack')
43
- """
44
- url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/"
45
- html = http_get(url, headers={"User-Agent": "ClaudeBot"})
46
-
47
- result = {"product_id": product_id, "slug": slug}
48
-
49
- # Overall rating + review count from header line "4.7 (24059)"
50
- m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
51
- if m:
52
- result["overall_rating"] = float(m.group(1))
53
- result["review_count"] = int(m.group(2).replace(",", ""))
54
-
55
- # Page size and total pages from "Showing 1-25 of 24059 Reviews"
56
- showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html)
57
- if showing:
58
- result["per_page"] = int(showing.group(2))
59
- result["total_pages"] = (int(showing.group(3).replace(",", "")) + 24) // 25
60
-
61
- # Sub-ratings: "Ease of use\n\n4.6" and "Customer Service\n\n4.4"
62
- lines = html.split("\n")
63
- for i, line in enumerate(lines):
64
- for label, key in [("Ease of use", "ease_of_use"), ("Customer Service", "customer_service")]:
65
- if line.strip() == label:
66
- for j in range(i + 1, min(i + 5, len(lines))):
67
- try:
68
- val = float(lines[j].strip())
69
- if 0 < val <= 5.0:
70
- result[key] = val
71
- break
72
- except ValueError:
73
- pass
74
-
75
- return result
76
-
77
- summary = get_product_summary(135003, "Slack")
78
- print(json.dumps(summary, indent=2))
79
- # {
80
- # "product_id": 135003,
81
- # "slug": "Slack",
82
- # "overall_rating": 4.7,
83
- # "review_count": 24059,
84
- # "per_page": 25,
85
- # "total_pages": 963,
86
- # "ease_of_use": 4.6,
87
- # "customer_service": 4.4
88
- # }
89
- ```
90
-
91
- ---
92
-
93
- ## Common workflows
94
-
95
- ### Get reviews (paginated)
96
-
97
- 25 reviews per page. Use `?page=N` for pagination.
98
-
99
- ```python
100
- from helpers import http_get
101
- import re
102
-
103
- def get_reviews_page(product_id, slug, page=1):
104
- """
105
- Returns up to 25 reviews for one page.
106
- Total pages = ceil(review_count / 25).
107
- """
108
- url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}"
109
- html = http_get(url, headers={"User-Agent": "ClaudeBot"})
110
-
111
- # Total review count from header
112
- m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
113
- total = int(m.group(2).replace(",", "")) if m else 0
114
-
115
- # Showing X-Y of Z
116
- showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html)
117
-
118
- # Split by review title markers "### "Title""
119
- blocks = re.split(r'\n### "', html)
120
- reviews = []
121
-
122
- for block in blocks[1:]:
123
- r = {}
124
-
125
- # Title (up to closing quote)
126
- t = re.match(r'([^"]+)"', block)
127
- if t:
128
- r["title"] = t.group(1).strip()
129
-
130
- # Date
131
- d = re.search(
132
- r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}",
133
- block
134
- )
135
- if d:
136
- r["date"] = d.group(0)
137
-
138
- # Overall rating for this review (first float 1.0–5.0 between blank lines)
139
- rm = re.search(r"\n\n([\d.]+)\n\n", block)
140
- if rm:
141
- val = float(rm.group(1))
142
- if 1.0 <= val <= 5.0:
143
- r["rating"] = val
144
-
145
- # Pros
146
- pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL)
147
- if pros:
148
- r["pros"] = pros.group(1).strip()
149
-
150
- # Cons
151
- cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL)
152
- if cons:
153
- r["cons"] = cons.group(1).strip()
154
-
155
- if r.get("title"):
156
- reviews.append(r)
157
-
158
- return {
159
- "total": total,
160
- "page": page,
161
- "showing": f"{showing.group(1)}-{showing.group(2)} of {showing.group(3)}" if showing else None,
162
- "reviews": reviews,
163
- }
164
-
165
- # Page 1
166
- result = get_reviews_page(135003, "Slack", page=1)
167
- print(f"Total reviews: {result['total']}, this page: {len(result['reviews'])}")
168
- # Total reviews: 24059, this page: 25
169
-
170
- print(result["reviews"][0])
171
- # {'title': 'Love, love, love Slack!', 'date': 'April 14, 2026', 'rating': 5.0,
172
- # 'pros': '...', 'cons': '...'}
173
- ```
174
-
175
- ### Scrape all reviews in bulk (parallel)
176
-
177
- 10 pages in ~2s with 5 workers. No rate limiting observed during testing.
178
-
179
- ```python
180
- from helpers import http_get
181
- import re
182
- from concurrent.futures import ThreadPoolExecutor
183
-
184
- UA = {"User-Agent": "ClaudeBot"}
185
-
186
- def _fetch_page(args):
187
- product_id, slug, page = args
188
- url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}"
189
- html = http_get(url, headers=UA)
190
- blocks = re.split(r'\n### "', html)
191
- reviews = []
192
- for block in blocks[1:]:
193
- r = {}
194
- t = re.match(r'([^"]+)"', block)
195
- if t: r["title"] = t.group(1).strip()
196
- d = re.search(r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}", block)
197
- if d: r["date"] = d.group(0)
198
- rm = re.search(r"\n\n([\d.]+)\n\n", block)
199
- if rm:
200
- val = float(rm.group(1))
201
- if 1.0 <= val <= 5.0: r["rating"] = val
202
- pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL)
203
- if pros: r["pros"] = pros.group(1).strip()
204
- cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL)
205
- if cons: r["cons"] = cons.group(1).strip()
206
- if r.get("title"): reviews.append(r)
207
- return reviews
208
-
209
- def get_all_reviews(product_id, slug, max_pages=None, workers=5):
210
- """Fetch all reviews in parallel. max_pages=None fetches everything."""
211
- # First: get total pages
212
- summary_html = http_get(
213
- f"https://www.capterra.com/p/{product_id}/{slug}/reviews/",
214
- headers=UA
215
- )
216
- m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', summary_html, re.MULTILINE)
217
- total = int(m.group(2).replace(",", "")) if m else 0
218
- total_pages = (total + 24) // 25
219
- pages = range(1, (max_pages or total_pages) + 1)
220
-
221
- tasks = [(product_id, slug, p) for p in pages]
222
- all_reviews = []
223
- with ThreadPoolExecutor(max_workers=workers) as ex:
224
- for batch in ex.map(_fetch_page, tasks):
225
- all_reviews.extend(batch)
226
- return all_reviews
227
-
228
- # Fetch first 50 reviews (2 pages) in parallel
229
- reviews = get_all_reviews(135003, "Slack", max_pages=2, workers=2)
230
- print(f"Fetched {len(reviews)} reviews")
231
- # Fetched 50 reviews
232
- ```
233
-
234
- ### Get a product's full overview (rating breakdown, sentiment, pricing)
235
-
236
- ```python
237
- from helpers import http_get
238
- import re, json
239
-
240
- def get_product_overview(product_id, slug):
241
- """Rating breakdown, sentiment, starting price from the product page."""
242
- url = f"https://www.capterra.com/p/{product_id}/{slug}/"
243
- html = http_get(url, headers={"User-Agent": "ClaudeBot"})
244
-
245
- result = {}
246
-
247
- # Overall rating and review count from the reviews section
248
- # Appears as "\n4.7\n\nBased on 24,059 reviews\n"
249
- m = re.search(r'\n([\d.]+)\n\nBased on ([\d,]+) reviews\n', html)
250
- if m:
251
- result["overall_rating"] = float(m.group(1))
252
- result["review_count"] = int(m.group(2).replace(",", ""))
253
-
254
- # Rating breakdown: "5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48)"
255
- breakdown = re.findall(r'\b([1-5])\((\d+)\)', html)
256
- if breakdown:
257
- result["rating_breakdown"] = {int(s): int(c) for s, c in breakdown if 1 <= int(s) <= 5}
258
-
259
- # Sentiment: "Positive\n\n96%\n\nNeutral\n\n4%\n\nNegative\n\n1%"
260
- for label, key in [("Positive", "sentiment_positive"), ("Neutral", "sentiment_neutral"), ("Negative", "sentiment_negative")]:
261
- sm = re.search(rf'{label}\s*\n+\s*(\d+)%', html)
262
- if sm:
263
- result[key] = int(sm.group(1))
264
-
265
- # Starting price ("Starting price\n\n$8.75\n\nPer User")
266
- pm = re.search(r'Starting price\s*\n+\$?([\d.]+)', html)
267
- if pm:
268
- result["starting_price_usd"] = float(pm.group(1))
269
-
270
- # Categories ("What is X used for?" links)
271
- cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html[:3000])
272
- if cats:
273
- result["categories"] = [name for name, _ in cats]
274
-
275
- # Sub-ratings from product page
276
- for label, key in [("Value for money", "value_for_money"), ("Features", "features_rating")]:
277
- sub = re.search(rf'{label}\s*\n+\s*([\d.]+)', html)
278
- if sub:
279
- try:
280
- val = float(sub.group(1))
281
- if 0 < val <= 5.0:
282
- result[key] = val
283
- except ValueError:
284
- pass
285
-
286
- return result
287
-
288
- overview = get_product_overview(135003, "Slack")
289
- print(json.dumps(overview, indent=2))
290
- # {
291
- # "overall_rating": 4.7,
292
- # "review_count": 24059,
293
- # "rating_breakdown": {"5": 17268, "4": 5708, "3": 907, "2": 128, "1": 48},
294
- # "sentiment_positive": 96,
295
- # "sentiment_neutral": 4,
296
- # "sentiment_negative": 1,
297
- # "starting_price_usd": 8.75,
298
- # "categories": ["Team Communication", "Collaboration", "Remote Work"]
299
- # }
300
- ```
301
-
302
- ### Browse a software category
303
-
304
- Each category page returns up to 40 products on page 1, then ~24–25 per subsequent page. Pagination works via `?page=N`.
305
-
306
- ```python
307
- from helpers import http_get
308
- import re
309
-
310
- def get_category_products(category_slug, page=1):
311
- """
312
- List products in a Capterra category.
313
- category_slug examples: 'project-management-software', 'crm-software', 'accounting-software'
314
- Full list: https://www.capterra.com/categories/
315
- """
316
- url = f"https://www.capterra.com/{category_slug}/"
317
- if page > 1:
318
- url = f"https://www.capterra.com/{category_slug}/?page={page}"
319
- html = http_get(url, headers={"User-Agent": "ClaudeBot"})
320
-
321
- # Ratings: [4.6 (5732)](https://www.capterra.com/p/147657/monday-com/reviews/)
322
- raw = re.findall(
323
- r'\[([\d.]+)\s+\(([\d,]+)\)\]\(https://www\.capterra\.com/p/(\d+)/([^/]+)/reviews/\)',
324
- html
325
- )
326
- # Product names from "Learn more about X" links
327
- names = {pid: name for name, pid in re.findall(
328
- r'\[Learn more about ([^\]]+)\]\(https://www\.capterra\.com/p/(\d+)/[^/]+/\)', html
329
- )}
330
-
331
- items, seen = [], set()
332
- for rating, review_count, pid, slug in raw:
333
- if pid not in seen:
334
- seen.add(pid)
335
- items.append({
336
- "product_id": int(pid),
337
- "name": names.get(pid, slug),
338
- "slug": slug,
339
- "overall_rating": float(rating),
340
- "review_count": int(review_count.replace(",", "")),
341
- "product_url": f"https://www.capterra.com/p/{pid}/{slug}/",
342
- "reviews_url": f"https://www.capterra.com/p/{pid}/{slug}/reviews/",
343
- })
344
- return items
345
-
346
- products = get_category_products("project-management-software", page=1)
347
- for p in products[:3]:
348
- print(f"{p['name']}: {p['overall_rating']} ({p['review_count']} reviews)")
349
- # monday.com: 4.6 (5732 reviews)
350
- # Jira: 4.4 (15325 reviews)
351
- # Celoxis: 4.4 (327 reviews)
352
- ```
353
-
354
- ### Get all 1000+ software categories
355
-
356
- ```python
357
- from helpers import http_get
358
- import re
359
-
360
- def get_all_categories():
361
- """Returns list of {name, slug} for all ~1003 Capterra software categories."""
362
- html = http_get("https://www.capterra.com/categories/", headers={"User-Agent": "ClaudeBot"})
363
- cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html)
364
- return [{"name": name, "slug": slug} for name, slug in cats]
365
-
366
- categories = get_all_categories()
367
- print(f"{len(categories)} categories") # 1003
368
- print(categories[:3])
369
- # [{'name': 'AB Testing', 'slug': 'ab-testing-software'},
370
- # {'name': 'Absence Management', 'slug': 'absence-management-software'}, ...]
371
- ```
372
-
373
- ---
374
-
375
- ## URL patterns
376
-
377
- | Page type | URL pattern |
378
- |-----------|-------------|
379
- | Product overview | `https://www.capterra.com/p/{id}/{Slug}/` |
380
- | Product reviews | `https://www.capterra.com/p/{id}/{Slug}/reviews/` |
381
- | Reviews page N | `https://www.capterra.com/p/{id}/{Slug}/reviews/?page={N}` |
382
- | Reviews (alt) | `https://www.capterra.com/reviews/{id}/{Slug}/` |
383
- | Category listing | `https://www.capterra.com/{category}-software/` |
384
- | Category page N | `https://www.capterra.com/{category}-software/?page={N}` |
385
- | All categories | `https://www.capterra.com/categories/` |
386
- | Product pricing | `https://www.capterra.com/p/{id}/{Slug}/pricing/` |
387
- | Product alternatives | `https://www.capterra.com/p/{id}/{Slug}/alternatives/` |
388
- | Compare A vs B | `https://www.capterra.com/compare/{id_a}-{id_b}/{Slug_a}-vs-{Slug_b}` |
389
-
390
- **Finding a product's ID:** Look in the URL of any product listing in a category page. The pattern `https://www.capterra.com/p/{id}/{Slug}/reviews/` appears in every category listing as the link target for each rating badge. The slug is case-sensitive in practice (e.g. `Slack`, not `slack`).
391
-
392
- Product IDs are stable numeric identifiers. Note that the same software vendor may have multiple product IDs under different names/versions. Always find the ID from a category search rather than guessing.
393
-
394
- ---
395
-
396
- ## Anti-bot measures
397
-
398
- - **Cloudflare is active on all routes** (`Server: cloudflare`, `CF-RAY` present in all response headers).
399
- - **Browser UAs (Chrome, Firefox, Safari) return HTTP 403** with `Cf-Mitigated: challenge` regardless of how complete the headers are. There is no HTTP-only bypass.
400
- - **`ClaudeBot` UA bypasses Cloudflare** and receives clean pre-rendered Markdown. Capterra explicitly allows it in `robots.txt` via `User-agent: ClaudeBot / Allow: /`. This is a deliberate AI-accessibility feature.
401
- - **Other AI bot UAs that also work**: `GPTBot`, `PerplexityBot` (also in `robots.txt` Allow list). `Anthropic-AI` was tested and returns 403 — only `ClaudeBot` is the correct UA.
402
- - **The search endpoint (`/search/?q=...`) returns empty results** via ClaudeBot — the query parameter is not passed through. Use category browsing or direct product URLs instead.
403
- - **No CAPTCHA observed** during testing with ClaudeBot.
404
- - **No rate limiting observed**: 10 parallel requests across 5 workers completed in ~2s with all 200 responses. Sequential batches of 5 pages at 0.15–0.95s per request also worked cleanly.
405
- - **The Markdown response has no JSON-LD, no `__NEXT_DATA__`** — these are HTML-only structures. The Markdown format is simpler to parse.
406
- - **Disallowed paths** (from robots.txt): `/search`, `/ppc/clicks/`, `/sem-b/`, `/sem-compare-b/`, `/workspace/`, `/auth/login`. These 403 even with ClaudeBot.
407
-
408
- ---
409
-
410
- ## Gotchas
411
-
412
- - **Old Capterra product IDs may be invalid.** The URL `https://www.capterra.com/p/56703/Slack/` (ID 56703) returns 404 even with ClaudeBot — this is a stale or merged product ID. Slack's current ID is 135003, found in the team-communication-software category listing. Always discover IDs by crawling category pages rather than hard-coding them.
413
-
414
- - **Slug is case-sensitive.** `Slack` works; `slack` returns 404. The slug is always in the category listing data.
415
-
416
- - **Response is Markdown, not HTML.** `http_get` returns pre-rendered Markdown with no HTML tags, no JSON-LD, and no `__NEXT_DATA__`. Do not attempt `BeautifulSoup` parsing. Use `re` on the text directly.
417
-
418
- - **`http_get` default UA is `Mozilla/5.0`** — this returns 403 from Capterra. Always pass `headers={"User-Agent": "ClaudeBot"}` explicitly.
419
-
420
- - **Reviews page vs product page**: The `/reviews/` page has a clean rating header (`4.7 (24059)`) on line 10. The product overview page (`/p/{id}/{Slug}/`) has the same number buried deeper in the page as `\n4.7\n\nBased on 24,059 reviews\n`. For rating extraction, the reviews page is simpler and more reliable.
421
-
422
- - **Category page 1 is larger than subsequent pages**: Page 1 includes editorial content (author bio, top-picks editorial) which can double the page size. Subsequent pages are ~20–30KB and contain only listings.
423
-
424
- - **Reviewer name is present in the text but not cleanly delimited**: The Markdown format for reviewer attribution uses plain text lines above the review body. It's easier to skip reviewer name extraction than to parse the ambiguous formatting.
425
-
426
- - **Sub-rating labels in reviews page**: "Ease of use" (lowercase 'u') and "Customer Service" (capitalized 'S') — match exactly. The product overview page may show additional sub-ratings like "Features" and "Value for money".
427
-
428
- - **`rating_breakdown` pattern caveat**: The pattern `[1-5]\(\d+\)` on the product page can also match feature ratings. To isolate the 5-star breakdown, find it within the "Filter by rating" section, which appears as a block like `5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48)`.
429
-
430
- ---
431
-
432
- ## When to use the browser instead
433
-
434
- The browser is not needed for any common Capterra task — the ClaudeBot flow handles all of them. Use the browser only if:
435
-
436
- - You need to interact with a page element (e.g. submit a review, use the "fit-finder" wizard).
437
- - You need to access a Capterra page that is explicitly blocked in robots.txt (e.g. `/workspace/`, `/auth/login/`).
438
- - You need to simulate a logged-in user session with Capterra credentials.
439
-
440
- For read-only scraping of product data, reviews, and category listings, `http_get` with `ClaudeBot` UA is both faster and more reliable than a browser.
1
+ # Capterra — Scraping & Data Extraction
2
+
3
+ Field-tested against capterra.com on 2026-04-18. All code blocks validated with live requests.
4
+
5
+ ## Do this first
6
+
7
+ **Use `User-Agent: ClaudeBot` — Capterra explicitly allows it in robots.txt and returns clean, pre-rendered Markdown instead of JavaScript-heavy HTML. No browser needed.**
8
+
9
+ Capterra serves a fully structured Markdown representation of every page to AI bots (`ClaudeBot`, `GPTBot`, `PerplexityBot`, `Anthropic-AI` are all listed as `Allow: /` in robots.txt). The Markdown format is far easier to parse than HTML.
10
+
11
+ With the default `Mozilla/5.0` UA (or any realistic browser UA), Capterra returns HTTP 403 with `Cf-Mitigated: challenge` — Cloudflare blocks all browser UA requests. There is no bypass via HTTP; those pages require a real browser session.
12
+
13
+ ```python
14
+ from helpers import http_get
15
+ import re, json
16
+
17
+ # Works everywhere:
18
+ html = http_get(
19
+ "https://www.capterra.com/p/135003/Slack/reviews/",
20
+ headers={"User-Agent": "ClaudeBot"}
21
+ )
22
+
23
+ # Extract overall rating and review count from the Markdown header line "4.7 (24059)"
24
+ m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
25
+ print(m.group(1), m.group(2)) # 4.7 24059
26
+ ```
27
+
28
+ ---
29
+
30
+ ## Fastest approach: product summary in one call
31
+
32
+ All key metrics — overall rating, review count, sub-ratings, pagination — come from the `/reviews/` endpoint in a single request.
33
+
34
+ ```python
35
+ from helpers import http_get
36
+ import re, json
37
+
38
+ def get_product_summary(product_id, slug):
39
+ """
40
+ Returns overall rating, review count, sub-ratings.
41
+ product_id: Capterra numeric ID (e.g. 135003)
42
+ slug: URL slug (e.g. 'Slack')
43
+ """
44
+ url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/"
45
+ html = http_get(url, headers={"User-Agent": "ClaudeBot"})
46
+
47
+ result = {"product_id": product_id, "slug": slug}
48
+
49
+ # Overall rating + review count from header line "4.7 (24059)"
50
+ m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
51
+ if m:
52
+ result["overall_rating"] = float(m.group(1))
53
+ result["review_count"] = int(m.group(2).replace(",", ""))
54
+
55
+ # Page size and total pages from "Showing 1-25 of 24059 Reviews"
56
+ showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html)
57
+ if showing:
58
+ result["per_page"] = int(showing.group(2))
59
+ result["total_pages"] = (int(showing.group(3).replace(",", "")) + 24) // 25
60
+
61
+ # Sub-ratings: "Ease of use\n\n4.6" and "Customer Service\n\n4.4"
62
+ lines = html.split("\n")
63
+ for i, line in enumerate(lines):
64
+ for label, key in [("Ease of use", "ease_of_use"), ("Customer Service", "customer_service")]:
65
+ if line.strip() == label:
66
+ for j in range(i + 1, min(i + 5, len(lines))):
67
+ try:
68
+ val = float(lines[j].strip())
69
+ if 0 < val <= 5.0:
70
+ result[key] = val
71
+ break
72
+ except ValueError:
73
+ pass
74
+
75
+ return result
76
+
77
+ summary = get_product_summary(135003, "Slack")
78
+ print(json.dumps(summary, indent=2))
79
+ # {
80
+ # "product_id": 135003,
81
+ # "slug": "Slack",
82
+ # "overall_rating": 4.7,
83
+ # "review_count": 24059,
84
+ # "per_page": 25,
85
+ # "total_pages": 963,
86
+ # "ease_of_use": 4.6,
87
+ # "customer_service": 4.4
88
+ # }
89
+ ```
90
+
91
+ ---
92
+
93
+ ## Common workflows
94
+
95
+ ### Get reviews (paginated)
96
+
97
+ 25 reviews per page. Use `?page=N` for pagination.
98
+
99
+ ```python
100
+ from helpers import http_get
101
+ import re
102
+
103
+ def get_reviews_page(product_id, slug, page=1):
104
+ """
105
+ Returns up to 25 reviews for one page.
106
+ Total pages = ceil(review_count / 25).
107
+ """
108
+ url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}"
109
+ html = http_get(url, headers={"User-Agent": "ClaudeBot"})
110
+
111
+ # Total review count from header
112
+ m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
113
+ total = int(m.group(2).replace(",", "")) if m else 0
114
+
115
+ # Showing X-Y of Z
116
+ showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html)
117
+
118
+ # Split by review title markers "### "Title""
119
+ blocks = re.split(r'\n### "', html)
120
+ reviews = []
121
+
122
+ for block in blocks[1:]:
123
+ r = {}
124
+
125
+ # Title (up to closing quote)
126
+ t = re.match(r'([^"]+)"', block)
127
+ if t:
128
+ r["title"] = t.group(1).strip()
129
+
130
+ # Date
131
+ d = re.search(
132
+ r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}",
133
+ block
134
+ )
135
+ if d:
136
+ r["date"] = d.group(0)
137
+
138
+ # Overall rating for this review (first float 1.0–5.0 between blank lines)
139
+ rm = re.search(r"\n\n([\d.]+)\n\n", block)
140
+ if rm:
141
+ val = float(rm.group(1))
142
+ if 1.0 <= val <= 5.0:
143
+ r["rating"] = val
144
+
145
+ # Pros
146
+ pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL)
147
+ if pros:
148
+ r["pros"] = pros.group(1).strip()
149
+
150
+ # Cons
151
+ cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL)
152
+ if cons:
153
+ r["cons"] = cons.group(1).strip()
154
+
155
+ if r.get("title"):
156
+ reviews.append(r)
157
+
158
+ return {
159
+ "total": total,
160
+ "page": page,
161
+ "showing": f"{showing.group(1)}-{showing.group(2)} of {showing.group(3)}" if showing else None,
162
+ "reviews": reviews,
163
+ }
164
+
165
+ # Page 1
166
+ result = get_reviews_page(135003, "Slack", page=1)
167
+ print(f"Total reviews: {result['total']}, this page: {len(result['reviews'])}")
168
+ # Total reviews: 24059, this page: 25
169
+
170
+ print(result["reviews"][0])
171
+ # {'title': 'Love, love, love Slack!', 'date': 'April 14, 2026', 'rating': 5.0,
172
+ # 'pros': '...', 'cons': '...'}
173
+ ```
174
+
175
+ ### Scrape all reviews in bulk (parallel)
176
+
177
+ 10 pages in ~2s with 5 workers. No rate limiting observed during testing.
178
+
179
+ ```python
180
+ from helpers import http_get
181
+ import re
182
+ from concurrent.futures import ThreadPoolExecutor
183
+
184
+ UA = {"User-Agent": "ClaudeBot"}
185
+
186
+ def _fetch_page(args):
187
+ product_id, slug, page = args
188
+ url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}"
189
+ html = http_get(url, headers=UA)
190
+ blocks = re.split(r'\n### "', html)
191
+ reviews = []
192
+ for block in blocks[1:]:
193
+ r = {}
194
+ t = re.match(r'([^"]+)"', block)
195
+ if t: r["title"] = t.group(1).strip()
196
+ d = re.search(r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}", block)
197
+ if d: r["date"] = d.group(0)
198
+ rm = re.search(r"\n\n([\d.]+)\n\n", block)
199
+ if rm:
200
+ val = float(rm.group(1))
201
+ if 1.0 <= val <= 5.0: r["rating"] = val
202
+ pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL)
203
+ if pros: r["pros"] = pros.group(1).strip()
204
+ cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL)
205
+ if cons: r["cons"] = cons.group(1).strip()
206
+ if r.get("title"): reviews.append(r)
207
+ return reviews
208
+
209
+ def get_all_reviews(product_id, slug, max_pages=None, workers=5):
210
+ """Fetch all reviews in parallel. max_pages=None fetches everything."""
211
+ # First: get total pages
212
+ summary_html = http_get(
213
+ f"https://www.capterra.com/p/{product_id}/{slug}/reviews/",
214
+ headers=UA
215
+ )
216
+ m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', summary_html, re.MULTILINE)
217
+ total = int(m.group(2).replace(",", "")) if m else 0
218
+ total_pages = (total + 24) // 25
219
+ pages = range(1, (max_pages or total_pages) + 1)
220
+
221
+ tasks = [(product_id, slug, p) for p in pages]
222
+ all_reviews = []
223
+ with ThreadPoolExecutor(max_workers=workers) as ex:
224
+ for batch in ex.map(_fetch_page, tasks):
225
+ all_reviews.extend(batch)
226
+ return all_reviews
227
+
228
+ # Fetch first 50 reviews (2 pages) in parallel
229
+ reviews = get_all_reviews(135003, "Slack", max_pages=2, workers=2)
230
+ print(f"Fetched {len(reviews)} reviews")
231
+ # Fetched 50 reviews
232
+ ```
233
+
234
+ ### Get a product's full overview (rating breakdown, sentiment, pricing)
235
+
236
+ ```python
237
+ from helpers import http_get
238
+ import re, json
239
+
240
+ def get_product_overview(product_id, slug):
241
+ """Rating breakdown, sentiment, starting price from the product page."""
242
+ url = f"https://www.capterra.com/p/{product_id}/{slug}/"
243
+ html = http_get(url, headers={"User-Agent": "ClaudeBot"})
244
+
245
+ result = {}
246
+
247
+ # Overall rating and review count from the reviews section
248
+ # Appears as "\n4.7\n\nBased on 24,059 reviews\n"
249
+ m = re.search(r'\n([\d.]+)\n\nBased on ([\d,]+) reviews\n', html)
250
+ if m:
251
+ result["overall_rating"] = float(m.group(1))
252
+ result["review_count"] = int(m.group(2).replace(",", ""))
253
+
254
+ # Rating breakdown: "5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48)"
255
+ breakdown = re.findall(r'\b([1-5])\((\d+)\)', html)
256
+ if breakdown:
257
+ result["rating_breakdown"] = {int(s): int(c) for s, c in breakdown if 1 <= int(s) <= 5}
258
+
259
+ # Sentiment: "Positive\n\n96%\n\nNeutral\n\n4%\n\nNegative\n\n1%"
260
+ for label, key in [("Positive", "sentiment_positive"), ("Neutral", "sentiment_neutral"), ("Negative", "sentiment_negative")]:
261
+ sm = re.search(rf'{label}\s*\n+\s*(\d+)%', html)
262
+ if sm:
263
+ result[key] = int(sm.group(1))
264
+
265
+ # Starting price ("Starting price\n\n$8.75\n\nPer User")
266
+ pm = re.search(r'Starting price\s*\n+\$?([\d.]+)', html)
267
+ if pm:
268
+ result["starting_price_usd"] = float(pm.group(1))
269
+
270
+ # Categories ("What is X used for?" links)
271
+ cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html[:3000])
272
+ if cats:
273
+ result["categories"] = [name for name, _ in cats]
274
+
275
+ # Sub-ratings from product page
276
+ for label, key in [("Value for money", "value_for_money"), ("Features", "features_rating")]:
277
+ sub = re.search(rf'{label}\s*\n+\s*([\d.]+)', html)
278
+ if sub:
279
+ try:
280
+ val = float(sub.group(1))
281
+ if 0 < val <= 5.0:
282
+ result[key] = val
283
+ except ValueError:
284
+ pass
285
+
286
+ return result
287
+
288
+ overview = get_product_overview(135003, "Slack")
289
+ print(json.dumps(overview, indent=2))
290
+ # {
291
+ # "overall_rating": 4.7,
292
+ # "review_count": 24059,
293
+ # "rating_breakdown": {"5": 17268, "4": 5708, "3": 907, "2": 128, "1": 48},
294
+ # "sentiment_positive": 96,
295
+ # "sentiment_neutral": 4,
296
+ # "sentiment_negative": 1,
297
+ # "starting_price_usd": 8.75,
298
+ # "categories": ["Team Communication", "Collaboration", "Remote Work"]
299
+ # }
300
+ ```
301
+
302
+ ### Browse a software category
303
+
304
+ Each category page returns up to 40 products on page 1, then ~24–25 per subsequent page. Pagination works via `?page=N`.
305
+
306
+ ```python
307
+ from helpers import http_get
308
+ import re
309
+
310
+ def get_category_products(category_slug, page=1):
311
+ """
312
+ List products in a Capterra category.
313
+ category_slug examples: 'project-management-software', 'crm-software', 'accounting-software'
314
+ Full list: https://www.capterra.com/categories/
315
+ """
316
+ url = f"https://www.capterra.com/{category_slug}/"
317
+ if page > 1:
318
+ url = f"https://www.capterra.com/{category_slug}/?page={page}"
319
+ html = http_get(url, headers={"User-Agent": "ClaudeBot"})
320
+
321
+ # Ratings: [4.6 (5732)](https://www.capterra.com/p/147657/monday-com/reviews/)
322
+ raw = re.findall(
323
+ r'\[([\d.]+)\s+\(([\d,]+)\)\]\(https://www\.capterra\.com/p/(\d+)/([^/]+)/reviews/\)',
324
+ html
325
+ )
326
+ # Product names from "Learn more about X" links
327
+ names = {pid: name for name, pid in re.findall(
328
+ r'\[Learn more about ([^\]]+)\]\(https://www\.capterra\.com/p/(\d+)/[^/]+/\)', html
329
+ )}
330
+
331
+ items, seen = [], set()
332
+ for rating, review_count, pid, slug in raw:
333
+ if pid not in seen:
334
+ seen.add(pid)
335
+ items.append({
336
+ "product_id": int(pid),
337
+ "name": names.get(pid, slug),
338
+ "slug": slug,
339
+ "overall_rating": float(rating),
340
+ "review_count": int(review_count.replace(",", "")),
341
+ "product_url": f"https://www.capterra.com/p/{pid}/{slug}/",
342
+ "reviews_url": f"https://www.capterra.com/p/{pid}/{slug}/reviews/",
343
+ })
344
+ return items
345
+
346
+ products = get_category_products("project-management-software", page=1)
347
+ for p in products[:3]:
348
+ print(f"{p['name']}: {p['overall_rating']} ({p['review_count']} reviews)")
349
+ # monday.com: 4.6 (5732 reviews)
350
+ # Jira: 4.4 (15325 reviews)
351
+ # Celoxis: 4.4 (327 reviews)
352
+ ```
353
+
354
+ ### Get all 1000+ software categories
355
+
356
+ ```python
357
+ from helpers import http_get
358
+ import re
359
+
360
+ def get_all_categories():
361
+ """Returns list of {name, slug} for all ~1003 Capterra software categories."""
362
+ html = http_get("https://www.capterra.com/categories/", headers={"User-Agent": "ClaudeBot"})
363
+ cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html)
364
+ return [{"name": name, "slug": slug} for name, slug in cats]
365
+
366
+ categories = get_all_categories()
367
+ print(f"{len(categories)} categories") # 1003
368
+ print(categories[:3])
369
+ # [{'name': 'AB Testing', 'slug': 'ab-testing-software'},
370
+ # {'name': 'Absence Management', 'slug': 'absence-management-software'}, ...]
371
+ ```
372
+
373
+ ---
374
+
375
+ ## URL patterns
376
+
377
+ | Page type | URL pattern |
378
+ |-----------|-------------|
379
+ | Product overview | `https://www.capterra.com/p/{id}/{Slug}/` |
380
+ | Product reviews | `https://www.capterra.com/p/{id}/{Slug}/reviews/` |
381
+ | Reviews page N | `https://www.capterra.com/p/{id}/{Slug}/reviews/?page={N}` |
382
+ | Reviews (alt) | `https://www.capterra.com/reviews/{id}/{Slug}/` |
383
+ | Category listing | `https://www.capterra.com/{category}-software/` |
384
+ | Category page N | `https://www.capterra.com/{category}-software/?page={N}` |
385
+ | All categories | `https://www.capterra.com/categories/` |
386
+ | Product pricing | `https://www.capterra.com/p/{id}/{Slug}/pricing/` |
387
+ | Product alternatives | `https://www.capterra.com/p/{id}/{Slug}/alternatives/` |
388
+ | Compare A vs B | `https://www.capterra.com/compare/{id_a}-{id_b}/{Slug_a}-vs-{Slug_b}` |
389
+
390
+ **Finding a product's ID:** Look in the URL of any product listing in a category page. The pattern `https://www.capterra.com/p/{id}/{Slug}/reviews/` appears in every category listing as the link target for each rating badge. The slug is case-sensitive in practice (e.g. `Slack`, not `slack`).
391
+
392
+ Product IDs are stable numeric identifiers. Note that the same software vendor may have multiple product IDs under different names/versions. Always find the ID from a category search rather than guessing.
393
+
394
+ ---
395
+
396
+ ## Anti-bot measures
397
+
398
+ - **Cloudflare is active on all routes** (`Server: cloudflare`, `CF-RAY` present in all response headers).
399
+ - **Browser UAs (Chrome, Firefox, Safari) return HTTP 403** with `Cf-Mitigated: challenge` regardless of how complete the headers are. There is no HTTP-only bypass.
400
+ - **`ClaudeBot` UA bypasses Cloudflare** and receives clean pre-rendered Markdown. Capterra explicitly allows it in `robots.txt` via `User-agent: ClaudeBot / Allow: /`. This is a deliberate AI-accessibility feature.
401
+ - **Other AI bot UAs that also work**: `GPTBot`, `PerplexityBot` (also in `robots.txt` Allow list). `Anthropic-AI` was tested and returns 403 — only `ClaudeBot` is the correct UA.
402
+ - **The search endpoint (`/search/?q=...`) returns empty results** via ClaudeBot — the query parameter is not passed through. Use category browsing or direct product URLs instead.
403
+ - **No CAPTCHA observed** during testing with ClaudeBot.
404
+ - **No rate limiting observed**: 10 parallel requests across 5 workers completed in ~2s with all 200 responses. Sequential batches of 5 pages at 0.15–0.95s per request also worked cleanly.
405
+ - **The Markdown response has no JSON-LD, no `__NEXT_DATA__`** — these are HTML-only structures. The Markdown format is simpler to parse.
406
+ - **Disallowed paths** (from robots.txt): `/search`, `/ppc/clicks/`, `/sem-b/`, `/sem-compare-b/`, `/workspace/`, `/auth/login`. These 403 even with ClaudeBot.
407
+
408
+ ---
409
+
410
+ ## Gotchas
411
+
412
+ - **Old Capterra product IDs may be invalid.** The URL `https://www.capterra.com/p/56703/Slack/` (ID 56703) returns 404 even with ClaudeBot — this is a stale or merged product ID. Slack's current ID is 135003, found in the team-communication-software category listing. Always discover IDs by crawling category pages rather than hard-coding them.
413
+
414
+ - **Slug is case-sensitive.** `Slack` works; `slack` returns 404. The slug is always in the category listing data.
415
+
416
+ - **Response is Markdown, not HTML.** `http_get` returns pre-rendered Markdown with no HTML tags, no JSON-LD, and no `__NEXT_DATA__`. Do not attempt `BeautifulSoup` parsing. Use `re` on the text directly.
417
+
418
+ - **`http_get` default UA is `Mozilla/5.0`** — this returns 403 from Capterra. Always pass `headers={"User-Agent": "ClaudeBot"}` explicitly.
419
+
420
+ - **Reviews page vs product page**: The `/reviews/` page has a clean rating header (`4.7 (24059)`) on line 10. The product overview page (`/p/{id}/{Slug}/`) has the same number buried deeper in the page as `\n4.7\n\nBased on 24,059 reviews\n`. For rating extraction, the reviews page is simpler and more reliable.
421
+
422
+ - **Category page 1 is larger than subsequent pages**: Page 1 includes editorial content (author bio, top-picks editorial) which can double the page size. Subsequent pages are ~20–30KB and contain only listings.
423
+
424
+ - **Reviewer name is present in the text but not cleanly delimited**: The Markdown format for reviewer attribution uses plain text lines above the review body. It's easier to skip reviewer name extraction than to parse the ambiguous formatting.
425
+
426
+ - **Sub-rating labels in reviews page**: "Ease of use" (lowercase 'u') and "Customer Service" (capitalized 'S') — match exactly. The product overview page may show additional sub-ratings like "Features" and "Value for money".
427
+
428
+ - **`rating_breakdown` pattern caveat**: The pattern `[1-5]\(\d+\)` on the product page can also match feature ratings. To isolate the 5-star breakdown, find it within the "Filter by rating" section, which appears as a block like `5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48)`.
429
+
430
+ ---
431
+
432
+ ## When to use the browser instead
433
+
434
+ The browser is not needed for any common Capterra task — the ClaudeBot flow handles all of them. Use the browser only if:
435
+
436
+ - You need to interact with a page element (e.g. submit a review, use the "fit-finder" wizard).
437
+ - You need to access a Capterra page that is explicitly blocked in robots.txt (e.g. `/workspace/`, `/auth/login/`).
438
+ - You need to simulate a logged-in user session with Capterra credentials.
439
+
440
+ For read-only scraping of product data, reviews, and category listings, `http_get` with `ClaudeBot` UA is both faster and more reliable than a browser.