@pencil-agent/nano-pencil 2.0.1 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (186) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/model/custom-providers.js +1 -1
  7. package/dist/core/model-registry.js +5 -5
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  118. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  119. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  121. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  122. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  123. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  124. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  125. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  126. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  127. package/dist/extensions/builtin/goal/README.md +67 -67
  128. package/dist/extensions/builtin/grub/README.md +112 -112
  129. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  130. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  131. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  132. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  133. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  134. package/dist/extensions/builtin/loop/README.md +92 -92
  135. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  136. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  137. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  138. package/dist/extensions/builtin/sal/README.md +72 -72
  139. package/dist/extensions/builtin/security-audit/README.md +289 -289
  140. package/dist/extensions/builtin/team/AGENT.md +112 -112
  141. package/dist/extensions/builtin/team/TESTING.md +299 -299
  142. package/dist/extensions/builtin/token-save/README.md +56 -56
  143. package/dist/extensions/optional/AGENT.md +10 -10
  144. package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
  145. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  146. package/dist/modes/interactive/interactive-mode.js +19 -19
  147. package/dist/modes/interactive/theme/dark.json +85 -85
  148. package/dist/modes/interactive/theme/light.json +84 -84
  149. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  150. package/dist/modes/interactive/theme/warm.json +81 -81
  151. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  152. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  153. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  154. package/docs/SDK-TESTING.md +364 -0
  155. package/docs/codex-goal-command-impl.md +1055 -1055
  156. package/docs/codex-goal-vs-grub.md +500 -500
  157. package/docs/custom-provider.md +27 -27
  158. package/docs/extensions.md +27 -27
  159. package/docs/keybindings.md +27 -27
  160. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  161. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  162. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  163. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  164. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  165. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  166. package/docs/loop-usage-examples.md +214 -214
  167. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  168. package/docs/models.md +27 -27
  169. package/docs/packages.md +27 -27
  170. package/docs/pi-design-philosophy.md +457 -457
  171. package/docs/planmode.md +1987 -1987
  172. package/docs/prompt-templates.md +27 -27
  173. package/docs/providers.md +27 -27
  174. package/docs/sdk.md +27 -27
  175. package/docs/skills.md +27 -27
  176. package/docs/startup-performance-optimization.md +301 -0
  177. package/docs/themes.md +27 -27
  178. package/docs/tui.md +27 -27
  179. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  180. package/package.json +190 -190
  181. package/docs/cc-agent-design.md +0 -1297
  182. package/docs/cc-tui-design.md +0 -1333
  183. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  184. package/docs/scan-report.md +0 -3820
  185. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  186. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,440 +1,440 @@
1
- # Capterra — Scraping & Data Extraction
2
-
3
- Field-tested against capterra.com on 2026-04-18. All code blocks validated with live requests.
4
-
5
- ## Do this first
6
-
7
- **Use `User-Agent: ClaudeBot` — Capterra explicitly allows it in robots.txt and returns clean, pre-rendered Markdown instead of JavaScript-heavy HTML. No browser needed.**
8
-
9
- Capterra serves a fully structured Markdown representation of every page to AI bots (`ClaudeBot`, `GPTBot`, `PerplexityBot`, `Anthropic-AI` are all listed as `Allow: /` in robots.txt). The Markdown format is far easier to parse than HTML.
10
-
11
- With the default `Mozilla/5.0` UA (or any realistic browser UA), Capterra returns HTTP 403 with `Cf-Mitigated: challenge` — Cloudflare blocks all browser UA requests. There is no bypass via HTTP; those pages require a real browser session.
12
-
13
- ```python
14
- from helpers import http_get
15
- import re, json
16
-
17
- # Works everywhere:
18
- html = http_get(
19
- "https://www.capterra.com/p/135003/Slack/reviews/",
20
- headers={"User-Agent": "ClaudeBot"}
21
- )
22
-
23
- # Extract overall rating and review count from the Markdown header line "4.7 (24059)"
24
- m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
25
- print(m.group(1), m.group(2)) # 4.7 24059
26
- ```
27
-
28
- ---
29
-
30
- ## Fastest approach: product summary in one call
31
-
32
- All key metrics — overall rating, review count, sub-ratings, pagination — come from the `/reviews/` endpoint in a single request.
33
-
34
- ```python
35
- from helpers import http_get
36
- import re, json
37
-
38
- def get_product_summary(product_id, slug):
39
- """
40
- Returns overall rating, review count, sub-ratings.
41
- product_id: Capterra numeric ID (e.g. 135003)
42
- slug: URL slug (e.g. 'Slack')
43
- """
44
- url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/"
45
- html = http_get(url, headers={"User-Agent": "ClaudeBot"})
46
-
47
- result = {"product_id": product_id, "slug": slug}
48
-
49
- # Overall rating + review count from header line "4.7 (24059)"
50
- m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
51
- if m:
52
- result["overall_rating"] = float(m.group(1))
53
- result["review_count"] = int(m.group(2).replace(",", ""))
54
-
55
- # Page size and total pages from "Showing 1-25 of 24059 Reviews"
56
- showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html)
57
- if showing:
58
- result["per_page"] = int(showing.group(2))
59
- result["total_pages"] = (int(showing.group(3).replace(",", "")) + 24) // 25
60
-
61
- # Sub-ratings: "Ease of use\n\n4.6" and "Customer Service\n\n4.4"
62
- lines = html.split("\n")
63
- for i, line in enumerate(lines):
64
- for label, key in [("Ease of use", "ease_of_use"), ("Customer Service", "customer_service")]:
65
- if line.strip() == label:
66
- for j in range(i + 1, min(i + 5, len(lines))):
67
- try:
68
- val = float(lines[j].strip())
69
- if 0 < val <= 5.0:
70
- result[key] = val
71
- break
72
- except ValueError:
73
- pass
74
-
75
- return result
76
-
77
- summary = get_product_summary(135003, "Slack")
78
- print(json.dumps(summary, indent=2))
79
- # {
80
- # "product_id": 135003,
81
- # "slug": "Slack",
82
- # "overall_rating": 4.7,
83
- # "review_count": 24059,
84
- # "per_page": 25,
85
- # "total_pages": 963,
86
- # "ease_of_use": 4.6,
87
- # "customer_service": 4.4
88
- # }
89
- ```
90
-
91
- ---
92
-
93
- ## Common workflows
94
-
95
- ### Get reviews (paginated)
96
-
97
- 25 reviews per page. Use `?page=N` for pagination.
98
-
99
- ```python
100
- from helpers import http_get
101
- import re
102
-
103
- def get_reviews_page(product_id, slug, page=1):
104
- """
105
- Returns up to 25 reviews for one page.
106
- Total pages = ceil(review_count / 25).
107
- """
108
- url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}"
109
- html = http_get(url, headers={"User-Agent": "ClaudeBot"})
110
-
111
- # Total review count from header
112
- m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
113
- total = int(m.group(2).replace(",", "")) if m else 0
114
-
115
- # Showing X-Y of Z
116
- showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html)
117
-
118
- # Split by review title markers "### "Title""
119
- blocks = re.split(r'\n### "', html)
120
- reviews = []
121
-
122
- for block in blocks[1:]:
123
- r = {}
124
-
125
- # Title (up to closing quote)
126
- t = re.match(r'([^"]+)"', block)
127
- if t:
128
- r["title"] = t.group(1).strip()
129
-
130
- # Date
131
- d = re.search(
132
- r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}",
133
- block
134
- )
135
- if d:
136
- r["date"] = d.group(0)
137
-
138
- # Overall rating for this review (first float 1.0–5.0 between blank lines)
139
- rm = re.search(r"\n\n([\d.]+)\n\n", block)
140
- if rm:
141
- val = float(rm.group(1))
142
- if 1.0 <= val <= 5.0:
143
- r["rating"] = val
144
-
145
- # Pros
146
- pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL)
147
- if pros:
148
- r["pros"] = pros.group(1).strip()
149
-
150
- # Cons
151
- cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL)
152
- if cons:
153
- r["cons"] = cons.group(1).strip()
154
-
155
- if r.get("title"):
156
- reviews.append(r)
157
-
158
- return {
159
- "total": total,
160
- "page": page,
161
- "showing": f"{showing.group(1)}-{showing.group(2)} of {showing.group(3)}" if showing else None,
162
- "reviews": reviews,
163
- }
164
-
165
- # Page 1
166
- result = get_reviews_page(135003, "Slack", page=1)
167
- print(f"Total reviews: {result['total']}, this page: {len(result['reviews'])}")
168
- # Total reviews: 24059, this page: 25
169
-
170
- print(result["reviews"][0])
171
- # {'title': 'Love, love, love Slack!', 'date': 'April 14, 2026', 'rating': 5.0,
172
- # 'pros': '...', 'cons': '...'}
173
- ```
174
-
175
- ### Scrape all reviews in bulk (parallel)
176
-
177
- 10 pages in ~2s with 5 workers. No rate limiting observed during testing.
178
-
179
- ```python
180
- from helpers import http_get
181
- import re
182
- from concurrent.futures import ThreadPoolExecutor
183
-
184
- UA = {"User-Agent": "ClaudeBot"}
185
-
186
- def _fetch_page(args):
187
- product_id, slug, page = args
188
- url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}"
189
- html = http_get(url, headers=UA)
190
- blocks = re.split(r'\n### "', html)
191
- reviews = []
192
- for block in blocks[1:]:
193
- r = {}
194
- t = re.match(r'([^"]+)"', block)
195
- if t: r["title"] = t.group(1).strip()
196
- d = re.search(r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}", block)
197
- if d: r["date"] = d.group(0)
198
- rm = re.search(r"\n\n([\d.]+)\n\n", block)
199
- if rm:
200
- val = float(rm.group(1))
201
- if 1.0 <= val <= 5.0: r["rating"] = val
202
- pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL)
203
- if pros: r["pros"] = pros.group(1).strip()
204
- cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL)
205
- if cons: r["cons"] = cons.group(1).strip()
206
- if r.get("title"): reviews.append(r)
207
- return reviews
208
-
209
- def get_all_reviews(product_id, slug, max_pages=None, workers=5):
210
- """Fetch all reviews in parallel. max_pages=None fetches everything."""
211
- # First: get total pages
212
- summary_html = http_get(
213
- f"https://www.capterra.com/p/{product_id}/{slug}/reviews/",
214
- headers=UA
215
- )
216
- m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', summary_html, re.MULTILINE)
217
- total = int(m.group(2).replace(",", "")) if m else 0
218
- total_pages = (total + 24) // 25
219
- pages = range(1, (max_pages or total_pages) + 1)
220
-
221
- tasks = [(product_id, slug, p) for p in pages]
222
- all_reviews = []
223
- with ThreadPoolExecutor(max_workers=workers) as ex:
224
- for batch in ex.map(_fetch_page, tasks):
225
- all_reviews.extend(batch)
226
- return all_reviews
227
-
228
- # Fetch first 50 reviews (2 pages) in parallel
229
- reviews = get_all_reviews(135003, "Slack", max_pages=2, workers=2)
230
- print(f"Fetched {len(reviews)} reviews")
231
- # Fetched 50 reviews
232
- ```
233
-
234
- ### Get a product's full overview (rating breakdown, sentiment, pricing)
235
-
236
- ```python
237
- from helpers import http_get
238
- import re, json
239
-
240
- def get_product_overview(product_id, slug):
241
- """Rating breakdown, sentiment, starting price from the product page."""
242
- url = f"https://www.capterra.com/p/{product_id}/{slug}/"
243
- html = http_get(url, headers={"User-Agent": "ClaudeBot"})
244
-
245
- result = {}
246
-
247
- # Overall rating and review count from the reviews section
248
- # Appears as "\n4.7\n\nBased on 24,059 reviews\n"
249
- m = re.search(r'\n([\d.]+)\n\nBased on ([\d,]+) reviews\n', html)
250
- if m:
251
- result["overall_rating"] = float(m.group(1))
252
- result["review_count"] = int(m.group(2).replace(",", ""))
253
-
254
- # Rating breakdown: "5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48)"
255
- breakdown = re.findall(r'\b([1-5])\((\d+)\)', html)
256
- if breakdown:
257
- result["rating_breakdown"] = {int(s): int(c) for s, c in breakdown if 1 <= int(s) <= 5}
258
-
259
- # Sentiment: "Positive\n\n96%\n\nNeutral\n\n4%\n\nNegative\n\n1%"
260
- for label, key in [("Positive", "sentiment_positive"), ("Neutral", "sentiment_neutral"), ("Negative", "sentiment_negative")]:
261
- sm = re.search(rf'{label}\s*\n+\s*(\d+)%', html)
262
- if sm:
263
- result[key] = int(sm.group(1))
264
-
265
- # Starting price ("Starting price\n\n$8.75\n\nPer User")
266
- pm = re.search(r'Starting price\s*\n+\$?([\d.]+)', html)
267
- if pm:
268
- result["starting_price_usd"] = float(pm.group(1))
269
-
270
- # Categories ("What is X used for?" links)
271
- cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html[:3000])
272
- if cats:
273
- result["categories"] = [name for name, _ in cats]
274
-
275
- # Sub-ratings from product page
276
- for label, key in [("Value for money", "value_for_money"), ("Features", "features_rating")]:
277
- sub = re.search(rf'{label}\s*\n+\s*([\d.]+)', html)
278
- if sub:
279
- try:
280
- val = float(sub.group(1))
281
- if 0 < val <= 5.0:
282
- result[key] = val
283
- except ValueError:
284
- pass
285
-
286
- return result
287
-
288
- overview = get_product_overview(135003, "Slack")
289
- print(json.dumps(overview, indent=2))
290
- # {
291
- # "overall_rating": 4.7,
292
- # "review_count": 24059,
293
- # "rating_breakdown": {"5": 17268, "4": 5708, "3": 907, "2": 128, "1": 48},
294
- # "sentiment_positive": 96,
295
- # "sentiment_neutral": 4,
296
- # "sentiment_negative": 1,
297
- # "starting_price_usd": 8.75,
298
- # "categories": ["Team Communication", "Collaboration", "Remote Work"]
299
- # }
300
- ```
301
-
302
- ### Browse a software category
303
-
304
- Each category page returns up to 40 products on page 1, then ~24–25 per subsequent page. Pagination works via `?page=N`.
305
-
306
- ```python
307
- from helpers import http_get
308
- import re
309
-
310
- def get_category_products(category_slug, page=1):
311
- """
312
- List products in a Capterra category.
313
- category_slug examples: 'project-management-software', 'crm-software', 'accounting-software'
314
- Full list: https://www.capterra.com/categories/
315
- """
316
- url = f"https://www.capterra.com/{category_slug}/"
317
- if page > 1:
318
- url = f"https://www.capterra.com/{category_slug}/?page={page}"
319
- html = http_get(url, headers={"User-Agent": "ClaudeBot"})
320
-
321
- # Ratings: [4.6 (5732)](https://www.capterra.com/p/147657/monday-com/reviews/)
322
- raw = re.findall(
323
- r'\[([\d.]+)\s+\(([\d,]+)\)\]\(https://www\.capterra\.com/p/(\d+)/([^/]+)/reviews/\)',
324
- html
325
- )
326
- # Product names from "Learn more about X" links
327
- names = {pid: name for name, pid in re.findall(
328
- r'\[Learn more about ([^\]]+)\]\(https://www\.capterra\.com/p/(\d+)/[^/]+/\)', html
329
- )}
330
-
331
- items, seen = [], set()
332
- for rating, review_count, pid, slug in raw:
333
- if pid not in seen:
334
- seen.add(pid)
335
- items.append({
336
- "product_id": int(pid),
337
- "name": names.get(pid, slug),
338
- "slug": slug,
339
- "overall_rating": float(rating),
340
- "review_count": int(review_count.replace(",", "")),
341
- "product_url": f"https://www.capterra.com/p/{pid}/{slug}/",
342
- "reviews_url": f"https://www.capterra.com/p/{pid}/{slug}/reviews/",
343
- })
344
- return items
345
-
346
- products = get_category_products("project-management-software", page=1)
347
- for p in products[:3]:
348
- print(f"{p['name']}: {p['overall_rating']} ({p['review_count']} reviews)")
349
- # monday.com: 4.6 (5732 reviews)
350
- # Jira: 4.4 (15325 reviews)
351
- # Celoxis: 4.4 (327 reviews)
352
- ```
353
-
354
- ### Get all 1000+ software categories
355
-
356
- ```python
357
- from helpers import http_get
358
- import re
359
-
360
- def get_all_categories():
361
- """Returns list of {name, slug} for all ~1003 Capterra software categories."""
362
- html = http_get("https://www.capterra.com/categories/", headers={"User-Agent": "ClaudeBot"})
363
- cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html)
364
- return [{"name": name, "slug": slug} for name, slug in cats]
365
-
366
- categories = get_all_categories()
367
- print(f"{len(categories)} categories") # 1003
368
- print(categories[:3])
369
- # [{'name': 'AB Testing', 'slug': 'ab-testing-software'},
370
- # {'name': 'Absence Management', 'slug': 'absence-management-software'}, ...]
371
- ```
372
-
373
- ---
374
-
375
- ## URL patterns
376
-
377
- | Page type | URL pattern |
378
- |-----------|-------------|
379
- | Product overview | `https://www.capterra.com/p/{id}/{Slug}/` |
380
- | Product reviews | `https://www.capterra.com/p/{id}/{Slug}/reviews/` |
381
- | Reviews page N | `https://www.capterra.com/p/{id}/{Slug}/reviews/?page={N}` |
382
- | Reviews (alt) | `https://www.capterra.com/reviews/{id}/{Slug}/` |
383
- | Category listing | `https://www.capterra.com/{category}-software/` |
384
- | Category page N | `https://www.capterra.com/{category}-software/?page={N}` |
385
- | All categories | `https://www.capterra.com/categories/` |
386
- | Product pricing | `https://www.capterra.com/p/{id}/{Slug}/pricing/` |
387
- | Product alternatives | `https://www.capterra.com/p/{id}/{Slug}/alternatives/` |
388
- | Compare A vs B | `https://www.capterra.com/compare/{id_a}-{id_b}/{Slug_a}-vs-{Slug_b}` |
389
-
390
- **Finding a product's ID:** Look in the URL of any product listing in a category page. The pattern `https://www.capterra.com/p/{id}/{Slug}/reviews/` appears in every category listing as the link target for each rating badge. The slug is case-sensitive in practice (e.g. `Slack`, not `slack`).
391
-
392
- Product IDs are stable numeric identifiers. Note that the same software vendor may have multiple product IDs under different names/versions. Always find the ID from a category search rather than guessing.
393
-
394
- ---
395
-
396
- ## Anti-bot measures
397
-
398
- - **Cloudflare is active on all routes** (`Server: cloudflare`, `CF-RAY` present in all response headers).
399
- - **Browser UAs (Chrome, Firefox, Safari) return HTTP 403** with `Cf-Mitigated: challenge` regardless of how complete the headers are. There is no HTTP-only bypass.
400
- - **`ClaudeBot` UA bypasses Cloudflare** and receives clean pre-rendered Markdown. Capterra explicitly allows it in `robots.txt` via `User-agent: ClaudeBot / Allow: /`. This is a deliberate AI-accessibility feature.
401
- - **Other AI bot UAs that also work**: `GPTBot`, `PerplexityBot` (also in `robots.txt` Allow list). `Anthropic-AI` was tested and returns 403 — only `ClaudeBot` is the correct UA.
402
- - **The search endpoint (`/search/?q=...`) returns empty results** via ClaudeBot — the query parameter is not passed through. Use category browsing or direct product URLs instead.
403
- - **No CAPTCHA observed** during testing with ClaudeBot.
404
- - **No rate limiting observed**: 10 parallel requests across 5 workers completed in ~2s with all 200 responses. Sequential batches of 5 pages at 0.15–0.95s per request also worked cleanly.
405
- - **The Markdown response has no JSON-LD, no `__NEXT_DATA__`** — these are HTML-only structures. The Markdown format is simpler to parse.
406
- - **Disallowed paths** (from robots.txt): `/search`, `/ppc/clicks/`, `/sem-b/`, `/sem-compare-b/`, `/workspace/`, `/auth/login`. These 403 even with ClaudeBot.
407
-
408
- ---
409
-
410
- ## Gotchas
411
-
412
- - **Old Capterra product IDs may be invalid.** The URL `https://www.capterra.com/p/56703/Slack/` (ID 56703) returns 404 even with ClaudeBot — this is a stale or merged product ID. Slack's current ID is 135003, found in the team-communication-software category listing. Always discover IDs by crawling category pages rather than hard-coding them.
413
-
414
- - **Slug is case-sensitive.** `Slack` works; `slack` returns 404. The slug is always in the category listing data.
415
-
416
- - **Response is Markdown, not HTML.** `http_get` returns pre-rendered Markdown with no HTML tags, no JSON-LD, and no `__NEXT_DATA__`. Do not attempt `BeautifulSoup` parsing. Use `re` on the text directly.
417
-
418
- - **`http_get` default UA is `Mozilla/5.0`** — this returns 403 from Capterra. Always pass `headers={"User-Agent": "ClaudeBot"}` explicitly.
419
-
420
- - **Reviews page vs product page**: The `/reviews/` page has a clean rating header (`4.7 (24059)`) on line 10. The product overview page (`/p/{id}/{Slug}/`) has the same number buried deeper in the page as `\n4.7\n\nBased on 24,059 reviews\n`. For rating extraction, the reviews page is simpler and more reliable.
421
-
422
- - **Category page 1 is larger than subsequent pages**: Page 1 includes editorial content (author bio, top-picks editorial) which can double the page size. Subsequent pages are ~20–30KB and contain only listings.
423
-
424
- - **Reviewer name is present in the text but not cleanly delimited**: The Markdown format for reviewer attribution uses plain text lines above the review body. It's easier to skip reviewer name extraction than to parse the ambiguous formatting.
425
-
426
- - **Sub-rating labels in reviews page**: "Ease of use" (lowercase 'u') and "Customer Service" (capitalized 'S') — match exactly. The product overview page may show additional sub-ratings like "Features" and "Value for money".
427
-
428
- - **`rating_breakdown` pattern caveat**: The pattern `[1-5]\(\d+\)` on the product page can also match feature ratings. To isolate the 5-star breakdown, find it within the "Filter by rating" section, which appears as a block like `5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48)`.
429
-
430
- ---
431
-
432
- ## When to use the browser instead
433
-
434
- The browser is not needed for any common Capterra task — the ClaudeBot flow handles all of them. Use the browser only if:
435
-
436
- - You need to interact with a page element (e.g. submit a review, use the "fit-finder" wizard).
437
- - You need to access a Capterra page that is explicitly blocked in robots.txt (e.g. `/workspace/`, `/auth/login/`).
438
- - You need to simulate a logged-in user session with Capterra credentials.
439
-
440
- For read-only scraping of product data, reviews, and category listings, `http_get` with `ClaudeBot` UA is both faster and more reliable than a browser.
1
+ # Capterra — Scraping & Data Extraction
2
+
3
+ Field-tested against capterra.com on 2026-04-18. All code blocks validated with live requests.
4
+
5
+ ## Do this first
6
+
7
+ **Use `User-Agent: ClaudeBot` — Capterra explicitly allows it in robots.txt and returns clean, pre-rendered Markdown instead of JavaScript-heavy HTML. No browser needed.**
8
+
9
+ Capterra serves a fully structured Markdown representation of every page to AI bots (`ClaudeBot`, `GPTBot`, `PerplexityBot`, `Anthropic-AI` are all listed as `Allow: /` in robots.txt). The Markdown format is far easier to parse than HTML.
10
+
11
+ With the default `Mozilla/5.0` UA (or any realistic browser UA), Capterra returns HTTP 403 with `Cf-Mitigated: challenge` — Cloudflare blocks all browser UA requests. There is no bypass via HTTP; those pages require a real browser session.
12
+
13
+ ```python
14
+ from helpers import http_get
15
+ import re, json
16
+
17
+ # Works everywhere:
18
+ html = http_get(
19
+ "https://www.capterra.com/p/135003/Slack/reviews/",
20
+ headers={"User-Agent": "ClaudeBot"}
21
+ )
22
+
23
+ # Extract overall rating and review count from the Markdown header line "4.7 (24059)"
24
+ m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
25
+ print(m.group(1), m.group(2)) # 4.7 24059
26
+ ```
27
+
28
+ ---
29
+
30
+ ## Fastest approach: product summary in one call
31
+
32
+ All key metrics — overall rating, review count, sub-ratings, pagination — come from the `/reviews/` endpoint in a single request.
33
+
34
+ ```python
35
+ from helpers import http_get
36
+ import re, json
37
+
38
+ def get_product_summary(product_id, slug):
39
+ """
40
+ Returns overall rating, review count, sub-ratings.
41
+ product_id: Capterra numeric ID (e.g. 135003)
42
+ slug: URL slug (e.g. 'Slack')
43
+ """
44
+ url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/"
45
+ html = http_get(url, headers={"User-Agent": "ClaudeBot"})
46
+
47
+ result = {"product_id": product_id, "slug": slug}
48
+
49
+ # Overall rating + review count from header line "4.7 (24059)"
50
+ m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
51
+ if m:
52
+ result["overall_rating"] = float(m.group(1))
53
+ result["review_count"] = int(m.group(2).replace(",", ""))
54
+
55
+ # Page size and total pages from "Showing 1-25 of 24059 Reviews"
56
+ showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html)
57
+ if showing:
58
+ result["per_page"] = int(showing.group(2))
59
+ result["total_pages"] = (int(showing.group(3).replace(",", "")) + 24) // 25
60
+
61
+ # Sub-ratings: "Ease of use\n\n4.6" and "Customer Service\n\n4.4"
62
+ lines = html.split("\n")
63
+ for i, line in enumerate(lines):
64
+ for label, key in [("Ease of use", "ease_of_use"), ("Customer Service", "customer_service")]:
65
+ if line.strip() == label:
66
+ for j in range(i + 1, min(i + 5, len(lines))):
67
+ try:
68
+ val = float(lines[j].strip())
69
+ if 0 < val <= 5.0:
70
+ result[key] = val
71
+ break
72
+ except ValueError:
73
+ pass
74
+
75
+ return result
76
+
77
+ summary = get_product_summary(135003, "Slack")
78
+ print(json.dumps(summary, indent=2))
79
+ # {
80
+ # "product_id": 135003,
81
+ # "slug": "Slack",
82
+ # "overall_rating": 4.7,
83
+ # "review_count": 24059,
84
+ # "per_page": 25,
85
+ # "total_pages": 963,
86
+ # "ease_of_use": 4.6,
87
+ # "customer_service": 4.4
88
+ # }
89
+ ```
90
+
91
+ ---
92
+
93
+ ## Common workflows
94
+
95
+ ### Get reviews (paginated)
96
+
97
+ 25 reviews per page. Use `?page=N` for pagination.
98
+
99
+ ```python
100
+ from helpers import http_get
101
+ import re
102
+
103
+ def get_reviews_page(product_id, slug, page=1):
104
+ """
105
+ Returns up to 25 reviews for one page.
106
+ Total pages = ceil(review_count / 25).
107
+ """
108
+ url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}"
109
+ html = http_get(url, headers={"User-Agent": "ClaudeBot"})
110
+
111
+ # Total review count from header
112
+ m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE)
113
+ total = int(m.group(2).replace(",", "")) if m else 0
114
+
115
+ # Showing X-Y of Z
116
+ showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html)
117
+
118
+ # Split by review title markers "### "Title""
119
+ blocks = re.split(r'\n### "', html)
120
+ reviews = []
121
+
122
+ for block in blocks[1:]:
123
+ r = {}
124
+
125
+ # Title (up to closing quote)
126
+ t = re.match(r'([^"]+)"', block)
127
+ if t:
128
+ r["title"] = t.group(1).strip()
129
+
130
+ # Date
131
+ d = re.search(
132
+ r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}",
133
+ block
134
+ )
135
+ if d:
136
+ r["date"] = d.group(0)
137
+
138
+ # Overall rating for this review (first float 1.0–5.0 between blank lines)
139
+ rm = re.search(r"\n\n([\d.]+)\n\n", block)
140
+ if rm:
141
+ val = float(rm.group(1))
142
+ if 1.0 <= val <= 5.0:
143
+ r["rating"] = val
144
+
145
+ # Pros
146
+ pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL)
147
+ if pros:
148
+ r["pros"] = pros.group(1).strip()
149
+
150
+ # Cons
151
+ cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL)
152
+ if cons:
153
+ r["cons"] = cons.group(1).strip()
154
+
155
+ if r.get("title"):
156
+ reviews.append(r)
157
+
158
+ return {
159
+ "total": total,
160
+ "page": page,
161
+ "showing": f"{showing.group(1)}-{showing.group(2)} of {showing.group(3)}" if showing else None,
162
+ "reviews": reviews,
163
+ }
164
+
165
+ # Page 1
166
+ result = get_reviews_page(135003, "Slack", page=1)
167
+ print(f"Total reviews: {result['total']}, this page: {len(result['reviews'])}")
168
+ # Total reviews: 24059, this page: 25
169
+
170
+ print(result["reviews"][0])
171
+ # {'title': 'Love, love, love Slack!', 'date': 'April 14, 2026', 'rating': 5.0,
172
+ # 'pros': '...', 'cons': '...'}
173
+ ```
174
+
175
+ ### Scrape all reviews in bulk (parallel)
176
+
177
+ 10 pages in ~2s with 5 workers. No rate limiting observed during testing.
178
+
179
+ ```python
180
+ from helpers import http_get
181
+ import re
182
+ from concurrent.futures import ThreadPoolExecutor
183
+
184
+ UA = {"User-Agent": "ClaudeBot"}
185
+
186
+ def _fetch_page(args):
187
+ product_id, slug, page = args
188
+ url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}"
189
+ html = http_get(url, headers=UA)
190
+ blocks = re.split(r'\n### "', html)
191
+ reviews = []
192
+ for block in blocks[1:]:
193
+ r = {}
194
+ t = re.match(r'([^"]+)"', block)
195
+ if t: r["title"] = t.group(1).strip()
196
+ d = re.search(r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}", block)
197
+ if d: r["date"] = d.group(0)
198
+ rm = re.search(r"\n\n([\d.]+)\n\n", block)
199
+ if rm:
200
+ val = float(rm.group(1))
201
+ if 1.0 <= val <= 5.0: r["rating"] = val
202
+ pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL)
203
+ if pros: r["pros"] = pros.group(1).strip()
204
+ cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL)
205
+ if cons: r["cons"] = cons.group(1).strip()
206
+ if r.get("title"): reviews.append(r)
207
+ return reviews
208
+
209
+ def get_all_reviews(product_id, slug, max_pages=None, workers=5):
210
+ """Fetch all reviews in parallel. max_pages=None fetches everything."""
211
+ # First: get total pages
212
+ summary_html = http_get(
213
+ f"https://www.capterra.com/p/{product_id}/{slug}/reviews/",
214
+ headers=UA
215
+ )
216
+ m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', summary_html, re.MULTILINE)
217
+ total = int(m.group(2).replace(",", "")) if m else 0
218
+ total_pages = (total + 24) // 25
219
+ pages = range(1, (max_pages or total_pages) + 1)
220
+
221
+ tasks = [(product_id, slug, p) for p in pages]
222
+ all_reviews = []
223
+ with ThreadPoolExecutor(max_workers=workers) as ex:
224
+ for batch in ex.map(_fetch_page, tasks):
225
+ all_reviews.extend(batch)
226
+ return all_reviews
227
+
228
+ # Fetch first 50 reviews (2 pages) in parallel
229
+ reviews = get_all_reviews(135003, "Slack", max_pages=2, workers=2)
230
+ print(f"Fetched {len(reviews)} reviews")
231
+ # Fetched 50 reviews
232
+ ```
233
+
234
+ ### Get a product's full overview (rating breakdown, sentiment, pricing)
235
+
236
+ ```python
237
+ from helpers import http_get
238
+ import re, json
239
+
240
+ def get_product_overview(product_id, slug):
241
+ """Rating breakdown, sentiment, starting price from the product page."""
242
+ url = f"https://www.capterra.com/p/{product_id}/{slug}/"
243
+ html = http_get(url, headers={"User-Agent": "ClaudeBot"})
244
+
245
+ result = {}
246
+
247
+ # Overall rating and review count from the reviews section
248
+ # Appears as "\n4.7\n\nBased on 24,059 reviews\n"
249
+ m = re.search(r'\n([\d.]+)\n\nBased on ([\d,]+) reviews\n', html)
250
+ if m:
251
+ result["overall_rating"] = float(m.group(1))
252
+ result["review_count"] = int(m.group(2).replace(",", ""))
253
+
254
+ # Rating breakdown: "5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48)"
255
+ breakdown = re.findall(r'\b([1-5])\((\d+)\)', html)
256
+ if breakdown:
257
+ result["rating_breakdown"] = {int(s): int(c) for s, c in breakdown if 1 <= int(s) <= 5}
258
+
259
+ # Sentiment: "Positive\n\n96%\n\nNeutral\n\n4%\n\nNegative\n\n1%"
260
+ for label, key in [("Positive", "sentiment_positive"), ("Neutral", "sentiment_neutral"), ("Negative", "sentiment_negative")]:
261
+ sm = re.search(rf'{label}\s*\n+\s*(\d+)%', html)
262
+ if sm:
263
+ result[key] = int(sm.group(1))
264
+
265
+ # Starting price ("Starting price\n\n$8.75\n\nPer User")
266
+ pm = re.search(r'Starting price\s*\n+\$?([\d.]+)', html)
267
+ if pm:
268
+ result["starting_price_usd"] = float(pm.group(1))
269
+
270
+ # Categories ("What is X used for?" links)
271
+ cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html[:3000])
272
+ if cats:
273
+ result["categories"] = [name for name, _ in cats]
274
+
275
+ # Sub-ratings from product page
276
+ for label, key in [("Value for money", "value_for_money"), ("Features", "features_rating")]:
277
+ sub = re.search(rf'{label}\s*\n+\s*([\d.]+)', html)
278
+ if sub:
279
+ try:
280
+ val = float(sub.group(1))
281
+ if 0 < val <= 5.0:
282
+ result[key] = val
283
+ except ValueError:
284
+ pass
285
+
286
+ return result
287
+
288
+ overview = get_product_overview(135003, "Slack")
289
+ print(json.dumps(overview, indent=2))
290
+ # {
291
+ # "overall_rating": 4.7,
292
+ # "review_count": 24059,
293
+ # "rating_breakdown": {"5": 17268, "4": 5708, "3": 907, "2": 128, "1": 48},
294
+ # "sentiment_positive": 96,
295
+ # "sentiment_neutral": 4,
296
+ # "sentiment_negative": 1,
297
+ # "starting_price_usd": 8.75,
298
+ # "categories": ["Team Communication", "Collaboration", "Remote Work"]
299
+ # }
300
+ ```
301
+
302
+ ### Browse a software category
303
+
304
+ Each category page returns up to 40 products on page 1, then ~24–25 per subsequent page. Pagination works via `?page=N`.
305
+
306
+ ```python
307
+ from helpers import http_get
308
+ import re
309
+
310
+ def get_category_products(category_slug, page=1):
311
+ """
312
+ List products in a Capterra category.
313
+ category_slug examples: 'project-management-software', 'crm-software', 'accounting-software'
314
+ Full list: https://www.capterra.com/categories/
315
+ """
316
+ url = f"https://www.capterra.com/{category_slug}/"
317
+ if page > 1:
318
+ url = f"https://www.capterra.com/{category_slug}/?page={page}"
319
+ html = http_get(url, headers={"User-Agent": "ClaudeBot"})
320
+
321
+ # Ratings: [4.6 (5732)](https://www.capterra.com/p/147657/monday-com/reviews/)
322
+ raw = re.findall(
323
+ r'\[([\d.]+)\s+\(([\d,]+)\)\]\(https://www\.capterra\.com/p/(\d+)/([^/]+)/reviews/\)',
324
+ html
325
+ )
326
+ # Product names from "Learn more about X" links
327
+ names = {pid: name for name, pid in re.findall(
328
+ r'\[Learn more about ([^\]]+)\]\(https://www\.capterra\.com/p/(\d+)/[^/]+/\)', html
329
+ )}
330
+
331
+ items, seen = [], set()
332
+ for rating, review_count, pid, slug in raw:
333
+ if pid not in seen:
334
+ seen.add(pid)
335
+ items.append({
336
+ "product_id": int(pid),
337
+ "name": names.get(pid, slug),
338
+ "slug": slug,
339
+ "overall_rating": float(rating),
340
+ "review_count": int(review_count.replace(",", "")),
341
+ "product_url": f"https://www.capterra.com/p/{pid}/{slug}/",
342
+ "reviews_url": f"https://www.capterra.com/p/{pid}/{slug}/reviews/",
343
+ })
344
+ return items
345
+
346
+ products = get_category_products("project-management-software", page=1)
347
+ for p in products[:3]:
348
+ print(f"{p['name']}: {p['overall_rating']} ({p['review_count']} reviews)")
349
+ # monday.com: 4.6 (5732 reviews)
350
+ # Jira: 4.4 (15325 reviews)
351
+ # Celoxis: 4.4 (327 reviews)
352
+ ```
353
+
354
+ ### Get all 1000+ software categories
355
+
356
+ ```python
357
+ from helpers import http_get
358
+ import re
359
+
360
+ def get_all_categories():
361
+ """Returns list of {name, slug} for all ~1003 Capterra software categories."""
362
+ html = http_get("https://www.capterra.com/categories/", headers={"User-Agent": "ClaudeBot"})
363
+ cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html)
364
+ return [{"name": name, "slug": slug} for name, slug in cats]
365
+
366
+ categories = get_all_categories()
367
+ print(f"{len(categories)} categories") # 1003
368
+ print(categories[:3])
369
+ # [{'name': 'AB Testing', 'slug': 'ab-testing-software'},
370
+ # {'name': 'Absence Management', 'slug': 'absence-management-software'}, ...]
371
+ ```
372
+
373
+ ---
374
+
375
+ ## URL patterns
376
+
377
+ | Page type | URL pattern |
378
+ |-----------|-------------|
379
+ | Product overview | `https://www.capterra.com/p/{id}/{Slug}/` |
380
+ | Product reviews | `https://www.capterra.com/p/{id}/{Slug}/reviews/` |
381
+ | Reviews page N | `https://www.capterra.com/p/{id}/{Slug}/reviews/?page={N}` |
382
+ | Reviews (alt) | `https://www.capterra.com/reviews/{id}/{Slug}/` |
383
+ | Category listing | `https://www.capterra.com/{category}-software/` |
384
+ | Category page N | `https://www.capterra.com/{category}-software/?page={N}` |
385
+ | All categories | `https://www.capterra.com/categories/` |
386
+ | Product pricing | `https://www.capterra.com/p/{id}/{Slug}/pricing/` |
387
+ | Product alternatives | `https://www.capterra.com/p/{id}/{Slug}/alternatives/` |
388
+ | Compare A vs B | `https://www.capterra.com/compare/{id_a}-{id_b}/{Slug_a}-vs-{Slug_b}` |
389
+
390
+ **Finding a product's ID:** Look in the URL of any product listing in a category page. The pattern `https://www.capterra.com/p/{id}/{Slug}/reviews/` appears in every category listing as the link target for each rating badge. The slug is case-sensitive in practice (e.g. `Slack`, not `slack`).
391
+
392
+ Product IDs are stable numeric identifiers. Note that the same software vendor may have multiple product IDs under different names/versions. Always find the ID from a category search rather than guessing.
393
+
394
+ ---
395
+
396
+ ## Anti-bot measures
397
+
398
+ - **Cloudflare is active on all routes** (`Server: cloudflare`, `CF-RAY` present in all response headers).
399
+ - **Browser UAs (Chrome, Firefox, Safari) return HTTP 403** with `Cf-Mitigated: challenge` regardless of how complete the headers are. There is no HTTP-only bypass.
400
+ - **`ClaudeBot` UA bypasses Cloudflare** and receives clean pre-rendered Markdown. Capterra explicitly allows it in `robots.txt` via `User-agent: ClaudeBot / Allow: /`. This is a deliberate AI-accessibility feature.
401
+ - **Other AI bot UAs that also work**: `GPTBot`, `PerplexityBot` (also in `robots.txt` Allow list). `Anthropic-AI` was tested and returns 403 — only `ClaudeBot` is the correct UA.
402
+ - **The search endpoint (`/search/?q=...`) returns empty results** via ClaudeBot — the query parameter is not passed through. Use category browsing or direct product URLs instead.
403
+ - **No CAPTCHA observed** during testing with ClaudeBot.
404
+ - **No rate limiting observed**: 10 parallel requests across 5 workers completed in ~2s with all 200 responses. Sequential batches of 5 pages at 0.15–0.95s per request also worked cleanly.
405
+ - **The Markdown response has no JSON-LD, no `__NEXT_DATA__`** — these are HTML-only structures. The Markdown format is simpler to parse.
406
+ - **Disallowed paths** (from robots.txt): `/search`, `/ppc/clicks/`, `/sem-b/`, `/sem-compare-b/`, `/workspace/`, `/auth/login`. These 403 even with ClaudeBot.
407
+
408
+ ---
409
+
410
+ ## Gotchas
411
+
412
+ - **Old Capterra product IDs may be invalid.** The URL `https://www.capterra.com/p/56703/Slack/` (ID 56703) returns 404 even with ClaudeBot — this is a stale or merged product ID. Slack's current ID is 135003, found in the team-communication-software category listing. Always discover IDs by crawling category pages rather than hard-coding them.
413
+
414
+ - **Slug is case-sensitive.** `Slack` works; `slack` returns 404. The slug is always in the category listing data.
415
+
416
+ - **Response is Markdown, not HTML.** `http_get` returns pre-rendered Markdown with no HTML tags, no JSON-LD, and no `__NEXT_DATA__`. Do not attempt `BeautifulSoup` parsing. Use `re` on the text directly.
417
+
418
+ - **`http_get` default UA is `Mozilla/5.0`** — this returns 403 from Capterra. Always pass `headers={"User-Agent": "ClaudeBot"}` explicitly.
419
+
420
+ - **Reviews page vs product page**: The `/reviews/` page has a clean rating header (`4.7 (24059)`) on line 10. The product overview page (`/p/{id}/{Slug}/`) has the same number buried deeper in the page as `\n4.7\n\nBased on 24,059 reviews\n`. For rating extraction, the reviews page is simpler and more reliable.
421
+
422
+ - **Category page 1 is larger than subsequent pages**: Page 1 includes editorial content (author bio, top-picks editorial) which can double the page size. Subsequent pages are ~20–30KB and contain only listings.
423
+
424
+ - **Reviewer name is present in the text but not cleanly delimited**: The Markdown format for reviewer attribution uses plain text lines above the review body. It's easier to skip reviewer name extraction than to parse the ambiguous formatting.
425
+
426
+ - **Sub-rating labels in reviews page**: "Ease of use" (lowercase 'u') and "Customer Service" (capitalized 'S') — match exactly. The product overview page may show additional sub-ratings like "Features" and "Value for money".
427
+
428
+ - **`rating_breakdown` pattern caveat**: The pattern `[1-5]\(\d+\)` on the product page can also match feature ratings. To isolate the 5-star breakdown, find it within the "Filter by rating" section, which appears as a block like `5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48)`.
429
+
430
+ ---
431
+
432
+ ## When to use the browser instead
433
+
434
+ The browser is not needed for any common Capterra task — the ClaudeBot flow handles all of them. Use the browser only if:
435
+
436
+ - You need to interact with a page element (e.g. submit a review, use the "fit-finder" wizard).
437
+ - You need to access a Capterra page that is explicitly blocked in robots.txt (e.g. `/workspace/`, `/auth/login/`).
438
+ - You need to simulate a logged-in user session with Capterra credentials.
439
+
440
+ For read-only scraping of product data, reviews, and category listings, `http_get` with `ClaudeBot` UA is both faster and more reliable than a browser.