@pencil-agent/nano-pencil 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (195) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/mcp/mcp-client.d.ts +3 -1
  7. package/dist/core/mcp/mcp-client.js +6 -6
  8. package/dist/core/mcp/mcp-config.d.ts +3 -3
  9. package/dist/core/mcp/mcp-config.js +1 -1
  10. package/dist/core/mcp/mcp-manager.d.ts +5 -1
  11. package/dist/core/mcp/mcp-manager.js +1 -1
  12. package/dist/core/platform/config/resource-loader.d.ts +2 -0
  13. package/dist/core/platform/config/resource-loader.js +2 -2
  14. package/dist/core/runtime/agent-session.d.ts +12 -0
  15. package/dist/core/runtime/agent-session.js +8 -8
  16. package/dist/core/runtime/sdk.d.ts +8 -0
  17. package/dist/core/runtime/sdk.js +1 -1
  18. package/dist/extensions/builtin/AGENT.md +115 -115
  19. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  20. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  99. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  100. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  101. package/dist/extensions/builtin/browser/browser.md +73 -73
  102. package/dist/extensions/builtin/browser/install.md +142 -142
  103. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  104. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  105. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  107. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  112. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  113. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  114. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  115. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  116. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  117. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  118. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  119. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  120. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  121. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  122. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  123. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  124. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  125. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  126. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  127. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  128. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  129. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  130. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  131. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  132. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  133. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  134. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  135. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  136. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  137. package/dist/extensions/builtin/goal/README.md +67 -67
  138. package/dist/extensions/builtin/grub/README.md +112 -112
  139. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  140. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  141. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  142. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  143. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  144. package/dist/extensions/builtin/loop/README.md +92 -92
  145. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  146. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  147. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  148. package/dist/extensions/builtin/sal/README.md +72 -72
  149. package/dist/extensions/builtin/security-audit/README.md +289 -289
  150. package/dist/extensions/builtin/team/AGENT.md +112 -112
  151. package/dist/extensions/builtin/team/TESTING.md +299 -299
  152. package/dist/extensions/builtin/token-save/README.md +56 -56
  153. package/dist/extensions/optional/AGENT.md +10 -10
  154. package/dist/modes/interactive/interactive-mode.js +36 -36
  155. package/dist/modes/interactive/theme/dark.json +85 -85
  156. package/dist/modes/interactive/theme/light.json +84 -84
  157. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  158. package/dist/modes/interactive/theme/warm.json +81 -81
  159. package/dist/node_modules/@pencil-agent/agent-core/dist/agent-loop.js +3 -2
  160. package/dist/node_modules/@pencil-agent/agent-core/dist/structured-adaptive-agent-loop.js +2 -1
  161. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  162. package/docs/cc-agent-design.md +1297 -0
  163. package/docs/cc-tui-design.md +1333 -0
  164. package/docs/codex-goal-command-impl.md +1055 -1055
  165. package/docs/codex-goal-vs-grub.md +500 -500
  166. package/docs/custom-provider.md +27 -27
  167. package/docs/extensions.md +27 -27
  168. package/docs/keybindings.md +27 -27
  169. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  170. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  171. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  172. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  173. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  174. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  175. package/docs/loop-usage-examples.md +214 -214
  176. package/docs/models.md +27 -27
  177. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +170 -0
  178. package/docs/packages.md +27 -27
  179. package/docs/pi-design-philosophy.md +457 -457
  180. package/docs/planmode.md +1987 -1987
  181. package/docs/prompt-templates.md +27 -27
  182. package/docs/providers.md +27 -27
  183. package/docs/scan-report.md +3820 -0
  184. package/docs/sdk.md +27 -27
  185. package/docs/skills.md +27 -27
  186. package/docs/themes.md +27 -27
  187. package/docs/tui.md +27 -27
  188. package/docs//345/257/271/346/240/207Claude-Code.md +1775 -0
  189. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +261 -0
  190. package/package.json +190 -190
  191. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +0 -851
  192. package/docs/SDK-TESTING.md +0 -364
  193. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +0 -593
  194. package/docs/startup-performance-optimization.md +0 -301
  195. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +0 -47
@@ -1,543 +1,543 @@
1
- # Glassdoor — Company Data, Reviews, Jobs & Salaries
2
-
3
- Field-tested against glassdoor.com on 2026-04-18.
4
-
5
- ## Anti-bot verdict: browser required, no http_get workaround exists
6
-
7
- **`http_get` returns HTTP 403 on every Glassdoor URL without exception.**
8
-
9
- Tested endpoints (all 403):
10
- - `/Reviews/Google-Reviews-E9079.htm`
11
- - `/Overview/Working-at-Google-EI_IE9079.htm`
12
- - `/Job/jobs.htm?sc.keyword=software+engineer`
13
- - `/Salaries/software-engineer-salary-SRCH_KO0,17.htm`
14
- - `/graph` (GraphQL)
15
- - `sitemap.xml`
16
-
17
- UAs tested (all blocked): `Mozilla/5.0`, full Chrome 124, Googlebot, `curl/7.88.1`.
18
-
19
- **Stack:** Cloudflare Bot Management (`Server: cloudflare`, `Cf-Mitigated: challenge`).
20
- Challenge type: `managed` (JS-executed browser fingerprint check, no CAPTCHA widget, no user click
21
- required in a real browser). Cookie-only bypass also fails — the `__cf_bm` cookie returned in the
22
- 403 response is bound to the browser TLS fingerprint and does not grant access when replayed.
23
-
24
- `api.glassdoor.com` (the old public partner API) returned `410 Gone` — permanently shut down.
25
-
26
- **Use `goto_url()` + `wait()` exclusively. Never use `http_get` for Glassdoor.**
27
-
28
- ---
29
-
30
- ## Do this first: open in a new tab, wait for CF to resolve
31
-
32
- ```python
33
- new_tab("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
34
- wait_for_load()
35
- wait(5) # CF managed challenge runs for ~2-4s after readyState=complete
36
- ```
37
-
38
- `wait(5)` is mandatory. CF's managed challenge executes JS fingerprinting probes after the DOM
39
- is ready. Extracting before this resolves returns an empty or partial page.
40
-
41
- Verify you are past the challenge before extracting:
42
-
43
- ```python
44
- title = js("document.title")
45
- url = page_info()["url"]
46
- if "Security" in title or "__cf_chl_tk" in url:
47
- # CF challenge did not resolve yet — wait longer
48
- wait(5)
49
- title = js("document.title")
50
- assert "Security" not in title, "Still on CF block page"
51
- ```
52
-
53
- ---
54
-
55
- ## URL patterns
56
-
57
- | Goal | URL |
58
- |---|---|
59
- | Company reviews | `/Reviews/{Company-slug}-Reviews-E{employer_id}.htm` |
60
- | Company overview | `/Overview/Working-at-{Company-slug}-EI_IE{employer_id}.htm` |
61
- | Company jobs | `/Jobs/{Company-slug}-Jobs-E{employer_id}.htm` |
62
- | Keyword job search | `/Job/jobs.htm?sc.keyword={keyword}` |
63
- | Keyword + location | `/Job/jobs.htm?sc.keyword={keyword}&locT=C&locKeyword={city}` |
64
- | Remote jobs | `/Job/jobs.htm?sc.keyword={keyword}&remoteWorkType=1` |
65
- | Job search page 2+ | append `&p=2`, `&p=3` |
66
- | Salary page | `/Salaries/{role-slug}-salary-SRCH_KO0,{len}.htm` |
67
-
68
- Employer IDs and company slugs are stable. Example: Google = `EI_IE9079`, slug = `Google`.
69
-
70
- Find the employer ID from a search result URL or the company's Glassdoor page URL.
71
-
72
- ---
73
-
74
- ## Workflow 1: Job search — extract result cards
75
-
76
- Glassdoor renders job cards client-side. Wait 5 seconds after load before extracting.
77
-
78
- ```python
79
- import json
80
- from urllib.parse import quote_plus
81
-
82
- query = "software engineer"
83
- new_tab(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}")
84
- wait_for_load()
85
- wait(5) # CF challenge + JS render
86
-
87
- # Dismiss cookie banner if present (GDPR regions)
88
- dismiss_cookie_banner()
89
-
90
- jobs = js("""
91
- (function() {
92
- // Primary selector as of 2026-04
93
- var cards = document.querySelectorAll('li[data-jobid]');
94
- if (!cards.length) {
95
- // Fallback: class-based (Next.js CSS modules use hashed suffixes — match prefix)
96
- cards = document.querySelectorAll('[class*="JobsList_jobListItem"]');
97
- }
98
- var out = [];
99
- for (var i = 0; i < cards.length; i++) {
100
- var c = cards[i];
101
- var jobId = c.getAttribute('data-jobid') || '';
102
- var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
103
- var compEl = c.querySelector('[data-test="employer-name"], [class*="JobCard_employer"]');
104
- var locEl = c.querySelector('[data-test="emp-location"], [class*="JobCard_location"]');
105
- var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"], .salaryEstimate');
106
- var ratingEl = c.querySelector('[data-test="rating"], [class*="ratingNumber"]');
107
- var linkEl = c.querySelector('a[href*="/job-listing/"], a[href*="glassdoor.com/job"]');
108
- out.push({
109
- jobId,
110
- title: titleEl ? titleEl.innerText.trim() : '',
111
- company: compEl ? compEl.innerText.trim() : '',
112
- location: locEl ? locEl.innerText.trim() : '',
113
- salary: salEl ? salEl.innerText.trim() : '',
114
- rating: ratingEl ? ratingEl.innerText.trim() : '',
115
- url: linkEl ? linkEl.href : '',
116
- });
117
- }
118
- return JSON.stringify(out.filter(j => j.title));
119
- })()
120
- """)
121
-
122
- results = json.loads(jobs)
123
- for r in results:
124
- print(r["title"], "|", r["company"], "|", r["location"])
125
- ```
126
-
127
- **If `results` is empty:** take a screenshot and check which page you are on. Glassdoor often
128
- serves a different layout under A/B tests. The screenshot will reveal the actual card selector.
129
-
130
- ```python
131
- capture_screenshot("/tmp/glassdoor_jobs.png")
132
- # Inspect the image, then adjust the querySelectorAll selector above
133
- ```
134
-
135
- ---
136
-
137
- ## Workflow 2: Job search pagination
138
-
139
- Glassdoor paginates via `&p=N` on the job search URL.
140
-
141
- ```python
142
- import json
143
- from urllib.parse import quote_plus
144
-
145
- query = "data scientist"
146
- all_jobs = []
147
-
148
- for page in range(1, 4): # pages 1-3, ~10 cards each
149
- url = f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}&p={page}"
150
- goto_url(url)
151
- wait_for_load()
152
- wait(5 if page == 1 else 3) # first page needs CF wait; subsequent pages are faster
153
-
154
- if page == 1:
155
- dismiss_cookie_banner()
156
-
157
- batch_json = js("""
158
- (function() {
159
- var cards = document.querySelectorAll('li[data-jobid], [class*="JobsList_jobListItem"]');
160
- var out = [];
161
- for (var i = 0; i < cards.length; i++) {
162
- var c = cards[i];
163
- var jobId = c.getAttribute('data-jobid') || '';
164
- var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
165
- var compEl = c.querySelector('[data-test="employer-name"]');
166
- var locEl = c.querySelector('[data-test="emp-location"]');
167
- var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"]');
168
- var linkEl = c.querySelector('a[href*="/job-listing/"]');
169
- out.push({
170
- jobId,
171
- title: titleEl ? titleEl.innerText.trim() : '',
172
- company: compEl ? compEl.innerText.trim() : '',
173
- location: locEl ? locEl.innerText.trim() : '',
174
- salary: salEl ? salEl.innerText.trim() : '',
175
- url: linkEl ? linkEl.href : '',
176
- });
177
- }
178
- return JSON.stringify(out.filter(j => j.title));
179
- })()
180
- """)
181
-
182
- batch = json.loads(batch_json)
183
- if not batch:
184
- break # no more results
185
- all_jobs.extend(batch)
186
-
187
- print(f"Collected {len(all_jobs)} jobs across {page} pages")
188
- ```
189
-
190
- ---
191
-
192
- ## Workflow 3: Company overview — rating and review count
193
-
194
- Navigate to the company Overview or Reviews page. These pages require login for full content but the
195
- summary header (overall rating, review count, recommend %) is visible without login.
196
-
197
- ```python
198
- import json, re
199
-
200
- # Example: Google (employer_id=9079)
201
- employer_id = 9079
202
- company_slug = "Google"
203
-
204
- goto_url(f"https://www.glassdoor.com/Overview/Working-at-{company_slug}-EI_IE{employer_id}.htm")
205
- wait_for_load()
206
- wait(5) # CF challenge
207
-
208
- # Try __NEXT_DATA__ first — fastest and most complete
209
- next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
210
-
211
- if next_data_raw:
212
- nd = json.loads(next_data_raw)
213
- # Company data lives under props.pageProps — path varies by page type
214
- # Try employer overview path
215
- props = nd.get("props", {}).get("pageProps", {})
216
- employer = props.get("employer") or props.get("employerOverview")
217
- if employer:
218
- print("Rating:", employer.get("overallRating"))
219
- print("Reviews:", employer.get("reviewCount") or employer.get("numberOfReviews"))
220
- print("Name:", employer.get("name") or employer.get("shortName"))
221
- else:
222
- # Fall back to DOM selectors
223
- summary = js("""
224
- (function() {
225
- var ratingEl = document.querySelector('[data-test="rating"], .ratingNumber, [class*="ratingNum"]');
226
- var countEl = document.querySelector('[data-test="reviewCount"], .reviewCount, [class*="reviewCount"]');
227
- var nameEl = document.querySelector('h1[data-test="employer-name"], [class*="EmployerProfile_name"]');
228
- var recEl = document.querySelector('[data-test="recommend"], [class*="recommend"]');
229
- return JSON.stringify({
230
- rating: ratingEl ? ratingEl.innerText.trim() : '',
231
- reviews: countEl ? countEl.innerText.trim() : '',
232
- name: nameEl ? nameEl.innerText.trim() : '',
233
- recommend: recEl ? recEl.innerText.trim() : '',
234
- });
235
- })()
236
- """)
237
- print(json.loads(summary))
238
- ```
239
-
240
- ---
241
-
242
- ## Workflow 4: Company reviews page — extract individual reviews
243
-
244
- Reviews pages show up to ~10 reviews per page without login. A login modal appears after scrolling.
245
- Extract before scrolling.
246
-
247
- ```python
248
- import json
249
-
250
- employer_id = 9079
251
- company_slug = "Google"
252
-
253
- goto_url(f"https://www.glassdoor.com/Reviews/{company_slug}-Reviews-E{employer_id}.htm")
254
- wait_for_load()
255
- wait(5)
256
-
257
- dismiss_cookie_banner()
258
-
259
- reviews = js("""
260
- (function() {
261
- // Review cards — confirmed selector pattern
262
- var cards = document.querySelectorAll('[id^="empReview_"], [data-test="review-card"], [class*="ReviewCard"]');
263
- if (!cards.length) {
264
- cards = document.querySelectorAll('article[class*="review"]');
265
- }
266
- var out = [];
267
- for (var i = 0; i < cards.length; i++) {
268
- var c = cards[i];
269
-
270
- // Overall star rating (1-5)
271
- var starsEl = c.querySelector('[data-test="review-rating"], [class*="starRating"], span[class*="ratingNumber"]');
272
- var stars = starsEl ? starsEl.innerText.trim() : '';
273
-
274
- // Pros / Cons text
275
- var prosEl = c.querySelector('[data-test="pros"], [class*="pros"], p[class*="pros"]');
276
- var consEl = c.querySelector('[data-test="cons"], [class*="cons"], p[class*="cons"]');
277
- var pros = prosEl ? prosEl.innerText.trim() : '';
278
- var cons = consEl ? consEl.innerText.trim() : '';
279
-
280
- // Review title
281
- var titleEl = c.querySelector('[data-test="review-title"], h2[class*="reviewTitle"], [class*="title"] a');
282
- var title = titleEl ? titleEl.innerText.trim() : '';
283
-
284
- // Job title of reviewer
285
- var jobTitleEl = c.querySelector('[data-test="reviewer-job-title"], [class*="reviewerInfo"], [class*="authorJobTitle"]');
286
- var jobTitle = jobTitleEl ? jobTitleEl.innerText.trim() : '';
287
-
288
- // Date
289
- var dateEl = c.querySelector('time, [data-test="review-date"], [class*="reviewDate"]');
290
- var date = dateEl ? (dateEl.getAttribute('datetime') || dateEl.innerText.trim()) : '';
291
-
292
- if (pros || cons || title) {
293
- out.push({stars, title, jobTitle, pros, cons, date});
294
- }
295
- }
296
- return JSON.stringify(out);
297
- })()
298
- """)
299
-
300
- results = json.loads(reviews)
301
- for r in results:
302
- print(f"{r['stars']}★ | {r['title']} | {r['jobTitle']}")
303
- print(f" + {r['pros'][:100]}")
304
- print(f" - {r['cons'][:100]}")
305
- ```
306
-
307
- ---
308
-
309
- ## Workflow 5: Salary page — extract reported salary data
310
-
311
- ```python
312
- import json
313
- from urllib.parse import quote_plus
314
-
315
- # Salary pages use slug + character-count in the URL (n = len(role_slug))
316
- role = "software-engineer"
317
- n = len(role) # 17 for "software-engineer"
318
-
319
- goto_url(f"https://www.glassdoor.com/Salaries/{role}-salary-SRCH_KO0,{n}.htm")
320
- wait_for_load()
321
- wait(5)
322
-
323
- # Try __NEXT_DATA__ for structured salary data
324
- next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
325
-
326
- if next_data_raw:
327
- nd = json.loads(next_data_raw)
328
- # Salary data is typically under props.pageProps.salaryData or .salaryEstimate
329
- props = nd.get("props", {}).get("pageProps", {})
330
- salary_data = props.get("salaryData") or props.get("payData")
331
- if salary_data:
332
- print(json.dumps(salary_data, indent=2))
333
-
334
- # DOM fallback
335
- salary_summary = js("""
336
- (function() {
337
- var medianEl = document.querySelector('[data-test="salary-estimate"], [class*="salaryEstimate"], [class*="median"]');
338
- var rangeEl = document.querySelector('[data-test="salary-range"], [class*="salaryRange"]');
339
- var countEl = document.querySelector('[data-test="salary-count"], [class*="salaryCount"]');
340
- return JSON.stringify({
341
- median: medianEl ? medianEl.innerText.trim() : '',
342
- range: rangeEl ? rangeEl.innerText.trim() : '',
343
- count: countEl ? countEl.innerText.trim() : '',
344
- });
345
- })()
346
- """)
347
- print(json.loads(salary_summary))
348
- ```
349
-
350
- ---
351
-
352
- ## Handling the login modal
353
-
354
- Glassdoor shows a sign-in modal:
355
- - On Reviews/Salary pages: after viewing ~3-5 items (scroll-triggered)
356
- - On job detail pages: often immediately
357
-
358
- Dismiss it before extracting anything that requires scrolling:
359
-
360
- ```python
361
- def dismiss_glassdoor_login_modal():
362
- """Close the Glassdoor sign-in modal. Safe to call if no modal is present."""
363
- closed = js("""
364
- (function() {
365
- var selectors = [
366
- '[alt="Close"]',
367
- 'button[class*="modal_closeIcon"]',
368
- '[data-test="close-modal"]',
369
- '[aria-label="Close"]',
370
- 'button[data-test="CloseButton"]',
371
- '[class*="CloseButton"]',
372
- ];
373
- for (var i = 0; i < selectors.length; i++) {
374
- var btn = document.querySelector(selectors[i]);
375
- if (btn && btn.offsetParent !== null) {
376
- btn.click();
377
- return selectors[i];
378
- }
379
- }
380
- return null;
381
- })()
382
- """)
383
- if closed:
384
- wait(1)
385
- return closed
386
-
387
- def dismiss_cookie_banner():
388
- """Dismiss GDPR consent overlay. Safe to call even if no banner is present."""
389
- dismissed = js("""
390
- (function() {
391
- var selectors = [
392
- 'button[data-test="accept-cookies"]',
393
- '#onetrust-accept-btn-handler',
394
- 'button[id*="accept-all"]',
395
- 'button[class*="accept"]',
396
- 'button[class*="consent"]',
397
- ];
398
- for (var i = 0; i < selectors.length; i++) {
399
- var btn = document.querySelector(selectors[i]);
400
- if (btn && btn.offsetParent !== null) {
401
- btn.click();
402
- return selectors[i];
403
- }
404
- }
405
- return null;
406
- })()
407
- """)
408
- if dismissed:
409
- wait(1)
410
- return dismissed
411
- ```
412
-
413
- For Reviews/Salary pages: call `dismiss_glassdoor_login_modal()` immediately after the initial
414
- wait, before any scrolling. Once you scroll down, the modal blocks the page and the X button
415
- may itself be outside the viewport.
416
-
417
- ---
418
-
419
- ## Detecting whether you are past the CF challenge
420
-
421
- After `goto_url()` + `wait(5)`, confirm you are on the real page:
422
-
423
- ```python
424
- def glassdoor_is_cf_blocked() -> bool:
425
- """True if the CF managed challenge is still running."""
426
- title = js("document.title") or ""
427
- url = page_info()["url"]
428
- return "Security" in title or "__cf_chl_tk" in url
429
-
430
- # Usage
431
- goto_url("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
432
- wait_for_load()
433
- wait(5)
434
-
435
- if glassdoor_is_cf_blocked():
436
- wait(10) # give CF extra time
437
- if glassdoor_is_cf_blocked():
438
- capture_screenshot("/tmp/glassdoor_cf_block.png")
439
- raise RuntimeError("CF challenge did not resolve — check screenshot")
440
- ```
441
-
442
- ---
443
-
444
- ## Glassdoor company ID lookup
445
-
446
- Glassdoor uses numeric employer IDs (e.g., Google = 9079, Apple = 1138, Meta = 40772).
447
- To find the ID for any company:
448
-
449
- ```python
450
- from urllib.parse import quote_plus
451
-
452
- company_name = "OpenAI"
453
- goto_url(f"https://www.glassdoor.com/Search/results.htm?keyword={quote_plus(company_name)}&locT=N")
454
- wait_for_load()
455
- wait(5)
456
-
457
- # Extract company cards from search results
458
- companies = js("""
459
- (function() {
460
- var cards = document.querySelectorAll('[data-test="employer-card"], [class*="EmployerCard"], [class*="employer-card"]');
461
- var out = [];
462
- for (var i = 0; i < cards.length; i++) {
463
- var c = cards[i];
464
- var link = c.querySelector('a[href*="Overview"], a[href*="Reviews"]');
465
- if (!link) continue;
466
- var href = link.href;
467
- // Extract employer ID: EI_IE{id} or E{id}
468
- var m = href.match(/E(?:I_IE)?(\d+)/);
469
- var empId = m ? m[1] : '';
470
- var nameEl = c.querySelector('[class*="EmployerCard_name"], h2, [class*="name"]');
471
- out.push({
472
- empId,
473
- name: nameEl ? nameEl.innerText.trim() : '',
474
- href,
475
- });
476
- }
477
- return JSON.stringify(out);
478
- })()
479
- """)
480
-
481
- import json
482
- for c in json.loads(companies):
483
- print(c["empId"], c["name"], c["href"][:60])
484
- ```
485
-
486
- ---
487
-
488
- ## Gotchas
489
-
490
- - **`http_get` is permanently blocked.** Cloudflare Bot Management blocks every IP-level request
491
- with a JS managed challenge. No User-Agent, cookie, or header combination bypasses it. The
492
- `__cf_bm` cookie returned in the 403 response is TLS-fingerprint-bound and cannot be replayed.
493
- `api.glassdoor.com` is 410 Gone (shut down). Only real Chrome via CDP works.
494
-
495
- - **`wait(5)` minimum after `wait_for_load()`.** CF's managed challenge runs for 2-4 seconds after
496
- `readyState = complete`. Extracting too early returns the challenge page HTML, not Glassdoor
497
- content. If you get empty results or the title is "Security | Glassdoor", wait longer.
498
-
499
- - **Login modal triggers on scroll, not on load.** Extract all visible content immediately on page
500
- load before any scrolling. Call `dismiss_glassdoor_login_modal()` right after the initial wait —
501
- before issuing any `scroll()` calls.
502
-
503
- - **Glassdoor shows ~10 cards without login.** Reviews and salary pages are severely limited
504
- without an account. Job search cards are more accessible (~10-15 per page). If you need 30+
505
- reviews, a logged-in session is required.
506
-
507
- - **CSS class names use Next.js hashed suffixes.** Selectors like `[class*="JobCard_jobTitle"]`
508
- match despite the hash suffix (e.g., `JobCard_jobTitle__abc12`). Never hardcode the full hashed
509
- class name — it changes with deployments. Always use `[class*="prefix"]`.
510
-
511
- - **`__NEXT_DATA__` is the fast path.** When accessible, Glassdoor's Next.js pages embed all page
512
- data in `<script id="__NEXT_DATA__" type="application/json">`. Parse it before falling back to
513
- DOM queries. Data path varies by page type: look under `props.pageProps.employer`,
514
- `props.pageProps.salaryData`, `props.pageProps.jobListings`, etc.
515
-
516
- - **Company URL slugs and IDs are stable.** The employer ID (e.g., `9079` for Google) never
517
- changes. Slugs occasionally change when a company rebrands — always verify by following the
518
- canonical redirect from a search result.
519
-
520
- - **Rate limiting.** Glassdoor rate-limits by IP after ~5 company-page loads per minute.
521
- Use `wait(5)` between consecutive company page navigations. Salary and reviews pages are heavier
522
- — use `wait(8)` between those.
523
-
524
- - **Salary URL requires character-count parameter.** The `SRCH_KO0,{n}` fragment encodes
525
- `0` (start of role name) and `n` (end, i.e., `len(role_slug)`). For `"software-engineer"` (17
526
- chars): `SRCH_KO0,17`. Wrong count returns a 404.
527
-
528
- - **`locKeyword` vs `locId` for location filter.** `locKeyword=San+Francisco` works without
529
- knowing Glassdoor's internal city ID. `locT=C` means city-type location. For metro areas,
530
- also try `locT=M`. Omit `locId` unless you have the exact numeric ID from a Glassdoor URL.
531
-
532
- - **PerimeterX is also active as a secondary layer.** After passing CF, Glassdoor runs behavioral
533
- fingerprinting. Rapid automated scrolling, mouse movement, or navigation patterns may trigger a
534
- secondary block. Mitigate with `wait(2)` between actions and avoid scripted mouse movement.
535
-
536
- - **Review and salary data require login on some accounts.** Anonymous sessions get a subset of
537
- data. If a field returns empty consistently, the page may require authentication before surfacing
538
- that data in the DOM or `__NEXT_DATA__`.
539
-
540
- - **`goto_url()` vs `new_tab()` for first navigation.** Use `new_tab()` for the very first Glassdoor
541
- page in a session. If the harness is attached to a non-Glassdoor tab, `goto_url()` can silently
542
- fail to pass the CF challenge because the existing tab may not have a clean origin context.
543
- After the first successful load, `goto_url()` works fine for subsequent Glassdoor navigations.
1
+ # Glassdoor — Company Data, Reviews, Jobs & Salaries
2
+
3
+ Field-tested against glassdoor.com on 2026-04-18.
4
+
5
+ ## Anti-bot verdict: browser required, no http_get workaround exists
6
+
7
+ **`http_get` returns HTTP 403 on every Glassdoor URL without exception.**
8
+
9
+ Tested endpoints (all 403):
10
+ - `/Reviews/Google-Reviews-E9079.htm`
11
+ - `/Overview/Working-at-Google-EI_IE9079.htm`
12
+ - `/Job/jobs.htm?sc.keyword=software+engineer`
13
+ - `/Salaries/software-engineer-salary-SRCH_KO0,17.htm`
14
+ - `/graph` (GraphQL)
15
+ - `sitemap.xml`
16
+
17
+ UAs tested (all blocked): `Mozilla/5.0`, full Chrome 124, Googlebot, `curl/7.88.1`.
18
+
19
+ **Stack:** Cloudflare Bot Management (`Server: cloudflare`, `Cf-Mitigated: challenge`).
20
+ Challenge type: `managed` (JS-executed browser fingerprint check, no CAPTCHA widget, no user click
21
+ required in a real browser). Cookie-only bypass also fails — the `__cf_bm` cookie returned in the
22
+ 403 response is bound to the browser TLS fingerprint and does not grant access when replayed.
23
+
24
+ `api.glassdoor.com` (the old public partner API) returned `410 Gone` — permanently shut down.
25
+
26
+ **Use `goto_url()` + `wait()` exclusively. Never use `http_get` for Glassdoor.**
27
+
28
+ ---
29
+
30
+ ## Do this first: open in a new tab, wait for CF to resolve
31
+
32
+ ```python
33
+ new_tab("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
34
+ wait_for_load()
35
+ wait(5) # CF managed challenge runs for ~2-4s after readyState=complete
36
+ ```
37
+
38
+ `wait(5)` is mandatory. CF's managed challenge executes JS fingerprinting probes after the DOM
39
+ is ready. Extracting before this resolves returns an empty or partial page.
40
+
41
+ Verify you are past the challenge before extracting:
42
+
43
+ ```python
44
+ title = js("document.title")
45
+ url = page_info()["url"]
46
+ if "Security" in title or "__cf_chl_tk" in url:
47
+ # CF challenge did not resolve yet — wait longer
48
+ wait(5)
49
+ title = js("document.title")
50
+ assert "Security" not in title, "Still on CF block page"
51
+ ```
52
+
53
+ ---
54
+
55
+ ## URL patterns
56
+
57
+ | Goal | URL |
58
+ |---|---|
59
+ | Company reviews | `/Reviews/{Company-slug}-Reviews-E{employer_id}.htm` |
60
+ | Company overview | `/Overview/Working-at-{Company-slug}-EI_IE{employer_id}.htm` |
61
+ | Company jobs | `/Jobs/{Company-slug}-Jobs-E{employer_id}.htm` |
62
+ | Keyword job search | `/Job/jobs.htm?sc.keyword={keyword}` |
63
+ | Keyword + location | `/Job/jobs.htm?sc.keyword={keyword}&locT=C&locKeyword={city}` |
64
+ | Remote jobs | `/Job/jobs.htm?sc.keyword={keyword}&remoteWorkType=1` |
65
+ | Job search page 2+ | append `&p=2`, `&p=3` |
66
+ | Salary page | `/Salaries/{role-slug}-salary-SRCH_KO0,{len}.htm` |
67
+
68
+ Employer IDs and company slugs are stable. Example: Google = `EI_IE9079`, slug = `Google`.
69
+
70
+ Find the employer ID from a search result URL or the company's Glassdoor page URL.
71
+
72
+ ---
73
+
74
+ ## Workflow 1: Job search — extract result cards
75
+
76
+ Glassdoor renders job cards client-side. Wait 5 seconds after load before extracting.
77
+
78
+ ```python
79
+ import json
80
+ from urllib.parse import quote_plus
81
+
82
+ query = "software engineer"
83
+ new_tab(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}")
84
+ wait_for_load()
85
+ wait(5) # CF challenge + JS render
86
+
87
+ # Dismiss cookie banner if present (GDPR regions)
88
+ dismiss_cookie_banner()
89
+
90
+ jobs = js("""
91
+ (function() {
92
+ // Primary selector as of 2026-04
93
+ var cards = document.querySelectorAll('li[data-jobid]');
94
+ if (!cards.length) {
95
+ // Fallback: class-based (Next.js CSS modules use hashed suffixes — match prefix)
96
+ cards = document.querySelectorAll('[class*="JobsList_jobListItem"]');
97
+ }
98
+ var out = [];
99
+ for (var i = 0; i < cards.length; i++) {
100
+ var c = cards[i];
101
+ var jobId = c.getAttribute('data-jobid') || '';
102
+ var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
103
+ var compEl = c.querySelector('[data-test="employer-name"], [class*="JobCard_employer"]');
104
+ var locEl = c.querySelector('[data-test="emp-location"], [class*="JobCard_location"]');
105
+ var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"], .salaryEstimate');
106
+ var ratingEl = c.querySelector('[data-test="rating"], [class*="ratingNumber"]');
107
+ var linkEl = c.querySelector('a[href*="/job-listing/"], a[href*="glassdoor.com/job"]');
108
+ out.push({
109
+ jobId,
110
+ title: titleEl ? titleEl.innerText.trim() : '',
111
+ company: compEl ? compEl.innerText.trim() : '',
112
+ location: locEl ? locEl.innerText.trim() : '',
113
+ salary: salEl ? salEl.innerText.trim() : '',
114
+ rating: ratingEl ? ratingEl.innerText.trim() : '',
115
+ url: linkEl ? linkEl.href : '',
116
+ });
117
+ }
118
+ return JSON.stringify(out.filter(j => j.title));
119
+ })()
120
+ """)
121
+
122
+ results = json.loads(jobs)
123
+ for r in results:
124
+ print(r["title"], "|", r["company"], "|", r["location"])
125
+ ```
126
+
127
+ **If `results` is empty:** take a screenshot and check which page you are on. Glassdoor often
128
+ serves a different layout under A/B tests. The screenshot will reveal the actual card selector.
129
+
130
+ ```python
131
+ capture_screenshot("/tmp/glassdoor_jobs.png")
132
+ # Inspect the image, then adjust the querySelectorAll selector above
133
+ ```
134
+
135
+ ---
136
+
137
+ ## Workflow 2: Job search pagination
138
+
139
+ Glassdoor paginates via `&p=N` on the job search URL.
140
+
141
+ ```python
142
+ import json
143
+ from urllib.parse import quote_plus
144
+
145
+ query = "data scientist"
146
+ all_jobs = []
147
+
148
+ for page in range(1, 4): # pages 1-3, ~10 cards each
149
+ url = f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}&p={page}"
150
+ goto_url(url)
151
+ wait_for_load()
152
+ wait(5 if page == 1 else 3) # first page needs CF wait; subsequent pages are faster
153
+
154
+ if page == 1:
155
+ dismiss_cookie_banner()
156
+
157
+ batch_json = js("""
158
+ (function() {
159
+ var cards = document.querySelectorAll('li[data-jobid], [class*="JobsList_jobListItem"]');
160
+ var out = [];
161
+ for (var i = 0; i < cards.length; i++) {
162
+ var c = cards[i];
163
+ var jobId = c.getAttribute('data-jobid') || '';
164
+ var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
165
+ var compEl = c.querySelector('[data-test="employer-name"]');
166
+ var locEl = c.querySelector('[data-test="emp-location"]');
167
+ var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"]');
168
+ var linkEl = c.querySelector('a[href*="/job-listing/"]');
169
+ out.push({
170
+ jobId,
171
+ title: titleEl ? titleEl.innerText.trim() : '',
172
+ company: compEl ? compEl.innerText.trim() : '',
173
+ location: locEl ? locEl.innerText.trim() : '',
174
+ salary: salEl ? salEl.innerText.trim() : '',
175
+ url: linkEl ? linkEl.href : '',
176
+ });
177
+ }
178
+ return JSON.stringify(out.filter(j => j.title));
179
+ })()
180
+ """)
181
+
182
+ batch = json.loads(batch_json)
183
+ if not batch:
184
+ break # no more results
185
+ all_jobs.extend(batch)
186
+
187
+ print(f"Collected {len(all_jobs)} jobs across {page} pages")
188
+ ```
189
+
190
+ ---
191
+
192
+ ## Workflow 3: Company overview — rating and review count
193
+
194
+ Navigate to the company Overview or Reviews page. These pages require login for full content but the
195
+ summary header (overall rating, review count, recommend %) is visible without login.
196
+
197
+ ```python
198
+ import json, re
199
+
200
+ # Example: Google (employer_id=9079)
201
+ employer_id = 9079
202
+ company_slug = "Google"
203
+
204
+ goto_url(f"https://www.glassdoor.com/Overview/Working-at-{company_slug}-EI_IE{employer_id}.htm")
205
+ wait_for_load()
206
+ wait(5) # CF challenge
207
+
208
+ # Try __NEXT_DATA__ first — fastest and most complete
209
+ next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
210
+
211
+ if next_data_raw:
212
+ nd = json.loads(next_data_raw)
213
+ # Company data lives under props.pageProps — path varies by page type
214
+ # Try employer overview path
215
+ props = nd.get("props", {}).get("pageProps", {})
216
+ employer = props.get("employer") or props.get("employerOverview")
217
+ if employer:
218
+ print("Rating:", employer.get("overallRating"))
219
+ print("Reviews:", employer.get("reviewCount") or employer.get("numberOfReviews"))
220
+ print("Name:", employer.get("name") or employer.get("shortName"))
221
+ else:
222
+ # Fall back to DOM selectors
223
+ summary = js("""
224
+ (function() {
225
+ var ratingEl = document.querySelector('[data-test="rating"], .ratingNumber, [class*="ratingNum"]');
226
+ var countEl = document.querySelector('[data-test="reviewCount"], .reviewCount, [class*="reviewCount"]');
227
+ var nameEl = document.querySelector('h1[data-test="employer-name"], [class*="EmployerProfile_name"]');
228
+ var recEl = document.querySelector('[data-test="recommend"], [class*="recommend"]');
229
+ return JSON.stringify({
230
+ rating: ratingEl ? ratingEl.innerText.trim() : '',
231
+ reviews: countEl ? countEl.innerText.trim() : '',
232
+ name: nameEl ? nameEl.innerText.trim() : '',
233
+ recommend: recEl ? recEl.innerText.trim() : '',
234
+ });
235
+ })()
236
+ """)
237
+ print(json.loads(summary))
238
+ ```
239
+
240
+ ---
241
+
242
+ ## Workflow 4: Company reviews page — extract individual reviews
243
+
244
+ Reviews pages show up to ~10 reviews per page without login. A login modal appears after scrolling.
245
+ Extract before scrolling.
246
+
247
+ ```python
248
+ import json
249
+
250
+ employer_id = 9079
251
+ company_slug = "Google"
252
+
253
+ goto_url(f"https://www.glassdoor.com/Reviews/{company_slug}-Reviews-E{employer_id}.htm")
254
+ wait_for_load()
255
+ wait(5)
256
+
257
+ dismiss_cookie_banner()
258
+
259
+ reviews = js("""
260
+ (function() {
261
+ // Review cards — confirmed selector pattern
262
+ var cards = document.querySelectorAll('[id^="empReview_"], [data-test="review-card"], [class*="ReviewCard"]');
263
+ if (!cards.length) {
264
+ cards = document.querySelectorAll('article[class*="review"]');
265
+ }
266
+ var out = [];
267
+ for (var i = 0; i < cards.length; i++) {
268
+ var c = cards[i];
269
+
270
+ // Overall star rating (1-5)
271
+ var starsEl = c.querySelector('[data-test="review-rating"], [class*="starRating"], span[class*="ratingNumber"]');
272
+ var stars = starsEl ? starsEl.innerText.trim() : '';
273
+
274
+ // Pros / Cons text
275
+ var prosEl = c.querySelector('[data-test="pros"], [class*="pros"], p[class*="pros"]');
276
+ var consEl = c.querySelector('[data-test="cons"], [class*="cons"], p[class*="cons"]');
277
+ var pros = prosEl ? prosEl.innerText.trim() : '';
278
+ var cons = consEl ? consEl.innerText.trim() : '';
279
+
280
+ // Review title
281
+ var titleEl = c.querySelector('[data-test="review-title"], h2[class*="reviewTitle"], [class*="title"] a');
282
+ var title = titleEl ? titleEl.innerText.trim() : '';
283
+
284
+ // Job title of reviewer
285
+ var jobTitleEl = c.querySelector('[data-test="reviewer-job-title"], [class*="reviewerInfo"], [class*="authorJobTitle"]');
286
+ var jobTitle = jobTitleEl ? jobTitleEl.innerText.trim() : '';
287
+
288
+ // Date
289
+ var dateEl = c.querySelector('time, [data-test="review-date"], [class*="reviewDate"]');
290
+ var date = dateEl ? (dateEl.getAttribute('datetime') || dateEl.innerText.trim()) : '';
291
+
292
+ if (pros || cons || title) {
293
+ out.push({stars, title, jobTitle, pros, cons, date});
294
+ }
295
+ }
296
+ return JSON.stringify(out);
297
+ })()
298
+ """)
299
+
300
+ results = json.loads(reviews)
301
+ for r in results:
302
+ print(f"{r['stars']}★ | {r['title']} | {r['jobTitle']}")
303
+ print(f" + {r['pros'][:100]}")
304
+ print(f" - {r['cons'][:100]}")
305
+ ```
306
+
307
+ ---
308
+
309
+ ## Workflow 5: Salary page — extract reported salary data
310
+
311
+ ```python
312
+ import json
313
+ from urllib.parse import quote_plus
314
+
315
+ # Salary pages use slug + character-count in the URL (n = len(role_slug))
316
+ role = "software-engineer"
317
+ n = len(role) # 17 for "software-engineer"
318
+
319
+ goto_url(f"https://www.glassdoor.com/Salaries/{role}-salary-SRCH_KO0,{n}.htm")
320
+ wait_for_load()
321
+ wait(5)
322
+
323
+ # Try __NEXT_DATA__ for structured salary data
324
+ next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
325
+
326
+ if next_data_raw:
327
+ nd = json.loads(next_data_raw)
328
+ # Salary data is typically under props.pageProps.salaryData or .salaryEstimate
329
+ props = nd.get("props", {}).get("pageProps", {})
330
+ salary_data = props.get("salaryData") or props.get("payData")
331
+ if salary_data:
332
+ print(json.dumps(salary_data, indent=2))
333
+
334
+ # DOM fallback
335
+ salary_summary = js("""
336
+ (function() {
337
+ var medianEl = document.querySelector('[data-test="salary-estimate"], [class*="salaryEstimate"], [class*="median"]');
338
+ var rangeEl = document.querySelector('[data-test="salary-range"], [class*="salaryRange"]');
339
+ var countEl = document.querySelector('[data-test="salary-count"], [class*="salaryCount"]');
340
+ return JSON.stringify({
341
+ median: medianEl ? medianEl.innerText.trim() : '',
342
+ range: rangeEl ? rangeEl.innerText.trim() : '',
343
+ count: countEl ? countEl.innerText.trim() : '',
344
+ });
345
+ })()
346
+ """)
347
+ print(json.loads(salary_summary))
348
+ ```
349
+
350
+ ---
351
+
352
+ ## Handling the login modal
353
+
354
+ Glassdoor shows a sign-in modal:
355
+ - On Reviews/Salary pages: after viewing ~3-5 items (scroll-triggered)
356
+ - On job detail pages: often immediately
357
+
358
+ Dismiss it before extracting anything that requires scrolling:
359
+
360
+ ```python
361
+ def dismiss_glassdoor_login_modal():
362
+ """Close the Glassdoor sign-in modal. Safe to call if no modal is present."""
363
+ closed = js("""
364
+ (function() {
365
+ var selectors = [
366
+ '[alt="Close"]',
367
+ 'button[class*="modal_closeIcon"]',
368
+ '[data-test="close-modal"]',
369
+ '[aria-label="Close"]',
370
+ 'button[data-test="CloseButton"]',
371
+ '[class*="CloseButton"]',
372
+ ];
373
+ for (var i = 0; i < selectors.length; i++) {
374
+ var btn = document.querySelector(selectors[i]);
375
+ if (btn && btn.offsetParent !== null) {
376
+ btn.click();
377
+ return selectors[i];
378
+ }
379
+ }
380
+ return null;
381
+ })()
382
+ """)
383
+ if closed:
384
+ wait(1)
385
+ return closed
386
+
387
+ def dismiss_cookie_banner():
388
+ """Dismiss GDPR consent overlay. Safe to call even if no banner is present."""
389
+ dismissed = js("""
390
+ (function() {
391
+ var selectors = [
392
+ 'button[data-test="accept-cookies"]',
393
+ '#onetrust-accept-btn-handler',
394
+ 'button[id*="accept-all"]',
395
+ 'button[class*="accept"]',
396
+ 'button[class*="consent"]',
397
+ ];
398
+ for (var i = 0; i < selectors.length; i++) {
399
+ var btn = document.querySelector(selectors[i]);
400
+ if (btn && btn.offsetParent !== null) {
401
+ btn.click();
402
+ return selectors[i];
403
+ }
404
+ }
405
+ return null;
406
+ })()
407
+ """)
408
+ if dismissed:
409
+ wait(1)
410
+ return dismissed
411
+ ```
412
+
413
+ For Reviews/Salary pages: call `dismiss_glassdoor_login_modal()` immediately after the initial
414
+ wait, before any scrolling. Once you scroll down, the modal blocks the page and the X button
415
+ may itself be outside the viewport.
416
+
417
+ ---
418
+
419
+ ## Detecting whether you are past the CF challenge
420
+
421
+ After `goto_url()` + `wait(5)`, confirm you are on the real page:
422
+
423
+ ```python
424
+ def glassdoor_is_cf_blocked() -> bool:
425
+ """True if the CF managed challenge is still running."""
426
+ title = js("document.title") or ""
427
+ url = page_info()["url"]
428
+ return "Security" in title or "__cf_chl_tk" in url
429
+
430
+ # Usage
431
+ goto_url("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
432
+ wait_for_load()
433
+ wait(5)
434
+
435
+ if glassdoor_is_cf_blocked():
436
+ wait(10) # give CF extra time
437
+ if glassdoor_is_cf_blocked():
438
+ capture_screenshot("/tmp/glassdoor_cf_block.png")
439
+ raise RuntimeError("CF challenge did not resolve — check screenshot")
440
+ ```
441
+
442
+ ---
443
+
444
+ ## Glassdoor company ID lookup
445
+
446
+ Glassdoor uses numeric employer IDs (e.g., Google = 9079, Apple = 1138, Meta = 40772).
447
+ To find the ID for any company:
448
+
449
+ ```python
450
+ from urllib.parse import quote_plus
451
+
452
+ company_name = "OpenAI"
453
+ goto_url(f"https://www.glassdoor.com/Search/results.htm?keyword={quote_plus(company_name)}&locT=N")
454
+ wait_for_load()
455
+ wait(5)
456
+
457
+ # Extract company cards from search results
458
+ companies = js("""
459
+ (function() {
460
+ var cards = document.querySelectorAll('[data-test="employer-card"], [class*="EmployerCard"], [class*="employer-card"]');
461
+ var out = [];
462
+ for (var i = 0; i < cards.length; i++) {
463
+ var c = cards[i];
464
+ var link = c.querySelector('a[href*="Overview"], a[href*="Reviews"]');
465
+ if (!link) continue;
466
+ var href = link.href;
467
+ // Extract employer ID: EI_IE{id} or E{id}
468
+ var m = href.match(/E(?:I_IE)?(\d+)/);
469
+ var empId = m ? m[1] : '';
470
+ var nameEl = c.querySelector('[class*="EmployerCard_name"], h2, [class*="name"]');
471
+ out.push({
472
+ empId,
473
+ name: nameEl ? nameEl.innerText.trim() : '',
474
+ href,
475
+ });
476
+ }
477
+ return JSON.stringify(out);
478
+ })()
479
+ """)
480
+
481
+ import json
482
+ for c in json.loads(companies):
483
+ print(c["empId"], c["name"], c["href"][:60])
484
+ ```
485
+
486
+ ---
487
+
488
+ ## Gotchas
489
+
490
+ - **`http_get` is permanently blocked.** Cloudflare Bot Management blocks every IP-level request
491
+ with a JS managed challenge. No User-Agent, cookie, or header combination bypasses it. The
492
+ `__cf_bm` cookie returned in the 403 response is TLS-fingerprint-bound and cannot be replayed.
493
+ `api.glassdoor.com` is 410 Gone (shut down). Only real Chrome via CDP works.
494
+
495
+ - **`wait(5)` minimum after `wait_for_load()`.** CF's managed challenge runs for 2-4 seconds after
496
+ `readyState = complete`. Extracting too early returns the challenge page HTML, not Glassdoor
497
+ content. If you get empty results or the title is "Security | Glassdoor", wait longer.
498
+
499
+ - **Login modal triggers on scroll, not on load.** Extract all visible content immediately on page
500
+ load before any scrolling. Call `dismiss_glassdoor_login_modal()` right after the initial wait —
501
+ before issuing any `scroll()` calls.
502
+
503
+ - **Glassdoor shows ~10 cards without login.** Reviews and salary pages are severely limited
504
+ without an account. Job search cards are more accessible (~10-15 per page). If you need 30+
505
+ reviews, a logged-in session is required.
506
+
507
+ - **CSS class names use Next.js hashed suffixes.** Selectors like `[class*="JobCard_jobTitle"]`
508
+ match despite the hash suffix (e.g., `JobCard_jobTitle__abc12`). Never hardcode the full hashed
509
+ class name — it changes with deployments. Always use `[class*="prefix"]`.
510
+
511
+ - **`__NEXT_DATA__` is the fast path.** When accessible, Glassdoor's Next.js pages embed all page
512
+ data in `<script id="__NEXT_DATA__" type="application/json">`. Parse it before falling back to
513
+ DOM queries. Data path varies by page type: look under `props.pageProps.employer`,
514
+ `props.pageProps.salaryData`, `props.pageProps.jobListings`, etc.
515
+
516
+ - **Company URL slugs and IDs are stable.** The employer ID (e.g., `9079` for Google) never
517
+ changes. Slugs occasionally change when a company rebrands — always verify by following the
518
+ canonical redirect from a search result.
519
+
520
+ - **Rate limiting.** Glassdoor rate-limits by IP after ~5 company-page loads per minute.
521
+ Use `wait(5)` between consecutive company page navigations. Salary and reviews pages are heavier
522
+ — use `wait(8)` between those.
523
+
524
+ - **Salary URL requires character-count parameter.** The `SRCH_KO0,{n}` fragment encodes
525
+ `0` (start of role name) and `n` (end, i.e., `len(role_slug)`). For `"software-engineer"` (17
526
+ chars): `SRCH_KO0,17`. Wrong count returns a 404.
527
+
528
+ - **`locKeyword` vs `locId` for location filter.** `locKeyword=San+Francisco` works without
529
+ knowing Glassdoor's internal city ID. `locT=C` means city-type location. For metro areas,
530
+ also try `locT=M`. Omit `locId` unless you have the exact numeric ID from a Glassdoor URL.
531
+
532
+ - **PerimeterX is also active as a secondary layer.** After passing CF, Glassdoor runs behavioral
533
+ fingerprinting. Rapid automated scrolling, mouse movement, or navigation patterns may trigger a
534
+ secondary block. Mitigate with `wait(2)` between actions and avoid scripted mouse movement.
535
+
536
+ - **Review and salary data require login on some accounts.** Anonymous sessions get a subset of
537
+ data. If a field returns empty consistently, the page may require authentication before surfacing
538
+ that data in the DOM or `__NEXT_DATA__`.
539
+
540
+ - **`goto_url()` vs `new_tab()` for first navigation.** Use `new_tab()` for the very first Glassdoor
541
+ page in a session. If the harness is attached to a non-Glassdoor tab, `goto_url()` can silently
542
+ fail to pass the CF challenge because the existing tab may not have a clean origin context.
543
+ After the first successful load, `goto_url()` works fine for subsequent Glassdoor navigations.