@pencil-agent/nano-pencil 2.0.1 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (186) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/model/custom-providers.js +1 -1
  7. package/dist/core/model-registry.js +5 -5
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  118. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  119. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  121. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  122. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  123. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  124. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  125. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  126. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  127. package/dist/extensions/builtin/goal/README.md +67 -67
  128. package/dist/extensions/builtin/grub/README.md +112 -112
  129. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  130. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  131. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  132. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  133. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  134. package/dist/extensions/builtin/loop/README.md +92 -92
  135. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  136. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  137. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  138. package/dist/extensions/builtin/sal/README.md +72 -72
  139. package/dist/extensions/builtin/security-audit/README.md +289 -289
  140. package/dist/extensions/builtin/team/AGENT.md +112 -112
  141. package/dist/extensions/builtin/team/TESTING.md +299 -299
  142. package/dist/extensions/builtin/token-save/README.md +56 -56
  143. package/dist/extensions/optional/AGENT.md +10 -10
  144. package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
  145. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  146. package/dist/modes/interactive/interactive-mode.js +19 -19
  147. package/dist/modes/interactive/theme/dark.json +85 -85
  148. package/dist/modes/interactive/theme/light.json +84 -84
  149. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  150. package/dist/modes/interactive/theme/warm.json +81 -81
  151. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  152. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  153. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  154. package/docs/SDK-TESTING.md +364 -0
  155. package/docs/codex-goal-command-impl.md +1055 -1055
  156. package/docs/codex-goal-vs-grub.md +500 -500
  157. package/docs/custom-provider.md +27 -27
  158. package/docs/extensions.md +27 -27
  159. package/docs/keybindings.md +27 -27
  160. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  161. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  162. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  163. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  164. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  165. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  166. package/docs/loop-usage-examples.md +214 -214
  167. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  168. package/docs/models.md +27 -27
  169. package/docs/packages.md +27 -27
  170. package/docs/pi-design-philosophy.md +457 -457
  171. package/docs/planmode.md +1987 -1987
  172. package/docs/prompt-templates.md +27 -27
  173. package/docs/providers.md +27 -27
  174. package/docs/sdk.md +27 -27
  175. package/docs/skills.md +27 -27
  176. package/docs/startup-performance-optimization.md +301 -0
  177. package/docs/themes.md +27 -27
  178. package/docs/tui.md +27 -27
  179. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  180. package/package.json +190 -190
  181. package/docs/cc-agent-design.md +0 -1297
  182. package/docs/cc-tui-design.md +0 -1333
  183. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  184. package/docs/scan-report.md +0 -3820
  185. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  186. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,543 +1,543 @@
1
- # Glassdoor — Company Data, Reviews, Jobs & Salaries
2
-
3
- Field-tested against glassdoor.com on 2026-04-18.
4
-
5
- ## Anti-bot verdict: browser required, no http_get workaround exists
6
-
7
- **`http_get` returns HTTP 403 on every Glassdoor URL without exception.**
8
-
9
- Tested endpoints (all 403):
10
- - `/Reviews/Google-Reviews-E9079.htm`
11
- - `/Overview/Working-at-Google-EI_IE9079.htm`
12
- - `/Job/jobs.htm?sc.keyword=software+engineer`
13
- - `/Salaries/software-engineer-salary-SRCH_KO0,17.htm`
14
- - `/graph` (GraphQL)
15
- - `sitemap.xml`
16
-
17
- UAs tested (all blocked): `Mozilla/5.0`, full Chrome 124, Googlebot, `curl/7.88.1`.
18
-
19
- **Stack:** Cloudflare Bot Management (`Server: cloudflare`, `Cf-Mitigated: challenge`).
20
- Challenge type: `managed` (JS-executed browser fingerprint check, no CAPTCHA widget, no user click
21
- required in a real browser). Cookie-only bypass also fails — the `__cf_bm` cookie returned in the
22
- 403 response is bound to the browser TLS fingerprint and does not grant access when replayed.
23
-
24
- `api.glassdoor.com` (the old public partner API) returned `410 Gone` — permanently shut down.
25
-
26
- **Use `goto_url()` + `wait()` exclusively. Never use `http_get` for Glassdoor.**
27
-
28
- ---
29
-
30
- ## Do this first: open in a new tab, wait for CF to resolve
31
-
32
- ```python
33
- new_tab("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
34
- wait_for_load()
35
- wait(5) # CF managed challenge runs for ~2-4s after readyState=complete
36
- ```
37
-
38
- `wait(5)` is mandatory. CF's managed challenge executes JS fingerprinting probes after the DOM
39
- is ready. Extracting before this resolves returns an empty or partial page.
40
-
41
- Verify you are past the challenge before extracting:
42
-
43
- ```python
44
- title = js("document.title")
45
- url = page_info()["url"]
46
- if "Security" in title or "__cf_chl_tk" in url:
47
- # CF challenge did not resolve yet — wait longer
48
- wait(5)
49
- title = js("document.title")
50
- assert "Security" not in title, "Still on CF block page"
51
- ```
52
-
53
- ---
54
-
55
- ## URL patterns
56
-
57
- | Goal | URL |
58
- |---|---|
59
- | Company reviews | `/Reviews/{Company-slug}-Reviews-E{employer_id}.htm` |
60
- | Company overview | `/Overview/Working-at-{Company-slug}-EI_IE{employer_id}.htm` |
61
- | Company jobs | `/Jobs/{Company-slug}-Jobs-E{employer_id}.htm` |
62
- | Keyword job search | `/Job/jobs.htm?sc.keyword={keyword}` |
63
- | Keyword + location | `/Job/jobs.htm?sc.keyword={keyword}&locT=C&locKeyword={city}` |
64
- | Remote jobs | `/Job/jobs.htm?sc.keyword={keyword}&remoteWorkType=1` |
65
- | Job search page 2+ | append `&p=2`, `&p=3` |
66
- | Salary page | `/Salaries/{role-slug}-salary-SRCH_KO0,{len}.htm` |
67
-
68
- Employer IDs and company slugs are stable. Example: Google = `EI_IE9079`, slug = `Google`.
69
-
70
- Find the employer ID from a search result URL or the company's Glassdoor page URL.
71
-
72
- ---
73
-
74
- ## Workflow 1: Job search — extract result cards
75
-
76
- Glassdoor renders job cards client-side. Wait 5 seconds after load before extracting.
77
-
78
- ```python
79
- import json
80
- from urllib.parse import quote_plus
81
-
82
- query = "software engineer"
83
- new_tab(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}")
84
- wait_for_load()
85
- wait(5) # CF challenge + JS render
86
-
87
- # Dismiss cookie banner if present (GDPR regions)
88
- dismiss_cookie_banner()
89
-
90
- jobs = js("""
91
- (function() {
92
- // Primary selector as of 2026-04
93
- var cards = document.querySelectorAll('li[data-jobid]');
94
- if (!cards.length) {
95
- // Fallback: class-based (Next.js CSS modules use hashed suffixes — match prefix)
96
- cards = document.querySelectorAll('[class*="JobsList_jobListItem"]');
97
- }
98
- var out = [];
99
- for (var i = 0; i < cards.length; i++) {
100
- var c = cards[i];
101
- var jobId = c.getAttribute('data-jobid') || '';
102
- var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
103
- var compEl = c.querySelector('[data-test="employer-name"], [class*="JobCard_employer"]');
104
- var locEl = c.querySelector('[data-test="emp-location"], [class*="JobCard_location"]');
105
- var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"], .salaryEstimate');
106
- var ratingEl = c.querySelector('[data-test="rating"], [class*="ratingNumber"]');
107
- var linkEl = c.querySelector('a[href*="/job-listing/"], a[href*="glassdoor.com/job"]');
108
- out.push({
109
- jobId,
110
- title: titleEl ? titleEl.innerText.trim() : '',
111
- company: compEl ? compEl.innerText.trim() : '',
112
- location: locEl ? locEl.innerText.trim() : '',
113
- salary: salEl ? salEl.innerText.trim() : '',
114
- rating: ratingEl ? ratingEl.innerText.trim() : '',
115
- url: linkEl ? linkEl.href : '',
116
- });
117
- }
118
- return JSON.stringify(out.filter(j => j.title));
119
- })()
120
- """)
121
-
122
- results = json.loads(jobs)
123
- for r in results:
124
- print(r["title"], "|", r["company"], "|", r["location"])
125
- ```
126
-
127
- **If `results` is empty:** take a screenshot and check which page you are on. Glassdoor often
128
- serves a different layout under A/B tests. The screenshot will reveal the actual card selector.
129
-
130
- ```python
131
- capture_screenshot("/tmp/glassdoor_jobs.png")
132
- # Inspect the image, then adjust the querySelectorAll selector above
133
- ```
134
-
135
- ---
136
-
137
- ## Workflow 2: Job search pagination
138
-
139
- Glassdoor paginates via `&p=N` on the job search URL.
140
-
141
- ```python
142
- import json
143
- from urllib.parse import quote_plus
144
-
145
- query = "data scientist"
146
- all_jobs = []
147
-
148
- for page in range(1, 4): # pages 1-3, ~10 cards each
149
- url = f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}&p={page}"
150
- goto_url(url)
151
- wait_for_load()
152
- wait(5 if page == 1 else 3) # first page needs CF wait; subsequent pages are faster
153
-
154
- if page == 1:
155
- dismiss_cookie_banner()
156
-
157
- batch_json = js("""
158
- (function() {
159
- var cards = document.querySelectorAll('li[data-jobid], [class*="JobsList_jobListItem"]');
160
- var out = [];
161
- for (var i = 0; i < cards.length; i++) {
162
- var c = cards[i];
163
- var jobId = c.getAttribute('data-jobid') || '';
164
- var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
165
- var compEl = c.querySelector('[data-test="employer-name"]');
166
- var locEl = c.querySelector('[data-test="emp-location"]');
167
- var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"]');
168
- var linkEl = c.querySelector('a[href*="/job-listing/"]');
169
- out.push({
170
- jobId,
171
- title: titleEl ? titleEl.innerText.trim() : '',
172
- company: compEl ? compEl.innerText.trim() : '',
173
- location: locEl ? locEl.innerText.trim() : '',
174
- salary: salEl ? salEl.innerText.trim() : '',
175
- url: linkEl ? linkEl.href : '',
176
- });
177
- }
178
- return JSON.stringify(out.filter(j => j.title));
179
- })()
180
- """)
181
-
182
- batch = json.loads(batch_json)
183
- if not batch:
184
- break # no more results
185
- all_jobs.extend(batch)
186
-
187
- print(f"Collected {len(all_jobs)} jobs across {page} pages")
188
- ```
189
-
190
- ---
191
-
192
- ## Workflow 3: Company overview — rating and review count
193
-
194
- Navigate to the company Overview or Reviews page. These pages require login for full content but the
195
- summary header (overall rating, review count, recommend %) is visible without login.
196
-
197
- ```python
198
- import json, re
199
-
200
- # Example: Google (employer_id=9079)
201
- employer_id = 9079
202
- company_slug = "Google"
203
-
204
- goto_url(f"https://www.glassdoor.com/Overview/Working-at-{company_slug}-EI_IE{employer_id}.htm")
205
- wait_for_load()
206
- wait(5) # CF challenge
207
-
208
- # Try __NEXT_DATA__ first — fastest and most complete
209
- next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
210
-
211
- if next_data_raw:
212
- nd = json.loads(next_data_raw)
213
- # Company data lives under props.pageProps — path varies by page type
214
- # Try employer overview path
215
- props = nd.get("props", {}).get("pageProps", {})
216
- employer = props.get("employer") or props.get("employerOverview")
217
- if employer:
218
- print("Rating:", employer.get("overallRating"))
219
- print("Reviews:", employer.get("reviewCount") or employer.get("numberOfReviews"))
220
- print("Name:", employer.get("name") or employer.get("shortName"))
221
- else:
222
- # Fall back to DOM selectors
223
- summary = js("""
224
- (function() {
225
- var ratingEl = document.querySelector('[data-test="rating"], .ratingNumber, [class*="ratingNum"]');
226
- var countEl = document.querySelector('[data-test="reviewCount"], .reviewCount, [class*="reviewCount"]');
227
- var nameEl = document.querySelector('h1[data-test="employer-name"], [class*="EmployerProfile_name"]');
228
- var recEl = document.querySelector('[data-test="recommend"], [class*="recommend"]');
229
- return JSON.stringify({
230
- rating: ratingEl ? ratingEl.innerText.trim() : '',
231
- reviews: countEl ? countEl.innerText.trim() : '',
232
- name: nameEl ? nameEl.innerText.trim() : '',
233
- recommend: recEl ? recEl.innerText.trim() : '',
234
- });
235
- })()
236
- """)
237
- print(json.loads(summary))
238
- ```
239
-
240
- ---
241
-
242
- ## Workflow 4: Company reviews page — extract individual reviews
243
-
244
- Reviews pages show up to ~10 reviews per page without login. A login modal appears after scrolling.
245
- Extract before scrolling.
246
-
247
- ```python
248
- import json
249
-
250
- employer_id = 9079
251
- company_slug = "Google"
252
-
253
- goto_url(f"https://www.glassdoor.com/Reviews/{company_slug}-Reviews-E{employer_id}.htm")
254
- wait_for_load()
255
- wait(5)
256
-
257
- dismiss_cookie_banner()
258
-
259
- reviews = js("""
260
- (function() {
261
- // Review cards — confirmed selector pattern
262
- var cards = document.querySelectorAll('[id^="empReview_"], [data-test="review-card"], [class*="ReviewCard"]');
263
- if (!cards.length) {
264
- cards = document.querySelectorAll('article[class*="review"]');
265
- }
266
- var out = [];
267
- for (var i = 0; i < cards.length; i++) {
268
- var c = cards[i];
269
-
270
- // Overall star rating (1-5)
271
- var starsEl = c.querySelector('[data-test="review-rating"], [class*="starRating"], span[class*="ratingNumber"]');
272
- var stars = starsEl ? starsEl.innerText.trim() : '';
273
-
274
- // Pros / Cons text
275
- var prosEl = c.querySelector('[data-test="pros"], [class*="pros"], p[class*="pros"]');
276
- var consEl = c.querySelector('[data-test="cons"], [class*="cons"], p[class*="cons"]');
277
- var pros = prosEl ? prosEl.innerText.trim() : '';
278
- var cons = consEl ? consEl.innerText.trim() : '';
279
-
280
- // Review title
281
- var titleEl = c.querySelector('[data-test="review-title"], h2[class*="reviewTitle"], [class*="title"] a');
282
- var title = titleEl ? titleEl.innerText.trim() : '';
283
-
284
- // Job title of reviewer
285
- var jobTitleEl = c.querySelector('[data-test="reviewer-job-title"], [class*="reviewerInfo"], [class*="authorJobTitle"]');
286
- var jobTitle = jobTitleEl ? jobTitleEl.innerText.trim() : '';
287
-
288
- // Date
289
- var dateEl = c.querySelector('time, [data-test="review-date"], [class*="reviewDate"]');
290
- var date = dateEl ? (dateEl.getAttribute('datetime') || dateEl.innerText.trim()) : '';
291
-
292
- if (pros || cons || title) {
293
- out.push({stars, title, jobTitle, pros, cons, date});
294
- }
295
- }
296
- return JSON.stringify(out);
297
- })()
298
- """)
299
-
300
- results = json.loads(reviews)
301
- for r in results:
302
- print(f"{r['stars']}★ | {r['title']} | {r['jobTitle']}")
303
- print(f" + {r['pros'][:100]}")
304
- print(f" - {r['cons'][:100]}")
305
- ```
306
-
307
- ---
308
-
309
- ## Workflow 5: Salary page — extract reported salary data
310
-
311
- ```python
312
- import json
313
- from urllib.parse import quote_plus
314
-
315
- # Salary pages use slug + character-count in the URL (n = len(role_slug))
316
- role = "software-engineer"
317
- n = len(role) # 17 for "software-engineer"
318
-
319
- goto_url(f"https://www.glassdoor.com/Salaries/{role}-salary-SRCH_KO0,{n}.htm")
320
- wait_for_load()
321
- wait(5)
322
-
323
- # Try __NEXT_DATA__ for structured salary data
324
- next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
325
-
326
- if next_data_raw:
327
- nd = json.loads(next_data_raw)
328
- # Salary data is typically under props.pageProps.salaryData or .salaryEstimate
329
- props = nd.get("props", {}).get("pageProps", {})
330
- salary_data = props.get("salaryData") or props.get("payData")
331
- if salary_data:
332
- print(json.dumps(salary_data, indent=2))
333
-
334
- # DOM fallback
335
- salary_summary = js("""
336
- (function() {
337
- var medianEl = document.querySelector('[data-test="salary-estimate"], [class*="salaryEstimate"], [class*="median"]');
338
- var rangeEl = document.querySelector('[data-test="salary-range"], [class*="salaryRange"]');
339
- var countEl = document.querySelector('[data-test="salary-count"], [class*="salaryCount"]');
340
- return JSON.stringify({
341
- median: medianEl ? medianEl.innerText.trim() : '',
342
- range: rangeEl ? rangeEl.innerText.trim() : '',
343
- count: countEl ? countEl.innerText.trim() : '',
344
- });
345
- })()
346
- """)
347
- print(json.loads(salary_summary))
348
- ```
349
-
350
- ---
351
-
352
- ## Handling the login modal
353
-
354
- Glassdoor shows a sign-in modal:
355
- - On Reviews/Salary pages: after viewing ~3-5 items (scroll-triggered)
356
- - On job detail pages: often immediately
357
-
358
- Dismiss it before extracting anything that requires scrolling:
359
-
360
- ```python
361
- def dismiss_glassdoor_login_modal():
362
- """Close the Glassdoor sign-in modal. Safe to call if no modal is present."""
363
- closed = js("""
364
- (function() {
365
- var selectors = [
366
- '[alt="Close"]',
367
- 'button[class*="modal_closeIcon"]',
368
- '[data-test="close-modal"]',
369
- '[aria-label="Close"]',
370
- 'button[data-test="CloseButton"]',
371
- '[class*="CloseButton"]',
372
- ];
373
- for (var i = 0; i < selectors.length; i++) {
374
- var btn = document.querySelector(selectors[i]);
375
- if (btn && btn.offsetParent !== null) {
376
- btn.click();
377
- return selectors[i];
378
- }
379
- }
380
- return null;
381
- })()
382
- """)
383
- if closed:
384
- wait(1)
385
- return closed
386
-
387
- def dismiss_cookie_banner():
388
- """Dismiss GDPR consent overlay. Safe to call even if no banner is present."""
389
- dismissed = js("""
390
- (function() {
391
- var selectors = [
392
- 'button[data-test="accept-cookies"]',
393
- '#onetrust-accept-btn-handler',
394
- 'button[id*="accept-all"]',
395
- 'button[class*="accept"]',
396
- 'button[class*="consent"]',
397
- ];
398
- for (var i = 0; i < selectors.length; i++) {
399
- var btn = document.querySelector(selectors[i]);
400
- if (btn && btn.offsetParent !== null) {
401
- btn.click();
402
- return selectors[i];
403
- }
404
- }
405
- return null;
406
- })()
407
- """)
408
- if dismissed:
409
- wait(1)
410
- return dismissed
411
- ```
412
-
413
- For Reviews/Salary pages: call `dismiss_glassdoor_login_modal()` immediately after the initial
414
- wait, before any scrolling. Once you scroll down, the modal blocks the page and the X button
415
- may itself be outside the viewport.
416
-
417
- ---
418
-
419
- ## Detecting whether you are past the CF challenge
420
-
421
- After `goto_url()` + `wait(5)`, confirm you are on the real page:
422
-
423
- ```python
424
- def glassdoor_is_cf_blocked() -> bool:
425
- """True if the CF managed challenge is still running."""
426
- title = js("document.title") or ""
427
- url = page_info()["url"]
428
- return "Security" in title or "__cf_chl_tk" in url
429
-
430
- # Usage
431
- goto_url("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
432
- wait_for_load()
433
- wait(5)
434
-
435
- if glassdoor_is_cf_blocked():
436
- wait(10) # give CF extra time
437
- if glassdoor_is_cf_blocked():
438
- capture_screenshot("/tmp/glassdoor_cf_block.png")
439
- raise RuntimeError("CF challenge did not resolve — check screenshot")
440
- ```
441
-
442
- ---
443
-
444
- ## Glassdoor company ID lookup
445
-
446
- Glassdoor uses numeric employer IDs (e.g., Google = 9079, Apple = 1138, Meta = 40772).
447
- To find the ID for any company:
448
-
449
- ```python
450
- from urllib.parse import quote_plus
451
-
452
- company_name = "OpenAI"
453
- goto_url(f"https://www.glassdoor.com/Search/results.htm?keyword={quote_plus(company_name)}&locT=N")
454
- wait_for_load()
455
- wait(5)
456
-
457
- # Extract company cards from search results
458
- companies = js("""
459
- (function() {
460
- var cards = document.querySelectorAll('[data-test="employer-card"], [class*="EmployerCard"], [class*="employer-card"]');
461
- var out = [];
462
- for (var i = 0; i < cards.length; i++) {
463
- var c = cards[i];
464
- var link = c.querySelector('a[href*="Overview"], a[href*="Reviews"]');
465
- if (!link) continue;
466
- var href = link.href;
467
- // Extract employer ID: EI_IE{id} or E{id}
468
- var m = href.match(/E(?:I_IE)?(\d+)/);
469
- var empId = m ? m[1] : '';
470
- var nameEl = c.querySelector('[class*="EmployerCard_name"], h2, [class*="name"]');
471
- out.push({
472
- empId,
473
- name: nameEl ? nameEl.innerText.trim() : '',
474
- href,
475
- });
476
- }
477
- return JSON.stringify(out);
478
- })()
479
- """)
480
-
481
- import json
482
- for c in json.loads(companies):
483
- print(c["empId"], c["name"], c["href"][:60])
484
- ```
485
-
486
- ---
487
-
488
- ## Gotchas
489
-
490
- - **`http_get` is permanently blocked.** Cloudflare Bot Management blocks every IP-level request
491
- with a JS managed challenge. No User-Agent, cookie, or header combination bypasses it. The
492
- `__cf_bm` cookie returned in the 403 response is TLS-fingerprint-bound and cannot be replayed.
493
- `api.glassdoor.com` is 410 Gone (shut down). Only real Chrome via CDP works.
494
-
495
- - **`wait(5)` minimum after `wait_for_load()`.** CF's managed challenge runs for 2-4 seconds after
496
- `readyState = complete`. Extracting too early returns the challenge page HTML, not Glassdoor
497
- content. If you get empty results or the title is "Security | Glassdoor", wait longer.
498
-
499
- - **Login modal triggers on scroll, not on load.** Extract all visible content immediately on page
500
- load before any scrolling. Call `dismiss_glassdoor_login_modal()` right after the initial wait —
501
- before issuing any `scroll()` calls.
502
-
503
- - **Glassdoor shows ~10 cards without login.** Reviews and salary pages are severely limited
504
- without an account. Job search cards are more accessible (~10-15 per page). If you need 30+
505
- reviews, a logged-in session is required.
506
-
507
- - **CSS class names use Next.js hashed suffixes.** Selectors like `[class*="JobCard_jobTitle"]`
508
- match despite the hash suffix (e.g., `JobCard_jobTitle__abc12`). Never hardcode the full hashed
509
- class name — it changes with deployments. Always use `[class*="prefix"]`.
510
-
511
- - **`__NEXT_DATA__` is the fast path.** When accessible, Glassdoor's Next.js pages embed all page
512
- data in `<script id="__NEXT_DATA__" type="application/json">`. Parse it before falling back to
513
- DOM queries. Data path varies by page type: look under `props.pageProps.employer`,
514
- `props.pageProps.salaryData`, `props.pageProps.jobListings`, etc.
515
-
516
- - **Company URL slugs and IDs are stable.** The employer ID (e.g., `9079` for Google) never
517
- changes. Slugs occasionally change when a company rebrands — always verify by following the
518
- canonical redirect from a search result.
519
-
520
- - **Rate limiting.** Glassdoor rate-limits by IP after ~5 company-page loads per minute.
521
- Use `wait(5)` between consecutive company page navigations. Salary and reviews pages are heavier
522
- — use `wait(8)` between those.
523
-
524
- - **Salary URL requires character-count parameter.** The `SRCH_KO0,{n}` fragment encodes
525
- `0` (start of role name) and `n` (end, i.e., `len(role_slug)`). For `"software-engineer"` (17
526
- chars): `SRCH_KO0,17`. Wrong count returns a 404.
527
-
528
- - **`locKeyword` vs `locId` for location filter.** `locKeyword=San+Francisco` works without
529
- knowing Glassdoor's internal city ID. `locT=C` means city-type location. For metro areas,
530
- also try `locT=M`. Omit `locId` unless you have the exact numeric ID from a Glassdoor URL.
531
-
532
- - **PerimeterX is also active as a secondary layer.** After passing CF, Glassdoor runs behavioral
533
- fingerprinting. Rapid automated scrolling, mouse movement, or navigation patterns may trigger a
534
- secondary block. Mitigate with `wait(2)` between actions and avoid scripted mouse movement.
535
-
536
- - **Review and salary data require login on some accounts.** Anonymous sessions get a subset of
537
- data. If a field returns empty consistently, the page may require authentication before surfacing
538
- that data in the DOM or `__NEXT_DATA__`.
539
-
540
- - **`goto_url()` vs `new_tab()` for first navigation.** Use `new_tab()` for the very first Glassdoor
541
- page in a session. If the harness is attached to a non-Glassdoor tab, `goto_url()` can silently
542
- fail to pass the CF challenge because the existing tab may not have a clean origin context.
543
- After the first successful load, `goto_url()` works fine for subsequent Glassdoor navigations.
1
+ # Glassdoor — Company Data, Reviews, Jobs & Salaries
2
+
3
+ Field-tested against glassdoor.com on 2026-04-18.
4
+
5
+ ## Anti-bot verdict: browser required, no http_get workaround exists
6
+
7
+ **`http_get` returns HTTP 403 on every Glassdoor URL without exception.**
8
+
9
+ Tested endpoints (all 403):
10
+ - `/Reviews/Google-Reviews-E9079.htm`
11
+ - `/Overview/Working-at-Google-EI_IE9079.htm`
12
+ - `/Job/jobs.htm?sc.keyword=software+engineer`
13
+ - `/Salaries/software-engineer-salary-SRCH_KO0,17.htm`
14
+ - `/graph` (GraphQL)
15
+ - `sitemap.xml`
16
+
17
+ UAs tested (all blocked): `Mozilla/5.0`, full Chrome 124, Googlebot, `curl/7.88.1`.
18
+
19
+ **Stack:** Cloudflare Bot Management (`Server: cloudflare`, `Cf-Mitigated: challenge`).
20
+ Challenge type: `managed` (JS-executed browser fingerprint check, no CAPTCHA widget, no user click
21
+ required in a real browser). Cookie-only bypass also fails — the `__cf_bm` cookie returned in the
22
+ 403 response is bound to the browser TLS fingerprint and does not grant access when replayed.
23
+
24
+ `api.glassdoor.com` (the old public partner API) returned `410 Gone` — permanently shut down.
25
+
26
+ **Use `goto_url()` + `wait()` exclusively. Never use `http_get` for Glassdoor.**
27
+
28
+ ---
29
+
30
+ ## Do this first: open in a new tab, wait for CF to resolve
31
+
32
+ ```python
33
+ new_tab("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
34
+ wait_for_load()
35
+ wait(5) # CF managed challenge runs for ~2-4s after readyState=complete
36
+ ```
37
+
38
+ `wait(5)` is mandatory. CF's managed challenge executes JS fingerprinting probes after the DOM
39
+ is ready. Extracting before this resolves returns an empty or partial page.
40
+
41
+ Verify you are past the challenge before extracting:
42
+
43
+ ```python
44
+ title = js("document.title")
45
+ url = page_info()["url"]
46
+ if "Security" in title or "__cf_chl_tk" in url:
47
+ # CF challenge did not resolve yet — wait longer
48
+ wait(5)
49
+ title = js("document.title")
50
+ assert "Security" not in title, "Still on CF block page"
51
+ ```
52
+
53
+ ---
54
+
55
+ ## URL patterns
56
+
57
+ | Goal | URL |
58
+ |---|---|
59
+ | Company reviews | `/Reviews/{Company-slug}-Reviews-E{employer_id}.htm` |
60
+ | Company overview | `/Overview/Working-at-{Company-slug}-EI_IE{employer_id}.htm` |
61
+ | Company jobs | `/Jobs/{Company-slug}-Jobs-E{employer_id}.htm` |
62
+ | Keyword job search | `/Job/jobs.htm?sc.keyword={keyword}` |
63
+ | Keyword + location | `/Job/jobs.htm?sc.keyword={keyword}&locT=C&locKeyword={city}` |
64
+ | Remote jobs | `/Job/jobs.htm?sc.keyword={keyword}&remoteWorkType=1` |
65
+ | Job search page 2+ | append `&p=2`, `&p=3` |
66
+ | Salary page | `/Salaries/{role-slug}-salary-SRCH_KO0,{len}.htm` |
67
+
68
+ Employer IDs and company slugs are stable. Example: Google = `EI_IE9079`, slug = `Google`.
69
+
70
+ Find the employer ID from a search result URL or the company's Glassdoor page URL.
71
+
72
+ ---
73
+
74
+ ## Workflow 1: Job search — extract result cards
75
+
76
+ Glassdoor renders job cards client-side. Wait 5 seconds after load before extracting.
77
+
78
+ ```python
79
+ import json
80
+ from urllib.parse import quote_plus
81
+
82
+ query = "software engineer"
83
+ new_tab(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}")
84
+ wait_for_load()
85
+ wait(5) # CF challenge + JS render
86
+
87
+ # Dismiss cookie banner if present (GDPR regions)
88
+ dismiss_cookie_banner()
89
+
90
+ jobs = js("""
91
+ (function() {
92
+ // Primary selector as of 2026-04
93
+ var cards = document.querySelectorAll('li[data-jobid]');
94
+ if (!cards.length) {
95
+ // Fallback: class-based (Next.js CSS modules use hashed suffixes — match prefix)
96
+ cards = document.querySelectorAll('[class*="JobsList_jobListItem"]');
97
+ }
98
+ var out = [];
99
+ for (var i = 0; i < cards.length; i++) {
100
+ var c = cards[i];
101
+ var jobId = c.getAttribute('data-jobid') || '';
102
+ var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
103
+ var compEl = c.querySelector('[data-test="employer-name"], [class*="JobCard_employer"]');
104
+ var locEl = c.querySelector('[data-test="emp-location"], [class*="JobCard_location"]');
105
+ var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"], .salaryEstimate');
106
+ var ratingEl = c.querySelector('[data-test="rating"], [class*="ratingNumber"]');
107
+ var linkEl = c.querySelector('a[href*="/job-listing/"], a[href*="glassdoor.com/job"]');
108
+ out.push({
109
+ jobId,
110
+ title: titleEl ? titleEl.innerText.trim() : '',
111
+ company: compEl ? compEl.innerText.trim() : '',
112
+ location: locEl ? locEl.innerText.trim() : '',
113
+ salary: salEl ? salEl.innerText.trim() : '',
114
+ rating: ratingEl ? ratingEl.innerText.trim() : '',
115
+ url: linkEl ? linkEl.href : '',
116
+ });
117
+ }
118
+ return JSON.stringify(out.filter(j => j.title));
119
+ })()
120
+ """)
121
+
122
+ results = json.loads(jobs)
123
+ for r in results:
124
+ print(r["title"], "|", r["company"], "|", r["location"])
125
+ ```
126
+
127
+ **If `results` is empty:** take a screenshot and check which page you are on. Glassdoor often
128
+ serves a different layout under A/B tests. The screenshot will reveal the actual card selector.
129
+
130
+ ```python
131
+ capture_screenshot("/tmp/glassdoor_jobs.png")
132
+ # Inspect the image, then adjust the querySelectorAll selector above
133
+ ```
134
+
135
+ ---
136
+
137
+ ## Workflow 2: Job search pagination
138
+
139
+ Glassdoor paginates via `&p=N` on the job search URL.
140
+
141
+ ```python
142
+ import json
143
+ from urllib.parse import quote_plus
144
+
145
+ query = "data scientist"
146
+ all_jobs = []
147
+
148
+ for page in range(1, 4): # pages 1-3, ~10 cards each
149
+ url = f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}&p={page}"
150
+ goto_url(url)
151
+ wait_for_load()
152
+ wait(5 if page == 1 else 3) # first page needs CF wait; subsequent pages are faster
153
+
154
+ if page == 1:
155
+ dismiss_cookie_banner()
156
+
157
+ batch_json = js("""
158
+ (function() {
159
+ var cards = document.querySelectorAll('li[data-jobid], [class*="JobsList_jobListItem"]');
160
+ var out = [];
161
+ for (var i = 0; i < cards.length; i++) {
162
+ var c = cards[i];
163
+ var jobId = c.getAttribute('data-jobid') || '';
164
+ var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
165
+ var compEl = c.querySelector('[data-test="employer-name"]');
166
+ var locEl = c.querySelector('[data-test="emp-location"]');
167
+ var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"]');
168
+ var linkEl = c.querySelector('a[href*="/job-listing/"]');
169
+ out.push({
170
+ jobId,
171
+ title: titleEl ? titleEl.innerText.trim() : '',
172
+ company: compEl ? compEl.innerText.trim() : '',
173
+ location: locEl ? locEl.innerText.trim() : '',
174
+ salary: salEl ? salEl.innerText.trim() : '',
175
+ url: linkEl ? linkEl.href : '',
176
+ });
177
+ }
178
+ return JSON.stringify(out.filter(j => j.title));
179
+ })()
180
+ """)
181
+
182
+ batch = json.loads(batch_json)
183
+ if not batch:
184
+ break # no more results
185
+ all_jobs.extend(batch)
186
+
187
+ print(f"Collected {len(all_jobs)} jobs across {page} pages")
188
+ ```
189
+
190
+ ---
191
+
192
+ ## Workflow 3: Company overview — rating and review count
193
+
194
+ Navigate to the company Overview or Reviews page. These pages require login for full content but the
195
+ summary header (overall rating, review count, recommend %) is visible without login.
196
+
197
+ ```python
198
+ import json, re
199
+
200
+ # Example: Google (employer_id=9079)
201
+ employer_id = 9079
202
+ company_slug = "Google"
203
+
204
+ goto_url(f"https://www.glassdoor.com/Overview/Working-at-{company_slug}-EI_IE{employer_id}.htm")
205
+ wait_for_load()
206
+ wait(5) # CF challenge
207
+
208
+ # Try __NEXT_DATA__ first — fastest and most complete
209
+ next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
210
+
211
+ if next_data_raw:
212
+ nd = json.loads(next_data_raw)
213
+ # Company data lives under props.pageProps — path varies by page type
214
+ # Try employer overview path
215
+ props = nd.get("props", {}).get("pageProps", {})
216
+ employer = props.get("employer") or props.get("employerOverview")
217
+ if employer:
218
+ print("Rating:", employer.get("overallRating"))
219
+ print("Reviews:", employer.get("reviewCount") or employer.get("numberOfReviews"))
220
+ print("Name:", employer.get("name") or employer.get("shortName"))
221
+ else:
222
+ # Fall back to DOM selectors
223
+ summary = js("""
224
+ (function() {
225
+ var ratingEl = document.querySelector('[data-test="rating"], .ratingNumber, [class*="ratingNum"]');
226
+ var countEl = document.querySelector('[data-test="reviewCount"], .reviewCount, [class*="reviewCount"]');
227
+ var nameEl = document.querySelector('h1[data-test="employer-name"], [class*="EmployerProfile_name"]');
228
+ var recEl = document.querySelector('[data-test="recommend"], [class*="recommend"]');
229
+ return JSON.stringify({
230
+ rating: ratingEl ? ratingEl.innerText.trim() : '',
231
+ reviews: countEl ? countEl.innerText.trim() : '',
232
+ name: nameEl ? nameEl.innerText.trim() : '',
233
+ recommend: recEl ? recEl.innerText.trim() : '',
234
+ });
235
+ })()
236
+ """)
237
+ print(json.loads(summary))
238
+ ```
239
+
240
+ ---
241
+
242
+ ## Workflow 4: Company reviews page — extract individual reviews
243
+
244
+ Reviews pages show up to ~10 reviews per page without login. A login modal appears after scrolling.
245
+ Extract before scrolling.
246
+
247
+ ```python
248
+ import json
249
+
250
+ employer_id = 9079
251
+ company_slug = "Google"
252
+
253
+ goto_url(f"https://www.glassdoor.com/Reviews/{company_slug}-Reviews-E{employer_id}.htm")
254
+ wait_for_load()
255
+ wait(5)
256
+
257
+ dismiss_cookie_banner()
258
+
259
+ reviews = js("""
260
+ (function() {
261
+ // Review cards — confirmed selector pattern
262
+ var cards = document.querySelectorAll('[id^="empReview_"], [data-test="review-card"], [class*="ReviewCard"]');
263
+ if (!cards.length) {
264
+ cards = document.querySelectorAll('article[class*="review"]');
265
+ }
266
+ var out = [];
267
+ for (var i = 0; i < cards.length; i++) {
268
+ var c = cards[i];
269
+
270
+ // Overall star rating (1-5)
271
+ var starsEl = c.querySelector('[data-test="review-rating"], [class*="starRating"], span[class*="ratingNumber"]');
272
+ var stars = starsEl ? starsEl.innerText.trim() : '';
273
+
274
+ // Pros / Cons text
275
+ var prosEl = c.querySelector('[data-test="pros"], [class*="pros"], p[class*="pros"]');
276
+ var consEl = c.querySelector('[data-test="cons"], [class*="cons"], p[class*="cons"]');
277
+ var pros = prosEl ? prosEl.innerText.trim() : '';
278
+ var cons = consEl ? consEl.innerText.trim() : '';
279
+
280
+ // Review title
281
+ var titleEl = c.querySelector('[data-test="review-title"], h2[class*="reviewTitle"], [class*="title"] a');
282
+ var title = titleEl ? titleEl.innerText.trim() : '';
283
+
284
+ // Job title of reviewer
285
+ var jobTitleEl = c.querySelector('[data-test="reviewer-job-title"], [class*="reviewerInfo"], [class*="authorJobTitle"]');
286
+ var jobTitle = jobTitleEl ? jobTitleEl.innerText.trim() : '';
287
+
288
+ // Date
289
+ var dateEl = c.querySelector('time, [data-test="review-date"], [class*="reviewDate"]');
290
+ var date = dateEl ? (dateEl.getAttribute('datetime') || dateEl.innerText.trim()) : '';
291
+
292
+ if (pros || cons || title) {
293
+ out.push({stars, title, jobTitle, pros, cons, date});
294
+ }
295
+ }
296
+ return JSON.stringify(out);
297
+ })()
298
+ """)
299
+
300
+ results = json.loads(reviews)
301
+ for r in results:
302
+ print(f"{r['stars']}★ | {r['title']} | {r['jobTitle']}")
303
+ print(f" + {r['pros'][:100]}")
304
+ print(f" - {r['cons'][:100]}")
305
+ ```
306
+
307
+ ---
308
+
309
+ ## Workflow 5: Salary page — extract reported salary data
310
+
311
+ ```python
312
+ import json
313
+ from urllib.parse import quote_plus
314
+
315
+ # Salary pages use slug + character-count in the URL (n = len(role_slug))
316
+ role = "software-engineer"
317
+ n = len(role) # 17 for "software-engineer"
318
+
319
+ goto_url(f"https://www.glassdoor.com/Salaries/{role}-salary-SRCH_KO0,{n}.htm")
320
+ wait_for_load()
321
+ wait(5)
322
+
323
+ # Try __NEXT_DATA__ for structured salary data
324
+ next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
325
+
326
+ if next_data_raw:
327
+ nd = json.loads(next_data_raw)
328
+ # Salary data is typically under props.pageProps.salaryData or .salaryEstimate
329
+ props = nd.get("props", {}).get("pageProps", {})
330
+ salary_data = props.get("salaryData") or props.get("payData")
331
+ if salary_data:
332
+ print(json.dumps(salary_data, indent=2))
333
+
334
+ # DOM fallback
335
+ salary_summary = js("""
336
+ (function() {
337
+ var medianEl = document.querySelector('[data-test="salary-estimate"], [class*="salaryEstimate"], [class*="median"]');
338
+ var rangeEl = document.querySelector('[data-test="salary-range"], [class*="salaryRange"]');
339
+ var countEl = document.querySelector('[data-test="salary-count"], [class*="salaryCount"]');
340
+ return JSON.stringify({
341
+ median: medianEl ? medianEl.innerText.trim() : '',
342
+ range: rangeEl ? rangeEl.innerText.trim() : '',
343
+ count: countEl ? countEl.innerText.trim() : '',
344
+ });
345
+ })()
346
+ """)
347
+ print(json.loads(salary_summary))
348
+ ```
349
+
350
+ ---
351
+
352
+ ## Handling the login modal
353
+
354
+ Glassdoor shows a sign-in modal:
355
+ - On Reviews/Salary pages: after viewing ~3-5 items (scroll-triggered)
356
+ - On job detail pages: often immediately
357
+
358
+ Dismiss it before extracting anything that requires scrolling:
359
+
360
+ ```python
361
+ def dismiss_glassdoor_login_modal():
362
+ """Close the Glassdoor sign-in modal. Safe to call if no modal is present."""
363
+ closed = js("""
364
+ (function() {
365
+ var selectors = [
366
+ '[alt="Close"]',
367
+ 'button[class*="modal_closeIcon"]',
368
+ '[data-test="close-modal"]',
369
+ '[aria-label="Close"]',
370
+ 'button[data-test="CloseButton"]',
371
+ '[class*="CloseButton"]',
372
+ ];
373
+ for (var i = 0; i < selectors.length; i++) {
374
+ var btn = document.querySelector(selectors[i]);
375
+ if (btn && btn.offsetParent !== null) {
376
+ btn.click();
377
+ return selectors[i];
378
+ }
379
+ }
380
+ return null;
381
+ })()
382
+ """)
383
+ if closed:
384
+ wait(1)
385
+ return closed
386
+
387
+ def dismiss_cookie_banner():
388
+ """Dismiss GDPR consent overlay. Safe to call even if no banner is present."""
389
+ dismissed = js("""
390
+ (function() {
391
+ var selectors = [
392
+ 'button[data-test="accept-cookies"]',
393
+ '#onetrust-accept-btn-handler',
394
+ 'button[id*="accept-all"]',
395
+ 'button[class*="accept"]',
396
+ 'button[class*="consent"]',
397
+ ];
398
+ for (var i = 0; i < selectors.length; i++) {
399
+ var btn = document.querySelector(selectors[i]);
400
+ if (btn && btn.offsetParent !== null) {
401
+ btn.click();
402
+ return selectors[i];
403
+ }
404
+ }
405
+ return null;
406
+ })()
407
+ """)
408
+ if dismissed:
409
+ wait(1)
410
+ return dismissed
411
+ ```
412
+
413
+ For Reviews/Salary pages: call `dismiss_glassdoor_login_modal()` immediately after the initial
414
+ wait, before any scrolling. Once you scroll down, the modal blocks the page and the X button
415
+ may itself be outside the viewport.
416
+
417
+ ---
418
+
419
+ ## Detecting whether you are past the CF challenge
420
+
421
+ After `goto_url()` + `wait(5)`, confirm you are on the real page:
422
+
423
+ ```python
424
+ def glassdoor_is_cf_blocked() -> bool:
425
+ """True if the CF managed challenge is still running."""
426
+ title = js("document.title") or ""
427
+ url = page_info()["url"]
428
+ return "Security" in title or "__cf_chl_tk" in url
429
+
430
+ # Usage
431
+ goto_url("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
432
+ wait_for_load()
433
+ wait(5)
434
+
435
+ if glassdoor_is_cf_blocked():
436
+ wait(10) # give CF extra time
437
+ if glassdoor_is_cf_blocked():
438
+ capture_screenshot("/tmp/glassdoor_cf_block.png")
439
+ raise RuntimeError("CF challenge did not resolve — check screenshot")
440
+ ```
441
+
442
+ ---
443
+
444
+ ## Glassdoor company ID lookup
445
+
446
+ Glassdoor uses numeric employer IDs (e.g., Google = 9079, Apple = 1138, Meta = 40772).
447
+ To find the ID for any company:
448
+
449
+ ```python
450
+ from urllib.parse import quote_plus
451
+
452
+ company_name = "OpenAI"
453
+ goto_url(f"https://www.glassdoor.com/Search/results.htm?keyword={quote_plus(company_name)}&locT=N")
454
+ wait_for_load()
455
+ wait(5)
456
+
457
+ # Extract company cards from search results
458
+ companies = js("""
459
+ (function() {
460
+ var cards = document.querySelectorAll('[data-test="employer-card"], [class*="EmployerCard"], [class*="employer-card"]');
461
+ var out = [];
462
+ for (var i = 0; i < cards.length; i++) {
463
+ var c = cards[i];
464
+ var link = c.querySelector('a[href*="Overview"], a[href*="Reviews"]');
465
+ if (!link) continue;
466
+ var href = link.href;
467
+ // Extract employer ID: EI_IE{id} or E{id}
468
+ var m = href.match(/E(?:I_IE)?(\d+)/);
469
+ var empId = m ? m[1] : '';
470
+ var nameEl = c.querySelector('[class*="EmployerCard_name"], h2, [class*="name"]');
471
+ out.push({
472
+ empId,
473
+ name: nameEl ? nameEl.innerText.trim() : '',
474
+ href,
475
+ });
476
+ }
477
+ return JSON.stringify(out);
478
+ })()
479
+ """)
480
+
481
+ import json
482
+ for c in json.loads(companies):
483
+ print(c["empId"], c["name"], c["href"][:60])
484
+ ```
485
+
486
+ ---
487
+
488
+ ## Gotchas
489
+
490
+ - **`http_get` is permanently blocked.** Cloudflare Bot Management blocks every IP-level request
491
+ with a JS managed challenge. No User-Agent, cookie, or header combination bypasses it. The
492
+ `__cf_bm` cookie returned in the 403 response is TLS-fingerprint-bound and cannot be replayed.
493
+ `api.glassdoor.com` is 410 Gone (shut down). Only real Chrome via CDP works.
494
+
495
+ - **`wait(5)` minimum after `wait_for_load()`.** CF's managed challenge runs for 2-4 seconds after
496
+ `readyState = complete`. Extracting too early returns the challenge page HTML, not Glassdoor
497
+ content. If you get empty results or the title is "Security | Glassdoor", wait longer.
498
+
499
+ - **Login modal triggers on scroll, not on load.** Extract all visible content immediately on page
500
+ load before any scrolling. Call `dismiss_glassdoor_login_modal()` right after the initial wait —
501
+ before issuing any `scroll()` calls.
502
+
503
+ - **Glassdoor shows ~10 cards without login.** Reviews and salary pages are severely limited
504
+ without an account. Job search cards are more accessible (~10-15 per page). If you need 30+
505
+ reviews, a logged-in session is required.
506
+
507
+ - **CSS class names use Next.js hashed suffixes.** Selectors like `[class*="JobCard_jobTitle"]`
508
+ match despite the hash suffix (e.g., `JobCard_jobTitle__abc12`). Never hardcode the full hashed
509
+ class name — it changes with deployments. Always use `[class*="prefix"]`.
510
+
511
+ - **`__NEXT_DATA__` is the fast path.** When accessible, Glassdoor's Next.js pages embed all page
512
+ data in `<script id="__NEXT_DATA__" type="application/json">`. Parse it before falling back to
513
+ DOM queries. Data path varies by page type: look under `props.pageProps.employer`,
514
+ `props.pageProps.salaryData`, `props.pageProps.jobListings`, etc.
515
+
516
+ - **Company URL slugs and IDs are stable.** The employer ID (e.g., `9079` for Google) never
517
+ changes. Slugs occasionally change when a company rebrands — always verify by following the
518
+ canonical redirect from a search result.
519
+
520
+ - **Rate limiting.** Glassdoor rate-limits by IP after ~5 company-page loads per minute.
521
+ Use `wait(5)` between consecutive company page navigations. Salary and reviews pages are heavier
522
+ — use `wait(8)` between those.
523
+
524
+ - **Salary URL requires character-count parameter.** The `SRCH_KO0,{n}` fragment encodes
525
+ `0` (start of role name) and `n` (end, i.e., `len(role_slug)`). For `"software-engineer"` (17
526
+ chars): `SRCH_KO0,17`. Wrong count returns a 404.
527
+
528
+ - **`locKeyword` vs `locId` for location filter.** `locKeyword=San+Francisco` works without
529
+ knowing Glassdoor's internal city ID. `locT=C` means city-type location. For metro areas,
530
+ also try `locT=M`. Omit `locId` unless you have the exact numeric ID from a Glassdoor URL.
531
+
532
+ - **PerimeterX is also active as a secondary layer.** After passing CF, Glassdoor runs behavioral
533
+ fingerprinting. Rapid automated scrolling, mouse movement, or navigation patterns may trigger a
534
+ secondary block. Mitigate with `wait(2)` between actions and avoid scripted mouse movement.
535
+
536
+ - **Review and salary data require login on some accounts.** Anonymous sessions get a subset of
537
+ data. If a field returns empty consistently, the page may require authentication before surfacing
538
+ that data in the DOM or `__NEXT_DATA__`.
539
+
540
+ - **`goto_url()` vs `new_tab()` for first navigation.** Use `new_tab()` for the very first Glassdoor
541
+ page in a session. If the harness is attached to a non-Glassdoor tab, `goto_url()` can silently
542
+ fail to pass the CF challenge because the existing tab may not have a clean origin context.
543
+ After the first successful load, `goto_url()` works fine for subsequent Glassdoor navigations.