@pencil-agent/nano-pencil 2.0.1 → 2.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +267 -267
- package/dist/build-meta.json +3 -3
- package/dist/core/export-html/AGENT.md +11 -11
- package/dist/core/export-html/template.css +971 -971
- package/dist/core/export-html/template.html +54 -54
- package/dist/core/model/custom-providers.js +1 -1
- package/dist/core/model-registry.js +5 -5
- package/dist/extensions/builtin/AGENT.md +115 -115
- package/dist/extensions/builtin/browser/AGENT.md +17 -17
- package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
- package/dist/extensions/builtin/browser/browser.md +73 -73
- package/dist/extensions/builtin/browser/install.md +142 -142
- package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
- package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
- package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
- package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
- package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
- package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
- package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
- package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
- package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
- package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
- package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
- package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
- package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
- package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
- package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
- package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
- package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
- package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
- package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
- package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
- package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
- package/dist/extensions/builtin/goal/README.md +67 -67
- package/dist/extensions/builtin/grub/README.md +112 -112
- package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
- package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
- package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
- package/dist/extensions/builtin/link-world/linkworld.md +313 -313
- package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
- package/dist/extensions/builtin/loop/README.md +92 -92
- package/dist/extensions/builtin/mcp/figma-design.md +68 -68
- package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
- package/dist/extensions/builtin/recap/AGENT.md +15 -15
- package/dist/extensions/builtin/sal/README.md +72 -72
- package/dist/extensions/builtin/security-audit/README.md +289 -289
- package/dist/extensions/builtin/team/AGENT.md +112 -112
- package/dist/extensions/builtin/team/TESTING.md +299 -299
- package/dist/extensions/builtin/token-save/README.md +56 -56
- package/dist/extensions/optional/AGENT.md +10 -10
- package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
- package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
- package/dist/modes/interactive/interactive-mode.js +19 -19
- package/dist/modes/interactive/theme/dark.json +85 -85
- package/dist/modes/interactive/theme/light.json +84 -84
- package/dist/modes/interactive/theme/theme-schema.json +335 -335
- package/dist/modes/interactive/theme/warm.json +81 -81
- package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
- package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
- package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
- package/docs/SDK-TESTING.md +364 -0
- package/docs/codex-goal-command-impl.md +1055 -1055
- package/docs/codex-goal-vs-grub.md +500 -500
- package/docs/custom-provider.md +27 -27
- package/docs/extensions.md +27 -27
- package/docs/keybindings.md +27 -27
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
- package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
- package/docs/loop-usage-examples.md +214 -214
- package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
- package/docs/models.md +27 -27
- package/docs/packages.md +27 -27
- package/docs/pi-design-philosophy.md +457 -457
- package/docs/planmode.md +1987 -1987
- package/docs/prompt-templates.md +27 -27
- package/docs/providers.md +27 -27
- package/docs/sdk.md +27 -27
- package/docs/skills.md +27 -27
- package/docs/startup-performance-optimization.md +301 -0
- package/docs/themes.md +27 -27
- package/docs/tui.md +27 -27
- package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
- package/package.json +190 -190
- package/docs/cc-agent-design.md +0 -1297
- package/docs/cc-tui-design.md +0 -1333
- package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
- package/docs/scan-report.md +0 -3820
- package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
- package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
|
@@ -1,543 +1,543 @@
|
|
|
1
|
-
# Glassdoor — Company Data, Reviews, Jobs & Salaries
|
|
2
|
-
|
|
3
|
-
Field-tested against glassdoor.com on 2026-04-18.
|
|
4
|
-
|
|
5
|
-
## Anti-bot verdict: browser required, no http_get workaround exists
|
|
6
|
-
|
|
7
|
-
**`http_get` returns HTTP 403 on every Glassdoor URL without exception.**
|
|
8
|
-
|
|
9
|
-
Tested endpoints (all 403):
|
|
10
|
-
- `/Reviews/Google-Reviews-E9079.htm`
|
|
11
|
-
- `/Overview/Working-at-Google-EI_IE9079.htm`
|
|
12
|
-
- `/Job/jobs.htm?sc.keyword=software+engineer`
|
|
13
|
-
- `/Salaries/software-engineer-salary-SRCH_KO0,17.htm`
|
|
14
|
-
- `/graph` (GraphQL)
|
|
15
|
-
- `sitemap.xml`
|
|
16
|
-
|
|
17
|
-
UAs tested (all blocked): `Mozilla/5.0`, full Chrome 124, Googlebot, `curl/7.88.1`.
|
|
18
|
-
|
|
19
|
-
**Stack:** Cloudflare Bot Management (`Server: cloudflare`, `Cf-Mitigated: challenge`).
|
|
20
|
-
Challenge type: `managed` (JS-executed browser fingerprint check, no CAPTCHA widget, no user click
|
|
21
|
-
required in a real browser). Cookie-only bypass also fails — the `__cf_bm` cookie returned in the
|
|
22
|
-
403 response is bound to the browser TLS fingerprint and does not grant access when replayed.
|
|
23
|
-
|
|
24
|
-
`api.glassdoor.com` (the old public partner API) returned `410 Gone` — permanently shut down.
|
|
25
|
-
|
|
26
|
-
**Use `goto_url()` + `wait()` exclusively. Never use `http_get` for Glassdoor.**
|
|
27
|
-
|
|
28
|
-
---
|
|
29
|
-
|
|
30
|
-
## Do this first: open in a new tab, wait for CF to resolve
|
|
31
|
-
|
|
32
|
-
```python
|
|
33
|
-
new_tab("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
|
|
34
|
-
wait_for_load()
|
|
35
|
-
wait(5) # CF managed challenge runs for ~2-4s after readyState=complete
|
|
36
|
-
```
|
|
37
|
-
|
|
38
|
-
`wait(5)` is mandatory. CF's managed challenge executes JS fingerprinting probes after the DOM
|
|
39
|
-
is ready. Extracting before this resolves returns an empty or partial page.
|
|
40
|
-
|
|
41
|
-
Verify you are past the challenge before extracting:
|
|
42
|
-
|
|
43
|
-
```python
|
|
44
|
-
title = js("document.title")
|
|
45
|
-
url = page_info()["url"]
|
|
46
|
-
if "Security" in title or "__cf_chl_tk" in url:
|
|
47
|
-
# CF challenge did not resolve yet — wait longer
|
|
48
|
-
wait(5)
|
|
49
|
-
title = js("document.title")
|
|
50
|
-
assert "Security" not in title, "Still on CF block page"
|
|
51
|
-
```
|
|
52
|
-
|
|
53
|
-
---
|
|
54
|
-
|
|
55
|
-
## URL patterns
|
|
56
|
-
|
|
57
|
-
| Goal | URL |
|
|
58
|
-
|---|---|
|
|
59
|
-
| Company reviews | `/Reviews/{Company-slug}-Reviews-E{employer_id}.htm` |
|
|
60
|
-
| Company overview | `/Overview/Working-at-{Company-slug}-EI_IE{employer_id}.htm` |
|
|
61
|
-
| Company jobs | `/Jobs/{Company-slug}-Jobs-E{employer_id}.htm` |
|
|
62
|
-
| Keyword job search | `/Job/jobs.htm?sc.keyword={keyword}` |
|
|
63
|
-
| Keyword + location | `/Job/jobs.htm?sc.keyword={keyword}&locT=C&locKeyword={city}` |
|
|
64
|
-
| Remote jobs | `/Job/jobs.htm?sc.keyword={keyword}&remoteWorkType=1` |
|
|
65
|
-
| Job search page 2+ | append `&p=2`, `&p=3` |
|
|
66
|
-
| Salary page | `/Salaries/{role-slug}-salary-SRCH_KO0,{len}.htm` |
|
|
67
|
-
|
|
68
|
-
Employer IDs and company slugs are stable. Example: Google = `EI_IE9079`, slug = `Google`.
|
|
69
|
-
|
|
70
|
-
Find the employer ID from a search result URL or the company's Glassdoor page URL.
|
|
71
|
-
|
|
72
|
-
---
|
|
73
|
-
|
|
74
|
-
## Workflow 1: Job search — extract result cards
|
|
75
|
-
|
|
76
|
-
Glassdoor renders job cards client-side. Wait 5 seconds after load before extracting.
|
|
77
|
-
|
|
78
|
-
```python
|
|
79
|
-
import json
|
|
80
|
-
from urllib.parse import quote_plus
|
|
81
|
-
|
|
82
|
-
query = "software engineer"
|
|
83
|
-
new_tab(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}")
|
|
84
|
-
wait_for_load()
|
|
85
|
-
wait(5) # CF challenge + JS render
|
|
86
|
-
|
|
87
|
-
# Dismiss cookie banner if present (GDPR regions)
|
|
88
|
-
dismiss_cookie_banner()
|
|
89
|
-
|
|
90
|
-
jobs = js("""
|
|
91
|
-
(function() {
|
|
92
|
-
// Primary selector as of 2026-04
|
|
93
|
-
var cards = document.querySelectorAll('li[data-jobid]');
|
|
94
|
-
if (!cards.length) {
|
|
95
|
-
// Fallback: class-based (Next.js CSS modules use hashed suffixes — match prefix)
|
|
96
|
-
cards = document.querySelectorAll('[class*="JobsList_jobListItem"]');
|
|
97
|
-
}
|
|
98
|
-
var out = [];
|
|
99
|
-
for (var i = 0; i < cards.length; i++) {
|
|
100
|
-
var c = cards[i];
|
|
101
|
-
var jobId = c.getAttribute('data-jobid') || '';
|
|
102
|
-
var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
|
|
103
|
-
var compEl = c.querySelector('[data-test="employer-name"], [class*="JobCard_employer"]');
|
|
104
|
-
var locEl = c.querySelector('[data-test="emp-location"], [class*="JobCard_location"]');
|
|
105
|
-
var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"], .salaryEstimate');
|
|
106
|
-
var ratingEl = c.querySelector('[data-test="rating"], [class*="ratingNumber"]');
|
|
107
|
-
var linkEl = c.querySelector('a[href*="/job-listing/"], a[href*="glassdoor.com/job"]');
|
|
108
|
-
out.push({
|
|
109
|
-
jobId,
|
|
110
|
-
title: titleEl ? titleEl.innerText.trim() : '',
|
|
111
|
-
company: compEl ? compEl.innerText.trim() : '',
|
|
112
|
-
location: locEl ? locEl.innerText.trim() : '',
|
|
113
|
-
salary: salEl ? salEl.innerText.trim() : '',
|
|
114
|
-
rating: ratingEl ? ratingEl.innerText.trim() : '',
|
|
115
|
-
url: linkEl ? linkEl.href : '',
|
|
116
|
-
});
|
|
117
|
-
}
|
|
118
|
-
return JSON.stringify(out.filter(j => j.title));
|
|
119
|
-
})()
|
|
120
|
-
""")
|
|
121
|
-
|
|
122
|
-
results = json.loads(jobs)
|
|
123
|
-
for r in results:
|
|
124
|
-
print(r["title"], "|", r["company"], "|", r["location"])
|
|
125
|
-
```
|
|
126
|
-
|
|
127
|
-
**If `results` is empty:** take a screenshot and check which page you are on. Glassdoor often
|
|
128
|
-
serves a different layout under A/B tests. The screenshot will reveal the actual card selector.
|
|
129
|
-
|
|
130
|
-
```python
|
|
131
|
-
capture_screenshot("/tmp/glassdoor_jobs.png")
|
|
132
|
-
# Inspect the image, then adjust the querySelectorAll selector above
|
|
133
|
-
```
|
|
134
|
-
|
|
135
|
-
---
|
|
136
|
-
|
|
137
|
-
## Workflow 2: Job search pagination
|
|
138
|
-
|
|
139
|
-
Glassdoor paginates via `&p=N` on the job search URL.
|
|
140
|
-
|
|
141
|
-
```python
|
|
142
|
-
import json
|
|
143
|
-
from urllib.parse import quote_plus
|
|
144
|
-
|
|
145
|
-
query = "data scientist"
|
|
146
|
-
all_jobs = []
|
|
147
|
-
|
|
148
|
-
for page in range(1, 4): # pages 1-3, ~10 cards each
|
|
149
|
-
url = f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}&p={page}"
|
|
150
|
-
goto_url(url)
|
|
151
|
-
wait_for_load()
|
|
152
|
-
wait(5 if page == 1 else 3) # first page needs CF wait; subsequent pages are faster
|
|
153
|
-
|
|
154
|
-
if page == 1:
|
|
155
|
-
dismiss_cookie_banner()
|
|
156
|
-
|
|
157
|
-
batch_json = js("""
|
|
158
|
-
(function() {
|
|
159
|
-
var cards = document.querySelectorAll('li[data-jobid], [class*="JobsList_jobListItem"]');
|
|
160
|
-
var out = [];
|
|
161
|
-
for (var i = 0; i < cards.length; i++) {
|
|
162
|
-
var c = cards[i];
|
|
163
|
-
var jobId = c.getAttribute('data-jobid') || '';
|
|
164
|
-
var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
|
|
165
|
-
var compEl = c.querySelector('[data-test="employer-name"]');
|
|
166
|
-
var locEl = c.querySelector('[data-test="emp-location"]');
|
|
167
|
-
var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"]');
|
|
168
|
-
var linkEl = c.querySelector('a[href*="/job-listing/"]');
|
|
169
|
-
out.push({
|
|
170
|
-
jobId,
|
|
171
|
-
title: titleEl ? titleEl.innerText.trim() : '',
|
|
172
|
-
company: compEl ? compEl.innerText.trim() : '',
|
|
173
|
-
location: locEl ? locEl.innerText.trim() : '',
|
|
174
|
-
salary: salEl ? salEl.innerText.trim() : '',
|
|
175
|
-
url: linkEl ? linkEl.href : '',
|
|
176
|
-
});
|
|
177
|
-
}
|
|
178
|
-
return JSON.stringify(out.filter(j => j.title));
|
|
179
|
-
})()
|
|
180
|
-
""")
|
|
181
|
-
|
|
182
|
-
batch = json.loads(batch_json)
|
|
183
|
-
if not batch:
|
|
184
|
-
break # no more results
|
|
185
|
-
all_jobs.extend(batch)
|
|
186
|
-
|
|
187
|
-
print(f"Collected {len(all_jobs)} jobs across {page} pages")
|
|
188
|
-
```
|
|
189
|
-
|
|
190
|
-
---
|
|
191
|
-
|
|
192
|
-
## Workflow 3: Company overview — rating and review count
|
|
193
|
-
|
|
194
|
-
Navigate to the company Overview or Reviews page. These pages require login for full content but the
|
|
195
|
-
summary header (overall rating, review count, recommend %) is visible without login.
|
|
196
|
-
|
|
197
|
-
```python
|
|
198
|
-
import json, re
|
|
199
|
-
|
|
200
|
-
# Example: Google (employer_id=9079)
|
|
201
|
-
employer_id = 9079
|
|
202
|
-
company_slug = "Google"
|
|
203
|
-
|
|
204
|
-
goto_url(f"https://www.glassdoor.com/Overview/Working-at-{company_slug}-EI_IE{employer_id}.htm")
|
|
205
|
-
wait_for_load()
|
|
206
|
-
wait(5) # CF challenge
|
|
207
|
-
|
|
208
|
-
# Try __NEXT_DATA__ first — fastest and most complete
|
|
209
|
-
next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
|
|
210
|
-
|
|
211
|
-
if next_data_raw:
|
|
212
|
-
nd = json.loads(next_data_raw)
|
|
213
|
-
# Company data lives under props.pageProps — path varies by page type
|
|
214
|
-
# Try employer overview path
|
|
215
|
-
props = nd.get("props", {}).get("pageProps", {})
|
|
216
|
-
employer = props.get("employer") or props.get("employerOverview")
|
|
217
|
-
if employer:
|
|
218
|
-
print("Rating:", employer.get("overallRating"))
|
|
219
|
-
print("Reviews:", employer.get("reviewCount") or employer.get("numberOfReviews"))
|
|
220
|
-
print("Name:", employer.get("name") or employer.get("shortName"))
|
|
221
|
-
else:
|
|
222
|
-
# Fall back to DOM selectors
|
|
223
|
-
summary = js("""
|
|
224
|
-
(function() {
|
|
225
|
-
var ratingEl = document.querySelector('[data-test="rating"], .ratingNumber, [class*="ratingNum"]');
|
|
226
|
-
var countEl = document.querySelector('[data-test="reviewCount"], .reviewCount, [class*="reviewCount"]');
|
|
227
|
-
var nameEl = document.querySelector('h1[data-test="employer-name"], [class*="EmployerProfile_name"]');
|
|
228
|
-
var recEl = document.querySelector('[data-test="recommend"], [class*="recommend"]');
|
|
229
|
-
return JSON.stringify({
|
|
230
|
-
rating: ratingEl ? ratingEl.innerText.trim() : '',
|
|
231
|
-
reviews: countEl ? countEl.innerText.trim() : '',
|
|
232
|
-
name: nameEl ? nameEl.innerText.trim() : '',
|
|
233
|
-
recommend: recEl ? recEl.innerText.trim() : '',
|
|
234
|
-
});
|
|
235
|
-
})()
|
|
236
|
-
""")
|
|
237
|
-
print(json.loads(summary))
|
|
238
|
-
```
|
|
239
|
-
|
|
240
|
-
---
|
|
241
|
-
|
|
242
|
-
## Workflow 4: Company reviews page — extract individual reviews
|
|
243
|
-
|
|
244
|
-
Reviews pages show up to ~10 reviews per page without login. A login modal appears after scrolling.
|
|
245
|
-
Extract before scrolling.
|
|
246
|
-
|
|
247
|
-
```python
|
|
248
|
-
import json
|
|
249
|
-
|
|
250
|
-
employer_id = 9079
|
|
251
|
-
company_slug = "Google"
|
|
252
|
-
|
|
253
|
-
goto_url(f"https://www.glassdoor.com/Reviews/{company_slug}-Reviews-E{employer_id}.htm")
|
|
254
|
-
wait_for_load()
|
|
255
|
-
wait(5)
|
|
256
|
-
|
|
257
|
-
dismiss_cookie_banner()
|
|
258
|
-
|
|
259
|
-
reviews = js("""
|
|
260
|
-
(function() {
|
|
261
|
-
// Review cards — confirmed selector pattern
|
|
262
|
-
var cards = document.querySelectorAll('[id^="empReview_"], [data-test="review-card"], [class*="ReviewCard"]');
|
|
263
|
-
if (!cards.length) {
|
|
264
|
-
cards = document.querySelectorAll('article[class*="review"]');
|
|
265
|
-
}
|
|
266
|
-
var out = [];
|
|
267
|
-
for (var i = 0; i < cards.length; i++) {
|
|
268
|
-
var c = cards[i];
|
|
269
|
-
|
|
270
|
-
// Overall star rating (1-5)
|
|
271
|
-
var starsEl = c.querySelector('[data-test="review-rating"], [class*="starRating"], span[class*="ratingNumber"]');
|
|
272
|
-
var stars = starsEl ? starsEl.innerText.trim() : '';
|
|
273
|
-
|
|
274
|
-
// Pros / Cons text
|
|
275
|
-
var prosEl = c.querySelector('[data-test="pros"], [class*="pros"], p[class*="pros"]');
|
|
276
|
-
var consEl = c.querySelector('[data-test="cons"], [class*="cons"], p[class*="cons"]');
|
|
277
|
-
var pros = prosEl ? prosEl.innerText.trim() : '';
|
|
278
|
-
var cons = consEl ? consEl.innerText.trim() : '';
|
|
279
|
-
|
|
280
|
-
// Review title
|
|
281
|
-
var titleEl = c.querySelector('[data-test="review-title"], h2[class*="reviewTitle"], [class*="title"] a');
|
|
282
|
-
var title = titleEl ? titleEl.innerText.trim() : '';
|
|
283
|
-
|
|
284
|
-
// Job title of reviewer
|
|
285
|
-
var jobTitleEl = c.querySelector('[data-test="reviewer-job-title"], [class*="reviewerInfo"], [class*="authorJobTitle"]');
|
|
286
|
-
var jobTitle = jobTitleEl ? jobTitleEl.innerText.trim() : '';
|
|
287
|
-
|
|
288
|
-
// Date
|
|
289
|
-
var dateEl = c.querySelector('time, [data-test="review-date"], [class*="reviewDate"]');
|
|
290
|
-
var date = dateEl ? (dateEl.getAttribute('datetime') || dateEl.innerText.trim()) : '';
|
|
291
|
-
|
|
292
|
-
if (pros || cons || title) {
|
|
293
|
-
out.push({stars, title, jobTitle, pros, cons, date});
|
|
294
|
-
}
|
|
295
|
-
}
|
|
296
|
-
return JSON.stringify(out);
|
|
297
|
-
})()
|
|
298
|
-
""")
|
|
299
|
-
|
|
300
|
-
results = json.loads(reviews)
|
|
301
|
-
for r in results:
|
|
302
|
-
print(f"{r['stars']}★ | {r['title']} | {r['jobTitle']}")
|
|
303
|
-
print(f" + {r['pros'][:100]}")
|
|
304
|
-
print(f" - {r['cons'][:100]}")
|
|
305
|
-
```
|
|
306
|
-
|
|
307
|
-
---
|
|
308
|
-
|
|
309
|
-
## Workflow 5: Salary page — extract reported salary data
|
|
310
|
-
|
|
311
|
-
```python
|
|
312
|
-
import json
|
|
313
|
-
from urllib.parse import quote_plus
|
|
314
|
-
|
|
315
|
-
# Salary pages use slug + character-count in the URL (n = len(role_slug))
|
|
316
|
-
role = "software-engineer"
|
|
317
|
-
n = len(role) # 17 for "software-engineer"
|
|
318
|
-
|
|
319
|
-
goto_url(f"https://www.glassdoor.com/Salaries/{role}-salary-SRCH_KO0,{n}.htm")
|
|
320
|
-
wait_for_load()
|
|
321
|
-
wait(5)
|
|
322
|
-
|
|
323
|
-
# Try __NEXT_DATA__ for structured salary data
|
|
324
|
-
next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
|
|
325
|
-
|
|
326
|
-
if next_data_raw:
|
|
327
|
-
nd = json.loads(next_data_raw)
|
|
328
|
-
# Salary data is typically under props.pageProps.salaryData or .salaryEstimate
|
|
329
|
-
props = nd.get("props", {}).get("pageProps", {})
|
|
330
|
-
salary_data = props.get("salaryData") or props.get("payData")
|
|
331
|
-
if salary_data:
|
|
332
|
-
print(json.dumps(salary_data, indent=2))
|
|
333
|
-
|
|
334
|
-
# DOM fallback
|
|
335
|
-
salary_summary = js("""
|
|
336
|
-
(function() {
|
|
337
|
-
var medianEl = document.querySelector('[data-test="salary-estimate"], [class*="salaryEstimate"], [class*="median"]');
|
|
338
|
-
var rangeEl = document.querySelector('[data-test="salary-range"], [class*="salaryRange"]');
|
|
339
|
-
var countEl = document.querySelector('[data-test="salary-count"], [class*="salaryCount"]');
|
|
340
|
-
return JSON.stringify({
|
|
341
|
-
median: medianEl ? medianEl.innerText.trim() : '',
|
|
342
|
-
range: rangeEl ? rangeEl.innerText.trim() : '',
|
|
343
|
-
count: countEl ? countEl.innerText.trim() : '',
|
|
344
|
-
});
|
|
345
|
-
})()
|
|
346
|
-
""")
|
|
347
|
-
print(json.loads(salary_summary))
|
|
348
|
-
```
|
|
349
|
-
|
|
350
|
-
---
|
|
351
|
-
|
|
352
|
-
## Handling the login modal
|
|
353
|
-
|
|
354
|
-
Glassdoor shows a sign-in modal:
|
|
355
|
-
- On Reviews/Salary pages: after viewing ~3-5 items (scroll-triggered)
|
|
356
|
-
- On job detail pages: often immediately
|
|
357
|
-
|
|
358
|
-
Dismiss it before extracting anything that requires scrolling:
|
|
359
|
-
|
|
360
|
-
```python
|
|
361
|
-
def dismiss_glassdoor_login_modal():
|
|
362
|
-
"""Close the Glassdoor sign-in modal. Safe to call if no modal is present."""
|
|
363
|
-
closed = js("""
|
|
364
|
-
(function() {
|
|
365
|
-
var selectors = [
|
|
366
|
-
'[alt="Close"]',
|
|
367
|
-
'button[class*="modal_closeIcon"]',
|
|
368
|
-
'[data-test="close-modal"]',
|
|
369
|
-
'[aria-label="Close"]',
|
|
370
|
-
'button[data-test="CloseButton"]',
|
|
371
|
-
'[class*="CloseButton"]',
|
|
372
|
-
];
|
|
373
|
-
for (var i = 0; i < selectors.length; i++) {
|
|
374
|
-
var btn = document.querySelector(selectors[i]);
|
|
375
|
-
if (btn && btn.offsetParent !== null) {
|
|
376
|
-
btn.click();
|
|
377
|
-
return selectors[i];
|
|
378
|
-
}
|
|
379
|
-
}
|
|
380
|
-
return null;
|
|
381
|
-
})()
|
|
382
|
-
""")
|
|
383
|
-
if closed:
|
|
384
|
-
wait(1)
|
|
385
|
-
return closed
|
|
386
|
-
|
|
387
|
-
def dismiss_cookie_banner():
|
|
388
|
-
"""Dismiss GDPR consent overlay. Safe to call even if no banner is present."""
|
|
389
|
-
dismissed = js("""
|
|
390
|
-
(function() {
|
|
391
|
-
var selectors = [
|
|
392
|
-
'button[data-test="accept-cookies"]',
|
|
393
|
-
'#onetrust-accept-btn-handler',
|
|
394
|
-
'button[id*="accept-all"]',
|
|
395
|
-
'button[class*="accept"]',
|
|
396
|
-
'button[class*="consent"]',
|
|
397
|
-
];
|
|
398
|
-
for (var i = 0; i < selectors.length; i++) {
|
|
399
|
-
var btn = document.querySelector(selectors[i]);
|
|
400
|
-
if (btn && btn.offsetParent !== null) {
|
|
401
|
-
btn.click();
|
|
402
|
-
return selectors[i];
|
|
403
|
-
}
|
|
404
|
-
}
|
|
405
|
-
return null;
|
|
406
|
-
})()
|
|
407
|
-
""")
|
|
408
|
-
if dismissed:
|
|
409
|
-
wait(1)
|
|
410
|
-
return dismissed
|
|
411
|
-
```
|
|
412
|
-
|
|
413
|
-
For Reviews/Salary pages: call `dismiss_glassdoor_login_modal()` immediately after the initial
|
|
414
|
-
wait, before any scrolling. Once you scroll down, the modal blocks the page and the X button
|
|
415
|
-
may itself be outside the viewport.
|
|
416
|
-
|
|
417
|
-
---
|
|
418
|
-
|
|
419
|
-
## Detecting whether you are past the CF challenge
|
|
420
|
-
|
|
421
|
-
After `goto_url()` + `wait(5)`, confirm you are on the real page:
|
|
422
|
-
|
|
423
|
-
```python
|
|
424
|
-
def glassdoor_is_cf_blocked() -> bool:
|
|
425
|
-
"""True if the CF managed challenge is still running."""
|
|
426
|
-
title = js("document.title") or ""
|
|
427
|
-
url = page_info()["url"]
|
|
428
|
-
return "Security" in title or "__cf_chl_tk" in url
|
|
429
|
-
|
|
430
|
-
# Usage
|
|
431
|
-
goto_url("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
|
|
432
|
-
wait_for_load()
|
|
433
|
-
wait(5)
|
|
434
|
-
|
|
435
|
-
if glassdoor_is_cf_blocked():
|
|
436
|
-
wait(10) # give CF extra time
|
|
437
|
-
if glassdoor_is_cf_blocked():
|
|
438
|
-
capture_screenshot("/tmp/glassdoor_cf_block.png")
|
|
439
|
-
raise RuntimeError("CF challenge did not resolve — check screenshot")
|
|
440
|
-
```
|
|
441
|
-
|
|
442
|
-
---
|
|
443
|
-
|
|
444
|
-
## Glassdoor company ID lookup
|
|
445
|
-
|
|
446
|
-
Glassdoor uses numeric employer IDs (e.g., Google = 9079, Apple = 1138, Meta = 40772).
|
|
447
|
-
To find the ID for any company:
|
|
448
|
-
|
|
449
|
-
```python
|
|
450
|
-
from urllib.parse import quote_plus
|
|
451
|
-
|
|
452
|
-
company_name = "OpenAI"
|
|
453
|
-
goto_url(f"https://www.glassdoor.com/Search/results.htm?keyword={quote_plus(company_name)}&locT=N")
|
|
454
|
-
wait_for_load()
|
|
455
|
-
wait(5)
|
|
456
|
-
|
|
457
|
-
# Extract company cards from search results
|
|
458
|
-
companies = js("""
|
|
459
|
-
(function() {
|
|
460
|
-
var cards = document.querySelectorAll('[data-test="employer-card"], [class*="EmployerCard"], [class*="employer-card"]');
|
|
461
|
-
var out = [];
|
|
462
|
-
for (var i = 0; i < cards.length; i++) {
|
|
463
|
-
var c = cards[i];
|
|
464
|
-
var link = c.querySelector('a[href*="Overview"], a[href*="Reviews"]');
|
|
465
|
-
if (!link) continue;
|
|
466
|
-
var href = link.href;
|
|
467
|
-
// Extract employer ID: EI_IE{id} or E{id}
|
|
468
|
-
var m = href.match(/E(?:I_IE)?(\d+)/);
|
|
469
|
-
var empId = m ? m[1] : '';
|
|
470
|
-
var nameEl = c.querySelector('[class*="EmployerCard_name"], h2, [class*="name"]');
|
|
471
|
-
out.push({
|
|
472
|
-
empId,
|
|
473
|
-
name: nameEl ? nameEl.innerText.trim() : '',
|
|
474
|
-
href,
|
|
475
|
-
});
|
|
476
|
-
}
|
|
477
|
-
return JSON.stringify(out);
|
|
478
|
-
})()
|
|
479
|
-
""")
|
|
480
|
-
|
|
481
|
-
import json
|
|
482
|
-
for c in json.loads(companies):
|
|
483
|
-
print(c["empId"], c["name"], c["href"][:60])
|
|
484
|
-
```
|
|
485
|
-
|
|
486
|
-
---
|
|
487
|
-
|
|
488
|
-
## Gotchas
|
|
489
|
-
|
|
490
|
-
- **`http_get` is permanently blocked.** Cloudflare Bot Management blocks every IP-level request
|
|
491
|
-
with a JS managed challenge. No User-Agent, cookie, or header combination bypasses it. The
|
|
492
|
-
`__cf_bm` cookie returned in the 403 response is TLS-fingerprint-bound and cannot be replayed.
|
|
493
|
-
`api.glassdoor.com` is 410 Gone (shut down). Only real Chrome via CDP works.
|
|
494
|
-
|
|
495
|
-
- **`wait(5)` minimum after `wait_for_load()`.** CF's managed challenge runs for 2-4 seconds after
|
|
496
|
-
`readyState = complete`. Extracting too early returns the challenge page HTML, not Glassdoor
|
|
497
|
-
content. If you get empty results or the title is "Security | Glassdoor", wait longer.
|
|
498
|
-
|
|
499
|
-
- **Login modal triggers on scroll, not on load.** Extract all visible content immediately on page
|
|
500
|
-
load before any scrolling. Call `dismiss_glassdoor_login_modal()` right after the initial wait —
|
|
501
|
-
before issuing any `scroll()` calls.
|
|
502
|
-
|
|
503
|
-
- **Glassdoor shows ~10 cards without login.** Reviews and salary pages are severely limited
|
|
504
|
-
without an account. Job search cards are more accessible (~10-15 per page). If you need 30+
|
|
505
|
-
reviews, a logged-in session is required.
|
|
506
|
-
|
|
507
|
-
- **CSS class names use Next.js hashed suffixes.** Selectors like `[class*="JobCard_jobTitle"]`
|
|
508
|
-
match despite the hash suffix (e.g., `JobCard_jobTitle__abc12`). Never hardcode the full hashed
|
|
509
|
-
class name — it changes with deployments. Always use `[class*="prefix"]`.
|
|
510
|
-
|
|
511
|
-
- **`__NEXT_DATA__` is the fast path.** When accessible, Glassdoor's Next.js pages embed all page
|
|
512
|
-
data in `<script id="__NEXT_DATA__" type="application/json">`. Parse it before falling back to
|
|
513
|
-
DOM queries. Data path varies by page type: look under `props.pageProps.employer`,
|
|
514
|
-
`props.pageProps.salaryData`, `props.pageProps.jobListings`, etc.
|
|
515
|
-
|
|
516
|
-
- **Company URL slugs and IDs are stable.** The employer ID (e.g., `9079` for Google) never
|
|
517
|
-
changes. Slugs occasionally change when a company rebrands — always verify by following the
|
|
518
|
-
canonical redirect from a search result.
|
|
519
|
-
|
|
520
|
-
- **Rate limiting.** Glassdoor rate-limits by IP after ~5 company-page loads per minute.
|
|
521
|
-
Use `wait(5)` between consecutive company page navigations. Salary and reviews pages are heavier
|
|
522
|
-
— use `wait(8)` between those.
|
|
523
|
-
|
|
524
|
-
- **Salary URL requires character-count parameter.** The `SRCH_KO0,{n}` fragment encodes
|
|
525
|
-
`0` (start of role name) and `n` (end, i.e., `len(role_slug)`). For `"software-engineer"` (17
|
|
526
|
-
chars): `SRCH_KO0,17`. Wrong count returns a 404.
|
|
527
|
-
|
|
528
|
-
- **`locKeyword` vs `locId` for location filter.** `locKeyword=San+Francisco` works without
|
|
529
|
-
knowing Glassdoor's internal city ID. `locT=C` means city-type location. For metro areas,
|
|
530
|
-
also try `locT=M`. Omit `locId` unless you have the exact numeric ID from a Glassdoor URL.
|
|
531
|
-
|
|
532
|
-
- **PerimeterX is also active as a secondary layer.** After passing CF, Glassdoor runs behavioral
|
|
533
|
-
fingerprinting. Rapid automated scrolling, mouse movement, or navigation patterns may trigger a
|
|
534
|
-
secondary block. Mitigate with `wait(2)` between actions and avoid scripted mouse movement.
|
|
535
|
-
|
|
536
|
-
- **Review and salary data require login on some accounts.** Anonymous sessions get a subset of
|
|
537
|
-
data. If a field returns empty consistently, the page may require authentication before surfacing
|
|
538
|
-
that data in the DOM or `__NEXT_DATA__`.
|
|
539
|
-
|
|
540
|
-
- **`goto_url()` vs `new_tab()` for first navigation.** Use `new_tab()` for the very first Glassdoor
|
|
541
|
-
page in a session. If the harness is attached to a non-Glassdoor tab, `goto_url()` can silently
|
|
542
|
-
fail to pass the CF challenge because the existing tab may not have a clean origin context.
|
|
543
|
-
After the first successful load, `goto_url()` works fine for subsequent Glassdoor navigations.
|
|
1
|
+
# Glassdoor — Company Data, Reviews, Jobs & Salaries
|
|
2
|
+
|
|
3
|
+
Field-tested against glassdoor.com on 2026-04-18.
|
|
4
|
+
|
|
5
|
+
## Anti-bot verdict: browser required, no http_get workaround exists
|
|
6
|
+
|
|
7
|
+
**`http_get` returns HTTP 403 on every Glassdoor URL without exception.**
|
|
8
|
+
|
|
9
|
+
Tested endpoints (all 403):
|
|
10
|
+
- `/Reviews/Google-Reviews-E9079.htm`
|
|
11
|
+
- `/Overview/Working-at-Google-EI_IE9079.htm`
|
|
12
|
+
- `/Job/jobs.htm?sc.keyword=software+engineer`
|
|
13
|
+
- `/Salaries/software-engineer-salary-SRCH_KO0,17.htm`
|
|
14
|
+
- `/graph` (GraphQL)
|
|
15
|
+
- `sitemap.xml`
|
|
16
|
+
|
|
17
|
+
UAs tested (all blocked): `Mozilla/5.0`, full Chrome 124, Googlebot, `curl/7.88.1`.
|
|
18
|
+
|
|
19
|
+
**Stack:** Cloudflare Bot Management (`Server: cloudflare`, `Cf-Mitigated: challenge`).
|
|
20
|
+
Challenge type: `managed` (JS-executed browser fingerprint check, no CAPTCHA widget, no user click
|
|
21
|
+
required in a real browser). Cookie-only bypass also fails — the `__cf_bm` cookie returned in the
|
|
22
|
+
403 response is bound to the browser TLS fingerprint and does not grant access when replayed.
|
|
23
|
+
|
|
24
|
+
`api.glassdoor.com` (the old public partner API) returned `410 Gone` — permanently shut down.
|
|
25
|
+
|
|
26
|
+
**Use `goto_url()` + `wait()` exclusively. Never use `http_get` for Glassdoor.**
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## Do this first: open in a new tab, wait for CF to resolve
|
|
31
|
+
|
|
32
|
+
```python
|
|
33
|
+
new_tab("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
|
|
34
|
+
wait_for_load()
|
|
35
|
+
wait(5) # CF managed challenge runs for ~2-4s after readyState=complete
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
`wait(5)` is mandatory. CF's managed challenge executes JS fingerprinting probes after the DOM
|
|
39
|
+
is ready. Extracting before this resolves returns an empty or partial page.
|
|
40
|
+
|
|
41
|
+
Verify you are past the challenge before extracting:
|
|
42
|
+
|
|
43
|
+
```python
|
|
44
|
+
title = js("document.title")
|
|
45
|
+
url = page_info()["url"]
|
|
46
|
+
if "Security" in title or "__cf_chl_tk" in url:
|
|
47
|
+
# CF challenge did not resolve yet — wait longer
|
|
48
|
+
wait(5)
|
|
49
|
+
title = js("document.title")
|
|
50
|
+
assert "Security" not in title, "Still on CF block page"
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## URL patterns
|
|
56
|
+
|
|
57
|
+
| Goal | URL |
|
|
58
|
+
|---|---|
|
|
59
|
+
| Company reviews | `/Reviews/{Company-slug}-Reviews-E{employer_id}.htm` |
|
|
60
|
+
| Company overview | `/Overview/Working-at-{Company-slug}-EI_IE{employer_id}.htm` |
|
|
61
|
+
| Company jobs | `/Jobs/{Company-slug}-Jobs-E{employer_id}.htm` |
|
|
62
|
+
| Keyword job search | `/Job/jobs.htm?sc.keyword={keyword}` |
|
|
63
|
+
| Keyword + location | `/Job/jobs.htm?sc.keyword={keyword}&locT=C&locKeyword={city}` |
|
|
64
|
+
| Remote jobs | `/Job/jobs.htm?sc.keyword={keyword}&remoteWorkType=1` |
|
|
65
|
+
| Job search page 2+ | append `&p=2`, `&p=3` |
|
|
66
|
+
| Salary page | `/Salaries/{role-slug}-salary-SRCH_KO0,{len}.htm` |
|
|
67
|
+
|
|
68
|
+
Employer IDs and company slugs are stable. Example: Google = `EI_IE9079`, slug = `Google`.
|
|
69
|
+
|
|
70
|
+
Find the employer ID from a search result URL or the company's Glassdoor page URL.
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## Workflow 1: Job search — extract result cards
|
|
75
|
+
|
|
76
|
+
Glassdoor renders job cards client-side. Wait 5 seconds after load before extracting.
|
|
77
|
+
|
|
78
|
+
```python
|
|
79
|
+
import json
|
|
80
|
+
from urllib.parse import quote_plus
|
|
81
|
+
|
|
82
|
+
query = "software engineer"
|
|
83
|
+
new_tab(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}")
|
|
84
|
+
wait_for_load()
|
|
85
|
+
wait(5) # CF challenge + JS render
|
|
86
|
+
|
|
87
|
+
# Dismiss cookie banner if present (GDPR regions)
|
|
88
|
+
dismiss_cookie_banner()
|
|
89
|
+
|
|
90
|
+
jobs = js("""
|
|
91
|
+
(function() {
|
|
92
|
+
// Primary selector as of 2026-04
|
|
93
|
+
var cards = document.querySelectorAll('li[data-jobid]');
|
|
94
|
+
if (!cards.length) {
|
|
95
|
+
// Fallback: class-based (Next.js CSS modules use hashed suffixes — match prefix)
|
|
96
|
+
cards = document.querySelectorAll('[class*="JobsList_jobListItem"]');
|
|
97
|
+
}
|
|
98
|
+
var out = [];
|
|
99
|
+
for (var i = 0; i < cards.length; i++) {
|
|
100
|
+
var c = cards[i];
|
|
101
|
+
var jobId = c.getAttribute('data-jobid') || '';
|
|
102
|
+
var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
|
|
103
|
+
var compEl = c.querySelector('[data-test="employer-name"], [class*="JobCard_employer"]');
|
|
104
|
+
var locEl = c.querySelector('[data-test="emp-location"], [class*="JobCard_location"]');
|
|
105
|
+
var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"], .salaryEstimate');
|
|
106
|
+
var ratingEl = c.querySelector('[data-test="rating"], [class*="ratingNumber"]');
|
|
107
|
+
var linkEl = c.querySelector('a[href*="/job-listing/"], a[href*="glassdoor.com/job"]');
|
|
108
|
+
out.push({
|
|
109
|
+
jobId,
|
|
110
|
+
title: titleEl ? titleEl.innerText.trim() : '',
|
|
111
|
+
company: compEl ? compEl.innerText.trim() : '',
|
|
112
|
+
location: locEl ? locEl.innerText.trim() : '',
|
|
113
|
+
salary: salEl ? salEl.innerText.trim() : '',
|
|
114
|
+
rating: ratingEl ? ratingEl.innerText.trim() : '',
|
|
115
|
+
url: linkEl ? linkEl.href : '',
|
|
116
|
+
});
|
|
117
|
+
}
|
|
118
|
+
return JSON.stringify(out.filter(j => j.title));
|
|
119
|
+
})()
|
|
120
|
+
""")
|
|
121
|
+
|
|
122
|
+
results = json.loads(jobs)
|
|
123
|
+
for r in results:
|
|
124
|
+
print(r["title"], "|", r["company"], "|", r["location"])
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
**If `results` is empty:** take a screenshot and check which page you are on. Glassdoor often
|
|
128
|
+
serves a different layout under A/B tests. The screenshot will reveal the actual card selector.
|
|
129
|
+
|
|
130
|
+
```python
|
|
131
|
+
capture_screenshot("/tmp/glassdoor_jobs.png")
|
|
132
|
+
# Inspect the image, then adjust the querySelectorAll selector above
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
## Workflow 2: Job search pagination
|
|
138
|
+
|
|
139
|
+
Glassdoor paginates via `&p=N` on the job search URL.
|
|
140
|
+
|
|
141
|
+
```python
|
|
142
|
+
import json
|
|
143
|
+
from urllib.parse import quote_plus
|
|
144
|
+
|
|
145
|
+
query = "data scientist"
|
|
146
|
+
all_jobs = []
|
|
147
|
+
|
|
148
|
+
for page in range(1, 4): # pages 1-3, ~10 cards each
|
|
149
|
+
url = f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}&p={page}"
|
|
150
|
+
goto_url(url)
|
|
151
|
+
wait_for_load()
|
|
152
|
+
wait(5 if page == 1 else 3) # first page needs CF wait; subsequent pages are faster
|
|
153
|
+
|
|
154
|
+
if page == 1:
|
|
155
|
+
dismiss_cookie_banner()
|
|
156
|
+
|
|
157
|
+
batch_json = js("""
|
|
158
|
+
(function() {
|
|
159
|
+
var cards = document.querySelectorAll('li[data-jobid], [class*="JobsList_jobListItem"]');
|
|
160
|
+
var out = [];
|
|
161
|
+
for (var i = 0; i < cards.length; i++) {
|
|
162
|
+
var c = cards[i];
|
|
163
|
+
var jobId = c.getAttribute('data-jobid') || '';
|
|
164
|
+
var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"]');
|
|
165
|
+
var compEl = c.querySelector('[data-test="employer-name"]');
|
|
166
|
+
var locEl = c.querySelector('[data-test="emp-location"]');
|
|
167
|
+
var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"]');
|
|
168
|
+
var linkEl = c.querySelector('a[href*="/job-listing/"]');
|
|
169
|
+
out.push({
|
|
170
|
+
jobId,
|
|
171
|
+
title: titleEl ? titleEl.innerText.trim() : '',
|
|
172
|
+
company: compEl ? compEl.innerText.trim() : '',
|
|
173
|
+
location: locEl ? locEl.innerText.trim() : '',
|
|
174
|
+
salary: salEl ? salEl.innerText.trim() : '',
|
|
175
|
+
url: linkEl ? linkEl.href : '',
|
|
176
|
+
});
|
|
177
|
+
}
|
|
178
|
+
return JSON.stringify(out.filter(j => j.title));
|
|
179
|
+
})()
|
|
180
|
+
""")
|
|
181
|
+
|
|
182
|
+
batch = json.loads(batch_json)
|
|
183
|
+
if not batch:
|
|
184
|
+
break # no more results
|
|
185
|
+
all_jobs.extend(batch)
|
|
186
|
+
|
|
187
|
+
print(f"Collected {len(all_jobs)} jobs across {page} pages")
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## Workflow 3: Company overview — rating and review count
|
|
193
|
+
|
|
194
|
+
Navigate to the company Overview or Reviews page. These pages require login for full content but the
|
|
195
|
+
summary header (overall rating, review count, recommend %) is visible without login.
|
|
196
|
+
|
|
197
|
+
```python
|
|
198
|
+
import json, re
|
|
199
|
+
|
|
200
|
+
# Example: Google (employer_id=9079)
|
|
201
|
+
employer_id = 9079
|
|
202
|
+
company_slug = "Google"
|
|
203
|
+
|
|
204
|
+
goto_url(f"https://www.glassdoor.com/Overview/Working-at-{company_slug}-EI_IE{employer_id}.htm")
|
|
205
|
+
wait_for_load()
|
|
206
|
+
wait(5) # CF challenge
|
|
207
|
+
|
|
208
|
+
# Try __NEXT_DATA__ first — fastest and most complete
|
|
209
|
+
next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
|
|
210
|
+
|
|
211
|
+
if next_data_raw:
|
|
212
|
+
nd = json.loads(next_data_raw)
|
|
213
|
+
# Company data lives under props.pageProps — path varies by page type
|
|
214
|
+
# Try employer overview path
|
|
215
|
+
props = nd.get("props", {}).get("pageProps", {})
|
|
216
|
+
employer = props.get("employer") or props.get("employerOverview")
|
|
217
|
+
if employer:
|
|
218
|
+
print("Rating:", employer.get("overallRating"))
|
|
219
|
+
print("Reviews:", employer.get("reviewCount") or employer.get("numberOfReviews"))
|
|
220
|
+
print("Name:", employer.get("name") or employer.get("shortName"))
|
|
221
|
+
else:
|
|
222
|
+
# Fall back to DOM selectors
|
|
223
|
+
summary = js("""
|
|
224
|
+
(function() {
|
|
225
|
+
var ratingEl = document.querySelector('[data-test="rating"], .ratingNumber, [class*="ratingNum"]');
|
|
226
|
+
var countEl = document.querySelector('[data-test="reviewCount"], .reviewCount, [class*="reviewCount"]');
|
|
227
|
+
var nameEl = document.querySelector('h1[data-test="employer-name"], [class*="EmployerProfile_name"]');
|
|
228
|
+
var recEl = document.querySelector('[data-test="recommend"], [class*="recommend"]');
|
|
229
|
+
return JSON.stringify({
|
|
230
|
+
rating: ratingEl ? ratingEl.innerText.trim() : '',
|
|
231
|
+
reviews: countEl ? countEl.innerText.trim() : '',
|
|
232
|
+
name: nameEl ? nameEl.innerText.trim() : '',
|
|
233
|
+
recommend: recEl ? recEl.innerText.trim() : '',
|
|
234
|
+
});
|
|
235
|
+
})()
|
|
236
|
+
""")
|
|
237
|
+
print(json.loads(summary))
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
---
|
|
241
|
+
|
|
242
|
+
## Workflow 4: Company reviews page — extract individual reviews
|
|
243
|
+
|
|
244
|
+
Reviews pages show up to ~10 reviews per page without login. A login modal appears after scrolling.
|
|
245
|
+
Extract before scrolling.
|
|
246
|
+
|
|
247
|
+
```python
|
|
248
|
+
import json
|
|
249
|
+
|
|
250
|
+
employer_id = 9079
|
|
251
|
+
company_slug = "Google"
|
|
252
|
+
|
|
253
|
+
goto_url(f"https://www.glassdoor.com/Reviews/{company_slug}-Reviews-E{employer_id}.htm")
|
|
254
|
+
wait_for_load()
|
|
255
|
+
wait(5)
|
|
256
|
+
|
|
257
|
+
dismiss_cookie_banner()
|
|
258
|
+
|
|
259
|
+
reviews = js("""
|
|
260
|
+
(function() {
|
|
261
|
+
// Review cards — confirmed selector pattern
|
|
262
|
+
var cards = document.querySelectorAll('[id^="empReview_"], [data-test="review-card"], [class*="ReviewCard"]');
|
|
263
|
+
if (!cards.length) {
|
|
264
|
+
cards = document.querySelectorAll('article[class*="review"]');
|
|
265
|
+
}
|
|
266
|
+
var out = [];
|
|
267
|
+
for (var i = 0; i < cards.length; i++) {
|
|
268
|
+
var c = cards[i];
|
|
269
|
+
|
|
270
|
+
// Overall star rating (1-5)
|
|
271
|
+
var starsEl = c.querySelector('[data-test="review-rating"], [class*="starRating"], span[class*="ratingNumber"]');
|
|
272
|
+
var stars = starsEl ? starsEl.innerText.trim() : '';
|
|
273
|
+
|
|
274
|
+
// Pros / Cons text
|
|
275
|
+
var prosEl = c.querySelector('[data-test="pros"], [class*="pros"], p[class*="pros"]');
|
|
276
|
+
var consEl = c.querySelector('[data-test="cons"], [class*="cons"], p[class*="cons"]');
|
|
277
|
+
var pros = prosEl ? prosEl.innerText.trim() : '';
|
|
278
|
+
var cons = consEl ? consEl.innerText.trim() : '';
|
|
279
|
+
|
|
280
|
+
// Review title
|
|
281
|
+
var titleEl = c.querySelector('[data-test="review-title"], h2[class*="reviewTitle"], [class*="title"] a');
|
|
282
|
+
var title = titleEl ? titleEl.innerText.trim() : '';
|
|
283
|
+
|
|
284
|
+
// Job title of reviewer
|
|
285
|
+
var jobTitleEl = c.querySelector('[data-test="reviewer-job-title"], [class*="reviewerInfo"], [class*="authorJobTitle"]');
|
|
286
|
+
var jobTitle = jobTitleEl ? jobTitleEl.innerText.trim() : '';
|
|
287
|
+
|
|
288
|
+
// Date
|
|
289
|
+
var dateEl = c.querySelector('time, [data-test="review-date"], [class*="reviewDate"]');
|
|
290
|
+
var date = dateEl ? (dateEl.getAttribute('datetime') || dateEl.innerText.trim()) : '';
|
|
291
|
+
|
|
292
|
+
if (pros || cons || title) {
|
|
293
|
+
out.push({stars, title, jobTitle, pros, cons, date});
|
|
294
|
+
}
|
|
295
|
+
}
|
|
296
|
+
return JSON.stringify(out);
|
|
297
|
+
})()
|
|
298
|
+
""")
|
|
299
|
+
|
|
300
|
+
results = json.loads(reviews)
|
|
301
|
+
for r in results:
|
|
302
|
+
print(f"{r['stars']}★ | {r['title']} | {r['jobTitle']}")
|
|
303
|
+
print(f" + {r['pros'][:100]}")
|
|
304
|
+
print(f" - {r['cons'][:100]}")
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
---
|
|
308
|
+
|
|
309
|
+
## Workflow 5: Salary page — extract reported salary data
|
|
310
|
+
|
|
311
|
+
```python
|
|
312
|
+
import json
|
|
313
|
+
from urllib.parse import quote_plus
|
|
314
|
+
|
|
315
|
+
# Salary pages use slug + character-count in the URL (n = len(role_slug))
|
|
316
|
+
role = "software-engineer"
|
|
317
|
+
n = len(role) # 17 for "software-engineer"
|
|
318
|
+
|
|
319
|
+
goto_url(f"https://www.glassdoor.com/Salaries/{role}-salary-SRCH_KO0,{n}.htm")
|
|
320
|
+
wait_for_load()
|
|
321
|
+
wait(5)
|
|
322
|
+
|
|
323
|
+
# Try __NEXT_DATA__ for structured salary data
|
|
324
|
+
next_data_raw = js("document.getElementById('__NEXT_DATA__') ? document.getElementById('__NEXT_DATA__').textContent : null")
|
|
325
|
+
|
|
326
|
+
if next_data_raw:
|
|
327
|
+
nd = json.loads(next_data_raw)
|
|
328
|
+
# Salary data is typically under props.pageProps.salaryData or .salaryEstimate
|
|
329
|
+
props = nd.get("props", {}).get("pageProps", {})
|
|
330
|
+
salary_data = props.get("salaryData") or props.get("payData")
|
|
331
|
+
if salary_data:
|
|
332
|
+
print(json.dumps(salary_data, indent=2))
|
|
333
|
+
|
|
334
|
+
# DOM fallback
|
|
335
|
+
salary_summary = js("""
|
|
336
|
+
(function() {
|
|
337
|
+
var medianEl = document.querySelector('[data-test="salary-estimate"], [class*="salaryEstimate"], [class*="median"]');
|
|
338
|
+
var rangeEl = document.querySelector('[data-test="salary-range"], [class*="salaryRange"]');
|
|
339
|
+
var countEl = document.querySelector('[data-test="salary-count"], [class*="salaryCount"]');
|
|
340
|
+
return JSON.stringify({
|
|
341
|
+
median: medianEl ? medianEl.innerText.trim() : '',
|
|
342
|
+
range: rangeEl ? rangeEl.innerText.trim() : '',
|
|
343
|
+
count: countEl ? countEl.innerText.trim() : '',
|
|
344
|
+
});
|
|
345
|
+
})()
|
|
346
|
+
""")
|
|
347
|
+
print(json.loads(salary_summary))
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
---
|
|
351
|
+
|
|
352
|
+
## Handling the login modal
|
|
353
|
+
|
|
354
|
+
Glassdoor shows a sign-in modal:
|
|
355
|
+
- On Reviews/Salary pages: after viewing ~3-5 items (scroll-triggered)
|
|
356
|
+
- On job detail pages: often immediately
|
|
357
|
+
|
|
358
|
+
Dismiss it before extracting anything that requires scrolling:
|
|
359
|
+
|
|
360
|
+
```python
|
|
361
|
+
def dismiss_glassdoor_login_modal():
|
|
362
|
+
"""Close the Glassdoor sign-in modal. Safe to call if no modal is present."""
|
|
363
|
+
closed = js("""
|
|
364
|
+
(function() {
|
|
365
|
+
var selectors = [
|
|
366
|
+
'[alt="Close"]',
|
|
367
|
+
'button[class*="modal_closeIcon"]',
|
|
368
|
+
'[data-test="close-modal"]',
|
|
369
|
+
'[aria-label="Close"]',
|
|
370
|
+
'button[data-test="CloseButton"]',
|
|
371
|
+
'[class*="CloseButton"]',
|
|
372
|
+
];
|
|
373
|
+
for (var i = 0; i < selectors.length; i++) {
|
|
374
|
+
var btn = document.querySelector(selectors[i]);
|
|
375
|
+
if (btn && btn.offsetParent !== null) {
|
|
376
|
+
btn.click();
|
|
377
|
+
return selectors[i];
|
|
378
|
+
}
|
|
379
|
+
}
|
|
380
|
+
return null;
|
|
381
|
+
})()
|
|
382
|
+
""")
|
|
383
|
+
if closed:
|
|
384
|
+
wait(1)
|
|
385
|
+
return closed
|
|
386
|
+
|
|
387
|
+
def dismiss_cookie_banner():
|
|
388
|
+
"""Dismiss GDPR consent overlay. Safe to call even if no banner is present."""
|
|
389
|
+
dismissed = js("""
|
|
390
|
+
(function() {
|
|
391
|
+
var selectors = [
|
|
392
|
+
'button[data-test="accept-cookies"]',
|
|
393
|
+
'#onetrust-accept-btn-handler',
|
|
394
|
+
'button[id*="accept-all"]',
|
|
395
|
+
'button[class*="accept"]',
|
|
396
|
+
'button[class*="consent"]',
|
|
397
|
+
];
|
|
398
|
+
for (var i = 0; i < selectors.length; i++) {
|
|
399
|
+
var btn = document.querySelector(selectors[i]);
|
|
400
|
+
if (btn && btn.offsetParent !== null) {
|
|
401
|
+
btn.click();
|
|
402
|
+
return selectors[i];
|
|
403
|
+
}
|
|
404
|
+
}
|
|
405
|
+
return null;
|
|
406
|
+
})()
|
|
407
|
+
""")
|
|
408
|
+
if dismissed:
|
|
409
|
+
wait(1)
|
|
410
|
+
return dismissed
|
|
411
|
+
```
|
|
412
|
+
|
|
413
|
+
For Reviews/Salary pages: call `dismiss_glassdoor_login_modal()` immediately after the initial
|
|
414
|
+
wait, before any scrolling. Once you scroll down, the modal blocks the page and the X button
|
|
415
|
+
may itself be outside the viewport.
|
|
416
|
+
|
|
417
|
+
---
|
|
418
|
+
|
|
419
|
+
## Detecting whether you are past the CF challenge
|
|
420
|
+
|
|
421
|
+
After `goto_url()` + `wait(5)`, confirm you are on the real page:
|
|
422
|
+
|
|
423
|
+
```python
|
|
424
|
+
def glassdoor_is_cf_blocked() -> bool:
|
|
425
|
+
"""True if the CF managed challenge is still running."""
|
|
426
|
+
title = js("document.title") or ""
|
|
427
|
+
url = page_info()["url"]
|
|
428
|
+
return "Security" in title or "__cf_chl_tk" in url
|
|
429
|
+
|
|
430
|
+
# Usage
|
|
431
|
+
goto_url("https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm")
|
|
432
|
+
wait_for_load()
|
|
433
|
+
wait(5)
|
|
434
|
+
|
|
435
|
+
if glassdoor_is_cf_blocked():
|
|
436
|
+
wait(10) # give CF extra time
|
|
437
|
+
if glassdoor_is_cf_blocked():
|
|
438
|
+
capture_screenshot("/tmp/glassdoor_cf_block.png")
|
|
439
|
+
raise RuntimeError("CF challenge did not resolve — check screenshot")
|
|
440
|
+
```
|
|
441
|
+
|
|
442
|
+
---
|
|
443
|
+
|
|
444
|
+
## Glassdoor company ID lookup
|
|
445
|
+
|
|
446
|
+
Glassdoor uses numeric employer IDs (e.g., Google = 9079, Apple = 1138, Meta = 40772).
|
|
447
|
+
To find the ID for any company:
|
|
448
|
+
|
|
449
|
+
```python
|
|
450
|
+
from urllib.parse import quote_plus
|
|
451
|
+
|
|
452
|
+
company_name = "OpenAI"
|
|
453
|
+
goto_url(f"https://www.glassdoor.com/Search/results.htm?keyword={quote_plus(company_name)}&locT=N")
|
|
454
|
+
wait_for_load()
|
|
455
|
+
wait(5)
|
|
456
|
+
|
|
457
|
+
# Extract company cards from search results
|
|
458
|
+
companies = js("""
|
|
459
|
+
(function() {
|
|
460
|
+
var cards = document.querySelectorAll('[data-test="employer-card"], [class*="EmployerCard"], [class*="employer-card"]');
|
|
461
|
+
var out = [];
|
|
462
|
+
for (var i = 0; i < cards.length; i++) {
|
|
463
|
+
var c = cards[i];
|
|
464
|
+
var link = c.querySelector('a[href*="Overview"], a[href*="Reviews"]');
|
|
465
|
+
if (!link) continue;
|
|
466
|
+
var href = link.href;
|
|
467
|
+
// Extract employer ID: EI_IE{id} or E{id}
|
|
468
|
+
var m = href.match(/E(?:I_IE)?(\d+)/);
|
|
469
|
+
var empId = m ? m[1] : '';
|
|
470
|
+
var nameEl = c.querySelector('[class*="EmployerCard_name"], h2, [class*="name"]');
|
|
471
|
+
out.push({
|
|
472
|
+
empId,
|
|
473
|
+
name: nameEl ? nameEl.innerText.trim() : '',
|
|
474
|
+
href,
|
|
475
|
+
});
|
|
476
|
+
}
|
|
477
|
+
return JSON.stringify(out);
|
|
478
|
+
})()
|
|
479
|
+
""")
|
|
480
|
+
|
|
481
|
+
import json
|
|
482
|
+
for c in json.loads(companies):
|
|
483
|
+
print(c["empId"], c["name"], c["href"][:60])
|
|
484
|
+
```
|
|
485
|
+
|
|
486
|
+
---
|
|
487
|
+
|
|
488
|
+
## Gotchas
|
|
489
|
+
|
|
490
|
+
- **`http_get` is permanently blocked.** Cloudflare Bot Management blocks every IP-level request
|
|
491
|
+
with a JS managed challenge. No User-Agent, cookie, or header combination bypasses it. The
|
|
492
|
+
`__cf_bm` cookie returned in the 403 response is TLS-fingerprint-bound and cannot be replayed.
|
|
493
|
+
`api.glassdoor.com` is 410 Gone (shut down). Only real Chrome via CDP works.
|
|
494
|
+
|
|
495
|
+
- **`wait(5)` minimum after `wait_for_load()`.** CF's managed challenge runs for 2-4 seconds after
|
|
496
|
+
`readyState = complete`. Extracting too early returns the challenge page HTML, not Glassdoor
|
|
497
|
+
content. If you get empty results or the title is "Security | Glassdoor", wait longer.
|
|
498
|
+
|
|
499
|
+
- **Login modal triggers on scroll, not on load.** Extract all visible content immediately on page
|
|
500
|
+
load before any scrolling. Call `dismiss_glassdoor_login_modal()` right after the initial wait —
|
|
501
|
+
before issuing any `scroll()` calls.
|
|
502
|
+
|
|
503
|
+
- **Glassdoor shows ~10 cards without login.** Reviews and salary pages are severely limited
|
|
504
|
+
without an account. Job search cards are more accessible (~10-15 per page). If you need 30+
|
|
505
|
+
reviews, a logged-in session is required.
|
|
506
|
+
|
|
507
|
+
- **CSS class names use Next.js hashed suffixes.** Selectors like `[class*="JobCard_jobTitle"]`
|
|
508
|
+
match despite the hash suffix (e.g., `JobCard_jobTitle__abc12`). Never hardcode the full hashed
|
|
509
|
+
class name — it changes with deployments. Always use `[class*="prefix"]`.
|
|
510
|
+
|
|
511
|
+
- **`__NEXT_DATA__` is the fast path.** When accessible, Glassdoor's Next.js pages embed all page
|
|
512
|
+
data in `<script id="__NEXT_DATA__" type="application/json">`. Parse it before falling back to
|
|
513
|
+
DOM queries. Data path varies by page type: look under `props.pageProps.employer`,
|
|
514
|
+
`props.pageProps.salaryData`, `props.pageProps.jobListings`, etc.
|
|
515
|
+
|
|
516
|
+
- **Company URL slugs and IDs are stable.** The employer ID (e.g., `9079` for Google) never
|
|
517
|
+
changes. Slugs occasionally change when a company rebrands — always verify by following the
|
|
518
|
+
canonical redirect from a search result.
|
|
519
|
+
|
|
520
|
+
- **Rate limiting.** Glassdoor rate-limits by IP after ~5 company-page loads per minute.
|
|
521
|
+
Use `wait(5)` between consecutive company page navigations. Salary and reviews pages are heavier
|
|
522
|
+
— use `wait(8)` between those.
|
|
523
|
+
|
|
524
|
+
- **Salary URL requires character-count parameter.** The `SRCH_KO0,{n}` fragment encodes
|
|
525
|
+
`0` (start of role name) and `n` (end, i.e., `len(role_slug)`). For `"software-engineer"` (17
|
|
526
|
+
chars): `SRCH_KO0,17`. Wrong count returns a 404.
|
|
527
|
+
|
|
528
|
+
- **`locKeyword` vs `locId` for location filter.** `locKeyword=San+Francisco` works without
|
|
529
|
+
knowing Glassdoor's internal city ID. `locT=C` means city-type location. For metro areas,
|
|
530
|
+
also try `locT=M`. Omit `locId` unless you have the exact numeric ID from a Glassdoor URL.
|
|
531
|
+
|
|
532
|
+
- **PerimeterX is also active as a secondary layer.** After passing CF, Glassdoor runs behavioral
|
|
533
|
+
fingerprinting. Rapid automated scrolling, mouse movement, or navigation patterns may trigger a
|
|
534
|
+
secondary block. Mitigate with `wait(2)` between actions and avoid scripted mouse movement.
|
|
535
|
+
|
|
536
|
+
- **Review and salary data require login on some accounts.** Anonymous sessions get a subset of
|
|
537
|
+
data. If a field returns empty consistently, the page may require authentication before surfacing
|
|
538
|
+
that data in the DOM or `__NEXT_DATA__`.
|
|
539
|
+
|
|
540
|
+
- **`goto_url()` vs `new_tab()` for first navigation.** Use `new_tab()` for the very first Glassdoor
|
|
541
|
+
page in a session. If the harness is attached to a non-Glassdoor tab, `goto_url()` can silently
|
|
542
|
+
fail to pass the CF challenge because the existing tab may not have a clean origin context.
|
|
543
|
+
After the first successful load, `goto_url()` works fine for subsequent Glassdoor navigations.
|