@pencil-agent/nano-pencil 2.0.0-beta.8 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +267 -267
- package/dist/build-meta.json +3 -3
- package/dist/core/export-html/AGENT.md +11 -11
- package/dist/core/export-html/template.css +971 -971
- package/dist/core/export-html/template.html +54 -54
- package/dist/core/extensions-host/index.d.ts +1 -1
- package/dist/core/extensions-host/loader.js +1 -1
- package/dist/core/extensions-host/runner.d.ts +1 -0
- package/dist/core/extensions-host/runner.js +2 -2
- package/dist/core/extensions-host/types.d.ts +17 -22
- package/dist/core/lib/ai/src/types.d.ts +12 -2
- package/dist/core/persona/persona-manager.js +5 -2
- package/dist/core/runtime/agent-session.js +3 -3
- package/dist/core/runtime/extension-core-bindings.d.ts +1 -0
- package/dist/core/runtime/extension-core-bindings.js +2 -2
- package/dist/extensions/builtin/AGENT.md +115 -115
- package/dist/extensions/builtin/browser/AGENT.md +17 -17
- package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
- package/dist/extensions/builtin/browser/browser.md +73 -73
- package/dist/extensions/builtin/browser/install.md +142 -142
- package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
- package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
- package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
- package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
- package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
- package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
- package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
- package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
- package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
- package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
- package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
- package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
- package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
- package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
- package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
- package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
- package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
- package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
- package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
- package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
- package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
- package/dist/extensions/builtin/goal/README.md +67 -67
- package/dist/extensions/builtin/goal/goal-controller.d.ts +39 -10
- package/dist/extensions/builtin/goal/goal-controller.js +1 -1
- package/dist/extensions/builtin/goal/goal-format.js +1 -1
- package/dist/extensions/builtin/goal/goal-prompts.d.ts +2 -0
- package/dist/extensions/builtin/goal/goal-prompts.js +5 -4
- package/dist/extensions/builtin/goal/goal-store.js +1 -1
- package/dist/extensions/builtin/goal/index.d.ts +1 -1
- package/dist/extensions/builtin/goal/index.js +10 -7
- package/dist/extensions/builtin/grub/README.md +112 -112
- package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
- package/dist/extensions/builtin/link-world/index.js +6 -6
- package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
- package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
- package/dist/extensions/builtin/link-world/linkworld.md +313 -313
- package/dist/extensions/builtin/link-world/{network-routing.md → network-routing/network-routing.md} +67 -67
- package/dist/extensions/builtin/loop/README.md +92 -92
- package/dist/extensions/builtin/mcp/figma-design.md +68 -68
- package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
- package/dist/extensions/builtin/plan/index.js +1 -1
- package/dist/extensions/builtin/recap/AGENT.md +15 -15
- package/dist/extensions/builtin/sal/README.md +72 -72
- package/dist/extensions/builtin/security-audit/README.md +289 -289
- package/dist/extensions/builtin/task/task-store.d.ts +4 -0
- package/dist/extensions/builtin/task/task-store.js +1 -1
- package/dist/extensions/builtin/team/AGENT.md +112 -112
- package/dist/extensions/builtin/team/TESTING.md +299 -299
- package/dist/extensions/builtin/token-save/README.md +56 -56
- package/dist/extensions/optional/AGENT.md +10 -10
- package/dist/index.d.ts +5 -30
- package/dist/index.js +1 -1
- package/dist/models.d.ts +7 -0
- package/dist/models.js +1 -0
- package/dist/modes/interactive/components/footer.js +1 -1
- package/dist/modes/interactive/components/task-status-panel.d.ts +36 -0
- package/dist/modes/interactive/components/task-status-panel.js +1 -0
- package/dist/modes/interactive/controllers/stream-render-controller.d.ts +7 -0
- package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
- package/dist/modes/interactive/interactive-mode.js +40 -40
- package/dist/modes/interactive/state/interactive-state.d.ts +2 -0
- package/dist/modes/interactive/state/interactive-state.js +1 -1
- package/dist/modes/interactive/theme/dark.json +85 -85
- package/dist/modes/interactive/theme/light.json +84 -84
- package/dist/modes/interactive/theme/theme-schema.json +335 -335
- package/dist/modes/interactive/theme/warm.json +81 -81
- package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
- package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
- package/dist/node_modules/@pencil-agent/ai/dist/providers/anthropic.js +2 -2
- package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-completions.js +5 -5
- package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-responses.js +1 -1
- package/dist/node_modules/@pencil-agent/ai/dist/stream.js +1 -1
- package/dist/packages/protocol/src/commands.d.ts +33 -0
- package/dist/packages/protocol/src/flags.d.ts +20 -0
- package/dist/packages/protocol/src/hooks.d.ts +17 -0
- package/dist/packages/protocol/src/hooks.js +0 -0
- package/dist/packages/{extension-sdk → protocol}/src/index.d.ts +7 -4
- package/dist/packages/protocol/src/index.js +1 -0
- package/dist/packages/{extension-sdk → protocol}/src/lifecycle.d.ts +15 -27
- package/dist/packages/protocol/src/lifecycle.js +0 -0
- package/dist/packages/{extension-sdk → protocol}/src/tools.d.ts +1 -1
- package/dist/packages/protocol/src/tools.js +0 -0
- package/dist/public-config.d.ts +12 -0
- package/dist/public-config.js +1 -0
- package/dist/runtime.d.ts +9 -0
- package/dist/runtime.js +1 -0
- package/dist/session-compaction.d.ts +7 -0
- package/dist/session-compaction.js +1 -0
- package/dist/session.d.ts +7 -0
- package/dist/session.js +1 -0
- package/dist/skills.d.ts +7 -0
- package/dist/skills.js +1 -0
- package/dist/tools.d.ts +7 -0
- package/dist/tools.js +1 -0
- package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
- package/docs/SDK-TESTING.md +364 -0
- package/docs/codex-goal-command-impl.md +1055 -1055
- package/docs/codex-goal-vs-grub.md +500 -500
- package/docs/custom-provider.md +27 -27
- package/docs/extensions.md +27 -27
- package/docs/keybindings.md +27 -27
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
- package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
- package/docs/loop-usage-examples.md +214 -214
- package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
- package/docs/models.md +27 -27
- package/docs/packages.md +27 -27
- package/docs/pi-design-philosophy.md +457 -457
- package/docs/planmode.md +1987 -1987
- package/docs/prompt-templates.md +27 -27
- package/docs/providers.md +27 -27
- package/docs/sdk.md +27 -27
- package/docs/skills.md +27 -27
- package/docs/startup-performance-optimization.md +301 -0
- package/docs/themes.md +27 -27
- package/docs/tui.md +27 -27
- package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
- package/package.json +190 -162
- package/dist/packages/extension-sdk/src/index.js +0 -1
- package/docs/cc-agent-design.md +0 -1297
- package/docs/cc-tui-design.md +0 -1333
- package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
- /package/dist/packages/{extension-sdk/src/lifecycle.js → protocol/src/commands.js} +0 -0
- /package/dist/packages/{extension-sdk/src/tools.js → protocol/src/flags.js} +0 -0
package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md
CHANGED
|
@@ -1,1021 +1,1021 @@
|
|
|
1
|
-
# Job Boards — Indeed, Glassdoor, Stepstone
|
|
2
|
-
|
|
3
|
-
Covers: `indeed.com`, `glassdoor.com`, `stepstone.de`
|
|
4
|
-
|
|
5
|
-
---
|
|
6
|
-
|
|
7
|
-
## Do this first: construct search URLs directly
|
|
8
|
-
|
|
9
|
-
Never type into the search box on the homepage — bot detection triggers immediately. Build search URLs directly and navigate straight to results.
|
|
10
|
-
|
|
11
|
-
```python
|
|
12
|
-
from urllib.parse import quote_plus
|
|
13
|
-
|
|
14
|
-
# Indeed — English (US)
|
|
15
|
-
query, location = "Python developer", "San Francisco"
|
|
16
|
-
goto_url(f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}")
|
|
17
|
-
wait_for_load()
|
|
18
|
-
wait(2)
|
|
19
|
-
|
|
20
|
-
# Indeed — last 24 hours
|
|
21
|
-
goto_url(f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}&fromage=1")
|
|
22
|
-
wait_for_load()
|
|
23
|
-
wait(2)
|
|
24
|
-
|
|
25
|
-
# Glassdoor — public search (no login required for result cards)
|
|
26
|
-
goto_url(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}")
|
|
27
|
-
wait_for_load()
|
|
28
|
-
wait(2)
|
|
29
|
-
|
|
30
|
-
# Stepstone (Germany)
|
|
31
|
-
keyword, city = "Data Scientist", "Berlin"
|
|
32
|
-
goto_url(f"https://www.stepstone.de/jobs/{quote_plus(keyword)}/in-{quote_plus(city)}.html")
|
|
33
|
-
wait_for_load()
|
|
34
|
-
wait(2)
|
|
35
|
-
```
|
|
36
|
-
|
|
37
|
-
---
|
|
38
|
-
|
|
39
|
-
## URL patterns
|
|
40
|
-
|
|
41
|
-
### Indeed
|
|
42
|
-
|
|
43
|
-
| Goal | URL pattern |
|
|
44
|
-
|---|---|
|
|
45
|
-
| Keyword + location | `/jobs?q={title}&l={location}` |
|
|
46
|
-
| Last 24 hours | `/jobs?q={title}&l={location}&fromage=1` |
|
|
47
|
-
| Last 3 days | `/jobs?q={title}&l={location}&fromage=3` |
|
|
48
|
-
| Last week | `/jobs?q={title}&l={location}&fromage=7` |
|
|
49
|
-
| Remote only | `/jobs?q={title}&remotejob=032b3046-06a3-4876-8dfd-474eb5e7ed11` |
|
|
50
|
-
| Full-time only | `/jobs?q={title}&l={location}&jt=fulltime` |
|
|
51
|
-
| Part-time | `/jobs?q={title}&l={location}&jt=parttime` |
|
|
52
|
-
| With salary | `/jobs?q={title}&l={location}&rbl=%24{min}%2B` |
|
|
53
|
-
| Page 2 (results 11-20) | append `&start=10` |
|
|
54
|
-
| Page 3 (results 21-30) | append `&start=20` |
|
|
55
|
-
| Job detail page | `https://www.indeed.com/viewjob?jk={job_key}` |
|
|
56
|
-
|
|
57
|
-
**Indeed country variants**: `.co.uk`, `.de`, `.fr`, `.com.au` — same URL structure, different base domain.
|
|
58
|
-
|
|
59
|
-
### Glassdoor
|
|
60
|
-
|
|
61
|
-
| Goal | URL pattern |
|
|
62
|
-
|---|---|
|
|
63
|
-
| Keyword search | `/Job/jobs.htm?sc.keyword={title}` |
|
|
64
|
-
| Keyword + city name | `/Job/jobs.htm?sc.keyword={title}&locT=C&locKeyword={city}` |
|
|
65
|
-
| Remote filter | `/Job/jobs.htm?sc.keyword={title}&remoteWorkType=1` |
|
|
66
|
-
| Next page | append `&p=2`, `&p=3` |
|
|
67
|
-
|
|
68
|
-
### Stepstone (Germany)
|
|
69
|
-
|
|
70
|
-
| Goal | URL pattern |
|
|
71
|
-
|---|---|
|
|
72
|
-
| Keyword in city | `/jobs/{keyword}/in-{city}.html` |
|
|
73
|
-
| Page 2 | `/jobs/{keyword}/in-{city}/page-2.html` |
|
|
74
|
-
| Page 3 | `/jobs/{keyword}/in-{city}/page-3.html` |
|
|
75
|
-
| Full-time | `/jobs/{keyword}/in-{city}.html?of=1` |
|
|
76
|
-
|
|
77
|
-
For Stepstone, keyword and city go directly in the path — encode spaces as `-`:
|
|
78
|
-
```python
|
|
79
|
-
kw_path = keyword.replace(" ", "-")
|
|
80
|
-
city_path = city.replace(" ", "-")
|
|
81
|
-
goto_url(f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}.html")
|
|
82
|
-
```
|
|
83
|
-
|
|
84
|
-
---
|
|
85
|
-
|
|
86
|
-
## Cookie / consent banner dismissal
|
|
87
|
-
|
|
88
|
-
Indeed (EU/UK) and Glassdoor show GDPR consent overlays. Dismiss before extraction.
|
|
89
|
-
|
|
90
|
-
```python
|
|
91
|
-
def dismiss_cookie_banner():
|
|
92
|
-
"""Try common consent button patterns. Safe to call even if no banner is present."""
|
|
93
|
-
dismissed = js("""
|
|
94
|
-
(function() {
|
|
95
|
-
// Indeed: "Accept all cookies" button
|
|
96
|
-
var selectors = [
|
|
97
|
-
'button[id*="onetrust-accept"]',
|
|
98
|
-
'button[id*="accept-all"]',
|
|
99
|
-
'#onetrust-accept-btn-handler',
|
|
100
|
-
'button[data-testid="cookie-consent-accept"]',
|
|
101
|
-
// Glassdoor: consent modal
|
|
102
|
-
'button[data-test="accept-cookies"]',
|
|
103
|
-
// Generic patterns
|
|
104
|
-
'button[class*="accept"]',
|
|
105
|
-
'button[class*="consent"]',
|
|
106
|
-
];
|
|
107
|
-
for (var i = 0; i < selectors.length; i++) {
|
|
108
|
-
var btn = document.querySelector(selectors[i]);
|
|
109
|
-
if (btn && btn.offsetParent !== null) {
|
|
110
|
-
btn.click();
|
|
111
|
-
return selectors[i];
|
|
112
|
-
}
|
|
113
|
-
}
|
|
114
|
-
return null;
|
|
115
|
-
})()
|
|
116
|
-
""")
|
|
117
|
-
if dismissed:
|
|
118
|
-
wait(1)
|
|
119
|
-
return dismissed
|
|
120
|
-
```
|
|
121
|
-
|
|
122
|
-
Call immediately after `wait_for_load()` on `.co.uk`, `.de`, or `glassdoor.com`:
|
|
123
|
-
|
|
124
|
-
```python
|
|
125
|
-
goto_url("https://www.indeed.co.uk/jobs?q=Python+developer&l=London")
|
|
126
|
-
wait_for_load()
|
|
127
|
-
wait(2)
|
|
128
|
-
dismiss_cookie_banner()
|
|
129
|
-
wait(1)
|
|
130
|
-
```
|
|
131
|
-
|
|
132
|
-
---
|
|
133
|
-
|
|
134
|
-
## Workflow 1: Indeed — search result card extraction
|
|
135
|
-
|
|
136
|
-
Each result card on Indeed carries a `data-jk` attribute (the job key). Use it to construct direct URLs.
|
|
137
|
-
|
|
138
|
-
```python
|
|
139
|
-
import json
|
|
140
|
-
from urllib.parse import quote_plus
|
|
141
|
-
|
|
142
|
-
query, location = "machine learning engineer", "New York"
|
|
143
|
-
goto_url(f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}")
|
|
144
|
-
wait_for_load()
|
|
145
|
-
wait(2)
|
|
146
|
-
dismiss_cookie_banner()
|
|
147
|
-
|
|
148
|
-
jobs = js("""
|
|
149
|
-
(function() {
|
|
150
|
-
// Cards live in <div data-jk="..."> or <li> with data-jk attribute
|
|
151
|
-
var cards = document.querySelectorAll('[data-jk]');
|
|
152
|
-
var out = [];
|
|
153
|
-
for (var i = 0; i < cards.length; i++) {
|
|
154
|
-
var c = cards[i];
|
|
155
|
-
var jk = c.getAttribute('data-jk') || '';
|
|
156
|
-
if (!jk) continue;
|
|
157
|
-
|
|
158
|
-
// Title
|
|
159
|
-
var titleEl = c.querySelector('h2.jobTitle span[title], h2.jobTitle span:not(.visually-hidden), [data-testid="job-title"]');
|
|
160
|
-
var title = titleEl ? titleEl.innerText.trim() : '';
|
|
161
|
-
|
|
162
|
-
// Company name
|
|
163
|
-
var compEl = c.querySelector('[data-testid="company-name"], .companyName, span[data-testid="company-name"]');
|
|
164
|
-
var company = compEl ? compEl.innerText.trim() : '';
|
|
165
|
-
|
|
166
|
-
// Location
|
|
167
|
-
var locEl = c.querySelector('[data-testid="text-location"], .companyLocation');
|
|
168
|
-
var location = locEl ? locEl.innerText.trim() : '';
|
|
169
|
-
|
|
170
|
-
// Salary — may not always be present in the card
|
|
171
|
-
var salEl = c.querySelector('[data-testid="attribute_snippet_testid"], .salary-snippet-container, .metadata.salary-snippet');
|
|
172
|
-
var salary = salEl ? salEl.innerText.trim() : '';
|
|
173
|
-
|
|
174
|
-
// Posting date / age
|
|
175
|
-
var dateEl = c.querySelector('[data-testid="myJobsStateDate"], span.date, .result-link-bar-container .date');
|
|
176
|
-
var posted = dateEl ? dateEl.innerText.trim() : '';
|
|
177
|
-
|
|
178
|
-
// Direct URL via job key
|
|
179
|
-
var url = 'https://www.indeed.com/viewjob?jk=' + jk;
|
|
180
|
-
|
|
181
|
-
if (title) {
|
|
182
|
-
out.push({jk, title, company, location, salary, posted, url});
|
|
183
|
-
}
|
|
184
|
-
}
|
|
185
|
-
return JSON.stringify(out);
|
|
186
|
-
})()
|
|
187
|
-
""")
|
|
188
|
-
|
|
189
|
-
results = json.loads(jobs)
|
|
190
|
-
for r in results:
|
|
191
|
-
print(r)
|
|
192
|
-
# Typically returns 10–15 cards per page
|
|
193
|
-
```
|
|
194
|
-
|
|
195
|
-
---
|
|
196
|
-
|
|
197
|
-
## Workflow 2: Indeed — pagination (multi-page extraction)
|
|
198
|
-
|
|
199
|
-
Indeed paginates using `&start=N` where N increments by 10 per page.
|
|
200
|
-
|
|
201
|
-
```python
|
|
202
|
-
import json
|
|
203
|
-
from urllib.parse import quote_plus
|
|
204
|
-
|
|
205
|
-
query, location = "data scientist", "remote"
|
|
206
|
-
base_url = f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}"
|
|
207
|
-
|
|
208
|
-
all_jobs = []
|
|
209
|
-
|
|
210
|
-
for page in range(3): # 3 pages = up to ~30 results
|
|
211
|
-
start = page * 10
|
|
212
|
-
url = base_url if start == 0 else f"{base_url}&start={start}"
|
|
213
|
-
goto_url(url)
|
|
214
|
-
wait_for_load()
|
|
215
|
-
wait(2) # mandatory — bot detection is aggressive on rapid loads
|
|
216
|
-
|
|
217
|
-
if page == 0:
|
|
218
|
-
dismiss_cookie_banner()
|
|
219
|
-
|
|
220
|
-
batch_json = js("""
|
|
221
|
-
(function() {
|
|
222
|
-
var cards = document.querySelectorAll('[data-jk]');
|
|
223
|
-
var out = [];
|
|
224
|
-
for (var i = 0; i < cards.length; i++) {
|
|
225
|
-
var c = cards[i];
|
|
226
|
-
var jk = c.getAttribute('data-jk') || '';
|
|
227
|
-
if (!jk) continue;
|
|
228
|
-
var titleEl = c.querySelector('h2.jobTitle span[title], [data-testid="job-title"]');
|
|
229
|
-
var compEl = c.querySelector('[data-testid="company-name"], .companyName');
|
|
230
|
-
var locEl = c.querySelector('[data-testid="text-location"], .companyLocation');
|
|
231
|
-
var salEl = c.querySelector('[data-testid="attribute_snippet_testid"], .salary-snippet-container');
|
|
232
|
-
var dateEl = c.querySelector('[data-testid="myJobsStateDate"], span.date');
|
|
233
|
-
out.push({
|
|
234
|
-
jk,
|
|
235
|
-
title: titleEl ? titleEl.innerText.trim() : '',
|
|
236
|
-
company: compEl ? compEl.innerText.trim() : '',
|
|
237
|
-
location: locEl ? locEl.innerText.trim() : '',
|
|
238
|
-
salary: salEl ? salEl.innerText.trim() : '',
|
|
239
|
-
posted: dateEl ? dateEl.innerText.trim() : '',
|
|
240
|
-
url: 'https://www.indeed.com/viewjob?jk=' + jk,
|
|
241
|
-
});
|
|
242
|
-
}
|
|
243
|
-
return JSON.stringify(out.filter(j => j.title));
|
|
244
|
-
})()
|
|
245
|
-
""")
|
|
246
|
-
|
|
247
|
-
batch = json.loads(batch_json)
|
|
248
|
-
if not batch:
|
|
249
|
-
break # no results on this page — stop
|
|
250
|
-
all_jobs.extend(batch)
|
|
251
|
-
|
|
252
|
-
print(f"Collected {len(all_jobs)} jobs across {page+1} pages")
|
|
253
|
-
```
|
|
254
|
-
|
|
255
|
-
**For `fromage` (date filter) + pagination**: keep the `fromage` param in the base URL:
|
|
256
|
-
```python
|
|
257
|
-
base_url = f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}&fromage=1"
|
|
258
|
-
```
|
|
259
|
-
|
|
260
|
-
---
|
|
261
|
-
|
|
262
|
-
## Workflow 3: Indeed — job detail page extraction
|
|
263
|
-
|
|
264
|
-
Fetch the full job description from the detail page. The `viewjob?jk=` URL is canonical and stable.
|
|
265
|
-
|
|
266
|
-
```python
|
|
267
|
-
import json, re
|
|
268
|
-
|
|
269
|
-
def get_indeed_job_detail(jk: str) -> dict:
|
|
270
|
-
"""Fetch full job details from an Indeed job key."""
|
|
271
|
-
goto_url(f"https://www.indeed.com/viewjob?jk={jk}")
|
|
272
|
-
wait_for_load()
|
|
273
|
-
wait(2)
|
|
274
|
-
|
|
275
|
-
detail = js("""
|
|
276
|
-
(function() {
|
|
277
|
-
// Title
|
|
278
|
-
var titleEl = document.querySelector('[data-testid="jobsearch-JobInfoHeader-title"], h1.jobsearch-JobInfoHeader-title');
|
|
279
|
-
var title = titleEl ? titleEl.innerText.trim() : '';
|
|
280
|
-
|
|
281
|
-
// Company
|
|
282
|
-
var compEl = document.querySelector('[data-testid="inlineHeader-companyName"] a, [data-company-name="true"]');
|
|
283
|
-
var company = compEl ? compEl.innerText.trim() : '';
|
|
284
|
-
|
|
285
|
-
// Location
|
|
286
|
-
var locEl = document.querySelector('[data-testid="inlineHeader-companyLocation"], [data-testid="job-location"]');
|
|
287
|
-
var location = locEl ? locEl.innerText.trim() : '';
|
|
288
|
-
|
|
289
|
-
// Salary — shown when available in header
|
|
290
|
-
var salEl = document.querySelector('[data-testid="jobsearch-OtherJobDetailsContainer"] [aria-label*="alary"], #salaryInfoAndJobType span');
|
|
291
|
-
var salary = salEl ? salEl.innerText.trim() : '';
|
|
292
|
-
|
|
293
|
-
// Full job description text
|
|
294
|
-
var descEl = document.getElementById('jobDescriptionText');
|
|
295
|
-
var description = descEl ? descEl.innerText.trim() : '';
|
|
296
|
-
|
|
297
|
-
// Job type (Full-time, Part-time, Contract, etc.)
|
|
298
|
-
var typeEl = document.querySelector('[data-testid="attribute_snippet_testid"]');
|
|
299
|
-
var jobType = typeEl ? typeEl.innerText.trim() : '';
|
|
300
|
-
|
|
301
|
-
// "Apply on company site" link — external application URL
|
|
302
|
-
var externalBtn = document.querySelector('[data-jk][href*="indeed.com/applystart"], a[href*="indeed.com/applystart"]');
|
|
303
|
-
var externalUrl = externalBtn ? externalBtn.href : '';
|
|
304
|
-
|
|
305
|
-
return JSON.stringify({title, company, location, salary, jobType, description, externalUrl});
|
|
306
|
-
})()
|
|
307
|
-
""")
|
|
308
|
-
return json.loads(detail)
|
|
309
|
-
|
|
310
|
-
# Example
|
|
311
|
-
detail = get_indeed_job_detail("abc123def456xyz")
|
|
312
|
-
print(detail["title"], "—", detail["salary"])
|
|
313
|
-
print(detail["description"][:500]) # first 500 chars
|
|
314
|
-
```
|
|
315
|
-
|
|
316
|
-
---
|
|
317
|
-
|
|
318
|
-
## Workflow 4: Glassdoor — search result extraction
|
|
319
|
-
|
|
320
|
-
Glassdoor shows a login modal after a few scrolls. Extract cards from the first visible load before triggering that wall.
|
|
321
|
-
|
|
322
|
-
```python
|
|
323
|
-
import json
|
|
324
|
-
from urllib.parse import quote_plus
|
|
325
|
-
|
|
326
|
-
query = "product manager"
|
|
327
|
-
goto_url(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}")
|
|
328
|
-
wait_for_load()
|
|
329
|
-
wait(3) # Glassdoor JS rendering takes longer
|
|
330
|
-
|
|
331
|
-
# Dismiss cookie banner if present
|
|
332
|
-
dismiss_cookie_banner()
|
|
333
|
-
|
|
334
|
-
# Extract cards before any scroll (avoid triggering login modal)
|
|
335
|
-
jobs = js("""
|
|
336
|
-
(function() {
|
|
337
|
-
// Glassdoor job cards: li[data-jobid] or article[data-id]
|
|
338
|
-
var cards = document.querySelectorAll('li[data-jobid], li[class*="JobsList_jobListItem"]');
|
|
339
|
-
if (!cards.length) {
|
|
340
|
-
// Fallback: try generic article cards
|
|
341
|
-
cards = document.querySelectorAll('[data-test="jobListing"], [id^="job-listing-"]');
|
|
342
|
-
}
|
|
343
|
-
var out = [];
|
|
344
|
-
for (var i = 0; i < cards.length; i++) {
|
|
345
|
-
var c = cards[i];
|
|
346
|
-
|
|
347
|
-
// Job ID (used for canonical URL)
|
|
348
|
-
var jobId = c.getAttribute('data-jobid') || c.getAttribute('data-id') || '';
|
|
349
|
-
|
|
350
|
-
// Title
|
|
351
|
-
var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"], .job-title');
|
|
352
|
-
var title = titleEl ? titleEl.innerText.trim() : '';
|
|
353
|
-
|
|
354
|
-
// Company
|
|
355
|
-
var compEl = c.querySelector('[data-test="employer-name"], [class*="JobCard_employer"], .employer-name');
|
|
356
|
-
var company = compEl ? compEl.innerText.trim() : '';
|
|
357
|
-
|
|
358
|
-
// Location
|
|
359
|
-
var locEl = c.querySelector('[data-test="emp-location"], [class*="JobCard_location"], .location');
|
|
360
|
-
var location = locEl ? locEl.innerText.trim() : '';
|
|
361
|
-
|
|
362
|
-
// Salary estimate (not always shown in card)
|
|
363
|
-
var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"], .salaryEstimate');
|
|
364
|
-
var salary = salEl ? salEl.innerText.trim() : '';
|
|
365
|
-
|
|
366
|
-
// Company rating
|
|
367
|
-
var ratingEl = c.querySelector('[data-test="rating"], [class*="ratingNumber"], .rating');
|
|
368
|
-
var rating = ratingEl ? ratingEl.innerText.trim() : '';
|
|
369
|
-
|
|
370
|
-
// Canonical URL
|
|
371
|
-
var linkEl = c.querySelector('a[href*="/job-listing/"], a[href*="glassdoor.com/job"]');
|
|
372
|
-
var url = linkEl ? linkEl.href : (jobId ? 'https://www.glassdoor.com/job-listing/glassdoor-jl' + jobId + '.htm' : '');
|
|
373
|
-
|
|
374
|
-
if (title) out.push({jobId, title, company, location, salary, rating, url});
|
|
375
|
-
}
|
|
376
|
-
return JSON.stringify(out);
|
|
377
|
-
})()
|
|
378
|
-
""")
|
|
379
|
-
|
|
380
|
-
results = json.loads(jobs)
|
|
381
|
-
for r in results:
|
|
382
|
-
print(r)
|
|
383
|
-
```
|
|
384
|
-
|
|
385
|
-
**If `jobs` returns an empty list**, Glassdoor has changed its DOM structure. Take a screenshot and inspect:
|
|
386
|
-
|
|
387
|
-
```python
|
|
388
|
-
capture_screenshot()
|
|
389
|
-
# Look for the actual card selector, then update the querySelectorAll above
|
|
390
|
-
```
|
|
391
|
-
|
|
392
|
-
---
|
|
393
|
-
|
|
394
|
-
## Workflow 5: Glassdoor — handling the login wall
|
|
395
|
-
|
|
396
|
-
Glassdoor increasingly shows a login modal after viewing a few listings. Detect and dismiss it.
|
|
397
|
-
|
|
398
|
-
```python
|
|
399
|
-
def dismiss_glassdoor_login_modal():
|
|
400
|
-
"""Close the Glassdoor sign-in / register modal if it appears."""
|
|
401
|
-
closed = js("""
|
|
402
|
-
(function() {
|
|
403
|
-
// Close button on the modal
|
|
404
|
-
var closeBtn = document.querySelector(
|
|
405
|
-
'[alt="Close"], button[class*="modal_closeIcon"], [data-test="close-modal"]'
|
|
406
|
-
);
|
|
407
|
-
if (closeBtn && closeBtn.offsetParent !== null) {
|
|
408
|
-
closeBtn.click();
|
|
409
|
-
return 'closed';
|
|
410
|
-
}
|
|
411
|
-
// Sometimes the modal has an X with aria-label
|
|
412
|
-
var ariaClose = document.querySelector('[aria-label="Close"]');
|
|
413
|
-
if (ariaClose && ariaClose.offsetParent !== null) {
|
|
414
|
-
ariaClose.click();
|
|
415
|
-
return 'aria-closed';
|
|
416
|
-
}
|
|
417
|
-
return null;
|
|
418
|
-
})()
|
|
419
|
-
""")
|
|
420
|
-
if closed:
|
|
421
|
-
wait(1)
|
|
422
|
-
return closed
|
|
423
|
-
|
|
424
|
-
# Strategy: extract as much as possible before the modal appears
|
|
425
|
-
# If the modal blocks results, dismiss it and try again
|
|
426
|
-
result = dismiss_glassdoor_login_modal()
|
|
427
|
-
if result:
|
|
428
|
-
wait(1)
|
|
429
|
-
# Re-run extraction after dismissal
|
|
430
|
-
```
|
|
431
|
-
|
|
432
|
-
If the modal is persistent and cannot be closed, switch to Indeed for the same search — it has more accessible public results.
|
|
433
|
-
|
|
434
|
-
---
|
|
435
|
-
|
|
436
|
-
## Workflow 6: Stepstone (German) — job extraction
|
|
437
|
-
|
|
438
|
-
Stepstone is server-rendered. Most data can be extracted with `http_get` for speed, or via `goto` + `js()` for dynamic content.
|
|
439
|
-
|
|
440
|
-
```python
|
|
441
|
-
import json, re
|
|
442
|
-
from urllib.parse import quote_plus
|
|
443
|
-
|
|
444
|
-
keyword = "Sachbearbeiter Einkauf"
|
|
445
|
-
city = "Regensburg"
|
|
446
|
-
|
|
447
|
-
# Stepstone encodes keyword/city in the path
|
|
448
|
-
kw_path = keyword.replace(" ", "-")
|
|
449
|
-
city_path = city.replace(" ", "-")
|
|
450
|
-
|
|
451
|
-
goto_url(f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}.html")
|
|
452
|
-
wait_for_load()
|
|
453
|
-
wait(2)
|
|
454
|
-
dismiss_cookie_banner()
|
|
455
|
-
|
|
456
|
-
jobs = js("""
|
|
457
|
-
(function() {
|
|
458
|
-
// Stepstone result cards
|
|
459
|
-
var cards = document.querySelectorAll(
|
|
460
|
-
'article[data-at="job-item"], [data-genesis-element="JOB_CARD"], article.sc-fhzFiK'
|
|
461
|
-
);
|
|
462
|
-
var out = [];
|
|
463
|
-
for (var i = 0; i < cards.length; i++) {
|
|
464
|
-
var c = cards[i];
|
|
465
|
-
|
|
466
|
-
// Title
|
|
467
|
-
var titleEl = c.querySelector('h2[data-at="job-item-title"] a, [data-at="job-title"], .listing__title a');
|
|
468
|
-
var title = titleEl ? titleEl.innerText.trim() : '';
|
|
469
|
-
var url = titleEl ? (titleEl.href || '') : '';
|
|
470
|
-
|
|
471
|
-
// Company
|
|
472
|
-
var compEl = c.querySelector('[data-at="job-item-company-name"], [data-at="company-name"], .listing__company');
|
|
473
|
-
var company = compEl ? compEl.innerText.trim() : '';
|
|
474
|
-
|
|
475
|
-
// Location
|
|
476
|
-
var locEl = c.querySelector('[data-at="job-item-location"], .listing__location');
|
|
477
|
-
var location = locEl ? locEl.innerText.trim() : '';
|
|
478
|
-
|
|
479
|
-
// Posting date
|
|
480
|
-
var dateEl = c.querySelector('[data-at="job-posting-date"], time, .listing__date');
|
|
481
|
-
var posted = dateEl ? (dateEl.getAttribute('datetime') || dateEl.innerText.trim()) : '';
|
|
482
|
-
|
|
483
|
-
if (title) out.push({title, company, location, posted, url});
|
|
484
|
-
}
|
|
485
|
-
return JSON.stringify(out);
|
|
486
|
-
})()
|
|
487
|
-
""")
|
|
488
|
-
|
|
489
|
-
results = json.loads(jobs)
|
|
490
|
-
for r in results:
|
|
491
|
-
print(r)
|
|
492
|
-
```
|
|
493
|
-
|
|
494
|
-
### Stepstone pagination
|
|
495
|
-
|
|
496
|
-
```python
|
|
497
|
-
import json
|
|
498
|
-
|
|
499
|
-
all_jobs = []
|
|
500
|
-
for page in range(1, 4): # pages 1-3
|
|
501
|
-
if page == 1:
|
|
502
|
-
url = f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}.html"
|
|
503
|
-
else:
|
|
504
|
-
url = f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}/page-{page}.html"
|
|
505
|
-
|
|
506
|
-
goto_url(url)
|
|
507
|
-
wait_for_load()
|
|
508
|
-
wait(2)
|
|
509
|
-
|
|
510
|
-
if page == 1:
|
|
511
|
-
dismiss_cookie_banner()
|
|
512
|
-
|
|
513
|
-
batch_json = js("""
|
|
514
|
-
(function() {
|
|
515
|
-
var cards = document.querySelectorAll('article[data-at="job-item"], [data-genesis-element="JOB_CARD"]');
|
|
516
|
-
var out = [];
|
|
517
|
-
for (var i = 0; i < cards.length; i++) {
|
|
518
|
-
var c = cards[i];
|
|
519
|
-
var titleEl = c.querySelector('[data-at="job-item-title"] a, [data-at="job-title"]');
|
|
520
|
-
var compEl = c.querySelector('[data-at="job-item-company-name"]');
|
|
521
|
-
var locEl = c.querySelector('[data-at="job-item-location"]');
|
|
522
|
-
var dateEl = c.querySelector('time');
|
|
523
|
-
out.push({
|
|
524
|
-
title: titleEl ? titleEl.innerText.trim() : '',
|
|
525
|
-
company: compEl ? compEl.innerText.trim() : '',
|
|
526
|
-
location: locEl ? locEl.innerText.trim() : '',
|
|
527
|
-
posted: dateEl ? dateEl.getAttribute('datetime') || dateEl.innerText.trim() : '',
|
|
528
|
-
url: titleEl ? titleEl.href : '',
|
|
529
|
-
});
|
|
530
|
-
}
|
|
531
|
-
return JSON.stringify(out.filter(j => j.title));
|
|
532
|
-
})()
|
|
533
|
-
""")
|
|
534
|
-
|
|
535
|
-
batch = json.loads(batch_json)
|
|
536
|
-
if not batch:
|
|
537
|
-
break
|
|
538
|
-
all_jobs.extend(batch)
|
|
539
|
-
|
|
540
|
-
print(f"Stepstone: {len(all_jobs)} jobs collected")
|
|
541
|
-
```
|
|
542
|
-
|
|
543
|
-
---
|
|
544
|
-
|
|
545
|
-
## Indeed job key (jk) — direct URL construction
|
|
546
|
-
|
|
547
|
-
Indeed search result links go through a tracking redirect. **Do not use those redirect URLs.** Instead, extract the `data-jk` attribute directly for the stable canonical URL.
|
|
548
|
-
|
|
549
|
-
```python
|
|
550
|
-
# Correct approach: extract data-jk from the card
|
|
551
|
-
job_keys = js("""
|
|
552
|
-
JSON.stringify(
|
|
553
|
-
Array.from(document.querySelectorAll('[data-jk]'))
|
|
554
|
-
.map(el => el.getAttribute('data-jk'))
|
|
555
|
-
.filter(jk => jk && jk.length > 0)
|
|
556
|
-
.filter((jk, i, arr) => arr.indexOf(jk) === i) // dedupe
|
|
557
|
-
)
|
|
558
|
-
""")
|
|
559
|
-
import json
|
|
560
|
-
jks = json.loads(job_keys)
|
|
561
|
-
|
|
562
|
-
# Canonical job detail URL for any job key:
|
|
563
|
-
for jk in jks:
|
|
564
|
-
direct_url = f"https://www.indeed.com/viewjob?jk={jk}"
|
|
565
|
-
print(direct_url)
|
|
566
|
-
```
|
|
567
|
-
|
|
568
|
-
If you already have a redirect URL and need to extract the `jk` from it:
|
|
569
|
-
|
|
570
|
-
```python
|
|
571
|
-
import re
|
|
572
|
-
def extract_jk(url: str) -> str | None:
|
|
573
|
-
m = re.search(r'[?&]jk=([a-f0-9]+)', url)
|
|
574
|
-
return m.group(1) if m else None
|
|
575
|
-
```
|
|
576
|
-
|
|
577
|
-
---
|
|
578
|
-
|
|
579
|
-
## Salary extraction and normalization
|
|
580
|
-
|
|
581
|
-
Salary appears in different places and formats depending on the job and site.
|
|
582
|
-
|
|
583
|
-
### Indeed salary patterns
|
|
584
|
-
|
|
585
|
-
```python
|
|
586
|
-
import re
|
|
587
|
-
|
|
588
|
-
def parse_indeed_salary(raw: str) -> dict:
|
|
589
|
-
"""
|
|
590
|
-
Parse Indeed salary strings like:
|
|
591
|
-
"$85,000 - $110,000 a year"
|
|
592
|
-
"Up to $65 an hour"
|
|
593
|
-
"$25 - $30 an hour"
|
|
594
|
-
"From $120,000 a year"
|
|
595
|
-
"Employer est.: $90,000 - $120,000 a year"
|
|
596
|
-
Returns: {low, high, period, source}
|
|
597
|
-
"""
|
|
598
|
-
if not raw:
|
|
599
|
-
return {"raw": raw, "low": None, "high": None, "period": None, "source": None}
|
|
600
|
-
|
|
601
|
-
source = None
|
|
602
|
-
if "Employer est." in raw:
|
|
603
|
-
source = "employer"
|
|
604
|
-
raw = raw.replace("Employer est.:", "").strip()
|
|
605
|
-
elif "Glassdoor est." in raw:
|
|
606
|
-
source = "glassdoor"
|
|
607
|
-
raw = raw.replace("Glassdoor est.:", "").strip()
|
|
608
|
-
|
|
609
|
-
raw_clean = raw.replace(",", "")
|
|
610
|
-
|
|
611
|
-
# Period
|
|
612
|
-
period = None
|
|
613
|
-
if "a year" in raw or "per year" in raw or "/yr" in raw:
|
|
614
|
-
period = "year"
|
|
615
|
-
elif "an hour" in raw or "per hour" in raw or "/hr" in raw:
|
|
616
|
-
period = "hour"
|
|
617
|
-
elif "a month" in raw or "per month" in raw:
|
|
618
|
-
period = "month"
|
|
619
|
-
|
|
620
|
-
# Range: two dollar amounts
|
|
621
|
-
range_m = re.findall(r'\$?([\d]+(?:\.\d+)?)', raw_clean)
|
|
622
|
-
low = float(range_m[0]) if len(range_m) >= 1 else None
|
|
623
|
-
high = float(range_m[1]) if len(range_m) >= 2 else low
|
|
624
|
-
|
|
625
|
-
return {"raw": raw, "low": low, "high": high, "period": period, "source": source}
|
|
626
|
-
|
|
627
|
-
# Examples
|
|
628
|
-
parse_indeed_salary("$85,000 - $110,000 a year")
|
|
629
|
-
# -> {"low": 85000.0, "high": 110000.0, "period": "year", "source": None}
|
|
630
|
-
|
|
631
|
-
parse_indeed_salary("Employer est.: $90,000 - $120,000 a year")
|
|
632
|
-
# -> {"low": 90000.0, "high": 120000.0, "period": "year", "source": "employer"}
|
|
633
|
-
|
|
634
|
-
parse_indeed_salary("Up to $65 an hour")
|
|
635
|
-
# -> {"low": 65.0, "high": 65.0, "period": "hour", "source": None}
|
|
636
|
-
```
|
|
637
|
-
|
|
638
|
-
### Glassdoor salary note
|
|
639
|
-
|
|
640
|
-
Glassdoor shows two types of salary estimates:
|
|
641
|
-
- **"Employer est."** — the company provided a range in the job post
|
|
642
|
-
- **"Glassdoor est."** — Glassdoor estimated based on similar roles; shown with "(est.)" in the card
|
|
643
|
-
|
|
644
|
-
Both are shown as text inside the card. Parse the same way as Indeed.
|
|
645
|
-
|
|
646
|
-
If the salary is absent in the search result card, it is only available on the job detail page (requires a click through to the individual listing).
|
|
647
|
-
|
|
648
|
-
---
|
|
649
|
-
|
|
650
|
-
## Date normalization ("3 days ago" → actual date)
|
|
651
|
-
|
|
652
|
-
All three sites use relative timestamps. Convert to absolute dates when needed.
|
|
653
|
-
|
|
654
|
-
```python
|
|
655
|
-
import re
|
|
656
|
-
from datetime import datetime, timedelta
|
|
657
|
-
|
|
658
|
-
def parse_relative_date(text: str, reference_date: datetime = None) -> datetime | None:
|
|
659
|
-
"""
|
|
660
|
-
Convert relative job posting dates to datetime objects.
|
|
661
|
-
Handles: "Just posted", "Today", "1 day ago", "3 days ago", "30+ days ago"
|
|
662
|
-
"""
|
|
663
|
-
if reference_date is None:
|
|
664
|
-
reference_date = datetime.utcnow()
|
|
665
|
-
|
|
666
|
-
text = text.strip().lower()
|
|
667
|
-
|
|
668
|
-
if not text or text in ("", "unknown"):
|
|
669
|
-
return None
|
|
670
|
-
if text in ("just posted", "today", "active today"):
|
|
671
|
-
return reference_date
|
|
672
|
-
if "hour" in text:
|
|
673
|
-
m = re.search(r'(\d+)', text)
|
|
674
|
-
hours = int(m.group(1)) if m else 1
|
|
675
|
-
return reference_date - timedelta(hours=hours)
|
|
676
|
-
if "day" in text:
|
|
677
|
-
m = re.search(r'(\d+)', text)
|
|
678
|
-
days = int(m.group(1)) if m else 1
|
|
679
|
-
return reference_date - timedelta(days=days)
|
|
680
|
-
if "week" in text:
|
|
681
|
-
m = re.search(r'(\d+)', text)
|
|
682
|
-
weeks = int(m.group(1)) if m else 1
|
|
683
|
-
return reference_date - timedelta(weeks=weeks)
|
|
684
|
-
if "month" in text:
|
|
685
|
-
m = re.search(r'(\d+)', text)
|
|
686
|
-
months = int(m.group(1)) if m else 1
|
|
687
|
-
return reference_date - timedelta(days=months * 30)
|
|
688
|
-
if "30+" in text:
|
|
689
|
-
return reference_date - timedelta(days=30)
|
|
690
|
-
|
|
691
|
-
return None # unparseable
|
|
692
|
-
|
|
693
|
-
# Examples
|
|
694
|
-
parse_relative_date("3 days ago") # datetime ~3 days before now
|
|
695
|
-
parse_relative_date("Just posted") # datetime.utcnow()
|
|
696
|
-
parse_relative_date("30+ days ago") # datetime 30 days ago
|
|
697
|
-
```
|
|
698
|
-
|
|
699
|
-
---
|
|
700
|
-
|
|
701
|
-
## Workflow 7: Fast bulk extraction with `http_get` (no browser)
|
|
702
|
-
|
|
703
|
-
For Indeed, the raw HTML of search results contains structured JSON in a `window.mosaic.providerData` script tag. This is faster and more reliable than DOM extraction.
|
|
704
|
-
|
|
705
|
-
```python
|
|
706
|
-
import json, re
|
|
707
|
-
from urllib.parse import quote_plus
|
|
708
|
-
|
|
709
|
-
def indeed_http_search(query: str, location: str = "", fromage: int = 0, start: int = 0) -> list[dict]:
|
|
710
|
-
"""
|
|
711
|
-
Extract Indeed jobs via HTTP (no browser). Parses the embedded JSON payload.
|
|
712
|
-
Returns up to ~15 jobs per call.
|
|
713
|
-
"""
|
|
714
|
-
params = f"q={quote_plus(query)}&l={quote_plus(location)}&start={start}"
|
|
715
|
-
if fromage:
|
|
716
|
-
params += f"&fromage={fromage}"
|
|
717
|
-
|
|
718
|
-
html = http_get(
|
|
719
|
-
f"https://www.indeed.com/jobs?{params}",
|
|
720
|
-
headers={
|
|
721
|
-
"Accept-Language": "en-US,en;q=0.9",
|
|
722
|
-
"Accept": "text/html,application/xhtml+xml",
|
|
723
|
-
}
|
|
724
|
-
)
|
|
725
|
-
|
|
726
|
-
# Check for CAPTCHA before parsing
|
|
727
|
-
if "captcha" in html.lower() or "robot check" in html.lower():
|
|
728
|
-
return [] # fall back to browser-based extraction
|
|
729
|
-
|
|
730
|
-
# Indeed embeds job data in window.mosaic.providerData["mosaic-provider-jobcards"]
|
|
731
|
-
m = re.search(
|
|
732
|
-
r'window\.mosaic\.providerData\["mosaic-provider-jobcards"\]\s*=\s*(\{.*?\});',
|
|
733
|
-
html, re.DOTALL
|
|
734
|
-
)
|
|
735
|
-
if not m:
|
|
736
|
-
return []
|
|
737
|
-
|
|
738
|
-
try:
|
|
739
|
-
data = json.loads(m.group(1))
|
|
740
|
-
except json.JSONDecodeError:
|
|
741
|
-
return []
|
|
742
|
-
|
|
743
|
-
results_list = (
|
|
744
|
-
data
|
|
745
|
-
.get("metaData", {})
|
|
746
|
-
.get("mosaicProviderJobCardsModel", {})
|
|
747
|
-
.get("results", [])
|
|
748
|
-
)
|
|
749
|
-
|
|
750
|
-
jobs = []
|
|
751
|
-
for r in results_list:
|
|
752
|
-
jk = r.get("jobkey", "")
|
|
753
|
-
jobs.append({
|
|
754
|
-
"jk": jk,
|
|
755
|
-
"title": r.get("title", ""),
|
|
756
|
-
"company": r.get("company", ""),
|
|
757
|
-
"location": r.get("formattedLocation", ""),
|
|
758
|
-
"salary": r.get("salarySnippet", {}).get("text", ""),
|
|
759
|
-
"posted": r.get("formattedRelativeTime", ""),
|
|
760
|
-
"url": f"https://www.indeed.com/viewjob?jk={jk}",
|
|
761
|
-
"snippet": r.get("snippet", ""), # short description preview
|
|
762
|
-
})
|
|
763
|
-
return jobs
|
|
764
|
-
|
|
765
|
-
# Example — last 24h remote jobs
|
|
766
|
-
jobs = indeed_http_search("software engineer", "remote", fromage=1)
|
|
767
|
-
for j in jobs:
|
|
768
|
-
print(j["title"], "|", j["company"], "|", j["salary"])
|
|
769
|
-
```
|
|
770
|
-
|
|
771
|
-
If `http_get` returns 0 results (CAPTCHA or structure change), fall back to the `goto` + `js()` browser workflow above.
|
|
772
|
-
|
|
773
|
-
---
|
|
774
|
-
|
|
775
|
-
## Workflow 8: "Easy Apply" vs external application detection
|
|
776
|
-
|
|
777
|
-
Some Indeed listings apply on Indeed directly ("Easy Apply") while others redirect to the company site. Detect which type before deciding what to do.
|
|
778
|
-
|
|
779
|
-
```python
|
|
780
|
-
def get_application_type(jk: str) -> dict:
|
|
781
|
-
"""Returns {type: 'easy_apply'|'external'|'unknown', external_url: str|None}"""
|
|
782
|
-
goto_url(f"https://www.indeed.com/viewjob?jk={jk}")
|
|
783
|
-
wait_for_load()
|
|
784
|
-
wait(2)
|
|
785
|
-
|
|
786
|
-
return js("""
|
|
787
|
-
(function() {
|
|
788
|
-
// "Apply now" button pointing to /applystart = indeed-hosted Easy Apply
|
|
789
|
-
var easyBtn = document.querySelector(
|
|
790
|
-
'button[data-testid="applyButton"], [id="indeedApplyButton"], button[class*="IndeedApplyButton"]'
|
|
791
|
-
);
|
|
792
|
-
// "Apply on company site" button
|
|
793
|
-
var extBtn = document.querySelector(
|
|
794
|
-
'a[data-testid="applyButton"][href*="indeed.com/applystart"], a[href*="indeed.com/applystart"]'
|
|
795
|
-
);
|
|
796
|
-
// External redirect — check the main CTA
|
|
797
|
-
var mainCta = document.querySelector('[data-testid="applyButton"]');
|
|
798
|
-
var ctaHref = mainCta ? mainCta.href : '';
|
|
799
|
-
|
|
800
|
-
if (easyBtn && !ctaHref.includes('apply.indeed')) {
|
|
801
|
-
return {type: 'easy_apply', externalUrl: null};
|
|
802
|
-
}
|
|
803
|
-
if (extBtn || (ctaHref && !ctaHref.includes('indeed.com/viewjob'))) {
|
|
804
|
-
return {type: 'external', externalUrl: ctaHref || null};
|
|
805
|
-
}
|
|
806
|
-
return {type: 'unknown', externalUrl: null};
|
|
807
|
-
})()
|
|
808
|
-
""")
|
|
809
|
-
```
|
|
810
|
-
|
|
811
|
-
---
|
|
812
|
-
|
|
813
|
-
## Bot detection and rate limiting
|
|
814
|
-
|
|
815
|
-
Indeed and Glassdoor have active bot detection. Violating these limits leads to CAPTCHA walls, IP blocks, or silently degraded results (cards with empty fields).
|
|
816
|
-
|
|
817
|
-
### Safe request cadence
|
|
818
|
-
|
|
819
|
-
```python
|
|
820
|
-
# Minimum wait between page loads
|
|
821
|
-
INTER_PAGE_WAIT = 2.5 # seconds — don't go below 2
|
|
822
|
-
|
|
823
|
-
# Between job detail page fetches
|
|
824
|
-
INTER_DETAIL_WAIT = 3.0 # seconds
|
|
825
|
-
|
|
826
|
-
# http_get concurrency limit
|
|
827
|
-
MAX_HTTP_CONCURRENT = 2 # never more than 2 at once for Indeed/Glassdoor
|
|
828
|
-
```
|
|
829
|
-
|
|
830
|
-
### CAPTCHA detection
|
|
831
|
-
|
|
832
|
-
```python
|
|
833
|
-
def is_captcha_page() -> bool:
|
|
834
|
-
"""Check if the current page is a CAPTCHA or block page."""
|
|
835
|
-
url = page_info()["url"]
|
|
836
|
-
title = js("document.title") or ""
|
|
837
|
-
body_text = js("document.body ? document.body.innerText.substring(0, 500) : ''") or ""
|
|
838
|
-
|
|
839
|
-
signals = [
|
|
840
|
-
"captcha" in url.lower(),
|
|
841
|
-
"robot" in title.lower(),
|
|
842
|
-
"are you a human" in body_text.lower(),
|
|
843
|
-
"verify you are human" in body_text.lower(),
|
|
844
|
-
"unusual traffic" in body_text.lower(),
|
|
845
|
-
"indeed.com/error" in url,
|
|
846
|
-
"sorry" in title.lower() and "indeed" in url,
|
|
847
|
-
]
|
|
848
|
-
return any(signals)
|
|
849
|
-
|
|
850
|
-
# Use after every goto:
|
|
851
|
-
goto_url(some_url)
|
|
852
|
-
wait_for_load()
|
|
853
|
-
wait(2)
|
|
854
|
-
if is_captcha_page():
|
|
855
|
-
capture_screenshot()
|
|
856
|
-
# Wait longer and retry once
|
|
857
|
-
wait(10)
|
|
858
|
-
goto_url(some_url)
|
|
859
|
-
wait_for_load()
|
|
860
|
-
wait(3)
|
|
861
|
-
```
|
|
862
|
-
|
|
863
|
-
### Glassdoor session hygiene
|
|
864
|
-
|
|
865
|
-
Glassdoor's bot detection is more fingerprint-based. If results stop loading:
|
|
866
|
-
|
|
867
|
-
1. Take a `capture_screenshot()` — confirm whether it is a login modal vs a block page
|
|
868
|
-
2. Dismiss any login modal first (`dismiss_glassdoor_login_modal()`)
|
|
869
|
-
3. If a block page appears, pause 30+ seconds before retrying
|
|
870
|
-
4. Switch to Indeed for the same query — results are similar and bot tolerance is higher
|
|
871
|
-
|
|
872
|
-
---
|
|
873
|
-
|
|
874
|
-
## Filtering by date, job type, and salary
|
|
875
|
-
|
|
876
|
-
### Indeed URL filter parameters
|
|
877
|
-
|
|
878
|
-
```python
|
|
879
|
-
from urllib.parse import quote_plus
|
|
880
|
-
|
|
881
|
-
def build_indeed_url(
|
|
882
|
-
query: str,
|
|
883
|
-
location: str = "",
|
|
884
|
-
fromage: int = 0, # days: 1=last 24h, 3=last 3 days, 7=last week
|
|
885
|
-
job_type: str = "", # "fulltime", "parttime", "contract", "internship", "temporary"
|
|
886
|
-
remote: bool = False,
|
|
887
|
-
start: int = 0,
|
|
888
|
-
) -> str:
|
|
889
|
-
base = f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}"
|
|
890
|
-
if fromage:
|
|
891
|
-
base += f"&fromage={fromage}"
|
|
892
|
-
if job_type:
|
|
893
|
-
base += f"&jt={job_type}"
|
|
894
|
-
if remote:
|
|
895
|
-
base += "&remotejob=032b3046-06a3-4876-8dfd-474eb5e7ed11"
|
|
896
|
-
if start:
|
|
897
|
-
base += f"&start={start}"
|
|
898
|
-
return base
|
|
899
|
-
|
|
900
|
-
# Examples
|
|
901
|
-
url = build_indeed_url("backend engineer", "Austin, TX", fromage=7, job_type="fulltime")
|
|
902
|
-
url = build_indeed_url("data analyst", remote=True, fromage=1)
|
|
903
|
-
```
|
|
904
|
-
|
|
905
|
-
---
|
|
906
|
-
|
|
907
|
-
## Collecting N results across pages
|
|
908
|
-
|
|
909
|
-
```python
|
|
910
|
-
import json
|
|
911
|
-
from urllib.parse import quote_plus
|
|
912
|
-
|
|
913
|
-
def collect_indeed_jobs(query: str, location: str = "", max_results: int = 20,
|
|
914
|
-
fromage: int = 0, job_type: str = "") -> list[dict]:
|
|
915
|
-
"""
|
|
916
|
-
Collect up to max_results jobs from Indeed across multiple pages.
|
|
917
|
-
Waits between pages to avoid bot detection.
|
|
918
|
-
"""
|
|
919
|
-
all_jobs = []
|
|
920
|
-
seen_jks = set()
|
|
921
|
-
page = 0
|
|
922
|
-
|
|
923
|
-
while len(all_jobs) < max_results:
|
|
924
|
-
start = page * 10
|
|
925
|
-
url = build_indeed_url(query, location, fromage=fromage, job_type=job_type, start=start)
|
|
926
|
-
goto_url(url)
|
|
927
|
-
wait_for_load()
|
|
928
|
-
wait(2.5)
|
|
929
|
-
|
|
930
|
-
if page == 0:
|
|
931
|
-
dismiss_cookie_banner()
|
|
932
|
-
|
|
933
|
-
if is_captcha_page():
|
|
934
|
-
print(f"CAPTCHA on page {page+1}, stopping")
|
|
935
|
-
break
|
|
936
|
-
|
|
937
|
-
batch_json = js("""
|
|
938
|
-
(function() {
|
|
939
|
-
var cards = document.querySelectorAll('[data-jk]');
|
|
940
|
-
var out = [];
|
|
941
|
-
for (var i = 0; i < cards.length; i++) {
|
|
942
|
-
var c = cards[i];
|
|
943
|
-
var jk = c.getAttribute('data-jk') || '';
|
|
944
|
-
if (!jk) continue;
|
|
945
|
-
var titleEl = c.querySelector('h2.jobTitle span[title], [data-testid="job-title"]');
|
|
946
|
-
var compEl = c.querySelector('[data-testid="company-name"], .companyName');
|
|
947
|
-
var locEl = c.querySelector('[data-testid="text-location"], .companyLocation');
|
|
948
|
-
var salEl = c.querySelector('[data-testid="attribute_snippet_testid"], .salary-snippet-container');
|
|
949
|
-
var dateEl = c.querySelector('[data-testid="myJobsStateDate"], span.date');
|
|
950
|
-
out.push({
|
|
951
|
-
jk,
|
|
952
|
-
title: titleEl ? titleEl.innerText.trim() : '',
|
|
953
|
-
company: compEl ? compEl.innerText.trim() : '',
|
|
954
|
-
location: locEl ? locEl.innerText.trim() : '',
|
|
955
|
-
salary: salEl ? salEl.innerText.trim() : '',
|
|
956
|
-
posted: dateEl ? dateEl.innerText.trim() : '',
|
|
957
|
-
url: 'https://www.indeed.com/viewjob?jk=' + jk,
|
|
958
|
-
});
|
|
959
|
-
}
|
|
960
|
-
return JSON.stringify(out.filter(j => j.title && j.jk));
|
|
961
|
-
})()
|
|
962
|
-
""")
|
|
963
|
-
|
|
964
|
-
batch = json.loads(batch_json)
|
|
965
|
-
if not batch:
|
|
966
|
-
break # no more results
|
|
967
|
-
|
|
968
|
-
new_jobs = [j for j in batch if j["jk"] not in seen_jks]
|
|
969
|
-
seen_jks.update(j["jk"] for j in new_jobs)
|
|
970
|
-
all_jobs.extend(new_jobs)
|
|
971
|
-
page += 1
|
|
972
|
-
|
|
973
|
-
return all_jobs[:max_results]
|
|
974
|
-
|
|
975
|
-
# Examples
|
|
976
|
-
jobs = collect_indeed_jobs("Python developer", "San Francisco", max_results=20)
|
|
977
|
-
jobs = collect_indeed_jobs("remote software engineer", fromage=1, max_results=10)
|
|
978
|
-
jobs = collect_indeed_jobs("machine learning engineer", max_results=30, fromage=7, job_type="fulltime")
|
|
979
|
-
```
|
|
980
|
-
|
|
981
|
-
---
|
|
982
|
-
|
|
983
|
-
## Gotchas
|
|
984
|
-
|
|
985
|
-
- **`data-jk` is the job key, not a DOM id** — Always use `[data-jk]` to select cards, not `#job-...` ids which vary by page layout and A/B test variant.
|
|
986
|
-
|
|
987
|
-
- **Indeed redirect links are NOT stable URLs** — Anchor `href` values in search results go through `https://www.indeed.com/rc/clk?...` tracking redirects which expire. Always extract `data-jk` from the card and construct `https://www.indeed.com/viewjob?jk={jk}` yourself.
|
|
988
|
-
|
|
989
|
-
- **Salary is on the detail page, not the card** — Many listings show no salary in the search result card. If salary is required, fetch the individual `viewjob?jk=` page and extract it there. Budget `wait(3)` per detail page and do not fetch more than 5 detail pages per minute.
|
|
990
|
-
|
|
991
|
-
- **"Employer est." vs "Glassdoor est."** — These are two distinct data signals. Employer estimates come from the job post itself; Glassdoor estimates are crowd-sourced. The distinction matters when reporting salary accuracy to users.
|
|
992
|
-
|
|
993
|
-
- **Glassdoor login modal appears after 2-3 scrolls** — Extract all visible cards immediately on load before scrolling. If you need to load more results via scroll/infinite scroll, dismiss the modal first.
|
|
994
|
-
|
|
995
|
-
- **Glassdoor public results are limited** — Without login, Glassdoor shows ~10-15 cards. If the task requires 30+ results, use Indeed instead (no login required, up to ~15 per page with full pagination).
|
|
996
|
-
|
|
997
|
-
- **Stepstone uses path-based URL routing, not query params** — Spaces in keyword or city must be replaced with `-` for the path, not `%20` or `+`. `quote_plus()` is wrong for path segments. Use `.replace(" ", "-")`.
|
|
998
|
-
|
|
999
|
-
- **Stepstone pagination is in the path** — `/page-2.html`, `/page-3.html` — not `?page=2`. There is no `&start=N` param as in Indeed.
|
|
1000
|
-
|
|
1001
|
-
- **`http_get` for Glassdoor fails more often** — Glassdoor requires JS to render job cards. Use the browser path for Glassdoor. `http_get` only works reliably for Indeed and Stepstone where server-rendered HTML contains structured data.
|
|
1002
|
-
|
|
1003
|
-
- **Indeed embeds JSON in a `<script>` tag** — The `window.mosaic.providerData` block in the HTML source is the fastest extraction path but it can break if Indeed changes the key. Always have the DOM-based `js()` approach as a fallback.
|
|
1004
|
-
|
|
1005
|
-
- **Date strings are relative, not absolute** — "3 days ago", "30+ days ago", "Just posted" — none of these are machine-parseable dates without a reference point. Use `datetime.utcnow()` as the reference. "30+" means at least 30 days ago; treat as stale.
|
|
1006
|
-
|
|
1007
|
-
- **`fromage=1` on Indeed means "last 24 hours" but uses the listing creation date, not the apply-by date** — Fresh listings can appear in `fromage=3` results a day later due to indexing lag.
|
|
1008
|
-
|
|
1009
|
-
- **Indeed CAPTCHA appears as a clean-looking page** with an image puzzle or just a "continue" button — it will not raise an error. Always check `is_captcha_page()` before assuming extraction results are valid.
|
|
1010
|
-
|
|
1011
|
-
- **Glassdoor location IDs for `locT=C&locId=`** — Programmatic location filtering by ID requires a separate city-ID lookup (Glassdoor's internal city registry). For basic scraping, omit `locId` and use `locKeyword=` with the city name instead — results are less precise but don't require a lookup step.
|
|
1012
|
-
|
|
1013
|
-
- **User-agent matters** — `http_get` uses `Mozilla/5.0` by default (see `helpers.py`). For Indeed `http_get`, also set `Accept-Language: en-US,en;q=0.9` to avoid getting German or localized results based on IP geolocation.
|
|
1014
|
-
|
|
1015
|
-
- **Stepstone cookie modal is fullscreen** — On first load, Stepstone shows a fullscreen consent overlay that blocks the entire page. Always call `dismiss_cookie_banner()` before any extraction. If the overlay cannot be dismissed with the generic pattern, use a coordinate click: `capture_screenshot()` first to find the "Alle akzeptieren" (Accept all) button position, then `click_at_xy(x, y)`.
|
|
1016
|
-
|
|
1017
|
-
- **Glassdoor salary in card vs detail** — Salary text in the card may be truncated ("$90K - $120K (Glassdoor est.)"). The full salary breakdown (base, bonus, total comp) is only on the job detail page, which requires a click through.
|
|
1018
|
-
|
|
1019
|
-
- **"Easy Apply" listings may not have an external URL** — If the job only has an Indeed-hosted application, there is no company site URL. The `externalUrl` will be `null` — this is expected, not a scraping failure.
|
|
1020
|
-
|
|
1021
|
-
- **Empty cards on Indeed mobile breakpoints** — If the browser viewport is very narrow, Indeed may render a different card layout with different selectors. Keep viewport at normal desktop width (1280px+) to get consistent `[data-jk]` card rendering.
|
|
1
|
+
# Job Boards — Indeed, Glassdoor, Stepstone
|
|
2
|
+
|
|
3
|
+
Covers: `indeed.com`, `glassdoor.com`, `stepstone.de`
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Do this first: construct search URLs directly
|
|
8
|
+
|
|
9
|
+
Never type into the search box on the homepage — bot detection triggers immediately. Build search URLs directly and navigate straight to results.
|
|
10
|
+
|
|
11
|
+
```python
|
|
12
|
+
from urllib.parse import quote_plus
|
|
13
|
+
|
|
14
|
+
# Indeed — English (US)
|
|
15
|
+
query, location = "Python developer", "San Francisco"
|
|
16
|
+
goto_url(f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}")
|
|
17
|
+
wait_for_load()
|
|
18
|
+
wait(2)
|
|
19
|
+
|
|
20
|
+
# Indeed — last 24 hours
|
|
21
|
+
goto_url(f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}&fromage=1")
|
|
22
|
+
wait_for_load()
|
|
23
|
+
wait(2)
|
|
24
|
+
|
|
25
|
+
# Glassdoor — public search (no login required for result cards)
|
|
26
|
+
goto_url(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}")
|
|
27
|
+
wait_for_load()
|
|
28
|
+
wait(2)
|
|
29
|
+
|
|
30
|
+
# Stepstone (Germany)
|
|
31
|
+
keyword, city = "Data Scientist", "Berlin"
|
|
32
|
+
goto_url(f"https://www.stepstone.de/jobs/{quote_plus(keyword)}/in-{quote_plus(city)}.html")
|
|
33
|
+
wait_for_load()
|
|
34
|
+
wait(2)
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## URL patterns
|
|
40
|
+
|
|
41
|
+
### Indeed
|
|
42
|
+
|
|
43
|
+
| Goal | URL pattern |
|
|
44
|
+
|---|---|
|
|
45
|
+
| Keyword + location | `/jobs?q={title}&l={location}` |
|
|
46
|
+
| Last 24 hours | `/jobs?q={title}&l={location}&fromage=1` |
|
|
47
|
+
| Last 3 days | `/jobs?q={title}&l={location}&fromage=3` |
|
|
48
|
+
| Last week | `/jobs?q={title}&l={location}&fromage=7` |
|
|
49
|
+
| Remote only | `/jobs?q={title}&remotejob=032b3046-06a3-4876-8dfd-474eb5e7ed11` |
|
|
50
|
+
| Full-time only | `/jobs?q={title}&l={location}&jt=fulltime` |
|
|
51
|
+
| Part-time | `/jobs?q={title}&l={location}&jt=parttime` |
|
|
52
|
+
| With salary | `/jobs?q={title}&l={location}&rbl=%24{min}%2B` |
|
|
53
|
+
| Page 2 (results 11-20) | append `&start=10` |
|
|
54
|
+
| Page 3 (results 21-30) | append `&start=20` |
|
|
55
|
+
| Job detail page | `https://www.indeed.com/viewjob?jk={job_key}` |
|
|
56
|
+
|
|
57
|
+
**Indeed country variants**: `.co.uk`, `.de`, `.fr`, `.com.au` — same URL structure, different base domain.
|
|
58
|
+
|
|
59
|
+
### Glassdoor
|
|
60
|
+
|
|
61
|
+
| Goal | URL pattern |
|
|
62
|
+
|---|---|
|
|
63
|
+
| Keyword search | `/Job/jobs.htm?sc.keyword={title}` |
|
|
64
|
+
| Keyword + city name | `/Job/jobs.htm?sc.keyword={title}&locT=C&locKeyword={city}` |
|
|
65
|
+
| Remote filter | `/Job/jobs.htm?sc.keyword={title}&remoteWorkType=1` |
|
|
66
|
+
| Next page | append `&p=2`, `&p=3` |
|
|
67
|
+
|
|
68
|
+
### Stepstone (Germany)
|
|
69
|
+
|
|
70
|
+
| Goal | URL pattern |
|
|
71
|
+
|---|---|
|
|
72
|
+
| Keyword in city | `/jobs/{keyword}/in-{city}.html` |
|
|
73
|
+
| Page 2 | `/jobs/{keyword}/in-{city}/page-2.html` |
|
|
74
|
+
| Page 3 | `/jobs/{keyword}/in-{city}/page-3.html` |
|
|
75
|
+
| Full-time | `/jobs/{keyword}/in-{city}.html?of=1` |
|
|
76
|
+
|
|
77
|
+
For Stepstone, keyword and city go directly in the path — encode spaces as `-`:
|
|
78
|
+
```python
|
|
79
|
+
kw_path = keyword.replace(" ", "-")
|
|
80
|
+
city_path = city.replace(" ", "-")
|
|
81
|
+
goto_url(f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}.html")
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## Cookie / consent banner dismissal
|
|
87
|
+
|
|
88
|
+
Indeed (EU/UK) and Glassdoor show GDPR consent overlays. Dismiss before extraction.
|
|
89
|
+
|
|
90
|
+
```python
|
|
91
|
+
def dismiss_cookie_banner():
|
|
92
|
+
"""Try common consent button patterns. Safe to call even if no banner is present."""
|
|
93
|
+
dismissed = js("""
|
|
94
|
+
(function() {
|
|
95
|
+
// Indeed: "Accept all cookies" button
|
|
96
|
+
var selectors = [
|
|
97
|
+
'button[id*="onetrust-accept"]',
|
|
98
|
+
'button[id*="accept-all"]',
|
|
99
|
+
'#onetrust-accept-btn-handler',
|
|
100
|
+
'button[data-testid="cookie-consent-accept"]',
|
|
101
|
+
// Glassdoor: consent modal
|
|
102
|
+
'button[data-test="accept-cookies"]',
|
|
103
|
+
// Generic patterns
|
|
104
|
+
'button[class*="accept"]',
|
|
105
|
+
'button[class*="consent"]',
|
|
106
|
+
];
|
|
107
|
+
for (var i = 0; i < selectors.length; i++) {
|
|
108
|
+
var btn = document.querySelector(selectors[i]);
|
|
109
|
+
if (btn && btn.offsetParent !== null) {
|
|
110
|
+
btn.click();
|
|
111
|
+
return selectors[i];
|
|
112
|
+
}
|
|
113
|
+
}
|
|
114
|
+
return null;
|
|
115
|
+
})()
|
|
116
|
+
""")
|
|
117
|
+
if dismissed:
|
|
118
|
+
wait(1)
|
|
119
|
+
return dismissed
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Call immediately after `wait_for_load()` on `.co.uk`, `.de`, or `glassdoor.com`:
|
|
123
|
+
|
|
124
|
+
```python
|
|
125
|
+
goto_url("https://www.indeed.co.uk/jobs?q=Python+developer&l=London")
|
|
126
|
+
wait_for_load()
|
|
127
|
+
wait(2)
|
|
128
|
+
dismiss_cookie_banner()
|
|
129
|
+
wait(1)
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Workflow 1: Indeed — search result card extraction
|
|
135
|
+
|
|
136
|
+
Each result card on Indeed carries a `data-jk` attribute (the job key). Use it to construct direct URLs.
|
|
137
|
+
|
|
138
|
+
```python
|
|
139
|
+
import json
|
|
140
|
+
from urllib.parse import quote_plus
|
|
141
|
+
|
|
142
|
+
query, location = "machine learning engineer", "New York"
|
|
143
|
+
goto_url(f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}")
|
|
144
|
+
wait_for_load()
|
|
145
|
+
wait(2)
|
|
146
|
+
dismiss_cookie_banner()
|
|
147
|
+
|
|
148
|
+
jobs = js("""
|
|
149
|
+
(function() {
|
|
150
|
+
// Cards live in <div data-jk="..."> or <li> with data-jk attribute
|
|
151
|
+
var cards = document.querySelectorAll('[data-jk]');
|
|
152
|
+
var out = [];
|
|
153
|
+
for (var i = 0; i < cards.length; i++) {
|
|
154
|
+
var c = cards[i];
|
|
155
|
+
var jk = c.getAttribute('data-jk') || '';
|
|
156
|
+
if (!jk) continue;
|
|
157
|
+
|
|
158
|
+
// Title
|
|
159
|
+
var titleEl = c.querySelector('h2.jobTitle span[title], h2.jobTitle span:not(.visually-hidden), [data-testid="job-title"]');
|
|
160
|
+
var title = titleEl ? titleEl.innerText.trim() : '';
|
|
161
|
+
|
|
162
|
+
// Company name
|
|
163
|
+
var compEl = c.querySelector('[data-testid="company-name"], .companyName, span[data-testid="company-name"]');
|
|
164
|
+
var company = compEl ? compEl.innerText.trim() : '';
|
|
165
|
+
|
|
166
|
+
// Location
|
|
167
|
+
var locEl = c.querySelector('[data-testid="text-location"], .companyLocation');
|
|
168
|
+
var location = locEl ? locEl.innerText.trim() : '';
|
|
169
|
+
|
|
170
|
+
// Salary — may not always be present in the card
|
|
171
|
+
var salEl = c.querySelector('[data-testid="attribute_snippet_testid"], .salary-snippet-container, .metadata.salary-snippet');
|
|
172
|
+
var salary = salEl ? salEl.innerText.trim() : '';
|
|
173
|
+
|
|
174
|
+
// Posting date / age
|
|
175
|
+
var dateEl = c.querySelector('[data-testid="myJobsStateDate"], span.date, .result-link-bar-container .date');
|
|
176
|
+
var posted = dateEl ? dateEl.innerText.trim() : '';
|
|
177
|
+
|
|
178
|
+
// Direct URL via job key
|
|
179
|
+
var url = 'https://www.indeed.com/viewjob?jk=' + jk;
|
|
180
|
+
|
|
181
|
+
if (title) {
|
|
182
|
+
out.push({jk, title, company, location, salary, posted, url});
|
|
183
|
+
}
|
|
184
|
+
}
|
|
185
|
+
return JSON.stringify(out);
|
|
186
|
+
})()
|
|
187
|
+
""")
|
|
188
|
+
|
|
189
|
+
results = json.loads(jobs)
|
|
190
|
+
for r in results:
|
|
191
|
+
print(r)
|
|
192
|
+
# Typically returns 10–15 cards per page
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
---
|
|
196
|
+
|
|
197
|
+
## Workflow 2: Indeed — pagination (multi-page extraction)
|
|
198
|
+
|
|
199
|
+
Indeed paginates using `&start=N` where N increments by 10 per page.
|
|
200
|
+
|
|
201
|
+
```python
|
|
202
|
+
import json
|
|
203
|
+
from urllib.parse import quote_plus
|
|
204
|
+
|
|
205
|
+
query, location = "data scientist", "remote"
|
|
206
|
+
base_url = f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}"
|
|
207
|
+
|
|
208
|
+
all_jobs = []
|
|
209
|
+
|
|
210
|
+
for page in range(3): # 3 pages = up to ~30 results
|
|
211
|
+
start = page * 10
|
|
212
|
+
url = base_url if start == 0 else f"{base_url}&start={start}"
|
|
213
|
+
goto_url(url)
|
|
214
|
+
wait_for_load()
|
|
215
|
+
wait(2) # mandatory — bot detection is aggressive on rapid loads
|
|
216
|
+
|
|
217
|
+
if page == 0:
|
|
218
|
+
dismiss_cookie_banner()
|
|
219
|
+
|
|
220
|
+
batch_json = js("""
|
|
221
|
+
(function() {
|
|
222
|
+
var cards = document.querySelectorAll('[data-jk]');
|
|
223
|
+
var out = [];
|
|
224
|
+
for (var i = 0; i < cards.length; i++) {
|
|
225
|
+
var c = cards[i];
|
|
226
|
+
var jk = c.getAttribute('data-jk') || '';
|
|
227
|
+
if (!jk) continue;
|
|
228
|
+
var titleEl = c.querySelector('h2.jobTitle span[title], [data-testid="job-title"]');
|
|
229
|
+
var compEl = c.querySelector('[data-testid="company-name"], .companyName');
|
|
230
|
+
var locEl = c.querySelector('[data-testid="text-location"], .companyLocation');
|
|
231
|
+
var salEl = c.querySelector('[data-testid="attribute_snippet_testid"], .salary-snippet-container');
|
|
232
|
+
var dateEl = c.querySelector('[data-testid="myJobsStateDate"], span.date');
|
|
233
|
+
out.push({
|
|
234
|
+
jk,
|
|
235
|
+
title: titleEl ? titleEl.innerText.trim() : '',
|
|
236
|
+
company: compEl ? compEl.innerText.trim() : '',
|
|
237
|
+
location: locEl ? locEl.innerText.trim() : '',
|
|
238
|
+
salary: salEl ? salEl.innerText.trim() : '',
|
|
239
|
+
posted: dateEl ? dateEl.innerText.trim() : '',
|
|
240
|
+
url: 'https://www.indeed.com/viewjob?jk=' + jk,
|
|
241
|
+
});
|
|
242
|
+
}
|
|
243
|
+
return JSON.stringify(out.filter(j => j.title));
|
|
244
|
+
})()
|
|
245
|
+
""")
|
|
246
|
+
|
|
247
|
+
batch = json.loads(batch_json)
|
|
248
|
+
if not batch:
|
|
249
|
+
break # no results on this page — stop
|
|
250
|
+
all_jobs.extend(batch)
|
|
251
|
+
|
|
252
|
+
print(f"Collected {len(all_jobs)} jobs across {page+1} pages")
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
**For `fromage` (date filter) + pagination**: keep the `fromage` param in the base URL:
|
|
256
|
+
```python
|
|
257
|
+
base_url = f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}&fromage=1"
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
---
|
|
261
|
+
|
|
262
|
+
## Workflow 3: Indeed — job detail page extraction
|
|
263
|
+
|
|
264
|
+
Fetch the full job description from the detail page. The `viewjob?jk=` URL is canonical and stable.
|
|
265
|
+
|
|
266
|
+
```python
|
|
267
|
+
import json, re
|
|
268
|
+
|
|
269
|
+
def get_indeed_job_detail(jk: str) -> dict:
|
|
270
|
+
"""Fetch full job details from an Indeed job key."""
|
|
271
|
+
goto_url(f"https://www.indeed.com/viewjob?jk={jk}")
|
|
272
|
+
wait_for_load()
|
|
273
|
+
wait(2)
|
|
274
|
+
|
|
275
|
+
detail = js("""
|
|
276
|
+
(function() {
|
|
277
|
+
// Title
|
|
278
|
+
var titleEl = document.querySelector('[data-testid="jobsearch-JobInfoHeader-title"], h1.jobsearch-JobInfoHeader-title');
|
|
279
|
+
var title = titleEl ? titleEl.innerText.trim() : '';
|
|
280
|
+
|
|
281
|
+
// Company
|
|
282
|
+
var compEl = document.querySelector('[data-testid="inlineHeader-companyName"] a, [data-company-name="true"]');
|
|
283
|
+
var company = compEl ? compEl.innerText.trim() : '';
|
|
284
|
+
|
|
285
|
+
// Location
|
|
286
|
+
var locEl = document.querySelector('[data-testid="inlineHeader-companyLocation"], [data-testid="job-location"]');
|
|
287
|
+
var location = locEl ? locEl.innerText.trim() : '';
|
|
288
|
+
|
|
289
|
+
// Salary — shown when available in header
|
|
290
|
+
var salEl = document.querySelector('[data-testid="jobsearch-OtherJobDetailsContainer"] [aria-label*="alary"], #salaryInfoAndJobType span');
|
|
291
|
+
var salary = salEl ? salEl.innerText.trim() : '';
|
|
292
|
+
|
|
293
|
+
// Full job description text
|
|
294
|
+
var descEl = document.getElementById('jobDescriptionText');
|
|
295
|
+
var description = descEl ? descEl.innerText.trim() : '';
|
|
296
|
+
|
|
297
|
+
// Job type (Full-time, Part-time, Contract, etc.)
|
|
298
|
+
var typeEl = document.querySelector('[data-testid="attribute_snippet_testid"]');
|
|
299
|
+
var jobType = typeEl ? typeEl.innerText.trim() : '';
|
|
300
|
+
|
|
301
|
+
// "Apply on company site" link — external application URL
|
|
302
|
+
var externalBtn = document.querySelector('[data-jk][href*="indeed.com/applystart"], a[href*="indeed.com/applystart"]');
|
|
303
|
+
var externalUrl = externalBtn ? externalBtn.href : '';
|
|
304
|
+
|
|
305
|
+
return JSON.stringify({title, company, location, salary, jobType, description, externalUrl});
|
|
306
|
+
})()
|
|
307
|
+
""")
|
|
308
|
+
return json.loads(detail)
|
|
309
|
+
|
|
310
|
+
# Example
|
|
311
|
+
detail = get_indeed_job_detail("abc123def456xyz")
|
|
312
|
+
print(detail["title"], "—", detail["salary"])
|
|
313
|
+
print(detail["description"][:500]) # first 500 chars
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
---
|
|
317
|
+
|
|
318
|
+
## Workflow 4: Glassdoor — search result extraction
|
|
319
|
+
|
|
320
|
+
Glassdoor shows a login modal after a few scrolls. Extract cards from the first visible load before triggering that wall.
|
|
321
|
+
|
|
322
|
+
```python
|
|
323
|
+
import json
|
|
324
|
+
from urllib.parse import quote_plus
|
|
325
|
+
|
|
326
|
+
query = "product manager"
|
|
327
|
+
goto_url(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}")
|
|
328
|
+
wait_for_load()
|
|
329
|
+
wait(3) # Glassdoor JS rendering takes longer
|
|
330
|
+
|
|
331
|
+
# Dismiss cookie banner if present
|
|
332
|
+
dismiss_cookie_banner()
|
|
333
|
+
|
|
334
|
+
# Extract cards before any scroll (avoid triggering login modal)
|
|
335
|
+
jobs = js("""
|
|
336
|
+
(function() {
|
|
337
|
+
// Glassdoor job cards: li[data-jobid] or article[data-id]
|
|
338
|
+
var cards = document.querySelectorAll('li[data-jobid], li[class*="JobsList_jobListItem"]');
|
|
339
|
+
if (!cards.length) {
|
|
340
|
+
// Fallback: try generic article cards
|
|
341
|
+
cards = document.querySelectorAll('[data-test="jobListing"], [id^="job-listing-"]');
|
|
342
|
+
}
|
|
343
|
+
var out = [];
|
|
344
|
+
for (var i = 0; i < cards.length; i++) {
|
|
345
|
+
var c = cards[i];
|
|
346
|
+
|
|
347
|
+
// Job ID (used for canonical URL)
|
|
348
|
+
var jobId = c.getAttribute('data-jobid') || c.getAttribute('data-id') || '';
|
|
349
|
+
|
|
350
|
+
// Title
|
|
351
|
+
var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"], .job-title');
|
|
352
|
+
var title = titleEl ? titleEl.innerText.trim() : '';
|
|
353
|
+
|
|
354
|
+
// Company
|
|
355
|
+
var compEl = c.querySelector('[data-test="employer-name"], [class*="JobCard_employer"], .employer-name');
|
|
356
|
+
var company = compEl ? compEl.innerText.trim() : '';
|
|
357
|
+
|
|
358
|
+
// Location
|
|
359
|
+
var locEl = c.querySelector('[data-test="emp-location"], [class*="JobCard_location"], .location');
|
|
360
|
+
var location = locEl ? locEl.innerText.trim() : '';
|
|
361
|
+
|
|
362
|
+
// Salary estimate (not always shown in card)
|
|
363
|
+
var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"], .salaryEstimate');
|
|
364
|
+
var salary = salEl ? salEl.innerText.trim() : '';
|
|
365
|
+
|
|
366
|
+
// Company rating
|
|
367
|
+
var ratingEl = c.querySelector('[data-test="rating"], [class*="ratingNumber"], .rating');
|
|
368
|
+
var rating = ratingEl ? ratingEl.innerText.trim() : '';
|
|
369
|
+
|
|
370
|
+
// Canonical URL
|
|
371
|
+
var linkEl = c.querySelector('a[href*="/job-listing/"], a[href*="glassdoor.com/job"]');
|
|
372
|
+
var url = linkEl ? linkEl.href : (jobId ? 'https://www.glassdoor.com/job-listing/glassdoor-jl' + jobId + '.htm' : '');
|
|
373
|
+
|
|
374
|
+
if (title) out.push({jobId, title, company, location, salary, rating, url});
|
|
375
|
+
}
|
|
376
|
+
return JSON.stringify(out);
|
|
377
|
+
})()
|
|
378
|
+
""")
|
|
379
|
+
|
|
380
|
+
results = json.loads(jobs)
|
|
381
|
+
for r in results:
|
|
382
|
+
print(r)
|
|
383
|
+
```
|
|
384
|
+
|
|
385
|
+
**If `jobs` returns an empty list**, Glassdoor has changed its DOM structure. Take a screenshot and inspect:
|
|
386
|
+
|
|
387
|
+
```python
|
|
388
|
+
capture_screenshot()
|
|
389
|
+
# Look for the actual card selector, then update the querySelectorAll above
|
|
390
|
+
```
|
|
391
|
+
|
|
392
|
+
---
|
|
393
|
+
|
|
394
|
+
## Workflow 5: Glassdoor — handling the login wall
|
|
395
|
+
|
|
396
|
+
Glassdoor increasingly shows a login modal after viewing a few listings. Detect and dismiss it.
|
|
397
|
+
|
|
398
|
+
```python
|
|
399
|
+
def dismiss_glassdoor_login_modal():
|
|
400
|
+
"""Close the Glassdoor sign-in / register modal if it appears."""
|
|
401
|
+
closed = js("""
|
|
402
|
+
(function() {
|
|
403
|
+
// Close button on the modal
|
|
404
|
+
var closeBtn = document.querySelector(
|
|
405
|
+
'[alt="Close"], button[class*="modal_closeIcon"], [data-test="close-modal"]'
|
|
406
|
+
);
|
|
407
|
+
if (closeBtn && closeBtn.offsetParent !== null) {
|
|
408
|
+
closeBtn.click();
|
|
409
|
+
return 'closed';
|
|
410
|
+
}
|
|
411
|
+
// Sometimes the modal has an X with aria-label
|
|
412
|
+
var ariaClose = document.querySelector('[aria-label="Close"]');
|
|
413
|
+
if (ariaClose && ariaClose.offsetParent !== null) {
|
|
414
|
+
ariaClose.click();
|
|
415
|
+
return 'aria-closed';
|
|
416
|
+
}
|
|
417
|
+
return null;
|
|
418
|
+
})()
|
|
419
|
+
""")
|
|
420
|
+
if closed:
|
|
421
|
+
wait(1)
|
|
422
|
+
return closed
|
|
423
|
+
|
|
424
|
+
# Strategy: extract as much as possible before the modal appears
|
|
425
|
+
# If the modal blocks results, dismiss it and try again
|
|
426
|
+
result = dismiss_glassdoor_login_modal()
|
|
427
|
+
if result:
|
|
428
|
+
wait(1)
|
|
429
|
+
# Re-run extraction after dismissal
|
|
430
|
+
```
|
|
431
|
+
|
|
432
|
+
If the modal is persistent and cannot be closed, switch to Indeed for the same search — it has more accessible public results.
|
|
433
|
+
|
|
434
|
+
---
|
|
435
|
+
|
|
436
|
+
## Workflow 6: Stepstone (German) — job extraction
|
|
437
|
+
|
|
438
|
+
Stepstone is server-rendered. Most data can be extracted with `http_get` for speed, or via `goto` + `js()` for dynamic content.
|
|
439
|
+
|
|
440
|
+
```python
|
|
441
|
+
import json, re
|
|
442
|
+
from urllib.parse import quote_plus
|
|
443
|
+
|
|
444
|
+
keyword = "Sachbearbeiter Einkauf"
|
|
445
|
+
city = "Regensburg"
|
|
446
|
+
|
|
447
|
+
# Stepstone encodes keyword/city in the path
|
|
448
|
+
kw_path = keyword.replace(" ", "-")
|
|
449
|
+
city_path = city.replace(" ", "-")
|
|
450
|
+
|
|
451
|
+
goto_url(f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}.html")
|
|
452
|
+
wait_for_load()
|
|
453
|
+
wait(2)
|
|
454
|
+
dismiss_cookie_banner()
|
|
455
|
+
|
|
456
|
+
jobs = js("""
|
|
457
|
+
(function() {
|
|
458
|
+
// Stepstone result cards
|
|
459
|
+
var cards = document.querySelectorAll(
|
|
460
|
+
'article[data-at="job-item"], [data-genesis-element="JOB_CARD"], article.sc-fhzFiK'
|
|
461
|
+
);
|
|
462
|
+
var out = [];
|
|
463
|
+
for (var i = 0; i < cards.length; i++) {
|
|
464
|
+
var c = cards[i];
|
|
465
|
+
|
|
466
|
+
// Title
|
|
467
|
+
var titleEl = c.querySelector('h2[data-at="job-item-title"] a, [data-at="job-title"], .listing__title a');
|
|
468
|
+
var title = titleEl ? titleEl.innerText.trim() : '';
|
|
469
|
+
var url = titleEl ? (titleEl.href || '') : '';
|
|
470
|
+
|
|
471
|
+
// Company
|
|
472
|
+
var compEl = c.querySelector('[data-at="job-item-company-name"], [data-at="company-name"], .listing__company');
|
|
473
|
+
var company = compEl ? compEl.innerText.trim() : '';
|
|
474
|
+
|
|
475
|
+
// Location
|
|
476
|
+
var locEl = c.querySelector('[data-at="job-item-location"], .listing__location');
|
|
477
|
+
var location = locEl ? locEl.innerText.trim() : '';
|
|
478
|
+
|
|
479
|
+
// Posting date
|
|
480
|
+
var dateEl = c.querySelector('[data-at="job-posting-date"], time, .listing__date');
|
|
481
|
+
var posted = dateEl ? (dateEl.getAttribute('datetime') || dateEl.innerText.trim()) : '';
|
|
482
|
+
|
|
483
|
+
if (title) out.push({title, company, location, posted, url});
|
|
484
|
+
}
|
|
485
|
+
return JSON.stringify(out);
|
|
486
|
+
})()
|
|
487
|
+
""")
|
|
488
|
+
|
|
489
|
+
results = json.loads(jobs)
|
|
490
|
+
for r in results:
|
|
491
|
+
print(r)
|
|
492
|
+
```
|
|
493
|
+
|
|
494
|
+
### Stepstone pagination
|
|
495
|
+
|
|
496
|
+
```python
|
|
497
|
+
import json
|
|
498
|
+
|
|
499
|
+
all_jobs = []
|
|
500
|
+
for page in range(1, 4): # pages 1-3
|
|
501
|
+
if page == 1:
|
|
502
|
+
url = f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}.html"
|
|
503
|
+
else:
|
|
504
|
+
url = f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}/page-{page}.html"
|
|
505
|
+
|
|
506
|
+
goto_url(url)
|
|
507
|
+
wait_for_load()
|
|
508
|
+
wait(2)
|
|
509
|
+
|
|
510
|
+
if page == 1:
|
|
511
|
+
dismiss_cookie_banner()
|
|
512
|
+
|
|
513
|
+
batch_json = js("""
|
|
514
|
+
(function() {
|
|
515
|
+
var cards = document.querySelectorAll('article[data-at="job-item"], [data-genesis-element="JOB_CARD"]');
|
|
516
|
+
var out = [];
|
|
517
|
+
for (var i = 0; i < cards.length; i++) {
|
|
518
|
+
var c = cards[i];
|
|
519
|
+
var titleEl = c.querySelector('[data-at="job-item-title"] a, [data-at="job-title"]');
|
|
520
|
+
var compEl = c.querySelector('[data-at="job-item-company-name"]');
|
|
521
|
+
var locEl = c.querySelector('[data-at="job-item-location"]');
|
|
522
|
+
var dateEl = c.querySelector('time');
|
|
523
|
+
out.push({
|
|
524
|
+
title: titleEl ? titleEl.innerText.trim() : '',
|
|
525
|
+
company: compEl ? compEl.innerText.trim() : '',
|
|
526
|
+
location: locEl ? locEl.innerText.trim() : '',
|
|
527
|
+
posted: dateEl ? dateEl.getAttribute('datetime') || dateEl.innerText.trim() : '',
|
|
528
|
+
url: titleEl ? titleEl.href : '',
|
|
529
|
+
});
|
|
530
|
+
}
|
|
531
|
+
return JSON.stringify(out.filter(j => j.title));
|
|
532
|
+
})()
|
|
533
|
+
""")
|
|
534
|
+
|
|
535
|
+
batch = json.loads(batch_json)
|
|
536
|
+
if not batch:
|
|
537
|
+
break
|
|
538
|
+
all_jobs.extend(batch)
|
|
539
|
+
|
|
540
|
+
print(f"Stepstone: {len(all_jobs)} jobs collected")
|
|
541
|
+
```
|
|
542
|
+
|
|
543
|
+
---
|
|
544
|
+
|
|
545
|
+
## Indeed job key (jk) — direct URL construction
|
|
546
|
+
|
|
547
|
+
Indeed search result links go through a tracking redirect. **Do not use those redirect URLs.** Instead, extract the `data-jk` attribute directly for the stable canonical URL.
|
|
548
|
+
|
|
549
|
+
```python
|
|
550
|
+
# Correct approach: extract data-jk from the card
|
|
551
|
+
job_keys = js("""
|
|
552
|
+
JSON.stringify(
|
|
553
|
+
Array.from(document.querySelectorAll('[data-jk]'))
|
|
554
|
+
.map(el => el.getAttribute('data-jk'))
|
|
555
|
+
.filter(jk => jk && jk.length > 0)
|
|
556
|
+
.filter((jk, i, arr) => arr.indexOf(jk) === i) // dedupe
|
|
557
|
+
)
|
|
558
|
+
""")
|
|
559
|
+
import json
|
|
560
|
+
jks = json.loads(job_keys)
|
|
561
|
+
|
|
562
|
+
# Canonical job detail URL for any job key:
|
|
563
|
+
for jk in jks:
|
|
564
|
+
direct_url = f"https://www.indeed.com/viewjob?jk={jk}"
|
|
565
|
+
print(direct_url)
|
|
566
|
+
```
|
|
567
|
+
|
|
568
|
+
If you already have a redirect URL and need to extract the `jk` from it:
|
|
569
|
+
|
|
570
|
+
```python
|
|
571
|
+
import re
|
|
572
|
+
def extract_jk(url: str) -> str | None:
|
|
573
|
+
m = re.search(r'[?&]jk=([a-f0-9]+)', url)
|
|
574
|
+
return m.group(1) if m else None
|
|
575
|
+
```
|
|
576
|
+
|
|
577
|
+
---
|
|
578
|
+
|
|
579
|
+
## Salary extraction and normalization
|
|
580
|
+
|
|
581
|
+
Salary appears in different places and formats depending on the job and site.
|
|
582
|
+
|
|
583
|
+
### Indeed salary patterns
|
|
584
|
+
|
|
585
|
+
```python
|
|
586
|
+
import re
|
|
587
|
+
|
|
588
|
+
def parse_indeed_salary(raw: str) -> dict:
|
|
589
|
+
"""
|
|
590
|
+
Parse Indeed salary strings like:
|
|
591
|
+
"$85,000 - $110,000 a year"
|
|
592
|
+
"Up to $65 an hour"
|
|
593
|
+
"$25 - $30 an hour"
|
|
594
|
+
"From $120,000 a year"
|
|
595
|
+
"Employer est.: $90,000 - $120,000 a year"
|
|
596
|
+
Returns: {low, high, period, source}
|
|
597
|
+
"""
|
|
598
|
+
if not raw:
|
|
599
|
+
return {"raw": raw, "low": None, "high": None, "period": None, "source": None}
|
|
600
|
+
|
|
601
|
+
source = None
|
|
602
|
+
if "Employer est." in raw:
|
|
603
|
+
source = "employer"
|
|
604
|
+
raw = raw.replace("Employer est.:", "").strip()
|
|
605
|
+
elif "Glassdoor est." in raw:
|
|
606
|
+
source = "glassdoor"
|
|
607
|
+
raw = raw.replace("Glassdoor est.:", "").strip()
|
|
608
|
+
|
|
609
|
+
raw_clean = raw.replace(",", "")
|
|
610
|
+
|
|
611
|
+
# Period
|
|
612
|
+
period = None
|
|
613
|
+
if "a year" in raw or "per year" in raw or "/yr" in raw:
|
|
614
|
+
period = "year"
|
|
615
|
+
elif "an hour" in raw or "per hour" in raw or "/hr" in raw:
|
|
616
|
+
period = "hour"
|
|
617
|
+
elif "a month" in raw or "per month" in raw:
|
|
618
|
+
period = "month"
|
|
619
|
+
|
|
620
|
+
# Range: two dollar amounts
|
|
621
|
+
range_m = re.findall(r'\$?([\d]+(?:\.\d+)?)', raw_clean)
|
|
622
|
+
low = float(range_m[0]) if len(range_m) >= 1 else None
|
|
623
|
+
high = float(range_m[1]) if len(range_m) >= 2 else low
|
|
624
|
+
|
|
625
|
+
return {"raw": raw, "low": low, "high": high, "period": period, "source": source}
|
|
626
|
+
|
|
627
|
+
# Examples
|
|
628
|
+
parse_indeed_salary("$85,000 - $110,000 a year")
|
|
629
|
+
# -> {"low": 85000.0, "high": 110000.0, "period": "year", "source": None}
|
|
630
|
+
|
|
631
|
+
parse_indeed_salary("Employer est.: $90,000 - $120,000 a year")
|
|
632
|
+
# -> {"low": 90000.0, "high": 120000.0, "period": "year", "source": "employer"}
|
|
633
|
+
|
|
634
|
+
parse_indeed_salary("Up to $65 an hour")
|
|
635
|
+
# -> {"low": 65.0, "high": 65.0, "period": "hour", "source": None}
|
|
636
|
+
```
|
|
637
|
+
|
|
638
|
+
### Glassdoor salary note
|
|
639
|
+
|
|
640
|
+
Glassdoor shows two types of salary estimates:
|
|
641
|
+
- **"Employer est."** — the company provided a range in the job post
|
|
642
|
+
- **"Glassdoor est."** — Glassdoor estimated based on similar roles; shown with "(est.)" in the card
|
|
643
|
+
|
|
644
|
+
Both are shown as text inside the card. Parse the same way as Indeed.
|
|
645
|
+
|
|
646
|
+
If the salary is absent in the search result card, it is only available on the job detail page (requires a click through to the individual listing).
|
|
647
|
+
|
|
648
|
+
---
|
|
649
|
+
|
|
650
|
+
## Date normalization ("3 days ago" → actual date)
|
|
651
|
+
|
|
652
|
+
All three sites use relative timestamps. Convert to absolute dates when needed.
|
|
653
|
+
|
|
654
|
+
```python
|
|
655
|
+
import re
|
|
656
|
+
from datetime import datetime, timedelta
|
|
657
|
+
|
|
658
|
+
def parse_relative_date(text: str, reference_date: datetime = None) -> datetime | None:
|
|
659
|
+
"""
|
|
660
|
+
Convert relative job posting dates to datetime objects.
|
|
661
|
+
Handles: "Just posted", "Today", "1 day ago", "3 days ago", "30+ days ago"
|
|
662
|
+
"""
|
|
663
|
+
if reference_date is None:
|
|
664
|
+
reference_date = datetime.utcnow()
|
|
665
|
+
|
|
666
|
+
text = text.strip().lower()
|
|
667
|
+
|
|
668
|
+
if not text or text in ("", "unknown"):
|
|
669
|
+
return None
|
|
670
|
+
if text in ("just posted", "today", "active today"):
|
|
671
|
+
return reference_date
|
|
672
|
+
if "hour" in text:
|
|
673
|
+
m = re.search(r'(\d+)', text)
|
|
674
|
+
hours = int(m.group(1)) if m else 1
|
|
675
|
+
return reference_date - timedelta(hours=hours)
|
|
676
|
+
if "day" in text:
|
|
677
|
+
m = re.search(r'(\d+)', text)
|
|
678
|
+
days = int(m.group(1)) if m else 1
|
|
679
|
+
return reference_date - timedelta(days=days)
|
|
680
|
+
if "week" in text:
|
|
681
|
+
m = re.search(r'(\d+)', text)
|
|
682
|
+
weeks = int(m.group(1)) if m else 1
|
|
683
|
+
return reference_date - timedelta(weeks=weeks)
|
|
684
|
+
if "month" in text:
|
|
685
|
+
m = re.search(r'(\d+)', text)
|
|
686
|
+
months = int(m.group(1)) if m else 1
|
|
687
|
+
return reference_date - timedelta(days=months * 30)
|
|
688
|
+
if "30+" in text:
|
|
689
|
+
return reference_date - timedelta(days=30)
|
|
690
|
+
|
|
691
|
+
return None # unparseable
|
|
692
|
+
|
|
693
|
+
# Examples
|
|
694
|
+
parse_relative_date("3 days ago") # datetime ~3 days before now
|
|
695
|
+
parse_relative_date("Just posted") # datetime.utcnow()
|
|
696
|
+
parse_relative_date("30+ days ago") # datetime 30 days ago
|
|
697
|
+
```
|
|
698
|
+
|
|
699
|
+
---
|
|
700
|
+
|
|
701
|
+
## Workflow 7: Fast bulk extraction with `http_get` (no browser)
|
|
702
|
+
|
|
703
|
+
For Indeed, the raw HTML of search results contains structured JSON in a `window.mosaic.providerData` script tag. This is faster and more reliable than DOM extraction.
|
|
704
|
+
|
|
705
|
+
```python
|
|
706
|
+
import json, re
|
|
707
|
+
from urllib.parse import quote_plus
|
|
708
|
+
|
|
709
|
+
def indeed_http_search(query: str, location: str = "", fromage: int = 0, start: int = 0) -> list[dict]:
|
|
710
|
+
"""
|
|
711
|
+
Extract Indeed jobs via HTTP (no browser). Parses the embedded JSON payload.
|
|
712
|
+
Returns up to ~15 jobs per call.
|
|
713
|
+
"""
|
|
714
|
+
params = f"q={quote_plus(query)}&l={quote_plus(location)}&start={start}"
|
|
715
|
+
if fromage:
|
|
716
|
+
params += f"&fromage={fromage}"
|
|
717
|
+
|
|
718
|
+
html = http_get(
|
|
719
|
+
f"https://www.indeed.com/jobs?{params}",
|
|
720
|
+
headers={
|
|
721
|
+
"Accept-Language": "en-US,en;q=0.9",
|
|
722
|
+
"Accept": "text/html,application/xhtml+xml",
|
|
723
|
+
}
|
|
724
|
+
)
|
|
725
|
+
|
|
726
|
+
# Check for CAPTCHA before parsing
|
|
727
|
+
if "captcha" in html.lower() or "robot check" in html.lower():
|
|
728
|
+
return [] # fall back to browser-based extraction
|
|
729
|
+
|
|
730
|
+
# Indeed embeds job data in window.mosaic.providerData["mosaic-provider-jobcards"]
|
|
731
|
+
m = re.search(
|
|
732
|
+
r'window\.mosaic\.providerData\["mosaic-provider-jobcards"\]\s*=\s*(\{.*?\});',
|
|
733
|
+
html, re.DOTALL
|
|
734
|
+
)
|
|
735
|
+
if not m:
|
|
736
|
+
return []
|
|
737
|
+
|
|
738
|
+
try:
|
|
739
|
+
data = json.loads(m.group(1))
|
|
740
|
+
except json.JSONDecodeError:
|
|
741
|
+
return []
|
|
742
|
+
|
|
743
|
+
results_list = (
|
|
744
|
+
data
|
|
745
|
+
.get("metaData", {})
|
|
746
|
+
.get("mosaicProviderJobCardsModel", {})
|
|
747
|
+
.get("results", [])
|
|
748
|
+
)
|
|
749
|
+
|
|
750
|
+
jobs = []
|
|
751
|
+
for r in results_list:
|
|
752
|
+
jk = r.get("jobkey", "")
|
|
753
|
+
jobs.append({
|
|
754
|
+
"jk": jk,
|
|
755
|
+
"title": r.get("title", ""),
|
|
756
|
+
"company": r.get("company", ""),
|
|
757
|
+
"location": r.get("formattedLocation", ""),
|
|
758
|
+
"salary": r.get("salarySnippet", {}).get("text", ""),
|
|
759
|
+
"posted": r.get("formattedRelativeTime", ""),
|
|
760
|
+
"url": f"https://www.indeed.com/viewjob?jk={jk}",
|
|
761
|
+
"snippet": r.get("snippet", ""), # short description preview
|
|
762
|
+
})
|
|
763
|
+
return jobs
|
|
764
|
+
|
|
765
|
+
# Example — last 24h remote jobs
|
|
766
|
+
jobs = indeed_http_search("software engineer", "remote", fromage=1)
|
|
767
|
+
for j in jobs:
|
|
768
|
+
print(j["title"], "|", j["company"], "|", j["salary"])
|
|
769
|
+
```
|
|
770
|
+
|
|
771
|
+
If `http_get` returns 0 results (CAPTCHA or structure change), fall back to the `goto` + `js()` browser workflow above.
|
|
772
|
+
|
|
773
|
+
---
|
|
774
|
+
|
|
775
|
+
## Workflow 8: "Easy Apply" vs external application detection
|
|
776
|
+
|
|
777
|
+
Some Indeed listings apply on Indeed directly ("Easy Apply") while others redirect to the company site. Detect which type before deciding what to do.
|
|
778
|
+
|
|
779
|
+
```python
|
|
780
|
+
def get_application_type(jk: str) -> dict:
|
|
781
|
+
"""Returns {type: 'easy_apply'|'external'|'unknown', external_url: str|None}"""
|
|
782
|
+
goto_url(f"https://www.indeed.com/viewjob?jk={jk}")
|
|
783
|
+
wait_for_load()
|
|
784
|
+
wait(2)
|
|
785
|
+
|
|
786
|
+
return js("""
|
|
787
|
+
(function() {
|
|
788
|
+
// "Apply now" button pointing to /applystart = indeed-hosted Easy Apply
|
|
789
|
+
var easyBtn = document.querySelector(
|
|
790
|
+
'button[data-testid="applyButton"], [id="indeedApplyButton"], button[class*="IndeedApplyButton"]'
|
|
791
|
+
);
|
|
792
|
+
// "Apply on company site" button
|
|
793
|
+
var extBtn = document.querySelector(
|
|
794
|
+
'a[data-testid="applyButton"][href*="indeed.com/applystart"], a[href*="indeed.com/applystart"]'
|
|
795
|
+
);
|
|
796
|
+
// External redirect — check the main CTA
|
|
797
|
+
var mainCta = document.querySelector('[data-testid="applyButton"]');
|
|
798
|
+
var ctaHref = mainCta ? mainCta.href : '';
|
|
799
|
+
|
|
800
|
+
if (easyBtn && !ctaHref.includes('apply.indeed')) {
|
|
801
|
+
return {type: 'easy_apply', externalUrl: null};
|
|
802
|
+
}
|
|
803
|
+
if (extBtn || (ctaHref && !ctaHref.includes('indeed.com/viewjob'))) {
|
|
804
|
+
return {type: 'external', externalUrl: ctaHref || null};
|
|
805
|
+
}
|
|
806
|
+
return {type: 'unknown', externalUrl: null};
|
|
807
|
+
})()
|
|
808
|
+
""")
|
|
809
|
+
```
|
|
810
|
+
|
|
811
|
+
---
|
|
812
|
+
|
|
813
|
+
## Bot detection and rate limiting
|
|
814
|
+
|
|
815
|
+
Indeed and Glassdoor have active bot detection. Violating these limits leads to CAPTCHA walls, IP blocks, or silently degraded results (cards with empty fields).
|
|
816
|
+
|
|
817
|
+
### Safe request cadence
|
|
818
|
+
|
|
819
|
+
```python
|
|
820
|
+
# Minimum wait between page loads
|
|
821
|
+
INTER_PAGE_WAIT = 2.5 # seconds — don't go below 2
|
|
822
|
+
|
|
823
|
+
# Between job detail page fetches
|
|
824
|
+
INTER_DETAIL_WAIT = 3.0 # seconds
|
|
825
|
+
|
|
826
|
+
# http_get concurrency limit
|
|
827
|
+
MAX_HTTP_CONCURRENT = 2 # never more than 2 at once for Indeed/Glassdoor
|
|
828
|
+
```
|
|
829
|
+
|
|
830
|
+
### CAPTCHA detection
|
|
831
|
+
|
|
832
|
+
```python
|
|
833
|
+
def is_captcha_page() -> bool:
|
|
834
|
+
"""Check if the current page is a CAPTCHA or block page."""
|
|
835
|
+
url = page_info()["url"]
|
|
836
|
+
title = js("document.title") or ""
|
|
837
|
+
body_text = js("document.body ? document.body.innerText.substring(0, 500) : ''") or ""
|
|
838
|
+
|
|
839
|
+
signals = [
|
|
840
|
+
"captcha" in url.lower(),
|
|
841
|
+
"robot" in title.lower(),
|
|
842
|
+
"are you a human" in body_text.lower(),
|
|
843
|
+
"verify you are human" in body_text.lower(),
|
|
844
|
+
"unusual traffic" in body_text.lower(),
|
|
845
|
+
"indeed.com/error" in url,
|
|
846
|
+
"sorry" in title.lower() and "indeed" in url,
|
|
847
|
+
]
|
|
848
|
+
return any(signals)
|
|
849
|
+
|
|
850
|
+
# Use after every goto:
|
|
851
|
+
goto_url(some_url)
|
|
852
|
+
wait_for_load()
|
|
853
|
+
wait(2)
|
|
854
|
+
if is_captcha_page():
|
|
855
|
+
capture_screenshot()
|
|
856
|
+
# Wait longer and retry once
|
|
857
|
+
wait(10)
|
|
858
|
+
goto_url(some_url)
|
|
859
|
+
wait_for_load()
|
|
860
|
+
wait(3)
|
|
861
|
+
```
|
|
862
|
+
|
|
863
|
+
### Glassdoor session hygiene
|
|
864
|
+
|
|
865
|
+
Glassdoor's bot detection is more fingerprint-based. If results stop loading:
|
|
866
|
+
|
|
867
|
+
1. Take a `capture_screenshot()` — confirm whether it is a login modal vs a block page
|
|
868
|
+
2. Dismiss any login modal first (`dismiss_glassdoor_login_modal()`)
|
|
869
|
+
3. If a block page appears, pause 30+ seconds before retrying
|
|
870
|
+
4. Switch to Indeed for the same query — results are similar and bot tolerance is higher
|
|
871
|
+
|
|
872
|
+
---
|
|
873
|
+
|
|
874
|
+
## Filtering by date, job type, and salary
|
|
875
|
+
|
|
876
|
+
### Indeed URL filter parameters
|
|
877
|
+
|
|
878
|
+
```python
|
|
879
|
+
from urllib.parse import quote_plus
|
|
880
|
+
|
|
881
|
+
def build_indeed_url(
|
|
882
|
+
query: str,
|
|
883
|
+
location: str = "",
|
|
884
|
+
fromage: int = 0, # days: 1=last 24h, 3=last 3 days, 7=last week
|
|
885
|
+
job_type: str = "", # "fulltime", "parttime", "contract", "internship", "temporary"
|
|
886
|
+
remote: bool = False,
|
|
887
|
+
start: int = 0,
|
|
888
|
+
) -> str:
|
|
889
|
+
base = f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}"
|
|
890
|
+
if fromage:
|
|
891
|
+
base += f"&fromage={fromage}"
|
|
892
|
+
if job_type:
|
|
893
|
+
base += f"&jt={job_type}"
|
|
894
|
+
if remote:
|
|
895
|
+
base += "&remotejob=032b3046-06a3-4876-8dfd-474eb5e7ed11"
|
|
896
|
+
if start:
|
|
897
|
+
base += f"&start={start}"
|
|
898
|
+
return base
|
|
899
|
+
|
|
900
|
+
# Examples
|
|
901
|
+
url = build_indeed_url("backend engineer", "Austin, TX", fromage=7, job_type="fulltime")
|
|
902
|
+
url = build_indeed_url("data analyst", remote=True, fromage=1)
|
|
903
|
+
```
|
|
904
|
+
|
|
905
|
+
---
|
|
906
|
+
|
|
907
|
+
## Collecting N results across pages
|
|
908
|
+
|
|
909
|
+
```python
|
|
910
|
+
import json
|
|
911
|
+
from urllib.parse import quote_plus
|
|
912
|
+
|
|
913
|
+
def collect_indeed_jobs(query: str, location: str = "", max_results: int = 20,
|
|
914
|
+
fromage: int = 0, job_type: str = "") -> list[dict]:
|
|
915
|
+
"""
|
|
916
|
+
Collect up to max_results jobs from Indeed across multiple pages.
|
|
917
|
+
Waits between pages to avoid bot detection.
|
|
918
|
+
"""
|
|
919
|
+
all_jobs = []
|
|
920
|
+
seen_jks = set()
|
|
921
|
+
page = 0
|
|
922
|
+
|
|
923
|
+
while len(all_jobs) < max_results:
|
|
924
|
+
start = page * 10
|
|
925
|
+
url = build_indeed_url(query, location, fromage=fromage, job_type=job_type, start=start)
|
|
926
|
+
goto_url(url)
|
|
927
|
+
wait_for_load()
|
|
928
|
+
wait(2.5)
|
|
929
|
+
|
|
930
|
+
if page == 0:
|
|
931
|
+
dismiss_cookie_banner()
|
|
932
|
+
|
|
933
|
+
if is_captcha_page():
|
|
934
|
+
print(f"CAPTCHA on page {page+1}, stopping")
|
|
935
|
+
break
|
|
936
|
+
|
|
937
|
+
batch_json = js("""
|
|
938
|
+
(function() {
|
|
939
|
+
var cards = document.querySelectorAll('[data-jk]');
|
|
940
|
+
var out = [];
|
|
941
|
+
for (var i = 0; i < cards.length; i++) {
|
|
942
|
+
var c = cards[i];
|
|
943
|
+
var jk = c.getAttribute('data-jk') || '';
|
|
944
|
+
if (!jk) continue;
|
|
945
|
+
var titleEl = c.querySelector('h2.jobTitle span[title], [data-testid="job-title"]');
|
|
946
|
+
var compEl = c.querySelector('[data-testid="company-name"], .companyName');
|
|
947
|
+
var locEl = c.querySelector('[data-testid="text-location"], .companyLocation');
|
|
948
|
+
var salEl = c.querySelector('[data-testid="attribute_snippet_testid"], .salary-snippet-container');
|
|
949
|
+
var dateEl = c.querySelector('[data-testid="myJobsStateDate"], span.date');
|
|
950
|
+
out.push({
|
|
951
|
+
jk,
|
|
952
|
+
title: titleEl ? titleEl.innerText.trim() : '',
|
|
953
|
+
company: compEl ? compEl.innerText.trim() : '',
|
|
954
|
+
location: locEl ? locEl.innerText.trim() : '',
|
|
955
|
+
salary: salEl ? salEl.innerText.trim() : '',
|
|
956
|
+
posted: dateEl ? dateEl.innerText.trim() : '',
|
|
957
|
+
url: 'https://www.indeed.com/viewjob?jk=' + jk,
|
|
958
|
+
});
|
|
959
|
+
}
|
|
960
|
+
return JSON.stringify(out.filter(j => j.title && j.jk));
|
|
961
|
+
})()
|
|
962
|
+
""")
|
|
963
|
+
|
|
964
|
+
batch = json.loads(batch_json)
|
|
965
|
+
if not batch:
|
|
966
|
+
break # no more results
|
|
967
|
+
|
|
968
|
+
new_jobs = [j for j in batch if j["jk"] not in seen_jks]
|
|
969
|
+
seen_jks.update(j["jk"] for j in new_jobs)
|
|
970
|
+
all_jobs.extend(new_jobs)
|
|
971
|
+
page += 1
|
|
972
|
+
|
|
973
|
+
return all_jobs[:max_results]
|
|
974
|
+
|
|
975
|
+
# Examples
|
|
976
|
+
jobs = collect_indeed_jobs("Python developer", "San Francisco", max_results=20)
|
|
977
|
+
jobs = collect_indeed_jobs("remote software engineer", fromage=1, max_results=10)
|
|
978
|
+
jobs = collect_indeed_jobs("machine learning engineer", max_results=30, fromage=7, job_type="fulltime")
|
|
979
|
+
```
|
|
980
|
+
|
|
981
|
+
---
|
|
982
|
+
|
|
983
|
+
## Gotchas
|
|
984
|
+
|
|
985
|
+
- **`data-jk` is the job key, not a DOM id** — Always use `[data-jk]` to select cards, not `#job-...` ids which vary by page layout and A/B test variant.
|
|
986
|
+
|
|
987
|
+
- **Indeed redirect links are NOT stable URLs** — Anchor `href` values in search results go through `https://www.indeed.com/rc/clk?...` tracking redirects which expire. Always extract `data-jk` from the card and construct `https://www.indeed.com/viewjob?jk={jk}` yourself.
|
|
988
|
+
|
|
989
|
+
- **Salary is on the detail page, not the card** — Many listings show no salary in the search result card. If salary is required, fetch the individual `viewjob?jk=` page and extract it there. Budget `wait(3)` per detail page and do not fetch more than 5 detail pages per minute.
|
|
990
|
+
|
|
991
|
+
- **"Employer est." vs "Glassdoor est."** — These are two distinct data signals. Employer estimates come from the job post itself; Glassdoor estimates are crowd-sourced. The distinction matters when reporting salary accuracy to users.
|
|
992
|
+
|
|
993
|
+
- **Glassdoor login modal appears after 2-3 scrolls** — Extract all visible cards immediately on load before scrolling. If you need to load more results via scroll/infinite scroll, dismiss the modal first.
|
|
994
|
+
|
|
995
|
+
- **Glassdoor public results are limited** — Without login, Glassdoor shows ~10-15 cards. If the task requires 30+ results, use Indeed instead (no login required, up to ~15 per page with full pagination).
|
|
996
|
+
|
|
997
|
+
- **Stepstone uses path-based URL routing, not query params** — Spaces in keyword or city must be replaced with `-` for the path, not `%20` or `+`. `quote_plus()` is wrong for path segments. Use `.replace(" ", "-")`.
|
|
998
|
+
|
|
999
|
+
- **Stepstone pagination is in the path** — `/page-2.html`, `/page-3.html` — not `?page=2`. There is no `&start=N` param as in Indeed.
|
|
1000
|
+
|
|
1001
|
+
- **`http_get` for Glassdoor fails more often** — Glassdoor requires JS to render job cards. Use the browser path for Glassdoor. `http_get` only works reliably for Indeed and Stepstone where server-rendered HTML contains structured data.
|
|
1002
|
+
|
|
1003
|
+
- **Indeed embeds JSON in a `<script>` tag** — The `window.mosaic.providerData` block in the HTML source is the fastest extraction path but it can break if Indeed changes the key. Always have the DOM-based `js()` approach as a fallback.
|
|
1004
|
+
|
|
1005
|
+
- **Date strings are relative, not absolute** — "3 days ago", "30+ days ago", "Just posted" — none of these are machine-parseable dates without a reference point. Use `datetime.utcnow()` as the reference. "30+" means at least 30 days ago; treat as stale.
|
|
1006
|
+
|
|
1007
|
+
- **`fromage=1` on Indeed means "last 24 hours" but uses the listing creation date, not the apply-by date** — Fresh listings can appear in `fromage=3` results a day later due to indexing lag.
|
|
1008
|
+
|
|
1009
|
+
- **Indeed CAPTCHA appears as a clean-looking page** with an image puzzle or just a "continue" button — it will not raise an error. Always check `is_captcha_page()` before assuming extraction results are valid.
|
|
1010
|
+
|
|
1011
|
+
- **Glassdoor location IDs for `locT=C&locId=`** — Programmatic location filtering by ID requires a separate city-ID lookup (Glassdoor's internal city registry). For basic scraping, omit `locId` and use `locKeyword=` with the city name instead — results are less precise but don't require a lookup step.
|
|
1012
|
+
|
|
1013
|
+
- **User-agent matters** — `http_get` uses `Mozilla/5.0` by default (see `helpers.py`). For Indeed `http_get`, also set `Accept-Language: en-US,en;q=0.9` to avoid getting German or localized results based on IP geolocation.
|
|
1014
|
+
|
|
1015
|
+
- **Stepstone cookie modal is fullscreen** — On first load, Stepstone shows a fullscreen consent overlay that blocks the entire page. Always call `dismiss_cookie_banner()` before any extraction. If the overlay cannot be dismissed with the generic pattern, use a coordinate click: `capture_screenshot()` first to find the "Alle akzeptieren" (Accept all) button position, then `click_at_xy(x, y)`.
|
|
1016
|
+
|
|
1017
|
+
- **Glassdoor salary in card vs detail** — Salary text in the card may be truncated ("$90K - $120K (Glassdoor est.)"). The full salary breakdown (base, bonus, total comp) is only on the job detail page, which requires a click through.
|
|
1018
|
+
|
|
1019
|
+
- **"Easy Apply" listings may not have an external URL** — If the job only has an Indeed-hosted application, there is no company site URL. The `externalUrl` will be `null` — this is expected, not a scraping failure.
|
|
1020
|
+
|
|
1021
|
+
- **Empty cards on Indeed mobile breakpoints** — If the browser viewport is very narrow, Indeed may render a different card layout with different selectors. Keep viewport at normal desktop width (1280px+) to get consistent `[data-jk]` card rendering.
|