@pencil-agent/nano-pencil 2.0.1 → 2.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +267 -267
- package/dist/build-meta.json +3 -3
- package/dist/core/export-html/AGENT.md +11 -11
- package/dist/core/export-html/template.css +971 -971
- package/dist/core/export-html/template.html +54 -54
- package/dist/core/model/custom-providers.js +1 -1
- package/dist/core/model-registry.js +5 -5
- package/dist/extensions/builtin/AGENT.md +115 -115
- package/dist/extensions/builtin/browser/AGENT.md +17 -17
- package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
- package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
- package/dist/extensions/builtin/browser/browser.md +73 -73
- package/dist/extensions/builtin/browser/install.md +142 -142
- package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
- package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
- package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
- package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
- package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
- package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
- package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
- package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
- package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
- package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
- package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
- package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
- package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
- package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
- package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
- package/dist/extensions/builtin/debug/index.js +9 -9
- package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
- package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
- package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
- package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
- package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
- package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
- package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
- package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
- package/dist/extensions/builtin/goal/README.md +67 -67
- package/dist/extensions/builtin/goal/index.js +6 -6
- package/dist/extensions/builtin/grub/README.md +112 -112
- package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
- package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
- package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
- package/dist/extensions/builtin/link-world/linkworld.md +313 -313
- package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
- package/dist/extensions/builtin/loop/README.md +92 -92
- package/dist/extensions/builtin/mcp/figma-design.md +68 -68
- package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
- package/dist/extensions/builtin/recap/AGENT.md +15 -15
- package/dist/extensions/builtin/sal/README.md +72 -72
- package/dist/extensions/builtin/security-audit/README.md +289 -289
- package/dist/extensions/builtin/team/AGENT.md +112 -112
- package/dist/extensions/builtin/team/TESTING.md +299 -299
- package/dist/extensions/builtin/token-save/README.md +56 -56
- package/dist/extensions/optional/AGENT.md +10 -10
- package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
- package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
- package/dist/modes/interactive/interactive-mode.js +19 -19
- package/dist/modes/interactive/theme/dark.json +85 -85
- package/dist/modes/interactive/theme/light.json +84 -84
- package/dist/modes/interactive/theme/theme-schema.json +335 -335
- package/dist/modes/interactive/theme/warm.json +81 -81
- package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
- package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
- package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
- package/docs/SDK-TESTING.md +364 -0
- package/docs/codex-goal-command-impl.md +1055 -1055
- package/docs/codex-goal-vs-grub.md +500 -500
- package/docs/custom-provider.md +27 -27
- package/docs/extensions.md +27 -27
- package/docs/keybindings.md +27 -27
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
- package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
- package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
- package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
- package/docs/loop-usage-examples.md +214 -214
- package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
- package/docs/models.md +27 -27
- package/docs/packages.md +27 -27
- package/docs/pi-design-philosophy.md +457 -457
- package/docs/planmode.md +1987 -1987
- package/docs/prompt-templates.md +27 -27
- package/docs/providers.md +27 -27
- package/docs/sdk.md +27 -27
- package/docs/skills.md +27 -27
- package/docs/startup-performance-optimization.md +301 -0
- package/docs/themes.md +27 -27
- package/docs/tui.md +27 -27
- package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
- package/package.json +190 -190
- package/docs/cc-agent-design.md +0 -1297
- package/docs/cc-tui-design.md +0 -1333
- package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
- package/docs/scan-report.md +0 -3820
- package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
- package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
|
@@ -1,295 +1,295 @@
|
|
|
1
|
-
# Facebook Pages — mining a public Page's feed for posts + external URLs
|
|
2
|
-
|
|
3
|
-
Companion to `groups.md`. Most of the DOM surface is shared because FB renders
|
|
4
|
-
post articles from the same React component in both contexts — the differences
|
|
5
|
-
are the **URL shapes**, the **sort options**, and the **rate-limit ceiling**
|
|
6
|
-
(Pages are public, so FB is a little more forgiving than in member-gated Groups).
|
|
7
|
-
|
|
8
|
-
**Requires:** a real Chrome driven by Browser Harness. Logged-in is recommended
|
|
9
|
-
but not strictly required — FB Pages are public. Logged-out sessions get more
|
|
10
|
-
aggressive "see more" gating and an interstitial login prompt that breaks the
|
|
11
|
-
scroll loop after ~5 posts. Stay signed in.
|
|
12
|
-
|
|
13
|
-
## What this skill is for
|
|
14
|
-
|
|
15
|
-
1. Pull the N most recent posts from a named FB Page (brand, publisher, local business)
|
|
16
|
-
2. Harvest every external URL the Page has linked out to
|
|
17
|
-
3. Grab Page metadata — follower count, category, website, verified status
|
|
18
|
-
4. Hand the outbound URL list to Firecrawl (or `http_get`) for downstream extraction
|
|
19
|
-
|
|
20
|
-
It is NOT for: leaving comments, reacting, messaging the Page, or any write action.
|
|
21
|
-
|
|
22
|
-
## URL patterns
|
|
23
|
-
|
|
24
|
-
Pages can be addressed by either a vanity slug (`/BoatingOntario.ca`) or a
|
|
25
|
-
numeric Page ID (`/100064...`). Vanity is more legible; numeric is more stable
|
|
26
|
-
(vanities can be changed by the page owner).
|
|
27
|
-
|
|
28
|
-
| What | URL |
|
|
29
|
-
|------|-----|
|
|
30
|
-
| Page main feed (default tab) | `https://www.facebook.com/{vanity_or_id}` |
|
|
31
|
-
| Page Posts tab (canonical post feed) | `https://www.facebook.com/{vanity_or_id}/posts` |
|
|
32
|
-
| Page About | `https://www.facebook.com/{vanity_or_id}/about` |
|
|
33
|
-
| Page Reviews | `https://www.facebook.com/{vanity_or_id}/reviews` |
|
|
34
|
-
| Page Videos | `https://www.facebook.com/{vanity_or_id}/videos` |
|
|
35
|
-
| Page Events | `https://www.facebook.com/{vanity_or_id}/events` |
|
|
36
|
-
| Single post (vanity permalink) | `https://www.facebook.com/{vanity_or_id}/posts/pfbid{...}` |
|
|
37
|
-
| Single post (legacy permalink) | `https://www.facebook.com/permalink.php?story_fbid={story_id}&id={page_id}` |
|
|
38
|
-
| Single post (story permalink) | `https://www.facebook.com/story.php?story_fbid={story_id}&id={page_id}` |
|
|
39
|
-
| Page-search (find a Page by name) | `https://www.facebook.com/search/pages/?q={query}` |
|
|
40
|
-
|
|
41
|
-
Unlike Groups, Pages do **not** support `?sorting_setting=CHRONOLOGICAL` — the
|
|
42
|
-
Posts tab is the closest thing to a chronological view, and it's reverse-chrono
|
|
43
|
-
by default. Don't rely on perfect ordering: pinned posts always appear first,
|
|
44
|
-
and FB occasionally reorders the top few based on engagement.
|
|
45
|
-
|
|
46
|
-
## DOM anchors
|
|
47
|
-
|
|
48
|
-
Post-article anchors are **the same as groups.md** because the feed component
|
|
49
|
-
is shared. Page-chrome anchors (header, about-rail, tabs) are specific to Pages.
|
|
50
|
-
|
|
51
|
-
| Anchor | Selector | Notes |
|
|
52
|
-
|--------|----------|-------|
|
|
53
|
-
| Page display name | `h1` (first on page) | Stable — FB has rendered Page name as the top-level `h1` for years |
|
|
54
|
-
| Verified badge | `h1 svg[aria-label*="Verified"]` | Present on verified Pages only |
|
|
55
|
-
| Follower/like count | `a[href$="/followers/"], a[href$="/friends_likes/"]` | Text node contains the count — parse with a regex |
|
|
56
|
-
| Category line | `div[role="main"] span:has(a[href*="/pages/category/"])` | Sits under the name in the header |
|
|
57
|
-
| Website link in header | `a[href^="https://l.facebook.com/l.php"][href*="u="]` inside the About rail | Same redirector wrapper as post links — decode before using |
|
|
58
|
-
| Each post container | `div[role="article"]` | Same as groups |
|
|
59
|
-
| Post permalink | `a[href*="/posts/"][href*="pfbid"], a[href*="/permalink.php"], a[href*="/story.php"]` | Page posts use `pfbid...` style or the legacy `permalink.php`/`story.php` shapes |
|
|
60
|
-
| Post body text | `div[data-ad-preview="message"], div[data-ad-comet-preview="message"]` | Same as groups |
|
|
61
|
-
| Post author | `h3 a, h4 a, strong a` | On a Page, this is always the Page itself — useful only for sanity checking you're still on the right Page |
|
|
62
|
-
| Post timestamp | `a[href*="/posts/"] abbr, a[role="link"] > span > span` | Hover returns absolute time; relative string is fine for sorting |
|
|
63
|
-
| External link (FB redirector) | `a[href^="https://l.facebook.com/l.php?u="]` | Decode the `u=` param |
|
|
64
|
-
| "See more" on long posts | `div[role="button"]:has(span:contains("See more"))` | Click before reading body or posts get truncated |
|
|
65
|
-
|
|
66
|
-
If a selector stops returning results, run the self-inspection block at the
|
|
67
|
-
bottom and update this table — that's the workflow, not a fallback.
|
|
68
|
-
|
|
69
|
-
## Extracting Page metadata (header block)
|
|
70
|
-
|
|
71
|
-
Unlike a Group, a Page's header carries useful signal on its own — category,
|
|
72
|
-
verified, follower count, website. Pull it in one JS call before you start
|
|
73
|
-
scrolling the feed.
|
|
74
|
-
|
|
75
|
-
```python
|
|
76
|
-
meta = js("""
|
|
77
|
-
({
|
|
78
|
-
name: document.querySelector('h1')?.innerText || null,
|
|
79
|
-
verified: !!document.querySelector('h1 svg[aria-label*="Verified"]'),
|
|
80
|
-
followers: (Array.from(document.querySelectorAll('a'))
|
|
81
|
-
.find(a => /followers$/.test(a.getAttribute('href')||''))?.innerText) || null,
|
|
82
|
-
likes: (Array.from(document.querySelectorAll('a'))
|
|
83
|
-
.find(a => /friends_likes$/.test(a.getAttribute('href')||''))?.innerText) || null,
|
|
84
|
-
category: (Array.from(document.querySelectorAll('a[href*="/pages/category/"]'))[0]?.innerText) || null,
|
|
85
|
-
website_redirector: (Array.from(document.querySelectorAll('a[href^="https://l.facebook.com/l.php"]'))
|
|
86
|
-
.find(a => !a.closest('div[role="article"]'))?.href) || null,
|
|
87
|
-
})
|
|
88
|
-
""")
|
|
89
|
-
```
|
|
90
|
-
|
|
91
|
-
Decode `website_redirector` with the same helper as post links (see below).
|
|
92
|
-
|
|
93
|
-
## Scrolling the feed (lazy load)
|
|
94
|
-
|
|
95
|
-
Same collect-as-you-go pattern as groups. FB virtualizes the Page feed too —
|
|
96
|
-
scrolled-past posts unmount, so scroll-then-collect loses them.
|
|
97
|
-
|
|
98
|
-
```python
|
|
99
|
-
seen = {} # permalink -> dict
|
|
100
|
-
TARGET = 50
|
|
101
|
-
MAX_SCROLLS = 30
|
|
102
|
-
|
|
103
|
-
for i in range(MAX_SCROLLS):
|
|
104
|
-
batch = js("""
|
|
105
|
-
Array.from(document.querySelectorAll('div[role="article"]')).map(el => {
|
|
106
|
-
const link = el.querySelector('a[href*="/posts/"][href*="pfbid"], a[href*="/permalink.php"], a[href*="/story.php"]');
|
|
107
|
-
const body = el.querySelector('div[data-ad-preview="message"], div[data-ad-comet-preview="message"]');
|
|
108
|
-
const time = el.querySelector('abbr, a[role="link"] > span > span');
|
|
109
|
-
const externals = Array.from(el.querySelectorAll('a[href^="https://l.facebook.com/l.php?u="]'))
|
|
110
|
-
.map(a => a.href);
|
|
111
|
-
return {
|
|
112
|
-
url: link?.href || null,
|
|
113
|
-
time: time?.innerText || null,
|
|
114
|
-
body: body?.innerText?.slice(0, 4000) || null,
|
|
115
|
-
externals: externals,
|
|
116
|
-
};
|
|
117
|
-
}).filter(p => p.url)
|
|
118
|
-
""") or []
|
|
119
|
-
for p in batch:
|
|
120
|
-
seen.setdefault(p["url"], p)
|
|
121
|
-
if len(seen) >= TARGET:
|
|
122
|
-
break
|
|
123
|
-
scroll(640, 400, dy=900)
|
|
124
|
-
wait(2.5)
|
|
125
|
-
```
|
|
126
|
-
|
|
127
|
-
Notes:
|
|
128
|
-
- Page feeds are usually **less dense** than active Group feeds — a slow Page
|
|
129
|
-
may only render 8–15 posts total before you hit the footer. Use
|
|
130
|
-
`if len(batch) == 0 for two consecutive iterations` as a stop condition.
|
|
131
|
-
- Pinned posts re-appear at the top on every fresh load. The `seen` dict
|
|
132
|
-
dedupes them naturally via permalink.
|
|
133
|
-
|
|
134
|
-
## Decoding the external-URL redirector
|
|
135
|
-
|
|
136
|
-
Identical to groups.md — every outbound link is wrapped in
|
|
137
|
-
`https://l.facebook.com/l.php?u={URL-encoded real URL}&h=...`. Strip the wrapper.
|
|
138
|
-
|
|
139
|
-
```python
|
|
140
|
-
from urllib.parse import urlparse, parse_qs, unquote
|
|
141
|
-
def decode_fb_link(href):
|
|
142
|
-
if not href.startswith("https://l.facebook.com/l.php"):
|
|
143
|
-
return href
|
|
144
|
-
q = parse_qs(urlparse(href).query)
|
|
145
|
-
return unquote(q["u"][0]) if "u" in q else href
|
|
146
|
-
```
|
|
147
|
-
|
|
148
|
-
## Handoff to Firecrawl
|
|
149
|
-
|
|
150
|
-
Same pattern as groups — Pages are the walled-garden surface that Harness is
|
|
151
|
-
good at; the external URLs the Page has shared are public and better suited to
|
|
152
|
-
Firecrawl's schema-native extraction.
|
|
153
|
-
|
|
154
|
-
```python
|
|
155
|
-
external_urls = sorted({decode_fb_link(x) for p in seen.values() for x in p["externals"]})
|
|
156
|
-
print(f"harvested {len(external_urls)} unique external URLs from Page")
|
|
157
|
-
# In the calling conversation:
|
|
158
|
-
# firecrawl_extract(urls=external_urls, prompt="...", schema={...})
|
|
159
|
-
```
|
|
160
|
-
|
|
161
|
-
## Rate-limit discipline
|
|
162
|
-
|
|
163
|
-
Pages are public, so the ceiling is higher than Groups — but the account-level
|
|
164
|
-
detection still applies, because you're driving a real logged-in session.
|
|
165
|
-
|
|
166
|
-
- **≥2 seconds between scrolls** inside the collect loop
|
|
167
|
-
- **≥2 seconds between Pages** if you're sweeping multiple (down from 3s for Groups)
|
|
168
|
-
- **No more than ~12 Pages per hour** for sustained monitoring (up from 6 Groups/hr)
|
|
169
|
-
- **Don't re-open the same Page within 10 minutes** — repeated hits inside a
|
|
170
|
-
short window is a heuristic that triggers soft-throttling even on public content
|
|
171
|
-
|
|
172
|
-
Symptoms of over-pacing: the "See more" links on long posts stop being clickable,
|
|
173
|
-
the login interstitial appears even though you're signed in, or the URL silently
|
|
174
|
-
redirects to `/login/device-based/`. If any of those fire, **stop**, let Jay look
|
|
175
|
-
at the screen, and don't try to auto-resolve.
|
|
176
|
-
|
|
177
|
-
## Self-inspection block (run when selectors stop working)
|
|
178
|
-
|
|
179
|
-
```python
|
|
180
|
-
print(js("""
|
|
181
|
-
({
|
|
182
|
-
articles: document.querySelectorAll('div[role="article"]').length,
|
|
183
|
-
body_preview_a: document.querySelectorAll('div[data-ad-preview="message"]').length,
|
|
184
|
-
body_preview_b: document.querySelectorAll('div[data-ad-comet-preview="message"]').length,
|
|
185
|
-
external_redirectors: document.querySelectorAll('a[href^="https://l.facebook.com/l.php?u="]').length,
|
|
186
|
-
pfbid_posts: document.querySelectorAll('a[href*="/posts/"][href*="pfbid"]').length,
|
|
187
|
-
permalink_php: document.querySelectorAll('a[href*="/permalink.php"]').length,
|
|
188
|
-
story_php: document.querySelectorAll('a[href*="/story.php"]').length,
|
|
189
|
-
h1_present: !!document.querySelector('h1'),
|
|
190
|
-
})
|
|
191
|
-
"""))
|
|
192
|
-
# If any count is 0 on a Page you know has posts, the selector drifted.
|
|
193
|
-
# Open DevTools, inspect a post, find the new stable attribute, update the
|
|
194
|
-
# DOM anchors table above.
|
|
195
|
-
```
|
|
196
|
-
|
|
197
|
-
## Full example — mine one Page, emit JSON for downstream tools
|
|
198
|
-
|
|
199
|
-
```bash
|
|
200
|
-
cd ~/Developer/browser-harness && uv run browser-harness <<'PY'
|
|
201
|
-
import json, sys
|
|
202
|
-
from urllib.parse import urlparse, parse_qs, unquote
|
|
203
|
-
|
|
204
|
-
PAGE = "BoatingOntario.ca" # vanity slug OR numeric Page ID
|
|
205
|
-
TARGET = 30
|
|
206
|
-
MAX_SCROLLS = 25
|
|
207
|
-
|
|
208
|
-
goto_url(f"https://www.facebook.com/{PAGE}/posts")
|
|
209
|
-
wait_for_load()
|
|
210
|
-
wait(3)
|
|
211
|
-
|
|
212
|
-
info = page_info()
|
|
213
|
-
if "/checkpoint/" in info["url"] or "/login" in info["url"]:
|
|
214
|
-
sys.exit("AUTH_WALL — stop and have the account re-verify.")
|
|
215
|
-
|
|
216
|
-
# Header metadata
|
|
217
|
-
meta = js("""
|
|
218
|
-
({
|
|
219
|
-
name: document.querySelector('h1')?.innerText || null,
|
|
220
|
-
verified: !!document.querySelector('h1 svg[aria-label*="Verified"]'),
|
|
221
|
-
category: (Array.from(document.querySelectorAll('a[href*="/pages/category/"]'))[0]?.innerText) || null,
|
|
222
|
-
followers: (Array.from(document.querySelectorAll('a'))
|
|
223
|
-
.find(a => /followers$/.test(a.getAttribute('href')||''))?.innerText) || null,
|
|
224
|
-
website_redirector: (Array.from(document.querySelectorAll('a[href^="https://l.facebook.com/l.php"]'))
|
|
225
|
-
.find(a => !a.closest('div[role="article"]'))?.href) || null,
|
|
226
|
-
})
|
|
227
|
-
""")
|
|
228
|
-
|
|
229
|
-
# Feed sweep
|
|
230
|
-
seen = {}
|
|
231
|
-
empty_streak = 0
|
|
232
|
-
for _ in range(MAX_SCROLLS):
|
|
233
|
-
batch = js("""
|
|
234
|
-
Array.from(document.querySelectorAll('div[role="article"]')).map(el => {
|
|
235
|
-
const link = el.querySelector('a[href*="/posts/"][href*="pfbid"], a[href*="/permalink.php"], a[href*="/story.php"]');
|
|
236
|
-
const body = el.querySelector('div[data-ad-preview="message"], div[data-ad-comet-preview="message"]');
|
|
237
|
-
const time = el.querySelector('abbr, a[role="link"] > span > span');
|
|
238
|
-
const externals = Array.from(el.querySelectorAll('a[href^="https://l.facebook.com/l.php?u="]')).map(a => a.href);
|
|
239
|
-
return { url: link?.href, time: time?.innerText,
|
|
240
|
-
body: body?.innerText?.slice(0, 4000), externals };
|
|
241
|
-
}).filter(p => p.url)
|
|
242
|
-
""") or []
|
|
243
|
-
before = len(seen)
|
|
244
|
-
for p in batch:
|
|
245
|
-
seen.setdefault(p["url"], p)
|
|
246
|
-
empty_streak = empty_streak + 1 if len(seen) == before else 0
|
|
247
|
-
if len(seen) >= TARGET or empty_streak >= 2:
|
|
248
|
-
break
|
|
249
|
-
scroll(640, 400, dy=900)
|
|
250
|
-
wait(2.5)
|
|
251
|
-
|
|
252
|
-
def decode(u):
|
|
253
|
-
if not u.startswith("https://l.facebook.com/l.php"): return u
|
|
254
|
-
q = parse_qs(urlparse(u).query)
|
|
255
|
-
return unquote(q["u"][0]) if "u" in q else u
|
|
256
|
-
|
|
257
|
-
posts = list(seen.values())
|
|
258
|
-
if meta.get("website_redirector"):
|
|
259
|
-
meta["website"] = decode(meta["website_redirector"])
|
|
260
|
-
all_externals = sorted({decode(x) for p in posts for x in p["externals"]})
|
|
261
|
-
capture_screenshot(f"/tmp/fb-page-{PAGE}.png", full=True)
|
|
262
|
-
print(json.dumps({
|
|
263
|
-
"page": PAGE,
|
|
264
|
-
"meta": meta,
|
|
265
|
-
"post_count": len(posts),
|
|
266
|
-
"posts": posts,
|
|
267
|
-
"external_urls": all_externals,
|
|
268
|
-
}, ensure_ascii=False))
|
|
269
|
-
PY
|
|
270
|
-
```
|
|
271
|
-
|
|
272
|
-
The stdout JSON is the handoff payload — parse it in the calling agent and
|
|
273
|
-
route `external_urls` into `firecrawl_extract`, route `meta` into a
|
|
274
|
-
competitor-intel table, or feed `posts` into keyword matching.
|
|
275
|
-
|
|
276
|
-
## When to reach for pages.md vs groups.md
|
|
277
|
-
|
|
278
|
-
| If the URL is... | Use |
|
|
279
|
-
|------------------|-----|
|
|
280
|
-
| `facebook.com/groups/{id_or_slug}` | `groups.md` |
|
|
281
|
-
| `facebook.com/{vanity}` or `facebook.com/{numeric_id}` | `pages.md` |
|
|
282
|
-
| `facebook.com/profile.php?id={id}` | neither — that's a **personal profile**, different DOM and much stricter rate limits |
|
|
283
|
-
| `facebook.com/marketplace/...` | neither — dedicated Marketplace skill needed |
|
|
284
|
-
|
|
285
|
-
A quick way to tell Pages from personal profiles when the URL shape is
|
|
286
|
-
ambiguous: Pages have an `h1` with a verified-badge slot and a category link
|
|
287
|
-
underneath; personal profiles have a cover photo component and a "Friends" tab.
|
|
288
|
-
|
|
289
|
-
## Gotchas log (append when you hit something new)
|
|
290
|
-
|
|
291
|
-
- **Initial version:** Post-article selectors inherited from `groups.md` because
|
|
292
|
-
FB renders the feed article component identically across Group and Page
|
|
293
|
-
contexts. Run the self-inspection block on first live use to confirm no drift
|
|
294
|
-
since the groups.md verification date, and append a note here with what you
|
|
295
|
-
found.
|
|
1
|
+
# Facebook Pages — mining a public Page's feed for posts + external URLs
|
|
2
|
+
|
|
3
|
+
Companion to `groups.md`. Most of the DOM surface is shared because FB renders
|
|
4
|
+
post articles from the same React component in both contexts — the differences
|
|
5
|
+
are the **URL shapes**, the **sort options**, and the **rate-limit ceiling**
|
|
6
|
+
(Pages are public, so FB is a little more forgiving than in member-gated Groups).
|
|
7
|
+
|
|
8
|
+
**Requires:** a real Chrome driven by Browser Harness. Logged-in is recommended
|
|
9
|
+
but not strictly required — FB Pages are public. Logged-out sessions get more
|
|
10
|
+
aggressive "see more" gating and an interstitial login prompt that breaks the
|
|
11
|
+
scroll loop after ~5 posts. Stay signed in.
|
|
12
|
+
|
|
13
|
+
## What this skill is for
|
|
14
|
+
|
|
15
|
+
1. Pull the N most recent posts from a named FB Page (brand, publisher, local business)
|
|
16
|
+
2. Harvest every external URL the Page has linked out to
|
|
17
|
+
3. Grab Page metadata — follower count, category, website, verified status
|
|
18
|
+
4. Hand the outbound URL list to Firecrawl (or `http_get`) for downstream extraction
|
|
19
|
+
|
|
20
|
+
It is NOT for: leaving comments, reacting, messaging the Page, or any write action.
|
|
21
|
+
|
|
22
|
+
## URL patterns
|
|
23
|
+
|
|
24
|
+
Pages can be addressed by either a vanity slug (`/BoatingOntario.ca`) or a
|
|
25
|
+
numeric Page ID (`/100064...`). Vanity is more legible; numeric is more stable
|
|
26
|
+
(vanities can be changed by the page owner).
|
|
27
|
+
|
|
28
|
+
| What | URL |
|
|
29
|
+
|------|-----|
|
|
30
|
+
| Page main feed (default tab) | `https://www.facebook.com/{vanity_or_id}` |
|
|
31
|
+
| Page Posts tab (canonical post feed) | `https://www.facebook.com/{vanity_or_id}/posts` |
|
|
32
|
+
| Page About | `https://www.facebook.com/{vanity_or_id}/about` |
|
|
33
|
+
| Page Reviews | `https://www.facebook.com/{vanity_or_id}/reviews` |
|
|
34
|
+
| Page Videos | `https://www.facebook.com/{vanity_or_id}/videos` |
|
|
35
|
+
| Page Events | `https://www.facebook.com/{vanity_or_id}/events` |
|
|
36
|
+
| Single post (vanity permalink) | `https://www.facebook.com/{vanity_or_id}/posts/pfbid{...}` |
|
|
37
|
+
| Single post (legacy permalink) | `https://www.facebook.com/permalink.php?story_fbid={story_id}&id={page_id}` |
|
|
38
|
+
| Single post (story permalink) | `https://www.facebook.com/story.php?story_fbid={story_id}&id={page_id}` |
|
|
39
|
+
| Page-search (find a Page by name) | `https://www.facebook.com/search/pages/?q={query}` |
|
|
40
|
+
|
|
41
|
+
Unlike Groups, Pages do **not** support `?sorting_setting=CHRONOLOGICAL` — the
|
|
42
|
+
Posts tab is the closest thing to a chronological view, and it's reverse-chrono
|
|
43
|
+
by default. Don't rely on perfect ordering: pinned posts always appear first,
|
|
44
|
+
and FB occasionally reorders the top few based on engagement.
|
|
45
|
+
|
|
46
|
+
## DOM anchors
|
|
47
|
+
|
|
48
|
+
Post-article anchors are **the same as groups.md** because the feed component
|
|
49
|
+
is shared. Page-chrome anchors (header, about-rail, tabs) are specific to Pages.
|
|
50
|
+
|
|
51
|
+
| Anchor | Selector | Notes |
|
|
52
|
+
|--------|----------|-------|
|
|
53
|
+
| Page display name | `h1` (first on page) | Stable — FB has rendered Page name as the top-level `h1` for years |
|
|
54
|
+
| Verified badge | `h1 svg[aria-label*="Verified"]` | Present on verified Pages only |
|
|
55
|
+
| Follower/like count | `a[href$="/followers/"], a[href$="/friends_likes/"]` | Text node contains the count — parse with a regex |
|
|
56
|
+
| Category line | `div[role="main"] span:has(a[href*="/pages/category/"])` | Sits under the name in the header |
|
|
57
|
+
| Website link in header | `a[href^="https://l.facebook.com/l.php"][href*="u="]` inside the About rail | Same redirector wrapper as post links — decode before using |
|
|
58
|
+
| Each post container | `div[role="article"]` | Same as groups |
|
|
59
|
+
| Post permalink | `a[href*="/posts/"][href*="pfbid"], a[href*="/permalink.php"], a[href*="/story.php"]` | Page posts use `pfbid...` style or the legacy `permalink.php`/`story.php` shapes |
|
|
60
|
+
| Post body text | `div[data-ad-preview="message"], div[data-ad-comet-preview="message"]` | Same as groups |
|
|
61
|
+
| Post author | `h3 a, h4 a, strong a` | On a Page, this is always the Page itself — useful only for sanity checking you're still on the right Page |
|
|
62
|
+
| Post timestamp | `a[href*="/posts/"] abbr, a[role="link"] > span > span` | Hover returns absolute time; relative string is fine for sorting |
|
|
63
|
+
| External link (FB redirector) | `a[href^="https://l.facebook.com/l.php?u="]` | Decode the `u=` param |
|
|
64
|
+
| "See more" on long posts | `div[role="button"]:has(span:contains("See more"))` | Click before reading body or posts get truncated |
|
|
65
|
+
|
|
66
|
+
If a selector stops returning results, run the self-inspection block at the
|
|
67
|
+
bottom and update this table — that's the workflow, not a fallback.
|
|
68
|
+
|
|
69
|
+
## Extracting Page metadata (header block)
|
|
70
|
+
|
|
71
|
+
Unlike a Group, a Page's header carries useful signal on its own — category,
|
|
72
|
+
verified, follower count, website. Pull it in one JS call before you start
|
|
73
|
+
scrolling the feed.
|
|
74
|
+
|
|
75
|
+
```python
|
|
76
|
+
meta = js("""
|
|
77
|
+
({
|
|
78
|
+
name: document.querySelector('h1')?.innerText || null,
|
|
79
|
+
verified: !!document.querySelector('h1 svg[aria-label*="Verified"]'),
|
|
80
|
+
followers: (Array.from(document.querySelectorAll('a'))
|
|
81
|
+
.find(a => /followers$/.test(a.getAttribute('href')||''))?.innerText) || null,
|
|
82
|
+
likes: (Array.from(document.querySelectorAll('a'))
|
|
83
|
+
.find(a => /friends_likes$/.test(a.getAttribute('href')||''))?.innerText) || null,
|
|
84
|
+
category: (Array.from(document.querySelectorAll('a[href*="/pages/category/"]'))[0]?.innerText) || null,
|
|
85
|
+
website_redirector: (Array.from(document.querySelectorAll('a[href^="https://l.facebook.com/l.php"]'))
|
|
86
|
+
.find(a => !a.closest('div[role="article"]'))?.href) || null,
|
|
87
|
+
})
|
|
88
|
+
""")
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
Decode `website_redirector` with the same helper as post links (see below).
|
|
92
|
+
|
|
93
|
+
## Scrolling the feed (lazy load)
|
|
94
|
+
|
|
95
|
+
Same collect-as-you-go pattern as groups. FB virtualizes the Page feed too —
|
|
96
|
+
scrolled-past posts unmount, so scroll-then-collect loses them.
|
|
97
|
+
|
|
98
|
+
```python
|
|
99
|
+
seen = {} # permalink -> dict
|
|
100
|
+
TARGET = 50
|
|
101
|
+
MAX_SCROLLS = 30
|
|
102
|
+
|
|
103
|
+
for i in range(MAX_SCROLLS):
|
|
104
|
+
batch = js("""
|
|
105
|
+
Array.from(document.querySelectorAll('div[role="article"]')).map(el => {
|
|
106
|
+
const link = el.querySelector('a[href*="/posts/"][href*="pfbid"], a[href*="/permalink.php"], a[href*="/story.php"]');
|
|
107
|
+
const body = el.querySelector('div[data-ad-preview="message"], div[data-ad-comet-preview="message"]');
|
|
108
|
+
const time = el.querySelector('abbr, a[role="link"] > span > span');
|
|
109
|
+
const externals = Array.from(el.querySelectorAll('a[href^="https://l.facebook.com/l.php?u="]'))
|
|
110
|
+
.map(a => a.href);
|
|
111
|
+
return {
|
|
112
|
+
url: link?.href || null,
|
|
113
|
+
time: time?.innerText || null,
|
|
114
|
+
body: body?.innerText?.slice(0, 4000) || null,
|
|
115
|
+
externals: externals,
|
|
116
|
+
};
|
|
117
|
+
}).filter(p => p.url)
|
|
118
|
+
""") or []
|
|
119
|
+
for p in batch:
|
|
120
|
+
seen.setdefault(p["url"], p)
|
|
121
|
+
if len(seen) >= TARGET:
|
|
122
|
+
break
|
|
123
|
+
scroll(640, 400, dy=900)
|
|
124
|
+
wait(2.5)
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
Notes:
|
|
128
|
+
- Page feeds are usually **less dense** than active Group feeds — a slow Page
|
|
129
|
+
may only render 8–15 posts total before you hit the footer. Use
|
|
130
|
+
`if len(batch) == 0 for two consecutive iterations` as a stop condition.
|
|
131
|
+
- Pinned posts re-appear at the top on every fresh load. The `seen` dict
|
|
132
|
+
dedupes them naturally via permalink.
|
|
133
|
+
|
|
134
|
+
## Decoding the external-URL redirector
|
|
135
|
+
|
|
136
|
+
Identical to groups.md — every outbound link is wrapped in
|
|
137
|
+
`https://l.facebook.com/l.php?u={URL-encoded real URL}&h=...`. Strip the wrapper.
|
|
138
|
+
|
|
139
|
+
```python
|
|
140
|
+
from urllib.parse import urlparse, parse_qs, unquote
|
|
141
|
+
def decode_fb_link(href):
|
|
142
|
+
if not href.startswith("https://l.facebook.com/l.php"):
|
|
143
|
+
return href
|
|
144
|
+
q = parse_qs(urlparse(href).query)
|
|
145
|
+
return unquote(q["u"][0]) if "u" in q else href
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
## Handoff to Firecrawl
|
|
149
|
+
|
|
150
|
+
Same pattern as groups — Pages are the walled-garden surface that Harness is
|
|
151
|
+
good at; the external URLs the Page has shared are public and better suited to
|
|
152
|
+
Firecrawl's schema-native extraction.
|
|
153
|
+
|
|
154
|
+
```python
|
|
155
|
+
external_urls = sorted({decode_fb_link(x) for p in seen.values() for x in p["externals"]})
|
|
156
|
+
print(f"harvested {len(external_urls)} unique external URLs from Page")
|
|
157
|
+
# In the calling conversation:
|
|
158
|
+
# firecrawl_extract(urls=external_urls, prompt="...", schema={...})
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
## Rate-limit discipline
|
|
162
|
+
|
|
163
|
+
Pages are public, so the ceiling is higher than Groups — but the account-level
|
|
164
|
+
detection still applies, because you're driving a real logged-in session.
|
|
165
|
+
|
|
166
|
+
- **≥2 seconds between scrolls** inside the collect loop
|
|
167
|
+
- **≥2 seconds between Pages** if you're sweeping multiple (down from 3s for Groups)
|
|
168
|
+
- **No more than ~12 Pages per hour** for sustained monitoring (up from 6 Groups/hr)
|
|
169
|
+
- **Don't re-open the same Page within 10 minutes** — repeated hits inside a
|
|
170
|
+
short window is a heuristic that triggers soft-throttling even on public content
|
|
171
|
+
|
|
172
|
+
Symptoms of over-pacing: the "See more" links on long posts stop being clickable,
|
|
173
|
+
the login interstitial appears even though you're signed in, or the URL silently
|
|
174
|
+
redirects to `/login/device-based/`. If any of those fire, **stop**, let Jay look
|
|
175
|
+
at the screen, and don't try to auto-resolve.
|
|
176
|
+
|
|
177
|
+
## Self-inspection block (run when selectors stop working)
|
|
178
|
+
|
|
179
|
+
```python
|
|
180
|
+
print(js("""
|
|
181
|
+
({
|
|
182
|
+
articles: document.querySelectorAll('div[role="article"]').length,
|
|
183
|
+
body_preview_a: document.querySelectorAll('div[data-ad-preview="message"]').length,
|
|
184
|
+
body_preview_b: document.querySelectorAll('div[data-ad-comet-preview="message"]').length,
|
|
185
|
+
external_redirectors: document.querySelectorAll('a[href^="https://l.facebook.com/l.php?u="]').length,
|
|
186
|
+
pfbid_posts: document.querySelectorAll('a[href*="/posts/"][href*="pfbid"]').length,
|
|
187
|
+
permalink_php: document.querySelectorAll('a[href*="/permalink.php"]').length,
|
|
188
|
+
story_php: document.querySelectorAll('a[href*="/story.php"]').length,
|
|
189
|
+
h1_present: !!document.querySelector('h1'),
|
|
190
|
+
})
|
|
191
|
+
"""))
|
|
192
|
+
# If any count is 0 on a Page you know has posts, the selector drifted.
|
|
193
|
+
# Open DevTools, inspect a post, find the new stable attribute, update the
|
|
194
|
+
# DOM anchors table above.
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
## Full example — mine one Page, emit JSON for downstream tools
|
|
198
|
+
|
|
199
|
+
```bash
|
|
200
|
+
cd ~/Developer/browser-harness && uv run browser-harness <<'PY'
|
|
201
|
+
import json, sys
|
|
202
|
+
from urllib.parse import urlparse, parse_qs, unquote
|
|
203
|
+
|
|
204
|
+
PAGE = "BoatingOntario.ca" # vanity slug OR numeric Page ID
|
|
205
|
+
TARGET = 30
|
|
206
|
+
MAX_SCROLLS = 25
|
|
207
|
+
|
|
208
|
+
goto_url(f"https://www.facebook.com/{PAGE}/posts")
|
|
209
|
+
wait_for_load()
|
|
210
|
+
wait(3)
|
|
211
|
+
|
|
212
|
+
info = page_info()
|
|
213
|
+
if "/checkpoint/" in info["url"] or "/login" in info["url"]:
|
|
214
|
+
sys.exit("AUTH_WALL — stop and have the account re-verify.")
|
|
215
|
+
|
|
216
|
+
# Header metadata
|
|
217
|
+
meta = js("""
|
|
218
|
+
({
|
|
219
|
+
name: document.querySelector('h1')?.innerText || null,
|
|
220
|
+
verified: !!document.querySelector('h1 svg[aria-label*="Verified"]'),
|
|
221
|
+
category: (Array.from(document.querySelectorAll('a[href*="/pages/category/"]'))[0]?.innerText) || null,
|
|
222
|
+
followers: (Array.from(document.querySelectorAll('a'))
|
|
223
|
+
.find(a => /followers$/.test(a.getAttribute('href')||''))?.innerText) || null,
|
|
224
|
+
website_redirector: (Array.from(document.querySelectorAll('a[href^="https://l.facebook.com/l.php"]'))
|
|
225
|
+
.find(a => !a.closest('div[role="article"]'))?.href) || null,
|
|
226
|
+
})
|
|
227
|
+
""")
|
|
228
|
+
|
|
229
|
+
# Feed sweep
|
|
230
|
+
seen = {}
|
|
231
|
+
empty_streak = 0
|
|
232
|
+
for _ in range(MAX_SCROLLS):
|
|
233
|
+
batch = js("""
|
|
234
|
+
Array.from(document.querySelectorAll('div[role="article"]')).map(el => {
|
|
235
|
+
const link = el.querySelector('a[href*="/posts/"][href*="pfbid"], a[href*="/permalink.php"], a[href*="/story.php"]');
|
|
236
|
+
const body = el.querySelector('div[data-ad-preview="message"], div[data-ad-comet-preview="message"]');
|
|
237
|
+
const time = el.querySelector('abbr, a[role="link"] > span > span');
|
|
238
|
+
const externals = Array.from(el.querySelectorAll('a[href^="https://l.facebook.com/l.php?u="]')).map(a => a.href);
|
|
239
|
+
return { url: link?.href, time: time?.innerText,
|
|
240
|
+
body: body?.innerText?.slice(0, 4000), externals };
|
|
241
|
+
}).filter(p => p.url)
|
|
242
|
+
""") or []
|
|
243
|
+
before = len(seen)
|
|
244
|
+
for p in batch:
|
|
245
|
+
seen.setdefault(p["url"], p)
|
|
246
|
+
empty_streak = empty_streak + 1 if len(seen) == before else 0
|
|
247
|
+
if len(seen) >= TARGET or empty_streak >= 2:
|
|
248
|
+
break
|
|
249
|
+
scroll(640, 400, dy=900)
|
|
250
|
+
wait(2.5)
|
|
251
|
+
|
|
252
|
+
def decode(u):
|
|
253
|
+
if not u.startswith("https://l.facebook.com/l.php"): return u
|
|
254
|
+
q = parse_qs(urlparse(u).query)
|
|
255
|
+
return unquote(q["u"][0]) if "u" in q else u
|
|
256
|
+
|
|
257
|
+
posts = list(seen.values())
|
|
258
|
+
if meta.get("website_redirector"):
|
|
259
|
+
meta["website"] = decode(meta["website_redirector"])
|
|
260
|
+
all_externals = sorted({decode(x) for p in posts for x in p["externals"]})
|
|
261
|
+
capture_screenshot(f"/tmp/fb-page-{PAGE}.png", full=True)
|
|
262
|
+
print(json.dumps({
|
|
263
|
+
"page": PAGE,
|
|
264
|
+
"meta": meta,
|
|
265
|
+
"post_count": len(posts),
|
|
266
|
+
"posts": posts,
|
|
267
|
+
"external_urls": all_externals,
|
|
268
|
+
}, ensure_ascii=False))
|
|
269
|
+
PY
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
The stdout JSON is the handoff payload — parse it in the calling agent and
|
|
273
|
+
route `external_urls` into `firecrawl_extract`, route `meta` into a
|
|
274
|
+
competitor-intel table, or feed `posts` into keyword matching.
|
|
275
|
+
|
|
276
|
+
## When to reach for pages.md vs groups.md
|
|
277
|
+
|
|
278
|
+
| If the URL is... | Use |
|
|
279
|
+
|------------------|-----|
|
|
280
|
+
| `facebook.com/groups/{id_or_slug}` | `groups.md` |
|
|
281
|
+
| `facebook.com/{vanity}` or `facebook.com/{numeric_id}` | `pages.md` |
|
|
282
|
+
| `facebook.com/profile.php?id={id}` | neither — that's a **personal profile**, different DOM and much stricter rate limits |
|
|
283
|
+
| `facebook.com/marketplace/...` | neither — dedicated Marketplace skill needed |
|
|
284
|
+
|
|
285
|
+
A quick way to tell Pages from personal profiles when the URL shape is
|
|
286
|
+
ambiguous: Pages have an `h1` with a verified-badge slot and a category link
|
|
287
|
+
underneath; personal profiles have a cover photo component and a "Friends" tab.
|
|
288
|
+
|
|
289
|
+
## Gotchas log (append when you hit something new)
|
|
290
|
+
|
|
291
|
+
- **Initial version:** Post-article selectors inherited from `groups.md` because
|
|
292
|
+
FB renders the feed article component identically across Group and Page
|
|
293
|
+
contexts. Run the self-inspection block on first live use to confirm no drift
|
|
294
|
+
since the groups.md verification date, and append a note here with what you
|
|
295
|
+
found.
|