@pencil-agent/nano-pencil 2.0.0-beta.8 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (241) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/extensions-host/index.d.ts +1 -1
  7. package/dist/core/extensions-host/loader.js +1 -1
  8. package/dist/core/extensions-host/runner.d.ts +1 -0
  9. package/dist/core/extensions-host/runner.js +2 -2
  10. package/dist/core/extensions-host/types.d.ts +17 -22
  11. package/dist/core/lib/ai/src/types.d.ts +12 -2
  12. package/dist/core/persona/persona-manager.js +5 -2
  13. package/dist/core/runtime/agent-session.js +3 -3
  14. package/dist/core/runtime/extension-core-bindings.d.ts +1 -0
  15. package/dist/core/runtime/extension-core-bindings.js +2 -2
  16. package/dist/extensions/builtin/AGENT.md +115 -115
  17. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  18. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  99. package/dist/extensions/builtin/browser/browser.md +73 -73
  100. package/dist/extensions/builtin/browser/install.md +142 -142
  101. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  102. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  104. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  105. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  112. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  113. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  114. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  115. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  116. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  117. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  118. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  119. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  120. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  121. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  122. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  123. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  124. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  125. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  126. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  127. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  128. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  129. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  130. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  131. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  132. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  133. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  134. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  135. package/dist/extensions/builtin/goal/README.md +67 -67
  136. package/dist/extensions/builtin/goal/goal-controller.d.ts +39 -10
  137. package/dist/extensions/builtin/goal/goal-controller.js +1 -1
  138. package/dist/extensions/builtin/goal/goal-format.js +1 -1
  139. package/dist/extensions/builtin/goal/goal-prompts.d.ts +2 -0
  140. package/dist/extensions/builtin/goal/goal-prompts.js +5 -4
  141. package/dist/extensions/builtin/goal/goal-store.js +1 -1
  142. package/dist/extensions/builtin/goal/index.d.ts +1 -1
  143. package/dist/extensions/builtin/goal/index.js +10 -7
  144. package/dist/extensions/builtin/grub/README.md +112 -112
  145. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  146. package/dist/extensions/builtin/link-world/index.js +6 -6
  147. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  148. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  149. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  150. package/dist/extensions/builtin/link-world/{network-routing.md → network-routing/network-routing.md} +67 -67
  151. package/dist/extensions/builtin/loop/README.md +92 -92
  152. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  153. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  154. package/dist/extensions/builtin/plan/index.js +1 -1
  155. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  156. package/dist/extensions/builtin/sal/README.md +72 -72
  157. package/dist/extensions/builtin/security-audit/README.md +289 -289
  158. package/dist/extensions/builtin/task/task-store.d.ts +4 -0
  159. package/dist/extensions/builtin/task/task-store.js +1 -1
  160. package/dist/extensions/builtin/team/AGENT.md +112 -112
  161. package/dist/extensions/builtin/team/TESTING.md +299 -299
  162. package/dist/extensions/builtin/token-save/README.md +56 -56
  163. package/dist/extensions/optional/AGENT.md +10 -10
  164. package/dist/index.d.ts +5 -30
  165. package/dist/index.js +1 -1
  166. package/dist/models.d.ts +7 -0
  167. package/dist/models.js +1 -0
  168. package/dist/modes/interactive/components/footer.js +1 -1
  169. package/dist/modes/interactive/components/task-status-panel.d.ts +36 -0
  170. package/dist/modes/interactive/components/task-status-panel.js +1 -0
  171. package/dist/modes/interactive/controllers/stream-render-controller.d.ts +7 -0
  172. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  173. package/dist/modes/interactive/interactive-mode.js +40 -40
  174. package/dist/modes/interactive/state/interactive-state.d.ts +2 -0
  175. package/dist/modes/interactive/state/interactive-state.js +1 -1
  176. package/dist/modes/interactive/theme/dark.json +85 -85
  177. package/dist/modes/interactive/theme/light.json +84 -84
  178. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  179. package/dist/modes/interactive/theme/warm.json +81 -81
  180. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  181. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  182. package/dist/node_modules/@pencil-agent/ai/dist/providers/anthropic.js +2 -2
  183. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-completions.js +5 -5
  184. package/dist/node_modules/@pencil-agent/ai/dist/providers/openai-responses.js +1 -1
  185. package/dist/node_modules/@pencil-agent/ai/dist/stream.js +1 -1
  186. package/dist/packages/protocol/src/commands.d.ts +33 -0
  187. package/dist/packages/protocol/src/flags.d.ts +20 -0
  188. package/dist/packages/protocol/src/hooks.d.ts +17 -0
  189. package/dist/packages/protocol/src/hooks.js +0 -0
  190. package/dist/packages/{extension-sdk → protocol}/src/index.d.ts +7 -4
  191. package/dist/packages/protocol/src/index.js +1 -0
  192. package/dist/packages/{extension-sdk → protocol}/src/lifecycle.d.ts +15 -27
  193. package/dist/packages/protocol/src/lifecycle.js +0 -0
  194. package/dist/packages/{extension-sdk → protocol}/src/tools.d.ts +1 -1
  195. package/dist/packages/protocol/src/tools.js +0 -0
  196. package/dist/public-config.d.ts +12 -0
  197. package/dist/public-config.js +1 -0
  198. package/dist/runtime.d.ts +9 -0
  199. package/dist/runtime.js +1 -0
  200. package/dist/session-compaction.d.ts +7 -0
  201. package/dist/session-compaction.js +1 -0
  202. package/dist/session.d.ts +7 -0
  203. package/dist/session.js +1 -0
  204. package/dist/skills.d.ts +7 -0
  205. package/dist/skills.js +1 -0
  206. package/dist/tools.d.ts +7 -0
  207. package/dist/tools.js +1 -0
  208. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  209. package/docs/SDK-TESTING.md +364 -0
  210. package/docs/codex-goal-command-impl.md +1055 -1055
  211. package/docs/codex-goal-vs-grub.md +500 -500
  212. package/docs/custom-provider.md +27 -27
  213. package/docs/extensions.md +27 -27
  214. package/docs/keybindings.md +27 -27
  215. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  216. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  217. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  218. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  219. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  220. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  221. package/docs/loop-usage-examples.md +214 -214
  222. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  223. package/docs/models.md +27 -27
  224. package/docs/packages.md +27 -27
  225. package/docs/pi-design-philosophy.md +457 -457
  226. package/docs/planmode.md +1987 -1987
  227. package/docs/prompt-templates.md +27 -27
  228. package/docs/providers.md +27 -27
  229. package/docs/sdk.md +27 -27
  230. package/docs/skills.md +27 -27
  231. package/docs/startup-performance-optimization.md +301 -0
  232. package/docs/themes.md +27 -27
  233. package/docs/tui.md +27 -27
  234. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  235. package/package.json +190 -162
  236. package/dist/packages/extension-sdk/src/index.js +0 -1
  237. package/docs/cc-agent-design.md +0 -1297
  238. package/docs/cc-tui-design.md +0 -1333
  239. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  240. /package/dist/packages/{extension-sdk/src/lifecycle.js → protocol/src/commands.js} +0 -0
  241. /package/dist/packages/{extension-sdk/src/tools.js → protocol/src/flags.js} +0 -0
@@ -1,361 +1,361 @@
1
- # SEC EDGAR — Scraping & Data Extraction
2
-
3
- `https://www.sec.gov` / `https://data.sec.gov` / `https://efts.sec.gov` — all public data, no auth required. Every workflow here is pure `http_get` — no browser needed.
4
-
5
- ## Do this first
6
-
7
- **SEC.gov requires a custom User-Agent on `www.sec.gov` and `data.sec.gov`. Always pass `headers=UA` or you get 403.**
8
-
9
- ```python
10
- import json
11
- UA = {"User-Agent": "browser-harness research@example.com"}
12
- # Format required: "CompanyName contact@email.com"
13
- # "Mozilla/5.0" (http_get default) works on efts.sec.gov and data.sec.gov
14
- # but FAILS on www.sec.gov (company_tickers.json, Archives/, etc.)
15
- ```
16
-
17
- Start with `company_tickers.json` to resolve any ticker → CIK in one call, then branch to whichever endpoint you need.
18
-
19
- ```python
20
- import json
21
- UA = {"User-Agent": "browser-harness research@example.com"}
22
- tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
23
- # 10,391 public companies, ~50KB, always fresh
24
- # Entry format: {"cik_str": 320193, "ticker": "AAPL", "title": "Apple Inc."}
25
-
26
- # Look up by ticker (exact, case-sensitive in the data)
27
- aapl = next(v for v in tickers.values() if v['ticker'] == 'AAPL')
28
- # {'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}
29
-
30
- # CIK is an int here; pad to 10 digits for API URLs
31
- cik = str(aapl['cik_str']).zfill(10) # "0000320193"
32
- ```
33
-
34
- ## Common workflows
35
-
36
- ### Ticker / name → CIK lookup
37
-
38
- ```python
39
- import json
40
- UA = {"User-Agent": "browser-harness research@example.com"}
41
- tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
42
-
43
- # By ticker
44
- tsla = next((v for v in tickers.values() if v['ticker'] == 'TSLA'), None)
45
- # {'cik_str': 1318605, 'ticker': 'TSLA', 'title': 'Tesla, Inc.'}
46
-
47
- # By partial name match
48
- apples = [v for v in tickers.values() if 'APPLE' in v['title'].upper()]
49
- # [{'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}, ...]
50
- ```
51
-
52
- ### Company submissions (metadata + recent filings list)
53
-
54
- ```python
55
- import json
56
- UA = {"User-Agent": "browser-harness research@example.com"}
57
- cik = "0000320193" # Apple - always zero-pad to 10 digits
58
- data = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
59
-
60
- print(data['name']) # "Apple Inc."
61
- print(data['cik']) # "0000320193"
62
- print(data['sic']) # "3571"
63
- print(data['sicDescription']) # "Electronic Computers"
64
- print(data['tickers']) # ["AAPL"]
65
- print(data['exchanges']) # ["Nasdaq"]
66
-
67
- # Most recent ~1,000 filings are in data['filings']['recent']
68
- recent = data['filings']['recent']
69
- # Fields per filing (parallel arrays, same index):
70
- # accessionNumber, filingDate, reportDate, form, primaryDocument,
71
- # primaryDocDescription, size, isXBRL, items, fileNumber
72
-
73
- # Filter for 10-K and 10-Q only
74
- filings_10k = [
75
- (f, d, a, doc)
76
- for f, d, a, doc in zip(
77
- recent['form'], recent['filingDate'],
78
- recent['accessionNumber'], recent['primaryDocument']
79
- )
80
- if f in ('10-K', '10-Q')
81
- ]
82
- # Result: [('10-Q', '2026-01-30', '0000320193-26-000006', 'aapl-20251227.htm'), ...]
83
- ```
84
-
85
- ### Build direct filing document URL
86
-
87
- ```python
88
- # Given accessionNumber and primaryDocument from submissions JSON:
89
- accn = "0000320193-25-000079"
90
- doc = "aapl-20250927.htm"
91
- cik = "320193" # int part only (no leading zeros) for Archives path
92
-
93
- accn_nodash = accn.replace("-", "")
94
- url = f"https://www.sec.gov/Archives/edgar/data/{cik}/{accn_nodash}/{doc}"
95
- # https://www.sec.gov/Archives/edgar/data/320193/000032019325000079/aapl-20250927.htm
96
-
97
- # Full 10-K is 1.5MB of XBRL-tagged HTML — use http_get for text extraction
98
- content = http_get(url, headers=UA) # UA required on www.sec.gov
99
- ```
100
-
101
- ### XBRL financial data — single company, one concept over time
102
-
103
- ```python
104
- import json
105
- UA = {"User-Agent": "browser-harness research@example.com"}
106
- cik_padded = "0000320193"
107
-
108
- # companyconcept: one metric, all reported values (quarterly + annual)
109
- data = json.loads(http_get(
110
- f"https://data.sec.gov/api/xbrl/companyconcept/CIK{cik_padded}/us-gaap/Assets.json",
111
- headers=UA
112
- ))
113
- # data keys: cik, taxonomy, tag, label, description, entityName, units
114
- # data['units']['USD'] -> list of {end, val, accn, fy, fp, form, filed}
115
-
116
- entries = data['units']['USD']
117
-
118
- # Deduplicate: same period re-reported across multiple filings — keep latest
119
- def annual_series(entries):
120
- seen = {}
121
- for e in entries:
122
- if e.get('form') == '10-K' and e.get('fp') == 'FY':
123
- end = e['end']
124
- if end not in seen or e['filed'] > seen[end]['filed']:
125
- seen[end] = e
126
- return [seen[k] for k in sorted(seen)]
127
-
128
- assets = annual_series(entries)
129
- for e in assets[-5:]:
130
- print(f"{e['end']} ${e['val']/1e9:.1f}B")
131
- # 2021-09-25 $351.0B
132
- # 2022-09-24 $352.8B
133
- # 2023-09-30 $352.6B
134
- # 2024-09-28 $365.0B
135
- # 2025-09-27 $359.2B
136
- ```
137
-
138
- ### XBRL financial data — all US-GAAP metrics for a company
139
-
140
- ```python
141
- import json
142
- UA = {"User-Agent": "browser-harness research@example.com"}
143
-
144
- # companyfacts: all reported XBRL concepts in one ~5MB call
145
- data = json.loads(http_get(
146
- "https://data.sec.gov/api/xbrl/companyfacts/CIK0000320193.json",
147
- headers=UA
148
- ))
149
- # data['entityName'] = "Apple Inc."
150
- # data['facts'] = {'us-gaap': {...503 concepts...}, 'dei': {...}}
151
-
152
- usgaap = data['facts']['us-gaap']
153
- print(len(usgaap)) # 503 concepts for Apple
154
-
155
- # Common concept names (companies vary — check what's available):
156
- # Revenue: RevenueFromContractWithCustomerExcludingAssessedTax (post-2018 standard)
157
- # SalesRevenueNet (older filings)
158
- # Revenues (some companies still use)
159
- # Net income: NetIncomeLoss
160
- # Assets: Assets
161
- # Cash: CashAndCashEquivalentsAtCarryingValue
162
- # EPS: EarningsPerShareBasic, EarningsPerShareDiluted
163
-
164
- # Find all revenue-related concepts this company reported:
165
- revenue_keys = [k for k in usgaap if 'Revenue' in k]
166
-
167
- # Extract annual revenue — handle company-specific concept name
168
- for concept in ['RevenueFromContractWithCustomerExcludingAssessedTax', 'SalesRevenueNet', 'Revenues']:
169
- if concept in usgaap:
170
- entries = usgaap[concept]['units'].get('USD', [])
171
- annual = {}
172
- for e in entries:
173
- if e.get('form') == '10-K' and e.get('fp') == 'FY':
174
- end = e['end']
175
- if end not in annual or e['filed'] > annual[end]['filed']:
176
- annual[end] = e
177
- if annual:
178
- print(f"Using: {concept}")
179
- for end in sorted(annual)[-3:]:
180
- print(f" {end} ${annual[end]['val']/1e9:.1f}B")
181
- break
182
- # Apple output:
183
- # Using: RevenueFromContractWithCustomerExcludingAssessedTax
184
- # 2023-09-30 $383.3B
185
- # 2024-09-28 $391.0B
186
- # 2025-09-27 $416.2B
187
- ```
188
-
189
- ### Cross-company financial comparison (XBRL frames)
190
-
191
- ```python
192
- import json
193
- UA = {"User-Agent": "browser-harness research@example.com"}
194
-
195
- # frames: one concept, one period, all companies that reported it
196
- # Period formats:
197
- # CY2024 = calendar year 2024 (annual)
198
- # CY2024Q4I = Q4 2024 instantaneous (balance sheet items)
199
- # CY2024Q4 = Q4 2024 duration (income statement items)
200
-
201
- # Top companies by annual revenue (2024)
202
- data = json.loads(http_get(
203
- "https://data.sec.gov/api/xbrl/frames/us-gaap/RevenueFromContractWithCustomerExcludingAssessedTax/USD/CY2024.json",
204
- headers=UA
205
- ))
206
- companies = sorted(data['data'], key=lambda x: x['val'], reverse=True)
207
- # data['data'] entries: {accn, cik, entityName, loc, start, end, val}
208
- for c in companies[:5]:
209
- print(f"{c['entityName']:<40} ${c['val']/1e9:.0f}B")
210
- # Walmart Inc. $675B
211
- # AMAZON.COM, INC. $638B
212
- # Apple Inc. $391B
213
- # McKESSON CORPORATION $359B
214
- # Alphabet Inc. $350B
215
-
216
- # Total assets snapshot end of 2024 (balance sheet = instantaneous)
217
- data2 = json.loads(http_get(
218
- "https://data.sec.gov/api/xbrl/frames/us-gaap/Assets/USD/CY2024Q4I.json",
219
- headers=UA
220
- ))
221
- # 6,229 companies for this frame
222
- ```
223
-
224
- ### Full-text search across all filings
225
-
226
- ```python
227
- import json
228
- UA = {"User-Agent": "browser-harness research@example.com"}
229
-
230
- # Search for any phrase across filing documents
231
- # Params: q (quoted phrase), forms (comma-separated), dateRange=custom,
232
- # startdt, enddt, size (max 100), from (offset for pagination)
233
- url = (
234
- "https://efts.sec.gov/LATEST/search-index"
235
- "?q=%22climate+risk%22"
236
- "&forms=10-K"
237
- "&dateRange=custom&startdt=2024-01-01"
238
- "&size=10&from=0"
239
- )
240
- data = json.loads(http_get(url, headers=UA))
241
- # Note: default http_get UA (Mozilla/5.0) works fine on efts.sec.gov
242
-
243
- print(data['hits']['total']['value']) # e.g. 1438 matching documents
244
- hits = data['hits']['hits'] # up to 100 per call
245
-
246
- for h in hits:
247
- src = h['_source']
248
- # Key fields: display_names, ciks, form, file_date, adsh (accession), period_ending
249
- name = src['display_names'][0] if src.get('display_names') else '?'
250
- cik = src['ciks'][0] if src.get('ciks') else '?'
251
- print(f"{name} form={src['form']} filed={src['file_date']} accn={src['adsh']}")
252
-
253
- # Pagination: max 100 per page, use from= to walk through results
254
- # Page 2: from=100, Page 3: from=200, etc.
255
- for page in range(0, 300, 100):
256
- page_url = url + f"&from={page}"
257
- page_data = json.loads(http_get(page_url, headers=UA))
258
- if not page_data['hits']['hits']:
259
- break
260
- # process...
261
-
262
- # Aggregations — group hits by entity, SIC, state
263
- aggs = data['aggregations']
264
- top_entities = aggs['entity_filter']['buckets'] # [{key: "Name (CIK...)", doc_count: N}, ...]
265
- top_sics = aggs['sic_filter']['buckets']
266
- top_states = aggs['biz_states_filter']['buckets']
267
- ```
268
-
269
- ### Find a company's CIK by name search (via search aggregations)
270
-
271
- ```python
272
- import json, re
273
- UA = {"User-Agent": "browser-harness research@example.com"}
274
-
275
- # Best method: company_tickers.json (fastest, all tickers)
276
- tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
277
- msft = next(v for v in tickers.values() if v['ticker'] == 'MSFT')
278
- # CIK = msft['cik_str'] → 789019
279
-
280
- # Alternative: full-text search aggregations (finds CIK from company name)
281
- data = json.loads(http_get(
282
- "https://efts.sec.gov/LATEST/search-index?q=%22microsoft+corporation%22&forms=10-K",
283
- headers=UA
284
- ))
285
- buckets = data['aggregations']['entity_filter']['buckets']
286
- # [{'key': 'MICROSOFT CORP (MSFT) (CIK 0000789019)', 'doc_count': 11}, ...]
287
- for b in buckets[:3]:
288
- m = re.search(r'\(CIK (\d+)\)', b['key'])
289
- if m:
290
- print(f"{b['key'][:50]} → CIK {m.group(1)}")
291
- ```
292
-
293
- ### Parallel fetching (multiple companies)
294
-
295
- ```python
296
- import json
297
- from concurrent.futures import ThreadPoolExecutor
298
-
299
- UA = {"User-Agent": "browser-harness research@example.com"}
300
-
301
- def get_company_meta(ticker_cik):
302
- ticker, cik = ticker_cik
303
- subs = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
304
- return {"ticker": ticker, "name": subs['name'], "sic": subs['sic']}
305
-
306
- companies = [("AAPL", "0000320193"), ("TSLA", "0001318605"), ("MSFT", "0000789019")]
307
- with ThreadPoolExecutor(max_workers=3) as ex:
308
- results = list(ex.map(get_company_meta, companies))
309
- # Confirmed: 3 requests complete in ~0.28s
310
- # SEC rate limit: 10 req/sec — stay at max_workers ≤ 8 to be safe
311
- ```
312
-
313
- ## API reference
314
-
315
- | Endpoint | What it returns | UA required |
316
- |---|---|---|
317
- | `www.sec.gov/files/company_tickers.json` | All 10,391 tickers → CIK mapping | YES |
318
- | `data.sec.gov/submissions/CIK{10-digit}.json` | Company meta + ~1000 recent filings | YES |
319
- | `data.sec.gov/api/xbrl/companyfacts/CIK{10-digit}.json` | All XBRL facts (~5MB) | YES |
320
- | `data.sec.gov/api/xbrl/companyconcept/CIK{10-digit}/{taxonomy}/{concept}.json` | One concept, all values | YES |
321
- | `data.sec.gov/api/xbrl/frames/{taxonomy}/{concept}/{unit}/{period}.json` | All companies for one period | YES |
322
- | `efts.sec.gov/LATEST/search-index?q=...` | Full-text search across filings | NO (Mozilla/5.0 works) |
323
- | `www.sec.gov/Archives/edgar/data/{cik}/{accn-nodash}/{doc}` | Actual filing document | YES |
324
-
325
- `data.sec.gov` accepts `Mozilla/5.0` (the http_get default). `www.sec.gov` (Archives, company_tickers) requires the `"CompanyName email@example.com"` format.
326
-
327
- ## Rate limits
328
-
329
- SEC documents a **10 requests/second** limit. In practice:
330
- - 12 rapid sequential calls to `data.sec.gov` completed in 2.4s (5 req/s) with no throttling.
331
- - 3 parallel calls completed in 0.28s without issue.
332
- - Stay at `max_workers ≤ 8` for ThreadPoolExecutor to respect the 10 req/s ceiling.
333
- - No per-day or per-hour cap documented; the 10/s limit is the only stated constraint.
334
-
335
- ## Gotchas
336
-
337
- - **`www.sec.gov` returns 403 with `Mozilla/5.0` UA** — The http_get default (`"Mozilla/5.0"`) works on `data.sec.gov` and `efts.sec.gov` but is blocked on `www.sec.gov`. Always pass `headers=UA` where UA includes your company name and email. Confirmed: `"python-requests/2.28"` → 403.
338
-
339
- - **`data.sec.gov` is more permissive** — `Mozilla/5.0` works on `data.sec.gov` (submissions, xbrl). But always use the proper UA anyway — SEC's policy page explicitly requires it and they can add stricter checks at any time.
340
-
341
- - **XBRL contains duplicate entries per period** — The same fiscal year end date appears multiple times when a company restates or re-files. Each entry has a `filed` date and `accn` (accession). To get the canonical value, deduplicate by `end` keeping the entry with the latest `filed` date.
342
-
343
- - **Revenue concept name varies by company and era** — There is no single canonical "revenue" concept. Apple uses `RevenueFromContractWithCustomerExcludingAssessedTax`. Microsoft uses the same for recent years, but older filings used `SalesRevenueNet`. Always check which concepts are actually present: `[k for k in usgaap if 'Revenue' in k]`.
344
-
345
- - **`fp` field for annual filings is `'FY'`, but quarterly values also appear in 10-K** — A 10-K re-reports each quarter (fp=Q1, Q2, Q3) plus the full year (fp=FY). Filter on both `form == '10-K'` AND `fp == 'FY'` to get only annual totals.
346
-
347
- - **`companyfacts` is ~5MB per company** — For a single metric, use `companyconcept` instead (much smaller). Only use `companyfacts` when you need multiple concepts from the same company.
348
-
349
- - **`submissions` recent filings cap at ~1,000** — Very old filings don't appear. If you need historical data before that window, use the `filings.files` array in submissions JSON to find older filing JSON pages (`data.sec.gov/submissions/CIK{cik}-submissions-001.json`, etc.).
350
-
351
- - **`adsh` in search results is the accession number** — The search index returns `adsh` (no dashes). To build the document URL, insert dashes: `adsh[:10] + '-' + adsh[10:12] + '-' + adsh[12:]`, or use the `accessionNumber` field from submissions JSON (which already has dashes).
352
-
353
- - **`size` param is capped at 100** — Requesting `size=200` silently returns 100 hits. Walk results with `from=0`, `from=100`, etc. Maximum reachable index is 10,000 (Elasticsearch default).
354
-
355
- - **Search total `'gte'` relation means >10,000 hits** — When `total['relation'] == 'gte'`, there are more than 10,000 results (only first 10,000 accessible). Narrow with `dateRange` or `forms` filters.
356
-
357
- - **`company_tickers.json` covers only exchange-listed companies** — ~10,391 entries. Many SEC filers (private companies, bond issuers, FHLBs) have CIKs but no ticker. Find them via the full-text search aggregations or `submissions` lookup if you have the CIK.
358
-
359
- - **Filing document is XBRL-tagged HTML, 1–2MB** — Retrieving the actual 10-K HTML works but is large. For financial data extraction, always prefer the XBRL API endpoints over parsing the document.
360
-
361
- - **CIK format gotcha** — `company_tickers.json` returns `cik_str` as an int (`320193`). The submissions and xbrl APIs require a 10-digit zero-padded string in the filename (`CIK0000320193`). Always use `str(cik).zfill(10)` when building URLs.
1
+ # SEC EDGAR — Scraping & Data Extraction
2
+
3
+ `https://www.sec.gov` / `https://data.sec.gov` / `https://efts.sec.gov` — all public data, no auth required. Every workflow here is pure `http_get` — no browser needed.
4
+
5
+ ## Do this first
6
+
7
+ **SEC.gov requires a custom User-Agent on `www.sec.gov` and `data.sec.gov`. Always pass `headers=UA` or you get 403.**
8
+
9
+ ```python
10
+ import json
11
+ UA = {"User-Agent": "browser-harness research@example.com"}
12
+ # Format required: "CompanyName contact@email.com"
13
+ # "Mozilla/5.0" (http_get default) works on efts.sec.gov and data.sec.gov
14
+ # but FAILS on www.sec.gov (company_tickers.json, Archives/, etc.)
15
+ ```
16
+
17
+ Start with `company_tickers.json` to resolve any ticker → CIK in one call, then branch to whichever endpoint you need.
18
+
19
+ ```python
20
+ import json
21
+ UA = {"User-Agent": "browser-harness research@example.com"}
22
+ tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
23
+ # 10,391 public companies, ~50KB, always fresh
24
+ # Entry format: {"cik_str": 320193, "ticker": "AAPL", "title": "Apple Inc."}
25
+
26
+ # Look up by ticker (exact, case-sensitive in the data)
27
+ aapl = next(v for v in tickers.values() if v['ticker'] == 'AAPL')
28
+ # {'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}
29
+
30
+ # CIK is an int here; pad to 10 digits for API URLs
31
+ cik = str(aapl['cik_str']).zfill(10) # "0000320193"
32
+ ```
33
+
34
+ ## Common workflows
35
+
36
+ ### Ticker / name → CIK lookup
37
+
38
+ ```python
39
+ import json
40
+ UA = {"User-Agent": "browser-harness research@example.com"}
41
+ tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
42
+
43
+ # By ticker
44
+ tsla = next((v for v in tickers.values() if v['ticker'] == 'TSLA'), None)
45
+ # {'cik_str': 1318605, 'ticker': 'TSLA', 'title': 'Tesla, Inc.'}
46
+
47
+ # By partial name match
48
+ apples = [v for v in tickers.values() if 'APPLE' in v['title'].upper()]
49
+ # [{'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}, ...]
50
+ ```
51
+
52
+ ### Company submissions (metadata + recent filings list)
53
+
54
+ ```python
55
+ import json
56
+ UA = {"User-Agent": "browser-harness research@example.com"}
57
+ cik = "0000320193" # Apple - always zero-pad to 10 digits
58
+ data = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
59
+
60
+ print(data['name']) # "Apple Inc."
61
+ print(data['cik']) # "0000320193"
62
+ print(data['sic']) # "3571"
63
+ print(data['sicDescription']) # "Electronic Computers"
64
+ print(data['tickers']) # ["AAPL"]
65
+ print(data['exchanges']) # ["Nasdaq"]
66
+
67
+ # Most recent ~1,000 filings are in data['filings']['recent']
68
+ recent = data['filings']['recent']
69
+ # Fields per filing (parallel arrays, same index):
70
+ # accessionNumber, filingDate, reportDate, form, primaryDocument,
71
+ # primaryDocDescription, size, isXBRL, items, fileNumber
72
+
73
+ # Filter for 10-K and 10-Q only
74
+ filings_10k = [
75
+ (f, d, a, doc)
76
+ for f, d, a, doc in zip(
77
+ recent['form'], recent['filingDate'],
78
+ recent['accessionNumber'], recent['primaryDocument']
79
+ )
80
+ if f in ('10-K', '10-Q')
81
+ ]
82
+ # Result: [('10-Q', '2026-01-30', '0000320193-26-000006', 'aapl-20251227.htm'), ...]
83
+ ```
84
+
85
+ ### Build direct filing document URL
86
+
87
+ ```python
88
+ # Given accessionNumber and primaryDocument from submissions JSON:
89
+ accn = "0000320193-25-000079"
90
+ doc = "aapl-20250927.htm"
91
+ cik = "320193" # int part only (no leading zeros) for Archives path
92
+
93
+ accn_nodash = accn.replace("-", "")
94
+ url = f"https://www.sec.gov/Archives/edgar/data/{cik}/{accn_nodash}/{doc}"
95
+ # https://www.sec.gov/Archives/edgar/data/320193/000032019325000079/aapl-20250927.htm
96
+
97
+ # Full 10-K is 1.5MB of XBRL-tagged HTML — use http_get for text extraction
98
+ content = http_get(url, headers=UA) # UA required on www.sec.gov
99
+ ```
100
+
101
+ ### XBRL financial data — single company, one concept over time
102
+
103
+ ```python
104
+ import json
105
+ UA = {"User-Agent": "browser-harness research@example.com"}
106
+ cik_padded = "0000320193"
107
+
108
+ # companyconcept: one metric, all reported values (quarterly + annual)
109
+ data = json.loads(http_get(
110
+ f"https://data.sec.gov/api/xbrl/companyconcept/CIK{cik_padded}/us-gaap/Assets.json",
111
+ headers=UA
112
+ ))
113
+ # data keys: cik, taxonomy, tag, label, description, entityName, units
114
+ # data['units']['USD'] -> list of {end, val, accn, fy, fp, form, filed}
115
+
116
+ entries = data['units']['USD']
117
+
118
+ # Deduplicate: same period re-reported across multiple filings — keep latest
119
+ def annual_series(entries):
120
+ seen = {}
121
+ for e in entries:
122
+ if e.get('form') == '10-K' and e.get('fp') == 'FY':
123
+ end = e['end']
124
+ if end not in seen or e['filed'] > seen[end]['filed']:
125
+ seen[end] = e
126
+ return [seen[k] for k in sorted(seen)]
127
+
128
+ assets = annual_series(entries)
129
+ for e in assets[-5:]:
130
+ print(f"{e['end']} ${e['val']/1e9:.1f}B")
131
+ # 2021-09-25 $351.0B
132
+ # 2022-09-24 $352.8B
133
+ # 2023-09-30 $352.6B
134
+ # 2024-09-28 $365.0B
135
+ # 2025-09-27 $359.2B
136
+ ```
137
+
138
+ ### XBRL financial data — all US-GAAP metrics for a company
139
+
140
+ ```python
141
+ import json
142
+ UA = {"User-Agent": "browser-harness research@example.com"}
143
+
144
+ # companyfacts: all reported XBRL concepts in one ~5MB call
145
+ data = json.loads(http_get(
146
+ "https://data.sec.gov/api/xbrl/companyfacts/CIK0000320193.json",
147
+ headers=UA
148
+ ))
149
+ # data['entityName'] = "Apple Inc."
150
+ # data['facts'] = {'us-gaap': {...503 concepts...}, 'dei': {...}}
151
+
152
+ usgaap = data['facts']['us-gaap']
153
+ print(len(usgaap)) # 503 concepts for Apple
154
+
155
+ # Common concept names (companies vary — check what's available):
156
+ # Revenue: RevenueFromContractWithCustomerExcludingAssessedTax (post-2018 standard)
157
+ # SalesRevenueNet (older filings)
158
+ # Revenues (some companies still use)
159
+ # Net income: NetIncomeLoss
160
+ # Assets: Assets
161
+ # Cash: CashAndCashEquivalentsAtCarryingValue
162
+ # EPS: EarningsPerShareBasic, EarningsPerShareDiluted
163
+
164
+ # Find all revenue-related concepts this company reported:
165
+ revenue_keys = [k for k in usgaap if 'Revenue' in k]
166
+
167
+ # Extract annual revenue — handle company-specific concept name
168
+ for concept in ['RevenueFromContractWithCustomerExcludingAssessedTax', 'SalesRevenueNet', 'Revenues']:
169
+ if concept in usgaap:
170
+ entries = usgaap[concept]['units'].get('USD', [])
171
+ annual = {}
172
+ for e in entries:
173
+ if e.get('form') == '10-K' and e.get('fp') == 'FY':
174
+ end = e['end']
175
+ if end not in annual or e['filed'] > annual[end]['filed']:
176
+ annual[end] = e
177
+ if annual:
178
+ print(f"Using: {concept}")
179
+ for end in sorted(annual)[-3:]:
180
+ print(f" {end} ${annual[end]['val']/1e9:.1f}B")
181
+ break
182
+ # Apple output:
183
+ # Using: RevenueFromContractWithCustomerExcludingAssessedTax
184
+ # 2023-09-30 $383.3B
185
+ # 2024-09-28 $391.0B
186
+ # 2025-09-27 $416.2B
187
+ ```
188
+
189
+ ### Cross-company financial comparison (XBRL frames)
190
+
191
+ ```python
192
+ import json
193
+ UA = {"User-Agent": "browser-harness research@example.com"}
194
+
195
+ # frames: one concept, one period, all companies that reported it
196
+ # Period formats:
197
+ # CY2024 = calendar year 2024 (annual)
198
+ # CY2024Q4I = Q4 2024 instantaneous (balance sheet items)
199
+ # CY2024Q4 = Q4 2024 duration (income statement items)
200
+
201
+ # Top companies by annual revenue (2024)
202
+ data = json.loads(http_get(
203
+ "https://data.sec.gov/api/xbrl/frames/us-gaap/RevenueFromContractWithCustomerExcludingAssessedTax/USD/CY2024.json",
204
+ headers=UA
205
+ ))
206
+ companies = sorted(data['data'], key=lambda x: x['val'], reverse=True)
207
+ # data['data'] entries: {accn, cik, entityName, loc, start, end, val}
208
+ for c in companies[:5]:
209
+ print(f"{c['entityName']:<40} ${c['val']/1e9:.0f}B")
210
+ # Walmart Inc. $675B
211
+ # AMAZON.COM, INC. $638B
212
+ # Apple Inc. $391B
213
+ # McKESSON CORPORATION $359B
214
+ # Alphabet Inc. $350B
215
+
216
+ # Total assets snapshot end of 2024 (balance sheet = instantaneous)
217
+ data2 = json.loads(http_get(
218
+ "https://data.sec.gov/api/xbrl/frames/us-gaap/Assets/USD/CY2024Q4I.json",
219
+ headers=UA
220
+ ))
221
+ # 6,229 companies for this frame
222
+ ```
223
+
224
+ ### Full-text search across all filings
225
+
226
+ ```python
227
+ import json
228
+ UA = {"User-Agent": "browser-harness research@example.com"}
229
+
230
+ # Search for any phrase across filing documents
231
+ # Params: q (quoted phrase), forms (comma-separated), dateRange=custom,
232
+ # startdt, enddt, size (max 100), from (offset for pagination)
233
+ url = (
234
+ "https://efts.sec.gov/LATEST/search-index"
235
+ "?q=%22climate+risk%22"
236
+ "&forms=10-K"
237
+ "&dateRange=custom&startdt=2024-01-01"
238
+ "&size=10&from=0"
239
+ )
240
+ data = json.loads(http_get(url, headers=UA))
241
+ # Note: default http_get UA (Mozilla/5.0) works fine on efts.sec.gov
242
+
243
+ print(data['hits']['total']['value']) # e.g. 1438 matching documents
244
+ hits = data['hits']['hits'] # up to 100 per call
245
+
246
+ for h in hits:
247
+ src = h['_source']
248
+ # Key fields: display_names, ciks, form, file_date, adsh (accession), period_ending
249
+ name = src['display_names'][0] if src.get('display_names') else '?'
250
+ cik = src['ciks'][0] if src.get('ciks') else '?'
251
+ print(f"{name} form={src['form']} filed={src['file_date']} accn={src['adsh']}")
252
+
253
+ # Pagination: max 100 per page, use from= to walk through results
254
+ # Page 2: from=100, Page 3: from=200, etc.
255
+ for page in range(0, 300, 100):
256
+ page_url = url + f"&from={page}"
257
+ page_data = json.loads(http_get(page_url, headers=UA))
258
+ if not page_data['hits']['hits']:
259
+ break
260
+ # process...
261
+
262
+ # Aggregations — group hits by entity, SIC, state
263
+ aggs = data['aggregations']
264
+ top_entities = aggs['entity_filter']['buckets'] # [{key: "Name (CIK...)", doc_count: N}, ...]
265
+ top_sics = aggs['sic_filter']['buckets']
266
+ top_states = aggs['biz_states_filter']['buckets']
267
+ ```
268
+
269
+ ### Find a company's CIK by name search (via search aggregations)
270
+
271
+ ```python
272
+ import json, re
273
+ UA = {"User-Agent": "browser-harness research@example.com"}
274
+
275
+ # Best method: company_tickers.json (fastest, all tickers)
276
+ tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
277
+ msft = next(v for v in tickers.values() if v['ticker'] == 'MSFT')
278
+ # CIK = msft['cik_str'] → 789019
279
+
280
+ # Alternative: full-text search aggregations (finds CIK from company name)
281
+ data = json.loads(http_get(
282
+ "https://efts.sec.gov/LATEST/search-index?q=%22microsoft+corporation%22&forms=10-K",
283
+ headers=UA
284
+ ))
285
+ buckets = data['aggregations']['entity_filter']['buckets']
286
+ # [{'key': 'MICROSOFT CORP (MSFT) (CIK 0000789019)', 'doc_count': 11}, ...]
287
+ for b in buckets[:3]:
288
+ m = re.search(r'\(CIK (\d+)\)', b['key'])
289
+ if m:
290
+ print(f"{b['key'][:50]} → CIK {m.group(1)}")
291
+ ```
292
+
293
+ ### Parallel fetching (multiple companies)
294
+
295
+ ```python
296
+ import json
297
+ from concurrent.futures import ThreadPoolExecutor
298
+
299
+ UA = {"User-Agent": "browser-harness research@example.com"}
300
+
301
+ def get_company_meta(ticker_cik):
302
+ ticker, cik = ticker_cik
303
+ subs = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
304
+ return {"ticker": ticker, "name": subs['name'], "sic": subs['sic']}
305
+
306
+ companies = [("AAPL", "0000320193"), ("TSLA", "0001318605"), ("MSFT", "0000789019")]
307
+ with ThreadPoolExecutor(max_workers=3) as ex:
308
+ results = list(ex.map(get_company_meta, companies))
309
+ # Confirmed: 3 requests complete in ~0.28s
310
+ # SEC rate limit: 10 req/sec — stay at max_workers ≤ 8 to be safe
311
+ ```
312
+
313
+ ## API reference
314
+
315
+ | Endpoint | What it returns | UA required |
316
+ |---|---|---|
317
+ | `www.sec.gov/files/company_tickers.json` | All 10,391 tickers → CIK mapping | YES |
318
+ | `data.sec.gov/submissions/CIK{10-digit}.json` | Company meta + ~1000 recent filings | YES |
319
+ | `data.sec.gov/api/xbrl/companyfacts/CIK{10-digit}.json` | All XBRL facts (~5MB) | YES |
320
+ | `data.sec.gov/api/xbrl/companyconcept/CIK{10-digit}/{taxonomy}/{concept}.json` | One concept, all values | YES |
321
+ | `data.sec.gov/api/xbrl/frames/{taxonomy}/{concept}/{unit}/{period}.json` | All companies for one period | YES |
322
+ | `efts.sec.gov/LATEST/search-index?q=...` | Full-text search across filings | NO (Mozilla/5.0 works) |
323
+ | `www.sec.gov/Archives/edgar/data/{cik}/{accn-nodash}/{doc}` | Actual filing document | YES |
324
+
325
+ `data.sec.gov` accepts `Mozilla/5.0` (the http_get default). `www.sec.gov` (Archives, company_tickers) requires the `"CompanyName email@example.com"` format.
326
+
327
+ ## Rate limits
328
+
329
+ SEC documents a **10 requests/second** limit. In practice:
330
+ - 12 rapid sequential calls to `data.sec.gov` completed in 2.4s (5 req/s) with no throttling.
331
+ - 3 parallel calls completed in 0.28s without issue.
332
+ - Stay at `max_workers ≤ 8` for ThreadPoolExecutor to respect the 10 req/s ceiling.
333
+ - No per-day or per-hour cap documented; the 10/s limit is the only stated constraint.
334
+
335
+ ## Gotchas
336
+
337
+ - **`www.sec.gov` returns 403 with `Mozilla/5.0` UA** — The http_get default (`"Mozilla/5.0"`) works on `data.sec.gov` and `efts.sec.gov` but is blocked on `www.sec.gov`. Always pass `headers=UA` where UA includes your company name and email. Confirmed: `"python-requests/2.28"` → 403.
338
+
339
+ - **`data.sec.gov` is more permissive** — `Mozilla/5.0` works on `data.sec.gov` (submissions, xbrl). But always use the proper UA anyway — SEC's policy page explicitly requires it and they can add stricter checks at any time.
340
+
341
+ - **XBRL contains duplicate entries per period** — The same fiscal year end date appears multiple times when a company restates or re-files. Each entry has a `filed` date and `accn` (accession). To get the canonical value, deduplicate by `end` keeping the entry with the latest `filed` date.
342
+
343
+ - **Revenue concept name varies by company and era** — There is no single canonical "revenue" concept. Apple uses `RevenueFromContractWithCustomerExcludingAssessedTax`. Microsoft uses the same for recent years, but older filings used `SalesRevenueNet`. Always check which concepts are actually present: `[k for k in usgaap if 'Revenue' in k]`.
344
+
345
+ - **`fp` field for annual filings is `'FY'`, but quarterly values also appear in 10-K** — A 10-K re-reports each quarter (fp=Q1, Q2, Q3) plus the full year (fp=FY). Filter on both `form == '10-K'` AND `fp == 'FY'` to get only annual totals.
346
+
347
+ - **`companyfacts` is ~5MB per company** — For a single metric, use `companyconcept` instead (much smaller). Only use `companyfacts` when you need multiple concepts from the same company.
348
+
349
+ - **`submissions` recent filings cap at ~1,000** — Very old filings don't appear. If you need historical data before that window, use the `filings.files` array in submissions JSON to find older filing JSON pages (`data.sec.gov/submissions/CIK{cik}-submissions-001.json`, etc.).
350
+
351
+ - **`adsh` in search results is the accession number** — The search index returns `adsh` (no dashes). To build the document URL, insert dashes: `adsh[:10] + '-' + adsh[10:12] + '-' + adsh[12:]`, or use the `accessionNumber` field from submissions JSON (which already has dashes).
352
+
353
+ - **`size` param is capped at 100** — Requesting `size=200` silently returns 100 hits. Walk results with `from=0`, `from=100`, etc. Maximum reachable index is 10,000 (Elasticsearch default).
354
+
355
+ - **Search total `'gte'` relation means >10,000 hits** — When `total['relation'] == 'gte'`, there are more than 10,000 results (only first 10,000 accessible). Narrow with `dateRange` or `forms` filters.
356
+
357
+ - **`company_tickers.json` covers only exchange-listed companies** — ~10,391 entries. Many SEC filers (private companies, bond issuers, FHLBs) have CIKs but no ticker. Find them via the full-text search aggregations or `submissions` lookup if you have the CIK.
358
+
359
+ - **Filing document is XBRL-tagged HTML, 1–2MB** — Retrieving the actual 10-K HTML works but is large. For financial data extraction, always prefer the XBRL API endpoints over parsing the document.
360
+
361
+ - **CIK format gotcha** — `company_tickers.json` returns `cik_str` as an int (`320193`). The submissions and xbrl APIs require a 10-digit zero-padded string in the filename (`CIK0000320193`). Always use `str(cik).zfill(10)` when building URLs.