@pencil-agent/nano-pencil 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (195) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/mcp/mcp-client.d.ts +3 -1
  7. package/dist/core/mcp/mcp-client.js +6 -6
  8. package/dist/core/mcp/mcp-config.d.ts +3 -3
  9. package/dist/core/mcp/mcp-config.js +1 -1
  10. package/dist/core/mcp/mcp-manager.d.ts +5 -1
  11. package/dist/core/mcp/mcp-manager.js +1 -1
  12. package/dist/core/platform/config/resource-loader.d.ts +2 -0
  13. package/dist/core/platform/config/resource-loader.js +2 -2
  14. package/dist/core/runtime/agent-session.d.ts +12 -0
  15. package/dist/core/runtime/agent-session.js +8 -8
  16. package/dist/core/runtime/sdk.d.ts +8 -0
  17. package/dist/core/runtime/sdk.js +1 -1
  18. package/dist/extensions/builtin/AGENT.md +115 -115
  19. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  20. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  91. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  92. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  93. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  94. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  95. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  96. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  97. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  98. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  99. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  100. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  101. package/dist/extensions/builtin/browser/browser.md +73 -73
  102. package/dist/extensions/builtin/browser/install.md +142 -142
  103. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  104. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  105. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  107. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  108. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  109. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  110. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  111. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  112. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  113. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  114. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  115. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  116. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  117. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  118. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  119. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  120. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  121. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  122. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  123. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  124. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  125. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  126. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  127. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  128. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  129. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  130. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  131. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  132. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  133. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  134. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  135. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  136. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  137. package/dist/extensions/builtin/goal/README.md +67 -67
  138. package/dist/extensions/builtin/grub/README.md +112 -112
  139. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  140. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  141. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  142. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  143. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  144. package/dist/extensions/builtin/loop/README.md +92 -92
  145. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  146. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  147. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  148. package/dist/extensions/builtin/sal/README.md +72 -72
  149. package/dist/extensions/builtin/security-audit/README.md +289 -289
  150. package/dist/extensions/builtin/team/AGENT.md +112 -112
  151. package/dist/extensions/builtin/team/TESTING.md +299 -299
  152. package/dist/extensions/builtin/token-save/README.md +56 -56
  153. package/dist/extensions/optional/AGENT.md +10 -10
  154. package/dist/modes/interactive/interactive-mode.js +36 -36
  155. package/dist/modes/interactive/theme/dark.json +85 -85
  156. package/dist/modes/interactive/theme/light.json +84 -84
  157. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  158. package/dist/modes/interactive/theme/warm.json +81 -81
  159. package/dist/node_modules/@pencil-agent/agent-core/dist/agent-loop.js +3 -2
  160. package/dist/node_modules/@pencil-agent/agent-core/dist/structured-adaptive-agent-loop.js +2 -1
  161. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  162. package/docs/cc-agent-design.md +1297 -0
  163. package/docs/cc-tui-design.md +1333 -0
  164. package/docs/codex-goal-command-impl.md +1055 -1055
  165. package/docs/codex-goal-vs-grub.md +500 -500
  166. package/docs/custom-provider.md +27 -27
  167. package/docs/extensions.md +27 -27
  168. package/docs/keybindings.md +27 -27
  169. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  170. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  171. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  172. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  173. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  174. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  175. package/docs/loop-usage-examples.md +214 -214
  176. package/docs/models.md +27 -27
  177. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +170 -0
  178. package/docs/packages.md +27 -27
  179. package/docs/pi-design-philosophy.md +457 -457
  180. package/docs/planmode.md +1987 -1987
  181. package/docs/prompt-templates.md +27 -27
  182. package/docs/providers.md +27 -27
  183. package/docs/scan-report.md +3820 -0
  184. package/docs/sdk.md +27 -27
  185. package/docs/skills.md +27 -27
  186. package/docs/themes.md +27 -27
  187. package/docs/tui.md +27 -27
  188. package/docs//345/257/271/346/240/207Claude-Code.md +1775 -0
  189. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +261 -0
  190. package/package.json +190 -190
  191. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +0 -851
  192. package/docs/SDK-TESTING.md +0 -364
  193. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +0 -593
  194. package/docs/startup-performance-optimization.md +0 -301
  195. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +0 -47
@@ -1,361 +1,361 @@
1
- # SEC EDGAR — Scraping & Data Extraction
2
-
3
- `https://www.sec.gov` / `https://data.sec.gov` / `https://efts.sec.gov` — all public data, no auth required. Every workflow here is pure `http_get` — no browser needed.
4
-
5
- ## Do this first
6
-
7
- **SEC.gov requires a custom User-Agent on `www.sec.gov` and `data.sec.gov`. Always pass `headers=UA` or you get 403.**
8
-
9
- ```python
10
- import json
11
- UA = {"User-Agent": "browser-harness research@example.com"}
12
- # Format required: "CompanyName contact@email.com"
13
- # "Mozilla/5.0" (http_get default) works on efts.sec.gov and data.sec.gov
14
- # but FAILS on www.sec.gov (company_tickers.json, Archives/, etc.)
15
- ```
16
-
17
- Start with `company_tickers.json` to resolve any ticker → CIK in one call, then branch to whichever endpoint you need.
18
-
19
- ```python
20
- import json
21
- UA = {"User-Agent": "browser-harness research@example.com"}
22
- tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
23
- # 10,391 public companies, ~50KB, always fresh
24
- # Entry format: {"cik_str": 320193, "ticker": "AAPL", "title": "Apple Inc."}
25
-
26
- # Look up by ticker (exact, case-sensitive in the data)
27
- aapl = next(v for v in tickers.values() if v['ticker'] == 'AAPL')
28
- # {'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}
29
-
30
- # CIK is an int here; pad to 10 digits for API URLs
31
- cik = str(aapl['cik_str']).zfill(10) # "0000320193"
32
- ```
33
-
34
- ## Common workflows
35
-
36
- ### Ticker / name → CIK lookup
37
-
38
- ```python
39
- import json
40
- UA = {"User-Agent": "browser-harness research@example.com"}
41
- tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
42
-
43
- # By ticker
44
- tsla = next((v for v in tickers.values() if v['ticker'] == 'TSLA'), None)
45
- # {'cik_str': 1318605, 'ticker': 'TSLA', 'title': 'Tesla, Inc.'}
46
-
47
- # By partial name match
48
- apples = [v for v in tickers.values() if 'APPLE' in v['title'].upper()]
49
- # [{'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}, ...]
50
- ```
51
-
52
- ### Company submissions (metadata + recent filings list)
53
-
54
- ```python
55
- import json
56
- UA = {"User-Agent": "browser-harness research@example.com"}
57
- cik = "0000320193" # Apple - always zero-pad to 10 digits
58
- data = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
59
-
60
- print(data['name']) # "Apple Inc."
61
- print(data['cik']) # "0000320193"
62
- print(data['sic']) # "3571"
63
- print(data['sicDescription']) # "Electronic Computers"
64
- print(data['tickers']) # ["AAPL"]
65
- print(data['exchanges']) # ["Nasdaq"]
66
-
67
- # Most recent ~1,000 filings are in data['filings']['recent']
68
- recent = data['filings']['recent']
69
- # Fields per filing (parallel arrays, same index):
70
- # accessionNumber, filingDate, reportDate, form, primaryDocument,
71
- # primaryDocDescription, size, isXBRL, items, fileNumber
72
-
73
- # Filter for 10-K and 10-Q only
74
- filings_10k = [
75
- (f, d, a, doc)
76
- for f, d, a, doc in zip(
77
- recent['form'], recent['filingDate'],
78
- recent['accessionNumber'], recent['primaryDocument']
79
- )
80
- if f in ('10-K', '10-Q')
81
- ]
82
- # Result: [('10-Q', '2026-01-30', '0000320193-26-000006', 'aapl-20251227.htm'), ...]
83
- ```
84
-
85
- ### Build direct filing document URL
86
-
87
- ```python
88
- # Given accessionNumber and primaryDocument from submissions JSON:
89
- accn = "0000320193-25-000079"
90
- doc = "aapl-20250927.htm"
91
- cik = "320193" # int part only (no leading zeros) for Archives path
92
-
93
- accn_nodash = accn.replace("-", "")
94
- url = f"https://www.sec.gov/Archives/edgar/data/{cik}/{accn_nodash}/{doc}"
95
- # https://www.sec.gov/Archives/edgar/data/320193/000032019325000079/aapl-20250927.htm
96
-
97
- # Full 10-K is 1.5MB of XBRL-tagged HTML — use http_get for text extraction
98
- content = http_get(url, headers=UA) # UA required on www.sec.gov
99
- ```
100
-
101
- ### XBRL financial data — single company, one concept over time
102
-
103
- ```python
104
- import json
105
- UA = {"User-Agent": "browser-harness research@example.com"}
106
- cik_padded = "0000320193"
107
-
108
- # companyconcept: one metric, all reported values (quarterly + annual)
109
- data = json.loads(http_get(
110
- f"https://data.sec.gov/api/xbrl/companyconcept/CIK{cik_padded}/us-gaap/Assets.json",
111
- headers=UA
112
- ))
113
- # data keys: cik, taxonomy, tag, label, description, entityName, units
114
- # data['units']['USD'] -> list of {end, val, accn, fy, fp, form, filed}
115
-
116
- entries = data['units']['USD']
117
-
118
- # Deduplicate: same period re-reported across multiple filings — keep latest
119
- def annual_series(entries):
120
- seen = {}
121
- for e in entries:
122
- if e.get('form') == '10-K' and e.get('fp') == 'FY':
123
- end = e['end']
124
- if end not in seen or e['filed'] > seen[end]['filed']:
125
- seen[end] = e
126
- return [seen[k] for k in sorted(seen)]
127
-
128
- assets = annual_series(entries)
129
- for e in assets[-5:]:
130
- print(f"{e['end']} ${e['val']/1e9:.1f}B")
131
- # 2021-09-25 $351.0B
132
- # 2022-09-24 $352.8B
133
- # 2023-09-30 $352.6B
134
- # 2024-09-28 $365.0B
135
- # 2025-09-27 $359.2B
136
- ```
137
-
138
- ### XBRL financial data — all US-GAAP metrics for a company
139
-
140
- ```python
141
- import json
142
- UA = {"User-Agent": "browser-harness research@example.com"}
143
-
144
- # companyfacts: all reported XBRL concepts in one ~5MB call
145
- data = json.loads(http_get(
146
- "https://data.sec.gov/api/xbrl/companyfacts/CIK0000320193.json",
147
- headers=UA
148
- ))
149
- # data['entityName'] = "Apple Inc."
150
- # data['facts'] = {'us-gaap': {...503 concepts...}, 'dei': {...}}
151
-
152
- usgaap = data['facts']['us-gaap']
153
- print(len(usgaap)) # 503 concepts for Apple
154
-
155
- # Common concept names (companies vary — check what's available):
156
- # Revenue: RevenueFromContractWithCustomerExcludingAssessedTax (post-2018 standard)
157
- # SalesRevenueNet (older filings)
158
- # Revenues (some companies still use)
159
- # Net income: NetIncomeLoss
160
- # Assets: Assets
161
- # Cash: CashAndCashEquivalentsAtCarryingValue
162
- # EPS: EarningsPerShareBasic, EarningsPerShareDiluted
163
-
164
- # Find all revenue-related concepts this company reported:
165
- revenue_keys = [k for k in usgaap if 'Revenue' in k]
166
-
167
- # Extract annual revenue — handle company-specific concept name
168
- for concept in ['RevenueFromContractWithCustomerExcludingAssessedTax', 'SalesRevenueNet', 'Revenues']:
169
- if concept in usgaap:
170
- entries = usgaap[concept]['units'].get('USD', [])
171
- annual = {}
172
- for e in entries:
173
- if e.get('form') == '10-K' and e.get('fp') == 'FY':
174
- end = e['end']
175
- if end not in annual or e['filed'] > annual[end]['filed']:
176
- annual[end] = e
177
- if annual:
178
- print(f"Using: {concept}")
179
- for end in sorted(annual)[-3:]:
180
- print(f" {end} ${annual[end]['val']/1e9:.1f}B")
181
- break
182
- # Apple output:
183
- # Using: RevenueFromContractWithCustomerExcludingAssessedTax
184
- # 2023-09-30 $383.3B
185
- # 2024-09-28 $391.0B
186
- # 2025-09-27 $416.2B
187
- ```
188
-
189
- ### Cross-company financial comparison (XBRL frames)
190
-
191
- ```python
192
- import json
193
- UA = {"User-Agent": "browser-harness research@example.com"}
194
-
195
- # frames: one concept, one period, all companies that reported it
196
- # Period formats:
197
- # CY2024 = calendar year 2024 (annual)
198
- # CY2024Q4I = Q4 2024 instantaneous (balance sheet items)
199
- # CY2024Q4 = Q4 2024 duration (income statement items)
200
-
201
- # Top companies by annual revenue (2024)
202
- data = json.loads(http_get(
203
- "https://data.sec.gov/api/xbrl/frames/us-gaap/RevenueFromContractWithCustomerExcludingAssessedTax/USD/CY2024.json",
204
- headers=UA
205
- ))
206
- companies = sorted(data['data'], key=lambda x: x['val'], reverse=True)
207
- # data['data'] entries: {accn, cik, entityName, loc, start, end, val}
208
- for c in companies[:5]:
209
- print(f"{c['entityName']:<40} ${c['val']/1e9:.0f}B")
210
- # Walmart Inc. $675B
211
- # AMAZON.COM, INC. $638B
212
- # Apple Inc. $391B
213
- # McKESSON CORPORATION $359B
214
- # Alphabet Inc. $350B
215
-
216
- # Total assets snapshot end of 2024 (balance sheet = instantaneous)
217
- data2 = json.loads(http_get(
218
- "https://data.sec.gov/api/xbrl/frames/us-gaap/Assets/USD/CY2024Q4I.json",
219
- headers=UA
220
- ))
221
- # 6,229 companies for this frame
222
- ```
223
-
224
- ### Full-text search across all filings
225
-
226
- ```python
227
- import json
228
- UA = {"User-Agent": "browser-harness research@example.com"}
229
-
230
- # Search for any phrase across filing documents
231
- # Params: q (quoted phrase), forms (comma-separated), dateRange=custom,
232
- # startdt, enddt, size (max 100), from (offset for pagination)
233
- url = (
234
- "https://efts.sec.gov/LATEST/search-index"
235
- "?q=%22climate+risk%22"
236
- "&forms=10-K"
237
- "&dateRange=custom&startdt=2024-01-01"
238
- "&size=10&from=0"
239
- )
240
- data = json.loads(http_get(url, headers=UA))
241
- # Note: default http_get UA (Mozilla/5.0) works fine on efts.sec.gov
242
-
243
- print(data['hits']['total']['value']) # e.g. 1438 matching documents
244
- hits = data['hits']['hits'] # up to 100 per call
245
-
246
- for h in hits:
247
- src = h['_source']
248
- # Key fields: display_names, ciks, form, file_date, adsh (accession), period_ending
249
- name = src['display_names'][0] if src.get('display_names') else '?'
250
- cik = src['ciks'][0] if src.get('ciks') else '?'
251
- print(f"{name} form={src['form']} filed={src['file_date']} accn={src['adsh']}")
252
-
253
- # Pagination: max 100 per page, use from= to walk through results
254
- # Page 2: from=100, Page 3: from=200, etc.
255
- for page in range(0, 300, 100):
256
- page_url = url + f"&from={page}"
257
- page_data = json.loads(http_get(page_url, headers=UA))
258
- if not page_data['hits']['hits']:
259
- break
260
- # process...
261
-
262
- # Aggregations — group hits by entity, SIC, state
263
- aggs = data['aggregations']
264
- top_entities = aggs['entity_filter']['buckets'] # [{key: "Name (CIK...)", doc_count: N}, ...]
265
- top_sics = aggs['sic_filter']['buckets']
266
- top_states = aggs['biz_states_filter']['buckets']
267
- ```
268
-
269
- ### Find a company's CIK by name search (via search aggregations)
270
-
271
- ```python
272
- import json, re
273
- UA = {"User-Agent": "browser-harness research@example.com"}
274
-
275
- # Best method: company_tickers.json (fastest, all tickers)
276
- tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
277
- msft = next(v for v in tickers.values() if v['ticker'] == 'MSFT')
278
- # CIK = msft['cik_str'] → 789019
279
-
280
- # Alternative: full-text search aggregations (finds CIK from company name)
281
- data = json.loads(http_get(
282
- "https://efts.sec.gov/LATEST/search-index?q=%22microsoft+corporation%22&forms=10-K",
283
- headers=UA
284
- ))
285
- buckets = data['aggregations']['entity_filter']['buckets']
286
- # [{'key': 'MICROSOFT CORP (MSFT) (CIK 0000789019)', 'doc_count': 11}, ...]
287
- for b in buckets[:3]:
288
- m = re.search(r'\(CIK (\d+)\)', b['key'])
289
- if m:
290
- print(f"{b['key'][:50]} → CIK {m.group(1)}")
291
- ```
292
-
293
- ### Parallel fetching (multiple companies)
294
-
295
- ```python
296
- import json
297
- from concurrent.futures import ThreadPoolExecutor
298
-
299
- UA = {"User-Agent": "browser-harness research@example.com"}
300
-
301
- def get_company_meta(ticker_cik):
302
- ticker, cik = ticker_cik
303
- subs = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
304
- return {"ticker": ticker, "name": subs['name'], "sic": subs['sic']}
305
-
306
- companies = [("AAPL", "0000320193"), ("TSLA", "0001318605"), ("MSFT", "0000789019")]
307
- with ThreadPoolExecutor(max_workers=3) as ex:
308
- results = list(ex.map(get_company_meta, companies))
309
- # Confirmed: 3 requests complete in ~0.28s
310
- # SEC rate limit: 10 req/sec — stay at max_workers ≤ 8 to be safe
311
- ```
312
-
313
- ## API reference
314
-
315
- | Endpoint | What it returns | UA required |
316
- |---|---|---|
317
- | `www.sec.gov/files/company_tickers.json` | All 10,391 tickers → CIK mapping | YES |
318
- | `data.sec.gov/submissions/CIK{10-digit}.json` | Company meta + ~1000 recent filings | YES |
319
- | `data.sec.gov/api/xbrl/companyfacts/CIK{10-digit}.json` | All XBRL facts (~5MB) | YES |
320
- | `data.sec.gov/api/xbrl/companyconcept/CIK{10-digit}/{taxonomy}/{concept}.json` | One concept, all values | YES |
321
- | `data.sec.gov/api/xbrl/frames/{taxonomy}/{concept}/{unit}/{period}.json` | All companies for one period | YES |
322
- | `efts.sec.gov/LATEST/search-index?q=...` | Full-text search across filings | NO (Mozilla/5.0 works) |
323
- | `www.sec.gov/Archives/edgar/data/{cik}/{accn-nodash}/{doc}` | Actual filing document | YES |
324
-
325
- `data.sec.gov` accepts `Mozilla/5.0` (the http_get default). `www.sec.gov` (Archives, company_tickers) requires the `"CompanyName email@example.com"` format.
326
-
327
- ## Rate limits
328
-
329
- SEC documents a **10 requests/second** limit. In practice:
330
- - 12 rapid sequential calls to `data.sec.gov` completed in 2.4s (5 req/s) with no throttling.
331
- - 3 parallel calls completed in 0.28s without issue.
332
- - Stay at `max_workers ≤ 8` for ThreadPoolExecutor to respect the 10 req/s ceiling.
333
- - No per-day or per-hour cap documented; the 10/s limit is the only stated constraint.
334
-
335
- ## Gotchas
336
-
337
- - **`www.sec.gov` returns 403 with `Mozilla/5.0` UA** — The http_get default (`"Mozilla/5.0"`) works on `data.sec.gov` and `efts.sec.gov` but is blocked on `www.sec.gov`. Always pass `headers=UA` where UA includes your company name and email. Confirmed: `"python-requests/2.28"` → 403.
338
-
339
- - **`data.sec.gov` is more permissive** — `Mozilla/5.0` works on `data.sec.gov` (submissions, xbrl). But always use the proper UA anyway — SEC's policy page explicitly requires it and they can add stricter checks at any time.
340
-
341
- - **XBRL contains duplicate entries per period** — The same fiscal year end date appears multiple times when a company restates or re-files. Each entry has a `filed` date and `accn` (accession). To get the canonical value, deduplicate by `end` keeping the entry with the latest `filed` date.
342
-
343
- - **Revenue concept name varies by company and era** — There is no single canonical "revenue" concept. Apple uses `RevenueFromContractWithCustomerExcludingAssessedTax`. Microsoft uses the same for recent years, but older filings used `SalesRevenueNet`. Always check which concepts are actually present: `[k for k in usgaap if 'Revenue' in k]`.
344
-
345
- - **`fp` field for annual filings is `'FY'`, but quarterly values also appear in 10-K** — A 10-K re-reports each quarter (fp=Q1, Q2, Q3) plus the full year (fp=FY). Filter on both `form == '10-K'` AND `fp == 'FY'` to get only annual totals.
346
-
347
- - **`companyfacts` is ~5MB per company** — For a single metric, use `companyconcept` instead (much smaller). Only use `companyfacts` when you need multiple concepts from the same company.
348
-
349
- - **`submissions` recent filings cap at ~1,000** — Very old filings don't appear. If you need historical data before that window, use the `filings.files` array in submissions JSON to find older filing JSON pages (`data.sec.gov/submissions/CIK{cik}-submissions-001.json`, etc.).
350
-
351
- - **`adsh` in search results is the accession number** — The search index returns `adsh` (no dashes). To build the document URL, insert dashes: `adsh[:10] + '-' + adsh[10:12] + '-' + adsh[12:]`, or use the `accessionNumber` field from submissions JSON (which already has dashes).
352
-
353
- - **`size` param is capped at 100** — Requesting `size=200` silently returns 100 hits. Walk results with `from=0`, `from=100`, etc. Maximum reachable index is 10,000 (Elasticsearch default).
354
-
355
- - **Search total `'gte'` relation means >10,000 hits** — When `total['relation'] == 'gte'`, there are more than 10,000 results (only first 10,000 accessible). Narrow with `dateRange` or `forms` filters.
356
-
357
- - **`company_tickers.json` covers only exchange-listed companies** — ~10,391 entries. Many SEC filers (private companies, bond issuers, FHLBs) have CIKs but no ticker. Find them via the full-text search aggregations or `submissions` lookup if you have the CIK.
358
-
359
- - **Filing document is XBRL-tagged HTML, 1–2MB** — Retrieving the actual 10-K HTML works but is large. For financial data extraction, always prefer the XBRL API endpoints over parsing the document.
360
-
361
- - **CIK format gotcha** — `company_tickers.json` returns `cik_str` as an int (`320193`). The submissions and xbrl APIs require a 10-digit zero-padded string in the filename (`CIK0000320193`). Always use `str(cik).zfill(10)` when building URLs.
1
+ # SEC EDGAR — Scraping & Data Extraction
2
+
3
+ `https://www.sec.gov` / `https://data.sec.gov` / `https://efts.sec.gov` — all public data, no auth required. Every workflow here is pure `http_get` — no browser needed.
4
+
5
+ ## Do this first
6
+
7
+ **SEC.gov requires a custom User-Agent on `www.sec.gov` and `data.sec.gov`. Always pass `headers=UA` or you get 403.**
8
+
9
+ ```python
10
+ import json
11
+ UA = {"User-Agent": "browser-harness research@example.com"}
12
+ # Format required: "CompanyName contact@email.com"
13
+ # "Mozilla/5.0" (http_get default) works on efts.sec.gov and data.sec.gov
14
+ # but FAILS on www.sec.gov (company_tickers.json, Archives/, etc.)
15
+ ```
16
+
17
+ Start with `company_tickers.json` to resolve any ticker → CIK in one call, then branch to whichever endpoint you need.
18
+
19
+ ```python
20
+ import json
21
+ UA = {"User-Agent": "browser-harness research@example.com"}
22
+ tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
23
+ # 10,391 public companies, ~50KB, always fresh
24
+ # Entry format: {"cik_str": 320193, "ticker": "AAPL", "title": "Apple Inc."}
25
+
26
+ # Look up by ticker (exact, case-sensitive in the data)
27
+ aapl = next(v for v in tickers.values() if v['ticker'] == 'AAPL')
28
+ # {'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}
29
+
30
+ # CIK is an int here; pad to 10 digits for API URLs
31
+ cik = str(aapl['cik_str']).zfill(10) # "0000320193"
32
+ ```
33
+
34
+ ## Common workflows
35
+
36
+ ### Ticker / name → CIK lookup
37
+
38
+ ```python
39
+ import json
40
+ UA = {"User-Agent": "browser-harness research@example.com"}
41
+ tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
42
+
43
+ # By ticker
44
+ tsla = next((v for v in tickers.values() if v['ticker'] == 'TSLA'), None)
45
+ # {'cik_str': 1318605, 'ticker': 'TSLA', 'title': 'Tesla, Inc.'}
46
+
47
+ # By partial name match
48
+ apples = [v for v in tickers.values() if 'APPLE' in v['title'].upper()]
49
+ # [{'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}, ...]
50
+ ```
51
+
52
+ ### Company submissions (metadata + recent filings list)
53
+
54
+ ```python
55
+ import json
56
+ UA = {"User-Agent": "browser-harness research@example.com"}
57
+ cik = "0000320193" # Apple - always zero-pad to 10 digits
58
+ data = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
59
+
60
+ print(data['name']) # "Apple Inc."
61
+ print(data['cik']) # "0000320193"
62
+ print(data['sic']) # "3571"
63
+ print(data['sicDescription']) # "Electronic Computers"
64
+ print(data['tickers']) # ["AAPL"]
65
+ print(data['exchanges']) # ["Nasdaq"]
66
+
67
+ # Most recent ~1,000 filings are in data['filings']['recent']
68
+ recent = data['filings']['recent']
69
+ # Fields per filing (parallel arrays, same index):
70
+ # accessionNumber, filingDate, reportDate, form, primaryDocument,
71
+ # primaryDocDescription, size, isXBRL, items, fileNumber
72
+
73
+ # Filter for 10-K and 10-Q only
74
+ filings_10k = [
75
+ (f, d, a, doc)
76
+ for f, d, a, doc in zip(
77
+ recent['form'], recent['filingDate'],
78
+ recent['accessionNumber'], recent['primaryDocument']
79
+ )
80
+ if f in ('10-K', '10-Q')
81
+ ]
82
+ # Result: [('10-Q', '2026-01-30', '0000320193-26-000006', 'aapl-20251227.htm'), ...]
83
+ ```
84
+
85
+ ### Build direct filing document URL
86
+
87
+ ```python
88
+ # Given accessionNumber and primaryDocument from submissions JSON:
89
+ accn = "0000320193-25-000079"
90
+ doc = "aapl-20250927.htm"
91
+ cik = "320193" # int part only (no leading zeros) for Archives path
92
+
93
+ accn_nodash = accn.replace("-", "")
94
+ url = f"https://www.sec.gov/Archives/edgar/data/{cik}/{accn_nodash}/{doc}"
95
+ # https://www.sec.gov/Archives/edgar/data/320193/000032019325000079/aapl-20250927.htm
96
+
97
+ # Full 10-K is 1.5MB of XBRL-tagged HTML — use http_get for text extraction
98
+ content = http_get(url, headers=UA) # UA required on www.sec.gov
99
+ ```
100
+
101
+ ### XBRL financial data — single company, one concept over time
102
+
103
+ ```python
104
+ import json
105
+ UA = {"User-Agent": "browser-harness research@example.com"}
106
+ cik_padded = "0000320193"
107
+
108
+ # companyconcept: one metric, all reported values (quarterly + annual)
109
+ data = json.loads(http_get(
110
+ f"https://data.sec.gov/api/xbrl/companyconcept/CIK{cik_padded}/us-gaap/Assets.json",
111
+ headers=UA
112
+ ))
113
+ # data keys: cik, taxonomy, tag, label, description, entityName, units
114
+ # data['units']['USD'] -> list of {end, val, accn, fy, fp, form, filed}
115
+
116
+ entries = data['units']['USD']
117
+
118
+ # Deduplicate: same period re-reported across multiple filings — keep latest
119
+ def annual_series(entries):
120
+ seen = {}
121
+ for e in entries:
122
+ if e.get('form') == '10-K' and e.get('fp') == 'FY':
123
+ end = e['end']
124
+ if end not in seen or e['filed'] > seen[end]['filed']:
125
+ seen[end] = e
126
+ return [seen[k] for k in sorted(seen)]
127
+
128
+ assets = annual_series(entries)
129
+ for e in assets[-5:]:
130
+ print(f"{e['end']} ${e['val']/1e9:.1f}B")
131
+ # 2021-09-25 $351.0B
132
+ # 2022-09-24 $352.8B
133
+ # 2023-09-30 $352.6B
134
+ # 2024-09-28 $365.0B
135
+ # 2025-09-27 $359.2B
136
+ ```
137
+
138
+ ### XBRL financial data — all US-GAAP metrics for a company
139
+
140
+ ```python
141
+ import json
142
+ UA = {"User-Agent": "browser-harness research@example.com"}
143
+
144
+ # companyfacts: all reported XBRL concepts in one ~5MB call
145
+ data = json.loads(http_get(
146
+ "https://data.sec.gov/api/xbrl/companyfacts/CIK0000320193.json",
147
+ headers=UA
148
+ ))
149
+ # data['entityName'] = "Apple Inc."
150
+ # data['facts'] = {'us-gaap': {...503 concepts...}, 'dei': {...}}
151
+
152
+ usgaap = data['facts']['us-gaap']
153
+ print(len(usgaap)) # 503 concepts for Apple
154
+
155
+ # Common concept names (companies vary — check what's available):
156
+ # Revenue: RevenueFromContractWithCustomerExcludingAssessedTax (post-2018 standard)
157
+ # SalesRevenueNet (older filings)
158
+ # Revenues (some companies still use)
159
+ # Net income: NetIncomeLoss
160
+ # Assets: Assets
161
+ # Cash: CashAndCashEquivalentsAtCarryingValue
162
+ # EPS: EarningsPerShareBasic, EarningsPerShareDiluted
163
+
164
+ # Find all revenue-related concepts this company reported:
165
+ revenue_keys = [k for k in usgaap if 'Revenue' in k]
166
+
167
+ # Extract annual revenue — handle company-specific concept name
168
+ for concept in ['RevenueFromContractWithCustomerExcludingAssessedTax', 'SalesRevenueNet', 'Revenues']:
169
+ if concept in usgaap:
170
+ entries = usgaap[concept]['units'].get('USD', [])
171
+ annual = {}
172
+ for e in entries:
173
+ if e.get('form') == '10-K' and e.get('fp') == 'FY':
174
+ end = e['end']
175
+ if end not in annual or e['filed'] > annual[end]['filed']:
176
+ annual[end] = e
177
+ if annual:
178
+ print(f"Using: {concept}")
179
+ for end in sorted(annual)[-3:]:
180
+ print(f" {end} ${annual[end]['val']/1e9:.1f}B")
181
+ break
182
+ # Apple output:
183
+ # Using: RevenueFromContractWithCustomerExcludingAssessedTax
184
+ # 2023-09-30 $383.3B
185
+ # 2024-09-28 $391.0B
186
+ # 2025-09-27 $416.2B
187
+ ```
188
+
189
+ ### Cross-company financial comparison (XBRL frames)
190
+
191
+ ```python
192
+ import json
193
+ UA = {"User-Agent": "browser-harness research@example.com"}
194
+
195
+ # frames: one concept, one period, all companies that reported it
196
+ # Period formats:
197
+ # CY2024 = calendar year 2024 (annual)
198
+ # CY2024Q4I = Q4 2024 instantaneous (balance sheet items)
199
+ # CY2024Q4 = Q4 2024 duration (income statement items)
200
+
201
+ # Top companies by annual revenue (2024)
202
+ data = json.loads(http_get(
203
+ "https://data.sec.gov/api/xbrl/frames/us-gaap/RevenueFromContractWithCustomerExcludingAssessedTax/USD/CY2024.json",
204
+ headers=UA
205
+ ))
206
+ companies = sorted(data['data'], key=lambda x: x['val'], reverse=True)
207
+ # data['data'] entries: {accn, cik, entityName, loc, start, end, val}
208
+ for c in companies[:5]:
209
+ print(f"{c['entityName']:<40} ${c['val']/1e9:.0f}B")
210
+ # Walmart Inc. $675B
211
+ # AMAZON.COM, INC. $638B
212
+ # Apple Inc. $391B
213
+ # McKESSON CORPORATION $359B
214
+ # Alphabet Inc. $350B
215
+
216
+ # Total assets snapshot end of 2024 (balance sheet = instantaneous)
217
+ data2 = json.loads(http_get(
218
+ "https://data.sec.gov/api/xbrl/frames/us-gaap/Assets/USD/CY2024Q4I.json",
219
+ headers=UA
220
+ ))
221
+ # 6,229 companies for this frame
222
+ ```
223
+
224
+ ### Full-text search across all filings
225
+
226
+ ```python
227
+ import json
228
+ UA = {"User-Agent": "browser-harness research@example.com"}
229
+
230
+ # Search for any phrase across filing documents
231
+ # Params: q (quoted phrase), forms (comma-separated), dateRange=custom,
232
+ # startdt, enddt, size (max 100), from (offset for pagination)
233
+ url = (
234
+ "https://efts.sec.gov/LATEST/search-index"
235
+ "?q=%22climate+risk%22"
236
+ "&forms=10-K"
237
+ "&dateRange=custom&startdt=2024-01-01"
238
+ "&size=10&from=0"
239
+ )
240
+ data = json.loads(http_get(url, headers=UA))
241
+ # Note: default http_get UA (Mozilla/5.0) works fine on efts.sec.gov
242
+
243
+ print(data['hits']['total']['value']) # e.g. 1438 matching documents
244
+ hits = data['hits']['hits'] # up to 100 per call
245
+
246
+ for h in hits:
247
+ src = h['_source']
248
+ # Key fields: display_names, ciks, form, file_date, adsh (accession), period_ending
249
+ name = src['display_names'][0] if src.get('display_names') else '?'
250
+ cik = src['ciks'][0] if src.get('ciks') else '?'
251
+ print(f"{name} form={src['form']} filed={src['file_date']} accn={src['adsh']}")
252
+
253
+ # Pagination: max 100 per page, use from= to walk through results
254
+ # Page 2: from=100, Page 3: from=200, etc.
255
+ for page in range(0, 300, 100):
256
+ page_url = url + f"&from={page}"
257
+ page_data = json.loads(http_get(page_url, headers=UA))
258
+ if not page_data['hits']['hits']:
259
+ break
260
+ # process...
261
+
262
+ # Aggregations — group hits by entity, SIC, state
263
+ aggs = data['aggregations']
264
+ top_entities = aggs['entity_filter']['buckets'] # [{key: "Name (CIK...)", doc_count: N}, ...]
265
+ top_sics = aggs['sic_filter']['buckets']
266
+ top_states = aggs['biz_states_filter']['buckets']
267
+ ```
268
+
269
+ ### Find a company's CIK by name search (via search aggregations)
270
+
271
+ ```python
272
+ import json, re
273
+ UA = {"User-Agent": "browser-harness research@example.com"}
274
+
275
+ # Best method: company_tickers.json (fastest, all tickers)
276
+ tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
277
+ msft = next(v for v in tickers.values() if v['ticker'] == 'MSFT')
278
+ # CIK = msft['cik_str'] → 789019
279
+
280
+ # Alternative: full-text search aggregations (finds CIK from company name)
281
+ data = json.loads(http_get(
282
+ "https://efts.sec.gov/LATEST/search-index?q=%22microsoft+corporation%22&forms=10-K",
283
+ headers=UA
284
+ ))
285
+ buckets = data['aggregations']['entity_filter']['buckets']
286
+ # [{'key': 'MICROSOFT CORP (MSFT) (CIK 0000789019)', 'doc_count': 11}, ...]
287
+ for b in buckets[:3]:
288
+ m = re.search(r'\(CIK (\d+)\)', b['key'])
289
+ if m:
290
+ print(f"{b['key'][:50]} → CIK {m.group(1)}")
291
+ ```
292
+
293
+ ### Parallel fetching (multiple companies)
294
+
295
+ ```python
296
+ import json
297
+ from concurrent.futures import ThreadPoolExecutor
298
+
299
+ UA = {"User-Agent": "browser-harness research@example.com"}
300
+
301
+ def get_company_meta(ticker_cik):
302
+ ticker, cik = ticker_cik
303
+ subs = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
304
+ return {"ticker": ticker, "name": subs['name'], "sic": subs['sic']}
305
+
306
+ companies = [("AAPL", "0000320193"), ("TSLA", "0001318605"), ("MSFT", "0000789019")]
307
+ with ThreadPoolExecutor(max_workers=3) as ex:
308
+ results = list(ex.map(get_company_meta, companies))
309
+ # Confirmed: 3 requests complete in ~0.28s
310
+ # SEC rate limit: 10 req/sec — stay at max_workers ≤ 8 to be safe
311
+ ```
312
+
313
+ ## API reference
314
+
315
+ | Endpoint | What it returns | UA required |
316
+ |---|---|---|
317
+ | `www.sec.gov/files/company_tickers.json` | All 10,391 tickers → CIK mapping | YES |
318
+ | `data.sec.gov/submissions/CIK{10-digit}.json` | Company meta + ~1000 recent filings | YES |
319
+ | `data.sec.gov/api/xbrl/companyfacts/CIK{10-digit}.json` | All XBRL facts (~5MB) | YES |
320
+ | `data.sec.gov/api/xbrl/companyconcept/CIK{10-digit}/{taxonomy}/{concept}.json` | One concept, all values | YES |
321
+ | `data.sec.gov/api/xbrl/frames/{taxonomy}/{concept}/{unit}/{period}.json` | All companies for one period | YES |
322
+ | `efts.sec.gov/LATEST/search-index?q=...` | Full-text search across filings | NO (Mozilla/5.0 works) |
323
+ | `www.sec.gov/Archives/edgar/data/{cik}/{accn-nodash}/{doc}` | Actual filing document | YES |
324
+
325
+ `data.sec.gov` accepts `Mozilla/5.0` (the http_get default). `www.sec.gov` (Archives, company_tickers) requires the `"CompanyName email@example.com"` format.
326
+
327
+ ## Rate limits
328
+
329
+ SEC documents a **10 requests/second** limit. In practice:
330
+ - 12 rapid sequential calls to `data.sec.gov` completed in 2.4s (5 req/s) with no throttling.
331
+ - 3 parallel calls completed in 0.28s without issue.
332
+ - Stay at `max_workers ≤ 8` for ThreadPoolExecutor to respect the 10 req/s ceiling.
333
+ - No per-day or per-hour cap documented; the 10/s limit is the only stated constraint.
334
+
335
+ ## Gotchas
336
+
337
+ - **`www.sec.gov` returns 403 with `Mozilla/5.0` UA** — The http_get default (`"Mozilla/5.0"`) works on `data.sec.gov` and `efts.sec.gov` but is blocked on `www.sec.gov`. Always pass `headers=UA` where UA includes your company name and email. Confirmed: `"python-requests/2.28"` → 403.
338
+
339
+ - **`data.sec.gov` is more permissive** — `Mozilla/5.0` works on `data.sec.gov` (submissions, xbrl). But always use the proper UA anyway — SEC's policy page explicitly requires it and they can add stricter checks at any time.
340
+
341
+ - **XBRL contains duplicate entries per period** — The same fiscal year end date appears multiple times when a company restates or re-files. Each entry has a `filed` date and `accn` (accession). To get the canonical value, deduplicate by `end` keeping the entry with the latest `filed` date.
342
+
343
+ - **Revenue concept name varies by company and era** — There is no single canonical "revenue" concept. Apple uses `RevenueFromContractWithCustomerExcludingAssessedTax`. Microsoft uses the same for recent years, but older filings used `SalesRevenueNet`. Always check which concepts are actually present: `[k for k in usgaap if 'Revenue' in k]`.
344
+
345
+ - **`fp` field for annual filings is `'FY'`, but quarterly values also appear in 10-K** — A 10-K re-reports each quarter (fp=Q1, Q2, Q3) plus the full year (fp=FY). Filter on both `form == '10-K'` AND `fp == 'FY'` to get only annual totals.
346
+
347
+ - **`companyfacts` is ~5MB per company** — For a single metric, use `companyconcept` instead (much smaller). Only use `companyfacts` when you need multiple concepts from the same company.
348
+
349
+ - **`submissions` recent filings cap at ~1,000** — Very old filings don't appear. If you need historical data before that window, use the `filings.files` array in submissions JSON to find older filing JSON pages (`data.sec.gov/submissions/CIK{cik}-submissions-001.json`, etc.).
350
+
351
+ - **`adsh` in search results is the accession number** — The search index returns `adsh` (no dashes). To build the document URL, insert dashes: `adsh[:10] + '-' + adsh[10:12] + '-' + adsh[12:]`, or use the `accessionNumber` field from submissions JSON (which already has dashes).
352
+
353
+ - **`size` param is capped at 100** — Requesting `size=200` silently returns 100 hits. Walk results with `from=0`, `from=100`, etc. Maximum reachable index is 10,000 (Elasticsearch default).
354
+
355
+ - **Search total `'gte'` relation means >10,000 hits** — When `total['relation'] == 'gte'`, there are more than 10,000 results (only first 10,000 accessible). Narrow with `dateRange` or `forms` filters.
356
+
357
+ - **`company_tickers.json` covers only exchange-listed companies** — ~10,391 entries. Many SEC filers (private companies, bond issuers, FHLBs) have CIKs but no ticker. Find them via the full-text search aggregations or `submissions` lookup if you have the CIK.
358
+
359
+ - **Filing document is XBRL-tagged HTML, 1–2MB** — Retrieving the actual 10-K HTML works but is large. For financial data extraction, always prefer the XBRL API endpoints over parsing the document.
360
+
361
+ - **CIK format gotcha** — `company_tickers.json` returns `cik_str` as an int (`320193`). The submissions and xbrl APIs require a 10-digit zero-padded string in the filename (`CIK0000320193`). Always use `str(cik).zfill(10)` when building URLs.