@pencil-agent/nano-pencil 2.0.1 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (186) hide show
  1. package/README.md +267 -267
  2. package/dist/build-meta.json +3 -3
  3. package/dist/core/export-html/AGENT.md +11 -11
  4. package/dist/core/export-html/template.css +971 -971
  5. package/dist/core/export-html/template.html +54 -54
  6. package/dist/core/model/custom-providers.js +1 -1
  7. package/dist/core/model-registry.js +5 -5
  8. package/dist/extensions/builtin/AGENT.md +115 -115
  9. package/dist/extensions/builtin/browser/AGENT.md +17 -17
  10. package/dist/extensions/builtin/browser/agent-workspace/agent_helpers.py +12 -12
  11. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/amazon/product-search.md +198 -198
  12. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/archive-org/scraping.md +341 -341
  13. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv/scraping.md +311 -311
  14. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/arxiv-bulk/scraping.md +333 -333
  15. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/atlas/overview.md +70 -70
  16. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/booking-com/scraping.md +578 -578
  17. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/capterra/scraping.md +440 -440
  18. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/centilebrain/generate-estimates.md +110 -110
  19. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coingecko/scraping.md +325 -325
  20. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coinmarketcap/scraping.md +463 -463
  21. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/coursera/scraping.md +360 -360
  22. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/craigslist/scraping.md +390 -390
  23. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/crossref/scraping.md +568 -568
  24. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/dev-to/scraping.md +323 -323
  25. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/duckduckgo/scraping.md +349 -349
  26. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/ebay/scraping.md +435 -435
  27. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/etsy/scraping.md +506 -506
  28. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/eventbrite/scraping.md +363 -363
  29. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/expedia/automation.md +168 -168
  30. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/groups.md +236 -236
  31. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/facebook/pages.md +295 -295
  32. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/framer/editor.md +108 -108
  33. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/fred/scraping.md +493 -493
  34. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/g2/scraping.md +580 -580
  35. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/genius/scraping.md +511 -511
  36. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/repo-actions.md +65 -65
  37. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/github/scraping.md +184 -184
  38. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/glassdoor/scraping.md +543 -543
  39. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gmail/compose.md +122 -122
  40. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/goodreads/scraping.md +461 -461
  41. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/gutenberg/scraping.md +383 -383
  42. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/hackernews/scraping.md +243 -243
  43. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/howlongtobeat/scraping.md +473 -473
  44. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/imdb/scraping.md +271 -271
  45. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/itch-io/scraping.md +436 -436
  46. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +1021 -1021
  47. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/letterboxd/scraping.md +349 -349
  48. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/linkedin/invitation-manager.md +109 -109
  49. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/loom/folder-enumeration.md +170 -170
  50. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/macrotrends/scraping.md +537 -537
  51. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/article-hydration.md +120 -120
  52. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/medium/scraping.md +414 -414
  53. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/metacritic/scraping.md +477 -477
  54. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/musicbrainz/scraping.md +478 -478
  55. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/nasa/scraping.md +339 -339
  56. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/news-aggregation/multi-source.md +205 -205
  57. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/open-library/scraping.md +472 -472
  58. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openalex/scraping.md +470 -470
  59. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/openstreetmap/scraping.md +490 -490
  60. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/package-registries/npm-pypi.md +478 -478
  61. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/polymarket/scraping.md +234 -234
  62. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/producthunt/scraping.md +307 -307
  63. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/pubmed/scraping.md +421 -421
  64. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/quora/scraping.md +364 -364
  65. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rawg/scraping.md +352 -352
  66. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/reddit/scraping.md +124 -124
  67. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/rest-countries/scraping.md +233 -233
  68. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/sec-edgar/scraping.md +361 -361
  69. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/README.md +36 -36
  70. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/embedded-apps.md +72 -72
  71. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/knowledge-base.md +109 -109
  72. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/shopify-admin/polaris-inputs.md +137 -137
  73. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/soundcloud/scraping.md +362 -362
  74. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/spotify/scraping.md +339 -339
  75. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/stackoverflow/scraping.md +435 -435
  76. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/steam/scraping.md +575 -575
  77. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/substack/scraping.md +338 -338
  78. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/thetechgeeks/pricing.md +52 -52
  79. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tiktok/upload.md +107 -107
  80. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/tradingview/scraping.md +309 -309
  81. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trello/boards-and-lists.md +88 -88
  82. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/trustpilot/scraping.md +375 -375
  83. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/walmart/scraping.md +444 -444
  84. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wayback-machine/scraping.md +306 -306
  85. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/weather/scraping.md +398 -398
  86. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/wellfound/scraping.md +596 -596
  87. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/world-bank/scraping.md +356 -356
  88. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/xiaohongshu/scraping.md +84 -84
  89. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/youtube/scraping.md +418 -418
  90. package/dist/extensions/builtin/browser/agent-workspace/domain-skills/zillow/scraping.md +433 -433
  91. package/dist/extensions/builtin/browser/browser.md +73 -73
  92. package/dist/extensions/builtin/browser/install.md +142 -142
  93. package/dist/extensions/builtin/browser/interaction-skills/connection.md +48 -48
  94. package/dist/extensions/builtin/browser/interaction-skills/cookies.md +3 -3
  95. package/dist/extensions/builtin/browser/interaction-skills/cross-origin-iframes.md +3 -3
  96. package/dist/extensions/builtin/browser/interaction-skills/dialogs.md +64 -64
  97. package/dist/extensions/builtin/browser/interaction-skills/downloads.md +3 -3
  98. package/dist/extensions/builtin/browser/interaction-skills/drag-and-drop.md +3 -3
  99. package/dist/extensions/builtin/browser/interaction-skills/dropdowns.md +3 -3
  100. package/dist/extensions/builtin/browser/interaction-skills/iframes.md +3 -3
  101. package/dist/extensions/builtin/browser/interaction-skills/network-requests.md +3 -3
  102. package/dist/extensions/builtin/browser/interaction-skills/print-as-pdf.md +3 -3
  103. package/dist/extensions/builtin/browser/interaction-skills/profile-sync.md +90 -90
  104. package/dist/extensions/builtin/browser/interaction-skills/screenshots.md +17 -17
  105. package/dist/extensions/builtin/browser/interaction-skills/scrolling.md +3 -3
  106. package/dist/extensions/builtin/browser/interaction-skills/shadow-dom.md +3 -3
  107. package/dist/extensions/builtin/browser/interaction-skills/tabs.md +69 -69
  108. package/dist/extensions/builtin/browser/interaction-skills/uploads.md +1 -1
  109. package/dist/extensions/builtin/browser/interaction-skills/viewport.md +3 -3
  110. package/dist/extensions/builtin/browser/src/browser_harness/AGENT.md +15 -15
  111. package/dist/extensions/builtin/browser/src/browser_harness/__init__.py +8 -8
  112. package/dist/extensions/builtin/browser/src/browser_harness/_ipc.py +90 -90
  113. package/dist/extensions/builtin/browser/src/browser_harness/admin.py +722 -722
  114. package/dist/extensions/builtin/browser/src/browser_harness/daemon.py +328 -328
  115. package/dist/extensions/builtin/browser/src/browser_harness/helpers.py +396 -396
  116. package/dist/extensions/builtin/browser/src/browser_harness/run.py +103 -103
  117. package/dist/extensions/builtin/discipline/skills/brainstorming/SKILL.md +33 -33
  118. package/dist/extensions/builtin/discipline/skills/executing-plans/SKILL.md +25 -25
  119. package/dist/extensions/builtin/discipline/skills/finishing-development-branch/SKILL.md +25 -25
  120. package/dist/extensions/builtin/discipline/skills/receiving-code-review/SKILL.md +22 -22
  121. package/dist/extensions/builtin/discipline/skills/requesting-code-review/SKILL.md +31 -31
  122. package/dist/extensions/builtin/discipline/skills/systematic-debugging/SKILL.md +28 -28
  123. package/dist/extensions/builtin/discipline/skills/test-driven-development/SKILL.md +32 -32
  124. package/dist/extensions/builtin/discipline/skills/using-git-worktrees/SKILL.md +25 -25
  125. package/dist/extensions/builtin/discipline/skills/verification-before-completion/SKILL.md +27 -27
  126. package/dist/extensions/builtin/discipline/skills/writing-plans/SKILL.md +26 -26
  127. package/dist/extensions/builtin/goal/README.md +67 -67
  128. package/dist/extensions/builtin/grub/README.md +112 -112
  129. package/dist/extensions/builtin/link-world/agent-workspace/README.md +16 -16
  130. package/dist/extensions/builtin/link-world/internet-search/internet-search.md +65 -65
  131. package/dist/extensions/builtin/link-world/link-world-agent.md +82 -82
  132. package/dist/extensions/builtin/link-world/linkworld.md +313 -313
  133. package/dist/extensions/builtin/link-world/network-routing/network-routing.md +67 -67
  134. package/dist/extensions/builtin/loop/README.md +92 -92
  135. package/dist/extensions/builtin/mcp/figma-design.md +68 -68
  136. package/dist/extensions/builtin/mcp/mcp-management.md +85 -85
  137. package/dist/extensions/builtin/recap/AGENT.md +15 -15
  138. package/dist/extensions/builtin/sal/README.md +72 -72
  139. package/dist/extensions/builtin/security-audit/README.md +289 -289
  140. package/dist/extensions/builtin/team/AGENT.md +112 -112
  141. package/dist/extensions/builtin/team/TESTING.md +299 -299
  142. package/dist/extensions/builtin/token-save/README.md +56 -56
  143. package/dist/extensions/optional/AGENT.md +10 -10
  144. package/dist/modes/interactive/controllers/input-submit-controller.js +2 -2
  145. package/dist/modes/interactive/controllers/stream-render-controller.js +2 -2
  146. package/dist/modes/interactive/interactive-mode.js +19 -19
  147. package/dist/modes/interactive/theme/dark.json +85 -85
  148. package/dist/modes/interactive/theme/light.json +84 -84
  149. package/dist/modes/interactive/theme/theme-schema.json +335 -335
  150. package/dist/modes/interactive/theme/warm.json +81 -81
  151. package/dist/node_modules/@pencil-agent/ai/dist/cli.js +0 -0
  152. package/dist/node_modules/@pencil-agent/ai/dist/models.generated.js +1 -1
  153. package/docs/ACP/345/215/217/350/256/256/351/233/206/346/210/220/345/274/200/345/217/221/346/226/207/346/241/243.md +851 -0
  154. package/docs/SDK-TESTING.md +364 -0
  155. package/docs/codex-goal-command-impl.md +1055 -1055
  156. package/docs/codex-goal-vs-grub.md +500 -500
  157. package/docs/custom-provider.md +27 -27
  158. package/docs/extensions.md +27 -27
  159. package/docs/keybindings.md +27 -27
  160. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/200/273/347/273/223.md" +250 -250
  161. package/docs/loop /351/207/215/346/236/204/345/256/214/346/210/220/346/212/245/345/221/212.md" +122 -122
  162. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210.md" +1222 -1222
  163. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/256/236/347/216/260/346/212/245/345/221/212.md" +158 -158
  164. package/docs/loop /351/207/215/346/236/204/346/226/271/346/241/210/345/257/271/346/257/224/345/210/206/346/236/220.md" +128 -128
  165. package/docs/loop /351/207/215/346/236/204/350/256/241/345/210/222.md" +320 -320
  166. package/docs/loop-usage-examples.md +214 -214
  167. package/docs/mem-core/346/212/200/346/234/257/346/226/207/346/241/243.md +593 -0
  168. package/docs/models.md +27 -27
  169. package/docs/packages.md +27 -27
  170. package/docs/pi-design-philosophy.md +457 -457
  171. package/docs/planmode.md +1987 -1987
  172. package/docs/prompt-templates.md +27 -27
  173. package/docs/providers.md +27 -27
  174. package/docs/sdk.md +27 -27
  175. package/docs/skills.md +27 -27
  176. package/docs/startup-performance-optimization.md +301 -0
  177. package/docs/themes.md +27 -27
  178. package/docs/tui.md +27 -27
  179. package/docs//350/256/244/347/237/245/345/234/260/345/233/276.md +47 -0
  180. package/package.json +190 -190
  181. package/docs/cc-agent-design.md +0 -1297
  182. package/docs/cc-tui-design.md +0 -1333
  183. package/docs/nanoPencil-/345/255/246/344/271/240/350/256/241/345/210/222.md +0 -170
  184. package/docs/scan-report.md +0 -3820
  185. package/docs//345/257/271/346/240/207Claude-Code.md +0 -1775
  186. package/docs//351/230/277/351/207/214/345/267/264/345/267/264/350/264/242/346/212/245/345/210/206/346/236/220/344/271/246.md +0 -261
@@ -1,361 +1,361 @@
1
- # SEC EDGAR — Scraping & Data Extraction
2
-
3
- `https://www.sec.gov` / `https://data.sec.gov` / `https://efts.sec.gov` — all public data, no auth required. Every workflow here is pure `http_get` — no browser needed.
4
-
5
- ## Do this first
6
-
7
- **SEC.gov requires a custom User-Agent on `www.sec.gov` and `data.sec.gov`. Always pass `headers=UA` or you get 403.**
8
-
9
- ```python
10
- import json
11
- UA = {"User-Agent": "browser-harness research@example.com"}
12
- # Format required: "CompanyName contact@email.com"
13
- # "Mozilla/5.0" (http_get default) works on efts.sec.gov and data.sec.gov
14
- # but FAILS on www.sec.gov (company_tickers.json, Archives/, etc.)
15
- ```
16
-
17
- Start with `company_tickers.json` to resolve any ticker → CIK in one call, then branch to whichever endpoint you need.
18
-
19
- ```python
20
- import json
21
- UA = {"User-Agent": "browser-harness research@example.com"}
22
- tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
23
- # 10,391 public companies, ~50KB, always fresh
24
- # Entry format: {"cik_str": 320193, "ticker": "AAPL", "title": "Apple Inc."}
25
-
26
- # Look up by ticker (exact, case-sensitive in the data)
27
- aapl = next(v for v in tickers.values() if v['ticker'] == 'AAPL')
28
- # {'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}
29
-
30
- # CIK is an int here; pad to 10 digits for API URLs
31
- cik = str(aapl['cik_str']).zfill(10) # "0000320193"
32
- ```
33
-
34
- ## Common workflows
35
-
36
- ### Ticker / name → CIK lookup
37
-
38
- ```python
39
- import json
40
- UA = {"User-Agent": "browser-harness research@example.com"}
41
- tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
42
-
43
- # By ticker
44
- tsla = next((v for v in tickers.values() if v['ticker'] == 'TSLA'), None)
45
- # {'cik_str': 1318605, 'ticker': 'TSLA', 'title': 'Tesla, Inc.'}
46
-
47
- # By partial name match
48
- apples = [v for v in tickers.values() if 'APPLE' in v['title'].upper()]
49
- # [{'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}, ...]
50
- ```
51
-
52
- ### Company submissions (metadata + recent filings list)
53
-
54
- ```python
55
- import json
56
- UA = {"User-Agent": "browser-harness research@example.com"}
57
- cik = "0000320193" # Apple - always zero-pad to 10 digits
58
- data = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
59
-
60
- print(data['name']) # "Apple Inc."
61
- print(data['cik']) # "0000320193"
62
- print(data['sic']) # "3571"
63
- print(data['sicDescription']) # "Electronic Computers"
64
- print(data['tickers']) # ["AAPL"]
65
- print(data['exchanges']) # ["Nasdaq"]
66
-
67
- # Most recent ~1,000 filings are in data['filings']['recent']
68
- recent = data['filings']['recent']
69
- # Fields per filing (parallel arrays, same index):
70
- # accessionNumber, filingDate, reportDate, form, primaryDocument,
71
- # primaryDocDescription, size, isXBRL, items, fileNumber
72
-
73
- # Filter for 10-K and 10-Q only
74
- filings_10k = [
75
- (f, d, a, doc)
76
- for f, d, a, doc in zip(
77
- recent['form'], recent['filingDate'],
78
- recent['accessionNumber'], recent['primaryDocument']
79
- )
80
- if f in ('10-K', '10-Q')
81
- ]
82
- # Result: [('10-Q', '2026-01-30', '0000320193-26-000006', 'aapl-20251227.htm'), ...]
83
- ```
84
-
85
- ### Build direct filing document URL
86
-
87
- ```python
88
- # Given accessionNumber and primaryDocument from submissions JSON:
89
- accn = "0000320193-25-000079"
90
- doc = "aapl-20250927.htm"
91
- cik = "320193" # int part only (no leading zeros) for Archives path
92
-
93
- accn_nodash = accn.replace("-", "")
94
- url = f"https://www.sec.gov/Archives/edgar/data/{cik}/{accn_nodash}/{doc}"
95
- # https://www.sec.gov/Archives/edgar/data/320193/000032019325000079/aapl-20250927.htm
96
-
97
- # Full 10-K is 1.5MB of XBRL-tagged HTML — use http_get for text extraction
98
- content = http_get(url, headers=UA) # UA required on www.sec.gov
99
- ```
100
-
101
- ### XBRL financial data — single company, one concept over time
102
-
103
- ```python
104
- import json
105
- UA = {"User-Agent": "browser-harness research@example.com"}
106
- cik_padded = "0000320193"
107
-
108
- # companyconcept: one metric, all reported values (quarterly + annual)
109
- data = json.loads(http_get(
110
- f"https://data.sec.gov/api/xbrl/companyconcept/CIK{cik_padded}/us-gaap/Assets.json",
111
- headers=UA
112
- ))
113
- # data keys: cik, taxonomy, tag, label, description, entityName, units
114
- # data['units']['USD'] -> list of {end, val, accn, fy, fp, form, filed}
115
-
116
- entries = data['units']['USD']
117
-
118
- # Deduplicate: same period re-reported across multiple filings — keep latest
119
- def annual_series(entries):
120
- seen = {}
121
- for e in entries:
122
- if e.get('form') == '10-K' and e.get('fp') == 'FY':
123
- end = e['end']
124
- if end not in seen or e['filed'] > seen[end]['filed']:
125
- seen[end] = e
126
- return [seen[k] for k in sorted(seen)]
127
-
128
- assets = annual_series(entries)
129
- for e in assets[-5:]:
130
- print(f"{e['end']} ${e['val']/1e9:.1f}B")
131
- # 2021-09-25 $351.0B
132
- # 2022-09-24 $352.8B
133
- # 2023-09-30 $352.6B
134
- # 2024-09-28 $365.0B
135
- # 2025-09-27 $359.2B
136
- ```
137
-
138
- ### XBRL financial data — all US-GAAP metrics for a company
139
-
140
- ```python
141
- import json
142
- UA = {"User-Agent": "browser-harness research@example.com"}
143
-
144
- # companyfacts: all reported XBRL concepts in one ~5MB call
145
- data = json.loads(http_get(
146
- "https://data.sec.gov/api/xbrl/companyfacts/CIK0000320193.json",
147
- headers=UA
148
- ))
149
- # data['entityName'] = "Apple Inc."
150
- # data['facts'] = {'us-gaap': {...503 concepts...}, 'dei': {...}}
151
-
152
- usgaap = data['facts']['us-gaap']
153
- print(len(usgaap)) # 503 concepts for Apple
154
-
155
- # Common concept names (companies vary — check what's available):
156
- # Revenue: RevenueFromContractWithCustomerExcludingAssessedTax (post-2018 standard)
157
- # SalesRevenueNet (older filings)
158
- # Revenues (some companies still use)
159
- # Net income: NetIncomeLoss
160
- # Assets: Assets
161
- # Cash: CashAndCashEquivalentsAtCarryingValue
162
- # EPS: EarningsPerShareBasic, EarningsPerShareDiluted
163
-
164
- # Find all revenue-related concepts this company reported:
165
- revenue_keys = [k for k in usgaap if 'Revenue' in k]
166
-
167
- # Extract annual revenue — handle company-specific concept name
168
- for concept in ['RevenueFromContractWithCustomerExcludingAssessedTax', 'SalesRevenueNet', 'Revenues']:
169
- if concept in usgaap:
170
- entries = usgaap[concept]['units'].get('USD', [])
171
- annual = {}
172
- for e in entries:
173
- if e.get('form') == '10-K' and e.get('fp') == 'FY':
174
- end = e['end']
175
- if end not in annual or e['filed'] > annual[end]['filed']:
176
- annual[end] = e
177
- if annual:
178
- print(f"Using: {concept}")
179
- for end in sorted(annual)[-3:]:
180
- print(f" {end} ${annual[end]['val']/1e9:.1f}B")
181
- break
182
- # Apple output:
183
- # Using: RevenueFromContractWithCustomerExcludingAssessedTax
184
- # 2023-09-30 $383.3B
185
- # 2024-09-28 $391.0B
186
- # 2025-09-27 $416.2B
187
- ```
188
-
189
- ### Cross-company financial comparison (XBRL frames)
190
-
191
- ```python
192
- import json
193
- UA = {"User-Agent": "browser-harness research@example.com"}
194
-
195
- # frames: one concept, one period, all companies that reported it
196
- # Period formats:
197
- # CY2024 = calendar year 2024 (annual)
198
- # CY2024Q4I = Q4 2024 instantaneous (balance sheet items)
199
- # CY2024Q4 = Q4 2024 duration (income statement items)
200
-
201
- # Top companies by annual revenue (2024)
202
- data = json.loads(http_get(
203
- "https://data.sec.gov/api/xbrl/frames/us-gaap/RevenueFromContractWithCustomerExcludingAssessedTax/USD/CY2024.json",
204
- headers=UA
205
- ))
206
- companies = sorted(data['data'], key=lambda x: x['val'], reverse=True)
207
- # data['data'] entries: {accn, cik, entityName, loc, start, end, val}
208
- for c in companies[:5]:
209
- print(f"{c['entityName']:<40} ${c['val']/1e9:.0f}B")
210
- # Walmart Inc. $675B
211
- # AMAZON.COM, INC. $638B
212
- # Apple Inc. $391B
213
- # McKESSON CORPORATION $359B
214
- # Alphabet Inc. $350B
215
-
216
- # Total assets snapshot end of 2024 (balance sheet = instantaneous)
217
- data2 = json.loads(http_get(
218
- "https://data.sec.gov/api/xbrl/frames/us-gaap/Assets/USD/CY2024Q4I.json",
219
- headers=UA
220
- ))
221
- # 6,229 companies for this frame
222
- ```
223
-
224
- ### Full-text search across all filings
225
-
226
- ```python
227
- import json
228
- UA = {"User-Agent": "browser-harness research@example.com"}
229
-
230
- # Search for any phrase across filing documents
231
- # Params: q (quoted phrase), forms (comma-separated), dateRange=custom,
232
- # startdt, enddt, size (max 100), from (offset for pagination)
233
- url = (
234
- "https://efts.sec.gov/LATEST/search-index"
235
- "?q=%22climate+risk%22"
236
- "&forms=10-K"
237
- "&dateRange=custom&startdt=2024-01-01"
238
- "&size=10&from=0"
239
- )
240
- data = json.loads(http_get(url, headers=UA))
241
- # Note: default http_get UA (Mozilla/5.0) works fine on efts.sec.gov
242
-
243
- print(data['hits']['total']['value']) # e.g. 1438 matching documents
244
- hits = data['hits']['hits'] # up to 100 per call
245
-
246
- for h in hits:
247
- src = h['_source']
248
- # Key fields: display_names, ciks, form, file_date, adsh (accession), period_ending
249
- name = src['display_names'][0] if src.get('display_names') else '?'
250
- cik = src['ciks'][0] if src.get('ciks') else '?'
251
- print(f"{name} form={src['form']} filed={src['file_date']} accn={src['adsh']}")
252
-
253
- # Pagination: max 100 per page, use from= to walk through results
254
- # Page 2: from=100, Page 3: from=200, etc.
255
- for page in range(0, 300, 100):
256
- page_url = url + f"&from={page}"
257
- page_data = json.loads(http_get(page_url, headers=UA))
258
- if not page_data['hits']['hits']:
259
- break
260
- # process...
261
-
262
- # Aggregations — group hits by entity, SIC, state
263
- aggs = data['aggregations']
264
- top_entities = aggs['entity_filter']['buckets'] # [{key: "Name (CIK...)", doc_count: N}, ...]
265
- top_sics = aggs['sic_filter']['buckets']
266
- top_states = aggs['biz_states_filter']['buckets']
267
- ```
268
-
269
- ### Find a company's CIK by name search (via search aggregations)
270
-
271
- ```python
272
- import json, re
273
- UA = {"User-Agent": "browser-harness research@example.com"}
274
-
275
- # Best method: company_tickers.json (fastest, all tickers)
276
- tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
277
- msft = next(v for v in tickers.values() if v['ticker'] == 'MSFT')
278
- # CIK = msft['cik_str'] → 789019
279
-
280
- # Alternative: full-text search aggregations (finds CIK from company name)
281
- data = json.loads(http_get(
282
- "https://efts.sec.gov/LATEST/search-index?q=%22microsoft+corporation%22&forms=10-K",
283
- headers=UA
284
- ))
285
- buckets = data['aggregations']['entity_filter']['buckets']
286
- # [{'key': 'MICROSOFT CORP (MSFT) (CIK 0000789019)', 'doc_count': 11}, ...]
287
- for b in buckets[:3]:
288
- m = re.search(r'\(CIK (\d+)\)', b['key'])
289
- if m:
290
- print(f"{b['key'][:50]} → CIK {m.group(1)}")
291
- ```
292
-
293
- ### Parallel fetching (multiple companies)
294
-
295
- ```python
296
- import json
297
- from concurrent.futures import ThreadPoolExecutor
298
-
299
- UA = {"User-Agent": "browser-harness research@example.com"}
300
-
301
- def get_company_meta(ticker_cik):
302
- ticker, cik = ticker_cik
303
- subs = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
304
- return {"ticker": ticker, "name": subs['name'], "sic": subs['sic']}
305
-
306
- companies = [("AAPL", "0000320193"), ("TSLA", "0001318605"), ("MSFT", "0000789019")]
307
- with ThreadPoolExecutor(max_workers=3) as ex:
308
- results = list(ex.map(get_company_meta, companies))
309
- # Confirmed: 3 requests complete in ~0.28s
310
- # SEC rate limit: 10 req/sec — stay at max_workers ≤ 8 to be safe
311
- ```
312
-
313
- ## API reference
314
-
315
- | Endpoint | What it returns | UA required |
316
- |---|---|---|
317
- | `www.sec.gov/files/company_tickers.json` | All 10,391 tickers → CIK mapping | YES |
318
- | `data.sec.gov/submissions/CIK{10-digit}.json` | Company meta + ~1000 recent filings | YES |
319
- | `data.sec.gov/api/xbrl/companyfacts/CIK{10-digit}.json` | All XBRL facts (~5MB) | YES |
320
- | `data.sec.gov/api/xbrl/companyconcept/CIK{10-digit}/{taxonomy}/{concept}.json` | One concept, all values | YES |
321
- | `data.sec.gov/api/xbrl/frames/{taxonomy}/{concept}/{unit}/{period}.json` | All companies for one period | YES |
322
- | `efts.sec.gov/LATEST/search-index?q=...` | Full-text search across filings | NO (Mozilla/5.0 works) |
323
- | `www.sec.gov/Archives/edgar/data/{cik}/{accn-nodash}/{doc}` | Actual filing document | YES |
324
-
325
- `data.sec.gov` accepts `Mozilla/5.0` (the http_get default). `www.sec.gov` (Archives, company_tickers) requires the `"CompanyName email@example.com"` format.
326
-
327
- ## Rate limits
328
-
329
- SEC documents a **10 requests/second** limit. In practice:
330
- - 12 rapid sequential calls to `data.sec.gov` completed in 2.4s (5 req/s) with no throttling.
331
- - 3 parallel calls completed in 0.28s without issue.
332
- - Stay at `max_workers ≤ 8` for ThreadPoolExecutor to respect the 10 req/s ceiling.
333
- - No per-day or per-hour cap documented; the 10/s limit is the only stated constraint.
334
-
335
- ## Gotchas
336
-
337
- - **`www.sec.gov` returns 403 with `Mozilla/5.0` UA** — The http_get default (`"Mozilla/5.0"`) works on `data.sec.gov` and `efts.sec.gov` but is blocked on `www.sec.gov`. Always pass `headers=UA` where UA includes your company name and email. Confirmed: `"python-requests/2.28"` → 403.
338
-
339
- - **`data.sec.gov` is more permissive** — `Mozilla/5.0` works on `data.sec.gov` (submissions, xbrl). But always use the proper UA anyway — SEC's policy page explicitly requires it and they can add stricter checks at any time.
340
-
341
- - **XBRL contains duplicate entries per period** — The same fiscal year end date appears multiple times when a company restates or re-files. Each entry has a `filed` date and `accn` (accession). To get the canonical value, deduplicate by `end` keeping the entry with the latest `filed` date.
342
-
343
- - **Revenue concept name varies by company and era** — There is no single canonical "revenue" concept. Apple uses `RevenueFromContractWithCustomerExcludingAssessedTax`. Microsoft uses the same for recent years, but older filings used `SalesRevenueNet`. Always check which concepts are actually present: `[k for k in usgaap if 'Revenue' in k]`.
344
-
345
- - **`fp` field for annual filings is `'FY'`, but quarterly values also appear in 10-K** — A 10-K re-reports each quarter (fp=Q1, Q2, Q3) plus the full year (fp=FY). Filter on both `form == '10-K'` AND `fp == 'FY'` to get only annual totals.
346
-
347
- - **`companyfacts` is ~5MB per company** — For a single metric, use `companyconcept` instead (much smaller). Only use `companyfacts` when you need multiple concepts from the same company.
348
-
349
- - **`submissions` recent filings cap at ~1,000** — Very old filings don't appear. If you need historical data before that window, use the `filings.files` array in submissions JSON to find older filing JSON pages (`data.sec.gov/submissions/CIK{cik}-submissions-001.json`, etc.).
350
-
351
- - **`adsh` in search results is the accession number** — The search index returns `adsh` (no dashes). To build the document URL, insert dashes: `adsh[:10] + '-' + adsh[10:12] + '-' + adsh[12:]`, or use the `accessionNumber` field from submissions JSON (which already has dashes).
352
-
353
- - **`size` param is capped at 100** — Requesting `size=200` silently returns 100 hits. Walk results with `from=0`, `from=100`, etc. Maximum reachable index is 10,000 (Elasticsearch default).
354
-
355
- - **Search total `'gte'` relation means >10,000 hits** — When `total['relation'] == 'gte'`, there are more than 10,000 results (only first 10,000 accessible). Narrow with `dateRange` or `forms` filters.
356
-
357
- - **`company_tickers.json` covers only exchange-listed companies** — ~10,391 entries. Many SEC filers (private companies, bond issuers, FHLBs) have CIKs but no ticker. Find them via the full-text search aggregations or `submissions` lookup if you have the CIK.
358
-
359
- - **Filing document is XBRL-tagged HTML, 1–2MB** — Retrieving the actual 10-K HTML works but is large. For financial data extraction, always prefer the XBRL API endpoints over parsing the document.
360
-
361
- - **CIK format gotcha** — `company_tickers.json` returns `cik_str` as an int (`320193`). The submissions and xbrl APIs require a 10-digit zero-padded string in the filename (`CIK0000320193`). Always use `str(cik).zfill(10)` when building URLs.
1
+ # SEC EDGAR — Scraping & Data Extraction
2
+
3
+ `https://www.sec.gov` / `https://data.sec.gov` / `https://efts.sec.gov` — all public data, no auth required. Every workflow here is pure `http_get` — no browser needed.
4
+
5
+ ## Do this first
6
+
7
+ **SEC.gov requires a custom User-Agent on `www.sec.gov` and `data.sec.gov`. Always pass `headers=UA` or you get 403.**
8
+
9
+ ```python
10
+ import json
11
+ UA = {"User-Agent": "browser-harness research@example.com"}
12
+ # Format required: "CompanyName contact@email.com"
13
+ # "Mozilla/5.0" (http_get default) works on efts.sec.gov and data.sec.gov
14
+ # but FAILS on www.sec.gov (company_tickers.json, Archives/, etc.)
15
+ ```
16
+
17
+ Start with `company_tickers.json` to resolve any ticker → CIK in one call, then branch to whichever endpoint you need.
18
+
19
+ ```python
20
+ import json
21
+ UA = {"User-Agent": "browser-harness research@example.com"}
22
+ tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
23
+ # 10,391 public companies, ~50KB, always fresh
24
+ # Entry format: {"cik_str": 320193, "ticker": "AAPL", "title": "Apple Inc."}
25
+
26
+ # Look up by ticker (exact, case-sensitive in the data)
27
+ aapl = next(v for v in tickers.values() if v['ticker'] == 'AAPL')
28
+ # {'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}
29
+
30
+ # CIK is an int here; pad to 10 digits for API URLs
31
+ cik = str(aapl['cik_str']).zfill(10) # "0000320193"
32
+ ```
33
+
34
+ ## Common workflows
35
+
36
+ ### Ticker / name → CIK lookup
37
+
38
+ ```python
39
+ import json
40
+ UA = {"User-Agent": "browser-harness research@example.com"}
41
+ tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
42
+
43
+ # By ticker
44
+ tsla = next((v for v in tickers.values() if v['ticker'] == 'TSLA'), None)
45
+ # {'cik_str': 1318605, 'ticker': 'TSLA', 'title': 'Tesla, Inc.'}
46
+
47
+ # By partial name match
48
+ apples = [v for v in tickers.values() if 'APPLE' in v['title'].upper()]
49
+ # [{'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}, ...]
50
+ ```
51
+
52
+ ### Company submissions (metadata + recent filings list)
53
+
54
+ ```python
55
+ import json
56
+ UA = {"User-Agent": "browser-harness research@example.com"}
57
+ cik = "0000320193" # Apple - always zero-pad to 10 digits
58
+ data = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
59
+
60
+ print(data['name']) # "Apple Inc."
61
+ print(data['cik']) # "0000320193"
62
+ print(data['sic']) # "3571"
63
+ print(data['sicDescription']) # "Electronic Computers"
64
+ print(data['tickers']) # ["AAPL"]
65
+ print(data['exchanges']) # ["Nasdaq"]
66
+
67
+ # Most recent ~1,000 filings are in data['filings']['recent']
68
+ recent = data['filings']['recent']
69
+ # Fields per filing (parallel arrays, same index):
70
+ # accessionNumber, filingDate, reportDate, form, primaryDocument,
71
+ # primaryDocDescription, size, isXBRL, items, fileNumber
72
+
73
+ # Filter for 10-K and 10-Q only
74
+ filings_10k = [
75
+ (f, d, a, doc)
76
+ for f, d, a, doc in zip(
77
+ recent['form'], recent['filingDate'],
78
+ recent['accessionNumber'], recent['primaryDocument']
79
+ )
80
+ if f in ('10-K', '10-Q')
81
+ ]
82
+ # Result: [('10-Q', '2026-01-30', '0000320193-26-000006', 'aapl-20251227.htm'), ...]
83
+ ```
84
+
85
+ ### Build direct filing document URL
86
+
87
+ ```python
88
+ # Given accessionNumber and primaryDocument from submissions JSON:
89
+ accn = "0000320193-25-000079"
90
+ doc = "aapl-20250927.htm"
91
+ cik = "320193" # int part only (no leading zeros) for Archives path
92
+
93
+ accn_nodash = accn.replace("-", "")
94
+ url = f"https://www.sec.gov/Archives/edgar/data/{cik}/{accn_nodash}/{doc}"
95
+ # https://www.sec.gov/Archives/edgar/data/320193/000032019325000079/aapl-20250927.htm
96
+
97
+ # Full 10-K is 1.5MB of XBRL-tagged HTML — use http_get for text extraction
98
+ content = http_get(url, headers=UA) # UA required on www.sec.gov
99
+ ```
100
+
101
+ ### XBRL financial data — single company, one concept over time
102
+
103
+ ```python
104
+ import json
105
+ UA = {"User-Agent": "browser-harness research@example.com"}
106
+ cik_padded = "0000320193"
107
+
108
+ # companyconcept: one metric, all reported values (quarterly + annual)
109
+ data = json.loads(http_get(
110
+ f"https://data.sec.gov/api/xbrl/companyconcept/CIK{cik_padded}/us-gaap/Assets.json",
111
+ headers=UA
112
+ ))
113
+ # data keys: cik, taxonomy, tag, label, description, entityName, units
114
+ # data['units']['USD'] -> list of {end, val, accn, fy, fp, form, filed}
115
+
116
+ entries = data['units']['USD']
117
+
118
+ # Deduplicate: same period re-reported across multiple filings — keep latest
119
+ def annual_series(entries):
120
+ seen = {}
121
+ for e in entries:
122
+ if e.get('form') == '10-K' and e.get('fp') == 'FY':
123
+ end = e['end']
124
+ if end not in seen or e['filed'] > seen[end]['filed']:
125
+ seen[end] = e
126
+ return [seen[k] for k in sorted(seen)]
127
+
128
+ assets = annual_series(entries)
129
+ for e in assets[-5:]:
130
+ print(f"{e['end']} ${e['val']/1e9:.1f}B")
131
+ # 2021-09-25 $351.0B
132
+ # 2022-09-24 $352.8B
133
+ # 2023-09-30 $352.6B
134
+ # 2024-09-28 $365.0B
135
+ # 2025-09-27 $359.2B
136
+ ```
137
+
138
+ ### XBRL financial data — all US-GAAP metrics for a company
139
+
140
+ ```python
141
+ import json
142
+ UA = {"User-Agent": "browser-harness research@example.com"}
143
+
144
+ # companyfacts: all reported XBRL concepts in one ~5MB call
145
+ data = json.loads(http_get(
146
+ "https://data.sec.gov/api/xbrl/companyfacts/CIK0000320193.json",
147
+ headers=UA
148
+ ))
149
+ # data['entityName'] = "Apple Inc."
150
+ # data['facts'] = {'us-gaap': {...503 concepts...}, 'dei': {...}}
151
+
152
+ usgaap = data['facts']['us-gaap']
153
+ print(len(usgaap)) # 503 concepts for Apple
154
+
155
+ # Common concept names (companies vary — check what's available):
156
+ # Revenue: RevenueFromContractWithCustomerExcludingAssessedTax (post-2018 standard)
157
+ # SalesRevenueNet (older filings)
158
+ # Revenues (some companies still use)
159
+ # Net income: NetIncomeLoss
160
+ # Assets: Assets
161
+ # Cash: CashAndCashEquivalentsAtCarryingValue
162
+ # EPS: EarningsPerShareBasic, EarningsPerShareDiluted
163
+
164
+ # Find all revenue-related concepts this company reported:
165
+ revenue_keys = [k for k in usgaap if 'Revenue' in k]
166
+
167
+ # Extract annual revenue — handle company-specific concept name
168
+ for concept in ['RevenueFromContractWithCustomerExcludingAssessedTax', 'SalesRevenueNet', 'Revenues']:
169
+ if concept in usgaap:
170
+ entries = usgaap[concept]['units'].get('USD', [])
171
+ annual = {}
172
+ for e in entries:
173
+ if e.get('form') == '10-K' and e.get('fp') == 'FY':
174
+ end = e['end']
175
+ if end not in annual or e['filed'] > annual[end]['filed']:
176
+ annual[end] = e
177
+ if annual:
178
+ print(f"Using: {concept}")
179
+ for end in sorted(annual)[-3:]:
180
+ print(f" {end} ${annual[end]['val']/1e9:.1f}B")
181
+ break
182
+ # Apple output:
183
+ # Using: RevenueFromContractWithCustomerExcludingAssessedTax
184
+ # 2023-09-30 $383.3B
185
+ # 2024-09-28 $391.0B
186
+ # 2025-09-27 $416.2B
187
+ ```
188
+
189
+ ### Cross-company financial comparison (XBRL frames)
190
+
191
+ ```python
192
+ import json
193
+ UA = {"User-Agent": "browser-harness research@example.com"}
194
+
195
+ # frames: one concept, one period, all companies that reported it
196
+ # Period formats:
197
+ # CY2024 = calendar year 2024 (annual)
198
+ # CY2024Q4I = Q4 2024 instantaneous (balance sheet items)
199
+ # CY2024Q4 = Q4 2024 duration (income statement items)
200
+
201
+ # Top companies by annual revenue (2024)
202
+ data = json.loads(http_get(
203
+ "https://data.sec.gov/api/xbrl/frames/us-gaap/RevenueFromContractWithCustomerExcludingAssessedTax/USD/CY2024.json",
204
+ headers=UA
205
+ ))
206
+ companies = sorted(data['data'], key=lambda x: x['val'], reverse=True)
207
+ # data['data'] entries: {accn, cik, entityName, loc, start, end, val}
208
+ for c in companies[:5]:
209
+ print(f"{c['entityName']:<40} ${c['val']/1e9:.0f}B")
210
+ # Walmart Inc. $675B
211
+ # AMAZON.COM, INC. $638B
212
+ # Apple Inc. $391B
213
+ # McKESSON CORPORATION $359B
214
+ # Alphabet Inc. $350B
215
+
216
+ # Total assets snapshot end of 2024 (balance sheet = instantaneous)
217
+ data2 = json.loads(http_get(
218
+ "https://data.sec.gov/api/xbrl/frames/us-gaap/Assets/USD/CY2024Q4I.json",
219
+ headers=UA
220
+ ))
221
+ # 6,229 companies for this frame
222
+ ```
223
+
224
+ ### Full-text search across all filings
225
+
226
+ ```python
227
+ import json
228
+ UA = {"User-Agent": "browser-harness research@example.com"}
229
+
230
+ # Search for any phrase across filing documents
231
+ # Params: q (quoted phrase), forms (comma-separated), dateRange=custom,
232
+ # startdt, enddt, size (max 100), from (offset for pagination)
233
+ url = (
234
+ "https://efts.sec.gov/LATEST/search-index"
235
+ "?q=%22climate+risk%22"
236
+ "&forms=10-K"
237
+ "&dateRange=custom&startdt=2024-01-01"
238
+ "&size=10&from=0"
239
+ )
240
+ data = json.loads(http_get(url, headers=UA))
241
+ # Note: default http_get UA (Mozilla/5.0) works fine on efts.sec.gov
242
+
243
+ print(data['hits']['total']['value']) # e.g. 1438 matching documents
244
+ hits = data['hits']['hits'] # up to 100 per call
245
+
246
+ for h in hits:
247
+ src = h['_source']
248
+ # Key fields: display_names, ciks, form, file_date, adsh (accession), period_ending
249
+ name = src['display_names'][0] if src.get('display_names') else '?'
250
+ cik = src['ciks'][0] if src.get('ciks') else '?'
251
+ print(f"{name} form={src['form']} filed={src['file_date']} accn={src['adsh']}")
252
+
253
+ # Pagination: max 100 per page, use from= to walk through results
254
+ # Page 2: from=100, Page 3: from=200, etc.
255
+ for page in range(0, 300, 100):
256
+ page_url = url + f"&from={page}"
257
+ page_data = json.loads(http_get(page_url, headers=UA))
258
+ if not page_data['hits']['hits']:
259
+ break
260
+ # process...
261
+
262
+ # Aggregations — group hits by entity, SIC, state
263
+ aggs = data['aggregations']
264
+ top_entities = aggs['entity_filter']['buckets'] # [{key: "Name (CIK...)", doc_count: N}, ...]
265
+ top_sics = aggs['sic_filter']['buckets']
266
+ top_states = aggs['biz_states_filter']['buckets']
267
+ ```
268
+
269
+ ### Find a company's CIK by name search (via search aggregations)
270
+
271
+ ```python
272
+ import json, re
273
+ UA = {"User-Agent": "browser-harness research@example.com"}
274
+
275
+ # Best method: company_tickers.json (fastest, all tickers)
276
+ tickers = json.loads(http_get("https://www.sec.gov/files/company_tickers.json", headers=UA))
277
+ msft = next(v for v in tickers.values() if v['ticker'] == 'MSFT')
278
+ # CIK = msft['cik_str'] → 789019
279
+
280
+ # Alternative: full-text search aggregations (finds CIK from company name)
281
+ data = json.loads(http_get(
282
+ "https://efts.sec.gov/LATEST/search-index?q=%22microsoft+corporation%22&forms=10-K",
283
+ headers=UA
284
+ ))
285
+ buckets = data['aggregations']['entity_filter']['buckets']
286
+ # [{'key': 'MICROSOFT CORP (MSFT) (CIK 0000789019)', 'doc_count': 11}, ...]
287
+ for b in buckets[:3]:
288
+ m = re.search(r'\(CIK (\d+)\)', b['key'])
289
+ if m:
290
+ print(f"{b['key'][:50]} → CIK {m.group(1)}")
291
+ ```
292
+
293
+ ### Parallel fetching (multiple companies)
294
+
295
+ ```python
296
+ import json
297
+ from concurrent.futures import ThreadPoolExecutor
298
+
299
+ UA = {"User-Agent": "browser-harness research@example.com"}
300
+
301
+ def get_company_meta(ticker_cik):
302
+ ticker, cik = ticker_cik
303
+ subs = json.loads(http_get(f"https://data.sec.gov/submissions/CIK{cik}.json", headers=UA))
304
+ return {"ticker": ticker, "name": subs['name'], "sic": subs['sic']}
305
+
306
+ companies = [("AAPL", "0000320193"), ("TSLA", "0001318605"), ("MSFT", "0000789019")]
307
+ with ThreadPoolExecutor(max_workers=3) as ex:
308
+ results = list(ex.map(get_company_meta, companies))
309
+ # Confirmed: 3 requests complete in ~0.28s
310
+ # SEC rate limit: 10 req/sec — stay at max_workers ≤ 8 to be safe
311
+ ```
312
+
313
+ ## API reference
314
+
315
+ | Endpoint | What it returns | UA required |
316
+ |---|---|---|
317
+ | `www.sec.gov/files/company_tickers.json` | All 10,391 tickers → CIK mapping | YES |
318
+ | `data.sec.gov/submissions/CIK{10-digit}.json` | Company meta + ~1000 recent filings | YES |
319
+ | `data.sec.gov/api/xbrl/companyfacts/CIK{10-digit}.json` | All XBRL facts (~5MB) | YES |
320
+ | `data.sec.gov/api/xbrl/companyconcept/CIK{10-digit}/{taxonomy}/{concept}.json` | One concept, all values | YES |
321
+ | `data.sec.gov/api/xbrl/frames/{taxonomy}/{concept}/{unit}/{period}.json` | All companies for one period | YES |
322
+ | `efts.sec.gov/LATEST/search-index?q=...` | Full-text search across filings | NO (Mozilla/5.0 works) |
323
+ | `www.sec.gov/Archives/edgar/data/{cik}/{accn-nodash}/{doc}` | Actual filing document | YES |
324
+
325
+ `data.sec.gov` accepts `Mozilla/5.0` (the http_get default). `www.sec.gov` (Archives, company_tickers) requires the `"CompanyName email@example.com"` format.
326
+
327
+ ## Rate limits
328
+
329
+ SEC documents a **10 requests/second** limit. In practice:
330
+ - 12 rapid sequential calls to `data.sec.gov` completed in 2.4s (5 req/s) with no throttling.
331
+ - 3 parallel calls completed in 0.28s without issue.
332
+ - Stay at `max_workers ≤ 8` for ThreadPoolExecutor to respect the 10 req/s ceiling.
333
+ - No per-day or per-hour cap documented; the 10/s limit is the only stated constraint.
334
+
335
+ ## Gotchas
336
+
337
+ - **`www.sec.gov` returns 403 with `Mozilla/5.0` UA** — The http_get default (`"Mozilla/5.0"`) works on `data.sec.gov` and `efts.sec.gov` but is blocked on `www.sec.gov`. Always pass `headers=UA` where UA includes your company name and email. Confirmed: `"python-requests/2.28"` → 403.
338
+
339
+ - **`data.sec.gov` is more permissive** — `Mozilla/5.0` works on `data.sec.gov` (submissions, xbrl). But always use the proper UA anyway — SEC's policy page explicitly requires it and they can add stricter checks at any time.
340
+
341
+ - **XBRL contains duplicate entries per period** — The same fiscal year end date appears multiple times when a company restates or re-files. Each entry has a `filed` date and `accn` (accession). To get the canonical value, deduplicate by `end` keeping the entry with the latest `filed` date.
342
+
343
+ - **Revenue concept name varies by company and era** — There is no single canonical "revenue" concept. Apple uses `RevenueFromContractWithCustomerExcludingAssessedTax`. Microsoft uses the same for recent years, but older filings used `SalesRevenueNet`. Always check which concepts are actually present: `[k for k in usgaap if 'Revenue' in k]`.
344
+
345
+ - **`fp` field for annual filings is `'FY'`, but quarterly values also appear in 10-K** — A 10-K re-reports each quarter (fp=Q1, Q2, Q3) plus the full year (fp=FY). Filter on both `form == '10-K'` AND `fp == 'FY'` to get only annual totals.
346
+
347
+ - **`companyfacts` is ~5MB per company** — For a single metric, use `companyconcept` instead (much smaller). Only use `companyfacts` when you need multiple concepts from the same company.
348
+
349
+ - **`submissions` recent filings cap at ~1,000** — Very old filings don't appear. If you need historical data before that window, use the `filings.files` array in submissions JSON to find older filing JSON pages (`data.sec.gov/submissions/CIK{cik}-submissions-001.json`, etc.).
350
+
351
+ - **`adsh` in search results is the accession number** — The search index returns `adsh` (no dashes). To build the document URL, insert dashes: `adsh[:10] + '-' + adsh[10:12] + '-' + adsh[12:]`, or use the `accessionNumber` field from submissions JSON (which already has dashes).
352
+
353
+ - **`size` param is capped at 100** — Requesting `size=200` silently returns 100 hits. Walk results with `from=0`, `from=100`, etc. Maximum reachable index is 10,000 (Elasticsearch default).
354
+
355
+ - **Search total `'gte'` relation means >10,000 hits** — When `total['relation'] == 'gte'`, there are more than 10,000 results (only first 10,000 accessible). Narrow with `dateRange` or `forms` filters.
356
+
357
+ - **`company_tickers.json` covers only exchange-listed companies** — ~10,391 entries. Many SEC filers (private companies, bond issuers, FHLBs) have CIKs but no ticker. Find them via the full-text search aggregations or `submissions` lookup if you have the CIK.
358
+
359
+ - **Filing document is XBRL-tagged HTML, 1–2MB** — Retrieving the actual 10-K HTML works but is large. For financial data extraction, always prefer the XBRL API endpoints over parsing the document.
360
+
361
+ - **CIK format gotcha** — `company_tickers.json` returns `cik_str` as an int (`320193`). The submissions and xbrl APIs require a 10-digit zero-padded string in the filename (`CIK0000320193`). Always use `str(cik).zfill(10)` when building URLs.