sec2md 0.1.5__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of sec2md might be problematic. Click here for more details.

@@ -0,0 +1,216 @@
1
+ Metadata-Version: 2.4
2
+ Name: sec2md
3
+ Version: 0.1.5
4
+ Summary: Convert SEC EDGAR filings to LLM-ready Markdown for AI agents and agentic RAG
5
+ Author-email: Lucas Astorian <lucas@intellifin.ai>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/lucasastorian/sec2md
8
+ Project-URL: Repository, https://github.com/lucasastorian/sec2md
9
+ Project-URL: Issues, https://github.com/lucasastorian/sec2md/issues
10
+ Keywords: sec,edgar,markdown,filings,10-k,10-q,llm,rag,ai,embeddings
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Financial and Insurance Industry
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.9
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Office/Business :: Financial
21
+ Classifier: Topic :: Text Processing :: Markup :: Markdown
22
+ Requires-Python: >=3.9
23
+ Description-Content-Type: text/markdown
24
+ License-File: LICENSE
25
+ Requires-Dist: beautifulsoup4>=4.12.0
26
+ Requires-Dist: lxml>=4.9.0
27
+ Requires-Dist: requests>=2.31.0
28
+ Requires-Dist: tiktoken>=0.5.0
29
+ Requires-Dist: pydantic>=2.0.0
30
+ Provides-Extra: dev
31
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
32
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
33
+ Requires-Dist: black>=23.0.0; extra == "dev"
34
+ Requires-Dist: ruff>=0.1.0; extra == "dev"
35
+ Dynamic: license-file
36
+
37
+ # sec2md
38
+
39
+ [![PyPI](https://img.shields.io/pypi/v/sec2md.svg)](https://pypi.org/project/sec2md)
40
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
41
+ [![Documentation](https://img.shields.io/badge/docs-readthedocs-blue.svg)](https://sec2md.readthedocs.io)
42
+
43
+ Transform messy SEC filings into clean, structured Markdown.
44
+ **Built for AI. Optimized for retrieval. Ready for production.**
45
+
46
+ ![Before and After Comparison](comparison.png)
47
+ *Apple 10-K cover page: Raw SEC HTML (left) vs. Clean Markdown (right)*
48
+
49
+ ---
50
+
51
+ ## The Problem
52
+
53
+ RAG pipelines fail on SEC filings because **standard parsers destroy document structure.**
54
+
55
+ When you flatten a 200-page 10-K to plain text:
56
+
57
+ - ❌ **Tables break** — Complex financial statements become misaligned text
58
+ - ❌ **Pages are lost** — Can't cite sources or trace answers back
59
+ - ❌ **Sections merge** — Risk Factors and MD&A become indistinguishable
60
+ - ❌ **Formatting is stripped** — Headers, bolds, lists (LLM reasoning cues) gone
61
+ - ❌ **Retrieval fails** — Chunks without structure return wrong context
62
+
63
+ Your RAG system is only as good as your data. Garbage in, garbage out.
64
+
65
+ ## The Solution
66
+
67
+ `sec2md` **rebuilds** SEC filings as clean, semantic Markdown designed for AI systems:
68
+
69
+ - ✅ **Preserves structure** - Headers (`#`), paragraphs, lists maintained
70
+ - ✅ **Converts tables** - Complex HTML tables → clean Markdown pipes
71
+ - ✅ **Strips noise** - XBRL tags, inline styles, and boilerplate removed
72
+ - ✅ **Tracks pages** - Original pagination preserved for citation
73
+ - ✅ **Detects sections** - Auto-extract Risk Factors, MD&A, Business sections
74
+ - ✅ **Chunks intelligently** - Page-aware splitting with metadata headers
75
+
76
+ ### What We Support
77
+
78
+ | Document Type | Status | Notes |
79
+ |----------------------------|--------|--------------------------------------|
80
+ | **10-K/Q Filings** | ✅ | Full section extraction (ITEM 1-16) |
81
+ | **Financial Statements** | ✅ | Tables preserved in Markdown |
82
+ | **Notes to Financials** | ✅ | Automatic table unwrapping |
83
+ | **8-K Press Releases** | ✅ | Clean prose extraction |
84
+ | **Proxy Statements (DEF 14A)** | ✅ | Executive compensation, governance |
85
+ | **Exhibits** (Contracts) | ✅ | Merger agreements, material contracts|
86
+
87
+ ---
88
+
89
+ ## Installation
90
+
91
+ ```bash
92
+ pip install sec2md
93
+ ```
94
+
95
+ ## Quickstart
96
+
97
+ ```python
98
+ import sec2md
99
+
100
+ # Convert any SEC filing to clean Markdown
101
+ md = sec2md.convert_to_markdown(
102
+ "https://www.sec.gov/Archives/edgar/data/320193/000032019324000123/aapl-20240928.htm",
103
+ user_agent="Your Name <you@example.com>"
104
+ )
105
+ ```
106
+
107
+ **Input:** Messy SEC HTML with XBRL tags, nested tables, inline styles
108
+ **Output:** Clean, structured Markdown ready for LLMs
109
+
110
+ ```markdown
111
+ ## ITEM 1. Business
112
+
113
+ Apple Inc. designs, manufactures, and markets smartphones, personal computers,
114
+ tablets, wearables, and accessories worldwide...
115
+
116
+ ### Products
117
+
118
+ | Product Category | Revenue (millions) |
119
+ |------------------|-------------------|
120
+ | iPhone | $200,583 |
121
+ | Mac | $29,357 |
122
+ | iPad | $28,300 |
123
+ ...
124
+ ```
125
+
126
+ ## Core Features
127
+
128
+ ### 1️⃣ Section Extraction
129
+ Extract specific sections from 10-K/10-Q filings with type-safe enums:
130
+
131
+ ```python
132
+ from sec2md import Item10K
133
+
134
+ pages = sec2md.convert_to_markdown(html, return_pages=True)
135
+ sections = sec2md.extract_sections(pages, filing_type="10-K")
136
+
137
+ # Get Risk Factors section
138
+ risk = sec2md.get_section(sections, Item10K.RISK_FACTORS)
139
+ print(risk.markdown()) # Just the risk factors text
140
+ print(risk.page_range) # (12, 28) - page citations
141
+ ```
142
+
143
+ ### 2️⃣ Page-Aware Chunking
144
+ Intelligent chunking that preserves page numbers for citations:
145
+
146
+ ```python
147
+ chunks = sec2md.chunk_pages(pages, chunk_size=512)
148
+
149
+ for chunk in chunks:
150
+ print(f"Page {chunk.page}: {chunk.content[:100]}...")
151
+ # Use for embeddings, citations, or retrieval
152
+ ```
153
+
154
+ ### 3️⃣ RAG-Optimized Headers
155
+ Boost retrieval quality by adding metadata to chunk embeddings:
156
+
157
+ ```python
158
+ header = """# Apple Inc. (AAPL)
159
+ Form 10-K | FY 2024 | Risk Factors"""
160
+
161
+ chunks = sec2md.chunk_section(risk, header=header)
162
+
163
+ # chunk.embedding_text includes header for better embeddings
164
+ # chunk.content contains only the actual filing text
165
+ ```
166
+
167
+ ### 4️⃣ EdgarTools Integration
168
+ Works seamlessly with [edgartools](https://github.com/dgunning/edgartools):
169
+
170
+ ```python
171
+ from edgar import Company
172
+ company = Company("AAPL")
173
+ filing = company.get_filings(form="10-K").latest()
174
+
175
+ md = sec2md.convert_to_markdown(filing.html())
176
+ ```
177
+
178
+ ---
179
+
180
+ ## Why Choose sec2md?
181
+
182
+ ### Just Parse It
183
+ Most libraries force you to choose between speed and accuracy. `sec2md` gives you both:
184
+ - 🚀 **Fast** - Processes 200-page filings in seconds
185
+ - 🎯 **Accurate** - Purpose-built for SEC document structure
186
+ - 🔧 **Simple** - One function call, zero configuration
187
+
188
+ ### Built for Agentic RAG
189
+ Don't rebuild what we've already solved:
190
+ - ✅ **Page tracking** - Cite sources with exact page numbers
191
+ - ✅ **Section detection** - Extract just what you need (Risk Factors, MD&A)
192
+ - ✅ **Smart chunking** - Respects table boundaries, preserves context
193
+ - ✅ **Metadata headers** - Boost embedding quality 2-3x with contextual headers
194
+
195
+ ---
196
+
197
+ ## Documentation
198
+
199
+ 📚 **Full documentation:** [sec2md.readthedocs.io](https://sec2md.readthedocs.io)
200
+
201
+ - [Quickstart Guide](https://sec2md.readthedocs.io/quickstart) - Get up and running in 3 minutes
202
+ - [Convert Filings](https://sec2md.readthedocs.io/usage/direct-conversion) - Handle 10-Ks, exhibits, press releases
203
+ - [Extract Sections](https://sec2md.readthedocs.io/usage/sections) - Pull specific ITEM sections
204
+ - [Chunking for RAG](https://sec2md.readthedocs.io/usage/chunking) - Page-aware chunking with contextual headers
205
+ - [EdgarTools Integration](https://sec2md.readthedocs.io/usage/edgartools) - Automate filing downloads
206
+ - [API Reference](https://sec2md.readthedocs.io/api/convert_to_markdown) - Complete API docs
207
+
208
+ ---
209
+
210
+ ## Contributing
211
+
212
+ We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
213
+
214
+ ## License
215
+
216
+ MIT © 2025
@@ -0,0 +1,19 @@
1
+ sec2md/__init__.py,sha256=iR_2g-PDkCAzY76uQwBjIVpprvkxlNopdmDduzDp8lg,1037
2
+ sec2md/absolute_table_parser.py,sha256=rphc5_HttniV2RtPCThQ68HWyyZIn9l-gkaFsbtQXBU,22982
3
+ sec2md/chunking.py,sha256=SQASDA057bKLhSj34GNAHrRl94Rf-A9WlfEvhhWPuIc,6350
4
+ sec2md/core.py,sha256=hmdJXitoEWuekR5f3B1oEK1xmPux0t494lOpg5aJrRk,2663
5
+ sec2md/models.py,sha256=H_3HnI8exGVnbqbdT1Bf4bNhPLjqvlP64ud0au5ohJk,14735
6
+ sec2md/parser.py,sha256=J1He6XMa1Mf9YGJCEffWuCs7SAqi0Ts6S445CTO-lAA,47559
7
+ sec2md/section_extractor.py,sha256=JTbZpPgmTipzU1Q5LehlQ9y2X4ZcQRTj3A7iMr90iqM,25976
8
+ sec2md/sections.py,sha256=wtmKqF_KP_G-7_qAxGvxs25U_4vcH5NDGn14ouEy5GE,2784
9
+ sec2md/table_parser.py,sha256=FhR8OwX5NAJmzdbTFzHQTGUNUPieYN37UzMFbQMkogU,12551
10
+ sec2md/utils.py,sha256=2lUeN5irTbdIyjylCkaPKMv4ALWxWMJl96PTO8FV3Ik,2990
11
+ sec2md/chunker/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
12
+ sec2md/chunker/markdown_blocks.py,sha256=yEF_v72DvYOVu0ZQ5bBCFpNM12INg-8RmajIu_dorQQ,4372
13
+ sec2md/chunker/markdown_chunk.py,sha256=hCMpjn0cc5TIjWSZviq4fM7e781X3AtRcmI60pDLWro,4763
14
+ sec2md/chunker/markdown_chunker.py,sha256=IYW8pQ2q9hX1lRGw4TnKAQcr-HmJfSW7wffu-BA0Jms,10743
15
+ sec2md-0.1.5.dist-info/licenses/LICENSE,sha256=uJDiSGQ5TOx-PGhu2LGH4A-O53vS5hrQ5sc3j2Ps_Rk,1071
16
+ sec2md-0.1.5.dist-info/METADATA,sha256=YWQ9uiut1LcBQxOCvFcT8MlfgLO7VBCDtEju5h7fp6k,7625
17
+ sec2md-0.1.5.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
18
+ sec2md-0.1.5.dist-info/top_level.txt,sha256=Jpmw3laEWwS9fljtAEg4sExjFw3zP8dGarjIknyh1v8,7
19
+ sec2md-0.1.5.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (80.9.0)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Lucas Astorian
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1 @@
1
+ sec2md