llmsbrieftxt 1.6.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of llmsbrieftxt might be problematic. Click here for more details.

@@ -0,0 +1,420 @@
1
+ Metadata-Version: 2.4
2
+ Name: llmsbrieftxt
3
+ Version: 1.6.0
4
+ Summary: Generate llms-brief.txt files from documentation websites using AI
5
+ Project-URL: Homepage, https://github.com/stevennevins/llmsbrief
6
+ Project-URL: Repository, https://github.com/stevennevins/llmsbrief
7
+ Project-URL: Issues, https://github.com/stevennevins/llmsbrief/issues
8
+ Project-URL: Documentation, https://github.com/stevennevins/llmsbrief#readme
9
+ Author: llmsbrieftxt contributors
10
+ License: MIT
11
+ License-File: LICENSE
12
+ Keywords: ai,crawling,documentation,llm,llms-brief,llmstxt,openai,summarization,web-scraping
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
21
+ Classifier: Topic :: Software Development :: Documentation
22
+ Classifier: Topic :: Text Processing :: Markup :: Markdown
23
+ Requires-Python: >=3.10
24
+ Requires-Dist: beautifulsoup4>=4.13.5
25
+ Requires-Dist: httpx<1.0.0,>=0.28.1
26
+ Requires-Dist: openai<2.0.0,>=1.54.0
27
+ Requires-Dist: pydantic<3.0.0,>=2.10.1
28
+ Requires-Dist: tenacity<10.0.0,>=9.1.2
29
+ Requires-Dist: tqdm<5.0.0,>=4.66.0
30
+ Requires-Dist: trafilatura>=2.0.0
31
+ Requires-Dist: ultimate-sitemap-parser>=1.6.0
32
+ Description-Content-Type: text/markdown
33
+
34
+ # llmsbrieftxt
35
+
36
+ Generate llms-brief.txt files from any documentation website using AI. A focused, production-ready CLI tool that does one thing exceptionally well.
37
+
38
+ ## Quick Start
39
+
40
+ ```bash
41
+ # Install
42
+ pip install llmsbrieftxt
43
+
44
+ # Set your OpenAI API key
45
+ export OPENAI_API_KEY="sk-your-api-key-here"
46
+
47
+ # Generate llms-brief.txt from a documentation site
48
+ llmtxt https://docs.python.org/3/
49
+
50
+ # Preview URLs before processing
51
+ llmtxt https://react.dev --show-urls
52
+
53
+ # Use a different model
54
+ llmtxt https://react.dev --model gpt-4o
55
+ ```
56
+
57
+ ## What It Does
58
+
59
+ Crawls documentation websites, extracts content, and uses OpenAI to generate structured llms-brief.txt files. Each entry contains a title, URL, keywords, and one-line summary - making it easy for LLMs and developers to navigate documentation.
60
+
61
+ **Key Features:**
62
+ - **Smart Crawling**: Breadth-first discovery up to depth 3, with URL deduplication
63
+ - **Content Extraction**: HTML to Markdown using trafilatura
64
+ - **AI Summarization**: Structured output using OpenAI
65
+ - **Automatic Caching**: Summaries cached in `.llmsbrieftxt_cache/` to avoid reprocessing
66
+ - **Production-Ready**: Clean output, proper error handling, scriptable
67
+
68
+ ## Installation
69
+
70
+ ```bash
71
+ # With pip
72
+ pip install llmsbrieftxt
73
+
74
+ # With uv (recommended)
75
+ uv pip install llmsbrieftxt
76
+ ```
77
+
78
+ ## Prerequisites
79
+
80
+ - **Python 3.10+**
81
+ - **OpenAI API Key**: Required for generating summaries
82
+ ```bash
83
+ export OPENAI_API_KEY="sk-your-api-key-here"
84
+ ```
85
+
86
+ ## Usage
87
+
88
+ ### Basic Command
89
+
90
+ ```bash
91
+ llmtxt <url> [options]
92
+ ```
93
+
94
+ Output is automatically saved to `~/.claude/docs/<domain>.txt` (e.g., `docs.python.org.txt`)
95
+
96
+ ### Options
97
+
98
+ - `--output PATH` - Custom output path (default: `~/.claude/docs/<domain>.txt`)
99
+ - `--model MODEL` - OpenAI model to use (default: `gpt-5-mini`)
100
+ - `--max-concurrent-summaries N` - Concurrent LLM requests (default: 10)
101
+ - `--show-urls` - Preview discovered URLs with cost estimate (no API calls)
102
+ - `--max-urls N` - Limit number of URLs to process
103
+ - `--depth N` - Maximum crawl depth (default: 3)
104
+ - `--cache-dir PATH` - Cache directory path (default: `.llmsbrieftxt_cache`)
105
+ - `--use-cache-only` - Use only cached summaries, skip API calls for new pages
106
+ - `--force-refresh` - Ignore cache and regenerate all summaries
107
+
108
+ ### Examples
109
+
110
+ ```bash
111
+ # Basic usage - saves to ~/.claude/docs/docs.python.org.txt
112
+ llmtxt https://docs.python.org/3/
113
+
114
+ # Use a different model
115
+ llmtxt https://react.dev --model gpt-4o
116
+
117
+ # Preview URLs with cost estimate before processing (no API calls)
118
+ llmtxt https://react.dev --show-urls
119
+
120
+ # Limit scope for testing
121
+ llmtxt https://docs.python.org --max-urls 50
122
+
123
+ # Custom crawl depth (explore deeper or shallower)
124
+ llmtxt https://example.com --depth 2
125
+
126
+ # Use only cached summaries (no API calls)
127
+ llmtxt https://docs.python.org/3/ --use-cache-only
128
+
129
+ # Force refresh all summaries (ignore cache)
130
+ llmtxt https://docs.python.org/3/ --force-refresh
131
+
132
+ # Custom cache directory
133
+ llmtxt https://example.com --cache-dir /tmp/my-cache
134
+
135
+ # Custom output location
136
+ llmtxt https://react.dev --output ./my-docs/react.txt
137
+
138
+ # Process with higher concurrency (if you have high rate limits)
139
+ llmtxt https://fastapi.tiangolo.com --max-concurrent-summaries 20
140
+ ```
141
+
142
+ ## Searching and Listing
143
+
144
+ This tool focuses on **generating** llms-brief.txt files. For searching and listing, use standard Unix tools:
145
+
146
+ ### Search Documentation
147
+
148
+ ```bash
149
+ # Search all docs
150
+ rg "async functions" ~/.claude/docs/
151
+
152
+ # Search specific file
153
+ rg "hooks" ~/.claude/docs/react.dev.txt
154
+
155
+ # Case-insensitive search
156
+ rg -i "error handling" ~/.claude/docs/
157
+
158
+ # Show context around matches
159
+ rg -C 2 "api" ~/.claude/docs/
160
+
161
+ # Or use grep
162
+ grep -r "async" ~/.claude/docs/
163
+ ```
164
+
165
+ ### List Documentation
166
+
167
+ ```bash
168
+ # List all docs
169
+ ls ~/.claude/docs/
170
+
171
+ # List with details
172
+ ls -lh ~/.claude/docs/
173
+
174
+ # Count entries in a file
175
+ grep -c "^Title:" ~/.claude/docs/react.dev.txt
176
+
177
+ # Find all docs and show sizes
178
+ find ~/.claude/docs/ -name "*.txt" -exec wc -l {} +
179
+ ```
180
+
181
+ **Why use standard tools?** They're:
182
+ - Already installed on your system
183
+ - More powerful and flexible
184
+ - Well-documented
185
+ - Composable with other commands
186
+ - Faster than any custom implementation
187
+
188
+ ## How It Works
189
+
190
+ ### URL Discovery
191
+
192
+ The tool uses a comprehensive breadth-first search strategy:
193
+ - Explores links up to 3 levels deep from your starting URL
194
+ - Automatically excludes assets (CSS, JS, images) and non-documentation pages
195
+ - Sophisticated URL normalization prevents duplicate processing
196
+ - Discovers 100-300+ pages on typical documentation sites
197
+
198
+ ### Content Processing Pipeline
199
+
200
+ ```
201
+ URL Discovery → Content Extraction → LLM Summarization → File Generation
202
+ ```
203
+
204
+ 1. **Crawl**: Discover all documentation URLs
205
+ 2. **Extract**: Convert HTML to markdown using trafilatura
206
+ 3. **Summarize**: Generate structured summaries using OpenAI
207
+ 4. **Cache**: Store summaries in `.llmsbrieftxt_cache/` for reuse
208
+ 5. **Generate**: Compile into searchable llms-brief.txt format
209
+
210
+ ### Output Format
211
+
212
+ Each entry in the generated file contains:
213
+ ```
214
+ Title: [Page Name](URL)
215
+ Keywords: searchable, terms, functions, concepts
216
+ Summary: One-line description of page content
217
+
218
+ ```
219
+
220
+ ## Development
221
+
222
+ ### Setup
223
+
224
+ ```bash
225
+ # Clone and install with dev dependencies
226
+ git clone https://github.com/stevennevins/llmsbrief.git
227
+ cd llmsbrief
228
+ uv sync --group dev
229
+ ```
230
+
231
+ ### Running Tests
232
+
233
+ ```bash
234
+ # All tests
235
+ uv run pytest
236
+
237
+ # Unit tests only
238
+ uv run pytest tests/unit/
239
+
240
+ # Specific test file
241
+ uv run pytest tests/unit/test_cli.py
242
+
243
+ # With verbose output
244
+ uv run pytest -v
245
+ ```
246
+
247
+ ### E2E Testing with Ollama (No API Costs)
248
+
249
+ For testing without OpenAI API costs, use [Ollama](https://ollama.com) as a local LLM provider:
250
+
251
+ ```bash
252
+ # 1. Install Ollama (one-time setup)
253
+ curl -fsSL https://ollama.com/install.sh | sh
254
+ # Or download from: https://ollama.com/download
255
+
256
+ # 2. Start Ollama service
257
+ ollama serve &
258
+
259
+ # 3. Pull a lightweight model
260
+ ollama pull tinyllama # 637MB, fastest
261
+ # Or: ollama pull phi3:mini # 2.3GB, better quality
262
+
263
+ # 4. Run E2E tests with Ollama
264
+ export OPENAI_BASE_URL="http://localhost:11434/v1"
265
+ export OPENAI_API_KEY="ollama-dummy-key"
266
+ uv run pytest tests/integration/test_ollama_e2e.py -v
267
+
268
+ # 5. Or test the CLI directly
269
+ llmtxt https://example.com --model tinyllama --max-urls 5 --depth 1
270
+ ```
271
+
272
+ **Benefits:**
273
+ - ✅ Zero API costs - runs completely local
274
+ - ✅ OpenAI-compatible endpoint
275
+ - ✅ Same code path as production
276
+ - ✅ Cached in GitHub Actions for CI/CD
277
+
278
+ **Recommended Models:**
279
+ - `tinyllama` (637MB) - Fastest, great for CI/CD
280
+ - `phi3:mini` (2.3GB) - Better quality, still fast
281
+ - `gemma2:2b` (1.6GB) - Balanced option
282
+
283
+ ### Code Quality
284
+
285
+ ```bash
286
+ # Lint code
287
+ uv run ruff check llmsbrieftxt/ tests/
288
+
289
+ # Format code
290
+ uv run ruff format llmsbrieftxt/ tests/
291
+
292
+ # Type checking
293
+ uv run mypy llmsbrieftxt/
294
+ ```
295
+
296
+ ## Configuration
297
+
298
+ ### Default Settings
299
+
300
+ - **Crawl Depth**: 3 levels (configurable via `--depth`)
301
+ - **Output Location**: `~/.claude/docs/<domain>.txt` (configurable via `--output`)
302
+ - **Cache Directory**: `.llmsbrieftxt_cache/` (configurable via `--cache-dir`)
303
+ - **OpenAI Model**: `gpt-5-mini` (configurable via `--model`)
304
+ - **Concurrent Requests**: 10 (configurable via `--max-concurrent-summaries`)
305
+
306
+ ### Environment Variables
307
+
308
+ - `OPENAI_API_KEY` - Required for all operations
309
+ - `OPENAI_BASE_URL` - Optional. Set to use OpenAI-compatible endpoints (e.g., Ollama at `http://localhost:11434/v1`)
310
+
311
+ ## Usage Tips
312
+
313
+ ### Managing API Costs
314
+
315
+ - **Preview with cost estimate**: Use `--show-urls` to see discovered URLs and estimated API cost before processing
316
+ - **Limit scope**: Use `--max-urls` to limit processing during testing
317
+ - **Automatic caching**: Summaries are cached automatically - rerunning is cheap
318
+ - **Cache-only mode**: Use `--use-cache-only` to generate output from cache without API calls
319
+ - **Force refresh**: Use `--force-refresh` when you need to regenerate all summaries
320
+ - **Cost-effective model**: Default model `gpt-5-mini` is cost-effective for most documentation
321
+
322
+ ### Controlling Crawl Depth
323
+
324
+ - **Default depth (3)**: Good for most documentation sites (100-300 pages)
325
+ - **Shallow crawl (1-2)**: Use for large sites or to focus on main pages only
326
+ - **Deep crawl (4-5)**: Use for small sites or comprehensive coverage
327
+ - Example: `llmtxt https://example.com --depth 2 --show-urls` to preview scope
328
+
329
+ ### Cache Management
330
+
331
+ - **Default location**: `.llmsbrieftxt_cache/` in current directory
332
+ - **Custom location**: Use `--cache-dir` for shared caches or different organization
333
+ - **Cache benefits**: Speeds up reruns, reduces API costs, enables incremental updates
334
+ - **Failed URLs tracking**: Failed URLs are written to `failed_urls.txt` next to output file
335
+
336
+ ### Organizing Documentation
337
+
338
+ All docs are saved to `~/.claude/docs/` by domain name:
339
+ ```
340
+ ~/.claude/docs/
341
+ ├── docs.python.org.txt
342
+ ├── react.dev.txt
343
+ ├── pytorch.org.txt
344
+ └── fastapi.tiangolo.com.txt
345
+ ```
346
+
347
+ This makes it easy for Claude Code and other tools to find and reference documentation.
348
+
349
+ ## Integrations
350
+
351
+ ### Claude Code
352
+
353
+ This tool is designed to work seamlessly with Claude Code. Once you've generated documentation files, Claude can search and reference them during development sessions.
354
+
355
+ ### MCP Servers
356
+
357
+ Generated llms-brief.txt files can be served via MCP (Model Context Protocol) servers. See the [mcpdoc project](https://github.com/langchain-ai/mcpdoc) for an example integration.
358
+
359
+ ## Troubleshooting
360
+
361
+ ### API Key Issues
362
+
363
+ ```bash
364
+ # Verify API key is set
365
+ echo $OPENAI_API_KEY
366
+
367
+ # Set it if missing
368
+ export OPENAI_API_KEY="sk-your-api-key-here"
369
+ ```
370
+
371
+ ### Rate Limiting
372
+
373
+ If you hit rate limits, reduce concurrent requests:
374
+ ```bash
375
+ llmtxt https://example.com --max-concurrent-summaries 5
376
+ ```
377
+
378
+ ### Large Documentation Sites
379
+
380
+ For very large sites (500+ pages):
381
+ 1. Start with `--show-urls` to see scope
382
+ 2. Use `--max-urls` to process in batches
383
+ 3. Increase `--max-concurrent-summaries` if you have high rate limits
384
+
385
+ ## Migrating from 0.x
386
+
387
+ Version 1.0.0 removes search and list subcommands in favor of Unix tools:
388
+
389
+ ```bash
390
+ # Before (v0.x)
391
+ llmsbrieftxt generate https://docs.python.org/3/
392
+ llmsbrieftxt search "async"
393
+ llmsbrieftxt list
394
+
395
+ # After (v1.0.0)
396
+ llmtxt https://docs.python.org/3/
397
+ rg "async" ~/.claude/docs/
398
+ ls ~/.claude/docs/
399
+ ```
400
+
401
+ **Why the change?** Focus on doing one thing well. Search and list are better served by mature, powerful Unix tools you already have.
402
+
403
+ ## License
404
+
405
+ MIT
406
+
407
+ ## Contributing
408
+
409
+ Contributions welcome! Please:
410
+ 1. Run tests: `uv run pytest`
411
+ 2. Lint code: `uv run ruff check llmsbrieftxt/ tests/`
412
+ 3. Format code: `uv run ruff format llmsbrieftxt/ tests/`
413
+ 4. Check types: `uv run mypy llmsbrieftxt/`
414
+ 5. Submit a PR
415
+
416
+ ## Links
417
+
418
+ - **Homepage**: https://github.com/stevennevins/llmsbrief
419
+ - **Issues**: https://github.com/stevennevins/llmsbrief/issues
420
+ - **llms.txt Spec**: https://llmstxt.org/
@@ -0,0 +1,16 @@
1
+ llmsbrieftxt/__init__.py,sha256=baAcEjLSYFIeNZF51tOMmA_zAMhN8HvKael-UU-Ruec,22
2
+ llmsbrieftxt/cli.py,sha256=TSSSKtDydMpa6rApZ6sJQwCgGkMXf2cSeDe_lp80F1g,8440
3
+ llmsbrieftxt/constants.py,sha256=cjV_W5MqfVINM78__6eKnFPOGPHAI4ZYz8GqbIEEKz8,2565
4
+ llmsbrieftxt/crawler.py,sha256=ryt6pZ8Ed5vzEa78qeu93eSDlSyuFBqePlYZZMUFvGM,12553
5
+ llmsbrieftxt/doc_loader.py,sha256=dGeHnEVCqtTQgdowMCFxrhrmh3QV5n8l3TIOgDYaU9g,5167
6
+ llmsbrieftxt/extractor.py,sha256=28jckOcYf7u5zmZrhOZ-PmcWvPwTLZhMHxISSkFdeXk,1955
7
+ llmsbrieftxt/main.py,sha256=5R6cAKFou9_FCluHQaktHKQU_nn_n3asnveB_g7o3yA,14346
8
+ llmsbrieftxt/schema.py,sha256=ix9666XBpSbHUuYF1-jIK88sijK5Cvaer6gwbdLlWfs,2186
9
+ llmsbrieftxt/summarizer.py,sha256=bv5CLc_0yxFefoXXBt8R_ztqsk4i4yAEiFv8LX93B04,11015
10
+ llmsbrieftxt/url_filters.py,sha256=1KWO9yfPEqOIFXVts5xraErVQKPDAw4Nls3yuXzbRE8,2182
11
+ llmsbrieftxt/url_utils.py,sha256=vFc_MNyLZ6QflhDF0oyiZJPYuF2_GyQmtKK7etwCmcs,2212
12
+ llmsbrieftxt-1.6.0.dist-info/METADATA,sha256=S91kMFwJNIb4b8PRsOEdlHNLT3Ay4F8ZZkA_QQnAcqo,12140
13
+ llmsbrieftxt-1.6.0.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
14
+ llmsbrieftxt-1.6.0.dist-info/entry_points.txt,sha256=lY7gjN9DS7cv3Kd3LjezvgFBum7BhpMHSPGvdCzBtFU,49
15
+ llmsbrieftxt-1.6.0.dist-info/licenses/LICENSE,sha256=Bf6uF7ggkMcXEXAdu2lGR7u-voH5CJIWOzU5vnKQVJI,1082
16
+ llmsbrieftxt-1.6.0.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: hatchling 1.27.0
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ llmtxt = llmsbrieftxt.cli:main
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 llmsbrieftxt contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.