PyPI - llmsbrieftxt - Versions diffs - 1.6.0__py3-none-any.whl - Mend

llmsbrieftxt 1.6.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of llmsbrieftxt might be problematic. Click here for more details.

Files changed (16) hide show

llmsbrieftxt/__init__.py +1 -0
llmsbrieftxt/cli.py +276 -0
llmsbrieftxt/constants.py +62 -0
llmsbrieftxt/crawler.py +358 -0
llmsbrieftxt/doc_loader.py +150 -0
llmsbrieftxt/extractor.py +69 -0
llmsbrieftxt/main.py +379 -0
llmsbrieftxt/schema.py +42 -0
llmsbrieftxt/summarizer.py +303 -0
llmsbrieftxt/url_filters.py +75 -0
llmsbrieftxt/url_utils.py +73 -0
llmsbrieftxt-1.6.0.dist-info/METADATA +420 -0
llmsbrieftxt-1.6.0.dist-info/RECORD +16 -0
llmsbrieftxt-1.6.0.dist-info/WHEEL +4 -0
llmsbrieftxt-1.6.0.dist-info/entry_points.txt +2 -0
llmsbrieftxt-1.6.0.dist-info/licenses/LICENSE +21 -0

llmsbrieftxt-1.6.0.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,420 @@
+Metadata-Version: 2.4
+Name: llmsbrieftxt
+Version: 1.6.0
+Summary: Generate llms-brief.txt files from documentation websites using AI
+Project-URL: Homepage, https://github.com/stevennevins/llmsbrief
+Project-URL: Repository, https://github.com/stevennevins/llmsbrief
+Project-URL: Issues, https://github.com/stevennevins/llmsbrief/issues
+Project-URL: Documentation, https://github.com/stevennevins/llmsbrief#readme
+Author: llmsbrieftxt contributors
+License: MIT
+License-File: LICENSE
+Keywords: ai,crawling,documentation,llm,llms-brief,llmstxt,openai,summarization,web-scraping
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
+Classifier: Topic :: Software Development :: Documentation
+Classifier: Topic :: Text Processing :: Markup :: Markdown
+Requires-Python: >=3.10
+Requires-Dist: beautifulsoup4>=4.13.5
+Requires-Dist: httpx<1.0.0,>=0.28.1
+Requires-Dist: openai<2.0.0,>=1.54.0
+Requires-Dist: pydantic<3.0.0,>=2.10.1
+Requires-Dist: tenacity<10.0.0,>=9.1.2
+Requires-Dist: tqdm<5.0.0,>=4.66.0
+Requires-Dist: trafilatura>=2.0.0
+Requires-Dist: ultimate-sitemap-parser>=1.6.0
+Description-Content-Type: text/markdown
+# llmsbrieftxt
+Generate llms-brief.txt files from any documentation website using AI. A focused, production-ready CLI tool that does one thing exceptionally well.
+## Quick Start
+```bash
+# Install
+pip install llmsbrieftxt
+# Set your OpenAI API key
+export OPENAI_API_KEY="sk-your-api-key-here"
+# Generate llms-brief.txt from a documentation site
+llmtxt https://docs.python.org/3/
+# Preview URLs before processing
+llmtxt https://react.dev --show-urls
+# Use a different model
+llmtxt https://react.dev --model gpt-4o
+```
+## What It Does
+Crawls documentation websites, extracts content, and uses OpenAI to generate structured llms-brief.txt files. Each entry contains a title, URL, keywords, and one-line summary - making it easy for LLMs and developers to navigate documentation.
+**Key Features:**
+- **Smart Crawling**: Breadth-first discovery up to depth 3, with URL deduplication
+- **Content Extraction**: HTML to Markdown using trafilatura
+- **AI Summarization**: Structured output using OpenAI
+- **Automatic Caching**: Summaries cached in `.llmsbrieftxt_cache/` to avoid reprocessing
+- **Production-Ready**: Clean output, proper error handling, scriptable
+## Installation
+```bash
+# With pip
+pip install llmsbrieftxt
+# With uv (recommended)
+uv pip install llmsbrieftxt
+```
+## Prerequisites
+- **Python 3.10+**
+- **OpenAI API Key**: Required for generating summaries
+  ```bash
+  export OPENAI_API_KEY="sk-your-api-key-here"
+  ```
+## Usage
+### Basic Command
+```bash
+llmtxt <url> [options]
+```
+Output is automatically saved to `~/.claude/docs/<domain>.txt` (e.g., `docs.python.org.txt`)
+### Options
+- `--output PATH` - Custom output path (default: `~/.claude/docs/<domain>.txt`)
+- `--model MODEL` - OpenAI model to use (default: `gpt-5-mini`)
+- `--max-concurrent-summaries N` - Concurrent LLM requests (default: 10)
+- `--show-urls` - Preview discovered URLs with cost estimate (no API calls)
+- `--max-urls N` - Limit number of URLs to process
+- `--depth N` - Maximum crawl depth (default: 3)
+- `--cache-dir PATH` - Cache directory path (default: `.llmsbrieftxt_cache`)
+- `--use-cache-only` - Use only cached summaries, skip API calls for new pages
+- `--force-refresh` - Ignore cache and regenerate all summaries
+### Examples
+```bash
+# Basic usage - saves to ~/.claude/docs/docs.python.org.txt
+llmtxt https://docs.python.org/3/
+# Use a different model
+llmtxt https://react.dev --model gpt-4o
+# Preview URLs with cost estimate before processing (no API calls)
+llmtxt https://react.dev --show-urls
+# Limit scope for testing
+llmtxt https://docs.python.org --max-urls 50
+# Custom crawl depth (explore deeper or shallower)
+llmtxt https://example.com --depth 2
+# Use only cached summaries (no API calls)
+llmtxt https://docs.python.org/3/ --use-cache-only
+# Force refresh all summaries (ignore cache)
+llmtxt https://docs.python.org/3/ --force-refresh
+# Custom cache directory
+llmtxt https://example.com --cache-dir /tmp/my-cache
+# Custom output location
+llmtxt https://react.dev --output ./my-docs/react.txt
+# Process with higher concurrency (if you have high rate limits)
+llmtxt https://fastapi.tiangolo.com --max-concurrent-summaries 20
+```
+## Searching and Listing
+This tool focuses on **generating** llms-brief.txt files. For searching and listing, use standard Unix tools:
+### Search Documentation
+```bash
+# Search all docs
+rg "async functions" ~/.claude/docs/
+# Search specific file
+rg "hooks" ~/.claude/docs/react.dev.txt
+# Case-insensitive search
+rg -i "error handling" ~/.claude/docs/
+# Show context around matches
+rg -C 2 "api" ~/.claude/docs/
+# Or use grep
+grep -r "async" ~/.claude/docs/
+```
+### List Documentation
+```bash
+# List all docs
+ls ~/.claude/docs/
+# List with details
+ls -lh ~/.claude/docs/
+# Count entries in a file
+grep -c "^Title:" ~/.claude/docs/react.dev.txt
+# Find all docs and show sizes
+find ~/.claude/docs/ -name "*.txt" -exec wc -l {} +
+```
+**Why use standard tools?** They're:
+- Already installed on your system
+- More powerful and flexible
+- Well-documented
+- Composable with other commands
+- Faster than any custom implementation
+## How It Works
+### URL Discovery
+The tool uses a comprehensive breadth-first search strategy:
+- Explores links up to 3 levels deep from your starting URL
+- Automatically excludes assets (CSS, JS, images) and non-documentation pages
+- Sophisticated URL normalization prevents duplicate processing
+- Discovers 100-300+ pages on typical documentation sites
+### Content Processing Pipeline
+```
+URL Discovery → Content Extraction → LLM Summarization → File Generation
+```
+1. **Crawl**: Discover all documentation URLs
+2. **Extract**: Convert HTML to markdown using trafilatura
+3. **Summarize**: Generate structured summaries using OpenAI
+4. **Cache**: Store summaries in `.llmsbrieftxt_cache/` for reuse
+5. **Generate**: Compile into searchable llms-brief.txt format
+### Output Format
+Each entry in the generated file contains:
+```
+Title: [Page Name](URL)
+Keywords: searchable, terms, functions, concepts
+Summary: One-line description of page content
+```
+## Development
+### Setup
+```bash
+# Clone and install with dev dependencies
+git clone https://github.com/stevennevins/llmsbrief.git
+cd llmsbrief
+uv sync --group dev
+```
+### Running Tests
+```bash
+# All tests
+uv run pytest
+# Unit tests only
+uv run pytest tests/unit/
+# Specific test file
+uv run pytest tests/unit/test_cli.py
+# With verbose output
+uv run pytest -v
+```
+### E2E Testing with Ollama (No API Costs)
+For testing without OpenAI API costs, use [Ollama](https://ollama.com) as a local LLM provider:
+```bash
+# 1. Install Ollama (one-time setup)
+curl -fsSL https://ollama.com/install.sh | sh
+# Or download from: https://ollama.com/download
+# 2. Start Ollama service
+ollama serve &
+# 3. Pull a lightweight model
+ollama pull tinyllama  # 637MB, fastest
+# Or: ollama pull phi3:mini  # 2.3GB, better quality
+# 4. Run E2E tests with Ollama
+export OPENAI_BASE_URL="http://localhost:11434/v1"
+export OPENAI_API_KEY="ollama-dummy-key"
+uv run pytest tests/integration/test_ollama_e2e.py -v
+# 5. Or test the CLI directly
+llmtxt https://example.com --model tinyllama --max-urls 5 --depth 1
+```
+**Benefits:**
+- ✅ Zero API costs - runs completely local
+- ✅ OpenAI-compatible endpoint
+- ✅ Same code path as production
+- ✅ Cached in GitHub Actions for CI/CD
+**Recommended Models:**
+- `tinyllama` (637MB) - Fastest, great for CI/CD
+- `phi3:mini` (2.3GB) - Better quality, still fast
+- `gemma2:2b` (1.6GB) - Balanced option
+### Code Quality
+```bash
+# Lint code
+uv run ruff check llmsbrieftxt/ tests/
+# Format code
+uv run ruff format llmsbrieftxt/ tests/
+# Type checking
+uv run mypy llmsbrieftxt/
+```
+## Configuration
+### Default Settings
+- **Crawl Depth**: 3 levels (configurable via `--depth`)
+- **Output Location**: `~/.claude/docs/<domain>.txt` (configurable via `--output`)
+- **Cache Directory**: `.llmsbrieftxt_cache/` (configurable via `--cache-dir`)
+- **OpenAI Model**: `gpt-5-mini` (configurable via `--model`)
+- **Concurrent Requests**: 10 (configurable via `--max-concurrent-summaries`)
+### Environment Variables
+- `OPENAI_API_KEY` - Required for all operations
+- `OPENAI_BASE_URL` - Optional. Set to use OpenAI-compatible endpoints (e.g., Ollama at `http://localhost:11434/v1`)
+## Usage Tips
+### Managing API Costs
+- **Preview with cost estimate**: Use `--show-urls` to see discovered URLs and estimated API cost before processing
+- **Limit scope**: Use `--max-urls` to limit processing during testing
+- **Automatic caching**: Summaries are cached automatically - rerunning is cheap
+- **Cache-only mode**: Use `--use-cache-only` to generate output from cache without API calls
+- **Force refresh**: Use `--force-refresh` when you need to regenerate all summaries
+- **Cost-effective model**: Default model `gpt-5-mini` is cost-effective for most documentation
+### Controlling Crawl Depth
+- **Default depth (3)**: Good for most documentation sites (100-300 pages)
+- **Shallow crawl (1-2)**: Use for large sites or to focus on main pages only
+- **Deep crawl (4-5)**: Use for small sites or comprehensive coverage
+- Example: `llmtxt https://example.com --depth 2 --show-urls` to preview scope
+### Cache Management
+- **Default location**: `.llmsbrieftxt_cache/` in current directory
+- **Custom location**: Use `--cache-dir` for shared caches or different organization
+- **Cache benefits**: Speeds up reruns, reduces API costs, enables incremental updates
+- **Failed URLs tracking**: Failed URLs are written to `failed_urls.txt` next to output file
+### Organizing Documentation
+All docs are saved to `~/.claude/docs/` by domain name:
+```
+~/.claude/docs/
+├── docs.python.org.txt
+├── react.dev.txt
+├── pytorch.org.txt
+└── fastapi.tiangolo.com.txt
+```
+This makes it easy for Claude Code and other tools to find and reference documentation.
+## Integrations
+### Claude Code
+This tool is designed to work seamlessly with Claude Code. Once you've generated documentation files, Claude can search and reference them during development sessions.
+### MCP Servers
+Generated llms-brief.txt files can be served via MCP (Model Context Protocol) servers. See the [mcpdoc project](https://github.com/langchain-ai/mcpdoc) for an example integration.
+## Troubleshooting
+### API Key Issues
+```bash
+# Verify API key is set
+echo $OPENAI_API_KEY
+# Set it if missing
+export OPENAI_API_KEY="sk-your-api-key-here"
+```
+### Rate Limiting
+If you hit rate limits, reduce concurrent requests:
+```bash
+llmtxt https://example.com --max-concurrent-summaries 5
+```
+### Large Documentation Sites
+For very large sites (500+ pages):
+1. Start with `--show-urls` to see scope
+2. Use `--max-urls` to process in batches
+3. Increase `--max-concurrent-summaries` if you have high rate limits
+## Migrating from 0.x
+Version 1.0.0 removes search and list subcommands in favor of Unix tools:
+```bash
+# Before (v0.x)
+llmsbrieftxt generate https://docs.python.org/3/
+llmsbrieftxt search "async"
+llmsbrieftxt list
+# After (v1.0.0)
+llmtxt https://docs.python.org/3/
+rg "async" ~/.claude/docs/
+ls ~/.claude/docs/
+```
+**Why the change?** Focus on doing one thing well. Search and list are better served by mature, powerful Unix tools you already have.
+## License
+MIT
+## Contributing
+Contributions welcome! Please:
+1. Run tests: `uv run pytest`
+2. Lint code: `uv run ruff check llmsbrieftxt/ tests/`
+3. Format code: `uv run ruff format llmsbrieftxt/ tests/`
+4. Check types: `uv run mypy llmsbrieftxt/`
+5. Submit a PR
+## Links
+- **Homepage**: https://github.com/stevennevins/llmsbrief
+- **Issues**: https://github.com/stevennevins/llmsbrief/issues
+- **llms.txt Spec**: https://llmstxt.org/

llmsbrieftxt-1.6.0.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,16 @@
+llmsbrieftxt/__init__.py,sha256=baAcEjLSYFIeNZF51tOMmA_zAMhN8HvKael-UU-Ruec,22
+llmsbrieftxt/cli.py,sha256=TSSSKtDydMpa6rApZ6sJQwCgGkMXf2cSeDe_lp80F1g,8440
+llmsbrieftxt/constants.py,sha256=cjV_W5MqfVINM78__6eKnFPOGPHAI4ZYz8GqbIEEKz8,2565
+llmsbrieftxt/crawler.py,sha256=ryt6pZ8Ed5vzEa78qeu93eSDlSyuFBqePlYZZMUFvGM,12553
+llmsbrieftxt/doc_loader.py,sha256=dGeHnEVCqtTQgdowMCFxrhrmh3QV5n8l3TIOgDYaU9g,5167
+llmsbrieftxt/extractor.py,sha256=28jckOcYf7u5zmZrhOZ-PmcWvPwTLZhMHxISSkFdeXk,1955
+llmsbrieftxt/main.py,sha256=5R6cAKFou9_FCluHQaktHKQU_nn_n3asnveB_g7o3yA,14346
+llmsbrieftxt/schema.py,sha256=ix9666XBpSbHUuYF1-jIK88sijK5Cvaer6gwbdLlWfs,2186
+llmsbrieftxt/summarizer.py,sha256=bv5CLc_0yxFefoXXBt8R_ztqsk4i4yAEiFv8LX93B04,11015
+llmsbrieftxt/url_filters.py,sha256=1KWO9yfPEqOIFXVts5xraErVQKPDAw4Nls3yuXzbRE8,2182
+llmsbrieftxt/url_utils.py,sha256=vFc_MNyLZ6QflhDF0oyiZJPYuF2_GyQmtKK7etwCmcs,2212
+llmsbrieftxt-1.6.0.dist-info/METADATA,sha256=S91kMFwJNIb4b8PRsOEdlHNLT3Ay4F8ZZkA_QQnAcqo,12140
+llmsbrieftxt-1.6.0.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
+llmsbrieftxt-1.6.0.dist-info/entry_points.txt,sha256=lY7gjN9DS7cv3Kd3LjezvgFBum7BhpMHSPGvdCzBtFU,49
+llmsbrieftxt-1.6.0.dist-info/licenses/LICENSE,sha256=Bf6uF7ggkMcXEXAdu2lGR7u-voH5CJIWOzU5vnKQVJI,1082
+llmsbrieftxt-1.6.0.dist-info/RECORD,,

llmsbrieftxt-1.6.0.dist-info/WHEEL ADDED Viewed

@@ -0,0 +1,4 @@
+Wheel-Version: 1.0
+Generator: hatchling 1.27.0
+Root-Is-Purelib: true
+Tag: py3-none-any

llmsbrieftxt-1.6.0.dist-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ [console_scripts]
2	+ llmtxt = llmsbrieftxt.cli:main

llmsbrieftxt-1.6.0.dist-info/licenses/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2024 llmsbrieftxt contributors
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.