any2summary 0.0.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,202 @@
1
+ Metadata-Version: 2.4
2
+ Name: any2summary
3
+ Version: 0.0.1
4
+ Summary: any2summary is a command-line toolkit that handles the entire pipeline for podcasts, videos, and long-form articles—download, transcription, optional Azure speaker diarization, and Markdown summarization—directly on your local machine.
5
+ Author: Lee
6
+ License: Proprietary
7
+ Requires-Python: >=3.10
8
+ Description-Content-Type: text/markdown
9
+ Requires-Dist: openai>=1.30.0
10
+ Requires-Dist: youtube-transcript-api>=0.6.2
11
+ Requires-Dist: yt-dlp>=2024.3.10
12
+ Requires-Dist: httpx[socks]>=0.27.0
13
+ Provides-Extra: dev
14
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
15
+
16
+ # any2summary
17
+
18
+ `any2summary` is a command-line toolkit that handles the entire pipeline for podcasts, videos, and long-form articles—download, transcription, optional Azure speaker diarization, and Markdown summarization—directly on your local machine. The CLI emits structured JSON by default, and when Azure summarization is enabled it also writes Markdown with a cover, table of contents, and timeline table so long-form content can drop into your note-taking system with minimal effort.
19
+
20
+ > 📘 Looking for the Simplified Chinese version? See `README.zh.md` in the project root. Both documents share the same structure and should stay in sync.
21
+
22
+ ## Use Cases
23
+ - **YouTube / Bilibili / Spotify / Apple Podcasts**: fetch captions when available, or download audio plus run Azure OpenAI `gpt-4o-transcribe-diarize` for transcripts and speaker labels.
24
+ - **Web articles / documentation**: fall back to article mode when audio cannot be downloaded, capturing page text and metadata before summarization.
25
+ - **Batch processing**: pass a comma-separated list to `--url`; the CLI processes links concurrently and prints results in the original order.
26
+
27
+ ## Feature Highlights
28
+ - `youtube-transcript-api` + `yt_dlp` + `ffmpeg` handle caption/audio retrieval with automatic Referer, User-Agent, and Android fallback tuning to avoid 403 errors.
29
+ - Audio longer than Azure’s 1,500-second limit is split into ≤1,400-second WAV chunks and uploaded sequentially; streaming mode refreshes progress in real time.
30
+ - Azure diarization results align with existing captions; when Azure returns empty segments the CLI falls back to the downloaded subtitles to keep the pipeline moving.
31
+ - Audio-only links or captionless videos automatically trigger the Azure transcription flow; add `--force-azure-diarization` to invoke Azure even when captions exist.
32
+ - `--azure-summary` calls Azure GPT-5 (Responses API or Chat Completions) to generate Markdown summaries and copies them into `ANY2SUMMARY_OUTBOX_DIR` (defaults to an Obsidian outbox folder).
33
+ - Article mode (`fetch_article_assets`) caches `article_raw.html`, `article_content.txt`, and `article_metadata.json`, then applies `ARTICLE_SUMMARY_PROMPT`; `--article-summary-prompt-file` overrides the default.
34
+ - `--clean-cache` clears cached artifacts for the current URL; `ANY2SUMMARY_DOTENV` automatically loads a `.env` file and remains compatible with legacy `PODCAST_TRANSFORMER_*` variables.
35
+ - CLI output is always indented JSON; in batch mode each job prints a separate JSON document, making it easy to stream-parse.
36
+
37
+ ## Quick Start
38
+
39
+ ### Prerequisites
40
+ - Python 3.10+
41
+ - `ffmpeg` (install via `brew install ffmpeg` on macOS or follow the official docs for other platforms)
42
+ - Network access to YouTube/your target site plus Azure OpenAI (adjust the proxy variables in `setup_and_run.sh` if needed)
43
+ - Azure OpenAI resource and deployments for transcription/summary features
44
+
45
+ ### Installation Options
46
+ 1. **PyPI (recommended):** `pip install any2summary`
47
+ 2. **From source:** `cd any2summary && pip install .`
48
+ 3. **Manual dependencies:** `pip install youtube-transcript-api yt-dlp openai "httpx[socks]"`
49
+ 4. **Bootstrap script:** `cd any2summary && ./setup_and_run.sh --help` (creates `.venv`, installs deps, and exports proxy variables near the top)
50
+
51
+ ### Minimal Example
52
+ ```bash
53
+ python -m any2summary.cli \
54
+ --url "https://www.youtube.com/watch?v=<video-id>" \
55
+ --language en
56
+ ```
57
+ - Captions are returned as JSON by default. When the target lacks captions, Azure transcription triggers automatically. Add `--force-azure-diarization` to invoke Azure even if captions already exist.
58
+ - Supply multiple comma-separated links in `--url` to process them concurrently while preserving order.
59
+
60
+ ### Sample Script
61
+ ```bash
62
+ ./run_example.sh "https://www.youtube.com/watch?v=<video-id>"
63
+ ```
64
+ The script loads `.env` located in the same directory and calls `setup_and_run.sh`, making it convenient to verify Azure credentials.
65
+
66
+ ## CLI Reference
67
+
68
+ | Argument | Type / Default | Required | Description | Typical Usage |
69
+ | --- | --- | --- | --- | --- |
70
+ | `--url` | String, comma-separated | ✔ | Video/audio/article URLs; processed concurrently in the given order | Batch caption/summary export |
71
+ | `--language` | String, default `en` | | Preferred language for captions/transcripts | Control transcript language |
72
+ | `--fallback-language` | Repeatable | | Extra language codes to try when the primary one is missing | Cross-language resilience |
73
+ | `-V/--version` | Flag | | Display version and exit | Verify installed version |
74
+ | `--azure-streaming` / `--no-azure-streaming` | Boolean, default on | | Whether Azure transcription streams chunk updates | Minimize CLI noise or keep progress bars |
75
+ | `--force-azure-diarization` | Flag | | Force Azure diarization even when captions are available (ignored for article links; automatically on for Apple Podcasts & similar audio URLs) | Ensure Azure results every time |
76
+ | `--azure-summary` | Flag | | Use Azure GPT-5 to produce Markdown summaries/timelines saved to `summary.md` in cache | Generate polished summaries |
77
+ | `--summary-prompt-file` | Path | | Custom prompt for audio/video summaries (defaults to `./prompts/summary_prompt.txt`) | Tailor summary tone |
78
+ | `--article-summary-prompt-file` | Path | | Custom prompt for article mode when `--azure-summary` is enabled (defaults to `./prompts/article_prompt.txt`) | Tune article summarization |
79
+ | `--max-speakers` | Integer | | Upper bound for Azure diarization speaker count | Interview/meeting constraints |
80
+ | `--known-speaker` | `name=path.wav`, repeatable | | Provide reference audio clips to improve speaker labeling | Identify recurring hosts |
81
+ | `--known-speaker-name` | String, repeatable | | Supply speaker names without audio samples | Give Azure semantic hints |
82
+ | `--clean-cache` | Flag | | Remove cached artifacts for the current URL before processing | Force re-download/re-transcribe |
83
+
84
+ > **Notes:** Article mode ignores `--summary-prompt-file` and `--force-azure-diarization` to ensure web pages always use the article-specific prompt. Conversely, Apple Podcasts and similar audio sources automatically fall back to the Azure pipeline even without `--force-azure-diarization`.
85
+
86
+ ## Environment Variables & Config
87
+
88
+ | Variable | Default / Source | Purpose |
89
+ | --- | --- | --- |
90
+ | `ANY2SUMMARY_DOTENV` | `.env` in working dir | Auto-loaded `.env`; also honors `PODCAST_TRANSFORMER_DOTENV` |
91
+ | `ANY2SUMMARY_CACHE_DIR` | `~/.cache/any2summary` | Override cache location (subdirectories keyed by host/video ID) |
92
+ | `ANY2SUMMARY_OUTBOX_DIR` | `~/Library/.../Obsidian Vault/010 outbox` | Destination for Markdown copies; set to disable or redirect |
93
+ | `ANY2SUMMARY_YTDLP_UA` | Desktop Chrome UA | Custom UA for `yt_dlp`; Android fallback overrides when needed |
94
+ | `ANY2SUMMARY_YTDLP_COOKIES` | Empty | Path to `cookies.txt` for login-only content |
95
+ | `ANY2SUMMARY_DEBUG_PAYLOAD` | Empty | If set, save `debug_payload_*.json` in cache directories |
96
+ | `AZURE_OPENAI_API_KEY` / `AZURE_OPENAI_ENDPOINT` | None | Required for all Azure features |
97
+ | `AZURE_OPENAI_API_VERSION` | `2025-03-01-preview` | Azure diarization API version |
98
+ | `AZURE_OPENAI_TRANSCRIBE_DEPLOYMENT` | `gpt-4o-transcribe-diarize` | Transcription/dearization deployment name |
99
+ | `AZURE_OPENAI_SUMMARY_DEPLOYMENT` | `llab-gpt-5-pro` | Summary model deployment |
100
+ | `AZURE_OPENAI_DOMAIN_DEPLOYMENT` | Uses summary deployment | Infers domain tags from summaries |
101
+ | `AZURE_OPENAI_SUMMARY_API_VERSION` | `2025-01-01-preview` | API version for Chat Completions mode |
102
+ | `AZURE_OPENAI_USE_RESPONSES` | Based on deployment suffix | Opt into Responses API (`1/true/yes` or `*-pro`) |
103
+ | `AZURE_OPENAI_RESPONSES_BASE_URL` | Derived from endpoint | Override Responses API base URL |
104
+ | `AZURE_OPENAI_CHUNKING_STRATEGY` | `auto` | Strategy string/JSON sent to Azure transcription |
105
+ | Proxy vars | Exported in `setup_and_run.sh` | Defaults to localhost:7890 for http/https/all_proxy |
106
+
107
+ ## Typical Workflows
108
+
109
+ ### 1. Captions + timeline only (no Azure)
110
+ ```bash
111
+ python -m any2summary.cli --url "https://youtu.be/<id>" --language zh
112
+ ```
113
+ Emits `segments` with timestamps and text—ideal for additional scripting or downstream tooling.
114
+
115
+ ### 2. Speaker diarization + summary
116
+ ```bash
117
+ ANY2SUMMARY_DOTENV=./.env \
118
+ python -m any2summary.cli \
119
+ --url "https://www.youtube.com/watch?v=<video-id>" \
120
+ --language en \
121
+ --force-azure-diarization \
122
+ --azure-summary \
123
+ --summary-prompt-file ./prompts/summary_prompt.txt \
124
+ --known-speaker "Host=./samples/host.wav"
125
+ ```
126
+ - Audio is cached under `~/.cache/any2summary/youtube/<video-id>/` and split when needed.
127
+ - JSON output includes inline `summary`/`timeline` plus `summary_path` pointing to Markdown files; a copy is placed under `ANY2SUMMARY_OUTBOX_DIR`.
128
+
129
+ ### 3. Article mode
130
+ ```bash
131
+ python -m any2summary.cli \
132
+ --url "https://example.com/blog/post" \
133
+ --language zh \
134
+ --azure-summary \
135
+ --article-summary-prompt-file ./prompts/article_prompt.txt
136
+ ```
137
+ - `fetch_article_assets` stores `article_raw.html`, `article_content.txt`, and `article_metadata.json`.
138
+ - The workflow always applies the article-specific prompt and ignores `--summary-prompt-file` / `--force-azure-diarization`.
139
+
140
+ ### 4. Multiple URLs in parallel
141
+ ```bash
142
+ python -m any2summary.cli \
143
+ --url "https://youtu.be/A1,https://podcasts.apple.com/episode/B2" \
144
+ --azure-summary
145
+ ```
146
+ - Each job prints a JSON block in the original order; failures are reported to stderr as `[URL] error message` without stopping remaining tasks.
147
+
148
+ ## Cache Layout
149
+ - Default cache root: `~/.cache/any2summary/<host_or_id>/`, containing:
150
+ - `audio.*`: downloaded audio (split files named `audio_partXXX.wav`)
151
+ - `captions.json`: caption segments
152
+ - `segments.json`: merged Azure transcripts
153
+ - `summary.md`, `timeline.md`: Markdown exports
154
+ - `article_raw.html` / `article_content.txt` / `article_metadata.json`: article mode artifacts
155
+ - `--clean-cache` wipes the directory before processing.
156
+ - Set `ANY2SUMMARY_CACHE_DIR` to relocate caches to another drive or shared path.
157
+
158
+ ## Advanced Customization & Debugging
159
+ - **Prompt overrides:** keep dedicated prompt files per source type and pass them via `--summary-prompt-file` / `--article-summary-prompt-file`.
160
+ - **Default prompt management:** editing `prompts/summary_prompt.txt` or `prompts/article_prompt.txt` immediately updates the CLI’s built-in behavior.
161
+ - **Speaker accuracy:** use `--known-speaker name=sample.wav` or `--known-speaker-name` hints to improve Azure labels.
162
+ - **Azure streaming:** enabled by default; disable with `--no-azure-streaming` in CI or log-sensitive environments.
163
+ - **Android fallback:** `yt_dlp` automatically retries with Android settings on YouTube 403 errors; provide cookies through `ANY2SUMMARY_YTDLP_COOKIES` for gated content.
164
+ - **Payload debugging:** set `ANY2SUMMARY_DEBUG_PAYLOAD=1` to dump raw Azure responses as JSON in the cache folder.
165
+ - **Batch throughput:** a `ThreadPoolExecutor` caps concurrency at CPU count; split large batches manually if you need throttling.
166
+
167
+ ## Scripts & Docker
168
+
169
+ ### setup_and_run.sh
170
+ - Creates `.venv`, installs dependencies, and exports proxy variables (`http_proxy/https_proxy/all_proxy` to `127.0.0.1:7890` by default). Edit the script to match your proxy port.
171
+ - Accepts the full CLI argument list (e.g., `./setup_and_run.sh --url <...> --azure-summary`) and is suitable for teammates who prefer shell scripts over Python invocations.
172
+
173
+ ### Docker
174
+ ```bash
175
+ docker build -t any2summary ./any2summary
176
+ docker run --rm \
177
+ --env-file ./any2summary/.env \
178
+ -v "$HOME/.cache/any2summary:/app/.cache/any2summary" \
179
+ any2summary \
180
+ --url "https://www.youtube.com/watch?v=<video-id>" \
181
+ --language en
182
+ ```
183
+ - Pass Azure credentials via `--env-file` and mount the cache directory to avoid repeated downloads/transcriptions.
184
+
185
+ ## Testing
186
+ ```bash
187
+ cd any2summary
188
+ PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest test/test_cli.py test/test_cli_article.py
189
+
190
+ # From the repo root:
191
+ PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest any2summary/test/
192
+ pytest test/ -q # regression + integration suites
193
+ ```
194
+
195
+ ## FAQ
196
+ - **403 Forbidden / audio download fails**: verify the URL is publicly accessible; for login-required content, provide cookies via `ANY2SUMMARY_YTDLP_COOKIES` or rely on the default proxy in `setup_and_run.sh`.
197
+ - **Azure credential errors**: ensure `.env` or environment vars define `AZURE_OPENAI_API_KEY` and `AZURE_OPENAI_ENDPOINT`, and set deployment names when summaries are required.
198
+ - **Audio too long**: the CLI auto-splits WAV files and retries; if stale oversized files linger, run with `--clean-cache` first.
199
+ - **Empty article summaries**: confirm `--azure-summary` is enabled and the article is reachable; provide a custom `--article-summary-prompt-file` if necessary.
200
+ - **Disk usage**: periodically clean `ANY2SUMMARY_CACHE_DIR` or combine it with `--clean-cache` on old tasks.
201
+
202
+ Before publishing, verify that README updates, sample commands, and prompt descriptions align with the current CLI behavior to avoid mismatches for new users.
@@ -0,0 +1,7 @@
1
+ any2summary/__init__.py,sha256=2KK4sZffALShfIYo7tep__xTdtNPLTbxOIwVKpAelzg,103
2
+ any2summary/cli.py,sha256=vfAowYa_1awBbdqq9gV3NkmQGGzDNdImX_Cke9DrLwk,115582
3
+ any2summary-0.0.1.dist-info/METADATA,sha256=vVOCJDQfsCWmqjfTiroixI777JipegHKh6iXIJsB0Lc,13642
4
+ any2summary-0.0.1.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
5
+ any2summary-0.0.1.dist-info/entry_points.txt,sha256=D4wiGLnG6yUA3Z6DplrC83REhWGI8R1wOxDiG_QHvzw,53
6
+ any2summary-0.0.1.dist-info/top_level.txt,sha256=yAEyvK40syQ4qaCXvTK5Ld4W-teheppS2WaRxgVCeJU,12
7
+ any2summary-0.0.1.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (80.9.0)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ any2summary = any2summary.cli:main
@@ -0,0 +1 @@
1
+ any2summary