any2summary 0.0.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -0,0 +1,202 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: any2summary
|
|
3
|
+
Version: 0.0.1
|
|
4
|
+
Summary: any2summary is a command-line toolkit that handles the entire pipeline for podcasts, videos, and long-form articles—download, transcription, optional Azure speaker diarization, and Markdown summarization—directly on your local machine.
|
|
5
|
+
Author: Lee
|
|
6
|
+
License: Proprietary
|
|
7
|
+
Requires-Python: >=3.10
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
Requires-Dist: openai>=1.30.0
|
|
10
|
+
Requires-Dist: youtube-transcript-api>=0.6.2
|
|
11
|
+
Requires-Dist: yt-dlp>=2024.3.10
|
|
12
|
+
Requires-Dist: httpx[socks]>=0.27.0
|
|
13
|
+
Provides-Extra: dev
|
|
14
|
+
Requires-Dist: pytest>=8.0.0; extra == "dev"
|
|
15
|
+
|
|
16
|
+
# any2summary
|
|
17
|
+
|
|
18
|
+
`any2summary` is a command-line toolkit that handles the entire pipeline for podcasts, videos, and long-form articles—download, transcription, optional Azure speaker diarization, and Markdown summarization—directly on your local machine. The CLI emits structured JSON by default, and when Azure summarization is enabled it also writes Markdown with a cover, table of contents, and timeline table so long-form content can drop into your note-taking system with minimal effort.
|
|
19
|
+
|
|
20
|
+
> 📘 Looking for the Simplified Chinese version? See `README.zh.md` in the project root. Both documents share the same structure and should stay in sync.
|
|
21
|
+
|
|
22
|
+
## Use Cases
|
|
23
|
+
- **YouTube / Bilibili / Spotify / Apple Podcasts**: fetch captions when available, or download audio plus run Azure OpenAI `gpt-4o-transcribe-diarize` for transcripts and speaker labels.
|
|
24
|
+
- **Web articles / documentation**: fall back to article mode when audio cannot be downloaded, capturing page text and metadata before summarization.
|
|
25
|
+
- **Batch processing**: pass a comma-separated list to `--url`; the CLI processes links concurrently and prints results in the original order.
|
|
26
|
+
|
|
27
|
+
## Feature Highlights
|
|
28
|
+
- `youtube-transcript-api` + `yt_dlp` + `ffmpeg` handle caption/audio retrieval with automatic Referer, User-Agent, and Android fallback tuning to avoid 403 errors.
|
|
29
|
+
- Audio longer than Azure’s 1,500-second limit is split into ≤1,400-second WAV chunks and uploaded sequentially; streaming mode refreshes progress in real time.
|
|
30
|
+
- Azure diarization results align with existing captions; when Azure returns empty segments the CLI falls back to the downloaded subtitles to keep the pipeline moving.
|
|
31
|
+
- Audio-only links or captionless videos automatically trigger the Azure transcription flow; add `--force-azure-diarization` to invoke Azure even when captions exist.
|
|
32
|
+
- `--azure-summary` calls Azure GPT-5 (Responses API or Chat Completions) to generate Markdown summaries and copies them into `ANY2SUMMARY_OUTBOX_DIR` (defaults to an Obsidian outbox folder).
|
|
33
|
+
- Article mode (`fetch_article_assets`) caches `article_raw.html`, `article_content.txt`, and `article_metadata.json`, then applies `ARTICLE_SUMMARY_PROMPT`; `--article-summary-prompt-file` overrides the default.
|
|
34
|
+
- `--clean-cache` clears cached artifacts for the current URL; `ANY2SUMMARY_DOTENV` automatically loads a `.env` file and remains compatible with legacy `PODCAST_TRANSFORMER_*` variables.
|
|
35
|
+
- CLI output is always indented JSON; in batch mode each job prints a separate JSON document, making it easy to stream-parse.
|
|
36
|
+
|
|
37
|
+
## Quick Start
|
|
38
|
+
|
|
39
|
+
### Prerequisites
|
|
40
|
+
- Python 3.10+
|
|
41
|
+
- `ffmpeg` (install via `brew install ffmpeg` on macOS or follow the official docs for other platforms)
|
|
42
|
+
- Network access to YouTube/your target site plus Azure OpenAI (adjust the proxy variables in `setup_and_run.sh` if needed)
|
|
43
|
+
- Azure OpenAI resource and deployments for transcription/summary features
|
|
44
|
+
|
|
45
|
+
### Installation Options
|
|
46
|
+
1. **PyPI (recommended):** `pip install any2summary`
|
|
47
|
+
2. **From source:** `cd any2summary && pip install .`
|
|
48
|
+
3. **Manual dependencies:** `pip install youtube-transcript-api yt-dlp openai "httpx[socks]"`
|
|
49
|
+
4. **Bootstrap script:** `cd any2summary && ./setup_and_run.sh --help` (creates `.venv`, installs deps, and exports proxy variables near the top)
|
|
50
|
+
|
|
51
|
+
### Minimal Example
|
|
52
|
+
```bash
|
|
53
|
+
python -m any2summary.cli \
|
|
54
|
+
--url "https://www.youtube.com/watch?v=<video-id>" \
|
|
55
|
+
--language en
|
|
56
|
+
```
|
|
57
|
+
- Captions are returned as JSON by default. When the target lacks captions, Azure transcription triggers automatically. Add `--force-azure-diarization` to invoke Azure even if captions already exist.
|
|
58
|
+
- Supply multiple comma-separated links in `--url` to process them concurrently while preserving order.
|
|
59
|
+
|
|
60
|
+
### Sample Script
|
|
61
|
+
```bash
|
|
62
|
+
./run_example.sh "https://www.youtube.com/watch?v=<video-id>"
|
|
63
|
+
```
|
|
64
|
+
The script loads `.env` located in the same directory and calls `setup_and_run.sh`, making it convenient to verify Azure credentials.
|
|
65
|
+
|
|
66
|
+
## CLI Reference
|
|
67
|
+
|
|
68
|
+
| Argument | Type / Default | Required | Description | Typical Usage |
|
|
69
|
+
| --- | --- | --- | --- | --- |
|
|
70
|
+
| `--url` | String, comma-separated | ✔ | Video/audio/article URLs; processed concurrently in the given order | Batch caption/summary export |
|
|
71
|
+
| `--language` | String, default `en` | | Preferred language for captions/transcripts | Control transcript language |
|
|
72
|
+
| `--fallback-language` | Repeatable | | Extra language codes to try when the primary one is missing | Cross-language resilience |
|
|
73
|
+
| `-V/--version` | Flag | | Display version and exit | Verify installed version |
|
|
74
|
+
| `--azure-streaming` / `--no-azure-streaming` | Boolean, default on | | Whether Azure transcription streams chunk updates | Minimize CLI noise or keep progress bars |
|
|
75
|
+
| `--force-azure-diarization` | Flag | | Force Azure diarization even when captions are available (ignored for article links; automatically on for Apple Podcasts & similar audio URLs) | Ensure Azure results every time |
|
|
76
|
+
| `--azure-summary` | Flag | | Use Azure GPT-5 to produce Markdown summaries/timelines saved to `summary.md` in cache | Generate polished summaries |
|
|
77
|
+
| `--summary-prompt-file` | Path | | Custom prompt for audio/video summaries (defaults to `./prompts/summary_prompt.txt`) | Tailor summary tone |
|
|
78
|
+
| `--article-summary-prompt-file` | Path | | Custom prompt for article mode when `--azure-summary` is enabled (defaults to `./prompts/article_prompt.txt`) | Tune article summarization |
|
|
79
|
+
| `--max-speakers` | Integer | | Upper bound for Azure diarization speaker count | Interview/meeting constraints |
|
|
80
|
+
| `--known-speaker` | `name=path.wav`, repeatable | | Provide reference audio clips to improve speaker labeling | Identify recurring hosts |
|
|
81
|
+
| `--known-speaker-name` | String, repeatable | | Supply speaker names without audio samples | Give Azure semantic hints |
|
|
82
|
+
| `--clean-cache` | Flag | | Remove cached artifacts for the current URL before processing | Force re-download/re-transcribe |
|
|
83
|
+
|
|
84
|
+
> **Notes:** Article mode ignores `--summary-prompt-file` and `--force-azure-diarization` to ensure web pages always use the article-specific prompt. Conversely, Apple Podcasts and similar audio sources automatically fall back to the Azure pipeline even without `--force-azure-diarization`.
|
|
85
|
+
|
|
86
|
+
## Environment Variables & Config
|
|
87
|
+
|
|
88
|
+
| Variable | Default / Source | Purpose |
|
|
89
|
+
| --- | --- | --- |
|
|
90
|
+
| `ANY2SUMMARY_DOTENV` | `.env` in working dir | Auto-loaded `.env`; also honors `PODCAST_TRANSFORMER_DOTENV` |
|
|
91
|
+
| `ANY2SUMMARY_CACHE_DIR` | `~/.cache/any2summary` | Override cache location (subdirectories keyed by host/video ID) |
|
|
92
|
+
| `ANY2SUMMARY_OUTBOX_DIR` | `~/Library/.../Obsidian Vault/010 outbox` | Destination for Markdown copies; set to disable or redirect |
|
|
93
|
+
| `ANY2SUMMARY_YTDLP_UA` | Desktop Chrome UA | Custom UA for `yt_dlp`; Android fallback overrides when needed |
|
|
94
|
+
| `ANY2SUMMARY_YTDLP_COOKIES` | Empty | Path to `cookies.txt` for login-only content |
|
|
95
|
+
| `ANY2SUMMARY_DEBUG_PAYLOAD` | Empty | If set, save `debug_payload_*.json` in cache directories |
|
|
96
|
+
| `AZURE_OPENAI_API_KEY` / `AZURE_OPENAI_ENDPOINT` | None | Required for all Azure features |
|
|
97
|
+
| `AZURE_OPENAI_API_VERSION` | `2025-03-01-preview` | Azure diarization API version |
|
|
98
|
+
| `AZURE_OPENAI_TRANSCRIBE_DEPLOYMENT` | `gpt-4o-transcribe-diarize` | Transcription/dearization deployment name |
|
|
99
|
+
| `AZURE_OPENAI_SUMMARY_DEPLOYMENT` | `llab-gpt-5-pro` | Summary model deployment |
|
|
100
|
+
| `AZURE_OPENAI_DOMAIN_DEPLOYMENT` | Uses summary deployment | Infers domain tags from summaries |
|
|
101
|
+
| `AZURE_OPENAI_SUMMARY_API_VERSION` | `2025-01-01-preview` | API version for Chat Completions mode |
|
|
102
|
+
| `AZURE_OPENAI_USE_RESPONSES` | Based on deployment suffix | Opt into Responses API (`1/true/yes` or `*-pro`) |
|
|
103
|
+
| `AZURE_OPENAI_RESPONSES_BASE_URL` | Derived from endpoint | Override Responses API base URL |
|
|
104
|
+
| `AZURE_OPENAI_CHUNKING_STRATEGY` | `auto` | Strategy string/JSON sent to Azure transcription |
|
|
105
|
+
| Proxy vars | Exported in `setup_and_run.sh` | Defaults to localhost:7890 for http/https/all_proxy |
|
|
106
|
+
|
|
107
|
+
## Typical Workflows
|
|
108
|
+
|
|
109
|
+
### 1. Captions + timeline only (no Azure)
|
|
110
|
+
```bash
|
|
111
|
+
python -m any2summary.cli --url "https://youtu.be/<id>" --language zh
|
|
112
|
+
```
|
|
113
|
+
Emits `segments` with timestamps and text—ideal for additional scripting or downstream tooling.
|
|
114
|
+
|
|
115
|
+
### 2. Speaker diarization + summary
|
|
116
|
+
```bash
|
|
117
|
+
ANY2SUMMARY_DOTENV=./.env \
|
|
118
|
+
python -m any2summary.cli \
|
|
119
|
+
--url "https://www.youtube.com/watch?v=<video-id>" \
|
|
120
|
+
--language en \
|
|
121
|
+
--force-azure-diarization \
|
|
122
|
+
--azure-summary \
|
|
123
|
+
--summary-prompt-file ./prompts/summary_prompt.txt \
|
|
124
|
+
--known-speaker "Host=./samples/host.wav"
|
|
125
|
+
```
|
|
126
|
+
- Audio is cached under `~/.cache/any2summary/youtube/<video-id>/` and split when needed.
|
|
127
|
+
- JSON output includes inline `summary`/`timeline` plus `summary_path` pointing to Markdown files; a copy is placed under `ANY2SUMMARY_OUTBOX_DIR`.
|
|
128
|
+
|
|
129
|
+
### 3. Article mode
|
|
130
|
+
```bash
|
|
131
|
+
python -m any2summary.cli \
|
|
132
|
+
--url "https://example.com/blog/post" \
|
|
133
|
+
--language zh \
|
|
134
|
+
--azure-summary \
|
|
135
|
+
--article-summary-prompt-file ./prompts/article_prompt.txt
|
|
136
|
+
```
|
|
137
|
+
- `fetch_article_assets` stores `article_raw.html`, `article_content.txt`, and `article_metadata.json`.
|
|
138
|
+
- The workflow always applies the article-specific prompt and ignores `--summary-prompt-file` / `--force-azure-diarization`.
|
|
139
|
+
|
|
140
|
+
### 4. Multiple URLs in parallel
|
|
141
|
+
```bash
|
|
142
|
+
python -m any2summary.cli \
|
|
143
|
+
--url "https://youtu.be/A1,https://podcasts.apple.com/episode/B2" \
|
|
144
|
+
--azure-summary
|
|
145
|
+
```
|
|
146
|
+
- Each job prints a JSON block in the original order; failures are reported to stderr as `[URL] error message` without stopping remaining tasks.
|
|
147
|
+
|
|
148
|
+
## Cache Layout
|
|
149
|
+
- Default cache root: `~/.cache/any2summary/<host_or_id>/`, containing:
|
|
150
|
+
- `audio.*`: downloaded audio (split files named `audio_partXXX.wav`)
|
|
151
|
+
- `captions.json`: caption segments
|
|
152
|
+
- `segments.json`: merged Azure transcripts
|
|
153
|
+
- `summary.md`, `timeline.md`: Markdown exports
|
|
154
|
+
- `article_raw.html` / `article_content.txt` / `article_metadata.json`: article mode artifacts
|
|
155
|
+
- `--clean-cache` wipes the directory before processing.
|
|
156
|
+
- Set `ANY2SUMMARY_CACHE_DIR` to relocate caches to another drive or shared path.
|
|
157
|
+
|
|
158
|
+
## Advanced Customization & Debugging
|
|
159
|
+
- **Prompt overrides:** keep dedicated prompt files per source type and pass them via `--summary-prompt-file` / `--article-summary-prompt-file`.
|
|
160
|
+
- **Default prompt management:** editing `prompts/summary_prompt.txt` or `prompts/article_prompt.txt` immediately updates the CLI’s built-in behavior.
|
|
161
|
+
- **Speaker accuracy:** use `--known-speaker name=sample.wav` or `--known-speaker-name` hints to improve Azure labels.
|
|
162
|
+
- **Azure streaming:** enabled by default; disable with `--no-azure-streaming` in CI or log-sensitive environments.
|
|
163
|
+
- **Android fallback:** `yt_dlp` automatically retries with Android settings on YouTube 403 errors; provide cookies through `ANY2SUMMARY_YTDLP_COOKIES` for gated content.
|
|
164
|
+
- **Payload debugging:** set `ANY2SUMMARY_DEBUG_PAYLOAD=1` to dump raw Azure responses as JSON in the cache folder.
|
|
165
|
+
- **Batch throughput:** a `ThreadPoolExecutor` caps concurrency at CPU count; split large batches manually if you need throttling.
|
|
166
|
+
|
|
167
|
+
## Scripts & Docker
|
|
168
|
+
|
|
169
|
+
### setup_and_run.sh
|
|
170
|
+
- Creates `.venv`, installs dependencies, and exports proxy variables (`http_proxy/https_proxy/all_proxy` to `127.0.0.1:7890` by default). Edit the script to match your proxy port.
|
|
171
|
+
- Accepts the full CLI argument list (e.g., `./setup_and_run.sh --url <...> --azure-summary`) and is suitable for teammates who prefer shell scripts over Python invocations.
|
|
172
|
+
|
|
173
|
+
### Docker
|
|
174
|
+
```bash
|
|
175
|
+
docker build -t any2summary ./any2summary
|
|
176
|
+
docker run --rm \
|
|
177
|
+
--env-file ./any2summary/.env \
|
|
178
|
+
-v "$HOME/.cache/any2summary:/app/.cache/any2summary" \
|
|
179
|
+
any2summary \
|
|
180
|
+
--url "https://www.youtube.com/watch?v=<video-id>" \
|
|
181
|
+
--language en
|
|
182
|
+
```
|
|
183
|
+
- Pass Azure credentials via `--env-file` and mount the cache directory to avoid repeated downloads/transcriptions.
|
|
184
|
+
|
|
185
|
+
## Testing
|
|
186
|
+
```bash
|
|
187
|
+
cd any2summary
|
|
188
|
+
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest test/test_cli.py test/test_cli_article.py
|
|
189
|
+
|
|
190
|
+
# From the repo root:
|
|
191
|
+
PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest any2summary/test/
|
|
192
|
+
pytest test/ -q # regression + integration suites
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
## FAQ
|
|
196
|
+
- **403 Forbidden / audio download fails**: verify the URL is publicly accessible; for login-required content, provide cookies via `ANY2SUMMARY_YTDLP_COOKIES` or rely on the default proxy in `setup_and_run.sh`.
|
|
197
|
+
- **Azure credential errors**: ensure `.env` or environment vars define `AZURE_OPENAI_API_KEY` and `AZURE_OPENAI_ENDPOINT`, and set deployment names when summaries are required.
|
|
198
|
+
- **Audio too long**: the CLI auto-splits WAV files and retries; if stale oversized files linger, run with `--clean-cache` first.
|
|
199
|
+
- **Empty article summaries**: confirm `--azure-summary` is enabled and the article is reachable; provide a custom `--article-summary-prompt-file` if necessary.
|
|
200
|
+
- **Disk usage**: periodically clean `ANY2SUMMARY_CACHE_DIR` or combine it with `--clean-cache` on old tasks.
|
|
201
|
+
|
|
202
|
+
Before publishing, verify that README updates, sample commands, and prompt descriptions align with the current CLI behavior to avoid mismatches for new users.
|
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
any2summary/__init__.py,sha256=2KK4sZffALShfIYo7tep__xTdtNPLTbxOIwVKpAelzg,103
|
|
2
|
+
any2summary/cli.py,sha256=vfAowYa_1awBbdqq9gV3NkmQGGzDNdImX_Cke9DrLwk,115582
|
|
3
|
+
any2summary-0.0.1.dist-info/METADATA,sha256=vVOCJDQfsCWmqjfTiroixI777JipegHKh6iXIJsB0Lc,13642
|
|
4
|
+
any2summary-0.0.1.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
|
|
5
|
+
any2summary-0.0.1.dist-info/entry_points.txt,sha256=D4wiGLnG6yUA3Z6DplrC83REhWGI8R1wOxDiG_QHvzw,53
|
|
6
|
+
any2summary-0.0.1.dist-info/top_level.txt,sha256=yAEyvK40syQ4qaCXvTK5Ld4W-teheppS2WaRxgVCeJU,12
|
|
7
|
+
any2summary-0.0.1.dist-info/RECORD,,
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
any2summary
|