@steipete/summarize 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (55) hide show
  1. package/CHANGELOG.md +33 -3
  2. package/README.md +41 -9
  3. package/dist/cli.cjs +5209 -740
  4. package/dist/cli.cjs.map +4 -4
  5. package/dist/esm/content/link-preview/client.js +6 -0
  6. package/dist/esm/content/link-preview/client.js.map +1 -1
  7. package/dist/esm/content/link-preview/transcript/index.js +6 -0
  8. package/dist/esm/content/link-preview/transcript/index.js.map +1 -1
  9. package/dist/esm/content/link-preview/transcript/providers/youtube/yt-dlp.js +213 -0
  10. package/dist/esm/content/link-preview/transcript/providers/youtube/yt-dlp.js.map +1 -0
  11. package/dist/esm/content/link-preview/transcript/providers/youtube.js +40 -2
  12. package/dist/esm/content/link-preview/transcript/providers/youtube.js.map +1 -1
  13. package/dist/esm/flags.js +20 -1
  14. package/dist/esm/flags.js.map +1 -1
  15. package/dist/esm/llm/generate-text.js +51 -14
  16. package/dist/esm/llm/generate-text.js.map +1 -1
  17. package/dist/esm/llm/html-to-markdown.js +3 -2
  18. package/dist/esm/llm/html-to-markdown.js.map +1 -1
  19. package/dist/esm/markitdown.js +54 -0
  20. package/dist/esm/markitdown.js.map +1 -0
  21. package/dist/esm/prompts/file.js +19 -0
  22. package/dist/esm/prompts/file.js.map +1 -1
  23. package/dist/esm/prompts/index.js +1 -1
  24. package/dist/esm/prompts/index.js.map +1 -1
  25. package/dist/esm/run.js +302 -44
  26. package/dist/esm/run.js.map +1 -1
  27. package/dist/esm/version.js +1 -1
  28. package/dist/types/content/link-preview/client.d.ts +3 -0
  29. package/dist/types/content/link-preview/content/types.d.ts +1 -1
  30. package/dist/types/content/link-preview/deps.d.ts +3 -0
  31. package/dist/types/content/link-preview/transcript/providers/youtube/yt-dlp.d.ts +15 -0
  32. package/dist/types/content/link-preview/transcript/types.d.ts +4 -0
  33. package/dist/types/flags.d.ts +5 -1
  34. package/dist/types/llm/generate-text.d.ts +8 -2
  35. package/dist/types/llm/html-to-markdown.d.ts +4 -1
  36. package/dist/types/markitdown.d.ts +10 -0
  37. package/dist/types/prompts/file.d.ts +7 -0
  38. package/dist/types/prompts/index.d.ts +1 -1
  39. package/dist/types/run.d.ts +3 -1
  40. package/dist/types/version.d.ts +1 -1
  41. package/docs/README.md +1 -1
  42. package/docs/extract-only.md +10 -7
  43. package/docs/firecrawl.md +2 -2
  44. package/docs/site/docs/config.html +3 -3
  45. package/docs/site/docs/extract-only.html +7 -5
  46. package/docs/site/docs/firecrawl.html +6 -6
  47. package/docs/site/docs/index.html +2 -2
  48. package/docs/site/docs/llm.html +2 -2
  49. package/docs/site/docs/openai.html +2 -2
  50. package/docs/site/docs/website.html +7 -4
  51. package/docs/site/docs/youtube.html +2 -2
  52. package/docs/site/index.html +1 -1
  53. package/docs/website.md +10 -7
  54. package/docs/youtube.md +6 -3
  55. package/package.json +5 -1
package/CHANGELOG.md CHANGED
@@ -1,10 +1,40 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.4.0 - 2025-12-21
4
+
5
+ ### Changes
6
+
7
+ - Add URL extraction mode via `--extract` (deprecated alias: `--extract-only`) with `--format md|text`.
8
+ - Rename HTML→Markdown conversion flag to `--markdown-mode` (deprecated alias: `--markdown`).
9
+ - Add `--preprocess off|auto|always` and a `uvx markitdown` fallback for Markdown extraction and unsupported file attachments (when `--format md` is used).
10
+
11
+ ## 0.3.0 - 2025-12-20
12
+ ### Changes
13
+
14
+ - Add yt-dlp audio transcription fallback for YouTube; prefer OpenAI Whisper with FAL fallback. Thanks @dougvk.
15
+ - Add `--no-playlist` to yt-dlp downloads to avoid transcript mismatches.
16
+ - Run yt-dlp after web + Apify in `--youtube auto`, and error early for missing keys in `--youtube yt-dlp`.
17
+ - Require Node 22+.
18
+ - Respect `OPENAI_BASE_URL` when set, even with OpenRouter keys.
19
+ - Apply OpenRouter provider ordering headers to HTML→Markdown conversion.
20
+ - Add OpenRouter configuration tests. Thanks @dougvk for the initial OpenRouter support.
21
+ - Build and ship a Bun bytecode arm64 binary for Homebrew.
22
+
23
+ ### Tests
24
+
25
+ - Add coverage for yt-dlp ordering, missing-key errors, and helper paths.
26
+ - Add live coverage for yt-dlp transcript mode and missing-caption YouTube pages.
27
+
28
+ ### Dev
29
+
30
+ - Add `Dockerfile.test` for containerized yt-dlp testing.
31
+
3
32
  ## 0.2.0 - 2025-12-20
4
33
 
5
34
  ### Changes
6
35
 
7
- - Remove map-reduce summarization; reject inputs that exceed the model’s context window.
36
+ - Add native OpenRouter support via `OPENROUTER_API_KEY` with optional provider ordering (`OPENROUTER_PROVIDERS`).
37
+ - Remove map-reduce summarization; reject inputs that exceed the model's context window.
8
38
  - Preflight text prompts with the GPT tokenizer and the model’s max input tokens.
9
39
  - Reject text files over 10 MB before tokenization.
10
40
  - Reject too-small numeric `--length` and `--max-output-tokens` values.
@@ -71,7 +101,7 @@ First public release.
71
101
  - `--max-output-tokens <count>` (optional hard cap)
72
102
  - `--timeout <duration>` (default `2m`)
73
103
  - `--stream auto|on|off`, `--render auto|md-live|md|plain`
74
- - `--extract-only` (URLs only; no summary)
104
+ - `--extract` (URLs only; no summary; deprecated alias: `--extract-only`)
75
105
  - `--json` (structured output incl. input config, prompt, extracted content, LLM metadata, and metrics)
76
106
  - `--metrics off|on|detailed` (default `on`)
77
107
  - `--verbose`
@@ -80,7 +110,7 @@ First public release.
80
110
 
81
111
  - Websites: fetch + extract “article-ish” content + normalization for prompts.
82
112
  - Firecrawl fallback for blocked/thin sites (`--firecrawl off|auto|always`, via `FIRECRAWL_API_KEY`).
83
- - Markdown extraction for websites in `--extract-only` mode (`--markdown off|auto|llm`).
113
+ - Markdown extraction for websites in `--extract` mode (`--format md|text`, `--markdown-mode off|auto|llm`).
84
114
  - YouTube (`--youtube auto|web|apify`):
85
115
  - best-effort transcript endpoints
86
116
  - optional Apify fallback (requires `APIFY_API_TOKEN`; single actor `faVsWy9VTSNVIhWpR`)
package/README.md CHANGED
@@ -11,6 +11,8 @@ It streams output by default on TTY and renders Markdown to ANSI (via `markdansi
11
11
 
12
12
  ## Install
13
13
 
14
+ Requires Node 22+.
15
+
14
16
  - npx (no install):
15
17
 
16
18
  ```bash
@@ -23,6 +25,8 @@ npx -y @steipete/summarize "https://example.com" --model google/gemini-3-flash-p
23
25
  brew install steipete/tap/summarize
24
26
  ```
25
27
 
28
+ Apple Silicon only (arm64).
29
+
26
30
  ## Quickstart
27
31
 
28
32
  ```bash
@@ -108,7 +112,10 @@ npx -y @steipete/summarize <input> [flags]
108
112
  - `--max-output-tokens <count>`: hard cap for LLM output tokens (optional)
109
113
  - `--stream auto|on|off`: stream LLM output (`auto` = TTY only; disabled in `--json` mode)
110
114
  - `--render auto|md-live|md|plain`: Markdown rendering (`auto` = best default for TTY)
111
- - `--extract-only`: print extracted content and exit (no summary) — only for URLs
115
+ - `--format md|text`: website/file content format (default `text`)
116
+ - `--preprocess off|auto|always`: preprocess files (only with `--format md`) for model compatibility (default `auto`)
117
+ - `--extract`: print extracted content and exit (no summary) — only for URLs
118
+ - Deprecated alias: `--extract-only`
112
119
  - `--json`: machine-readable output with diagnostics, prompt, `metrics`, and optional summary
113
120
  - `--verbose`: debug/diagnostics on stderr
114
121
  - `--metrics off|on|detailed`: metrics output (default `on`; `detailed` prints a breakdown to stderr)
@@ -118,14 +125,23 @@ npx -y @steipete/summarize <input> [flags]
118
125
  Non-YouTube URLs go through a “fetch → extract” pipeline. When the direct fetch/extraction is blocked or too thin, `--firecrawl auto` can fall back to Firecrawl (if configured).
119
126
 
120
127
  - `--firecrawl off|auto|always` (default `auto`)
121
- - `--markdown off|auto|llm` (default `auto`; only affects `--extract-only` for non-YouTube URLs)
122
- - Plain-text mode: use `--firecrawl off --markdown off`.
128
+ - `--extract --format md|text` (default `text`)
129
+ - `--markdown-mode off|auto|llm` (default `auto`; only affects `--format md` for non-YouTube URLs)
130
+ - Plain-text mode: use `--format text`.
131
+
132
+ ## YouTube transcripts
133
+
134
+ `--youtube auto` tries best-effort web transcript endpoints first. When captions aren't available, it falls back to:
123
135
 
124
- ## YouTube transcripts (Apify fallback)
136
+ 1. **Apify** (if `APIFY_API_TOKEN` is set): Uses a scraping actor (`faVsWy9VTSNVIhWpR`)
137
+ 2. **yt-dlp + Whisper** (if `YT_DLP_PATH` is set): Downloads audio via yt-dlp, transcribes with OpenAI Whisper if `OPENAI_API_KEY` is set, otherwise falls back to FAL (`FAL_KEY`)
125
138
 
126
- `--youtube auto` tries best-effort web transcript endpoints first, then falls back to Apify *only if* `APIFY_API_TOKEN` is set.
139
+ Environment variables for yt-dlp mode:
140
+ - `YT_DLP_PATH` - path to yt-dlp binary
141
+ - `OPENAI_API_KEY` - OpenAI Whisper transcription (preferred)
142
+ - `FAL_KEY` - FAL AI Whisper fallback
127
143
 
128
- Apify uses a single actor (`faVsWy9VTSNVIhWpR`). It costs money but tends to be more reliable.
144
+ Apify costs money but tends to be more reliable when captions exist.
129
145
 
130
146
  ## Configuration
131
147
 
@@ -160,13 +176,29 @@ Set the key matching your chosen `--model`:
160
176
 
161
177
  OpenRouter (OpenAI-compatible):
162
178
 
163
- - Set `OPENAI_BASE_URL=https://openrouter.ai/api/v1`
164
- - Prefer `OPENROUTER_API_KEY=...` (instead of reusing `OPENAI_API_KEY`)
165
- - Use OpenRouter models via the `openai/...` prefix, e.g. `--model openai/xiaomi/mimo-v2-flash:free`
179
+ - Set `OPENROUTER_API_KEY=...` to route `openai/...` models through OpenRouter
180
+ - Use OpenRouter models via the `openai/...` prefix, e.g. `--model openai/openai/gpt-oss-20b`
181
+ - Optional: `OPENROUTER_PROVIDERS=...` to specify provider fallback order (e.g. `groq,google-vertex`)
182
+
183
+ Example:
184
+
185
+ ```bash
186
+ OPENROUTER_API_KEY=sk-or-... summarize "https://example.com" --model openai/openai/gpt-oss-20b
187
+ ```
188
+
189
+ With provider ordering (falls back through providers in order):
190
+
191
+ ```bash
192
+ OPENROUTER_API_KEY=sk-or-... OPENROUTER_PROVIDERS="groq,google-vertex" summarize "https://example.com"
193
+ ```
194
+
195
+ Legacy: `OPENAI_BASE_URL=https://openrouter.ai/api/v1` with `OPENAI_API_KEY` also works.
166
196
 
167
197
  Optional services:
168
198
 
169
199
  - `FIRECRAWL_API_KEY` (website extraction fallback)
200
+ - `YT_DLP_PATH` (path to yt-dlp binary for audio extraction)
201
+ - `FAL_KEY` (FAL AI API key for audio transcription via Whisper)
170
202
  - `APIFY_API_TOKEN` (YouTube transcript fallback)
171
203
 
172
204
  ## Model limits