@steipete/summarize 0.2.0 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +33 -3
- package/README.md +41 -9
- package/dist/cli.cjs +5209 -740
- package/dist/cli.cjs.map +4 -4
- package/dist/esm/content/link-preview/client.js +6 -0
- package/dist/esm/content/link-preview/client.js.map +1 -1
- package/dist/esm/content/link-preview/transcript/index.js +6 -0
- package/dist/esm/content/link-preview/transcript/index.js.map +1 -1
- package/dist/esm/content/link-preview/transcript/providers/youtube/yt-dlp.js +213 -0
- package/dist/esm/content/link-preview/transcript/providers/youtube/yt-dlp.js.map +1 -0
- package/dist/esm/content/link-preview/transcript/providers/youtube.js +40 -2
- package/dist/esm/content/link-preview/transcript/providers/youtube.js.map +1 -1
- package/dist/esm/flags.js +20 -1
- package/dist/esm/flags.js.map +1 -1
- package/dist/esm/llm/generate-text.js +51 -14
- package/dist/esm/llm/generate-text.js.map +1 -1
- package/dist/esm/llm/html-to-markdown.js +3 -2
- package/dist/esm/llm/html-to-markdown.js.map +1 -1
- package/dist/esm/markitdown.js +54 -0
- package/dist/esm/markitdown.js.map +1 -0
- package/dist/esm/prompts/file.js +19 -0
- package/dist/esm/prompts/file.js.map +1 -1
- package/dist/esm/prompts/index.js +1 -1
- package/dist/esm/prompts/index.js.map +1 -1
- package/dist/esm/run.js +302 -44
- package/dist/esm/run.js.map +1 -1
- package/dist/esm/version.js +1 -1
- package/dist/types/content/link-preview/client.d.ts +3 -0
- package/dist/types/content/link-preview/content/types.d.ts +1 -1
- package/dist/types/content/link-preview/deps.d.ts +3 -0
- package/dist/types/content/link-preview/transcript/providers/youtube/yt-dlp.d.ts +15 -0
- package/dist/types/content/link-preview/transcript/types.d.ts +4 -0
- package/dist/types/flags.d.ts +5 -1
- package/dist/types/llm/generate-text.d.ts +8 -2
- package/dist/types/llm/html-to-markdown.d.ts +4 -1
- package/dist/types/markitdown.d.ts +10 -0
- package/dist/types/prompts/file.d.ts +7 -0
- package/dist/types/prompts/index.d.ts +1 -1
- package/dist/types/run.d.ts +3 -1
- package/dist/types/version.d.ts +1 -1
- package/docs/README.md +1 -1
- package/docs/extract-only.md +10 -7
- package/docs/firecrawl.md +2 -2
- package/docs/site/docs/config.html +3 -3
- package/docs/site/docs/extract-only.html +7 -5
- package/docs/site/docs/firecrawl.html +6 -6
- package/docs/site/docs/index.html +2 -2
- package/docs/site/docs/llm.html +2 -2
- package/docs/site/docs/openai.html +2 -2
- package/docs/site/docs/website.html +7 -4
- package/docs/site/docs/youtube.html +2 -2
- package/docs/site/index.html +1 -1
- package/docs/website.md +10 -7
- package/docs/youtube.md +6 -3
- package/package.json +5 -1
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
<meta charset="utf-8" />
|
|
5
5
|
<meta name="viewport" content="width=device-width,initial-scale=1" />
|
|
6
6
|
<meta name="color-scheme" content="dark light" />
|
|
7
|
-
<title>Extract
|
|
7
|
+
<title>Extract — summarize</title>
|
|
8
8
|
<link rel="canonical" href="https://summarize.sh/docs/extract-only" />
|
|
9
9
|
<link rel="preconnect" href="https://fonts.googleapis.com" />
|
|
10
10
|
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
|
|
@@ -31,7 +31,7 @@
|
|
|
31
31
|
<a href="./index.html">Overview</a>
|
|
32
32
|
<a href="./website.html">Website mode</a>
|
|
33
33
|
<a href="./youtube.html">YouTube mode</a>
|
|
34
|
-
<a href="./extract-only.html">Extract
|
|
34
|
+
<a href="./extract-only.html">Extract</a>
|
|
35
35
|
<a href="./llm.html">LLM</a>
|
|
36
36
|
<a href="./openai.html">OpenAI</a>
|
|
37
37
|
<a href="./firecrawl.html">Firecrawl</a>
|
|
@@ -40,7 +40,7 @@
|
|
|
40
40
|
|
|
41
41
|
<article class="doc reveal">
|
|
42
42
|
<p class="kicker">mode</p>
|
|
43
|
-
<h1>Extract
|
|
43
|
+
<h1>Extract</h1>
|
|
44
44
|
<p>Print the extracted content and stop. No summary call.</p>
|
|
45
45
|
|
|
46
46
|
<h2>Usage</h2>
|
|
@@ -50,7 +50,7 @@
|
|
|
50
50
|
<span class="terminal__dot terminal__dot--b"></span>
|
|
51
51
|
<span class="terminal__dot terminal__dot--c"></span>
|
|
52
52
|
</div>
|
|
53
|
-
<pre><code id="ex-only">summarize --extract
|
|
53
|
+
<pre><code id="ex-only">summarize --extract "https://example.com/article"</code></pre>
|
|
54
54
|
</div>
|
|
55
55
|
<div class="copyRow">
|
|
56
56
|
<span class="hint">Good for piping into your own tooling.</span>
|
|
@@ -63,7 +63,9 @@
|
|
|
63
63
|
<li><code>--verbose</code> — show which extractor ran and why.</li>
|
|
64
64
|
<li><code>--timeout</code> — tune crawling budget (<code>2m</code> default).</li>
|
|
65
65
|
<li><code>--firecrawl off|auto|always</code> — choose the fallback strategy.</li>
|
|
66
|
-
<li><code>--
|
|
66
|
+
<li><code>--format md|text</code> — choose extracted output format.</li>
|
|
67
|
+
<li><code>--markdown-mode off|auto|llm</code> — for non-YouTube URLs with <code>--format md</code>, control HTML→Markdown conversion (LLM) + fallbacks.</li>
|
|
68
|
+
<li><code>--preprocess off|auto|always</code> — controls markitdown usage (default <code>auto</code>).</li>
|
|
67
69
|
</ul>
|
|
68
70
|
</article>
|
|
69
71
|
</section>
|
|
@@ -31,7 +31,7 @@
|
|
|
31
31
|
<a href="./index.html">Overview</a>
|
|
32
32
|
<a href="./website.html">Website mode</a>
|
|
33
33
|
<a href="./youtube.html">YouTube mode</a>
|
|
34
|
-
<a href="./extract-only.html">Extract
|
|
34
|
+
<a href="./extract-only.html">Extract</a>
|
|
35
35
|
<a href="./llm.html">LLM</a>
|
|
36
36
|
<a href="./openai.html">OpenAI</a>
|
|
37
37
|
<a href="./firecrawl.html">Firecrawl</a>
|
|
@@ -41,7 +41,7 @@
|
|
|
41
41
|
<article class="doc reveal">
|
|
42
42
|
<p class="kicker">extractor</p>
|
|
43
43
|
<h1>Firecrawl</h1>
|
|
44
|
-
<p>Used as a fallback when HTML extraction looks blocked or too thin
|
|
44
|
+
<p>Used as a fallback when HTML extraction looks blocked or too thin — and as a preferred Markdown source in extract mode (when configured).</p>
|
|
45
45
|
|
|
46
46
|
<h2>Key</h2>
|
|
47
47
|
<ul>
|
|
@@ -49,14 +49,14 @@
|
|
|
49
49
|
<li>Control behavior with <code>--firecrawl off|auto|always</code>.</li>
|
|
50
50
|
</ul>
|
|
51
51
|
|
|
52
|
-
<h2>Extract
|
|
52
|
+
<h2>Extract + Markdown</h2>
|
|
53
53
|
<ul>
|
|
54
|
-
<li><code>--extract
|
|
55
|
-
<li><code>--
|
|
54
|
+
<li><code>--extract</code> prints the extracted content.</li>
|
|
55
|
+
<li><code>--extract --format md</code> outputs Markdown for non-YouTube URLs.</li>
|
|
56
56
|
</ul>
|
|
57
57
|
|
|
58
58
|
<div class="note">
|
|
59
|
-
If you only want plain text: use <code>--
|
|
59
|
+
If you only want plain text: use <code>--extract --format text</code>.
|
|
60
60
|
</div>
|
|
61
61
|
</article>
|
|
62
62
|
</section>
|
|
@@ -34,7 +34,7 @@
|
|
|
34
34
|
<div class="pillRow">
|
|
35
35
|
<span class="pill"><span class="pill__dot" aria-hidden="true"></span> Website</span>
|
|
36
36
|
<span class="pill"><span class="pill__dot" aria-hidden="true" style="background: var(--accent2)"></span> YouTube</span>
|
|
37
|
-
<span class="pill"><span class="pill__dot" aria-hidden="true" style="background: var(--accent3)"></span> Extract
|
|
37
|
+
<span class="pill"><span class="pill__dot" aria-hidden="true" style="background: var(--accent3)"></span> Extract</span>
|
|
38
38
|
</div>
|
|
39
39
|
</div>
|
|
40
40
|
</div>
|
|
@@ -52,7 +52,7 @@
|
|
|
52
52
|
<div class="small">docs/youtube.md</div>
|
|
53
53
|
</a>
|
|
54
54
|
<a class="card reveal" href="./extract-only.html">
|
|
55
|
-
<h2>Extract
|
|
55
|
+
<h2>Extract</h2>
|
|
56
56
|
<p>Get the cleaned content and stop; perfect for piping.</p>
|
|
57
57
|
<div class="small">docs/extract-only.md</div>
|
|
58
58
|
</a>
|
package/docs/site/docs/llm.html
CHANGED
|
@@ -31,7 +31,7 @@
|
|
|
31
31
|
<a href="./index.html">Overview</a>
|
|
32
32
|
<a href="./website.html">Website mode</a>
|
|
33
33
|
<a href="./youtube.html">YouTube mode</a>
|
|
34
|
-
<a href="./extract-only.html">Extract
|
|
34
|
+
<a href="./extract-only.html">Extract</a>
|
|
35
35
|
<a href="./llm.html">LLM</a>
|
|
36
36
|
<a href="./openai.html">OpenAI</a>
|
|
37
37
|
<a href="./firecrawl.html">Firecrawl</a>
|
|
@@ -53,7 +53,7 @@
|
|
|
53
53
|
<h2>Practical advice</h2>
|
|
54
54
|
<ul>
|
|
55
55
|
<li>Pin <code>--model</code> for stable output.</li>
|
|
56
|
-
<li>When using <code>--markdown llm</code>, provider fallback is disabled by design.</li>
|
|
56
|
+
<li>When using <code>--markdown-mode llm</code>, provider fallback is disabled by design.</li>
|
|
57
57
|
<li>For audits / tooling, prefer <code>--json</code> + fixed model.</li>
|
|
58
58
|
</ul>
|
|
59
59
|
</article>
|
|
@@ -31,7 +31,7 @@
|
|
|
31
31
|
<a href="./index.html">Overview</a>
|
|
32
32
|
<a href="./website.html">Website mode</a>
|
|
33
33
|
<a href="./youtube.html">YouTube mode</a>
|
|
34
|
-
<a href="./extract-only.html">Extract
|
|
34
|
+
<a href="./extract-only.html">Extract</a>
|
|
35
35
|
<a href="./llm.html">LLM</a>
|
|
36
36
|
<a href="./openai.html">OpenAI</a>
|
|
37
37
|
<a href="./firecrawl.html">Firecrawl</a>
|
|
@@ -45,7 +45,7 @@
|
|
|
45
45
|
|
|
46
46
|
<h2>Notes</h2>
|
|
47
47
|
<ul>
|
|
48
|
-
<li>Some modes (like <code>--extract
|
|
48
|
+
<li>Some modes (like <code>--extract</code>) don’t need an LLM at all.</li>
|
|
49
49
|
<li>When output is used downstream, prefer <code>--json</code> and pin <code>--model</code>.</li>
|
|
50
50
|
</ul>
|
|
51
51
|
|
|
@@ -31,7 +31,7 @@
|
|
|
31
31
|
<a href="./index.html">Overview</a>
|
|
32
32
|
<a href="./website.html">Website mode</a>
|
|
33
33
|
<a href="./youtube.html">YouTube mode</a>
|
|
34
|
-
<a href="./extract-only.html">Extract
|
|
34
|
+
<a href="./extract-only.html">Extract</a>
|
|
35
35
|
<a href="./llm.html">LLM</a>
|
|
36
36
|
<a href="./openai.html">OpenAI</a>
|
|
37
37
|
<a href="./firecrawl.html">Firecrawl</a>
|
|
@@ -41,20 +41,23 @@
|
|
|
41
41
|
<article class="doc reveal">
|
|
42
42
|
<p class="kicker">mode</p>
|
|
43
43
|
<h1>Website mode</h1>
|
|
44
|
-
<p>Fetch HTML → extract “article-ish” content → normalize to clean text. If extraction looks blocked or too thin, retry via Firecrawl Markdown (optional)
|
|
44
|
+
<p>Fetch HTML → extract “article-ish” content → normalize to clean text. If extraction looks blocked or too thin, retry via Firecrawl Markdown (optional). With <code>--format md</code>, the CLI prefers Firecrawl Markdown when configured and can also convert HTML → Markdown via <code>--markdown-mode</code> (LLM) or <code>uvx markitdown</code>.</p>
|
|
45
45
|
|
|
46
46
|
<h2>Flags</h2>
|
|
47
47
|
<ul>
|
|
48
48
|
<li><code>--firecrawl off|auto|always</code></li>
|
|
49
49
|
<li><code>--timeout 30s|2m|5000ms</code> (default <code>2m</code>)</li>
|
|
50
|
-
<li><code>--extract
|
|
50
|
+
<li><code>--extract</code> (print extracted content; no summary call)</li>
|
|
51
|
+
<li><code>--format md|text</code> (default <code>text</code>)</li>
|
|
52
|
+
<li><code>--markdown-mode off|auto|llm</code> (only with <code>--format md</code>)</li>
|
|
53
|
+
<li><code>--preprocess off|auto|always</code> (default <code>auto</code>; controls markitdown usage)</li>
|
|
51
54
|
<li><code>--json</code> (emit a single JSON object)</li>
|
|
52
55
|
<li><code>--verbose</code> (progress + which extractor was used)</li>
|
|
53
56
|
<li><code>--metrics off|on|detailed</code></li>
|
|
54
57
|
</ul>
|
|
55
58
|
|
|
56
59
|
<div class="note">
|
|
57
|
-
Plain-text mode: <code>--
|
|
60
|
+
Plain-text mode: <code>--extract --format text</code>.
|
|
58
61
|
</div>
|
|
59
62
|
</article>
|
|
60
63
|
</section>
|
|
@@ -31,7 +31,7 @@
|
|
|
31
31
|
<a href="./index.html">Overview</a>
|
|
32
32
|
<a href="./website.html">Website mode</a>
|
|
33
33
|
<a href="./youtube.html">YouTube mode</a>
|
|
34
|
-
<a href="./extract-only.html">Extract
|
|
34
|
+
<a href="./extract-only.html">Extract</a>
|
|
35
35
|
<a href="./llm.html">LLM</a>
|
|
36
36
|
<a href="./openai.html">OpenAI</a>
|
|
37
37
|
<a href="./firecrawl.html">Firecrawl</a>
|
|
@@ -45,7 +45,7 @@
|
|
|
45
45
|
|
|
46
46
|
<h2>Tip</h2>
|
|
47
47
|
<ul>
|
|
48
|
-
<li>If you only want the transcript: use <code>--extract
|
|
48
|
+
<li>If you only want the transcript: use <code>--extract</code>.</li>
|
|
49
49
|
<li>For pipelines: add <code>--json</code>.</li>
|
|
50
50
|
</ul>
|
|
51
51
|
</article>
|
package/docs/site/index.html
CHANGED
|
@@ -104,7 +104,7 @@
|
|
|
104
104
|
</div>
|
|
105
105
|
<div class="card reveal">
|
|
106
106
|
<h2>Built for pipelines</h2>
|
|
107
|
-
<p><code>--extract
|
|
107
|
+
<p><code>--extract</code>, <code>--json</code>, and <code>--metrics</code> make it scriptable.</p>
|
|
108
108
|
<div class="small">Compose it with your own tools</div>
|
|
109
109
|
</div>
|
|
110
110
|
<div class="card reveal">
|
package/docs/website.md
CHANGED
|
@@ -7,21 +7,24 @@ Use this for non-YouTube URLs.
|
|
|
7
7
|
- Fetches the page HTML.
|
|
8
8
|
- Extracts “article-ish” content and normalizes it into clean text.
|
|
9
9
|
- If extraction looks blocked or too thin, it can retry via Firecrawl (Markdown).
|
|
10
|
-
-
|
|
11
|
-
-
|
|
10
|
+
- With `--format md`, the CLI prefers Firecrawl Markdown by default when `FIRECRAWL_API_KEY` is configured.
|
|
11
|
+
- With `--format md`, `--markdown-mode auto|llm` can also convert HTML → Markdown via an LLM using the configured `--model` (no provider fallback).
|
|
12
|
+
- With `--format md`, `--markdown-mode auto` may fall back to `uvx markitdown` when available (disable with `--preprocess off`).
|
|
12
13
|
|
|
13
14
|
## Flags
|
|
14
15
|
|
|
15
16
|
- `--firecrawl off|auto|always`
|
|
16
|
-
- `--
|
|
17
|
-
-
|
|
17
|
+
- `--format md|text` (default: `text`)
|
|
18
|
+
- `--markdown-mode off|auto|llm` (default: `auto`; only affects `--format md` for non-YouTube URLs)
|
|
19
|
+
- `--preprocess off|auto|always` (default: `auto`; controls markitdown usage; `always` only affects file inputs)
|
|
20
|
+
- Plain-text mode: use `--format text`.
|
|
18
21
|
- `--timeout 30s|30|2m|5000ms` (default: `2m`)
|
|
19
|
-
- `--extract
|
|
22
|
+
- `--extract` (print extracted content; no summary LLM call)
|
|
20
23
|
- `--json` (emit a single JSON object)
|
|
21
24
|
- `--verbose` (progress + which extractor was used)
|
|
22
25
|
- `--metrics off|on|detailed` (default: `on`; `detailed` prints a breakdown to stderr)
|
|
23
26
|
|
|
24
27
|
## API keys
|
|
25
28
|
|
|
26
|
-
- Optional: `FIRECRAWL_API_KEY` (for the Firecrawl fallback)
|
|
27
|
-
- Optional: `XAI_API_KEY` / `OPENAI_API_KEY` / `GEMINI_API_KEY` (also accepts `GOOGLE_GENERATIVE_AI_API_KEY` / `GOOGLE_API_KEY`) (required only when `--markdown llm` is used, or when `--markdown auto` falls back to LLM conversion)
|
|
29
|
+
- Optional: `FIRECRAWL_API_KEY` (for the Firecrawl fallback / preferred Markdown output)
|
|
30
|
+
- Optional: `XAI_API_KEY` / `OPENAI_API_KEY` / `GEMINI_API_KEY` (also accepts `GOOGLE_GENERATIVE_AI_API_KEY` / `GOOGLE_API_KEY`) (required only when `--markdown-mode llm` is used, or when `--markdown-mode auto` falls back to LLM conversion)
|
package/docs/youtube.md
CHANGED
|
@@ -2,11 +2,12 @@
|
|
|
2
2
|
|
|
3
3
|
YouTube URLs use transcript-first extraction.
|
|
4
4
|
|
|
5
|
-
## `--youtube auto|web|apify`
|
|
5
|
+
## `--youtube auto|web|apify|yt-dlp`
|
|
6
6
|
|
|
7
|
-
- `auto` (default): try `youtubei` → `captionTracks` → Apify (if token exists)
|
|
7
|
+
- `auto` (default): try `youtubei` → `captionTracks` → Apify (if token exists) → `yt-dlp` (if configured)
|
|
8
8
|
- `web`: try `youtubei` → `captionTracks` only
|
|
9
9
|
- `apify`: Apify only
|
|
10
|
+
- `yt-dlp`: download audio + transcribe (OpenAI Whisper preferred; FAL fallback)
|
|
10
11
|
|
|
11
12
|
## `youtubei` vs `captionTracks`
|
|
12
13
|
|
|
@@ -24,9 +25,11 @@ YouTube URLs use transcript-first extraction.
|
|
|
24
25
|
- If no transcript is available, we still extract `ytInitialPlayerResponse.videoDetails.shortDescription` so YouTube links can still summarize meaningfully.
|
|
25
26
|
- Apify is an optional fallback (needs `APIFY_API_TOKEN`).
|
|
26
27
|
- By default, we use the actor id `faVsWy9VTSNVIhWpR` (Pinto Studio’s “Youtube Transcript Scraper”).
|
|
28
|
+
- `yt-dlp` requires `YT_DLP_PATH` and either `OPENAI_API_KEY` (preferred) or `FAL_KEY`.
|
|
29
|
+
- If OpenAI transcription fails and `FAL_KEY` is set, we fall back to FAL automatically.
|
|
27
30
|
|
|
28
31
|
## Example
|
|
29
32
|
|
|
30
33
|
```bash
|
|
31
|
-
pnpm summarize -- --extract
|
|
34
|
+
pnpm summarize -- --extract "https://www.youtube.com/watch?v=I845O57ZSy4&t=11s"
|
|
32
35
|
```
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@steipete/summarize",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.4.0",
|
|
4
4
|
"description": "Link → clean text → summary.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"bin": {
|
|
@@ -30,7 +30,11 @@
|
|
|
30
30
|
"README.md",
|
|
31
31
|
"LICENSE"
|
|
32
32
|
],
|
|
33
|
+
"engines": {
|
|
34
|
+
"node": ">=22"
|
|
35
|
+
},
|
|
33
36
|
"dependencies": {
|
|
37
|
+
"@fal-ai/client": "^1.2.1",
|
|
34
38
|
"cheerio": "^1.1.2",
|
|
35
39
|
"es-toolkit": "^1.43.0",
|
|
36
40
|
"gpt-tokenizer": "^3.4.0",
|