@nanocollective/get-md 1.1.0 → 1.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +19 -327
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,359 +1,51 @@
|
|
|
1
1
|
# get-md
|
|
2
2
|
|
|
3
|
-
A fast, lightweight HTML to Markdown converter optimized for LLM consumption.
|
|
3
|
+
A fast, lightweight HTML to Markdown converter optimized for LLM consumption.
|
|
4
4
|
|
|
5
5
|
## Features
|
|
6
6
|
|
|
7
|
-
- **Lightning-fast
|
|
8
|
-
- **Intelligent extraction
|
|
9
|
-
- **
|
|
10
|
-
- **
|
|
11
|
-
- **
|
|
12
|
-
- **TypeScript**: Full type definitions included
|
|
13
|
-
- **Zero downloads**: Works instantly (AI model optional, ~1GB)
|
|
14
|
-
- **Lightweight**: Small package size (~10MB)
|
|
15
|
-
- **React Native compatible**: Full support including content extraction!
|
|
7
|
+
- **Lightning-fast** (<100ms) with optional AI-powered conversion
|
|
8
|
+
- **Intelligent extraction** using Mozilla Readability
|
|
9
|
+
- **CLI included** for command-line usage
|
|
10
|
+
- **TypeScript** with full type definitions
|
|
11
|
+
- **React Native compatible**
|
|
16
12
|
|
|
17
13
|
## Installation
|
|
18
14
|
|
|
19
15
|
```bash
|
|
20
16
|
npm install @nanocollective/get-md
|
|
21
|
-
# or
|
|
22
|
-
pnpm add @nanocollective/get-md
|
|
23
|
-
# or
|
|
24
|
-
yarn add @nanocollective/get-md
|
|
25
17
|
```
|
|
26
18
|
|
|
27
19
|
## Quick Start
|
|
28
20
|
|
|
29
|
-
###
|
|
21
|
+
### Library
|
|
30
22
|
|
|
31
23
|
```typescript
|
|
32
24
|
import { convertToMarkdown } from "@nanocollective/get-md";
|
|
33
25
|
|
|
34
|
-
// From HTML string
|
|
35
|
-
const result = await convertToMarkdown("<h1>Hello</h1><p>World</p>");
|
|
36
|
-
console.log(result.markdown);
|
|
37
|
-
// # Hello
|
|
38
|
-
//
|
|
39
|
-
// World
|
|
40
|
-
|
|
41
|
-
// From URL (auto-detected)
|
|
42
26
|
const result = await convertToMarkdown("https://example.com");
|
|
43
|
-
console.log(result.metadata.title);
|
|
44
|
-
|
|
45
|
-
// From URL with custom timeout and headers
|
|
46
|
-
const result = await convertToMarkdown("https://example.com", {
|
|
47
|
-
timeout: 10000,
|
|
48
|
-
headers: { Authorization: "Bearer token" },
|
|
49
|
-
});
|
|
50
|
-
|
|
51
|
-
// Force URL mode if auto-detection fails
|
|
52
|
-
const result = await convertToMarkdown("example.com", { isUrl: true });
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
### As a CLI
|
|
56
|
-
|
|
57
|
-
```bash
|
|
58
|
-
# From stdin
|
|
59
|
-
echo '<h1>Hello</h1>' | getmd
|
|
60
|
-
|
|
61
|
-
# From file
|
|
62
|
-
getmd input.html
|
|
63
|
-
|
|
64
|
-
# From URL
|
|
65
|
-
getmd https://example.com
|
|
66
|
-
|
|
67
|
-
# Save to file
|
|
68
|
-
getmd input.html -o output.md
|
|
69
|
-
```
|
|
70
|
-
|
|
71
|
-
## API
|
|
72
|
-
|
|
73
|
-
### `convertToMarkdown(html, options?)`
|
|
74
|
-
|
|
75
|
-
Convert HTML to clean, LLM-optimized Markdown.
|
|
76
|
-
|
|
77
|
-
**Parameters:**
|
|
78
|
-
|
|
79
|
-
- `html` (string): Raw HTML string or URL to fetch
|
|
80
|
-
- `options` (MarkdownOptions): Conversion options
|
|
81
|
-
|
|
82
|
-
**Returns:** `Promise<MarkdownResult>`
|
|
83
|
-
|
|
84
|
-
**Options:**
|
|
85
|
-
|
|
86
|
-
````typescript
|
|
87
|
-
{
|
|
88
|
-
// Content options
|
|
89
|
-
extractContent?: boolean; // Use Readability extraction (default: true)
|
|
90
|
-
includeMeta?: boolean; // Include YAML frontmatter (default: true)
|
|
91
|
-
includeImages?: boolean; // Include images (default: true)
|
|
92
|
-
includeLinks?: boolean; // Include links (default: true)
|
|
93
|
-
includeTables?: boolean; // Include tables (default: true)
|
|
94
|
-
aggressiveCleanup?: boolean; // Remove ads, nav, etc. (default: true)
|
|
95
|
-
maxLength?: number; // Max output length (default: 1000000)
|
|
96
|
-
baseUrl?: string; // Base URL for resolving relative links
|
|
97
|
-
|
|
98
|
-
// URL fetch options (only used when input is a URL)
|
|
99
|
-
isUrl?: boolean; // Force treat input as URL (default: auto-detect)
|
|
100
|
-
timeout?: number; // Request timeout in ms (default: 15000)
|
|
101
|
-
followRedirects?: boolean; // Follow redirects (default: true)
|
|
102
|
-
maxRedirects?: number; // Max redirects to follow (default: 5)
|
|
103
|
-
headers?: Record<string, string>; // Custom HTTP headers
|
|
104
|
-
userAgent?: string; // Custom user agent
|
|
105
|
-
}
|
|
106
|
-
```
|
|
107
|
-
|
|
108
|
-
## CLI Usage
|
|
109
|
-
|
|
110
|
-
```bash
|
|
111
|
-
getmd [input] [options]
|
|
112
|
-
|
|
113
|
-
Options:
|
|
114
|
-
-o, --output <file> Output file (default: stdout)
|
|
115
|
-
--no-extract Disable Readability content extraction
|
|
116
|
-
--no-frontmatter Exclude metadata from YAML frontmatter
|
|
117
|
-
--no-images Remove images from output
|
|
118
|
-
--no-links Remove links from output
|
|
119
|
-
--no-tables Remove tables from output
|
|
120
|
-
--max-length <n> Maximum output length (default: 1000000)
|
|
121
|
-
--base-url <url> Base URL for resolving relative links
|
|
122
|
-
-v, --verbose Verbose output
|
|
123
|
-
-h, --help Display help
|
|
124
|
-
````
|
|
125
|
-
|
|
126
|
-
## Examples
|
|
127
|
-
|
|
128
|
-
### Basic Conversion
|
|
129
|
-
|
|
130
|
-
```typescript
|
|
131
|
-
import { convertToMarkdown } from "@nanocollective/getmd";
|
|
132
|
-
|
|
133
|
-
const html = `
|
|
134
|
-
<article>
|
|
135
|
-
<h1>My Article</h1>
|
|
136
|
-
<p>This is a <strong>test</strong>.</p>
|
|
137
|
-
</article>
|
|
138
|
-
`;
|
|
139
|
-
|
|
140
|
-
const result = await convertToMarkdown(html);
|
|
141
|
-
console.log(result.markdown);
|
|
142
|
-
```
|
|
143
|
-
|
|
144
|
-
### With Metadata
|
|
145
|
-
|
|
146
|
-
```typescript
|
|
147
|
-
// Metadata is included by default
|
|
148
|
-
const result = await convertToMarkdown(html);
|
|
149
27
|
console.log(result.markdown);
|
|
150
|
-
// ---
|
|
151
|
-
// title: "My Article"
|
|
152
|
-
// author: "John Doe"
|
|
153
|
-
// readingTime: 5
|
|
154
|
-
// ---
|
|
155
|
-
//
|
|
156
|
-
// # My Article
|
|
157
|
-
// ...
|
|
158
|
-
|
|
159
|
-
// To exclude metadata:
|
|
160
|
-
const resultNoMeta = await convertToMarkdown(html, { includeMeta: false });
|
|
161
28
|
```
|
|
162
29
|
|
|
163
|
-
### CLI
|
|
30
|
+
### CLI
|
|
164
31
|
|
|
165
32
|
```bash
|
|
166
|
-
|
|
167
|
-
getmd article.html -o article.md
|
|
168
|
-
|
|
169
|
-
# Fetch from URL
|
|
170
|
-
getmd https://blog.example.com/post -o post.md
|
|
171
|
-
|
|
172
|
-
# Remove images and links
|
|
173
|
-
getmd article.html --no-images --no-links -o clean.md
|
|
174
|
-
|
|
175
|
-
# Exclude frontmatter metadata
|
|
176
|
-
getmd article.html --no-frontmatter -o clean.md
|
|
177
|
-
```
|
|
178
|
-
|
|
179
|
-
## React Native Support
|
|
180
|
-
|
|
181
|
-
get-md **fully supports React Native** including content extraction! We use `happy-dom-without-node` instead of JSDOM, which works across Node.js, React Native, and browser environments.
|
|
182
|
-
|
|
183
|
-
```typescript
|
|
184
|
-
import { convertToMarkdown } from "@nanocollective/get-md";
|
|
185
|
-
|
|
186
|
-
// Works in React Native with full features!
|
|
187
|
-
const result = await convertToMarkdown(html, {
|
|
188
|
-
extractContent: true, // Readability extraction works!
|
|
189
|
-
includeMeta: true,
|
|
190
|
-
// ... all other options work
|
|
191
|
-
});
|
|
33
|
+
getmd https://example.com -o output.md
|
|
192
34
|
```
|
|
193
35
|
|
|
194
|
-
|
|
195
|
-
- ✅ HTML to Markdown conversion
|
|
196
|
-
- ✅ Mozilla Readability content extraction
|
|
197
|
-
- ✅ Metadata extraction
|
|
198
|
-
- ✅ Content cleaning and optimization
|
|
199
|
-
- ✅ All formatting options
|
|
36
|
+
## Documentation
|
|
200
37
|
|
|
201
|
-
|
|
38
|
+
For full documentation, see [docs/index.md](docs/index.md):
|
|
202
39
|
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
### When to Use LLM Conversion
|
|
208
|
-
|
|
209
|
-
| Use Case | Recommended Method |
|
|
210
|
-
|----------|-------------------|
|
|
211
|
-
| High-quality single conversions | LLM (better structure) |
|
|
212
|
-
| Complex documentation sites | LLM (semantic understanding) |
|
|
213
|
-
| Batch processing (1000+ pages) | Turndown (speed) |
|
|
214
|
-
| CI/CD pipelines | Turndown (fast, deterministic) |
|
|
215
|
-
| Real-time conversion | Turndown (sub-second) |
|
|
216
|
-
|
|
217
|
-
### Quick Start
|
|
218
|
-
|
|
219
|
-
```typescript
|
|
220
|
-
import { convertToMarkdown, checkLLMModel, downloadLLMModel } from '@nanocollective/get-md';
|
|
221
|
-
|
|
222
|
-
// 1. Check if model is available
|
|
223
|
-
const status = await checkLLMModel();
|
|
224
|
-
|
|
225
|
-
if (!status.available) {
|
|
226
|
-
// 2. Download model (one-time, ~986MB)
|
|
227
|
-
await downloadLLMModel({
|
|
228
|
-
onProgress: (downloaded, total, percentage) => {
|
|
229
|
-
console.log(`Downloading: ${percentage.toFixed(1)}%`);
|
|
230
|
-
}
|
|
231
|
-
});
|
|
232
|
-
}
|
|
233
|
-
|
|
234
|
-
// 3. Convert with LLM
|
|
235
|
-
const result = await convertToMarkdown('https://example.com', {
|
|
236
|
-
useLLM: true,
|
|
237
|
-
onLLMEvent: (event) => {
|
|
238
|
-
if (event.type === 'conversion-complete') {
|
|
239
|
-
console.log(`Done in ${event.duration}ms`);
|
|
240
|
-
}
|
|
241
|
-
}
|
|
242
|
-
});
|
|
243
|
-
```
|
|
244
|
-
|
|
245
|
-
### CLI Usage
|
|
246
|
-
|
|
247
|
-
```bash
|
|
248
|
-
# Check model status
|
|
249
|
-
getmd --model-info
|
|
250
|
-
|
|
251
|
-
# Download model (one-time)
|
|
252
|
-
getmd --download-model
|
|
253
|
-
|
|
254
|
-
# Convert with LLM
|
|
255
|
-
getmd https://example.com --use-llm
|
|
256
|
-
|
|
257
|
-
# Compare Turndown vs LLM
|
|
258
|
-
getmd https://example.com --compare -o comparison.md
|
|
259
|
-
|
|
260
|
-
# Show default model path
|
|
261
|
-
getmd --model-path
|
|
262
|
-
|
|
263
|
-
# Remove model
|
|
264
|
-
getmd --remove-model
|
|
265
|
-
```
|
|
266
|
-
|
|
267
|
-
### Configuration File
|
|
268
|
-
|
|
269
|
-
You can set default options in a config file. Create `.getmdrc` or `get-md.config.json` in your project or home directory:
|
|
270
|
-
|
|
271
|
-
```json
|
|
272
|
-
{
|
|
273
|
-
"useLLM": true,
|
|
274
|
-
"llmTemperature": 0.1,
|
|
275
|
-
"llmFallback": true,
|
|
276
|
-
"extractContent": true,
|
|
277
|
-
"includeMeta": true
|
|
278
|
-
}
|
|
279
|
-
```
|
|
280
|
-
|
|
281
|
-
CLI flags override config file settings. Use `getmd --show-config` to see current configuration.
|
|
282
|
-
|
|
283
|
-
### LLM Options
|
|
284
|
-
|
|
285
|
-
```typescript
|
|
286
|
-
{
|
|
287
|
-
useLLM?: boolean; // Use LLM for conversion (default: false)
|
|
288
|
-
llmModelPath?: string; // Custom model path (optional)
|
|
289
|
-
llmTemperature?: number; // Generation temperature (default: 0.1)
|
|
290
|
-
llmMaxTokens?: number; // Max tokens (default: 512000)
|
|
291
|
-
llmFallback?: boolean; // Fallback to Turndown on error (default: true)
|
|
292
|
-
|
|
293
|
-
// Event callbacks
|
|
294
|
-
onLLMEvent?: (event: LLMEvent) => void;
|
|
295
|
-
onDownloadProgress?: (downloaded, total, percentage) => void;
|
|
296
|
-
onModelStatus?: (status) => void;
|
|
297
|
-
onConversionProgress?: (progress) => void;
|
|
298
|
-
}
|
|
299
|
-
```
|
|
300
|
-
|
|
301
|
-
### Requirements
|
|
302
|
-
|
|
303
|
-
- **Disk Space**: ~1GB for the model file
|
|
304
|
-
- **RAM**: 2-4GB during inference (8GB+ system recommended)
|
|
305
|
-
- **Speed**: 5-10 seconds per typical webpage (vs <100ms for Turndown)
|
|
306
|
-
|
|
307
|
-
### Model Management
|
|
308
|
-
|
|
309
|
-
```typescript
|
|
310
|
-
import {
|
|
311
|
-
checkLLMModel,
|
|
312
|
-
downloadLLMModel,
|
|
313
|
-
removeLLMModel,
|
|
314
|
-
getLLMModelInfo
|
|
315
|
-
} from '@nanocollective/get-md';
|
|
316
|
-
|
|
317
|
-
// Check model availability
|
|
318
|
-
const status = await checkLLMModel();
|
|
319
|
-
console.log(status.available); // true/false
|
|
320
|
-
console.log(status.sizeFormatted); // "986MB"
|
|
321
|
-
|
|
322
|
-
// Get model information
|
|
323
|
-
const info = getLLMModelInfo();
|
|
324
|
-
console.log(info.defaultPath); // ~/.get-md/models/ReaderLM-v2-Q4_K_M.gguf
|
|
325
|
-
console.log(info.recommendedModel); // "ReaderLM-v2-Q4_K_M"
|
|
326
|
-
|
|
327
|
-
// Remove model
|
|
328
|
-
await removeLLMModel();
|
|
329
|
-
```
|
|
330
|
-
|
|
331
|
-
## Why get-md?
|
|
332
|
-
|
|
333
|
-
### For LLMs
|
|
334
|
-
|
|
335
|
-
- **Consistent output**: Deterministic markdown formatting helps LLMs learn patterns
|
|
336
|
-
- **Clean structure**: Proper heading hierarchy, list formatting, and spacing
|
|
337
|
-
- **Noise removal**: Automatically removes ads, navigation, footers, etc.
|
|
338
|
-
- **Fast processing**: <100ms per document enables real-time workflows
|
|
339
|
-
|
|
340
|
-
### vs Other Tools
|
|
341
|
-
|
|
342
|
-
- **Faster than LLM-based extractors**: No model inference overhead
|
|
343
|
-
- **More reliable**: Deterministic output, no hallucinations
|
|
344
|
-
- **Cheaper**: No API costs
|
|
345
|
-
- **Privacy-friendly**: Runs locally, no data sent to third parties
|
|
40
|
+
- [API Reference](docs/api.md)
|
|
41
|
+
- [CLI Usage](docs/cli.md)
|
|
42
|
+
- [LLM-Powered Conversion](docs/llm.md)
|
|
43
|
+
- [React Native Support](docs/react-native.md)
|
|
346
44
|
|
|
347
45
|
## Community
|
|
348
46
|
|
|
349
|
-
We're a small community-led team building local and privacy-first AI solutions under the [Nano Collective](https://nanocollective.org)
|
|
350
|
-
|
|
351
|
-
**If you want to contribute to the code:**
|
|
352
|
-
|
|
353
|
-
- Read our detailed [CONTRIBUTING.md](CONTRIBUTING.md) guide for information on development setup, coding standards, and how to submit your changes.
|
|
354
|
-
|
|
355
|
-
**If you want to be part of our community or help with other aspects like design or marketing:**
|
|
356
|
-
|
|
357
|
-
- Join our Discord server to connect with other users, ask questions, share ideas, and get help: [Join our Discord server](https://discord.gg/ktPDV6rekE)
|
|
47
|
+
We're a small community-led team building local and privacy-first AI solutions under the [Nano Collective](https://nanocollective.org).
|
|
358
48
|
|
|
359
|
-
-
|
|
49
|
+
- [Contributing Guide](CONTRIBUTING.md)
|
|
50
|
+
- [Discord Server](https://discord.gg/ktPDV6rekE)
|
|
51
|
+
- [GitHub Issues](https://github.com/nanocollective/get-md/issues)
|