@nanocollective/get-md 1.1.0 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +19 -327
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -1,359 +1,51 @@
1
1
  # get-md
2
2
 
3
- A fast, lightweight HTML to Markdown converter optimized for LLM consumption. Uses proven parsing libraries to deliver clean, well-structured markdown with intelligent content extraction and noise filtering.
3
+ A fast, lightweight HTML to Markdown converter optimized for LLM consumption.
4
4
 
5
5
  ## Features
6
6
 
7
- - **Lightning-fast**: Converts HTML to Markdown in <100ms
8
- - **Intelligent extraction**: Uses Mozilla Readability to extract main content
9
- - **LLM-optimized**: Consistent formatting perfect for AI consumption
10
- - **Optional AI conversion**: Higher quality output with local ReaderLM-v2 model
11
- - **CLI included**: Use from the command line or as a library
12
- - **TypeScript**: Full type definitions included
13
- - **Zero downloads**: Works instantly (AI model optional, ~1GB)
14
- - **Lightweight**: Small package size (~10MB)
15
- - **React Native compatible**: Full support including content extraction!
7
+ - **Lightning-fast** (<100ms) with optional AI-powered conversion
8
+ - **Intelligent extraction** using Mozilla Readability
9
+ - **CLI included** for command-line usage
10
+ - **TypeScript** with full type definitions
11
+ - **React Native compatible**
16
12
 
17
13
  ## Installation
18
14
 
19
15
  ```bash
20
16
  npm install @nanocollective/get-md
21
- # or
22
- pnpm add @nanocollective/get-md
23
- # or
24
- yarn add @nanocollective/get-md
25
17
  ```
26
18
 
27
19
  ## Quick Start
28
20
 
29
- ### As a Library
21
+ ### Library
30
22
 
31
23
  ```typescript
32
24
  import { convertToMarkdown } from "@nanocollective/get-md";
33
25
 
34
- // From HTML string
35
- const result = await convertToMarkdown("<h1>Hello</h1><p>World</p>");
36
- console.log(result.markdown);
37
- // # Hello
38
- //
39
- // World
40
-
41
- // From URL (auto-detected)
42
26
  const result = await convertToMarkdown("https://example.com");
43
- console.log(result.metadata.title);
44
-
45
- // From URL with custom timeout and headers
46
- const result = await convertToMarkdown("https://example.com", {
47
- timeout: 10000,
48
- headers: { Authorization: "Bearer token" },
49
- });
50
-
51
- // Force URL mode if auto-detection fails
52
- const result = await convertToMarkdown("example.com", { isUrl: true });
53
- ```
54
-
55
- ### As a CLI
56
-
57
- ```bash
58
- # From stdin
59
- echo '<h1>Hello</h1>' | getmd
60
-
61
- # From file
62
- getmd input.html
63
-
64
- # From URL
65
- getmd https://example.com
66
-
67
- # Save to file
68
- getmd input.html -o output.md
69
- ```
70
-
71
- ## API
72
-
73
- ### `convertToMarkdown(html, options?)`
74
-
75
- Convert HTML to clean, LLM-optimized Markdown.
76
-
77
- **Parameters:**
78
-
79
- - `html` (string): Raw HTML string or URL to fetch
80
- - `options` (MarkdownOptions): Conversion options
81
-
82
- **Returns:** `Promise<MarkdownResult>`
83
-
84
- **Options:**
85
-
86
- ````typescript
87
- {
88
- // Content options
89
- extractContent?: boolean; // Use Readability extraction (default: true)
90
- includeMeta?: boolean; // Include YAML frontmatter (default: true)
91
- includeImages?: boolean; // Include images (default: true)
92
- includeLinks?: boolean; // Include links (default: true)
93
- includeTables?: boolean; // Include tables (default: true)
94
- aggressiveCleanup?: boolean; // Remove ads, nav, etc. (default: true)
95
- maxLength?: number; // Max output length (default: 1000000)
96
- baseUrl?: string; // Base URL for resolving relative links
97
-
98
- // URL fetch options (only used when input is a URL)
99
- isUrl?: boolean; // Force treat input as URL (default: auto-detect)
100
- timeout?: number; // Request timeout in ms (default: 15000)
101
- followRedirects?: boolean; // Follow redirects (default: true)
102
- maxRedirects?: number; // Max redirects to follow (default: 5)
103
- headers?: Record<string, string>; // Custom HTTP headers
104
- userAgent?: string; // Custom user agent
105
- }
106
- ```
107
-
108
- ## CLI Usage
109
-
110
- ```bash
111
- getmd [input] [options]
112
-
113
- Options:
114
- -o, --output <file> Output file (default: stdout)
115
- --no-extract Disable Readability content extraction
116
- --no-frontmatter Exclude metadata from YAML frontmatter
117
- --no-images Remove images from output
118
- --no-links Remove links from output
119
- --no-tables Remove tables from output
120
- --max-length <n> Maximum output length (default: 1000000)
121
- --base-url <url> Base URL for resolving relative links
122
- -v, --verbose Verbose output
123
- -h, --help Display help
124
- ````
125
-
126
- ## Examples
127
-
128
- ### Basic Conversion
129
-
130
- ```typescript
131
- import { convertToMarkdown } from "@nanocollective/getmd";
132
-
133
- const html = `
134
- <article>
135
- <h1>My Article</h1>
136
- <p>This is a <strong>test</strong>.</p>
137
- </article>
138
- `;
139
-
140
- const result = await convertToMarkdown(html);
141
- console.log(result.markdown);
142
- ```
143
-
144
- ### With Metadata
145
-
146
- ```typescript
147
- // Metadata is included by default
148
- const result = await convertToMarkdown(html);
149
27
  console.log(result.markdown);
150
- // ---
151
- // title: "My Article"
152
- // author: "John Doe"
153
- // readingTime: 5
154
- // ---
155
- //
156
- // # My Article
157
- // ...
158
-
159
- // To exclude metadata:
160
- const resultNoMeta = await convertToMarkdown(html, { includeMeta: false });
161
28
  ```
162
29
 
163
- ### CLI Examples
30
+ ### CLI
164
31
 
165
32
  ```bash
166
- # Convert HTML file (frontmatter included by default)
167
- getmd article.html -o article.md
168
-
169
- # Fetch from URL
170
- getmd https://blog.example.com/post -o post.md
171
-
172
- # Remove images and links
173
- getmd article.html --no-images --no-links -o clean.md
174
-
175
- # Exclude frontmatter metadata
176
- getmd article.html --no-frontmatter -o clean.md
177
- ```
178
-
179
- ## React Native Support
180
-
181
- get-md **fully supports React Native** including content extraction! We use `happy-dom-without-node` instead of JSDOM, which works across Node.js, React Native, and browser environments.
182
-
183
- ```typescript
184
- import { convertToMarkdown } from "@nanocollective/get-md";
185
-
186
- // Works in React Native with full features!
187
- const result = await convertToMarkdown(html, {
188
- extractContent: true, // Readability extraction works!
189
- includeMeta: true,
190
- // ... all other options work
191
- });
33
+ getmd https://example.com -o output.md
192
34
  ```
193
35
 
194
- **All features work in React Native:**
195
- - ✅ HTML to Markdown conversion
196
- - ✅ Mozilla Readability content extraction
197
- - ✅ Metadata extraction
198
- - ✅ Content cleaning and optimization
199
- - ✅ All formatting options
36
+ ## Documentation
200
37
 
201
- No special configuration needed!
38
+ For full documentation, see [docs/index.md](docs/index.md):
202
39
 
203
- ## LLM-Powered Conversion (Optional)
204
-
205
- For even higher quality markdown output, get-md supports optional AI-powered conversion using a local LLM model. This uses [ReaderLM-v2](https://huggingface.co/jinaai/ReaderLM-v2), a model specifically trained for HTML-to-Markdown conversion.
206
-
207
- ### When to Use LLM Conversion
208
-
209
- | Use Case | Recommended Method |
210
- |----------|-------------------|
211
- | High-quality single conversions | LLM (better structure) |
212
- | Complex documentation sites | LLM (semantic understanding) |
213
- | Batch processing (1000+ pages) | Turndown (speed) |
214
- | CI/CD pipelines | Turndown (fast, deterministic) |
215
- | Real-time conversion | Turndown (sub-second) |
216
-
217
- ### Quick Start
218
-
219
- ```typescript
220
- import { convertToMarkdown, checkLLMModel, downloadLLMModel } from '@nanocollective/get-md';
221
-
222
- // 1. Check if model is available
223
- const status = await checkLLMModel();
224
-
225
- if (!status.available) {
226
- // 2. Download model (one-time, ~986MB)
227
- await downloadLLMModel({
228
- onProgress: (downloaded, total, percentage) => {
229
- console.log(`Downloading: ${percentage.toFixed(1)}%`);
230
- }
231
- });
232
- }
233
-
234
- // 3. Convert with LLM
235
- const result = await convertToMarkdown('https://example.com', {
236
- useLLM: true,
237
- onLLMEvent: (event) => {
238
- if (event.type === 'conversion-complete') {
239
- console.log(`Done in ${event.duration}ms`);
240
- }
241
- }
242
- });
243
- ```
244
-
245
- ### CLI Usage
246
-
247
- ```bash
248
- # Check model status
249
- getmd --model-info
250
-
251
- # Download model (one-time)
252
- getmd --download-model
253
-
254
- # Convert with LLM
255
- getmd https://example.com --use-llm
256
-
257
- # Compare Turndown vs LLM
258
- getmd https://example.com --compare -o comparison.md
259
-
260
- # Show default model path
261
- getmd --model-path
262
-
263
- # Remove model
264
- getmd --remove-model
265
- ```
266
-
267
- ### Configuration File
268
-
269
- You can set default options in a config file. Create `.getmdrc` or `get-md.config.json` in your project or home directory:
270
-
271
- ```json
272
- {
273
- "useLLM": true,
274
- "llmTemperature": 0.1,
275
- "llmFallback": true,
276
- "extractContent": true,
277
- "includeMeta": true
278
- }
279
- ```
280
-
281
- CLI flags override config file settings. Use `getmd --show-config` to see current configuration.
282
-
283
- ### LLM Options
284
-
285
- ```typescript
286
- {
287
- useLLM?: boolean; // Use LLM for conversion (default: false)
288
- llmModelPath?: string; // Custom model path (optional)
289
- llmTemperature?: number; // Generation temperature (default: 0.1)
290
- llmMaxTokens?: number; // Max tokens (default: 512000)
291
- llmFallback?: boolean; // Fallback to Turndown on error (default: true)
292
-
293
- // Event callbacks
294
- onLLMEvent?: (event: LLMEvent) => void;
295
- onDownloadProgress?: (downloaded, total, percentage) => void;
296
- onModelStatus?: (status) => void;
297
- onConversionProgress?: (progress) => void;
298
- }
299
- ```
300
-
301
- ### Requirements
302
-
303
- - **Disk Space**: ~1GB for the model file
304
- - **RAM**: 2-4GB during inference (8GB+ system recommended)
305
- - **Speed**: 5-10 seconds per typical webpage (vs <100ms for Turndown)
306
-
307
- ### Model Management
308
-
309
- ```typescript
310
- import {
311
- checkLLMModel,
312
- downloadLLMModel,
313
- removeLLMModel,
314
- getLLMModelInfo
315
- } from '@nanocollective/get-md';
316
-
317
- // Check model availability
318
- const status = await checkLLMModel();
319
- console.log(status.available); // true/false
320
- console.log(status.sizeFormatted); // "986MB"
321
-
322
- // Get model information
323
- const info = getLLMModelInfo();
324
- console.log(info.defaultPath); // ~/.get-md/models/ReaderLM-v2-Q4_K_M.gguf
325
- console.log(info.recommendedModel); // "ReaderLM-v2-Q4_K_M"
326
-
327
- // Remove model
328
- await removeLLMModel();
329
- ```
330
-
331
- ## Why get-md?
332
-
333
- ### For LLMs
334
-
335
- - **Consistent output**: Deterministic markdown formatting helps LLMs learn patterns
336
- - **Clean structure**: Proper heading hierarchy, list formatting, and spacing
337
- - **Noise removal**: Automatically removes ads, navigation, footers, etc.
338
- - **Fast processing**: <100ms per document enables real-time workflows
339
-
340
- ### vs Other Tools
341
-
342
- - **Faster than LLM-based extractors**: No model inference overhead
343
- - **More reliable**: Deterministic output, no hallucinations
344
- - **Cheaper**: No API costs
345
- - **Privacy-friendly**: Runs locally, no data sent to third parties
40
+ - [API Reference](docs/api.md)
41
+ - [CLI Usage](docs/cli.md)
42
+ - [LLM-Powered Conversion](docs/llm.md)
43
+ - [React Native Support](docs/react-native.md)
346
44
 
347
45
  ## Community
348
46
 
349
- We're a small community-led team building local and privacy-first AI solutions under the [Nano Collective](https://nanocollective.org) and would love your help! Whether you're interested in contributing code, documentation, or just being part of our community, there are several ways to get involved.
350
-
351
- **If you want to contribute to the code:**
352
-
353
- - Read our detailed [CONTRIBUTING.md](CONTRIBUTING.md) guide for information on development setup, coding standards, and how to submit your changes.
354
-
355
- **If you want to be part of our community or help with other aspects like design or marketing:**
356
-
357
- - Join our Discord server to connect with other users, ask questions, share ideas, and get help: [Join our Discord server](https://discord.gg/ktPDV6rekE)
47
+ We're a small community-led team building local and privacy-first AI solutions under the [Nano Collective](https://nanocollective.org).
358
48
 
359
- - Head to our GitHub issues or discussions to open and join current conversations with others in the community.
49
+ - [Contributing Guide](CONTRIBUTING.md)
50
+ - [Discord Server](https://discord.gg/ktPDV6rekE)
51
+ - [GitHub Issues](https://github.com/nanocollective/get-md/issues)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@nanocollective/get-md",
3
- "version": "1.1.0",
3
+ "version": "1.1.1",
4
4
  "description": "Fast HTML to Markdown converter optimized for LLM consumption",
5
5
  "type": "module",
6
6
  "main": "./dist/index.js",