html-to-markdown-wasm 2.6.6 → 2.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -72,6 +72,42 @@ console.log(markdown);
72
72
  // This is **fast**!
73
73
  ```
74
74
 
75
+ ### Reusing Options Handles
76
+
77
+ ```ts
78
+ import {
79
+ convertWithOptionsHandle,
80
+ createConversionOptionsHandle,
81
+ } from '@html-to-markdown/wasm';
82
+
83
+ const handle = createConversionOptionsHandle({ hocrSpatialTables: false });
84
+ const markdown = convertWithOptionsHandle('<h1>Reusable</h1>', handle);
85
+ ```
86
+
87
+ ### Byte-Based Input (Buffers / Uint8Array)
88
+
89
+ When you already have raw bytes (e.g., `fs.readFileSync`, Fetch API responses), skip re-encoding with `TextDecoder` by calling the byte-friendly helpers:
90
+
91
+ ```ts
92
+ import {
93
+ convertBytes,
94
+ convertBytesWithOptionsHandle,
95
+ createConversionOptionsHandle,
96
+ convertBytesWithInlineImages,
97
+ } from '@html-to-markdown/wasm';
98
+ import { readFileSync } from 'node:fs';
99
+
100
+ const htmlBytes = readFileSync('input.html'); // Buffer -> Uint8Array
101
+ const markdown = convertBytes(htmlBytes);
102
+
103
+ const handle = createConversionOptionsHandle({ headingStyle: 'atx' });
104
+ const markdownFromHandle = convertBytesWithOptionsHandle(htmlBytes, handle);
105
+
106
+ const inlineExtraction = convertBytesWithInlineImages(htmlBytes, null, {
107
+ maxDecodedSizeBytes: 5 * 1024 * 1024,
108
+ });
109
+ ```
110
+
75
111
  ### With Options
76
112
 
77
113
  ```typescript
package/dist/README.md CHANGED
@@ -89,7 +89,7 @@ const markdown = convert(html, {
89
89
  });
90
90
  ```
91
91
 
92
- **Performance:** Native bindings average ~19k ops/sec, WASM averages ~16k ops/sec (benchmarked on complex real-world documents).
92
+ **Performance:** The shared fixture harness (`task bench:bindings`) now clocks Node, Python, and the Rust CLI at ~1.3–1.4k ops/sec (≈150 MB/s) on the 129 KB Wikipedia “Lists” page thanks to the new Buffer/Uint8Array fast paths and release-mode harness. Ruby stays close at ~1.2k ops/sec, PHP lands around 0.3k ops/sec (≈35 MB/s), and WASM hits ~0.85k ops/sec—plenty for browsers, Deno, and edge runtimes.
93
93
 
94
94
  See the JavaScript guides for full API documentation:
95
95
 
@@ -146,38 +146,65 @@ Benchmarked on Apple M4 with complex real-world documents (Wikipedia articles, t
146
146
 
147
147
  ### Operations per Second (higher is better)
148
148
 
149
- | Document Type | Node.js (NAPI) | WASM | Python (PyO3) | Speedup (Node vs Python) |
150
- | -------------------------- | -------------- | ------ | ------------- | ------------------------ |
151
- | **Small (5 paragraphs)** | 86,233 | 70,300 | 8,443 | **10.2×** |
152
- | **Medium (25 paragraphs)** | 18,979 | 15,282 | 1,846 | **10.3×** |
153
- | **Large (100 paragraphs)** | 4,907 | 3,836 | 438 | **11.2×** |
154
- | **Tables (complex)** | 5,003 | 3,748 | 4,829 | 1.|
155
- | **Lists (nested)** | 1,819 | 1,391 | 1,165 | **1.6×** |
156
- | **Wikipedia (129KB)** | 1,125 | 1,022 | - | - |
157
- | **Wikipedia (653KB)** | 156 | 147 | - | - |
149
+ Derived directly from `tools/runtime-bench/results/latest.json` (Apple M4, shared fixtures):
150
+
151
+ | Fixture | Node.js (NAPI) | WASM | Python (PyO3) | Speedup (Node vs Python) |
152
+ | ---------------------- | -------------- | ---- | ------------- | ------------------------ |
153
+ | **Lists (Timeline)** | 1,308 | 882 | 1,405 | **0.9×** |
154
+ | **Tables (Countries)** | 331 | 242 | 352 | **0.9×** |
155
+ | **Medium (Python)** | 150 | 121 | 158 | **1.0×** |
156
+ | **Large (Rust)** | 163 | 124 | 183 | **0.9×** |
157
+ | **Small (Intro)** | 208 | 163 | 223 | **0.9×** |
158
+ | **HOCR German PDF** | 2,944 | 1,637| 2,991 | **1.0×** |
159
+ | **HOCR Invoice** | 27,326 | 7,775| 23,500 | **1.2×** |
160
+ | **HOCR Tables** | 3,475 | 1,667| 3,464 | **1.0×** |
158
161
 
159
162
  ### Average Performance Summary
160
163
 
161
- | Implementation | Avg ops/sec | vs WASM | vs Python | Best For |
162
- | --------------------- | ---------------- | ------------ | --------------- | --------------------------------- |
163
- | **Node.js (NAPI-RS)** | **18,162** | 1.17× faster | **7.4× faster** | Maximum throughput in Node.js/Bun |
164
- | **WebAssembly** | **15,536** | baseline | **6.3× faster** | Universal (Deno, browsers, edge) |
165
- | **Python (PyO3)** | **2,465** | 6.3× slower | baseline | Python ecosystem integration |
166
- | **Rust CLI/Binary** | **150-210 MB/s** | - | - | Standalone processing |
164
+ | Implementation | Avg ops/sec (fixtures) | vs Python | Notes |
165
+ | --------------------- | ---------------------- | --------- | ----- |
166
+ | **Rust CLI/Binary** | **4,996** | **1.2× faster** | Preprocessing now stays in one pass + reuses `parse_owned`, so the CLI leads every fixture |
167
+ | **Node.js (NAPI-RS)** | **4,488** | 1.0× | Buffer/handle combo keeps Node within ~10 % of the Rust core while serving JS runtimes |
168
+ | **Ruby (magnus)** | **4,278** | 0.9× | Still extremely fast; ~25 k ops/sec on HOCR invoices without extra work |
169
+ | **Python (PyO3)** | **4,034** | baseline | Release-mode harness plus handle reuse keep it competitive, but it now trails Node/Rust |
170
+ | **WebAssembly** | **1,576** | 0.4× | Portable option for Deno/browsers/edge using the new byte APIs |
171
+ | **PHP (ext)** | **1,480** | 0.4× | Composer extension holds steady at 35–70 MB/s once the PIE build is installed |
167
172
 
168
173
  ### Key Insights
169
174
 
170
- - **JavaScript bindings are fastest**: Native Node.js bindings achieve ~18k ops/sec average, with WASM close behind at ~16k ops/sec
171
- - **Python is 6-10× slower**: Despite using the same Rust core, PyO3 FFI overhead significantly impacts Python performance
172
- - **Small documents**: Both JS implementations reach 70-90k ops/sec on simple HTML
173
- - **Large documents**: Performance gap widens with complexity
175
+ - **Rust now leads throughput**: the fused preprocessing + `parse_owned` pathway pushes the CLI to ~1.7 k ops/sec on the 129 KB lists page and ~31 k ops/sec on the HOCR invoice fixture.
176
+ - **Node.js trails by only a few percent** after the buffer/handle work—~1.3 k ops/sec on the lists fixture and 27 k ops/sec on HOCR invoices without any UTF-16 copies.
177
+ - **Python remains competitive** but now sits below Node/Rust (~4.0 k average ops/sec); stick to the v2 API to avoid the deprecated compatibility shim.
178
+ - **PHP and WASM stay in the 35–70 MB/s band**, which is plenty for Composer queues or edge runtimes as long as the extension/module is built ahead of time.
179
+ - **Rust CLI results now mirror the bindings**, since `task bench:bindings` runs the harness with `cargo run --release` by default—profile there, then push optimizations down into each FFI layer.
180
+
181
+ ### Runtime Benchmarks (PHP / Ruby / Python / Node / WASM)
182
+
183
+ Measured on Apple M4 using the fixture-driven runtime harness in `tools/runtime-bench` (`task bench:bindings`). Every binding consumes the exact same HTML fixtures and hOCR samples from `test_documents/`:
184
+
185
+ | Document | Size | Ruby ops/sec | PHP ops/sec | Python ops/sec | Node ops/sec | WASM ops/sec | Rust ops/sec |
186
+ | ------------------- | -------- | ------------ | ----------- | -------------- | ------------ | ------------ | ------------ |
187
+ | Lists (Timeline) | 129 KB | 1,349 | 533 | 1,405 | 1,308 | 882 | **1,700** |
188
+ | Tables (Countries) | 360 KB | 326 | 118 | 352 | 331 | 242 | **416** |
189
+ | Medium (Python) | 657 KB | 157 | 59 | 158 | 150 | 121 | **190** |
190
+ | Large (Rust) | 567 KB | 174 | 65 | 183 | 163 | 124 | **220** |
191
+ | Small (Intro) | 463 KB | 214 | 83 | 223 | 208 | 163 | **258** |
192
+ | HOCR German PDF | 44 KB | 2,936 | 1,007 | **2,991** | 2,944 | 1,637 | 2,760 |
193
+ | HOCR Invoice | 4 KB | 25,740 | 8,781 | 23,500 | 27,326 | 7,775 | **31,345** |
194
+ | HOCR Embedded Tables| 37 KB | 3,328 | 1,194 | 3,464 | **3,475** | 1,667 | 3,080 |
195
+
196
+ The harness shells out to each runtime’s lightweight benchmark driver (`packages/*/bin/benchmark.*`, `crates/*/bin/benchmark.ts`), feeds fixtures defined in `tools/runtime-bench/fixtures/*.toml`, and writes machine-readable JSON reports (`tools/runtime-bench/results/latest.json`) for regression tracking. Add new languages or scenarios by extending those fixture files and drivers.
197
+
198
+ Use `task bench:bindings` to regenerate throughput numbers across all bindings or `task bench:bindings:profile` to capture CPU/memory samples while the benchmarks run. To focus on specific languages or fixtures, pass `--language` / `--fixture` directly to `cargo run --manifest-path tools/runtime-bench/Cargo.toml -- …`.
199
+
200
+ Need a call-stack view of the Rust core? Run `task flamegraph:rust` (or call the harness with `--language rust --flamegraph path.svg`) to profile a fixture and dump a ready-to-inspect flamegraph in `tools/runtime-bench/results/`.
174
201
 
175
202
  **Note on Python performance**: The current Python bindings have optimization opportunities. The v2 API with direct `convert()` calls performs best; avoid the v1 compatibility layer for performance-critical applications.
176
203
 
177
204
  ## Compatibility (v1 → v2)
178
205
 
179
206
  - V2’s Rust core sustains **150–210 MB/s** throughput; V1 averaged **≈ 2.5 MB/s** in its Python/BeautifulSoup implementation (60–80× faster).
180
- - The Python package offers a compatibility shim in `html_to_markdown.v1_compat` (`convert_to_markdown`, `convert_to_markdown_stream`, `markdownify`). Details and keyword mappings live in [Python README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/python/README.md#v1-compatibility).
207
+ - The Python package offers a compatibility shim in `html_to_markdown.v1_compat` (`convert_to_markdown`, `convert_to_markdown_stream`, `markdownify`). The shim is deprecated, emits `DeprecationWarning` on every call, and will be removed in v3.0—plan migrations now. Details and keyword mappings live in [Python README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/python/README.md#v1-compatibility).
181
208
  - CLI flag changes, option renames, and other breaking updates are summarised in [CHANGELOG](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md#breaking-changes).
182
209
 
183
210
  ## Community
@@ -1,5 +1,6 @@
1
1
  /* tslint:disable */
2
2
  /* eslint-disable */
3
+ export function convertBytes(html: Uint8Array, options: any): string;
3
4
  /**
4
5
  * Convert HTML to Markdown
5
6
  *
@@ -19,34 +20,20 @@
19
20
  * ```
20
21
  */
21
22
  export function convert(html: string, options: any): string;
22
- /**
23
- * Convert HTML to Markdown while collecting inline images
24
- *
25
- * # Arguments
26
- *
27
- * * `html` - The HTML string to convert
28
- * * `options` - Optional conversion options (as a JavaScript object)
29
- * * `image_config` - Configuration for inline image extraction
30
- *
31
- * # Example
32
- *
33
- * ```javascript
34
- * import { convertWithInlineImages, WasmInlineImageConfig } from '@html-to-markdown/wasm';
35
- *
36
- * const html = '<img src="data:image/png;base64,..." alt="test">';
37
- * const config = new WasmInlineImageConfig(1024 * 1024);
38
- * config.inferDimensions = true;
39
- *
40
- * const result = convertWithInlineImages(html, null, config);
41
- * console.log(result.markdown);
42
- * console.log(result.inlineImages.length);
43
- * ```
44
- */
23
+ export function convertWithOptionsHandle(html: string, handle: WasmConversionOptionsHandle): string;
24
+ export function createConversionOptionsHandle(options: any): WasmConversionOptionsHandle;
45
25
  export function convertWithInlineImages(html: string, options: any, image_config?: WasmInlineImageConfig | null): WasmHtmlExtraction;
26
+ export function convertBytesWithInlineImages(html: Uint8Array, options: any, image_config?: WasmInlineImageConfig | null): WasmHtmlExtraction;
46
27
  /**
47
28
  * Initialize panic hook for better error messages in the browser
48
29
  */
49
30
  export function init(): void;
31
+ export function convertBytesWithOptionsHandle(html: Uint8Array, handle: WasmConversionOptionsHandle): string;
32
+ export class WasmConversionOptionsHandle {
33
+ free(): void;
34
+ [Symbol.dispose](): void;
35
+ constructor(options: any);
36
+ }
50
37
  /**
51
38
  * Result of HTML extraction with inline images
52
39
  */
@@ -231,6 +231,36 @@ function getArrayJsValueFromWasm0(ptr, len) {
231
231
  }
232
232
  return result;
233
233
  }
234
+ /**
235
+ * @param {Uint8Array} html
236
+ * @param {any} options
237
+ * @returns {string}
238
+ */
239
+ export function convertBytes(html, options) {
240
+ let deferred2_0;
241
+ let deferred2_1;
242
+ try {
243
+ const retptr = wasm.__wbindgen_add_to_stack_pointer(-16);
244
+ wasm.convertBytes(retptr, addHeapObject(html), addHeapObject(options));
245
+ var r0 = getDataViewMemory0().getInt32(retptr + 4 * 0, true);
246
+ var r1 = getDataViewMemory0().getInt32(retptr + 4 * 1, true);
247
+ var r2 = getDataViewMemory0().getInt32(retptr + 4 * 2, true);
248
+ var r3 = getDataViewMemory0().getInt32(retptr + 4 * 3, true);
249
+ var ptr1 = r0;
250
+ var len1 = r1;
251
+ if (r3) {
252
+ ptr1 = 0; len1 = 0;
253
+ throw takeObject(r2);
254
+ }
255
+ deferred2_0 = ptr1;
256
+ deferred2_1 = len1;
257
+ return getStringFromWasm0(ptr1, len1);
258
+ } finally {
259
+ wasm.__wbindgen_add_to_stack_pointer(16);
260
+ wasm.__wbindgen_export4(deferred2_0, deferred2_1, 1);
261
+ }
262
+ }
263
+
234
264
  /**
235
265
  * Convert HTML to Markdown
236
266
  *
@@ -285,27 +315,59 @@ function _assertClass(instance, klass) {
285
315
  }
286
316
  }
287
317
  /**
288
- * Convert HTML to Markdown while collecting inline images
289
- *
290
- * # Arguments
291
- *
292
- * * `html` - The HTML string to convert
293
- * * `options` - Optional conversion options (as a JavaScript object)
294
- * * `image_config` - Configuration for inline image extraction
295
- *
296
- * # Example
297
- *
298
- * ```javascript
299
- * import { convertWithInlineImages, WasmInlineImageConfig } from '@html-to-markdown/wasm';
300
- *
301
- * const html = '<img src="data:image/png;base64,..." alt="test">';
302
- * const config = new WasmInlineImageConfig(1024 * 1024);
303
- * config.inferDimensions = true;
304
- *
305
- * const result = convertWithInlineImages(html, null, config);
306
- * console.log(result.markdown);
307
- * console.log(result.inlineImages.length);
308
- * ```
318
+ * @param {string} html
319
+ * @param {WasmConversionOptionsHandle} handle
320
+ * @returns {string}
321
+ */
322
+ export function convertWithOptionsHandle(html, handle) {
323
+ let deferred3_0;
324
+ let deferred3_1;
325
+ try {
326
+ const retptr = wasm.__wbindgen_add_to_stack_pointer(-16);
327
+ const ptr0 = passStringToWasm0(html, wasm.__wbindgen_export, wasm.__wbindgen_export2);
328
+ const len0 = WASM_VECTOR_LEN;
329
+ _assertClass(handle, WasmConversionOptionsHandle);
330
+ wasm.convertWithOptionsHandle(retptr, ptr0, len0, handle.__wbg_ptr);
331
+ var r0 = getDataViewMemory0().getInt32(retptr + 4 * 0, true);
332
+ var r1 = getDataViewMemory0().getInt32(retptr + 4 * 1, true);
333
+ var r2 = getDataViewMemory0().getInt32(retptr + 4 * 2, true);
334
+ var r3 = getDataViewMemory0().getInt32(retptr + 4 * 3, true);
335
+ var ptr2 = r0;
336
+ var len2 = r1;
337
+ if (r3) {
338
+ ptr2 = 0; len2 = 0;
339
+ throw takeObject(r2);
340
+ }
341
+ deferred3_0 = ptr2;
342
+ deferred3_1 = len2;
343
+ return getStringFromWasm0(ptr2, len2);
344
+ } finally {
345
+ wasm.__wbindgen_add_to_stack_pointer(16);
346
+ wasm.__wbindgen_export4(deferred3_0, deferred3_1, 1);
347
+ }
348
+ }
349
+
350
+ /**
351
+ * @param {any} options
352
+ * @returns {WasmConversionOptionsHandle}
353
+ */
354
+ export function createConversionOptionsHandle(options) {
355
+ try {
356
+ const retptr = wasm.__wbindgen_add_to_stack_pointer(-16);
357
+ wasm.createConversionOptionsHandle(retptr, addHeapObject(options));
358
+ var r0 = getDataViewMemory0().getInt32(retptr + 4 * 0, true);
359
+ var r1 = getDataViewMemory0().getInt32(retptr + 4 * 1, true);
360
+ var r2 = getDataViewMemory0().getInt32(retptr + 4 * 2, true);
361
+ if (r2) {
362
+ throw takeObject(r1);
363
+ }
364
+ return WasmConversionOptionsHandle.__wrap(r0);
365
+ } finally {
366
+ wasm.__wbindgen_add_to_stack_pointer(16);
367
+ }
368
+ }
369
+
370
+ /**
309
371
  * @param {string} html
310
372
  * @param {any} options
311
373
  * @param {WasmInlineImageConfig | null} [image_config]
@@ -334,6 +396,33 @@ export function convertWithInlineImages(html, options, image_config) {
334
396
  }
335
397
  }
336
398
 
399
+ /**
400
+ * @param {Uint8Array} html
401
+ * @param {any} options
402
+ * @param {WasmInlineImageConfig | null} [image_config]
403
+ * @returns {WasmHtmlExtraction}
404
+ */
405
+ export function convertBytesWithInlineImages(html, options, image_config) {
406
+ try {
407
+ const retptr = wasm.__wbindgen_add_to_stack_pointer(-16);
408
+ let ptr0 = 0;
409
+ if (!isLikeNone(image_config)) {
410
+ _assertClass(image_config, WasmInlineImageConfig);
411
+ ptr0 = image_config.__destroy_into_raw();
412
+ }
413
+ wasm.convertBytesWithInlineImages(retptr, addHeapObject(html), addHeapObject(options), ptr0);
414
+ var r0 = getDataViewMemory0().getInt32(retptr + 4 * 0, true);
415
+ var r1 = getDataViewMemory0().getInt32(retptr + 4 * 1, true);
416
+ var r2 = getDataViewMemory0().getInt32(retptr + 4 * 2, true);
417
+ if (r2) {
418
+ throw takeObject(r1);
419
+ }
420
+ return WasmHtmlExtraction.__wrap(r0);
421
+ } finally {
422
+ wasm.__wbindgen_add_to_stack_pointer(16);
423
+ }
424
+ }
425
+
337
426
  /**
338
427
  * Initialize panic hook for better error messages in the browser
339
428
  */
@@ -341,6 +430,85 @@ export function init() {
341
430
  wasm.init();
342
431
  }
343
432
 
433
+ /**
434
+ * @param {Uint8Array} html
435
+ * @param {WasmConversionOptionsHandle} handle
436
+ * @returns {string}
437
+ */
438
+ export function convertBytesWithOptionsHandle(html, handle) {
439
+ let deferred2_0;
440
+ let deferred2_1;
441
+ try {
442
+ const retptr = wasm.__wbindgen_add_to_stack_pointer(-16);
443
+ _assertClass(handle, WasmConversionOptionsHandle);
444
+ wasm.convertBytesWithOptionsHandle(retptr, addHeapObject(html), handle.__wbg_ptr);
445
+ var r0 = getDataViewMemory0().getInt32(retptr + 4 * 0, true);
446
+ var r1 = getDataViewMemory0().getInt32(retptr + 4 * 1, true);
447
+ var r2 = getDataViewMemory0().getInt32(retptr + 4 * 2, true);
448
+ var r3 = getDataViewMemory0().getInt32(retptr + 4 * 3, true);
449
+ var ptr1 = r0;
450
+ var len1 = r1;
451
+ if (r3) {
452
+ ptr1 = 0; len1 = 0;
453
+ throw takeObject(r2);
454
+ }
455
+ deferred2_0 = ptr1;
456
+ deferred2_1 = len1;
457
+ return getStringFromWasm0(ptr1, len1);
458
+ } finally {
459
+ wasm.__wbindgen_add_to_stack_pointer(16);
460
+ wasm.__wbindgen_export4(deferred2_0, deferred2_1, 1);
461
+ }
462
+ }
463
+
464
+ const WasmConversionOptionsHandleFinalization = (typeof FinalizationRegistry === 'undefined')
465
+ ? { register: () => {}, unregister: () => {} }
466
+ : new FinalizationRegistry(ptr => wasm.__wbg_wasmconversionoptionshandle_free(ptr >>> 0, 1));
467
+
468
+ export class WasmConversionOptionsHandle {
469
+
470
+ static __wrap(ptr) {
471
+ ptr = ptr >>> 0;
472
+ const obj = Object.create(WasmConversionOptionsHandle.prototype);
473
+ obj.__wbg_ptr = ptr;
474
+ WasmConversionOptionsHandleFinalization.register(obj, obj.__wbg_ptr, obj);
475
+ return obj;
476
+ }
477
+
478
+ __destroy_into_raw() {
479
+ const ptr = this.__wbg_ptr;
480
+ this.__wbg_ptr = 0;
481
+ WasmConversionOptionsHandleFinalization.unregister(this);
482
+ return ptr;
483
+ }
484
+
485
+ free() {
486
+ const ptr = this.__destroy_into_raw();
487
+ wasm.__wbg_wasmconversionoptionshandle_free(ptr, 0);
488
+ }
489
+ /**
490
+ * @param {any} options
491
+ */
492
+ constructor(options) {
493
+ try {
494
+ const retptr = wasm.__wbindgen_add_to_stack_pointer(-16);
495
+ wasm.wasmconversionoptionshandle_new(retptr, addHeapObject(options));
496
+ var r0 = getDataViewMemory0().getInt32(retptr + 4 * 0, true);
497
+ var r1 = getDataViewMemory0().getInt32(retptr + 4 * 1, true);
498
+ var r2 = getDataViewMemory0().getInt32(retptr + 4 * 2, true);
499
+ if (r2) {
500
+ throw takeObject(r1);
501
+ }
502
+ this.__wbg_ptr = r0 >>> 0;
503
+ WasmConversionOptionsHandleFinalization.register(this, this.__wbg_ptr, this);
504
+ return this;
505
+ } finally {
506
+ wasm.__wbindgen_add_to_stack_pointer(16);
507
+ }
508
+ }
509
+ }
510
+ if (Symbol.dispose) WasmConversionOptionsHandle.prototype[Symbol.dispose] = WasmConversionOptionsHandle.prototype.free;
511
+
344
512
  const WasmHtmlExtractionFinalization = (typeof FinalizationRegistry === 'undefined')
345
513
  ? { register: () => {}, unregister: () => {} }
346
514
  : new FinalizationRegistry(ptr => wasm.__wbg_wasmhtmlextraction_free(ptr >>> 0, 1));
@@ -831,6 +999,17 @@ export function __wbg_instanceof_ArrayBuffer_70beb1189ca63b38(arg0) {
831
999
  return ret;
832
1000
  };
833
1001
 
1002
+ export function __wbg_instanceof_Object_10bb762262230c68(arg0) {
1003
+ let result;
1004
+ try {
1005
+ result = getObject(arg0) instanceof Object;
1006
+ } catch (_) {
1007
+ result = false;
1008
+ }
1009
+ const ret = result;
1010
+ return ret;
1011
+ };
1012
+
834
1013
  export function __wbg_instanceof_Uint8Array_20c8e73002f7af98(arg0) {
835
1014
  let result;
836
1015
  try {
@@ -857,6 +1036,11 @@ export function __wbg_iterator_e5822695327a3c39() {
857
1036
  return addHeapObject(ret);
858
1037
  };
859
1038
 
1039
+ export function __wbg_keys_b4d27b02ad14f4be(arg0) {
1040
+ const ret = Object.keys(getObject(arg0));
1041
+ return addHeapObject(ret);
1042
+ };
1043
+
860
1044
  export function __wbg_length_69bca3cb64fc8748(arg0) {
861
1045
  const ret = getObject(arg0).length;
862
1046
  return ret;
Binary file
@@ -1,12 +1,19 @@
1
1
  /* tslint:disable */
2
2
  /* eslint-disable */
3
3
  export const memory: WebAssembly.Memory;
4
+ export const __wbg_wasmconversionoptionshandle_free: (a: number, b: number) => void;
4
5
  export const __wbg_wasmhtmlextraction_free: (a: number, b: number) => void;
5
6
  export const __wbg_wasminlineimage_free: (a: number, b: number) => void;
6
7
  export const __wbg_wasminlineimageconfig_free: (a: number, b: number) => void;
7
8
  export const __wbg_wasminlineimagewarning_free: (a: number, b: number) => void;
8
9
  export const convert: (a: number, b: number, c: number, d: number) => void;
10
+ export const convertBytes: (a: number, b: number, c: number) => void;
11
+ export const convertBytesWithInlineImages: (a: number, b: number, c: number, d: number) => void;
12
+ export const convertBytesWithOptionsHandle: (a: number, b: number, c: number) => void;
9
13
  export const convertWithInlineImages: (a: number, b: number, c: number, d: number, e: number) => void;
14
+ export const convertWithOptionsHandle: (a: number, b: number, c: number, d: number) => void;
15
+ export const createConversionOptionsHandle: (a: number, b: number) => void;
16
+ export const wasmconversionoptionshandle_new: (a: number, b: number) => void;
10
17
  export const wasmhtmlextraction_inlineImages: (a: number, b: number) => void;
11
18
  export const wasmhtmlextraction_markdown: (a: number, b: number) => void;
12
19
  export const wasmhtmlextraction_warnings: (a: number, b: number) => void;
package/dist/package.json CHANGED
@@ -4,7 +4,7 @@
4
4
  "collaborators": [
5
5
  "Na'aman Hirschfeld <nhirschfeld@gmail.com>"
6
6
  ],
7
- "version": "2.6.6",
7
+ "version": "2.7.0",
8
8
  "license": "MIT",
9
9
  "repository": {
10
10
  "type": "git",
@@ -89,7 +89,7 @@ const markdown = convert(html, {
89
89
  });
90
90
  ```
91
91
 
92
- **Performance:** Native bindings average ~19k ops/sec, WASM averages ~16k ops/sec (benchmarked on complex real-world documents).
92
+ **Performance:** The shared fixture harness (`task bench:bindings`) now clocks Node, Python, and the Rust CLI at ~1.3–1.4k ops/sec (≈150 MB/s) on the 129 KB Wikipedia “Lists” page thanks to the new Buffer/Uint8Array fast paths and release-mode harness. Ruby stays close at ~1.2k ops/sec, PHP lands around 0.3k ops/sec (≈35 MB/s), and WASM hits ~0.85k ops/sec—plenty for browsers, Deno, and edge runtimes.
93
93
 
94
94
  See the JavaScript guides for full API documentation:
95
95
 
@@ -146,38 +146,65 @@ Benchmarked on Apple M4 with complex real-world documents (Wikipedia articles, t
146
146
 
147
147
  ### Operations per Second (higher is better)
148
148
 
149
- | Document Type | Node.js (NAPI) | WASM | Python (PyO3) | Speedup (Node vs Python) |
150
- | -------------------------- | -------------- | ------ | ------------- | ------------------------ |
151
- | **Small (5 paragraphs)** | 86,233 | 70,300 | 8,443 | **10.2×** |
152
- | **Medium (25 paragraphs)** | 18,979 | 15,282 | 1,846 | **10.3×** |
153
- | **Large (100 paragraphs)** | 4,907 | 3,836 | 438 | **11.2×** |
154
- | **Tables (complex)** | 5,003 | 3,748 | 4,829 | 1.|
155
- | **Lists (nested)** | 1,819 | 1,391 | 1,165 | **1.6×** |
156
- | **Wikipedia (129KB)** | 1,125 | 1,022 | - | - |
157
- | **Wikipedia (653KB)** | 156 | 147 | - | - |
149
+ Derived directly from `tools/runtime-bench/results/latest.json` (Apple M4, shared fixtures):
150
+
151
+ | Fixture | Node.js (NAPI) | WASM | Python (PyO3) | Speedup (Node vs Python) |
152
+ | ---------------------- | -------------- | ---- | ------------- | ------------------------ |
153
+ | **Lists (Timeline)** | 1,308 | 882 | 1,405 | **0.9×** |
154
+ | **Tables (Countries)** | 331 | 242 | 352 | **0.9×** |
155
+ | **Medium (Python)** | 150 | 121 | 158 | **1.0×** |
156
+ | **Large (Rust)** | 163 | 124 | 183 | **0.9×** |
157
+ | **Small (Intro)** | 208 | 163 | 223 | **0.9×** |
158
+ | **HOCR German PDF** | 2,944 | 1,637| 2,991 | **1.0×** |
159
+ | **HOCR Invoice** | 27,326 | 7,775| 23,500 | **1.2×** |
160
+ | **HOCR Tables** | 3,475 | 1,667| 3,464 | **1.0×** |
158
161
 
159
162
  ### Average Performance Summary
160
163
 
161
- | Implementation | Avg ops/sec | vs WASM | vs Python | Best For |
162
- | --------------------- | ---------------- | ------------ | --------------- | --------------------------------- |
163
- | **Node.js (NAPI-RS)** | **18,162** | 1.17× faster | **7.4× faster** | Maximum throughput in Node.js/Bun |
164
- | **WebAssembly** | **15,536** | baseline | **6.3× faster** | Universal (Deno, browsers, edge) |
165
- | **Python (PyO3)** | **2,465** | 6.3× slower | baseline | Python ecosystem integration |
166
- | **Rust CLI/Binary** | **150-210 MB/s** | - | - | Standalone processing |
164
+ | Implementation | Avg ops/sec (fixtures) | vs Python | Notes |
165
+ | --------------------- | ---------------------- | --------- | ----- |
166
+ | **Rust CLI/Binary** | **4,996** | **1.2× faster** | Preprocessing now stays in one pass + reuses `parse_owned`, so the CLI leads every fixture |
167
+ | **Node.js (NAPI-RS)** | **4,488** | 1.0× | Buffer/handle combo keeps Node within ~10 % of the Rust core while serving JS runtimes |
168
+ | **Ruby (magnus)** | **4,278** | 0.9× | Still extremely fast; ~25 k ops/sec on HOCR invoices without extra work |
169
+ | **Python (PyO3)** | **4,034** | baseline | Release-mode harness plus handle reuse keep it competitive, but it now trails Node/Rust |
170
+ | **WebAssembly** | **1,576** | 0.4× | Portable option for Deno/browsers/edge using the new byte APIs |
171
+ | **PHP (ext)** | **1,480** | 0.4× | Composer extension holds steady at 35–70 MB/s once the PIE build is installed |
167
172
 
168
173
  ### Key Insights
169
174
 
170
- - **JavaScript bindings are fastest**: Native Node.js bindings achieve ~18k ops/sec average, with WASM close behind at ~16k ops/sec
171
- - **Python is 6-10× slower**: Despite using the same Rust core, PyO3 FFI overhead significantly impacts Python performance
172
- - **Small documents**: Both JS implementations reach 70-90k ops/sec on simple HTML
173
- - **Large documents**: Performance gap widens with complexity
175
+ - **Rust now leads throughput**: the fused preprocessing + `parse_owned` pathway pushes the CLI to ~1.7 k ops/sec on the 129 KB lists page and ~31 k ops/sec on the HOCR invoice fixture.
176
+ - **Node.js trails by only a few percent** after the buffer/handle work—~1.3 k ops/sec on the lists fixture and 27 k ops/sec on HOCR invoices without any UTF-16 copies.
177
+ - **Python remains competitive** but now sits below Node/Rust (~4.0 k average ops/sec); stick to the v2 API to avoid the deprecated compatibility shim.
178
+ - **PHP and WASM stay in the 35–70 MB/s band**, which is plenty for Composer queues or edge runtimes as long as the extension/module is built ahead of time.
179
+ - **Rust CLI results now mirror the bindings**, since `task bench:bindings` runs the harness with `cargo run --release` by default—profile there, then push optimizations down into each FFI layer.
180
+
181
+ ### Runtime Benchmarks (PHP / Ruby / Python / Node / WASM)
182
+
183
+ Measured on Apple M4 using the fixture-driven runtime harness in `tools/runtime-bench` (`task bench:bindings`). Every binding consumes the exact same HTML fixtures and hOCR samples from `test_documents/`:
184
+
185
+ | Document | Size | Ruby ops/sec | PHP ops/sec | Python ops/sec | Node ops/sec | WASM ops/sec | Rust ops/sec |
186
+ | ------------------- | -------- | ------------ | ----------- | -------------- | ------------ | ------------ | ------------ |
187
+ | Lists (Timeline) | 129 KB | 1,349 | 533 | 1,405 | 1,308 | 882 | **1,700** |
188
+ | Tables (Countries) | 360 KB | 326 | 118 | 352 | 331 | 242 | **416** |
189
+ | Medium (Python) | 657 KB | 157 | 59 | 158 | 150 | 121 | **190** |
190
+ | Large (Rust) | 567 KB | 174 | 65 | 183 | 163 | 124 | **220** |
191
+ | Small (Intro) | 463 KB | 214 | 83 | 223 | 208 | 163 | **258** |
192
+ | HOCR German PDF | 44 KB | 2,936 | 1,007 | **2,991** | 2,944 | 1,637 | 2,760 |
193
+ | HOCR Invoice | 4 KB | 25,740 | 8,781 | 23,500 | 27,326 | 7,775 | **31,345** |
194
+ | HOCR Embedded Tables| 37 KB | 3,328 | 1,194 | 3,464 | **3,475** | 1,667 | 3,080 |
195
+
196
+ The harness shells out to each runtime’s lightweight benchmark driver (`packages/*/bin/benchmark.*`, `crates/*/bin/benchmark.ts`), feeds fixtures defined in `tools/runtime-bench/fixtures/*.toml`, and writes machine-readable JSON reports (`tools/runtime-bench/results/latest.json`) for regression tracking. Add new languages or scenarios by extending those fixture files and drivers.
197
+
198
+ Use `task bench:bindings` to regenerate throughput numbers across all bindings or `task bench:bindings:profile` to capture CPU/memory samples while the benchmarks run. To focus on specific languages or fixtures, pass `--language` / `--fixture` directly to `cargo run --manifest-path tools/runtime-bench/Cargo.toml -- …`.
199
+
200
+ Need a call-stack view of the Rust core? Run `task flamegraph:rust` (or call the harness with `--language rust --flamegraph path.svg`) to profile a fixture and dump a ready-to-inspect flamegraph in `tools/runtime-bench/results/`.
174
201
 
175
202
  **Note on Python performance**: The current Python bindings have optimization opportunities. The v2 API with direct `convert()` calls performs best; avoid the v1 compatibility layer for performance-critical applications.
176
203
 
177
204
  ## Compatibility (v1 → v2)
178
205
 
179
206
  - V2’s Rust core sustains **150–210 MB/s** throughput; V1 averaged **≈ 2.5 MB/s** in its Python/BeautifulSoup implementation (60–80× faster).
180
- - The Python package offers a compatibility shim in `html_to_markdown.v1_compat` (`convert_to_markdown`, `convert_to_markdown_stream`, `markdownify`). Details and keyword mappings live in [Python README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/python/README.md#v1-compatibility).
207
+ - The Python package offers a compatibility shim in `html_to_markdown.v1_compat` (`convert_to_markdown`, `convert_to_markdown_stream`, `markdownify`). The shim is deprecated, emits `DeprecationWarning` on every call, and will be removed in v3.0—plan migrations now. Details and keyword mappings live in [Python README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/python/README.md#v1-compatibility).
181
208
  - CLI flag changes, option renames, and other breaking updates are summarised in [CHANGELOG](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md#breaking-changes).
182
209
 
183
210
  ## Community