npm - html-to-markdown-wasm - Versions diffs - 2.12.1 → 2.14.1 - Mend

html-to-markdown-wasm 2.12.1 → 2.14.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/README.md +107 -0
package/dist/README.md +420 -13
package/dist/html_to_markdown_wasm_bg.wasm +0 -0
package/dist/package.json +1 -1
package/dist-node/README.md +420 -13
package/dist-node/html_to_markdown_wasm_bg.wasm +0 -0
package/dist-node/package.json +1 -1
package/dist-web/README.md +420 -13
package/dist-web/html_to_markdown_wasm_bg.wasm +0 -0
package/dist-web/package.json +1 -1
package/package.json +2 -2

package/README.md CHANGED Viewed

@@ -291,6 +291,113 @@ for (const img of result.inlineImages) {
 }
 ```
+## Metadata Extraction
+Extract document metadata (headers, links, images, structured data) alongside Markdown conversion:
+```typescript
+import { convertWithMetadata, WasmMetadataConfig } from 'html-to-markdown-wasm';
+const html = `
+  <html lang="en">
+    <head><title>My Article</title></head>
+    <body>
+      <h1>Main Title</h1>
+      <p>Content with <a href="https://example.com">a link</a></p>
+      <img src="https://example.com/image.jpg" alt="Example image">
+    </body>
+  </html>
+`;
+const config = new WasmMetadataConfig();
+config.extractHeaders = true;
+config.extractLinks = true;
+config.extractImages = true;
+config.extractStructuredData = true;
+config.maxStructuredDataSize = 1_000_000; // 1MB limit
+const result = convertWithMetadata(html, null, config);
+console.log(result.markdown);
+console.log('Document metadata:', result.metadata.document);
+// {
+//   title: 'My Article',
+//   language: 'en',
+//   ...
+// }
+console.log('Headers:', result.metadata.headers);
+// [
+//   { level: 1, text: 'Main Title', id: undefined, depth: 0, htmlOffset: ... }
+// ]
+console.log('Links:', result.metadata.links);
+// [
+//   {
+//     href: 'https://example.com',
+//     text: 'a link',
+//     linkType: 'external',
+//     rel: [],
+//     ...
+//   }
+// ]
+console.log('Images:', result.metadata.images);
+// [
+//   {
+//     src: 'https://example.com/image.jpg',
+//     alt: 'Example image',
+//     imageType: 'external',
+//     ...
+//   }
+// ]
+```
+### Metadata Configuration
+The `WasmMetadataConfig` class controls what metadata is extracted:
+```typescript
+import { WasmMetadataConfig } from 'html-to-markdown-wasm';
+const config = new WasmMetadataConfig();
+// Enable/disable extraction types
+config.extractHeaders = true;              // h1-h6 elements
+config.extractLinks = true;                // <a> elements with link type classification
+config.extractImages = true;               // <img> and <svg> elements
+config.extractStructuredData = true;       // JSON-LD, Microdata, RDFa
+// Limit structured data size to prevent memory exhaustion
+config.maxStructuredDataSize = 1_000_000;  // 1MB default
+```
+### Metadata Structure
+The returned metadata object includes:
+- **document**: Document-level metadata (title, description, keywords, language, OG tags, Twitter cards, etc.)
+- **headers**: Array of header elements with level, text, id, and document position
+- **links**: Array of links with href, text, type (anchor/internal/external/email/phone), and rel attributes
+- **images**: Array of images with src, alt text, dimensions, and type classification (dataUri/external/relative/svg)
+- **structuredData**: Array of JSON-LD, Microdata, and RDFa blocks
+### Byte-Based Input
+Convert bytes directly with metadata extraction:
+```typescript
+import { convertBytesWithMetadata, WasmMetadataConfig } from 'html-to-markdown-wasm';
+import { readFileSync } from 'node:fs';
+const htmlBytes = readFileSync('article.html');
+const config = new WasmMetadataConfig();
+const result = convertBytesWithMetadata(htmlBytes, null, config);
+console.log(result.markdown);
+console.log(result.metadata);
+```
 ## Build Targets
 Three build targets are provided for different environments:

package/dist/README.md CHANGED Viewed

@@ -28,6 +28,7 @@ Experience WebAssembly-powered HTML to Markdown conversion instantly in your bro
 - **Blazing Fast**: Rust-powered core delivers 10-80× faster conversion than pure Python alternatives
 - **Universal**: Works everywhere - Node.js, Bun, Deno, browsers, Python, Rust, and standalone CLI
 - **Smart Conversion**: Handles complex documents including nested tables, code blocks, task lists, and hOCR OCR output
+- **Metadata Extraction**: Extract document metadata (title, description, headers, links, images) alongside conversion
 - **Highly Configurable**: Control heading styles, code block fences, list formatting, whitespace handling, and HTML sanitization
 - **Tag Preservation**: Keep specific HTML tags unconverted when markdown isn't expressive enough
 - **Secure by Default**: Built-in HTML sanitization prevents malicious content
@@ -35,19 +36,22 @@ Experience WebAssembly-powered HTML to Markdown conversion instantly in your bro
 ## Documentation
-- **JavaScript/TypeScript guides**:
-    - Node.js/Bun (native) – [Node.js README](https://github.com/Goldziher/html-to-markdown/blob/main/crates/html-to-markdown-node/README.md)
-    - WebAssembly (universal) – [WASM README](https://github.com/Goldziher/html-to-markdown/blob/main/crates/html-to-markdown-wasm/README.md)
-    - TypeScript wrapper – [TypeScript README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/typescript/README.md)
-- **Python guide** – [Python README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/python/README.md)
-- **PHP guides**:
-    - PHP wrapper package – [PHP README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/php/README.md)
-    - PHP extension (PIE) – [Extension README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/php-ext/README.md)
-- **Ruby guide** – [Ruby README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/ruby/README.md)
-- **Elixir guide** – [Elixir README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/elixir/README.md)
-- **Rust guide** – [Rust README](https://github.com/Goldziher/html-to-markdown/blob/main/crates/html-to-markdown/README.md)
-- **Contributing** – [CONTRIBUTING.md](https://github.com/Goldziher/html-to-markdown/blob/main/CONTRIBUTING.md) ⭐ Start here!
-- **Changelog** – [CHANGELOG.md](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md)
+**Language Guides & API References:**
+- **Python** – [README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/python/README.md) with metadata extraction, inline images, hOCR workflows
+- **JavaScript/TypeScript** – [Node.js](https://github.com/Goldziher/html-to-markdown/blob/main/crates/html-to-markdown-node/README.md) | [TypeScript](https://github.com/Goldziher/html-to-markdown/blob/main/packages/typescript/README.md) | [WASM](https://github.com/Goldziher/html-to-markdown/blob/main/crates/html-to-markdown-wasm/README.md)
+- **Ruby** – [README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/ruby/README.md) with RBS types, Steep type checking
+- **PHP** – [Package](https://github.com/Goldziher/html-to-markdown/blob/main/packages/php/README.md) | [Extension (PIE)](https://github.com/Goldziher/html-to-markdown/blob/main/packages/php-ext/README.md)
+- **Go** – [README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/go/README.md) with FFI bindings
+- **Java** – [README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/java/README.md) with Panama FFI, Maven/Gradle setup
+- **C#/.NET** – [README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/csharp/README.md) with NuGet distribution
+- **Elixir** – [README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/elixir/README.md) with Rustler NIF bindings
+- **Rust** – [README](https://github.com/Goldziher/html-to-markdown/blob/main/crates/html-to-markdown/README.md) with core API, error handling, advanced features
+**Project Resources:**
+- **Contributing** – [CONTRIBUTING.md](https://github.com/Goldziher/html-to-markdown/blob/main/CONTRIBUTING.md) ⭐ Start here for development
+- **Changelog** – [CHANGELOG.md](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md) – Version history and breaking changes
 ## Installation
@@ -102,6 +106,44 @@ See the JavaScript guides for full API documentation:
 - [Node.js/Bun guide](https://github.com/Goldziher/html-to-markdown/tree/main/crates/html-to-markdown-node)
 - [WebAssembly guide](https://github.com/Goldziher/html-to-markdown/tree/main/crates/html-to-markdown-wasm)
+### Metadata extraction (all languages)
+```typescript
+import { convertWithMetadata } from 'html-to-markdown-node';
+const html = `
+  <html>
+    <head>
+      <title>Example</title>
+      <meta name="description" content="Demo page">
+      <link rel="canonical" href="https://example.com/page">
+    </head>
+    <body>
+      <h1 id="welcome">Welcome</h1>
+      <a href="https://example.com" rel="nofollow external">Example link</a>
+      <img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
+    </body>
+  </html>
+`;
+const { markdown, metadata } = await convertWithMetadata(
+  html,
+  { headingStyle: 'Atx' },
+  { extract_links: true, extract_images: true, extract_headers: true },
+);
+console.log(markdown);
+// metadata.document.title === 'Example'
+// metadata.links[0].rel === ['nofollow', 'external']
+// metadata.images[0].dimensions === [640, 480]
+```
+Equivalent APIs are available in every binding:
+- Python: `convert_with_metadata(html, options=None, metadata_config=None)`
+- Ruby: `HtmlToMarkdown.convert_with_metadata(html, options = nil, metadata_config = nil)`
+- PHP: `convert_with_metadata(string $html, ?array $options = null, ?array $metadataConfig = null)`
 ### CLI
 ```bash
@@ -119,6 +161,371 @@ html-to-markdown --url https://example.com > output.md
 html-to-markdown --url https://example.com --user-agent "Mozilla/5.0" > output.md
 ```
+### Metadata Extraction
+Extract document metadata alongside HTML-to-Markdown conversion. All bindings support identical APIs:
+#### CLI Examples
+```bash
+# Basic metadata extraction with conversion
+html-to-markdown input.html --with-metadata -o output.json
+# Extract document metadata (title, description, language, etc.)
+html-to-markdown input.html --with-metadata --extract-document
+# Extract headers and links
+html-to-markdown input.html --with-metadata --extract-headers --extract-links
+# Extract all metadata types with conversion
+html-to-markdown input.html --with-metadata \
+  --extract-document \
+  --extract-headers \
+  --extract-links \
+  --extract-images \
+  --extract-structured-data \
+  -o metadata.json
+# Fetch and extract from remote URL
+html-to-markdown --url https://example.com --with-metadata -o output.json
+# Web scraping with preprocessing and metadata
+html-to-markdown page.html --preprocess --preset aggressive \
+  --with-metadata --extract-links --extract-images
+```
+Output format (JSON):
+```json
+{
+  "markdown": "# Title\n\nContent here...",
+  "metadata": {
+    "document": {
+      "title": "Page Title",
+      "description": "Meta description",
+      "charset": "utf-8",
+      "language": "en"
+    },
+    "headers": [
+      { "level": 1, "text": "Title", "id": "title" }
+    ],
+    "links": [
+      {
+        "text": "Example",
+        "href": "https://example.com",
+        "title": null,
+        "rel": ["external"]
+      }
+    ],
+    "images": [
+      {
+        "src": "https://example.com/image.jpg",
+        "alt": "Hero image",
+        "title": null,
+        "dimensions": [640, 480]
+      }
+    ]
+  }
+}
+```
+#### Python Example
+```python
+from html_to_markdown import convert_with_metadata
+html = '''
+<html>
+  <head>
+    <title>Product Guide</title>
+    <meta name="description" content="Complete product documentation">
+  </head>
+  <body>
+    <h1>Getting Started</h1>
+    <p>Visit our <a href="https://example.com">website</a> for more.</p>
+    <img src="https://example.com/guide.jpg" alt="Setup diagram" width="800" height="600">
+  </body>
+</html>
+'''
+markdown, metadata = convert_with_metadata(
+    html,
+    options={'heading_style': 'Atx'},
+    metadata_config={
+        'extract_document': True,
+        'extract_headers': True,
+        'extract_links': True,
+        'extract_images': True,
+    }
+)
+print(markdown)
+print(f"Title: {metadata['document']['title']}")
+print(f"Links found: {len(metadata['links'])}")
+```
+#### TypeScript/Node.js Example
+```typescript
+import { convertWithMetadata } from 'html-to-markdown-node';
+const html = `
+  <html>
+    <head>
+      <title>Article</title>
+      <meta name="description" content="Tech article">
+    </head>
+    <body>
+      <h1>Web Performance</h1>
+      <p>Read our <a href="/blog">blog</a> for tips.</p>
+      <img src="/perf.png" alt="Chart" width="1200" height="630">
+    </body>
+  </html>
+`;
+const { markdown, metadata } = await convertWithMetadata(html, {
+  headingStyle: 'Atx',
+}, {
+  extract_document: true,
+  extract_headers: true,
+  extract_links: true,
+  extract_images: true,
+});
+console.log(markdown);
+console.log(`Found ${metadata.headers.length} headers`);
+console.log(`Found ${metadata.links.length} links`);
+```
+#### Ruby Example
+```ruby
+require 'html_to_markdown'
+html = <<~HTML
+  <html>
+    <head>
+      <title>Documentation</title>
+      <meta name="description" content="API Reference">
+    </head>
+    <body>
+      <h2>Installation</h2>
+      <p>See our <a href="https://github.com">GitHub</a>.</p>
+      <img src="https://example.com/diagram.svg" alt="Architecture" width="960" height="540">
+    </body>
+  </html>
+HTML
+markdown, metadata = HtmlToMarkdown.convert_with_metadata(
+  html,
+  options: { heading_style: :atx },
+  metadata_config: {
+    extract_document: true,
+    extract_headers: true,
+    extract_links: true,
+    extract_images: true,
+  }
+)
+puts markdown
+puts "Title: #{metadata[:document][:title]}"
+puts "Images: #{metadata[:images].length}"
+```
+#### PHP Example
+```php
+<?php
+use HtmlToMarkdown\HtmlToMarkdown;
+$html = <<<HTML
+<html>
+  <head>
+    <title>Tutorial</title>
+    <meta name="description" content="Step-by-step guide">
+  </head>
+  <body>
+    <h1>Getting Started</h1>
+    <p>Check our <a href="https://example.com/guide">guide</a>.</p>
+    <img src="https://example.com/steps.png" alt="Steps" width="1024" height="768">
+  </body>
+</html>
+HTML;
+[$markdown, $metadata] = convert_with_metadata(
+    $html,
+    options: ['heading_style' => 'Atx'],
+    metadataConfig: [
+        'extract_document' => true,
+        'extract_headers' => true,
+        'extract_links' => true,
+        'extract_images' => true,
+    ]
+);
+echo "Title: " . $metadata['document']['title'] . "\n";
+echo "Found " . count($metadata['links']) . " links\n";
+```
+#### Go Example
+```go
+package main
+import (
+	"encoding/json"
+	"fmt"
+	"log"
+	"github.com/Goldziher/html-to-markdown/packages/go/htmltomarkdown"
+)
+func main() {
+	html := `
+	<html>
+		<head>
+			<title>Developer Guide</title>
+			<meta name="description" content="Complete API reference">
+		</head>
+		<body>
+			<h1>API Overview</h1>
+			<p>Learn more at our <a href="https://api.example.com/docs">API docs</a>.</p>
+			<img src="https://example.com/api-flow.png" alt="API Flow" width="1280" height="720">
+		</body>
+	</html>
+	`
+	markdown, metadata, err := htmltomarkdown.ConvertWithMetadata(html, &htmltomarkdown.MetadataConfig{
+		ExtractDocument:     true,
+		ExtractHeaders:      true,
+		ExtractLinks:        true,
+		ExtractImages:       true,
+		ExtractStructuredData: false,
+	})
+	if err != nil {
+		log.Fatal(err)
+	}
+	fmt.Println("Markdown:", markdown)
+	fmt.Printf("Title: %s\n", metadata.Document.Title)
+	fmt.Printf("Found %d links\n", len(metadata.Links))
+	// Marshal to JSON if needed
+	jsonBytes, _ := json.MarshalIndent(metadata, "", "  ")
+	fmt.Println(string(jsonBytes))
+}
+```
+#### Java Example
+```java
+import io.github.goldziher.htmltomarkdown.HtmlToMarkdown;
+import io.github.goldziher.htmltomarkdown.ConversionResult;
+import com.google.gson.Gson;
+import com.google.gson.GsonBuilder;
+public class MetadataExample {
+    public static void main(String[] args) {
+        String html = """
+            <html>
+              <head>
+                <title>Java Guide</title>
+                <meta name="description" content="Complete Java bindings documentation">
+              </head>
+              <body>
+                <h1>Quick Start</h1>
+                <p>Visit our <a href="https://github.com/Goldziher/html-to-markdown">GitHub</a>.</p>
+                <img src="https://example.com/java-flow.png" alt="Flow diagram" width="1024" height="576">
+              </body>
+            </html>
+            """;
+        try {
+            ConversionResult result = HtmlToMarkdown.convertWithMetadata(
+                html,
+                new HtmlToMarkdown.MetadataOptions()
+                    .extractDocument(true)
+                    .extractHeaders(true)
+                    .extractLinks(true)
+                    .extractImages(true)
+            );
+            System.out.println("Markdown:\n" + result.getMarkdown());
+            System.out.println("Title: " + result.getMetadata().getDocument().getTitle());
+            System.out.println("Links found: " + result.getMetadata().getLinks().size());
+            // Pretty-print metadata as JSON
+            Gson gson = new GsonBuilder().setPrettyPrinting().create();
+            System.out.println(gson.toJson(result.getMetadata()));
+        } catch (HtmlToMarkdown.ConversionException e) {
+            System.err.println("Conversion failed: " + e.getMessage());
+        }
+    }
+}
+```
+#### C# Example
+```csharp
+using HtmlToMarkdown;
+using System.Text.Json;
+var html = @"
+<html>
+  <head>
+    <title>C# Guide</title>
+    <meta name=""description"" content=""Official C# bindings documentation"">
+  </head>
+  <body>
+    <h1>Introduction</h1>
+    <p>See our <a href=""https://github.com/Goldziher/html-to-markdown"">repository</a>.</p>
+    <img src=""https://example.com/csharp-arch.png"" alt=""Architecture"" width=""1200"" height=""675"">
+  </body>
+</html>
+";
+try
+{
+    var result = HtmlToMarkdownConverter.ConvertWithMetadata(
+        html,
+        new MetadataConfig
+        {
+            ExtractDocument = true,
+            ExtractHeaders = true,
+            ExtractLinks = true,
+            ExtractImages = true,
+        }
+    );
+    Console.WriteLine("Markdown:");
+    Console.WriteLine(result.Markdown);
+    Console.WriteLine($"Title: {result.Metadata.Document.Title}");
+    Console.WriteLine($"Links found: {result.Metadata.Links.Count}");
+    // Serialize metadata to JSON
+    var options = new JsonSerializerOptions { WriteIndented = true };
+    var json = JsonSerializer.Serialize(result.Metadata, options);
+    Console.WriteLine(json);
+}
+catch (HtmlToMarkdownException ex)
+{
+    Console.Error.WriteLine($"Conversion failed: {ex.Message}");
+}
+```
+See the individual binding READMEs for detailed metadata extraction options:
+- **Python** – [Python README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/python/README.md)
+- **TypeScript/Node.js** – [Node.js README](https://github.com/Goldziher/html-to-markdown/blob/main/crates/html-to-markdown-node/README.md) | [TypeScript README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/typescript/README.md)
+- **Ruby** – [Ruby README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/ruby/README.md)
+- **PHP** – [PHP README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/php/README.md)
+- **Go** – [Go README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/go/README.md)
+- **Java** – [Java README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/java/README.md)
+- **C#/.NET** – [C# README](https://github.com/Goldziher/html-to-markdown/blob/main/packages/csharp/README.md)
+- **WebAssembly** – [WASM README](https://github.com/Goldziher/html-to-markdown/blob/main/crates/html-to-markdown-wasm/README.md)
+- **Rust** – [Rust README](https://github.com/Goldziher/html-to-markdown/blob/main/crates/html-to-markdown/README.md)
 ### Python (v2 API)
 ```python

package/dist/html_to_markdown_wasm_bg.wasm CHANGED Viewed

Binary file

package/dist/package.json CHANGED Viewed

@@ -4,7 +4,7 @@
   "collaborators": [
     "Na'aman Hirschfeld <nhirschfeld@gmail.com>"
   ],
-  "version": "2.12.1",
+  "version": "2.14.1",
   "license": "MIT",
   "repository": {
     "type": "git",