npm - xscrape - Versions diffs - 3.0.2 → 3.0.4 - Mend

xscrape 3.0.2 → 3.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -11,47 +11,78 @@
 <p align="center">
 <a href="https://opensource.org/licenses/MIT" rel="nofollow"><img src="https://img.shields.io/github/license/johnie/xscrape" alt="License"></a>
 <a href="https://www.npmjs.com/package/xscrape" rel="nofollow"><img src="https://img.shields.io/npm/v/xscrape.svg" alt="npm"></a>
+<a href="https://github.com/johnie/xscrape/actions"><img src="https://github.com/johnie/xscrape/actions/workflows/ci.yml/badge.svg" alt="Build Status"></a>
 <a href="https://github.com/johnie/xscrape" rel="nofollow"><img src="https://img.shields.io/github/stars/johnie/xscrape" alt="stars"></a>
 </p>
 <br/>
 <br/>
-## Features
+## Overview
+xscrape is a powerful HTML scraping library that combines the flexibility of query selectors with the safety of schema validation. It works with any validation library that implements the [Standard Schema](https://standardschema.dev) specification, including Zod, Valibot, ArkType, and Effect Schema.
-  * **HTML Parsing**: Extract data from HTML using CSS selectors with the help of [cheerio](https://github.com/cheeriojs/cheerio).
-  * **Flexible Schema Validation**: Validate and transform extracted data with any validation library that implements the [Standard Schema](https://standardschema.dev), such as Zod, Valibot, ArkType, and Effect Schema.
-  * **Custom Transformations**: Provide custom transformations for extracted attributes.
-  * **Default Values**: Define default values for missing data fields through your chosen schema library's features.
-  * **Nested Field Support**: Define and extract nested data structures from HTML elements.
+## Features
------
+- **HTML Parsing**: Extract data from HTML using query selectors powered by [cheerio](https://github.com/cheeriojs/cheerio)
+- **Universal Schema Support**: Works with any [Standard Schema](https://standardschema.dev) compatible library
+- **Type Safety**: Full TypeScript support with inferred types from your schemas
+- **Flexible Extraction**: Support for nested objects, arrays, and custom transformation functions
+- **Error Handling**: Comprehensive error handling with detailed validation feedback
+- **Custom Transformations**: Apply post-processing transformations to validated data
+- **Default Values**: Handle missing data gracefully through schema defaults
 ## Installation
-To install this library, use your preferred package manager:
+Install xscrape with your preferred package manager:
 ```bash
+npm install xscrape
+# or
 pnpm add xscrape
 # or
-npm install xscrape
+bun add xscrape
 ```
-You will also need to install your chosen schema validation library, for example, Zod:
+## Quick Start
-```bash
-pnpm add zod
-# or
-npm install zod
+```typescript
+import { defineScraper } from 'xscrape';
+import { z } from 'zod';
+// Define your schema
+const schema = z.object({
+  title: z.string(),
+  description: z.string(),
+  keywords: z.array(z.string()),
+  views: z.coerce.number(),
+});
+// Create a scraper
+const scraper = defineScraper({
+  schema,
+  extract: {
+    title: { selector: 'title' },
+    description: { selector: 'meta[name="description"]', value: 'content' },
+    keywords: {
+      selector: 'meta[name="keywords"]',
+      value: (el) => el.attribs['content']?.split(',') || [],
+    },
+    views: { selector: 'meta[name="views"]', value: 'content' },
+  },
+});
+// Use the scraper
+const { data, error } = await scraper(htmlString);
 ```
------
+## Usage Examples
-## Usage
+### Basic Extraction
-Below is an example of how to use `xscrape` with a Zod schema to extract and transform data from an HTML document.
+Extract basic metadata from an HTML page:
-```ts
+```typescript
 import { defineScraper } from 'xscrape';
 import { z } from 'zod';
@@ -59,27 +90,12 @@ const scraper = defineScraper({
   schema: z.object({
     title: z.string(),
     description: z.string(),
-    keywords: z.array(z.string()),
-    views: z.coerce.number(),
+    author: z.string(),
   }),
   extract: {
-    title: {
-      selector: 'title',
-    },
-    description: {
-      selector: 'meta[name="description"]',
-      value: 'content',
-    },
-    keywords: {
-      selector: 'meta[name="keywords"]',
-      value(el) {
-        return el.attribs['content']?.split(',');
-      },
-    },
-    views: {
-      selector: 'meta[name="views"]',
-      value: 'content',
-    },
+    title: { selector: 'title' },
+    description: { selector: 'meta[name="description"]', value: 'content' },
+    author: { selector: 'meta[name="author"]', value: 'content' },
   },
 });
@@ -87,115 +103,320 @@ const html = `
 <!DOCTYPE html>
 <html>
 <head>
-  <meta name="description" content="An example description.">
-  <meta name="keywords" content="typescript,html,parsing">
-  <meta name="views" content="1234">
-  <title>Example Title</title>
+  <title>My Blog Post</title>
+  <meta name="description" content="An interesting blog post">
+  <meta name="author" content="John Doe">
 </head>
-<body></body>
+<body>...</body>
 </html>
 `;
 const { data, error } = await scraper(html);
-console.log(data);
-// Outputs:
-// {
-//   title: 'Example Title',
-//   description: 'An example description.',
-//   keywords: ['typescript', 'html', 'parsing'],
-//   views: 1234
-// }
+// data: { title: "My Blog Post", description: "An interesting blog post", author: "John Doe" }
 ```
 ### Handling Missing Data
-You can handle missing data by using the features of your chosen schema library, such as default values in Zod.
-```ts
-import { defineScraper } from 'xscrape';
-import { z } from 'zod';
+Use schema defaults to handle missing data gracefully:
+```typescript
 const scraper = defineScraper({
   schema: z.object({
-    title: z.string().default('No title'),
-    description: z.string().default('No description'),
+    title: z.string().default('Untitled'),
+    description: z.string().default('No description available'),
+    publishedAt: z.string().optional(),
     views: z.coerce.number().default(0),
   }),
   extract: {
-    title: {
-      selector: 'title',
-    },
-    description: {
-      selector: 'meta[name="description"]',
-      value: 'content',
-    },
-    views: {
-      selector: 'meta[name="views"]',
-      value: 'content',
-    },
+    title: { selector: 'title' },
+    description: { selector: 'meta[name="description"]', value: 'content' },
+    publishedAt: { selector: 'meta[name="published"]', value: 'content' },
+    views: { selector: 'meta[name="views"]', value: 'content' },
   },
 });
+// Even with incomplete HTML, you get sensible defaults
+const { data } = await scraper('<html><head><title>Test</title></head></html>');
+// data: { title: "Test", description: "No description available", views: 0 }
 ```
-### Nested Fields
+### Extracting Arrays
-`xscrape` also supports extracting nested data structures.
+Extract multiple elements as arrays:
-```ts
-import { defineScraper } from 'xscrape';
-import { z } from 'zod';
+```typescript
+const scraper = defineScraper({
+  schema: z.object({
+    links: z.array(z.string()),
+    headings: z.array(z.string()),
+  }),
+  extract: {
+    links: [{ selector: 'a', value: 'href' }],
+    headings: [{ selector: 'h1, h2, h3' }],
+  },
+});
+const html = `
+<html>
+<body>
+  <h1>Main Title</h1>
+  <h2>Subtitle</h2>
+  <a href="/page1">Link 1</a>
+  <a href="/page2">Link 2</a>
+</body>
+</html>
+`;
+const { data } = await scraper(html);
+// data: {
+//   links: ["/page1", "/page2"],
+//   headings: ["Main Title", "Subtitle"]
+// }
+```
+### Nested Objects
+Extract complex nested data structures:
+```typescript
 const scraper = defineScraper({
   schema: z.object({
     title: z.string(),
-    image: z.object({
-      url: z.string().url(),
+    socialMedia: z.object({
+      image: z.string().url(),
       width: z.coerce.number(),
       height: z.coerce.number(),
-    }).default({ url: '', width: 0, height: 0 }).optional(),
+      type: z.string(),
+    }),
   }),
   extract: {
-    title: {
-      selector: 'title',
-    },
-    image: {
+    title: { selector: 'title' },
+    socialMedia: {
       selector: 'head',
       value: {
-        url: {
-          selector: 'meta[property="og:image"]',
-          value: 'content',
-        },
-        width: {
-          selector: 'meta[property="og:image:width"]',
-          value: 'content',
-        },
-        height: {
-          selector: 'meta[property="og:image:height"]',
-          value: 'content',
-        },
+        image: { selector: 'meta[property="og:image"]', value: 'content' },
+        width: { selector: 'meta[property="og:image:width"]', value: 'content' },
+        height: { selector: 'meta[property="og:image:height"]', value: 'content' },
+        type: { selector: 'meta[property="og:type"]', value: 'content' },
+      },
+    },
+  },
+});
+```
+### Custom Value Transformations
+Apply custom logic to extracted values:
+```typescript
+const scraper = defineScraper({
+  schema: z.object({
+    tags: z.array(z.string()),
+    publishedDate: z.date(),
+    readingTime: z.number(),
+  }),
+  extract: {
+    tags: {
+      selector: 'meta[name="keywords"]',
+      value: (el) => el.attribs['content']?.split(',').map(tag => tag.trim()) || [],
+    },
+    publishedDate: {
+      selector: 'meta[name="published"]',
+      value: (el) => new Date(el.attribs['content']),
+    },
+    readingTime: {
+      selector: 'article',
+      value: (el) => {
+        const text = el.text();
+        const wordsPerMinute = 200;
+        const wordCount = text.split(/\s+/).length;
+        return Math.ceil(wordCount / wordsPerMinute);
       },
     },
   },
 });
 ```
------
+### Post-Processing with Transform
+Apply transformations to the validated data:
+```typescript
+const scraper = defineScraper({
+  schema: z.object({
+    title: z.string(),
+    description: z.string(),
+    tags: z.array(z.string()),
+  }),
+  extract: {
+    title: { selector: 'title' },
+    description: { selector: 'meta[name="description"]', value: 'content' },
+    tags: {
+      selector: 'meta[name="keywords"]',
+      value: (el) => el.attribs['content']?.split(',') || [],
+    },
+  },
+  transform: (data) => ({
+    ...data,
+    slug: data.title.toLowerCase().replace(/\s+/g, '-'),
+    tagCount: data.tags.length,
+    summary: data.description.substring(0, 100) + '...',
+  }),
+});
+```
+## Schema Library Examples
+### Zod
-## Configuration
+```typescript
+import { z } from 'zod';
-The `defineScraper` function accepts a configuration object with the following properties:
+const schema = z.object({
+  title: z.string(),
+  price: z.coerce.number(),
+  inStock: z.boolean().default(false),
+});
+```
-  * **`schema`**: A schema object from any library that implements the [Standard Schema](https://standardschema.dev) interface. This schema defines the shape and validation rules for the extracted data.
-  * **`extract`**: An object that determines how fields are extracted from the HTML using CSS selectors.
-  * **`transform`** (optional): A function to apply custom transformations to the validated data.
+### Valibot
+```typescript
+import * as v from 'valibot';
+const schema = v.object({
+  title: v.string(),
+  price: v.pipe(v.string(), v.transform(Number)),
+  inStock: v.optional(v.boolean(), false),
+});
+```
------
+### ArkType
+```typescript
+import { type } from 'arktype';
+const schema = type({
+  title: 'string',
+  price: 'number',
+  inStock: 'boolean = false',
+});
+```
+### Effect Schema
+```typescript
+import { Schema } from 'effect';
+const schema = Schema.Struct({
+  title: Schema.String,
+  price: Schema.NumberFromString,
+  inStock: Schema.optionalWith(Schema.Boolean, { default: () => false }),
+});
+```
+## API Reference
+### `defineScraper(config)`
+Creates a scraper function with the specified configuration.
+#### Parameters
+- `config.schema`: A Standard Schema compatible schema object
+- `config.extract`: Extraction configuration object
+- `config.transform?`: Optional post-processing function
+#### Returns
+A scraper function that takes HTML string and returns `Promise<{ data?: T, error?: unknown }>`.
+### Extraction Configuration
+The `extract` object defines how to extract data from HTML:
+```typescript
+type ExtractConfig = {
+  [key: string]: ExtractDescriptor | [ExtractDescriptor];
+};
+type ExtractDescriptor = {
+  selector: string;
+  value?: string | ((el: Element) => any) | ExtractConfig;
+};
+```
+#### Properties
+- `selector`: CSS selector to find elements
+- `value`: How to extract the value:
+  - `string`: Attribute name (e.g., `'href'`, `'content'`)
+  - `function`: Custom extraction function
+  - `object`: Nested extraction configuration
+  - `undefined`: Extract text content
+#### Array Extraction
+Wrap the descriptor in an array to extract multiple elements:
+```typescript
+{
+  links: [{ selector: 'a', value: 'href' }]
+}
+```
+## Error Handling
+xscrape provides comprehensive error handling:
+```typescript
+const { data, error } = await scraper(html);
+if (error) {
+  // Handle validation errors, extraction errors, or transform errors
+  console.error('Scraping failed:', error);
+} else {
+  // Use the validated data
+  console.log('Extracted data:', data);
+}
+```
+## Best Practices
+1. **Use Specific Selectors**: Be as specific as possible with CSS selectors to avoid unexpected matches
+2. **Handle Missing Data**: Use schema defaults or optional fields for data that might not be present
+3. **Validate URLs**: Use URL validation in your schema for href attributes
+4. **Transform Data Early**: Use custom value functions rather than post-processing when possible
+5. **Type Safety**: Let TypeScript infer types from your schema for better developer experience
+## Common Use Cases
+- **Web Scraping**: Extract structured data from websites
+- **Meta Tag Extraction**: Get social media and SEO metadata
+- **Content Migration**: Transform HTML content to structured data
+- **Testing**: Validate HTML structure in tests
+- **RSS/Feed Processing**: Extract article data from HTML feeds
+## Performance Considerations
+- xscrape uses cheerio for fast HTML parsing
+- Schema validation is performed once after extraction
+- Consider using streaming for large HTML documents
+- Cache scrapers when processing many similar documents
 ## Contributing
-Contributions are welcome\! Please see the [Contributing Guide](https://github.com/johnie/xscrape/blob/main/CONTRIBUTING.md) for more information.
+We welcome contributions! Please see our [Contributing Guide](https://github.com/johnie/xscrape/blob/main/CONTRIBUTING.md) for details.
 ## License
-This project is licensed under the MIT License. See the [LICENSE](https://github.com/johnie/xscrape/blob/main/LICENSE) file for details.
+MIT License. See the [LICENSE](https://github.com/johnie/xscrape/blob/main/LICENSE) file for details.
+## Related Projects
+- [cheerio](https://github.com/cheeriojs/cheerio) - jQuery-like server-side HTML parsing
+- [Standard Schema](https://standardschema.dev) - Universal schema specification
+- [Zod](https://zod.dev) - TypeScript-first schema validation
+- [Valibot](https://valibot.dev) - Modular and type-safe schema library
+- [Effect](https://effect.website) - Maximum Type-safety (incl. error handling)
+- [ArkType](https://arktype.io) - TypeScript's 1:1 validator, optimized from editor to runtime

package/dist/index.cjs CHANGED Viewed

@@ -55,14 +55,10 @@ function defineScraper(config) {
         };
       }
       if (config.transform) {
-        try {
-          const transformed = await Promise.resolve(
-            config.transform(validationResult.value)
-          );
-          return { data: transformed };
-        } catch (error) {
-          return { error };
-        }
+        const transformed = await Promise.resolve(
+          config.transform(validationResult.value)
+        );
+        return { data: transformed };
       }
       return { data: validationResult.value };
     } catch (error) {

package/dist/index.js CHANGED Viewed

@@ -19,14 +19,10 @@ function defineScraper(config) {
         };
       }
       if (config.transform) {
-        try {
-          const transformed = await Promise.resolve(
-            config.transform(validationResult.value)
-          );
-          return { data: transformed };
-        } catch (error) {
-          return { error };
-        }
+        const transformed = await Promise.resolve(
+          config.transform(validationResult.value)
+        );
+        return { data: transformed };
       }
       return { data: validationResult.value };
     } catch (error) {

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "xscrape",
-  "version": "3.0.2",
+  "version": "3.0.4",
   "description": "A flexible and powerful library designed to extract and transform data from HTML documents using user-defined schemas",
   "main": "dist/index.js",
   "exports": {
@@ -41,35 +41,39 @@
   "homepage": "https://github.com/johnie/xscrape#readme",
   "devDependencies": {
     "@arethetypeswrong/cli": "^0.18.2",
-    "@biomejs/biome": "2.1.1",
+    "@biomejs/biome": "2.1.2",
     "@changesets/changelog-github": "^0.5.1",
     "@changesets/cli": "^2.29.5",
     "arktype": "^2.1.20",
-    "effect": "^3.16.12",
+    "effect": "^3.17.0",
     "jsdom": "^26.1.0",
     "lefthook": "^1.12.2",
     "tsup": "^8.5.0",
     "typescript": "^5.8.3",
     "valibot": "^1.1.0",
-    "vite": "^7.0.4",
+    "vite": "^7.0.5",
+    "vitepress": "^1.6.3",
     "vitest": "^3.2.4",
-    "zod": "^4.0.2"
+    "zod": "^4.0.5"
   },
   "dependencies": {
     "@standard-schema/spec": "^1.0.0",
-    "cheerio": "^1.1.0",
+    "cheerio": "^1.1.2",
     "domhandler": "^5.0.3"
   },
   "scripts": {
     "build": "tsup",
-    "ci": "npm run build && npm run check-format && npm run check-exports && npm run lint && npm run test",
-    "lint": "tsc",
+    "ci": "pnpm run build && pnpm run lint && pnpm run typecheck && pnpm run check-exports && pnpm run test",
+    "typecheck": "tsc",
     "test": "vitest run",
     "test:watch": "vitest",
     "format": "biome format --write ./src",
-    "check-format": "biome check ./src",
+    "lint": "biome check ./src",
     "check-exports": "attw --pack .",
-    "local-release": "npm run ci && changeset version && changeset publish",
-    "release": "npm run ci && changeset publish"
+    "local-release": "pnpm run ci && changeset version && changeset publish",
+    "release": "pnpm run ci && changeset publish",
+    "docs:dev": "vitepress dev docs",
+    "docs:build": "vitepress build docs",
+    "docs:preview": "vitepress preview docs"
   }
 }