npm - crawl4ai - Versions diffs - 1.0.0 - Mend

crawl4ai 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2024 Crawl4AI Community
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,429 @@
+# Crawl4AI TypeScript SDK
+A type-safe TypeScript SDK for the Crawl4AI REST API. Built for modern JavaScript/TypeScript environments with full Bun and Node.js compatibility.
+## 🚀 Features
+- **Full TypeScript Support** - Complete type definitions for all API endpoints and responses
+- **Bun & Node.js Compatible** - Works seamlessly in both runtimes
+- **Modern Async/Await** - Promise-based API for all operations
+- **Comprehensive Coverage** - All Crawl4AI endpoints including specialized features
+- **Smart Error Handling** - Custom error classes with retry logic and timeouts
+- **Batch Processing** - Efficiently crawl multiple URLs in a single request
+- **Input Validation** - Built-in URL validation and parameter checking
+- **Debug Mode** - Optional request/response logging for development
+- **Zero Dependencies** - Uses only native fetch API
+## 📦 Installation
+### Using Bun (Recommended)
+```bash
+bun add crawl4ai
+```
+### Using npm/yarn
+```bash
+npm install crawl4ai
+# or
+yarn add crawl4ai
+```
+## 📚 About Crawl4AI
+> ⚠️ **Unofficial Package**: This is an unofficial TypeScript SDK for the Crawl4AI REST API. This package was created for personal use to provide a type-safe way to interact with Crawl4AI's REST API.
+- **Official Project**: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
+- **Official Documentation**: [https://docs.crawl4ai.com/](https://docs.crawl4ai.com/)
+## 🏗️ Prerequisites
+1. **Crawl4AI Server Running**
+   You can use the hosted version or run your own:
+   ```bash
+   # Using Docker
+   docker run -p 11235:11235 unclecode/crawl4ai:latest
+   # With LLM support
+   docker run -p 11235:11235 \
+     -e OPENAI_API_KEY=your_openai_key \
+     -e ANTHROPIC_API_KEY=your_anthropic_key \
+     unclecode/crawl4ai:latest
+   ```
+2. **TypeScript** (if using TypeScript)
+   ```bash
+   bun add -d typescript
+   ```
+## 🚀 Quick Start
+### Basic Usage
+```typescript
+import Crawl4AI from 'crawl4ai';
+// Initialize the client
+const client = new Crawl4AI({
+  baseUrl: 'https://example.com', // or your local instance
+  apiToken: 'your_token_here', // Optional
+  timeout: 30000,
+  debug: true // Enable request/response logging
+});
+// Perform a basic crawl
+const results = await client.crawl({
+  urls: 'https://example.com',
+  browser_config: {
+    headless: true,
+    viewport: { width: 1920, height: 1080 }
+  },
+  crawler_config: {
+    cache_mode: 'bypass',
+    word_count_threshold: 10
+  }
+});
+const result = results[0]; // API returns array of results
+console.log('Title:', result.metadata?.title);
+console.log('Content:', result.markdown?.slice(0, 200));
+```
+### Configuration Options
+```typescript
+const client = new Crawl4AI({
+  baseUrl: 'https://example.com',
+  apiToken: 'optional_api_token',
+  timeout: 60000,          // Request timeout in ms
+  retries: 3,              // Number of retry attempts
+  retryDelay: 1000,        // Delay between retries in ms
+  throwOnError: true,      // Throw on HTTP errors
+  debug: false,            // Enable debug logging
+  defaultHeaders: {        // Additional headers
+    'User-Agent': 'MyApp/1.0'
+  }
+});
+```
+## 📖 API Reference
+### Core Methods
+#### `crawl(request)` - Main Crawl Endpoint
+Crawl one or more URLs with full configuration options:
+```typescript
+const results = await client.crawl({
+  urls: ['https://example.com', 'https://example.org'],
+  browser_config: {
+    headless: true,
+    simulate_user: true,
+    magic: true // Anti-detection features
+  },
+  crawler_config: {
+    cache_mode: 'bypass',
+    extraction_strategy: {
+      type: 'json_css',
+      params: { /* CSS extraction config */ }
+    }
+  }
+});
+```
+### Content Generation
+#### `markdown(request)` - Get Markdown
+Extract markdown with various filters:
+```typescript
+const markdown = await client.markdown({
+  url: 'https://example.com',
+  f: 'fit',  // 'raw' | 'fit' | 'bm25' | 'llm'
+  q: 'search query for bm25/llm filters'
+});
+```
+#### `html(request)` - Get Processed HTML
+Get sanitized HTML for schema extraction:
+```typescript
+const html = await client.html({
+  url: 'https://example.com'
+});
+```
+#### `screenshot(request)` - Capture Screenshot
+Capture full-page screenshots:
+```typescript
+const screenshotBase64 = await client.screenshot({
+  url: 'https://example.com',
+  screenshot_wait_for: 2,  // Wait 2 seconds before capture
+  output_path: '/path/to/save.png'  // Optional: save to file
+});
+```
+#### `pdf(request)` - Generate PDF
+Generate PDF documents:
+```typescript
+const pdfData = await client.pdf({
+  url: 'https://example.com',
+  output_path: '/path/to/save.pdf'  // Optional: save to file
+});
+```
+### JavaScript Execution
+#### `executeJs(request)` - Run JavaScript
+Execute JavaScript on the page and get full crawl results:
+```typescript
+const result = await client.executeJs({
+  url: 'https://example.com',
+  scripts: [
+    'return document.title;',
+    'return document.querySelectorAll("a").length;',
+    'window.scrollTo(0, document.body.scrollHeight);'
+  ]
+});
+console.log('JS Results:', result.js_execution_result);
+```
+### AI/LLM Features
+#### `ask(params)` - Get Library Context
+Get Crawl4AI documentation for AI assistants:
+```typescript
+const answer = await client.ask({
+  query: 'extraction strategies',
+  context_type: 'doc',  // 'code' | 'doc' | 'all'
+  max_results: 10
+});
+```
+#### `llm(url, query)` - LLM Endpoint
+Process URLs with LLM:
+```typescript
+const response = await client.llm(
+  'https://example.com',
+  'What is the main purpose of this website?'
+);
+```
+### Utility Methods
+```typescript
+// Test connection
+const isConnected = await client.testConnection();
+// With error details
+const isConnected = await client.testConnection({ throwOnError: true });
+// Get health status
+const health = await client.health();
+// Get API version
+const version = await client.version();
+// With error details
+const version = await client.version({ throwOnError: true });
+// Get Prometheus metrics
+const metrics = await client.metrics();
+// Update configuration
+client.setApiToken('new_token');
+client.setBaseUrl('https://new-url.com');
+client.setDebug(true);
+```
+## 🎯 Data Extraction Strategies
+### CSS Selector Extraction
+Extract structured data using CSS selectors:
+```typescript
+const results = await client.crawl({
+  urls: 'https://news.ycombinator.com',
+  crawler_config: {
+    extraction_strategy: {
+      type: 'json_css',
+      params: {
+        schema: {
+          baseSelector: '.athing',
+          fields: [
+            {
+              name: 'title',
+              selector: '.titleline > a',
+              type: 'text'
+            },
+            {
+              name: 'url',
+              selector: '.titleline > a',
+              type: 'href'
+            },
+            {
+              name: 'score',
+              selector: '+ tr .score',
+              type: 'text'
+            }
+          ]
+        }
+      }
+    }
+  }
+});
+const posts = JSON.parse(results[0].extracted_content || '[]');
+```
+### LLM-Based Extraction
+Use AI models for intelligent data extraction:
+```typescript
+const results = await client.crawl({
+  urls: 'https://www.bbc.com/news',
+  crawler_config: {
+    extraction_strategy: {
+      type: 'llm',
+      params: {
+        provider: 'openai/gpt-4o-mini',
+        api_token: process.env.OPENAI_API_KEY,
+        schema: {
+          type: 'object',
+          properties: {
+            headline: { type: 'string' },
+            summary: { type: 'string' },
+            author: { type: 'string' },
+            tags: {
+              type: 'array',
+              items: { type: 'string' }
+            }
+          }
+        },
+        extraction_type: 'schema',
+        instruction: 'Extract news articles with their key information'
+      }
+    }
+  }
+});
+```
+### Cosine Similarity Extraction
+Filter content based on semantic similarity:
+```typescript
+const results = await client.crawl({
+  urls: 'https://example.com/blog',
+  crawler_config: {
+    extraction_strategy: {
+      type: 'cosine',
+      params: {
+        semantic_filter: 'artificial intelligence machine learning',
+        word_count_threshold: 50,
+        max_dist: 0.3,
+        top_k: 5
+      }
+    }
+  }
+});
+```
+## 🛠️ Error Handling
+The SDK provides custom error handling with detailed information:
+```typescript
+import { Crawl4AIError } from 'crawl4ai-sdk';
+try {
+  const results = await client.crawl({ urls: 'https://example.com' });
+} catch (error) {
+  if (error instanceof Crawl4AIError) {
+    console.error('API Error:', error.message);
+    console.error('Status:', error.status);
+    console.error('Details:', error.data);
+  } else {
+    console.error('Unexpected error:', error);
+  }
+}
+```
+## 🧪 Testing
+Run the test suite:
+```bash
+# Run all tests
+bun test
+# Run specific test file
+bun test src/sdk.test.ts
+```
+## 📚 Examples
+Run the included examples:
+```bash
+# Basic usage
+bun run example:basic
+# Advanced features
+bun run example:advanced
+# LLM extraction
+bun run example:llm
+```
+## 🔒 Security & Best Practices
+### Authentication
+Always use API tokens in production:
+```typescript
+const client = new Crawl4AI({
+  baseUrl: 'https://your-crawl4ai-server.com',
+  apiToken: process.env.CRAWL4AI_API_TOKEN
+});
+```
+### Rate Limiting
+Implement client-side throttling:
+```typescript
+// Sequential processing with delays
+for (const url of urls) {
+  const results = await client.crawl({ urls: url });
+  await new Promise(resolve => setTimeout(resolve, 1000)); // 1s delay
+}
+```
+### Input Validation
+The SDK automatically validates URLs before making requests. Invalid URLs will throw a `Crawl4AIError`.
+## 🤝 Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.
+## 📄 License
+This SDK is released under the MIT License.
+## 🙏 Acknowledgments
+Built for the amazing [Crawl4AI](https://github.com/unclecode/crawl4ai) project by [@unclecode](https://github.com/unclecode) and the Crawl4AI community.

package/dist/errors.d.ts ADDED Viewed

@@ -0,0 +1,96 @@
+/**
+ * Custom error classes for Crawl4AI SDK
+ */
+import type { ValidationError } from './types';
+/**
+ * Base error class for all Crawl4AI errors
+ */
+export declare class Crawl4AIError extends Error {
+    status?: number;
+    statusText?: string;
+    data?: ValidationError | Record<string, unknown>;
+    request?: {
+        url: string;
+        method: string;
+        headers?: Record<string, string>;
+        body?: unknown;
+    };
+    constructor(message: string, status?: number, statusText?: string, data?: ValidationError | Record<string, unknown>);
+}
+/**
+ * Network-related errors (timeouts, connection failures)
+ */
+export declare class NetworkError extends Crawl4AIError {
+    constructor(message: string, cause?: Error);
+}
+/**
+ * Request timeout error
+ */
+export declare class TimeoutError extends NetworkError {
+    timeout: number;
+    constructor(timeout: number, url?: string);
+}
+/**
+ * Validation errors for request parameters
+ */
+export declare class RequestValidationError extends Crawl4AIError {
+    field?: string;
+    value?: unknown;
+    constructor(message: string, field?: string, value?: unknown);
+}
+/**
+ * Rate limiting error
+ */
+export declare class RateLimitError extends Crawl4AIError {
+    retryAfter?: number;
+    limit?: number;
+    remaining?: number;
+    reset?: Date;
+    constructor(message: string, retryAfter?: number, headers?: Record<string, string>);
+}
+/**
+ * Authentication/Authorization errors
+ */
+export declare class AuthError extends Crawl4AIError {
+    constructor(message?: string, status?: number);
+}
+/**
+ * Server errors (5xx)
+ */
+export declare class ServerError extends Crawl4AIError {
+    constructor(message?: string, status?: number, statusText?: string);
+}
+/**
+ * Resource not found error
+ */
+export declare class NotFoundError extends Crawl4AIError {
+    resource?: string;
+    constructor(resource?: string);
+}
+/**
+ * Response parsing error
+ */
+export declare class ParseError extends Crawl4AIError {
+    responseText?: string;
+    constructor(message: string, responseText?: string);
+}
+/**
+ * Type guard to check if an error is a Crawl4AI error
+ */
+export declare function isCrawl4AIError(error: unknown): error is Crawl4AIError;
+/**
+ * Type guard to check if an error is a rate limit error
+ */
+export declare function isRateLimitError(error: unknown): error is RateLimitError;
+/**
+ * Type guard to check if an error is an auth error
+ */
+export declare function isAuthError(error: unknown): error is AuthError;
+/**
+ * Type guard to check if an error is a network error
+ */
+export declare function isNetworkError(error: unknown): error is NetworkError;
+/**
+ * Helper to create appropriate error based on status code
+ */
+export declare function createHttpError(status: number, statusText: string, message?: string, data?: unknown, headers?: Record<string, string>): Crawl4AIError;

package/dist/index.d.ts ADDED Viewed

@@ -0,0 +1,7 @@
+/**
+ * Crawl4AI TypeScript SDK
+ * Export all types and classes
+ */
+export * from './errors';
+export { Crawl4AI, default } from './sdk';
+export * from './types';