clean-web-scraper 3.2.1 → 3.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +28 -6
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -11,10 +11,10 @@ A powerful Node.js web scraper that extracts clean, readable content from websit
11
11
  - 🚫 Excludes unwanted paths from scraping
12
12
  - 🔄 Handles relative and absolute URLs like a pro
13
13
  - 🎯 No duplicate page visits
14
- - 📊 Generates JSONL output file for ML training
15
- - 📊 AI-friendly clean text and csv output (perfect for LLM fine-tuning!)
14
+ - 🤖 AI-friendly output formats (JSONL, CSV, clean text)
16
15
  - 📊 Rich metadata extraction
17
16
  - 📁 Combine results from multiple scrapers into a unified dataset
17
+ - 🎯 Turn any website into an AI training dataset
18
18
 
19
19
  ## 🛠️ Prerequisites
20
20
 
@@ -58,15 +58,37 @@ const scraper = new WebScraper({
58
58
  metadataFields: ['title', 'description'] // Optional: Specify metadata fields to include
59
59
  });
60
60
  await scraper.start();
61
-
62
- // Combine results from multiple scrapers
63
- await WebScraper.combineResults('./combined-dataset', [scraper1, scraper2]);
64
61
  ```
65
62
 
66
63
  ```bash
67
64
  node example-usage.js
68
65
  ```
69
66
 
67
+ ## 💻 Advanced Usage: Multi-Site Scraping
68
+
69
+ ```js
70
+ const WebScraper = require('clean-web-scraper');
71
+
72
+ // Scrape documentation website
73
+ const docsScraper = new WebScraper({
74
+ baseURL: 'https://docs.example.com',
75
+ scrapResultPath: './datasets/docs'
76
+ });
77
+
78
+ // Scrape blog website
79
+ const blogScraper = new WebScraper({
80
+ baseURL: 'https://blog.example.com',
81
+ scrapResultPath: './datasets/blog'
82
+ });
83
+
84
+ // Start scraping both sites
85
+ await docsScraper.start();
86
+ await blogScraper.start();
87
+
88
+ // Combine all scraped content into a single dataset
89
+ await WebScraper.combineResults('./combined-dataset', [docsScraper, blogScraper]);
90
+ ```
91
+
70
92
  ## 📤 Output
71
93
 
72
94
  Your AI-ready content is saved in a clean, structured format:
@@ -100,7 +122,7 @@ example.com/
100
122
 
101
123
  ## 🤖 AI/LLM Training Ready
102
124
 
103
- The output is specifically formatted for AI training purposes:
125
+ The output is specifically formatted for AI training and fine-tuning purposes:
104
126
 
105
127
  - Clean, processed text without HTML markup
106
128
  - Multiple formats (JSONL, CSV, text files)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "clean-web-scraper",
3
- "version": "3.2.1",
3
+ "version": "3.2.2",
4
4
  "main": "main.js",
5
5
  "scripts": {
6
6
  "start": "node main.js",