clean-web-scraper 3.2.4 → 3.2.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +11 -10
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -14,7 +14,6 @@ A powerful Node.js web scraper that extracts clean, readable content from websit
14
14
  - 🤖 AI-friendly output formats (JSONL, CSV, clean text)
15
15
  - 📊 Rich metadata extraction
16
16
  - 📁 Combine results from multiple scrapers into a unified dataset
17
- - 🎯 Turn any website into an AI training dataset
18
17
 
19
18
  ## 🛠️ Prerequisites
20
19
 
@@ -57,10 +56,6 @@ const scraper = new WebScraper({
57
56
  await scraper.start();
58
57
  ```
59
58
 
60
- ```bash
61
- node example-usage.js
62
- ```
63
-
64
59
  ## 💻 Advanced Usage: Multi-Site Scraping
65
60
 
66
61
  ```js
@@ -92,16 +87,18 @@ await blogScraper.start();
92
87
  await WebScraper.combineResults('./combined', [docsScraper, blogScraper]);
93
88
  ```
94
89
 
90
+ ```bash
91
+ node example-usage.js
92
+ ```
93
+
95
94
  ## 📤 Output
96
95
 
97
96
  Your AI-ready content is saved in a clean, structured format:
98
97
 
99
- - 📁 Base folder: ./folderPath/example.com/
98
+ - 📁 Base folder: `./folderPath/example.com/`
100
99
  - 📑 Files preserve original URL paths
101
- - 📝 Pure text format, perfect for LLM training and fine-tuning
102
- - 🤖 No HTML, no mess - just clean, structured text ready for AI consumption
103
- - 📊 JSONL output for ML training
104
- - 📈 CSV output with clean text content
100
+ - 🤖 No HTML, no noise - just clean, structured text (`.txt` files)
101
+ - 📊 `JSONL` and `CSV` outputs, ready for AI consumption and model training
105
102
 
106
103
  ```bash
107
104
  example.com/
@@ -141,10 +138,13 @@ combined/
141
138
 
142
139
  ### 📝 Text Files (*.txt)
143
140
 
141
+ ```text
144
142
  The actual article content starts here. This is the clean, processed text of the article that was extracted from the webpage
143
+ ```
145
144
 
146
145
  ### 📑 Text Files with Metadata (texts_with_metadata/*.txt)
147
146
 
147
+ ```text
148
148
  title: My Awesome Page
149
149
  description: This is a great article about coding
150
150
  author: John Doe
@@ -154,6 +154,7 @@ dateScraped: 2024-01-20T10:30:00Z
154
154
  \-\-\-
155
155
 
156
156
  The actual article content starts here. This is the clean, processed text of the article that was extracted from the webpage.
157
+ ```
157
158
 
158
159
  ### 📊 JSONL Files (train.jsonl)
159
160
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "clean-web-scraper",
3
- "version": "3.2.4",
3
+ "version": "3.2.5",
4
4
  "main": "main.js",
5
5
  "scripts": {
6
6
  "start": "node main.js",