crawlee-one 3.0.0 → 3.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/README.md +241 -0
- package/package.json +1 -1
package/dist/README.md
ADDED
|
@@ -0,0 +1,241 @@
|
|
|
1
|
+
# CrawleeOne
|
|
2
|
+
|
|
3
|
+
[](https://www.npmjs.com/package/crawlee-one)
|
|
4
|
+
[](https://www.npmjs.com/package/crawlee-one)
|
|
5
|
+
[](https://github.com/jurooravec/crawlee-one/blob/main/LICENSE)
|
|
6
|
+
[](https://www.typescriptlang.org/)
|
|
7
|
+
[](https://nodejs.org/)
|
|
8
|
+
[](https://github.com/jurooravec/crawlee-one)
|
|
9
|
+
|
|
10
|
+
**Production-ready web scraping. Out of the box.**
|
|
11
|
+
|
|
12
|
+
CrawleeOne wraps [Crawlee](https://crawlee.dev/) with everything production scrapers need -- data transforms, privacy compliance, error tracking, caching, and more -- in a single function call. Write the extraction logic. CrawleeOne handles the rest.
|
|
13
|
+
|
|
14
|
+
Works seamlessly with [Apify](https://apify.com/), but the storage backend is pluggable -- you're not locked in.
|
|
15
|
+
|
|
16
|
+
```sh
|
|
17
|
+
npm install crawlee-one
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
## Quick start
|
|
21
|
+
|
|
22
|
+
```ts
|
|
23
|
+
import { crawleeOne } from 'crawlee-one';
|
|
24
|
+
|
|
25
|
+
await crawleeOne({
|
|
26
|
+
type: 'cheerio',
|
|
27
|
+
routes: {
|
|
28
|
+
mainPage: {
|
|
29
|
+
match: /example\.com\/home/i,
|
|
30
|
+
handler: async (ctx) => {
|
|
31
|
+
const { $, pushData, pushRequests } = ctx;
|
|
32
|
+
await pushData([{ title: $('h1').text() }], {
|
|
33
|
+
privacyMask: { author: true },
|
|
34
|
+
});
|
|
35
|
+
await pushRequests([{ url: 'https://example.com/page/2' }]);
|
|
36
|
+
},
|
|
37
|
+
},
|
|
38
|
+
otherPage: {
|
|
39
|
+
match: (url, ctx) => url.startsWith('/') && ctx.$('.author').length > 0,
|
|
40
|
+
handler: async (ctx) => {
|
|
41
|
+
/* ... */
|
|
42
|
+
},
|
|
43
|
+
},
|
|
44
|
+
},
|
|
45
|
+
});
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
That's it. No `Actor.main()` boilerplate, no manual router setup, no input wiring. CrawleeOne handles initialization, routing, input resolution, error handling, and teardown.
|
|
49
|
+
|
|
50
|
+
## Why CrawleeOne?
|
|
51
|
+
|
|
52
|
+
### One function. Full crawler.
|
|
53
|
+
|
|
54
|
+
Replace 100+ lines of Actor + Router + input boilerplate with a single `crawleeOne()` call.
|
|
55
|
+
|
|
56
|
+
### Switch strategies, not code.
|
|
57
|
+
|
|
58
|
+
Go from `cheerio` to `playwright` by changing one prop. Your route handlers stay the same.
|
|
59
|
+
|
|
60
|
+
### Reshape output without touching scraper code.
|
|
61
|
+
|
|
62
|
+
Users filter, transform, rename, and limit results via input config -- no code changes needed.
|
|
63
|
+
|
|
64
|
+
```json
|
|
65
|
+
{
|
|
66
|
+
"outputPickFields": ["name", "email"],
|
|
67
|
+
"outputRenameFields": { "photo": "media.photos[0].url" },
|
|
68
|
+
"outputMaxEntries": 500,
|
|
69
|
+
"outputFilter": "(entry) => entry.rating > 4.0"
|
|
70
|
+
}
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### Fully typed out of the box.
|
|
74
|
+
|
|
75
|
+
Route handlers and context objects are typed based on your crawler type. TypeScript knows whether you have `ctx.page` or `ctx.$` -- no extra setup.
|
|
76
|
+
|
|
77
|
+
### Privacy compliance, built in.
|
|
78
|
+
|
|
79
|
+
Mark fields as personal data. CrawleeOne redacts them automatically when `includePersonalData` is off.
|
|
80
|
+
|
|
81
|
+
### Incremental scraping.
|
|
82
|
+
|
|
83
|
+
Only process entries you haven't seen before. Built-in cache with KeyValueStore tracks what's been scraped across runs.
|
|
84
|
+
|
|
85
|
+
### Errors captured, not lost.
|
|
86
|
+
|
|
87
|
+
Failed requests are saved to a dataset automatically. Plug in Sentry with one line, or implement your own telemetry.
|
|
88
|
+
|
|
89
|
+
### Match routes by URL or content.
|
|
90
|
+
|
|
91
|
+
Regex, functions, or both. CrawleeOne auto-routes unlabeled requests to the right handler.
|
|
92
|
+
|
|
93
|
+
[See all features](./packages/crawlee-one/docs/features.md)
|
|
94
|
+
|
|
95
|
+
## Before and after
|
|
96
|
+
|
|
97
|
+
<details>
|
|
98
|
+
<summary>What CrawleeOne replaces (click to expand)</summary>
|
|
99
|
+
|
|
100
|
+
**With CrawleeOne:**
|
|
101
|
+
|
|
102
|
+
```ts
|
|
103
|
+
await crawleeOne({
|
|
104
|
+
type: 'cheerio',
|
|
105
|
+
routes: {
|
|
106
|
+
mainPage: {
|
|
107
|
+
match: /example\.com\/home/i,
|
|
108
|
+
handler: async (ctx) => {
|
|
109
|
+
const data = [
|
|
110
|
+
/* ... */
|
|
111
|
+
];
|
|
112
|
+
await ctx.pushData(data, { privacyMask: { author: true } });
|
|
113
|
+
await ctx.pushRequests([{ url: 'https://...' }]);
|
|
114
|
+
},
|
|
115
|
+
},
|
|
116
|
+
},
|
|
117
|
+
});
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
**Without CrawleeOne (vanilla Crawlee + Apify):**
|
|
121
|
+
|
|
122
|
+
```ts
|
|
123
|
+
import { Actor } from 'apify';
|
|
124
|
+
import { CheerioCrawler, createCheerioRouter } from 'crawlee';
|
|
125
|
+
|
|
126
|
+
await Actor.main(async () => {
|
|
127
|
+
const rawInput = await Actor.getInput();
|
|
128
|
+
const input = {
|
|
129
|
+
...rawInput,
|
|
130
|
+
...(await fetchInput(rawInput.inputFromUrl)),
|
|
131
|
+
...(await runFunc(rawInput.inputFromFunc)),
|
|
132
|
+
};
|
|
133
|
+
|
|
134
|
+
const router = createCheerioRouter();
|
|
135
|
+
|
|
136
|
+
router.addHandler('mainPage', async (ctx) => {
|
|
137
|
+
await onBeforeHandler(ctx);
|
|
138
|
+
const data = [
|
|
139
|
+
/* ... */
|
|
140
|
+
];
|
|
141
|
+
const finalData = await transformAndFilterData(data, ctx, input);
|
|
142
|
+
const dataset = await Actor.openDataset(input.datasetId);
|
|
143
|
+
await dataset.pushData(data);
|
|
144
|
+
const reqs = ['https://...'].map((url) => ({ url }));
|
|
145
|
+
const finalReqs = await transformAndFilterReqs(reqs, ctx, input);
|
|
146
|
+
const queue = await Actor.openRequestQueue(input.requestQueueId);
|
|
147
|
+
await queue.addRequests(finalReqs);
|
|
148
|
+
await onAfterHandler(ctx);
|
|
149
|
+
});
|
|
150
|
+
|
|
151
|
+
router.addDefaultHandler(async (ctx) => {
|
|
152
|
+
await onBeforeHandler(ctx);
|
|
153
|
+
const url = ctx.request.loadedUrl || ctx.request.url;
|
|
154
|
+
if (url.match(/example\.com\/home/i)) {
|
|
155
|
+
const req = { url, userData: { label: 'mainPage' } };
|
|
156
|
+
const finalReqs = await transformAndFilterReqs([req], ctx, input);
|
|
157
|
+
const queue = await Actor.openRequestQueue(input.requestQueueId);
|
|
158
|
+
await queue.addRequests(finalReqs);
|
|
159
|
+
}
|
|
160
|
+
await onAfterHandler(ctx);
|
|
161
|
+
});
|
|
162
|
+
|
|
163
|
+
const crawler = new CheerioCrawler({ ...input, requestHandler: router });
|
|
164
|
+
crawler.run(['https://...']);
|
|
165
|
+
});
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
And that's far from everything -- the vanilla version still doesn't include data transforms, privacy masking, error tracking, caching, or input validation.
|
|
169
|
+
|
|
170
|
+
</details>
|
|
171
|
+
|
|
172
|
+
## Common use cases
|
|
173
|
+
|
|
174
|
+
CrawleeOne scrapers support these out of the box, all configurable via input:
|
|
175
|
+
|
|
176
|
+
| Use case | What it does |
|
|
177
|
+
| ---------------------------------------------------------------------------------------------- | --------------------------------------------------------------- |
|
|
178
|
+
| **[Import URLs](./packages/crawlee-one/docs/playbook-01-import-urls.md)** | Load URLs from databases, datasets, or custom functions. |
|
|
179
|
+
| **[Data transforms](./packages/crawlee-one/docs/playbook-03-results-mapping-simple.md)** | Rename, select, limit, and reshape output without code changes. |
|
|
180
|
+
| **[Request filtering](./packages/crawlee-one/docs/playbook-06-requests-mapping-filtering.md)** | Control what gets scraped to save time and money. |
|
|
181
|
+
| **[Caching](./packages/crawlee-one/docs/playbook-07-caching.md)** | Incremental scraping -- only process new entries. |
|
|
182
|
+
| **[Privacy compliance](./packages/crawlee-one/docs/playbook-10-privacy-compliance.md)** | Redact personal data with a single toggle. |
|
|
183
|
+
| **[Error capture](./packages/crawlee-one/docs/playbook-11-errors.md)** | Centralized error tracking across scrapers. |
|
|
184
|
+
|
|
185
|
+
[See all 12 use cases](./packages/crawlee-one/docs/use-cases.md)
|
|
186
|
+
|
|
187
|
+
## Getting started
|
|
188
|
+
|
|
189
|
+
### Installation
|
|
190
|
+
|
|
191
|
+
```sh
|
|
192
|
+
npm install crawlee-one
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
### For scraper developers
|
|
196
|
+
|
|
197
|
+
1. Read the [getting started guide](./packages/crawlee-one/docs/getting-started.md) for a full walkthrough of `crawleeOne()` and its options.
|
|
198
|
+
2. See [example projects](#example-projects) for real-world usage.
|
|
199
|
+
3. Managing multiple crawlers in one project? Use [codegen](./packages/crawlee-one/docs/codegen.md) to generate typed helper functions from a config file.
|
|
200
|
+
|
|
201
|
+
### For end users
|
|
202
|
+
|
|
203
|
+
Scrapers built with CrawleeOne are configurable by the end users (via Apify platform). Transform, filter, limit, and reshape scraped data and requests -- all through input fields, no code changes needed.
|
|
204
|
+
|
|
205
|
+
[User guide](./packages/crawlee-one/docs/user-guide.md)
|
|
206
|
+
|
|
207
|
+

|
|
208
|
+
|
|
209
|
+
## Documentation
|
|
210
|
+
|
|
211
|
+
| Document | Description |
|
|
212
|
+
| ------------------------------------------------------------------------------------ | ----------------------------------------------------------- |
|
|
213
|
+
| [Getting started](./packages/crawlee-one/docs/getting-started.md) | Developer guide with full `crawleeOne()` options reference. |
|
|
214
|
+
| [Features](./packages/crawlee-one/docs/features.md) | Complete feature catalog with code examples. |
|
|
215
|
+
| [Use cases](./packages/crawlee-one/docs/use-cases.md) | All 12 use cases with links to detailed guides. |
|
|
216
|
+
| [Input reference](./packages/crawlee-one/docs/reference-input.md) | All available input fields. |
|
|
217
|
+
| [Deploying to Apify](./packages/crawlee-one/docs/deploying-to-apify.md) | Step-by-step Apify deployment guide. |
|
|
218
|
+
| [Codegen](./packages/crawlee-one/docs/codegen.md) | Generate typed crawler definitions from config. |
|
|
219
|
+
| [Integrations](./packages/crawlee-one/docs/integrations.md) | Custom telemetry and storage backends. |
|
|
220
|
+
| [User guide](./packages/crawlee-one/docs/user-guide.md) | Guide for end users of CrawleeOne scrapers. |
|
|
221
|
+
| [API reference](./packages/crawlee-one/docs/typedoc/globals.md) | Auto-generated TypeScript API docs. |
|
|
222
|
+
| [Crawlee & Apify overview](./packages/crawlee-one/docs/scraping-workflow-summary.md) | Background on how Crawlee and Apify work. |
|
|
223
|
+
|
|
224
|
+
## Example projects
|
|
225
|
+
|
|
226
|
+
- [SKCRIS Scraper](https://github.com/JuroOravec/apify-actor-skcris) -- Slovak research database scraper.
|
|
227
|
+
- [Profesia.sk Scraper](https://github.com/JuroOravec/apify-actor-profesia-sk) -- Slovak job board scraper.
|
|
228
|
+
|
|
229
|
+
## Contributing
|
|
230
|
+
|
|
231
|
+
Found a bug or have a feature request? Please [open an issue](https://github.com/jurooravec/crawlee-one/issues).
|
|
232
|
+
|
|
233
|
+
When contributing code, please fork the repo and submit a pull request. See [CONTRIBUTING.md](./CONTRIBUTING.md) for dev setup and guidelines.
|
|
234
|
+
|
|
235
|
+
## Development
|
|
236
|
+
|
|
237
|
+
Want to build, test, or hack on CrawleeOne? The [development guide](./packages/crawlee-one/docs/development/README.md) covers prerequisites, all npm scripts, project structure, architecture, and testing strategy.
|
|
238
|
+
|
|
239
|
+
## Supporting CrawleeOne
|
|
240
|
+
|
|
241
|
+
CrawleeOne is a labour of love. If you find it useful, you can support the project on [Buy Me a Coffee](https://www.buymeacoffee.com/jurooravec).
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "crawlee-one",
|
|
3
|
-
"version": "3.0.
|
|
3
|
+
"version": "3.0.1",
|
|
4
4
|
"type": "module",
|
|
5
5
|
"private": false,
|
|
6
6
|
"description": "Production-ready web scraping in a single function call. Built on Crawlee. Data transforms, caching, privacy compliance, and error tracking -- out of the box.",
|