@crawlkit-sh/sdk 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +24 -0
- package/LICENSE +21 -0
- package/README.md +386 -0
- package/dist/index.cjs +745 -0
- package/dist/index.cjs.map +1 -0
- package/dist/index.d.cts +1416 -0
- package/dist/index.d.ts +1416 -0
- package/dist/index.js +734 -0
- package/dist/index.js.map +1 -0
- package/package.json +82 -0
package/CHANGELOG.md
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [1.0.0] - 2024-01-25
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
|
|
12
|
+
- Initial release of @crawlkit/sdk
|
|
13
|
+
- Core scraping functionality (`scrape()`)
|
|
14
|
+
- AI-powered data extraction (`extract()`)
|
|
15
|
+
- Web search via DuckDuckGo (`search()`)
|
|
16
|
+
- Full-page screenshots (`screenshot()`)
|
|
17
|
+
- LinkedIn scraping (`linkedin.company()`, `linkedin.person()`)
|
|
18
|
+
- Instagram scraping (`instagram.profile()`, `instagram.content()`)
|
|
19
|
+
- Google Play Store data (`appstore.playstoreReviews()`, `appstore.playstoreDetail()`)
|
|
20
|
+
- Apple App Store data (`appstore.appstoreReviews()`)
|
|
21
|
+
- Comprehensive TypeScript types
|
|
22
|
+
- Custom error classes for better error handling
|
|
23
|
+
- ESM and CommonJS support
|
|
24
|
+
- Zero runtime dependencies
|
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2024 CrawlKit
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,386 @@
|
|
|
1
|
+
# CrawlKit SDK
|
|
2
|
+
|
|
3
|
+
[](https://www.npmjs.com/package/@crawlkit-sh/sdk)
|
|
4
|
+
[](https://www.npmjs.com/package/@crawlkit-sh/sdk)
|
|
5
|
+
[](https://www.typescriptlang.org/)
|
|
6
|
+
[](https://opensource.org/licenses/MIT)
|
|
7
|
+
|
|
8
|
+
> Official TypeScript/JavaScript SDK for [CrawlKit](https://crawlkit.sh) - the modern web scraping API
|
|
9
|
+
|
|
10
|
+
Turn any website into structured data with a single API call. CrawlKit handles proxies, JavaScript rendering, anti-bot detection, and data extraction so you can focus on building.
|
|
11
|
+
|
|
12
|
+
## Features
|
|
13
|
+
|
|
14
|
+
- **Web Scraping** - Convert any webpage to clean Markdown, HTML, or raw content
|
|
15
|
+
- **AI Data Extraction** - Extract structured data using JSON Schema with LLM
|
|
16
|
+
- **Web Search** - Search the web via DuckDuckGo API
|
|
17
|
+
- **Screenshots** - Capture full-page screenshots
|
|
18
|
+
- **LinkedIn Scraping** - Scrape company profiles and person profiles
|
|
19
|
+
- **Instagram Scraping** - Scrape profiles and posts/reels
|
|
20
|
+
- **App Store Data** - Fetch reviews and details from Google Play & Apple App Store
|
|
21
|
+
- **Browser Automation** - Click, type, scroll, and execute JavaScript
|
|
22
|
+
- **TypeScript First** - Full type safety with comprehensive type definitions
|
|
23
|
+
- **Zero Dependencies** - Uses native fetch, works in Node.js 18+ and browsers
|
|
24
|
+
|
|
25
|
+
## Installation
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
npm install @crawlkit-sh/sdk
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
yarn add @crawlkit-sh/sdk
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
pnpm add @crawlkit-sh/sdk
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
## Quick Start
|
|
40
|
+
|
|
41
|
+
```typescript
|
|
42
|
+
import { CrawlKit } from '@crawlkit-sh/sdk';
|
|
43
|
+
|
|
44
|
+
const crawlkit = new CrawlKit({ apiKey: 'ck_your_api_key' });
|
|
45
|
+
|
|
46
|
+
// Scrape a webpage
|
|
47
|
+
const page = await crawlkit.scrape({ url: 'https://example.com' });
|
|
48
|
+
console.log(page.markdown);
|
|
49
|
+
console.log(page.metadata.title);
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Get your API key at [crawlkit.sh](https://crawlkit.sh)
|
|
53
|
+
|
|
54
|
+
## Examples
|
|
55
|
+
|
|
56
|
+
### Web Scraping
|
|
57
|
+
|
|
58
|
+
Scrape any webpage and get clean, structured content:
|
|
59
|
+
|
|
60
|
+
```typescript
|
|
61
|
+
const result = await crawlkit.scrape({
|
|
62
|
+
url: 'https://example.com/blog/article',
|
|
63
|
+
options: {
|
|
64
|
+
onlyMainContent: true, // Remove navigation, footers, etc.
|
|
65
|
+
waitFor: '#content', // Wait for element before scraping
|
|
66
|
+
}
|
|
67
|
+
});
|
|
68
|
+
|
|
69
|
+
console.log(result.markdown); // Clean markdown content
|
|
70
|
+
console.log(result.html); // Cleaned HTML
|
|
71
|
+
console.log(result.metadata.title); // Page title
|
|
72
|
+
console.log(result.metadata.author); // Author if available
|
|
73
|
+
console.log(result.links.internal); // Internal links found
|
|
74
|
+
console.log(result.links.external); // External links found
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
### AI-Powered Data Extraction
|
|
78
|
+
|
|
79
|
+
Extract structured data from any page using JSON Schema:
|
|
80
|
+
|
|
81
|
+
```typescript
|
|
82
|
+
interface Product {
|
|
83
|
+
name: string;
|
|
84
|
+
price: number;
|
|
85
|
+
currency: string;
|
|
86
|
+
description: string;
|
|
87
|
+
inStock: boolean;
|
|
88
|
+
reviews: { rating: number; count: number };
|
|
89
|
+
}
|
|
90
|
+
|
|
91
|
+
const result = await crawlkit.extract<Product>({
|
|
92
|
+
url: 'https://example.com/product/123',
|
|
93
|
+
schema: {
|
|
94
|
+
type: 'object',
|
|
95
|
+
properties: {
|
|
96
|
+
name: { type: 'string' },
|
|
97
|
+
price: { type: 'number' },
|
|
98
|
+
currency: { type: 'string' },
|
|
99
|
+
description: { type: 'string' },
|
|
100
|
+
inStock: { type: 'boolean' },
|
|
101
|
+
reviews: {
|
|
102
|
+
type: 'object',
|
|
103
|
+
properties: {
|
|
104
|
+
rating: { type: 'number' },
|
|
105
|
+
count: { type: 'number' }
|
|
106
|
+
}
|
|
107
|
+
}
|
|
108
|
+
}
|
|
109
|
+
},
|
|
110
|
+
options: {
|
|
111
|
+
prompt: 'Extract product information from this e-commerce page'
|
|
112
|
+
}
|
|
113
|
+
});
|
|
114
|
+
|
|
115
|
+
// TypeScript knows result.json is Product
|
|
116
|
+
console.log(`${result.json.name}: $${result.json.price}`);
|
|
117
|
+
console.log(`In stock: ${result.json.inStock}`);
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### Browser Automation
|
|
121
|
+
|
|
122
|
+
Handle SPAs, dynamic content, and interactive pages:
|
|
123
|
+
|
|
124
|
+
```typescript
|
|
125
|
+
const result = await crawlkit.scrape({
|
|
126
|
+
url: 'https://example.com/spa',
|
|
127
|
+
options: {
|
|
128
|
+
waitFor: '.content-loaded',
|
|
129
|
+
actions: [
|
|
130
|
+
{ type: 'click', selector: '#accept-cookies' },
|
|
131
|
+
{ type: 'wait', milliseconds: 1000 },
|
|
132
|
+
{ type: 'click', selector: '#load-more' },
|
|
133
|
+
{ type: 'scroll', direction: 'down' },
|
|
134
|
+
{ type: 'type', selector: '#search', text: 'query' },
|
|
135
|
+
{ type: 'press', key: 'Enter' },
|
|
136
|
+
{ type: 'wait', milliseconds: 2000 },
|
|
137
|
+
]
|
|
138
|
+
}
|
|
139
|
+
});
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
### Web Search
|
|
143
|
+
|
|
144
|
+
Search the web and get structured results:
|
|
145
|
+
|
|
146
|
+
```typescript
|
|
147
|
+
const result = await crawlkit.search({
|
|
148
|
+
query: 'typescript best practices 2024',
|
|
149
|
+
options: {
|
|
150
|
+
maxResults: 20,
|
|
151
|
+
timeRange: 'm', // Past month: 'd', 'w', 'm', 'y'
|
|
152
|
+
region: 'us-en'
|
|
153
|
+
}
|
|
154
|
+
});
|
|
155
|
+
|
|
156
|
+
for (const item of result.results) {
|
|
157
|
+
console.log(`${item.position}. ${item.title}`);
|
|
158
|
+
console.log(` ${item.url}`);
|
|
159
|
+
console.log(` ${item.snippet}\n`);
|
|
160
|
+
}
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
### Screenshots
|
|
164
|
+
|
|
165
|
+
Capture full-page screenshots:
|
|
166
|
+
|
|
167
|
+
```typescript
|
|
168
|
+
const result = await crawlkit.screenshot({
|
|
169
|
+
url: 'https://example.com',
|
|
170
|
+
options: {
|
|
171
|
+
width: 1920,
|
|
172
|
+
height: 1080,
|
|
173
|
+
waitForSelector: '#main-content'
|
|
174
|
+
}
|
|
175
|
+
});
|
|
176
|
+
|
|
177
|
+
console.log('Screenshot URL:', result.url);
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
### LinkedIn Scraping
|
|
181
|
+
|
|
182
|
+
Scrape LinkedIn company and person profiles:
|
|
183
|
+
|
|
184
|
+
```typescript
|
|
185
|
+
// Company profile
|
|
186
|
+
const company = await crawlkit.linkedin.company({
|
|
187
|
+
url: 'https://www.linkedin.com/company/openai',
|
|
188
|
+
options: { includeJobs: true }
|
|
189
|
+
});
|
|
190
|
+
|
|
191
|
+
console.log(company.company.name);
|
|
192
|
+
console.log(company.company.industry);
|
|
193
|
+
console.log(company.company.followers);
|
|
194
|
+
console.log(company.company.description);
|
|
195
|
+
console.log(company.company.employees);
|
|
196
|
+
console.log(company.company.jobs);
|
|
197
|
+
|
|
198
|
+
// Person profiles (batch up to 10)
|
|
199
|
+
const people = await crawlkit.linkedin.person({
|
|
200
|
+
url: [
|
|
201
|
+
'https://www.linkedin.com/in/user1',
|
|
202
|
+
'https://www.linkedin.com/in/user2'
|
|
203
|
+
]
|
|
204
|
+
});
|
|
205
|
+
|
|
206
|
+
console.log(`Success: ${people.successCount}, Failed: ${people.failedCount}`);
|
|
207
|
+
people.persons.forEach(p => console.log(p.person));
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
### Instagram Scraping
|
|
211
|
+
|
|
212
|
+
Scrape Instagram profiles and content:
|
|
213
|
+
|
|
214
|
+
```typescript
|
|
215
|
+
// Profile
|
|
216
|
+
const profile = await crawlkit.instagram.profile({
|
|
217
|
+
username: 'instagram'
|
|
218
|
+
});
|
|
219
|
+
|
|
220
|
+
console.log(profile.profile.full_name);
|
|
221
|
+
console.log(profile.profile.follower_count);
|
|
222
|
+
console.log(profile.profile.following_count);
|
|
223
|
+
console.log(profile.profile.biography);
|
|
224
|
+
console.log(profile.profile.posts); // Recent posts
|
|
225
|
+
|
|
226
|
+
// Post/Reel content
|
|
227
|
+
const post = await crawlkit.instagram.content({
|
|
228
|
+
shortcode: 'CxIIgCCq8mg' // or full URL
|
|
229
|
+
});
|
|
230
|
+
|
|
231
|
+
console.log(post.post.like_count);
|
|
232
|
+
console.log(post.post.comment_count);
|
|
233
|
+
console.log(post.post.video_url);
|
|
234
|
+
console.log(post.post.caption);
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
### App Store Data
|
|
238
|
+
|
|
239
|
+
Fetch app reviews and details:
|
|
240
|
+
|
|
241
|
+
```typescript
|
|
242
|
+
// Google Play Store reviews with pagination
|
|
243
|
+
let cursor: string | null = null;
|
|
244
|
+
do {
|
|
245
|
+
const reviews = await crawlkit.appstore.playstoreReviews({
|
|
246
|
+
appId: 'com.example.app',
|
|
247
|
+
cursor,
|
|
248
|
+
options: { lang: 'en' }
|
|
249
|
+
});
|
|
250
|
+
|
|
251
|
+
reviews.reviews.forEach(r => {
|
|
252
|
+
console.log(`${r.rating}/5: ${r.text}`);
|
|
253
|
+
if (r.developerReply) {
|
|
254
|
+
console.log(` Reply: ${r.developerReply.text}`);
|
|
255
|
+
}
|
|
256
|
+
});
|
|
257
|
+
|
|
258
|
+
cursor = reviews.pagination.nextCursor;
|
|
259
|
+
} while (cursor);
|
|
260
|
+
|
|
261
|
+
// Google Play Store app details
|
|
262
|
+
const details = await crawlkit.appstore.playstoreDetail({
|
|
263
|
+
appId: 'com.example.app'
|
|
264
|
+
});
|
|
265
|
+
|
|
266
|
+
console.log(details.appName);
|
|
267
|
+
console.log(details.rating);
|
|
268
|
+
console.log(details.installs);
|
|
269
|
+
console.log(details.description);
|
|
270
|
+
|
|
271
|
+
// Apple App Store reviews
|
|
272
|
+
const iosReviews = await crawlkit.appstore.appstoreReviews({
|
|
273
|
+
appId: '123456789'
|
|
274
|
+
});
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
## Error Handling
|
|
278
|
+
|
|
279
|
+
The SDK provides typed error classes for different scenarios:
|
|
280
|
+
|
|
281
|
+
```typescript
|
|
282
|
+
import {
|
|
283
|
+
CrawlKit,
|
|
284
|
+
CrawlKitError,
|
|
285
|
+
AuthenticationError,
|
|
286
|
+
InsufficientCreditsError,
|
|
287
|
+
ValidationError,
|
|
288
|
+
RateLimitError,
|
|
289
|
+
TimeoutError,
|
|
290
|
+
NotFoundError,
|
|
291
|
+
NetworkError
|
|
292
|
+
} from '@crawlkit-sh/sdk';
|
|
293
|
+
|
|
294
|
+
try {
|
|
295
|
+
const result = await crawlkit.scrape({ url: 'https://example.com' });
|
|
296
|
+
} catch (error) {
|
|
297
|
+
if (error instanceof AuthenticationError) {
|
|
298
|
+
console.log('Invalid API key');
|
|
299
|
+
} else if (error instanceof InsufficientCreditsError) {
|
|
300
|
+
console.log(`Not enough credits. Available: ${error.creditsRemaining}`);
|
|
301
|
+
} else if (error instanceof RateLimitError) {
|
|
302
|
+
console.log('Rate limit exceeded, please slow down');
|
|
303
|
+
} else if (error instanceof TimeoutError) {
|
|
304
|
+
console.log('Request timed out');
|
|
305
|
+
} else if (error instanceof ValidationError) {
|
|
306
|
+
console.log(`Invalid request: ${error.message}`);
|
|
307
|
+
} else if (error instanceof NetworkError) {
|
|
308
|
+
console.log(`Network error [${error.code}]: ${error.message}`);
|
|
309
|
+
} else if (error instanceof CrawlKitError) {
|
|
310
|
+
console.log(`API Error [${error.code}]: ${error.message}`);
|
|
311
|
+
console.log(`Status: ${error.statusCode}`);
|
|
312
|
+
if (error.creditsRefunded) {
|
|
313
|
+
console.log(`Credits refunded: ${error.creditsRefunded}`);
|
|
314
|
+
}
|
|
315
|
+
}
|
|
316
|
+
}
|
|
317
|
+
```
|
|
318
|
+
|
|
319
|
+
## Configuration
|
|
320
|
+
|
|
321
|
+
```typescript
|
|
322
|
+
const crawlkit = new CrawlKit({
|
|
323
|
+
// Required: Your API key (get it at crawlkit.sh)
|
|
324
|
+
apiKey: 'ck_your_api_key',
|
|
325
|
+
|
|
326
|
+
// Optional: Custom base URL (default: https://api.crawlkit.sh)
|
|
327
|
+
baseUrl: 'https://api.crawlkit.sh',
|
|
328
|
+
|
|
329
|
+
// Optional: Default timeout in ms (default: 30000)
|
|
330
|
+
timeout: 60000,
|
|
331
|
+
|
|
332
|
+
// Optional: Custom fetch implementation
|
|
333
|
+
fetch: customFetch
|
|
334
|
+
});
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
## Credit Costs
|
|
338
|
+
|
|
339
|
+
| Operation | Credits |
|
|
340
|
+
|-----------|---------|
|
|
341
|
+
| `scrape()` | 1 |
|
|
342
|
+
| `extract()` | 5 |
|
|
343
|
+
| `search()` | 1 per page (~10 results) |
|
|
344
|
+
| `screenshot()` | 1 |
|
|
345
|
+
| `linkedin.company()` | 1 |
|
|
346
|
+
| `linkedin.person()` | 3 per URL |
|
|
347
|
+
| `instagram.profile()` | 1 |
|
|
348
|
+
| `instagram.content()` | 1 |
|
|
349
|
+
| `appstore.playstoreReviews()` | 1 per page |
|
|
350
|
+
| `appstore.playstoreDetail()` | 1 |
|
|
351
|
+
| `appstore.appstoreReviews()` | 1 per page |
|
|
352
|
+
|
|
353
|
+
## TypeScript Support
|
|
354
|
+
|
|
355
|
+
This SDK is written in TypeScript and provides comprehensive type definitions for all methods and responses. Enable strict mode in your `tsconfig.json` for the best experience:
|
|
356
|
+
|
|
357
|
+
```json
|
|
358
|
+
{
|
|
359
|
+
"compilerOptions": {
|
|
360
|
+
"strict": true
|
|
361
|
+
}
|
|
362
|
+
}
|
|
363
|
+
```
|
|
364
|
+
|
|
365
|
+
## Requirements
|
|
366
|
+
|
|
367
|
+
- Node.js 18.0.0 or higher (for native fetch support)
|
|
368
|
+
- Or any modern browser with fetch support
|
|
369
|
+
|
|
370
|
+
## Documentation
|
|
371
|
+
|
|
372
|
+
For detailed API documentation and guides, visit [docs.crawlkit.sh](https://docs.crawlkit.sh)
|
|
373
|
+
|
|
374
|
+
## Support
|
|
375
|
+
|
|
376
|
+
- [GitHub Issues](https://github.com/crawlkit/sdk/issues)
|
|
377
|
+
- [Documentation](https://docs.crawlkit.sh)
|
|
378
|
+
- Email: support@crawlkit.sh
|
|
379
|
+
|
|
380
|
+
## License
|
|
381
|
+
|
|
382
|
+
MIT License - see [LICENSE](LICENSE) for details.
|
|
383
|
+
|
|
384
|
+
---
|
|
385
|
+
|
|
386
|
+
Built with love by [CrawlKit](https://crawlkit.sh)
|