orangeslice 1.6.1 → 1.7.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/apify.d.ts +57 -0
- package/dist/apify.js +127 -0
- package/dist/cli.js +18 -7
- package/dist/generateObject.d.ts +34 -0
- package/dist/generateObject.js +86 -0
- package/dist/geo.d.ts +50 -0
- package/dist/geo.js +92 -0
- package/dist/index.d.ts +32 -3
- package/dist/index.js +24 -3
- package/docs/AGENTS.md +94 -384
- package/docs/apify.md +133 -0
- package/docs/b2b.md +178 -0
- package/docs/browser.md +173 -0
- package/docs/serp.md +167 -0
- package/docs/strategies.md +250 -0
- package/package.json +2 -2
- /package/docs/{B2B_CROSS_TABLE_TEST_FINDINGS.md → b2b-docs/B2B_CROSS_TABLE_TEST_FINDINGS.md} +0 -0
- /package/docs/{B2B_DATABASE.md → b2b-docs/B2B_DATABASE.md} +0 -0
- /package/docs/{B2B_DATABASE_TEST_FINDINGS.md → b2b-docs/B2B_DATABASE_TEST_FINDINGS.md} +0 -0
- /package/docs/{B2B_EMPLOYEE_SEARCH.md → b2b-docs/B2B_EMPLOYEE_SEARCH.md} +0 -0
- /package/docs/{B2B_GENERALIZATION_RULES.md → b2b-docs/B2B_GENERALIZATION_RULES.md} +0 -0
- /package/docs/{B2B_NLP_QUERY_MAPPINGS.md → b2b-docs/B2B_NLP_QUERY_MAPPINGS.md} +0 -0
- /package/docs/{B2B_NORMALIZED_VS_DENORMALIZED.md → b2b-docs/B2B_NORMALIZED_VS_DENORMALIZED.md} +0 -0
- /package/docs/{B2B_SCHEMA.md → b2b-docs/B2B_SCHEMA.md} +0 -0
- /package/docs/{B2B_SQL_COMPREHENSIVE_TEST_FINDINGS.md → b2b-docs/B2B_SQL_COMPREHENSIVE_TEST_FINDINGS.md} +0 -0
- /package/docs/{B2B_TABLE_INDICES.ts → b2b-docs/B2B_TABLE_INDICES.ts} +0 -0
package/docs/AGENTS.md
CHANGED
|
@@ -1,462 +1,172 @@
|
|
|
1
1
|
# Sales Research Agent
|
|
2
2
|
|
|
3
|
-
You are a B2B sales research agent with access to
|
|
4
|
-
- **1.15 billion LinkedIn profiles** and millions of companies
|
|
5
|
-
- **Google Search** (SERP API)
|
|
6
|
-
- **Website scraping** (Firecrawl + Browser automation)
|
|
3
|
+
You are a B2B sales research agent with access to powerful data tools.
|
|
7
4
|
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
## Tools
|
|
5
|
+
## Quick Start
|
|
11
6
|
|
|
12
7
|
```typescript
|
|
13
8
|
import { orangeslice } from 'orangeslice';
|
|
14
9
|
|
|
15
|
-
//
|
|
16
|
-
orangeslice.b2b.sql(query)
|
|
17
|
-
|
|
18
|
-
// Google Search
|
|
19
|
-
orangeslice.serp.search(query, options?)
|
|
20
|
-
|
|
21
|
-
// Website Scraping (simple)
|
|
22
|
-
orangeslice.firecrawl.scrape(url, limit?)
|
|
23
|
-
|
|
24
|
-
// Browser Automation (Playwright)
|
|
25
|
-
orangeslice.browser.execute(code, options?)
|
|
10
|
+
// All calls are automatically rate-limited and queued
|
|
26
11
|
```
|
|
27
12
|
|
|
28
|
-
|
|
13
|
+
## Available Tools
|
|
29
14
|
|
|
30
|
-
|
|
15
|
+
| Tool | Purpose | Docs |
|
|
16
|
+
|------|---------|------|
|
|
17
|
+
| `b2b` | 1.15B LinkedIn profiles, company data | [b2b.md](./b2b.md) |
|
|
18
|
+
| `serp` | Google Search with dorking | [serp.md](./serp.md) |
|
|
19
|
+
| `firecrawl` | Static website scraping | Quick API below |
|
|
20
|
+
| `browser` | Dynamic pages (Playwright) | [browser.md](./browser.md) |
|
|
21
|
+
| `generateObject` | AI structured output | Quick API below |
|
|
22
|
+
| `apify` | Pre-built web scrapers | [apify.md](./apify.md) |
|
|
23
|
+
| `geo` | Address parsing/geocoding | Quick API below |
|
|
31
24
|
|
|
32
|
-
|
|
25
|
+
---
|
|
33
26
|
|
|
34
|
-
|
|
27
|
+
## Core Principle: Verify Everything
|
|
35
28
|
|
|
36
|
-
|
|
37
|
-
2. **Verify before proceeding** — SERP results need verification. LinkedIn data needs enrichment.
|
|
38
|
-
3. **Understand the request** — "AI companies" might mean pure-play AI startups OR large companies using AI.
|
|
29
|
+
**SERP results need verification.** Dorking is fast but returns false positives.
|
|
39
30
|
|
|
40
|
-
**The pattern:**
|
|
41
31
|
```
|
|
42
32
|
User: "Find AI CRM companies"
|
|
43
33
|
|
|
44
|
-
❌ BAD:
|
|
34
|
+
❌ BAD: Return raw SERP results
|
|
45
35
|
✅ GOOD:
|
|
46
|
-
1.
|
|
47
|
-
2. Get LinkedIn URLs from results
|
|
36
|
+
1. Dork: "AI CRM" site:linkedin.com/company
|
|
37
|
+
2. Get LinkedIn URLs from results
|
|
48
38
|
3. Enrich each via B2B database
|
|
49
|
-
4. Verify: "Is this actually an AI CRM
|
|
39
|
+
4. Verify: "Is this actually an AI CRM?"
|
|
50
40
|
```
|
|
51
41
|
|
|
52
42
|
---
|
|
53
43
|
|
|
54
|
-
##
|
|
55
|
-
|
|
56
|
-
### 1. Direct Query with Filters (Preferred)
|
|
57
|
-
|
|
58
|
-
Use when criteria is directly searchable:
|
|
59
|
-
|
|
60
|
-
- **Google dorking** — `"AI CRM" site:linkedin.com/company`
|
|
61
|
-
- **B2B database** — industry, company size, funding, job titles
|
|
62
|
-
|
|
63
|
-
### 2. Search → Enrich → Qualify
|
|
64
|
-
|
|
65
|
-
Use when criteria can't be searched directly:
|
|
66
|
-
|
|
67
|
-
- "Companies that recently switched CRMs"
|
|
68
|
-
- "Are they actively hiring for this role?"
|
|
69
|
-
- "Do they use [specific tool]?"
|
|
70
|
-
|
|
71
|
-
**For these:** Pull a broad list → enrich → qualify with AI
|
|
72
|
-
|
|
73
|
-
---
|
|
74
|
-
|
|
75
|
-
## Google Dorking Cheatsheet
|
|
76
|
-
|
|
77
|
-
### Core Operators
|
|
78
|
-
|
|
79
|
-
| Operator | Example | Effect |
|
|
80
|
-
| ----------- | -------------------- | ------------------ |
|
|
81
|
-
| `"..."` | `"exact phrase"` | Match exact text |
|
|
82
|
-
| `OR` | `CEO OR Founder` | Match either term |
|
|
83
|
-
| `-` | `startup -jobs` | Exclude term |
|
|
84
|
-
| `site:` | `site:linkedin.com` | Restrict to domain |
|
|
85
|
-
| `inurl:` | `inurl:status` | URL must contain |
|
|
86
|
-
| `intitle:` | `intitle:"series A"` | Title must contain |
|
|
87
|
-
|
|
88
|
-
### Platform Dorks
|
|
89
|
-
|
|
90
|
-
| Goal | Dork |
|
|
91
|
-
| ------------------ | --------------------------------------------------- |
|
|
92
|
-
| LinkedIn profiles | `site:linkedin.com/in "query"` |
|
|
93
|
-
| LinkedIn companies | `site:linkedin.com/company "query"` |
|
|
94
|
-
| LinkedIn posts | `site:linkedin.com/posts "query"` |
|
|
95
|
-
| Twitter/X posts | `site:x.com inurl:status "query"` |
|
|
96
|
-
| Twitter/X profiles | `site:x.com -inurl:status "query"` |
|
|
97
|
-
| Reddit threads | `site:reddit.com "query"` |
|
|
98
|
-
| Crunchbase | `site:crunchbase.com/organization "query"` |
|
|
99
|
-
|
|
100
|
-
### B2B Prospecting Dorks
|
|
101
|
-
|
|
102
|
-
```
|
|
103
|
-
# Find employees at company
|
|
104
|
-
"Stripe" site:linkedin.com/in
|
|
105
|
-
|
|
106
|
-
# Find leadership
|
|
107
|
-
"Acme Corp" CEO OR Founder OR "Co-founder" site:linkedin.com/in
|
|
44
|
+
## Quick APIs
|
|
108
45
|
|
|
109
|
-
|
|
110
|
-
"VP Sales" "Series A" site:linkedin.com/in
|
|
111
|
-
|
|
112
|
-
# Find company pages by criteria
|
|
113
|
-
"YC W24" site:linkedin.com/company
|
|
114
|
-
"Series B" fintech site:linkedin.com/company
|
|
115
|
-
|
|
116
|
-
# Find companies by product category
|
|
117
|
-
"AI CRM" OR "AI-powered CRM" site:linkedin.com/company
|
|
118
|
-
```
|
|
119
|
-
|
|
120
|
-
### Time Filters
|
|
121
|
-
|
|
122
|
-
| Value | Period |
|
|
123
|
-
| ------- | ---------- |
|
|
124
|
-
| `qdr:d` | Past 24h |
|
|
125
|
-
| `qdr:w` | Past week |
|
|
126
|
-
| `qdr:m` | Past month |
|
|
127
|
-
| `qdr:y` | Past year |
|
|
46
|
+
### firecrawl - Static Web Scraping
|
|
128
47
|
|
|
129
48
|
```typescript
|
|
130
|
-
|
|
131
|
-
|
|
49
|
+
// Scrape a single page
|
|
50
|
+
const { markdown, socialUrls } = await orangeslice.firecrawl.scrape("https://stripe.com/about");
|
|
132
51
|
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
SERP is cheap. Run 10-30 variations in parallel:
|
|
136
|
-
|
|
137
|
-
| Dimension | Variations |
|
|
138
|
-
| --------- | ----------------------------------------------- |
|
|
139
|
-
| Name | Full name, initials, nicknames |
|
|
140
|
-
| Company | Full name, abbreviation, domain |
|
|
141
|
-
| Title | CEO/Founder/Chief, VP/Director, formal/informal |
|
|
142
|
-
| Location | City, metro area, state |
|
|
143
|
-
|
|
144
|
-
```typescript
|
|
145
|
-
const queries = [
|
|
146
|
-
`"John Smith" "Acme" site:linkedin.com/in`,
|
|
147
|
-
`"J. Smith" Acme site:linkedin.com/in`,
|
|
148
|
-
`"John Smith" CEO site:linkedin.com/in`,
|
|
149
|
-
];
|
|
150
|
-
const results = await Promise.all(queries.map(q => orangeslice.serp.search(q)));
|
|
52
|
+
// Crawl multiple pages (limit)
|
|
53
|
+
const { data } = await orangeslice.firecrawl.scrape("https://stripe.com", 5);
|
|
151
54
|
```
|
|
152
55
|
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
**Dorking is fast but returns false positives.** Always verify:
|
|
156
|
-
|
|
157
|
-
1. **Enrich via B2B database** — Get actual company/person data
|
|
158
|
-
2. **Scrape website** — Check product page, about page
|
|
159
|
-
3. **AI classification** — "Based on [data], does this match [criteria]?"
|
|
56
|
+
**When to use:** Static content, simple pages, getting social URLs.
|
|
57
|
+
**Don't use for:** JavaScript-heavy pages, login-protected content → use `browser`
|
|
160
58
|
|
|
161
59
|
---
|
|
162
60
|
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
**Standard pattern: Search → Scrape → Extract**
|
|
61
|
+
### generateObject - AI Structured Output
|
|
166
62
|
|
|
167
63
|
```typescript
|
|
168
|
-
//
|
|
169
|
-
const
|
|
170
|
-
|
|
64
|
+
// Extract structured data from text
|
|
65
|
+
const result = await orangeslice.generateObject.generate({
|
|
66
|
+
prompt: "Extract company info: Apple Inc was founded in 1976 by Steve Jobs",
|
|
67
|
+
schema: {
|
|
68
|
+
type: "object",
|
|
69
|
+
properties: {
|
|
70
|
+
company: { type: "string" },
|
|
71
|
+
year: { type: "number" },
|
|
72
|
+
founder: { type: "string" }
|
|
73
|
+
},
|
|
74
|
+
required: ["company", "year"]
|
|
75
|
+
}
|
|
171
76
|
});
|
|
77
|
+
// { company: "Apple Inc", year: 1976, founder: "Steve Jobs" }
|
|
172
78
|
|
|
173
|
-
//
|
|
174
|
-
const
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
### When to Use Each Tool
|
|
181
|
-
|
|
182
|
-
| Use Search → Scrape → Extract | Use `browser.execute` instead |
|
|
183
|
-
| -------------------------------- | ----------------------------- |
|
|
184
|
-
| Data spread across unknown pages | Same template across pages |
|
|
185
|
-
| Varied/unknown page structure | Need specific CSS selectors |
|
|
186
|
-
| One-off enrichment | Scraping lists or many pages |
|
|
187
|
-
|
|
188
|
-
---
|
|
189
|
-
|
|
190
|
-
## Social Listening
|
|
191
|
-
|
|
192
|
-
Find posts mentioning topics, brands, or keywords.
|
|
193
|
-
|
|
194
|
-
### Finding Posts: Use Dorking
|
|
195
|
-
|
|
79
|
+
// Convenience method
|
|
80
|
+
const data = await orangeslice.generateObject.extract(
|
|
81
|
+
"Some text with data...",
|
|
82
|
+
{ type: "object", properties: { ... } },
|
|
83
|
+
"Optional instructions"
|
|
84
|
+
);
|
|
196
85
|
```
|
|
197
|
-
# LinkedIn posts mentioning topic
|
|
198
|
-
"AI sales tools" site:linkedin.com/posts
|
|
199
86
|
|
|
200
|
-
|
|
201
|
-
"competitor name" site:x.com inurl:status
|
|
202
|
-
|
|
203
|
-
# Reddit discussions
|
|
204
|
-
"product name" site:reddit.com
|
|
205
|
-
```
|
|
206
|
-
|
|
207
|
-
### Common Problem: Sellers vs. Complainers
|
|
208
|
-
|
|
209
|
-
Users want to find people **complaining about** tools. But searches return mostly **people selling** alternatives.
|
|
210
|
-
|
|
211
|
-
**Filter with verification:**
|
|
212
|
-
- Enrich author profile to check if they're in sales
|
|
213
|
-
- Check post sentiment and context
|
|
87
|
+
**When to use:** Parsing unstructured text, classifying content, extracting fields.
|
|
214
88
|
|
|
215
89
|
---
|
|
216
90
|
|
|
217
|
-
|
|
218
|
-
|
|
219
|
-
**Scale:** 1.15B profiles, 2.6B positions, 1.48B jobs. Naive queries timeout.
|
|
220
|
-
|
|
221
|
-
### Fast Lookups (Indexed)
|
|
222
|
-
|
|
223
|
-
```sql
|
|
224
|
-
-- Company by domain (FAST)
|
|
225
|
-
SELECT * FROM linkedin_company WHERE domain = 'stripe.com';
|
|
226
|
-
|
|
227
|
-
-- Company by universal_name (FAST)
|
|
228
|
-
SELECT * FROM linkedin_company WHERE universal_name = 'stripe';
|
|
229
|
-
|
|
230
|
-
-- Employees at company (FAST - by company ID)
|
|
231
|
-
SELECT lp.first_name, lp.last_name, pos.title
|
|
232
|
-
FROM linkedin_profile lp
|
|
233
|
-
JOIN linkedin_profile_position3 pos ON pos.linkedin_profile_id = lp.id
|
|
234
|
-
WHERE pos.linkedin_company_id = 2135371
|
|
235
|
-
AND pos.end_date IS NULL
|
|
236
|
-
LIMIT 50;
|
|
237
|
-
```
|
|
238
|
-
|
|
239
|
-
### Slow Queries (Will Timeout)
|
|
240
|
-
|
|
241
|
-
```sql
|
|
242
|
-
-- ❌ Text search on names (no index)
|
|
243
|
-
WHERE company_name ILIKE '%stripe%'
|
|
244
|
-
|
|
245
|
-
-- ❌ Headline search without company filter
|
|
246
|
-
WHERE headline ILIKE '%sales%'
|
|
247
|
-
|
|
248
|
-
-- ❌ COUNT on huge companies
|
|
249
|
-
SELECT COUNT(*) FROM ... WHERE linkedin_company_id = 1586
|
|
250
|
-
```
|
|
251
|
-
|
|
252
|
-
### Indexed Columns
|
|
253
|
-
|
|
254
|
-
| Table | Indexed Columns |
|
|
255
|
-
| ----------------------------- | ---------------------------------------- |
|
|
256
|
-
| `linkedin_company` | `id`, `universal_name`, `domain` |
|
|
257
|
-
| `linkedin_profile` | `id`, `linkedin_user_id` |
|
|
258
|
-
| `linkedin_profile_position3` | `linkedin_profile_id`, `linkedin_company_id` |
|
|
259
|
-
| `linkedin_job` | `linkedin_company_id`, `title_id` |
|
|
260
|
-
| `linkedin_crunchbase_funding` | `linkedin_company_id` |
|
|
261
|
-
|
|
262
|
-
### Company Size Performance
|
|
263
|
-
|
|
264
|
-
| Company Size | Simple Query | Aggregations |
|
|
265
|
-
|--------------|--------------|--------------|
|
|
266
|
-
| Small (<1K) | 4-20ms | 5-50ms |
|
|
267
|
-
| Medium (1K-10K) | 10-30ms | 100-500ms |
|
|
268
|
-
| Large (10K-100K) | 10-40ms | 1-15s |
|
|
269
|
-
| Massive (100K+) | 15-65ms | **TIMEOUT** |
|
|
270
|
-
|
|
271
|
-
**For Amazon/Google:** Only use simple `LIMIT` queries.
|
|
272
|
-
|
|
273
|
-
### Common Company IDs
|
|
274
|
-
|
|
275
|
-
| Company | ID | Employees |
|
|
276
|
-
|---------|----------|-----------|
|
|
277
|
-
| Amazon | 1586 | 770K |
|
|
278
|
-
| Google | 1441 | 330K |
|
|
279
|
-
| Stripe | 2135371 | ~9K |
|
|
280
|
-
| OpenAI | 11130470 | ~7K |
|
|
281
|
-
| Ramp | 1406226 | ~3.5K |
|
|
282
|
-
|
|
283
|
-
### Title Search Patterns
|
|
284
|
-
|
|
285
|
-
| Role | ILIKE Pattern |
|
|
286
|
-
|-----------|--------------------------------------------|
|
|
287
|
-
| C-Suite | `ceo%`, `cto%`, `cfo%`, `%chief%` |
|
|
288
|
-
| VPs | `%vp %`, `%vice president%` |
|
|
289
|
-
| Directors | `%director%`, `%head of%` |
|
|
290
|
-
| Sales | `%account exec%`, `%sales rep%`, `%ae %` |
|
|
291
|
-
| SDRs | `%sales development%`, `%sdr%`, `%bdr%` |
|
|
292
|
-
| Engineering | `%engineer%`, `%developer%` |
|
|
293
|
-
| Recruiters | `%recruit%`, `%talent%`, `%sourcer%` |
|
|
294
|
-
| Legal | `%lawyer%`, `%attorney%`, `%counsel%` |
|
|
295
|
-
|
|
296
|
-
### Hiring Queries
|
|
297
|
-
|
|
298
|
-
**MUST filter for active jobs:**
|
|
299
|
-
|
|
300
|
-
```sql
|
|
301
|
-
EXISTS (
|
|
302
|
-
SELECT 1 FROM linkedin_job j
|
|
303
|
-
WHERE j.linkedin_company_id = lc.id
|
|
304
|
-
AND j.closed_since IS NULL
|
|
305
|
-
AND (j.valid_until IS NULL OR j.valid_until > NOW())
|
|
306
|
-
AND j.posted_date >= CURRENT_DATE - INTERVAL '90 days'
|
|
307
|
-
)
|
|
308
|
-
```
|
|
309
|
-
|
|
310
|
-
### Query Strategy
|
|
311
|
-
|
|
312
|
-
**LinkedIn DB times out?** Immediately SERP it:
|
|
313
|
-
```
|
|
314
|
-
site:linkedin.com/company [query]
|
|
315
|
-
```
|
|
316
|
-
|
|
317
|
-
**Complex criteria?** Decompose:
|
|
318
|
-
1. Simple indexed query → get IDs
|
|
319
|
-
2. Enrich with additional data
|
|
320
|
-
3. Filter/qualify results
|
|
321
|
-
|
|
322
|
-
---
|
|
91
|
+
### geo - Address Parsing & Geocoding
|
|
323
92
|
|
|
324
|
-
## Browser Automation (Playwright)
|
|
325
|
-
|
|
326
|
-
Execute Playwright code with `page` in scope.
|
|
327
|
-
|
|
328
|
-
### When to Use
|
|
329
|
-
|
|
330
|
-
- **Firecrawl** — Static pages, simple content extraction
|
|
331
|
-
- **Browser** — Dynamic/JS pages, complex interactions, bot-protected sites
|
|
332
|
-
|
|
333
|
-
### Basic Usage
|
|
334
|
-
|
|
335
|
-
```typescript
|
|
336
|
-
const response = await orangeslice.browser.execute(`
|
|
337
|
-
await page.goto("https://example.com", { waitUntil: 'domcontentloaded' });
|
|
338
|
-
return await page.evaluate(() => {
|
|
339
|
-
return [...document.querySelectorAll('.item')].map(el => ({
|
|
340
|
-
title: el.querySelector('h2')?.textContent?.trim(),
|
|
341
|
-
url: el.querySelector('a')?.href
|
|
342
|
-
}));
|
|
343
|
-
});
|
|
344
|
-
`);
|
|
345
|
-
// response = { success: true, result: [...] }
|
|
346
|
-
```
|
|
347
|
-
|
|
348
|
-
### Workflow: Analyze → Extract
|
|
349
|
-
|
|
350
|
-
**Step 1: Discover selectors**
|
|
351
|
-
```typescript
|
|
352
|
-
const response = await orangeslice.browser.execute(`
|
|
353
|
-
await page.goto(url, { waitUntil: 'domcontentloaded' });
|
|
354
|
-
return await page._snapshotForAI();
|
|
355
|
-
`);
|
|
356
|
-
// Analyze snapshot to find CSS selectors
|
|
357
|
-
```
|
|
358
|
-
|
|
359
|
-
**Step 2: Extract with discovered selectors**
|
|
360
93
|
```typescript
|
|
361
|
-
|
|
362
|
-
|
|
363
|
-
|
|
364
|
-
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
|
|
371
|
-
|
|
94
|
+
// Parse full address
|
|
95
|
+
const parsed = await orangeslice.geo.parseAddress("1600 Amphitheatre Parkway, Mountain View, CA");
|
|
96
|
+
// {
|
|
97
|
+
// streetNumber: "1600",
|
|
98
|
+
// route: "Amphitheatre Parkway",
|
|
99
|
+
// city: "Mountain View",
|
|
100
|
+
// state: "California",
|
|
101
|
+
// postalCode: "94043",
|
|
102
|
+
// country: "United States",
|
|
103
|
+
// lat: 37.4224764,
|
|
104
|
+
// lng: -122.0842499
|
|
105
|
+
// }
|
|
372
106
|
|
|
373
|
-
|
|
107
|
+
// Just get coordinates
|
|
108
|
+
const { lat, lng } = await orangeslice.geo.geocode("Times Square, NYC");
|
|
374
109
|
|
|
375
|
-
|
|
376
|
-
const
|
|
377
|
-
// Navigate to entry page (passes bot check once)
|
|
378
|
-
await page.goto(entryUrl, { waitUntil: 'domcontentloaded' });
|
|
379
|
-
|
|
380
|
-
// Get all URLs to visit
|
|
381
|
-
const urls = await page.evaluate(() =>
|
|
382
|
-
[...document.querySelectorAll('a.link')].map(a => a.href)
|
|
383
|
-
);
|
|
384
|
-
|
|
385
|
-
// Visit each IN THE SAME SESSION
|
|
386
|
-
const results = [];
|
|
387
|
-
for (const url of urls.slice(0, 10)) {
|
|
388
|
-
await page.goto(url, { waitUntil: 'domcontentloaded' });
|
|
389
|
-
const data = await page.evaluate(() => ({
|
|
390
|
-
title: document.querySelector('h1')?.textContent?.trim()
|
|
391
|
-
}));
|
|
392
|
-
results.push(data);
|
|
393
|
-
}
|
|
394
|
-
return results;
|
|
395
|
-
`);
|
|
110
|
+
// Just city/state
|
|
111
|
+
const { city, state } = await orangeslice.geo.getCityState("123 Main St, Boston, MA");
|
|
396
112
|
```
|
|
397
113
|
|
|
398
|
-
|
|
399
|
-
|
|
400
|
-
1. **Always use `{ waitUntil: 'domcontentloaded' }`** — Prevents hanging
|
|
401
|
-
2. **Check `response.success`** — Don't just destructure `result`
|
|
402
|
-
3. **Analyze before extracting** — Use `_snapshotForAI()` to find selectors
|
|
403
|
-
4. **Return objects, not HTML** — Use `page.evaluate()` for structured data
|
|
404
|
-
5. **3 minute hard limit** — Plan multi-page scrapes accordingly
|
|
114
|
+
**When to use:** Normalizing addresses, getting coordinates, location-based filtering.
|
|
405
115
|
|
|
406
116
|
---
|
|
407
117
|
|
|
408
118
|
## Rate Limits
|
|
409
119
|
|
|
410
|
-
| Function
|
|
411
|
-
|
|
412
|
-
| `b2b`
|
|
413
|
-
| `serp`
|
|
414
|
-
| `firecrawl` | 2
|
|
415
|
-
| `browser`
|
|
120
|
+
| Function | Concurrency | Min Delay |
|
|
121
|
+
|----------|-------------|-----------|
|
|
122
|
+
| `b2b` | 2 | 100ms |
|
|
123
|
+
| `serp` | 2 | 200ms |
|
|
124
|
+
| `firecrawl` | 2 | 500ms |
|
|
125
|
+
| `browser` | 2 | 500ms |
|
|
126
|
+
| `generateObject` | 2 | 200ms |
|
|
127
|
+
| `apify` | 2 | 500ms |
|
|
128
|
+
| `geo` | 2 | 100ms |
|
|
416
129
|
|
|
417
|
-
All calls
|
|
130
|
+
All calls queue automatically. Safe to fire many in parallel.
|
|
418
131
|
|
|
419
132
|
---
|
|
420
133
|
|
|
421
|
-
##
|
|
134
|
+
## Restrictions
|
|
422
135
|
|
|
423
|
-
❌ **No direct contact data** — Email
|
|
424
|
-
❌ **No Indeed data** — Indeed tables
|
|
425
|
-
❌ **No traffic
|
|
136
|
+
❌ **No direct contact data** — Email/phone restricted
|
|
137
|
+
❌ **No Indeed data** — Indeed tables restricted
|
|
138
|
+
❌ **No traffic data** — Domain analytics restricted
|
|
426
139
|
|
|
427
140
|
---
|
|
428
141
|
|
|
429
142
|
## Example: Full Research Flow
|
|
430
143
|
|
|
431
|
-
**User:** "Research Ramp - give me everything"
|
|
432
|
-
|
|
433
144
|
```typescript
|
|
434
|
-
|
|
435
|
-
|
|
436
|
-
// 1. B2B Database - Company info
|
|
145
|
+
// Research a company end-to-end
|
|
437
146
|
const company = await orangeslice.b2b.sql(`
|
|
438
|
-
SELECT
|
|
439
|
-
FROM linkedin_company WHERE domain = 'ramp.com'
|
|
147
|
+
SELECT * FROM linkedin_company WHERE domain = 'ramp.com'
|
|
440
148
|
`);
|
|
441
149
|
|
|
442
|
-
// 2. B2B Database - Leadership team
|
|
443
150
|
const leadership = await orangeslice.b2b.sql(`
|
|
444
|
-
SELECT lp.first_name, lp.last_name,
|
|
151
|
+
SELECT lp.first_name, lp.last_name, pos.title
|
|
445
152
|
FROM linkedin_profile lp
|
|
446
153
|
JOIN linkedin_profile_position3 pos ON pos.linkedin_profile_id = lp.id
|
|
447
|
-
WHERE pos.linkedin_company_id =
|
|
154
|
+
WHERE pos.linkedin_company_id = ${company[0].id}
|
|
448
155
|
AND pos.end_date IS NULL
|
|
449
|
-
AND
|
|
450
|
-
LIMIT
|
|
156
|
+
AND pos.title ILIKE '%ceo%' OR pos.title ILIKE '%cto%'
|
|
157
|
+
LIMIT 10
|
|
451
158
|
`);
|
|
452
159
|
|
|
453
|
-
|
|
454
|
-
const news = await orangeslice.serp.search("Ramp fintech funding 2024", { tbs: "qdr:m" });
|
|
160
|
+
const news = await orangeslice.serp.search("Ramp fintech", { tbs: "qdr:m" });
|
|
455
161
|
|
|
456
|
-
// 4. Website Scraping - About page + socials
|
|
457
162
|
const about = await orangeslice.firecrawl.scrape("https://ramp.com/about");
|
|
458
163
|
```
|
|
459
164
|
|
|
460
165
|
---
|
|
461
166
|
|
|
462
|
-
**
|
|
167
|
+
**See detailed docs:**
|
|
168
|
+
- [b2b.md](./b2b.md) — Database schema, performance, query patterns
|
|
169
|
+
- [serp.md](./serp.md) — Google dorking cheatsheet, verification
|
|
170
|
+
- [browser.md](./browser.md) — Playwright automation patterns
|
|
171
|
+
- [apify.md](./apify.md) — Pre-built scrapers for social, maps, etc.
|
|
172
|
+
- [strategies.md](./strategies.md) — Prospecting and enrichment patterns
|
package/docs/apify.md
ADDED
|
@@ -0,0 +1,133 @@
|
|
|
1
|
+
# Apify Actors
|
|
2
|
+
|
|
3
|
+
Pre-built web scrapers for social media, Google Maps, e-commerce, and more.
|
|
4
|
+
|
|
5
|
+
```typescript
|
|
6
|
+
import { orangeslice } from 'orangeslice';
|
|
7
|
+
|
|
8
|
+
// Run an actor
|
|
9
|
+
const results = await orangeslice.apify.run("username/actor-name", { input: "params" });
|
|
10
|
+
|
|
11
|
+
// Search for actors
|
|
12
|
+
const { actors } = await orangeslice.apify.search("linkedin scraper");
|
|
13
|
+
|
|
14
|
+
// Get actor input schema (what params it accepts)
|
|
15
|
+
const schema = await orangeslice.apify.getInputSchema("apify/web-scraper");
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Workflow
|
|
21
|
+
|
|
22
|
+
1. **Search** for an actor that does what you need
|
|
23
|
+
2. **Get input schema** to understand required params
|
|
24
|
+
3. **Run** the actor with your inputs
|
|
25
|
+
4. Results are returned when the actor completes
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## Searching for Actors
|
|
30
|
+
|
|
31
|
+
```typescript
|
|
32
|
+
const { actors, total } = await orangeslice.apify.search("google maps reviews", 10);
|
|
33
|
+
|
|
34
|
+
// actors = [{
|
|
35
|
+
// actorId: "compass/crawler-google-places",
|
|
36
|
+
// title: "Google Maps Scraper",
|
|
37
|
+
// description: "Scrape Google Maps...",
|
|
38
|
+
// stats: { totalRuns: 1000000 },
|
|
39
|
+
// pricing: { ... }
|
|
40
|
+
// }, ...]
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
---
|
|
44
|
+
|
|
45
|
+
## Getting Input Schema
|
|
46
|
+
|
|
47
|
+
Before running, check what params the actor needs:
|
|
48
|
+
|
|
49
|
+
```typescript
|
|
50
|
+
const schema = await orangeslice.apify.getInputSchema("compass/crawler-google-places");
|
|
51
|
+
|
|
52
|
+
console.log(schema.inputProperties);
|
|
53
|
+
// {
|
|
54
|
+
// searchStringsArray: { type: "array", description: "Search queries" },
|
|
55
|
+
// maxReviews: { type: "integer", description: "Max reviews per place" },
|
|
56
|
+
// language: { type: "string", default: "en" },
|
|
57
|
+
// ...
|
|
58
|
+
// }
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Running Actors
|
|
64
|
+
|
|
65
|
+
```typescript
|
|
66
|
+
// Google Maps reviews
|
|
67
|
+
const reviews = await orangeslice.apify.run("compass/crawler-google-places", {
|
|
68
|
+
searchStringsArray: ["restaurants in NYC"],
|
|
69
|
+
maxReviews: 20,
|
|
70
|
+
language: "en"
|
|
71
|
+
});
|
|
72
|
+
|
|
73
|
+
// With dataset params (limit results)
|
|
74
|
+
const results = await orangeslice.apify.run("apify/web-scraper",
|
|
75
|
+
{ startUrls: [{ url: "https://example.com" }] },
|
|
76
|
+
{ limit: 100 } // Only return first 100 items
|
|
77
|
+
);
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## Popular Actors
|
|
83
|
+
|
|
84
|
+
| Use Case | Actor | Example Input |
|
|
85
|
+
|----------|-------|---------------|
|
|
86
|
+
| Google Maps | `compass/crawler-google-places` | `{ searchStringsArray: ["cafes SF"] }` |
|
|
87
|
+
| Google Search | `apify/google-search-scraper` | `{ queries: "site:linkedin.com CEO" }` |
|
|
88
|
+
| Instagram | `apify/instagram-scraper` | `{ directUrls: ["https://instagram.com/user"] }` |
|
|
89
|
+
| Twitter/X | `apidojo/tweet-scraper` | `{ searchTerms: ["#startup"] }` |
|
|
90
|
+
| LinkedIn | `anchor/linkedin-profile-scraper` | `{ profileUrls: [...] }` |
|
|
91
|
+
| YouTube | `streamers/youtube-scraper` | `{ searchKeywords: ["tech reviews"] }` |
|
|
92
|
+
| TikTok | `clockworks/tiktok-scraper` | `{ profiles: ["@username"] }` |
|
|
93
|
+
| Amazon | `junglee/amazon-scraper` | `{ keyword: "laptop stand" }` |
|
|
94
|
+
| Yelp | `yin/yelp-scraper` | `{ searchTerms: "plumber", location: "NYC" }` |
|
|
95
|
+
|
|
96
|
+
---
|
|
97
|
+
|
|
98
|
+
## Dataset Params
|
|
99
|
+
|
|
100
|
+
Control the results returned:
|
|
101
|
+
|
|
102
|
+
```typescript
|
|
103
|
+
await orangeslice.apify.run(actor, input, {
|
|
104
|
+
limit: 100, // Max items to return
|
|
105
|
+
offset: 0, // Skip first N items
|
|
106
|
+
clean: true, // Remove empty fields
|
|
107
|
+
fields: ["name", "url"], // Only these fields
|
|
108
|
+
unwind: "reviews" // Flatten nested array
|
|
109
|
+
});
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
---
|
|
113
|
+
|
|
114
|
+
## Timeouts
|
|
115
|
+
|
|
116
|
+
- Actors run asynchronously and are polled for completion
|
|
117
|
+
- Max wait: 5 minutes
|
|
118
|
+
- Large scrapes may timeout — use smaller batches
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## Response Format
|
|
123
|
+
|
|
124
|
+
Returns the dataset items directly as an array:
|
|
125
|
+
|
|
126
|
+
```typescript
|
|
127
|
+
const results = await orangeslice.apify.run(...);
|
|
128
|
+
// results = [
|
|
129
|
+
// { name: "...", address: "...", rating: 4.5 },
|
|
130
|
+
// { name: "...", address: "...", rating: 4.2 },
|
|
131
|
+
// ...
|
|
132
|
+
// ]
|
|
133
|
+
```
|