orangeslice 2.1.0 → 2.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,337 @@
1
+ ---
2
+ description: Search Crunchbase with SQL
3
+ ---
4
+
5
+ # Crunchbase Search
6
+
7
+ Run SQL against `public.crunchbase_scraper_lean` for startup/company prospecting.
8
+
9
+ ```typescript
10
+ const rows = await services.crunchbase.search({
11
+ sql: `
12
+ SELECT name, website_url, linkedin_url
13
+ FROM public.crunchbase_scraper_lean
14
+ WHERE operating_status = 'active'
15
+ LIMIT 25
16
+ `
17
+ });
18
+
19
+ // rows: Record<string, unknown>[]
20
+ return rows;
21
+ ```
22
+
23
+ ## Contract (Hard Rules)
24
+
25
+ - Query **only** `public.crunchbase_scraper_lean`.
26
+ - **Only one statement** is allowed.
27
+ - **Only SELECT** queries are allowed (`WITH ... SELECT` is fine).
28
+ - Always include `LIMIT` (recommended `<= 100`).
29
+ - This is an external service path, not `ctx.sql()`.
30
+ - Credits are 1 credit per returned row (reserve estimate is derived from `LIMIT`).
31
+
32
+ ## Return Type
33
+
34
+ `services.crunchbase.search()` returns rows directly:
35
+
36
+ ```typescript
37
+ (Record < string, unknown > []);
38
+ ```
39
+
40
+ No `{ rows, count }` envelope.
41
+
42
+ ```typescript
43
+ const rows = await services.crunchbase.search({ sql: "SELECT name FROM public.crunchbase_scraper_lean LIMIT 10" });
44
+ const count = rows.length;
45
+ ```
46
+
47
+ ## Live Schema (Verified)
48
+
49
+ Source of truth: live DB introspection of `public.crunchbase_scraper_lean`.
50
+
51
+ | Column | Type | Nullable |
52
+ | ---------------------------- | ------------- | -------- |
53
+ | `id` | `bigint` | no |
54
+ | `uuid` | `text` | yes |
55
+ | `name` | `text` | yes |
56
+ | `link` | `text` | yes |
57
+ | `type` | `text` | yes |
58
+ | `operating_status` | `text` | yes |
59
+ | `company_type` | `text` | yes |
60
+ | `short_description` | `text` | yes |
61
+ | `description` | `text` | yes |
62
+ | `website_url` | `text` | yes |
63
+ | `linkedin_url` | `text` | yes |
64
+ | `twitter_url` | `text` | yes |
65
+ | `facebook_url` | `text` | yes |
66
+ | `contact_email` | `text` | yes |
67
+ | `phone_number` | `text` | yes |
68
+ | `hq_postal_code` | `text` | yes |
69
+ | `primary_category` | `text` | yes |
70
+ | `categories` | `jsonb` | no |
71
+ | `category_groups` | `jsonb` | no |
72
+ | `location_identifiers` | `jsonb` | no |
73
+ | `location_group_identifiers` | `jsonb` | no |
74
+ | `num_employees_enum` | `integer` | yes |
75
+ | `revenue_range` | `text` | yes |
76
+ | `funding_stage` | `text` | yes |
77
+ | `funding_total_usd` | `numeric` | yes |
78
+ | `last_funding_total_usd` | `numeric` | yes |
79
+ | `last_funding_type` | `text` | yes |
80
+ | `last_funding_date` | `date` | yes |
81
+ | `num_funding_rounds` | `integer` | yes |
82
+ | `num_investors` | `integer` | yes |
83
+ | `num_lead_investors` | `integer` | yes |
84
+ | `rank_org_company` | `integer` | yes |
85
+ | `rank_org` | `integer` | yes |
86
+ | `rank_delta_d7` | `integer` | yes |
87
+ | `rank_delta_d30` | `integer` | yes |
88
+ | `rank_delta_d90` | `integer` | yes |
89
+ | `growth_score_tier` | `text` | yes |
90
+ | `heat_score_tier` | `text` | yes |
91
+ | `ipo_status` | `text` | yes |
92
+ | `went_public_on` | `date` | yes |
93
+ | `imported_at` | `timestamptz` | no |
94
+
95
+ ## Enum Catalog (Verified Distinct Values)
96
+
97
+ These are observed live values, in production data.
98
+
99
+ ### `operating_status`
100
+
101
+ - `active`
102
+ - `closed`
103
+
104
+ ### `company_type`
105
+
106
+ - `for_profit`
107
+ - `non_profit`
108
+
109
+ ### `type`
110
+
111
+ - `organization`
112
+
113
+ ### `funding_stage`
114
+
115
+ - `seed`
116
+ - `early_stage_venture`
117
+ - `m_and_a`
118
+ - `late_stage_venture`
119
+ - `ipo`
120
+
121
+ ### `last_funding_type`
122
+
123
+ - `seed`
124
+ - `series_a`
125
+ - `series_b`
126
+ - `series_c`
127
+
128
+ ### `revenue_range`
129
+
130
+ - `r_00000000`
131
+ - `r_00001000`
132
+ - `r_00010000`
133
+ - `r_00050000`
134
+ - `r_00100000`
135
+ - `r_00500000`
136
+ - `r_01000000`
137
+ - `r_10000000`
138
+
139
+ ### `growth_score_tier`
140
+
141
+ - `c100_high`
142
+ - `c200_medium`
143
+ - `c300_low`
144
+
145
+ ### `heat_score_tier`
146
+
147
+ - `c100_high`
148
+ - `c200_medium`
149
+ - `c300_low`
150
+
151
+ ### `ipo_status`
152
+
153
+ - `private`
154
+ - `public`
155
+ - `delisted`
156
+
157
+ ### `num_employees_enum`
158
+
159
+ Column exists, but currently sparse/null in this dataset.
160
+
161
+ ## JSONB Array Fields
162
+
163
+ `categories`, `category_groups`, `location_identifiers`, and `location_group_identifiers` are `jsonb` arrays.
164
+
165
+ Do **not** treat them as `text[]` with `&& ARRAY[...]::text[]`.
166
+ Use `jsonb_array_elements_text(...)` with `EXISTS`, for example:
167
+
168
+ ```sql
169
+ AND EXISTS (
170
+ SELECT 1
171
+ FROM jsonb_array_elements_text(categories) AS c(category)
172
+ WHERE category IN ('Health Care', 'Biotechnology')
173
+ )
174
+ ```
175
+
176
+ ## Recommended Query Patterns
177
+
178
+ | Pattern | Why |
179
+ | ------------------------------------------------------- | ---------------------------------- |
180
+ | Equality / `IN` filters on enum columns | Fast and stable |
181
+ | Date windows on `last_funding_date` | Strong recency control |
182
+ | Numeric ranges on `funding_total_usd` | Good segmentation |
183
+ | `EXISTS + jsonb_array_elements_text` for tags/locations | Works with current schema |
184
+ | Explicit narrow column lists | Lower payload and faster execution |
185
+
186
+ ## Banned / Avoided Patterns
187
+
188
+ | Pattern | Why | Better Alternative |
189
+ | ---------------------------------------------------------------------------- | ----------------------------------- | --------------------------------------------------- |
190
+ | Missing `LIMIT` | Unbounded scans + excessive credits | Always add `LIMIT` |
191
+ | `SELECT *` for production pulls | Larger payload and cost | Select only needed columns |
192
+ | Leading-wildcard scans on long text (`ILIKE '%term%'`) across broad dataset | Expensive text scans | Use enum/date/range filters first, then narrow text |
193
+ | Heavy aggregations (`COUNT(*)`, `DISTINCT`, wide `GROUP BY`) on large slices | Slow and expensive | Pull scoped rows, aggregate in code |
194
+ | Unscoped global sorts on large sets | Expensive sort operations | Filter first, sort smaller result sets |
195
+ | Multi-table joins for routine prospecting | More planner risk and latency | Stay on lean table only |
196
+
197
+ ## Canonical Prospecting Queries
198
+
199
+ ### 1) US early-stage SaaS/AI, currently active
200
+
201
+ ```sql
202
+ SELECT
203
+ name,
204
+ website_url,
205
+ linkedin_url,
206
+ funding_stage,
207
+ num_employees_enum,
208
+ last_funding_date
209
+ FROM public.crunchbase_scraper_lean
210
+ WHERE operating_status = 'active'
211
+ AND funding_stage IN ('seed', 'early_stage_venture')
212
+ AND EXISTS (
213
+ SELECT 1
214
+ FROM jsonb_array_elements_text(categories) AS c(category)
215
+ WHERE category IN ('SaaS', 'Artificial Intelligence (AI)')
216
+ )
217
+ AND EXISTS (
218
+ SELECT 1
219
+ FROM jsonb_array_elements_text(location_identifiers) AS l(location)
220
+ WHERE location = 'United States'
221
+ )
222
+ LIMIT 100;
223
+ ```
224
+
225
+ ### 2) Recently funded (last 12 months)
226
+
227
+ ```sql
228
+ SELECT
229
+ name,
230
+ website_url,
231
+ last_funding_type,
232
+ last_funding_date,
233
+ last_funding_total_usd,
234
+ funding_total_usd
235
+ FROM public.crunchbase_scraper_lean
236
+ WHERE operating_status = 'active'
237
+ AND last_funding_date >= CURRENT_DATE - INTERVAL '12 months'
238
+ AND last_funding_type IN ('seed', 'series_a', 'series_b')
239
+ ORDER BY last_funding_date DESC NULLS LAST
240
+ LIMIT 100;
241
+ ```
242
+
243
+ ### 3) Bay Area fintech companies with meaningful funding
244
+
245
+ ```sql
246
+ SELECT
247
+ name,
248
+ website_url,
249
+ funding_stage,
250
+ funding_total_usd,
251
+ num_employees_enum
252
+ FROM public.crunchbase_scraper_lean
253
+ WHERE operating_status = 'active'
254
+ AND EXISTS (
255
+ SELECT 1
256
+ FROM jsonb_array_elements_text(categories) AS c(category)
257
+ WHERE category IN ('FinTech', 'Financial Services')
258
+ )
259
+ AND EXISTS (
260
+ SELECT 1
261
+ FROM jsonb_array_elements_text(location_group_identifiers) AS g(location_group)
262
+ WHERE location_group = 'San Francisco Bay Area'
263
+ )
264
+ AND funding_total_usd >= 5000000
265
+ LIMIT 75;
266
+ ```
267
+
268
+ ### 4) Non-profits with health focus
269
+
270
+ ```sql
271
+ SELECT
272
+ name,
273
+ website_url,
274
+ company_type,
275
+ categories,
276
+ location_identifiers
277
+ FROM public.crunchbase_scraper_lean
278
+ WHERE company_type = 'non_profit'
279
+ AND EXISTS (
280
+ SELECT 1
281
+ FROM jsonb_array_elements_text(categories) AS c(category)
282
+ WHERE category ILIKE ANY (ARRAY['%health%', '%medical%', '%biotech%', '%pharma%', '%telemedicine%'])
283
+ )
284
+ LIMIT 100;
285
+ ```
286
+
287
+ ### 5) Healthtech seed to series B (safe column set)
288
+
289
+ ```sql
290
+ SELECT
291
+ name,
292
+ website_url,
293
+ linkedin_url,
294
+ short_description,
295
+ funding_stage,
296
+ last_funding_type,
297
+ last_funding_date,
298
+ funding_total_usd,
299
+ num_employees_enum,
300
+ categories,
301
+ location_identifiers,
302
+ num_investors,
303
+ num_funding_rounds
304
+ FROM public.crunchbase_scraper_lean
305
+ WHERE operating_status = 'active'
306
+ AND last_funding_type IN ('seed', 'series_a', 'series_b')
307
+ AND EXISTS (
308
+ SELECT 1
309
+ FROM jsonb_array_elements_text(categories) AS c(category)
310
+ WHERE category ILIKE ANY (ARRAY['%health%', '%medical%', '%biotech%', '%pharma%', '%telemedicine%'])
311
+ )
312
+ ORDER BY last_funding_date DESC NULLS LAST
313
+ LIMIT 100;
314
+ ```
315
+
316
+ ## Usage Pattern (Spreadsheet Code)
317
+
318
+ ```typescript
319
+ const rows = await services.crunchbase.search({
320
+ sql: `
321
+ SELECT name, website_url, linkedin_url
322
+ FROM public.crunchbase_scraper_lean
323
+ WHERE operating_status = 'active'
324
+ LIMIT 20
325
+ `
326
+ });
327
+
328
+ // rows is already an array of objects
329
+ return rows;
330
+ ```
331
+
332
+ ## Troubleshooting
333
+
334
+ - `column "...\" does not exist` -> you are using an old/nonexistent column name; check "Known Bad Column Names".
335
+ - `only public.crunchbase_scraper_lean is allowed` -> query references a disallowed table.
336
+ - `only SELECT queries are allowed` -> remove `INSERT/UPDATE/DELETE`, keep read-only SQL.
337
+ - Empty results with no error -> usually value casing mismatch (use lowercase enum values like `active`, `series_a`).
@@ -2,12 +2,12 @@
2
2
  - **apify**: Run any of 10,000+ Apify actors for web scraping, social media, e-commerce, and more.
3
3
  - **browser**: Kernel browser automation - spin up cloud browsers, execute Playwright code, take screenshots. **Use this for scraping structured lists of repeated data** (e.g., product listings, search results, table rows) where you know the DOM structure. Also ideal for **intercepting network requests** to discover underlying APIs, then paginate those APIs directly in your code (faster & cheaper than clicking through pages). Perfect for JS-heavy sites that don't work with simple HTTP scraping.
4
4
  - **company**: company data (getting employees at the company, getting company data, getting open jobs).
5
+ - **crunchbase**: SQL search over the lean Crunchbase company table (`public.crunchbase_scraper_lean`) for startup prospecting.
5
6
  - **person**: finding a persons linkedin url, enriching it from linkedin, contact info, and searching for specific people / groups on linkedin
6
7
  - **geo**: parsing address
7
8
  - **googleMaps**: search businesses via Google Maps.
8
9
  - **email**: send transactional notification emails through Orange Slice's managed sender.
9
10
  - **scrape**: website scraper, sitemap scraper
10
11
  - **web**: SERP
11
- - **healthcare**: Query the NPI (National Provider Identifier) database for healthcare organizations by specialty, location, or name. Contains 1.8M+ providers.
12
12
  - **predictLeads**: company intelligence datasets (financing events, technologies, products, job openings, news, and related company data).
13
13
  - **guides**: agent notes & operational docs (see [Error Handling Cheatsheet](../error-handling-cheatsheet.md))
@@ -207,6 +207,38 @@ JOIN linkedin_company lc ON ...
207
207
  - **Use `lp` alias** for person tables
208
208
  - **Default to US**: `lp.location_country_code = 'US'`
209
209
 
210
+ ## Return Type
211
+
212
+ `services.person.linkedin.search()` returns an object envelope:
213
+
214
+ ```typescript
215
+ {
216
+ rows: (Record < string, unknown > []);
217
+ count: number;
218
+ }
219
+ ```
220
+
221
+ - `rows`: Result rows from your SQL query, with exactly the columns you selected.
222
+ - `count`: Number of rows returned in `rows`.
223
+
224
+ Example:
225
+
226
+ ```typescript
227
+ const searchResult = await services.person.linkedin.search({
228
+ sql: `
229
+ SELECT
230
+ lp.first_name,
231
+ lp.last_name,
232
+ lp.public_profile_url AS lp_linkedin_url
233
+ FROM linkedin_profile lp
234
+ WHERE lp.location_country_code = 'US'
235
+ LIMIT 10
236
+ `
237
+ });
238
+
239
+ return searchResult.rows; // Most spreadsheet snippets should return rows
240
+ ```
241
+
210
242
  ---
211
243
 
212
244
  ## Table Aliases
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "orangeslice",
3
- "version": "2.1.0",
3
+ "version": "2.1.2",
4
4
  "description": "B2B LinkedIn database prospector - 1.15B profiles, 85M companies",
5
5
  "main": "dist/index.js",
6
6
  "types": "dist/index.d.ts",