devpost-scraper 0.1.0__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -11,3 +11,8 @@ DEVPOST_SESSION=
11
11
  # GitHub personal access token for higher API rate limits (5000/hr vs 60/hr)
12
12
  # Generate at https://github.com/settings/tokens (no scopes needed, public data only)
13
13
  GITHUB_TOKEN=
14
+
15
+ # Customer.io Track API credentials (required for --emit-events)
16
+ # Find at https://fly.customer.io/settings/api_credentials
17
+ CUSTOMERIO_SITE_ID=
18
+ CUSTOMERIO_API_KEY=
@@ -6,6 +6,8 @@ uv.lock
6
6
  start.sh
7
7
  .venv/
8
8
  __pycache__/
9
+ *.db
9
10
  .backboard/
10
11
  dist/
11
- *.egg-info/
12
+ *.egg-info/
13
+ emails/
@@ -0,0 +1,181 @@
1
+ Metadata-Version: 2.4
2
+ Name: devpost-scraper
3
+ Version: 0.3.0
4
+ Summary: CLI for extracting Devpost data with Backboard tool-calling and exporting results to CSV.
5
+ Requires-Python: >=3.11
6
+ Requires-Dist: backboard-sdk>=1.5.9
7
+ Requires-Dist: beautifulsoup4>=4.12.0
8
+ Requires-Dist: httpx>=0.27.0
9
+ Requires-Dist: pydantic>=2.7.0
10
+ Requires-Dist: python-dotenv>=1.0.1
11
+ Description-Content-Type: text/markdown
12
+
13
+ # Devpost Scraper
14
+
15
+ CLI toolkit for extracting Devpost hackathon data, enriching participants with emails,
16
+ storing results in SQLite, and emitting Customer.io events.
17
+
18
+ Three commands:
19
+
20
+ | Command | Purpose |
21
+ |---|---|
22
+ | `devpost-scraper` | Search Devpost projects by keyword, enrich with emails, export CSV |
23
+ | `devpost-participants` | Scrape a single hackathon's participant list, export CSV |
24
+ | `devpost-harvest` | Walk the hackathon listing, scrape all participants, store in SQLite, emit delta events |
25
+
26
+ ## Requirements
27
+
28
+ - Python 3.11+
29
+ - [`uv`](https://docs.astral.sh/uv/)
30
+ - A [Backboard](https://app.backboard.io) API key (for `devpost-scraper` only)
31
+
32
+ ## Install
33
+
34
+ ```bash
35
+ uv sync
36
+ ```
37
+
38
+ ## Environment
39
+
40
+ Copy `.env.example` → `.env` and fill in:
41
+
42
+ | Variable | Required for | Notes |
43
+ |---|---|---|
44
+ | `BACKBOARD_API_KEY` | `devpost-scraper` | Backboard account key |
45
+ | `DEVPOST_ASSISTANT_ID` | auto | Persisted on first run |
46
+ | `DEVPOST_SESSION` | `devpost-participants`, `devpost-harvest` | `_devpost` cookie from browser DevTools |
47
+ | `GITHUB_TOKEN` | optional | GitHub PAT for 5000 req/hr (vs 60). No scopes needed |
48
+ | `CUSTOMERIO_SITE_ID` | `--emit-events` | Customer.io Track API |
49
+ | `CUSTOMERIO_API_KEY` | `--emit-events` | Customer.io Track API |
50
+
51
+ ---
52
+
53
+ ## devpost-scraper
54
+
55
+ Search Devpost projects by keyword, enrich each with detail page + author email, export CSV.
56
+
57
+ ```bash
58
+ uv run devpost-scraper "ai agents" --output results.csv
59
+ uv run devpost-scraper "climate tech" "developer tools" -o results.csv
60
+
61
+ # Or via start.sh
62
+ ./start.sh "ai agents" --output results.csv
63
+ ```
64
+
65
+ ---
66
+
67
+ ## devpost-participants
68
+
69
+ Scrape a single hackathon's participant list and export to CSV.
70
+
71
+ ```bash
72
+ # First time — pass session cookie
73
+ uv run devpost-participants "https://authorizedtoact.devpost.com/participants" \
74
+ --jwt "<_devpost cookie value>" -o participants.csv
75
+
76
+ # Reuse saved session from .env
77
+ uv run devpost-participants "https://authorizedtoact.devpost.com/participants" -o out.csv
78
+
79
+ # Skip email enrichment
80
+ uv run devpost-participants "https://..." --no-email -o out.csv
81
+
82
+ # Emit Customer.io events after scrape
83
+ uv run devpost-participants "https://..." --emit-events -o out.csv
84
+ ```
85
+
86
+ ---
87
+
88
+ ## devpost-harvest
89
+
90
+ Automated pipeline: walk the hackathon listing → scrape participants → store in SQLite → emit Customer.io events for delta (new) participants.
91
+
92
+ ### Basic usage
93
+
94
+ ```bash
95
+ # Scrape 3 pages of open hackathons (27 hackathons), enrich new participants, emit events
96
+ uv run devpost-harvest --emit-events
97
+
98
+ # Fast first run — scrape without email enrichment
99
+ uv run devpost-harvest --no-email
100
+ ```
101
+
102
+ ### Flags
103
+
104
+ | Flag | Default | Description |
105
+ |---|---|---|
106
+ | `--pages N` | `3` | Number of hackathon listing pages to fetch (9 per page) |
107
+ | `--hackathons N` | `0` (all) | Only process the first N hackathons from the listing |
108
+ | `--jwt TOKEN` | `.env` | Devpost `_devpost` session cookie |
109
+ | `--db PATH` | `devpost_harvest.db` | SQLite database path |
110
+ | `--status {open,ended,upcoming}` | `open` | Hackathon status filter (repeatable) |
111
+ | `--max-participants N` | `0` (unlimited) | Cap participants scraped per hackathon |
112
+ | `--no-email` | off | Skip email enrichment entirely (even for new participants) |
113
+ | `--emit-events` | off | Emit Customer.io events for unemitted participants during scrape |
114
+ | `--emit-unsent` | off | Skip scraping — just emit events for all unsent participants in DB |
115
+ | `--rescrape` | off | Re-scrape hackathons already scraped in a previous run |
116
+
117
+ ### How it works
118
+
119
+ ```
120
+ Phase 1: Discover hackathons
121
+ GET /api/hackathons?status[]=open → paginated JSON listing
122
+
123
+ Phase 2: Per hackathon
124
+ 2a. Fast scan — scrape all participant pages (no enrichment, ~1 req per 20 participants)
125
+ 2b. Upsert into SQLite → detect delta (new participants not previously in DB)
126
+ 2c. Email-enrich delta only — GitHub API + link walking (skipped with --no-email)
127
+ 2d. Emit Customer.io events for unemitted participants (only with --emit-events)
128
+ ```
129
+
130
+ ### Delta logic
131
+
132
+ On subsequent runs, the fast scan re-fetches participant lists but only new participants
133
+ (not previously in SQLite) get the expensive email enrichment. Already-emitted participants
134
+ are never re-emitted. This makes re-runs fast and safe to repeat.
135
+
136
+ ### Common workflows
137
+
138
+ ```bash
139
+ # Initial bulk scrape (no events yet)
140
+ uv run devpost-harvest --pages 5
141
+
142
+ # Emit all unsent events from the DB (no scraping, no JWT needed)
143
+ uv run devpost-harvest --emit-unsent
144
+
145
+ # Quick delta check on first hackathon only
146
+ uv run devpost-harvest --hackathons 1 --rescrape --emit-events
147
+
148
+ # Re-scan all hackathons for new participants, enrich + emit
149
+ uv run devpost-harvest --rescrape --emit-events
150
+
151
+ # Include ended hackathons
152
+ uv run devpost-harvest --status open --status ended
153
+
154
+ # Fast delta scan (skip email enrichment for new participants too)
155
+ uv run devpost-harvest --rescrape --no-email
156
+ ```
157
+
158
+ ### SQLite schema
159
+
160
+ The database (`devpost_harvest.db`) has two tables:
161
+
162
+ - **`hackathons`** — id, url, title, org, state, dates, registrations, prize, themes.
163
+ `last_scraped_at` is set after participants are scraped.
164
+ - **`participants`** — (hackathon_url, username) primary key, enrichment fields,
165
+ `first_seen_at`, `last_seen_at`, `event_emitted_at`.
166
+
167
+ ### Customer.io events
168
+
169
+ Event name: `devpost_hackathon`. Uses participant email as the Customer.io user ID.
170
+
171
+ Event data: hackathon_url, hackathon_title, username, name, specialty, profile_url, github_url, linkedin_url.
172
+
173
+ Email templates in `emails/` use `{{customer.first_name}}` and `{{event.*}}` Liquid variables.
174
+
175
+ ---
176
+
177
+ ## Development
178
+
179
+ ```bash
180
+ uv run python -m devpost_scraper.cli "ai agents" --output out.csv
181
+ ```
@@ -0,0 +1,169 @@
1
+ # Devpost Scraper
2
+
3
+ CLI toolkit for extracting Devpost hackathon data, enriching participants with emails,
4
+ storing results in SQLite, and emitting Customer.io events.
5
+
6
+ Three commands:
7
+
8
+ | Command | Purpose |
9
+ |---|---|
10
+ | `devpost-scraper` | Search Devpost projects by keyword, enrich with emails, export CSV |
11
+ | `devpost-participants` | Scrape a single hackathon's participant list, export CSV |
12
+ | `devpost-harvest` | Walk the hackathon listing, scrape all participants, store in SQLite, emit delta events |
13
+
14
+ ## Requirements
15
+
16
+ - Python 3.11+
17
+ - [`uv`](https://docs.astral.sh/uv/)
18
+ - A [Backboard](https://app.backboard.io) API key (for `devpost-scraper` only)
19
+
20
+ ## Install
21
+
22
+ ```bash
23
+ uv sync
24
+ ```
25
+
26
+ ## Environment
27
+
28
+ Copy `.env.example` → `.env` and fill in:
29
+
30
+ | Variable | Required for | Notes |
31
+ |---|---|---|
32
+ | `BACKBOARD_API_KEY` | `devpost-scraper` | Backboard account key |
33
+ | `DEVPOST_ASSISTANT_ID` | auto | Persisted on first run |
34
+ | `DEVPOST_SESSION` | `devpost-participants`, `devpost-harvest` | `_devpost` cookie from browser DevTools |
35
+ | `GITHUB_TOKEN` | optional | GitHub PAT for 5000 req/hr (vs 60). No scopes needed |
36
+ | `CUSTOMERIO_SITE_ID` | `--emit-events` | Customer.io Track API |
37
+ | `CUSTOMERIO_API_KEY` | `--emit-events` | Customer.io Track API |
38
+
39
+ ---
40
+
41
+ ## devpost-scraper
42
+
43
+ Search Devpost projects by keyword, enrich each with detail page + author email, export CSV.
44
+
45
+ ```bash
46
+ uv run devpost-scraper "ai agents" --output results.csv
47
+ uv run devpost-scraper "climate tech" "developer tools" -o results.csv
48
+
49
+ # Or via start.sh
50
+ ./start.sh "ai agents" --output results.csv
51
+ ```
52
+
53
+ ---
54
+
55
+ ## devpost-participants
56
+
57
+ Scrape a single hackathon's participant list and export to CSV.
58
+
59
+ ```bash
60
+ # First time — pass session cookie
61
+ uv run devpost-participants "https://authorizedtoact.devpost.com/participants" \
62
+ --jwt "<_devpost cookie value>" -o participants.csv
63
+
64
+ # Reuse saved session from .env
65
+ uv run devpost-participants "https://authorizedtoact.devpost.com/participants" -o out.csv
66
+
67
+ # Skip email enrichment
68
+ uv run devpost-participants "https://..." --no-email -o out.csv
69
+
70
+ # Emit Customer.io events after scrape
71
+ uv run devpost-participants "https://..." --emit-events -o out.csv
72
+ ```
73
+
74
+ ---
75
+
76
+ ## devpost-harvest
77
+
78
+ Automated pipeline: walk the hackathon listing → scrape participants → store in SQLite → emit Customer.io events for delta (new) participants.
79
+
80
+ ### Basic usage
81
+
82
+ ```bash
83
+ # Scrape 3 pages of open hackathons (27 hackathons), enrich new participants, emit events
84
+ uv run devpost-harvest --emit-events
85
+
86
+ # Fast first run — scrape without email enrichment
87
+ uv run devpost-harvest --no-email
88
+ ```
89
+
90
+ ### Flags
91
+
92
+ | Flag | Default | Description |
93
+ |---|---|---|
94
+ | `--pages N` | `3` | Number of hackathon listing pages to fetch (9 per page) |
95
+ | `--hackathons N` | `0` (all) | Only process the first N hackathons from the listing |
96
+ | `--jwt TOKEN` | `.env` | Devpost `_devpost` session cookie |
97
+ | `--db PATH` | `devpost_harvest.db` | SQLite database path |
98
+ | `--status {open,ended,upcoming}` | `open` | Hackathon status filter (repeatable) |
99
+ | `--max-participants N` | `0` (unlimited) | Cap participants scraped per hackathon |
100
+ | `--no-email` | off | Skip email enrichment entirely (even for new participants) |
101
+ | `--emit-events` | off | Emit Customer.io events for unemitted participants during scrape |
102
+ | `--emit-unsent` | off | Skip scraping — just emit events for all unsent participants in DB |
103
+ | `--rescrape` | off | Re-scrape hackathons already scraped in a previous run |
104
+
105
+ ### How it works
106
+
107
+ ```
108
+ Phase 1: Discover hackathons
109
+ GET /api/hackathons?status[]=open → paginated JSON listing
110
+
111
+ Phase 2: Per hackathon
112
+ 2a. Fast scan — scrape all participant pages (no enrichment, ~1 req per 20 participants)
113
+ 2b. Upsert into SQLite → detect delta (new participants not previously in DB)
114
+ 2c. Email-enrich delta only — GitHub API + link walking (skipped with --no-email)
115
+ 2d. Emit Customer.io events for unemitted participants (only with --emit-events)
116
+ ```
117
+
118
+ ### Delta logic
119
+
120
+ On subsequent runs, the fast scan re-fetches participant lists but only new participants
121
+ (not previously in SQLite) get the expensive email enrichment. Already-emitted participants
122
+ are never re-emitted. This makes re-runs fast and safe to repeat.
123
+
124
+ ### Common workflows
125
+
126
+ ```bash
127
+ # Initial bulk scrape (no events yet)
128
+ uv run devpost-harvest --pages 5
129
+
130
+ # Emit all unsent events from the DB (no scraping, no JWT needed)
131
+ uv run devpost-harvest --emit-unsent
132
+
133
+ # Quick delta check on first hackathon only
134
+ uv run devpost-harvest --hackathons 1 --rescrape --emit-events
135
+
136
+ # Re-scan all hackathons for new participants, enrich + emit
137
+ uv run devpost-harvest --rescrape --emit-events
138
+
139
+ # Include ended hackathons
140
+ uv run devpost-harvest --status open --status ended
141
+
142
+ # Fast delta scan (skip email enrichment for new participants too)
143
+ uv run devpost-harvest --rescrape --no-email
144
+ ```
145
+
146
+ ### SQLite schema
147
+
148
+ The database (`devpost_harvest.db`) has two tables:
149
+
150
+ - **`hackathons`** — id, url, title, org, state, dates, registrations, prize, themes.
151
+ `last_scraped_at` is set after participants are scraped.
152
+ - **`participants`** — (hackathon_url, username) primary key, enrichment fields,
153
+ `first_seen_at`, `last_seen_at`, `event_emitted_at`.
154
+
155
+ ### Customer.io events
156
+
157
+ Event name: `devpost_hackathon`. Uses participant email as the Customer.io user ID.
158
+
159
+ Event data: hackathon_url, hackathon_title, username, name, specialty, profile_url, github_url, linkedin_url.
160
+
161
+ Email templates in `emails/` use `{{customer.first_name}}` and `{{event.*}}` Liquid variables.
162
+
163
+ ---
164
+
165
+ ## Development
166
+
167
+ ```bash
168
+ uv run python -m devpost_scraper.cli "ai agents" --output out.csv
169
+ ```
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "devpost-scraper"
3
- version = "0.1.0"
3
+ version = "0.3.0"
4
4
  description = "CLI for extracting Devpost data with Backboard tool-calling and exporting results to CSV."
5
5
  readme = "README.md"
6
6
  requires-python = ">=3.11"
@@ -15,6 +15,7 @@ dependencies = [
15
15
  [project.scripts]
16
16
  devpost-scraper = "devpost_scraper.cli:main"
17
17
  devpost-participants = "devpost_scraper.cli:participants_main"
18
+ devpost-harvest = "devpost_scraper.cli:harvest_main"
18
19
 
19
20
  [build-system]
20
21
  requires = ["hatchling"]
@@ -17,13 +17,15 @@ from devpost_scraper.backboard_client import (
17
17
  ensure_assistant,
18
18
  run_in_thread,
19
19
  )
20
+ from devpost_scraper.customerio import emit_hackathon_events
20
21
  from devpost_scraper.csv_export import write_projects
21
- from devpost_scraper.models import DevpostProject, HackathonParticipant
22
+ from devpost_scraper.models import DevpostProject, Hackathon, HackathonParticipant
22
23
  from devpost_scraper.scraper import (
23
24
  find_author_email,
24
25
  find_participant_email,
25
26
  get_hackathon_participants,
26
27
  get_project_details,
28
+ list_hackathons,
27
29
  search_projects,
28
30
  )
29
31
 
@@ -228,6 +230,7 @@ async def _run_participants(
228
230
  jwt_token: str,
229
231
  output: str | None,
230
232
  no_email: bool,
233
+ emit_events: bool = False,
231
234
  ) -> None:
232
235
  all_participants: list[HackathonParticipant] = []
233
236
  page = 1
@@ -303,6 +306,9 @@ async def _run_participants(
303
306
  writer.writerows(rows)
304
307
  print(buf.getvalue())
305
308
 
309
+ if emit_events:
310
+ await emit_hackathon_events(all_participants)
311
+
306
312
 
307
313
  def participants_main() -> None:
308
314
  load_dotenv(_ENV_FILE, override=True)
@@ -334,6 +340,12 @@ def participants_main() -> None:
334
340
  default=False,
335
341
  help="Skip email enrichment (faster)",
336
342
  )
343
+ parser.add_argument(
344
+ "--emit-events",
345
+ action="store_true",
346
+ default=False,
347
+ help="Emit devpost_hackathon events to Customer.io (requires CUSTOMERIO_SITE_ID and CUSTOMERIO_API_KEY in .env)",
348
+ )
337
349
  args = parser.parse_args()
338
350
 
339
351
  if not args.output:
@@ -360,5 +372,289 @@ def participants_main() -> None:
360
372
  jwt_token=jwt_token,
361
373
  output=args.output,
362
374
  no_email=args.no_email,
375
+ emit_events=args.emit_events,
376
+ )
377
+ )
378
+
379
+
380
+ # ---------------------------------------------------------------------------
381
+ # devpost-harvest: walk hackathon listing → scrape participants → delta emit
382
+ # ---------------------------------------------------------------------------
383
+
384
+ async def _run_harvest(
385
+ pages: int,
386
+ jwt_token: str,
387
+ db_path: str,
388
+ no_email: bool,
389
+ emit_events: bool,
390
+ rescrape: bool,
391
+ max_participants: int = 0,
392
+ max_hackathons: int = 0,
393
+ statuses: list[str] | None = None,
394
+ ) -> None:
395
+ from devpost_scraper.db import HarvestDB
396
+
397
+ db = HarvestDB(db_path)
398
+
399
+ # Phase 1: discover hackathons
400
+ all_hackathons: list[Hackathon] = []
401
+ for page in range(1, pages + 1):
402
+ print(f"[harvest] Fetching hackathon listing page {page}…", file=sys.stderr)
403
+ data = await list_hackathons(page=page, statuses=statuses)
404
+ batch = data.get("hackathons", [])
405
+ if not batch:
406
+ print(f"[harvest] No hackathons on page {page}, stopping.", file=sys.stderr)
407
+ break
408
+ for raw in batch:
409
+ h = Hackathon(**raw)
410
+ if h.invite_only:
411
+ print(f" [skip] invite-only: {h.title}", file=sys.stderr)
412
+ continue
413
+ db.upsert_hackathon(h)
414
+ all_hackathons.append(h)
415
+ if max_hackathons and len(all_hackathons) >= max_hackathons:
416
+ break
417
+ print(f"[harvest] Page {page}: {len(batch)} hackathons ({len(all_hackathons)} total)", file=sys.stderr)
418
+ if max_hackathons and len(all_hackathons) >= max_hackathons:
419
+ break
420
+
421
+ if not all_hackathons:
422
+ print("[harvest] No hackathons found.", file=sys.stderr)
423
+ db.close()
424
+ return
425
+
426
+ # Phase 2: for each hackathon, scrape participants
427
+ total_new = 0
428
+ total_emitted = 0
429
+
430
+ for h in all_hackathons:
431
+ if not rescrape and db.hackathon_scraped(h.url):
432
+ print(f" [cached] {h.title} — already scraped, skipping (use --rescrape to force)", file=sys.stderr)
433
+ continue
434
+
435
+ print(f"\n[harvest] {h.title} ({h.url})", file=sys.stderr)
436
+ print(f" registrations: {h.registrations_count}, state: {h.open_state}", file=sys.stderr)
437
+
438
+ # Phase 2a: fast scan — scrape all participant pages (no enrichment)
439
+ participants: list[HackathonParticipant] = []
440
+ ppage = 1
441
+ while True:
442
+ try:
443
+ data = await get_hackathon_participants(h.url, jwt_token, page=ppage)
444
+ except Exception as exc:
445
+ print(f" [warn] participants fetch failed page {ppage}: {exc}", file=sys.stderr)
446
+ break
447
+
448
+ batch = data.get("participants", [])
449
+ has_more = data.get("has_more", False)
450
+
451
+ if not batch:
452
+ if ppage == 1:
453
+ print(f" [info] No participants found (may need auth)", file=sys.stderr)
454
+ break
455
+
456
+ if max_participants and len(participants) + len(batch) > max_participants:
457
+ batch = batch[:max_participants - len(participants)]
458
+ has_more = False
459
+
460
+ for raw in batch:
461
+ participants.append(
462
+ HackathonParticipant(
463
+ hackathon_url=h.url,
464
+ hackathon_title=h.title,
465
+ username=raw.get("username", ""),
466
+ name=raw.get("name", ""),
467
+ specialty=raw.get("specialty", ""),
468
+ profile_url=raw.get("profile_url", ""),
469
+ )
470
+ )
471
+
472
+ if not has_more:
473
+ break
474
+ ppage += 1
475
+
476
+ if not participants:
477
+ db.mark_hackathon_scraped(h.url)
478
+ continue
479
+
480
+ print(f" [scan] {len(participants)} participants across {ppage} pages", file=sys.stderr)
481
+
482
+ # Phase 2b: upsert → detect delta
483
+ new_participants = db.upsert_participants(participants)
484
+ total_new += len(new_participants)
485
+ print(f" [db] {len(new_participants)} new, {len(participants) - len(new_participants)} existing", file=sys.stderr)
486
+
487
+ # Phase 2c: email-enrich only the delta
488
+ if new_participants and not no_email:
489
+ print(f" [enrich] enriching {len(new_participants)} new participants…", file=sys.stderr)
490
+ for p in new_participants:
491
+ if not p.profile_url:
492
+ continue
493
+ try:
494
+ email_data = await find_participant_email(p.profile_url)
495
+ p.email = email_data.get("email", "")
496
+ p.github_url = email_data.get("github_url", "")
497
+ p.linkedin_url = email_data.get("linkedin_url", "")
498
+ if p.email:
499
+ print(f" [email] {p.email} ← {p.username}", file=sys.stderr)
500
+ db.update_participant_enrichment(p)
501
+ except Exception as exc:
502
+ print(f" [warn] enrich failed for {p.username}: {exc}", file=sys.stderr)
503
+
504
+ # Phase 2d: emit events for unemitted participants
505
+ if emit_events:
506
+ unemitted = db.get_unemitted_participants(h.url)
507
+ if unemitted:
508
+ print(f" [cio] Emitting events for {len(unemitted)} unemitted participants…", file=sys.stderr)
509
+ await emit_hackathon_events(unemitted)
510
+ for p in unemitted:
511
+ db.mark_event_emitted(h.url, p.username)
512
+ total_emitted += len(unemitted)
513
+
514
+ db.mark_hackathon_scraped(h.url)
515
+
516
+ # Summary
517
+ stats = db.stats()
518
+ print(f"\n{'=' * 60}", file=sys.stderr)
519
+ print(f"[harvest] Done.", file=sys.stderr)
520
+ print(f" hackathons in db: {stats['hackathons']}", file=sys.stderr)
521
+ print(f" participants in db: {stats['participants']}", file=sys.stderr)
522
+ print(f" with email: {stats['with_email']}", file=sys.stderr)
523
+ print(f" new this run: {total_new}", file=sys.stderr)
524
+ print(f" events emitted (total): {stats['events_emitted']}", file=sys.stderr)
525
+ if total_emitted:
526
+ print(f" events emitted (this run): {total_emitted}", file=sys.stderr)
527
+ print(f" db: {db_path}", file=sys.stderr)
528
+ db.close()
529
+
530
+
531
+ async def _run_emit_unsent(db_path: str) -> None:
532
+ from devpost_scraper.db import HarvestDB
533
+
534
+ db = HarvestDB(db_path)
535
+ unemitted = db.all_unemitted_participants()
536
+
537
+ if not unemitted:
538
+ print("[emit-unsent] No unsent participants with emails in DB.", file=sys.stderr)
539
+ db.close()
540
+ return
541
+
542
+ print(f"[emit-unsent] {len(unemitted)} participants to emit", file=sys.stderr)
543
+ await emit_hackathon_events(unemitted)
544
+
545
+ for p in unemitted:
546
+ db.mark_event_emitted(p.hackathon_url, p.username)
547
+
548
+ stats = db.stats()
549
+ print(f"\n[emit-unsent] Done. {len(unemitted)} events emitted.", file=sys.stderr)
550
+ print(f" events emitted (total): {stats['events_emitted']}", file=sys.stderr)
551
+ db.close()
552
+
553
+
554
+ def harvest_main() -> None:
555
+ load_dotenv(_ENV_FILE, override=True)
556
+
557
+ parser = argparse.ArgumentParser(
558
+ prog="devpost-harvest",
559
+ description=(
560
+ "Walk the Devpost hackathon listing, scrape participants per hackathon, "
561
+ "store in SQLite, and emit Customer.io events for new (delta) participants."
562
+ ),
563
+ )
564
+ parser.add_argument(
565
+ "--pages",
566
+ type=int,
567
+ default=3,
568
+ help="Number of hackathon listing pages to fetch (9 hackathons/page, default: 3)",
569
+ )
570
+ parser.add_argument(
571
+ "--jwt",
572
+ metavar="TOKEN",
573
+ default=None,
574
+ help="Value of the _devpost session cookie. Falls back to DEVPOST_SESSION in .env",
575
+ )
576
+ parser.add_argument(
577
+ "--db",
578
+ metavar="PATH",
579
+ default="devpost_harvest.db",
580
+ help="SQLite database path (default: devpost_harvest.db)",
581
+ )
582
+ parser.add_argument(
583
+ "--no-email",
584
+ action="store_true",
585
+ default=False,
586
+ help="Skip email enrichment (much faster)",
587
+ )
588
+ parser.add_argument(
589
+ "--emit-events",
590
+ action="store_true",
591
+ default=False,
592
+ help="Emit Customer.io events for delta participants during scrape",
593
+ )
594
+ parser.add_argument(
595
+ "--emit-unsent",
596
+ action="store_true",
597
+ default=False,
598
+ help="Skip scraping — just emit Customer.io events for all unsent participants in the DB",
599
+ )
600
+ parser.add_argument(
601
+ "--rescrape",
602
+ action="store_true",
603
+ default=False,
604
+ help="Re-scrape hackathons that were already scraped in a previous run",
605
+ )
606
+ parser.add_argument(
607
+ "--max-participants",
608
+ type=int,
609
+ default=0,
610
+ metavar="N",
611
+ help="Cap participants scraped per hackathon (0 = unlimited, default: 0)",
612
+ )
613
+ parser.add_argument(
614
+ "--hackathons",
615
+ type=int,
616
+ default=0,
617
+ metavar="N",
618
+ help="Only process the first N hackathons from the listing (0 = all, default: 0)",
619
+ )
620
+ parser.add_argument(
621
+ "--status",
622
+ action="append",
623
+ choices=["open", "ended", "upcoming"],
624
+ default=None,
625
+ dest="statuses",
626
+ help="Hackathon status filter (repeatable, default: open). e.g. --status open --status ended",
627
+ )
628
+ args = parser.parse_args()
629
+
630
+ if args.statuses is None:
631
+ args.statuses = ["open"]
632
+
633
+ if args.emit_unsent:
634
+ asyncio.run(_run_emit_unsent(db_path=args.db))
635
+ return
636
+
637
+ jwt_token = args.jwt or os.getenv(_PARTICIPANTS_JWT_KEY, "").strip()
638
+ if not jwt_token:
639
+ raise SystemExit(
640
+ "[error] No session cookie. Pass --jwt TOKEN or set DEVPOST_SESSION in .env\n"
641
+ " Copy the _devpost cookie value from browser DevTools → Application → Cookies"
642
+ )
643
+
644
+ if args.jwt:
645
+ _ENV_FILE.touch(exist_ok=True)
646
+ set_key(str(_ENV_FILE), _PARTICIPANTS_JWT_KEY, args.jwt)
647
+
648
+ asyncio.run(
649
+ _run_harvest(
650
+ pages=args.pages,
651
+ jwt_token=jwt_token,
652
+ db_path=args.db,
653
+ no_email=args.no_email,
654
+ emit_events=args.emit_events,
655
+ rescrape=args.rescrape,
656
+ max_participants=args.max_participants,
657
+ max_hackathons=args.hackathons,
658
+ statuses=args.statuses,
363
659
  )
364
660
  )
@@ -0,0 +1,87 @@
1
+ from __future__ import annotations
2
+
3
+ import logging
4
+ import os
5
+ import sys
6
+
7
+ import httpx
8
+
9
+ from devpost_scraper.models import DevpostHackathonEvent, HackathonParticipant
10
+
11
+ logger = logging.getLogger(__name__)
12
+
13
+ _TRACK_API_URL = "https://track.customer.io/api/v1"
14
+ _EVENT_NAME = "devpost_hackathon"
15
+
16
+
17
+ class CustomerIOService:
18
+ """Async Customer.io Track API client (httpx + basic auth)."""
19
+
20
+ def __init__(self, site_id: str, api_key: str) -> None:
21
+ self._auth = (site_id, api_key)
22
+
23
+ async def identify_user(self, user_id: str, email: str, **attrs: str) -> bool:
24
+ payload = {"email": email, **{k: v for k, v in attrs.items() if v}}
25
+ url = f"{_TRACK_API_URL}/customers/{user_id}"
26
+ async with httpx.AsyncClient() as client:
27
+ resp = await client.put(url, json=payload, auth=self._auth, timeout=10.0)
28
+ if resp.status_code == 200:
29
+ return True
30
+ logger.error("[cio] identify %s → %s %s", user_id, resp.status_code, resp.text)
31
+ return False
32
+
33
+ async def track_event(self, user_id: str, event_name: str, data: dict) -> bool:
34
+ url = f"{_TRACK_API_URL}/customers/{user_id}/events"
35
+ payload = {"name": event_name, "data": data}
36
+ async with httpx.AsyncClient() as client:
37
+ resp = await client.post(url, json=payload, auth=self._auth, timeout=10.0)
38
+ if resp.status_code == 200:
39
+ return True
40
+ logger.error("[cio] track %s/%s → %s %s", user_id, event_name, resp.status_code, resp.text)
41
+ return False
42
+
43
+
44
+ def _build_service() -> CustomerIOService:
45
+ site_id = os.getenv("CUSTOMERIO_SITE_ID", "").strip()
46
+ api_key = os.getenv("CUSTOMERIO_API_KEY", "").strip()
47
+ if not site_id or not api_key:
48
+ raise SystemExit(
49
+ "[error] CUSTOMERIO_SITE_ID and CUSTOMERIO_API_KEY must be set in .env"
50
+ )
51
+ return CustomerIOService(site_id, api_key)
52
+
53
+
54
+ async def emit_hackathon_events(participants: list[HackathonParticipant]) -> None:
55
+ eligible = [p for p in participants if p.email]
56
+ if not eligible:
57
+ print("[cio] No participants with emails — skipping event emission", file=sys.stderr)
58
+ return
59
+
60
+ svc = _build_service()
61
+ sent = 0
62
+
63
+ for p in eligible:
64
+ event = DevpostHackathonEvent(
65
+ hackathon_url=p.hackathon_url,
66
+ hackathon_title=p.hackathon_title,
67
+ username=p.username,
68
+ name=p.name,
69
+ specialty=p.specialty,
70
+ profile_url=p.profile_url,
71
+ github_url=p.github_url,
72
+ linkedin_url=p.linkedin_url,
73
+ )
74
+
75
+ name_parts = p.name.split(maxsplit=1)
76
+ first = name_parts[0] if name_parts else ""
77
+ last = name_parts[1] if len(name_parts) > 1 else ""
78
+
79
+ await svc.identify_user(p.email, email=p.email, first_name=first, last_name=last)
80
+ ok = await svc.track_event(p.email, _EVENT_NAME, event.model_dump())
81
+ if ok:
82
+ sent += 1
83
+ print(f" [cio] {_EVENT_NAME} → {p.email}", file=sys.stderr)
84
+ else:
85
+ print(f" [cio] FAILED {p.email}", file=sys.stderr)
86
+
87
+ print(f"[cio] Emitted {sent}/{len(eligible)} events", file=sys.stderr)
@@ -0,0 +1,224 @@
1
+ from __future__ import annotations
2
+
3
+ import sqlite3
4
+ from datetime import datetime, timezone
5
+ from pathlib import Path
6
+
7
+ from devpost_scraper.models import Hackathon, HackathonParticipant
8
+
9
+ _DEFAULT_DB = "devpost_harvest.db"
10
+
11
+ _SCHEMA = """\
12
+ CREATE TABLE IF NOT EXISTS hackathons (
13
+ id INTEGER PRIMARY KEY,
14
+ url TEXT UNIQUE NOT NULL,
15
+ title TEXT,
16
+ organization_name TEXT,
17
+ open_state TEXT,
18
+ submission_period_dates TEXT,
19
+ registrations_count INTEGER,
20
+ prize_amount TEXT,
21
+ themes TEXT,
22
+ invite_only INTEGER,
23
+ first_seen_at TEXT NOT NULL,
24
+ last_scraped_at TEXT
25
+ );
26
+
27
+ CREATE TABLE IF NOT EXISTS participants (
28
+ hackathon_url TEXT NOT NULL,
29
+ hackathon_title TEXT,
30
+ username TEXT NOT NULL,
31
+ name TEXT,
32
+ specialty TEXT,
33
+ profile_url TEXT,
34
+ github_url TEXT,
35
+ linkedin_url TEXT,
36
+ email TEXT,
37
+ first_seen_at TEXT NOT NULL,
38
+ last_seen_at TEXT NOT NULL,
39
+ event_emitted_at TEXT,
40
+ PRIMARY KEY (hackathon_url, username)
41
+ );
42
+ """
43
+
44
+
45
+ class HarvestDB:
46
+ def __init__(self, db_path: str = _DEFAULT_DB) -> None:
47
+ self._path = Path(db_path)
48
+ self._conn = sqlite3.connect(str(self._path))
49
+ self._conn.row_factory = sqlite3.Row
50
+ self._conn.executescript(_SCHEMA)
51
+ self._migrate()
52
+
53
+ def _migrate(self) -> None:
54
+ cols = {r[1] for r in self._conn.execute("PRAGMA table_info(participants)").fetchall()}
55
+ if "hackathon_title" not in cols:
56
+ self._conn.execute("ALTER TABLE participants ADD COLUMN hackathon_title TEXT")
57
+ self._conn.commit()
58
+
59
+ def close(self) -> None:
60
+ self._conn.close()
61
+
62
+ def upsert_hackathon(self, h: Hackathon) -> None:
63
+ now = _now_iso()
64
+ self._conn.execute(
65
+ """INSERT INTO hackathons
66
+ (id, url, title, organization_name, open_state,
67
+ submission_period_dates, registrations_count, prize_amount,
68
+ themes, invite_only, first_seen_at)
69
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
70
+ ON CONFLICT(id) DO UPDATE SET
71
+ title=excluded.title,
72
+ organization_name=excluded.organization_name,
73
+ open_state=excluded.open_state,
74
+ submission_period_dates=excluded.submission_period_dates,
75
+ registrations_count=excluded.registrations_count,
76
+ prize_amount=excluded.prize_amount,
77
+ themes=excluded.themes,
78
+ invite_only=excluded.invite_only
79
+ """,
80
+ (
81
+ h.id, h.url, h.title, h.organization_name, h.open_state,
82
+ h.submission_period_dates, h.registrations_count, h.prize_amount,
83
+ h.themes, int(h.invite_only), now,
84
+ ),
85
+ )
86
+ self._conn.commit()
87
+
88
+ def upsert_participants(
89
+ self, participants: list[HackathonParticipant],
90
+ ) -> list[HackathonParticipant]:
91
+ """Insert or update participants. Returns only the NEW ones (not previously seen)."""
92
+ now = _now_iso()
93
+ new: list[HackathonParticipant] = []
94
+
95
+ for p in participants:
96
+ existing = self._conn.execute(
97
+ "SELECT 1 FROM participants WHERE hackathon_url=? AND username=?",
98
+ (p.hackathon_url, p.username),
99
+ ).fetchone()
100
+
101
+ if existing is None:
102
+ new.append(p)
103
+ self._conn.execute(
104
+ """INSERT INTO participants
105
+ (hackathon_url, hackathon_title, username, name, specialty,
106
+ profile_url, github_url, linkedin_url, email,
107
+ first_seen_at, last_seen_at)
108
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
109
+ """,
110
+ (
111
+ p.hackathon_url, p.hackathon_title, p.username, p.name,
112
+ p.specialty, p.profile_url, p.github_url, p.linkedin_url,
113
+ p.email, now, now,
114
+ ),
115
+ )
116
+ else:
117
+ self._conn.execute(
118
+ """UPDATE participants
119
+ SET hackathon_title=?, name=?, specialty=?, profile_url=?,
120
+ github_url=?, linkedin_url=?, email=?, last_seen_at=?
121
+ WHERE hackathon_url=? AND username=?
122
+ """,
123
+ (
124
+ p.hackathon_title, p.name, p.specialty, p.profile_url,
125
+ p.github_url, p.linkedin_url, p.email, now,
126
+ p.hackathon_url, p.username,
127
+ ),
128
+ )
129
+
130
+ self._conn.commit()
131
+ return new
132
+
133
+ def update_participant_enrichment(self, p: HackathonParticipant) -> None:
134
+ """Update email/github/linkedin fields for an existing participant."""
135
+ self._conn.execute(
136
+ """UPDATE participants
137
+ SET email=?, github_url=?, linkedin_url=?, last_seen_at=?
138
+ WHERE hackathon_url=? AND username=?""",
139
+ (p.email, p.github_url, p.linkedin_url, _now_iso(), p.hackathon_url, p.username),
140
+ )
141
+ self._conn.commit()
142
+
143
+ def mark_event_emitted(self, hackathon_url: str, username: str) -> None:
144
+ self._conn.execute(
145
+ "UPDATE participants SET event_emitted_at=? WHERE hackathon_url=? AND username=?",
146
+ (_now_iso(), hackathon_url, username),
147
+ )
148
+ self._conn.commit()
149
+
150
+ def get_unemitted_participants(self, hackathon_url: str) -> list[HackathonParticipant]:
151
+ """Return participants that have an email but haven't had events emitted yet."""
152
+ rows = self._conn.execute(
153
+ """SELECT * FROM participants
154
+ WHERE hackathon_url=? AND email != '' AND event_emitted_at IS NULL""",
155
+ (hackathon_url,),
156
+ ).fetchall()
157
+ return [
158
+ HackathonParticipant(
159
+ hackathon_url=r["hackathon_url"],
160
+ hackathon_title=r["hackathon_title"] or "",
161
+ username=r["username"],
162
+ name=r["name"] or "",
163
+ specialty=r["specialty"] or "",
164
+ profile_url=r["profile_url"] or "",
165
+ github_url=r["github_url"] or "",
166
+ linkedin_url=r["linkedin_url"] or "",
167
+ email=r["email"] or "",
168
+ )
169
+ for r in rows
170
+ ]
171
+
172
+ def all_unemitted_participants(self) -> list[HackathonParticipant]:
173
+ """Return all participants across all hackathons with email but no event emitted."""
174
+ rows = self._conn.execute(
175
+ "SELECT * FROM participants WHERE email != '' AND event_emitted_at IS NULL"
176
+ ).fetchall()
177
+ return [
178
+ HackathonParticipant(
179
+ hackathon_url=r["hackathon_url"],
180
+ hackathon_title=r["hackathon_title"] or "",
181
+ username=r["username"],
182
+ name=r["name"] or "",
183
+ specialty=r["specialty"] or "",
184
+ profile_url=r["profile_url"] or "",
185
+ github_url=r["github_url"] or "",
186
+ linkedin_url=r["linkedin_url"] or "",
187
+ email=r["email"] or "",
188
+ )
189
+ for r in rows
190
+ ]
191
+
192
+ def hackathon_scraped(self, hackathon_url: str) -> bool:
193
+ row = self._conn.execute(
194
+ "SELECT last_scraped_at FROM hackathons WHERE url=?",
195
+ (hackathon_url,),
196
+ ).fetchone()
197
+ return row is not None and row["last_scraped_at"] is not None
198
+
199
+ def mark_hackathon_scraped(self, hackathon_url: str) -> None:
200
+ self._conn.execute(
201
+ "UPDATE hackathons SET last_scraped_at=? WHERE url=?",
202
+ (_now_iso(), hackathon_url),
203
+ )
204
+ self._conn.commit()
205
+
206
+ def stats(self) -> dict[str, int]:
207
+ hcount = self._conn.execute("SELECT COUNT(*) FROM hackathons").fetchone()[0]
208
+ pcount = self._conn.execute("SELECT COUNT(*) FROM participants").fetchone()[0]
209
+ emitted = self._conn.execute(
210
+ "SELECT COUNT(*) FROM participants WHERE event_emitted_at IS NOT NULL"
211
+ ).fetchone()[0]
212
+ with_email = self._conn.execute(
213
+ "SELECT COUNT(*) FROM participants WHERE email != ''"
214
+ ).fetchone()[0]
215
+ return {
216
+ "hackathons": hcount,
217
+ "participants": pcount,
218
+ "with_email": with_email,
219
+ "events_emitted": emitted,
220
+ }
221
+
222
+
223
+ def _now_iso() -> str:
224
+ return datetime.now(timezone.utc).isoformat()
@@ -3,10 +3,26 @@ from __future__ import annotations
3
3
  from pydantic import BaseModel, ConfigDict
4
4
 
5
5
 
6
+ class Hackathon(BaseModel):
7
+ model_config = ConfigDict(extra="ignore")
8
+
9
+ id: int
10
+ url: str
11
+ title: str = ""
12
+ organization_name: str = ""
13
+ open_state: str = ""
14
+ submission_period_dates: str = ""
15
+ registrations_count: int = 0
16
+ prize_amount: str = ""
17
+ themes: str = ""
18
+ invite_only: bool = False
19
+
20
+
6
21
  class HackathonParticipant(BaseModel):
7
22
  model_config = ConfigDict(extra="ignore")
8
23
 
9
24
  hackathon_url: str = ""
25
+ hackathon_title: str = ""
10
26
  username: str = ""
11
27
  name: str = ""
12
28
  specialty: str = ""
@@ -17,7 +33,20 @@ class HackathonParticipant(BaseModel):
17
33
 
18
34
  @classmethod
19
35
  def fieldnames(cls) -> list[str]:
20
- return ["hackathon_url", "username", "name", "specialty", "profile_url", "github_url", "linkedin_url", "email"]
36
+ return ["hackathon_url", "hackathon_title", "username", "name", "specialty", "profile_url", "github_url", "linkedin_url", "email"]
37
+
38
+
39
+ class DevpostHackathonEvent(BaseModel):
40
+ """Payload for the Customer.io ``devpost_hackathon`` event."""
41
+
42
+ hackathon_url: str
43
+ hackathon_title: str
44
+ username: str
45
+ name: str
46
+ specialty: str
47
+ profile_url: str
48
+ github_url: str
49
+ linkedin_url: str
21
50
 
22
51
 
23
52
  class DevpostProject(BaseModel):
@@ -23,6 +23,7 @@ _WALKABLE_DOMAINS = {
23
23
  }
24
24
 
25
25
  _SEARCH_URL = "https://devpost.com/software/search"
26
+ _HACKATHONS_API_URL = "https://devpost.com/api/hackathons"
26
27
  _GITHUB_API_URL = "https://api.github.com/users"
27
28
 
28
29
 
@@ -83,6 +84,58 @@ async def search_projects(query: str, page: int = 1) -> dict[str, Any]:
83
84
  }
84
85
 
85
86
 
87
+ async def list_hackathons(
88
+ page: int = 1,
89
+ statuses: list[str] | None = None,
90
+ ) -> dict[str, Any]:
91
+ """Fetch one page of hackathons from the Devpost API.
92
+ Returns {"hackathons": [...], "total_count": int, "per_page": int}.
93
+ """
94
+ if statuses is None:
95
+ statuses = ["open"]
96
+
97
+ params: list[tuple[str, str]] = [("page", str(page))]
98
+ for s in statuses:
99
+ params.append(("status[]", s))
100
+
101
+ async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
102
+ resp = await client.get(
103
+ _HACKATHONS_API_URL,
104
+ params=params,
105
+ headers={**_JSON_HEADERS, "Accept": "application/json"},
106
+ )
107
+ resp.raise_for_status()
108
+ data = resp.json()
109
+
110
+ hackathons = []
111
+ for item in data.get("hackathons", []):
112
+ themes = item.get("themes") or []
113
+ theme_names = ", ".join(t.get("name", "") for t in themes if t.get("name"))
114
+
115
+ prize_raw = item.get("prize_amount", "") or ""
116
+ prize_clean = re.sub(r"<[^>]+>", "", prize_raw).strip()
117
+
118
+ hackathons.append({
119
+ "id": item.get("id", 0),
120
+ "title": item.get("title", ""),
121
+ "url": item.get("url", "").rstrip("/"),
122
+ "organization_name": item.get("organization_name", ""),
123
+ "open_state": item.get("open_state", ""),
124
+ "submission_period_dates": item.get("submission_period_dates", ""),
125
+ "registrations_count": item.get("registrations_count", 0),
126
+ "prize_amount": prize_clean,
127
+ "themes": theme_names,
128
+ "invite_only": bool(item.get("invite_only")),
129
+ })
130
+
131
+ meta = data.get("meta", {})
132
+ return {
133
+ "hackathons": hackathons,
134
+ "total_count": meta.get("total_count", 0),
135
+ "per_page": meta.get("per_page", 9),
136
+ }
137
+
138
+
86
139
  async def get_project_details(url: str) -> dict[str, Any]:
87
140
  """Fetch a Devpost project page and extract detail fields."""
88
141
  async with httpx.AsyncClient(timeout=30.0, follow_redirects=True) as client:
@@ -1,101 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: devpost-scraper
3
- Version: 0.1.0
4
- Summary: CLI for extracting Devpost data with Backboard tool-calling and exporting results to CSV.
5
- Requires-Python: >=3.11
6
- Requires-Dist: backboard-sdk>=1.5.9
7
- Requires-Dist: beautifulsoup4>=4.12.0
8
- Requires-Dist: httpx>=0.27.0
9
- Requires-Dist: pydantic>=2.7.0
10
- Requires-Dist: python-dotenv>=1.0.1
11
- Description-Content-Type: text/markdown
12
-
13
- # Devpost Scraper
14
-
15
- CLI for extracting Devpost project data with a Backboard assistant that can call a Devpost MCP tool server and export structured results to CSV.
16
-
17
- ## Requirements
18
-
19
- - Python 3.11+
20
- - `uv`
21
- - Node.js / `npx` available on your machine
22
- - A Backboard API key
23
-
24
- ## Environment
25
-
26
- Create a `.env` file from `.env.example` and set:
27
-
28
- - `BACKBOARD_API_KEY`
29
- - `BACKBOARD_MODEL` (optional)
30
- - `DEVPOST_ASSISTANT_NAME` (optional)
31
-
32
- ## MCP server
33
-
34
- This project is designed to use a Devpost MCP server with this configuration:
35
-
36
- ```json
37
- {
38
- "mcpServers": {
39
- "devpost": {
40
- "command": "npx",
41
- "args": ["devpost-mcp-server"]
42
- }
43
- }
44
- }
45
- ```
46
-
47
- ## Install
48
-
49
- ```bash
50
- uv sync
51
- ```
52
-
53
- ## Run
54
-
55
- ```bash
56
- uv run devpost-scraper "ai agents" --output ai_agents.csv
57
- uv run devpost-scraper "developer tools" "climate tech" --output results.csv
58
- ```
59
-
60
- You can also use the startup script:
61
-
62
- ```bash
63
- ./start.sh "ai agents" --output ai_agents.csv
64
- ```
65
-
66
- ## What it does
67
-
68
- 1. Creates or reuses a Backboard assistant configured for Devpost extraction.
69
- 2. Creates a thread for the run.
70
- 3. Sends a prompt that asks the assistant to use the Devpost MCP toolset.
71
- 4. Handles tool-calling loops until the assistant returns completed structured content.
72
- 5. Parses the structured JSON result.
73
- 6. Writes the extracted rows to CSV.
74
-
75
- ## Expected output shape
76
-
77
- Each extracted row should contain fields like:
78
-
79
- - `search_term`
80
- - `project_title`
81
- - `tagline`
82
- - `project_url`
83
- - `hackathon_name`
84
- - `hackathon_url`
85
- - `summary`
86
- - `built_with`
87
- - `prizes`
88
- - `submission_date`
89
- - `team_size`
90
-
91
- ## Notes
92
-
93
- - The CLI is intentionally API-heavy and UI-free.
94
- - The Backboard assistant must have access to the Devpost MCP tools in the environment where it runs.
95
- - If your Backboard account or environment requires additional tool registration, wire that into the assistant creation flow in the client module.
96
-
97
- ## Development
98
-
99
- ```bash
100
- uv run python -m devpost_scraper.cli "ai agents" --output out.csv
101
- ```
@@ -1,89 +0,0 @@
1
- # Devpost Scraper
2
-
3
- CLI for extracting Devpost project data with a Backboard assistant that can call a Devpost MCP tool server and export structured results to CSV.
4
-
5
- ## Requirements
6
-
7
- - Python 3.11+
8
- - `uv`
9
- - Node.js / `npx` available on your machine
10
- - A Backboard API key
11
-
12
- ## Environment
13
-
14
- Create a `.env` file from `.env.example` and set:
15
-
16
- - `BACKBOARD_API_KEY`
17
- - `BACKBOARD_MODEL` (optional)
18
- - `DEVPOST_ASSISTANT_NAME` (optional)
19
-
20
- ## MCP server
21
-
22
- This project is designed to use a Devpost MCP server with this configuration:
23
-
24
- ```json
25
- {
26
- "mcpServers": {
27
- "devpost": {
28
- "command": "npx",
29
- "args": ["devpost-mcp-server"]
30
- }
31
- }
32
- }
33
- ```
34
-
35
- ## Install
36
-
37
- ```bash
38
- uv sync
39
- ```
40
-
41
- ## Run
42
-
43
- ```bash
44
- uv run devpost-scraper "ai agents" --output ai_agents.csv
45
- uv run devpost-scraper "developer tools" "climate tech" --output results.csv
46
- ```
47
-
48
- You can also use the startup script:
49
-
50
- ```bash
51
- ./start.sh "ai agents" --output ai_agents.csv
52
- ```
53
-
54
- ## What it does
55
-
56
- 1. Creates or reuses a Backboard assistant configured for Devpost extraction.
57
- 2. Creates a thread for the run.
58
- 3. Sends a prompt that asks the assistant to use the Devpost MCP toolset.
59
- 4. Handles tool-calling loops until the assistant returns completed structured content.
60
- 5. Parses the structured JSON result.
61
- 6. Writes the extracted rows to CSV.
62
-
63
- ## Expected output shape
64
-
65
- Each extracted row should contain fields like:
66
-
67
- - `search_term`
68
- - `project_title`
69
- - `tagline`
70
- - `project_url`
71
- - `hackathon_name`
72
- - `hackathon_url`
73
- - `summary`
74
- - `built_with`
75
- - `prizes`
76
- - `submission_date`
77
- - `team_size`
78
-
79
- ## Notes
80
-
81
- - The CLI is intentionally API-heavy and UI-free.
82
- - The Backboard assistant must have access to the Devpost MCP tools in the environment where it runs.
83
- - If your Backboard account or environment requires additional tool registration, wire that into the assistant creation flow in the client module.
84
-
85
- ## Development
86
-
87
- ```bash
88
- uv run python -m devpost_scraper.cli "ai agents" --output out.csv
89
- ```