prospector_engine 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (59) hide show
  1. checksums.yaml +7 -0
  2. data/MIT-LICENSE +20 -0
  3. data/README.md +333 -0
  4. data/Rakefile +9 -0
  5. data/app/CLAUDE.md +43 -0
  6. data/app/assets/stylesheets/prospector/application.css +476 -0
  7. data/app/controllers/prospector/application_controller.rb +16 -0
  8. data/app/controllers/prospector/candidates_controller.rb +31 -0
  9. data/app/controllers/prospector/keyword_generations_controller.rb +10 -0
  10. data/app/controllers/prospector/keywords_controller.rb +38 -0
  11. data/app/controllers/prospector/run_bulk_approvals_controller.rb +13 -0
  12. data/app/controllers/prospector/run_cancellations_controller.rb +9 -0
  13. data/app/controllers/prospector/run_reclassifications_controller.rb +21 -0
  14. data/app/controllers/prospector/run_restarts_controller.rb +14 -0
  15. data/app/controllers/prospector/run_retries_controller.rb +14 -0
  16. data/app/controllers/prospector/runs_controller.rb +47 -0
  17. data/app/jobs/prospector/application_job.rb +5 -0
  18. data/app/jobs/prospector/bulk_approve_job.rb +14 -0
  19. data/app/jobs/prospector/classify_job.rb +17 -0
  20. data/app/jobs/prospector/fetch_job.rb +8 -0
  21. data/app/models/prospector/application_record.rb +6 -0
  22. data/app/models/prospector/candidate.rb +93 -0
  23. data/app/models/prospector/classification_run.rb +15 -0
  24. data/app/models/prospector/keyword.rb +16 -0
  25. data/app/models/prospector/run.rb +94 -0
  26. data/app/views/prospector/candidates/show.html.erb +63 -0
  27. data/app/views/prospector/keywords/index.html.erb +72 -0
  28. data/app/views/prospector/layouts/prospector.html.erb +38 -0
  29. data/app/views/prospector/runs/index.html.erb +33 -0
  30. data/app/views/prospector/runs/new.html.erb +109 -0
  31. data/app/views/prospector/runs/show.html.erb +111 -0
  32. data/config/routes.rb +15 -0
  33. data/db/prospector_schema.rb +81 -0
  34. data/lib/generators/prospector/install/install_generator.rb +31 -0
  35. data/lib/generators/prospector/install/templates/create_prospector_tables.rb +83 -0
  36. data/lib/generators/prospector/install/templates/prospector.rb +37 -0
  37. data/lib/prospector/CLAUDE.md +52 -0
  38. data/lib/prospector/classification/runner.rb +105 -0
  39. data/lib/prospector/configuration.rb +56 -0
  40. data/lib/prospector/engine.rb +18 -0
  41. data/lib/prospector/enrichment/contact_scraper.rb +188 -0
  42. data/lib/prospector/error.rb +8 -0
  43. data/lib/prospector/geography/base.rb +40 -0
  44. data/lib/prospector/geography/bounding_box.rb +58 -0
  45. data/lib/prospector/geography/city.rb +29 -0
  46. data/lib/prospector/geography/coordinates.rb +43 -0
  47. data/lib/prospector/geography/metro_area.rb +74 -0
  48. data/lib/prospector/geography/zip_code.rb +25 -0
  49. data/lib/prospector/keywords/generator.rb +74 -0
  50. data/lib/prospector/pipeline/normalizer.rb +57 -0
  51. data/lib/prospector/pipeline/orchestrator.rb +151 -0
  52. data/lib/prospector/sources/base.rb +13 -0
  53. data/lib/prospector/sources/google_places/adapter.rb +92 -0
  54. data/lib/prospector/sources/google_places/client.rb +58 -0
  55. data/lib/prospector/sources/google_places/us_address_validator.rb +24 -0
  56. data/lib/prospector/sources/result.rb +21 -0
  57. data/lib/prospector/version.rb +3 -0
  58. data/lib/prospector.rb +20 -0
  59. metadata +185 -0
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 46798902589ee83ac3a63b7cf36fcb67b61d4a84fbf2ac07819b04a778439099
4
+ data.tar.gz: 87cb71c8d3c832156306184d31f9a7d3e8f45343e33bceb266ae3cc095a61130
5
+ SHA512:
6
+ metadata.gz: 68982b0d84db8931d2e691f365d0032a5bfbff4f0f9ea40fcafcabeab8e0783699bb8c6e202d2529c083fa639e22ac77aae961978ddb7b0663d4414f9f255868
7
+ data.tar.gz: 1876729cef3a17d3a09a717de8ef392c4e67a611b6d164d36b0b985e37dda846f16d3814273dabbfe551a01adae5d9b6999f577b88ce82a389bc102ba57d46de
data/MIT-LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright 2026 AxiumFoundry
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,333 @@
1
+ # Prospector
2
+
3
+ A Rails engine for discovering businesses from multiple sources with AI-powered keyword generation and classification.
4
+
5
+ Prospector handles the full discovery pipeline: generate search keywords with AI, fetch business listings from external APIs, classify results for domain relevance, and present an admin review interface -- all inside your existing Rails app.
6
+
7
+ ## Features
8
+
9
+ - **Multi-source adapters** -- Ships with Google Places. Pluggable interface for adding Yelp, Bing, OpenStreetMap, or custom sources.
10
+ - **AI keyword generation** -- On-demand LLM-powered keyword generation for any business domain. Keywords are stored and reused across runs.
11
+ - **AI classification** -- Automatically classifies discovered businesses for domain relevance. Non-relevant results are auto-rejected.
12
+ - **Flexible geography** -- Search by metro area, city, coordinates + radius, ZIP code, or bounding box.
13
+ - **Admin UI** -- Mountable admin interface with self-contained CSS. Review candidates, approve/reject, bulk approve, trigger reclassification.
14
+ - **Keyword management** -- Admin UI for viewing, adding, toggling, and AI-generating search keywords per category.
15
+ - **Background jobs** -- Fetch, classify, and bulk approve jobs integrate with your existing queue (Solid Queue, Sidekiq, etc.).
16
+ - **Turbo Streams** -- Real-time updates when Turbo is present. Gracefully degrades without it.
17
+ - **Domain-agnostic** -- Works for motorcycle services, aviation operators, restaurants, or any business vertical.
18
+
19
+ ## Requirements
20
+
21
+ - Ruby >= 3.1
22
+ - Rails >= 7.1
23
+ - PostgreSQL (uses `jsonb` columns)
24
+ - A Google Places API key (for the built-in adapter)
25
+ - An LLM API key compatible with [ruby_llm](https://github.com/crmne/ruby_llm) (for keyword generation and classification)
26
+ - A classifier class inheriting from [LlmClassifier::Classifier](https://github.com/AxiumFoundry/llm_classifier)
27
+
28
+ ## Installation
29
+
30
+ Add to your Gemfile:
31
+
32
+ ```ruby
33
+ gem "prospector_engine", github: "AxiumFoundry/prospector_engine"
34
+ ```
35
+
36
+ Run the install generator:
37
+
38
+ ```bash
39
+ bin/rails generate prospector:install
40
+ bin/rails db:migrate
41
+ ```
42
+
43
+ This creates:
44
+ - `config/initializers/prospector.rb` -- configuration file
45
+ - A migration for the four Prospector tables
46
+ - A route mounting the engine at `/prospector`
47
+
48
+ ## Configuration
49
+
50
+ Edit `config/initializers/prospector.rb`:
51
+
52
+ ```ruby
53
+ Prospector.configure do |config|
54
+ # Required: domain slug for keyword generation and classification context
55
+ config.domain = "motorcycle_services"
56
+
57
+ # Required: called when an admin approves a candidate.
58
+ # Receives a Prospector::Candidate instance.
59
+ # Runs AFTER the candidate status is committed (safe to enqueue jobs).
60
+ config.on_approve do |candidate|
61
+ data = candidate.normalized_data
62
+ Business.create!(
63
+ business_name: data["business_name"],
64
+ street_address: data["street_address"],
65
+ city: data["city"],
66
+ state: data["state"],
67
+ zip_code: data["zip_code"],
68
+ phone_number: data["phone_number"],
69
+ website: data["website"],
70
+ latitude: data["latitude"],
71
+ longitude: data["longitude"],
72
+ service_types: candidate.llm_categories,
73
+ import_source: "prospector"
74
+ )
75
+ end
76
+
77
+ # Required: admin authentication.
78
+ # Receives the controller instance.
79
+ config.authenticate_admin_with do |controller|
80
+ controller.current_user&.admin?
81
+ end
82
+
83
+ # Required: classifier class (inherits LlmClassifier::Classifier).
84
+ # Defines categories, model, and classification rules for your domain.
85
+ # See "Defining a Classifier" below.
86
+ config.classifier = MotorcycleClassifier
87
+
88
+ # Optional: check for duplicates before creating candidates.
89
+ # Return true to skip the candidate.
90
+ config.duplicate_check do |source_uid:, name:, **|
91
+ Business.exists?(["import_metadata->>'place_id' = ?", source_uid])
92
+ end
93
+
94
+ # Optional: default source adapter (default: :google_places)
95
+ config.default_source = :google_places
96
+
97
+ # Optional: default AI model (passed to classifier when not overridden)
98
+ config.default_classifier_model = "anthropic:claude-sonnet-4-20250514"
99
+
100
+ # Optional: job queue name (default: :default)
101
+ config.queue_name = :prospector
102
+ end
103
+ ```
104
+
105
+ Set the required environment variables:
106
+
107
+ ```bash
108
+ GOOGLE_MAPS_API_KEY=your_google_places_api_key
109
+ ANTHROPIC_API_KEY=your_anthropic_api_key # or whatever ruby_llm needs
110
+ ```
111
+
112
+ ## Usage
113
+
114
+ ### Admin UI
115
+
116
+ Visit `/prospector` in your browser. The admin interface provides:
117
+
118
+ - **Runs** -- Create, monitor, retry, restart, and cancel discovery runs
119
+ - **Candidates** -- Review discovered businesses, approve or reject individually or in bulk
120
+ - **Keywords** -- View, add, toggle, and AI-generate search keywords per category
121
+
122
+ ### Creating a Run
123
+
124
+ 1. Click "New Run" in the admin UI
125
+ 2. Select a geography type (metro area, city, coordinates, ZIP code, or bounding box)
126
+ 3. Fill in the geography details
127
+ 4. Click "Start Run"
128
+
129
+ The run progresses through these stages automatically:
130
+
131
+ ```
132
+ pending -> running (fetching from source) -> classifying (AI classification) -> completed
133
+ ```
134
+
135
+ ### Defining a Classifier
136
+
137
+ Prospector uses [llm_classifier](https://github.com/AxiumFoundry/llm_classifier) for AI classification. Define a classifier for your domain:
138
+
139
+ ```ruby
140
+ # app/classifiers/motorcycle_classifier.rb
141
+ class MotorcycleClassifier < LlmClassifier::Classifier
142
+ categories :mechanic, :instructor, :gear, :dealership, :storage,
143
+ :detailing, :training, :towing, :insurance, :touring,
144
+ :clubs, :parts
145
+
146
+ model "anthropic:claude-sonnet-4-20250514"
147
+ multi_label true
148
+ require_categories true # fail when no categories match (auto-reject)
149
+
150
+ system_prompt <<~PROMPT
151
+ You are classifying businesses for the motorcycle services domain.
152
+ Determine which categories this business belongs to based on its
153
+ name, address, website, and any other available information.
154
+ PROMPT
155
+
156
+ knowledge do
157
+ motorcycle_brands %w[Harley-Davidson Honda Yamaha Kawasaki Suzuki Ducati Triumph]
158
+ service_descriptions({
159
+ mechanic: "Motorcycle repair and maintenance",
160
+ instructor: "Motorcycle riding lessons and schools",
161
+ gear: "Motorcycle apparel and equipment retail"
162
+ })
163
+ end
164
+ end
165
+ ```
166
+
167
+ Then set it in your Prospector config:
168
+
169
+ ```ruby
170
+ config.classifier = MotorcycleClassifier
171
+ ```
172
+
173
+ The classifier receives a hash with `name`, `address`, `website`, `description`, and `source_types` keys. It returns an `LlmClassifier::Result` with `.categories`, `.confidence`, and `.reasoning`.
174
+
175
+ ### Programmatic Usage
176
+
177
+ ```ruby
178
+ # Create and start a run
179
+ run = Prospector::Run.create!(
180
+ source_adapter: "google_places",
181
+ geography_type: "metro_area",
182
+ geography_data: { "name" => "Dallas-Fort Worth", "primary_state" => "TX" },
183
+ categories: ["mechanic", "instructor", "gear"]
184
+ )
185
+ Prospector::FetchJob.perform_later(run.id)
186
+
187
+ # Seed keywords manually (instead of LLM generation)
188
+ Prospector::Keyword.create!(
189
+ domain: "motorcycle_services",
190
+ category: "mechanic",
191
+ keyword: "motorcycle repair shop",
192
+ source: "manual"
193
+ )
194
+
195
+ # Approve a candidate programmatically
196
+ candidate = Prospector::Candidate.find(id)
197
+ candidate.approve! # fires the on_approve callback
198
+ ```
199
+
200
+ ### Geography Types
201
+
202
+ ```ruby
203
+ # Metro area (text search)
204
+ Prospector::Geography::MetroArea.new(name: "San Francisco", primary_state: "CA")
205
+
206
+ # City (text search)
207
+ Prospector::Geography::City.new(city: "Austin", state: "TX")
208
+
209
+ # Coordinates (nearby search)
210
+ Prospector::Geography::Coordinates.new(lat: 30.267, lng: -97.743, radius_meters: 10_000)
211
+
212
+ # ZIP code (text search)
213
+ Prospector::Geography::ZipCode.new(zip: "75201")
214
+
215
+ # Bounding box (nearby search, converted to center + radius)
216
+ Prospector::Geography::BoundingBox.new(ne_lat: 33.0, ne_lng: -96.0, sw_lat: 32.0, sw_lng: -97.0)
217
+ ```
218
+
219
+ ## Custom Source Adapters
220
+
221
+ Prospector ships with Google Places. To add another source:
222
+
223
+ ```ruby
224
+ # lib/my_app/yelp_adapter.rb
225
+ class MyApp::YelpAdapter < Prospector::Sources::Base
226
+ def self.adapter_key = "yelp"
227
+
228
+ def fetch(geography:, keywords:)
229
+ # Call the Yelp API for each keyword + geography combination.
230
+ # Return an Array of Prospector::Sources::Result.
231
+ keywords.flat_map do |keyword|
232
+ search_yelp(keyword, geography).map do |biz|
233
+ Prospector::Sources::Result.new(
234
+ uid: biz["id"],
235
+ name: biz["name"],
236
+ formatted_address: biz["location"]["display_address"].join(", "),
237
+ latitude: biz["coordinates"]["latitude"],
238
+ longitude: biz["coordinates"]["longitude"],
239
+ phone_number: biz["phone"],
240
+ website: biz["url"],
241
+ description: nil,
242
+ category: keyword,
243
+ hours: nil,
244
+ rating: biz["rating"],
245
+ rating_count: biz["review_count"],
246
+ types: biz["categories"].map { |c| c["alias"] },
247
+ raw: biz
248
+ )
249
+ end
250
+ end
251
+ end
252
+ end
253
+ ```
254
+
255
+ Register it in your initializer:
256
+
257
+ ```ruby
258
+ Prospector.configure do |config|
259
+ config.register_source :yelp, MyApp::YelpAdapter
260
+ end
261
+ ```
262
+
263
+ Then select "Yelp" as the source when creating a run.
264
+
265
+ ## Database Tables
266
+
267
+ Prospector creates four tables, all prefixed with `prospector_`:
268
+
269
+ | Table | Purpose |
270
+ |-------|---------|
271
+ | `prospector_runs` | Discovery runs with status, geography, and progress counters |
272
+ | `prospector_candidates` | Discovered businesses pending review |
273
+ | `prospector_classification_runs` | Tracks AI reclassification operations |
274
+ | `prospector_keywords` | Stored search keywords per domain and category |
275
+
276
+ ## Architecture
277
+
278
+ ```
279
+ Prospector::Run (state machine)
280
+ |
281
+ |-- FetchJob
282
+ | |-- Keywords::Generator (DB cache -> LLM fallback)
283
+ | |-- Sources::GooglePlaces::Adapter (or custom adapter)
284
+ | |-- Pipeline::Orchestrator
285
+ | | |-- Pipeline::Normalizer (address parsing)
286
+ | | |-- Duplicate checking (via config.duplicate_check)
287
+ | | '-- Creates Prospector::Candidate records
288
+ | '-- Enqueues ClassifyJob
289
+ |
290
+ |-- ClassifyJob
291
+ | '-- Classification::Runner
292
+ | |-- Calls LLM via ruby_llm
293
+ | |-- Stores categories, confidence, reasoning in metadata
294
+ | '-- Auto-rejects candidates with no relevant categories
295
+ |
296
+ '-- Admin UI
297
+ |-- Review candidates (approve / reject / restore)
298
+ |-- Bulk approve
299
+ '-- Trigger reclassification with different models
300
+ ```
301
+
302
+ ## Run States
303
+
304
+ ```
305
+ pending --> running --> classifying --> completed
306
+ | |
307
+ v v
308
+ failed cancelled
309
+ ^ ^
310
+ |__________________________|
311
+ (retry / restart)
312
+ ```
313
+
314
+ ## Development
315
+
316
+ The gem ships with a devcontainer for containerized development:
317
+
318
+ ```bash
319
+ # Start the devcontainer
320
+ docker compose -f .devcontainer/compose.yaml up -d --build
321
+
322
+ # Run tests (inside the container)
323
+ docker exec -w /workspaces/prospector prospector-app-1 bundle exec rake test
324
+
325
+ # Run a single test file
326
+ docker exec -w /workspaces/prospector prospector-app-1 bundle exec ruby -Itest test/models/run_test.rb
327
+ ```
328
+
329
+ The test suite uses a dummy Rails app (`test/dummy/`) with PostgreSQL.
330
+
331
+ ## License
332
+
333
+ MIT License. See [MIT-LICENSE](MIT-LICENSE).
data/Rakefile ADDED
@@ -0,0 +1,9 @@
1
+ require "bundler/gem_tasks"
2
+ require "rake/testtask"
3
+
4
+ Rake::TestTask.new(:test) do |t|
5
+ t.libs << "test"
6
+ t.pattern = "test/**/*_test.rb"
7
+ end
8
+
9
+ task default: :test
data/app/CLAUDE.md ADDED
@@ -0,0 +1,43 @@
1
+ # App Code
2
+
3
+ ## Models (`models/prospector/`)
4
+
5
+ All models inherit from `Prospector::ApplicationRecord` which sets `table_name_prefix = "prospector_"`.
6
+
7
+ - `Run` - Central orchestration record. State machine via explicit predicate methods. Associations: has_many candidates, has_many classification_runs. Key methods: `cancel!`, `reset_for_retry!`, `restart!`, `geography`/`geography=`.
8
+ - `Candidate` - Raw scraped business record. Key methods: `approve!` (fires on_approve callback OUTSIDE transaction), `reject!(reason:)`, `restore_to_pending!`, `approvable?`. Metadata stores normalized_data, llm_categories, llm_confidence, llm_reasoning.
9
+ - `ClassificationRun` - Tracks reclassification operations per run.
10
+ - `Keyword` - Stored search keywords. Scoped by domain + category. `keywords_for(domain:, category:)` returns active keyword strings.
11
+
12
+ ## Controllers (`controllers/prospector/`)
13
+
14
+ All inherit from `Prospector::ApplicationController` which runs `authenticate_prospector_admin!` before_action.
15
+
16
+ - `RunsController` - index (list runs), show (run detail + candidates), new/create (start a run)
17
+ - `CandidatesController` - show (candidate detail), update (approve/reject/restore with state guards)
18
+ - `RunRetriesController` - create (retry failed run, guards retryable?)
19
+ - `RunRestartsController` - create (restart completed/failed run, guards restartable?)
20
+ - `RunCancellationsController` - create (cancel active run, guards cancellable?)
21
+ - `RunReclassificationsController` - create (trigger reclassification with chosen model)
22
+ - `RunBulkApprovalsController` - create (bulk approve pending candidates, guards completed?/classifying?)
23
+
24
+ ## Jobs (`jobs/prospector/`)
25
+
26
+ All inherit from `Prospector::ApplicationJob` which uses the configured queue name.
27
+
28
+ - `FetchJob` - Delegates to `Pipeline::Orchestrator`. Enqueued by RunsController#create.
29
+ - `ClassifyJob` - Delegates to `Classification::Runner`. Enqueued after fetch completes.
30
+ - `BulkApproveJob` - Approves all approvable pending candidates in a run.
31
+
32
+ ## Views (`views/prospector/`)
33
+
34
+ Self-contained admin UI. Layout at `layouts/prospector.html.erb` with scoped CSS (`prospector/application.css`).
35
+
36
+ - `runs/index` - List of all runs with status badges
37
+ - `runs/show` - Run detail with stats, action buttons, candidate tabs (pending/approved/rejected)
38
+ - `runs/new` - Form to create a new run (geography type, source, categories)
39
+ - `candidates/show` - Candidate detail with AI classification results
40
+
41
+ ## Assets (`assets/stylesheets/prospector/`)
42
+
43
+ - `application.css` - Self-contained CSS scoped under `.prospector-ui`. No Tailwind dependency. CSS custom properties for theming.