completion-kit 0.5.9 → 0.5.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (36) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +15 -15
  3. data/app/assets/images/completion_kit/favicon.ico +0 -0
  4. data/app/assets/images/completion_kit/logo.png +0 -0
  5. data/app/assets/stylesheets/completion_kit/application.css +38 -7
  6. data/app/controllers/completion_kit/api/v1/runs_controller.rb +1 -1
  7. data/app/controllers/completion_kit/api_reference_controller.rb +6 -0
  8. data/app/controllers/completion_kit/datasets_controller.rb +10 -0
  9. data/app/controllers/completion_kit/mcp_controller.rb +2 -2
  10. data/app/controllers/completion_kit/runs_controller.rb +8 -2
  11. data/app/jobs/completion_kit/judge_review_job.rb +1 -1
  12. data/app/models/completion_kit/mcp_session.rb +29 -0
  13. data/app/models/completion_kit/run.rb +55 -10
  14. data/app/services/completion_kit/mcp_dispatcher.rb +1 -3
  15. data/app/services/completion_kit/mcp_tools/runs.rb +6 -4
  16. data/app/views/completion_kit/api_reference/_body.html.erb +47 -23
  17. data/app/views/completion_kit/api_reference/_resource_card.html.erb +24 -0
  18. data/app/views/completion_kit/api_reference/_resource_list.html.erb +10 -0
  19. data/app/views/completion_kit/api_reference/index.html.erb +7 -1
  20. data/app/views/completion_kit/datasets/show.html.erb +2 -18
  21. data/app/views/completion_kit/prompts/show.html.erb +8 -26
  22. data/app/views/completion_kit/responses/show.html.erb +26 -11
  23. data/app/views/completion_kit/runs/_form.html.erb +51 -4
  24. data/app/views/completion_kit/runs/_row.html.erb +6 -2
  25. data/app/views/completion_kit/runs/_status_header.html.erb +5 -1
  26. data/app/views/completion_kit/runs/_table.html.erb +19 -0
  27. data/app/views/completion_kit/runs/edit.html.erb +6 -2
  28. data/app/views/completion_kit/runs/index.html.erb +1 -17
  29. data/app/views/completion_kit/runs/show.html.erb +24 -15
  30. data/app/views/layouts/completion_kit/application.html.erb +2 -2
  31. data/db/migrate/20260513000001_create_completion_kit_mcp_sessions.rb +12 -0
  32. data/db/migrate/20260514000001_allow_judge_only_runs.rb +6 -0
  33. data/lib/completion_kit/engine.rb +2 -1
  34. data/lib/completion_kit/version.rb +1 -1
  35. metadata +9 -2
  36. data/app/assets/images/completion_kit/logo.svg +0 -6
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 2e1641413d1ed8d27bb6344094b788f33ab521e4d738bf1468fba148185a58f6
4
- data.tar.gz: 60b3676c0b8100430a7841a6281b5656184198583b40565c0ef5f7b24663ebc4
3
+ metadata.gz: ed531ae29162bb91d2c463c3ff4eb20b5da469b9b7a21baddf5054a0ccc15041
4
+ data.tar.gz: b86aea95b2e1cf73abf6514093565dc07b12dc0f4fe5c5c5c8b80db3fbdfa83d
5
5
  SHA512:
6
- metadata.gz: 003f3af4e5eaa28bc5c9b800e3788a63134a8bc9750d170ed03c6c1707b2e57675c79e25994abd763fb133eaec484f07133e12a74be9179f90c360b1d290dafd
7
- data.tar.gz: 6b73b15e6c1eb9af3d8f5228997553a7489db9c948a2e7298d2c9b40eb19b27a431af84b52368716bac0705f4c62acc2b5b7a8f937192445416b131c59b661f9
6
+ metadata.gz: 04ae500020e71d52c41073c36a6741bc47b94a06ceec6548720d6022a60ce7422be8a354d627c39a7be8174af2ce65219041c5d99ad175157c7bf4b4eaf8f056
7
+ data.tar.gz: 261daeeb1555b3aecb8e2e18edb7f14ebdc37f974c2713ecbe12c43a281e8109edc45222be0f6039c325a0428e89c1c11a1a7104f0a36ace2b618fb2ef1cb7e8
data/README.md CHANGED
@@ -14,21 +14,23 @@ Run every prompt against real data. Score each output with an LLM judge against
14
14
 
15
15
  It's the difference between "this prompt seems to work" and "this prompt scores 4.3 out of 5 across 200 inputs, up from 3.8 last version."
16
16
 
17
- **[completionkit.com](https://completionkit.com)** | **[RubyGems](https://rubygems.org/gems/completion-kit)**
17
+ **[Start on completionkit.com](https://completionkit.com)** | **[RubyGems](https://rubygems.org/gems/completion-kit)**
18
18
 
19
- > **CompletionKit Cloud** hosted, managed CompletionKit with zero setup. Same engine, run for you. See plans at [completionkit.com/pricing](https://completionkit.com/pricing).
19
+ > **Just want to use it?** [CompletionKit Cloud](https://completionkit.com) is the same engine, fully hosted zero install, no Rails ops, plans at [completionkit.com/pricing](https://completionkit.com/pricing).
20
20
 
21
21
  ![Test run with scored results](https://raw.githubusercontent.com/homemade-software-inc/completion-kit/main/docs/screenshots/test-run.png)
22
22
 
23
- ## Quick Start
23
+ ## Three ways to run it
24
24
 
25
- ### Use CompletionKit Cloud
25
+ Same engine, same UI, same REST API and MCP server — pick the deployment that fits.
26
26
 
27
- The fastest way to start no install, no servers to run. Sign up at [completionkit.com](https://completionkit.com) and you get the same engine you'd self-host, hosted for you. Best fit if you want to skip the Rails ops.
27
+ ### 1. Hosted — [completionkit.com](https://completionkit.com) (recommended)
28
28
 
29
- ### Or run the standalone app
29
+ The fastest path. Sign up and you're running on the same engine you'd self-host, without touching a Rails app. No `db:migrate`, no Puma, no Solid Queue, no provider key management — multi-tenant workspaces, your team logs in, you go. Plans at [completionkit.com/pricing](https://completionkit.com/pricing).
30
30
 
31
- Self-host the same engine. No existing Rails app needed.
31
+ ### 2. Self-hosted the bundled standalone Rails app
32
+
33
+ Run it on your own infra. No existing Rails app required; Postgres + any Rails-friendly host (Fly, Render, Heroku, Docker, …).
32
34
 
33
35
  ```bash
34
36
  git clone https://github.com/homemade-software-inc/completion-kit.git
@@ -38,7 +40,7 @@ bin/rails completion_kit:install:migrations
38
40
  bin/rails db:migrate
39
41
  ```
40
42
 
41
- Then run **both** processes — a web server and a Solid Queue worker. In two terminals:
43
+ Run **both** a web server and a Solid Queue worker. In two terminals:
42
44
 
43
45
  ```bash
44
46
  bin/rails server
@@ -50,9 +52,9 @@ bin/jobs
50
52
 
51
53
  Or with [foreman](https://github.com/ddollar/foreman) in one terminal: `foreman start -f Procfile.dev`.
52
54
 
53
- Visit `http://localhost:3000`. Add a provider credential (Settings), create a prompt, upload a CSV dataset, and run it.
55
+ Visit `http://localhost:3000`. Add a provider credential (Settings), create a prompt, upload a CSV dataset, and run it. See [Deploying self-hosted](#deploying-self-hosted) for the production-env setup.
54
56
 
55
- ### Or mount as an engine in your existing Rails app
57
+ ### 3. Rails engine mount into your existing Rails app
56
58
 
57
59
  ```ruby
58
60
  gem "completion-kit"
@@ -63,11 +65,9 @@ bin/rails generate completion_kit:install
63
65
  bin/rails db:migrate
64
66
  ```
65
67
 
66
- The engine mounts at `/completion_kit` in your app. CompletionKit's generate and judge flows enqueue Active Job jobs (`CompletionKit::GenerateRowJob`, `CompletionKit::JudgeReviewJob`, `CompletionKit::RunCompletionCheckJob`), so your host app needs an Active Job adapter that actually processes them — Solid Queue, Sidekiq, GoodJob, etc. The `:async` adapter is **not** suitable for production: it runs jobs in the web Puma's thread pool with no durability and no retry, and a long LLM call will block request handling.
67
-
68
- ### Host-app layout integration
68
+ The engine mounts at `/completion_kit`. Generate / judge flows enqueue Active Job jobs (`CompletionKit::GenerateRowJob`, `CompletionKit::JudgeReviewJob`, `CompletionKit::RunCompletionCheckJob`), so your host app needs an Active Job adapter that actually processes them — Solid Queue, Sidekiq, GoodJob, etc. The `:async` adapter is **not** suitable for production: it runs jobs in the web Puma's thread pool with no durability and no retry, and a long LLM call will block request handling.
69
69
 
70
- If your host app overrides the engine layout (e.g. `layout "application"` on engine controllers, or rendering engine views inside your own shell), include both the engine's stylesheet and JavaScript in that layout:
70
+ **Host-app layout integration.** If your host app overrides the engine layout (e.g. `layout "application"` on engine controllers, or rendering engine views inside your own shell), include both the engine's stylesheet and JavaScript in that layout:
71
71
 
72
72
  ```erb
73
73
  <%= stylesheet_link_tag "completion_kit/application", media: "all" %>
@@ -183,7 +183,7 @@ CompletionKit runs a [Model Context Protocol](https://modelcontextprotocol.io) s
183
183
 
184
184
  The in-app API reference page has install snippets you can copy straight into your MCP client config.
185
185
 
186
- ## Deploying the standalone app
186
+ ## Deploying self-hosted
187
187
 
188
188
  Any Rails-friendly host works (Fly, Heroku, Render, Docker, etc.). Point it at a Postgres instance via `DATABASE_URL`, set your provider env vars, and run `cd standalone && bin/rails db:migrate` on each deploy.
189
189
 
@@ -126,13 +126,21 @@ form.button_to {
126
126
  font-weight: 700;
127
127
  letter-spacing: 0.02em;
128
128
  text-decoration: none;
129
- color: var(--ck-accent);
129
+ color: #3AD0E6;
130
130
  }
131
131
 
132
132
  .ck-brand img {
133
133
  display: block;
134
134
  }
135
135
 
136
+ .ck-brand__name {
137
+ padding-top: 0.75rem;
138
+ }
139
+
140
+ .ck-brand__kit {
141
+ color: #AFEDF7;
142
+ }
143
+
136
144
  .ck-topbar__copy {
137
145
  display: none;
138
146
  }
@@ -262,7 +270,7 @@ form.button_to {
262
270
  .ck-meta-copy,
263
271
  .ck-note,
264
272
  .ck-hint {
265
- font-size: 0.95rem;
273
+ font-size: 0.9rem;
266
274
  line-height: 1.6;
267
275
  }
268
276
 
@@ -1269,6 +1277,13 @@ tr:hover .ck-chip--publish {
1269
1277
  color: var(--ck-accent);
1270
1278
  }
1271
1279
 
1280
+ /* the main prompt template block on prompts/show — bigger padding + a
1281
+ touch more line-height since this is the page's primary content */
1282
+ .ck-code--prompt {
1283
+ padding: 1.5rem;
1284
+ line-height: 1.75;
1285
+ }
1286
+
1272
1287
  .ck-note-box {
1273
1288
  background: var(--ck-surface-soft);
1274
1289
  border: 1px solid var(--ck-line);
@@ -1522,6 +1537,9 @@ tr:hover .ck-chip--publish {
1522
1537
  display: grid;
1523
1538
  gap: 0.4rem;
1524
1539
  }
1540
+ .ck-field[hidden] {
1541
+ display: none;
1542
+ }
1525
1543
 
1526
1544
  .ck-field--spacious {
1527
1545
  margin-top: 0.3rem;
@@ -1855,7 +1873,18 @@ tr:hover .ck-chip--publish {
1855
1873
  background: var(--ck-bg-strong);
1856
1874
  overflow: auto;
1857
1875
  max-height: 60vh;
1876
+ scrollbar-width: thin;
1877
+ scrollbar-color: var(--ck-line-strong) transparent;
1858
1878
  }
1879
+ .ck-csv-table-wrap::-webkit-scrollbar { width: 10px; height: 10px; }
1880
+ .ck-csv-table-wrap::-webkit-scrollbar-track { background: transparent; }
1881
+ .ck-csv-table-wrap::-webkit-scrollbar-thumb {
1882
+ background: var(--ck-line-strong);
1883
+ border-radius: 5px;
1884
+ border: 2px solid var(--ck-bg-strong);
1885
+ }
1886
+ .ck-csv-table-wrap::-webkit-scrollbar-thumb:hover { background: var(--ck-muted); }
1887
+ .ck-csv-table-wrap::-webkit-scrollbar-corner { background: transparent; }
1859
1888
 
1860
1889
  .ck-modal__body .ck-csv-table-wrap {
1861
1890
  margin-top: 0;
@@ -2763,8 +2792,10 @@ select.ck-input {
2763
2792
 
2764
2793
  /* the metrics field stacks several sub-sections (hint, groups, divider, tag
2765
2794
  filter, checkboxes) — give it more vertical breathing room than a plain field,
2766
- and extra separation from the run-tags field that follows it */
2767
- #metrics-field {
2795
+ and extra separation from the run-tags field that follows it. Only when the
2796
+ checkboxes are actually present, though — when there are no metrics the field
2797
+ is just "label + warning" and should be a normal compact field. */
2798
+ #metrics-field:has(.ck-metric-checkboxes) {
2768
2799
  gap: 0.85rem;
2769
2800
  margin-bottom: 1.25rem;
2770
2801
  }
@@ -3571,7 +3602,7 @@ a.ck-metric-group-pill {
3571
3602
  }
3572
3603
 
3573
3604
  .ck-mcp-tool__desc {
3574
- font-size: 0.8rem;
3605
+ font-size: 0.9rem;
3575
3606
  color: var(--ck-muted);
3576
3607
  }
3577
3608
 
@@ -3595,7 +3626,7 @@ a.ck-metric-group-pill {
3595
3626
  gap: 0.5rem;
3596
3627
  padding: 0.6rem 0.85rem;
3597
3628
  font-family: var(--ck-mono);
3598
- font-size: 0.78rem;
3629
+ font-size: 0.9rem;
3599
3630
  font-weight: 500;
3600
3631
  color: var(--ck-text);
3601
3632
  border-bottom: 1px solid var(--ck-line);
@@ -3669,7 +3700,7 @@ a.ck-metric-group-pill {
3669
3700
  }
3670
3701
 
3671
3702
  .ck-api-prompt-card__desc {
3672
- font-size: 0.78rem;
3703
+ font-size: 0.9rem;
3673
3704
  color: var(--ck-muted);
3674
3705
  margin: 0.2rem 0 0;
3675
3706
  line-height: 1.4;
@@ -76,7 +76,7 @@ module CompletionKit
76
76
  end
77
77
 
78
78
  def run_params
79
- params.permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature,
79
+ params.permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature, :output_column,
80
80
  metric_ids: [], tag_names: [])
81
81
  end
82
82
  end
@@ -2,6 +2,12 @@ module CompletionKit
2
2
  class ApiReferenceController < ApplicationController
3
3
  def index
4
4
  @published_prompts = Prompt.current_versions.order(name: :asc)
5
+ @recent_runs = Run.includes(:prompt).order(created_at: :desc).limit(10)
6
+ @datasets = Dataset.order(name: :asc)
7
+ @metrics = Metric.order(name: :asc)
8
+ @metric_groups = MetricGroup.includes(:metrics).order(name: :asc)
9
+ @tags = Tag.order(name: :asc)
10
+ @provider_credentials = ProviderCredential.order(:provider)
5
11
  @token = CompletionKit.config.api_token
6
12
  @base_url = request.base_url + request.script_name
7
13
  end
@@ -9,6 +9,16 @@ module CompletionKit
9
9
 
10
10
  def show
11
11
  @runs = @dataset.runs.includes(:prompt, :responses).order(created_at: :desc)
12
+ respond_to do |format|
13
+ format.html
14
+ format.csv do
15
+ slug = @dataset.name.to_s.parameterize.presence || "dataset-#{@dataset.id}"
16
+ send_data @dataset.csv_data.to_s,
17
+ type: "text/csv",
18
+ filename: "#{slug}.csv",
19
+ disposition: "attachment"
20
+ end
21
+ end
12
22
  end
13
23
 
14
24
  def new
@@ -17,7 +17,7 @@ module CompletionKit
17
17
  end
18
18
 
19
19
  session_id = request.headers["Mcp-Session-Id"]
20
- unless session_id && Rails.cache.exist?("mcp_session:#{session_id}")
20
+ unless McpSession.active?(session_id)
21
21
  render json: jsonrpc_error(request_body["id"], -32000, "Session not initialized. Send initialize first."), status: :bad_request
22
22
  return
23
23
  end
@@ -40,7 +40,7 @@ module CompletionKit
40
40
 
41
41
  def destroy
42
42
  session_id = request.headers["Mcp-Session-Id"]
43
- Rails.cache.delete("mcp_session:#{session_id}") if session_id
43
+ McpSession.destroy_session(session_id) if session_id
44
44
  head :ok
45
45
  end
46
46
 
@@ -84,6 +84,7 @@ module CompletionKit
84
84
  dataset_id: @run.dataset_id,
85
85
  judge_model: @run.judge_model,
86
86
  temperature: @run.temperature,
87
+ output_column: @run.output_column,
87
88
  tag_names: @run.tag_names,
88
89
  status: "pending"
89
90
  )
@@ -108,6 +109,11 @@ module CompletionKit
108
109
  end
109
110
 
110
111
  def suggest
112
+ if @run.prompt.nil?
113
+ redirect_to run_path(@run), alert: "Judge-only runs don't have a prompt to improve."
114
+ return
115
+ end
116
+
111
117
  service = PromptImprovementService.new(@run)
112
118
  result = service.suggest
113
119
  suggestion = @run.suggestions.create!(
@@ -159,13 +165,13 @@ module CompletionKit
159
165
  end
160
166
 
161
167
  def run_params
162
- params.require(:run).permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature, metric_ids: [], tag_names: [])
168
+ params.require(:run).permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature, :output_column, metric_ids: [], tag_names: [])
163
169
  end
164
170
 
165
171
  # Editing a run that already has results forks a new run — but only when a
166
172
  # field that affects generation or judging changed. Renaming or retagging is
167
173
  # pure metadata and updates the run in place.
168
- GENERATION_RUN_FIELDS = %i[prompt_id dataset_id judge_model temperature].freeze
174
+ GENERATION_RUN_FIELDS = %i[prompt_id dataset_id judge_model temperature output_column].freeze
169
175
 
170
176
  def run_generation_changed?
171
177
  GENERATION_RUN_FIELDS.each do |field|
@@ -54,7 +54,7 @@ module CompletionKit
54
54
  evaluation = judge.evaluate(
55
55
  response.response_text,
56
56
  response.expected_output,
57
- run.prompt.template,
57
+ run.prompt&.template,
58
58
  criteria: metric.instruction.to_s,
59
59
  rubric_text: metric.display_rubric_text,
60
60
  input_data: response.input_data
@@ -0,0 +1,29 @@
1
+ module CompletionKit
2
+ # MCP session marker — one row per active client session, kept in the
3
+ # database so sessions survive Puma restarts, deploys, and Rails.cache
4
+ # eviction. Expired rows are opportunistically pruned on every new
5
+ # session start, so the table stays bounded by recent activity.
6
+ class McpSession < ApplicationRecord
7
+ self.table_name = "completion_kit_mcp_sessions"
8
+
9
+ SESSION_TTL = 1.hour
10
+
11
+ def self.start!
12
+ prune_expired!
13
+ create!(session_id: SecureRandom.uuid, expires_at: SESSION_TTL.from_now).session_id
14
+ end
15
+
16
+ def self.active?(session_id)
17
+ return false if session_id.blank?
18
+ where(session_id: session_id).where("expires_at > ?", Time.current).exists?
19
+ end
20
+
21
+ def self.destroy_session(session_id)
22
+ where(session_id: session_id).delete_all
23
+ end
24
+
25
+ def self.prune_expired!
26
+ where("expires_at < ?", Time.current).delete_all
27
+ end
28
+ end
29
+ end
@@ -5,7 +5,7 @@ module CompletionKit
5
5
 
6
6
  STATUSES = %w[pending running completed failed].freeze
7
7
 
8
- belongs_to :prompt
8
+ belongs_to :prompt, optional: true
9
9
  belongs_to :dataset, optional: true
10
10
  has_many :responses, dependent: :destroy
11
11
  has_many :run_metrics, -> { order(:position) }, dependent: :destroy
@@ -15,10 +15,18 @@ module CompletionKit
15
15
  validates :name, presence: true
16
16
  validates :status, inclusion: { in: STATUSES }
17
17
  validate :dataset_supplies_prompt_variables
18
+ validate :judge_only_run_supplies_output_column
18
19
 
19
20
  before_validation :set_default_status, on: :create
20
21
  before_validation :set_auto_name, on: :create
21
22
 
23
+ # A judge-only run grades a pre-existing column on the dataset instead of
24
+ # generating new outputs. No prompt is attached; the response text is read
25
+ # from row[output_column]; no LLM generation happens.
26
+ def judge_only?
27
+ prompt.nil?
28
+ end
29
+
22
30
  def missing_dataset_variables
23
31
  return [] unless prompt
24
32
  vars = prompt.variables
@@ -89,9 +97,14 @@ module CompletionKit
89
97
 
90
98
  return fail_with_summary!("Dataset has no rows") if rows.empty?
91
99
 
92
- client = LlmClient.for_model(prompt.llm_model, ApiConfig.for_model(prompt.llm_model))
93
- unless client.configured?
94
- return fail_with_summary!("LLM API not configured: #{client.configuration_errors.join(', ')}")
100
+ if judge_only?
101
+ column = output_column.presence || "actual_output"
102
+ return fail_with_summary!("Dataset has no \"#{column}\" column") unless dataset && dataset.headers.include?(column)
103
+ else
104
+ client = LlmClient.for_model(prompt.llm_model, ApiConfig.for_model(prompt.llm_model))
105
+ unless client.configured?
106
+ return fail_with_summary!("LLM API not configured: #{client.configuration_errors.join(', ')}")
107
+ end
95
108
  end
96
109
 
97
110
  transaction do
@@ -105,14 +118,27 @@ module CompletionKit
105
118
  )
106
119
  rows.each_with_index do |row, index|
107
120
  input = row.empty? ? nil : row.to_json
108
- response = responses.create!(
121
+ attrs = {
109
122
  status: "pending",
110
123
  row_index: index,
111
124
  input_data: input,
112
125
  expected_output: row["expected_output"]
113
- )
114
- GenerateRowJob.perform_later(id, response.id)
126
+ }
127
+ if judge_only?
128
+ attrs[:status] = "succeeded"
129
+ attrs[:response_text] = row[output_column.presence || "actual_output"].to_s
130
+ end
131
+
132
+ response = responses.create!(attrs)
133
+
134
+ if judge_only?
135
+ metrics.each { |m| JudgeReviewJob.perform_later(response.id, m.id) } if judge_configured?
136
+ else
137
+ GenerateRowJob.perform_later(id, response.id)
138
+ end
115
139
  end
140
+
141
+ RunCompletionCheckJob.perform_later(id) if judge_only?
116
142
  end
117
143
 
118
144
  broadcast_ui
@@ -168,6 +194,7 @@ module CompletionKit
168
194
  {
169
195
  id: id, name: name, status: status, prompt_id: prompt_id,
170
196
  dataset_id: dataset_id, judge_model: judge_model, temperature: temperature,
197
+ output_column: output_column,
171
198
  created_at: created_at, updated_at: updated_at,
172
199
  responses_count: responses.count, avg_score: avg_score,
173
200
  progress_current: snap[:generated_done],
@@ -274,10 +301,14 @@ module CompletionKit
274
301
 
275
302
  def set_auto_name
276
303
  return if name.present?
277
- return unless prompt.present?
278
304
 
279
- count = Run.where(prompt_id: prompt_id).count + 1
280
- self.name = "#{prompt.name} — v#{prompt.version_number} ##{count}"
305
+ if prompt.present?
306
+ count = Run.where(prompt_id: prompt_id).count + 1
307
+ self.name = "#{prompt.name} — v#{prompt.version_number} ##{count}"
308
+ elsif dataset.present?
309
+ count = Run.where(prompt_id: nil, dataset_id: dataset.id).count + 1
310
+ self.name = "#{dataset.name} — judge-only ##{count}"
311
+ end
281
312
  end
282
313
 
283
314
  def dataset_supplies_prompt_variables
@@ -290,5 +321,19 @@ module CompletionKit
290
321
  errors.add(:dataset_id, "is missing columns required by the prompt: #{missing.join(', ')}")
291
322
  end
292
323
  end
324
+
325
+ def judge_only_run_supplies_output_column
326
+ return if prompt.present?
327
+
328
+ if dataset.nil?
329
+ errors.add(:dataset_id, "is required for a judge-only run (no prompt)")
330
+ return
331
+ end
332
+
333
+ column = output_column.presence || "actual_output"
334
+ unless dataset.headers.include?(column)
335
+ errors.add(:output_column, "\"#{column}\" is not a column on dataset \"#{dataset.name}\"")
336
+ end
337
+ end
293
338
  end
294
339
  end
@@ -6,10 +6,8 @@ module CompletionKit
6
6
  PROTOCOL_VERSION = "2025-03-26"
7
7
 
8
8
  def self.initialize_session
9
- session_id = SecureRandom.uuid
10
- Rails.cache.write("mcp_session:#{session_id}", true, expires_in: 1.hour)
11
9
  {
12
- session_id: session_id,
10
+ session_id: McpSession.start!,
13
11
  protocolVersion: PROTOCOL_VERSION,
14
12
  serverInfo: {name: "CompletionKit", version: CompletionKit::VERSION},
15
13
  capabilities: {tools: {listChanged: false}}
@@ -15,16 +15,17 @@ module CompletionKit
15
15
  handler: :get
16
16
  },
17
17
  "runs_create" => {
18
- description: "Create a run",
18
+ description: "Create a run. Omit prompt_id and provide output_column for a judge-only run that grades a pre-existing dataset column instead of generating new outputs.",
19
19
  inputSchema: {
20
20
  type: "object",
21
21
  properties: {
22
22
  name: {type: "string"}, prompt_id: {type: "integer"},
23
23
  dataset_id: {type: "integer"}, judge_model: {type: "string"},
24
+ output_column: {type: "string", description: "Dataset column to grade when prompt_id is omitted; defaults to \"actual_output\"."},
24
25
  metric_ids: {type: "array", items: {type: "integer"}},
25
26
  tag_names: {type: "array", items: {type: "string"}}
26
27
  },
27
- required: ["name", "prompt_id"]
28
+ required: ["name"]
28
29
  },
29
30
  handler: :create
30
31
  },
@@ -35,6 +36,7 @@ module CompletionKit
35
36
  properties: {
36
37
  id: {type: "integer"}, name: {type: "string"},
37
38
  dataset_id: {type: "integer"}, judge_model: {type: "string"},
39
+ output_column: {type: "string"},
38
40
  metric_ids: {type: "array", items: {type: "integer"}},
39
41
  tag_names: {type: "array", items: {type: "string"}}
40
42
  },
@@ -63,7 +65,7 @@ module CompletionKit
63
65
  end
64
66
 
65
67
  def self.create(args)
66
- run = Run.new(args.slice("name", "prompt_id", "dataset_id", "judge_model"))
68
+ run = Run.new(args.slice("name", "prompt_id", "dataset_id", "judge_model", "output_column"))
67
69
  if run.save
68
70
  run.replace_metrics!(args["metric_ids"])
69
71
  run.update!(tag_names: args["tag_names"]) if args.key?("tag_names")
@@ -75,7 +77,7 @@ module CompletionKit
75
77
 
76
78
  def self.update(args)
77
79
  run = Run.find(args["id"])
78
- if run.update(args.except("id", "metric_ids", "tag_names").slice("name", "dataset_id", "judge_model"))
80
+ if run.update(args.except("id", "metric_ids", "tag_names").slice("name", "dataset_id", "judge_model", "output_column"))
79
81
  run.replace_metrics!(args["metric_ids"]) if args.key?("metric_ids")
80
82
  run.update!(tag_names: args["tag_names"]) if args.key?("tag_names")
81
83
  text_result(run.reload.as_json)