completion-kit 0.5.9 → 0.5.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +15 -15
- data/app/assets/images/completion_kit/favicon.ico +0 -0
- data/app/assets/images/completion_kit/logo.png +0 -0
- data/app/assets/stylesheets/completion_kit/application.css +38 -7
- data/app/controllers/completion_kit/api/v1/runs_controller.rb +1 -1
- data/app/controllers/completion_kit/api_reference_controller.rb +6 -0
- data/app/controllers/completion_kit/datasets_controller.rb +10 -0
- data/app/controllers/completion_kit/mcp_controller.rb +2 -2
- data/app/controllers/completion_kit/runs_controller.rb +8 -2
- data/app/jobs/completion_kit/judge_review_job.rb +1 -1
- data/app/models/completion_kit/mcp_session.rb +29 -0
- data/app/models/completion_kit/run.rb +55 -10
- data/app/services/completion_kit/mcp_dispatcher.rb +1 -3
- data/app/services/completion_kit/mcp_tools/runs.rb +6 -4
- data/app/views/completion_kit/api_reference/_body.html.erb +47 -23
- data/app/views/completion_kit/api_reference/_resource_card.html.erb +24 -0
- data/app/views/completion_kit/api_reference/_resource_list.html.erb +10 -0
- data/app/views/completion_kit/api_reference/index.html.erb +7 -1
- data/app/views/completion_kit/datasets/show.html.erb +2 -18
- data/app/views/completion_kit/prompts/show.html.erb +8 -26
- data/app/views/completion_kit/responses/show.html.erb +26 -11
- data/app/views/completion_kit/runs/_form.html.erb +51 -4
- data/app/views/completion_kit/runs/_row.html.erb +6 -2
- data/app/views/completion_kit/runs/_status_header.html.erb +5 -1
- data/app/views/completion_kit/runs/_table.html.erb +19 -0
- data/app/views/completion_kit/runs/edit.html.erb +6 -2
- data/app/views/completion_kit/runs/index.html.erb +1 -17
- data/app/views/completion_kit/runs/show.html.erb +24 -15
- data/app/views/layouts/completion_kit/application.html.erb +2 -2
- data/db/migrate/20260513000001_create_completion_kit_mcp_sessions.rb +12 -0
- data/db/migrate/20260514000001_allow_judge_only_runs.rb +6 -0
- data/lib/completion_kit/engine.rb +2 -1
- data/lib/completion_kit/version.rb +1 -1
- metadata +9 -2
- data/app/assets/images/completion_kit/logo.svg +0 -6
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: ed531ae29162bb91d2c463c3ff4eb20b5da469b9b7a21baddf5054a0ccc15041
|
|
4
|
+
data.tar.gz: b86aea95b2e1cf73abf6514093565dc07b12dc0f4fe5c5c5c8b80db3fbdfa83d
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 04ae500020e71d52c41073c36a6741bc47b94a06ceec6548720d6022a60ce7422be8a354d627c39a7be8174af2ce65219041c5d99ad175157c7bf4b4eaf8f056
|
|
7
|
+
data.tar.gz: 261daeeb1555b3aecb8e2e18edb7f14ebdc37f974c2713ecbe12c43a281e8109edc45222be0f6039c325a0428e89c1c11a1a7104f0a36ace2b618fb2ef1cb7e8
|
data/README.md
CHANGED
|
@@ -14,21 +14,23 @@ Run every prompt against real data. Score each output with an LLM judge against
|
|
|
14
14
|
|
|
15
15
|
It's the difference between "this prompt seems to work" and "this prompt scores 4.3 out of 5 across 200 inputs, up from 3.8 last version."
|
|
16
16
|
|
|
17
|
-
**[completionkit.com](https://completionkit.com)** | **[RubyGems](https://rubygems.org/gems/completion-kit)**
|
|
17
|
+
**[Start on completionkit.com →](https://completionkit.com)** | **[RubyGems](https://rubygems.org/gems/completion-kit)**
|
|
18
18
|
|
|
19
|
-
> **
|
|
19
|
+
> **Just want to use it?** [CompletionKit Cloud](https://completionkit.com) is the same engine, fully hosted — zero install, no Rails ops, plans at [completionkit.com/pricing](https://completionkit.com/pricing).
|
|
20
20
|
|
|
21
21
|

|
|
22
22
|
|
|
23
|
-
##
|
|
23
|
+
## Three ways to run it
|
|
24
24
|
|
|
25
|
-
|
|
25
|
+
Same engine, same UI, same REST API and MCP server — pick the deployment that fits.
|
|
26
26
|
|
|
27
|
-
|
|
27
|
+
### 1. Hosted — [completionkit.com](https://completionkit.com) (recommended)
|
|
28
28
|
|
|
29
|
-
|
|
29
|
+
The fastest path. Sign up and you're running on the same engine you'd self-host, without touching a Rails app. No `db:migrate`, no Puma, no Solid Queue, no provider key management — multi-tenant workspaces, your team logs in, you go. Plans at [completionkit.com/pricing](https://completionkit.com/pricing).
|
|
30
30
|
|
|
31
|
-
Self-
|
|
31
|
+
### 2. Self-hosted — the bundled standalone Rails app
|
|
32
|
+
|
|
33
|
+
Run it on your own infra. No existing Rails app required; Postgres + any Rails-friendly host (Fly, Render, Heroku, Docker, …).
|
|
32
34
|
|
|
33
35
|
```bash
|
|
34
36
|
git clone https://github.com/homemade-software-inc/completion-kit.git
|
|
@@ -38,7 +40,7 @@ bin/rails completion_kit:install:migrations
|
|
|
38
40
|
bin/rails db:migrate
|
|
39
41
|
```
|
|
40
42
|
|
|
41
|
-
|
|
43
|
+
Run **both** a web server and a Solid Queue worker. In two terminals:
|
|
42
44
|
|
|
43
45
|
```bash
|
|
44
46
|
bin/rails server
|
|
@@ -50,9 +52,9 @@ bin/jobs
|
|
|
50
52
|
|
|
51
53
|
Or with [foreman](https://github.com/ddollar/foreman) in one terminal: `foreman start -f Procfile.dev`.
|
|
52
54
|
|
|
53
|
-
Visit `http://localhost:3000`. Add a provider credential (Settings), create a prompt, upload a CSV dataset, and run it.
|
|
55
|
+
Visit `http://localhost:3000`. Add a provider credential (Settings), create a prompt, upload a CSV dataset, and run it. See [Deploying self-hosted](#deploying-self-hosted) for the production-env setup.
|
|
54
56
|
|
|
55
|
-
###
|
|
57
|
+
### 3. Rails engine — mount into your existing Rails app
|
|
56
58
|
|
|
57
59
|
```ruby
|
|
58
60
|
gem "completion-kit"
|
|
@@ -63,11 +65,9 @@ bin/rails generate completion_kit:install
|
|
|
63
65
|
bin/rails db:migrate
|
|
64
66
|
```
|
|
65
67
|
|
|
66
|
-
The engine mounts at `/completion_kit
|
|
67
|
-
|
|
68
|
-
### Host-app layout integration
|
|
68
|
+
The engine mounts at `/completion_kit`. Generate / judge flows enqueue Active Job jobs (`CompletionKit::GenerateRowJob`, `CompletionKit::JudgeReviewJob`, `CompletionKit::RunCompletionCheckJob`), so your host app needs an Active Job adapter that actually processes them — Solid Queue, Sidekiq, GoodJob, etc. The `:async` adapter is **not** suitable for production: it runs jobs in the web Puma's thread pool with no durability and no retry, and a long LLM call will block request handling.
|
|
69
69
|
|
|
70
|
-
If your host app overrides the engine layout (e.g. `layout "application"` on engine controllers, or rendering engine views inside your own shell), include both the engine's stylesheet and JavaScript in that layout:
|
|
70
|
+
**Host-app layout integration.** If your host app overrides the engine layout (e.g. `layout "application"` on engine controllers, or rendering engine views inside your own shell), include both the engine's stylesheet and JavaScript in that layout:
|
|
71
71
|
|
|
72
72
|
```erb
|
|
73
73
|
<%= stylesheet_link_tag "completion_kit/application", media: "all" %>
|
|
@@ -183,7 +183,7 @@ CompletionKit runs a [Model Context Protocol](https://modelcontextprotocol.io) s
|
|
|
183
183
|
|
|
184
184
|
The in-app API reference page has install snippets you can copy straight into your MCP client config.
|
|
185
185
|
|
|
186
|
-
## Deploying
|
|
186
|
+
## Deploying self-hosted
|
|
187
187
|
|
|
188
188
|
Any Rails-friendly host works (Fly, Heroku, Render, Docker, etc.). Point it at a Postgres instance via `DATABASE_URL`, set your provider env vars, and run `cd standalone && bin/rails db:migrate` on each deploy.
|
|
189
189
|
|
|
Binary file
|
|
Binary file
|
|
@@ -126,13 +126,21 @@ form.button_to {
|
|
|
126
126
|
font-weight: 700;
|
|
127
127
|
letter-spacing: 0.02em;
|
|
128
128
|
text-decoration: none;
|
|
129
|
-
color:
|
|
129
|
+
color: #3AD0E6;
|
|
130
130
|
}
|
|
131
131
|
|
|
132
132
|
.ck-brand img {
|
|
133
133
|
display: block;
|
|
134
134
|
}
|
|
135
135
|
|
|
136
|
+
.ck-brand__name {
|
|
137
|
+
padding-top: 0.75rem;
|
|
138
|
+
}
|
|
139
|
+
|
|
140
|
+
.ck-brand__kit {
|
|
141
|
+
color: #AFEDF7;
|
|
142
|
+
}
|
|
143
|
+
|
|
136
144
|
.ck-topbar__copy {
|
|
137
145
|
display: none;
|
|
138
146
|
}
|
|
@@ -262,7 +270,7 @@ form.button_to {
|
|
|
262
270
|
.ck-meta-copy,
|
|
263
271
|
.ck-note,
|
|
264
272
|
.ck-hint {
|
|
265
|
-
font-size: 0.
|
|
273
|
+
font-size: 0.9rem;
|
|
266
274
|
line-height: 1.6;
|
|
267
275
|
}
|
|
268
276
|
|
|
@@ -1269,6 +1277,13 @@ tr:hover .ck-chip--publish {
|
|
|
1269
1277
|
color: var(--ck-accent);
|
|
1270
1278
|
}
|
|
1271
1279
|
|
|
1280
|
+
/* the main prompt template block on prompts/show — bigger padding + a
|
|
1281
|
+
touch more line-height since this is the page's primary content */
|
|
1282
|
+
.ck-code--prompt {
|
|
1283
|
+
padding: 1.5rem;
|
|
1284
|
+
line-height: 1.75;
|
|
1285
|
+
}
|
|
1286
|
+
|
|
1272
1287
|
.ck-note-box {
|
|
1273
1288
|
background: var(--ck-surface-soft);
|
|
1274
1289
|
border: 1px solid var(--ck-line);
|
|
@@ -1522,6 +1537,9 @@ tr:hover .ck-chip--publish {
|
|
|
1522
1537
|
display: grid;
|
|
1523
1538
|
gap: 0.4rem;
|
|
1524
1539
|
}
|
|
1540
|
+
.ck-field[hidden] {
|
|
1541
|
+
display: none;
|
|
1542
|
+
}
|
|
1525
1543
|
|
|
1526
1544
|
.ck-field--spacious {
|
|
1527
1545
|
margin-top: 0.3rem;
|
|
@@ -1855,7 +1873,18 @@ tr:hover .ck-chip--publish {
|
|
|
1855
1873
|
background: var(--ck-bg-strong);
|
|
1856
1874
|
overflow: auto;
|
|
1857
1875
|
max-height: 60vh;
|
|
1876
|
+
scrollbar-width: thin;
|
|
1877
|
+
scrollbar-color: var(--ck-line-strong) transparent;
|
|
1858
1878
|
}
|
|
1879
|
+
.ck-csv-table-wrap::-webkit-scrollbar { width: 10px; height: 10px; }
|
|
1880
|
+
.ck-csv-table-wrap::-webkit-scrollbar-track { background: transparent; }
|
|
1881
|
+
.ck-csv-table-wrap::-webkit-scrollbar-thumb {
|
|
1882
|
+
background: var(--ck-line-strong);
|
|
1883
|
+
border-radius: 5px;
|
|
1884
|
+
border: 2px solid var(--ck-bg-strong);
|
|
1885
|
+
}
|
|
1886
|
+
.ck-csv-table-wrap::-webkit-scrollbar-thumb:hover { background: var(--ck-muted); }
|
|
1887
|
+
.ck-csv-table-wrap::-webkit-scrollbar-corner { background: transparent; }
|
|
1859
1888
|
|
|
1860
1889
|
.ck-modal__body .ck-csv-table-wrap {
|
|
1861
1890
|
margin-top: 0;
|
|
@@ -2763,8 +2792,10 @@ select.ck-input {
|
|
|
2763
2792
|
|
|
2764
2793
|
/* the metrics field stacks several sub-sections (hint, groups, divider, tag
|
|
2765
2794
|
filter, checkboxes) — give it more vertical breathing room than a plain field,
|
|
2766
|
-
and extra separation from the run-tags field that follows it
|
|
2767
|
-
|
|
2795
|
+
and extra separation from the run-tags field that follows it. Only when the
|
|
2796
|
+
checkboxes are actually present, though — when there are no metrics the field
|
|
2797
|
+
is just "label + warning" and should be a normal compact field. */
|
|
2798
|
+
#metrics-field:has(.ck-metric-checkboxes) {
|
|
2768
2799
|
gap: 0.85rem;
|
|
2769
2800
|
margin-bottom: 1.25rem;
|
|
2770
2801
|
}
|
|
@@ -3571,7 +3602,7 @@ a.ck-metric-group-pill {
|
|
|
3571
3602
|
}
|
|
3572
3603
|
|
|
3573
3604
|
.ck-mcp-tool__desc {
|
|
3574
|
-
font-size: 0.
|
|
3605
|
+
font-size: 0.9rem;
|
|
3575
3606
|
color: var(--ck-muted);
|
|
3576
3607
|
}
|
|
3577
3608
|
|
|
@@ -3595,7 +3626,7 @@ a.ck-metric-group-pill {
|
|
|
3595
3626
|
gap: 0.5rem;
|
|
3596
3627
|
padding: 0.6rem 0.85rem;
|
|
3597
3628
|
font-family: var(--ck-mono);
|
|
3598
|
-
font-size: 0.
|
|
3629
|
+
font-size: 0.9rem;
|
|
3599
3630
|
font-weight: 500;
|
|
3600
3631
|
color: var(--ck-text);
|
|
3601
3632
|
border-bottom: 1px solid var(--ck-line);
|
|
@@ -3669,7 +3700,7 @@ a.ck-metric-group-pill {
|
|
|
3669
3700
|
}
|
|
3670
3701
|
|
|
3671
3702
|
.ck-api-prompt-card__desc {
|
|
3672
|
-
font-size: 0.
|
|
3703
|
+
font-size: 0.9rem;
|
|
3673
3704
|
color: var(--ck-muted);
|
|
3674
3705
|
margin: 0.2rem 0 0;
|
|
3675
3706
|
line-height: 1.4;
|
|
@@ -76,7 +76,7 @@ module CompletionKit
|
|
|
76
76
|
end
|
|
77
77
|
|
|
78
78
|
def run_params
|
|
79
|
-
params.permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature,
|
|
79
|
+
params.permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature, :output_column,
|
|
80
80
|
metric_ids: [], tag_names: [])
|
|
81
81
|
end
|
|
82
82
|
end
|
|
@@ -2,6 +2,12 @@ module CompletionKit
|
|
|
2
2
|
class ApiReferenceController < ApplicationController
|
|
3
3
|
def index
|
|
4
4
|
@published_prompts = Prompt.current_versions.order(name: :asc)
|
|
5
|
+
@recent_runs = Run.includes(:prompt).order(created_at: :desc).limit(10)
|
|
6
|
+
@datasets = Dataset.order(name: :asc)
|
|
7
|
+
@metrics = Metric.order(name: :asc)
|
|
8
|
+
@metric_groups = MetricGroup.includes(:metrics).order(name: :asc)
|
|
9
|
+
@tags = Tag.order(name: :asc)
|
|
10
|
+
@provider_credentials = ProviderCredential.order(:provider)
|
|
5
11
|
@token = CompletionKit.config.api_token
|
|
6
12
|
@base_url = request.base_url + request.script_name
|
|
7
13
|
end
|
|
@@ -9,6 +9,16 @@ module CompletionKit
|
|
|
9
9
|
|
|
10
10
|
def show
|
|
11
11
|
@runs = @dataset.runs.includes(:prompt, :responses).order(created_at: :desc)
|
|
12
|
+
respond_to do |format|
|
|
13
|
+
format.html
|
|
14
|
+
format.csv do
|
|
15
|
+
slug = @dataset.name.to_s.parameterize.presence || "dataset-#{@dataset.id}"
|
|
16
|
+
send_data @dataset.csv_data.to_s,
|
|
17
|
+
type: "text/csv",
|
|
18
|
+
filename: "#{slug}.csv",
|
|
19
|
+
disposition: "attachment"
|
|
20
|
+
end
|
|
21
|
+
end
|
|
12
22
|
end
|
|
13
23
|
|
|
14
24
|
def new
|
|
@@ -17,7 +17,7 @@ module CompletionKit
|
|
|
17
17
|
end
|
|
18
18
|
|
|
19
19
|
session_id = request.headers["Mcp-Session-Id"]
|
|
20
|
-
unless
|
|
20
|
+
unless McpSession.active?(session_id)
|
|
21
21
|
render json: jsonrpc_error(request_body["id"], -32000, "Session not initialized. Send initialize first."), status: :bad_request
|
|
22
22
|
return
|
|
23
23
|
end
|
|
@@ -40,7 +40,7 @@ module CompletionKit
|
|
|
40
40
|
|
|
41
41
|
def destroy
|
|
42
42
|
session_id = request.headers["Mcp-Session-Id"]
|
|
43
|
-
|
|
43
|
+
McpSession.destroy_session(session_id) if session_id
|
|
44
44
|
head :ok
|
|
45
45
|
end
|
|
46
46
|
|
|
@@ -84,6 +84,7 @@ module CompletionKit
|
|
|
84
84
|
dataset_id: @run.dataset_id,
|
|
85
85
|
judge_model: @run.judge_model,
|
|
86
86
|
temperature: @run.temperature,
|
|
87
|
+
output_column: @run.output_column,
|
|
87
88
|
tag_names: @run.tag_names,
|
|
88
89
|
status: "pending"
|
|
89
90
|
)
|
|
@@ -108,6 +109,11 @@ module CompletionKit
|
|
|
108
109
|
end
|
|
109
110
|
|
|
110
111
|
def suggest
|
|
112
|
+
if @run.prompt.nil?
|
|
113
|
+
redirect_to run_path(@run), alert: "Judge-only runs don't have a prompt to improve."
|
|
114
|
+
return
|
|
115
|
+
end
|
|
116
|
+
|
|
111
117
|
service = PromptImprovementService.new(@run)
|
|
112
118
|
result = service.suggest
|
|
113
119
|
suggestion = @run.suggestions.create!(
|
|
@@ -159,13 +165,13 @@ module CompletionKit
|
|
|
159
165
|
end
|
|
160
166
|
|
|
161
167
|
def run_params
|
|
162
|
-
params.require(:run).permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature, metric_ids: [], tag_names: [])
|
|
168
|
+
params.require(:run).permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature, :output_column, metric_ids: [], tag_names: [])
|
|
163
169
|
end
|
|
164
170
|
|
|
165
171
|
# Editing a run that already has results forks a new run — but only when a
|
|
166
172
|
# field that affects generation or judging changed. Renaming or retagging is
|
|
167
173
|
# pure metadata and updates the run in place.
|
|
168
|
-
GENERATION_RUN_FIELDS = %i[prompt_id dataset_id judge_model temperature].freeze
|
|
174
|
+
GENERATION_RUN_FIELDS = %i[prompt_id dataset_id judge_model temperature output_column].freeze
|
|
169
175
|
|
|
170
176
|
def run_generation_changed?
|
|
171
177
|
GENERATION_RUN_FIELDS.each do |field|
|
|
@@ -54,7 +54,7 @@ module CompletionKit
|
|
|
54
54
|
evaluation = judge.evaluate(
|
|
55
55
|
response.response_text,
|
|
56
56
|
response.expected_output,
|
|
57
|
-
run.prompt
|
|
57
|
+
run.prompt&.template,
|
|
58
58
|
criteria: metric.instruction.to_s,
|
|
59
59
|
rubric_text: metric.display_rubric_text,
|
|
60
60
|
input_data: response.input_data
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
module CompletionKit
|
|
2
|
+
# MCP session marker — one row per active client session, kept in the
|
|
3
|
+
# database so sessions survive Puma restarts, deploys, and Rails.cache
|
|
4
|
+
# eviction. Expired rows are opportunistically pruned on every new
|
|
5
|
+
# session start, so the table stays bounded by recent activity.
|
|
6
|
+
class McpSession < ApplicationRecord
|
|
7
|
+
self.table_name = "completion_kit_mcp_sessions"
|
|
8
|
+
|
|
9
|
+
SESSION_TTL = 1.hour
|
|
10
|
+
|
|
11
|
+
def self.start!
|
|
12
|
+
prune_expired!
|
|
13
|
+
create!(session_id: SecureRandom.uuid, expires_at: SESSION_TTL.from_now).session_id
|
|
14
|
+
end
|
|
15
|
+
|
|
16
|
+
def self.active?(session_id)
|
|
17
|
+
return false if session_id.blank?
|
|
18
|
+
where(session_id: session_id).where("expires_at > ?", Time.current).exists?
|
|
19
|
+
end
|
|
20
|
+
|
|
21
|
+
def self.destroy_session(session_id)
|
|
22
|
+
where(session_id: session_id).delete_all
|
|
23
|
+
end
|
|
24
|
+
|
|
25
|
+
def self.prune_expired!
|
|
26
|
+
where("expires_at < ?", Time.current).delete_all
|
|
27
|
+
end
|
|
28
|
+
end
|
|
29
|
+
end
|
|
@@ -5,7 +5,7 @@ module CompletionKit
|
|
|
5
5
|
|
|
6
6
|
STATUSES = %w[pending running completed failed].freeze
|
|
7
7
|
|
|
8
|
-
belongs_to :prompt
|
|
8
|
+
belongs_to :prompt, optional: true
|
|
9
9
|
belongs_to :dataset, optional: true
|
|
10
10
|
has_many :responses, dependent: :destroy
|
|
11
11
|
has_many :run_metrics, -> { order(:position) }, dependent: :destroy
|
|
@@ -15,10 +15,18 @@ module CompletionKit
|
|
|
15
15
|
validates :name, presence: true
|
|
16
16
|
validates :status, inclusion: { in: STATUSES }
|
|
17
17
|
validate :dataset_supplies_prompt_variables
|
|
18
|
+
validate :judge_only_run_supplies_output_column
|
|
18
19
|
|
|
19
20
|
before_validation :set_default_status, on: :create
|
|
20
21
|
before_validation :set_auto_name, on: :create
|
|
21
22
|
|
|
23
|
+
# A judge-only run grades a pre-existing column on the dataset instead of
|
|
24
|
+
# generating new outputs. No prompt is attached; the response text is read
|
|
25
|
+
# from row[output_column]; no LLM generation happens.
|
|
26
|
+
def judge_only?
|
|
27
|
+
prompt.nil?
|
|
28
|
+
end
|
|
29
|
+
|
|
22
30
|
def missing_dataset_variables
|
|
23
31
|
return [] unless prompt
|
|
24
32
|
vars = prompt.variables
|
|
@@ -89,9 +97,14 @@ module CompletionKit
|
|
|
89
97
|
|
|
90
98
|
return fail_with_summary!("Dataset has no rows") if rows.empty?
|
|
91
99
|
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
return fail_with_summary!("
|
|
100
|
+
if judge_only?
|
|
101
|
+
column = output_column.presence || "actual_output"
|
|
102
|
+
return fail_with_summary!("Dataset has no \"#{column}\" column") unless dataset && dataset.headers.include?(column)
|
|
103
|
+
else
|
|
104
|
+
client = LlmClient.for_model(prompt.llm_model, ApiConfig.for_model(prompt.llm_model))
|
|
105
|
+
unless client.configured?
|
|
106
|
+
return fail_with_summary!("LLM API not configured: #{client.configuration_errors.join(', ')}")
|
|
107
|
+
end
|
|
95
108
|
end
|
|
96
109
|
|
|
97
110
|
transaction do
|
|
@@ -105,14 +118,27 @@ module CompletionKit
|
|
|
105
118
|
)
|
|
106
119
|
rows.each_with_index do |row, index|
|
|
107
120
|
input = row.empty? ? nil : row.to_json
|
|
108
|
-
|
|
121
|
+
attrs = {
|
|
109
122
|
status: "pending",
|
|
110
123
|
row_index: index,
|
|
111
124
|
input_data: input,
|
|
112
125
|
expected_output: row["expected_output"]
|
|
113
|
-
|
|
114
|
-
|
|
126
|
+
}
|
|
127
|
+
if judge_only?
|
|
128
|
+
attrs[:status] = "succeeded"
|
|
129
|
+
attrs[:response_text] = row[output_column.presence || "actual_output"].to_s
|
|
130
|
+
end
|
|
131
|
+
|
|
132
|
+
response = responses.create!(attrs)
|
|
133
|
+
|
|
134
|
+
if judge_only?
|
|
135
|
+
metrics.each { |m| JudgeReviewJob.perform_later(response.id, m.id) } if judge_configured?
|
|
136
|
+
else
|
|
137
|
+
GenerateRowJob.perform_later(id, response.id)
|
|
138
|
+
end
|
|
115
139
|
end
|
|
140
|
+
|
|
141
|
+
RunCompletionCheckJob.perform_later(id) if judge_only?
|
|
116
142
|
end
|
|
117
143
|
|
|
118
144
|
broadcast_ui
|
|
@@ -168,6 +194,7 @@ module CompletionKit
|
|
|
168
194
|
{
|
|
169
195
|
id: id, name: name, status: status, prompt_id: prompt_id,
|
|
170
196
|
dataset_id: dataset_id, judge_model: judge_model, temperature: temperature,
|
|
197
|
+
output_column: output_column,
|
|
171
198
|
created_at: created_at, updated_at: updated_at,
|
|
172
199
|
responses_count: responses.count, avg_score: avg_score,
|
|
173
200
|
progress_current: snap[:generated_done],
|
|
@@ -274,10 +301,14 @@ module CompletionKit
|
|
|
274
301
|
|
|
275
302
|
def set_auto_name
|
|
276
303
|
return if name.present?
|
|
277
|
-
return unless prompt.present?
|
|
278
304
|
|
|
279
|
-
|
|
280
|
-
|
|
305
|
+
if prompt.present?
|
|
306
|
+
count = Run.where(prompt_id: prompt_id).count + 1
|
|
307
|
+
self.name = "#{prompt.name} — v#{prompt.version_number} ##{count}"
|
|
308
|
+
elsif dataset.present?
|
|
309
|
+
count = Run.where(prompt_id: nil, dataset_id: dataset.id).count + 1
|
|
310
|
+
self.name = "#{dataset.name} — judge-only ##{count}"
|
|
311
|
+
end
|
|
281
312
|
end
|
|
282
313
|
|
|
283
314
|
def dataset_supplies_prompt_variables
|
|
@@ -290,5 +321,19 @@ module CompletionKit
|
|
|
290
321
|
errors.add(:dataset_id, "is missing columns required by the prompt: #{missing.join(', ')}")
|
|
291
322
|
end
|
|
292
323
|
end
|
|
324
|
+
|
|
325
|
+
def judge_only_run_supplies_output_column
|
|
326
|
+
return if prompt.present?
|
|
327
|
+
|
|
328
|
+
if dataset.nil?
|
|
329
|
+
errors.add(:dataset_id, "is required for a judge-only run (no prompt)")
|
|
330
|
+
return
|
|
331
|
+
end
|
|
332
|
+
|
|
333
|
+
column = output_column.presence || "actual_output"
|
|
334
|
+
unless dataset.headers.include?(column)
|
|
335
|
+
errors.add(:output_column, "\"#{column}\" is not a column on dataset \"#{dataset.name}\"")
|
|
336
|
+
end
|
|
337
|
+
end
|
|
293
338
|
end
|
|
294
339
|
end
|
|
@@ -6,10 +6,8 @@ module CompletionKit
|
|
|
6
6
|
PROTOCOL_VERSION = "2025-03-26"
|
|
7
7
|
|
|
8
8
|
def self.initialize_session
|
|
9
|
-
session_id = SecureRandom.uuid
|
|
10
|
-
Rails.cache.write("mcp_session:#{session_id}", true, expires_in: 1.hour)
|
|
11
9
|
{
|
|
12
|
-
session_id:
|
|
10
|
+
session_id: McpSession.start!,
|
|
13
11
|
protocolVersion: PROTOCOL_VERSION,
|
|
14
12
|
serverInfo: {name: "CompletionKit", version: CompletionKit::VERSION},
|
|
15
13
|
capabilities: {tools: {listChanged: false}}
|
|
@@ -15,16 +15,17 @@ module CompletionKit
|
|
|
15
15
|
handler: :get
|
|
16
16
|
},
|
|
17
17
|
"runs_create" => {
|
|
18
|
-
description: "Create a run",
|
|
18
|
+
description: "Create a run. Omit prompt_id and provide output_column for a judge-only run that grades a pre-existing dataset column instead of generating new outputs.",
|
|
19
19
|
inputSchema: {
|
|
20
20
|
type: "object",
|
|
21
21
|
properties: {
|
|
22
22
|
name: {type: "string"}, prompt_id: {type: "integer"},
|
|
23
23
|
dataset_id: {type: "integer"}, judge_model: {type: "string"},
|
|
24
|
+
output_column: {type: "string", description: "Dataset column to grade when prompt_id is omitted; defaults to \"actual_output\"."},
|
|
24
25
|
metric_ids: {type: "array", items: {type: "integer"}},
|
|
25
26
|
tag_names: {type: "array", items: {type: "string"}}
|
|
26
27
|
},
|
|
27
|
-
required: ["name"
|
|
28
|
+
required: ["name"]
|
|
28
29
|
},
|
|
29
30
|
handler: :create
|
|
30
31
|
},
|
|
@@ -35,6 +36,7 @@ module CompletionKit
|
|
|
35
36
|
properties: {
|
|
36
37
|
id: {type: "integer"}, name: {type: "string"},
|
|
37
38
|
dataset_id: {type: "integer"}, judge_model: {type: "string"},
|
|
39
|
+
output_column: {type: "string"},
|
|
38
40
|
metric_ids: {type: "array", items: {type: "integer"}},
|
|
39
41
|
tag_names: {type: "array", items: {type: "string"}}
|
|
40
42
|
},
|
|
@@ -63,7 +65,7 @@ module CompletionKit
|
|
|
63
65
|
end
|
|
64
66
|
|
|
65
67
|
def self.create(args)
|
|
66
|
-
run = Run.new(args.slice("name", "prompt_id", "dataset_id", "judge_model"))
|
|
68
|
+
run = Run.new(args.slice("name", "prompt_id", "dataset_id", "judge_model", "output_column"))
|
|
67
69
|
if run.save
|
|
68
70
|
run.replace_metrics!(args["metric_ids"])
|
|
69
71
|
run.update!(tag_names: args["tag_names"]) if args.key?("tag_names")
|
|
@@ -75,7 +77,7 @@ module CompletionKit
|
|
|
75
77
|
|
|
76
78
|
def self.update(args)
|
|
77
79
|
run = Run.find(args["id"])
|
|
78
|
-
if run.update(args.except("id", "metric_ids", "tag_names").slice("name", "dataset_id", "judge_model"))
|
|
80
|
+
if run.update(args.except("id", "metric_ids", "tag_names").slice("name", "dataset_id", "judge_model", "output_column"))
|
|
79
81
|
run.replace_metrics!(args["metric_ids"]) if args.key?("metric_ids")
|
|
80
82
|
run.update!(tag_names: args["tag_names"]) if args.key?("tag_names")
|
|
81
83
|
text_result(run.reload.as_json)
|