completion-kit 0.5.10 → 0.5.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +15 -15
- data/app/assets/stylesheets/completion_kit/application.css +3 -0
- data/app/controllers/completion_kit/api/v1/runs_controller.rb +1 -1
- data/app/controllers/completion_kit/runs_controller.rb +8 -2
- data/app/jobs/completion_kit/judge_review_job.rb +1 -1
- data/app/models/completion_kit/run.rb +55 -10
- data/app/services/completion_kit/mcp_tools/runs.rb +6 -4
- data/app/views/completion_kit/api_reference/_body.html.erb +1 -1
- data/app/views/completion_kit/responses/show.html.erb +26 -11
- data/app/views/completion_kit/runs/_form.html.erb +50 -3
- data/app/views/completion_kit/runs/_row.html.erb +6 -2
- data/app/views/completion_kit/runs/_status_header.html.erb +5 -1
- data/app/views/completion_kit/runs/edit.html.erb +6 -2
- data/app/views/completion_kit/runs/show.html.erb +24 -15
- data/db/migrate/20260514000001_allow_judge_only_runs.rb +6 -0
- data/lib/completion_kit/version.rb +1 -1
- metadata +2 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: ed531ae29162bb91d2c463c3ff4eb20b5da469b9b7a21baddf5054a0ccc15041
|
|
4
|
+
data.tar.gz: b86aea95b2e1cf73abf6514093565dc07b12dc0f4fe5c5c5c8b80db3fbdfa83d
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 04ae500020e71d52c41073c36a6741bc47b94a06ceec6548720d6022a60ce7422be8a354d627c39a7be8174af2ce65219041c5d99ad175157c7bf4b4eaf8f056
|
|
7
|
+
data.tar.gz: 261daeeb1555b3aecb8e2e18edb7f14ebdc37f974c2713ecbe12c43a281e8109edc45222be0f6039c325a0428e89c1c11a1a7104f0a36ace2b618fb2ef1cb7e8
|
data/README.md
CHANGED
|
@@ -14,21 +14,23 @@ Run every prompt against real data. Score each output with an LLM judge against
|
|
|
14
14
|
|
|
15
15
|
It's the difference between "this prompt seems to work" and "this prompt scores 4.3 out of 5 across 200 inputs, up from 3.8 last version."
|
|
16
16
|
|
|
17
|
-
**[completionkit.com](https://completionkit.com)** | **[RubyGems](https://rubygems.org/gems/completion-kit)**
|
|
17
|
+
**[Start on completionkit.com →](https://completionkit.com)** | **[RubyGems](https://rubygems.org/gems/completion-kit)**
|
|
18
18
|
|
|
19
|
-
> **
|
|
19
|
+
> **Just want to use it?** [CompletionKit Cloud](https://completionkit.com) is the same engine, fully hosted — zero install, no Rails ops, plans at [completionkit.com/pricing](https://completionkit.com/pricing).
|
|
20
20
|
|
|
21
21
|

|
|
22
22
|
|
|
23
|
-
##
|
|
23
|
+
## Three ways to run it
|
|
24
24
|
|
|
25
|
-
|
|
25
|
+
Same engine, same UI, same REST API and MCP server — pick the deployment that fits.
|
|
26
26
|
|
|
27
|
-
|
|
27
|
+
### 1. Hosted — [completionkit.com](https://completionkit.com) (recommended)
|
|
28
28
|
|
|
29
|
-
|
|
29
|
+
The fastest path. Sign up and you're running on the same engine you'd self-host, without touching a Rails app. No `db:migrate`, no Puma, no Solid Queue, no provider key management — multi-tenant workspaces, your team logs in, you go. Plans at [completionkit.com/pricing](https://completionkit.com/pricing).
|
|
30
30
|
|
|
31
|
-
Self-
|
|
31
|
+
### 2. Self-hosted — the bundled standalone Rails app
|
|
32
|
+
|
|
33
|
+
Run it on your own infra. No existing Rails app required; Postgres + any Rails-friendly host (Fly, Render, Heroku, Docker, …).
|
|
32
34
|
|
|
33
35
|
```bash
|
|
34
36
|
git clone https://github.com/homemade-software-inc/completion-kit.git
|
|
@@ -38,7 +40,7 @@ bin/rails completion_kit:install:migrations
|
|
|
38
40
|
bin/rails db:migrate
|
|
39
41
|
```
|
|
40
42
|
|
|
41
|
-
|
|
43
|
+
Run **both** a web server and a Solid Queue worker. In two terminals:
|
|
42
44
|
|
|
43
45
|
```bash
|
|
44
46
|
bin/rails server
|
|
@@ -50,9 +52,9 @@ bin/jobs
|
|
|
50
52
|
|
|
51
53
|
Or with [foreman](https://github.com/ddollar/foreman) in one terminal: `foreman start -f Procfile.dev`.
|
|
52
54
|
|
|
53
|
-
Visit `http://localhost:3000`. Add a provider credential (Settings), create a prompt, upload a CSV dataset, and run it.
|
|
55
|
+
Visit `http://localhost:3000`. Add a provider credential (Settings), create a prompt, upload a CSV dataset, and run it. See [Deploying self-hosted](#deploying-self-hosted) for the production-env setup.
|
|
54
56
|
|
|
55
|
-
###
|
|
57
|
+
### 3. Rails engine — mount into your existing Rails app
|
|
56
58
|
|
|
57
59
|
```ruby
|
|
58
60
|
gem "completion-kit"
|
|
@@ -63,11 +65,9 @@ bin/rails generate completion_kit:install
|
|
|
63
65
|
bin/rails db:migrate
|
|
64
66
|
```
|
|
65
67
|
|
|
66
|
-
The engine mounts at `/completion_kit
|
|
67
|
-
|
|
68
|
-
### Host-app layout integration
|
|
68
|
+
The engine mounts at `/completion_kit`. Generate / judge flows enqueue Active Job jobs (`CompletionKit::GenerateRowJob`, `CompletionKit::JudgeReviewJob`, `CompletionKit::RunCompletionCheckJob`), so your host app needs an Active Job adapter that actually processes them — Solid Queue, Sidekiq, GoodJob, etc. The `:async` adapter is **not** suitable for production: it runs jobs in the web Puma's thread pool with no durability and no retry, and a long LLM call will block request handling.
|
|
69
69
|
|
|
70
|
-
If your host app overrides the engine layout (e.g. `layout "application"` on engine controllers, or rendering engine views inside your own shell), include both the engine's stylesheet and JavaScript in that layout:
|
|
70
|
+
**Host-app layout integration.** If your host app overrides the engine layout (e.g. `layout "application"` on engine controllers, or rendering engine views inside your own shell), include both the engine's stylesheet and JavaScript in that layout:
|
|
71
71
|
|
|
72
72
|
```erb
|
|
73
73
|
<%= stylesheet_link_tag "completion_kit/application", media: "all" %>
|
|
@@ -183,7 +183,7 @@ CompletionKit runs a [Model Context Protocol](https://modelcontextprotocol.io) s
|
|
|
183
183
|
|
|
184
184
|
The in-app API reference page has install snippets you can copy straight into your MCP client config.
|
|
185
185
|
|
|
186
|
-
## Deploying
|
|
186
|
+
## Deploying self-hosted
|
|
187
187
|
|
|
188
188
|
Any Rails-friendly host works (Fly, Heroku, Render, Docker, etc.). Point it at a Postgres instance via `DATABASE_URL`, set your provider env vars, and run `cd standalone && bin/rails db:migrate` on each deploy.
|
|
189
189
|
|
|
@@ -76,7 +76,7 @@ module CompletionKit
|
|
|
76
76
|
end
|
|
77
77
|
|
|
78
78
|
def run_params
|
|
79
|
-
params.permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature,
|
|
79
|
+
params.permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature, :output_column,
|
|
80
80
|
metric_ids: [], tag_names: [])
|
|
81
81
|
end
|
|
82
82
|
end
|
|
@@ -84,6 +84,7 @@ module CompletionKit
|
|
|
84
84
|
dataset_id: @run.dataset_id,
|
|
85
85
|
judge_model: @run.judge_model,
|
|
86
86
|
temperature: @run.temperature,
|
|
87
|
+
output_column: @run.output_column,
|
|
87
88
|
tag_names: @run.tag_names,
|
|
88
89
|
status: "pending"
|
|
89
90
|
)
|
|
@@ -108,6 +109,11 @@ module CompletionKit
|
|
|
108
109
|
end
|
|
109
110
|
|
|
110
111
|
def suggest
|
|
112
|
+
if @run.prompt.nil?
|
|
113
|
+
redirect_to run_path(@run), alert: "Judge-only runs don't have a prompt to improve."
|
|
114
|
+
return
|
|
115
|
+
end
|
|
116
|
+
|
|
111
117
|
service = PromptImprovementService.new(@run)
|
|
112
118
|
result = service.suggest
|
|
113
119
|
suggestion = @run.suggestions.create!(
|
|
@@ -159,13 +165,13 @@ module CompletionKit
|
|
|
159
165
|
end
|
|
160
166
|
|
|
161
167
|
def run_params
|
|
162
|
-
params.require(:run).permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature, metric_ids: [], tag_names: [])
|
|
168
|
+
params.require(:run).permit(:name, :prompt_id, :dataset_id, :judge_model, :temperature, :output_column, metric_ids: [], tag_names: [])
|
|
163
169
|
end
|
|
164
170
|
|
|
165
171
|
# Editing a run that already has results forks a new run — but only when a
|
|
166
172
|
# field that affects generation or judging changed. Renaming or retagging is
|
|
167
173
|
# pure metadata and updates the run in place.
|
|
168
|
-
GENERATION_RUN_FIELDS = %i[prompt_id dataset_id judge_model temperature].freeze
|
|
174
|
+
GENERATION_RUN_FIELDS = %i[prompt_id dataset_id judge_model temperature output_column].freeze
|
|
169
175
|
|
|
170
176
|
def run_generation_changed?
|
|
171
177
|
GENERATION_RUN_FIELDS.each do |field|
|
|
@@ -54,7 +54,7 @@ module CompletionKit
|
|
|
54
54
|
evaluation = judge.evaluate(
|
|
55
55
|
response.response_text,
|
|
56
56
|
response.expected_output,
|
|
57
|
-
run.prompt
|
|
57
|
+
run.prompt&.template,
|
|
58
58
|
criteria: metric.instruction.to_s,
|
|
59
59
|
rubric_text: metric.display_rubric_text,
|
|
60
60
|
input_data: response.input_data
|
|
@@ -5,7 +5,7 @@ module CompletionKit
|
|
|
5
5
|
|
|
6
6
|
STATUSES = %w[pending running completed failed].freeze
|
|
7
7
|
|
|
8
|
-
belongs_to :prompt
|
|
8
|
+
belongs_to :prompt, optional: true
|
|
9
9
|
belongs_to :dataset, optional: true
|
|
10
10
|
has_many :responses, dependent: :destroy
|
|
11
11
|
has_many :run_metrics, -> { order(:position) }, dependent: :destroy
|
|
@@ -15,10 +15,18 @@ module CompletionKit
|
|
|
15
15
|
validates :name, presence: true
|
|
16
16
|
validates :status, inclusion: { in: STATUSES }
|
|
17
17
|
validate :dataset_supplies_prompt_variables
|
|
18
|
+
validate :judge_only_run_supplies_output_column
|
|
18
19
|
|
|
19
20
|
before_validation :set_default_status, on: :create
|
|
20
21
|
before_validation :set_auto_name, on: :create
|
|
21
22
|
|
|
23
|
+
# A judge-only run grades a pre-existing column on the dataset instead of
|
|
24
|
+
# generating new outputs. No prompt is attached; the response text is read
|
|
25
|
+
# from row[output_column]; no LLM generation happens.
|
|
26
|
+
def judge_only?
|
|
27
|
+
prompt.nil?
|
|
28
|
+
end
|
|
29
|
+
|
|
22
30
|
def missing_dataset_variables
|
|
23
31
|
return [] unless prompt
|
|
24
32
|
vars = prompt.variables
|
|
@@ -89,9 +97,14 @@ module CompletionKit
|
|
|
89
97
|
|
|
90
98
|
return fail_with_summary!("Dataset has no rows") if rows.empty?
|
|
91
99
|
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
return fail_with_summary!("
|
|
100
|
+
if judge_only?
|
|
101
|
+
column = output_column.presence || "actual_output"
|
|
102
|
+
return fail_with_summary!("Dataset has no \"#{column}\" column") unless dataset && dataset.headers.include?(column)
|
|
103
|
+
else
|
|
104
|
+
client = LlmClient.for_model(prompt.llm_model, ApiConfig.for_model(prompt.llm_model))
|
|
105
|
+
unless client.configured?
|
|
106
|
+
return fail_with_summary!("LLM API not configured: #{client.configuration_errors.join(', ')}")
|
|
107
|
+
end
|
|
95
108
|
end
|
|
96
109
|
|
|
97
110
|
transaction do
|
|
@@ -105,14 +118,27 @@ module CompletionKit
|
|
|
105
118
|
)
|
|
106
119
|
rows.each_with_index do |row, index|
|
|
107
120
|
input = row.empty? ? nil : row.to_json
|
|
108
|
-
|
|
121
|
+
attrs = {
|
|
109
122
|
status: "pending",
|
|
110
123
|
row_index: index,
|
|
111
124
|
input_data: input,
|
|
112
125
|
expected_output: row["expected_output"]
|
|
113
|
-
|
|
114
|
-
|
|
126
|
+
}
|
|
127
|
+
if judge_only?
|
|
128
|
+
attrs[:status] = "succeeded"
|
|
129
|
+
attrs[:response_text] = row[output_column.presence || "actual_output"].to_s
|
|
130
|
+
end
|
|
131
|
+
|
|
132
|
+
response = responses.create!(attrs)
|
|
133
|
+
|
|
134
|
+
if judge_only?
|
|
135
|
+
metrics.each { |m| JudgeReviewJob.perform_later(response.id, m.id) } if judge_configured?
|
|
136
|
+
else
|
|
137
|
+
GenerateRowJob.perform_later(id, response.id)
|
|
138
|
+
end
|
|
115
139
|
end
|
|
140
|
+
|
|
141
|
+
RunCompletionCheckJob.perform_later(id) if judge_only?
|
|
116
142
|
end
|
|
117
143
|
|
|
118
144
|
broadcast_ui
|
|
@@ -168,6 +194,7 @@ module CompletionKit
|
|
|
168
194
|
{
|
|
169
195
|
id: id, name: name, status: status, prompt_id: prompt_id,
|
|
170
196
|
dataset_id: dataset_id, judge_model: judge_model, temperature: temperature,
|
|
197
|
+
output_column: output_column,
|
|
171
198
|
created_at: created_at, updated_at: updated_at,
|
|
172
199
|
responses_count: responses.count, avg_score: avg_score,
|
|
173
200
|
progress_current: snap[:generated_done],
|
|
@@ -274,10 +301,14 @@ module CompletionKit
|
|
|
274
301
|
|
|
275
302
|
def set_auto_name
|
|
276
303
|
return if name.present?
|
|
277
|
-
return unless prompt.present?
|
|
278
304
|
|
|
279
|
-
|
|
280
|
-
|
|
305
|
+
if prompt.present?
|
|
306
|
+
count = Run.where(prompt_id: prompt_id).count + 1
|
|
307
|
+
self.name = "#{prompt.name} — v#{prompt.version_number} ##{count}"
|
|
308
|
+
elsif dataset.present?
|
|
309
|
+
count = Run.where(prompt_id: nil, dataset_id: dataset.id).count + 1
|
|
310
|
+
self.name = "#{dataset.name} — judge-only ##{count}"
|
|
311
|
+
end
|
|
281
312
|
end
|
|
282
313
|
|
|
283
314
|
def dataset_supplies_prompt_variables
|
|
@@ -290,5 +321,19 @@ module CompletionKit
|
|
|
290
321
|
errors.add(:dataset_id, "is missing columns required by the prompt: #{missing.join(', ')}")
|
|
291
322
|
end
|
|
292
323
|
end
|
|
324
|
+
|
|
325
|
+
def judge_only_run_supplies_output_column
|
|
326
|
+
return if prompt.present?
|
|
327
|
+
|
|
328
|
+
if dataset.nil?
|
|
329
|
+
errors.add(:dataset_id, "is required for a judge-only run (no prompt)")
|
|
330
|
+
return
|
|
331
|
+
end
|
|
332
|
+
|
|
333
|
+
column = output_column.presence || "actual_output"
|
|
334
|
+
unless dataset.headers.include?(column)
|
|
335
|
+
errors.add(:output_column, "\"#{column}\" is not a column on dataset \"#{dataset.name}\"")
|
|
336
|
+
end
|
|
337
|
+
end
|
|
293
338
|
end
|
|
294
339
|
end
|
|
@@ -15,16 +15,17 @@ module CompletionKit
|
|
|
15
15
|
handler: :get
|
|
16
16
|
},
|
|
17
17
|
"runs_create" => {
|
|
18
|
-
description: "Create a run",
|
|
18
|
+
description: "Create a run. Omit prompt_id and provide output_column for a judge-only run that grades a pre-existing dataset column instead of generating new outputs.",
|
|
19
19
|
inputSchema: {
|
|
20
20
|
type: "object",
|
|
21
21
|
properties: {
|
|
22
22
|
name: {type: "string"}, prompt_id: {type: "integer"},
|
|
23
23
|
dataset_id: {type: "integer"}, judge_model: {type: "string"},
|
|
24
|
+
output_column: {type: "string", description: "Dataset column to grade when prompt_id is omitted; defaults to \"actual_output\"."},
|
|
24
25
|
metric_ids: {type: "array", items: {type: "integer"}},
|
|
25
26
|
tag_names: {type: "array", items: {type: "string"}}
|
|
26
27
|
},
|
|
27
|
-
required: ["name"
|
|
28
|
+
required: ["name"]
|
|
28
29
|
},
|
|
29
30
|
handler: :create
|
|
30
31
|
},
|
|
@@ -35,6 +36,7 @@ module CompletionKit
|
|
|
35
36
|
properties: {
|
|
36
37
|
id: {type: "integer"}, name: {type: "string"},
|
|
37
38
|
dataset_id: {type: "integer"}, judge_model: {type: "string"},
|
|
39
|
+
output_column: {type: "string"},
|
|
38
40
|
metric_ids: {type: "array", items: {type: "integer"}},
|
|
39
41
|
tag_names: {type: "array", items: {type: "string"}}
|
|
40
42
|
},
|
|
@@ -63,7 +65,7 @@ module CompletionKit
|
|
|
63
65
|
end
|
|
64
66
|
|
|
65
67
|
def self.create(args)
|
|
66
|
-
run = Run.new(args.slice("name", "prompt_id", "dataset_id", "judge_model"))
|
|
68
|
+
run = Run.new(args.slice("name", "prompt_id", "dataset_id", "judge_model", "output_column"))
|
|
67
69
|
if run.save
|
|
68
70
|
run.replace_metrics!(args["metric_ids"])
|
|
69
71
|
run.update!(tag_names: args["tag_names"]) if args.key?("tag_names")
|
|
@@ -75,7 +77,7 @@ module CompletionKit
|
|
|
75
77
|
|
|
76
78
|
def self.update(args)
|
|
77
79
|
run = Run.find(args["id"])
|
|
78
|
-
if run.update(args.except("id", "metric_ids", "tag_names").slice("name", "dataset_id", "judge_model"))
|
|
80
|
+
if run.update(args.except("id", "metric_ids", "tag_names").slice("name", "dataset_id", "judge_model", "output_column"))
|
|
79
81
|
run.replace_metrics!(args["metric_ids"]) if args.key?("metric_ids")
|
|
80
82
|
run.update!(tag_names: args["tag_names"]) if args.key?("tag_names")
|
|
81
83
|
text_result(run.reload.as_json)
|
|
@@ -121,7 +121,7 @@
|
|
|
121
121
|
<div class="ck-api-endpoint">
|
|
122
122
|
<p class="ck-api-method"><span class="ck-chip ck-chip--soft">POST</span> /api/v1/runs</p>
|
|
123
123
|
<p class="ck-meta-copy">Create a new run.</p>
|
|
124
|
-
<p class="ck-api-params"><strong>
|
|
124
|
+
<p class="ck-api-params"><strong>Optional:</strong> <code>name</code>, <code>prompt_id</code>, <code>dataset_id</code>, <code>metric_ids</code>, <code>judge_model</code>, <code>output_column</code> (judge-only: omit <code>prompt_id</code> and grade a dataset column instead, default <code>actual_output</code>)</p>
|
|
125
125
|
<%= render "completion_kit/api_reference/example", base_url: base_url, token: token, real_token: real_token, cmd: "curl -X POST #{base_url}/api/v1/runs \\\n -H \"Authorization: Bearer #{token}\" \\\n -H \"Content-Type: application/json\" \\\n -d '{\"prompt_id\": 1, \"dataset_id\": 1, \"metric_ids\": [1, 2]}'" %>
|
|
126
126
|
</div>
|
|
127
127
|
<div class="ck-api-endpoint">
|
|
@@ -1,6 +1,10 @@
|
|
|
1
1
|
<ol class="ck-breadcrumb">
|
|
2
|
-
|
|
3
|
-
|
|
2
|
+
<% if @run.prompt %>
|
|
3
|
+
<li><%= link_to "Prompts", prompts_path %></li>
|
|
4
|
+
<li><%= link_to @run.prompt.name, prompt_path(@run.prompt) %></li>
|
|
5
|
+
<% else %>
|
|
6
|
+
<li><%= link_to "Runs", runs_path %></li>
|
|
7
|
+
<% end %>
|
|
4
8
|
<li><%= link_to @run.name, run_path(@run) %></li>
|
|
5
9
|
<li>Response #<%= @response_number %></li>
|
|
6
10
|
</ol>
|
|
@@ -30,20 +34,29 @@
|
|
|
30
34
|
<span class="ck-run-config__key">Run</span>
|
|
31
35
|
<%= link_to @run.name, run_path(@run), class: "ck-link" %>
|
|
32
36
|
</div>
|
|
33
|
-
|
|
34
|
-
<
|
|
35
|
-
|
|
36
|
-
|
|
37
|
+
<% if @run.prompt %>
|
|
38
|
+
<div class="ck-run-config__row">
|
|
39
|
+
<span class="ck-run-config__key">Prompt</span>
|
|
40
|
+
<%= link_to @run.prompt.display_name, prompt_path(@run.prompt), class: "ck-link" %>
|
|
41
|
+
</div>
|
|
42
|
+
<% else %>
|
|
43
|
+
<div class="ck-run-config__row">
|
|
44
|
+
<span class="ck-run-config__key">Output</span>
|
|
45
|
+
<span>Dataset column <code><%= @run.output_column.presence || "actual_output" %></code></span>
|
|
46
|
+
</div>
|
|
47
|
+
<% end %>
|
|
37
48
|
<% if @run.dataset %>
|
|
38
49
|
<div class="ck-run-config__row">
|
|
39
50
|
<span class="ck-run-config__key">Dataset</span>
|
|
40
51
|
<%= link_to @run.dataset.name, dataset_path(@run.dataset), class: "ck-link" %>
|
|
41
52
|
</div>
|
|
42
53
|
<% end %>
|
|
43
|
-
|
|
44
|
-
<
|
|
45
|
-
|
|
46
|
-
|
|
54
|
+
<% if @run.prompt %>
|
|
55
|
+
<div class="ck-run-config__row">
|
|
56
|
+
<span class="ck-run-config__key">Model</span>
|
|
57
|
+
<span style="text-transform: none;"><%= @run.prompt.llm_model %></span>
|
|
58
|
+
</div>
|
|
59
|
+
<% end %>
|
|
47
60
|
<% if @run.judge_model.present? %>
|
|
48
61
|
<div class="ck-run-config__row">
|
|
49
62
|
<span class="ck-run-config__key">Judge</span>
|
|
@@ -60,7 +73,9 @@
|
|
|
60
73
|
<section class="ck-card--spaced">
|
|
61
74
|
<div class="ck-prompt-preview__header">
|
|
62
75
|
<p class="ck-kicker">Response</p>
|
|
63
|
-
|
|
76
|
+
<% if @run.prompt %>
|
|
77
|
+
<span class="ck-chip ck-chip--soft" style="text-transform: none;"><%= @run.prompt.llm_model %></span>
|
|
78
|
+
<% end %>
|
|
64
79
|
</div>
|
|
65
80
|
<pre class="ck-code"><%= @response.response_text %></pre>
|
|
66
81
|
</section>
|
|
@@ -17,6 +17,17 @@
|
|
|
17
17
|
</div>
|
|
18
18
|
|
|
19
19
|
<div class="ck-field">
|
|
20
|
+
<label class="ck-checkbox-label">
|
|
21
|
+
<%= check_box_tag "run[judge_only]", "1", run.persisted? && run.judge_only?, id: "run_judge_only", class: "ck-checkbox" %>
|
|
22
|
+
<span class="ck-checkbox-label__box" aria-hidden="true"></span>
|
|
23
|
+
<span class="ck-checkbox-label__body">
|
|
24
|
+
<span class="ck-checkbox-label__text">Judge-only run</span>
|
|
25
|
+
<span class="ck-checkbox-label__hint">Grade an existing column on the dataset instead of running a prompt. Roughly half the LLM calls per row.</span>
|
|
26
|
+
</span>
|
|
27
|
+
</label>
|
|
28
|
+
</div>
|
|
29
|
+
|
|
30
|
+
<div class="ck-field" id="prompt-field">
|
|
20
31
|
<%= form.label :prompt_id, "Prompt", class: "ck-label" %>
|
|
21
32
|
<%= form.select :prompt_id,
|
|
22
33
|
@prompts.map { |p|
|
|
@@ -43,6 +54,12 @@
|
|
|
43
54
|
</div>
|
|
44
55
|
</div>
|
|
45
56
|
|
|
57
|
+
<div class="ck-field" id="output-column-field" hidden>
|
|
58
|
+
<%= form.label :output_column, "Output column", class: "ck-label" %>
|
|
59
|
+
<%= form.text_field :output_column, value: run.output_column.presence || "actual_output", class: "ck-input", id: "run_output_column", placeholder: "actual_output" %>
|
|
60
|
+
<p class="ck-field-hint">Name of the dataset column whose value will be graded as the response. Defaults to <code>actual_output</code>.</p>
|
|
61
|
+
</div>
|
|
62
|
+
|
|
46
63
|
<div class="ck-field" id="dataset-field">
|
|
47
64
|
<%= form.label :dataset_id, "Dataset", class: "ck-label" %>
|
|
48
65
|
<% if @datasets.empty? %>
|
|
@@ -157,6 +174,15 @@
|
|
|
157
174
|
function updateRunForm() {
|
|
158
175
|
var promptEl = document.getElementById('run_prompt_id');
|
|
159
176
|
var judgeEl = document.getElementById('run_judge_model');
|
|
177
|
+
var judgeOnlyEl = document.getElementById('run_judge_only');
|
|
178
|
+
var judgeOnly = !!(judgeOnlyEl && judgeOnlyEl.checked);
|
|
179
|
+
var promptField = document.getElementById('prompt-field');
|
|
180
|
+
var outputColumnField = document.getElementById('output-column-field');
|
|
181
|
+
var outputColumnEl = document.getElementById('run_output_column');
|
|
182
|
+
if (promptField) promptField.hidden = judgeOnly;
|
|
183
|
+
if (outputColumnField) outputColumnField.hidden = !judgeOnly;
|
|
184
|
+
if (judgeOnly && promptEl) promptEl.value = '';
|
|
185
|
+
|
|
160
186
|
var prompt = promptEl ? promptEl.value : '';
|
|
161
187
|
var judge = judgeEl ? judgeEl.value : '';
|
|
162
188
|
var metrics = document.querySelectorAll('input[name="run[metric_ids][]"]:checked');
|
|
@@ -222,11 +248,28 @@ function updateRunForm() {
|
|
|
222
248
|
}
|
|
223
249
|
}
|
|
224
250
|
|
|
225
|
-
var valid
|
|
251
|
+
var valid;
|
|
252
|
+
if (judgeOnly) {
|
|
253
|
+
valid = !!dataset;
|
|
254
|
+
if (dataset && datasetEl && outputColumnEl) {
|
|
255
|
+
var headersJudge = (datasetEl.options[datasetEl.selectedIndex] && datasetEl.options[datasetEl.selectedIndex].dataset.headers ? datasetEl.options[datasetEl.selectedIndex].dataset.headers.split(/,\s*/) : []).filter(Boolean);
|
|
256
|
+
var col = (outputColumnEl.value || 'actual_output').trim();
|
|
257
|
+
if (col === '' || headersJudge.indexOf(col) === -1) {
|
|
258
|
+
valid = false;
|
|
259
|
+
if (datasetField) datasetField.className = 'ck-field ck-field--error';
|
|
260
|
+
if (datasetHint) datasetHint.textContent = "Dataset has no \"" + col + "\" column — pick a different output column or dataset.";
|
|
261
|
+
}
|
|
262
|
+
} else if (!dataset) {
|
|
263
|
+
if (datasetField) datasetField.className = 'ck-field ck-field--info';
|
|
264
|
+
if (datasetHint) datasetHint.textContent = 'Judge-only runs need a dataset that supplies the output column.';
|
|
265
|
+
}
|
|
266
|
+
} else {
|
|
267
|
+
valid = prompt !== '';
|
|
268
|
+
if (hasVars && !dataset) valid = false;
|
|
269
|
+
if (missingVars.length > 0) valid = false;
|
|
270
|
+
}
|
|
226
271
|
if (judge && metrics.length === 0) valid = false;
|
|
227
272
|
if (!judge && metrics.length > 0) valid = false;
|
|
228
|
-
if (hasVars && !dataset) valid = false;
|
|
229
|
-
if (missingVars.length > 0) valid = false;
|
|
230
273
|
if (submitBtn) submitBtn.disabled = !valid;
|
|
231
274
|
|
|
232
275
|
ckUpdateMetricGroupsState();
|
|
@@ -260,9 +303,13 @@ function ckUpdateMetricGroupsState() {
|
|
|
260
303
|
var judgeEl = document.getElementById('run_judge_model');
|
|
261
304
|
var promptEl = document.getElementById('run_prompt_id');
|
|
262
305
|
var datasetEl = document.getElementById('run_dataset_id');
|
|
306
|
+
var judgeOnlyEl = document.getElementById('run_judge_only');
|
|
307
|
+
var outputColumnEl = document.getElementById('run_output_column');
|
|
263
308
|
if (judgeEl) judgeEl.addEventListener('change', updateRunForm);
|
|
264
309
|
if (promptEl) promptEl.addEventListener('change', updateRunForm);
|
|
265
310
|
if (datasetEl) datasetEl.addEventListener('change', updateRunForm);
|
|
311
|
+
if (judgeOnlyEl) judgeOnlyEl.addEventListener('change', updateRunForm);
|
|
312
|
+
if (outputColumnEl) outputColumnEl.addEventListener('input', updateRunForm);
|
|
266
313
|
document.querySelectorAll('input[name="run[metric_ids][]"]').forEach(function(cb) {
|
|
267
314
|
cb.addEventListener('change', updateRunForm);
|
|
268
315
|
});
|
|
@@ -6,8 +6,12 @@
|
|
|
6
6
|
<strong><%= run.name %></strong>
|
|
7
7
|
</span>
|
|
8
8
|
<div class="ck-runs-table__config">
|
|
9
|
-
|
|
10
|
-
|
|
9
|
+
<% if run.prompt %>
|
|
10
|
+
<%= link_to run.prompt.name, prompt_path(run.prompt), class: "ck-runs-table__config-link", onclick: "event.stopPropagation();" %>
|
|
11
|
+
<span class="ck-runs-table__version">v<%= run.prompt.version_number %></span>
|
|
12
|
+
<% else %>
|
|
13
|
+
<span class="ck-runs-table__version">Judge-only</span>
|
|
14
|
+
<% end %>
|
|
11
15
|
<% if run.dataset %>
|
|
12
16
|
<span class="ck-runs-table__sep">·</span>
|
|
13
17
|
<%= link_to run.dataset.name, dataset_path(run.dataset), class: "ck-runs-table__config-link", onclick: "event.stopPropagation();" %>
|
|
@@ -19,7 +19,11 @@
|
|
|
19
19
|
<span class="ck-status-badge__label"><%= run.status.upcase %></span>
|
|
20
20
|
</span>
|
|
21
21
|
<h1 class="ck-title"><%= run.name %></h1>
|
|
22
|
-
|
|
22
|
+
<% if run.prompt %>
|
|
23
|
+
<p class="ck-meta-copy"><%= link_to run.prompt.display_name, prompt_path(run.prompt), class: "ck-link" %> <span class="ck-chip" style="text-transform: none;"><%= run.prompt.llm_model %></span></p>
|
|
24
|
+
<% else %>
|
|
25
|
+
<p class="ck-meta-copy">Judge-only run — grading column <code><%= run.output_column.presence || "actual_output" %></code><% if run.dataset %> on <%= link_to run.dataset.name, dataset_path(run.dataset), class: "ck-link" %><% end %></p>
|
|
26
|
+
<% end %>
|
|
23
27
|
</div>
|
|
24
28
|
<%= render "completion_kit/runs/actions", run: run %>
|
|
25
29
|
</section>
|
|
@@ -1,6 +1,10 @@
|
|
|
1
1
|
<ol class="ck-breadcrumb">
|
|
2
|
-
|
|
3
|
-
|
|
2
|
+
<% if @run.prompt %>
|
|
3
|
+
<li><%= link_to "Prompts", prompts_path %></li>
|
|
4
|
+
<li><%= link_to @run.prompt.name, prompt_path(@run.prompt) %></li>
|
|
5
|
+
<% else %>
|
|
6
|
+
<li><%= link_to "Runs", runs_path %></li>
|
|
7
|
+
<% end %>
|
|
4
8
|
<li><%= link_to @run.name, run_path(@run) %></li>
|
|
5
9
|
<li>Edit</li>
|
|
6
10
|
</ol>
|
|
@@ -59,24 +59,33 @@
|
|
|
59
59
|
</div>
|
|
60
60
|
</div>
|
|
61
61
|
|
|
62
|
-
|
|
63
|
-
<div class="ck-prompt-
|
|
64
|
-
<
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
<%=
|
|
71
|
-
|
|
62
|
+
<% if @run.prompt %>
|
|
63
|
+
<div class="ck-prompt-preview">
|
|
64
|
+
<div class="ck-prompt-preview__header">
|
|
65
|
+
<p class="ck-kicker">Prompt</p>
|
|
66
|
+
<% latest_suggestion = @run.suggestions.order(created_at: :desc).first %>
|
|
67
|
+
<% if latest_suggestion %>
|
|
68
|
+
<%= link_to "View suggestion", suggestion_path(latest_suggestion, from: "run"), class: ck_button_classes(:light, variant: :outline) + " ck-button--sm" %>
|
|
69
|
+
<% elsif @run.status == "completed" && @run.responses.joins(:reviews).exists? %>
|
|
70
|
+
<%= button_to suggest_run_path(@run), method: :post, class: ck_button_classes(:light, variant: :outline) + " ck-button--sm", form_class: "inline-block" do %>
|
|
71
|
+
<%= heroicon_tag "sparkles", variant: :outline, class: "ck-magic-icon", "aria-hidden": "true" %>
|
|
72
|
+
Suggest improvements
|
|
73
|
+
<% end %>
|
|
72
74
|
<% end %>
|
|
75
|
+
</div>
|
|
76
|
+
<p class="ck-prompt-preview__text" id="prompt_text"><%= @run.prompt.template %></p>
|
|
77
|
+
<% if @run.prompt.template.length > 200 %>
|
|
78
|
+
<button type="button" class="ck-disclosure-toggle" id="prompt_toggle" aria-expanded="false" aria-controls="prompt_text" onclick="var t=document.getElementById('prompt_text');var l=this;var expanded=t.classList.toggle('ck-prompt-preview__text--expanded');l.firstChild.textContent=expanded?'Show less':'Show more';l.setAttribute('aria-expanded',expanded?'true':'false')"><span>Show more</span></button>
|
|
73
79
|
<% end %>
|
|
74
80
|
</div>
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
<
|
|
78
|
-
|
|
79
|
-
</div>
|
|
81
|
+
<% else %>
|
|
82
|
+
<div class="ck-prompt-preview">
|
|
83
|
+
<div class="ck-prompt-preview__header">
|
|
84
|
+
<p class="ck-kicker">Output source</p>
|
|
85
|
+
</div>
|
|
86
|
+
<p class="ck-prompt-preview__text">Dataset column <code><%= @run.output_column.presence || "actual_output" %></code> — no prompt generated these outputs.</p>
|
|
87
|
+
</div>
|
|
88
|
+
<% end %>
|
|
80
89
|
|
|
81
90
|
<% if @run.dataset %>
|
|
82
91
|
<dialog id="dataset-preview-<%= @run.id %>" class="ck-modal" onclick="if(event.target===this)this.close()">
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: completion-kit
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.5.
|
|
4
|
+
version: 0.5.11
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Damien Bastin
|
|
@@ -381,6 +381,7 @@ files:
|
|
|
381
381
|
- db/migrate/20260509000001_create_completion_kit_tags.rb
|
|
382
382
|
- db/migrate/20260509000002_create_completion_kit_taggings.rb
|
|
383
383
|
- db/migrate/20260513000001_create_completion_kit_mcp_sessions.rb
|
|
384
|
+
- db/migrate/20260514000001_allow_judge_only_runs.rb
|
|
384
385
|
- lib/completion-kit.rb
|
|
385
386
|
- lib/completion_kit.rb
|
|
386
387
|
- lib/completion_kit/concurrency_check.rb
|