completion-kit 0.1.0.rc1 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +83 -88
  3. data/app/controllers/completion_kit/api/v1/metric_groups_controller.rb +62 -0
  4. data/app/controllers/completion_kit/api/v1/metrics_controller.rb +1 -1
  5. data/app/controllers/completion_kit/metric_groups_controller.rb +67 -0
  6. data/app/controllers/completion_kit/metrics_controller.rb +2 -2
  7. data/app/controllers/completion_kit/runs_controller.rb +1 -1
  8. data/app/models/completion_kit/metric.rb +3 -5
  9. data/app/models/completion_kit/metric_group.rb +22 -0
  10. data/app/models/completion_kit/metric_group_membership.rb +20 -0
  11. data/app/models/completion_kit/run.rb +0 -1
  12. data/app/services/completion_kit/judge_service.rb +3 -7
  13. data/app/services/completion_kit/mcp_dispatcher.rb +2 -2
  14. data/app/services/completion_kit/mcp_tools/{criteria.rb → metric_groups.rb} +28 -28
  15. data/app/services/completion_kit/mcp_tools/metrics.rb +2 -4
  16. data/app/views/completion_kit/api_reference/index.html.erb +11 -11
  17. data/app/views/completion_kit/metric_groups/_form.html.erb +46 -0
  18. data/app/views/completion_kit/metric_groups/edit.html.erb +13 -0
  19. data/app/views/completion_kit/metric_groups/index.html.erb +41 -0
  20. data/app/views/completion_kit/metric_groups/new.html.erb +12 -0
  21. data/app/views/completion_kit/{criteria → metric_groups}/show.html.erb +8 -9
  22. data/app/views/completion_kit/metrics/_form.html.erb +2 -23
  23. data/app/views/completion_kit/metrics/index.html.erb +13 -5
  24. data/app/views/completion_kit/metrics/show.html.erb +1 -12
  25. data/app/views/completion_kit/runs/_form.html.erb +5 -5
  26. data/app/views/layouts/completion_kit/application.html.erb +4 -1
  27. data/config/routes.rb +2 -2
  28. data/db/migrate/20260416000001_remove_evaluation_steps_from_metrics.rb +5 -0
  29. data/db/migrate/20260417000001_rename_criteria_to_metric_groups.rb +13 -0
  30. data/lib/completion_kit/engine.rb +1 -7
  31. data/lib/completion_kit/version.rb +1 -1
  32. metadata +21 -21
  33. data/app/assets/javascripts/completion_kit/evaluation_steps_controller.js +0 -25
  34. data/app/controllers/completion_kit/api/v1/criteria_controller.rb +0 -62
  35. data/app/controllers/completion_kit/criteria_controller.rb +0 -67
  36. data/app/models/completion_kit/criteria.rb +0 -22
  37. data/app/models/completion_kit/criteria_membership.rb +0 -20
  38. data/app/views/completion_kit/criteria/_form.html.erb +0 -46
  39. data/app/views/completion_kit/criteria/edit.html.erb +0 -14
  40. data/app/views/completion_kit/criteria/index.html.erb +0 -37
  41. data/app/views/completion_kit/criteria/new.html.erb +0 -13
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: cdecf15deb685a524186a2bd4ba48268e10475da0d2cc2914969318893268f70
4
- data.tar.gz: ce395abf147434f9a825f79902e47b171b074038d20c06bf7b86b80ba70eaa00
3
+ metadata.gz: f8b122b978bb3d74051e14d734203da11cd951ef6ea0cb60b0459215812000e9
4
+ data.tar.gz: 458b93ab81bf13dcaf1fd7431e6934899855b4e6513ef16e82b4cd085dcb3e67
5
5
  SHA512:
6
- metadata.gz: 9d13fd1d1863c87ca0f7ed78eb1853ac921ff39862b84c678e5b5c9977f1832edcfb50c52b2371f2dc88ad6360be22a78d39ef53385239939a1ac9297df444ff
7
- data.tar.gz: bf95df5b178ccfe455350d216a894836a85cfeff5c117d0206b87211ef638c1d690790350168b5a9ad84e8e3895add8c89acdea98738a1c0d2025115679d939f
6
+ metadata.gz: d3c49afda3bb03eda67c89df4b5a8144da2259e8002ff35e906fa7516e0d115c0c7e3549a6b2de78f5478b72fefdcf758a0f6e042f83d3befc430c0bb26e7fa0
7
+ data.tar.gz: ed5fb37dcacd8c3bc1947d75b7ff83e4abde0fbcc418944ac7f3f4dc837679eb6d5cce5c144546ddd59aa4bc6548bdb83ac03c4b13428b916ef6e40b0f510e10
data/README.md CHANGED
@@ -1,29 +1,44 @@
1
1
  <p align="center">
2
- <img src="https://raw.githubusercontent.com/homemade-software-inc/completion-kit/main/docs/logo.png" alt="CompletionKit logo" width="120" />
2
+ <img src="https://raw.githubusercontent.com/homemade-software-inc/completion-kit/main/docs/logo.png" alt="CompletionKit" width="360" />
3
3
  </p>
4
4
 
5
- # CompletionKit
6
-
7
- [![CI](https://github.com/homemade-software-inc/completion-kit/actions/workflows/ci.yml/badge.svg)](https://github.com/homemade-software-inc/completion-kit/actions/workflows/ci.yml)
8
- ![coverage](https://img.shields.io/badge/coverage-100%25-brightgreen)
9
- ![dependencies](https://img.shields.io/badge/dependencies-7-blue)
10
- [![Dependabot](https://img.shields.io/badge/dependabot-enabled-blue?logo=dependabot)](https://github.com/homemade-software-inc/completion-kit/network/updates)
11
-
12
- You need to know whether your prompts produce the output you expect, consistently, across real data. CompletionKit gives you that, inside your Rails app.
5
+ <p align="center">
6
+ <a href="https://github.com/homemade-software-inc/completion-kit/actions/workflows/ci.yml"><img src="https://github.com/homemade-software-inc/completion-kit/actions/workflows/ci.yml/badge.svg" alt="CI" /></a>
7
+ <img src="https://img.shields.io/badge/coverage-100%25-brightgreen" alt="coverage" />
8
+ </p>
13
9
 
14
- Mount the engine, bring your prompts and datasets, and every input runs through a model you pick. Each output is scored against your own metrics and rubrics by an LLM-as-judge. When you change a prompt, re-run the same dataset and see exactly what got better and what broke — and when the scores tell you something's off, CompletionKit can suggest an improved version of the prompt based on the reviews, which you inspect as a diff and apply as a new version.
10
+ Your prompts need tests too.
15
11
 
16
- Drive it from the web UI, from the REST API, or from Claude Code and other MCP-aware agents via the built-in Model Context Protocol server. All three share the same state your prompts, runs, datasets, and scores are one source of truth.
12
+ Run every prompt against real data. Score each output with an LLM judge against criteria you define. Change anything: the prompt, the model, the temperature, the dataset. Re-run and see exactly what got better and what broke. When the scores tell you something's off, CompletionKit suggests an improved prompt based on the judge's actual feedback on your runs. You inspect the diff, apply it as a new version, and verify the improvement.
17
13
 
18
14
  It's the difference between "this prompt seems to work" and "this prompt scores 4.3 out of 5 across 200 inputs, up from 3.8 last version."
19
15
 
16
+ **[completionkit.com](https://completionkit.com)** | **[RubyGems](https://rubygems.org/gems/completion-kit)**
17
+
20
18
  ![Prompts index](https://raw.githubusercontent.com/homemade-software-inc/completion-kit/main/docs/screenshots/prompts.png)
21
19
 
22
20
  ![Prompt detail with metrics and rubrics](https://raw.githubusercontent.com/homemade-software-inc/completion-kit/main/docs/screenshots/prompt-detail.png)
23
21
 
24
22
  ![Test run with scored results](https://raw.githubusercontent.com/homemade-software-inc/completion-kit/main/docs/screenshots/test-run.png)
25
23
 
26
- ## Setup
24
+ ## Quick Start
25
+
26
+ ### Run the standalone app
27
+
28
+ The fastest way to start. No existing Rails app needed.
29
+
30
+ ```bash
31
+ git clone https://github.com/homemade-software-inc/completion-kit.git
32
+ cd completion-kit/standalone
33
+ bundle install
34
+ bin/rails completion_kit:install:migrations
35
+ bin/rails db:migrate
36
+ bin/rails server
37
+ ```
38
+
39
+ Visit `http://localhost:3000`. Add a provider credential (Settings), create a prompt, upload a CSV dataset, and run it.
40
+
41
+ ### Or mount as an engine in your existing Rails app
27
42
 
28
43
  ```ruby
29
44
  gem "completion-kit"
@@ -34,26 +49,40 @@ bin/rails generate completion_kit:install
34
49
  bin/rails db:migrate
35
50
  ```
36
51
 
37
- Set your provider keys via environment variables or the generated initializer:
52
+ The engine mounts at `/completion_kit` in your app.
38
53
 
39
- ```bash
40
- OPENAI_API_KEY=...
41
- ANTHROPIC_API_KEY=...
42
- LLAMA_API_KEY=...
43
- LLAMA_API_ENDPOINT=...
44
- ```
54
+ ## Providers
55
+
56
+ CompletionKit discovers available models from each provider's API automatically.
45
57
 
46
- Available models are discovered dynamically from each provider's API.
58
+ | Provider | Env vars | What it covers |
59
+ |----------|----------|----------------|
60
+ | **OpenAI** | `OPENAI_API_KEY` | GPT-5, GPT-4.1, GPT-4o, etc. |
61
+ | **Anthropic** | `ANTHROPIC_API_KEY` | Claude Opus, Sonnet, Haiku |
62
+ | **Ollama / local endpoint** | `OLLAMA_API_ENDPOINT` (default: `http://localhost:11434/v1`) | Any model you've `ollama pull`-ed, or any OpenAI-compatible local server (vLLM, LM Studio, llama.cpp) |
63
+ | **OpenRouter** | `OPENROUTER_API_KEY` | 100+ models from 30+ providers through one API key |
47
64
 
48
- ### Encryption keys
65
+ Set these as environment variables or configure them in the generated initializer. You can also add provider credentials through the web UI under Settings.
49
66
 
50
- Provider API keys are stored using [Rails Active Record encryption](https://guides.rubyonrails.org/active_record_encryption.html), so the host app must have encryption keys configured. If you haven't set them up already:
67
+ ### Encryption
68
+
69
+ Provider API keys are encrypted at rest using [Active Record encryption](https://guides.rubyonrails.org/active_record_encryption.html). You need three encryption keys configured before the app will boot in production.
70
+
71
+ Generate them:
51
72
 
52
73
  ```bash
53
74
  bin/rails db:encryption:init
54
75
  ```
55
76
 
56
- Copy the generated keys into `config/credentials.yml.enc` under `active_record_encryption`, or set the equivalent environment variables. CompletionKit won't boot without valid keys in production.
77
+ Then set them as environment variables:
78
+
79
+ ```bash
80
+ COMPLETION_KIT_ENCRYPTION_PRIMARY_KEY=<generated value>
81
+ COMPLETION_KIT_ENCRYPTION_DETERMINISTIC_KEY=<generated value>
82
+ COMPLETION_KIT_ENCRYPTION_KEY_DERIVATION_SALT=<generated value>
83
+ ```
84
+
85
+ Or add them to `config/credentials.yml.enc` under `active_record_encryption`. In development, the standalone app uses built-in fallback values so you can skip this step locally.
57
86
 
58
87
  ## Authentication
59
88
 
@@ -62,7 +91,6 @@ CompletionKit requires authentication in production. In development, routes are
62
91
  ### Basic Auth (recommended for simple setups)
63
92
 
64
93
  ```ruby
65
- # config/initializers/completion_kit.rb
66
94
  CompletionKit.configure do |c|
67
95
  c.username = "admin"
68
96
  c.password = ENV["COMPLETION_KIT_PASSWORD"]
@@ -72,58 +100,49 @@ end
72
100
  ### Custom Auth (Devise, etc.)
73
101
 
74
102
  ```ruby
75
- # config/initializers/completion_kit.rb
76
103
  CompletionKit.configure do |c|
77
104
  c.auth_strategy = ->(controller) { controller.authenticate_user! }
78
105
  end
79
106
  ```
80
107
 
81
- Only one mode can be active — setting both raises a `ConfigurationError`.
108
+ Only one mode can be active.
82
109
 
83
- ## Usage
110
+ ## How it works
84
111
 
85
- 1. Create a prompt with `{{variable}}` placeholders
86
- 2. Create a test run and paste CSV data (headers match variable names)
87
- 3. Generate outputs, run AI review, inspect scored results
112
+ 1. **Create a prompt** with `{{variable}}` placeholders
113
+ 2. **Upload a dataset.** A CSV where column headers match the variable names.
114
+ 3. **Run it** against a model and score outputs with an LLM judge against criteria you define.
115
+ 4. **Iterate.** Change the prompt, the model, the temperature, or the dataset and re-run. CompletionKit versions your prompts so you can always compare against previous results.
116
+ 5. **Get suggestions.** When scores drop, ask CompletionKit for an AI-generated improvement. The suggestion is based on the judge's actual per-response feedback, not generic prompt-engineering advice. Inspect the diff and apply it as a new version.
88
117
 
89
- ## Programmatic access
118
+ ## Concepts
90
119
 
91
- CompletionKit exposes every resource through both a REST JSON API and an MCP server. Both share the same bearer-token auth, so configure once and use either interface:
120
+ - **Prompt.** A versioned template with `{{variable}}` placeholders. Publishing freezes the template; editing a published prompt creates a new version.
121
+ - **Dataset.** A CSV of real inputs. Each row becomes one test case.
122
+ - **Run.** One execution of a prompt against a dataset. Captures every input (model, temperature, metrics) and stores all outputs and scores.
123
+ - **Response.** The model's output for one dataset row, with reviews attached.
124
+ - **Metric.** An evaluation dimension with a name, instruction, evaluation steps, and a 1-5 star scoring scale. The LLM judge uses this to score each response.
125
+ - **Metric Group.** A reusable group of metrics you can apply to a run as a set.
126
+ - **Provider Credential.** An API key for a model provider. Encrypted at rest, never returned through the API.
127
+
128
+ ## REST API
129
+
130
+ Every resource is accessible via a bearer-token JSON API:
92
131
 
93
132
  ```ruby
94
- # config/initializers/completion_kit.rb
95
133
  CompletionKit.configure { |c| c.api_token = ENV["COMPLETION_KIT_API_TOKEN"] }
96
134
  ```
97
135
 
98
- ### Concepts
99
-
100
- These are the objects you'll work with, whether through the UI, the REST API, or the MCP server:
101
-
102
- - **Prompt** — A named, versioned template with `{{variable}}` placeholders. Publishing a prompt freezes its template so runs always reference a known version; editing a published prompt creates a new version.
103
- - **Dataset** — A CSV of real inputs. Column headers match the prompt's `{{variable}}` names, and each row becomes one test case.
104
- - **Run** — A single execution of a prompt against a dataset. Tracks progress, stores outputs, and records which metrics were used for scoring.
105
- - **Response** — The model's output for one row of the dataset, with any reviews attached.
106
- - **Metric** — One evaluation dimension: a name, an instruction, evaluation steps, and 1–5-star rubric bands. The judge uses a metric to score a response.
107
- - **Criteria** — A named, reusable bundle of metrics you can apply to a run in one step.
108
- - **Provider Credential** — An API key for a model provider (OpenAI, Anthropic, Ollama, OpenRouter). Encrypted at rest using Rails' Active Record encryption, and never returned through the API.
109
-
110
- ### REST API
111
-
112
136
  ```bash
113
137
  curl -H "Authorization: Bearer $TOKEN" \
114
138
  http://localhost:3000/completion_kit/api/v1/prompts
115
-
116
- curl -X POST http://localhost:3000/completion_kit/api/v1/prompts \
117
- -H "Authorization: Bearer $TOKEN" \
118
- -H "Content-Type: application/json" \
119
- -d '{"name":"summarizer","template":"Summarize: {{text}}","llm_model":"gpt-4.1"}'
120
139
  ```
121
140
 
122
- Mount the engine, then visit **`/completion_kit/api_reference`** in your running app for per-endpoint documentation with copy-to-clipboard curl examples pre-filled with your token.
141
+ Visit `/completion_kit/api_reference` in your running app for per-endpoint docs with copy-to-clipboard curl examples.
123
142
 
124
- ### MCP server
143
+ ## MCP server
125
144
 
126
- CompletionKit also runs a [Model Context Protocol](https://modelcontextprotocol.io) server at the `/mcp` path within the engine mount, exposing the same resources as 36 tools (one per CRUD action plus process actions like `runs_generate` and `prompts_publish`). Point Claude Code, Cursor, or any other MCP client at it:
145
+ CompletionKit runs a [Model Context Protocol](https://modelcontextprotocol.io) server at `/completion_kit/mcp`, exposing every resource as tools that MCP-aware clients (Claude Code, Cursor, etc.) can drive directly:
127
146
 
128
147
  ```json
129
148
  {
@@ -136,40 +155,21 @@ CompletionKit also runs a [Model Context Protocol](https://modelcontextprotocol.
136
155
  }
137
156
  ```
138
157
 
139
- The in-app API reference page also ships install snippets you can copy straight into your MCP client config.
140
-
141
- ## Standalone App
142
-
143
- CompletionKit ships with a standalone Rails app you can deploy as a hosted service.
144
-
145
- ### Quick Start
146
-
147
- ```bash
148
- cd standalone
149
- bundle install
150
- bin/rails completion_kit:install:migrations
151
- bin/rails db:migrate
152
- bin/rails server
153
- ```
154
-
155
- Visit `http://localhost:3000` for the home page, or `http://localhost:3000/completion_kit` for the engine UI.
158
+ The in-app API reference page has install snippets you can copy straight into your MCP client config.
156
159
 
157
- ### Configuration
160
+ ## Deploying the standalone app
158
161
 
159
- Set environment variables:
162
+ Any Rails-friendly host works (Fly, Heroku, Render, Docker, etc.). Point it at a Postgres instance via `DATABASE_URL`, set your provider env vars, and run `cd standalone && bin/rails db:migrate` on each deploy.
160
163
 
161
164
  | Variable | Purpose | Default |
162
165
  |----------|---------|---------|
163
- | `COMPLETION_KIT_API_TOKEN` | Bearer token for REST API and MCP access | (none API disabled) |
166
+ | `COMPLETION_KIT_API_TOKEN` | Bearer token for REST API and MCP | (none, API disabled) |
164
167
  | `COMPLETION_KIT_USERNAME` | Web UI login username | `admin` |
165
- | `COMPLETION_KIT_PASSWORD` | Web UI login password | (none open in dev) |
166
- | `DATABASE_URL` | PostgreSQL connection string (production) | SQLite in dev |
167
-
168
- ### Deploying
168
+ | `COMPLETION_KIT_PASSWORD` | Web UI login password | (none, open in dev) |
169
169
 
170
- Any Rails-friendly host works Fly, Heroku, Render, self-managed Docker, etc. Point your host at a Postgres instance via `DATABASE_URL`, set the environment variables above, and run `cd standalone && bin/rails db:migrate` on each deploy.
170
+ You also need the three `COMPLETION_KIT_ENCRYPTION_*` keys from the [Encryption](#encryption) section above.
171
171
 
172
- When the gem ships a new engine migration, install it into your standalone app locally and commit the generated file before pushing:
172
+ When the gem ships a new migration, install it locally and commit before pushing:
173
173
 
174
174
  ```bash
175
175
  cd standalone
@@ -178,14 +178,9 @@ bin/rails db:migrate
178
178
  git add db/migrate/ && git commit -m "install new engine migration"
179
179
  ```
180
180
 
181
- That way your host's `db:migrate` picks up the new file on the next deploy. Don't run `completion_kit:install:migrations` on the host itself — migration files are source artifacts, they belong in git.
182
-
183
- ## Development
181
+ ## Contributing
184
182
 
185
- ```bash
186
- bundle install
187
- bundle exec rspec
188
- ```
183
+ See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, testing, and pull request guidelines.
189
184
 
190
185
  ## License
191
186
 
@@ -0,0 +1,62 @@
1
+ module CompletionKit
2
+ module Api
3
+ module V1
4
+ class MetricGroupsController < BaseController
5
+ before_action :set_metric_group, only: [:show, :update, :destroy]
6
+
7
+ def index
8
+ render json: MetricGroup.order(created_at: :desc)
9
+ end
10
+
11
+ def show
12
+ render json: @metric_group
13
+ end
14
+
15
+ def create
16
+ metric_group = MetricGroup.new(metric_group_params.except(:metric_ids))
17
+ if metric_group.save
18
+ replace_metric_memberships(metric_group, params[:metric_ids]) if params.key?(:metric_ids)
19
+ render json: metric_group.reload, status: :created
20
+ else
21
+ render json: {errors: metric_group.errors}, status: :unprocessable_entity
22
+ end
23
+ end
24
+
25
+ def update
26
+ if @metric_group.update(metric_group_params.except(:metric_ids))
27
+ replace_metric_memberships(@metric_group, params[:metric_ids]) if params.key?(:metric_ids)
28
+ render json: @metric_group.reload
29
+ else
30
+ render json: {errors: @metric_group.errors}, status: :unprocessable_entity
31
+ end
32
+ end
33
+
34
+ def destroy
35
+ @metric_group.destroy!
36
+ head :no_content
37
+ end
38
+
39
+ private
40
+
41
+ def set_metric_group
42
+ @metric_group = MetricGroup.find(params[:id])
43
+ rescue ActiveRecord::RecordNotFound
44
+ not_found
45
+ end
46
+
47
+ def metric_group_params
48
+ params.permit(:name, :description, metric_ids: [])
49
+ end
50
+
51
+ def replace_metric_memberships(metric_group, metric_ids)
52
+ return unless metric_ids
53
+
54
+ metric_group.metric_group_memberships.delete_all
55
+ Array(metric_ids).reject(&:blank?).each_with_index do |metric_id, index|
56
+ metric_group.metric_group_memberships.create!(metric_id: metric_id, position: index + 1)
57
+ end
58
+ end
59
+ end
60
+ end
61
+ end
62
+ end
@@ -43,7 +43,7 @@ module CompletionKit
43
43
  end
44
44
 
45
45
  def metric_params
46
- params.permit(:name, :instruction, evaluation_steps: [], rubric_bands: [:stars, :description])
46
+ params.permit(:name, :instruction, rubric_bands: [:stars, :description])
47
47
  end
48
48
  end
49
49
  end
@@ -0,0 +1,67 @@
1
+ module CompletionKit
2
+ class MetricGroupsController < ApplicationController
3
+ before_action :set_metric_group, only: [:show, :edit, :update, :destroy]
4
+
5
+ def index
6
+ @metric_groups = MetricGroup.includes(:metrics).order(:name)
7
+ end
8
+
9
+ def show
10
+ end
11
+
12
+ def new
13
+ @metric_group = MetricGroup.new
14
+ @metrics = Metric.order(:name)
15
+ end
16
+
17
+ def edit
18
+ @metrics = Metric.order(:name)
19
+ end
20
+
21
+ def create
22
+ @metric_group = MetricGroup.new(metric_group_params.except(:metric_ids))
23
+ @metrics = Metric.order(:name)
24
+
25
+ if @metric_group.save
26
+ replace_metric_memberships
27
+ redirect_to metric_group_path(@metric_group), notice: "Metric group was successfully created."
28
+ else
29
+ render :new, status: :unprocessable_entity
30
+ end
31
+ end
32
+
33
+ def update
34
+ @metrics = Metric.order(:name)
35
+
36
+ if @metric_group.update(metric_group_params.except(:metric_ids))
37
+ replace_metric_memberships
38
+ redirect_to metric_group_path(@metric_group), notice: "Metric group was successfully updated."
39
+ else
40
+ render :edit, status: :unprocessable_entity
41
+ end
42
+ end
43
+
44
+ def destroy
45
+ @metric_group.destroy
46
+ redirect_to metric_groups_path, notice: "Metric group was successfully destroyed."
47
+ end
48
+
49
+ private
50
+
51
+ def set_metric_group
52
+ @metric_group = MetricGroup.find(params[:id])
53
+ end
54
+
55
+ def metric_group_params
56
+ params.require(:metric_group).permit(:name, :description, metric_ids: [])
57
+ end
58
+
59
+ def replace_metric_memberships
60
+ metric_ids = Array(metric_group_params[:metric_ids]).reject(&:blank?)
61
+ @metric_group.metric_group_memberships.delete_all
62
+ metric_ids.each_with_index do |metric_id, index|
63
+ @metric_group.metric_group_memberships.create!(metric_id: metric_id, position: index + 1)
64
+ end
65
+ end
66
+ end
67
+ end
@@ -3,7 +3,7 @@ module CompletionKit
3
3
  before_action :set_metric, only: [:show, :edit, :update, :destroy]
4
4
 
5
5
  def index
6
- @metrics = Metric.order(:name)
6
+ @metrics = Metric.includes(:metric_groups).order(:name)
7
7
  end
8
8
 
9
9
  def show
@@ -46,7 +46,7 @@ module CompletionKit
46
46
  end
47
47
 
48
48
  def metric_params
49
- params.require(:metric).permit(:name, :instruction, evaluation_steps: [], rubric_bands: [:stars, :description])
49
+ params.require(:metric).permit(:name, :instruction, rubric_bands: [:stars, :description])
50
50
  end
51
51
  end
52
52
  end
@@ -112,7 +112,7 @@ module CompletionKit
112
112
  def load_form_collections
113
113
  @prompts = Prompt.order(:name)
114
114
  @datasets = Dataset.order(:name)
115
- @criterias = Criteria.includes(:metrics).order(:name)
115
+ @metric_groups = MetricGroup.includes(:metrics).order(:name)
116
116
  @all_metrics = Metric.order(:name)
117
117
  end
118
118
 
@@ -8,12 +8,11 @@ module CompletionKit
8
8
  { "stars" => 1, "description" => "Fails to meet the criteria. Major errors or completely off-target." }
9
9
  ].freeze
10
10
 
11
- has_many :criteria_memberships, dependent: :destroy
12
- has_many :criterias, through: :criteria_memberships, source: :criteria
11
+ has_many :metric_group_memberships, dependent: :destroy
12
+ has_many :metric_groups, through: :metric_group_memberships, source: :metric_group
13
13
  has_many :reviews, dependent: :nullify
14
14
 
15
15
  serialize :rubric_bands, coder: JSON
16
- serialize :evaluation_steps, coder: JSON
17
16
 
18
17
  validates :name, presence: true
19
18
  validates :key, uniqueness: true, allow_nil: true
@@ -74,7 +73,7 @@ module CompletionKit
74
73
  def as_json(options = {})
75
74
  {
76
75
  id: id, name: name, key: key, instruction: instruction,
77
- evaluation_steps: evaluation_steps, rubric_bands: rubric_bands,
76
+ rubric_bands: rubric_bands,
78
77
  created_at: created_at, updated_at: updated_at
79
78
  }
80
79
  end
@@ -86,7 +85,6 @@ module CompletionKit
86
85
  end
87
86
 
88
87
  def set_defaults
89
- self.evaluation_steps ||= []
90
88
  self.rubric_bands = self.class.default_rubric_bands if rubric_bands.blank?
91
89
  end
92
90
 
@@ -0,0 +1,22 @@
1
+ module CompletionKit
2
+ class MetricGroup < ApplicationRecord
3
+ self.table_name = "completion_kit_metric_groups"
4
+
5
+ has_many :metric_group_memberships, -> { order(:position, :id) }, dependent: :destroy
6
+ has_many :metrics, through: :metric_group_memberships
7
+
8
+ validates :name, presence: true
9
+
10
+ def ordered_metrics
11
+ metric_group_memberships.includes(:metric).map(&:metric).compact
12
+ end
13
+
14
+ def as_json(options = {})
15
+ {
16
+ id: id, name: name, description: description,
17
+ created_at: created_at, updated_at: updated_at,
18
+ metric_ids: metric_ids
19
+ }
20
+ end
21
+ end
22
+ end
@@ -0,0 +1,20 @@
1
+ module CompletionKit
2
+ class MetricGroupMembership < ApplicationRecord
3
+ self.table_name = "completion_kit_metric_group_memberships"
4
+
5
+ belongs_to :metric_group, class_name: "CompletionKit::MetricGroup", foreign_key: "metric_group_id"
6
+ belongs_to :metric
7
+
8
+ validates :metric_id, uniqueness: { scope: :metric_group_id }
9
+
10
+ before_validation :set_default_position
11
+
12
+ private
13
+
14
+ def set_default_position
15
+ return if position.present? || metric_group.blank?
16
+
17
+ self.position = metric_group.metric_group_memberships.maximum(:position).to_i + 1
18
+ end
19
+ end
20
+ end
@@ -114,7 +114,6 @@ module CompletionKit
114
114
  response.expected_output,
115
115
  prompt.template,
116
116
  criteria: metric.respond_to?(:instruction) ? metric.instruction.to_s : "",
117
- evaluation_steps: metric.respond_to?(:evaluation_steps) ? metric.evaluation_steps : nil,
118
117
  rubric_text: metric.respond_to?(:display_rubric_text) ? metric.display_rubric_text : nil,
119
118
  input_data: response.input_data
120
119
  )
@@ -8,11 +8,11 @@ module CompletionKit
8
8
  @judge_client = LlmClient.for_model(@judge_model, ApiConfig.for_model(@judge_model))
9
9
  end
10
10
 
11
- def evaluate(output, expected_output = nil, prompt = nil, criteria: nil, evaluation_steps: nil, rubric_text: nil, human_examples: nil, input_data: nil, **_extras)
11
+ def evaluate(output, expected_output = nil, prompt = nil, criteria: nil, rubric_text: nil, human_examples: nil, input_data: nil, **_extras)
12
12
  return { score: 1, feedback: "Judge not configured" } unless @judge_client.configured?
13
13
 
14
14
  judge_prompt = build_judge_prompt(output, expected_output, prompt,
15
- criteria: criteria, evaluation_steps: evaluation_steps,
15
+ criteria: criteria,
16
16
  rubric_text: rubric_text, human_examples: human_examples,
17
17
  input_data: input_data)
18
18
 
@@ -27,7 +27,7 @@ module CompletionKit
27
27
 
28
28
  private
29
29
 
30
- def build_judge_prompt(output, expected_output, prompt, criteria: nil, evaluation_steps: nil, rubric_text: nil, human_examples: nil, input_data: nil)
30
+ def build_judge_prompt(output, expected_output, prompt, criteria: nil, rubric_text: nil, human_examples: nil, input_data: nil)
31
31
  judge_prompt = <<~PROMPT
32
32
  You are an expert evaluator. You MUST respond with ONLY two lines in this exact format, nothing else:
33
33
 
@@ -44,10 +44,6 @@ module CompletionKit
44
44
  judge_prompt += "\nCriteria: #{criteria}\n"
45
45
  end
46
46
 
47
- if evaluation_steps.present? && evaluation_steps.any?
48
- judge_prompt += "\nEvaluation steps:\n#{evaluation_steps.each_with_index.map { |step, i| "#{i + 1}. #{step}" }.join("\n")}\n"
49
- end
50
-
51
47
  if human_examples.present?
52
48
  judge_prompt += "\nCalibration examples:\n"
53
49
  human_examples.each_with_index do |example, index|
@@ -33,7 +33,7 @@ module CompletionKit
33
33
  McpTools::Responses.definitions +
34
34
  McpTools::Datasets.definitions +
35
35
  McpTools::Metrics.definitions +
36
- McpTools::Criteria.definitions +
36
+ McpTools::MetricGroups.definitions +
37
37
  McpTools::ProviderCredentials.definitions
38
38
  end
39
39
 
@@ -44,7 +44,7 @@ module CompletionKit
44
44
  when /\Aresponses_/ then McpTools::Responses.call(name, arguments)
45
45
  when /\Adatasets_/ then McpTools::Datasets.call(name, arguments)
46
46
  when /\Ametrics_/ then McpTools::Metrics.call(name, arguments)
47
- when /\Acriteria_/ then McpTools::Criteria.call(name, arguments)
47
+ when /\Ametric_groups_/ then McpTools::MetricGroups.call(name, arguments)
48
48
  when /\Aprovider_credentials_/ then McpTools::ProviderCredentials.call(name, arguments)
49
49
  else raise MethodNotFound, "Unknown tool: #{name}"
50
50
  end