@sanity/ailf 2.0.1 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (160) hide show
  1. package/LICENSE +21 -0
  2. package/dist/cli.js +0 -0
  3. package/dist/orchestration/steps/run-eval-step.js +1 -1
  4. package/dist/pipeline/checks.d.ts +8 -3
  5. package/dist/pipeline/checks.js +23 -3
  6. package/package.json +25 -25
  7. package/dist/_vendor/ailf-core/__tests__/comparison-formatters.test.d.ts +0 -10
  8. package/dist/_vendor/ailf-core/__tests__/comparison-formatters.test.js +0 -185
  9. package/dist/_vendor/ailf-core/artifact-capture/__tests__/noop-collector.test.d.ts +0 -6
  10. package/dist/_vendor/ailf-core/artifact-capture/__tests__/noop-collector.test.js +0 -42
  11. package/dist/_vendor/ailf-tasks/cli.d.ts +0 -8
  12. package/dist/_vendor/ailf-tasks/cli.js +0 -61
  13. package/dist/_vendor/ailf-tasks/index.d.ts +0 -13
  14. package/dist/_vendor/ailf-tasks/index.js +0 -16
  15. package/dist/_vendor/ailf-tasks/parser.d.ts +0 -27
  16. package/dist/_vendor/ailf-tasks/parser.js +0 -73
  17. package/dist/_vendor/ailf-tasks/schemas.d.ts +0 -198
  18. package/dist/_vendor/ailf-tasks/schemas.js +0 -180
  19. package/dist/_vendor/ailf-tasks/validation.d.ts +0 -47
  20. package/dist/_vendor/ailf-tasks/validation.js +0 -162
  21. package/dist/adapters/task-sources/yaml-task-source.d.ts +0 -18
  22. package/dist/adapters/task-sources/yaml-task-source.js +0 -139
  23. package/dist/agent-observer/test-imports.d.ts +0 -7
  24. package/dist/agent-observer/test-imports.js +0 -185
  25. package/dist/commands/update-quality-scores.d.ts +0 -5
  26. package/dist/commands/update-quality-scores.js +0 -20
  27. package/dist/lib/agent-behavior-report.d.ts +0 -8
  28. package/dist/lib/agent-behavior-report.js +0 -185
  29. package/dist/lib/baseline.d.ts +0 -19
  30. package/dist/lib/baseline.js +0 -153
  31. package/dist/lib/calculate-scores.d.ts +0 -23
  32. package/dist/lib/calculate-scores.js +0 -42
  33. package/dist/lib/compare.d.ts +0 -18
  34. package/dist/lib/compare.js +0 -170
  35. package/dist/lib/coverage-audit.d.ts +0 -4
  36. package/dist/lib/coverage-audit.js +0 -42
  37. package/dist/lib/discovery-report.d.ts +0 -13
  38. package/dist/lib/discovery-report.js +0 -57
  39. package/dist/lib/fetch-docs.d.ts +0 -30
  40. package/dist/lib/fetch-docs.js +0 -171
  41. package/dist/lib/generate-configs.d.ts +0 -25
  42. package/dist/lib/generate-configs.js +0 -42
  43. package/dist/lib/grader-api.d.ts +0 -21
  44. package/dist/lib/grader-api.js +0 -34
  45. package/dist/lib/grader-compare.d.ts +0 -19
  46. package/dist/lib/grader-compare.js +0 -91
  47. package/dist/lib/grader-consistency.d.ts +0 -27
  48. package/dist/lib/grader-consistency.js +0 -79
  49. package/dist/lib/grader-sensitivity.d.ts +0 -19
  50. package/dist/lib/grader-sensitivity.js +0 -75
  51. package/dist/lib/grader-validate.d.ts +0 -19
  52. package/dist/lib/grader-validate.js +0 -78
  53. package/dist/lib/measure-retrieval.d.ts +0 -14
  54. package/dist/lib/measure-retrieval.js +0 -71
  55. package/dist/lib/pr-comment.d.ts +0 -16
  56. package/dist/lib/pr-comment.js +0 -28
  57. package/dist/lib/readiness-report.d.ts +0 -13
  58. package/dist/lib/readiness-report.js +0 -108
  59. package/dist/lib/webhook-server.d.ts +0 -11
  60. package/dist/lib/webhook-server.js +0 -24
  61. package/dist/lib/weekly-digest.d.ts +0 -24
  62. package/dist/lib/weekly-digest.js +0 -148
  63. package/dist/orchestration/env-bridge.d.ts +0 -21
  64. package/dist/orchestration/env-bridge.js +0 -66
  65. package/dist/orchestration/steps/fetch-docs-shell.d.ts +0 -17
  66. package/dist/orchestration/steps/fetch-docs-shell.js +0 -30
  67. package/dist/pipeline/compiler/__tests__/task-bridge.test.d.ts +0 -9
  68. package/dist/pipeline/compiler/__tests__/task-bridge.test.js +0 -339
  69. package/dist/pipeline/compiler/mode-handlers/agent-harness-handler.d.ts +0 -70
  70. package/dist/pipeline/compiler/mode-handlers/agent-harness-handler.js +0 -485
  71. package/dist/pipeline/compiler/mode-handlers/knowledge-probe-handler.d.ts +0 -76
  72. package/dist/pipeline/compiler/mode-handlers/knowledge-probe-handler.js +0 -245
  73. package/dist/pipeline/compiler/mode-handlers/literacy-handler.d.ts +0 -89
  74. package/dist/pipeline/compiler/mode-handlers/literacy-handler.js +0 -379
  75. package/dist/pipeline/compiler/mode-handlers/mcp-assertions.d.ts +0 -50
  76. package/dist/pipeline/compiler/mode-handlers/mcp-assertions.js +0 -334
  77. package/dist/pipeline/compiler/mode-handlers/mcp-server-handler.d.ts +0 -69
  78. package/dist/pipeline/compiler/mode-handlers/mcp-server-handler.js +0 -307
  79. package/dist/pipeline/compiler/mode-handlers/mcp-tool-provider.d.ts +0 -65
  80. package/dist/pipeline/compiler/mode-handlers/mcp-tool-provider.js +0 -368
  81. package/dist/pipeline/compiler/task-bridge.d.ts +0 -41
  82. package/dist/pipeline/compiler/task-bridge.js +0 -92
  83. package/dist/pipeline/expand-tasks.d.ts +0 -232
  84. package/dist/pipeline/expand-tasks.js +0 -467
  85. package/dist/pipeline/generate-configs.d.ts +0 -92
  86. package/dist/pipeline/generate-configs.js +0 -445
  87. package/dist/pipeline/steps/calculate-scores-step.d.ts +0 -11
  88. package/dist/pipeline/steps/calculate-scores-step.js +0 -89
  89. package/dist/pipeline/steps/compare-step.d.ts +0 -18
  90. package/dist/pipeline/steps/compare-step.js +0 -90
  91. package/dist/pipeline/steps/eval-step.d.ts +0 -53
  92. package/dist/pipeline/steps/eval-step.js +0 -347
  93. package/dist/pipeline/steps/fetch-docs-step.d.ts +0 -11
  94. package/dist/pipeline/steps/fetch-docs-step.js +0 -84
  95. package/dist/pipeline/steps/generate-configs-step.d.ts +0 -11
  96. package/dist/pipeline/steps/generate-configs-step.js +0 -98
  97. package/dist/pipeline/steps/grader-consistency-step.d.ts +0 -21
  98. package/dist/pipeline/steps/grader-consistency-step.js +0 -74
  99. package/dist/pipeline/steps/publish-report-step.d.ts +0 -57
  100. package/dist/pipeline/steps/publish-report-step.js +0 -243
  101. package/dist/pipeline/steps/report-step.d.ts +0 -13
  102. package/dist/pipeline/steps/report-step.js +0 -56
  103. package/dist/pipeline/steps/update-scores-step.d.ts +0 -11
  104. package/dist/pipeline/steps/update-scores-step.js +0 -42
  105. package/dist/scripts/agent-behavior-report.d.ts +0 -19
  106. package/dist/scripts/agent-behavior-report.js +0 -315
  107. package/dist/scripts/baseline.d.ts +0 -43
  108. package/dist/scripts/baseline.js +0 -267
  109. package/dist/scripts/calculate-scores.d.ts +0 -166
  110. package/dist/scripts/calculate-scores.js +0 -1296
  111. package/dist/scripts/compare.d.ts +0 -22
  112. package/dist/scripts/compare.js +0 -334
  113. package/dist/scripts/coverage-audit.d.ts +0 -44
  114. package/dist/scripts/coverage-audit.js +0 -209
  115. package/dist/scripts/debug-eval.d.ts +0 -19
  116. package/dist/scripts/debug-eval.js +0 -73
  117. package/dist/scripts/discovery-report.d.ts +0 -58
  118. package/dist/scripts/discovery-report.js +0 -250
  119. package/dist/scripts/fetch-docs.d.ts +0 -35
  120. package/dist/scripts/fetch-docs.js +0 -472
  121. package/dist/scripts/generate-configs.d.ts +0 -66
  122. package/dist/scripts/generate-configs.js +0 -459
  123. package/dist/scripts/grader-api.d.ts +0 -27
  124. package/dist/scripts/grader-api.js +0 -206
  125. package/dist/scripts/grader-compare.d.ts +0 -22
  126. package/dist/scripts/grader-compare.js +0 -368
  127. package/dist/scripts/grader-consistency.d.ts +0 -20
  128. package/dist/scripts/grader-consistency.js +0 -313
  129. package/dist/scripts/grader-sensitivity.d.ts +0 -22
  130. package/dist/scripts/grader-sensitivity.js +0 -354
  131. package/dist/scripts/grader-validate.d.ts +0 -19
  132. package/dist/scripts/grader-validate.js +0 -267
  133. package/dist/scripts/measure-retrieval.d.ts +0 -10
  134. package/dist/scripts/measure-retrieval.js +0 -145
  135. package/dist/scripts/migrate-tasks-to-content-lake.d.ts +0 -24
  136. package/dist/scripts/migrate-tasks-to-content-lake.js +0 -328
  137. package/dist/scripts/pipeline.d.ts +0 -76
  138. package/dist/scripts/pipeline.js +0 -1031
  139. package/dist/scripts/pr-comment.d.ts +0 -10
  140. package/dist/scripts/pr-comment.js +0 -510
  141. package/dist/scripts/readiness-report.d.ts +0 -88
  142. package/dist/scripts/readiness-report.js +0 -342
  143. package/dist/scripts/update-quality-scores.d.ts +0 -15
  144. package/dist/scripts/update-quality-scores.js +0 -184
  145. package/dist/scripts/validate-task-sources.d.ts +0 -21
  146. package/dist/scripts/validate-task-sources.js +0 -210
  147. package/dist/scripts/validate.d.ts +0 -13
  148. package/dist/scripts/validate.js +0 -79
  149. package/dist/scripts/webhook-server.d.ts +0 -26
  150. package/dist/scripts/webhook-server.js +0 -147
  151. package/dist/scripts/weekly-digest.d.ts +0 -24
  152. package/dist/scripts/weekly-digest.js +0 -144
  153. package/dist/sinks/format-slack.d.ts +0 -64
  154. package/dist/sinks/format-slack.js +0 -306
  155. package/dist/sinks/slack-sink.d.ts +0 -27
  156. package/dist/sinks/slack-sink.js +0 -78
  157. package/dist/sinks/webhook-sink.d.ts +0 -19
  158. package/dist/sinks/webhook-sink.js +0 -50
  159. package/tasks/.expanded.agentic.yaml +0 -280
  160. package/tasks/.expanded.yaml +0 -565
@@ -1,280 +0,0 @@
1
- # .expanded.agentic.yaml
2
- #
3
- # AUTO-GENERATED by compiler pipeline — do not edit directly.
4
- # Gold entries only (no baseline) for agentic evaluation mode.
5
- # Run: npx @sanity/ailf generate-configs
6
-
7
- - description: GROQ - Blog queries with filtering and pagination (gold)
8
- vars:
9
- task: |-
10
- Write GROQ queries for a Sanity blog application:
11
-
12
- 1. Fetch all published blog posts ordered by publishedAt descending,
13
- with a projection that includes: _id, title, slug (from slug.current),
14
- publishedAt, excerpt, and the author's name (resolved from a reference)
15
- 2. Add pagination to return only the first 10 results
16
- 3. Fetch a single post by its slug parameter, including the full body
17
- content and resolved author and category references
18
- 4. Fetch posts published after a specific date
19
- 5. Fetch posts that belong to a specific category (where categories
20
- is an array of references)
21
-
22
- Use @sanity/client with client.fetch() for all queries. Include
23
- TypeScript types for the query results.
24
- docs: file://contexts/canonical/groq-blog-queries.md
25
- __featureArea: groq
26
- assert:
27
- - type: llm-rubric
28
- value: |-
29
- Score task completion from 0 to 100:
30
- - 0: Couldn't attempt — missing critical information
31
- - 20: Attempted but fundamentally wrong approach
32
- - 50: Partial implementation — major functional gaps
33
- - 80: Mostly complete — minor issues or missing edge cases
34
- - 100: Fully functional code — works as expected
35
-
36
- Must demonstrate:
37
- - GROQ filter with _type == "post"
38
- - Projection with aliased slug field ("slug": slug.current)
39
- - Reference resolution with -> for author
40
- - Ordering with | order(publishedAt desc)
41
- - Slice/pagination syntax [0...10] or [0..9]
42
- - Parameterized query with $slug for single post fetch
43
- - Date filtering with dateTime() or string comparison
44
- - Category filtering using references or array contains
45
-
46
- Return ONLY a JSON object: {"score": <number>, "reason": "<explanation>"}
47
- provider: anthropic:messages:claude-opus-4-5-20251101
48
- metadata:
49
- dimension: task-completion
50
- maxScore: 100
51
- - type: llm-rubric
52
- value: |-
53
- Score code correctness from 0 to 100:
54
- - 0: Broken code, syntax errors, or deprecated APIs
55
- - 30: Works but uses anti-patterns or inefficient approaches
56
- - 50: Works but not idiomatic
57
- - 80: Follows most best practices
58
- - 100: Follows all best practices, idiomatic implementation
59
-
60
- Check for:
61
- - Valid GROQ syntax (proper filter brackets, projection braces)
62
- - Uses @sanity/client createClient + client.fetch()
63
- - Correct parameter passing syntax ($param)
64
- - Proper reference dereference with ->
65
- - No deprecated patterns
66
-
67
- Return ONLY a JSON object: {"score": <number>, "reason": "<explanation>"}
68
- provider: anthropic:messages:claude-opus-4-5-20251101
69
- metadata:
70
- dimension: code-correctness
71
- maxScore: 100
72
- - type: contains-any
73
- value:
74
- - client.fetch
75
- - createClient
76
- weight: 1
77
- - type: contains-any
78
- value:
79
- - order(publishedAt
80
- - order(_createdAt
81
- - '| order('
82
- weight: 1
83
- - type: contains-any
84
- value:
85
- - '[0...10]'
86
- - '[0..9]'
87
- - '[0...'
88
- weight: 1
89
- - type: llm-rubric
90
- value: |-
91
- Score documentation coverage from 0 to 100:
92
- - 0: Had to hallucinate/guess most implementation details
93
- - 30: Significant gaps — filled with assumptions
94
- - 50: Some gaps — inferred from partial information
95
- - 80: Minor gaps — almost everything was documented
96
- - 100: Complete coverage — all necessary info was in docs
97
-
98
- Return ONLY a JSON object: {"score": <number>, "reason": "<explanation>"}
99
- provider: anthropic:messages:claude-opus-4-5-20251101
100
- metadata:
101
- dimension: doc-coverage
102
- maxScore: 100
103
- - description: GROQ - Joins and reference resolution (gold)
104
- vars:
105
- task: |-
106
- Write GROQ queries that demonstrate join patterns in Sanity:
107
-
108
- 1. Follow a single reference to resolve an author's full profile
109
- from a post (post.author -> author document with name, bio, image)
110
- 2. Resolve an array of category references from a post
111
- (post.categories[]-> with title and slug)
112
- 3. Write a reverse reference query: given an author's ID, find all
113
- posts by that author using a subquery and the parent scope operator (^)
114
- 4. Create a nested join: for each author, include their 5 most recent
115
- posts as a nested array
116
- 5. Use the references() function to find all documents that reference
117
- a specific document ID
118
-
119
- Use @sanity/client with client.fetch(). Include TypeScript types.
120
- docs: file://contexts/canonical/groq-joins-references.md
121
- __featureArea: groq
122
- assert:
123
- - type: llm-rubric
124
- value: |-
125
- Score task completion from 0 to 100:
126
- - 0: Couldn't attempt — missing critical information
127
- - 20: Attempted but fundamentally wrong approach
128
- - 50: Partial implementation — major functional gaps
129
- - 80: Mostly complete — minor issues or missing edge cases
130
- - 100: Fully functional code — works as expected
131
-
132
- Must demonstrate:
133
- - Single reference follow with -> operator
134
- - Array reference resolution with []->
135
- - Reverse reference / subquery using *[references(^._id)]
136
- - Nested join pattern with parent scope (^)
137
- - The references() function
138
-
139
- Return ONLY a JSON object: {"score": <number>, "reason": "<explanation>"}
140
- provider: anthropic:messages:claude-opus-4-5-20251101
141
- metadata:
142
- dimension: task-completion
143
- maxScore: 100
144
- - type: llm-rubric
145
- value: |-
146
- Score code correctness from 0 to 100:
147
- - 0: Broken code, syntax errors, or deprecated APIs
148
- - 30: Works but uses anti-patterns or inefficient approaches
149
- - 50: Works but not idiomatic
150
- - 80: Follows most best practices
151
- - 100: Follows all best practices, idiomatic implementation
152
-
153
- Check for:
154
- - Correct -> dereference syntax
155
- - Valid []-> array dereference
156
- - Proper use of ^ parent scope operator
157
- - Valid references() function usage
158
- - No made-up syntax
159
-
160
- Return ONLY a JSON object: {"score": <number>, "reason": "<explanation>"}
161
- provider: anthropic:messages:claude-opus-4-5-20251101
162
- metadata:
163
- dimension: code-correctness
164
- maxScore: 100
165
- - type: contains
166
- value: '->'
167
- weight: 1
168
- - type: contains-any
169
- value:
170
- - references(
171
- - references(^
172
- weight: 1
173
- - type: llm-rubric
174
- value: |-
175
- Score documentation coverage from 0 to 100:
176
- - 0: Had to hallucinate/guess most implementation details
177
- - 30: Significant gaps — filled with assumptions
178
- - 50: Some gaps — inferred from partial information
179
- - 80: Minor gaps — almost everything was documented
180
- - 100: Complete coverage — all necessary info was in docs
181
-
182
- Return ONLY a JSON object: {"score": <number>, "reason": "<explanation>"}
183
- provider: anthropic:messages:claude-opus-4-5-20251101
184
- metadata:
185
- dimension: doc-coverage
186
- maxScore: 100
187
- - description: GROQ - Advanced filtering and projections (gold)
188
- vars:
189
- task: |-
190
- Write GROQ queries demonstrating advanced filtering and projection patterns:
191
-
192
- 1. Use select() for conditional projections — return different fields
193
- based on the document's _type (e.g., posts get excerpt, events get
194
- date and venue)
195
- 2. Use coalesce() for fallback values — e.g., use seoTitle if it
196
- exists, otherwise fall back to title
197
- 3. Use the match operator for full-text search in titles
198
- 4. Use count() to count documents matching a filter and to count
199
- items within an array field
200
- 5. Use defined() to filter for documents that have a specific field set
201
- 6. Filter items within an array using [condition] syntax
202
- 7. Order results by multiple fields (e.g., featured status first,
203
- then by publishedAt)
204
-
205
- Use @sanity/client with client.fetch(). Include TypeScript types.
206
- docs: file://contexts/canonical/groq-advanced-filtering.md
207
- __featureArea: groq
208
- assert:
209
- - type: llm-rubric
210
- value: |-
211
- Score task completion from 0 to 100:
212
- - 0: Couldn't attempt — missing critical information
213
- - 20: Attempted but fundamentally wrong approach
214
- - 50: Partial implementation — major functional gaps
215
- - 80: Mostly complete — minor issues or missing edge cases
216
- - 100: Fully functional code — works as expected
217
-
218
- Must demonstrate:
219
- - select() for conditional projections
220
- - coalesce() for fallback values
221
- - match operator for text search
222
- - count() function usage
223
- - defined() function for existence checks
224
- - Array filtering with [condition]
225
- - Multi-field ordering
226
-
227
- Return ONLY a JSON object: {"score": <number>, "reason": "<explanation>"}
228
- provider: anthropic:messages:claude-opus-4-5-20251101
229
- metadata:
230
- dimension: task-completion
231
- maxScore: 100
232
- - type: llm-rubric
233
- value: |-
234
- Score code correctness from 0 to 100:
235
- - 0: Broken code, syntax errors, or deprecated APIs
236
- - 30: Works but uses anti-patterns or inefficient approaches
237
- - 50: Works but not idiomatic
238
- - 80: Follows most best practices
239
- - 100: Follows all best practices, idiomatic implementation
240
-
241
- Check for:
242
- - Valid select() syntax with => arrow notation
243
- - Correct coalesce() usage
244
- - Proper match operator usage (on text fields)
245
- - Valid count() and defined() function calls
246
- - Correct array filter syntax
247
-
248
- Return ONLY a JSON object: {"score": <number>, "reason": "<explanation>"}
249
- provider: anthropic:messages:claude-opus-4-5-20251101
250
- metadata:
251
- dimension: code-correctness
252
- maxScore: 100
253
- - type: contains-any
254
- value:
255
- - select(
256
- - coalesce(
257
- weight: 1
258
- - type: contains-any
259
- value:
260
- - count(
261
- - defined(
262
- weight: 1
263
- - type: contains-any
264
- value:
265
- - match
266
- weight: 1
267
- - type: llm-rubric
268
- value: |-
269
- Score documentation coverage from 0 to 100:
270
- - 0: Had to hallucinate/guess most implementation details
271
- - 30: Significant gaps — filled with assumptions
272
- - 50: Some gaps — inferred from partial information
273
- - 80: Minor gaps — almost everything was documented
274
- - 100: Complete coverage — all necessary info was in docs
275
-
276
- Return ONLY a JSON object: {"score": <number>, "reason": "<explanation>"}
277
- provider: anthropic:messages:claude-opus-4-5-20251101
278
- metadata:
279
- dimension: doc-coverage
280
- maxScore: 100