@artemiskit/cli 0.1.8 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,144 @@
1
1
  # @artemiskit/cli
2
2
 
3
+ ## 0.2.2
4
+
5
+ ### Patch Changes
6
+
7
+ - d5ca7c6: Add baseline command and CI mode for regression detection
8
+
9
+ ### New Features
10
+
11
+ - **Baseline Command**: New `akit baseline` command with `set`, `list`, `get`, `remove` subcommands
12
+
13
+ - Lookup by run ID (default) or scenario name (`--scenario` flag)
14
+ - Store and manage baseline metrics for regression comparison
15
+
16
+ - **CI Mode**: New `--ci` flag for machine-readable output
17
+
18
+ - Outputs environment variable format for easy parsing
19
+ - Auto-detects CI environments (GitHub Actions, GitLab CI, etc.)
20
+ - Suppresses colors and spinners
21
+
22
+ - **Summary Formats**: New `--summary` flag with `json`, `text`, `security` formats
23
+
24
+ - JSON summary for pipeline parsing
25
+ - Security summary for compliance reporting
26
+
27
+ - **Regression Detection**: New `--baseline` and `--threshold` flags
28
+ - Compare runs against saved baselines
29
+ - Configurable regression threshold (default 5%)
30
+ - Exit code 1 on regression detection
31
+
32
+ - Updated dependencies [d5ca7c6]
33
+ - @artemiskit/core@0.2.2
34
+ - @artemiskit/adapter-openai@0.1.9
35
+ - @artemiskit/adapter-vercel-ai@0.1.9
36
+ - @artemiskit/redteam@0.2.2
37
+ - @artemiskit/reports@0.2.2
38
+
39
+ ## 0.2.1
40
+
41
+ ### Patch Changes
42
+
43
+ - fix: improve LLM grader compatibility with reasoning models
44
+
45
+ - Remove temperature parameter from LLM grader (reasoning models like o1, o3, gpt-5-mini only support temperature=1)
46
+ - Increase maxTokens from 200 to 1000 to accommodate reasoning models that use tokens for internal thinking
47
+ - Improve grader prompt for stricter JSON-only output format
48
+ - Add fallback parsing for malformed JSON responses
49
+ - Add markdown code block stripping from grader responses
50
+ - Add `modelFamily` configuration option to Azure OpenAI provider for correct parameter detection when deployment names differ from model names
51
+
52
+ - Updated dependencies
53
+ - @artemiskit/core@0.2.1
54
+ - @artemiskit/adapter-openai@0.1.8
55
+ - @artemiskit/adapter-vercel-ai@0.1.8
56
+ - @artemiskit/redteam@0.2.1
57
+ - @artemiskit/reports@0.2.1
58
+
59
+ ## 0.2.0
60
+
61
+ ### Minor Changes
62
+
63
+ - d2c3835: ## v0.2.0 - Enhanced Evaluation Features
64
+
65
+ ### CLI (`@artemiskit/cli`)
66
+
67
+ #### New Features
68
+
69
+ - **Multi-turn mutations**: Added `--mutations multi_turn` flag for red team testing with 4 built-in strategies:
70
+ - `gradual_escalation`: Gradually intensifies requests over conversation turns
71
+ - `context_switching`: Shifts topics to lower defenses before attack
72
+ - `persona_building`: Establishes trust through roleplay
73
+ - `distraction`: Uses side discussions to slip in harmful requests
74
+ - **Custom multi-turn conversations**: Support for array prompts in red team scenarios (consistent with `run` command format). The last user message becomes the attack target, preceding messages form conversation context.
75
+ - **Custom attacks**: Added `--custom-attacks` flag to load custom attack patterns from YAML files with template variables and variations.
76
+ - **Encoding mutations**: Added `--mutations encoding` for obfuscation attacks (base64, ROT13, hex, unicode).
77
+ - **Directory scanning**: Run all scenarios in a directory with `akit run scenarios/`
78
+ - **Glob pattern matching**: Use patterns like `akit run scenarios/**/*.yaml`
79
+ - **Parallel execution**: Added `--parallel` flag for concurrent scenario execution
80
+ - **Scenario tags**: Filter scenarios with `--tags` flag
81
+
82
+ ### Core (`@artemiskit/core`)
83
+
84
+ #### New Features
85
+
86
+ - **Combined matchers**: New `type: combined` expectation with `operator: and|or` for complex assertion logic
87
+ - **`not_contains` expectation**: Negative containment check to ensure responses don't include specific text
88
+ - **`similarity` expectation**: Semantic similarity matching with two modes:
89
+ - Embedding-based: Uses vector embeddings for fast semantic comparison
90
+ - LLM-based fallback: Uses LLM to evaluate semantic similarity when embeddings unavailable
91
+ - Configurable threshold (default 0.75)
92
+ - **`inline` expectation**: Safe expression-based custom matchers in YAML using JavaScript-like expressions (e.g., `response.length > 100`, `response.includes('hello')`)
93
+ - **p90 latency metric**: Added p90 percentile to stress test latency metrics
94
+ - **Token usage tracking**: Monitor token consumption per request in stress tests
95
+ - **Cost estimation**: Estimate API costs with model pricing data
96
+
97
+ ### Red Team (`@artemiskit/redteam`)
98
+
99
+ #### New Features
100
+
101
+ - **MultiTurnMutation class**: Full implementation with strategy support and custom conversation prefixes
102
+ - **Custom attack loader**: Parse and load custom attack patterns from YAML
103
+ - **Encoding mutation**: Obfuscate attack payloads using various encoding schemes
104
+ - **CVSS-like severity scoring**: Detailed attack severity scoring with:
105
+ - `CvssScore` interface with attack vector, complexity, impact metrics
106
+ - `CvssCalculator` class for score calculation and aggregation
107
+ - Predefined scores for all mutations and detection categories
108
+ - Human-readable score descriptions and vector strings
109
+
110
+ ### Reports (`@artemiskit/reports`)
111
+
112
+ #### New Features
113
+
114
+ - **Run comparison HTML report**: Visual diff between two runs showing:
115
+ - Metrics overview with baseline vs current comparison
116
+ - Change summary (regressions, improvements, unchanged)
117
+ - Case-by-case comparison table with filtering
118
+ - Side-by-side response comparison for each case
119
+ - **Comparison JSON export**: Structured comparison data for programmatic use
120
+
121
+ ### CLI Enhancements
122
+
123
+ - **Compare command `--html` flag**: Generate HTML comparison report
124
+ - **Compare command `--json` flag**: Generate JSON comparison data
125
+
126
+ ### Documentation
127
+
128
+ - Updated all CLI command documentation
129
+ - Added comprehensive examples for custom multi-turn scenarios
130
+ - Documented combined matchers and `not_contains` expectations
131
+ - Added mutation strategy reference tables
132
+
133
+ ### Patch Changes
134
+
135
+ - Updated dependencies [d2c3835]
136
+ - @artemiskit/core@0.2.0
137
+ - @artemiskit/redteam@0.2.0
138
+ - @artemiskit/reports@0.2.0
139
+ - @artemiskit/adapter-openai@0.1.7
140
+ - @artemiskit/adapter-vercel-ai@0.1.7
141
+
3
142
  ## 0.1.8
4
143
 
5
144
  ### Patch Changes
package/bin/artemis.ts CHANGED
File without changes