PyPI - pytest-skill-engineering - Versions diffs - 0.6.4__tar.gz → 0.6.6__tar.gz - Mend

pytest-skill-engineering 0.6.4tar.gz → 0.6.6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (75) hide show

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pytest-skill-engineering
-Version: 0.6.4
+Version: 0.6.6
 Summary: The testing framework for skill engineering. Test tool descriptions, prompt templates, agent skills, and custom agents with real LLMs. AI analyzes results and tells you what to fix.
 Project-URL: Homepage, https://github.com/sbroenne/pytest-skill-engineering
 Project-URL: Repository, https://github.com/sbroenne/pytest-skill-engineering
@@ -28,6 +28,7 @@ Requires-Dist: nh3>=0.3.3
 Requires-Dist: pydantic>=2.0
 Requires-Dist: pytest>=9.0
 Requires-Dist: python-frontmatter>=1.1.0
+Requires-Dist: trio>=0.33.0
 Provides-Extra: dev
 Requires-Dist: pre-commit>=4.5; extra == 'dev'
 Requires-Dist: pyright>=1.1.408; extra == 'dev'
@@ -62,7 +63,7 @@ Test MCP servers, CLI tools, Agent Skills, and custom agents using the **real Gi
 Your MCP server passes all unit tests. Then a user tries it in GitHub Copilot and:
 - Copilot picks the wrong tool
-- Passes garbage parameters
+- Passes garbage parameters
 - Can't recover from errors
 - Ignores your skill's instructions
@@ -97,7 +98,7 @@ async def test_balance_query(copilot_eval):
         max_turns=10,
     )
     result = await copilot_eval(agent, "What's my checking balance?")
     assert result.success
     assert result.tool_was_called("get_balance")
 ```
@@ -148,7 +149,7 @@ The AI-powered report needs a model to generate insights. Configure it in `pypro
 ```toml
 [tool.pytest.ini_options]
-addopts = "--aitest-summary-model=copilot/gpt-5-mini"
+addopts = "--aitest-summary-model=copilot/gpt-5.4-mini"
 ```
 You can also use Azure OpenAI or other providers if you prefer — see [Configuration](https://sbroenne.github.io/pytest-skill-engineering/reference/configuration/).

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/README.md RENAMED Viewed

@@ -14,7 +14,7 @@ Test MCP servers, CLI tools, Agent Skills, and custom agents using the **real Gi
 Your MCP server passes all unit tests. Then a user tries it in GitHub Copilot and:
 - Copilot picks the wrong tool
-- Passes garbage parameters
+- Passes garbage parameters
 - Can't recover from errors
 - Ignores your skill's instructions
@@ -49,7 +49,7 @@ async def test_balance_query(copilot_eval):
         max_turns=10,
     )
     result = await copilot_eval(agent, "What's my checking balance?")
     assert result.success
     assert result.tool_was_called("get_balance")
 ```
@@ -100,7 +100,7 @@ The AI-powered report needs a model to generate insights. Configure it in `pypro
 ```toml
 [tool.pytest.ini_options]
-addopts = "--aitest-summary-model=copilot/gpt-5-mini"
+addopts = "--aitest-summary-model=copilot/gpt-5.4-mini"
 ```
 You can also use Azure OpenAI or other providers if you prefer — see [Configuration](https://sbroenne.github.io/pytest-skill-engineering/reference/configuration/).

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "pytest-skill-engineering"
-version = "0.6.4"
+version = "0.6.6"
 description = "The testing framework for skill engineering. Test tool descriptions, prompt templates, agent skills, and custom agents with real LLMs. AI analyzes results and tells you what to fix."
 readme = "README.md"
 license = { text = "MIT" }
@@ -35,6 +35,7 @@ dependencies = [
     "python-frontmatter>=1.1.0",
     "nh3>=0.3.3",
     "github-copilot-sdk>=0.2.2",
+    "trio>=0.33.0",
 ]
 [project.optional-dependencies]
@@ -119,7 +120,7 @@ filterwarnings = [
 # This demonstrates the recommended setup - configure once in pyproject.toml.
 # LLM auth is handled by the GitHub Copilot SDK (gh auth login or GITHUB_TOKEN)
 addopts = """
---aitest-summary-model=copilot/gpt-5.4
+--aitest-summary-model=copilot/gpt-5.5
 --aitest-html=aitest-reports/report.html
 """
 markers = [

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/copilot/judge.py RENAMED Viewed

@@ -39,7 +39,7 @@ def _approve_all_permissions(*_args: Any, **_kwargs: Any) -> Any:
     """Approve all permission requests using the current SDK result type."""
     from copilot.session import PermissionRequestResult  # noqa: PLC0415
-    return PermissionRequestResult(kind="approved")
+    return PermissionRequestResult(kind="approve-once")
 def _get_data_field(event: Any, field: str, default: Any = None) -> Any:

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/copilot/runner.py RENAMED Viewed

@@ -115,7 +115,7 @@ def _approve_all_permissions(*_args: Any, **_kwargs: Any) -> Any:
     """Approve all permission requests using the current SDK result type."""
     from copilot.session import PermissionRequestResult
-    return PermissionRequestResult(kind="approved")
+    return PermissionRequestResult(kind="approve-once")
 def _is_transient_error(error: str | None) -> bool:

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/core/evals.py RENAMED Viewed

@@ -38,7 +38,7 @@ Example usage::
     from pytest_skill_engineering import Eval, Provider
     agent = Eval.from_agent_file(
         ".github/agents/reviewer.agent.md",
-        provider=Provider(model="azure/gpt-5-mini"),
+        provider=Provider(model="azure/gpt-5.4-mini"),
     )
     # Use with CopilotEval

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/core/plugin.py RENAMED Viewed

@@ -23,7 +23,7 @@ Example::
     from pytest_skill_engineering import Eval, Provider
     agent = Eval.from_plugin(
         "my-plugin/",
-        provider=Provider(model="azure/gpt-5-mini"),
+        provider=Provider(model="azure/gpt-5.4-mini"),
     )
 """

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/execution/clarification.py RENAMED Viewed

@@ -38,7 +38,7 @@ async def check_clarification(
     Args:
         response_text: The agent's final response text to classify.
-        judge_model: Model string (e.g. "gpt-5-mini", "claude-sonnet-4").
+        judge_model: Model string (e.g. "gpt-5.4-mini", "claude-sonnet-4").
         timeout_seconds: Timeout for the judge LLM call.
     Returns:

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/execution/rate_limiter.py RENAMED Viewed

@@ -5,7 +5,7 @@ tokens per minute (tpm). Rate limiters are shared across all engine instances
 using the same model, so concurrent tests respect deployment limits.
 Usage:
-    limiter = get_rate_limiter("azure/gpt-5-mini", rpm=10, tpm=10000)
+    limiter = get_rate_limiter("azure/gpt-5.4-mini", rpm=10, tpm=10000)
     await limiter.acquire()  # Waits if rate limit would be exceeded
     # ... make API call ...
     limiter.record_tokens(1500)  # Track token usage for tpm enforcement
@@ -38,7 +38,7 @@ def get_rate_limiter(
     restrictive limits (minimum of old and new values).
     Args:
-        model: Model identifier string (e.g. "azure/gpt-5-mini").
+        model: Model identifier string (e.g. "azure/gpt-5.4-mini").
         rpm: Requests per minute limit.
         tpm: Tokens per minute limit.

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/fixtures/factories.py RENAMED Viewed

@@ -18,7 +18,7 @@ def skill_factory() -> Callable[[Path | str], Skill]:
         def test_with_skill(skill_factory, eval_run):
             skill = skill_factory("path/to/my-skill")
             agent = Eval(
-                provider=Provider(model="azure/gpt-5-mini"),
+                provider=Provider(model="azure/gpt-5.4-mini"),
                 skill=skill,
             )
             result = await eval_run(agent, "Do something with the skill")

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/fixtures/llm_assert.py RENAMED Viewed

@@ -12,7 +12,7 @@ from dataclasses import dataclass
 import pytest
-_LLM_MODEL_DEFAULT = "copilot/gpt-5-mini"
+_LLM_MODEL_DEFAULT = "copilot/gpt-5.4-mini"
 @dataclass(slots=True)
@@ -111,17 +111,19 @@ def llm_assert(request: pytest.FixtureRequest) -> LLMAssert:
     The judge model is resolved in this order:
     1. ``--llm-model`` if explicitly set
     2. ``--aitest-summary-model`` (same model for analysis and assertions)
-    3. ``copilot/gpt-5-mini`` as final fallback
+    3. ``copilot/gpt-5.4-mini`` as final fallback
     Example::
         def test_response(llm_assert):
             assert llm_assert("Your balance is $1,500", "mentions a dollar amount")
     """
-    model_str: str = request.config.getoption("--llm-model")
+    model_str = request.config.getoption("--llm-model")
+    if not isinstance(model_str, str):
+        model_str = _LLM_MODEL_DEFAULT
     if model_str == _LLM_MODEL_DEFAULT:
         # Not explicitly set — fall back to summary model if available
         summary_model = request.config.getoption("--aitest-summary-model", default=None)
-        if summary_model:
+        if isinstance(summary_model, str) and summary_model:
             model_str = summary_model
     return LLMAssert(model=model_str)

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/fixtures/llm_assert_image.py RENAMED Viewed

@@ -74,7 +74,7 @@ def llm_assert_image(request: pytest.FixtureRequest) -> LLMAssertImage:
     1. ``--llm-vision-model`` if explicitly set
     2. ``--llm-model`` (same model for text and image assertions)
     3. ``--aitest-summary-model``
-    4. ``copilot/gpt-5-mini`` as final fallback
+    4. ``copilot/gpt-5.4-mini`` as final fallback
     NOTE: This fixture currently raises NotImplementedError when called,
     as the Copilot SDK does not yet support image inputs in a documented way.
@@ -86,19 +86,22 @@ def llm_assert_image(request: pytest.FixtureRequest) -> LLMAssertImage:
             screenshots = result.tool_images_for("screenshot")
             assert llm_assert_image(screenshots[-1], "shows a bar chart")
     """
-    _LLM_MODEL_DEFAULT = "copilot/gpt-5-mini"  # noqa: N806
+    _LLM_MODEL_DEFAULT = "copilot/gpt-5.4-mini"  # noqa: N806
     # Try vision-specific model first
-    vision_model_str: str | None = request.config.getoption("--llm-vision-model", default=None)
+    vision_model_option = request.config.getoption("--llm-vision-model", default=None)
+    vision_model_str = vision_model_option if isinstance(vision_model_option, str) else None
     if vision_model_str:
         model_str = vision_model_str
     else:
         # Fall back to llm-model → summary model → default
         model_str = request.config.getoption("--llm-model")
+        if not isinstance(model_str, str):
+            model_str = _LLM_MODEL_DEFAULT
         if model_str == _LLM_MODEL_DEFAULT:
             summary_model = request.config.getoption("--aitest-summary-model", default=None)
-            if summary_model:
+            if isinstance(summary_model, str) and summary_model:
                 model_str = summary_model
     return LLMAssertImage(model=model_str)

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/fixtures/llm_score.py RENAMED Viewed

@@ -20,9 +20,12 @@ import pytest
 from pytest_skill_engineering.core.scoring import (
     ScoreResult,
     ScoringDimension,
+    assert_score,
 )
-_LLM_MODEL_DEFAULT = "copilot/gpt-5-mini"
+_LLM_MODEL_DEFAULT = "copilot/gpt-5.4-mini"
+__all__ = ["LLMScore", "ScoreResult", "ScoringDimension", "assert_score", "llm_score"]
 # ---------------------------------------------------------------------------
@@ -240,7 +243,7 @@ def llm_score(request: pytest.FixtureRequest) -> LLMScore:
     1. ``--llm-model`` if explicitly set
     2. ``--aitest-summary-model`` (shared analysis model)
-    3. ``copilot/gpt-5-mini`` as final fallback
+    3. ``copilot/gpt-5.4-mini`` as final fallback
     Example::
@@ -254,9 +257,11 @@ def llm_score(request: pytest.FixtureRequest) -> LLMScore:
             result = llm_score(my_text, rubric)
             assert_score(result, min_total=7)
     """
-    model_str: str = request.config.getoption("--llm-model")
+    model_str = request.config.getoption("--llm-model")
+    if not isinstance(model_str, str):
+        model_str = _LLM_MODEL_DEFAULT
     if model_str == _LLM_MODEL_DEFAULT:
         summary_model = request.config.getoption("--aitest-summary-model", default=None)
-        if summary_model:
+        if isinstance(summary_model, str) and summary_model:
             model_str = summary_model
     return LLMScore(model=model_str)

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/plugin.py RENAMED Viewed

@@ -181,7 +181,8 @@ def pytest_generate_tests(metafunc: pytest.Metafunc) -> None:
     receives the parameter even though it does not declare the fixture
     explicitly.
     """
-    count: int = metafunc.config.getoption("--aitest-iterations", default=1)
+    count_option = metafunc.config.getoption("--aitest-iterations", default=1)
+    count = count_option if isinstance(count_option, int) else 1
     if count <= 1:
         return
     metafunc.fixturenames.append("_aitest_iteration")
@@ -360,7 +361,7 @@ def _add_junit_properties(
         <testcase name="test_balance">
           <properties>
             <property name="aitest.agent.name" value="banking-agent"/>
-            <property name="aitest.model" value="gpt-5-mini"/>
+            <property name="aitest.model" value="gpt-5.4-mini"/>
             <property name="aitest.skill" value="financial-advisor"/>
             <property name="aitest.tools.called" value="get_balance,transfer"/>
           </properties>

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/plugin_options.py RENAMED Viewed

@@ -24,7 +24,7 @@ def add_aitest_options(group: OptionGroup) -> None:
         default=None,
         help=(
             "Model for AI analysis. Required when generating reports. "
-            "Use the most capable model you can afford (e.g., gpt-5.1-chat, claude-opus-4)."
+            "Use the most capable model you can afford (e.g., gpt-5.5, claude-opus-4)."
         ),
     )
@@ -107,10 +107,10 @@ def add_aitest_options(group: OptionGroup) -> None:
     # LLM judge model for llm_assert fixture
     group.addoption(
         "--llm-model",
-        default="copilot/gpt-5-mini",
+        default="copilot/gpt-5.4-mini",
         help=(
             "Model for llm_assert semantic assertions. "
-            "Defaults to --aitest-summary-model if set, otherwise copilot/gpt-5-mini."
+            "Defaults to --aitest-summary-model if set, otherwise copilot/gpt-5.4-mini."
         ),
     )

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/plugin_report.py RENAMED Viewed

@@ -107,7 +107,8 @@ def generate_structured_insights(
         from pytest_skill_engineering.reporting.insights import generate_insights
         # Require dedicated summary model - no fallback
-        model = config.getoption("--aitest-summary-model")
+        model_option = config.getoption("--aitest-summary-model")
+        model = model_option if isinstance(model_option, str) else None
         if not model:
             if required:
                 raise pytest.UsageError(
@@ -196,7 +197,7 @@ def generate_structured_insights(
                 model=model,
                 min_pass_rate=config.getoption("--aitest-min-pass-rate"),
                 analysis_prompt=analysis_prompt,
-                compact=config.getoption("--aitest-summary-compact"),
+                compact=config.getoption("--aitest-summary-compact") is True,
             )
         # Use asyncio.run() instead of deprecated get_event_loop().run_until_complete()

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/prompts/ai_summary.md RENAMED Viewed

@@ -5,7 +5,7 @@ You are analyzing test results for **pytest-skill-engineering**, a skill enginee
 ## Key Concepts
 An **Eval** is a complete test configuration — the harness that exercises the skill stack:
-- **Model**: The LLM (e.g., `gpt-5-mini`, `gpt-4.1`)
+- **Model**: The LLM (e.g., `gpt-5.4-mini`, `gpt-4.1`)
 - **MCP/CLI Servers**: The tools being tested (tool descriptions + schemas)
 - **MCP Prompt Templates**: Slash-command prompts bundled with MCP servers (e.g., `/mcp.servername.promptname`)
 - **Skill**: Optional domain knowledge injected into context
@@ -309,7 +309,7 @@ Use these sections as needed (skip sections with no content):
 - **Effective**: Eval followed instructions correctly
 - **Mixed**: Some tests passed, others showed confusion
 - **Ineffective**: Instructions ignored or misunderstood
-- **Model-specific effectiveness**: Instructions that fail with one model may succeed with another. If a variant was tested with multiple models (e.g., `gpt-5-mini + detailed` failed but `gpt-4.1 + detailed` passed), label it **mixed** — NOT ineffective. Only label instructions **ineffective** if they failed across ALL models tested. Always qualify: "ineffective with gpt-5-mini" rather than just "ineffective".
+- **Model-specific effectiveness**: Instructions that fail with one model may succeed with another. If a variant was tested with multiple models (e.g., `gpt-5.4-mini + detailed` failed but `gpt-4.1 + detailed` passed), label it **mixed** — NOT ineffective. Only label instructions **ineffective** if they failed across ALL models tested. Always qualify: "ineffective with gpt-5.4-mini" rather than just "ineffective".
 - Note token bloat: "150 tokens of examples could be removed"
 ### Skill Feedback
@@ -372,7 +372,7 @@ Use these sections as needed (skip sections with no content):
     - **Gauge color values**: green=#4ade80, amber=#facc15, red=#f87171, blue=#60a5fa
 12. **Use pre-computed numbers** — The input includes a "Pre-computed Eval Statistics" section with exact values for pass rates, costs, tokens, winner designation, and aggregate stats (total tests, failures, agents, avg turns). Use these numbers verbatim. Never estimate or approximate.
 13. **Cost comparisons must use actual data** — When comparing costs between agents, use the **actual per-test cost** from the pre-computed statistics (total cost ÷ number of tests). Never cite model list pricing or theoretical cost differences. A cheaper model may use more tokens, making the realized cost difference much smaller than the per-token price difference. For example, if model A costs $0.0018/test and model B costs $0.0025/test, say "~28% cheaper" — NOT "85% cheaper" or "6× cheaper" based on list pricing.
-14. **Instruction labels must be model-specific** — Never label custom agent instructions as globally "ineffective" or globally "effective" when tested with multiple models and produced different outcomes. If `gpt-5-mini + detailed` failed but `gpt-4.1 + detailed` passed, the instructions are "mixed" — effective with gpt-4.1, ineffective with gpt-5-mini. The same applies to the Optimizations section: do not say "restrict [instructions] usage" if they work correctly with some models.
+14. **Instruction labels must be model-specific** — Never label custom agent instructions as globally "ineffective" or globally "effective" when tested with multiple models and produced different outcomes. If `gpt-5.4-mini + detailed` failed but `gpt-4.1 + detailed` passed, the instructions are "mixed" — effective with gpt-4.1, ineffective with gpt-5.4-mini. The same applies to the Optimizations section: do not say "restrict [instructions] usage" if they work correctly with some models.
 15. **Bullet lists need a blank line before them** — In markdown, a list must be preceded by a blank line to render correctly. NEVER put a bullet list directly after a `**bold label:**` on the next line — the markdown parser will collapse them into a single paragraph. Use `####` headings instead of bold labels when you need a label followed by a list.
 16. **Iteration awareness** — When iteration data is present ("Iter Pass Rate" in Pre-computed Eval Statistics), factor consistency into your recommendation. An agent with 100% pass rate at 5/5 iterations is more reliable than one with 100% pass rate at 3/5 iterations. Flag tests with <100% iteration pass rate as **flaky** in your analysis. When no iteration data is present, skip all iteration-related analysis.
 17. **Score awareness** — When LLM score data is present (`LLM Score: X/Y (Z%)`), mention the weighted score in the Winner Card summary and note any dimensions below 70% in the analysis. When no score data exists, skip all score-related commentary.

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/prompts/coding_agent_analysis.md RENAMED Viewed

@@ -288,7 +288,7 @@ Use these sections as needed (skip sections with no content):
 - **Effective**: Eval followed instructions and completed tasks correctly
 - **Mixed**: Some tasks succeeded, others showed the agent ignoring or misunderstanding instructions
 - **Ineffective**: Instructions were ignored or produced worse behavior
-- **Model-specific effectiveness**: An instruction that fails with one model may succeed with another. If an instruction variant was tested with multiple models (e.g., `gpt-5-mini + verbose` failed but `gpt-4.1 + verbose` passed), label it **mixed** — NOT ineffective. Only label an instruction **ineffective** if it failed across ALL models it was tested with. Always qualify: "ineffective with gpt-5-mini" rather than just "ineffective".
+- **Model-specific effectiveness**: An instruction that fails with one model may succeed with another. If an instruction variant was tested with multiple models (e.g., `gpt-5.4-mini + verbose` failed but `gpt-4.1 + verbose` passed), label it **mixed** — NOT ineffective. Only label an instruction **ineffective** if it failed across ALL models it was tested with. Always qualify: "ineffective with gpt-5.4-mini" rather than just "ineffective".
 - Always show the problematic instruction text and a concrete replacement
 ### Tool Usage
@@ -357,5 +357,5 @@ Use these sections as needed (skip sections with no content):
     - **No inline color styles** — use only the CSS class names (green, blue, amber, red) on metric-card and metric-value
 12. **Use pre-computed numbers** — The input includes a "Pre-computed Eval Statistics" section with exact values for pass rates, costs, tokens, winner designation, and aggregate stats (total tests, failures, agents, avg turns). Use these numbers verbatim. Never estimate or approximate.
 13. **Cost comparisons must use actual data** — When comparing costs between agents, use the **actual per-test cost** from the pre-computed statistics (total cost ÷ number of tests). Never cite model list pricing or theoretical cost differences. A cheaper model may use more tokens, making the realized cost difference much smaller than the per-token price difference.
-14. **Instruction labels must be model-specific** — Never label instructions as globally "ineffective" or globally "effective" when tested with multiple models producing different outcomes. If `gpt-5-mini + verbose` failed but `gpt-4.1 + verbose` passed, the instructions are "mixed" — effective with gpt-4.1, ineffective with gpt-5-mini.
+14. **Instruction labels must be model-specific** — Never label instructions as globally "ineffective" or globally "effective" when tested with multiple models producing different outcomes. If `gpt-5.4-mini + verbose` failed but `gpt-4.1 + verbose` passed, the instructions are "mixed" — effective with gpt-4.1, ineffective with gpt-5.4-mini.
 15. **Bullet lists need a blank line before them** — In markdown, a list must be preceded by a blank line to render correctly. NEVER put a bullet list directly after a `**bold label:**` on the next line — the markdown parser will collapse them into a single paragraph. Use `####` headings instead of bold labels when you need a label followed by a list.

{pytest_skill_engineering-0.6.4 → pytest_skill_engineering-0.6.6}/src/pytest_skill_engineering/reporting/insights.py RENAMED Viewed

@@ -479,7 +479,7 @@ async def generate_insights(
     custom_agent_info: list[CustomAgentInfo] | None = None,
     prompt_names: list[str] | None = None,
     instruction_file_info: list[InstructionFileInfo] | None = None,
-    model: str = "copilot/gpt-5-mini",
+    model: str = "copilot/gpt-5.4-mini",
     cache_dir: Path | None = None,
     min_pass_rate: int | None = None,
     analysis_prompt: str | None = None,
@@ -496,7 +496,7 @@ async def generate_insights(
         custom_agent_info: Custom agent metadata (optional)
         prompt_names: Names of prompt files tested (optional)
         instruction_file_info: Custom instruction file metadata (optional)
-        model: Model identifier (e.g., "copilot/gpt-5-mini", "azure/gpt-5-mini")
+        model: Model identifier (e.g., "copilot/gpt-5.4-mini", "azure/gpt-5.4-mini")
         cache_dir: Directory for caching results (optional)
         min_pass_rate: Minimum pass rate threshold for disqualifying agents
         analysis_prompt: Custom analysis prompt text. If None, uses the built-in