agent-release-gates 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (314) hide show
  1. agent_release_gates-0.1.0/.dockerignore +22 -0
  2. agent_release_gates-0.1.0/.gitattributes +1 -0
  3. agent_release_gates-0.1.0/.github/ISSUE_TEMPLATE/bug_report.yml +46 -0
  4. agent_release_gates-0.1.0/.github/ISSUE_TEMPLATE/config.yml +11 -0
  5. agent_release_gates-0.1.0/.github/ISSUE_TEMPLATE/eval_improvement.yml +51 -0
  6. agent_release_gates-0.1.0/.github/ISSUE_TEMPLATE/external_review.yml +47 -0
  7. agent_release_gates-0.1.0/.github/pull_request_template.md +17 -0
  8. agent_release_gates-0.1.0/.github/workflows/ci.yml +53 -0
  9. agent_release_gates-0.1.0/.github/workflows/pages.yml +59 -0
  10. agent_release_gates-0.1.0/.gitignore +25 -0
  11. agent_release_gates-0.1.0/.streamlit/config.toml +2 -0
  12. agent_release_gates-0.1.0/CONTRIBUTING.md +69 -0
  13. agent_release_gates-0.1.0/Dockerfile +34 -0
  14. agent_release_gates-0.1.0/LICENSE +21 -0
  15. agent_release_gates-0.1.0/PKG-INFO +201 -0
  16. agent_release_gates-0.1.0/README.md +172 -0
  17. agent_release_gates-0.1.0/app/streamlit_app.py +2581 -0
  18. agent_release_gates-0.1.0/config/action_risk_policy.yaml +29 -0
  19. agent_release_gates-0.1.0/config/incident_release_policy.json +10 -0
  20. agent_release_gates-0.1.0/config/safety_taxonomy.yaml +86 -0
  21. agent_release_gates-0.1.0/data/eval/golden_cases.jsonl +358 -0
  22. agent_release_gates-0.1.0/data/eval/human_calibration_cases.jsonl +24 -0
  23. agent_release_gates-0.1.0/data/eval/red_team_cases.jsonl +60 -0
  24. agent_release_gates-0.1.0/data/eval/safety_challenge_cases.jsonl +40 -0
  25. agent_release_gates-0.1.0/data/eval/safety_prevalence_cases.jsonl +80 -0
  26. agent_release_gates-0.1.0/data/eval/safety_secondary_review_validation_cases.jsonl +39 -0
  27. agent_release_gates-0.1.0/data/eval_cases/goal_conflict_cases.jsonl +12 -0
  28. agent_release_gates-0.1.0/data/eval_cases/incident_regression_cases.jsonl +8 -0
  29. agent_release_gates-0.1.0/data/eval_cases/instruction_hierarchy_cases.jsonl +12 -0
  30. agent_release_gates-0.1.0/data/eval_cases/memory_context_pollution_cases.jsonl +12 -0
  31. agent_release_gates-0.1.0/data/incidents/incident_cases.jsonl +8 -0
  32. agent_release_gates-0.1.0/data/incidents/trace_events.jsonl +16 -0
  33. agent_release_gates-0.1.0/data/public/techqa_rag_eval_sample.jsonl +160 -0
  34. agent_release_gates-0.1.0/data/public/wixqa_public_rag_sample.jsonl +80 -0
  35. agent_release_gates-0.1.0/data/review/external_human_review_label_template.csv +25 -0
  36. agent_release_gates-0.1.0/data/review/external_human_review_packet.csv +29 -0
  37. agent_release_gates-0.1.0/data/review/external_human_review_reviewer_guide.md +44 -0
  38. agent_release_gates-0.1.0/data/synthetic/raw_docs/runbooks.jsonl +24 -0
  39. agent_release_gates-0.1.0/data/synthetic/raw_tickets/tickets.jsonl +180 -0
  40. agent_release_gates-0.1.0/docker-compose.yml +58 -0
  41. agent_release_gates-0.1.0/docs/agent_safety_intervention_study.md +45 -0
  42. agent_release_gates-0.1.0/docs/architecture.md +51 -0
  43. agent_release_gates-0.1.0/docs/baseline_v1_summary.md +21 -0
  44. agent_release_gates-0.1.0/docs/benchmark_card.md +86 -0
  45. agent_release_gates-0.1.0/docs/candidate_results_schema.md +188 -0
  46. agent_release_gates-0.1.0/docs/dashboard.md +48 -0
  47. agent_release_gates-0.1.0/docs/data_dictionary.md +400 -0
  48. agent_release_gates-0.1.0/docs/dataset_card.md +94 -0
  49. agent_release_gates-0.1.0/docs/evaluate_your_agent_quickstart.md +160 -0
  50. agent_release_gates-0.1.0/docs/evaluation_plan.md +229 -0
  51. agent_release_gates-0.1.0/docs/external_human_review_protocol.md +62 -0
  52. agent_release_gates-0.1.0/docs/failure_taxonomy.md +64 -0
  53. agent_release_gates-0.1.0/docs/incident_pack_schema.md +150 -0
  54. agent_release_gates-0.1.0/docs/inspect_quickstart.md +47 -0
  55. agent_release_gates-0.1.0/docs/model_card.md +119 -0
  56. agent_release_gates-0.1.0/docs/operations_runbook.md +301 -0
  57. agent_release_gates-0.1.0/docs/publishing.md +54 -0
  58. agent_release_gates-0.1.0/docs/requirements_matrix.md +13 -0
  59. agent_release_gates-0.1.0/docs/research_roadmap.md +88 -0
  60. agent_release_gates-0.1.0/docs/reviewer_handoff_pack.md +100 -0
  61. agent_release_gates-0.1.0/docs/safety_threshold_decision_memo.md +68 -0
  62. agent_release_gates-0.1.0/docs/technical_artifacts.md +93 -0
  63. agent_release_gates-0.1.0/docs/threat_model.md +54 -0
  64. agent_release_gates-0.1.0/examples/incident_pack_minimal/agent_run_log.jsonl +1 -0
  65. agent_release_gates-0.1.0/examples/incident_pack_minimal/candidate_results_pass.jsonl +1 -0
  66. agent_release_gates-0.1.0/examples/incident_pack_minimal/incident_cases.jsonl +1 -0
  67. agent_release_gates-0.1.0/examples/incident_pack_minimal/incident_release_policy.json +10 -0
  68. agent_release_gates-0.1.0/examples/incident_pack_minimal/langchain_trace_log.jsonl +1 -0
  69. agent_release_gates-0.1.0/examples/incident_pack_minimal/trace_events.jsonl +2 -0
  70. agent_release_gates-0.1.0/internal_ai_agent_project_plan.md +108 -0
  71. agent_release_gates-0.1.0/ops/otel-collector-config.yaml +24 -0
  72. agent_release_gates-0.1.0/pyproject.toml +89 -0
  73. agent_release_gates-0.1.0/reports/action_gate_intervention.json +121 -0
  74. agent_release_gates-0.1.0/reports/action_gate_intervention.md +20 -0
  75. agent_release_gates-0.1.0/reports/agent_eval_cases.jsonl +180 -0
  76. agent_release_gates-0.1.0/reports/agent_eval_summary.json +27 -0
  77. agent_release_gates-0.1.0/reports/agent_otel_spans.jsonl +50 -0
  78. agent_release_gates-0.1.0/reports/agent_safety_intervention_study.json +55 -0
  79. agent_release_gates-0.1.0/reports/agent_trace_examples.jsonl +10 -0
  80. agent_release_gates-0.1.0/reports/baseline_eval_cases.jsonl +358 -0
  81. agent_release_gates-0.1.0/reports/baseline_eval_summary.json +1043 -0
  82. agent_release_gates-0.1.0/reports/baseline_v1_summary.json +32 -0
  83. agent_release_gates-0.1.0/reports/collector_export_preview.json +17 -0
  84. agent_release_gates-0.1.0/reports/dataset_profile.json +193 -0
  85. agent_release_gates-0.1.0/reports/embedding_eval_cases.jsonl +358 -0
  86. agent_release_gates-0.1.0/reports/embedding_eval_summary.json +502 -0
  87. agent_release_gates-0.1.0/reports/eval_comparison.json +42 -0
  88. agent_release_gates-0.1.0/reports/evaluation_gates.json +163 -0
  89. agent_release_gates-0.1.0/reports/evaluation_history.json +87 -0
  90. agent_release_gates-0.1.0/reports/evaluation_report.html +156 -0
  91. agent_release_gates-0.1.0/reports/evaluation_report.md +1021 -0
  92. agent_release_gates-0.1.0/reports/evaluation_report.pdf +2590 -0
  93. agent_release_gates-0.1.0/reports/external_human_review_cases.jsonl +0 -0
  94. agent_release_gates-0.1.0/reports/external_human_review_manifest.json +56 -0
  95. agent_release_gates-0.1.0/reports/external_human_review_summary.json +47 -0
  96. agent_release_gates-0.1.0/reports/extraction_eval_cases.jsonl +180 -0
  97. agent_release_gates-0.1.0/reports/extraction_eval_summary.json +16 -0
  98. agent_release_gates-0.1.0/reports/failure_taxonomy_summary.json +143 -0
  99. agent_release_gates-0.1.0/reports/goal_conflict_intervention.json +212 -0
  100. agent_release_gates-0.1.0/reports/goal_conflict_intervention.md +33 -0
  101. agent_release_gates-0.1.0/reports/human_calibration_cases.jsonl +24 -0
  102. agent_release_gates-0.1.0/reports/human_calibration_summary.json +311 -0
  103. agent_release_gates-0.1.0/reports/hybrid_eval_cases.jsonl +358 -0
  104. agent_release_gates-0.1.0/reports/hybrid_eval_summary.json +564 -0
  105. agent_release_gates-0.1.0/reports/improved_eval_cases.jsonl +358 -0
  106. agent_release_gates-0.1.0/reports/improved_eval_summary.json +775 -0
  107. agent_release_gates-0.1.0/reports/incident_memo_INC-2026-0001.md +19 -0
  108. agent_release_gates-0.1.0/reports/incident_memo_INC-2026-0002.md +19 -0
  109. agent_release_gates-0.1.0/reports/incident_memo_INC-2026-0003.md +19 -0
  110. agent_release_gates-0.1.0/reports/incident_memo_INC-2026-0004.md +19 -0
  111. agent_release_gates-0.1.0/reports/incident_memo_INC-2026-0005.md +19 -0
  112. agent_release_gates-0.1.0/reports/incident_memo_INC-2026-0006.md +19 -0
  113. agent_release_gates-0.1.0/reports/incident_memo_INC-2026-0007.md +19 -0
  114. agent_release_gates-0.1.0/reports/incident_memo_INC-2026-0008.md +19 -0
  115. agent_release_gates-0.1.0/reports/incident_release_gates.json +89 -0
  116. agent_release_gates-0.1.0/reports/incident_replay_runs.jsonl +8 -0
  117. agent_release_gates-0.1.0/reports/incident_replay_summary.json +87 -0
  118. agent_release_gates-0.1.0/reports/incident_response_plan.json +251 -0
  119. agent_release_gates-0.1.0/reports/instruction_hierarchy_intervention.json +143 -0
  120. agent_release_gates-0.1.0/reports/instruction_hierarchy_intervention.md +20 -0
  121. agent_release_gates-0.1.0/reports/judge_reliability_cases.jsonl +24 -0
  122. agent_release_gates-0.1.0/reports/judge_reliability_summary.json +271 -0
  123. agent_release_gates-0.1.0/reports/memory_context_intervention.json +197 -0
  124. agent_release_gates-0.1.0/reports/memory_context_intervention.md +33 -0
  125. agent_release_gates-0.1.0/reports/model_judge_adapter_status.json +19 -0
  126. agent_release_gates-0.1.0/reports/model_judge_provider_comparison.json +932 -0
  127. agent_release_gates-0.1.0/reports/model_judge_reviewed_summary.json +155 -0
  128. agent_release_gates-0.1.0/reports/model_judge_reviewed_summary_anthropic.json +143 -0
  129. agent_release_gates-0.1.0/reports/multi_model_comparison_plan.json +98 -0
  130. agent_release_gates-0.1.0/reports/observability_otel_spans.jsonl +1328 -0
  131. agent_release_gates-0.1.0/reports/observability_trace_index.json +1074 -0
  132. agent_release_gates-0.1.0/reports/public_rag_findings.json +93 -0
  133. agent_release_gates-0.1.0/reports/public_rag_model_reranker_adapter_status.json +36 -0
  134. agent_release_gates-0.1.0/reports/public_rag_model_reranker_packet.jsonl +24 -0
  135. agent_release_gates-0.1.0/reports/public_rag_reranker_eval.json +443 -0
  136. agent_release_gates-0.1.0/reports/public_rag_reranking_opportunity.json +370 -0
  137. agent_release_gates-0.1.0/reports/rag_grounding_intervention.json +593 -0
  138. agent_release_gates-0.1.0/reports/rag_grounding_intervention.md +33 -0
  139. agent_release_gates-0.1.0/reports/retriever_comparison.json +59 -0
  140. agent_release_gates-0.1.0/reports/retriever_metric_snapshots.json +87 -0
  141. agent_release_gates-0.1.0/reports/safety_adjudication_notes.json +640 -0
  142. agent_release_gates-0.1.0/reports/safety_classifier_eval_cases.jsonl +120 -0
  143. agent_release_gates-0.1.0/reports/safety_classifier_eval_summary.json +483 -0
  144. agent_release_gates-0.1.0/reports/safety_classifier_intervention_study.json +161 -0
  145. agent_release_gates-0.1.0/reports/safety_classifier_intervention_study.md +21 -0
  146. agent_release_gates-0.1.0/reports/safety_human_review_simulation.json +264 -0
  147. agent_release_gates-0.1.0/reports/safety_mitigation_impact.json +59 -0
  148. agent_release_gates-0.1.0/reports/safety_reviewer_disagreement_slices.json +180 -0
  149. agent_release_gates-0.1.0/reports/safety_secondary_review_band_analysis.json +74 -0
  150. agent_release_gates-0.1.0/reports/safety_secondary_review_floor_validation.json +1573 -0
  151. agent_release_gates-0.1.0/reports/safety_secondary_review_operating_recommendation.json +90 -0
  152. agent_release_gates-0.1.0/reports/safety_threshold_decision_memo.json +48 -0
  153. agent_release_gates-0.1.0/reports/safety_threshold_retuning.json +168 -0
  154. agent_release_gates-0.1.0/reports/safety_threshold_sweep.json +92 -0
  155. agent_release_gates-0.1.0/reports/security_eval_cases.jsonl +120 -0
  156. agent_release_gates-0.1.0/reports/security_eval_summary.json +195 -0
  157. agent_release_gates-0.1.0/reports/techqa_public_benchmark_profile.json +32 -0
  158. agent_release_gates-0.1.0/reports/techqa_public_rag_cases.jsonl +160 -0
  159. agent_release_gates-0.1.0/reports/techqa_public_rag_summary.json +326 -0
  160. agent_release_gates-0.1.0/reports/techqa_public_retriever_cases.jsonl +320 -0
  161. agent_release_gates-0.1.0/reports/techqa_public_retriever_comparison.json +65 -0
  162. agent_release_gates-0.1.0/reports/vector_eval_cases.jsonl +358 -0
  163. agent_release_gates-0.1.0/reports/vector_eval_summary.json +502 -0
  164. agent_release_gates-0.1.0/reports/wixqa_public_benchmark_profile.json +32 -0
  165. agent_release_gates-0.1.0/reports/wixqa_public_rag_cases.jsonl +80 -0
  166. agent_release_gates-0.1.0/reports/wixqa_public_rag_summary.json +297 -0
  167. agent_release_gates-0.1.0/reports/wixqa_public_retriever_cases.jsonl +160 -0
  168. agent_release_gates-0.1.0/reports/wixqa_public_retriever_comparison.json +58 -0
  169. agent_release_gates-0.1.0/requirements.txt +1 -0
  170. agent_release_gates-0.1.0/runtime.txt +1 -0
  171. agent_release_gates-0.1.0/schemas/candidate_results_v1.schema.json +95 -0
  172. agent_release_gates-0.1.0/schemas/incident_pack_v1.schema.json +187 -0
  173. agent_release_gates-0.1.0/scripts/agent_safety.py +15 -0
  174. agent_release_gates-0.1.0/scripts/build_public_site.py +1340 -0
  175. agent_release_gates-0.1.0/scripts/check_otel_collector_deployment.py +177 -0
  176. agent_release_gates-0.1.0/scripts/evaluate_action_gates.py +17 -0
  177. agent_release_gates-0.1.0/scripts/evaluate_goal_conflict_intervention.py +24 -0
  178. agent_release_gates-0.1.0/scripts/evaluate_grounding_interventions.py +17 -0
  179. agent_release_gates-0.1.0/scripts/evaluate_instruction_hierarchy.py +19 -0
  180. agent_release_gates-0.1.0/scripts/evaluate_memory_context_intervention.py +17 -0
  181. agent_release_gates-0.1.0/scripts/export_candidate_results.py +76 -0
  182. agent_release_gates-0.1.0/scripts/export_otel_collector.py +69 -0
  183. agent_release_gates-0.1.0/scripts/export_public_report.py +15 -0
  184. agent_release_gates-0.1.0/scripts/generate_synthetic_data.py +15 -0
  185. agent_release_gates-0.1.0/scripts/prepare_external_human_review.py +15 -0
  186. agent_release_gates-0.1.0/scripts/prepare_multi_model_comparison.py +17 -0
  187. agent_release_gates-0.1.0/scripts/prepare_public_rag_model_reranker_packet.py +33 -0
  188. agent_release_gates-0.1.0/scripts/prepare_techqa_public_benchmark.py +38 -0
  189. agent_release_gates-0.1.0/scripts/prepare_wixqa_public_benchmark.py +38 -0
  190. agent_release_gates-0.1.0/scripts/promote_model_judge_result.py +67 -0
  191. agent_release_gates-0.1.0/scripts/run_agent_eval.py +17 -0
  192. agent_release_gates-0.1.0/scripts/run_all_evals.py +445 -0
  193. agent_release_gates-0.1.0/scripts/run_baseline_eval.py +19 -0
  194. agent_release_gates-0.1.0/scripts/run_extraction_eval.py +16 -0
  195. agent_release_gates-0.1.0/scripts/run_incident_replay.py +71 -0
  196. agent_release_gates-0.1.0/scripts/run_model_judge_eval.py +248 -0
  197. agent_release_gates-0.1.0/scripts/run_provider_embedding_eval.py +133 -0
  198. agent_release_gates-0.1.0/scripts/run_real_agent_replay.py +65 -0
  199. agent_release_gates-0.1.0/scripts/run_safety_classifier_eval.py +22 -0
  200. agent_release_gates-0.1.0/scripts/run_security_eval.py +16 -0
  201. agent_release_gates-0.1.0/scripts/run_techqa_public_eval.py +18 -0
  202. agent_release_gates-0.1.0/scripts/run_wixqa_public_eval.py +18 -0
  203. agent_release_gates-0.1.0/scripts/smoke_otel_collector.py +56 -0
  204. agent_release_gates-0.1.0/src/internal_ai_agent/__init__.py +5 -0
  205. agent_release_gates-0.1.0/src/internal_ai_agent/agent/__init__.py +1 -0
  206. agent_release_gates-0.1.0/src/internal_ai_agent/agent/schemas.py +47 -0
  207. agent_release_gates-0.1.0/src/internal_ai_agent/agent/tools.py +86 -0
  208. agent_release_gates-0.1.0/src/internal_ai_agent/agent/workflow.py +254 -0
  209. agent_release_gates-0.1.0/src/internal_ai_agent/api/__init__.py +1 -0
  210. agent_release_gates-0.1.0/src/internal_ai_agent/api/main.py +206 -0
  211. agent_release_gates-0.1.0/src/internal_ai_agent/api/schemas.py +50 -0
  212. agent_release_gates-0.1.0/src/internal_ai_agent/cli.py +111 -0
  213. agent_release_gates-0.1.0/src/internal_ai_agent/dashboard/__init__.py +1 -0
  214. agent_release_gates-0.1.0/src/internal_ai_agent/dashboard/data.py +1763 -0
  215. agent_release_gates-0.1.0/src/internal_ai_agent/data/__init__.py +1 -0
  216. agent_release_gates-0.1.0/src/internal_ai_agent/data/synthetic.py +2526 -0
  217. agent_release_gates-0.1.0/src/internal_ai_agent/evals/__init__.py +1 -0
  218. agent_release_gates-0.1.0/src/internal_ai_agent/evals/agent.py +185 -0
  219. agent_release_gates-0.1.0/src/internal_ai_agent/evals/candidate_results_export.py +724 -0
  220. agent_release_gates-0.1.0/src/internal_ai_agent/evals/dataset_profile.py +148 -0
  221. agent_release_gates-0.1.0/src/internal_ai_agent/evals/external_review.py +535 -0
  222. agent_release_gates-0.1.0/src/internal_ai_agent/evals/extraction.py +71 -0
  223. agent_release_gates-0.1.0/src/internal_ai_agent/evals/failure_taxonomy.py +218 -0
  224. agent_release_gates-0.1.0/src/internal_ai_agent/evals/gates.py +264 -0
  225. agent_release_gates-0.1.0/src/internal_ai_agent/evals/goal_conflict_intervention.py +363 -0
  226. agent_release_gates-0.1.0/src/internal_ai_agent/evals/human_calibration.py +617 -0
  227. agent_release_gates-0.1.0/src/internal_ai_agent/evals/incident_replay.py +2288 -0
  228. agent_release_gates-0.1.0/src/internal_ai_agent/evals/intervention_study.py +1175 -0
  229. agent_release_gates-0.1.0/src/internal_ai_agent/evals/memory_context_intervention.py +368 -0
  230. agent_release_gates-0.1.0/src/internal_ai_agent/evals/model_judge.py +319 -0
  231. agent_release_gates-0.1.0/src/internal_ai_agent/evals/model_judge_promotion.py +105 -0
  232. agent_release_gates-0.1.0/src/internal_ai_agent/evals/multi_model_comparison.py +355 -0
  233. agent_release_gates-0.1.0/src/internal_ai_agent/evals/multilingual_safety.py +117 -0
  234. agent_release_gates-0.1.0/src/internal_ai_agent/evals/nist_rmf_mapping.py +160 -0
  235. agent_release_gates-0.1.0/src/internal_ai_agent/evals/public_rag_findings.py +229 -0
  236. agent_release_gates-0.1.0/src/internal_ai_agent/evals/public_rag_model_reranker.py +283 -0
  237. agent_release_gates-0.1.0/src/internal_ai_agent/evals/public_rag_reranker.py +390 -0
  238. agent_release_gates-0.1.0/src/internal_ai_agent/evals/public_rag_reranking.py +214 -0
  239. agent_release_gates-0.1.0/src/internal_ai_agent/evals/rag_grounding_intervention.py +449 -0
  240. agent_release_gates-0.1.0/src/internal_ai_agent/evals/runner.py +767 -0
  241. agent_release_gates-0.1.0/src/internal_ai_agent/evals/safety_classifier.py +1890 -0
  242. agent_release_gates-0.1.0/src/internal_ai_agent/evals/security.py +221 -0
  243. agent_release_gates-0.1.0/src/internal_ai_agent/evals/techqa_public.py +693 -0
  244. agent_release_gates-0.1.0/src/internal_ai_agent/evals/wixqa_public.py +616 -0
  245. agent_release_gates-0.1.0/src/internal_ai_agent/extraction/__init__.py +1 -0
  246. agent_release_gates-0.1.0/src/internal_ai_agent/extraction/schemas.py +28 -0
  247. agent_release_gates-0.1.0/src/internal_ai_agent/extraction/service.py +99 -0
  248. agent_release_gates-0.1.0/src/internal_ai_agent/inspect_suite/__init__.py +8 -0
  249. agent_release_gates-0.1.0/src/internal_ai_agent/inspect_suite/_registry.py +9 -0
  250. agent_release_gates-0.1.0/src/internal_ai_agent/inspect_suite/scorers.py +33 -0
  251. agent_release_gates-0.1.0/src/internal_ai_agent/inspect_suite/scoring.py +40 -0
  252. agent_release_gates-0.1.0/src/internal_ai_agent/inspect_suite/tasks.py +34 -0
  253. agent_release_gates-0.1.0/src/internal_ai_agent/io.py +23 -0
  254. agent_release_gates-0.1.0/src/internal_ai_agent/observability/__init__.py +1 -0
  255. agent_release_gates-0.1.0/src/internal_ai_agent/observability/audit.py +59 -0
  256. agent_release_gates-0.1.0/src/internal_ai_agent/observability/collector.py +246 -0
  257. agent_release_gates-0.1.0/src/internal_ai_agent/observability/collector_deployment.py +70 -0
  258. agent_release_gates-0.1.0/src/internal_ai_agent/observability/collector_smoke.py +143 -0
  259. agent_release_gates-0.1.0/src/internal_ai_agent/observability/otel.py +763 -0
  260. agent_release_gates-0.1.0/src/internal_ai_agent/observability/trace_index.py +192 -0
  261. agent_release_gates-0.1.0/src/internal_ai_agent/providers/__init__.py +1 -0
  262. agent_release_gates-0.1.0/src/internal_ai_agent/providers/agent_runner.py +258 -0
  263. agent_release_gates-0.1.0/src/internal_ai_agent/providers/anthropic_judge.py +223 -0
  264. agent_release_gates-0.1.0/src/internal_ai_agent/providers/openai_embeddings.py +113 -0
  265. agent_release_gates-0.1.0/src/internal_ai_agent/providers/openai_judge.py +214 -0
  266. agent_release_gates-0.1.0/src/internal_ai_agent/rag/__init__.py +1 -0
  267. agent_release_gates-0.1.0/src/internal_ai_agent/rag/baseline.py +1168 -0
  268. agent_release_gates-0.1.0/src/internal_ai_agent/reporting/__init__.py +1 -0
  269. agent_release_gates-0.1.0/src/internal_ai_agent/reporting/public_report.py +2327 -0
  270. agent_release_gates-0.1.0/src/internal_ai_agent/security/__init__.py +1 -0
  271. agent_release_gates-0.1.0/src/internal_ai_agent/security/action_safety.py +89 -0
  272. agent_release_gates-0.1.0/src/internal_ai_agent/security/policy.py +178 -0
  273. agent_release_gates-0.1.0/streamlit_app.py +6 -0
  274. agent_release_gates-0.1.0/tests/unit/golden/golden_cases.json +4874 -0
  275. agent_release_gates-0.1.0/tests/unit/golden/public_report.md +722 -0
  276. agent_release_gates-0.1.0/tests/unit/test_action_safety.py +137 -0
  277. agent_release_gates-0.1.0/tests/unit/test_agent_eval.py +41 -0
  278. agent_release_gates-0.1.0/tests/unit/test_agent_runner.py +120 -0
  279. agent_release_gates-0.1.0/tests/unit/test_agent_safety_cli.py +233 -0
  280. agent_release_gates-0.1.0/tests/unit/test_agent_workflow.py +88 -0
  281. agent_release_gates-0.1.0/tests/unit/test_api.py +278 -0
  282. agent_release_gates-0.1.0/tests/unit/test_baseline.py +488 -0
  283. agent_release_gates-0.1.0/tests/unit/test_candidate_results_export.py +719 -0
  284. agent_release_gates-0.1.0/tests/unit/test_cli.py +22 -0
  285. agent_release_gates-0.1.0/tests/unit/test_collector.py +263 -0
  286. agent_release_gates-0.1.0/tests/unit/test_dashboard_data.py +934 -0
  287. agent_release_gates-0.1.0/tests/unit/test_extraction.py +41 -0
  288. agent_release_gates-0.1.0/tests/unit/test_failure_taxonomy.py +57 -0
  289. agent_release_gates-0.1.0/tests/unit/test_gates.py +125 -0
  290. agent_release_gates-0.1.0/tests/unit/test_human_calibration.py +113 -0
  291. agent_release_gates-0.1.0/tests/unit/test_incident_replay.py +870 -0
  292. agent_release_gates-0.1.0/tests/unit/test_inspect_scoring.py +58 -0
  293. agent_release_gates-0.1.0/tests/unit/test_inspect_suite.py +65 -0
  294. agent_release_gates-0.1.0/tests/unit/test_model_judge.py +400 -0
  295. agent_release_gates-0.1.0/tests/unit/test_multi_model_comparison.py +133 -0
  296. agent_release_gates-0.1.0/tests/unit/test_multilingual_safety.py +30 -0
  297. agent_release_gates-0.1.0/tests/unit/test_nist_rmf_mapping.py +52 -0
  298. agent_release_gates-0.1.0/tests/unit/test_otel.py +384 -0
  299. agent_release_gates-0.1.0/tests/unit/test_provider_embeddings.py +110 -0
  300. agent_release_gates-0.1.0/tests/unit/test_public_rag_findings.py +112 -0
  301. agent_release_gates-0.1.0/tests/unit/test_public_rag_model_reranker.py +40 -0
  302. agent_release_gates-0.1.0/tests/unit/test_public_rag_reranker.py +43 -0
  303. agent_release_gates-0.1.0/tests/unit/test_public_rag_reranking.py +75 -0
  304. agent_release_gates-0.1.0/tests/unit/test_public_report.py +367 -0
  305. agent_release_gates-0.1.0/tests/unit/test_public_site.py +85 -0
  306. agent_release_gates-0.1.0/tests/unit/test_run_all_evals.py +210 -0
  307. agent_release_gates-0.1.0/tests/unit/test_safety_classifier_eval.py +212 -0
  308. agent_release_gates-0.1.0/tests/unit/test_security_eval.py +137 -0
  309. agent_release_gates-0.1.0/tests/unit/test_streamlit_app.py +20 -0
  310. agent_release_gates-0.1.0/tests/unit/test_synthetic_data.py +111 -0
  311. agent_release_gates-0.1.0/tests/unit/test_techqa_public_eval.py +86 -0
  312. agent_release_gates-0.1.0/tests/unit/test_trace_index.py +63 -0
  313. agent_release_gates-0.1.0/tests/unit/test_wixqa_public_eval.py +116 -0
  314. agent_release_gates-0.1.0/uv.lock +1417 -0
@@ -0,0 +1,22 @@
1
+ .git/
2
+ .github/
3
+ .venv/
4
+ __pycache__/
5
+ *.py[cod]
6
+ .pytest_cache/
7
+ .ruff_cache/
8
+ .coverage
9
+ htmlcov/
10
+ .env
11
+ .env.*
12
+ !.env.example
13
+ dist/
14
+ build/
15
+ *.egg-info/
16
+ internal_ai_agent_project_plan.md
17
+ reports/collector_export_smoke.json
18
+ reports/collector_deployment_check.json
19
+ reports/provider_embedding_eval_status.json
20
+ reports/provider_embedding_eval_summary.json
21
+ reports/provider_embedding_eval_cases.jsonl
22
+ public/
@@ -0,0 +1 @@
1
+ *.pdf binary
@@ -0,0 +1,46 @@
1
+ name: Bug or reproducibility issue
2
+ description: Report a failing command, broken artifact, dashboard issue, or reproducibility problem.
3
+ title: "[Bug]: "
4
+ labels: ["bug"]
5
+ body:
6
+ - type: markdown
7
+ attributes:
8
+ value: |
9
+ Please include enough context to reproduce the issue. Do not include API keys, credentials, private data, production logs, or confidential examples.
10
+ - type: textarea
11
+ id: what_happened
12
+ attributes:
13
+ label: What happened?
14
+ description: Describe the failure or unexpected behavior.
15
+ validations:
16
+ required: true
17
+ - type: textarea
18
+ id: command
19
+ attributes:
20
+ label: Command or page
21
+ description: Paste the command, URL, or page where the issue occurred. Redact secrets.
22
+ placeholder: "uv run python scripts/run_all_evals.py"
23
+ validations:
24
+ required: false
25
+ - type: textarea
26
+ id: expected
27
+ attributes:
28
+ label: Expected behavior
29
+ description: What did you expect to happen?
30
+ validations:
31
+ required: true
32
+ - type: textarea
33
+ id: environment
34
+ attributes:
35
+ label: Environment
36
+ description: OS, Python version, browser, or relevant setup detail.
37
+ placeholder: "Windows 11, Python 3.12, uv installed"
38
+ validations:
39
+ required: false
40
+ - type: checkboxes
41
+ id: checks
42
+ attributes:
43
+ label: Safety check
44
+ options:
45
+ - label: I have removed credentials, personal data, private company data, and confidential logs.
46
+ required: true
@@ -0,0 +1,11 @@
1
+ blank_issues_enabled: false
2
+ contact_links:
3
+ - name: Reviewer handoff pack
4
+ url: https://rosscyking1115.github.io/agent-release-gates/reviewer_handoff_pack.md
5
+ about: Read the external-review instructions before volunteering to label cases.
6
+ - name: Public dashboard
7
+ url: https://agent-evaluation-lab.streamlit.app/
8
+ about: Explore the interactive evaluation dashboard.
9
+ - name: Full public report
10
+ url: https://rosscyking1115.github.io/agent-release-gates/evaluation_report.html
11
+ about: Review the generated evaluation report and current limitations.
@@ -0,0 +1,51 @@
1
+ name: Benchmark or evaluation improvement
2
+ description: Suggest a dataset, metric, failure mode, intervention, or documentation improvement.
3
+ title: "[Eval improvement]: "
4
+ labels: ["evaluation", "enhancement"]
5
+ body:
6
+ - type: markdown
7
+ attributes:
8
+ value: |
9
+ Use this for benchmark-quality improvements. Please keep the project boundary conservative: no private company documents, credentials, production logs, customer data, employee data, or confidential workflows.
10
+ - type: dropdown
11
+ id: area
12
+ attributes:
13
+ label: Area
14
+ options:
15
+ - Retrieval or RAG grounding
16
+ - Safety classifier or prevalence estimation
17
+ - Agent/tool governance
18
+ - Memory/context pollution
19
+ - Goal-conflict arbitration
20
+ - Multi-model comparison
21
+ - Human review or judge reliability
22
+ - Public report or dashboard
23
+ - Documentation
24
+ - Other
25
+ validations:
26
+ required: true
27
+ - type: textarea
28
+ id: proposal
29
+ attributes:
30
+ label: Proposal
31
+ description: What should change, and why would it improve evaluation quality?
32
+ placeholder: "Describe the failure mode, metric, dataset, or report improvement."
33
+ validations:
34
+ required: true
35
+ - type: textarea
36
+ id: reproducibility
37
+ attributes:
38
+ label: Reproducibility plan
39
+ description: How could the change be tested or regenerated?
40
+ placeholder: "Example: add cases, update expected metrics, run scripts/run_all_evals.py and pytest."
41
+ validations:
42
+ required: false
43
+ - type: checkboxes
44
+ id: data_boundary
45
+ attributes:
46
+ label: Data boundary
47
+ options:
48
+ - label: This suggestion does not require private, confidential, personal, or production data.
49
+ required: true
50
+ - label: If using a public dataset, I can provide source, license, and sampling details.
51
+ required: false
@@ -0,0 +1,47 @@
1
+ name: External review volunteer
2
+ description: Volunteer to independently review calibration cases or benchmark claims.
3
+ title: "[External review]: "
4
+ labels: ["external-review"]
5
+ body:
6
+ - type: markdown
7
+ attributes:
8
+ value: |
9
+ Thank you for helping review Agent Release Safety Gates.
10
+
11
+ Please do not paste completed labels, personal information, credentials, private documents, or confidential examples into a public issue. This issue is only for volunteering, clarifying scope, or reporting review-process questions.
12
+ - type: dropdown
13
+ id: review_type
14
+ attributes:
15
+ label: Review type
16
+ options:
17
+ - Safety-label calibration cases
18
+ - Report or benchmark-methodology review
19
+ - Dataset-boundary review
20
+ - Reproducibility review
21
+ - Other review
22
+ validations:
23
+ required: true
24
+ - type: checkboxes
25
+ id: independence
26
+ attributes:
27
+ label: Independence check
28
+ options:
29
+ - label: I have not inspected maintainer labels, classifier outputs, hosted judge outputs, or source code for the cases I plan to label.
30
+ required: false
31
+ - label: I can use a pseudonymous reviewer id and avoid sharing personal information in label files.
32
+ required: true
33
+ - type: textarea
34
+ id: context
35
+ attributes:
36
+ label: Relevant context
37
+ description: Briefly describe your review background or what you want to review. Do not include private data.
38
+ placeholder: "Example: I can independently label the 24 safety calibration cases this week."
39
+ validations:
40
+ required: true
41
+ - type: textarea
42
+ id: questions
43
+ attributes:
44
+ label: Questions or constraints
45
+ description: Add any process questions or availability constraints.
46
+ validations:
47
+ required: false
@@ -0,0 +1,17 @@
1
+ ## Summary
2
+
3
+ Describe the evaluation, documentation, dashboard, or reproducibility change.
4
+
5
+ ## Checks
6
+
7
+ - [ ] Synthetic and public-data tracks remain separated.
8
+ - [ ] No private company documents, customer data, employee data, credentials, production logs, or confidential workflows were added.
9
+ - [ ] New or changed metrics include limitations.
10
+ - [ ] Generated reports or public-site artifacts were regenerated when needed.
11
+ - [ ] Tests or checks were run, or the reason they were skipped is documented.
12
+
13
+ ## Verification
14
+
15
+ ```text
16
+ Paste relevant commands and results here.
17
+ ```
@@ -0,0 +1,53 @@
1
+ name: ci
2
+
3
+ on:
4
+ push:
5
+ branches: ["main"]
6
+ pull_request:
7
+ workflow_dispatch:
8
+
9
+ jobs:
10
+ test:
11
+ runs-on: ubuntu-latest
12
+ steps:
13
+ - name: Check out repository
14
+ uses: actions/checkout@v6
15
+
16
+ - name: Set up Python
17
+ uses: actions/setup-python@v6
18
+ with:
19
+ python-version: "3.12"
20
+
21
+ - name: Install uv
22
+ run: python -m pip install uv
23
+
24
+ - name: Install dependencies
25
+ run: uv sync --locked --dev
26
+
27
+ - name: Lint
28
+ run: uv run ruff check .
29
+
30
+ - name: Test
31
+ run: uv run pytest
32
+
33
+ - name: Regenerate deterministic eval reports
34
+ run: uv run python scripts/run_all_evals.py
35
+
36
+ - name: Smoke OTLP collector export
37
+ run: uv run python scripts/smoke_otel_collector.py
38
+
39
+ - name: Dry-run provider embedding eval
40
+ run: uv run python scripts/run_provider_embedding_eval.py
41
+
42
+ - name: Start OpenTelemetry Collector
43
+ run: docker compose --profile observability up -d otel-collector
44
+
45
+ - name: Check OpenTelemetry Collector deployment
46
+ run: uv run python scripts/check_otel_collector_deployment.py
47
+
48
+ - name: Stop OpenTelemetry Collector
49
+ if: always()
50
+ run: docker compose --profile observability down
51
+
52
+ - name: Build Docker image
53
+ run: docker build -t agent-release-gates:ci .
@@ -0,0 +1,59 @@
1
+ name: pages
2
+
3
+ on:
4
+ push:
5
+ branches: ["main"]
6
+ workflow_dispatch:
7
+
8
+ permissions:
9
+ contents: read
10
+ pages: write
11
+ id-token: write
12
+
13
+ env:
14
+ FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
15
+
16
+ concurrency:
17
+ group: "pages"
18
+ cancel-in-progress: false
19
+
20
+ jobs:
21
+ deploy:
22
+ environment:
23
+ name: github-pages
24
+ url: ${{ steps.deployment.outputs.page_url }}
25
+ runs-on: ubuntu-latest
26
+ steps:
27
+ - name: Check out repository
28
+ uses: actions/checkout@v6
29
+
30
+ - name: Set up Python
31
+ uses: actions/setup-python@v6
32
+ with:
33
+ python-version: "3.12"
34
+
35
+ - name: Install uv
36
+ run: python -m pip install uv
37
+
38
+ - name: Install dependencies
39
+ run: uv sync --locked --dev
40
+
41
+ - name: Regenerate deterministic eval reports
42
+ run: uv run python scripts/run_all_evals.py
43
+
44
+ - name: Build public site
45
+ run: uv run python scripts/build_public_site.py
46
+
47
+ - name: Configure Pages
48
+ uses: actions/configure-pages@v5
49
+ with:
50
+ enablement: true
51
+
52
+ - name: Upload Pages artifact
53
+ uses: actions/upload-pages-artifact@v3
54
+ with:
55
+ path: public
56
+
57
+ - name: Deploy to GitHub Pages
58
+ id: deployment
59
+ uses: actions/deploy-pages@v4
@@ -0,0 +1,25 @@
1
+ .venv/
2
+ __pycache__/
3
+ *.py[cod]
4
+ .pytest_cache/
5
+ .ruff_cache/
6
+ .coverage
7
+ htmlcov/
8
+ .env
9
+ .env.*
10
+ !.env.example
11
+ dist/
12
+ build/
13
+ *.egg-info/
14
+ reports/collector_export_smoke.json
15
+ reports/collector_deployment_check.json
16
+ reports/provider_embedding_eval_status.json
17
+ reports/provider_embedding_eval_summary.json
18
+ reports/provider_embedding_eval_cases.jsonl
19
+ reports/model_judge_eval_status.json
20
+ reports/model_judge_eval_summary.json
21
+ reports/model_judge_eval_cases.jsonl
22
+ data/public/techqa_train.json
23
+ data/public/wixqa_expertwritten_test.jsonl
24
+ data/public/wixqa_kb_corpus.jsonl
25
+ /public/
@@ -0,0 +1,2 @@
1
+ [browser]
2
+ gatherUsageStats = false
@@ -0,0 +1,69 @@
1
+ # Contributing
2
+
3
+ Thanks for taking a look at Agent Release Safety Gates.
4
+
5
+ This project is intended to be a public benchmark-style artifact. Contributions should improve evaluation quality, reproducibility, safety analysis, or clarity. Please keep the data boundary conservative: do not add private company documents, customer data, employee data, credentials, production logs, or confidential workflows.
6
+
7
+ ## Good Contribution Areas
8
+
9
+ - hand-authored benign, ambiguous, unsafe, prompt-injection, weak-evidence, excessive-agency, and tool-misuse cases
10
+ - clearer failure taxonomy labels
11
+ - public benchmark expansion with reproducible sampling
12
+ - judge-reliability experiments
13
+ - human-review calibration workflow
14
+ - multi-model adapters with explicit model ids, dates, settings, and limitations
15
+ - report, dashboard, or documentation improvements that make results easier to inspect
16
+ - tests that prevent benchmark or reporting regressions
17
+
18
+ ## Opening Issues
19
+
20
+ Use GitHub issues for public, non-sensitive project discussion:
21
+
22
+ - External review volunteer: use this to offer independent labels, methodology review, reproducibility review, or dataset-boundary review.
23
+ - Benchmark or evaluation improvement: use this for proposed datasets, metrics, failure modes, interventions, or report improvements.
24
+ - Bug or reproducibility issue: use this for failing commands, broken artifacts, dashboard problems, or unclear setup.
25
+
26
+ Do not paste completed reviewer labels, credentials, personal data, private company data, production logs, or confidential examples into public issues.
27
+
28
+ ## Data Rules
29
+
30
+ - Use synthetic internal-operations data unless a public dataset is explicitly documented.
31
+ - Keep public datasets separated from the synthetic benchmark in code, docs, and reports.
32
+ - Do not add real personal data or confidential business procedures.
33
+ - Do not imply the project measures any real organization's internal AI system.
34
+
35
+ ## Result Rules
36
+
37
+ - Do not publish provider-backed or multi-model results unless the run is reproducible and the model settings are documented.
38
+ - Separate synthetic labels, simulated reviewer labels, real human labels, and LLM-as-judge labels.
39
+ - Include limitations when adding new metrics.
40
+ - Add or update tests when changing scoring, reports, or release gates.
41
+
42
+ ## Local Checks
43
+
44
+ Run these before opening a pull request:
45
+
46
+ ```powershell
47
+ uv run ruff check .
48
+ uv run pytest
49
+ uv run python scripts/run_all_evals.py
50
+ uv run python scripts/build_public_site.py
51
+ ```
52
+
53
+ Optional checks:
54
+
55
+ ```powershell
56
+ uv run python scripts/smoke_otel_collector.py
57
+ docker compose --profile observability up -d otel-collector
58
+ uv run python scripts/check_otel_collector_deployment.py
59
+ docker compose --profile observability down
60
+ docker build -t agent-release-gates:local .
61
+ ```
62
+
63
+ ## Pull Request Checklist
64
+
65
+ - The change has a clear evaluation or documentation purpose.
66
+ - Synthetic and public-data tracks remain separated.
67
+ - New metrics are reproducible.
68
+ - New limitations are documented.
69
+ - Tests or generated artifacts are updated when needed.
@@ -0,0 +1,34 @@
1
+ FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim
2
+
3
+ WORKDIR /app
4
+
5
+ ENV PYTHONDONTWRITEBYTECODE=1
6
+ ENV PYTHONUNBUFFERED=1
7
+ ENV UV_LINK_MODE=copy
8
+ ENV STREAMLIT_BROWSER_GATHER_USAGE_STATS=false
9
+ # Sync the env once at build time; runtime `uv run` uses it without re-syncing
10
+ # (which would otherwise drop the optional api/dashboard extras).
11
+ ENV UV_NO_SYNC=1
12
+
13
+ COPY pyproject.toml uv.lock README.md streamlit_app.py ./
14
+ COPY src ./src
15
+
16
+ RUN uv sync --locked --no-dev --extra api --extra dashboard
17
+
18
+ COPY app ./app
19
+ COPY data ./data
20
+ COPY docs ./docs
21
+ COPY reports ./reports
22
+ COPY scripts ./scripts
23
+ COPY .streamlit ./.streamlit
24
+
25
+ RUN uv run --no-dev python scripts/run_all_evals.py
26
+
27
+ RUN useradd --create-home --shell /usr/sbin/nologin appuser \
28
+ && chown -R appuser:appuser /app
29
+
30
+ USER appuser
31
+
32
+ EXPOSE 8000
33
+
34
+ CMD ["uv", "run", "--no-dev", "uvicorn", "internal_ai_agent.api.main:app", "--host", "0.0.0.0", "--port", "8000"]
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 rosscyking1115
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,201 @@
1
+ Metadata-Version: 2.4
2
+ Name: agent-release-gates
3
+ Version: 0.1.0
4
+ Summary: Release-readiness gates for AI agents: replay known incidents, apply policy-as-code gates, and produce ship/warn/block evidence before an agent, prompt, model, or tool-policy change ships.
5
+ Project-URL: Homepage, https://github.com/rosscyking1115/agent-release-gates
6
+ Project-URL: Repository, https://github.com/rosscyking1115/agent-release-gates
7
+ Project-URL: Documentation, https://rosscyking1115.github.io/agent-release-gates/
8
+ Project-URL: Issues, https://github.com/rosscyking1115/agent-release-gates/issues
9
+ Author: rosscyking1115
10
+ License-Expression: MIT
11
+ License-File: LICENSE
12
+ Keywords: agent,ai-safety,evaluation,inspect-ai,llm,red-team,release-gate
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.12
17
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
18
+ Classifier: Topic :: Software Development :: Quality Assurance
19
+ Requires-Python: >=3.12
20
+ Requires-Dist: pydantic>=2.8.0
21
+ Provides-Extra: api
22
+ Requires-Dist: fastapi>=0.115.0; extra == 'api'
23
+ Requires-Dist: uvicorn[standard]>=0.30.0; extra == 'api'
24
+ Provides-Extra: dashboard
25
+ Requires-Dist: altair>=6.0.0; extra == 'dashboard'
26
+ Requires-Dist: pandas>=2.2.0; extra == 'dashboard'
27
+ Requires-Dist: streamlit>=1.38.0; extra == 'dashboard'
28
+ Description-Content-Type: text/markdown
29
+
30
+ # Agent Release Safety Gates
31
+
32
+ A public release-readiness system for testing whether AI-agent workflow changes remain grounded, safe, auditable, and useful under retrieval, refusal, prompt-injection, incident replay, and approval-gated tool-use conditions.
33
+
34
+ This project is not a clone, assessment, or reverse-engineering attempt of any company's internal AI system. The controlled operations benchmark is synthetic by design; TechQA and WixQA are used separately as public retrieval-validation datasets.
35
+
36
+ ## Live Project
37
+
38
+ - Public project page: https://rosscyking1115.github.io/agent-release-gates/
39
+ - Full evaluation report (HTML): https://rosscyking1115.github.io/agent-release-gates/evaluation_report.html
40
+ - PDF report: https://rosscyking1115.github.io/agent-release-gates/evaluation_report.pdf
41
+ - Interactive dashboard: run locally (see [Run Locally](#run-locally)) or deploy on
42
+ Streamlit Cloud — see the [dashboard deployment guide](docs/dashboard.md). A hosted
43
+ instance must be set to public visibility to be shareable.
44
+
45
+ ## What This Project Does
46
+
47
+ The project evaluates an AI-agent workflow across five release questions:
48
+
49
+ - Does the agent retrieve the right evidence and cite it?
50
+ - Does it abstain or refuse when evidence is weak, unsafe, or prompt-injected?
51
+ - Does it require approval before mock side-effecting tool calls?
52
+ - Does it leave enough trace, audit, and monitoring evidence for review?
53
+ - Does it pass incident replay and policy-as-code release gates?
54
+
55
+ The result is a reproducible evaluation artifact rather than a one-off dashboard: deterministic eval runners, generated reports, CI checks, Dockerized local execution, a Streamlit dashboard, and a GitHub Pages report site.
56
+
57
+ ## Product Direction
58
+
59
+ **Agent Release Safety Gates** is a release-readiness workflow for replaying known agent incidents, applying policy gates, and producing evidence before a changed agent, prompt, model, or tool policy ships. Its deterministic evaluation benchmark and runners are the evidence layer behind that workflow.
60
+
61
+ The first module is an Incident Replay Suite that turns redacted synthetic incidents into regression fixtures, replay results, release gates, and incident memos.
62
+
63
+ To evaluate an external agent, use the public quickstart:
64
+
65
+ - [Evaluate your agent quickstart](docs/evaluate_your_agent_quickstart.md)
66
+ - [Incident pack schema](docs/incident_pack_schema.md)
67
+ - [Candidate results schema](docs/candidate_results_schema.md)
68
+
69
+ ## Current Evidence Snapshot
70
+
71
+ | Area | Current result |
72
+ | --- | --- |
73
+ | Controlled benchmark | 358 synthetic golden cases, 60 red-team cases, 180 synthetic operations tickets |
74
+ | Retrieval | 100.00% synthetic retrieval hit rate@3 with local TF-IDF/vector-style retrievers |
75
+ | Public RAG validation | 160 TechQA cases and 80 WixQA cases evaluated separately from the synthetic benchmark |
76
+ | Safety | 90.91% classifier recall, 0 high-severity false negatives in the current challenge set |
77
+ | Agent governance | 100.00% mock side-effect block rate and approval audit rate |
78
+ | Incident replay | 8 seeded synthetic incidents replayed, 100.00% closure rate, 0 replay must-not violations |
79
+ | Intervention study | 3 deterministic safety studies plus public RAG grounding and memory/context studies |
80
+ | Hosted judge calibration | Reviewed OpenAI and Anthropic judge runs with public-safe provider comparison |
81
+
82
+ These results are engineering evidence over controlled benchmarks. They are not claims of real-world production performance.
83
+
84
+ ## Key Findings
85
+
86
+ - Safety metrics are not meaningful alone; the lab reports over-review cost, benign auto-blocks, weak-evidence handling, and unsafe misses beside the headline scores.
87
+ - Layered safeguards reduce selected prompt-injection, unsafe-action, and unsafe-request failures in controlled studies while making review burden visible.
88
+ - Public TechQA and WixQA retrieval tracks help test whether the RAG harness works beyond self-contained synthetic data.
89
+ - Public RAG grounding thresholds reduce unsupported answer attempts while making abstention and review cost visible.
90
+ - Memory/context controls reduce polluted-memory following while preserving benign memory usefulness.
91
+ - Goal-conflict arbitration reduces unsafe goal-following while preserving benign task completion.
92
+ - Synthetic operations data remains useful for controlled tests that would be unsafe or impractical to run on confidential real workflows.
93
+ - The next strongest validation step is independent human labelling, followed by broader multi-model comparison.
94
+
95
+ ## What Is Included
96
+
97
+ - Evaluation runners for retrieval, extraction, safety classification, controlled-agent behavior, and observability.
98
+ - Baseline-vs-intervention studies for instruction hierarchy, action-risk gates, and safety classifier review policy.
99
+ - Public RAG grounding and abstention intervention study over TechQA and WixQA.
100
+ - Memory/context pollution intervention study covering stale, injected, and cross-user memory.
101
+ - Goal-conflict intervention study covering safety, evidence, privacy, and tool-risk arbitration.
102
+ - Incident replay suite with seeded incidents, replay matrix, release gates, regression fixtures, and generated memos.
103
+ - Public benchmark documentation, dataset boundaries, failure taxonomy, and external-review packet.
104
+ - Candidate-results exporters for generic agent logs and LangChain/LangSmith-style traces.
105
+ - Streamlit dashboard for interactive inspection.
106
+ - GitHub Pages report and PDF for public review.
107
+ - CI, Docker, Docker Compose, linting, tests, and deterministic report regeneration.
108
+
109
+ ## Install
110
+
111
+ Once published to PyPI (see [publishing guide](docs/publishing.md)), the core install
112
+ is lean — `pydantic` only — and gives you the CLI, the Inspect suite, the real-agent
113
+ runner, and the scoring logic. The API and dashboard are opt-in extras:
114
+
115
+ ```bash
116
+ pip install agent-release-gates # CLI + Inspect suite + scoring
117
+ pip install "agent-release-gates[api]" # + FastAPI evidence service
118
+ pip install "agent-release-gates[dashboard]" # + Streamlit dashboard deps
119
+ pip install agent-release-gates inspect_ai # to run under Inspect
120
+ ```
121
+
122
+ ```bash
123
+ agent-safety release-gate # ship / warn / block
124
+ inspect eval agent-release-gates/incident_replay --model openai/gpt-4.1-mini
125
+ ```
126
+
127
+ > Not yet on PyPI — build it yourself with `uv build`, or run from source below.
128
+
129
+ ## Run Locally
130
+
131
+ ```powershell
132
+ uv sync
133
+ uv run python scripts/run_all_evals.py
134
+ # Release gate (installed console command); exits non-zero on a blocking failure.
135
+ uv run agent-safety release-gate --policy config/incident_release_policy.json
136
+ # Interactive dashboard.
137
+ uv run streamlit run streamlit_app.py --server.port 8510
138
+ ```
139
+
140
+ Open `http://localhost:8510`. Run the API and dashboard together with
141
+ `docker compose up --build`, then open `http://localhost:8510` and
142
+ `http://localhost:8000/health`.
143
+
144
+ Drive a real LLM through the release gate, or run the suite under Inspect:
145
+
146
+ ```powershell
147
+ # Any OpenAI-compatible / self-hosted open model endpoint.
148
+ $env:AGENT_RUNNER_API_KEY = "..."
149
+ uv run python scripts/run_real_agent_replay.py
150
+
151
+ # Inspect (UK AISI) -- optional peer dependency.
152
+ uv pip install inspect_ai
153
+ inspect eval agent-release-gates/incident_replay --model openai/gpt-4.1-mini
154
+ ```
155
+
156
+ ## Verification
157
+
158
+ ```powershell
159
+ uv run ruff check .
160
+ uv run pytest
161
+ uv run python scripts/run_all_evals.py
162
+ uv run agent-safety release-gate --policy config/incident_release_policy.json
163
+ uv run python scripts/build_public_site.py
164
+ docker build -t agent-release-safety-gates:local .
165
+ ```
166
+
167
+ CI runs linting, tests, deterministic report checks, local OpenTelemetry smoke testing, Dockerized collector verification, and Docker build verification.
168
+
169
+ ## Review Materials
170
+
171
+ - Evaluate your agent quickstart: [docs/evaluate_your_agent_quickstart.md](docs/evaluate_your_agent_quickstart.md)
172
+ - Benchmark card: [docs/benchmark_card.md](docs/benchmark_card.md)
173
+ - Agent safety intervention study: [docs/agent_safety_intervention_study.md](docs/agent_safety_intervention_study.md)
174
+ - RAG grounding intervention report: [reports/rag_grounding_intervention.md](reports/rag_grounding_intervention.md)
175
+ - Memory context intervention report: [reports/memory_context_intervention.md](reports/memory_context_intervention.md)
176
+ - Goal conflict intervention report: [reports/goal_conflict_intervention.md](reports/goal_conflict_intervention.md)
177
+ - Incident pack schema: [docs/incident_pack_schema.md](docs/incident_pack_schema.md)
178
+ - Candidate results schema: [docs/candidate_results_schema.md](docs/candidate_results_schema.md)
179
+ - Incident replay summary: [reports/incident_replay_summary.json](reports/incident_replay_summary.json)
180
+ - Dataset card: [docs/dataset_card.md](docs/dataset_card.md)
181
+ - Failure taxonomy: [docs/failure_taxonomy.md](docs/failure_taxonomy.md)
182
+ - External reviewer handoff pack: [docs/reviewer_handoff_pack.md](docs/reviewer_handoff_pack.md)
183
+ - Technical artifact index: [docs/technical_artifacts.md](docs/technical_artifacts.md)
184
+ - Contribution guide: [CONTRIBUTING.md](CONTRIBUTING.md)
185
+
186
+ ## Current Limitations
187
+
188
+ - The controlled benchmark is synthetic and still partly templated.
189
+ - Public TechQA and WixQA tracks use compact samples, not the full upstream datasets.
190
+ - Human-review labels are currently simulated workflow labels; independent reviewer labels are prepared but not yet published.
191
+ - Hosted model evidence includes reviewed judge-calibration runs, not a broad multi-model agent comparison.
192
+ - Provider-backed embedding and reranker adapters are prepared, but credentialed hosted results are not claimed until reviewed.
193
+
194
+ ## Roadmap
195
+
196
+ - Collect independent human labels using the prepared review packet.
197
+ - Add reproducible multi-model comparison across hosted and open-source models.
198
+ - Expand public RAG validation beyond the current compact TechQA and WixQA samples.
199
+ - Add more framework-specific candidate-results exporters for common agent runners.
200
+ - Expand the paper-style intervention report with external reviewer disagreement analysis.
201
+ - Invite external review through issues and contribution guidelines.