PyPI - bigquery-agent-analytics - Versions diffs - 0.2.2__tar.gz → 0.2.3__tar.gz - Mend

bigquery-agent-analytics 0.2.2tar.gz → 0.2.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (240) hide show

bigquery_agent_analytics-0.2.3/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,131 @@
+# Changelog
+All notable changes to `bigquery-agent-analytics` are documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [Unreleased]
+### Fixed
+- **LLM-as-Judge AI.GENERATE path now executes against current
+  BigQuery.** Earlier versions emitted a table-valued
+  ``FROM session_traces, AI.GENERATE(...) AS result`` shape with
+  ``output_schema`` and a flat ``model_params`` dict. Current
+  ``AI.GENERATE`` is a scalar function that returns a STRUCT;
+  the table-valued form raises ``Table-valued function not found``
+  and the flat ``model_params`` raises ``does not conform to the
+  GenerateContent request body``. Mocked unit tests passed because
+  they bypassed real query execution. The SDK now renders a
+  ``SELECT AI.GENERATE(...).score, ...`` query with a
+  ``generationConfig``-wrapped ``model_params`` and ``output_schema``
+  on the scalar form, runs against live BigQuery, and unwraps the
+  returned struct's ``score`` / ``justification`` / ``status``
+  fields.
+- **LLM-as-Judge AI.GENERATE / ML.GENERATE_TEXT now uses the full
+  Python prompt template.** Previously both BQ-native paths sent
+  only ``prompt_template.split('{trace_text}')[0]`` to BigQuery,
+  silently dropping every instruction that followed the
+  placeholders — including the per-criterion output-format spec
+  the judge model needs to score consistently with the
+  API-fallback path. The two BQ paths and the Python API path now
+  produce comparable scores against the same prompt.
+### Added
+- ``evaluators.render_ai_generate_judge_query(...)`` is the new
+  entry point that builds the AI.GENERATE batch SQL.
+  ``connection_id`` is optional — when omitted the call uses
+  end-user credentials; when supplied it inlines the
+  ``connection_id =>`` argument so callers can route through a
+  service-account-owned connection when their environment
+  requires it.
+- ``Client.connection_id`` already existed; it is now plumbed
+  through to ``_ai_generate_judge`` so a connection set at client
+  construction propagates to the judge SQL automatically.
+- Live BigQuery integration tests for the LLM-judge AI.GENERATE
+  path (``tests/test_ai_generate_judge_live.py``). Skipped by
+  default; opt in with ``BQAA_RUN_LIVE_TESTS=1`` plus
+  ``PROJECT_ID`` / ``DATASET_ID``. Three tests cover SQL parse
+  acceptance, expected result-schema column names, and the
+  ``connection_id`` escape hatch when
+  ``BQAA_AI_GENERATE_CONNECTION_ID`` is set. Catches the class of
+  mock-divergence bug that let the prior broken template ship.
+- ``EvaluationReport.details["execution_mode"]`` is now populated
+  for LLM-as-Judge runs with one of ``ai_generate``,
+  ``ml_generate_text``, ``api_fallback``, or ``no_op`` — matching
+  the value space the categorical evaluator already exposes. When
+  an earlier tier raised before a later tier succeeded,
+  ``details["fallback_reason"]`` carries the chained exception
+  messages in attempt order, so CI and dashboards can audit which
+  path actually ran.
+- ``evaluators.split_judge_prompt_template(prompt_template)`` is
+  the helper the SQL paths use to safely substitute the template
+  into ``CONCAT()``; exposed publicly for downstream code that
+  needs the same shape.
+- ``bq-agent-sdk evaluate --exit-code`` FAIL lines now carry a
+  bounded ``feedback="…"`` snippet drawn from
+  ``SessionScore.llm_feedback`` for LLM-judge failures. The
+  snippet collapses internal whitespace to a single space,
+  truncates to 120 characters with an ellipsis, and is omitted
+  entirely for code-based metrics (which leave ``llm_feedback``
+  empty). CI logs now explain *why* the judge said the session
+  failed without forcing the reader to chase the JSON output.
+### Changed
+- ``--strict`` help text and ``SDK.md §4`` clarified to match shipped
+  behavior. ``--strict`` is a *visibility* knob — it stamps
+  ``details['parse_error']=True`` on AI.GENERATE/ML.GENERATE_TEXT
+  judge rows whose ``scores`` dict is empty, and adds a report-level
+  ``parse_errors`` counter. It does **not** flip any session's
+  pass/fail outcome: both BQ-native judge methods compute ``passed``
+  as ``bool(scores) and all(...)``, so empty-scores rows already
+  fail without the flag. API-fallback parse errors coerce to
+  ``score=0.0``, so they fail as low-score failures rather than
+  parse errors. For pass/fail-only CI consumers ``--strict`` is a
+  no-op; reach for it when a dashboard needs to tell "no parseable
+  score" apart from "low score."
+## [0.2.2] - 2026-04-24
+### Changed (breaking)
+- **Prebuilt `CodeEvaluator` gates now compare raw observed values
+  directly against the user-supplied budget.** `CodeEvaluator.latency`,
+  `.turn_count`, `.error_rate`, `.token_efficiency`, `.ttft`, and
+  `.cost_per_session` return `1.0` when the observed metric is within
+  budget and `0.0` otherwise. The previous implementation scored sessions
+  on a normalized `1.0 - (observed / budget)` scale against a `0.5` pass
+  cutoff, which effectively fired every gate at roughly half the budget
+  the user typed (e.g. `latency(threshold_ms=5000)` failed sessions at
+  `avg_latency_ms > 2500`). Users relying on the old sub-budget fail
+  behavior should lower their budgets to match their intent.
+- The scheduled streaming evaluator (`streaming_observability_v1`) uses
+  the same raw-budget gate semantics for consistency with the prebuilt
+  `CodeEvaluator` factories.
+### Added
+- `CodeEvaluator.add_metric` accepts `observed_key`, `observed_fn`, and
+  `budget` arguments that flow into `SessionScore.details[f"metric_{name}"]`
+  for downstream reporting. The CLI uses these to emit readable failure
+  lines without re-running the scorer.
+- `bq-agent-sdk evaluate --exit-code` now prints a per-session failure
+  summary on stderr before exiting non-zero. Each line names the
+  session_id, failing metric, observed value, and the budget it blew
+  through. Output is capped at the first 10 failing sessions to keep
+  CI logs scannable.
+- `bq-agent-sdk categorical-eval` gains `--exit-code`,
+  `--min-pass-rate`, and `--pass-category METRIC=CATEGORY`
+  (repeatable) flags. Declare which classification counts as passing
+  per metric, set a minimum pass rate across the run, and fail CI when
+  any metric falls below it. Multiple pass categories per metric are
+  OR'd together (e.g. `--pass-category tone=positive --pass-category
+  tone=neutral`). Missing metric names warn on stderr without failing
+  the run so configuration mistakes are visible in CI logs.
+## [0.2.1]
+- See `git log` for prior changes.

{bigquery_agent_analytics-0.2.2 → bigquery_agent_analytics-0.2.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: bigquery-agent-analytics
-Version: 0.2.2
+Version: 0.2.3
 Summary: SDK for analyzing and evaluating agent traces stored in BigQuery.
 Author: Google LLC
 License-Expression: Apache-2.0

{bigquery_agent_analytics-0.2.2 → bigquery_agent_analytics-0.2.3}/SDK.md RENAMED Viewed

@@ -275,7 +275,36 @@ print(report.summary())
 ### Strict Mode
-When `strict=True`, sessions where the LLM judge returns empty or unparseable output are marked as **failed** instead of silently passing. Operational counters are placed in `report.details` (not `aggregate_scores`) so downstream consumers can treat scores as purely normalized metrics:
+`strict=True` adds **parse-error visibility** — it does not flip
+any session's pass/fail outcome. Both BQ-native judge methods set
+`passed = bool(scores) and all(score >= threshold for score in
+scores.values())`, so a row whose `scores` dict is empty (the
+judge model returned no parseable output) already fails. Without
+`strict=True` you can't tell from the report whether a failed
+session failed because the judge gave a low score or because the
+judge gave nothing parseable at all.
+`strict=True` walks the merged report and:
+- Stamps `SessionScore.details["parse_error"] = True` on every
+  session whose `scores` dict is empty.
+- Adds a report-level `details["parse_errors"]` count plus
+  `details["parse_error_rate"]` (fraction of `total_sessions`).
+The API-fallback path coerces malformed model output to
+`score=0.0` and always populates `scores`, so its failures look
+like low-score failures rather than parse errors. `strict=True`
+won't surface them as parse errors today; it's an AI.GENERATE /
+ML.GENERATE_TEXT visibility knob in practice.
+For pass/fail-only consumers (CI gates with `--exit-code`),
+`strict=True` is a no-op. Reach for it when a dashboard or
+investigation needs to distinguish "no parseable score" from
+"low score" failures.
+Operational counters are placed in `report.details` (not
+`aggregate_scores`) so downstream consumers can treat scores as
+purely normalized metrics:
 ```python
 report = client.evaluate(

bigquery_agent_analytics-0.2.3/examples/ci/README.md ADDED Viewed

@@ -0,0 +1,42 @@
+# `examples/ci/`
+Reference CI artifacts for agent quality gates backed by
+BigQuery Agent Analytics.
+## `evaluate_thresholds.yml`
+Drop-in GitHub Actions workflow that runs four deterministic
+budgets (latency, token usage, tool error rate, turn count) on
+every PR, scoring the last 24 hours of production traces from an
+`agent_events` BigQuery table. Exits non-zero when any session
+breaches its budget, so a bad merge lights up the PR status
+before code ships.
+See the companion Medium post, *Your Agent Events Table Is Also a
+Test Suite*, for the narrative, threshold-setting guidance, and
+the companion categorical-eval gate that pairs naturally with
+this workflow.
+### Quick start
+1. Copy `evaluate_thresholds.yml` to `.github/workflows/` in
+   your agent repo.
+2. Set repository variables `PROJECT_ID` and `DATASET_ID` to the
+   GCP project + BigQuery dataset where your `agent_events` table
+   lives.
+3. Set the repository secret `GCP_SA_KEY` to a service-account JSON
+   with `bigquery.jobUser` + `bigquery.dataViewer` on the dataset.
+4. Replace `calendar_assistant` with your agent's name in all four
+   `--agent-id` flags inside the workflow.
+5. Tune the four `--threshold` numbers against your own production
+   distribution. A defensible starting point for each is "p95 of
+   the last 30 days + 10% buffer"; revisit after week one of CI
+   gating.
+### Requirements
+- `bigquery-agent-analytics >= 0.2.2` — earlier releases shipped
+  normalized `1.0 - observed/budget` gate scoring with a `0.5`
+  pass cutoff, which fires every gate at roughly half the budget
+  the user typed. 0.2.2 switched to raw-budget binary gates so
+  the `--threshold` value means what it says.

bigquery_agent_analytics-0.2.3/examples/ci/evaluate_thresholds.yml ADDED Viewed

@@ -0,0 +1,78 @@
+# .github/workflows/evaluate_thresholds.yml
+#
+# Reference GitHub Actions workflow that gates every PR against the
+# last 24 hours of production traces stored in an `agent_events`
+# BigQuery table. Four deterministic budgets run as separate steps
+# so a red PR status tells you which gate regressed.
+#
+# Companion to the Medium post "Your Agent Events Table Is Also a
+# Test Suite." See the post for the narrative and for the sidebar
+# on picking initial threshold values from 30-day production data.
+#
+# Requires bigquery-agent-analytics >= 0.2.2 — the first release
+# with the raw-budget `--threshold` semantics and the tight
+# `--exit-code` failure output this workflow depends on.
+#
+# To adopt this workflow in your own agent repo:
+#   1. Copy this file to .github/workflows/evaluate_thresholds.yml.
+#   2. Set repo variables PROJECT_ID and DATASET_ID to the GCP
+#      project + BigQuery dataset where your agent_events table
+#      lives.
+#   3. Set the repo secret GCP_SA_KEY to a service account JSON
+#      with bigquery.jobUser + bigquery.dataViewer on the dataset.
+#   4. Replace `calendar_assistant` with your agent's name in all
+#      four --agent-id flags.
+#   5. Tune the four --threshold numbers against your own
+#      production distribution. A defensible starting point for
+#      each is "p95 of last 30 days + 10% buffer"; revisit after
+#      week one of CI gating.
+name: Agent quality gate
+on:
+  pull_request:
+    paths:
+      - 'agents/**'
+      - 'prompts/**'
+jobs:
+  gate:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with: { python-version: '3.12' }
+      - run: pip install 'bigquery-agent-analytics>=0.2.2,<0.3.0'
+      - uses: google-github-actions/auth@v2
+        with: { credentials_json: '${{ secrets.GCP_SA_KEY }}' }
+      - name: Latency budget
+        run: >
+          bq-agent-sdk evaluate --evaluator=latency --threshold=5000
+          --last=24h --agent-id=calendar_assistant --exit-code
+          --project-id=${{ vars.PROJECT_ID }}
+          --dataset-id=${{ vars.DATASET_ID }}
+      - name: Token budget
+        # Tune this to your agent's real token distribution. A short
+        # system prompt + few-turn sessions will land in the low
+        # thousands; production agents with longer instructions and
+        # multi-turn tool chains typically want tens of thousands.
+        # Run `bq-agent-sdk evaluate --evaluator=token_efficiency
+        # --last=30d` without `--exit-code` once to see your own
+        # baseline before picking a number.
+        run: >
+          bq-agent-sdk evaluate --evaluator=token_efficiency --threshold=5000
+          --last=24h --agent-id=calendar_assistant --exit-code
+          --project-id=${{ vars.PROJECT_ID }}
+          --dataset-id=${{ vars.DATASET_ID }}
+      - name: Tool error rate
+        run: >
+          bq-agent-sdk evaluate --evaluator=error_rate --threshold=0.1
+          --last=24h --agent-id=calendar_assistant --exit-code
+          --project-id=${{ vars.PROJECT_ID }}
+          --dataset-id=${{ vars.DATASET_ID }}
+      - name: Turn count
+        run: >
+          bq-agent-sdk evaluate --evaluator=turn_count --threshold=10
+          --last=24h --agent-id=calendar_assistant --exit-code
+          --project-id=${{ vars.PROJECT_ID }}
+          --dataset-id=${{ vars.DATASET_ID }}

{bigquery_agent_analytics-0.2.2 → bigquery_agent_analytics-0.2.3}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "bigquery-agent-analytics"
-version = "0.2.2"
+version = "0.2.3"
 description = "SDK for analyzing and evaluating agent traces stored in BigQuery."
 readme = "README.md"
 license = "Apache-2.0"

{bigquery_agent_analytics-0.2.2 → bigquery_agent_analytics-0.2.3}/src/bigquery_agent_analytics/cli.py RENAMED Viewed

@@ -302,7 +302,14 @@ def evaluate(
     ),
     strict: bool = typer.Option(
         False,
-        help="Fail sessions with unparseable judge output.",
+        help=(
+            "Stamp parse-error metadata on AI.GENERATE judge rows with"
+            " empty or NULL typed output. Those rows already fail"
+            " (empty score < threshold); --strict adds"
+            " details['parse_error']=True and a report-level"
+            " parse_errors counter so dashboards can tell 'no"
+            " parseable score' apart from 'low score' failures."
+        ),
     ),
     endpoint: Optional[str] = typer.Option(
         None,
@@ -368,6 +375,31 @@ def evaluate(
     raise typer.Exit(code=2)
+_FEEDBACK_SNIPPET_MAX = 120
+def _format_feedback_snippet(
+    feedback: Optional[str], max_chars: int = _FEEDBACK_SNIPPET_MAX
+) -> Optional[str]:
+  """Return a single-line, bounded snippet of an LLM-judge justification.
+  Collapses internal whitespace runs (including newlines) to a single
+  space so the snippet fits on one CI log line, then truncates to
+  ``max_chars`` with a trailing ``…`` when the original was longer.
+  Returns ``None`` for empty / whitespace-only input so callers can
+  cleanly skip the field.
+  """
+  if not feedback:
+    return None
+  collapsed = " ".join(feedback.split())
+  if not collapsed:
+    return None
+  if len(collapsed) <= max_chars:
+    return collapsed
+  # Reserve one char for the ellipsis to keep the visual width capped.
+  return collapsed[: max_chars - 1].rstrip() + "\u2026"
 def _emit_evaluate_failures(
     report: EvaluationReport, max_sessions: int = 10
 ) -> None:
@@ -377,10 +409,14 @@ def _emit_evaluate_failures(
   Prefers the raw observed + budget pair (``CodeEvaluator`` prebuilts);
   falls back to score + threshold when the metric didn't declare
   observed/budget (custom ``add_metric`` users, ``LLMAsJudge``
-  criteria). A failing session is guaranteed to produce at least one
-  FAIL line — never just the summary header.
-  Capped at ``max_sessions`` most-recent failures so CI logs stay scannable.
+  criteria). For LLM-judge failures the line also carries a bounded
+  ``feedback="…"`` snippet drawn from ``SessionScore.llm_feedback``
+  so CI logs explain *why* the judge said the session failed without
+  forcing the reader to chase the JSON output.
+  A failing session is guaranteed to produce at least one FAIL line —
+  never just the summary header. Capped at ``max_sessions`` most-recent
+  failures so CI logs stay scannable.
   """
   failed = [s for s in report.session_scores if not s.passed]
   if not failed:
@@ -393,6 +429,7 @@ def _emit_evaluate_failures(
   )
   shown = failed[:max_sessions]
   for s in shown:
+    feedback_snippet = _format_feedback_snippet(s.llm_feedback)
     emitted_for_session = False
     for metric_name, score in s.scores.items():
       detail = s.details.get(f"metric_{metric_name}") or {}
@@ -433,6 +470,12 @@ def _emit_evaluate_failures(
       parts.append(f"score={score:.4g}")
       if threshold is not None and isinstance(threshold, (int, float)):
         parts.append(f"threshold={threshold:.4g}")
+      # LLM judges populate ``SessionScore.llm_feedback`` with the
+      # judge's justification. Surface a bounded snippet on the FAIL
+      # line so CI logs explain *why* without dumping the full JSON.
+      # Code-based metrics leave ``llm_feedback`` empty and skip this.
+      if feedback_snippet is not None:
+        parts.append(f'feedback="{feedback_snippet}"')
       typer.echo("  " + " ".join(parts), err=True)
       emitted_for_session = True
@@ -441,10 +484,12 @@ def _emit_evaluate_failures(
     # while the session itself is flagged failed (a bug upstream) — we
     # still point the reader at the session id.
     if not emitted_for_session:
-      typer.echo(
-          f"  FAIL session={s.session_id} (no per-metric detail available)",
-          err=True,
-      )
+      fallback = f"  FAIL session={s.session_id}"
+      if feedback_snippet is not None:
+        fallback += f' feedback="{feedback_snippet}"'
+      else:
+        fallback += " (no per-metric detail available)"
+      typer.echo(fallback, err=True)
   if len(failed) > len(shown):
     typer.echo(
         f"  ... {len(failed) - len(shown)} more failing session(s) "

{bigquery_agent_analytics-0.2.2 → bigquery_agent_analytics-0.2.3}/src/bigquery_agent_analytics/client.py RENAMED Viewed

@@ -78,8 +78,10 @@ from .evaluators import DEFAULT_ENDPOINT
 from .evaluators import EvaluationReport
 from .evaluators import LLM_JUDGE_BATCH_QUERY
 from .evaluators import LLMAsJudge
+from .evaluators import render_ai_generate_judge_query
 from .evaluators import SESSION_SUMMARY_QUERY
 from .evaluators import SessionScore
+from .evaluators import split_judge_prompt_template
 from .feedback import AnalysisConfig
 from .feedback import compute_drift
 from .feedback import compute_question_distribution
@@ -975,14 +977,27 @@ class Client:
     then falls back to the Gemini API.  Each path evaluates
     every criterion in the evaluator and merges the per-session
     scores into a single report.
+    Stamps ``report.details["execution_mode"]`` with one of
+    ``ai_generate``, ``ml_generate_text``, ``api_fallback`` so the
+    caller (and CI gates) can audit which path actually ran.
+    When an earlier tier raised before a later tier succeeded,
+    ``report.details["fallback_reason"]`` carries the chained
+    exception messages in attempt order. (The naming mirrors the
+    categorical evaluator's ``execution_mode`` value space for
+    consistency.)
     """
     criteria = evaluator._criteria
     if not criteria:
-      return _build_report(
+      report = _build_report(
           evaluator_name=evaluator.name,
           dataset=f"{self._table_ref} WHERE {where}",
           session_scores=[],
       )
+      report.details["execution_mode"] = "no_op"
+      return report
+    fallback_reasons: list[str] = []
     # Try AI.GENERATE (new path) when endpoint is not a legacy ref
     if not self._is_legacy_model_ref(self.endpoint):
@@ -997,17 +1012,20 @@ class Client:
               params,
           )
           criterion_reports.append((criterion, report))
-        return _merge_criterion_reports(
+        merged = _merge_criterion_reports(
             evaluator.name,
             f"{self._table_ref} WHERE {where}",
             criteria,
             criterion_reports,
         )
+        merged.details["execution_mode"] = "ai_generate"
+        return merged
       except Exception as e:
         logger.debug(
             "AI.GENERATE judge failed, trying legacy: %s",
             e,
         )
+        fallback_reasons.append(f"ai_generate: {e}")
     # Try legacy BQML batch evaluation
     text_model = (
@@ -1028,20 +1046,29 @@ class Client:
             text_model,
         )
         criterion_reports.append((criterion, report))
-      return _merge_criterion_reports(
+      merged = _merge_criterion_reports(
           evaluator.name,
           f"{self._table_ref} WHERE {where}",
           criteria,
           criterion_reports,
       )
+      merged.details["execution_mode"] = "ml_generate_text"
+      if fallback_reasons:
+        merged.details["fallback_reason"] = "; ".join(fallback_reasons)
+      return merged
     except Exception as e:
       logger.debug(
           "BQML judge failed, falling back to API: %s",
           e,
       )
+      fallback_reasons.append(f"ml_generate_text: {e}")
     # Fallback: fetch traces using same table/filter, evaluate via API
-    return self._api_judge(evaluator, table, where, params)
+    api_report = self._api_judge(evaluator, table, where, params)
+    api_report.details["execution_mode"] = "api_fallback"
+    if fallback_reasons:
+      api_report.details["fallback_reason"] = "; ".join(fallback_reasons)
+    return api_report
   def _ai_generate_judge(
       self,
@@ -1054,20 +1081,22 @@ class Client:
     """Evaluates using BigQuery AI.GENERATE with typed output."""
     from google.cloud import bigquery as bq
+    prefix, middle, suffix = split_judge_prompt_template(
+        criterion.prompt_template
+    )
     judge_params = list(params) + [
-        bq.ScalarQueryParameter(
-            "judge_prompt",
-            "STRING",
-            criterion.prompt_template.split("{trace_text}")[0],
-        ),
+        bq.ScalarQueryParameter("judge_prompt_prefix", "STRING", prefix),
+        bq.ScalarQueryParameter("judge_prompt_middle", "STRING", middle),
+        bq.ScalarQueryParameter("judge_prompt_suffix", "STRING", suffix),
     ]
-    query = AI_GENERATE_JUDGE_BATCH_QUERY.format(
+    query = render_ai_generate_judge_query(
         project=self.project_id,
         dataset=self.dataset_id,
         table=table,
         where=where,
         endpoint=self.endpoint,
+        connection_id=self.connection_id,
     )
     job_config = bq.QueryJobConfig(
         query_parameters=judge_params,
@@ -1121,12 +1150,13 @@ class Client:
     """Evaluates using BigQuery ML.GENERATE_TEXT."""
     from google.cloud import bigquery as bq
+    prefix, middle, suffix = split_judge_prompt_template(
+        criterion.prompt_template
+    )
     judge_params = list(params) + [
-        bq.ScalarQueryParameter(
-            "judge_prompt",
-            "STRING",
-            criterion.prompt_template.split("{trace_text}")[0],
-        ),
+        bq.ScalarQueryParameter("judge_prompt_prefix", "STRING", prefix),
+        bq.ScalarQueryParameter("judge_prompt_middle", "STRING", middle),
+        bq.ScalarQueryParameter("judge_prompt_suffix", "STRING", suffix),
     ]
     query = LLM_JUDGE_BATCH_QUERY.format(

bigquery-agent-analytics 0.2.2__tar.gz → 0.2.3__tar.gz

bigquery-agent-analytics 0.2.2tar.gz → 0.2.3tar.gz