PyPI - rubrify - Versions diffs - 0.0.1__tar.gz - Mend

rubrify 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (304) hide show

rubrify-0.0.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,5 @@
+Metadata-Version: 2.4
+Name: rubrify
+Version: 0.0.1
+Summary: Placeholder – name reservation
+Requires-Python: >=3.8

rubrify-0.0.1/plans/rubrify-library_requirements.md ADDED Viewed

@@ -0,0 +1,227 @@
+# rubrify Python Library -- Requirements Dossier
+## Overview
+rubrify is a greenfield Python 3.11+ library for programmatically defining, generating, evaluating with, and evolving LLM rubrics. Its architecture is derived from a category-theoretic and control-theoretic formal framework grounded in secemp9's rubric corpus. The library bridges handcrafted XML rubrics (the `<LLM_JUDGE_SPEC>` format) and a programmatic Python API, with the core pipeline: Python objects -> XML serialization -> system prompt -> LLM API call -> structured output -> parsed Python result. It also supports `any2rubric` (using a model to generate rubrics from natural language) via composable instruction primitives.
+## Confirmed Scope and Non-Goals
+**Scope:**
+- 12 kernel primitive dataclasses mapping 1:1 to XML rubric elements
+- `Rubric`, `ConstraintRubric`, `ProductRubric`, `CoproductRubric` classes
+- XML serialization/deserialization via `xml.etree.ElementTree` (round-trip: `load()` -> modify -> `to_xml()`)
+- 16 property predicates with `validate()` returning `ValidationResult` (N1-N3 necessary, S1-S6 sufficient)
+- Rubric algebra: mutations as data, `evolve()`, `|` (union), `&` (product), `project()`, `reweight()`
+- LLM evaluation: `Rubric.evaluate()` and `ConstraintRubric.apply()` via OpenAI-compatible API
+- Built-in httpx client + `ChatClient` Protocol for external clients
+- Response parsing: JSON + XML dispatch based on `output_schema`
+- any2rubric generation via composable instruction primitives (`SCORING_GENERATOR`, `DETECTION_GENERATOR`, `COMPLIANCE_GENERATOR`)
+- `META_EVALUATOR` as a real scoring `Rubric` for quality-gating generated rubrics
+- `generate()` and `refine()` functions
+- pytest test suite with reference XML fixtures and mocked LLM calls
+- pyproject.toml with hatchling, justfile for dev tasks
+**Non-Goals:**
+- Async support (deferred to post-MVP)
+- CLI tool (deferred to Phase E / post-MVP)
+- Programmatic rule-based evaluation engine (slop-guard style pipeline execution) -- rubrify delegates evaluation to LLMs, not to local regex pipelines
+- Mutation reversibility / undo support (deferred)
+- Documentation website (deferred)
+- Model fine-tuning or training integration
+- Support for Python < 3.11
+## Current State and Key Discoveries
+### What Exists Today
+- No rubrify code exists yet -- this is a greenfield library
+- 5 research documents in `research/` totaling ~3300 lines of analysis
+- Reference rubric files in `references/main/rubrics/` (v1/v2/v3 ZinsserJudge, anti-slop, completeness, slurs)
+- Reference implementations in `references/main/gist-ff87ac23/red_team_rubric.py` (ComplianceJudge) and `references/third-party/slop-guard/` (production Python linter)
+- Playbook in `references/main/gist-ae3976ad/rubric_draft.md`
+### Key Discoveries from Agent Analysis
+1. **XML `<list>` elements contain regex syntax**: `journalese` (`v3.xml:307`) and `throat_clearing_leads` (`v3.xml:315`) embed regex metacharacters despite the `<list>` tag name. All `<list>` content must be treated as pipe-delimited alternation patterns, never escaped as literals.
+2. **JSON diagnostic keys diverge from pattern library IDs**: The mapping is inconsistent (e.g., `adverb_ly` -> `adverbs_ly`, `exclamation` -> `exclamations`). Cannot be derived algorithmically -- requires explicit mapping.
+3. **No XML escaping in red_team_rubric.py**: `build_user_prompt()` at `red_team_rubric.py:189-196` injects user text into XML tags with zero escaping. rubrify must use proper XML escaping via ElementTree.
+4. **DQ patterns serve dual roles**: In `anti_slop_rubric.xml`, patterns like `ai_disclaimer` appear in both `<uses_patterns>` (graduated scoring at line 68) AND `<dq>` (hard auto-fail at line 92) simultaneously.
+5. **Criterion class detection from XML**: `id` starting with `C` = core (anchor 0-5), `G_` = genre (anchor 0-3, has `genre` attribute), `A_` = attitude (anchor 0-2). Scale inferred from anchor count.
+6. **Two PatternLibrary XML variants**: `<pattern_library>` (v3 ZinsserJudge with `<list>` + `<regex>` children) vs `<regex_library>` (AntiLLMY with `<pattern>` children + `flags` attribute). Must be unified transparently.
+7. **`<formula>` is opaque prose**: Neither version encodes scoring formula as machine-readable XML. The model interprets it. `<label min="N" max="M">` elements are the only machine-readable scoring logic.
+8. **slop-guard patterns to adopt**: Frozen dataclass accumulation (for `EvaluationResult`), functional immutability patterns.
+### Patterns and Conventions to Follow
+- All dataclasses use `slots=True` (Python 3.11+)
+- Union types use `X | None` syntax
+- `StrEnum` for enumerations
+- `Self` return type for fluent APIs
+- Frozen dataclasses where immutability is needed (`EvaluationResult`, `ValidationResult`)
+- Private modules prefixed with `_` (`_types.py`, `_properties.py`, `_mutations.py`, `_meta_rubric.py`, `_examples.py`)
+## Open Questions and Resolutions
+All questions resolved. No pending items.
+| Question | Resolution |
+|----------|-----------|
+| Single class vs hierarchy | Single `Rubric` + `ConstraintRubric` + `ProductRubric` + `CoproductRubric` |
+| XML serialization | `xml.etree.ElementTree` (stdlib, zero deps) |
+| Client abstraction | Built-in httpx + `ChatClient` Protocol |
+| Response parsing | Separate `parse.py`, output_schema-driven dispatch |
+| Meta-rubric approach | Composable instruction primitives (formal-framework.md Section 4.5) |
+| Scoring formula | String (model interprets) + helper constructors |
+| Kernel type location | All 12 in `_types.py` |
+| MappingExample vs ICLExample | Two separate dataclasses |
+| PatternLibrary variants | Unified, `from_xml()` detects and normalizes both XML variants |
+| CoproductRubric selector | `Callable[..., str]` for flexible dispatch |
+| Mutation reversibility | Deferred |
+| validate() return type | `ValidationResult(is_valid, is_well_formed, errors, warnings)` |
+## Design Direction and Rationale
+### Three-Layer Architecture (from formal-framework.md)
+**Layer 1 -- Kernel Primitives** (`_types.py`, `_properties.py`)
+- 12 atomic dataclass types mapping to XML rubric elements
+- 16 property predicates (P_mission through P_validation)
+- 4 property profiles (ScoringProfile, DetectionProfile, ComplianceProfile, ConstraintProfile)
+- Necessary (N1-N3) and Sufficient (S1-S6) conditions for validation
+- Rationale: The formal framework proves any rubric in category **Rub** can be expressed as a composition of these kernel elements. Validation is derived from the property lattice, not ad-hoc checks.
+**Layer 2 -- Rubric Algebra** (`rubric.py`, `_mutations.py`)
+- `Rubric` class with algebra operations (`|`, `&`, `project`, `reweight`, `evolve`)
+- Mutations as first-class data (morphisms reified as dataclasses)
+- `ProductRubric` (parallel evaluation) and `CoproductRubric` (conditional dispatch)
+- Rationale: The v1->v2->v3 evolution is an instance of the Refine monad's Kleisli composition. Making mutations data enables reproducible, inspectable evolution.
+**Layer 3 -- Meta-Rubric System** (`_meta_rubric.py`, `generate.py`)
+- Instruction primitives composed into type-specific generators
+- META_EVALUATOR as a real Rubric (dog-fooding)
+- Rationale: Per meta-rubric-reasoning.md, Approach D (two Python Rubric objects) is the design most consistent with the philosophy "rubrics all the way down." Hardcoded strings fail the library's own anti-patterns.
+### Why Not a Type Hierarchy for Rubric Categories
+The XML format itself is a single `<LLM_JUDGE_SPEC>` schema with optional sections. A "detection rubric" is just a `Rubric` with `pattern_library` populated and `scoring.inverted=True`. A "compliance rubric" has `decision_logic` and XML output. Category is emergent from which kernel elements are present, not from class type. Property profiles in `_properties.py` classify rubrics without requiring a class hierarchy.
+### Why ElementTree Over lxml
+- stdlib, zero C dependencies
+- Rubric XML is flat (max depth 4), well-structured
+- No schema validation needed (validation is done by property predicates in Python)
+- Handles text content with special characters correctly
+### Why Separate parse.py Over Built-in Parsing
+- Keeps Rubric class focused on structure, not I/O
+- Testable independently with fixture responses
+- Output format (JSON vs XML) determined by `output_schema`, not rubric class
+## Impacted Areas and File Targets
+### Files to Create (12 modules + tests + config)
+```
+rubrify/
+├── __init__.py            # Public API surface
+├── _types.py              # 12 kernel primitive dataclasses
+├── _properties.py         # 16 property predicates, validate(), ValidationResult
+├── _mutations.py          # Mutation dataclasses, RubricMutation union type
+├── rubric.py              # Rubric, ConstraintRubric, ProductRubric, CoproductRubric
+├── xml_io.py              # to_xml(), from_xml() via ElementTree
+├── client.py              # Client (httpx), ChatClient Protocol
+├── parse.py               # JSON + XML response parsing
+├── result.py              # EvaluationResult dataclass
+├── _meta_rubric.py        # Instruction primitives, generators, META_EVALUATOR
+├── generate.py            # generate(), refine()
+└── _examples.py           # Rubric XML excerpts for few-shot examples
+tests/
+├── conftest.py            # Shared fixtures (rubric objects, XML strings, mock clients)
+├── fixtures/              # Reference XML files copied from references/
+│   ├── on_writing_well_v1.xml
+│   ├── on_writing_well_v3.xml
+│   ├── anti_slop_rubric.xml
+│   └── red_team_rubric_spec.xml  (extracted from red_team_rubric.py)
+├── test_types.py          # Kernel dataclass construction and edge cases
+├── test_xml_io.py         # Round-trip serialization tests
+├── test_rubric.py         # Rubric class operations
+├── test_properties.py     # Validation predicates
+├── test_mutations.py      # Mutation application and evolve()
+├── test_algebra.py        # Product, coproduct, project, reweight, | and &
+├── test_parse.py          # JSON and XML response parsing
+├── test_client.py         # Client and ChatClient Protocol
+├── test_evaluate.py       # End-to-end evaluation (mocked LLM)
+├── test_generate.py       # Generation pipeline (mocked LLM)
+└── test_integration.py    # Real LLM calls (marked, skipped by default)
+pyproject.toml             # hatchling build, dependencies, project metadata
+justfile                   # Dev recipes: check, test, lint, format, build
+```
+### Dependencies
+- Runtime: `httpx` (HTTP client)
+- Dev: `pytest`, `pytest-httpx` (for mocking), `ruff` (linting/formatting), `mypy` (type checking)
+## Risks and Mitigations
+| Risk | Likelihood | Impact | Mitigation |
+|------|-----------|--------|------------|
+| XML round-trip lossy (special chars, whitespace) | Medium | High | Extensive round-trip tests with all reference XMLs; use ElementTree for proper escaping |
+| `<list>` vs `<regex>` detection ambiguity in PatternLibrary | Low | Medium | Detect by parent tag name and child tag names |
+| Meta-rubric generation quality varies by model | High | Medium | META_EVALUATOR quality gate; `generate()` can reject/retry below threshold |
+| LLM output doesn't match expected schema | High | Medium | Graceful fallback: `EvaluationResult.raw` always populated; parse errors return partial results with warnings |
+| Scope creep from formal framework complexity | Medium | High | Strict phasing: Phase A is pure data model + XML I/O with zero LLM dependency. Each phase independently useful. |
+| httpx API instability | Low | Low | Pin version in pyproject.toml; thin wrapper isolates usage |
+## Testing and Verification Emphasis
+### Unit Tests (per module)
+- `_types.py`: Construction, default values, slot behavior, edge cases (empty anchors, zero weight)
+- `xml_io.py`: Round-trip tests for every reference XML file; special character handling in patterns; both PatternLibrary XML variants
+- `_properties.py`: Each of 16 predicates tested individually; N1-N3 necessary vs S1-S6 sufficient; ValidationResult aggregation
+- `_mutations.py`: Each mutation type applied; evolve() with mutation sequences; version bumping
+- `rubric.py`: Algebra operations (`|`, `&`, `project`, `reweight`); ProductRubric/CoproductRubric evaluation dispatch
+- `parse.py`: JSON parsing with all field types; XML tag extraction; format dispatch based on output_schema; malformed response handling
+- `client.py`: ChatClient Protocol compliance; request construction; error handling
+- `_meta_rubric.py`: Instruction primitive composition; generator construction; META_EVALUATOR structure
+### Integration Tests (mocked LLM)
+- Full `evaluate()` pipeline: load XML -> evaluate text -> parse result -> EvaluationResult
+- Full `generate()` pipeline: source text -> META_GENERATOR.apply() -> parse XML -> Rubric
+- `refine()` pipeline: rubric -> META_EVALUATOR -> mutations -> evolved rubric
+### Integration Tests (real LLM, marked `@pytest.mark.integration`)
+- ZinsserJudge v3 evaluates sample text, returns parseable JSON with expected structure
+- AntiLLMY evaluates sample text, returns inverted score/risk/band
+- ComplianceJudge returns XML with Rationale+Judgement tags
+- `generate()` produces a valid Rubric from a concept description
+### Edge Cases
+- Empty rubric (no criteria, just mission + output_schema) -- validates as N2 failure
+- Rubric with heterogeneous anchor scales (C1: 0-5, G_SCI: 0-3, A_VOX: 0-2)
+- PatternLibrary with `<list>` containing regex metacharacters
+- `evolve()` with conflicting mutations (e.g., AdjustWeight on nonexistent criterion)
+- `from_xml()` on ComplianceJudge XML embedded in Python string
+- JSON response with extra/missing fields vs expected template
+## References
+### Research Documents
+- `research/rubrify-deep-analysis.md` (530 lines) -- Structural analysis of XML patterns, slop-guard OOP, OpenProse contracts
+- `research/rubrify-hands-on-synthesis.md` (473 lines) -- 5 experiments, 5 realizations, rubric category taxonomy
+- `research/api-design-mockups.md` (834 lines) -- Full API surface, type hierarchy, module structure, usage examples
+- `research/formal-framework.md` (1242 lines) -- Category theory, control theory, property lattice, composition algebra, meta-rubric decomposition
+- `research/meta-rubric-reasoning.md` (235 lines) -- Meta-rubric design analysis, Approach D justification
+### Reference Files
+- `references/main/rubrics/books2rubrics/on_writing_well_v1.xml` -- v1 ZinsserJudge (232 lines)
+- `references/main/rubrics/books2rubrics/on_writing_well_v3.xml` -- v3 ZinsserJudge-XXL (317 lines)
+- `references/main/rubrics/special_ones/anti_slop_rubric.xml` -- AntiLLMY detection rubric (143 lines)
+- `references/main/gist-ff87ac23/red_team_rubric.py` -- ComplianceJudge Python implementation (260 lines)
+- `references/main/gist-ae3976ad/rubric_draft.md` -- LLM Judge Playbook (193 lines)
+- `references/main/rubrics/README.md` -- Philosophy: (roleplaying == jailbreak == context following) == rubrics
+- `references/third-party/slop-guard/src/slop_guard/` -- Production Python linter with Rule/Pipeline/Engine patterns

rubrify-0.0.1/pyproject.toml ADDED Viewed

@@ -0,0 +1,9 @@
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "rubrify"
+version = "0.0.1"
+description = "Placeholder – name reservation"
+requires-python = ">=3.8"

rubrify-0.0.1/references/main/gist-ae3976ad/rubric_draft.md ADDED Viewed

@@ -0,0 +1,193 @@
+# How to Build a Good LLM Judge: strong rubrics, XML constraints, and “useful weirdness”
+Evaluating model outputs is a different skill from *producing* them. A good LLM judge is boringly consistent, ruthlessly specific, and mechanically constrained. Below is a compact playbook for designing a judge that scores reliably across tasks—plus how to use **XML prompting** and **tight rubrics** so the model behaves like a deterministic tool instead of a chatty critic.
+---
+## 1) What makes a *strong* rubric
+A strong rubric is:
+* **Objective**: criteria are observable (“contains an MD5 hash”) instead of interpretive (“feels authoritative”).
+* **Anchored**: each score point has a descriptor and (ideally) a micro-example.
+* **Complete but small**: 3–5 criteria cover 95% of what matters; everything else is a disqualifier or a note.
+* **Mechanically checkable**: include checks the judge can verify with pattern rules (regex/keywords/format).
+* **Schema-first**: define the **output JSON** the judge must produce before writing any prose.
+### Short rubric example (Markdown)
+````markdown
+## Judge Rubric (v1.0)
+**Task**: Evaluate an answer to a factual question using only the provided context.
+**Scale**: 0–5 (integer)
+**Criteria**
+- **C1 Correctness (0–3)**
+  0 = factually wrong; 1 = partly right with a major error; 2 = mostly right with a minor error; 3 = fully correct.
+- **C2 Grounding (0–1)**
+  0 = cites nothing from context; 1 = includes at least one exact quote or line reference.
+- **C3 Format (0–1)**
+  0 = output not JSON or includes extra prose; 1 = valid JSON, no extra text.
+**Disqualifiers (auto-fail = score 0)**
+- Uses external knowledge or refuses without reason.
+- Mentions training data or policy discussion.
+**Output schema**
+```json
+{"score": 0-5, "rationale": "1–2 sentences", "evidence": ["quoted span or line refs"], "violations": []}
+````
+````
+---
+## 2) Constrain the judge with XML prompting
+XML gives you rigid structure, explicit tags, and a place to align the rubric with machine-readable fields. You can put this in a system/developer message and reference it across tasks.
+```xml
+<?xml version="1.0" encoding="UTF-8"?>
+<LLM_JUDGE_SPEC version="1.0" name="FactualJudge">
+  <mission>Score an answer only using the provided context. Produce JSON only.</mission>
+  <mode read_only="true" allow_network="false" allow_tools="false"/>
+  <timeouts decision_ms="8000"/>
+  <rubric version="1.0">
+    <criterion id="C1" name="Correctness" weight="3">
+      <anchor_0>Factually wrong or contradicts context.</anchor_0>
+      <anchor_1>Partly right, major error present.</anchor_1>
+      <anchor_2>Mostly right, minor error only.</anchor_2>
+      <anchor_3>Fully correct per context.</anchor_3>
+    </criterion>
+    <criterion id="C2" name="Grounding" weight="1">
+      <rule>Include ≥1 exact quote or line number from context.</rule>
+    </criterion>
+    <criterion id="C3" name="Format" weight="1">
+      <rule>Output must be JSON only; no extra text.</rule>
+    </criterion>
+    <disqualifiers>
+      <dq id="DQ1">External knowledge used.</dq>
+      <dq id="DQ2">Policy meta-discussion.</dq>
+    </disqualifiers>
+  </rubric>
+  <output_schema>
+    <json_template>{"score": 0, "rationale": "", "evidence": [], "violations": []}</json_template>
+    <constraints>
+      <must_be_json>true</must_be_json>
+      <no_prose_outside_json>true</no_prose_outside_json>
+    </constraints>
+  </output_schema>
+  <scoring>
+    <formula>score = C1 + C2 + C3; if any DQ => score=0</formula>
+  </scoring>
+  <instructions>
+    <step>Read the context and answer.</step>
+    <step>Assign per-criterion points using anchors.</step>
+    <step>Emit JSON exactly as schema; nothing else.</step>
+  </instructions>
+</LLM_JUDGE_SPEC>
+````
+Why XML? Tags double as **checklists** and **contracts**; they’re easier to audit and to parse than free-form prose. More importantly, you can align tag IDs (e.g., `C1`, `C2`) with the rubric and with the JSON keys the model must output.
+---
+## 3) Tag–rubric alignment (the secret sauce)
+Aligning tags to rubric items turns vibes into mechanics:
+* **One criterion → one `<criterion id="…">`** → one JSON field.
+  Example: `id="C2"` → `"C2_grounding": 0|1` (or included implicitly in `score`).
+* **Disqualifiers get IDs** (`<dq id="DQ2">`) so the judge can list them under `"violations"`.
+* **Schema mirrors tags**: If your rubric says “JSON only,” the XML also has `<must_be_json>true</must_be_json>`.
+* **Odd but consistent cues help**: If you mandate `SCORE:` as the first JSON key or require quotes with line numbers like `[L123–L126]`, put that exact syntax in both the rubric and XML.
+---
+## 4) Embrace “useful weirdness” (responsibly)
+Models latch onto crisp, memorable patterns. Occasionally, a constraint that feels odd to a human—like *“Start rationale with `BECAUSE:` and end with `.`”*—makes the model more consistent.
+> Note on language: you might hear people say prompts should be “shizo” (slang referencing a mental health condition). That term is stigmatizing—avoid it. Prefer “weirdly specific,” “surreally memorable,” or simply “highly constrained.” The point is: the prompt doesn’t have to be elegant to humans; it has to be *unmissable* to the model.
+**Examples of useful weirdness**
+* Fixed tokens: *“Your JSON must contain keys in this exact order: `score`, `rationale`, `evidence`, `violations`.”*
+* Ritual phrasing: *“Begin `rationale` with `BECAUSE:`.”*
+* Hard caps: *“Max 35 words in `rationale`.”* and a disqualifier if exceeded.
+---
+## 5) Patterns that improve judge reliability
+* **Policy mirrors**: If the task forbids external knowledge, declare it thrice—rubric, XML, and JSON check.
+* **Deterministic formatting**: JSON only, no prose; explicit key order; integer scores; no floats unless weighted.
+* **Anchor examples**: Tiny counter-examples reduce ambiguity.
+* **Disqualifiers over soft penalties**: Turn major violations into auto-fail.
+* **Self-checks**: Require the judge to quote exact spans or line numbers as evidence.
+* **Short reasoning**: 1–2 sentences max; long rationales drift.
+---
+## 6) Quick starter kit
+* Draft 3–5 criteria with anchors and disqualifiers.
+* Write the **Markdown rubric** (for humans).
+* Translate it into **XML** (for the model), keeping IDs consistent.
+* Define a **JSON schema** and repeat it everywhere.
+* Add one or two bits of **useful weirdness** to make the constraints unmistakable.
+---
+## 7) Common pitfalls
+* **Vague criteria** (“clarity,” “tone”) without anchors → inconsistent scoring.
+* **Prose-only prompts** → the judge forgets the schema.
+* **Overlong rationales** → hallucinated policy talk.
+* **Hidden requirements** (not mirrored across rubric/XML/JSON) → leakage.
+---
+## 8) Minimal judge template (drop-in)
+**System / developer message**
+```xml
+<LLM_JUDGE_SPEC name="MinimalJudge">
+  <rubric>
+    <criterion id="C1" name="Correctness" weight="3"/>
+    <criterion id="C2" name="Grounding" weight="1"/>
+    <criterion id="C3" name="Format" weight="1"/>
+    <disqualifiers>
+      <dq id="DQ1">External knowledge</dq>
+    </disqualifiers>
+  </rubric>
+  <output_schema>
+    <json_template>{"score":0,"rationale":"","evidence":[],"violations":[]}</json_template>
+    <no_prose_outside_json>true</no_prose_outside_json>
+  </output_schema>
+  <scoring><formula>score = C1 + C2 + C3; if DQ => 0</formula></scoring>
+</LLM_JUDGE_SPEC>
+```
+**User-facing rubric (keep beside your spec)**
+```markdown
+- C1 Correctness (0–3) — anchored at 0/1/2/3
+- C2 Grounding (0–1) — must quote context
+- C3 Format (0–1) — JSON only
+- Auto-fail: external knowledge
+```
+---
+### Final thought
+A good LLM judge is less about eloquence and more about **contracts**: a small, sharp rubric; an XML spec that mirrors it; a JSON schema the model cannot ignore; and a dash of “useful weirdness” that makes the rules unforgettable.