PyPI - pio-arch - Versions diffs - 0.1.0__tar.gz - Mend

pio-arch 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (56) hide show

pio_arch-0.1.0/.gitignore +51 -0
pio_arch-0.1.0/.pre-commit-config.yaml +17 -0
pio_arch-0.1.0/AGENT_GUIDE.shared.md +313 -0
pio_arch-0.1.0/LEARNING_GUIDE.md +659 -0
pio_arch-0.1.0/LICENSE +21 -0
pio_arch-0.1.0/PKG-INFO +234 -0
pio_arch-0.1.0/README.md +193 -0
pio_arch-0.1.0/context.md +500 -0
pio_arch-0.1.0/docs/Makefile +24 -0
pio_arch-0.1.0/docs/api/embedder.rst +7 -0
pio_arch-0.1.0/docs/conf.py +39 -0
pio_arch-0.1.0/docs/index.rst +12 -0
pio_arch-0.1.0/notebooks/full_pio_walkthrough.ipynb +687 -0
pio_arch-0.1.0/notebooks/generate_pio_synthetic_data.ipynb +396 -0
pio_arch-0.1.0/notebooks/optuna_sweep.ipynb +578 -0
pio_arch-0.1.0/notebooks/pio_data_engineering.ipynb +521 -0
pio_arch-0.1.0/notebooks/pio_encoder_design.md +141 -0
pio_arch-0.1.0/notebooks/pio_prototype.ipynb +97 -0
pio_arch-0.1.0/notebooks/pio_sentence_model_prototype.ipynb +487 -0
pio_arch-0.1.0/notebooks/pio_synthetic_50k.parquet +0 -0
pio_arch-0.1.0/notebooks/pio_text_embedding_plan.md +338 -0
pio_arch-0.1.0/notebooks/sentence_only_pio.ipynb +588 -0
pio_arch-0.1.0/notebooks/universal_feature_transformer_demo.ipynb +873 -0
pio_arch-0.1.0/pio_arch/__init__.py +15 -0
pio_arch-0.1.0/pio_arch/models/__init__.py +22 -0
pio_arch-0.1.0/pio_arch/models/context_encoder.py +178 -0
pio_arch-0.1.0/pio_arch/models/embedder.py +132 -0
pio_arch-0.1.0/pio_arch/models/pio_attention.py +368 -0
pio_arch-0.1.0/pio_arch/models/sentence_encoder.py +237 -0
pio_arch-0.1.0/pio_arch/utils/__init__.py +0 -0
pio_arch-0.1.0/pio_arch/utils/collate.py +109 -0
pio_arch-0.1.0/pio_arch/utils/data.py +159 -0
pio_arch-0.1.0/pio_arch/utils/rff.py +63 -0
pio_arch-0.1.0/pio_arch/utils/sentence_pool.py +115 -0
pio_arch-0.1.0/pio_arch/utils/text_dropout.py +62 -0
pio_arch-0.1.0/pio_arch/utils/train.py +242 -0
pio_arch-0.1.0/pyproject.toml +183 -0
pio_arch-0.1.0/scripts/build_walkthrough_notebooks.py +1610 -0
pio_arch-0.1.0/scripts/smoke_test_notebooks.py +494 -0
pio_arch-0.1.0/tests/__init__.py +0 -0
pio_arch-0.1.0/tests/conftest.py +42 -0
pio_arch-0.1.0/tests/integration/__init__.py +0 -0
pio_arch-0.1.0/tests/integration/test_embedder_integration.py +48 -0
pio_arch-0.1.0/tests/integration/test_pio_attention_integration.py +110 -0
pio_arch-0.1.0/tests/integration/test_uft_integration.py +47 -0
pio_arch-0.1.0/tests/test_collate.py +116 -0
pio_arch-0.1.0/tests/test_context_encoder.py +242 -0
pio_arch-0.1.0/tests/test_data.py +174 -0
pio_arch-0.1.0/tests/test_embedder.py +413 -0
pio_arch-0.1.0/tests/test_hello.py +14 -0
pio_arch-0.1.0/tests/test_pio_attention.py +422 -0
pio_arch-0.1.0/tests/test_rff.py +91 -0
pio_arch-0.1.0/tests/test_sentence_encoder.py +251 -0
pio_arch-0.1.0/tests/test_sentence_pool.py +165 -0
pio_arch-0.1.0/tests/test_text_dropout.py +87 -0
pio_arch-0.1.0/tests/test_train.py +188 -0

pio_arch-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,51 @@
+# Python
+.venv/
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+*.egg-info/
+dist/
+build/
+.eggs/
+*.whl
+*.tar.gz
+# Testing & coverage
+.coverage
+.coverage.*
+htmlcov/
+.pytest_cache/
+# Type checking & linting
+.mypy_cache/
+.ruff_cache/
+# uv
+uv.lock
+# PyTorch artifacts
+*.pt
+*.pth
+*.ckpt
+checkpoints/
+runs/
+outputs/
+# Jupyter
+.ipynb_checkpoints/
+# Sphinx docs build output
+docs/_build/
+# IDE
+.vscode/
+.idea/
+*.swp
+# macOS
+.DS_Store
+# Secrets
+.env
+*.key

pio_arch-0.1.0/.pre-commit-config.yaml ADDED Viewed

@@ -0,0 +1,17 @@
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.15.11
+    hooks:
+      - id: ruff
+        args: [--fix]
+      - id: ruff-format
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.6.0
+    hooks:
+      - id: trailing-whitespace
+      - id: end-of-file-fixer
+      - id: check-yaml
+      - id: check-toml
+      - id: check-merge-conflict
+      - id: debug-statements

pio_arch-0.1.0/AGENT_GUIDE.shared.md ADDED Viewed

@@ -0,0 +1,313 @@
+# pio-arch — Shared Agent Guide
+This is the shared project guide for non-Claude coding agents. `AGENTS.md` and
+`GEMINI.md` are symlinks to this file so Codex, Gemini CLI, and similar tools
+read the same durable project guidance with no duplicated body text.
+Keep `.claude/CLAUDE.md` as the Claude-specific guide for Claude Code features,
+then mirror durable cross-agent rules here when they should also apply outside
+Claude.
+For deeper role-specific guidance, read:
+- `.claude/CLAUDE.md` - primary project guide and architecture overview
+- `context.md` - architecture reference material
+- `.claude/agents/ml-engineer.md` - model implementation and PyTorch guidance
+- `.claude/agents/qa-reviewer.md` - mathematical/model review checklist
+- `.claude/agents/unit-tester.md` - pytest, coverage, and ruff workflow
+- `.claude/agents/integration-tester.md` - end-to-end synthetic validation
+- `.claude/agents/python-expert.md` - packaging, dependencies, and tooling
+- `.claude/agents/data-scientist.md` - end-user notebooks and examples
+- `.claude/skills/implement-model/SKILL.md` - Claude-specific model pipeline
+- `.claude/uv-guide.md` - uv reference for conda users
+## Project Summary
+This repo contains PyTorch implementations of permutation-invariant neural
+network architectures for tabular, mixed-row, and text/set-structured data.
+Models consume variable-length unordered rows and produce fixed-size
+representations or per-task predictions.
+Core constraint: `f(X) = f(pi(X))` for any permutation `pi`.
+## Architectures
+Implemented model files:
+- `pio_arch/models/embedder.py` - `UniversalFeatureTransformer`, type-aware scalar and
+  discrete feature embedding.
+- `pio_arch/models/context_encoder.py` - `ContextRowEncoder`, mixed numeric/discrete/text
+  row encoder with feature-type embeddings.
+- `pio_arch/models/sentence_encoder.py` - sentence-set baseline, task heads, `TaskSpec`,
+  and masked multi-task loss utilities.
+- `pio_arch/models/pio_attention.py` - latent-bottleneck attention model for context
+  sets, with optional context self-attention, latent self-attention, task-query
+  pooling, and per-task or shared heads. Structurally Perceiver IO
+  (Jaegle et al., 2022).
+Utilities under `pio_arch/utils/`:
+- `pio_arch/utils/rff.py` - Random Fourier Feature helpers shared by embedders.
+- `pio_arch/utils/collate.py` - `sentence_set_collate` (tensor-only).
+- `pio_arch/utils/data.py` - `SentenceSetDataset` and `collate_sentence_set_batch`
+  for trainer-ready dict batches.
+- `pio_arch/utils/train.py` - `Trainer`, `train_one_epoch`, `evaluate`,
+  `make_masked_multitask_loss_fn`, `move_batch`.
+- `pio_arch/utils/text_dropout.py` - `TextRowDropout` (training-time row dropout).
+- `pio_arch/utils/sentence_pool.py` - `pool_sentence_embeddings`, a DeepSets-style
+  aggregator that collapses ``[B, N, text_dim]`` to ``[B, text_dim]``.
+Historical architecture notes and unimplemented reference designs live in
+`context.md`; do not treat those snippets as current source files.
+## Input Embedding
+`pio_arch/models/embedder.py` contains `UniversalFeatureTransformer`, the shared raw
+feature embedding layer.
+```python
+model = UniversalFeatureTransformer(
+    feature_vocab_size=500,
+    dim=64,          # must be even
+    rff_sigma=1.0,
+)
+# forward(feature_ids, feature_values, is_numerical=None) -> [B, n, dim]
+```
+Arguments:
+- `feature_ids: [B, n]` int tensor; used for categorical/sentinel/missing values
+- `feature_values: [B, n]` float tensor; used for numerical scalar values
+- `is_numerical: [B, n]` bool tensor or `None`; `None` means all numerical
+Conventions:
+- Numerical positions use fixed Random Fourier Features.
+- Categorical, sentinel, and missing positions use `nn.Embedding`.
+- ID `0` is reserved for missing/padding and returns a zero vector.
+- All categorical/sentinel/missing indexing is the caller's responsibility.
+## Mixed Context Rows
+`pio_arch/models/context_encoder.py` contains `ContextRowEncoder`, the current mixed-row
+input path for PIO-style models.
+```python
+model = ContextRowEncoder(
+    discrete_vocab_size=1000,
+    feature_type_vocab_size=64,
+    value_dim=32,          # must be even
+    feature_type_dim=32,
+    text_input_dim=384,    # optional; required only for text rows
+)
+# forward(value_ids, scalar_values, row_kinds, feature_type_ids, padding_mask,
+#         text_values=None) -> [B, N, value_dim + feature_type_dim]
+```
+Arguments:
+- `value_ids: [B, N]` long tensor for discrete rows.
+- `scalar_values: [B, N]` float tensor for numeric rows.
+- `row_kinds: [B, N]` long tensor using `ContextRowKind.NUMERIC`,
+  `ContextRowKind.DISCRETE`, or `ContextRowKind.TEXT`.
+- `feature_type_ids: [B, N]` long tensor identifying the row's feature type.
+- `padding_mask: [B, N]` bool tensor with `True` at padded rows.
+- `text_values: [B, N, text_input_dim]` optional precomputed text embeddings.
+Conventions:
+- Numeric rows use fixed RFFs.
+- Discrete rows use `nn.Embedding` with ID `0` reserved for padding/unknown.
+- Text rows require `text_input_dim` at construction and `text_values` at
+  forward time.
+- Padding rows are zeroed before returning.
+- The row encoder is permutation-equivariant; downstream pooling/attention is
+  responsible for producing invariant predictions.
+## Sentence and PIO Models
+`pio_arch/models/sentence_encoder.py` provides a sentence-only baseline:
+- `SentenceSetEncoder(sentence_embeddings, padding_mask) -> [B, model_dim]`
+  projects externally generated sentence embeddings, masks padded rows, and
+  pools with a masked mean.
+- `SentenceSetMultiTaskModel` wraps the encoder with task-specific heads.
+- `TaskSpec(name, kind, weight=1.0)` supports `kind="binary"` and
+  `kind="regression"`.
+- `masked_multitask_loss(predictions, targets, target_mask, tasks)` computes
+  weighted BCE-with-logits or MSE over observed targets only.
+`pio_arch/models/pio_attention.py` contains the current attention architecture:
+```python
+model = PIOAttentionModel(
+    input_dim=context_dim,
+    tasks=[TaskSpec("target", "binary")],
+    model_dim=64,           # context-stage dim
+    num_latents=8,
+    latent_dim=None,        # defaults to model_dim
+    task_dim=None,          # defaults to latent_dim
+    num_context_self_attn_blocks=0,
+    num_latent_self_attn_blocks=1,
+    num_heads=4,
+    head_mode="per_task",   # or "shared"
+)
+# forward(context, padding_mask) -> dict[str, [B, 1]]
+# encode_latents(context, padding_mask) -> [B, num_latents, latent_dim]
+```
+Each of ``model_dim``, ``latent_dim``, and ``task_dim`` must be divisible by
+``num_heads``. Set them independently to shrink or grow the latent bottleneck
+relative to the context and task queries.
+PIO processing order:
+1. Project context rows to `model_dim`.
+2. Optionally apply context self-attention blocks.
+3. Cross-attend learnable latent queries to context rows.
+4. Optionally apply latent self-attention blocks.
+5. Cross-attend learnable task queries to latents.
+6. Predict one `[B, 1]` tensor per task.
+The model handles all-padded samples with a safe attention mask so PyTorch
+`MultiheadAttention` does not produce NaNs.
+## Development Rules
+- Python 3.12+, PyTorch only for neural network code.
+- Use the project venv: `.venv/bin/python`, `.venv/bin/pytest`,
+  `.venv/bin/ruff`.
+- Each architecture lives in its own file under `pio_arch/models/`.
+- Shared utilities live in `utils/`.
+- Most set-model forwards use explicit padding masks where
+  `padding_mask: [B, N]` is bool with `True` at padded positions.
+- `PIOAttentionModel.forward(context, padding_mask)` returns a dictionary from
+  task name to `[B, 1]` prediction tensor.
+- `SentenceSetEncoder.forward(sentence_embeddings, padding_mask)` returns
+  `[B, model_dim]`.
+- `ContextRowEncoder.forward()` and `UniversalFeatureTransformer.forward()` have
+  their own signatures; do not force them into the downstream model interface.
+- Use `batch_first=True` in every `nn.MultiheadAttention`.
+- Prefer pre-norm attention blocks: normalize query/key/value inputs before
+  attention, then use residual feed-forward updates.
+- Add type hints to public methods.
+- Do not add positional encodings to set models; that breaks permutation
+  invariance.
+- Apply masks before aggregation so padded values cannot leak into outputs.
+- In attention modules, pass padded rows through `key_padding_mask` and re-zero
+  masked context rows after updates where those rows remain in the set.
+- Keep list-valued input semantics distinct: `None` means upstream field
+  missing, `[]` means present but empty, and non-empty lists mean observed
+  values. Do not use empty strings as missing markers.
+## Testing Expectations
+Run tests through the venv:
+```bash
+.venv/bin/pytest
+.venv/bin/pytest tests/test_embedder.py -v
+.venv/bin/pytest tests/test_context_encoder.py tests/test_pio_attention.py -v
+```
+Implemented coverage currently includes:
+- `tests/test_embedder.py` - `UniversalFeatureTransformer` shape, RFF,
+  categorical path, and permutation-equivariance checks.
+- `tests/test_context_encoder.py` - row-kind paths, padding zeroing, validation,
+  and permutation-equivariance checks.
+- `tests/test_sentence_encoder.py` - sentence-set pooling, multi-task heads, and
+  masked loss checks.
+- `tests/test_pio_attention.py` - attention block shapes, PIO task outputs,
+  all-padded sample safety, permutation invariance, and masked-context
+  corruption checks.
+- `tests/integration/` - UFT/embedder integration tests.
+Each downstream set model should have:
+- Shape test for its public output contract.
+- Permutation-invariance test: shuffled rows and mask produce the same output.
+- Masking test: corrupting padded positions does not affect outputs.
+- All-padded sample test when attention is involved.
+For substantial changes, also run:
+```bash
+.venv/bin/ruff format --check models/ utils/ tests/
+.venv/bin/ruff check models/ utils/ tests/
+env PRE_COMMIT_HOME=/private/tmp/pinn_models_precommit_cache .venv/bin/pre-commit run --all-files
+```
+## Model Review Checklist
+When reviewing or implementing model code, check:
+- No positional encoding or order-dependent feature injection.
+- Aggregation is order agnostic: sum, mean, PMA seeds, CLS latent, or slot mean.
+- Padded positions are excluded from attention via `key_padding_mask`.
+- Masked positions are excluded before pooling/aggregation.
+- Shapes are consistently `[B, N, dim]` internally.
+- Invariant encoders return fixed-size `[B, dim]` tensors or dictionaries of
+  per-task `[B, 1]` predictions.
+- Standard attention softmax is over keys.
+- Residual connections and normalization are present in transformer-style blocks.
+- Learnable latents or task queries are not initialized to all zeros.
+- Masked multi-task losses normalize each task over observed targets only.
+## Dependencies
+- Runtime dependencies live inlined in `[project] dependencies` in
+  `pyproject.toml` and should stay minimal: `torch>=2.0` and `numpy>=1.24`.
+  The torch constraint is forked with an environment marker because PyTorch
+  dropped macOS x86_64 wheels at torch 2.3.0; Intel Mac users are capped at
+  `<2.3`, everyone else stays free to upgrade.
+- Model-development extras live under `[project.optional-dependencies]` as
+  `model-dev`, including notebook/prototype tooling such as `polars`,
+  `pyarrow`, `sentence-transformers`, `transformers`, and `optuna`
+  (hyperparameter sweeps).
+- Test, lint, docs, and dev tooling live in PEP 735 `[dependency-groups]`.
+- See `.claude/agents/python-expert.md` and `.claude/uv-guide.md` before
+  changing packaging or dependency configuration.
+## Documentation
+Sphinx docs live in `docs/`:
+```bash
+.venv/bin/sphinx-build -b html docs docs/_build/html
+```
+Example notebooks live in `notebooks/`.
+PIO prototype notes live in:
+- `PIO_HANDOFF.md` - current handoff and recommended next steps.
+- `notebooks/pio_encoder_design.md` - implemented encoder/attention design.
+- `notebooks/pio_sentence_model_prototype.ipynb` - sentence-only prototype.
+- `notebooks/pio_text_embedding_plan.md` - text embedding and deployment notes.
+## Symlink / Redundancy Layout
+Current layout:
+- `AGENT_GUIDE.shared.md` - single maintained body of cross-agent guidance
+- `AGENTS.md -> AGENT_GUIDE.shared.md` - Codex and other AGENTS-aware tools
+- `GEMINI.md -> AGENT_GUIDE.shared.md` - Gemini CLI project context
+- `.claude/CLAUDE.md` - Claude Code-specific guide and imports
+Do not make these files direct symlinks to `.claude/CLAUDE.md` unless every
+agent that reads this repo understands Claude's `@file` import syntax and
+Claude-specific conventions. A direct symlink would minimize files but would
+couple Codex and Gemini CLI to Claude Code mechanics such as slash skills,
+subagent frontmatter, and settings.
+When updating guidance:
+1. Put tool-neutral project rules in this file.
+2. Put Claude-specific orchestration, skills, permissions, or subagent details
+   under `.claude/`.
+3. If a tool cannot follow symlinks, replace its symlink with a tiny wrapper that
+   imports this file using that tool's supported include syntax.