PyPI - tacit-citadel - Versions diffs - 0.1.0__tar.gz - Mend

tacit-citadel 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

tacit_citadel-0.1.0/.gitignore +218 -0
tacit_citadel-0.1.0/PKG-INFO +312 -0
tacit_citadel-0.1.0/README.md +294 -0
tacit_citadel-0.1.0/policy.yaml +110 -0
tacit_citadel-0.1.0/pyproject.toml +51 -0
tacit_citadel-0.1.0/sample.jsonl +1 -0
tacit_citadel-0.1.0/src/tacit_citadel/__init__.py +5 -0
tacit_citadel-0.1.0/src/tacit_citadel/actions.py +242 -0
tacit_citadel-0.1.0/src/tacit_citadel/cli.py +25 -0
tacit_citadel-0.1.0/src/tacit_citadel/llm.py +222 -0
tacit_citadel-0.1.0/src/tacit_citadel/main.py +108 -0
tacit_citadel-0.1.0/src/tacit_citadel/policy.py +117 -0
tacit_citadel-0.1.0/tests/test_actions.py +231 -0
tacit_citadel-0.1.0/tests/test_llm.py +278 -0
tacit_citadel-0.1.0/tests/test_main.py +277 -0
tacit_citadel-0.1.0/tests/test_policy.py +199 -0

tacit_citadel-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,218 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[codz]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#   Usually these files are written by a python script from a template
+#   before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py.cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+# Pipfile.lock
+# UV
+#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+# uv.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+# poetry.lock
+# poetry.toml
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#   pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
+#   https://pdm-project.org/en/latest/usage/project/#working-with-version-control
+# pdm.lock
+# pdm.toml
+.pdm-python
+.pdm-build/
+# pixi
+#   Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
+# pixi.lock
+#   Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
+#   in the .venv directory. It is recommended not to include this directory in version control.
+.pixi
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# Redis
+*.rdb
+*.aof
+*.pid
+# RabbitMQ
+mnesia/
+rabbitmq/
+rabbitmq-data/
+# ActiveMQ
+activemq-data/
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.envrc
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#   JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#   be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#   and can be added to the global gitignore or merged into this file.  For a more nuclear
+#   option (not recommended) you can uncomment the following to ignore the entire idea folder.
+# .idea/
+# Abstra
+#   Abstra is an AI-powered process automation framework.
+#   Ignore directories containing user credentials, local state, and settings.
+#   Learn more at https://abstra.io/docs
+.abstra/
+# Visual Studio Code
+#   Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
+#   that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
+#   and can be added to the global gitignore or merged into this file. However, if you prefer,
+#   you could uncomment the following to ignore the entire vscode folder
+# .vscode/
+# Temporary file for partial code execution
+tempCodeRunnerFile.py
+# Ruff stuff:
+.ruff_cache/
+# PyPI configuration file
+.pypirc
+# Marimo
+marimo/_static/
+marimo/_lsp/
+__marimo__/
+# Streamlit
+.streamlit/secrets.toml

tacit_citadel-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,312 @@
+Metadata-Version: 2.4
+Name: tacit-citadel
+Version: 0.1.0
+Summary: GPU-powered Structured Data De-identification Engine
+Requires-Python: <3.14,>=3.11
+Requires-Dist: click>=8.4.1
+Requires-Dist: numpy>=2; python_full_version >= '3.13'
+Requires-Dist: openai>=2.41.1
+Requires-Dist: presidio-analyzer>=2.2.362
+Requires-Dist: presidio-anonymizer>=2.2.362
+Requires-Dist: pydantic>=2.13.4
+Requires-Dist: pydash>=8.0.6
+Requires-Dist: pyjq>=2.6.0
+Requires-Dist: pyyaml>=6.0.3
+Provides-Extra: cuda
+Requires-Dist: spacy[cuda12x]>=3.8.14; (python_full_version == '3.12.*' and sys_platform == 'linux') and extra == 'cuda'
+Description-Content-Type: text/markdown
+# Citadel
+Citadel is a policy-driven de-identification tool for JSONL training and
+evaluation data. It reads one JSON object per line, applies a versioned YAML
+policy, and writes a compact de-identified JSONL file beside the input.
+The normal command is:
+```bash
+uv run tacit.citadel policy.yaml input.jsonl
+```
+For `input.jsonl`, Citadel writes:
+```text
+input.citadel.jsonl
+```
+The CLI currently takes exactly two positional arguments: the policy file and
+the input JSONL file. The output path is always derived from the input path.
+## Setup
+Citadel is packaged as `tacit-citadel` and exposes the `tacit.citadel` console
+script.
+The project supports Python `>=3.11,<3.14`; the checked-in `.python-version`
+currently selects Python 3.11.
+```bash
+uv sync
+```
+`pyjq` is a runtime dependency and builds native code when no compatible wheel
+is available. On macOS, make sure Xcode command line tools and the autotools
+chain are installed. If setup fails with `No such file or directory:
+'autoreconf'`, install the missing tools and rerun `uv sync`.
+```bash
+brew install autoconf automake libtool
+```
+If the `pyjq` build finds the command line tools but fails with `stdlib.h` not
+found, pass the active macOS SDK path into the build:
+```bash
+SDKROOT=$(xcrun --show-sdk-path) uv sync
+```
+CUDA-enabled spaCy is available as an optional extra for Linux CPython 3.12:
+```bash
+uv sync --extra cuda
+```
+The default install path does not install CUDA spaCy packages.
+## Usage
+Run the sample policy against the sample record:
+```bash
+uv run tacit.citadel policy.yaml sample.jsonl
+```
+This creates:
+```text
+sample.citadel.jsonl
+```
+On success, Citadel prints a short report:
+```text
+output: sample.citadel.jsonl
+records processed: 1
+fields changed: 7
+llm calls: 1
+epoch seed: 1787680000
+```
+The `epoch seed` is generated from the current Unix time unless `process_file`
+is called directly with an explicit `epoch_seed`.
+## Input
+Citadel expects JSONL. Each line must be a complete JSON object.
+```json
+{"client_id":"007","intake_details":{"date":"2026-01-05","weight":102.4}}
+```
+Non-object JSONL lines fail the run. Citadel processes records in chunks of 50
+and writes one compact JSON object per output line.
+## Policy
+Policy files are YAML mappings validated with Pydantic. Extra fields are
+rejected.
+Required top-level fields:
+```yaml
+version: 1
+name: nourish-intake-and-trajectory
+description: De-identification policy for Nourish-style records.
+llm:
+  base_url: http://127.0.0.1:8000/v1
+  model: google/gemma-4-12B-it-qat-w4a16-ct
+  temperature: 1.0
+  top_p: 0.95
+  top_k: 64
+rules:
+  - path: .client_id
+    action: drop
+```
+Each rule has:
+```yaml
+- path: .jq.selector
+  action: drop
+  required: true
+  params: {}
+```
+`path` is a jq selector. Citadel resolves selectors through `pyjq` and applies
+actions to the concrete JSON locations returned by `path(...)`.
+`required` defaults to `true`. If a required rule matches nothing, the run
+fails. Use `required: false` for sparse paths that are absent from some records.
+## Actions
+Citadel currently supports four actions.
+### `drop`
+Removes the matched object field.
+```yaml
+- path: .client_id
+  action: drop
+```
+`drop` only deletes object fields. It does not remove array elements.
+### `fuzz_number`
+Shifts numeric values while preserving approximate modelling signal. Boolean
+and non-numeric values are rejected.
+Percent mode:
+```yaml
+- path: .intake_details.weight
+  action: fuzz_number
+  params:
+    mode: percent
+    max_percent: 5
+    precision: 1
+```
+Range mode:
+```yaml
+- path: .intake_details.age
+  action: fuzz_number
+  params:
+    mode: range
+    min_delta: -2
+    max_delta: 2
+    step: 1
+```
+The random generator is seeded once per run. Integer inputs stay integers when
+the fuzzed value is integral.
+### `date_offset`
+Replaces an absolute date with a human-readable offset from an anchor date in
+the same record.
+```yaml
+- path: .trajectories[] | select(.type == "set_target").date
+  action: date_offset
+  required: false
+  params:
+    anchor_path: .intake_details.date
+    output: human_relative
+```
+Supported output strings are:
+```text
+same day
+N day after
+N days after
+N day before
+N days before
+```
+Date values must be strings accepted by Python's ISO date/datetime parser.
+### `llm_rewrite`
+Queues selected string fields for rewriting through an OpenAI-compatible chat
+completion endpoint.
+```yaml
+- path: .trajectories[] | select(.type == "messages").thread[].content
+  action: llm_rewrite
+  required: false
+  params:
+    system_prompt: You are a high-recall sensitive-data anonymizer.
+    user_prompt: |
+      Rewrite the INPUT text by replacing sensitive values with typed
+      placeholders. Return only the rewritten text.
+      INPUT
+      {{content}}
+```
+Only the matched field value is sent to the model. `{{content}}` in the system
+or user prompt is replaced with that selected text.
+The LLM client uses the policy's `llm.base_url`, `llm.model`, `temperature`,
+`top_p`, and `top_k`. The API key is set to `not-needed`, which matches local
+OpenAI-compatible servers such as vLLM.
+Within a run, duplicate source text is rewritten once and reused from an
+in-memory cache. Cache misses in the same chunk are submitted concurrently.
+If a rewrite request fails or is cancelled, Citadel writes
+`<LLM_REWRITE_FAILED>` into that field and continues the run.
+To smoke-test a local rewrite server directly:
+```bash
+uv run python -m tacit_citadel.llm \
+  --base-url http://127.0.0.1:8000/v1 \
+  --model google/gemma-4-12B-it-qat-w4a16-ct \
+  --text "Hi Jamie, your appointment is on January 12."
+```
+## Processing Model
+For each run, Citadel:
+1. Validates the policy YAML.
+2. Opens the input JSONL file.
+3. Parses each line as a JSON object.
+4. Applies policy rules in order.
+5. Resolves jq selectors to concrete JSON locations.
+6. Queues and runs LLM rewrites for each 50-record chunk.
+7. Writes compact JSONL to a temporary output file.
+8. Atomically replaces the derived output path after the full run succeeds.
+9. Prints a short report.
+If a fatal error occurs before replacement, Citadel deletes the temporary file.
+An existing output file is preserved.
+## Failure Behavior
+Citadel fails the run for:
+* missing policy or input files
+* invalid policy YAML or unsupported policy fields
+* invalid JSONL
+* JSONL lines that are not objects
+* invalid jq selectors
+* unmatched required rule paths
+* action type errors, such as applying `fuzz_number` to a string
+* invalid or missing `date_offset` anchors
+LLM rewrite request failures are nonfatal. The failed field is replaced with
+`<LLM_REWRITE_FAILED>` and processing continues.
+## Development
+Run the test suite:
+```bash
+uv run pytest
+```
+Run the lightweight checks:
+```bash
+uv run ruff check .
+uv run ty check
+```