tacit-citadel 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,218 @@
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[codz]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ share/python-wheels/
24
+ *.egg-info/
25
+ .installed.cfg
26
+ *.egg
27
+ MANIFEST
28
+
29
+ # PyInstaller
30
+ # Usually these files are written by a python script from a template
31
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
32
+ *.manifest
33
+ *.spec
34
+
35
+ # Installer logs
36
+ pip-log.txt
37
+ pip-delete-this-directory.txt
38
+
39
+ # Unit test / coverage reports
40
+ htmlcov/
41
+ .tox/
42
+ .nox/
43
+ .coverage
44
+ .coverage.*
45
+ .cache
46
+ nosetests.xml
47
+ coverage.xml
48
+ *.cover
49
+ *.py.cover
50
+ .hypothesis/
51
+ .pytest_cache/
52
+ cover/
53
+
54
+ # Translations
55
+ *.mo
56
+ *.pot
57
+
58
+ # Django stuff:
59
+ *.log
60
+ local_settings.py
61
+ db.sqlite3
62
+ db.sqlite3-journal
63
+
64
+ # Flask stuff:
65
+ instance/
66
+ .webassets-cache
67
+
68
+ # Scrapy stuff:
69
+ .scrapy
70
+
71
+ # Sphinx documentation
72
+ docs/_build/
73
+
74
+ # PyBuilder
75
+ .pybuilder/
76
+ target/
77
+
78
+ # Jupyter Notebook
79
+ .ipynb_checkpoints
80
+
81
+ # IPython
82
+ profile_default/
83
+ ipython_config.py
84
+
85
+ # pyenv
86
+ # For a library or package, you might want to ignore these files since the code is
87
+ # intended to run in multiple environments; otherwise, check them in:
88
+ # .python-version
89
+
90
+ # pipenv
91
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
93
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
94
+ # install all needed dependencies.
95
+ # Pipfile.lock
96
+
97
+ # UV
98
+ # Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
99
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
100
+ # commonly ignored for libraries.
101
+ # uv.lock
102
+
103
+ # poetry
104
+ # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
105
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
106
+ # commonly ignored for libraries.
107
+ # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108
+ # poetry.lock
109
+ # poetry.toml
110
+
111
+ # pdm
112
+ # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
113
+ # pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
114
+ # https://pdm-project.org/en/latest/usage/project/#working-with-version-control
115
+ # pdm.lock
116
+ # pdm.toml
117
+ .pdm-python
118
+ .pdm-build/
119
+
120
+ # pixi
121
+ # Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
122
+ # pixi.lock
123
+ # Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
124
+ # in the .venv directory. It is recommended not to include this directory in version control.
125
+ .pixi
126
+
127
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
128
+ __pypackages__/
129
+
130
+ # Celery stuff
131
+ celerybeat-schedule
132
+ celerybeat.pid
133
+
134
+ # Redis
135
+ *.rdb
136
+ *.aof
137
+ *.pid
138
+
139
+ # RabbitMQ
140
+ mnesia/
141
+ rabbitmq/
142
+ rabbitmq-data/
143
+
144
+ # ActiveMQ
145
+ activemq-data/
146
+
147
+ # SageMath parsed files
148
+ *.sage.py
149
+
150
+ # Environments
151
+ .env
152
+ .envrc
153
+ .venv
154
+ env/
155
+ venv/
156
+ ENV/
157
+ env.bak/
158
+ venv.bak/
159
+
160
+ # Spyder project settings
161
+ .spyderproject
162
+ .spyproject
163
+
164
+ # Rope project settings
165
+ .ropeproject
166
+
167
+ # mkdocs documentation
168
+ /site
169
+
170
+ # mypy
171
+ .mypy_cache/
172
+ .dmypy.json
173
+ dmypy.json
174
+
175
+ # Pyre type checker
176
+ .pyre/
177
+
178
+ # pytype static type analyzer
179
+ .pytype/
180
+
181
+ # Cython debug symbols
182
+ cython_debug/
183
+
184
+ # PyCharm
185
+ # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
186
+ # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
187
+ # and can be added to the global gitignore or merged into this file. For a more nuclear
188
+ # option (not recommended) you can uncomment the following to ignore the entire idea folder.
189
+ # .idea/
190
+
191
+ # Abstra
192
+ # Abstra is an AI-powered process automation framework.
193
+ # Ignore directories containing user credentials, local state, and settings.
194
+ # Learn more at https://abstra.io/docs
195
+ .abstra/
196
+
197
+ # Visual Studio Code
198
+ # Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
199
+ # that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
200
+ # and can be added to the global gitignore or merged into this file. However, if you prefer,
201
+ # you could uncomment the following to ignore the entire vscode folder
202
+ # .vscode/
203
+ # Temporary file for partial code execution
204
+ tempCodeRunnerFile.py
205
+
206
+ # Ruff stuff:
207
+ .ruff_cache/
208
+
209
+ # PyPI configuration file
210
+ .pypirc
211
+
212
+ # Marimo
213
+ marimo/_static/
214
+ marimo/_lsp/
215
+ __marimo__/
216
+
217
+ # Streamlit
218
+ .streamlit/secrets.toml
@@ -0,0 +1,312 @@
1
+ Metadata-Version: 2.4
2
+ Name: tacit-citadel
3
+ Version: 0.1.0
4
+ Summary: GPU-powered Structured Data De-identification Engine
5
+ Requires-Python: <3.14,>=3.11
6
+ Requires-Dist: click>=8.4.1
7
+ Requires-Dist: numpy>=2; python_full_version >= '3.13'
8
+ Requires-Dist: openai>=2.41.1
9
+ Requires-Dist: presidio-analyzer>=2.2.362
10
+ Requires-Dist: presidio-anonymizer>=2.2.362
11
+ Requires-Dist: pydantic>=2.13.4
12
+ Requires-Dist: pydash>=8.0.6
13
+ Requires-Dist: pyjq>=2.6.0
14
+ Requires-Dist: pyyaml>=6.0.3
15
+ Provides-Extra: cuda
16
+ Requires-Dist: spacy[cuda12x]>=3.8.14; (python_full_version == '3.12.*' and sys_platform == 'linux') and extra == 'cuda'
17
+ Description-Content-Type: text/markdown
18
+
19
+ # Citadel
20
+
21
+ Citadel is a policy-driven de-identification tool for JSONL training and
22
+ evaluation data. It reads one JSON object per line, applies a versioned YAML
23
+ policy, and writes a compact de-identified JSONL file beside the input.
24
+
25
+ The normal command is:
26
+
27
+ ```bash
28
+ uv run tacit.citadel policy.yaml input.jsonl
29
+ ```
30
+
31
+ For `input.jsonl`, Citadel writes:
32
+
33
+ ```text
34
+ input.citadel.jsonl
35
+ ```
36
+
37
+ The CLI currently takes exactly two positional arguments: the policy file and
38
+ the input JSONL file. The output path is always derived from the input path.
39
+
40
+ ## Setup
41
+
42
+ Citadel is packaged as `tacit-citadel` and exposes the `tacit.citadel` console
43
+ script.
44
+
45
+ The project supports Python `>=3.11,<3.14`; the checked-in `.python-version`
46
+ currently selects Python 3.11.
47
+
48
+ ```bash
49
+ uv sync
50
+ ```
51
+
52
+ `pyjq` is a runtime dependency and builds native code when no compatible wheel
53
+ is available. On macOS, make sure Xcode command line tools and the autotools
54
+ chain are installed. If setup fails with `No such file or directory:
55
+ 'autoreconf'`, install the missing tools and rerun `uv sync`.
56
+
57
+ ```bash
58
+ brew install autoconf automake libtool
59
+ ```
60
+
61
+ If the `pyjq` build finds the command line tools but fails with `stdlib.h` not
62
+ found, pass the active macOS SDK path into the build:
63
+
64
+ ```bash
65
+ SDKROOT=$(xcrun --show-sdk-path) uv sync
66
+ ```
67
+
68
+ CUDA-enabled spaCy is available as an optional extra for Linux CPython 3.12:
69
+
70
+ ```bash
71
+ uv sync --extra cuda
72
+ ```
73
+
74
+ The default install path does not install CUDA spaCy packages.
75
+
76
+ ## Usage
77
+
78
+ Run the sample policy against the sample record:
79
+
80
+ ```bash
81
+ uv run tacit.citadel policy.yaml sample.jsonl
82
+ ```
83
+
84
+ This creates:
85
+
86
+ ```text
87
+ sample.citadel.jsonl
88
+ ```
89
+
90
+ On success, Citadel prints a short report:
91
+
92
+ ```text
93
+ output: sample.citadel.jsonl
94
+ records processed: 1
95
+ fields changed: 7
96
+ llm calls: 1
97
+ epoch seed: 1787680000
98
+ ```
99
+
100
+ The `epoch seed` is generated from the current Unix time unless `process_file`
101
+ is called directly with an explicit `epoch_seed`.
102
+
103
+ ## Input
104
+
105
+ Citadel expects JSONL. Each line must be a complete JSON object.
106
+
107
+ ```json
108
+ {"client_id":"007","intake_details":{"date":"2026-01-05","weight":102.4}}
109
+ ```
110
+
111
+ Non-object JSONL lines fail the run. Citadel processes records in chunks of 50
112
+ and writes one compact JSON object per output line.
113
+
114
+ ## Policy
115
+
116
+ Policy files are YAML mappings validated with Pydantic. Extra fields are
117
+ rejected.
118
+
119
+ Required top-level fields:
120
+
121
+ ```yaml
122
+ version: 1
123
+ name: nourish-intake-and-trajectory
124
+ description: De-identification policy for Nourish-style records.
125
+
126
+ llm:
127
+ base_url: http://127.0.0.1:8000/v1
128
+ model: google/gemma-4-12B-it-qat-w4a16-ct
129
+ temperature: 1.0
130
+ top_p: 0.95
131
+ top_k: 64
132
+
133
+ rules:
134
+ - path: .client_id
135
+ action: drop
136
+ ```
137
+
138
+ Each rule has:
139
+
140
+ ```yaml
141
+ - path: .jq.selector
142
+ action: drop
143
+ required: true
144
+ params: {}
145
+ ```
146
+
147
+ `path` is a jq selector. Citadel resolves selectors through `pyjq` and applies
148
+ actions to the concrete JSON locations returned by `path(...)`.
149
+
150
+ `required` defaults to `true`. If a required rule matches nothing, the run
151
+ fails. Use `required: false` for sparse paths that are absent from some records.
152
+
153
+ ## Actions
154
+
155
+ Citadel currently supports four actions.
156
+
157
+ ### `drop`
158
+
159
+ Removes the matched object field.
160
+
161
+ ```yaml
162
+ - path: .client_id
163
+ action: drop
164
+ ```
165
+
166
+ `drop` only deletes object fields. It does not remove array elements.
167
+
168
+ ### `fuzz_number`
169
+
170
+ Shifts numeric values while preserving approximate modelling signal. Boolean
171
+ and non-numeric values are rejected.
172
+
173
+ Percent mode:
174
+
175
+ ```yaml
176
+ - path: .intake_details.weight
177
+ action: fuzz_number
178
+ params:
179
+ mode: percent
180
+ max_percent: 5
181
+ precision: 1
182
+ ```
183
+
184
+ Range mode:
185
+
186
+ ```yaml
187
+ - path: .intake_details.age
188
+ action: fuzz_number
189
+ params:
190
+ mode: range
191
+ min_delta: -2
192
+ max_delta: 2
193
+ step: 1
194
+ ```
195
+
196
+ The random generator is seeded once per run. Integer inputs stay integers when
197
+ the fuzzed value is integral.
198
+
199
+ ### `date_offset`
200
+
201
+ Replaces an absolute date with a human-readable offset from an anchor date in
202
+ the same record.
203
+
204
+ ```yaml
205
+ - path: .trajectories[] | select(.type == "set_target").date
206
+ action: date_offset
207
+ required: false
208
+ params:
209
+ anchor_path: .intake_details.date
210
+ output: human_relative
211
+ ```
212
+
213
+ Supported output strings are:
214
+
215
+ ```text
216
+ same day
217
+ N day after
218
+ N days after
219
+ N day before
220
+ N days before
221
+ ```
222
+
223
+ Date values must be strings accepted by Python's ISO date/datetime parser.
224
+
225
+ ### `llm_rewrite`
226
+
227
+ Queues selected string fields for rewriting through an OpenAI-compatible chat
228
+ completion endpoint.
229
+
230
+ ```yaml
231
+ - path: .trajectories[] | select(.type == "messages").thread[].content
232
+ action: llm_rewrite
233
+ required: false
234
+ params:
235
+ system_prompt: You are a high-recall sensitive-data anonymizer.
236
+ user_prompt: |
237
+ Rewrite the INPUT text by replacing sensitive values with typed
238
+ placeholders. Return only the rewritten text.
239
+
240
+ INPUT
241
+ {{content}}
242
+ ```
243
+
244
+ Only the matched field value is sent to the model. `{{content}}` in the system
245
+ or user prompt is replaced with that selected text.
246
+
247
+ The LLM client uses the policy's `llm.base_url`, `llm.model`, `temperature`,
248
+ `top_p`, and `top_k`. The API key is set to `not-needed`, which matches local
249
+ OpenAI-compatible servers such as vLLM.
250
+
251
+ Within a run, duplicate source text is rewritten once and reused from an
252
+ in-memory cache. Cache misses in the same chunk are submitted concurrently.
253
+
254
+ If a rewrite request fails or is cancelled, Citadel writes
255
+ `<LLM_REWRITE_FAILED>` into that field and continues the run.
256
+
257
+ To smoke-test a local rewrite server directly:
258
+
259
+ ```bash
260
+ uv run python -m tacit_citadel.llm \
261
+ --base-url http://127.0.0.1:8000/v1 \
262
+ --model google/gemma-4-12B-it-qat-w4a16-ct \
263
+ --text "Hi Jamie, your appointment is on January 12."
264
+ ```
265
+
266
+ ## Processing Model
267
+
268
+ For each run, Citadel:
269
+
270
+ 1. Validates the policy YAML.
271
+ 2. Opens the input JSONL file.
272
+ 3. Parses each line as a JSON object.
273
+ 4. Applies policy rules in order.
274
+ 5. Resolves jq selectors to concrete JSON locations.
275
+ 6. Queues and runs LLM rewrites for each 50-record chunk.
276
+ 7. Writes compact JSONL to a temporary output file.
277
+ 8. Atomically replaces the derived output path after the full run succeeds.
278
+ 9. Prints a short report.
279
+
280
+ If a fatal error occurs before replacement, Citadel deletes the temporary file.
281
+ An existing output file is preserved.
282
+
283
+ ## Failure Behavior
284
+
285
+ Citadel fails the run for:
286
+
287
+ * missing policy or input files
288
+ * invalid policy YAML or unsupported policy fields
289
+ * invalid JSONL
290
+ * JSONL lines that are not objects
291
+ * invalid jq selectors
292
+ * unmatched required rule paths
293
+ * action type errors, such as applying `fuzz_number` to a string
294
+ * invalid or missing `date_offset` anchors
295
+
296
+ LLM rewrite request failures are nonfatal. The failed field is replaced with
297
+ `<LLM_REWRITE_FAILED>` and processing continues.
298
+
299
+ ## Development
300
+
301
+ Run the test suite:
302
+
303
+ ```bash
304
+ uv run pytest
305
+ ```
306
+
307
+ Run the lightweight checks:
308
+
309
+ ```bash
310
+ uv run ruff check .
311
+ uv run ty check
312
+ ```