file-observer 1.0.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,345 @@
1
+ Metadata-Version: 2.4
2
+ Name: file-observer
3
+ Version: 1.0.0
4
+ Summary: Know what's in your files before you open them. Deterministic file observation engine with cryptographic vector identity.
5
+ Author-email: Russell Pfister <russalo@russalo.com>
6
+ License: AGPL-3.0-or-later
7
+ Project-URL: Homepage, https://github.com/russalo/file-observer
8
+ Project-URL: Repository, https://github.com/russalo/file-observer
9
+ Project-URL: Issues, https://github.com/russalo/file-observer/issues
10
+ Project-URL: Documentation, https://github.com/russalo/file-observer/blob/main/docs/README.md
11
+ Project-URL: Changelog, https://github.com/russalo/file-observer/blob/main/docs/HISTORY.md
12
+ Keywords: file-analysis,metadata,observation,document-pipeline,manifest,deterministic,provenance,audit,chatlog,vector,corpus
13
+ Classifier: Development Status :: 5 - Production/Stable
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Intended Audience :: Information Technology
16
+ Classifier: Intended Audience :: System Administrators
17
+ Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
18
+ Classifier: Operating System :: OS Independent
19
+ Classifier: Programming Language :: Python :: 3
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Programming Language :: Python :: 3.13
22
+ Classifier: Topic :: File Formats
23
+ Classifier: Topic :: System :: Filesystems
24
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
25
+ Classifier: Topic :: Software Development :: Quality Assurance
26
+ Classifier: Typing :: Typed
27
+ Requires-Python: >=3.12
28
+ Description-Content-Type: text/markdown
29
+ License-File: LICENSE
30
+ License-File: LICENSE-AGPL
31
+ Requires-Dist: python-magic
32
+ Requires-Dist: chardet
33
+ Provides-Extra: yaml
34
+ Requires-Dist: PyYAML; extra == "yaml"
35
+ Provides-Extra: msg
36
+ Requires-Dist: olefile; extra == "msg"
37
+ Provides-Extra: security
38
+ Requires-Dist: defusedxml; extra == "security"
39
+ Provides-Extra: dev
40
+ Requires-Dist: pytest; extra == "dev"
41
+ Requires-Dist: PyYAML; extra == "dev"
42
+ Requires-Dist: olefile; extra == "dev"
43
+ Requires-Dist: defusedxml; extra == "dev"
44
+ Dynamic: license-file
45
+
46
+ # File Observer
47
+
48
+ **Know what's in your files before you open them.**
49
+
50
+ File Observer scans directories and tells you exactly what's inside — file types, metadata, conversation patterns, author fingerprints, structural signals — all in a deterministic JSON manifest. It reads everything. It changes nothing.
51
+
52
+ ```bash
53
+ pip install file-observer
54
+ fo ./your-project --specialists
55
+ ```
56
+
57
+ ```
58
+ Scanned 4,366 files (3,526 text, 840 binary) in 31 directories.
59
+
60
+ 1,163 supported (336 with specialist metadata). 3,203 unsupported extensions.
61
+ Quality: 676 clean, 3,690 degraded. 4 safety flags, 2 polyglots.
62
+
63
+ Vectors: author_aggregate found 64 distinct authors across 114 files.
64
+ chatlog matched 22 files. reference_tokens ran on 806 files (2,164 URLs,
65
+ 382 paths, 262 @mentions). filename_patterns matched 84 of 4366 files.
66
+
67
+ Largest directories: tika-parsers (2,037), tika-pipes (459), tika-core (440).
68
+ ```
69
+
70
+ That's the human-readable summary. The full manifest has per-file metadata, provenance traces, vector digests, and a signed integrity envelope.
71
+
72
+ | | |
73
+ |---|---|
74
+ | **Package** | `file-observer` |
75
+ | **CLI** | `file-observer` or `fo` (shorthand) |
76
+ | **Version** | `1.0.0` |
77
+ | **Schema** | `1.0` |
78
+ | **Python** | `>= 3.12` |
79
+ | **License** | [AGPL-3.0](../LICENSE) (commercial license available) |
80
+ | **Tests** | 564 passed, validated against 12 corpora / 28,756 files |
81
+
82
+ ---
83
+
84
+ ## Why File Observer?
85
+
86
+ **Your pipeline needs to know what it's processing before it processes it.** File Observer is the observation layer that sits at the front of any document pipeline — ingestion, classification, OCR, embedding, audit. It tells the pipeline what's coming without touching the files.
87
+
88
+ - **Deterministic.** Same files + same config = identical manifest, every time. Cross-environment variance is explained, never hidden.
89
+ - **Auditable.** Every derived field has a provenance trace — which method, which trigger, which inputs. Nothing is a black box.
90
+ - **Honest.** `null` means "not observed within bounds," not "not present." Safety flags are observations, not assessments. The scanner records; the consumer interprets.
91
+ - **Verified.** Cryptographic identity digests on every vector. HMAC-signed manifests. Chain-of-custody across incremental scans.
92
+
93
+ ---
94
+
95
+ ## What it observes
96
+
97
+ ### 25 file types, 4 capability tiers
98
+
99
+ | Tier | Runs for | What it extracts |
100
+ |---|---|---|
101
+ | **Universal** | Every file | Identity, checksum, MIME, file signatures, polyglot detection, routing flags |
102
+ | **Baseline** | Text files | Encoding, preview, tags, frontmatter, chatlog detection, reference tokens, filename patterns |
103
+ | **Structural** | Text files | Title, headings, CSV headers, JSON/YAML/XML/TOML keys, technology hints |
104
+ | **Specialist** | Supported formats (opt-in) | PDF pages, image dimensions, email envelopes, spreadsheet structure, document metadata |
105
+
106
+ Supported specialist formats: `.pdf`, `.png`, `.jpg`, `.msg`, `.eml`, `.xlsx`, `.xls`, `.docx`, `.doc`, `.rtf`, `.jsonl`
107
+
108
+ ### 4 observation vectors with cryptographic identity
109
+
110
+ | Vector | What it finds |
111
+ |---|---|
112
+ | **chatlog** | Conversation patterns — turns, speakers, section markers. Works on `.txt`, `.md`, `.jsonl`. |
113
+ | **reference_tokens** | @mentions, wiki links, code blocks, URLs, emails, file paths, ticket numbers |
114
+ | **author_aggregate** | Cross-format author normalization. Spots template defaults vs real humans. |
115
+ | **filename_patterns** | Date prefixes, version markers, numbered revisions, template names, UUIDs, copy suffixes |
116
+
117
+ Each vector carries an identity digest (SHA-256). Same digest = same rules + same tuning = same output. Always.
118
+
119
+ ### Safety and integrity
120
+
121
+ - **Safety flags** — detects JavaScript in PDFs, macros in DOCX, OLE objects in RTF, external entities in XML
122
+ - **Manifest checksum** — SHA-256 over the canonical manifest
123
+ - **HMAC signatures** — optional signed manifests for audit chains
124
+ - **Delta scanning** — track added/modified/removed files across incremental scans
125
+ - **Per-directory summary** — corpus shape visible at a glance
126
+
127
+ ---
128
+
129
+ ## Quick start
130
+
131
+ ### Install
132
+
133
+ ```bash
134
+ pip install file-observer
135
+
136
+ # Optional: specialist format support
137
+ pip install "file-observer[msg]" # .msg/.doc/.xls (OLE2 formats)
138
+ pip install "file-observer[security]" # Hardened XML parsing
139
+ pip install "file-observer[dev]" # Full dev environment
140
+ ```
141
+
142
+ System requirement: `libmagic` for content-based MIME detection.
143
+ ```bash
144
+ sudo apt install libmagic1 # Debian/Ubuntu
145
+ brew install libmagic # macOS
146
+ pip install python-magic-bin # Windows
147
+ ```
148
+
149
+ ### Scan
150
+
151
+ ```bash
152
+ # Quick scan
153
+ fo ./project
154
+
155
+ # Deep scan with specialist metadata
156
+ fo ./project --specialists
157
+
158
+ # Named profile with JSONL output
159
+ fo ./project --profile deep_extract --format jsonl
160
+
161
+ # Delta scan against a previous manifest, signed
162
+ fo ./project --previous-manifest ./last.json --signing-key-file ./key
163
+ ```
164
+
165
+ ### Use in code
166
+
167
+ ```python
168
+ from pathlib import Path
169
+ from scanner import Scanner, ScannerConfig, manifest_to_json
170
+
171
+ config = ScannerConfig(enable_specialists=True)
172
+ manifest = Scanner(source_dir=Path("./documents"), config=config).scan()
173
+
174
+ # Human-readable summary
175
+ print(manifest.summary)
176
+
177
+ # Find conversation logs
178
+ for f in manifest.files:
179
+ if f.is_chatlog and f.specialist_metadata:
180
+ chat = f.specialist_metadata["chatlog"]
181
+ print(f"{f.path}: {chat['turn_count']} turns, {chat['speaker_labels']}")
182
+
183
+ # Triage via quality block
184
+ q = manifest.quality
185
+ print(f"{q.clean_files}/{q.total_files} clean, {q.safety_flags} safety flags")
186
+
187
+ # Write manifest
188
+ Path("manifest.json").write_text(manifest_to_json(manifest))
189
+ ```
190
+
191
+ Every scan also produces a standalone Markdown report (`report_v{version}_{timestamp}.md`) — readable in any browser, shareable, no JSON parsing required.
192
+
193
+ ---
194
+
195
+ ## Use cases
196
+
197
+ ### Document pipeline preprocessing
198
+ Point File Observer at an incoming document folder before your ingestor touches it. Know which files need OCR, which have specialist metadata, which are mislabeled, and which carry safety flags — before processing begins.
199
+
200
+ ### AI training data curation
201
+ Scanning AI conversation logs, knowledge bases, and document corpora? File Observer detects chatlog patterns in `.txt`, `.md`, and `.jsonl` files, counts turns and speakers, and surfaces reference tokens (URLs, @mentions, code blocks) across thousands of files. Built for the datasets that train and evaluate language models.
202
+
203
+ ### Audit and compliance
204
+ Every field has a provenance trace. Every vector has a cryptographic identity digest. Manifests can be HMAC-signed with chain-of-custody across incremental scans. When the auditor asks "how do you know this file contains X?" — the manifest answers.
205
+
206
+ ### Knowledge management and vault analysis
207
+ Run File Observer against an Obsidian vault, a Confluence export, or a shared drive. The per-directory summary shows corpus shape instantly. Reference tokens reveal link density, cross-references, and structural patterns. Author aggregation spots template defaults vs real contributors.
208
+
209
+ ### Migration and deduplication
210
+ Moving files between systems? File Observer gives you checksums, MIME analysis, format signatures, and polyglot detection for every file. Delta scanning tracks what changed between runs. Filename patterns catch copy suffixes, numbered revisions, and UUID-named files.
211
+
212
+ ### Security triage
213
+ Safety flags surface JavaScript in PDFs, macros in DOCX files, OLE objects in RTF, and external entities in XML — without opening or executing anything. Feed the flags into your security pipeline for automated quarantine decisions.
214
+
215
+ ---
216
+
217
+ ## How it works
218
+
219
+ ```
220
+ fo ./corpus --specialists
221
+ |
222
+ +-- Universal tier Every file: checksum, MIME, signatures, routing
223
+ +-- Baseline tier Text files: encoding, preview, tags, chatlog detection
224
+ +-- Structural tier Text files: title, headings, keys, technology hints
225
+ +-- Specialist tier Format-specific: PDF, images, email, spreadsheets, documents
226
+ +-- Vector pass chatlog, reference_tokens, filename_patterns (per-file)
227
+ +-- Corpus vectors author_aggregate (after all files processed)
228
+ +-- Summary Human-readable paragraph + per-directory breakdown
229
+ |
230
+ +-- Output: manifest.json + report.md
231
+ ```
232
+
233
+ One file failure never halts the scan. Errors are captured per-file, per-stage. The manifest is always complete.
234
+
235
+ ---
236
+
237
+ ## Configurable depth
238
+
239
+ | Profile | Baseline | Specialists | Use case |
240
+ |---|---|---|---|
241
+ | `fast_sort` | 8KB | Off | Quick triage, file routing |
242
+ | `general` | 64KB | Off | Standard observation |
243
+ | `deep_extract` | 1MB | On | Full metadata extraction |
244
+
245
+ Per-extension overrides let you give specific formats more budget:
246
+ ```bash
247
+ fo ./docs --specialists --extension-override .pdf:specialist_budget=524288
248
+ ```
249
+
250
+ ---
251
+
252
+ ## Validated at scale
253
+
254
+ File Observer has been tested against 12 real-world corpora totaling 28,756 files with **zero errors**:
255
+
256
+ | Corpus | Files | What it tested |
257
+ |---|---|---|
258
+ | Apache Tika | 4,366 | 152 document specialists, 69 PDFs, 57 spreadsheets, 13 emails |
259
+ | OBS Studio | 5,201 | Large C/C++ project, 91 filename patterns |
260
+ | AutoGPT | 3,945 | AI platform, 208 chatlog detections, 1,612 @mentions |
261
+ | FastAPI | 3,002 | Documentation-heavy Python, chatlog tuning validation |
262
+ | OpenPreserve | 753 | Adversarial format samples, 285 PDFs |
263
+ | Claude Code logs | 125 | Real AI conversation transcripts, JSONL chatlog detection |
264
+ | Flask, tmux, self-scan | 11K+ | Diverse code repos |
265
+
266
+ ---
267
+
268
+ ## Documentation
269
+
270
+ | Document | What it covers |
271
+ |---|---|
272
+ | [HISTORY.md](HISTORY.md) | Every version from v0.1 to v1.0, with specs and compliance reports |
273
+ | [PUBLIC_CONTRACT.md](PUBLIC_CONTRACT.md) | Consumer stability commitments — what you can rely on |
274
+ | [CONVENTIONS.md](CONVENTIONS.md) | Internal naming, versioning, and tracking |
275
+ | [v1.0.0 RFC Specification](v1.0.0_RFC_Specification.md) | Current release spec — schema freeze, binding contract |
276
+
277
+ ---
278
+
279
+ ## API Reference
280
+
281
+ ### Core classes
282
+
283
+ ```python
284
+ Scanner(source_dir: Path, config: ScannerConfig | None = None)
285
+ Scanner.scan() -> ScanManifest
286
+ ```
287
+
288
+ ### Configuration
289
+
290
+ ```python
291
+ ScannerConfig(
292
+ enable_specialists=False, # Enable format-specific extraction
293
+ preview_max_chars=1000, # Content preview length
294
+ sample_size=8192, # Binary detection sample
295
+ baseline_max_bytes=65536, # Text decode limit
296
+ specialist_budget=131072, # OOXML read budget
297
+ format="json", # "json" or "jsonl"
298
+ exclude_hidden=False, # Skip dot-files
299
+ ignore_file=None, # Path to .scannerignore
300
+ previous_manifest=None, # Delta scan reference
301
+ signing_key=None, # HMAC signing key
302
+ )
303
+ ```
304
+
305
+ ### Output
306
+
307
+ ```python
308
+ manifest_to_json(manifest) # Pretty-printed JSON
309
+ manifest_to_jsonl(manifest) # NDJSON streaming format
310
+ manifest_to_markdown(manifest) # Human-readable report
311
+ ```
312
+
313
+ ### Key data classes
314
+
315
+ - **`ScanManifest`** — top-level: context, stats, quality, vectors_collected, summary, files[]
316
+ - **`FileRecord`** — per-file: path, mime, checksum, encoding, specialist_metadata, reference_tokens, filename_patterns, safety_flags, signal_provenance, errors
317
+ - **`ScanContext`** — environment fingerprint: versions, platform, dependencies
318
+ - **`VectorRecord`** — vector identity, digest, scope, applied count, summary
319
+
320
+ ---
321
+
322
+ ## Contributing
323
+
324
+ We welcome contributions. See [CONTRIBUTING.md](../CONTRIBUTING.md) for the full guide.
325
+
326
+ **Quick version:**
327
+ 1. Fork and clone
328
+ 2. `pip install -e ".[dev]"` and run tests
329
+ 3. Sign the [CLA](../CLA.md) on your first PR
330
+ 4. One concern per PR, tests required, determinism preserved
331
+
332
+ ---
333
+
334
+ ## License
335
+
336
+ File Observer is dual-licensed:
337
+
338
+ - **Open source** under [AGPL-3.0](../LICENSE-AGPL) — use freely, contribute back
339
+ - **Commercial license** available for SaaS, proprietary embedding, and distribution without source disclosure
340
+
341
+ Internal use under AGPL requires no commercial license. Contact russalo@russalo.com for commercial terms.
342
+
343
+ ---
344
+
345
+ *Built by [Russalo](https://russalo.com). The scanner records. The consumer interprets. The identity digest makes the recording auditable.*
@@ -0,0 +1,9 @@
1
+ file_observer-1.0.0.dist-info/licenses/LICENSE,sha256=huQwXqt2aPs4OQqQLw7J_SGW_YbR1d7Da5wO9jfnGDc,1968
2
+ file_observer-1.0.0.dist-info/licenses/LICENSE-AGPL,sha256=DZak_2itbUtvHzD3E7GNUYSRK6jdOJ-GqncQ2weavLA,34523
3
+ scanner/__init__.py,sha256=p7ITX6WV6nn4_rb1UGAkcivgzqWOhN5SYeyASQFwwJk,216
4
+ scanner/scanner.py,sha256=LRO81pbIRqu3jyzSte2dcrGbPZaTG-XayhWsy92jXZM,146133
5
+ file_observer-1.0.0.dist-info/METADATA,sha256=Q3bPmD_UUe9QWW2gNQkEVeHBLdKbbJM0m_1IFDU763M,14141
6
+ file_observer-1.0.0.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
7
+ file_observer-1.0.0.dist-info/entry_points.txt,sha256=9RNi9R5O58N4gRf4048sMyQroUXKxTkUaoGIu_XpWGk,81
8
+ file_observer-1.0.0.dist-info/top_level.txt,sha256=uVdh7ZIJC9rjZQex7pba18nBuRRxFhjkFu4vHPiWTcw,8
9
+ file_observer-1.0.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (82.0.1)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,3 @@
1
+ [console_scripts]
2
+ file-observer = scanner.scanner:main
3
+ fo = scanner.scanner:main
@@ -0,0 +1,55 @@
1
+ File Observer
2
+ Copyright (c) 2026 Russell Pfister. All rights reserved.
3
+
4
+ This software is dual-licensed:
5
+
6
+ 1. OPEN SOURCE LICENSE (AGPL v3)
7
+
8
+ For open source use, this software is licensed under the
9
+ GNU Affero General Public License, Version 3 (AGPL-3.0).
10
+
11
+ You may use, modify, and distribute this software under the
12
+ terms of the AGPL-3.0, provided that:
13
+
14
+ - Any modified versions you distribute are also licensed under AGPL-3.0
15
+ - If you run a modified version as a network service, you must make
16
+ the source code of your modified version available to users of
17
+ that service
18
+
19
+ The full text of the AGPL-3.0 is available at:
20
+ https://www.gnu.org/licenses/agpl-3.0.html
21
+
22
+ A copy is included in this repository at LICENSE-AGPL.
23
+
24
+ 2. COMMERCIAL LICENSE
25
+
26
+ For use cases where the AGPL-3.0 terms are not suitable —
27
+ including but not limited to:
28
+
29
+ - Embedding scanner in proprietary software without source disclosure
30
+ - Offering scanner as part of a commercial SaaS or cloud service
31
+ without releasing your service code under AGPL
32
+ - Distributing scanner in proprietary products
33
+ - Obtaining support, warranty, or indemnification
34
+
35
+ A separate commercial license is available from Russalo LLC.
36
+
37
+ Contact: russalo@russalo.com
38
+
39
+ WHICH LICENSE APPLIES TO YOU?
40
+
41
+ - If you are using scanner internally (not offering it as a service
42
+ to others), the AGPL allows this without restriction. You do not
43
+ need a commercial license for internal use.
44
+
45
+ - If you are using scanner in an open source project licensed under
46
+ a compatible copyleft license, the AGPL applies.
47
+
48
+ - If you are offering scanner (or a modified version) as a network
49
+ service to third parties and do not wish to release your service
50
+ code under the AGPL, you need a commercial license.
51
+
52
+ - If you are embedding scanner in proprietary software for
53
+ distribution, you need a commercial license.
54
+
55
+ If you are unsure which license applies, contact russalo@russalo.com.