PyPI - filedge - Versions diffs - 0.1.0__tar.gz - Mend

filedge 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (131) hide show

filedge-0.1.0/.coverage +0 -0
filedge-0.1.0/.github/ISSUE_TEMPLATE/bug_report.yml +66 -0
filedge-0.1.0/.github/ISSUE_TEMPLATE/config.yml +5 -0
filedge-0.1.0/.github/ISSUE_TEMPLATE/feature_request.yml +32 -0
filedge-0.1.0/.github/dependabot.yml +25 -0
filedge-0.1.0/.github/pull_request_template.md +25 -0
filedge-0.1.0/.github/workflows/bigquery-integration.yml +35 -0
filedge-0.1.0/.github/workflows/ci.yml +77 -0
filedge-0.1.0/.github/workflows/docs.yml +44 -0
filedge-0.1.0/.github/workflows/release.yml +50 -0
filedge-0.1.0/.gitignore +10 -0
filedge-0.1.0/AGENTS.md +17 -0
filedge-0.1.0/CLAUDE.md +17 -0
filedge-0.1.0/CONTEXT.md +145 -0
filedge-0.1.0/CONTRIBUTING.md +101 -0
filedge-0.1.0/LICENSE +201 -0
filedge-0.1.0/PKG-INFO +40 -0
filedge-0.1.0/README.md +127 -0
filedge-0.1.0/SECURITY.md +47 -0
filedge-0.1.0/docs/PRD.md +139 -0
filedge-0.1.0/docs/adr/0001-single-transaction-commit.md +3 -0
filedge-0.1.0/docs/adr/0002-content-hash-as-idempotency-key.md +3 -0
filedge-0.1.0/docs/adr/0003-strict-mode-validation.md +3 -0
filedge-0.1.0/docs/adr/0004-audit-connector-split.md +38 -0
filedge-0.1.0/docs/adr/0005-sftp-out-of-scope.md +22 -0
filedge-0.1.0/docs/adr/0006-api-sources-fetched-to-files.md +16 -0
filedge-0.1.0/docs/adr/0007-queue-source-ingestion-model.md +16 -0
filedge-0.1.0/docs/adr/0008-schema-inference-confidence-tiers.md +7 -0
filedge-0.1.0/docs/adr/0009-warehouse-cdc-applied-file-markers.md +34 -0
filedge-0.1.0/docs/agents/domain.md +44 -0
filedge-0.1.0/docs/agents/issue-tracker.md +22 -0
filedge-0.1.0/docs/agents/triage-labels.md +15 -0
filedge-0.1.0/docs/architecture/decisions.md +97 -0
filedge-0.1.0/docs/architecture/index.md +76 -0
filedge-0.1.0/docs/getting-started.md +196 -0
filedge-0.1.0/docs/guides/api-sources.md +190 -0
filedge-0.1.0/docs/guides/cdc-files.md +131 -0
filedge-0.1.0/docs/guides/compact.md +106 -0
filedge-0.1.0/docs/guides/crash-retry.md +124 -0
filedge-0.1.0/docs/guides/healthcheck.md +71 -0
filedge-0.1.0/docs/guides/inspect.md +138 -0
filedge-0.1.0/docs/guides/observability.md +109 -0
filedge-0.1.0/docs/guides/preview.md +106 -0
filedge-0.1.0/docs/guides/queue-sources.md +140 -0
filedge-0.1.0/docs/guides/requeue.md +123 -0
filedge-0.1.0/docs/guides/run.md +229 -0
filedge-0.1.0/docs/guides/scale.md +223 -0
filedge-0.1.0/docs/guides/validate.md +101 -0
filedge-0.1.0/docs/index.md +63 -0
filedge-0.1.0/docs/reference/cli.md +207 -0
filedge-0.1.0/docs/reference/column-types.md +43 -0
filedge-0.1.0/docs/reference/connectors.md +158 -0
filedge-0.1.0/docs/reference/pipeline-yaml.md +180 -0
filedge-0.1.0/docs/release-checklist.md +102 -0
filedge-0.1.0/example/pipeline.yaml +26 -0
filedge-0.1.0/filedge/__init__.py +0 -0
filedge-0.1.0/filedge/cdc.py +49 -0
filedge-0.1.0/filedge/cli.py +526 -0
filedge-0.1.0/filedge/column_types.py +79 -0
filedge-0.1.0/filedge/compactor.py +158 -0
filedge-0.1.0/filedge/config.py +102 -0
filedge-0.1.0/filedge/connectors/__init__.py +110 -0
filedge-0.1.0/filedge/connectors/bigquery.py +309 -0
filedge-0.1.0/filedge/connectors/databricks.py +527 -0
filedge-0.1.0/filedge/connectors/duckdb.py +181 -0
filedge-0.1.0/filedge/connectors/postgres.py +163 -0
filedge-0.1.0/filedge/connectors/sqlite.py +149 -0
filedge-0.1.0/filedge/db.py +297 -0
filedge-0.1.0/filedge/filesystem.py +73 -0
filedge-0.1.0/filedge/hashing.py +11 -0
filedge-0.1.0/filedge/health.py +104 -0
filedge-0.1.0/filedge/inferrer.py +135 -0
filedge-0.1.0/filedge/inspect_formatter.py +43 -0
filedge-0.1.0/filedge/loader.py +66 -0
filedge-0.1.0/filedge/log.py +84 -0
filedge-0.1.0/filedge/parser.py +69 -0
filedge-0.1.0/filedge/pipeline.py +151 -0
filedge-0.1.0/filedge/preview_formatter.py +63 -0
filedge-0.1.0/filedge/progress.py +232 -0
filedge-0.1.0/filedge/schema.py +84 -0
filedge-0.1.0/filedge/tracing.py +131 -0
filedge-0.1.0/filedge/transform.py +31 -0
filedge-0.1.0/filedge/validate_formatter.py +33 -0
filedge-0.1.0/filedge/validator.py +51 -0
filedge-0.1.0/mkdocs.yml +66 -0
filedge-0.1.0/pyproject.toml +44 -0
filedge-0.1.0/tests/__init__.py +0 -0
filedge-0.1.0/tests/conftest.py +10 -0
filedge-0.1.0/tests/fixtures/sample.csv +5 -0
filedge-0.1.0/tests/fixtures/sample.ndjson +3 -0
filedge-0.1.0/tests/test_cli.py +419 -0
filedge-0.1.0/tests/test_cli_inspect.py +111 -0
filedge-0.1.0/tests/test_cli_inspect_remote.py +96 -0
filedge-0.1.0/tests/test_cli_parquet.py +66 -0
filedge-0.1.0/tests/test_cli_preview.py +110 -0
filedge-0.1.0/tests/test_cli_validate.py +147 -0
filedge-0.1.0/tests/test_compactor.py +234 -0
filedge-0.1.0/tests/test_config.py +197 -0
filedge-0.1.0/tests/test_connector_bigquery.py +237 -0
filedge-0.1.0/tests/test_connector_bigquery_unit.py +214 -0
filedge-0.1.0/tests/test_connector_databricks.py +330 -0
filedge-0.1.0/tests/test_connector_databricks_integration.py +231 -0
filedge-0.1.0/tests/test_connector_duckdb.py +382 -0
filedge-0.1.0/tests/test_connector_health.py +157 -0
filedge-0.1.0/tests/test_connector_postgres.py +265 -0
filedge-0.1.0/tests/test_connector_sqlite.py +284 -0
filedge-0.1.0/tests/test_connectors.py +58 -0
filedge-0.1.0/tests/test_crash_retry.py +127 -0
filedge-0.1.0/tests/test_db.py +259 -0
filedge-0.1.0/tests/test_encoding.py +76 -0
filedge-0.1.0/tests/test_filesystem.py +113 -0
filedge-0.1.0/tests/test_hashing.py +31 -0
filedge-0.1.0/tests/test_health.py +310 -0
filedge-0.1.0/tests/test_inferrer.py +161 -0
filedge-0.1.0/tests/test_inspect_formatter.py +75 -0
filedge-0.1.0/tests/test_loader.py +213 -0
filedge-0.1.0/tests/test_log.py +89 -0
filedge-0.1.0/tests/test_otel_logs.py +134 -0
filedge-0.1.0/tests/test_parquet_inferrer.py +89 -0
filedge-0.1.0/tests/test_parquet_parser.py +72 -0
filedge-0.1.0/tests/test_parser.py +57 -0
filedge-0.1.0/tests/test_pipeline.py +248 -0
filedge-0.1.0/tests/test_pipeline_progress.py +134 -0
filedge-0.1.0/tests/test_preview_formatter.py +86 -0
filedge-0.1.0/tests/test_schema.py +60 -0
filedge-0.1.0/tests/test_tracing.py +225 -0
filedge-0.1.0/tests/test_tracing_optional_imports.py +164 -0
filedge-0.1.0/tests/test_transform.py +100 -0
filedge-0.1.0/tests/test_validate_formatter.py +76 -0
filedge-0.1.0/tests/test_validator.py +89 -0
filedge-0.1.0/uv.lock +2610 -0

filedge-0.1.0/.coverage ADDED Viewed

Binary file

filedge-0.1.0/.github/ISSUE_TEMPLATE/bug_report.yml ADDED Viewed

@@ -0,0 +1,66 @@
+name: Bug report
+description: A reproducible bug in filedge
+title: "[bug]: "
+labels: ["bug", "needs-triage"]
+body:
+  - type: markdown
+    attributes:
+      value: |
+        Thanks for taking the time to file a bug. The more reproducible the
+        report, the faster it can be fixed.
+  - type: textarea
+    id: what-happened
+    attributes:
+      label: What happened?
+      description: What did you expect, and what actually occurred?
+      placeholder: |
+        Expected: ...
+        Actual: ...
+    validations:
+      required: true
+  - type: textarea
+    id: reproduce
+    attributes:
+      label: Steps to reproduce
+      description: Minimal `pipeline.yaml`, sample input file, and CLI invocation.
+      render: shell
+    validations:
+      required: true
+  - type: input
+    id: version
+    attributes:
+      label: Filedge version / commit SHA
+      placeholder: "0.1.0 or abc1234"
+    validations:
+      required: true
+  - type: dropdown
+    id: connector
+    attributes:
+      label: Connector
+      options:
+        - sqlite
+        - postgres
+        - bigquery
+        - databricks
+        - other / not applicable
+    validations:
+      required: true
+  - type: input
+    id: python
+    attributes:
+      label: Python version
+      placeholder: "3.13"
+    validations:
+      required: true
+  - type: textarea
+    id: logs
+    attributes:
+      label: Logs / traceback
+      description: Full traceback if there was one. Audit DB state (`filedge status --json`) helps too.
+      render: shell

filedge-0.1.0/.github/ISSUE_TEMPLATE/config.yml ADDED Viewed

@@ -0,0 +1,5 @@
+blank_issues_enabled: false
+contact_links:
+  - name: Security vulnerability
+    url: https://github.com/tongqqiu/filedge/security/advisories/new
+    about: Please report security vulnerabilities privately, not in the public tracker.

filedge-0.1.0/.github/ISSUE_TEMPLATE/feature_request.yml ADDED Viewed

@@ -0,0 +1,32 @@
+name: Feature request
+description: Suggest a new capability or enhancement
+title: "[feat]: "
+labels: ["enhancement", "needs-triage"]
+body:
+  - type: textarea
+    id: problem
+    attributes:
+      label: What problem would this solve?
+      description: Describe the use case or pain point. Avoid jumping to a solution.
+    validations:
+      required: true
+  - type: textarea
+    id: proposal
+    attributes:
+      label: Proposed change
+      description: Optional. What would the API / CLI / config look like?
+  - type: textarea
+    id: alternatives
+    attributes:
+      label: Alternatives considered
+      description: Workarounds you have tried or other approaches you considered.
+  - type: checkboxes
+    id: scope
+    attributes:
+      label: Scope
+      options:
+        - label: This is a breaking change to config or CLI surface
+        - label: I would be willing to send a PR for this

filedge-0.1.0/.github/dependabot.yml ADDED Viewed

@@ -0,0 +1,25 @@
+version: 2
+updates:
+  - package-ecosystem: "pip"
+    directory: "/"
+    schedule:
+      interval: "weekly"
+      day: "monday"
+    open-pull-requests-limit: 5
+    labels:
+      - "dependencies"
+    groups:
+      python-minor-and-patch:
+        update-types:
+          - "minor"
+          - "patch"
+  - package-ecosystem: "github-actions"
+    directory: "/"
+    schedule:
+      interval: "weekly"
+      day: "monday"
+    open-pull-requests-limit: 5
+    labels:
+      - "dependencies"
+      - "github-actions"

filedge-0.1.0/.github/pull_request_template.md ADDED Viewed

@@ -0,0 +1,25 @@
+## Summary
+<!-- What does this PR do, and why? Link the issue it closes. -->
+Closes #
+## Changes
+<!-- High-level bullets. The diff covers the detail. -->
+-
+-
+## Test plan
+<!-- How did you verify this works? -->
+- [ ] `uv run ruff check .` passes
+- [ ] `uv run pytest --cov=filedge` passes
+- [ ] New tests added for new behaviour (or n/a)
+- [ ] Docs / README / ADR updated (or n/a)
+## Notes for reviewer
+<!-- Anything reviewers should pay particular attention to: tricky bits, follow-ups, alternatives considered. -->

filedge-0.1.0/.github/workflows/bigquery-integration.yml ADDED Viewed

@@ -0,0 +1,35 @@
+name: BigQuery Integration
+on:
+  push:
+    branches: [main]
+  workflow_dispatch:
+permissions:
+  contents: read
+  id-token: write
+jobs:
+  bigquery-integration:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - uses: google-github-actions/auth@v3
+        with:
+          workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
+          service_account: ${{ secrets.GCP_BIGQUERY_TEST_SERVICE_ACCOUNT }}
+      - uses: astral-sh/setup-uv@v7
+        with:
+          python-version: "3.13"
+          enable-caching: true
+      - run: uv sync --extra dev --extra bigquery
+      - name: Test BigQuery connector
+        env:
+          FILEDGE_BIGQUERY_INTEGRATION: "1"
+          BIGQUERY_PROJECT: ${{ vars.BIGQUERY_PROJECT }}
+          BIGQUERY_DATASET: ${{ vars.BIGQUERY_DATASET }}
+        run: uv run pytest tests/test_connector_bigquery.py

filedge-0.1.0/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,77 @@
+name: CI
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+jobs:
+  dependency-resolution:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - uses: astral-sh/setup-uv@v7
+        with:
+          python-version: "3.13"
+          enable-caching: true
+      - run: uv lock --check
+      - run: uv sync --extra s3 --extra gcs
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - uses: astral-sh/setup-uv@v7
+        with:
+          python-version: "3.13"
+          enable-caching: true
+      - run: uv sync --extra dev
+      - run: uv run ruff check .
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.11", "3.13"]
+    services:
+      postgres:
+        image: postgres:16
+        env:
+          POSTGRES_PASSWORD: etl
+          POSTGRES_DB: etldb
+        ports:
+          - 5432:5432
+        options: >-
+          --health-cmd pg_isready
+          --health-interval 10s
+          --health-timeout 5s
+          --health-retries 5
+    steps:
+      - uses: actions/checkout@v6
+      - uses: astral-sh/setup-uv@v7
+        with:
+          python-version: ${{ matrix.python-version }}
+          enable-caching: true
+      - run: uv sync --extra dev --extra postgres --extra duckdb
+      - name: Test
+        env:
+          DATABASE_URL: postgresql://postgres:etl@localhost/etldb
+        run: uv run pytest --cov=filedge --cov-report=term-missing --cov-report=xml
+      - name: Upload coverage to Codecov
+        if: matrix.python-version == '3.13'
+        uses: codecov/codecov-action@v6
+        with:
+          files: ./coverage.xml
+          fail_ci_if_error: false
+          token: ${{ secrets.CODECOV_TOKEN }}

filedge-0.1.0/.github/workflows/docs.yml ADDED Viewed

@@ -0,0 +1,44 @@
+name: Deploy docs
+on:
+  push:
+    branches:
+      - main
+  workflow_dispatch:
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+concurrency:
+  group: "pages"
+  cancel-in-progress: false
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - uses: astral-sh/setup-uv@v7
+      - name: Install docs dependencies
+        run: uv sync --extra docs
+      - name: Build site
+        run: uv run mkdocs build --strict
+      - uses: actions/upload-pages-artifact@v5
+        with:
+          path: site/
+  deploy:
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    runs-on: ubuntu-latest
+    needs: build
+    steps:
+      - uses: actions/deploy-pages@v5
+        id: deployment

filedge-0.1.0/.github/workflows/release.yml ADDED Viewed

@@ -0,0 +1,50 @@
+name: Release
+on:
+  push:
+    tags:
+      - "v*"
+permissions:
+  contents: read
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - uses: astral-sh/setup-uv@v7
+        with:
+          python-version: "3.13"
+          enable-caching: true
+      - run: uv sync --extra dev
+      - run: uv run ruff check .
+      - run: uv run pytest --cov=filedge
+      - run: uv build
+      - uses: actions/upload-artifact@v4
+        with:
+          name: dist
+          path: dist/
+  publish:
+    needs: build
+    runs-on: ubuntu-latest
+    environment: pypi
+    permissions:
+      id-token: write  # required for OIDC trusted publishing
+    steps:
+      - uses: actions/download-artifact@v4
+        with:
+          name: dist
+          path: dist/
+      - uses: astral-sh/setup-uv@v7
+        with:
+          python-version: "3.13"
+      - run: uv publish

filedge-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,10 @@
+__pycache__/
+*.py[cod]
+*.egg-info/
+dist/
+build/
+.venv/
+venv/
+*.db
+.env
+site/

filedge-0.1.0/AGENTS.md ADDED Viewed

@@ -0,0 +1,17 @@
+# Filedge
+A batch ETL system designed for reliable file ingestion.
+## Agent skills
+### Issue tracker
+Issues live in GitHub Issues for this repo. See `docs/agents/issue-tracker.md`.
+### Triage labels
+Default label vocabulary (needs-triage, needs-info, ready-for-agent, ready-for-human, wontfix). See `docs/agents/triage-labels.md`.
+### Domain docs
+Single-context repo — one `CONTEXT.md` at root and `docs/adr/`. See `docs/agents/domain.md`.

filedge-0.1.0/CLAUDE.md ADDED Viewed

@@ -0,0 +1,17 @@
+# Filedge
+A batch ETL system designed for reliable file ingestion.
+## Agent skills
+### Issue tracker
+Issues live in GitHub Issues for this repo. See `docs/agents/issue-tracker.md`.
+### Triage labels
+Default label vocabulary (needs-triage, needs-info, ready-for-agent, ready-for-human, wontfix). See `docs/agents/triage-labels.md`.
+### Domain docs
+Single-context repo — one `CONTEXT.md` at root and `docs/adr/`. See `docs/agents/domain.md`.

filedge-0.1.0/CONTEXT.md ADDED Viewed

@@ -0,0 +1,145 @@
+# Context: Filedge
+A batch ETL system designed for reliable data ingestion from files, APIs, and message queues, targeting the failure modes that Airflow + Spark + data warehouse stacks handle poorly.
+---
+## Glossary
+### File
+The atomic unit of work. A single raw input file must either be fully loaded into the destination or not at all — partial states are not permitted.
+### Content Hash
+The primary idempotency key for a File. Computed as SHA-256 of the file's bytes. Two files with the same Content Hash are treated as identical data regardless of filename. Stored alongside the filename in the audit record.
+### Partial Load Corruption
+The #1 failure mode this system is designed to prevent. Occurs when a pipeline job fails mid-run and leaves the destination in a half-written state, causing subsequent retries to produce duplicates or skip records.
+### Commit
+The act of successfully applying one File to the Destination and then marking that File `COMMITTED` in the Audit DB. Because the Audit DB and Destination may be separate systems, retry safety comes from Connector-level idempotency keyed by Content Hash rather than one shared transaction.
+### Run
+A single execution of `filedge run` — a short-lived process that scans the Watched Directory, enqueues new Files as PENDING, processes them through the pipeline, and exits. Triggered by an external scheduler (cron, Airflow, Kubernetes CronJob). Stale PROCESSING locks older than a configured timeout are reclaimed at the start of each Run.
+### Streaming Load
+Files are processed in row batches (configurable size, default 1,000) rather than loaded entirely into memory. The Connector writes the File as one idempotent unit and commits only when the full File is processed, keeping memory bounded by `batch_size`.
+### Append-Only Load
+The default Write Mode (`write_mode: append`): records from each File are inserted into the destination table without replacing prior records. The ETL layer does not resolve whether a re-dropped file is a correction or a supplement — that is downstream responsibility, resolvable via provenance columns. Two Files with the same filename but different content hashes produce two distinct sets of rows in the destination. See also: Write Mode.
+### Column Tolerance
+Extra columns in a source file (not declared in Pipeline Config) are silently ignored. Missing columns declared as required in Pipeline Config cause the File to fail in Strict Mode. This asymmetry makes the pipeline tolerant of upstream additions while strict about data contract violations.
+### Column Type
+The executable meaning of a Pipeline Config column's `type:` value. Supported values are `string`, `integer`, `float`, `date`, `timestamp`, and `boolean`. `date` means ISO `YYYY-MM-DD`; Schema Inference must not suggest `date` for non-ISO date-like values that Transform cannot load.
+### Table Initialization
+On first Run against a new destination table, the system creates the table from the Pipeline Config schema (including provenance columns). After the table exists, any mismatch between the YAML and the live table causes the Run to fail loudly with a clear diff — no auto-migration. Schema changes require explicit operator action.
+### Runtime
+Python. The implementation language for the ingestion system, CLI, and all pipeline components.
+### Operator CLI
+A command-line interface for system observation and control. `filedge status` prints file counts by state, recent failures, and retry counts. Supports `--json` for machine-readable output. `filedge inspect <file>` runs Schema Inference on a file and prints a suggested `columns:` block. The stable interface over audit DB queries — future web UI would use the same backing queries.
+### Schema Inference
+The process of sampling the first N rows of a File (default 1,000, configurable via `--sample-rows`) and producing a suggested `columns:` block ready to paste into a Pipeline Config, alongside a human-readable summary. Each inferred column carries a Confidence Tier. Invoked via `filedge inspect <file>`. Format is auto-detected from file extension with a `--format` override. The YAML block goes to stdout; the summary goes to stderr, keeping them composable with shell redirection. NDJSON nested objects are surfaced as top-level `string` columns with a warning listing the nested keys — the pipeline has no flattening Transform, so suggesting dot-notation paths would produce a config that cannot be executed.
+_Avoid_: schema detection, type inference, column discovery.
+### Confidence Tier
+An annotation attached to each column in Schema Inference output, expressing how strongly the evidence supports the inferred type and `required:` value. Three tiers: **high** (all sampled values parse cleanly, no nulls); **low** (most values parse but exceptions found — null count or unparseable values shown); **ambiguous** (evidence is genuinely conflicting — e.g. two date formats detected, or values that could be boolean or integer). Operators are expected to review low and ambiguous columns before committing the config to production.
+_Avoid_: confidence score, inference quality, certainty level.
+### Pipeline Config
+A `pipeline.yaml` file that declares how a single ingestion pipeline behaves. Contains the file format, column mappings (source name → destination name + type), destination table name, connector settings, write mode, and retry cap. The operator interface for configuring ingestion — no code changes required for schema mapping updates.
+### Audit Record
+Two-level audit: (1) file-level — captures filename, content hash, state, attempt count, timestamps, and worker identity; (2) row-level provenance — every destination row carries `_source_file_hash` and `_ingested_at` columns linking it back to the File that produced or last changed the current row. Row-level provenance is non-negotiable: it is the basis for data lineage, debugging, and compliance.
+### Audit DB
+The relational database (SQLite for development, PostgreSQL for production) that holds the file-level audit records and drives the state machine (PENDING → PROCESSING → COMMITTED/FAILED). This is the control plane — it is always a SQL database with full transaction support, separate from the Destination.
+### Connector
+A pluggable adapter that owns all interactions with a specific Destination backend: creating or validating the destination table, writing rows, and enforcing write-mode semantics. The Connector is the only component that knows about the Destination's SDK, DDL dialect, and bulk-load API. Adding a new Destination means writing a new Connector — no changes to the pipeline or audit logic. Built-in Connectors: `sqlite`, `postgres`, `bigquery`, `databricks`, `duckdb`.
+### BigQuery Connector
+A Connector that writes rows to a BigQuery table via NDJSON staging and a bulk load job. Idempotency in append mode is achieved by encoding the destination table and `file_hash` in the BigQuery job ID: if a job with the same ID already exists and succeeded, the retry is a no-op. **Known limitation**: BigQuery only retains job metadata for 7 days. A retry of the same file more than 7 days after the original ingestion will submit a new job and produce duplicate rows. For pipelines where files may be re-ingested after this window, use `write_mode: truncate` or implement a pre-load DML DELETE.
+### DuckDB Connector
+A Connector that writes rows to a `.duckdb` file on disk. Targeted at local analytics and lightweight deployments where a full OLAP warehouse (BigQuery, Databricks) is overkill. DuckDB is file-based and supports only one writer at a time — the Connector fails fast with a clear error if the file is locked by another process rather than retrying. DuckDB is a destination only; the audit DB always remains SQLite or PostgreSQL. Rows are written via standard batched `executemany`; bulk Parquet loading is a future optimization.
+### Destination
+The system where ingested rows land. Decoupled from the Audit DB — each has its own connection and transaction scope. Because rows and the audit COMMITTED marker can no longer be written in a single transaction, the Connector is responsible for making `write_rows` idempotent per `file_hash`, so retries produce the same destination state as a first write.
+### Write Mode
+The strategy a Connector uses when writing a File's rows to the Destination table. Declared as `write_mode` in `pipeline.yaml`. Supported modes: `append` (default) — rows are added alongside prior records; `truncate` — the table is wiped then replaced with this File's rows; `cdc` — a CDC File is applied as SCD Type 1 changes by business key. Write Modes must preserve retry safety for a File identified by Content Hash.
+### CDC File
+A File containing change data capture records that describe inserts, updates, and deletes from an upstream system. A CDC File is still a File: it is complete before it reaches the Watched Directory, identified by Content Hash, processed under Strict Mode, and visible in the Audit DB. Filedge applies CDC Files as SCD Type 1 current-state changes; when multiple changes for the same business key appear in one File, the configured sequence column identifies the final change. Ties for the same key and sequence are invalid because row order is not a portable contract. SCD Type 2 history is not part of this term.
+_Avoid_: CDC source, replication stream.
+### CDC File Order
+The order in which CDC Files are applied when more than one File changes the same business key. Filedge processes Files in sorted path order during a Run, so upstream materializers must name or partition CDC Files so that sorted path order matches the intended change order. Filedge does not infer cross-File ordering from row-level sequence values in the current SCD Type 1 model.
+_Avoid_: CDC checkpoint, global sequence.
+### Applied File Marker
+A Destination-side record that a Connector writes after successfully applying a File whose retry safety cannot rely on row-level `_source_file_hash` alone. Used for CDC Files in warehouse Destinations where replaying the same File would otherwise re-apply business-key mutations. Complements the Audit DB; it does not replace the Audit Record.
+_Avoid_: checkpoint, CDC ledger.
+### Connector Registry
+The internal mapping from a `connector.type` string (e.g. `bigquery`) to a Connector implementation class. Resolved lazily at instantiation time so that missing optional SDK dependencies surface as a clear error only when the Connector is actually used. Declared in `pipeline.yaml` under a `connector:` block; secrets (API tokens, service account credentials) come from environment variables, never from YAML.
+### Retry
+Automatic re-attempt of a FAILED File with exponential backoff, up to a configured max attempt count (default: 3). After the cap is reached, the File enters terminal FAILED state requiring explicit human re-queue (resetting state to PENDING). Prevents bad files from burning retries indefinitely.
+### Strict Mode
+The validation policy for a File load: if any row fails schema validation, the entire File fails — no records are committed. This preserves the ability to reason about completeness. Lenient partial commits are not supported; a dead-letter quarantine is a future addition.
+### Transform
+A declarative, configuration-driven step that maps source column names to destination column names and coerces types (e.g. string → integer, ISO string → timestamp). Rejects rows that don't conform to the declared schema. No business logic — that belongs in the application layer consuming the destination.
+### Compaction
+A pre-processing step that merges many small Files in a source prefix into fewer, larger NDJSON files in a separate output prefix before ingestion. Solves the small-files problem common with event streams and cloud object stores — reducing object-store listing cost and enabling bulk loads into cloud warehouses. Invoked via `filedge compact` as a separate CLI command, scheduled before `filedge run`. Compaction reads via fsspec (no extra dependencies), groups files by count (`--max-files`), writes NDJSON with optional gzip compression (`--compress`), and names output files by timestamp and batch index. Originals in the source prefix are never modified. The output prefix becomes the Watched Directory for the subsequent `filedge run`.
+### Parser
+A pluggable component that takes a File path and yields rows. Implementations exist for CSV and newline-delimited JSON. Format is detected by file extension or per-directory configuration. Adding new formats (Parquet, Avro) is a new Parser implementation, not a system redesign.
+### Watched Directory
+The landing zone polled on a schedule to discover new Files. Accepts a local path or a cloud URI (`gs://`, `s3://`). The system scans the location on every Run, computes content hashes, filters out already-COMMITTED files, and enqueues new ones as PENDING. The Watched Directory is assumed to contain only complete, transfer-ready files — partial transfers and in-flight writes are the responsibility of whatever process deposits files there. SFTP is not a supported source; see ADR-0005.
+For large-scale deployments where object-store listing cost or latency becomes a concern, operators should use time-partitioned prefixes — e.g. `s3://bucket/landing/2026-05-23/` — and update the `filedge run --dir` argument daily. This keeps each Run's listing bounded to that day's files without requiring the pipeline to move or delete objects after ingestion.
+### File States
+The four states a File passes through: `PENDING` (discovered, not yet claimed), `PROCESSING` (claimed by a worker — acts as a distributed lock via content hash), `COMMITTED` (fully loaded, transaction complete), `FAILED` (load attempt failed, eligible for retry or human review). A file whose content hash is already `COMMITTED` is never admitted to the pipeline — it is silently deduplicated at the entry point.
+### Target User
+Data engineering teams at fintech companies where file ingestion is business-critical and high-visibility auditability is a compliance requirement. Every file must be traceable from source to destination row, and the audit trail must be uniform across all data sources — whether data starts as file drops, API exports, or queue messages materialized as files.
+_Avoid_: General data engineering teams, analytics teams.
+### API Source
+A data source that delivers records via HTTP API rather than file drops. Examples: Stripe, Salesforce, HubSpot, Jira, GitHub. API Sources are not polled directly by Filedge. They must be materialized by an upstream Fetcher as complete Files in a Watched Directory before `filedge run` ingests them.
+_Avoid_: API connector, API pipeline.
+### Fetcher
+A component or external job that pulls data from an API Source on a schedule, handles pagination, authentication, rate limiting, and incremental cursor management, and writes complete NDJSON files to the Watched Directory. The Fetcher is the API-source equivalent of the rclone sync layer for SFTP (ADR-0005): useful upstream plumbing, not Filedge's core ingestion layer. dlt, Airbyte, Meltano, vendor exports, and custom scripts can all be Fetchers. Only complete files should reach the Watched Directory; partial fetches must remain in staging or be deleted.
+_Avoid_: API connector, extractor, source connector.
+### Fetch Lock
+A Fetcher-owned concurrency guard that prevents two fetches for the same API Source from racing to promote partial files to the Watched Directory. It may be a filesystem lock, scheduler-level mutual exclusion, or a lock in the Fetcher's own state store. It is not an Audit DB record and is not part of the `filedge run` state machine.
+_Avoid_: fetch mutex, distributed lock.
+### Queue Source
+A data source that delivers records through a message broker such as Kafka, SQS, or Kinesis. Queue Sources are not consumed directly by Filedge. They must be materialized by an upstream Queue Materializer as complete Files in a Watched Directory before `filedge run` ingests them.
+_Avoid_: streaming source, event source, message queue connector.
+### Queue Materializer
+A component or external job that consumes records from a Queue Source, groups them into complete files, writes them to a staging area, and promotes them into the Watched Directory only after the file is complete. Examples include Kafka Connect S3 Sink, Kafka Connect GCS Sink, Flink, Spark Structured Streaming, Vector, Benthos, cloud-native delivery services, and custom consumers. The Queue Materializer owns consumer groups, offsets, rebalances, decoding, schema registry integration, poison-message handling, and delivery cadence. Filedge starts at the File boundary.
+_Avoid_: queue connector, streaming ingestion engine, `filedge consume`.
+### Offset Range Metadata
+Optional provenance metadata recorded by a Queue Materializer in the filename, object metadata, or sidecar manifest. For Kafka, a useful convention is `{topic}.{partition}.{start_offset}-{end_offset}.ndjson`. Offset Range Metadata helps operators trace a File back to queue positions, but it is not Filedge's idempotency key. Filedge still deduplicates by Content Hash.
+_Avoid_: offset range key, consumer checkpoint.
+### Sources Config
+A Fetcher-specific config file that declares how an API Source is pulled: which endpoints to fetch, the incremental key or cursor, credentials lookup, and the staging/landing paths. This is outside Filedge's core config surface. `pipeline.yaml` remains the Filedge config for ingesting the resulting Files.
+_Avoid_: fetch config, source pipeline.

filedge-0.1.0/CONTRIBUTING.md ADDED Viewed

@@ -0,0 +1,101 @@
+# Contributing to Filedge
+Thanks for your interest in contributing! This guide covers how to set up a
+development environment, run the test suite, and submit a change.
+## Development setup
+Filedge uses [uv](https://docs.astral.sh/uv/) for environment and dependency
+management. Install uv first, then:
+```bash
+git clone https://github.com/tongqqiu/filedge.git
+cd filedge
+uv sync --extra dev
+```
+For connector-specific work, add the matching extra:
+```bash
+uv sync --extra dev --extra postgres   # Postgres connector
+uv sync --extra dev --extra bigquery   # BigQuery connector
+uv sync --extra dev --extra databricks # Databricks connector
+uv sync --extra dev --extra duckdb     # DuckDB / Parquet
+```
+## Running checks locally
+Before opening a PR, run the same checks CI does:
+```bash
+uv run ruff check .                              # lint
+uv run pytest --cov=filedge --cov-report=term-missing  # tests + coverage
+```
+The Postgres test suite requires a running Postgres instance. Either start one
+locally and export `DATABASE_URL`, or rely on CI to cover that path.
+```bash
+export DATABASE_URL=postgresql://postgres:etl@localhost/etldb
+```
+Live BigQuery / Databricks integration tests are opt-in via env flags — see
+[README.md](README.md#bigquery) for the variables.
+## Submitting a change
+1. **Open an issue first** for anything beyond a small fix. It's faster to align
+   on approach in an issue than to redo a PR.
+2. **Branch from `main`** — name branches like `fix/...`, `feat/...`,
+   `docs/...`, `chore/...`.
+3. **Write or update tests** alongside the change. We expect new code paths to
+   have coverage; the suite runs `--cov` in CI.
+4. **Keep PRs focused.** One concern per PR. Mechanical refactors should be
+   separate from behavioural changes.
+5. **Update docs.** If you change CLI flags, config keys, or connector
+   behaviour, update the relevant page in `docs/` and the README.
+6. **Add an ADR** for architectural decisions. See `docs/adr/` for the format —
+   each ADR is a short markdown file with Context / Decision / Consequences.
+7. **Run the full check suite locally** (lint + pytest) before pushing.
+All changes go through pull request review — no direct pushes to `main`.
+## Commit messages
+Short, imperative subject lines (`Add ...`, `Fix ...`, `Refactor ...`), wrapped
+at ~72 chars. The body explains *why*, not *what* — the diff already shows the
+what.
+## Code style
+- Python 3.11+. Type hints on public functions.
+- `ruff` for lint and format. Default ruleset; do not disable rules without
+  justification.
+- Prefer explicit over clever. This is a reliability-focused codebase — clarity
+  beats brevity.
+- No new top-level dependencies without discussion in an issue. Optional
+  features go behind an extra in `pyproject.toml`.
+## Releasing
+See [docs/release-checklist.md](docs/release-checklist.md) for the step-by-step process — build verification, docs build, CLI smoke test, PyPI publish, and post-publish install-doc update.
+## Reporting bugs
+Use the bug issue template. Include:
+- Filedge version / commit SHA
+- Python version
+- Connector type (sqlite / postgres / bigquery / databricks)
+- Minimal `pipeline.yaml` and sample input that reproduces the issue
+- Full traceback or audit-DB state if relevant
+## Security issues
+**Do not file security issues in the public tracker.** See
+[SECURITY.md](SECURITY.md) for the private reporting process.
+## License
+By contributing, you agree that your contributions will be licensed under the
+Apache License 2.0 — the same license as the rest of the project.