PyPI - data-contract-validator - Versions diffs - 1.1.0__tar.gz → 1.1.7__tar.gz - Mend

data-contract-validator 1.1.0tar.gz → 1.1.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

data_contract_validator-1.1.7/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,235 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [Unreleased]
+## [1.1.7] - 2026-07-04
+### Changed
+- **The generated CI workflow now defaults `GITHUB_TOKEN` to
+  `secrets.API_REPO_TOKEN`** — a token you create yourself — instead of the
+  auto-provided `secrets.GITHUB_TOKEN`, which only has access to the repo
+  the workflow runs in. A personal access token works identically for
+  public and private targets, so this removes the silent-failure case
+  entirely rather than just documenting it (1.1.6's fix). Still skipped
+  entirely for a `local` target.
+### Added
+- The generated CI workflow now includes a commented scaffold for
+  `dbt deps && dbt docs generate`, unlocking Tier 1 (real warehouse types)
+  in CI instead of that only being mentioned in prose docs. Commented out
+  by default since it needs the user's warehouse adapter and credentials
+  filled in, which can't be inferred.
+## [1.1.6] - 2026-07-03
+### Fixed
+- **The generated CI workflow silently assumed the default
+  `secrets.GITHUB_TOKEN` could read the target API repo.** That token only
+  has access to the repo the workflow itself runs in — if `target.*.repo`
+  is a *different*, private repo, validation would fail on every PR with no
+  indication why. The generated workflow now documents the fix inline
+  (a personal-access-token secret pointed at that specific repo) and skips
+  the `GITHUB_TOKEN` env block entirely for a `local` target, which never
+  talks to the GitHub API at all.
+## [1.1.5] - 2026-07-03
+### Fixed
+- **Python `int` was mapped to the narrower `INTEGER` canonical rank,
+  producing a false "type mismatch" warning against any dbt column typed
+  `bigint`** (a very common type for count/id columns). Python's `int` is
+  arbitrary-precision, unlike a fixed-width SQL `INTEGER` column, so there's
+  no real truncation risk — it's now mapped to the wider `BIGINT` rank.
+  A genuinely fractional source (`DECIMAL`/`FLOAT`) is still flagged.
+### Added
+- `init --interactive` now offers to set up a pre-commit hook as part of the
+  same wizard, instead of requiring a separate `setup-precommit` invocation.
+  (The GitHub Actions CI workflow was already created automatically by
+  `init` for both the interactive and non-interactive paths — only the
+  pre-commit step needed folding in.)
+## [1.1.4] - 2026-07-03
+### Changed
+- **`table=True` SQLModel classes are no longer skipped during extraction.**
+  Whether a table is meant to come from dbt is business knowledge that can't
+  be recovered from the Python source — two structurally identical
+  `table=True` classes can need opposite treatment (one is a normal dbt-fed
+  table an API also returns directly; another is populated by a separate
+  pipeline like Kafka and was never meant to have a dbt model). Blanket-
+  skipping every `table=True` class silently exempted the former case from
+  validation too, which is likely the more common pattern — defeating the
+  tool's purpose for it. `table=True` classes are now validated like any
+  other target.
+- Added `mapping.exclude: [<table>, ...]` so the latter case (genuinely no
+  source model, e.g. Kafka-populated) can be stated explicitly instead of
+  inferred from `table=True`. Excluded tables are skipped entirely and never
+  produce a "missing table" issue.
+## [1.1.3] - 2026-07-03
+Supersedes 1.1.2, which was only ever published to TestPyPI for verification
+and never released to production PyPI.
+### Fixed
+- **`table=True` SQLModel classes were incorrectly evaluated as required API
+  contracts.** The standard `class Foo(SQLModel, table=True)` syntax puts
+  `table=True` on the class definition's own keywords, not nested inside a
+  `Call` base — the skip check only looked in the latter, so DB-only tables
+  never matched and produced permanent, unfixable "missing table" criticals.
+- **Explicit `__tablename__` is now resolved and used as the target table
+  name**, instead of only the class-name-derived guess. A class like
+  `VideoViewed` with `__tablename__ = "int_unified_video_viewed"` now matches
+  its real source model without needing a manual `mapping.tables` entry.
+- **`init --interactive` no longer guesses local vs. GitHub from the path's
+  shape.** A local relative path like `app/models` (the wizard's own
+  suggested default) is syntactically identical to a GitHub `org/repo`
+  string, and was always guessed as a repo, producing a nonsensical
+  `app/models/app/models` GitHub target. The wizard now asks explicitly
+  ("local project or a different GitHub repo?") before asking for the
+  path, and asks for the repo and the path within it as separate prompts.
+### Added
+- `init --interactive` and `contract-validator test` now verify a configured
+  GitHub target path actually exists via the GitHub API, instead of silently
+  accepting a stale or typo'd path.
+- GitHub API error messages hint at setting `GITHUB_TOKEN` when an
+  unauthenticated 404 is ambiguous with a private repo.
+### Changed
+- **`contract-validator init` no longer silently overwrites an existing
+  `.retl-validator.yml` or generated workflow file.** Re-running `init` (e.g.
+  after upgrading to pick up a newer version's config defaults) now refuses
+  and exits if either file already exists — pass `--force` to regenerate
+  them from scratch. Previously this was an unconditional overwrite with no
+  confirmation, which could silently destroy hand-added `mapping` entries.
+## [1.1.1] - 2026-06-30
+### Added
+- **Automatic plural/singular table & column matching.** dbt models are
+  conventionally plural (`users`) while Pydantic classes are singular
+  (`User` → `user`); these now match automatically with no `mapping` needed.
+  Candidate forms are only matched against names that actually exist on the
+  other side, so it never over-strips (`address` is never mistaken for
+  `addres`). Explicit `mapping` still takes precedence.
+## [1.1.0] - 2026-06-30
+This release is focused on **accuracy** — making a red check always mean a real
+problem and a green check genuinely safe, so the tool can be trusted to gate a
+deploy.
+### Added
+- **Canonical type system** (`core/types.py`): every extractor now normalizes
+  its native types (warehouse SQL types, Python hints) into a shared, neutral
+  vocabulary (`CanonicalType`). The validator compares canonical types instead
+  of raw strings, eliminating the bulk of false "type mismatch" warnings
+  (e.g. dbt `varchar` vs Pydantic `str` are now correctly equal).
+  - Dialect-aware normalization: Snowflake `NUMBER(38,0)`→bigint, BigQuery
+    `INT64`/`FLOAT64`, Redshift `SUPER`, Postgres `jsonb`, and more.
+- **Tiered dbt extraction** with graceful degradation:
+  1. `catalog.json` — real warehouse types (high confidence).
+  2. `sqlglot` — a proper SQL parser. Handles CTEs, `||`, window functions, and
+     quoted identifiers that the old regex parser mangled. Detects `SELECT *`
+     and flags the schema as incomplete.
+  3. regex — last-resort best effort (low confidence, never hard-fails).
+- **Confidence-aware validation**: when source columns can't be fully resolved
+  (e.g. `SELECT *`), a missing column is reported as a **warning, not a
+  build-blocking critical**. Type warnings are suppressed for low-confidence
+  (regex-tier) sources. This is the core false-positive guard.
+- **Explicit mapping config** (`mapping:` in `.retl-validator.yml`) for when
+  name heuristics aren't enough — map a target table/column to a differently
+  named source model/column:
+  ```yaml
+  mapping:
+    tables:
+      user_analytics: user_analytics_summary
+    columns:
+      user_analytics:
+        userId: user_id
+  ```
+- **Name normalization**: tables/columns now match across snake_case, camelCase
+  and casing differences (`userId` == `user_id` == `USER_ID`).
+### Changed
+- `Schema` now carries `confidence` and `is_complete` (via `metadata`).
+- `BaseExtractor` no longer contains Python-specific type mapping; type
+  normalization lives in the canonical type system. Added `_make_column` helper.
+- Added `sqlglot` as a dependency (imported optionally; falls back to regex if
+  absent).
+### Fixed
+- Hardened GitHub API rate-limit handling against non-dict response headers
+  (previously could raise when headers weren't a mapping).
+## [1.0.5] - 2025-01-24
+### Fixed
+- **CRITICAL**: Fixed missing return statement in `DBTExtractor.extract_schemas()` that could return `None` instead of dictionary
+  - Added fallback to SQL file parsing when manifest.json is unavailable
+  - Now works reliably with or without DBT CLI installed
+- **HIGH**: Fixed function signature mismatch in `_test_configuration()` causing TypeError on `--dry-run` command
+  - Added missing `disable_manifest` parameter
+  - Enhanced to display manifest parsing status
+- **MEDIUM**: Replaced bare exception handler in `_try_compile_dbt()` with specific exception types
+  - Now properly handles TimeoutExpired, FileNotFoundError
+  - Provides helpful error messages instead of silent failures
+  - Respects keyboard interrupts
+- **MEDIUM**: Removed unused `fastapi_directory` parameter from CLI
+  - Simplified API - use `--fastapi-local` for both files and directories
+- **MEDIUM**: Added comprehensive YAML error handling with user-friendly messages
+  - Catches malformed YAML files with helpful suggestions
+  - Validates required configuration sections
+  - Provides clear error messages instead of Python tracebacks
+- **LOW**: Added GitHub API rate limiting detection and handling
+  - Monitors rate limit headers and warns when limits are low
+  - Provides helpful guidance to use GITHUB_TOKEN for higher limits
+  - Better error messages for 403 and 404 responses
+### Improved
+- Enhanced error messages throughout the application
+- Better support for different use-cases:
+  - DBT projects with or without manifest.json
+  - Local files and directories for FastAPI models
+  - GitHub repositories with rate limit awareness
+  - Configuration validation with clear error reporting
+## [1.0.0] - 2025-01-XX
+### Added
+- Initial release of Data Contract Validator
+- DBT schema extraction from SQL files and manifest.json
+- FastAPI/Pydantic model extraction from local files and GitHub repos
+- Command-line interface with multiple output formats
+- GitHub Actions integration
+- Contract validation with critical/warning/info severity levels
+- Support for multiple repositories and complex validation scenarios
+### Features
+- ✅ DBT model schema extraction
+- ✅ FastAPI/Pydantic schema extraction
+- ✅ Cross-repository validation
+- ✅ GitHub Actions workflows
+- ✅ Multiple output formats (terminal, JSON, GitHub Actions)
+- ✅ Comprehensive error reporting with suggested fixes
+- ✅ Type compatibility checking
+- ✅ Missing table/column detection
+### Known Limitations
+- Only supports DBT and FastAPI currently
+- Requires manual installation of DBT CLI
+- Limited type inference from SQL
+- No support for complex nested types
+[Unreleased]: https://github.com/OGsiji/data-contract-validator/compare/v1.1.1...HEAD
+[1.1.1]: https://github.com/OGsiji/data-contract-validator/releases/tag/v1.1.1
+[1.1.0]: https://github.com/OGsiji/data-contract-validator/releases/tag/v1.1.0
+[1.0.5]: https://github.com/OGsiji/data-contract-validator/releases/tag/v1.0.5
+[1.0.0]: https://github.com/OGsiji/data-contract-validator/releases/tag/v1.0.0

{data_contract_validator-1.1.0/data_contract_validator.egg-info → data_contract_validator-1.1.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: data-contract-validator
-Version: 1.1.0
+Version: 1.1.7
 Summary: Validate data contracts between dbt models and FastAPI/Pydantic APIs with accurate, low-false-positive schema checks
 Author-email: Ogunniran Siji <ogunniransiji@gmail.com>
 Maintainer-email: Ogunniran Siji <ogunniransiji@gmail.com>
@@ -101,6 +101,80 @@ contract-validator test
 contract-validator validate
 ```
+## 🚀 Getting started, step by step
+If you're setting this up on a project for the first time, the order below
+avoids the sharp edges:
+1. **Install into the same environment dbt runs in** (not a separate venv) —
+   the tool needs to see your dbt project:
+   ```bash
+   pip install data-contract-validator
+   ```
+   Already have `.retl-validator.yml` committed by a teammate? Skip to step 5.
+2. **Generate the config + CI workflow** (one-time):
+   ```bash
+   contract-validator init --interactive
+   ```
+   You'll be asked: where your dbt project is, which API framework you use,
+   whether your models live in this local project or a different GitHub
+   repo, and then the local path (or the `org/repo` + path within it). It's
+   asked explicitly rather than guessed from the path's shape — a local path
+   like `app/models` is syntactically identical to a GitHub `org/repo`
+   string, so there's no reliable way to infer which one you mean. If you
+   pick GitHub, it checks the path actually exists before writing the
+   config — so a typo surfaces here instead of at `validate` time.
+   `init` refuses to touch an existing `.retl-validator.yml` or workflow
+   file — it won't clobber hand-added `mapping` entries just because you
+   upgraded the package and re-ran `init`. Pass `--force` if you really want
+   to regenerate them from the new version's defaults.
+3. **Pre-commit hook**: `init --interactive` asks whether you want one set
+   up right after creating the config and CI workflow — say yes there and
+   it's done. To add one later (or if you used non-interactive `init`,
+   which doesn't prompt), run it standalone:
+   ```bash
+   contract-validator setup-precommit --install-hooks
+   ```
+4. **If the target repo is private, set a token** before running anything
+   that talks to GitHub locally:
+   ```bash
+   export GITHUB_TOKEN=$(gh auth token)   # or a PAT with repo read access
+   ```
+   See [Private GitHub repos need `GITHUB_TOKEN`](#private-github-repos-need-github_token) below for why this is easy to miss.
+5. **Sanity-check the setup**:
+   ```bash
+   contract-validator test
+   ```
+   Confirms the config parses, the dbt project is found, and the target
+   (local path or GitHub path) is reachable. If this fails, `validate` will
+   fail the same way — fix it here first.
+6. **Run it**:
+   ```bash
+   contract-validator validate
+   ```
+7. **When it reports a critical issue, diagnose before assuming your dbt
+   model is wrong**:
+   - Real missing column/table → fix the dbt model.
+   - Target name doesn't match the dbt model by convention (renamed/prefixed)
+     → add an entry under `mapping.tables` in `.retl-validator.yml` (see
+     [When do I need `mapping`?](#when-do-i-need-mapping)).
+   - A table that's genuinely populated by something other than dbt (e.g. a
+     separate streaming pipeline) and has no source model on purpose → add
+     it to `mapping.exclude`. `table=True` alone is **not** used to infer
+     this automatically — see [FastAPI side](#fastapi-side) for why.
+8. **For accurate type-checking** (not just column-presence checks), run
+   `dbt docs generate` before `validate` so it picks up `catalog.json` (Tier 1,
+   real warehouse types) instead of inferring from SQL text — see
+   [How extraction works](#-how-extraction-works-and-why-its-accurate) below.
 ### One-off validation (no config file)
 ```bash
@@ -132,13 +206,25 @@ CI job.
 > 💡 **Tip:** run `dbt docs generate` in CI before validating to unlock Tier 1
 > (real types). Without it, you still get accurate column-presence checks from
-> Tier 2.
+> Tier 2. The workflow `init` generates includes this step already, commented
+> out — it needs your warehouse adapter and credentials filled in, which
+> can't be guessed, so it isn't active by default.
 ### FastAPI side
 Pydantic / SQLModel classes are parsed from source with Python's `ast` (no
-imports executed). `Optional[...]` controls whether a field is required;
-`table=True` SQLModel classes (DB tables, not API contracts) are skipped.
+imports executed). `Optional[...]` controls whether a field is required.
+An explicit `__tablename__` is used as the table name when present;
+otherwise the class name is converted to `snake_case`.
+`table=True` SQLModel classes are validated the same as any other class —
+they are **not** skipped. Whether a table is meant to come from dbt is
+business knowledge that isn't recoverable from the Python source: two
+structurally identical `table=True` classes can need opposite treatment (one
+is a normal dbt-fed table your API also returns directly; another is
+populated by a Kafka stream and was never meant to have a dbt model). Use
+`mapping.exclude` to state the latter case explicitly rather than relying on
+`table=True` to imply it.
 ## 🚦 What gets flagged
@@ -193,18 +279,103 @@ mapping:
     user_analytics:
       # target column : source column
       userId: user_id
+  # Target tables with no source model on purpose (e.g. Kafka-populated,
+  # not dbt) -- see "When do I need mapping?" below.
+  exclude:
+    - feed_interaction
 validation:
   fail_on: ["missing_tables", "missing_required_columns"]
   warn_on: ["type_mismatches", "missing_optional_columns"]
 ```
+### Private GitHub repos need `GITHUB_TOKEN`
+If `target.*.repo` points at a private repository, `contract-validator`
+needs a token with read access to it. Where that token comes from is
+different locally vs. in CI — and the CI case has a sharp edge worth
+understanding before it silently fails on a PR.
+**Locally**, set the `GITHUB_TOKEN` environment variable before running the
+CLI. On bash/zsh that's `export` (there's nothing to install — `export` just
+makes the variable visible to the `contract-validator` process you run
+next):
+```bash
+export GITHUB_TOKEN=$(gh auth token)   # or a PAT with repo read access
+contract-validator validate
+```
+GitHub's API 404s (not 403s) an unauthenticated request to a private path,
+so without a token this looks identical to a plain typo in `path` —
+`contract-validator init --interactive` and `contract-validator test` both
+check `target.*.path` actually exists and will point you at this if the
+lookup 404s with no token set.
+**In CI**, the workflow `init` generates for a GitHub target wires up
+`GITHUB_TOKEN: ${{ secrets.API_REPO_TOKEN }}` — a token **you** create,
+*not* the auto-provided `secrets.GITHUB_TOKEN`. That auto-provided token
+only has access to **the repository the workflow is running in**, so if
+your dbt repo and your API repo are different repos, it silently can't read
+the target the first time that target is private — and a PAT works
+identically for a public target too, so there's no reason to default to the
+token that only sometimes works. To finish the setup the generated workflow
+expects:
+1. Create a token with read access to the *target* repo — a
+   [fine-grained PAT](https://github.com/settings/personal-access-tokens/new)
+   scoped to just that repo's Contents (read-only) is the least-privilege
+   option; a classic PAT with the `repo` scope also works.
+2. In the repo running the workflow (your dbt repo): **Settings → Secrets
+   and variables → Actions → New repository secret**. Name it
+   **`API_REPO_TOKEN`** exactly (that's the name the generated workflow
+   already references) and paste the token as the value.
+   > ⚠️ **GitHub rejects any secret name starting with `GITHUB_`** — it's a
+   > reserved prefix. You cannot create a secret literally called
+   > `GITHUB_TOKEN`; that's not a naming suggestion, the UI will refuse it.
+   > That's exactly why the workflow's secret is named `API_REPO_TOKEN`
+   > instead, even though the environment variable it feeds is `GITHUB_TOKEN`
+   > — two different things with confusingly similar names:
+   > ```yaml
+   > env:
+   >   GITHUB_TOKEN: ${{ secrets.API_REPO_TOKEN }}
+   > #  ^^^^^^^^^^^   local variable name, can be anything -- the CLI
+   > #                just needs it called GITHUB_TOKEN to find it
+   > #                            ^^^^^^^^^^^^^^ the *secret's* name --
+   > #                            this is what GitHub restricts
+   > ```
+Skip all of this for a `local` target — `init` omits the whole `env:` block
+since a local target never talks to the GitHub API at all.
 ### When do I need `mapping`?
-By default, names are matched across `snake_case` / `camelCase` / casing
-(`UserAnalytics` → `user_analytics`, `userId` → `user_id`). Reach for `mapping`
-only when a model or column is named so differently that the convention can't
-bridge it (e.g. Pydantic `user_id` ↔ dbt `customer_identifier`).
+Most of the time you don't. Names are matched automatically across:
+- `snake_case` / `camelCase` / casing — `UserAnalytics` → `user_analytics`, `userId` → `user_id`
+- **plural ↔ singular** — dbt's plural `users` matches Pydantic's `User` (→ `user`)
+  with no config (and it won't over-match — `address` is never confused with `addres`).
+Reach for `mapping.tables` / `mapping.columns` only when a model or column is
+named so differently that convention can't bridge it (e.g. Pydantic
+`user_id` ↔ dbt `customer_identifier`).
+`mapping.exclude` is different — it's not about renamed models, it's for a
+target table that has **no source model on purpose**, because it's
+populated by something other than dbt (a Kafka stream, a cron job, etc.).
+This can't be inferred from the code (a `table=True` SQLModel class looks
+identical whether or not dbt is supposed to feed it), so it has to be a
+deliberate, human-stated exception:
+```yaml
+mapping:
+  exclude:
+    - feed_interaction
+    - affiliate_reward
+```
+Anything not listed is validated normally — including `table=True` classes,
+which are treated the same as any other target and are not silently skipped.
 ## 🐍 Python API
@@ -248,9 +419,16 @@ jobs:
       # Optional: `dbt docs generate` here for real warehouse types (Tier 1)
       - run: contract-validator validate --output github
         env:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          GITHUB_TOKEN: ${{ secrets.API_REPO_TOKEN }}
 ```
+`GITHUB_TOKEN` here is only needed if `target` is a `github` repo (`init`
+omits the whole `env:` block for a `local` target). `secrets.API_REPO_TOKEN`
+is a token you create yourself, not GitHub's auto-provided
+`secrets.GITHUB_TOKEN` — see
+[Private GitHub repos need `GITHUB_TOKEN`](#private-github-repos-need-github_token)
+above for why, and how to set it up.
 ### Pre-commit
 ```bash

data-contract-validator 1.1.0__tar.gz → 1.1.7__tar.gz

data-contract-validator 1.1.0tar.gz → 1.1.7tar.gz