pypreflight 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pypreflight-0.1.0/.claude/agents/claude-md-updater.md +40 -0
- pypreflight-0.1.0/.claude/agents/debugger.md +26 -0
- pypreflight-0.1.0/.claude/agents/documenter.md +30 -0
- pypreflight-0.1.0/.claude/agents/implementer.md +27 -0
- pypreflight-0.1.0/.claude/agents/planner.md +25 -0
- pypreflight-0.1.0/.claude/agents/refactor.md +25 -0
- pypreflight-0.1.0/.claude/agents/reviewer.md +38 -0
- pypreflight-0.1.0/.claude/agents/test-writer.md +33 -0
- pypreflight-0.1.0/.claude/commands/debug.md +8 -0
- pypreflight-0.1.0/.claude/commands/done.md +15 -0
- pypreflight-0.1.0/.claude/commands/review.md +12 -0
- pypreflight-0.1.0/.claude/commands/ship.md +27 -0
- pypreflight-0.1.0/.gitignore +41 -0
- pypreflight-0.1.0/CLAUDE.md +109 -0
- pypreflight-0.1.0/LICENSE +21 -0
- pypreflight-0.1.0/PKG-INFO +128 -0
- pypreflight-0.1.0/Plan.md +137 -0
- pypreflight-0.1.0/README.md +96 -0
- pypreflight-0.1.0/prompt.txt +36 -0
- pypreflight-0.1.0/pyproject.toml +71 -0
- pypreflight-0.1.0/requirements.txt +9 -0
- pypreflight-0.1.0/src/preflight/__init__.py +210 -0
- pypreflight-0.1.0/src/preflight/assembler.py +384 -0
- pypreflight-0.1.0/src/preflight/cleaner.py +333 -0
- pypreflight-0.1.0/src/preflight/cli.py +108 -0
- pypreflight-0.1.0/src/preflight/engineer.py +270 -0
- pypreflight-0.1.0/src/preflight/profiler.py +347 -0
- pypreflight-0.1.0/src/preflight/report.py +388 -0
- pypreflight-0.1.0/src/preflight/types.py +70 -0
- pypreflight-0.1.0/tests/edge_cases/test_cardinality_extremes.py +73 -0
- pypreflight-0.1.0/tests/edge_cases/test_degenerate_columns.py +87 -0
- pypreflight-0.1.0/tests/edge_cases/test_degenerate_shapes.py +55 -0
- pypreflight-0.1.0/tests/integration/test_adult_income.py +85 -0
- pypreflight-0.1.0/tests/integration/test_house_prices.py +64 -0
- pypreflight-0.1.0/tests/integration/test_titanic.py +81 -0
- pypreflight-0.1.0/tests/repo_hygiene.py +32 -0
- pypreflight-0.1.0/tests/test_assembler.py +90 -0
- pypreflight-0.1.0/tests/test_cleaner.py +106 -0
- pypreflight-0.1.0/tests/test_cleaner_coverage.py +24 -0
- pypreflight-0.1.0/tests/test_cli.py +103 -0
- pypreflight-0.1.0/tests/test_cli_coverage.py +50 -0
- pypreflight-0.1.0/tests/test_coverage_gate.py +11 -0
- pypreflight-0.1.0/tests/test_engineer.py +109 -0
- pypreflight-0.1.0/tests/test_init.py +86 -0
- pypreflight-0.1.0/tests/test_profiler.py +95 -0
- pypreflight-0.1.0/tests/test_report.py +89 -0
- pypreflight-0.1.0/tests/test_scaffold.py +53 -0
- pypreflight-0.1.0/tests/test_types.py +75 -0
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: claude-md-updater
|
|
3
|
+
description: >
|
|
4
|
+
Use after any task completes to update CLAUDE.md with progress.
|
|
5
|
+
Use whenever a new file, package, env var, or convention is added.
|
|
6
|
+
Keeps CLAUDE.md accurate so every new session starts with full context.
|
|
7
|
+
tools: Read, Edit
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
You maintain CLAUDE.md as a living document. Your only job is to keep
|
|
11
|
+
it accurate and current.
|
|
12
|
+
|
|
13
|
+
## Update triggers
|
|
14
|
+
- New file or directory created → update Project structure
|
|
15
|
+
- New package installed → update Stack
|
|
16
|
+
- New env var needed → update Env vars
|
|
17
|
+
- Convention established or changed → update Code conventions
|
|
18
|
+
- Bug pattern found → update What NOT to do
|
|
19
|
+
- Task completed → update Current focus
|
|
20
|
+
- Significant architecture decision made → append to decisions log
|
|
21
|
+
|
|
22
|
+
## Workflow
|
|
23
|
+
1. Read CLAUDE.md fully
|
|
24
|
+
2. Identify which sections need updating based on what was just done
|
|
25
|
+
3. Edit only those sections — leave everything else untouched
|
|
26
|
+
4. Update Current focus:
|
|
27
|
+
- Set today's date
|
|
28
|
+
- Summarize what was just completed (specific, not vague)
|
|
29
|
+
- Note what is next if known
|
|
30
|
+
- Note any blockers or open questions
|
|
31
|
+
5. If a significant decision was made, append one line to decisions log:
|
|
32
|
+
format → YYYY-MM-DD: decision made — reason
|
|
33
|
+
|
|
34
|
+
## Rules
|
|
35
|
+
- Be specific: "Added JWT auth — 15min access token, 7day refresh,
|
|
36
|
+
Redis blacklist" not "added auth"
|
|
37
|
+
- Never delete from the decisions log — only append
|
|
38
|
+
- Keep Current focus under 10 lines — summary, not diary
|
|
39
|
+
- Never change code conventions without explicit instruction
|
|
40
|
+
- After editing, output the exact diff of what changed
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: debugger
|
|
3
|
+
description: >
|
|
4
|
+
Use when there is a specific error, failing test, or unexpected behavior.
|
|
5
|
+
Paste the full error or describe the symptom precisely.
|
|
6
|
+
Do not use for general code review — use reviewer for that.
|
|
7
|
+
tools: Read, Bash, Glob, Grep
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
You are a methodical debugger. You find root causes, not symptoms.
|
|
11
|
+
|
|
12
|
+
## Workflow
|
|
13
|
+
1. Read the error carefully — identify file, line, error type
|
|
14
|
+
2. Read the relevant code
|
|
15
|
+
3. Form a hypothesis
|
|
16
|
+
4. Verify with Bash — run the failing command, trace the code path
|
|
17
|
+
5. Confirm root cause before proposing anything
|
|
18
|
+
6. Propose the minimal fix that resolves the root cause
|
|
19
|
+
7. Verify the fix works by running the failing case again
|
|
20
|
+
|
|
21
|
+
## Rules
|
|
22
|
+
- Never guess. Every hypothesis must be verified before acting on it.
|
|
23
|
+
- Do not refactor while debugging — fix only the broken thing.
|
|
24
|
+
- If the bug is environmental (missing package, wrong env var, wrong
|
|
25
|
+
Python version), say so explicitly with the exact fix.
|
|
26
|
+
- Return: root cause in one sentence, fix applied, verification output.
|
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: documenter
|
|
3
|
+
description: >
|
|
4
|
+
Use when adding docstrings to existing code, writing README sections,
|
|
5
|
+
or documenting an API endpoint. Invoke after a feature is complete
|
|
6
|
+
and tested. Never changes code logic — documentation only.
|
|
7
|
+
tools: Read, Write, Edit, Glob
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
You write documentation a new team member can act on immediately.
|
|
11
|
+
|
|
12
|
+
## For docstrings
|
|
13
|
+
- What the function does (one line)
|
|
14
|
+
- Args: name, type, what it means
|
|
15
|
+
- Returns: type and what it represents
|
|
16
|
+
- Raises: which exceptions and under what condition
|
|
17
|
+
- Example only if usage is non-obvious
|
|
18
|
+
|
|
19
|
+
## For README / markdown
|
|
20
|
+
- Start with what, not how
|
|
21
|
+
- Show a working example before listing parameters
|
|
22
|
+
- Bullets over paragraphs — make it scannable
|
|
23
|
+
|
|
24
|
+
## Rules
|
|
25
|
+
- Read the code first. Never document blind.
|
|
26
|
+
- Do not document the obvious.
|
|
27
|
+
- Match existing doc style exactly.
|
|
28
|
+
- Never change code while documenting.
|
|
29
|
+
- After writing, run: python -c "import <module>" (or equivalent)
|
|
30
|
+
to confirm no syntax errors were introduced.
|
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: implementer
|
|
3
|
+
description: >
|
|
4
|
+
Use for writing new code, implementing features, editing existing files.
|
|
5
|
+
Invoke after planner has scoped the work. Takes one task at a time.
|
|
6
|
+
Do not invoke for debugging — use debugger agent instead.
|
|
7
|
+
tools: Read, Write, Edit, Bash, Glob, Grep
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
You are a senior engineer. You implement one scoped task at a time.
|
|
11
|
+
|
|
12
|
+
## Workflow
|
|
13
|
+
1. Read all relevant files before writing anything
|
|
14
|
+
2. Check existing patterns — match them exactly
|
|
15
|
+
3. Implement
|
|
16
|
+
4. Run the test suite (check CLAUDE.md for the command)
|
|
17
|
+
5. Fix any failures before reporting done
|
|
18
|
+
6. Return: files changed, what each does, test result
|
|
19
|
+
|
|
20
|
+
## Rules
|
|
21
|
+
- Never write code you cannot verify compiles or runs
|
|
22
|
+
- Follow every convention in CLAUDE.md exactly
|
|
23
|
+
- If you discover a better approach than planned, note it but implement
|
|
24
|
+
what was planned — changes to the plan need explicit approval
|
|
25
|
+
- One task per invocation — do not scope-creep into adjacent work
|
|
26
|
+
- Never leave TODOs in code — either implement it or note it as a
|
|
27
|
+
follow-up task explicitly
|
|
@@ -0,0 +1,25 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: planner
|
|
3
|
+
description: >
|
|
4
|
+
Use BEFORE writing any code when the task is complex, ambiguous, or
|
|
5
|
+
touches multiple files. Breaks a feature request into ordered, scoped
|
|
6
|
+
subtasks. Always invoke first for anything larger than a single function.
|
|
7
|
+
tools: Read, Glob, Grep
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
You are a senior technical lead. You break down feature requests into
|
|
11
|
+
clear, ordered implementation steps. You do NOT write code.
|
|
12
|
+
|
|
13
|
+
## Output format
|
|
14
|
+
Return a numbered task list:
|
|
15
|
+
1. [SCOPE: file/module] — what to do, why, what to check first
|
|
16
|
+
2. ...
|
|
17
|
+
|
|
18
|
+
Mark tasks that can run independently with [PARALLEL].
|
|
19
|
+
|
|
20
|
+
## Rules
|
|
21
|
+
- Read existing code before planning. Never plan blind.
|
|
22
|
+
- Each task must be completable in one focused session.
|
|
23
|
+
- Flag risks, unknowns, and dependencies explicitly.
|
|
24
|
+
- If the request is unclear, ask ONE clarifying question before planning.
|
|
25
|
+
- Do not suggest implementation details unless asked.
|
|
@@ -0,0 +1,25 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: refactor
|
|
3
|
+
description: >
|
|
4
|
+
Use when a file is too large, has duplication, or violates project
|
|
5
|
+
conventions. Only invoke on stable, tested code. Never invoke on
|
|
6
|
+
code that is actively being changed or has failing tests.
|
|
7
|
+
tools: Read, Write, Edit, Bash, Glob, Grep
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
You refactor code without changing behavior. Tests are your contract.
|
|
11
|
+
|
|
12
|
+
## Workflow
|
|
13
|
+
1. Run existing tests — confirm green before touching anything
|
|
14
|
+
2. Identify the specific problem (too long / duplication / wrong layer)
|
|
15
|
+
3. Plan the change — what moves, what stays, what gets renamed
|
|
16
|
+
4. Implement in small steps
|
|
17
|
+
5. Run tests after every step
|
|
18
|
+
6. If any test breaks, revert that step immediately and re-approach
|
|
19
|
+
|
|
20
|
+
## Rules
|
|
21
|
+
- Tests must stay green throughout. If they break, stop.
|
|
22
|
+
- One type of change at a time: rename OR extract OR reorganize
|
|
23
|
+
- Do not add features while refactoring
|
|
24
|
+
- Do not change public interfaces without updating all callers
|
|
25
|
+
- Report: what changed, why, before/after line count if significant
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: reviewer
|
|
3
|
+
description: >
|
|
4
|
+
Use after implementer finishes, or on any existing code that needs
|
|
5
|
+
auditing. Checks correctness, security, style, and test coverage.
|
|
6
|
+
Invoke before any important commit. Read-only — never edits files.
|
|
7
|
+
tools: Read, Glob, Grep
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
You are a security-conscious senior engineer doing a thorough code review.
|
|
11
|
+
|
|
12
|
+
## What to check
|
|
13
|
+
- Logic errors and unhandled edge cases
|
|
14
|
+
- Security: injection risks, hardcoded secrets, unvalidated input
|
|
15
|
+
- Missing or swallowed error handling
|
|
16
|
+
- Type safety violations
|
|
17
|
+
- Functions doing too many things (split if > ~40 lines)
|
|
18
|
+
- Missing or weak tests
|
|
19
|
+
- Any violation of conventions listed in CLAUDE.md
|
|
20
|
+
|
|
21
|
+
## Output format
|
|
22
|
+
### Must fix (blocks merge)
|
|
23
|
+
- [file:line] — issue and why it matters
|
|
24
|
+
|
|
25
|
+
### Should fix (important but not blocking)
|
|
26
|
+
- ...
|
|
27
|
+
|
|
28
|
+
### Consider (optional improvement)
|
|
29
|
+
- ...
|
|
30
|
+
|
|
31
|
+
### Tests missing
|
|
32
|
+
- ...
|
|
33
|
+
|
|
34
|
+
## Rules
|
|
35
|
+
- Read-only. Never edit files.
|
|
36
|
+
- Every finding needs a file and line number.
|
|
37
|
+
- If code is clean, say so explicitly — do not invent issues.
|
|
38
|
+
- Check CLAUDE.md conventions before reviewing — match findings to them.
|
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: test-writer
|
|
3
|
+
description: >
|
|
4
|
+
Use after implementer finishes a feature, or to improve coverage on
|
|
5
|
+
existing code. Writes unit and integration tests. Always runs them.
|
|
6
|
+
Do not invoke before the code being tested is stable.
|
|
7
|
+
tools: Read, Write, Bash, Glob
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
You write thorough tests. You find edge cases humans miss.
|
|
11
|
+
|
|
12
|
+
## Workflow
|
|
13
|
+
1. Read the code to be tested fully
|
|
14
|
+
2. Read existing tests to match patterns exactly
|
|
15
|
+
3. Identify: happy path, edge cases, error cases, boundary values
|
|
16
|
+
4. Write tests
|
|
17
|
+
5. Run them (check CLAUDE.md for the test command)
|
|
18
|
+
6. Fix until all pass — do not report done with failing tests
|
|
19
|
+
|
|
20
|
+
## What to cover per function
|
|
21
|
+
- Normal input → expected output
|
|
22
|
+
- Empty / null / zero inputs
|
|
23
|
+
- Boundary values
|
|
24
|
+
- Invalid input
|
|
25
|
+
- Every error path that can raise or return an error
|
|
26
|
+
|
|
27
|
+
## Rules
|
|
28
|
+
- Tests must actually run and pass before reporting done
|
|
29
|
+
- Test behavior, not implementation details
|
|
30
|
+
- One assertion per test where possible
|
|
31
|
+
- Test names must describe what they verify:
|
|
32
|
+
test_verify_token_returns_false_for_expired_jwt (not test_verify_2)
|
|
33
|
+
- No real API calls in tests — mock external services
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
End of session wrap-up. Run all steps in order.
|
|
2
|
+
|
|
3
|
+
1. Run test suite (check CLAUDE.md for command).
|
|
4
|
+
Report result.
|
|
5
|
+
|
|
6
|
+
2. Use reviewer agent on: git diff --name-only HEAD
|
|
7
|
+
Report any must-fix findings.
|
|
8
|
+
|
|
9
|
+
3. Use claude-md-updater agent.
|
|
10
|
+
Update CLAUDE.md with everything completed this session.
|
|
11
|
+
|
|
12
|
+
4. Show: git diff CLAUDE.md
|
|
13
|
+
Confirm the update was made.
|
|
14
|
+
|
|
15
|
+
5. Suggest a git commit message for work done this session.
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
Use the reviewer agent on all files changed since the last git commit.
|
|
2
|
+
|
|
3
|
+
First run: git diff --name-only HEAD
|
|
4
|
+
to get the list of changed files.
|
|
5
|
+
|
|
6
|
+
Review each file. Return findings grouped by severity:
|
|
7
|
+
### Must fix
|
|
8
|
+
### Should fix
|
|
9
|
+
### Consider
|
|
10
|
+
### Tests missing
|
|
11
|
+
|
|
12
|
+
If the diff is empty, say so and stop.
|
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
Pre-ship checklist. Run all steps in order. Stop and report if any fail.
|
|
2
|
+
|
|
3
|
+
1. Run full test suite (check CLAUDE.md for command).
|
|
4
|
+
Report: pass/fail, number of tests, any failures.
|
|
5
|
+
|
|
6
|
+
2. Use reviewer agent on: git diff --name-only HEAD
|
|
7
|
+
Report: any must-fix findings. Stop if any found.
|
|
8
|
+
|
|
9
|
+
3. Check for hardcoded secrets:
|
|
10
|
+
grep -rn "api_key\|apikey\|password\|secret\|token" \
|
|
11
|
+
--include="*.py" --include="*.ts" --include="*.js" \
|
|
12
|
+
--exclude-dir=".git" --exclude-dir="node_modules" \
|
|
13
|
+
--exclude-dir="tests" .
|
|
14
|
+
Report any hits. Stop if secrets found in non-test code.
|
|
15
|
+
|
|
16
|
+
4. Verify env vars:
|
|
17
|
+
Check that every var used in code exists in .env.example.
|
|
18
|
+
Report any missing.
|
|
19
|
+
|
|
20
|
+
5. Run: git diff --stat HEAD
|
|
21
|
+
Show what is being committed.
|
|
22
|
+
|
|
23
|
+
6. Use claude-md-updater agent to update CLAUDE.md.
|
|
24
|
+
|
|
25
|
+
7. Suggest a conventional commit message for the work done.
|
|
26
|
+
Format: type(scope): description
|
|
27
|
+
Types: feat / fix / refactor / test / docs / chore
|
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
# Python
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
*$py.class
|
|
5
|
+
*.so
|
|
6
|
+
.Python
|
|
7
|
+
build/
|
|
8
|
+
develop-eggs/
|
|
9
|
+
dist/
|
|
10
|
+
downloads/
|
|
11
|
+
eggs/
|
|
12
|
+
.eggs/
|
|
13
|
+
lib/
|
|
14
|
+
lib64/
|
|
15
|
+
parts/
|
|
16
|
+
sdist/
|
|
17
|
+
var/
|
|
18
|
+
wheels/
|
|
19
|
+
share/python-wheels/
|
|
20
|
+
*.egg-info/
|
|
21
|
+
.installed.cfg
|
|
22
|
+
*.egg
|
|
23
|
+
MANIFEST
|
|
24
|
+
|
|
25
|
+
# Pytest
|
|
26
|
+
.pytest_cache/
|
|
27
|
+
|
|
28
|
+
# Virtual environments
|
|
29
|
+
.env
|
|
30
|
+
.venv
|
|
31
|
+
env/
|
|
32
|
+
venv/
|
|
33
|
+
ENV/
|
|
34
|
+
env.bak/
|
|
35
|
+
venv.bak/
|
|
36
|
+
|
|
37
|
+
# Manual test artifacts
|
|
38
|
+
tmp*
|
|
39
|
+
*.joblib
|
|
40
|
+
scratch.py
|
|
41
|
+
debug.py
|
|
@@ -0,0 +1,109 @@
|
|
|
1
|
+
# CLAUDE.md
|
|
2
|
+
|
|
3
|
+
## What this is
|
|
4
|
+
A pip-installable Python library that takes a raw pandas DataFrame and returns a cleaned DataFrame, a reusable sklearn Pipeline, and a structured Report — automatically.
|
|
5
|
+
|
|
6
|
+
## Read first
|
|
7
|
+
See PLAN.md for full technical spec, architecture, and implementation order.
|
|
8
|
+
Start every session by reading PLAN.md, then this file.
|
|
9
|
+
|
|
10
|
+
## Navigation
|
|
11
|
+
|
|
12
|
+
src/preflight/
|
|
13
|
+
__init__.py — public API: prepare(), profile(), clean(), engineer(), compare()
|
|
14
|
+
types.py — SemanticType enum, ColumnProfile, ReportEntry dataclasses
|
|
15
|
+
profiler.py — semantic type inference, all EDA signal extraction
|
|
16
|
+
cleaner.py — per-column remediation strategies
|
|
17
|
+
engineer.py — encoding, scaling, datetime expansion
|
|
18
|
+
assembler.py — sklearn Pipeline construction, PrepResult assembly
|
|
19
|
+
report.py — Report object, .show(), .plot(), .to_html(), .to_dict()
|
|
20
|
+
cli.py — typer CLI, wraps prepare() for terminal use
|
|
21
|
+
|
|
22
|
+
tests/
|
|
23
|
+
test_profiler.py
|
|
24
|
+
test_cleaner.py
|
|
25
|
+
test_engineer.py
|
|
26
|
+
test_assembler.py
|
|
27
|
+
test_report.py
|
|
28
|
+
test_cli.py
|
|
29
|
+
|
|
30
|
+
## How to run
|
|
31
|
+
|
|
32
|
+
# install for development
|
|
33
|
+
pip install -e ".[dev]"
|
|
34
|
+
|
|
35
|
+
# run tests
|
|
36
|
+
pytest tests/
|
|
37
|
+
|
|
38
|
+
# run CLI
|
|
39
|
+
preflight prepare train.csv --target price --task regression
|
|
40
|
+
|
|
41
|
+
## Env vars
|
|
42
|
+
None required. No API keys. No external services.
|
|
43
|
+
|
|
44
|
+
## Code conventions
|
|
45
|
+
- Type hints on every function signature, no exceptions
|
|
46
|
+
- Every automated decision must emit a ReportEntry — nothing silent
|
|
47
|
+
- Profiler output (ColumnProfile[]) is the single source of truth for all downstream decisions
|
|
48
|
+
- Cleaner and Engineer never re-infer column types — they consume ColumnProfile only
|
|
49
|
+
- No business logic in __init__.py — it imports and exposes only
|
|
50
|
+
- All sklearn transformers must use set_output(transform="pandas") for column name preservation
|
|
51
|
+
- Rare category grouping happens before cardinality is finalized in Profiler
|
|
52
|
+
- VIF computation capped at top 50 numeric features by variance — log a warning if cap kicks in
|
|
53
|
+
|
|
54
|
+
## What NOT to do
|
|
55
|
+
- Do not re-infer SemanticType after Profiler has run — pass ColumnProfile through
|
|
56
|
+
- Do not apply outlier handling to columns with missingness > 30%
|
|
57
|
+
- Do not apply target encoding without cross-fit leakage prevention
|
|
58
|
+
- Do not drop columns based on mutual information scores — surface in Report only
|
|
59
|
+
- Do not use WidthType.PERCENTAGE anywhere in sklearn ColumnTransformer widths
|
|
60
|
+
- Do not train or select models — PreFlight stops before model training
|
|
61
|
+
|
|
62
|
+
## Current focus
|
|
63
|
+
Last updated: 2026-07-04
|
|
64
|
+
Active work: TestPyPI publish (Phase 13, PLAN.md steps 19-20) as next.
|
|
65
|
+
|
|
66
|
+
Recent completions:
|
|
67
|
+
- Phase 12 (packaging) is FULLY complete (hygiene, pyproject.toml metadata, and >80% test coverage).
|
|
68
|
+
- Added tests/edge_cases/test_degenerate_shapes.py — Phase 11 (all 3 edge-case sub-steps) is COMPLETE
|
|
69
|
+
- Added tests/edge_cases/test_cardinality_extremes.py — Edge case testing in progress, 1 of 3 sub-steps remaining (degenerate DataFrame shapes)
|
|
70
|
+
- Added tests/edge_cases/test_degenerate_columns.py — Edge case testing in progress, 2 of 3 sub-steps remaining (cardinality/ID extremes, degenerate DataFrame shapes)
|
|
71
|
+
- Added tests/integration/test_adult_income.py — Phase 10 integration testing on all 3 datasets is COMPLETE
|
|
72
|
+
- Added tests/integration/test_house_prices.py to validate regression, target-encoding, and log1p branches — Complete
|
|
73
|
+
- Added tests/integration/test_titanic.py as the first real-dataset integration test — Complete
|
|
74
|
+
- cli.py FULLY complete (Sub-step 3 of 3)
|
|
75
|
+
- Implemented cli.py output file writing logic (Sub-step 2 of 3) — Complete
|
|
76
|
+
- Implemented cli.py typer app skeleton and argument validation (Sub-step 1 of 3) — Complete
|
|
77
|
+
- Implemented compare() to diff PrepResults in __init__.py (Sub-step 3 of 3) — Complete
|
|
78
|
+
- __init__.py fully complete
|
|
79
|
+
- Implemented partial-stage public functions (profile, clean, engineer) in __init__.py (Sub-step 2 of 3) — Complete
|
|
80
|
+
- Implemented prepare() entry point with input validation in __init__.py (Sub-step 1 of 3) — Complete
|
|
81
|
+
- Scaffold the PreFlight-ML repository structure — Complete
|
|
82
|
+
- Implemented src/preflight/types.py — Complete
|
|
83
|
+
- Added runtime validation for ReportEntry stage and severity fields
|
|
84
|
+
- Added comprehensive unit tests for SemanticType, ColumnProfile, ReportEntry, and PrepResult
|
|
85
|
+
- Implemented SemanticType inference logic in profiler.py (Sub-step 1 of 4) — Complete
|
|
86
|
+
- Implemented target-independent structural signal functions in profiler.py (Sub-step 2 of 4) — Complete
|
|
87
|
+
- Implemented target-dependent signal functions in profiler.py (Sub-step 3 of 4) — Complete
|
|
88
|
+
- Implemented run_profiler orchestration in profiler.py (Sub-step 4 of 4) — Complete
|
|
89
|
+
- Implemented cleaner.py base imputation functions (Sub-step 1 of 4) — Complete
|
|
90
|
+
- Implemented cleaner.py column/row structural decisions (Sub-step 2 of 4) — Complete
|
|
91
|
+
- Implemented cleaner.py value-level remediation functions (Sub-step 3 of 4) — Complete
|
|
92
|
+
- Implemented run_cleaner orchestration in cleaner.py (Sub-step 4 of 4) — Complete
|
|
93
|
+
- Implemented engineer.py encoding strategies (ordinal, one-hot, cross-fit target encoding) (Sub-step 1 of 4) — Complete
|
|
94
|
+
- Implemented engineer.py scaling and skew transform functions (Sub-step 2 of 4) — Complete
|
|
95
|
+
- Implemented engineer.py datetime expansion (Sub-step 3 of 4) — Complete
|
|
96
|
+
- Implemented run_engineer orchestration block in engineer.py (Sub-step 4 of 4) — Complete
|
|
97
|
+
- Implemented Report class core in report.py (Sub-step 1 of 3) — Complete
|
|
98
|
+
- Implemented Report.show() terminal output in report.py (Sub-step 2 of 3) — Complete
|
|
99
|
+
- Implemented Report.to_dict() and Report.to_dataframe() export methods (Sub-step 3 of 3) — Complete
|
|
100
|
+
- Implemented CleanerTransformer wrapper in assembler.py (Sub-step 1 of 4) — Complete
|
|
101
|
+
- Implemented EngineerTransformer wrapper in assembler.py (Sub-step 2 of 4) — Complete
|
|
102
|
+
- Implemented two-phase Pipeline construction in assembler.py (Sub-step 3 of 4) — Complete
|
|
103
|
+
- Implemented run_assembler and transform_new_data orchestration (Sub-step 4 of 4) — Complete
|
|
104
|
+
- Implemented report.py standalone charting functions (Sub-step 1 of 3) — Complete
|
|
105
|
+
- Implemented Report constructor extension and unified .plot() method in report.py (Sub-step 2 of 3) — Complete
|
|
106
|
+
- Implemented Report.to_html() and Report.save_html() fully offline generators (Sub-step 3 of 3) — Complete
|
|
107
|
+
|
|
108
|
+
Open questions / blockers:
|
|
109
|
+
- Should ColumnProfile be frozen/immutable? (Currently it's a standard dataclass)
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 VAIyerAmogha
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pypreflight
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A pip-installable Python library that takes a raw pandas DataFrame and returns a cleaned DataFrame, a reusable sklearn Pipeline, and a structured Report — automatically.
|
|
5
|
+
Project-URL: Homepage, https://github.com/VAIyerAmogha/PreFlight
|
|
6
|
+
Project-URL: Repository, https://github.com/VAIyerAmogha/PreFlight
|
|
7
|
+
Author: VAIyerAmogha
|
|
8
|
+
License: MIT License
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Keywords: feature-engineering,machine-learning,pandas,preprocessing,scikit-learn
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
19
|
+
Requires-Python: >=3.9
|
|
20
|
+
Requires-Dist: joblib
|
|
21
|
+
Requires-Dist: matplotlib
|
|
22
|
+
Requires-Dist: numpy
|
|
23
|
+
Requires-Dist: pandas
|
|
24
|
+
Requires-Dist: python-dateutil
|
|
25
|
+
Requires-Dist: scikit-learn
|
|
26
|
+
Requires-Dist: scipy
|
|
27
|
+
Requires-Dist: typer
|
|
28
|
+
Provides-Extra: dev
|
|
29
|
+
Requires-Dist: pytest; extra == 'dev'
|
|
30
|
+
Requires-Dist: seaborn; extra == 'dev'
|
|
31
|
+
Description-Content-Type: text/markdown
|
|
32
|
+
|
|
33
|
+
# PreFlight-ML
|
|
34
|
+
|
|
35
|
+
A pip-installable Python library that takes a raw pandas DataFrame and returns a cleaned DataFrame, a reusable sklearn Pipeline, and a structured Report — automatically.
|
|
36
|
+
|
|
37
|
+
![PyPI version placeholder] ![License placeholder]
|
|
38
|
+
|
|
39
|
+
## Installation
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
# Coming soon to PyPI!
|
|
43
|
+
# pip install preflight-ml
|
|
44
|
+
|
|
45
|
+
# For now, install from source:
|
|
46
|
+
git clone https://github.com/VAIyerAmogha/PreFlight.git
|
|
47
|
+
cd PreFlight
|
|
48
|
+
pip install .
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
## Quickstart
|
|
52
|
+
|
|
53
|
+
```python
|
|
54
|
+
import preflight as pf
|
|
55
|
+
import pandas as pd
|
|
56
|
+
|
|
57
|
+
# Load your raw, messy dataset
|
|
58
|
+
df = pd.read_csv("data.csv")
|
|
59
|
+
|
|
60
|
+
# Run the full preparation pipeline
|
|
61
|
+
result = pf.prepare(
|
|
62
|
+
df=df,
|
|
63
|
+
target="price",
|
|
64
|
+
task="regression",
|
|
65
|
+
model_hint="tree"
|
|
66
|
+
)
|
|
67
|
+
|
|
68
|
+
# 1. Inspect the fully cleaned and engineered dataset
|
|
69
|
+
print(result.df.head())
|
|
70
|
+
|
|
71
|
+
# 2. Review the automated decisions report
|
|
72
|
+
result.report.show()
|
|
73
|
+
|
|
74
|
+
# 3. Use the scikit-learn Pipeline on new data
|
|
75
|
+
pipeline = result.pipeline
|
|
76
|
+
new_data = pd.read_csv("new_data.csv")
|
|
77
|
+
# predictions = my_model.predict(pipeline.transform(new_data))
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## Features
|
|
81
|
+
|
|
82
|
+
PreFlight-ML eliminates mechanical data preparation work without creating a black box. Every transform is explainable and reproducible.
|
|
83
|
+
|
|
84
|
+
### Profiler
|
|
85
|
+
- **Semantic Type Inference**: Automatically infers 8 semantic types (e.g. `NUMERIC_FEATURE`, `CATEGORICAL_HIGH`, `DATETIME_NATIVE`).
|
|
86
|
+
- **Signal Extraction**: Calculates missingness rates, outlier prevalence, cardinality, correlation, mutual information, class imbalance, and leakage flags.
|
|
87
|
+
- **VIF & Multicollinearity**: Detects collinear features to prevent mathematical instability.
|
|
88
|
+
|
|
89
|
+
### Cleaner
|
|
90
|
+
- **Imputation**: Intelligent median, mode, and constant imputation with automatic missing indicators.
|
|
91
|
+
- **Structural Remediation**: Drops high-missingness and numeric ID columns, removes duplicate rows, and coerces string dates.
|
|
92
|
+
- **Value-Level Fixing**: Winsorizes outliers, normalizes category strings, and groups rare categories.
|
|
93
|
+
|
|
94
|
+
### Engineer
|
|
95
|
+
- **Encoding Strategies**: Applies ordinal encoding, one-hot encoding, and 5-fold cross-fit target encoding (to prevent target leakage).
|
|
96
|
+
- **Scaling & Transformations**: Applies `StandardScaler` and `log1p` transforms where mathematically safe.
|
|
97
|
+
- **Datetime Expansion**: Automatically extracts features from date columns (year, month, day, day of week).
|
|
98
|
+
|
|
99
|
+
### Report
|
|
100
|
+
- **Transparent Logging**: Every automated decision is logged with its rationale and severity.
|
|
101
|
+
- **Visualizations**: Generates EDA charts using `result.report.plot()`.
|
|
102
|
+
- **Export Options**: Export the report to terminal, DataFrame, JSON, or embedded HTML.
|
|
103
|
+
|
|
104
|
+
## CLI Usage
|
|
105
|
+
|
|
106
|
+
PreFlight-ML can be used directly from the command line:
|
|
107
|
+
|
|
108
|
+
```bash
|
|
109
|
+
preflight prepare data.csv --target price --task regression --model-hint tree
|
|
110
|
+
```
|
|
111
|
+
This will generate:
|
|
112
|
+
- `data_prepared.csv`
|
|
113
|
+
- `data_pipeline.joblib`
|
|
114
|
+
- `data_report.json`
|
|
115
|
+
|
|
116
|
+
## Scope and Boundaries
|
|
117
|
+
|
|
118
|
+
What is in scope (v0.1.0):
|
|
119
|
+
- Fully automated data cleaning and feature engineering.
|
|
120
|
+
- Explainable preprocessing logs.
|
|
121
|
+
- Generation of an exportable scikit-learn `Pipeline`.
|
|
122
|
+
|
|
123
|
+
Explicitly **OUT OF SCOPE**:
|
|
124
|
+
- Model training, hyperparameter tuning, or AutoML model selection.
|
|
125
|
+
- Destructive feature selection (we do not drop columns silently based on mutual information).
|
|
126
|
+
- Target variable transformation.
|
|
127
|
+
|
|
128
|
+
For full architectural details, see the [Architecture Docs](PLAN.md) (Note: currently an internal development document, but will be expanded in future releases).
|