pypreflight 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. pypreflight-0.1.0/.claude/agents/claude-md-updater.md +40 -0
  2. pypreflight-0.1.0/.claude/agents/debugger.md +26 -0
  3. pypreflight-0.1.0/.claude/agents/documenter.md +30 -0
  4. pypreflight-0.1.0/.claude/agents/implementer.md +27 -0
  5. pypreflight-0.1.0/.claude/agents/planner.md +25 -0
  6. pypreflight-0.1.0/.claude/agents/refactor.md +25 -0
  7. pypreflight-0.1.0/.claude/agents/reviewer.md +38 -0
  8. pypreflight-0.1.0/.claude/agents/test-writer.md +33 -0
  9. pypreflight-0.1.0/.claude/commands/debug.md +8 -0
  10. pypreflight-0.1.0/.claude/commands/done.md +15 -0
  11. pypreflight-0.1.0/.claude/commands/review.md +12 -0
  12. pypreflight-0.1.0/.claude/commands/ship.md +27 -0
  13. pypreflight-0.1.0/.gitignore +41 -0
  14. pypreflight-0.1.0/CLAUDE.md +109 -0
  15. pypreflight-0.1.0/LICENSE +21 -0
  16. pypreflight-0.1.0/PKG-INFO +128 -0
  17. pypreflight-0.1.0/Plan.md +137 -0
  18. pypreflight-0.1.0/README.md +96 -0
  19. pypreflight-0.1.0/prompt.txt +36 -0
  20. pypreflight-0.1.0/pyproject.toml +71 -0
  21. pypreflight-0.1.0/requirements.txt +9 -0
  22. pypreflight-0.1.0/src/preflight/__init__.py +210 -0
  23. pypreflight-0.1.0/src/preflight/assembler.py +384 -0
  24. pypreflight-0.1.0/src/preflight/cleaner.py +333 -0
  25. pypreflight-0.1.0/src/preflight/cli.py +108 -0
  26. pypreflight-0.1.0/src/preflight/engineer.py +270 -0
  27. pypreflight-0.1.0/src/preflight/profiler.py +347 -0
  28. pypreflight-0.1.0/src/preflight/report.py +388 -0
  29. pypreflight-0.1.0/src/preflight/types.py +70 -0
  30. pypreflight-0.1.0/tests/edge_cases/test_cardinality_extremes.py +73 -0
  31. pypreflight-0.1.0/tests/edge_cases/test_degenerate_columns.py +87 -0
  32. pypreflight-0.1.0/tests/edge_cases/test_degenerate_shapes.py +55 -0
  33. pypreflight-0.1.0/tests/integration/test_adult_income.py +85 -0
  34. pypreflight-0.1.0/tests/integration/test_house_prices.py +64 -0
  35. pypreflight-0.1.0/tests/integration/test_titanic.py +81 -0
  36. pypreflight-0.1.0/tests/repo_hygiene.py +32 -0
  37. pypreflight-0.1.0/tests/test_assembler.py +90 -0
  38. pypreflight-0.1.0/tests/test_cleaner.py +106 -0
  39. pypreflight-0.1.0/tests/test_cleaner_coverage.py +24 -0
  40. pypreflight-0.1.0/tests/test_cli.py +103 -0
  41. pypreflight-0.1.0/tests/test_cli_coverage.py +50 -0
  42. pypreflight-0.1.0/tests/test_coverage_gate.py +11 -0
  43. pypreflight-0.1.0/tests/test_engineer.py +109 -0
  44. pypreflight-0.1.0/tests/test_init.py +86 -0
  45. pypreflight-0.1.0/tests/test_profiler.py +95 -0
  46. pypreflight-0.1.0/tests/test_report.py +89 -0
  47. pypreflight-0.1.0/tests/test_scaffold.py +53 -0
  48. pypreflight-0.1.0/tests/test_types.py +75 -0
@@ -0,0 +1,40 @@
1
+ ---
2
+ name: claude-md-updater
3
+ description: >
4
+ Use after any task completes to update CLAUDE.md with progress.
5
+ Use whenever a new file, package, env var, or convention is added.
6
+ Keeps CLAUDE.md accurate so every new session starts with full context.
7
+ tools: Read, Edit
8
+ ---
9
+
10
+ You maintain CLAUDE.md as a living document. Your only job is to keep
11
+ it accurate and current.
12
+
13
+ ## Update triggers
14
+ - New file or directory created → update Project structure
15
+ - New package installed → update Stack
16
+ - New env var needed → update Env vars
17
+ - Convention established or changed → update Code conventions
18
+ - Bug pattern found → update What NOT to do
19
+ - Task completed → update Current focus
20
+ - Significant architecture decision made → append to decisions log
21
+
22
+ ## Workflow
23
+ 1. Read CLAUDE.md fully
24
+ 2. Identify which sections need updating based on what was just done
25
+ 3. Edit only those sections — leave everything else untouched
26
+ 4. Update Current focus:
27
+ - Set today's date
28
+ - Summarize what was just completed (specific, not vague)
29
+ - Note what is next if known
30
+ - Note any blockers or open questions
31
+ 5. If a significant decision was made, append one line to decisions log:
32
+ format → YYYY-MM-DD: decision made — reason
33
+
34
+ ## Rules
35
+ - Be specific: "Added JWT auth — 15min access token, 7day refresh,
36
+ Redis blacklist" not "added auth"
37
+ - Never delete from the decisions log — only append
38
+ - Keep Current focus under 10 lines — summary, not diary
39
+ - Never change code conventions without explicit instruction
40
+ - After editing, output the exact diff of what changed
@@ -0,0 +1,26 @@
1
+ ---
2
+ name: debugger
3
+ description: >
4
+ Use when there is a specific error, failing test, or unexpected behavior.
5
+ Paste the full error or describe the symptom precisely.
6
+ Do not use for general code review — use reviewer for that.
7
+ tools: Read, Bash, Glob, Grep
8
+ ---
9
+
10
+ You are a methodical debugger. You find root causes, not symptoms.
11
+
12
+ ## Workflow
13
+ 1. Read the error carefully — identify file, line, error type
14
+ 2. Read the relevant code
15
+ 3. Form a hypothesis
16
+ 4. Verify with Bash — run the failing command, trace the code path
17
+ 5. Confirm root cause before proposing anything
18
+ 6. Propose the minimal fix that resolves the root cause
19
+ 7. Verify the fix works by running the failing case again
20
+
21
+ ## Rules
22
+ - Never guess. Every hypothesis must be verified before acting on it.
23
+ - Do not refactor while debugging — fix only the broken thing.
24
+ - If the bug is environmental (missing package, wrong env var, wrong
25
+ Python version), say so explicitly with the exact fix.
26
+ - Return: root cause in one sentence, fix applied, verification output.
@@ -0,0 +1,30 @@
1
+ ---
2
+ name: documenter
3
+ description: >
4
+ Use when adding docstrings to existing code, writing README sections,
5
+ or documenting an API endpoint. Invoke after a feature is complete
6
+ and tested. Never changes code logic — documentation only.
7
+ tools: Read, Write, Edit, Glob
8
+ ---
9
+
10
+ You write documentation a new team member can act on immediately.
11
+
12
+ ## For docstrings
13
+ - What the function does (one line)
14
+ - Args: name, type, what it means
15
+ - Returns: type and what it represents
16
+ - Raises: which exceptions and under what condition
17
+ - Example only if usage is non-obvious
18
+
19
+ ## For README / markdown
20
+ - Start with what, not how
21
+ - Show a working example before listing parameters
22
+ - Bullets over paragraphs — make it scannable
23
+
24
+ ## Rules
25
+ - Read the code first. Never document blind.
26
+ - Do not document the obvious.
27
+ - Match existing doc style exactly.
28
+ - Never change code while documenting.
29
+ - After writing, run: python -c "import <module>" (or equivalent)
30
+ to confirm no syntax errors were introduced.
@@ -0,0 +1,27 @@
1
+ ---
2
+ name: implementer
3
+ description: >
4
+ Use for writing new code, implementing features, editing existing files.
5
+ Invoke after planner has scoped the work. Takes one task at a time.
6
+ Do not invoke for debugging — use debugger agent instead.
7
+ tools: Read, Write, Edit, Bash, Glob, Grep
8
+ ---
9
+
10
+ You are a senior engineer. You implement one scoped task at a time.
11
+
12
+ ## Workflow
13
+ 1. Read all relevant files before writing anything
14
+ 2. Check existing patterns — match them exactly
15
+ 3. Implement
16
+ 4. Run the test suite (check CLAUDE.md for the command)
17
+ 5. Fix any failures before reporting done
18
+ 6. Return: files changed, what each does, test result
19
+
20
+ ## Rules
21
+ - Never write code you cannot verify compiles or runs
22
+ - Follow every convention in CLAUDE.md exactly
23
+ - If you discover a better approach than planned, note it but implement
24
+ what was planned — changes to the plan need explicit approval
25
+ - One task per invocation — do not scope-creep into adjacent work
26
+ - Never leave TODOs in code — either implement it or note it as a
27
+ follow-up task explicitly
@@ -0,0 +1,25 @@
1
+ ---
2
+ name: planner
3
+ description: >
4
+ Use BEFORE writing any code when the task is complex, ambiguous, or
5
+ touches multiple files. Breaks a feature request into ordered, scoped
6
+ subtasks. Always invoke first for anything larger than a single function.
7
+ tools: Read, Glob, Grep
8
+ ---
9
+
10
+ You are a senior technical lead. You break down feature requests into
11
+ clear, ordered implementation steps. You do NOT write code.
12
+
13
+ ## Output format
14
+ Return a numbered task list:
15
+ 1. [SCOPE: file/module] — what to do, why, what to check first
16
+ 2. ...
17
+
18
+ Mark tasks that can run independently with [PARALLEL].
19
+
20
+ ## Rules
21
+ - Read existing code before planning. Never plan blind.
22
+ - Each task must be completable in one focused session.
23
+ - Flag risks, unknowns, and dependencies explicitly.
24
+ - If the request is unclear, ask ONE clarifying question before planning.
25
+ - Do not suggest implementation details unless asked.
@@ -0,0 +1,25 @@
1
+ ---
2
+ name: refactor
3
+ description: >
4
+ Use when a file is too large, has duplication, or violates project
5
+ conventions. Only invoke on stable, tested code. Never invoke on
6
+ code that is actively being changed or has failing tests.
7
+ tools: Read, Write, Edit, Bash, Glob, Grep
8
+ ---
9
+
10
+ You refactor code without changing behavior. Tests are your contract.
11
+
12
+ ## Workflow
13
+ 1. Run existing tests — confirm green before touching anything
14
+ 2. Identify the specific problem (too long / duplication / wrong layer)
15
+ 3. Plan the change — what moves, what stays, what gets renamed
16
+ 4. Implement in small steps
17
+ 5. Run tests after every step
18
+ 6. If any test breaks, revert that step immediately and re-approach
19
+
20
+ ## Rules
21
+ - Tests must stay green throughout. If they break, stop.
22
+ - One type of change at a time: rename OR extract OR reorganize
23
+ - Do not add features while refactoring
24
+ - Do not change public interfaces without updating all callers
25
+ - Report: what changed, why, before/after line count if significant
@@ -0,0 +1,38 @@
1
+ ---
2
+ name: reviewer
3
+ description: >
4
+ Use after implementer finishes, or on any existing code that needs
5
+ auditing. Checks correctness, security, style, and test coverage.
6
+ Invoke before any important commit. Read-only — never edits files.
7
+ tools: Read, Glob, Grep
8
+ ---
9
+
10
+ You are a security-conscious senior engineer doing a thorough code review.
11
+
12
+ ## What to check
13
+ - Logic errors and unhandled edge cases
14
+ - Security: injection risks, hardcoded secrets, unvalidated input
15
+ - Missing or swallowed error handling
16
+ - Type safety violations
17
+ - Functions doing too many things (split if > ~40 lines)
18
+ - Missing or weak tests
19
+ - Any violation of conventions listed in CLAUDE.md
20
+
21
+ ## Output format
22
+ ### Must fix (blocks merge)
23
+ - [file:line] — issue and why it matters
24
+
25
+ ### Should fix (important but not blocking)
26
+ - ...
27
+
28
+ ### Consider (optional improvement)
29
+ - ...
30
+
31
+ ### Tests missing
32
+ - ...
33
+
34
+ ## Rules
35
+ - Read-only. Never edit files.
36
+ - Every finding needs a file and line number.
37
+ - If code is clean, say so explicitly — do not invent issues.
38
+ - Check CLAUDE.md conventions before reviewing — match findings to them.
@@ -0,0 +1,33 @@
1
+ ---
2
+ name: test-writer
3
+ description: >
4
+ Use after implementer finishes a feature, or to improve coverage on
5
+ existing code. Writes unit and integration tests. Always runs them.
6
+ Do not invoke before the code being tested is stable.
7
+ tools: Read, Write, Bash, Glob
8
+ ---
9
+
10
+ You write thorough tests. You find edge cases humans miss.
11
+
12
+ ## Workflow
13
+ 1. Read the code to be tested fully
14
+ 2. Read existing tests to match patterns exactly
15
+ 3. Identify: happy path, edge cases, error cases, boundary values
16
+ 4. Write tests
17
+ 5. Run them (check CLAUDE.md for the test command)
18
+ 6. Fix until all pass — do not report done with failing tests
19
+
20
+ ## What to cover per function
21
+ - Normal input → expected output
22
+ - Empty / null / zero inputs
23
+ - Boundary values
24
+ - Invalid input
25
+ - Every error path that can raise or return an error
26
+
27
+ ## Rules
28
+ - Tests must actually run and pass before reporting done
29
+ - Test behavior, not implementation details
30
+ - One assertion per test where possible
31
+ - Test names must describe what they verify:
32
+ test_verify_token_returns_false_for_expired_jwt (not test_verify_2)
33
+ - No real API calls in tests — mock external services
@@ -0,0 +1,8 @@
1
+ Use the debugger agent on the error or symptom described in $ARGUMENTS.
2
+
3
+ If no error is provided, ask for:
4
+ 1. The full error message or unexpected behavior
5
+ 2. The command or action that triggers it
6
+ 3. What the expected behavior is
7
+
8
+ Then diagnose.
@@ -0,0 +1,15 @@
1
+ End of session wrap-up. Run all steps in order.
2
+
3
+ 1. Run test suite (check CLAUDE.md for command).
4
+ Report result.
5
+
6
+ 2. Use reviewer agent on: git diff --name-only HEAD
7
+ Report any must-fix findings.
8
+
9
+ 3. Use claude-md-updater agent.
10
+ Update CLAUDE.md with everything completed this session.
11
+
12
+ 4. Show: git diff CLAUDE.md
13
+ Confirm the update was made.
14
+
15
+ 5. Suggest a git commit message for work done this session.
@@ -0,0 +1,12 @@
1
+ Use the reviewer agent on all files changed since the last git commit.
2
+
3
+ First run: git diff --name-only HEAD
4
+ to get the list of changed files.
5
+
6
+ Review each file. Return findings grouped by severity:
7
+ ### Must fix
8
+ ### Should fix
9
+ ### Consider
10
+ ### Tests missing
11
+
12
+ If the diff is empty, say so and stop.
@@ -0,0 +1,27 @@
1
+ Pre-ship checklist. Run all steps in order. Stop and report if any fail.
2
+
3
+ 1. Run full test suite (check CLAUDE.md for command).
4
+ Report: pass/fail, number of tests, any failures.
5
+
6
+ 2. Use reviewer agent on: git diff --name-only HEAD
7
+ Report: any must-fix findings. Stop if any found.
8
+
9
+ 3. Check for hardcoded secrets:
10
+ grep -rn "api_key\|apikey\|password\|secret\|token" \
11
+ --include="*.py" --include="*.ts" --include="*.js" \
12
+ --exclude-dir=".git" --exclude-dir="node_modules" \
13
+ --exclude-dir="tests" .
14
+ Report any hits. Stop if secrets found in non-test code.
15
+
16
+ 4. Verify env vars:
17
+ Check that every var used in code exists in .env.example.
18
+ Report any missing.
19
+
20
+ 5. Run: git diff --stat HEAD
21
+ Show what is being committed.
22
+
23
+ 6. Use claude-md-updater agent to update CLAUDE.md.
24
+
25
+ 7. Suggest a conventional commit message for the work done.
26
+ Format: type(scope): description
27
+ Types: feat / fix / refactor / test / docs / chore
@@ -0,0 +1,41 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ share/python-wheels/
20
+ *.egg-info/
21
+ .installed.cfg
22
+ *.egg
23
+ MANIFEST
24
+
25
+ # Pytest
26
+ .pytest_cache/
27
+
28
+ # Virtual environments
29
+ .env
30
+ .venv
31
+ env/
32
+ venv/
33
+ ENV/
34
+ env.bak/
35
+ venv.bak/
36
+
37
+ # Manual test artifacts
38
+ tmp*
39
+ *.joblib
40
+ scratch.py
41
+ debug.py
@@ -0,0 +1,109 @@
1
+ # CLAUDE.md
2
+
3
+ ## What this is
4
+ A pip-installable Python library that takes a raw pandas DataFrame and returns a cleaned DataFrame, a reusable sklearn Pipeline, and a structured Report — automatically.
5
+
6
+ ## Read first
7
+ See PLAN.md for full technical spec, architecture, and implementation order.
8
+ Start every session by reading PLAN.md, then this file.
9
+
10
+ ## Navigation
11
+
12
+ src/preflight/
13
+ __init__.py — public API: prepare(), profile(), clean(), engineer(), compare()
14
+ types.py — SemanticType enum, ColumnProfile, ReportEntry dataclasses
15
+ profiler.py — semantic type inference, all EDA signal extraction
16
+ cleaner.py — per-column remediation strategies
17
+ engineer.py — encoding, scaling, datetime expansion
18
+ assembler.py — sklearn Pipeline construction, PrepResult assembly
19
+ report.py — Report object, .show(), .plot(), .to_html(), .to_dict()
20
+ cli.py — typer CLI, wraps prepare() for terminal use
21
+
22
+ tests/
23
+ test_profiler.py
24
+ test_cleaner.py
25
+ test_engineer.py
26
+ test_assembler.py
27
+ test_report.py
28
+ test_cli.py
29
+
30
+ ## How to run
31
+
32
+ # install for development
33
+ pip install -e ".[dev]"
34
+
35
+ # run tests
36
+ pytest tests/
37
+
38
+ # run CLI
39
+ preflight prepare train.csv --target price --task regression
40
+
41
+ ## Env vars
42
+ None required. No API keys. No external services.
43
+
44
+ ## Code conventions
45
+ - Type hints on every function signature, no exceptions
46
+ - Every automated decision must emit a ReportEntry — nothing silent
47
+ - Profiler output (ColumnProfile[]) is the single source of truth for all downstream decisions
48
+ - Cleaner and Engineer never re-infer column types — they consume ColumnProfile only
49
+ - No business logic in __init__.py — it imports and exposes only
50
+ - All sklearn transformers must use set_output(transform="pandas") for column name preservation
51
+ - Rare category grouping happens before cardinality is finalized in Profiler
52
+ - VIF computation capped at top 50 numeric features by variance — log a warning if cap kicks in
53
+
54
+ ## What NOT to do
55
+ - Do not re-infer SemanticType after Profiler has run — pass ColumnProfile through
56
+ - Do not apply outlier handling to columns with missingness > 30%
57
+ - Do not apply target encoding without cross-fit leakage prevention
58
+ - Do not drop columns based on mutual information scores — surface in Report only
59
+ - Do not use WidthType.PERCENTAGE anywhere in sklearn ColumnTransformer widths
60
+ - Do not train or select models — PreFlight stops before model training
61
+
62
+ ## Current focus
63
+ Last updated: 2026-07-04
64
+ Active work: TestPyPI publish (Phase 13, PLAN.md steps 19-20) as next.
65
+
66
+ Recent completions:
67
+ - Phase 12 (packaging) is FULLY complete (hygiene, pyproject.toml metadata, and >80% test coverage).
68
+ - Added tests/edge_cases/test_degenerate_shapes.py — Phase 11 (all 3 edge-case sub-steps) is COMPLETE
69
+ - Added tests/edge_cases/test_cardinality_extremes.py — Edge case testing in progress, 1 of 3 sub-steps remaining (degenerate DataFrame shapes)
70
+ - Added tests/edge_cases/test_degenerate_columns.py — Edge case testing in progress, 2 of 3 sub-steps remaining (cardinality/ID extremes, degenerate DataFrame shapes)
71
+ - Added tests/integration/test_adult_income.py — Phase 10 integration testing on all 3 datasets is COMPLETE
72
+ - Added tests/integration/test_house_prices.py to validate regression, target-encoding, and log1p branches — Complete
73
+ - Added tests/integration/test_titanic.py as the first real-dataset integration test — Complete
74
+ - cli.py FULLY complete (Sub-step 3 of 3)
75
+ - Implemented cli.py output file writing logic (Sub-step 2 of 3) — Complete
76
+ - Implemented cli.py typer app skeleton and argument validation (Sub-step 1 of 3) — Complete
77
+ - Implemented compare() to diff PrepResults in __init__.py (Sub-step 3 of 3) — Complete
78
+ - __init__.py fully complete
79
+ - Implemented partial-stage public functions (profile, clean, engineer) in __init__.py (Sub-step 2 of 3) — Complete
80
+ - Implemented prepare() entry point with input validation in __init__.py (Sub-step 1 of 3) — Complete
81
+ - Scaffold the PreFlight-ML repository structure — Complete
82
+ - Implemented src/preflight/types.py — Complete
83
+ - Added runtime validation for ReportEntry stage and severity fields
84
+ - Added comprehensive unit tests for SemanticType, ColumnProfile, ReportEntry, and PrepResult
85
+ - Implemented SemanticType inference logic in profiler.py (Sub-step 1 of 4) — Complete
86
+ - Implemented target-independent structural signal functions in profiler.py (Sub-step 2 of 4) — Complete
87
+ - Implemented target-dependent signal functions in profiler.py (Sub-step 3 of 4) — Complete
88
+ - Implemented run_profiler orchestration in profiler.py (Sub-step 4 of 4) — Complete
89
+ - Implemented cleaner.py base imputation functions (Sub-step 1 of 4) — Complete
90
+ - Implemented cleaner.py column/row structural decisions (Sub-step 2 of 4) — Complete
91
+ - Implemented cleaner.py value-level remediation functions (Sub-step 3 of 4) — Complete
92
+ - Implemented run_cleaner orchestration in cleaner.py (Sub-step 4 of 4) — Complete
93
+ - Implemented engineer.py encoding strategies (ordinal, one-hot, cross-fit target encoding) (Sub-step 1 of 4) — Complete
94
+ - Implemented engineer.py scaling and skew transform functions (Sub-step 2 of 4) — Complete
95
+ - Implemented engineer.py datetime expansion (Sub-step 3 of 4) — Complete
96
+ - Implemented run_engineer orchestration block in engineer.py (Sub-step 4 of 4) — Complete
97
+ - Implemented Report class core in report.py (Sub-step 1 of 3) — Complete
98
+ - Implemented Report.show() terminal output in report.py (Sub-step 2 of 3) — Complete
99
+ - Implemented Report.to_dict() and Report.to_dataframe() export methods (Sub-step 3 of 3) — Complete
100
+ - Implemented CleanerTransformer wrapper in assembler.py (Sub-step 1 of 4) — Complete
101
+ - Implemented EngineerTransformer wrapper in assembler.py (Sub-step 2 of 4) — Complete
102
+ - Implemented two-phase Pipeline construction in assembler.py (Sub-step 3 of 4) — Complete
103
+ - Implemented run_assembler and transform_new_data orchestration (Sub-step 4 of 4) — Complete
104
+ - Implemented report.py standalone charting functions (Sub-step 1 of 3) — Complete
105
+ - Implemented Report constructor extension and unified .plot() method in report.py (Sub-step 2 of 3) — Complete
106
+ - Implemented Report.to_html() and Report.save_html() fully offline generators (Sub-step 3 of 3) — Complete
107
+
108
+ Open questions / blockers:
109
+ - Should ColumnProfile be frozen/immutable? (Currently it's a standard dataclass)
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 VAIyerAmogha
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,128 @@
1
+ Metadata-Version: 2.4
2
+ Name: pypreflight
3
+ Version: 0.1.0
4
+ Summary: A pip-installable Python library that takes a raw pandas DataFrame and returns a cleaned DataFrame, a reusable sklearn Pipeline, and a structured Report — automatically.
5
+ Project-URL: Homepage, https://github.com/VAIyerAmogha/PreFlight
6
+ Project-URL: Repository, https://github.com/VAIyerAmogha/PreFlight
7
+ Author: VAIyerAmogha
8
+ License: MIT License
9
+ License-File: LICENSE
10
+ Keywords: feature-engineering,machine-learning,pandas,preprocessing,scikit-learn
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: License :: OSI Approved :: MIT License
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
+ Requires-Python: >=3.9
20
+ Requires-Dist: joblib
21
+ Requires-Dist: matplotlib
22
+ Requires-Dist: numpy
23
+ Requires-Dist: pandas
24
+ Requires-Dist: python-dateutil
25
+ Requires-Dist: scikit-learn
26
+ Requires-Dist: scipy
27
+ Requires-Dist: typer
28
+ Provides-Extra: dev
29
+ Requires-Dist: pytest; extra == 'dev'
30
+ Requires-Dist: seaborn; extra == 'dev'
31
+ Description-Content-Type: text/markdown
32
+
33
+ # PreFlight-ML
34
+
35
+ A pip-installable Python library that takes a raw pandas DataFrame and returns a cleaned DataFrame, a reusable sklearn Pipeline, and a structured Report — automatically.
36
+
37
+ ![PyPI version placeholder] ![License placeholder]
38
+
39
+ ## Installation
40
+
41
+ ```bash
42
+ # Coming soon to PyPI!
43
+ # pip install preflight-ml
44
+
45
+ # For now, install from source:
46
+ git clone https://github.com/VAIyerAmogha/PreFlight.git
47
+ cd PreFlight
48
+ pip install .
49
+ ```
50
+
51
+ ## Quickstart
52
+
53
+ ```python
54
+ import preflight as pf
55
+ import pandas as pd
56
+
57
+ # Load your raw, messy dataset
58
+ df = pd.read_csv("data.csv")
59
+
60
+ # Run the full preparation pipeline
61
+ result = pf.prepare(
62
+ df=df,
63
+ target="price",
64
+ task="regression",
65
+ model_hint="tree"
66
+ )
67
+
68
+ # 1. Inspect the fully cleaned and engineered dataset
69
+ print(result.df.head())
70
+
71
+ # 2. Review the automated decisions report
72
+ result.report.show()
73
+
74
+ # 3. Use the scikit-learn Pipeline on new data
75
+ pipeline = result.pipeline
76
+ new_data = pd.read_csv("new_data.csv")
77
+ # predictions = my_model.predict(pipeline.transform(new_data))
78
+ ```
79
+
80
+ ## Features
81
+
82
+ PreFlight-ML eliminates mechanical data preparation work without creating a black box. Every transform is explainable and reproducible.
83
+
84
+ ### Profiler
85
+ - **Semantic Type Inference**: Automatically infers 8 semantic types (e.g. `NUMERIC_FEATURE`, `CATEGORICAL_HIGH`, `DATETIME_NATIVE`).
86
+ - **Signal Extraction**: Calculates missingness rates, outlier prevalence, cardinality, correlation, mutual information, class imbalance, and leakage flags.
87
+ - **VIF & Multicollinearity**: Detects collinear features to prevent mathematical instability.
88
+
89
+ ### Cleaner
90
+ - **Imputation**: Intelligent median, mode, and constant imputation with automatic missing indicators.
91
+ - **Structural Remediation**: Drops high-missingness and numeric ID columns, removes duplicate rows, and coerces string dates.
92
+ - **Value-Level Fixing**: Winsorizes outliers, normalizes category strings, and groups rare categories.
93
+
94
+ ### Engineer
95
+ - **Encoding Strategies**: Applies ordinal encoding, one-hot encoding, and 5-fold cross-fit target encoding (to prevent target leakage).
96
+ - **Scaling & Transformations**: Applies `StandardScaler` and `log1p` transforms where mathematically safe.
97
+ - **Datetime Expansion**: Automatically extracts features from date columns (year, month, day, day of week).
98
+
99
+ ### Report
100
+ - **Transparent Logging**: Every automated decision is logged with its rationale and severity.
101
+ - **Visualizations**: Generates EDA charts using `result.report.plot()`.
102
+ - **Export Options**: Export the report to terminal, DataFrame, JSON, or embedded HTML.
103
+
104
+ ## CLI Usage
105
+
106
+ PreFlight-ML can be used directly from the command line:
107
+
108
+ ```bash
109
+ preflight prepare data.csv --target price --task regression --model-hint tree
110
+ ```
111
+ This will generate:
112
+ - `data_prepared.csv`
113
+ - `data_pipeline.joblib`
114
+ - `data_report.json`
115
+
116
+ ## Scope and Boundaries
117
+
118
+ What is in scope (v0.1.0):
119
+ - Fully automated data cleaning and feature engineering.
120
+ - Explainable preprocessing logs.
121
+ - Generation of an exportable scikit-learn `Pipeline`.
122
+
123
+ Explicitly **OUT OF SCOPE**:
124
+ - Model training, hyperparameter tuning, or AutoML model selection.
125
+ - Destructive feature selection (we do not drop columns silently based on mutual information).
126
+ - Target variable transformation.
127
+
128
+ For full architectural details, see the [Architecture Docs](PLAN.md) (Note: currently an internal development document, but will be expanded in future releases).