falsegreen 0.2.0__tar.gz → 0.2.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (37) hide show
  1. {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/workflows/ci.yml +2 -6
  2. {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/workflows/release-drafter.yml +1 -1
  3. {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/workflows/release.yml +2 -2
  4. {falsegreen-0.2.0 → falsegreen-0.2.2}/CHANGELOG.md +22 -6
  5. {falsegreen-0.2.0 → falsegreen-0.2.2}/CONTRIBUTING.md +12 -26
  6. {falsegreen-0.2.0 → falsegreen-0.2.2}/CREDITS.md +10 -7
  7. {falsegreen-0.2.0 → falsegreen-0.2.2}/PKG-INFO +48 -116
  8. {falsegreen-0.2.0 → falsegreen-0.2.2}/README.md +47 -115
  9. {falsegreen-0.2.0 → falsegreen-0.2.2}/pyproject.toml +1 -1
  10. {falsegreen-0.2.0 → falsegreen-0.2.2}/src/falsegreen/__init__.py +1 -1
  11. {falsegreen-0.2.0 → falsegreen-0.2.2}/src/falsegreen/scanner.py +2 -2
  12. {falsegreen-0.2.0 → falsegreen-0.2.2}/tests/test_scanner.py +13 -0
  13. falsegreen-0.2.0/.claude-plugin/marketplace.json +0 -20
  14. falsegreen-0.2.0/.claude-plugin/plugin.json +0 -11
  15. falsegreen-0.2.0/skills/falsegreen/README.md +0 -22
  16. falsegreen-0.2.0/skills/falsegreen/SKILL.md +0 -278
  17. falsegreen-0.2.0/skills/falsegreen/examples/bad_tests_sample.py +0 -96
  18. falsegreen-0.2.0/skills/falsegreen/reference.md +0 -365
  19. falsegreen-0.2.0/skills/falsegreen/scripts/scan.py +0 -1625
  20. {falsegreen-0.2.0 → falsegreen-0.2.2}/.gitattributes +0 -0
  21. {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/CODEOWNERS +0 -0
  22. {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/ISSUE_TEMPLATE/bug_report.md +0 -0
  23. {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/ISSUE_TEMPLATE/config.yml +0 -0
  24. {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/ISSUE_TEMPLATE/feature_request.md +0 -0
  25. {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/dependabot.yml +0 -0
  26. {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/pull_request_template.md +0 -0
  27. {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/release-drafter.yml +0 -0
  28. {falsegreen-0.2.0 → falsegreen-0.2.2}/.gitignore +0 -0
  29. {falsegreen-0.2.0 → falsegreen-0.2.2}/.pre-commit-hooks.yaml +0 -0
  30. {falsegreen-0.2.0 → falsegreen-0.2.2}/CODE_OF_CONDUCT.md +0 -0
  31. {falsegreen-0.2.0 → falsegreen-0.2.2}/LICENSE +0 -0
  32. {falsegreen-0.2.0 → falsegreen-0.2.2}/RELEASE.md +0 -0
  33. {falsegreen-0.2.0 → falsegreen-0.2.2}/SECURITY.md +0 -0
  34. {falsegreen-0.2.0 → falsegreen-0.2.2}/docs/guide.md +0 -0
  35. {falsegreen-0.2.0 → falsegreen-0.2.2}/requirements-dev.txt +0 -0
  36. {falsegreen-0.2.0 → falsegreen-0.2.2}/src/falsegreen/__main__.py +0 -0
  37. {falsegreen-0.2.0 → falsegreen-0.2.2}/src/falsegreen/hook_install.py +0 -0
@@ -24,9 +24,5 @@ jobs:
24
24
  run: ruff check src tests
25
25
  - name: Test
26
26
  run: pytest -q
27
- - name: Bundled skill scanner must match the package
28
- run: diff -u src/falsegreen/scanner.py skills/falsegreen/scripts/scan.py
29
- - name: Self-scan (must flag the demo, must not flag itself)
30
- run: |
31
- python -m falsegreen skills/falsegreen/examples/bad_tests_sample.py || true
32
- python -m falsegreen src tests
27
+ - name: Self-scan (must not flag itself)
28
+ run: python -m falsegreen src tests
@@ -16,6 +16,6 @@ jobs:
16
16
  pull-requests: write
17
17
  runs-on: ubuntu-latest
18
18
  steps:
19
- - uses: release-drafter/release-drafter@v6
19
+ - uses: release-drafter/release-drafter@v7
20
20
  env:
21
21
  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
@@ -31,7 +31,7 @@ jobs:
31
31
  python -m build
32
32
  python -m twine check dist/*
33
33
  - name: Upload dist artifact
34
- uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
34
+ uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
35
35
  with:
36
36
  name: dist
37
37
  path: dist/
@@ -45,7 +45,7 @@ jobs:
45
45
  id-token: write # OIDC: the only credential the publish step needs
46
46
  steps:
47
47
  - name: Download dist artifact
48
- uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4
48
+ uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # v8.0.1
49
49
  with:
50
50
  name: dist
51
51
  path: dist/
@@ -6,6 +6,21 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
6
 
7
7
  ## [Unreleased]
8
8
 
9
+ ## [0.2.2] - 2026-06-08
10
+
11
+ ### Changed
12
+ - Skill and Claude plugin removed from this repo — the LLM semantic pass, the
13
+ detection reference, and multi-language support now live in
14
+ [falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
15
+ - README, CONTRIBUTING, and CREDITS updated to reflect the split.
16
+
17
+ ## [0.2.1] - 2026-06-08
18
+
19
+ ### Fixed
20
+ - C2 (HIGH) no longer flags an empty body under sympy's `@SKIP` decorator
21
+ (`from sympy.testing.pytest import SKIP`), which raises `Skipped` at runtime —
22
+ same semantics as `@pytest.mark.skip`. Found validating sympy.
23
+
9
24
  ## [0.2.0] - 2026-06-05
10
25
 
11
26
  ### Fixed
@@ -72,16 +87,17 @@ First release.
72
87
  - C20 (HIGH): assertion in dead code after `return`/`raise`/`fail()`. C21 (LOW):
73
88
  every assertion conditional, none runs unconditionally. Both from the rotten-
74
89
  green-test line of work (Soares 2023).
75
- - Claude Code skill (`/falsegreen`) for the semantic pass: judges a test's
76
- expected value against intended behavior using an oracle hierarchy and a
77
- test-intent classification step (catches cases 12 and 18).
78
- - Distribution as a pip package, a `pre-commit` hook, and a Claude plugin.
79
- - Plain-language guide (`docs/guide.md`), detection reference, and a demo file.
90
+ - Distribution as a pip package and a `pre-commit` hook.
91
+ - Plain-language guide (`docs/guide.md`); the detection reference and LLM semantic
92
+ pass live in [falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
80
93
 
81
94
  ### Validated
82
95
  - Two real-project passes (bailiff, md-bridge) settled the rules and fixed three
83
96
  false positives: C6 on called boolean predicates, C1 on literal-collection
84
97
  loops, and C7 on `f() is f()` (the lru_cache / singleton identity test).
85
98
 
86
- [Unreleased]: https://github.com/vinicq/falsegreen/compare/v0.1.0...HEAD
99
+ [Unreleased]: https://github.com/vinicq/falsegreen/compare/v0.2.2...HEAD
100
+ [0.2.2]: https://github.com/vinicq/falsegreen/compare/v0.2.1...v0.2.2
101
+ [0.2.1]: https://github.com/vinicq/falsegreen/compare/v0.2.0...v0.2.1
102
+ [0.2.0]: https://github.com/vinicq/falsegreen/compare/v0.1.0...v0.2.0
87
103
  [0.1.0]: https://github.com/vinicq/falsegreen/releases/tag/v0.1.0
@@ -19,49 +19,35 @@ Then branch, change, add a test, and open a pull request.
19
19
 
20
20
  ## How the project is built
21
21
 
22
- Two layers, one repo:
22
+ One module, one job: `src/falsegreen/scanner.py` is a zero-dependency AST pass.
23
+ It parses test files, never imports or runs them. Each pattern is a case code
24
+ (`C1`, `C5`, `C13`, ...). HIGH-confidence codes block a commit; LOW only warn.
23
25
 
24
- - **Scanner** (`src/falsegreen/scanner.py`): a zero-dependency AST pass. It parses
25
- test files, it never imports or runs them. Each pattern is a case code
26
- (`C1`, `C5`, `C13`, ...). HIGH-confidence codes block a commit; LOW only warn.
27
- - **Skill** (`skills/falsegreen/`): the Claude Code semantic pass. It bundles a
28
- byte-identical copy of the scanner at `skills/falsegreen/scripts/scan.py`; CI
29
- fails if it drifts from `src/falsegreen/scanner.py`.
30
-
31
- The plain-language rubric is `docs/guide.md`; the detection reference is
32
- `skills/falsegreen/reference.md`.
26
+ The plain-language rubric is `docs/guide.md`. The LLM semantic pass and the
27
+ multi-language detection reference live in
28
+ [falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
33
29
 
34
30
  ## Filing an issue
35
31
 
36
32
  A useful bug report for a false positive includes the smallest test snippet that
37
33
  gets wrongly flagged, the code falsegreen emitted, and what you expected. For a
38
- false negative, show the bad test that slipped through. Use the demo file
39
- `skills/falsegreen/examples/bad_tests_sample.py` as a format reference.
34
+ false negative, show the bad test that slipped through.
40
35
 
41
36
  ## Adding or changing a detection rule
42
37
 
43
- This is the most common contribution. A rule touches up to five places, and the
38
+ This is the most common contribution. A rule touches up to three places, and the
44
39
  pull request needs all that apply:
45
40
 
46
41
  1. **Logic** in `src/falsegreen/scanner.py`. Decide HIGH vs LOW. The rule of
47
42
  thumb: HIGH only if a legitimate test can almost never trigger it, because
48
43
  HIGH blocks commits. When in doubt, ship it LOW.
49
- 2. **Reference** entry in `skills/falsegreen/reference.md` (what it looks like,
50
- why it fools you, confidence, the tool it maps to).
51
- 3. **Guide** entry in `docs/guide.md` if it is a new case, in the same
44
+ 2. **Guide** entry in `docs/guide.md` if it is a new case, in the same
52
45
  real-world-analogy style as the others.
53
- 4. **Tests** in `tests/test_scanner.py`: one test proving the rule fires on the
46
+ 3. **Tests** in `tests/test_scanner.py`: one test proving the rule fires on the
54
47
  bad pattern, and at least one proving it does NOT fire on the legitimate
55
48
  look-alike. The second test matters more than the first.
56
- 5. **Skill prose** in `skills/falsegreen/SKILL.md`, *only if* the change alters a
57
- confidence level, an exemption, a flag, or the operator's mental model. CI
58
- byte-checks `scripts/scan.py` against the scanner, so detector *logic* is
59
- mirrored automatically; the SKILL.md prose and its flag list are NOT, so they
60
- must be kept consistent with `reference.md` and the README CLI section by hand.
61
-
62
- Then run `pytest`, `python -m falsegreen src tests` (must stay clean), and
63
- `diff src/falsegreen/scanner.py skills/falsegreen/scripts/scan.py` (must be
64
- identical, copy the file if you changed the scanner).
49
+
50
+ Then run `pytest` and `python -m falsegreen src tests` (must stay clean).
65
51
 
66
52
  ### Off-by-default codes
67
53
 
@@ -1,8 +1,9 @@
1
1
  # Credits and academic references
2
2
 
3
3
  falsegreen builds on published research in test smells and rotten green tests. The
4
- work below shaped its concepts, its rule catalog, and the design of its two layers
5
- (deterministic scanner plus an LLM semantic pass). Credit to the authors.
4
+ work below shaped its concepts, its rule catalog, and the design of the deterministic
5
+ scanner. The LLM semantic pass and multi-language support live in
6
+ [falsegreen-skill](https://github.com/vinicq/falsegreen-skill). Credit to the authors.
6
7
 
7
8
  ## Conceptual foundation
8
9
 
@@ -42,8 +43,9 @@ Marcelo d'Amorim, Márcio Ribeiro, Gustavo Soares
42
43
  ([@gustavoasoares](https://github.com/gustavoasoares)), Eduardo Almeida, Elvys
43
44
  Soares ([@elvyssoares](https://github.com/elvyssoares)). SBES 2025. arXiv:2504.07277. Empirical evidence that small local models in
44
45
  agent-based workflows detect and refactor test smells (Phi-4-14B, pass@5 of 75.3%;
45
- six generated pull requests merged into open-source projects). Backs falsegreen's
46
- LLM semantic pass and the AI-applies-the-fix path of the dual-use report.
46
+ six generated pull requests merged into open-source projects). Backs
47
+ [falsegreen-skill](https://github.com/vinicq/falsegreen-skill)'s LLM semantic pass
48
+ and the AI-applies-the-fix path of the dual-use report.
47
49
 
48
50
  **Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An
49
51
  Empirical Study.** E. G. Santana Jr., Jander Pereira Santos Junior, Erlon P.
@@ -58,8 +60,9 @@ and its multi-agent verify idea.
58
60
  **Evaluating Large Language Models in Detecting Test Smells.** Keila Lucas, Rohit
59
61
  Gheyi, Elvys Soares, Márcio Ribeiro, Ivan Machado. SBES 2024. arXiv:2407.19261.
60
62
  LLMs detected 21 of 30 test smell types across seven languages (ChatGPT-4 best).
61
- Backs falsegreen's choice to handle cross-language coverage in the language-agnostic
62
- semantic pass rather than in the Python-only scanner.
63
+ Backs [falsegreen-skill](https://github.com/vinicq/falsegreen-skill)'s choice to handle
64
+ cross-language coverage in the language-agnostic semantic pass rather than in the
65
+ Python-only scanner.
63
66
 
64
67
  **Test smells in LLM-Generated Unit Tests.** Wendkûuni C. Ouédraogo, Yinghua Li,
65
68
  Xueqi Dang, Xunzhu Tang, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F.
@@ -82,7 +85,7 @@ Dalton Nicodemos Jorge ([@daltonjorge](https://github.com/daltonjorge)). PhD the
82
85
  UFCG, 2023. Advisors Patrícia D. L. Machado, Wilkerson L. Andrade. Tool STEEL:
83
86
  <https://github.com/daltonjorge/steel>. Its JavaScript Exception Test smell (a
84
87
  `try/catch` that swallows the thrown error) and assertion-in-`forEach`-over-empty
85
- sharpened the skill's "Frontend cues by language" with two J1 cues for Jest/Vitest.
88
+ sharpened falsegreen-skill's "Frontend cues by language" with two J1 cues for Jest/Vitest.
86
89
 
87
90
  **Detecção de smells em testes automatizados em diferentes linguagens de
88
91
  programação.** Gustavo Augusto Calazans Lopes. TCC, UFAL, 2023. Advisor Márcio de
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: falsegreen
3
- Version: 0.2.0
3
+ Version: 0.2.2
4
4
  Summary: Find unit tests that give false positives: green tests that protect nothing, and tests that pass while asserting the wrong expected value.
5
5
  Project-URL: Homepage, https://github.com/vinicq/falsegreen
6
6
  Project-URL: Issues, https://github.com/vinicq/falsegreen/issues
@@ -39,17 +39,18 @@ each test against more than twenty mechanical smells, the ones a parser can prov
39
39
  an assertion that never runs, a check that is empty or always true, a swallowed
40
40
  exception, a mock of the unit under test, an assertion stranded in dead code, a
41
41
  weak truthiness check, an async test that never awaits. High-confidence findings
42
- block the commit; the rest warn. The Claude Code skill then does the part a parser
43
- cannot: it reads the production code and judges whether each test asserts the
44
- *right* value, measured against the intended behavior rather than the current
45
- (possibly buggy) output.
42
+ block the commit; the rest warn. The semantic layer judging whether each test
43
+ asserts the *right* value against intended behavior lives in
44
+ [falsegreen-skill](https://github.com/vinicq/falsegreen-skill), the companion
45
+ LLM-based tool that covers Python and other languages.
46
46
 
47
47
  The checks are grounded in the rotten-green-test research (Soares 2023; Delplanque
48
48
  et al., ICSE 2019) and cross-walked against the published test-smell catalog. See
49
49
  [CREDITS.md](CREDITS.md).
50
50
 
51
- > Live on PyPI: `pip install falsegreen`. Also a pre-commit hook and a Claude Code
52
- > plugin (see the three install paths below).
51
+ > Live on PyPI: `pip install falsegreen`. Also available as a pre-commit hook
52
+ > (see install paths below). For the LLM semantic pass, see
53
+ > [falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
53
54
 
54
55
  ---
55
56
 
@@ -60,10 +61,8 @@ et al., ICSE 2019) and cross-walked against the published test-smell catalog. Se
60
61
  - [What it validates, how, and why](#what-it-validates-how-and-why)
61
62
  - [The two layers](#the-two-layers)
62
63
  - [Download and use: the three ways](#download-and-use-the-three-ways)
63
- - [1. As a Python package (CLI, no skill needed)](#1-as-a-python-package-cli-no-skill-needed)
64
+ - [1. As a Python package (CLI)](#1-as-a-python-package-cli)
64
65
  - [2. As a pre-commit hook](#2-as-a-pre-commit-hook)
65
- - [3. As a Claude Code skill (the semantic pass)](#3-as-a-claude-code-skill-the-semantic-pass)
66
- - [With the skill vs without the skill](#with-the-skill-vs-without-the-skill)
67
66
  - [Configuration](#configuration)
68
67
  - [Technologies used](#technologies-used)
69
68
  - [How it compares](#how-it-compares)
@@ -131,9 +130,9 @@ positive, and a labeled characterization snapshot is not a frozen bug. That
131
130
  classification step keeps the tool from flagging legitimate styles.
132
131
 
133
132
  The plain-language guide behind every case, with a real-world analogy and a
134
- before/after for each, is in [`docs/guide.md`](docs/guide.md). The detection
135
- reference that maps each code to its scanner code and to established tooling is in
136
- [`skills/falsegreen/reference.md`](skills/falsegreen/reference.md).
133
+ before/after for each, is in [`docs/guide.md`](docs/guide.md). The full detection
134
+ reference (code-to-tooling mapping, J1–J6 judgment index) lives in
135
+ [`falsegreen-skill`](https://github.com/vinicq/falsegreen-skill).
137
136
 
138
137
  The basis is the rotten-green-test research: a passing test that holds an
139
138
  assertion which never runs (Elvys Soares, *A Multimethod Study of Test Smells*,
@@ -147,12 +146,10 @@ and the specific thing falsegreen took from each one, is in [CREDITS.md](CREDITS
147
146
 
148
147
  ## What it validates, how, and why
149
148
 
150
- The catalog has 18 named cases across the five families, and the scanner now ships
151
- 21 codes (the five families are the scanner-facing view; the semantic pass asks the
152
- same questions as six judgments, J1 to J6, which is the LLM-facing view of the same
153
- thing, mapped code by code in [`reference.md`](skills/falsegreen/reference.md)). A
154
- case is caught either by the deterministic **scanner** (a code like `C5`) or only
155
- by the **semantic** pass (it needs to read the production code). HIGH-confidence
149
+ The catalog has 18 named cases across the five families. The scanner ships 21 codes
150
+ covering all mechanically-detectable patterns. Cases that require reading production
151
+ intent (10, 11, 12, 15, 18) are handled by
152
+ [falsegreen-skill](https://github.com/vinicq/falsegreen-skill). HIGH-confidence
156
153
  scanner findings block a commit; LOW ones warn.
157
154
 
158
155
  | # | Case | Why it fools you | Detected by | Conf |
@@ -203,11 +200,9 @@ and stays quiet on them.
203
200
  by structure. A parser sees a mock but cannot tell whether it replaced an edge
204
201
  (network, disk, clock) or the thing under test. It sees an arithmetic expression
205
202
  but cannot tell whether the expected value was derived independently or copied
206
- from the code. The `/falsegreen` skill reads the production code, derives the
207
- intended behavior from the oracle hierarchy, compares it against what the test
208
- asserts, and when they disagree, names which side is wrong. It is told to favor
209
- precision over recall and to ground a verdict in a cited contract line, never in
210
- the code's current output alone.
203
+ from the code. That judgment requires reading the production code against an
204
+ independent oracle that is what
205
+ [falsegreen-skill](https://github.com/vinicq/falsegreen-skill) does.
211
206
 
212
207
  **Why two confidence levels.** A blocking gate that cries wolf gets disabled. So
213
208
  only near-certain, mechanically-unambiguous patterns are HIGH (they block). The
@@ -235,26 +230,10 @@ different natures.
235
230
  stays-clean regression test, and a re-scan brought the HIGH count to 0 across all
236
231
  8 projects. Each false positive is recorded as it is fixed, with its regression
237
232
  tests, in the commit history and the CHANGELOG.
238
- - **The semantic pass (LLM, any language).** Cross-language coverage runs through
239
- this pass, so its reliability is measured, not assumed. The validation is a
240
- benchmark corpus: tests planted with a known ground truth, a test that mocks the
241
- unit under test, one that copies the expected value from current output, one that
242
- re-implements the production formula, in Python and in other languages, scored for
243
- precision and recall with precision held above recall. Because the pass runs on an
244
- LLM it is non-deterministic, so this is a periodic skill-validation artifact, not
245
- a CI gate. The first labeled corpus has 24 Python cases (10 rotten, 14 sound)
246
- across cases 10, 11, 12, and 18, with sound look-alikes and plain controls. Run
247
- blind on a small model (Claude Haiku), the pass scored precision 1.00 (no false
248
- alarms on the 14 sound tests), recall 0.70, and 1.00 recall on the clear-cut
249
- smells; the only misses were borderline cases (a pure-delegation passthrough, a
250
- trivial one-operator formula) where the precision-first guardrail defers to
251
- "sound". That is the evidence behind the design claim that a small model is
252
- enough for a precision-first semantic pass. The number to grow is recall: a
253
- larger corpus, a second annotator, and multi-vote runs are the next step. A
254
- second corpus of 20 TypeScript cases (Jest/Vitest) reproduced the pattern:
255
- precision 1.00, recall 0.625, with the only misses being the same boundary
256
- cases, evidence that the pass carries across languages and frameworks, not just
257
- Python.
233
+ - **The semantic pass (LLM).** Validation for the LLM-based semantic layer is
234
+ tracked in [falsegreen-skill](https://github.com/vinicq/falsegreen-skill), where
235
+ benchmark corpora for Python and TypeScript are maintained with precision/recall
236
+ measurements.
258
237
 
259
238
  ---
260
239
 
@@ -292,28 +271,19 @@ that maintainability layer well; run them alongside falsegreen.
292
271
 
293
272
  | Layer | What it is | When it runs | Catches |
294
273
  |---|---|---|---|
295
- | **Scanner** | Zero-dependency AST analysis (Python/pytest), one self-contained module | CLI, CI, pre-commit | the mechanical patterns (21 codes) |
296
- | **Semantic pass** | A Claude Code skill (`/falsegreen`) that reads the code | on demand, in Claude Code | the bug-freezing patterns no static tool can see (cases 10/11/12/15/18) |
274
+ | **Scanner** (this repo) | Zero-dependency AST analysis (Python/pytest) | CLI, CI, pre-commit | 21 mechanical codes |
275
+ | **Semantic pass** ([falsegreen-skill](https://github.com/vinicq/falsegreen-skill)) | LLM-based analysis, Python + other languages | on demand | bug-freezing patterns no static tool can see (cases 10/11/12/15/18) |
297
276
 
298
277
  The scanner is the fast, deterministic pre-filter. It overlaps in part with
299
278
  `ruff`'s `PT` rules and with research tools like PyNose, and that overlap is fine:
300
- run them together. The semantic pass is the part nobody else ships, and it is the
301
- reason the project exists.
302
-
303
- The semantic pass runs on whatever Claude model your Claude Code session uses. It
304
- is not pinned to one model, and it does not need a frontier one: the research it
305
- draws on (Agentic LMs, SBES 2025; Santana Jr. et al., 2025) shows that small,
306
- locally-runnable models detect and refactor these patterns well. The value is in
307
- the protocol, not in any single model.
279
+ run them together. For the semantic layer and for TypeScript, JavaScript, Java,
280
+ and other languages — use [falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
308
281
 
309
282
  ---
310
283
 
311
- ## Download and use: the three ways
312
-
313
- Pick one or combine them. The CLI and pre-commit need no Claude Code; the skill
314
- adds the semantic pass on top.
284
+ ## Download and use
315
285
 
316
- ### 1. As a Python package (CLI, no skill needed)
286
+ ### 1. As a Python package (CLI)
317
287
 
318
288
  Install from PyPI:
319
289
 
@@ -343,12 +313,6 @@ code scanning / PR annotations; `--format junit` emits JUnit XML (HIGH ->
343
313
  finding. Wire those into any CI step. No third-party runtime dependencies; Python
344
314
  3.8+.
345
315
 
346
- Try it on the bundled demo (one bad test per case):
347
-
348
- ```bash
349
- pipx run falsegreen skills/falsegreen/examples/bad_tests_sample.py
350
- ```
351
-
352
316
  ### 2. As a pre-commit hook
353
317
 
354
318
  This is the standard, version-pinned way to gate every commit. Add to your
@@ -373,42 +337,13 @@ python -m falsegreen.hook_install --repo . # install
373
337
  python -m falsegreen.hook_install --uninstall # remove
374
338
  ```
375
339
 
376
- ### 3. As a Claude Code skill (the semantic pass)
340
+ ### 3. With the semantic pass (multi-language)
377
341
 
378
- Install the plugin:
379
-
380
- ```
381
- /plugin marketplace add vinicq/falsegreen
382
- ```
383
-
384
- Then, in a Claude Code session, run:
385
-
386
- ```
387
- /falsegreen
388
- ```
389
-
390
- against a diff or a module. The skill triages the scanner output first, then does
391
- the semantic work: for each test it finds the unit under test, derives the
392
- intended behavior from the oracle hierarchy, and reports tests that pass while
393
- asserting the wrong thing, with the cited evidence and a concrete fix. It is
394
- read-only by default (it proposes fixes, it does not edit your tests unless you
395
- ask).
396
-
397
- The scanner is bundled inside the skill, so the plugin works on its own. On
398
- another Agent Skills client that does not define `${CLAUDE_SKILL_DIR}`, install
399
- the package (`pip install falsegreen`) and the skill falls back to the CLI.
400
-
401
- ### With the skill vs without the skill
402
-
403
- - **Without the skill** (CLI / pre-commit / CI): you get the deterministic
404
- scanner. It catches the 16 mechanical codes and blocks commits on the
405
- high-confidence ones. This is everything a non-Claude-Code user needs and runs
406
- anywhere Python runs.
407
- - **With the skill** (`/falsegreen` in Claude Code): you additionally get the
408
- semantic pass, which catches the five code-aware cases (10, 11, 12, 15, 18),
409
- including the headline one: a test that is green while its expected value
410
- contradicts the spec. No static tool, this one included, can find that on its
411
- own.
342
+ For cases that require reading production intent — mocking the unit under test,
343
+ copying expected from current output, re-implementing the formula — use
344
+ [falsegreen-skill](https://github.com/vinicq/falsegreen-skill). It covers Python,
345
+ TypeScript, JavaScript, Java, and other languages via an LLM-based analysis using
346
+ the same case catalog.
412
347
 
413
348
  ---
414
349
 
@@ -468,12 +403,9 @@ override.
468
403
  - **Packaging:** `hatchling` build backend, SPDX license metadata (PEP 639),
469
404
  console entry point, distributed on PyPI.
470
405
  - **Distribution:** a [pre-commit](https://pre-commit.com) hook
471
- (`.pre-commit-hooks.yaml`) and a Claude Code plugin following the
472
- [Agent Skills](https://agentskills.io) open standard (`SKILL.md` plus a
473
- `.claude-plugin/` marketplace manifest).
406
+ (`.pre-commit-hooks.yaml`), distributed on PyPI.
474
407
  - **CI:** GitHub Actions across Python 3.8 / 3.11 / 3.13, running `ruff`,
475
- `pytest`, a self-scan (the tool must stay clean on its own code), and a
476
- drift-check that the bundled scanner copy matches the package byte for byte.
408
+ `pytest`, and a self-scan (the tool must stay clean on its own code).
477
409
 
478
410
  ---
479
411
 
@@ -481,17 +413,20 @@ override.
481
413
 
482
414
  - **ruff / flake8-pytest-style** - mature, fast lint rules. Overlaps on broad
483
415
  `raises` (PT011) and assert-in-except (PT017). Run both. falsegreen adds
484
- uncollected tests, always-true asserts, self-comparison, mock typos, and the
485
- semantic pass.
416
+ uncollected tests, always-true asserts, self-comparison, and mock typos.
486
417
  - **PyNose / pytest-smell / TEMPY** - test-smell catalogs from research. Broader
487
418
  taxonomy, but no commit gate and no oracle-correctness check.
488
419
  - **mutmut / cosmic-ray** - mutation testing, the most honest measure of whether a
489
420
  green suite fails when the code is wrong. Complementary and heavier. falsegreen
490
421
  is the cheap pre-filter you run on every commit; mutation testing is the deep
491
422
  audit you run on the suites that matter.
423
+ - **[falsegreen-skill](https://github.com/vinicq/falsegreen-skill)** - the LLM
424
+ companion for the semantic pass (cases 10/11/12/15/18) and for TypeScript,
425
+ JavaScript, Java, and other languages.
492
426
 
493
- The defensible gap: nobody else combines a deterministic commit gate with a
494
- code-as-evidence semantic pass aimed at oracle correctness (cases 12 and 18).
427
+ The defensible gap: a deterministic commit gate that catches the mechanical
428
+ false-positive patterns with zero runtime dependencies, paired with an LLM
429
+ semantic layer that catches the oracle-correctness cases no static tool can see.
495
430
 
496
431
  ---
497
432
 
@@ -499,20 +434,17 @@ code-as-evidence semantic pass aimed at oracle correctness (cases 12 and 18).
499
434
 
500
435
  ```
501
436
  falsegreen/
502
- src/falsegreen/scanner.py the deterministic scanner (canonical)
437
+ src/falsegreen/scanner.py the deterministic scanner
503
438
  src/falsegreen/hook_install.py raw git-hook installer
504
- skills/falsegreen/
505
- SKILL.md the semantic-pass protocol
506
- reference.md the 18-case detection rubric
507
- scripts/scan.py bundled scanner (kept identical to the package)
508
- examples/bad_tests_sample.py one bad test per case (demo + regression)
509
439
  docs/guide.md plain-language guide to every case
510
440
  tests/test_scanner.py the scanner's own tests
511
441
  .pre-commit-hooks.yaml pre-commit integration
512
- .claude-plugin/ plugin + marketplace manifests
513
442
  pyproject.toml packaging
514
443
  ```
515
444
 
445
+ The LLM skill, the semantic-pass protocol, and the multi-language case reference
446
+ live in [falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
447
+
516
448
  ---
517
449
 
518
450
  ## Contributing, security, license