falsegreen 0.2.0__tar.gz → 0.2.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/workflows/ci.yml +2 -6
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/workflows/release-drafter.yml +1 -1
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/workflows/release.yml +2 -2
- {falsegreen-0.2.0 → falsegreen-0.2.2}/CHANGELOG.md +22 -6
- {falsegreen-0.2.0 → falsegreen-0.2.2}/CONTRIBUTING.md +12 -26
- {falsegreen-0.2.0 → falsegreen-0.2.2}/CREDITS.md +10 -7
- {falsegreen-0.2.0 → falsegreen-0.2.2}/PKG-INFO +48 -116
- {falsegreen-0.2.0 → falsegreen-0.2.2}/README.md +47 -115
- {falsegreen-0.2.0 → falsegreen-0.2.2}/pyproject.toml +1 -1
- {falsegreen-0.2.0 → falsegreen-0.2.2}/src/falsegreen/__init__.py +1 -1
- {falsegreen-0.2.0 → falsegreen-0.2.2}/src/falsegreen/scanner.py +2 -2
- {falsegreen-0.2.0 → falsegreen-0.2.2}/tests/test_scanner.py +13 -0
- falsegreen-0.2.0/.claude-plugin/marketplace.json +0 -20
- falsegreen-0.2.0/.claude-plugin/plugin.json +0 -11
- falsegreen-0.2.0/skills/falsegreen/README.md +0 -22
- falsegreen-0.2.0/skills/falsegreen/SKILL.md +0 -278
- falsegreen-0.2.0/skills/falsegreen/examples/bad_tests_sample.py +0 -96
- falsegreen-0.2.0/skills/falsegreen/reference.md +0 -365
- falsegreen-0.2.0/skills/falsegreen/scripts/scan.py +0 -1625
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.gitattributes +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/CODEOWNERS +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/ISSUE_TEMPLATE/bug_report.md +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/ISSUE_TEMPLATE/config.yml +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/ISSUE_TEMPLATE/feature_request.md +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/dependabot.yml +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/pull_request_template.md +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.github/release-drafter.yml +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.gitignore +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/.pre-commit-hooks.yaml +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/CODE_OF_CONDUCT.md +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/LICENSE +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/RELEASE.md +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/SECURITY.md +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/docs/guide.md +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/requirements-dev.txt +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/src/falsegreen/__main__.py +0 -0
- {falsegreen-0.2.0 → falsegreen-0.2.2}/src/falsegreen/hook_install.py +0 -0
|
@@ -24,9 +24,5 @@ jobs:
|
|
|
24
24
|
run: ruff check src tests
|
|
25
25
|
- name: Test
|
|
26
26
|
run: pytest -q
|
|
27
|
-
- name:
|
|
28
|
-
run:
|
|
29
|
-
- name: Self-scan (must flag the demo, must not flag itself)
|
|
30
|
-
run: |
|
|
31
|
-
python -m falsegreen skills/falsegreen/examples/bad_tests_sample.py || true
|
|
32
|
-
python -m falsegreen src tests
|
|
27
|
+
- name: Self-scan (must not flag itself)
|
|
28
|
+
run: python -m falsegreen src tests
|
|
@@ -31,7 +31,7 @@ jobs:
|
|
|
31
31
|
python -m build
|
|
32
32
|
python -m twine check dist/*
|
|
33
33
|
- name: Upload dist artifact
|
|
34
|
-
uses: actions/upload-artifact@
|
|
34
|
+
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
|
|
35
35
|
with:
|
|
36
36
|
name: dist
|
|
37
37
|
path: dist/
|
|
@@ -45,7 +45,7 @@ jobs:
|
|
|
45
45
|
id-token: write # OIDC: the only credential the publish step needs
|
|
46
46
|
steps:
|
|
47
47
|
- name: Download dist artifact
|
|
48
|
-
uses: actions/download-artifact@
|
|
48
|
+
uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # v8.0.1
|
|
49
49
|
with:
|
|
50
50
|
name: dist
|
|
51
51
|
path: dist/
|
|
@@ -6,6 +6,21 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
|
6
6
|
|
|
7
7
|
## [Unreleased]
|
|
8
8
|
|
|
9
|
+
## [0.2.2] - 2026-06-08
|
|
10
|
+
|
|
11
|
+
### Changed
|
|
12
|
+
- Skill and Claude plugin removed from this repo — the LLM semantic pass, the
|
|
13
|
+
detection reference, and multi-language support now live in
|
|
14
|
+
[falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
|
|
15
|
+
- README, CONTRIBUTING, and CREDITS updated to reflect the split.
|
|
16
|
+
|
|
17
|
+
## [0.2.1] - 2026-06-08
|
|
18
|
+
|
|
19
|
+
### Fixed
|
|
20
|
+
- C2 (HIGH) no longer flags an empty body under sympy's `@SKIP` decorator
|
|
21
|
+
(`from sympy.testing.pytest import SKIP`), which raises `Skipped` at runtime —
|
|
22
|
+
same semantics as `@pytest.mark.skip`. Found validating sympy.
|
|
23
|
+
|
|
9
24
|
## [0.2.0] - 2026-06-05
|
|
10
25
|
|
|
11
26
|
### Fixed
|
|
@@ -72,16 +87,17 @@ First release.
|
|
|
72
87
|
- C20 (HIGH): assertion in dead code after `return`/`raise`/`fail()`. C21 (LOW):
|
|
73
88
|
every assertion conditional, none runs unconditionally. Both from the rotten-
|
|
74
89
|
green-test line of work (Soares 2023).
|
|
75
|
-
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
- Distribution as a pip package, a `pre-commit` hook, and a Claude plugin.
|
|
79
|
-
- Plain-language guide (`docs/guide.md`), detection reference, and a demo file.
|
|
90
|
+
- Distribution as a pip package and a `pre-commit` hook.
|
|
91
|
+
- Plain-language guide (`docs/guide.md`); the detection reference and LLM semantic
|
|
92
|
+
pass live in [falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
|
|
80
93
|
|
|
81
94
|
### Validated
|
|
82
95
|
- Two real-project passes (bailiff, md-bridge) settled the rules and fixed three
|
|
83
96
|
false positives: C6 on called boolean predicates, C1 on literal-collection
|
|
84
97
|
loops, and C7 on `f() is f()` (the lru_cache / singleton identity test).
|
|
85
98
|
|
|
86
|
-
[Unreleased]: https://github.com/vinicq/falsegreen/compare/v0.
|
|
99
|
+
[Unreleased]: https://github.com/vinicq/falsegreen/compare/v0.2.2...HEAD
|
|
100
|
+
[0.2.2]: https://github.com/vinicq/falsegreen/compare/v0.2.1...v0.2.2
|
|
101
|
+
[0.2.1]: https://github.com/vinicq/falsegreen/compare/v0.2.0...v0.2.1
|
|
102
|
+
[0.2.0]: https://github.com/vinicq/falsegreen/compare/v0.1.0...v0.2.0
|
|
87
103
|
[0.1.0]: https://github.com/vinicq/falsegreen/releases/tag/v0.1.0
|
|
@@ -19,49 +19,35 @@ Then branch, change, add a test, and open a pull request.
|
|
|
19
19
|
|
|
20
20
|
## How the project is built
|
|
21
21
|
|
|
22
|
-
|
|
22
|
+
One module, one job: `src/falsegreen/scanner.py` is a zero-dependency AST pass.
|
|
23
|
+
It parses test files, never imports or runs them. Each pattern is a case code
|
|
24
|
+
(`C1`, `C5`, `C13`, ...). HIGH-confidence codes block a commit; LOW only warn.
|
|
23
25
|
|
|
24
|
-
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
- **Skill** (`skills/falsegreen/`): the Claude Code semantic pass. It bundles a
|
|
28
|
-
byte-identical copy of the scanner at `skills/falsegreen/scripts/scan.py`; CI
|
|
29
|
-
fails if it drifts from `src/falsegreen/scanner.py`.
|
|
30
|
-
|
|
31
|
-
The plain-language rubric is `docs/guide.md`; the detection reference is
|
|
32
|
-
`skills/falsegreen/reference.md`.
|
|
26
|
+
The plain-language rubric is `docs/guide.md`. The LLM semantic pass and the
|
|
27
|
+
multi-language detection reference live in
|
|
28
|
+
[falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
|
|
33
29
|
|
|
34
30
|
## Filing an issue
|
|
35
31
|
|
|
36
32
|
A useful bug report for a false positive includes the smallest test snippet that
|
|
37
33
|
gets wrongly flagged, the code falsegreen emitted, and what you expected. For a
|
|
38
|
-
false negative, show the bad test that slipped through.
|
|
39
|
-
`skills/falsegreen/examples/bad_tests_sample.py` as a format reference.
|
|
34
|
+
false negative, show the bad test that slipped through.
|
|
40
35
|
|
|
41
36
|
## Adding or changing a detection rule
|
|
42
37
|
|
|
43
|
-
This is the most common contribution. A rule touches up to
|
|
38
|
+
This is the most common contribution. A rule touches up to three places, and the
|
|
44
39
|
pull request needs all that apply:
|
|
45
40
|
|
|
46
41
|
1. **Logic** in `src/falsegreen/scanner.py`. Decide HIGH vs LOW. The rule of
|
|
47
42
|
thumb: HIGH only if a legitimate test can almost never trigger it, because
|
|
48
43
|
HIGH blocks commits. When in doubt, ship it LOW.
|
|
49
|
-
2. **
|
|
50
|
-
why it fools you, confidence, the tool it maps to).
|
|
51
|
-
3. **Guide** entry in `docs/guide.md` if it is a new case, in the same
|
|
44
|
+
2. **Guide** entry in `docs/guide.md` if it is a new case, in the same
|
|
52
45
|
real-world-analogy style as the others.
|
|
53
|
-
|
|
46
|
+
3. **Tests** in `tests/test_scanner.py`: one test proving the rule fires on the
|
|
54
47
|
bad pattern, and at least one proving it does NOT fire on the legitimate
|
|
55
48
|
look-alike. The second test matters more than the first.
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
byte-checks `scripts/scan.py` against the scanner, so detector *logic* is
|
|
59
|
-
mirrored automatically; the SKILL.md prose and its flag list are NOT, so they
|
|
60
|
-
must be kept consistent with `reference.md` and the README CLI section by hand.
|
|
61
|
-
|
|
62
|
-
Then run `pytest`, `python -m falsegreen src tests` (must stay clean), and
|
|
63
|
-
`diff src/falsegreen/scanner.py skills/falsegreen/scripts/scan.py` (must be
|
|
64
|
-
identical, copy the file if you changed the scanner).
|
|
49
|
+
|
|
50
|
+
Then run `pytest` and `python -m falsegreen src tests` (must stay clean).
|
|
65
51
|
|
|
66
52
|
### Off-by-default codes
|
|
67
53
|
|
|
@@ -1,8 +1,9 @@
|
|
|
1
1
|
# Credits and academic references
|
|
2
2
|
|
|
3
3
|
falsegreen builds on published research in test smells and rotten green tests. The
|
|
4
|
-
work below shaped its concepts, its rule catalog, and the design of
|
|
5
|
-
|
|
4
|
+
work below shaped its concepts, its rule catalog, and the design of the deterministic
|
|
5
|
+
scanner. The LLM semantic pass and multi-language support live in
|
|
6
|
+
[falsegreen-skill](https://github.com/vinicq/falsegreen-skill). Credit to the authors.
|
|
6
7
|
|
|
7
8
|
## Conceptual foundation
|
|
8
9
|
|
|
@@ -42,8 +43,9 @@ Marcelo d'Amorim, Márcio Ribeiro, Gustavo Soares
|
|
|
42
43
|
([@gustavoasoares](https://github.com/gustavoasoares)), Eduardo Almeida, Elvys
|
|
43
44
|
Soares ([@elvyssoares](https://github.com/elvyssoares)). SBES 2025. arXiv:2504.07277. Empirical evidence that small local models in
|
|
44
45
|
agent-based workflows detect and refactor test smells (Phi-4-14B, pass@5 of 75.3%;
|
|
45
|
-
six generated pull requests merged into open-source projects). Backs
|
|
46
|
-
LLM semantic pass
|
|
46
|
+
six generated pull requests merged into open-source projects). Backs
|
|
47
|
+
[falsegreen-skill](https://github.com/vinicq/falsegreen-skill)'s LLM semantic pass
|
|
48
|
+
and the AI-applies-the-fix path of the dual-use report.
|
|
47
49
|
|
|
48
50
|
**Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An
|
|
49
51
|
Empirical Study.** E. G. Santana Jr., Jander Pereira Santos Junior, Erlon P.
|
|
@@ -58,8 +60,9 @@ and its multi-agent verify idea.
|
|
|
58
60
|
**Evaluating Large Language Models in Detecting Test Smells.** Keila Lucas, Rohit
|
|
59
61
|
Gheyi, Elvys Soares, Márcio Ribeiro, Ivan Machado. SBES 2024. arXiv:2407.19261.
|
|
60
62
|
LLMs detected 21 of 30 test smell types across seven languages (ChatGPT-4 best).
|
|
61
|
-
Backs falsegreen's choice to handle
|
|
62
|
-
semantic pass rather than in the
|
|
63
|
+
Backs [falsegreen-skill](https://github.com/vinicq/falsegreen-skill)'s choice to handle
|
|
64
|
+
cross-language coverage in the language-agnostic semantic pass rather than in the
|
|
65
|
+
Python-only scanner.
|
|
63
66
|
|
|
64
67
|
**Test smells in LLM-Generated Unit Tests.** Wendkûuni C. Ouédraogo, Yinghua Li,
|
|
65
68
|
Xueqi Dang, Xunzhu Tang, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F.
|
|
@@ -82,7 +85,7 @@ Dalton Nicodemos Jorge ([@daltonjorge](https://github.com/daltonjorge)). PhD the
|
|
|
82
85
|
UFCG, 2023. Advisors Patrícia D. L. Machado, Wilkerson L. Andrade. Tool STEEL:
|
|
83
86
|
<https://github.com/daltonjorge/steel>. Its JavaScript Exception Test smell (a
|
|
84
87
|
`try/catch` that swallows the thrown error) and assertion-in-`forEach`-over-empty
|
|
85
|
-
sharpened
|
|
88
|
+
sharpened falsegreen-skill's "Frontend cues by language" with two J1 cues for Jest/Vitest.
|
|
86
89
|
|
|
87
90
|
**Detecção de smells em testes automatizados em diferentes linguagens de
|
|
88
91
|
programação.** Gustavo Augusto Calazans Lopes. TCC, UFAL, 2023. Advisor Márcio de
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: falsegreen
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.2
|
|
4
4
|
Summary: Find unit tests that give false positives: green tests that protect nothing, and tests that pass while asserting the wrong expected value.
|
|
5
5
|
Project-URL: Homepage, https://github.com/vinicq/falsegreen
|
|
6
6
|
Project-URL: Issues, https://github.com/vinicq/falsegreen/issues
|
|
@@ -39,17 +39,18 @@ each test against more than twenty mechanical smells, the ones a parser can prov
|
|
|
39
39
|
an assertion that never runs, a check that is empty or always true, a swallowed
|
|
40
40
|
exception, a mock of the unit under test, an assertion stranded in dead code, a
|
|
41
41
|
weak truthiness check, an async test that never awaits. High-confidence findings
|
|
42
|
-
block the commit; the rest warn. The
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
42
|
+
block the commit; the rest warn. The semantic layer — judging whether each test
|
|
43
|
+
asserts the *right* value against intended behavior — lives in
|
|
44
|
+
[falsegreen-skill](https://github.com/vinicq/falsegreen-skill), the companion
|
|
45
|
+
LLM-based tool that covers Python and other languages.
|
|
46
46
|
|
|
47
47
|
The checks are grounded in the rotten-green-test research (Soares 2023; Delplanque
|
|
48
48
|
et al., ICSE 2019) and cross-walked against the published test-smell catalog. See
|
|
49
49
|
[CREDITS.md](CREDITS.md).
|
|
50
50
|
|
|
51
|
-
> Live on PyPI: `pip install falsegreen`. Also a pre-commit hook
|
|
52
|
-
>
|
|
51
|
+
> Live on PyPI: `pip install falsegreen`. Also available as a pre-commit hook
|
|
52
|
+
> (see install paths below). For the LLM semantic pass, see
|
|
53
|
+
> [falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
|
|
53
54
|
|
|
54
55
|
---
|
|
55
56
|
|
|
@@ -60,10 +61,8 @@ et al., ICSE 2019) and cross-walked against the published test-smell catalog. Se
|
|
|
60
61
|
- [What it validates, how, and why](#what-it-validates-how-and-why)
|
|
61
62
|
- [The two layers](#the-two-layers)
|
|
62
63
|
- [Download and use: the three ways](#download-and-use-the-three-ways)
|
|
63
|
-
- [1. As a Python package (CLI
|
|
64
|
+
- [1. As a Python package (CLI)](#1-as-a-python-package-cli)
|
|
64
65
|
- [2. As a pre-commit hook](#2-as-a-pre-commit-hook)
|
|
65
|
-
- [3. As a Claude Code skill (the semantic pass)](#3-as-a-claude-code-skill-the-semantic-pass)
|
|
66
|
-
- [With the skill vs without the skill](#with-the-skill-vs-without-the-skill)
|
|
67
66
|
- [Configuration](#configuration)
|
|
68
67
|
- [Technologies used](#technologies-used)
|
|
69
68
|
- [How it compares](#how-it-compares)
|
|
@@ -131,9 +130,9 @@ positive, and a labeled characterization snapshot is not a frozen bug. That
|
|
|
131
130
|
classification step keeps the tool from flagging legitimate styles.
|
|
132
131
|
|
|
133
132
|
The plain-language guide behind every case, with a real-world analogy and a
|
|
134
|
-
before/after for each, is in [`docs/guide.md`](docs/guide.md). The detection
|
|
135
|
-
reference
|
|
136
|
-
[`
|
|
133
|
+
before/after for each, is in [`docs/guide.md`](docs/guide.md). The full detection
|
|
134
|
+
reference (code-to-tooling mapping, J1–J6 judgment index) lives in
|
|
135
|
+
[`falsegreen-skill`](https://github.com/vinicq/falsegreen-skill).
|
|
137
136
|
|
|
138
137
|
The basis is the rotten-green-test research: a passing test that holds an
|
|
139
138
|
assertion which never runs (Elvys Soares, *A Multimethod Study of Test Smells*,
|
|
@@ -147,12 +146,10 @@ and the specific thing falsegreen took from each one, is in [CREDITS.md](CREDITS
|
|
|
147
146
|
|
|
148
147
|
## What it validates, how, and why
|
|
149
148
|
|
|
150
|
-
The catalog has 18 named cases across the five families
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
case is caught either by the deterministic **scanner** (a code like `C5`) or only
|
|
155
|
-
by the **semantic** pass (it needs to read the production code). HIGH-confidence
|
|
149
|
+
The catalog has 18 named cases across the five families. The scanner ships 21 codes
|
|
150
|
+
covering all mechanically-detectable patterns. Cases that require reading production
|
|
151
|
+
intent (10, 11, 12, 15, 18) are handled by
|
|
152
|
+
[falsegreen-skill](https://github.com/vinicq/falsegreen-skill). HIGH-confidence
|
|
156
153
|
scanner findings block a commit; LOW ones warn.
|
|
157
154
|
|
|
158
155
|
| # | Case | Why it fools you | Detected by | Conf |
|
|
@@ -203,11 +200,9 @@ and stays quiet on them.
|
|
|
203
200
|
by structure. A parser sees a mock but cannot tell whether it replaced an edge
|
|
204
201
|
(network, disk, clock) or the thing under test. It sees an arithmetic expression
|
|
205
202
|
but cannot tell whether the expected value was derived independently or copied
|
|
206
|
-
from the code.
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
precision over recall and to ground a verdict in a cited contract line, never in
|
|
210
|
-
the code's current output alone.
|
|
203
|
+
from the code. That judgment requires reading the production code against an
|
|
204
|
+
independent oracle — that is what
|
|
205
|
+
[falsegreen-skill](https://github.com/vinicq/falsegreen-skill) does.
|
|
211
206
|
|
|
212
207
|
**Why two confidence levels.** A blocking gate that cries wolf gets disabled. So
|
|
213
208
|
only near-certain, mechanically-unambiguous patterns are HIGH (they block). The
|
|
@@ -235,26 +230,10 @@ different natures.
|
|
|
235
230
|
stays-clean regression test, and a re-scan brought the HIGH count to 0 across all
|
|
236
231
|
8 projects. Each false positive is recorded as it is fixed, with its regression
|
|
237
232
|
tests, in the commit history and the CHANGELOG.
|
|
238
|
-
- **The semantic pass (LLM
|
|
239
|
-
|
|
240
|
-
benchmark
|
|
241
|
-
|
|
242
|
-
re-implements the production formula, in Python and in other languages, scored for
|
|
243
|
-
precision and recall with precision held above recall. Because the pass runs on an
|
|
244
|
-
LLM it is non-deterministic, so this is a periodic skill-validation artifact, not
|
|
245
|
-
a CI gate. The first labeled corpus has 24 Python cases (10 rotten, 14 sound)
|
|
246
|
-
across cases 10, 11, 12, and 18, with sound look-alikes and plain controls. Run
|
|
247
|
-
blind on a small model (Claude Haiku), the pass scored precision 1.00 (no false
|
|
248
|
-
alarms on the 14 sound tests), recall 0.70, and 1.00 recall on the clear-cut
|
|
249
|
-
smells; the only misses were borderline cases (a pure-delegation passthrough, a
|
|
250
|
-
trivial one-operator formula) where the precision-first guardrail defers to
|
|
251
|
-
"sound". That is the evidence behind the design claim that a small model is
|
|
252
|
-
enough for a precision-first semantic pass. The number to grow is recall: a
|
|
253
|
-
larger corpus, a second annotator, and multi-vote runs are the next step. A
|
|
254
|
-
second corpus of 20 TypeScript cases (Jest/Vitest) reproduced the pattern:
|
|
255
|
-
precision 1.00, recall 0.625, with the only misses being the same boundary
|
|
256
|
-
cases, evidence that the pass carries across languages and frameworks, not just
|
|
257
|
-
Python.
|
|
233
|
+
- **The semantic pass (LLM).** Validation for the LLM-based semantic layer is
|
|
234
|
+
tracked in [falsegreen-skill](https://github.com/vinicq/falsegreen-skill), where
|
|
235
|
+
benchmark corpora for Python and TypeScript are maintained with precision/recall
|
|
236
|
+
measurements.
|
|
258
237
|
|
|
259
238
|
---
|
|
260
239
|
|
|
@@ -292,28 +271,19 @@ that maintainability layer well; run them alongside falsegreen.
|
|
|
292
271
|
|
|
293
272
|
| Layer | What it is | When it runs | Catches |
|
|
294
273
|
|---|---|---|---|
|
|
295
|
-
| **Scanner** | Zero-dependency AST analysis (Python/pytest)
|
|
296
|
-
| **Semantic pass**
|
|
274
|
+
| **Scanner** (this repo) | Zero-dependency AST analysis (Python/pytest) | CLI, CI, pre-commit | 21 mechanical codes |
|
|
275
|
+
| **Semantic pass** ([falsegreen-skill](https://github.com/vinicq/falsegreen-skill)) | LLM-based analysis, Python + other languages | on demand | bug-freezing patterns no static tool can see (cases 10/11/12/15/18) |
|
|
297
276
|
|
|
298
277
|
The scanner is the fast, deterministic pre-filter. It overlaps in part with
|
|
299
278
|
`ruff`'s `PT` rules and with research tools like PyNose, and that overlap is fine:
|
|
300
|
-
run them together.
|
|
301
|
-
|
|
302
|
-
|
|
303
|
-
The semantic pass runs on whatever Claude model your Claude Code session uses. It
|
|
304
|
-
is not pinned to one model, and it does not need a frontier one: the research it
|
|
305
|
-
draws on (Agentic LMs, SBES 2025; Santana Jr. et al., 2025) shows that small,
|
|
306
|
-
locally-runnable models detect and refactor these patterns well. The value is in
|
|
307
|
-
the protocol, not in any single model.
|
|
279
|
+
run them together. For the semantic layer — and for TypeScript, JavaScript, Java,
|
|
280
|
+
and other languages — use [falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
|
|
308
281
|
|
|
309
282
|
---
|
|
310
283
|
|
|
311
|
-
## Download and use
|
|
312
|
-
|
|
313
|
-
Pick one or combine them. The CLI and pre-commit need no Claude Code; the skill
|
|
314
|
-
adds the semantic pass on top.
|
|
284
|
+
## Download and use
|
|
315
285
|
|
|
316
|
-
### 1. As a Python package (CLI
|
|
286
|
+
### 1. As a Python package (CLI)
|
|
317
287
|
|
|
318
288
|
Install from PyPI:
|
|
319
289
|
|
|
@@ -343,12 +313,6 @@ code scanning / PR annotations; `--format junit` emits JUnit XML (HIGH ->
|
|
|
343
313
|
finding. Wire those into any CI step. No third-party runtime dependencies; Python
|
|
344
314
|
3.8+.
|
|
345
315
|
|
|
346
|
-
Try it on the bundled demo (one bad test per case):
|
|
347
|
-
|
|
348
|
-
```bash
|
|
349
|
-
pipx run falsegreen skills/falsegreen/examples/bad_tests_sample.py
|
|
350
|
-
```
|
|
351
|
-
|
|
352
316
|
### 2. As a pre-commit hook
|
|
353
317
|
|
|
354
318
|
This is the standard, version-pinned way to gate every commit. Add to your
|
|
@@ -373,42 +337,13 @@ python -m falsegreen.hook_install --repo . # install
|
|
|
373
337
|
python -m falsegreen.hook_install --uninstall # remove
|
|
374
338
|
```
|
|
375
339
|
|
|
376
|
-
### 3.
|
|
340
|
+
### 3. With the semantic pass (multi-language)
|
|
377
341
|
|
|
378
|
-
|
|
379
|
-
|
|
380
|
-
|
|
381
|
-
|
|
382
|
-
|
|
383
|
-
|
|
384
|
-
Then, in a Claude Code session, run:
|
|
385
|
-
|
|
386
|
-
```
|
|
387
|
-
/falsegreen
|
|
388
|
-
```
|
|
389
|
-
|
|
390
|
-
against a diff or a module. The skill triages the scanner output first, then does
|
|
391
|
-
the semantic work: for each test it finds the unit under test, derives the
|
|
392
|
-
intended behavior from the oracle hierarchy, and reports tests that pass while
|
|
393
|
-
asserting the wrong thing, with the cited evidence and a concrete fix. It is
|
|
394
|
-
read-only by default (it proposes fixes, it does not edit your tests unless you
|
|
395
|
-
ask).
|
|
396
|
-
|
|
397
|
-
The scanner is bundled inside the skill, so the plugin works on its own. On
|
|
398
|
-
another Agent Skills client that does not define `${CLAUDE_SKILL_DIR}`, install
|
|
399
|
-
the package (`pip install falsegreen`) and the skill falls back to the CLI.
|
|
400
|
-
|
|
401
|
-
### With the skill vs without the skill
|
|
402
|
-
|
|
403
|
-
- **Without the skill** (CLI / pre-commit / CI): you get the deterministic
|
|
404
|
-
scanner. It catches the 16 mechanical codes and blocks commits on the
|
|
405
|
-
high-confidence ones. This is everything a non-Claude-Code user needs and runs
|
|
406
|
-
anywhere Python runs.
|
|
407
|
-
- **With the skill** (`/falsegreen` in Claude Code): you additionally get the
|
|
408
|
-
semantic pass, which catches the five code-aware cases (10, 11, 12, 15, 18),
|
|
409
|
-
including the headline one: a test that is green while its expected value
|
|
410
|
-
contradicts the spec. No static tool, this one included, can find that on its
|
|
411
|
-
own.
|
|
342
|
+
For cases that require reading production intent — mocking the unit under test,
|
|
343
|
+
copying expected from current output, re-implementing the formula — use
|
|
344
|
+
[falsegreen-skill](https://github.com/vinicq/falsegreen-skill). It covers Python,
|
|
345
|
+
TypeScript, JavaScript, Java, and other languages via an LLM-based analysis using
|
|
346
|
+
the same case catalog.
|
|
412
347
|
|
|
413
348
|
---
|
|
414
349
|
|
|
@@ -468,12 +403,9 @@ override.
|
|
|
468
403
|
- **Packaging:** `hatchling` build backend, SPDX license metadata (PEP 639),
|
|
469
404
|
console entry point, distributed on PyPI.
|
|
470
405
|
- **Distribution:** a [pre-commit](https://pre-commit.com) hook
|
|
471
|
-
(`.pre-commit-hooks.yaml`)
|
|
472
|
-
[Agent Skills](https://agentskills.io) open standard (`SKILL.md` plus a
|
|
473
|
-
`.claude-plugin/` marketplace manifest).
|
|
406
|
+
(`.pre-commit-hooks.yaml`), distributed on PyPI.
|
|
474
407
|
- **CI:** GitHub Actions across Python 3.8 / 3.11 / 3.13, running `ruff`,
|
|
475
|
-
`pytest`, a self-scan (the tool must stay clean on its own code)
|
|
476
|
-
drift-check that the bundled scanner copy matches the package byte for byte.
|
|
408
|
+
`pytest`, and a self-scan (the tool must stay clean on its own code).
|
|
477
409
|
|
|
478
410
|
---
|
|
479
411
|
|
|
@@ -481,17 +413,20 @@ override.
|
|
|
481
413
|
|
|
482
414
|
- **ruff / flake8-pytest-style** - mature, fast lint rules. Overlaps on broad
|
|
483
415
|
`raises` (PT011) and assert-in-except (PT017). Run both. falsegreen adds
|
|
484
|
-
uncollected tests, always-true asserts, self-comparison, mock typos
|
|
485
|
-
semantic pass.
|
|
416
|
+
uncollected tests, always-true asserts, self-comparison, and mock typos.
|
|
486
417
|
- **PyNose / pytest-smell / TEMPY** - test-smell catalogs from research. Broader
|
|
487
418
|
taxonomy, but no commit gate and no oracle-correctness check.
|
|
488
419
|
- **mutmut / cosmic-ray** - mutation testing, the most honest measure of whether a
|
|
489
420
|
green suite fails when the code is wrong. Complementary and heavier. falsegreen
|
|
490
421
|
is the cheap pre-filter you run on every commit; mutation testing is the deep
|
|
491
422
|
audit you run on the suites that matter.
|
|
423
|
+
- **[falsegreen-skill](https://github.com/vinicq/falsegreen-skill)** - the LLM
|
|
424
|
+
companion for the semantic pass (cases 10/11/12/15/18) and for TypeScript,
|
|
425
|
+
JavaScript, Java, and other languages.
|
|
492
426
|
|
|
493
|
-
The defensible gap:
|
|
494
|
-
|
|
427
|
+
The defensible gap: a deterministic commit gate that catches the mechanical
|
|
428
|
+
false-positive patterns with zero runtime dependencies, paired with an LLM
|
|
429
|
+
semantic layer that catches the oracle-correctness cases no static tool can see.
|
|
495
430
|
|
|
496
431
|
---
|
|
497
432
|
|
|
@@ -499,20 +434,17 @@ code-as-evidence semantic pass aimed at oracle correctness (cases 12 and 18).
|
|
|
499
434
|
|
|
500
435
|
```
|
|
501
436
|
falsegreen/
|
|
502
|
-
src/falsegreen/scanner.py the deterministic scanner
|
|
437
|
+
src/falsegreen/scanner.py the deterministic scanner
|
|
503
438
|
src/falsegreen/hook_install.py raw git-hook installer
|
|
504
|
-
skills/falsegreen/
|
|
505
|
-
SKILL.md the semantic-pass protocol
|
|
506
|
-
reference.md the 18-case detection rubric
|
|
507
|
-
scripts/scan.py bundled scanner (kept identical to the package)
|
|
508
|
-
examples/bad_tests_sample.py one bad test per case (demo + regression)
|
|
509
439
|
docs/guide.md plain-language guide to every case
|
|
510
440
|
tests/test_scanner.py the scanner's own tests
|
|
511
441
|
.pre-commit-hooks.yaml pre-commit integration
|
|
512
|
-
.claude-plugin/ plugin + marketplace manifests
|
|
513
442
|
pyproject.toml packaging
|
|
514
443
|
```
|
|
515
444
|
|
|
445
|
+
The LLM skill, the semantic-pass protocol, and the multi-language case reference
|
|
446
|
+
live in [falsegreen-skill](https://github.com/vinicq/falsegreen-skill).
|
|
447
|
+
|
|
516
448
|
---
|
|
517
449
|
|
|
518
450
|
## Contributing, security, license
|