bepo 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
bepo-0.2.0/.gitignore ADDED
@@ -0,0 +1,11 @@
1
+ __pycache__/
2
+ *.py[cod]
3
+ *$py.class
4
+ *.egg-info/
5
+ dist/
6
+ build/
7
+ .pytest_cache/
8
+ .eggs/
9
+ *.egg
10
+ .venv/
11
+ venv/
bepo-0.2.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Andrew Park
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
bepo-0.2.0/PKG-INFO ADDED
@@ -0,0 +1,194 @@
1
+ Metadata-Version: 2.4
2
+ Name: bepo
3
+ Version: 0.2.0
4
+ Summary: Detect duplicate PRs in GitHub repos
5
+ Project-URL: Homepage, https://github.com/aardpark/bepo
6
+ Project-URL: Repository, https://github.com/aardpark/bepo
7
+ Author: Andrew Park
8
+ License-Expression: MIT
9
+ License-File: LICENSE
10
+ Keywords: cli,detection,duplicate,github,pull-request
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Environment :: Console
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Topic :: Software Development :: Quality Assurance
20
+ Requires-Python: >=3.10
21
+ Description-Content-Type: text/markdown
22
+
23
+ # bepo
24
+
25
+ Detect duplicate pull requests in GitHub repos.
26
+
27
+ No ML, no embeddings, no API keys. Just static analysis of diffs.
28
+
29
+ A maintainer with 100 open PRs can run `bepo check --repo foo/bar` and in 5 minutes get a ranked list of "you should look at these pairs." That saves hours of manual review.
30
+
31
+ ## The Problem
32
+
33
+ Large repos waste engineering time on duplicate PRs. When multiple contributors fix the same bug independently, only one PR gets merged — the rest is wasted effort.
34
+
35
+ **This actually happens.** We analyzed 100 PRs from [OpenClaw](https://github.com/openclaw/openclaw) and found:
36
+
37
+ | Cluster | PRs | What happened |
38
+ |---------|-----|---------------|
39
+ | Matrix startup bug | **4 PRs** | 4 engineers independently fixed `startupGraceMs = 0` → `5000` |
40
+ | Media token regex | 2 PRs | Identical fix submitted twice |
41
+ | Feishu bitable config | 2 PRs | Same multi-account config fix |
42
+
43
+ **8 duplicate PRs across 3 bug fixes.** That's real engineering time wasted.
44
+
45
+ ## Proof: OpenClaw Analysis
46
+
47
+ We ran bepo on OpenClaw's open PRs. Here's what it found:
48
+
49
+ ```
50
+ $ bepo check --repo openclaw/openclaw --limit 100
51
+
52
+ #20025 <-> #19973
53
+ Similarity: 86%
54
+ Reason: Both fix #19843 ← Same issue!
55
+
56
+ #19868 <-> #19855
57
+ Similarity: 81%
58
+ Reason: Same files: parse.ts, pi-embedded-subscribe.tools.ts
59
+
60
+ #19871 <-> #19853
61
+ Similarity: 100%
62
+ Reason: Same files: bitable.ts, config-schema.ts, tools-config.ts
63
+ ```
64
+
65
+ **Verified manually:**
66
+
67
+ | PR Pair | Similarity | Verdict |
68
+ |---------|------------|---------|
69
+ | #20025 ↔ #19973 (Matrix) | 86% | ✅ TRUE DUPLICATE — both change `startupGraceMs` from 0 to 5000 |
70
+ | #19868 ↔ #19855 (regex) | 81% | ✅ TRUE DUPLICATE — identical PR titles |
71
+ | #19871 ↔ #19853 (Feishu) | 100% | ✅ TRUE DUPLICATE — same files, same fix |
72
+ | #19996 ↔ #19993 (unrelated) | 20% | ✅ Correctly NOT flagged |
73
+
74
+ **Precision: 80%** (4/5 flagged clusters were true duplicates)
75
+
76
+ ## More Examples
77
+
78
+ **VSCode** — Found PRs touching same files for same feature:
79
+ ```
80
+ #295823 <-> #295822
81
+ Similarity: 77%
82
+ Reason: Same files: chatModel.ts, chatForkActions.ts
83
+
84
+ Both: "Use metadata flag for fork detection"
85
+ ```
86
+
87
+ **Next.js** — Found related test updates:
88
+ ```
89
+ #90121 <-> #90120
90
+ Similarity: 86%
91
+ Reason: Same files: test/
92
+ ```
93
+
94
+ ## Install
95
+
96
+ ```bash
97
+ pip install bepo
98
+ ```
99
+
100
+ Requires [GitHub CLI](https://cli.github.com/) (`gh`) to be installed and authenticated.
101
+
102
+ ## Usage
103
+
104
+ ```bash
105
+ # Check a repo for duplicate PRs
106
+ bepo check --repo owner/repo
107
+
108
+ # Adjust sensitivity (default: 0.4, higher = stricter)
109
+ bepo check --repo owner/repo --threshold 0.5
110
+
111
+ # Check more PRs
112
+ bepo check --repo owner/repo --limit 100
113
+
114
+ # JSON output for CI
115
+ bepo check --repo owner/repo --json
116
+ ```
117
+
118
+ ## How It Works
119
+
120
+ bepo fingerprints each PR by extracting:
121
+
122
+ | Signal | Weight | What it catches |
123
+ |--------|--------|-----------------|
124
+ | Same issue ref (#123) | 10.0 | Definite duplicate |
125
+ | Same code changes | 8.0 | Identical lines added/removed |
126
+ | Same files touched | 6.0 | PRs modifying same code |
127
+ | Same feature domain | 3.0 | auth, messaging, database, etc. |
128
+ | Same imports | 1.0 | Similar dependencies |
129
+
130
+ Then computes pairwise Jaccard similarity.
131
+
132
+ **That's it.** No embeddings, no LLM calls. Just:
133
+ - Parse `+++ b/path` from diffs
134
+ - Regex for `#\d+` issue refs
135
+ - Compare actual code changes
136
+ - Set intersection for similarity
137
+
138
+ ~300 lines of Python.
139
+
140
+ ## As a Library
141
+
142
+ ```python
143
+ from bepo import fingerprint_pr, find_duplicates
144
+
145
+ # Fingerprint PRs
146
+ fp1 = fingerprint_pr("#123", diff1, title="Fix auth", body="Fixes #456")
147
+ fp2 = fingerprint_pr("#124", diff2, title="Auth fix", body="Fixes #456")
148
+
149
+ # Find duplicates
150
+ dups = find_duplicates([fp1, fp2], threshold=0.4)
151
+ for d in dups:
152
+ print(f"{d.pr_a} ↔ {d.pr_b}: {d.similarity:.0%}")
153
+ print(f" Shared issues: {d.shared_issues}")
154
+ print(f" Shared files: {d.shared_files}")
155
+ ```
156
+
157
+ ## GitHub Action
158
+
159
+ ```yaml
160
+ name: PR Duplicate Check
161
+ on: [pull_request]
162
+
163
+ jobs:
164
+ check:
165
+ runs-on: ubuntu-latest
166
+ steps:
167
+ - uses: actions/checkout@v4
168
+ - uses: actions/setup-python@v5
169
+ with:
170
+ python-version: '3.11'
171
+ - run: pip install bepo
172
+ - run: bepo check --repo ${{ github.repository }} --json
173
+ env:
174
+ GH_TOKEN: ${{ github.token }}
175
+ ```
176
+
177
+ ## Why This Works
178
+
179
+ Duplicates share obvious signals:
180
+ - **Same code** = Identical changes (639 shared lines caught SoundChain duplicates)
181
+ - **Same issue ref** = Same bug report (#19843 appeared in 4 Matrix PRs)
182
+ - **Same files** = Same bug location (100% overlap for Feishu cluster)
183
+
184
+ Code overlap and issue refs catch most duplicates. Simple works.
185
+
186
+ ## Origin Story
187
+
188
+ This tool was vibe-coded in a single session with Claude.
189
+
190
+ We tried a few approaches and kept finding that simpler signals outperformed fancier ones. File overlap and issue refs catch most duplicates. Sometimes the obvious solution is the right one.
191
+
192
+ ## License
193
+
194
+ MIT
bepo-0.2.0/README.md ADDED
@@ -0,0 +1,172 @@
1
+ # bepo
2
+
3
+ Detect duplicate pull requests in GitHub repos.
4
+
5
+ No ML, no embeddings, no API keys. Just static analysis of diffs.
6
+
7
+ A maintainer with 100 open PRs can run `bepo check --repo foo/bar` and in 5 minutes get a ranked list of "you should look at these pairs." That saves hours of manual review.
8
+
9
+ ## The Problem
10
+
11
+ Large repos waste engineering time on duplicate PRs. When multiple contributors fix the same bug independently, only one PR gets merged — the rest is wasted effort.
12
+
13
+ **This actually happens.** We analyzed 100 PRs from [OpenClaw](https://github.com/openclaw/openclaw) and found:
14
+
15
+ | Cluster | PRs | What happened |
16
+ |---------|-----|---------------|
17
+ | Matrix startup bug | **4 PRs** | 4 engineers independently fixed `startupGraceMs = 0` → `5000` |
18
+ | Media token regex | 2 PRs | Identical fix submitted twice |
19
+ | Feishu bitable config | 2 PRs | Same multi-account config fix |
20
+
21
+ **8 duplicate PRs across 3 bug fixes.** That's real engineering time wasted.
22
+
23
+ ## Proof: OpenClaw Analysis
24
+
25
+ We ran bepo on OpenClaw's open PRs. Here's what it found:
26
+
27
+ ```
28
+ $ bepo check --repo openclaw/openclaw --limit 100
29
+
30
+ #20025 <-> #19973
31
+ Similarity: 86%
32
+ Reason: Both fix #19843 ← Same issue!
33
+
34
+ #19868 <-> #19855
35
+ Similarity: 81%
36
+ Reason: Same files: parse.ts, pi-embedded-subscribe.tools.ts
37
+
38
+ #19871 <-> #19853
39
+ Similarity: 100%
40
+ Reason: Same files: bitable.ts, config-schema.ts, tools-config.ts
41
+ ```
42
+
43
+ **Verified manually:**
44
+
45
+ | PR Pair | Similarity | Verdict |
46
+ |---------|------------|---------|
47
+ | #20025 ↔ #19973 (Matrix) | 86% | ✅ TRUE DUPLICATE — both change `startupGraceMs` from 0 to 5000 |
48
+ | #19868 ↔ #19855 (regex) | 81% | ✅ TRUE DUPLICATE — identical PR titles |
49
+ | #19871 ↔ #19853 (Feishu) | 100% | ✅ TRUE DUPLICATE — same files, same fix |
50
+ | #19996 ↔ #19993 (unrelated) | 20% | ✅ Correctly NOT flagged |
51
+
52
+ **Precision: 80%** (4/5 flagged clusters were true duplicates)
53
+
54
+ ## More Examples
55
+
56
+ **VSCode** — Found PRs touching same files for same feature:
57
+ ```
58
+ #295823 <-> #295822
59
+ Similarity: 77%
60
+ Reason: Same files: chatModel.ts, chatForkActions.ts
61
+
62
+ Both: "Use metadata flag for fork detection"
63
+ ```
64
+
65
+ **Next.js** — Found related test updates:
66
+ ```
67
+ #90121 <-> #90120
68
+ Similarity: 86%
69
+ Reason: Same files: test/
70
+ ```
71
+
72
+ ## Install
73
+
74
+ ```bash
75
+ pip install bepo
76
+ ```
77
+
78
+ Requires [GitHub CLI](https://cli.github.com/) (`gh`) to be installed and authenticated.
79
+
80
+ ## Usage
81
+
82
+ ```bash
83
+ # Check a repo for duplicate PRs
84
+ bepo check --repo owner/repo
85
+
86
+ # Adjust sensitivity (default: 0.4, higher = stricter)
87
+ bepo check --repo owner/repo --threshold 0.5
88
+
89
+ # Check more PRs
90
+ bepo check --repo owner/repo --limit 100
91
+
92
+ # JSON output for CI
93
+ bepo check --repo owner/repo --json
94
+ ```
95
+
96
+ ## How It Works
97
+
98
+ bepo fingerprints each PR by extracting:
99
+
100
+ | Signal | Weight | What it catches |
101
+ |--------|--------|-----------------|
102
+ | Same issue ref (#123) | 10.0 | Definite duplicate |
103
+ | Same code changes | 8.0 | Identical lines added/removed |
104
+ | Same files touched | 6.0 | PRs modifying same code |
105
+ | Same feature domain | 3.0 | auth, messaging, database, etc. |
106
+ | Same imports | 1.0 | Similar dependencies |
107
+
108
+ Then computes pairwise Jaccard similarity.
109
+
110
+ **That's it.** No embeddings, no LLM calls. Just:
111
+ - Parse `+++ b/path` from diffs
112
+ - Regex for `#\d+` issue refs
113
+ - Compare actual code changes
114
+ - Set intersection for similarity
115
+
116
+ ~300 lines of Python.
117
+
118
+ ## As a Library
119
+
120
+ ```python
121
+ from bepo import fingerprint_pr, find_duplicates
122
+
123
+ # Fingerprint PRs
124
+ fp1 = fingerprint_pr("#123", diff1, title="Fix auth", body="Fixes #456")
125
+ fp2 = fingerprint_pr("#124", diff2, title="Auth fix", body="Fixes #456")
126
+
127
+ # Find duplicates
128
+ dups = find_duplicates([fp1, fp2], threshold=0.4)
129
+ for d in dups:
130
+ print(f"{d.pr_a} ↔ {d.pr_b}: {d.similarity:.0%}")
131
+ print(f" Shared issues: {d.shared_issues}")
132
+ print(f" Shared files: {d.shared_files}")
133
+ ```
134
+
135
+ ## GitHub Action
136
+
137
+ ```yaml
138
+ name: PR Duplicate Check
139
+ on: [pull_request]
140
+
141
+ jobs:
142
+ check:
143
+ runs-on: ubuntu-latest
144
+ steps:
145
+ - uses: actions/checkout@v4
146
+ - uses: actions/setup-python@v5
147
+ with:
148
+ python-version: '3.11'
149
+ - run: pip install bepo
150
+ - run: bepo check --repo ${{ github.repository }} --json
151
+ env:
152
+ GH_TOKEN: ${{ github.token }}
153
+ ```
154
+
155
+ ## Why This Works
156
+
157
+ Duplicates share obvious signals:
158
+ - **Same code** = Identical changes (639 shared lines caught SoundChain duplicates)
159
+ - **Same issue ref** = Same bug report (#19843 appeared in 4 Matrix PRs)
160
+ - **Same files** = Same bug location (100% overlap for Feishu cluster)
161
+
162
+ Code overlap and issue refs catch most duplicates. Simple works.
163
+
164
+ ## Origin Story
165
+
166
+ This tool was vibe-coded in a single session with Claude.
167
+
168
+ We tried a few approaches and kept finding that simpler signals outperformed fancier ones. File overlap and issue refs catch most duplicates. Sometimes the obvious solution is the right one.
169
+
170
+ ## License
171
+
172
+ MIT
@@ -0,0 +1,21 @@
1
+ """bepo - PR duplicate detection for GitHub repos.
2
+
3
+ Detects duplicate and related pull requests using static analysis.
4
+ No ML, no embeddings, just smart diff parsing.
5
+ """
6
+
7
+ __version__ = "0.2.0"
8
+
9
+ from .fingerprint import (
10
+ fingerprint_pr,
11
+ find_duplicates,
12
+ Fingerprint,
13
+ Duplicate,
14
+ )
15
+
16
+ __all__ = [
17
+ "fingerprint_pr",
18
+ "find_duplicates",
19
+ "Fingerprint",
20
+ "Duplicate",
21
+ ]
bepo-0.2.0/bepo/cli.py ADDED
@@ -0,0 +1,227 @@
1
+ """CLI for bepo."""
2
+ from __future__ import annotations
3
+
4
+ import argparse
5
+ import hashlib
6
+ import json
7
+ import os
8
+ import subprocess
9
+ import sys
10
+ import time
11
+ from pathlib import Path
12
+ from .fingerprint import fingerprint_pr, find_duplicates
13
+
14
+
15
+ # ANSI color codes
16
+ class Colors:
17
+ BOLD = '\033[1m'
18
+ DIM = '\033[2m'
19
+ GREEN = '\033[92m'
20
+ YELLOW = '\033[93m'
21
+ RED = '\033[91m'
22
+ CYAN = '\033[96m'
23
+ RESET = '\033[0m'
24
+
25
+ @classmethod
26
+ def disable(cls):
27
+ cls.BOLD = cls.DIM = cls.GREEN = cls.YELLOW = ''
28
+ cls.RED = cls.CYAN = cls.RESET = ''
29
+
30
+
31
+ def get_cache_dir() -> Path:
32
+ """Get cache directory, creating if needed."""
33
+ cache_dir = Path.home() / '.cache' / 'bepo'
34
+ cache_dir.mkdir(parents=True, exist_ok=True)
35
+ return cache_dir
36
+
37
+
38
+ def get_cache_key(repo: str, pr_num: int) -> str:
39
+ """Generate cache key for a PR diff."""
40
+ return hashlib.md5(f"{repo}:{pr_num}".encode()).hexdigest()[:12]
41
+
42
+
43
+ def load_cached_diff(repo: str, pr_num: int) -> str | None:
44
+ """Load diff from cache if exists and fresh (< 1 hour)."""
45
+ cache_file = get_cache_dir() / f"{get_cache_key(repo, pr_num)}.diff"
46
+ if cache_file.exists():
47
+ # Check if cache is fresh (< 1 hour)
48
+ age = time.time() - cache_file.stat().st_mtime
49
+ if age < 3600:
50
+ return cache_file.read_text()
51
+ return None
52
+
53
+
54
+ def save_cached_diff(repo: str, pr_num: int, diff: str):
55
+ """Save diff to cache."""
56
+ cache_file = get_cache_dir() / f"{get_cache_key(repo, pr_num)}.diff"
57
+ cache_file.write_text(diff)
58
+
59
+
60
+ def fetch_prs(repo: str, limit: int = 50) -> list[dict]:
61
+ """Fetch open PRs from GitHub."""
62
+ result = subprocess.run(
63
+ ['gh', 'pr', 'list', '--repo', repo, '--state', 'open',
64
+ '--limit', str(limit), '--json', 'number,title,body'],
65
+ capture_output=True, text=True
66
+ )
67
+ if result.returncode != 0:
68
+ print(f"{Colors.RED}Error fetching PRs: {result.stderr}{Colors.RESET}", file=sys.stderr)
69
+ sys.exit(1)
70
+ return json.loads(result.stdout)
71
+
72
+
73
+ def fetch_diff(repo: str, pr_num: int, use_cache: bool = True) -> str:
74
+ """Fetch diff for a PR, using cache if available."""
75
+ if use_cache:
76
+ cached = load_cached_diff(repo, pr_num)
77
+ if cached is not None:
78
+ return cached
79
+
80
+ result = subprocess.run(
81
+ ['gh', 'pr', 'diff', str(pr_num), '--repo', repo],
82
+ capture_output=True, text=True
83
+ )
84
+ diff = result.stdout if result.returncode == 0 else ""
85
+
86
+ if diff and use_cache:
87
+ save_cached_diff(repo, pr_num, diff)
88
+
89
+ return diff
90
+
91
+
92
+ def print_progress(current: int, total: int, pr_num: int, cached: bool = False):
93
+ """Print progress indicator."""
94
+ bar_width = 20
95
+ filled = int(bar_width * current / total)
96
+ bar = '█' * filled + '░' * (bar_width - filled)
97
+ cache_indicator = f" {Colors.DIM}(cached){Colors.RESET}" if cached else ""
98
+ print(f"\r {Colors.DIM}[{bar}]{Colors.RESET} {current}/{total} PR#{pr_num}{cache_indicator} ",
99
+ end='', file=sys.stderr, flush=True)
100
+
101
+
102
+ def main():
103
+ parser = argparse.ArgumentParser(
104
+ description='bepo - detect duplicate PRs',
105
+ formatter_class=argparse.RawDescriptionHelpFormatter,
106
+ epilog='''
107
+ Examples:
108
+ bepo check --repo owner/repo
109
+ bepo check --repo owner/repo --threshold 0.5
110
+ bepo check --repo owner/repo --json
111
+ bepo check --repo owner/repo --no-cache
112
+ '''
113
+ )
114
+ subparsers = parser.add_subparsers(dest='command', required=True)
115
+
116
+ # check command
117
+ check = subparsers.add_parser('check', help='Check repo for duplicate PRs')
118
+ check.add_argument('--repo', '-r', required=True, help='GitHub repo (owner/repo)')
119
+ check.add_argument('--threshold', '-t', type=float, default=0.4,
120
+ help='Similarity threshold (default: 0.4)')
121
+ check.add_argument('--limit', '-l', type=int, default=50,
122
+ help='Max PRs to check (default: 50)')
123
+ check.add_argument('--json', action='store_true', help='Output JSON')
124
+ check.add_argument('--no-cache', action='store_true', help='Disable diff caching')
125
+ check.add_argument('--no-color', action='store_true', help='Disable colored output')
126
+
127
+ # clear-cache command
128
+ clear = subparsers.add_parser('clear-cache', help='Clear cached diffs')
129
+
130
+ args = parser.parse_args()
131
+
132
+ if args.command == 'clear-cache':
133
+ run_clear_cache()
134
+ elif args.command == 'check':
135
+ if args.no_color or not sys.stderr.isatty():
136
+ Colors.disable()
137
+ run_check(args)
138
+
139
+
140
+ def run_clear_cache():
141
+ """Clear the diff cache."""
142
+ cache_dir = get_cache_dir()
143
+ count = 0
144
+ for f in cache_dir.glob('*.diff'):
145
+ f.unlink()
146
+ count += 1
147
+ print(f"Cleared {count} cached diffs.")
148
+
149
+
150
+ def run_check(args):
151
+ """Run duplicate check on a repo."""
152
+ use_cache = not args.no_cache
153
+
154
+ print(f"{Colors.BOLD}Fetching PRs from {args.repo}...{Colors.RESET}", file=sys.stderr)
155
+ prs = fetch_prs(args.repo, args.limit)
156
+ print(f"Found {Colors.CYAN}{len(prs)}{Colors.RESET} open PRs\n", file=sys.stderr)
157
+
158
+ # Fingerprint each PR
159
+ fingerprints = []
160
+ cache_hits = 0
161
+ start_time = time.time()
162
+
163
+ for i, pr in enumerate(prs, 1):
164
+ pr_num = pr['number']
165
+ cached = load_cached_diff(args.repo, pr_num) is not None if use_cache else False
166
+ if cached:
167
+ cache_hits += 1
168
+ print_progress(i, len(prs), pr_num, cached)
169
+
170
+ diff = fetch_diff(args.repo, pr_num, use_cache=use_cache)
171
+ if diff:
172
+ fp = fingerprint_pr(
173
+ f"#{pr_num}",
174
+ diff,
175
+ title=pr.get('title', ''),
176
+ body=pr.get('body', '') or '',
177
+ )
178
+ fingerprints.append(fp)
179
+
180
+ elapsed = time.time() - start_time
181
+ print(f"\r{' ' * 60}\r", end='', file=sys.stderr) # Clear progress line
182
+
183
+ if use_cache and cache_hits > 0:
184
+ print(f"{Colors.DIM}Analyzed {len(prs)} PRs in {elapsed:.1f}s ({cache_hits} cached){Colors.RESET}\n",
185
+ file=sys.stderr)
186
+ else:
187
+ print(f"{Colors.DIM}Analyzed {len(prs)} PRs in {elapsed:.1f}s{Colors.RESET}\n", file=sys.stderr)
188
+
189
+ # Find duplicates
190
+ duplicates = find_duplicates(fingerprints, threshold=args.threshold)
191
+
192
+ if args.json:
193
+ output = [
194
+ {
195
+ 'pr_a': d.pr_a,
196
+ 'pr_b': d.pr_b,
197
+ 'similarity': round(d.similarity, 3),
198
+ 'shared_issues': d.shared_issues,
199
+ 'shared_files': d.shared_files,
200
+ 'shared_code_lines': d.shared_code_lines,
201
+ 'reason': d.reason,
202
+ }
203
+ for d in duplicates
204
+ ]
205
+ print(json.dumps(output, indent=2))
206
+ else:
207
+ if not duplicates:
208
+ print(f"{Colors.GREEN}No duplicates found.{Colors.RESET}")
209
+ else:
210
+ print(f"{Colors.BOLD}Found {len(duplicates)} potential duplicates:{Colors.RESET}\n")
211
+ for d in duplicates:
212
+ # Color code by similarity
213
+ if d.similarity >= 0.8:
214
+ sim_color = Colors.RED
215
+ elif d.similarity >= 0.6:
216
+ sim_color = Colors.YELLOW
217
+ else:
218
+ sim_color = Colors.RESET
219
+
220
+ print(f"{Colors.BOLD}{d.pr_a} <-> {d.pr_b}{Colors.RESET}")
221
+ print(f" {sim_color}Similarity: {d.similarity:.0%}{Colors.RESET}")
222
+ print(f" {Colors.DIM}Reason: {d.reason}{Colors.RESET}")
223
+ print()
224
+
225
+
226
+ if __name__ == '__main__':
227
+ main()
@@ -0,0 +1,289 @@
1
+ """PR fingerprinting and duplicate detection."""
2
+ from __future__ import annotations
3
+
4
+ import re
5
+ from dataclasses import dataclass, field
6
+
7
+
8
+ @dataclass
9
+ class Fingerprint:
10
+ """Fingerprint of a PR's changes."""
11
+ pr_id: str
12
+ files_touched: list[str] = field(default_factory=list)
13
+ domains: list[str] = field(default_factory=list)
14
+ issue_refs: list[str] = field(default_factory=list)
15
+ imports: list[str] = field(default_factory=list)
16
+ code_lines: list[str] = field(default_factory=list) # normalized changed lines
17
+
18
+ def similarity(self, other: "Fingerprint") -> float:
19
+ """Compute similarity score between two fingerprints.
20
+
21
+ Weights:
22
+ - Issue ref overlap (10.0): Same issue = definite duplicate
23
+ - Code content overlap (8.0): Same changes = definite duplicate
24
+ - File path overlap (5.0): Same files = likely related
25
+ - Domain overlap (3.0): Same feature area
26
+ - Imports (1.0): Similar dependencies
27
+ """
28
+ scores = []
29
+ weights = []
30
+
31
+ # Issue reference overlap - strongest signal
32
+ if self.issue_refs and other.issue_refs:
33
+ r1, r2 = set(self.issue_refs), set(other.issue_refs)
34
+ if r1 & r2:
35
+ scores.append(1.0)
36
+ weights.append(10.0)
37
+
38
+ # Code content similarity - what actually changed
39
+ code_sim = 0.0
40
+ if self.code_lines and other.code_lines:
41
+ c1, c2 = set(self.code_lines), set(other.code_lines)
42
+ if c1 | c2:
43
+ code_sim = len(c1 & c2) / len(c1 | c2)
44
+ if code_sim > 0.1: # only count if meaningful overlap
45
+ scores.append(code_sim)
46
+ weights.append(8.0)
47
+
48
+ # Exact file overlap (strongest file signal)
49
+ exact_files_a = set(self.files_touched)
50
+ exact_files_b = set(other.files_touched)
51
+ exact_overlap = exact_files_a & exact_files_b
52
+ if exact_overlap:
53
+ # Same exact files + code overlap = very likely duplicate
54
+ file_sim = len(exact_overlap) / len(exact_files_a | exact_files_b)
55
+ if code_sim > 0.2:
56
+ # Files AND code match - boost significantly
57
+ scores.append(file_sim)
58
+ weights.append(6.0)
59
+ else:
60
+ # Same files but different code - lower weight
61
+ scores.append(file_sim)
62
+ weights.append(3.0)
63
+ else:
64
+ # Directory level similarity (weaker signal)
65
+ def get_dirs(files):
66
+ dirs = set()
67
+ for f in files:
68
+ parts = f.split('/')
69
+ if len(parts) >= 2:
70
+ dirs.add('/'.join(parts[:-1]))
71
+ return dirs
72
+ f1, f2 = get_dirs(self.files_touched), get_dirs(other.files_touched)
73
+ if f1 | f2:
74
+ scores.append(len(f1 & f2) / len(f1 | f2))
75
+ weights.append(2.0) # lower weight for dir-only match
76
+
77
+ # Domain similarity
78
+ if self.domains or other.domains:
79
+ d1, d2 = set(self.domains), set(other.domains)
80
+ if d1 | d2:
81
+ scores.append(len(d1 & d2) / len(d1 | d2))
82
+ weights.append(3.0)
83
+
84
+ # Import similarity
85
+ if self.imports or other.imports:
86
+ i1, i2 = set(self.imports), set(other.imports)
87
+ if i1 | i2:
88
+ scores.append(len(i1 & i2) / len(i1 | i2))
89
+ weights.append(1.0)
90
+
91
+ if not scores:
92
+ return 0.0
93
+ return sum(s * w for s, w in zip(scores, weights)) / sum(weights)
94
+
95
+
96
+ @dataclass
97
+ class Duplicate:
98
+ """A pair of similar PRs."""
99
+ pr_a: str
100
+ pr_b: str
101
+ similarity: float
102
+ shared_issues: list[str]
103
+ shared_files: list[str]
104
+ shared_domains: list[str]
105
+ shared_code_lines: int # count of overlapping code lines
106
+ reason: str
107
+
108
+
109
+ # Domain keywords for categorizing PRs
110
+ DOMAINS = {
111
+ # Messaging
112
+ "telegram": "messaging", "whatsapp": "messaging", "discord": "messaging",
113
+ "slack": "messaging", "matrix": "messaging", "twilio": "messaging",
114
+ # Auth
115
+ "auth": "auth", "login": "auth", "oauth": "auth", "jwt": "auth",
116
+ "session": "auth", "token": "auth",
117
+ # Data
118
+ "cache": "cache", "redis": "cache", "database": "database", "db": "database",
119
+ "postgres": "database", "mysql": "database", "mongo": "database",
120
+ # API
121
+ "api": "api", "endpoint": "api", "route": "api", "graphql": "api", "webhook": "api",
122
+ # Scheduling
123
+ "cron": "scheduling", "schedule": "scheduling", "job": "scheduling", "queue": "scheduling",
124
+ # AI
125
+ "llm": "ai", "model": "ai", "embedding": "ai", "openai": "ai", "anthropic": "ai",
126
+ # Media
127
+ "media": "media", "image": "media", "video": "media", "upload": "media",
128
+ # Config
129
+ "config": "config", "setting": "config", "env": "config",
130
+ # Observability
131
+ "log": "observability", "trace": "observability", "metric": "observability",
132
+ # Plugins
133
+ "plugin": "plugin", "extension": "plugin",
134
+ }
135
+
136
+
137
+ def _extract_files(diff: str) -> list[str]:
138
+ """Extract file paths from diff."""
139
+ files = []
140
+ for line in diff.split('\n'):
141
+ if line.startswith('+++ b/'):
142
+ files.append(line[6:])
143
+ return list(set(files))
144
+
145
+
146
+ def _extract_issues(text: str) -> list[str]:
147
+ """Extract issue references (#123)."""
148
+ return list(set(m.group(1) for m in re.finditer(r'#(\d+)', text)))
149
+
150
+
151
+ def _extract_domains(files: list[str], code: str) -> list[str]:
152
+ """Extract domains from file paths and code."""
153
+ domains = set()
154
+ text = ' '.join(files).lower() + ' ' + code.lower()
155
+ for keyword, domain in DOMAINS.items():
156
+ if keyword in text:
157
+ domains.add(domain)
158
+ return list(domains)
159
+
160
+
161
+ def _extract_imports(diff: str) -> list[str]:
162
+ """Extract imports from diff."""
163
+ imports = []
164
+ for line in diff.split('\n'):
165
+ if line.startswith('+') and not line.startswith('+++'):
166
+ line = line[1:]
167
+ m = re.search(r'import\s+.*\s+from\s+["\']([^"\']+)["\']', line)
168
+ if m:
169
+ imports.append(m.group(1))
170
+ m = re.search(r'require\s*\(\s*["\']([^"\']+)["\']', line)
171
+ if m:
172
+ imports.append(m.group(1))
173
+ return list(set(imports))
174
+
175
+
176
+ def _extract_new_code(diff: str) -> str:
177
+ """Extract added lines from diff."""
178
+ lines = []
179
+ for line in diff.split('\n'):
180
+ if line.startswith('+') and not line.startswith('+++'):
181
+ lines.append(line[1:])
182
+ return '\n'.join(lines)
183
+
184
+
185
+ def _extract_code_lines(diff: str) -> list[str]:
186
+ """Extract normalized code lines from diff for content comparison.
187
+
188
+ Normalizes lines by stripping whitespace and filtering noise.
189
+ Returns unique meaningful lines that represent actual code changes.
190
+ """
191
+ lines = set()
192
+ for line in diff.split('\n'):
193
+ # Get added and removed lines (both matter for comparison)
194
+ if line.startswith('+') and not line.startswith('+++'):
195
+ code = line[1:].strip()
196
+ elif line.startswith('-') and not line.startswith('---'):
197
+ code = line[1:].strip()
198
+ else:
199
+ continue
200
+
201
+ # Skip empty lines and trivial changes
202
+ if not code:
203
+ continue
204
+ if len(code) < 4: # skip tiny fragments
205
+ continue
206
+ if code.startswith('//') or code.startswith('#') or code.startswith('*'):
207
+ continue # skip comments
208
+ if code in ('{', '}', '(', ')', '[', ']', 'else', 'return', 'break', 'continue'):
209
+ continue # skip trivial syntax
210
+
211
+ lines.add(code)
212
+
213
+ return list(lines)
214
+
215
+
216
+ def fingerprint_pr(
217
+ pr_id: str,
218
+ diff: str,
219
+ title: str = "",
220
+ body: str = "",
221
+ ) -> Fingerprint:
222
+ """Create a fingerprint from a PR diff."""
223
+ new_code = _extract_new_code(diff)
224
+ files = _extract_files(diff)
225
+
226
+ return Fingerprint(
227
+ pr_id=pr_id,
228
+ files_touched=files,
229
+ domains=_extract_domains(files, new_code),
230
+ issue_refs=_extract_issues(f"{title} {body} {diff}"),
231
+ imports=_extract_imports(diff),
232
+ code_lines=_extract_code_lines(diff),
233
+ )
234
+
235
+
236
+ def find_duplicates(
237
+ fingerprints: list[Fingerprint],
238
+ threshold: float = 0.4,
239
+ ) -> list[Duplicate]:
240
+ """Find duplicate PRs from a list of fingerprints."""
241
+ duplicates = []
242
+
243
+ for i, fp_a in enumerate(fingerprints):
244
+ for j, fp_b in enumerate(fingerprints):
245
+ if i >= j:
246
+ continue
247
+
248
+ sim = fp_a.similarity(fp_b)
249
+ if sim < threshold:
250
+ continue
251
+
252
+ shared_issues = list(set(fp_a.issue_refs) & set(fp_b.issue_refs))
253
+ shared_domains = list(set(fp_a.domains) & set(fp_b.domains))
254
+ shared_code = set(fp_a.code_lines) & set(fp_b.code_lines)
255
+ shared_code_count = len(shared_code)
256
+
257
+ def get_dirs(files):
258
+ return set('/'.join(f.split('/')[:-1]) for f in files if '/' in f)
259
+ shared_files = list(get_dirs(fp_a.files_touched) & get_dirs(fp_b.files_touched))
260
+
261
+ # Determine reason - prioritize strongest signals
262
+ if shared_issues:
263
+ reason = f"Both fix #{', #'.join(shared_issues[:2])}"
264
+ elif shared_code_count >= 3:
265
+ # Show a sample of shared code
266
+ sample = list(shared_code)[:1][0]
267
+ if len(sample) > 40:
268
+ sample = sample[:40] + "..."
269
+ reason = f"Same code: {shared_code_count} lines overlap"
270
+ elif shared_files:
271
+ reason = f"Same files: {', '.join(f.split('/')[-1] for f in shared_files[:2])}"
272
+ elif shared_domains:
273
+ reason = f"Same domain: {', '.join(shared_domains[:2])}"
274
+ else:
275
+ reason = f"Similar ({sim:.0%})"
276
+
277
+ duplicates.append(Duplicate(
278
+ pr_a=fp_a.pr_id,
279
+ pr_b=fp_b.pr_id,
280
+ similarity=sim,
281
+ shared_issues=shared_issues,
282
+ shared_files=shared_files,
283
+ shared_domains=shared_domains,
284
+ shared_code_lines=shared_code_count,
285
+ reason=reason,
286
+ ))
287
+
288
+ duplicates.sort(key=lambda d: d.similarity, reverse=True)
289
+ return duplicates
@@ -0,0 +1,31 @@
1
+ [build-system]
2
+ requires = ["hatchling"]
3
+ build-backend = "hatchling.build"
4
+
5
+ [project]
6
+ name = "bepo"
7
+ version = "0.2.0"
8
+ description = "Detect duplicate PRs in GitHub repos"
9
+ readme = "README.md"
10
+ license = "MIT"
11
+ requires-python = ">=3.10"
12
+ authors = [{ name = "Andrew Park" }]
13
+ keywords = ["github", "pull-request", "duplicate", "detection", "cli"]
14
+ classifiers = [
15
+ "Development Status :: 4 - Beta",
16
+ "Environment :: Console",
17
+ "Intended Audience :: Developers",
18
+ "License :: OSI Approved :: MIT License",
19
+ "Programming Language :: Python :: 3",
20
+ "Programming Language :: Python :: 3.10",
21
+ "Programming Language :: Python :: 3.11",
22
+ "Programming Language :: Python :: 3.12",
23
+ "Topic :: Software Development :: Quality Assurance",
24
+ ]
25
+
26
+ [project.scripts]
27
+ bepo = "bepo.cli:main"
28
+
29
+ [project.urls]
30
+ Homepage = "https://github.com/aardpark/bepo"
31
+ Repository = "https://github.com/aardpark/bepo"
@@ -0,0 +1,71 @@
1
+ """Tests for bepo fingerprinting."""
2
+ from bepo import fingerprint_pr, find_duplicates
3
+
4
+
5
+ def test_fingerprint_extracts_files():
6
+ diff = """+++ b/src/auth/login.ts
7
+ - old code
8
+ + new code
9
+ +++ b/src/auth/logout.ts
10
+ + more code
11
+ """
12
+ fp = fingerprint_pr("#1", diff)
13
+ assert "src/auth/login.ts" in fp.files_touched
14
+ assert "src/auth/logout.ts" in fp.files_touched
15
+
16
+
17
+ def test_fingerprint_extracts_issues():
18
+ fp = fingerprint_pr("#1", "", title="Fix bug", body="Fixes #123 and #456")
19
+ assert "123" in fp.issue_refs
20
+ assert "456" in fp.issue_refs
21
+
22
+
23
+ def test_fingerprint_extracts_domains():
24
+ diff = "+++ b/src/telegram/handler.ts\n+ code"
25
+ fp = fingerprint_pr("#1", diff)
26
+ assert "messaging" in fp.domains
27
+
28
+
29
+ def test_similar_prs_detected():
30
+ diff1 = "+++ b/src/auth/login.ts\n+ code"
31
+ diff2 = "+++ b/src/auth/login.ts\n+ other code"
32
+
33
+ fp1 = fingerprint_pr("#1", diff1, body="Fixes #100")
34
+ fp2 = fingerprint_pr("#2", diff2, body="Fixes #100")
35
+
36
+ dups = find_duplicates([fp1, fp2], threshold=0.3)
37
+ assert len(dups) == 1
38
+ assert dups[0].similarity > 0.5
39
+ assert "100" in dups[0].shared_issues
40
+
41
+
42
+ def test_different_prs_not_flagged():
43
+ diff1 = "+++ b/src/auth/login.ts\n+ const user = await authenticate(token)"
44
+ diff2 = "+++ b/src/payments/stripe.ts\n+ const charge = await stripe.charges.create(amount)"
45
+
46
+ fp1 = fingerprint_pr("#1", diff1)
47
+ fp2 = fingerprint_pr("#2", diff2)
48
+
49
+ dups = find_duplicates([fp1, fp2], threshold=0.5)
50
+ assert len(dups) == 0
51
+
52
+
53
+ def test_same_code_detected():
54
+ """PRs with identical code changes should be flagged."""
55
+ diff1 = """+++ b/src/config.ts
56
+ + startupGraceMs = 5000
57
+ + retryCount = 3
58
+ + timeout = 30000
59
+ """
60
+ diff2 = """+++ b/src/config.ts
61
+ + startupGraceMs = 5000
62
+ + retryCount = 3
63
+ + timeout = 30000
64
+ """
65
+ fp1 = fingerprint_pr("#1", diff1)
66
+ fp2 = fingerprint_pr("#2", diff2)
67
+
68
+ dups = find_duplicates([fp1, fp2], threshold=0.3)
69
+ assert len(dups) == 1
70
+ assert dups[0].shared_code_lines >= 3
71
+ assert "code" in dups[0].reason.lower() or dups[0].similarity > 0.8