gitlab-harvester 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- gitlab_harvester-0.1.0/LICENSE +21 -0
- gitlab_harvester-0.1.0/PKG-INFO +329 -0
- gitlab_harvester-0.1.0/README.md +301 -0
- gitlab_harvester-0.1.0/gitlab_harvester.egg-info/PKG-INFO +329 -0
- gitlab_harvester-0.1.0/gitlab_harvester.egg-info/SOURCES.txt +15 -0
- gitlab_harvester-0.1.0/gitlab_harvester.egg-info/dependency_links.txt +1 -0
- gitlab_harvester-0.1.0/gitlab_harvester.egg-info/entry_points.txt +2 -0
- gitlab_harvester-0.1.0/gitlab_harvester.egg-info/requires.txt +3 -0
- gitlab_harvester-0.1.0/gitlab_harvester.egg-info/top_level.txt +2 -0
- gitlab_harvester-0.1.0/gitlab_harvester.py +229 -0
- gitlab_harvester-0.1.0/glh/__init__.py +57 -0
- gitlab_harvester-0.1.0/glh/cli.py +292 -0
- gitlab_harvester-0.1.0/glh/harvester.py +1120 -0
- gitlab_harvester-0.1.0/glh/planner.py +265 -0
- gitlab_harvester-0.1.0/glh/session.py +35 -0
- gitlab_harvester-0.1.0/pyproject.toml +64 -0
- gitlab_harvester-0.1.0/setup.cfg +4 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Cur1iosity
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,329 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: gitlab-harvester
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Build GitLab instance project index (JSONL) and search repositories for sensitive keywords.
|
|
5
|
+
Author-email: Cur1 <cur1iosity@protonmail.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/Cur1iosity/GitlabHarvester
|
|
8
|
+
Project-URL: Repository, https://github.com/Cur1iosity/GitlabHarvester
|
|
9
|
+
Project-URL: Issues, https://github.com/Cur1iosity/GitlabHarvester/issues
|
|
10
|
+
Keywords: gitlab,security,osint,redteam,index,scraping,searching,ndjson,jsonl,secret-detection
|
|
11
|
+
Classifier: Programming Language :: Python :: 3
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
14
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
15
|
+
Classifier: Environment :: Console
|
|
16
|
+
Classifier: Intended Audience :: Developers
|
|
17
|
+
Classifier: Intended Audience :: Information Technology
|
|
18
|
+
Classifier: Topic :: Security
|
|
19
|
+
Classifier: Topic :: Software Development :: Libraries
|
|
20
|
+
Classifier: Topic :: Utilities
|
|
21
|
+
Requires-Python: >=3.11
|
|
22
|
+
Description-Content-Type: text/markdown
|
|
23
|
+
License-File: LICENSE
|
|
24
|
+
Requires-Dist: python-gitlab>=8.0.0
|
|
25
|
+
Requires-Dist: tqdm>=4.66.0
|
|
26
|
+
Requires-Dist: requests>=2.31.0
|
|
27
|
+
Dynamic: license-file
|
|
28
|
+
|
|
29
|
+
# GitlabHarvester
|
|
30
|
+
|
|
31
|
+
**Global term search across an entire GitLab instance — especially useful for GitLab CE.**
|
|
32
|
+
|
|
33
|
+
GitLab Community Edition does not provide instance‑wide code search the way GitLab EE can.
|
|
34
|
+
**GitlabHarvester** fills this gap: it builds a lightweight **Instance Project Index (JSONL/NDJSON)** and performs term search across repositories **without cloning** them.
|
|
35
|
+
|
|
36
|
+
The tool is conceptually similar to utilities like *gitlab-finder* (Node.js), but implemented in modern Python with streaming output, branch planning and resumable sessions.
|
|
37
|
+
|
|
38
|
+
---
|
|
39
|
+
|
|
40
|
+
## Why this tool matters
|
|
41
|
+
|
|
42
|
+
- GitLab CE → no global code search
|
|
43
|
+
- Web UI search → limited and unreliable
|
|
44
|
+
- Cloning thousands of repos → slow & disk heavy
|
|
45
|
+
|
|
46
|
+
**GitlabHarvester** lets you search the whole instance using only the API.
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## Features
|
|
51
|
+
|
|
52
|
+
- ✅ **Instance‑wide keyword search** for GitLab CE
|
|
53
|
+
- ✅ **No cloning required** — API based
|
|
54
|
+
- ✅ **Project Index (JSONL/NDJSON)** for repeatable runs
|
|
55
|
+
- ✅ Branch strategies:
|
|
56
|
+
- `default` — scan only default branch (fast)
|
|
57
|
+
- `all` — scan all indexed branches
|
|
58
|
+
- `N` — scan up to N branches
|
|
59
|
+
- ✅ Fork strategies (explained below)
|
|
60
|
+
- ✅ **Session output + resume**
|
|
61
|
+
- ✅ Low memory footprint
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Requirements
|
|
66
|
+
|
|
67
|
+
- Python **3.11+**
|
|
68
|
+
- GitLab token with **read_api** permissions
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## Installation
|
|
73
|
+
|
|
74
|
+
### Using pipx (recommended)
|
|
75
|
+
```bash
|
|
76
|
+
git clone https://github.com/Cur1iosity/GitlabHarvester.git
|
|
77
|
+
cd GitlabHarvester
|
|
78
|
+
pipx install .
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
or
|
|
82
|
+
|
|
83
|
+
```bash
|
|
84
|
+
pipx install git+https://github.com/Cur1iosity/GitlabHarvester.git
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
After that you can run the tool directly:
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
gitlab-harvester --help
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### Classic pip install
|
|
94
|
+
```bash
|
|
95
|
+
git clone https://github.com/Cur1iosity/GitlabHarvester.git
|
|
96
|
+
cd GitlabHarvester
|
|
97
|
+
pip install .
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
or
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
pip install git+https://github.com/Cur1iosity/GitlabHarvester.git
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
---
|
|
107
|
+
|
|
108
|
+
## Quick Start (the index builds automatically)
|
|
109
|
+
|
|
110
|
+
You **do not need to build the project index manually**.
|
|
111
|
+
When you run a search, the index is created on the fly if it does not exist.
|
|
112
|
+
|
|
113
|
+
### Search a single keyword
|
|
114
|
+
|
|
115
|
+
```bash
|
|
116
|
+
gitlab-harvester -H https://gitlab.example.com -t $TOKEN --search "password"
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
### Search using a file with keywords
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
gitlab-harvester -H https://gitlab.example.com -t $TOKEN --terms-file keywords.txt
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Build only the index (optional)
|
|
126
|
+
|
|
127
|
+
This step is useful only if you want to prepare the index in advance:
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
gitlab-harvester -H https://gitlab.example.com -t $TOKEN --dump-only
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## Branch control
|
|
136
|
+
|
|
137
|
+
There are two independent controls:
|
|
138
|
+
|
|
139
|
+
- `--index-branches` — what branches are stored in the index
|
|
140
|
+
- `--scan-branches` — what branches are actually scanned
|
|
141
|
+
|
|
142
|
+
### Examples
|
|
143
|
+
|
|
144
|
+
```bash
|
|
145
|
+
# Index only default branches, but scan up to 10
|
|
146
|
+
gitlab-harvester -H ... -t ... --scan-branches 10
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
```bash
|
|
150
|
+
# Store all branches and scan all
|
|
151
|
+
gitlab-harvester -H ... -t ... --index-branches all --scan-branches all
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
Shorthand:
|
|
155
|
+
|
|
156
|
+
```bash
|
|
157
|
+
gitlab-harvester -H ... -t ... --branches 10
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
---
|
|
161
|
+
|
|
162
|
+
## Fork strategies (important)
|
|
163
|
+
|
|
164
|
+
```bash
|
|
165
|
+
--forks skip|include|branch-diff|all-branches
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
### What they mean
|
|
169
|
+
|
|
170
|
+
- **skip**
|
|
171
|
+
Forked projects are completely ignored.
|
|
172
|
+
Good when forks are mostly duplicates and noise.
|
|
173
|
+
|
|
174
|
+
- **include**
|
|
175
|
+
Forks are treated like normal projects.
|
|
176
|
+
Simple and predictable but may rescan identical branches.
|
|
177
|
+
|
|
178
|
+
- **branch-diff** (recommended)
|
|
179
|
+
Smart mode:
|
|
180
|
+
- always scans fork default branch
|
|
181
|
+
- scans base branches (`main, master, develop, dev`)
|
|
182
|
+
- scans only **branches unique to the fork** compared to upstream
|
|
183
|
+
→ best signal/noise ratio.
|
|
184
|
+
|
|
185
|
+
- **all-branches**
|
|
186
|
+
Scan every branch of every fork — most exhaustive and slowest.
|
|
187
|
+
|
|
188
|
+
### Example
|
|
189
|
+
|
|
190
|
+
```bash
|
|
191
|
+
gitlab-harvester -H ... -t ... --terms-file keywords.txt --forks branch-diff --fork-diff-bases main,master,develop,dev
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
---
|
|
195
|
+
|
|
196
|
+
## Session & resume
|
|
197
|
+
|
|
198
|
+
Results are written to JSONL session files.
|
|
199
|
+
|
|
200
|
+
```bash
|
|
201
|
+
gitlab-harvester -H ... -t ... --terms-file keywords.txt --session audit_run
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
Resume:
|
|
205
|
+
|
|
206
|
+
```bash
|
|
207
|
+
gitlab-harvester -H ... -t ... --terms-file keywords.txt --session-file audit_run.jsonl --resume
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
## Output
|
|
213
|
+
|
|
214
|
+
- **Project Index (JSONL)** — metadata + project entries
|
|
215
|
+
- **Session file (JSONL)** — hits + checkpoints
|
|
216
|
+
|
|
217
|
+
---
|
|
218
|
+
|
|
219
|
+
## Usage
|
|
220
|
+
```bash
|
|
221
|
+
gitlab-harvester --help
|
|
222
|
+
|
|
223
|
+
usage: gitlab-harvester [-h] -H HOST -t TOKEN [-bs BATCH_SIZE] [--index-file INDEX_FILE] [--dump-projects] [--dump-only] [-b BRANCHES] [--index-branches INDEX_BRANCHES] [--scan-branches SCAN_BRANCHES]
|
|
224
|
+
[--branches-per-page BRANCHES_PER_PAGE] [--forks {skip,include,branch-diff,all-branches}] [--fork-diff-bases FORK_DIFF_BASES] [-s SEARCH | -f TERMS_FILE] [--session SESSION |
|
|
225
|
+
--session-file SESSION_FILE] [-o OUTPUT] [--resume]
|
|
226
|
+
|
|
227
|
+
Collect and use an Instance Project Index from a GitLab instance.
|
|
228
|
+
|
|
229
|
+
options:
|
|
230
|
+
-h, --help show this help message and exit
|
|
231
|
+
-H, --host HOST GitLab host (e.g., gitlab.example.com).
|
|
232
|
+
-t, --token TOKEN GitLab token with read_api permissions.
|
|
233
|
+
-bs, --batch-size BATCH_SIZE
|
|
234
|
+
Projects per page for GitLab API requests (default: 100).
|
|
235
|
+
--index-file INDEX_FILE
|
|
236
|
+
Path to Instance Project Index file (JSONL/NDJSON). Defaults to instance-specific name.
|
|
237
|
+
--dump-projects Rebuild the Instance Project Index even if it already exists.
|
|
238
|
+
--dump-only Only build the Instance Project Index and exit.
|
|
239
|
+
-b, --branches BRANCHES
|
|
240
|
+
Shorthand for setting both --index-branches and --scan-branches.
|
|
241
|
+
--index-branches INDEX_BRANCHES
|
|
242
|
+
Branch depth for building the Project Index: 'default' (store only default branch), 'all' (store all), or N limit.
|
|
243
|
+
--scan-branches SCAN_BRANCHES
|
|
244
|
+
Branch scope for scanning: omit -> scan default only; 'all' -> scan all branches from index; N -> scan up to N branches (default + N-1).
|
|
245
|
+
--branches-per-page BRANCHES_PER_PAGE
|
|
246
|
+
Branches per page for GitLab API requests (default: 100).
|
|
247
|
+
--forks {skip,include,branch-diff,all-branches}
|
|
248
|
+
How to handle forked projects during search: skip (ignore forks), include (treat as regular projects), branch-diff (scan only base + unique branches vs upstream), all-branches (scan
|
|
249
|
+
every branch of forks).
|
|
250
|
+
--fork-diff-bases FORK_DIFF_BASES
|
|
251
|
+
Comma-separated list of branch names always scanned in forks when --forks=branch-diff (default: main,master,develop,dev).
|
|
252
|
+
-s, --search SEARCH Single search term.
|
|
253
|
+
-f, --terms-file TERMS_FILE
|
|
254
|
+
File with search terms (one per line).
|
|
255
|
+
--session SESSION Session name for results output (writes <name>.jsonl).
|
|
256
|
+
--session-file SESSION_FILE
|
|
257
|
+
Explicit path for session results file (JSONL).
|
|
258
|
+
-o, --output OUTPUT Output file for results (optional).
|
|
259
|
+
--resume Resume search using an existing session file (if supported).
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
## Useful notes
|
|
263
|
+
|
|
264
|
+
### Deduplicate results (context unique)
|
|
265
|
+
|
|
266
|
+
Search across forks and mirrors often produces context duplicates — identical file fragments that appear in multiple repositories or branches.
|
|
267
|
+
Removing them is useful when:
|
|
268
|
+
|
|
269
|
+
you only need to confirm the fact of presence of a secret/keyword,
|
|
270
|
+
|
|
271
|
+
the same leaked token appears in dozens of forks,
|
|
272
|
+
|
|
273
|
+
you want to reduce a 1–5 GB session file to a human-reviewable size.
|
|
274
|
+
|
|
275
|
+
The dedup script keeps only one record per unique content, while preserving the original JSONL structure.
|
|
276
|
+
|
|
277
|
+
What it does:
|
|
278
|
+
|
|
279
|
+
- hashes normalized search content,
|
|
280
|
+
- keeps the first occurrence,
|
|
281
|
+
- drops identical matches from other projects/branches.
|
|
282
|
+
|
|
283
|
+
Run:
|
|
284
|
+
```bash
|
|
285
|
+
python scripts/dedup.py \
|
|
286
|
+
--input session_20250312.jsonl \
|
|
287
|
+
--output session_20250312_dedup.jsonl
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
Options:
|
|
291
|
+
|
|
292
|
+
--no-normalize — treat content strictly (no whitespace normalization)
|
|
293
|
+
|
|
294
|
+
--sqlite /path/db.sqlite — external store for very large files.
|
|
295
|
+
|
|
296
|
+
**This is not classic deduplication by location — different repositories are preserved, but identical content matches are unified.**
|
|
297
|
+
|
|
298
|
+
### Convert JSONL to JSON
|
|
299
|
+
|
|
300
|
+
Session files are stored as JSONL for streaming and resume support.
|
|
301
|
+
For manual analysis you may want a single JSON document.
|
|
302
|
+
|
|
303
|
+
Run:
|
|
304
|
+
```
|
|
305
|
+
python scripts/convert_jsonl_to_json.py \
|
|
306
|
+
--input session_20250312_dedup.jsonl \
|
|
307
|
+
--output session_20250312.json
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
The converter produces a compact minified JSON.
|
|
311
|
+
For readable formatting use jq:
|
|
312
|
+
|
|
313
|
+
```bash
|
|
314
|
+
jq . session_20250312.json > session_20250312_pretty.json
|
|
315
|
+
```
|
|
316
|
+
Why convert:
|
|
317
|
+
- easier browsing in editors,
|
|
318
|
+
- compatibility with SIEM/ETL tools,
|
|
319
|
+
- convenient diff between sessions.
|
|
320
|
+
|
|
321
|
+
## Security note
|
|
322
|
+
|
|
323
|
+
Use only on GitLab instances where you have authorization.
|
|
324
|
+
|
|
325
|
+
---
|
|
326
|
+
|
|
327
|
+
## License
|
|
328
|
+
|
|
329
|
+
MIT
|
|
@@ -0,0 +1,301 @@
|
|
|
1
|
+
# GitlabHarvester
|
|
2
|
+
|
|
3
|
+
**Global term search across an entire GitLab instance — especially useful for GitLab CE.**
|
|
4
|
+
|
|
5
|
+
GitLab Community Edition does not provide instance‑wide code search the way GitLab EE can.
|
|
6
|
+
**GitlabHarvester** fills this gap: it builds a lightweight **Instance Project Index (JSONL/NDJSON)** and performs term search across repositories **without cloning** them.
|
|
7
|
+
|
|
8
|
+
The tool is conceptually similar to utilities like *gitlab-finder* (Node.js), but implemented in modern Python with streaming output, branch planning and resumable sessions.
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## Why this tool matters
|
|
13
|
+
|
|
14
|
+
- GitLab CE → no global code search
|
|
15
|
+
- Web UI search → limited and unreliable
|
|
16
|
+
- Cloning thousands of repos → slow & disk heavy
|
|
17
|
+
|
|
18
|
+
**GitlabHarvester** lets you search the whole instance using only the API.
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## Features
|
|
23
|
+
|
|
24
|
+
- ✅ **Instance‑wide keyword search** for GitLab CE
|
|
25
|
+
- ✅ **No cloning required** — API based
|
|
26
|
+
- ✅ **Project Index (JSONL/NDJSON)** for repeatable runs
|
|
27
|
+
- ✅ Branch strategies:
|
|
28
|
+
- `default` — scan only default branch (fast)
|
|
29
|
+
- `all` — scan all indexed branches
|
|
30
|
+
- `N` — scan up to N branches
|
|
31
|
+
- ✅ Fork strategies (explained below)
|
|
32
|
+
- ✅ **Session output + resume**
|
|
33
|
+
- ✅ Low memory footprint
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## Requirements
|
|
38
|
+
|
|
39
|
+
- Python **3.11+**
|
|
40
|
+
- GitLab token with **read_api** permissions
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## Installation
|
|
45
|
+
|
|
46
|
+
### Using pipx (recommended)
|
|
47
|
+
```bash
|
|
48
|
+
git clone https://github.com/Cur1iosity/GitlabHarvester.git
|
|
49
|
+
cd GitlabHarvester
|
|
50
|
+
pipx install .
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
or
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
pipx install git+https://github.com/Cur1iosity/GitlabHarvester.git
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
After that you can run the tool directly:
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
gitlab-harvester --help
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
### Classic pip install
|
|
66
|
+
```bash
|
|
67
|
+
git clone https://github.com/Cur1iosity/GitlabHarvester.git
|
|
68
|
+
cd GitlabHarvester
|
|
69
|
+
pip install .
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
or
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
pip install git+https://github.com/Cur1iosity/GitlabHarvester.git
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
## Quick Start (the index builds automatically)
|
|
81
|
+
|
|
82
|
+
You **do not need to build the project index manually**.
|
|
83
|
+
When you run a search, the index is created on the fly if it does not exist.
|
|
84
|
+
|
|
85
|
+
### Search a single keyword
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
gitlab-harvester -H https://gitlab.example.com -t $TOKEN --search "password"
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
### Search using a file with keywords
|
|
92
|
+
|
|
93
|
+
```bash
|
|
94
|
+
gitlab-harvester -H https://gitlab.example.com -t $TOKEN --terms-file keywords.txt
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
### Build only the index (optional)
|
|
98
|
+
|
|
99
|
+
This step is useful only if you want to prepare the index in advance:
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
gitlab-harvester -H https://gitlab.example.com -t $TOKEN --dump-only
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
---
|
|
106
|
+
|
|
107
|
+
## Branch control
|
|
108
|
+
|
|
109
|
+
There are two independent controls:
|
|
110
|
+
|
|
111
|
+
- `--index-branches` — what branches are stored in the index
|
|
112
|
+
- `--scan-branches` — what branches are actually scanned
|
|
113
|
+
|
|
114
|
+
### Examples
|
|
115
|
+
|
|
116
|
+
```bash
|
|
117
|
+
# Index only default branches, but scan up to 10
|
|
118
|
+
gitlab-harvester -H ... -t ... --scan-branches 10
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
# Store all branches and scan all
|
|
123
|
+
gitlab-harvester -H ... -t ... --index-branches all --scan-branches all
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
Shorthand:
|
|
127
|
+
|
|
128
|
+
```bash
|
|
129
|
+
gitlab-harvester -H ... -t ... --branches 10
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Fork strategies (important)
|
|
135
|
+
|
|
136
|
+
```bash
|
|
137
|
+
--forks skip|include|branch-diff|all-branches
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
### What they mean
|
|
141
|
+
|
|
142
|
+
- **skip**
|
|
143
|
+
Forked projects are completely ignored.
|
|
144
|
+
Good when forks are mostly duplicates and noise.
|
|
145
|
+
|
|
146
|
+
- **include**
|
|
147
|
+
Forks are treated like normal projects.
|
|
148
|
+
Simple and predictable but may rescan identical branches.
|
|
149
|
+
|
|
150
|
+
- **branch-diff** (recommended)
|
|
151
|
+
Smart mode:
|
|
152
|
+
- always scans fork default branch
|
|
153
|
+
- scans base branches (`main, master, develop, dev`)
|
|
154
|
+
- scans only **branches unique to the fork** compared to upstream
|
|
155
|
+
→ best signal/noise ratio.
|
|
156
|
+
|
|
157
|
+
- **all-branches**
|
|
158
|
+
Scan every branch of every fork — most exhaustive and slowest.
|
|
159
|
+
|
|
160
|
+
### Example
|
|
161
|
+
|
|
162
|
+
```bash
|
|
163
|
+
gitlab-harvester -H ... -t ... --terms-file keywords.txt --forks branch-diff --fork-diff-bases main,master,develop,dev
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
---
|
|
167
|
+
|
|
168
|
+
## Session & resume
|
|
169
|
+
|
|
170
|
+
Results are written to JSONL session files.
|
|
171
|
+
|
|
172
|
+
```bash
|
|
173
|
+
gitlab-harvester -H ... -t ... --terms-file keywords.txt --session audit_run
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
Resume:
|
|
177
|
+
|
|
178
|
+
```bash
|
|
179
|
+
gitlab-harvester -H ... -t ... --terms-file keywords.txt --session-file audit_run.jsonl --resume
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
## Output
|
|
185
|
+
|
|
186
|
+
- **Project Index (JSONL)** — metadata + project entries
|
|
187
|
+
- **Session file (JSONL)** — hits + checkpoints
|
|
188
|
+
|
|
189
|
+
---
|
|
190
|
+
|
|
191
|
+
## Usage
|
|
192
|
+
```bash
|
|
193
|
+
gitlab-harvester --help
|
|
194
|
+
|
|
195
|
+
usage: gitlab-harvester [-h] -H HOST -t TOKEN [-bs BATCH_SIZE] [--index-file INDEX_FILE] [--dump-projects] [--dump-only] [-b BRANCHES] [--index-branches INDEX_BRANCHES] [--scan-branches SCAN_BRANCHES]
|
|
196
|
+
[--branches-per-page BRANCHES_PER_PAGE] [--forks {skip,include,branch-diff,all-branches}] [--fork-diff-bases FORK_DIFF_BASES] [-s SEARCH | -f TERMS_FILE] [--session SESSION |
|
|
197
|
+
--session-file SESSION_FILE] [-o OUTPUT] [--resume]
|
|
198
|
+
|
|
199
|
+
Collect and use an Instance Project Index from a GitLab instance.
|
|
200
|
+
|
|
201
|
+
options:
|
|
202
|
+
-h, --help show this help message and exit
|
|
203
|
+
-H, --host HOST GitLab host (e.g., gitlab.example.com).
|
|
204
|
+
-t, --token TOKEN GitLab token with read_api permissions.
|
|
205
|
+
-bs, --batch-size BATCH_SIZE
|
|
206
|
+
Projects per page for GitLab API requests (default: 100).
|
|
207
|
+
--index-file INDEX_FILE
|
|
208
|
+
Path to Instance Project Index file (JSONL/NDJSON). Defaults to instance-specific name.
|
|
209
|
+
--dump-projects Rebuild the Instance Project Index even if it already exists.
|
|
210
|
+
--dump-only Only build the Instance Project Index and exit.
|
|
211
|
+
-b, --branches BRANCHES
|
|
212
|
+
Shorthand for setting both --index-branches and --scan-branches.
|
|
213
|
+
--index-branches INDEX_BRANCHES
|
|
214
|
+
Branch depth for building the Project Index: 'default' (store only default branch), 'all' (store all), or N limit.
|
|
215
|
+
--scan-branches SCAN_BRANCHES
|
|
216
|
+
Branch scope for scanning: omit -> scan default only; 'all' -> scan all branches from index; N -> scan up to N branches (default + N-1).
|
|
217
|
+
--branches-per-page BRANCHES_PER_PAGE
|
|
218
|
+
Branches per page for GitLab API requests (default: 100).
|
|
219
|
+
--forks {skip,include,branch-diff,all-branches}
|
|
220
|
+
How to handle forked projects during search: skip (ignore forks), include (treat as regular projects), branch-diff (scan only base + unique branches vs upstream), all-branches (scan
|
|
221
|
+
every branch of forks).
|
|
222
|
+
--fork-diff-bases FORK_DIFF_BASES
|
|
223
|
+
Comma-separated list of branch names always scanned in forks when --forks=branch-diff (default: main,master,develop,dev).
|
|
224
|
+
-s, --search SEARCH Single search term.
|
|
225
|
+
-f, --terms-file TERMS_FILE
|
|
226
|
+
File with search terms (one per line).
|
|
227
|
+
--session SESSION Session name for results output (writes <name>.jsonl).
|
|
228
|
+
--session-file SESSION_FILE
|
|
229
|
+
Explicit path for session results file (JSONL).
|
|
230
|
+
-o, --output OUTPUT Output file for results (optional).
|
|
231
|
+
--resume Resume search using an existing session file (if supported).
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
## Useful notes
|
|
235
|
+
|
|
236
|
+
### Deduplicate results (context unique)
|
|
237
|
+
|
|
238
|
+
Search across forks and mirrors often produces context duplicates — identical file fragments that appear in multiple repositories or branches.
|
|
239
|
+
Removing them is useful when:
|
|
240
|
+
|
|
241
|
+
you only need to confirm the fact of presence of a secret/keyword,
|
|
242
|
+
|
|
243
|
+
the same leaked token appears in dozens of forks,
|
|
244
|
+
|
|
245
|
+
you want to reduce a 1–5 GB session file to a human-reviewable size.
|
|
246
|
+
|
|
247
|
+
The dedup script keeps only one record per unique content, while preserving the original JSONL structure.
|
|
248
|
+
|
|
249
|
+
What it does:
|
|
250
|
+
|
|
251
|
+
- hashes normalized search content,
|
|
252
|
+
- keeps the first occurrence,
|
|
253
|
+
- drops identical matches from other projects/branches.
|
|
254
|
+
|
|
255
|
+
Run:
|
|
256
|
+
```bash
|
|
257
|
+
python scripts/dedup.py \
|
|
258
|
+
--input session_20250312.jsonl \
|
|
259
|
+
--output session_20250312_dedup.jsonl
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
Options:
|
|
263
|
+
|
|
264
|
+
--no-normalize — treat content strictly (no whitespace normalization)
|
|
265
|
+
|
|
266
|
+
--sqlite /path/db.sqlite — external store for very large files.
|
|
267
|
+
|
|
268
|
+
**This is not classic deduplication by location — different repositories are preserved, but identical content matches are unified.**
|
|
269
|
+
|
|
270
|
+
### Convert JSONL to JSON
|
|
271
|
+
|
|
272
|
+
Session files are stored as JSONL for streaming and resume support.
|
|
273
|
+
For manual analysis you may want a single JSON document.
|
|
274
|
+
|
|
275
|
+
Run:
|
|
276
|
+
```
|
|
277
|
+
python scripts/convert_jsonl_to_json.py \
|
|
278
|
+
--input session_20250312_dedup.jsonl \
|
|
279
|
+
--output session_20250312.json
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
The converter produces a compact minified JSON.
|
|
283
|
+
For readable formatting use jq:
|
|
284
|
+
|
|
285
|
+
```bash
|
|
286
|
+
jq . session_20250312.json > session_20250312_pretty.json
|
|
287
|
+
```
|
|
288
|
+
Why convert:
|
|
289
|
+
- easier browsing in editors,
|
|
290
|
+
- compatibility with SIEM/ETL tools,
|
|
291
|
+
- convenient diff between sessions.
|
|
292
|
+
|
|
293
|
+
## Security note
|
|
294
|
+
|
|
295
|
+
Use only on GitLab instances where you have authorization.
|
|
296
|
+
|
|
297
|
+
---
|
|
298
|
+
|
|
299
|
+
## License
|
|
300
|
+
|
|
301
|
+
MIT
|