datalore-cli 0.4.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- datalore_cli-0.4.1/LICENSE +21 -0
- datalore_cli-0.4.1/PKG-INFO +347 -0
- datalore_cli-0.4.1/README.md +318 -0
- datalore_cli-0.4.1/datalore_cli.egg-info/PKG-INFO +347 -0
- datalore_cli-0.4.1/datalore_cli.egg-info/SOURCES.txt +14 -0
- datalore_cli-0.4.1/datalore_cli.egg-info/dependency_links.txt +1 -0
- datalore_cli-0.4.1/datalore_cli.egg-info/entry_points.txt +2 -0
- datalore_cli-0.4.1/datalore_cli.egg-info/requires.txt +5 -0
- datalore_cli-0.4.1/datalore_cli.egg-info/top_level.txt +1 -0
- datalore_cli-0.4.1/datalore_tool/__init__.py +5 -0
- datalore_cli-0.4.1/datalore_tool/__main__.py +5 -0
- datalore_cli-0.4.1/datalore_tool/cli.py +502 -0
- datalore_cli-0.4.1/datalore_tool/core.py +1547 -0
- datalore_cli-0.4.1/pyproject.toml +42 -0
- datalore_cli-0.4.1/setup.cfg +4 -0
- datalore_cli-0.4.1/tests/test_cli.py +393 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Mihajlo Micic
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,347 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: datalore-cli
|
|
3
|
+
Version: 0.4.1
|
|
4
|
+
Summary: Deterministic local data analysis CLI for coding agents.
|
|
5
|
+
Author: Mihajlo Micic
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Repository, https://github.com/micic-mihajlo/datalore-cli
|
|
8
|
+
Project-URL: Issues, https://github.com/micic-mihajlo/datalore-cli/issues
|
|
9
|
+
Keywords: cli,data-analysis,agents,codex,claude-code,pandas
|
|
10
|
+
Classifier: Development Status :: 4 - Beta
|
|
11
|
+
Classifier: Environment :: Console
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
19
|
+
Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
20
|
+
Requires-Python: >=3.9
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
License-File: LICENSE
|
|
23
|
+
Requires-Dist: pandas
|
|
24
|
+
Requires-Dist: matplotlib
|
|
25
|
+
Requires-Dist: seaborn
|
|
26
|
+
Requires-Dist: scikit-learn
|
|
27
|
+
Requires-Dist: numpy
|
|
28
|
+
Dynamic: license-file
|
|
29
|
+
|
|
30
|
+
# Datalore
|
|
31
|
+
|
|
32
|
+
Datalore is a deterministic local data-analysis CLI designed to amplify coding agents.
|
|
33
|
+
|
|
34
|
+
The core idea is simple: when Codex, Claude Code, or a human operator needs a reliable answer about a local dataset, they should call a stable binary with explicit arguments and parse structured output instead of improvising pandas, sklearn, and matplotlib every time.
|
|
35
|
+
|
|
36
|
+
## What It Does
|
|
37
|
+
|
|
38
|
+
- Inspect local CSV, Excel, and JSON datasets
|
|
39
|
+
- Profile columns, dtypes, missing values, and duplicates
|
|
40
|
+
- Generate structured descriptive statistics
|
|
41
|
+
- Generate structured correlation reports
|
|
42
|
+
- Run deterministic linear regression
|
|
43
|
+
- Apply deterministic cleaning and export operations
|
|
44
|
+
- Compare two datasets structurally and at the row level
|
|
45
|
+
- Generate reproducible plot artifacts
|
|
46
|
+
- Diagnose runtime/install issues with a doctor command
|
|
47
|
+
- Emit either human-readable text or machine-readable JSON
|
|
48
|
+
- Optionally persist the full JSON envelope to a report file
|
|
49
|
+
- Keep outputs under `artifacts/` by default
|
|
50
|
+
|
|
51
|
+
## Install
|
|
52
|
+
|
|
53
|
+
Current install flow, from any machine where you want to analyze local data:
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
python3 -m venv .venv
|
|
57
|
+
source .venv/bin/activate
|
|
58
|
+
pip install 'git+https://github.com/micic-mihajlo/datalore-cli.git'
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
After that, the public interface is just:
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
datalore --help
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
The repo-local `./datalore` wrapper is only a development convenience for contributors working inside this repository.
|
|
68
|
+
|
|
69
|
+
If you are developing Datalore itself, install it in editable mode from the repo root:
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
pip install -e .
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## Use In Your Own Project
|
|
76
|
+
|
|
77
|
+
Move into the folder that contains your data:
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
cd /path/to/my/data
|
|
81
|
+
datalore init
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
That creates:
|
|
85
|
+
- `AGENTS.md`
|
|
86
|
+
- `CLAUDE.md`
|
|
87
|
+
- `./datalore`
|
|
88
|
+
- `artifacts/`
|
|
89
|
+
- a `.gitignore` entry for `artifacts/`
|
|
90
|
+
|
|
91
|
+
After that, both you and your coding agent can work from that folder directly:
|
|
92
|
+
|
|
93
|
+
```bash
|
|
94
|
+
./datalore files . --format json
|
|
95
|
+
./datalore profile sales.csv --format json
|
|
96
|
+
./datalore clean sales.csv --rename sales=>revenue --output artifacts/datasets/sales_clean.csv --format json
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
Why the wrapper matters:
|
|
100
|
+
- humans can keep using `datalore`
|
|
101
|
+
- agents are usually safer using `./datalore` because it binds the project to the exact Python interpreter that installed Datalore
|
|
102
|
+
|
|
103
|
+
## CLI Usage
|
|
104
|
+
|
|
105
|
+
Inspect a dataset:
|
|
106
|
+
|
|
107
|
+
```bash
|
|
108
|
+
datalore inspect data.csv
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
Profile a dataset in JSON mode for agent consumption:
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
datalore profile data.csv --format json
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
Run a simple regression:
|
|
118
|
+
|
|
119
|
+
```bash
|
|
120
|
+
datalore regression data.csv --target last --format json
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
Constrain the predictors:
|
|
124
|
+
|
|
125
|
+
```bash
|
|
126
|
+
datalore regression data.csv --target revenue --predictors spend,visits,leads --format json
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
Create a plot with a stable output path:
|
|
130
|
+
|
|
131
|
+
```bash
|
|
132
|
+
datalore plot data.csv --kind histogram --x revenue --output artifacts/plots/revenue_hist.png --format json
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
Generate a correlation heatmap:
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
datalore plot data.csv --kind correlation-heatmap --format json
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
Compare two datasets:
|
|
142
|
+
|
|
143
|
+
```bash
|
|
144
|
+
datalore compare before.csv after.csv --format json
|
|
145
|
+
datalore compare before.csv after.csv --key-columns id --format json
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
Generate structured summary statistics:
|
|
149
|
+
|
|
150
|
+
```bash
|
|
151
|
+
./datalore summary data.csv --format json
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
Generate a structured correlation report:
|
|
155
|
+
|
|
156
|
+
```bash
|
|
157
|
+
./datalore correlate data.csv --target revenue --min-abs-correlation 0.3 --format json
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
Clean and export a dataset deterministically:
|
|
161
|
+
|
|
162
|
+
```bash
|
|
163
|
+
datalore clean data.csv --fill-numeric median --drop-duplicates --output artifacts/datasets/data_clean.csv --format json
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
Apply deterministic transforms without falling back to ad hoc pandas:
|
|
167
|
+
|
|
168
|
+
```bash
|
|
169
|
+
datalore clean data.csv \
|
|
170
|
+
--rename sales=>revenue \
|
|
171
|
+
--derive sales_per_customer=>revenue\|/\|customers \
|
|
172
|
+
--filter 'customers>=24' \
|
|
173
|
+
--limit 100 \
|
|
174
|
+
--output artifacts/datasets/data_prepared.csv \
|
|
175
|
+
--format json
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
Write the full structured result to disk for downstream tooling:
|
|
179
|
+
|
|
180
|
+
```bash
|
|
181
|
+
datalore profile data.csv --format json --report-file artifacts/reports/profile.json
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
Find local datasets, including gitignored files in the repo:
|
|
185
|
+
|
|
186
|
+
```bash
|
|
187
|
+
datalore files . --format json
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
Inspect the runtime/install state:
|
|
191
|
+
|
|
192
|
+
```bash
|
|
193
|
+
datalore doctor --format json
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
Initialize a non-repo project for Codex and Claude Code:
|
|
197
|
+
|
|
198
|
+
```bash
|
|
199
|
+
datalore init
|
|
200
|
+
datalore init --agent codex
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
## Command Surface
|
|
204
|
+
|
|
205
|
+
`datalore inspect`
|
|
206
|
+
- Quick shape, dtype, missing-value, duplicate, and preview summary.
|
|
207
|
+
|
|
208
|
+
`datalore profile`
|
|
209
|
+
- Adds per-column detail on top of `inspect`.
|
|
210
|
+
|
|
211
|
+
`datalore regression`
|
|
212
|
+
- Runs a deterministic linear regression using numeric columns only.
|
|
213
|
+
|
|
214
|
+
`datalore summary`
|
|
215
|
+
- Emits machine-readable descriptive statistics for numeric and categorical columns.
|
|
216
|
+
|
|
217
|
+
`datalore correlate`
|
|
218
|
+
- Emits a machine-readable correlation matrix, strongest pairs, and optional target-column ranking.
|
|
219
|
+
|
|
220
|
+
`datalore clean`
|
|
221
|
+
- Applies deterministic cleaning and export operations such as filter, derive, rename, select, fill, drop NA, dedupe, sort, limit, and standardize.
|
|
222
|
+
- The transform pipeline runs in a fixed order: `rename -> derive -> filter -> select -> fill -> drop NA -> dedupe -> standardize -> sort -> limit`.
|
|
223
|
+
|
|
224
|
+
`datalore compare`
|
|
225
|
+
- Compares two datasets for shape, schema, dtype, missing-value, duplicate, and row-level changes. With `--key-columns`, it can report changed rows keyed by business identifiers.
|
|
226
|
+
|
|
227
|
+
`datalore files`
|
|
228
|
+
- Discovers local dataset files by walking the workspace, including gitignored inputs that `rg --files` would skip.
|
|
229
|
+
|
|
230
|
+
`datalore plot`
|
|
231
|
+
- Generates histogram, scatter, line, bar, or correlation heatmap artifacts.
|
|
232
|
+
|
|
233
|
+
`datalore doctor`
|
|
234
|
+
- Reports interpreter, artifact-root, and dependency status for install/runtime debugging.
|
|
235
|
+
|
|
236
|
+
`datalore init`
|
|
237
|
+
- Scaffolds `AGENTS.md`, `CLAUDE.md`, a local `./datalore` wrapper, `artifacts/`, and an `artifacts/` ignore rule in the current project.
|
|
238
|
+
|
|
239
|
+
## Reliability
|
|
240
|
+
|
|
241
|
+
- JSON output uses a stable envelope with `schema_version`, `command`, `result`, and `error`
|
|
242
|
+
- every command supports `--report-file` for downstream automation
|
|
243
|
+
- `datalore` works from any folder after installation
|
|
244
|
+
- `./datalore` prefers the repo venv automatically when developing inside this repo
|
|
245
|
+
- `datalore doctor` gives a stable diagnostics entrypoint when the runtime is suspect
|
|
246
|
+
- startup no longer crashes on `--help` if scientific dependencies are missing
|
|
247
|
+
- CI runs editable install, distribution builds, and the CLI test suite on multiple Python versions via `.github/workflows/ci.yml`
|
|
248
|
+
|
|
249
|
+
## Agent-Friendly Conventions
|
|
250
|
+
|
|
251
|
+
In a normal user project, prefer:
|
|
252
|
+
|
|
253
|
+
```bash
|
|
254
|
+
./datalore <command> ... --format json
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
Why:
|
|
258
|
+
- JSON output is stable and easy to parse
|
|
259
|
+
- exit codes are predictable
|
|
260
|
+
- artifact paths are explicit
|
|
261
|
+
- the analysis method is fixed rather than improvised
|
|
262
|
+
- `--report-file` gives you a stable on-disk result for chained workflows
|
|
263
|
+
- `datalore files` avoids the common failure mode where the dataset is gitignored and invisible to `rg --files`
|
|
264
|
+
- `datalore init` gives Codex and Claude Code repo-local instructions plus a reliable local wrapper in the user's own data project
|
|
265
|
+
|
|
266
|
+
Inside the Datalore repository itself, contributors can still prefer:
|
|
267
|
+
|
|
268
|
+
```bash
|
|
269
|
+
./datalore <command> ... --format json
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
Repo-native instructions are included for both ecosystems:
|
|
273
|
+
- `AGENTS.md` for Codex
|
|
274
|
+
- `CLAUDE.md` for Claude Code
|
|
275
|
+
- `.agents/skills/` for Codex skills
|
|
276
|
+
- `.claude/skills/` for Claude Code skills
|
|
277
|
+
|
|
278
|
+
These instructions tell agents when to use Datalore instead of writing ad hoc analysis code.
|
|
279
|
+
|
|
280
|
+
## Artifacts
|
|
281
|
+
|
|
282
|
+
Generated outputs go under `artifacts/` by default:
|
|
283
|
+
|
|
284
|
+
- `artifacts/plots/`
|
|
285
|
+
- `artifacts/datasets/`
|
|
286
|
+
- `artifacts/reports/`
|
|
287
|
+
- `artifacts/scripts/`
|
|
288
|
+
|
|
289
|
+
Override the root with:
|
|
290
|
+
|
|
291
|
+
```bash
|
|
292
|
+
export DATALORE_ARTIFACT_DIR=/path/to/artifacts
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
## Publishing
|
|
296
|
+
|
|
297
|
+
The repository is set up for PyPI Trusted Publishing through GitHub Actions.
|
|
298
|
+
|
|
299
|
+
Release flow:
|
|
300
|
+
|
|
301
|
+
```bash
|
|
302
|
+
git tag v0.4.1
|
|
303
|
+
git push origin v0.4.1
|
|
304
|
+
gh release create v0.4.1 --generate-notes
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
Before the first release, create or register the `datalore-cli` project on PyPI and add a Trusted Publisher that matches:
|
|
308
|
+
|
|
309
|
+
- owner: `micic-mihajlo`
|
|
310
|
+
- repository: `datalore-cli`
|
|
311
|
+
- workflow: `.github/workflows/publish.yml`
|
|
312
|
+
- environment: `pypi`
|
|
313
|
+
|
|
314
|
+
If the project does not exist yet on PyPI, use PyPI's pending publisher flow for the first release. After that, publishing happens from GitHub Actions without local API tokens.
|
|
315
|
+
|
|
316
|
+
## Testing
|
|
317
|
+
|
|
318
|
+
Run the CLI test suite:
|
|
319
|
+
|
|
320
|
+
```bash
|
|
321
|
+
.venv/bin/python -m unittest discover -s tests -v
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
Smoke test the fixture dataset manually:
|
|
325
|
+
|
|
326
|
+
```bash
|
|
327
|
+
./datalore files tests --format json
|
|
328
|
+
./datalore inspect tests/fixtures/mini_dataset.csv --format json
|
|
329
|
+
./datalore summary tests/fixtures/mini_dataset.csv --format json
|
|
330
|
+
./datalore correlate tests/fixtures/mini_dataset.csv --target sales --format json
|
|
331
|
+
./datalore regression tests/fixtures/mini_dataset.csv --target sales --predictors marketing,customers --format json
|
|
332
|
+
./datalore clean tests/fixtures/mini_dataset.csv --rename sales=>revenue --derive sales_per_customer=>revenue\|/\|customers --filter 'customers>=24' --limit 3 --format json
|
|
333
|
+
./datalore clean tests/fixtures/mini_dataset_changed.csv --fill-numeric median --output artifacts/datasets/mini_clean.csv --format json
|
|
334
|
+
./datalore compare tests/fixtures/mini_dataset.csv tests/fixtures/mini_dataset_changed.csv --key-columns month --format json
|
|
335
|
+
./datalore doctor --format json
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
Smoke test the installed package from outside the repo:
|
|
339
|
+
|
|
340
|
+
```bash
|
|
341
|
+
mkdir -p /tmp/datalore-demo
|
|
342
|
+
cp tests/fixtures/mini_dataset.csv /tmp/datalore-demo/
|
|
343
|
+
cd /tmp/datalore-demo
|
|
344
|
+
datalore init
|
|
345
|
+
./datalore files . --format json
|
|
346
|
+
./datalore profile mini_dataset.csv --format json
|
|
347
|
+
```
|
|
@@ -0,0 +1,318 @@
|
|
|
1
|
+
# Datalore
|
|
2
|
+
|
|
3
|
+
Datalore is a deterministic local data-analysis CLI designed to amplify coding agents.
|
|
4
|
+
|
|
5
|
+
The core idea is simple: when Codex, Claude Code, or a human operator needs a reliable answer about a local dataset, they should call a stable binary with explicit arguments and parse structured output instead of improvising pandas, sklearn, and matplotlib every time.
|
|
6
|
+
|
|
7
|
+
## What It Does
|
|
8
|
+
|
|
9
|
+
- Inspect local CSV, Excel, and JSON datasets
|
|
10
|
+
- Profile columns, dtypes, missing values, and duplicates
|
|
11
|
+
- Generate structured descriptive statistics
|
|
12
|
+
- Generate structured correlation reports
|
|
13
|
+
- Run deterministic linear regression
|
|
14
|
+
- Apply deterministic cleaning and export operations
|
|
15
|
+
- Compare two datasets structurally and at the row level
|
|
16
|
+
- Generate reproducible plot artifacts
|
|
17
|
+
- Diagnose runtime/install issues with a doctor command
|
|
18
|
+
- Emit either human-readable text or machine-readable JSON
|
|
19
|
+
- Optionally persist the full JSON envelope to a report file
|
|
20
|
+
- Keep outputs under `artifacts/` by default
|
|
21
|
+
|
|
22
|
+
## Install
|
|
23
|
+
|
|
24
|
+
Current install flow, from any machine where you want to analyze local data:
|
|
25
|
+
|
|
26
|
+
```bash
|
|
27
|
+
python3 -m venv .venv
|
|
28
|
+
source .venv/bin/activate
|
|
29
|
+
pip install 'git+https://github.com/micic-mihajlo/datalore-cli.git'
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
After that, the public interface is just:
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
datalore --help
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
The repo-local `./datalore` wrapper is only a development convenience for contributors working inside this repository.
|
|
39
|
+
|
|
40
|
+
If you are developing Datalore itself, install it in editable mode from the repo root:
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
pip install -e .
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
## Use In Your Own Project
|
|
47
|
+
|
|
48
|
+
Move into the folder that contains your data:
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
cd /path/to/my/data
|
|
52
|
+
datalore init
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
That creates:
|
|
56
|
+
- `AGENTS.md`
|
|
57
|
+
- `CLAUDE.md`
|
|
58
|
+
- `./datalore`
|
|
59
|
+
- `artifacts/`
|
|
60
|
+
- a `.gitignore` entry for `artifacts/`
|
|
61
|
+
|
|
62
|
+
After that, both you and your coding agent can work from that folder directly:
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
./datalore files . --format json
|
|
66
|
+
./datalore profile sales.csv --format json
|
|
67
|
+
./datalore clean sales.csv --rename sales=>revenue --output artifacts/datasets/sales_clean.csv --format json
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
Why the wrapper matters:
|
|
71
|
+
- humans can keep using `datalore`
|
|
72
|
+
- agents are usually safer using `./datalore` because it binds the project to the exact Python interpreter that installed Datalore
|
|
73
|
+
|
|
74
|
+
## CLI Usage
|
|
75
|
+
|
|
76
|
+
Inspect a dataset:
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
datalore inspect data.csv
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
Profile a dataset in JSON mode for agent consumption:
|
|
83
|
+
|
|
84
|
+
```bash
|
|
85
|
+
datalore profile data.csv --format json
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
Run a simple regression:
|
|
89
|
+
|
|
90
|
+
```bash
|
|
91
|
+
datalore regression data.csv --target last --format json
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
Constrain the predictors:
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
datalore regression data.csv --target revenue --predictors spend,visits,leads --format json
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
Create a plot with a stable output path:
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
datalore plot data.csv --kind histogram --x revenue --output artifacts/plots/revenue_hist.png --format json
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
Generate a correlation heatmap:
|
|
107
|
+
|
|
108
|
+
```bash
|
|
109
|
+
datalore plot data.csv --kind correlation-heatmap --format json
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
Compare two datasets:
|
|
113
|
+
|
|
114
|
+
```bash
|
|
115
|
+
datalore compare before.csv after.csv --format json
|
|
116
|
+
datalore compare before.csv after.csv --key-columns id --format json
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
Generate structured summary statistics:
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
./datalore summary data.csv --format json
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Generate a structured correlation report:
|
|
126
|
+
|
|
127
|
+
```bash
|
|
128
|
+
./datalore correlate data.csv --target revenue --min-abs-correlation 0.3 --format json
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Clean and export a dataset deterministically:
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
datalore clean data.csv --fill-numeric median --drop-duplicates --output artifacts/datasets/data_clean.csv --format json
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
Apply deterministic transforms without falling back to ad hoc pandas:
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
datalore clean data.csv \
|
|
141
|
+
--rename sales=>revenue \
|
|
142
|
+
--derive sales_per_customer=>revenue\|/\|customers \
|
|
143
|
+
--filter 'customers>=24' \
|
|
144
|
+
--limit 100 \
|
|
145
|
+
--output artifacts/datasets/data_prepared.csv \
|
|
146
|
+
--format json
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
Write the full structured result to disk for downstream tooling:
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
datalore profile data.csv --format json --report-file artifacts/reports/profile.json
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
Find local datasets, including gitignored files in the repo:
|
|
156
|
+
|
|
157
|
+
```bash
|
|
158
|
+
datalore files . --format json
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
Inspect the runtime/install state:
|
|
162
|
+
|
|
163
|
+
```bash
|
|
164
|
+
datalore doctor --format json
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
Initialize a non-repo project for Codex and Claude Code:
|
|
168
|
+
|
|
169
|
+
```bash
|
|
170
|
+
datalore init
|
|
171
|
+
datalore init --agent codex
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
## Command Surface
|
|
175
|
+
|
|
176
|
+
`datalore inspect`
|
|
177
|
+
- Quick shape, dtype, missing-value, duplicate, and preview summary.
|
|
178
|
+
|
|
179
|
+
`datalore profile`
|
|
180
|
+
- Adds per-column detail on top of `inspect`.
|
|
181
|
+
|
|
182
|
+
`datalore regression`
|
|
183
|
+
- Runs a deterministic linear regression using numeric columns only.
|
|
184
|
+
|
|
185
|
+
`datalore summary`
|
|
186
|
+
- Emits machine-readable descriptive statistics for numeric and categorical columns.
|
|
187
|
+
|
|
188
|
+
`datalore correlate`
|
|
189
|
+
- Emits a machine-readable correlation matrix, strongest pairs, and optional target-column ranking.
|
|
190
|
+
|
|
191
|
+
`datalore clean`
|
|
192
|
+
- Applies deterministic cleaning and export operations such as filter, derive, rename, select, fill, drop NA, dedupe, sort, limit, and standardize.
|
|
193
|
+
- The transform pipeline runs in a fixed order: `rename -> derive -> filter -> select -> fill -> drop NA -> dedupe -> standardize -> sort -> limit`.
|
|
194
|
+
|
|
195
|
+
`datalore compare`
|
|
196
|
+
- Compares two datasets for shape, schema, dtype, missing-value, duplicate, and row-level changes. With `--key-columns`, it can report changed rows keyed by business identifiers.
|
|
197
|
+
|
|
198
|
+
`datalore files`
|
|
199
|
+
- Discovers local dataset files by walking the workspace, including gitignored inputs that `rg --files` would skip.
|
|
200
|
+
|
|
201
|
+
`datalore plot`
|
|
202
|
+
- Generates histogram, scatter, line, bar, or correlation heatmap artifacts.
|
|
203
|
+
|
|
204
|
+
`datalore doctor`
|
|
205
|
+
- Reports interpreter, artifact-root, and dependency status for install/runtime debugging.
|
|
206
|
+
|
|
207
|
+
`datalore init`
|
|
208
|
+
- Scaffolds `AGENTS.md`, `CLAUDE.md`, a local `./datalore` wrapper, `artifacts/`, and an `artifacts/` ignore rule in the current project.
|
|
209
|
+
|
|
210
|
+
## Reliability
|
|
211
|
+
|
|
212
|
+
- JSON output uses a stable envelope with `schema_version`, `command`, `result`, and `error`
|
|
213
|
+
- every command supports `--report-file` for downstream automation
|
|
214
|
+
- `datalore` works from any folder after installation
|
|
215
|
+
- `./datalore` prefers the repo venv automatically when developing inside this repo
|
|
216
|
+
- `datalore doctor` gives a stable diagnostics entrypoint when the runtime is suspect
|
|
217
|
+
- startup no longer crashes on `--help` if scientific dependencies are missing
|
|
218
|
+
- CI runs editable install, distribution builds, and the CLI test suite on multiple Python versions via `.github/workflows/ci.yml`
|
|
219
|
+
|
|
220
|
+
## Agent-Friendly Conventions
|
|
221
|
+
|
|
222
|
+
In a normal user project, prefer:
|
|
223
|
+
|
|
224
|
+
```bash
|
|
225
|
+
./datalore <command> ... --format json
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
Why:
|
|
229
|
+
- JSON output is stable and easy to parse
|
|
230
|
+
- exit codes are predictable
|
|
231
|
+
- artifact paths are explicit
|
|
232
|
+
- the analysis method is fixed rather than improvised
|
|
233
|
+
- `--report-file` gives you a stable on-disk result for chained workflows
|
|
234
|
+
- `datalore files` avoids the common failure mode where the dataset is gitignored and invisible to `rg --files`
|
|
235
|
+
- `datalore init` gives Codex and Claude Code repo-local instructions plus a reliable local wrapper in the user's own data project
|
|
236
|
+
|
|
237
|
+
Inside the Datalore repository itself, contributors can still prefer:
|
|
238
|
+
|
|
239
|
+
```bash
|
|
240
|
+
./datalore <command> ... --format json
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
Repo-native instructions are included for both ecosystems:
|
|
244
|
+
- `AGENTS.md` for Codex
|
|
245
|
+
- `CLAUDE.md` for Claude Code
|
|
246
|
+
- `.agents/skills/` for Codex skills
|
|
247
|
+
- `.claude/skills/` for Claude Code skills
|
|
248
|
+
|
|
249
|
+
These instructions tell agents when to use Datalore instead of writing ad hoc analysis code.
|
|
250
|
+
|
|
251
|
+
## Artifacts
|
|
252
|
+
|
|
253
|
+
Generated outputs go under `artifacts/` by default:
|
|
254
|
+
|
|
255
|
+
- `artifacts/plots/`
|
|
256
|
+
- `artifacts/datasets/`
|
|
257
|
+
- `artifacts/reports/`
|
|
258
|
+
- `artifacts/scripts/`
|
|
259
|
+
|
|
260
|
+
Override the root with:
|
|
261
|
+
|
|
262
|
+
```bash
|
|
263
|
+
export DATALORE_ARTIFACT_DIR=/path/to/artifacts
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
## Publishing
|
|
267
|
+
|
|
268
|
+
The repository is set up for PyPI Trusted Publishing through GitHub Actions.
|
|
269
|
+
|
|
270
|
+
Release flow:
|
|
271
|
+
|
|
272
|
+
```bash
|
|
273
|
+
git tag v0.4.1
|
|
274
|
+
git push origin v0.4.1
|
|
275
|
+
gh release create v0.4.1 --generate-notes
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
Before the first release, create or register the `datalore-cli` project on PyPI and add a Trusted Publisher that matches:
|
|
279
|
+
|
|
280
|
+
- owner: `micic-mihajlo`
|
|
281
|
+
- repository: `datalore-cli`
|
|
282
|
+
- workflow: `.github/workflows/publish.yml`
|
|
283
|
+
- environment: `pypi`
|
|
284
|
+
|
|
285
|
+
If the project does not exist yet on PyPI, use PyPI's pending publisher flow for the first release. After that, publishing happens from GitHub Actions without local API tokens.
|
|
286
|
+
|
|
287
|
+
## Testing
|
|
288
|
+
|
|
289
|
+
Run the CLI test suite:
|
|
290
|
+
|
|
291
|
+
```bash
|
|
292
|
+
.venv/bin/python -m unittest discover -s tests -v
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
Smoke test the fixture dataset manually:
|
|
296
|
+
|
|
297
|
+
```bash
|
|
298
|
+
./datalore files tests --format json
|
|
299
|
+
./datalore inspect tests/fixtures/mini_dataset.csv --format json
|
|
300
|
+
./datalore summary tests/fixtures/mini_dataset.csv --format json
|
|
301
|
+
./datalore correlate tests/fixtures/mini_dataset.csv --target sales --format json
|
|
302
|
+
./datalore regression tests/fixtures/mini_dataset.csv --target sales --predictors marketing,customers --format json
|
|
303
|
+
./datalore clean tests/fixtures/mini_dataset.csv --rename sales=>revenue --derive sales_per_customer=>revenue\|/\|customers --filter 'customers>=24' --limit 3 --format json
|
|
304
|
+
./datalore clean tests/fixtures/mini_dataset_changed.csv --fill-numeric median --output artifacts/datasets/mini_clean.csv --format json
|
|
305
|
+
./datalore compare tests/fixtures/mini_dataset.csv tests/fixtures/mini_dataset_changed.csv --key-columns month --format json
|
|
306
|
+
./datalore doctor --format json
|
|
307
|
+
```
|
|
308
|
+
|
|
309
|
+
Smoke test the installed package from outside the repo:
|
|
310
|
+
|
|
311
|
+
```bash
|
|
312
|
+
mkdir -p /tmp/datalore-demo
|
|
313
|
+
cp tests/fixtures/mini_dataset.csv /tmp/datalore-demo/
|
|
314
|
+
cd /tmp/datalore-demo
|
|
315
|
+
datalore init
|
|
316
|
+
./datalore files . --format json
|
|
317
|
+
./datalore profile mini_dataset.csv --format json
|
|
318
|
+
```
|