cocina 0.0.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- cocina-0.0.2/LICENSE.md +19 -0
- cocina-0.0.2/PKG-INFO +370 -0
- cocina-0.0.2/README.md +346 -0
- cocina-0.0.2/cocina/__init__.py +8 -0
- cocina-0.0.2/cocina/cli.py +257 -0
- cocina-0.0.2/cocina/config_handler.py +704 -0
- cocina-0.0.2/cocina/constants.py +40 -0
- cocina-0.0.2/cocina/printer.py +350 -0
- cocina-0.0.2/cocina/utils.py +457 -0
- cocina-0.0.2/cocina.egg-info/PKG-INFO +370 -0
- cocina-0.0.2/cocina.egg-info/SOURCES.txt +16 -0
- cocina-0.0.2/cocina.egg-info/dependency_links.txt +1 -0
- cocina-0.0.2/cocina.egg-info/entry_points.txt +2 -0
- cocina-0.0.2/cocina.egg-info/requires.txt +3 -0
- cocina-0.0.2/cocina.egg-info/top_level.txt +1 -0
- cocina-0.0.2/pyproject.toml +56 -0
- cocina-0.0.2/setup.cfg +8 -0
cocina-0.0.2/LICENSE.md
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
License
|
|
2
|
+
================================================================================
|
|
3
|
+
The source code in this repository and the data made available through this repo / tool are licensed separately.
|
|
4
|
+
|
|
5
|
+
Source code
|
|
6
|
+
--------------------------------------------------------------------------------
|
|
7
|
+
Source code is made available under the BSD License:
|
|
8
|
+
|
|
9
|
+
Copyright 2024 (c) Regents of University of California ([The Eric and Wendy Schmidt Center for Data Science and the Environment at UC Berkeley](https://dse.berkeley.edu/), [Benioff Ocean Science Laboratory](https://bosl.ucsb.edu/)).
|
|
10
|
+
|
|
11
|
+
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
|
|
12
|
+
|
|
13
|
+
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
|
|
14
|
+
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
|
|
15
|
+
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
|
|
16
|
+
|
|
17
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
18
|
+
|
|
19
|
+
Copyright 2024 (c) Regents of University of California ([The Eric and Wendy Schmidt Center for Data Science and the Environment at UC Berkeley](https://dse.berkeley.edu/)).
|
cocina-0.0.2/PKG-INFO
ADDED
|
@@ -0,0 +1,370 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: cocina
|
|
3
|
+
Version: 0.0.2
|
|
4
|
+
Summary: A collection of tools for building structured Python projects
|
|
5
|
+
Author-email: Brookie Guzder-Williams <bguzder-williams@berkeley.edu>
|
|
6
|
+
License: CC-BY-4.0
|
|
7
|
+
Classifier: Development Status :: 4 - Beta
|
|
8
|
+
Classifier: Intended Audience :: Developers
|
|
9
|
+
Classifier: Programming Language :: Python :: 3
|
|
10
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
15
|
+
Classifier: Topic :: Software Development :: Libraries
|
|
16
|
+
Classifier: Topic :: Scientific/Engineering
|
|
17
|
+
Requires-Python: >=3.8
|
|
18
|
+
Description-Content-Type: text/markdown
|
|
19
|
+
License-File: LICENSE.md
|
|
20
|
+
Requires-Dist: pyyaml
|
|
21
|
+
Requires-Dist: click<9,>=8.2.1
|
|
22
|
+
Requires-Dist: build
|
|
23
|
+
Dynamic: license-file
|
|
24
|
+
|
|
25
|
+
# Cocina (WIP: status-broken moving to "cocina")
|
|
26
|
+
|
|
27
|
+
Cocina (cocina) is a collection of tools for building structured Python projects. It provides sophisticated configuration management, job execution capabilities, and a professional CLI interface.
|
|
28
|
+
|
|
29
|
+
## Core Components
|
|
30
|
+
|
|
31
|
+
1. **[ConfigHandler](#confighandler)** - Unified configuration management, constants, and environment variables
|
|
32
|
+
2. **[ConfigArgs](#configargs)** - Job-specific configuration loading with structured argument access
|
|
33
|
+
3. **[CLI](#cli)** - Command-line interface for project initialization and job execution
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## Table of Contents
|
|
38
|
+
|
|
39
|
+
- [Getting Started](#getting-started)
|
|
40
|
+
- [Overview](#overview)
|
|
41
|
+
- [Example](#example)
|
|
42
|
+
- [Advanced Features](#advanced-features)
|
|
43
|
+
- [cocina Configuration](#cocina-configuration)
|
|
44
|
+
- [Configuration Files](#configuration-files)
|
|
45
|
+
- [ConfigHandler](#confighandler)
|
|
46
|
+
- [ConfigArgs](#configargs)
|
|
47
|
+
- [CLI](#cli)
|
|
48
|
+
- [Initialize Project](#initialize-project)
|
|
49
|
+
- [Run Jobs](#run-jobs)
|
|
50
|
+
- [Tools](#tools)
|
|
51
|
+
- [Printer](#printer)
|
|
52
|
+
- [Timer](#timer)
|
|
53
|
+
- [Development](#development)
|
|
54
|
+
- [Documentation](#documentation)
|
|
55
|
+
|
|
56
|
+
|
|
57
|
+
---
|
|
58
|
+
|
|
59
|
+
## Getting Started
|
|
60
|
+
|
|
61
|
+
**INSTALL**:
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
git clone https://github.com/SchmidtDSE/project_kit.git
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
Add to your `pyproject.toml`:
|
|
68
|
+
```toml
|
|
69
|
+
[tool.pixi.pypi-dependencies]
|
|
70
|
+
project_kit = { path = "path/to/project_kit", editable = true }
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
**INITIALIZE**:
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
pixi run cocina init --log_dir logs --package your_package_name
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
> See [cocina Configuration](#cocina-configuration) for detailed initialization options.
|
|
80
|
+
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
### Overview
|
|
85
|
+
|
|
86
|
+
Cocina separates **configuration** (values that can change) from **constants** (values that never change) and **job arguments** (run-specific parameters).
|
|
87
|
+
|
|
88
|
+
#### Key Concepts
|
|
89
|
+
|
|
90
|
+
- **ConfigHandler** (`ch`) - Manages constants and project configuration
|
|
91
|
+
- Constants: `your_module/constants.py` (protected from modification)
|
|
92
|
+
- General Config: `config/config.yaml`
|
|
93
|
+
- Env Config: `config/<environment-name>.yaml`
|
|
94
|
+
- Usage: `ch.DATABASE_URL`, `ch.get(MAX_SCALE, 1000)`
|
|
95
|
+
|
|
96
|
+
- **ConfigArgs** (`ca`) - Manages job-specific run configurations
|
|
97
|
+
- Job configs: `config/args/job_name.yaml`
|
|
98
|
+
- Usage: To run method `method_name`: `method_name(*ca.method_name.args, **ca.method_name.kwargs)`
|
|
99
|
+
|
|
100
|
+
**Note**: names of configuration and job directories and files can be customized in [.cocina](#cocina-configuration).
|
|
101
|
+
|
|
102
|
+
#### Before and After
|
|
103
|
+
|
|
104
|
+
**Traditional approach:**
|
|
105
|
+
```python
|
|
106
|
+
SOURCE = "path/to/src.parquet"
|
|
107
|
+
OUTPUT_DEST = "path/to/output"
|
|
108
|
+
|
|
109
|
+
def main():
|
|
110
|
+
data = load_data(SOURCE, limit=1000, debug=True)
|
|
111
|
+
data = process_data(data, scale=100, validate=False)
|
|
112
|
+
save_data(data, OUTPUT_DEST, format="json")
|
|
113
|
+
|
|
114
|
+
if __name__ == "__main__":
|
|
115
|
+
main()
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
**With Cocina:**
|
|
119
|
+
```python
|
|
120
|
+
def run(config_args):
|
|
121
|
+
data = load_data(*config_args.load_data.args, **config_args.load_data.kwargs)
|
|
122
|
+
data = process_data(data, *config_args.process_data.args, **config_args.process_data.kwargs)
|
|
123
|
+
save_data(data, *config_args.save_data.args, **config_args.save_data.kwargs)
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
All parameters are now externalized to YAML configuration files, making scripts reusable and maintainable. CLI mangagement/arg-parsing is handled through the cocina [CLI](#cli)
|
|
127
|
+
|
|
128
|
+
### Example
|
|
129
|
+
|
|
130
|
+
**Project Structure:**
|
|
131
|
+
```
|
|
132
|
+
my_project/
|
|
133
|
+
├── my_package/ # Python package
|
|
134
|
+
│ ├── constants.py # Project Constants (protected from modification)
|
|
135
|
+
│ ├── ... # Modules
|
|
136
|
+
│ └── data_manager.py # Named example python module
|
|
137
|
+
├── config/
|
|
138
|
+
│ ├── config.yaml # Main configuration
|
|
139
|
+
│ ├── prod.yaml # Production configuration overrides
|
|
140
|
+
│ └── args/
|
|
141
|
+
│ └── data_pipeline.yaml # Job configuration
|
|
142
|
+
└── jobs/
|
|
143
|
+
└── data_pipeline.py # Job implementation
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
**Configuration (`config/args/data_pipeline.yaml`):**
|
|
147
|
+
```yaml
|
|
148
|
+
extract_data:
|
|
149
|
+
args: ["source_table"]
|
|
150
|
+
kwargs:
|
|
151
|
+
limit: 1000
|
|
152
|
+
debug: false
|
|
153
|
+
|
|
154
|
+
transform_data:
|
|
155
|
+
scale: 100
|
|
156
|
+
validate: true
|
|
157
|
+
|
|
158
|
+
save_data:
|
|
159
|
+
- "output_table"
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
**Job Implementation (`jobs/data_pipeline.py`):**
|
|
163
|
+
```python
|
|
164
|
+
def run(config_args, printer=None):
|
|
165
|
+
data = extract_data(*config_args.extract_data.args, **config_args.extract_data.kwargs)
|
|
166
|
+
data = transform_data(data, *config_args.transform_data.args, **config_args.transform_data.kwargs)
|
|
167
|
+
save_data(*config_args.save_data.args, **config_args.save_data.kwargs)
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
**Running Jobs:**
|
|
171
|
+
```bash
|
|
172
|
+
# Default environment
|
|
173
|
+
pixi run cocina job data_pipeline
|
|
174
|
+
|
|
175
|
+
# Production environment
|
|
176
|
+
pixi run cocina job data_pipeline --env prod
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
#### RUN AND MAIN METHODS
|
|
180
|
+
|
|
181
|
+
When running a job, the CLI requires either a `run` method that takes arguments `config_args: ConfigArgs`, `printer: Printer`, or a `run` method that takes only `config_args: ConfigArgs`, or a `main` method that does not have any arguments.
|
|
182
|
+
|
|
183
|
+
Priority ordering is:
|
|
184
|
+
|
|
185
|
+
1. `run(config_args, printer)` | passing both a `ConfigArgs` and `Printer` instance
|
|
186
|
+
2. `run(config_args)` | passing a `ConfigArgs` instance
|
|
187
|
+
3. `main()` | for jobs without configuration (legacy scripts)
|
|
188
|
+
|
|
189
|
+
|
|
190
|
+
#### USER CODEBASE/NOTEBOOKS
|
|
191
|
+
|
|
192
|
+
Although the main focus is on building and running configured "jobs", [ConfigArgs](#configargs) can also be used in your code (a notebook for example):
|
|
193
|
+
|
|
194
|
+
```python
|
|
195
|
+
# Load job-specific configuration
|
|
196
|
+
ca = ConfigArgs('job_group_1.job_a1')
|
|
197
|
+
jobs.job_group_1.job_a1.step_1(*ca.step_1.args, **ca.step_1.kwargs)
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
---
|
|
201
|
+
|
|
202
|
+
## cocina Configuration
|
|
203
|
+
|
|
204
|
+
The `.cocina` file contains project settings and must be in your project root. It defines:
|
|
205
|
+
- Configuration file locations and naming conventions
|
|
206
|
+
- Project root directory location
|
|
207
|
+
- Environment variable names
|
|
208
|
+
|
|
209
|
+
**Required:** Every project must have a `.cocina` file at the root.
|
|
210
|
+
|
|
211
|
+
**Options:**
|
|
212
|
+
- `--log_dir`: Enable automatic log file creation
|
|
213
|
+
- `--package`: Specify main package for constants loading
|
|
214
|
+
- `--force`: Overwrite existing `.cocina` file
|
|
215
|
+
|
|
216
|
+
---
|
|
217
|
+
|
|
218
|
+
## Configuration Files
|
|
219
|
+
|
|
220
|
+
Cocina uses YAML files in the `config/` directory:
|
|
221
|
+
|
|
222
|
+
```
|
|
223
|
+
config/
|
|
224
|
+
├── config.yaml # Main configuration
|
|
225
|
+
├── dev.yaml # Development environment overrides
|
|
226
|
+
├── prod.yaml # Production environment overrides
|
|
227
|
+
└── args/ # Job-specific configurations
|
|
228
|
+
├── job_name.yaml # Individual job config
|
|
229
|
+
└── group_name/ # Grouped job configs
|
|
230
|
+
└── job_a.yaml
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
**Configuration Types:**
|
|
234
|
+
- **Main Config**: `config.yaml` - shared across all environments
|
|
235
|
+
- **Environment Config**: `{env}.yaml` - environment-specific overrides
|
|
236
|
+
- **Job Config**: `args/{job}.yaml` - job-specific parameters and arguments
|
|
237
|
+
|
|
238
|
+
### ConfigHandler
|
|
239
|
+
|
|
240
|
+
Manages constants and main configuration with environment support.
|
|
241
|
+
|
|
242
|
+
```python
|
|
243
|
+
from cocina.config_handler import ConfigHandler
|
|
244
|
+
|
|
245
|
+
ch = ConfigHandler()
|
|
246
|
+
print(ch.DATABASE_URL) # From config.yaml
|
|
247
|
+
print(ch.MAX_SCALE) # From constants.py (protected)
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
**Features:**
|
|
251
|
+
- Loads constants from `your_package/constants.py`
|
|
252
|
+
- Loads configuration from `config/config.yaml`
|
|
253
|
+
- Environment-specific overrides from `config/{env}.yaml`
|
|
254
|
+
- Dict-style and attribute access patterns
|
|
255
|
+
|
|
256
|
+
### ConfigArgs
|
|
257
|
+
|
|
258
|
+
Loads job-specific configurations with structured argument access.
|
|
259
|
+
|
|
260
|
+
```python
|
|
261
|
+
from cocina.config_handler import ConfigArgs
|
|
262
|
+
|
|
263
|
+
ca = ConfigArgs('data_pipeline')
|
|
264
|
+
# Access method arguments
|
|
265
|
+
ca.extract_data.args # ["source_table"]
|
|
266
|
+
ca.extract_data.kwargs # {"limit": 1000, "debug": False}
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
**YAML Configuration Parsing:**
|
|
270
|
+
- Dict with `args`/`kwargs` keys → extracts args and kwargs
|
|
271
|
+
- Dict without special keys → `args=[]`, `kwargs=dict`
|
|
272
|
+
- List/tuple → `args=value`, `kwargs={}`
|
|
273
|
+
- Single value → `args=[value]`, `kwargs={}`
|
|
274
|
+
|
|
275
|
+
**Features:**
|
|
276
|
+
- Environment-specific overrides
|
|
277
|
+
- Reference resolution from main config
|
|
278
|
+
- Dynamic value substitution
|
|
279
|
+
|
|
280
|
+
---
|
|
281
|
+
|
|
282
|
+
## CLI
|
|
283
|
+
|
|
284
|
+
### Initialize Project
|
|
285
|
+
|
|
286
|
+
```bash
|
|
287
|
+
pixi run cocina init --log_dir logs --package your_package
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
### Run Jobs
|
|
291
|
+
|
|
292
|
+
```bash
|
|
293
|
+
# Run a single job
|
|
294
|
+
pixi run cocina job data_pipeline
|
|
295
|
+
|
|
296
|
+
# Run with specific environment
|
|
297
|
+
pixi run cocina job data_pipeline --env prod
|
|
298
|
+
|
|
299
|
+
# Run multiple jobs
|
|
300
|
+
pixi run cocina job job1 job2 job3
|
|
301
|
+
|
|
302
|
+
# Dry run (validate without executing)
|
|
303
|
+
pixi run cocina job data_pipeline --dry_run
|
|
304
|
+
```
|
|
305
|
+
|
|
306
|
+
**Options:**
|
|
307
|
+
- `--env`: Environment configuration to use (dev, prod, etc.)
|
|
308
|
+
- `--verbose`: Enable detailed output
|
|
309
|
+
- `--dry_run`: Validate configuration without running
|
|
310
|
+
|
|
311
|
+
|
|
312
|
+
---
|
|
313
|
+
|
|
314
|
+
## Tools
|
|
315
|
+
|
|
316
|
+
### Printer
|
|
317
|
+
Professional output with timestamps, headers, and optional file logging.
|
|
318
|
+
|
|
319
|
+
```python
|
|
320
|
+
from cocina.printer import Printer
|
|
321
|
+
|
|
322
|
+
printer = Printer(header='MyApp')
|
|
323
|
+
printer.start('Processing begins')
|
|
324
|
+
printer.message('Status update', count=42, status='ok')
|
|
325
|
+
printer.stop('Complete')
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
### Timer
|
|
329
|
+
Simple timing functionality with duration tracking.
|
|
330
|
+
|
|
331
|
+
```python
|
|
332
|
+
from cocina.utils import Timer
|
|
333
|
+
|
|
334
|
+
timer = Timer()
|
|
335
|
+
timer.start() # Start timing
|
|
336
|
+
print(timer.state()) # Current elapsed time
|
|
337
|
+
print(timer.now()) # Current timestamp
|
|
338
|
+
stop_time = timer.stop() # Stop timing
|
|
339
|
+
print(timer.delta()) # Total duration string
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
> See [complete documentation](docs/) for all utility functions and helpers.
|
|
343
|
+
|
|
344
|
+
---
|
|
345
|
+
|
|
346
|
+
## Development
|
|
347
|
+
|
|
348
|
+
**Requirements:** Managed with [Pixi](https://pixi.sh/latest) - no manual environment setup needed.
|
|
349
|
+
|
|
350
|
+
```bash
|
|
351
|
+
# All commands use pixi
|
|
352
|
+
pixi run jupyter lab
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
**Style:** Follows PEP8 standards. See [setup.cfg](./setup.cfg) for project-specific rules.
|
|
356
|
+
|
|
357
|
+
---
|
|
358
|
+
|
|
359
|
+
## Documentation
|
|
360
|
+
|
|
361
|
+
- **[Getting Started](/wiki/getting-started)** - Installation, initialization, and first job
|
|
362
|
+
- **[Configuration Guide](/wiki/configuration)** - Complete configuration management
|
|
363
|
+
- **[Job System](/wiki/jobs)** - Creating and running jobs
|
|
364
|
+
- **[CLI Reference](/wiki/cli)** - Command-line interface
|
|
365
|
+
- **[Examples](/wiki/examples)** - Detailed usage examples
|
|
366
|
+
- **[Advanced Topics](/wiki/advanced)** - Complex patterns and extensions
|
|
367
|
+
|
|
368
|
+
## License
|
|
369
|
+
|
|
370
|
+
CC-BY-4.0
|