cocina 0.0.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,19 @@
1
+ License
2
+ ================================================================================
3
+ The source code in this repository and the data made available through this repo / tool are licensed separately.
4
+
5
+ Source code
6
+ --------------------------------------------------------------------------------
7
+ Source code is made available under the BSD License:
8
+
9
+ Copyright 2024 (c) Regents of University of California ([The Eric and Wendy Schmidt Center for Data Science and the Environment at UC Berkeley](https://dse.berkeley.edu/), [Benioff Ocean Science Laboratory](https://bosl.ucsb.edu/)).
10
+
11
+ Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
12
+
13
+ 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
14
+ 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
15
+ 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
16
+
17
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
18
+
19
+ Copyright 2024 (c) Regents of University of California ([The Eric and Wendy Schmidt Center for Data Science and the Environment at UC Berkeley](https://dse.berkeley.edu/)).
cocina-0.0.2/PKG-INFO ADDED
@@ -0,0 +1,370 @@
1
+ Metadata-Version: 2.4
2
+ Name: cocina
3
+ Version: 0.0.2
4
+ Summary: A collection of tools for building structured Python projects
5
+ Author-email: Brookie Guzder-Williams <bguzder-williams@berkeley.edu>
6
+ License: CC-BY-4.0
7
+ Classifier: Development Status :: 4 - Beta
8
+ Classifier: Intended Audience :: Developers
9
+ Classifier: Programming Language :: Python :: 3
10
+ Classifier: Programming Language :: Python :: 3.8
11
+ Classifier: Programming Language :: Python :: 3.9
12
+ Classifier: Programming Language :: Python :: 3.10
13
+ Classifier: Programming Language :: Python :: 3.11
14
+ Classifier: Programming Language :: Python :: 3.12
15
+ Classifier: Topic :: Software Development :: Libraries
16
+ Classifier: Topic :: Scientific/Engineering
17
+ Requires-Python: >=3.8
18
+ Description-Content-Type: text/markdown
19
+ License-File: LICENSE.md
20
+ Requires-Dist: pyyaml
21
+ Requires-Dist: click<9,>=8.2.1
22
+ Requires-Dist: build
23
+ Dynamic: license-file
24
+
25
+ # Cocina (WIP: status-broken moving to "cocina")
26
+
27
+ Cocina (cocina) is a collection of tools for building structured Python projects. It provides sophisticated configuration management, job execution capabilities, and a professional CLI interface.
28
+
29
+ ## Core Components
30
+
31
+ 1. **[ConfigHandler](#confighandler)** - Unified configuration management, constants, and environment variables
32
+ 2. **[ConfigArgs](#configargs)** - Job-specific configuration loading with structured argument access
33
+ 3. **[CLI](#cli)** - Command-line interface for project initialization and job execution
34
+
35
+ ---
36
+
37
+ ## Table of Contents
38
+
39
+ - [Getting Started](#getting-started)
40
+ - [Overview](#overview)
41
+ - [Example](#example)
42
+ - [Advanced Features](#advanced-features)
43
+ - [cocina Configuration](#cocina-configuration)
44
+ - [Configuration Files](#configuration-files)
45
+ - [ConfigHandler](#confighandler)
46
+ - [ConfigArgs](#configargs)
47
+ - [CLI](#cli)
48
+ - [Initialize Project](#initialize-project)
49
+ - [Run Jobs](#run-jobs)
50
+ - [Tools](#tools)
51
+ - [Printer](#printer)
52
+ - [Timer](#timer)
53
+ - [Development](#development)
54
+ - [Documentation](#documentation)
55
+
56
+
57
+ ---
58
+
59
+ ## Getting Started
60
+
61
+ **INSTALL**:
62
+
63
+ ```bash
64
+ git clone https://github.com/SchmidtDSE/project_kit.git
65
+ ```
66
+
67
+ Add to your `pyproject.toml`:
68
+ ```toml
69
+ [tool.pixi.pypi-dependencies]
70
+ project_kit = { path = "path/to/project_kit", editable = true }
71
+ ```
72
+
73
+ **INITIALIZE**:
74
+
75
+ ```bash
76
+ pixi run cocina init --log_dir logs --package your_package_name
77
+ ```
78
+
79
+ > See [cocina Configuration](#cocina-configuration) for detailed initialization options.
80
+
81
+
82
+ ---
83
+
84
+ ### Overview
85
+
86
+ Cocina separates **configuration** (values that can change) from **constants** (values that never change) and **job arguments** (run-specific parameters).
87
+
88
+ #### Key Concepts
89
+
90
+ - **ConfigHandler** (`ch`) - Manages constants and project configuration
91
+ - Constants: `your_module/constants.py` (protected from modification)
92
+ - General Config: `config/config.yaml`
93
+ - Env Config: `config/<environment-name>.yaml`
94
+ - Usage: `ch.DATABASE_URL`, `ch.get(MAX_SCALE, 1000)`
95
+
96
+ - **ConfigArgs** (`ca`) - Manages job-specific run configurations
97
+ - Job configs: `config/args/job_name.yaml`
98
+ - Usage: To run method `method_name`: `method_name(*ca.method_name.args, **ca.method_name.kwargs)`
99
+
100
+ **Note**: names of configuration and job directories and files can be customized in [.cocina](#cocina-configuration).
101
+
102
+ #### Before and After
103
+
104
+ **Traditional approach:**
105
+ ```python
106
+ SOURCE = "path/to/src.parquet"
107
+ OUTPUT_DEST = "path/to/output"
108
+
109
+ def main():
110
+ data = load_data(SOURCE, limit=1000, debug=True)
111
+ data = process_data(data, scale=100, validate=False)
112
+ save_data(data, OUTPUT_DEST, format="json")
113
+
114
+ if __name__ == "__main__":
115
+ main()
116
+ ```
117
+
118
+ **With Cocina:**
119
+ ```python
120
+ def run(config_args):
121
+ data = load_data(*config_args.load_data.args, **config_args.load_data.kwargs)
122
+ data = process_data(data, *config_args.process_data.args, **config_args.process_data.kwargs)
123
+ save_data(data, *config_args.save_data.args, **config_args.save_data.kwargs)
124
+ ```
125
+
126
+ All parameters are now externalized to YAML configuration files, making scripts reusable and maintainable. CLI mangagement/arg-parsing is handled through the cocina [CLI](#cli)
127
+
128
+ ### Example
129
+
130
+ **Project Structure:**
131
+ ```
132
+ my_project/
133
+ ├── my_package/ # Python package
134
+ │ ├── constants.py # Project Constants (protected from modification)
135
+ │ ├── ... # Modules
136
+ │ └── data_manager.py # Named example python module
137
+ ├── config/
138
+ │ ├── config.yaml # Main configuration
139
+ │ ├── prod.yaml # Production configuration overrides
140
+ │ └── args/
141
+ │ └── data_pipeline.yaml # Job configuration
142
+ └── jobs/
143
+ └── data_pipeline.py # Job implementation
144
+ ```
145
+
146
+ **Configuration (`config/args/data_pipeline.yaml`):**
147
+ ```yaml
148
+ extract_data:
149
+ args: ["source_table"]
150
+ kwargs:
151
+ limit: 1000
152
+ debug: false
153
+
154
+ transform_data:
155
+ scale: 100
156
+ validate: true
157
+
158
+ save_data:
159
+ - "output_table"
160
+ ```
161
+
162
+ **Job Implementation (`jobs/data_pipeline.py`):**
163
+ ```python
164
+ def run(config_args, printer=None):
165
+ data = extract_data(*config_args.extract_data.args, **config_args.extract_data.kwargs)
166
+ data = transform_data(data, *config_args.transform_data.args, **config_args.transform_data.kwargs)
167
+ save_data(*config_args.save_data.args, **config_args.save_data.kwargs)
168
+ ```
169
+
170
+ **Running Jobs:**
171
+ ```bash
172
+ # Default environment
173
+ pixi run cocina job data_pipeline
174
+
175
+ # Production environment
176
+ pixi run cocina job data_pipeline --env prod
177
+ ```
178
+
179
+ #### RUN AND MAIN METHODS
180
+
181
+ When running a job, the CLI requires either a `run` method that takes arguments `config_args: ConfigArgs`, `printer: Printer`, or a `run` method that takes only `config_args: ConfigArgs`, or a `main` method that does not have any arguments.
182
+
183
+ Priority ordering is:
184
+
185
+ 1. `run(config_args, printer)` | passing both a `ConfigArgs` and `Printer` instance
186
+ 2. `run(config_args)` | passing a `ConfigArgs` instance
187
+ 3. `main()` | for jobs without configuration (legacy scripts)
188
+
189
+
190
+ #### USER CODEBASE/NOTEBOOKS
191
+
192
+ Although the main focus is on building and running configured "jobs", [ConfigArgs](#configargs) can also be used in your code (a notebook for example):
193
+
194
+ ```python
195
+ # Load job-specific configuration
196
+ ca = ConfigArgs('job_group_1.job_a1')
197
+ jobs.job_group_1.job_a1.step_1(*ca.step_1.args, **ca.step_1.kwargs)
198
+ ```
199
+
200
+ ---
201
+
202
+ ## cocina Configuration
203
+
204
+ The `.cocina` file contains project settings and must be in your project root. It defines:
205
+ - Configuration file locations and naming conventions
206
+ - Project root directory location
207
+ - Environment variable names
208
+
209
+ **Required:** Every project must have a `.cocina` file at the root.
210
+
211
+ **Options:**
212
+ - `--log_dir`: Enable automatic log file creation
213
+ - `--package`: Specify main package for constants loading
214
+ - `--force`: Overwrite existing `.cocina` file
215
+
216
+ ---
217
+
218
+ ## Configuration Files
219
+
220
+ Cocina uses YAML files in the `config/` directory:
221
+
222
+ ```
223
+ config/
224
+ ├── config.yaml # Main configuration
225
+ ├── dev.yaml # Development environment overrides
226
+ ├── prod.yaml # Production environment overrides
227
+ └── args/ # Job-specific configurations
228
+ ├── job_name.yaml # Individual job config
229
+ └── group_name/ # Grouped job configs
230
+ └── job_a.yaml
231
+ ```
232
+
233
+ **Configuration Types:**
234
+ - **Main Config**: `config.yaml` - shared across all environments
235
+ - **Environment Config**: `{env}.yaml` - environment-specific overrides
236
+ - **Job Config**: `args/{job}.yaml` - job-specific parameters and arguments
237
+
238
+ ### ConfigHandler
239
+
240
+ Manages constants and main configuration with environment support.
241
+
242
+ ```python
243
+ from cocina.config_handler import ConfigHandler
244
+
245
+ ch = ConfigHandler()
246
+ print(ch.DATABASE_URL) # From config.yaml
247
+ print(ch.MAX_SCALE) # From constants.py (protected)
248
+ ```
249
+
250
+ **Features:**
251
+ - Loads constants from `your_package/constants.py`
252
+ - Loads configuration from `config/config.yaml`
253
+ - Environment-specific overrides from `config/{env}.yaml`
254
+ - Dict-style and attribute access patterns
255
+
256
+ ### ConfigArgs
257
+
258
+ Loads job-specific configurations with structured argument access.
259
+
260
+ ```python
261
+ from cocina.config_handler import ConfigArgs
262
+
263
+ ca = ConfigArgs('data_pipeline')
264
+ # Access method arguments
265
+ ca.extract_data.args # ["source_table"]
266
+ ca.extract_data.kwargs # {"limit": 1000, "debug": False}
267
+ ```
268
+
269
+ **YAML Configuration Parsing:**
270
+ - Dict with `args`/`kwargs` keys → extracts args and kwargs
271
+ - Dict without special keys → `args=[]`, `kwargs=dict`
272
+ - List/tuple → `args=value`, `kwargs={}`
273
+ - Single value → `args=[value]`, `kwargs={}`
274
+
275
+ **Features:**
276
+ - Environment-specific overrides
277
+ - Reference resolution from main config
278
+ - Dynamic value substitution
279
+
280
+ ---
281
+
282
+ ## CLI
283
+
284
+ ### Initialize Project
285
+
286
+ ```bash
287
+ pixi run cocina init --log_dir logs --package your_package
288
+ ```
289
+
290
+ ### Run Jobs
291
+
292
+ ```bash
293
+ # Run a single job
294
+ pixi run cocina job data_pipeline
295
+
296
+ # Run with specific environment
297
+ pixi run cocina job data_pipeline --env prod
298
+
299
+ # Run multiple jobs
300
+ pixi run cocina job job1 job2 job3
301
+
302
+ # Dry run (validate without executing)
303
+ pixi run cocina job data_pipeline --dry_run
304
+ ```
305
+
306
+ **Options:**
307
+ - `--env`: Environment configuration to use (dev, prod, etc.)
308
+ - `--verbose`: Enable detailed output
309
+ - `--dry_run`: Validate configuration without running
310
+
311
+
312
+ ---
313
+
314
+ ## Tools
315
+
316
+ ### Printer
317
+ Professional output with timestamps, headers, and optional file logging.
318
+
319
+ ```python
320
+ from cocina.printer import Printer
321
+
322
+ printer = Printer(header='MyApp')
323
+ printer.start('Processing begins')
324
+ printer.message('Status update', count=42, status='ok')
325
+ printer.stop('Complete')
326
+ ```
327
+
328
+ ### Timer
329
+ Simple timing functionality with duration tracking.
330
+
331
+ ```python
332
+ from cocina.utils import Timer
333
+
334
+ timer = Timer()
335
+ timer.start() # Start timing
336
+ print(timer.state()) # Current elapsed time
337
+ print(timer.now()) # Current timestamp
338
+ stop_time = timer.stop() # Stop timing
339
+ print(timer.delta()) # Total duration string
340
+ ```
341
+
342
+ > See [complete documentation](docs/) for all utility functions and helpers.
343
+
344
+ ---
345
+
346
+ ## Development
347
+
348
+ **Requirements:** Managed with [Pixi](https://pixi.sh/latest) - no manual environment setup needed.
349
+
350
+ ```bash
351
+ # All commands use pixi
352
+ pixi run jupyter lab
353
+ ```
354
+
355
+ **Style:** Follows PEP8 standards. See [setup.cfg](./setup.cfg) for project-specific rules.
356
+
357
+ ---
358
+
359
+ ## Documentation
360
+
361
+ - **[Getting Started](/wiki/getting-started)** - Installation, initialization, and first job
362
+ - **[Configuration Guide](/wiki/configuration)** - Complete configuration management
363
+ - **[Job System](/wiki/jobs)** - Creating and running jobs
364
+ - **[CLI Reference](/wiki/cli)** - Command-line interface
365
+ - **[Examples](/wiki/examples)** - Detailed usage examples
366
+ - **[Advanced Topics](/wiki/advanced)** - Complex patterns and extensions
367
+
368
+ ## License
369
+
370
+ CC-BY-4.0