GenTS 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,39 @@
1
+ # This workflow will upload a Python Package using Twine when a release is created
2
+ # For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries
3
+
4
+ # This workflow uses actions that are not certified by GitHub.
5
+ # They are provided by a third-party and are governed by
6
+ # separate terms of service, privacy policy, and support
7
+ # documentation.
8
+
9
+ name: Upload Python Package
10
+
11
+ on:
12
+ release:
13
+ types: [published]
14
+
15
+ permissions:
16
+ contents: read
17
+
18
+ jobs:
19
+ deploy:
20
+
21
+ runs-on: ubuntu-latest
22
+
23
+ steps:
24
+ - uses: actions/checkout@v4
25
+ - name: Set up Python
26
+ uses: actions/setup-python@v3
27
+ with:
28
+ python-version: '3.10.x'
29
+ - name: Install dependencies
30
+ run: |
31
+ python -m pip install --upgrade pip
32
+ pip install build
33
+ - name: Build package
34
+ run: python -m build
35
+ - name: Publish package
36
+ uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
37
+ with:
38
+ user: __token__
39
+ password: ${{ secrets.PYPI_API_TOKEN }}
gents-0.1.0/.gitignore ADDED
@@ -0,0 +1,162 @@
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ share/python-wheels/
24
+ *.egg-info/
25
+ .installed.cfg
26
+ *.egg
27
+ MANIFEST
28
+
29
+ # PyInstaller
30
+ # Usually these files are written by a python script from a template
31
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
32
+ *.manifest
33
+ *.spec
34
+
35
+ # Installer logs
36
+ pip-log.txt
37
+ pip-delete-this-directory.txt
38
+
39
+ # Unit test / coverage reports
40
+ htmlcov/
41
+ .tox/
42
+ .nox/
43
+ .coverage
44
+ .coverage.*
45
+ .cache
46
+ nosetests.xml
47
+ coverage.xml
48
+ *.cover
49
+ *.py,cover
50
+ .hypothesis/
51
+ .pytest_cache/
52
+ cover/
53
+
54
+ # Translations
55
+ *.mo
56
+ *.pot
57
+
58
+ # Django stuff:
59
+ *.log
60
+ local_settings.py
61
+ db.sqlite3
62
+ db.sqlite3-journal
63
+
64
+ # Flask stuff:
65
+ instance/
66
+ .webassets-cache
67
+
68
+ # Scrapy stuff:
69
+ .scrapy
70
+
71
+ # Sphinx documentation
72
+ docs/_build/
73
+
74
+ # PyBuilder
75
+ .pybuilder/
76
+ target/
77
+
78
+ # Jupyter Notebook
79
+ .ipynb_checkpoints
80
+
81
+ # IPython
82
+ profile_default/
83
+ ipython_config.py
84
+
85
+ # pyenv
86
+ # For a library or package, you might want to ignore these files since the code is
87
+ # intended to run in multiple environments; otherwise, check them in:
88
+ # .python-version
89
+
90
+ # pipenv
91
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
93
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
94
+ # install all needed dependencies.
95
+ #Pipfile.lock
96
+
97
+ # poetry
98
+ # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
99
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
100
+ # commonly ignored for libraries.
101
+ # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102
+ #poetry.lock
103
+
104
+ # pdm
105
+ # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106
+ #pdm.lock
107
+ # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108
+ # in version control.
109
+ # https://pdm.fming.dev/latest/usage/project/#working-with-version-control
110
+ .pdm.toml
111
+ .pdm-python
112
+ .pdm-build/
113
+
114
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
115
+ __pypackages__/
116
+
117
+ # Celery stuff
118
+ celerybeat-schedule
119
+ celerybeat.pid
120
+
121
+ # SageMath parsed files
122
+ *.sage.py
123
+
124
+ # Environments
125
+ .env
126
+ .venv
127
+ env/
128
+ venv/
129
+ ENV/
130
+ env.bak/
131
+ venv.bak/
132
+
133
+ # Spyder project settings
134
+ .spyderproject
135
+ .spyproject
136
+
137
+ # Rope project settings
138
+ .ropeproject
139
+
140
+ # mkdocs documentation
141
+ /site
142
+
143
+ # mypy
144
+ .mypy_cache/
145
+ .dmypy.json
146
+ dmypy.json
147
+
148
+ # Pyre type checker
149
+ .pyre/
150
+
151
+ # pytype static type analyzer
152
+ .pytype/
153
+
154
+ # Cython debug symbols
155
+ cython_debug/
156
+
157
+ # PyCharm
158
+ # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
159
+ # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
160
+ # and can be added to the global gitignore or merged into this file. For a more nuclear
161
+ # option (not recommended) you can uncomment the following to ignore the entire idea folder.
162
+ #.idea/
@@ -0,0 +1,179 @@
1
+ Metadata-Version: 2.1
2
+ Name: GenTS
3
+ Version: 0.1.0
4
+ Summary: A useful tool for post-processing CESM2 history files into the timeseries format.
5
+ Author-email: Cameron Cummins <cameron.cummins@utexas.edu>
6
+ Project-URL: Homepage, https://github.com/AgentOxygen/GenTS
7
+ Project-URL: Issues, https://github.com/AgentOxygen/GenTS
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.10.13
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+
15
+ # Time Series Generation
16
+
17
+ >Cameron Cummins<br>
18
+ Computational Engineer<br>
19
+ Contact: cameron.cummins@utexas.edu<br>
20
+ Webpage: [https://www.jsg.utexas.edu/student/cameron_cummins](https://www.jsg.utexas.edu/student/cameron_cummins)<br>
21
+ Affiliation: Persad Aero-Climate Lab, The University of Texas at Austin
22
+
23
+ >Adam Phillips<br>
24
+ Mentor and Advisor<br>
25
+ Contact: asphilli@ucar.edu<br>
26
+ Webpage: [https://staff.cgd.ucar.edu/asphilli/](https://staff.cgd.ucar.edu/asphilli/)<br>
27
+ Affiliation: Climate and Global Dynamics Lab, National Center for Atmospheric Research
28
+
29
+ ## *This Project is still in Development*
30
+ It works pretty well and can likely handle most cases, but some features aren't finished yet. Stay tuned for more!
31
+
32
+ ## Converting from History Files to Time Series Files
33
+ Original model output produced by CESM2 is stored by timestep, with multiple geophysical variables stored in a single netCDF3 file together. These files are referred to as **history files** as each file effectively represents a snapshot of many variables at separate moments in time. This is intuitive from the model's perspective because variables are computed together, after solving many differential equations, at each step in the time series. However, this format is cumbersome for scientists because analysis is often performed over a large number of timesteps for only a few variables. A more practical format is to store all the timesteps for single variables in separate files. Datasets stored in this manner are known as **time series files**.
34
+
35
+ ## Capabilities and Current Limitations
36
+ - [x] Functional core to convert history file to time series file in parallel with Dask
37
+ - [x] Automatic path parsing
38
+ - [x] Compatible with all model components of CESM
39
+ - [x] Compatible with large ensembles
40
+ - [x] Compatible with non-rectilinear grids
41
+ - [x] Automatic path resolving and group detection (detects different experiments, timesteps, and other sub-directories)
42
+ - [x] Customizable output directory structure that automatically reflects input directory structure
43
+ - [x] Adjustable time chunk size
44
+ - [x] Adjustable time range selection
45
+ - [x] Customizable variable selection
46
+ - [x] Adjustable file compression
47
+ - [x] Resumable process (such as exceeding wall time or encountering an error)
48
+ - [x] Documented API
49
+ - [x] Fully written in Python (no NCO, CDO, or other subprocess commands)
50
+ - [ ] (WIP) Verification tools for ensuring file integrity
51
+ - [ ] (WIP) Command line interface
52
+ - [ ] (WIP) Automatic cluster configuration/recommendation
53
+
54
+ ## Dependencies
55
+ Ensure that you have the following packages installed. If you are using Casper, activate conda environment "npl-2024b".
56
+ - `python >= 3.11.9`
57
+ - `dask >= 2024.7.0`
58
+ - `dask_jobqueue >= 0.8.5`
59
+ - `netCDF4 >= 1.7.1`
60
+ - `numpy >= 1.26.4`
61
+
62
+
63
+ ## How To Use
64
+ The process of converting history files into time series files is almost entirely I/O dependent and benefits heavily from parallelism that increases data throughput. We leverage Dask to parallelize this process: reading in history files and writing out their respective time series datasets across multiple worker processes. These functions can be run without Dask, but it will execute in serial and likely be significantly slower.
65
+
66
+ Start by cloning this repository to a directory of your choosing. This can be done by navigating to a directory and running the following command, assuming you have `git` installed:
67
+ ````
68
+ git clone https://github.com/AgentOxygen/timeseries_generation.git
69
+ ````
70
+ You can then either import `timeseries_generation.py` as a module in an interactive Jupyter Notebook (recommended) or in a custom Python script. ~~You may also execute it as a command from the terminal~~(not implemented yet).
71
+
72
+ For those that learn faster by doing than reading, there is a template notebook available called `template_notebook.ipynb`.
73
+
74
+ ### Jupyter Notebook
75
+ To use this package in an interactive Jupyter notebook, create a notebook in the repository directory and start a Dask cluster using either a LocalCluster or your HPC's job-queue. Be sure to activate an environment with the correct dependencies (this can be done either at the start of the Jupyter Server or by selecting the appropriate kernel in the bottom left bar). Here is an example of a possible cluster on Casper using the NPL 2024b conda environment and the PBS job scheduler:
76
+ ````
77
+ from timeseries_generation import ModelOutputDatabase
78
+ from dask_jobqueue import PBSCluster
79
+ from dask.distributed import Client
80
+
81
+
82
+ cluster = PBSCluster(
83
+ cores=20,
84
+ memory='40GB',
85
+ processes=20,
86
+ queue='casper',
87
+ resource_spec='select=1:ncpus=20:mem=40GB',
88
+ account='PROJECT CODE GOES HERE',
89
+ walltime='02:00:00',
90
+ local_directory="/local_scratch/"
91
+ )
92
+
93
+ cluster.scale(200)
94
+ client = Client(cluster)
95
+ ````
96
+ Here, we spin up 200 workers, each on one core with 2 GB of memory (so 200 total cores) with maximum wall clocks of 2 hours (note that this is done in PBS batches of 20 cores per machine). The `local_directory` parameter allows each worker to use their respective node's NVMe storage for scratch work if needed. Note that these cores may be spread across multiple machines. This process should not use more than two gigabytes of memory and does not utilize multi-threading (one process per core is optimal). How the other properties of the cluster should be configured (number of cores, number of different machines, and wall time) is difficult to determine, ~~but benchmarks are available below to get a good idea~~ (WIP). If generation process is interrupted, it can be resumed without re-generating existing time series data. Note that the ``Client`` object is important for connecting the Jupyter instance to the Dask cluster. You can monitor the Dask dashboard using the URL indicated by the `client` object (run just `client` in a cell block to get the Jupyter GUI). Once created, you can then initialize a ``ModelOutputDatabase`` class object with the appropriate parameters. Here is an example of common parameters you may use:
97
+ ````
98
+ model_database = ModelOutputDatabase(
99
+ hf_head_dir="PATH TO HEAD DIRECTORY FOR HISTORY FILES (MULTIPLE MODEL RUNS, ONE RUN, OR INDIVIDUAL COMPONENTS)",
100
+ ts_head_dir="PATH TO HEAD DIRECTORY FOR REFLECTIVE DIRECTORY STRUCTURE AND TIME SERIES FILES",
101
+ dir_name_swaps={
102
+ "hist": "proc/tseries"
103
+ },
104
+ file_exclusions=[".once.nc", ".pop.hv.nc", ".initial_hist."]
105
+ )
106
+ ````
107
+ The ``ModelOutputDatabase`` object centralizes the API and acts as the starting point for all history-to-time-series operations. Creating the database will automatically search the input directories under `hf_head_dir` and group the history files by ensemble member, model component, timestep, and any other differences between subdirectories and naming patterns. The structures and names are arbitrary, so you can input a directory containing an ensemble of model archives or a single model component from a single run (or a random directory with history files in it, it should be able to handle it). It does not trigger any computation. `ts_head_dir` tells the class where to create a directory structure that matches the one under `hf_head_dir`. Only directories with `.nc` files will be reflected and optionally any other file names that contain the keywords specified by `file_exclusions` will be excluded. `dir_name_swaps` takes a Python dictionary with keys equal to directory names found under `hf_head_dir` that should be renamed. In the example above, all directories named `hist/` are renamed to two new directories `proc/tseries` to remain consistent with current time series generation conventions. There are many other parameters that can be specified to filter out which history files/variables are converted and how the time series files are formatted (see the section below titled *ModelOutputDatabase Parameters*). To trigger the time series generation process, call ``.run()``:
108
+ ````
109
+ model_database.run()
110
+ client.shutdown()
111
+ ````
112
+ It is good practice to shutdown the cluster after the process is complete. If metadata of the history files needs to be inspected before generating time series (to verify which history files are being detected), ``.build()`` can be used to obtain metadata for all history files. This process is parallelized and is lightweight, typically taking less than a minute to run on a large-enough cluster. This function is automatically called by ``.run()`` before generating the time series datasets.
113
+
114
+ You can monitor the Dask dashboard using the URL provided by the `client` object mentioned previously. This is a good way to track progress and check for errors.
115
+
116
+ ### What to Expect
117
+
118
+ When you create ``ModelOutputDatabase``, it recursively searches the directory tree at `hf_head_dir`. If this tree is large, it may take some time. It is run in serial, so it won't show any information in the Dask dashboard, but you should see some print output. Calling `.run()` will first check to see if the database has been built, if not, it will automatically call `.build()` (which you can optionally call before `.run()` to inspect metadata). The `.build()` function aggregates the metadata for *all* of the history files under `hf_head_dir`. This process is parallelized using the Dask cluster and doesn't read much data, but opens a lot of files. This can take anywhere from less than a minute to 10 minutes depending on how the filesystem is organized. Once the metadata is aggregated in parallel, it is then processed in serial to for the arguments for time series generation. This step shouldn't take more than a few minutes and all progress will be communicated through print statements. After calculating the arguments, the `generateTimeSeries` is parallelized across the cluster with progress updated in the Dask dashboard. This is the final function call that actually preforms the time series generation by concatenating and merging variable data from the history files and writing them to disk. This process can take minutes to hours depending on the size of the history file output, the number of variables exported, the configuration of the Dask cluster, and traffic on the file system.
119
+
120
+ Errors may occur due to hiccups in file-reads, but they tend to be rare. These errors are not resolved in real-time, but reported to the user so that a smaller second-pass can be preformed to fix any incomplete or missing time series files. You might also notice that some tasks take longer than others or that the rate of progress sometimes slows down. This is likely due to different variable shapes/sizes and variability in disk I/O.
121
+
122
+ You may see a warning:
123
+ ````
124
+ UserWarning: Sending large graph of size ....
125
+ ````
126
+ It's a performance warning from Dask and doesn't mean anything is broken or that you did anything wrong. It might be the cause of some latency which I am hoping to eliminate in a future update, but just ignore it for now.
127
+
128
+ By default, existing files are not overwritten, but this can be configured as shown in *ModelOutputDatabase Parameters*. Immidiately before the program releases the handle for a time series file, an attribute is appended called `timeseries_process` as explained in *Additional Attributes*. This attribute is used to verify that a file was properly created (being the last thing added, it indicates the process completed successfully). If it does not exist in the file with the correct value, the program will assume it is broken and needs to be replaced. This is to protect against unexpected interruptions such as system crashes, I/O errors, or wall-clock limits.
129
+
130
+ ## ModelOutputDatabase Parameters
131
+
132
+ The ``ModelOutputDatabase`` constructor has multiple optional parameters that can be customized to fit your use-case:
133
+ ````
134
+ class ModelOutputDatabase:
135
+ def __init__(self,
136
+ hf_head_dir: str,
137
+ ts_head_dir: str,
138
+ dir_name_swaps: dict = {},
139
+ file_exclusions: list = [],
140
+ dir_exclusions: list = ["rest", "logs"],
141
+ include_variables: list = None,
142
+ exclude_variables: list = None,
143
+ year_start: int = None,
144
+ year_end: int = None,
145
+ compression_level: int = None,
146
+ variable_compression_levels: dict = None) -> None:
147
+ ````
148
+ - `hf_head_dir` : Required string/Path object, path to head directory to structure with subdirectories containing history files.
149
+ - `ts_head_dir` : Required string/Path object, path to head directory where structure reflecting `hf_head_dir` will be created and time series files will be written to.
150
+
151
+ Note that for a Python class, the `self` parameter is an internal parameter skipped when creating the class object (`hf_head_dir` is the first parameter you can pass to `ModelOutputDatabase()`). You are only *required* to provide the first two parameters, `hf_head_dir` and `ts_head_dir`, but many assumptions will be made which you can also deduce from the default values in the constructor:
152
+ 1. All netCDF files under `hf_head_dir`, including all sub-directories except for "rest" and "logs", will be treated as history files.
153
+ 3. An identical directory structure to that of `hf_head_dir` will be created under `ts_head_dir` with the exception of time-step sub-directories (such as "month_1" and "day_1")
154
+ 4. The entire time series for all history files will be concatenated.
155
+ 5. All variables containing the time dimension will be concatenated and have time series files.
156
+ 6. No compression will be used (compression level 0).
157
+
158
+ These defaults can be overwritten to customize how your time series files are generated:
159
+ - `dir_name_swaps` : Optional dictionary, dictionary for swapping out keyword directory names in the structure under `hf_head_dir` (e.g. `{"hist" : "proc/tseries"}`)
160
+ - `file_exclusions` : Optional list/array, file names containing any of the keywords in this list will be excluded from the database.
161
+ - `dir_exclusions` : Optional list, directory names containing any of the keywords in this list will be excluded from the database.
162
+ - `include_variables` : Optional list, variables to include in either creating individual time series files for adding as auxiliary variables.
163
+ - `exclude_variables` : Optional list, variables to exclude from either creating individual time series files for adding as auxiliary variables.
164
+ - `year_start` : Optional int, starting year for time series generation, must be later than first history file timestamp to have an effect.
165
+ - `year_end` : Optional int, ending year for time series generation, must be later than last history file timestamp to have an effect.
166
+ - `compression_level` : Optional int, compression level to pass to netCDF4 engine when generating time series files.
167
+ - `variable_compression_levels` : Optional dictionary, compression levels to apply to specific variables (variable name is key and the compression level is the value).
168
+
169
+ ## Additional Attributes
170
+ When generating a new time series file, two new attributes are created in the global attributes dictionary:
171
+ 1. ``timeseries_software_version`` - Indicates the version of this software used to produce the time series dataset. This is important for identifying if the data is subject to any bugs that may exist in previous iterations of the code.
172
+ 2. ``timeseries_process`` - If the process is interrupted, some files may exist but not contain all the data or possibly be corrupted due to early termination of the file handle. This boolean attribute is created at initialization with a ``False`` value that is only set to ``True`` immediately before the file handle closes (it is the very last operation). When re-building the database after an interruption, this flag is checked to determine whether the file should be skipped or removed and re-computed.
173
+
174
+ ## Benchmarks
175
+ WIP
176
+
177
+ ## Why not use Xarray?
178
+ Xarray is an amazing tool that makes *most* big data processes run significantly faster with a lot less hassle (and it works seemlessly with Dask). Time series generation, however, has zero analytical computations involved and is almost entirely I/O bound. Data is simply read, concatenated, merged, and written. I (Cameron) tried many different approaches with Xarray, but they all had performance issues that aren't a problem in most other situations. For example, reading the metadata of *all* of the history files is quite useful for determining which history files go together and what variables are primary versus auxiliary. Reading metadata using xarray is pretty slow at the moment, but that may be due to the fact that it adds so much more functionality on top of standard netCDF operations (which we don't really need for this problem). **Xarray calls the netcdf4-Python engine under the hood, so we just use the engine directly instead**. There are other features that could potentially make Xarray viable in the future, such as taking advantage of chunking. However, current model output is in the netCDF-3 format, which does not support inherently chunking, and I couldn't figure out how to control which workers recieved specific file-sized chunks using the Xarray API (which is important for synchronization in parallel, otherwise resulting in lots of latency and memory usage).
179
+
@@ -0,0 +1,13 @@
1
+ .gitignore
2
+ LICENSE
3
+ README.md
4
+ pyproject.toml
5
+ .github/workflows/release.yml
6
+ GenTS.egg-info/PKG-INFO
7
+ GenTS.egg-info/SOURCES.txt
8
+ GenTS.egg-info/dependency_links.txt
9
+ GenTS.egg-info/top_level.txt
10
+ gents/__init__.py
11
+ gents/ts_gen.py
12
+ utils/PrototypingTS.ipynb
13
+ utils/template_notebook.ipynb
@@ -0,0 +1 @@
1
+ gents
gents-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Cameron Cummins
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
gents-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,179 @@
1
+ Metadata-Version: 2.1
2
+ Name: GenTS
3
+ Version: 0.1.0
4
+ Summary: A useful tool for post-processing CESM2 history files into the timeseries format.
5
+ Author-email: Cameron Cummins <cameron.cummins@utexas.edu>
6
+ Project-URL: Homepage, https://github.com/AgentOxygen/GenTS
7
+ Project-URL: Issues, https://github.com/AgentOxygen/GenTS
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.10.13
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+
15
+ # Time Series Generation
16
+
17
+ >Cameron Cummins<br>
18
+ Computational Engineer<br>
19
+ Contact: cameron.cummins@utexas.edu<br>
20
+ Webpage: [https://www.jsg.utexas.edu/student/cameron_cummins](https://www.jsg.utexas.edu/student/cameron_cummins)<br>
21
+ Affiliation: Persad Aero-Climate Lab, The University of Texas at Austin
22
+
23
+ >Adam Phillips<br>
24
+ Mentor and Advisor<br>
25
+ Contact: asphilli@ucar.edu<br>
26
+ Webpage: [https://staff.cgd.ucar.edu/asphilli/](https://staff.cgd.ucar.edu/asphilli/)<br>
27
+ Affiliation: Climate and Global Dynamics Lab, National Center for Atmospheric Research
28
+
29
+ ## *This Project is still in Development*
30
+ It works pretty well and can likely handle most cases, but some features aren't finished yet. Stay tuned for more!
31
+
32
+ ## Converting from History Files to Time Series Files
33
+ Original model output produced by CESM2 is stored by timestep, with multiple geophysical variables stored in a single netCDF3 file together. These files are referred to as **history files** as each file effectively represents a snapshot of many variables at separate moments in time. This is intuitive from the model's perspective because variables are computed together, after solving many differential equations, at each step in the time series. However, this format is cumbersome for scientists because analysis is often performed over a large number of timesteps for only a few variables. A more practical format is to store all the timesteps for single variables in separate files. Datasets stored in this manner are known as **time series files**.
34
+
35
+ ## Capabilities and Current Limitations
36
+ - [x] Functional core to convert history file to time series file in parallel with Dask
37
+ - [x] Automatic path parsing
38
+ - [x] Compatible with all model components of CESM
39
+ - [x] Compatible with large ensembles
40
+ - [x] Compatible with non-rectilinear grids
41
+ - [x] Automatic path resolving and group detection (detects different experiments, timesteps, and other sub-directories)
42
+ - [x] Customizable output directory structure that automatically reflects input directory structure
43
+ - [x] Adjustable time chunk size
44
+ - [x] Adjustable time range selection
45
+ - [x] Customizable variable selection
46
+ - [x] Adjustable file compression
47
+ - [x] Resumable process (such as exceeding wall time or encountering an error)
48
+ - [x] Documented API
49
+ - [x] Fully written in Python (no NCO, CDO, or other subprocess commands)
50
+ - [ ] (WIP) Verification tools for ensuring file integrity
51
+ - [ ] (WIP) Command line interface
52
+ - [ ] (WIP) Automatic cluster configuration/recommendation
53
+
54
+ ## Dependencies
55
+ Ensure that you have the following packages installed. If you are using Casper, activate conda environment "npl-2024b".
56
+ - `python >= 3.11.9`
57
+ - `dask >= 2024.7.0`
58
+ - `dask_jobqueue >= 0.8.5`
59
+ - `netCDF4 >= 1.7.1`
60
+ - `numpy >= 1.26.4`
61
+
62
+
63
+ ## How To Use
64
+ The process of converting history files into time series files is almost entirely I/O dependent and benefits heavily from parallelism that increases data throughput. We leverage Dask to parallelize this process: reading in history files and writing out their respective time series datasets across multiple worker processes. These functions can be run without Dask, but it will execute in serial and likely be significantly slower.
65
+
66
+ Start by cloning this repository to a directory of your choosing. This can be done by navigating to a directory and running the following command, assuming you have `git` installed:
67
+ ````
68
+ git clone https://github.com/AgentOxygen/timeseries_generation.git
69
+ ````
70
+ You can then either import `timeseries_generation.py` as a module in an interactive Jupyter Notebook (recommended) or in a custom Python script. ~~You may also execute it as a command from the terminal~~(not implemented yet).
71
+
72
+ For those that learn faster by doing than reading, there is a template notebook available called `template_notebook.ipynb`.
73
+
74
+ ### Jupyter Notebook
75
+ To use this package in an interactive Jupyter notebook, create a notebook in the repository directory and start a Dask cluster using either a LocalCluster or your HPC's job-queue. Be sure to activate an environment with the correct dependencies (this can be done either at the start of the Jupyter Server or by selecting the appropriate kernel in the bottom left bar). Here is an example of a possible cluster on Casper using the NPL 2024b conda environment and the PBS job scheduler:
76
+ ````
77
+ from timeseries_generation import ModelOutputDatabase
78
+ from dask_jobqueue import PBSCluster
79
+ from dask.distributed import Client
80
+
81
+
82
+ cluster = PBSCluster(
83
+ cores=20,
84
+ memory='40GB',
85
+ processes=20,
86
+ queue='casper',
87
+ resource_spec='select=1:ncpus=20:mem=40GB',
88
+ account='PROJECT CODE GOES HERE',
89
+ walltime='02:00:00',
90
+ local_directory="/local_scratch/"
91
+ )
92
+
93
+ cluster.scale(200)
94
+ client = Client(cluster)
95
+ ````
96
+ Here, we spin up 200 workers, each on one core with 2 GB of memory (so 200 total cores) with maximum wall clocks of 2 hours (note that this is done in PBS batches of 20 cores per machine). The `local_directory` parameter allows each worker to use their respective node's NVMe storage for scratch work if needed. Note that these cores may be spread across multiple machines. This process should not use more than two gigabytes of memory and does not utilize multi-threading (one process per core is optimal). How the other properties of the cluster should be configured (number of cores, number of different machines, and wall time) is difficult to determine, ~~but benchmarks are available below to get a good idea~~ (WIP). If generation process is interrupted, it can be resumed without re-generating existing time series data. Note that the ``Client`` object is important for connecting the Jupyter instance to the Dask cluster. You can monitor the Dask dashboard using the URL indicated by the `client` object (run just `client` in a cell block to get the Jupyter GUI). Once created, you can then initialize a ``ModelOutputDatabase`` class object with the appropriate parameters. Here is an example of common parameters you may use:
97
+ ````
98
+ model_database = ModelOutputDatabase(
99
+ hf_head_dir="PATH TO HEAD DIRECTORY FOR HISTORY FILES (MULTIPLE MODEL RUNS, ONE RUN, OR INDIVIDUAL COMPONENTS)",
100
+ ts_head_dir="PATH TO HEAD DIRECTORY FOR REFLECTIVE DIRECTORY STRUCTURE AND TIME SERIES FILES",
101
+ dir_name_swaps={
102
+ "hist": "proc/tseries"
103
+ },
104
+ file_exclusions=[".once.nc", ".pop.hv.nc", ".initial_hist."]
105
+ )
106
+ ````
107
+ The ``ModelOutputDatabase`` object centralizes the API and acts as the starting point for all history-to-time-series operations. Creating the database will automatically search the input directories under `hf_head_dir` and group the history files by ensemble member, model component, timestep, and any other differences between subdirectories and naming patterns. The structures and names are arbitrary, so you can input a directory containing an ensemble of model archives or a single model component from a single run (or a random directory with history files in it, it should be able to handle it). It does not trigger any computation. `ts_head_dir` tells the class where to create a directory structure that matches the one under `hf_head_dir`. Only directories with `.nc` files will be reflected and optionally any other file names that contain the keywords specified by `file_exclusions` will be excluded. `dir_name_swaps` takes a Python dictionary with keys equal to directory names found under `hf_head_dir` that should be renamed. In the example above, all directories named `hist/` are renamed to two new directories `proc/tseries` to remain consistent with current time series generation conventions. There are many other parameters that can be specified to filter out which history files/variables are converted and how the time series files are formatted (see the section below titled *ModelOutputDatabase Parameters*). To trigger the time series generation process, call ``.run()``:
108
+ ````
109
+ model_database.run()
110
+ client.shutdown()
111
+ ````
112
+ It is good practice to shutdown the cluster after the process is complete. If metadata of the history files needs to be inspected before generating time series (to verify which history files are being detected), ``.build()`` can be used to obtain metadata for all history files. This process is parallelized and is lightweight, typically taking less than a minute to run on a large-enough cluster. This function is automatically called by ``.run()`` before generating the time series datasets.
113
+
114
+ You can monitor the Dask dashboard using the URL provided by the `client` object mentioned previously. This is a good way to track progress and check for errors.
115
+
116
+ ### What to Expect
117
+
118
+ When you create ``ModelOutputDatabase``, it recursively searches the directory tree at `hf_head_dir`. If this tree is large, it may take some time. It is run in serial, so it won't show any information in the Dask dashboard, but you should see some print output. Calling `.run()` will first check to see if the database has been built, if not, it will automatically call `.build()` (which you can optionally call before `.run()` to inspect metadata). The `.build()` function aggregates the metadata for *all* of the history files under `hf_head_dir`. This process is parallelized using the Dask cluster and doesn't read much data, but opens a lot of files. This can take anywhere from less than a minute to 10 minutes depending on how the filesystem is organized. Once the metadata is aggregated in parallel, it is then processed in serial to for the arguments for time series generation. This step shouldn't take more than a few minutes and all progress will be communicated through print statements. After calculating the arguments, the `generateTimeSeries` is parallelized across the cluster with progress updated in the Dask dashboard. This is the final function call that actually preforms the time series generation by concatenating and merging variable data from the history files and writing them to disk. This process can take minutes to hours depending on the size of the history file output, the number of variables exported, the configuration of the Dask cluster, and traffic on the file system.
119
+
120
+ Errors may occur due to hiccups in file-reads, but they tend to be rare. These errors are not resolved in real-time, but reported to the user so that a smaller second-pass can be preformed to fix any incomplete or missing time series files. You might also notice that some tasks take longer than others or that the rate of progress sometimes slows down. This is likely due to different variable shapes/sizes and variability in disk I/O.
121
+
122
+ You may see a warning:
123
+ ````
124
+ UserWarning: Sending large graph of size ....
125
+ ````
126
+ It's a performance warning from Dask and doesn't mean anything is broken or that you did anything wrong. It might be the cause of some latency which I am hoping to eliminate in a future update, but just ignore it for now.
127
+
128
+ By default, existing files are not overwritten, but this can be configured as shown in *ModelOutputDatabase Parameters*. Immidiately before the program releases the handle for a time series file, an attribute is appended called `timeseries_process` as explained in *Additional Attributes*. This attribute is used to verify that a file was properly created (being the last thing added, it indicates the process completed successfully). If it does not exist in the file with the correct value, the program will assume it is broken and needs to be replaced. This is to protect against unexpected interruptions such as system crashes, I/O errors, or wall-clock limits.
129
+
130
+ ## ModelOutputDatabase Parameters
131
+
132
+ The ``ModelOutputDatabase`` constructor has multiple optional parameters that can be customized to fit your use-case:
133
+ ````
134
+ class ModelOutputDatabase:
135
+ def __init__(self,
136
+ hf_head_dir: str,
137
+ ts_head_dir: str,
138
+ dir_name_swaps: dict = {},
139
+ file_exclusions: list = [],
140
+ dir_exclusions: list = ["rest", "logs"],
141
+ include_variables: list = None,
142
+ exclude_variables: list = None,
143
+ year_start: int = None,
144
+ year_end: int = None,
145
+ compression_level: int = None,
146
+ variable_compression_levels: dict = None) -> None:
147
+ ````
148
+ - `hf_head_dir` : Required string/Path object, path to head directory to structure with subdirectories containing history files.
149
+ - `ts_head_dir` : Required string/Path object, path to head directory where structure reflecting `hf_head_dir` will be created and time series files will be written to.
150
+
151
+ Note that for a Python class, the `self` parameter is an internal parameter skipped when creating the class object (`hf_head_dir` is the first parameter you can pass to `ModelOutputDatabase()`). You are only *required* to provide the first two parameters, `hf_head_dir` and `ts_head_dir`, but many assumptions will be made which you can also deduce from the default values in the constructor:
152
+ 1. All netCDF files under `hf_head_dir`, including all sub-directories except for "rest" and "logs", will be treated as history files.
153
+ 3. An identical directory structure to that of `hf_head_dir` will be created under `ts_head_dir` with the exception of time-step sub-directories (such as "month_1" and "day_1")
154
+ 4. The entire time series for all history files will be concatenated.
155
+ 5. All variables containing the time dimension will be concatenated and have time series files.
156
+ 6. No compression will be used (compression level 0).
157
+
158
+ These defaults can be overwritten to customize how your time series files are generated:
159
+ - `dir_name_swaps` : Optional dictionary, dictionary for swapping out keyword directory names in the structure under `hf_head_dir` (e.g. `{"hist" : "proc/tseries"}`)
160
+ - `file_exclusions` : Optional list/array, file names containing any of the keywords in this list will be excluded from the database.
161
+ - `dir_exclusions` : Optional list, directory names containing any of the keywords in this list will be excluded from the database.
162
+ - `include_variables` : Optional list, variables to include in either creating individual time series files for adding as auxiliary variables.
163
+ - `exclude_variables` : Optional list, variables to exclude from either creating individual time series files for adding as auxiliary variables.
164
+ - `year_start` : Optional int, starting year for time series generation, must be later than first history file timestamp to have an effect.
165
+ - `year_end` : Optional int, ending year for time series generation, must be later than last history file timestamp to have an effect.
166
+ - `compression_level` : Optional int, compression level to pass to netCDF4 engine when generating time series files.
167
+ - `variable_compression_levels` : Optional dictionary, compression levels to apply to specific variables (variable name is key and the compression level is the value).
168
+
169
+ ## Additional Attributes
170
+ When generating a new time series file, two new attributes are created in the global attributes dictionary:
171
+ 1. ``timeseries_software_version`` - Indicates the version of this software used to produce the time series dataset. This is important for identifying if the data is subject to any bugs that may exist in previous iterations of the code.
172
+ 2. ``timeseries_process`` - If the process is interrupted, some files may exist but not contain all the data or possibly be corrupted due to early termination of the file handle. This boolean attribute is created at initialization with a ``False`` value that is only set to ``True`` immediately before the file handle closes (it is the very last operation). When re-building the database after an interruption, this flag is checked to determine whether the file should be skipped or removed and re-computed.
173
+
174
+ ## Benchmarks
175
+ WIP
176
+
177
+ ## Why not use Xarray?
178
+ Xarray is an amazing tool that makes *most* big data processes run significantly faster with a lot less hassle (and it works seemlessly with Dask). Time series generation, however, has zero analytical computations involved and is almost entirely I/O bound. Data is simply read, concatenated, merged, and written. I (Cameron) tried many different approaches with Xarray, but they all had performance issues that aren't a problem in most other situations. For example, reading the metadata of *all* of the history files is quite useful for determining which history files go together and what variables are primary versus auxiliary. Reading metadata using xarray is pretty slow at the moment, but that may be due to the fact that it adds so much more functionality on top of standard netCDF operations (which we don't really need for this problem). **Xarray calls the netcdf4-Python engine under the hood, so we just use the engine directly instead**. There are other features that could potentially make Xarray viable in the future, such as taking advantage of chunking. However, current model output is in the netCDF-3 format, which does not support inherently chunking, and I couldn't figure out how to control which workers recieved specific file-sized chunks using the Xarray API (which is important for synchronization in parallel, otherwise resulting in lots of latency and memory usage).
179
+