PyPI - calkit-python - Versions diffs - 0.6.0__tar.gz → 0.7.0__tar.gz - Mend

calkit-python 0.6.0tar.gz → 0.7.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (59) hide show

{calkit_python-0.6.0 → calkit_python-0.7.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: calkit-python
-Version: 0.6.0
+Version: 0.7.0
 Summary: Reproducibility simplified.
 Project-URL: Homepage, https://github.com/calkit/calkit
 Project-URL: Issues, https://github.com/calkit/calkit/issues
@@ -31,20 +31,40 @@ Description-Content-Type: text/markdown
 # Calkit
-[Calkit](https://calkit.io) helps simplify reproducibility,
-acting as a layer on top of
-[Git](https://git-scm.com/), [DVC](https://dvc.org/),
-[Docker](https://docker.com),
-and adds a domain-specific data model
+Calkit is a lightweight framework for doing reproducible research.
+It acts as a top-level layer to integrate and simplify the use of enabling
+technologies such as
+[Git](https://git-scm.com/),
+[DVC](https://dvc.org/),
+[Conda](https://docs.conda.io/en/latest/),
+and [Docker](https://docker.com).
+Calkit also adds a domain-specific data model
 such that all aspects of the research process can be fully described in a
 single repository and therefore easily consumed by others.
+Our goal is to make reproducibility easier so it becomes more common.
+To do this, we try to make it easy for users to follow two simple rules:
+1. **Keep everything in version control.** This includes large files like
+   datasets, enabled by DVC. The [Calkit cloud](https://calkit.io)
+   serves as a simple default DVC remote storage location for those who do not
+   want to manage their own infrastructure.
+2. **Generate all important artifacts with a single pipeline.** There should be
+   no special instructions required to reproduce a project's artifacts.
+   It should be as simple as calling `calkit run`.
+   The DVC pipeline (in a project's `dvc.yaml` file) is therefore the main
+   thing to "build" throughout a research project.
+   Calkit provides helper functionality to build pipeline stages that
+   keep computational environments up-to-date and label their outputs for
+   convenient reuse.
 ## Tutorials
+- [Jupyter notebook as a DVC pipeline](docs/tutorials/notebook-pipeline.md)
 - [Keeping track of conda environments](docs/tutorials/conda-envs.md)
 - [Defining and executing manual procedures](docs/tutorials/procedures.md)
 - [Adding a new LaTeX-based publication with its own Docker build environment](docs/tutorials/adding-latex-pub-docker.md)
-- [A reproducibly workflow using Microsoft Office (Word and Excel)](https://petebachant.me/office-repro/)
+- [A reproducible workflow using Microsoft Office (Word and Excel)](https://petebachant.me/office-repro/)
 - [Reproducible OpenFOAM simulations](https://petebachant.me/reproducible-openfoam/)
 ## Why does reproducibility matter?

{calkit_python-0.6.0 → calkit_python-0.7.0}/README.md RENAMED Viewed

@@ -1,19 +1,39 @@
 # Calkit
-[Calkit](https://calkit.io) helps simplify reproducibility,
-acting as a layer on top of
-[Git](https://git-scm.com/), [DVC](https://dvc.org/),
-[Docker](https://docker.com),
-and adds a domain-specific data model
+Calkit is a lightweight framework for doing reproducible research.
+It acts as a top-level layer to integrate and simplify the use of enabling
+technologies such as
+[Git](https://git-scm.com/),
+[DVC](https://dvc.org/),
+[Conda](https://docs.conda.io/en/latest/),
+and [Docker](https://docker.com).
+Calkit also adds a domain-specific data model
 such that all aspects of the research process can be fully described in a
 single repository and therefore easily consumed by others.
+Our goal is to make reproducibility easier so it becomes more common.
+To do this, we try to make it easy for users to follow two simple rules:
+1. **Keep everything in version control.** This includes large files like
+   datasets, enabled by DVC. The [Calkit cloud](https://calkit.io)
+   serves as a simple default DVC remote storage location for those who do not
+   want to manage their own infrastructure.
+2. **Generate all important artifacts with a single pipeline.** There should be
+   no special instructions required to reproduce a project's artifacts.
+   It should be as simple as calling `calkit run`.
+   The DVC pipeline (in a project's `dvc.yaml` file) is therefore the main
+   thing to "build" throughout a research project.
+   Calkit provides helper functionality to build pipeline stages that
+   keep computational environments up-to-date and label their outputs for
+   convenient reuse.
 ## Tutorials
+- [Jupyter notebook as a DVC pipeline](docs/tutorials/notebook-pipeline.md)
 - [Keeping track of conda environments](docs/tutorials/conda-envs.md)
 - [Defining and executing manual procedures](docs/tutorials/procedures.md)
 - [Adding a new LaTeX-based publication with its own Docker build environment](docs/tutorials/adding-latex-pub-docker.md)
-- [A reproducibly workflow using Microsoft Office (Word and Excel)](https://petebachant.me/office-repro/)
+- [A reproducible workflow using Microsoft Office (Word and Excel)](https://petebachant.me/office-repro/)
 - [Reproducible OpenFOAM simulations](https://petebachant.me/reproducible-openfoam/)
 ## Why does reproducibility matter?

{calkit_python-0.6.0 → calkit_python-0.7.0}/calkit/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-__version__ = "0.6.0"
+__version__ = "0.7.0"
 from .core import *
 from . import git

{calkit_python-0.6.0 → calkit_python-0.7.0}/calkit/cli/main.py RENAMED Viewed

@@ -404,7 +404,12 @@ def run_dvc_repro(
                         f"Stage {stage_name} does not have exactly one output"
                     )
                 cktype = ckmeta.get("type")
-                if cktype not in ["figure", "dataset", "publication"]:
+                if cktype not in [
+                    "figure",
+                    "dataset",
+                    "publication",
+                    "notebook",
+                ]:
                     raise_error(f"Invalid Calkit output type '{cktype}'")
                 objects.append(
                     dict(path=outs[0]) | ckmeta | dict(stage=stage_name)
@@ -553,7 +558,9 @@ def run_in_env(
             typer.echo(f"Running command: {docker_cmd}")
         subprocess.call(docker_cmd, cwd=wdir)
     elif env["kind"] == "conda":
-        cmd = ["conda", "run", "-n", env_name] + cmd
+        with open(env["path"]) as f:
+            conda_env = calkit.ryaml.load(f)
+        cmd = ["conda", "run", "-n", conda_env["name"]] + cmd
         if verbose:
             typer.echo(f"Running command: {cmd}")
         subprocess.call(cmd, cwd=wdir)

calkit_python-0.7.0/calkit/core.py ADDED Viewed

@@ -0,0 +1,202 @@
+"""Core functionality."""
+from __future__ import annotations
+import glob
+import json
+import logging
+import os
+import pickle
+from datetime import UTC, datetime
+from typing import Literal
+import ruamel.yaml
+from git import Repo
+from git.exc import InvalidGitRepositoryError
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__package__)
+ryaml = ruamel.yaml.YAML()
+ryaml.indent(mapping=2, sequence=4, offset=2)
+ryaml.preserve_quotes = True
+ryaml.width = 70
+def find_project_dirs(relative=False, max_depth=3) -> list[str]:
+    """Find all Calkit project directories."""
+    if relative:
+        start = ""
+    else:
+        start = os.path.expanduser("~")
+    res = []
+    for i in range(max_depth):
+        pattern = os.path.join(start, *["*"] * (i + 1), "calkit.yaml")
+        res += glob.glob(pattern)
+        # Check GitHub documents for users who use GitHub Desktop
+        pattern = os.path.join(
+            start, "*", "GitHub", *["*"] * (i + 1), "calkit.yaml"
+        )
+        res += glob.glob(pattern)
+    final_res = []
+    for ck_fpath in res:
+        path = os.path.dirname(ck_fpath)
+        # Make sure this path is a Git repo
+        try:
+            Repo(path)
+        except InvalidGitRepositoryError:
+            continue
+        final_res.append(path)
+    return final_res
+def load_calkit_info(
+    wdir=None, process_includes: bool | str | list[str] = False
+) -> dict:
+    """Load Calkit project information.
+    Parameters
+    ----------
+    wdir : str
+        Working directory. Defaults to current working directory.
+    process_includes: bool, string or list of strings
+        Whether or not to process any '_include' keys for a given kind of
+        object. If a string is passed, only process includes for that kind.
+        Similarly, if a list of strings is passed, only process those kinds.
+        If True, process all default kinds.
+    """
+    info = {}
+    fpath = "calkit.yaml"
+    if wdir is not None:
+        fpath = os.path.join(wdir, fpath)
+    if os.path.isfile(fpath):
+        with open(fpath) as f:
+            info = ryaml.load(f)
+    # Check for any includes, i.e., entities with an _include key, for which
+    # we should merge in another file
+    default_includes_enabled = ["environments", "procedures"]
+    if process_includes:
+        if isinstance(process_includes, bool):
+            includes_enabled = default_includes_enabled
+        elif isinstance(process_includes, str):
+            includes_enabled = [process_includes]
+        elif isinstance(process_includes, list):
+            includes_enabled = process_includes
+        for kind in includes_enabled:
+            if kind in info:
+                for obj_name, obj in info[kind].items():
+                    if "_include" in obj:
+                        include_fpath = obj.pop("_include")
+                        with open(include_fpath) as f:
+                            include_data = ryaml.load(f)
+                        info[kind][obj_name] |= include_data
+    return info
+def utcnow(remove_tz=True) -> datetime:
+    """Return now in UTC, optionally stripping timezone information."""
+    dt = datetime.now(UTC)
+    if remove_tz:
+        dt = dt.replace(tzinfo=None)
+    return dt
+NOTEBOOK_STAGE_OUT_FORMATS = ["pickle", "parquet", "json", "yaml", "csv"]
+def get_notebook_stage_dir(stage_name: str) -> str:
+    return f".calkit/notebook-stages/{stage_name}"
+def get_notebook_stage_script_path(stage_name: str) -> str:
+    return os.path.join(get_notebook_stage_dir(stage_name), "script.py")
+def get_notebook_stage_out_dir(stage_name: str) -> str:
+    return os.path.join(get_notebook_stage_dir(stage_name), "outs")
+def get_notebook_stage_out_path(
+    stage_name: str,
+    out_name: str,
+    fmt: Literal["pickle", "parquet", "json", "yaml", "csv"] = "pickle",
+) -> str:
+    if fmt not in NOTEBOOK_STAGE_OUT_FORMATS:
+        raise ValueError(f"Invalid output format '{fmt}'")
+    return os.path.join(
+        get_notebook_stage_out_dir(stage_name), f"{out_name}.{fmt}"
+    )
+def load_notebook_stage_out(
+    stage_name: str,
+    out_name: str,
+    fmt: Literal["pickle", "parquet", "json", "yaml", "csv"] = "pickle",
+    engine: Literal["pandas", "polars"] | None = None,
+):
+    fpath = get_notebook_stage_out_path(stage_name, out_name, fmt=fmt)
+    if fmt in ["pickle", "json", "yaml"] and engine is not None:
+        raise ValueError(
+            f"Engine '{engine}' not compatible with format '{fmt}'"
+        )
+    if fmt == "pickle":
+        with open(fpath, "rb") as f:
+            return pickle.load(f)
+    elif fmt == "yaml":
+        with open(fpath) as f:
+            return ryaml.load(f)
+    elif fmt == "json":
+        with open(fpath) as f:
+            return json.load(f)
+    elif fmt == "csv" and engine == "pandas":
+        import pandas as pd
+        return pd.read_csv(fpath)
+    elif fmt == "csv" and engine == "polars":
+        import polars as pl
+        return pl.read_csv(fpath)
+    elif fmt == "parquet" and engine == "pandas":
+        import pandas as pd
+        return pd.read_parquet(fpath)
+    elif fmt == "parquet" and engine == "polars":
+        import polars as pl
+        return pl.read_parquet(fpath)
+    raise ValueError(f"Unsupported format '{fmt}' for engine '{engine}'")
+def save_notebook_stage_out(
+    obj,
+    stage_name: str,
+    out_name: str,
+    fmt: Literal["pickle", "parquet", "json", "yaml", "csv"] = "pickle",
+    engine: Literal["pandas", "polars"] | None = None,
+):
+    fpath = get_notebook_stage_out_path(stage_name, out_name, fmt=fmt)
+    dirname = os.path.dirname(fpath)
+    os.makedirs(dirname, exist_ok=True)
+    if fmt in ["pickle", "json", "yaml"] and engine is not None:
+        raise ValueError(
+            f"Engine '{engine}' not compatible with format '{fmt}'"
+        )
+    if fmt == "pickle":
+        with open(fpath, "wb") as f:
+            pickle.dump(obj, f)
+    elif fmt == "json":
+        with open(fpath, "w") as f:
+            json.dump(obj, f)
+    elif fmt == "yaml":
+        with open(fpath, "w") as f:
+            ryaml.dump(obj, f)
+    elif fmt == "csv" and engine == "pandas":
+        obj.to_csv(fpath)
+    elif fmt == "parquet" and engine == "pandas":
+        obj.to_parquet(fpath)
+    elif fmt == "csv" and engine == "polars":
+        obj.write_csv(fpath)
+    elif fmt == "parquet" and engine == "polars":
+        obj.write_parquet(fpath)
+    else:
+        raise ValueError(f"Unsupported format '{fmt}' for engine '{engine}'")

calkit_python-0.7.0/calkit/magics.py ADDED Viewed

@@ -0,0 +1,257 @@
+"""IPython magics."""
+import ast
+import os
+import subprocess
+from IPython.core import magic_arguments
+from IPython.core.magic import Magics, cell_magic, magics_class
+import calkit
+@magics_class
+class Calkit(Magics):
+    @magic_arguments.magic_arguments()
+    @magic_arguments.argument(
+        "-n", "--name", help="Stage name.", required=True
+    )
+    @magic_arguments.argument(
+        "--dep",
+        "-d",
+        help=(
+            "Declare another stage's output variable as a dependency. "
+            "Should be in the format '{stage_name}:{var_name}'. "
+            "Optionally, the output format and engine, if applicable, can be "
+            "appended like 'my-stage:some_dict:yaml' or "
+            "'my-stage:df:parquet:pandas'."
+        ),
+        nargs="+",
+    )
+    @magic_arguments.argument(
+        "--out",
+        "-o",
+        help=(
+            "Declare a variable as an output. "
+            "Optionally, the output format can be specified like "
+            "'my_dict:json' or both the output format and engine can be "
+            "specified like 'df:parquet:polars'."
+        ),
+        nargs="+",
+    )
+    @magic_arguments.argument(
+        "--dep-path",
+        "-D",
+        help=(
+            "Declare a path as a dependency, so that if that path changes, "
+            "the stage will be rerun."
+        ),
+        nargs="+",
+    )
+    @magic_arguments.argument(
+        "--out-path",
+        "-O",
+        help=(
+            "Declare an output path written to by this cell, e.g., "
+            "if a figure is saved to a file."
+        ),
+        nargs="+",
+    )
+    @magic_arguments.argument(
+        "--out-type",
+        "-t",
+        choices=["figure", "dataset"],
+        help=(
+            "Declare the output as a type of Calkit object. If --out-path "
+            "is specified, that will be used as the object path, else its "
+            "path will be set as the output variable path. "
+            "Note that there must only be one output to use this option."
+        ),
+    )
+    @magic_arguments.argument(
+        "--out-title", help="Title for Calkit output object."
+    )
+    @magic_arguments.argument(
+        "--out-desc", help="Description for Calkit output object."
+    )
+    @cell_magic
+    def stage(self, line, cell):
+        """Turn a notebook cell into a DVC pipeline stage.
+        Note that all dependencies must be declared since the cell will be
+        first turned into a script and then run as part of the DVC pipeline.
+        Then variables will be loaded back into the user namespace state by
+        loading the DVC output.
+        """
+        args = magic_arguments.parse_argstring(self.stage, line)
+        # If an output object type is specified, make sure we only have one
+        # output
+        if args.out_type:
+            all_outs = []
+            if args.out:
+                all_outs += args.out
+            if args.out_path:
+                all_outs = args.out_path
+            if len(all_outs) != 1:
+                raise ValueError(
+                    "Only one output can be defined if declaring as a "
+                    "Calkit object"
+                )
+            # Parse calkit object parameters
+            out_params = {}
+            if args.out_title:
+                out_params["title"] = ast.literal_eval(args.out_title)
+            if args.out_desc:
+                out_params["description"] = ast.literal_eval(args.out_desc)
+            # Ensure we have required keys
+            # TODO: Use Pydantic here
+            if "title" not in out_params:
+                raise ValueError(
+                    f"Calkit type {args.out_type} requires a title"
+                )
+            # Parse output path
+            if args.out_path:
+                out_params["path"] = args.out_path[0]
+            elif args.out:
+                out = args.out[0]
+                out_split = out.split(":")
+                kws = dict(stage_name=args.name, out_name=out_split[0])
+                if len(out_split) > 1:
+                    kws["fmt"] = out_split[1]
+                out_path = calkit.get_notebook_stage_out_path(**kws)
+                out_params["path"] = out_path
+            out_params["stage"] = args.name
+            # Save in calkit.yaml
+            ck_info = calkit.load_calkit_info()
+            objs = ck_info.get(args.out_type + "s", [])
+            objs = [obj for obj in objs if obj["path"] != out_params["path"]]
+            objs.append(out_params)
+            ck_info[args.out_type + "s"] = objs
+            with open("calkit.yaml", "w") as f:
+                calkit.ryaml.dump(ck_info, f)
+        # First, let's write this cell out to a script, ensuring that we
+        # load the important state at the top
+        script_txt = "# This script was automatically generated by Calkit\n\n"
+        script_txt += "import calkit\n\n"
+        if args.dep:
+            for d in args.dep:
+                dep_split = d.split(":")
+                stage = dep_split[0]
+                varname = dep_split[1]
+                fmt_string = ""
+                eng_string = ""
+                if len(dep_split) >= 3:
+                    fmt_string = f", fmt='{dep_split[2]}'"
+                if len(dep_split) == 4:
+                    eng_string = f", engine='{dep_split[3]}'"
+                script_txt += (
+                    f"{varname} = calkit.load_notebook_stage_out("
+                    f"stage_name='{stage}', out_name='{varname}'"
+                    f"{fmt_string}{eng_string})\n\n"
+                )
+        script_txt += cell
+        # Add lines that save our output variables to files
+        if args.out:
+            for out in args.out:
+                fmt_string = ""
+                eng_string = ""
+                out_split = out.split(":")
+                outvar = out_split[0]
+                if len(out_split) > 1:
+                    fmt_string = f", fmt='{out_split[1]}'"
+                if len(out_split) == 3:
+                    eng_string = f", engine='{out_split[2]}'"
+                script_txt += (
+                    f"calkit.save_notebook_stage_out("
+                    f"{outvar}, stage_name='{args.name}', out_name='{outvar}'"
+                    f"{fmt_string}{eng_string})\n"
+                )
+        # Save the script to a Python file
+        script_fpath = calkit.get_notebook_stage_script_path(args.name)
+        script_dir = os.path.dirname(script_fpath)
+        os.makedirs(script_dir, exist_ok=True)
+        outs_dir = calkit.get_notebook_stage_out_dir(stage_name=args.name)
+        os.makedirs(outs_dir, exist_ok=True)
+        with open(script_fpath, "w") as f:
+            f.write(script_txt)
+        # Create a DVC stage that runs the script, defining the appropriate
+        # dependencies and outputs, and run it
+        cmd = [
+            "dvc",
+            "stage",
+            "add",
+            "-q",
+            "-n",
+            args.name,
+            "--run",
+            "--force",
+            "-d",
+            script_fpath,
+        ]
+        if args.dep:
+            for dep in args.dep:
+                dep_split = dep.split(":")
+                stage = dep_split[0]
+                varname = dep_split[1]
+                kws = dict(stage_name=stage, out_name=varname)
+                if len(dep_split) > 2:
+                    kws["fmt"] = dep_split[2]
+                cmd += [
+                    "-d",
+                    calkit.get_notebook_stage_out_path(**kws),
+                ]
+        if args.dep_path:
+            for dep in args.dep_path:
+                cmd += ["-d", f"'{dep}'"]
+        if args.out:
+            for out in args.out:
+                out_split = out.split(":")
+                out_name = out_split[0]
+                kws = dict(stage_name=args.name, out_name=out_name)
+                if len(out_split) > 1:
+                    kws["fmt"] = out_split[1]
+                cmd += [
+                    "-o",
+                    calkit.get_notebook_stage_out_path(**kws),
+                ]
+        if args.out_path:
+            for path in args.out_path:
+                cmd += ["-o", f"{path}"]
+        cmd.append(f"python '{script_fpath}'")
+        try:
+            subprocess.run(cmd, check=True, capture_output=True, text=True)
+        except subprocess.CalledProcessError as e:
+            print(f"Error: {e.stderr}")
+            raise e
+        # Now let's read in and inject the outputs back into the IPython state
+        if args.out:
+            for out in args.out:
+                out_split = out.split(":")
+                out_name = out_split[0]
+                kws = dict(stage_name=args.name, out_name=out_name)
+                if len(out_split) > 1:
+                    kws["fmt"] = out_split[1]
+                if len(out_split) > 2:
+                    kws["engine"] = out_split[2]
+                self.shell.user_ns[out_name] = calkit.load_notebook_stage_out(
+                    **kws
+                )
+        # If the last line of the cell has no equals signs, run that command,
+        # since it's probably meant for display
+        last_line = cell.strip().split("\n")[-1]
+        if not "=" in last_line:
+            self.shell.run_cell(last_line)
+def load_ipython_extension(ipython):
+    """Any module file that define a function named `load_ipython_extension`
+    can be loaded via `%load_ext module.path` or be configured to be
+    autoloaded by IPython at startup time.
+    See https://ipython.readthedocs.io/en/stable/config/custommagics.html
+    """
+    # You can register the class itself without instantiating it
+    # IPython will call the default constructor on it
+    ipython.register_magics(Calkit)

calkit_python-0.7.0/calkit/tests/test_magics.py ADDED Viewed

@@ -0,0 +1,36 @@
+"""Tests for ``calkit.magics``."""
+import os
+import shutil
+import subprocess
+import calkit
+def test_stage(tmp_dir):
+    # Test the stage magic
+    # Run git and dvc init in the temp dir
+    subprocess.check_call(["git", "init"])
+    subprocess.check_call(["dvc", "init"])
+    # Copy in a test notebook and run it
+    nb_fpath = os.path.join(
+        os.path.dirname(__file__), "..", "..", "test", "pipeline.ipynb"
+    )
+    shutil.copy(nb_fpath, "notebook.ipynb")
+    subprocess.check_call(
+        ["jupyter", "nbconvert", "--execute", "notebook.ipynb", "--to", "html"]
+    )
+    # Check DVC stages make sense
+    with open("dvc.yaml") as f:
+        pipeline = calkit.ryaml.load(f)
+    script = ".calkit/notebook-stages/get-data/script.py"
+    deps = pipeline["stages"]["get-data"]["deps"]
+    assert script in deps
+    # Check Calkit metadata makes sense
+    ck_info = calkit.load_calkit_info()
+    figs = ck_info["figures"]
+    fig = figs[0]
+    assert fig["path"] == "figures/plot.png"
+    assert fig["title"] == "A plot of the data"
+    assert fig["description"] == "This is a plot of the data."
+    assert fig["stage"] == "plot-fig"

calkit_python-0.7.0/docs/tutorials/notebook-pipeline.md ADDED Viewed

@@ -0,0 +1,158 @@
+# Using a Jupyter Notebook as a reproducible pipeline
+Jupyter Notebooks are great tools for exploration,
+but they can cause real headaches when it comes to managing state,
+since they can be executed out-of-order.
+This can lead to bad practices like only running certain cells
+since others are too expensive or failing.
+This means it's very possible for a result from a notebook to be
+non-reproducible.
+Here we're going to show how to use Calkit to turn a Jupyter Notebook
+into a DVC pipeline,
+as well as label our artifacts.
+The natural process would be something like:
+1. Prototype a cell by running whatever commands make sense.
+2. Convert cells that are working and valuable into pipeline
+   stages, and delete anything else.
+We should also be using [`nbstripout`](https://github.com/kynan/nbstripout)
+to strip notebook outputs before we commit to the repo,
+since the important ones will be produced as part of the pipeline
+and cached with DVC.
+At the end of this process we should be left with a notebook that runs
+very quickly after it's been run once,
+and all of our important outputs will be cached and pushed to the cloud,
+but kept out of our Git repo.
+Alright, so let's show how to convert a notebook into a reproducible
+DVC pipeline without leaving the notebook interface.
+First, let's write a cell to fetch a dataset,
+and let's assume this is expensive,
+maybe because we had to fetch it from a database.
+To simulate that expense we'll use a call to `time.sleep`.
+```python
+import pandas as pd
+import time
+time.sleep(10)
+df = pd.DataFrame({"col1": range(1000)})
+df.describe()
+```
+In order to convert this cell into a pipeline stage,
+we'll need to load the Calkit magics in our notebook.
+This only needs to be run once, so it can be at the very top:
+```python
+%load_ext calkit.magics
+```
+Next we simply call the `%%stage` magic with the appropriate arguments to
+convert the cell into a pipeline stage and run it externally with DVC:
+```python
+%%stage --name get-data --out df
+import pandas as pd
+import time
+time.sleep(10)
+df = pd.DataFrame({"col1": range(1000)})
+df.describe()
+```
+In the magic call, we gave the stage a name and declared an output `df`.
+When we run the cell, we'll see it takes at least 10 seconds the first time,
+but if we run it a second time,
+it will be much faster, since our output is being fetched from the DVC cache.
+If we run `calkit status`, we can see we have some new data to commit and
+push to the DVC remote.
+If we do that, anyone else who clones this project will be able to
+pull in the cache, and the cell will run quickly for them.
+## Saving outputs in different formats
+By default, our output variables will be pickled,
+which is not the most portable format.
+Let's instead save our DataFrame to Parquet format.
+To do this, all we need to do is adjust the `--out` value to add the format
+and DataFrame library
+(Calkit currently supports both Pandas and Polars DataFrames.)
+So change the call to the magic to be:
+```python
+%%stage --name get-data --out df:parquet:pandas
+```
+## Using the output of one cell as a dependency in another
+Let's imagine that now we want to create a visualization of our data.
+Just like if we were creating a typical DVC stage in a `dvc.yaml` file,
+we can declare a cell to depend on the output of another cell with the
+`--dep` command.
+For example:
+```python
+%%stage --name plot --dep get-data:df:parquet:pandas --out fig
+fig = df.plot(backend="plotly")
+fig
+```
+In this case, we need to specify what DataFrame library to use to read in
+this dependency.
+Here we tell Calkit that it's a Parquet file to be read with Pandas.
+Calkit will ensure this dependency is loaded into memory before running the
+cell as part of the pipeline.
+## Declaring an output as a figure saved to a different path
+In the cell above we end up pickling `fig` into the DVC cache,
+which is fine if we only ever want to view the figure through the notebook
+interface,
+but what if we want to declare this as a figure and, e.g.,
+use it in a publication?
+We can add a line that saves the figure and declare an additional output path
+and metadata like (note this requires `plotly` and `kaleido` to be installed):
+```python
+%%stage --name plot --dep get-data:df:parquet:pandas --out fig --out-path figures/plot.png --out-type figure --out-title "A plot of the data" --out-desc "This is a plot of the data."
+import os
+os.makedirs("figures", exist_ok=True)
+fig = df.plot(backend="plotly")
+fig.write_image("figures/plot.png")
+fig
+```
+If we call `calkit list figures`, we'll see our figure,
+and after pushing to the cloud, we'll be able to see it there as well.
+Note that we could also go back and add `--out-type=dataset` to the
+`get-data` cell,
+which will similarly add that dataset to our project metadata
+for searchability and reuse.
+## Running the pipeline outside the notebook
+One cool feature about building the pipeline this way is that it actually
+creates runnable stages in `dvc.yaml`,
+so `calkit run` or `dvc repro` will run all the same operations that
+executing the notebook would.
+## Further exploration
+If you'd like to try this out or explore further,
+you can view this project up on
+[GitHub](https://github.com/calkit/example-notebook-pipeline)
+or the [Calkit cloud](https://calkit.io/calkit/example-notebook-pipeline).

calkit_python-0.7.0/test/pipeline.ipynb ADDED Viewed

@@ -0,0 +1,93 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Notebook as a pipeline test"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext calkit.magics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%stage --name get-data-pickle --out df\n",
+    "\n",
+    "import pandas as pd\n",
+    "\n",
+    "df = pd.DataFrame({\"col1\": range(1000)})\n",
+    "df.describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%stage --name get-data --out df:parquet:pandas\n",
+    "\n",
+    "import pandas as pd\n",
+    "import time\n",
+    "\n",
+    "time.sleep(10)\n",
+    "\n",
+    "df = pd.DataFrame({\"col1\": range(1000)})\n",
+    "df.describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%stage --name plot --dep get-data:df:parquet:pandas --out fig\n",
+    "\n",
+    "fig = df.plot(backend=\"plotly\")\n",
+    "fig"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%stage --name plot-fig --dep get-data:df:parquet:pandas --out fig --out-path figures/plot.png --out-type figure --out-title \"A plot of the data\" --out-desc \"This is a plot of the data.\"\n",
+    "\n",
+    "import os\n",
+    "\n",
+    "os.makedirs(\"figures\", exist_ok=True)\n",
+    "\n",
+    "fig = df.plot(backend=\"plotly\")\n",
+    "fig.write_image(\"figures/plot.png\")\n",
+    "fig"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "base",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

calkit_python-0.6.0/calkit/core.py DELETED Viewed

@@ -1,98 +0,0 @@
-"""Core functionality."""
-from __future__ import annotations
-import glob
-import logging
-import os
-from datetime import UTC, datetime
-import ruamel.yaml
-from git import Repo
-from git.exc import InvalidGitRepositoryError
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__package__)
-ryaml = ruamel.yaml.YAML()
-ryaml.indent(mapping=2, sequence=4, offset=2)
-ryaml.preserve_quotes = True
-ryaml.width = 70
-def find_project_dirs(relative=False, max_depth=3) -> list[str]:
-    """Find all Calkit project directories."""
-    if relative:
-        start = ""
-    else:
-        start = os.path.expanduser("~")
-    res = []
-    for i in range(max_depth):
-        pattern = os.path.join(start, *["*"] * (i + 1), "calkit.yaml")
-        res += glob.glob(pattern)
-        # Check GitHub documents for users who use GitHub Desktop
-        pattern = os.path.join(
-            start, "*", "GitHub", *["*"] * (i + 1), "calkit.yaml"
-        )
-        res += glob.glob(pattern)
-    final_res = []
-    for ck_fpath in res:
-        path = os.path.dirname(ck_fpath)
-        # Make sure this path is a Git repo
-        try:
-            Repo(path)
-        except InvalidGitRepositoryError:
-            continue
-        final_res.append(path)
-    return final_res
-def load_calkit_info(
-    wdir=None, process_includes: bool | str | list[str] = False
-) -> dict:
-    """Load Calkit project information.
-    Parameters
-    ----------
-    wdir : str
-        Working directory. Defaults to current working directory.
-    process_includes: bool, string or list of strings
-        Whether or not to process any '_include' keys for a given kind of
-        object. If a string is passed, only process includes for that kind.
-        Similarly, if a list of strings is passed, only process those kinds.
-        If True, process all default kinds.
-    """
-    info = {}
-    fpath = "calkit.yaml"
-    if wdir is not None:
-        fpath = os.path.join(wdir, fpath)
-    if os.path.isfile(fpath):
-        with open(fpath) as f:
-            info = ryaml.load(f)
-    # Check for any includes, i.e., entities with an _include key, for which
-    # we should merge in another file
-    default_includes_enabled = ["environments", "procedures"]
-    if process_includes:
-        if isinstance(process_includes, bool):
-            includes_enabled = default_includes_enabled
-        elif isinstance(process_includes, str):
-            includes_enabled = [process_includes]
-        elif isinstance(process_includes, list):
-            includes_enabled = process_includes
-        for kind in includes_enabled:
-            if kind in info:
-                for obj_name, obj in info[kind].items():
-                    if "_include" in obj:
-                        include_fpath = obj.pop("_include")
-                        with open(include_fpath) as f:
-                            include_data = ryaml.load(f)
-                        info[kind][obj_name] |= include_data
-    return info
-def utcnow(remove_tz=True) -> datetime:
-    """Return now in UTC, optionally stripping timezone information."""
-    dt = datetime.now(UTC)
-    if remove_tz:
-        dt = dt.replace(tzinfo=None)
-    return dt