PyPI - esrf-data-compressor - Versions diffs - 0.1.2__tar.gz → 0.2.1__tar.gz - Mend

esrf-data-compressor 0.1.2tar.gz → 0.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

{esrf_data_compressor-0.1.2/src/esrf_data_compressor.egg-info → esrf_data_compressor-0.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: esrf-data-compressor
-Version: 0.1.2
+Version: 0.2.1
 Summary: A library to compress ESRF data and reduce their footprint
 Author-email: ESRF <dau-pydev@esrf.fr>
 License: MIT License
@@ -79,12 +79,14 @@ Dynamic: license-file
 * **Parallel execution**
-  * Automatically factors CPU cores into worker processes × per-process threads
-  * By default, each worker runs up to 4 Blosc2 threads (or falls back to 1 thread if < 4 cores)
+* Automatically factors CPU cores into worker processes × per-process threads
+* By default, each worker runs up to 2 Blosc2 threads (or falls back to 1 thread if < 2 cores)
 * **Non-destructive workflow**
-  1. `compress` writes a sibling file `<basename>_<compression_method>.h5` next to each original
+  1. `compress` writes compressed files either:
+     - next to each source as `<basename>_<compression_method>.h5` (`--layout sibling`), or
+     - under a mirrored `RAW_DATA_COMPRESSED` tree using the same source file names, while copying non-compressed folders/files (`--layout mirror`, default)
   2. `check` computes SSIM (first and last frames) and writes a report
   3. `overwrite` (optional) swaps out the raw frame file (irreversible)

{esrf_data_compressor-0.1.2 → esrf_data_compressor-0.2.1}/README.md RENAMED Viewed

@@ -18,12 +18,14 @@
 * **Parallel execution**
-  * Automatically factors CPU cores into worker processes × per-process threads
-  * By default, each worker runs up to 4 Blosc2 threads (or falls back to 1 thread if < 4 cores)
+* Automatically factors CPU cores into worker processes × per-process threads
+* By default, each worker runs up to 2 Blosc2 threads (or falls back to 1 thread if < 2 cores)
 * **Non-destructive workflow**
-  1. `compress` writes a sibling file `<basename>_<compression_method>.h5` next to each original
+  1. `compress` writes compressed files either:
+     - next to each source as `<basename>_<compression_method>.h5` (`--layout sibling`), or
+     - under a mirrored `RAW_DATA_COMPRESSED` tree using the same source file names, while copying non-compressed folders/files (`--layout mirror`, default)
   2. `check` computes SSIM (first and last frames) and writes a report
   3. `overwrite` (optional) swaps out the raw frame file (irreversible)
@@ -119,4 +121,4 @@ All noteworthy changes are recorded in [CHANGELOG.md](CHANGELOG.md). Version 0.1
 * Four-command CLI (`compress-hdf5 list`, `compress-hdf5 compress`, `compress-hdf5 check`, `compress-hdf5 overwrite`).
 * Parallelism with worker×thread auto-factoring.
-For more details, see the full history in [CHANGELOG.md](CHANGELOG.md).
+For more details, see the full history in [CHANGELOG.md](CHANGELOG.md).

{esrf_data_compressor-0.1.2 → esrf_data_compressor-0.2.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "esrf-data-compressor"
-version = "0.1.2"
+version = "0.2.1"
 authors = [{ name = "ESRF", email = "dau-pydev@esrf.fr" }]
 description = "A library to compress ESRF data and reduce their footprint"
 readme = { file = "README.md", content-type = "text/markdown" }
@@ -69,4 +69,4 @@ omit = ["*/tests/*"]
 [tool.isort]
 profile = "black"
-force_single_line = true
+force_single_line = true

{esrf_data_compressor-0.1.2 → esrf_data_compressor-0.2.1}/src/esrf_data_compressor/checker/run_check.py RENAMED Viewed

@@ -3,12 +3,15 @@ from concurrent.futures import ProcessPoolExecutor, as_completed
 from tqdm import tqdm
 from esrf_data_compressor.checker.ssim import compute_ssim_for_file_pair
+from esrf_data_compressor.utils.paths import get_available_cpus, resolve_compressed_path
-def run_ssim_check(raw_files: list[str], method: str, report_path: str) -> None:
+def run_ssim_check(
+    raw_files: list[str], method: str, report_path: str, layout: str = "sibling"
+) -> None:
     """
     Given a list of raw HDF5 file paths, partitions into:
-      to_check → those with a sibling <stem>_<method>.h5
+      to_check → those with an expected compressed counterpart according to `layout`
       missing  → those without one
     Writes a report to `report_path`:
@@ -21,9 +24,7 @@ def run_ssim_check(raw_files: list[str], method: str, report_path: str) -> None:
     # partition
     for orig in raw_files:
-        dirname, fname = os.path.dirname(orig), os.path.basename(orig)
-        stem, _ = os.path.splitext(fname)
-        comp_path = os.path.join(dirname, f"{stem}_{method}.h5")
+        comp_path = resolve_compressed_path(orig, method, layout=layout)
         if os.path.exists(comp_path):
             to_check.append((orig, comp_path))
         else:
@@ -45,7 +46,7 @@ def run_ssim_check(raw_files: list[str], method: str, report_path: str) -> None:
             return
         # run SSIM in parallel
-        n_workers = min(len(to_check), os.cpu_count() or 1)
+        n_workers = min(len(to_check), get_available_cpus())
         with ProcessPoolExecutor(max_workers=n_workers) as exe:
             futures = {
                 exe.submit(compute_ssim_for_file_pair, orig, comp): (orig, comp)

{esrf_data_compressor-0.1.2 → esrf_data_compressor-0.2.1}/src/esrf_data_compressor/cli.py RENAMED Viewed

@@ -50,9 +50,9 @@ def do_compress(args):
         return
     print(
-        f"Compressing {len(files)} file(s) from '{report}' using '{args.method}' method and ratio {args.cratio} …"
+        f"Compressing {len(files)} file(s) from '{report}' using '{args.method}' method, ratio {args.cratio}, layout '{args.layout}' …"
     )
-    mgr = CompressorManager(cratio=args.cratio, method=args.method)
+    mgr = CompressorManager(cratio=args.cratio, method=args.method, layout=args.layout)
     mgr.compress_files(files)
     print("Compression complete.\n")
@@ -72,7 +72,7 @@ def do_check(args):
     report_path = os.path.abspath(report_fname)
     try:
-        run_ssim_check(files, args.method, report_path)
+        run_ssim_check(files, args.method, report_path, layout=args.layout)
     except SystemExit as e:
         exit_with_error(str(e))
@@ -142,6 +142,12 @@ def main():
         default="jp2k",
         help="Compression method",
     )
+    p.add_argument(
+        "--layout",
+        choices=["sibling", "mirror"],
+        default="mirror",
+        help="Output layout: sibling (next to each source) or mirror (under RAW_DATA_COMPRESSED, preserving source names).",
+    )
     p.set_defaults(func=do_compress)
     p = sub.add_parser("check", help="Generate SSIM report for TO COMPRESS files")
@@ -151,6 +157,12 @@ def main():
     p.add_argument(
         "--method", choices=["jp2k"], default="jp2k", help="Compression method"
     )
+    p.add_argument(
+        "--layout",
+        choices=["sibling", "mirror"],
+        default="mirror",
+        help="Location of compressed files to check.",
+    )
     p.set_defaults(func=do_check)
     p = sub.add_parser(

{esrf_data_compressor-0.1.2 → esrf_data_compressor-0.2.1}/src/esrf_data_compressor/compressors/base.py RENAMED Viewed

@@ -1,8 +1,15 @@
 import os
-from concurrent.futures import ProcessPoolExecutor, as_completed
+import shutil
+from pathlib import Path
+from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
 from tqdm import tqdm
 from esrf_data_compressor.compressors.jp2k import JP2KCompressorWrapper
+from esrf_data_compressor.utils.paths import (
+    get_available_cpus,
+    resolve_compressed_path,
+    resolve_mirror_path,
+)
 class Compressor:
@@ -18,11 +25,11 @@ class CompressorManager:
     """
     Manages parallel compression and overwrite.
-    Each worker process is given up to 4 Blosc2 threads (or fewer if the machine
+    Each worker process is given up to 2 Blosc2 threads (or fewer if the machine
     has fewer than 4 cores).  The number of worker processes is then
     total_cores // threads_per_worker (at least 1).  If the user explicitly
     passes `workers`, we cap it to `total_cores`, then recompute threads_per_worker
-    = min(4, total_cores // workers).
+    = min(2, total_cores // workers).
     Usage:
         mgr = CompressorManager(cratio=10, method='jp2k')
@@ -31,10 +38,14 @@ class CompressorManager:
     """
     def __init__(
-        self, workers: int | None = None, cratio: int = 10, method: str = "jp2k"
+        self,
+        workers: int | None = None,
+        cratio: int = 10,
+        method: str = "jp2k",
+        layout: str = "sibling",
     ):
-        total_cores = os.cpu_count() or 1
-        default_nthreads = 4 if total_cores >= 4 else 1
+        total_cores = get_available_cpus()
+        default_nthreads = 2 if total_cores >= 2 else 1
         default_workers = max(1, total_cores // default_nthreads)
         if workers is None:
@@ -43,12 +54,13 @@ class CompressorManager:
         else:
             w = min(workers, total_cores)
             possible = total_cores // w
-            nthreads = min(possible, 4) if possible >= 1 else 1
+            nthreads = min(possible, 2) if possible >= 1 else 1
         self.workers = max(1, w)
         self.nthreads = max(1, nthreads)
         self.cratio = cratio
         self.method = method
+        self.layout = layout
         if self.method == "jp2k":
             self.compressor = JP2KCompressorWrapper(
@@ -58,33 +70,98 @@ class CompressorManager:
             raise ValueError(f"Unsupported compression method: {self.method}")
         print(f"Compression method: {self.method}")
+        print(f"Output layout: {self.layout}")
         print(f"Total CPU cores: {total_cores}")
         print(f"Worker processes: {self.workers}")
         print(f"Threads per worker: {self.nthreads}")
         print(f"Total threads: {self.workers * self.nthreads}")
+    @staticmethod
+    def _find_raw_root(path: str) -> str | None:
+        p = Path(os.path.abspath(path))
+        parts = p.parts
+        if "RAW_DATA" not in parts:
+            return None
+        return str(Path(*parts[: parts.index("RAW_DATA") + 1]))
     def _compress_worker(self, ipath: str) -> tuple[str, str]:
         """
         Worker function for ProcessPoolExecutor: compress a single HDF5:
-        <ipath>.h5 → <same_dir>/<basename>_<method>.h5
+        - sibling layout: <same_dir>/<basename>_<method>.h5
+        - mirror layout:  mirror RAW_DATA tree under RAW_DATA_COMPRESSED
         """
-        base, _ = os.path.splitext(ipath)
-        outp = f"{base}_{self.method}.h5"
+        outp = resolve_compressed_path(ipath, self.method, layout=self.layout)
+        os.makedirs(os.path.dirname(outp), exist_ok=True)
         self.compressor.compress_file(
             ipath, outp, cratio=self.cratio, nthreads=self.nthreads
         )
         return ipath, "success"
+    def _mirror_non_compressed_dataset_content(self, file_list: list[str]) -> None:
+        source_targets = {os.path.realpath(p) for p in file_list}
+        raw_roots: set[str] = set()
+        for ipath in file_list:
+            raw_root = self._find_raw_root(ipath)
+            if raw_root:
+                raw_roots.add(raw_root)
+        copy_tasks: list[tuple[str, str]] = []
+        for src_dir in sorted(raw_roots):
+            try:
+                dst_dir = resolve_mirror_path(src_dir)
+            except ValueError:
+                print(f"WARNING: Cannot mirror folder outside RAW_DATA: '{src_dir}'")
+                continue
+            for cur, dirs, files in os.walk(src_dir):
+                rel_cur = os.path.relpath(cur, src_dir)
+                target_cur = (
+                    dst_dir if rel_cur == "." else os.path.join(dst_dir, rel_cur)
+                )
+                os.makedirs(target_cur, exist_ok=True)
+                for dname in dirs:
+                    os.makedirs(os.path.join(target_cur, dname), exist_ok=True)
+                for fname in files:
+                    src_file = os.path.join(cur, fname)
+                    if os.path.realpath(src_file) in source_targets:
+                        # Do not copy raw files that will be produced by compression.
+                        continue
+                    dst_file = os.path.join(target_cur, fname)
+                    copy_tasks.append((src_file, dst_file))
+        if not copy_tasks:
+            return
+        max_workers = min(len(copy_tasks), max(1, get_available_cpus()), 8)
+        with ThreadPoolExecutor(max_workers=max_workers) as executor:
+            futures = {
+                executor.submit(shutil.copy2, s, d): (s, d) for s, d in copy_tasks
+            }
+            for fut in as_completed(futures):
+                src_file, dst_file = futures[fut]
+                try:
+                    fut.result()
+                except Exception as e:
+                    print(f"WARNING: Failed to copy '{src_file}' → '{dst_file}': {e}")
     def compress_files(self, file_list: list[str]) -> None:
         """
-        Compress each .h5 in file_list in parallel, producing <basename>_<method>.h5
-        next to each source file. Does not overwrite originals. At the end, prints
-        total elapsed time and data rate in MB/s.
+        Compress each .h5 in file_list in parallel.
+        - sibling layout: produce <basename>_<method>.h5 next to each source.
+        - mirror layout: write compressed files to RAW_DATA_COMPRESSED with same file names.
+        Does not overwrite originals. At the end, prints total elapsed time and data rate in MB/s.
         """
         valid = [p for p in file_list if p.lower().endswith(".h5")]
         if not valid:
             print("No valid .h5 files to compress.")
             return
+        if self.layout == "mirror":
+            print(
+                "Preparing RAW_DATA_COMPRESSED with non-compressed dataset content..."
+            )
+            self._mirror_non_compressed_dataset_content(valid)
         total_bytes = 0
         for f in valid:
@@ -130,8 +207,9 @@ class CompressorManager:
             if not ipath.lower().endswith(".h5"):
                 continue
-            base, _ = os.path.splitext(ipath)
-            compressed_path = f"{base}_{self.method}.h5"
+            compressed_path = resolve_compressed_path(
+                ipath, self.method, layout=self.layout
+            )
             if os.path.exists(compressed_path):
                 backup = ipath + ".bak"
@@ -184,9 +262,10 @@ class CompressorManager:
             if not ipath.lower().endswith(".h5"):
                 continue
-            base, _ = os.path.splitext(ipath)
             backup = ipath + ".bak"
-            method_path = f"{base}_{self.method}.h5"
+            method_path = resolve_compressed_path(
+                ipath, self.method, layout=self.layout
+            )
             if not os.path.exists(backup):
                 print(f"SKIP (no backup): {ipath}")

{esrf_data_compressor-0.1.2 → esrf_data_compressor-0.2.1}/src/esrf_data_compressor/compressors/jp2k.py RENAMED Viewed

@@ -54,8 +54,7 @@ class JP2KCompressor:
         )
     def _compress_3d(self, name: str, src_dset: h5py.Dataset, dst_grp: h5py.Group):
-        data = src_dset[()]
-        Z, Y, X = data.shape
+        Z, Y, X = src_dset.shape
         dst_dset = dst_grp.create_dataset(
             name,
@@ -70,7 +69,8 @@ class JP2KCompressor:
         t0 = time.perf_counter()
         for z in range(Z):
-            plane = data[z, :, :]
+            # Read one slice at a time to reduce peak RAM usage.
+            plane = src_dset[z, :, :]
             t1 = time.perf_counter()
             b2im = blosc2.asarray(
                 plane[np.newaxis, ...],

{esrf_data_compressor-0.1.2 → esrf_data_compressor-0.2.1}/src/esrf_data_compressor/tests/test_cli.py RENAMED Viewed

@@ -109,7 +109,7 @@ def test_commands_with_non_empty_list(
     # Run command
     argv = [cmd, "-i", "report.txt"]
     if cmd == "compress":
-        argv += ["--cratio", "5", "--method", "jp2k"]
+        argv += ["--cratio", "5", "--method", "jp2k", "--layout", "sibling"]
     argv_runner(argv)
     out = capsys.readouterr().out
     assert msg_start in out
@@ -167,17 +167,71 @@ def test_empty_reports(argv_runner, monkeypatch, capsys, cmd, empty_msg, tmp_pat
 def test_check_success_writes_report(argv_runner, monkeypatch, capsys, tmp_path):
     monkeypatch.setattr(cli, "parse_report", lambda rpt: ["f"])
-    def run(files, method, out):
+    def run(files, method, out, layout):
+        assert layout == "sibling"
         with open(out, "w") as f:
             f.write("ok")
     monkeypatch.setattr(cli, "run_ssim_check", run)
     report = tmp_path / "rpt.txt"
-    argv_runner(["check", "-i", str(report), "--method", "jp2k"])
+    argv_runner(["check", "-i", str(report), "--method", "jp2k", "--layout", "sibling"])
     out = capsys.readouterr().out
     assert "SSIM report written to" in out
+def test_compress_mirror_layout_creates_under_raw_data_compressed(
+    argv_runner, monkeypatch, tmp_path
+):
+    ds = tmp_path / "RAW_DATA" / "sampleA" / "ds1"
+    src = ds / "scan0001" / "f1.h5"
+    src.parent.mkdir(parents=True)
+    src.write_text("data")
+    base = ds / "dataset.h5"
+    base.write_text("base")
+    sample_sidecar = tmp_path / "RAW_DATA" / "sampleA" / "sample_sidecar.h5"
+    sample_sidecar.write_text("sidecar")
+    other_sample_sidecar = tmp_path / "RAW_DATA" / "sampleB" / "other_sidecar.h5"
+    other_sample_sidecar.parent.mkdir(parents=True)
+    other_sample_sidecar.write_text("other")
+    side = ds / "scan0002" / "meta.txt"
+    side.parent.mkdir(parents=True)
+    side.write_text("meta")
+    monkeypatch.setattr(cli, "parse_report", lambda rpt: [str(src)])
+    monkeypatch.setattr(
+        JP2KCompressorWrapper,
+        "compress_file",
+        lambda self, inp, out, **kw: open(out, "w").close(),
+    )
+    argv_runner(
+        [
+            "compress",
+            "-i",
+            "report.txt",
+            "--cratio",
+            "5",
+            "--method",
+            "jp2k",
+            "--layout",
+            "mirror",
+        ]
+    )
+    # The dataset base/filter file is mirrored under RAW_DATA_COMPRESSED.
+    assert (
+        tmp_path / "RAW_DATA_COMPRESSED" / "sampleA" / "ds1" / "dataset.h5"
+    ).exists()
+    # Compressed file keeps the same source name under mirrored scan path.
+    assert (
+        tmp_path / "RAW_DATA_COMPRESSED" / "sampleA" / "ds1" / "scan0001" / "f1.h5"
+    ).exists()
+    assert (
+        tmp_path / "RAW_DATA_COMPRESSED" / "sampleA" / "ds1" / "scan0002" / "meta.txt"
+    ).exists()
+    assert (tmp_path / "RAW_DATA_COMPRESSED" / "sampleA" / "sample_sidecar.h5").exists()
+    assert (tmp_path / "RAW_DATA_COMPRESSED" / "sampleB" / "other_sidecar.h5").exists()
 def test_overwrite_final_deletes_backups(argv_runner, monkeypatch, capsys, tmp_path):
     # Prepare a file and its backup
     (tmp_path / "f1.h5").write_text("current")

esrf_data_compressor-0.2.1/src/esrf_data_compressor/tests/test_paths.py ADDED Viewed

@@ -0,0 +1,36 @@
+import pytest
+from esrf_data_compressor.utils.paths import (
+    find_dataset_base_h5,
+    resolve_compressed_path,
+    resolve_mirror_path,
+)
+def test_resolve_compressed_path_sibling():
+    p = "/data/visitor/e/bl/s/RAW_DATA/sample/ds/f1.h5"
+    out = resolve_compressed_path(p, "jp2k", layout="sibling")
+    assert out == "/data/visitor/e/bl/s/RAW_DATA/sample/ds/f1_jp2k.h5"
+def test_resolve_compressed_path_mirror():
+    p = "/data/visitor/e/bl/s/RAW_DATA/sample/ds/f1.h5"
+    out = resolve_compressed_path(p, "jp2k", layout="mirror")
+    assert out == "/data/visitor/e/bl/s/RAW_DATA_COMPRESSED/sample/ds/f1.h5"
+def test_resolve_mirror_path_requires_raw_data():
+    with pytest.raises(ValueError):
+        resolve_mirror_path("/tmp/no_raw_data_here/f1.h5")
+def test_find_dataset_base_h5(tmp_path):
+    ds = tmp_path / "RAW_DATA" / "sample" / "ds1"
+    scan = ds / "scan0001"
+    scan.mkdir(parents=True)
+    base = ds / "dataset.h5"
+    base.write_text("base")
+    src = scan / "frames.h5"
+    src.write_text("source")
+    assert find_dataset_base_h5(str(src)) == str(base)

{esrf_data_compressor-0.1.2 → esrf_data_compressor-0.2.1}/src/esrf_data_compressor/tests/test_run_check.py RENAMED Viewed

@@ -105,3 +105,21 @@ def test_ssim_error_handling(tmp_path, monkeypatch):
     # should include an ERROR line mentioning the exception message
     assert any("ERROR processing file pair" in line for line in lines)
     assert any("Error" in line for line in lines)
+def test_mirror_layout_finds_compressed_file(tmp_path, monkeypatch):
+    raw = tmp_path / "RAW_DATA" / "sample" / "ds" / "d3.h5"
+    comp = tmp_path / "RAW_DATA_COMPRESSED" / "sample" / "ds" / "d3.h5"
+    raw.parent.mkdir(parents=True)
+    comp.parent.mkdir(parents=True)
+    raw.write_text("r3")
+    comp.write_text("c3")
+    report = tmp_path / "report.txt"
+    monkeypatch.setattr(rs, "compute_ssim_for_file_pair", lambda o, c: ("d3", ["ok"]))
+    rs.run_ssim_check(
+        [str(raw)], method="method", report_path=str(report), layout="mirror"
+    )
+    lines = _read_report(report)
+    assert lines[2] == f"Compressed file: {comp}"

esrf_data_compressor-0.2.1/src/esrf_data_compressor/utils/paths.py ADDED Viewed

@@ -0,0 +1,129 @@
+import os
+from pathlib import Path
+import re
+def _parse_slurm_cpus_env() -> int | None:
+    """
+    Return CPU count from SLURM env vars if available.
+    """
+    candidates = [
+        ("SLURM_CPUS_PER_TASK", None),
+        ("SLURM_CPUS_ON_NODE", None),
+        ("SLURM_JOB_CPUS_PER_NODE", "1"),
+        ("SLURM_TASKS_PER_NODE", None),
+    ]
+    for key, fallback in candidates:
+        val = os.environ.get(key)
+        if not val:
+            continue
+        if key == "SLURM_JOB_CPUS_PER_NODE":
+            # Formats like "32(x2)" or "32,32" or "32"
+            val = val.split(",")[0]
+            if "(x" in val:
+                val = val.split("(x", 1)[0]
+        if key == "SLURM_TASKS_PER_NODE":
+            # Often like "1" or "2(x3)"
+            if "(x" in val:
+                val = val.split("(x", 1)[0]
+        try:
+            n = int(val)
+            if n > 0:
+                return n
+        except ValueError:
+            if fallback is not None:
+                try:
+                    n = int(fallback)
+                    if n > 0:
+                        return n
+                except ValueError:
+                    pass
+    return None
+def get_available_cpus() -> int:
+    """
+    Use SLURM-provided CPU count when available; otherwise fall back to os.cpu_count().
+    """
+    slurm = _parse_slurm_cpus_env()
+    if slurm is not None:
+        return slurm
+    return os.cpu_count() or 1
+def resolve_mirror_path(
+    input_path: str,
+    *,
+    source_root: str = "RAW_DATA",
+    target_root: str = "RAW_DATA_COMPRESSED",
+) -> str:
+    """
+    Build a mirrored path under `target_root` by replacing the `source_root`
+    segment in `input_path`.
+    """
+    parts = Path(input_path).parts
+    if source_root not in parts:
+        raise ValueError(
+            f"Cannot mirror path '{input_path}': missing '{source_root}' segment."
+        )
+    idx = parts.index(source_root)
+    return str(Path(*parts[:idx], target_root, *parts[idx + 1 :]))
+def resolve_compressed_path(
+    input_path: str,
+    method: str,
+    *,
+    layout: str = "sibling",
+    source_root: str = "RAW_DATA",
+    target_root: str = "RAW_DATA_COMPRESSED",
+) -> str:
+    if layout == "sibling":
+        base_name = os.path.splitext(os.path.basename(input_path))[0]
+        compressed_name = f"{base_name}_{method}.h5"
+        return os.path.join(os.path.dirname(input_path), compressed_name)
+    if layout == "mirror":
+        # In mirror mode, compressed files keep the same file name as source.
+        return resolve_mirror_path(
+            input_path, source_root=source_root, target_root=target_root
+        )
+    raise ValueError(f"Unsupported layout: {layout}")
+def find_dataset_base_h5(
+    input_path: str,
+    *,
+    source_root: str = "RAW_DATA",
+) -> str | None:
+    """
+    Walk up from `input_path` to find the dataset directory that contains:
+      - exactly one .h5 file (the base/filter file)
+      - at least one scanXXXX subdirectory
+    Returns the absolute path to that .h5, or None when not found.
+    """
+    scan_re = re.compile(r"^scan\d{4}$", re.IGNORECASE)
+    p = Path(input_path).resolve()
+    parts = p.parts
+    if source_root not in parts:
+        return None
+    root_idx = parts.index(source_root)
+    cur = p.parent
+    while True:
+        if len(cur.parts) < root_idx + 1:
+            return None
+        try:
+            entries = list(cur.iterdir())
+        except OSError:
+            entries = []
+        h5_files = [e for e in entries if e.is_file() and e.suffix.lower() == ".h5"]
+        has_scan = any(e.is_dir() and scan_re.match(e.name) for e in entries)
+        if has_scan and len(h5_files) == 1:
+            return str(h5_files[0])
+        if len(cur.parts) == root_idx + 1:
+            return None
+        cur = cur.parent

{esrf_data_compressor-0.1.2 → esrf_data_compressor-0.2.1/src/esrf_data_compressor.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: esrf-data-compressor
-Version: 0.1.2
+Version: 0.2.1
 Summary: A library to compress ESRF data and reduce their footprint
 Author-email: ESRF <dau-pydev@esrf.fr>
 License: MIT License
@@ -79,12 +79,14 @@ Dynamic: license-file
 * **Parallel execution**
-  * Automatically factors CPU cores into worker processes × per-process threads
-  * By default, each worker runs up to 4 Blosc2 threads (or falls back to 1 thread if < 4 cores)
+* Automatically factors CPU cores into worker processes × per-process threads
+* By default, each worker runs up to 2 Blosc2 threads (or falls back to 1 thread if < 2 cores)
 * **Non-destructive workflow**
-  1. `compress` writes a sibling file `<basename>_<compression_method>.h5` next to each original
+  1. `compress` writes compressed files either:
+     - next to each source as `<basename>_<compression_method>.h5` (`--layout sibling`), or
+     - under a mirrored `RAW_DATA_COMPRESSED` tree using the same source file names, while copying non-compressed folders/files (`--layout mirror`, default)
   2. `check` computes SSIM (first and last frames) and writes a report
   3. `overwrite` (optional) swaps out the raw frame file (irreversible)

{esrf_data_compressor-0.1.2 → esrf_data_compressor-0.2.1}/src/esrf_data_compressor.egg-info/SOURCES.txt RENAMED Viewed

@@ -20,8 +20,10 @@ src/esrf_data_compressor/tests/test_cli.py
 src/esrf_data_compressor/tests/test_finder.py
 src/esrf_data_compressor/tests/test_hdf5_helpers.py
 src/esrf_data_compressor/tests/test_jp2k.py
+src/esrf_data_compressor/tests/test_paths.py
 src/esrf_data_compressor/tests/test_run_check.py
 src/esrf_data_compressor/tests/test_ssim.py
 src/esrf_data_compressor/tests/test_utils.py
 src/esrf_data_compressor/utils/hdf5_helpers.py
+src/esrf_data_compressor/utils/paths.py
 src/esrf_data_compressor/utils/utils.py