PyPI - mapillary-downloader - Versions diffs - 0.3.1__tar.gz → 0.4.0__tar.gz - Mend

mapillary-downloader 0.3.1tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: mapillary_downloader
-Version: 0.3.1
+Version: 0.4.0
 Summary: Download your Mapillary data before it's gone
 Author-email: Gareth Davidson <gaz@bitplane.net>
 Requires-Python: >=3.10
@@ -47,37 +47,43 @@ First, get your Mapillary API access token from
 [the developer dashboard](https://www.mapillary.com/dashboard/developers)
 ```bash
-# Set token via environment variable
+# Set token via environment variable (recommended)
 export MAPILLARY_TOKEN=YOUR_TOKEN
-mapillary-downloader --username SOME_USERNAME --output ./downloads
+mapillary-downloader USERNAME1 USERNAME2 USERNAME3
 # Or pass token directly, and have it in your shell history 💩👀
-mapillary-downloader --token YOUR_TOKEN --username SOME_USERNAME --output ./downloads
+mapillary-downloader --token YOUR_TOKEN USERNAME1 USERNAME2
+# Download to specific directory
+mapillary-downloader --output ./downloads USERNAME1
 ```
-| option        | because                               | default            |
-| ------------- | ------------------------------------- | ------------------ |
-| `--username`  | Mapillary username                    | None (required)    |
-| `--token`     | Mapillary API token (or env var)      | `$MAPILLARY_TOKEN` |
-| `--output`    | Output directory                      | `./mapillary_data` |
-| `--quality`   | 256, 1024, 2048 or original           | `original`         |
-| `--bbox`      | `west,south,east,north`               | `None`             |
-| `--webp`      | Convert to WebP (saves ~70% space)    | `False`            |
-| `--workers`   | Number of parallel download workers   | Half of CPU count  |
-| `--no-tar`    | Don't tar sequence directories        | `False`            |
+| option          | because                                      | default            |
+| --------------- | -------------------------------------------- | ------------------ |
+| `usernames`     | One or more Mapillary usernames              | (required)         |
+| `--token`       | Mapillary API token (or env var)             | `$MAPILLARY_TOKEN` |
+| `--output`      | Output directory                             | `./mapillary_data` |
+| `--quality`     | 256, 1024, 2048 or original                  | `original`         |
+| `--bbox`        | `west,south,east,north`                      | `None`             |
+| `--no-webp`     | Don't convert to WebP                        | `False`            |
+| `--workers`     | Number of parallel download workers          | Half of CPU count  |
+| `--no-tar`      | Don't tar sequence directories               | `False`            |
+| `--no-check-ia` | Don't check if exists on Internet Archive    | `False`            |
 The downloader will:
-* 📷 Download a user's images organized by sequence
+* 📷 Download multiple users' images organized by sequence
 * 📜 Inject EXIF metadata (GPS coordinates, camera info, timestamps,
   compass direction)
 * 🛟 Save progress so you can safely resume if interrupted
-* 🗜️ Optionally convert to WebP to save space
+* 🗜️ Convert to WebP by default to save ~70% disk space
 * 📦 Tar sequence directories for faster uploads
+* 🏛️ Check Internet Archive to avoid duplicate downloads
+* 💾 Stage downloads in cache, move atomically when complete
 ## WebP Conversion
-You'll need `cwebp` to use the `--webp` flag. So install it:
+WebP conversion is **enabled by default** (saves ~70% disk space). You'll need the `cwebp` binary installed:
 ```bash
 # Debian/Ubuntu
@@ -87,6 +93,12 @@ sudo apt install webp
 brew install webp
 ```
+To disable WebP conversion and keep original JPEGs, use `--no-webp`:
+```bash
+mapillary-downloader --no-webp USERNAME
+```
 ## Sequence Tarball Creation
 By default, sequence directories are automatically tarred after download because
@@ -96,7 +108,7 @@ uploading files to IA.
 To keep individual files instead of creating tars, use the `--no-tar` flag:
 ```bash
-mapillary-downloader --username WHOEVER --no-tar
+mapillary-downloader --no-tar USERNAME
 ```
 ## Internet Archive upload

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/README.md RENAMED Viewed

@@ -17,37 +17,43 @@ First, get your Mapillary API access token from
 [the developer dashboard](https://www.mapillary.com/dashboard/developers)
 ```bash
-# Set token via environment variable
+# Set token via environment variable (recommended)
 export MAPILLARY_TOKEN=YOUR_TOKEN
-mapillary-downloader --username SOME_USERNAME --output ./downloads
+mapillary-downloader USERNAME1 USERNAME2 USERNAME3
 # Or pass token directly, and have it in your shell history 💩👀
-mapillary-downloader --token YOUR_TOKEN --username SOME_USERNAME --output ./downloads
+mapillary-downloader --token YOUR_TOKEN USERNAME1 USERNAME2
+# Download to specific directory
+mapillary-downloader --output ./downloads USERNAME1
 ```
-| option        | because                               | default            |
-| ------------- | ------------------------------------- | ------------------ |
-| `--username`  | Mapillary username                    | None (required)    |
-| `--token`     | Mapillary API token (or env var)      | `$MAPILLARY_TOKEN` |
-| `--output`    | Output directory                      | `./mapillary_data` |
-| `--quality`   | 256, 1024, 2048 or original           | `original`         |
-| `--bbox`      | `west,south,east,north`               | `None`             |
-| `--webp`      | Convert to WebP (saves ~70% space)    | `False`            |
-| `--workers`   | Number of parallel download workers   | Half of CPU count  |
-| `--no-tar`    | Don't tar sequence directories        | `False`            |
+| option          | because                                      | default            |
+| --------------- | -------------------------------------------- | ------------------ |
+| `usernames`     | One or more Mapillary usernames              | (required)         |
+| `--token`       | Mapillary API token (or env var)             | `$MAPILLARY_TOKEN` |
+| `--output`      | Output directory                             | `./mapillary_data` |
+| `--quality`     | 256, 1024, 2048 or original                  | `original`         |
+| `--bbox`        | `west,south,east,north`                      | `None`             |
+| `--no-webp`     | Don't convert to WebP                        | `False`            |
+| `--workers`     | Number of parallel download workers          | Half of CPU count  |
+| `--no-tar`      | Don't tar sequence directories               | `False`            |
+| `--no-check-ia` | Don't check if exists on Internet Archive    | `False`            |
 The downloader will:
-* 📷 Download a user's images organized by sequence
+* 📷 Download multiple users' images organized by sequence
 * 📜 Inject EXIF metadata (GPS coordinates, camera info, timestamps,
   compass direction)
 * 🛟 Save progress so you can safely resume if interrupted
-* 🗜️ Optionally convert to WebP to save space
+* 🗜️ Convert to WebP by default to save ~70% disk space
 * 📦 Tar sequence directories for faster uploads
+* 🏛️ Check Internet Archive to avoid duplicate downloads
+* 💾 Stage downloads in cache, move atomically when complete
 ## WebP Conversion
-You'll need `cwebp` to use the `--webp` flag. So install it:
+WebP conversion is **enabled by default** (saves ~70% disk space). You'll need the `cwebp` binary installed:
 ```bash
 # Debian/Ubuntu
@@ -57,6 +63,12 @@ sudo apt install webp
 brew install webp
 ```
+To disable WebP conversion and keep original JPEGs, use `--no-webp`:
+```bash
+mapillary-downloader --no-webp USERNAME
+```
 ## Sequence Tarball Creation
 By default, sequence directories are automatically tarred after download because
@@ -66,7 +78,7 @@ uploading files to IA.
 To keep individual files instead of creating tars, use the `--no-tar` flag:
 ```bash
-mapillary-downloader --username WHOEVER --no-tar
+mapillary-downloader --no-tar USERNAME
 ```
 ## Internet Archive upload

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/pyproject.toml RENAMED Viewed

@@ -1,7 +1,7 @@
 [project]
 name = "mapillary_downloader"
 description = "Download your Mapillary data before it's gone"
-version = "0.3.1"
+version = "0.4.0"
 authors = [
     { name = "Gareth Davidson", email = "gaz@bitplane.net" }
 ]

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/src/mapillary_downloader/__main__.py RENAMED Viewed

@@ -3,6 +3,7 @@
 import argparse
 import os
 import sys
+from importlib.metadata import version
 from mapillary_downloader.client import MapillaryClient
 from mapillary_downloader.downloader import MapillaryDownloader
 from mapillary_downloader.logging_config import setup_logging
@@ -15,12 +16,17 @@ def main():
     logger = setup_logging()
     parser = argparse.ArgumentParser(description="Download your Mapillary data before it's gone")
+    parser.add_argument(
+        "--version",
+        action="version",
+        version=f"%(prog)s {version('mapillary-downloader')}",
+    )
     parser.add_argument(
         "--token",
         default=os.environ.get("MAPILLARY_TOKEN"),
         help="Mapillary API access token (or set MAPILLARY_TOKEN env var)",
     )
-    parser.add_argument("--username", required=True, help="Mapillary username")
+    parser.add_argument("usernames", nargs="+", help="Mapillary username(s) to download")
     parser.add_argument("--output", default="./mapillary_data", help="Output directory (default: ./mapillary_data)")
     parser.add_argument(
         "--quality",
@@ -30,9 +36,9 @@ def main():
     )
     parser.add_argument("--bbox", help="Bounding box: west,south,east,north")
     parser.add_argument(
-        "--webp",
+        "--no-webp",
         action="store_true",
-        help="Convert images to WebP format (saves ~70%% disk space, requires cwebp binary)",
+        help="Don't convert to WebP (WebP conversion is enabled by default, saves ~70%% disk space)",
     )
     parser.add_argument(
         "--workers",
@@ -45,6 +51,11 @@ def main():
         action="store_true",
         help="Don't tar sequence directories (keep individual files)",
     )
+    parser.add_argument(
+        "--no-check-ia",
+        action="store_true",
+        help="Don't check if collection exists on Internet Archive before downloading",
+    )
     args = parser.parse_args()
@@ -63,19 +74,41 @@ def main():
             logger.error("Error: bbox must be four comma-separated numbers")
             sys.exit(1)
-    # Check for cwebp binary if WebP conversion is requested
-    if args.webp:
+    # WebP is enabled by default, disabled with --no-webp
+    convert_webp = not args.no_webp
+    # Check for cwebp binary if WebP conversion is enabled
+    if convert_webp:
         if not check_cwebp_available():
-            logger.error("Error: cwebp binary not found. Install webp package (e.g., apt install webp)")
+            logger.error(
+                "Error: cwebp binary not found. Install webp package (e.g., apt install webp) or use --no-webp"
+            )
             sys.exit(1)
         logger.info("WebP conversion enabled - images will be converted after download")
     try:
         client = MapillaryClient(args.token)
-        downloader = MapillaryDownloader(
-            client, args.output, args.username, args.quality, workers=args.workers, tar_sequences=not args.no_tar
-        )
-        downloader.download_user_data(bbox=bbox, convert_webp=args.webp)
+        # Process each username
+        for username in args.usernames:
+            logger.info("")
+            logger.info("=" * 60)
+            logger.info(f"Processing user: {username}")
+            logger.info("=" * 60)
+            logger.info("")
+            downloader = MapillaryDownloader(
+                client,
+                args.output,
+                username,
+                args.quality,
+                workers=args.workers,
+                tar_sequences=not args.no_tar,
+                convert_webp=convert_webp,
+                check_ia=not args.no_check_ia,
+            )
+            downloader.download_user_data(bbox=bbox, convert_webp=convert_webp)
     except KeyboardInterrupt:
         logger.info("\nInterrupted by user")
         sys.exit(1)

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/src/mapillary_downloader/downloader.py RENAMED Viewed

@@ -1,32 +1,65 @@
 """Main downloader logic."""
+import gzip
 import json
 import logging
 import os
+import shutil
 import time
 from pathlib import Path
 from concurrent.futures import ProcessPoolExecutor, as_completed
 from mapillary_downloader.utils import format_size, format_time
 from mapillary_downloader.ia_meta import generate_ia_metadata
+from mapillary_downloader.ia_check import check_ia_exists
 from mapillary_downloader.worker import download_and_convert_image
 from mapillary_downloader.tar_sequences import tar_sequence_directories
+from mapillary_downloader.logging_config import add_file_handler
 logger = logging.getLogger("mapillary_downloader")
+def get_cache_dir():
+    """Get XDG cache directory for staging downloads.
+    Returns:
+        Path to cache directory for mapillary_downloader
+    """
+    xdg_cache = os.environ.get("XDG_CACHE_HOME")
+    if xdg_cache:
+        cache_dir = Path(xdg_cache)
+    else:
+        cache_dir = Path.home() / ".cache"
+    mapillary_cache = cache_dir / "mapillary_downloader"
+    mapillary_cache.mkdir(parents=True, exist_ok=True)
+    return mapillary_cache
 class MapillaryDownloader:
     """Handles downloading Mapillary data for a user."""
-    def __init__(self, client, output_dir, username=None, quality=None, workers=None, tar_sequences=True):
+    def __init__(
+        self,
+        client,
+        output_dir,
+        username=None,
+        quality=None,
+        workers=None,
+        tar_sequences=True,
+        convert_webp=False,
+        check_ia=True,
+    ):
         """Initialize the downloader.
         Args:
             client: MapillaryClient instance
-            output_dir: Base directory to save downloads
+            output_dir: Base directory to save downloads (final destination)
             username: Mapillary username (for collection directory)
             quality: Image quality (for collection directory)
             workers: Number of parallel workers (default: half of cpu_count)
             tar_sequences: Whether to tar sequence directories after download (default: True)
+            convert_webp: Whether to convert images to WebP (affects collection name)
+            check_ia: Whether to check if collection exists on Internet Archive (default: True)
         """
         self.client = client
         self.base_output_dir = Path(output_dir)
@@ -34,16 +67,39 @@ class MapillaryDownloader:
         self.quality = quality
         self.workers = workers if workers is not None else max(1, os.cpu_count() // 2)
         self.tar_sequences = tar_sequences
+        self.convert_webp = convert_webp
+        self.check_ia = check_ia
-        # If username and quality provided, create collection directory
+        # Determine collection name
         if username and quality:
             collection_name = f"mapillary-{username}-{quality}"
-            self.output_dir = self.base_output_dir / collection_name
+            if convert_webp:
+                collection_name += "-webp"
+            self.collection_name = collection_name
         else:
-            self.output_dir = self.base_output_dir
+            self.collection_name = None
+        # Set up staging directory in cache
+        cache_dir = get_cache_dir()
+        if self.collection_name:
+            self.staging_dir = cache_dir / self.collection_name
+            self.final_dir = self.base_output_dir / self.collection_name
+        else:
+            self.staging_dir = cache_dir / "download"
+            self.final_dir = self.base_output_dir
+        # Work in staging directory during download
+        self.output_dir = self.staging_dir
         self.output_dir.mkdir(parents=True, exist_ok=True)
+        logger.info(f"Staging directory: {self.staging_dir}")
+        logger.info(f"Final destination: {self.final_dir}")
+        # Set up file logging for archival
+        log_file = self.output_dir / "download.log"
+        add_file_handler(log_file)
+        logger.info(f"Logging to: {log_file}")
         self.metadata_file = self.output_dir / "metadata.jsonl"
         self.progress_file = self.output_dir / "progress.json"
         self.downloaded = self._load_progress()
@@ -74,6 +130,18 @@ class MapillaryDownloader:
         if not self.username or not self.quality:
             raise ValueError("Username and quality must be provided during initialization")
+        # Check if collection already exists on Internet Archive
+        if self.check_ia and self.collection_name:
+            logger.info(f"Checking if {self.collection_name} exists on Internet Archive...")
+            if check_ia_exists(self.collection_name):
+                logger.info("Collection already exists on archive.org, skipping download")
+                return
+        # Check if collection already exists in final destination
+        if self.final_dir.exists():
+            logger.info(f"Collection already exists at {self.final_dir}, skipping download")
+            return
         quality_field = f"thumb_{self.quality}_url"
         logger.info(f"Downloading images for user: {self.username}")
@@ -168,9 +236,38 @@ class MapillaryDownloader:
         if self.tar_sequences:
             tar_sequence_directories(self.output_dir)
+        # Gzip metadata.jsonl to save space
+        if self.metadata_file.exists():
+            logger.info("Compressing metadata.jsonl...")
+            original_size = self.metadata_file.stat().st_size
+            gzipped_file = self.metadata_file.with_suffix(".jsonl.gz")
+            with open(self.metadata_file, "rb") as f_in:
+                with gzip.open(gzipped_file, "wb", compresslevel=9) as f_out:
+                    shutil.copyfileobj(f_in, f_out)
+            compressed_size = gzipped_file.stat().st_size
+            self.metadata_file.unlink()
+            savings = 100 * (1 - compressed_size / original_size)
+            logger.info(
+                f"Compressed metadata: {format_size(original_size)} → {format_size(compressed_size)} "
+                f"({savings:.1f}% savings)"
+            )
         # Generate IA metadata
         generate_ia_metadata(self.output_dir)
+        # Move from staging to final destination
+        logger.info("Moving collection from staging to final destination...")
+        if self.final_dir.exists():
+            logger.warning(f"Destination already exists, removing: {self.final_dir}")
+            shutil.rmtree(self.final_dir)
+        self.final_dir.parent.mkdir(parents=True, exist_ok=True)
+        shutil.move(str(self.staging_dir), str(self.final_dir))
+        logger.info(f"Collection moved to: {self.final_dir}")
     def _download_images_parallel(self, images, convert_webp):
         """Download images in parallel using worker pool.
@@ -184,6 +281,7 @@ class MapillaryDownloader:
         downloaded_count = 0
         total_bytes = 0
         failed_count = 0
+        batch_start_time = time.time()
         with ProcessPoolExecutor(max_workers=self.workers) as executor:
             # Submit all tasks
@@ -209,7 +307,16 @@ class MapillaryDownloader:
                     total_bytes += bytes_dl
                     if downloaded_count % 10 == 0:
-                        logger.info(f"Downloaded: {downloaded_count}/{len(images)} ({format_size(total_bytes)})")
+                        # Calculate ETA
+                        elapsed = time.time() - batch_start_time
+                        rate = downloaded_count / elapsed if elapsed > 0 else 0
+                        remaining = len(images) - downloaded_count
+                        eta_seconds = remaining / rate if rate > 0 else 0
+                        logger.info(
+                            f"Downloaded: {downloaded_count}/{len(images)} ({format_size(total_bytes)}) "
+                            f"- ETA: {format_time(eta_seconds)}"
+                        )
                         self._save_progress()
                 else:
                     failed_count += 1

mapillary_downloader-0.4.0/src/mapillary_downloader/ia_check.py ADDED Viewed

@@ -0,0 +1,33 @@
+"""Check if collections exist on Internet Archive."""
+import logging
+import requests
+logger = logging.getLogger("mapillary_downloader")
+def check_ia_exists(collection_name):
+    """Check if a collection exists on Internet Archive.
+    Args:
+        collection_name: Name of the collection (e.g., mapillary-username-original-webp)
+    Returns:
+        Boolean indicating if the collection exists on IA
+    """
+    # IA identifier format
+    ia_url = f"https://archive.org/metadata/{collection_name}"
+    try:
+        response = requests.get(ia_url, timeout=10)
+        # If we get a 200, the item exists
+        if response.status_code == 200:
+            data = response.json()
+            # Check if it's a valid item (not just metadata for non-existent item)
+            if "metadata" in data and data.get("is_dark") is not True:
+                return True
+        return False
+    except requests.RequestException as e:
+        logger.warning(f"Failed to check IA for {collection_name}: {e}")
+        # On error, assume it doesn't exist (better to download than skip)
+        return False

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/src/mapillary_downloader/ia_meta.py RENAMED Viewed

@@ -1,5 +1,6 @@
 """Internet Archive metadata generation for Mapillary collections."""
+import gzip
 import json
 import logging
 import re
@@ -14,22 +15,22 @@ def parse_collection_name(directory):
     """Parse username and quality from directory name.
     Args:
-        directory: Path to collection directory (e.g., mapillary-username-original)
+        directory: Path to collection directory (e.g., mapillary-username-original or mapillary-username-original-webp)
     Returns:
         Tuple of (username, quality) or (None, None) if parsing fails
     """
-    match = re.match(r"mapillary-(.+)-(256|1024|2048|original)$", Path(directory).name)
+    match = re.match(r"mapillary-(.+)-(256|1024|2048|original)(?:-webp)?$", Path(directory).name)
     if match:
         return match.group(1), match.group(2)
     return None, None
 def get_date_range(metadata_file):
-    """Get first and last captured_at dates from metadata.jsonl.
+    """Get first and last captured_at dates from metadata.jsonl.gz.
     Args:
-        metadata_file: Path to metadata.jsonl file
+        metadata_file: Path to metadata.jsonl.gz file
     Returns:
         Tuple of (first_date, last_date) as ISO format strings, or (None, None)
@@ -38,7 +39,7 @@ def get_date_range(metadata_file):
         return None, None
     timestamps = []
-    with open(metadata_file) as f:
+    with gzip.open(metadata_file, "rt") as f:
         for line in f:
             if line.strip():
                 data = json.loads(line)
@@ -59,10 +60,10 @@ def get_date_range(metadata_file):
 def count_images(metadata_file):
-    """Count number of images in metadata.jsonl.
+    """Count number of images in metadata.jsonl.gz.
     Args:
-        metadata_file: Path to metadata.jsonl file
+        metadata_file: Path to metadata.jsonl.gz file
     Returns:
         Number of images
@@ -71,7 +72,7 @@ def count_images(metadata_file):
         return 0
     count = 0
-    with open(metadata_file) as f:
+    with gzip.open(metadata_file, "rt") as f:
         for line in f:
             if line.strip():
                 count += 1
@@ -112,9 +113,9 @@ def generate_ia_metadata(collection_dir):
         logger.error(f"Could not parse username/quality from directory: {collection_dir.name}")
         return False
-    metadata_file = collection_dir / "metadata.jsonl"
+    metadata_file = collection_dir / "metadata.jsonl.gz"
     if not metadata_file.exists():
-        logger.error(f"metadata.jsonl not found in {collection_dir}")
+        logger.error(f"metadata.jsonl.gz not found in {collection_dir}")
         return False
     logger.info(f"Generating IA metadata for {collection_dir.name}...")
@@ -135,7 +136,7 @@ def generate_ia_metadata(collection_dir):
     write_meta_tag(
         meta_dir,
         "title",
-        f"Mapillary images by {username} ({quality} quality)",
+        f"Mapillary images by {username}",
     )
     description = (

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/src/mapillary_downloader/logging_config.py RENAMED Viewed

@@ -60,3 +60,23 @@ def setup_logging(level=logging.INFO):
     logger.addHandler(handler)
     return logger
+def add_file_handler(log_file, level=logging.INFO):
+    """Add a file handler to the logger for archival.
+    Args:
+        log_file: Path to log file
+        level: Logging level for file handler
+    """
+    # Use plain formatter for file (no colors)
+    formatter = logging.Formatter(fmt="%(asctime)s [%(levelname)s] %(message)s", datefmt="%Y-%m-%d %H:%M:%S")
+    handler = logging.FileHandler(log_file, mode="a", encoding="utf-8")
+    handler.setFormatter(formatter)
+    handler.setLevel(level)
+    logger = logging.getLogger("mapillary_downloader")
+    logger.addHandler(handler)
+    return handler

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/src/mapillary_downloader/tar_sequences.py RENAMED Viewed

@@ -1,8 +1,9 @@
 """Tar sequence directories for efficient Internet Archive uploads."""
 import logging
-import subprocess
+import tarfile
 from pathlib import Path
+from mapillary_downloader.utils import format_size
 logger = logging.getLogger("mapillary_downloader")
@@ -38,6 +39,7 @@ def tar_sequence_directories(collection_dir):
     tarred_count = 0
     total_files = 0
+    total_tar_bytes = 0
     for seq_dir in sequence_dirs:
         seq_name = seq_dir.name
@@ -58,22 +60,38 @@ def tar_sequence_directories(collection_dir):
             continue
         try:
-            # Create uncompressed tar (WebP already compressed)
-            # Use -C to change directory so paths in tar are relative
-            # Use -- to prevent sequence IDs starting with - from being interpreted as options
-            result = subprocess.run(
-                ["tar", "-cf", str(tar_path), "-C", str(collection_dir), "--", seq_name],
-                capture_output=True,
-                text=True,
-                timeout=300,  # 5 minute timeout per tar
-            )
-            if result.returncode != 0:
-                logger.error(f"Failed to tar {seq_name}: {result.stderr}")
+            # Create reproducible uncompressed tar (WebP already compressed)
+            # Sort files by name for deterministic ordering
+            files_to_tar = sorted([f for f in seq_dir.rglob("*") if f.is_file()], key=lambda x: x.name)
+            if not files_to_tar:
+                logger.warning(f"Skipping directory with no files: {seq_name}")
                 continue
+            with tarfile.open(tar_path, "w") as tar:
+                for file_path in files_to_tar:
+                    # Get path relative to collection_dir for tar archive
+                    arcname = file_path.relative_to(collection_dir)
+                    # Create TarInfo for reproducibility
+                    tarinfo = tar.gettarinfo(str(file_path), arcname=str(arcname))
+                    # Normalize for reproducibility across platforms
+                    tarinfo.uid = 0
+                    tarinfo.gid = 0
+                    tarinfo.uname = ""
+                    tarinfo.gname = ""
+                    # mtime already set on file by worker, preserve it
+                    # Add file to tar
+                    with open(file_path, "rb") as f:
+                        tar.addfile(tarinfo, f)
             # Verify tar was created and has size
             if tar_path.exists() and tar_path.stat().st_size > 0:
+                tar_size = tar_path.stat().st_size
+                total_tar_bytes += tar_size
                 # Remove original directory
                 for file in seq_dir.rglob("*"):
                     if file.is_file():
@@ -99,14 +117,12 @@ def tar_sequence_directories(collection_dir):
                 if tar_path.exists():
                     tar_path.unlink()
-        except subprocess.TimeoutExpired:
-            logger.error(f"Timeout tarring {seq_name}")
-            if tar_path.exists():
-                tar_path.unlink()
         except Exception as e:
             logger.error(f"Error tarring {seq_name}: {e}")
             if tar_path.exists():
                 tar_path.unlink()
-    logger.info(f"Tarred {tarred_count} sequences ({total_files:,} files total)")
+    logger.info(
+        f"Tarred {tarred_count} sequences ({total_files:,} files, {format_size(total_tar_bytes)} total tar size)"
+    )
     return tarred_count, total_files

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/src/mapillary_downloader/worker.py RENAMED Viewed

@@ -1,5 +1,6 @@
 """Worker process for parallel image download and conversion."""
+import os
 import tempfile
 from pathlib import Path
 import requests
@@ -80,6 +81,12 @@ def download_and_convert_image(image_data, output_dir, quality, convert_webp, ac
             if not webp_path:
                 return (image_id, bytes_downloaded, False, "WebP conversion failed")
+        # Set file mtime to captured_at timestamp for reproducibility
+        if "captured_at" in image_data:
+            # captured_at is in milliseconds, convert to seconds
+            mtime = image_data["captured_at"] / 1000
+            os.utime(final_path, (mtime, mtime))
         return (image_id, bytes_downloaded, True, None)
     except Exception as e:

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/LICENSE.md RENAMED Viewed

File without changes

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/src/mapillary_downloader/__init__.py RENAMED Viewed

File without changes

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/src/mapillary_downloader/client.py RENAMED Viewed

File without changes

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/src/mapillary_downloader/exif_writer.py RENAMED Viewed

File without changes

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/src/mapillary_downloader/utils.py RENAMED Viewed

File without changes

{mapillary_downloader-0.3.1 → mapillary_downloader-0.4.0}/src/mapillary_downloader/webp_converter.py RENAMED Viewed

File without changes

mapillary-downloader 0.3.1__tar.gz → 0.4.0__tar.gz

mapillary-downloader 0.3.1tar.gz → 0.4.0tar.gz