mapillary-downloader 0.1.3__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: mapillary_downloader
3
- Version: 0.1.3
3
+ Version: 0.3.0
4
4
  Summary: Download your Mapillary data before it's gone
5
5
  Author-email: Gareth Davidson <gaz@bitplane.net>
6
6
  Requires-Python: >=3.10
@@ -34,54 +34,99 @@ Download your Mapillary data before it's gone.
34
34
 
35
35
  ## Installation
36
36
 
37
- ```bash
38
- pip install mapillary-downloader
39
- ```
40
-
41
- Or from source:
37
+ Installation is optional, you can prefix the command with `uvx` or `pipx` to
38
+ download and run it. Or if you're oldskool you can do:
42
39
 
43
40
  ```bash
44
- make install
41
+ pip install mapillary-downloader
45
42
  ```
46
43
 
47
44
  ## Usage
48
45
 
49
- First, get your Mapillary API access token from https://www.mapillary.com/dashboard/developers
46
+ First, get your Mapillary API access token from
47
+ [the developer dashboard](https://www.mapillary.com/dashboard/developers)
50
48
 
51
49
  ```bash
52
- mapillary-downloader --token YOUR_TOKEN --username YOUR_USERNAME --output ./downloads
50
+ # Set token via environment variable
51
+ export MAPILLARY_TOKEN=YOUR_TOKEN
52
+ mapillary-downloader --username SOME_USERNAME --output ./downloads
53
+
54
+ # Or pass token directly, and have it in your shell history 💩👀
55
+ mapillary-downloader --token YOUR_TOKEN --username SOME_USERNAME --output ./downloads
53
56
  ```
54
57
 
55
58
  | option | because | default |
56
59
  | ------------- | ------------------------------------- | ------------------ |
57
- | `--token` | Your Mapillary API access token | None (required) |
58
- | `--username` | Your Mapillary username | None (required) |
60
+ | `--username` | Mapillary username | None (required) |
61
+ | `--token` | Mapillary API token (or env var) | `$MAPILLARY_TOKEN` |
59
62
  | `--output` | Output directory | `./mapillary_data` |
60
63
  | `--quality` | 256, 1024, 2048 or original | `original` |
61
64
  | `--bbox` | `west,south,east,north` | `None` |
65
+ | `--webp` | Convert to WebP (saves ~70% space) | `False` |
66
+ | `--workers` | Number of parallel download workers | CPU count |
67
+ | `--no-tar` | Don't tar sequence directories | `False` |
62
68
 
63
69
  The downloader will:
64
70
 
65
- * 💾 Fetch all your uploaded images from Mapillary
66
- * 📷 Download full-resolution images organized by sequence
71
+ * 📷 Download a user's images organized by sequence
67
72
  * 📜 Inject EXIF metadata (GPS coordinates, camera info, timestamps,
68
73
  compass direction)
69
74
  * 🛟 Save progress so you can safely resume if interrupted
75
+ * 🗜️ Optionally convert to WebP to save space
76
+ * 📦 Tar sequence directories for faster uploads
77
+
78
+ ## WebP Conversion
79
+
80
+ You'll need `cwebp` to use the `--webp` flag. So install it:
81
+
82
+ ```bash
83
+ # Debian/Ubuntu
84
+ sudo apt install webp
85
+
86
+ # macOS
87
+ brew install webp
88
+ ```
89
+
90
+ ## Sequence Tarball Creation
91
+
92
+ By default, sequence directories are automatically tarred after download because
93
+ if they weren't, you'd spend more time setting up upload metadata than actually
94
+ uploading files to IA.
95
+
96
+ To keep individual files instead of creating tars, use the `--no-tar` flag:
97
+
98
+ ```bash
99
+ mapillary-downloader --username WHOEVER --no-tar
100
+ ```
101
+
102
+ ## Internet Archive upload
103
+
104
+ I've written a bash tool to rip media then tag, queue, and upload to The
105
+ Internet Archive. The metadata is in the same format. If you copy completed
106
+ download dirs into the `4.ship` dir, they'll find their way into an
107
+ appropriately named item.
108
+
109
+ See inlay for details:
110
+
111
+ * [📀 rip](https://bitplane.net/dev/sh/rip)
112
+
70
113
 
71
114
  ## Development
72
115
 
73
116
  ```bash
74
117
  make dev # Setup dev environment
75
118
  make test # Run tests
76
- make coverage # Run tests with coverage
119
+ make dist # Build the distribution
120
+ make help # See other make options
77
121
  ```
78
122
 
79
123
  ## Links
80
124
 
81
125
  * [🏠 home](https://bitplane.net/dev/python/mapillary_downloader)
82
- * [📖 pydoc](https://bitplane.net/dev/python/mapillary_downloader/pydoc)
126
+ * [📖 pydoc](https://bitplane.net/dev/python/mapillary_downloader/pydoc)
83
127
  * [🐍 pypi](https://pypi.org/project/mapillary-downloader)
84
128
  * [🐱 github](https://github.com/bitplane/mapillary_downloader)
129
+ * [📀 rip](https://bitplane.net/dev/sh/rip
85
130
 
86
131
  ## License
87
132
 
@@ -0,0 +1,108 @@
1
+ # 🗺️ Mapillary Downloader
2
+
3
+ Download your Mapillary data before it's gone.
4
+
5
+ ## Installation
6
+
7
+ Installation is optional, you can prefix the command with `uvx` or `pipx` to
8
+ download and run it. Or if you're oldskool you can do:
9
+
10
+ ```bash
11
+ pip install mapillary-downloader
12
+ ```
13
+
14
+ ## Usage
15
+
16
+ First, get your Mapillary API access token from
17
+ [the developer dashboard](https://www.mapillary.com/dashboard/developers)
18
+
19
+ ```bash
20
+ # Set token via environment variable
21
+ export MAPILLARY_TOKEN=YOUR_TOKEN
22
+ mapillary-downloader --username SOME_USERNAME --output ./downloads
23
+
24
+ # Or pass token directly, and have it in your shell history 💩👀
25
+ mapillary-downloader --token YOUR_TOKEN --username SOME_USERNAME --output ./downloads
26
+ ```
27
+
28
+ | option | because | default |
29
+ | ------------- | ------------------------------------- | ------------------ |
30
+ | `--username` | Mapillary username | None (required) |
31
+ | `--token` | Mapillary API token (or env var) | `$MAPILLARY_TOKEN` |
32
+ | `--output` | Output directory | `./mapillary_data` |
33
+ | `--quality` | 256, 1024, 2048 or original | `original` |
34
+ | `--bbox` | `west,south,east,north` | `None` |
35
+ | `--webp` | Convert to WebP (saves ~70% space) | `False` |
36
+ | `--workers` | Number of parallel download workers | CPU count |
37
+ | `--no-tar` | Don't tar sequence directories | `False` |
38
+
39
+ The downloader will:
40
+
41
+ * 📷 Download a user's images organized by sequence
42
+ * 📜 Inject EXIF metadata (GPS coordinates, camera info, timestamps,
43
+ compass direction)
44
+ * 🛟 Save progress so you can safely resume if interrupted
45
+ * 🗜️ Optionally convert to WebP to save space
46
+ * 📦 Tar sequence directories for faster uploads
47
+
48
+ ## WebP Conversion
49
+
50
+ You'll need `cwebp` to use the `--webp` flag. So install it:
51
+
52
+ ```bash
53
+ # Debian/Ubuntu
54
+ sudo apt install webp
55
+
56
+ # macOS
57
+ brew install webp
58
+ ```
59
+
60
+ ## Sequence Tarball Creation
61
+
62
+ By default, sequence directories are automatically tarred after download because
63
+ if they weren't, you'd spend more time setting up upload metadata than actually
64
+ uploading files to IA.
65
+
66
+ To keep individual files instead of creating tars, use the `--no-tar` flag:
67
+
68
+ ```bash
69
+ mapillary-downloader --username WHOEVER --no-tar
70
+ ```
71
+
72
+ ## Internet Archive upload
73
+
74
+ I've written a bash tool to rip media then tag, queue, and upload to The
75
+ Internet Archive. The metadata is in the same format. If you copy completed
76
+ download dirs into the `4.ship` dir, they'll find their way into an
77
+ appropriately named item.
78
+
79
+ See inlay for details:
80
+
81
+ * [📀 rip](https://bitplane.net/dev/sh/rip)
82
+
83
+
84
+ ## Development
85
+
86
+ ```bash
87
+ make dev # Setup dev environment
88
+ make test # Run tests
89
+ make dist # Build the distribution
90
+ make help # See other make options
91
+ ```
92
+
93
+ ## Links
94
+
95
+ * [🏠 home](https://bitplane.net/dev/python/mapillary_downloader)
96
+ * [📖 pydoc](https://bitplane.net/dev/python/mapillary_downloader/pydoc)
97
+ * [🐍 pypi](https://pypi.org/project/mapillary-downloader)
98
+ * [🐱 github](https://github.com/bitplane/mapillary_downloader)
99
+ * [📀 rip](https://bitplane.net/dev/sh/rip
100
+
101
+ ## License
102
+
103
+ WTFPL with one additional clause
104
+
105
+ 1. Don't blame me
106
+
107
+ Do wtf you want, but don't blame me if it makes jokes about the size of your
108
+ disk drive.
@@ -1,7 +1,7 @@
1
1
  [project]
2
2
  name = "mapillary_downloader"
3
3
  description = "Download your Mapillary data before it's gone"
4
- version = "0.1.3"
4
+ version = "0.3.0"
5
5
  authors = [
6
6
  { name = "Gareth Davidson", email = "gaz@bitplane.net" }
7
7
  ]
@@ -0,0 +1,88 @@
1
+ """CLI entry point."""
2
+
3
+ import argparse
4
+ import os
5
+ import sys
6
+ from mapillary_downloader.client import MapillaryClient
7
+ from mapillary_downloader.downloader import MapillaryDownloader
8
+ from mapillary_downloader.logging_config import setup_logging
9
+ from mapillary_downloader.webp_converter import check_cwebp_available
10
+
11
+
12
+ def main():
13
+ """Main CLI entry point."""
14
+ # Set up logging
15
+ logger = setup_logging()
16
+
17
+ parser = argparse.ArgumentParser(description="Download your Mapillary data before it's gone")
18
+ parser.add_argument(
19
+ "--token",
20
+ default=os.environ.get("MAPILLARY_TOKEN"),
21
+ help="Mapillary API access token (or set MAPILLARY_TOKEN env var)",
22
+ )
23
+ parser.add_argument("--username", required=True, help="Mapillary username")
24
+ parser.add_argument("--output", default="./mapillary_data", help="Output directory (default: ./mapillary_data)")
25
+ parser.add_argument(
26
+ "--quality",
27
+ choices=["256", "1024", "2048", "original"],
28
+ default="original",
29
+ help="Image quality to download (default: original)",
30
+ )
31
+ parser.add_argument("--bbox", help="Bounding box: west,south,east,north")
32
+ parser.add_argument(
33
+ "--webp",
34
+ action="store_true",
35
+ help="Convert images to WebP format (saves ~70%% disk space, requires cwebp binary)",
36
+ )
37
+ parser.add_argument(
38
+ "--workers",
39
+ type=int,
40
+ default=None,
41
+ help="Number of parallel workers (default: number of CPU cores)",
42
+ )
43
+ parser.add_argument(
44
+ "--no-tar",
45
+ action="store_true",
46
+ help="Don't tar sequence directories (keep individual files)",
47
+ )
48
+
49
+ args = parser.parse_args()
50
+
51
+ # Check for token
52
+ if not args.token:
53
+ logger.error("Error: Mapillary API token required. Use --token or set MAPILLARY_TOKEN environment variable")
54
+ sys.exit(1)
55
+
56
+ bbox = None
57
+ if args.bbox:
58
+ try:
59
+ bbox = [float(x) for x in args.bbox.split(",")]
60
+ if len(bbox) != 4:
61
+ raise ValueError
62
+ except ValueError:
63
+ logger.error("Error: bbox must be four comma-separated numbers")
64
+ sys.exit(1)
65
+
66
+ # Check for cwebp binary if WebP conversion is requested
67
+ if args.webp:
68
+ if not check_cwebp_available():
69
+ logger.error("Error: cwebp binary not found. Install webp package (e.g., apt install webp)")
70
+ sys.exit(1)
71
+ logger.info("WebP conversion enabled - images will be converted after download")
72
+
73
+ try:
74
+ client = MapillaryClient(args.token)
75
+ downloader = MapillaryDownloader(
76
+ client, args.output, args.username, args.quality, workers=args.workers, tar_sequences=not args.no_tar
77
+ )
78
+ downloader.download_user_data(bbox=bbox, convert_webp=args.webp)
79
+ except KeyboardInterrupt:
80
+ logger.info("\nInterrupted by user")
81
+ sys.exit(1)
82
+ except Exception as e:
83
+ logger.error(f"Error: {e}")
84
+ sys.exit(1)
85
+
86
+
87
+ if __name__ == "__main__":
88
+ main()
@@ -0,0 +1,218 @@
1
+ """Main downloader logic."""
2
+
3
+ import json
4
+ import logging
5
+ import os
6
+ import time
7
+ from pathlib import Path
8
+ from concurrent.futures import ProcessPoolExecutor, as_completed
9
+ from mapillary_downloader.utils import format_size, format_time
10
+ from mapillary_downloader.ia_meta import generate_ia_metadata
11
+ from mapillary_downloader.worker import download_and_convert_image
12
+ from mapillary_downloader.tar_sequences import tar_sequence_directories
13
+
14
+ logger = logging.getLogger("mapillary_downloader")
15
+
16
+
17
+ class MapillaryDownloader:
18
+ """Handles downloading Mapillary data for a user."""
19
+
20
+ def __init__(self, client, output_dir, username=None, quality=None, workers=None, tar_sequences=True):
21
+ """Initialize the downloader.
22
+
23
+ Args:
24
+ client: MapillaryClient instance
25
+ output_dir: Base directory to save downloads
26
+ username: Mapillary username (for collection directory)
27
+ quality: Image quality (for collection directory)
28
+ workers: Number of parallel workers (default: cpu_count)
29
+ tar_sequences: Whether to tar sequence directories after download (default: True)
30
+ """
31
+ self.client = client
32
+ self.base_output_dir = Path(output_dir)
33
+ self.username = username
34
+ self.quality = quality
35
+ self.workers = workers if workers is not None else os.cpu_count()
36
+ self.tar_sequences = tar_sequences
37
+
38
+ # If username and quality provided, create collection directory
39
+ if username and quality:
40
+ collection_name = f"mapillary-{username}-{quality}"
41
+ self.output_dir = self.base_output_dir / collection_name
42
+ else:
43
+ self.output_dir = self.base_output_dir
44
+
45
+ self.output_dir.mkdir(parents=True, exist_ok=True)
46
+
47
+ self.metadata_file = self.output_dir / "metadata.jsonl"
48
+ self.progress_file = self.output_dir / "progress.json"
49
+ self.downloaded = self._load_progress()
50
+
51
+ def _load_progress(self):
52
+ """Load previously downloaded image IDs."""
53
+ if self.progress_file.exists():
54
+ with open(self.progress_file) as f:
55
+ return set(json.load(f).get("downloaded", []))
56
+ return set()
57
+
58
+ def _save_progress(self):
59
+ """Save progress to disk atomically."""
60
+ temp_file = self.progress_file.with_suffix(".json.tmp")
61
+ with open(temp_file, "w") as f:
62
+ json.dump({"downloaded": list(self.downloaded)}, f)
63
+ f.flush()
64
+ os.fsync(f.fileno())
65
+ temp_file.replace(self.progress_file)
66
+
67
+ def download_user_data(self, bbox=None, convert_webp=False):
68
+ """Download all images for a user.
69
+
70
+ Args:
71
+ bbox: Optional bounding box [west, south, east, north]
72
+ convert_webp: Convert images to WebP format after download
73
+ """
74
+ if not self.username or not self.quality:
75
+ raise ValueError("Username and quality must be provided during initialization")
76
+
77
+ quality_field = f"thumb_{self.quality}_url"
78
+
79
+ logger.info(f"Downloading images for user: {self.username}")
80
+ logger.info(f"Output directory: {self.output_dir}")
81
+ logger.info(f"Quality: {self.quality}")
82
+ logger.info(f"Using {self.workers} parallel workers")
83
+
84
+ processed = 0
85
+ downloaded_count = 0
86
+ skipped = 0
87
+ total_bytes = 0
88
+ failed_count = 0
89
+
90
+ start_time = time.time()
91
+
92
+ # Track which image IDs we've seen in metadata to avoid re-fetching
93
+ seen_ids = set()
94
+
95
+ # Collect images to download from existing metadata
96
+ images_to_download = []
97
+
98
+ if self.metadata_file.exists():
99
+ logger.info("Processing existing metadata file...")
100
+ with open(self.metadata_file) as f:
101
+ for line in f:
102
+ if line.strip():
103
+ image = json.loads(line)
104
+ image_id = image["id"]
105
+ seen_ids.add(image_id)
106
+ processed += 1
107
+
108
+ if image_id in self.downloaded:
109
+ skipped += 1
110
+ continue
111
+
112
+ # Queue for download
113
+ if image.get(quality_field):
114
+ images_to_download.append(image)
115
+
116
+ # Download images from existing metadata in parallel
117
+ if images_to_download:
118
+ logger.info(f"Downloading {len(images_to_download)} images from existing metadata...")
119
+ downloaded_count, total_bytes, failed_count = self._download_images_parallel(
120
+ images_to_download, convert_webp
121
+ )
122
+
123
+ # Always check API for new images (will skip duplicates via seen_ids)
124
+ logger.info("Checking for new images from API...")
125
+ new_images = []
126
+
127
+ with open(self.metadata_file, "a") as meta_f:
128
+ for image in self.client.get_user_images(self.username, bbox=bbox):
129
+ image_id = image["id"]
130
+
131
+ # Skip if we already have this in our metadata file
132
+ if image_id in seen_ids:
133
+ continue
134
+
135
+ seen_ids.add(image_id)
136
+ processed += 1
137
+
138
+ # Save new metadata
139
+ meta_f.write(json.dumps(image) + "\n")
140
+ meta_f.flush()
141
+
142
+ # Skip if already downloaded
143
+ if image_id in self.downloaded:
144
+ skipped += 1
145
+ continue
146
+
147
+ # Queue for download
148
+ if image.get(quality_field):
149
+ new_images.append(image)
150
+
151
+ # Download new images in parallel
152
+ if new_images:
153
+ logger.info(f"Downloading {len(new_images)} new images...")
154
+ new_downloaded, new_bytes, new_failed = self._download_images_parallel(new_images, convert_webp)
155
+ downloaded_count += new_downloaded
156
+ total_bytes += new_bytes
157
+ failed_count += new_failed
158
+
159
+ self._save_progress()
160
+ elapsed = time.time() - start_time
161
+ logger.info(
162
+ f"Complete! Processed {processed} images, downloaded {downloaded_count} ({format_size(total_bytes)}), "
163
+ f"skipped {skipped}, failed {failed_count}"
164
+ )
165
+ logger.info(f"Total time: {format_time(elapsed)}")
166
+
167
+ # Tar sequence directories for efficient IA uploads
168
+ if self.tar_sequences:
169
+ tar_sequence_directories(self.output_dir)
170
+
171
+ # Generate IA metadata
172
+ generate_ia_metadata(self.output_dir)
173
+
174
+ def _download_images_parallel(self, images, convert_webp):
175
+ """Download images in parallel using worker pool.
176
+
177
+ Args:
178
+ images: List of image metadata dicts
179
+ convert_webp: Whether to convert to WebP
180
+
181
+ Returns:
182
+ Tuple of (downloaded_count, total_bytes, failed_count)
183
+ """
184
+ downloaded_count = 0
185
+ total_bytes = 0
186
+ failed_count = 0
187
+
188
+ with ProcessPoolExecutor(max_workers=self.workers) as executor:
189
+ # Submit all tasks
190
+ future_to_image = {}
191
+ for image in images:
192
+ future = executor.submit(
193
+ download_and_convert_image,
194
+ image,
195
+ str(self.output_dir),
196
+ self.quality,
197
+ convert_webp,
198
+ self.client.access_token,
199
+ )
200
+ future_to_image[future] = image["id"]
201
+
202
+ # Process results as they complete
203
+ for future in as_completed(future_to_image):
204
+ image_id, bytes_dl, success, error_msg = future.result()
205
+
206
+ if success:
207
+ self.downloaded.add(image_id)
208
+ downloaded_count += 1
209
+ total_bytes += bytes_dl
210
+
211
+ if downloaded_count % 10 == 0:
212
+ logger.info(f"Downloaded: {downloaded_count}/{len(images)} ({format_size(total_bytes)})")
213
+ self._save_progress()
214
+ else:
215
+ failed_count += 1
216
+ logger.warning(f"Failed to download {image_id}: {error_msg}")
217
+
218
+ return downloaded_count, total_bytes, failed_count
@@ -0,0 +1,182 @@
1
+ """Internet Archive metadata generation for Mapillary collections."""
2
+
3
+ import json
4
+ import logging
5
+ import re
6
+ from datetime import datetime
7
+ from pathlib import Path
8
+ from importlib.metadata import version
9
+
10
+ logger = logging.getLogger("mapillary_downloader")
11
+
12
+
13
+ def parse_collection_name(directory):
14
+ """Parse username and quality from directory name.
15
+
16
+ Args:
17
+ directory: Path to collection directory (e.g., mapillary-username-original)
18
+
19
+ Returns:
20
+ Tuple of (username, quality) or (None, None) if parsing fails
21
+ """
22
+ match = re.match(r"mapillary-(.+)-(256|1024|2048|original)$", Path(directory).name)
23
+ if match:
24
+ return match.group(1), match.group(2)
25
+ return None, None
26
+
27
+
28
+ def get_date_range(metadata_file):
29
+ """Get first and last captured_at dates from metadata.jsonl.
30
+
31
+ Args:
32
+ metadata_file: Path to metadata.jsonl file
33
+
34
+ Returns:
35
+ Tuple of (first_date, last_date) as ISO format strings, or (None, None)
36
+ """
37
+ if not Path(metadata_file).exists():
38
+ return None, None
39
+
40
+ timestamps = []
41
+ with open(metadata_file) as f:
42
+ for line in f:
43
+ if line.strip():
44
+ data = json.loads(line)
45
+ if "captured_at" in data:
46
+ timestamps.append(data["captured_at"])
47
+
48
+ if not timestamps:
49
+ return None, None
50
+
51
+ # Convert from milliseconds to seconds, then to datetime
52
+ first_ts = min(timestamps) / 1000
53
+ last_ts = max(timestamps) / 1000
54
+
55
+ first_date = datetime.fromtimestamp(first_ts).strftime("%Y-%m-%d")
56
+ last_date = datetime.fromtimestamp(last_ts).strftime("%Y-%m-%d")
57
+
58
+ return first_date, last_date
59
+
60
+
61
+ def count_images(metadata_file):
62
+ """Count number of images in metadata.jsonl.
63
+
64
+ Args:
65
+ metadata_file: Path to metadata.jsonl file
66
+
67
+ Returns:
68
+ Number of images
69
+ """
70
+ if not Path(metadata_file).exists():
71
+ return 0
72
+
73
+ count = 0
74
+ with open(metadata_file) as f:
75
+ for line in f:
76
+ if line.strip():
77
+ count += 1
78
+ return count
79
+
80
+
81
+ def write_meta_tag(meta_dir, tag, values):
82
+ """Write metadata tag files in rip format.
83
+
84
+ Args:
85
+ meta_dir: Path to .meta directory
86
+ tag: Tag name
87
+ values: Single value or list of values
88
+ """
89
+ tag_dir = meta_dir / tag
90
+ tag_dir.mkdir(parents=True, exist_ok=True)
91
+
92
+ if not isinstance(values, list):
93
+ values = [values]
94
+
95
+ for idx, value in enumerate(values):
96
+ (tag_dir / str(idx)).write_text(str(value))
97
+
98
+
99
+ def generate_ia_metadata(collection_dir):
100
+ """Generate Internet Archive metadata for a Mapillary collection.
101
+
102
+ Args:
103
+ collection_dir: Path to collection directory (e.g., ./mapillary_data/mapillary-username-original)
104
+
105
+ Returns:
106
+ True if successful, False otherwise
107
+ """
108
+ collection_dir = Path(collection_dir)
109
+ username, quality = parse_collection_name(collection_dir)
110
+
111
+ if not username or not quality:
112
+ logger.error(f"Could not parse username/quality from directory: {collection_dir.name}")
113
+ return False
114
+
115
+ metadata_file = collection_dir / "metadata.jsonl"
116
+ if not metadata_file.exists():
117
+ logger.error(f"metadata.jsonl not found in {collection_dir}")
118
+ return False
119
+
120
+ logger.info(f"Generating IA metadata for {collection_dir.name}...")
121
+
122
+ # Get date range and image count
123
+ first_date, last_date = get_date_range(metadata_file)
124
+ image_count = count_images(metadata_file)
125
+
126
+ if not first_date or not last_date:
127
+ logger.warning("Could not determine date range from metadata")
128
+ first_date = last_date = "unknown"
129
+
130
+ # Create .meta directory
131
+ meta_dir = collection_dir / ".meta"
132
+ meta_dir.mkdir(exist_ok=True)
133
+
134
+ # Generate metadata tags
135
+ write_meta_tag(
136
+ meta_dir,
137
+ "title",
138
+ f"Mapillary images by {username} ({quality} quality)",
139
+ )
140
+
141
+ description = (
142
+ f"Street-level imagery from Mapillary user '{username}'. "
143
+ f"Contains {image_count:,} images captured between {first_date} and {last_date}. "
144
+ f"Images are organized by sequence ID and include EXIF metadata with GPS coordinates, "
145
+ f"camera information, and compass direction.\n\n"
146
+ f"Downloaded using mapillary_downloader (https://bitplane.net/dev/python/mapillary_downloader/). "
147
+ f"Uploaded using rip (https://bitplane.net/dev/sh/rip)."
148
+ )
149
+ write_meta_tag(meta_dir, "description", description)
150
+
151
+ # Subject tags
152
+ write_meta_tag(
153
+ meta_dir,
154
+ "subject",
155
+ ["mapillary", "street-view", "computer-vision", "geospatial", "photography"],
156
+ )
157
+
158
+ write_meta_tag(meta_dir, "creator", username)
159
+ write_meta_tag(meta_dir, "date", first_date)
160
+ write_meta_tag(meta_dir, "coverage", f"{first_date} - {last_date}")
161
+ write_meta_tag(meta_dir, "licenseurl", "https://creativecommons.org/licenses/by-sa/4.0/")
162
+ write_meta_tag(meta_dir, "mediatype", "data")
163
+ write_meta_tag(meta_dir, "collection", "opensource_media")
164
+
165
+ # Source and scanner metadata
166
+ write_meta_tag(meta_dir, "source", f"https://www.mapillary.com/app/user/{username}")
167
+
168
+ downloader_version = version("mapillary_downloader")
169
+ write_meta_tag(
170
+ meta_dir,
171
+ "scanner",
172
+ [
173
+ f"mapillary_downloader {downloader_version} https://bitplane.net/dev/python/mapillary_downloader/",
174
+ "rip https://bitplane.net/dev/sh/rip",
175
+ ],
176
+ )
177
+
178
+ # Add searchable tag for batch collection management
179
+ write_meta_tag(meta_dir, "mapillary_downloader", downloader_version)
180
+
181
+ logger.info(f"IA metadata generated in {meta_dir}")
182
+ return True
@@ -0,0 +1,112 @@
1
+ """Tar sequence directories for efficient Internet Archive uploads."""
2
+
3
+ import logging
4
+ import subprocess
5
+ from pathlib import Path
6
+
7
+ logger = logging.getLogger("mapillary_downloader")
8
+
9
+
10
+ def tar_sequence_directories(collection_dir):
11
+ """Tar all sequence directories in a collection for faster IA uploads.
12
+
13
+ Args:
14
+ collection_dir: Path to collection directory (e.g., mapillary-user-quality/)
15
+
16
+ Returns:
17
+ Tuple of (tarred_count, total_files_tarred)
18
+ """
19
+ collection_dir = Path(collection_dir)
20
+
21
+ if not collection_dir.exists():
22
+ logger.error(f"Collection directory not found: {collection_dir}")
23
+ return 0, 0
24
+
25
+ # Find all sequence directories (skip special dirs)
26
+ skip_dirs = {".meta", "__pycache__"}
27
+ sequence_dirs = []
28
+
29
+ for item in collection_dir.iterdir():
30
+ if item.is_dir() and item.name not in skip_dirs:
31
+ sequence_dirs.append(item)
32
+
33
+ if not sequence_dirs:
34
+ logger.info("No sequence directories to tar")
35
+ return 0, 0
36
+
37
+ logger.info(f"Tarring {len(sequence_dirs)} sequence directories...")
38
+
39
+ tarred_count = 0
40
+ total_files = 0
41
+
42
+ for seq_dir in sequence_dirs:
43
+ seq_name = seq_dir.name
44
+ tar_path = collection_dir / f"{seq_name}.tar"
45
+
46
+ # Handle naming collision - find next available name
47
+ counter = 1
48
+ while tar_path.exists():
49
+ counter += 1
50
+ tar_path = collection_dir / f"{seq_name}.{counter}.tar"
51
+
52
+ # Count files in sequence
53
+ files = list(seq_dir.glob("*"))
54
+ file_count = len([f for f in files if f.is_file()])
55
+
56
+ if file_count == 0:
57
+ logger.warning(f"Skipping empty directory: {seq_name}")
58
+ continue
59
+
60
+ try:
61
+ # Create uncompressed tar (WebP already compressed)
62
+ # Use -C to change directory so paths in tar are relative
63
+ # Use -- to prevent sequence IDs starting with - from being interpreted as options
64
+ result = subprocess.run(
65
+ ["tar", "-cf", str(tar_path), "-C", str(collection_dir), "--", seq_name],
66
+ capture_output=True,
67
+ text=True,
68
+ timeout=300, # 5 minute timeout per tar
69
+ )
70
+
71
+ if result.returncode != 0:
72
+ logger.error(f"Failed to tar {seq_name}: {result.stderr}")
73
+ continue
74
+
75
+ # Verify tar was created and has size
76
+ if tar_path.exists() and tar_path.stat().st_size > 0:
77
+ # Remove original directory
78
+ for file in seq_dir.rglob("*"):
79
+ if file.is_file():
80
+ file.unlink()
81
+
82
+ # Remove empty subdirs and main dir
83
+ for subdir in list(seq_dir.rglob("*")):
84
+ if subdir.is_dir():
85
+ try:
86
+ subdir.rmdir()
87
+ except OSError:
88
+ pass # Not empty yet
89
+
90
+ seq_dir.rmdir()
91
+
92
+ tarred_count += 1
93
+ total_files += file_count
94
+
95
+ if tarred_count % 10 == 0:
96
+ logger.info(f"Tarred {tarred_count}/{len(sequence_dirs)} sequences...")
97
+ else:
98
+ logger.error(f"Tar file empty or not created: {tar_path}")
99
+ if tar_path.exists():
100
+ tar_path.unlink()
101
+
102
+ except subprocess.TimeoutExpired:
103
+ logger.error(f"Timeout tarring {seq_name}")
104
+ if tar_path.exists():
105
+ tar_path.unlink()
106
+ except Exception as e:
107
+ logger.error(f"Error tarring {seq_name}: {e}")
108
+ if tar_path.exists():
109
+ tar_path.unlink()
110
+
111
+ logger.info(f"Tarred {tarred_count} sequences ({total_files:,} files total)")
112
+ return tarred_count, total_files
@@ -0,0 +1,63 @@
1
+ """WebP image conversion utilities."""
2
+
3
+ import logging
4
+ import shutil
5
+ import subprocess
6
+ from pathlib import Path
7
+
8
+ logger = logging.getLogger("mapillary_downloader")
9
+
10
+
11
+ def check_cwebp_available():
12
+ """Check if cwebp binary is available.
13
+
14
+ Returns:
15
+ bool: True if cwebp is found, False otherwise
16
+ """
17
+ return shutil.which("cwebp") is not None
18
+
19
+
20
+ def convert_to_webp(jpg_path, output_path=None, delete_original=True):
21
+ """Convert a JPG image to WebP format, preserving EXIF metadata.
22
+
23
+ Args:
24
+ jpg_path: Path to the JPG file
25
+ output_path: Optional path for the WebP output. If None, uses jpg_path with .webp extension
26
+ delete_original: Whether to delete the original JPG after conversion (default: True)
27
+
28
+ Returns:
29
+ Path object to the new WebP file, or None if conversion failed
30
+ """
31
+ jpg_path = Path(jpg_path)
32
+
33
+ if output_path is None:
34
+ webp_path = jpg_path.with_suffix(".webp")
35
+ else:
36
+ webp_path = Path(output_path)
37
+ # Ensure output directory exists
38
+ webp_path.parent.mkdir(parents=True, exist_ok=True)
39
+
40
+ try:
41
+ # Convert with cwebp, preserving all metadata
42
+ result = subprocess.run(
43
+ ["cwebp", "-metadata", "all", str(jpg_path), "-o", str(webp_path)],
44
+ capture_output=True,
45
+ text=True,
46
+ timeout=60,
47
+ )
48
+
49
+ if result.returncode != 0:
50
+ logger.error(f"cwebp conversion failed for {jpg_path}: {result.stderr}")
51
+ return None
52
+
53
+ # Delete original JPG after successful conversion if requested
54
+ if delete_original:
55
+ jpg_path.unlink()
56
+ return webp_path
57
+
58
+ except subprocess.TimeoutExpired:
59
+ logger.error(f"cwebp conversion timed out for {jpg_path}")
60
+ return None
61
+ except Exception as e:
62
+ logger.error(f"Error converting {jpg_path} to WebP: {e}")
63
+ return None
@@ -0,0 +1,102 @@
1
+ """Worker process for parallel image download and conversion."""
2
+
3
+ import tempfile
4
+ from pathlib import Path
5
+ import requests
6
+ from requests.exceptions import RequestException
7
+ import time
8
+ from mapillary_downloader.exif_writer import write_exif_to_image
9
+ from mapillary_downloader.webp_converter import convert_to_webp
10
+
11
+
12
+ def download_and_convert_image(image_data, output_dir, quality, convert_webp, access_token):
13
+ """Download and optionally convert a single image.
14
+
15
+ This function is designed to run in a worker process.
16
+
17
+ Args:
18
+ image_data: Image metadata dict from API
19
+ output_dir: Base output directory path
20
+ quality: Quality level (256, 1024, 2048, original)
21
+ convert_webp: Whether to convert to WebP
22
+ access_token: Mapillary API access token
23
+
24
+ Returns:
25
+ Tuple of (image_id, bytes_downloaded, success, error_msg)
26
+ """
27
+ image_id = image_data["id"]
28
+ quality_field = f"thumb_{quality}_url"
29
+
30
+ temp_dir = None
31
+ try:
32
+ # Get image URL
33
+ image_url = image_data.get(quality_field)
34
+ if not image_url:
35
+ return (image_id, 0, False, f"No {quality} URL")
36
+
37
+ # Determine final output directory
38
+ output_dir = Path(output_dir)
39
+ sequence_id = image_data.get("sequence")
40
+ if sequence_id:
41
+ img_dir = output_dir / sequence_id
42
+ img_dir.mkdir(parents=True, exist_ok=True)
43
+ else:
44
+ img_dir = output_dir
45
+
46
+ # If converting to WebP, use /tmp for intermediate JPEG
47
+ # Otherwise write JPEG directly to final location
48
+ if convert_webp:
49
+ temp_dir = tempfile.mkdtemp(prefix="mapillary_downloader_")
50
+ jpg_path = Path(temp_dir) / f"{image_id}.jpg"
51
+ final_path = img_dir / f"{image_id}.webp"
52
+ else:
53
+ jpg_path = img_dir / f"{image_id}.jpg"
54
+ final_path = jpg_path
55
+
56
+ # Download image
57
+ session = requests.Session()
58
+ session.headers.update({"Authorization": f"OAuth {access_token}"})
59
+
60
+ max_retries = 10
61
+ base_delay = 1.0
62
+ bytes_downloaded = 0
63
+
64
+ for attempt in range(max_retries):
65
+ try:
66
+ response = session.get(image_url, stream=True, timeout=60)
67
+ response.raise_for_status()
68
+
69
+ with open(jpg_path, "wb") as f:
70
+ for chunk in response.iter_content(chunk_size=8192):
71
+ f.write(chunk)
72
+ bytes_downloaded += len(chunk)
73
+ break
74
+ except RequestException as e:
75
+ if attempt == max_retries - 1:
76
+ return (image_id, 0, False, f"Download failed: {e}")
77
+
78
+ delay = base_delay * (2**attempt)
79
+ time.sleep(delay)
80
+
81
+ # Write EXIF metadata
82
+ write_exif_to_image(jpg_path, image_data)
83
+
84
+ # Convert to WebP if requested
85
+ if convert_webp:
86
+ webp_path = convert_to_webp(jpg_path, output_path=final_path, delete_original=False)
87
+ if not webp_path:
88
+ return (image_id, bytes_downloaded, False, "WebP conversion failed")
89
+
90
+ return (image_id, bytes_downloaded, True, None)
91
+
92
+ except Exception as e:
93
+ return (image_id, 0, False, str(e))
94
+ finally:
95
+ # Clean up temp directory if it was created
96
+ if temp_dir and Path(temp_dir).exists():
97
+ try:
98
+ for file in Path(temp_dir).glob("*"):
99
+ file.unlink()
100
+ Path(temp_dir).rmdir()
101
+ except Exception:
102
+ pass # Best effort cleanup
@@ -1,63 +0,0 @@
1
- # 🗺️ Mapillary Downloader
2
-
3
- Download your Mapillary data before it's gone.
4
-
5
- ## Installation
6
-
7
- ```bash
8
- pip install mapillary-downloader
9
- ```
10
-
11
- Or from source:
12
-
13
- ```bash
14
- make install
15
- ```
16
-
17
- ## Usage
18
-
19
- First, get your Mapillary API access token from https://www.mapillary.com/dashboard/developers
20
-
21
- ```bash
22
- mapillary-downloader --token YOUR_TOKEN --username YOUR_USERNAME --output ./downloads
23
- ```
24
-
25
- | option | because | default |
26
- | ------------- | ------------------------------------- | ------------------ |
27
- | `--token` | Your Mapillary API access token | None (required) |
28
- | `--username` | Your Mapillary username | None (required) |
29
- | `--output` | Output directory | `./mapillary_data` |
30
- | `--quality` | 256, 1024, 2048 or original | `original` |
31
- | `--bbox` | `west,south,east,north` | `None` |
32
-
33
- The downloader will:
34
-
35
- * 💾 Fetch all your uploaded images from Mapillary
36
- * 📷 Download full-resolution images organized by sequence
37
- * 📜 Inject EXIF metadata (GPS coordinates, camera info, timestamps,
38
- compass direction)
39
- * 🛟 Save progress so you can safely resume if interrupted
40
-
41
- ## Development
42
-
43
- ```bash
44
- make dev # Setup dev environment
45
- make test # Run tests
46
- make coverage # Run tests with coverage
47
- ```
48
-
49
- ## Links
50
-
51
- * [🏠 home](https://bitplane.net/dev/python/mapillary_downloader)
52
- * [📖 pydoc](https://bitplane.net/dev/python/mapillary_downloader/pydoc)
53
- * [🐍 pypi](https://pypi.org/project/mapillary-downloader)
54
- * [🐱 github](https://github.com/bitplane/mapillary_downloader)
55
-
56
- ## License
57
-
58
- WTFPL with one additional clause
59
-
60
- 1. Don't blame me
61
-
62
- Do wtf you want, but don't blame me if it makes jokes about the size of your
63
- disk drive.
@@ -1,52 +0,0 @@
1
- """CLI entry point."""
2
-
3
- import argparse
4
- import sys
5
- from mapillary_downloader.client import MapillaryClient
6
- from mapillary_downloader.downloader import MapillaryDownloader
7
- from mapillary_downloader.logging_config import setup_logging
8
-
9
-
10
- def main():
11
- """Main CLI entry point."""
12
- # Set up logging
13
- logger = setup_logging()
14
-
15
- parser = argparse.ArgumentParser(description="Download your Mapillary data before it's gone")
16
- parser.add_argument("--token", required=True, help="Mapillary API access token")
17
- parser.add_argument("--username", required=True, help="Your Mapillary username")
18
- parser.add_argument("--output", default="./mapillary_data", help="Output directory (default: ./mapillary_data)")
19
- parser.add_argument(
20
- "--quality",
21
- choices=["256", "1024", "2048", "original"],
22
- default="original",
23
- help="Image quality to download (default: original)",
24
- )
25
- parser.add_argument("--bbox", help="Bounding box: west,south,east,north")
26
-
27
- args = parser.parse_args()
28
-
29
- bbox = None
30
- if args.bbox:
31
- try:
32
- bbox = [float(x) for x in args.bbox.split(",")]
33
- if len(bbox) != 4:
34
- raise ValueError
35
- except ValueError:
36
- logger.error("Error: bbox must be four comma-separated numbers")
37
- sys.exit(1)
38
-
39
- try:
40
- client = MapillaryClient(args.token)
41
- downloader = MapillaryDownloader(client, args.output)
42
- downloader.download_user_data(args.username, args.quality, bbox)
43
- except KeyboardInterrupt:
44
- logger.info("\nInterrupted by user")
45
- sys.exit(1)
46
- except Exception as e:
47
- logger.error(f"Error: {e}")
48
- sys.exit(1)
49
-
50
-
51
- if __name__ == "__main__":
52
- main()
@@ -1,192 +0,0 @@
1
- """Main downloader logic."""
2
-
3
- import json
4
- import logging
5
- import os
6
- import time
7
- from pathlib import Path
8
- from collections import deque
9
- from mapillary_downloader.exif_writer import write_exif_to_image
10
- from mapillary_downloader.utils import format_size, format_time
11
-
12
- logger = logging.getLogger("mapillary_downloader")
13
-
14
-
15
- class MapillaryDownloader:
16
- """Handles downloading Mapillary data for a user."""
17
-
18
- def __init__(self, client, output_dir):
19
- """Initialize the downloader.
20
-
21
- Args:
22
- client: MapillaryClient instance
23
- output_dir: Directory to save downloads
24
- """
25
- self.client = client
26
- self.output_dir = Path(output_dir)
27
- self.output_dir.mkdir(parents=True, exist_ok=True)
28
-
29
- self.metadata_file = self.output_dir / "metadata.jsonl"
30
- self.progress_file = self.output_dir / "progress.json"
31
- self.downloaded = self._load_progress()
32
-
33
- def _load_progress(self):
34
- """Load previously downloaded image IDs."""
35
- if self.progress_file.exists():
36
- with open(self.progress_file) as f:
37
- return set(json.load(f).get("downloaded", []))
38
- return set()
39
-
40
- def _save_progress(self):
41
- """Save progress to disk atomically."""
42
- temp_file = self.progress_file.with_suffix(".json.tmp")
43
- with open(temp_file, "w") as f:
44
- json.dump({"downloaded": list(self.downloaded)}, f)
45
- f.flush()
46
- os.fsync(f.fileno())
47
- temp_file.replace(self.progress_file)
48
-
49
- def download_user_data(self, username, quality="original", bbox=None):
50
- """Download all images for a user.
51
-
52
- Args:
53
- username: Mapillary username
54
- quality: Image quality to download (256, 1024, 2048, original)
55
- bbox: Optional bounding box [west, south, east, north]
56
- """
57
- quality_field = f"thumb_{quality}_url"
58
-
59
- logger.info(f"Downloading images for user: {username}")
60
- logger.info(f"Output directory: {self.output_dir}")
61
- logger.info(f"Quality: {quality}")
62
-
63
- processed = 0
64
- downloaded_count = 0
65
- skipped = 0
66
- total_bytes = 0
67
-
68
- # Track download times for adaptive ETA (last 50 downloads)
69
- download_times = deque(maxlen=50)
70
- start_time = time.time()
71
-
72
- # Track which image IDs we've seen in metadata to avoid re-fetching
73
- seen_ids = set()
74
-
75
- # First, process any existing metadata without re-fetching from API
76
- if self.metadata_file.exists():
77
- logger.info("Processing existing metadata file...")
78
- with open(self.metadata_file) as f:
79
- for line in f:
80
- if line.strip():
81
- image = json.loads(line)
82
- image_id = image["id"]
83
- seen_ids.add(image_id)
84
- processed += 1
85
-
86
- if image_id in self.downloaded:
87
- skipped += 1
88
- continue
89
-
90
- # Download this un-downloaded image
91
- image_url = image.get(quality_field)
92
- if not image_url:
93
- logger.warning(f"No {quality} URL for image {image_id}")
94
- continue
95
-
96
- sequence_id = image.get("sequence")
97
- if sequence_id:
98
- img_dir = self.output_dir / sequence_id
99
- img_dir.mkdir(exist_ok=True)
100
- else:
101
- img_dir = self.output_dir
102
-
103
- output_path = img_dir / f"{image_id}.jpg"
104
-
105
- download_start = time.time()
106
- bytes_downloaded = self.client.download_image(image_url, output_path)
107
- if bytes_downloaded:
108
- download_time = time.time() - download_start
109
- download_times.append(download_time)
110
-
111
- write_exif_to_image(output_path, image)
112
-
113
- self.downloaded.add(image_id)
114
- downloaded_count += 1
115
- total_bytes += bytes_downloaded
116
-
117
- progress_str = (
118
- f"Processed: {processed}, Downloaded: {downloaded_count} ({format_size(total_bytes)})"
119
- )
120
- logger.info(progress_str)
121
-
122
- if downloaded_count % 10 == 0:
123
- self._save_progress()
124
-
125
- # Always check API for new images (will skip duplicates via seen_ids)
126
- logger.info("Checking for new images from API...")
127
- with open(self.metadata_file, "a") as meta_f:
128
- for image in self.client.get_user_images(username, bbox=bbox):
129
- image_id = image["id"]
130
-
131
- # Skip if we already have this in our metadata file
132
- if image_id in seen_ids:
133
- continue
134
-
135
- seen_ids.add(image_id)
136
- processed += 1
137
-
138
- # Save new metadata
139
- meta_f.write(json.dumps(image) + "\n")
140
- meta_f.flush()
141
-
142
- # Skip if already downloaded
143
- if image_id in self.downloaded:
144
- skipped += 1
145
- continue
146
-
147
- # Download image
148
- image_url = image.get(quality_field)
149
- if not image_url:
150
- logger.warning(f"No {quality} URL for image {image_id}")
151
- continue
152
-
153
- # Use sequence ID for organization
154
- sequence_id = image.get("sequence")
155
- if sequence_id:
156
- img_dir = self.output_dir / sequence_id
157
- img_dir.mkdir(exist_ok=True)
158
- else:
159
- img_dir = self.output_dir
160
-
161
- output_path = img_dir / f"{image_id}.jpg"
162
-
163
- download_start = time.time()
164
- bytes_downloaded = self.client.download_image(image_url, output_path)
165
- if bytes_downloaded:
166
- download_time = time.time() - download_start
167
- download_times.append(download_time)
168
-
169
- # Write EXIF metadata to the downloaded image
170
- write_exif_to_image(output_path, image)
171
-
172
- self.downloaded.add(image_id)
173
- downloaded_count += 1
174
- total_bytes += bytes_downloaded
175
-
176
- # Calculate progress
177
- progress_str = (
178
- f"Processed: {processed}, Downloaded: {downloaded_count} ({format_size(total_bytes)})"
179
- )
180
-
181
- logger.info(progress_str)
182
-
183
- # Save progress every 10 images
184
- if downloaded_count % 10 == 0:
185
- self._save_progress()
186
-
187
- self._save_progress()
188
- elapsed = time.time() - start_time
189
- logger.info(
190
- f"Complete! Processed {processed} images, downloaded {downloaded_count} ({format_size(total_bytes)}), skipped {skipped}"
191
- )
192
- logger.info(f"Total time: {format_time(elapsed)}")