mapillary-downloader 0.2.0__tar.gz → 0.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: mapillary_downloader
3
- Version: 0.2.0
3
+ Version: 0.3.1
4
4
  Summary: Download your Mapillary data before it's gone
5
5
  Author-email: Gareth Davidson <gaz@bitplane.net>
6
6
  Requires-Python: >=3.10
@@ -34,52 +34,50 @@ Download your Mapillary data before it's gone.
34
34
 
35
35
  ## Installation
36
36
 
37
- ```bash
38
- pip install mapillary-downloader
39
- ```
40
-
41
- Or from source:
37
+ Installation is optional, you can prefix the command with `uvx` or `pipx` to
38
+ download and run it. Or if you're oldskool you can do:
42
39
 
43
40
  ```bash
44
- make install
41
+ pip install mapillary-downloader
45
42
  ```
46
43
 
47
44
  ## Usage
48
45
 
49
- First, get your Mapillary API access token from https://www.mapillary.com/dashboard/developers
46
+ First, get your Mapillary API access token from
47
+ [the developer dashboard](https://www.mapillary.com/dashboard/developers)
50
48
 
51
49
  ```bash
52
- mapillary-downloader --token YOUR_TOKEN --username YOUR_USERNAME --output ./downloads
50
+ # Set token via environment variable
51
+ export MAPILLARY_TOKEN=YOUR_TOKEN
52
+ mapillary-downloader --username SOME_USERNAME --output ./downloads
53
+
54
+ # Or pass token directly, and have it in your shell history 💩👀
55
+ mapillary-downloader --token YOUR_TOKEN --username SOME_USERNAME --output ./downloads
53
56
  ```
54
57
 
55
58
  | option | because | default |
56
59
  | ------------- | ------------------------------------- | ------------------ |
57
- | `--token` | Your Mapillary API access token | None (required) |
58
- | `--username` | Your Mapillary username | None (required) |
60
+ | `--username` | Mapillary username | None (required) |
61
+ | `--token` | Mapillary API token (or env var) | `$MAPILLARY_TOKEN` |
59
62
  | `--output` | Output directory | `./mapillary_data` |
60
63
  | `--quality` | 256, 1024, 2048 or original | `original` |
61
64
  | `--bbox` | `west,south,east,north` | `None` |
62
65
  | `--webp` | Convert to WebP (saves ~70% space) | `False` |
66
+ | `--workers` | Number of parallel download workers | Half of CPU count |
67
+ | `--no-tar` | Don't tar sequence directories | `False` |
63
68
 
64
69
  The downloader will:
65
70
 
66
- * 💾 Fetch all your uploaded images from Mapillary
67
- * 📷 Download full-resolution images organized by sequence
71
+ * 📷 Download a user's images organized by sequence
68
72
  * 📜 Inject EXIF metadata (GPS coordinates, camera info, timestamps,
69
73
  compass direction)
70
74
  * 🛟 Save progress so you can safely resume if interrupted
71
- * 🗜️ Optionally convert to WebP format for massive space savings
75
+ * 🗜️ Optionally convert to WebP to save space
76
+ * 📦 Tar sequence directories for faster uploads
72
77
 
73
78
  ## WebP Conversion
74
79
 
75
- Use the `--webp` flag to convert images to WebP format after download:
76
-
77
- ```bash
78
- mapillary-downloader --token YOUR_TOKEN --username YOUR_USERNAME --webp
79
- ```
80
-
81
- This reduces storage by approximately 70% while preserving all EXIF metadata
82
- including GPS coordinates. Requires the `cwebp` binary to be installed:
80
+ You'll need `cwebp` to use the `--webp` flag. So install it:
83
81
 
84
82
  ```bash
85
83
  # Debian/Ubuntu
@@ -89,20 +87,46 @@ sudo apt install webp
89
87
  brew install webp
90
88
  ```
91
89
 
90
+ ## Sequence Tarball Creation
91
+
92
+ By default, sequence directories are automatically tarred after download because
93
+ if they weren't, you'd spend more time setting up upload metadata than actually
94
+ uploading files to IA.
95
+
96
+ To keep individual files instead of creating tars, use the `--no-tar` flag:
97
+
98
+ ```bash
99
+ mapillary-downloader --username WHOEVER --no-tar
100
+ ```
101
+
102
+ ## Internet Archive upload
103
+
104
+ I've written a bash tool to rip media then tag, queue, and upload to The
105
+ Internet Archive. The metadata is in the same format. If you copy completed
106
+ download dirs into the `4.ship` dir, they'll find their way into an
107
+ appropriately named item.
108
+
109
+ See inlay for details:
110
+
111
+ * [📀 rip](https://bitplane.net/dev/sh/rip)
112
+
113
+
92
114
  ## Development
93
115
 
94
116
  ```bash
95
117
  make dev # Setup dev environment
96
118
  make test # Run tests
97
- make coverage # Run tests with coverage
119
+ make dist # Build the distribution
120
+ make help # See other make options
98
121
  ```
99
122
 
100
123
  ## Links
101
124
 
102
125
  * [🏠 home](https://bitplane.net/dev/python/mapillary_downloader)
103
- * [📖 pydoc](https://bitplane.net/dev/python/mapillary_downloader/pydoc)
126
+ * [📖 pydoc](https://bitplane.net/dev/python/mapillary_downloader/pydoc)
104
127
  * [🐍 pypi](https://pypi.org/project/mapillary-downloader)
105
128
  * [🐱 github](https://github.com/bitplane/mapillary_downloader)
129
+ * [📀 rip](https://bitplane.net/dev/sh/rip
106
130
 
107
131
  ## License
108
132
 
@@ -0,0 +1,108 @@
1
+ # 🗺️ Mapillary Downloader
2
+
3
+ Download your Mapillary data before it's gone.
4
+
5
+ ## Installation
6
+
7
+ Installation is optional, you can prefix the command with `uvx` or `pipx` to
8
+ download and run it. Or if you're oldskool you can do:
9
+
10
+ ```bash
11
+ pip install mapillary-downloader
12
+ ```
13
+
14
+ ## Usage
15
+
16
+ First, get your Mapillary API access token from
17
+ [the developer dashboard](https://www.mapillary.com/dashboard/developers)
18
+
19
+ ```bash
20
+ # Set token via environment variable
21
+ export MAPILLARY_TOKEN=YOUR_TOKEN
22
+ mapillary-downloader --username SOME_USERNAME --output ./downloads
23
+
24
+ # Or pass token directly, and have it in your shell history 💩👀
25
+ mapillary-downloader --token YOUR_TOKEN --username SOME_USERNAME --output ./downloads
26
+ ```
27
+
28
+ | option | because | default |
29
+ | ------------- | ------------------------------------- | ------------------ |
30
+ | `--username` | Mapillary username | None (required) |
31
+ | `--token` | Mapillary API token (or env var) | `$MAPILLARY_TOKEN` |
32
+ | `--output` | Output directory | `./mapillary_data` |
33
+ | `--quality` | 256, 1024, 2048 or original | `original` |
34
+ | `--bbox` | `west,south,east,north` | `None` |
35
+ | `--webp` | Convert to WebP (saves ~70% space) | `False` |
36
+ | `--workers` | Number of parallel download workers | Half of CPU count |
37
+ | `--no-tar` | Don't tar sequence directories | `False` |
38
+
39
+ The downloader will:
40
+
41
+ * 📷 Download a user's images organized by sequence
42
+ * 📜 Inject EXIF metadata (GPS coordinates, camera info, timestamps,
43
+ compass direction)
44
+ * 🛟 Save progress so you can safely resume if interrupted
45
+ * 🗜️ Optionally convert to WebP to save space
46
+ * 📦 Tar sequence directories for faster uploads
47
+
48
+ ## WebP Conversion
49
+
50
+ You'll need `cwebp` to use the `--webp` flag. So install it:
51
+
52
+ ```bash
53
+ # Debian/Ubuntu
54
+ sudo apt install webp
55
+
56
+ # macOS
57
+ brew install webp
58
+ ```
59
+
60
+ ## Sequence Tarball Creation
61
+
62
+ By default, sequence directories are automatically tarred after download because
63
+ if they weren't, you'd spend more time setting up upload metadata than actually
64
+ uploading files to IA.
65
+
66
+ To keep individual files instead of creating tars, use the `--no-tar` flag:
67
+
68
+ ```bash
69
+ mapillary-downloader --username WHOEVER --no-tar
70
+ ```
71
+
72
+ ## Internet Archive upload
73
+
74
+ I've written a bash tool to rip media then tag, queue, and upload to The
75
+ Internet Archive. The metadata is in the same format. If you copy completed
76
+ download dirs into the `4.ship` dir, they'll find their way into an
77
+ appropriately named item.
78
+
79
+ See inlay for details:
80
+
81
+ * [📀 rip](https://bitplane.net/dev/sh/rip)
82
+
83
+
84
+ ## Development
85
+
86
+ ```bash
87
+ make dev # Setup dev environment
88
+ make test # Run tests
89
+ make dist # Build the distribution
90
+ make help # See other make options
91
+ ```
92
+
93
+ ## Links
94
+
95
+ * [🏠 home](https://bitplane.net/dev/python/mapillary_downloader)
96
+ * [📖 pydoc](https://bitplane.net/dev/python/mapillary_downloader/pydoc)
97
+ * [🐍 pypi](https://pypi.org/project/mapillary-downloader)
98
+ * [🐱 github](https://github.com/bitplane/mapillary_downloader)
99
+ * [📀 rip](https://bitplane.net/dev/sh/rip
100
+
101
+ ## License
102
+
103
+ WTFPL with one additional clause
104
+
105
+ 1. Don't blame me
106
+
107
+ Do wtf you want, but don't blame me if it makes jokes about the size of your
108
+ disk drive.
@@ -1,7 +1,7 @@
1
1
  [project]
2
2
  name = "mapillary_downloader"
3
3
  description = "Download your Mapillary data before it's gone"
4
- version = "0.2.0"
4
+ version = "0.3.1"
5
5
  authors = [
6
6
  { name = "Gareth Davidson", email = "gaz@bitplane.net" }
7
7
  ]
@@ -1,6 +1,7 @@
1
1
  """CLI entry point."""
2
2
 
3
3
  import argparse
4
+ import os
4
5
  import sys
5
6
  from mapillary_downloader.client import MapillaryClient
6
7
  from mapillary_downloader.downloader import MapillaryDownloader
@@ -14,8 +15,12 @@ def main():
14
15
  logger = setup_logging()
15
16
 
16
17
  parser = argparse.ArgumentParser(description="Download your Mapillary data before it's gone")
17
- parser.add_argument("--token", required=True, help="Mapillary API access token")
18
- parser.add_argument("--username", required=True, help="Your Mapillary username")
18
+ parser.add_argument(
19
+ "--token",
20
+ default=os.environ.get("MAPILLARY_TOKEN"),
21
+ help="Mapillary API access token (or set MAPILLARY_TOKEN env var)",
22
+ )
23
+ parser.add_argument("--username", required=True, help="Mapillary username")
19
24
  parser.add_argument("--output", default="./mapillary_data", help="Output directory (default: ./mapillary_data)")
20
25
  parser.add_argument(
21
26
  "--quality",
@@ -29,9 +34,25 @@ def main():
29
34
  action="store_true",
30
35
  help="Convert images to WebP format (saves ~70%% disk space, requires cwebp binary)",
31
36
  )
37
+ parser.add_argument(
38
+ "--workers",
39
+ type=int,
40
+ default=None,
41
+ help="Number of parallel workers (default: half of CPU cores)",
42
+ )
43
+ parser.add_argument(
44
+ "--no-tar",
45
+ action="store_true",
46
+ help="Don't tar sequence directories (keep individual files)",
47
+ )
32
48
 
33
49
  args = parser.parse_args()
34
50
 
51
+ # Check for token
52
+ if not args.token:
53
+ logger.error("Error: Mapillary API token required. Use --token or set MAPILLARY_TOKEN environment variable")
54
+ sys.exit(1)
55
+
35
56
  bbox = None
36
57
  if args.bbox:
37
58
  try:
@@ -51,8 +72,10 @@ def main():
51
72
 
52
73
  try:
53
74
  client = MapillaryClient(args.token)
54
- downloader = MapillaryDownloader(client, args.output)
55
- downloader.download_user_data(args.username, args.quality, bbox, convert_webp=args.webp)
75
+ downloader = MapillaryDownloader(
76
+ client, args.output, args.username, args.quality, workers=args.workers, tar_sequences=not args.no_tar
77
+ )
78
+ downloader.download_user_data(bbox=bbox, convert_webp=args.webp)
56
79
  except KeyboardInterrupt:
57
80
  logger.info("\nInterrupted by user")
58
81
  sys.exit(1)
@@ -0,0 +1,218 @@
1
+ """Main downloader logic."""
2
+
3
+ import json
4
+ import logging
5
+ import os
6
+ import time
7
+ from pathlib import Path
8
+ from concurrent.futures import ProcessPoolExecutor, as_completed
9
+ from mapillary_downloader.utils import format_size, format_time
10
+ from mapillary_downloader.ia_meta import generate_ia_metadata
11
+ from mapillary_downloader.worker import download_and_convert_image
12
+ from mapillary_downloader.tar_sequences import tar_sequence_directories
13
+
14
+ logger = logging.getLogger("mapillary_downloader")
15
+
16
+
17
+ class MapillaryDownloader:
18
+ """Handles downloading Mapillary data for a user."""
19
+
20
+ def __init__(self, client, output_dir, username=None, quality=None, workers=None, tar_sequences=True):
21
+ """Initialize the downloader.
22
+
23
+ Args:
24
+ client: MapillaryClient instance
25
+ output_dir: Base directory to save downloads
26
+ username: Mapillary username (for collection directory)
27
+ quality: Image quality (for collection directory)
28
+ workers: Number of parallel workers (default: half of cpu_count)
29
+ tar_sequences: Whether to tar sequence directories after download (default: True)
30
+ """
31
+ self.client = client
32
+ self.base_output_dir = Path(output_dir)
33
+ self.username = username
34
+ self.quality = quality
35
+ self.workers = workers if workers is not None else max(1, os.cpu_count() // 2)
36
+ self.tar_sequences = tar_sequences
37
+
38
+ # If username and quality provided, create collection directory
39
+ if username and quality:
40
+ collection_name = f"mapillary-{username}-{quality}"
41
+ self.output_dir = self.base_output_dir / collection_name
42
+ else:
43
+ self.output_dir = self.base_output_dir
44
+
45
+ self.output_dir.mkdir(parents=True, exist_ok=True)
46
+
47
+ self.metadata_file = self.output_dir / "metadata.jsonl"
48
+ self.progress_file = self.output_dir / "progress.json"
49
+ self.downloaded = self._load_progress()
50
+
51
+ def _load_progress(self):
52
+ """Load previously downloaded image IDs."""
53
+ if self.progress_file.exists():
54
+ with open(self.progress_file) as f:
55
+ return set(json.load(f).get("downloaded", []))
56
+ return set()
57
+
58
+ def _save_progress(self):
59
+ """Save progress to disk atomically."""
60
+ temp_file = self.progress_file.with_suffix(".json.tmp")
61
+ with open(temp_file, "w") as f:
62
+ json.dump({"downloaded": list(self.downloaded)}, f)
63
+ f.flush()
64
+ os.fsync(f.fileno())
65
+ temp_file.replace(self.progress_file)
66
+
67
+ def download_user_data(self, bbox=None, convert_webp=False):
68
+ """Download all images for a user.
69
+
70
+ Args:
71
+ bbox: Optional bounding box [west, south, east, north]
72
+ convert_webp: Convert images to WebP format after download
73
+ """
74
+ if not self.username or not self.quality:
75
+ raise ValueError("Username and quality must be provided during initialization")
76
+
77
+ quality_field = f"thumb_{self.quality}_url"
78
+
79
+ logger.info(f"Downloading images for user: {self.username}")
80
+ logger.info(f"Output directory: {self.output_dir}")
81
+ logger.info(f"Quality: {self.quality}")
82
+ logger.info(f"Using {self.workers} parallel workers")
83
+
84
+ processed = 0
85
+ downloaded_count = 0
86
+ skipped = 0
87
+ total_bytes = 0
88
+ failed_count = 0
89
+
90
+ start_time = time.time()
91
+
92
+ # Track which image IDs we've seen in metadata to avoid re-fetching
93
+ seen_ids = set()
94
+
95
+ # Collect images to download from existing metadata
96
+ images_to_download = []
97
+
98
+ if self.metadata_file.exists():
99
+ logger.info("Processing existing metadata file...")
100
+ with open(self.metadata_file) as f:
101
+ for line in f:
102
+ if line.strip():
103
+ image = json.loads(line)
104
+ image_id = image["id"]
105
+ seen_ids.add(image_id)
106
+ processed += 1
107
+
108
+ if image_id in self.downloaded:
109
+ skipped += 1
110
+ continue
111
+
112
+ # Queue for download
113
+ if image.get(quality_field):
114
+ images_to_download.append(image)
115
+
116
+ # Download images from existing metadata in parallel
117
+ if images_to_download:
118
+ logger.info(f"Downloading {len(images_to_download)} images from existing metadata...")
119
+ downloaded_count, total_bytes, failed_count = self._download_images_parallel(
120
+ images_to_download, convert_webp
121
+ )
122
+
123
+ # Always check API for new images (will skip duplicates via seen_ids)
124
+ logger.info("Checking for new images from API...")
125
+ new_images = []
126
+
127
+ with open(self.metadata_file, "a") as meta_f:
128
+ for image in self.client.get_user_images(self.username, bbox=bbox):
129
+ image_id = image["id"]
130
+
131
+ # Skip if we already have this in our metadata file
132
+ if image_id in seen_ids:
133
+ continue
134
+
135
+ seen_ids.add(image_id)
136
+ processed += 1
137
+
138
+ # Save new metadata
139
+ meta_f.write(json.dumps(image) + "\n")
140
+ meta_f.flush()
141
+
142
+ # Skip if already downloaded
143
+ if image_id in self.downloaded:
144
+ skipped += 1
145
+ continue
146
+
147
+ # Queue for download
148
+ if image.get(quality_field):
149
+ new_images.append(image)
150
+
151
+ # Download new images in parallel
152
+ if new_images:
153
+ logger.info(f"Downloading {len(new_images)} new images...")
154
+ new_downloaded, new_bytes, new_failed = self._download_images_parallel(new_images, convert_webp)
155
+ downloaded_count += new_downloaded
156
+ total_bytes += new_bytes
157
+ failed_count += new_failed
158
+
159
+ self._save_progress()
160
+ elapsed = time.time() - start_time
161
+ logger.info(
162
+ f"Complete! Processed {processed} images, downloaded {downloaded_count} ({format_size(total_bytes)}), "
163
+ f"skipped {skipped}, failed {failed_count}"
164
+ )
165
+ logger.info(f"Total time: {format_time(elapsed)}")
166
+
167
+ # Tar sequence directories for efficient IA uploads
168
+ if self.tar_sequences:
169
+ tar_sequence_directories(self.output_dir)
170
+
171
+ # Generate IA metadata
172
+ generate_ia_metadata(self.output_dir)
173
+
174
+ def _download_images_parallel(self, images, convert_webp):
175
+ """Download images in parallel using worker pool.
176
+
177
+ Args:
178
+ images: List of image metadata dicts
179
+ convert_webp: Whether to convert to WebP
180
+
181
+ Returns:
182
+ Tuple of (downloaded_count, total_bytes, failed_count)
183
+ """
184
+ downloaded_count = 0
185
+ total_bytes = 0
186
+ failed_count = 0
187
+
188
+ with ProcessPoolExecutor(max_workers=self.workers) as executor:
189
+ # Submit all tasks
190
+ future_to_image = {}
191
+ for image in images:
192
+ future = executor.submit(
193
+ download_and_convert_image,
194
+ image,
195
+ str(self.output_dir),
196
+ self.quality,
197
+ convert_webp,
198
+ self.client.access_token,
199
+ )
200
+ future_to_image[future] = image["id"]
201
+
202
+ # Process results as they complete
203
+ for future in as_completed(future_to_image):
204
+ image_id, bytes_dl, success, error_msg = future.result()
205
+
206
+ if success:
207
+ self.downloaded.add(image_id)
208
+ downloaded_count += 1
209
+ total_bytes += bytes_dl
210
+
211
+ if downloaded_count % 10 == 0:
212
+ logger.info(f"Downloaded: {downloaded_count}/{len(images)} ({format_size(total_bytes)})")
213
+ self._save_progress()
214
+ else:
215
+ failed_count += 1
216
+ logger.warning(f"Failed to download {image_id}: {error_msg}")
217
+
218
+ return downloaded_count, total_bytes, failed_count
@@ -0,0 +1,182 @@
1
+ """Internet Archive metadata generation for Mapillary collections."""
2
+
3
+ import json
4
+ import logging
5
+ import re
6
+ from datetime import datetime
7
+ from pathlib import Path
8
+ from importlib.metadata import version
9
+
10
+ logger = logging.getLogger("mapillary_downloader")
11
+
12
+
13
+ def parse_collection_name(directory):
14
+ """Parse username and quality from directory name.
15
+
16
+ Args:
17
+ directory: Path to collection directory (e.g., mapillary-username-original)
18
+
19
+ Returns:
20
+ Tuple of (username, quality) or (None, None) if parsing fails
21
+ """
22
+ match = re.match(r"mapillary-(.+)-(256|1024|2048|original)$", Path(directory).name)
23
+ if match:
24
+ return match.group(1), match.group(2)
25
+ return None, None
26
+
27
+
28
+ def get_date_range(metadata_file):
29
+ """Get first and last captured_at dates from metadata.jsonl.
30
+
31
+ Args:
32
+ metadata_file: Path to metadata.jsonl file
33
+
34
+ Returns:
35
+ Tuple of (first_date, last_date) as ISO format strings, or (None, None)
36
+ """
37
+ if not Path(metadata_file).exists():
38
+ return None, None
39
+
40
+ timestamps = []
41
+ with open(metadata_file) as f:
42
+ for line in f:
43
+ if line.strip():
44
+ data = json.loads(line)
45
+ if "captured_at" in data:
46
+ timestamps.append(data["captured_at"])
47
+
48
+ if not timestamps:
49
+ return None, None
50
+
51
+ # Convert from milliseconds to seconds, then to datetime
52
+ first_ts = min(timestamps) / 1000
53
+ last_ts = max(timestamps) / 1000
54
+
55
+ first_date = datetime.fromtimestamp(first_ts).strftime("%Y-%m-%d")
56
+ last_date = datetime.fromtimestamp(last_ts).strftime("%Y-%m-%d")
57
+
58
+ return first_date, last_date
59
+
60
+
61
+ def count_images(metadata_file):
62
+ """Count number of images in metadata.jsonl.
63
+
64
+ Args:
65
+ metadata_file: Path to metadata.jsonl file
66
+
67
+ Returns:
68
+ Number of images
69
+ """
70
+ if not Path(metadata_file).exists():
71
+ return 0
72
+
73
+ count = 0
74
+ with open(metadata_file) as f:
75
+ for line in f:
76
+ if line.strip():
77
+ count += 1
78
+ return count
79
+
80
+
81
+ def write_meta_tag(meta_dir, tag, values):
82
+ """Write metadata tag files in rip format.
83
+
84
+ Args:
85
+ meta_dir: Path to .meta directory
86
+ tag: Tag name
87
+ values: Single value or list of values
88
+ """
89
+ tag_dir = meta_dir / tag
90
+ tag_dir.mkdir(parents=True, exist_ok=True)
91
+
92
+ if not isinstance(values, list):
93
+ values = [values]
94
+
95
+ for idx, value in enumerate(values):
96
+ (tag_dir / str(idx)).write_text(str(value))
97
+
98
+
99
+ def generate_ia_metadata(collection_dir):
100
+ """Generate Internet Archive metadata for a Mapillary collection.
101
+
102
+ Args:
103
+ collection_dir: Path to collection directory (e.g., ./mapillary_data/mapillary-username-original)
104
+
105
+ Returns:
106
+ True if successful, False otherwise
107
+ """
108
+ collection_dir = Path(collection_dir)
109
+ username, quality = parse_collection_name(collection_dir)
110
+
111
+ if not username or not quality:
112
+ logger.error(f"Could not parse username/quality from directory: {collection_dir.name}")
113
+ return False
114
+
115
+ metadata_file = collection_dir / "metadata.jsonl"
116
+ if not metadata_file.exists():
117
+ logger.error(f"metadata.jsonl not found in {collection_dir}")
118
+ return False
119
+
120
+ logger.info(f"Generating IA metadata for {collection_dir.name}...")
121
+
122
+ # Get date range and image count
123
+ first_date, last_date = get_date_range(metadata_file)
124
+ image_count = count_images(metadata_file)
125
+
126
+ if not first_date or not last_date:
127
+ logger.warning("Could not determine date range from metadata")
128
+ first_date = last_date = "unknown"
129
+
130
+ # Create .meta directory
131
+ meta_dir = collection_dir / ".meta"
132
+ meta_dir.mkdir(exist_ok=True)
133
+
134
+ # Generate metadata tags
135
+ write_meta_tag(
136
+ meta_dir,
137
+ "title",
138
+ f"Mapillary images by {username} ({quality} quality)",
139
+ )
140
+
141
+ description = (
142
+ f"Street-level imagery from Mapillary user '{username}'. "
143
+ f"Contains {image_count:,} images captured between {first_date} and {last_date}. "
144
+ f"Images are organized by sequence ID and include EXIF metadata with GPS coordinates, "
145
+ f"camera information, and compass direction.\n\n"
146
+ f"Downloaded using mapillary_downloader (https://bitplane.net/dev/python/mapillary_downloader/). "
147
+ f"Uploaded using rip (https://bitplane.net/dev/sh/rip)."
148
+ )
149
+ write_meta_tag(meta_dir, "description", description)
150
+
151
+ # Subject tags
152
+ write_meta_tag(
153
+ meta_dir,
154
+ "subject",
155
+ ["mapillary", "street-view", "computer-vision", "geospatial", "photography"],
156
+ )
157
+
158
+ write_meta_tag(meta_dir, "creator", username)
159
+ write_meta_tag(meta_dir, "date", first_date)
160
+ write_meta_tag(meta_dir, "coverage", f"{first_date} - {last_date}")
161
+ write_meta_tag(meta_dir, "licenseurl", "https://creativecommons.org/licenses/by-sa/4.0/")
162
+ write_meta_tag(meta_dir, "mediatype", "data")
163
+ write_meta_tag(meta_dir, "collection", "opensource_media")
164
+
165
+ # Source and scanner metadata
166
+ write_meta_tag(meta_dir, "source", f"https://www.mapillary.com/app/user/{username}")
167
+
168
+ downloader_version = version("mapillary_downloader")
169
+ write_meta_tag(
170
+ meta_dir,
171
+ "scanner",
172
+ [
173
+ f"mapillary_downloader {downloader_version} https://bitplane.net/dev/python/mapillary_downloader/",
174
+ "rip https://bitplane.net/dev/sh/rip",
175
+ ],
176
+ )
177
+
178
+ # Add searchable tag for batch collection management
179
+ write_meta_tag(meta_dir, "mapillary_downloader", downloader_version)
180
+
181
+ logger.info(f"IA metadata generated in {meta_dir}")
182
+ return True
@@ -0,0 +1,112 @@
1
+ """Tar sequence directories for efficient Internet Archive uploads."""
2
+
3
+ import logging
4
+ import subprocess
5
+ from pathlib import Path
6
+
7
+ logger = logging.getLogger("mapillary_downloader")
8
+
9
+
10
+ def tar_sequence_directories(collection_dir):
11
+ """Tar all sequence directories in a collection for faster IA uploads.
12
+
13
+ Args:
14
+ collection_dir: Path to collection directory (e.g., mapillary-user-quality/)
15
+
16
+ Returns:
17
+ Tuple of (tarred_count, total_files_tarred)
18
+ """
19
+ collection_dir = Path(collection_dir)
20
+
21
+ if not collection_dir.exists():
22
+ logger.error(f"Collection directory not found: {collection_dir}")
23
+ return 0, 0
24
+
25
+ # Find all sequence directories (skip special dirs)
26
+ skip_dirs = {".meta", "__pycache__"}
27
+ sequence_dirs = []
28
+
29
+ for item in collection_dir.iterdir():
30
+ if item.is_dir() and item.name not in skip_dirs:
31
+ sequence_dirs.append(item)
32
+
33
+ if not sequence_dirs:
34
+ logger.info("No sequence directories to tar")
35
+ return 0, 0
36
+
37
+ logger.info(f"Tarring {len(sequence_dirs)} sequence directories...")
38
+
39
+ tarred_count = 0
40
+ total_files = 0
41
+
42
+ for seq_dir in sequence_dirs:
43
+ seq_name = seq_dir.name
44
+ tar_path = collection_dir / f"{seq_name}.tar"
45
+
46
+ # Handle naming collision - find next available name
47
+ counter = 1
48
+ while tar_path.exists():
49
+ counter += 1
50
+ tar_path = collection_dir / f"{seq_name}.{counter}.tar"
51
+
52
+ # Count files in sequence
53
+ files = list(seq_dir.glob("*"))
54
+ file_count = len([f for f in files if f.is_file()])
55
+
56
+ if file_count == 0:
57
+ logger.warning(f"Skipping empty directory: {seq_name}")
58
+ continue
59
+
60
+ try:
61
+ # Create uncompressed tar (WebP already compressed)
62
+ # Use -C to change directory so paths in tar are relative
63
+ # Use -- to prevent sequence IDs starting with - from being interpreted as options
64
+ result = subprocess.run(
65
+ ["tar", "-cf", str(tar_path), "-C", str(collection_dir), "--", seq_name],
66
+ capture_output=True,
67
+ text=True,
68
+ timeout=300, # 5 minute timeout per tar
69
+ )
70
+
71
+ if result.returncode != 0:
72
+ logger.error(f"Failed to tar {seq_name}: {result.stderr}")
73
+ continue
74
+
75
+ # Verify tar was created and has size
76
+ if tar_path.exists() and tar_path.stat().st_size > 0:
77
+ # Remove original directory
78
+ for file in seq_dir.rglob("*"):
79
+ if file.is_file():
80
+ file.unlink()
81
+
82
+ # Remove empty subdirs and main dir
83
+ for subdir in list(seq_dir.rglob("*")):
84
+ if subdir.is_dir():
85
+ try:
86
+ subdir.rmdir()
87
+ except OSError:
88
+ pass # Not empty yet
89
+
90
+ seq_dir.rmdir()
91
+
92
+ tarred_count += 1
93
+ total_files += file_count
94
+
95
+ if tarred_count % 10 == 0:
96
+ logger.info(f"Tarred {tarred_count}/{len(sequence_dirs)} sequences...")
97
+ else:
98
+ logger.error(f"Tar file empty or not created: {tar_path}")
99
+ if tar_path.exists():
100
+ tar_path.unlink()
101
+
102
+ except subprocess.TimeoutExpired:
103
+ logger.error(f"Timeout tarring {seq_name}")
104
+ if tar_path.exists():
105
+ tar_path.unlink()
106
+ except Exception as e:
107
+ logger.error(f"Error tarring {seq_name}: {e}")
108
+ if tar_path.exists():
109
+ tar_path.unlink()
110
+
111
+ logger.info(f"Tarred {tarred_count} sequences ({total_files:,} files total)")
112
+ return tarred_count, total_files
@@ -17,17 +17,25 @@ def check_cwebp_available():
17
17
  return shutil.which("cwebp") is not None
18
18
 
19
19
 
20
- def convert_to_webp(jpg_path):
20
+ def convert_to_webp(jpg_path, output_path=None, delete_original=True):
21
21
  """Convert a JPG image to WebP format, preserving EXIF metadata.
22
22
 
23
23
  Args:
24
24
  jpg_path: Path to the JPG file
25
+ output_path: Optional path for the WebP output. If None, uses jpg_path with .webp extension
26
+ delete_original: Whether to delete the original JPG after conversion (default: True)
25
27
 
26
28
  Returns:
27
29
  Path object to the new WebP file, or None if conversion failed
28
30
  """
29
31
  jpg_path = Path(jpg_path)
30
- webp_path = jpg_path.with_suffix(".webp")
32
+
33
+ if output_path is None:
34
+ webp_path = jpg_path.with_suffix(".webp")
35
+ else:
36
+ webp_path = Path(output_path)
37
+ # Ensure output directory exists
38
+ webp_path.parent.mkdir(parents=True, exist_ok=True)
31
39
 
32
40
  try:
33
41
  # Convert with cwebp, preserving all metadata
@@ -42,8 +50,9 @@ def convert_to_webp(jpg_path):
42
50
  logger.error(f"cwebp conversion failed for {jpg_path}: {result.stderr}")
43
51
  return None
44
52
 
45
- # Delete original JPG after successful conversion
46
- jpg_path.unlink()
53
+ # Delete original JPG after successful conversion if requested
54
+ if delete_original:
55
+ jpg_path.unlink()
47
56
  return webp_path
48
57
 
49
58
  except subprocess.TimeoutExpired:
@@ -0,0 +1,95 @@
1
+ """Worker process for parallel image download and conversion."""
2
+
3
+ import tempfile
4
+ from pathlib import Path
5
+ import requests
6
+ from requests.exceptions import RequestException
7
+ from mapillary_downloader.exif_writer import write_exif_to_image
8
+ from mapillary_downloader.webp_converter import convert_to_webp
9
+
10
+
11
+ def download_and_convert_image(image_data, output_dir, quality, convert_webp, access_token):
12
+ """Download and optionally convert a single image.
13
+
14
+ This function is designed to run in a worker process.
15
+
16
+ Args:
17
+ image_data: Image metadata dict from API
18
+ output_dir: Base output directory path
19
+ quality: Quality level (256, 1024, 2048, original)
20
+ convert_webp: Whether to convert to WebP
21
+ access_token: Mapillary API access token
22
+
23
+ Returns:
24
+ Tuple of (image_id, bytes_downloaded, success, error_msg)
25
+ """
26
+ image_id = image_data["id"]
27
+ quality_field = f"thumb_{quality}_url"
28
+
29
+ temp_dir = None
30
+ try:
31
+ # Get image URL
32
+ image_url = image_data.get(quality_field)
33
+ if not image_url:
34
+ return (image_id, 0, False, f"No {quality} URL")
35
+
36
+ # Determine final output directory
37
+ output_dir = Path(output_dir)
38
+ sequence_id = image_data.get("sequence")
39
+ if sequence_id:
40
+ img_dir = output_dir / sequence_id
41
+ img_dir.mkdir(parents=True, exist_ok=True)
42
+ else:
43
+ img_dir = output_dir
44
+
45
+ # If converting to WebP, use /tmp for intermediate JPEG
46
+ # Otherwise write JPEG directly to final location
47
+ if convert_webp:
48
+ temp_dir = tempfile.mkdtemp(prefix="mapillary_downloader_")
49
+ jpg_path = Path(temp_dir) / f"{image_id}.jpg"
50
+ final_path = img_dir / f"{image_id}.webp"
51
+ else:
52
+ jpg_path = img_dir / f"{image_id}.jpg"
53
+ final_path = jpg_path
54
+
55
+ # Download image
56
+ # No retries for CDN images - they're cheap, just skip failures and move on
57
+ session = requests.Session()
58
+ session.headers.update({"Authorization": f"OAuth {access_token}"})
59
+
60
+ bytes_downloaded = 0
61
+
62
+ try:
63
+ # 60 second timeout for entire download (connection + read)
64
+ response = session.get(image_url, stream=True, timeout=60)
65
+ response.raise_for_status()
66
+
67
+ with open(jpg_path, "wb") as f:
68
+ for chunk in response.iter_content(chunk_size=8192):
69
+ f.write(chunk)
70
+ bytes_downloaded += len(chunk)
71
+ except RequestException as e:
72
+ return (image_id, 0, False, f"Download failed: {e}")
73
+
74
+ # Write EXIF metadata
75
+ write_exif_to_image(jpg_path, image_data)
76
+
77
+ # Convert to WebP if requested
78
+ if convert_webp:
79
+ webp_path = convert_to_webp(jpg_path, output_path=final_path, delete_original=False)
80
+ if not webp_path:
81
+ return (image_id, bytes_downloaded, False, "WebP conversion failed")
82
+
83
+ return (image_id, bytes_downloaded, True, None)
84
+
85
+ except Exception as e:
86
+ return (image_id, 0, False, str(e))
87
+ finally:
88
+ # Clean up temp directory if it was created
89
+ if temp_dir and Path(temp_dir).exists():
90
+ try:
91
+ for file in Path(temp_dir).glob("*"):
92
+ file.unlink()
93
+ Path(temp_dir).rmdir()
94
+ except Exception:
95
+ pass # Best effort cleanup
@@ -1,84 +0,0 @@
1
- # 🗺️ Mapillary Downloader
2
-
3
- Download your Mapillary data before it's gone.
4
-
5
- ## Installation
6
-
7
- ```bash
8
- pip install mapillary-downloader
9
- ```
10
-
11
- Or from source:
12
-
13
- ```bash
14
- make install
15
- ```
16
-
17
- ## Usage
18
-
19
- First, get your Mapillary API access token from https://www.mapillary.com/dashboard/developers
20
-
21
- ```bash
22
- mapillary-downloader --token YOUR_TOKEN --username YOUR_USERNAME --output ./downloads
23
- ```
24
-
25
- | option | because | default |
26
- | ------------- | ------------------------------------- | ------------------ |
27
- | `--token` | Your Mapillary API access token | None (required) |
28
- | `--username` | Your Mapillary username | None (required) |
29
- | `--output` | Output directory | `./mapillary_data` |
30
- | `--quality` | 256, 1024, 2048 or original | `original` |
31
- | `--bbox` | `west,south,east,north` | `None` |
32
- | `--webp` | Convert to WebP (saves ~70% space) | `False` |
33
-
34
- The downloader will:
35
-
36
- * 💾 Fetch all your uploaded images from Mapillary
37
- * 📷 Download full-resolution images organized by sequence
38
- * 📜 Inject EXIF metadata (GPS coordinates, camera info, timestamps,
39
- compass direction)
40
- * 🛟 Save progress so you can safely resume if interrupted
41
- * 🗜️ Optionally convert to WebP format for massive space savings
42
-
43
- ## WebP Conversion
44
-
45
- Use the `--webp` flag to convert images to WebP format after download:
46
-
47
- ```bash
48
- mapillary-downloader --token YOUR_TOKEN --username YOUR_USERNAME --webp
49
- ```
50
-
51
- This reduces storage by approximately 70% while preserving all EXIF metadata
52
- including GPS coordinates. Requires the `cwebp` binary to be installed:
53
-
54
- ```bash
55
- # Debian/Ubuntu
56
- sudo apt install webp
57
-
58
- # macOS
59
- brew install webp
60
- ```
61
-
62
- ## Development
63
-
64
- ```bash
65
- make dev # Setup dev environment
66
- make test # Run tests
67
- make coverage # Run tests with coverage
68
- ```
69
-
70
- ## Links
71
-
72
- * [🏠 home](https://bitplane.net/dev/python/mapillary_downloader)
73
- * [📖 pydoc](https://bitplane.net/dev/python/mapillary_downloader/pydoc)
74
- * [🐍 pypi](https://pypi.org/project/mapillary-downloader)
75
- * [🐱 github](https://github.com/bitplane/mapillary_downloader)
76
-
77
- ## License
78
-
79
- WTFPL with one additional clause
80
-
81
- 1. Don't blame me
82
-
83
- Do wtf you want, but don't blame me if it makes jokes about the size of your
84
- disk drive.
@@ -1,206 +0,0 @@
1
- """Main downloader logic."""
2
-
3
- import json
4
- import logging
5
- import os
6
- import time
7
- from pathlib import Path
8
- from collections import deque
9
- from mapillary_downloader.exif_writer import write_exif_to_image
10
- from mapillary_downloader.utils import format_size, format_time
11
- from mapillary_downloader.webp_converter import convert_to_webp
12
-
13
- logger = logging.getLogger("mapillary_downloader")
14
-
15
-
16
- class MapillaryDownloader:
17
- """Handles downloading Mapillary data for a user."""
18
-
19
- def __init__(self, client, output_dir):
20
- """Initialize the downloader.
21
-
22
- Args:
23
- client: MapillaryClient instance
24
- output_dir: Directory to save downloads
25
- """
26
- self.client = client
27
- self.output_dir = Path(output_dir)
28
- self.output_dir.mkdir(parents=True, exist_ok=True)
29
-
30
- self.metadata_file = self.output_dir / "metadata.jsonl"
31
- self.progress_file = self.output_dir / "progress.json"
32
- self.downloaded = self._load_progress()
33
-
34
- def _load_progress(self):
35
- """Load previously downloaded image IDs."""
36
- if self.progress_file.exists():
37
- with open(self.progress_file) as f:
38
- return set(json.load(f).get("downloaded", []))
39
- return set()
40
-
41
- def _save_progress(self):
42
- """Save progress to disk atomically."""
43
- temp_file = self.progress_file.with_suffix(".json.tmp")
44
- with open(temp_file, "w") as f:
45
- json.dump({"downloaded": list(self.downloaded)}, f)
46
- f.flush()
47
- os.fsync(f.fileno())
48
- temp_file.replace(self.progress_file)
49
-
50
- def download_user_data(self, username, quality="original", bbox=None, convert_webp=False):
51
- """Download all images for a user.
52
-
53
- Args:
54
- username: Mapillary username
55
- quality: Image quality to download (256, 1024, 2048, original)
56
- bbox: Optional bounding box [west, south, east, north]
57
- convert_webp: Convert images to WebP format after download
58
- """
59
- quality_field = f"thumb_{quality}_url"
60
-
61
- logger.info(f"Downloading images for user: {username}")
62
- logger.info(f"Output directory: {self.output_dir}")
63
- logger.info(f"Quality: {quality}")
64
-
65
- processed = 0
66
- downloaded_count = 0
67
- skipped = 0
68
- total_bytes = 0
69
-
70
- # Track download times for adaptive ETA (last 50 downloads)
71
- download_times = deque(maxlen=50)
72
- start_time = time.time()
73
-
74
- # Track which image IDs we've seen in metadata to avoid re-fetching
75
- seen_ids = set()
76
-
77
- # First, process any existing metadata without re-fetching from API
78
- if self.metadata_file.exists():
79
- logger.info("Processing existing metadata file...")
80
- with open(self.metadata_file) as f:
81
- for line in f:
82
- if line.strip():
83
- image = json.loads(line)
84
- image_id = image["id"]
85
- seen_ids.add(image_id)
86
- processed += 1
87
-
88
- if image_id in self.downloaded:
89
- skipped += 1
90
- continue
91
-
92
- # Download this un-downloaded image
93
- image_url = image.get(quality_field)
94
- if not image_url:
95
- logger.warning(f"No {quality} URL for image {image_id}")
96
- continue
97
-
98
- sequence_id = image.get("sequence")
99
- if sequence_id:
100
- img_dir = self.output_dir / sequence_id
101
- img_dir.mkdir(exist_ok=True)
102
- else:
103
- img_dir = self.output_dir
104
-
105
- output_path = img_dir / f"{image_id}.jpg"
106
-
107
- download_start = time.time()
108
- bytes_downloaded = self.client.download_image(image_url, output_path)
109
- if bytes_downloaded:
110
- download_time = time.time() - download_start
111
- download_times.append(download_time)
112
-
113
- write_exif_to_image(output_path, image)
114
-
115
- # Convert to WebP if requested
116
- if convert_webp:
117
- webp_path = convert_to_webp(output_path)
118
- if webp_path:
119
- output_path = webp_path
120
-
121
- self.downloaded.add(image_id)
122
- downloaded_count += 1
123
- total_bytes += bytes_downloaded
124
-
125
- progress_str = (
126
- f"Processed: {processed}, Downloaded: {downloaded_count} ({format_size(total_bytes)})"
127
- )
128
- logger.info(progress_str)
129
-
130
- if downloaded_count % 10 == 0:
131
- self._save_progress()
132
-
133
- # Always check API for new images (will skip duplicates via seen_ids)
134
- logger.info("Checking for new images from API...")
135
- with open(self.metadata_file, "a") as meta_f:
136
- for image in self.client.get_user_images(username, bbox=bbox):
137
- image_id = image["id"]
138
-
139
- # Skip if we already have this in our metadata file
140
- if image_id in seen_ids:
141
- continue
142
-
143
- seen_ids.add(image_id)
144
- processed += 1
145
-
146
- # Save new metadata
147
- meta_f.write(json.dumps(image) + "\n")
148
- meta_f.flush()
149
-
150
- # Skip if already downloaded
151
- if image_id in self.downloaded:
152
- skipped += 1
153
- continue
154
-
155
- # Download image
156
- image_url = image.get(quality_field)
157
- if not image_url:
158
- logger.warning(f"No {quality} URL for image {image_id}")
159
- continue
160
-
161
- # Use sequence ID for organization
162
- sequence_id = image.get("sequence")
163
- if sequence_id:
164
- img_dir = self.output_dir / sequence_id
165
- img_dir.mkdir(exist_ok=True)
166
- else:
167
- img_dir = self.output_dir
168
-
169
- output_path = img_dir / f"{image_id}.jpg"
170
-
171
- download_start = time.time()
172
- bytes_downloaded = self.client.download_image(image_url, output_path)
173
- if bytes_downloaded:
174
- download_time = time.time() - download_start
175
- download_times.append(download_time)
176
-
177
- # Write EXIF metadata to the downloaded image
178
- write_exif_to_image(output_path, image)
179
-
180
- # Convert to WebP if requested
181
- if convert_webp:
182
- webp_path = convert_to_webp(output_path)
183
- if webp_path:
184
- output_path = webp_path
185
-
186
- self.downloaded.add(image_id)
187
- downloaded_count += 1
188
- total_bytes += bytes_downloaded
189
-
190
- # Calculate progress
191
- progress_str = (
192
- f"Processed: {processed}, Downloaded: {downloaded_count} ({format_size(total_bytes)})"
193
- )
194
-
195
- logger.info(progress_str)
196
-
197
- # Save progress every 10 images
198
- if downloaded_count % 10 == 0:
199
- self._save_progress()
200
-
201
- self._save_progress()
202
- elapsed = time.time() - start_time
203
- logger.info(
204
- f"Complete! Processed {processed} images, downloaded {downloaded_count} ({format_size(total_bytes)}), skipped {skipped}"
205
- )
206
- logger.info(f"Total time: {format_time(elapsed)}")