hfdol 0.1.15__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
hfdol-0.1.15/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) [year] [fullname]
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
hfdol-0.1.15/PKG-INFO ADDED
@@ -0,0 +1,386 @@
1
+ Metadata-Version: 2.4
2
+ Name: hfdol
3
+ Version: 0.1.15
4
+ Summary: Simple Mapping interface to HuggingFace
5
+ Home-page: https://github.com/thorwhalen/hfdol
6
+ Author: Thor Whalen
7
+ License: mit
8
+ Keywords: datasets,data science,artificial intelligence,AI
9
+ Platform: any
10
+ Description-Content-Type: text/markdown
11
+ License-File: LICENSE
12
+ Requires-Dist: datasets
13
+ Requires-Dist: huggingface
14
+ Provides-Extra: testing
15
+ Dynamic: license-file
16
+
17
+ # hfdol
18
+
19
+ Simple Mapping interface to HuggingFace.
20
+
21
+ (Note -- was [hf](https://pypi.org/project/hf/0.0.14/) but realeased the name to Huggingface itself for their tool.)
22
+
23
+ To install: ```pip install hfdol```
24
+
25
+ You'll also need a Hugginface token. See [more about this here](https://huggingface.co/docs/huggingface_hub/en/quick-start).
26
+
27
+
28
+ ## Motivation
29
+
30
+ The Python packages [`datasets`](https://github.com/huggingface/datasets) and [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) provide a remarkably clean, well-documented, and comprehensive API for accessing datasets, models, spaces, and papers hosted on [Hugging Face](https://huggingface.co).
31
+ Yet, as elegant as these APIs are, they remain *their own language*. Every library—no matter how intuitive—inevitably carries its own conventions, abstractions, and domain-specific semantics. When working with one or two APIs, this diversity is harmless, even stimulating. But when juggling dozens or hundreds of them, the cognitive overhead accumulates.
32
+
33
+ Despite their differences, most APIs share a small set of universal primitives — *retrieve something by key, list what's available, check existence, store, update, delete*.
34
+ In Python, these operations are embodied by the `Mapping` interface, the conceptual model behind dictionaries. It's a minimal, ubiquitous, and instantly recognizable abstraction.
35
+
36
+ This package offers such a `Mapping`-based façade to Hugging Face datasets and models, allowing you to browse, query, and access them as if they were simple Python dictionaries. The goal isn't to replace the original API, but to provide a thin, ergonomic layer for the most common operations — so you can spend less time remembering syntax, and more time working with data.
37
+
38
+ ## Examples
39
+
40
+ This package provides four ready-to-use singleton instances, each offering a dictionary-like interface to different types of HuggingFace resources:
41
+
42
+ ```python
43
+ import hfdoldol
44
+ ```
45
+
46
+ ### Working with Datasets
47
+
48
+ The `hfdol.datasets` singleton provides a `Mapping` (i.e. read-only-dictionary-like) interface to HuggingFace datasets:
49
+
50
+ #### List Local Datasets
51
+
52
+ As with dictionaries, `hfdol.datasets` is an iterable. An iterable of keys.
53
+ The keys are repository ids for those datasets you've downloaded.
54
+ See what datasets you already have cached locally like this:
55
+
56
+ ```python
57
+ list(hfdol.datasets) # Lists locally cached datasets
58
+ # ['stingning/ultrachat', 'allenai/WildChat-1M', 'google-research-datasets/go_emotions']
59
+ ```
60
+
61
+ #### Access Local Datasets
62
+
63
+ The values of `hfdol.datasets` are the `DatasetDict`
64
+ (from Huggingface's `datasets` package) instances that give you access to the dataset.
65
+ If you already have the dataset downloaded locally, it will load it from there,
66
+ if not it will download it, then give it to you (and it will be cached locally
67
+ for the next time you access it).
68
+
69
+ ```python
70
+ data = hfdol.datasets['stingning/ultrachat'] # Loads the dataset
71
+ print(data) # Shows dataset information and structure
72
+ ```
73
+
74
+ #### Search for Remote Datasets
75
+
76
+ `hfdol.datasets` also offers a search functionality, so you can search "remote"
77
+ repositories:
78
+
79
+ ```python
80
+ # Search for music-related datasets
81
+ search_results = hfdol.datasets.search('music', gated=False)
82
+ print(f"search_results is a {type(search_results).__name__}") # It's a generator
83
+
84
+ # Get the first result (it will be a `DatasetInfo` instance contain information on the dataset)
85
+ result = next(search_results)
86
+ print(f"Dataset ID: {result.id}")
87
+ print(f"Description: {result.description[:80]}...")
88
+
89
+ # Download and use it directly
90
+ data = hfdol.datasets[result] # You can pass the DatasetInfo object directly
91
+ ```
92
+
93
+ Note that the `gated=False` was to make sure you get models that you have access to.
94
+ For more search options, see the [HuggingFace Hub documentation](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.list_datasets).
95
+
96
+ #### A useful recipe: Get a table of result infos
97
+
98
+ You can use this to get a dataframe of the first/next `n` results of the results iterable:
99
+
100
+ ```py
101
+ def table_of_results(results, n=10):
102
+ import itertools, operator, pandas as pd
103
+
104
+ results_table = pd.DataFrame( # make a table with
105
+ map(
106
+ operator.attrgetter('__dict__'), # the attributes dicts
107
+ itertools.islice(results, n), # ... of the first 10 search results
108
+ )
109
+ )
110
+ return results_table
111
+ ```
112
+
113
+ Example:
114
+
115
+ ```py
116
+ results_table = table_of_results(search_results)
117
+ results_table
118
+ ```
119
+
120
+ id author sha ...
121
+ 0 Genius-Society/hoyoMusic Genius-Society 4f7e5120c0e8e26213d4bb3b52bcce76e69dfce4 ...
122
+ 1 Genius-Society/emo163 Genius-Society 6b8c3526b66940ddaedf15602d01083d24eb370c ...
123
+ 2 ccmusic-database/acapella ccmusic-database 4cb8a4d4cb58cc55f30cb8c7a180fee1b5576dc5 ...
124
+ 3 ccmusic-database/pianos ccmusic-database db2b3f74c4c989b4fbda4b309e6bc925bfd8f5d1 ...
125
+ ...
126
+
127
+
128
+ ### Working with Models
129
+
130
+ The `hfdol.models` singleton provides the same dictionary-like interface for models:
131
+
132
+ #### Search for Models
133
+
134
+ Find models by keywords:
135
+
136
+ ```python
137
+ model_search_results = hfdol.models.search('embeddings', gated=False)
138
+ model_result = next(model_search_results)
139
+ print(f"Model: {model_result.id}")
140
+ ```
141
+
142
+ #### Download Models
143
+
144
+ Get the local path to a model (downloads if not cached):
145
+
146
+ ```python
147
+ model_path = hfdol.models[model_result]
148
+ print(f"Model downloaded to: {model_path}")
149
+ ```
150
+
151
+ #### List Local Models
152
+
153
+ See what models you have cached:
154
+
155
+ ```python
156
+ list(hfdol.models) # Lists all locally cached models
157
+ ```
158
+
159
+ ### Working with Spaces
160
+
161
+ The `hfdol.spaces` singleton provides access to HuggingFace Spaces (interactive ML demos and applications):
162
+
163
+ #### Search for Spaces
164
+
165
+ Find interesting Spaces by keywords:
166
+
167
+ ```python
168
+ space_search_results = hfdol.spaces.search('gradio', limit=5)
169
+ space_result = next(space_search_results)
170
+ print(f"Space: {space_result.id}")
171
+ ```
172
+
173
+ #### Access Space Information
174
+
175
+ Get detailed information about a Space:
176
+
177
+ ```python
178
+ space_info = hfdol.spaces[space_result]
179
+ print(f"Space info: {space_info}")
180
+ ```
181
+
182
+ #### List Local Spaces
183
+
184
+ See what spaces you have cached locally:
185
+
186
+ ```python
187
+ list(hfdol.spaces) # Lists all locally cached spaces
188
+ ```
189
+
190
+ ### Working with Papers
191
+
192
+ The `hfdol.papers` singleton provides access to research papers hosted on HuggingFace:
193
+
194
+ #### Search for Papers
195
+
196
+ Find research papers by topic:
197
+
198
+ ```python
199
+ paper_search_results = hfdol.papers.search('transformer', limit=5)
200
+ paper_result = next(paper_search_results)
201
+ print(f"Paper: {paper_result.id}")
202
+ ```
203
+
204
+ #### Access Paper Information
205
+
206
+ Get detailed information about a paper:
207
+
208
+ ```python
209
+ paper_info = hfdol.papers[paper_result]
210
+ print(f"Paper title: {paper_info.title}")
211
+ print(f"Abstract: {paper_info.summary[:100]}...")
212
+ ```
213
+
214
+ Note: Papers are metadata objects only—they contain information about research papers but don't have downloadable files like datasets or models.
215
+
216
+ ### Getting Repository Sizes
217
+
218
+ You can check the size of any repository before downloading using the `get_size` function. The `repo_type` parameter is required to avoid ambiguity when repositories exist as multiple types:
219
+
220
+ ```python
221
+ from hfdol import get_size
222
+
223
+ # Get size of a dataset (specify repo_type explicitly)
224
+ dataset_size = get_size('ccmusic-database/music_genre', repo_type='dataset')
225
+ print(f"Dataset size: {dataset_size:.2f} GiB")
226
+
227
+ # Get size of a model
228
+ model_size = get_size('ccmusic-database/music_genre', repo_type='model')
229
+ print(f"Model size: {model_size:.2f} GiB")
230
+
231
+ # Using RepoType enum for type safety
232
+ from hfdol.base import RepoType
233
+ size_with_enum = get_size('some-repo', repo_type=RepoType.DATASET)
234
+
235
+ # Get size in different units (e.g., bytes)
236
+ size_in_bytes = get_size('some-repo', repo_type='dataset', unit_bytes=1)
237
+ ```
238
+
239
+ **Pro tip**: Use the singleton instances for automatic repo_type handling:
240
+ ```python
241
+ # These automatically know their repo_type
242
+ dataset_size = hfdol.datasets.get_size('ccmusic-database/music_genre')
243
+ model_size = hfdol.models.get_size('ccmusic-database/music_genre')
244
+ ```
245
+
246
+ ### Unified Interface
247
+
248
+ The beauty of this approach is that whether you're working with datasets, models, spaces, or papers, the interface remains familiar and consistent—just like working with Python dictionaries. All four singleton instances support the same core operations:
249
+
250
+ - **Dictionary-style access**: `resource = hfdol.datasets[key]`, `model_path = hfdol.models[key]`
251
+ - **Local listing**: `list(hfdol.datasets)`, `list(hfdol.models)`
252
+ - **Remote searching**: `hfdol.datasets.search(query)`, `hfdol.models.search(query)`
253
+ - **Existence checking**: `key in hfdol.datasets`, `key in hfdol.models`
254
+
255
+ This unified interface means you can switch between different types of HuggingFace resources without learning new APIs—it's all just dictionaries! And since they're singleton instances, they're always ready to use without any setup.
256
+
257
+
258
+ ## Design & Architecture
259
+
260
+ ### Design Philosophy
261
+
262
+ This package is designed as a **thin façade** over the excellent [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) and [`datasets`](https://github.com/huggingface/datasets) libraries. Rather than reinventing functionality, it provides a unified `Mapping` interface that wraps the most common operations, making them feel like native Python dictionary operations.
263
+
264
+ The design balances two sometimes-competing goals:
265
+ 1. **Simplicity**: Keep the codebase small, readable, and maintainable
266
+ 2. **Single Source of Truth (SSOT)**: Minimize hardcoded knowledge about the underlying APIs
267
+
268
+ Ideally, this interface would be *entirely* auto-generated through static analysis of the wrapped packages. While we achieve this partially, practical constraints require some manual intervention—but we've minimized it as much as possible.
269
+
270
+ ### Key Architectural Patterns
271
+
272
+ #### 1. Configuration-Driven Design (SSOT)
273
+
274
+ The `repo_type_helpers` dictionary serves as the **single source of truth** for all repo-type-specific behavior:
275
+
276
+ ```python
277
+ repo_type_helpers = dict(
278
+ dataset=dict(
279
+ loader_func=load_dataset,
280
+ search_func=list_datasets,
281
+ ),
282
+ model=dict(
283
+ loader_func=snapshot_download,
284
+ search_func=list_models,
285
+ ),
286
+ # ... etc
287
+ )
288
+ ```
289
+
290
+ This declarative approach means:
291
+ - Adding a new repo type requires only updating this configuration
292
+ - No duplication of logic across different repo types
293
+ - Clear visibility of how each type differs
294
+
295
+ #### 2. Dynamic Signature Injection
296
+
297
+ Rather than manually replicating the signatures of wrapped functions (which would violate SSOT), we use **signature extraction and injection** via the `sign_kwargs_with` decorator:
298
+
299
+ ```python
300
+ @sign_kwargs_with(search_func)
301
+ def search(self, filter, **kwargs):
302
+ return self.search_func(filter=filter, **kwargs)
303
+ ```
304
+
305
+ This means:
306
+ - Each `.search()` method automatically inherits the correct signature from its underlying function
307
+ - IDEs and type checkers see the actual parameters available
308
+ - When HuggingFace updates their APIs, our signatures update automatically
309
+ - Documentation stays accurate without manual synchronization
310
+
311
+ **Note**: The `list_papers` function required special handling (`_list_papers` wrapper) because it uses `query` instead of `filter` as its parameter name. This is the type of pragmatic compromise we make—we normalize the interface rather than exposing the inconsistency.
312
+
313
+ #### 3. Separation of Concerns
314
+
315
+ The architecture cleanly separates:
316
+
317
+ - **Configuration** (`repo_type_helpers`): What differs between types
318
+ - **Base functionality** (`HfMapping`): Shared behavior for all types
319
+ - **Type-specific classes** (`HfDatasets`, `HfModels`, etc.): Minimal subclasses that mainly provide:
320
+ - Clear, discoverable class names
321
+ - Type-specific documentation
322
+ - Future extensibility points
323
+ - **Convenience layer** (module-level singletons): Zero-setup access for users
324
+
325
+ #### 4. Module-Level Singletons
326
+
327
+ The pre-instantiated `datasets`, `models`, `spaces`, and `papers` instances follow Python's **convenience instance pattern** (seen in `sys.stdout`, `np.random`, etc.):
328
+
329
+ ```python
330
+ # Ready to use immediately
331
+ datasets = HfDatasets()
332
+ models = HfModels()
333
+ ```
334
+
335
+ This works because these instances:
336
+ - Have no mutable state
337
+ - Require no configuration for basic use
338
+ - Represent logical singletons ("the datasets mapping")
339
+
340
+ #### 5. Progressive Disclosure
341
+
342
+ The API supports multiple levels of sophistication:
343
+
344
+ ```python
345
+ # Simplest: Use pre-configured singletons
346
+ data = hfdol.datasets['some/dataset']
347
+
348
+ # Advanced: Create custom instances with configuration
349
+ my_datasets = HfDatasets()
350
+
351
+ # Power user: Parameterized mapping for dynamic repo types
352
+ custom = HfMapping(RepoType.DATASET)
353
+ ```
354
+
355
+ ### Design Compromises
356
+
357
+ Several compromises were made for pragmatism:
358
+
359
+ 1. **Manual wrappers**: `_list_papers` normalizes the papers API to match others
360
+ 2. **Enum + string hybrid**: `RepoType(str, Enum)` allows both type safety and string convenience
361
+ 3. **Explicit repo_type in get_size**: Required parameter to avoid ambiguity when repos exist as multiple types
362
+ 4. **Signature injection limitations**: Works well for keyword arguments but can't handle complex overloads
363
+
364
+ ### Contributing Guidelines
365
+
366
+ When contributing to this package, please maintain these principles:
367
+
368
+ **✅ DO:**
369
+ - Add configuration to `repo_type_helpers` rather than creating new methods
370
+ - Use signature extraction (`sign_kwargs_with`) when wrapping functions with many parameters
371
+ - Keep `HfMapping` generic and push specialization to configuration
372
+ - Document *why* special cases exist (like `_list_papers`)
373
+ - Test against actual HuggingFace APIs to catch signature drift
374
+
375
+ **❌ AVOID:**
376
+ - Duplicating knowledge about wrapped APIs
377
+ - Hardcoding parameter lists or types that could be extracted
378
+ - Adding stateful behavior to mapping instances
379
+ - Creating wrapper methods that simply pass through to underlying functions
380
+
381
+ **When in doubt:**
382
+ - Ask "Could this be driven by configuration?"
383
+ - Prefer declarative patterns over imperative logic
384
+ - Keep the codebase small and the configuration visible
385
+
386
+ The goal is a package where 80% of the code is just wiring and configuration, and the HuggingFace packages do the actual work. This maximizes maintainability and minimizes drift as those packages evolve.