PyPI - emb-diversity - Versions diffs - 0.0.3__tar.gz - Mend

emb-diversity 0.0.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (66) hide show

emb_diversity-0.0.3/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Cantao Su, Anna Wegmann, Menan Velayuthan, Dong Nguyen, Esther Ploeger
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

emb_diversity-0.0.3/PKG-INFO ADDED Viewed

@@ -0,0 +1,386 @@
+Metadata-Version: 2.4
+Name: emb-diversity
+Version: 0.0.3
+Summary: A package for measuring diversity in text and vector data
+Author-email: Cantao Su <c.su@uu.nl>, Menan Velayuthan <m.velayuthan@uu.nl>, Esther Ploeger <e.ploeger@uu.nl>, Dong Nguyen <d.p.nguyen@uu.nl>, Anna Wegmann <a.m.wegmann@uu.nl>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/nlpsoc/Diversity-Measurement
+Project-URL: Documentation, https://nlpsoc.github.io/Diversity-Measurement/
+Project-URL: Repository, https://github.com/nlpsoc/Diversity-Measurement
+Project-URL: Bug Tracker, https://github.com/nlpsoc/Diversity-Measurement/issues
+Keywords: diversity,nlp,embeddings,emb-diversity,metrics
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Developers
+Classifier: Topic :: Scientific/Engineering
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Text Processing :: Linguistic
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: Operating System :: OS Independent
+Classifier: Natural Language :: English
+Requires-Python: >=3.11
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: datasets~=3.3.2
+Requires-Dist: matplotlib~=3.10.0
+Requires-Dist: networkx>=2.8.8
+Requires-Dist: numpy~=2.3.5
+Requires-Dist: pandas~=2.3.1
+Requires-Dist: scikit-learn~=1.7.1
+Requires-Dist: scipy~=1.16.0
+Requires-Dist: sentence-transformers>=2.7.0
+Requires-Dist: typer>=0.9.0
+Requires-Dist: umap-learn>=0.5.9
+Requires-Dist: vendi-score>=0.0.3
+Dynamic: license-file
+# emb-diversity
+<!-- docs-intro-start -->
+A Python package for measuring data diversity on small- to medium-sized text datasets. All measures are calculating diversity based on embeddings, i.e., vector representations of your data. Depending on what embedding models you want to use, you are able to calculate semantic, stylistic and other types of diversity with our package.
+This library is developed as part of the [DataDivers](https://datadivers-erc.github.io/) project.
+<!-- docs-intro-end -->
+📖 **Documentation:** <https://nlpsoc.github.io/Diversity-Measurement/>
+## Install
+<!-- docs-install-start -->
+Install the latest release from PyPI:
+```bash
+pip install emb-diversity
+```
+The first time you measure diversity, the default embedding model
+(`all-mpnet-base-v2`, ~420 MB) is downloaded from the Hugging Face Hub and
+cached locally, so later runs are fast and work offline.
+<!-- docs-install-end -->
+## Usage
+<!-- docs-quickstart-start -->
+Measuring the diversity of a dataset with our package is easy:
+```python
+from emb_diversity import measure_diversity
+# more style-diverse, more topic-uniform (music)
+texts_a = [
+    "I thoroughly enjoy the hair bands.",
+    "songs of the 80's are the best.",
+    "Hip Hop is going DOWNHILL!!!!!",
+    "rock music just makes me feel good",
+    "The 80's rocked!That generation had the best music!"
+]
+# Uses the default measures and semantic embeddings
+print(measure_diversity(texts_a))
+# -> {'graph_entropy': 6.86..., 'vendi_score': 4.12..., 'mean_pw_dist': 0.69...}
+```
+Note that measuring the diversity of a dataset is usually only meaningful when comparing it to another datasets. The reason is that diversity values in isolation are not easily interpretable and are not bounded, sensitive to dataset size and sensitive to the used embedding space. Let's add another corpus.
+```python
+# more style-uniform (formal), more topic-diverse
+texts_b = [
+    "I thoroughly enjoy the hair bands.",
+    "They have not caused any harm to me.",
+    "He has a very distinct walk.",
+    "It depends on what they will pay.",
+    "I would go out with the son of a preacher.",
+]
+print(measure_diversity(texts_a))
+# -> {'graph_entropy': 6.86..., 'vendi_score': 4.12..., 'mean_pw_dist': 0.69...}
+print(measure_diversity(texts_b))
+# -> {'graph_entropy': 6.91..., 'vendi_score': 4.93..., 'mean_pw_dist': 0.98...}
+```
+When a measure considers a dataset to be more diverse, it will assign it a higher diversity value. Here, the three default measures consistently show that `texts_b` is more diverse than `texts_a`. This can change, when we change what diversity "axis" is considered, for example, "style" instead of "semantic".
+```python
+# Use a different diversity axis, for style diversity AnnaWegmann/style-embeddings is the default
+print(measure_diversity(texts_a, diversity_axis="style"))
+# -> {'graph_entropy': 6.69..., 'vendi_score': 4.17..., 'mean_pw_dist': 0.93...}
+print(measure_diversity(texts_b, diversity_axis="style"))
+# -> {'graph_entropy': 6.32..., 'vendi_score': 2.24..., 'mean_pw_dist': 0.32...}
+```
+You can also specify a different embedding model with a HuggingFace identifier, for example, a model trained for Dutch. Be careful to use models that were trained on the diversity axis you are interested in, otherwise you might get some inconsistent results!
+```python
+# Use a specific embedding model (here a small, fast SBERT model)
+print(measure_diversity(texts_a, embedding_model="GroNLP/bert-base-dutch-cased"))
+# -> {'graph_entropy': 6.61..., 'vendi_score': 1.89..., 'mean_pw_dist': 0.20...}
+print(measure_diversity(texts_b, embedding_model="GroNLP/bert-base-dutch-cased"))
+# -> {'graph_entropy': 6.80..., 'vendi_score': 1.52..., 'mean_pw_dist': 0.11...}
+```
+You can also use specific measures, see an overview here: https://nlpsoc.github.io/Diversity-Measurement/user-guide/measures.html. Use with caution. Some measures might be worse for your use case than others. We recommend to test whether your chosen measure and embedding space capture your diversity axis of interest.
+```python
+# Run specific measures
+print(measure_diversity(texts_a, measure=["diameter", "log_determinant"]))
+# -> {'diameter': 0.94..., 'log_determinant': -0.93...}
+print(measure_diversity(texts_b, measure=["diameter", "log_determinant"]))
+# -> {'diameter': 1.0..., 'log_determinant': -0.06...}
+```
+Note that most measures return unbounded values that cannot be compared for datasets with differing sizes. Happy diversity measuring!
+<!-- docs-quickstart-end -->
+## Table of Contents
+- [Install](#install)
+- [Usage](#usage)
+- [Development](#development)
+  - [Development setup](#development-setup)
+  - [Suggested Workflow for Collaboration](#suggested-workflow-for-collaboration)
+  - [Working with uv](#working-with-uv)
+  - [Docstring Style Guide](#docstring-style-guide)
+  - [Adding New Measures](#adding-new-measures)
+  - [Adding New Diversity Axes](#adding-new-diversity-axes)
+  - [Building and publishing a release](#building-and-publishing-a-release)
+- [Funding](#funding)
+- [Citation](#citation)
+## Development
+### Development setup
+To work on `emb-diversity` itself, install from a clone with
+[`uv`](https://docs.astral.sh/uv/getting-started/installation/):
+```bash
+git clone https://github.com/nlpsoc/Diversity-Measurement.git
+cd Diversity-Measurement
+uv sync --group dev          # runtime + dev tools (pytest, docs, ...)
+source .venv/bin/activate
+```
+Use `uv sync --no-group dev` to install only the runtime dependencies.
+### Suggested Workflow for Collaboration
+1. **Create a new branch** for your feature or bug fix:
+   ```bash
+   git checkout -b feature/my-feature
+   ```
+2. **Make your changes** in the codebase.
+3. **Run tests** to ensure everything works as expected:
+   ```bash
+   pytest
+   ```
+4. **Commit your changes** with a descriptive message:
+   ```bash
+   git add .
+   git commit -m "Add feature X"
+   ```
+5. **Push your branch** to the remote repository:
+   ```bash
+   git push origin feature/my-feature
+   ```
+6. **Create a pull request** on GitHub to merge your changes into the main branch and request a review from your team members.
+7. **Address any feedback** from the review process.
+8. Once approved, **merge your pull request** into the main branch.
+9. **Delete your branch** after merging to keep the repository clean:
+   ```bash
+   git branch -d feature/my-feature
+   git push origin --delete feature/my-feature
+   ```
+### Working with uv
+#### Adding Packages with `uv add`
+To add packages to your project, always use `uv add` rather than `uv pip install`. This ensures that your dependencies are properly managed and recorded in your `pyproject.toml`. For example:
+```bash
+uv add <package-name>
+```
+#### Adding Packages to a Dev Group
+If you need to add a package specifically to your development environment, you can add it to the `dev` group like this:
+```bash
+uv add --group dev <package-name>
+```
+#### Switching Between Dev and Standard Mode
+After you are done with testing and want to go back to standard mode, run:
+```bash
+uv sync --no-group dev
+```
+This will disable all additional groups and just load your main project dependencies.
+#### Best Practice: Run `uv lock -U`
+Whenever you upgrade, downgrade, or change versions of packages, it's a good practice to run:
+```bash
+uv lock -U
+```
+This updates your `uv.lock` file to ensure all versions are consistent and everything is in sync.
+### Docstring Style Guide
+This project uses **Google-style docstrings** which are automatically parsed by the Sphinx Napoleon extension.
+#### Functions and Methods
+```python
+def calculate_diversity(vectors: np.ndarray, method: str = "vendi") -> float:
+    """Calculate diversity score for a set of vectors.
+    This function computes various diversity metrics for vector representations.
+    The default method uses the Vendi Score which is based on matrix entropy.
+    References:
+        Cox, Samuel Rhys, Yunlong Wang, Ashraf Abdul, Christian von der Weth, and Brian Y. Lim. "Directed Diversity: Leveraging Language Embedding Distances for Collective Creativity in Crowd Ideation." Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, May 6, 2021, 1–35. https://doi.org/10.1145/3411764.3445782.
+    Args:
+        vectors: Array of shape (n_samples, n_features) containing the vectors.
+        method: Diversity calculation method. Options are "vendi", "entropy",
+            or "distinctness". Defaults to "vendi".
+    Returns:
+        Diversity score as a float between 0 and 1, where higher values
+        indicate greater diversity.
+    Raises:
+        ValueError: If vectors array is empty or method is not recognized.
+    Example:
+        >>> vectors = np.array([[1, 0], [0, 1], [1, 1]])
+        >>> score = calculate_diversity(vectors)
+        >>> print(f"Diversity: {score:.2f}")
+        Diversity: 0.87
+    """
+    pass
+```
+#### Key Points
+- **One-line summary**: Start with a brief summary in imperative mood ("Calculate", not "Calculates")
+- **Blank line**: After the summary, add a blank line before any detailed description
+- **References**: Add related papers
+- **Args**: Document each parameter with type information
+- **Returns**: Describe what the function returns
+- **Raises**: Document exceptions that might be raised
+- **Example**: Include usage examples when helpful
+- **Type hints**: Use type hints in function signatures AND document them in docstrings
+#### Section Headers
+Use these section headers in docstrings:
+- `References:` Related papers
+- `Args:` — Function/method parameters
+- `Returns:` — Return value description
+- `Raises:` — Exceptions that may be raised
+- `Yields:` — For generators
+- `Attributes:` — For class attributes
+- `Example:` or `Examples:` — Usage examples
+- `Note:` — Important notes
+- `Warning:` — Warnings about usage
+Further reading: [Google Style Guide](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) · [Sphinx Napoleon docs](https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html)
+### Adding New Measures
+When you add a new measure to `src/emb_diversity/measures/`:
+1. Create a new file with the function decorated with `@accepts_text` and a complete docstring following the style guide above.
+2. Export it from `src/emb_diversity/__init__.py` if it should be part of the public API.
+3. Register it in `src/emb_diversity/measures_registry.py` with `measures.register("name", func)`.
+4. **Update `docs/source/user-guide/measures.md`** — add a row for the new measure in the appropriate table.
+### Adding New Diversity Axes
+Register a new axis in `src/emb_diversity/axes_registry.py`:
+```python
+from emb_diversity.axes_registry import DiversityAxis, axes
+axes.register(
+    "multilingual",
+    DiversityAxis(
+        name="multilingual",
+        default_model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
+        description="Cross-lingual semantic diversity",
+    ),
+)
+```
+Update `docs/source/user-guide/axes.md` with the new axis.
+### Building and publishing a release
+Releases are published by CI via PyPI Trusted Publishing (no API token is
+stored), in two stages — a TestPyPI dry run, then production PyPI:
+1. **Bump `version`** in `pyproject.toml`, commit, and merge to `main`. A version
+   number can be uploaded only once, so every release needs a new number — you
+   cannot re-publish or overwrite an existing version.
+2. **Tag and push → TestPyPI.** Pushing a `v*` tag triggers `publish-testpypi.yml`
+   (it checks the tag matches the `pyproject.toml` version):
+   ```bash
+   git tag v0.0.1          # must match the version in pyproject.toml
+   git push origin v0.0.1
+   ```
+   Verify the result at <https://test.pypi.org/project/emb-diversity/>.
+3. **Create a GitHub Release → PyPI.** When the TestPyPI run looks good, create a
+   GitHub Release for the tag. That triggers `publish-pypi.yml`, which uploads to
+   real PyPI (<https://pypi.org/project/emb-diversity/>). Create the release either:
+   - **on GitHub:** go to the repository's **Releases** page (right-hand sidebar of
+     the repo, or `.../releases`) → **Draft a new release** → under *Choose a tag*
+     pick the existing tag (e.g. `v0.0.1`) → add a title and notes → **Publish
+     release**; or
+   - **with the GitHub CLI:**
+     ```bash
+     gh release create v0.0.1 --title "v0.0.1" --notes "First release"
+     ```
+   Publishing the release (not just drafting it) is what triggers the workflow.
+To build and validate **locally** before tagging (optional):
+```bash
+rm -rf dist              # clear artifacts from previous versions first
+uv build                 # -> dist/emb_diversity-<version>.{tar.gz,whl}
+uvx twine check dist/*   # validate metadata + that the README renders on PyPI
+```
+`uv build` only *adds* to `dist/`, so clear it first when building a new version —
+otherwise old artifacts linger and an upload would try (and fail) to re-publish
+them. CI doesn't need this: each run starts from a clean checkout.
+## Funding
+This work is supported by the ERC Starting Grant **DataDivers** (101162980).
+## Citation
+<!-- docs-citation-start -->
+There is no paper yet, so if you use `emb-diversity` in your work, please cite
+the software:
+```bibtex
+@misc{emb_diversity,
+  author = {Su, Cantao and Velayuthan, Menan and Ploeger, Esther and Nguyen, Dong and Wegmann, Anna},
+  title  = {emb-diversity},
+  url    = {https://github.com/nlpsoc/Diversity-Measurement},
+}
+```
+<!-- docs-citation-end -->