PyPI - Microns-DataCleaner - Versions diffs - 0.1.5__tar.gz - Mend

Microns-DataCleaner 0.1.5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

microns_datacleaner-0.1.5/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 Milano Microns Colab
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

microns_datacleaner-0.1.5/PKG-INFO ADDED Viewed

@@ -0,0 +1,140 @@
+Metadata-Version: 2.3
+Name: Microns-DataCleaner
+Version: 0.1.5
+Summary: Provides common functions to download and process data from the mm3 data.
+Keywords: neuroscience
+Author: Victor Buendia
+Author-email: victor.buendia@unibocconi.it
+Maintainer: All contributors
+Requires-Python: >=3.9
+Classifier: Development Status :: 2 - Pre-Alpha
+Classifier: Intended Audience :: Science/Research
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Requires-Dist: caveclient
+Requires-Dist: numpy
+Requires-Dist: pandas
+Requires-Dist: scipy
+Requires-Dist: standard_transform
+Requires-Dist: tqdm
+Description-Content-Type: text/markdown
+# MICrONS-datacleaner
+[![License](https://badgen.net/github/license/MICrONS-Milano-CoLab/MICrONS-datacleaner)](https://opensource.org/licenses/MIT)
+This project contains tools to work with the [MICrONS Cortical MM3 dataset](https://www.microns-explorer.org/cortical-mm3), providing a **robust interface** to interact with the nucleus data.
+## Key features
+- **Simple interface** to download and keep organized anatomical data via CAVEClient.
+- **Allows to query the synapse table in chunks** avoiding common pitfalls.
+- **Easily process nucleus annotation tables**.
+- **Automatically segment** the brain volume into cortical **layers.**
+- **Tools for filtering** and constructing connectome subsets.
+- Basic interface to add functional properties, including **tuning curves and selectivity**.
+## Install 📥
+```bash
+pip install microns-datacleaner
+```
+## Using the package ⏩
+- **Few lines of code** to get a full table with neurons' brain area, layer, cell_type, proofreading information, and nucleus position:
+```python
+#Import the lib
+import microns_datacleaner as mic
+#Target version and download folder
+cleaner = mic.MicronsDataCleaner(datadir = "data", version=1300)
+#Download the data
+cleaner.download_nucleus_data()
+#Process the downloaded data and segment into layers
+units, segments = cleaner.process_nucleus_data()
+```
+- **Filter easily!** How can we get all neurons in V1, layers L2/3 and L4 with proofread axons?
+```python
+units_filter = fl.filter_neurons(units, layer=['L2/3', 'L4'], proofread='ax_clean', brain_area='V1')
+```
+- **Robustly download synapses** between a subset of pre and post-synaptic neurons in chunks.
+```python
+preids  = units_filter['pt_root_id']
+postids = units_filter['pt_root_id']
+cleaner.download_synapse_data(preids, postids)
+#Connection problems at chunk number 23? Just restart from there
+cleanerdownload_synapse_data(preids, postids, start_index=23)
+```
+Check the docs and our tutorial notebook just below to get started!
+## Docs & Tutorials 📜
+If it is the first time working with the MICrONS data, we recommend you read our basic tutorial (also available as a Python Notebook), as well as the official documentation of the MICrONS project.
+If you want to contribute, please read our guidelines first. Feel free to open an issue if you find any problem.
+You can find a full documentation of the API and functions in the [docs](https://margheritapremi.github.io/MICrONS-datacleaner).
+## Requirements
+### Dependencies
+- CaveCLIENT
+- Pandas
+- Numpy
+- Scipy
+- TQDM
+- Standard transform for coordinate change (MICrONS ecosystem)
+### Dev-dependencies
+- pdoc (to generate the docs)
+- ruff (to keep contributions in a consistent format)
+## Citation Policy 📚
+If you use our code, **please consider to cite the associated repository,** as well as the [IARPA MICrONS Minnie Project](https://doi.org/10.60533/BOSS-2021-T0SY) and the [Microns Phase 3 NDA](https://github.com/cajal/microns_phase3_nda) repository.
+Our code serves as an interface for the MICrONS data. Please cite appropiate the literature for the data used following their [Citation Policy](https://www.microns-explorer.org/citation-policy). The papers may depend on the annotation tables used.
+Our unit table is constructed by integrating information from the following papers:
+1. [Functional connectomics spanning multiple areas of mouse visual cortex](https://doi.org/10.1038/s41586-025-08790-w). The Microns Consortium. 2025
+2. [Foundation model of neural activity predicts response to new stimulus types](https://doi.org/10.1038/s41586-025-08829-y)
+3. [Perisomatic ultrastructure efficiently classifies cells in mouse cortex](http://doi.org/10.1038/s41586-024-07765-7)
+4. [NEURD offers automated proofreading and feature extraction for connectomics](https://doi.org/)
+5. [CAVE: Connectome Annotation Versioning Engine](https://doi.org/10.1038/s41592-024-02426-z)
+## Acknowledgements
+We acknowledge funding by the NextGenerationEU, in the framework of the FAIR—Future Artificial Intelligence Research project (FAIR PE00000013—CUP B43C22000800006).
+## Generating the Docs
+## Generating the docs
+Go to the main folder of the repository, and run
+```
+pdoc -t docs/template source/mic_datacleaner.py -o docs/html
+```
+The docs will be generated in the `docs/html` folder in HTML format, which can be checked with the browser. If you need the docs for all the files, and not only the class, use `source/*.py` instead of `source/mic_datacleaner.py` above.

microns_datacleaner-0.1.5/README.md ADDED Viewed

@@ -0,0 +1,118 @@
+# MICrONS-datacleaner
+[![License](https://badgen.net/github/license/MICrONS-Milano-CoLab/MICrONS-datacleaner)](https://opensource.org/licenses/MIT)
+This project contains tools to work with the [MICrONS Cortical MM3 dataset](https://www.microns-explorer.org/cortical-mm3), providing a **robust interface** to interact with the nucleus data.
+## Key features
+- **Simple interface** to download and keep organized anatomical data via CAVEClient.
+- **Allows to query the synapse table in chunks** avoiding common pitfalls.
+- **Easily process nucleus annotation tables**.
+- **Automatically segment** the brain volume into cortical **layers.**
+- **Tools for filtering** and constructing connectome subsets.
+- Basic interface to add functional properties, including **tuning curves and selectivity**.
+## Install 📥
+```bash
+pip install microns-datacleaner
+```
+## Using the package ⏩
+- **Few lines of code** to get a full table with neurons' brain area, layer, cell_type, proofreading information, and nucleus position:
+```python
+#Import the lib
+import microns_datacleaner as mic
+#Target version and download folder
+cleaner = mic.MicronsDataCleaner(datadir = "data", version=1300)
+#Download the data
+cleaner.download_nucleus_data()
+#Process the downloaded data and segment into layers
+units, segments = cleaner.process_nucleus_data()
+```
+- **Filter easily!** How can we get all neurons in V1, layers L2/3 and L4 with proofread axons?
+```python
+units_filter = fl.filter_neurons(units, layer=['L2/3', 'L4'], proofread='ax_clean', brain_area='V1')
+```
+- **Robustly download synapses** between a subset of pre and post-synaptic neurons in chunks.
+```python
+preids  = units_filter['pt_root_id']
+postids = units_filter['pt_root_id']
+cleaner.download_synapse_data(preids, postids)
+#Connection problems at chunk number 23? Just restart from there
+cleanerdownload_synapse_data(preids, postids, start_index=23)
+```
+Check the docs and our tutorial notebook just below to get started!
+## Docs & Tutorials 📜
+If it is the first time working with the MICrONS data, we recommend you read our basic tutorial (also available as a Python Notebook), as well as the official documentation of the MICrONS project.
+If you want to contribute, please read our guidelines first. Feel free to open an issue if you find any problem.
+You can find a full documentation of the API and functions in the [docs](https://margheritapremi.github.io/MICrONS-datacleaner).
+## Requirements
+### Dependencies
+- CaveCLIENT
+- Pandas
+- Numpy
+- Scipy
+- TQDM
+- Standard transform for coordinate change (MICrONS ecosystem)
+### Dev-dependencies
+- pdoc (to generate the docs)
+- ruff (to keep contributions in a consistent format)
+## Citation Policy 📚
+If you use our code, **please consider to cite the associated repository,** as well as the [IARPA MICrONS Minnie Project](https://doi.org/10.60533/BOSS-2021-T0SY) and the [Microns Phase 3 NDA](https://github.com/cajal/microns_phase3_nda) repository.
+Our code serves as an interface for the MICrONS data. Please cite appropiate the literature for the data used following their [Citation Policy](https://www.microns-explorer.org/citation-policy). The papers may depend on the annotation tables used.
+Our unit table is constructed by integrating information from the following papers:
+1. [Functional connectomics spanning multiple areas of mouse visual cortex](https://doi.org/10.1038/s41586-025-08790-w). The Microns Consortium. 2025
+2. [Foundation model of neural activity predicts response to new stimulus types](https://doi.org/10.1038/s41586-025-08829-y)
+3. [Perisomatic ultrastructure efficiently classifies cells in mouse cortex](http://doi.org/10.1038/s41586-024-07765-7)
+4. [NEURD offers automated proofreading and feature extraction for connectomics](https://doi.org/)
+5. [CAVE: Connectome Annotation Versioning Engine](https://doi.org/10.1038/s41592-024-02426-z)
+## Acknowledgements
+We acknowledge funding by the NextGenerationEU, in the framework of the FAIR—Future Artificial Intelligence Research project (FAIR PE00000013—CUP B43C22000800006).
+## Generating the Docs
+## Generating the docs
+Go to the main folder of the repository, and run
+```
+pdoc -t docs/template source/mic_datacleaner.py -o docs/html
+```
+The docs will be generated in the `docs/html` folder in HTML format, which can be checked with the browser. If you need the docs for all the files, and not only the class, use `source/*.py` instead of `source/mic_datacleaner.py` above.

microns_datacleaner-0.1.5/pyproject.toml ADDED Viewed

@@ -0,0 +1,30 @@
+[build-system]
+requires = ["poetry-core>=1.0.0"]
+build-backend = "poetry.core.masonry.api"
+[project]
+name = "Microns-DataCleaner"
+version = "0.1.5"
+description = "Provides common functions to download and process data from the mm3 data."
+keywords = ["neuroscience"]
+classifiers = ["Development Status :: 2 - Pre-Alpha",
+ "Intended Audience :: Science/Research",
+ "Operating System :: OS Independent",
+ "Programming Language :: Python :: 3",
+]
+requires-python = ">=3.9"
+packages = [{include = ""}]
+dynamic = ["version", "readme"]
+authors = [{name = "Victor Buendia", email = "victor.buendia@unibocconi.it"}, {name = "Margherita Premi", email = "margherita.premi@gmail.com"}]
+maintainers = [{name = "All contributors"}]
+readme = "README.md"
+dependencies = [
+    "numpy",
+    "standard_transform",
+    "pandas",
+    "caveclient",
+    "tqdm",
+    "scipy"
+]

microns_datacleaner-0.1.5/src/microns_datacleaner/__init__.py ADDED Viewed

@@ -0,0 +1,9 @@
+"""
+.. include:: ../../docs-src/docs_main.md
+"""
+__version__ = "0.1.5"
+from .mic_datacleaner import MicronsDataCleaner
+__all__ = ["MicronsDataCleaner", "filters", "remapper"]

microns_datacleaner-0.1.5/src/microns_datacleaner/downloader.py ADDED Viewed

@@ -0,0 +1,278 @@
+import pandas as pd
+import time as time
+import requests
+import os
+import logging
+from tqdm import tqdm
+def download_tables(client, path2download, tables2download):
+    """
+    Download all the indicated tables for further processing.
+    Parameters:
+    -----------
+        client: caveclient.CAVEclient
+             The CAVEclient instance used to connect to and download from the data service.
+        path2download: str
+            The local file path to the directory where the downloaded tables will be saved as CSV files.
+        tables2download: list[str]
+            A list containing the names of the tables to be downloaded.
+    Returns:
+    -------
+        None.
+            This function does not return any value. It saves the downloaded tables as files in the
+            specified directory.
+    """
+    logging.info(f"Starting download of nucleus data to {path2download}.")
+        # Ensure directory exists
+    os.makedirs(path2download, exist_ok=True)
+    # Ensure directory exists
+    os.makedirs(path2download, exist_ok=True)
+    # Download all the tables in the list
+    for table in tqdm(tables2download, "Downloading nucleus tables..."):
+        try:
+            auxtable = client.materialize.query_table(table, split_positions=True)
+            auxtable = pd.DataFrame(auxtable)
+            auxtable.to_csv(f'{path2download}/{table}.csv', index=False)
+        except Exception as e:
+            raise RuntimeError(f'Error downloading table {table}: {e}')
+def connectome_constructor(
+    client, presynaptic_set, postsynaptic_set, savefolder, neurs_per_steps=500, start_index=0, max_retries=10, delay=5, drop_synapses_duplicates=True
+):
+    """
+    Constructs a connectome subset for specified pre- and postsynaptic neurons.
+    This function queries the MICrONS connectomics database to extract synaptic
+    connections between a defined set of presynaptic and postsynaptic neurons.
+    Parameters:
+    -----------
+        client: caveclient.CAVEclient
+            The CAVEclient instance used to access MICrONS connectomics data.
+        presynaptic_set: numpy.ndarray
+            A 1D NumPy array of unique `root_ids` for the presynaptic neurons.
+        postsynaptic_set: numpy.ndarray
+            A 1D NumPy array of unique `root_ids` for the postsynaptic neurons.
+        savefolder: str
+            The path to the directory where the output files will be saved.
+        neurs_per_steps: int, optional
+            Number of postsynaptic neurons to query per batch, by default 500.
+            This parameter enables querying the database in iterative batches to
+            work around API query size limits. A value of 500 is a reliable
+            choice for a presynaptic set of approximately 8000 neurons.
+        start_index: int, optional
+            The starting batch index for the download, by default 0. If a previous
+            download was interrupted, this can be set to the index of the last
+            successfully downloaded file to resume the process.
+        max_retries: int, optional
+            The maximum number of times to retry a query if the server fails to
+            respond, by default 10.
+        drop_synapses_duplicates: bool, optional
+            If True (default), all synapses between a given pair of neurons (i, j)
+            are merged into a single entry. The `synapse_size` of this entry will be
+            the sum of all individual synapse sizes. If False, each synapse is
+            saved as a separate entry.
+    Returns:
+    --------
+        None.
+            This function does not return any value. The resulting connection tables
+            are saved as individual CSV files in the specified `savefolder`.
+    """
+    # Ensure directory exists
+    os.makedirs(savefolder, exist_ok=True)
+    # We are doing the neurons in packages of neurs_per_steps. If neurs_per_steps is not
+    # a divisor of the postsynaptic_set the last iteration has less neurons
+    n_before_last = (postsynaptic_set.size // neurs_per_steps) * neurs_per_steps
+    n_chunks = 1 + (postsynaptic_set.size // neurs_per_steps)
+    # Time before starting the party
+    time_0 = time.time()
+    synapse_table = client.info.get_datastack_info()['synapse_table']
+    # Preset the dictionary so we do not build a large object every time
+    neurons_to_download = {'pre_pt_root_id': presynaptic_set}
+    # If we are not getting individual synapses, the best thing we can do is to not ask for positions, which is very heavy
+    if drop_synapses_duplicates:
+        cols_2_download = ['pre_pt_root_id', 'post_pt_root_id', 'size']
+        logging.info("Dropping synapse duplicates and excluding position data for lighter queries.")
+    else:
+        cols_2_download = ['pre_pt_root_id', 'post_pt_root_id', 'size', 'ctr_pt_position']
+    part = start_index
+    # Progress bar over the amount of chunks to download
+    with tqdm(total=n_chunks) as progress_bar:
+        # Main loop over chunks
+        for i in range(start_index * neurs_per_steps, postsynaptic_set.size, neurs_per_steps):
+            # Inform about our progress
+            logging.debug(f'Postsynaptic neurons queried so far: {i}...')
+            # Try to query the API several times
+            success = False  # Flag to track if current batch succeeded
+            retry = 0
+            while retry < max_retries and not success:
+                try:
+                    # Get the postids that we will be grabbing in this query. We will get neurs_per_step of them
+                    post_ids = postsynaptic_set[i : i + neurs_per_steps] if i < n_before_last else postsynaptic_set[i:]
+                    neurons_to_download['post_pt_root_id'] = post_ids
+                    logging.debug(f"Querying batch starting at index {i} with {len(post_ids)} neurons.")
+                    # Query the table
+                    sub_syn_df = client.materialize.query_table(
+                        synapse_table, filter_in_dict=neurons_to_download, select_columns=cols_2_download, split_positions=True
+                    )
+                    # Sum all repeated synapses. The last reset_index is because groupby would otherwise create a
+                    # multiindex dataframe and we want to have pre_root and post_root as columns
+                    if drop_synapses_duplicates:
+                        sub_syn_df = sub_syn_df.groupby(['pre_pt_root_id', 'post_pt_root_id']).sum().reset_index()
+                    sub_syn_df.to_csv(f'{savefolder}/connections_table_{part}.csv', index=False)
+                    logging.info(f"Successfully saved connections_table_{part}.csv")
+                    part += 1
+                    # Measure how much time in total our program did run
+                    elapsed_time = time.time() - time_0
+                    neurons_done = min(i + neurs_per_steps, postsynaptic_set.size)
+                    time_per_neuron = elapsed_time / neurons_done
+                    neurons_2_do = postsynaptic_set.size - neurons_done
+                    remaining_time = time_format(neurons_2_do * time_per_neuron)
+                    logging.debug(f'Estimated remaining time: {remaining_time}')
+                    success = True
+                    # Set that another chunk was downloaded
+                    progress_bar.update(1)
+                except requests.HTTPError as excep:
+                    logging.warning(f'API error on trial {retry + 1}. Retrying in {delay} seconds... Details: {excep}')
+                    print(f'API error on trial {retry + 1}. Retrying in {delay} seconds... Details: {excep}')
+                    time.sleep(delay)
+                    retry += 1
+                except Exception as excep:
+                    logging.error(f"An unexpected error occurred: {excep}")
+                    raise excep
+    if not success:
+        logging.error('Exceeded the max retries when trying to get synaptic connectivity. Aborting.')
+        raise TimeoutError('Exceeded the max_tries when trying to get synaptic connectivity')
+def time_format(seconds):
+    """
+    Formats a duration in seconds into a human-readable string.
+    Parameters:
+    -----------
+        seconds: float
+        The total duration in seconds to be formatted.
+    Returns:
+    --------
+        str
+        A string representing the formatted duration.
+    """
+    if seconds > 3600 * 24:
+        days = int(seconds // (24 * 3600))
+        hours = int((seconds - days * 24 * 3600) // 3600)
+        return f'{days} days, {hours}h'
+    elif seconds > 3600:
+        hours = int(seconds // 3600)
+        minutes = int((seconds - hours * 3600) // 60)
+        return f'{hours}h, {minutes}min'
+    elif seconds > 60:
+        minutes = int(seconds // 60)
+        rem_sec = int((seconds - 60 * minutes))
+        return f'{minutes}min {rem_sec}s'
+    else:
+        return f'{seconds:.0f}s'
+def merge_connection_tables(savefolder, filename):
+    """
+    Merges individual connection table files into a single master file.
+    This function scans a specified directory for connection table files
+    (identified by the prefix 'connections_table_'), which are typically
+    generated by the `connectome_constructor` function. It then concatenates
+    them into a single pandas DataFrame and saves the result as a new CSV file.
+    Parameters:
+    -----------
+        savefolder: str
+            The path to the directory containing the connection table files to be merged.
+        filename: str
+            The base name for the output file. The merged table will be saved as
+         '{filename}.csv' in the `savefolder`.
+    Returns:
+    --------
+        None.
+            This function does not return a value. It saves the merged table to a CSV file.
+    """
+    # Check if the synapses folder exists
+    logging.info(f"Starting to merge connection tables into {filename}.csv")
+    synapses_path = f'{savefolder}/synapses/'
+    if not os.path.exists(synapses_path):
+        if os.path.exists(savefolder) and any('connections_table_' in f for f in os.listdir(savefolder)):
+            synapses_path = savefolder
+        else:
+            raise FileNotFoundError(f'Could not find synapses directory at {synapses_path}')
+    # Count the number of tables to merge, by checking all files in the correct folder
+    connection_files = []
+    for file in os.listdir(synapses_path):
+        file_path = os.path.join(synapses_path, file)
+        if os.path.isfile(file_path) and 'connections_table_' in file:
+            connection_files.append(file_path)
+    if not connection_files:
+        logging.warning('No connection tables found to merge.')
+        return
+    logging.info(f"Found {len(connection_files)} connection tables to merge.")
+    # Merge all of them
+    first_file = connection_files[0]
+    table = pd.read_csv(first_file)
+    for file_path in connection_files[1:]:
+        table = pd.concat([table, pd.read_csv(file_path)])
+    output_path = f'{savefolder}/{filename}.csv'
+    table.to_csv(output_path, index=False)
+    logging.info(f'Merged {len(connection_files)} tables into {output_path}')
+    return
+def download_functional_fits(filepath):
+    """
+    Downloads functional fit data from a static Zenodo repository.
+    This function retrieves a CSV file containing functional fitting data from a
+    pre-defined Zenodo URL and saves it to the specified local path.
+    Parameters:
+    -----------
+        filepath: str
+            The full path, including the desired filename, where the downloaded file will be stored.
+    Returns:
+    --------
+        None.
+            This function does not return a value. It saves the content directly to a file.
+    """
+    # TO DO
+    response = requests.get("URL TO OUR FILE IN ZENODO")
+    with open(f"{filepath}.csv", mode="wb") as file:
+        file.write(response.content)

microns_datacleaner-0.1.5/src/microns_datacleaner/filters/__init__.py ADDED Viewed

@@ -0,0 +1,17 @@
+"""
+# Filters subpackage
+The filters subpackage helps to query the units and synapse tables. The tables are just Pandas Dataframes, so it is always possible to `.query` them.
+However, often it is necessary to query for several aspects at once, which is inconvenient, especially for the synapses.
+The `filters` package helps to reduce the effort for the most common operations.
+As stated in the Quick Start, the three most important functions are `filter_neurons`, `filter_connections` and `synapses_by_id`.
+There are several examples in the `basic_notebook`.  The API reference below contains detailed information about the arguments of these functions.
+> It is important to notice that the filters act only on the predefined columns of the unit table, but not on custom columns added from the other tables. In these cases, your best bet is to `.query` directly.
+Please read the API below for more information in individual functions.
+"""
+from .filters import *
+__all__ = ["filter_neurons", "filter_connections", "synapses_by_id", "remove_autapses", "connections_to", "connections_from"]