PyPI - graph-seeder - Versions diffs - 1.0.0.dev4__tar.gz → 1.0.0.dev5__tar.gz - Mend

graph-seeder 1.0.0.dev4tar.gz → 1.0.0.dev5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

graph_seeder-1.0.0.dev5/PKG-INFO ADDED Viewed

@@ -0,0 +1,454 @@
+Metadata-Version: 2.4
+Name: graph-seeder
+Version: 1.0.0.dev5
+Summary: A powerful tool to extract and densify subgraphs from Knowledge Graphs via SPARQL or LMDB, with different extraction strategies.
+Requires-Python: >=3.9
+Requires-Dist: lmdb>=2.2.0
+Requires-Dist: networkx<4.0.0,>=3.2.1
+Requires-Dist: pandas<3.0.0,>=2.3.3
+Requires-Dist: rdflib>=7.6.0
+Requires-Dist: requests>=2.32.5
+Requires-Dist: rich>=15.0.0
+Requires-Dist: sparqlwrapper>=2.0.0
+Requires-Dist: urllib3>=2.6.3
+Description-Content-Type: text/markdown
+# Graph Seeder
+Graph Seeder is a highly configurable, end-to-end Python package designed to extract, densify, and analyze subgraphs
+from Knowledge Graphs (like DBpedia, Wikidata, local RDF files or LMDB hashmaps) based on seed entities.
+It can be used as a command-line tool or imported as a library in your Python projects. The package supports different
+extraction strategies, automatic densification to connect isolated components, and export formats for both the extracted
+paths and the full graph.
+**Full documentation and updates:** [Graph Seeder on PyPI](https://pypi.org/project/graph-seeder/)
+### Warning
+The full documentation has not been written yet.
+This README provides a comprehensive overview of features, installation, and usage.
+## Features
+* **Smart extraction:** Dynamically queries SPARQL endpoints, local Turtle files or LMDB hashmaps using Bidirectional
+  BFS (for paths between entities) or Radial Hop Expansion (for neighborhoods).
+* **Automatic densification:** Analyzes the extracted subgraph and automatically connects disconnected components to
+  maximize connectivity and semantic richness.
+* **Rich exports:** Outputs results in hierarchical JSON (preserving path traces) or RDF Turtle format, along with
+  detailed extraction statistics.
+* **Resilience and hub management:** Implements robust error handling, automatic retries, and intelligent detection of
+  massive hub nodes to prevent endpoint overloads and timeouts.
+## Installation
+### As a Python Package (recommended)
+You can install Graph Seeder directly from PyPI into your project's virtual environment:
+```bash
+pip install graph-seeder
+```
+### For local development
+If you want to clone the repository to modify the code locally:
+```bash
+git clone https://github.com/YourOrg/graph-seeder.git
+cd graph-seeder
+uv sync  # or: pip install -e .
+```
+## Configuration
+The project is driven by a powerful configuration engine. You can either :
+* Use built-in configuration templates (`dbpedia_default`, `wikidata_default`, `pgxlod_default`, `europeana_default` or
+  `default`)
+* Give the path to your own custom `.json` configuration file. You can use the `generate-config` command to create a
+  template file with all available parameters and their default values, which you can then modify as needed.
+You can also override any configuration parameter directly from the command line or via Python arguments, which will
+take precedence over the config file values.
+## Usage
+### 1. Via Command Line Interface (CLI)
+You can call graph-seeder directly from your terminal. Use the --config flag to specify your base configuration, and
+append any overrides as key=value pairs.
+```bash
+# Example 1: Using a built-in template with some overrides
+graph-seeder --config dbpedia_default input_path=data/seeds.csv output_format=json output_path=results/my_graph
+# Example 2: Using your own custom JSON configuration file
+graph-seeder --config path/to/my_custom_config.json
+# Example 3: Overriding deep parameters on the fly (takes precedence over the config)
+graph-seeder --config wikidata_default max_hops=3 batch_size=50 request_delay=2.5 type=hashmap
+```
+### 2. Via Python API
+You can import and use Graph Seeder directly in your Python scripts. The SubgraphExtractor accepts a config (which can
+be a built-in template name or a path to a .json file) and uses kwargs for overrides
+```python
+from graph_seeder import SubgraphExtractor
+# You can pass a built-in template name OR a path to a custom .json file.
+# Any additional keyword arguments will override the base configuration.
+extractor = SubgraphExtractor(
+    config="wikidata_default",  # Or "path/to/my_custom_config.json"
+    input_path="data/seeds.csv",  # Override: Input file
+    output_format="json",  # Override: Output format
+    output_path="results/my_graph",  # Override: Output destination
+    batch_size=50,  # Override: SPARQL batch size
+    max_hops=3  # Override: Maximum depth limit
+)
+extractor.run()
+```
+## Configuration
+Here is the complete list of parameters you can configure (either in a `.json` config file or overridden in the
+CLI/Python arguments).
+### Data
+| Parameter           | Type  | Description                                                                                                            |
+|---------------------|-------|------------------------------------------------------------------------------------------------------------------------|
+| `input_path`        | `str` | Path to the input CSV file containing seed nodes.                                                                      |
+| `output_format`     | `str` | Format of the extracted graph output (`json` or `ttl`).                                                                |
+| `output_path`       | `str` | Destination path and base filename for the extracted files.                                                            |
+| `stats_output_path` | `str` | Path to save the extraction statistics in JSON format.                                                                 |
+| `turtle_path`       | `str` | Path to a local `.ttl` file (if using local extraction instead of a SPARQL endpoint) to load a local knowledge graph.  |
+| `hashmap_path`      | `str` | Path to a local LMDB hashmap (if using local extraction instead of a SPARQL endpoint) to load a local knowledge graph. |
+### Client / SPARQL
+| Parameter         | Type    | Description                                                                                                                                            |
+|-------------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `type`            | `str`   | Type of client to use for extraction (`sparql` for SPARQL endpoints, `turtle` to load a local Turtle file, or `hashmap` to load a local LMDB hashmap). |
+| `endpoint`        | `str`   | URL of the SPARQL endpoint to query (e.g., `https://dbpedia.org/sparql`).                                                                              |
+| `user_agent`      | `str`   | HTTP User-Agent header (Highly recommended for some knowledge graphs like Wikidata to avoid blocks).                                                   |
+| `request_delay`   | `float` | Delay in seconds between consecutive requests to avoid server overload.                                                                                |
+| `retry_attempts`  | `int`   | Number of times to retry a failed HTTP request.                                                                                                        |
+| `retry_delay`     | `float` | Delay in seconds before retrying a failed request.                                                                                                     |
+| `rate_limit_wait` | `float` | Time to wait in seconds when a rate limit (HTTP 429) is encountered.                                                                                   |
+| `timeout`         | `float` | Maximum time in seconds to wait for a server response.                                                                                                 |
+### Extraction settings
+| Parameter                   | Type   | Description                                                                                                                                                                                                                    |
+|-----------------------------|--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `strategy`                  | `str`  | Graph extraction algorithm: `bfs` (paths between pairs) or `hop` (radial expansion).                                                                                                                                           |
+| `batch_size`                | `int`  | Number of entities to process in a single SPARQL query.                                                                                                                                                                        |
+| `max_hops`                  | `int`  | Maximum depth or distance from the seed nodes to explore.                                                                                                                                                                      |
+| `check_seeds_validity`      | `bool` | Verify if seed nodes have valid URIs before starting.                                                                                                                                                                          |
+| `create_all_pairs`          | `bool` | If True, generates all possible source/target pairs from a list of seeds.                                                                                                                                                      |
+| `check_hub_seeds`           | `bool` | Check the degree of seed nodes beforehand to identify massive hubs from their seeds and asks the user if they want to keep or exclude them.                                                                                    |
+| `keep_hub_seeds`            | `bool` | Keep (`True`), skip (`False`), or prompt user (`None`) about massive hub seeds.                                                                                                                                                |
+| `max_neighbors_threshold`   | `int`  | Maximum number of neighbors allowed before a node is considered a massive hub.                                                                                                                                                 |
+| `hub_pagination_threshold`  | `int`  | Number of neighbors at which the extractor will start paginating queries for a node, to avoid getting timeout from the request with seeds with many neighbors. If this parameter is not specified, no pagination will be used. |
+| `hub_pairs_batch_size`      | `int`  | When paginating, number of pairs (node/property) to process in each batch.                                                                                                                                                     |
+| `min_triplets_per_property` | `int`  | Minimum number of triplets required per property to be kept when paginating.                                                                                                                                                   |
+### Densification
+| Parameter            | Type   | Description                                                                                                                                                                                |
+|----------------------|--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `skip_densification` | `bool` | Skip the post-extraction step that attempts to connect isolated subgraphs.                                                                                                                 |
+| `densification_mode` | `str`  | Strategy used to pick nodes to connect during densification (`most_connected` to choose to connect the most connected seeds in a connected component or `random` to choose a random seed). |
+### Graph filters
+| Parameter              | Type            | Description                                                                   |
+|------------------------|-----------------|-------------------------------------------------------------------------------|
+| `namespaces`           | `dict` / `list` | Custom namespaces (CLI format: `prefix=URI`, e.g., `ex=http://example.com/`). |
+| `include_uri_prefixes` | `list`          | Only explore nodes whose URIs start with one of these prefixes.               |
+| `exclude_uri_prefixes` | `list`          | Ignore nodes whose URIs start with any of these prefixes.                     |
+| `exclude_properties`   | `list`          | Specific properties (URIs) to completely ignore during extraction.            |
+| `exclude_nodes`        | `list`          | Specific nodes (URIs) to completely ignore during extraction.                 |
+### Debug
+| Parameter         | Type   | Description                                                                         |
+|-------------------|--------|-------------------------------------------------------------------------------------|
+| `debug_enabled`   | `bool` | Enable verbose debug-level logging in the console, used to display failed requests. |
+| `request_logging` | `bool` | Log details of all SPARQL queries.                                                  |
+## Input dataset structure
+The input must be a CSV file containing your seed entities using full URIs.
+### Path extraction (`strategy: bfs`)
+Provide two columns representing the source and target entities to connect:
+```csv
+seed,target
+http://dbpedia.org/resource/Paris,http://dbpedia.org/resource/London
+http://dbpedia.org/resource/Inria,http://dbpedia.org/resource/Computer_science
+```
+### Radial expansion (`strategy: hop`
+Provide a single column of seeds.
+```csv
+seed
+http://dbpedia.org/resource/Inria
+http://dbpedia.org/resource/France
+http://dbpedia.org/resource/Alan_Turing
+```
+## Architecture & Extraction Pipeline
+The extraction workflow is divided into five major stages, each optimized to reduce endpoint load, improve reliability,
+and maximize graph quality.
+### 1. Pre-processing and safety checks
+Before starting the extraction, the `SubgraphExtractor` performs validation and safety checks to ensure that the
+provided seeds are valid and to identify any potential issues that could arise during extraction.
+#### Seed validation (`check_seeds_validity`)
+The extractor sends validation queries to verify that each provided seed URI actually exists within the target Knowledge
+Graph.
+Invalid or unreachable entities are then displayed in a warning message and the extraction is stopped, so users can
+correct their input before starting the full extraction.
+#### Massive hub detection (`check_hub_seeds`)
+Knowledge Graphs often contain highly connected entities such as:
+* `United States`
+* `Human`
+* `English language`
+These "super-hubs" may have millions of relationships and can easily trigger endpoint timeouts.
+To prevent this:
+1. The extractor computes the exact degree (number of neighbors) of every seed node.
+2. If a seed node exceeds `max_neighbors_threshold`, a warning is raised.
+3. The user may:
+    * Remove the seed node from the extraction.
+    * Keep the seed node and continue.
+When retained, the seed node is automatically added to a `forced_hubs` list, which forces the extractor to keep it
+during the extraction phase, even if it exceeds the `max_neighbors_threshold`.
+#### Pair generation (`create_all_pairs`)
+When `create_all_pairs=True`, the extractor converts the input list of seeds into a complete set of source-target pairs.
+For a list of **N** seeds, the number of generated pairs is:
+$$
+\frac{N(N-1)}{2}
+$$
+This allows users to easily extract paths between all combinations of a given set of entities without having to manually
+create the pairs in the input CSV file, allowing for a more rich and interconnected subgraph to be extracted.
+### 2. Graph exploration
+Graph Seeder maintains an in-memory `networkx.MultiGraph` acting as a local cache, to avoid redundant queries and to
+store the evolving graph structure.
+Nodes are only fetched from the SPARQL endpoint when they become part of the active exploration frontier.
+Depending on the extraction objective, one of two traversal strategies is used.
+### Bidirectional BFS (path finding)
+To discover shortest paths between a source and a target entity, Graph Seeder employs a **Bidirectional Breadth-First
+Search (BFS)**.
+Instead of exploring from only one side, the algorithm simultaneously searches from both endpoints.
+At each iteration, the algorithm compares:
+* `q_src`: source frontier size
+* `q_tgt`: target frontier size
+Only the smaller frontier is expanded, to reduce the number of sparql queries and memory usage.
+#### Stopping
+The search terminates immediately when the two frontiers intersect.
+However, there are two cases where the search will stop without finding a path:
+1. If either frontier exceeds `max_hops` without intersection, the search is abandoned to prevent potential infinite
+   loops.
+2. If one frontier is completely exhausted (no more nodes to explore) before intersection, the search is also stopped.
+   This happens when the source and target are in disconnected components of the graph, due to missing links or filtered
+   nodes and properties.
+### Radial Hop Expansion (neighborhood extraction)
+For neighborhood extraction (`strategy="hop"`), the graph is expanded radially around the seed entities.
+The exploration proceeds layer-by-layer until reaching `max_hops`.
+## 3. Graph wrapper and SPARQL querying
+The wrapper component is responsible for all interactions with the underlying data source, whether it's a SPARQL
+endpoint, a local Turtle file, or an LMDB hashmap.
+### Wrapper design
+- **`NeighborhoodWrapper` (The Interface):** An abstract base class that contains the configuration parameters (such as
+  `max_neighbors_threshold`, URI filters, excluded properties, and the `forced_hubs` registry). It defines the core
+  methods that any concrete wrapper implementation must provide:
+    - `check_seeds_validity()`
+    - `count_neighbors()`
+    - `get_neighborhood()`
+- **`GraphWrapper`** The concrete class that extends this interface, providing the SPARQL-specific logic, batching
+  mechanisms, and fault-tolerance required to safely explore the graph.
+### Safety checks and hub management
+Before extracting paths, `GraphWrapper` performs the safety checks and hub management steps mentioned in the
+pre-processing stage of the pipeline.
+#### Seed validation
+Using `check_seeds_validity()`, it processes input seeds in batches. If a batch validation query fails, the wrapper
+automatically applies a dichotomy split to isolate the specific problematic entity and displays it in the console for
+user correction.
+#### Hub detection
+Using `count_neighbors()`, it constructs a mapping of seed nodes to their degree (number of neighbors):
+```text
+node_uri → number_of_neighbors
+```
+This mapping is then used to identify massive seeds that exceed the `max_neighbors_threshold`. The user is prompted to
+decide whether to keep or exclude each hub seed, and the decision is stored in the `forced_hubs` registry for later
+reference.
+### Two-phase neighborhood extraction
+Then, using `get_neighborhood()`, the wrapper executes a two-phase extraction process for each node in the active
+frontier:
+#### Step 1: Property statistics retrieval
+Before pulling any actual edges, the wrapper executes a metadata query for the current batch of nodes. It retrieves
+every property connected to those nodes and their occurrence counts.
+#### Step 2: Node classification
+Based on these statistics, each node is dynamically routed into one of three execution paths:
+1. **Skipped Nodes**
+    - If a node's total neighbors exceed `max_neighbors_threshold` (and it wasn't manually forced by the user), it is
+      completely skipped.
+    - This prevents timeouts on queries containing extreme global hubs (such as *United States* or *Human*).
+2. **Safe Nodes (Standard)**
+    - If a node's degree is below the `hub_pagination_threshold`, it is considered safe.
+    - It is grouped with other safe nodes, and their entire neighborhoods are fetched in a single query.
+3. **Hub Nodes (Pagination)**
+    - If a node exceeds the `hub_pagination_threshold`, a specialized property-by-property extraction is triggered.
+    - Properties yielding fewer than `min_triplets_per_property` are ignored to focus on the most semantically relevant
+      edges.
+### Dichotomy error handling
+SPARQL endpoints occasionally fail due to HTTP 500 errors, query timeouts or temporary server overload.
+When a query fails, Graph Seeder does not discard the operation.
+Instead, it recursively divides the input batch into two equal halves:
+```text
+[ A B C D E F ] (Fails)
+      ↓
+[ A B C ] (Succeeds)  +  [ D E F ] (Fails)
+                               ↓
+                         [ D ] + [ E F ] (Succeeds)
+```
+Each subset is executed independently and the process continues until either:
+- A successful query is obtained, or
+- The subset size reaches a single item.
+This allows the extractor to isolate problematic entities or properties without discarding the entire batch.
+#### Traffic control
+The underlying `SparqlClient` automatically manages endpoint throttling.
+Features include:
+* configurable request delays (`request_delay`),
+* automatic retries (`retry_attempts`),
+* retry backoff (`retry_delay`),
+* HTTP 429 rate-limit handling (`rate_limit_wait`),
+* configurable query timeouts (`timeout`).
+### 4. Graph densification
+After the initial extraction phase, the `GraphConnector` analyzes the resulting graph topology.
+#### Connected Component Analysis
+The graph is decomposed into its connected components. If multiple disconnected subgraphs are detected, Graph Seeder
+attempts to reconnect them automatically.
+For each disconnected component, a representative node is selected according to the chosen densification
+densification_mode:
+* `most_connected`: the seed with the highest degree (most neighbors) is selected as the representative for that
+  component.
+* `random`: a random seed is selected as the representative for that component.
+A Bidirectional BFS is then executed between representatives of disconnected components. When a connecting path is
+found, the corresponding triples are added to the graph.
+This process iterates until either all components are connected or all pairs of representatives have been exhausted
+without finding a path between some of them.
+### 5. Graph export and statistics
+Once extraction and densification are complete, the `GraphExporter` and `GraphStatistics` modules generate the final
+outputs.
+#### Graph export
+The resulting graph can be exported either as:
+* Hierarchical JSON preserving source-target path traces.
+* RDF Turtle (`.ttl`) files.
+The network graph containing all the retrieved triples during the extraction phase is also saved in a NetworkX `gpickle`
+format.
+#### Statistical report
+A complete extraction report is generated in JSON format containing metrics such as:
+* number of traversed triples,
+* number of unique triples,
+* number of unique subjects,
+* number of unique predicates,
+* number of unique objects,
+* number of unique entities,
+* number of connected components,
+* mean component size,
+* standard deviation of component sizes.

graph-seeder 1.0.0.dev4__tar.gz → 1.0.0.dev5__tar.gz

graph-seeder 1.0.0.dev4tar.gz → 1.0.0.dev5tar.gz