graph-seeder 1.0.0.dev4__tar.gz → 1.0.0.dev5__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. graph_seeder-1.0.0.dev5/PKG-INFO +454 -0
  2. graph_seeder-1.0.0.dev5/README.md +439 -0
  3. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/pyproject.toml +1 -1
  4. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/GraphSeeder.py +20 -3
  5. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/SubgraphExtractor.py +2 -2
  6. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/configs/dbpedia_default.json +1 -1
  7. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/configs/default.json +6 -6
  8. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/configs/europeana_default.json +1 -1
  9. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/configs/pgxlod_default.json +1 -1
  10. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/configs/wikidata_default.json +1 -1
  11. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/densification/GraphConnector.py +6 -4
  12. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/extraction/BFS/BFS.py +8 -2
  13. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/extraction/Hop/HopExpansion.py +14 -7
  14. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/utils/ConsoleUI.py +1 -1
  15. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/utils/utils.py +59 -5
  16. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/wrapper/sparql/GraphWrapper.py +2 -4
  17. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/uv.lock +1 -1
  18. graph_seeder-1.0.0.dev4/PKG-INFO +0 -191
  19. graph_seeder-1.0.0.dev4/README.md +0 -176
  20. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/.github/workflows/publish.yml +0 -0
  21. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/.gitignore +0 -0
  22. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/requirements.txt +0 -0
  23. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/extraction/ExtractionStrategy.py +0 -0
  24. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/utils/Factory.py +0 -0
  25. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/utils/GraphExporter.py +0 -0
  26. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/utils/GraphStatistics.py +0 -0
  27. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/utils/URIManager.py +0 -0
  28. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/wrapper/NeighborhoodWrapper.py +0 -0
  29. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/wrapper/hashmap/HashMapWrapper.py +0 -0
  30. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/wrapper/sparql/BaseClient.py +0 -0
  31. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/wrapper/sparql/SparqlQueryBuilder.py +0 -0
  32. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/wrapper/sparql/client/SparqlClient.py +0 -0
  33. {graph_seeder-1.0.0.dev4 → graph_seeder-1.0.0.dev5}/src/graph_seeder/wrapper/sparql/client/TurtleClient.py +0 -0
@@ -0,0 +1,454 @@
1
+ Metadata-Version: 2.4
2
+ Name: graph-seeder
3
+ Version: 1.0.0.dev5
4
+ Summary: A powerful tool to extract and densify subgraphs from Knowledge Graphs via SPARQL or LMDB, with different extraction strategies.
5
+ Requires-Python: >=3.9
6
+ Requires-Dist: lmdb>=2.2.0
7
+ Requires-Dist: networkx<4.0.0,>=3.2.1
8
+ Requires-Dist: pandas<3.0.0,>=2.3.3
9
+ Requires-Dist: rdflib>=7.6.0
10
+ Requires-Dist: requests>=2.32.5
11
+ Requires-Dist: rich>=15.0.0
12
+ Requires-Dist: sparqlwrapper>=2.0.0
13
+ Requires-Dist: urllib3>=2.6.3
14
+ Description-Content-Type: text/markdown
15
+
16
+ # Graph Seeder
17
+
18
+ Graph Seeder is a highly configurable, end-to-end Python package designed to extract, densify, and analyze subgraphs
19
+ from Knowledge Graphs (like DBpedia, Wikidata, local RDF files or LMDB hashmaps) based on seed entities.
20
+
21
+ It can be used as a command-line tool or imported as a library in your Python projects. The package supports different
22
+ extraction strategies, automatic densification to connect isolated components, and export formats for both the extracted
23
+ paths and the full graph.
24
+
25
+ **Full documentation and updates:** [Graph Seeder on PyPI](https://pypi.org/project/graph-seeder/)
26
+
27
+ ### Warning
28
+
29
+ The full documentation has not been written yet.
30
+
31
+ This README provides a comprehensive overview of features, installation, and usage.
32
+
33
+ ## Features
34
+
35
+ * **Smart extraction:** Dynamically queries SPARQL endpoints, local Turtle files or LMDB hashmaps using Bidirectional
36
+ BFS (for paths between entities) or Radial Hop Expansion (for neighborhoods).
37
+ * **Automatic densification:** Analyzes the extracted subgraph and automatically connects disconnected components to
38
+ maximize connectivity and semantic richness.
39
+ * **Rich exports:** Outputs results in hierarchical JSON (preserving path traces) or RDF Turtle format, along with
40
+ detailed extraction statistics.
41
+ * **Resilience and hub management:** Implements robust error handling, automatic retries, and intelligent detection of
42
+ massive hub nodes to prevent endpoint overloads and timeouts.
43
+
44
+ ## Installation
45
+
46
+ ### As a Python Package (recommended)
47
+
48
+ You can install Graph Seeder directly from PyPI into your project's virtual environment:
49
+
50
+ ```bash
51
+ pip install graph-seeder
52
+ ```
53
+
54
+ ### For local development
55
+
56
+ If you want to clone the repository to modify the code locally:
57
+
58
+ ```bash
59
+ git clone https://github.com/YourOrg/graph-seeder.git
60
+ cd graph-seeder
61
+ uv sync # or: pip install -e .
62
+ ```
63
+
64
+ ## Configuration
65
+
66
+ The project is driven by a powerful configuration engine. You can either :
67
+
68
+ * Use built-in configuration templates (`dbpedia_default`, `wikidata_default`, `pgxlod_default`, `europeana_default` or
69
+ `default`)
70
+ * Give the path to your own custom `.json` configuration file. You can use the `generate-config` command to create a
71
+ template file with all available parameters and their default values, which you can then modify as needed.
72
+
73
+ You can also override any configuration parameter directly from the command line or via Python arguments, which will
74
+ take precedence over the config file values.
75
+
76
+ ## Usage
77
+
78
+ ### 1. Via Command Line Interface (CLI)
79
+
80
+ You can call graph-seeder directly from your terminal. Use the --config flag to specify your base configuration, and
81
+ append any overrides as key=value pairs.
82
+
83
+ ```bash
84
+ # Example 1: Using a built-in template with some overrides
85
+ graph-seeder --config dbpedia_default input_path=data/seeds.csv output_format=json output_path=results/my_graph
86
+
87
+ # Example 2: Using your own custom JSON configuration file
88
+ graph-seeder --config path/to/my_custom_config.json
89
+
90
+ # Example 3: Overriding deep parameters on the fly (takes precedence over the config)
91
+ graph-seeder --config wikidata_default max_hops=3 batch_size=50 request_delay=2.5 type=hashmap
92
+ ```
93
+
94
+ ### 2. Via Python API
95
+
96
+ You can import and use Graph Seeder directly in your Python scripts. The SubgraphExtractor accepts a config (which can
97
+ be a built-in template name or a path to a .json file) and uses kwargs for overrides
98
+
99
+ ```python
100
+ from graph_seeder import SubgraphExtractor
101
+
102
+ # You can pass a built-in template name OR a path to a custom .json file.
103
+ # Any additional keyword arguments will override the base configuration.
104
+ extractor = SubgraphExtractor(
105
+ config="wikidata_default", # Or "path/to/my_custom_config.json"
106
+ input_path="data/seeds.csv", # Override: Input file
107
+ output_format="json", # Override: Output format
108
+ output_path="results/my_graph", # Override: Output destination
109
+ batch_size=50, # Override: SPARQL batch size
110
+ max_hops=3 # Override: Maximum depth limit
111
+ )
112
+
113
+ extractor.run()
114
+ ```
115
+
116
+ ## Configuration
117
+
118
+ Here is the complete list of parameters you can configure (either in a `.json` config file or overridden in the
119
+ CLI/Python arguments).
120
+
121
+ ### Data
122
+
123
+ | Parameter | Type | Description |
124
+ |---------------------|-------|------------------------------------------------------------------------------------------------------------------------|
125
+ | `input_path` | `str` | Path to the input CSV file containing seed nodes. |
126
+ | `output_format` | `str` | Format of the extracted graph output (`json` or `ttl`). |
127
+ | `output_path` | `str` | Destination path and base filename for the extracted files. |
128
+ | `stats_output_path` | `str` | Path to save the extraction statistics in JSON format. |
129
+ | `turtle_path` | `str` | Path to a local `.ttl` file (if using local extraction instead of a SPARQL endpoint) to load a local knowledge graph. |
130
+ | `hashmap_path` | `str` | Path to a local LMDB hashmap (if using local extraction instead of a SPARQL endpoint) to load a local knowledge graph. |
131
+
132
+ ### Client / SPARQL
133
+
134
+ | Parameter | Type | Description |
135
+ |-------------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------|
136
+ | `type` | `str` | Type of client to use for extraction (`sparql` for SPARQL endpoints, `turtle` to load a local Turtle file, or `hashmap` to load a local LMDB hashmap). |
137
+ | `endpoint` | `str` | URL of the SPARQL endpoint to query (e.g., `https://dbpedia.org/sparql`). |
138
+ | `user_agent` | `str` | HTTP User-Agent header (Highly recommended for some knowledge graphs like Wikidata to avoid blocks). |
139
+ | `request_delay` | `float` | Delay in seconds between consecutive requests to avoid server overload. |
140
+ | `retry_attempts` | `int` | Number of times to retry a failed HTTP request. |
141
+ | `retry_delay` | `float` | Delay in seconds before retrying a failed request. |
142
+ | `rate_limit_wait` | `float` | Time to wait in seconds when a rate limit (HTTP 429) is encountered. |
143
+ | `timeout` | `float` | Maximum time in seconds to wait for a server response. |
144
+
145
+ ### Extraction settings
146
+
147
+ | Parameter | Type | Description |
148
+ |-----------------------------|--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
149
+ | `strategy` | `str` | Graph extraction algorithm: `bfs` (paths between pairs) or `hop` (radial expansion). |
150
+ | `batch_size` | `int` | Number of entities to process in a single SPARQL query. |
151
+ | `max_hops` | `int` | Maximum depth or distance from the seed nodes to explore. |
152
+ | `check_seeds_validity` | `bool` | Verify if seed nodes have valid URIs before starting. |
153
+ | `create_all_pairs` | `bool` | If True, generates all possible source/target pairs from a list of seeds. |
154
+ | `check_hub_seeds` | `bool` | Check the degree of seed nodes beforehand to identify massive hubs from their seeds and asks the user if they want to keep or exclude them. |
155
+ | `keep_hub_seeds` | `bool` | Keep (`True`), skip (`False`), or prompt user (`None`) about massive hub seeds. |
156
+ | `max_neighbors_threshold` | `int` | Maximum number of neighbors allowed before a node is considered a massive hub. |
157
+ | `hub_pagination_threshold` | `int` | Number of neighbors at which the extractor will start paginating queries for a node, to avoid getting timeout from the request with seeds with many neighbors. If this parameter is not specified, no pagination will be used. |
158
+ | `hub_pairs_batch_size` | `int` | When paginating, number of pairs (node/property) to process in each batch. |
159
+ | `min_triplets_per_property` | `int` | Minimum number of triplets required per property to be kept when paginating. |
160
+
161
+ ### Densification
162
+
163
+ | Parameter | Type | Description |
164
+ |----------------------|--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
165
+ | `skip_densification` | `bool` | Skip the post-extraction step that attempts to connect isolated subgraphs. |
166
+ | `densification_mode` | `str` | Strategy used to pick nodes to connect during densification (`most_connected` to choose to connect the most connected seeds in a connected component or `random` to choose a random seed). |
167
+
168
+ ### Graph filters
169
+
170
+ | Parameter | Type | Description |
171
+ |------------------------|-----------------|-------------------------------------------------------------------------------|
172
+ | `namespaces` | `dict` / `list` | Custom namespaces (CLI format: `prefix=URI`, e.g., `ex=http://example.com/`). |
173
+ | `include_uri_prefixes` | `list` | Only explore nodes whose URIs start with one of these prefixes. |
174
+ | `exclude_uri_prefixes` | `list` | Ignore nodes whose URIs start with any of these prefixes. |
175
+ | `exclude_properties` | `list` | Specific properties (URIs) to completely ignore during extraction. |
176
+ | `exclude_nodes` | `list` | Specific nodes (URIs) to completely ignore during extraction. |
177
+
178
+ ### Debug
179
+
180
+ | Parameter | Type | Description |
181
+ |-------------------|--------|-------------------------------------------------------------------------------------|
182
+ | `debug_enabled` | `bool` | Enable verbose debug-level logging in the console, used to display failed requests. |
183
+ | `request_logging` | `bool` | Log details of all SPARQL queries. |
184
+
185
+ ## Input dataset structure
186
+
187
+ The input must be a CSV file containing your seed entities using full URIs.
188
+
189
+ ### Path extraction (`strategy: bfs`)
190
+
191
+ Provide two columns representing the source and target entities to connect:
192
+
193
+ ```csv
194
+ seed,target
195
+ http://dbpedia.org/resource/Paris,http://dbpedia.org/resource/London
196
+ http://dbpedia.org/resource/Inria,http://dbpedia.org/resource/Computer_science
197
+ ```
198
+
199
+ ### Radial expansion (`strategy: hop`
200
+
201
+ Provide a single column of seeds.
202
+
203
+ ```csv
204
+ seed
205
+ http://dbpedia.org/resource/Inria
206
+ http://dbpedia.org/resource/France
207
+ http://dbpedia.org/resource/Alan_Turing
208
+ ```
209
+
210
+ ## Architecture & Extraction Pipeline
211
+
212
+ The extraction workflow is divided into five major stages, each optimized to reduce endpoint load, improve reliability,
213
+ and maximize graph quality.
214
+
215
+ ### 1. Pre-processing and safety checks
216
+
217
+ Before starting the extraction, the `SubgraphExtractor` performs validation and safety checks to ensure that the
218
+ provided seeds are valid and to identify any potential issues that could arise during extraction.
219
+
220
+ #### Seed validation (`check_seeds_validity`)
221
+
222
+ The extractor sends validation queries to verify that each provided seed URI actually exists within the target Knowledge
223
+ Graph.
224
+
225
+ Invalid or unreachable entities are then displayed in a warning message and the extraction is stopped, so users can
226
+ correct their input before starting the full extraction.
227
+
228
+ #### Massive hub detection (`check_hub_seeds`)
229
+
230
+ Knowledge Graphs often contain highly connected entities such as:
231
+
232
+ * `United States`
233
+ * `Human`
234
+ * `English language`
235
+
236
+ These "super-hubs" may have millions of relationships and can easily trigger endpoint timeouts.
237
+
238
+ To prevent this:
239
+
240
+ 1. The extractor computes the exact degree (number of neighbors) of every seed node.
241
+ 2. If a seed node exceeds `max_neighbors_threshold`, a warning is raised.
242
+ 3. The user may:
243
+
244
+ * Remove the seed node from the extraction.
245
+ * Keep the seed node and continue.
246
+
247
+ When retained, the seed node is automatically added to a `forced_hubs` list, which forces the extractor to keep it
248
+ during the extraction phase, even if it exceeds the `max_neighbors_threshold`.
249
+
250
+ #### Pair generation (`create_all_pairs`)
251
+
252
+ When `create_all_pairs=True`, the extractor converts the input list of seeds into a complete set of source-target pairs.
253
+
254
+ For a list of **N** seeds, the number of generated pairs is:
255
+
256
+ $$
257
+ \frac{N(N-1)}{2}
258
+ $$
259
+
260
+ This allows users to easily extract paths between all combinations of a given set of entities without having to manually
261
+ create the pairs in the input CSV file, allowing for a more rich and interconnected subgraph to be extracted.
262
+
263
+ ### 2. Graph exploration
264
+
265
+ Graph Seeder maintains an in-memory `networkx.MultiGraph` acting as a local cache, to avoid redundant queries and to
266
+ store the evolving graph structure.
267
+
268
+ Nodes are only fetched from the SPARQL endpoint when they become part of the active exploration frontier.
269
+
270
+ Depending on the extraction objective, one of two traversal strategies is used.
271
+
272
+ ### Bidirectional BFS (path finding)
273
+
274
+ To discover shortest paths between a source and a target entity, Graph Seeder employs a **Bidirectional Breadth-First
275
+ Search (BFS)**.
276
+
277
+ Instead of exploring from only one side, the algorithm simultaneously searches from both endpoints.
278
+
279
+ At each iteration, the algorithm compares:
280
+
281
+ * `q_src`: source frontier size
282
+ * `q_tgt`: target frontier size
283
+
284
+ Only the smaller frontier is expanded, to reduce the number of sparql queries and memory usage.
285
+
286
+ #### Stopping
287
+
288
+ The search terminates immediately when the two frontiers intersect.
289
+ However, there are two cases where the search will stop without finding a path:
290
+
291
+ 1. If either frontier exceeds `max_hops` without intersection, the search is abandoned to prevent potential infinite
292
+ loops.
293
+ 2. If one frontier is completely exhausted (no more nodes to explore) before intersection, the search is also stopped.
294
+ This happens when the source and target are in disconnected components of the graph, due to missing links or filtered
295
+ nodes and properties.
296
+
297
+ ### Radial Hop Expansion (neighborhood extraction)
298
+
299
+ For neighborhood extraction (`strategy="hop"`), the graph is expanded radially around the seed entities.
300
+
301
+ The exploration proceeds layer-by-layer until reaching `max_hops`.
302
+
303
+ ## 3. Graph wrapper and SPARQL querying
304
+
305
+ The wrapper component is responsible for all interactions with the underlying data source, whether it's a SPARQL
306
+ endpoint, a local Turtle file, or an LMDB hashmap.
307
+
308
+ ### Wrapper design
309
+
310
+ - **`NeighborhoodWrapper` (The Interface):** An abstract base class that contains the configuration parameters (such as
311
+ `max_neighbors_threshold`, URI filters, excluded properties, and the `forced_hubs` registry). It defines the core
312
+ methods that any concrete wrapper implementation must provide:
313
+ - `check_seeds_validity()`
314
+ - `count_neighbors()`
315
+ - `get_neighborhood()`
316
+
317
+ - **`GraphWrapper`** The concrete class that extends this interface, providing the SPARQL-specific logic, batching
318
+ mechanisms, and fault-tolerance required to safely explore the graph.
319
+
320
+ ### Safety checks and hub management
321
+
322
+ Before extracting paths, `GraphWrapper` performs the safety checks and hub management steps mentioned in the
323
+ pre-processing stage of the pipeline.
324
+
325
+ #### Seed validation
326
+
327
+ Using `check_seeds_validity()`, it processes input seeds in batches. If a batch validation query fails, the wrapper
328
+ automatically applies a dichotomy split to isolate the specific problematic entity and displays it in the console for
329
+ user correction.
330
+
331
+ #### Hub detection
332
+
333
+ Using `count_neighbors()`, it constructs a mapping of seed nodes to their degree (number of neighbors):
334
+
335
+ ```text
336
+ node_uri → number_of_neighbors
337
+ ```
338
+
339
+ This mapping is then used to identify massive seeds that exceed the `max_neighbors_threshold`. The user is prompted to
340
+ decide whether to keep or exclude each hub seed, and the decision is stored in the `forced_hubs` registry for later
341
+ reference.
342
+
343
+ ### Two-phase neighborhood extraction
344
+
345
+ Then, using `get_neighborhood()`, the wrapper executes a two-phase extraction process for each node in the active
346
+ frontier:
347
+
348
+ #### Step 1: Property statistics retrieval
349
+
350
+ Before pulling any actual edges, the wrapper executes a metadata query for the current batch of nodes. It retrieves
351
+ every property connected to those nodes and their occurrence counts.
352
+
353
+ #### Step 2: Node classification
354
+
355
+ Based on these statistics, each node is dynamically routed into one of three execution paths:
356
+
357
+ 1. **Skipped Nodes**
358
+ - If a node's total neighbors exceed `max_neighbors_threshold` (and it wasn't manually forced by the user), it is
359
+ completely skipped.
360
+ - This prevents timeouts on queries containing extreme global hubs (such as *United States* or *Human*).
361
+
362
+ 2. **Safe Nodes (Standard)**
363
+ - If a node's degree is below the `hub_pagination_threshold`, it is considered safe.
364
+ - It is grouped with other safe nodes, and their entire neighborhoods are fetched in a single query.
365
+
366
+ 3. **Hub Nodes (Pagination)**
367
+ - If a node exceeds the `hub_pagination_threshold`, a specialized property-by-property extraction is triggered.
368
+ - Properties yielding fewer than `min_triplets_per_property` are ignored to focus on the most semantically relevant
369
+ edges.
370
+
371
+ ### Dichotomy error handling
372
+
373
+ SPARQL endpoints occasionally fail due to HTTP 500 errors, query timeouts or temporary server overload.
374
+
375
+ When a query fails, Graph Seeder does not discard the operation.
376
+ Instead, it recursively divides the input batch into two equal halves:
377
+
378
+ ```text
379
+ [ A B C D E F ] (Fails)
380
+
381
+ [ A B C ] (Succeeds) + [ D E F ] (Fails)
382
+
383
+ [ D ] + [ E F ] (Succeeds)
384
+ ```
385
+
386
+ Each subset is executed independently and the process continues until either:
387
+
388
+ - A successful query is obtained, or
389
+ - The subset size reaches a single item.
390
+
391
+ This allows the extractor to isolate problematic entities or properties without discarding the entire batch.
392
+
393
+ #### Traffic control
394
+
395
+ The underlying `SparqlClient` automatically manages endpoint throttling.
396
+
397
+ Features include:
398
+
399
+ * configurable request delays (`request_delay`),
400
+ * automatic retries (`retry_attempts`),
401
+ * retry backoff (`retry_delay`),
402
+ * HTTP 429 rate-limit handling (`rate_limit_wait`),
403
+ * configurable query timeouts (`timeout`).
404
+
405
+ ### 4. Graph densification
406
+
407
+ After the initial extraction phase, the `GraphConnector` analyzes the resulting graph topology.
408
+
409
+ #### Connected Component Analysis
410
+
411
+ The graph is decomposed into its connected components. If multiple disconnected subgraphs are detected, Graph Seeder
412
+ attempts to reconnect them automatically.
413
+
414
+ For each disconnected component, a representative node is selected according to the chosen densification
415
+ densification_mode:
416
+
417
+ * `most_connected`: the seed with the highest degree (most neighbors) is selected as the representative for that
418
+ component.
419
+ * `random`: a random seed is selected as the representative for that component.
420
+
421
+ A Bidirectional BFS is then executed between representatives of disconnected components. When a connecting path is
422
+ found, the corresponding triples are added to the graph.
423
+
424
+ This process iterates until either all components are connected or all pairs of representatives have been exhausted
425
+ without finding a path between some of them.
426
+
427
+ ### 5. Graph export and statistics
428
+
429
+ Once extraction and densification are complete, the `GraphExporter` and `GraphStatistics` modules generate the final
430
+ outputs.
431
+
432
+ #### Graph export
433
+
434
+ The resulting graph can be exported either as:
435
+
436
+ * Hierarchical JSON preserving source-target path traces.
437
+ * RDF Turtle (`.ttl`) files.
438
+
439
+ The network graph containing all the retrieved triples during the extraction phase is also saved in a NetworkX `gpickle`
440
+ format.
441
+
442
+ #### Statistical report
443
+
444
+ A complete extraction report is generated in JSON format containing metrics such as:
445
+
446
+ * number of traversed triples,
447
+ * number of unique triples,
448
+ * number of unique subjects,
449
+ * number of unique predicates,
450
+ * number of unique objects,
451
+ * number of unique entities,
452
+ * number of connected components,
453
+ * mean component size,
454
+ * standard deviation of component sizes.