graphembed-rs 0.1.0__cp311-cp311-macosx_11_0_arm64.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
graphembed/__init__.py
ADDED
graphembed/__init__.pyi
ADDED
@@ -0,0 +1,79 @@
|
|
1
|
+
from typing import Optional
|
2
|
+
|
3
|
+
# ---------- Embedding ----------
|
4
|
+
def embed_hope_rank(
|
5
|
+
csv: str,
|
6
|
+
symetric: bool,
|
7
|
+
target_rank: int,
|
8
|
+
nbiter: int,
|
9
|
+
output: Optional[str] = None,
|
10
|
+
) -> None: ...
|
11
|
+
def embed_hope_precision(
|
12
|
+
csv: str,
|
13
|
+
symetric: bool,
|
14
|
+
epsil: float,
|
15
|
+
maxrank: int,
|
16
|
+
blockiter: int,
|
17
|
+
output: Optional[str] = None,
|
18
|
+
) -> None: ...
|
19
|
+
def embed_sketching(
|
20
|
+
csv: str,
|
21
|
+
symetric: bool,
|
22
|
+
decay: float,
|
23
|
+
dim: int,
|
24
|
+
nbiter: int,
|
25
|
+
output: Optional[str] = None,
|
26
|
+
) -> None: ...
|
27
|
+
|
28
|
+
# ---------- Validation (returns mean AUC) ----------
|
29
|
+
def validate_hope_rank(
|
30
|
+
csv: str,
|
31
|
+
symetric: bool,
|
32
|
+
target_rank: int,
|
33
|
+
nbiter: int,
|
34
|
+
nbpass: int = 10,
|
35
|
+
skip_frac: float = 0.1,
|
36
|
+
centric: bool = False,
|
37
|
+
) -> float: ...
|
38
|
+
def validate_hope_precision(
|
39
|
+
csv: str,
|
40
|
+
symetric: bool,
|
41
|
+
epsil: float,
|
42
|
+
maxrank: int,
|
43
|
+
blockiter: int,
|
44
|
+
nbpass: int = 10,
|
45
|
+
skip_frac: float = 0.1,
|
46
|
+
centric: bool = False,
|
47
|
+
) -> float: ...
|
48
|
+
def validate_sketching(
|
49
|
+
csv: str,
|
50
|
+
symetric: bool,
|
51
|
+
decay: float,
|
52
|
+
dim: int,
|
53
|
+
nbiter: int,
|
54
|
+
nbpass: int = 10,
|
55
|
+
skip_frac: float = 0.1,
|
56
|
+
centric: bool = False,
|
57
|
+
) -> float: ...
|
58
|
+
|
59
|
+
# ---------- VCMPR (precision/recall curves) ----------
|
60
|
+
def estimate_vcmpr_hope_rank(
|
61
|
+
csv: str,
|
62
|
+
symetric: bool,
|
63
|
+
target_rank: int,
|
64
|
+
nbiter: int,
|
65
|
+
nbpass: int = 2,
|
66
|
+
topk: int = 10,
|
67
|
+
skip_frac: float = 0.1,
|
68
|
+
) -> None: ...
|
69
|
+
def estimate_vcmpr_sketching(
|
70
|
+
csv: str,
|
71
|
+
symetric: bool,
|
72
|
+
decay: float,
|
73
|
+
dim: int,
|
74
|
+
nbiter: int,
|
75
|
+
nbpass: int = 2,
|
76
|
+
topk: int = 10,
|
77
|
+
skip_frac: float = 0.1,
|
78
|
+
) -> None: ...
|
79
|
+
|
Binary file
|
graphembed/py.typed
ADDED
File without changes
|
@@ -0,0 +1,193 @@
|
|
1
|
+
Metadata-Version: 2.4
|
2
|
+
Name: graphembed_rs
|
3
|
+
Version: 0.1.0
|
4
|
+
Summary: Python bindings for the high‑performance Rust graph/network embedding library graphembed
|
5
|
+
Keywords: graph,embedding,hash
|
6
|
+
Author: Jianshu Zhao
|
7
|
+
Author-email: jeanpierre.both@gmail.com, jianshuzhao@yahoo.com
|
8
|
+
License: MIT OR Apache-2.0
|
9
|
+
Requires-Python: >=3.8
|
10
|
+
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
|
11
|
+
Project-URL: Source Code, https://github.com/jean-pierreBoth/graphembed
|
12
|
+
|
13
|
+
# Graphembed
|
14
|
+
|
15
|
+
This crate provides ,as a library and an executable,embedding of directed or undirected graphs with positively weighted edges.
|
16
|
+
|
17
|
+
|
18
|
+
- For simple graphs, without data attached to nodes/labels, there are 2 (rust) modules **nodesketch** and **atp**. A simple executable with a validation option based on link prediction is also provided.
|
19
|
+
|
20
|
+
- To complement the embeddings we provide also core decomposition of graphs (see the module **structure**). We give try to analyze how Orkut communities are preserved through an embedding. (See Notebooks directory).
|
21
|
+
|
22
|
+
- The module **gkernel** is dedicated to graphs with discrete labels attached to nodes/edges. We use the *petgraph* crate for graph description.
|
23
|
+
The algorithm is based on an extension of the hashing strategy used in the module **nodesketch**.
|
24
|
+
In the undirected case, this module also computes a global embedding vector for the whole graph. **It is still in an early version**.
|
25
|
+
|
26
|
+
## Methods
|
27
|
+
|
28
|
+
### The embedding algorithms used in this crate are based on the following papers
|
29
|
+
|
30
|
+
- **nodesketch**
|
31
|
+
|
32
|
+
*NodeSketch : Highly-Efficient Graph Embeddings via Recursive Sketching KDD 2019*. see [nodesketch](https://dl.acm.org/doi/10.1145/3292500.3330951)
|
33
|
+
D. Yang,P. Rosso,Bin-Li, P. Cudre-Mauroux.
|
34
|
+
|
35
|
+
It is based on multi hop neighbourhood identification via sensitive hashing based on the recent algorithm **probminhash**. See [arxiv](https://arxiv.org/abs/1911.00675) or [ieee-2022](https://ieeexplore.ieee.org/document/9185081).
|
36
|
+
|
37
|
+
The algorithm associates a probability distribution on neighbours of each point depending on edge weights and distance to the point.
|
38
|
+
Then this distribution is hashed to build a (discrete) embedding vector consisting in nodes identifiers.
|
39
|
+
The distance between embedded vectors is the Jaccard distance so we get
|
40
|
+
a real distance on the embedding space for the symetric embedding.
|
41
|
+
|
42
|
+
An extension of the paper is also implemented to get asymetric embedding for directed graph. The similarity is also based on the hash of sets (nodes going to or from) a given node but then the dissimilarity is no more a distance (no symetry and some discrepancy with the triangular inequality).
|
43
|
+
|
44
|
+
**The orkut graph with 3 millions nodes and 100 millions of edge is embedded in 5' with a 24 core i9 laptop with this algorithm giving an AUC of 0.95**.
|
45
|
+
|
46
|
+
|
47
|
+
- **atp**
|
48
|
+
|
49
|
+
*Asymetric Transitivity Preserving Graph Embedding 2016*.
|
50
|
+
M. Ou, P Cui, J. Pei, Z. Zhang and W. Zhu. See [hope](https://dl.acm.org/doi/10.1145/2939672.2939751).
|
51
|
+
|
52
|
+
The objective is to provide an asymetric graph embedding and get estimate of the precision of the embedding in function of its dimension.
|
53
|
+
|
54
|
+
We use the Adamic-Adar matricial representation of the graph. (It must be noted that the ponderation of a node by the number of couples joined by it is called Resource Allocation in the Graph Kernel litterature).
|
55
|
+
The asymetric embedding is obtained from the left and right singular eigenvectors of the Adamic-Adar representation of the graph.
|
56
|
+
Source node are related to left singular vectors and target nodes to the right ones.
|
57
|
+
The similarity measure is the dot product, so it is not a norm.
|
58
|
+
The svd is approximated by randomization as described in Halko-Tropp 2011 as implemented in the [annembed crate](https://crates.io/crates/annembed).
|
59
|
+
|
60
|
+
### The core decomposition algorithms
|
61
|
+
|
62
|
+
- **Density-friendly decomposition**
|
63
|
+
|
64
|
+
*Large Scale decomposition via convex programming 2017*
|
65
|
+
M.Danisch T.H Hubert Chan and M.Sozio
|
66
|
+
|
67
|
+
The decomposition of the graph in maximally dense groups of nodes is implemented and used to assess the quality of the embeddings in a structural way. See module *validation* and the comments on the embedding of the *Orkut* graph where we can use the community data provided with the graph to analyze the behaviour of embedded edge lengths.
|
68
|
+
|
69
|
+
In particular it is shown that :
|
70
|
+
- embedding of edges internal to a community are consistently smaller than embedded edges crossing a block frontier.
|
71
|
+
- The transition probabilities of edge from one block to another are similar (low kullback divergence) in the original graph and in the embedded graph.
|
72
|
+
|
73
|
+
See results in [orkut.md](./orkut.md) and examples directory together with a small Rust notebook in directory [Notebooks](./Notebooks/orkutrs.ipynb)
|
74
|
+
|
75
|
+
## Validation
|
76
|
+
|
77
|
+
Validation of embeddings is assessed via standard Auc with random deletion of edges. See documentation in the *link* module and *embed* binary.
|
78
|
+
We give also a variation based on centric quality assessment as explained at [cauc](http://github.com/jean-pierreBoth/linkauc)
|
79
|
+
## Some data sets
|
80
|
+
|
81
|
+
### Without labels
|
82
|
+
|
83
|
+
Small datasets are given in the Data subdirectory (with 7z compression) to run tests.
|
84
|
+
Larger datasets can be downloaded from the SNAP data collections <https://snap.stanford.edu/data>
|
85
|
+
|
86
|
+
#### Some small test graphs are provided in a Data subdirectory
|
87
|
+
|
88
|
+
- Symetric graphs
|
89
|
+
|
90
|
+
- Les miserables <http://konect.cc/networks/moreno_lesmis>.
|
91
|
+
This is the graph of co-occurence of characters in Victor Hugo's novel 'Les Misérables'.
|
92
|
+
|
93
|
+
- Asymetric graphs
|
94
|
+
|
95
|
+
- wiki-vote <https://snap.stanford.edu/data/wiki-Vote.html>
|
96
|
+
7115 nodes 103689 edges
|
97
|
+
|
98
|
+
- Cora : <http://konect.cc/networks/subelj_cora>
|
99
|
+
citation network 23166 nodes 91500 edges
|
100
|
+
|
101
|
+
#### Some larger data tests for user to download
|
102
|
+
|
103
|
+
These graphs were used in results see below.
|
104
|
+
|
105
|
+
Beware of the possible need to convert from Windows to Linux End Of Line, see the dos2unix utility.
|
106
|
+
Possibly some data can need to be converted from Tsv format to Csv, before being read by the program.
|
107
|
+
|
108
|
+
- Symetric
|
109
|
+
|
110
|
+
- Amazon. Nodes: 334 863 Edges: 925 872 <https://snap.stanford.edu/data/amazon0601.html>
|
111
|
+
- youtube. Nodes: 1 134 890 Edges: 2 987 624 <https://snap.stanford.edu/data/com-Youtube.html>
|
112
|
+
- orkut. Nodes: 3 072 441 Edges: 117 185 083 <https://snap.stanford.edu/data/com-Orkut.html>
|
113
|
+
|
114
|
+
- Asymetric
|
115
|
+
|
116
|
+
- twitter as tested in Hope <http://konect.cc/networks/munmun_twitter_social>
|
117
|
+
465017 nodes 834797 edges
|
118
|
+
|
119
|
+
## Some results
|
120
|
+
|
121
|
+
### results for the *atp* and *nodesketch* modules
|
122
|
+
|
123
|
+
Embedding and link prediction evaluation for the above data sets are given in file [resultats.md](./resultats.md)
|
124
|
+
A more global analysis of the embedding with the nodesketch module is done for the orkut graph in file [orkut.md](./orkut.md)
|
125
|
+
|
126
|
+
A preliminary of node centric quality estimation is provided in the validation module (see documentation in validation::link).
|
127
|
+
|
128
|
+
### Some qualitative comments
|
129
|
+
|
130
|
+
- For the embedding using the randomized svd, increasing the embedding dimension is interesting as far as the corresponding eigenvalues continue to decrease significantly.
|
131
|
+
|
132
|
+
- The munmun_twitter_social graph shows that treating a directed graph as an undirected graph give significantly different results in terms of link prediction AUC.
|
133
|
+
|
134
|
+
|
135
|
+
|
136
|
+
|
137
|
+
## Generalized Svd
|
138
|
+
|
139
|
+
An implementation of Generalized Svd comes as a by-product in module [gsvd](./src/atp/gsvd.rs).
|
140
|
+
|
141
|
+
## Installation and Usage
|
142
|
+
|
143
|
+
### Installation
|
144
|
+
|
145
|
+
The crate provides features (with a default configuration), required by the *annembed* dependency, to specify which version of lapack you want to use or the choice of simd implementation.
|
146
|
+
- For example compilation is done by :
|
147
|
+
*cargo build --release --features="openblas-system"* to use a dynamic link with openblas.
|
148
|
+
The choice of one feature is mandatory to provide required linear algebra library.
|
149
|
+
- On Intel the simdeez_f feature can be used. On other cpus the stdsimd feature can be chosen but it requires compiler >= 1.79
|
150
|
+
|
151
|
+
### Usage
|
152
|
+
|
153
|
+
The embed module can be generated with the standard : cargo doc --no-deps --bin embed.
|
154
|
+
|
155
|
+
- The Hope embedding relying on matrices computations limits the size of the graph to some hundred thousands nodes.
|
156
|
+
It is intrinsically asymetric in nature. It nevertheless gives access to the spectrum of Adamic Adar matrix representing the graph and
|
157
|
+
so to the required dimension to get a valid embedding in $R^{n}$.
|
158
|
+
|
159
|
+
- The Sketching embedding is much faster for large graphs but embeds in a space consisting in sequences of node id equipped with the Jaccard distance. It is particularly efficient in low degrees graph.
|
160
|
+
|
161
|
+
- The *embed* module takes embedding and possibly validation commands (link prediction task) in one directive.
|
162
|
+
The general syntax is :
|
163
|
+
|
164
|
+
embed file_description [validation_command --validation_arguments] sketching mode --embedding_arguments
|
165
|
+
for example:
|
166
|
+
|
167
|
+
For a symetric graph we get:
|
168
|
+
|
169
|
+
- just embedding:
|
170
|
+
embed --csv ./Data/Graphs/Orkut/com-orkut.ungraph.txt --symetric sketching --decay 0.2 --dim 200 --nbiter
|
171
|
+
|
172
|
+
- embedding and validation:
|
173
|
+
|
174
|
+
embed --csv ./Data/Graphs/Orkut/com-orkut.ungraph.txt --symetric validation --nbpass 5 --skip 0.15 sketching --decay 0.2 --dim 200 --nbiter 5
|
175
|
+
|
176
|
+
For an asymetric graph we get
|
177
|
+
|
178
|
+
embed --csv ./Data/Graphs/asymetric.csv validation --nbpass 5 --skip 0.15 sketching --decay 0.2 --dim 200 --nbiter 5
|
179
|
+
|
180
|
+
|
181
|
+
More details can be found in docs of the embed module. Use cargo doc --no-dep --bin embed (and cargo doc --no-dep) as usual.
|
182
|
+
|
183
|
+
- Use the environment variable RUST_LOG gives access to some information at various level (debug, info, error) via the **log** and **env_logger** crates.
|
184
|
+
|
185
|
+
## License
|
186
|
+
|
187
|
+
Licensed under either of
|
188
|
+
|
189
|
+
* Apache License, Version 2.0, [LICENSE-APACHE](LICENSE-APACHE) or <http://www.apache.org/licenses/LICENSE-2.0>
|
190
|
+
* MIT license [LICENSE-MIT](LICENSE-MIT) or <http://opensource.org/licenses/MIT>
|
191
|
+
|
192
|
+
at your option.
|
193
|
+
|
@@ -0,0 +1,7 @@
|
|
1
|
+
graphembed_rs-0.1.0.dist-info/METADATA,sha256=1bBA8fy75z8I6YGAsPAMoB-67zpAHW66cHywSb-hPj8,9901
|
2
|
+
graphembed_rs-0.1.0.dist-info/WHEEL,sha256=wsVBlw9xyAuHecZeOYqJ_tA7emUKfXYOn-_180uZRi4,104
|
3
|
+
graphembed/__init__.py,sha256=RCcLraveWf-myTsDQGePMYq-scNNfz-3Mv1baSbgAmM,123
|
4
|
+
graphembed/__init__.pyi,sha256=3_KBFG4g9akylo32CHlm9bZStcLwxIY2X4si21ilD3w,1626
|
5
|
+
graphembed/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
6
|
+
graphembed/graphembed.cpython-311-darwin.so,sha256=02KwhQh5VZBhzUc7zl-JdfWhSjb1sQ40V4KSbqSsi9o,5158016
|
7
|
+
graphembed_rs-0.1.0.dist-info/RECORD,,
|