PyDistintoX 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pydistintox-0.1.0/PKG-INFO +365 -0
- pydistintox-0.1.0/README.md +341 -0
- pydistintox-0.1.0/pyproject.toml +48 -0
- pydistintox-0.1.0/src/pydistintox/.DS_Store +0 -0
- pydistintox-0.1.0/src/pydistintox/__init__.py +51 -0
- pydistintox-0.1.0/src/pydistintox/__main__.py +4 -0
- pydistintox-0.1.0/src/pydistintox/cli.py +258 -0
- pydistintox-0.1.0/src/pydistintox/common/.gitkeep +1 -0
- pydistintox-0.1.0/src/pydistintox/common/__init__.py +0 -0
- pydistintox-0.1.0/src/pydistintox/common/config.py +155 -0
- pydistintox-0.1.0/src/pydistintox/common/utils.py +347 -0
- pydistintox-0.1.0/src/pydistintox/distinct_measures/__init__.py +0 -0
- pydistintox-0.1.0/src/pydistintox/distinct_measures/config.py +57 -0
- pydistintox-0.1.0/src/pydistintox/distinct_measures/core.py +159 -0
- pydistintox-0.1.0/src/pydistintox/distinct_measures/load_matrices.py +50 -0
- pydistintox-0.1.0/src/pydistintox/distinct_measures/measures.py +471 -0
- pydistintox-0.1.0/src/pydistintox/distinct_measures/utils.py +37 -0
- pydistintox-0.1.0/src/pydistintox/main.py +92 -0
- pydistintox-0.1.0/src/pydistintox/td_matrices/__init__.py +0 -0
- pydistintox-0.1.0/src/pydistintox/td_matrices/config.py +16 -0
- pydistintox-0.1.0/src/pydistintox/td_matrices/core.py +204 -0
- pydistintox-0.1.0/src/pydistintox/td_matrices/create_matrices.py +49 -0
- pydistintox-0.1.0/src/pydistintox/td_matrices/io_utils.py +147 -0
- pydistintox-0.1.0/src/pydistintox/td_matrices/parsing.py +88 -0
- pydistintox-0.1.0/src/pydistintox/td_matrices/tfidf_measures.py +167 -0
- pydistintox-0.1.0/src/pydistintox/td_matrices/utils.py +64 -0
- pydistintox-0.1.0/src/pydistintox/visualize/__init__.py +0 -0
- pydistintox-0.1.0/src/pydistintox/visualize/config.py +28 -0
- pydistintox-0.1.0/src/pydistintox/visualize/core.py +205 -0
- pydistintox-0.1.0/src/pydistintox/visualize/utils.py +152 -0
|
@@ -0,0 +1,365 @@
|
|
|
1
|
+
Metadata-Version: 2.3
|
|
2
|
+
Name: PyDistintoX
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: This is a reimplementation of fundamental functionalities of 'Pydistinto', a software project developed at the [TCDH](https://tcdh.uni-trier.de/).
|
|
5
|
+
Author: Leon Glüsing, Stefan Walter-Heßbrüggen
|
|
6
|
+
Author-email: Leon Glüsing <leongluesing@uni-muenster.de>
|
|
7
|
+
License: CC0-1.0
|
|
8
|
+
Requires-Dist: altair>=6.0.0
|
|
9
|
+
Requires-Dist: argparse>=1.4.0
|
|
10
|
+
Requires-Dist: gensim>=4.3.3
|
|
11
|
+
Requires-Dist: logging>=0.4.9.6
|
|
12
|
+
Requires-Dist: numpy>=1.26.4
|
|
13
|
+
Requires-Dist: pandas>=2.2.3
|
|
14
|
+
Requires-Dist: pip>=25.2
|
|
15
|
+
Requires-Dist: scikit-learn>=1.6.1
|
|
16
|
+
Requires-Dist: scipy>=1.13.1
|
|
17
|
+
Requires-Dist: spacy>=3.1
|
|
18
|
+
Maintainer: Leon Glüsing
|
|
19
|
+
Maintainer-email: Leon Glüsing <leongluesing@posteo.de>
|
|
20
|
+
Requires-Python: >=3.11.1, <3.13
|
|
21
|
+
Project-URL: Homepage of Original Project, https://www.uni-muenster.de/Wissenschaftstheorie/Forschung/PRODATPHIL/
|
|
22
|
+
Project-URL: Homepage of Author, https://leongluesing.de
|
|
23
|
+
Description-Content-Type: text/markdown
|
|
24
|
+
|
|
25
|
+
# PyDistintoX
|
|
26
|
+
|
|
27
|
+
<table style="width: 100%;">
|
|
28
|
+
<tr>
|
|
29
|
+
<td style="width: 33%; text-align: center;">
|
|
30
|
+
<img src="./docs/screenshots/overview.png" style="max-width: 100%; height: auto;"/>
|
|
31
|
+
<p><em>Overview</em></p>
|
|
32
|
+
</td>
|
|
33
|
+
<td style="width: 33%; text-align: center;">
|
|
34
|
+
<img src="./docs/screenshots/zeta_sd2.png" style="max-width: 100%; height: auto;"/>
|
|
35
|
+
<p><em>Zeta SD2</em></p>
|
|
36
|
+
</td>
|
|
37
|
+
<td style="width: 33%; text-align: center;">
|
|
38
|
+
<img src="./docs/screenshots/nfc.png" style="max-width: 100%; height: auto;"/>
|
|
39
|
+
<p><em>NFC</em></p>
|
|
40
|
+
</td>
|
|
41
|
+
</tr>
|
|
42
|
+
<tr>
|
|
43
|
+
<td colspan="3" style="text-align: center;">
|
|
44
|
+
<img src="./docs/screenshots/heatmap.png" style="max-width: 300px; height: auto;"/>
|
|
45
|
+
<p><em>Heatmap</em></p>
|
|
46
|
+
</td>
|
|
47
|
+
</tr>
|
|
48
|
+
</table>
|
|
49
|
+
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
**This is a reimplementation of fundamental functionalities of "Pydistinto", a software project developed at the [TCDH](https://tcdh.uni-trier.de/). The original script can be found [here](https://github.com/Zeta-and-Company/pydistinto).** The following major changes are worth mentioning:
|
|
54
|
+
|
|
55
|
+
* using Gensim and NumPy instead of Pandas for the construction of term-document matrices; Gensim can parse corpora that do not fit into RAM, NumPy/SciPy is used to process sparse matrices with lower memory requirements,
|
|
56
|
+
|
|
57
|
+
* using the inbuilt tf-idf functions of Gensim for adding more tf-idf related measures of distinctiveness (see [below](#tf-idf-measures)),
|
|
58
|
+
|
|
59
|
+
This is a *lite* version of the original Pydistinto since we offer less features, f.e. no randomization of texts. However, if your corpus reaches a size that makes it difficult for the original Pydistinto to cope, it may be worth to try out our solution.
|
|
60
|
+
|
|
61
|
+
PyDistintoX was developed by Leon Glüsing and Stefan Heßbrüggen-Walter. Original development was funded by the Deutsche Forschungsgemeinschaft via the project "Prodatphil: Science and Logic", project no. 537184692.
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
## Overview
|
|
66
|
+
PyDistintoX can be used in two ways:
|
|
67
|
+
- [Standalone Application (CLI)](#standalone-application): Run analyses directly from the command line.
|
|
68
|
+
- [Python library](#python-library): Import functions for custom workflows.
|
|
69
|
+
|
|
70
|
+
Furthermore, we support the installation via `uv` as well via `pip`.
|
|
71
|
+
|
|
72
|
+
|
|
73
|
+
### Prerequisites
|
|
74
|
+
Choose either:
|
|
75
|
+
| uv (recommended) | Pure Python |
|
|
76
|
+
|---|---|
|
|
77
|
+
| install [here](https://docs.astral.sh/uv/getting-started/installation/)| >= 3.11.1, < 3.13|
|
|
78
|
+
|
|
79
|
+
> **Note**: Replace `python` in the commands below with your specific version if needed.
|
|
80
|
+
|
|
81
|
+
|
|
82
|
+
### Clone the Repository
|
|
83
|
+
Open a terminal and clone the repository:
|
|
84
|
+
```bash
|
|
85
|
+
git clone https://gitlab.com/leongluesing/pydistintox.git
|
|
86
|
+
cd pydistintox
|
|
87
|
+
```
|
|
88
|
+
Alternatively, you may download the code directly and go in a terminal into the project directory.
|
|
89
|
+
|
|
90
|
+
|
|
91
|
+
## Standalone Application
|
|
92
|
+
Run analyses directly from the command line.
|
|
93
|
+
<details>
|
|
94
|
+
<summary>click to expand</summary>
|
|
95
|
+
|
|
96
|
+
### Installation
|
|
97
|
+
|
|
98
|
+
Download spacy [model of your choice](https://spacy.io/models). You need to know the model name when using the [CLI](#command-line-options) later as well.
|
|
99
|
+
|
|
100
|
+
#### uv (recommended)
|
|
101
|
+
|
|
102
|
+
Install dependencies and activate virtual environment
|
|
103
|
+
```bash
|
|
104
|
+
uv venv
|
|
105
|
+
source .venv/bin/activate
|
|
106
|
+
uv pip install -e .
|
|
107
|
+
```
|
|
108
|
+
(use `.\.venv\Scripts\activate` on windows to activate environment)
|
|
109
|
+
|
|
110
|
+
Download spacy [model of your choice](https://spacy.io/models):
|
|
111
|
+
|
|
112
|
+
```bash
|
|
113
|
+
uv pip install $(spacy info <MODEL_NAME> --url)
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
On Windows use `uv pip install (spacy info <MODEL_NAME> --url)` instead.
|
|
117
|
+
|
|
118
|
+
|
|
119
|
+
#### pip
|
|
120
|
+
|
|
121
|
+
Create virtual environment (if it doesn't already exist)
|
|
122
|
+
|
|
123
|
+
```bash
|
|
124
|
+
python -m venv .venv
|
|
125
|
+
source .venv/bin/activate
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
On Windows, use `.\.venv\Scripts\activate` to activate environment
|
|
129
|
+
|
|
130
|
+
Now install PyDistintoX and the spaCy [model of your choice](https://spacy.io/models)
|
|
131
|
+
```bash
|
|
132
|
+
pip install -e .
|
|
133
|
+
pip install $(spacy info <MODEL_NAME> --url)
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
---
|
|
137
|
+
|
|
138
|
+
## Quickstart
|
|
139
|
+
```bash
|
|
140
|
+
uv run pydistintox --example
|
|
141
|
+
```
|
|
142
|
+
or respectively
|
|
143
|
+
```bash
|
|
144
|
+
python -m pydistintox --example
|
|
145
|
+
```
|
|
146
|
+
runs the application with texts of Arthur Conan Doyle that are part of the installation. You find them in `data/texts/example`.
|
|
147
|
+
|
|
148
|
+
|
|
149
|
+
---
|
|
150
|
+
|
|
151
|
+
## Usage
|
|
152
|
+
|
|
153
|
+
In the example data, Doyle's detective novels are compared to his other novels in order to identify words that are distinctive for this genre. You can either:
|
|
154
|
+
- place your target corpus files in `data/texts/tar` and your reference corpus files in `data/texts/ref`, **or**
|
|
155
|
+
- specify custom directories via command line options (see [Command Line Options](#command-line-options) below).
|
|
156
|
+
|
|
157
|
+
|
|
158
|
+
The application processes your texts in three steps:
|
|
159
|
+
1. **NLP Processing:** Texts are tokenized and lemmatized. If `--save-nlp` flag is set, the results are saved as JSON in `data/interim/json`.
|
|
160
|
+
2. **Statistical Analysis:** Various distinctiveness measures are calculated.
|
|
161
|
+
3. **Results Visualization:** The results are automatically displayed in your default browser.
|
|
162
|
+
|
|
163
|
+
If you already created the JSON files (f.e. from a previous run), you can skip the first step by setting the `--load-nlp` flag (see [Command Line Options](#command-line-options) below).
|
|
164
|
+
|
|
165
|
+
You can run the application via either:
|
|
166
|
+
| uv (recommended) | pip |
|
|
167
|
+
|-----------------------------------|------------------------------------------|
|
|
168
|
+
| `uv run pydistintox` | `python -m pydistintox` |
|
|
169
|
+
|
|
170
|
+
|
|
171
|
+
### Command Line Options
|
|
172
|
+
* `--debug` changes logging mode to verbose
|
|
173
|
+
* `--load-nlp path/to/json/dir` skips tokenization and lemmatization, and load nlp results from path. Make sure `path/to/json/dir` contains the folders `tar` and `ref`. If no path is specified, the default `data/interim/json` is used.
|
|
174
|
+
* `--example` Use the example data included with the installation to explore the program.
|
|
175
|
+
* `--input-tar path/to/target/corpus` specify the directory for target corpus. Give the directory, not a path to a file! (Escape spaces in the path if necessary)
|
|
176
|
+
* `--input-ref path/to/reference/corpus` specify the directory for reference corpus. Give the directory, not a path to a file! (Escape spaces in the path if necessary)
|
|
177
|
+
* `--model spacy_model_name` spaCy model name in the format: `{lang}_core_{dataset}_{size}`. Example: `en_core_web_sm` (lang=en, dataset=web, size=sm). Available sizes: sm, md, lg, trf. [Find models here](https://spacy.io/models).
|
|
178
|
+
* `--raw-scores` Scores will not be scaled to -1,1.
|
|
179
|
+
|
|
180
|
+
</details>
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
|
|
185
|
+
## Python library
|
|
186
|
+
Import functions for custom workflows.
|
|
187
|
+
<details>
|
|
188
|
+
<summary>click to expand</summary>
|
|
189
|
+
|
|
190
|
+
|
|
191
|
+
### Installation
|
|
192
|
+
- **If you are not using a virtual environment yet:**
|
|
193
|
+
Create one:
|
|
194
|
+
- `uv venv`
|
|
195
|
+
- `python -m venv .venv`
|
|
196
|
+
|
|
197
|
+
Activate it with:
|
|
198
|
+
- `source .venv/bin/activate` (Linux/macOS)
|
|
199
|
+
- `.\.venv\Scripts\activate` (Windows)
|
|
200
|
+
|
|
201
|
+
Then run either:
|
|
202
|
+
| | uv (recommended) | pip |
|
|
203
|
+
|---|-----------------------------------|------------------------------------------|
|
|
204
|
+
|Install | `uv pip install .` | `pip install .` |
|
|
205
|
+
|Download Model| `uv pip install $(spacy info <MODEL_NAME> --url)` | `pip install $(spacy info <MODEL_NAME> --url)` |
|
|
206
|
+
|Check Installation|You can see the place of installation via `uv pip show pydistintox`.|You can see the place of installation via `pip show pydistintox`.|
|
|
207
|
+
|
|
208
|
+
|
|
209
|
+
### Quickstart
|
|
210
|
+
|
|
211
|
+
#### Installation
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
# Assuming you do not have an active virtual environment
|
|
215
|
+
python3.11 -m venv .venv
|
|
216
|
+
source .venv/bin/activate
|
|
217
|
+
# and you are in the root directory of the pydistintox
|
|
218
|
+
# project. You may install it via
|
|
219
|
+
pip install .
|
|
220
|
+
# You can check your installation via
|
|
221
|
+
pip show pydistintox
|
|
222
|
+
|
|
223
|
+
# download spacy model
|
|
224
|
+
uv pip install $(spacy info en_core_web_sm --url)
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
#### Usage
|
|
228
|
+
|
|
229
|
+
|
|
230
|
+
|
|
231
|
+
```python
|
|
232
|
+
from pathlib import Path
|
|
233
|
+
from pydistintox import *
|
|
234
|
+
|
|
235
|
+
config = Config(
|
|
236
|
+
skip_nlp = False,
|
|
237
|
+
save_nlp= JSON,
|
|
238
|
+
spacy_model_name='en_core_web_sm',
|
|
239
|
+
target=INPUT_TAR,
|
|
240
|
+
reference=INPUT_REF,
|
|
241
|
+
debug = True,
|
|
242
|
+
source='YOUR COMMENT HERE',
|
|
243
|
+
measures_to_calculate = [], # only tf_idf_measures_to_calculate put here; for non-tf-idf not implemented yet
|
|
244
|
+
scaling = True,
|
|
245
|
+
)
|
|
246
|
+
non_tf_idf_measures_to_calculate = ['zeta_sd2'] # only calculate zeta_sd2
|
|
247
|
+
|
|
248
|
+
...
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
**For a complete example, see the script:** [demo.py](./docs/examples/demo.py)
|
|
252
|
+
|
|
253
|
+
|
|
254
|
+
### Available Import Functions
|
|
255
|
+
Key functions for programmatic use (import via `from pydistintox import ...`).
|
|
256
|
+
**Full documentation:** [FUNCTIONS.md](./docs/FUNCTIONS.md)
|
|
257
|
+
|
|
258
|
+
</details>
|
|
259
|
+
|
|
260
|
+
---
|
|
261
|
+
|
|
262
|
+
## Remove PyDistintoX
|
|
263
|
+
|
|
264
|
+
| | uv | pip |
|
|
265
|
+
|---|-----------------------------------|------------------------------------------|
|
|
266
|
+
|uninstall|`uv pip uninstall pydistintox`|`pip uninstall pydistintox`|
|
|
267
|
+
|clean cache|`uv cache clean`|`pip cache purge`|
|
|
268
|
+
|remove model|`uv pip uninstall en_core_web_sm`|`pip uninstall en_core_web_sm`|
|
|
269
|
+
|
|
270
|
+
You may delete the virtual environment completely. Thereby all dependencies and downloaded models will be deleted as well: First deactivate the virtual environment (`deactivate`) and then delete its folder (`rm -r .venv` or `rmdir /s .venv` on windows).
|
|
271
|
+
|
|
272
|
+
---
|
|
273
|
+
|
|
274
|
+
## Troubleshooting
|
|
275
|
+
|
|
276
|
+
### Installation of spaCy Model
|
|
277
|
+
|
|
278
|
+
When `command not found: spacy` pops up, this means `pydistintox` is not correctly installed in your virtual environment yet (or the v.e. is not active). Try
|
|
279
|
+
|
|
280
|
+
```bash
|
|
281
|
+
source .venv/bin/activate
|
|
282
|
+
(uv) pip install .
|
|
283
|
+
(uv) pip install $(spacy info <MODEL_NAME> --url)
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
If the installtion of a spacy model via `(uv) pip install <MODEL_NAME>` still fails, you may download the model by some other means and run `(uv) pip install path/to/model` instead.
|
|
287
|
+
|
|
288
|
+
---
|
|
289
|
+
|
|
290
|
+
## Technical Reference
|
|
291
|
+
|
|
292
|
+
The distinctiveness measures provided are scaled to the range of [-1, 1] by default for visualization, but this scaling does not make them statistically comparable. Each measure is based on different underlying assumptions, mathematical formulations, and distributions (e.g., frequency-based vs. dispersion-based). Direct comparison of values across measures is not meaningful, unless one explicitly accounts for the differing mathematical foundations.
|
|
293
|
+
|
|
294
|
+
### Non-TF-IDF Measures
|
|
295
|
+
|
|
296
|
+
Using the gensim tf-idf-model as helper function, `PyDistintoX` calculates absolute and binary frequencies. This is tf-idf divided by 1, i. e. disregarding document frequency, so strictly speaking tf-no-idf. Based on this data, the following distinct measures are calculated.
|
|
297
|
+
|
|
298
|
+
The following table is adapted from the foundational article on which this software is based: ["Evaluation of Measures of Distinctiveness", by Keli Du, Julia Dudar, Christof Schöch](https://doi.org/10.48694/jcls.102) ([Wayback Machine](https://web.archive.org/web/20260116101348/https://jcls.io/article/id/102/)). For more information see [here](https://zeta-project.eu/de/distinktivitaetsmasse/) ([Wayback Machine](https://web.archive.org/web/20260113084856/https://zeta-project.eu/de/distinktivitaetsmasse/))
|
|
299
|
+
|
|
300
|
+
| Name | Type of measure | References | Evaluated in | Implementation Key |
|
|
301
|
+
|-----------------------------------|----------------------|-------------------------------------------------|------------------------------------------------------------------------------|----------------------|
|
|
302
|
+
| TF-IDF | Term weighting | Luhn 1957; Spärck Jones 1972 | Salton and Buckley 1988 | – |
|
|
303
|
+
| Ratio of relative frequencies (RRF) | Frequency-based | Damerau 1993 | Gries 2010 | `rrf_dr0` |
|
|
304
|
+
| Chi-squared test (χ²) | Frequency-based | Dunning 1993 | Lijffijt et al. 2014 | `chi_square_value` |
|
|
305
|
+
| Log-likelihood ratio test (LLR) | Frequency-based | Dunning 1993 | Egbert and Biber 2019; Paquot and Bestgen 2009; Lijffijt et al. 2014 | `LLR_value` |
|
|
306
|
+
| Welch’s t-test (Welch) | Distribution-based | Welch 1947 | Paquot and Bestgen 2009; Lijffijt et al. 2014 | `welch_t_value` |
|
|
307
|
+
| Wilcoxon rank sum test (Wilcoxon) | Dispersion-based | Wilcoxon 1945; Mann and Whitney 1947 | Paquot and Bestgen 2009; Lijffijt et al. 2014 | `ranksumtest_value` |
|
|
308
|
+
| Burrows Zeta (Zeta_orig) | Dispersion-based | Burrows 2007; Craig and Kinney 2009 | Schöch 2018 | `zeta_sd0` |
|
|
309
|
+
| logarithmic Zeta (Zeta_log) | Dispersion-based | Schöch 2018 | Schöch 2018; Du et al. 2021 | `zeta_sd2` |
|
|
310
|
+
| Eta | Dispersion-based | Du et al. 2021 | Du et al. 2021 | `eta_sg0` |
|
|
311
|
+
|
|
312
|
+
|
|
313
|
+
### TF-IDF Measures
|
|
314
|
+
`PyDistintoX` uses the gensim tf-idf model for calculating distinctness scores. The following parameters are used:
|
|
315
|
+
|
|
316
|
+
- 'nfn',
|
|
317
|
+
- 'nfc',
|
|
318
|
+
- 'bfn',
|
|
319
|
+
- 'afn',
|
|
320
|
+
- 'lfn',
|
|
321
|
+
- 'ltc'
|
|
322
|
+
|
|
323
|
+
For more information on the meaning of these parameters see [wikipedia](https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System) [Wayback Machine](https://web.archive.org/web/20260203181537/https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System), or the following paragraph:
|
|
324
|
+
|
|
325
|
+
<details>
|
|
326
|
+
|
|
327
|
+
<summary>
|
|
328
|
+
Explanation from the Gensim Documentation
|
|
329
|
+
</summary>
|
|
330
|
+
|
|
331
|
+
From the [gensim documentation](https://radimrehurek.com/gensim/models/tfidfmodel.html#gensim.models.tfidfmodel.TfidfModel):
|
|
332
|
+
|
|
333
|
+
>**smartirs (str, optional) –**
|
|
334
|
+
>
|
|
335
|
+
>SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example ‘ntc’, ‘bpn’ and so on, where the letters represents the term weighting of the document vector.
|
|
336
|
+
>
|
|
337
|
+
>Term frequency weighing:
|
|
338
|
+
> - b - binary,
|
|
339
|
+
> - t or n - raw,
|
|
340
|
+
> - a - augmented,
|
|
341
|
+
> - l - logarithm,
|
|
342
|
+
> - d - double logarithm,
|
|
343
|
+
> - L - log average.
|
|
344
|
+
>
|
|
345
|
+
>Document frequency weighting:
|
|
346
|
+
> - x or n - none,
|
|
347
|
+
> - f - idf,
|
|
348
|
+
> - t - zero-corrected idf,
|
|
349
|
+
> - p - probabilistic idf.
|
|
350
|
+
>
|
|
351
|
+
>Document normalization:
|
|
352
|
+
> - x or n - none,
|
|
353
|
+
> - c - cosine,
|
|
354
|
+
> - u - pivoted unique,
|
|
355
|
+
> - b - pivoted character length.
|
|
356
|
+
>
|
|
357
|
+
>Default is ‘nfc’. For more information visit SMART Information Retrieval System.
|
|
358
|
+
|
|
359
|
+
</details>
|
|
360
|
+
|
|
361
|
+
### SpaCy and Gensim
|
|
362
|
+
|
|
363
|
+
While gensim features its own rule-based [tokenizer](https://radimrehurek.com/gensim/utils.html#gensim.utils.simple_tokenize), we provide the opportunity to plug-in spaCy models for tokenization and lemmatization. Results of this process are saved as json files in spaCy's own format. We then extract lemmas from these data and recast them as Gensim input documents (basically list of words forming sentences which in turn are combined in a list to form a document).
|
|
364
|
+
|
|
365
|
+
|