ssdiff 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- ssdiff-0.1.0/LICENSE +21 -0
- ssdiff-0.1.0/PKG-INFO +500 -0
- ssdiff-0.1.0/README.md +460 -0
- ssdiff-0.1.0/pyproject.toml +72 -0
- ssdiff-0.1.0/setup.cfg +4 -0
- ssdiff-0.1.0/ssdiff/__init__.py +20 -0
- ssdiff-0.1.0/ssdiff/clusters.py +101 -0
- ssdiff-0.1.0/ssdiff/core.py +608 -0
- ssdiff-0.1.0/ssdiff/lexicon.py +320 -0
- ssdiff-0.1.0/ssdiff/preprocess.py +175 -0
- ssdiff-0.1.0/ssdiff/snippets.py +270 -0
- ssdiff-0.1.0/ssdiff/utils.py +157 -0
- ssdiff-0.1.0/ssdiff.egg-info/PKG-INFO +500 -0
- ssdiff-0.1.0/ssdiff.egg-info/SOURCES.txt +24 -0
- ssdiff-0.1.0/ssdiff.egg-info/dependency_links.txt +1 -0
- ssdiff-0.1.0/ssdiff.egg-info/requires.txt +17 -0
- ssdiff-0.1.0/ssdiff.egg-info/top_level.txt +1 -0
- ssdiff-0.1.0/tests/test_basic_pipeline.py +34 -0
- ssdiff-0.1.0/tests/test_imports.py +9 -0
ssdiff-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 Hubert Plisiecki
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
ssdiff-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,500 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: ssdiff
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Supervised Semantic Differential (SSD): interpretable, embedding-based analysis of concept meaning in text.
|
|
5
|
+
Author-email: Hubert Plisiecki <hplisiecki@gmail.com>, Paweł Lenartowicz <pawellenartowicz@europe.com>
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/hplisiecki/Supervised-Semantic-Differential
|
|
8
|
+
Project-URL: Repository, https://github.com/hplisiecki/Supervised-Semantic-Differential
|
|
9
|
+
Project-URL: Issues, https://github.com/hplisiecki/Supervised-Semantic-Differential/issues
|
|
10
|
+
Project-URL: Changelog, https://github.com/hplisiecki/Supervised-Semantic-Differential/releases
|
|
11
|
+
Keywords: NLP,semantics,word embeddings,psychometrics,semantic differential,computational social science
|
|
12
|
+
Classifier: Programming Language :: Python :: 3
|
|
13
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Operating System :: OS Independent
|
|
19
|
+
Classifier: Intended Audience :: Science/Research
|
|
20
|
+
Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
21
|
+
Requires-Python: >=3.9
|
|
22
|
+
Description-Content-Type: text/markdown
|
|
23
|
+
License-File: LICENSE
|
|
24
|
+
Requires-Dist: numpy>=1.26.4
|
|
25
|
+
Requires-Dist: pandas>=2.1.4
|
|
26
|
+
Requires-Dist: scikit-learn>=1.3.2
|
|
27
|
+
Requires-Dist: gensim>=4.3.2
|
|
28
|
+
Requires-Dist: spacy>=3.7.2
|
|
29
|
+
Requires-Dist: requests>=2.32.5
|
|
30
|
+
Provides-Extra: dev
|
|
31
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
32
|
+
Requires-Dist: pytest-cov>=4.0; extra == "dev"
|
|
33
|
+
Requires-Dist: ruff>=0.5; extra == "dev"
|
|
34
|
+
Requires-Dist: mypy>=1.6; extra == "dev"
|
|
35
|
+
Requires-Dist: build>=1.0.0; extra == "dev"
|
|
36
|
+
Requires-Dist: twine>=4.0; extra == "dev"
|
|
37
|
+
Provides-Extra: gui
|
|
38
|
+
Requires-Dist: PySide6>=6.6; extra == "gui"
|
|
39
|
+
Dynamic: license-file
|
|
40
|
+
|
|
41
|
+
# Supervised Semantic Differential (SSD)
|
|
42
|
+
|
|
43
|
+
**SSD** lets you recover **interpretable semantic directions** related to specific concpets directly from open-ended text and relate them to **numeric outcomes**
|
|
44
|
+
(e.g., psychometric scales, judgments). It builds per-essay concept vectors from **local contexts around seed words**,
|
|
45
|
+
learns a **semantic gradient** that best predicts the outcome, and then provides multiple interpretability layers:
|
|
46
|
+
|
|
47
|
+
- **Nearest neighbors** of each pole (+β̂ / −β̂)
|
|
48
|
+
- **Clustering** of neighbors into themes
|
|
49
|
+
- **Text snippets**: top sentences whose local contexts align with each cluster centroid or the β̂ axis
|
|
50
|
+
- **Per-essay scores** (cosine alignments) for further analysis
|
|
51
|
+
|
|
52
|
+
The goal of the package is to allow psycholinguistic researchers to draw data-driven insights about
|
|
53
|
+
how people use language depending on their attitudes, traits, or other numeric variables of interest.
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
## Table of Contents
|
|
58
|
+
|
|
59
|
+
- [Installation](#installation)
|
|
60
|
+
- [Quickstart](#quickstart)
|
|
61
|
+
- [Core Concepts](#core-concepts)
|
|
62
|
+
- [Preprocessing (spaCy)](#preprocessing-spacy)
|
|
63
|
+
- [Lexicon Utilities](#lexicon-utilities)
|
|
64
|
+
- [Fitting SSD](#fitting-ssd)
|
|
65
|
+
- [Neighbors & Clustering](#neighbors--clustering)
|
|
66
|
+
- [Interpreting with Snippets](#interpreting-with-snippets)
|
|
67
|
+
- [Per-Essay SSD Scores](#per-essay-ssd-scores)
|
|
68
|
+
- [API Summary](#api-summary)
|
|
69
|
+
- [Citing & License](#citing--license)
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## Installation
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
pip install ssdiff
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
Dependencies (installed automatically): `numpy`, `pandas`, `scikit-learn`, `gensim`, `spacy`.
|
|
80
|
+
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Quickstart
|
|
85
|
+
|
|
86
|
+
Below is an end-to-end minimal example using the Polish model and an example dataset.
|
|
87
|
+
Adjust paths and column names to your data.
|
|
88
|
+
|
|
89
|
+
```python
|
|
90
|
+
from ssdiff import (
|
|
91
|
+
SSD, load_embeddings, normalize_kv,
|
|
92
|
+
load_spacy, load_stopwords, preprocess_texts, build_docs_from_preprocessed,
|
|
93
|
+
suggest_lexicon, token_presence_stats, coverage_by_lexicon,
|
|
94
|
+
)
|
|
95
|
+
|
|
96
|
+
import pandas as pd
|
|
97
|
+
|
|
98
|
+
MODEL_PATH = r"NLPResources\word2vec_model.kv"
|
|
99
|
+
DATA_PATH = r"data\example_dataset.csv"
|
|
100
|
+
|
|
101
|
+
# 1) Load and normalize embeddings (L2 + ABTT on word space)
|
|
102
|
+
kv = normalize_kv(load_embeddings(MODEL_PATH), l2=True, abtt_m=1)
|
|
103
|
+
|
|
104
|
+
# 2) Load your data
|
|
105
|
+
df = pd.read_csv(DATA_PATH)
|
|
106
|
+
text_raw_col = "text_raw" # column with raw text
|
|
107
|
+
y_col = "questionnaire_result" # numeric outcome column
|
|
108
|
+
|
|
109
|
+
# 3) Preprocess (spaCy) — keep original sentences and lemmas linked
|
|
110
|
+
nlp = load_spacy("pl_core_news_lg") # polish spacy model
|
|
111
|
+
stopwords = load_stopwords("pl") # polish stopwords
|
|
112
|
+
texts_raw = df[text_raw_col].fillna("").astype(str).tolist()
|
|
113
|
+
pre_docs = preprocess_texts(texts_raw, nlp, stopwords)
|
|
114
|
+
|
|
115
|
+
# 4) Build lemma docs for modeling and filter to non-NaN y
|
|
116
|
+
docs = build_docs_from_preprocessed(pre_docs) # list[list[str]]
|
|
117
|
+
y = pd.to_numeric(df[y_col], errors="coerce")
|
|
118
|
+
mask = ~y.isna()
|
|
119
|
+
docs = [docs[i] for i in range(len(docs)) if mask.iat[i]]
|
|
120
|
+
pre_docs = [pre_docs[i] for i in range(len(pre_docs)) if mask.iat[i]]
|
|
121
|
+
y = y[mask].to_numpy()
|
|
122
|
+
|
|
123
|
+
# 5) Define a lexicon (tokens must match your preprocessing) Check lexicon utilities section for data-driven selection.
|
|
124
|
+
lexicon = {"concept_keyword_1", "concept_keyword_2", "concept_keyword_3", "concept_keyword_4"} # keywords that define your concept
|
|
125
|
+
|
|
126
|
+
# 6) Choose PCA dimensionality based on sample size
|
|
127
|
+
n_kept = len(docs)
|
|
128
|
+
PCA_K = min(10, max(3, n_kept // 10))
|
|
129
|
+
|
|
130
|
+
# 7) Fit SSD
|
|
131
|
+
ssd = SSD(
|
|
132
|
+
kv=kv,
|
|
133
|
+
docs=docs,
|
|
134
|
+
y=y,
|
|
135
|
+
lexicon=lexicon,
|
|
136
|
+
l2_normalize_docs=True, # normalize per-essay vectors
|
|
137
|
+
N_PCA=PCA_K,
|
|
138
|
+
use_unit_beta=True, # unit β̂ for neighbors/interpretation
|
|
139
|
+
windpow = 3, # context window ±3 tokens
|
|
140
|
+
SIF_a = 1e-3, # SIF weighting parameter
|
|
141
|
+
)
|
|
142
|
+
|
|
143
|
+
# 8) Inspect regression readout
|
|
144
|
+
print({
|
|
145
|
+
"R2": ssd.r2,
|
|
146
|
+
"adj_R2": float(getattr(wa, "r2_adj", float("nan"))),
|
|
147
|
+
"F": ssd.f_stat,
|
|
148
|
+
"p": ssd.f_pvalue,
|
|
149
|
+
"beta_norm": ssd.beta_norm_stdCN, # ||β|| in SD(y) per +1.0 cosine
|
|
150
|
+
"delta_per_0.10_raw": ssd.delta_per_0p10_raw,
|
|
151
|
+
"IQR_effect_raw": ssd.iqr_effect_raw,
|
|
152
|
+
"corr_y_pred": ssd.y_corr_pred,
|
|
153
|
+
"n_raw": int(getattr(wa, "n_raw", len(docs))),
|
|
154
|
+
"n_kept": int(getattr(wa, "n_kept", len(docs))),
|
|
155
|
+
"n_dropped": int(getattr(wa, "n_dropped", 0)),
|
|
156
|
+
})
|
|
157
|
+
|
|
158
|
+
# 9) Neighbors
|
|
159
|
+
ssd.top_words(n = 20, verbose = True)
|
|
160
|
+
|
|
161
|
+
# 10) Cluster themes (e.g., Clustering by silhouette)
|
|
162
|
+
df_clusters, df_members = ssd.cluster_neighbors(topn = 100, k=None, k_min = 2, k_max = 10, verbose = True, top_words = 5)
|
|
163
|
+
|
|
164
|
+
# 11) Snippets for interpretation
|
|
165
|
+
snips = ssd.cluster_snippets(pre_docs=pre_docs, side="both", window_sentences=1, top_per_cluster=100)
|
|
166
|
+
df_pos_snip = snips["pos"]
|
|
167
|
+
df_neg_snip = snips["neg"]
|
|
168
|
+
|
|
169
|
+
beta_snips = wa.beta_snippets(pre_docs=pre_docs, window_sentences=1, top_per_side=200)
|
|
170
|
+
df_beta_pos = beta_snips["beta_pos"]
|
|
171
|
+
df_beta_neg = beta_snips["beta_neg"]
|
|
172
|
+
|
|
173
|
+
# 12) Per-essay SSD scores
|
|
174
|
+
scores = ssd.ssd_scores(docs, include_all=True)
|
|
175
|
+
|
|
176
|
+
```
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
## Core Concepts
|
|
180
|
+
|
|
181
|
+
- **Seed lexicon**: a small set of tokens (lemmas) indicating the concept of interest (e.g., {klimat, klimatyczny, zmiana}).
|
|
182
|
+
- **Per-essay vector**: SIF-weighted average of context vectors around each seed occurrence (±3 tokens), then averaged across occurrences.
|
|
183
|
+
- **SSD fitting**: PCA on standardized doc vectors, OLS from components to standardized outcome 𝑦, then back-project to doc space to get β (the semantic gradient).
|
|
184
|
+
- **Interpretation**: nearest neighbors to +β̂/−β̂, clustering neighbors into themes, and showing original sentences whose local context aligns with centroids or β̂.
|
|
185
|
+
|
|
186
|
+
---
|
|
187
|
+
## Word2Vec Embeddings
|
|
188
|
+
|
|
189
|
+
The method requires pre-trained word embeddings in either the .kv, .bin, .txt, or a .gz compression of the previous formats.
|
|
190
|
+
In order to capture only the semantic information, without frequency-based artifacts, it is recommended to apply L2 normalization
|
|
191
|
+
and All-But-The-Top (ABTT) transformation to the embeddings before fitting SSD.
|
|
192
|
+
|
|
193
|
+
```python
|
|
194
|
+
from ssdiff import load_embeddings, normalize_kv
|
|
195
|
+
|
|
196
|
+
MODEL_PATH = r"NLPResources\word2vec_model.kv"
|
|
197
|
+
|
|
198
|
+
kv = load_embeddings(MODEL_PATH) # load embeddings
|
|
199
|
+
|
|
200
|
+
kv = normalize_kv(kv, l2=True, abtt_m=1) # L2 + ABTT (remove top-1 PC)
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
The model is not included in the package, and will differ depending on your language and domain.
|
|
204
|
+
Look for pre-trained embeddings in your language, the more data they were trained on, the better.
|
|
205
|
+
Pay attention to the vocabulary coverage of your texts.
|
|
206
|
+
|
|
207
|
+
For polish, the nkjp+wiki-lemmas-all-300-cbow-hs.txt.gz (no. 25) from the [Polish Word2Vec model list](https://dsmodels.nlp.ipipan.waw.pl) was found to work well.
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
## Preprocessing (spaCy)
|
|
211
|
+
|
|
212
|
+
SSD uses spaCy to keep original sentences and lemmas aligned for later snippet extraction.
|
|
213
|
+
|
|
214
|
+
```python
|
|
215
|
+
from ssdiff import load_spacy, load_stopwords, preprocess_texts, build_docs_from_preprocessed
|
|
216
|
+
|
|
217
|
+
nlp = load_spacy("pl_core_news_lg") # or another language model
|
|
218
|
+
stopwords = load_stopwords("pl") # same stopword source across app & package
|
|
219
|
+
|
|
220
|
+
pre_docs = preprocess_texts(texts_raw, nlp, stopwords)
|
|
221
|
+
docs = build_docs_from_preprocessed(pre_docs) # → list[list[str]] (lemmas without stopwords/punct)
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
Each PreprocessedDoc stores:
|
|
225
|
+
|
|
226
|
+
- **raw**: original raw text
|
|
227
|
+
- **sents_surface**: list[str], original sentences
|
|
228
|
+
- **sents_lemmas**: list[list[str]]
|
|
229
|
+
- **doc_lemmas**: flattened lemmas (list[str])
|
|
230
|
+
- **sent_char_spans**: list of (start_char, end_char) per sentence
|
|
231
|
+
- **token_to_sent**: index mapping lemma positions → sentence index
|
|
232
|
+
|
|
233
|
+
Spacy models for various languages can be found [here](https://spacy.io/models).
|
|
234
|
+
|
|
235
|
+
To install a spaCy model, run e.g.:
|
|
236
|
+
```bash
|
|
237
|
+
python -m spacy download pl_core_news_lg
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
---
|
|
241
|
+
## Lexicon Utilities
|
|
242
|
+
|
|
243
|
+
These helpers make lexicon selection transparent and data-driven (you can also hand-pick tokens).
|
|
244
|
+
|
|
245
|
+
### `suggest_lexicon(...)`
|
|
246
|
+
|
|
247
|
+
Rank tokens by balanced coverage with a mild penalty for strong correlation with
|
|
248
|
+
- `cov_all`: fraction of essays containing the token (0/1 presence)
|
|
249
|
+
- `cov_bal`: average presence across 𝑛 quantile bins of 𝑦 (default: 4 bins)
|
|
250
|
+
- `corr`: Pearson correlation between 0/1 presence and standardized 𝑦
|
|
251
|
+
- `rank = cov_bal * (1 - min(1, |corr|/corr_cap))` (default `corr_cap=0.30`)
|
|
252
|
+
|
|
253
|
+
Accepts a DataFrame (`text_col`, `score_col`) or a `(texts, y)` tuple where texts can be raw strings or token lists.
|
|
254
|
+
|
|
255
|
+
```python
|
|
256
|
+
from ssdiff import suggest_lexicon
|
|
257
|
+
|
|
258
|
+
# Using a DataFrame
|
|
259
|
+
cands_df = suggest_lexicon(df, text_col="lemmatized", score_col="questionnaire_result", top_k=150)
|
|
260
|
+
|
|
261
|
+
# Or using a tuple (texts, y)
|
|
262
|
+
texts = [" ".join(doc) for doc in docs]
|
|
263
|
+
cands_df2 = suggest_lexicon((docs, y), top_k=150)
|
|
264
|
+
```
|
|
265
|
+
### `token_presence_stats(...)`
|
|
266
|
+
|
|
267
|
+
Per-token coverage & correlation diagnostics:
|
|
268
|
+
```python
|
|
269
|
+
from ssdiff import token_presence_stats
|
|
270
|
+
stats = token_presence_stats((texts, y), token="concept_keyword_1", n_bins=4, verbose=True)
|
|
271
|
+
print(stats) # dict: token, docs, cov_all, cov_bal, corr, rank
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
### `coverage_by_lexicon(...)`
|
|
275
|
+
|
|
276
|
+
Summary for your chosen lexicon:
|
|
277
|
+
- `summary` : `docs_any`, `cov_all`, `q1`. `q4`, `corr_any`
|
|
278
|
+
- `q1` / `q4`: coverage within the lowest/highest 𝑦 bins (by quantile)
|
|
279
|
+
- `per_token_df`: stats for each token
|
|
280
|
+
|
|
281
|
+
```python
|
|
282
|
+
from ssdiff import coverage_by_lexicon
|
|
283
|
+
|
|
284
|
+
summary, per_tok = coverage_by_lexicon(
|
|
285
|
+
(texts, y),
|
|
286
|
+
lexicon={"concept_keyword_1", "concept_keyword_2", "concept_keyword_3", "concept_keyword_4"},
|
|
287
|
+
n_bins=4,
|
|
288
|
+
verbose=True
|
|
289
|
+
)
|
|
290
|
+
```
|
|
291
|
+
---
|
|
292
|
+
## Fitting SSD
|
|
293
|
+
|
|
294
|
+
Instantiate `SSD` with your normalized embeddings, tokenized documents, numeric outcome, and lexicon:
|
|
295
|
+
```python
|
|
296
|
+
from ssdiff import SSD, load_embeddings, normalize_kv
|
|
297
|
+
|
|
298
|
+
kv = normalize_kv(load_embeddings(MODEL_PATH), l2=True, abtt_m=1)
|
|
299
|
+
|
|
300
|
+
PCA_K = min(10, max(3, len(docs)//10))
|
|
301
|
+
ssd = SSD(
|
|
302
|
+
kv=kv,
|
|
303
|
+
docs=docs,
|
|
304
|
+
y=y,
|
|
305
|
+
lexicon={"klimat", "klimatyczny", "zmiana"},
|
|
306
|
+
l2_normalize_docs=True,
|
|
307
|
+
N_PCA=PCA_K,
|
|
308
|
+
use_unit_beta=True, # unit β̂ for neighbors/interpretation
|
|
309
|
+
windpow = 3, # context window ±3 tokens
|
|
310
|
+
SIF_a = 1e-3, # SIF weighting parameter
|
|
311
|
+
)
|
|
312
|
+
|
|
313
|
+
print(ssd.r2, ssd.f_stat, ssd.f_pvalue)
|
|
314
|
+
```
|
|
315
|
+
Key outputs attached to the instance:
|
|
316
|
+
- `beta` / `beta_unit` — semantic gradient (doc space)
|
|
317
|
+
- `r2`, `f_stat`, `f_pvalue`, 'r2_adj' — regression fit stats
|
|
318
|
+
- `beta_norm_stdCN` — ||β|| in SD(y) per +1.0 cosine
|
|
319
|
+
- `delta_per_0p10_raw` — change in raw 𝑦 per +0.10 cosine
|
|
320
|
+
- `iqr_effect_raw` — IQR(of cosine)*slope in raw 𝑦
|
|
321
|
+
- `y_corr_pred` — correlation of standardized 𝑦 with predicted values
|
|
322
|
+
|
|
323
|
+
---
|
|
324
|
+
## Neighbors & Clustering
|
|
325
|
+
|
|
326
|
+
### Nearest neighbors
|
|
327
|
+
|
|
328
|
+
Get the top N nearest neighbors of +β̂/−β̂:
|
|
329
|
+
|
|
330
|
+
```python
|
|
331
|
+
top_words = ssd.top_words(n = 20, verbose = True)
|
|
332
|
+
```
|
|
333
|
+
|
|
334
|
+
### Clustering neighbors into themes
|
|
335
|
+
Use `cluster_neighbors_sign` to group the top N neighbors of +β̂/−β̂ into k clusters (k-means; Euclidean on unit vectors ≈ cosine):
|
|
336
|
+
|
|
337
|
+
```python
|
|
338
|
+
df_clusters, df_members = ssd.cluster_neighbors(topn = 100,
|
|
339
|
+
k=None,
|
|
340
|
+
k_min = 2,
|
|
341
|
+
k_max = 10,
|
|
342
|
+
verbose = True,
|
|
343
|
+
random_state = 13, # for reproducibility
|
|
344
|
+
top_words = 5,
|
|
345
|
+
verbose = True)
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
Returns
|
|
349
|
+
- df_clusters (one row per cluster):
|
|
350
|
+
- side, cluster_rank, size, centroid_cos_beta, coherence, top_words
|
|
351
|
+
- df_members (one row per word):
|
|
352
|
+
side, cluster_rank, word, cos_to_centroid, cos_to_beta
|
|
353
|
+
|
|
354
|
+
The raw clusters (with all per-word cosines and internal ids) are kept internally as:
|
|
355
|
+
- ssd.pos_clusters_raw
|
|
356
|
+
- ssd.neg_clusters_raw
|
|
357
|
+
|
|
358
|
+
---
|
|
359
|
+
## Interpreting with Snippets
|
|
360
|
+
After clustering, SSD lets you **link the abstract directions in embedding space back to actual language** by inspecting **text snippets**.
|
|
361
|
+
The script:
|
|
362
|
+
1. Locates each **occurrence of a seed word** (from your lexicon) in the corpus.
|
|
363
|
+
2. Extracts a **small window of surrounding context** (±3 tokens).
|
|
364
|
+
3. Represents that window as a **SIF-weighted context vector** in the same embedding space as β̂ and the cluster centroids.
|
|
365
|
+
4. Computes the **cosine similarity** between each such local context vector and
|
|
366
|
+
- a **cluster centroid** (to find passages representative of that theme), or
|
|
367
|
+
- the overall **semantic gradient β̂** (to find passages aligned with the global direction).
|
|
368
|
+
|
|
369
|
+
### Snippets by cluster centroids
|
|
370
|
+
|
|
371
|
+
```python
|
|
372
|
+
snips = ssd.cluster_snippets(
|
|
373
|
+
pre_docs=pre_docs, # from preprocess_texts(...)
|
|
374
|
+
side="both", # "pos", "neg", or "both"
|
|
375
|
+
window_sentences=1, # [sent-1, sent, sent+1]
|
|
376
|
+
top_per_cluster=100, # keep best K per cluster
|
|
377
|
+
)
|
|
378
|
+
|
|
379
|
+
df_pos_snip = snips["pos"] # columns: centroid_label, doc_id, cosine, seed, sentence_before, sentence_anchor, sentence_after, window_text_surface, ...
|
|
380
|
+
df_neg_snip = snips["neg"]
|
|
381
|
+
|
|
382
|
+
|
|
383
|
+
|
|
384
|
+
df_pos_snip = snips["pos"]
|
|
385
|
+
df_neg_snip = snips["neg"]
|
|
386
|
+
```
|
|
387
|
+
Each returned row represents a seed occurrence window, not a whole essay.
|
|
388
|
+
The `cosine` column is the similarity between the context vector (built around that seed occurrence) and the cluster centroid.
|
|
389
|
+
Surface text (`sentence_before`, `sentence_anchor`, `sentence_after`) lets you read the passage in context.
|
|
390
|
+
|
|
391
|
+
### Snippets along β̂
|
|
392
|
+
You can also extract windows that best illustrate the main semantic direction (rather than specific clusters):
|
|
393
|
+
```python
|
|
394
|
+
|
|
395
|
+
beta_snips = ssd.beta_snippets(
|
|
396
|
+
pre_docs=pre_docs,
|
|
397
|
+
window_sentences=1,
|
|
398
|
+
seeds=ssd.lexicon,
|
|
399
|
+
top_per_side=200,
|
|
400
|
+
)
|
|
401
|
+
|
|
402
|
+
df_beta_pos = beta_snips["beta_pos"]
|
|
403
|
+
df_beta_neg = beta_snips["beta_neg"]
|
|
404
|
+
```
|
|
405
|
+
Here, the cosine is taken between each seed-centered context vector and β̂ (the main semantic gradient).
|
|
406
|
+
Sorting by this cosine reveals which local language usages most strongly express the positive or negative pole of your concept.
|
|
407
|
+
|
|
408
|
+
---
|
|
409
|
+
## Per-Essay SSD Scores
|
|
410
|
+
|
|
411
|
+
The **SSD score** for each essay quantifies **how closely the text’s meaning aligns with the main semantic direction (β̂)** discovered by the model.
|
|
412
|
+
These scores can be used for individual-difference analyses, correlations with psychological scales, or visualization of semantic alignment across groups.
|
|
413
|
+
|
|
414
|
+
Internally, each essay is represented by a **SIF-weighted average of local context vectors** (around the lexicon seeds).
|
|
415
|
+
The SSD score is then computed as the **cosine similarity between that essay’s vector and β̂**.
|
|
416
|
+
In addition, the model’s regression weights allow you to compute the **predicted outcome** for each essay — both in standardized units and in the original scale of your dependent variable.
|
|
417
|
+
|
|
418
|
+
|
|
419
|
+
### How scores are computed
|
|
420
|
+
|
|
421
|
+
For each document \(i\):
|
|
422
|
+
- (x_i) — document vector (normalized if `l2_normalize_docs=True`)
|
|
423
|
+
- (β̂) — unit semantic gradient in embedding space
|
|
424
|
+
- `cos[i] = cos(x_i, β̂)` → **semantic alignment score**
|
|
425
|
+
- `yhat_std[i] = x_i · β` → predicted standardized outcome
|
|
426
|
+
- `yhat_raw[i] = mean(y) + std(y) * yhat_std[i]` → prediction in original units
|
|
427
|
+
|
|
428
|
+
These are available for **all documents**, with NaNs for those that did not contain any lexicon occurrences (i.e., were dropped before fitting).
|
|
429
|
+
|
|
430
|
+
```python
|
|
431
|
+
scores = ssd.ssd_scores(
|
|
432
|
+
docs, # list[list[str]]
|
|
433
|
+
include_all=True) # include all docs, even those dropped due to no seed contexts
|
|
434
|
+
|
|
435
|
+
```
|
|
436
|
+
|
|
437
|
+
Returned columns:
|
|
438
|
+
- `doc_index` Original document index (0-based)
|
|
439
|
+
- `kept` Whether the essay had valid seed contexts (True/False)
|
|
440
|
+
- `cos` Cosine alignment of essay vector to β̂
|
|
441
|
+
- `yhat_std` Predicted outcome (standardized units)
|
|
442
|
+
- `yhat_raw` Predicted outcome (original scale of your dependent variable)
|
|
443
|
+
- `y_true_std` True standardized outcome (NaN for dropped docs)
|
|
444
|
+
- `y_true_raw` True raw outcome (NaN for dropped docs)
|
|
445
|
+
|
|
446
|
+
---
|
|
447
|
+
## API Summary
|
|
448
|
+
The `ssd` top-level package re-exports the main objects so you can write:
|
|
449
|
+
|
|
450
|
+
```python
|
|
451
|
+
from ssdiff import (
|
|
452
|
+
SSD, # the analysis class (fit, neighbors, clustering, snippets, scores)
|
|
453
|
+
load_embeddings, normalize_kv,
|
|
454
|
+
load_spacy, load_stopwords, preprocess_texts, build_docs_from_preprocessed,
|
|
455
|
+
suggest_lexicon, token_presence_stats, coverage_by_lexicon,
|
|
456
|
+
)
|
|
457
|
+
```
|
|
458
|
+
|
|
459
|
+
### `SSD` (class)
|
|
460
|
+
|
|
461
|
+
- `__init__(kv, docs, y, lexicon, *, l2_normalize_docs=True, N_PCA=20, use_unit_beta=True)`
|
|
462
|
+
- Attributes after fit: `beta`, `beta_unit`, `r2`, `f_stat`, `f_pvalue`, `beta_norm_stdCN`,
|
|
463
|
+
`delta_per_0p10_raw`, `iqr_effect_raw`, `y_corr_pred`, `n_kept`, etc.
|
|
464
|
+
- Methods:
|
|
465
|
+
- `nbrs(sign=+1, n=20)` → list[(word, cosine)]
|
|
466
|
+
- `cluster_neighbors_sign(side="pos", topn=100, k=None, k_min=2, k_max=10, restrict_vocab=50000, random_state=13, min_cluster_size=2, top_words=10, verbose=False)` → `(df_clusters, df_members)` and stores raw clusters in `pos_clusters_raw`/`neg_clusters_raw`
|
|
467
|
+
- `snippets_from_clusters(pre_docs, window_sentences=1, seeds=None, sif_a=1e-3, top_per_cluster=100)` → dict with `"pos"`/`"neg"` DataFrames
|
|
468
|
+
- `snippets_along_beta(pre_docs, window_sentences=1, seeds=None, sif_a=1e-3, top_per_side=200)` → dict with `"beta_pos"`/`"beta_neg"` DataFrames
|
|
469
|
+
- `ssd_scores(docs)` → numpy array of per-essay cosines
|
|
470
|
+
|
|
471
|
+
### Embeddings
|
|
472
|
+
- `load_embeddings(path)` → `gensim.models.KeyedVectors`
|
|
473
|
+
- `normalize_kv(kv, l2=True, abtt_m=0)` → new KeyedVectors with L2 + optional ABTT (“all-but-the-top”, top-m PCs removed)
|
|
474
|
+
|
|
475
|
+
### Preprocessing
|
|
476
|
+
- `load_spacy(model_name="pl_core_news_lg")` → spaCy nlp
|
|
477
|
+
- `load_stopwords(lang="pl")` → list of stopwords (remote Polish list with sensible fallback)
|
|
478
|
+
- `preprocess_texts(texts, nlp, stopwords)` → list of PreprocessedDoc
|
|
479
|
+
- `build_docs_from_preprocessed(pre_docs)` → list[list[str]] (lemmas for modeling)
|
|
480
|
+
|
|
481
|
+
### Lexicon
|
|
482
|
+
- `suggest_lexicon(df_or_tuple, text_col=None, score_col=None, top_k=150, min_docs=5, n_bins=4, corr_cap=0.30)` → DataFrame
|
|
483
|
+
- `token_presence_stats(df_or_tuple, token, n_bins=4, corr_cap=0.30, verbose=False)` → dict
|
|
484
|
+
- `coverage_by_lexicon(df_or_tuple, lexicon, n_bins=4, verbose=False)` → `(summary, per_token_df)`
|
|
485
|
+
|
|
486
|
+
---
|
|
487
|
+
## Citing & License
|
|
488
|
+
|
|
489
|
+
- License: MIT (see LICENSE).
|
|
490
|
+
- If you use SSD in published work, please cite the package (and the classic Semantic Differential literature that motivated the method).
|
|
491
|
+
- A suggested citation:
|
|
492
|
+
|
|
493
|
+
---
|
|
494
|
+
## Questions / Contributions
|
|
495
|
+
- File issues and feature requests on the repo’s Issues page.
|
|
496
|
+
- Pull requests welcome — especially for:
|
|
497
|
+
- Robustness diagnostics and visualization helpers
|
|
498
|
+
- Documentation improvements
|
|
499
|
+
|
|
500
|
+
Contact: hplisiecki@gmail.com
|