ssdiff 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
ssdiff-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Hubert Plisiecki
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
ssdiff-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,500 @@
1
+ Metadata-Version: 2.4
2
+ Name: ssdiff
3
+ Version: 0.1.0
4
+ Summary: Supervised Semantic Differential (SSD): interpretable, embedding-based analysis of concept meaning in text.
5
+ Author-email: Hubert Plisiecki <hplisiecki@gmail.com>, Paweł Lenartowicz <pawellenartowicz@europe.com>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/hplisiecki/Supervised-Semantic-Differential
8
+ Project-URL: Repository, https://github.com/hplisiecki/Supervised-Semantic-Differential
9
+ Project-URL: Issues, https://github.com/hplisiecki/Supervised-Semantic-Differential/issues
10
+ Project-URL: Changelog, https://github.com/hplisiecki/Supervised-Semantic-Differential/releases
11
+ Keywords: NLP,semantics,word embeddings,psychometrics,semantic differential,computational social science
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3 :: Only
14
+ Classifier: Programming Language :: Python :: 3.9
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Operating System :: OS Independent
19
+ Classifier: Intended Audience :: Science/Research
20
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
21
+ Requires-Python: >=3.9
22
+ Description-Content-Type: text/markdown
23
+ License-File: LICENSE
24
+ Requires-Dist: numpy>=1.26.4
25
+ Requires-Dist: pandas>=2.1.4
26
+ Requires-Dist: scikit-learn>=1.3.2
27
+ Requires-Dist: gensim>=4.3.2
28
+ Requires-Dist: spacy>=3.7.2
29
+ Requires-Dist: requests>=2.32.5
30
+ Provides-Extra: dev
31
+ Requires-Dist: pytest>=7.0; extra == "dev"
32
+ Requires-Dist: pytest-cov>=4.0; extra == "dev"
33
+ Requires-Dist: ruff>=0.5; extra == "dev"
34
+ Requires-Dist: mypy>=1.6; extra == "dev"
35
+ Requires-Dist: build>=1.0.0; extra == "dev"
36
+ Requires-Dist: twine>=4.0; extra == "dev"
37
+ Provides-Extra: gui
38
+ Requires-Dist: PySide6>=6.6; extra == "gui"
39
+ Dynamic: license-file
40
+
41
+ # Supervised Semantic Differential (SSD)
42
+
43
+ **SSD** lets you recover **interpretable semantic directions** related to specific concpets directly from open-ended text and relate them to **numeric outcomes**
44
+ (e.g., psychometric scales, judgments). It builds per-essay concept vectors from **local contexts around seed words**,
45
+ learns a **semantic gradient** that best predicts the outcome, and then provides multiple interpretability layers:
46
+
47
+ - **Nearest neighbors** of each pole (+β̂ / −β̂)
48
+ - **Clustering** of neighbors into themes
49
+ - **Text snippets**: top sentences whose local contexts align with each cluster centroid or the β̂ axis
50
+ - **Per-essay scores** (cosine alignments) for further analysis
51
+
52
+ The goal of the package is to allow psycholinguistic researchers to draw data-driven insights about
53
+ how people use language depending on their attitudes, traits, or other numeric variables of interest.
54
+
55
+ ---
56
+
57
+ ## Table of Contents
58
+
59
+ - [Installation](#installation)
60
+ - [Quickstart](#quickstart)
61
+ - [Core Concepts](#core-concepts)
62
+ - [Preprocessing (spaCy)](#preprocessing-spacy)
63
+ - [Lexicon Utilities](#lexicon-utilities)
64
+ - [Fitting SSD](#fitting-ssd)
65
+ - [Neighbors & Clustering](#neighbors--clustering)
66
+ - [Interpreting with Snippets](#interpreting-with-snippets)
67
+ - [Per-Essay SSD Scores](#per-essay-ssd-scores)
68
+ - [API Summary](#api-summary)
69
+ - [Citing & License](#citing--license)
70
+
71
+ ---
72
+
73
+ ## Installation
74
+
75
+ ```bash
76
+ pip install ssdiff
77
+ ```
78
+
79
+ Dependencies (installed automatically): `numpy`, `pandas`, `scikit-learn`, `gensim`, `spacy`.
80
+
81
+
82
+ ---
83
+
84
+ ## Quickstart
85
+
86
+ Below is an end-to-end minimal example using the Polish model and an example dataset.
87
+ Adjust paths and column names to your data.
88
+
89
+ ```python
90
+ from ssdiff import (
91
+ SSD, load_embeddings, normalize_kv,
92
+ load_spacy, load_stopwords, preprocess_texts, build_docs_from_preprocessed,
93
+ suggest_lexicon, token_presence_stats, coverage_by_lexicon,
94
+ )
95
+
96
+ import pandas as pd
97
+
98
+ MODEL_PATH = r"NLPResources\word2vec_model.kv"
99
+ DATA_PATH = r"data\example_dataset.csv"
100
+
101
+ # 1) Load and normalize embeddings (L2 + ABTT on word space)
102
+ kv = normalize_kv(load_embeddings(MODEL_PATH), l2=True, abtt_m=1)
103
+
104
+ # 2) Load your data
105
+ df = pd.read_csv(DATA_PATH)
106
+ text_raw_col = "text_raw" # column with raw text
107
+ y_col = "questionnaire_result" # numeric outcome column
108
+
109
+ # 3) Preprocess (spaCy) — keep original sentences and lemmas linked
110
+ nlp = load_spacy("pl_core_news_lg") # polish spacy model
111
+ stopwords = load_stopwords("pl") # polish stopwords
112
+ texts_raw = df[text_raw_col].fillna("").astype(str).tolist()
113
+ pre_docs = preprocess_texts(texts_raw, nlp, stopwords)
114
+
115
+ # 4) Build lemma docs for modeling and filter to non-NaN y
116
+ docs = build_docs_from_preprocessed(pre_docs) # list[list[str]]
117
+ y = pd.to_numeric(df[y_col], errors="coerce")
118
+ mask = ~y.isna()
119
+ docs = [docs[i] for i in range(len(docs)) if mask.iat[i]]
120
+ pre_docs = [pre_docs[i] for i in range(len(pre_docs)) if mask.iat[i]]
121
+ y = y[mask].to_numpy()
122
+
123
+ # 5) Define a lexicon (tokens must match your preprocessing) Check lexicon utilities section for data-driven selection.
124
+ lexicon = {"concept_keyword_1", "concept_keyword_2", "concept_keyword_3", "concept_keyword_4"} # keywords that define your concept
125
+
126
+ # 6) Choose PCA dimensionality based on sample size
127
+ n_kept = len(docs)
128
+ PCA_K = min(10, max(3, n_kept // 10))
129
+
130
+ # 7) Fit SSD
131
+ ssd = SSD(
132
+ kv=kv,
133
+ docs=docs,
134
+ y=y,
135
+ lexicon=lexicon,
136
+ l2_normalize_docs=True, # normalize per-essay vectors
137
+ N_PCA=PCA_K,
138
+ use_unit_beta=True, # unit β̂ for neighbors/interpretation
139
+ windpow = 3, # context window ±3 tokens
140
+ SIF_a = 1e-3, # SIF weighting parameter
141
+ )
142
+
143
+ # 8) Inspect regression readout
144
+ print({
145
+ "R2": ssd.r2,
146
+ "adj_R2": float(getattr(wa, "r2_adj", float("nan"))),
147
+ "F": ssd.f_stat,
148
+ "p": ssd.f_pvalue,
149
+ "beta_norm": ssd.beta_norm_stdCN, # ||β|| in SD(y) per +1.0 cosine
150
+ "delta_per_0.10_raw": ssd.delta_per_0p10_raw,
151
+ "IQR_effect_raw": ssd.iqr_effect_raw,
152
+ "corr_y_pred": ssd.y_corr_pred,
153
+ "n_raw": int(getattr(wa, "n_raw", len(docs))),
154
+ "n_kept": int(getattr(wa, "n_kept", len(docs))),
155
+ "n_dropped": int(getattr(wa, "n_dropped", 0)),
156
+ })
157
+
158
+ # 9) Neighbors
159
+ ssd.top_words(n = 20, verbose = True)
160
+
161
+ # 10) Cluster themes (e.g., Clustering by silhouette)
162
+ df_clusters, df_members = ssd.cluster_neighbors(topn = 100, k=None, k_min = 2, k_max = 10, verbose = True, top_words = 5)
163
+
164
+ # 11) Snippets for interpretation
165
+ snips = ssd.cluster_snippets(pre_docs=pre_docs, side="both", window_sentences=1, top_per_cluster=100)
166
+ df_pos_snip = snips["pos"]
167
+ df_neg_snip = snips["neg"]
168
+
169
+ beta_snips = wa.beta_snippets(pre_docs=pre_docs, window_sentences=1, top_per_side=200)
170
+ df_beta_pos = beta_snips["beta_pos"]
171
+ df_beta_neg = beta_snips["beta_neg"]
172
+
173
+ # 12) Per-essay SSD scores
174
+ scores = ssd.ssd_scores(docs, include_all=True)
175
+
176
+ ```
177
+ ---
178
+
179
+ ## Core Concepts
180
+
181
+ - **Seed lexicon**: a small set of tokens (lemmas) indicating the concept of interest (e.g., {klimat, klimatyczny, zmiana}).
182
+ - **Per-essay vector**: SIF-weighted average of context vectors around each seed occurrence (±3 tokens), then averaged across occurrences.
183
+ - **SSD fitting**: PCA on standardized doc vectors, OLS from components to standardized outcome 𝑦, then back-project to doc space to get β (the semantic gradient).
184
+ - **Interpretation**: nearest neighbors to +β̂/−β̂, clustering neighbors into themes, and showing original sentences whose local context aligns with centroids or β̂.
185
+
186
+ ---
187
+ ## Word2Vec Embeddings
188
+
189
+ The method requires pre-trained word embeddings in either the .kv, .bin, .txt, or a .gz compression of the previous formats.
190
+ In order to capture only the semantic information, without frequency-based artifacts, it is recommended to apply L2 normalization
191
+ and All-But-The-Top (ABTT) transformation to the embeddings before fitting SSD.
192
+
193
+ ```python
194
+ from ssdiff import load_embeddings, normalize_kv
195
+
196
+ MODEL_PATH = r"NLPResources\word2vec_model.kv"
197
+
198
+ kv = load_embeddings(MODEL_PATH) # load embeddings
199
+
200
+ kv = normalize_kv(kv, l2=True, abtt_m=1) # L2 + ABTT (remove top-1 PC)
201
+ ```
202
+
203
+ The model is not included in the package, and will differ depending on your language and domain.
204
+ Look for pre-trained embeddings in your language, the more data they were trained on, the better.
205
+ Pay attention to the vocabulary coverage of your texts.
206
+
207
+ For polish, the nkjp+wiki-lemmas-all-300-cbow-hs.txt.gz (no. 25) from the [Polish Word2Vec model list](https://dsmodels.nlp.ipipan.waw.pl) was found to work well.
208
+
209
+ ---
210
+ ## Preprocessing (spaCy)
211
+
212
+ SSD uses spaCy to keep original sentences and lemmas aligned for later snippet extraction.
213
+
214
+ ```python
215
+ from ssdiff import load_spacy, load_stopwords, preprocess_texts, build_docs_from_preprocessed
216
+
217
+ nlp = load_spacy("pl_core_news_lg") # or another language model
218
+ stopwords = load_stopwords("pl") # same stopword source across app & package
219
+
220
+ pre_docs = preprocess_texts(texts_raw, nlp, stopwords)
221
+ docs = build_docs_from_preprocessed(pre_docs) # → list[list[str]] (lemmas without stopwords/punct)
222
+ ```
223
+
224
+ Each PreprocessedDoc stores:
225
+
226
+ - **raw**: original raw text
227
+ - **sents_surface**: list[str], original sentences
228
+ - **sents_lemmas**: list[list[str]]
229
+ - **doc_lemmas**: flattened lemmas (list[str])
230
+ - **sent_char_spans**: list of (start_char, end_char) per sentence
231
+ - **token_to_sent**: index mapping lemma positions → sentence index
232
+
233
+ Spacy models for various languages can be found [here](https://spacy.io/models).
234
+
235
+ To install a spaCy model, run e.g.:
236
+ ```bash
237
+ python -m spacy download pl_core_news_lg
238
+ ```
239
+
240
+ ---
241
+ ## Lexicon Utilities
242
+
243
+ These helpers make lexicon selection transparent and data-driven (you can also hand-pick tokens).
244
+
245
+ ### `suggest_lexicon(...)`
246
+
247
+ Rank tokens by balanced coverage with a mild penalty for strong correlation with
248
+ - `cov_all`: fraction of essays containing the token (0/1 presence)
249
+ - `cov_bal`: average presence across 𝑛 quantile bins of 𝑦 (default: 4 bins)
250
+ - `corr`: Pearson correlation between 0/1 presence and standardized 𝑦
251
+ - `rank = cov_bal * (1 - min(1, |corr|/corr_cap))` (default `corr_cap=0.30`)
252
+
253
+ Accepts a DataFrame (`text_col`, `score_col`) or a `(texts, y)` tuple where texts can be raw strings or token lists.
254
+
255
+ ```python
256
+ from ssdiff import suggest_lexicon
257
+
258
+ # Using a DataFrame
259
+ cands_df = suggest_lexicon(df, text_col="lemmatized", score_col="questionnaire_result", top_k=150)
260
+
261
+ # Or using a tuple (texts, y)
262
+ texts = [" ".join(doc) for doc in docs]
263
+ cands_df2 = suggest_lexicon((docs, y), top_k=150)
264
+ ```
265
+ ### `token_presence_stats(...)`
266
+
267
+ Per-token coverage & correlation diagnostics:
268
+ ```python
269
+ from ssdiff import token_presence_stats
270
+ stats = token_presence_stats((texts, y), token="concept_keyword_1", n_bins=4, verbose=True)
271
+ print(stats) # dict: token, docs, cov_all, cov_bal, corr, rank
272
+ ```
273
+
274
+ ### `coverage_by_lexicon(...)`
275
+
276
+ Summary for your chosen lexicon:
277
+ - `summary` : `docs_any`, `cov_all`, `q1`. `q4`, `corr_any`
278
+ - `q1` / `q4`: coverage within the lowest/highest 𝑦 bins (by quantile)
279
+ - `per_token_df`: stats for each token
280
+
281
+ ```python
282
+ from ssdiff import coverage_by_lexicon
283
+
284
+ summary, per_tok = coverage_by_lexicon(
285
+ (texts, y),
286
+ lexicon={"concept_keyword_1", "concept_keyword_2", "concept_keyword_3", "concept_keyword_4"},
287
+ n_bins=4,
288
+ verbose=True
289
+ )
290
+ ```
291
+ ---
292
+ ## Fitting SSD
293
+
294
+ Instantiate `SSD` with your normalized embeddings, tokenized documents, numeric outcome, and lexicon:
295
+ ```python
296
+ from ssdiff import SSD, load_embeddings, normalize_kv
297
+
298
+ kv = normalize_kv(load_embeddings(MODEL_PATH), l2=True, abtt_m=1)
299
+
300
+ PCA_K = min(10, max(3, len(docs)//10))
301
+ ssd = SSD(
302
+ kv=kv,
303
+ docs=docs,
304
+ y=y,
305
+ lexicon={"klimat", "klimatyczny", "zmiana"},
306
+ l2_normalize_docs=True,
307
+ N_PCA=PCA_K,
308
+ use_unit_beta=True, # unit β̂ for neighbors/interpretation
309
+ windpow = 3, # context window ±3 tokens
310
+ SIF_a = 1e-3, # SIF weighting parameter
311
+ )
312
+
313
+ print(ssd.r2, ssd.f_stat, ssd.f_pvalue)
314
+ ```
315
+ Key outputs attached to the instance:
316
+ - `beta` / `beta_unit` — semantic gradient (doc space)
317
+ - `r2`, `f_stat`, `f_pvalue`, 'r2_adj' — regression fit stats
318
+ - `beta_norm_stdCN` — ||β|| in SD(y) per +1.0 cosine
319
+ - `delta_per_0p10_raw` — change in raw 𝑦 per +0.10 cosine
320
+ - `iqr_effect_raw` — IQR(of cosine)*slope in raw 𝑦
321
+ - `y_corr_pred` — correlation of standardized 𝑦 with predicted values
322
+
323
+ ---
324
+ ## Neighbors & Clustering
325
+
326
+ ### Nearest neighbors
327
+
328
+ Get the top N nearest neighbors of +β̂/−β̂:
329
+
330
+ ```python
331
+ top_words = ssd.top_words(n = 20, verbose = True)
332
+ ```
333
+
334
+ ### Clustering neighbors into themes
335
+ Use `cluster_neighbors_sign` to group the top N neighbors of +β̂/−β̂ into k clusters (k-means; Euclidean on unit vectors ≈ cosine):
336
+
337
+ ```python
338
+ df_clusters, df_members = ssd.cluster_neighbors(topn = 100,
339
+ k=None,
340
+ k_min = 2,
341
+ k_max = 10,
342
+ verbose = True,
343
+ random_state = 13, # for reproducibility
344
+ top_words = 5,
345
+ verbose = True)
346
+ ```
347
+
348
+ Returns
349
+ - df_clusters (one row per cluster):
350
+ - side, cluster_rank, size, centroid_cos_beta, coherence, top_words
351
+ - df_members (one row per word):
352
+ side, cluster_rank, word, cos_to_centroid, cos_to_beta
353
+
354
+ The raw clusters (with all per-word cosines and internal ids) are kept internally as:
355
+ - ssd.pos_clusters_raw
356
+ - ssd.neg_clusters_raw
357
+
358
+ ---
359
+ ## Interpreting with Snippets
360
+ After clustering, SSD lets you **link the abstract directions in embedding space back to actual language** by inspecting **text snippets**.
361
+ The script:
362
+ 1. Locates each **occurrence of a seed word** (from your lexicon) in the corpus.
363
+ 2. Extracts a **small window of surrounding context** (±3 tokens).
364
+ 3. Represents that window as a **SIF-weighted context vector** in the same embedding space as β̂ and the cluster centroids.
365
+ 4. Computes the **cosine similarity** between each such local context vector and
366
+ - a **cluster centroid** (to find passages representative of that theme), or
367
+ - the overall **semantic gradient β̂** (to find passages aligned with the global direction).
368
+
369
+ ### Snippets by cluster centroids
370
+
371
+ ```python
372
+ snips = ssd.cluster_snippets(
373
+ pre_docs=pre_docs, # from preprocess_texts(...)
374
+ side="both", # "pos", "neg", or "both"
375
+ window_sentences=1, # [sent-1, sent, sent+1]
376
+ top_per_cluster=100, # keep best K per cluster
377
+ )
378
+
379
+ df_pos_snip = snips["pos"] # columns: centroid_label, doc_id, cosine, seed, sentence_before, sentence_anchor, sentence_after, window_text_surface, ...
380
+ df_neg_snip = snips["neg"]
381
+
382
+
383
+
384
+ df_pos_snip = snips["pos"]
385
+ df_neg_snip = snips["neg"]
386
+ ```
387
+ Each returned row represents a seed occurrence window, not a whole essay.
388
+ The `cosine` column is the similarity between the context vector (built around that seed occurrence) and the cluster centroid.
389
+ Surface text (`sentence_before`, `sentence_anchor`, `sentence_after`) lets you read the passage in context.
390
+
391
+ ### Snippets along β̂
392
+ You can also extract windows that best illustrate the main semantic direction (rather than specific clusters):
393
+ ```python
394
+
395
+ beta_snips = ssd.beta_snippets(
396
+ pre_docs=pre_docs,
397
+ window_sentences=1,
398
+ seeds=ssd.lexicon,
399
+ top_per_side=200,
400
+ )
401
+
402
+ df_beta_pos = beta_snips["beta_pos"]
403
+ df_beta_neg = beta_snips["beta_neg"]
404
+ ```
405
+ Here, the cosine is taken between each seed-centered context vector and β̂ (the main semantic gradient).
406
+ Sorting by this cosine reveals which local language usages most strongly express the positive or negative pole of your concept.
407
+
408
+ ---
409
+ ## Per-Essay SSD Scores
410
+
411
+ The **SSD score** for each essay quantifies **how closely the text’s meaning aligns with the main semantic direction (β̂)** discovered by the model.
412
+ These scores can be used for individual-difference analyses, correlations with psychological scales, or visualization of semantic alignment across groups.
413
+
414
+ Internally, each essay is represented by a **SIF-weighted average of local context vectors** (around the lexicon seeds).
415
+ The SSD score is then computed as the **cosine similarity between that essay’s vector and β̂**.
416
+ In addition, the model’s regression weights allow you to compute the **predicted outcome** for each essay — both in standardized units and in the original scale of your dependent variable.
417
+
418
+
419
+ ### How scores are computed
420
+
421
+ For each document \(i\):
422
+ - (x_i) — document vector (normalized if `l2_normalize_docs=True`)
423
+ - (β̂) — unit semantic gradient in embedding space
424
+ - `cos[i] = cos(x_i, β̂)` → **semantic alignment score**
425
+ - `yhat_std[i] = x_i · β` → predicted standardized outcome
426
+ - `yhat_raw[i] = mean(y) + std(y) * yhat_std[i]` → prediction in original units
427
+
428
+ These are available for **all documents**, with NaNs for those that did not contain any lexicon occurrences (i.e., were dropped before fitting).
429
+
430
+ ```python
431
+ scores = ssd.ssd_scores(
432
+ docs, # list[list[str]]
433
+ include_all=True) # include all docs, even those dropped due to no seed contexts
434
+
435
+ ```
436
+
437
+ Returned columns:
438
+ - `doc_index` Original document index (0-based)
439
+ - `kept` Whether the essay had valid seed contexts (True/False)
440
+ - `cos` Cosine alignment of essay vector to β̂
441
+ - `yhat_std` Predicted outcome (standardized units)
442
+ - `yhat_raw` Predicted outcome (original scale of your dependent variable)
443
+ - `y_true_std` True standardized outcome (NaN for dropped docs)
444
+ - `y_true_raw` True raw outcome (NaN for dropped docs)
445
+
446
+ ---
447
+ ## API Summary
448
+ The `ssd` top-level package re-exports the main objects so you can write:
449
+
450
+ ```python
451
+ from ssdiff import (
452
+ SSD, # the analysis class (fit, neighbors, clustering, snippets, scores)
453
+ load_embeddings, normalize_kv,
454
+ load_spacy, load_stopwords, preprocess_texts, build_docs_from_preprocessed,
455
+ suggest_lexicon, token_presence_stats, coverage_by_lexicon,
456
+ )
457
+ ```
458
+
459
+ ### `SSD` (class)
460
+
461
+ - `__init__(kv, docs, y, lexicon, *, l2_normalize_docs=True, N_PCA=20, use_unit_beta=True)`
462
+ - Attributes after fit: `beta`, `beta_unit`, `r2`, `f_stat`, `f_pvalue`, `beta_norm_stdCN`,
463
+ `delta_per_0p10_raw`, `iqr_effect_raw`, `y_corr_pred`, `n_kept`, etc.
464
+ - Methods:
465
+ - `nbrs(sign=+1, n=20)` → list[(word, cosine)]
466
+ - `cluster_neighbors_sign(side="pos", topn=100, k=None, k_min=2, k_max=10, restrict_vocab=50000, random_state=13, min_cluster_size=2, top_words=10, verbose=False)` → `(df_clusters, df_members)` and stores raw clusters in `pos_clusters_raw`/`neg_clusters_raw`
467
+ - `snippets_from_clusters(pre_docs, window_sentences=1, seeds=None, sif_a=1e-3, top_per_cluster=100)` → dict with `"pos"`/`"neg"` DataFrames
468
+ - `snippets_along_beta(pre_docs, window_sentences=1, seeds=None, sif_a=1e-3, top_per_side=200)` → dict with `"beta_pos"`/`"beta_neg"` DataFrames
469
+ - `ssd_scores(docs)` → numpy array of per-essay cosines
470
+
471
+ ### Embeddings
472
+ - `load_embeddings(path)` → `gensim.models.KeyedVectors`
473
+ - `normalize_kv(kv, l2=True, abtt_m=0)` → new KeyedVectors with L2 + optional ABTT (“all-but-the-top”, top-m PCs removed)
474
+
475
+ ### Preprocessing
476
+ - `load_spacy(model_name="pl_core_news_lg")` → spaCy nlp
477
+ - `load_stopwords(lang="pl")` → list of stopwords (remote Polish list with sensible fallback)
478
+ - `preprocess_texts(texts, nlp, stopwords)` → list of PreprocessedDoc
479
+ - `build_docs_from_preprocessed(pre_docs)` → list[list[str]] (lemmas for modeling)
480
+
481
+ ### Lexicon
482
+ - `suggest_lexicon(df_or_tuple, text_col=None, score_col=None, top_k=150, min_docs=5, n_bins=4, corr_cap=0.30)` → DataFrame
483
+ - `token_presence_stats(df_or_tuple, token, n_bins=4, corr_cap=0.30, verbose=False)` → dict
484
+ - `coverage_by_lexicon(df_or_tuple, lexicon, n_bins=4, verbose=False)` → `(summary, per_token_df)`
485
+
486
+ ---
487
+ ## Citing & License
488
+
489
+ - License: MIT (see LICENSE).
490
+ - If you use SSD in published work, please cite the package (and the classic Semantic Differential literature that motivated the method).
491
+ - A suggested citation:
492
+
493
+ ---
494
+ ## Questions / Contributions
495
+ - File issues and feature requests on the repo’s Issues page.
496
+ - Pull requests welcome — especially for:
497
+ - Robustness diagnostics and visualization helpers
498
+ - Documentation improvements
499
+
500
+ Contact: hplisiecki@gmail.com