PyPI - behave-text - Versions diffs - 0.1.0__tar.gz - Mend

behave-text 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

behave_text-0.1.0/PKG-INFO +14 -0
behave_text-0.1.0/README.md +196 -0
behave_text-0.1.0/behave_text/__init__.py +0 -0
behave_text-0.1.0/behave_text/spec/__init__.py +43 -0
behave_text-0.1.0/behave_text/spec/envelope.py +53 -0
behave_text-0.1.0/behave_text/spec/primitives.py +353 -0
behave_text-0.1.0/behave_text.egg-info/PKG-INFO +14 -0
behave_text-0.1.0/behave_text.egg-info/SOURCES.txt +12 -0
behave_text-0.1.0/behave_text.egg-info/dependency_links.txt +1 -0
behave_text-0.1.0/behave_text.egg-info/requires.txt +7 -0
behave_text-0.1.0/behave_text.egg-info/top_level.txt +1 -0
behave_text-0.1.0/pyproject.toml +33 -0
behave_text-0.1.0/setup.cfg +4 -0
behave_text-0.1.0/tests/test_primitives.py +101 -0

behave_text-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,14 @@
+Metadata-Version: 2.4
+Name: behave-text
+Version: 0.1.0
+Summary: BEHAVE-TEXT — text/messaging-domain behavioral observation registry, layered on behave-core
+Author: ANTI
+License: GPL-3.0-or-later
+Project-URL: Source, https://git.resacachile.cl/anti/BEHAVE
+Requires-Python: >=3.11
+Requires-Dist: pydantic>=2.6
+Requires-Dist: behave-core>=0.1.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8; extra == "dev"
+Requires-Dist: pytest-cov; extra == "dev"
+Requires-Dist: ruff; extra == "dev"

behave_text-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,196 @@
+<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
+# behave-text
+[← repo](../README.md)
+Text/messaging-domain behavioral observation registry. Defines what can be observed
+about an actor through their written messaging activity — stylometric fingerprints,
+lexical patterns, interaction rhythms, and governance-role signals.
+BEHAVE-TEXT operates on **derived features, not raw text**. Sensors hash, aggregate,
+and classify before emitting — the raw message content never enters a BEHAVE
+observation. This is a tighter constraint than BEHAVE-SHELL because the source
+signal *is* text content; the PII risk is higher.
+The topic prefix is `actor.observation.text` (not `attacker.`) because chat groups
+include non-attacker roles — admins, buyers, sellers, bots, lurkers. The framing
+is deliberately neutral: BEHAVE-TEXT observes actors, not adversaries.
+## Install
+```bash
+pip install -e ../core/ -e .
+# development:
+pip install -e ../core/ -e ".[dev]"
+```
+## Quickstart
+```python
+from behave_text.spec import Observation, Window, TOPIC_PREFIX, event_topic_for
+obs = Observation(
+    primitive="stylometric.capitalization_habit",
+    value="lowercase",
+    confidence=0.91,
+    window=Window(start_ts=1714000000.0, end_ts=1714086400.0),
+    source="behave/text-sensor/stylometry.py",
+)
+topic = event_topic_for("stylometric.capitalization_habit")
+# → "actor.observation.text.stylometric.capitalization_habit"
+```
+## Public API (`behave_text.spec`)
+| Symbol | Description |
+|---|---|
+| `Observation` | Registry-aware subclass of `behave_core.spec.Observation`. Validates `primitive` and `value` against `PRIMITIVE_REGISTRY`. |
+| `Window` | Re-exported from `behave_core`. |
+| `ObservationValue` | Re-exported union type. |
+| `PRIMITIVE_REGISTRY` | `dict[str, ValueTypeSpec]` — the full primitive catalog (35 entries). |
+| `ValueKind` | Enum: `CATEGORICAL`, `NUMERIC`, `HASH`, `ARRAY`, `FREE_STRING`, `BOOL`. |
+| `ValueTypeSpec` | Pydantic model: kind, allowed values, bounds, notes. |
+| `is_known(primitive)` | `bool` — whether a primitive path is registered. |
+| `get(primitive)` | Returns the `ValueTypeSpec`; raises `KeyError` if unknown. |
+| `TOPIC_PREFIX` | `"actor.observation.text"` |
+| `event_topic_for(primitive)` | Returns the full event bus topic string. |
+Note: `to_event_payload` / `from_event_payload` (full round-trip helpers) are
+present in `behave-shell` but not yet implemented here — `status: planned`.
+## Primitives
+35 primitives across 6 categories.
+---
+### `stylometric.*` — Writing style fingerprints (12 primitives)
+Stylometric primitives capture the unconscious writing habits that distinguish
+one author from another. The field goes back to the Mosteller-Wallace Federalist
+Papers study (1963): function-word frequencies alone can attribute authorship
+with high accuracy in long-form English text. BEHAVE-TEXT adapts these methods
+to short-form Spanish chat, which introduces domain-specific challenges (short
+messages, informal register, code-switching, emoji). Calibration results from
+the Rutify corpus are noted inline where they affect interpretation.
+| Primitive | Kind | Description |
+|---|---|---|
+| `stylometric.punctuation_style` | hash | Canonical punctuation-pattern fingerprint hash. Captures the author's consistent punctuation tics (double spaces, comma habits, no-period endings) as a searchable signature. |
+| `stylometric.capitalization_habit` | categorical | Dominant capitalization rule. `lowercase` = no capitals. `proper` = standard sentence/title case. `random_caps` = no consistent rule. `mixed_i` = consistent lowercase 'i' mid-sentence — common in Spanish chat where the standalone-'I' habit doesn't apply but the behavior transfers. |
+| `stylometric.emoji_usage` | categorical | Rate of emoji use. `none`, `occasional`, `frequent`, `exclusive` (messages rarely without emoji). Captures tone and register. |
+| `stylometric.emoji_placement` | categorical | Emoji position relative to sentence-ending punctuation. `pre_punctuation` = 'Hola 😊.' `post_punctuation` = 'Hola. 😊' Individual authors are strikingly consistent in this micro-habit. |
+| `stylometric.message_length_class` | categorical | Median message length bucket: `short` 1-5 words, `medium` 6-20, `long` 21-50, `paragraph` >50. See also `message_length_variance_class` for distribution shape. |
+| `stylometric.message_length_variance_class` | categorical | Distribution shape of per-message word counts. `tight` CV<0.5 (always 1-3 words). `varied` 0.5≤CV<1.5 (normal mix). `bimodal` CV≥1.5 (mostly short with occasional rants). Two authors can share the same median length but have wildly different variance. |
+| `stylometric.linebreak_style` | categorical | Whether the author sends one complete thought per message or bursts multiple short sequential messages. `multi_line` = habitual 3-5 short messages per turn. `wall_of_text` = dense blocks, rarely uses line breaks. Captures a stylistic rhythm that is hard to consciously alter. |
+| `stylometric.typo_signature` | hash | SHA-256 of the canonical persistent-typo set — the specific recurring errors the author makes consistently (e.g. always writes `tener` as `tenet`, or `porque` as `xq`). Persistent typos are strong authorship signals because they reflect keyboard-motor habits. |
+| `stylometric.function_word_distribution_top50` | hash | 64-bit SimHash over the 50 most common Spanish function-word frequency vector. Based on the Mosteller-Wallace method. **Calibration note (2026-05-02, Rutify corpus):** within-author and cross-author Hamming distance distributions overlap (within median 8 bits, cross median 10 bits) in short-message chat — this primitive alone cannot discriminate authors. Engines should weight it low and composite with character n-grams and distinctive vocabulary. Kept in v0 for calibration grids. |
+| `stylometric.function_word_distribution_top200` | hash | 64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like `tampoco`, `aunque`, `mientras`) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2. |
+| `stylometric.character_ngram_simhash` | hash | 64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. `#char3gram`). |
+| `stylometric.distinctive_vocabulary_signature` | hash | 64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where `function_word_*` captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. `#tfidf-top50`). |
+---
+### `lexical.*` — Vocabulary and linguistic patterns (8 primitives)
+Lexical primitives characterize *what* and *how* an actor writes at the word and
+sentence level. Where stylometric primitives fingerprint unconscious micro-habits,
+lexical primitives capture deliberate linguistic choices — vocabulary richness,
+how questions are formed, register.
+| Primitive | Kind | Description |
+|---|---|---|
+| `lexical.vocabulary_richness` | numeric [0,1] | Moving-Average Type-Token Ratio (MATTR) over a sliding window (default 50 tokens). Volume-independent: each window contributes its own unique/total ratio, the value is the mean. Avoids the standard TTR bias where larger corpora mechanically score lower. Source label declares window size. |
+| `lexical.slang_density` | numeric [0,1] | Rate of slang terms per message, against a locale-tuned slang corpus. |
+| `lexical.code_switching_rate` | numeric [0,1] | Language switches per N tokens (Solorio & Liu metric). A speaker who switches between Spanish and English, or Spanish and lunfardo/caló, will have a higher rate than a monolingual writer. |
+| `lexical.code_switching_matrix_language` | free_string | BCP-47 tag of the dominant (matrix) language in code-switching texts (e.g. `es-CL`, `es-AR`). The matrix language is the grammatical scaffold; embedded languages appear as inserts. |
+| `lexical.code_switching_embedded_languages` | array[free_string] | BCP-47 list of non-matrix languages observed in the actor's messages. |
+| `lexical.sentence_complexity_class` | categorical | Dominant clause structure. `simple` = single-clause. `compound` = two independent clauses joined by coordinating conjunctions (pero, y, o). `complex` = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment. |
+| `lexical.question_formation_style` | categorical | How questions are formed. `punctuation_only` = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. `lexical` = explicit interrogatives (¿qué, cómo, cuándo). `formal` = inverted subject-verb or formal register. |
+| `lexical.imperative_style` | categorical | How commands and requests are framed. `informal_directive` = tú/vos imperative (dame, hazlo). `formal_directive` = usted imperative (hágame el favor). `polite` = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts. |
+---
+### `temporal_evolution.*` — Behavioral change over time (1 primitive)
+| Primitive | Kind | Description |
+|---|---|---|
+| `temporal_evolution.lifecycle_phase` | categorical | Auto-classified lifecycle stage from windowed within-corpus analysis. `arrival_burst` = first 24hr, first-window volume dominates (empirically validated against OxPayload's first 12 hours in Rutify). `stable_member` = low drift across the full tenure. `fluctuating_member` = tenure ≥24hr with median drift between stable and inflection thresholds — established noisy regulars (e.g. lamarabitch). `inflection_member` = long-tenure actor with a real behavioral shift in at least one window-pair. `declining_member` = monotonically decreasing per-window message counts. `unknown` = insufficient data. Window size adapts to tenure: <24hr → 2h, <7d → 12h, <30d → 1d, otherwise 7d. |
+---
+### `network.*` — Governance and role signals (2 primitives)
+Network primitives capture the actor's *structural role* in the group — inferred
+from interaction patterns rather than content — and a bot detector. These are
+heuristic composites built from other primitives; treat them as candidate signals,
+not verdicts.
+| Primitive | Kind | Description |
+|---|---|---|
+| `network.is_likely_bot` | categorical | Heuristic bot detector. `likely_bot` when `conversation_initiation_rate` ≥ 0.95 AND `attention_pattern` = `broadcast` AND `vocabulary_richness` < 0.65. Validated (2026-05-03) against SangMata_beta_bot (caught) vs 11 high-volume humans (no false positives). Low-volume bots (e.g. QuotLyBot, 9 messages) sit below the fingerprint threshold. Source label declares heuristic version (e.g. `#bot-heuristic-v1`). |
+| `network.governance_role_signal` | categorical | Heuristic role shape from interaction primitives + lifecycle. `admin_pattern` = init_rate ≥ 0.80, attention reciprocal, non-bot, non-arrival_burst. `responder_pattern` = init_rate ≤ 0.45, attention reciprocal. `bot_pattern` = matches `is_likely_bot`. `regular` = everything else above volume threshold. Empirically caught 4/4 high-volume Rutify admins, sebaImlI as responder, SangMata as bot. NOT a ground-truth admin label. |
+---
+### `interaction.*` — Messaging behavior (6 primitives)
+Interaction primitives characterize *how* the actor participates in conversations —
+timing, initiation rate, and attention patterns.
+| Primitive | Kind | Description |
+|---|---|---|
+| `interaction.response_latency_class` | categorical | How quickly the actor responds to messages directed at them. `immediate` <30s (suggests active monitoring or automation). `fast` 30s-5min. `normal` 5-60min. `slow` 1-24hr. `sporadic` = no consistent pattern. |
+| `interaction.conversation_initiation_rate` | numeric [0,1] | Thread-starting messages / total messages. High rate = the actor drives conversations. |
+| `interaction.message_burst_rate` | categorical | Whether the actor sends multiple messages per turn. `habitual` = almost always bursts (3+ messages before any reply). `single` = almost always one message per turn. Tied to `stylometric.linebreak_style multi_line`. |
+| `interaction.active_hours_class` | free_string | UTC active-hours window summary (e.g. `05:00-14:00 UTC`). Free string — the window shape varies by actor and doesn't fit a closed enum. |
+| `interaction.session_duration_class` | categorical | Typical session length: `short` <15min, `medium` 15-90min, `long` 90min-4hr, `marathon` >4hr. Shares the enum with `behave_shell`'s `temporal.session_duration`. |
+| `interaction.attention_pattern` | categorical | Reply-graph centrality shape. `broadcast` = sends to many, replies to few (one-to-many). `focused` = concentrates on a small set of interlocutors. `reciprocal` = balanced give-and-take. |
+---
+### `content.*` — Content-derived signals, EXPERIMENTAL (6 primitives)
+Content primitives are derived from message text through classifiers rather than
+structural/timing analysis. They carry the highest risk of false positives, are
+brittle to vocabulary drift, and are locale-specific. An attribution engine may
+choose to weight these at zero until field-validated against labeled data.
+| Primitive | Kind | Description |
+|---|---|---|
+| `content.role_signal` | categorical | Locale-tuned role-vocabulary classifier. Values: `admin`, `seller`, `buyer`, `lurker`, `newbie`. May be moved to a separate IOC/keyword-detection layer after Rutify testing. `EXPERIMENTAL` |
+| `content.transactional_language` | numeric [0,1] | Rate of transactional terms per message. Locale-specific; brittle to vocabulary drift. `EXPERIMENTAL` |
+| `content.opsec_awareness` | numeric [0,1] | Rate of security-conscious phrases. **HIGH FALSE-POSITIVE RISK** on casual conversation about deleting files/messages. `EXPERIMENTAL` |
+| `content.targeting_language` | array[free_string] | IOC-shaped target patterns (bank names, government portals, RUT ranges). Consider moving to a dedicated IOC layer. `EXPERIMENTAL` |
+| `content.boasting_pattern` | categorical | Success-claim frequency: `none`, `occasional`, `frequent`. Corpus-dependent regex. `EXPERIMENTAL` |
+| `content.conflict_style` | categorical | Dispute-tone classification: `aggressive`, `defusing`, `appellate`. Needs labelled training data. `EXPERIMENTAL` |
+---
+## Schema
+Machine-readable JSON Schema:
+[`json/observation.schema.json`](json/observation.schema.json)
+Regenerate after model changes:
+```bash
+python scripts/generate_schema.py
+```
+## Tests
+```bash
+pytest tests/
+```
+## Attribution recipes
+[`attribution-recipes.md`](attribution-recipes.md) — placeholder document sketching
+how an external attribution engine would consume `actor.observation.text.*` topics
+to build actor profiles (`credential_broker`, `low_skill_buyer`, `group_admin`, etc.).
+**Not populated yet** — awaiting Rutify corpus calibration. Not part of the BEHAVE spec.
+## License
+Code and schemas: [GPL-3.0-or-later](../LICENSE)
+Spec prose (this file, attribution-recipes.md): [CC-BY-SA-4.0](../LICENSE.docs)

behave_text-0.1.0/behave_text/__init__.py ADDED Viewed

File without changes

behave_text-0.1.0/behave_text/spec/__init__.py ADDED Viewed

@@ -0,0 +1,43 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""BEHAVE-TEXT spec — text/messaging-domain registry, layered on behave-core.
+Public API:
+    from spec import Observation, Window, OBSERVATION_SCHEMA_VERSION
+    from spec import PRIMITIVE_REGISTRY, ValueKind, ValueTypeSpec
+    from spec import TOPIC_PREFIX, event_topic_for
+The ``Observation`` exported here is a registry-aware subclass of the base
+class from ``behave-core``; it validates that ``primitive`` is in the
+text registry and that ``value`` matches the registry's per-primitive spec.
+See ``spec.envelope`` (and the core envelope module) for PII discipline.
+"""
+from .envelope import OBSERVATION_SCHEMA_VERSION, Observation, ObservationValue, Window
+from .primitives import PRIMITIVE_REGISTRY, ValueKind, ValueTypeSpec, get, is_known
+# Topic namespace deliberately uses *actor* (not *attacker*) because chat-group
+# members may include observers, brokers, victims, and bystanders alongside
+# threat actors. Attribution of role is the engine's job, not BEHAVE-TEXT's.
+TOPIC_PREFIX: str = "actor.observation.text"
+def event_topic_for(primitive: str) -> str:
+    """Return the canonical bus topic for a BEHAVE-TEXT primitive."""
+    return f"{TOPIC_PREFIX}.{primitive}"
+__all__ = [
+    "OBSERVATION_SCHEMA_VERSION",
+    "Observation",
+    "ObservationValue",
+    "Window",
+    "PRIMITIVE_REGISTRY",
+    "ValueKind",
+    "ValueTypeSpec",
+    "is_known",
+    "get",
+    "TOPIC_PREFIX",
+    "event_topic_for",
+]

behave_text-0.1.0/behave_text/spec/envelope.py ADDED Viewed

@@ -0,0 +1,53 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""BEHAVE-TEXT Observation envelope (registry-aware subclass).
+Mirrors BEHAVE-SHELL's pattern: structural envelope from `behave-core`,
+registry-aware validation added here against BEHAVE-TEXT's `PRIMITIVE_REGISTRY`.
+PII discipline (TIGHTER for text than for shell):
+  text-domain sensors operate on raw message bodies. They MUST hash, aggregate,
+  or categorize before constructing an Observation — never put message text
+  into the `value` or `evidence_ref` field. `evidence_ref` should point at an
+  external message-store record (e.g. a Telegram message ID), not at the text.
+"""
+from __future__ import annotations
+from pydantic import model_validator
+from behave_core.spec.envelope import (
+    OBSERVATION_SCHEMA_VERSION,
+    ObservationValue,
+    Window,
+)
+from behave_core.spec.envelope import Observation as _BaseObservation
+from .primitives import PRIMITIVE_REGISTRY
+class Observation(_BaseObservation):
+    """Text-domain Observation: base envelope + BEHAVE-TEXT registry check."""
+    @model_validator(mode="after")
+    def _validate_against_text_registry(self) -> "Observation":
+        spec = PRIMITIVE_REGISTRY.get(self.primitive)
+        if spec is None:
+            raise ValueError(
+                f"unknown primitive {self.primitive!r}; "
+                f"add it to spec/primitives.py:PRIMITIVE_REGISTRY first"
+            )
+        try:
+            spec.validate_value(self.value)
+        except ValueError as exc:
+            raise ValueError(
+                f"value invalid for primitive {self.primitive!r}: {exc}"
+            ) from None
+        return self
+__all__ = [
+    "OBSERVATION_SCHEMA_VERSION",
+    "Observation",
+    "ObservationValue",
+    "Window",
+]

behave_text-0.1.0/behave_text/spec/primitives.py ADDED Viewed

@@ -0,0 +1,353 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""BEHAVE-TEXT primitive registry.
+Source-of-truth for what `Observation.primitive` may be in the text/messaging
+domain and what `Observation.value` must look like. Mirrors every row in the
+primitive tables of `scratchpad.md`.
+PII discipline notice (carried over from behave-core's envelope module):
+  TEXT-domain observations carry CATEGORICAL LABELS, AGGREGATE RATES, and
+  HASHES of distributions. Sensors operating on Telegram/messaging text MUST
+  NOT emit raw message content into BEHAVE-TEXT observations — only derived
+  features. The `evidence_ref` field points to the underlying message store
+  held elsewhere; never into the message body itself.
+  This is a tighter constraint than BEHAVE-SHELL's because the source signal
+  IS text content. Sensors must hash/aggregate before emitting.
+Adding a new primitive is a deliberate registry edit. Drift between this file
+and `scratchpad.md` is a bug; v0 keeps the registry hand-written so PR review
+catches drift, v0.x may auto-extract from the markdown if drift becomes a
+maintenance issue.
+Status flags appear in the `notes` field. `EXPERIMENTAL` marks primitives in
+the `content.*` layer whose detector implementations are likely brittle; an
+attribution engine may choose to weight those at zero until field-validated.
+"""
+from __future__ import annotations
+from enum import Enum
+from typing import Any, Optional
+from pydantic import BaseModel, Field
+class ValueKind(str, Enum):
+    """Discriminator for the shape an `Observation.value` must take."""
+    CATEGORICAL = "categorical"
+    NUMERIC     = "numeric"
+    HASH        = "hash"
+    ARRAY       = "array"
+    FREE_STRING = "free_string"
+    BOOL        = "bool"
+class ValueTypeSpec(BaseModel):
+    """Per-primitive value-type spec (mirrors BEHAVE-SHELL's shape)."""
+    kind: ValueKind
+    allowed: Optional[list[str]] = Field(default=None)
+    min_val: Optional[float] = Field(default=None)
+    max_val: Optional[float] = Field(default=None)
+    array_of: Optional[ValueKind] = Field(default=None)
+    notes: Optional[str] = Field(default=None)
+    def validate_value(self, value: Any) -> None:
+        if self.kind is ValueKind.CATEGORICAL:
+            if not isinstance(value, str):
+                raise ValueError(f"expected categorical string, got {type(value).__name__}")
+            if self.allowed is not None and value not in self.allowed:
+                raise ValueError(f"value {value!r} not in allowed set {self.allowed!r}")
+        elif self.kind is ValueKind.NUMERIC:
+            if isinstance(value, bool) or not isinstance(value, (int, float)):
+                raise ValueError(f"expected numeric, got {type(value).__name__}")
+            if self.min_val is not None and value < self.min_val:
+                raise ValueError(f"value {value} below min_val {self.min_val}")
+            if self.max_val is not None and value > self.max_val:
+                raise ValueError(f"value {value} above max_val {self.max_val}")
+        elif self.kind is ValueKind.HASH:
+            if not isinstance(value, str) or not value:
+                raise ValueError("expected non-empty hash string")
+        elif self.kind is ValueKind.FREE_STRING:
+            if not isinstance(value, str):
+                raise ValueError(f"expected string, got {type(value).__name__}")
+        elif self.kind is ValueKind.BOOL:
+            if not isinstance(value, bool):
+                raise ValueError(f"expected bool, got {type(value).__name__}")
+        elif self.kind is ValueKind.ARRAY:
+            if not isinstance(value, list):
+                raise ValueError(f"expected array, got {type(value).__name__}")
+            if self.array_of is None:
+                return
+            element_spec = ValueTypeSpec(kind=self.array_of)
+            for i, element in enumerate(value):
+                try:
+                    element_spec.validate_value(element)
+                except ValueError as exc:
+                    raise ValueError(f"array element [{i}]: {exc}") from None
+# ─── Convenience constructors ───────────────────────────────────────────────
+def _cat(*allowed: str, notes: Optional[str] = None) -> ValueTypeSpec:
+    return ValueTypeSpec(kind=ValueKind.CATEGORICAL, allowed=list(allowed), notes=notes)
+def _num(min_val: Optional[float] = None, max_val: Optional[float] = None, notes: Optional[str] = None) -> ValueTypeSpec:
+    return ValueTypeSpec(kind=ValueKind.NUMERIC, min_val=min_val, max_val=max_val, notes=notes)
+def _hash(notes: Optional[str] = None) -> ValueTypeSpec:
+    return ValueTypeSpec(kind=ValueKind.HASH, notes=notes)
+def _str(notes: Optional[str] = None) -> ValueTypeSpec:
+    return ValueTypeSpec(kind=ValueKind.FREE_STRING, notes=notes)
+def _array(of: ValueKind, notes: Optional[str] = None) -> ValueTypeSpec:
+    return ValueTypeSpec(kind=ValueKind.ARRAY, array_of=of, notes=notes)
+# ─── The registry ───────────────────────────────────────────────────────────
+#
+# 28 primitives across 4 layers. Mirrors scratchpad.md row-for-row.
+PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = {
+    # ── stylometric.* (motor analog — 8) ──────────────────────────────────
+    "stylometric.punctuation_style":          _hash(notes="canonical punctuation-pattern fingerprint"),
+    "stylometric.capitalization_habit": _cat(
+        "lowercase", "proper", "random_caps", "mixed_i",
+        notes="Dominant capitalization rule the author applies. lowercase=no capitals except "
+              "after sentence breaks. proper=standard title/sentence case. random_caps=no "
+              "consistent rule. mixed_i=author consistently writes 'i' in lowercase even "
+              "mid-sentence — common in Spanish chat where 'I' is not a standalone word "
+              "but the habit transfers from the native language's lowercase 'yo'.",
+    ),
+    "stylometric.emoji_usage": _cat(
+        "none", "occasional", "frequent", "exclusive",
+        notes="Rate of emoji use per message. exclusive=messages rarely contain text without "
+              "emoji. This captures tone and register — heavy emoji use in a criminal-market "
+              "context is a distinct style trait worth preserving.",
+    ),
+    "stylometric.emoji_placement": _cat(
+        "pre_punctuation", "post_punctuation", "no_punctuation", "mixed",
+        notes="Where emojis appear relative to sentence-ending punctuation. "
+              "pre_punctuation='Hola 😊.' post_punctuation='Hola. 😊' "
+              "Individual authors are strikingly consistent in this micro-habit.",
+    ),
+    "stylometric.message_length_class": _cat(
+        "short", "medium", "long", "paragraph",
+        notes="Median message length bucket: short=1-5 words, medium=6-20 words, "
+              "long=21-50 words, paragraph=>50 words. See also "
+              "stylometric.message_length_variance_class for the distribution shape.",
+    ),
+    "stylometric.message_length_variance_class": _cat(
+        "tight", "varied", "bimodal",
+        notes="Coefficient of variation of per-message word counts. Captures "
+              "DISTRIBUTION SHAPE that message_length_class collapses by "
+              "emitting only the median bucket. Two authors can share the same "
+              "median length but have wildly different variance: `tight` (CV<0.5) "
+              "= consistent (always 1-3 words), `varied` (0.5<=CV<1.5) = normal "
+              "mix, `bimodal` (CV>=1.5) = long-tail (mostly short with occasional "
+              "rants). Added in v0.2 after Rutify calibration found median-only "
+              "bucketing discarded most of the per-author variance signal.",
+    ),
+    "stylometric.linebreak_style": _cat(
+        "single_thought", "multi_line", "wall_of_text",
+        notes="Whether the author sends one complete thought per message or breaks a single "
+              "statement into multiple sequential short messages. multi_line=habitual "
+              "message-burst style (sends 3-5 short messages in rapid succession instead "
+              "of one composed message). wall_of_text=rarely uses line breaks, sends dense "
+              "blocks. Captures a stylistic rhythm that is hard to consciously alter.",
+    ),
+    "stylometric.typo_signature":             _hash(notes="sha256 of canonical persistent-typo set"),
+    "stylometric.function_word_distribution_top50": _hash(
+        notes="64-bit simhash over the 50-most-common Spanish function-word frequency "
+              "vector. Mosteller-Wallace gold standard for English long-form authorship; "
+              "EMPIRICALLY DOMAIN-FLAWED for Spanish chat-domain — calibrated 2026-05-02 "
+              "against the Rutify corpus showed within-author and cross-author Hamming "
+              "distance distributions overlap (within median 8 bits, cross median 10 "
+              "bits) so this primitive ALONE cannot discriminate authors in chat-style "
+              "short-message corpora. Engines should weight it low until paired with "
+              "the larger top-200 variant or composited with character n-gram and "
+              "distinctive-vocabulary signatures (see siblings below). Kept in v0 for "
+              "calibration grids and documentary purposes.",
+    ),
+    "stylometric.function_word_distribution_top200": _hash(
+        notes="64-bit simhash over the 200-most-common Spanish function-word frequency "
+              "vector. The wider list reaches into the long tail (rare-but-individual "
+              "function words like `tampoco`, `aunque`, `mientras`) that carry more "
+              "discriminating signal in short-message chat domains. NOT YET EMITTED by "
+              "the v0 prototype extractor; populated when v0.2 calibration is done.",
+    ),
+    "stylometric.character_ngram_simhash": _hash(
+        notes="64-bit simhash over a frequency vector of character n-grams (default "
+              "n=3) from the author's lowercased text corpus. ORTHOGONAL to "
+              "function-word distributions: captures punctuation tics, accent-"
+              "stripping habits, typo patterns, and idiom-fragment fingerprints "
+              "that survive paraphrase. Lowercases input so that capitalization "
+              "habits — already captured by stylometric.capitalization_habit — "
+              "do not double-count. Accents PRESERVED because accent-stripping is "
+              "itself a stylistic tic worth catching. Source label declares n size "
+              "(e.g. `#char3gram`, `#char4gram`).",
+    ),
+    "stylometric.distinctive_vocabulary_signature": _hash(
+        notes="64-bit simhash over a TF-IDF-weighted top-K rare-word vector. "
+              "COMPLEMENTARY to function-word distributions: where function_word_* "
+              "captures common-word *style*, this captures the author's distinctive "
+              "*lexicon* (the words this person uses that other authors in the same "
+              "corpus do NOT). Strong against context-shift because rare words are "
+              "where authorial choice lives. Requires the chat corpus for IDF "
+              "computation, performed once per extraction. Source label declares the "
+              "top-K size and corpus tag (e.g. `#tfidf-top50`).",
+    ),
+    # ── lexical.* (cognitive analog — 8) ──────────────────────────────────
+    "lexical.vocabulary_richness":            _num(
+        min_val=0.0, max_val=1.0,
+        notes="Moving-Average Type-Token Ratio (MATTR) over a sliding window "
+              "(default 50 tokens). Volume-independent: each window contributes "
+              "its own unique/total ratio, the primitive's value is the mean. "
+              "Avoids the standard TTR bias where larger corpora mechanically "
+              "score lower. Source label declares the window size.",
+    ),
+    "lexical.slang_density":                  _num(min_val=0.0, max_val=1.0,
+                                                   notes="rate per message; locale-tuned slang corpus"),
+    "lexical.code_switching_rate":            _num(min_val=0.0, max_val=1.0,
+                                                   notes="switches per N tokens; Solorio & Liu metric"),
+    "lexical.code_switching_matrix_language": _str(notes="BCP-47 of dominant language"),
+    "lexical.code_switching_embedded_languages": _array(ValueKind.FREE_STRING,
+                                                        notes="BCP-47 list of non-matrix languages observed"),
+    "lexical.sentence_complexity_class": _cat(
+        "simple", "compound", "complex",
+        notes="Dominant clause structure. simple=single-clause messages (no conjunctions "
+              "or subordination). compound=two independent clauses joined by coordinating "
+              "conjunctions (pero, y, o, ni). complex=dependent clauses and subordination "
+              "(aunque, porque, cuando, que + verb). Reflects education level and "
+              "cognitive investment in message composition.",
+    ),
+    "lexical.question_formation_style": _cat(
+        "punctuation_only", "lexical", "formal",
+        notes="How questions are formed. punctuation_only=question mark appended without "
+              "interrogative words ('¿Cuánto?' or 'Mañana?') — very common in Spanish "
+              "chat. lexical=explicit interrogatives (¿qué, cómo, cuándo, dónde). "
+              "formal=inverted subject-verb order or formal register ('¿Podría usted...'). "
+              "Captures register and education level.",
+    ),
+    "lexical.imperative_style": _cat(
+        "informal_directive", "formal_directive", "polite",
+        notes="How commands and requests are framed. informal_directive=tú/vos imperative "
+              "('dame', 'hazlo', 'mándame'). formal_directive=usted imperative "
+              "('hágame el favor', 'envíeme'). polite=conditional or modal softening "
+              "('¿podría...?', 'me gustaría...'). Stable per-author trait in criminal "
+              "market contexts where hierarchical and peer relationships are expressed "
+              "through register choice.",
+    ),
+    # ── temporal_evolution.* (lifecycle / change-over-time — 1) ───────────
+    "temporal_evolution.lifecycle_phase": _cat(
+        "arrival_burst", "stable_member", "fluctuating_member",
+        "inflection_member", "declining_member", "unknown",
+        notes="Auto-classified lifecycle stage derived from windowed within-"
+              "corpus analysis. arrival_burst: tenure < 24hr with first-window "
+              "volume dominating later windows and high inter-window drift "
+              "(empirically validated 2026-05-03 against OxPayload's first 12 "
+              "hours on Rutify). stable_member: low drift between consecutive "
+              "windows across the whole tenure. fluctuating_member (added v0.3): "
+              "tenure ≥ 24hr with median drift in [stable_max, inflection_min) "
+              "and no single window crossing inflection_min — established noisy "
+              "regulars who don't fit clean stable/inflection classes (e.g. "
+              "labelled admin lamarabitch, formerly classified unknown). "
+              "inflection_member: long-tenure actor whose drift spikes in at "
+              "least one window-pair (a real behavioral shift mid-corpus). "
+              "declining_member: monotonically decreasing per-window message "
+              "counts. unknown: insufficient windowed data for classification. "
+              "Window size adapts to tenure: <24hr → 2h windows, <7d → 12h, "
+              "<30d → 1d, otherwise 7d.",
+    ),
+    # ── network.* (governance/role-shape signals — 2, added v0.3) ─────────
+    "network.is_likely_bot": _cat(
+        "likely_bot", "not_bot", "unknown",
+        notes="Heuristic bot detector composited from existing primitives. "
+              "Classifies as likely_bot when conversation_initiation_rate ≥ 0.95 "
+              "AND attention_pattern = broadcast AND vocabulary_richness < 0.65. "
+              "Empirically validated 2026-05-03 against the tdl-labeled Rutify "
+              "bot SangMata_beta_bot (correctly caught) vs 11 high-volume humans "
+              "in the same corpus (none false-positive). NOT a verdict — engines "
+              "should treat as a candidate signal, especially since low-volume "
+              "bots (e.g. QuotLyBot with 9 messages) sit below the fingerprint "
+              "threshold and emit nothing here. Source label declares the "
+              "heuristic version (e.g. #bot-heuristic-v1).",
+    ),
+    "network.governance_role_signal": _cat(
+        "admin_pattern", "responder_pattern", "regular", "bot_pattern", "unknown",
+        notes="Heuristic role-shape composited from interaction primitives + "
+              "lifecycle_phase. admin_pattern: init_rate ≥ 0.80 AND attn = "
+              "reciprocal AND non-bot AND not arrival_burst. responder_pattern: "
+              "init_rate ≤ 0.45 AND attn = reciprocal. bot_pattern: matches "
+              "network.is_likely_bot likely_bot. regular: everything else above "
+              "the volume threshold. Empirically caught all 4 high-volume "
+              "tdl-labeled Rutify admins, sebaImlI as responder, "
+              "SangMata_beta_bot as bot, OxPayload/bopxcx as regular (their "
+              "arrival_burst lifecycle overrides the admin-shaped init_rate). "
+              "NOT a ground-truth admin label — kkaxlazer matches admin_pattern "
+              "while not formally admin, but the 2026-05-03 reply-graph cohort "
+              "analysis showed they're operationally embedded in the admin "
+              "layer (4/4 cohort signal with the top admin), so the heuristic "
+              "is doing the right thing.",
+    ),
+    # ── interaction.* (temporal analog — 6) ───────────────────────────────
+    "interaction.response_latency_class": _cat(
+        "immediate", "fast", "normal", "slow", "sporadic",
+        notes="How quickly the actor responds to messages directed at them. "
+              "immediate=<30s (suggests active monitoring or automated response). "
+              "fast=30s-5min. normal=5-60min (typical async chat). slow=1-24hr. "
+              "sporadic=no consistent response latency — appears and disappears.",
+    ),
+    "interaction.conversation_initiation_rate": _num(min_val=0.0, max_val=1.0,
+                                                     notes="thread-starting messages / total"),
+    "interaction.message_burst_rate": _cat(
+        "single", "occasional", "habitual",
+        notes="Whether the actor sends multiple messages in rapid sequence within a "
+              "conversation turn. habitual=almost always bursts (sends 3+ messages "
+              "before any reply). single=almost always one message per turn. Tied to "
+              "stylometric.linebreak_style multi_line.",
+    ),
+    "interaction.active_hours_class":         _str(notes="UTC active-hours window summary"),
+    "interaction.session_duration_class":     _cat("short", "medium", "long", "marathon",
+                                                   notes="REUSED enum from BEHAVE-SHELL temporal.session_duration"),
+    "interaction.attention_pattern":          _cat("broadcast", "focused", "reciprocal",
+                                                   notes="from reply-graph centrality"),
+    # ── content.* (operational analog — 6, EXPERIMENTAL) ──────────────────
+    "content.role_signal":                    _cat("admin", "seller", "buyer", "lurker", "newbie",
+                                                   notes="EXPERIMENTAL — locale-tuned role-vocabulary classifier; "
+                                                         "may be moved to a separate IOC/keyword-detection layer "
+                                                         "once tested against the Rutify corpus"),
+    "content.transactional_language":         _num(min_val=0.0, max_val=1.0,
+                                                   notes="EXPERIMENTAL — rate of transactional terms; "
+                                                         "locale-specific, brittle to vocabulary drift"),
+    "content.opsec_awareness":                _num(min_val=0.0, max_val=1.0,
+                                                   notes="EXPERIMENTAL — rate of security-conscious phrases; "
+                                                         "HIGH FALSE-POSITIVE RISK on casual conversation about "
+                                                         "deleting files / messages"),
+    "content.targeting_language":             _array(ValueKind.FREE_STRING,
+                                                     notes="EXPERIMENTAL — IOC-shaped target patterns "
+                                                           "(bank names, government portals, RUT ranges, etc); "
+                                                           "consider moving to dedicated IOC layer"),
+    "content.boasting_pattern":               _cat("none", "occasional", "frequent",
+                                                   notes="EXPERIMENTAL — success-claim regex; corpus-dependent"),
+    "content.conflict_style":                 _cat("aggressive", "defusing", "appellate",
+                                                   notes="EXPERIMENTAL — dispute-tone classifier; needs "
+                                                         "labelled training data"),
+}
+def is_known(primitive: str) -> bool:
+    return primitive in PRIMITIVE_REGISTRY
+def get(primitive: str) -> ValueTypeSpec:
+    """Return the value-type spec for *primitive*; raise KeyError if unknown."""
+    return PRIMITIVE_REGISTRY[primitive]

behave_text-0.1.0/behave_text.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,14 @@
+Metadata-Version: 2.4
+Name: behave-text
+Version: 0.1.0
+Summary: BEHAVE-TEXT — text/messaging-domain behavioral observation registry, layered on behave-core
+Author: ANTI
+License: GPL-3.0-or-later
+Project-URL: Source, https://git.resacachile.cl/anti/BEHAVE
+Requires-Python: >=3.11
+Requires-Dist: pydantic>=2.6
+Requires-Dist: behave-core>=0.1.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8; extra == "dev"
+Requires-Dist: pytest-cov; extra == "dev"
+Requires-Dist: ruff; extra == "dev"

behave_text-0.1.0/behave_text.egg-info/SOURCES.txt ADDED Viewed

@@ -0,0 +1,12 @@
+README.md
+pyproject.toml
+behave_text/__init__.py
+behave_text.egg-info/PKG-INFO
+behave_text.egg-info/SOURCES.txt
+behave_text.egg-info/dependency_links.txt
+behave_text.egg-info/requires.txt
+behave_text.egg-info/top_level.txt
+behave_text/spec/__init__.py
+behave_text/spec/envelope.py
+behave_text/spec/primitives.py
+tests/test_primitives.py

behave_text-0.1.0/behave_text.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+

behave_text-0.1.0/behave_text.egg-info/requires.txt ADDED Viewed

@@ -0,0 +1,7 @@
+pydantic>=2.6
+behave-core>=0.1.0
+[dev]
+pytest>=8
+pytest-cov
+ruff

behave_text-0.1.0/behave_text.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ behave_text

behave_text-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,33 @@
+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "behave-text"
+version = "0.1.0"
+description = "BEHAVE-TEXT — text/messaging-domain behavioral observation registry, layered on behave-core"
+requires-python = ">=3.11"
+license = { text = "GPL-3.0-or-later" }
+authors = [{ name = "ANTI" }]
+dependencies = ["pydantic>=2.6", "behave-core>=0.1.0"]
+[project.optional-dependencies]
+dev = ["pytest>=8", "pytest-cov", "ruff"]
+[project.urls]
+"Source" = "https://git.resacachile.cl/anti/BEHAVE"
+[tool.setuptools.packages.find]
+include = ["behave_text*"]
+[tool.ruff]
+line-length = 100
+target-version = "py311"
+[tool.ruff.lint]
+select = ["E", "F", "I", "B", "UP"]
+ignore = ["E501"]
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+addopts = "-q --import-mode=importlib"

behave_text-0.1.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

behave_text-0.1.0/tests/test_primitives.py ADDED Viewed

@@ -0,0 +1,101 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""Registry coverage tests for BEHAVE-TEXT.
+Asserts that every primitive listed in scratchpad.md's tables has exactly one
+entry in PRIMITIVE_REGISTRY. Drift-detector — failing this test means
+scratchpad.md and the registry have diverged.
+"""
+from __future__ import annotations
+import re
+from pathlib import Path
+from behave_text.spec import PRIMITIVE_REGISTRY, ValueKind
+# Primitive paths expected by scratchpad.md (hand-extracted; v0).
+EXPECTED_PRIMITIVES = {
+    # stylometric.* (motor analog — 8)
+    "stylometric.punctuation_style",
+    "stylometric.capitalization_habit",
+    "stylometric.emoji_usage",
+    "stylometric.emoji_placement",
+    "stylometric.message_length_class",
+    "stylometric.message_length_variance_class",
+    "stylometric.linebreak_style",
+    "stylometric.typo_signature",
+    "stylometric.function_word_distribution_top50",
+    "stylometric.function_word_distribution_top200",
+    "stylometric.character_ngram_simhash",
+    "stylometric.distinctive_vocabulary_signature",
+    # lexical.* (cognitive analog — 8)
+    "lexical.vocabulary_richness",
+    "lexical.slang_density",
+    "lexical.code_switching_rate",
+    "lexical.code_switching_matrix_language",
+    "lexical.code_switching_embedded_languages",
+    "lexical.sentence_complexity_class",
+    "lexical.question_formation_style",
+    "lexical.imperative_style",
+    # temporal_evolution.* (lifecycle/change-over-time — 1, added v0.2)
+    "temporal_evolution.lifecycle_phase",
+    # network.* (governance/role-shape — 2, added v0.3)
+    "network.is_likely_bot",
+    "network.governance_role_signal",
+    # interaction.* (temporal analog — 6)
+    "interaction.response_latency_class",
+    "interaction.conversation_initiation_rate",
+    "interaction.message_burst_rate",
+    "interaction.active_hours_class",
+    "interaction.session_duration_class",
+    "interaction.attention_pattern",
+    # content.* (operational analog — 6, EXPERIMENTAL)
+    "content.role_signal",
+    "content.transactional_language",
+    "content.opsec_awareness",
+    "content.targeting_language",
+    "content.boasting_pattern",
+    "content.conflict_style",
+}
+def test_registry_covers_expected_primitives_exactly():
+    registry_keys = set(PRIMITIVE_REGISTRY.keys())
+    missing = EXPECTED_PRIMITIVES - registry_keys
+    extra = registry_keys - EXPECTED_PRIMITIVES
+    assert not missing, f"registry missing: {sorted(missing)}"
+    assert not extra, f"registry has unexpected entries: {sorted(extra)}"
+def test_every_primitive_has_a_valid_spec():
+    for primitive, spec in PRIMITIVE_REGISTRY.items():
+        if spec.kind is ValueKind.CATEGORICAL:
+            assert spec.allowed, f"{primitive}: categorical must define `allowed`"
+            assert all(isinstance(v, str) for v in spec.allowed)
+        elif spec.kind is ValueKind.ARRAY:
+            assert spec.array_of is not None, f"{primitive}: array must define `array_of`"
+            assert spec.array_of is not ValueKind.ARRAY, (
+                f"{primitive}: nested arrays not supported in v0"
+            )
+def test_primitive_paths_are_dotted_lowercase():
+    pattern = re.compile(r"^[a-z][a-z0-9_]*(\.[a-z][a-z0-9_]*)+$")
+    for primitive in PRIMITIVE_REGISTRY:
+        assert pattern.match(primitive), f"malformed primitive path: {primitive!r}"
+def test_experimental_primitives_are_in_content_layer_only():
+    """`status: experimental` should be confined to content.* in v0."""
+    for primitive, spec in PRIMITIVE_REGISTRY.items():
+        if spec.notes and "EXPERIMENTAL" in spec.notes:
+            assert primitive.startswith("content."), (
+                f"{primitive}: EXPERIMENTAL flag should only appear in content.* layer in v0"
+            )
+def test_topic_namespace_uses_actor_not_attacker():
+    """The text-domain topic prefix must be `actor.*`, not `attacker.*`."""
+    from behave_text.spec import TOPIC_PREFIX, event_topic_for
+    assert TOPIC_PREFIX == "actor.observation.text"
+    assert event_topic_for("stylometric.emoji_usage") == "actor.observation.text.stylometric.emoji_usage"