lda-ruby 0.3.9 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. checksums.yaml +5 -13
  2. data/CHANGELOG.md +16 -0
  3. data/Gemfile +9 -0
  4. data/README.md +126 -3
  5. data/VERSION.yml +3 -3
  6. data/docs/modernization-handoff.md +233 -0
  7. data/docs/porting-strategy.md +148 -0
  8. data/docs/precompiled-platform-policy.md +81 -0
  9. data/docs/precompiled-target-evaluation.md +67 -0
  10. data/docs/release-runbook.md +192 -0
  11. data/docs/rust-orchestration-guardrails.md +50 -0
  12. data/ext/lda-ruby/cokus.c +10 -11
  13. data/ext/lda-ruby/cokus.h +3 -3
  14. data/ext/lda-ruby/extconf.rb +10 -6
  15. data/ext/lda-ruby/lda-inference.c +23 -7
  16. data/ext/lda-ruby/utils.c +8 -0
  17. data/ext/lda-ruby-rust/Cargo.toml +12 -0
  18. data/ext/lda-ruby-rust/README.md +73 -0
  19. data/ext/lda-ruby-rust/extconf.rb +135 -0
  20. data/ext/lda-ruby-rust/include/strings.h +35 -0
  21. data/ext/lda-ruby-rust/src/lib.rs +1263 -0
  22. data/lda-ruby.gemspec +0 -0
  23. data/lib/lda-ruby/backends/base.rb +133 -0
  24. data/lib/lda-ruby/backends/native.rb +158 -0
  25. data/lib/lda-ruby/backends/pure_ruby.rb +675 -0
  26. data/lib/lda-ruby/backends/rust.rb +607 -0
  27. data/lib/lda-ruby/backends.rb +58 -0
  28. data/lib/lda-ruby/corpus/corpus.rb +17 -15
  29. data/lib/lda-ruby/corpus/data_corpus.rb +2 -2
  30. data/lib/lda-ruby/corpus/directory_corpus.rb +2 -2
  31. data/lib/lda-ruby/corpus/text_corpus.rb +2 -2
  32. data/lib/lda-ruby/document/document.rb +6 -6
  33. data/lib/lda-ruby/document/text_document.rb +5 -4
  34. data/lib/lda-ruby/rust_build_policy.rb +21 -0
  35. data/lib/lda-ruby/version.rb +5 -0
  36. data/lib/lda-ruby.rb +293 -48
  37. data/test/backend_compatibility_test.rb +146 -0
  38. data/test/backends_selection_test.rb +100 -0
  39. data/test/benchmark_scripts_test.rb +23 -0
  40. data/test/gemspec_test.rb +27 -0
  41. data/test/lda_ruby_test.rb +49 -11
  42. data/test/packaged_gem_smoke_test.rb +33 -0
  43. data/test/pure_ruby_orchestration_test.rb +109 -0
  44. data/test/release_scripts_test.rb +93 -0
  45. data/test/rust_build_policy_test.rb +23 -0
  46. data/test/rust_orchestration_test.rb +911 -0
  47. data/test/simple_pipeline_test.rb +22 -0
  48. data/test/simple_yaml.rb +1 -7
  49. data/test/test_helper.rb +5 -6
  50. metadata +54 -38
  51. data/Rakefile +0 -61
  52. data/ext/lda-ruby/Makefile +0 -181
  53. data/test/data/.gitignore +0 -2
  54. data/test/simple_test.rb +0 -26
@@ -0,0 +1,81 @@
1
+ # Precompiled Platform Gem Policy (Phase 5B)
2
+
3
+ This document defines the publish strategy and compatibility policy for `lda-ruby` precompiled gems.
4
+
5
+ ## Artifact Strategy
6
+
7
+ Each release version publishes a split package set:
8
+
9
+ - Source gem: `lda-ruby-<version>.gem`
10
+ - Precompiled platform gems:
11
+ - `lda-ruby-<version>-x86_64-linux.gem`
12
+ - `lda-ruby-<version>-x86_64-darwin.gem`
13
+ - `lda-ruby-<version>-arm64-darwin.gem`
14
+ - `lda-ruby-<version>-x64-mingw-ucrt.gem`
15
+ - `lda-ruby-<version>-x86_64-linux-musl.gem`
16
+
17
+ The source gem remains the universal fallback. Platform gems are additive and are expected to install without local build tools.
18
+ Precompiled artifacts are built on matching host runners (no cross-compilation in current workflow).
19
+
20
+ ## Compatibility Policy
21
+
22
+ - Supported Ruby versions: 3.2 and 3.3 (plus future versions validated by CI).
23
+ - Release-blocking precompiled targets:
24
+ - Linux `x86_64-linux`
25
+ - Linux musl `x86_64-linux-musl`
26
+ - macOS Intel `x86_64-darwin`
27
+ - macOS Apple Silicon `arm64-darwin`
28
+ - Windows `x64-mingw-ucrt`
29
+ - Other platforms:
30
+ - Install from source gem.
31
+ - Runtime remains supported through native/pure fallback paths.
32
+
33
+ Backend behavior expectations:
34
+
35
+ - Platform gem install:
36
+ - `auto` backend resolves to `rust` by default.
37
+ - `native` and `pure` overrides continue to work.
38
+ - Source gem install:
39
+ - Rust build policy is controlled by `LDA_RUBY_RUST_BUILD=auto|always|never`.
40
+ - If Rust build is skipped/unavailable, `auto` falls back to `native`, then `pure_ruby`.
41
+
42
+ ## Guardrails
43
+
44
+ Validation must pass before publish:
45
+
46
+ - `./bin/release-preflight` (source-gem checks).
47
+ - `./bin/release-precompiled-artifacts --platform <target>` for each release-blocking platform.
48
+
49
+ Release automation requirements:
50
+
51
+ - `.github/workflows/release.yml` builds source + precompiled artifacts.
52
+ - Release workflow matrix must include all release-blocking precompiled targets.
53
+ - Publish jobs push all built gems and attach checksums to GitHub releases.
54
+ - Post-publish verification job must validate RubyGems entries and GitHub release assets for the tagged version.
55
+
56
+ Continuous integration guardrail:
57
+
58
+ - `.github/workflows/ci.yml` runs `release-precompiled-artifacts` for the full release-blocking precompiled matrix (Linux, Linux musl, macOS Intel, macOS Apple Silicon, Windows) on every branch/PR.
59
+ - macOS precompiled lanes pin Homebrew `llvm@18` (falling back to `llvm` if unavailable) and export `LIBCLANG_PATH` from the selected prefix to keep bindgen stable across Homebrew formula updates.
60
+ - `.github/workflows/precompiled-candidate-evaluation.yml` is used for additional platform candidate checks.
61
+ - `.github/workflows/release.yml` dry-run validates the full release-blocking matrix before publish.
62
+
63
+ Latest release-matrix validation:
64
+
65
+ - [release dry-run 22556487788](https://github.com/ealdent/lda-ruby/actions/runs/22556487788) succeeded for Linux, Linux musl, macOS Intel, macOS Apple Silicon, and Windows targets.
66
+
67
+ ## Rollout / Expansion Rules
68
+
69
+ When adding a new precompiled platform:
70
+
71
+ 1. Add target to release workflow matrix.
72
+ 2. Add or update CI coverage for that platform family.
73
+ 3. Update this policy and the release runbook support matrix.
74
+ 4. Record feasibility evidence and rollout notes in `docs/precompiled-target-evaluation.md`.
75
+ 5. Validate a dry-run release with `workflow_dispatch` before shipping.
76
+
77
+ When deprecating a precompiled platform:
78
+
79
+ 1. Remove platform from release matrix.
80
+ 2. Keep source-gem path available unless the overall platform support policy changes.
81
+ 3. Document deprecation in `CHANGELOG.md` and release notes.
@@ -0,0 +1,67 @@
1
+ # Precompiled Target Evaluation (Priority 2)
2
+
3
+ This document tracks current feasibility for expanding precompiled gem targets beyond the Phase 5B baseline.
4
+
5
+ Current release-blocking precompiled targets:
6
+
7
+ - `x86_64-linux`
8
+ - `x86_64-darwin`
9
+ - `arm64-darwin`
10
+ - `x64-mingw-ucrt`
11
+ - `x86_64-linux-musl`
12
+
13
+ Reference implementation constraints:
14
+
15
+ - `bin/release-precompiled-artifacts` only supports host-matching platform builds (no cross-compilation).
16
+ - Release workflow currently uses matching host runners for each precompiled target.
17
+
18
+ ## Candidate: Windows (`x64-mingw-ucrt`)
19
+
20
+ Status: promoted to release-blocking after release dry-run matrix success.
21
+
22
+ Feasibility notes:
23
+
24
+ - GitHub Actions provides Windows runners, so host-matching builds are possible in principle.
25
+ - Existing release tooling is bash-first and assumes POSIX shell ergonomics throughout.
26
+ - Runtime smoke and packaged-gem checks were validated in candidate runs before promotion.
27
+ - Candidate runs:
28
+ - [run 22555475302](https://github.com/ealdent/lda-ruby/actions/runs/22555475302): failed in native extension compile (`rake compile`) with `cokus.h` macro collision and `time_t` mismatch.
29
+ - [run 22555550326](https://github.com/ealdent/lda-ruby/actions/runs/22555550326): progressed further, failed on `utils.c` `mkdir(name, mode)` mismatch (Windows `_mkdir` required).
30
+ - [run 22556009214](https://github.com/ealdent/lda-ruby/actions/runs/22556009214): Rust bindgen/toolchain parsing fixed; build then failed on Windows DLL name staging expectation.
31
+ - [run 22556129503](https://github.com/ealdent/lda-ruby/actions/runs/22556129503): Windows candidate build + artifact upload succeeded after GNU toolchain alignment, bindgen header/sysroot setup, and dual DLL name staging support.
32
+ - [run 22556206925](https://github.com/ealdent/lda-ruby/actions/runs/22556206925): Windows candidate remained green with packaged-gem runtime smoke checks enabled.
33
+ - [run 22556487788](https://github.com/ealdent/lda-ruby/actions/runs/22556487788): release workflow dry-run succeeded with `windows-x64-mingw-ucrt` included in release matrix.
34
+
35
+ Required validation to promote:
36
+
37
+ 1. Completed: release dry-run matrix validation passed.
38
+
39
+ ## Candidate: musl Linux (`x86_64-linux-musl`)
40
+
41
+ Status: promoted to release-blocking after release dry-run matrix success.
42
+
43
+ Feasibility notes:
44
+
45
+ - Current workflow uses `ubuntu-latest` (glibc), not musl.
46
+ - Current artifact script rejects cross-platform builds, so a musl artifact requires either:
47
+ - a musl-hosted builder, or
48
+ - a dedicated musl-native build container/workflow path treated as host-equivalent for packaging.
49
+ - Local validation signal (2026-03-01): Alpine container dry-run succeeded for host-matching `aarch64-linux-musl` with:
50
+ - `./bin/release-precompiled-artifacts --platform <detected-musl-platform> --skip-preflight --skip-runtime-checks`
51
+ - Candidate workflow runs (2026-03-01):
52
+ - [run 22555475302](https://github.com/ealdent/lda-ruby/actions/runs/22555475302): built `x86_64-linux-musl` successfully but artifact upload path was misconfigured.
53
+ - [run 22555550326](https://github.com/ealdent/lda-ruby/actions/runs/22555550326): musl candidate built and uploaded artifacts successfully with corrected glob path (`pkg/lda-ruby-*-linux-musl.gem*`).
54
+ - [run 22556129503](https://github.com/ealdent/lda-ruby/actions/runs/22556129503): musl candidate build + artifact upload remained green alongside the fixed Windows lane.
55
+ - [run 22556206925](https://github.com/ealdent/lda-ruby/actions/runs/22556206925): musl candidate remained green with packaged-gem runtime smoke checks enabled.
56
+ - [run 22556487788](https://github.com/ealdent/lda-ruby/actions/runs/22556487788): release workflow dry-run succeeded with `linux-musl-x86_64` included in release matrix.
57
+
58
+ Required validation to promote:
59
+
60
+ 1. Completed: release dry-run matrix validation passed.
61
+
62
+ ## Recommendation
63
+
64
+ Current expansion step is complete for Windows and musl. Any additional target should follow the same sequence:
65
+ 1. Add candidate workflow coverage.
66
+ 2. Verify candidate runtime checks.
67
+ 3. Validate one release dry-run with the new matrix lane before promotion.
@@ -0,0 +1,192 @@
1
+ # Release Runbook (Phase 5A + 5B)
2
+
3
+ This runbook defines the maintainer workflow for shipping `lda-ruby` source and precompiled platform gem releases.
4
+
5
+ Authoritative platform/support policy is maintained in `docs/precompiled-platform-policy.md`; expansion feasibility notes live in `docs/precompiled-target-evaluation.md`.
6
+
7
+ ## Scope
8
+
9
+ - Release artifact types:
10
+ - source gem: `pkg/lda-ruby-<version>.gem`
11
+ - precompiled gems (current targets are defined in `docs/precompiled-platform-policy.md`)
12
+ - Release trigger: git tag (`vX.Y.Z`) with matching version files
13
+ - Publish targets:
14
+ - RubyGems (`gem push`)
15
+ - GitHub Releases (gem + checksum attachment)
16
+
17
+ ## Prerequisites
18
+
19
+ 1. Access:
20
+ - push/tag rights on `master`
21
+ - access to GitHub Actions environments for release approvals
22
+ - RubyGems owner access for `lda-ruby`
23
+ 2. Local tooling:
24
+ - Ruby 3.2+ with Bundler
25
+ - Rust toolchain (`cargo`) for local precompiled-gem build checks
26
+ - `libclang` available to Rust bindgen
27
+ - Docker (recommended for reproducible checks)
28
+ 3. Repository state:
29
+ - release commit merged to `master`
30
+ - clean working tree
31
+ - version files in sync
32
+
33
+ ## Required Secrets and Environments
34
+
35
+ GitHub repository secret:
36
+
37
+ - `RUBYGEMS_API_KEY`: API key with push rights for `lda-ruby` and non-interactive publish support (no OTP prompt during `gem push`).
38
+
39
+ GitHub Actions environment:
40
+
41
+ - `release`: protect this environment with required reviewer approval.
42
+ - Both publish jobs in `.github/workflows/release.yml` are bound to `release`.
43
+
44
+ ## Release Preparation
45
+
46
+ 1. Prepare and update release files:
47
+
48
+ ```bash
49
+ ./bin/release-prepare 0.4.0
50
+ ```
51
+
52
+ 2. Review changes:
53
+ - `VERSION.yml`
54
+ - `lib/lda-ruby/version.rb`
55
+ - `CHANGELOG.md`
56
+
57
+ 3. Validate full release checks locally:
58
+
59
+ ```bash
60
+ SKIP_DOCKER=1 ./bin/release-preflight
61
+ ./bin/test-packaged-gem-manifest
62
+ ```
63
+
64
+ 4. Validate local precompiled gem flow for your current host platform:
65
+
66
+ ```bash
67
+ ./bin/release-precompiled-artifacts --tag v0.4.0 --skip-preflight
68
+ ```
69
+
70
+ Note: `release-precompiled-artifacts` only supports building for the current host platform (no cross-compilation).
71
+
72
+ 5. Verify RubyGems API key behavior before tagging:
73
+
74
+ ```bash
75
+ ./bin/verify-rubygems-api-key
76
+ ```
77
+
78
+ This check intentionally attempts a duplicate push of an existing gem version. A duplicate-rejected response is expected and confirms non-interactive auth works.
79
+
80
+ 6. Commit and merge to `master`.
81
+
82
+ ## Dry-Run Path (No Publish)
83
+
84
+ Use `workflow_dispatch` with `publish=false`.
85
+
86
+ Behavior:
87
+
88
+ - runs release validation and artifact build
89
+ - uploads source + precompiled `pkg/lda-ruby-*.gem` and checksum files as workflow artifacts
90
+ - does not push to RubyGems
91
+ - does not create a GitHub release
92
+
93
+ Latest verified dry-run reference:
94
+
95
+ - date: 2026-03-02
96
+ - workflow run: `https://github.com/ealdent/lda-ruby/actions/runs/22556487788`
97
+ - dispatch parameters: `release_tag=v0.4.0`, `publish=false`
98
+ - result: success across `validate`, `build_artifacts`, and full `build_precompiled_artifacts` matrix
99
+ - verified precompiled lanes:
100
+ - `linux-x86_64`
101
+ - `linux-musl-x86_64`
102
+ - `macos-x86_64`
103
+ - `macos-arm64`
104
+ - `windows-x64-mingw-ucrt`
105
+
106
+ Optional local dry-run equivalent:
107
+
108
+ ```bash
109
+ ./bin/release-artifacts --tag v0.4.0
110
+ ./bin/release-precompiled-artifacts --tag v0.4.0 --skip-preflight
111
+ ```
112
+
113
+ Candidate expansion workflow:
114
+
115
+ - For future platform evaluation beyond current release-blocking targets, run `.github/workflows/precompiled-candidate-evaluation.yml` via `workflow_dispatch`.
116
+ - Record outcome artifacts/logs in `docs/precompiled-target-evaluation.md`.
117
+
118
+ ## Known Publish Incident (`v0.4.0`)
119
+
120
+ - date: 2026-02-25
121
+ - release runs:
122
+ - `https://github.com/ealdent/lda-ruby/actions/runs/22383716372`
123
+ - `https://github.com/ealdent/lda-ruby/actions/runs/22383849236` (attempt 1 + rerun attempt 2 + rerun attempt 3)
124
+ - result: artifact build stages passed, `publish to RubyGems` failed with OTP-required auth (`You have enabled multifactor authentication but no OTP code provided.`)
125
+ - recovery action: rotated `release` environment secret `RUBYGEMS_API_KEY` to a CI-safe key and reran run `22383849236`.
126
+ - recovery result: rerun attempt 3 succeeded; RubyGems `0.4.0` and GitHub release `v0.4.0` published.
127
+
128
+ ## Publish Path (Tag-Driven)
129
+
130
+ 1. Ensure the release commit is on `master`.
131
+ 2. Create and push the release tag:
132
+
133
+ ```bash
134
+ git checkout master
135
+ git pull --ff-only
136
+ git tag -a v0.4.0 -m "Release v0.4.0"
137
+ git push origin v0.4.0
138
+ ```
139
+
140
+ 3. Monitor `.github/workflows/release.yml`:
141
+ - `validate`
142
+ - `build_artifacts`
143
+ - `build_precompiled_artifacts` (linux + linux-musl + macOS + windows matrix)
144
+ - environment-gated `publish_rubygems`
145
+ - environment-gated `publish_github_release`
146
+ - `verify_published_artifacts`
147
+ - on failed tag-triggered `release.yml` runs, `.github/workflows/release-failure-alert.yml` opens a triage issue with failed job links
148
+ - if the same release run later succeeds (for example via rerun), the alert issue is auto-closed by `.github/workflows/release-failure-alert.yml`
149
+ 4. Approve the protected `release` environment when prompted.
150
+ 5. Confirm published outputs:
151
+ - RubyGems shows `lda-ruby` `0.4.0` source gem and platform gems
152
+ - GitHub release `v0.4.0` exists with all gem and `.sha256` attachments
153
+ - workflow job `verify_published_artifacts` succeeds
154
+
155
+ ## Rollback and Recovery
156
+
157
+ If publish fails before RubyGems push:
158
+
159
+ 1. Fix issue on `master`.
160
+ 2. Delete and recreate the tag only if the broken tag did not produce public artifacts:
161
+ - `git tag -d vX.Y.Z`
162
+ - `git push origin :refs/tags/vX.Y.Z`
163
+ 3. Re-tag and re-run release.
164
+
165
+ If RubyGems push succeeds but GitHub release fails:
166
+
167
+ 1. Re-run only the GitHub release path by re-running the workflow job after fix.
168
+ 2. Do not re-push gem for the same version.
169
+
170
+ If an incorrect gem is published:
171
+
172
+ 1. Yank from RubyGems:
173
+
174
+ ```bash
175
+ gem yank lda-ruby -v X.Y.Z
176
+ ```
177
+
178
+ 2. Publish a corrective version (for example `X.Y.(Z+1)`), do not re-use yanked version numbers.
179
+ 3. Update `CHANGELOG.md` and release notes to document the correction.
180
+
181
+ ## Troubleshooting
182
+
183
+ - `Could not find 'bundler'`: install the Bundler version pinned in `Gemfile.lock`.
184
+ - `cargo not found` in rust-enabled checks: ensure Rust toolchain is installed or run in Docker.
185
+ - `libclang` not found while building precompiled gems: install LLVM/libclang and set `LIBCLANG_PATH` if needed.
186
+ - Linux `Install Rust bindgen dependencies` can take several minutes on fresh runners due apt package index and package installs.
187
+ - RubyGems publish asks for OTP (`You have enabled multi-factor authentication but no OTP code provided`): run `./bin/verify-rubygems-api-key`, then rotate `RUBYGEMS_API_KEY` to a CI-safe key if OTP is requested.
188
+ - Post-publish verification fails: run `./bin/verify-release-publish --tag vX.Y.Z` and fix missing RubyGems entries or GitHub release assets before considering the release complete.
189
+ - macOS Rust link errors (`symbol(s) not found` for Ruby APIs): ensure build path preserves `-C link-arg=-Wl,-undefined,dynamic_lookup` in `RUSTFLAGS`.
190
+ - Tag/version mismatch: run `./bin/check-version-sync --tag vX.Y.Z`.
191
+ - Artifact mismatch during release: rebuild with `./bin/release-artifacts --tag vX.Y.Z`.
192
+ - Precompiled artifact mismatch: rebuild with `./bin/release-precompiled-artifacts --tag vX.Y.Z --skip-preflight`.
@@ -0,0 +1,50 @@
1
+ # Rust Orchestration Guardrails
2
+
3
+ This document defines the minimum parity and performance gates for deeper Rust orchestration refactors.
4
+
5
+ ## Numeric parity guardrails
6
+
7
+ Required tests:
8
+
9
+ - `bundle exec ruby -Ilib:test test/backend_compatibility_test.rb`
10
+ - `bundle exec ruby -Ilib:test test/rust_orchestration_test.rb`
11
+
12
+ Current parity expectations:
13
+
14
+ - Rust vs pure backend fixture parity remains exact within existing tolerances used by tests.
15
+ - Session-based orchestration paths (`run_em_on_session`, `run_em_on_session_with_start_seed`, `run_em_on_session_start`, `run_em_on_session_with_corpus`) must match direct non-session orchestration for equivalent settings/seeds.
16
+ - `Lda::Backends::Rust` cached-corpus EM should prefer the managed Rust session entrypoint (`run_em_on_session_with_corpus`) even when no active session id is cached locally, rather than branching in Ruby between session-only, recovery, and direct paths.
17
+ - `Lda::Backends::Rust` non-session fallback should prefer Rust start-aware orchestration (`run_em_with_start_seed`) before legacy beta-input orchestration (`run_em`).
18
+ - Direct non-session fallback should reuse the backend's cached Rust corpus snapshot rather than rebuilding corpus arrays from `@corpus` for each invocation.
19
+ - Legacy beta-input compatibility fallback should also reuse the backend's cached Rust corpus snapshot rather than rebuilding full EM corpus input in Ruby.
20
+ - Rust backend corpus/session lifecycle must not leak session count across corpus replacement.
21
+ - Missing-session recovery in managed session orchestration (`run_em_on_session_with_corpus`) must recreate a usable session and keep parity with direct orchestration.
22
+ - Managed Rust corpus orchestration (`run_em_on_session_with_corpus`) must keep parity with direct orchestration even when it falls back internally from session-backed execution to start-seeded array execution.
23
+ - Corpus reassignment through Rust session replacement lifecycle (`replace_corpus_session`) must preserve stable session count and route subsequent EM runs over updated corpus data.
24
+ - Unknown start-mode handling in seed-aware Rust orchestration must match Ruby's non-seeded fallback behavior when given the same explicit seed.
25
+
26
+ ## Benchmark guardrail
27
+
28
+ Run:
29
+
30
+ - `./bin/check-rust-benchmark`
31
+
32
+ Default benchmark policy:
33
+
34
+ - `BENCH_RUST_TO_PURE_MAX_RATIO=0.045`
35
+ - i.e., Rust mean runtime must be no worse than 4.5% of pure mean runtime on the benchmark fixture/config.
36
+ - CI benchmark guardrail job enforces the same ratio with `BENCH_RUNS=1` for runtime stability.
37
+ - latest tightening evidence (2026-03-05): local Docker guardrail check with `BENCH_RUNS=3` observed Rust/Pure ratio `0.0368` (`rust=0.0758s`, `pure=2.0569s`), and prior CI streak data on `codex/rust-orchestration-phase8` (`22555725309` .. `22557953998`) observed `[0.0252, 0.0288]`, supporting a tighter `0.045` threshold with headroom.
38
+
39
+ Configurable environment knobs:
40
+
41
+ - `BENCH_RUNS` (default `5`)
42
+ - `BENCH_START` (default `seeded`)
43
+ - `BENCH_TOPICS` (default `8`)
44
+ - `BENCH_MAX_ITER` (default `20`)
45
+ - `BENCH_EM_MAX_ITER` (default `40`)
46
+ - `BENCH_RUST_TO_PURE_MAX_RATIO` (default `0.045`)
47
+
48
+ ## When to tighten thresholds
49
+
50
+ Tighten benchmark thresholds only after collecting multiple stable runs on the same host/environment and updating this document with the new target ratio.
data/ext/lda-ruby/cokus.c CHANGED
@@ -45,14 +45,14 @@
45
45
 
46
46
  #include "cokus.h"
47
47
 
48
- static uint32 state[N+1]; // state vector + 1 extra to not violate ANSI C
48
+ static uint32 state[COKUS_N+1]; // state vector + 1 extra to not violate ANSI C
49
49
  static uint32 *next; // next random value is computed from here
50
50
  static int left = -1; // can *next++ this many times before reloading
51
51
 
52
52
  void seedMT(uint32 seed)
53
53
  {
54
54
  //
55
- // We initialize state[0..(N-1)] via the generator
55
+ // We initialize state[0..(COKUS_N-1)] via the generator
56
56
  //
57
57
  // x_new = (69069 * x_old) mod 2^32
58
58
  //
@@ -100,28 +100,28 @@ void seedMT(uint32 seed)
100
100
  register uint32 x = (seed | 1U) & 0xFFFFFFFFU, *s = state;
101
101
  register int j;
102
102
 
103
- for(left=0, *s++=x, j=N; --j;
103
+ for(left=0, *s++=x, j=COKUS_N; --j;
104
104
  *s++ = (x*=69069U) & 0xFFFFFFFFU);
105
105
  }
106
106
 
107
107
 
108
108
  uint32 reloadMT(void)
109
109
  {
110
- register uint32 *p0=state, *p2=state+2, *pM=state+M, s0, s1;
110
+ register uint32 *p0=state, *p2=state+2, *pM=state+COKUS_M, s0, s1;
111
111
  register int j;
112
112
 
113
113
  if(left < -1)
114
114
  seedMT(4357U);
115
115
 
116
- left=N-1, next=state+1;
116
+ left=COKUS_N-1, next=state+1;
117
117
 
118
- for(s0=state[0], s1=state[1], j=N-M+1; --j; s0=s1, s1=*p2++)
119
- *p0++ = *pM++ ^ (mixBits(s0, s1) >> 1) ^ (loBit(s1) ? K : 0U);
118
+ for(s0=state[0], s1=state[1], j=COKUS_N-COKUS_M+1; --j; s0=s1, s1=*p2++)
119
+ *p0++ = *pM++ ^ (mixBits(s0, s1) >> 1) ^ (loBit(s1) ? COKUS_K : 0U);
120
120
 
121
- for(pM=state, j=M; --j; s0=s1, s1=*p2++)
122
- *p0++ = *pM++ ^ (mixBits(s0, s1) >> 1) ^ (loBit(s1) ? K : 0U);
121
+ for(pM=state, j=COKUS_M; --j; s0=s1, s1=*p2++)
122
+ *p0++ = *pM++ ^ (mixBits(s0, s1) >> 1) ^ (loBit(s1) ? COKUS_K : 0U);
123
123
 
124
- s1=state[0], *p0 = *pM ^ (mixBits(s0, s1) >> 1) ^ (loBit(s1) ? K : 0U);
124
+ s1=state[0], *p0 = *pM ^ (mixBits(s0, s1) >> 1) ^ (loBit(s1) ? COKUS_K : 0U);
125
125
  s1 ^= (s1 >> 11);
126
126
  s1 ^= (s1 << 7) & 0x9D2C5680U;
127
127
  s1 ^= (s1 << 15) & 0xEFC60000U;
@@ -142,4 +142,3 @@ uint32 randomMT(void)
142
142
  y ^= (y >> 18);
143
143
  return(y);
144
144
  }
145
-
data/ext/lda-ruby/cokus.h CHANGED
@@ -12,9 +12,9 @@
12
12
 
13
13
  typedef unsigned long uint32;
14
14
 
15
- #define N (624) // length of state vector
16
- #define M (397) // a period parameter
17
- #define K (0x9908B0DFU) // a magic constant
15
+ #define COKUS_N (624) // length of state vector
16
+ #define COKUS_M (397) // a period parameter
17
+ #define COKUS_K (0x9908B0DFU) // a magic constant
18
18
  #define hiBit(u) ((u) & 0x80000000U) // mask all but highest bit of u
19
19
  #define loBit(u) ((u) & 0x00000001U) // mask all but lowest bit of u
20
20
  #define loBits(u) ((u) & 0x7FFFFFFFU) // mask the highest bit of u
@@ -1,9 +1,13 @@
1
- ENV["ARCHFLAGS"] = "-arch #{`uname -p` =~ /powerpc/ ? 'ppc' : 'i386'}"
1
+ # frozen_string_literal: true
2
2
 
3
- require 'mkmf'
3
+ require "mkmf"
4
4
 
5
- $CFLAGS << ' -Wall -ggdb -O0'
6
- $defs.push( "-D USE_RUBY" )
5
+ extension_name = "lda-ruby/lda"
6
+ dir_config(extension_name)
7
7
 
8
- dir_config('lda-ruby/lda')
9
- create_makefile("lda-ruby/lda")
8
+ $defs << "-DUSE_RUBY"
9
+ append_cflags("-Wall")
10
+ append_cflags("-Wextra")
11
+ append_cflags("-Wno-unused-parameter")
12
+
13
+ create_makefile(extension_name)
@@ -435,9 +435,9 @@ void infer(char* model_root, char* save, corpus* corpus) {
435
435
  int main(int argc, char* argv[]) {
436
436
  corpus* corpus;
437
437
 
438
- long t1;
438
+ time_t t1;
439
439
  (void) time(&t1);
440
- seedMT(t1);
440
+ seedMT((uint32) t1);
441
441
  // seedMT(4357U);
442
442
 
443
443
  if (argc > 1)
@@ -614,7 +614,25 @@ void run_quiet_em(char* start, corpus* corpus) {
614
614
  * * em_convergence
615
615
  * * est_alpha
616
616
  */
617
- static VALUE wrap_set_config(VALUE self, VALUE init_alpha, VALUE num_topics, VALUE max_iter, VALUE convergence, VALUE em_max_iter, VALUE em_convergence, VALUE est_alpha) {
617
+ static VALUE wrap_set_config(int argc, VALUE* argv, VALUE self) {
618
+ VALUE init_alpha = Qnil;
619
+ VALUE num_topics = Qnil;
620
+ VALUE max_iter = Qnil;
621
+ VALUE convergence = Qnil;
622
+ VALUE em_max_iter = Qnil;
623
+ VALUE em_convergence = Qnil;
624
+ VALUE est_alpha = Qnil;
625
+
626
+ rb_check_arity(argc, 5, 7);
627
+
628
+ init_alpha = argv[0];
629
+ num_topics = argv[1];
630
+ max_iter = argv[2];
631
+ convergence = argv[3];
632
+ em_max_iter = argv[4];
633
+ em_convergence = (argc >= 6) ? argv[5] : rb_float_new(EM_CONVERGED);
634
+ est_alpha = (argc == 7) ? argv[6] : rb_int_new(ESTIMATE_ALPHA);
635
+
618
636
  INITIAL_ALPHA = NUM2DBL(init_alpha);
619
637
  NTOPICS = NUM2INT(num_topics);
620
638
  if( NTOPICS < 0 ) { rb_raise(rb_eRuntimeError, "NTOPICS must be greater than 0 - %d", NTOPICS); }
@@ -954,13 +972,11 @@ static VALUE wrap_get_model_settings(VALUE self) {
954
972
  }
955
973
 
956
974
 
957
- void Init_lda() {
975
+ void Init_lda(void) {
958
976
  corpus_loaded = FALSE;
959
977
  model_loaded = FALSE;
960
978
  VERBOSE = TRUE;
961
979
 
962
- rb_require("lda-ruby");
963
-
964
980
  rb_cLdaModule = rb_define_module("Lda");
965
981
  rb_cLda = rb_define_class_under(rb_cLdaModule, "Lda", rb_cObject);
966
982
  rb_cLdaCorpus = rb_define_class_under(rb_cLdaModule, "Corpus", rb_cObject);
@@ -977,7 +993,7 @@ void Init_lda() {
977
993
  rb_define_method(rb_cLda, "load_settings", wrap_load_settings, 1);
978
994
 
979
995
  // method to set all the config options at once
980
- rb_define_method(rb_cLda, "set_config", wrap_set_config, 5);
996
+ rb_define_method(rb_cLda, "set_config", wrap_set_config, -1);
981
997
 
982
998
  // accessor stuff for main settings
983
999
  rb_define_method(rb_cLda, "max_iter", wrap_get_max_iter, 0);
data/ext/lda-ruby/utils.c CHANGED
@@ -1,5 +1,9 @@
1
1
  #include "utils.h"
2
2
 
3
+ #ifdef _WIN32
4
+ #include <direct.h>
5
+ #endif
6
+
3
7
  /*
4
8
  * given log(a) and log(b), return log(a + b)
5
9
  *
@@ -85,7 +89,11 @@ double log_gamma(double x)
85
89
 
86
90
  void make_directory(char* name)
87
91
  {
92
+ #ifdef _WIN32
93
+ _mkdir(name);
94
+ #else
88
95
  mkdir(name, S_IRUSR|S_IWUSR|S_IXUSR);
96
+ #endif
89
97
  }
90
98
 
91
99
 
@@ -0,0 +1,12 @@
1
+ [package]
2
+ name = "lda_ruby_rust"
3
+ version = "0.1.0"
4
+ edition = "2021"
5
+ rust-version = "1.74"
6
+
7
+ [lib]
8
+ name = "lda_ruby_rust"
9
+ crate-type = ["cdylib"]
10
+
11
+ [dependencies]
12
+ magnus = "0.7"
@@ -0,0 +1,73 @@
1
+ # Experimental Rust Extension Scaffold
2
+
3
+ This directory contains an experimental Rust extension scaffold built with `magnus`.
4
+
5
+ Current scope:
6
+
7
+ - Defines `Lda::RustBackend` module in Ruby.
8
+ - Exposes capability hooks:
9
+ - `Lda::RustBackend.available?`
10
+ - `Lda::RustBackend.abi_version`
11
+ - `Lda::RustBackend.corpus_session_count`
12
+ - `Lda::RustBackend.corpus_session_exists(session_id)`
13
+ - `Lda::RustBackend.before_em(start, num_docs, num_terms)`
14
+ - `Lda::RustBackend.topic_weights_for_word(beta, gamma, word_index, min_probability)`
15
+ - `Lda::RustBackend.accumulate_topic_term_counts(topic_term_counts, phi_d, words, counts)`
16
+ - `Lda::RustBackend.infer_document(beta, gamma_initial, words, counts, max_iter, convergence, min_probability, init_alpha)`
17
+ - `Lda::RustBackend.infer_corpus_iteration(beta, document_words, document_counts, max_iter, convergence, min_probability, init_alpha)`
18
+ - `Lda::RustBackend.normalize_topic_term_counts(topic_term_counts, min_probability)`
19
+ - `Lda::RustBackend.average_gamma_shift(previous_gamma, current_gamma)`
20
+ - `Lda::RustBackend.topic_document_probability(phi_tensor, document_counts, num_topics, min_probability)`
21
+ - `Lda::RustBackend.seeded_topic_term_probabilities(document_words, document_counts, topics, terms, min_probability)`
22
+ - `Lda::RustBackend.random_topic_term_probabilities(topics, terms, min_probability, random_seed)`
23
+ - `Lda::RustBackend.create_corpus_session(document_words, document_counts, terms)`
24
+ - `Lda::RustBackend.replace_corpus_session(session_id, document_words, document_counts, terms)`
25
+ - `Lda::RustBackend.drop_corpus_session(session_id)`
26
+ - `Lda::RustBackend.configure_corpus_session(session_id, topics, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability)`
27
+ - `Lda::RustBackend.run_em(initial_beta, document_words, document_counts, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability)`
28
+ - `Lda::RustBackend.run_em_with_start(start, document_words, document_counts, topics, terms, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability)`
29
+ - `Lda::RustBackend.run_em_with_start_seed(start, document_words, document_counts, topics, terms, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability, random_seed)`
30
+ - `Lda::RustBackend.run_em_on_session(session_id, start, topics, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability, random_seed)`
31
+ - `Lda::RustBackend.run_em_on_session_start(session_id, start, random_seed)`
32
+ - `Lda::RustBackend.run_em_on_session_with_start_seed(session_id, start, topics, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability, random_seed)`
33
+ - `Lda::RustBackend.run_em_on_session_with_corpus(session_id, document_words, document_counts, terms, start, topics, max_iter, convergence, em_max_iter, em_convergence, init_alpha, min_probability, random_seed)`
34
+
35
+ Hot-path kernels currently executed in Rust when `backend: :rust` is active:
36
+ - topic weights for a word across topics
37
+ - topic-term count accumulation from per-document `phi`
38
+ - full per-document inference loop (batched inner EM updates)
39
+ - full per-iteration corpus inference (batched document processing)
40
+ - topic-term normalization and log-probability finalization for EM beta updates
41
+ - gamma convergence shift reduction between EM iterations
42
+ - topic-document average log-probability computation
43
+ - seeded topic-term initialization
44
+ - random topic-term initialization with explicit seed control
45
+ - EM outer-loop orchestration with convergence checks (`run_em`)
46
+ - start-aware deterministic EM orchestration (`run_em_with_start` for `seeded`/`deterministic`)
47
+ - start-aware seeded and random EM orchestration with explicit seed control (`run_em_with_start_seed`)
48
+ - unified session-settings orchestration (`run_em_on_session`) that applies settings and executes EM in one call
49
+ - session-based EM orchestration against Rust-managed corpus lifecycle (`create_corpus_session` + `run_em_on_session_with_start_seed`)
50
+ - settings-aware session orchestration (`configure_corpus_session` + `run_em_on_session_start`)
51
+ - managed corpus orchestration (`run_em_on_session_with_corpus`) that can recreate missing sessions and, if session-backed execution cannot be used, falls back internally to direct start-aware execution inside Rust
52
+ - `Lda::Backends::Rust` prefers `run_em_on_session_with_corpus` whenever a cached Rust corpus snapshot is available, even if no session id is currently cached locally
53
+ - direct and legacy beta-input compatibility fallbacks both reuse the backend's cached Rust corpus snapshot instead of rebuilding corpus arrays in Ruby
54
+ - unknown EM start modes in seed-aware orchestration follow Ruby's non-seeded fallback behavior (seeded by explicit `random_seed`)
55
+
56
+ Remaining numeric LDA kernels are still provided by the pure Ruby backend and will move incrementally.
57
+
58
+ ## Local build (optional)
59
+
60
+ ```bash
61
+ cd ext/lda-ruby-rust
62
+ cargo build --release
63
+ ```
64
+
65
+ Then run Ruby with `require "lda_ruby_rust"` available on load path.
66
+
67
+ ## Install-time policy
68
+
69
+ During source gem installs, `ext/lda-ruby-rust/extconf.rb` can optionally build this extension.
70
+
71
+ - `LDA_RUBY_RUST_BUILD=auto` (default): build when `cargo` is available.
72
+ - `LDA_RUBY_RUST_BUILD=always`: require a successful Rust build or fail installation.
73
+ - `LDA_RUBY_RUST_BUILD=never`: always skip Rust build.