text-metrics 0.0.1 → 1.0.0.beta1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d2c50a6c558c741b22dd13e0c7327c0453ca619bc5dc535364c7af81e282a00d
4
- data.tar.gz: 761532ff88057efc471f1c0b1f8c1395f527d7ce1d2490e41a982e85994b22f9
3
+ metadata.gz: d045a3d3bb921b33db65e5f0ae8cb2211541732760e5a059c17ab3bf36335b86
4
+ data.tar.gz: 8ad9fb90b40800eeafe15e2bffbc48b69b226ac7f71f77429efbbec95d0be2ab
5
5
  SHA512:
6
- metadata.gz: 5327430d90c3a17bdc4822112700fcba99bb5f6fa5df7a22063b534048d87bfd59eb82b9fcd0363defa659c5e39a5e3109a005b94d55c35e23b2801bbd547532
7
- data.tar.gz: 6dde381cb4592e1158aced2f8c90bae5fd585951e1f02450da197b98c0eaa1741903d4cc0310402fb5dba2447e9908187782327c57c5855d000e8ded84239a63
6
+ metadata.gz: aff19ee3172ad857e3762964fbe13866fffa1f8a91402d138b8b50ce08324c537ae865bc02419b0177fc2d4af5d71cbfd16a4a44ca18fe843e1e90914df2cc39
7
+ data.tar.gz: dd3cb65ccd6e5f27d4715e367e409f7995dcac7e2e4b77c63f43e074e121ca32b5e584dd2b3167585f501fafe6db296c0cf0202b778a42f058283fad24e0370c
data/CHANGELOG.md CHANGED
@@ -1,3 +1,37 @@
1
1
  # Change log
2
2
 
3
3
  ## master
4
+
5
+ ## 1.0.0.beta1
6
+
7
+ First public (beta) release. The API was reworked for stability before tagging 1.0 — see
8
+ [`UPGRADING.md`](UPGRADING.md) for migration details. **Breaking changes:**
9
+
10
+ - `TextMetrics.new` now takes the text as a positional argument and returns the
11
+ language-specific analyzer directly: `TextMetrics.new("text", language: :en_us)`.
12
+ The `TextMetrics::TextMetrics` wrapper and its `#text_metrics_processor` accessor are gone.
13
+ - `language` is now a symbol (`:en_us`, `:fr`); strings are still accepted. An unknown
14
+ language raises `TextMetrics::Error` instead of a `NoMethodError`.
15
+ - `#all` was renamed to `#to_h`, and it is now the single source of truth for the metric
16
+ set — every individual reader and `#to_h` are derived from the same list, so they can't drift.
17
+ - Levenshtein moved off the analyzer to the module: `TextMetrics.distance(a, b)` (raw integer)
18
+ and `TextMetrics.similarity(a, b)` (0–100 score). This replaces
19
+ `levenshtein_distance_from(other, normalize:)`, whose return value changed meaning with the flag.
20
+ - Punctuation metrics renamed to the singular: `punctuation_count`,
21
+ `words_per_punctuation_average`, `punctuation_per_sentence_average`.
22
+ - `poly_syllabes_count` and the tokenizers are now private implementation details.
23
+ - Analyzed text is whitespace-normalized once and exposed via `#text`.
24
+ - English syllable counting now uses the CMU Pronouncing Dictionary as the source of truth
25
+ (falling back to `text-hyphen` only for out-of-vocabulary words), which is more accurate than
26
+ the previous hyphenation-only approach. English syllable counts and the readability scores
27
+ derived from them may shift slightly versus `0.x`.
28
+ - Readability scores are now computed from full-precision ratios (rounding only the final
29
+ result) instead of from the display-rounded averages — this removes errors of several points
30
+ that the intermediate rounding could introduce.
31
+ - Readability scores are no longer clamped: a Flesch score can now exceed 100 or go negative,
32
+ as the formulas define. (Previously clamped to 0–100 / 0–18 / 0–20.)
33
+ - French Flesch Reading Ease now uses the correct Kandel-Moles base constant `207` (was `206.835`).
34
+ - `flesch_kincaid_grade` now uses the standard US-grade formula for every language; the previous
35
+ unsourced French coefficients are gone (Flesch-Kincaid Grade has no validated French adaptation).
36
+ - `letters_per_word_average` and the Coleman-Liau index now count alphabetic letters only, not
37
+ digits and punctuation, per the Coleman-Liau definition.
data/README.md CHANGED
@@ -6,9 +6,23 @@
6
6
 
7
7
  Text Metrics is a Ruby library for text analysis. It is inspired from Textstat library in Python and the Ruby port of [Textstat](https://github.com/kupolak/textstat).
8
8
 
9
- In addition to basic metrics it also provides readability tests like Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index.
9
+ In addition to basic metrics it also provides readability tests like Flesch Reading Ease, Flesch-Kincaid Grade Level, SMOG, Coleman-Liau Index and LIX. The main languages supported are English and French.
10
10
 
11
- At this point the main language supported are English and French.
11
+ ## Accurate, dictionary-based syllable counting
12
+
13
+ Readability scores (Flesch, Flesch-Kincaid, SMOG) are only as trustworthy as the syllable counts they are built on — and the hyphenation heuristics most libraries rely on get a lot of words wrong. Text Metrics counts syllables from real pronunciation dictionaries instead:
14
+
15
+ - **English** uses the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict): pronunciation-derived syllable counts for ~126,000 words, falling back to hyphenation only for out-of-vocabulary words. Benchmarked against CMUdict as ground truth, the common hyphenation-only approach returns the **wrong** syllable count for **~46% of dictionary words** — Text Metrics uses the reference count directly.
16
+ - **French** uses counts derived from the [Lexique](http://www.lexique.org/) database, with a vowel heuristic patched by an exceptions list for the words it gets wrong.
17
+
18
+ This makes syllable-dependent metrics meaningfully more accurate than hyphenation-only implementations, especially on longer and less common words.
19
+
20
+ ### Formula notes
21
+
22
+ - Scores are computed from full-precision ratios and returned **unclamped** — a Flesch Reading Ease score can legitimately exceed 100 (very easy) or go below 0 (very difficult).
23
+ - **French Flesch Reading Ease** uses the Kandel-Moles (1958) adaptation: `207 − 1.015 × (words/sentences) − 73.6 × (syllables/words)`. Because its base is 207, very easy French text can score slightly above 100.
24
+ - **Flesch-Kincaid Grade** maps to a US school grade and uses the same formula for all languages (there is no validated French adaptation).
25
+ - **Coleman-Liau** counts alphabetic letters only (not digits or punctuation), per its definition.
12
26
 
13
27
  ## Features
14
28
 
@@ -47,23 +61,43 @@ gem "text-metrics", github: "plume-app/text-metrics", branch: "main"
47
61
  ## Usage
48
62
 
49
63
  ```ruby
50
- @text_analyser = TextMetrics.new(text: "This gem analyses all kind of texts.")
51
- # get all metrics at once:
52
-
53
- @text_analyser.all
54
- # { words_count: 7, characters_count: 30, sentences_count: 1, syllables_count: 9, syllables_per_word_average: 1.3, letters_per_word_average: 4.29, words_per_sentence_average: 7.0, characters_per_sentence_average: 30.0, flesch_reading_ease: 89.75, flesch_kincaid_grade: 2.5 }
55
-
56
- # or get each metric separately:
57
- @text_analyser.words_count # => 7
58
- @text_analyser.characters_count # => 30
59
- @text_analyser.sentences_count # => 1
60
- @text_analyser.syllables_count # => 9
61
- @text_analyser.syllables_per_word_average # => 1.3
62
- @text_analyser.letters_per_word_average # => 4.29
63
- @text_analyser.words_per_sentence_average # => 7.0
64
- @text_analyser.characters_per_sentence_average # => 30.0
65
- @text_analyser.flesch_reading_ease # => 89.75
66
- @text_analyser.flesch_kincaid_grade # => 2.5
64
+ metrics = TextMetrics.new("This gem analyses all kinds of text.")
65
+
66
+ # Get every metric at once as a Hash:
67
+ metrics.to_h
68
+ # {
69
+ # words_count: 7, characters_count: 30, sentences_count: 1, syllables_count: 11,
70
+ # punctuation_count: 1, syllables_per_word_average: 1.6, letters_per_word_average: 4.29,
71
+ # words_per_sentence_average: 7.0, characters_per_sentence_average: 30.0,
72
+ # words_per_punctuation_average: 7.0, punctuation_per_sentence_average: 1.0,
73
+ # flesch_reading_ease: 64.37, flesch_kincaid_grade: 6.0, lix: 21.29,
74
+ # smog_index: 0.0, coleman_liau_index: 5.2
75
+ # }
76
+
77
+ # Or read each metric on its own:
78
+ metrics.words_count # => 7
79
+ metrics.characters_count # => 30
80
+ metrics.flesch_reading_ease # => 64.37
81
+ metrics.flesch_kincaid_grade # => 6.0
82
+ ```
83
+
84
+ ### Languages
85
+
86
+ The default language is American English (`:en_us`). Pass `language:` to analyse French:
87
+
88
+ ```ruby
89
+ TextMetrics.new("Bonjour le monde.", language: :fr)
90
+ ```
91
+
92
+ An unknown language raises `TextMetrics::Error`.
93
+
94
+ ### Comparing two texts
95
+
96
+ Levenshtein comparison is between two texts, so it lives on the module itself:
97
+
98
+ ```ruby
99
+ TextMetrics.distance("kitten", "sitting") # => 3 raw edit distance
100
+ TextMetrics.similarity("kitten", "sitting") # => 57.14 0–100 score (100.0 == identical)
67
101
  ```
68
102
 
69
103
  ## Contributing
@@ -75,6 +109,8 @@ Bug reports and pull requests are welcome on GitHub at [https://github.com/plume
75
109
  This gem was inspired by [Textstat](https://github.com/kupolak/textstat) and [Textstat](https://github.com/textstat/textstat) in Python.
76
110
  This gem is generated via [`newgem` template](https://github.com/palkan/newgem) by [@palkan](https://github.com/palkan).
77
111
 
112
+ English syllable counts come from the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict) (unrestricted use), with [text-hyphen](https://github.com/halostatue/text-hyphen) as a fallback. French syllable counts are derived from [Lexique](http://www.lexique.org/).
113
+
78
114
  ## License
79
115
 
80
116
  The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
data/UPGRADING.md ADDED
@@ -0,0 +1,73 @@
1
+ # Upgrading
2
+
3
+ ## To 1.0 (from the pre-release `0.x` internal API)
4
+
5
+ 1.0 is the first public release. The internal `0.x` API was reworked once, deliberately, so
6
+ the public surface can stay stable from here on. Everything below is a one-time migration.
7
+
8
+ ### Building an analyzer
9
+
10
+ `text` is now positional, and `TextMetrics.new` returns the analyzer itself — there is no
11
+ longer a wrapper object with a `text_metrics_processor` accessor.
12
+
13
+ ```ruby
14
+ # Before
15
+ analyser = TextMetrics.new(text: "Some text", language: "fr")
16
+ analyser.text_metrics_processor.words_count
17
+
18
+ # After
19
+ metrics = TextMetrics.new("Some text", language: :fr)
20
+ metrics.words_count
21
+ ```
22
+
23
+ ### Languages are symbols, and unknown ones raise
24
+
25
+ ```ruby
26
+ # Before — unknown language blew up with NoMethodError on nil
27
+ TextMetrics.new(text: "x", language: "es")
28
+
29
+ # After — symbols (strings still accepted), unknown languages raise TextMetrics::Error
30
+ TextMetrics.new("x", language: :es) # => raises TextMetrics::Error
31
+ ```
32
+
33
+ ### `#all` is now `#to_h`
34
+
35
+ ```ruby
36
+ # Before
37
+ metrics.all
38
+
39
+ # After
40
+ metrics.to_h
41
+ ```
42
+
43
+ `#to_h` and the individual metric readers are now generated from a single list
44
+ (`TextMetrics::Processors::Base::METRICS`), so they can never report different sets again.
45
+
46
+ ### Levenshtein moved to the module
47
+
48
+ The single method with a `normalize:` flag (which returned a raw distance or a 0–100 score
49
+ depending on the flag) is replaced by two intention-revealing module methods:
50
+
51
+ ```ruby
52
+ # Before
53
+ metrics.levenshtein_distance_from("other", normalize: false) # raw distance
54
+ metrics.levenshtein_distance_from("other") # 0–100 score
55
+
56
+ # After
57
+ TextMetrics.distance("text", "other") # => Integer, raw edit distance
58
+ TextMetrics.similarity("text", "other") # => Float, 0–100 score (100.0 == identical)
59
+ ```
60
+
61
+ ### Renamed metrics
62
+
63
+ The punctuation metrics dropped the plural; some helpers are now private:
64
+
65
+ | Before | After |
66
+ | ---------------------------------- | -------------------------------- |
67
+ | `punctuations_count` | `punctuation_count` |
68
+ | `words_per_punctuations_average` | `words_per_punctuation_average` |
69
+ | `punctuations_per_sentence_average`| `punctuation_per_sentence_average` |
70
+ | `poly_syllabes_count` (public) | private implementation detail |
71
+
72
+ The word/sentence/punctuation tokenizers are now private as well — use the count and average
73
+ metrics or `#to_h`.
@@ -0,0 +1,6 @@
1
+ # frozen_string_literal: true
2
+
3
+ # Bundler auto-requires the hyphenated gem name (`require "text-metrics"`) by default.
4
+ # This shim points it at the real entry point so `gem "text-metrics"` works without an
5
+ # explicit `require:` option.
6
+ require "text_metrics"