text-metrics 0.0.1 → 1.0.0.beta2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d2c50a6c558c741b22dd13e0c7327c0453ca619bc5dc535364c7af81e282a00d
4
- data.tar.gz: 761532ff88057efc471f1c0b1f8c1395f527d7ce1d2490e41a982e85994b22f9
3
+ metadata.gz: e71911a9dd9a27cc6dcaa9eb9525cf48d73d48c5fc49526d475cb2a1f6de98c6
4
+ data.tar.gz: c686dfd9cf3cdb730b27c0329b1a0ba13748dea1de1f5730cd4283cea9885de2
5
5
  SHA512:
6
- metadata.gz: 5327430d90c3a17bdc4822112700fcba99bb5f6fa5df7a22063b534048d87bfd59eb82b9fcd0363defa659c5e39a5e3109a005b94d55c35e23b2801bbd547532
7
- data.tar.gz: 6dde381cb4592e1158aced2f8c90bae5fd585951e1f02450da197b98c0eaa1741903d4cc0310402fb5dba2447e9908187782327c57c5855d000e8ded84239a63
6
+ metadata.gz: 57de157f47ac0ab6cf79956509087371d9c9e12fbdbe6d80b7314110c1d8211e831561adb1c9476199e086bb3f6d33c55c1f3e7c39dcf144e04f0766c7e1773d
7
+ data.tar.gz: f7b5f77a6fba329093f3f55c38844ac8696f28c9d5bbfaf08750d464732bdf5a5484ebea57444a928ebe8a65e6fdec76a8ade187987df17cd419edc3c4f8555b
data/CHANGELOG.md CHANGED
@@ -1,3 +1,40 @@
1
1
  # Change log
2
2
 
3
- ## master
3
+ ## main
4
+
5
+ ## ## 1.0.0.beta2
6
+ - Add `gunning_fog_index` to the analyzer metrics and `#to_h` output.
7
+
8
+ ## 1.0.0.beta1
9
+
10
+ First public (beta) release. The API was reworked for stability before tagging 1.0 — see
11
+ [`UPGRADING.md`](UPGRADING.md) for migration details. **Breaking changes:**
12
+
13
+ - `TextMetrics.new` now takes the text as a positional argument and returns the
14
+ language-specific analyzer directly: `TextMetrics.new("text", language: :en_us)`.
15
+ The `TextMetrics::TextMetrics` wrapper and its `#text_metrics_processor` accessor are gone.
16
+ - `language` is now a symbol (`:en_us`, `:fr`); strings are still accepted. An unknown
17
+ language raises `TextMetrics::Error` instead of a `NoMethodError`.
18
+ - `#all` was renamed to `#to_h`, and it is now the single source of truth for the metric
19
+ set — every individual reader and `#to_h` are derived from the same list, so they can't drift.
20
+ - Levenshtein moved off the analyzer to the module: `TextMetrics.distance(a, b)` (raw integer)
21
+ and `TextMetrics.similarity(a, b)` (0–100 score). This replaces
22
+ `levenshtein_distance_from(other, normalize:)`, whose return value changed meaning with the flag.
23
+ - Punctuation metrics renamed to the singular: `punctuation_count`,
24
+ `words_per_punctuation_average`, `punctuation_per_sentence_average`.
25
+ - `poly_syllabes_count` and the tokenizers are now private implementation details.
26
+ - Analyzed text is whitespace-normalized once and exposed via `#text`.
27
+ - English syllable counting now uses the CMU Pronouncing Dictionary as the source of truth
28
+ (falling back to `text-hyphen` only for out-of-vocabulary words), which is more accurate than
29
+ the previous hyphenation-only approach. English syllable counts and the readability scores
30
+ derived from them may shift slightly versus `0.x`.
31
+ - Readability scores are now computed from full-precision ratios (rounding only the final
32
+ result) instead of from the display-rounded averages — this removes errors of several points
33
+ that the intermediate rounding could introduce.
34
+ - Readability scores are no longer clamped: a Flesch score can now exceed 100 or go negative,
35
+ as the formulas define. (Previously clamped to 0–100 / 0–18 / 0–20.)
36
+ - French Flesch Reading Ease now uses the correct Kandel-Moles base constant `207` (was `206.835`).
37
+ - `flesch_kincaid_grade` now uses the standard US-grade formula for every language; the previous
38
+ unsourced French coefficients are gone (Flesch-Kincaid Grade has no validated French adaptation).
39
+ - `letters_per_word_average` and the Coleman-Liau index now count alphabetic letters only, not
40
+ digits and punctuation, per the Coleman-Liau definition.
data/README.md CHANGED
@@ -4,36 +4,51 @@
4
4
 
5
5
  # Text Metrics
6
6
 
7
- Text Metrics is a Ruby library for text analysis. It is inspired from Textstat library in Python and the Ruby port of [Textstat](https://github.com/kupolak/textstat).
7
+ Text Metrics is a Ruby library for analysing text. It was inspired by the Python Textstat library and the Ruby port of [Textstat](https://github.com/kupolak/textstat).
8
8
 
9
- In addition to basic metrics it also provides readability tests like Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index.
9
+ It gives you the everyday counts you need — words, characters, sentences and syllables — plus readability scores such as Flesch Reading Ease, Flesch-Kincaid Grade Level, SMOG, Gunning Fog, Coleman-Liau Index and LIX. English and French are supported.
10
10
 
11
- At this point the main language supported are English and French.
11
+ ## Accurate, dictionary-based syllable counting
12
+
13
+ Readability scores such as Flesch, Flesch-Kincaid and SMOG depend heavily on syllable counts. Many libraries estimate those counts with hyphenation rules, which are fast but often wrong. Text Metrics starts from real pronunciation data instead:
14
+
15
+ - **English** uses the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict), which provides pronunciation-derived syllable counts for about 126,000 words. Hyphenation is only used when a word is not in the dictionary. Against CMUdict as ground truth, a hyphenation-only approach gets the syllable count wrong for about **46% of dictionary words**.
16
+ - **French** uses counts derived from the [Lexique](http://www.lexique.org/) database, plus a vowel heuristic and an exceptions list for the words the heuristic misses.
17
+
18
+ That makes syllable-sensitive scores more reliable, especially for longer or less common words.
19
+
20
+ ### Formula notes
21
+
22
+ - Scores are computed from full-precision ratios and returned **unclamped**. A Flesch Reading Ease score can legitimately be above 100 for very easy text, or below 0 for very difficult text.
23
+ - **French Flesch Reading Ease** uses the Kandel-Moles (1958) adaptation: `207 − 1.015 × (words/sentences) − 73.6 × (syllables/words)`. Because the formula starts from 207, very easy French text can score slightly above 100.
24
+ - **Flesch-Kincaid Grade** maps to a US school grade. It uses the same formula for every language because there is no validated French adaptation.
25
+ - **Gunning Fog** uses words with three or more syllables as complex words, matching the same syllable counts used by SMOG.
26
+ - **Coleman-Liau** counts alphabetic letters only (not digits or punctuation), per its definition.
12
27
 
13
28
  ## Features
14
29
 
15
30
  _Basic metrics:_
16
31
 
17
- - [x] words count
18
- - [x] characters count
19
- - [x] sentences count
20
- - [x] syllables per word average
21
- - [x] letters per word average
22
- - [x] sentence length average
23
- - [x] sentence length (characters) average
32
+ - [x] word count
33
+ - [x] character count
34
+ - [x] sentence count
35
+ - [x] average syllables per word
36
+ - [x] average letters per word
37
+ - [x] average words per sentence
38
+ - [x] average characters per sentence
24
39
 
25
40
  _Readability tests:_
26
41
 
27
42
  - [x] Flesch Reading Ease
28
43
  - [x] Flesch-Kincaid Grade Level
29
44
  - [x] Smog Index
45
+ - [x] Gunning Fog Index
30
46
  - [x] Coleman-Liau Index
31
47
  - [x] Lix Index
32
- - [ ] Gunning Fog Index
33
48
 
34
49
  ## Installation
35
50
 
36
- No official release yet, but you can install it from GitHub:
51
+ Text Metrics is not published to RubyGems yet. For now, install it from GitHub:
37
52
 
38
53
  ```ruby
39
54
  # Gemfile
@@ -47,33 +62,56 @@ gem "text-metrics", github: "plume-app/text-metrics", branch: "main"
47
62
  ## Usage
48
63
 
49
64
  ```ruby
50
- @text_analyser = TextMetrics.new(text: "This gem analyses all kind of texts.")
51
- # get all metrics at once:
52
-
53
- @text_analyser.all
54
- # { words_count: 7, characters_count: 30, sentences_count: 1, syllables_count: 9, syllables_per_word_average: 1.3, letters_per_word_average: 4.29, words_per_sentence_average: 7.0, characters_per_sentence_average: 30.0, flesch_reading_ease: 89.75, flesch_kincaid_grade: 2.5 }
55
-
56
- # or get each metric separately:
57
- @text_analyser.words_count # => 7
58
- @text_analyser.characters_count # => 30
59
- @text_analyser.sentences_count # => 1
60
- @text_analyser.syllables_count # => 9
61
- @text_analyser.syllables_per_word_average # => 1.3
62
- @text_analyser.letters_per_word_average # => 4.29
63
- @text_analyser.words_per_sentence_average # => 7.0
64
- @text_analyser.characters_per_sentence_average # => 30.0
65
- @text_analyser.flesch_reading_ease # => 89.75
66
- @text_analyser.flesch_kincaid_grade # => 2.5
65
+ metrics = TextMetrics.new("This gem analyses all kinds of text.")
66
+
67
+ # Get every metric at once:
68
+ metrics.to_h
69
+ # {
70
+ # words_count: 7, characters_count: 30, sentences_count: 1, syllables_count: 10,
71
+ # punctuation_count: 1, syllables_per_word_average: 1.4, letters_per_word_average: 4.14,
72
+ # words_per_sentence_average: 7.0, characters_per_sentence_average: 30.0,
73
+ # words_per_punctuation_average: 7.0, punctuation_per_sentence_average: 1.0,
74
+ # flesch_reading_ease: 78.87, flesch_kincaid_grade: 4.0, lix: 21.29,
75
+ # smog_index: 0.0, gunning_fog_index: 8.5, coleman_liau_index: 4.33
76
+ # }
77
+
78
+ # Or ask for a single metric:
79
+ metrics.words_count # => 7
80
+ metrics.characters_count # => 30
81
+ metrics.flesch_reading_ease # => 78.87
82
+ metrics.flesch_kincaid_grade # => 4.0
83
+ metrics.gunning_fog_index # => 8.5
84
+ ```
85
+
86
+ ### Languages
87
+
88
+ American English (`:en_us`) is the default. To analyse French text, pass `language: :fr`:
89
+
90
+ ```ruby
91
+ TextMetrics.new("Bonjour le monde.", language: :fr)
92
+ ```
93
+
94
+ Unsupported languages raise `TextMetrics::Error`.
95
+
96
+ ### Comparing two texts
97
+
98
+ Levenshtein distance compares two strings, so it is exposed on the `TextMetrics` module:
99
+
100
+ ```ruby
101
+ TextMetrics.distance("kitten", "sitting") # => 3 raw edit distance
102
+ TextMetrics.similarity("kitten", "sitting") # => 57.14 0–100 score (100.0 == identical)
67
103
  ```
68
104
 
69
105
  ## Contributing
70
106
 
71
- Bug reports and pull requests are welcome on GitHub at [https://github.com/plume-app/text-metrics](https://github.com/plume-app/text-metrics).
107
+ Bug reports and pull requests are welcome on [GitHub](https://github.com/plume-app/text-metrics).
72
108
 
73
109
  ## Credits
74
110
 
75
- This gem was inspired by [Textstat](https://github.com/kupolak/textstat) and [Textstat](https://github.com/textstat/textstat) in Python.
76
- This gem is generated via [`newgem` template](https://github.com/palkan/newgem) by [@palkan](https://github.com/palkan).
111
+ This gem was inspired by [Textstat](https://github.com/textstat/textstat) in Python and the Ruby [Textstat](https://github.com/kupolak/textstat) port.
112
+ It was generated from the [`newgem` template](https://github.com/palkan/newgem) by [@palkan](https://github.com/palkan).
113
+
114
+ English syllable counts come from the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict) (unrestricted use), with [text-hyphen](https://github.com/halostatue/text-hyphen) as a fallback. French syllable counts are derived from [Lexique](http://www.lexique.org/).
77
115
 
78
116
  ## License
79
117
 
data/UPGRADING.md ADDED
@@ -0,0 +1,73 @@
1
+ # Upgrading
2
+
3
+ ## To 1.0 (from the pre-release `0.x` internal API)
4
+
5
+ 1.0 is the first public release. The internal `0.x` API was reworked once, deliberately, so
6
+ the public surface can stay stable from here on. Everything below is a one-time migration.
7
+
8
+ ### Building an analyzer
9
+
10
+ `text` is now positional, and `TextMetrics.new` returns the analyzer itself — there is no
11
+ longer a wrapper object with a `text_metrics_processor` accessor.
12
+
13
+ ```ruby
14
+ # Before
15
+ analyser = TextMetrics.new(text: "Some text", language: "fr")
16
+ analyser.text_metrics_processor.words_count
17
+
18
+ # After
19
+ metrics = TextMetrics.new("Some text", language: :fr)
20
+ metrics.words_count
21
+ ```
22
+
23
+ ### Languages are symbols, and unknown ones raise
24
+
25
+ ```ruby
26
+ # Before — unknown language blew up with NoMethodError on nil
27
+ TextMetrics.new(text: "x", language: "es")
28
+
29
+ # After — symbols (strings still accepted), unknown languages raise TextMetrics::Error
30
+ TextMetrics.new("x", language: :es) # => raises TextMetrics::Error
31
+ ```
32
+
33
+ ### `#all` is now `#to_h`
34
+
35
+ ```ruby
36
+ # Before
37
+ metrics.all
38
+
39
+ # After
40
+ metrics.to_h
41
+ ```
42
+
43
+ `#to_h` and the individual metric readers are now generated from a single list
44
+ (`TextMetrics::Processors::Base::METRICS`), so they can never report different sets again.
45
+
46
+ ### Levenshtein moved to the module
47
+
48
+ The single method with a `normalize:` flag (which returned a raw distance or a 0–100 score
49
+ depending on the flag) is replaced by two intention-revealing module methods:
50
+
51
+ ```ruby
52
+ # Before
53
+ metrics.levenshtein_distance_from("other", normalize: false) # raw distance
54
+ metrics.levenshtein_distance_from("other") # 0–100 score
55
+
56
+ # After
57
+ TextMetrics.distance("text", "other") # => Integer, raw edit distance
58
+ TextMetrics.similarity("text", "other") # => Float, 0–100 score (100.0 == identical)
59
+ ```
60
+
61
+ ### Renamed metrics
62
+
63
+ The punctuation metrics dropped the plural; some helpers are now private:
64
+
65
+ | Before | After |
66
+ | ---------------------------------- | -------------------------------- |
67
+ | `punctuations_count` | `punctuation_count` |
68
+ | `words_per_punctuations_average` | `words_per_punctuation_average` |
69
+ | `punctuations_per_sentence_average`| `punctuation_per_sentence_average` |
70
+ | `poly_syllabes_count` (public) | private implementation detail |
71
+
72
+ The word/sentence/punctuation tokenizers are now private as well — use the count and average
73
+ metrics or `#to_h`.
@@ -0,0 +1,6 @@
1
+ # frozen_string_literal: true
2
+
3
+ # Bundler auto-requires the hyphenated gem name (`require "text-metrics"`) by default.
4
+ # This shim points it at the real entry point so `gem "text-metrics"` works without an
5
+ # explicit `require:` option.
6
+ require "text_metrics"