RubyGems - text-metrics - Versions diffs - 0.0.1 → 1.0.0.beta2 - Mend

text-metrics 0.0.1 → 1.0.0.beta2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +38 -1
data/README.md +70 -32
data/UPGRADING.md +73 -0
data/lib/text-metrics.rb +6 -0
data/lib/text_metrics/dictionaries/english_word_syllable_database.txt +126052 -0
data/lib/text_metrics/levenshtein.rb +46 -0
data/lib/text_metrics/processors/american_english.rb +38 -10
data/lib/text_metrics/processors/base.rb +117 -126
data/lib/text_metrics/processors/french.rb +32 -14
data/lib/text_metrics/version.rb +1 -1
data/lib/text_metrics.rb +28 -25
metadata +12 -14
data/lib/text_metrics/dictionnaries/en_us.txt +0 -2945
data/lib/text_metrics/dictionnaries/fr.txt +0 -1462
data/lib/text_metrics/dictionnaries/french_word_syllable_database.yml +0 -125345
data/lib/text_metrics/dictionnaries/lexique-383.csv +0 -142695
/data/lib/text_metrics/{dictionnaries → dictionaries}/french_word_syllable_exceptions.yml +0 -0

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: d2c50a6c558c741b22dd13e0c7327c0453ca619bc5dc535364c7af81e282a00d
-  data.tar.gz: 761532ff88057efc471f1c0b1f8c1395f527d7ce1d2490e41a982e85994b22f9
+  metadata.gz: e71911a9dd9a27cc6dcaa9eb9525cf48d73d48c5fc49526d475cb2a1f6de98c6
+  data.tar.gz: c686dfd9cf3cdb730b27c0329b1a0ba13748dea1de1f5730cd4283cea9885de2
 SHA512:
-  metadata.gz: 5327430d90c3a17bdc4822112700fcba99bb5f6fa5df7a22063b534048d87bfd59eb82b9fcd0363defa659c5e39a5e3109a005b94d55c35e23b2801bbd547532
-  data.tar.gz: 6dde381cb4592e1158aced2f8c90bae5fd585951e1f02450da197b98c0eaa1741903d4cc0310402fb5dba2447e9908187782327c57c5855d000e8ded84239a63
+  metadata.gz: 57de157f47ac0ab6cf79956509087371d9c9e12fbdbe6d80b7314110c1d8211e831561adb1c9476199e086bb3f6d33c55c1f3e7c39dcf144e04f0766c7e1773d
+  data.tar.gz: f7b5f77a6fba329093f3f55c38844ac8696f28c9d5bbfaf08750d464732bdf5a5484ebea57444a928ebe8a65e6fdec76a8ade187987df17cd419edc3c4f8555b

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,40 @@
 # Change log
-## master
+## main
+## ## 1.0.0.beta2
+- Add `gunning_fog_index` to the analyzer metrics and `#to_h` output.
+## 1.0.0.beta1
+First public (beta) release. The API was reworked for stability before tagging 1.0 — see
+[`UPGRADING.md`](UPGRADING.md) for migration details. **Breaking changes:**
+- `TextMetrics.new` now takes the text as a positional argument and returns the
+  language-specific analyzer directly: `TextMetrics.new("text", language: :en_us)`.
+  The `TextMetrics::TextMetrics` wrapper and its `#text_metrics_processor` accessor are gone.
+- `language` is now a symbol (`:en_us`, `:fr`); strings are still accepted. An unknown
+  language raises `TextMetrics::Error` instead of a `NoMethodError`.
+- `#all` was renamed to `#to_h`, and it is now the single source of truth for the metric
+  set — every individual reader and `#to_h` are derived from the same list, so they can't drift.
+- Levenshtein moved off the analyzer to the module: `TextMetrics.distance(a, b)` (raw integer)
+  and `TextMetrics.similarity(a, b)` (0–100 score). This replaces
+  `levenshtein_distance_from(other, normalize:)`, whose return value changed meaning with the flag.
+- Punctuation metrics renamed to the singular: `punctuation_count`,
+  `words_per_punctuation_average`, `punctuation_per_sentence_average`.
+- `poly_syllabes_count` and the tokenizers are now private implementation details.
+- Analyzed text is whitespace-normalized once and exposed via `#text`.
+- English syllable counting now uses the CMU Pronouncing Dictionary as the source of truth
+  (falling back to `text-hyphen` only for out-of-vocabulary words), which is more accurate than
+  the previous hyphenation-only approach. English syllable counts and the readability scores
+  derived from them may shift slightly versus `0.x`.
+- Readability scores are now computed from full-precision ratios (rounding only the final
+  result) instead of from the display-rounded averages — this removes errors of several points
+  that the intermediate rounding could introduce.
+- Readability scores are no longer clamped: a Flesch score can now exceed 100 or go negative,
+  as the formulas define. (Previously clamped to 0–100 / 0–18 / 0–20.)
+- French Flesch Reading Ease now uses the correct Kandel-Moles base constant `207` (was `206.835`).
+- `flesch_kincaid_grade` now uses the standard US-grade formula for every language; the previous
+  unsourced French coefficients are gone (Flesch-Kincaid Grade has no validated French adaptation).
+- `letters_per_word_average` and the Coleman-Liau index now count alphabetic letters only, not
+  digits and punctuation, per the Coleman-Liau definition.

data/README.md CHANGED Viewed

@@ -4,36 +4,51 @@
 # Text Metrics
-Text Metrics is a Ruby library for text analysis. It is inspired from Textstat library in Python and the Ruby port of [Textstat](https://github.com/kupolak/textstat).
+Text Metrics is a Ruby library for analysing text. It was inspired by the Python Textstat library and the Ruby port of [Textstat](https://github.com/kupolak/textstat).
-In addition to basic metrics it also provides readability tests like Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index.
+It gives you the everyday counts you need — words, characters, sentences and syllables — plus readability scores such as Flesch Reading Ease, Flesch-Kincaid Grade Level, SMOG, Gunning Fog, Coleman-Liau Index and LIX. English and French are supported.
-At this point the main language supported are English and French.
+## Accurate, dictionary-based syllable counting
+Readability scores such as Flesch, Flesch-Kincaid and SMOG depend heavily on syllable counts. Many libraries estimate those counts with hyphenation rules, which are fast but often wrong. Text Metrics starts from real pronunciation data instead:
+- **English** uses the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict), which provides pronunciation-derived syllable counts for about 126,000 words. Hyphenation is only used when a word is not in the dictionary. Against CMUdict as ground truth, a hyphenation-only approach gets the syllable count wrong for about **46% of dictionary words**.
+- **French** uses counts derived from the [Lexique](http://www.lexique.org/) database, plus a vowel heuristic and an exceptions list for the words the heuristic misses.
+That makes syllable-sensitive scores more reliable, especially for longer or less common words.
+### Formula notes
+- Scores are computed from full-precision ratios and returned **unclamped**. A Flesch Reading Ease score can legitimately be above 100 for very easy text, or below 0 for very difficult text.
+- **French Flesch Reading Ease** uses the Kandel-Moles (1958) adaptation: `207 − 1.015 × (words/sentences) − 73.6 × (syllables/words)`. Because the formula starts from 207, very easy French text can score slightly above 100.
+- **Flesch-Kincaid Grade** maps to a US school grade. It uses the same formula for every language because there is no validated French adaptation.
+- **Gunning Fog** uses words with three or more syllables as complex words, matching the same syllable counts used by SMOG.
+- **Coleman-Liau** counts alphabetic letters only (not digits or punctuation), per its definition.
 ## Features
 _Basic metrics:_
-- [x] words count
-- [x] characters count
-- [x] sentences count
-- [x] syllables per word average
-- [x] letters per word average
-- [x] sentence length average
-- [x] sentence length (characters) average
+- [x] word count
+- [x] character count
+- [x] sentence count
+- [x] average syllables per word
+- [x] average letters per word
+- [x] average words per sentence
+- [x] average characters per sentence
 _Readability tests:_
 - [x] Flesch Reading Ease
 - [x] Flesch-Kincaid Grade Level
 - [x] Smog Index
+- [x] Gunning Fog Index
 - [x] Coleman-Liau Index
 - [x] Lix Index
-- [ ] Gunning Fog Index
 ## Installation
-No official release yet, but you can install it from GitHub:
+Text Metrics is not published to RubyGems yet. For now, install it from GitHub:
 ```ruby
 # Gemfile
@@ -47,33 +62,56 @@ gem "text-metrics", github: "plume-app/text-metrics", branch: "main"
 ## Usage
 ```ruby
-@text_analyser = TextMetrics.new(text: "This gem analyses all kind of texts.")
-# get all metrics at once:
-@text_analyser.all
-# { words_count: 7, characters_count: 30, sentences_count: 1, syllables_count: 9, syllables_per_word_average: 1.3, letters_per_word_average: 4.29, words_per_sentence_average: 7.0, characters_per_sentence_average: 30.0, flesch_reading_ease: 89.75, flesch_kincaid_grade: 2.5 }
-# or get each metric separately:
-@text_analyser.words_count # => 7
-@text_analyser.characters_count # => 30
-@text_analyser.sentences_count # => 1
-@text_analyser.syllables_count # => 9
-@text_analyser.syllables_per_word_average # => 1.3
-@text_analyser.letters_per_word_average # => 4.29
-@text_analyser.words_per_sentence_average # => 7.0
-@text_analyser.characters_per_sentence_average # => 30.0
-@text_analyser.flesch_reading_ease # => 89.75
-@text_analyser.flesch_kincaid_grade # => 2.5
+metrics = TextMetrics.new("This gem analyses all kinds of text.")
+# Get every metric at once:
+metrics.to_h
+# {
+#   words_count: 7, characters_count: 30, sentences_count: 1, syllables_count: 10,
+#   punctuation_count: 1, syllables_per_word_average: 1.4, letters_per_word_average: 4.14,
+#   words_per_sentence_average: 7.0, characters_per_sentence_average: 30.0,
+#   words_per_punctuation_average: 7.0, punctuation_per_sentence_average: 1.0,
+#   flesch_reading_ease: 78.87, flesch_kincaid_grade: 4.0, lix: 21.29,
+#   smog_index: 0.0, gunning_fog_index: 8.5, coleman_liau_index: 4.33
+# }
+# Or ask for a single metric:
+metrics.words_count                     # => 7
+metrics.characters_count                # => 30
+metrics.flesch_reading_ease             # => 78.87
+metrics.flesch_kincaid_grade            # => 4.0
+metrics.gunning_fog_index                # => 8.5
+```
+### Languages
+American English (`:en_us`) is the default. To analyse French text, pass `language: :fr`:
+```ruby
+TextMetrics.new("Bonjour le monde.", language: :fr)
+```
+Unsupported languages raise `TextMetrics::Error`.
+### Comparing two texts
+Levenshtein distance compares two strings, so it is exposed on the `TextMetrics` module:
+```ruby
+TextMetrics.distance("kitten", "sitting")   # => 3      raw edit distance
+TextMetrics.similarity("kitten", "sitting") # => 57.14  0–100 score (100.0 == identical)
 ```
 ## Contributing
-Bug reports and pull requests are welcome on GitHub at [https://github.com/plume-app/text-metrics](https://github.com/plume-app/text-metrics).
+Bug reports and pull requests are welcome on [GitHub](https://github.com/plume-app/text-metrics).
 ## Credits
-This gem was inspired by [Textstat](https://github.com/kupolak/textstat) and [Textstat](https://github.com/textstat/textstat) in Python.
-This gem is generated via [`newgem` template](https://github.com/palkan/newgem) by [@palkan](https://github.com/palkan).
+This gem was inspired by [Textstat](https://github.com/textstat/textstat) in Python and the Ruby [Textstat](https://github.com/kupolak/textstat) port.
+It was generated from the [`newgem` template](https://github.com/palkan/newgem) by [@palkan](https://github.com/palkan).
+English syllable counts come from the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict) (unrestricted use), with [text-hyphen](https://github.com/halostatue/text-hyphen) as a fallback. French syllable counts are derived from [Lexique](http://www.lexique.org/).
 ## License

data/UPGRADING.md ADDED Viewed

@@ -0,0 +1,73 @@
+# Upgrading
+## To 1.0 (from the pre-release `0.x` internal API)
+1.0 is the first public release. The internal `0.x` API was reworked once, deliberately, so
+the public surface can stay stable from here on. Everything below is a one-time migration.
+### Building an analyzer
+`text` is now positional, and `TextMetrics.new` returns the analyzer itself — there is no
+longer a wrapper object with a `text_metrics_processor` accessor.
+```ruby
+# Before
+analyser = TextMetrics.new(text: "Some text", language: "fr")
+analyser.text_metrics_processor.words_count
+# After
+metrics = TextMetrics.new("Some text", language: :fr)
+metrics.words_count
+```
+### Languages are symbols, and unknown ones raise
+```ruby
+# Before — unknown language blew up with NoMethodError on nil
+TextMetrics.new(text: "x", language: "es")
+# After — symbols (strings still accepted), unknown languages raise TextMetrics::Error
+TextMetrics.new("x", language: :es) # => raises TextMetrics::Error
+```
+### `#all` is now `#to_h`
+```ruby
+# Before
+metrics.all
+# After
+metrics.to_h
+```
+`#to_h` and the individual metric readers are now generated from a single list
+(`TextMetrics::Processors::Base::METRICS`), so they can never report different sets again.
+### Levenshtein moved to the module
+The single method with a `normalize:` flag (which returned a raw distance or a 0–100 score
+depending on the flag) is replaced by two intention-revealing module methods:
+```ruby
+# Before
+metrics.levenshtein_distance_from("other", normalize: false) # raw distance
+metrics.levenshtein_distance_from("other")                   # 0–100 score
+# After
+TextMetrics.distance("text", "other")   # => Integer, raw edit distance
+TextMetrics.similarity("text", "other") # => Float, 0–100 score (100.0 == identical)
+```
+### Renamed metrics
+The punctuation metrics dropped the plural; some helpers are now private:
+| Before                             | After                            |
+| ---------------------------------- | -------------------------------- |
+| `punctuations_count`               | `punctuation_count`              |
+| `words_per_punctuations_average`   | `words_per_punctuation_average`  |
+| `punctuations_per_sentence_average`| `punctuation_per_sentence_average` |
+| `poly_syllabes_count` (public)     | private implementation detail    |
+The word/sentence/punctuation tokenizers are now private as well — use the count and average
+metrics or `#to_h`.

data/lib/text-metrics.rb ADDED Viewed

@@ -0,0 +1,6 @@
+# frozen_string_literal: true
+# Bundler auto-requires the hyphenated gem name (`require "text-metrics"`) by default.
+# This shim points it at the real entry point so `gem "text-metrics"` works without an
+# explicit `require:` option.
+require "text_metrics"