text-metrics 1.0.0.beta1 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d045a3d3bb921b33db65e5f0ae8cb2211541732760e5a059c17ab3bf36335b86
4
- data.tar.gz: 8ad9fb90b40800eeafe15e2bffbc48b69b226ac7f71f77429efbbec95d0be2ab
3
+ metadata.gz: 0aa8550e92948515979c3ddf605d9a591992f7fd855097c3a942bd4bdaa2e490
4
+ data.tar.gz: 540fd11f846a899d8a0266e53e303c879c5655fe18d674165b5a7a480332b245
5
5
  SHA512:
6
- metadata.gz: aff19ee3172ad857e3762964fbe13866fffa1f8a91402d138b8b50ce08324c537ae865bc02419b0177fc2d4af5d71cbfd16a4a44ca18fe843e1e90914df2cc39
7
- data.tar.gz: dd3cb65ccd6e5f27d4715e367e409f7995dcac7e2e4b77c63f43e074e121ca32b5e584dd2b3167585f501fafe6db296c0cf0202b778a42f058283fad24e0370c
6
+ metadata.gz: ca14e801a86eee04c3633b53a5eb77d403308fe6698a111664c6eac82409977d7e697019611ed1e21dcb43f4ec7cb275048d2f9fad6e279d9613a0a73b41f982
7
+ data.tar.gz: 47218fd86626d8656217ee13cdd8d67f2eb2950a88ee18aed23bd42d73e2c4d5a1f20ff44b2650f7fb308b0dd4184682bfb5891f35f5fe9680e53878713f4ff8
data/CHANGELOG.md CHANGED
@@ -1,6 +1,13 @@
1
1
  # Change log
2
2
 
3
- ## master
3
+ ## main
4
+
5
+ ## 1.0.0
6
+ - First public release.
7
+ - see `UPGRADING.md` for migration details from the pre-release `0.x` API.
8
+
9
+ ## ## 1.0.0.beta2
10
+ - Add `gunning_fog_index` to the analyzer metrics and `#to_h` output.
4
11
 
5
12
  ## 1.0.0.beta1
6
13
 
data/README.md CHANGED
@@ -4,54 +4,57 @@
4
4
 
5
5
  # Text Metrics
6
6
 
7
- Text Metrics is a Ruby library for text analysis. It is inspired from Textstat library in Python and the Ruby port of [Textstat](https://github.com/kupolak/textstat).
7
+ Text Metrics is a Ruby library for analysing text. It was inspired by the Python Textstat library and the Ruby port of [Textstat](https://github.com/kupolak/textstat).
8
8
 
9
- In addition to basic metrics it also provides readability tests like Flesch Reading Ease, Flesch-Kincaid Grade Level, SMOG, Coleman-Liau Index and LIX. The main languages supported are English and French.
9
+ It gives you the everyday counts you need — words, characters, sentences and syllables — plus readability scores such as Flesch Reading Ease, Flesch-Kincaid Grade Level, SMOG, Gunning Fog, Coleman-Liau Index and LIX. English and French are supported.
10
+
11
+ It is battle-tested in production on millions of student writings at [Plume](https://plume-app.co).
10
12
 
11
13
  ## Accurate, dictionary-based syllable counting
12
14
 
13
- Readability scores (Flesch, Flesch-Kincaid, SMOG) are only as trustworthy as the syllable counts they are built on and the hyphenation heuristics most libraries rely on get a lot of words wrong. Text Metrics counts syllables from real pronunciation dictionaries instead:
15
+ Readability scores such as Flesch, Flesch-Kincaid and SMOG depend heavily on syllable counts. Many libraries estimate those counts with hyphenation rules, which are fast but often wrong. Text Metrics starts from real pronunciation data instead:
14
16
 
15
- - **English** uses the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict): pronunciation-derived syllable counts for ~126,000 words, falling back to hyphenation only for out-of-vocabulary words. Benchmarked against CMUdict as ground truth, the common hyphenation-only approach returns the **wrong** syllable count for **~46% of dictionary words** — Text Metrics uses the reference count directly.
16
- - **French** uses counts derived from the [Lexique](http://www.lexique.org/) database, with a vowel heuristic patched by an exceptions list for the words it gets wrong.
17
+ - **English** uses the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict), which provides pronunciation-derived syllable counts for about 126,000 words. Hyphenation is only used when a word is not in the dictionary. Against CMUdict as ground truth, a hyphenation-only approach gets the syllable count wrong for about **46% of dictionary words**.
18
+ - **French** uses counts derived from the [Lexique](http://www.lexique.org/) database, plus a vowel heuristic and an exceptions list for the words the heuristic misses.
17
19
 
18
- This makes syllable-dependent metrics meaningfully more accurate than hyphenation-only implementations, especially on longer and less common words.
20
+ That makes syllable-sensitive scores more reliable, especially for longer or less common words.
19
21
 
20
22
  ### Formula notes
21
23
 
22
- - Scores are computed from full-precision ratios and returned **unclamped** a Flesch Reading Ease score can legitimately exceed 100 (very easy) or go below 0 (very difficult).
23
- - **French Flesch Reading Ease** uses the Kandel-Moles (1958) adaptation: `207 − 1.015 × (words/sentences) − 73.6 × (syllables/words)`. Because its base is 207, very easy French text can score slightly above 100.
24
- - **Flesch-Kincaid Grade** maps to a US school grade and uses the same formula for all languages (there is no validated French adaptation).
24
+ - Scores are computed from full-precision ratios and returned **unclamped**. A Flesch Reading Ease score can legitimately be above 100 for very easy text, or below 0 for very difficult text.
25
+ - **French Flesch Reading Ease** uses the Kandel-Moles (1958) adaptation: `207 − 1.015 × (words/sentences) − 73.6 × (syllables/words)`. Because the formula starts from 207, very easy French text can score slightly above 100.
26
+ - **Flesch-Kincaid Grade** maps to a US school grade. It uses the same formula for every language because there is no validated French adaptation.
27
+ - **Gunning Fog** uses words with three or more syllables as complex words, matching the same syllable counts used by SMOG.
25
28
  - **Coleman-Liau** counts alphabetic letters only (not digits or punctuation), per its definition.
26
29
 
27
30
  ## Features
28
31
 
29
32
  _Basic metrics:_
30
33
 
31
- - [x] words count
32
- - [x] characters count
33
- - [x] sentences count
34
- - [x] syllables per word average
35
- - [x] letters per word average
36
- - [x] sentence length average
37
- - [x] sentence length (characters) average
34
+ - [x] word count
35
+ - [x] character count
36
+ - [x] sentence count
37
+ - [x] average syllables per word
38
+ - [x] average letters per word
39
+ - [x] average words per sentence
40
+ - [x] average characters per sentence
38
41
 
39
42
  _Readability tests:_
40
43
 
41
44
  - [x] Flesch Reading Ease
42
45
  - [x] Flesch-Kincaid Grade Level
43
46
  - [x] Smog Index
47
+ - [x] Gunning Fog Index
44
48
  - [x] Coleman-Liau Index
45
49
  - [x] Lix Index
46
- - [ ] Gunning Fog Index
47
50
 
48
51
  ## Installation
49
52
 
50
- No official release yet, but you can install it from GitHub:
53
+ Text Metrics is not published to RubyGems yet. For now, install it from GitHub:
51
54
 
52
55
  ```ruby
53
56
  # Gemfile
54
- gem "text-metrics", github: "plume-app/text-metrics", branch: "main"
57
+ gem "text-metrics", "~> 1.0.0"
55
58
  ```
56
59
 
57
60
  ### Supported Ruby versions
@@ -63,37 +66,38 @@ gem "text-metrics", github: "plume-app/text-metrics", branch: "main"
63
66
  ```ruby
64
67
  metrics = TextMetrics.new("This gem analyses all kinds of text.")
65
68
 
66
- # Get every metric at once as a Hash:
69
+ # Get every metric at once:
67
70
  metrics.to_h
68
71
  # {
69
- # words_count: 7, characters_count: 30, sentences_count: 1, syllables_count: 11,
70
- # punctuation_count: 1, syllables_per_word_average: 1.6, letters_per_word_average: 4.29,
72
+ # words_count: 7, characters_count: 30, sentences_count: 1, syllables_count: 10,
73
+ # punctuation_count: 1, syllables_per_word_average: 1.4, letters_per_word_average: 4.14,
71
74
  # words_per_sentence_average: 7.0, characters_per_sentence_average: 30.0,
72
75
  # words_per_punctuation_average: 7.0, punctuation_per_sentence_average: 1.0,
73
- # flesch_reading_ease: 64.37, flesch_kincaid_grade: 6.0, lix: 21.29,
74
- # smog_index: 0.0, coleman_liau_index: 5.2
76
+ # flesch_reading_ease: 78.87, flesch_kincaid_grade: 4.0, lix: 21.29,
77
+ # smog_index: 0.0, gunning_fog_index: 8.5, coleman_liau_index: 4.33
75
78
  # }
76
79
 
77
- # Or read each metric on its own:
80
+ # Or ask for a single metric:
78
81
  metrics.words_count # => 7
79
82
  metrics.characters_count # => 30
80
- metrics.flesch_reading_ease # => 64.37
81
- metrics.flesch_kincaid_grade # => 6.0
83
+ metrics.flesch_reading_ease # => 78.87
84
+ metrics.flesch_kincaid_grade # => 4.0
85
+ metrics.gunning_fog_index # => 8.5
82
86
  ```
83
87
 
84
88
  ### Languages
85
89
 
86
- The default language is American English (`:en_us`). Pass `language:` to analyse French:
90
+ American English (`:en_us`) is the default. To analyse French text, pass `language: :fr`:
87
91
 
88
92
  ```ruby
89
93
  TextMetrics.new("Bonjour le monde.", language: :fr)
90
94
  ```
91
95
 
92
- An unknown language raises `TextMetrics::Error`.
96
+ Unsupported languages raise `TextMetrics::Error`.
93
97
 
94
98
  ### Comparing two texts
95
99
 
96
- Levenshtein comparison is between two texts, so it lives on the module itself:
100
+ Levenshtein distance compares two strings, so it is exposed on the `TextMetrics` module:
97
101
 
98
102
  ```ruby
99
103
  TextMetrics.distance("kitten", "sitting") # => 3 raw edit distance
@@ -102,12 +106,12 @@ TextMetrics.similarity("kitten", "sitting") # => 57.14 0–100 score (100.0 ==
102
106
 
103
107
  ## Contributing
104
108
 
105
- Bug reports and pull requests are welcome on GitHub at [https://github.com/plume-app/text-metrics](https://github.com/plume-app/text-metrics).
109
+ Bug reports and pull requests are welcome on [GitHub](https://github.com/plume-app/text-metrics).
106
110
 
107
111
  ## Credits
108
112
 
109
- This gem was inspired by [Textstat](https://github.com/kupolak/textstat) and [Textstat](https://github.com/textstat/textstat) in Python.
110
- This gem is generated via [`newgem` template](https://github.com/palkan/newgem) by [@palkan](https://github.com/palkan).
113
+ This gem was inspired by [Textstat](https://github.com/textstat/textstat) in Python and the Ruby [Textstat](https://github.com/kupolak/textstat) port.
114
+ It was generated from the [`newgem` template](https://github.com/palkan/newgem) by [@palkan](https://github.com/palkan).
111
115
 
112
116
  English syllable counts come from the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict) (unrestricted use), with [text-hyphen](https://github.com/halostatue/text-hyphen) as a fallback. French syllable counts are derived from [Lexique](http://www.lexique.org/).
113
117
 
@@ -25,6 +25,7 @@ module TextMetrics
25
25
  flesch_kincaid_grade
26
26
  lix
27
27
  smog_index
28
+ gunning_fog_index
28
29
  coleman_liau_index
29
30
  ].freeze
30
31
 
@@ -120,6 +121,12 @@ module TextMetrics
120
121
  0.0
121
122
  end
122
123
 
124
+ def gunning_fog_index
125
+ return 0.0 if words_count.zero?
126
+
127
+ (0.4 * (average_words_per_sentence + 100.0 * count_polysyllabic_words / words_count)).round(1)
128
+ end
129
+
123
130
  def coleman_liau_index
124
131
  return 0.0 if words_count.zero?
125
132
 
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module TextMetrics # :nodoc:
4
- VERSION = "1.0.0.beta1"
4
+ VERSION = "1.0.0"
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: text-metrics
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0.beta1
4
+ version: 1.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Adrien POLY