text-metrics 1.0.0.beta1 → 1.0.0.beta2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +4 -1
- data/README.md +34 -32
- data/lib/text_metrics/processors/base.rb +7 -0
- data/lib/text_metrics/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: e71911a9dd9a27cc6dcaa9eb9525cf48d73d48c5fc49526d475cb2a1f6de98c6
|
|
4
|
+
data.tar.gz: c686dfd9cf3cdb730b27c0329b1a0ba13748dea1de1f5730cd4283cea9885de2
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 57de157f47ac0ab6cf79956509087371d9c9e12fbdbe6d80b7314110c1d8211e831561adb1c9476199e086bb3f6d33c55c1f3e7c39dcf144e04f0766c7e1773d
|
|
7
|
+
data.tar.gz: f7b5f77a6fba329093f3f55c38844ac8696f28c9d5bbfaf08750d464732bdf5a5484ebea57444a928ebe8a65e6fdec76a8ade187987df17cd419edc3c4f8555b
|
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
|
@@ -4,50 +4,51 @@
|
|
|
4
4
|
|
|
5
5
|
# Text Metrics
|
|
6
6
|
|
|
7
|
-
Text Metrics is a Ruby library for text
|
|
7
|
+
Text Metrics is a Ruby library for analysing text. It was inspired by the Python Textstat library and the Ruby port of [Textstat](https://github.com/kupolak/textstat).
|
|
8
8
|
|
|
9
|
-
|
|
9
|
+
It gives you the everyday counts you need — words, characters, sentences and syllables — plus readability scores such as Flesch Reading Ease, Flesch-Kincaid Grade Level, SMOG, Gunning Fog, Coleman-Liau Index and LIX. English and French are supported.
|
|
10
10
|
|
|
11
11
|
## Accurate, dictionary-based syllable counting
|
|
12
12
|
|
|
13
|
-
Readability scores
|
|
13
|
+
Readability scores such as Flesch, Flesch-Kincaid and SMOG depend heavily on syllable counts. Many libraries estimate those counts with hyphenation rules, which are fast but often wrong. Text Metrics starts from real pronunciation data instead:
|
|
14
14
|
|
|
15
|
-
- **English** uses the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict)
|
|
16
|
-
- **French** uses counts derived from the [Lexique](http://www.lexique.org/) database,
|
|
15
|
+
- **English** uses the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict), which provides pronunciation-derived syllable counts for about 126,000 words. Hyphenation is only used when a word is not in the dictionary. Against CMUdict as ground truth, a hyphenation-only approach gets the syllable count wrong for about **46% of dictionary words**.
|
|
16
|
+
- **French** uses counts derived from the [Lexique](http://www.lexique.org/) database, plus a vowel heuristic and an exceptions list for the words the heuristic misses.
|
|
17
17
|
|
|
18
|
-
|
|
18
|
+
That makes syllable-sensitive scores more reliable, especially for longer or less common words.
|
|
19
19
|
|
|
20
20
|
### Formula notes
|
|
21
21
|
|
|
22
|
-
- Scores are computed from full-precision ratios and returned **unclamped
|
|
23
|
-
- **French Flesch Reading Ease** uses the Kandel-Moles (1958) adaptation: `207 − 1.015 × (words/sentences) − 73.6 × (syllables/words)`. Because
|
|
24
|
-
- **Flesch-Kincaid Grade** maps to a US school grade
|
|
22
|
+
- Scores are computed from full-precision ratios and returned **unclamped**. A Flesch Reading Ease score can legitimately be above 100 for very easy text, or below 0 for very difficult text.
|
|
23
|
+
- **French Flesch Reading Ease** uses the Kandel-Moles (1958) adaptation: `207 − 1.015 × (words/sentences) − 73.6 × (syllables/words)`. Because the formula starts from 207, very easy French text can score slightly above 100.
|
|
24
|
+
- **Flesch-Kincaid Grade** maps to a US school grade. It uses the same formula for every language because there is no validated French adaptation.
|
|
25
|
+
- **Gunning Fog** uses words with three or more syllables as complex words, matching the same syllable counts used by SMOG.
|
|
25
26
|
- **Coleman-Liau** counts alphabetic letters only (not digits or punctuation), per its definition.
|
|
26
27
|
|
|
27
28
|
## Features
|
|
28
29
|
|
|
29
30
|
_Basic metrics:_
|
|
30
31
|
|
|
31
|
-
- [x]
|
|
32
|
-
- [x]
|
|
33
|
-
- [x]
|
|
34
|
-
- [x] syllables per word
|
|
35
|
-
- [x] letters per word
|
|
36
|
-
- [x]
|
|
37
|
-
- [x]
|
|
32
|
+
- [x] word count
|
|
33
|
+
- [x] character count
|
|
34
|
+
- [x] sentence count
|
|
35
|
+
- [x] average syllables per word
|
|
36
|
+
- [x] average letters per word
|
|
37
|
+
- [x] average words per sentence
|
|
38
|
+
- [x] average characters per sentence
|
|
38
39
|
|
|
39
40
|
_Readability tests:_
|
|
40
41
|
|
|
41
42
|
- [x] Flesch Reading Ease
|
|
42
43
|
- [x] Flesch-Kincaid Grade Level
|
|
43
44
|
- [x] Smog Index
|
|
45
|
+
- [x] Gunning Fog Index
|
|
44
46
|
- [x] Coleman-Liau Index
|
|
45
47
|
- [x] Lix Index
|
|
46
|
-
- [ ] Gunning Fog Index
|
|
47
48
|
|
|
48
49
|
## Installation
|
|
49
50
|
|
|
50
|
-
|
|
51
|
+
Text Metrics is not published to RubyGems yet. For now, install it from GitHub:
|
|
51
52
|
|
|
52
53
|
```ruby
|
|
53
54
|
# Gemfile
|
|
@@ -63,37 +64,38 @@ gem "text-metrics", github: "plume-app/text-metrics", branch: "main"
|
|
|
63
64
|
```ruby
|
|
64
65
|
metrics = TextMetrics.new("This gem analyses all kinds of text.")
|
|
65
66
|
|
|
66
|
-
# Get every metric at once
|
|
67
|
+
# Get every metric at once:
|
|
67
68
|
metrics.to_h
|
|
68
69
|
# {
|
|
69
|
-
# words_count: 7, characters_count: 30, sentences_count: 1, syllables_count:
|
|
70
|
-
# punctuation_count: 1, syllables_per_word_average: 1.
|
|
70
|
+
# words_count: 7, characters_count: 30, sentences_count: 1, syllables_count: 10,
|
|
71
|
+
# punctuation_count: 1, syllables_per_word_average: 1.4, letters_per_word_average: 4.14,
|
|
71
72
|
# words_per_sentence_average: 7.0, characters_per_sentence_average: 30.0,
|
|
72
73
|
# words_per_punctuation_average: 7.0, punctuation_per_sentence_average: 1.0,
|
|
73
|
-
# flesch_reading_ease:
|
|
74
|
-
# smog_index: 0.0, coleman_liau_index:
|
|
74
|
+
# flesch_reading_ease: 78.87, flesch_kincaid_grade: 4.0, lix: 21.29,
|
|
75
|
+
# smog_index: 0.0, gunning_fog_index: 8.5, coleman_liau_index: 4.33
|
|
75
76
|
# }
|
|
76
77
|
|
|
77
|
-
# Or
|
|
78
|
+
# Or ask for a single metric:
|
|
78
79
|
metrics.words_count # => 7
|
|
79
80
|
metrics.characters_count # => 30
|
|
80
|
-
metrics.flesch_reading_ease # =>
|
|
81
|
-
metrics.flesch_kincaid_grade # =>
|
|
81
|
+
metrics.flesch_reading_ease # => 78.87
|
|
82
|
+
metrics.flesch_kincaid_grade # => 4.0
|
|
83
|
+
metrics.gunning_fog_index # => 8.5
|
|
82
84
|
```
|
|
83
85
|
|
|
84
86
|
### Languages
|
|
85
87
|
|
|
86
|
-
|
|
88
|
+
American English (`:en_us`) is the default. To analyse French text, pass `language: :fr`:
|
|
87
89
|
|
|
88
90
|
```ruby
|
|
89
91
|
TextMetrics.new("Bonjour le monde.", language: :fr)
|
|
90
92
|
```
|
|
91
93
|
|
|
92
|
-
|
|
94
|
+
Unsupported languages raise `TextMetrics::Error`.
|
|
93
95
|
|
|
94
96
|
### Comparing two texts
|
|
95
97
|
|
|
96
|
-
Levenshtein
|
|
98
|
+
Levenshtein distance compares two strings, so it is exposed on the `TextMetrics` module:
|
|
97
99
|
|
|
98
100
|
```ruby
|
|
99
101
|
TextMetrics.distance("kitten", "sitting") # => 3 raw edit distance
|
|
@@ -102,12 +104,12 @@ TextMetrics.similarity("kitten", "sitting") # => 57.14 0–100 score (100.0 ==
|
|
|
102
104
|
|
|
103
105
|
## Contributing
|
|
104
106
|
|
|
105
|
-
Bug reports and pull requests are welcome on GitHub
|
|
107
|
+
Bug reports and pull requests are welcome on [GitHub](https://github.com/plume-app/text-metrics).
|
|
106
108
|
|
|
107
109
|
## Credits
|
|
108
110
|
|
|
109
|
-
This gem was inspired by [Textstat](https://github.com/
|
|
110
|
-
|
|
111
|
+
This gem was inspired by [Textstat](https://github.com/textstat/textstat) in Python and the Ruby [Textstat](https://github.com/kupolak/textstat) port.
|
|
112
|
+
It was generated from the [`newgem` template](https://github.com/palkan/newgem) by [@palkan](https://github.com/palkan).
|
|
111
113
|
|
|
112
114
|
English syllable counts come from the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict) (unrestricted use), with [text-hyphen](https://github.com/halostatue/text-hyphen) as a fallback. French syllable counts are derived from [Lexique](http://www.lexique.org/).
|
|
113
115
|
|
|
@@ -25,6 +25,7 @@ module TextMetrics
|
|
|
25
25
|
flesch_kincaid_grade
|
|
26
26
|
lix
|
|
27
27
|
smog_index
|
|
28
|
+
gunning_fog_index
|
|
28
29
|
coleman_liau_index
|
|
29
30
|
].freeze
|
|
30
31
|
|
|
@@ -120,6 +121,12 @@ module TextMetrics
|
|
|
120
121
|
0.0
|
|
121
122
|
end
|
|
122
123
|
|
|
124
|
+
def gunning_fog_index
|
|
125
|
+
return 0.0 if words_count.zero?
|
|
126
|
+
|
|
127
|
+
(0.4 * (average_words_per_sentence + 100.0 * count_polysyllabic_words / words_count)).round(1)
|
|
128
|
+
end
|
|
129
|
+
|
|
123
130
|
def coleman_liau_index
|
|
124
131
|
return 0.0 if words_count.zero?
|
|
125
132
|
|
data/lib/text_metrics/version.rb
CHANGED