text-metrics 0.0.1 → 1.0.0.beta1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +34 -0
- data/README.md +55 -19
- data/UPGRADING.md +73 -0
- data/lib/text-metrics.rb +6 -0
- data/lib/text_metrics/dictionaries/english_word_syllable_database.txt +126052 -0
- data/lib/text_metrics/levenshtein.rb +46 -0
- data/lib/text_metrics/processors/american_english.rb +38 -10
- data/lib/text_metrics/processors/base.rb +110 -126
- data/lib/text_metrics/processors/french.rb +32 -14
- data/lib/text_metrics/version.rb +1 -1
- data/lib/text_metrics.rb +28 -25
- metadata +12 -14
- data/lib/text_metrics/dictionnaries/en_us.txt +0 -2945
- data/lib/text_metrics/dictionnaries/fr.txt +0 -1462
- data/lib/text_metrics/dictionnaries/french_word_syllable_database.yml +0 -125345
- data/lib/text_metrics/dictionnaries/lexique-383.csv +0 -142695
- /data/lib/text_metrics/{dictionnaries → dictionaries}/french_word_syllable_exceptions.yml +0 -0
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: d045a3d3bb921b33db65e5f0ae8cb2211541732760e5a059c17ab3bf36335b86
|
|
4
|
+
data.tar.gz: 8ad9fb90b40800eeafe15e2bffbc48b69b226ac7f71f77429efbbec95d0be2ab
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: aff19ee3172ad857e3762964fbe13866fffa1f8a91402d138b8b50ce08324c537ae865bc02419b0177fc2d4af5d71cbfd16a4a44ca18fe843e1e90914df2cc39
|
|
7
|
+
data.tar.gz: dd3cb65ccd6e5f27d4715e367e409f7995dcac7e2e4b77c63f43e074e121ca32b5e584dd2b3167585f501fafe6db296c0cf0202b778a42f058283fad24e0370c
|
data/CHANGELOG.md
CHANGED
|
@@ -1,3 +1,37 @@
|
|
|
1
1
|
# Change log
|
|
2
2
|
|
|
3
3
|
## master
|
|
4
|
+
|
|
5
|
+
## 1.0.0.beta1
|
|
6
|
+
|
|
7
|
+
First public (beta) release. The API was reworked for stability before tagging 1.0 — see
|
|
8
|
+
[`UPGRADING.md`](UPGRADING.md) for migration details. **Breaking changes:**
|
|
9
|
+
|
|
10
|
+
- `TextMetrics.new` now takes the text as a positional argument and returns the
|
|
11
|
+
language-specific analyzer directly: `TextMetrics.new("text", language: :en_us)`.
|
|
12
|
+
The `TextMetrics::TextMetrics` wrapper and its `#text_metrics_processor` accessor are gone.
|
|
13
|
+
- `language` is now a symbol (`:en_us`, `:fr`); strings are still accepted. An unknown
|
|
14
|
+
language raises `TextMetrics::Error` instead of a `NoMethodError`.
|
|
15
|
+
- `#all` was renamed to `#to_h`, and it is now the single source of truth for the metric
|
|
16
|
+
set — every individual reader and `#to_h` are derived from the same list, so they can't drift.
|
|
17
|
+
- Levenshtein moved off the analyzer to the module: `TextMetrics.distance(a, b)` (raw integer)
|
|
18
|
+
and `TextMetrics.similarity(a, b)` (0–100 score). This replaces
|
|
19
|
+
`levenshtein_distance_from(other, normalize:)`, whose return value changed meaning with the flag.
|
|
20
|
+
- Punctuation metrics renamed to the singular: `punctuation_count`,
|
|
21
|
+
`words_per_punctuation_average`, `punctuation_per_sentence_average`.
|
|
22
|
+
- `poly_syllabes_count` and the tokenizers are now private implementation details.
|
|
23
|
+
- Analyzed text is whitespace-normalized once and exposed via `#text`.
|
|
24
|
+
- English syllable counting now uses the CMU Pronouncing Dictionary as the source of truth
|
|
25
|
+
(falling back to `text-hyphen` only for out-of-vocabulary words), which is more accurate than
|
|
26
|
+
the previous hyphenation-only approach. English syllable counts and the readability scores
|
|
27
|
+
derived from them may shift slightly versus `0.x`.
|
|
28
|
+
- Readability scores are now computed from full-precision ratios (rounding only the final
|
|
29
|
+
result) instead of from the display-rounded averages — this removes errors of several points
|
|
30
|
+
that the intermediate rounding could introduce.
|
|
31
|
+
- Readability scores are no longer clamped: a Flesch score can now exceed 100 or go negative,
|
|
32
|
+
as the formulas define. (Previously clamped to 0–100 / 0–18 / 0–20.)
|
|
33
|
+
- French Flesch Reading Ease now uses the correct Kandel-Moles base constant `207` (was `206.835`).
|
|
34
|
+
- `flesch_kincaid_grade` now uses the standard US-grade formula for every language; the previous
|
|
35
|
+
unsourced French coefficients are gone (Flesch-Kincaid Grade has no validated French adaptation).
|
|
36
|
+
- `letters_per_word_average` and the Coleman-Liau index now count alphabetic letters only, not
|
|
37
|
+
digits and punctuation, per the Coleman-Liau definition.
|
data/README.md
CHANGED
|
@@ -6,9 +6,23 @@
|
|
|
6
6
|
|
|
7
7
|
Text Metrics is a Ruby library for text analysis. It is inspired from Textstat library in Python and the Ruby port of [Textstat](https://github.com/kupolak/textstat).
|
|
8
8
|
|
|
9
|
-
In addition to basic metrics it also provides readability tests like Flesch Reading Ease, Flesch-Kincaid Grade Level,
|
|
9
|
+
In addition to basic metrics it also provides readability tests like Flesch Reading Ease, Flesch-Kincaid Grade Level, SMOG, Coleman-Liau Index and LIX. The main languages supported are English and French.
|
|
10
10
|
|
|
11
|
-
|
|
11
|
+
## Accurate, dictionary-based syllable counting
|
|
12
|
+
|
|
13
|
+
Readability scores (Flesch, Flesch-Kincaid, SMOG) are only as trustworthy as the syllable counts they are built on — and the hyphenation heuristics most libraries rely on get a lot of words wrong. Text Metrics counts syllables from real pronunciation dictionaries instead:
|
|
14
|
+
|
|
15
|
+
- **English** uses the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict): pronunciation-derived syllable counts for ~126,000 words, falling back to hyphenation only for out-of-vocabulary words. Benchmarked against CMUdict as ground truth, the common hyphenation-only approach returns the **wrong** syllable count for **~46% of dictionary words** — Text Metrics uses the reference count directly.
|
|
16
|
+
- **French** uses counts derived from the [Lexique](http://www.lexique.org/) database, with a vowel heuristic patched by an exceptions list for the words it gets wrong.
|
|
17
|
+
|
|
18
|
+
This makes syllable-dependent metrics meaningfully more accurate than hyphenation-only implementations, especially on longer and less common words.
|
|
19
|
+
|
|
20
|
+
### Formula notes
|
|
21
|
+
|
|
22
|
+
- Scores are computed from full-precision ratios and returned **unclamped** — a Flesch Reading Ease score can legitimately exceed 100 (very easy) or go below 0 (very difficult).
|
|
23
|
+
- **French Flesch Reading Ease** uses the Kandel-Moles (1958) adaptation: `207 − 1.015 × (words/sentences) − 73.6 × (syllables/words)`. Because its base is 207, very easy French text can score slightly above 100.
|
|
24
|
+
- **Flesch-Kincaid Grade** maps to a US school grade and uses the same formula for all languages (there is no validated French adaptation).
|
|
25
|
+
- **Coleman-Liau** counts alphabetic letters only (not digits or punctuation), per its definition.
|
|
12
26
|
|
|
13
27
|
## Features
|
|
14
28
|
|
|
@@ -47,23 +61,43 @@ gem "text-metrics", github: "plume-app/text-metrics", branch: "main"
|
|
|
47
61
|
## Usage
|
|
48
62
|
|
|
49
63
|
```ruby
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
# {
|
|
55
|
-
|
|
56
|
-
#
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
64
|
+
metrics = TextMetrics.new("This gem analyses all kinds of text.")
|
|
65
|
+
|
|
66
|
+
# Get every metric at once as a Hash:
|
|
67
|
+
metrics.to_h
|
|
68
|
+
# {
|
|
69
|
+
# words_count: 7, characters_count: 30, sentences_count: 1, syllables_count: 11,
|
|
70
|
+
# punctuation_count: 1, syllables_per_word_average: 1.6, letters_per_word_average: 4.29,
|
|
71
|
+
# words_per_sentence_average: 7.0, characters_per_sentence_average: 30.0,
|
|
72
|
+
# words_per_punctuation_average: 7.0, punctuation_per_sentence_average: 1.0,
|
|
73
|
+
# flesch_reading_ease: 64.37, flesch_kincaid_grade: 6.0, lix: 21.29,
|
|
74
|
+
# smog_index: 0.0, coleman_liau_index: 5.2
|
|
75
|
+
# }
|
|
76
|
+
|
|
77
|
+
# Or read each metric on its own:
|
|
78
|
+
metrics.words_count # => 7
|
|
79
|
+
metrics.characters_count # => 30
|
|
80
|
+
metrics.flesch_reading_ease # => 64.37
|
|
81
|
+
metrics.flesch_kincaid_grade # => 6.0
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Languages
|
|
85
|
+
|
|
86
|
+
The default language is American English (`:en_us`). Pass `language:` to analyse French:
|
|
87
|
+
|
|
88
|
+
```ruby
|
|
89
|
+
TextMetrics.new("Bonjour le monde.", language: :fr)
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
An unknown language raises `TextMetrics::Error`.
|
|
93
|
+
|
|
94
|
+
### Comparing two texts
|
|
95
|
+
|
|
96
|
+
Levenshtein comparison is between two texts, so it lives on the module itself:
|
|
97
|
+
|
|
98
|
+
```ruby
|
|
99
|
+
TextMetrics.distance("kitten", "sitting") # => 3 raw edit distance
|
|
100
|
+
TextMetrics.similarity("kitten", "sitting") # => 57.14 0–100 score (100.0 == identical)
|
|
67
101
|
```
|
|
68
102
|
|
|
69
103
|
## Contributing
|
|
@@ -75,6 +109,8 @@ Bug reports and pull requests are welcome on GitHub at [https://github.com/plume
|
|
|
75
109
|
This gem was inspired by [Textstat](https://github.com/kupolak/textstat) and [Textstat](https://github.com/textstat/textstat) in Python.
|
|
76
110
|
This gem is generated via [`newgem` template](https://github.com/palkan/newgem) by [@palkan](https://github.com/palkan).
|
|
77
111
|
|
|
112
|
+
English syllable counts come from the [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict) (unrestricted use), with [text-hyphen](https://github.com/halostatue/text-hyphen) as a fallback. French syllable counts are derived from [Lexique](http://www.lexique.org/).
|
|
113
|
+
|
|
78
114
|
## License
|
|
79
115
|
|
|
80
116
|
The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
|
data/UPGRADING.md
ADDED
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
# Upgrading
|
|
2
|
+
|
|
3
|
+
## To 1.0 (from the pre-release `0.x` internal API)
|
|
4
|
+
|
|
5
|
+
1.0 is the first public release. The internal `0.x` API was reworked once, deliberately, so
|
|
6
|
+
the public surface can stay stable from here on. Everything below is a one-time migration.
|
|
7
|
+
|
|
8
|
+
### Building an analyzer
|
|
9
|
+
|
|
10
|
+
`text` is now positional, and `TextMetrics.new` returns the analyzer itself — there is no
|
|
11
|
+
longer a wrapper object with a `text_metrics_processor` accessor.
|
|
12
|
+
|
|
13
|
+
```ruby
|
|
14
|
+
# Before
|
|
15
|
+
analyser = TextMetrics.new(text: "Some text", language: "fr")
|
|
16
|
+
analyser.text_metrics_processor.words_count
|
|
17
|
+
|
|
18
|
+
# After
|
|
19
|
+
metrics = TextMetrics.new("Some text", language: :fr)
|
|
20
|
+
metrics.words_count
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
### Languages are symbols, and unknown ones raise
|
|
24
|
+
|
|
25
|
+
```ruby
|
|
26
|
+
# Before — unknown language blew up with NoMethodError on nil
|
|
27
|
+
TextMetrics.new(text: "x", language: "es")
|
|
28
|
+
|
|
29
|
+
# After — symbols (strings still accepted), unknown languages raise TextMetrics::Error
|
|
30
|
+
TextMetrics.new("x", language: :es) # => raises TextMetrics::Error
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
### `#all` is now `#to_h`
|
|
34
|
+
|
|
35
|
+
```ruby
|
|
36
|
+
# Before
|
|
37
|
+
metrics.all
|
|
38
|
+
|
|
39
|
+
# After
|
|
40
|
+
metrics.to_h
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
`#to_h` and the individual metric readers are now generated from a single list
|
|
44
|
+
(`TextMetrics::Processors::Base::METRICS`), so they can never report different sets again.
|
|
45
|
+
|
|
46
|
+
### Levenshtein moved to the module
|
|
47
|
+
|
|
48
|
+
The single method with a `normalize:` flag (which returned a raw distance or a 0–100 score
|
|
49
|
+
depending on the flag) is replaced by two intention-revealing module methods:
|
|
50
|
+
|
|
51
|
+
```ruby
|
|
52
|
+
# Before
|
|
53
|
+
metrics.levenshtein_distance_from("other", normalize: false) # raw distance
|
|
54
|
+
metrics.levenshtein_distance_from("other") # 0–100 score
|
|
55
|
+
|
|
56
|
+
# After
|
|
57
|
+
TextMetrics.distance("text", "other") # => Integer, raw edit distance
|
|
58
|
+
TextMetrics.similarity("text", "other") # => Float, 0–100 score (100.0 == identical)
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### Renamed metrics
|
|
62
|
+
|
|
63
|
+
The punctuation metrics dropped the plural; some helpers are now private:
|
|
64
|
+
|
|
65
|
+
| Before | After |
|
|
66
|
+
| ---------------------------------- | -------------------------------- |
|
|
67
|
+
| `punctuations_count` | `punctuation_count` |
|
|
68
|
+
| `words_per_punctuations_average` | `words_per_punctuation_average` |
|
|
69
|
+
| `punctuations_per_sentence_average`| `punctuation_per_sentence_average` |
|
|
70
|
+
| `poly_syllabes_count` (public) | private implementation detail |
|
|
71
|
+
|
|
72
|
+
The word/sentence/punctuation tokenizers are now private as well — use the count and average
|
|
73
|
+
metrics or `#to_h`.
|
data/lib/text-metrics.rb
ADDED