tinycus 1.0.0 → 1.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/LICENSE +1 -0
- data/README.md +90 -0
- metadata +12 -6
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 321cf855e7c31e2a3d7cae41f0554543dbe27eb2c1d2a59cc2070c6c77645a73
|
4
|
+
data.tar.gz: cb663b8bbf7617886c4866b92ad9a76c6ef676bf885da265ce305a37dd1da20a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: '028fb9942a10556132e68af6f12aef6799e8aa531ae88150a045161f541ec800b37b48f82a4501f4d9aefd7dd6fd5b5022633ec03e02c70d92083cdac84ed677'
|
7
|
+
data.tar.gz: f964fb644a236ba1cc54236688fb72a211a0fbfa9d5aae127d8352f1040df24eaeed587cf19ebb281fdc4daf6d13c6af208db85bafc25bf28cdeae938ae3cc7d
|
data/LICENSE
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
GPL v3, (c) 2023 Benjamin Crowell
|
data/README.md
ADDED
@@ -0,0 +1,90 @@
|
|
1
|
+
Tinycus
|
2
|
+
=======
|
3
|
+
|
4
|
+
This is a ruby library to do some string functions efficiently that
|
5
|
+
would otherwise be slow or require a huge footprint. For example,
|
6
|
+
it can remove accents from strings, or alphabetize strings in polytonic
|
7
|
+
Greek.
|
8
|
+
|
9
|
+
The current implementation is about 2-3 times faster for these tasks
|
10
|
+
than what I initially came up with naively. The footprint is about 1000
|
11
|
+
times smaller than that of the ICU library (30 Mb), which also doesn't
|
12
|
+
have bindings for ruby. The name Tinycus is meant to evoke "tiny
|
13
|
+
ICU." Tinycus supports polytonic Greek, which GNU libc doesn't.
|
14
|
+
|
15
|
+
If you're using Tinycus and have comments or suggestions, please
|
16
|
+
[contact me](http://lightandmatter.com/area4author.html).
|
17
|
+
|
18
|
+
Installation
|
19
|
+
--------
|
20
|
+
|
21
|
+
### On linux, using make
|
22
|
+
|
23
|
+
sudo make install
|
24
|
+
make test
|
25
|
+
|
26
|
+
### Using rubygems
|
27
|
+
|
28
|
+
gem install tinycus
|
29
|
+
|
30
|
+
Use
|
31
|
+
---
|
32
|
+
|
33
|
+
Examples:
|
34
|
+
|
35
|
+
require './tinycus.rb';
|
36
|
+
puts Tinycus::Tr.remove_accents_from_euro('ἄγε, vámonos',n:true)
|
37
|
+
--> αγε, vamonos
|
38
|
+
puts Tinycus.sort_greek("Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος οὐλομένην".split(/\s+/)).join(' ')
|
39
|
+
--> ἄειδε, Ἀχιλῆος θεά, Μῆνιν οὐλομένην Πηληϊάδεω
|
40
|
+
|
41
|
+
All input strings are expected to be utf-8 normalized to NFC form, and
|
42
|
+
all returned values are also in this encoding. Many functions have an
|
43
|
+
optional argument n which defaults to false. If you set n:true, as in
|
44
|
+
the first example above, then your inputs will be normalized to NFC
|
45
|
+
for you. This is safer but slower. Since the whole point of the
|
46
|
+
library is speed, the library is set up to make it convenient for you
|
47
|
+
if you simply massage all strings into the required form at the time
|
48
|
+
when they're created or read in, then do all your manipulations. If
|
49
|
+
your inputs to Tinycus are not NFC normalized, and you don't do
|
50
|
+
n:true, then the results will be incorrect. If your inputs are in some
|
51
|
+
other encoding such as ISO-8859-1, then the library may either give
|
52
|
+
incorrect results or raise an exception.
|
53
|
+
|
54
|
+
When there is an error in a constructor, the object that is created has
|
55
|
+
a .err property that is an error message. If there is no error, then
|
56
|
+
the .err is set to nil.
|
57
|
+
|
58
|
+
Real-world sources of polytonic Greek text are usually incredibly messy,
|
59
|
+
containing all kinds of weird crap that someone typed on a keyboard
|
60
|
+
and looks OK by eye, but is actually wrong and not suitable for
|
61
|
+
machine processing. This type of stuff is legal unicode, but it's
|
62
|
+
the wrong way of representing the word. For instance, I've seen the vowel
|
63
|
+
that's supposed to look like ά might be written with two accents on the same
|
64
|
+
character: both an accented alpha unicode
|
65
|
+
character and, superimposed on that, a combining accent. It looks OK on the screen
|
66
|
+
because the two marks are on top of each other. I've collected a large number of these
|
67
|
+
awfulnesses in the wild. The function Cleanup.clean_up_grotty_greek()
|
68
|
+
is meant to correct them all. It's slow. It has a bunch of options.
|
69
|
+
There are also various more fine-tuned or specialized functions, such as
|
70
|
+
Cleanup.standardize_greek_punctuation().
|
71
|
+
|
72
|
+
Beta code conversion
|
73
|
+
--------------------
|
74
|
+
Beta code is an obsolete way of encoding Greek characters: https://en.wikipedia.org/wiki/Beta_Code .
|
75
|
+
Tinycus can handle conversion of a subset of beta code using the functions Tinycus.greek_unicode_to_beta_code
|
76
|
+
and Tinycus.greek_beta_code_to_unicode. There are other libraries that can do this, such as the
|
77
|
+
ruby library https://github.com/perseids-tools/beta-code-rb as well as standalone
|
78
|
+
software such as the debian package unibetacode. I only rolled my own implementation because
|
79
|
+
it seemed pretty easy to do, and I wanted to be sure that it would generate utf-8 encoded
|
80
|
+
according to modern standards and in a way that would be compatible with the rest of Tinycus.
|
81
|
+
|
82
|
+
There is also a function Cleanup.clean_up_greek_beta_code that is meant to clean up stray
|
83
|
+
beta code in documents that were supposed to have been converted to unicode but still have
|
84
|
+
some beta code lingering in them.
|
85
|
+
|
86
|
+
Performance
|
87
|
+
-----
|
88
|
+
|
89
|
+
See comments at the top of scripts/benchmark.rb for some notes on algorithms
|
90
|
+
I tried and their performance.
|
metadata
CHANGED
@@ -1,29 +1,35 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tinycus
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Benjamin Crowell
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2024-01-
|
11
|
+
date: 2024-01-15 00:00:00.000000000 Z
|
12
12
|
dependencies: []
|
13
|
-
description:
|
14
|
-
|
15
|
-
|
13
|
+
description: "This is a ruby library to do some string functions efficiently that\nwould
|
14
|
+
otherwise be slow or require a huge footprint. For example,\nit can remove accents
|
15
|
+
from strings, or alphabetize strings in polytonic Greek. \n"
|
16
16
|
email:
|
17
17
|
executables: []
|
18
18
|
extensions: []
|
19
|
-
extra_rdoc_files:
|
19
|
+
extra_rdoc_files:
|
20
|
+
- README.md
|
20
21
|
files:
|
22
|
+
- LICENSE
|
23
|
+
- README.md
|
21
24
|
- tinycus.rb
|
22
25
|
homepage: https://bitbucket.org/ben-crowell/tinycus
|
23
26
|
licenses:
|
24
27
|
- GPL-3.0-only
|
25
28
|
metadata:
|
26
29
|
contact_uri: http://lightandmatter.com/area4author.html
|
30
|
+
documentation_uri: https://bitbucket.org/ben-crowell/tinycus
|
31
|
+
homepage_uri: https://bitbucket.org/ben-crowell/tinycus
|
32
|
+
source_code_uri: https://bitbucket.org/ben-crowell/tinycus
|
27
33
|
post_install_message:
|
28
34
|
rdoc_options: []
|
29
35
|
require_paths:
|