tinycus 1.0.0 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. checksums.yaml +4 -4
  2. data/LICENSE +1 -0
  3. data/README.md +90 -0
  4. metadata +12 -6
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ff64b058ca64944f6b13a0fad6ed92d49b08da8611ca454ae37100a3d94a753f
4
- data.tar.gz: 10c0ef5a67e9324f32c41668a8eb3104e6508c4bbf0bbef2471c5c1f6631fffe
3
+ metadata.gz: 321cf855e7c31e2a3d7cae41f0554543dbe27eb2c1d2a59cc2070c6c77645a73
4
+ data.tar.gz: cb663b8bbf7617886c4866b92ad9a76c6ef676bf885da265ce305a37dd1da20a
5
5
  SHA512:
6
- metadata.gz: a9348bc09c79487faf8406077e6ec981ef9bd94a709c49f5383ec8ad591235cd1dd9bdc370143d58eee2bb9b76bcf55dec1dc188e0340772fe59961235e58158
7
- data.tar.gz: 6a09f32fd84c9fd866ab13871722efd4cd02814e614c9022b3f0296ea5821e2493fd955b317afc1af8f2c93a2741eac707cf79f6699ffab212008db9aebda506
6
+ metadata.gz: '028fb9942a10556132e68af6f12aef6799e8aa531ae88150a045161f541ec800b37b48f82a4501f4d9aefd7dd6fd5b5022633ec03e02c70d92083cdac84ed677'
7
+ data.tar.gz: f964fb644a236ba1cc54236688fb72a211a0fbfa9d5aae127d8352f1040df24eaeed587cf19ebb281fdc4daf6d13c6af208db85bafc25bf28cdeae938ae3cc7d
data/LICENSE ADDED
@@ -0,0 +1 @@
1
+ GPL v3, (c) 2023 Benjamin Crowell
data/README.md ADDED
@@ -0,0 +1,90 @@
1
+ Tinycus
2
+ =======
3
+
4
+ This is a ruby library to do some string functions efficiently that
5
+ would otherwise be slow or require a huge footprint. For example,
6
+ it can remove accents from strings, or alphabetize strings in polytonic
7
+ Greek.
8
+
9
+ The current implementation is about 2-3 times faster for these tasks
10
+ than what I initially came up with naively. The footprint is about 1000
11
+ times smaller than that of the ICU library (30 Mb), which also doesn't
12
+ have bindings for ruby. The name Tinycus is meant to evoke "tiny
13
+ ICU." Tinycus supports polytonic Greek, which GNU libc doesn't.
14
+
15
+ If you're using Tinycus and have comments or suggestions, please
16
+ [contact me](http://lightandmatter.com/area4author.html).
17
+
18
+ Installation
19
+ --------
20
+
21
+ ### On linux, using make
22
+
23
+ sudo make install
24
+ make test
25
+
26
+ ### Using rubygems
27
+
28
+ gem install tinycus
29
+
30
+ Use
31
+ ---
32
+
33
+ Examples:
34
+
35
+ require './tinycus.rb';
36
+ puts Tinycus::Tr.remove_accents_from_euro('ἄγε, vámonos',n:true)
37
+ --> αγε, vamonos
38
+ puts Tinycus.sort_greek("Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος οὐλομένην".split(/\s+/)).join(' ')
39
+ --> ἄειδε, Ἀχιλῆος θεά, Μῆνιν οὐλομένην Πηληϊάδεω
40
+
41
+ All input strings are expected to be utf-8 normalized to NFC form, and
42
+ all returned values are also in this encoding. Many functions have an
43
+ optional argument n which defaults to false. If you set n:true, as in
44
+ the first example above, then your inputs will be normalized to NFC
45
+ for you. This is safer but slower. Since the whole point of the
46
+ library is speed, the library is set up to make it convenient for you
47
+ if you simply massage all strings into the required form at the time
48
+ when they're created or read in, then do all your manipulations. If
49
+ your inputs to Tinycus are not NFC normalized, and you don't do
50
+ n:true, then the results will be incorrect. If your inputs are in some
51
+ other encoding such as ISO-8859-1, then the library may either give
52
+ incorrect results or raise an exception.
53
+
54
+ When there is an error in a constructor, the object that is created has
55
+ a .err property that is an error message. If there is no error, then
56
+ the .err is set to nil.
57
+
58
+ Real-world sources of polytonic Greek text are usually incredibly messy,
59
+ containing all kinds of weird crap that someone typed on a keyboard
60
+ and looks OK by eye, but is actually wrong and not suitable for
61
+ machine processing. This type of stuff is legal unicode, but it's
62
+ the wrong way of representing the word. For instance, I've seen the vowel
63
+ that's supposed to look like ά might be written with two accents on the same
64
+ character: both an accented alpha unicode
65
+ character and, superimposed on that, a combining accent. It looks OK on the screen
66
+ because the two marks are on top of each other. I've collected a large number of these
67
+ awfulnesses in the wild. The function Cleanup.clean_up_grotty_greek()
68
+ is meant to correct them all. It's slow. It has a bunch of options.
69
+ There are also various more fine-tuned or specialized functions, such as
70
+ Cleanup.standardize_greek_punctuation().
71
+
72
+ Beta code conversion
73
+ --------------------
74
+ Beta code is an obsolete way of encoding Greek characters: https://en.wikipedia.org/wiki/Beta_Code .
75
+ Tinycus can handle conversion of a subset of beta code using the functions Tinycus.greek_unicode_to_beta_code
76
+ and Tinycus.greek_beta_code_to_unicode. There are other libraries that can do this, such as the
77
+ ruby library https://github.com/perseids-tools/beta-code-rb as well as standalone
78
+ software such as the debian package unibetacode. I only rolled my own implementation because
79
+ it seemed pretty easy to do, and I wanted to be sure that it would generate utf-8 encoded
80
+ according to modern standards and in a way that would be compatible with the rest of Tinycus.
81
+
82
+ There is also a function Cleanup.clean_up_greek_beta_code that is meant to clean up stray
83
+ beta code in documents that were supposed to have been converted to unicode but still have
84
+ some beta code lingering in them.
85
+
86
+ Performance
87
+ -----
88
+
89
+ See comments at the top of scripts/benchmark.rb for some notes on algorithms
90
+ I tried and their performance.
metadata CHANGED
@@ -1,29 +1,35 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tinycus
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Benjamin Crowell
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-01-14 00:00:00.000000000 Z
11
+ date: 2024-01-15 00:00:00.000000000 Z
12
12
  dependencies: []
13
- description: String manipulation for languages including polytonic Greek. High performance,
14
- small footprint, pure ruby. Typical uses would be removing accents from ancient
15
- Greek words, or alphabetizing words in ancient Greek.
13
+ description: "This is a ruby library to do some string functions efficiently that\nwould
14
+ otherwise be slow or require a huge footprint. For example,\nit can remove accents
15
+ from strings, or alphabetize strings in polytonic Greek. \n"
16
16
  email:
17
17
  executables: []
18
18
  extensions: []
19
- extra_rdoc_files: []
19
+ extra_rdoc_files:
20
+ - README.md
20
21
  files:
22
+ - LICENSE
23
+ - README.md
21
24
  - tinycus.rb
22
25
  homepage: https://bitbucket.org/ben-crowell/tinycus
23
26
  licenses:
24
27
  - GPL-3.0-only
25
28
  metadata:
26
29
  contact_uri: http://lightandmatter.com/area4author.html
30
+ documentation_uri: https://bitbucket.org/ben-crowell/tinycus
31
+ homepage_uri: https://bitbucket.org/ben-crowell/tinycus
32
+ source_code_uri: https://bitbucket.org/ben-crowell/tinycus
27
33
  post_install_message:
28
34
  rdoc_options: []
29
35
  require_paths: