RubyGems - anystyle - Versions diffs - 1.3.0 → 1.3.1 - Mend

anystyle 1.3.0 → 1.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

checksums.yaml +4 -4
data/README.md +74 -12
data/lib/anystyle/normalizer/volume.rb +1 -1
data/lib/anystyle/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 4a8c6471369e8969b190536c6f597742e6917197ce26009b1f7c7dcd1ec32168
-  data.tar.gz: 1f9af1e1337c47fda651b40fd19903d7474fe860b679bcafd832b932fc075947
+  metadata.gz: '082895ba5f17e070ac5c85afe9849b27097497622ac25f81eacbba205af1fec8'
+  data.tar.gz: b3f288e9cb22ce9a4dc74970cabf866f5349c60d52b4e8b1c6624b6fd6dfbd1a
 SHA512:
-  metadata.gz: 504a133a3cefedeb24fc9c165e00c9f59f6dfee14b78fa0206b9eadb2d060e236b8e99fe211d94d8df1de9d9c93efc8d3149c931827c5df83bfa56524539ce55
-  data.tar.gz: 70a9e1c156fe996980610755971eb4feb899afd2a57e20cbd95c664618326385b88ef8c733268922977a7252b2fff47771ba69892f37eecc94227545c4e956b3
+  metadata.gz: 47ea4876f749891b2f305456404e585dcaee27690caa80091836f4b81ee191e548a7a799db001bfdcda58ed9784485416e8abd379b3d712804054cf78a5b4417
+  data.tar.gz: b7ce8f5116f2ca211ab5f42ae345aa112954ae895aaa0668ef93218c66b4696bfeba9da1f513fa28669b3dfb668033cf94c7857f1450a8320e685d5d16f50d37

data/README.md CHANGED Viewed

@@ -20,6 +20,35 @@ Using AnyStyle CLI
 See [anystyle-cli](https://github.com/inukshuk/anystyle-cli) for more details.
+Using AnyStyle in Ruby
+----------------------
+Install the `anystyle` gem.
+    $ [sudo] gem install anystyle
+Once installed, you can use the static Parser and Finder instances
+by calling the `AnyStyle.parse` or `AnyStyle.find` methods. For example:
+```ruby
+require 'anystyle'
+pp AnyStyle.parse 'Derrida, J. (1967). L’écriture et la différence (1 éd.). Paris: Éditions du Seuil.'
+#-> [{
+#  :author=>[{:family=>"Derrida", :given=>"J."}],
+#  :date=>["1967"],
+#  :title=>["L’écriture et la différence"],
+#  :edition=>["1"],
+#  :location=>["Paris"],
+#  :publisher=>["Éditions du Seuil"],
+#  :language=>"fr",
+#  :scripts=>["Common", "Latin"],
+#  :type=>"book"
+#}]
+```
+Alternatively, you can create your own `AnyStyle::Parser` or
+`AnyStyle::Finder` with custom options.
 Web Application and Web Service
 -------------------------------
@@ -30,20 +59,53 @@ Please note that the web service is currently based on the legacy
 [0.x branch](https://github.com/inukshuk/anystyle/tree/0.x).
-Using AnyStyle in Ruby
-----------------------
-    $ [sudo] gem install anystyle
-Reference Parsing
------------------
-Document Parsing
-----------------
 Training
 --------
+You can train custom Finder and Parser models. To do this, you need
+to prepare your own data sets for training. You can create your own
+data from scratch or build on AnyStyle's default sets. The default
+parser model is based on the
+[core](https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml)
+data set; the default finder model source data is not publicly
+available in its entirety, but you can find a number of tagged
+documents
+[here](https://github.com/inukshuk/anystyle/blob/master/res/finder).
+When you have compiled a data set for training, you will be ready
+to create your own model:
+    $ anystyle train training-data.xml custom.mod
+This will save your new model as `custom.mod`. To use your model
+instead of AnyStyle's default, use the `-P` or `--parser-model` flag
+and, respectively, `-F` or `--finder-model` to use a custom Finder
+model. For instance, the command below would parse all references
+in `bib.txt` using the custom model we just trained and print the
+result to STDOUT using the JSON output format:
+    $ anystyle -P custom.mod -f json parse bib.txt -
+When training your own models, it is good practice to check the
+quality using a second data set. For example, using AnyStyle's own
+[gold](https://github.com/inukshuk/anystyle/blob/master/res/parser/gold.xml)
+data set (a large, manually curated data set) we could check our
+custom model like this:
+    $ anystyle -P x.mod check ./res/parser/gold.xml
+    Checking gold.xml.................   1 seq  0.06%   3 tok  0.01%  3s
+This command will print the sequence and token error rates; in
+the case of AnyStyle a the number of sequence errors is the number
+of references which were tagged differently by the parser than they
+were in the input; the number of token errors is the total number of
+words across all the references which were tagged differently. In the
+example above, we got one reference wrong (out of 1700 at the time);
+but even this one reference was mostly tagged correctly, because only
+a total of 3 words were tagged differently.
+When working with training data, it is a good idea to use the
+`Wapiti::Dataset` API in Ruby: it supports all the standard set
+operators and makes it very easy to combine or compare data sets.
 Dictionary Adapters
 -------------------

data/lib/anystyle/normalizer/volume.rb CHANGED Viewed

@@ -38,7 +38,7 @@ module AnyStyle
               .sub(/<\/?(italic|i|strong|b|span|div)>/, '')
               .sub(/^[\p{P}\s]+/, '')
               .sub(/^[Vv]ol(ume)?[\p{P}\s]+/, '')
-              .sub(/\p{P}$/, '')
+              .sub(/[\p{P}\p{Z}\p{C}]+$/, '')
           end
         end
       end

data/lib/anystyle/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module AnyStyle
-  VERSION = '1.3.0'.freeze
+  VERSION = '1.3.1'.freeze
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: anystyle
 version: !ruby/object:Gem::Version
-  version: 1.3.0
+  version: 1.3.1
 platform: ruby
 authors:
 - Sylvester Keil
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2018-09-18 00:00:00.000000000 Z
+date: 2018-09-21 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bibtex-ruby