anystyle 1.3.0 → 1.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4a8c6471369e8969b190536c6f597742e6917197ce26009b1f7c7dcd1ec32168
4
- data.tar.gz: 1f9af1e1337c47fda651b40fd19903d7474fe860b679bcafd832b932fc075947
3
+ metadata.gz: '082895ba5f17e070ac5c85afe9849b27097497622ac25f81eacbba205af1fec8'
4
+ data.tar.gz: b3f288e9cb22ce9a4dc74970cabf866f5349c60d52b4e8b1c6624b6fd6dfbd1a
5
5
  SHA512:
6
- metadata.gz: 504a133a3cefedeb24fc9c165e00c9f59f6dfee14b78fa0206b9eadb2d060e236b8e99fe211d94d8df1de9d9c93efc8d3149c931827c5df83bfa56524539ce55
7
- data.tar.gz: 70a9e1c156fe996980610755971eb4feb899afd2a57e20cbd95c664618326385b88ef8c733268922977a7252b2fff47771ba69892f37eecc94227545c4e956b3
6
+ metadata.gz: 47ea4876f749891b2f305456404e585dcaee27690caa80091836f4b81ee191e548a7a799db001bfdcda58ed9784485416e8abd379b3d712804054cf78a5b4417
7
+ data.tar.gz: b7ce8f5116f2ca211ab5f42ae345aa112954ae895aaa0668ef93218c66b4696bfeba9da1f513fa28669b3dfb668033cf94c7857f1450a8320e685d5d16f50d37
data/README.md CHANGED
@@ -20,6 +20,35 @@ Using AnyStyle CLI
20
20
 
21
21
  See [anystyle-cli](https://github.com/inukshuk/anystyle-cli) for more details.
22
22
 
23
+ Using AnyStyle in Ruby
24
+ ----------------------
25
+ Install the `anystyle` gem.
26
+
27
+ $ [sudo] gem install anystyle
28
+
29
+ Once installed, you can use the static Parser and Finder instances
30
+ by calling the `AnyStyle.parse` or `AnyStyle.find` methods. For example:
31
+
32
+ ```ruby
33
+ require 'anystyle'
34
+
35
+ pp AnyStyle.parse 'Derrida, J. (1967). L’écriture et la différence (1 éd.). Paris: Éditions du Seuil.'
36
+ #-> [{
37
+ # :author=>[{:family=>"Derrida", :given=>"J."}],
38
+ # :date=>["1967"],
39
+ # :title=>["L’écriture et la différence"],
40
+ # :edition=>["1"],
41
+ # :location=>["Paris"],
42
+ # :publisher=>["Éditions du Seuil"],
43
+ # :language=>"fr",
44
+ # :scripts=>["Common", "Latin"],
45
+ # :type=>"book"
46
+ #}]
47
+ ```
48
+
49
+ Alternatively, you can create your own `AnyStyle::Parser` or
50
+ `AnyStyle::Finder` with custom options.
51
+
23
52
 
24
53
  Web Application and Web Service
25
54
  -------------------------------
@@ -30,20 +59,53 @@ Please note that the web service is currently based on the legacy
30
59
  [0.x branch](https://github.com/inukshuk/anystyle/tree/0.x).
31
60
 
32
61
 
33
- Using AnyStyle in Ruby
34
- ----------------------
35
-
36
- $ [sudo] gem install anystyle
37
-
38
-
39
- Reference Parsing
40
- -----------------
41
-
42
- Document Parsing
43
- ----------------
44
-
45
62
  Training
46
63
  --------
64
+ You can train custom Finder and Parser models. To do this, you need
65
+ to prepare your own data sets for training. You can create your own
66
+ data from scratch or build on AnyStyle's default sets. The default
67
+ parser model is based on the
68
+ [core](https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml)
69
+ data set; the default finder model source data is not publicly
70
+ available in its entirety, but you can find a number of tagged
71
+ documents
72
+ [here](https://github.com/inukshuk/anystyle/blob/master/res/finder).
73
+
74
+ When you have compiled a data set for training, you will be ready
75
+ to create your own model:
76
+
77
+ $ anystyle train training-data.xml custom.mod
78
+
79
+ This will save your new model as `custom.mod`. To use your model
80
+ instead of AnyStyle's default, use the `-P` or `--parser-model` flag
81
+ and, respectively, `-F` or `--finder-model` to use a custom Finder
82
+ model. For instance, the command below would parse all references
83
+ in `bib.txt` using the custom model we just trained and print the
84
+ result to STDOUT using the JSON output format:
85
+
86
+ $ anystyle -P custom.mod -f json parse bib.txt -
87
+
88
+ When training your own models, it is good practice to check the
89
+ quality using a second data set. For example, using AnyStyle's own
90
+ [gold](https://github.com/inukshuk/anystyle/blob/master/res/parser/gold.xml)
91
+ data set (a large, manually curated data set) we could check our
92
+ custom model like this:
93
+
94
+ $ anystyle -P x.mod check ./res/parser/gold.xml
95
+ Checking gold.xml................. 1 seq 0.06% 3 tok 0.01% 3s
96
+
97
+ This command will print the sequence and token error rates; in
98
+ the case of AnyStyle a the number of sequence errors is the number
99
+ of references which were tagged differently by the parser than they
100
+ were in the input; the number of token errors is the total number of
101
+ words across all the references which were tagged differently. In the
102
+ example above, we got one reference wrong (out of 1700 at the time);
103
+ but even this one reference was mostly tagged correctly, because only
104
+ a total of 3 words were tagged differently.
105
+
106
+ When working with training data, it is a good idea to use the
107
+ `Wapiti::Dataset` API in Ruby: it supports all the standard set
108
+ operators and makes it very easy to combine or compare data sets.
47
109
 
48
110
  Dictionary Adapters
49
111
  -------------------
@@ -38,7 +38,7 @@ module AnyStyle
38
38
  .sub(/<\/?(italic|i|strong|b|span|div)>/, '')
39
39
  .sub(/^[\p{P}\s]+/, '')
40
40
  .sub(/^[Vv]ol(ume)?[\p{P}\s]+/, '')
41
- .sub(/\p{P}$/, '')
41
+ .sub(/[\p{P}\p{Z}\p{C}]+$/, '')
42
42
  end
43
43
  end
44
44
  end
@@ -1,3 +1,3 @@
1
1
  module AnyStyle
2
- VERSION = '1.3.0'.freeze
2
+ VERSION = '1.3.1'.freeze
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: anystyle
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.3.0
4
+ version: 1.3.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Sylvester Keil
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-09-18 00:00:00.000000000 Z
11
+ date: 2018-09-21 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bibtex-ruby