anystyle 1.3.0 → 1.3.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4a8c6471369e8969b190536c6f597742e6917197ce26009b1f7c7dcd1ec32168
4
- data.tar.gz: 1f9af1e1337c47fda651b40fd19903d7474fe860b679bcafd832b932fc075947
3
+ metadata.gz: '082895ba5f17e070ac5c85afe9849b27097497622ac25f81eacbba205af1fec8'
4
+ data.tar.gz: b3f288e9cb22ce9a4dc74970cabf866f5349c60d52b4e8b1c6624b6fd6dfbd1a
5
5
  SHA512:
6
- metadata.gz: 504a133a3cefedeb24fc9c165e00c9f59f6dfee14b78fa0206b9eadb2d060e236b8e99fe211d94d8df1de9d9c93efc8d3149c931827c5df83bfa56524539ce55
7
- data.tar.gz: 70a9e1c156fe996980610755971eb4feb899afd2a57e20cbd95c664618326385b88ef8c733268922977a7252b2fff47771ba69892f37eecc94227545c4e956b3
6
+ metadata.gz: 47ea4876f749891b2f305456404e585dcaee27690caa80091836f4b81ee191e548a7a799db001bfdcda58ed9784485416e8abd379b3d712804054cf78a5b4417
7
+ data.tar.gz: b7ce8f5116f2ca211ab5f42ae345aa112954ae895aaa0668ef93218c66b4696bfeba9da1f513fa28669b3dfb668033cf94c7857f1450a8320e685d5d16f50d37
data/README.md CHANGED
@@ -20,6 +20,35 @@ Using AnyStyle CLI
20
20
 
21
21
  See [anystyle-cli](https://github.com/inukshuk/anystyle-cli) for more details.
22
22
 
23
+ Using AnyStyle in Ruby
24
+ ----------------------
25
+ Install the `anystyle` gem.
26
+
27
+ $ [sudo] gem install anystyle
28
+
29
+ Once installed, you can use the static Parser and Finder instances
30
+ by calling the `AnyStyle.parse` or `AnyStyle.find` methods. For example:
31
+
32
+ ```ruby
33
+ require 'anystyle'
34
+
35
+ pp AnyStyle.parse 'Derrida, J. (1967). L’écriture et la différence (1 éd.). Paris: Éditions du Seuil.'
36
+ #-> [{
37
+ # :author=>[{:family=>"Derrida", :given=>"J."}],
38
+ # :date=>["1967"],
39
+ # :title=>["L’écriture et la différence"],
40
+ # :edition=>["1"],
41
+ # :location=>["Paris"],
42
+ # :publisher=>["Éditions du Seuil"],
43
+ # :language=>"fr",
44
+ # :scripts=>["Common", "Latin"],
45
+ # :type=>"book"
46
+ #}]
47
+ ```
48
+
49
+ Alternatively, you can create your own `AnyStyle::Parser` or
50
+ `AnyStyle::Finder` with custom options.
51
+
23
52
 
24
53
  Web Application and Web Service
25
54
  -------------------------------
@@ -30,20 +59,53 @@ Please note that the web service is currently based on the legacy
30
59
  [0.x branch](https://github.com/inukshuk/anystyle/tree/0.x).
31
60
 
32
61
 
33
- Using AnyStyle in Ruby
34
- ----------------------
35
-
36
- $ [sudo] gem install anystyle
37
-
38
-
39
- Reference Parsing
40
- -----------------
41
-
42
- Document Parsing
43
- ----------------
44
-
45
62
  Training
46
63
  --------
64
+ You can train custom Finder and Parser models. To do this, you need
65
+ to prepare your own data sets for training. You can create your own
66
+ data from scratch or build on AnyStyle's default sets. The default
67
+ parser model is based on the
68
+ [core](https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml)
69
+ data set; the default finder model source data is not publicly
70
+ available in its entirety, but you can find a number of tagged
71
+ documents
72
+ [here](https://github.com/inukshuk/anystyle/blob/master/res/finder).
73
+
74
+ When you have compiled a data set for training, you will be ready
75
+ to create your own model:
76
+
77
+ $ anystyle train training-data.xml custom.mod
78
+
79
+ This will save your new model as `custom.mod`. To use your model
80
+ instead of AnyStyle's default, use the `-P` or `--parser-model` flag
81
+ and, respectively, `-F` or `--finder-model` to use a custom Finder
82
+ model. For instance, the command below would parse all references
83
+ in `bib.txt` using the custom model we just trained and print the
84
+ result to STDOUT using the JSON output format:
85
+
86
+ $ anystyle -P custom.mod -f json parse bib.txt -
87
+
88
+ When training your own models, it is good practice to check the
89
+ quality using a second data set. For example, using AnyStyle's own
90
+ [gold](https://github.com/inukshuk/anystyle/blob/master/res/parser/gold.xml)
91
+ data set (a large, manually curated data set) we could check our
92
+ custom model like this:
93
+
94
+ $ anystyle -P x.mod check ./res/parser/gold.xml
95
+ Checking gold.xml................. 1 seq 0.06% 3 tok 0.01% 3s
96
+
97
+ This command will print the sequence and token error rates; in
98
+ the case of AnyStyle a the number of sequence errors is the number
99
+ of references which were tagged differently by the parser than they
100
+ were in the input; the number of token errors is the total number of
101
+ words across all the references which were tagged differently. In the
102
+ example above, we got one reference wrong (out of 1700 at the time);
103
+ but even this one reference was mostly tagged correctly, because only
104
+ a total of 3 words were tagged differently.
105
+
106
+ When working with training data, it is a good idea to use the
107
+ `Wapiti::Dataset` API in Ruby: it supports all the standard set
108
+ operators and makes it very easy to combine or compare data sets.
47
109
 
48
110
  Dictionary Adapters
49
111
  -------------------
@@ -38,7 +38,7 @@ module AnyStyle
38
38
  .sub(/<\/?(italic|i|strong|b|span|div)>/, '')
39
39
  .sub(/^[\p{P}\s]+/, '')
40
40
  .sub(/^[Vv]ol(ume)?[\p{P}\s]+/, '')
41
- .sub(/\p{P}$/, '')
41
+ .sub(/[\p{P}\p{Z}\p{C}]+$/, '')
42
42
  end
43
43
  end
44
44
  end
@@ -1,3 +1,3 @@
1
1
  module AnyStyle
2
- VERSION = '1.3.0'.freeze
2
+ VERSION = '1.3.1'.freeze
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: anystyle
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.3.0
4
+ version: 1.3.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Sylvester Keil
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-09-18 00:00:00.000000000 Z
11
+ date: 2018-09-21 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bibtex-ruby