anystyle 1.3.11 → 1.3.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: dcb19c0b21f5cd2ba5409b22ff1105e5186981871c22cf6862633cdaa0150a61
4
- data.tar.gz: b0fcd484f3f87784816bd4e3dc55e059329602f4800242362531cf25c9db7adc
3
+ metadata.gz: 1bf14266569ad0eb5e812e68d0289780e50ee2cc765faeb1a5fb48b37b1ba4b7
4
+ data.tar.gz: be96508a8b939c6450342a763cfb469b87d170dab80380b4a82beb7d02b4e0c4
5
5
  SHA512:
6
- metadata.gz: f0df92b83ca46a7464c94c737dd035914d2d67c3269a7ca1730225511f0a57e4fe3530f1cedf8d1a63894650946bb953c178de9f2fd584411d72aa5f62b31290
7
- data.tar.gz: 975dddfcf906495bc0c65b26d816e372bfc9fbf0331f1b08350cb9fc0befcff3754968f620501682f43f0985885d8e4319017458ba28b4092ab34038fe7043f0
6
+ metadata.gz: a8a2c6f4f0d997b657b9817b8bf2a4bdf62db4e9f53130ed9923de571e9cb6a264e554dd78c4ebaf7180b424734632d526fd46d3accba07e0cd766fb37689fae
7
+ data.tar.gz: d2ddd8f136ac10a0ddd29b9522fc7779c1ea92c1d441601c801726fa4e304d104de588b38c230fa53102e59e23e2ce0b48b79f4c4bd932c2cb0e3a3f56ccd43a
data/README.md CHANGED
@@ -57,6 +57,9 @@ AnyStyle is available as web application at [anystyle.io](https://anystyle.io).
57
57
  The web application [is open source](https://github.com/inukshuk/anystyle.io)
58
58
  and you can also host yourself!
59
59
 
60
+ Improving results for your data
61
+ =================================
62
+
60
63
  Training
61
64
  --------
62
65
  You can train custom Finder and Parser models. To do this, you need
@@ -105,6 +108,47 @@ When working with training data, it is a good idea to use the
105
108
  `Wapiti::Dataset` API in Ruby: it supports all the standard set
106
109
  operators and makes it very easy to combine or compare data sets.
107
110
 
111
+ Natural Languages used in AnyStyle
112
+ ----------------------------------
113
+
114
+ As mentioned above, the
115
+ [core](https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml)
116
+ dataset contains the manually marked-up references that are used as the
117
+ basis for the default AnyStyle parsing model. If the references you are
118
+ trying to parse include many non-English documents, the distribution of
119
+ natural languages in this corpus is relevant (detected using [cld](https://github.com/jtoy/cld)).
120
+
121
+ | Language | n |
122
+ |-------------------------|-----|
123
+ | ENGLISH | 965 |
124
+ | FRENCH | 54 |
125
+ | GERMAN | 26 |
126
+ | ITALIAN | 11 |
127
+ | Others | 9 |
128
+ | | |
129
+ | Not reliably determined | 449 |
130
+ | (but mainly English) | |
131
+
132
+ (These data are based on AnyStyle version 1.3.13)
133
+
134
+ There is a strong prevalence of English-language documents with the
135
+ conventions used in English-language bibliographies, with some
136
+ representation of other European languages. The languages used reflect
137
+ those used in scientific publishing as well as the maintainers'
138
+ competencies. If you are working with many documents in languages other
139
+ than English, you might consider training the model with some examples
140
+ in the relevant languages.
141
+
142
+ AnyStyle should work with references written in any Latin script
143
+ (including most European languages, languages such as Indonesian and
144
+ Malaysian, as well as romanised Arabic, Chinese and Japanese). It should
145
+ also support languages written with non-Latin alphabets (such as
146
+ Russian), although no examples of these appear in the default training
147
+ sets. Languages written in syllabaries or complex symbols which do not
148
+ use white space to separate tokens are not compatible with AnyStyle's
149
+ approach: this includes Chinese, Japanese, Arabic as well as many Indian
150
+ languages.
151
+
108
152
  Dictionary Adapters
109
153
  -------------------
110
154
  During the statistical analysis of reference strings, AnyStyle relies
@@ -142,6 +186,8 @@ and configure AnyStyle to use the Redis adapter:
142
186
  AnyStyle::Dictionary::Redis.defaults[:host] = 'localhost'
143
187
  AnyStyle::Dictionary::Redis.defaults[:port] = 6379
144
188
 
189
+ About AnyStyle
190
+ ==============
145
191
  Contributing
146
192
  ------------
147
193
  The AnyStyle source code is
@@ -13,7 +13,12 @@ module AnyStyle
13
13
  when :url
14
14
  doi = doi_extract(value) if value =~ /doi\.org\//i
15
15
  append item, :doi, doi unless doi.nil?
16
- URI.extract(value)
16
+ urls = URI.extract(value, %w(http https ftp ftps))
17
+ if urls.empty?
18
+ value
19
+ else
20
+ urls
21
+ end
17
22
  when :doi
18
23
  doi_extract(value) || value
19
24
  else
@@ -4,6 +4,7 @@ module AnyStyle
4
4
  @keys = [:volume, :pages, :date]
5
5
 
6
6
  VOLNUM_RX = '(\p{Lu}?\d+|[IVXLCDM]+)'
7
+
7
8
  def normalize(item, **opts)
8
9
  map_values(item, [:volume]) do |_, volume|
9
10
  volume = StringUtils.strip_html volume
@@ -15,7 +16,7 @@ module AnyStyle
15
16
  end
16
17
 
17
18
  case volume
18
- when /(?:^|\s)#{VOLNUM_RX}\s?\(([^)]+)\)(\s?\d+\p{Pd}\d+)?/
19
+ when /(?:^|\s)#{VOLNUM_RX}\s?\(([^)]+)\)[;:,]?(?:pp?.?)?(\s?\d+\p{Pd}\d+)?/
19
20
  volume = $1
20
21
  append item, :issue, $2
21
22
  append item, :pages, $3.strip unless $3.nil?
@@ -6,7 +6,7 @@ module AnyStyle
6
6
  attr_reader :defaults, :formats
7
7
 
8
8
  def load(path)
9
- new :model => path
9
+ new model: path
10
10
  end
11
11
 
12
12
  # Returns a default parser instance
@@ -18,7 +18,8 @@ module AnyStyle
18
18
  attr_reader :model, :options, :features, :normalizers, :mtime
19
19
 
20
20
  def initialize(options = {})
21
- @options = self.class.defaults.merge(options)
21
+ def_opts = self.class.defaults || Parser.defaults
22
+ @options = def_opts.merge(options)
22
23
  load_model
23
24
  end
24
25
 
@@ -28,7 +29,7 @@ module AnyStyle
28
29
  @model.options.update_attributes options
29
30
  @mtime = File.mtime(file)
30
31
  else
31
- @model = Wapiti::Model.new(options.reject { |k,_| k == :model })
32
+ @model = Wapiti::Model.new(options.reject { |k, _| k == :model })
32
33
  @model.path = options[:model]
33
34
  @mtime = Time.now
34
35
  end