anystyle 1.3.13 → 1.3.14

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '09ec60a796bf13a493c0e91f4df8ad0295d70dde460da704119dfcbbd654a09c'
4
- data.tar.gz: e17787d8d6a2007b9f815e64c32092f305fdf92e80ec3914da1a72318ec049ee
3
+ metadata.gz: 1bf14266569ad0eb5e812e68d0289780e50ee2cc765faeb1a5fb48b37b1ba4b7
4
+ data.tar.gz: be96508a8b939c6450342a763cfb469b87d170dab80380b4a82beb7d02b4e0c4
5
5
  SHA512:
6
- metadata.gz: ab5a9c7bfeb5b65c0b6345d69a11bcae62ea8df785f3eb212c1e8d345e2882ac44da53098523cc3e07828ca0b0da8557de133e1e95ac21f35fe3f5d9ca405079
7
- data.tar.gz: 408f6b7f3810e31984c3e8dbbca1831bf98a1197b443d233ff831b824af5bd8d4249935eaa4e3fc02bc0b8d15e7c4d85256e82253c4c8ec60500b8c269d70ff3
6
+ metadata.gz: a8a2c6f4f0d997b657b9817b8bf2a4bdf62db4e9f53130ed9923de571e9cb6a264e554dd78c4ebaf7180b424734632d526fd46d3accba07e0cd766fb37689fae
7
+ data.tar.gz: d2ddd8f136ac10a0ddd29b9522fc7779c1ea92c1d441601c801726fa4e304d104de588b38c230fa53102e59e23e2ce0b48b79f4c4bd932c2cb0e3a3f56ccd43a
data/README.md CHANGED
@@ -57,6 +57,9 @@ AnyStyle is available as web application at [anystyle.io](https://anystyle.io).
57
57
  The web application [is open source](https://github.com/inukshuk/anystyle.io)
58
58
  and you can also host yourself!
59
59
 
60
+ Improving results for your data
61
+ =================================
62
+
60
63
  Training
61
64
  --------
62
65
  You can train custom Finder and Parser models. To do this, you need
@@ -105,6 +108,47 @@ When working with training data, it is a good idea to use the
105
108
  `Wapiti::Dataset` API in Ruby: it supports all the standard set
106
109
  operators and makes it very easy to combine or compare data sets.
107
110
 
111
+ Natural Languages used in AnyStyle
112
+ ----------------------------------
113
+
114
+ As mentioned above, the
115
+ [core](https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml)
116
+ dataset contains the manually marked-up references that are used as the
117
+ basis for the default AnyStyle parsing model. If the references you are
118
+ trying to parse include many non-English documents, the distribution of
119
+ natural languages in this corpus is relevant (detected using [cld](https://github.com/jtoy/cld)).
120
+
121
+ | Language | n |
122
+ |-------------------------|-----|
123
+ | ENGLISH | 965 |
124
+ | FRENCH | 54 |
125
+ | GERMAN | 26 |
126
+ | ITALIAN | 11 |
127
+ | Others | 9 |
128
+ | | |
129
+ | Not reliably determined | 449 |
130
+ | (but mainly English) | |
131
+
132
+ (These data are based on AnyStyle version 1.3.13)
133
+
134
+ There is a strong prevalence of English-language documents with the
135
+ conventions used in English-language bibliographies, with some
136
+ representation of other European languages. The languages used reflect
137
+ those used in scientific publishing as well as the maintainers'
138
+ competencies. If you are working with many documents in languages other
139
+ than English, you might consider training the model with some examples
140
+ in the relevant languages.
141
+
142
+ AnyStyle should work with references written in any Latin script
143
+ (including most European languages, languages such as Indonesian and
144
+ Malaysian, as well as romanised Arabic, Chinese and Japanese). It should
145
+ also support languages written with non-Latin alphabets (such as
146
+ Russian), although no examples of these appear in the default training
147
+ sets. Languages written in syllabaries or complex symbols which do not
148
+ use white space to separate tokens are not compatible with AnyStyle's
149
+ approach: this includes Chinese, Japanese, Arabic as well as many Indian
150
+ languages.
151
+
108
152
  Dictionary Adapters
109
153
  -------------------
110
154
  During the statistical analysis of reference strings, AnyStyle relies
@@ -142,6 +186,8 @@ and configure AnyStyle to use the Redis adapter:
142
186
  AnyStyle::Dictionary::Redis.defaults[:host] = 'localhost'
143
187
  AnyStyle::Dictionary::Redis.defaults[:port] = 6379
144
188
 
189
+ About AnyStyle
190
+ ==============
145
191
  Contributing
146
192
  ------------
147
193
  The AnyStyle source code is
@@ -13,7 +13,12 @@ module AnyStyle
13
13
  when :url
14
14
  doi = doi_extract(value) if value =~ /doi\.org\//i
15
15
  append item, :doi, doi unless doi.nil?
16
- URI.extract(value)
16
+ urls = URI.extract(value, %w(http https ftp ftps))
17
+ if urls.empty?
18
+ value
19
+ else
20
+ urls
21
+ end
17
22
  when :doi
18
23
  doi_extract(value) || value
19
24
  else
@@ -18,7 +18,8 @@ module AnyStyle
18
18
  attr_reader :model, :options, :features, :normalizers, :mtime
19
19
 
20
20
  def initialize(options = {})
21
- @options = self.class.defaults.merge(options)
21
+ def_opts = self.class.defaults || Parser.defaults
22
+ @options = def_opts.merge(options)
22
23
  load_model
23
24
  end
24
25