anystyle 1.3.13 → 1.4.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '09ec60a796bf13a493c0e91f4df8ad0295d70dde460da704119dfcbbd654a09c'
4
- data.tar.gz: e17787d8d6a2007b9f815e64c32092f305fdf92e80ec3914da1a72318ec049ee
3
+ metadata.gz: b7d5642b2c453c00a91748d6e447504c825b0097095cda159d92f2c6d82a11c5
4
+ data.tar.gz: c29fa4ee58a017404b205418bcf1d45f7111d3005e881f5d869d73572d0cd5e6
5
5
  SHA512:
6
- metadata.gz: ab5a9c7bfeb5b65c0b6345d69a11bcae62ea8df785f3eb212c1e8d345e2882ac44da53098523cc3e07828ca0b0da8557de133e1e95ac21f35fe3f5d9ca405079
7
- data.tar.gz: 408f6b7f3810e31984c3e8dbbca1831bf98a1197b443d233ff831b824af5bd8d4249935eaa4e3fc02bc0b8d15e7c4d85256e82253c4c8ec60500b8c269d70ff3
6
+ metadata.gz: 44f8226f33b22aa53b743ce56db3203a1c1bb3e4dd749ca746a0c46563656f36a4da585eebb32f0b8062701f6719e5c0aed9ee5f9fa533a1f49aacdef2d79504
7
+ data.tar.gz: dcaf21ed636bc50a7fa982e220942b0d752f13714e1a8cda038d1d7b324b7da9c78569903460c10454c5740d191e1abc19254c24222cd2a411880c78fe3933ea
data/HISTORY.md CHANGED
@@ -1,3 +1,9 @@
1
+ 1.4.0 / 2023-01-06
2
+ ==================
3
+ * Removed deprectate string taint checking (@bbonamin).
4
+ * `AnyStyle::Parser#parse` will no longer automatically open local files.
5
+ Please call `Wapiti::Dataset.open` explicitly if you relied on this.
6
+
1
7
  1.3.6 / 2019-12-02
2
8
  ==================
3
9
  * Updated parser model.
data/LICENSE CHANGED
@@ -1,5 +1,5 @@
1
1
  AnyStyle
2
- Copyright 2011-2020 Sylvester Keil. All rights reserved.
2
+ Copyright 2011-2023 Sylvester Keil. All rights reserved.
3
3
 
4
4
  Redistribution and use in source and binary forms, with or without
5
5
  modification, are permitted provided that the following conditions are met:
data/README.md CHANGED
@@ -57,6 +57,9 @@ AnyStyle is available as web application at [anystyle.io](https://anystyle.io).
57
57
  The web application [is open source](https://github.com/inukshuk/anystyle.io)
58
58
  and you can also host yourself!
59
59
 
60
+ Improving results for your data
61
+ =================================
62
+
60
63
  Training
61
64
  --------
62
65
  You can train custom Finder and Parser models. To do this, you need
@@ -105,6 +108,47 @@ When working with training data, it is a good idea to use the
105
108
  `Wapiti::Dataset` API in Ruby: it supports all the standard set
106
109
  operators and makes it very easy to combine or compare data sets.
107
110
 
111
+ Natural Languages used in AnyStyle
112
+ ----------------------------------
113
+
114
+ As mentioned above, the
115
+ [core](https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml)
116
+ dataset contains the manually marked-up references that are used as the
117
+ basis for the default AnyStyle parsing model. If the references you are
118
+ trying to parse include many non-English documents, the distribution of
119
+ natural languages in this corpus is relevant (detected using [cld](https://github.com/jtoy/cld)).
120
+
121
+ | Language | n |
122
+ |-------------------------|-----|
123
+ | ENGLISH | 965 |
124
+ | FRENCH | 54 |
125
+ | GERMAN | 26 |
126
+ | ITALIAN | 11 |
127
+ | Others | 9 |
128
+ | | |
129
+ | Not reliably determined | 449 |
130
+ | (but mainly English) | |
131
+
132
+ (These data are based on AnyStyle version 1.3.13)
133
+
134
+ There is a strong prevalence of English-language documents with the
135
+ conventions used in English-language bibliographies, with some
136
+ representation of other European languages. The languages used reflect
137
+ those used in scientific publishing as well as the maintainers'
138
+ competencies. If you are working with many documents in languages other
139
+ than English, you might consider training the model with some examples
140
+ in the relevant languages.
141
+
142
+ AnyStyle should work with references written in any Latin script
143
+ (including most European languages, languages such as Indonesian and
144
+ Malaysian, as well as romanised Arabic, Chinese and Japanese). It should
145
+ also support languages written with non-Latin alphabets (such as
146
+ Russian), although no examples of these appear in the default training
147
+ sets. Languages written in syllabaries or complex symbols which do not
148
+ use white space to separate tokens are not compatible with AnyStyle's
149
+ approach: this includes Chinese, Japanese, Arabic as well as many Indian
150
+ languages.
151
+
108
152
  Dictionary Adapters
109
153
  -------------------
110
154
  During the statistical analysis of reference strings, AnyStyle relies
@@ -142,6 +186,8 @@ and configure AnyStyle to use the Redis adapter:
142
186
  AnyStyle::Dictionary::Redis.defaults[:host] = 'localhost'
143
187
  AnyStyle::Dictionary::Redis.defaults[:port] = 6379
144
188
 
189
+ About AnyStyle
190
+ ==============
145
191
  Contributing
146
192
  ------------
147
193
  The AnyStyle source code is
@@ -167,7 +213,7 @@ to join us! Over the years our main contributors have been:
167
213
 
168
214
  License
169
215
  -------
170
- Copyright 2011-2020 Sylvester Keil. All rights reserved.
216
+ Copyright 2011-2023 Sylvester Keil. All rights reserved.
171
217
 
172
218
  AnyStyle is distributed under a BSD-style license.
173
219
  See LICENSE for details.
@@ -10,7 +10,7 @@ module AnyStyle
10
10
  end
11
11
 
12
12
  def open
13
- if File.exists?(options[:path])
13
+ if File.exist?(options[:path])
14
14
  @db = ::Marshal.load(File.open(options[:path]))
15
15
  else
16
16
  @db = {}
@@ -18,8 +18,6 @@ module AnyStyle
18
18
  end
19
19
 
20
20
  def open(path, format: File.extname(path), tagged: false, **opts)
21
- raise ArgumentError,
22
- "cannot open tainted path: '#{path}'" if path.tainted?
23
21
  raise ArgumentError,
24
22
  "document not found: '#{path}'" unless File.exist?(path)
25
23
 
@@ -8,7 +8,7 @@ module AnyStyle
8
8
  compact: true,
9
9
  threads: 4,
10
10
  format: :references,
11
- training_data: Dir[File.join(RES, 'finder', '*.ttx')].map(&:untaint),
11
+ training_data: Dir[File.join(RES, 'finder', '*.ttx')],
12
12
  layout: true,
13
13
  pdftotext: 'pdftotext',
14
14
  pdfinfo: 'pdfinfo'
@@ -13,7 +13,12 @@ module AnyStyle
13
13
  when :url
14
14
  doi = doi_extract(value) if value =~ /doi\.org\//i
15
15
  append item, :doi, doi unless doi.nil?
16
- URI.extract(value)
16
+ urls = URI.extract(value, %w(http https ftp ftps))
17
+ if urls.empty?
18
+ value
19
+ else
20
+ urls
21
+ end
17
22
  when :doi
18
23
  doi_extract(value) || value
19
24
  else
@@ -18,7 +18,8 @@ module AnyStyle
18
18
  attr_reader :model, :options, :features, :normalizers, :mtime
19
19
 
20
20
  def initialize(options = {})
21
- @options = self.class.defaults.merge(options)
21
+ def_opts = self.class.defaults || Parser.defaults
22
+ @options = def_opts.merge(options)
22
23
  load_model
23
24
  end
24
25
 
@@ -85,12 +86,6 @@ module AnyStyle
85
86
  expand input
86
87
  when Wapiti::Sequence
87
88
  expand Wapiti::Dataset.new([input])
88
- when String
89
- if !input.tainted? && input.length < 1024 && File.exists?(input)
90
- expand Wapiti::Dataset.open(input, **opts)
91
- else
92
- expand Wapiti::Dataset.parse(input, **opts)
93
- end
94
89
  else
95
90
  expand Wapiti::Dataset.parse(input, **opts)
96
91
  end