anystyle 1.3.13 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '09ec60a796bf13a493c0e91f4df8ad0295d70dde460da704119dfcbbd654a09c'
4
- data.tar.gz: e17787d8d6a2007b9f815e64c32092f305fdf92e80ec3914da1a72318ec049ee
3
+ metadata.gz: b7d5642b2c453c00a91748d6e447504c825b0097095cda159d92f2c6d82a11c5
4
+ data.tar.gz: c29fa4ee58a017404b205418bcf1d45f7111d3005e881f5d869d73572d0cd5e6
5
5
  SHA512:
6
- metadata.gz: ab5a9c7bfeb5b65c0b6345d69a11bcae62ea8df785f3eb212c1e8d345e2882ac44da53098523cc3e07828ca0b0da8557de133e1e95ac21f35fe3f5d9ca405079
7
- data.tar.gz: 408f6b7f3810e31984c3e8dbbca1831bf98a1197b443d233ff831b824af5bd8d4249935eaa4e3fc02bc0b8d15e7c4d85256e82253c4c8ec60500b8c269d70ff3
6
+ metadata.gz: 44f8226f33b22aa53b743ce56db3203a1c1bb3e4dd749ca746a0c46563656f36a4da585eebb32f0b8062701f6719e5c0aed9ee5f9fa533a1f49aacdef2d79504
7
+ data.tar.gz: dcaf21ed636bc50a7fa982e220942b0d752f13714e1a8cda038d1d7b324b7da9c78569903460c10454c5740d191e1abc19254c24222cd2a411880c78fe3933ea
data/HISTORY.md CHANGED
@@ -1,3 +1,9 @@
1
+ 1.4.0 / 2023-01-06
2
+ ==================
3
+ * Removed deprectate string taint checking (@bbonamin).
4
+ * `AnyStyle::Parser#parse` will no longer automatically open local files.
5
+ Please call `Wapiti::Dataset.open` explicitly if you relied on this.
6
+
1
7
  1.3.6 / 2019-12-02
2
8
  ==================
3
9
  * Updated parser model.
data/LICENSE CHANGED
@@ -1,5 +1,5 @@
1
1
  AnyStyle
2
- Copyright 2011-2020 Sylvester Keil. All rights reserved.
2
+ Copyright 2011-2023 Sylvester Keil. All rights reserved.
3
3
 
4
4
  Redistribution and use in source and binary forms, with or without
5
5
  modification, are permitted provided that the following conditions are met:
data/README.md CHANGED
@@ -57,6 +57,9 @@ AnyStyle is available as web application at [anystyle.io](https://anystyle.io).
57
57
  The web application [is open source](https://github.com/inukshuk/anystyle.io)
58
58
  and you can also host yourself!
59
59
 
60
+ Improving results for your data
61
+ =================================
62
+
60
63
  Training
61
64
  --------
62
65
  You can train custom Finder and Parser models. To do this, you need
@@ -105,6 +108,47 @@ When working with training data, it is a good idea to use the
105
108
  `Wapiti::Dataset` API in Ruby: it supports all the standard set
106
109
  operators and makes it very easy to combine or compare data sets.
107
110
 
111
+ Natural Languages used in AnyStyle
112
+ ----------------------------------
113
+
114
+ As mentioned above, the
115
+ [core](https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml)
116
+ dataset contains the manually marked-up references that are used as the
117
+ basis for the default AnyStyle parsing model. If the references you are
118
+ trying to parse include many non-English documents, the distribution of
119
+ natural languages in this corpus is relevant (detected using [cld](https://github.com/jtoy/cld)).
120
+
121
+ | Language | n |
122
+ |-------------------------|-----|
123
+ | ENGLISH | 965 |
124
+ | FRENCH | 54 |
125
+ | GERMAN | 26 |
126
+ | ITALIAN | 11 |
127
+ | Others | 9 |
128
+ | | |
129
+ | Not reliably determined | 449 |
130
+ | (but mainly English) | |
131
+
132
+ (These data are based on AnyStyle version 1.3.13)
133
+
134
+ There is a strong prevalence of English-language documents with the
135
+ conventions used in English-language bibliographies, with some
136
+ representation of other European languages. The languages used reflect
137
+ those used in scientific publishing as well as the maintainers'
138
+ competencies. If you are working with many documents in languages other
139
+ than English, you might consider training the model with some examples
140
+ in the relevant languages.
141
+
142
+ AnyStyle should work with references written in any Latin script
143
+ (including most European languages, languages such as Indonesian and
144
+ Malaysian, as well as romanised Arabic, Chinese and Japanese). It should
145
+ also support languages written with non-Latin alphabets (such as
146
+ Russian), although no examples of these appear in the default training
147
+ sets. Languages written in syllabaries or complex symbols which do not
148
+ use white space to separate tokens are not compatible with AnyStyle's
149
+ approach: this includes Chinese, Japanese, Arabic as well as many Indian
150
+ languages.
151
+
108
152
  Dictionary Adapters
109
153
  -------------------
110
154
  During the statistical analysis of reference strings, AnyStyle relies
@@ -142,6 +186,8 @@ and configure AnyStyle to use the Redis adapter:
142
186
  AnyStyle::Dictionary::Redis.defaults[:host] = 'localhost'
143
187
  AnyStyle::Dictionary::Redis.defaults[:port] = 6379
144
188
 
189
+ About AnyStyle
190
+ ==============
145
191
  Contributing
146
192
  ------------
147
193
  The AnyStyle source code is
@@ -167,7 +213,7 @@ to join us! Over the years our main contributors have been:
167
213
 
168
214
  License
169
215
  -------
170
- Copyright 2011-2020 Sylvester Keil. All rights reserved.
216
+ Copyright 2011-2023 Sylvester Keil. All rights reserved.
171
217
 
172
218
  AnyStyle is distributed under a BSD-style license.
173
219
  See LICENSE for details.
@@ -10,7 +10,7 @@ module AnyStyle
10
10
  end
11
11
 
12
12
  def open
13
- if File.exists?(options[:path])
13
+ if File.exist?(options[:path])
14
14
  @db = ::Marshal.load(File.open(options[:path]))
15
15
  else
16
16
  @db = {}
@@ -18,8 +18,6 @@ module AnyStyle
18
18
  end
19
19
 
20
20
  def open(path, format: File.extname(path), tagged: false, **opts)
21
- raise ArgumentError,
22
- "cannot open tainted path: '#{path}'" if path.tainted?
23
21
  raise ArgumentError,
24
22
  "document not found: '#{path}'" unless File.exist?(path)
25
23
 
@@ -8,7 +8,7 @@ module AnyStyle
8
8
  compact: true,
9
9
  threads: 4,
10
10
  format: :references,
11
- training_data: Dir[File.join(RES, 'finder', '*.ttx')].map(&:untaint),
11
+ training_data: Dir[File.join(RES, 'finder', '*.ttx')],
12
12
  layout: true,
13
13
  pdftotext: 'pdftotext',
14
14
  pdfinfo: 'pdfinfo'
@@ -13,7 +13,12 @@ module AnyStyle
13
13
  when :url
14
14
  doi = doi_extract(value) if value =~ /doi\.org\//i
15
15
  append item, :doi, doi unless doi.nil?
16
- URI.extract(value)
16
+ urls = URI.extract(value, %w(http https ftp ftps))
17
+ if urls.empty?
18
+ value
19
+ else
20
+ urls
21
+ end
17
22
  when :doi
18
23
  doi_extract(value) || value
19
24
  else
@@ -18,7 +18,8 @@ module AnyStyle
18
18
  attr_reader :model, :options, :features, :normalizers, :mtime
19
19
 
20
20
  def initialize(options = {})
21
- @options = self.class.defaults.merge(options)
21
+ def_opts = self.class.defaults || Parser.defaults
22
+ @options = def_opts.merge(options)
22
23
  load_model
23
24
  end
24
25
 
@@ -85,12 +86,6 @@ module AnyStyle
85
86
  expand input
86
87
  when Wapiti::Sequence
87
88
  expand Wapiti::Dataset.new([input])
88
- when String
89
- if !input.tainted? && input.length < 1024 && File.exists?(input)
90
- expand Wapiti::Dataset.open(input, **opts)
91
- else
92
- expand Wapiti::Dataset.parse(input, **opts)
93
- end
94
89
  else
95
90
  expand Wapiti::Dataset.parse(input, **opts)
96
91
  end