anystyle 1.3.13 → 1.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/HISTORY.md +6 -0
- data/LICENSE +1 -1
- data/README.md +47 -1
- data/lib/anystyle/dictionary/marshal.rb +1 -1
- data/lib/anystyle/document.rb +0 -2
- data/lib/anystyle/finder.rb +1 -1
- data/lib/anystyle/normalizer/locator.rb +6 -1
- data/lib/anystyle/parser.rb +2 -7
- data/lib/anystyle/support/parser.mod +14758 -15558
- data/lib/anystyle/support.rb +2 -2
- data/lib/anystyle/utils.rb +0 -3
- data/lib/anystyle/version.rb +1 -1
- data/res/parser/core.xml +35 -36
- data/res/parser/gold.xml +0 -27
- metadata +9 -15
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b7d5642b2c453c00a91748d6e447504c825b0097095cda159d92f2c6d82a11c5
|
4
|
+
data.tar.gz: c29fa4ee58a017404b205418bcf1d45f7111d3005e881f5d869d73572d0cd5e6
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 44f8226f33b22aa53b743ce56db3203a1c1bb3e4dd749ca746a0c46563656f36a4da585eebb32f0b8062701f6719e5c0aed9ee5f9fa533a1f49aacdef2d79504
|
7
|
+
data.tar.gz: dcaf21ed636bc50a7fa982e220942b0d752f13714e1a8cda038d1d7b324b7da9c78569903460c10454c5740d191e1abc19254c24222cd2a411880c78fe3933ea
|
data/HISTORY.md
CHANGED
@@ -1,3 +1,9 @@
|
|
1
|
+
1.4.0 / 2023-01-06
|
2
|
+
==================
|
3
|
+
* Removed deprectate string taint checking (@bbonamin).
|
4
|
+
* `AnyStyle::Parser#parse` will no longer automatically open local files.
|
5
|
+
Please call `Wapiti::Dataset.open` explicitly if you relied on this.
|
6
|
+
|
1
7
|
1.3.6 / 2019-12-02
|
2
8
|
==================
|
3
9
|
* Updated parser model.
|
data/LICENSE
CHANGED
@@ -1,5 +1,5 @@
|
|
1
1
|
AnyStyle
|
2
|
-
Copyright 2011-
|
2
|
+
Copyright 2011-2023 Sylvester Keil. All rights reserved.
|
3
3
|
|
4
4
|
Redistribution and use in source and binary forms, with or without
|
5
5
|
modification, are permitted provided that the following conditions are met:
|
data/README.md
CHANGED
@@ -57,6 +57,9 @@ AnyStyle is available as web application at [anystyle.io](https://anystyle.io).
|
|
57
57
|
The web application [is open source](https://github.com/inukshuk/anystyle.io)
|
58
58
|
and you can also host yourself!
|
59
59
|
|
60
|
+
Improving results for your data
|
61
|
+
=================================
|
62
|
+
|
60
63
|
Training
|
61
64
|
--------
|
62
65
|
You can train custom Finder and Parser models. To do this, you need
|
@@ -105,6 +108,47 @@ When working with training data, it is a good idea to use the
|
|
105
108
|
`Wapiti::Dataset` API in Ruby: it supports all the standard set
|
106
109
|
operators and makes it very easy to combine or compare data sets.
|
107
110
|
|
111
|
+
Natural Languages used in AnyStyle
|
112
|
+
----------------------------------
|
113
|
+
|
114
|
+
As mentioned above, the
|
115
|
+
[core](https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml)
|
116
|
+
dataset contains the manually marked-up references that are used as the
|
117
|
+
basis for the default AnyStyle parsing model. If the references you are
|
118
|
+
trying to parse include many non-English documents, the distribution of
|
119
|
+
natural languages in this corpus is relevant (detected using [cld](https://github.com/jtoy/cld)).
|
120
|
+
|
121
|
+
| Language | n |
|
122
|
+
|-------------------------|-----|
|
123
|
+
| ENGLISH | 965 |
|
124
|
+
| FRENCH | 54 |
|
125
|
+
| GERMAN | 26 |
|
126
|
+
| ITALIAN | 11 |
|
127
|
+
| Others | 9 |
|
128
|
+
| | |
|
129
|
+
| Not reliably determined | 449 |
|
130
|
+
| (but mainly English) | |
|
131
|
+
|
132
|
+
(These data are based on AnyStyle version 1.3.13)
|
133
|
+
|
134
|
+
There is a strong prevalence of English-language documents with the
|
135
|
+
conventions used in English-language bibliographies, with some
|
136
|
+
representation of other European languages. The languages used reflect
|
137
|
+
those used in scientific publishing as well as the maintainers'
|
138
|
+
competencies. If you are working with many documents in languages other
|
139
|
+
than English, you might consider training the model with some examples
|
140
|
+
in the relevant languages.
|
141
|
+
|
142
|
+
AnyStyle should work with references written in any Latin script
|
143
|
+
(including most European languages, languages such as Indonesian and
|
144
|
+
Malaysian, as well as romanised Arabic, Chinese and Japanese). It should
|
145
|
+
also support languages written with non-Latin alphabets (such as
|
146
|
+
Russian), although no examples of these appear in the default training
|
147
|
+
sets. Languages written in syllabaries or complex symbols which do not
|
148
|
+
use white space to separate tokens are not compatible with AnyStyle's
|
149
|
+
approach: this includes Chinese, Japanese, Arabic as well as many Indian
|
150
|
+
languages.
|
151
|
+
|
108
152
|
Dictionary Adapters
|
109
153
|
-------------------
|
110
154
|
During the statistical analysis of reference strings, AnyStyle relies
|
@@ -142,6 +186,8 @@ and configure AnyStyle to use the Redis adapter:
|
|
142
186
|
AnyStyle::Dictionary::Redis.defaults[:host] = 'localhost'
|
143
187
|
AnyStyle::Dictionary::Redis.defaults[:port] = 6379
|
144
188
|
|
189
|
+
About AnyStyle
|
190
|
+
==============
|
145
191
|
Contributing
|
146
192
|
------------
|
147
193
|
The AnyStyle source code is
|
@@ -167,7 +213,7 @@ to join us! Over the years our main contributors have been:
|
|
167
213
|
|
168
214
|
License
|
169
215
|
-------
|
170
|
-
Copyright 2011-
|
216
|
+
Copyright 2011-2023 Sylvester Keil. All rights reserved.
|
171
217
|
|
172
218
|
AnyStyle is distributed under a BSD-style license.
|
173
219
|
See LICENSE for details.
|
data/lib/anystyle/document.rb
CHANGED
@@ -18,8 +18,6 @@ module AnyStyle
|
|
18
18
|
end
|
19
19
|
|
20
20
|
def open(path, format: File.extname(path), tagged: false, **opts)
|
21
|
-
raise ArgumentError,
|
22
|
-
"cannot open tainted path: '#{path}'" if path.tainted?
|
23
21
|
raise ArgumentError,
|
24
22
|
"document not found: '#{path}'" unless File.exist?(path)
|
25
23
|
|
data/lib/anystyle/finder.rb
CHANGED
@@ -8,7 +8,7 @@ module AnyStyle
|
|
8
8
|
compact: true,
|
9
9
|
threads: 4,
|
10
10
|
format: :references,
|
11
|
-
training_data: Dir[File.join(RES, 'finder', '*.ttx')]
|
11
|
+
training_data: Dir[File.join(RES, 'finder', '*.ttx')],
|
12
12
|
layout: true,
|
13
13
|
pdftotext: 'pdftotext',
|
14
14
|
pdfinfo: 'pdfinfo'
|
@@ -13,7 +13,12 @@ module AnyStyle
|
|
13
13
|
when :url
|
14
14
|
doi = doi_extract(value) if value =~ /doi\.org\//i
|
15
15
|
append item, :doi, doi unless doi.nil?
|
16
|
-
URI.extract(value)
|
16
|
+
urls = URI.extract(value, %w(http https ftp ftps))
|
17
|
+
if urls.empty?
|
18
|
+
value
|
19
|
+
else
|
20
|
+
urls
|
21
|
+
end
|
17
22
|
when :doi
|
18
23
|
doi_extract(value) || value
|
19
24
|
else
|
data/lib/anystyle/parser.rb
CHANGED
@@ -18,7 +18,8 @@ module AnyStyle
|
|
18
18
|
attr_reader :model, :options, :features, :normalizers, :mtime
|
19
19
|
|
20
20
|
def initialize(options = {})
|
21
|
-
|
21
|
+
def_opts = self.class.defaults || Parser.defaults
|
22
|
+
@options = def_opts.merge(options)
|
22
23
|
load_model
|
23
24
|
end
|
24
25
|
|
@@ -85,12 +86,6 @@ module AnyStyle
|
|
85
86
|
expand input
|
86
87
|
when Wapiti::Sequence
|
87
88
|
expand Wapiti::Dataset.new([input])
|
88
|
-
when String
|
89
|
-
if !input.tainted? && input.length < 1024 && File.exists?(input)
|
90
|
-
expand Wapiti::Dataset.open(input, **opts)
|
91
|
-
else
|
92
|
-
expand Wapiti::Dataset.parse(input, **opts)
|
93
|
-
end
|
94
89
|
else
|
95
90
|
expand Wapiti::Dataset.parse(input, **opts)
|
96
91
|
end
|