anystyle 1.3.13 → 1.4.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/HISTORY.md +6 -0
- data/LICENSE +1 -1
- data/README.md +47 -1
- data/lib/anystyle/dictionary/marshal.rb +1 -1
- data/lib/anystyle/document.rb +0 -2
- data/lib/anystyle/finder.rb +1 -1
- data/lib/anystyle/normalizer/locator.rb +6 -1
- data/lib/anystyle/parser.rb +2 -7
- data/lib/anystyle/support/parser.mod +14758 -15558
- data/lib/anystyle/support.rb +2 -2
- data/lib/anystyle/utils.rb +0 -3
- data/lib/anystyle/version.rb +1 -1
- data/res/parser/core.xml +35 -36
- data/res/parser/gold.xml +0 -27
- metadata +9 -15
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b7d5642b2c453c00a91748d6e447504c825b0097095cda159d92f2c6d82a11c5
|
4
|
+
data.tar.gz: c29fa4ee58a017404b205418bcf1d45f7111d3005e881f5d869d73572d0cd5e6
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 44f8226f33b22aa53b743ce56db3203a1c1bb3e4dd749ca746a0c46563656f36a4da585eebb32f0b8062701f6719e5c0aed9ee5f9fa533a1f49aacdef2d79504
|
7
|
+
data.tar.gz: dcaf21ed636bc50a7fa982e220942b0d752f13714e1a8cda038d1d7b324b7da9c78569903460c10454c5740d191e1abc19254c24222cd2a411880c78fe3933ea
|
data/HISTORY.md
CHANGED
@@ -1,3 +1,9 @@
|
|
1
|
+
1.4.0 / 2023-01-06
|
2
|
+
==================
|
3
|
+
* Removed deprectate string taint checking (@bbonamin).
|
4
|
+
* `AnyStyle::Parser#parse` will no longer automatically open local files.
|
5
|
+
Please call `Wapiti::Dataset.open` explicitly if you relied on this.
|
6
|
+
|
1
7
|
1.3.6 / 2019-12-02
|
2
8
|
==================
|
3
9
|
* Updated parser model.
|
data/LICENSE
CHANGED
@@ -1,5 +1,5 @@
|
|
1
1
|
AnyStyle
|
2
|
-
Copyright 2011-
|
2
|
+
Copyright 2011-2023 Sylvester Keil. All rights reserved.
|
3
3
|
|
4
4
|
Redistribution and use in source and binary forms, with or without
|
5
5
|
modification, are permitted provided that the following conditions are met:
|
data/README.md
CHANGED
@@ -57,6 +57,9 @@ AnyStyle is available as web application at [anystyle.io](https://anystyle.io).
|
|
57
57
|
The web application [is open source](https://github.com/inukshuk/anystyle.io)
|
58
58
|
and you can also host yourself!
|
59
59
|
|
60
|
+
Improving results for your data
|
61
|
+
=================================
|
62
|
+
|
60
63
|
Training
|
61
64
|
--------
|
62
65
|
You can train custom Finder and Parser models. To do this, you need
|
@@ -105,6 +108,47 @@ When working with training data, it is a good idea to use the
|
|
105
108
|
`Wapiti::Dataset` API in Ruby: it supports all the standard set
|
106
109
|
operators and makes it very easy to combine or compare data sets.
|
107
110
|
|
111
|
+
Natural Languages used in AnyStyle
|
112
|
+
----------------------------------
|
113
|
+
|
114
|
+
As mentioned above, the
|
115
|
+
[core](https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml)
|
116
|
+
dataset contains the manually marked-up references that are used as the
|
117
|
+
basis for the default AnyStyle parsing model. If the references you are
|
118
|
+
trying to parse include many non-English documents, the distribution of
|
119
|
+
natural languages in this corpus is relevant (detected using [cld](https://github.com/jtoy/cld)).
|
120
|
+
|
121
|
+
| Language | n |
|
122
|
+
|-------------------------|-----|
|
123
|
+
| ENGLISH | 965 |
|
124
|
+
| FRENCH | 54 |
|
125
|
+
| GERMAN | 26 |
|
126
|
+
| ITALIAN | 11 |
|
127
|
+
| Others | 9 |
|
128
|
+
| | |
|
129
|
+
| Not reliably determined | 449 |
|
130
|
+
| (but mainly English) | |
|
131
|
+
|
132
|
+
(These data are based on AnyStyle version 1.3.13)
|
133
|
+
|
134
|
+
There is a strong prevalence of English-language documents with the
|
135
|
+
conventions used in English-language bibliographies, with some
|
136
|
+
representation of other European languages. The languages used reflect
|
137
|
+
those used in scientific publishing as well as the maintainers'
|
138
|
+
competencies. If you are working with many documents in languages other
|
139
|
+
than English, you might consider training the model with some examples
|
140
|
+
in the relevant languages.
|
141
|
+
|
142
|
+
AnyStyle should work with references written in any Latin script
|
143
|
+
(including most European languages, languages such as Indonesian and
|
144
|
+
Malaysian, as well as romanised Arabic, Chinese and Japanese). It should
|
145
|
+
also support languages written with non-Latin alphabets (such as
|
146
|
+
Russian), although no examples of these appear in the default training
|
147
|
+
sets. Languages written in syllabaries or complex symbols which do not
|
148
|
+
use white space to separate tokens are not compatible with AnyStyle's
|
149
|
+
approach: this includes Chinese, Japanese, Arabic as well as many Indian
|
150
|
+
languages.
|
151
|
+
|
108
152
|
Dictionary Adapters
|
109
153
|
-------------------
|
110
154
|
During the statistical analysis of reference strings, AnyStyle relies
|
@@ -142,6 +186,8 @@ and configure AnyStyle to use the Redis adapter:
|
|
142
186
|
AnyStyle::Dictionary::Redis.defaults[:host] = 'localhost'
|
143
187
|
AnyStyle::Dictionary::Redis.defaults[:port] = 6379
|
144
188
|
|
189
|
+
About AnyStyle
|
190
|
+
==============
|
145
191
|
Contributing
|
146
192
|
------------
|
147
193
|
The AnyStyle source code is
|
@@ -167,7 +213,7 @@ to join us! Over the years our main contributors have been:
|
|
167
213
|
|
168
214
|
License
|
169
215
|
-------
|
170
|
-
Copyright 2011-
|
216
|
+
Copyright 2011-2023 Sylvester Keil. All rights reserved.
|
171
217
|
|
172
218
|
AnyStyle is distributed under a BSD-style license.
|
173
219
|
See LICENSE for details.
|
data/lib/anystyle/document.rb
CHANGED
@@ -18,8 +18,6 @@ module AnyStyle
|
|
18
18
|
end
|
19
19
|
|
20
20
|
def open(path, format: File.extname(path), tagged: false, **opts)
|
21
|
-
raise ArgumentError,
|
22
|
-
"cannot open tainted path: '#{path}'" if path.tainted?
|
23
21
|
raise ArgumentError,
|
24
22
|
"document not found: '#{path}'" unless File.exist?(path)
|
25
23
|
|
data/lib/anystyle/finder.rb
CHANGED
@@ -8,7 +8,7 @@ module AnyStyle
|
|
8
8
|
compact: true,
|
9
9
|
threads: 4,
|
10
10
|
format: :references,
|
11
|
-
training_data: Dir[File.join(RES, 'finder', '*.ttx')]
|
11
|
+
training_data: Dir[File.join(RES, 'finder', '*.ttx')],
|
12
12
|
layout: true,
|
13
13
|
pdftotext: 'pdftotext',
|
14
14
|
pdfinfo: 'pdfinfo'
|
@@ -13,7 +13,12 @@ module AnyStyle
|
|
13
13
|
when :url
|
14
14
|
doi = doi_extract(value) if value =~ /doi\.org\//i
|
15
15
|
append item, :doi, doi unless doi.nil?
|
16
|
-
URI.extract(value)
|
16
|
+
urls = URI.extract(value, %w(http https ftp ftps))
|
17
|
+
if urls.empty?
|
18
|
+
value
|
19
|
+
else
|
20
|
+
urls
|
21
|
+
end
|
17
22
|
when :doi
|
18
23
|
doi_extract(value) || value
|
19
24
|
else
|
data/lib/anystyle/parser.rb
CHANGED
@@ -18,7 +18,8 @@ module AnyStyle
|
|
18
18
|
attr_reader :model, :options, :features, :normalizers, :mtime
|
19
19
|
|
20
20
|
def initialize(options = {})
|
21
|
-
|
21
|
+
def_opts = self.class.defaults || Parser.defaults
|
22
|
+
@options = def_opts.merge(options)
|
22
23
|
load_model
|
23
24
|
end
|
24
25
|
|
@@ -85,12 +86,6 @@ module AnyStyle
|
|
85
86
|
expand input
|
86
87
|
when Wapiti::Sequence
|
87
88
|
expand Wapiti::Dataset.new([input])
|
88
|
-
when String
|
89
|
-
if !input.tainted? && input.length < 1024 && File.exists?(input)
|
90
|
-
expand Wapiti::Dataset.open(input, **opts)
|
91
|
-
else
|
92
|
-
expand Wapiti::Dataset.parse(input, **opts)
|
93
|
-
end
|
94
89
|
else
|
95
90
|
expand Wapiti::Dataset.parse(input, **opts)
|
96
91
|
end
|