anystyle 1.3.13 → 1.3.14
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +46 -0
- data/lib/anystyle/normalizer/locator.rb +6 -1
- data/lib/anystyle/parser.rb +2 -1
- data/lib/anystyle/support/parser.mod +14758 -15558
- data/lib/anystyle/version.rb +1 -1
- data/res/parser/core.xml +35 -36
- data/res/parser/gold.xml +0 -27
- metadata +7 -13
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 1bf14266569ad0eb5e812e68d0289780e50ee2cc765faeb1a5fb48b37b1ba4b7
|
4
|
+
data.tar.gz: be96508a8b939c6450342a763cfb469b87d170dab80380b4a82beb7d02b4e0c4
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: a8a2c6f4f0d997b657b9817b8bf2a4bdf62db4e9f53130ed9923de571e9cb6a264e554dd78c4ebaf7180b424734632d526fd46d3accba07e0cd766fb37689fae
|
7
|
+
data.tar.gz: d2ddd8f136ac10a0ddd29b9522fc7779c1ea92c1d441601c801726fa4e304d104de588b38c230fa53102e59e23e2ce0b48b79f4c4bd932c2cb0e3a3f56ccd43a
|
data/README.md
CHANGED
@@ -57,6 +57,9 @@ AnyStyle is available as web application at [anystyle.io](https://anystyle.io).
|
|
57
57
|
The web application [is open source](https://github.com/inukshuk/anystyle.io)
|
58
58
|
and you can also host yourself!
|
59
59
|
|
60
|
+
Improving results for your data
|
61
|
+
=================================
|
62
|
+
|
60
63
|
Training
|
61
64
|
--------
|
62
65
|
You can train custom Finder and Parser models. To do this, you need
|
@@ -105,6 +108,47 @@ When working with training data, it is a good idea to use the
|
|
105
108
|
`Wapiti::Dataset` API in Ruby: it supports all the standard set
|
106
109
|
operators and makes it very easy to combine or compare data sets.
|
107
110
|
|
111
|
+
Natural Languages used in AnyStyle
|
112
|
+
----------------------------------
|
113
|
+
|
114
|
+
As mentioned above, the
|
115
|
+
[core](https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml)
|
116
|
+
dataset contains the manually marked-up references that are used as the
|
117
|
+
basis for the default AnyStyle parsing model. If the references you are
|
118
|
+
trying to parse include many non-English documents, the distribution of
|
119
|
+
natural languages in this corpus is relevant (detected using [cld](https://github.com/jtoy/cld)).
|
120
|
+
|
121
|
+
| Language | n |
|
122
|
+
|-------------------------|-----|
|
123
|
+
| ENGLISH | 965 |
|
124
|
+
| FRENCH | 54 |
|
125
|
+
| GERMAN | 26 |
|
126
|
+
| ITALIAN | 11 |
|
127
|
+
| Others | 9 |
|
128
|
+
| | |
|
129
|
+
| Not reliably determined | 449 |
|
130
|
+
| (but mainly English) | |
|
131
|
+
|
132
|
+
(These data are based on AnyStyle version 1.3.13)
|
133
|
+
|
134
|
+
There is a strong prevalence of English-language documents with the
|
135
|
+
conventions used in English-language bibliographies, with some
|
136
|
+
representation of other European languages. The languages used reflect
|
137
|
+
those used in scientific publishing as well as the maintainers'
|
138
|
+
competencies. If you are working with many documents in languages other
|
139
|
+
than English, you might consider training the model with some examples
|
140
|
+
in the relevant languages.
|
141
|
+
|
142
|
+
AnyStyle should work with references written in any Latin script
|
143
|
+
(including most European languages, languages such as Indonesian and
|
144
|
+
Malaysian, as well as romanised Arabic, Chinese and Japanese). It should
|
145
|
+
also support languages written with non-Latin alphabets (such as
|
146
|
+
Russian), although no examples of these appear in the default training
|
147
|
+
sets. Languages written in syllabaries or complex symbols which do not
|
148
|
+
use white space to separate tokens are not compatible with AnyStyle's
|
149
|
+
approach: this includes Chinese, Japanese, Arabic as well as many Indian
|
150
|
+
languages.
|
151
|
+
|
108
152
|
Dictionary Adapters
|
109
153
|
-------------------
|
110
154
|
During the statistical analysis of reference strings, AnyStyle relies
|
@@ -142,6 +186,8 @@ and configure AnyStyle to use the Redis adapter:
|
|
142
186
|
AnyStyle::Dictionary::Redis.defaults[:host] = 'localhost'
|
143
187
|
AnyStyle::Dictionary::Redis.defaults[:port] = 6379
|
144
188
|
|
189
|
+
About AnyStyle
|
190
|
+
==============
|
145
191
|
Contributing
|
146
192
|
------------
|
147
193
|
The AnyStyle source code is
|
@@ -13,7 +13,12 @@ module AnyStyle
|
|
13
13
|
when :url
|
14
14
|
doi = doi_extract(value) if value =~ /doi\.org\//i
|
15
15
|
append item, :doi, doi unless doi.nil?
|
16
|
-
URI.extract(value)
|
16
|
+
urls = URI.extract(value, %w(http https ftp ftps))
|
17
|
+
if urls.empty?
|
18
|
+
value
|
19
|
+
else
|
20
|
+
urls
|
21
|
+
end
|
17
22
|
when :doi
|
18
23
|
doi_extract(value) || value
|
19
24
|
else
|
data/lib/anystyle/parser.rb
CHANGED
@@ -18,7 +18,8 @@ module AnyStyle
|
|
18
18
|
attr_reader :model, :options, :features, :normalizers, :mtime
|
19
19
|
|
20
20
|
def initialize(options = {})
|
21
|
-
|
21
|
+
def_opts = self.class.defaults || Parser.defaults
|
22
|
+
@options = def_opts.merge(options)
|
22
23
|
load_model
|
23
24
|
end
|
24
25
|
|