commonmeta-ruby 3.2.11 → 3.2.13
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Gemfile.lock +1 -1
- data/bin/commonmeta +1 -1
- data/lib/commonmeta/cli.rb +8 -2
- data/lib/commonmeta/readers/json_feed_reader.rb +4 -4
- data/lib/commonmeta/utils.rb +4 -0
- data/lib/commonmeta/version.rb +1 -1
- data/spec/cli_spec.rb +9 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/json_feed/json_feed_by_blog.yml +1791 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/json_feed/json_feed_not_indexed.yml +2155 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/json_feed/json_feed_unregistered.yml +2010 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/by_blog_id.yml +1791 -0
- data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/not_indexed_posts.yml +2155 -0
- data/spec/readers/json_feed_reader_spec.rb +7 -3
- metadata +7 -2
@@ -0,0 +1,1791 @@
|
|
1
|
+
---
|
2
|
+
http_interactions:
|
3
|
+
- request:
|
4
|
+
method: get
|
5
|
+
uri: https://rogue-scholar.org/api/blogs/tyfqw20
|
6
|
+
body:
|
7
|
+
encoding: UTF-8
|
8
|
+
string: ''
|
9
|
+
headers:
|
10
|
+
Connection:
|
11
|
+
- close
|
12
|
+
Host:
|
13
|
+
- rogue-scholar.org
|
14
|
+
User-Agent:
|
15
|
+
- http.rb/5.1.1
|
16
|
+
response:
|
17
|
+
status:
|
18
|
+
code: 200
|
19
|
+
message: OK
|
20
|
+
headers:
|
21
|
+
Age:
|
22
|
+
- '0'
|
23
|
+
Cache-Control:
|
24
|
+
- public, max-age=0, must-revalidate
|
25
|
+
Content-Length:
|
26
|
+
- '162607'
|
27
|
+
Content-Type:
|
28
|
+
- application/json; charset=utf-8
|
29
|
+
Date:
|
30
|
+
- Thu, 15 Jun 2023 20:47:10 GMT
|
31
|
+
Etag:
|
32
|
+
- '"6w7me0q1i23h72"'
|
33
|
+
Server:
|
34
|
+
- Vercel
|
35
|
+
Strict-Transport-Security:
|
36
|
+
- max-age=63072000
|
37
|
+
X-Matched-Path:
|
38
|
+
- "/api/blogs/[slug]"
|
39
|
+
X-Vercel-Cache:
|
40
|
+
- MISS
|
41
|
+
X-Vercel-Id:
|
42
|
+
- fra1::iad1::95rf9-1686862029064-f08aa4b0d0a5
|
43
|
+
Connection:
|
44
|
+
- close
|
45
|
+
body:
|
46
|
+
encoding: UTF-8
|
47
|
+
string: '{"id":"tyfqw20","title":"iPhylo","description":"Rants, raves (and occasionally
|
48
|
+
considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For
|
49
|
+
more ranty and less considered opinions, see my <a href=\"https://twitter.com/rdmpage\">Twitter
|
50
|
+
feed</a>.<br>ISSN 2051-8188. Written content on this site is licensed under
|
51
|
+
a <a href=\"https://creativecommons.org/licenses/by/4.0/\">Creative Commons
|
52
|
+
Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default?alt=rss","feed_format":"application/rss+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-05-31T17:26:00+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger","category":"Natural
|
53
|
+
Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/btdk4-42879","uuid":"3e1278f6-e7c0-43e1-bb54-6829e1344c0d","url":"https://iphylo.blogspot.com/2022/09/the-ideal-taxonomic-journal.html","title":"The
|
54
|
+
ideal taxonomic journal","summary":"This is just some random notes on an “ideal”
|
55
|
+
taxonomic journal, inspired in part by some recent discussions on “turbo-taxonomy”
|
56
|
+
(e.g., https://doi.org/10.3897/zookeys.1087.76720 and https://doi.org/10.1186/1742-9994-10-15),
|
57
|
+
and also examples such as the Australian Journal of Taxonomy https://doi.org/10.54102/ajt.qxi3r
|
58
|
+
which seems well-intentioned but limited. XML One approach is to have highly
|
59
|
+
structured text that embeds detailed markup, and ideally a tool that generates
|
60
|
+
markup in XML. This is...","date_published":"2022-09-29T14:00:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
61
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>This is just some random
|
62
|
+
notes on an “ideal” taxonomic journal, inspired in part by some recent discussions
|
63
|
+
on “turbo-taxonomy” (e.g., <a href=\"https://doi.org/10.3897/zookeys.1087.76720\">https://doi.org/10.3897/zookeys.1087.76720</a>
|
64
|
+
and <a href=\"https://doi.org/10.1186/1742-9994-10-15\">https://doi.org/10.1186/1742-9994-10-15</a>),
|
65
|
+
and also examples such as the Australian Journal of Taxonomy <a href=\"https://doi.org/10.54102/ajt.qxi3r\">https://doi.org/10.54102/ajt.qxi3r</a>
|
66
|
+
which seems well-intentioned but limited.</p>\n<h2 id=\"xml\">XML</h2>\n<p>One
|
67
|
+
approach is to have highly structured text that embeds detailed markup, and
|
68
|
+
ideally a tool that generates markup in XML. This is the approach taken by
|
69
|
+
Pensoft. There is an inevitable trade-off between the burden on authors of
|
70
|
+
marking up text versus making the paper machine readable. In some ways this
|
71
|
+
seems misplaced effort given that there is little evidence that publications
|
72
|
+
by themselves have much value (see <a href=\"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html\">The
|
73
|
+
Business of Extracting Knowledge from Academic Publications</a>). “Value”
|
74
|
+
in this case means as a source of data or factual statements that we can compute
|
75
|
+
over. Human-readable text is not a good way to convey this sort of information.</p>\n<p>It’s
|
76
|
+
also interesting that many editing tools are going in the opposite direction,
|
77
|
+
for example there are minimalist tools using <a href=\"https://en.wikipedia.org/wiki/Markdown\">Markdown</a>
|
78
|
+
where the goal is to <em>get out of the author’s way</em>, rather than impose
|
79
|
+
a way of writing. Text is written by humans for humans, so the tools should
|
80
|
+
be human-friendly.</p>\n<p>The idea of publishing using XML is attractive
|
81
|
+
in that it gives you XML that can be archived by, say, PubMed Central, but
|
82
|
+
other than that the value seems limited. A cursory glance at download stats
|
83
|
+
for journals that provide PDF and XML downloads, such as <em>PLoS One</em>
|
84
|
+
and <em>ZooKeys</em>, PDF is by far the more popular format. So arguably there
|
85
|
+
is little value in providing XML. Those who have tried to use JATS-XML as
|
86
|
+
an authoring tool have not had a happy time: <a href=\"https://doi.org/10.7557/15.5517\">How
|
87
|
+
we tried to JATS XML</a>. However, there are various tools to help with the
|
88
|
+
process, such as <a href=\"https://github.com/Vitaliy-1/docxToJats\">docxToJats</a>,<br>\ntexture,
|
89
|
+
and <a href=\"https://github.com/elifesciences/jats-xml-to-pdf\">jats-xml-to-pdf</a>
|
90
|
+
if this is the route one wants to take.</p>\n<h2 id=\"automating-writing-manuscripts\">Automating
|
91
|
+
writing manuscripts</h2>\n<p>The dream, of course, is to have a tool where
|
92
|
+
you store all your taxonomic data (literature, specimens, characters, images,
|
93
|
+
sequences, media files, etc.) and at the click of a button generate a paper.
|
94
|
+
Certainly some of this can be automated, much nomenclatural and specimen information
|
95
|
+
could be converted to human-readable text. Ideally this computer-generated
|
96
|
+
text would not be edited (otherwise it could get out of sync with the underlying
|
97
|
+
data). The text should be <a href=\"https://en.wikipedia.org/wiki/Transclusion\">transcluded</a>.
|
98
|
+
As an aside, one way to do this would be to include things such as lists of
|
99
|
+
material examined as images rather than text while the manuscript is being
|
100
|
+
edited. In the same way that you (probably) wouldn’t edit a photograph within
|
101
|
+
your text editor, you shouldn’t be editing data. When the manuscript is published
|
102
|
+
the data-generated portions can then be output as text.</p>\n<p>Of course
|
103
|
+
all of this assumes that we have taxonomic data in a database (or some other
|
104
|
+
storage format, including plain text and Mark-down, e.g. <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
|
105
|
+
markdown, and taxonomic trees</a>) that can generate outputs in the various
|
106
|
+
formats that we need.</p>\n<h2 id=\"archiving-data-and-images\">Archiving
|
107
|
+
data and images</h2>\n<p>One of the really nice things that <a href=\"http://plazi.org\">Plazi</a>
|
108
|
+
do is have a pipeline that sends taxonomic descriptions and images to Zenodo,
|
109
|
+
and similar data to GBIF. Any taxonomic journal should be able to do this.
|
110
|
+
Indeed, arguably each taxonomic treatment within the paper should be linked
|
111
|
+
to the Zenodo DOI at the time of publication. Indeed, we could imagine ultimately
|
112
|
+
having treatments as transclusions within the larger manuscript. Alternatively
|
113
|
+
we could store the treatments as parts of the larger article (rather like
|
114
|
+
chapters in a book), each with a CrossRef DOI. I’m still sceptical about whether
|
115
|
+
these treatments are as important as we make out, see <a href=\"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html\">Does
|
116
|
+
anyone cite taxonomic treatments?</a>. But having machine-readable taxonomic
|
117
|
+
data archived and accessible is a good thing. Uploading the same data to GBIF
|
118
|
+
makes much of that data immediately accessible. Now that GBIF offers <a href=\"https://www.gbif.org/composition/3kQFinjwHbCGZeLb5OhwN2/gbif-hosted-portals\">hosted
|
119
|
+
portals</a> there is the possibility of having custom interfaces to data from
|
120
|
+
a particular journal.</p>\n<h2 id=\"name-and-identifier-registration\">Name
|
121
|
+
and identifier registration</h2>\n<p>We would also want automatic registration
|
122
|
+
of new taxonomic names, for which there are pipelines (see “A common registration-to-publication
|
123
|
+
automated pipeline for nomenclatural acts for higher plants (International
|
124
|
+
Plant Names Index, IPNI), fungi (Index Fungorum, MycoBank) and animals (ZooBank)”
|
125
|
+
<a href=\"https://doi.org/10.3897/zookeys.550.9551\">https://doi.org/10.3897/zookeys.550.9551</a>).
|
126
|
+
These pipelines do not seem to be documented in much detail, and the data
|
127
|
+
formats differ across registration agencies (e.g., IPNI and ZooBank). For
|
128
|
+
example, ZooBank seems to require TaxPub XML.</p>\n<p>Registration of names
|
129
|
+
and identifiers, especially across multiple registration agencies (ZooBank,
|
130
|
+
CrossRef, DataCite, etc.) requires some coordination, especially when one
|
131
|
+
registration agency requires identifiers from another.</p>\n<h2 id=\"summary\">Summary</h2>\n<p>If
|
132
|
+
data is key, then the taxonomic paper itself becomes something of a wrapper
|
133
|
+
around that data. It still serves the function of being human-readable, providing
|
134
|
+
broader context for the work, and as an archive that conforms to currently
|
135
|
+
accepted ways to publish taxonomic names. But in some ways it is the last
|
136
|
+
interesting part of the process.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/37y2z-gre70","uuid":"f3629c86-06e0-42c0-844a-266b03a91ef1","url":"https://iphylo.blogspot.com/2023/05/ten-years-and-million-links.html","title":"Ten
|
137
|
+
years and a million links","summary":"As trailed on a Twitter thread last
|
138
|
+
week I’ve been working on a manuscript describing the efforts to map taxonomic
|
139
|
+
names to their original descriptions in the taxonomic literature. Putting
|
140
|
+
together a manuscript on linking taxonomic names to the primary literature,
|
141
|
+
basically “um, what, exactly, have you been doing all these years?”. TL;DR
|
142
|
+
Across fungi, plants, and animals approx 1.3 million names have been linked
|
143
|
+
to a persistent identifier for a publication.— Roderic Page (@rdmpage) May
|
144
|
+
25,...","date_published":"2023-05-31T17:26:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
145
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>As trailed on a Twitter
|
146
|
+
thread last week I’ve been working on a manuscript describing the efforts
|
147
|
+
to map taxonomic names to their original descriptions in the taxonomic literature.</p>\n<blockquote
|
148
|
+
class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Putting together a manuscript
|
149
|
+
on linking taxonomic names to the primary literature, basically “um, what,
|
150
|
+
exactly, have you been doing all these years?”. TL;DR Across fungi, plants,
|
151
|
+
and animals approx 1.3 million names have been linked to a persistent identifier
|
152
|
+
for a publication.</p>— Roderic Page (@rdmpage) <a href=\"https://twitter.com/rdmpage/status/1661714128413573120?ref_src=twsrc%5Etfw\">May
|
153
|
+
25, 2023</a></blockquote> \n<p>The preprint is on bioRxiv <a href=\"https://doi.org/10.1101/2023.05.29.542697\">doi:10.1101/2023.05.29.542697</a></p>\n<blockquote>\n<p>A
|
154
|
+
major gap in the biodiversity knowledge graph is a connection between taxonomic
|
155
|
+
names and the taxonomic literature. While both names and publications often
|
156
|
+
have persistent identifiers (PIDs), such as Life Science Identifiers (LSIDs)
|
157
|
+
or Digital Object Identifiers (DOIs), LSIDs for names are rarely linked to
|
158
|
+
DOIs for publications. This article describes efforts to make those connections
|
159
|
+
across three large taxonomic databases: Index Fungorum, International Plant
|
160
|
+
Names Index (IPNI), and the Index of Organism Names (ION). Over a million
|
161
|
+
names have been matched to DOIs or other persistent identifiers for taxonomic
|
162
|
+
publications. This represents approximately 36% of names for which publication
|
163
|
+
data is available. The mappings between LSIDs and publication PIDs are made
|
164
|
+
available through ChecklistBank. Applications of this mapping are discussed,
|
165
|
+
including a web app to locate the citation of a taxonomic name, and a knowledge
|
166
|
+
graph that uses data on researcher’s ORCID ids to connect taxonomic names
|
167
|
+
and publications to authors of those names.</p>\n</blockquote>\n<p>Much of
|
168
|
+
the work has been linking taxa to names, which still has huge gaps. There
|
169
|
+
are also interesting differences in coverage between plants, animals, and
|
170
|
+
fungi (see preprint for details).</p>\n\n<div class=\"separator\" style=\"clear:
|
171
|
+
both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s2276/animals-coverage.png\"
|
172
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
173
|
+
border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
|
174
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s320/animals-coverage.png\"/></a></div><div
|
175
|
+
class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s2276/fungi-coverage.png\"
|
176
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
177
|
+
border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
|
178
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s320/fungi-coverage.png\"/></a></div><div
|
179
|
+
class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s2276/plants-coverage.png\"
|
180
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
181
|
+
border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
|
182
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s320/plants-coverage.png\"/></a></div>\n\n\nThere
|
183
|
+
is also a simple app to demonstrate these links, see <a href=\"https://species-cite.herokuapp.com\">https://species-cite.herokuapp.com</a>.\n\n\n\n<blockquote>\n<p>Written
|
184
|
+
with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/92rdb-5fe58","uuid":"d33d4f49-b281-4997-9eb9-dbad1e52d9bd","url":"https://iphylo.blogspot.com/2022/09/local-global-identifiers-for.html","title":"Local
|
185
|
+
global identifiers for decentralised wikis","summary":"I''ve been thinking
|
186
|
+
a bit about how one could use a Markdown wiki-like tool such as Obsidian to
|
187
|
+
work with taxonomic data (see earlier posts Obsidian, markdown, and taxonomic
|
188
|
+
trees and Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu).
|
189
|
+
One \"gotcha\" would be how to name pages. If we treat the database as entirely
|
190
|
+
local, then the page names don''t matter, but what if we envisage sharing
|
191
|
+
the database, or merging it with others (for example, if we divided a taxon
|
192
|
+
up into chunks, and...","date_published":"2022-09-08T16:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
193
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>I''ve been thinking a bit
|
194
|
+
about how one could use a Markdown wiki-like tool such as Obsidian to work
|
195
|
+
with taxonomic data (see earlier posts <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
|
196
|
+
markdown, and taxonomic trees</a> and <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">Personal
|
197
|
+
knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu</a>).</p>\n\n<p>One
|
198
|
+
\"gotcha\" would be how to name pages. If we treat the database as entirely
|
199
|
+
local, then the page names don''t matter, but what if we envisage sharing
|
200
|
+
the database, or merging it with others (for example, if we divided a taxon
|
201
|
+
up into chunks, and different people worked on those different chunks)? </p>\n\n<p>This
|
202
|
+
is the attraction of globally unique identifiers. You and I can independently
|
203
|
+
work on the same thing, such as data linked to scientific paper, safe in the
|
204
|
+
knowledge that if we both use the DOI for that paper we can easily combine
|
205
|
+
what we''ve done. But global identifiers can also be a pain, especially if
|
206
|
+
we need to use a service to look them up (\"is there a DOI for this paper?\",
|
207
|
+
\"what is the LSID for this taxonomic name?\").</p>\n\n<p>Life would be easier
|
208
|
+
if we could generate identifiers \"locally\", but had some assurance that
|
209
|
+
they would be globally unique, and that anyone else generating an identifier
|
210
|
+
for the same thing would arrive at the same identifier (this eliminates things
|
211
|
+
such as <a href=\"https://en.wikipedia.org/wiki/Universally_unique_identifier\">UUIDs</a>
|
212
|
+
which are intentionally designed to prvent people genrrating the same identifier).
|
213
|
+
One approach is \"content addressing\" (see, e.g. <a href=\"https://web.archive.org/web/20210514054054/https://bentrask.com/notes/content-addressing.html\">Principles
|
214
|
+
of Content Addressing</a> - dead link but in the Wayabck Machine, see also
|
215
|
+
<a href=\"https://github.com/btrask/stronglink\">btrask/stronglink</a>). For
|
216
|
+
example, we can generate a cryptographic hash of a file (such as a PDF) and
|
217
|
+
use that as the identifier.</p>\n\n<p>Now the problem is that we have globally
|
218
|
+
unique, but ugly and unfriendly identifiers (such as \"6c98136eba9084ea9a5fc0b7693fed8648014505\").
|
219
|
+
What we need are nice, easy to use identifiers we can use as page names. <a
|
220
|
+
href=\"https://species.wikimedia.org/wiki/Main_Page\">Wikispecies</a> serves
|
221
|
+
as a possible role model, where taxon names serve as page names, as do simplified
|
222
|
+
citations (e.g., authors and years). This model runs into the problem that
|
223
|
+
taxon names aren''t unique, nor are author + year combinations. In Wikispecies
|
224
|
+
this is resolved by having a centralised database where it''s first come,
|
225
|
+
first served. If there is a name clash you have to create a new name for your
|
226
|
+
page. This works, but what if you have multiple databases un by different
|
227
|
+
people? How do we ensure the identifiers are the same?</p>\n\n<p>Then I remembered
|
228
|
+
Roger Hyam''s flight of fantasy over a decade ago: <a href=\"http://www.hyam.net/blog/archives/1007\">SpeciesIndex.org
|
229
|
+
– an impractical, practical solution</a>. He proposed the following rules
|
230
|
+
to generate a unique URI for a taxonomic name:\n\n<ul>\n <li>The URI must
|
231
|
+
start with \"http://speciesindex.org\" followed by one or more of the following
|
232
|
+
separated by slashes.</li>\n\n <li>First word of name. Must only contain
|
233
|
+
letters. Must not be the same as one of the names of the nomenclatural codes
|
234
|
+
(icbn or iczn). Optional but highly recommended.</li> \n\n <li>Second word
|
235
|
+
of name. Must only contain letters and not be a nomenclatural code name. Optional.</li>
|
236
|
+
\n\n <li>Third word of name. Must only contain letters and not be a nomenclatural
|
237
|
+
code name. Optional.</li> \n\n <li>Year of publication. Must be an integer
|
238
|
+
greater than 1650 and equal to or less than the current year. If this is an
|
239
|
+
ICZN name then this should be the year the species (epithet) was published
|
240
|
+
as is commonly cited after the name. If this is an ICBN name at species or
|
241
|
+
below then it is the date of the combination. Optional. Recommended for zoological
|
242
|
+
names if known. Not recommended for botanical names unless there is a known
|
243
|
+
problem with homonyms in use by non-taxonomists.</li>\n \n<li>Nomenclatural
|
244
|
+
code governing the name of the taxon. Currently this must be either ''icbn''
|
245
|
+
or ''iczn''. This may be omitted if the code is unknown or not relevant. Other
|
246
|
+
codes may be added to this list.</li> \n <li>Qualifier This must be a Version
|
247
|
+
4 RFC-4122 UUID. Optional. Used to generate a new independent identifier for
|
248
|
+
a taxon for which the conventional name is unknown or does not exist or to
|
249
|
+
indicate a particular taxon concept that bears the embedded name.</li>\n\n <li>The
|
250
|
+
whole speciesindex.org URI string should be considered case\nsensitive. Everything
|
251
|
+
should be lower case apart from the first letter of words that are specified
|
252
|
+
as having upper case in their relevant codes e.g. names at and above the rank
|
253
|
+
of genus.</li>\n</ul>\n</p>\n\n<p>Roger is basically arging that while names
|
254
|
+
aren''t unique (i.e., we have homonyms such as <i>Abronia</i>) they are pretty
|
255
|
+
close to being so, and with a few tweaks we can come up with a unique representation.
|
256
|
+
Another way to think about this if we had a database of all taxonomics, we
|
257
|
+
could construct a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
|
258
|
+
and for each name find the shortest set of name parts (genus, species, etc),
|
259
|
+
year, and code that gave us a unique string for that name. In many cases the
|
260
|
+
species name may be all we need, in other cases we may need to add year and/or
|
261
|
+
nomenclatural code to arrive at a unique string. \n\n</p>\n\n<p>What about
|
262
|
+
bibliographic references? Well many of us will have databases (e.g., Endnote,
|
263
|
+
Mendeley, Zotero, etc.) which generate \"cite keys\". These are typically
|
264
|
+
short, memorable identifiers for a reference that are unique within that database.
|
265
|
+
There is an interesting discussion on the <a href=\"https://discourse.jabref.org/t/universal-citekey-generator/2441/2\">JabRef
|
266
|
+
forum</a> regarding a \"Universal Citekey Generator\", and source code is
|
267
|
+
available <a href=\"https://github.com/cparnot/universal-citekey-js\">cparnot/universal-citekey-js</a>.
|
268
|
+
I''ve yet to explore this in detail, but it looks a promising way to generate
|
269
|
+
unique identifiers from basic metadata (echos of more elaborate schemes such
|
270
|
+
as <a href=\"https://en.wikipedia.org/wiki/Serial_Item_and_Contribution_Identifier\">SICIs</a>).
|
271
|
+
For example,\n\n<blockquote>Senna AR, Guedes UN, Andrade LF, Pereira-Filho
|
272
|
+
GH. 2021. A new species of amphipod Pariphinotus Kunkel, 1910 (Amphipoda:
|
273
|
+
Phliantidae) from Southwestern Atlantic. Zool Stud 60:57. doi:10.6620/ZS.2021.60-57.</blockquote>\n\nbecomes
|
274
|
+
\"Senna:2021ck\". So if two people have the same, core, metadata for a paper
|
275
|
+
they can generate the same key.</p>\n\n<p>Hence it seems with a few conventions
|
276
|
+
(and maybe some simple tools to support them) we could have decentralised
|
277
|
+
wiki-like tools that used the same identifiers for the same things, and yet
|
278
|
+
those identfiiers were short and human-friendly.</p>","tags":["citekey","identfiiers","markdown","obsidian","Roger
|
279
|
+
Hyam"],"language":"en","references":null},{"id":"https://doi.org/10.59350/j77nc-e8x98","uuid":"c6b101f4-bfbc-4d01-921d-805c43c85757","url":"https://iphylo.blogspot.com/2022/08/linking-taxonomic-names-to-literature.html","title":"Linking
|
280
|
+
taxonomic names to the literature","summary":"Just some thoughts as I work
|
281
|
+
through some datasets linking taxonomic names to the literature. In the diagram
|
282
|
+
above I''ve tried to capture the different situatios I encounter. Much of
|
283
|
+
the work I''ve done on this has focussed on case 1 in the diagram: I want
|
284
|
+
to link a taxonomic name to an identifier for the work in which that name
|
285
|
+
was published. In practise this means linking names to DOIs. This has the
|
286
|
+
advantage of linking to a citable indentifier, raising questions such as whether
|
287
|
+
citations...","date_published":"2022-08-22T17:19:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
288
|
+
(Roderic Page)"}],"image":null,"content_html":"Just some thoughts as I work
|
289
|
+
through some datasets linking taxonomic names to the literature.\n\n<div class=\"separator\"
|
290
|
+
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s2140/linking%20to%20names144.jpg\"
|
291
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
292
|
+
border=\"0\" height=\"600\" data-original-height=\"2140\" data-original-width=\"1604\"
|
293
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s600/linking%20to%20names144.jpg\"/></a></div>\n\n<p>In
|
294
|
+
the diagram above I''ve tried to capture the different situatios I encounter.
|
295
|
+
Much of the work I''ve done on this has focussed on case 1 in the diagram:
|
296
|
+
I want to link a taxonomic name to an identifier for the work in which that
|
297
|
+
name was published. In practise this means linking names to DOIs. This has
|
298
|
+
the advantage of linking to a citable indentifier, raising questions such
|
299
|
+
as whether citations of taxonmic papers by taxonomic databases could become
|
300
|
+
part of a <a href=\"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html\">taxonomist''s
|
301
|
+
Google Scholar profile</a>.</p>\n\n<p>In many taxonomic databases full work-level
|
302
|
+
citations are not the norm, instead taxonomists cite one or more pages within
|
303
|
+
a work that are relevant to a taxonomic name. These \"microcitations\" (what
|
304
|
+
the U.S. legal profession refer to as \"point citations\" or \"pincites\", see
|
305
|
+
<a href=\"https://rasmussen.libanswers.com/faq/283203\">What are pincites,
|
306
|
+
pinpoints, or jump legal references?</a>) require some work to map to the
|
307
|
+
work itself (which is typically the thing that has a citatble identifier such
|
308
|
+
as a DOI).</p>\n\n<p>Microcitations (case 2 in the diagram above) can be quite
|
309
|
+
complex. Some might simply mention a single page, but others might list a
|
310
|
+
series of (not necessarily contiguous) pages, as well as figures, plates etc.
|
311
|
+
Converting these to citable identifiers can be tricky, especially as in most
|
312
|
+
cases we don''t have page-level identifiers. The Biodiversity Heritage Library
|
313
|
+
(BHL) does have URLs for each scanned page, and we have a standard for referring
|
314
|
+
to pages in a PDF (<code>page=<pageNum></code>, see <a href=\"https://datatracker.ietf.org/doc/html/rfc8118\">RFC
|
315
|
+
8118</a>). But how do we refer to a set of pages? Do we pick the first page?
|
316
|
+
Do we try and represent a set of pages, and if so, how?</p>\n\n<p>Another
|
317
|
+
issue with page-level identifiers is that not everything on a given page may
|
318
|
+
be relevant to the taxonomic name. In case 2 above I''ve shaded in the parts
|
319
|
+
of the pages and figure that refer to the taxonomic name. An example where
|
320
|
+
this can be problematic is the recent test case I created for BHL where a
|
321
|
+
page image was included for the taxonomic name <a href=\"https://www.gbif.org/species/195763322\"><i>Aphrophora
|
322
|
+
impressa</i></a>. The image includes the species description and a illustration,
|
323
|
+
as well as text that relates to other species.</p>\n\n<div class=\"separator\"
|
324
|
+
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s3467/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"
|
325
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
326
|
+
border=\"0\" height=\"400\" data-original-height=\"3467\" data-original-width=\"2106\"
|
327
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s400/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"/></a></div>\n\n<p>Given
|
328
|
+
that not everything on a page need be relevant, we could extract just the
|
329
|
+
relevant blocks of text and illustrations (e.g., paragraphs of text, panels
|
330
|
+
within a figure, etc.) and treat that set of elements as the thing to cite.
|
331
|
+
This is, of course, what <a href=\"http://plazi.org\">Plazi</a> are doing.
|
332
|
+
The set of extracted blocks is glued together as a \"treatment\", assigned
|
333
|
+
an identifier (often a DOI), and treated as a citable unit. It would be interesting
|
334
|
+
to see to what extent these treatments are actually cited, for example, do
|
335
|
+
subsequent revisions that cite work that include treatments cite those treatments,
|
336
|
+
or just the work itself? Put another way, are we creating <a href=\"https://iphylo.blogspot.com/2012/09/decoding-nature-encode-ipad-app-omg-it.html\">\"threads\"</a>
|
337
|
+
between taxonomic revisions?</p>\n\n<p>One reason for these notes is that
|
338
|
+
I''m exploring uploading taxonomic name - literature links to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
|
339
|
+
and case 1 above is easy, as is case 3 (if we have treatment-level identifiers).
|
340
|
+
But case 2 is problematic because we are linking to a set of things that may
|
341
|
+
not have an identifier, which means a decision has to be made about which
|
342
|
+
page to link to, and how to refer to that page.</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/w18j9-v7j10","uuid":"d811172e-7798-403c-a83d-3d5317a9657e","url":"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html","title":"Papers
|
343
|
+
citing data that cite papers: CrossRef, DataCite, and the Catalogue of Life","summary":"Quick
|
344
|
+
notes to self following on from a conversation about linking taxonomic names
|
345
|
+
to the literature. Is there a way to turn those links into countable citations
|
346
|
+
(even if just one per database) for Google Scholar?— Wayne Maddison
|
347
|
+
(@WayneMaddison) August 3, 2022 There are different sorts of citation: Paper
|
348
|
+
cites another paper Paper cites a dataset Dataset cites a paper Citation
|
349
|
+
type (1) is largely a solved problem (although there are issues of the ownership
|
350
|
+
and use of this...","date_published":"2022-08-03T11:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
351
|
+
(Roderic Page)"}],"image":null,"content_html":"Quick notes to self following
|
352
|
+
on from a conversation about linking taxonomic names to the literature.\n\n<blockquote
|
353
|
+
class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Is there a way to turn
|
354
|
+
those links into countable citations (even if just one per database) for Google
|
355
|
+
Scholar?</p>— Wayne Maddison (@WayneMaddison) <a href=\"https://twitter.com/WayneMaddison/status/1554644747406348288?ref_src=twsrc%5Etfw\">August
|
356
|
+
3, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
|
357
|
+
charset=\"utf-8\"></script>\n\nThere are different sorts of citation:\n\n<ol>\n <li>Paper
|
358
|
+
cites another paper</li>\n <li>Paper cites a dataset</li>\n <li>Dataset
|
359
|
+
cites a paper</li>\n</ol>\n\nCitation type (1) is largely a solved problem
|
360
|
+
(although there are issues of the ownership and use of this data, see e.g.
|
361
|
+
<a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
|
362
|
+
has no impact factor</a>.\n\nCitation type (2) is becoming more widespread
|
363
|
+
(but not perfect as GBIF''s <a href=\"https://twitter.com/search?q=%23citethedoi&src=typed_query\">#citethedoi</a>
|
364
|
+
campaign demonstrates. But the idea is well accepted and there are guides
|
365
|
+
to how to do it, e.g.:\n\n<blockquote>\nCousijn, H., Kenall, A., Ganley, E.
|
366
|
+
et al. A data citation roadmap for scientific publishers. Sci Data 5, 180259
|
367
|
+
(2018). <a href=\"https://doi.org/10.1038/sdata.2018.259\">https://doi.org/10.1038/sdata.2018.259</a>\n</blockquote>\n\nHowever,
|
368
|
+
things do get problematic because most (but not all) DOIs for publications
|
369
|
+
are managed by CrossRef, which has an extensive citation database linking
|
370
|
+
papers to other paopers. Most datasets have DataCite DOIs, and DataCite manages
|
371
|
+
its own citations links, but as far as I''m aware these two systems don''t
|
372
|
+
really taklk to each other.\n\nCitation type (3) is the case where a database
|
373
|
+
is largely based on the literature, which applies to taxonomy. Taxonomic databases
|
374
|
+
are essentially collections of literature that have opinions on taxa, and
|
375
|
+
the database may simply compile those (e.g., a nomenclator), or come to some
|
376
|
+
view on the applicability of each name. In an ideal would, each reference
|
377
|
+
included in a taxonomic database would gain a citation, which would help better
|
378
|
+
reflect the value of that work (a long standing bone of contention for taxonomists).\n\nIt
|
379
|
+
would be interesting to explore these issues further. CrossRef and DataCite
|
380
|
+
do share <a href=\"https://www.crossref.org/services/event-data/\">Event Data</a>
|
381
|
+
(see also <a href=\"https://support.datacite.org/docs/eventdata-guide\">DataCite
|
382
|
+
Event Data</a>). Can this track citations of papers by a dataset?\n \n \nMy
|
383
|
+
take on Wayne''s question:\n\n<blockquote>\n Is there a way to turn those
|
384
|
+
links into countable citations (even if just one per database) for Google
|
385
|
+
Scholar?\n</blockquote>\n\nis that he''s is after type 3 citations, which
|
386
|
+
I don''t think we have a way to handle just yet (but I''d need to look at
|
387
|
+
Event Data a bit more). Google Scholar is a black box, and the academic coimmunity''s
|
388
|
+
reliance on it for metrics is troubling. But it would be interetsing to try
|
389
|
+
and figure out if there is a way to get Google Scholar to index the citations
|
390
|
+
of taxonomic papers by databases. For instance, the <a href=\"https://www.catalogueoflife.org/\">Catalogue
|
391
|
+
of Life</a> has an ISSN <a href=\"https://portal.issn.org/resource/ISSN/2405-884X\">2405-884X</a>
|
392
|
+
so it can be treated as a publication. At the moment its web pages have lots
|
393
|
+
of identifiers for people managing data and their organisations (lots of <a
|
394
|
+
href=\"https://orcid.org\">ORCIDs</a> and <a href=\"https://ror.org\">RORs</a>,
|
395
|
+
and DOIs for individual datasets (e.g., <a href=\"https://www.checklistbank.org/dataset/9828/about\">checklistbank.org</a>)
|
396
|
+
but precious little in the way of DOIs for publications (or, indeed, ORCIDs
|
397
|
+
for taxonomists). What would it take for taxonomic publications in the Catalogue
|
398
|
+
of Life to be treated as first class citations?","tags":["Catalogue of Life","citation","CrossRef","DataCite","DOI"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ws094-1w310","uuid":"6bed78ec-0029-4096-b1c3-48a55a9fdb3b","url":"https://iphylo.blogspot.com/2023/04/chatgpt-of-course.html","title":"ChatGPT,
|
399
|
+
of course","summary":"I haven’t blogged for a while, work and other reasons
|
400
|
+
have meant I’ve not had much time to think, and mostly I blog to help me think.
|
401
|
+
ChatGPT is obviously a big thing at the moment, and once we get past the moral
|
402
|
+
panic (“students can pass exams using AI!”) there are a lot of interesting
|
403
|
+
possibilities to explore. Inspired by essays such as How Q&A systems based
|
404
|
+
on large language models (eg GPT4) will change things if they become the dominant
|
405
|
+
search paradigm — 9 implications for libraries...","date_published":"2023-04-03T12:52:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
406
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>I haven’t blogged for a
|
407
|
+
while, work and <a href=\"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html\">other
|
408
|
+
reasons</a> have meant I’ve not had much time to think, and mostly I blog
|
409
|
+
to help me think.</p>\n<p>ChatGPT is obviously a big thing at the moment,
|
410
|
+
and once we get past the moral panic (“students can pass exams using AI!”)
|
411
|
+
there are a lot of interesting possibilities to explore. Inspired by essays
|
412
|
+
such as <a href=\"https://medium.com/@aarontay/how-q-a-systems-based-on-large-language-models-eg-gpt4-will-change-things-if-they-become-the-norm-c7cf62736ba\">How
|
413
|
+
Q&A systems based on large language models (eg GPT4) will change things
|
414
|
+
if they become the dominant search paradigm — 9 implications for libraries</a>
|
415
|
+
and <a href=\"https://about.sourcegraph.com/blog/cheating-is-all-you-need\">Cheating
|
416
|
+
is All You Need</a>, as well as [<a href=\"https://paul-graham-gpt.vercel.app/\">Paul
|
417
|
+
Graham GPT</a>](<a href=\"https://paul-graham-gpt.vercel.app\">https://paul-graham-gpt.vercel.app</a>)
|
418
|
+
I thought I’d try a few things and see where this goes.</p>\n<p>ChatGPT can
|
419
|
+
do some surprising things.</p>\n<h4 id=\"parse-bibliographic-data\">Parse
|
420
|
+
bibliographic data</h4>\n<p>I spend a LOT of time working with bibliographic
|
421
|
+
data, trying to parse it into structured data. ChatGPT can do this:</p>\n\n<div
|
422
|
+
class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s924/Screenshot%202023-04-03%20at%2012.59.30.png\"
|
423
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
424
|
+
border=\"0\" height=\"400\" data-original-height=\"924\" data-original-width=\"738\"
|
425
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s400/Screenshot%202023-04-03%20at%2012.59.30.png\"/></a></div>\n\n<p>Note
|
426
|
+
that it does more than simply parse the strings, it expands journal abbreviations
|
427
|
+
such as “J. Malay Brch. R. Asiat. Soc.” to the full name “Journal of the Malayan
|
428
|
+
Branch of the Royal Asiatic Society”. So we can get clean, parsed data in
|
429
|
+
a range of formats.</p>\n<h4 id=\"parse-specimens\">Parse specimens</h4>\n<p>Based
|
430
|
+
on the success with parsing bibliographic strings I wondered how well it could
|
431
|
+
handle citation software specimens (“material examined”). Elsewhere I’ve been
|
432
|
+
critical of Plazi’s ability to do this, see <a href=\"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html\">Problems
|
433
|
+
with Plazi parsing: how reliable are automated methods for extracting specimens
|
434
|
+
from the literature?</a>.</p>\n<p>For example, given this specimen record
|
435
|
+
on p. 130 of <a href=\"https://doi.org/10.5852/ejt.2021.775.1553\">doi:10.5852/ejt.2021.775.1553</a></p>\n<blockquote>\n<p>LAOS
|
436
|
+
• Kammoune Province, Bunghona Market, 7 km Nof Xe Bangfai River;<br>\n17.13674°
|
437
|
+
N, 104.98591° E; E. Jeratthitikul, K. Wisittikoson, A. Fanka, N. Wutthituntisil
|
438
|
+
and P. Prasankok leg.; sold by local people;<br>\nMUMNH-UNI2831.</p>\n</blockquote>\n<p>ChatGPT
|
439
|
+
extracted a plausible Darwin Core record:</p>\n\n<div class=\"separator\"
|
440
|
+
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s901/Screenshot%202023-04-03%20at%2013.30.54.png\"
|
441
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
442
|
+
border=\"0\" height=\"400\" data-original-height=\"901\" data-original-width=\"764\"
|
443
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s400/Screenshot%202023-04-03%20at%2013.30.54.png\"/></a></div>\n\n<p>I’ve
|
444
|
+
been experimenting with parsing specimen records using the same machine learning
|
445
|
+
approach for bibliographic data (e.g., <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
|
446
|
+
parsing tool released</a>), perhaps it’s time to rethink that idea.</p>\n<h4
|
447
|
+
id=\"correct-ocr-errors\">Correct OCR errors</h4>\n<p>OCR errors are a well
|
448
|
+
known limitation of the Biodiversity Heritage Library (BHL), see for example
|
449
|
+
<a href=\"https://blog.biodiversitylibrary.org/2022/07/ocr-improvements-early-analysis.html\">OCR
|
450
|
+
Improvements: An Early Analysis</a>. Given that ChatGPT is built on top of
|
451
|
+
a large language model it seems obvious to see how it handles OCR errors.
|
452
|
+
I tried the following snippet of text from <a href=\"https://www.biodiversitylibrary.org/page/4312227\">BHL
|
453
|
+
page 4312227</a>:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Jiiaii
|
454
|
+
Jr., 16, who always join him and Hedriek in th(> search for<br>\nfrogs.
|
455
|
+
Iloraeio Mayorga, research assistant to tlie author, has also<br>\nbeen (wtreniely
|
456
|
+
useful, not only in field work, but also in the<br>\ntedious curatorial and
|
457
|
+
laboratory work that usually fol- lows. Dr.<br>\nJohn Randall kindly took
|
458
|
+
tlie photouraplis of E. hedricki, and the<br>\nGaliiiaues family of San Juan
|
459
|
+
made their El Verde house available as<br>\na (MMiter of activities for the
|
460
|
+
colleetiuff party. To all these<br>\njx’opic, the autlioi- feels (iee])ly
|
461
|
+
indebted.</p>\n<p>Elet’tiierodactylus iiKnRrcKi s]). n.</p>\n<p>Type. Museum
|
462
|
+
of Coniparative Zoology Xo. ;}()!)()8, c^ from El Verde,<br>\nwest flank
|
463
|
+
of El Yuiuinc Puerto Hico. I.IOO ft. Coll. J. A. Kivero,<br>\n11 Aug-. I!)(i2.</p>\n</blockquote>\n<p>ChatGPT
|
464
|
+
returns this:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Juan Jr., 16,
|
465
|
+
who always joins him and Hedrick in the search for<br>\nfrogs. Horacio Mayorga,
|
466
|
+
research assistant to the author, has also<br>\nbeen extremely useful, not
|
467
|
+
only in field work, but also in the tedious<br>\ncuratorial and laboratory
|
468
|
+
work that usually follows. Dr. John Randall<br>\nkindly took the photographs
|
469
|
+
of E. hedricki, and the Galinanes family<br>\nof San Juan made their El Verde
|
470
|
+
house available as a center of<br>\nactivities for the collecting party. To
|
471
|
+
all these people, the author<br>\nfeels deeply indebted.</p>\n<p>Eleutherodactylus
|
472
|
+
hedricki sp. n.</p>\n<p>Type. Museum of Comparative Zoology No. 30908, c?
|
473
|
+
from El Verde, west<br>\nflank of El Yunque, Puerto Rico. 1100 ft. Coll. J.
|
474
|
+
A. Rivero, 11 Aug.<br>\n1962.</p>\n</blockquote>\n\n<div class=\"separator\"
|
475
|
+
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s859/Screenshot%202023-04-03%20at%2013.07.53.png\"
|
476
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
477
|
+
border=\"0\" width=\"400\" data-original-height=\"510\" data-original-width=\"859\"
|
478
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s400/Screenshot%202023-04-03%20at%2013.07.53.png\"/></a></div>\n\n<p>Comparing
|
479
|
+
this to the scanned image ChatGPT it does pretty well, for example the gobbledegook
|
480
|
+
“Elet’tiierodactylus iiKnRrcKi” is correctly translated as “Eleutherodactylus
|
481
|
+
hedricki”. Running all of BHL through ChatGPT probably isn’t feasible, but
|
482
|
+
one could imagine targeted cleaning of key papers.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>These
|
483
|
+
small experiments are fairly trivial, but they are the sort of tedious tasks
|
484
|
+
that would otherwise require significant programming (or other resources)
|
485
|
+
to solve. But ChatGPT can do rather more, as I hope to discuss in the next
|
486
|
+
post.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/7esgr-61v1","uuid":"96fa91d5-459c-482f-aa38-dda6e0a30e20","url":"https://iphylo.blogspot.com/2022/01/large-graph-viewer-experiments.html","title":"Large
|
487
|
+
graph viewer experiments","summary":"I keep returning to the problem of viewing
|
488
|
+
large graphs and trees, which means my hard drive has accumulated lots of
|
489
|
+
failed prototypes. Inspired by some recent discussions on comparing taxonomic
|
490
|
+
classifications I decided to package one of these (wildly incomplete) prototypes
|
491
|
+
up so that I can document the idea and put the code somewhere safe. Very cool,
|
492
|
+
thanks for sharing this-- the tree diff is similar to what J Rees has been
|
493
|
+
cooking up lately with his 'cl diff' tool. I'll tag...","date_published":"2022-01-02T11:25:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
494
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>I keep returning to the
|
495
|
+
problem of viewing large graphs and trees, which means my hard drive has accumulated
|
496
|
+
lots of failed prototypes. Inspired by some recent discussions on comparing
|
497
|
+
taxonomic classifications I decided to package one of these (wildly incomplete)
|
498
|
+
prototypes up so that I can document the idea and put the code somewhere safe.</p>\n\n<blockquote
|
499
|
+
class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Very cool, thanks for sharing
|
500
|
+
this-- the tree diff is similar to what J Rees has been cooking up lately
|
501
|
+
with his 'cl diff' tool. I'll tag <a href=\"https://twitter.com/beckettws?ref_src=twsrc%5Etfw\">@beckettws</a>
|
502
|
+
in here too so he can see potential crossover. The goal is autogenerate diffs
|
503
|
+
like this as 1st step to mapping taxo name-to concept</p>— Nate Upham
|
504
|
+
(@n8_upham) <a href=\"https://twitter.com/n8_upham/status/1475834371131289608?ref_src=twsrc%5Etfw\">December
|
505
|
+
28, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
|
506
|
+
charset=\"utf-8\"></script>\n\n<h2>Google Maps-like viewer</h2>\n\n<div class=\"separator\"
|
507
|
+
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s500\"
|
508
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
509
|
+
border=\"0\" width=\"400\" data-original-height=\"448\" data-original-width=\"500\"
|
510
|
+
src=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s400\"/></a></div>\n\n<p>I''ve
|
511
|
+
created a simple viewer that uses a tiled map viewer (like Google Maps) to
|
512
|
+
display a large graph. The idea is to draw the entire graph scaled to a 256
|
513
|
+
x 256 pixel tile. The graph is stored in a database that supports geospatial
|
514
|
+
queries, which means the queries to retrieve the individual tiles need to
|
515
|
+
display the graph at different levels of resolution are simply bounding box
|
516
|
+
queries to a database. I realise that this description is cryptic at best.
|
517
|
+
The GitHub repository <a href=\"https://github.com/rdmpage/gml-viewer\">https://github.com/rdmpage/gml-viewer</a>
|
518
|
+
has more details and the code itself. There''s a lot to do, especially adding
|
519
|
+
support for labels(!) which presents some interesting challenges (<a href=\"https://en.wikipedia.org/wiki/Level_of_detail_(computer_graphics)\">levels
|
520
|
+
of detail</a> and <a href=\"https://en.wikipedia.org/wiki/Cartographic_generalization\">generalization</a>).
|
521
|
+
The code doesn''t do any layout of the graph itself, instead I''ve used the
|
522
|
+
<a href=\"https://www.yworks.com/products/yed\">yEd</a> tool to compute the
|
523
|
+
x,y coordinates of the graph.</p>\n\n<p>Since this exercise was inspired by
|
524
|
+
a discussion of the <a href=\"https://www.mammaldiversity.org\">ASM Mammal
|
525
|
+
Diversity Database</a>, the graph I''ve used for the demonstration above is
|
526
|
+
the ASM classification of extant mammals. I guess I need to solve the labelling
|
527
|
+
issue fairly quickly!</p>","tags":["Google Maps","graph","Mammal Species of
|
528
|
+
the World","mammals","taxonomy"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m48f7-c2128","uuid":"8aea47e4-f227-45f4-b37b-0454a8a7a3ff","url":"https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html","title":"ChatGPT,
|
529
|
+
semantic search, and knowledge graphs","summary":"One thing about ChatGPT
|
530
|
+
is it has opened my eyes to some concepts I was dimly aware of but am only
|
531
|
+
now beginning to fully appreciate. ChatGPT enables you ask it questions, but
|
532
|
+
the answers depend on what ChatGPT “knows”. As several people have noted,
|
533
|
+
what would be even better is to be able to run ChatGPT on your own content.
|
534
|
+
Indeed, ChatGPT itself now supports this using plugins. Paul Graham GPT However,
|
535
|
+
it’s still useful to see how to add ChatGPT functionality to your own content
|
536
|
+
from...","date_published":"2023-04-03T15:30:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
537
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>One thing about ChatGPT
|
538
|
+
is it has opened my eyes to some concepts I was dimly aware of but am only
|
539
|
+
now beginning to fully appreciate. ChatGPT enables you ask it questions, but
|
540
|
+
the answers depend on what ChatGPT “knows”. As several people have noted,
|
541
|
+
what would be even better is to be able to run ChatGPT on your own content.
|
542
|
+
Indeed, ChatGPT itself now supports this using <a href=\"https://openai.com/blog/chatgpt-plugins\">plugins</a>.</p>\n<h4
|
543
|
+
id=\"paul-graham-gpt\">Paul Graham GPT</h4>\n<p>However, it’s still useful
|
544
|
+
to see how to add ChatGPT functionality to your own content from scratch.
|
545
|
+
A nice example of this is <a href=\"https://paul-graham-gpt.vercel.app/\">Paul
|
546
|
+
Graham GPT</a> by <a href=\"https://twitter.com/mckaywrigley\">Mckay Wrigley</a>.
|
547
|
+
Mckay Wrigley took essays by Paul Graham (a well known venture capitalist)
|
548
|
+
and built a question and answer tool very like ChatGPT.</p>\n<iframe width=\"560\"
|
549
|
+
height=\"315\" src=\"https://www.youtube.com/embed/ii1jcLg-eIQ\" title=\"YouTube
|
550
|
+
video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write;
|
551
|
+
encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n<p>Because
|
552
|
+
you can send a block of text to ChatGPT (as part of the prompt) you can get
|
553
|
+
ChatGPT to summarise or transform that information, or answer questions based
|
554
|
+
on that information. But there is a limit to how much information you can
|
555
|
+
pack into a prompt. You can’t put all of Paul Graham’s essays into a prompt
|
556
|
+
for example. So a solution is to do some preprocessing. For example, given
|
557
|
+
a question such as “How do I start a startup?” we could first find the essays
|
558
|
+
that are most relevant to this question, then use them to create a prompt
|
559
|
+
for ChatGPT. A quick and dirty way to do this is simply do a text search over
|
560
|
+
the essays and take the top hits. But we aren’t searching for words, we are
|
561
|
+
searching for answers to a question. The essay with the best answer might
|
562
|
+
not include the phrase “How do I start a startup?”.</p>\n<h4 id=\"semantic-search\">Semantic
|
563
|
+
search</h4>\n<p>Enter <a href=\"https://en.wikipedia.org/wiki/Semantic_search\">Semantic
|
564
|
+
search</a>. The key concept behind semantic search is that we are looking
|
565
|
+
for documents with similar meaning, not just similarity of text. One approach
|
566
|
+
to this is to represent documents by “embeddings”, that is, a vector of numbers
|
567
|
+
that encapsulate features of the document. Documents with similar vectors
|
568
|
+
are potentially related. In semantic search we take the query (e.g., “How
|
569
|
+
do I start a startup?”), compute its embedding, then search among the documents
|
570
|
+
for those with similar embeddings.</p>\n<p>To create Paul Graham GPT Mckay
|
571
|
+
Wrigley did the following. First he sent each essay to the OpenAI API underlying
|
572
|
+
ChatGPT, and in return he got the embedding for that essay (a vector of 1536
|
573
|
+
numbers). Each embedding was stored in a database (Mckay uses Postgres with
|
574
|
+
<a href=\"https://github.com/pgvector/pgvector\">pgvector</a>). When a user
|
575
|
+
enters a query such as “How do I start a startup?” that query is also sent
|
576
|
+
to the OpenAI API to retrieve its embedding vector. Then we query the database
|
577
|
+
of embeddings for Paul Graham’s essays and take the top five hits. These hits
|
578
|
+
are, one hopes, the most likely to contain relevant answers. The original
|
579
|
+
question and the most similar essays are then bundled up and sent to ChatGPT
|
580
|
+
which then synthesises an answer. See his <a href=\"https://github.com/mckaywrigley/paul-graham-gpt\">GitHub
|
581
|
+
repo</a> for more details. Note that we are still using ChatGPT, but on a
|
582
|
+
set of documents it doesn’t already have.</p>\n<h4 id=\"knowledge-graphs\">Knowledge
|
583
|
+
graphs</h4>\n<p>I’m a fan of knowledge graphs, but they are not terribly easy
|
584
|
+
to use. For example, I built a knowledge graph of Australian animals <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a>
|
585
|
+
that contains a wealth of information on taxa, publications, and people, wrapped
|
586
|
+
up in a web site. If you want to learn more you need to figure out how to
|
587
|
+
write queries in SPARQL, which is not fun. Maybe we could use ChatGPT to write
|
588
|
+
the SPARQL queries for us, but it would be much more fun to be simply ask
|
589
|
+
natural language queries (e.g., “who are the experts on Australian ants?”).
|
590
|
+
I made some naïve notes on these ideas <a href=\"https://iphylo.blogspot.com/2015/09/possible-project-natural-language.html\">Possible
|
591
|
+
project: natural language queries, or answering “how many species are there?”</a>
|
592
|
+
and <a href=\"https://iphylo.blogspot.com/2019/05/ozymandias-meets-wikipedia-with-notes.html\">Ozymandias
|
593
|
+
meets Wikipedia, with notes on natural language generation</a>.</p>\n<p>Of
|
594
|
+
course, this is a well known problem. Tools such as <a href=\"http://rdf2vec.org\">RDF2vec</a>
|
595
|
+
can take RDF from a knowledge graph and create embeddings which could in tern
|
596
|
+
be used to support semantic search. But it seems to me that we could simply
|
597
|
+
this process a bit by making use of ChatGPT.</p>\n<p>Firstly we would generate
|
598
|
+
natural language statements from the knowledge graph (e.g., “species x belongs
|
599
|
+
to genus y and was described in z”, “this paper on ants was authored by x”,
|
600
|
+
etc.) that cover the basic questions we expect people to ask. We then get
|
601
|
+
embeddings for these (e.g., using OpenAI). We then have an interface where
|
602
|
+
people can ask a question (“is species x a valid species?”, “who has published
|
603
|
+
on ants”, etc.), we get the embedding for that question, retrieve natural
|
604
|
+
language statements that the closest in embedding “space”, package everything
|
605
|
+
up and ask ChatGPT to summarise the answer.</p>\n<p>The trick, of course,
|
606
|
+
is to figure out how t generate natural language statements from the knowledge
|
607
|
+
graph (which amounts to deciding what paths to traverse in the knowledge graph,
|
608
|
+
and how to write those paths is something approximating English). We also
|
609
|
+
want to know something about the sorts of questions people are likely to ask
|
610
|
+
so that we have a reasonable chance of having the answers (for example, are
|
611
|
+
people going to ask about individual species, or questions about summary statistics
|
612
|
+
such as numbers of species in a genus, etc.).</p>\n<p>What makes this attractive
|
613
|
+
is that it seems a straightforward way to go from a largely academic exercise
|
614
|
+
(build a knowledge graph) to something potentially useful (a question and
|
615
|
+
answer machine). Imagine if something like the defunct BBC wildlife site (see
|
616
|
+
<a href=\"https://iphylo.blogspot.com/2017/12/blue-planet-ii-bbc-and-semantic-web.html\">Blue
|
617
|
+
Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and
|
618
|
+
opportunities lost</a>) revived <a href=\"https://aspiring-look.glitch.me\">here</a>
|
619
|
+
had a question and answer interface where we could ask questions rather than
|
620
|
+
passively browse.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>I have so much
|
621
|
+
more to learn, and need to think about ways to incorporate semantic search
|
622
|
+
and ChatGPT-like tools into knowledge graphs.</p>\n<blockquote>\n<p>Written
|
623
|
+
with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/rfxj3-x6739","uuid":"6a4d5c44-f4a9-4d40-a32c-a4d5e512c55a","url":"https://iphylo.blogspot.com/2022/05/thoughts-on-treebase-dying.html","title":"Thoughts
|
624
|
+
on TreeBASE dying(?)","summary":"@rvosa is Naturalis no longer hosting Treebase?
|
625
|
+
https://t.co/MBRgcxaBmR— Hilmar Lapp (@hlapp) May 10, 2022 So it looks
|
626
|
+
like TreeBASE is in trouble, it''s legacy Java code a victim of security issues.
|
627
|
+
Perhaps this is a chance to rethink TreeBASE, assuming that a repository of
|
628
|
+
published phylogenies is still considered a worthwhile thing to have (and
|
629
|
+
I think that question is open). Here''s what I think could be done. The data
|
630
|
+
(individual studies with trees and data) are packaged into...","date_published":"2022-05-11T16:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
631
|
+
(Roderic Page)"}],"image":null,"content_html":"<blockquote class=\"twitter-tweet\"><p
|
632
|
+
lang=\"en\" dir=\"ltr\"><a href=\"https://twitter.com/rvosa?ref_src=twsrc%5Etfw\">@rvosa</a>
|
633
|
+
is Naturalis no longer hosting Treebase? <a href=\"https://t.co/MBRgcxaBmR\">https://t.co/MBRgcxaBmR</a></p>—
|
634
|
+
Hilmar Lapp (@hlapp) <a href=\"https://twitter.com/hlapp/status/1524166490798309381?ref_src=twsrc%5Etfw\">May
|
635
|
+
10, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
|
636
|
+
charset=\"utf-8\"></script>\n\n<p>So it looks like <a href=\"http://treebase.org\">TreeBASE</a>
|
637
|
+
is in trouble, it''s legacy Java code a victim of security issues. Perhaps
|
638
|
+
this is a chance to rethink TreeBASE, assuming that a repository of published
|
639
|
+
phylogenies is still considered a worthwhile thing to have (and I think that
|
640
|
+
question is open).</p>\n\n<p>Here''s what I think could be done.</p>\n\n<ol>\n<li>\nThe
|
641
|
+
data (individual studies with trees and data) are packaged into whatever format
|
642
|
+
is easiest (NEXUS, XML, JSON) and uploaded to a repository such as <a href=\"https://zenodo.org\">Zenodo</a>
|
643
|
+
for long term storage. They get DOIs for citability. This becomes the default
|
644
|
+
storage for TreeBASE.\n</li>\n<li>\nThe data is transformed into JSON and
|
645
|
+
indexed using Elasticsearch. A simple web interface is placed on top so that
|
646
|
+
people can easily find trees (never a strong point of the original TreeBASE).
|
647
|
+
Trees are displayed natively on the web using SVG. The number one goal is
|
648
|
+
for people to be able to find trees, view them, and download them.\n</li>\n<li>\nTo
|
649
|
+
add data to TreeBASE the easiest way would be for people to upload them direct
|
650
|
+
to Zenodo and tag them \"treebase\". A bot then grabs a feed of these datasets
|
651
|
+
and adds them to the search engine in (1) above. As time allows, add an interface
|
652
|
+
where people upload data directly, it gets curated, then deposited in Zenodo.
|
653
|
+
This presupposes that there are people available to do curation. Maybe have
|
654
|
+
\"stars\" for the level of curation so that users know whether anyone has
|
655
|
+
checked the data.\n</li>\n</ol>\n\n<p>There''s lots of details to tweak, for
|
656
|
+
example how many of the existing URLs for studies are preserved (some URL
|
657
|
+
mapping), and what about the API? And I''m unclear about the relationship
|
658
|
+
with <a href=\"https://datadryad.org\">Dryad</a>.</p>\n\n<p>My sense is that
|
659
|
+
the TreeBASE code is very much of its time (10-15 years ago), a monolithic
|
660
|
+
block of code with SQL, Java, etc. If one was starting from scratch today
|
661
|
+
I don''t think this would be the obvious solution. Things have trended towards
|
662
|
+
being simpler, with lots of building blocks now available in the cloud. Need
|
663
|
+
a search engine? Just spin up a container in the cloud and you have one. More
|
664
|
+
and more functionality can be devolved elsewhere.</p>\n\n<p>Another other
|
665
|
+
issue is how to support TreeBASE. It has essentially been a volunteer effort
|
666
|
+
to date, with little or no funding. One reason I think having Zenodo as a
|
667
|
+
storage engine is that it takes care of long term sustainability of the data.</p>\n\n<p>I
|
668
|
+
realise that this is all wild arm waving, but maybe now is the time to reinvent
|
669
|
+
TreeBASE?</p>\n\n<h2>Updates</h2>\n\n<p>It''s been a while since I''ve paid
|
670
|
+
a lot of attention to phylogenetic databases, and it shows. There is a file-based
|
671
|
+
storage system for phylogenies <a href=\"https://github.com/OpenTreeOfLife/phylesystem-1\">phylesystem</a>
|
672
|
+
(see \"Phylesystem: a git-based data store for community-curated phylogenetic
|
673
|
+
estimates\" <a href=\"https://doi.org/10.1093/bioinformatics/btv276\">https://doi.org/10.1093/bioinformatics/btv276</a>)
|
674
|
+
that is sort of what I had in mind, although long term persistence is based
|
675
|
+
on GitHub rather than a repository such as Zenodo. Phylesystem uses a truly
|
676
|
+
horrible-looking JSON transformation of <a href=\"http://nexml.github.io\">NeXML</a>
|
677
|
+
(NeXML itself is ugly), and TreeBASE also supports NeXML, so some form of
|
678
|
+
NeXML or a JSON transformation seems the obvious storage format. It will probably
|
679
|
+
need some cleaning and simplification if it is to be indexed easily. Looking
|
680
|
+
back over the long history of TreeBASE and phylogenetic databases I''m struck
|
681
|
+
by how much complexity has been introduced over time. I think the tech has
|
682
|
+
gotten in the way sometimes (which might just be another way of saying that
|
683
|
+
I''m not smart enough to make sense of it all.</p>\n\n<p>So we could imagine
|
684
|
+
a search engine that covers both TreeBASE and <a href=\"https://tree.opentreeoflife.org/curator\">Open
|
685
|
+
Tree of Life studies</a>.</p>\n\n<p>Basic metadata-based searches would be
|
686
|
+
straightforward, and we could have a user interface that highlights the trees
|
687
|
+
(I think TreeBASE''s biggest search rival is a Google image search). The harder
|
688
|
+
problem is searching by tree structure, for which there is an interesting
|
689
|
+
literature without any decent implementations that I''m aware of (as I said,
|
690
|
+
I''ve been out of this field a while).</p>\n\n<p>So my instinct is we could
|
691
|
+
go a long way with simply indexing JSON (CouchDB or Elasticsearch), then need
|
692
|
+
to think a bit more cleverly about higher taxon and tree based searching.
|
693
|
+
I''ve always thought that one killer query would be not so much \"show me
|
694
|
+
all the trees for my taxon\" but \"show me a synthesis of the trees for my
|
695
|
+
taxon\". Imagine a supertree of recent studies that we could use as a summary
|
696
|
+
of our current knowledge, or a visualisation that summarises where there are
|
697
|
+
conflicts among the trees.</p>\n\n<h3>Relevant code and sites</h3>\n\n<ul>\n<li><a
|
698
|
+
href=\"https://github.com/rdmpage/cdaotools\">CDAO Tools</a>, see \"CDAO-Store:
|
699
|
+
Ontology-driven Data Integration for Phylogenetic Analysis\" <a href=\"https://doi.org/10.1186/1471-2105-12-98\">https://doi.org/10.1186/1471-2105-12-98</a></li>\n<li><a
|
700
|
+
href=\"https://github.com/NESCent/phylocommons\">PhyloCommons</a></li>\n</ul>","tags":["phylogeny","TreeBASE"],"language":"en","references":null},{"id":"https://doi.org/10.59350/jzvs4-r9559","uuid":"23fa1dd8-5c6b-4aa9-9cad-c6f6b14ae9e0","url":"https://iphylo.blogspot.com/2021/08/json-ld-in-wild-examples-of-how.html","title":"JSON-LD
|
701
|
+
in the wild: examples of how structured data is represented on the web","summary":"I''ve
|
702
|
+
created a GitHub repository so that I can keep track of the examples of JSON-LD
|
703
|
+
that I''ve seen being actively used, for example embedded in web sites, or
|
704
|
+
accessed using an API. The repository is https://github.com/rdmpage/wild-json-ld.
|
705
|
+
The list is by no means exhaustive, I hope to add more examples as I come
|
706
|
+
across them. One reason for doing this is to learn what others are doing.
|
707
|
+
For example, after looking at SciGraph''s JSON-LD I now see how an ordered
|
708
|
+
list can be modelled in RDF in...","date_published":"2021-08-27T13:20:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
709
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>I''ve created a GitHub repository
|
710
|
+
so that I can keep track of the examples of JSON-LD that I''ve seen being
|
711
|
+
actively used, for example embedded in web sites, or accessed using an API.
|
712
|
+
The repository is <a href=\"https://github.com/rdmpage/wild-json-ld\">https://github.com/rdmpage/wild-json-ld</a>.
|
713
|
+
The list is by no means exhaustive, I hope to add more examples as I come
|
714
|
+
across them.</p>\n\n<p>One reason for doing this is to learn what others are
|
715
|
+
doing. For example, after looking at SciGraph''s JSON-LD I now see how an
|
716
|
+
ordered list can be modelled in RDF in such a way that the list of authors
|
717
|
+
in a JSON-LD document for, say a scientific paper, is correct. By default
|
718
|
+
RDF has no notion of ordered lists, so if you do a SPARQL query to get the
|
719
|
+
authors of a paper, the order of the authors returned in the query will be
|
720
|
+
arbitrary. There are various ways to try and tackle this. In my Ozymandias
|
721
|
+
knowledge graph I used \"roles\" to represent order (see <a href=\"https://doi.org/10.7717/peerj.6739/fig-2\">Figure
|
722
|
+
2</a> in the Ozymandias paper). I then used properties of the role to order
|
723
|
+
the list of authors.</p>\n\n<p>Another approach is to use rdf:lists (see <a
|
724
|
+
href=\"http://www.snee.com/bobdc.blog/2014/04/rdf-lists-and-sparql.html\">RDF
|
725
|
+
lists and SPARQL</a> and <a href=\"https://stackoverflow.com/questions/17523804/is-it-possible-to-get-the-position-of-an-element-in-an-rdf-collection-in-sparql/17530689#17530689\">Is
|
726
|
+
it possible to get the position of an element in an RDF Collection in SPARQL?</a>
|
727
|
+
for an introduction to lists). SciGraph uses this approach. The value for
|
728
|
+
schema:author is not an author, but a blank node (bnode), and this bnode has
|
729
|
+
two predicates, rdf:first and rdf:rest. One points to an author, the other
|
730
|
+
points to another bnode. This pattern repeats until we encounter a value of
|
731
|
+
rdf:nil for rdf:rest.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a
|
732
|
+
href=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s629/list.png\"
|
733
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
734
|
+
border=\"0\" height=\"320\" data-original-height=\"629\" data-original-width=\"401\"
|
735
|
+
src=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s320/list.png\"/></a></div>\n\n<p>This
|
736
|
+
introduces some complexity, but the benefit is that the JSON-LD version of
|
737
|
+
the RDF will have the authors in the correct order, and hence any client that
|
738
|
+
is using JSON will be able to treat the array of authors as ordered. Without
|
739
|
+
some means of ordering the client could not make this assumption, hence the
|
740
|
+
first author in the list might not actually be the first author of the paper.</p>","tags":["JSON-LD","RDF"],"language":"en","references":null},{"id":"https://doi.org/10.59350/zc4qc-77616","uuid":"30c78d9d-2e50-49db-9f4f-b3baa060387b","url":"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html","title":"Does
|
741
|
+
anyone cite taxonomic treatments?","summary":"Taxonomic treatments have come
|
742
|
+
up in various discussions I''m involved in, and I''m curious as to whether
|
743
|
+
they are actually being used, in particular, whether they are actually being
|
744
|
+
cited. Consider the following quote: The taxa are described in taxonomic treatments,
|
745
|
+
well defined sections of scientific publications (Catapano 2019). They include
|
746
|
+
a nomenclatural section and one or more sections including descriptions, material
|
747
|
+
citations referring to studied specimens, or notes ecology and...","date_published":"2022-09-01T16:49:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
748
|
+
(Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
749
|
+
both;\"><a href=\"https://zenodo.org/record/5731100/thumb100\" style=\"display:
|
750
|
+
block; padding: 1em 0; text-align: center; clear: right; float: right;\"><img
|
751
|
+
alt=\"\" border=\"0\" height=\"128\" data-original-height=\"106\" data-original-width=\"100\"
|
752
|
+
src=\"https://zenodo.org/record/5731100/thumb250\"/></a></div>\nTaxonomic
|
753
|
+
treatments have come up in various discussions I''m involved in, and I''m
|
754
|
+
curious as to whether they are actually being used, in particular, whether
|
755
|
+
they are actually being cited. Consider the following quote:\n\n<blockquote>\nThe
|
756
|
+
taxa are described in taxonomic treatments, well defined sections of scientific
|
757
|
+
publications (Catapano 2019). They include a nomenclatural section and one
|
758
|
+
or more sections including descriptions, material citations referring to studied
|
759
|
+
specimens, or notes ecology and behavior. In case the treatment does not describe
|
760
|
+
a new discovered taxon, previous treatments are cited in the form of treatment
|
761
|
+
citations. This citation can refer to a previous treatment and add additional
|
762
|
+
data, or it can be a statement synonymizing the taxon with another taxon.
|
763
|
+
This allows building a citation network, and ultimately is a constituent part
|
764
|
+
of the catalogue of life. - Taxonomic Treatments as Open FAIR Digital Objects
|
765
|
+
<a href=\"https://doi.org/10.3897/rio.8.e93709\">https://doi.org/10.3897/rio.8.e93709</a>\n</blockquote>\n\n<p>\n
|
766
|
+
\"Traditional\" academic citation is from article to article. For example,
|
767
|
+
consider these two papers:\n\n<blockquote>\nLi Y, Li S, Lin Y (2021) Taxonomic
|
768
|
+
study on fourteen symphytognathid species from Asia (Araneae, Symphytognathidae).
|
769
|
+
ZooKeys 1072: 1-47. https://doi.org/10.3897/zookeys.1072.67935\n</blockquote>\n\n<blockquote>\nMiller
|
770
|
+
J, Griswold C, Yin C (2009) The symphytognathoid spiders of the Gaoligongshan,
|
771
|
+
Yunnan, China (Araneae: Araneoidea): Systematics and diversity of micro-orbweavers.
|
772
|
+
ZooKeys 11: 9-195. https://doi.org/10.3897/zookeys.11.160\n</blockquote>\n</p>\n\n<p>Li
|
773
|
+
et al. 2021 cites Miller et al. 2009 (although Pensoft seems to have broken
|
774
|
+
the citation such that it does appear correctly either on their web page or
|
775
|
+
in CrossRef).</p>\n\n<p>So, we have this link: [article]10.3897/zookeys.1072.67935
|
776
|
+
--cites--> [article]10.3897/zookeys.11.160. One article cites another.</p>\n\n<p>In
|
777
|
+
their 2021 paper Li et al. discuss <i>Patu jidanweishi</i> Miller, Griswold
|
778
|
+
& Yin, 2009:\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s1040/Screenshot%202022-09-01%20at%2017.12.27.png\"
|
779
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
780
|
+
border=\"0\" width=\"400\" data-original-height=\"314\" data-original-width=\"1040\"
|
781
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s400/Screenshot%202022-09-01%20at%2017.12.27.png\"/></a></div>\n\n<p>There
|
782
|
+
is a treatment for the original description of <i>Patu jidanweishi</i> at
|
783
|
+
<a href=\"https://doi.org/10.5281/zenodo.3792232\">https://doi.org/10.5281/zenodo.3792232</a>,
|
784
|
+
which was created by Plazi with a time stamp \"2020-05-06T04:59:53.278684+00:00\".
|
785
|
+
The original publication date was 2009, the treatments are being added retrospectively.</p>\n\n<p>In
|
786
|
+
an ideal world my expectation would be that Li et al. 2021 would have cited
|
787
|
+
the treatment, instead of just providing the text string \"Patu jidanweishi
|
788
|
+
Miller, Griswold & Yin, 2009: 64, figs 65A–E, 66A, B, 67A–D, 68A–F, 69A–F,
|
789
|
+
70A–F and 71A–F (♂♀).\" Isn''t the expectation under the treatment model that
|
790
|
+
we would have seen this relationship:</p>\n\n<p>[article]10.3897/zookeys.1072.67935
|
791
|
+
--cites--> [treatment]https://doi.org/10.5281/zenodo.3792232</p>\n\n<p>Furthermore,
|
792
|
+
if it is the case that \"[i]n case the treatment does not describe a new discovered
|
793
|
+
taxon, previous treatments are cited in the form of treatment citations\"
|
794
|
+
then we should also see a citation between treatments, in other words Li et
|
795
|
+
al.''s 2021 treatment of <i>Patu jidanweishi</i> (which doesn''t seem to have
|
796
|
+
a DOI but is available on Plazi'' web site as <a href=\"https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74\">https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74</a>)
|
797
|
+
should also cite the original treatment? It doesn''t - but it does cite the
|
798
|
+
Miller et al. paper.</p>\n\n<p>So in this example we don''t see articles citing
|
799
|
+
treatments, nor do we see treatments citing treatments. Playing Devil''s advocate,
|
800
|
+
why then do we have treatments? Does''t the lack of citations suggest that
|
801
|
+
- despite some taxonomists saying this is the unit that matters - they actually
|
802
|
+
don''t. If we pay attention to what people do rather than what they say they
|
803
|
+
do, they cite articles.</p>\n\n<p>Now, there are all sorts of reasons why
|
804
|
+
we don''t see [article] -> [treatment] citations, or [treatment] -> [treatment]
|
805
|
+
citations. Treatments are being added after the fact by Plazi, not by the
|
806
|
+
authors of the original work. And in many cases the treatments that could
|
807
|
+
be cited haven''t appeared until after that potentially citing work was published.
|
808
|
+
In the example above the Miller et al. paper dates from 2009, but the treatment
|
809
|
+
extracted only went online in 2020. And while there is a long standing culture
|
810
|
+
of citing publications (ideally using DOIs) there isn''t an equivalent culture
|
811
|
+
of citing treatments (beyond the simple text strings).</p>\n\n<p>Obviously
|
812
|
+
this is but one example. I''d need to do some exploration of the citation
|
813
|
+
graph to get a better sense of citations patterns, perhaps using <a href=\"https://www.crossref.org/documentation/event-data/\">CrossRef''s
|
814
|
+
event data</a>. But my sense is that taxonomists don''t cite treatments.</p>\n\n<p>I''m
|
815
|
+
guessing Plazi would respond by saying treatments are cited, for example (indirectly)
|
816
|
+
in GBIF downloads. This is true, although arguably people aren''t citing the
|
817
|
+
treatment, they''re citing specimen data in those treatments, and that specimen
|
818
|
+
data could be extracted at the level of articles rather than treatments. In
|
819
|
+
other words, it''s not the treatments themselves that people are citing.</p>\n\n<p>To
|
820
|
+
be clear, I think there is value in being able to identify those \"well defined
|
821
|
+
sections\" of a publication that deal with a given taxon (i.e., treatments),
|
822
|
+
but it''s not clear to me that these are actually the citable units people
|
823
|
+
might hope them to be. Likewise, journals such as <i>ZooKeys</i> have DOIs
|
824
|
+
for individual figures. Does anyone actually cite those?</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/en7e9-5s882","uuid":"20b9d31e-513f-496b-b399-4215306e1588","url":"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html","title":"Obsidian,
|
825
|
+
markdown, and taxonomic trees","summary":"Returning to the subject of personal
|
826
|
+
knowledge graphs Kyle Scheer has an interesting repository of Markdown files
|
827
|
+
that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines
|
828
|
+
(see his blog post for more background). If you add these files to Obsidian
|
829
|
+
you get a nice visualisation of a taxonomy of academic disciplines. The applications
|
830
|
+
of this to biological taxonomy seem obvious, especially as a tool like Obsidian
|
831
|
+
enables all sorts of interesting links to be added...","date_published":"2022-04-07T21:07:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
832
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>Returning to the subject
|
833
|
+
of <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">personal
|
834
|
+
knowledge graphs</a> Kyle Scheer has an interesting repository of Markdown
|
835
|
+
files that describe academic disciplines at <a href=\"https://github.com/kyletscheer/academic-disciplines\">https://github.com/kyletscheer/academic-disciplines</a>
|
836
|
+
(see <a href=\"https://kyletscheer.medium.com/on-creating-a-tree-of-knowledge-f099c1028bf6\">his
|
837
|
+
blog post</a> for more background).</p>\n\n<p>If you add these files to <a
|
838
|
+
href=\"https://obsidian.md/\">Obsidian</a> you get a nice visualisation of
|
839
|
+
a taxonomy of academic disciplines. The applications of this to biological
|
840
|
+
taxonomy seem obvious, especially as a tool like Obsidian enables all sorts
|
841
|
+
of interesting links to be added (e.g., we could add links to the taxonomic
|
842
|
+
research behind each node in the taxonomic tree, the people doing that research,
|
843
|
+
etc. - although that would mean we''d no longer have a simple tree).</p>\n\n<p>The
|
844
|
+
more I look at these sort of simple Markdown-based tools the more I wonder
|
845
|
+
whether we could make more use of them to create simple but persistent databases.
|
846
|
+
Text files seem the most stable, long-lived digital format around, maybe this
|
847
|
+
would be a way to minimise the inevitable obsolescence of database and server
|
848
|
+
software. Time for some experiments I feel... can we take a taxonomic group,
|
849
|
+
such as mammals, and create a richly connected database purely in Markdown?</p>\n\n<div
|
850
|
+
class=\"separator\" style=\"clear: both; text-align: center;\"><iframe allowfullscreen=''allowfullscreen''
|
851
|
+
webkitallowfullscreen=''webkitallowfullscreen'' mozallowfullscreen=''mozallowfullscreen''
|
852
|
+
width=''400'' height=''322'' src=''https://www.blogger.com/video.g?token=AD6v5dy3Sa_SY_MJCZYYCT-bAGe9QD1z_V0tkE0qM5FaQJfAEgGOoHtYPATsNNbBvTEh_tHOZ83nMGzpYRg''
|
853
|
+
class=''b-hbp-video b-uploaded'' frameborder=''0''></iframe></div>","tags":["markdown","obsidian"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m7gb7-d7c49","uuid":"7d814863-43b5-4faf-a475-da8de5efd3ef","url":"https://iphylo.blogspot.com/2022/02/duplicate-dois-again.html","title":"Duplicate
|
854
|
+
DOIs (again)","summary":"This blog post provides some background to a recent
|
855
|
+
tweet where I expressed my frustration about the duplication of DOIs for the
|
856
|
+
same article. I''m going to document the details here. The DOI that alerted
|
857
|
+
me to this problem is https://doi.org/10.2307/2436688 which is for the article
|
858
|
+
Snyder, W. C., & Hansen, H. N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American
|
859
|
+
Journal of Botany, 27(2), 64–67. This article is hosted by JSTOR at https://www.jstor.org/stable/2436688
|
860
|
+
which displays the DOI...","date_published":"2022-02-08T15:06:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
861
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>This blog post provides
|
862
|
+
some background to a <a href=\"https://twitter.com/rdmpage/status/1491023036199600132\">recent
|
863
|
+
tweet</a> where I expressed my frustration about the duplication of DOIs for
|
864
|
+
the same article. I''m going to document the details here.</p>\n\n<p>The DOI
|
865
|
+
that alerted me to this problem is <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
|
866
|
+
which is for the article</p>\n\n<blockquote>\nSnyder, W. C., & Hansen, H.
|
867
|
+
N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American Journal of Botany, 27(2),
|
868
|
+
64–67.\n</blockquote>\n\n<p>This article is hosted by JSTOR at <a href=\"https://www.jstor.org/stable/2436688\">https://www.jstor.org/stable/2436688</a>
|
869
|
+
which displays the DOI <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
|
870
|
+
.</p>\n\n<p>This same article is also hosted by Wiley at <a href=\"https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x\">https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x</a>
|
871
|
+
with the DOI <a href=\"https://doi.org/10.1002/j.1537-2197.1940.tb14217.x\">https://doi.org/10.1002/j.1537-2197.1940.tb14217.x</a>.</p>\n\n<h2>Expected
|
872
|
+
behaviour</h2>\n\n<p>What should happen is if Wiley is going to be the publisher
|
873
|
+
of this content (taking over from JSTOR), the DOI <b>10.2307/2436688</b> should
|
874
|
+
be redirected to the Wiley page, and the Wiley page displays this DOI (i.e.,
|
875
|
+
<b>10.2307/2436688</b>). If I want to get metadata for this DOI, I should
|
876
|
+
be able to use CrossRef''s API to retrieve that metadata, e.g. <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
|
877
|
+
should return metadata for the article.</p>\n\n<h2>What actually happens</h2>\n\n<p>Wiley
|
878
|
+
display the same article on their web site with the DOI <b>10.1002/j.1537-2197.1940.tb14217.x</b>.
|
879
|
+
They have minted a new DOI for the same article! The original JSTOR DOI now
|
880
|
+
resolves to the Wiley page (you can see this using the <a href=\"https://hdl.handle.net\">Handle
|
881
|
+
Resolver</a>), which is what is supposed to happen. However, Wiley should
|
882
|
+
have reused the original DOI rather than mint their own.</p>\n\n<p>Furthermore,
|
883
|
+
while the original DOI still resolves in a web browser, I can''t retrieve
|
884
|
+
metadata about that DOI from CrossRef, so any attempt to build upon that DOI
|
885
|
+
fails. However, I can retrieve metadata for the Wiley DOI, i.e. <a href=\"https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x\">https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x</a>
|
886
|
+
works, but <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
|
887
|
+
doesn''t.</p>\n\n<h2>Why does this matter?</h2>\n\n<p>For anyone using DOIs
|
888
|
+
as stable links to the literature the persistence of DOIs is something you
|
889
|
+
should be able to rely upon, both for people clicking on links in web browsers
|
890
|
+
and developers getting metadata from those DOIs. The whole rationale of the
|
891
|
+
DOI system is a single, globally unique identifier for each article, and that
|
892
|
+
these DOIs persist even when the publisher of the content changes. If this
|
893
|
+
property doesn''t hold, then why would a developer such as myself invest effort
|
894
|
+
in linking using DOIs?</p>\n\n<p>Just for the record, I think CrossRef is
|
895
|
+
great and is a hugely important part of the scholarly landscape. There are
|
896
|
+
lots of things that I do that would be nearly impossible without CrossRef
|
897
|
+
and its tools. But cases like this where we get massive duplication of DOIs
|
898
|
+
when a publishers takes over an existing journal fundamentally breaks the
|
899
|
+
underlying model of stable, persistent identifiers.</p>","tags":["CrossRef","DOI","duplicates"],"language":"en","references":null},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
|
900
|
+
on \"The Macroscope\" - a tool for the 21st Century?","summary":"This is a
|
901
|
+
guest post by Tony Rees. It would be difficult to encounter a scientist, or
|
902
|
+
anyone interested in science, who is not familiar with the microscope, a tool
|
903
|
+
for making objects visible that are otherwise too small to be properly seen
|
904
|
+
by the unaided eye, or to reveal otherwise invisible fine detail in larger
|
905
|
+
objects. A select few with a particular interest in microscopy may also have
|
906
|
+
encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop
|
907
|
+
microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
908
|
+
(Roderic Page)"}],"image":null,"content_html":"<p><img src=\"https://lh3.googleusercontent.com/-A99btr6ERMs/Vl1Wvjp2OtI/AAAAAAAAEFI/7bKdRjNG5w0/ytNkVT2U.jpg?imgmax=800\"
|
909
|
+
alt=\"YtNkVT2U\" title=\"ytNkVT2U.jpg\" border=\"0\" width=\"128\" height=\"128\"
|
910
|
+
style=\"float:right;\" /> This is a guest post by <a href=\"https://about.me/TonyRees\">Tony
|
911
|
+
Rees</a>.</p>\n\n<p>It would be difficult to encounter a scientist, or anyone
|
912
|
+
interested in science, who is not familiar with the microscope, a tool for
|
913
|
+
making objects visible that are otherwise too small to be properly seen by
|
914
|
+
the unaided eye, or to reveal otherwise invisible fine detail in larger objects.
|
915
|
+
A select few with a particular interest in microscopy may also have encountered
|
916
|
+
the Wild-Leica \"Macroscope\", a specialised type of benchtop microscope optimised
|
917
|
+
for low-power macro-photography. However in this overview I discuss the \"Macroscope\"
|
918
|
+
in a different sense, which is that of the antithesis to the microscope: namely
|
919
|
+
a method for visualizing subjects too large to be encompassed by a single
|
920
|
+
field of vision, such as the Earth or some subset of its phenomena (the biosphere,
|
921
|
+
for example), or conceptually, the universe.</p>\n\n<p><div class=\"separator\"
|
922
|
+
style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s500/2020045672.jpg\"
|
923
|
+
style=\"display: block; padding: 1em 0; text-align: center; clear: right;
|
924
|
+
float: right;\"><img alt=\"\" border=\"0\" height=\"320\" data-original-height=\"500\"
|
925
|
+
data-original-width=\"303\" src=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s320/2020045672.jpg\"/></a></div>My
|
926
|
+
introduction to the term was via addresses given by Jesse Ausubel in the formative
|
927
|
+
years of the 2001-2010 <a href=\"http://www.coml.org\">Census of Marine Life</a>,
|
928
|
+
for which he was a key proponent. In Ausubel''s view, the Census would perform
|
929
|
+
the function of a macroscope, permitting a view of everything that lives in
|
930
|
+
the global ocean (or at least, that subset which could realistically be sampled
|
931
|
+
in the time frame available) as opposed to more limited subsets available
|
932
|
+
via previous data collection efforts. My view (which could, of course, be
|
933
|
+
wrong) was that his thinking had been informed by a work entitled \"Le macroscope,
|
934
|
+
vers une vision globale\" published in 1975 by the French thinker Joël de
|
935
|
+
Rosnay, who had expressed such a concept as being globally applicable in many
|
936
|
+
fields, including the physical and natural worlds but also extending to human
|
937
|
+
society, the growth of cities, and more. Yet again, some ecologists may also
|
938
|
+
have encountered the term, sometimes in the guise of \"Odum''s macroscope\",
|
939
|
+
as an approach for obtaining \"big picture\" analyses of macroecological processes
|
940
|
+
suitable for mathematical modelling, typically by elimination of fine detail
|
941
|
+
so that only the larger patterns remain, as initially advocated by Howard
|
942
|
+
T. Odum in his 1971 book \"Environment, Power, and Society\".</p>\n\n<p>From
|
943
|
+
the standpoint of the 21st century, it seems that we are closer to achieving
|
944
|
+
a \"macroscope\" (or possibly, multiple such tools) than ever before, based
|
945
|
+
on the availability of existing and continuing new data streams, improved
|
946
|
+
technology for data assembly and storage, and advanced ways to query and combine
|
947
|
+
these large streams of data to produce new visualizations, data products,
|
948
|
+
and analytical findings. I devote the remainder of this article to examples
|
949
|
+
where either particular workers have employed \"macroscope\" terminology to
|
950
|
+
describe their activities, or where potentially equivalent actions are taking
|
951
|
+
place without the explicit \"macroscope\" association, but are equally worthy
|
952
|
+
of consideration. To save space here, references cited here (most or all)
|
953
|
+
can be found via a Wikipedia article entitled \"<a href=\"https://en.wikipedia.org/wiki/Macroscope_(science_concept)\">Macroscope
|
954
|
+
(science concept)</a>\" that I authored on the subject around a year ago,
|
955
|
+
and have continued to add to on occasion as new thoughts or information come
|
956
|
+
to hand (see <a href=\"https://en.wikipedia.org/w/index.php?title=Macroscope_(science_concept)&offset=&limit=500&action=history\">edit
|
957
|
+
history for the article</a>).</p>\n\n<p>First, one can ask, what constitutes
|
958
|
+
a macroscope, in the present context? In the Wikipedia article I point to
|
959
|
+
a book \"Big Data - Related Technologies, Challenges and Future Prospects\"
|
960
|
+
by Chen <em>et al.</em> (2014) (<a href=\"https://doi.org/10.1007/978-3-319-06245-7\">doi:10.1007/978-3-319-06245-7</a>),
|
961
|
+
in which the \"value chain of big data\" is characterised as divisible into
|
962
|
+
four phases, namely data generation, data acquisition (aka data assembly),
|
963
|
+
data storage, and data analysis. To my mind, data generation (which others
|
964
|
+
may term acquisition, differently from the usage by Chen <em>et al.</em>)
|
965
|
+
is obviously the first step, but does not in itself constitute the macroscope,
|
966
|
+
except in rare cases - such as Landsat imagery, perhaps - where on its own,
|
967
|
+
a single co-ordinated data stream is sufficient to meet the need for a particular
|
968
|
+
type of \"global view\". A variant of this might be a coordinated data collection
|
969
|
+
program - such as that of the ten year Census of Marine Life - which might
|
970
|
+
produce the data required for the desired global view; but again, in reality,
|
971
|
+
such data are collected in a series of discrete chunks, in many and often
|
972
|
+
disparate data formats, and must be \"wrangled\" into a more coherent whole
|
973
|
+
before any meaningful \"macroscope\" functionality becomes available.</p>\n\n<p>Here
|
974
|
+
we come to what, in my view, constitutes the heart of the \"macroscope\":
|
975
|
+
an intelligently organized (i.e. indexable and searchable), coherent data
|
976
|
+
store or repository (where \"data\" may include imagery and other non numeric
|
977
|
+
data forms, but much else besides). Taking the Census of Marine Life example,
|
978
|
+
the data repository for that project''s data (plus other available sources
|
979
|
+
as inputs) is the <a href=\"https://obis.org\">Ocean Biodiversity Information
|
980
|
+
System</a> or OBIS (previously the Ocean Biogeographic Information System),
|
981
|
+
which according to this view forms the \"macroscope\" for which the Census
|
982
|
+
data is a feed. (For non habitat-specific biodiversity data, <a href=\"https://www.gbif.org\">GBIF</a>
|
983
|
+
is an equivalent, and more extensive, operation). Other planetary scale \"macroscopes\",
|
984
|
+
by this definition (which may or may not have an explicit geographic, i.e.
|
985
|
+
spatial, component) would include inventories of biological taxa such as the
|
986
|
+
<a href=\"https://www.catalogueoflife.org\">Catalogue of Life</a> and so on,
|
987
|
+
all the way back to the pioneering compendia published by Linnaeus in the
|
988
|
+
eighteenth century; while for cartography and topographic imagery, the current
|
989
|
+
\"blockbuster\" of <a href=\"http://earth.google.com\">Google Earth</a> and
|
990
|
+
its predecessors also come well into public consciousness.</p>\n\n<p>In the
|
991
|
+
view of some workers and/or operations, both of these phases are precursors
|
992
|
+
to the real \"work\" of the macroscope which is to reveal previously unseen
|
993
|
+
portions of the \"big picture\" by means either of the availability of large,
|
994
|
+
synoptic datasets, or fusion between different data streams to produce novel
|
995
|
+
insights. Companies such as IBM and Microsoft have used phraseology such as:</p>\n\n<blockquote>By
|
996
|
+
2022 we will use machine-learning algorithms and software to help us organize
|
997
|
+
information about the physical world, helping bring the vast and complex data
|
998
|
+
gathered by billions of devices within the range of our vision and understanding.
|
999
|
+
We call this a \"macroscope\" – but unlike the microscope to see the very
|
1000
|
+
small, or the telescope that can see far away, it is a system of software
|
1001
|
+
and algorithms to bring all of Earth''s complex data together to analyze it
|
1002
|
+
by space and time for meaning.\" (IBM)</blockquote>\n\n<blockquote>As the
|
1003
|
+
Earth becomes increasingly instrumented with low-cost, high-bandwidth sensors,
|
1004
|
+
we will gain a better understanding of our environment via a virtual, distributed
|
1005
|
+
whole-Earth \"macroscope\"... Massive-scale data analytics will enable real-time
|
1006
|
+
tracking of disease and targeted responses to potential pandemics. Our virtual
|
1007
|
+
\"macroscope\" can now be used on ourselves, as well as on our planet.\" (Microsoft)
|
1008
|
+
(references available via the Wikipedia article cited above).</blockquote>\n\n<p>Whether
|
1009
|
+
or not the analytical capabilities described here are viewed as being an integral
|
1010
|
+
part of the \"macroscope\" concept, or are maybe an add-on, is ultimately
|
1011
|
+
a question of semantics and perhaps, personal opinion. Continuing the Census
|
1012
|
+
of Marine Life/OBIS example, OBIS offers some (arguably rather basic) visualization
|
1013
|
+
and summary tools, but also makes its data available for download to users
|
1014
|
+
wishing to analyse it further according to their own particular interests;
|
1015
|
+
using OBIS data in this manner, Mark Costello et al. in 2017 were able to
|
1016
|
+
demarcate a finite number of data-supported marine biogeographic realms for
|
1017
|
+
the first time (Costello et al. 2017: Nature Communications. 8: 1057. <a href=\"https://doi.org/10.1038/s41467-017-01121-2\">doi:10.1038/s41467-017-01121-2</a>),
|
1018
|
+
a project which I was able to assist in a small way in an advisory capacity.
|
1019
|
+
In a case such as this, perhaps the final function of the macroscope, namely
|
1020
|
+
data visualization and analysis, was outsourced to the authors'' own research
|
1021
|
+
institution. Similarly at an earlier phase, \"data aggregation\" can also
|
1022
|
+
be virtual rather than actual, i.e. avoiding using a single physical system
|
1023
|
+
to hold all the data, enabled by open web mapping standards WMS (web map service)
|
1024
|
+
and WFS (web feature service) to access a set of distributed data stores,
|
1025
|
+
e.g. as implemented on the portal for the <a href=\"https://portal.aodn.org.au/\">Australian
|
1026
|
+
Ocean Data Network</a>.</p>\n\n<p>So, as we pass through the third decade
|
1027
|
+
of the twenty first century, what developments await us in the \"macroscope\"
|
1028
|
+
area\"? In the biodiversity space, one can reasonably presume that the existing
|
1029
|
+
\"macroscopic\" data assembly projects such as OBIS and GBIF will continue,
|
1030
|
+
and hopefully slowly fill current gaps in their coverage - although in the
|
1031
|
+
marine area, strategic new data collection exercises may be required (Census
|
1032
|
+
2020, or 2025, anyone?), while (again hopefully), the Catalogue of Life will
|
1033
|
+
continue its progress towards a \"complete\" species inventory for the biosphere.
|
1034
|
+
The Landsat project, with imagery dating back to 1972, continues with the
|
1035
|
+
launch of its latest satellite Landsat 9 just this year (21 September 2021)
|
1036
|
+
with a planned mission duration for the next 5 years, so the \"macroscope\"
|
1037
|
+
functionality of that project seems set to continue for the medium term at
|
1038
|
+
least. Meanwhile the ongoing development of sensor networks, both on land
|
1039
|
+
and in the ocean, offers an exciting new method of \"instrumenting the earth\"
|
1040
|
+
to obtain much more real time data than has ever been available in the past,
|
1041
|
+
offering scope for many more, use case-specific \"macroscopes\" to be constructed
|
1042
|
+
that can fuse (e.g.) satellite imagery with much more that is happening at
|
1043
|
+
a local level.</p>\n\n<p>So, the \"macroscope\" concept appears to be alive
|
1044
|
+
and well, even though the nomenclature can change from time to time (IBM''s
|
1045
|
+
\"Macroscope\", foreshadowed in 2017, became the \"IBM Pairs Geoscope\" on
|
1046
|
+
implementation, and is now simply the \"Geospatial Analytics component within
|
1047
|
+
the IBM Environmental Intelligence Suite\" according to available IBM publicity
|
1048
|
+
materials). In reality this illustrates a new dichotomy: even if \"everyone\"
|
1049
|
+
in principle has access to huge quantities of publicly available data, maybe
|
1050
|
+
only a few well funded entities now have the computational ability to make
|
1051
|
+
sense of it, and can charge clients a good fee for their services...</p>\n\n<p>I
|
1052
|
+
present this account partly to give a brief picture of \"macroscope\" concepts
|
1053
|
+
today and in the past, for those who may be interested, and partly to present
|
1054
|
+
a few personal views which would be out of scope in a \"neutral point of view\"
|
1055
|
+
article such as is required on Wikipedia; also to see if readers of this blog
|
1056
|
+
would like to contribute further to discussion of any of the concepts traversed
|
1057
|
+
herein.</p>","tags":["guest post","macroscope"],"language":"en","references":null},{"id":"https://doi.org/10.59350/2b1j9-qmw12","uuid":"37538c38-66e6-4ac4-ab5c-679684622ade","url":"https://iphylo.blogspot.com/2022/05/round-trip-from-identifiers-to.html","title":"Round
|
1058
|
+
trip from identifiers to citations and back again","summary":"Note to self
|
1059
|
+
(basically rewriting last year''s Finding citations of specimens). Bibliographic
|
1060
|
+
data supports going from identifier to citation string and back again, so
|
1061
|
+
we can do a \"round trip.\" 1. Given a DOI we can get structured data with
|
1062
|
+
a simple HTTP fetch, then use a tool such as citation.js to convert that data
|
1063
|
+
into a human-readable string in a variety of formats. Identifier ⟶ Structured
|
1064
|
+
data ⟶ Human readable string 10.7717/peerj-cs.214 HTTP with...","date_published":"2022-05-27T16:34:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1065
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>Note to self (basically
|
1066
|
+
rewriting last year''s <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">Finding
|
1067
|
+
citations of specimens</a>).</p>\n\n<p>Bibliographic data supports going from
|
1068
|
+
identifier to citation string and back again, so we can do a \"round trip.\"</p>\n\n<h2>1.</h2>\n\n<p>Given
|
1069
|
+
a DOI we can get structured data with a simple HTTP fetch, then use a tool
|
1070
|
+
such as <a href=\"https://citation.js.org\">citation.js</a> to convert that
|
1071
|
+
data into a human-readable string in a variety of formats.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟶\n</th>\n<th>\nStructured
|
1072
|
+
data\n</th>\n<th>\n⟶\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nHTTP
|
1073
|
+
with content-negotiation\n</td>\n<td>\nCSL-JSON\n</td>\n<td>\nCSL templates\n</td>\n<td
|
1074
|
+
width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
|
1075
|
+
modular bibliography tool for the browser and command line. PeerJ Computer
|
1076
|
+
Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>2.</h2>\n\n<p>Going
|
1077
|
+
in the reverse direction (string to identifier) is a little more challenging.
|
1078
|
+
In the \"old days\" a typical strategy was to attempt to parse the citation
|
1079
|
+
string into structured data (see <a href=\"hhtps://anystyle.io\">AnyStyle</a>
|
1080
|
+
for a nice example of this), then we could extract a truple of (journal, volume,
|
1081
|
+
starting page) and use that to query CrossRef to find if there was an article
|
1082
|
+
with that tuple, which gave us the DOI.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nStructured
|
1083
|
+
data\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nOpenURL
|
1084
|
+
query\n</td>\n<td>\njournal, volume, start page\n</td>\n<td>\nCitation parser
|
1085
|
+
\n</td>\n<td width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
|
1086
|
+
modular bibliography tool for the browser and command line. PeerJ Computer
|
1087
|
+
Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>3.</h2>\n\n<p>Another
|
1088
|
+
strategy is to take all the citations strings for each DOI, index those in
|
1089
|
+
a search engine, then just use a simple search to find the best match to your
|
1090
|
+
citation string, and hence the DOI. This is what <a href=\"https://search.crossref.org\">https://search.crossref.org</a>
|
1091
|
+
does.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman
|
1092
|
+
readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nsearch\n</td>\n<td
|
1093
|
+
width=\"50%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
|
1094
|
+
modular bibliography tool for the browser and command line. PeerJ Computer
|
1095
|
+
Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<p>At
|
1096
|
+
the moment my work on material citations (i.e., lists of specimens in taxonomic
|
1097
|
+
papers) is focussing on 1 (generating citations from specimen data in GBIF)
|
1098
|
+
and 2 (parsing citations into structured data).</p>","tags":["citation","GBIF","material
|
1099
|
+
examined","specimen codes"],"language":"en","references":null},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
|
1100
|
+
bibliographic data","summary":"There are several instances where I have a
|
1101
|
+
collection of references that I want to deduplicate and merge. For example,
|
1102
|
+
in Zootaxa has no impact factor I describe a dataset of the literature cited
|
1103
|
+
by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4),
|
1104
|
+
as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1).
|
1105
|
+
Given that the same articles may be cited many times, these datasets have
|
1106
|
+
lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1107
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>There are several instances
|
1108
|
+
where I have a collection of references that I want to deduplicate and merge.
|
1109
|
+
For example, in <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
|
1110
|
+
has no impact factor</a> I describe a dataset of the literature cited by articles
|
1111
|
+
in the journal <i>Zootaxa</i>. This data is available on Figshare (<a href=\"https://doi.org/10.6084/m9.figshare.c.5054372.v4\">https://doi.org/10.6084/m9.figshare.c.5054372.v4</a>),
|
1112
|
+
as is the equivalent dataset for <i>Phytotaxa</i> (<a href=\"https://doi.org/10.6084/m9.figshare.c.5525901.v1\">https://doi.org/10.6084/m9.figshare.c.5525901.v1</a>).
|
1113
|
+
Given that the same articles may be cited many times, these datasets have
|
1114
|
+
lots of duplicates. Similarly, articles in <a href=\"https://species.wikimedia.org\">Wikispecies</a>
|
1115
|
+
often have extensive lists of references cited, and the same reference may
|
1116
|
+
appear on multiple pages (for an initial attempt to extract these references
|
1117
|
+
see <a href=\"https://doi.org/10.5281/zenodo.5801661\">https://doi.org/10.5281/zenodo.5801661</a>
|
1118
|
+
and <a href=\"https://github.com/rdmpage/wikispecies-parser\">https://github.com/rdmpage/wikispecies-parser</a>).</p>\n\n<p>There
|
1119
|
+
are several reasons I want to merge these references. If I want to build a
|
1120
|
+
citation graph for <i>Zootaxa</i> or <i>Phytotaxa</i> I need to merge references
|
1121
|
+
that are the same so that I can accurate count citations. I am also interested
|
1122
|
+
in harvesting the metadata to help find those articles in the <a href=\"https://www.biodiversitylibrary.org\">Biodiversity
|
1123
|
+
Heritage Library</a> (BHL), and the literature cited section of scientific
|
1124
|
+
articles is a potential goldmine of bibliographic metadata, as is Wikispecies.</p>\n\n<p>After
|
1125
|
+
various experiments and false starts I''ve created a repository <a href=\"https://github.com/rdmpage/bib-dedup\">https://github.com/rdmpage/bib-dedup</a>
|
1126
|
+
to host a series of PHP scripts to deduplicate bibliographics data. I''ve
|
1127
|
+
settled on using CSL-JSON as the format for bibliographic data. Because deduplication
|
1128
|
+
relies on comparing pairs of references, the standard format for most of the
|
1129
|
+
scripts is a JSON array containing a pair of CSL-JSON objects to compare.
|
1130
|
+
Below are the steps the code takes.</p>\n\n<h2>Generating pairs to compare</h2>\n\n<p>The
|
1131
|
+
first step is to take a list of references and generate the pairs that will
|
1132
|
+
be compared. I started with this approach as I wanted to explore machine learning
|
1133
|
+
and wanted a simple format for training data, such as an array of two CSL-JSON
|
1134
|
+
objects and an integer flag representing whether the two references were the
|
1135
|
+
same of different.</p>\n\n<p>There are various ways to generate CSL-JSON for
|
1136
|
+
a reference. I use a tool I wrote (see <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
|
1137
|
+
parsing tool released</a>) that has a simple API where you parse one or more
|
1138
|
+
references and it returns that reference as structured data in CSL-JSON.</p>\n\n<p>Attempting
|
1139
|
+
to do all possible pairwise comparisons rapidly gets impractical as the number
|
1140
|
+
of references increases, so we need some way to restrict the number of comparisons
|
1141
|
+
we make. One approach I''ve explored is the “sorted neighbourhood method”
|
1142
|
+
where we sort the references 9for example by their title) then move a sliding
|
1143
|
+
window down the list of references, comparing all references within that window.
|
1144
|
+
This greatly reduces the number of pairwise comparisons. So the first step
|
1145
|
+
is to sort the references, then run a sliding window over them, output all
|
1146
|
+
the pairs in each window (ignoring in pairwise comparisons already made in
|
1147
|
+
a previous window). Other methods of \"blocking\" could also be used, such
|
1148
|
+
as only including references in a particular year, or a particular journal.</p>\n\n<p>So,
|
1149
|
+
the output of this step is a set of JSON arrays, each with a pair of references
|
1150
|
+
in CSL-JSON format. Each array is stored on a single line in the same file
|
1151
|
+
in <a href=\"https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON_2\">line-delimited
|
1152
|
+
JSON</a> (JSONL).</p>\n\n<h2>Comparing pairs</h2>\n\n<p>The next step is to
|
1153
|
+
compare each pair of references and decide whether they are a match or not.
|
1154
|
+
Initially I explored a machine learning approach used in the following paper:</p>\n\n<blockquote>\nWilson
|
1155
|
+
DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex
|
1156
|
+
features to improve genealogical record linkage. In: The 2011 International
|
1157
|
+
Joint Conference on Neural Networks. 9–14. DOI: <a href=\"https://doi.org/10.1109/IJCNN.2011.6033192\">10.1109/IJCNN.2011.6033192</a>\n</blockquote>\n\n<p>Initial
|
1158
|
+
experiments using <a href=\"https://github.com/jtet/Perceptron\">https://github.com/jtet/Perceptron</a>
|
1159
|
+
were promising and I want to play with this further, but I deciding to skip
|
1160
|
+
this for now and just use simple string comparison. So for each CSL-JSON object
|
1161
|
+
I generate a citation string in the same format using CiteProc, then compute
|
1162
|
+
the <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein
|
1163
|
+
distance</a> between the two strings. By normalising this distance by the
|
1164
|
+
length of the two strings being compared I can use an arbitrary threshold
|
1165
|
+
to decide if the references are the same or not.</p>\n\n<h2>Clustering</h2>\n\n<p>For
|
1166
|
+
this step we read the JSONL file produced above and record whether the two
|
1167
|
+
references are a match or not. Assuming each reference has a unique identifier
|
1168
|
+
(needs only be unique within the file) then we can use those identifier to
|
1169
|
+
record the clusters each reference belongs to. I do this using a <a href=\"https://en.wikipedia.org/wiki/Disjoint-set_data_structure\">Disjoint-set
|
1170
|
+
data structure</a>. For each reference start with a graph where each node
|
1171
|
+
represents a reference, and each node has a pointer to a parent node. Initially
|
1172
|
+
the reference is its own parent. A simple implementation is to have an array
|
1173
|
+
index by reference identifiers and where the value of each cell in the array
|
1174
|
+
is the node''s parent.</p>\n\n<p>As we discover pairs we update the parents
|
1175
|
+
of the nodes to reflect this, such that once all the comparisons are done
|
1176
|
+
we have a one or more sets of clusters corresponding to the references that
|
1177
|
+
we think are the same. Another way to think of this is that we are getting
|
1178
|
+
the components of a graph where each node is a reference and pair of references
|
1179
|
+
that match are connected by an edge.</p>\n\n<p>In the code I''m using I write
|
1180
|
+
this graph in <a href=\"https://en.wikipedia.org/wiki/Trivial_Graph_Format\">Trivial
|
1181
|
+
Graph Format</a> (TGF) which can be visualised using a tools such as <a href=\"https://www.yworks.com/products/yed\">yEd</a>.</p>\n\n<h2>Merging</h2>\n\n<p>Now
|
1182
|
+
that we have a graph representing the sets of references that we think are
|
1183
|
+
the same we need to merge them. This is where things get interesting as the
|
1184
|
+
references are similar (by definition) but may differ in some details. The
|
1185
|
+
paper below describes a simple Bayesian approach for merging records:</p>\n\n<blockquote>\nCouncill
|
1186
|
+
IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles
|
1187
|
+
CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching
|
1188
|
+
Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
|
1189
|
+
Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: <a href=\"https://doi.org/10.1145/1141753.1141817\">10.1145/1141753.1141817</a>.\n</blockquote>\n\n<p>So
|
1190
|
+
the next step is to read the graph with the clusters, generate the sets of
|
1191
|
+
bibliographic references that correspond to each cluster, then use the method
|
1192
|
+
described in Councill et al. to produce a single bibliographic record for
|
1193
|
+
that cluster. These records could then be used to, say locate the corresponding
|
1194
|
+
article in BHL, or populate Wikidata with missing references.</p>\n\n<p>Obviously
|
1195
|
+
there is always the potential for errors, such as trying to merge references
|
1196
|
+
that are not the same. As a quick and dirty check I flag as dubious any cluster
|
1197
|
+
where the page numbers vary among members of the cluster. More sophisticated
|
1198
|
+
checks are possible, especially if I go down the ML route (i.e., I would have
|
1199
|
+
evidence for the probability that the same reference can disagree on some
|
1200
|
+
aspects of metadata).</p>\n\n<h2>Summary</h2>\n\n<p>At this stage the code
|
1201
|
+
is working well enough for me to play with and explore some example datasets.
|
1202
|
+
The focus is on structured bibliographic metadata, but I may simplify things
|
1203
|
+
and have a version that handles simple string matching, for example to cluster
|
1204
|
+
together different abbreviations of the same journal name.</p>","tags":["data
|
1205
|
+
cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ndtkv-6ve80","uuid":"e8e95aaf-bacb-4b5a-bf91-54e903526ab2","url":"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html","title":"Revisiting
|
1206
|
+
RSS to monitor the latest taxonomic research","summary":"Over a decade ago
|
1207
|
+
RSS (RDF Site Summary or Really Simple Syndication) was attracting a lot of
|
1208
|
+
interest as a way to integrate data across various websites. Many science
|
1209
|
+
publishers would provide a list of their latest articles in XML in one of
|
1210
|
+
three flavours of RSS (RDF, RSS, Atom). This led to tools such as uBioRSS
|
1211
|
+
[1] and my own e-Biosphere Challenge: visualising biodiversity digitisation
|
1212
|
+
in real time. It was a time of enthusiasm for aggregating lots of data, such
|
1213
|
+
as the ill-fated PLoS...","date_published":"2021-11-23T20:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1214
|
+
(Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
1215
|
+
both;\"><a href=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s257/feedicon.png\"
|
1216
|
+
style=\"display: block; padding: 1em 0; text-align: center; clear: right;
|
1217
|
+
float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"257\"
|
1218
|
+
data-original-width=\"257\" src=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s200/feedicon.png\"/></a></div>\n<p>Over
|
1219
|
+
a decade ago <a href=\"https://en.wikipedia.org/wiki/RSS\">RSS</a> (RDF Site
|
1220
|
+
Summary or Really Simple Syndication) was attracting a lot of interest as
|
1221
|
+
a way to integrate data across various websites. Many science publishers would
|
1222
|
+
provide a list of their latest articles in XML in one of three flavours of
|
1223
|
+
RSS (RDF, RSS, Atom). This led to tools such as <a href=\"http://ubio.org/rss/\">uBioRSS</a>
|
1224
|
+
[<a href=\"#Leary2007\">1</a>] and my own <a href=\"https://iphylo.blogspot.com/2009/05/e-biosphere-challenge-visualising.html\">e-Biosphere
|
1225
|
+
Challenge: visualising biodiversity digitisation in real time</a>. It was
|
1226
|
+
a time of enthusiasm for aggregating lots of data, such as the <a href=\"https://iphylo.blogspot.com/2013/07/the-demise-of-plos-biodiversity-hub.html\">ill-fated</a>
|
1227
|
+
PLoS Biodiversity Hub [<a href=\"#Mindell2011\">2</a>].</p>\n\n<p>Since I
|
1228
|
+
seem to be condemned to revisit old ideas rather than come up with anything
|
1229
|
+
new, I''ve been looking at providing a tool like the now defunct uBioRSS.
|
1230
|
+
The idea is to harvest RSS feeds from journals (with an emphasis on taxonomic
|
1231
|
+
and systematic journals), aggregate the results, and make them browsable by
|
1232
|
+
taxon and geography. Here''s a sneak peak:</p>\n\n<div class=\"separator\"
|
1233
|
+
style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s1032/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"
|
1234
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1235
|
+
border=\"0\" width=\"400\" data-original-height=\"952\" data-original-width=\"1032\"
|
1236
|
+
src=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s400/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"/></a></div>\n\n<p>What
|
1237
|
+
seems like a straightforward task quickly became a bit of a challenge. Not
|
1238
|
+
all journals have RSS feeds (they seem to have become less widely supported
|
1239
|
+
over time) so I need to think of alternative ways to get lists of recent articles.
|
1240
|
+
These lists also need to be processed in various ways. There are three versions
|
1241
|
+
of RSS, each with their own idiosyncracies, so I need to standardise things
|
1242
|
+
like dates. I also want to augment them with things like DOIs (often missing
|
1243
|
+
from RSS feeds) and thumbnails for the articles (often available on publisher
|
1244
|
+
websites but not the feeds). Then I need to index the content by taxon and
|
1245
|
+
geography. For taxa I use a version of Patrick Leary''s \"taxonfinder\" (see
|
1246
|
+
<a href=\"https://right-frill.glitch.me\">https://right-frill.glitch.me</a>)
|
1247
|
+
to find names, then the <a href=\"https://index.globalnames.org\">Global Names
|
1248
|
+
Index</a> to assign names found to the GBIF taxonomic hierarchy.</p>\n\n<p>Indexing
|
1249
|
+
by geography proved harder. Typically <a href=\"https://en.wikipedia.org/wiki/Toponym_resolution#Geoparsing\">geoparsing</a>
|
1250
|
+
involves taking a body of text and doing the following:\n<ul><li>Using named-entity
|
1251
|
+
recognition <a href=\"https://en.wikipedia.org/wiki/Named-entity_recognition\">NER</a>
|
1252
|
+
to identity named entities in the text (e.g., place names, people names, etc.).</li>\n<li>Using
|
1253
|
+
a gazetteer of geographic names <a href=\"http://www.geonames.org\">GeoNames</a>
|
1254
|
+
to try and match the place names found by NER.</li>\n</ul></p>\n\n<p>An example
|
1255
|
+
of such a parser is the <a href=\"https://www.ltg.ed.ac.uk/software/geoparser/\">Edinburgh
|
1256
|
+
Geoparser</a>. Typically geoparsing software can be large and tricky to install,
|
1257
|
+
especially if you are looking to make your installation publicly accessible.
|
1258
|
+
Geoparsing services seem to have a short half-life (e.g., <a href=\"https://geoparser.io\">Geoparser.io</a>),
|
1259
|
+
perhaps because they are so useful they quickly get swamped by users.</p>\n\n<p>Bearing
|
1260
|
+
this in mind, the approach I’ve taken here is to create a very simple geoparser
|
1261
|
+
that is focussed on fairly large areas, especially those relevant to biodiversity,
|
1262
|
+
and is aimed at geoparsing text such as abstracts of scientific papers. I''ve
|
1263
|
+
created a small database of places by harvesting data from Wikidata, then
|
1264
|
+
I use the \"flash text\" algorithm [<a href=\"#Singh2017\">3</a>] to find
|
1265
|
+
geographic places. This approach uses a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
|
1266
|
+
to store the place names. All I do is walk through the text seeing whether
|
1267
|
+
the current word matches a place name (or the start of one) in the trie, then
|
1268
|
+
moving on. This is very quick and seems to work quite well.</p>\n\n<p>Given
|
1269
|
+
that I need to aggregate data from a lot of sources, apply various transformations
|
1270
|
+
to that data, then merge it, there are a lot of moving parts. I started playing
|
1271
|
+
with a \"NoCode\" platform for creating workflows, in this case <a href=\"https://n8n.io\">n8n</a>
|
1272
|
+
(in many ways reminiscent of the now defunct <a href=\"https://en.wikipedia.org/wiki/Yahoo!_Pipes\">Yahoo
|
1273
|
+
Pipes</a>). This was quite fun for a while, but after lots of experimentation
|
1274
|
+
I moved back to writing code to aggregate the data into a CouchDB database.
|
1275
|
+
CouchDB is one of the NoSQL databases that I really like as it has a great
|
1276
|
+
interface, and makes queries very easy to do once you get your head around
|
1277
|
+
how it works.</p>\n\n<p>So the end result of this is \"BioRSS\" <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
|
1278
|
+
The interface comprises a stream of articles listed from newest to oldest,
|
1279
|
+
with a treemap and a geographic map on the left. You can use these to filter
|
1280
|
+
the articles by taxonomic group and/or country. For example the screen shot
|
1281
|
+
is showing arthropods from China (in this case from a month or two ago in
|
1282
|
+
the journal <i>ZooKeys</i>). As much fun as the interface has been to construct,
|
1283
|
+
in many ways I don''t really want to spend time making an interface. For each
|
1284
|
+
combination of taxon and country I provide a RSS feed so if you have a favour
|
1285
|
+
feed reader you can grab the feed and view it there. As BioRSS updates the
|
1286
|
+
data your feed reader should automatically update the feed. This means that
|
1287
|
+
you can have a feed that monitors, say, new papers on spiders in China.</p>\n\n<p>In
|
1288
|
+
the spirit of \"release early and release often\" this is an early version
|
1289
|
+
of this app. I need to add a lot more feeds, back date them to bring in older
|
1290
|
+
content, and I also want to make use of aggregators such as PubMed, CrossRef,
|
1291
|
+
and Google Scholar. The existence of these tools is, I suspect, one reason
|
1292
|
+
why RSS feeds are less common than they used to be.</p>\n\n<p>So, if this
|
1293
|
+
sounds useful please take it for a spin at <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
|
1294
|
+
Feedback is welcome, especially suggestions for journals to harvest and add
|
1295
|
+
to the news feed. Ultimately I''d like to have sufficient coverage of the
|
1296
|
+
taxonomic literature so that BioRSS becomes a place where we can go to find
|
1297
|
+
the latest papers on any taxon of interest.</p>\n\n<h2>References</h2>\n\n<blockquote>\n<a
|
1298
|
+
name=\"Leary2007\">1.</a> Patrick R. Leary, David P. Remsen, Catherine N.
|
1299
|
+
Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic
|
1300
|
+
literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages
|
1301
|
+
1434–1436, <a href=\"https://doi.org/10.1093/bioinformatics/btm109\">https://doi.org/10.1093/bioinformatics/btm109</a>\n</blockquote>\n\n<blockquote><a
|
1302
|
+
name=\"Mindell2011\">2.</a> Mindell, D. P., Fisher, B. L., Roopnarine, P.,
|
1303
|
+
Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating,
|
1304
|
+
Tagging and Integrating Biodiversity Research. PLoS ONE, 6(8), e19491. <a
|
1305
|
+
href=\"https://doi.org/10.1371/journal.pone.0019491\">doi:10.1371/journal.pone.0019491</a>\n</blockquote>\n\n<blockquote><a
|
1306
|
+
name=\"Singh2017\">3.</a> Singh, V. (2017). Replace or Retrieve Keywords In
|
1307
|
+
Documents at Scale. CoRR, abs/1711.00046. <a href=\"http://arxiv.org/abs/1711.00046\">http://arxiv.org/abs/1711.00046</a>\n\n</blockquote>","tags":["geocoding","NoCode","RSS"],"language":"en","references":[{"doi":"https://doi.org/10.1093/bioinformatics/btm109","key":"ref1"},{"doi":"https://doi.org/10.1371/journal.pone.0019491","key":"ref2"},{"key":"ref3","url":"http://arxiv.org/abs/1711.00046"}]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
|
1308
|
+
Stuart Page 1936-2022","summary":"My dad died last weekend. Below is a notice
|
1309
|
+
in today''s New Zealand Herald. I''m in New Zealand for his funeral. Don''t
|
1310
|
+
really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1311
|
+
(Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
1312
|
+
both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s3454/_DSC5106.jpg\"
|
1313
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1314
|
+
border=\"0\" width=\"400\" data-original-height=\"2582\" data-original-width=\"3454\"
|
1315
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s400/_DSC5106.jpg\"/></a></div>\n\nMy
|
1316
|
+
dad died last weekend. Below is a notice in today''s New Zealand Herald. I''m
|
1317
|
+
in New Zealand for his funeral. Don''t really have the words for this right
|
1318
|
+
now.\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s3640/IMG_2870.jpeg\"
|
1319
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1320
|
+
border=\"0\" height=\"320\" data-original-height=\"3640\" data-original-width=\"1391\"
|
1321
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg\"/></a></div>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/c79vq-7rr11","uuid":"3cb94422-5506-4e24-a41c-a250bb521ee0","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL
|
1322
|
+
for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
|
1323
|
+
for WikiData. More precisely, the endpoint is for a subset of the entities
|
1324
|
+
that are of interest to WikiCite, such as scholarly articles, people, and
|
1325
|
+
journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
|
1326
|
+
The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
|
1327
|
+
There are various ways to interact with the endpoint, personally I like the
|
1328
|
+
Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
|
1329
|
+
taken...","date_published":"2021-12-20T13:16:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1330
|
+
(Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
1331
|
+
both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s1000\"
|
1332
|
+
style=\"display: block; padding: 1em 0; text-align: center; clear: right;
|
1333
|
+
float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"1000\"
|
1334
|
+
data-original-width=\"1000\" src=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s200\"/></a></div><p>I''ve
|
1335
|
+
released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint
|
1336
|
+
is for a subset of the entities that are of interest to WikiCite, such as
|
1337
|
+
scholarly articles, people, and journals. There is a crude demo at <a href=\"https://wikicite-graphql.herokuapp.com\">https://wikicite-graphql.herokuapp.com</a>.
|
1338
|
+
The endpoint itself is at <a href=\"https://wikicite-graphql.herokuapp.com/gql.php\">https://wikicite-graphql.herokuapp.com/gql.php</a>.
|
1339
|
+
There are various ways to interact with the endpoint, personally I like the
|
1340
|
+
<a href=\"https://altair.sirmuel.design\">Altair GraphQL Client</a> by <a
|
1341
|
+
href=\"https://github.com/imolorhe\">Samuel Imolorhe</a>.</p>\n\n<p>As I''ve
|
1342
|
+
<a href=\"https://iphylo.blogspot.com/2021/04/it-been-while.html\">mentioned
|
1343
|
+
earlier</a> it''s taken me a while to see the point of GraphQL. But it is
|
1344
|
+
clear it is gaining traction in the biodiversity world (see for example the
|
1345
|
+
<a href=\"https://dev.gbif.org/hosted-portals.html\">GBIF Hosted Portals</a>)
|
1346
|
+
so it''s worth exploring. My take on GraphQL is that it is a way to create
|
1347
|
+
a self-describing API that someone developing a web site can use without them
|
1348
|
+
having to bury themselves in the gory details of how data is internally modelled.
|
1349
|
+
For example, WikiData''s query interface uses SPARQL, a powerful language
|
1350
|
+
that has a steep learning curve (in part because of the administrative overhead
|
1351
|
+
brought by RDF namespaces, etc.). In my previous SPARQL-based projects such
|
1352
|
+
as <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a> and <a
|
1353
|
+
href=\"http://alec-demo.herokuapp.com\">ALEC</a> I have either returned SPARQL
|
1354
|
+
results directly (Ozymandias) or formatted SPARQL results as <a href=\"https://schema.org/DataFeed\">schema.org
|
1355
|
+
DataFeeds</a> (equivalent to RSS feeds) (ALEC). Both approaches work, but
|
1356
|
+
they are project-specific and if anyone else tried to build based on these
|
1357
|
+
projects they might struggle for figure out what was going on. I certainly
|
1358
|
+
struggle, and I wrote them!</p>\n\n<p>So it seems worthwhile to explore this
|
1359
|
+
approach a little further and see if I can develop a GraphQL interface that
|
1360
|
+
can be used to build the sort of rich apps that I want to see. The demo I''ve
|
1361
|
+
created uses SPARQL under the hood to provide responses to the GraphQL queries.
|
1362
|
+
So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully)
|
1363
|
+
simpler overlay on top of SPARQL so that we can retrieve the data we want
|
1364
|
+
without having to learn the intricacies of SPARQL, nor how Wikidata models
|
1365
|
+
publications and people.</p>","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
|
1366
|
+
barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
|
1367
|
+
tweeted about this but want to bookmark it for later as well. The paper “A
|
1368
|
+
molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
|
1369
|
+
contains the following: …the annotated barcode records assembled by FinBOL
|
1370
|
+
participants represent a tremendous intergenerational transfer of taxonomic
|
1371
|
+
knowledge … the time contributed by current taxonomists in identifying and
|
1372
|
+
contributing voucher specimens represents a great gift to future generations
|
1373
|
+
who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1374
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>I <a href=\"https://twitter.com/rdmpage/status/1569738844416638981?s=21&t=9OVXuoUEwZtQt-Ldzlutfw\">tweeted
|
1375
|
+
about this</a> but want to bookmark it for later as well. The paper “A molecular-based
|
1376
|
+
identification resource for the arthropods of Finland” <a href=\"https://doi.org/10.1111/1755-0998.13510\">doi:10.1111/1755-0998.13510</a>
|
1377
|
+
contains the following:</p>\n<blockquote>\n<p>…the annotated barcode records
|
1378
|
+
assembled by FinBOL participants represent a tremendous <mark>intergenerational
|
1379
|
+
transfer of taxonomic knowledge</mark> … the time contributed by current taxonomists
|
1380
|
+
in identifying and contributing voucher specimens represents a great gift
|
1381
|
+
to future generations who will benefit from their expertise when they are
|
1382
|
+
no longer able to process new material.</p>\n</blockquote>\n<p>I think this
|
1383
|
+
is a very clever way to characterise the project. In an age of machine learning
|
1384
|
+
this may be commonest way to share knowledge , namely as expert-labelled training
|
1385
|
+
data used to build tools for others. Of course, this means the expertise itself
|
1386
|
+
may be lost, which has implications for updating the models if the data isn’t
|
1387
|
+
complete. But it speaks to Charles Godfrey’s theme of <a href=\"https://biostor.org/reference/250587\">“Taxonomy
|
1388
|
+
as information science”</a>.</p>\n<p>Note that the knowledge is also transformed
|
1389
|
+
in the sense that the underlying expertise of interpreting morphology, ecology,
|
1390
|
+
behaviour, genomics, and the past literature is not what is being passed on.
|
1391
|
+
Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>\n<p>This
|
1392
|
+
feels is different to, say iNaturalist, where there is a machine learning
|
1393
|
+
model to identify images. In that case, the model is built on something the
|
1394
|
+
community itself has created, and continues to create. Yes, the underlying
|
1395
|
+
idea is that same: “experts” have labelled the data, a model is trained, the
|
1396
|
+
model is used. But the benefits of the <a href=\"https://www.inaturalist.org\">iNaturalist</a>
|
1397
|
+
model are immediately applicable to the people whose data built the model.
|
1398
|
+
In the case of barcoding, because the technology itself is still not in the
|
1399
|
+
hands of many (relative to, say, digital imaging), the benefits are perhaps
|
1400
|
+
less tangible. Obviously researchers working with environmental DNA will find
|
1401
|
+
it very useful, but broader impact may await the arrival of citizen science
|
1402
|
+
DNA barcoding.</p>\n<p>The other consideration is whether the barcoding helps
|
1403
|
+
taxonomists. Is it to be used to help prioritise future work (“we are getting
|
1404
|
+
lots of unknown sequences in these taxa, lets do some taxonomy there”), or
|
1405
|
+
is it simply capturing the knowledge of a generation that won’t be replaced:</p>\n<blockquote>\n<p>The
|
1406
|
+
need to capture such knowledge is essential because there are, for example,
|
1407
|
+
no young Finnish taxonomists who can critically identify species in many key
|
1408
|
+
groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges,
|
1409
|
+
most mite lineages).</p>\n</blockquote>\n<p>The cycle of collect data, test
|
1410
|
+
and refine model, collect more data, rinse and repeat that happens with iNaturalist
|
1411
|
+
creates a feedback loop. It’s not clear that a similar cycle exists for DNA
|
1412
|
+
barcoding.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/enxas-arj18","uuid":"ab5a6e04-d55e-4901-8269-9eea65ce7178","url":"https://iphylo.blogspot.com/2022/08/can-we-use-citation-graph-to-measure.html","title":"Can
|
1413
|
+
we use the citation graph to measure the quality of a taxonomic database?","summary":"More
|
1414
|
+
arm-waving notes on taxonomic databases. I''ve started to add data to ChecklistBank
|
1415
|
+
and this has got me thinking about the issue of data quality. When you add
|
1416
|
+
data to ChecklistBank you are asked to give a measure of confidence based
|
1417
|
+
on the Catalogue of Life Checklist Confidence system of one - five stars:
|
1418
|
+
★ - ★★★★★. I''m scepetical about the notion of confidence or \"trust\" when
|
1419
|
+
it is reduced to a star system (see also Can you trust EOL?). I could literally
|
1420
|
+
pick any number of stars, there''s...","date_published":"2022-08-24T14:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1421
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>More arm-waving notes on
|
1422
|
+
taxonomic databases. I''ve started to add data to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
|
1423
|
+
and this has got me thinking about the issue of data quality. When you add
|
1424
|
+
data to ChecklistBank you are asked to give a measure of confidence based
|
1425
|
+
on the <a href=\"https://www.catalogueoflife.org/about/glossary.html#checklist-confidence\">Catalogue
|
1426
|
+
of Life Checklist Confidence</a> system of one - five stars: ★ - ★★★★★. I''m
|
1427
|
+
scepetical about the notion of confidence or \"trust\" when it is reduced
|
1428
|
+
to a star system (see also <a href=\"https://iphylo.blogspot.com/2012/06/can-you-trust-eol.html\">Can
|
1429
|
+
you trust EOL?</a>). I could literally pick any number of stars, there''s
|
1430
|
+
no way to measure what number of stars is appropriate. This feeds into my
|
1431
|
+
biggest reservation about the <a href=\"https://www.catalogueoflife.org\">Catalogue
|
1432
|
+
of Life</a>, it''s almost entirely authority based, not evidence based. That
|
1433
|
+
is, rather than give us evidence for why a particular taxon is valid, we are
|
1434
|
+
(mostly) just given a list of taxa are asked to accept those as gospel, based
|
1435
|
+
on assertions by one or more authorities. I''m not necessarly doubting the
|
1436
|
+
knowledge of those making these lists, it''s just that I think we need to
|
1437
|
+
do better than \"these are the accepted taxa because I say so\" implict in
|
1438
|
+
the Catalogue of Life.\n</p>\n\n<p>So, is there any way we could objectively
|
1439
|
+
measure the quality of a particular taxonomic checklist? Since I have a long
|
1440
|
+
standing interest in link the primary taxonomic litertaure to names in databases
|
1441
|
+
(since that''s where the evidence is), I keep wondering whether measures based
|
1442
|
+
on that literture could be developed. \n</p>\n<p>\nI recently revisited the
|
1443
|
+
fascinating (and quite old) literature on rates of synonymy:\n</p>\n<blockquote>\nGaston
|
1444
|
+
Kevin J. and Mound Laurence A. 1993 Taxonomy, hypothesis testing and the
|
1445
|
+
biodiversity crisisProc. R. Soc. Lond. B.251139–142\n<a href=\"http://doi.org/10.1098/rspb.1993.0020\">http://doi.org/10.1098/rspb.1993.0020</a>\n</blockquote>\n \n<blockquote>\n Andrew
|
1446
|
+
R. Solow, Laurence A. Mound, Kevin J. Gaston, Estimating the Rate of Synonymy,
|
1447
|
+
Systematic Biology, Volume 44, Issue 1, March 1995, Pages 93–96, <a href=\"https://doi.org/10.1093/sysbio/44.1.93\">https://doi.org/10.1093/sysbio/44.1.93</a>\n</blockquote>\n\n</p>\n\n<p>\nA
|
1448
|
+
key point these papers make is that the observed rate of synonymy is quite
|
1449
|
+
high (that is, many \"new species\" end up being merged with already known
|
1450
|
+
species), and that because it can take time to discover that a species is
|
1451
|
+
a synonym the actual rate may be even higher. In other words, in diagrams
|
1452
|
+
like the one reproduced below, the reason the proportion of synonyms declines
|
1453
|
+
the nearer we get to the present day (this paper came out in 1995) is not
|
1454
|
+
because are are creating fewer synonyms but because we''ve not yet had time
|
1455
|
+
to do the work to uncover the remaining synonyms.\n</p>\n \n<div class=\"separator\"
|
1456
|
+
style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s621/Screenshot%202022-08-24%20at%2014.59.47.png\"
|
1457
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1458
|
+
border=\"0\" width=\"400\" data-original-height=\"404\" data-original-width=\"621\"
|
1459
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s400/Screenshot%202022-08-24%20at%2014.59.47.png\"/></a></div>\n\n<p>Put
|
1460
|
+
another way, these papers are arguing that real work of taxonomy is revision,
|
1461
|
+
not species discovery, especially since it''s not uncommon for > 50% of species
|
1462
|
+
in a taxon to end up being synonymised. Indeed, if a taxoomic group has few
|
1463
|
+
synonyms then these authors would argue that''s a sign of neglect. More revisionary
|
1464
|
+
work would likely uncover additional synonyms. So, what we need is a way to
|
1465
|
+
measure the amount of research on a taxonomic group. It occurs to me that
|
1466
|
+
we could use the citation graph as a way to tackle this. Lets imagine we have
|
1467
|
+
a set of taxa (say a family) and we have all the papers that described new
|
1468
|
+
species or undertook revisions (or both). The extensiveness of that work could
|
1469
|
+
be measured by the citation graph. For example, build the citation graph for
|
1470
|
+
those papers. How many original species decsriptions are not cited? Those
|
1471
|
+
species have been potentially neglected. How many large-scale revisions have
|
1472
|
+
there been (as measured by the numbers of taxonomic papers those revisions
|
1473
|
+
cite)? There are some interesting approaches to quantifying this, such as
|
1474
|
+
using <a href=\"https://en.wikipedia.org/wiki/HITS_algorithm\">hubs and authorities</a>.</p>\n \n \n<p>I''m
|
1475
|
+
aware that taxonomists have not had the happiest relationship with citations:\n \n<blockquote>\nPinto
|
1476
|
+
ÂP, Mejdalani G, Mounce R, Silveira LF, Marinoni L, Rafael JA. Are publications
|
1477
|
+
on zoological taxonomy under attack? R Soc Open Sci. 2021 Feb 10;8(2):201617.
|
1478
|
+
<a href=\"https://doi.org/10.1098/rsos.201617\">doi: 10.1098/rsos.201617</a>.
|
1479
|
+
PMID: 33972859; PMCID: PMC8074659.\n</blockquote>\n\nStill, I think there
|
1480
|
+
is an intriguing possibility here. For this approach to work, we need to have
|
1481
|
+
linked taxonomic names to publications, and have citation data for those publications.
|
1482
|
+
This is happening on various platforms. Wikidata, for example, is becoming
|
1483
|
+
a repository of the taxonomic literature, some of it with citation links.\n\n<blockquote>\nPage
|
1484
|
+
RDM. 2022. Wikidata and the bibliography of life. PeerJ 10:e13712 <a href=\"https://doi.org/10.7717/peerj.13712\">https://doi.org/10.7717/peerj.13712</a>\n</blockquote>\n\nTime
|
1485
|
+
for some experiments.\n</p>","tags":["Bibliography of Life","citation","synonymy","taxonomic
|
1486
|
+
databases"],"language":"en","references":null},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
|
1487
|
+
interfaces, knowledge graphs, and Miller columns","summary":"Some quick notes
|
1488
|
+
on interface ideas for digital libraries and/or knowledge graphs. Recently
|
1489
|
+
there’s been something of an explosion in bibliographic tools to explore the
|
1490
|
+
literature. Examples include: Elicit which uses AI to search for and summarise
|
1491
|
+
papers _scite which uses AI to do sentiment analysis on citations (does paper
|
1492
|
+
A cite paper B favourably or not?) ResearchRabbit which uses lists, networks,
|
1493
|
+
and timelines to discover related research Scispace which navigates connections
|
1494
|
+
between...","date_published":"2023-04-25T13:01:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1495
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>Some quick notes on interface
|
1496
|
+
ideas for digital libraries and/or knowledge graphs.</p>\n<p>Recently there’s
|
1497
|
+
been something of an explosion in bibliographic tools to explore the literature.
|
1498
|
+
Examples include:</p>\n<ul>\n<li><a href=\"https://elicit.org\">Elicit</a>
|
1499
|
+
which uses AI to search for and summarise papers</li>\n<li><a href=\"https://scite.ai\">_scite</a>
|
1500
|
+
which uses AI to do sentiment analysis on citations (does paper A cite paper
|
1501
|
+
B favourably or not?)</li>\n<li><a href=\"https://www.researchrabbit.ai\">ResearchRabbit</a>
|
1502
|
+
which uses lists, networks, and timelines to discover related research</li>\n<li><a
|
1503
|
+
href=\"https://typeset.io\">Scispace</a> which navigates connections between
|
1504
|
+
papers, authors, topics, etc., and provides AI summaries.</li>\n</ul>\n<p>As
|
1505
|
+
an aside, I think these (and similar tools) are a great example of how bibliographic
|
1506
|
+
data such as abstracts, the citation graph and - to a lesser extent - full
|
1507
|
+
text - have become commodities. That is, what was once proprietary information
|
1508
|
+
is now free to anyone, which in turns means a whole ecosystem of new tools
|
1509
|
+
can emerge. If I was clever I’d be building a <a href=\"https://en.wikipedia.org/wiki/Wardley_map\">Wardley
|
1510
|
+
map</a> to explore this. Note that a decade or so ago reference managers like
|
1511
|
+
<a href=\"https://www.zotero.org\">Zotero</a> were made possible by publishers
|
1512
|
+
exposing basic bibliographic data on their articles. As we move to <a href=\"https://i4oc.org\">open
|
1513
|
+
citations</a> we are seeing the next generation of tools.</p>\n<p>Back to
|
1514
|
+
my main topic. As usual, rather than focus on what these tools do I’m more
|
1515
|
+
interested in how they <strong>look</strong>. I have history here, when the
|
1516
|
+
iPad came out I was intrigued by the possibilities it offered for displaying
|
1517
|
+
academic articles, as discussed <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad.html\">here</a>,
|
1518
|
+
<a href=\"https://iphylo.blogspot.com/2010/09/viewing-scientific-articles-on-ipad.html\">here</a>,
|
1519
|
+
<a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_24.html\">here</a>,
|
1520
|
+
<a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_3052.html\">here</a>,
|
1521
|
+
and <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_31.html\">here</a>.
|
1522
|
+
ResearchRabbit looks like this:</p>\n<div style=\"padding:86.91% 0 0 0;position:relative;\"><iframe
|
1523
|
+
src=\"https://player.vimeo.com/video/820871442?h=23b05b0dae&badge=0&autopause=0&player_id=0&app_id=58479\"
|
1524
|
+
frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
|
1525
|
+
style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"ResearchRabbit\"></iframe></div><script
|
1526
|
+
src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>Scispace’s <a
|
1527
|
+
href=\"https://typeset.io/explore/journals/parassitologia-1ieodjwe\">“trace”
|
1528
|
+
view</a> looks like this:</p>\n<div style=\"padding:84.55% 0 0 0;position:relative;\"><iframe
|
1529
|
+
src=\"https://player.vimeo.com/video/820871348?h=2db7b661ef&badge=0&autopause=0&player_id=0&app_id=58479\"
|
1530
|
+
frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
|
1531
|
+
style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"Scispace
|
1532
|
+
screencast\"></iframe></div><script src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>What
|
1533
|
+
is interesting about both is that they display content from left to right
|
1534
|
+
in vertical columns, rather than the more common horizontal rows. This sort
|
1535
|
+
of display is sometimes called <a href=\"https://en.wikipedia.org/wiki/Miller_columns\">Miller
|
1536
|
+
columns</a> or a <a href=\"https://web.archive.org/web/20210726134921/http://designinginterfaces.com/firstedition/index.php?page=Cascading_Lists\">cascading
|
1537
|
+
list</a>.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s1024/GNUstep-liveCD.png\"
|
1538
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1539
|
+
border=\"0\" width=\"400\" data-original-height=\"768\" data-original-width=\"1024\"
|
1540
|
+
src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s400/GNUstep-liveCD.png\"/></a></div>\n\n<p>By
|
1541
|
+
Gürkan Sengün (talk) - Own work, Public Domain, <a href=\"https://commons.wikimedia.org/w/index.php?curid=594715\">https://commons.wikimedia.org/w/index.php?curid=594715</a></p>\n<p>I’ve
|
1542
|
+
always found displaying a knowledge graph to be a challenge, as discussed
|
1543
|
+
<a href=\"https://iphylo.blogspot.com/2019/07/notes-on-collections-knowledge-graphs.html\">elsewhere
|
1544
|
+
on this blog</a> and in my paper on <a href=\"https://peerj.com/articles/6739/#p-29\">Ozymandias</a>.
|
1545
|
+
Miller columns enable one to drill down in increasing depth, but it doesn’t
|
1546
|
+
need to be a tree, it can be a path within a network. What I like about ResearchRabbit
|
1547
|
+
and the original Scispace interface is that they present the current item
|
1548
|
+
together with a list of possible connections (e.g., authors, citations) that
|
1549
|
+
you can drill down on. Clicking on these will result in a new column being
|
1550
|
+
appended to the right, with a view (typically a list) of the next candidates
|
1551
|
+
to visit. In graph terms, these are adjacent nodes to the original item. The
|
1552
|
+
clickable badges on each item can be thought of as sets of edges that have
|
1553
|
+
the same label (e.g., “authored by”, “cites”, “funded”, “is about”, etc.).
|
1554
|
+
Each of these nodes itself becomes a starting point for further exploration.
|
1555
|
+
Note that the original starting point isn’t privileged, other than being the
|
1556
|
+
starting point. That is, each time we drill down we are seeing the same type
|
1557
|
+
of information displayed in the same way. Note also that the navigation can
|
1558
|
+
be though of as a <strong>card</strong> for a node, with <strong>buttons</strong>
|
1559
|
+
grouping the adjacent nodes. When we click on an individual button, it expands
|
1560
|
+
into a <strong>list</strong> in the next column. This can be thought of as
|
1561
|
+
a preview for each adjacent node. Clicking on an element in the list generates
|
1562
|
+
a new card (we are viewing a single node) and we get another set of buttons
|
1563
|
+
corresponding to the adjacent nodes.</p>\n<p>One important behaviour in a
|
1564
|
+
Miller column interface is that the current path can be pruned at any point.
|
1565
|
+
If we go back (i.e., scroll to the left) and click on another tab on an item,
|
1566
|
+
everything downstream of that item (i.e., to the right) gets deleted and replaced
|
1567
|
+
by a new set of nodes. This could make retrieving a particular history of
|
1568
|
+
browsing a bit tricky, but encourages exploration. Both Scispace and ResearchRabbit have
|
1569
|
+
the ability to add items to a collection, so you can keep track of things
|
1570
|
+
you discover.</p>\n<p>Lots of food for thought, I’m assuming that there is
|
1571
|
+
some user interface/experience research on Miller columns. One thing to remember
|
1572
|
+
is that Miller columns are most often associated with trees, but in this case
|
1573
|
+
we are exploring a network. That means that potentially there is no limit
|
1574
|
+
to the number of columns being generated as we wander through the graph. It
|
1575
|
+
will be interesting to think about what the average depth is likely to be,
|
1576
|
+
in other words, how deep down the rabbit hole will be go?</p>\n\n<h3>Update</h3>\n<p>Should
|
1577
|
+
add link to David Regev''s explorations of <a href=\"https://medium.com/david-regev-on-ux/flow-browser-b730daf0f717\">Flow
|
1578
|
+
Browser</a>.\n\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":["cards","flow","Knowledge
|
1579
|
+
Graph","Miller column","RabbitResearch"],"language":"en","references":null},{"id":"https://doi.org/10.59350/t6fb9-4fn44","uuid":"8bc3fea6-cb86-4344-8dad-f312fbf58041","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The
|
1580
|
+
Business of Extracting Knowledge from Academic Publications","summary":"Markus
|
1581
|
+
Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting
|
1582
|
+
Knowledge from Academic Publications\". I spent months working on domain-specific
|
1583
|
+
search engines and knowledge discovery apps for biomedicine and eventually
|
1584
|
+
figured that synthesizing "insights" or building knowledge graphs
|
1585
|
+
by machine-reading the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc—
|
1586
|
+
Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I...","date_published":"2021-12-11T00:01:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1587
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>Markus Strasser (<a href=\"https://twitter.com/mkstra\">@mkstra</a>
|
1588
|
+
write a fascinating article entitled <a href=\"https://markusstrasser.org/extracting-knowledge-from-literature/\">\"The
|
1589
|
+
Business of Extracting Knowledge from Academic Publications\"</a>.</p>\n\n<blockquote
|
1590
|
+
class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">I spent months working
|
1591
|
+
on domain-specific search engines and knowledge discovery apps for biomedicine
|
1592
|
+
and eventually figured that synthesizing "insights" or building
|
1593
|
+
knowledge graphs by machine-reading the academic literature (papers) is *barely
|
1594
|
+
useful* :<a href=\"https://t.co/eciOg30Odc\">https://t.co/eciOg30Odc</a></p>—
|
1595
|
+
Markus Strasser (@mkstra) <a href=\"https://twitter.com/mkstra/status/1468334482113523716?ref_src=twsrc%5Etfw\">December
|
1596
|
+
7, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
|
1597
|
+
charset=\"utf-8\"></script>\n\n<p>His TL;DR:</p>\n\n<p><blockquote>\nTL;DR:
|
1598
|
+
I worked on biomedical literature search, discovery and recommender web applications
|
1599
|
+
for many months and concluded that extracting, structuring or synthesizing
|
1600
|
+
\"insights\" from academic publications (papers) or building knowledge bases
|
1601
|
+
from a domain corpus of literature has negligible value in industry.</p>\n\n<p>Close
|
1602
|
+
to nothing of what makes science actually work is published as text on the
|
1603
|
+
web.\n</blockquote></p>\n\n<p>After recounting the many problems of knowledge
|
1604
|
+
extraction - including a swipe at nanopubs which \"are ... dead in my view
|
1605
|
+
(without admitting it)\" - he concludes:</p>\n\n<p><blockquote>\nI’ve been
|
1606
|
+
flirting with this entire cluster of ideas including open source web annotation,
|
1607
|
+
semantic search and semantic web, public knowledge graphs, nano-publications,
|
1608
|
+
knowledge maps, interoperable protocols and structured data, serendipitous
|
1609
|
+
discovery apps, knowledge organization, communal sense making and academic
|
1610
|
+
literature/publishing toolchains for a few years on and off ... nothing of
|
1611
|
+
it will go anywhere.</p>\n\n<p>Don’t take that as a challenge. Take it as
|
1612
|
+
a red flag and run. Run towards better problems.\n</blockquote></p>\n\n<p>Well
|
1613
|
+
worth a read, and much food for thought.</p>","tags":["ai","business model","text
|
1614
|
+
mining"],"language":"en","references":null},{"id":"https://doi.org/10.59350/463yw-pbj26","uuid":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David
|
1615
|
+
Remsen","summary":"I heard yesterday from Martin Kalfatovic (BHL) that David
|
1616
|
+
Remsen has died. Very sad news. It''s starting to feel like iPhylo might end
|
1617
|
+
up being a list of obituaries of people working on biodiversity informatics
|
1618
|
+
(e.g., Scott Federhen). I spent several happy visits at MBL at Woods Hole
|
1619
|
+
talking to Dave at the height of the uBio project, which really kickstarted
|
1620
|
+
large scale indexing of taxonomic names, and the use of taxonomic name finding
|
1621
|
+
tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1622
|
+
(Roderic Page)"}],"image":null,"content_html":"<p>I heard yesterday from Martin
|
1623
|
+
Kalfatovic (BHL) that David Remsen has died. Very sad news. It''s starting
|
1624
|
+
to feel like iPhylo might end up being a list of obituaries of people working
|
1625
|
+
on biodiversity informatics (e.g., <a href=\"https://iphylo.blogspot.com/2016/05/scott-federhen-rip.html\">Scott
|
1626
|
+
Federhen</a>).</p>\n\n<p>I spent several happy visits at MBL at Woods Hole
|
1627
|
+
talking to Dave at the height of the uBio project, which really kickstarted
|
1628
|
+
large scale indexing of taxonomic names, and the use of taxonomic name finding
|
1629
|
+
tools to index the literature. His work on uBio with David (\"Paddy\") Patterson
|
1630
|
+
led to the <a href=\"https://eol.org\">Encyclopedia of Life</a> (EOL).</p>\n\n<p>A
|
1631
|
+
number of the things I''m currently working on are things Dave started. For
|
1632
|
+
example, I recently uploaded a version of his dataset for Nomenclator Zoologicus[1]
|
1633
|
+
to <a href=\"https://www.checklistbank.org/dataset/126539/about\">ChecklistBank</a>
|
1634
|
+
where I''m working on augmenting that original dataset by adding links to
|
1635
|
+
the taxonomic literature. My <a href=\"https://biorss.herokuapp.com/?feed=Y291bnRyeT1XT1JMRCZwYXRoPSU1QiUyMkJJT1RBJTIyJTVE\">BioRSS
|
1636
|
+
project</a> is essentially an attempt to revive uBioRSS[2] (see <a href=\"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html\">Revisiting
|
1637
|
+
RSS to monitor the latest taxonomic research</a>).</p>\n\n<p>I have fond memories
|
1638
|
+
of those visits to Woods Hole. A very sad day indeed.</p>\n\n<p><b>Update:</b>
|
1639
|
+
The David Remsen Memorial Fund has been set up on <a href=\"https://www.gofundme.com/f/david-remsen-memorial-fund\">GoFundMe</a>.</p>\n\n<p>1.
|
1640
|
+
Remsen, D. P., Norton, C., & Patterson, D. J. (2006). Taxonomic Informatics
|
1641
|
+
Tools for the Electronic Nomenclator Zoologicus. The Biological Bulletin,
|
1642
|
+
210(1), 18–24. https://doi.org/10.2307/4134533</p>\n\n<p>2. Patrick R. Leary,
|
1643
|
+
David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar,
|
1644
|
+
uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23,
|
1645
|
+
Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109</p>","tags":["David
|
1646
|
+
Remsen","obituary","uBio"],"language":"en","references":null},{"id":"https://doi.org/10.59350/pmhat-5ky65","uuid":"5891c709-d139-440f-bacb-06244424587a","url":"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html","title":"Problems
|
1647
|
+
with Plazi parsing: how reliable are automated methods for extracting specimens
|
1648
|
+
from the literature?","summary":"The Plazi project has become one of the major
|
1649
|
+
contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
|
1650
|
+
(see Plazi''s GBIF page for details). These occurrences are extracted from
|
1651
|
+
taxonomic publication using automated methods. New data is published almost
|
1652
|
+
daily (see latest treatments). The map below shows the geographic distribution
|
1653
|
+
of material citations provided to GBIF by Plazi, which gives you a sense of
|
1654
|
+
the size of the dataset. By any metric Plazi represents a...","date_published":"2021-10-25T11:10:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
|
1655
|
+
(Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
|
1656
|
+
both;\"><a href=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s240/Rf7UoXTw_400x400.jpg\"
|
1657
|
+
style=\"display: block; padding: 1em 0; text-align: center; clear: right;
|
1658
|
+
float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"240\"
|
1659
|
+
data-original-width=\"240\" src=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s200/Rf7UoXTw_400x400.jpg\"/></a></div><p>The
|
1660
|
+
<a href=\"http://plazi.org\">Plazi</a> project has become one of the major
|
1661
|
+
contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
|
1662
|
+
(see <a href=\"https://www.gbif.org/publisher/7ce8aef0-9e92-11dc-8738-b8a03c50a862\">Plazi''s
|
1663
|
+
GBIF page</a> for details). These occurrences are extracted from taxonomic
|
1664
|
+
publication using automated methods. New data is published almost daily (see
|
1665
|
+
<a href=\"https://tb.plazi.org/GgServer/static/newToday.html\">latest treatments</a>).
|
1666
|
+
The map below shows the geographic distribution of material citations provided
|
1667
|
+
to GBIF by Plazi, which gives you a sense of the size of the dataset.</p>\n\n<div
|
1668
|
+
class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s1030/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"
|
1669
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1670
|
+
border=\"0\" width=\"400\" data-original-height=\"514\" data-original-width=\"1030\"
|
1671
|
+
src=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"/></a></div>\n\n<p>By
|
1672
|
+
any metric Plazi represents a considerable achievement. But often when I browse
|
1673
|
+
individual records on Plazi I find records that seem clearly incorrect. Text
|
1674
|
+
mining the literature is a challenging problem, but at the moment Plazi seems
|
1675
|
+
something of a \"black box\". PDFs go in, the content is mined, and data comes
|
1676
|
+
up to be displayed on the Plazi web site and uploaded to GBIF. Nowhere does
|
1677
|
+
there seem to be an evaluation of how accurate this text mining actually is.
|
1678
|
+
Anecdotally it seems to work well in some cases, but in others it produces
|
1679
|
+
what can only be described as bogus records.</p>\n\n<h2>Finding errors</h2>\n\n<p>A
|
1680
|
+
treatment in Plazi is a block of text (and sometimes illustrations) that refers
|
1681
|
+
to a single taxon. Often that text will include a description of the taxon,
|
1682
|
+
and list one or more specimens that have been examined. These lists of specimens
|
1683
|
+
(\"material citations\") are one of the key bits of information that Plaza
|
1684
|
+
extracts from a treatment as these citations get fed into GBIF as occurrences.</p>\n\n<p>To
|
1685
|
+
help explore treatments I''ve constructed a simple web site that takes the
|
1686
|
+
Plazi identifier for a treatment and displays that treatment with the material
|
1687
|
+
citations highlighted. For example, for the Plazi treatment <a href=\"https://tb.plazi.org/GgServer/html/03B5A943FFBB6F02FE27EC94FABEEAE7\">03B5A943FFBB6F02FE27EC94FABEEAE7</a>
|
1688
|
+
you can view the marked up version at <a href=\"https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228\">https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228</a>.
|
1689
|
+
Below is an example of a material citation with its component parts tagged:</p>\n\n<div
|
1690
|
+
class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s693/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"
|
1691
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1692
|
+
border=\"0\" width=\"400\" data-original-height=\"94\" data-original-width=\"693\"
|
1693
|
+
src=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"/></a></div>\n\n<p>This
|
1694
|
+
is an example where Plazi has successfully parsed the specimen. But I keep
|
1695
|
+
coming across cases where specimens have not been parsed correctly, resulting
|
1696
|
+
in issues such as single specimens being split into multiple records (e.g., <a
|
1697
|
+
href=\"https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496\">https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496</a>),
|
1698
|
+
geographical coordinates being misinterpreted (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9\">https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9</a>),
|
1699
|
+
or collector''s initials being confused with codes for natural history collections
|
1700
|
+
(e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E\">https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E</a>).</p>\n\n<p>Parsing
|
1701
|
+
specimens is a hard problem so it''s not unexpected to find errors. But they
|
1702
|
+
do seem common enough to be easily found, which raises the question of just
|
1703
|
+
what percentage of these material citations are correct? How much of the
|
1704
|
+
data Plazi feeds to GBIF is correct? How would we know?</p>\n\n<h2>Systemic
|
1705
|
+
problems</h2>\n\n<p>Some of the errors I''ve found concern the interpretation
|
1706
|
+
of the parsed data. For example, it is striking that despite including marine
|
1707
|
+
taxa <b>no</b> Plazi record has a value for depth below sea level (see <a
|
1708
|
+
href=\"https://www.gbif.org/occurrence/map?depth=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">GBIF
|
1709
|
+
search on depth range 0-9999 for Plazi</a>). But <a href=\"https://www.gbif.org/occurrence/map?elevation=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">many
|
1710
|
+
records do have an elevation</a>, including records from marine environments.
|
1711
|
+
Any record that has a depth value is interpreted by Plazi as being elevation,
|
1712
|
+
so we have aerial crustacea and fish.</p>\n\n<h3>Map of Plazi records with
|
1713
|
+
depth 0-9999m</h3>\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s673/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"
|
1714
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1715
|
+
border=\"0\" width=\"400\" data-original-height=\"258\" data-original-width=\"673\"
|
1716
|
+
src=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"/></a></div>\n\n<h3>Map
|
1717
|
+
of Plazi records with elevation 0-9999m </h3>\n<div class=\"separator\" style=\"clear:
|
1718
|
+
both;\"><a href=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s675/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"
|
1719
|
+
style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
|
1720
|
+
border=\"0\" width=\"400\" data-original-height=\"256\" data-original-width=\"675\"
|
1721
|
+
src=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"/></a></div>\n\n<p>Anecdotally
|
1722
|
+
I''ve also noticed that Plazi seems to do well on zoological data, especially
|
1723
|
+
journals like <i>Zootaxa</i>, but it often struggles with botanical specimens.
|
1724
|
+
Botanists tend to cite specimens rather differently to zoologists (botanists
|
1725
|
+
emphasise collector numbers rather than specimen codes). Hence data quality
|
1726
|
+
in Plazi is likely to taxonomic biased.</p>\n\n<p>Plazi is <a href=\"https://github.com/plazi/community/issues\">using
|
1727
|
+
GitHub to track issues with treatments</a> so feedback on erroneous records
|
1728
|
+
is possible, but this seems inadequate to the task. There are tens of thousands
|
1729
|
+
of data sets, with more being released daily, and hundreds of thousands of
|
1730
|
+
occurrences, and relying on GitHub issues devolves the responsibility for
|
1731
|
+
error checking onto the data users. I don''t have a measure of how many records
|
1732
|
+
in Plazi have problems, but because I suspect it is a significant fraction
|
1733
|
+
because for any given day''s output I can typically find errors.</p>\n\n<h2>What
|
1734
|
+
to do?</h2>\n\n<p>Faced with a process that generates noisy data there are
|
1735
|
+
several of things we could do:</p>\n\n<ol>\n<li>Have tools to detect and flag
|
1736
|
+
errors made in generating the data.</li>\n<li>Have the data generator give
|
1737
|
+
estimates the confidence of its results.</li>\n<li>Improve the data generator.</li>\n</ol>\n\n<p>I
|
1738
|
+
think a comparison with the problem of parsing bibliographic references might
|
1739
|
+
be instructive here. There is a long history of people developing tools to
|
1740
|
+
parse references (<a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">I''ve
|
1741
|
+
even had a go</a>). State-of-the art tools such as <a href=\"https://anystyle.io\">AnyStyle</a>
|
1742
|
+
feature machine learning, and are tested against <a href=\"https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml\">human
|
1743
|
+
curated datasets</a> of tagged bibliographic records. This means we can evaluate
|
1744
|
+
the performance of a method (how well does it retrieve the same results as
|
1745
|
+
human experts?) and also improve the method by expanding the corpus of training
|
1746
|
+
data. Some of these tools can provide a measures of how confident they are
|
1747
|
+
when classifying a string as, say, a person''s name, which means we could
|
1748
|
+
flag potential issues for anyone wanting to use that record.</p>\n\n<p>We
|
1749
|
+
don''t have equivalent tools for parsing specimens in the literature, and
|
1750
|
+
hence have no easy way to quantify how good existing methods are, nor do we
|
1751
|
+
have a public corpus of material citations that we can use as training data.
|
1752
|
+
I <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">blogged
|
1753
|
+
about this</a> a few months ago and was considering using Plazi as a source
|
1754
|
+
of marked up specimen data to use for training. However based on what I''ve
|
1755
|
+
looked at so far Plazi''s data would need to be carefully scrutinised before
|
1756
|
+
it could be used as training data.</p>\n\n<p>Going forward, I think it would
|
1757
|
+
be desirable to have a set of records that can be used to benchmark specimen
|
1758
|
+
parsers, and ideally have the parsers themselves available as web services
|
1759
|
+
so that anyone can evaluate them. Even better would be a way to contribute
|
1760
|
+
to the training data so that these tools improve over time.</p>\n\n<p>Plazi''s
|
1761
|
+
data extraction tools are mostly desktop-based, that is, you need to download
|
1762
|
+
software to use their methods. However, there are experimental web services
|
1763
|
+
available as well. I''ve created a simple wrapper around the material citation
|
1764
|
+
parser, you can try it at <a href=\"https://plazi-tester.herokuapp.com/parser.php\">https://plazi-tester.herokuapp.com/parser.php</a>.
|
1765
|
+
It takes a single material citation and returns a version with elements such
|
1766
|
+
as specimen code and collector name tagged in different colours.</p>\n\n<h2>Summary</h2>\n\n<p>Text
|
1767
|
+
mining the taxonomic literature is clearly a gold mine of data, but at the
|
1768
|
+
same time it is potentially fraught as we try and extract structured data
|
1769
|
+
from semi-structured text. Plazi has demonstrated that it is possible to extract
|
1770
|
+
a lot of data from the literature, but at the same time the quality of that
|
1771
|
+
data seems highly variable. Even minor issues in parsing text can have big
|
1772
|
+
implications for data quality (e.g., marine organisms apparently living above
|
1773
|
+
sea level). Historically in biodiversity informatics we have favoured data
|
1774
|
+
quantity over data quality. Quantity has an obvious metric, and has milestones
|
1775
|
+
we can celebrate (e.g., <a href=\"GBIF at 1 billion - what''s next?\">one
|
1776
|
+
billion specimens</a>). There aren''t really any equivalent metrics for data
|
1777
|
+
quality.</p>\n\n<p>Adding new types of data can sometimes initially result
|
1778
|
+
in a new set of quality issues (e.g., <a href=\"https://iphylo.blogspot.com/2019/12/gbif-metagenomics-and-metacrap.html\">GBIF
|
1779
|
+
metagenomics and metacrap</a>) that take time to resolve. In the case of Plazi,
|
1780
|
+
I think it would be worthwhile to quantify just how many records have errors,
|
1781
|
+
and develop benchmarks that we can use to test methods for extracting specimen
|
1782
|
+
data from text. If we don''t do this then there will remain uncertainty as
|
1783
|
+
to how much trust we can place in data mined from the taxonomic literature.</p>\n\n<h2>Update</h2>\n\nPlazi
|
1784
|
+
has responded, see <a href=\"http://plazi.org/posts/2021/10/liberation-first-step-toward-quality/\">Liberating
|
1785
|
+
material citations as a first step to more better data</a>. My reading of
|
1786
|
+
their repsonse is that it essentially just reiterates Plazi''s approach and
|
1787
|
+
doesn''t tackle the underlying issue: their method for extracting material
|
1788
|
+
citations is error prone, and many of those errors end up in GBIF.","tags":["data
|
1789
|
+
quality","parsing","Plazi","specimen","text mining"],"language":"en","references":null}]}'
|
1790
|
+
recorded_at: Thu, 15 Jun 2023 20:47:10 GMT
|
1791
|
+
recorded_with: VCR 6.1.0
|