commonmeta-ruby 3.2.11 → 3.2.12

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,1791 @@
1
+ ---
2
+ http_interactions:
3
+ - request:
4
+ method: get
5
+ uri: https://rogue-scholar.org/api/blogs/tyfqw20
6
+ body:
7
+ encoding: UTF-8
8
+ string: ''
9
+ headers:
10
+ Connection:
11
+ - close
12
+ Host:
13
+ - rogue-scholar.org
14
+ User-Agent:
15
+ - http.rb/5.1.1
16
+ response:
17
+ status:
18
+ code: 200
19
+ message: OK
20
+ headers:
21
+ Age:
22
+ - '0'
23
+ Cache-Control:
24
+ - public, max-age=0, must-revalidate
25
+ Content-Length:
26
+ - '162607'
27
+ Content-Type:
28
+ - application/json; charset=utf-8
29
+ Date:
30
+ - Thu, 15 Jun 2023 20:39:01 GMT
31
+ Etag:
32
+ - '"6w7me0q1i23h72"'
33
+ Server:
34
+ - Vercel
35
+ Strict-Transport-Security:
36
+ - max-age=63072000
37
+ X-Matched-Path:
38
+ - "/api/blogs/[slug]"
39
+ X-Vercel-Cache:
40
+ - MISS
41
+ X-Vercel-Id:
42
+ - fra1::iad1::r8zhw-1686861540247-e74748738960
43
+ Connection:
44
+ - close
45
+ body:
46
+ encoding: UTF-8
47
+ string: '{"id":"tyfqw20","title":"iPhylo","description":"Rants, raves (and occasionally
48
+ considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For
49
+ more ranty and less considered opinions, see my <a href=\"https://twitter.com/rdmpage\">Twitter
50
+ feed</a>.<br>ISSN 2051-8188. Written content on this site is licensed under
51
+ a <a href=\"https://creativecommons.org/licenses/by/4.0/\">Creative Commons
52
+ Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default?alt=rss","feed_format":"application/rss+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-05-31T17:26:00+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger","category":"Natural
53
+ Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/btdk4-42879","uuid":"3e1278f6-e7c0-43e1-bb54-6829e1344c0d","url":"https://iphylo.blogspot.com/2022/09/the-ideal-taxonomic-journal.html","title":"The
54
+ ideal taxonomic journal","summary":"This is just some random notes on an “ideal”
55
+ taxonomic journal, inspired in part by some recent discussions on “turbo-taxonomy”
56
+ (e.g., https://doi.org/10.3897/zookeys.1087.76720 and https://doi.org/10.1186/1742-9994-10-15),
57
+ and also examples such as the Australian Journal of Taxonomy https://doi.org/10.54102/ajt.qxi3r
58
+ which seems well-intentioned but limited. XML One approach is to have highly
59
+ structured text that embeds detailed markup, and ideally a tool that generates
60
+ markup in XML. This is...","date_published":"2022-09-29T14:00:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
61
+ (Roderic Page)"}],"image":null,"content_html":"<p>This is just some random
62
+ notes on an “ideal” taxonomic journal, inspired in part by some recent discussions
63
+ on “turbo-taxonomy” (e.g., <a href=\"https://doi.org/10.3897/zookeys.1087.76720\">https://doi.org/10.3897/zookeys.1087.76720</a>
64
+ and <a href=\"https://doi.org/10.1186/1742-9994-10-15\">https://doi.org/10.1186/1742-9994-10-15</a>),
65
+ and also examples such as the Australian Journal of Taxonomy <a href=\"https://doi.org/10.54102/ajt.qxi3r\">https://doi.org/10.54102/ajt.qxi3r</a>
66
+ which seems well-intentioned but limited.</p>\n<h2 id=\"xml\">XML</h2>\n<p>One
67
+ approach is to have highly structured text that embeds detailed markup, and
68
+ ideally a tool that generates markup in XML. This is the approach taken by
69
+ Pensoft. There is an inevitable trade-off between the burden on authors of
70
+ marking up text versus making the paper machine readable. In some ways this
71
+ seems misplaced effort given that there is little evidence that publications
72
+ by themselves have much value (see <a href=\"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html\">The
73
+ Business of Extracting Knowledge from Academic Publications</a>). “Value”
74
+ in this case means as a source of data or factual statements that we can compute
75
+ over. Human-readable text is not a good way to convey this sort of information.</p>\n<p>It’s
76
+ also interesting that many editing tools are going in the opposite direction,
77
+ for example there are minimalist tools using <a href=\"https://en.wikipedia.org/wiki/Markdown\">Markdown</a>
78
+ where the goal is to <em>get out of the author’s way</em>, rather than impose
79
+ a way of writing. Text is written by humans for humans, so the tools should
80
+ be human-friendly.</p>\n<p>The idea of publishing using XML is attractive
81
+ in that it gives you XML that can be archived by, say, PubMed Central, but
82
+ other than that the value seems limited. A cursory glance at download stats
83
+ for journals that provide PDF and XML downloads, such as <em>PLoS One</em>
84
+ and <em>ZooKeys</em>, PDF is by far the more popular format. So arguably there
85
+ is little value in providing XML. Those who have tried to use JATS-XML as
86
+ an authoring tool have not had a happy time: <a href=\"https://doi.org/10.7557/15.5517\">How
87
+ we tried to JATS XML</a>. However, there are various tools to help with the
88
+ process, such as <a href=\"https://github.com/Vitaliy-1/docxToJats\">docxToJats</a>,<br>\ntexture,
89
+ and <a href=\"https://github.com/elifesciences/jats-xml-to-pdf\">jats-xml-to-pdf</a>
90
+ if this is the route one wants to take.</p>\n<h2 id=\"automating-writing-manuscripts\">Automating
91
+ writing manuscripts</h2>\n<p>The dream, of course, is to have a tool where
92
+ you store all your taxonomic data (literature, specimens, characters, images,
93
+ sequences, media files, etc.) and at the click of a button generate a paper.
94
+ Certainly some of this can be automated, much nomenclatural and specimen information
95
+ could be converted to human-readable text. Ideally this computer-generated
96
+ text would not be edited (otherwise it could get out of sync with the underlying
97
+ data). The text should be <a href=\"https://en.wikipedia.org/wiki/Transclusion\">transcluded</a>.
98
+ As an aside, one way to do this would be to include things such as lists of
99
+ material examined as images rather than text while the manuscript is being
100
+ edited. In the same way that you (probably) wouldn’t edit a photograph within
101
+ your text editor, you shouldn’t be editing data. When the manuscript is published
102
+ the data-generated portions can then be output as text.</p>\n<p>Of course
103
+ all of this assumes that we have taxonomic data in a database (or some other
104
+ storage format, including plain text and Mark-down, e.g. <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
105
+ markdown, and taxonomic trees</a>) that can generate outputs in the various
106
+ formats that we need.</p>\n<h2 id=\"archiving-data-and-images\">Archiving
107
+ data and images</h2>\n<p>One of the really nice things that <a href=\"http://plazi.org\">Plazi</a>
108
+ do is have a pipeline that sends taxonomic descriptions and images to Zenodo,
109
+ and similar data to GBIF. Any taxonomic journal should be able to do this.
110
+ Indeed, arguably each taxonomic treatment within the paper should be linked
111
+ to the Zenodo DOI at the time of publication. Indeed, we could imagine ultimately
112
+ having treatments as transclusions within the larger manuscript. Alternatively
113
+ we could store the treatments as parts of the larger article (rather like
114
+ chapters in a book), each with a CrossRef DOI. I’m still sceptical about whether
115
+ these treatments are as important as we make out, see <a href=\"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html\">Does
116
+ anyone cite taxonomic treatments?</a>. But having machine-readable taxonomic
117
+ data archived and accessible is a good thing. Uploading the same data to GBIF
118
+ makes much of that data immediately accessible. Now that GBIF offers <a href=\"https://www.gbif.org/composition/3kQFinjwHbCGZeLb5OhwN2/gbif-hosted-portals\">hosted
119
+ portals</a> there is the possibility of having custom interfaces to data from
120
+ a particular journal.</p>\n<h2 id=\"name-and-identifier-registration\">Name
121
+ and identifier registration</h2>\n<p>We would also want automatic registration
122
+ of new taxonomic names, for which there are pipelines (see “A common registration-to-publication
123
+ automated pipeline for nomenclatural acts for higher plants (International
124
+ Plant Names Index, IPNI), fungi (Index Fungorum, MycoBank) and animals (ZooBank)”
125
+ <a href=\"https://doi.org/10.3897/zookeys.550.9551\">https://doi.org/10.3897/zookeys.550.9551</a>).
126
+ These pipelines do not seem to be documented in much detail, and the data
127
+ formats differ across registration agencies (e.g., IPNI and ZooBank). For
128
+ example, ZooBank seems to require TaxPub XML.</p>\n<p>Registration of names
129
+ and identifiers, especially across multiple registration agencies (ZooBank,
130
+ CrossRef, DataCite, etc.) requires some coordination, especially when one
131
+ registration agency requires identifiers from another.</p>\n<h2 id=\"summary\">Summary</h2>\n<p>If
132
+ data is key, then the taxonomic paper itself becomes something of a wrapper
133
+ around that data. It still serves the function of being human-readable, providing
134
+ broader context for the work, and as an archive that conforms to currently
135
+ accepted ways to publish taxonomic names. But in some ways it is the last
136
+ interesting part of the process.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/37y2z-gre70","uuid":"f3629c86-06e0-42c0-844a-266b03a91ef1","url":"https://iphylo.blogspot.com/2023/05/ten-years-and-million-links.html","title":"Ten
137
+ years and a million links","summary":"As trailed on a Twitter thread last
138
+ week I’ve been working on a manuscript describing the efforts to map taxonomic
139
+ names to their original descriptions in the taxonomic literature. Putting
140
+ together a manuscript on linking taxonomic names to the primary literature,
141
+ basically “um, what, exactly, have you been doing all these years?”. TL;DR
142
+ Across fungi, plants, and animals approx 1.3 million names have been linked
143
+ to a persistent identifier for a publication.— Roderic Page (@rdmpage) May
144
+ 25,...","date_published":"2023-05-31T17:26:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
145
+ (Roderic Page)"}],"image":null,"content_html":"<p>As trailed on a Twitter
146
+ thread last week I’ve been working on a manuscript describing the efforts
147
+ to map taxonomic names to their original descriptions in the taxonomic literature.</p>\n<blockquote
148
+ class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Putting together a manuscript
149
+ on linking taxonomic names to the primary literature, basically “um, what,
150
+ exactly, have you been doing all these years?”. TL;DR Across fungi, plants,
151
+ and animals approx 1.3 million names have been linked to a persistent identifier
152
+ for a publication.</p>— Roderic Page (@rdmpage) <a href=\"https://twitter.com/rdmpage/status/1661714128413573120?ref_src=twsrc%5Etfw\">May
153
+ 25, 2023</a></blockquote> \n<p>The preprint is on bioRxiv <a href=\"https://doi.org/10.1101/2023.05.29.542697\">doi:10.1101/2023.05.29.542697</a></p>\n<blockquote>\n<p>A
154
+ major gap in the biodiversity knowledge graph is a connection between taxonomic
155
+ names and the taxonomic literature. While both names and publications often
156
+ have persistent identifiers (PIDs), such as Life Science Identifiers (LSIDs)
157
+ or Digital Object Identifiers (DOIs), LSIDs for names are rarely linked to
158
+ DOIs for publications. This article describes efforts to make those connections
159
+ across three large taxonomic databases: Index Fungorum, International Plant
160
+ Names Index (IPNI), and the Index of Organism Names (ION). Over a million
161
+ names have been matched to DOIs or other persistent identifiers for taxonomic
162
+ publications. This represents approximately 36% of names for which publication
163
+ data is available. The mappings between LSIDs and publication PIDs are made
164
+ available through ChecklistBank. Applications of this mapping are discussed,
165
+ including a web app to locate the citation of a taxonomic name, and a knowledge
166
+ graph that uses data on researcher’s ORCID ids to connect taxonomic names
167
+ and publications to authors of those names.</p>\n</blockquote>\n<p>Much of
168
+ the work has been linking taxa to names, which still has huge gaps. There
169
+ are also interesting differences in coverage between plants, animals, and
170
+ fungi (see preprint for details).</p>\n\n<div class=\"separator\" style=\"clear:
171
+ both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s2276/animals-coverage.png\"
172
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
173
+ border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
174
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s320/animals-coverage.png\"/></a></div><div
175
+ class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s2276/fungi-coverage.png\"
176
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
177
+ border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
178
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s320/fungi-coverage.png\"/></a></div><div
179
+ class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s2276/plants-coverage.png\"
180
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
181
+ border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
182
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s320/plants-coverage.png\"/></a></div>\n\n\nThere
183
+ is also a simple app to demonstrate these links, see <a href=\"https://species-cite.herokuapp.com\">https://species-cite.herokuapp.com</a>.\n\n\n\n<blockquote>\n<p>Written
184
+ with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/92rdb-5fe58","uuid":"d33d4f49-b281-4997-9eb9-dbad1e52d9bd","url":"https://iphylo.blogspot.com/2022/09/local-global-identifiers-for.html","title":"Local
185
+ global identifiers for decentralised wikis","summary":"I''ve been thinking
186
+ a bit about how one could use a Markdown wiki-like tool such as Obsidian to
187
+ work with taxonomic data (see earlier posts Obsidian, markdown, and taxonomic
188
+ trees and Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu).
189
+ One \"gotcha\" would be how to name pages. If we treat the database as entirely
190
+ local, then the page names don''t matter, but what if we envisage sharing
191
+ the database, or merging it with others (for example, if we divided a taxon
192
+ up into chunks, and...","date_published":"2022-09-08T16:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
193
+ (Roderic Page)"}],"image":null,"content_html":"<p>I''ve been thinking a bit
194
+ about how one could use a Markdown wiki-like tool such as Obsidian to work
195
+ with taxonomic data (see earlier posts <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
196
+ markdown, and taxonomic trees</a> and <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">Personal
197
+ knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu</a>).</p>\n\n<p>One
198
+ \"gotcha\" would be how to name pages. If we treat the database as entirely
199
+ local, then the page names don''t matter, but what if we envisage sharing
200
+ the database, or merging it with others (for example, if we divided a taxon
201
+ up into chunks, and different people worked on those different chunks)? </p>\n\n<p>This
202
+ is the attraction of globally unique identifiers. You and I can independently
203
+ work on the same thing, such as data linked to scientific paper, safe in the
204
+ knowledge that if we both use the DOI for that paper we can easily combine
205
+ what we''ve done. But global identifiers can also be a pain, especially if
206
+ we need to use a service to look them up (\"is there a DOI for this paper?\",
207
+ \"what is the LSID for this taxonomic name?\").</p>\n\n<p>Life would be easier
208
+ if we could generate identifiers \"locally\", but had some assurance that
209
+ they would be globally unique, and that anyone else generating an identifier
210
+ for the same thing would arrive at the same identifier (this eliminates things
211
+ such as <a href=\"https://en.wikipedia.org/wiki/Universally_unique_identifier\">UUIDs</a>
212
+ which are intentionally designed to prvent people genrrating the same identifier).
213
+ One approach is \"content addressing\" (see, e.g. <a href=\"https://web.archive.org/web/20210514054054/https://bentrask.com/notes/content-addressing.html\">Principles
214
+ of Content Addressing</a> - dead link but in the Wayabck Machine, see also
215
+ <a href=\"https://github.com/btrask/stronglink\">btrask/stronglink</a>). For
216
+ example, we can generate a cryptographic hash of a file (such as a PDF) and
217
+ use that as the identifier.</p>\n\n<p>Now the problem is that we have globally
218
+ unique, but ugly and unfriendly identifiers (such as \"6c98136eba9084ea9a5fc0b7693fed8648014505\").
219
+ What we need are nice, easy to use identifiers we can use as page names. <a
220
+ href=\"https://species.wikimedia.org/wiki/Main_Page\">Wikispecies</a> serves
221
+ as a possible role model, where taxon names serve as page names, as do simplified
222
+ citations (e.g., authors and years). This model runs into the problem that
223
+ taxon names aren''t unique, nor are author + year combinations. In Wikispecies
224
+ this is resolved by having a centralised database where it''s first come,
225
+ first served. If there is a name clash you have to create a new name for your
226
+ page. This works, but what if you have multiple databases un by different
227
+ people? How do we ensure the identifiers are the same?</p>\n\n<p>Then I remembered
228
+ Roger Hyam''s flight of fantasy over a decade ago: <a href=\"http://www.hyam.net/blog/archives/1007\">SpeciesIndex.org
229
+ – an impractical, practical solution</a>. He proposed the following rules
230
+ to generate a unique URI for a taxonomic name:\n\n<ul>\n <li>The URI must
231
+ start with \"http://speciesindex.org\" followed by one or more of the following
232
+ separated by slashes.</li>\n\n <li>First word of name. Must only contain
233
+ letters. Must not be the same as one of the names of the nomenclatural codes
234
+ (icbn or iczn). Optional but highly recommended.</li> \n\n <li>Second word
235
+ of name. Must only contain letters and not be a nomenclatural code name. Optional.</li>
236
+ \n\n <li>Third word of name. Must only contain letters and not be a nomenclatural
237
+ code name. Optional.</li> \n\n <li>Year of publication. Must be an integer
238
+ greater than 1650 and equal to or less than the current year. If this is an
239
+ ICZN name then this should be the year the species (epithet) was published
240
+ as is commonly cited after the name. If this is an ICBN name at species or
241
+ below then it is the date of the combination. Optional. Recommended for zoological
242
+ names if known. Not recommended for botanical names unless there is a known
243
+ problem with homonyms in use by non-taxonomists.</li>\n \n<li>Nomenclatural
244
+ code governing the name of the taxon. Currently this must be either ''icbn''
245
+ or ''iczn''. This may be omitted if the code is unknown or not relevant. Other
246
+ codes may be added to this list.</li> \n <li>Qualifier This must be a Version
247
+ 4 RFC-4122 UUID. Optional. Used to generate a new independent identifier for
248
+ a taxon for which the conventional name is unknown or does not exist or to
249
+ indicate a particular taxon concept that bears the embedded name.</li>\n\n <li>The
250
+ whole speciesindex.org URI string should be considered case\nsensitive. Everything
251
+ should be lower case apart from the first letter of words that are specified
252
+ as having upper case in their relevant codes e.g. names at and above the rank
253
+ of genus.</li>\n</ul>\n</p>\n\n<p>Roger is basically arging that while names
254
+ aren''t unique (i.e., we have homonyms such as <i>Abronia</i>) they are pretty
255
+ close to being so, and with a few tweaks we can come up with a unique representation.
256
+ Another way to think about this if we had a database of all taxonomics, we
257
+ could construct a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
258
+ and for each name find the shortest set of name parts (genus, species, etc),
259
+ year, and code that gave us a unique string for that name. In many cases the
260
+ species name may be all we need, in other cases we may need to add year and/or
261
+ nomenclatural code to arrive at a unique string. \n\n</p>\n\n<p>What about
262
+ bibliographic references? Well many of us will have databases (e.g., Endnote,
263
+ Mendeley, Zotero, etc.) which generate \"cite keys\". These are typically
264
+ short, memorable identifiers for a reference that are unique within that database.
265
+ There is an interesting discussion on the <a href=\"https://discourse.jabref.org/t/universal-citekey-generator/2441/2\">JabRef
266
+ forum</a> regarding a \"Universal Citekey Generator\", and source code is
267
+ available <a href=\"https://github.com/cparnot/universal-citekey-js\">cparnot/universal-citekey-js</a>.
268
+ I''ve yet to explore this in detail, but it looks a promising way to generate
269
+ unique identifiers from basic metadata (echos of more elaborate schemes such
270
+ as <a href=\"https://en.wikipedia.org/wiki/Serial_Item_and_Contribution_Identifier\">SICIs</a>).
271
+ For example,\n\n<blockquote>Senna AR, Guedes UN, Andrade LF, Pereira-Filho
272
+ GH. 2021. A new species of amphipod Pariphinotus Kunkel, 1910 (Amphipoda:
273
+ Phliantidae) from Southwestern Atlantic. Zool Stud 60:57. doi:10.6620/ZS.2021.60-57.</blockquote>\n\nbecomes
274
+ \"Senna:2021ck\". So if two people have the same, core, metadata for a paper
275
+ they can generate the same key.</p>\n\n<p>Hence it seems with a few conventions
276
+ (and maybe some simple tools to support them) we could have decentralised
277
+ wiki-like tools that used the same identifiers for the same things, and yet
278
+ those identfiiers were short and human-friendly.</p>","tags":["citekey","identfiiers","markdown","obsidian","Roger
279
+ Hyam"],"language":"en","references":null},{"id":"https://doi.org/10.59350/j77nc-e8x98","uuid":"c6b101f4-bfbc-4d01-921d-805c43c85757","url":"https://iphylo.blogspot.com/2022/08/linking-taxonomic-names-to-literature.html","title":"Linking
280
+ taxonomic names to the literature","summary":"Just some thoughts as I work
281
+ through some datasets linking taxonomic names to the literature. In the diagram
282
+ above I''ve tried to capture the different situatios I encounter. Much of
283
+ the work I''ve done on this has focussed on case 1 in the diagram: I want
284
+ to link a taxonomic name to an identifier for the work in which that name
285
+ was published. In practise this means linking names to DOIs. This has the
286
+ advantage of linking to a citable indentifier, raising questions such as whether
287
+ citations...","date_published":"2022-08-22T17:19:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
288
+ (Roderic Page)"}],"image":null,"content_html":"Just some thoughts as I work
289
+ through some datasets linking taxonomic names to the literature.\n\n<div class=\"separator\"
290
+ style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s2140/linking%20to%20names144.jpg\"
291
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
292
+ border=\"0\" height=\"600\" data-original-height=\"2140\" data-original-width=\"1604\"
293
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s600/linking%20to%20names144.jpg\"/></a></div>\n\n<p>In
294
+ the diagram above I''ve tried to capture the different situatios I encounter.
295
+ Much of the work I''ve done on this has focussed on case 1 in the diagram:
296
+ I want to link a taxonomic name to an identifier for the work in which that
297
+ name was published. In practise this means linking names to DOIs. This has
298
+ the advantage of linking to a citable indentifier, raising questions such
299
+ as whether citations of taxonmic papers by taxonomic databases could become
300
+ part of a <a href=\"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html\">taxonomist''s
301
+ Google Scholar profile</a>.</p>\n\n<p>In many taxonomic databases full work-level
302
+ citations are not the norm, instead taxonomists cite one or more pages within
303
+ a work that are relevant to a taxonomic name. These \"microcitations\" (what
304
+ the U.S. legal profession refer to as \"point citations\" or \"pincites\", see
305
+ <a href=\"https://rasmussen.libanswers.com/faq/283203\">What are pincites,
306
+ pinpoints, or jump legal references?</a>) require some work to map to the
307
+ work itself (which is typically the thing that has a citatble identifier such
308
+ as a DOI).</p>\n\n<p>Microcitations (case 2 in the diagram above) can be quite
309
+ complex. Some might simply mention a single page, but others might list a
310
+ series of (not necessarily contiguous) pages, as well as figures, plates etc.
311
+ Converting these to citable identifiers can be tricky, especially as in most
312
+ cases we don''t have page-level identifiers. The Biodiversity Heritage Library
313
+ (BHL) does have URLs for each scanned page, and we have a standard for referring
314
+ to pages in a PDF (<code>page=&lt;pageNum&gt;</code>, see <a href=\"https://datatracker.ietf.org/doc/html/rfc8118\">RFC
315
+ 8118</a>). But how do we refer to a set of pages? Do we pick the first page?
316
+ Do we try and represent a set of pages, and if so, how?</p>\n\n<p>Another
317
+ issue with page-level identifiers is that not everything on a given page may
318
+ be relevant to the taxonomic name. In case 2 above I''ve shaded in the parts
319
+ of the pages and figure that refer to the taxonomic name. An example where
320
+ this can be problematic is the recent test case I created for BHL where a
321
+ page image was included for the taxonomic name <a href=\"https://www.gbif.org/species/195763322\"><i>Aphrophora
322
+ impressa</i></a>. The image includes the species description and a illustration,
323
+ as well as text that relates to other species.</p>\n\n<div class=\"separator\"
324
+ style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s3467/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"
325
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
326
+ border=\"0\" height=\"400\" data-original-height=\"3467\" data-original-width=\"2106\"
327
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s400/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"/></a></div>\n\n<p>Given
328
+ that not everything on a page need be relevant, we could extract just the
329
+ relevant blocks of text and illustrations (e.g., paragraphs of text, panels
330
+ within a figure, etc.) and treat that set of elements as the thing to cite.
331
+ This is, of course, what <a href=\"http://plazi.org\">Plazi</a> are doing.
332
+ The set of extracted blocks is glued together as a \"treatment\", assigned
333
+ an identifier (often a DOI), and treated as a citable unit. It would be interesting
334
+ to see to what extent these treatments are actually cited, for example, do
335
+ subsequent revisions that cite work that include treatments cite those treatments,
336
+ or just the work itself? Put another way, are we creating <a href=\"https://iphylo.blogspot.com/2012/09/decoding-nature-encode-ipad-app-omg-it.html\">\"threads\"</a>
337
+ between taxonomic revisions?</p>\n\n<p>One reason for these notes is that
338
+ I''m exploring uploading taxonomic name - literature links to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
339
+ and case 1 above is easy, as is case 3 (if we have treatment-level identifiers).
340
+ But case 2 is problematic because we are linking to a set of things that may
341
+ not have an identifier, which means a decision has to be made about which
342
+ page to link to, and how to refer to that page.</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/w18j9-v7j10","uuid":"d811172e-7798-403c-a83d-3d5317a9657e","url":"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html","title":"Papers
343
+ citing data that cite papers: CrossRef, DataCite, and the Catalogue of Life","summary":"Quick
344
+ notes to self following on from a conversation about linking taxonomic names
345
+ to the literature. Is there a way to turn those links into countable citations
346
+ (even if just one per database) for Google Scholar?&mdash; Wayne Maddison
347
+ (@WayneMaddison) August 3, 2022 There are different sorts of citation: Paper
348
+ cites another paper Paper cites a dataset Dataset cites a paper Citation
349
+ type (1) is largely a solved problem (although there are issues of the ownership
350
+ and use of this...","date_published":"2022-08-03T11:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
351
+ (Roderic Page)"}],"image":null,"content_html":"Quick notes to self following
352
+ on from a conversation about linking taxonomic names to the literature.\n\n<blockquote
353
+ class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Is there a way to turn
354
+ those links into countable citations (even if just one per database) for Google
355
+ Scholar?</p>&mdash; Wayne Maddison (@WayneMaddison) <a href=\"https://twitter.com/WayneMaddison/status/1554644747406348288?ref_src=twsrc%5Etfw\">August
356
+ 3, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
357
+ charset=\"utf-8\"></script>\n\nThere are different sorts of citation:\n\n<ol>\n <li>Paper
358
+ cites another paper</li>\n <li>Paper cites a dataset</li>\n <li>Dataset
359
+ cites a paper</li>\n</ol>\n\nCitation type (1) is largely a solved problem
360
+ (although there are issues of the ownership and use of this data, see e.g.
361
+ <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
362
+ has no impact factor</a>.\n\nCitation type (2) is becoming more widespread
363
+ (but not perfect as GBIF''s <a href=\"https://twitter.com/search?q=%23citethedoi&src=typed_query\">#citethedoi</a>
364
+ campaign demonstrates. But the idea is well accepted and there are guides
365
+ to how to do it, e.g.:\n\n<blockquote>\nCousijn, H., Kenall, A., Ganley, E.
366
+ et al. A data citation roadmap for scientific publishers. Sci Data 5, 180259
367
+ (2018). <a href=\"https://doi.org/10.1038/sdata.2018.259\">https://doi.org/10.1038/sdata.2018.259</a>\n</blockquote>\n\nHowever,
368
+ things do get problematic because most (but not all) DOIs for publications
369
+ are managed by CrossRef, which has an extensive citation database linking
370
+ papers to other paopers. Most datasets have DataCite DOIs, and DataCite manages
371
+ its own citations links, but as far as I''m aware these two systems don''t
372
+ really taklk to each other.\n\nCitation type (3) is the case where a database
373
+ is largely based on the literature, which applies to taxonomy. Taxonomic databases
374
+ are essentially collections of literature that have opinions on taxa, and
375
+ the database may simply compile those (e.g., a nomenclator), or come to some
376
+ view on the applicability of each name. In an ideal would, each reference
377
+ included in a taxonomic database would gain a citation, which would help better
378
+ reflect the value of that work (a long standing bone of contention for taxonomists).\n\nIt
379
+ would be interesting to explore these issues further. CrossRef and DataCite
380
+ do share <a href=\"https://www.crossref.org/services/event-data/\">Event Data</a>
381
+ (see also <a href=\"https://support.datacite.org/docs/eventdata-guide\">DataCite
382
+ Event Data</a>). Can this track citations of papers by a dataset?\n \n \nMy
383
+ take on Wayne''s question:\n\n<blockquote>\n Is there a way to turn those
384
+ links into countable citations (even if just one per database) for Google
385
+ Scholar?\n</blockquote>\n\nis that he''s is after type 3 citations, which
386
+ I don''t think we have a way to handle just yet (but I''d need to look at
387
+ Event Data a bit more). Google Scholar is a black box, and the academic coimmunity''s
388
+ reliance on it for metrics is troubling. But it would be interetsing to try
389
+ and figure out if there is a way to get Google Scholar to index the citations
390
+ of taxonomic papers by databases. For instance, the <a href=\"https://www.catalogueoflife.org/\">Catalogue
391
+ of Life</a> has an ISSN <a href=\"https://portal.issn.org/resource/ISSN/2405-884X\">2405-884X</a>
392
+ so it can be treated as a publication. At the moment its web pages have lots
393
+ of identifiers for people managing data and their organisations (lots of <a
394
+ href=\"https://orcid.org\">ORCIDs</a> and <a href=\"https://ror.org\">RORs</a>,
395
+ and DOIs for individual datasets (e.g., <a href=\"https://www.checklistbank.org/dataset/9828/about\">checklistbank.org</a>)
396
+ but precious little in the way of DOIs for publications (or, indeed, ORCIDs
397
+ for taxonomists). What would it take for taxonomic publications in the Catalogue
398
+ of Life to be treated as first class citations?","tags":["Catalogue of Life","citation","CrossRef","DataCite","DOI"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ws094-1w310","uuid":"6bed78ec-0029-4096-b1c3-48a55a9fdb3b","url":"https://iphylo.blogspot.com/2023/04/chatgpt-of-course.html","title":"ChatGPT,
399
+ of course","summary":"I haven’t blogged for a while, work and other reasons
400
+ have meant I’ve not had much time to think, and mostly I blog to help me think.
401
+ ChatGPT is obviously a big thing at the moment, and once we get past the moral
402
+ panic (“students can pass exams using AI!”) there are a lot of interesting
403
+ possibilities to explore. Inspired by essays such as How Q&amp;A systems based
404
+ on large language models (eg GPT4) will change things if they become the dominant
405
+ search paradigm — 9 implications for libraries...","date_published":"2023-04-03T12:52:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
406
+ (Roderic Page)"}],"image":null,"content_html":"<p>I haven’t blogged for a
407
+ while, work and <a href=\"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html\">other
408
+ reasons</a> have meant I’ve not had much time to think, and mostly I blog
409
+ to help me think.</p>\n<p>ChatGPT is obviously a big thing at the moment,
410
+ and once we get past the moral panic (“students can pass exams using AI!”)
411
+ there are a lot of interesting possibilities to explore. Inspired by essays
412
+ such as <a href=\"https://medium.com/@aarontay/how-q-a-systems-based-on-large-language-models-eg-gpt4-will-change-things-if-they-become-the-norm-c7cf62736ba\">How
413
+ Q&amp;A systems based on large language models (eg GPT4) will change things
414
+ if they become the dominant search paradigm — 9 implications for libraries</a>
415
+ and <a href=\"https://about.sourcegraph.com/blog/cheating-is-all-you-need\">Cheating
416
+ is All You Need</a>, as well as [<a href=\"https://paul-graham-gpt.vercel.app/\">Paul
417
+ Graham GPT</a>](<a href=\"https://paul-graham-gpt.vercel.app\">https://paul-graham-gpt.vercel.app</a>)
418
+ I thought I’d try a few things and see where this goes.</p>\n<p>ChatGPT can
419
+ do some surprising things.</p>\n<h4 id=\"parse-bibliographic-data\">Parse
420
+ bibliographic data</h4>\n<p>I spend a LOT of time working with bibliographic
421
+ data, trying to parse it into structured data. ChatGPT can do this:</p>\n\n<div
422
+ class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s924/Screenshot%202023-04-03%20at%2012.59.30.png\"
423
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
424
+ border=\"0\" height=\"400\" data-original-height=\"924\" data-original-width=\"738\"
425
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s400/Screenshot%202023-04-03%20at%2012.59.30.png\"/></a></div>\n\n<p>Note
426
+ that it does more than simply parse the strings, it expands journal abbreviations
427
+ such as “J. Malay Brch. R. Asiat. Soc.” to the full name “Journal of the Malayan
428
+ Branch of the Royal Asiatic Society”. So we can get clean, parsed data in
429
+ a range of formats.</p>\n<h4 id=\"parse-specimens\">Parse specimens</h4>\n<p>Based
430
+ on the success with parsing bibliographic strings I wondered how well it could
431
+ handle citation software specimens (“material examined”). Elsewhere I’ve been
432
+ critical of Plazi’s ability to do this, see <a href=\"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html\">Problems
433
+ with Plazi parsing: how reliable are automated methods for extracting specimens
434
+ from the literature?</a>.</p>\n<p>For example, given this specimen record
435
+ on p. 130 of <a href=\"https://doi.org/10.5852/ejt.2021.775.1553\">doi:10.5852/ejt.2021.775.1553</a></p>\n<blockquote>\n<p>LAOS
436
+ • Kammoune Province, Bunghona Market, 7 km Nof Xe Bangfai River;<br>\n17.13674°
437
+ N, 104.98591° E; E. Jeratthitikul, K. Wisittikoson, A. Fanka, N. Wutthituntisil
438
+ and P. Prasankok leg.; sold by local people;<br>\nMUMNH-UNI2831.</p>\n</blockquote>\n<p>ChatGPT
439
+ extracted a plausible Darwin Core record:</p>\n\n<div class=\"separator\"
440
+ style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s901/Screenshot%202023-04-03%20at%2013.30.54.png\"
441
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
442
+ border=\"0\" height=\"400\" data-original-height=\"901\" data-original-width=\"764\"
443
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s400/Screenshot%202023-04-03%20at%2013.30.54.png\"/></a></div>\n\n<p>I’ve
444
+ been experimenting with parsing specimen records using the same machine learning
445
+ approach for bibliographic data (e.g., <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
446
+ parsing tool released</a>), perhaps it’s time to rethink that idea.</p>\n<h4
447
+ id=\"correct-ocr-errors\">Correct OCR errors</h4>\n<p>OCR errors are a well
448
+ known limitation of the Biodiversity Heritage Library (BHL), see for example
449
+ <a href=\"https://blog.biodiversitylibrary.org/2022/07/ocr-improvements-early-analysis.html\">OCR
450
+ Improvements: An Early Analysis</a>. Given that ChatGPT is built on top of
451
+ a large language model it seems obvious to see how it handles OCR errors.
452
+ I tried the following snippet of text from <a href=\"https://www.biodiversitylibrary.org/page/4312227\">BHL
453
+ page 4312227</a>:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Jiiaii
454
+ Jr., 16, who always join him and Hedriek in th(&gt; search for<br>\nfrogs.
455
+ Iloraeio Mayorga, research assistant to tlie author, has also<br>\nbeen (wtreniely
456
+ useful, not only in field work, but also in the<br>\ntedious curatorial and
457
+ laboratory work that usually fol- lows. Dr.<br>\nJohn Randall kindly took
458
+ tlie photouraplis of E. hedricki, and the<br>\nGaliiiaues family of San Juan
459
+ made their El Verde house available as<br>\na (MMiter of activities for the
460
+ colleetiuff party. To all these<br>\njx’opic, the autlioi- feels (iee])ly
461
+ indebted.</p>\n<p>Elet’tiierodactylus iiKnRrcKi s]). n.</p>\n<p>Type. Museum
462
+ of Coniparative Zoology Xo. ;}()!)()8, c^ from El Verde,<br>\nwest flank
463
+ of El Yuiuinc Puerto Hico. I.IOO ft. Coll. J. A. Kivero,<br>\n11 Aug-. I!)(i2.</p>\n</blockquote>\n<p>ChatGPT
464
+ returns this:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Juan Jr., 16,
465
+ who always joins him and Hedrick in the search for<br>\nfrogs. Horacio Mayorga,
466
+ research assistant to the author, has also<br>\nbeen extremely useful, not
467
+ only in field work, but also in the tedious<br>\ncuratorial and laboratory
468
+ work that usually follows. Dr. John Randall<br>\nkindly took the photographs
469
+ of E. hedricki, and the Galinanes family<br>\nof San Juan made their El Verde
470
+ house available as a center of<br>\nactivities for the collecting party. To
471
+ all these people, the author<br>\nfeels deeply indebted.</p>\n<p>Eleutherodactylus
472
+ hedricki sp. n.</p>\n<p>Type. Museum of Comparative Zoology No. 30908, c?
473
+ from El Verde, west<br>\nflank of El Yunque, Puerto Rico. 1100 ft. Coll. J.
474
+ A. Rivero, 11 Aug.<br>\n1962.</p>\n</blockquote>\n\n<div class=\"separator\"
475
+ style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s859/Screenshot%202023-04-03%20at%2013.07.53.png\"
476
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
477
+ border=\"0\" width=\"400\" data-original-height=\"510\" data-original-width=\"859\"
478
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s400/Screenshot%202023-04-03%20at%2013.07.53.png\"/></a></div>\n\n<p>Comparing
479
+ this to the scanned image ChatGPT it does pretty well, for example the gobbledegook
480
+ “Elet’tiierodactylus iiKnRrcKi” is correctly translated as “Eleutherodactylus
481
+ hedricki”. Running all of BHL through ChatGPT probably isn’t feasible, but
482
+ one could imagine targeted cleaning of key papers.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>These
483
+ small experiments are fairly trivial, but they are the sort of tedious tasks
484
+ that would otherwise require significant programming (or other resources)
485
+ to solve. But ChatGPT can do rather more, as I hope to discuss in the next
486
+ post.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/7esgr-61v1","uuid":"96fa91d5-459c-482f-aa38-dda6e0a30e20","url":"https://iphylo.blogspot.com/2022/01/large-graph-viewer-experiments.html","title":"Large
487
+ graph viewer experiments","summary":"I keep returning to the problem of viewing
488
+ large graphs and trees, which means my hard drive has accumulated lots of
489
+ failed prototypes. Inspired by some recent discussions on comparing taxonomic
490
+ classifications I decided to package one of these (wildly incomplete) prototypes
491
+ up so that I can document the idea and put the code somewhere safe. Very cool,
492
+ thanks for sharing this-- the tree diff is similar to what J Rees has been
493
+ cooking up lately with his &#39;cl diff&#39; tool. I&#39;ll tag...","date_published":"2022-01-02T11:25:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
494
+ (Roderic Page)"}],"image":null,"content_html":"<p>I keep returning to the
495
+ problem of viewing large graphs and trees, which means my hard drive has accumulated
496
+ lots of failed prototypes. Inspired by some recent discussions on comparing
497
+ taxonomic classifications I decided to package one of these (wildly incomplete)
498
+ prototypes up so that I can document the idea and put the code somewhere safe.</p>\n\n<blockquote
499
+ class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Very cool, thanks for sharing
500
+ this-- the tree diff is similar to what J Rees has been cooking up lately
501
+ with his &#39;cl diff&#39; tool. I&#39;ll tag <a href=\"https://twitter.com/beckettws?ref_src=twsrc%5Etfw\">@beckettws</a>
502
+ in here too so he can see potential crossover. The goal is autogenerate diffs
503
+ like this as 1st step to mapping taxo name-to concept</p>&mdash; Nate Upham
504
+ (@n8_upham) <a href=\"https://twitter.com/n8_upham/status/1475834371131289608?ref_src=twsrc%5Etfw\">December
505
+ 28, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
506
+ charset=\"utf-8\"></script>\n\n<h2>Google Maps-like viewer</h2>\n\n<div class=\"separator\"
507
+ style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s500\"
508
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
509
+ border=\"0\" width=\"400\" data-original-height=\"448\" data-original-width=\"500\"
510
+ src=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s400\"/></a></div>\n\n<p>I''ve
511
+ created a simple viewer that uses a tiled map viewer (like Google Maps) to
512
+ display a large graph. The idea is to draw the entire graph scaled to a 256
513
+ x 256 pixel tile. The graph is stored in a database that supports geospatial
514
+ queries, which means the queries to retrieve the individual tiles need to
515
+ display the graph at different levels of resolution are simply bounding box
516
+ queries to a database. I realise that this description is cryptic at best.
517
+ The GitHub repository <a href=\"https://github.com/rdmpage/gml-viewer\">https://github.com/rdmpage/gml-viewer</a>
518
+ has more details and the code itself. There''s a lot to do, especially adding
519
+ support for labels(!) which presents some interesting challenges (<a href=\"https://en.wikipedia.org/wiki/Level_of_detail_(computer_graphics)\">levels
520
+ of detail</a> and <a href=\"https://en.wikipedia.org/wiki/Cartographic_generalization\">generalization</a>).
521
+ The code doesn''t do any layout of the graph itself, instead I''ve used the
522
+ <a href=\"https://www.yworks.com/products/yed\">yEd</a> tool to compute the
523
+ x,y coordinates of the graph.</p>\n\n<p>Since this exercise was inspired by
524
+ a discussion of the <a href=\"https://www.mammaldiversity.org\">ASM Mammal
525
+ Diversity Database</a>, the graph I''ve used for the demonstration above is
526
+ the ASM classification of extant mammals. I guess I need to solve the labelling
527
+ issue fairly quickly!</p>","tags":["Google Maps","graph","Mammal Species of
528
+ the World","mammals","taxonomy"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m48f7-c2128","uuid":"8aea47e4-f227-45f4-b37b-0454a8a7a3ff","url":"https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html","title":"ChatGPT,
529
+ semantic search, and knowledge graphs","summary":"One thing about ChatGPT
530
+ is it has opened my eyes to some concepts I was dimly aware of but am only
531
+ now beginning to fully appreciate. ChatGPT enables you ask it questions, but
532
+ the answers depend on what ChatGPT “knows”. As several people have noted,
533
+ what would be even better is to be able to run ChatGPT on your own content.
534
+ Indeed, ChatGPT itself now supports this using plugins. Paul Graham GPT However,
535
+ it’s still useful to see how to add ChatGPT functionality to your own content
536
+ from...","date_published":"2023-04-03T15:30:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
537
+ (Roderic Page)"}],"image":null,"content_html":"<p>One thing about ChatGPT
538
+ is it has opened my eyes to some concepts I was dimly aware of but am only
539
+ now beginning to fully appreciate. ChatGPT enables you ask it questions, but
540
+ the answers depend on what ChatGPT “knows”. As several people have noted,
541
+ what would be even better is to be able to run ChatGPT on your own content.
542
+ Indeed, ChatGPT itself now supports this using <a href=\"https://openai.com/blog/chatgpt-plugins\">plugins</a>.</p>\n<h4
543
+ id=\"paul-graham-gpt\">Paul Graham GPT</h4>\n<p>However, it’s still useful
544
+ to see how to add ChatGPT functionality to your own content from scratch.
545
+ A nice example of this is <a href=\"https://paul-graham-gpt.vercel.app/\">Paul
546
+ Graham GPT</a> by <a href=\"https://twitter.com/mckaywrigley\">Mckay Wrigley</a>.
547
+ Mckay Wrigley took essays by Paul Graham (a well known venture capitalist)
548
+ and built a question and answer tool very like ChatGPT.</p>\n<iframe width=\"560\"
549
+ height=\"315\" src=\"https://www.youtube.com/embed/ii1jcLg-eIQ\" title=\"YouTube
550
+ video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write;
551
+ encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n<p>Because
552
+ you can send a block of text to ChatGPT (as part of the prompt) you can get
553
+ ChatGPT to summarise or transform that information, or answer questions based
554
+ on that information. But there is a limit to how much information you can
555
+ pack into a prompt. You can’t put all of Paul Graham’s essays into a prompt
556
+ for example. So a solution is to do some preprocessing. For example, given
557
+ a question such as “How do I start a startup?” we could first find the essays
558
+ that are most relevant to this question, then use them to create a prompt
559
+ for ChatGPT. A quick and dirty way to do this is simply do a text search over
560
+ the essays and take the top hits. But we aren’t searching for words, we are
561
+ searching for answers to a question. The essay with the best answer might
562
+ not include the phrase “How do I start a startup?”.</p>\n<h4 id=\"semantic-search\">Semantic
563
+ search</h4>\n<p>Enter <a href=\"https://en.wikipedia.org/wiki/Semantic_search\">Semantic
564
+ search</a>. The key concept behind semantic search is that we are looking
565
+ for documents with similar meaning, not just similarity of text. One approach
566
+ to this is to represent documents by “embeddings”, that is, a vector of numbers
567
+ that encapsulate features of the document. Documents with similar vectors
568
+ are potentially related. In semantic search we take the query (e.g., “How
569
+ do I start a startup?”), compute its embedding, then search among the documents
570
+ for those with similar embeddings.</p>\n<p>To create Paul Graham GPT Mckay
571
+ Wrigley did the following. First he sent each essay to the OpenAI API underlying
572
+ ChatGPT, and in return he got the embedding for that essay (a vector of 1536
573
+ numbers). Each embedding was stored in a database (Mckay uses Postgres with
574
+ <a href=\"https://github.com/pgvector/pgvector\">pgvector</a>). When a user
575
+ enters a query such as “How do I start a startup?” that query is also sent
576
+ to the OpenAI API to retrieve its embedding vector. Then we query the database
577
+ of embeddings for Paul Graham’s essays and take the top five hits. These hits
578
+ are, one hopes, the most likely to contain relevant answers. The original
579
+ question and the most similar essays are then bundled up and sent to ChatGPT
580
+ which then synthesises an answer. See his <a href=\"https://github.com/mckaywrigley/paul-graham-gpt\">GitHub
581
+ repo</a> for more details. Note that we are still using ChatGPT, but on a
582
+ set of documents it doesn’t already have.</p>\n<h4 id=\"knowledge-graphs\">Knowledge
583
+ graphs</h4>\n<p>I’m a fan of knowledge graphs, but they are not terribly easy
584
+ to use. For example, I built a knowledge graph of Australian animals <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a>
585
+ that contains a wealth of information on taxa, publications, and people, wrapped
586
+ up in a web site. If you want to learn more you need to figure out how to
587
+ write queries in SPARQL, which is not fun. Maybe we could use ChatGPT to write
588
+ the SPARQL queries for us, but it would be much more fun to be simply ask
589
+ natural language queries (e.g., “who are the experts on Australian ants?”).
590
+ I made some naïve notes on these ideas <a href=\"https://iphylo.blogspot.com/2015/09/possible-project-natural-language.html\">Possible
591
+ project: natural language queries, or answering “how many species are there?”</a>
592
+ and <a href=\"https://iphylo.blogspot.com/2019/05/ozymandias-meets-wikipedia-with-notes.html\">Ozymandias
593
+ meets Wikipedia, with notes on natural language generation</a>.</p>\n<p>Of
594
+ course, this is a well known problem. Tools such as <a href=\"http://rdf2vec.org\">RDF2vec</a>
595
+ can take RDF from a knowledge graph and create embeddings which could in tern
596
+ be used to support semantic search. But it seems to me that we could simply
597
+ this process a bit by making use of ChatGPT.</p>\n<p>Firstly we would generate
598
+ natural language statements from the knowledge graph (e.g., “species x belongs
599
+ to genus y and was described in z”, “this paper on ants was authored by x”,
600
+ etc.) that cover the basic questions we expect people to ask. We then get
601
+ embeddings for these (e.g., using OpenAI). We then have an interface where
602
+ people can ask a question (“is species x a valid species?”, “who has published
603
+ on ants”, etc.), we get the embedding for that question, retrieve natural
604
+ language statements that the closest in embedding “space”, package everything
605
+ up and ask ChatGPT to summarise the answer.</p>\n<p>The trick, of course,
606
+ is to figure out how t generate natural language statements from the knowledge
607
+ graph (which amounts to deciding what paths to traverse in the knowledge graph,
608
+ and how to write those paths is something approximating English). We also
609
+ want to know something about the sorts of questions people are likely to ask
610
+ so that we have a reasonable chance of having the answers (for example, are
611
+ people going to ask about individual species, or questions about summary statistics
612
+ such as numbers of species in a genus, etc.).</p>\n<p>What makes this attractive
613
+ is that it seems a straightforward way to go from a largely academic exercise
614
+ (build a knowledge graph) to something potentially useful (a question and
615
+ answer machine). Imagine if something like the defunct BBC wildlife site (see
616
+ <a href=\"https://iphylo.blogspot.com/2017/12/blue-planet-ii-bbc-and-semantic-web.html\">Blue
617
+ Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and
618
+ opportunities lost</a>) revived <a href=\"https://aspiring-look.glitch.me\">here</a>
619
+ had a question and answer interface where we could ask questions rather than
620
+ passively browse.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>I have so much
621
+ more to learn, and need to think about ways to incorporate semantic search
622
+ and ChatGPT-like tools into knowledge graphs.</p>\n<blockquote>\n<p>Written
623
+ with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/rfxj3-x6739","uuid":"6a4d5c44-f4a9-4d40-a32c-a4d5e512c55a","url":"https://iphylo.blogspot.com/2022/05/thoughts-on-treebase-dying.html","title":"Thoughts
624
+ on TreeBASE dying(?)","summary":"@rvosa is Naturalis no longer hosting Treebase?
625
+ https://t.co/MBRgcxaBmR&mdash; Hilmar Lapp (@hlapp) May 10, 2022 So it looks
626
+ like TreeBASE is in trouble, it''s legacy Java code a victim of security issues.
627
+ Perhaps this is a chance to rethink TreeBASE, assuming that a repository of
628
+ published phylogenies is still considered a worthwhile thing to have (and
629
+ I think that question is open). Here''s what I think could be done. The data
630
+ (individual studies with trees and data) are packaged into...","date_published":"2022-05-11T16:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
631
+ (Roderic Page)"}],"image":null,"content_html":"<blockquote class=\"twitter-tweet\"><p
632
+ lang=\"en\" dir=\"ltr\"><a href=\"https://twitter.com/rvosa?ref_src=twsrc%5Etfw\">@rvosa</a>
633
+ is Naturalis no longer hosting Treebase? <a href=\"https://t.co/MBRgcxaBmR\">https://t.co/MBRgcxaBmR</a></p>&mdash;
634
+ Hilmar Lapp (@hlapp) <a href=\"https://twitter.com/hlapp/status/1524166490798309381?ref_src=twsrc%5Etfw\">May
635
+ 10, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
636
+ charset=\"utf-8\"></script>\n\n<p>So it looks like <a href=\"http://treebase.org\">TreeBASE</a>
637
+ is in trouble, it''s legacy Java code a victim of security issues. Perhaps
638
+ this is a chance to rethink TreeBASE, assuming that a repository of published
639
+ phylogenies is still considered a worthwhile thing to have (and I think that
640
+ question is open).</p>\n\n<p>Here''s what I think could be done.</p>\n\n<ol>\n<li>\nThe
641
+ data (individual studies with trees and data) are packaged into whatever format
642
+ is easiest (NEXUS, XML, JSON) and uploaded to a repository such as <a href=\"https://zenodo.org\">Zenodo</a>
643
+ for long term storage. They get DOIs for citability. This becomes the default
644
+ storage for TreeBASE.\n</li>\n<li>\nThe data is transformed into JSON and
645
+ indexed using Elasticsearch. A simple web interface is placed on top so that
646
+ people can easily find trees (never a strong point of the original TreeBASE).
647
+ Trees are displayed natively on the web using SVG. The number one goal is
648
+ for people to be able to find trees, view them, and download them.\n</li>\n<li>\nTo
649
+ add data to TreeBASE the easiest way would be for people to upload them direct
650
+ to Zenodo and tag them \"treebase\". A bot then grabs a feed of these datasets
651
+ and adds them to the search engine in (1) above. As time allows, add an interface
652
+ where people upload data directly, it gets curated, then deposited in Zenodo.
653
+ This presupposes that there are people available to do curation. Maybe have
654
+ \"stars\" for the level of curation so that users know whether anyone has
655
+ checked the data.\n</li>\n</ol>\n\n<p>There''s lots of details to tweak, for
656
+ example how many of the existing URLs for studies are preserved (some URL
657
+ mapping), and what about the API? And I''m unclear about the relationship
658
+ with <a href=\"https://datadryad.org\">Dryad</a>.</p>\n\n<p>My sense is that
659
+ the TreeBASE code is very much of its time (10-15 years ago), a monolithic
660
+ block of code with SQL, Java, etc. If one was starting from scratch today
661
+ I don''t think this would be the obvious solution. Things have trended towards
662
+ being simpler, with lots of building blocks now available in the cloud. Need
663
+ a search engine? Just spin up a container in the cloud and you have one. More
664
+ and more functionality can be devolved elsewhere.</p>\n\n<p>Another other
665
+ issue is how to support TreeBASE. It has essentially been a volunteer effort
666
+ to date, with little or no funding. One reason I think having Zenodo as a
667
+ storage engine is that it takes care of long term sustainability of the data.</p>\n\n<p>I
668
+ realise that this is all wild arm waving, but maybe now is the time to reinvent
669
+ TreeBASE?</p>\n\n<h2>Updates</h2>\n\n<p>It''s been a while since I''ve paid
670
+ a lot of attention to phylogenetic databases, and it shows. There is a file-based
671
+ storage system for phylogenies <a href=\"https://github.com/OpenTreeOfLife/phylesystem-1\">phylesystem</a>
672
+ (see \"Phylesystem: a git-based data store for community-curated phylogenetic
673
+ estimates\" <a href=\"https://doi.org/10.1093/bioinformatics/btv276\">https://doi.org/10.1093/bioinformatics/btv276</a>)
674
+ that is sort of what I had in mind, although long term persistence is based
675
+ on GitHub rather than a repository such as Zenodo. Phylesystem uses a truly
676
+ horrible-looking JSON transformation of <a href=\"http://nexml.github.io\">NeXML</a>
677
+ (NeXML itself is ugly), and TreeBASE also supports NeXML, so some form of
678
+ NeXML or a JSON transformation seems the obvious storage format. It will probably
679
+ need some cleaning and simplification if it is to be indexed easily. Looking
680
+ back over the long history of TreeBASE and phylogenetic databases I''m struck
681
+ by how much complexity has been introduced over time. I think the tech has
682
+ gotten in the way sometimes (which might just be another way of saying that
683
+ I''m not smart enough to make sense of it all.</p>\n\n<p>So we could imagine
684
+ a search engine that covers both TreeBASE and <a href=\"https://tree.opentreeoflife.org/curator\">Open
685
+ Tree of Life studies</a>.</p>\n\n<p>Basic metadata-based searches would be
686
+ straightforward, and we could have a user interface that highlights the trees
687
+ (I think TreeBASE''s biggest search rival is a Google image search). The harder
688
+ problem is searching by tree structure, for which there is an interesting
689
+ literature without any decent implementations that I''m aware of (as I said,
690
+ I''ve been out of this field a while).</p>\n\n<p>So my instinct is we could
691
+ go a long way with simply indexing JSON (CouchDB or Elasticsearch), then need
692
+ to think a bit more cleverly about higher taxon and tree based searching.
693
+ I''ve always thought that one killer query would be not so much \"show me
694
+ all the trees for my taxon\" but \"show me a synthesis of the trees for my
695
+ taxon\". Imagine a supertree of recent studies that we could use as a summary
696
+ of our current knowledge, or a visualisation that summarises where there are
697
+ conflicts among the trees.</p>\n\n<h3>Relevant code and sites</h3>\n\n<ul>\n<li><a
698
+ href=\"https://github.com/rdmpage/cdaotools\">CDAO Tools</a>, see \"CDAO-Store:
699
+ Ontology-driven Data Integration for Phylogenetic Analysis\" <a href=\"https://doi.org/10.1186/1471-2105-12-98\">https://doi.org/10.1186/1471-2105-12-98</a></li>\n<li><a
700
+ href=\"https://github.com/NESCent/phylocommons\">PhyloCommons</a></li>\n</ul>","tags":["phylogeny","TreeBASE"],"language":"en","references":null},{"id":"https://doi.org/10.59350/jzvs4-r9559","uuid":"23fa1dd8-5c6b-4aa9-9cad-c6f6b14ae9e0","url":"https://iphylo.blogspot.com/2021/08/json-ld-in-wild-examples-of-how.html","title":"JSON-LD
701
+ in the wild: examples of how structured data is represented on the web","summary":"I''ve
702
+ created a GitHub repository so that I can keep track of the examples of JSON-LD
703
+ that I''ve seen being actively used, for example embedded in web sites, or
704
+ accessed using an API. The repository is https://github.com/rdmpage/wild-json-ld.
705
+ The list is by no means exhaustive, I hope to add more examples as I come
706
+ across them. One reason for doing this is to learn what others are doing.
707
+ For example, after looking at SciGraph''s JSON-LD I now see how an ordered
708
+ list can be modelled in RDF in...","date_published":"2021-08-27T13:20:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
709
+ (Roderic Page)"}],"image":null,"content_html":"<p>I''ve created a GitHub repository
710
+ so that I can keep track of the examples of JSON-LD that I''ve seen being
711
+ actively used, for example embedded in web sites, or accessed using an API.
712
+ The repository is <a href=\"https://github.com/rdmpage/wild-json-ld\">https://github.com/rdmpage/wild-json-ld</a>.
713
+ The list is by no means exhaustive, I hope to add more examples as I come
714
+ across them.</p>\n\n<p>One reason for doing this is to learn what others are
715
+ doing. For example, after looking at SciGraph''s JSON-LD I now see how an
716
+ ordered list can be modelled in RDF in such a way that the list of authors
717
+ in a JSON-LD document for, say a scientific paper, is correct. By default
718
+ RDF has no notion of ordered lists, so if you do a SPARQL query to get the
719
+ authors of a paper, the order of the authors returned in the query will be
720
+ arbitrary. There are various ways to try and tackle this. In my Ozymandias
721
+ knowledge graph I used \"roles\" to represent order (see <a href=\"https://doi.org/10.7717/peerj.6739/fig-2\">Figure
722
+ 2</a> in the Ozymandias paper). I then used properties of the role to order
723
+ the list of authors.</p>\n\n<p>Another approach is to use rdf:lists (see <a
724
+ href=\"http://www.snee.com/bobdc.blog/2014/04/rdf-lists-and-sparql.html\">RDF
725
+ lists and SPARQL</a> and <a href=\"https://stackoverflow.com/questions/17523804/is-it-possible-to-get-the-position-of-an-element-in-an-rdf-collection-in-sparql/17530689#17530689\">Is
726
+ it possible to get the position of an element in an RDF Collection in SPARQL?</a>
727
+ for an introduction to lists). SciGraph uses this approach. The value for
728
+ schema:author is not an author, but a blank node (bnode), and this bnode has
729
+ two predicates, rdf:first and rdf:rest. One points to an author, the other
730
+ points to another bnode. This pattern repeats until we encounter a value of
731
+ rdf:nil for rdf:rest.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a
732
+ href=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s629/list.png\"
733
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
734
+ border=\"0\" height=\"320\" data-original-height=\"629\" data-original-width=\"401\"
735
+ src=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s320/list.png\"/></a></div>\n\n<p>This
736
+ introduces some complexity, but the benefit is that the JSON-LD version of
737
+ the RDF will have the authors in the correct order, and hence any client that
738
+ is using JSON will be able to treat the array of authors as ordered. Without
739
+ some means of ordering the client could not make this assumption, hence the
740
+ first author in the list might not actually be the first author of the paper.</p>","tags":["JSON-LD","RDF"],"language":"en","references":null},{"id":"https://doi.org/10.59350/zc4qc-77616","uuid":"30c78d9d-2e50-49db-9f4f-b3baa060387b","url":"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html","title":"Does
741
+ anyone cite taxonomic treatments?","summary":"Taxonomic treatments have come
742
+ up in various discussions I''m involved in, and I''m curious as to whether
743
+ they are actually being used, in particular, whether they are actually being
744
+ cited. Consider the following quote: The taxa are described in taxonomic treatments,
745
+ well defined sections of scientific publications (Catapano 2019). They include
746
+ a nomenclatural section and one or more sections including descriptions, material
747
+ citations referring to studied specimens, or notes ecology and...","date_published":"2022-09-01T16:49:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
748
+ (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
749
+ both;\"><a href=\"https://zenodo.org/record/5731100/thumb100\" style=\"display:
750
+ block; padding: 1em 0; text-align: center; clear: right; float: right;\"><img
751
+ alt=\"\" border=\"0\" height=\"128\" data-original-height=\"106\" data-original-width=\"100\"
752
+ src=\"https://zenodo.org/record/5731100/thumb250\"/></a></div>\nTaxonomic
753
+ treatments have come up in various discussions I''m involved in, and I''m
754
+ curious as to whether they are actually being used, in particular, whether
755
+ they are actually being cited. Consider the following quote:\n\n<blockquote>\nThe
756
+ taxa are described in taxonomic treatments, well defined sections of scientific
757
+ publications (Catapano 2019). They include a nomenclatural section and one
758
+ or more sections including descriptions, material citations referring to studied
759
+ specimens, or notes ecology and behavior. In case the treatment does not describe
760
+ a new discovered taxon, previous treatments are cited in the form of treatment
761
+ citations. This citation can refer to a previous treatment and add additional
762
+ data, or it can be a statement synonymizing the taxon with another taxon.
763
+ This allows building a citation network, and ultimately is a constituent part
764
+ of the catalogue of life. - Taxonomic Treatments as Open FAIR Digital Objects
765
+ <a href=\"https://doi.org/10.3897/rio.8.e93709\">https://doi.org/10.3897/rio.8.e93709</a>\n</blockquote>\n\n<p>\n
766
+ \"Traditional\" academic citation is from article to article. For example,
767
+ consider these two papers:\n\n<blockquote>\nLi Y, Li S, Lin Y (2021) Taxonomic
768
+ study on fourteen symphytognathid species from Asia (Araneae, Symphytognathidae).
769
+ ZooKeys 1072: 1-47. https://doi.org/10.3897/zookeys.1072.67935\n</blockquote>\n\n<blockquote>\nMiller
770
+ J, Griswold C, Yin C (2009) The symphytognathoid spiders of the Gaoligongshan,
771
+ Yunnan, China (Araneae: Araneoidea): Systematics and diversity of micro-orbweavers.
772
+ ZooKeys 11: 9-195. https://doi.org/10.3897/zookeys.11.160\n</blockquote>\n</p>\n\n<p>Li
773
+ et al. 2021 cites Miller et al. 2009 (although Pensoft seems to have broken
774
+ the citation such that it does appear correctly either on their web page or
775
+ in CrossRef).</p>\n\n<p>So, we have this link: [article]10.3897/zookeys.1072.67935
776
+ --cites--> [article]10.3897/zookeys.11.160. One article cites another.</p>\n\n<p>In
777
+ their 2021 paper Li et al. discuss <i>Patu jidanweishi</i> Miller, Griswold
778
+ & Yin, 2009:\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s1040/Screenshot%202022-09-01%20at%2017.12.27.png\"
779
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
780
+ border=\"0\" width=\"400\" data-original-height=\"314\" data-original-width=\"1040\"
781
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s400/Screenshot%202022-09-01%20at%2017.12.27.png\"/></a></div>\n\n<p>There
782
+ is a treatment for the original description of <i>Patu jidanweishi</i> at
783
+ <a href=\"https://doi.org/10.5281/zenodo.3792232\">https://doi.org/10.5281/zenodo.3792232</a>,
784
+ which was created by Plazi with a time stamp \"2020-05-06T04:59:53.278684+00:00\".
785
+ The original publication date was 2009, the treatments are being added retrospectively.</p>\n\n<p>In
786
+ an ideal world my expectation would be that Li et al. 2021 would have cited
787
+ the treatment, instead of just providing the text string \"Patu jidanweishi
788
+ Miller, Griswold & Yin, 2009: 64, figs 65A–E, 66A, B, 67A–D, 68A–F, 69A–F,
789
+ 70A–F and 71A–F (♂♀).\" Isn''t the expectation under the treatment model that
790
+ we would have seen this relationship:</p>\n\n<p>[article]10.3897/zookeys.1072.67935
791
+ --cites--> [treatment]https://doi.org/10.5281/zenodo.3792232</p>\n\n<p>Furthermore,
792
+ if it is the case that \"[i]n case the treatment does not describe a new discovered
793
+ taxon, previous treatments are cited in the form of treatment citations\"
794
+ then we should also see a citation between treatments, in other words Li et
795
+ al.''s 2021 treatment of <i>Patu jidanweishi</i> (which doesn''t seem to have
796
+ a DOI but is available on Plazi'' web site as <a href=\"https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74\">https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74</a>)
797
+ should also cite the original treatment? It doesn''t - but it does cite the
798
+ Miller et al. paper.</p>\n\n<p>So in this example we don''t see articles citing
799
+ treatments, nor do we see treatments citing treatments. Playing Devil''s advocate,
800
+ why then do we have treatments? Does''t the lack of citations suggest that
801
+ - despite some taxonomists saying this is the unit that matters - they actually
802
+ don''t. If we pay attention to what people do rather than what they say they
803
+ do, they cite articles.</p>\n\n<p>Now, there are all sorts of reasons why
804
+ we don''t see [article] -> [treatment] citations, or [treatment] -> [treatment]
805
+ citations. Treatments are being added after the fact by Plazi, not by the
806
+ authors of the original work. And in many cases the treatments that could
807
+ be cited haven''t appeared until after that potentially citing work was published.
808
+ In the example above the Miller et al. paper dates from 2009, but the treatment
809
+ extracted only went online in 2020. And while there is a long standing culture
810
+ of citing publications (ideally using DOIs) there isn''t an equivalent culture
811
+ of citing treatments (beyond the simple text strings).</p>\n\n<p>Obviously
812
+ this is but one example. I''d need to do some exploration of the citation
813
+ graph to get a better sense of citations patterns, perhaps using <a href=\"https://www.crossref.org/documentation/event-data/\">CrossRef''s
814
+ event data</a>. But my sense is that taxonomists don''t cite treatments.</p>\n\n<p>I''m
815
+ guessing Plazi would respond by saying treatments are cited, for example (indirectly)
816
+ in GBIF downloads. This is true, although arguably people aren''t citing the
817
+ treatment, they''re citing specimen data in those treatments, and that specimen
818
+ data could be extracted at the level of articles rather than treatments. In
819
+ other words, it''s not the treatments themselves that people are citing.</p>\n\n<p>To
820
+ be clear, I think there is value in being able to identify those \"well defined
821
+ sections\" of a publication that deal with a given taxon (i.e., treatments),
822
+ but it''s not clear to me that these are actually the citable units people
823
+ might hope them to be. Likewise, journals such as <i>ZooKeys</i> have DOIs
824
+ for individual figures. Does anyone actually cite those?</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/en7e9-5s882","uuid":"20b9d31e-513f-496b-b399-4215306e1588","url":"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html","title":"Obsidian,
825
+ markdown, and taxonomic trees","summary":"Returning to the subject of personal
826
+ knowledge graphs Kyle Scheer has an interesting repository of Markdown files
827
+ that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines
828
+ (see his blog post for more background). If you add these files to Obsidian
829
+ you get a nice visualisation of a taxonomy of academic disciplines. The applications
830
+ of this to biological taxonomy seem obvious, especially as a tool like Obsidian
831
+ enables all sorts of interesting links to be added...","date_published":"2022-04-07T21:07:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
832
+ (Roderic Page)"}],"image":null,"content_html":"<p>Returning to the subject
833
+ of <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">personal
834
+ knowledge graphs</a> Kyle Scheer has an interesting repository of Markdown
835
+ files that describe academic disciplines at <a href=\"https://github.com/kyletscheer/academic-disciplines\">https://github.com/kyletscheer/academic-disciplines</a>
836
+ (see <a href=\"https://kyletscheer.medium.com/on-creating-a-tree-of-knowledge-f099c1028bf6\">his
837
+ blog post</a> for more background).</p>\n\n<p>If you add these files to <a
838
+ href=\"https://obsidian.md/\">Obsidian</a> you get a nice visualisation of
839
+ a taxonomy of academic disciplines. The applications of this to biological
840
+ taxonomy seem obvious, especially as a tool like Obsidian enables all sorts
841
+ of interesting links to be added (e.g., we could add links to the taxonomic
842
+ research behind each node in the taxonomic tree, the people doing that research,
843
+ etc. - although that would mean we''d no longer have a simple tree).</p>\n\n<p>The
844
+ more I look at these sort of simple Markdown-based tools the more I wonder
845
+ whether we could make more use of them to create simple but persistent databases.
846
+ Text files seem the most stable, long-lived digital format around, maybe this
847
+ would be a way to minimise the inevitable obsolescence of database and server
848
+ software. Time for some experiments I feel... can we take a taxonomic group,
849
+ such as mammals, and create a richly connected database purely in Markdown?</p>\n\n<div
850
+ class=\"separator\" style=\"clear: both; text-align: center;\"><iframe allowfullscreen=''allowfullscreen''
851
+ webkitallowfullscreen=''webkitallowfullscreen'' mozallowfullscreen=''mozallowfullscreen''
852
+ width=''400'' height=''322'' src=''https://www.blogger.com/video.g?token=AD6v5dy3Sa_SY_MJCZYYCT-bAGe9QD1z_V0tkE0qM5FaQJfAEgGOoHtYPATsNNbBvTEh_tHOZ83nMGzpYRg''
853
+ class=''b-hbp-video b-uploaded'' frameborder=''0''></iframe></div>","tags":["markdown","obsidian"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m7gb7-d7c49","uuid":"7d814863-43b5-4faf-a475-da8de5efd3ef","url":"https://iphylo.blogspot.com/2022/02/duplicate-dois-again.html","title":"Duplicate
854
+ DOIs (again)","summary":"This blog post provides some background to a recent
855
+ tweet where I expressed my frustration about the duplication of DOIs for the
856
+ same article. I''m going to document the details here. The DOI that alerted
857
+ me to this problem is https://doi.org/10.2307/2436688 which is for the article
858
+ Snyder, W. C., & Hansen, H. N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American
859
+ Journal of Botany, 27(2), 64–67. This article is hosted by JSTOR at https://www.jstor.org/stable/2436688
860
+ which displays the DOI...","date_published":"2022-02-08T15:06:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
861
+ (Roderic Page)"}],"image":null,"content_html":"<p>This blog post provides
862
+ some background to a <a href=\"https://twitter.com/rdmpage/status/1491023036199600132\">recent
863
+ tweet</a> where I expressed my frustration about the duplication of DOIs for
864
+ the same article. I''m going to document the details here.</p>\n\n<p>The DOI
865
+ that alerted me to this problem is <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
866
+ which is for the article</p>\n\n<blockquote>\nSnyder, W. C., & Hansen, H.
867
+ N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American Journal of Botany, 27(2),
868
+ 64–67.\n</blockquote>\n\n<p>This article is hosted by JSTOR at <a href=\"https://www.jstor.org/stable/2436688\">https://www.jstor.org/stable/2436688</a>
869
+ which displays the DOI <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
870
+ .</p>\n\n<p>This same article is also hosted by Wiley at <a href=\"https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x\">https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x</a>
871
+ with the DOI <a href=\"https://doi.org/10.1002/j.1537-2197.1940.tb14217.x\">https://doi.org/10.1002/j.1537-2197.1940.tb14217.x</a>.</p>\n\n<h2>Expected
872
+ behaviour</h2>\n\n<p>What should happen is if Wiley is going to be the publisher
873
+ of this content (taking over from JSTOR), the DOI <b>10.2307/2436688</b> should
874
+ be redirected to the Wiley page, and the Wiley page displays this DOI (i.e.,
875
+ <b>10.2307/2436688</b>). If I want to get metadata for this DOI, I should
876
+ be able to use CrossRef''s API to retrieve that metadata, e.g. <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
877
+ should return metadata for the article.</p>\n\n<h2>What actually happens</h2>\n\n<p>Wiley
878
+ display the same article on their web site with the DOI <b>10.1002/j.1537-2197.1940.tb14217.x</b>.
879
+ They have minted a new DOI for the same article! The original JSTOR DOI now
880
+ resolves to the Wiley page (you can see this using the <a href=\"https://hdl.handle.net\">Handle
881
+ Resolver</a>), which is what is supposed to happen. However, Wiley should
882
+ have reused the original DOI rather than mint their own.</p>\n\n<p>Furthermore,
883
+ while the original DOI still resolves in a web browser, I can''t retrieve
884
+ metadata about that DOI from CrossRef, so any attempt to build upon that DOI
885
+ fails. However, I can retrieve metadata for the Wiley DOI, i.e. <a href=\"https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x\">https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x</a>
886
+ works, but <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
887
+ doesn''t.</p>\n\n<h2>Why does this matter?</h2>\n\n<p>For anyone using DOIs
888
+ as stable links to the literature the persistence of DOIs is something you
889
+ should be able to rely upon, both for people clicking on links in web browsers
890
+ and developers getting metadata from those DOIs. The whole rationale of the
891
+ DOI system is a single, globally unique identifier for each article, and that
892
+ these DOIs persist even when the publisher of the content changes. If this
893
+ property doesn''t hold, then why would a developer such as myself invest effort
894
+ in linking using DOIs?</p>\n\n<p>Just for the record, I think CrossRef is
895
+ great and is a hugely important part of the scholarly landscape. There are
896
+ lots of things that I do that would be nearly impossible without CrossRef
897
+ and its tools. But cases like this where we get massive duplication of DOIs
898
+ when a publishers takes over an existing journal fundamentally breaks the
899
+ underlying model of stable, persistent identifiers.</p>","tags":["CrossRef","DOI","duplicates"],"language":"en","references":null},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
900
+ on \"The Macroscope\" - a tool for the 21st Century?","summary":"This is a
901
+ guest post by Tony Rees. It would be difficult to encounter a scientist, or
902
+ anyone interested in science, who is not familiar with the microscope, a tool
903
+ for making objects visible that are otherwise too small to be properly seen
904
+ by the unaided eye, or to reveal otherwise invisible fine detail in larger
905
+ objects. A select few with a particular interest in microscopy may also have
906
+ encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop
907
+ microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
908
+ (Roderic Page)"}],"image":null,"content_html":"<p><img src=\"https://lh3.googleusercontent.com/-A99btr6ERMs/Vl1Wvjp2OtI/AAAAAAAAEFI/7bKdRjNG5w0/ytNkVT2U.jpg?imgmax=800\"
909
+ alt=\"YtNkVT2U\" title=\"ytNkVT2U.jpg\" border=\"0\" width=\"128\" height=\"128\"
910
+ style=\"float:right;\" /> This is a guest post by <a href=\"https://about.me/TonyRees\">Tony
911
+ Rees</a>.</p>\n\n<p>It would be difficult to encounter a scientist, or anyone
912
+ interested in science, who is not familiar with the microscope, a tool for
913
+ making objects visible that are otherwise too small to be properly seen by
914
+ the unaided eye, or to reveal otherwise invisible fine detail in larger objects.
915
+ A select few with a particular interest in microscopy may also have encountered
916
+ the Wild-Leica \"Macroscope\", a specialised type of benchtop microscope optimised
917
+ for low-power macro-photography. However in this overview I discuss the \"Macroscope\"
918
+ in a different sense, which is that of the antithesis to the microscope: namely
919
+ a method for visualizing subjects too large to be encompassed by a single
920
+ field of vision, such as the Earth or some subset of its phenomena (the biosphere,
921
+ for example), or conceptually, the universe.</p>\n\n<p><div class=\"separator\"
922
+ style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s500/2020045672.jpg\"
923
+ style=\"display: block; padding: 1em 0; text-align: center; clear: right;
924
+ float: right;\"><img alt=\"\" border=\"0\" height=\"320\" data-original-height=\"500\"
925
+ data-original-width=\"303\" src=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s320/2020045672.jpg\"/></a></div>My
926
+ introduction to the term was via addresses given by Jesse Ausubel in the formative
927
+ years of the 2001-2010 <a href=\"http://www.coml.org\">Census of Marine Life</a>,
928
+ for which he was a key proponent. In Ausubel''s view, the Census would perform
929
+ the function of a macroscope, permitting a view of everything that lives in
930
+ the global ocean (or at least, that subset which could realistically be sampled
931
+ in the time frame available) as opposed to more limited subsets available
932
+ via previous data collection efforts. My view (which could, of course, be
933
+ wrong) was that his thinking had been informed by a work entitled \"Le macroscope,
934
+ vers une vision globale\" published in 1975 by the French thinker Joël de
935
+ Rosnay, who had expressed such a concept as being globally applicable in many
936
+ fields, including the physical and natural worlds but also extending to human
937
+ society, the growth of cities, and more. Yet again, some ecologists may also
938
+ have encountered the term, sometimes in the guise of \"Odum''s macroscope\",
939
+ as an approach for obtaining \"big picture\" analyses of macroecological processes
940
+ suitable for mathematical modelling, typically by elimination of fine detail
941
+ so that only the larger patterns remain, as initially advocated by Howard
942
+ T. Odum in his 1971 book \"Environment, Power, and Society\".</p>\n\n<p>From
943
+ the standpoint of the 21st century, it seems that we are closer to achieving
944
+ a \"macroscope\" (or possibly, multiple such tools) than ever before, based
945
+ on the availability of existing and continuing new data streams, improved
946
+ technology for data assembly and storage, and advanced ways to query and combine
947
+ these large streams of data to produce new visualizations, data products,
948
+ and analytical findings. I devote the remainder of this article to examples
949
+ where either particular workers have employed \"macroscope\" terminology to
950
+ describe their activities, or where potentially equivalent actions are taking
951
+ place without the explicit \"macroscope\" association, but are equally worthy
952
+ of consideration. To save space here, references cited here (most or all)
953
+ can be found via a Wikipedia article entitled \"<a href=\"https://en.wikipedia.org/wiki/Macroscope_(science_concept)\">Macroscope
954
+ (science concept)</a>\" that I authored on the subject around a year ago,
955
+ and have continued to add to on occasion as new thoughts or information come
956
+ to hand (see <a href=\"https://en.wikipedia.org/w/index.php?title=Macroscope_(science_concept)&offset=&limit=500&action=history\">edit
957
+ history for the article</a>).</p>\n\n<p>First, one can ask, what constitutes
958
+ a macroscope, in the present context? In the Wikipedia article I point to
959
+ a book \"Big Data - Related Technologies, Challenges and Future Prospects\"
960
+ by Chen <em>et al.</em> (2014) (<a href=\"https://doi.org/10.1007/978-3-319-06245-7\">doi:10.1007/978-3-319-06245-7</a>),
961
+ in which the \"value chain of big data\" is characterised as divisible into
962
+ four phases, namely data generation, data acquisition (aka data assembly),
963
+ data storage, and data analysis. To my mind, data generation (which others
964
+ may term acquisition, differently from the usage by Chen <em>et al.</em>)
965
+ is obviously the first step, but does not in itself constitute the macroscope,
966
+ except in rare cases - such as Landsat imagery, perhaps - where on its own,
967
+ a single co-ordinated data stream is sufficient to meet the need for a particular
968
+ type of \"global view\". A variant of this might be a coordinated data collection
969
+ program - such as that of the ten year Census of Marine Life - which might
970
+ produce the data required for the desired global view; but again, in reality,
971
+ such data are collected in a series of discrete chunks, in many and often
972
+ disparate data formats, and must be \"wrangled\" into a more coherent whole
973
+ before any meaningful \"macroscope\" functionality becomes available.</p>\n\n<p>Here
974
+ we come to what, in my view, constitutes the heart of the \"macroscope\":
975
+ an intelligently organized (i.e. indexable and searchable), coherent data
976
+ store or repository (where \"data\" may include imagery and other non numeric
977
+ data forms, but much else besides). Taking the Census of Marine Life example,
978
+ the data repository for that project''s data (plus other available sources
979
+ as inputs) is the <a href=\"https://obis.org\">Ocean Biodiversity Information
980
+ System</a> or OBIS (previously the Ocean Biogeographic Information System),
981
+ which according to this view forms the \"macroscope\" for which the Census
982
+ data is a feed. (For non habitat-specific biodiversity data, <a href=\"https://www.gbif.org\">GBIF</a>
983
+ is an equivalent, and more extensive, operation). Other planetary scale \"macroscopes\",
984
+ by this definition (which may or may not have an explicit geographic, i.e.
985
+ spatial, component) would include inventories of biological taxa such as the
986
+ <a href=\"https://www.catalogueoflife.org\">Catalogue of Life</a> and so on,
987
+ all the way back to the pioneering compendia published by Linnaeus in the
988
+ eighteenth century; while for cartography and topographic imagery, the current
989
+ \"blockbuster\" of <a href=\"http://earth.google.com\">Google Earth</a> and
990
+ its predecessors also come well into public consciousness.</p>\n\n<p>In the
991
+ view of some workers and/or operations, both of these phases are precursors
992
+ to the real \"work\" of the macroscope which is to reveal previously unseen
993
+ portions of the \"big picture\" by means either of the availability of large,
994
+ synoptic datasets, or fusion between different data streams to produce novel
995
+ insights. Companies such as IBM and Microsoft have used phraseology such as:</p>\n\n<blockquote>By
996
+ 2022 we will use machine-learning algorithms and software to help us organize
997
+ information about the physical world, helping bring the vast and complex data
998
+ gathered by billions of devices within the range of our vision and understanding.
999
+ We call this a \"macroscope\" – but unlike the microscope to see the very
1000
+ small, or the telescope that can see far away, it is a system of software
1001
+ and algorithms to bring all of Earth''s complex data together to analyze it
1002
+ by space and time for meaning.\" (IBM)</blockquote>\n\n<blockquote>As the
1003
+ Earth becomes increasingly instrumented with low-cost, high-bandwidth sensors,
1004
+ we will gain a better understanding of our environment via a virtual, distributed
1005
+ whole-Earth \"macroscope\"... Massive-scale data analytics will enable real-time
1006
+ tracking of disease and targeted responses to potential pandemics. Our virtual
1007
+ \"macroscope\" can now be used on ourselves, as well as on our planet.\" (Microsoft)
1008
+ (references available via the Wikipedia article cited above).</blockquote>\n\n<p>Whether
1009
+ or not the analytical capabilities described here are viewed as being an integral
1010
+ part of the \"macroscope\" concept, or are maybe an add-on, is ultimately
1011
+ a question of semantics and perhaps, personal opinion. Continuing the Census
1012
+ of Marine Life/OBIS example, OBIS offers some (arguably rather basic) visualization
1013
+ and summary tools, but also makes its data available for download to users
1014
+ wishing to analyse it further according to their own particular interests;
1015
+ using OBIS data in this manner, Mark Costello et al. in 2017 were able to
1016
+ demarcate a finite number of data-supported marine biogeographic realms for
1017
+ the first time (Costello et al. 2017: Nature Communications. 8: 1057. <a href=\"https://doi.org/10.1038/s41467-017-01121-2\">doi:10.1038/s41467-017-01121-2</a>),
1018
+ a project which I was able to assist in a small way in an advisory capacity.
1019
+ In a case such as this, perhaps the final function of the macroscope, namely
1020
+ data visualization and analysis, was outsourced to the authors'' own research
1021
+ institution. Similarly at an earlier phase, \"data aggregation\" can also
1022
+ be virtual rather than actual, i.e. avoiding using a single physical system
1023
+ to hold all the data, enabled by open web mapping standards WMS (web map service)
1024
+ and WFS (web feature service) to access a set of distributed data stores,
1025
+ e.g. as implemented on the portal for the <a href=\"https://portal.aodn.org.au/\">Australian
1026
+ Ocean Data Network</a>.</p>\n\n<p>So, as we pass through the third decade
1027
+ of the twenty first century, what developments await us in the \"macroscope\"
1028
+ area\"? In the biodiversity space, one can reasonably presume that the existing
1029
+ \"macroscopic\" data assembly projects such as OBIS and GBIF will continue,
1030
+ and hopefully slowly fill current gaps in their coverage - although in the
1031
+ marine area, strategic new data collection exercises may be required (Census
1032
+ 2020, or 2025, anyone?), while (again hopefully), the Catalogue of Life will
1033
+ continue its progress towards a \"complete\" species inventory for the biosphere.
1034
+ The Landsat project, with imagery dating back to 1972, continues with the
1035
+ launch of its latest satellite Landsat 9 just this year (21 September 2021)
1036
+ with a planned mission duration for the next 5 years, so the \"macroscope\"
1037
+ functionality of that project seems set to continue for the medium term at
1038
+ least. Meanwhile the ongoing development of sensor networks, both on land
1039
+ and in the ocean, offers an exciting new method of \"instrumenting the earth\"
1040
+ to obtain much more real time data than has ever been available in the past,
1041
+ offering scope for many more, use case-specific \"macroscopes\" to be constructed
1042
+ that can fuse (e.g.) satellite imagery with much more that is happening at
1043
+ a local level.</p>\n\n<p>So, the \"macroscope\" concept appears to be alive
1044
+ and well, even though the nomenclature can change from time to time (IBM''s
1045
+ \"Macroscope\", foreshadowed in 2017, became the \"IBM Pairs Geoscope\" on
1046
+ implementation, and is now simply the \"Geospatial Analytics component within
1047
+ the IBM Environmental Intelligence Suite\" according to available IBM publicity
1048
+ materials). In reality this illustrates a new dichotomy: even if \"everyone\"
1049
+ in principle has access to huge quantities of publicly available data, maybe
1050
+ only a few well funded entities now have the computational ability to make
1051
+ sense of it, and can charge clients a good fee for their services...</p>\n\n<p>I
1052
+ present this account partly to give a brief picture of \"macroscope\" concepts
1053
+ today and in the past, for those who may be interested, and partly to present
1054
+ a few personal views which would be out of scope in a \"neutral point of view\"
1055
+ article such as is required on Wikipedia; also to see if readers of this blog
1056
+ would like to contribute further to discussion of any of the concepts traversed
1057
+ herein.</p>","tags":["guest post","macroscope"],"language":"en","references":null},{"id":"https://doi.org/10.59350/2b1j9-qmw12","uuid":"37538c38-66e6-4ac4-ab5c-679684622ade","url":"https://iphylo.blogspot.com/2022/05/round-trip-from-identifiers-to.html","title":"Round
1058
+ trip from identifiers to citations and back again","summary":"Note to self
1059
+ (basically rewriting last year''s Finding citations of specimens). Bibliographic
1060
+ data supports going from identifier to citation string and back again, so
1061
+ we can do a \"round trip.\" 1. Given a DOI we can get structured data with
1062
+ a simple HTTP fetch, then use a tool such as citation.js to convert that data
1063
+ into a human-readable string in a variety of formats. Identifier ⟶ Structured
1064
+ data ⟶ Human readable string 10.7717/peerj-cs.214 HTTP with...","date_published":"2022-05-27T16:34:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1065
+ (Roderic Page)"}],"image":null,"content_html":"<p>Note to self (basically
1066
+ rewriting last year''s <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">Finding
1067
+ citations of specimens</a>).</p>\n\n<p>Bibliographic data supports going from
1068
+ identifier to citation string and back again, so we can do a \"round trip.\"</p>\n\n<h2>1.</h2>\n\n<p>Given
1069
+ a DOI we can get structured data with a simple HTTP fetch, then use a tool
1070
+ such as <a href=\"https://citation.js.org\">citation.js</a> to convert that
1071
+ data into a human-readable string in a variety of formats.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟶\n</th>\n<th>\nStructured
1072
+ data\n</th>\n<th>\n⟶\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nHTTP
1073
+ with content-negotiation\n</td>\n<td>\nCSL-JSON\n</td>\n<td>\nCSL templates\n</td>\n<td
1074
+ width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
1075
+ modular bibliography tool for the browser and command line. PeerJ Computer
1076
+ Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>2.</h2>\n\n<p>Going
1077
+ in the reverse direction (string to identifier) is a little more challenging.
1078
+ In the \"old days\" a typical strategy was to attempt to parse the citation
1079
+ string into structured data (see <a href=\"hhtps://anystyle.io\">AnyStyle</a>
1080
+ for a nice example of this), then we could extract a truple of (journal, volume,
1081
+ starting page) and use that to query CrossRef to find if there was an article
1082
+ with that tuple, which gave us the DOI.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nStructured
1083
+ data\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nOpenURL
1084
+ query\n</td>\n<td>\njournal, volume, start page\n</td>\n<td>\nCitation parser
1085
+ \n</td>\n<td width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
1086
+ modular bibliography tool for the browser and command line. PeerJ Computer
1087
+ Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>3.</h2>\n\n<p>Another
1088
+ strategy is to take all the citations strings for each DOI, index those in
1089
+ a search engine, then just use a simple search to find the best match to your
1090
+ citation string, and hence the DOI. This is what <a href=\"https://search.crossref.org\">https://search.crossref.org</a>
1091
+ does.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman
1092
+ readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nsearch\n</td>\n<td
1093
+ width=\"50%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
1094
+ modular bibliography tool for the browser and command line. PeerJ Computer
1095
+ Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<p>At
1096
+ the moment my work on material citations (i.e., lists of specimens in taxonomic
1097
+ papers) is focussing on 1 (generating citations from specimen data in GBIF)
1098
+ and 2 (parsing citations into structured data).</p>","tags":["citation","GBIF","material
1099
+ examined","specimen codes"],"language":"en","references":null},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
1100
+ bibliographic data","summary":"There are several instances where I have a
1101
+ collection of references that I want to deduplicate and merge. For example,
1102
+ in Zootaxa has no impact factor I describe a dataset of the literature cited
1103
+ by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4),
1104
+ as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1).
1105
+ Given that the same articles may be cited many times, these datasets have
1106
+ lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1107
+ (Roderic Page)"}],"image":null,"content_html":"<p>There are several instances
1108
+ where I have a collection of references that I want to deduplicate and merge.
1109
+ For example, in <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
1110
+ has no impact factor</a> I describe a dataset of the literature cited by articles
1111
+ in the journal <i>Zootaxa</i>. This data is available on Figshare (<a href=\"https://doi.org/10.6084/m9.figshare.c.5054372.v4\">https://doi.org/10.6084/m9.figshare.c.5054372.v4</a>),
1112
+ as is the equivalent dataset for <i>Phytotaxa</i> (<a href=\"https://doi.org/10.6084/m9.figshare.c.5525901.v1\">https://doi.org/10.6084/m9.figshare.c.5525901.v1</a>).
1113
+ Given that the same articles may be cited many times, these datasets have
1114
+ lots of duplicates. Similarly, articles in <a href=\"https://species.wikimedia.org\">Wikispecies</a>
1115
+ often have extensive lists of references cited, and the same reference may
1116
+ appear on multiple pages (for an initial attempt to extract these references
1117
+ see <a href=\"https://doi.org/10.5281/zenodo.5801661\">https://doi.org/10.5281/zenodo.5801661</a>
1118
+ and <a href=\"https://github.com/rdmpage/wikispecies-parser\">https://github.com/rdmpage/wikispecies-parser</a>).</p>\n\n<p>There
1119
+ are several reasons I want to merge these references. If I want to build a
1120
+ citation graph for <i>Zootaxa</i> or <i>Phytotaxa</i> I need to merge references
1121
+ that are the same so that I can accurate count citations. I am also interested
1122
+ in harvesting the metadata to help find those articles in the <a href=\"https://www.biodiversitylibrary.org\">Biodiversity
1123
+ Heritage Library</a> (BHL), and the literature cited section of scientific
1124
+ articles is a potential goldmine of bibliographic metadata, as is Wikispecies.</p>\n\n<p>After
1125
+ various experiments and false starts I''ve created a repository <a href=\"https://github.com/rdmpage/bib-dedup\">https://github.com/rdmpage/bib-dedup</a>
1126
+ to host a series of PHP scripts to deduplicate bibliographics data. I''ve
1127
+ settled on using CSL-JSON as the format for bibliographic data. Because deduplication
1128
+ relies on comparing pairs of references, the standard format for most of the
1129
+ scripts is a JSON array containing a pair of CSL-JSON objects to compare.
1130
+ Below are the steps the code takes.</p>\n\n<h2>Generating pairs to compare</h2>\n\n<p>The
1131
+ first step is to take a list of references and generate the pairs that will
1132
+ be compared. I started with this approach as I wanted to explore machine learning
1133
+ and wanted a simple format for training data, such as an array of two CSL-JSON
1134
+ objects and an integer flag representing whether the two references were the
1135
+ same of different.</p>\n\n<p>There are various ways to generate CSL-JSON for
1136
+ a reference. I use a tool I wrote (see <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
1137
+ parsing tool released</a>) that has a simple API where you parse one or more
1138
+ references and it returns that reference as structured data in CSL-JSON.</p>\n\n<p>Attempting
1139
+ to do all possible pairwise comparisons rapidly gets impractical as the number
1140
+ of references increases, so we need some way to restrict the number of comparisons
1141
+ we make. One approach I''ve explored is the “sorted neighbourhood method”
1142
+ where we sort the references 9for example by their title) then move a sliding
1143
+ window down the list of references, comparing all references within that window.
1144
+ This greatly reduces the number of pairwise comparisons. So the first step
1145
+ is to sort the references, then run a sliding window over them, output all
1146
+ the pairs in each window (ignoring in pairwise comparisons already made in
1147
+ a previous window). Other methods of \"blocking\" could also be used, such
1148
+ as only including references in a particular year, or a particular journal.</p>\n\n<p>So,
1149
+ the output of this step is a set of JSON arrays, each with a pair of references
1150
+ in CSL-JSON format. Each array is stored on a single line in the same file
1151
+ in <a href=\"https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON_2\">line-delimited
1152
+ JSON</a> (JSONL).</p>\n\n<h2>Comparing pairs</h2>\n\n<p>The next step is to
1153
+ compare each pair of references and decide whether they are a match or not.
1154
+ Initially I explored a machine learning approach used in the following paper:</p>\n\n<blockquote>\nWilson
1155
+ DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex
1156
+ features to improve genealogical record linkage. In: The 2011 International
1157
+ Joint Conference on Neural Networks. 9–14. DOI: <a href=\"https://doi.org/10.1109/IJCNN.2011.6033192\">10.1109/IJCNN.2011.6033192</a>\n</blockquote>\n\n<p>Initial
1158
+ experiments using <a href=\"https://github.com/jtet/Perceptron\">https://github.com/jtet/Perceptron</a>
1159
+ were promising and I want to play with this further, but I deciding to skip
1160
+ this for now and just use simple string comparison. So for each CSL-JSON object
1161
+ I generate a citation string in the same format using CiteProc, then compute
1162
+ the <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein
1163
+ distance</a> between the two strings. By normalising this distance by the
1164
+ length of the two strings being compared I can use an arbitrary threshold
1165
+ to decide if the references are the same or not.</p>\n\n<h2>Clustering</h2>\n\n<p>For
1166
+ this step we read the JSONL file produced above and record whether the two
1167
+ references are a match or not. Assuming each reference has a unique identifier
1168
+ (needs only be unique within the file) then we can use those identifier to
1169
+ record the clusters each reference belongs to. I do this using a <a href=\"https://en.wikipedia.org/wiki/Disjoint-set_data_structure\">Disjoint-set
1170
+ data structure</a>. For each reference start with a graph where each node
1171
+ represents a reference, and each node has a pointer to a parent node. Initially
1172
+ the reference is its own parent. A simple implementation is to have an array
1173
+ index by reference identifiers and where the value of each cell in the array
1174
+ is the node''s parent.</p>\n\n<p>As we discover pairs we update the parents
1175
+ of the nodes to reflect this, such that once all the comparisons are done
1176
+ we have a one or more sets of clusters corresponding to the references that
1177
+ we think are the same. Another way to think of this is that we are getting
1178
+ the components of a graph where each node is a reference and pair of references
1179
+ that match are connected by an edge.</p>\n\n<p>In the code I''m using I write
1180
+ this graph in <a href=\"https://en.wikipedia.org/wiki/Trivial_Graph_Format\">Trivial
1181
+ Graph Format</a> (TGF) which can be visualised using a tools such as <a href=\"https://www.yworks.com/products/yed\">yEd</a>.</p>\n\n<h2>Merging</h2>\n\n<p>Now
1182
+ that we have a graph representing the sets of references that we think are
1183
+ the same we need to merge them. This is where things get interesting as the
1184
+ references are similar (by definition) but may differ in some details. The
1185
+ paper below describes a simple Bayesian approach for merging records:</p>\n\n<blockquote>\nCouncill
1186
+ IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles
1187
+ CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching
1188
+ Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
1189
+ Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: <a href=\"https://doi.org/10.1145/1141753.1141817\">10.1145/1141753.1141817</a>.\n</blockquote>\n\n<p>So
1190
+ the next step is to read the graph with the clusters, generate the sets of
1191
+ bibliographic references that correspond to each cluster, then use the method
1192
+ described in Councill et al. to produce a single bibliographic record for
1193
+ that cluster. These records could then be used to, say locate the corresponding
1194
+ article in BHL, or populate Wikidata with missing references.</p>\n\n<p>Obviously
1195
+ there is always the potential for errors, such as trying to merge references
1196
+ that are not the same. As a quick and dirty check I flag as dubious any cluster
1197
+ where the page numbers vary among members of the cluster. More sophisticated
1198
+ checks are possible, especially if I go down the ML route (i.e., I would have
1199
+ evidence for the probability that the same reference can disagree on some
1200
+ aspects of metadata).</p>\n\n<h2>Summary</h2>\n\n<p>At this stage the code
1201
+ is working well enough for me to play with and explore some example datasets.
1202
+ The focus is on structured bibliographic metadata, but I may simplify things
1203
+ and have a version that handles simple string matching, for example to cluster
1204
+ together different abbreviations of the same journal name.</p>","tags":["data
1205
+ cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ndtkv-6ve80","uuid":"e8e95aaf-bacb-4b5a-bf91-54e903526ab2","url":"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html","title":"Revisiting
1206
+ RSS to monitor the latest taxonomic research","summary":"Over a decade ago
1207
+ RSS (RDF Site Summary or Really Simple Syndication) was attracting a lot of
1208
+ interest as a way to integrate data across various websites. Many science
1209
+ publishers would provide a list of their latest articles in XML in one of
1210
+ three flavours of RSS (RDF, RSS, Atom). This led to tools such as uBioRSS
1211
+ [1] and my own e-Biosphere Challenge: visualising biodiversity digitisation
1212
+ in real time. It was a time of enthusiasm for aggregating lots of data, such
1213
+ as the ill-fated PLoS...","date_published":"2021-11-23T20:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1214
+ (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1215
+ both;\"><a href=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s257/feedicon.png\"
1216
+ style=\"display: block; padding: 1em 0; text-align: center; clear: right;
1217
+ float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"257\"
1218
+ data-original-width=\"257\" src=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s200/feedicon.png\"/></a></div>\n<p>Over
1219
+ a decade ago <a href=\"https://en.wikipedia.org/wiki/RSS\">RSS</a> (RDF Site
1220
+ Summary or Really Simple Syndication) was attracting a lot of interest as
1221
+ a way to integrate data across various websites. Many science publishers would
1222
+ provide a list of their latest articles in XML in one of three flavours of
1223
+ RSS (RDF, RSS, Atom). This led to tools such as <a href=\"http://ubio.org/rss/\">uBioRSS</a>
1224
+ [<a href=\"#Leary2007\">1</a>] and my own <a href=\"https://iphylo.blogspot.com/2009/05/e-biosphere-challenge-visualising.html\">e-Biosphere
1225
+ Challenge: visualising biodiversity digitisation in real time</a>. It was
1226
+ a time of enthusiasm for aggregating lots of data, such as the <a href=\"https://iphylo.blogspot.com/2013/07/the-demise-of-plos-biodiversity-hub.html\">ill-fated</a>
1227
+ PLoS Biodiversity Hub [<a href=\"#Mindell2011\">2</a>].</p>\n\n<p>Since I
1228
+ seem to be condemned to revisit old ideas rather than come up with anything
1229
+ new, I''ve been looking at providing a tool like the now defunct uBioRSS.
1230
+ The idea is to harvest RSS feeds from journals (with an emphasis on taxonomic
1231
+ and systematic journals), aggregate the results, and make them browsable by
1232
+ taxon and geography. Here''s a sneak peak:</p>\n\n<div class=\"separator\"
1233
+ style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s1032/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"
1234
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1235
+ border=\"0\" width=\"400\" data-original-height=\"952\" data-original-width=\"1032\"
1236
+ src=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s400/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"/></a></div>\n\n<p>What
1237
+ seems like a straightforward task quickly became a bit of a challenge. Not
1238
+ all journals have RSS feeds (they seem to have become less widely supported
1239
+ over time) so I need to think of alternative ways to get lists of recent articles.
1240
+ These lists also need to be processed in various ways. There are three versions
1241
+ of RSS, each with their own idiosyncracies, so I need to standardise things
1242
+ like dates. I also want to augment them with things like DOIs (often missing
1243
+ from RSS feeds) and thumbnails for the articles (often available on publisher
1244
+ websites but not the feeds). Then I need to index the content by taxon and
1245
+ geography. For taxa I use a version of Patrick Leary''s \"taxonfinder\" (see
1246
+ <a href=\"https://right-frill.glitch.me\">https://right-frill.glitch.me</a>)
1247
+ to find names, then the <a href=\"https://index.globalnames.org\">Global Names
1248
+ Index</a> to assign names found to the GBIF taxonomic hierarchy.</p>\n\n<p>Indexing
1249
+ by geography proved harder. Typically <a href=\"https://en.wikipedia.org/wiki/Toponym_resolution#Geoparsing\">geoparsing</a>
1250
+ involves taking a body of text and doing the following:\n<ul><li>Using named-entity
1251
+ recognition <a href=\"https://en.wikipedia.org/wiki/Named-entity_recognition\">NER</a>
1252
+ to identity named entities in the text (e.g., place names, people names, etc.).</li>\n<li>Using
1253
+ a gazetteer of geographic names <a href=\"http://www.geonames.org\">GeoNames</a>
1254
+ to try and match the place names found by NER.</li>\n</ul></p>\n\n<p>An example
1255
+ of such a parser is the <a href=\"https://www.ltg.ed.ac.uk/software/geoparser/\">Edinburgh
1256
+ Geoparser</a>. Typically geoparsing software can be large and tricky to install,
1257
+ especially if you are looking to make your installation publicly accessible.
1258
+ Geoparsing services seem to have a short half-life (e.g., <a href=\"https://geoparser.io\">Geoparser.io</a>),
1259
+ perhaps because they are so useful they quickly get swamped by users.</p>\n\n<p>Bearing
1260
+ this in mind, the approach I’ve taken here is to create a very simple geoparser
1261
+ that is focussed on fairly large areas, especially those relevant to biodiversity,
1262
+ and is aimed at geoparsing text such as abstracts of scientific papers. I''ve
1263
+ created a small database of places by harvesting data from Wikidata, then
1264
+ I use the \"flash text\" algorithm [<a href=\"#Singh2017\">3</a>] to find
1265
+ geographic places. This approach uses a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
1266
+ to store the place names. All I do is walk through the text seeing whether
1267
+ the current word matches a place name (or the start of one) in the trie, then
1268
+ moving on. This is very quick and seems to work quite well.</p>\n\n<p>Given
1269
+ that I need to aggregate data from a lot of sources, apply various transformations
1270
+ to that data, then merge it, there are a lot of moving parts. I started playing
1271
+ with a \"NoCode\" platform for creating workflows, in this case <a href=\"https://n8n.io\">n8n</a>
1272
+ (in many ways reminiscent of the now defunct <a href=\"https://en.wikipedia.org/wiki/Yahoo!_Pipes\">Yahoo
1273
+ Pipes</a>). This was quite fun for a while, but after lots of experimentation
1274
+ I moved back to writing code to aggregate the data into a CouchDB database.
1275
+ CouchDB is one of the NoSQL databases that I really like as it has a great
1276
+ interface, and makes queries very easy to do once you get your head around
1277
+ how it works.</p>\n\n<p>So the end result of this is \"BioRSS\" <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
1278
+ The interface comprises a stream of articles listed from newest to oldest,
1279
+ with a treemap and a geographic map on the left. You can use these to filter
1280
+ the articles by taxonomic group and/or country. For example the screen shot
1281
+ is showing arthropods from China (in this case from a month or two ago in
1282
+ the journal <i>ZooKeys</i>). As much fun as the interface has been to construct,
1283
+ in many ways I don''t really want to spend time making an interface. For each
1284
+ combination of taxon and country I provide a RSS feed so if you have a favour
1285
+ feed reader you can grab the feed and view it there. As BioRSS updates the
1286
+ data your feed reader should automatically update the feed. This means that
1287
+ you can have a feed that monitors, say, new papers on spiders in China.</p>\n\n<p>In
1288
+ the spirit of \"release early and release often\" this is an early version
1289
+ of this app. I need to add a lot more feeds, back date them to bring in older
1290
+ content, and I also want to make use of aggregators such as PubMed, CrossRef,
1291
+ and Google Scholar. The existence of these tools is, I suspect, one reason
1292
+ why RSS feeds are less common than they used to be.</p>\n\n<p>So, if this
1293
+ sounds useful please take it for a spin at <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
1294
+ Feedback is welcome, especially suggestions for journals to harvest and add
1295
+ to the news feed. Ultimately I''d like to have sufficient coverage of the
1296
+ taxonomic literature so that BioRSS becomes a place where we can go to find
1297
+ the latest papers on any taxon of interest.</p>\n\n<h2>References</h2>\n\n<blockquote>\n<a
1298
+ name=\"Leary2007\">1.</a> Patrick R. Leary, David P. Remsen, Catherine N.
1299
+ Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic
1300
+ literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages
1301
+ 1434–1436, <a href=\"https://doi.org/10.1093/bioinformatics/btm109\">https://doi.org/10.1093/bioinformatics/btm109</a>\n</blockquote>\n\n<blockquote><a
1302
+ name=\"Mindell2011\">2.</a> Mindell, D. P., Fisher, B. L., Roopnarine, P.,
1303
+ Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating,
1304
+ Tagging and Integrating Biodiversity Research. PLoS ONE, 6(8), e19491. <a
1305
+ href=\"https://doi.org/10.1371/journal.pone.0019491\">doi:10.1371/journal.pone.0019491</a>\n</blockquote>\n\n<blockquote><a
1306
+ name=\"Singh2017\">3.</a> Singh, V. (2017). Replace or Retrieve Keywords In
1307
+ Documents at Scale. CoRR, abs/1711.00046. <a href=\"http://arxiv.org/abs/1711.00046\">http://arxiv.org/abs/1711.00046</a>\n\n</blockquote>","tags":["geocoding","NoCode","RSS"],"language":"en","references":[{"doi":"https://doi.org/10.1093/bioinformatics/btm109","key":"ref1"},{"doi":"https://doi.org/10.1371/journal.pone.0019491","key":"ref2"},{"key":"ref3","url":"http://arxiv.org/abs/1711.00046"}]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
1308
+ Stuart Page 1936-2022","summary":"My dad died last weekend. Below is a notice
1309
+ in today''s New Zealand Herald. I''m in New Zealand for his funeral. Don''t
1310
+ really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1311
+ (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1312
+ both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s3454/_DSC5106.jpg\"
1313
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1314
+ border=\"0\" width=\"400\" data-original-height=\"2582\" data-original-width=\"3454\"
1315
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s400/_DSC5106.jpg\"/></a></div>\n\nMy
1316
+ dad died last weekend. Below is a notice in today''s New Zealand Herald. I''m
1317
+ in New Zealand for his funeral. Don''t really have the words for this right
1318
+ now.\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s3640/IMG_2870.jpeg\"
1319
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1320
+ border=\"0\" height=\"320\" data-original-height=\"3640\" data-original-width=\"1391\"
1321
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg\"/></a></div>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/c79vq-7rr11","uuid":"3cb94422-5506-4e24-a41c-a250bb521ee0","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL
1322
+ for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
1323
+ for WikiData. More precisely, the endpoint is for a subset of the entities
1324
+ that are of interest to WikiCite, such as scholarly articles, people, and
1325
+ journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
1326
+ The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
1327
+ There are various ways to interact with the endpoint, personally I like the
1328
+ Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
1329
+ taken...","date_published":"2021-12-20T13:16:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1330
+ (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1331
+ both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s1000\"
1332
+ style=\"display: block; padding: 1em 0; text-align: center; clear: right;
1333
+ float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"1000\"
1334
+ data-original-width=\"1000\" src=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s200\"/></a></div><p>I''ve
1335
+ released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint
1336
+ is for a subset of the entities that are of interest to WikiCite, such as
1337
+ scholarly articles, people, and journals. There is a crude demo at <a href=\"https://wikicite-graphql.herokuapp.com\">https://wikicite-graphql.herokuapp.com</a>.
1338
+ The endpoint itself is at <a href=\"https://wikicite-graphql.herokuapp.com/gql.php\">https://wikicite-graphql.herokuapp.com/gql.php</a>.
1339
+ There are various ways to interact with the endpoint, personally I like the
1340
+ <a href=\"https://altair.sirmuel.design\">Altair GraphQL Client</a> by <a
1341
+ href=\"https://github.com/imolorhe\">Samuel Imolorhe</a>.</p>\n\n<p>As I''ve
1342
+ <a href=\"https://iphylo.blogspot.com/2021/04/it-been-while.html\">mentioned
1343
+ earlier</a> it''s taken me a while to see the point of GraphQL. But it is
1344
+ clear it is gaining traction in the biodiversity world (see for example the
1345
+ <a href=\"https://dev.gbif.org/hosted-portals.html\">GBIF Hosted Portals</a>)
1346
+ so it''s worth exploring. My take on GraphQL is that it is a way to create
1347
+ a self-describing API that someone developing a web site can use without them
1348
+ having to bury themselves in the gory details of how data is internally modelled.
1349
+ For example, WikiData''s query interface uses SPARQL, a powerful language
1350
+ that has a steep learning curve (in part because of the administrative overhead
1351
+ brought by RDF namespaces, etc.). In my previous SPARQL-based projects such
1352
+ as <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a> and <a
1353
+ href=\"http://alec-demo.herokuapp.com\">ALEC</a> I have either returned SPARQL
1354
+ results directly (Ozymandias) or formatted SPARQL results as <a href=\"https://schema.org/DataFeed\">schema.org
1355
+ DataFeeds</a> (equivalent to RSS feeds) (ALEC). Both approaches work, but
1356
+ they are project-specific and if anyone else tried to build based on these
1357
+ projects they might struggle for figure out what was going on. I certainly
1358
+ struggle, and I wrote them!</p>\n\n<p>So it seems worthwhile to explore this
1359
+ approach a little further and see if I can develop a GraphQL interface that
1360
+ can be used to build the sort of rich apps that I want to see. The demo I''ve
1361
+ created uses SPARQL under the hood to provide responses to the GraphQL queries.
1362
+ So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully)
1363
+ simpler overlay on top of SPARQL so that we can retrieve the data we want
1364
+ without having to learn the intricacies of SPARQL, nor how Wikidata models
1365
+ publications and people.</p>","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
1366
+ barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
1367
+ tweeted about this but want to bookmark it for later as well. The paper “A
1368
+ molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
1369
+ contains the following: …the annotated barcode records assembled by FinBOL
1370
+ participants represent a tremendous intergenerational transfer of taxonomic
1371
+ knowledge … the time contributed by current taxonomists in identifying and
1372
+ contributing voucher specimens represents a great gift to future generations
1373
+ who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1374
+ (Roderic Page)"}],"image":null,"content_html":"<p>I <a href=\"https://twitter.com/rdmpage/status/1569738844416638981?s=21&amp;t=9OVXuoUEwZtQt-Ldzlutfw\">tweeted
1375
+ about this</a> but want to bookmark it for later as well. The paper “A molecular-based
1376
+ identification resource for the arthropods of Finland” <a href=\"https://doi.org/10.1111/1755-0998.13510\">doi:10.1111/1755-0998.13510</a>
1377
+ contains the following:</p>\n<blockquote>\n<p>…the annotated barcode records
1378
+ assembled by FinBOL participants represent a tremendous <mark>intergenerational
1379
+ transfer of taxonomic knowledge</mark> … the time contributed by current taxonomists
1380
+ in identifying and contributing voucher specimens represents a great gift
1381
+ to future generations who will benefit from their expertise when they are
1382
+ no longer able to process new material.</p>\n</blockquote>\n<p>I think this
1383
+ is a very clever way to characterise the project. In an age of machine learning
1384
+ this may be commonest way to share knowledge , namely as expert-labelled training
1385
+ data used to build tools for others. Of course, this means the expertise itself
1386
+ may be lost, which has implications for updating the models if the data isn’t
1387
+ complete. But it speaks to Charles Godfrey’s theme of <a href=\"https://biostor.org/reference/250587\">“Taxonomy
1388
+ as information science”</a>.</p>\n<p>Note that the knowledge is also transformed
1389
+ in the sense that the underlying expertise of interpreting morphology, ecology,
1390
+ behaviour, genomics, and the past literature is not what is being passed on.
1391
+ Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>\n<p>This
1392
+ feels is different to, say iNaturalist, where there is a machine learning
1393
+ model to identify images. In that case, the model is built on something the
1394
+ community itself has created, and continues to create. Yes, the underlying
1395
+ idea is that same: “experts” have labelled the data, a model is trained, the
1396
+ model is used. But the benefits of the <a href=\"https://www.inaturalist.org\">iNaturalist</a>
1397
+ model are immediately applicable to the people whose data built the model.
1398
+ In the case of barcoding, because the technology itself is still not in the
1399
+ hands of many (relative to, say, digital imaging), the benefits are perhaps
1400
+ less tangible. Obviously researchers working with environmental DNA will find
1401
+ it very useful, but broader impact may await the arrival of citizen science
1402
+ DNA barcoding.</p>\n<p>The other consideration is whether the barcoding helps
1403
+ taxonomists. Is it to be used to help prioritise future work (“we are getting
1404
+ lots of unknown sequences in these taxa, lets do some taxonomy there”), or
1405
+ is it simply capturing the knowledge of a generation that won’t be replaced:</p>\n<blockquote>\n<p>The
1406
+ need to capture such knowledge is essential because there are, for example,
1407
+ no young Finnish taxonomists who can critically identify species in many key
1408
+ groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges,
1409
+ most mite lineages).</p>\n</blockquote>\n<p>The cycle of collect data, test
1410
+ and refine model, collect more data, rinse and repeat that happens with iNaturalist
1411
+ creates a feedback loop. It’s not clear that a similar cycle exists for DNA
1412
+ barcoding.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/enxas-arj18","uuid":"ab5a6e04-d55e-4901-8269-9eea65ce7178","url":"https://iphylo.blogspot.com/2022/08/can-we-use-citation-graph-to-measure.html","title":"Can
1413
+ we use the citation graph to measure the quality of a taxonomic database?","summary":"More
1414
+ arm-waving notes on taxonomic databases. I''ve started to add data to ChecklistBank
1415
+ and this has got me thinking about the issue of data quality. When you add
1416
+ data to ChecklistBank you are asked to give a measure of confidence based
1417
+ on the Catalogue of Life Checklist Confidence system of one - five stars:
1418
+ ★ - ★★★★★. I''m scepetical about the notion of confidence or \"trust\" when
1419
+ it is reduced to a star system (see also Can you trust EOL?). I could literally
1420
+ pick any number of stars, there''s...","date_published":"2022-08-24T14:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1421
+ (Roderic Page)"}],"image":null,"content_html":"<p>More arm-waving notes on
1422
+ taxonomic databases. I''ve started to add data to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
1423
+ and this has got me thinking about the issue of data quality. When you add
1424
+ data to ChecklistBank you are asked to give a measure of confidence based
1425
+ on the <a href=\"https://www.catalogueoflife.org/about/glossary.html#checklist-confidence\">Catalogue
1426
+ of Life Checklist Confidence</a> system of one - five stars: ★ - ★★★★★. I''m
1427
+ scepetical about the notion of confidence or \"trust\" when it is reduced
1428
+ to a star system (see also <a href=\"https://iphylo.blogspot.com/2012/06/can-you-trust-eol.html\">Can
1429
+ you trust EOL?</a>). I could literally pick any number of stars, there''s
1430
+ no way to measure what number of stars is appropriate. This feeds into my
1431
+ biggest reservation about the <a href=\"https://www.catalogueoflife.org\">Catalogue
1432
+ of Life</a>, it''s almost entirely authority based, not evidence based. That
1433
+ is, rather than give us evidence for why a particular taxon is valid, we are
1434
+ (mostly) just given a list of taxa are asked to accept those as gospel, based
1435
+ on assertions by one or more authorities. I''m not necessarly doubting the
1436
+ knowledge of those making these lists, it''s just that I think we need to
1437
+ do better than \"these are the accepted taxa because I say so\" implict in
1438
+ the Catalogue of Life.\n</p>\n\n<p>So, is there any way we could objectively
1439
+ measure the quality of a particular taxonomic checklist? Since I have a long
1440
+ standing interest in link the primary taxonomic litertaure to names in databases
1441
+ (since that''s where the evidence is), I keep wondering whether measures based
1442
+ on that literture could be developed. \n</p>\n<p>\nI recently revisited the
1443
+ fascinating (and quite old) literature on rates of synonymy:\n</p>\n<blockquote>\nGaston
1444
+ Kevin J. and Mound Laurence A. 1993 Taxonomy, hypothesis testing and the
1445
+ biodiversity crisisProc. R. Soc. Lond. B.251139–142\n<a href=\"http://doi.org/10.1098/rspb.1993.0020\">http://doi.org/10.1098/rspb.1993.0020</a>\n</blockquote>\n \n<blockquote>\n Andrew
1446
+ R. Solow, Laurence A. Mound, Kevin J. Gaston, Estimating the Rate of Synonymy,
1447
+ Systematic Biology, Volume 44, Issue 1, March 1995, Pages 93–96, <a href=\"https://doi.org/10.1093/sysbio/44.1.93\">https://doi.org/10.1093/sysbio/44.1.93</a>\n</blockquote>\n\n</p>\n\n<p>\nA
1448
+ key point these papers make is that the observed rate of synonymy is quite
1449
+ high (that is, many \"new species\" end up being merged with already known
1450
+ species), and that because it can take time to discover that a species is
1451
+ a synonym the actual rate may be even higher. In other words, in diagrams
1452
+ like the one reproduced below, the reason the proportion of synonyms declines
1453
+ the nearer we get to the present day (this paper came out in 1995) is not
1454
+ because are are creating fewer synonyms but because we''ve not yet had time
1455
+ to do the work to uncover the remaining synonyms.\n</p>\n \n<div class=\"separator\"
1456
+ style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s621/Screenshot%202022-08-24%20at%2014.59.47.png\"
1457
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1458
+ border=\"0\" width=\"400\" data-original-height=\"404\" data-original-width=\"621\"
1459
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s400/Screenshot%202022-08-24%20at%2014.59.47.png\"/></a></div>\n\n<p>Put
1460
+ another way, these papers are arguing that real work of taxonomy is revision,
1461
+ not species discovery, especially since it''s not uncommon for > 50% of species
1462
+ in a taxon to end up being synonymised. Indeed, if a taxoomic group has few
1463
+ synonyms then these authors would argue that''s a sign of neglect. More revisionary
1464
+ work would likely uncover additional synonyms. So, what we need is a way to
1465
+ measure the amount of research on a taxonomic group. It occurs to me that
1466
+ we could use the citation graph as a way to tackle this. Lets imagine we have
1467
+ a set of taxa (say a family) and we have all the papers that described new
1468
+ species or undertook revisions (or both). The extensiveness of that work could
1469
+ be measured by the citation graph. For example, build the citation graph for
1470
+ those papers. How many original species decsriptions are not cited? Those
1471
+ species have been potentially neglected. How many large-scale revisions have
1472
+ there been (as measured by the numbers of taxonomic papers those revisions
1473
+ cite)? There are some interesting approaches to quantifying this, such as
1474
+ using <a href=\"https://en.wikipedia.org/wiki/HITS_algorithm\">hubs and authorities</a>.</p>\n \n \n<p>I''m
1475
+ aware that taxonomists have not had the happiest relationship with citations:\n \n<blockquote>\nPinto
1476
+ ÂP, Mejdalani G, Mounce R, Silveira LF, Marinoni L, Rafael JA. Are publications
1477
+ on zoological taxonomy under attack? R Soc Open Sci. 2021 Feb 10;8(2):201617.
1478
+ <a href=\"https://doi.org/10.1098/rsos.201617\">doi: 10.1098/rsos.201617</a>.
1479
+ PMID: 33972859; PMCID: PMC8074659.\n</blockquote>\n\nStill, I think there
1480
+ is an intriguing possibility here. For this approach to work, we need to have
1481
+ linked taxonomic names to publications, and have citation data for those publications.
1482
+ This is happening on various platforms. Wikidata, for example, is becoming
1483
+ a repository of the taxonomic literature, some of it with citation links.\n\n<blockquote>\nPage
1484
+ RDM. 2022. Wikidata and the bibliography of life. PeerJ 10:e13712 <a href=\"https://doi.org/10.7717/peerj.13712\">https://doi.org/10.7717/peerj.13712</a>\n</blockquote>\n\nTime
1485
+ for some experiments.\n</p>","tags":["Bibliography of Life","citation","synonymy","taxonomic
1486
+ databases"],"language":"en","references":null},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
1487
+ interfaces, knowledge graphs, and Miller columns","summary":"Some quick notes
1488
+ on interface ideas for digital libraries and/or knowledge graphs. Recently
1489
+ there’s been something of an explosion in bibliographic tools to explore the
1490
+ literature. Examples include: Elicit which uses AI to search for and summarise
1491
+ papers _scite which uses AI to do sentiment analysis on citations (does paper
1492
+ A cite paper B favourably or not?) ResearchRabbit which uses lists, networks,
1493
+ and timelines to discover related research Scispace which navigates connections
1494
+ between...","date_published":"2023-04-25T13:01:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1495
+ (Roderic Page)"}],"image":null,"content_html":"<p>Some quick notes on interface
1496
+ ideas for digital libraries and/or knowledge graphs.</p>\n<p>Recently there’s
1497
+ been something of an explosion in bibliographic tools to explore the literature.
1498
+ Examples include:</p>\n<ul>\n<li><a href=\"https://elicit.org\">Elicit</a>
1499
+ which uses AI to search for and summarise papers</li>\n<li><a href=\"https://scite.ai\">_scite</a>
1500
+ which uses AI to do sentiment analysis on citations (does paper A cite paper
1501
+ B favourably or not?)</li>\n<li><a href=\"https://www.researchrabbit.ai\">ResearchRabbit</a>
1502
+ which uses lists, networks, and timelines to discover related research</li>\n<li><a
1503
+ href=\"https://typeset.io\">Scispace</a> which navigates connections between
1504
+ papers, authors, topics, etc., and provides AI summaries.</li>\n</ul>\n<p>As
1505
+ an aside, I think these (and similar tools) are a great example of how bibliographic
1506
+ data such as abstracts, the citation graph and - to a lesser extent - full
1507
+ text - have become commodities. That is, what was once proprietary information
1508
+ is now free to anyone, which in turns means a whole ecosystem of new tools
1509
+ can emerge. If I was clever I’d be building a <a href=\"https://en.wikipedia.org/wiki/Wardley_map\">Wardley
1510
+ map</a> to explore this. Note that a decade or so ago reference managers like
1511
+ <a href=\"https://www.zotero.org\">Zotero</a> were made possible by publishers
1512
+ exposing basic bibliographic data on their articles. As we move to <a href=\"https://i4oc.org\">open
1513
+ citations</a> we are seeing the next generation of tools.</p>\n<p>Back to
1514
+ my main topic. As usual, rather than focus on what these tools do I’m more
1515
+ interested in how they <strong>look</strong>. I have history here, when the
1516
+ iPad came out I was intrigued by the possibilities it offered for displaying
1517
+ academic articles, as discussed <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad.html\">here</a>,
1518
+ <a href=\"https://iphylo.blogspot.com/2010/09/viewing-scientific-articles-on-ipad.html\">here</a>,
1519
+ <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_24.html\">here</a>,
1520
+ <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_3052.html\">here</a>,
1521
+ and <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_31.html\">here</a>.
1522
+ ResearchRabbit looks like this:</p>\n<div style=\"padding:86.91% 0 0 0;position:relative;\"><iframe
1523
+ src=\"https://player.vimeo.com/video/820871442?h=23b05b0dae&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479\"
1524
+ frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
1525
+ style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"ResearchRabbit\"></iframe></div><script
1526
+ src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>Scispace’s <a
1527
+ href=\"https://typeset.io/explore/journals/parassitologia-1ieodjwe\">“trace”
1528
+ view</a> looks like this:</p>\n<div style=\"padding:84.55% 0 0 0;position:relative;\"><iframe
1529
+ src=\"https://player.vimeo.com/video/820871348?h=2db7b661ef&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479\"
1530
+ frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
1531
+ style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"Scispace
1532
+ screencast\"></iframe></div><script src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>What
1533
+ is interesting about both is that they display content from left to right
1534
+ in vertical columns, rather than the more common horizontal rows. This sort
1535
+ of display is sometimes called <a href=\"https://en.wikipedia.org/wiki/Miller_columns\">Miller
1536
+ columns</a> or a <a href=\"https://web.archive.org/web/20210726134921/http://designinginterfaces.com/firstedition/index.php?page=Cascading_Lists\">cascading
1537
+ list</a>.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s1024/GNUstep-liveCD.png\"
1538
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1539
+ border=\"0\" width=\"400\" data-original-height=\"768\" data-original-width=\"1024\"
1540
+ src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s400/GNUstep-liveCD.png\"/></a></div>\n\n<p>By
1541
+ Gürkan Sengün (talk) - Own work, Public Domain, <a href=\"https://commons.wikimedia.org/w/index.php?curid=594715\">https://commons.wikimedia.org/w/index.php?curid=594715</a></p>\n<p>I’ve
1542
+ always found displaying a knowledge graph to be a challenge, as discussed
1543
+ <a href=\"https://iphylo.blogspot.com/2019/07/notes-on-collections-knowledge-graphs.html\">elsewhere
1544
+ on this blog</a> and in my paper on <a href=\"https://peerj.com/articles/6739/#p-29\">Ozymandias</a>.
1545
+ Miller columns enable one to drill down in increasing depth, but it doesn’t
1546
+ need to be a tree, it can be a path within a network. What I like about ResearchRabbit
1547
+ and the original Scispace interface is that they present the current item
1548
+ together with a list of possible connections (e.g., authors, citations) that
1549
+ you can drill down on. Clicking on these will result in a new column being
1550
+ appended to the right, with a view (typically a list) of the next candidates
1551
+ to visit. In graph terms, these are adjacent nodes to the original item. The
1552
+ clickable badges on each item can be thought of as sets of edges that have
1553
+ the same label (e.g., “authored by”, “cites”, “funded”, “is about”, etc.).
1554
+ Each of these nodes itself becomes a starting point for further exploration.
1555
+ Note that the original starting point isn’t privileged, other than being the
1556
+ starting point. That is, each time we drill down we are seeing the same type
1557
+ of information displayed in the same way. Note also that the navigation can
1558
+ be though of as a <strong>card</strong> for a node, with <strong>buttons</strong>
1559
+ grouping the adjacent nodes. When we click on an individual button, it expands
1560
+ into a <strong>list</strong> in the next column. This can be thought of as
1561
+ a preview for each adjacent node. Clicking on an element in the list generates
1562
+ a new card (we are viewing a single node) and we get another set of buttons
1563
+ corresponding to the adjacent nodes.</p>\n<p>One important behaviour in a
1564
+ Miller column interface is that the current path can be pruned at any point.
1565
+ If we go back (i.e., scroll to the left) and click on another tab on an item,
1566
+ everything downstream of that item (i.e., to the right) gets deleted and replaced
1567
+ by a new set of nodes. This could make retrieving a particular history of
1568
+ browsing a bit tricky, but encourages exploration. Both Scispace and ResearchRabbit have
1569
+ the ability to add items to a collection, so you can keep track of things
1570
+ you discover.</p>\n<p>Lots of food for thought, I’m assuming that there is
1571
+ some user interface/experience research on Miller columns. One thing to remember
1572
+ is that Miller columns are most often associated with trees, but in this case
1573
+ we are exploring a network. That means that potentially there is no limit
1574
+ to the number of columns being generated as we wander through the graph. It
1575
+ will be interesting to think about what the average depth is likely to be,
1576
+ in other words, how deep down the rabbit hole will be go?</p>\n\n<h3>Update</h3>\n<p>Should
1577
+ add link to David Regev''s explorations of <a href=\"https://medium.com/david-regev-on-ux/flow-browser-b730daf0f717\">Flow
1578
+ Browser</a>.\n\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":["cards","flow","Knowledge
1579
+ Graph","Miller column","RabbitResearch"],"language":"en","references":null},{"id":"https://doi.org/10.59350/t6fb9-4fn44","uuid":"8bc3fea6-cb86-4344-8dad-f312fbf58041","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The
1580
+ Business of Extracting Knowledge from Academic Publications","summary":"Markus
1581
+ Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting
1582
+ Knowledge from Academic Publications\". I spent months working on domain-specific
1583
+ search engines and knowledge discovery apps for biomedicine and eventually
1584
+ figured that synthesizing &quot;insights&quot; or building knowledge graphs
1585
+ by machine-reading the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc&mdash;
1586
+ Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I...","date_published":"2021-12-11T00:01:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1587
+ (Roderic Page)"}],"image":null,"content_html":"<p>Markus Strasser (<a href=\"https://twitter.com/mkstra\">@mkstra</a>
1588
+ write a fascinating article entitled <a href=\"https://markusstrasser.org/extracting-knowledge-from-literature/\">\"The
1589
+ Business of Extracting Knowledge from Academic Publications\"</a>.</p>\n\n<blockquote
1590
+ class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">I spent months working
1591
+ on domain-specific search engines and knowledge discovery apps for biomedicine
1592
+ and eventually figured that synthesizing &quot;insights&quot; or building
1593
+ knowledge graphs by machine-reading the academic literature (papers) is *barely
1594
+ useful* :<a href=\"https://t.co/eciOg30Odc\">https://t.co/eciOg30Odc</a></p>&mdash;
1595
+ Markus Strasser (@mkstra) <a href=\"https://twitter.com/mkstra/status/1468334482113523716?ref_src=twsrc%5Etfw\">December
1596
+ 7, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
1597
+ charset=\"utf-8\"></script>\n\n<p>His TL;DR:</p>\n\n<p><blockquote>\nTL;DR:
1598
+ I worked on biomedical literature search, discovery and recommender web applications
1599
+ for many months and concluded that extracting, structuring or synthesizing
1600
+ \"insights\" from academic publications (papers) or building knowledge bases
1601
+ from a domain corpus of literature has negligible value in industry.</p>\n\n<p>Close
1602
+ to nothing of what makes science actually work is published as text on the
1603
+ web.\n</blockquote></p>\n\n<p>After recounting the many problems of knowledge
1604
+ extraction - including a swipe at nanopubs which \"are ... dead in my view
1605
+ (without admitting it)\" - he concludes:</p>\n\n<p><blockquote>\nI’ve been
1606
+ flirting with this entire cluster of ideas including open source web annotation,
1607
+ semantic search and semantic web, public knowledge graphs, nano-publications,
1608
+ knowledge maps, interoperable protocols and structured data, serendipitous
1609
+ discovery apps, knowledge organization, communal sense making and academic
1610
+ literature/publishing toolchains for a few years on and off ... nothing of
1611
+ it will go anywhere.</p>\n\n<p>Don’t take that as a challenge. Take it as
1612
+ a red flag and run. Run towards better problems.\n</blockquote></p>\n\n<p>Well
1613
+ worth a read, and much food for thought.</p>","tags":["ai","business model","text
1614
+ mining"],"language":"en","references":null},{"id":"https://doi.org/10.59350/463yw-pbj26","uuid":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David
1615
+ Remsen","summary":"I heard yesterday from Martin Kalfatovic (BHL) that David
1616
+ Remsen has died. Very sad news. It''s starting to feel like iPhylo might end
1617
+ up being a list of obituaries of people working on biodiversity informatics
1618
+ (e.g., Scott Federhen). I spent several happy visits at MBL at Woods Hole
1619
+ talking to Dave at the height of the uBio project, which really kickstarted
1620
+ large scale indexing of taxonomic names, and the use of taxonomic name finding
1621
+ tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1622
+ (Roderic Page)"}],"image":null,"content_html":"<p>I heard yesterday from Martin
1623
+ Kalfatovic (BHL) that David Remsen has died. Very sad news. It''s starting
1624
+ to feel like iPhylo might end up being a list of obituaries of people working
1625
+ on biodiversity informatics (e.g., <a href=\"https://iphylo.blogspot.com/2016/05/scott-federhen-rip.html\">Scott
1626
+ Federhen</a>).</p>\n\n<p>I spent several happy visits at MBL at Woods Hole
1627
+ talking to Dave at the height of the uBio project, which really kickstarted
1628
+ large scale indexing of taxonomic names, and the use of taxonomic name finding
1629
+ tools to index the literature. His work on uBio with David (\"Paddy\") Patterson
1630
+ led to the <a href=\"https://eol.org\">Encyclopedia of Life</a> (EOL).</p>\n\n<p>A
1631
+ number of the things I''m currently working on are things Dave started. For
1632
+ example, I recently uploaded a version of his dataset for Nomenclator Zoologicus[1]
1633
+ to <a href=\"https://www.checklistbank.org/dataset/126539/about\">ChecklistBank</a>
1634
+ where I''m working on augmenting that original dataset by adding links to
1635
+ the taxonomic literature. My <a href=\"https://biorss.herokuapp.com/?feed=Y291bnRyeT1XT1JMRCZwYXRoPSU1QiUyMkJJT1RBJTIyJTVE\">BioRSS
1636
+ project</a> is essentially an attempt to revive uBioRSS[2] (see <a href=\"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html\">Revisiting
1637
+ RSS to monitor the latest taxonomic research</a>).</p>\n\n<p>I have fond memories
1638
+ of those visits to Woods Hole. A very sad day indeed.</p>\n\n<p><b>Update:</b>
1639
+ The David Remsen Memorial Fund has been set up on <a href=\"https://www.gofundme.com/f/david-remsen-memorial-fund\">GoFundMe</a>.</p>\n\n<p>1.
1640
+ Remsen, D. P., Norton, C., & Patterson, D. J. (2006). Taxonomic Informatics
1641
+ Tools for the Electronic Nomenclator Zoologicus. The Biological Bulletin,
1642
+ 210(1), 18–24. https://doi.org/10.2307/4134533</p>\n\n<p>2. Patrick R. Leary,
1643
+ David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar,
1644
+ uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23,
1645
+ Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109</p>","tags":["David
1646
+ Remsen","obituary","uBio"],"language":"en","references":null},{"id":"https://doi.org/10.59350/pmhat-5ky65","uuid":"5891c709-d139-440f-bacb-06244424587a","url":"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html","title":"Problems
1647
+ with Plazi parsing: how reliable are automated methods for extracting specimens
1648
+ from the literature?","summary":"The Plazi project has become one of the major
1649
+ contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
1650
+ (see Plazi''s GBIF page for details). These occurrences are extracted from
1651
+ taxonomic publication using automated methods. New data is published almost
1652
+ daily (see latest treatments). The map below shows the geographic distribution
1653
+ of material citations provided to GBIF by Plazi, which gives you a sense of
1654
+ the size of the dataset. By any metric Plazi represents a...","date_published":"2021-10-25T11:10:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
1655
+ (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
1656
+ both;\"><a href=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s240/Rf7UoXTw_400x400.jpg\"
1657
+ style=\"display: block; padding: 1em 0; text-align: center; clear: right;
1658
+ float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"240\"
1659
+ data-original-width=\"240\" src=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s200/Rf7UoXTw_400x400.jpg\"/></a></div><p>The
1660
+ <a href=\"http://plazi.org\">Plazi</a> project has become one of the major
1661
+ contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
1662
+ (see <a href=\"https://www.gbif.org/publisher/7ce8aef0-9e92-11dc-8738-b8a03c50a862\">Plazi''s
1663
+ GBIF page</a> for details). These occurrences are extracted from taxonomic
1664
+ publication using automated methods. New data is published almost daily (see
1665
+ <a href=\"https://tb.plazi.org/GgServer/static/newToday.html\">latest treatments</a>).
1666
+ The map below shows the geographic distribution of material citations provided
1667
+ to GBIF by Plazi, which gives you a sense of the size of the dataset.</p>\n\n<div
1668
+ class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s1030/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"
1669
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1670
+ border=\"0\" width=\"400\" data-original-height=\"514\" data-original-width=\"1030\"
1671
+ src=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"/></a></div>\n\n<p>By
1672
+ any metric Plazi represents a considerable achievement. But often when I browse
1673
+ individual records on Plazi I find records that seem clearly incorrect. Text
1674
+ mining the literature is a challenging problem, but at the moment Plazi seems
1675
+ something of a \"black box\". PDFs go in, the content is mined, and data comes
1676
+ up to be displayed on the Plazi web site and uploaded to GBIF. Nowhere does
1677
+ there seem to be an evaluation of how accurate this text mining actually is.
1678
+ Anecdotally it seems to work well in some cases, but in others it produces
1679
+ what can only be described as bogus records.</p>\n\n<h2>Finding errors</h2>\n\n<p>A
1680
+ treatment in Plazi is a block of text (and sometimes illustrations) that refers
1681
+ to a single taxon. Often that text will include a description of the taxon,
1682
+ and list one or more specimens that have been examined. These lists of specimens
1683
+ (\"material citations\") are one of the key bits of information that Plaza
1684
+ extracts from a treatment as these citations get fed into GBIF as occurrences.</p>\n\n<p>To
1685
+ help explore treatments I''ve constructed a simple web site that takes the
1686
+ Plazi identifier for a treatment and displays that treatment with the material
1687
+ citations highlighted. For example, for the Plazi treatment <a href=\"https://tb.plazi.org/GgServer/html/03B5A943FFBB6F02FE27EC94FABEEAE7\">03B5A943FFBB6F02FE27EC94FABEEAE7</a>
1688
+ you can view the marked up version at <a href=\"https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228\">https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228</a>.
1689
+ Below is an example of a material citation with its component parts tagged:</p>\n\n<div
1690
+ class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s693/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"
1691
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1692
+ border=\"0\" width=\"400\" data-original-height=\"94\" data-original-width=\"693\"
1693
+ src=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"/></a></div>\n\n<p>This
1694
+ is an example where Plazi has successfully parsed the specimen. But I keep
1695
+ coming across cases where specimens have not been parsed correctly, resulting
1696
+ in issues such as single specimens being split into multiple records (e.g., <a
1697
+ href=\"https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496\">https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496</a>),
1698
+ geographical coordinates being misinterpreted (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9\">https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9</a>),
1699
+ or collector''s initials being confused with codes for natural history collections
1700
+ (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E\">https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E</a>).</p>\n\n<p>Parsing
1701
+ specimens is a hard problem so it''s not unexpected to find errors. But they
1702
+ do seem common enough to be easily found, which raises the question of just
1703
+ what percentage of these material citations are correct? How much of the
1704
+ data Plazi feeds to GBIF is correct? How would we know?</p>\n\n<h2>Systemic
1705
+ problems</h2>\n\n<p>Some of the errors I''ve found concern the interpretation
1706
+ of the parsed data. For example, it is striking that despite including marine
1707
+ taxa <b>no</b> Plazi record has a value for depth below sea level (see <a
1708
+ href=\"https://www.gbif.org/occurrence/map?depth=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">GBIF
1709
+ search on depth range 0-9999 for Plazi</a>). But <a href=\"https://www.gbif.org/occurrence/map?elevation=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">many
1710
+ records do have an elevation</a>, including records from marine environments.
1711
+ Any record that has a depth value is interpreted by Plazi as being elevation,
1712
+ so we have aerial crustacea and fish.</p>\n\n<h3>Map of Plazi records with
1713
+ depth 0-9999m</h3>\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s673/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"
1714
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1715
+ border=\"0\" width=\"400\" data-original-height=\"258\" data-original-width=\"673\"
1716
+ src=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"/></a></div>\n\n<h3>Map
1717
+ of Plazi records with elevation 0-9999m </h3>\n<div class=\"separator\" style=\"clear:
1718
+ both;\"><a href=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s675/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"
1719
+ style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
1720
+ border=\"0\" width=\"400\" data-original-height=\"256\" data-original-width=\"675\"
1721
+ src=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"/></a></div>\n\n<p>Anecdotally
1722
+ I''ve also noticed that Plazi seems to do well on zoological data, especially
1723
+ journals like <i>Zootaxa</i>, but it often struggles with botanical specimens.
1724
+ Botanists tend to cite specimens rather differently to zoologists (botanists
1725
+ emphasise collector numbers rather than specimen codes). Hence data quality
1726
+ in Plazi is likely to taxonomic biased.</p>\n\n<p>Plazi is <a href=\"https://github.com/plazi/community/issues\">using
1727
+ GitHub to track issues with treatments</a> so feedback on erroneous records
1728
+ is possible, but this seems inadequate to the task. There are tens of thousands
1729
+ of data sets, with more being released daily, and hundreds of thousands of
1730
+ occurrences, and relying on GitHub issues devolves the responsibility for
1731
+ error checking onto the data users. I don''t have a measure of how many records
1732
+ in Plazi have problems, but because I suspect it is a significant fraction
1733
+ because for any given day''s output I can typically find errors.</p>\n\n<h2>What
1734
+ to do?</h2>\n\n<p>Faced with a process that generates noisy data there are
1735
+ several of things we could do:</p>\n\n<ol>\n<li>Have tools to detect and flag
1736
+ errors made in generating the data.</li>\n<li>Have the data generator give
1737
+ estimates the confidence of its results.</li>\n<li>Improve the data generator.</li>\n</ol>\n\n<p>I
1738
+ think a comparison with the problem of parsing bibliographic references might
1739
+ be instructive here. There is a long history of people developing tools to
1740
+ parse references (<a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">I''ve
1741
+ even had a go</a>). State-of-the art tools such as <a href=\"https://anystyle.io\">AnyStyle</a>
1742
+ feature machine learning, and are tested against <a href=\"https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml\">human
1743
+ curated datasets</a> of tagged bibliographic records. This means we can evaluate
1744
+ the performance of a method (how well does it retrieve the same results as
1745
+ human experts?) and also improve the method by expanding the corpus of training
1746
+ data. Some of these tools can provide a measures of how confident they are
1747
+ when classifying a string as, say, a person''s name, which means we could
1748
+ flag potential issues for anyone wanting to use that record.</p>\n\n<p>We
1749
+ don''t have equivalent tools for parsing specimens in the literature, and
1750
+ hence have no easy way to quantify how good existing methods are, nor do we
1751
+ have a public corpus of material citations that we can use as training data.
1752
+ I <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">blogged
1753
+ about this</a> a few months ago and was considering using Plazi as a source
1754
+ of marked up specimen data to use for training. However based on what I''ve
1755
+ looked at so far Plazi''s data would need to be carefully scrutinised before
1756
+ it could be used as training data.</p>\n\n<p>Going forward, I think it would
1757
+ be desirable to have a set of records that can be used to benchmark specimen
1758
+ parsers, and ideally have the parsers themselves available as web services
1759
+ so that anyone can evaluate them. Even better would be a way to contribute
1760
+ to the training data so that these tools improve over time.</p>\n\n<p>Plazi''s
1761
+ data extraction tools are mostly desktop-based, that is, you need to download
1762
+ software to use their methods. However, there are experimental web services
1763
+ available as well. I''ve created a simple wrapper around the material citation
1764
+ parser, you can try it at <a href=\"https://plazi-tester.herokuapp.com/parser.php\">https://plazi-tester.herokuapp.com/parser.php</a>.
1765
+ It takes a single material citation and returns a version with elements such
1766
+ as specimen code and collector name tagged in different colours.</p>\n\n<h2>Summary</h2>\n\n<p>Text
1767
+ mining the taxonomic literature is clearly a gold mine of data, but at the
1768
+ same time it is potentially fraught as we try and extract structured data
1769
+ from semi-structured text. Plazi has demonstrated that it is possible to extract
1770
+ a lot of data from the literature, but at the same time the quality of that
1771
+ data seems highly variable. Even minor issues in parsing text can have big
1772
+ implications for data quality (e.g., marine organisms apparently living above
1773
+ sea level). Historically in biodiversity informatics we have favoured data
1774
+ quantity over data quality. Quantity has an obvious metric, and has milestones
1775
+ we can celebrate (e.g., <a href=\"GBIF at 1 billion - what''s next?\">one
1776
+ billion specimens</a>). There aren''t really any equivalent metrics for data
1777
+ quality.</p>\n\n<p>Adding new types of data can sometimes initially result
1778
+ in a new set of quality issues (e.g., <a href=\"https://iphylo.blogspot.com/2019/12/gbif-metagenomics-and-metacrap.html\">GBIF
1779
+ metagenomics and metacrap</a>) that take time to resolve. In the case of Plazi,
1780
+ I think it would be worthwhile to quantify just how many records have errors,
1781
+ and develop benchmarks that we can use to test methods for extracting specimen
1782
+ data from text. If we don''t do this then there will remain uncertainty as
1783
+ to how much trust we can place in data mined from the taxonomic literature.</p>\n\n<h2>Update</h2>\n\nPlazi
1784
+ has responded, see <a href=\"http://plazi.org/posts/2021/10/liberation-first-step-toward-quality/\">Liberating
1785
+ material citations as a first step to more better data</a>. My reading of
1786
+ their repsonse is that it essentially just reiterates Plazi''s approach and
1787
+ doesn''t tackle the underlying issue: their method for extracting material
1788
+ citations is error prone, and many of those errors end up in GBIF.","tags":["data
1789
+ quality","parsing","Plazi","specimen","text mining"],"language":"en","references":null}]}'
1790
+ recorded_at: Thu, 15 Jun 2023 20:39:01 GMT
1791
+ recorded_with: VCR 6.1.0