RubyGems - commonmeta-ruby - Versions diffs - 3.2.11 → 3.2.13 - Mend

commonmeta-ruby 3.2.11 → 3.2.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

checksums.yaml +4 -4
data/Gemfile.lock +1 -1
data/bin/commonmeta +1 -1
data/lib/commonmeta/cli.rb +8 -2
data/lib/commonmeta/readers/json_feed_reader.rb +4 -4
data/lib/commonmeta/utils.rb +4 -0
data/lib/commonmeta/version.rb +1 -1
data/spec/cli_spec.rb +9 -0
data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/json_feed/json_feed_by_blog.yml +1791 -0
data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/json_feed/json_feed_not_indexed.yml +2155 -0
data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/json_feed/json_feed_unregistered.yml +2010 -0
data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/by_blog_id.yml +1791 -0
data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/not_indexed_posts.yml +2155 -0
data/spec/readers/json_feed_reader_spec.rb +7 -3
metadata +7 -2

data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/json_feed/json_feed_by_blog.yml ADDED Viewed

@@ -0,0 +1,1791 @@
+---
+http_interactions:
+- request:
+    method: get
+    uri: https://rogue-scholar.org/api/blogs/tyfqw20
+    body:
+      encoding: UTF-8
+      string: ''
+    headers:
+      Connection:
+      - close
+      Host:
+      - rogue-scholar.org
+      User-Agent:
+      - http.rb/5.1.1
+  response:
+    status:
+      code: 200
+      message: OK
+    headers:
+      Age:
+      - '0'
+      Cache-Control:
+      - public, max-age=0, must-revalidate
+      Content-Length:
+      - '162607'
+      Content-Type:
+      - application/json; charset=utf-8
+      Date:
+      - Thu, 15 Jun 2023 20:47:10 GMT
+      Etag:
+      - '"6w7me0q1i23h72"'
+      Server:
+      - Vercel
+      Strict-Transport-Security:
+      - max-age=63072000
+      X-Matched-Path:
+      - "/api/blogs/[slug]"
+      X-Vercel-Cache:
+      - MISS
+      X-Vercel-Id:
+      - fra1::iad1::95rf9-1686862029064-f08aa4b0d0a5
+      Connection:
+      - close
+    body:
+      encoding: UTF-8
+      string: '{"id":"tyfqw20","title":"iPhylo","description":"Rants, raves (and occasionally
+        considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics.  For
+        more ranty and less considered opinions, see my <a href=\"https://twitter.com/rdmpage\">Twitter
+        feed</a>.<br>ISSN 2051-8188. Written content on this site is licensed under
+        a <a href=\"https://creativecommons.org/licenses/by/4.0/\">Creative Commons
+        Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default?alt=rss","feed_format":"application/rss+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-05-31T17:26:00+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger","category":"Natural
+        Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/btdk4-42879","uuid":"3e1278f6-e7c0-43e1-bb54-6829e1344c0d","url":"https://iphylo.blogspot.com/2022/09/the-ideal-taxonomic-journal.html","title":"The
+        ideal taxonomic journal","summary":"This is just some random notes on an “ideal”
+        taxonomic journal, inspired in part by some recent discussions on “turbo-taxonomy”
+        (e.g., https://doi.org/10.3897/zookeys.1087.76720 and https://doi.org/10.1186/1742-9994-10-15),
+        and also examples such as the Australian Journal of Taxonomy https://doi.org/10.54102/ajt.qxi3r
+        which seems well-intentioned but limited. XML One approach is to have highly
+        structured text that embeds detailed markup, and ideally a tool that generates
+        markup in XML. This is...","date_published":"2022-09-29T14:00:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>This is just some random
+        notes on an “ideal” taxonomic journal, inspired in part by some recent discussions
+        on “turbo-taxonomy” (e.g., <a href=\"https://doi.org/10.3897/zookeys.1087.76720\">https://doi.org/10.3897/zookeys.1087.76720</a>
+        and <a href=\"https://doi.org/10.1186/1742-9994-10-15\">https://doi.org/10.1186/1742-9994-10-15</a>),
+        and also examples such as the Australian Journal of Taxonomy <a href=\"https://doi.org/10.54102/ajt.qxi3r\">https://doi.org/10.54102/ajt.qxi3r</a>
+        which seems well-intentioned but limited.</p>\n<h2 id=\"xml\">XML</h2>\n<p>One
+        approach is to have highly structured text that embeds detailed markup, and
+        ideally a tool that generates markup in XML. This is the approach taken by
+        Pensoft. There is an inevitable trade-off between the burden on authors of
+        marking up text versus making the paper machine readable. In some ways this
+        seems misplaced effort given that there is little evidence that publications
+        by themselves have much value (see <a href=\"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html\">The
+        Business of Extracting Knowledge from Academic Publications</a>). “Value”
+        in this case means as a source of data or factual statements that we can compute
+        over. Human-readable text is not a good way to convey this sort of information.</p>\n<p>It’s
+        also interesting that many editing tools are going in the opposite direction,
+        for example there are minimalist tools using <a href=\"https://en.wikipedia.org/wiki/Markdown\">Markdown</a>
+        where the goal is to <em>get out of the author’s way</em>, rather than impose
+        a way of writing. Text is written by humans for humans, so the tools should
+        be human-friendly.</p>\n<p>The idea of publishing using XML is attractive
+        in that it gives you XML that can be archived by, say, PubMed Central, but
+        other than that the value seems limited. A cursory glance at download stats
+        for journals that provide PDF and XML downloads, such as <em>PLoS One</em>
+        and <em>ZooKeys</em>, PDF is by far the more popular format. So arguably there
+        is little value in providing XML. Those who have tried to use JATS-XML as
+        an authoring tool have not had a happy time: <a href=\"https://doi.org/10.7557/15.5517\">How
+        we tried to JATS XML</a>. However, there are various tools to help with the
+        process, such as <a href=\"https://github.com/Vitaliy-1/docxToJats\">docxToJats</a>,<br>\ntexture,
+        and <a href=\"https://github.com/elifesciences/jats-xml-to-pdf\">jats-xml-to-pdf</a>
+        if this is the route one wants to take.</p>\n<h2 id=\"automating-writing-manuscripts\">Automating
+        writing manuscripts</h2>\n<p>The dream, of course, is to have a tool where
+        you store all your taxonomic data (literature, specimens, characters, images,
+        sequences, media files, etc.) and at the click of a button generate a paper.
+        Certainly some of this can be automated, much nomenclatural and specimen information
+        could be converted to human-readable text. Ideally this computer-generated
+        text would not be edited (otherwise it could get out of sync with the underlying
+        data). The text should be <a href=\"https://en.wikipedia.org/wiki/Transclusion\">transcluded</a>.
+        As an aside, one way to do this would be to include things such as lists of
+        material examined as images rather than text while the manuscript is being
+        edited. In the same way that you (probably) wouldn’t edit a photograph within
+        your text editor, you shouldn’t be editing data. When the manuscript is published
+        the data-generated portions can then be output as text.</p>\n<p>Of course
+        all of this assumes that we have taxonomic data in a database (or some other
+        storage format, including plain text and Mark-down, e.g. <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
+        markdown, and taxonomic trees</a>) that can generate outputs in the various
+        formats that we need.</p>\n<h2 id=\"archiving-data-and-images\">Archiving
+        data and images</h2>\n<p>One of the really nice things that <a href=\"http://plazi.org\">Plazi</a>
+        do is have a pipeline that sends taxonomic descriptions and images to Zenodo,
+        and similar data to GBIF. Any taxonomic journal should be able to do this.
+        Indeed, arguably each taxonomic treatment within the paper should be linked
+        to the Zenodo DOI at the time of publication. Indeed, we could imagine ultimately
+        having treatments as transclusions within the larger manuscript. Alternatively
+        we could store the treatments as parts of the larger article (rather like
+        chapters in a book), each with a CrossRef DOI. I’m still sceptical about whether
+        these treatments are as important as we make out, see <a href=\"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html\">Does
+        anyone cite taxonomic treatments?</a>. But having machine-readable taxonomic
+        data archived and accessible is a good thing. Uploading the same data to GBIF
+        makes much of that data immediately accessible. Now that GBIF offers <a href=\"https://www.gbif.org/composition/3kQFinjwHbCGZeLb5OhwN2/gbif-hosted-portals\">hosted
+        portals</a> there is the possibility of having custom interfaces to data from
+        a particular journal.</p>\n<h2 id=\"name-and-identifier-registration\">Name
+        and identifier registration</h2>\n<p>We would also want automatic registration
+        of new taxonomic names, for which there are pipelines (see “A common registration-to-publication
+        automated pipeline for nomenclatural acts for higher plants (International
+        Plant Names Index, IPNI), fungi (Index Fungorum, MycoBank) and animals (ZooBank)”
+        <a href=\"https://doi.org/10.3897/zookeys.550.9551\">https://doi.org/10.3897/zookeys.550.9551</a>).
+        These pipelines do not seem to be documented in much detail, and the data
+        formats differ across registration agencies (e.g., IPNI and ZooBank). For
+        example, ZooBank seems to require TaxPub XML.</p>\n<p>Registration of names
+        and identifiers, especially across multiple registration agencies (ZooBank,
+        CrossRef, DataCite, etc.) requires some coordination, especially when one
+        registration agency requires identifiers from another.</p>\n<h2 id=\"summary\">Summary</h2>\n<p>If
+        data is key, then the taxonomic paper itself becomes something of a wrapper
+        around that data. It still serves the function of being human-readable, providing
+        broader context for the work, and as an archive that conforms to currently
+        accepted ways to publish taxonomic names. But in some ways it is the last
+        interesting part of the process.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/37y2z-gre70","uuid":"f3629c86-06e0-42c0-844a-266b03a91ef1","url":"https://iphylo.blogspot.com/2023/05/ten-years-and-million-links.html","title":"Ten
+        years and a million links","summary":"As trailed on a Twitter thread last
+        week I’ve been working on a manuscript describing the efforts to map taxonomic
+        names to their original descriptions in the taxonomic literature. Putting
+        together a manuscript on linking taxonomic names to the primary literature,
+        basically “um, what, exactly, have you been doing all these years?”. TL;DR
+        Across fungi, plants, and animals approx 1.3 million names have been linked
+        to a persistent identifier for a publication.— Roderic Page (@rdmpage) May
+        25,...","date_published":"2023-05-31T17:26:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>As trailed on a Twitter
+        thread last week I’ve been working on a manuscript describing the efforts
+        to map taxonomic names to their original descriptions in the taxonomic literature.</p>\n<blockquote
+        class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Putting together a manuscript
+        on linking taxonomic names to the primary literature, basically “um, what,
+        exactly, have you been doing all these years?”. TL;DR Across fungi, plants,
+        and animals approx 1.3 million names have been linked to a persistent identifier
+        for a publication.</p>— Roderic Page (@rdmpage) <a href=\"https://twitter.com/rdmpage/status/1661714128413573120?ref_src=twsrc%5Etfw\">May
+        25, 2023</a></blockquote> \n<p>The preprint is on bioRxiv <a href=\"https://doi.org/10.1101/2023.05.29.542697\">doi:10.1101/2023.05.29.542697</a></p>\n<blockquote>\n<p>A
+        major gap in the biodiversity knowledge graph is a connection between taxonomic
+        names and the taxonomic literature. While both names and publications often
+        have persistent identifiers (PIDs), such as Life Science Identifiers (LSIDs)
+        or Digital Object Identifiers (DOIs), LSIDs for names are rarely linked to
+        DOIs for publications. This article describes efforts to make those connections
+        across three large taxonomic databases: Index Fungorum, International Plant
+        Names Index (IPNI), and the Index of Organism Names (ION). Over a million
+        names have been matched to DOIs or other persistent identifiers for taxonomic
+        publications. This represents approximately 36% of names for which publication
+        data is available. The mappings between LSIDs and publication PIDs are made
+        available through ChecklistBank. Applications of this mapping are discussed,
+        including a web app to locate the citation of a taxonomic name, and a knowledge
+        graph that uses data on researcher’s ORCID ids to connect taxonomic names
+        and publications to authors of those names.</p>\n</blockquote>\n<p>Much of
+        the work has been linking taxa to names, which still has huge gaps. There
+        are also interesting differences in coverage between plants, animals, and
+        fungi (see preprint for details).</p>\n\n<div class=\"separator\" style=\"clear:
+        both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s2276/animals-coverage.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdWsSQhqi1DErXMIHm28g37-fiALNIsI5eQZmvoX_Fe03ZSwtKHbYt-LCsCCAUop0AGcwy_w7NpIjylVH1hNrM9oW-6j9e6tHASha49TTqFvDg2_tEx3r74RRFsjUo4M_Qat8NmKaZSChOt2hI3LsMjTVLrEVirEckU-9Ei7ug-7OHQlR4LA/s320/animals-coverage.png\"/></a></div><div
+        class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s2276/fungi-coverage.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdyxlVJ-oyMCNPmHtHWjSxdxMSJvgzdWRGRF6Ad4dk7ab7gGDpuKdKmS9XhROkopw361ylfsTd1ZkwkF6BN0JlWNnVLCKY1AfryCfWKHkgPQM7u-0SELW9j8RlQIflb6ibaV64gwW7oJrEvOGECvR51F8EW8cRg-1usW-GBM5ymObj7zlObQ/s320/fungi-coverage.png\"/></a></div><div
+        class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s2276/plants-coverage.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"320\" data-original-height=\"2276\" data-original-width=\"2276\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf0YBuvNSXWAJTfQ1jk4XSocMzCYHP7t6IPUqhjQ3mftgM_850igWaD2copgNH6Xk6T62xBU641wvwOvXgCCDY3m2xC_gaILXO9RGx8H3Gpy5OOncsLb9smpT2LIgtYOExVBVdDRWqA0AZ8-mQjWL7dL5TiG7MqVu8spT8ACoGOPR_T36hRA/s320/plants-coverage.png\"/></a></div>\n\n\nThere
+        is also a simple app to demonstrate these links, see <a href=\"https://species-cite.herokuapp.com\">https://species-cite.herokuapp.com</a>.\n\n\n\n<blockquote>\n<p>Written
+        with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/92rdb-5fe58","uuid":"d33d4f49-b281-4997-9eb9-dbad1e52d9bd","url":"https://iphylo.blogspot.com/2022/09/local-global-identifiers-for.html","title":"Local
+        global identifiers for decentralised wikis","summary":"I''ve been thinking
+        a bit about how one could use a Markdown wiki-like tool such as Obsidian to
+        work with taxonomic data (see earlier posts Obsidian, markdown, and taxonomic
+        trees and Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu).
+        One \"gotcha\" would be how to name pages. If we treat the database as entirely
+        local, then the page names don''t matter, but what if we envisage sharing
+        the database, or merging it with others (for example, if we divided a taxon
+        up into chunks, and...","date_published":"2022-09-08T16:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>I''ve been thinking a bit
+        about how one could use a Markdown wiki-like tool such as Obsidian to work
+        with taxonomic data (see earlier posts <a href=\"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html\">Obsidian,
+        markdown, and taxonomic trees</a> and <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">Personal
+        knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu</a>).</p>\n\n<p>One
+        \"gotcha\" would be how to name pages. If we treat the database as entirely
+        local, then the page names don''t matter, but what if we envisage sharing
+        the database, or merging it with others (for example, if we divided a taxon
+        up into chunks, and different people worked on those different chunks)? </p>\n\n<p>This
+        is the attraction of globally unique identifiers. You and I can independently
+        work on the same thing, such as data linked to scientific paper, safe in the
+        knowledge that if we both use the DOI for that paper we can easily combine
+        what we''ve done. But global identifiers can also be a pain, especially if
+        we need to use a service to look them up (\"is there a DOI for this paper?\",
+        \"what is the LSID for this taxonomic name?\").</p>\n\n<p>Life would be easier
+        if we could generate identifiers \"locally\", but had some assurance that
+        they would be globally unique, and that anyone else generating an identifier
+        for the same thing would arrive at the same identifier (this eliminates things
+        such as <a href=\"https://en.wikipedia.org/wiki/Universally_unique_identifier\">UUIDs</a>
+        which are intentionally designed to prvent people genrrating the same identifier).
+        One approach is \"content addressing\" (see, e.g. <a href=\"https://web.archive.org/web/20210514054054/https://bentrask.com/notes/content-addressing.html\">Principles
+        of Content Addressing</a> - dead link but in the Wayabck Machine, see also
+        <a href=\"https://github.com/btrask/stronglink\">btrask/stronglink</a>). For
+        example, we can generate a cryptographic hash of a file (such as a PDF) and
+        use that as the identifier.</p>\n\n<p>Now the problem is that we have globally
+        unique, but ugly and unfriendly identifiers (such as \"6c98136eba9084ea9a5fc0b7693fed8648014505\").
+        What we need are nice, easy to use identifiers we can use as page names. <a
+        href=\"https://species.wikimedia.org/wiki/Main_Page\">Wikispecies</a> serves
+        as a possible role model, where taxon names serve as page names, as do simplified
+        citations (e.g., authors and years). This model runs into the problem that
+        taxon names aren''t unique, nor are author + year combinations. In Wikispecies
+        this is resolved by having a centralised database where it''s first come,
+        first served. If there is a name clash you have to create a new name for your
+        page. This works, but what if you have multiple databases un by different
+        people? How do we ensure the identifiers are the same?</p>\n\n<p>Then I remembered
+        Roger Hyam''s flight of fantasy over a decade ago: <a href=\"http://www.hyam.net/blog/archives/1007\">SpeciesIndex.org
+        – an impractical, practical solution</a>. He proposed the following rules
+        to generate a unique URI for a taxonomic name:\n\n<ul>\n  <li>The URI must
+        start with \"http://speciesindex.org\" followed by one or more of the following
+        separated by slashes.</li>\n\n  <li>First word of name. Must only contain
+        letters. Must not be the same as one of the names of the nomenclatural codes
+        (icbn or iczn). Optional but highly recommended.</li> \n\n  <li>Second word
+        of name. Must only contain letters and not be a nomenclatural code name. Optional.</li>
+        \n\n  <li>Third word of name. Must only contain letters and not be a nomenclatural
+        code name. Optional.</li> \n\n  <li>Year of publication. Must be an integer
+        greater than 1650 and equal to or less than the current year. If this is an
+        ICZN name then this should be the year the species (epithet) was published
+        as is commonly cited after the name. If this is an ICBN name at species or
+        below then it is the date of the combination. Optional. Recommended for zoological
+        names if known. Not recommended for botanical names unless there is a known
+        problem with homonyms in use by non-taxonomists.</li>\n  \n<li>Nomenclatural
+        code governing the name of the taxon. Currently this must be either ''icbn''
+        or ''iczn''. This may be omitted if the code is unknown or not relevant. Other
+        codes may be added to this list.</li> \n  <li>Qualifier This must be a Version
+        4 RFC-4122 UUID. Optional. Used to generate a new independent identifier for
+        a taxon for which the conventional name is unknown or does not exist or to
+        indicate a particular taxon concept that bears the embedded name.</li>\n\n  <li>The
+        whole speciesindex.org URI string should be considered case\nsensitive. Everything
+        should be lower case apart from the first letter of words that are specified
+        as having upper case in their relevant codes e.g. names at and above the rank
+        of genus.</li>\n</ul>\n</p>\n\n<p>Roger is basically arging that while names
+        aren''t unique (i.e., we have homonyms such as <i>Abronia</i>) they are pretty
+        close to being so, and with a few tweaks we can come up with a unique representation.
+        Another way to think about this if we had a database of all taxonomics, we
+        could construct a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
+        and for each name find the shortest set of name parts (genus, species, etc),
+        year, and code that gave us a unique string for that name. In many cases the
+        species name may be all we need, in other cases we may need to add year and/or
+        nomenclatural code to arrive at a unique string. \n\n</p>\n\n<p>What about
+        bibliographic references? Well many of us will have databases (e.g., Endnote,
+        Mendeley, Zotero, etc.) which generate \"cite keys\". These are typically
+        short, memorable identifiers for a reference that are unique within that database.
+        There is an interesting discussion on the <a href=\"https://discourse.jabref.org/t/universal-citekey-generator/2441/2\">JabRef
+        forum</a> regarding a \"Universal Citekey Generator\", and source code is
+        available <a href=\"https://github.com/cparnot/universal-citekey-js\">cparnot/universal-citekey-js</a>.
+        I''ve yet to explore this in detail, but it looks a promising way to generate
+        unique identifiers from basic metadata (echos of more elaborate schemes such
+        as <a href=\"https://en.wikipedia.org/wiki/Serial_Item_and_Contribution_Identifier\">SICIs</a>).
+        For example,\n\n<blockquote>Senna AR, Guedes UN, Andrade LF, Pereira-Filho
+        GH. 2021. A new species of amphipod Pariphinotus Kunkel, 1910 (Amphipoda:
+        Phliantidae) from Southwestern Atlantic. Zool Stud 60:57. doi:10.6620/ZS.2021.60-57.</blockquote>\n\nbecomes
+        \"Senna:2021ck\". So if two people have the same, core, metadata for a paper
+        they can generate the same key.</p>\n\n<p>Hence it seems with a few conventions
+        (and maybe some simple tools to support them) we could have decentralised
+        wiki-like tools that used the same identifiers for the same things, and yet
+        those identfiiers were short and human-friendly.</p>","tags":["citekey","identfiiers","markdown","obsidian","Roger
+        Hyam"],"language":"en","references":null},{"id":"https://doi.org/10.59350/j77nc-e8x98","uuid":"c6b101f4-bfbc-4d01-921d-805c43c85757","url":"https://iphylo.blogspot.com/2022/08/linking-taxonomic-names-to-literature.html","title":"Linking
+        taxonomic names to the literature","summary":"Just some thoughts as I work
+        through some datasets linking taxonomic names to the literature. In the diagram
+        above I''ve tried to capture the different situatios I encounter. Much of
+        the work I''ve done on this has focussed on case 1 in the diagram: I want
+        to link a taxonomic name to an identifier for the work in which that name
+        was published. In practise this means linking names to DOIs. This has the
+        advantage of linking to a citable indentifier, raising questions such as whether
+        citations...","date_published":"2022-08-22T17:19:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"Just some thoughts as I work
+        through some datasets linking taxonomic names to the literature.\n\n<div class=\"separator\"
+        style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s2140/linking%20to%20names144.jpg\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" height=\"600\" data-original-height=\"2140\" data-original-width=\"1604\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s600/linking%20to%20names144.jpg\"/></a></div>\n\n<p>In
+        the diagram above I''ve tried to capture the different situatios I encounter.
+        Much of the work I''ve done on this has focussed on case 1 in the diagram:
+        I want to link a taxonomic name to an identifier for the work in which that
+        name was published. In practise this means linking names to DOIs. This has
+        the advantage of linking to a citable indentifier, raising questions such
+        as whether citations of taxonmic papers by taxonomic databases could become
+        part of a <a href=\"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html\">taxonomist''s
+        Google Scholar profile</a>.</p>\n\n<p>In many taxonomic databases full work-level
+        citations are not the norm, instead taxonomists cite one or more pages within
+        a work that are relevant to a taxonomic name. These \"microcitations\" (what
+        the U.S. legal profession refer to as \"point citations\" or \"pincites\",  see
+        <a href=\"https://rasmussen.libanswers.com/faq/283203\">What are pincites,
+        pinpoints, or jump legal references?</a>) require some work to map to the
+        work itself (which is typically the thing that has a citatble identifier such
+        as a DOI).</p>\n\n<p>Microcitations (case 2 in the diagram above) can be quite
+        complex. Some might simply mention a single page, but others might list a
+        series of (not necessarily contiguous) pages, as well as figures, plates etc.
+        Converting these to citable identifiers can be tricky, especially as in most
+        cases we don''t have page-level identifiers. The Biodiversity Heritage Library
+        (BHL) does have URLs for each scanned page, and we have a standard for referring
+        to pages in a PDF (<code>page=&lt;pageNum&gt;</code>, see <a href=\"https://datatracker.ietf.org/doc/html/rfc8118\">RFC
+        8118</a>). But how do we refer to a set of pages? Do we pick the first page?
+        Do we try and represent a set of pages, and if so, how?</p>\n\n<p>Another
+        issue with page-level identifiers is that not everything on a given page may
+        be relevant to the taxonomic name. In case 2 above I''ve shaded in the parts
+        of the pages and figure that refer to the taxonomic name. An example where
+        this can be problematic is the recent test case I created for BHL where a
+        page image was included for the taxonomic name <a href=\"https://www.gbif.org/species/195763322\"><i>Aphrophora
+        impressa</i></a>. The image includes the species description and a illustration,
+        as well as text that relates to other species.</p>\n\n<div class=\"separator\"
+        style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s3467/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" height=\"400\" data-original-height=\"3467\" data-original-width=\"2106\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s400/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"/></a></div>\n\n<p>Given
+        that not everything on a page need be relevant, we could extract just the
+        relevant blocks of text and illustrations (e.g., paragraphs of text, panels
+        within a figure, etc.) and treat that set of elements as the thing to cite.
+        This is, of course, what <a href=\"http://plazi.org\">Plazi</a> are doing.
+        The set of extracted blocks is glued together as a \"treatment\", assigned
+        an identifier (often a DOI), and treated as a citable unit. It would be interesting
+        to see to what extent these treatments are actually cited, for example, do
+        subsequent revisions that cite work that include treatments cite those treatments,
+        or just the work itself? Put another way, are we creating <a href=\"https://iphylo.blogspot.com/2012/09/decoding-nature-encode-ipad-app-omg-it.html\">\"threads\"</a>
+        between taxonomic revisions?</p>\n\n<p>One reason for these notes is that
+        I''m exploring uploading taxonomic name - literature links to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
+        and case 1 above is easy, as is case 3 (if we have treatment-level identifiers).
+        But case 2 is problematic because we are linking to a set of things that may
+        not have an identifier, which means a decision has to be made about which
+        page to link to, and how to refer to that page.</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/w18j9-v7j10","uuid":"d811172e-7798-403c-a83d-3d5317a9657e","url":"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html","title":"Papers
+        citing data that cite papers: CrossRef, DataCite, and the Catalogue of Life","summary":"Quick
+        notes to self following on from a conversation about linking taxonomic names
+        to the literature. Is there a way to turn those links into countable citations
+        (even if just one per database) for Google Scholar?&mdash; Wayne Maddison
+        (@WayneMaddison) August 3, 2022 There are different sorts of citation:  Paper
+        cites another paper  Paper cites a dataset  Dataset cites a paper Citation
+        type (1) is largely a solved problem (although there are issues of the ownership
+        and use of this...","date_published":"2022-08-03T11:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"Quick notes to self following
+        on from a conversation about linking taxonomic names to the literature.\n\n<blockquote
+        class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Is there a way to turn
+        those links into countable citations (even if just one per database) for Google
+        Scholar?</p>&mdash; Wayne Maddison (@WayneMaddison) <a href=\"https://twitter.com/WayneMaddison/status/1554644747406348288?ref_src=twsrc%5Etfw\">August
+        3, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
+        charset=\"utf-8\"></script>\n\nThere are different sorts of citation:\n\n<ol>\n  <li>Paper
+        cites another paper</li>\n  <li>Paper cites a dataset</li>\n  <li>Dataset
+        cites a paper</li>\n</ol>\n\nCitation type (1) is largely a solved problem
+        (although there are issues of the ownership and use of this data, see e.g.
+        <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
+        has no impact factor</a>.\n\nCitation type (2) is becoming more widespread
+        (but not perfect as GBIF''s <a href=\"https://twitter.com/search?q=%23citethedoi&src=typed_query\">#citethedoi</a>
+        campaign demonstrates. But the idea is well accepted and there are guides
+        to how to do it, e.g.:\n\n<blockquote>\nCousijn, H., Kenall, A., Ganley, E.
+        et al. A data citation roadmap for scientific publishers. Sci Data 5, 180259
+        (2018). <a href=\"https://doi.org/10.1038/sdata.2018.259\">https://doi.org/10.1038/sdata.2018.259</a>\n</blockquote>\n\nHowever,
+        things do get problematic because most (but not all) DOIs for publications
+        are managed by CrossRef, which has an extensive citation database linking
+        papers to other paopers. Most datasets have DataCite DOIs, and DataCite manages
+        its own citations links, but as far as I''m aware these two systems don''t
+        really taklk to each other.\n\nCitation type (3) is the case where a database
+        is largely based on the literature, which applies to taxonomy. Taxonomic databases
+        are essentially collections of literature that have opinions on taxa, and
+        the database may simply compile those (e.g., a nomenclator), or come to some
+        view on the applicability of each name. In an ideal would, each reference
+        included in a taxonomic database would gain a citation, which would help better
+        reflect the value of that work (a long standing bone of contention for taxonomists).\n\nIt
+        would be interesting to explore these issues further. CrossRef and DataCite
+        do share <a href=\"https://www.crossref.org/services/event-data/\">Event Data</a>
+        (see also <a href=\"https://support.datacite.org/docs/eventdata-guide\">DataCite
+        Event Data</a>). Can this track citations of papers by a dataset?\n  \n  \nMy
+        take on Wayne''s question:\n\n<blockquote>\n  Is there a way to turn those
+        links into countable citations (even if just one per database) for Google
+        Scholar?\n</blockquote>\n\nis that he''s is after type 3 citations, which
+        I don''t think we have a way to handle just yet (but I''d need to look at
+        Event Data a bit more). Google Scholar is a black box, and the academic coimmunity''s
+        reliance on it for metrics is troubling. But it would be interetsing to try
+        and figure out if there is a way to get Google Scholar to index the citations
+        of taxonomic papers by databases. For instance, the <a href=\"https://www.catalogueoflife.org/\">Catalogue
+        of Life</a> has an ISSN <a href=\"https://portal.issn.org/resource/ISSN/2405-884X\">2405-884X</a>
+        so it can be treated as a publication. At the moment its web pages have lots
+        of identifiers for people managing data and their organisations (lots of <a
+        href=\"https://orcid.org\">ORCIDs</a> and <a href=\"https://ror.org\">RORs</a>,
+        and DOIs for individual datasets (e.g., <a href=\"https://www.checklistbank.org/dataset/9828/about\">checklistbank.org</a>)
+        but precious little in the way of DOIs for publications (or, indeed, ORCIDs
+        for taxonomists). What would it take for taxonomic publications in the Catalogue
+        of Life to be treated as first class citations?","tags":["Catalogue of Life","citation","CrossRef","DataCite","DOI"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ws094-1w310","uuid":"6bed78ec-0029-4096-b1c3-48a55a9fdb3b","url":"https://iphylo.blogspot.com/2023/04/chatgpt-of-course.html","title":"ChatGPT,
+        of course","summary":"I haven’t blogged for a while, work and other reasons
+        have meant I’ve not had much time to think, and mostly I blog to help me think.
+        ChatGPT is obviously a big thing at the moment, and once we get past the moral
+        panic (“students can pass exams using AI!”) there are a lot of interesting
+        possibilities to explore. Inspired by essays such as How Q&amp;A systems based
+        on large language models (eg GPT4) will change things if they become the dominant
+        search paradigm — 9 implications for libraries...","date_published":"2023-04-03T12:52:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>I haven’t blogged for a
+        while, work and <a href=\"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html\">other
+        reasons</a> have meant I’ve not had much time to think, and mostly I blog
+        to help me think.</p>\n<p>ChatGPT is obviously a big thing at the moment,
+        and once we get past the moral panic (“students can pass exams using AI!”)
+        there are a lot of interesting possibilities to explore. Inspired by essays
+        such as <a href=\"https://medium.com/@aarontay/how-q-a-systems-based-on-large-language-models-eg-gpt4-will-change-things-if-they-become-the-norm-c7cf62736ba\">How
+        Q&amp;A systems based on large language models (eg GPT4) will change things
+        if they become the dominant search paradigm — 9 implications for libraries</a>
+        and <a href=\"https://about.sourcegraph.com/blog/cheating-is-all-you-need\">Cheating
+        is All You Need</a>, as well as [<a href=\"https://paul-graham-gpt.vercel.app/\">Paul
+        Graham GPT</a>](<a href=\"https://paul-graham-gpt.vercel.app\">https://paul-graham-gpt.vercel.app</a>)
+        I thought I’d try a few things and see where this goes.</p>\n<p>ChatGPT can
+        do some surprising things.</p>\n<h4 id=\"parse-bibliographic-data\">Parse
+        bibliographic data</h4>\n<p>I spend a LOT of time working with bibliographic
+        data, trying to parse it into structured data. ChatGPT can do this:</p>\n\n<div
+        class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s924/Screenshot%202023-04-03%20at%2012.59.30.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" height=\"400\" data-original-height=\"924\" data-original-width=\"738\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2bA-Xk6D2wfexP0cs-U3AQa534hNW7J06I0avM82WD3dZSiRvXyHFScv3Hb8QfKKp1fk4GU4olhv1BOq8Pqu4IJgptlBap3xcLI1bFJwB3EOmHKxqf4iy2exJy4vwZ2n4I0U0JVui4Xmhvdy3mTbMSXCRbmCglNFULi5oQEhtPVG4gLJjdw/s400/Screenshot%202023-04-03%20at%2012.59.30.png\"/></a></div>\n\n<p>Note
+        that it does more than simply parse the strings, it expands journal abbreviations
+        such as “J. Malay Brch. R. Asiat. Soc.” to the full name “Journal of the Malayan
+        Branch of the Royal Asiatic Society”.  So we can get clean, parsed data in
+        a range of formats.</p>\n<h4 id=\"parse-specimens\">Parse specimens</h4>\n<p>Based
+        on the success with parsing bibliographic strings I wondered how well it could
+        handle citation software specimens (“material examined”). Elsewhere I’ve been
+        critical of Plazi’s ability to do this, see <a href=\"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html\">Problems
+        with Plazi parsing: how reliable are automated methods for extracting specimens
+        from the literature?</a>.</p>\n<p>For example, given this specimen record
+        on p. 130 of <a href=\"https://doi.org/10.5852/ejt.2021.775.1553\">doi:10.5852/ejt.2021.775.1553</a></p>\n<blockquote>\n<p>LAOS
+        • Kammoune Province, Bunghona Market, 7 km Nof Xe Bangfai River;<br>\n17.13674°
+        N, 104.98591° E; E. Jeratthitikul, K. Wisittikoson, A. Fanka, N. Wutthituntisil
+        and P. Prasankok leg.; sold by local people;<br>\nMUMNH-UNI2831.</p>\n</blockquote>\n<p>ChatGPT
+        extracted a plausible Darwin Core record:</p>\n\n<div class=\"separator\"
+        style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s901/Screenshot%202023-04-03%20at%2013.30.54.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" height=\"400\" data-original-height=\"901\" data-original-width=\"764\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIwVYuPXl47t1uSpJ3xpbsIT60UUWeKTKSBpIQgD8L2gkj_fuNf4sZKR76JgsfVYYHPs9G9VoM-caUhG1P3irMyB6erSYwYI5LDlLkxdrAFWdbY3ExDs34nMVOF0tq-NBtsDKMZsSJWjhJaGt9v3dtAqWvRyVMNZ46eG2sRhJTICTkFghKtw/s400/Screenshot%202023-04-03%20at%2013.30.54.png\"/></a></div>\n\n<p>I’ve
+        been experimenting with parsing specimen records using the same machine learning
+        approach for bibliographic data (e.g., <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
+        parsing tool released</a>), perhaps it’s time to rethink that idea.</p>\n<h4
+        id=\"correct-ocr-errors\">Correct OCR errors</h4>\n<p>OCR errors are a well
+        known limitation of the Biodiversity Heritage Library (BHL), see for example
+        <a href=\"https://blog.biodiversitylibrary.org/2022/07/ocr-improvements-early-analysis.html\">OCR
+        Improvements: An Early Analysis</a>. Given that ChatGPT is built on top of
+        a large language model it seems obvious to see how it handles OCR errors.
+        I tried the following snippet of text from <a href=\"https://www.biodiversitylibrary.org/page/4312227\">BHL
+        page 4312227</a>:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Jiiaii
+        Jr., 16, who always join him and Hedriek in th(&gt; search  for<br>\nfrogs.
+        Iloraeio Mayorga, research assistant to tlie author,  has also<br>\nbeen (wtreniely
+        useful, not only in field work, but also  in the<br>\ntedious curatorial and
+        laboratory work that usually fol-  lows. Dr.<br>\nJohn Randall kindly took
+        tlie photouraplis of E.  hedricki, and the<br>\nGaliiiaues family of San Juan
+        made their El  Verde house available as<br>\na (MMiter of activities for the
+        colleetiuff  party. To all these<br>\njx’opic, the autlioi- feels (iee])ly
+        indebted.</p>\n<p>Elet’tiierodactylus iiKnRrcKi s]). n.</p>\n<p>Type. Museum
+        of Coniparative Zoology Xo. ;}()!)()8, c^ from  El Verde,<br>\nwest flank
+        of El Yuiuinc Puerto Hico. I.IOO ft.  Coll. J. A. Kivero,<br>\n11 Aug-. I!)(i2.</p>\n</blockquote>\n<p>ChatGPT
+        returns this:</p>\n<blockquote>\n<p>i BREVIORA No. 185</p>\n<p>Juan Jr., 16,
+        who always joins him and Hedrick in the search for<br>\nfrogs. Horacio Mayorga,
+        research assistant to the author, has also<br>\nbeen extremely useful, not
+        only in field work, but also in the tedious<br>\ncuratorial and laboratory
+        work that usually follows. Dr. John Randall<br>\nkindly took the photographs
+        of E. hedricki, and the Galinanes family<br>\nof San Juan made their El Verde
+        house available as a center of<br>\nactivities for the collecting party. To
+        all these people, the author<br>\nfeels deeply indebted.</p>\n<p>Eleutherodactylus
+        hedricki sp. n.</p>\n<p>Type. Museum of Comparative Zoology No. 30908, c?
+        from El Verde, west<br>\nflank of El Yunque, Puerto Rico. 1100 ft. Coll. J.
+        A. Rivero, 11 Aug.<br>\n1962.</p>\n</blockquote>\n\n<div class=\"separator\"
+        style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s859/Screenshot%202023-04-03%20at%2013.07.53.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"400\" data-original-height=\"510\" data-original-width=\"859\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibZLxXSTdt64SMhfqS8NCcLOxbVaFmkhMQx0yEUWcm696EL6k2_VgxZFEDT-SEqOChmwDggoQLkuU6DpcOF_d-nkQRKzQLt65WKfkFlIzDRV9LVrDqmu-D0OZl1p5Vva4QUdvDb3JpeKPPDXZ-BcZqq8V4PyGepxlnwtt7-q4Sg6_aHD58gw/s400/Screenshot%202023-04-03%20at%2013.07.53.png\"/></a></div>\n\n<p>Comparing
+        this to the scanned image ChatGPT it does pretty well, for example the gobbledegook
+        “Elet’tiierodactylus iiKnRrcKi” is correctly translated as “Eleutherodactylus
+        hedricki”.  Running all of BHL through ChatGPT probably isn’t feasible, but
+        one could imagine targeted cleaning of key papers.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>These
+        small experiments are fairly trivial, but they are the sort of tedious tasks
+        that would otherwise require significant programming (or other resources)
+        to solve. But ChatGPT can do rather more, as I hope to discuss in the next
+        post.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/7esgr-61v1","uuid":"96fa91d5-459c-482f-aa38-dda6e0a30e20","url":"https://iphylo.blogspot.com/2022/01/large-graph-viewer-experiments.html","title":"Large
+        graph viewer experiments","summary":"I keep returning to the problem of viewing
+        large graphs and trees, which means my hard drive has accumulated lots of
+        failed prototypes. Inspired by some recent discussions on comparing taxonomic
+        classifications I decided to package one of these (wildly incomplete) prototypes
+        up so that I can document the idea and put the code somewhere safe. Very cool,
+        thanks for sharing this-- the tree diff is similar to what J Rees has been
+        cooking up lately with his &#39;cl diff&#39; tool. I&#39;ll tag...","date_published":"2022-01-02T11:25:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>I keep returning to the
+        problem of viewing large graphs and trees, which means my hard drive has accumulated
+        lots of failed prototypes. Inspired by some recent discussions on comparing
+        taxonomic classifications I decided to package one of these (wildly incomplete)
+        prototypes up so that I can document the idea and put the code somewhere safe.</p>\n\n<blockquote
+        class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Very cool, thanks for sharing
+        this-- the tree diff is similar to what J Rees has been cooking up lately
+        with his &#39;cl diff&#39; tool. I&#39;ll tag <a href=\"https://twitter.com/beckettws?ref_src=twsrc%5Etfw\">@beckettws</a>
+        in here too so he can see potential crossover. The goal is autogenerate diffs
+        like this as 1st step to mapping taxo name-to concept</p>&mdash; Nate Upham
+        (@n8_upham) <a href=\"https://twitter.com/n8_upham/status/1475834371131289608?ref_src=twsrc%5Etfw\">December
+        28, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
+        charset=\"utf-8\"></script>\n\n<h2>Google Maps-like viewer</h2>\n\n<div class=\"separator\"
+        style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s500\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"400\" data-original-height=\"448\" data-original-width=\"500\"
+        src=\"https://blogger.googleusercontent.com/img/a/AVvXsEiGDVVKGmwRi0gn-kEtpxBCUEdK-WCLBSpmkALAUDHGeTPkTSICSYVgnsj5N7zUeUfQALfFFHJJCsfeFvRULKbmqxLz51rW5hp_11dVXh-FHrnlRA7RJTA7I82l7sERF5jAjlah0LyEheVayO9nAfHTGZDuw5rnCe9iEO3dHQmA8_5AFIlJJg=s400\"/></a></div>\n\n<p>I''ve
+        created a simple viewer that uses a tiled map viewer (like Google Maps) to
+        display a large graph. The idea is to draw the entire graph scaled to a 256
+        x 256 pixel tile. The graph is stored in a database that supports geospatial
+        queries, which means the queries to retrieve the individual tiles need to
+        display the graph at different levels of resolution are simply bounding box
+        queries to a database. I realise that this description is cryptic at best.
+        The GitHub repository <a href=\"https://github.com/rdmpage/gml-viewer\">https://github.com/rdmpage/gml-viewer</a>
+        has more details and the code itself. There''s a lot to do, especially adding
+        support for labels(!) which presents some interesting challenges (<a href=\"https://en.wikipedia.org/wiki/Level_of_detail_(computer_graphics)\">levels
+        of detail</a> and <a href=\"https://en.wikipedia.org/wiki/Cartographic_generalization\">generalization</a>).
+        The code doesn''t do any layout of the graph itself, instead I''ve used the
+        <a href=\"https://www.yworks.com/products/yed\">yEd</a> tool to compute the
+        x,y coordinates of the graph.</p>\n\n<p>Since this exercise was inspired by
+        a discussion of the <a href=\"https://www.mammaldiversity.org\">ASM Mammal
+        Diversity Database</a>, the graph I''ve used for the demonstration above is
+        the ASM classification of extant mammals. I guess I need to solve the labelling
+        issue fairly quickly!</p>","tags":["Google Maps","graph","Mammal Species of
+        the World","mammals","taxonomy"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m48f7-c2128","uuid":"8aea47e4-f227-45f4-b37b-0454a8a7a3ff","url":"https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html","title":"ChatGPT,
+        semantic search, and knowledge graphs","summary":"One thing about ChatGPT
+        is it has opened my eyes to some concepts I was dimly aware of but am only
+        now beginning to fully appreciate. ChatGPT enables you ask it questions, but
+        the answers depend on what ChatGPT “knows”. As several people have noted,
+        what would be even better is to be able to run ChatGPT on your own content.
+        Indeed,  ChatGPT itself now supports this using plugins. Paul Graham GPT However,
+        it’s still useful to see how to add ChatGPT functionality to your own content
+        from...","date_published":"2023-04-03T15:30:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>One thing about ChatGPT
+        is it has opened my eyes to some concepts I was dimly aware of but am only
+        now beginning to fully appreciate. ChatGPT enables you ask it questions, but
+        the answers depend on what ChatGPT “knows”. As several people have noted,
+        what would be even better is to be able to run ChatGPT on your own content.
+        Indeed,  ChatGPT itself now supports this using <a href=\"https://openai.com/blog/chatgpt-plugins\">plugins</a>.</p>\n<h4
+        id=\"paul-graham-gpt\">Paul Graham GPT</h4>\n<p>However, it’s still useful
+        to see how to add ChatGPT functionality to your own content from scratch.
+        A nice example of this is <a href=\"https://paul-graham-gpt.vercel.app/\">Paul
+        Graham GPT</a> by <a href=\"https://twitter.com/mckaywrigley\">Mckay Wrigley</a>.
+        Mckay Wrigley took essays by Paul Graham (a well known venture capitalist)
+        and built a question and answer tool very like ChatGPT.</p>\n<iframe width=\"560\"
+        height=\"315\" src=\"https://www.youtube.com/embed/ii1jcLg-eIQ\" title=\"YouTube
+        video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write;
+        encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n<p>Because
+        you can send a block of text to ChatGPT (as part of the prompt) you can get
+        ChatGPT to summarise or transform that information, or answer questions based
+        on that information. But there is a limit to how much information you can
+        pack into a prompt. You can’t put all of Paul Graham’s essays into a prompt
+        for example. So a solution is to do some preprocessing. For example, given
+        a question such as “How do I start a startup?” we could first find the essays
+        that are most relevant to this question, then use them to create a prompt
+        for ChatGPT. A quick and dirty way to do this is simply do a text search over
+        the essays and take the top hits. But we aren’t searching for words, we are
+        searching for answers to a question. The essay with the best answer might
+        not include the phrase “How do I start a startup?”.</p>\n<h4 id=\"semantic-search\">Semantic
+        search</h4>\n<p>Enter <a href=\"https://en.wikipedia.org/wiki/Semantic_search\">Semantic
+        search</a>. The key concept behind semantic search is that we are looking
+        for documents with similar meaning, not just similarity of text. One approach
+        to this is to represent documents by “embeddings”, that is, a vector of numbers
+        that encapsulate features of the document. Documents with similar vectors
+        are potentially related. In semantic search we take the query (e.g., “How
+        do I start a startup?”), compute its embedding, then search among the documents
+        for those with similar embeddings.</p>\n<p>To create Paul Graham GPT Mckay
+        Wrigley did the following. First he sent each essay to the OpenAI API underlying
+        ChatGPT, and in return he got the embedding for that essay (a vector of 1536
+        numbers). Each embedding was stored in a database (Mckay uses Postgres with
+        <a href=\"https://github.com/pgvector/pgvector\">pgvector</a>). When a user
+        enters a query such as “How do I start a startup?” that query is also sent
+        to the OpenAI API to retrieve its embedding vector. Then we query the database
+        of embeddings for Paul Graham’s essays and take the top five hits. These hits
+        are, one hopes, the most likely to contain relevant answers. The original
+        question and the most similar essays are then bundled up and sent to ChatGPT
+        which then synthesises an answer. See his <a href=\"https://github.com/mckaywrigley/paul-graham-gpt\">GitHub
+        repo</a> for more details. Note that we are still using ChatGPT, but on a
+        set of documents it doesn’t already have.</p>\n<h4 id=\"knowledge-graphs\">Knowledge
+        graphs</h4>\n<p>I’m a fan of knowledge graphs, but they are not terribly easy
+        to use. For example, I built a knowledge graph of Australian animals <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a>
+        that contains a wealth of information on taxa, publications, and people, wrapped
+        up in a web site. If you want to learn more you need to figure out how to
+        write queries in SPARQL, which is not fun. Maybe we could use ChatGPT to write
+        the SPARQL queries for us, but it would be much more fun to be simply ask
+        natural language queries (e.g., “who are the experts on Australian ants?”).
+        I made some naïve notes on these ideas <a href=\"https://iphylo.blogspot.com/2015/09/possible-project-natural-language.html\">Possible
+        project: natural language queries, or answering “how many species are there?”</a>
+        and <a href=\"https://iphylo.blogspot.com/2019/05/ozymandias-meets-wikipedia-with-notes.html\">Ozymandias
+        meets Wikipedia, with notes on natural language generation</a>.</p>\n<p>Of
+        course, this is a well known problem. Tools such as <a href=\"http://rdf2vec.org\">RDF2vec</a>
+        can take RDF from a knowledge graph and create embeddings which could in tern
+        be used to support semantic search. But it seems to me that we could simply
+        this process a bit by making use of ChatGPT.</p>\n<p>Firstly we would generate
+        natural language statements from the knowledge graph (e.g., “species x belongs
+        to genus y and was described in z”, “this paper on ants was authored by x”,
+        etc.) that cover the basic questions we expect people to ask. We then get
+        embeddings for these (e.g., using OpenAI). We then have an interface where
+        people can ask a question (“is species x a valid species?”, “who has published
+        on ants”, etc.), we get the embedding for that question, retrieve natural
+        language statements that the closest in embedding “space”, package everything
+        up and ask ChatGPT to summarise the answer.</p>\n<p>The trick, of course,
+        is to figure out how t generate natural language statements from the knowledge
+        graph (which amounts to deciding what paths to traverse in the knowledge graph,
+        and how to write those paths is something approximating English). We also
+        want to know something about the sorts of questions people are likely to ask
+        so that we have a reasonable chance of having the answers (for example, are
+        people going to ask about individual species, or questions about summary statistics
+        such as numbers of species in  a genus, etc.).</p>\n<p>What makes this attractive
+        is that it seems a straightforward way to go from a largely academic exercise
+        (build a knowledge graph) to something potentially useful (a question and
+        answer machine). Imagine if something like the defunct BBC wildlife site (see
+        <a href=\"https://iphylo.blogspot.com/2017/12/blue-planet-ii-bbc-and-semantic-web.html\">Blue
+        Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and
+        opportunities lost</a>) revived <a href=\"https://aspiring-look.glitch.me\">here</a>
+        had a question and answer interface where we could ask questions rather than
+        passively browse.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>I have so much
+        more to learn, and need to think about ways to incorporate semantic search
+        and ChatGPT-like tools into knowledge graphs.</p>\n<blockquote>\n<p>Written
+        with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/rfxj3-x6739","uuid":"6a4d5c44-f4a9-4d40-a32c-a4d5e512c55a","url":"https://iphylo.blogspot.com/2022/05/thoughts-on-treebase-dying.html","title":"Thoughts
+        on TreeBASE dying(?)","summary":"@rvosa is Naturalis no longer hosting Treebase?
+        https://t.co/MBRgcxaBmR&mdash; Hilmar Lapp (@hlapp) May 10, 2022 So it looks
+        like TreeBASE is in trouble, it''s legacy Java code a victim of security issues.
+        Perhaps this is a chance to rethink TreeBASE, assuming that a repository of
+        published phylogenies is still considered a worthwhile thing to have (and
+        I think that question is open). Here''s what I think could be done. The data
+        (individual studies with trees and data) are packaged into...","date_published":"2022-05-11T16:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<blockquote class=\"twitter-tweet\"><p
+        lang=\"en\" dir=\"ltr\"><a href=\"https://twitter.com/rvosa?ref_src=twsrc%5Etfw\">@rvosa</a>
+        is Naturalis no longer hosting Treebase? <a href=\"https://t.co/MBRgcxaBmR\">https://t.co/MBRgcxaBmR</a></p>&mdash;
+        Hilmar Lapp (@hlapp) <a href=\"https://twitter.com/hlapp/status/1524166490798309381?ref_src=twsrc%5Etfw\">May
+        10, 2022</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
+        charset=\"utf-8\"></script>\n\n<p>So it looks like <a href=\"http://treebase.org\">TreeBASE</a>
+        is in trouble, it''s legacy Java code a victim of security issues. Perhaps
+        this is a chance to rethink TreeBASE, assuming that a repository of published
+        phylogenies is still considered a worthwhile thing to have (and I think that
+        question is open).</p>\n\n<p>Here''s what I think could be done.</p>\n\n<ol>\n<li>\nThe
+        data (individual studies with trees and data) are packaged into whatever format
+        is easiest (NEXUS, XML, JSON) and uploaded to a repository such as <a href=\"https://zenodo.org\">Zenodo</a>
+        for long term storage. They get DOIs for citability. This becomes the default
+        storage for TreeBASE.\n</li>\n<li>\nThe data is transformed into JSON and
+        indexed using Elasticsearch. A simple web interface is placed on top so that
+        people can easily find trees (never a strong point of the original TreeBASE).
+        Trees are displayed natively on the web using SVG. The number one goal is
+        for people to be able to find trees, view them, and download them.\n</li>\n<li>\nTo
+        add data to TreeBASE the easiest way would be for people to upload them direct
+        to Zenodo and tag them \"treebase\". A bot then grabs a feed of these datasets
+        and adds them to the search engine in (1) above. As time allows, add an interface
+        where people upload data directly, it gets curated, then deposited in Zenodo.
+        This presupposes that there are people available to do curation. Maybe have
+        \"stars\" for the level of curation so that users know whether anyone has
+        checked the data.\n</li>\n</ol>\n\n<p>There''s lots of details to tweak, for
+        example how many of the existing URLs for studies are preserved (some URL
+        mapping), and what about the API? And I''m unclear about the relationship
+        with <a href=\"https://datadryad.org\">Dryad</a>.</p>\n\n<p>My sense is that
+        the TreeBASE code is very much of its time (10-15 years ago), a monolithic
+        block of code with SQL, Java, etc. If one was starting from scratch today
+        I don''t think this would be the obvious solution. Things have trended towards
+        being simpler, with lots of building blocks now available in the cloud. Need
+        a search engine? Just spin up a container in the cloud and you have one. More
+        and more functionality can be devolved elsewhere.</p>\n\n<p>Another other
+        issue is how to support TreeBASE. It has essentially been a volunteer effort
+        to date, with little or no funding. One reason I think having Zenodo as a
+        storage engine is that it takes care of long term sustainability of the data.</p>\n\n<p>I
+        realise that this is all wild arm waving, but maybe now is the time to reinvent
+        TreeBASE?</p>\n\n<h2>Updates</h2>\n\n<p>It''s been a while since I''ve paid
+        a lot of attention to phylogenetic databases, and it shows. There is a file-based
+        storage system for phylogenies <a href=\"https://github.com/OpenTreeOfLife/phylesystem-1\">phylesystem</a>
+        (see \"Phylesystem: a git-based data store for community-curated phylogenetic
+        estimates\" <a href=\"https://doi.org/10.1093/bioinformatics/btv276\">https://doi.org/10.1093/bioinformatics/btv276</a>)
+        that is sort of what I had in mind, although long term persistence is based
+        on GitHub rather than a repository such as Zenodo. Phylesystem uses a truly
+        horrible-looking JSON transformation of <a href=\"http://nexml.github.io\">NeXML</a>
+        (NeXML itself is ugly), and TreeBASE also supports NeXML, so some form of
+        NeXML or a JSON transformation seems the obvious storage format. It will probably
+        need some cleaning and simplification if it is to be indexed easily. Looking
+        back over the long history of TreeBASE and phylogenetic databases I''m struck
+        by how much complexity has been introduced over time. I think the tech has
+        gotten in the way sometimes (which might just be another way of saying that
+        I''m not smart enough to make sense of it all.</p>\n\n<p>So we could imagine
+        a search engine that covers both TreeBASE and <a href=\"https://tree.opentreeoflife.org/curator\">Open
+        Tree of Life studies</a>.</p>\n\n<p>Basic metadata-based searches would be
+        straightforward, and we could have a user interface that highlights the trees
+        (I think TreeBASE''s biggest search rival is a Google image search). The harder
+        problem is searching by tree structure, for which there is an interesting
+        literature without any decent implementations that I''m aware of (as I said,
+        I''ve been out of this field a while).</p>\n\n<p>So my instinct is we could
+        go a long way with simply indexing JSON (CouchDB or Elasticsearch), then need
+        to think a bit more cleverly about higher taxon and tree based searching.
+        I''ve always thought that one killer query would be not so much \"show me
+        all the trees for my taxon\" but \"show me a synthesis of the trees for my
+        taxon\". Imagine a supertree of recent studies that we could use as a summary
+        of our current knowledge, or a visualisation that summarises where there are
+        conflicts among the trees.</p>\n\n<h3>Relevant code and sites</h3>\n\n<ul>\n<li><a
+        href=\"https://github.com/rdmpage/cdaotools\">CDAO Tools</a>, see \"CDAO-Store:
+        Ontology-driven Data Integration for Phylogenetic Analysis\" <a href=\"https://doi.org/10.1186/1471-2105-12-98\">https://doi.org/10.1186/1471-2105-12-98</a></li>\n<li><a
+        href=\"https://github.com/NESCent/phylocommons\">PhyloCommons</a></li>\n</ul>","tags":["phylogeny","TreeBASE"],"language":"en","references":null},{"id":"https://doi.org/10.59350/jzvs4-r9559","uuid":"23fa1dd8-5c6b-4aa9-9cad-c6f6b14ae9e0","url":"https://iphylo.blogspot.com/2021/08/json-ld-in-wild-examples-of-how.html","title":"JSON-LD
+        in the wild: examples of how structured data is represented on the web","summary":"I''ve
+        created a GitHub repository so that I can keep track of the examples of JSON-LD
+        that I''ve seen being actively used, for example embedded in web sites, or
+        accessed using an API. The repository is https://github.com/rdmpage/wild-json-ld.
+        The list is by no means exhaustive, I hope to add more examples as I come
+        across them. One reason for doing this is to learn what others are doing.
+        For example, after looking at SciGraph''s JSON-LD I now see how an ordered
+        list can be modelled in RDF in...","date_published":"2021-08-27T13:20:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>I''ve created a GitHub repository
+        so that I can keep track of the examples of JSON-LD that I''ve seen being
+        actively used, for example embedded in web sites, or accessed using an API.
+        The repository is <a href=\"https://github.com/rdmpage/wild-json-ld\">https://github.com/rdmpage/wild-json-ld</a>.
+        The list is by no means exhaustive, I hope to add more examples as I come
+        across them.</p>\n\n<p>One reason for doing this is to learn what others are
+        doing. For example, after looking at SciGraph''s JSON-LD I now see how an
+        ordered list can be modelled in RDF in such a way that the list of authors
+        in a JSON-LD document for, say a scientific paper, is correct. By default
+        RDF has no notion of ordered lists, so if you do a SPARQL query to get the
+        authors of a paper, the order of the authors returned in the query will be
+        arbitrary. There are various ways to try and tackle this. In my Ozymandias
+        knowledge graph I used \"roles\" to represent order (see <a href=\"https://doi.org/10.7717/peerj.6739/fig-2\">Figure
+        2</a> in the Ozymandias paper). I then used properties of the role to order
+        the list of authors.</p>\n\n<p>Another approach is to use rdf:lists (see <a
+        href=\"http://www.snee.com/bobdc.blog/2014/04/rdf-lists-and-sparql.html\">RDF
+        lists and SPARQL</a> and <a href=\"https://stackoverflow.com/questions/17523804/is-it-possible-to-get-the-position-of-an-element-in-an-rdf-collection-in-sparql/17530689#17530689\">Is
+        it possible to get the position of an element in an RDF Collection in SPARQL?</a>
+        for an introduction to lists). SciGraph uses this approach. The value for
+        schema:author is not an author, but a blank node (bnode), and this bnode has
+        two predicates, rdf:first and rdf:rest. One points to an author, the other
+        points to another bnode. This pattern repeats until we encounter a value of
+        rdf:nil for rdf:rest.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a
+        href=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s629/list.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" height=\"320\" data-original-height=\"629\" data-original-width=\"401\"
+        src=\"https://1.bp.blogspot.com/-AESgWub1ZLQ/YSjoeo6O41I/AAAAAAAAgwg/5Edm7ZmuwL8NwxCcBvTqbI7js5nYmgggwCLcBGAsYHQ/s320/list.png\"/></a></div>\n\n<p>This
+        introduces some complexity, but the benefit is that the JSON-LD version of
+        the RDF will have the authors in the correct order, and hence any client that
+        is using JSON will be able to treat the array of authors as ordered. Without
+        some means of ordering the client could not make this assumption, hence the
+        first author in the list might not actually be the first author of the paper.</p>","tags":["JSON-LD","RDF"],"language":"en","references":null},{"id":"https://doi.org/10.59350/zc4qc-77616","uuid":"30c78d9d-2e50-49db-9f4f-b3baa060387b","url":"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html","title":"Does
+        anyone cite taxonomic treatments?","summary":"Taxonomic treatments have come
+        up in various discussions I''m involved in, and I''m curious as to whether
+        they are actually being used, in particular, whether they are actually being
+        cited. Consider the following quote: The taxa are described in taxonomic treatments,
+        well defined sections of scientific publications (Catapano 2019). They include
+        a nomenclatural section and one or more sections including descriptions, material
+        citations referring to studied specimens, or notes ecology and...","date_published":"2022-09-01T16:49:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
+        both;\"><a href=\"https://zenodo.org/record/5731100/thumb100\" style=\"display:
+        block; padding: 1em 0; text-align: center; clear: right; float: right;\"><img
+        alt=\"\" border=\"0\" height=\"128\" data-original-height=\"106\" data-original-width=\"100\"
+        src=\"https://zenodo.org/record/5731100/thumb250\"/></a></div>\nTaxonomic
+        treatments have come up in various discussions I''m involved in, and I''m
+        curious as to whether they are actually being used, in particular, whether
+        they are actually being cited. Consider the following quote:\n\n<blockquote>\nThe
+        taxa are described in taxonomic treatments, well defined sections of scientific
+        publications (Catapano 2019). They include a nomenclatural section and one
+        or more sections including descriptions, material citations referring to studied
+        specimens, or notes ecology and behavior. In case the treatment does not describe
+        a new discovered taxon, previous treatments are cited in the form of treatment
+        citations. This citation can refer to a previous treatment and add additional
+        data, or it can be a statement synonymizing the taxon with another taxon.
+        This allows building a citation network, and ultimately is a constituent part
+        of the catalogue of life. - Taxonomic Treatments as Open FAIR Digital Objects
+        <a href=\"https://doi.org/10.3897/rio.8.e93709\">https://doi.org/10.3897/rio.8.e93709</a>\n</blockquote>\n\n<p>\n
+        \"Traditional\" academic citation is from article to article. For example,
+        consider these two papers:\n\n<blockquote>\nLi Y, Li S, Lin Y (2021) Taxonomic
+        study on fourteen symphytognathid species from Asia (Araneae, Symphytognathidae).
+        ZooKeys 1072: 1-47. https://doi.org/10.3897/zookeys.1072.67935\n</blockquote>\n\n<blockquote>\nMiller
+        J, Griswold C, Yin C (2009) The symphytognathoid spiders of the Gaoligongshan,
+        Yunnan, China (Araneae: Araneoidea): Systematics and diversity of micro-orbweavers.
+        ZooKeys 11: 9-195. https://doi.org/10.3897/zookeys.11.160\n</blockquote>\n</p>\n\n<p>Li
+        et al. 2021 cites Miller et al. 2009 (although Pensoft seems to have broken
+        the citation such that it does appear correctly either on their web page or
+        in CrossRef).</p>\n\n<p>So, we have this link: [article]10.3897/zookeys.1072.67935
+        --cites--> [article]10.3897/zookeys.11.160. One article cites another.</p>\n\n<p>In
+        their 2021 paper Li et al. discuss <i>Patu jidanweishi</i> Miller, Griswold
+        & Yin, 2009:\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s1040/Screenshot%202022-09-01%20at%2017.12.27.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"400\" data-original-height=\"314\" data-original-width=\"1040\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s400/Screenshot%202022-09-01%20at%2017.12.27.png\"/></a></div>\n\n<p>There
+        is a treatment for the original description of <i>Patu jidanweishi</i> at
+        <a href=\"https://doi.org/10.5281/zenodo.3792232\">https://doi.org/10.5281/zenodo.3792232</a>,
+        which was created by Plazi with a time stamp \"2020-05-06T04:59:53.278684+00:00\".
+        The original publication date was 2009, the treatments are being added retrospectively.</p>\n\n<p>In
+        an ideal world my expectation would be that Li et al. 2021 would have cited
+        the treatment, instead of just providing the text string \"Patu jidanweishi
+        Miller, Griswold & Yin, 2009: 64, figs 65A–E, 66A, B, 67A–D, 68A–F, 69A–F,
+        70A–F and 71A–F (♂♀).\" Isn''t the expectation under the treatment model that
+        we would have seen this relationship:</p>\n\n<p>[article]10.3897/zookeys.1072.67935
+        --cites--> [treatment]https://doi.org/10.5281/zenodo.3792232</p>\n\n<p>Furthermore,
+        if it is the case that \"[i]n case the treatment does not describe a new discovered
+        taxon, previous treatments are cited in the form of treatment citations\"
+        then we should also see a citation between treatments, in other words Li et
+        al.''s 2021 treatment of <i>Patu jidanweishi</i> (which doesn''t seem to have
+        a DOI but is available on Plazi'' web site as <a href=\"https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74\">https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74</a>)
+        should also cite the original treatment? It doesn''t - but it does cite the
+        Miller et al. paper.</p>\n\n<p>So in this example we don''t see articles citing
+        treatments, nor do we see treatments citing treatments. Playing Devil''s advocate,
+        why then do we have treatments? Does''t the lack of citations suggest that
+        - despite some taxonomists saying this is the unit that matters - they actually
+        don''t. If we pay attention to what people do rather than what they say they
+        do, they cite articles.</p>\n\n<p>Now, there are all sorts of reasons why
+        we don''t see [article] -> [treatment] citations, or [treatment] -> [treatment]
+        citations. Treatments are being added after the fact by Plazi, not by the
+        authors of the original work. And in many cases the treatments that could
+        be cited haven''t appeared until after that potentially citing work was published.
+        In the example above the Miller et al. paper dates from 2009, but the treatment
+        extracted only went online in 2020. And while there is a long standing culture
+        of citing publications (ideally using DOIs) there isn''t an equivalent culture
+        of citing treatments (beyond the simple text strings).</p>\n\n<p>Obviously
+        this is but one example. I''d need to do some exploration of the citation
+        graph to get a better sense of citations patterns, perhaps using <a href=\"https://www.crossref.org/documentation/event-data/\">CrossRef''s
+        event data</a>. But my sense is that taxonomists don''t cite treatments.</p>\n\n<p>I''m
+        guessing Plazi would respond by saying treatments are cited, for example (indirectly)
+        in GBIF downloads. This is true, although arguably people aren''t citing the
+        treatment, they''re citing specimen data in those treatments, and that specimen
+        data could be extracted at the level of articles rather than treatments. In
+        other words, it''s not the treatments themselves that people are citing.</p>\n\n<p>To
+        be clear, I think there is value in being able to identify those \"well defined
+        sections\" of a publication that deal with a given taxon (i.e., treatments),
+        but it''s not clear to me that these are actually the citable units people
+        might hope them to be. Likewise, journals such as <i>ZooKeys</i> have DOIs
+        for individual figures. Does anyone actually cite those?</p>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/en7e9-5s882","uuid":"20b9d31e-513f-496b-b399-4215306e1588","url":"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html","title":"Obsidian,
+        markdown, and taxonomic trees","summary":"Returning to the subject of personal
+        knowledge graphs Kyle Scheer has an interesting repository of Markdown files
+        that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines
+        (see his blog post for more background). If you add these files to Obsidian
+        you get a nice visualisation of a taxonomy of academic disciplines. The applications
+        of this to biological taxonomy seem obvious, especially as a tool like Obsidian
+        enables all sorts of interesting links to be added...","date_published":"2022-04-07T21:07:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>Returning to the subject
+        of <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">personal
+        knowledge graphs</a> Kyle Scheer has an interesting repository of Markdown
+        files that describe academic disciplines at <a href=\"https://github.com/kyletscheer/academic-disciplines\">https://github.com/kyletscheer/academic-disciplines</a>
+        (see <a href=\"https://kyletscheer.medium.com/on-creating-a-tree-of-knowledge-f099c1028bf6\">his
+        blog post</a> for more background).</p>\n\n<p>If you add these files to <a
+        href=\"https://obsidian.md/\">Obsidian</a> you get a nice visualisation of
+        a taxonomy of academic disciplines. The applications of this to biological
+        taxonomy seem obvious, especially as a tool like Obsidian enables all sorts
+        of interesting links to be added (e.g., we could add links to the taxonomic
+        research behind each node in the taxonomic tree, the people doing that research,
+        etc. - although that would mean we''d no longer have a simple tree).</p>\n\n<p>The
+        more I look at these sort of simple Markdown-based tools the more I wonder
+        whether we could make more use of them to create simple but persistent databases.
+        Text files seem the most stable, long-lived digital format around, maybe this
+        would be a way to minimise the inevitable obsolescence of database and server
+        software. Time for some experiments I feel... can we take a taxonomic group,
+        such as mammals, and create a richly connected database purely in Markdown?</p>\n\n<div
+        class=\"separator\" style=\"clear: both; text-align: center;\"><iframe allowfullscreen=''allowfullscreen''
+        webkitallowfullscreen=''webkitallowfullscreen'' mozallowfullscreen=''mozallowfullscreen''
+        width=''400'' height=''322'' src=''https://www.blogger.com/video.g?token=AD6v5dy3Sa_SY_MJCZYYCT-bAGe9QD1z_V0tkE0qM5FaQJfAEgGOoHtYPATsNNbBvTEh_tHOZ83nMGzpYRg''
+        class=''b-hbp-video b-uploaded'' frameborder=''0''></iframe></div>","tags":["markdown","obsidian"],"language":"en","references":null},{"id":"https://doi.org/10.59350/m7gb7-d7c49","uuid":"7d814863-43b5-4faf-a475-da8de5efd3ef","url":"https://iphylo.blogspot.com/2022/02/duplicate-dois-again.html","title":"Duplicate
+        DOIs (again)","summary":"This blog post provides some background to a recent
+        tweet where I expressed my frustration about the duplication of DOIs for the
+        same article. I''m going to document the details here. The DOI that alerted
+        me to this problem is https://doi.org/10.2307/2436688 which is for the article
+        Snyder, W. C., & Hansen, H. N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American
+        Journal of Botany, 27(2), 64–67. This article is hosted by JSTOR at https://www.jstor.org/stable/2436688
+        which displays the DOI...","date_published":"2022-02-08T15:06:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>This blog post provides
+        some background to a <a href=\"https://twitter.com/rdmpage/status/1491023036199600132\">recent
+        tweet</a> where I expressed my frustration about the duplication of DOIs for
+        the same article. I''m going to document the details here.</p>\n\n<p>The DOI
+        that alerted me to this problem is <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
+        which is for the article</p>\n\n<blockquote>\nSnyder, W. C., & Hansen, H.
+        N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American Journal of Botany, 27(2),
+        64–67.\n</blockquote>\n\n<p>This article is hosted by JSTOR at <a href=\"https://www.jstor.org/stable/2436688\">https://www.jstor.org/stable/2436688</a>
+        which displays the DOI <a href=\"https://doi.org/10.2307/2436688\">https://doi.org/10.2307/2436688</a>
+        .</p>\n\n<p>This same article is also hosted by Wiley at <a href=\"https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x\">https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x</a>
+        with the DOI <a href=\"https://doi.org/10.1002/j.1537-2197.1940.tb14217.x\">https://doi.org/10.1002/j.1537-2197.1940.tb14217.x</a>.</p>\n\n<h2>Expected
+        behaviour</h2>\n\n<p>What should happen is if Wiley is going to be the publisher
+        of this content (taking over from JSTOR), the DOI <b>10.2307/2436688</b> should
+        be redirected to the Wiley page, and the Wiley page displays this DOI (i.e.,
+        <b>10.2307/2436688</b>). If I want to get metadata for this DOI, I should
+        be able to use CrossRef''s API to retrieve that metadata, e.g. <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
+        should return metadata for the article.</p>\n\n<h2>What actually happens</h2>\n\n<p>Wiley
+        display the same article on their web site with the DOI <b>10.1002/j.1537-2197.1940.tb14217.x</b>.
+        They have minted a new DOI for the same article! The original JSTOR DOI now
+        resolves to the Wiley page (you can see this using the <a href=\"https://hdl.handle.net\">Handle
+        Resolver</a>), which is what is supposed to happen. However, Wiley should
+        have reused the original DOI rather than mint their own.</p>\n\n<p>Furthermore,
+        while the original DOI still resolves in a web browser, I can''t retrieve
+        metadata about that DOI from CrossRef, so any attempt to build upon that DOI
+        fails. However, I can retrieve metadata for the Wiley DOI, i.e. <a href=\"https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x\">https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x</a>
+        works, but <a href=\"https://api.crossref.org/v1/works/10.2307/2436688\">https://api.crossref.org/v1/works/10.2307/2436688</a>
+        doesn''t.</p>\n\n<h2>Why does this matter?</h2>\n\n<p>For anyone using DOIs
+        as stable links to the literature the persistence of DOIs is something you
+        should be able to rely upon, both for people clicking on links in web browsers
+        and developers getting metadata from those DOIs. The whole rationale of the
+        DOI system is a single, globally unique identifier for each article, and that
+        these DOIs persist even when the publisher of the content changes. If this
+        property doesn''t hold, then why would a developer such as myself invest effort
+        in linking using DOIs?</p>\n\n<p>Just for the record, I think CrossRef is
+        great and is a hugely important part of the scholarly landscape. There are
+        lots of things that I do that would be nearly impossible without CrossRef
+        and its tools. But cases like this where we get massive duplication of DOIs
+        when a publishers takes over an existing journal fundamentally breaks the
+        underlying model of stable, persistent identifiers.</p>","tags":["CrossRef","DOI","duplicates"],"language":"en","references":null},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
+        on \"The Macroscope\" - a tool for the 21st Century?","summary":"This is a
+        guest post by Tony Rees. It would be difficult to encounter a scientist, or
+        anyone interested in science, who is not familiar with the microscope, a tool
+        for making objects visible that are otherwise too small to be properly seen
+        by the unaided eye, or to reveal otherwise invisible fine detail in larger
+        objects. A select few with a particular interest in microscopy may also have
+        encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop
+        microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p><img src=\"https://lh3.googleusercontent.com/-A99btr6ERMs/Vl1Wvjp2OtI/AAAAAAAAEFI/7bKdRjNG5w0/ytNkVT2U.jpg?imgmax=800\"
+        alt=\"YtNkVT2U\" title=\"ytNkVT2U.jpg\" border=\"0\" width=\"128\" height=\"128\"
+        style=\"float:right;\" /> This is a guest post by <a href=\"https://about.me/TonyRees\">Tony
+        Rees</a>.</p>\n\n<p>It would be difficult to encounter a scientist, or anyone
+        interested in science, who is not familiar with the microscope, a tool for
+        making objects visible that are otherwise too small to be properly seen by
+        the unaided eye, or to reveal otherwise invisible fine detail in larger objects.
+        A select few with a particular interest in microscopy may also have encountered
+        the Wild-Leica \"Macroscope\", a specialised type of benchtop microscope optimised
+        for low-power macro-photography. However in this overview I discuss the \"Macroscope\"
+        in a different sense, which is that of the antithesis to the microscope: namely
+        a method for visualizing subjects too large to be encompassed by a single
+        field of vision, such as the Earth or some subset of its phenomena (the biosphere,
+        for example), or conceptually, the universe.</p>\n\n<p><div class=\"separator\"
+        style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s500/2020045672.jpg\"
+        style=\"display: block; padding: 1em 0; text-align: center; clear: right;
+        float: right;\"><img alt=\"\" border=\"0\" height=\"320\" data-original-height=\"500\"
+        data-original-width=\"303\" src=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s320/2020045672.jpg\"/></a></div>My
+        introduction to the term was via addresses given by Jesse Ausubel in the formative
+        years of the 2001-2010 <a href=\"http://www.coml.org\">Census of Marine Life</a>,
+        for which he was a key proponent. In Ausubel''s view, the Census would perform
+        the function of a macroscope, permitting a view of everything that lives in
+        the global ocean (or at least, that subset which could realistically be sampled
+        in the time frame available) as opposed to more limited subsets available
+        via previous data collection efforts. My view (which could, of course, be
+        wrong) was that his thinking had been informed by a work entitled \"Le macroscope,
+        vers une vision globale\" published in 1975 by the French thinker Joël de
+        Rosnay, who had expressed such a concept as being globally applicable in many
+        fields, including the physical and natural worlds but also extending to human
+        society, the growth of cities, and more. Yet again, some ecologists may also
+        have encountered the term, sometimes in the guise of \"Odum''s macroscope\",
+        as an approach for obtaining \"big picture\" analyses of macroecological processes
+        suitable for mathematical modelling, typically by elimination of fine detail
+        so that only the larger patterns remain, as initially advocated by Howard
+        T. Odum in his 1971 book \"Environment, Power, and Society\".</p>\n\n<p>From
+        the standpoint of the 21st century, it seems that we are closer to achieving
+        a \"macroscope\" (or possibly, multiple such tools) than ever before, based
+        on the availability of existing and continuing new data streams, improved
+        technology for data assembly and storage, and advanced ways to query and combine
+        these large streams of data to produce new visualizations, data products,
+        and analytical findings. I devote the remainder of this article to examples
+        where either particular workers have employed \"macroscope\" terminology to
+        describe their activities, or where potentially equivalent actions are taking
+        place without the explicit \"macroscope\" association, but are equally worthy
+        of consideration. To save space here, references cited here (most or all)
+        can be found via a Wikipedia article entitled \"<a href=\"https://en.wikipedia.org/wiki/Macroscope_(science_concept)\">Macroscope
+        (science concept)</a>\" that I authored on the subject around a year ago,
+        and have continued to add to on occasion as new thoughts or information come
+        to hand (see <a href=\"https://en.wikipedia.org/w/index.php?title=Macroscope_(science_concept)&offset=&limit=500&action=history\">edit
+        history for the article</a>).</p>\n\n<p>First, one can ask, what constitutes
+        a macroscope, in the present context? In the Wikipedia article I point to
+        a book \"Big Data - Related Technologies, Challenges and Future Prospects\"
+        by Chen <em>et al.</em> (2014) (<a href=\"https://doi.org/10.1007/978-3-319-06245-7\">doi:10.1007/978-3-319-06245-7</a>),
+        in which the \"value chain of big data\" is characterised as divisible into
+        four phases, namely data generation, data acquisition (aka data assembly),
+        data storage, and data analysis. To my mind, data generation (which others
+        may term acquisition, differently from the usage by Chen <em>et al.</em>)
+        is obviously the first step, but does not in itself constitute the macroscope,
+        except in rare cases - such as Landsat imagery, perhaps - where on its own,
+        a single co-ordinated data stream is sufficient to meet the need for a particular
+        type of \"global view\". A variant of this might be a coordinated data collection
+        program - such as that of the ten year Census of Marine Life - which might
+        produce the data required for the desired global view; but again, in reality,
+        such data are collected in a series of discrete chunks, in many and often
+        disparate data formats, and must be \"wrangled\" into a more coherent whole
+        before any meaningful \"macroscope\" functionality becomes available.</p>\n\n<p>Here
+        we come to what, in my view, constitutes the heart of the \"macroscope\":
+        an intelligently organized (i.e. indexable and searchable), coherent data
+        store or repository (where \"data\" may include imagery and other non numeric
+        data forms, but much else besides). Taking the Census of Marine Life example,
+        the data repository for that project''s data (plus other available sources
+        as inputs) is the <a href=\"https://obis.org\">Ocean Biodiversity Information
+        System</a> or OBIS (previously the Ocean Biogeographic Information System),
+        which according to this view forms the \"macroscope\" for which the Census
+        data is a feed. (For non habitat-specific biodiversity data, <a href=\"https://www.gbif.org\">GBIF</a>
+        is an equivalent, and more extensive, operation). Other planetary scale \"macroscopes\",
+        by this definition (which may or may not have an explicit geographic, i.e.
+        spatial, component) would include inventories of biological taxa such as the
+        <a href=\"https://www.catalogueoflife.org\">Catalogue of Life</a> and so on,
+        all the way back to the pioneering compendia published by Linnaeus in the
+        eighteenth century; while for cartography and topographic imagery, the current
+        \"blockbuster\" of <a href=\"http://earth.google.com\">Google Earth</a> and
+        its predecessors also come well into public consciousness.</p>\n\n<p>In the
+        view of some workers and/or operations, both of these phases are precursors
+        to the real \"work\" of the macroscope which is to reveal previously unseen
+        portions of the \"big picture\" by means either of the availability of large,
+        synoptic datasets, or fusion between different data streams to produce novel
+        insights. Companies such as IBM and Microsoft have used phraseology such as:</p>\n\n<blockquote>By
+        2022 we will use machine-learning algorithms and software to help us organize
+        information about the physical world, helping bring the vast and complex data
+        gathered by billions of devices within the range of our vision and understanding.
+        We call this a \"macroscope\" – but unlike the microscope to see the very
+        small, or the telescope that can see far away, it is a system of software
+        and algorithms to bring all of Earth''s complex data together to analyze it
+        by space and time for meaning.\" (IBM)</blockquote>\n\n<blockquote>As the
+        Earth becomes increasingly instrumented with low-cost, high-bandwidth sensors,
+        we will gain a better understanding of our environment via a virtual, distributed
+        whole-Earth \"macroscope\"... Massive-scale data analytics will enable real-time
+        tracking of disease and targeted responses to potential pandemics. Our virtual
+        \"macroscope\" can now be used on ourselves, as well as on our planet.\" (Microsoft)
+        (references available via the Wikipedia article cited above).</blockquote>\n\n<p>Whether
+        or not the analytical capabilities described here are viewed as being an integral
+        part of the \"macroscope\" concept, or are maybe an add-on, is ultimately
+        a question of semantics and perhaps, personal opinion. Continuing the Census
+        of Marine Life/OBIS example, OBIS offers some (arguably rather basic) visualization
+        and summary tools, but also makes its data available for download to users
+        wishing to analyse it further according to their own particular interests;
+        using OBIS data in this manner, Mark Costello et al. in 2017 were able to
+        demarcate a finite number of data-supported marine biogeographic realms for
+        the first time (Costello et al. 2017: Nature Communications. 8: 1057. <a href=\"https://doi.org/10.1038/s41467-017-01121-2\">doi:10.1038/s41467-017-01121-2</a>),
+        a project which I was able to assist in a small way in an advisory capacity.
+        In a case such as this, perhaps the final function of the macroscope, namely
+        data visualization and analysis, was outsourced to the authors'' own research
+        institution. Similarly at an earlier phase, \"data aggregation\" can also
+        be virtual rather than actual, i.e. avoiding using a single physical system
+        to hold all the data, enabled by open web mapping standards WMS (web map service)
+        and WFS (web feature service) to access a set of distributed data stores,
+        e.g. as implemented on the portal for the <a href=\"https://portal.aodn.org.au/\">Australian
+        Ocean Data Network</a>.</p>\n\n<p>So, as we pass through the third decade
+        of the twenty first century, what developments await us in the \"macroscope\"
+        area\"? In the biodiversity space, one can reasonably presume that the existing
+        \"macroscopic\" data assembly projects such as OBIS and GBIF will continue,
+        and hopefully slowly fill current gaps in their coverage - although in the
+        marine area, strategic new data collection exercises may be required (Census
+        2020, or 2025, anyone?), while (again hopefully), the Catalogue of Life will
+        continue its progress towards a \"complete\" species inventory for the biosphere.
+        The Landsat project, with imagery dating back to 1972, continues with the
+        launch of its latest satellite Landsat 9 just this year (21 September 2021)
+        with a planned mission duration for the next 5 years, so the \"macroscope\"
+        functionality of that project seems set to continue for the medium term at
+        least. Meanwhile the ongoing development of sensor networks, both on land
+        and in the ocean, offers an exciting new method of \"instrumenting the earth\"
+        to obtain much more real time data than has ever been available in the past,
+        offering scope for many more, use case-specific \"macroscopes\" to be constructed
+        that can fuse (e.g.) satellite imagery with much more that is happening at
+        a local level.</p>\n\n<p>So, the \"macroscope\" concept appears to be alive
+        and well, even though the nomenclature can change from time to time (IBM''s
+        \"Macroscope\", foreshadowed in 2017, became the \"IBM Pairs Geoscope\" on
+        implementation, and is now simply the \"Geospatial Analytics component within
+        the IBM Environmental Intelligence Suite\" according to available IBM publicity
+        materials). In reality this illustrates a new dichotomy: even if \"everyone\"
+        in principle has access to huge quantities of publicly available data, maybe
+        only a few well funded entities now have the computational ability to make
+        sense of it, and can charge clients a good fee for their services...</p>\n\n<p>I
+        present this account partly to give a brief picture of \"macroscope\" concepts
+        today and in the past, for those who may be interested, and partly to present
+        a few personal views which would be out of scope in a \"neutral point of view\"
+        article such as is required on Wikipedia; also to see if readers of this blog
+        would like to contribute further to discussion of any of the concepts traversed
+        herein.</p>","tags":["guest post","macroscope"],"language":"en","references":null},{"id":"https://doi.org/10.59350/2b1j9-qmw12","uuid":"37538c38-66e6-4ac4-ab5c-679684622ade","url":"https://iphylo.blogspot.com/2022/05/round-trip-from-identifiers-to.html","title":"Round
+        trip from identifiers to citations and back again","summary":"Note to self
+        (basically rewriting last year''s Finding citations of specimens). Bibliographic
+        data supports going from identifier to citation string and back again, so
+        we can do a \"round trip.\" 1. Given a DOI we can get structured data with
+        a simple HTTP fetch, then use a tool such as citation.js to convert that data
+        into a human-readable string in a variety of formats. Identifier ⟶ Structured
+        data ⟶ Human readable string 10.7717/peerj-cs.214 HTTP with...","date_published":"2022-05-27T16:34:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>Note to self (basically
+        rewriting last year''s <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">Finding
+        citations of specimens</a>).</p>\n\n<p>Bibliographic data supports going from
+        identifier to citation string and back again, so we can do a \"round trip.\"</p>\n\n<h2>1.</h2>\n\n<p>Given
+        a DOI we can get structured data with a simple HTTP fetch, then use a tool
+        such as <a href=\"https://citation.js.org\">citation.js</a> to convert that
+        data into a human-readable string in a variety of formats.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟶\n</th>\n<th>\nStructured
+        data\n</th>\n<th>\n⟶\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nHTTP
+        with content-negotiation\n</td>\n<td>\nCSL-JSON\n</td>\n<td>\nCSL templates\n</td>\n<td
+        width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
+        modular bibliography tool for the browser and command line. PeerJ Computer
+        Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>2.</h2>\n\n<p>Going
+        in the reverse direction (string to identifier) is a little more challenging.
+        In the \"old days\" a typical strategy was to attempt to parse the citation
+        string into structured data (see <a href=\"hhtps://anystyle.io\">AnyStyle</a>
+        for a nice example of this), then we could extract a truple of (journal, volume,
+        starting page) and use that to query CrossRef to find if there was an article
+        with that tuple, which gave us the DOI.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nStructured
+        data\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nOpenURL
+        query\n</td>\n<td>\njournal, volume, start page\n</td>\n<td>\nCitation parser
+        \n</td>\n<td width=\"25%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
+        modular bibliography tool for the browser and command line. PeerJ Computer
+        Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<h2>3.</h2>\n\n<p>Another
+        strategy is to take all the citations strings for each DOI, index those in
+        a search engine, then just use a simple search to find the best match to your
+        citation string, and hence the DOI. This is what <a href=\"https://search.crossref.org\">https://search.crossref.org</a>
+        does.</p>\n\n<table>\n<tr>\n<th>\nIdentifier\n</th>\n<th>\n⟵\n</th>\n<th>\nHuman
+        readable string\n</th>\n<tr>\n<tr>\n<td>\n10.7717/peerj-cs.214\n</td>\n<td>\nsearch\n</td>\n<td
+        width=\"50%\">\nWillighagen, L. G. (2019). Citation.js: a format-independent,
+        modular bibliography tool for the browser and command line. PeerJ Computer
+        Science, 5, e214. https://doi.org/10.7717/peerj-cs.214\n</td>\n</tr>\n</table>\n\n<p>At
+        the moment my work on material citations (i.e., lists of specimens in taxonomic
+        papers) is focussing on 1 (generating citations from specimen data in GBIF)
+        and 2 (parsing citations into structured data).</p>","tags":["citation","GBIF","material
+        examined","specimen codes"],"language":"en","references":null},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
+        bibliographic data","summary":"There are several instances where I have a
+        collection of references that I want to deduplicate and merge. For example,
+        in Zootaxa has no impact factor I describe a dataset of the literature cited
+        by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4),
+        as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1).
+        Given that the same articles may be cited many times, these datasets have
+        lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>There are several instances
+        where I have a collection of references that I want to deduplicate and merge.
+        For example, in <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
+        has no impact factor</a> I describe a dataset of the literature cited by articles
+        in the journal <i>Zootaxa</i>. This data is available on Figshare (<a href=\"https://doi.org/10.6084/m9.figshare.c.5054372.v4\">https://doi.org/10.6084/m9.figshare.c.5054372.v4</a>),
+        as is the equivalent dataset for <i>Phytotaxa</i> (<a href=\"https://doi.org/10.6084/m9.figshare.c.5525901.v1\">https://doi.org/10.6084/m9.figshare.c.5525901.v1</a>).
+        Given that the same articles may be cited many times, these datasets have
+        lots of duplicates. Similarly, articles in <a href=\"https://species.wikimedia.org\">Wikispecies</a>
+        often have extensive lists of references cited, and the same reference may
+        appear on multiple pages (for an initial attempt to extract these references
+        see <a href=\"https://doi.org/10.5281/zenodo.5801661\">https://doi.org/10.5281/zenodo.5801661</a>
+        and <a href=\"https://github.com/rdmpage/wikispecies-parser\">https://github.com/rdmpage/wikispecies-parser</a>).</p>\n\n<p>There
+        are several reasons I want to merge these references. If I want to build a
+        citation graph for <i>Zootaxa</i> or <i>Phytotaxa</i> I need to merge references
+        that are the same so that I can accurate count citations. I am also interested
+        in harvesting the metadata to help find those articles in the <a href=\"https://www.biodiversitylibrary.org\">Biodiversity
+        Heritage Library</a> (BHL), and the literature cited section of scientific
+        articles is a potential goldmine of bibliographic metadata, as is Wikispecies.</p>\n\n<p>After
+        various experiments and false starts I''ve created a repository <a href=\"https://github.com/rdmpage/bib-dedup\">https://github.com/rdmpage/bib-dedup</a>
+        to host a series of PHP scripts to deduplicate bibliographics data. I''ve
+        settled on using CSL-JSON as the format for bibliographic data. Because deduplication
+        relies on comparing pairs of references, the standard format for most of the
+        scripts is a JSON array containing a pair of CSL-JSON objects to compare.
+        Below are the steps the code takes.</p>\n\n<h2>Generating pairs to compare</h2>\n\n<p>The
+        first step is to take a list of references and generate the pairs that will
+        be compared. I started with this approach as I wanted to explore machine learning
+        and wanted a simple format for training data, such as an array of two CSL-JSON
+        objects and an integer flag representing whether the two references were the
+        same of different.</p>\n\n<p>There are various ways to generate CSL-JSON for
+        a reference. I use a tool I wrote (see <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
+        parsing tool released</a>) that has a simple API where you parse one or more
+        references and it returns that reference as structured data in CSL-JSON.</p>\n\n<p>Attempting
+        to do all possible pairwise comparisons rapidly gets impractical as the number
+        of references increases, so we need some way to restrict the number of comparisons
+        we make. One approach I''ve explored is the “sorted neighbourhood method”
+        where we sort the references 9for example by their title) then move a sliding
+        window down the list of references, comparing all references within that window.
+        This greatly reduces the number of pairwise comparisons. So the first step
+        is to sort the references, then run a sliding window over them, output all
+        the pairs in each window (ignoring in pairwise comparisons already made in
+        a previous window). Other methods of \"blocking\" could also be used, such
+        as only including references in a particular year, or a particular journal.</p>\n\n<p>So,
+        the output of this step is a set of JSON arrays, each with a pair of references
+        in CSL-JSON format. Each array is stored on a single line in the same file
+        in <a href=\"https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON_2\">line-delimited
+        JSON</a> (JSONL).</p>\n\n<h2>Comparing pairs</h2>\n\n<p>The next step is to
+        compare each pair of references and decide whether they are a match or not.
+        Initially I explored a machine learning approach used in the following paper:</p>\n\n<blockquote>\nWilson
+        DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex
+        features to improve genealogical record linkage. In: The 2011 International
+        Joint Conference on Neural Networks. 9–14. DOI: <a href=\"https://doi.org/10.1109/IJCNN.2011.6033192\">10.1109/IJCNN.2011.6033192</a>\n</blockquote>\n\n<p>Initial
+        experiments using <a href=\"https://github.com/jtet/Perceptron\">https://github.com/jtet/Perceptron</a>
+        were promising and I want to play with this further, but I deciding to skip
+        this for now and just use simple string comparison. So for each CSL-JSON object
+        I generate a citation string in the same format using CiteProc, then compute
+        the <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein
+        distance</a> between the two strings. By normalising this distance by the
+        length of the two strings being compared I can use an arbitrary threshold
+        to decide if the references are the same or not.</p>\n\n<h2>Clustering</h2>\n\n<p>For
+        this step we read the JSONL file produced above and record whether the two
+        references are a match or not. Assuming each reference has a unique identifier
+        (needs only be unique within the file) then we can use those identifier to
+        record the clusters each reference belongs to. I do this using a <a href=\"https://en.wikipedia.org/wiki/Disjoint-set_data_structure\">Disjoint-set
+        data structure</a>. For each reference start with a graph where each node
+        represents a reference, and each node has a pointer to a parent node. Initially
+        the reference is its own parent. A simple implementation is to have an array
+        index by reference identifiers and where the value of each cell in the array
+        is the node''s parent.</p>\n\n<p>As we discover pairs we update the parents
+        of the nodes to reflect this, such that once all the comparisons are done
+        we have a one or more sets of clusters corresponding to the references that
+        we think are the same. Another way to think of this is that we are getting
+        the components of a graph where each node is a reference and pair of references
+        that match are connected by an edge.</p>\n\n<p>In the code I''m using I write
+        this graph in <a href=\"https://en.wikipedia.org/wiki/Trivial_Graph_Format\">Trivial
+        Graph Format</a> (TGF) which can be visualised using a tools such as <a href=\"https://www.yworks.com/products/yed\">yEd</a>.</p>\n\n<h2>Merging</h2>\n\n<p>Now
+        that we have a graph representing the sets of references that we think are
+        the same we need to merge them. This is where things get interesting as the
+        references are similar (by definition) but may differ in some details. The
+        paper below describes a simple Bayesian approach for merging records:</p>\n\n<blockquote>\nCouncill
+        IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles
+        CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching
+        Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
+        Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: <a href=\"https://doi.org/10.1145/1141753.1141817\">10.1145/1141753.1141817</a>.\n</blockquote>\n\n<p>So
+        the next step is to read the graph with the clusters, generate the sets of
+        bibliographic references that correspond to each cluster, then use the method
+        described in Councill et al. to produce a single bibliographic record for
+        that cluster. These records could then be used to, say locate the corresponding
+        article in BHL, or populate Wikidata with missing references.</p>\n\n<p>Obviously
+        there is always the potential for errors, such as trying to merge references
+        that are not the same. As a quick and dirty check I flag as dubious any cluster
+        where the page numbers vary among members of the cluster. More sophisticated
+        checks are possible, especially if I go down the ML route (i.e., I would have
+        evidence for the probability that the same reference can disagree on some
+        aspects of metadata).</p>\n\n<h2>Summary</h2>\n\n<p>At this stage the code
+        is working well enough for me to play with and explore some example datasets.
+        The focus is on structured bibliographic metadata, but I may simplify things
+        and have a version that handles simple string matching, for example to cluster
+        together different abbreviations of the same journal name.</p>","tags":["data
+        cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ndtkv-6ve80","uuid":"e8e95aaf-bacb-4b5a-bf91-54e903526ab2","url":"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html","title":"Revisiting
+        RSS to monitor the latest taxonomic research","summary":"Over a decade ago
+        RSS (RDF Site Summary or Really Simple Syndication) was attracting a lot of
+        interest as a way to integrate data across various websites. Many science
+        publishers would provide a list of their latest articles in XML in one of
+        three flavours of RSS (RDF, RSS, Atom). This led to tools such as uBioRSS
+        [1] and my own e-Biosphere Challenge: visualising biodiversity digitisation
+        in real time. It was a time of enthusiasm for aggregating lots of data, such
+        as the ill-fated PLoS...","date_published":"2021-11-23T20:53:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
+        both;\"><a href=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s257/feedicon.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; clear: right;
+        float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"257\"
+        data-original-width=\"257\" src=\"https://1.bp.blogspot.com/-dsij6_nhdsc/SgHYU5MCwsI/AAAAAAAAAe8/9KN6Gm87sj0PCuJG-crvZoMbL8MJvusegCPcBGAYYCw/s200/feedicon.png\"/></a></div>\n<p>Over
+        a decade ago <a href=\"https://en.wikipedia.org/wiki/RSS\">RSS</a> (RDF Site
+        Summary or Really Simple Syndication) was attracting a lot of interest as
+        a way to integrate data across various websites. Many science publishers would
+        provide a list of their latest articles in XML in one of three flavours of
+        RSS (RDF, RSS, Atom). This led to tools such as <a href=\"http://ubio.org/rss/\">uBioRSS</a>
+        [<a href=\"#Leary2007\">1</a>] and my own <a href=\"https://iphylo.blogspot.com/2009/05/e-biosphere-challenge-visualising.html\">e-Biosphere
+        Challenge: visualising biodiversity digitisation in real time</a>. It was
+        a time of enthusiasm for aggregating lots of data, such as the <a href=\"https://iphylo.blogspot.com/2013/07/the-demise-of-plos-biodiversity-hub.html\">ill-fated</a>
+        PLoS Biodiversity Hub [<a href=\"#Mindell2011\">2</a>].</p>\n\n<p>Since I
+        seem to be condemned to revisit old ideas rather than come up with anything
+        new,  I''ve been looking at providing a tool like the now defunct uBioRSS.
+        The idea is to harvest RSS feeds from journals (with an emphasis on taxonomic
+        and systematic journals), aggregate the results, and make them browsable by
+        taxon and geography. Here''s a sneak peak:</p>\n\n<div class=\"separator\"
+        style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s1032/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"400\" data-original-height=\"952\" data-original-width=\"1032\"
+        src=\"https://1.bp.blogspot.com/-toYkBpS81tE/YZ1VY1DU1FI/AAAAAAAAg4E/yFMM4Xc3AEE8BjCj0jKO0sLtT9ZI-3k8ACLcBGAsYHQ/s400/Screenshot%2B2021-11-23%2Bat%2B20.00.06.png\"/></a></div>\n\n<p>What
+        seems like a straightforward task quickly became a bit of a challenge. Not
+        all journals have RSS feeds (they seem to have become less widely supported
+        over time) so I need to think of alternative ways to get lists of recent articles.
+        These lists also need to be processed in various ways. There are three versions
+        of RSS, each with their own idiosyncracies, so I need to standardise things
+        like dates. I also want to augment them with things like DOIs (often missing
+        from RSS feeds) and thumbnails for the articles (often available on publisher
+        websites but not the feeds). Then I need to index the content by taxon and
+        geography. For taxa I use a version of Patrick Leary''s \"taxonfinder\" (see
+        <a href=\"https://right-frill.glitch.me\">https://right-frill.glitch.me</a>)
+        to find names, then the <a href=\"https://index.globalnames.org\">Global Names
+        Index</a> to assign names found to the GBIF taxonomic hierarchy.</p>\n\n<p>Indexing
+        by geography proved harder. Typically <a href=\"https://en.wikipedia.org/wiki/Toponym_resolution#Geoparsing\">geoparsing</a>
+        involves taking a body of text and doing the following:\n<ul><li>Using named-entity
+        recognition <a href=\"https://en.wikipedia.org/wiki/Named-entity_recognition\">NER</a>
+        to identity named entities in the text (e.g., place names, people names, etc.).</li>\n<li>Using
+        a gazetteer of geographic names <a href=\"http://www.geonames.org\">GeoNames</a>
+        to try and match the place names found by NER.</li>\n</ul></p>\n\n<p>An example
+        of such a parser is the <a href=\"https://www.ltg.ed.ac.uk/software/geoparser/\">Edinburgh
+        Geoparser</a>. Typically geoparsing software can be large and tricky to install,
+        especially if you are looking to make your installation publicly accessible.
+        Geoparsing services seem to have a short half-life (e.g., <a href=\"https://geoparser.io\">Geoparser.io</a>),
+        perhaps because they are so useful they quickly get swamped by users.</p>\n\n<p>Bearing
+        this in mind, the approach I’ve taken here is to create a very simple geoparser
+        that is focussed on fairly large areas, especially those relevant to biodiversity,
+        and is aimed at geoparsing text such as abstracts of scientific papers. I''ve
+        created a small database of places by harvesting data from Wikidata, then
+        I use the \"flash text\" algorithm [<a href=\"#Singh2017\">3</a>] to find
+        geographic places. This approach uses a <a href=\"https://en.wikipedia.org/wiki/Trie\">trie</a>
+        to store the place names. All I do is walk through the text seeing whether
+        the current word matches a place name (or the start of one) in the trie, then
+        moving on. This is very quick and seems to work quite well.</p>\n\n<p>Given
+        that I need to aggregate data from a lot of sources, apply various transformations
+        to that data, then merge it, there are a lot of moving parts. I started playing
+        with a \"NoCode\" platform for creating workflows, in this case <a href=\"https://n8n.io\">n8n</a>
+        (in many ways reminiscent of the now defunct <a href=\"https://en.wikipedia.org/wiki/Yahoo!_Pipes\">Yahoo
+        Pipes</a>). This was quite fun for a while, but after lots of experimentation
+        I moved back to writing code to aggregate the data into a CouchDB database.
+        CouchDB is one of the NoSQL databases that I really like as it has a great
+        interface, and makes queries very easy to do once you get your head around
+        how it works.</p>\n\n<p>So the end result of this is \"BioRSS\" <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
+        The interface comprises a stream of articles listed from newest to oldest,
+        with a treemap and a geographic map on the left. You can use these to filter
+        the articles by taxonomic group and/or country. For example the screen shot
+        is showing arthropods from China (in this case from a month or two ago in
+        the journal <i>ZooKeys</i>). As much fun as the interface has been to construct,
+        in many ways I don''t really want to spend time making an interface. For each
+        combination of taxon and country I provide a RSS feed so if you have a favour
+        feed reader you can grab the feed and view it there. As BioRSS updates the
+        data your feed reader should automatically update the feed. This means that
+        you can have a feed that monitors, say, new papers on spiders in China.</p>\n\n<p>In
+        the spirit of \"release early and release often\" this is an early version
+        of this app. I need to add a lot more feeds, back date them to bring in older
+        content, and I also want to make use of aggregators such as PubMed, CrossRef,
+        and Google Scholar. The existence of these tools is, I suspect, one reason
+        why RSS feeds are less common than they used to be.</p>\n\n<p>So, if this
+        sounds useful please take it for a spin at <a href=\"https://biorss.herokuapp.com\">https://biorss.herokuapp.com</a>.
+        Feedback is welcome, especially suggestions for journals to harvest and add
+        to the news feed. Ultimately I''d like to have sufficient coverage of the
+        taxonomic literature so that BioRSS becomes a place where we can go to find
+        the latest papers on any taxon of interest.</p>\n\n<h2>References</h2>\n\n<blockquote>\n<a
+        name=\"Leary2007\">1.</a> Patrick R. Leary, David P. Remsen, Catherine N.
+        Norton, David J. Patterson, Indra Neil Sarkar, uBioRSS: Tracking taxonomic
+        literature using RSS, Bioinformatics, Volume 23, Issue 11, June 2007, Pages
+        1434–1436, <a href=\"https://doi.org/10.1093/bioinformatics/btm109\">https://doi.org/10.1093/bioinformatics/btm109</a>\n</blockquote>\n\n<blockquote><a
+        name=\"Mindell2011\">2.</a> Mindell, D. P., Fisher, B. L., Roopnarine, P.,
+        Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating,
+        Tagging and Integrating Biodiversity Research. PLoS ONE, 6(8), e19491. <a
+        href=\"https://doi.org/10.1371/journal.pone.0019491\">doi:10.1371/journal.pone.0019491</a>\n</blockquote>\n\n<blockquote><a
+        name=\"Singh2017\">3.</a> Singh, V. (2017). Replace or Retrieve Keywords In
+        Documents at Scale. CoRR, abs/1711.00046. <a href=\"http://arxiv.org/abs/1711.00046\">http://arxiv.org/abs/1711.00046</a>\n\n</blockquote>","tags":["geocoding","NoCode","RSS"],"language":"en","references":[{"doi":"https://doi.org/10.1093/bioinformatics/btm109","key":"ref1"},{"doi":"https://doi.org/10.1371/journal.pone.0019491","key":"ref2"},{"key":"ref3","url":"http://arxiv.org/abs/1711.00046"}]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
+        Stuart Page 1936-2022","summary":"My dad died last weekend. Below is a notice
+        in today''s New Zealand Herald. I''m in New Zealand for his funeral. Don''t
+        really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
+        both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s3454/_DSC5106.jpg\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"400\" data-original-height=\"2582\" data-original-width=\"3454\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s400/_DSC5106.jpg\"/></a></div>\n\nMy
+        dad died last weekend. Below is a notice in today''s New Zealand Herald. I''m
+        in New Zealand for his funeral. Don''t really have the words for this right
+        now.\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s3640/IMG_2870.jpeg\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" height=\"320\" data-original-height=\"3640\" data-original-width=\"1391\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg\"/></a></div>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/c79vq-7rr11","uuid":"3cb94422-5506-4e24-a41c-a250bb521ee0","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL
+        for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
+        for WikiData. More precisely, the endpoint is for a subset of the entities
+        that are of interest to WikiCite, such as scholarly articles, people, and
+        journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
+        The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
+        There are various ways to interact with the endpoint, personally I like the
+        Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
+        taken...","date_published":"2021-12-20T13:16:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
+        both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s1000\"
+        style=\"display: block; padding: 1em 0; text-align: center; clear: right;
+        float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"1000\"
+        data-original-width=\"1000\" src=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s200\"/></a></div><p>I''ve
+        released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint
+        is for a subset of the entities that are of interest to WikiCite, such as
+        scholarly articles, people, and journals. There is a crude demo at <a href=\"https://wikicite-graphql.herokuapp.com\">https://wikicite-graphql.herokuapp.com</a>.
+        The endpoint itself is at <a href=\"https://wikicite-graphql.herokuapp.com/gql.php\">https://wikicite-graphql.herokuapp.com/gql.php</a>.
+        There are various ways to interact with the endpoint, personally I like the
+        <a href=\"https://altair.sirmuel.design\">Altair GraphQL Client</a> by <a
+        href=\"https://github.com/imolorhe\">Samuel Imolorhe</a>.</p>\n\n<p>As I''ve
+        <a href=\"https://iphylo.blogspot.com/2021/04/it-been-while.html\">mentioned
+        earlier</a> it''s taken me a while to see the point of GraphQL. But it is
+        clear it is gaining traction in the biodiversity world (see for example the
+        <a href=\"https://dev.gbif.org/hosted-portals.html\">GBIF Hosted Portals</a>)
+        so it''s worth exploring. My take on GraphQL is that it is a way to create
+        a self-describing API that someone developing a web site can use without them
+        having to bury themselves in the gory details of how data is internally modelled.
+        For example, WikiData''s query interface uses SPARQL, a powerful language
+        that has a steep learning curve (in part because of the administrative overhead
+        brought by RDF namespaces, etc.). In my previous SPARQL-based projects such
+        as <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a> and <a
+        href=\"http://alec-demo.herokuapp.com\">ALEC</a> I have either returned SPARQL
+        results directly (Ozymandias) or formatted SPARQL results as <a href=\"https://schema.org/DataFeed\">schema.org
+        DataFeeds</a> (equivalent to RSS feeds) (ALEC). Both approaches work, but
+        they are project-specific and if anyone else tried to build based on these
+        projects they might struggle for figure out what was going on. I certainly
+        struggle, and I wrote them!</p>\n\n<p>So it seems worthwhile to explore this
+        approach a little further and see if I can develop a GraphQL interface that
+        can be used to build the sort of rich apps that I want to see. The demo I''ve
+        created uses SPARQL under the hood to provide responses to the GraphQL queries.
+        So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully)
+        simpler overlay on top of SPARQL so that we can retrieve the data we want
+        without having to learn the intricacies of SPARQL, nor how Wikidata models
+        publications and people.</p>","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":null},{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
+        barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
+        tweeted about this but want to bookmark it for later as well. The paper “A
+        molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
+        contains the following: …the annotated barcode records assembled by FinBOL
+        participants represent a tremendous intergenerational transfer of taxonomic
+        knowledge … the time contributed by current taxonomists in identifying and
+        contributing voucher specimens represents a great gift to future generations
+        who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>I <a href=\"https://twitter.com/rdmpage/status/1569738844416638981?s=21&amp;t=9OVXuoUEwZtQt-Ldzlutfw\">tweeted
+        about this</a> but want to bookmark it for later as well. The paper “A molecular-based
+        identification resource for the arthropods of Finland” <a href=\"https://doi.org/10.1111/1755-0998.13510\">doi:10.1111/1755-0998.13510</a>
+        contains the following:</p>\n<blockquote>\n<p>…the annotated barcode records
+        assembled by FinBOL participants represent a tremendous <mark>intergenerational
+        transfer of taxonomic knowledge</mark> … the time contributed by current taxonomists
+        in identifying and contributing voucher specimens represents a great gift
+        to future generations who will benefit from their expertise when they are
+        no longer able to process new material.</p>\n</blockquote>\n<p>I think this
+        is a very clever way to characterise the project. In an age of machine learning
+        this may be commonest way to share knowledge , namely as expert-labelled training
+        data used to build tools for others. Of course, this means the expertise itself
+        may be lost, which has implications for updating the models if the data isn’t
+        complete. But it speaks to Charles Godfrey’s theme of  <a href=\"https://biostor.org/reference/250587\">“Taxonomy
+        as information science”</a>.</p>\n<p>Note that the knowledge is also transformed
+        in the sense that the underlying expertise of interpreting morphology, ecology,
+        behaviour, genomics, and the past literature is not what is being passed on.
+        Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>\n<p>This
+        feels is different to, say iNaturalist, where there is a machine learning
+        model to identify images. In that case, the model is built on something the
+        community itself has created, and continues to create. Yes, the underlying
+        idea is that same: “experts” have labelled the data, a model is trained, the
+        model is used. But the benefits of the <a href=\"https://www.inaturalist.org\">iNaturalist</a>
+        model are immediately applicable to the people whose data built the model.
+        In the case of barcoding, because the technology itself is still not in the
+        hands of many (relative to, say, digital imaging), the benefits are perhaps
+        less tangible. Obviously researchers working with environmental DNA will find
+        it very useful, but broader impact may await the arrival of citizen science
+        DNA barcoding.</p>\n<p>The other consideration is whether the barcoding helps
+        taxonomists. Is it to be used to help prioritise future work (“we are getting
+        lots of unknown sequences in these taxa, lets do some taxonomy there”), or
+        is it simply capturing the knowledge of a generation that won’t be replaced:</p>\n<blockquote>\n<p>The
+        need to capture such knowledge is essential because there are, for example,
+        no young Finnish taxonomists who can critically identify species in many key
+        groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges,
+        most mite lineages).</p>\n</blockquote>\n<p>The cycle of collect data, test
+        and refine model, collect more data, rinse and repeat that happens with iNaturalist
+        creates a feedback loop. It’s not clear that a similar cycle exists for DNA
+        barcoding.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":null},{"id":"https://doi.org/10.59350/enxas-arj18","uuid":"ab5a6e04-d55e-4901-8269-9eea65ce7178","url":"https://iphylo.blogspot.com/2022/08/can-we-use-citation-graph-to-measure.html","title":"Can
+        we use the citation graph to measure the quality of a taxonomic database?","summary":"More
+        arm-waving notes on taxonomic databases. I''ve started to add data to ChecklistBank
+        and this has got me thinking about the issue of data quality. When you add
+        data to ChecklistBank you are asked to give a measure of confidence based
+        on the Catalogue of Life Checklist Confidence system of one - five stars:
+        ★ - ★★★★★. I''m scepetical about the notion of confidence or \"trust\" when
+        it is reduced to a star system (see also Can you trust EOL?). I could literally
+        pick any number of stars, there''s...","date_published":"2022-08-24T14:33:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>More arm-waving notes on
+        taxonomic databases. I''ve started to add data to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
+        and this has got me thinking about the issue of data quality. When you add
+        data to ChecklistBank you are asked to give a measure of confidence based
+        on the <a href=\"https://www.catalogueoflife.org/about/glossary.html#checklist-confidence\">Catalogue
+        of Life Checklist Confidence</a> system of one - five stars: ★ - ★★★★★. I''m
+        scepetical about the notion of confidence or \"trust\" when it is reduced
+        to a star system (see also <a href=\"https://iphylo.blogspot.com/2012/06/can-you-trust-eol.html\">Can
+        you trust EOL?</a>). I could literally pick any number of stars, there''s
+        no way to measure what number of stars is appropriate. This feeds into my
+        biggest reservation about the <a href=\"https://www.catalogueoflife.org\">Catalogue
+        of Life</a>, it''s almost entirely authority based, not evidence based. That
+        is, rather than give us evidence for why a particular taxon is valid, we are
+        (mostly) just given a list of taxa are asked to accept those as gospel, based
+        on assertions by one or more authorities. I''m not necessarly doubting the
+        knowledge of those making these lists, it''s just that I think we need to
+        do better than \"these are the accepted taxa because I say so\" implict in
+        the Catalogue of Life.\n</p>\n\n<p>So, is there any way we could objectively
+        measure the quality of a particular taxonomic checklist? Since I have a long
+        standing interest in link the primary taxonomic litertaure to names in databases
+        (since that''s where the evidence is), I keep wondering whether measures based
+        on that literture could be developed. \n</p>\n<p>\nI recently revisited the
+        fascinating (and quite old) literature on rates of synonymy:\n</p>\n<blockquote>\nGaston
+        Kevin J.  and Mound Laurence A.  1993 Taxonomy, hypothesis testing and the
+        biodiversity crisisProc. R. Soc. Lond. B.251139–142\n<a href=\"http://doi.org/10.1098/rspb.1993.0020\">http://doi.org/10.1098/rspb.1993.0020</a>\n</blockquote>\n  \n<blockquote>\n  Andrew
+        R. Solow, Laurence A. Mound, Kevin J. Gaston, Estimating the Rate of Synonymy,
+        Systematic Biology, Volume 44, Issue 1, March 1995, Pages 93–96, <a href=\"https://doi.org/10.1093/sysbio/44.1.93\">https://doi.org/10.1093/sysbio/44.1.93</a>\n</blockquote>\n\n</p>\n\n<p>\nA
+        key point these papers make is that the observed rate of synonymy is quite
+        high (that is, many \"new species\" end up being merged with already known
+        species), and that because it can take time to discover that a species is
+        a synonym the actual rate may be even higher. In other words, in diagrams
+        like the one reproduced below, the reason the proportion of synonyms declines
+        the nearer we get to the present day (this paper came out in 1995) is not
+        because are are creating fewer synonyms but because we''ve not yet had time
+        to do the work to uncover the remaining synonyms.\n</p>\n  \n<div class=\"separator\"
+        style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s621/Screenshot%202022-08-24%20at%2014.59.47.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"400\" data-original-height=\"404\" data-original-width=\"621\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhgwvBcDUD_IJZyAa0gaqXB--ogJeVooV19I635N_w2OuztDFsy-ZPXEFuW5S1NUxxUEHmN8pbdO9MPljIi5v0A2355kYotL1fKqdewP9kmWu9TfwtIJ3jD04SQjeF3SWK-yMVAx-rNc6TO43GIwmftPk6IghOrcmur6SoHe06ws7fEFAvxA/s400/Screenshot%202022-08-24%20at%2014.59.47.png\"/></a></div>\n\n<p>Put
+        another way, these papers are arguing that real work of taxonomy is revision,
+        not species discovery, especially since it''s not uncommon for > 50% of species
+        in a taxon to end up being synonymised. Indeed, if a taxoomic group has few
+        synonyms then these authors would argue that''s a sign of neglect. More revisionary
+        work would likely uncover additional synonyms. So, what we need is a way to
+        measure the amount of research on a taxonomic group. It occurs to me that
+        we could use the citation graph as a way to tackle this. Lets imagine we have
+        a set of taxa (say a family) and we have all the papers that described new
+        species or undertook revisions (or both). The extensiveness of that work could
+        be measured by the citation graph. For example, build the citation graph for
+        those papers. How many original species decsriptions are not cited? Those
+        species have been potentially neglected. How many large-scale revisions have
+        there been (as measured by the numbers of taxonomic papers those revisions
+        cite)? There are some interesting approaches to quantifying this, such as
+        using <a href=\"https://en.wikipedia.org/wiki/HITS_algorithm\">hubs and authorities</a>.</p>\n  \n  \n<p>I''m
+        aware that taxonomists have not had the happiest relationship with citations:\n  \n<blockquote>\nPinto
+        ÂP, Mejdalani G, Mounce R, Silveira LF, Marinoni L, Rafael JA. Are publications
+        on zoological taxonomy under attack? R Soc Open Sci. 2021 Feb 10;8(2):201617.
+        <a href=\"https://doi.org/10.1098/rsos.201617\">doi: 10.1098/rsos.201617</a>.
+        PMID: 33972859; PMCID: PMC8074659.\n</blockquote>\n\nStill, I think there
+        is an intriguing possibility here. For this approach to work, we need to have
+        linked taxonomic names to publications, and have citation data for those publications.
+        This is happening on various platforms. Wikidata, for example, is becoming
+        a repository of the taxonomic literature, some of it with citation links.\n\n<blockquote>\nPage
+        RDM. 2022. Wikidata and the bibliography of life. PeerJ 10:e13712 <a href=\"https://doi.org/10.7717/peerj.13712\">https://doi.org/10.7717/peerj.13712</a>\n</blockquote>\n\nTime
+        for some experiments.\n</p>","tags":["Bibliography of Life","citation","synonymy","taxonomic
+        databases"],"language":"en","references":null},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
+        interfaces, knowledge graphs, and Miller columns","summary":"Some quick notes
+        on interface ideas for digital libraries and/or knowledge graphs. Recently
+        there’s been something of an explosion in bibliographic tools to explore the
+        literature. Examples include: Elicit which uses AI to search for and summarise
+        papers _scite which uses AI to do sentiment analysis on citations (does paper
+        A cite paper B favourably or not?) ResearchRabbit which uses lists, networks,
+        and timelines to discover related research Scispace which navigates connections
+        between...","date_published":"2023-04-25T13:01:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>Some quick notes on interface
+        ideas for digital libraries and/or knowledge graphs.</p>\n<p>Recently there’s
+        been something of an explosion in bibliographic tools to explore the literature.
+        Examples include:</p>\n<ul>\n<li><a href=\"https://elicit.org\">Elicit</a>
+        which uses AI to search for and summarise papers</li>\n<li><a href=\"https://scite.ai\">_scite</a>
+        which uses AI to do sentiment analysis on citations (does paper A cite paper
+        B favourably or not?)</li>\n<li><a href=\"https://www.researchrabbit.ai\">ResearchRabbit</a>
+        which uses lists, networks, and timelines to discover related research</li>\n<li><a
+        href=\"https://typeset.io\">Scispace</a> which navigates connections between
+        papers, authors, topics, etc., and provides AI summaries.</li>\n</ul>\n<p>As
+        an aside, I think these (and similar tools) are a great example of how bibliographic
+        data such as abstracts, the citation graph and - to a lesser extent - full
+        text - have become commodities. That is, what was once proprietary information
+        is now free to anyone, which in turns means a whole ecosystem of new tools
+        can emerge. If I was clever I’d be building a <a href=\"https://en.wikipedia.org/wiki/Wardley_map\">Wardley
+        map</a> to explore this. Note that a decade or so ago reference managers like
+        <a href=\"https://www.zotero.org\">Zotero</a> were made possible by publishers
+        exposing basic bibliographic data on their articles. As we move to <a href=\"https://i4oc.org\">open
+        citations</a> we are seeing the next generation of tools.</p>\n<p>Back to
+        my main topic. As usual, rather than focus on what these tools do I’m more
+        interested in how they <strong>look</strong>. I have history here, when the
+        iPad came out I was intrigued by the possibilities it offered for displaying
+        academic articles, as discussed <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad.html\">here</a>,
+        <a href=\"https://iphylo.blogspot.com/2010/09/viewing-scientific-articles-on-ipad.html\">here</a>,
+        <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_24.html\">here</a>,
+        <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_3052.html\">here</a>,
+        and <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_31.html\">here</a>.
+        ResearchRabbit looks like this:</p>\n<div style=\"padding:86.91% 0 0 0;position:relative;\"><iframe
+        src=\"https://player.vimeo.com/video/820871442?h=23b05b0dae&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479\"
+        frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
+        style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"ResearchRabbit\"></iframe></div><script
+        src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>Scispace’s <a
+        href=\"https://typeset.io/explore/journals/parassitologia-1ieodjwe\">“trace”
+        view</a> looks like this:</p>\n<div style=\"padding:84.55% 0 0 0;position:relative;\"><iframe
+        src=\"https://player.vimeo.com/video/820871348?h=2db7b661ef&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479\"
+        frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
+        style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"Scispace
+        screencast\"></iframe></div><script src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>What
+        is interesting about both is that they display content from left to right
+        in vertical columns, rather than the more common horizontal rows. This sort
+        of display is sometimes called <a href=\"https://en.wikipedia.org/wiki/Miller_columns\">Miller
+        columns</a> or a <a href=\"https://web.archive.org/web/20210726134921/http://designinginterfaces.com/firstedition/index.php?page=Cascading_Lists\">cascading
+        list</a>.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s1024/GNUstep-liveCD.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"400\" data-original-height=\"768\" data-original-width=\"1024\"
+        src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s400/GNUstep-liveCD.png\"/></a></div>\n\n<p>By
+        Gürkan Sengün (talk) - Own work, Public Domain, <a href=\"https://commons.wikimedia.org/w/index.php?curid=594715\">https://commons.wikimedia.org/w/index.php?curid=594715</a></p>\n<p>I’ve
+        always found displaying a knowledge graph to be a challenge, as discussed
+        <a href=\"https://iphylo.blogspot.com/2019/07/notes-on-collections-knowledge-graphs.html\">elsewhere
+        on this blog</a> and in my paper on <a href=\"https://peerj.com/articles/6739/#p-29\">Ozymandias</a>.
+        Miller columns enable one to drill down in increasing depth, but it doesn’t
+        need to be a tree, it can be a path within a network. What I like about ResearchRabbit
+        and the original Scispace interface is that they present the current item
+        together with a list of possible connections (e.g., authors, citations) that
+        you can drill down on. Clicking on these will result in a new column being
+        appended to the right, with a view (typically a list) of the next candidates
+        to visit. In graph terms, these are adjacent nodes to the original item.  The
+        clickable badges on each item can be thought of as sets of edges that have
+        the same label (e.g., “authored by”, “cites”, “funded”, “is about”, etc.).
+        Each of these nodes itself becomes a starting point for further exploration.
+        Note that the original starting point isn’t privileged, other than being the
+        starting point. That is, each time we drill down we are seeing the same type
+        of information displayed in the same way. Note also that the navigation can
+        be though of as a <strong>card</strong> for a node, with <strong>buttons</strong>
+        grouping the adjacent nodes. When we click on an individual button, it expands
+        into a <strong>list</strong> in the next column. This can be thought of as
+        a preview for each adjacent node. Clicking on an element in the list generates
+        a new card (we are viewing a single node) and we get another set of buttons
+        corresponding to the adjacent nodes.</p>\n<p>One important behaviour in a
+        Miller column interface is that the current path can be pruned at any point.
+        If we go back (i.e., scroll to the left) and click on another tab on an item,
+        everything downstream of that item (i.e., to the right) gets deleted and replaced
+        by a new set of nodes. This could make retrieving a particular history of
+        browsing a bit tricky, but encourages exploration. Both Scispace and ResearchRabbit  have
+        the ability to add items to a collection, so you can keep track of things
+        you discover.</p>\n<p>Lots of food for thought, I’m assuming that there is
+        some user interface/experience research on Miller columns. One thing to remember
+        is that Miller columns are most often associated with trees, but in this case
+        we are exploring a network. That means that potentially there is no limit
+        to the number of columns being generated as we wander through the graph. It
+        will be interesting to think about what the average depth is likely to be,
+        in other words, how deep down the rabbit hole will be go?</p>\n\n<h3>Update</h3>\n<p>Should
+        add link to David Regev''s explorations of <a href=\"https://medium.com/david-regev-on-ux/flow-browser-b730daf0f717\">Flow
+        Browser</a>.\n\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":["cards","flow","Knowledge
+        Graph","Miller column","RabbitResearch"],"language":"en","references":null},{"id":"https://doi.org/10.59350/t6fb9-4fn44","uuid":"8bc3fea6-cb86-4344-8dad-f312fbf58041","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The
+        Business of Extracting Knowledge from Academic Publications","summary":"Markus
+        Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting
+        Knowledge from Academic Publications\". I spent months working on domain-specific
+        search engines and knowledge discovery apps for biomedicine and eventually
+        figured that synthesizing &quot;insights&quot; or building knowledge graphs
+        by machine-reading the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc&mdash;
+        Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I...","date_published":"2021-12-11T00:01:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>Markus Strasser (<a href=\"https://twitter.com/mkstra\">@mkstra</a>
+        write a fascinating article entitled <a href=\"https://markusstrasser.org/extracting-knowledge-from-literature/\">\"The
+        Business of Extracting Knowledge from Academic Publications\"</a>.</p>\n\n<blockquote
+        class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">I spent months working
+        on domain-specific search engines and knowledge discovery apps for biomedicine
+        and eventually figured that synthesizing &quot;insights&quot; or building
+        knowledge graphs by machine-reading the academic literature (papers) is *barely
+        useful* :<a href=\"https://t.co/eciOg30Odc\">https://t.co/eciOg30Odc</a></p>&mdash;
+        Markus Strasser (@mkstra) <a href=\"https://twitter.com/mkstra/status/1468334482113523716?ref_src=twsrc%5Etfw\">December
+        7, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
+        charset=\"utf-8\"></script>\n\n<p>His TL;DR:</p>\n\n<p><blockquote>\nTL;DR:
+        I worked on biomedical literature search, discovery and recommender web applications
+        for many months and concluded that extracting, structuring or synthesizing
+        \"insights\" from academic publications (papers) or building knowledge bases
+        from a domain corpus of literature has negligible value in industry.</p>\n\n<p>Close
+        to nothing of what makes science actually work is published as text on the
+        web.\n</blockquote></p>\n\n<p>After recounting the many problems of knowledge
+        extraction - including a swipe at nanopubs which \"are ... dead in my view
+        (without admitting it)\" - he concludes:</p>\n\n<p><blockquote>\nI’ve been
+        flirting with this entire cluster of ideas including open source web annotation,
+        semantic search and semantic web, public knowledge graphs, nano-publications,
+        knowledge maps, interoperable protocols and structured data, serendipitous
+        discovery apps, knowledge organization, communal sense making and academic
+        literature/publishing toolchains for a few years on and off ... nothing of
+        it will go anywhere.</p>\n\n<p>Don’t take that as a challenge. Take it as
+        a red flag and run. Run towards better problems.\n</blockquote></p>\n\n<p>Well
+        worth a read, and much food for thought.</p>","tags":["ai","business model","text
+        mining"],"language":"en","references":null},{"id":"https://doi.org/10.59350/463yw-pbj26","uuid":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David
+        Remsen","summary":"I heard yesterday from Martin Kalfatovic (BHL) that David
+        Remsen has died. Very sad news. It''s starting to feel like iPhylo might end
+        up being a list of obituaries of people working on biodiversity informatics
+        (e.g., Scott Federhen). I spent several happy visits at MBL at Woods Hole
+        talking to Dave at the height of the uBio project, which really kickstarted
+        large scale indexing of taxonomic names, and the use of taxonomic name finding
+        tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<p>I heard yesterday from Martin
+        Kalfatovic (BHL) that David Remsen has died. Very sad news. It''s starting
+        to feel like iPhylo might end up being a list of obituaries of people working
+        on biodiversity informatics (e.g., <a href=\"https://iphylo.blogspot.com/2016/05/scott-federhen-rip.html\">Scott
+        Federhen</a>).</p>\n\n<p>I spent several happy visits at MBL at Woods Hole
+        talking to Dave at the height of the uBio project, which really kickstarted
+        large scale indexing of taxonomic names, and the use of taxonomic name finding
+        tools to index the literature. His work on uBio with David (\"Paddy\") Patterson
+        led to the <a href=\"https://eol.org\">Encyclopedia of Life</a> (EOL).</p>\n\n<p>A
+        number of the things I''m currently working on are things Dave started. For
+        example, I recently uploaded a version of his dataset for Nomenclator Zoologicus[1]
+        to <a href=\"https://www.checklistbank.org/dataset/126539/about\">ChecklistBank</a>
+        where I''m working on augmenting that original dataset by adding links to
+        the taxonomic literature. My <a href=\"https://biorss.herokuapp.com/?feed=Y291bnRyeT1XT1JMRCZwYXRoPSU1QiUyMkJJT1RBJTIyJTVE\">BioRSS
+        project</a> is essentially an attempt to revive uBioRSS[2] (see <a href=\"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html\">Revisiting
+        RSS to monitor the latest taxonomic research</a>).</p>\n\n<p>I have fond memories
+        of those visits to Woods Hole. A very sad day indeed.</p>\n\n<p><b>Update:</b>
+        The David Remsen Memorial Fund has been set up on <a href=\"https://www.gofundme.com/f/david-remsen-memorial-fund\">GoFundMe</a>.</p>\n\n<p>1.
+        Remsen, D. P., Norton, C., & Patterson, D. J. (2006). Taxonomic Informatics
+        Tools for the Electronic Nomenclator Zoologicus. The Biological Bulletin,
+        210(1), 18–24. https://doi.org/10.2307/4134533</p>\n\n<p>2. Patrick R. Leary,
+        David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar,
+        uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23,
+        Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109</p>","tags":["David
+        Remsen","obituary","uBio"],"language":"en","references":null},{"id":"https://doi.org/10.59350/pmhat-5ky65","uuid":"5891c709-d139-440f-bacb-06244424587a","url":"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html","title":"Problems
+        with Plazi parsing: how reliable are automated methods for extracting specimens
+        from the literature?","summary":"The Plazi project has become one of the major
+        contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
+        (see Plazi''s GBIF page for details). These occurrences are extracted from
+        taxonomic publication using automated methods. New data is published almost
+        daily (see latest treatments). The map below shows the geographic distribution
+        of material citations provided to GBIF by Plazi, which gives you a sense of
+        the size of the dataset. By any metric Plazi represents a...","date_published":"2021-10-25T11:10:00Z","date_modified":null,"authors":[{"url":null,"name":"noreply@blogger.com
+        (Roderic Page)"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
+        both;\"><a href=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s240/Rf7UoXTw_400x400.jpg\"
+        style=\"display: block; padding: 1em 0; text-align: center; clear: right;
+        float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"240\"
+        data-original-width=\"240\" src=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s200/Rf7UoXTw_400x400.jpg\"/></a></div><p>The
+        <a href=\"http://plazi.org\">Plazi</a> project has become one of the major
+        contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
+        (see <a href=\"https://www.gbif.org/publisher/7ce8aef0-9e92-11dc-8738-b8a03c50a862\">Plazi''s
+        GBIF page</a> for details). These occurrences are extracted from taxonomic
+        publication using automated methods. New data is published almost daily (see
+        <a href=\"https://tb.plazi.org/GgServer/static/newToday.html\">latest treatments</a>).
+        The map below shows the geographic distribution of material citations provided
+        to GBIF by Plazi, which gives you a sense of the size of the dataset.</p>\n\n<div
+        class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s1030/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"400\" data-original-height=\"514\" data-original-width=\"1030\"
+        src=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"/></a></div>\n\n<p>By
+        any metric Plazi represents a considerable achievement. But often when I browse
+        individual records on Plazi I find records that seem clearly incorrect. Text
+        mining the literature is a challenging problem, but at the moment Plazi seems
+        something of a \"black box\". PDFs go in, the content is mined, and data comes
+        up to be displayed on the Plazi web site and uploaded to GBIF. Nowhere does
+        there seem to be an evaluation of how accurate this text mining actually is.
+        Anecdotally it seems to work well in some cases, but in others it produces
+        what can only be described as bogus records.</p>\n\n<h2>Finding errors</h2>\n\n<p>A
+        treatment in Plazi is a block of text (and sometimes illustrations) that refers
+        to a single taxon. Often that text will include a description of the taxon,
+        and list one or more specimens that have been examined. These lists of specimens
+        (\"material citations\") are one of the key bits of information that Plaza
+        extracts from a treatment as these citations get fed into GBIF as occurrences.</p>\n\n<p>To
+        help explore treatments I''ve constructed a simple web site that takes the
+        Plazi identifier for a treatment and displays that treatment with the material
+        citations highlighted. For example, for the Plazi treatment <a href=\"https://tb.plazi.org/GgServer/html/03B5A943FFBB6F02FE27EC94FABEEAE7\">03B5A943FFBB6F02FE27EC94FABEEAE7</a>
+        you can view the marked up version at <a href=\"https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228\">https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228</a>.
+        Below is an example of a material citation with its component parts tagged:</p>\n\n<div
+        class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s693/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"400\" data-original-height=\"94\" data-original-width=\"693\"
+        src=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"/></a></div>\n\n<p>This
+        is an example where Plazi has successfully parsed the specimen. But I keep
+        coming across cases where specimens have not been parsed correctly, resulting
+        in issues such as single specimens being split into multiple records (e.g.,  <a
+        href=\"https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496\">https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496</a>),
+        geographical coordinates being misinterpreted (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9\">https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9</a>),
+        or collector''s initials being confused with codes for natural history collections
+        (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E\">https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E</a>).</p>\n\n<p>Parsing
+        specimens is a hard problem so it''s not unexpected to find errors. But they
+        do seem common enough to be easily found, which raises the question of just
+        what percentage of these  material citations are correct? How much of the
+        data Plazi feeds to GBIF is correct? How would we know?</p>\n\n<h2>Systemic
+        problems</h2>\n\n<p>Some of the errors I''ve found concern the interpretation
+        of the parsed data. For example, it is striking that despite including marine
+        taxa <b>no</b> Plazi record has a value for depth below sea level (see <a
+        href=\"https://www.gbif.org/occurrence/map?depth=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">GBIF
+        search on depth range 0-9999 for Plazi</a>). But <a href=\"https://www.gbif.org/occurrence/map?elevation=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">many
+        records do have an elevation</a>, including records from marine environments.
+        Any record that has a depth value is interpreted by Plazi as being elevation,
+        so we have aerial crustacea and fish.</p>\n\n<h3>Map of Plazi records with
+        depth 0-9999m</h3>\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s673/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"400\" data-original-height=\"258\" data-original-width=\"673\"
+        src=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"/></a></div>\n\n<h3>Map
+        of Plazi records with elevation 0-9999m </h3>\n<div class=\"separator\" style=\"clear:
+        both;\"><a href=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s675/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"
+        style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
+        border=\"0\" width=\"400\" data-original-height=\"256\" data-original-width=\"675\"
+        src=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"/></a></div>\n\n<p>Anecdotally
+        I''ve also noticed that Plazi seems to do well on zoological data, especially
+        journals like <i>Zootaxa</i>, but it often struggles with botanical specimens.
+        Botanists tend to cite specimens rather differently to zoologists (botanists
+        emphasise collector numbers rather than specimen codes). Hence data quality
+        in Plazi is likely to taxonomic biased.</p>\n\n<p>Plazi is <a href=\"https://github.com/plazi/community/issues\">using
+        GitHub to track issues with treatments</a> so feedback on erroneous records
+        is possible, but this seems inadequate to the task. There are tens of thousands
+        of data sets, with more being released daily, and hundreds of thousands of
+        occurrences, and relying on GitHub issues devolves the responsibility for
+        error checking onto the data users. I don''t have a measure of how many records
+        in Plazi have problems, but because I suspect it is a significant fraction
+        because for any given day''s output I can typically find errors.</p>\n\n<h2>What
+        to do?</h2>\n\n<p>Faced with a process that generates noisy data there are
+        several of things we could do:</p>\n\n<ol>\n<li>Have tools to detect and flag
+        errors made in generating the data.</li>\n<li>Have the data generator give
+        estimates the confidence of its results.</li>\n<li>Improve the data generator.</li>\n</ol>\n\n<p>I
+        think a comparison with the problem of parsing bibliographic references might
+        be instructive here. There is a long history of people developing tools to
+        parse references (<a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">I''ve
+        even had a go</a>). State-of-the art tools such as <a href=\"https://anystyle.io\">AnyStyle</a>
+        feature machine learning, and are tested against <a href=\"https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml\">human
+        curated datasets</a> of tagged bibliographic records. This means we can evaluate
+        the performance of a method (how well does it retrieve the same results as
+        human experts?) and also improve the method by expanding the corpus of training
+        data. Some of these tools can provide a measures of how confident they are
+        when classifying a string as, say, a person''s name, which means we could
+        flag potential issues for anyone wanting to use that record.</p>\n\n<p>We
+        don''t have equivalent tools for parsing specimens in the literature, and
+        hence have no easy way to quantify how good existing methods are, nor do we
+        have a public corpus of material citations that we can use as training data.
+        I <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">blogged
+        about this</a> a few months ago and was considering using Plazi as a source
+        of marked up specimen data to use for training. However based on what I''ve
+        looked at so far Plazi''s data would need to be carefully scrutinised before
+        it could be used as training data.</p>\n\n<p>Going forward, I think it would
+        be desirable to have a set of records that can be used to benchmark specimen
+        parsers, and ideally have the parsers themselves available as web services
+        so that anyone can evaluate them. Even better would be a way to contribute
+        to the training data so that these tools improve over time.</p>\n\n<p>Plazi''s
+        data extraction tools are mostly desktop-based, that is, you need to download
+        software to use their methods. However, there are experimental web services
+        available as well. I''ve created a simple wrapper around the material citation
+        parser, you can try it at <a href=\"https://plazi-tester.herokuapp.com/parser.php\">https://plazi-tester.herokuapp.com/parser.php</a>.
+        It takes a single material citation and returns a version with elements such
+        as specimen code and collector name tagged in different colours.</p>\n\n<h2>Summary</h2>\n\n<p>Text
+        mining the taxonomic literature is clearly a gold mine of data, but at the
+        same time it is potentially fraught as we try and extract structured data
+        from semi-structured text. Plazi has demonstrated that it is possible to extract
+        a lot of data from the literature, but at the same time the quality of that
+        data seems highly variable. Even minor issues in parsing text can have big
+        implications for data quality (e.g., marine organisms apparently living above
+        sea level). Historically in biodiversity informatics we have favoured data
+        quantity over data quality. Quantity has an obvious metric, and has milestones
+        we can celebrate (e.g., <a href=\"GBIF at 1 billion - what''s next?\">one
+        billion specimens</a>). There aren''t really any equivalent metrics for data
+        quality.</p>\n\n<p>Adding new types of data can sometimes initially result
+        in a new set of quality issues (e.g., <a href=\"https://iphylo.blogspot.com/2019/12/gbif-metagenomics-and-metacrap.html\">GBIF
+        metagenomics and metacrap</a>) that take time to resolve. In the case of Plazi,
+        I think it would be worthwhile to quantify just how many records have errors,
+        and develop benchmarks that we can use to test methods for extracting specimen
+        data from text. If we don''t do this then there will remain uncertainty as
+        to how much trust we can place in data mined from the taxonomic literature.</p>\n\n<h2>Update</h2>\n\nPlazi
+        has responded, see <a href=\"http://plazi.org/posts/2021/10/liberation-first-step-toward-quality/\">Liberating
+        material citations as a first step to more better data</a>. My reading of
+        their repsonse is that it essentially just reiterates Plazi''s approach and
+        doesn''t tackle the underlying issue: their method for extracting material
+        citations is error prone, and many of those errors end up in GBIF.","tags":["data
+        quality","parsing","Plazi","specimen","text mining"],"language":"en","references":null}]}'
+  recorded_at: Thu, 15 Jun 2023 20:47:10 GMT
+recorded_with: VCR 6.1.0