commonmeta-ruby 3.5.4 → 3.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (275) hide show
  1. checksums.yaml +4 -4
  2. data/Gemfile.lock +16 -5
  3. data/bin/commonmeta +1 -1
  4. data/lib/commonmeta/readers/crossref_xml_reader.rb +1 -1
  5. data/lib/commonmeta/utils.rb +6 -6
  6. data/lib/commonmeta/version.rb +1 -1
  7. data/spec/cli_spec.rb +13 -8
  8. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_file/crossref/default.yml +13 -13
  9. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_file/crossref/to_bibtex.yml +13 -13
  10. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_file/crossref/to_crossref_xml.yml +25 -25
  11. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_file/crossref/to_datacite.yml +13 -13
  12. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_file/crossref/to_schema_org.yml +13 -13
  13. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_file/crossref_xml/default.yml +7 -7
  14. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_file/crossref_xml/to_bibtex.yml +7 -7
  15. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_file/crossref_xml/to_crossref_xml.yml +7 -59
  16. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_file/crossref_xml/to_datacite.yml +7 -7
  17. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_file/crossref_xml/to_schema_org.yml +7 -7
  18. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/crossref/default.yml +24 -24
  19. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/crossref/to_bibtex.yml +24 -24
  20. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/crossref/to_citation.yml +24 -24
  21. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/crossref/to_crossref_xml.yml +24 -24
  22. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/crossref/to_datacite.yml +24 -24
  23. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/crossref/to_jats.yml +24 -24
  24. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/crossref/to_schema_org.yml +24 -24
  25. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/datacite/default.yml +16 -16
  26. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/datacite/to_bibtex.yml +16 -16
  27. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/datacite/to_citation.yml +16 -16
  28. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/datacite/to_datacite.yml +16 -16
  29. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/datacite/to_jats.yml +16 -16
  30. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/datacite/to_schema_org.yml +16 -16
  31. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/schema_org/default.yml +479 -946
  32. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/schema_org/to_crossref_xml.yml +957 -1891
  33. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/schema_org/to_datacite.yml +479 -946
  34. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_from_id/schema_org/to_schema_org.yml +481 -950
  35. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_blog.yml +5540 -968
  36. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_blog_unknown_blog_id.yml +22 -29
  37. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_id.yml +25 -39
  38. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_id_unknown_uuid.yml +18 -28
  39. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/find_from_format_by_id/crossref.yml +7 -7
  40. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/find_from_format_by_id/datacite.yml +7 -7
  41. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/find_from_format_by_id/jalc.yml +7 -7
  42. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/find_from_format_by_id/kisti.yml +7 -7
  43. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/find_from_format_by_id/medra.yml +7 -7
  44. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/find_from_format_by_id/op.yml +7 -7
  45. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/json_feed/json_feed_blog_id.yml +46 -0
  46. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/json_feed/json_feed_by_blog.yml +5578 -246
  47. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/json_feed/json_feed_not_indexed.yml +13 -2201
  48. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/json_feed/json_feed_unregistered.yml +176 -72
  49. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/change_metadata_as_datacite_xml/with_data_citation.yml +16 -16
  50. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/crossref.yml +6 -6
  51. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/datacite.yml +6 -6
  52. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/jalc.yml +6 -6
  53. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/kisti.yml +6 -6
  54. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/medra.yml +6 -6
  55. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/not_found.yml +6 -6
  56. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/doi_registration_agency/op.yml +6 -6
  57. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/crossref.yml +6 -6
  58. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/crossref_doi_not_url.yml +6 -6
  59. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/datacite.yml +6 -6
  60. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/datacite_doi_http.yml +6 -6
  61. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/find_from_format_by_ID/unknown_DOI_registration_agency.yml +6 -6
  62. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_blog_id_for_json_feed_item_id/by_blog_post_id.yml +27 -105
  63. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_blog_id_for_json_feed_item_id/not_found.yml +20 -27
  64. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_cff_metadata/cff-converter-python.yml +51 -25
  65. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_cff_metadata/ruby-cff.yml +12 -12
  66. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_cff_metadata/ruby-cff_repository_url.yml +9 -9
  67. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_codemeta_metadata/maremma.yml +10 -10
  68. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_codemeta_metadata/metadata_reports.yml +11 -11
  69. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/DOI_with_ORCID_ID.yml +78 -78
  70. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/DOI_with_SICI_DOI.yml +76 -76
  71. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/DOI_with_data_citation.yml +35 -35
  72. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/JaLC.yml +162 -162
  73. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/KISTI.yml +131 -131
  74. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/OP.yml +75 -75
  75. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/affiliation_is_space.yml +76 -76
  76. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/another_book.yml +113 -113
  77. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/another_book_chapter.yml +74 -74
  78. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/article_id_as_page_number.yml +77 -77
  79. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/author_literal.yml +84 -84
  80. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book.yml +77 -77
  81. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book_chapter.yml +75 -75
  82. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book_chapter_with_RDF_for_container.yml +73 -73
  83. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/book_oup.yml +72 -72
  84. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/component.yml +94 -94
  85. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/dataset.yml +104 -104
  86. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/dataset_usda.yml +136 -136
  87. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/date_in_future.yml +80 -80
  88. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/dissertation.yml +103 -103
  89. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/empty_given_name.yml +75 -75
  90. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/invalid_date.yml +77 -77
  91. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article.yml +76 -76
  92. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_original_language_title.yml +73 -73
  93. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_with.yml +128 -210
  94. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_with_RDF_for_container.yml +74 -74
  95. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_article_with_funding.yml +76 -76
  96. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/journal_issue.yml +72 -72
  97. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/mEDRA.yml +72 -72
  98. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/markup.yml +81 -81
  99. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/missing_contributor.yml +71 -71
  100. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/multiple_issn.yml +75 -75
  101. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/multiple_titles.yml +71 -71
  102. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/multiple_titles_with_missing.yml +573 -573
  103. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/not_found_error.yml +65 -65
  104. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/peer_review.yml +77 -77
  105. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/posted_content.yml +74 -74
  106. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/posted_content_copernicus.yml +76 -76
  107. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/report_osti.yml +120 -120
  108. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/vor_with_url.yml +78 -78
  109. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/yet_another_book.yml +74 -74
  110. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/yet_another_book_chapter.yml +73 -73
  111. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_raw/journal_article.yml +59 -59
  112. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datacite_metadata/SoftwareSourceCode.yml +4 -4
  113. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datacite_metadata/dissertation.yml +13 -13
  114. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datacite_metadata/funding_references.yml +15 -15
  115. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_datacite_metadata/subject_scheme.yml +120 -120
  116. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_id.yml +5540 -555
  117. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_id.yml +31 -42
  118. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_id_specific_prefix.yml +25 -39
  119. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/by_blog_id.yml +5540 -247
  120. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/not_indexed_posts.yml +14 -26
  121. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed/unregistered_posts.yml +176 -72
  122. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/archived_wordpress_post.yml +27 -95
  123. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/blog_post_with_non-url_id.yml +28 -106
  124. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/blogger_post.yml +21 -65
  125. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_author_name_suffix.yml +20 -208
  126. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_doi.yml +26 -97
  127. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_institutional_author.yml +24 -55
  128. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_organizational_author.yml +27 -70
  129. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_related_identifiers.yml +41 -143
  130. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_related_identifiers_and_funding.yml +54 -132
  131. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_with_related_identifiers_and_link_to_peer-reviewed_article.yml +304 -818
  132. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/ghost_post_without_doi.yml +24 -169
  133. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/jekyll_post.yml +24 -63
  134. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/jekyll_post_with_anonymous_author.yml +25 -40
  135. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/substack_post_with_broken_reference.yml +278 -591
  136. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/syldavia_gazette_post_with_references.yml +59 -101
  137. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/upstream_post_with_references.yml +135 -331
  138. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/wordpress_post.yml +24 -134
  139. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/wordpress_post_with_many_references.yml +578 -2967
  140. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/wordpress_post_with_references.yml +44 -205
  141. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_json_feed_item_metadata/wordpress_post_with_tracking_code_on_url.yml +26 -160
  142. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/affiliation_is_space.yml +21 -21
  143. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/has_familyName.yml +15 -15
  144. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/has_name_in_display-order_with_ORCID.yml +13 -13
  145. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/name_with_affiliation_crossref.yml +16 -16
  146. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_one_author/only_familyName_and_givenName.yml +66 -61
  147. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/BlogPosting.yml +145 -146
  148. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/BlogPosting_with_new_DOI.yml +149 -150
  149. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/get_schema_org_metadata_front_matter/BlogPosting.yml +114 -115
  150. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/harvard_dataverse.yml +300 -289
  151. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/pangaea.yml +66 -61
  152. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/upstream_blog.yml +64 -57
  153. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_schema_org_metadata/zenodo.yml +27 -24
  154. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/handle_input/DOI_RA_not_Crossref_or_DataCite.yml +6 -6
  155. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/handle_input/unknown_DOI_prefix.yml +6 -6
  156. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/json_schema_errors/is_valid.yml +16 -16
  157. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/BlogPosting.yml +10 -10
  158. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/Dataset.yml +10 -10
  159. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/authors_with_affiliations.yml +16 -16
  160. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/climate_data.yml +10 -10
  161. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/from_schema_org.yml +145 -146
  162. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/keywords_subject_scheme.yml +8 -8
  163. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/maremma.yml +12 -12
  164. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/text.yml +8 -8
  165. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/with_data_citation.yml +16 -16
  166. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_bibtex/with_pages.yml +16 -16
  167. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/Collection_of_Jupyter_notebooks.yml +13 -13
  168. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/SoftwareSourceCode_Zenodo.yml +13 -13
  169. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/SoftwareSourceCode_also_Zenodo.yml +8 -8
  170. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_cff/ruby-cff.yml +10 -10
  171. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Dataset.yml +10 -10
  172. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Journal_article.yml +16 -16
  173. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Journal_article_vancouver_style.yml +21 -21
  174. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/Missing_author.yml +15 -15
  175. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/interactive_resource_without_dates.yml +8 -8
  176. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_citation/software_w/version.yml +8 -8
  177. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_codemeta/SoftwareSourceCode_DataCite.yml +8 -8
  178. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_codemeta/SoftwareSourceCode_DataCite_check_codemeta_v2.yml +8 -8
  179. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_commonmeta/with_data_citation.yml +12 -12
  180. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/another_schema_org_from_front-matter.yml +32 -32
  181. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/journal_article.yml +5 -5
  182. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/journal_article_from_datacite.yml +5 -5
  183. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/journal_article_plos.yml +16 -16
  184. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_rogue_scholar_with_anonymous_author.yml +25 -40
  185. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_rogue_scholar_with_doi.yml +24 -134
  186. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_rogue_scholar_with_organizational_author.yml +27 -70
  187. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_rogue_scholar_with_relations.yml +41 -143
  188. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_rogue_scholar_with_relations_and_funding.yml +55 -133
  189. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_from_upstream_blog.yml +21 -224
  190. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/json_feed_item_with_references.yml +134 -330
  191. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/posted_content.yml +19 -19
  192. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/schema_org_from_another_science_blog.yml +9 -9
  193. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/schema_org_from_front_matter.yml +92 -91
  194. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/schema_org_from_upstream_blog.yml +6 -6
  195. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/Another_dataset.yml +8 -8
  196. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/BlogPosting.yml +10 -10
  197. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/BlogPosting_schema_org.yml +146 -147
  198. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/Dataset.yml +10 -10
  199. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/container_title.yml +16 -21
  200. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/interactive_resource_without_dates.yml +8 -8
  201. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/journal_article.yml +16 -16
  202. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/keywords_subject_scheme.yml +8 -8
  203. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/maremma.yml +10 -10
  204. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/missing_creator.yml +15 -15
  205. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/multiple_abstracts.yml +8 -8
  206. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/organization_author.yml +22 -22
  207. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/software.yml +8 -8
  208. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/software_w/version.yml +8 -8
  209. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/with_only_first_page.yml +16 -16
  210. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csl/with_pages.yml +16 -16
  211. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/climate_data.yml +10 -10
  212. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/maremma.yml +10 -10
  213. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/text.yml +8 -8
  214. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/with_data_citation.yml +16 -16
  215. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_csv/with_pages.yml +16 -16
  216. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/dissertation.yml +20 -20
  217. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/from_schema_org.yml +146 -147
  218. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/journal_article.yml +22 -22
  219. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/maremma.yml +10 -10
  220. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/with_ORCID_ID.yml +16 -16
  221. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_datacite/with_data_citation.yml +16 -16
  222. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/Dataset_in_schema_4_0.yml +10 -10
  223. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/Text_pass-thru.yml +8 -8
  224. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/book_chapter.yml +15 -15
  225. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/from_schema_org.yml +146 -147
  226. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/interactive_resource_without_dates.yml +8 -8
  227. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/maremma.yml +10 -10
  228. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/with_ORCID_ID.yml +16 -16
  229. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/with_data_citation.yml +16 -16
  230. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_jats_xml/with_editor.yml +17 -17
  231. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/BlogPosting.yml +10 -10
  232. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/BlogPosting_schema_org.yml +145 -146
  233. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/Dataset.yml +10 -10
  234. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/alternate_name.yml +8 -8
  235. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/journal_article.yml +9 -9
  236. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/keywords_with_subject_scheme.yml +8 -8
  237. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/maremma.yml +10 -10
  238. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_ris/with_pages.yml +10 -10
  239. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Another_Schema_org_JSON.yml +10 -10
  240. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Funding.yml +13 -13
  241. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Funding_OpenAIRE.yml +13 -13
  242. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Schema_org_JSON.yml +8 -8
  243. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/Schema_org_JSON_Cyark.yml +13 -13
  244. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/alternate_identifiers.yml +13 -13
  245. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/data_catalog.yml +13 -13
  246. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/geo_location_box.yml +13 -13
  247. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/interactive_resource_without_dates.yml +13 -13
  248. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/journal_article.yml +16 -16
  249. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/maremma_schema_org_JSON.yml +11 -11
  250. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/series_information.yml +17 -16
  251. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/subject_scheme.yml +15 -15
  252. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_schema_org/subject_scheme_multiple_keywords.yml +13 -13
  253. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/BlogPosting.yml +10 -10
  254. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/Dataset.yml +10 -10
  255. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/journal_article.yml +16 -16
  256. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_turtle/with_pages.yml +16 -16
  257. data/spec/readers/cff_reader_spec.rb +2 -20
  258. data/spec/readers/crossref_reader_spec.rb +10 -16
  259. data/spec/readers/crossref_xml_reader_spec.rb +61 -64
  260. data/spec/readers/json_feed_reader_spec.rb +56 -56
  261. data/spec/readers/schema_org_reader_spec.rb +1 -1
  262. data/spec/utils_spec.rb +1 -1
  263. data/spec/writers/crossref_xml_writer_spec.rb +9 -8
  264. data/spec/writers/csv_writer_spec.rb +1 -1
  265. data/spec/writers/ris_writer_spec.rb +2 -2
  266. metadata +3 -11
  267. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/convert_file/crossref_xml/to_crossref_xml_refresh.yml +0 -107
  268. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/doi_prefix/doi_prefix_by_blog.yml +0 -997
  269. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/doi_prefix/doi_prefix_by_uuid.yml +0 -256
  270. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_uuid.yml +0 -256
  271. data/spec/fixtures/vcr_cassettes/Commonmeta_CLI/encode/by_uuid_unknown_uuid.yml +0 -49
  272. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_crossref_metadata/missing_creator.yml +0 -307
  273. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_uuid.yml +0 -136
  274. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/get_doi_prefix_for_blog/by_blog_post_uuid_specific_prefix.yml +0 -256
  275. data/spec/fixtures/vcr_cassettes/Commonmeta_Metadata/write_metadata_as_crossref/book_oup.yml +0 -107
@@ -1,997 +0,0 @@
1
- ---
2
- http_interactions:
3
- - request:
4
- method: get
5
- uri: https://rogue-scholar.org/api/blogs/tyfqw20
6
- body:
7
- encoding: UTF-8
8
- string: ''
9
- headers:
10
- Connection:
11
- - close
12
- Host:
13
- - rogue-scholar.org
14
- User-Agent:
15
- - http.rb/5.1.1
16
- response:
17
- status:
18
- code: 200
19
- message: OK
20
- headers:
21
- Age:
22
- - '0'
23
- Cache-Control:
24
- - public, max-age=0, must-revalidate
25
- Content-Length:
26
- - '88356'
27
- Content-Type:
28
- - application/json; charset=utf-8
29
- Date:
30
- - Sun, 18 Jun 2023 06:01:20 GMT
31
- Etag:
32
- - '"xxw11pz9vb1w14"'
33
- Server:
34
- - Vercel
35
- Strict-Transport-Security:
36
- - max-age=63072000
37
- X-Matched-Path:
38
- - "/api/blogs/[slug]"
39
- X-Vercel-Cache:
40
- - MISS
41
- X-Vercel-Id:
42
- - fra1::iad1::fqtvx-1687068074466-8dd4935835aa
43
- Connection:
44
- - close
45
- body:
46
- encoding: UTF-8
47
- string: '{"id":"tyfqw20","title":"iPhylo","description":"Rants, raves (and occasionally
48
- considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For
49
- more ranty and less considered opinions, see my <a href=\"https://twitter.com/rdmpage\">Twitter
50
- feed</a>.<br>ISSN 2051-8188. Written content on this site is licensed under
51
- a <a href=\"https://creativecommons.org/licenses/by/4.0/\">Creative Commons
52
- Attribution 4.0 International license</a>.","language":"en","favicon":null,"feed_url":"https://iphylo.blogspot.com/feeds/posts/default","feed_format":"application/atom+xml","home_page_url":"https://iphylo.blogspot.com/","indexed_at":"2023-02-06","modified_at":"2023-05-31T17:26:00+00:00","license":"https://creativecommons.org/licenses/by/4.0/legalcode","generator":"Blogger
53
- 7.00","category":"Natural Sciences","backlog":true,"prefix":"10.59350","items":[{"id":"https://doi.org/10.59350/en7e9-5s882","uuid":"20b9d31e-513f-496b-b399-4215306e1588","url":"https://iphylo.blogspot.com/2022/04/obsidian-markdown-and-taxonomic-trees.html","title":"Obsidian,
54
- markdown, and taxonomic trees","summary":"Returning to the subject of personal
55
- knowledge graphs Kyle Scheer has an interesting repository of Markdown files
56
- that describe academic disciplines at https://github.com/kyletscheer/academic-disciplines
57
- (see his blog post for more background). If you add these files to Obsidian
58
- you get a nice visualisation of a taxonomy of academic disciplines. The applications
59
- of this to biological taxonomy seem obvious, especially as a tool like Obsidian
60
- enables all sorts of interesting links to be added...","date_published":"2022-04-07T21:07:00Z","date_modified":"2022-04-07T21:15:34Z","date_indexed":"1909-06-16T09:41:45+00:00","authors":[{"url":null,"name":"Roderic
61
- Page"}],"image":null,"content_html":"<p>Returning to the subject of <a href=\"https://iphylo.blogspot.com/2020/08/personal-knowledge-graphs-obsidian-roam.html\">personal
62
- knowledge graphs</a> Kyle Scheer has an interesting repository of Markdown
63
- files that describe academic disciplines at <a href=\"https://github.com/kyletscheer/academic-disciplines\">https://github.com/kyletscheer/academic-disciplines</a>
64
- (see <a href=\"https://kyletscheer.medium.com/on-creating-a-tree-of-knowledge-f099c1028bf6\">his
65
- blog post</a> for more background).</p>\n\n<p>If you add these files to <a
66
- href=\"https://obsidian.md/\">Obsidian</a> you get a nice visualisation of
67
- a taxonomy of academic disciplines. The applications of this to biological
68
- taxonomy seem obvious, especially as a tool like Obsidian enables all sorts
69
- of interesting links to be added (e.g., we could add links to the taxonomic
70
- research behind each node in the taxonomic tree, the people doing that research,
71
- etc. - although that would mean we''d no longer have a simple tree).</p>\n\n<p>The
72
- more I look at these sort of simple Markdown-based tools the more I wonder
73
- whether we could make more use of them to create simple but persistent databases.
74
- Text files seem the most stable, long-lived digital format around, maybe this
75
- would be a way to minimise the inevitable obsolescence of database and server
76
- software. Time for some experiments I feel... can we take a taxonomic group,
77
- such as mammals, and create a richly connected database purely in Markdown?</p>\n\n<div
78
- class=\"separator\" style=\"clear: both; text-align: center;\"><iframe allowfullscreen=''allowfullscreen''
79
- webkitallowfullscreen=''webkitallowfullscreen'' mozallowfullscreen=''mozallowfullscreen''
80
- width=''400'' height=''322'' src=''https://www.blogger.com/video.g?token=AD6v5dxZtweOTJTdg6aqvICq_tKF0la1QZuDAEpwPPCVQKtG5vjB-DzuQv-ApL8JnpyZ1FffYtWo6ymizNQ''
81
- class=''b-hbp-video b-uploaded'' frameborder=''0''></iframe></div>","tags":["markdown","obsidian"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/m48f7-c2128","uuid":"8aea47e4-f227-45f4-b37b-0454a8a7a3ff","url":"https://iphylo.blogspot.com/2023/04/chatgpt-semantic-search-and-knowledge.html","title":"ChatGPT,
82
- semantic search, and knowledge graphs","summary":"One thing about ChatGPT
83
- is it has opened my eyes to some concepts I was dimly aware of but am only
84
- now beginning to fully appreciate. ChatGPT enables you ask it questions, but
85
- the answers depend on what ChatGPT “knows”. As several people have noted,
86
- what would be even better is to be able to run ChatGPT on your own content.
87
- Indeed, ChatGPT itself now supports this using plugins. Paul Graham GPT However,
88
- it’s still useful to see how to add ChatGPT functionality to your own content
89
- from...","date_published":"2023-04-03T15:30:00Z","date_modified":"2023-04-03T15:32:04Z","date_indexed":"1909-06-16T09:02:34+00:00","authors":[{"url":null,"name":"Roderic
90
- Page"}],"image":null,"content_html":"<p>One thing about ChatGPT is it has
91
- opened my eyes to some concepts I was dimly aware of but am only now beginning
92
- to fully appreciate. ChatGPT enables you ask it questions, but the answers
93
- depend on what ChatGPT “knows”. As several people have noted, what would be
94
- even better is to be able to run ChatGPT on your own content. Indeed, ChatGPT
95
- itself now supports this using <a href=\"https://openai.com/blog/chatgpt-plugins\">plugins</a>.</p>\n<h4
96
- id=\"paul-graham-gpt\">Paul Graham GPT</h4>\n<p>However, it’s still useful
97
- to see how to add ChatGPT functionality to your own content from scratch.
98
- A nice example of this is <a href=\"https://paul-graham-gpt.vercel.app/\">Paul
99
- Graham GPT</a> by <a href=\"https://twitter.com/mckaywrigley\">Mckay Wrigley</a>.
100
- Mckay Wrigley took essays by Paul Graham (a well known venture capitalist)
101
- and built a question and answer tool very like ChatGPT.</p>\n<iframe width=\"560\"
102
- height=\"315\" src=\"https://www.youtube.com/embed/ii1jcLg-eIQ\" title=\"YouTube
103
- video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write;
104
- encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n<p>Because
105
- you can send a block of text to ChatGPT (as part of the prompt) you can get
106
- ChatGPT to summarise or transform that information, or answer questions based
107
- on that information. But there is a limit to how much information you can
108
- pack into a prompt. You can’t put all of Paul Graham’s essays into a prompt
109
- for example. So a solution is to do some preprocessing. For example, given
110
- a question such as “How do I start a startup?” we could first find the essays
111
- that are most relevant to this question, then use them to create a prompt
112
- for ChatGPT. A quick and dirty way to do this is simply do a text search over
113
- the essays and take the top hits. But we aren’t searching for words, we are
114
- searching for answers to a question. The essay with the best answer might
115
- not include the phrase “How do I start a startup?”.</p>\n<h4 id=\"semantic-search\">Semantic
116
- search</h4>\n<p>Enter <a href=\"https://en.wikipedia.org/wiki/Semantic_search\">Semantic
117
- search</a>. The key concept behind semantic search is that we are looking
118
- for documents with similar meaning, not just similarity of text. One approach
119
- to this is to represent documents by “embeddings”, that is, a vector of numbers
120
- that encapsulate features of the document. Documents with similar vectors
121
- are potentially related. In semantic search we take the query (e.g., “How
122
- do I start a startup?”), compute its embedding, then search among the documents
123
- for those with similar embeddings.</p>\n<p>To create Paul Graham GPT Mckay
124
- Wrigley did the following. First he sent each essay to the OpenAI API underlying
125
- ChatGPT, and in return he got the embedding for that essay (a vector of 1536
126
- numbers). Each embedding was stored in a database (Mckay uses Postgres with
127
- <a href=\"https://github.com/pgvector/pgvector\">pgvector</a>). When a user
128
- enters a query such as “How do I start a startup?” that query is also sent
129
- to the OpenAI API to retrieve its embedding vector. Then we query the database
130
- of embeddings for Paul Graham’s essays and take the top five hits. These hits
131
- are, one hopes, the most likely to contain relevant answers. The original
132
- question and the most similar essays are then bundled up and sent to ChatGPT
133
- which then synthesises an answer. See his <a href=\"https://github.com/mckaywrigley/paul-graham-gpt\">GitHub
134
- repo</a> for more details. Note that we are still using ChatGPT, but on a
135
- set of documents it doesn’t already have.</p>\n<h4 id=\"knowledge-graphs\">Knowledge
136
- graphs</h4>\n<p>I’m a fan of knowledge graphs, but they are not terribly easy
137
- to use. For example, I built a knowledge graph of Australian animals <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a>
138
- that contains a wealth of information on taxa, publications, and people, wrapped
139
- up in a web site. If you want to learn more you need to figure out how to
140
- write queries in SPARQL, which is not fun. Maybe we could use ChatGPT to write
141
- the SPARQL queries for us, but it would be much more fun to be simply ask
142
- natural language queries (e.g., “who are the experts on Australian ants?”).
143
- I made some naïve notes on these ideas <a href=\"https://iphylo.blogspot.com/2015/09/possible-project-natural-language.html\">Possible
144
- project: natural language queries, or answering “how many species are there?”</a>
145
- and <a href=\"https://iphylo.blogspot.com/2019/05/ozymandias-meets-wikipedia-with-notes.html\">Ozymandias
146
- meets Wikipedia, with notes on natural language generation</a>.</p>\n<p>Of
147
- course, this is a well known problem. Tools such as <a href=\"http://rdf2vec.org\">RDF2vec</a>
148
- can take RDF from a knowledge graph and create embeddings which could in tern
149
- be used to support semantic search. But it seems to me that we could simply
150
- this process a bit by making use of ChatGPT.</p>\n<p>Firstly we would generate
151
- natural language statements from the knowledge graph (e.g., “species x belongs
152
- to genus y and was described in z”, “this paper on ants was authored by x”,
153
- etc.) that cover the basic questions we expect people to ask. We then get
154
- embeddings for these (e.g., using OpenAI). We then have an interface where
155
- people can ask a question (“is species x a valid species?”, “who has published
156
- on ants”, etc.), we get the embedding for that question, retrieve natural
157
- language statements that the closest in embedding “space”, package everything
158
- up and ask ChatGPT to summarise the answer.</p>\n<p>The trick, of course,
159
- is to figure out how t generate natural language statements from the knowledge
160
- graph (which amounts to deciding what paths to traverse in the knowledge graph,
161
- and how to write those paths is something approximating English). We also
162
- want to know something about the sorts of questions people are likely to ask
163
- so that we have a reasonable chance of having the answers (for example, are
164
- people going to ask about individual species, or questions about summary statistics
165
- such as numbers of species in a genus, etc.).</p>\n<p>What makes this attractive
166
- is that it seems a straightforward way to go from a largely academic exercise
167
- (build a knowledge graph) to something potentially useful (a question and
168
- answer machine). Imagine if something like the defunct BBC wildlife site (see
169
- <a href=\"https://iphylo.blogspot.com/2017/12/blue-planet-ii-bbc-and-semantic-web.html\">Blue
170
- Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and
171
- opportunities lost</a>) revived <a href=\"https://aspiring-look.glitch.me\">here</a>
172
- had a question and answer interface where we could ask questions rather than
173
- passively browse.</p>\n<h4 id=\"summary\">Summary</h4>\n<p>I have so much
174
- more to learn, and need to think about ways to incorporate semantic search
175
- and ChatGPT-like tools into knowledge graphs.</p>\n<blockquote>\n<p>Written
176
- with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/zc4qc-77616","uuid":"30c78d9d-2e50-49db-9f4f-b3baa060387b","url":"https://iphylo.blogspot.com/2022/09/does-anyone-cite-taxonomic-treatments.html","title":"Does
177
- anyone cite taxonomic treatments?","summary":"Taxonomic treatments have come
178
- up in various discussions I''m involved in, and I''m curious as to whether
179
- they are actually being used, in particular, whether they are actually being
180
- cited. Consider the following quote: The taxa are described in taxonomic treatments,
181
- well defined sections of scientific publications (Catapano 2019). They include
182
- a nomenclatural section and one or more sections including descriptions, material
183
- citations referring to studied specimens, or notes ecology and...","date_published":"2022-09-01T16:49:00Z","date_modified":"2022-09-01T16:49:51Z","date_indexed":"1909-06-16T09:31:50+00:00","authors":[{"url":null,"name":"Roderic
184
- Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
185
- both;\"><a href=\"https://zenodo.org/record/5731100/thumb100\" style=\"display:
186
- block; padding: 1em 0; text-align: center; clear: right; float: right;\"><img
187
- alt=\"\" border=\"0\" height=\"128\" data-original-height=\"106\" data-original-width=\"100\"
188
- src=\"https://zenodo.org/record/5731100/thumb250\"/></a></div>\nTaxonomic
189
- treatments have come up in various discussions I''m involved in, and I''m
190
- curious as to whether they are actually being used, in particular, whether
191
- they are actually being cited. Consider the following quote:\n\n<blockquote>\nThe
192
- taxa are described in taxonomic treatments, well defined sections of scientific
193
- publications (Catapano 2019). They include a nomenclatural section and one
194
- or more sections including descriptions, material citations referring to studied
195
- specimens, or notes ecology and behavior. In case the treatment does not describe
196
- a new discovered taxon, previous treatments are cited in the form of treatment
197
- citations. This citation can refer to a previous treatment and add additional
198
- data, or it can be a statement synonymizing the taxon with another taxon.
199
- This allows building a citation network, and ultimately is a constituent part
200
- of the catalogue of life. - Taxonomic Treatments as Open FAIR Digital Objects
201
- <a href=\"https://doi.org/10.3897/rio.8.e93709\">https://doi.org/10.3897/rio.8.e93709</a>\n</blockquote>\n\n<p>\n
202
- \"Traditional\" academic citation is from article to article. For example,
203
- consider these two papers:\n\n<blockquote>\nLi Y, Li S, Lin Y (2021) Taxonomic
204
- study on fourteen symphytognathid species from Asia (Araneae, Symphytognathidae).
205
- ZooKeys 1072: 1-47. https://doi.org/10.3897/zookeys.1072.67935\n</blockquote>\n\n<blockquote>\nMiller
206
- J, Griswold C, Yin C (2009) The symphytognathoid spiders of the Gaoligongshan,
207
- Yunnan, China (Araneae: Araneoidea): Systematics and diversity of micro-orbweavers.
208
- ZooKeys 11: 9-195. https://doi.org/10.3897/zookeys.11.160\n</blockquote>\n</p>\n\n<p>Li
209
- et al. 2021 cites Miller et al. 2009 (although Pensoft seems to have broken
210
- the citation such that it does appear correctly either on their web page or
211
- in CrossRef).</p>\n\n<p>So, we have this link: [article]10.3897/zookeys.1072.67935
212
- --cites--> [article]10.3897/zookeys.11.160. One article cites another.</p>\n\n<p>In
213
- their 2021 paper Li et al. discuss <i>Patu jidanweishi</i> Miller, Griswold
214
- & Yin, 2009:\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s1040/Screenshot%202022-09-01%20at%2017.12.27.png\"
215
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
216
- border=\"0\" width=\"400\" data-original-height=\"314\" data-original-width=\"1040\"
217
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPavMHXqQNX1ls_zXo9kIcMLHPxc7ZpV9NCSof5wLumrg3ovPoi6nYKzZsINuqFtYoEvW1QrerePD-MEf2DJaYUXlT37d81x3L6ILls7u229rg0_Nc0uUmgW-ICzr6MI_QCZfgQbYGTxuofu-fuPVoygbCnm3vQVYOhLDLtp1EtQ9jRZHDvw/s400/Screenshot%202022-09-01%20at%2017.12.27.png\"/></a></div>\n\n<p>There
218
- is a treatment for the original description of <i>Patu jidanweishi</i> at
219
- <a href=\"https://doi.org/10.5281/zenodo.3792232\">https://doi.org/10.5281/zenodo.3792232</a>,
220
- which was created by Plazi with a time stamp \"2020-05-06T04:59:53.278684+00:00\".
221
- The original publication date was 2009, the treatments are being added retrospectively.</p>\n\n<p>In
222
- an ideal world my expectation would be that Li et al. 2021 would have cited
223
- the treatment, instead of just providing the text string \"Patu jidanweishi
224
- Miller, Griswold & Yin, 2009: 64, figs 65A–E, 66A, B, 67A–D, 68A–F, 69A–F,
225
- 70A–F and 71A–F (♂♀).\" Isn''t the expectation under the treatment model that
226
- we would have seen this relationship:</p>\n\n<p>[article]10.3897/zookeys.1072.67935
227
- --cites--> [treatment]https://doi.org/10.5281/zenodo.3792232</p>\n\n<p>Furthermore,
228
- if it is the case that \"[i]n case the treatment does not describe a new discovered
229
- taxon, previous treatments are cited in the form of treatment citations\"
230
- then we should also see a citation between treatments, in other words Li et
231
- al.''s 2021 treatment of <i>Patu jidanweishi</i> (which doesn''t seem to have
232
- a DOI but is available on Plazi'' web site as <a href=\"https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74\">https://tb.plazi.org/GgServer/html/1CD9FEC313A35240938EC58ABB858E74</a>)
233
- should also cite the original treatment? It doesn''t - but it does cite the
234
- Miller et al. paper.</p>\n\n<p>So in this example we don''t see articles citing
235
- treatments, nor do we see treatments citing treatments. Playing Devil''s advocate,
236
- why then do we have treatments? Does''t the lack of citations suggest that
237
- - despite some taxonomists saying this is the unit that matters - they actually
238
- don''t. If we pay attention to what people do rather than what they say they
239
- do, they cite articles.</p>\n\n<p>Now, there are all sorts of reasons why
240
- we don''t see [article] -> [treatment] citations, or [treatment] -> [treatment]
241
- citations. Treatments are being added after the fact by Plazi, not by the
242
- authors of the original work. And in many cases the treatments that could
243
- be cited haven''t appeared until after that potentially citing work was published.
244
- In the example above the Miller et al. paper dates from 2009, but the treatment
245
- extracted only went online in 2020. And while there is a long standing culture
246
- of citing publications (ideally using DOIs) there isn''t an equivalent culture
247
- of citing treatments (beyond the simple text strings).</p>\n\n<p>Obviously
248
- this is but one example. I''d need to do some exploration of the citation
249
- graph to get a better sense of citations patterns, perhaps using <a href=\"https://www.crossref.org/documentation/event-data/\">CrossRef''s
250
- event data</a>. But my sense is that taxonomists don''t cite treatments.</p>\n\n<p>I''m
251
- guessing Plazi would respond by saying treatments are cited, for example (indirectly)
252
- in GBIF downloads. This is true, although arguably people aren''t citing the
253
- treatment, they''re citing specimen data in those treatments, and that specimen
254
- data could be extracted at the level of articles rather than treatments. In
255
- other words, it''s not the treatments themselves that people are citing.</p>\n\n<p>To
256
- be clear, I think there is value in being able to identify those \"well defined
257
- sections\" of a publication that deal with a given taxon (i.e., treatments),
258
- but it''s not clear to me that these are actually the citable units people
259
- might hope them to be. Likewise, journals such as <i>ZooKeys</i> have DOIs
260
- for individual figures. Does anyone actually cite those?</p>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/pmhat-5ky65","uuid":"5891c709-d139-440f-bacb-06244424587a","url":"https://iphylo.blogspot.com/2021/10/problems-with-plazi-parsing-how.html","title":"Problems
261
- with Plazi parsing: how reliable are automated methods for extracting specimens
262
- from the literature?","summary":"The Plazi project has become one of the major
263
- contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
264
- (see Plazi''s GBIF page for details). These occurrences are extracted from
265
- taxonomic publication using automated methods. New data is published almost
266
- daily (see latest treatments). The map below shows the geographic distribution
267
- of material citations provided to GBIF by Plazi, which gives you a sense of
268
- the size of the dataset. By any metric Plazi represents a...","date_published":"2021-10-25T11:10:00Z","date_modified":"2021-10-28T16:08:18Z","date_indexed":"1970-01-01T00:00:00+00:00","authors":[{"url":null,"name":"Roderic
269
- Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
270
- both;\"><a href=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s240/Rf7UoXTw_400x400.jpg\"
271
- style=\"display: block; padding: 1em 0; text-align: center; clear: right;
272
- float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"240\"
273
- data-original-width=\"240\" src=\"https://1.bp.blogspot.com/-oiqSkA53FI4/YXaRAUpRgFI/AAAAAAAAgzY/PbPEOh_VlhIE0aaqqhVQPLAOE5pRULwJACLcBGAsYHQ/s200/Rf7UoXTw_400x400.jpg\"/></a></div><p>The
274
- <a href=\"http://plazi.org\">Plazi</a> project has become one of the major
275
- contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences
276
- (see <a href=\"https://www.gbif.org/publisher/7ce8aef0-9e92-11dc-8738-b8a03c50a862\">Plazi''s
277
- GBIF page</a> for details). These occurrences are extracted from taxonomic
278
- publication using automated methods. New data is published almost daily (see
279
- <a href=\"https://tb.plazi.org/GgServer/static/newToday.html\">latest treatments</a>).
280
- The map below shows the geographic distribution of material citations provided
281
- to GBIF by Plazi, which gives you a sense of the size of the dataset.</p>\n\n<div
282
- class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s1030/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"
283
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
284
- border=\"0\" width=\"400\" data-original-height=\"514\" data-original-width=\"1030\"
285
- src=\"https://1.bp.blogspot.com/-DCJ4HR8eej8/YXaRQnz22bI/AAAAAAAAgz4/AgRcree6jVgjtQL2ch7IXgtb_Xtx7fkngCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.43.23.png\"/></a></div>\n\n<p>By
286
- any metric Plazi represents a considerable achievement. But often when I browse
287
- individual records on Plazi I find records that seem clearly incorrect. Text
288
- mining the literature is a challenging problem, but at the moment Plazi seems
289
- something of a \"black box\". PDFs go in, the content is mined, and data comes
290
- up to be displayed on the Plazi web site and uploaded to GBIF. Nowhere does
291
- there seem to be an evaluation of how accurate this text mining actually is.
292
- Anecdotally it seems to work well in some cases, but in others it produces
293
- what can only be described as bogus records.</p>\n\n<h2>Finding errors</h2>\n\n<p>A
294
- treatment in Plazi is a block of text (and sometimes illustrations) that refers
295
- to a single taxon. Often that text will include a description of the taxon,
296
- and list one or more specimens that have been examined. These lists of specimens
297
- (\"material citations\") are one of the key bits of information that Plaza
298
- extracts from a treatment as these citations get fed into GBIF as occurrences.</p>\n\n<p>To
299
- help explore treatments I''ve constructed a simple web site that takes the
300
- Plazi identifier for a treatment and displays that treatment with the material
301
- citations highlighted. For example, for the Plazi treatment <a href=\"https://tb.plazi.org/GgServer/html/03B5A943FFBB6F02FE27EC94FABEEAE7\">03B5A943FFBB6F02FE27EC94FABEEAE7</a>
302
- you can view the marked up version at <a href=\"https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228\">https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228</a>.
303
- Below is an example of a material citation with its component parts tagged:</p>\n\n<div
304
- class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s693/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"
305
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
306
- border=\"0\" width=\"400\" data-original-height=\"94\" data-original-width=\"693\"
307
- src=\"https://1.bp.blogspot.com/-NIGuo9BggdA/YXaRQQrv0QI/AAAAAAAAgz4/SZDcA1jZSN47JMRTWDwSMRpHUShrCeOdgCPcBGAYYCw/s400/Screenshot%2B2021-10-24%2Bat%2B20.59.56.png\"/></a></div>\n\n<p>This
308
- is an example where Plazi has successfully parsed the specimen. But I keep
309
- coming across cases where specimens have not been parsed correctly, resulting
310
- in issues such as single specimens being split into multiple records (e.g., <a
311
- href=\"https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496\">https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496</a>),
312
- geographical coordinates being misinterpreted (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9\">https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9</a>),
313
- or collector''s initials being confused with codes for natural history collections
314
- (e.g., <a href=\"https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E\">https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E</a>).</p>\n\n<p>Parsing
315
- specimens is a hard problem so it''s not unexpected to find errors. But they
316
- do seem common enough to be easily found, which raises the question of just
317
- what percentage of these material citations are correct? How much of the
318
- data Plazi feeds to GBIF is correct? How would we know?</p>\n\n<h2>Systemic
319
- problems</h2>\n\n<p>Some of the errors I''ve found concern the interpretation
320
- of the parsed data. For example, it is striking that despite including marine
321
- taxa <b>no</b> Plazi record has a value for depth below sea level (see <a
322
- href=\"https://www.gbif.org/occurrence/map?depth=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">GBIF
323
- search on depth range 0-9999 for Plazi</a>). But <a href=\"https://www.gbif.org/occurrence/map?elevation=0,9999&publishing_org=7ce8aef0-9e92-11dc-8738-b8a03c50a862&advanced=1\">many
324
- records do have an elevation</a>, including records from marine environments.
325
- Any record that has a depth value is interpreted by Plazi as being elevation,
326
- so we have aerial crustacea and fish.</p>\n\n<h3>Map of Plazi records with
327
- depth 0-9999m</h3>\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s673/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"
328
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
329
- border=\"0\" width=\"400\" data-original-height=\"258\" data-original-width=\"673\"
330
- src=\"https://1.bp.blogspot.com/-GD4pPtPCxVc/YXaRQ9bdn1I/AAAAAAAAgz8/A9YsypSvHfwWKAjDxSdeFVUkou88LGItACPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.03.51.png\"/></a></div>\n\n<h3>Map
331
- of Plazi records with elevation 0-9999m </h3>\n<div class=\"separator\" style=\"clear:
332
- both;\"><a href=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s675/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"
333
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
334
- border=\"0\" width=\"400\" data-original-height=\"256\" data-original-width=\"675\"
335
- src=\"https://1.bp.blogspot.com/-BReSHtXTOkA/YXaRRKFW7dI/AAAAAAAAg0A/-FcBkMwyswIp0siWGVX3MNMANs7UkZFtwCPcBGAYYCw/s320/Screenshot%2B2021-10-25%2Bat%2B12.04.06.png\"/></a></div>\n\n<p>Anecdotally
336
- I''ve also noticed that Plazi seems to do well on zoological data, especially
337
- journals like <i>Zootaxa</i>, but it often struggles with botanical specimens.
338
- Botanists tend to cite specimens rather differently to zoologists (botanists
339
- emphasise collector numbers rather than specimen codes). Hence data quality
340
- in Plazi is likely to taxonomic biased.</p>\n\n<p>Plazi is <a href=\"https://github.com/plazi/community/issues\">using
341
- GitHub to track issues with treatments</a> so feedback on erroneous records
342
- is possible, but this seems inadequate to the task. There are tens of thousands
343
- of data sets, with more being released daily, and hundreds of thousands of
344
- occurrences, and relying on GitHub issues devolves the responsibility for
345
- error checking onto the data users. I don''t have a measure of how many records
346
- in Plazi have problems, but because I suspect it is a significant fraction
347
- because for any given day''s output I can typically find errors.</p>\n\n<h2>What
348
- to do?</h2>\n\n<p>Faced with a process that generates noisy data there are
349
- several of things we could do:</p>\n\n<ol>\n<li>Have tools to detect and flag
350
- errors made in generating the data.</li>\n<li>Have the data generator give
351
- estimates the confidence of its results.</li>\n<li>Improve the data generator.</li>\n</ol>\n\n<p>I
352
- think a comparison with the problem of parsing bibliographic references might
353
- be instructive here. There is a long history of people developing tools to
354
- parse references (<a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">I''ve
355
- even had a go</a>). State-of-the art tools such as <a href=\"https://anystyle.io\">AnyStyle</a>
356
- feature machine learning, and are tested against <a href=\"https://github.com/inukshuk/anystyle/blob/master/res/parser/core.xml\">human
357
- curated datasets</a> of tagged bibliographic records. This means we can evaluate
358
- the performance of a method (how well does it retrieve the same results as
359
- human experts?) and also improve the method by expanding the corpus of training
360
- data. Some of these tools can provide a measures of how confident they are
361
- when classifying a string as, say, a person''s name, which means we could
362
- flag potential issues for anyone wanting to use that record.</p>\n\n<p>We
363
- don''t have equivalent tools for parsing specimens in the literature, and
364
- hence have no easy way to quantify how good existing methods are, nor do we
365
- have a public corpus of material citations that we can use as training data.
366
- I <a href=\"https://iphylo.blogspot.com/2021/05/finding-citations-of-specimens.html\">blogged
367
- about this</a> a few months ago and was considering using Plazi as a source
368
- of marked up specimen data to use for training. However based on what I''ve
369
- looked at so far Plazi''s data would need to be carefully scrutinised before
370
- it could be used as training data.</p>\n\n<p>Going forward, I think it would
371
- be desirable to have a set of records that can be used to benchmark specimen
372
- parsers, and ideally have the parsers themselves available as web services
373
- so that anyone can evaluate them. Even better would be a way to contribute
374
- to the training data so that these tools improve over time.</p>\n\n<p>Plazi''s
375
- data extraction tools are mostly desktop-based, that is, you need to download
376
- software to use their methods. However, there are experimental web services
377
- available as well. I''ve created a simple wrapper around the material citation
378
- parser, you can try it at <a href=\"https://plazi-tester.herokuapp.com/parser.php\">https://plazi-tester.herokuapp.com/parser.php</a>.
379
- It takes a single material citation and returns a version with elements such
380
- as specimen code and collector name tagged in different colours.</p>\n\n<h2>Summary</h2>\n\n<p>Text
381
- mining the taxonomic literature is clearly a gold mine of data, but at the
382
- same time it is potentially fraught as we try and extract structured data
383
- from semi-structured text. Plazi has demonstrated that it is possible to extract
384
- a lot of data from the literature, but at the same time the quality of that
385
- data seems highly variable. Even minor issues in parsing text can have big
386
- implications for data quality (e.g., marine organisms apparently living above
387
- sea level). Historically in biodiversity informatics we have favoured data
388
- quantity over data quality. Quantity has an obvious metric, and has milestones
389
- we can celebrate (e.g., <a href=\"GBIF at 1 billion - what''s next?\">one
390
- billion specimens</a>). There aren''t really any equivalent metrics for data
391
- quality.</p>\n\n<p>Adding new types of data can sometimes initially result
392
- in a new set of quality issues (e.g., <a href=\"https://iphylo.blogspot.com/2019/12/gbif-metagenomics-and-metacrap.html\">GBIF
393
- metagenomics and metacrap</a>) that take time to resolve. In the case of Plazi,
394
- I think it would be worthwhile to quantify just how many records have errors,
395
- and develop benchmarks that we can use to test methods for extracting specimen
396
- data from text. If we don''t do this then there will remain uncertainty as
397
- to how much trust we can place in data mined from the taxonomic literature.</p>\n\n<h2>Update</h2>\n\nPlazi
398
- has responded, see <a href=\"http://plazi.org/posts/2021/10/liberation-first-step-toward-quality/\">Liberating
399
- material citations as a first step to more better data</a>. My reading of
400
- their repsonse is that it essentially just reiterates Plazi''s approach and
401
- doesn''t tackle the underlying issue: their method for extracting material
402
- citations is error prone, and many of those errors end up in GBIF.","tags":["data
403
- quality","parsing","Plazi","specimen","text mining"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/j77nc-e8x98","uuid":"c6b101f4-bfbc-4d01-921d-805c43c85757","url":"https://iphylo.blogspot.com/2022/08/linking-taxonomic-names-to-literature.html","title":"Linking
404
- taxonomic names to the literature","summary":"Just some thoughts as I work
405
- through some datasets linking taxonomic names to the literature. In the diagram
406
- above I''ve tried to capture the different situatios I encounter. Much of
407
- the work I''ve done on this has focussed on case 1 in the diagram: I want
408
- to link a taxonomic name to an identifier for the work in which that name
409
- was published. In practise this means linking names to DOIs. This has the
410
- advantage of linking to a citable indentifier, raising questions such as whether
411
- citations...","date_published":"2022-08-22T17:19:00Z","date_modified":"2022-08-22T17:19:08Z","date_indexed":"1909-06-16T08:21:41+00:00","authors":[{"url":null,"name":"Roderic
412
- Page"}],"image":null,"content_html":"Just some thoughts as I work through
413
- some datasets linking taxonomic names to the literature.\n\n<div class=\"separator\"
414
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s2140/linking%20to%20names144.jpg\"
415
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
416
- border=\"0\" height=\"600\" data-original-height=\"2140\" data-original-width=\"1604\"
417
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5ZkbnumEWKZpf0isei_hUJucFlOOK08-SKnJknD8B4qhAX36u-vMQRnZdhRJK5rPb7HNYcnqB7qE4agbeStqyzMkWHrzUj14gPkz2ohmbOVg8P_nHo0hM94PD1wH3SPJsaLAumN8vih3ch9pjH2RaVWZBLwwhGhNu0FS1m5z6j5xt2NeZ4w/s600/linking%20to%20names144.jpg\"/></a></div>\n\n<p>In
418
- the diagram above I''ve tried to capture the different situatios I encounter.
419
- Much of the work I''ve done on this has focussed on case 1 in the diagram:
420
- I want to link a taxonomic name to an identifier for the work in which that
421
- name was published. In practise this means linking names to DOIs. This has
422
- the advantage of linking to a citable indentifier, raising questions such
423
- as whether citations of taxonmic papers by taxonomic databases could become
424
- part of a <a href=\"https://iphylo.blogspot.com/2022/08/papers-citing-data-that-cite-papers.html\">taxonomist''s
425
- Google Scholar profile</a>.</p>\n\n<p>In many taxonomic databases full work-level
426
- citations are not the norm, instead taxonomists cite one or more pages within
427
- a work that are relevant to a taxonomic name. These \"microcitations\" (what
428
- the U.S. legal profession refer to as \"point citations\" or \"pincites\", see
429
- <a href=\"https://rasmussen.libanswers.com/faq/283203\">What are pincites,
430
- pinpoints, or jump legal references?</a>) require some work to map to the
431
- work itself (which is typically the thing that has a citatble identifier such
432
- as a DOI).</p>\n\n<p>Microcitations (case 2 in the diagram above) can be quite
433
- complex. Some might simply mention a single page, but others might list a
434
- series of (not necessarily contiguous) pages, as well as figures, plates etc.
435
- Converting these to citable identifiers can be tricky, especially as in most
436
- cases we don''t have page-level identifiers. The Biodiversity Heritage Library
437
- (BHL) does have URLs for each scanned page, and we have a standard for referring
438
- to pages in a PDF (<code>page=&lt;pageNum&gt;</code>, see <a href=\"https://datatracker.ietf.org/doc/html/rfc8118\">RFC
439
- 8118</a>). But how do we refer to a set of pages? Do we pick the first page?
440
- Do we try and represent a set of pages, and if so, how?</p>\n\n<p>Another
441
- issue with page-level identifiers is that not everything on a given page may
442
- be relevant to the taxonomic name. In case 2 above I''ve shaded in the parts
443
- of the pages and figure that refer to the taxonomic name. An example where
444
- this can be problematic is the recent test case I created for BHL where a
445
- page image was included for the taxonomic name <a href=\"https://www.gbif.org/species/195763322\"><i>Aphrophora
446
- impressa</i></a>. The image includes the species description and a illustration,
447
- as well as text that relates to other species.</p>\n\n<div class=\"separator\"
448
- style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s3467/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"
449
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
450
- border=\"0\" height=\"400\" data-original-height=\"3467\" data-original-width=\"2106\"
451
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1SAv1XiVBPHXeMLPfdsnTh4Sj4m0AQyqjXM0faXyvbNhBXFDl7SawjFIkMo3qz4RQhDEyZXkKAh5nF4gtfHbVXA6cK3NJ46CuIWECC6HmJKtTjoZ0M3QQXvLY_X-2-RecjBqEy68M0cEr0ph3l6KY51kA9BvGt9d4id314P71PBitpWMATg/s400/https---www.biodiversitylibrary.org-pageimage-29138463.jpeg\"/></a></div>\n\n<p>Given
452
- that not everything on a page need be relevant, we could extract just the
453
- relevant blocks of text and illustrations (e.g., paragraphs of text, panels
454
- within a figure, etc.) and treat that set of elements as the thing to cite.
455
- This is, of course, what <a href=\"http://plazi.org\">Plazi</a> are doing.
456
- The set of extracted blocks is glued together as a \"treatment\", assigned
457
- an identifier (often a DOI), and treated as a citable unit. It would be interesting
458
- to see to what extent these treatments are actually cited, for example, do
459
- subsequent revisions that cite work that include treatments cite those treatments,
460
- or just the work itself? Put another way, are we creating <a href=\"https://iphylo.blogspot.com/2012/09/decoding-nature-encode-ipad-app-omg-it.html\">\"threads\"</a>
461
- between taxonomic revisions?</p>\n\n<p>One reason for these notes is that
462
- I''m exploring uploading taxonomic name - literature links to <a href=\"https://www.checklistbank.org\">ChecklistBank</a>
463
- and case 1 above is easy, as is case 3 (if we have treatment-level identifiers).
464
- But case 2 is problematic because we are linking to a set of things that may
465
- not have an identifier, which means a decision has to be made about which
466
- page to link to, and how to refer to that page.</p>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/ymc6x-rx659","uuid":"0807f515-f31d-4e2c-9e6f-78c3a9668b9d","url":"https://iphylo.blogspot.com/2022/09/dna-barcoding-as-intergenerational.html","title":"DNA
467
- barcoding as intergenerational transfer of taxonomic knowledge","summary":"I
468
- tweeted about this but want to bookmark it for later as well. The paper “A
469
- molecular-based identification resource for the arthropods of Finland” doi:10.1111/1755-0998.13510
470
- contains the following: …the annotated barcode records assembled by FinBOL
471
- participants represent a tremendous intergenerational transfer of taxonomic
472
- knowledge … the time contributed by current taxonomists in identifying and
473
- contributing voucher specimens represents a great gift to future generations
474
- who will benefit...","date_published":"2022-09-14T10:12:00Z","date_modified":"2022-09-29T13:57:30Z","date_indexed":"1909-06-16T11:02:21+00:00","authors":[{"url":null,"name":"Roderic
475
- Page"}],"image":null,"content_html":"<p>I <a href=\"https://twitter.com/rdmpage/status/1569738844416638981?s=21&amp;t=9OVXuoUEwZtQt-Ldzlutfw\">tweeted
476
- about this</a> but want to bookmark it for later as well. The paper “A molecular-based
477
- identification resource for the arthropods of Finland” <a href=\"https://doi.org/10.1111/1755-0998.13510\">doi:10.1111/1755-0998.13510</a>
478
- contains the following:</p>\n<blockquote>\n<p>…the annotated barcode records
479
- assembled by FinBOL participants represent a tremendous <mark>intergenerational
480
- transfer of taxonomic knowledge</mark> … the time contributed by current taxonomists
481
- in identifying and contributing voucher specimens represents a great gift
482
- to future generations who will benefit from their expertise when they are
483
- no longer able to process new material.</p>\n</blockquote>\n<p>I think this
484
- is a very clever way to characterise the project. In an age of machine learning
485
- this may be commonest way to share knowledge , namely as expert-labelled training
486
- data used to build tools for others. Of course, this means the expertise itself
487
- may be lost, which has implications for updating the models if the data isn’t
488
- complete. But it speaks to Charles Godfrey’s theme of <a href=\"https://biostor.org/reference/250587\">“Taxonomy
489
- as information science”</a>.</p>\n<p>Note that the knowledge is also transformed
490
- in the sense that the underlying expertise of interpreting morphology, ecology,
491
- behaviour, genomics, and the past literature is not what is being passed on.
492
- Instead it is probabilities that a DNA sequence belongs to a particular taxon.</p>\n<p>This
493
- feels is different to, say iNaturalist, where there is a machine learning
494
- model to identify images. In that case, the model is built on something the
495
- community itself has created, and continues to create. Yes, the underlying
496
- idea is that same: “experts” have labelled the data, a model is trained, the
497
- model is used. But the benefits of the <a href=\"https://www.inaturalist.org\">iNaturalist</a>
498
- model are immediately applicable to the people whose data built the model.
499
- In the case of barcoding, because the technology itself is still not in the
500
- hands of many (relative to, say, digital imaging), the benefits are perhaps
501
- less tangible. Obviously researchers working with environmental DNA will find
502
- it very useful, but broader impact may await the arrival of citizen science
503
- DNA barcoding.</p>\n<p>The other consideration is whether the barcoding helps
504
- taxonomists. Is it to be used to help prioritise future work (“we are getting
505
- lots of unknown sequences in these taxa, lets do some taxonomy there”), or
506
- is it simply capturing the knowledge of a generation that won’t be replaced:</p>\n<blockquote>\n<p>The
507
- need to capture such knowledge is essential because there are, for example,
508
- no young Finnish taxonomists who can critically identify species in many key
509
- groups of ar- thropods (e.g., aphids, chewing lice, chalcid wasps, gall midges,
510
- most mite lineages).</p>\n</blockquote>\n<p>The cycle of collect data, test
511
- and refine model, collect more data, rinse and repeat that happens with iNaturalist
512
- creates a feedback loop. It’s not clear that a similar cycle exists for DNA
513
- barcoding.</p>\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/d3dc0-7an69","uuid":"545c177f-cea5-4b79-b554-3ccae9c789d7","url":"https://iphylo.blogspot.com/2021/10/reflections-on-macroscope-tool-for-21st.html","title":"Reflections
514
- on \"The Macroscope\" - a tool for the 21st Century?","summary":"This is a
515
- guest post by Tony Rees. It would be difficult to encounter a scientist, or
516
- anyone interested in science, who is not familiar with the microscope, a tool
517
- for making objects visible that are otherwise too small to be properly seen
518
- by the unaided eye, or to reveal otherwise invisible fine detail in larger
519
- objects. A select few with a particular interest in microscopy may also have
520
- encountered the Wild-Leica \"Macroscope\", a specialised type of benchtop
521
- microscope optimised for...","date_published":"2021-10-07T12:38:00Z","date_modified":"2021-10-08T10:26:22Z","date_indexed":"1909-06-16T10:02:25+00:00","authors":[{"url":null,"name":"Roderic
522
- Page"}],"image":null,"content_html":"<p><img src=\"https://lh3.googleusercontent.com/-A99btr6ERMs/Vl1Wvjp2OtI/AAAAAAAAEFI/7bKdRjNG5w0/ytNkVT2U.jpg?imgmax=800\"
523
- alt=\"YtNkVT2U\" title=\"ytNkVT2U.jpg\" border=\"0\" width=\"128\" height=\"128\"
524
- style=\"float:right;\" /> This is a guest post by <a href=\"https://about.me/TonyRees\">Tony
525
- Rees</a>.</p>\n\n<p>It would be difficult to encounter a scientist, or anyone
526
- interested in science, who is not familiar with the microscope, a tool for
527
- making objects visible that are otherwise too small to be properly seen by
528
- the unaided eye, or to reveal otherwise invisible fine detail in larger objects.
529
- A select few with a particular interest in microscopy may also have encountered
530
- the Wild-Leica \"Macroscope\", a specialised type of benchtop microscope optimised
531
- for low-power macro-photography. However in this overview I discuss the \"Macroscope\"
532
- in a different sense, which is that of the antithesis to the microscope: namely
533
- a method for visualizing subjects too large to be encompassed by a single
534
- field of vision, such as the Earth or some subset of its phenomena (the biosphere,
535
- for example), or conceptually, the universe.</p>\n\n<p><div class=\"separator\"
536
- style=\"clear: both;\"><a href=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s500/2020045672.jpg\"
537
- style=\"display: block; padding: 1em 0; text-align: center; clear: right;
538
- float: right;\"><img alt=\"\" border=\"0\" height=\"320\" data-original-height=\"500\"
539
- data-original-width=\"303\" src=\"https://1.bp.blogspot.com/-xXxH9aJn1tY/YV7vnKGQWPI/AAAAAAAAgyQ/7CSJJ663Ry4IXtav54nLcLiI0kda5_L7ACLcBGAsYHQ/s320/2020045672.jpg\"/></a></div>My
540
- introduction to the term was via addresses given by Jesse Ausubel in the formative
541
- years of the 2001-2010 <a href=\"http://www.coml.org\">Census of Marine Life</a>,
542
- for which he was a key proponent. In Ausubel''s view, the Census would perform
543
- the function of a macroscope, permitting a view of everything that lives in
544
- the global ocean (or at least, that subset which could realistically be sampled
545
- in the time frame available) as opposed to more limited subsets available
546
- via previous data collection efforts. My view (which could, of course, be
547
- wrong) was that his thinking had been informed by a work entitled \"Le macroscope,
548
- vers une vision globale\" published in 1975 by the French thinker Joël de
549
- Rosnay, who had expressed such a concept as being globally applicable in many
550
- fields, including the physical and natural worlds but also extending to human
551
- society, the growth of cities, and more. Yet again, some ecologists may also
552
- have encountered the term, sometimes in the guise of \"Odum''s macroscope\",
553
- as an approach for obtaining \"big picture\" analyses of macroecological processes
554
- suitable for mathematical modelling, typically by elimination of fine detail
555
- so that only the larger patterns remain, as initially advocated by Howard
556
- T. Odum in his 1971 book \"Environment, Power, and Society\".</p>\n\n<p>From
557
- the standpoint of the 21st century, it seems that we are closer to achieving
558
- a \"macroscope\" (or possibly, multiple such tools) than ever before, based
559
- on the availability of existing and continuing new data streams, improved
560
- technology for data assembly and storage, and advanced ways to query and combine
561
- these large streams of data to produce new visualizations, data products,
562
- and analytical findings. I devote the remainder of this article to examples
563
- where either particular workers have employed \"macroscope\" terminology to
564
- describe their activities, or where potentially equivalent actions are taking
565
- place without the explicit \"macroscope\" association, but are equally worthy
566
- of consideration. To save space here, references cited here (most or all)
567
- can be found via a Wikipedia article entitled \"<a href=\"https://en.wikipedia.org/wiki/Macroscope_(science_concept)\">Macroscope
568
- (science concept)</a>\" that I authored on the subject around a year ago,
569
- and have continued to add to on occasion as new thoughts or information come
570
- to hand (see <a href=\"https://en.wikipedia.org/w/index.php?title=Macroscope_(science_concept)&offset=&limit=500&action=history\">edit
571
- history for the article</a>).</p>\n\n<p>First, one can ask, what constitutes
572
- a macroscope, in the present context? In the Wikipedia article I point to
573
- a book \"Big Data - Related Technologies, Challenges and Future Prospects\"
574
- by Chen <em>et al.</em> (2014) (<a href=\"https://doi.org/10.1007/978-3-319-06245-7\">doi:10.1007/978-3-319-06245-7</a>),
575
- in which the \"value chain of big data\" is characterised as divisible into
576
- four phases, namely data generation, data acquisition (aka data assembly),
577
- data storage, and data analysis. To my mind, data generation (which others
578
- may term acquisition, differently from the usage by Chen <em>et al.</em>)
579
- is obviously the first step, but does not in itself constitute the macroscope,
580
- except in rare cases - such as Landsat imagery, perhaps - where on its own,
581
- a single co-ordinated data stream is sufficient to meet the need for a particular
582
- type of \"global view\". A variant of this might be a coordinated data collection
583
- program - such as that of the ten year Census of Marine Life - which might
584
- produce the data required for the desired global view; but again, in reality,
585
- such data are collected in a series of discrete chunks, in many and often
586
- disparate data formats, and must be \"wrangled\" into a more coherent whole
587
- before any meaningful \"macroscope\" functionality becomes available.</p>\n\n<p>Here
588
- we come to what, in my view, constitutes the heart of the \"macroscope\":
589
- an intelligently organized (i.e. indexable and searchable), coherent data
590
- store or repository (where \"data\" may include imagery and other non numeric
591
- data forms, but much else besides). Taking the Census of Marine Life example,
592
- the data repository for that project''s data (plus other available sources
593
- as inputs) is the <a href=\"https://obis.org\">Ocean Biodiversity Information
594
- System</a> or OBIS (previously the Ocean Biogeographic Information System),
595
- which according to this view forms the \"macroscope\" for which the Census
596
- data is a feed. (For non habitat-specific biodiversity data, <a href=\"https://www.gbif.org\">GBIF</a>
597
- is an equivalent, and more extensive, operation). Other planetary scale \"macroscopes\",
598
- by this definition (which may or may not have an explicit geographic, i.e.
599
- spatial, component) would include inventories of biological taxa such as the
600
- <a href=\"https://www.catalogueoflife.org\">Catalogue of Life</a> and so on,
601
- all the way back to the pioneering compendia published by Linnaeus in the
602
- eighteenth century; while for cartography and topographic imagery, the current
603
- \"blockbuster\" of <a href=\"http://earth.google.com\">Google Earth</a> and
604
- its predecessors also come well into public consciousness.</p>\n\n<p>In the
605
- view of some workers and/or operations, both of these phases are precursors
606
- to the real \"work\" of the macroscope which is to reveal previously unseen
607
- portions of the \"big picture\" by means either of the availability of large,
608
- synoptic datasets, or fusion between different data streams to produce novel
609
- insights. Companies such as IBM and Microsoft have used phraseology such as:</p>\n\n<blockquote>By
610
- 2022 we will use machine-learning algorithms and software to help us organize
611
- information about the physical world, helping bring the vast and complex data
612
- gathered by billions of devices within the range of our vision and understanding.
613
- We call this a \"macroscope\" – but unlike the microscope to see the very
614
- small, or the telescope that can see far away, it is a system of software
615
- and algorithms to bring all of Earth''s complex data together to analyze it
616
- by space and time for meaning.\" (IBM)</blockquote>\n\n<blockquote>As the
617
- Earth becomes increasingly instrumented with low-cost, high-bandwidth sensors,
618
- we will gain a better understanding of our environment via a virtual, distributed
619
- whole-Earth \"macroscope\"... Massive-scale data analytics will enable real-time
620
- tracking of disease and targeted responses to potential pandemics. Our virtual
621
- \"macroscope\" can now be used on ourselves, as well as on our planet.\" (Microsoft)
622
- (references available via the Wikipedia article cited above).</blockquote>\n\n<p>Whether
623
- or not the analytical capabilities described here are viewed as being an integral
624
- part of the \"macroscope\" concept, or are maybe an add-on, is ultimately
625
- a question of semantics and perhaps, personal opinion. Continuing the Census
626
- of Marine Life/OBIS example, OBIS offers some (arguably rather basic) visualization
627
- and summary tools, but also makes its data available for download to users
628
- wishing to analyse it further according to their own particular interests;
629
- using OBIS data in this manner, Mark Costello et al. in 2017 were able to
630
- demarcate a finite number of data-supported marine biogeographic realms for
631
- the first time (Costello et al. 2017: Nature Communications. 8: 1057. <a href=\"https://doi.org/10.1038/s41467-017-01121-2\">doi:10.1038/s41467-017-01121-2</a>),
632
- a project which I was able to assist in a small way in an advisory capacity.
633
- In a case such as this, perhaps the final function of the macroscope, namely
634
- data visualization and analysis, was outsourced to the authors'' own research
635
- institution. Similarly at an earlier phase, \"data aggregation\" can also
636
- be virtual rather than actual, i.e. avoiding using a single physical system
637
- to hold all the data, enabled by open web mapping standards WMS (web map service)
638
- and WFS (web feature service) to access a set of distributed data stores,
639
- e.g. as implemented on the portal for the <a href=\"https://portal.aodn.org.au/\">Australian
640
- Ocean Data Network</a>.</p>\n\n<p>So, as we pass through the third decade
641
- of the twenty first century, what developments await us in the \"macroscope\"
642
- area\"? In the biodiversity space, one can reasonably presume that the existing
643
- \"macroscopic\" data assembly projects such as OBIS and GBIF will continue,
644
- and hopefully slowly fill current gaps in their coverage - although in the
645
- marine area, strategic new data collection exercises may be required (Census
646
- 2020, or 2025, anyone?), while (again hopefully), the Catalogue of Life will
647
- continue its progress towards a \"complete\" species inventory for the biosphere.
648
- The Landsat project, with imagery dating back to 1972, continues with the
649
- launch of its latest satellite Landsat 9 just this year (21 September 2021)
650
- with a planned mission duration for the next 5 years, so the \"macroscope\"
651
- functionality of that project seems set to continue for the medium term at
652
- least. Meanwhile the ongoing development of sensor networks, both on land
653
- and in the ocean, offers an exciting new method of \"instrumenting the earth\"
654
- to obtain much more real time data than has ever been available in the past,
655
- offering scope for many more, use case-specific \"macroscopes\" to be constructed
656
- that can fuse (e.g.) satellite imagery with much more that is happening at
657
- a local level.</p>\n\n<p>So, the \"macroscope\" concept appears to be alive
658
- and well, even though the nomenclature can change from time to time (IBM''s
659
- \"Macroscope\", foreshadowed in 2017, became the \"IBM Pairs Geoscope\" on
660
- implementation, and is now simply the \"Geospatial Analytics component within
661
- the IBM Environmental Intelligence Suite\" according to available IBM publicity
662
- materials). In reality this illustrates a new dichotomy: even if \"everyone\"
663
- in principle has access to huge quantities of publicly available data, maybe
664
- only a few well funded entities now have the computational ability to make
665
- sense of it, and can charge clients a good fee for their services...</p>\n\n<p>I
666
- present this account partly to give a brief picture of \"macroscope\" concepts
667
- today and in the past, for those who may be interested, and partly to present
668
- a few personal views which would be out of scope in a \"neutral point of view\"
669
- article such as is required on Wikipedia; also to see if readers of this blog
670
- would like to contribute further to discussion of any of the concepts traversed
671
- herein.</p>","tags":["guest post","macroscope"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/gf1dw-n1v47","uuid":"a41163e0-9c9a-41e0-a141-f772663f2f32","url":"https://iphylo.blogspot.com/2023/03/dugald-stuart-page-1936-2022.html","title":"Dugald
672
- Stuart Page 1936-2022","summary":"My dad died last weekend. Below is a notice
673
- in today''s New Zealand Herald. I''m in New Zealand for his funeral. Don''t
674
- really have the words for this right now.","date_published":"2023-03-14T03:00:00Z","date_modified":"2023-03-22T07:25:56Z","date_indexed":"1909-06-16T10:41:55+00:00","authors":[{"url":null,"name":"Roderic
675
- Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
676
- both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s3454/_DSC5106.jpg\"
677
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
678
- border=\"0\" width=\"400\" data-original-height=\"2582\" data-original-width=\"3454\"
679
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZweukxntl7R5jnk3knVFVrqZ5RxC7mPZBV4gKeDIglbFzs2O442nbxqs8t8jV2tLqCU24K6gS32jW-Pe8q3O_5JR1Ms3qW1aQAZ877cKkFfcUydqUba9HsgNlX-zS9Ne92eLxRGS8F-lStTecJw2oalp3u58Yoc0oM7CUin5LKPeFIJ7Rzg/s400/_DSC5106.jpg\"/></a></div>\n\nMy
680
- dad died last weekend. Below is a notice in today''s New Zealand Herald. I''m
681
- in New Zealand for his funeral. Don''t really have the words for this right
682
- now.\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s3640/IMG_2870.jpeg\"
683
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
684
- border=\"0\" height=\"320\" data-original-height=\"3640\" data-original-width=\"1391\"
685
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRUTOFF1VWHCl8dg3FQuaWy5LM7aX8IivdRpTtzgrdQTEymsA5bLTZE3cSQf1WQIP3XrC46JsLScP8BxTK9C5a-B1i51yg8WGSJD0heJVaoDLnerv0lD1o3qloDjqEuuyfX4wagHB5YYBmjWnGeVQvyYVngvDDf9eM6pmMtZ7x94Y4jSVrug/s320/IMG_2870.jpeg\"/></a></div>","tags":[],"language":"en","references":[]},{"id":"https://doi.org/10.59350/cbzgz-p8428","uuid":"a93134aa-8b33-4dc7-8cd4-76cdf64732f4","url":"https://iphylo.blogspot.com/2023/04/library-interfaces-knowledge-graphs-and.html","title":"Library
686
- interfaces, knowledge graphs, and Miller columns","summary":"Some quick notes
687
- on interface ideas for digital libraries and/or knowledge graphs. Recently
688
- there’s been something of an explosion in bibliographic tools to explore the
689
- literature. Examples include: Elicit which uses AI to search for and summarise
690
- papers _scite which uses AI to do sentiment analysis on citations (does paper
691
- A cite paper B favourably or not?) ResearchRabbit which uses lists, networks,
692
- and timelines to discover related research Scispace which navigates connections
693
- between...","date_published":"2023-04-25T13:01:00Z","date_modified":"2023-04-27T14:51:08Z","date_indexed":"1909-06-16T11:25:14+00:00","authors":[{"url":null,"name":"Roderic
694
- Page"}],"image":null,"content_html":"<p>Some quick notes on interface ideas
695
- for digital libraries and/or knowledge graphs.</p>\n<p>Recently there’s been
696
- something of an explosion in bibliographic tools to explore the literature.
697
- Examples include:</p>\n<ul>\n<li><a href=\"https://elicit.org\">Elicit</a>
698
- which uses AI to search for and summarise papers</li>\n<li><a href=\"https://scite.ai\">_scite</a>
699
- which uses AI to do sentiment analysis on citations (does paper A cite paper
700
- B favourably or not?)</li>\n<li><a href=\"https://www.researchrabbit.ai\">ResearchRabbit</a>
701
- which uses lists, networks, and timelines to discover related research</li>\n<li><a
702
- href=\"https://typeset.io\">Scispace</a> which navigates connections between
703
- papers, authors, topics, etc., and provides AI summaries.</li>\n</ul>\n<p>As
704
- an aside, I think these (and similar tools) are a great example of how bibliographic
705
- data such as abstracts, the citation graph and - to a lesser extent - full
706
- text - have become commodities. That is, what was once proprietary information
707
- is now free to anyone, which in turns means a whole ecosystem of new tools
708
- can emerge. If I was clever I’d be building a <a href=\"https://en.wikipedia.org/wiki/Wardley_map\">Wardley
709
- map</a> to explore this. Note that a decade or so ago reference managers like
710
- <a href=\"https://www.zotero.org\">Zotero</a> were made possible by publishers
711
- exposing basic bibliographic data on their articles. As we move to <a href=\"https://i4oc.org\">open
712
- citations</a> we are seeing the next generation of tools.</p>\n<p>Back to
713
- my main topic. As usual, rather than focus on what these tools do I’m more
714
- interested in how they <strong>look</strong>. I have history here, when the
715
- iPad came out I was intrigued by the possibilities it offered for displaying
716
- academic articles, as discussed <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad.html\">here</a>,
717
- <a href=\"https://iphylo.blogspot.com/2010/09/viewing-scientific-articles-on-ipad.html\">here</a>,
718
- <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_24.html\">here</a>,
719
- <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_3052.html\">here</a>,
720
- and <a href=\"https://iphylo.blogspot.com/2010/08/viewing-scientific-articles-on-ipad_31.html\">here</a>.
721
- ResearchRabbit looks like this:</p>\n<div style=\"padding:86.91% 0 0 0;position:relative;\"><iframe
722
- src=\"https://player.vimeo.com/video/820871442?h=23b05b0dae&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479\"
723
- frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
724
- style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"ResearchRabbit\"></iframe></div><script
725
- src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>Scispace’s <a
726
- href=\"https://typeset.io/explore/journals/parassitologia-1ieodjwe\">“trace”
727
- view</a> looks like this:</p>\n<div style=\"padding:84.55% 0 0 0;position:relative;\"><iframe
728
- src=\"https://player.vimeo.com/video/820871348?h=2db7b661ef&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479\"
729
- frameborder=\"0\" allow=\"autoplay; fullscreen; picture-in-picture\" allowfullscreen
730
- style=\"position:absolute;top:0;left:0;width:100%;height:100%;\" title=\"Scispace
731
- screencast\"></iframe></div><script src=\"https://player.vimeo.com/api/player.js\"></script>\n<p>What
732
- is interesting about both is that they display content from left to right
733
- in vertical columns, rather than the more common horizontal rows. This sort
734
- of display is sometimes called <a href=\"https://en.wikipedia.org/wiki/Miller_columns\">Miller
735
- columns</a> or a <a href=\"https://web.archive.org/web/20210726134921/http://designinginterfaces.com/firstedition/index.php?page=Cascading_Lists\">cascading
736
- list</a>.</p>\n\n<div class=\"separator\" style=\"clear: both;\"><a href=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s1024/GNUstep-liveCD.png\"
737
- style=\"display: block; padding: 1em 0; text-align: center; \"><img alt=\"\"
738
- border=\"0\" width=\"400\" data-original-height=\"768\" data-original-width=\"1024\"
739
- src=\"https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBPnV9fRBcvm-BX5PjzfG5Cff9PerCLsTW8d5ZbsL6b41t7ypD7ovmcgfTf3b4b34mbq8NM4sfwOHkgEq32FLYnD497RFQD4HQmYmh5Eveu1zWdDVyKyDtPyE98QoxTaOEnLA5kK0fnl3dOOEgUvtVKlTZ8bt1gj2v_8tDRWl9f50ybyei3A/s400/GNUstep-liveCD.png\"/></a></div>\n\n<p>By
740
- Gürkan Sengün (talk) - Own work, Public Domain, <a href=\"https://commons.wikimedia.org/w/index.php?curid=594715\">https://commons.wikimedia.org/w/index.php?curid=594715</a></p>\n<p>I’ve
741
- always found displaying a knowledge graph to be a challenge, as discussed
742
- <a href=\"https://iphylo.blogspot.com/2019/07/notes-on-collections-knowledge-graphs.html\">elsewhere
743
- on this blog</a> and in my paper on <a href=\"https://peerj.com/articles/6739/#p-29\">Ozymandias</a>.
744
- Miller columns enable one to drill down in increasing depth, but it doesn’t
745
- need to be a tree, it can be a path within a network. What I like about ResearchRabbit
746
- and the original Scispace interface is that they present the current item
747
- together with a list of possible connections (e.g., authors, citations) that
748
- you can drill down on. Clicking on these will result in a new column being
749
- appended to the right, with a view (typically a list) of the next candidates
750
- to visit. In graph terms, these are adjacent nodes to the original item. The
751
- clickable badges on each item can be thought of as sets of edges that have
752
- the same label (e.g., “authored by”, “cites”, “funded”, “is about”, etc.).
753
- Each of these nodes itself becomes a starting point for further exploration.
754
- Note that the original starting point isn’t privileged, other than being the
755
- starting point. That is, each time we drill down we are seeing the same type
756
- of information displayed in the same way. Note also that the navigation can
757
- be though of as a <strong>card</strong> for a node, with <strong>buttons</strong>
758
- grouping the adjacent nodes. When we click on an individual button, it expands
759
- into a <strong>list</strong> in the next column. This can be thought of as
760
- a preview for each adjacent node. Clicking on an element in the list generates
761
- a new card (we are viewing a single node) and we get another set of buttons
762
- corresponding to the adjacent nodes.</p>\n<p>One important behaviour in a
763
- Miller column interface is that the current path can be pruned at any point.
764
- If we go back (i.e., scroll to the left) and click on another tab on an item,
765
- everything downstream of that item (i.e., to the right) gets deleted and replaced
766
- by a new set of nodes. This could make retrieving a particular history of
767
- browsing a bit tricky, but encourages exploration. Both Scispace and ResearchRabbit have
768
- the ability to add items to a collection, so you can keep track of things
769
- you discover.</p>\n<p>Lots of food for thought, I’m assuming that there is
770
- some user interface/experience research on Miller columns. One thing to remember
771
- is that Miller columns are most often associated with trees, but in this case
772
- we are exploring a network. That means that potentially there is no limit
773
- to the number of columns being generated as we wander through the graph. It
774
- will be interesting to think about what the average depth is likely to be,
775
- in other words, how deep down the rabbit hole will be go?</p>\n\n<h3>Update</h3>\n<p>Should
776
- add link to David Regev''s explorations of <a href=\"https://medium.com/david-regev-on-ux/flow-browser-b730daf0f717\">Flow
777
- Browser</a>.\n\n<blockquote>\n<p>Written with <a href=\"https://stackedit.io/\">StackEdit</a>.</p>\n</blockquote>","tags":["cards","flow","Knowledge
778
- Graph","Miller column","RabbitResearch"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/t6fb9-4fn44","uuid":"8bc3fea6-cb86-4344-8dad-f312fbf58041","url":"https://iphylo.blogspot.com/2021/12/the-business-of-extracting-knowledge.html","title":"The
779
- Business of Extracting Knowledge from Academic Publications","summary":"Markus
780
- Strasser (@mkstra write a fascinating article entitled \"The Business of Extracting
781
- Knowledge from Academic Publications\". I spent months working on domain-specific
782
- search engines and knowledge discovery apps for biomedicine and eventually
783
- figured that synthesizing \"insights\" or building knowledge graphs by machine-reading
784
- the academic literature (papers) is *barely useful* :https://t.co/eciOg30Odc—
785
- Markus Strasser (@mkstra) December 7, 2021 His TL;DR: TL;DR: I worked on biomedical...","date_published":"2021-12-11T00:01:00Z","date_modified":"2021-12-11T00:01:21Z","date_indexed":"1909-06-16T11:32:09+00:00","authors":[{"url":null,"name":"Roderic
786
- Page"}],"image":null,"content_html":"<p>Markus Strasser (<a href=\"https://twitter.com/mkstra\">@mkstra</a>
787
- write a fascinating article entitled <a href=\"https://markusstrasser.org/extracting-knowledge-from-literature/\">\"The
788
- Business of Extracting Knowledge from Academic Publications\"</a>.</p>\n\n<blockquote
789
- class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">I spent months working
790
- on domain-specific search engines and knowledge discovery apps for biomedicine
791
- and eventually figured that synthesizing &quot;insights&quot; or building
792
- knowledge graphs by machine-reading the academic literature (papers) is *barely
793
- useful* :<a href=\"https://t.co/eciOg30Odc\">https://t.co/eciOg30Odc</a></p>&mdash;
794
- Markus Strasser (@mkstra) <a href=\"https://twitter.com/mkstra/status/1468334482113523716?ref_src=twsrc%5Etfw\">December
795
- 7, 2021</a></blockquote> <script async src=\"https://platform.twitter.com/widgets.js\"
796
- charset=\"utf-8\"></script>\n\n<p>His TL;DR:</p>\n\n<p><blockquote>\nTL;DR:
797
- I worked on biomedical literature search, discovery and recommender web applications
798
- for many months and concluded that extracting, structuring or synthesizing
799
- \"insights\" from academic publications (papers) or building knowledge bases
800
- from a domain corpus of literature has negligible value in industry.</p>\n\n<p>Close
801
- to nothing of what makes science actually work is published as text on the
802
- web.\n</blockquote></p>\n\n<p>After recounting the many problems of knowledge
803
- extraction - including a swipe at nanopubs which \"are ... dead in my view
804
- (without admitting it)\" - he concludes:</p>\n\n<p><blockquote>\nI’ve been
805
- flirting with this entire cluster of ideas including open source web annotation,
806
- semantic search and semantic web, public knowledge graphs, nano-publications,
807
- knowledge maps, interoperable protocols and structured data, serendipitous
808
- discovery apps, knowledge organization, communal sense making and academic
809
- literature/publishing toolchains for a few years on and off ... nothing of
810
- it will go anywhere.</p>\n\n<p>Don’t take that as a challenge. Take it as
811
- a red flag and run. Run towards better problems.\n</blockquote></p>\n\n<p>Well
812
- worth a read, and much food for thought.</p>","tags":["ai","business model","text
813
- mining"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/463yw-pbj26","uuid":"dc829ab3-f0f1-40a4-b16d-a36dc0e34166","url":"https://iphylo.blogspot.com/2022/12/david-remsen.html","title":"David
814
- Remsen","summary":"I heard yesterday from Martin Kalfatovic (BHL) that David
815
- Remsen has died. Very sad news. It''s starting to feel like iPhylo might end
816
- up being a list of obituaries of people working on biodiversity informatics
817
- (e.g., Scott Federhen). I spent several happy visits at MBL at Woods Hole
818
- talking to Dave at the height of the uBio project, which really kickstarted
819
- large scale indexing of taxonomic names, and the use of taxonomic name finding
820
- tools to index the literature. His work on uBio with David...","date_published":"2022-12-16T17:54:00Z","date_modified":"2022-12-17T08:12:23Z","date_indexed":"1909-06-16T11:41:39+00:00","authors":[{"url":null,"name":"Roderic
821
- Page"}],"image":null,"content_html":"<p>I heard yesterday from Martin Kalfatovic
822
- (BHL) that David Remsen has died. Very sad news. It''s starting to feel like
823
- iPhylo might end up being a list of obituaries of people working on biodiversity
824
- informatics (e.g., <a href=\"https://iphylo.blogspot.com/2016/05/scott-federhen-rip.html\">Scott
825
- Federhen</a>).</p>\n\n<p>I spent several happy visits at MBL at Woods Hole
826
- talking to Dave at the height of the uBio project, which really kickstarted
827
- large scale indexing of taxonomic names, and the use of taxonomic name finding
828
- tools to index the literature. His work on uBio with David (\"Paddy\") Patterson
829
- led to the <a href=\"https://eol.org\">Encyclopedia of Life</a> (EOL).</p>\n\n<p>A
830
- number of the things I''m currently working on are things Dave started. For
831
- example, I recently uploaded a version of his dataset for Nomenclator Zoologicus[1]
832
- to <a href=\"https://www.checklistbank.org/dataset/126539/about\">ChecklistBank</a>
833
- where I''m working on augmenting that original dataset by adding links to
834
- the taxonomic literature. My <a href=\"https://biorss.herokuapp.com/?feed=Y291bnRyeT1XT1JMRCZwYXRoPSU1QiUyMkJJT1RBJTIyJTVE\">BioRSS
835
- project</a> is essentially an attempt to revive uBioRSS[2] (see <a href=\"https://iphylo.blogspot.com/2021/11/revisiting-rss-to-monitor-latests.html\">Revisiting
836
- RSS to monitor the latest taxonomic research</a>).</p>\n\n<p>I have fond memories
837
- of those visits to Woods Hole. A very sad day indeed.</p>\n\n<p><b>Update:</b>
838
- The David Remsen Memorial Fund has been set up on <a href=\"https://www.gofundme.com/f/david-remsen-memorial-fund\">GoFundMe</a>.</p>\n\n<p>1.
839
- Remsen, D. P., Norton, C., & Patterson, D. J. (2006). Taxonomic Informatics
840
- Tools for the Electronic Nomenclator Zoologicus. The Biological Bulletin,
841
- 210(1), 18–24. https://doi.org/10.2307/4134533</p>\n\n<p>2. Patrick R. Leary,
842
- David P. Remsen, Catherine N. Norton, David J. Patterson, Indra Neil Sarkar,
843
- uBioRSS: Tracking taxonomic literature using RSS, Bioinformatics, Volume 23,
844
- Issue 11, June 2007, Pages 1434–1436, https://doi.org/10.1093/bioinformatics/btm109</p>","tags":["David
845
- Remsen","obituary","uBio"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/3s376-6bm21","uuid":"62e7b438-67a3-44ac-a66d-3f5c278c949e","url":"https://iphylo.blogspot.com/2022/02/deduplicating-bibliographic-data.html","title":"Deduplicating
846
- bibliographic data","summary":"There are several instances where I have a
847
- collection of references that I want to deduplicate and merge. For example,
848
- in Zootaxa has no impact factor I describe a dataset of the literature cited
849
- by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4),
850
- as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1).
851
- Given that the same articles may be cited many times, these datasets have
852
- lots of...","date_published":"2022-02-03T15:09:00Z","date_modified":"2022-02-03T15:11:29Z","date_indexed":"1909-06-16T10:22:30+00:00","authors":[{"url":null,"name":"Roderic
853
- Page"}],"image":null,"content_html":"<p>There are several instances where
854
- I have a collection of references that I want to deduplicate and merge. For
855
- example, in <a href=\"https://iphylo.blogspot.com/2020/07/zootaxa-has-no-impact-factor.html\">Zootaxa
856
- has no impact factor</a> I describe a dataset of the literature cited by articles
857
- in the journal <i>Zootaxa</i>. This data is available on Figshare (<a href=\"https://doi.org/10.6084/m9.figshare.c.5054372.v4\">https://doi.org/10.6084/m9.figshare.c.5054372.v4</a>),
858
- as is the equivalent dataset for <i>Phytotaxa</i> (<a href=\"https://doi.org/10.6084/m9.figshare.c.5525901.v1\">https://doi.org/10.6084/m9.figshare.c.5525901.v1</a>).
859
- Given that the same articles may be cited many times, these datasets have
860
- lots of duplicates. Similarly, articles in <a href=\"https://species.wikimedia.org\">Wikispecies</a>
861
- often have extensive lists of references cited, and the same reference may
862
- appear on multiple pages (for an initial attempt to extract these references
863
- see <a href=\"https://doi.org/10.5281/zenodo.5801661\">https://doi.org/10.5281/zenodo.5801661</a>
864
- and <a href=\"https://github.com/rdmpage/wikispecies-parser\">https://github.com/rdmpage/wikispecies-parser</a>).</p>\n\n<p>There
865
- are several reasons I want to merge these references. If I want to build a
866
- citation graph for <i>Zootaxa</i> or <i>Phytotaxa</i> I need to merge references
867
- that are the same so that I can accurate count citations. I am also interested
868
- in harvesting the metadata to help find those articles in the <a href=\"https://www.biodiversitylibrary.org\">Biodiversity
869
- Heritage Library</a> (BHL), and the literature cited section of scientific
870
- articles is a potential goldmine of bibliographic metadata, as is Wikispecies.</p>\n\n<p>After
871
- various experiments and false starts I''ve created a repository <a href=\"https://github.com/rdmpage/bib-dedup\">https://github.com/rdmpage/bib-dedup</a>
872
- to host a series of PHP scripts to deduplicate bibliographics data. I''ve
873
- settled on using CSL-JSON as the format for bibliographic data. Because deduplication
874
- relies on comparing pairs of references, the standard format for most of the
875
- scripts is a JSON array containing a pair of CSL-JSON objects to compare.
876
- Below are the steps the code takes.</p>\n\n<h2>Generating pairs to compare</h2>\n\n<p>The
877
- first step is to take a list of references and generate the pairs that will
878
- be compared. I started with this approach as I wanted to explore machine learning
879
- and wanted a simple format for training data, such as an array of two CSL-JSON
880
- objects and an integer flag representing whether the two references were the
881
- same of different.</p>\n\n<p>There are various ways to generate CSL-JSON for
882
- a reference. I use a tool I wrote (see <a href=\"https://iphylo.blogspot.com/2021/07/citation-parsing-released.html\">Citation
883
- parsing tool released</a>) that has a simple API where you parse one or more
884
- references and it returns that reference as structured data in CSL-JSON.</p>\n\n<p>Attempting
885
- to do all possible pairwise comparisons rapidly gets impractical as the number
886
- of references increases, so we need some way to restrict the number of comparisons
887
- we make. One approach I''ve explored is the “sorted neighbourhood method”
888
- where we sort the references 9for example by their title) then move a sliding
889
- window down the list of references, comparing all references within that window.
890
- This greatly reduces the number of pairwise comparisons. So the first step
891
- is to sort the references, then run a sliding window over them, output all
892
- the pairs in each window (ignoring in pairwise comparisons already made in
893
- a previous window). Other methods of \"blocking\" could also be used, such
894
- as only including references in a particular year, or a particular journal.</p>\n\n<p>So,
895
- the output of this step is a set of JSON arrays, each with a pair of references
896
- in CSL-JSON format. Each array is stored on a single line in the same file
897
- in <a href=\"https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON_2\">line-delimited
898
- JSON</a> (JSONL).</p>\n\n<h2>Comparing pairs</h2>\n\n<p>The next step is to
899
- compare each pair of references and decide whether they are a match or not.
900
- Initially I explored a machine learning approach used in the following paper:</p>\n\n<blockquote>\nWilson
901
- DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex
902
- features to improve genealogical record linkage. In: The 2011 International
903
- Joint Conference on Neural Networks. 9–14. DOI: <a href=\"https://doi.org/10.1109/IJCNN.2011.6033192\">10.1109/IJCNN.2011.6033192</a>\n</blockquote>\n\n<p>Initial
904
- experiments using <a href=\"https://github.com/jtet/Perceptron\">https://github.com/jtet/Perceptron</a>
905
- were promising and I want to play with this further, but I deciding to skip
906
- this for now and just use simple string comparison. So for each CSL-JSON object
907
- I generate a citation string in the same format using CiteProc, then compute
908
- the <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein
909
- distance</a> between the two strings. By normalising this distance by the
910
- length of the two strings being compared I can use an arbitrary threshold
911
- to decide if the references are the same or not.</p>\n\n<h2>Clustering</h2>\n\n<p>For
912
- this step we read the JSONL file produced above and record whether the two
913
- references are a match or not. Assuming each reference has a unique identifier
914
- (needs only be unique within the file) then we can use those identifier to
915
- record the clusters each reference belongs to. I do this using a <a href=\"https://en.wikipedia.org/wiki/Disjoint-set_data_structure\">Disjoint-set
916
- data structure</a>. For each reference start with a graph where each node
917
- represents a reference, and each node has a pointer to a parent node. Initially
918
- the reference is its own parent. A simple implementation is to have an array
919
- index by reference identifiers and where the value of each cell in the array
920
- is the node''s parent.</p>\n\n<p>As we discover pairs we update the parents
921
- of the nodes to reflect this, such that once all the comparisons are done
922
- we have a one or more sets of clusters corresponding to the references that
923
- we think are the same. Another way to think of this is that we are getting
924
- the components of a graph where each node is a reference and pair of references
925
- that match are connected by an edge.</p>\n\n<p>In the code I''m using I write
926
- this graph in <a href=\"https://en.wikipedia.org/wiki/Trivial_Graph_Format\">Trivial
927
- Graph Format</a> (TGF) which can be visualised using a tools such as <a href=\"https://www.yworks.com/products/yed\">yEd</a>.</p>\n\n<h2>Merging</h2>\n\n<p>Now
928
- that we have a graph representing the sets of references that we think are
929
- the same we need to merge them. This is where things get interesting as the
930
- references are similar (by definition) but may differ in some details. The
931
- paper below describes a simple Bayesian approach for merging records:</p>\n\n<blockquote>\nCouncill
932
- IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles
933
- CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching
934
- Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
935
- Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: <a href=\"https://doi.org/10.1145/1141753.1141817\">10.1145/1141753.1141817</a>.\n</blockquote>\n\n<p>So
936
- the next step is to read the graph with the clusters, generate the sets of
937
- bibliographic references that correspond to each cluster, then use the method
938
- described in Councill et al. to produce a single bibliographic record for
939
- that cluster. These records could then be used to, say locate the corresponding
940
- article in BHL, or populate Wikidata with missing references.</p>\n\n<p>Obviously
941
- there is always the potential for errors, such as trying to merge references
942
- that are not the same. As a quick and dirty check I flag as dubious any cluster
943
- where the page numbers vary among members of the cluster. More sophisticated
944
- checks are possible, especially if I go down the ML route (i.e., I would have
945
- evidence for the probability that the same reference can disagree on some
946
- aspects of metadata).</p>\n\n<h2>Summary</h2>\n\n<p>At this stage the code
947
- is working well enough for me to play with and explore some example datasets.
948
- The focus is on structured bibliographic metadata, but I may simplify things
949
- and have a version that handles simple string matching, for example to cluster
950
- together different abbreviations of the same journal name.</p>","tags":["data
951
- cleaning","deduplication","Phytotaxa","Wikispecies","Zootaxa"],"language":"en","references":[]},{"id":"https://doi.org/10.59350/c79vq-7rr11","uuid":"3cb94422-5506-4e24-a41c-a250bb521ee0","url":"https://iphylo.blogspot.com/2021/12/graphql-for-wikidata-wikicite.html","title":"GraphQL
952
- for WikiData (WikiCite)","summary":"I''ve released a very crude GraphQL endpoint
953
- for WikiData. More precisely, the endpoint is for a subset of the entities
954
- that are of interest to WikiCite, such as scholarly articles, people, and
955
- journals. There is a crude demo at https://wikicite-graphql.herokuapp.com.
956
- The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php.
957
- There are various ways to interact with the endpoint, personally I like the
958
- Altair GraphQL Client by Samuel Imolorhe. As I''ve mentioned earlier it''s
959
- taken...","date_published":"2021-12-20T13:16:00Z","date_modified":"2021-12-20T13:20:05Z","date_indexed":"1909-06-16T10:52:00+00:00","authors":[{"url":null,"name":"Roderic
960
- Page"}],"image":null,"content_html":"<div class=\"separator\" style=\"clear:
961
- both;\"><a href=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s1000\"
962
- style=\"display: block; padding: 1em 0; text-align: center; clear: right;
963
- float: right;\"><img alt=\"\" border=\"0\" width=\"128\" data-original-height=\"1000\"
964
- data-original-width=\"1000\" src=\"https://blogger.googleusercontent.com/img/a/AVvXsEh7WW8TlfN2u3xe_E-sH5IK6AWYnAoWaKrP2b32UawUeqMpPlq6ZFk5BtJVZsMmCNh5j3QRsTj5H0Ee55RRGntnc9yjj_mNB8KHmH2dzocCgyLS2VFOxsBji6u4Ey6qxlDAnT-zrsBpnDcTchbhgt1x0Sf7RkmIMkS1y4-_3KCQian-SeIF-g=s200\"/></a></div><p>I''ve
965
- released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint
966
- is for a subset of the entities that are of interest to WikiCite, such as
967
- scholarly articles, people, and journals. There is a crude demo at <a href=\"https://wikicite-graphql.herokuapp.com\">https://wikicite-graphql.herokuapp.com</a>.
968
- The endpoint itself is at <a href=\"https://wikicite-graphql.herokuapp.com/gql.php\">https://wikicite-graphql.herokuapp.com/gql.php</a>.
969
- There are various ways to interact with the endpoint, personally I like the
970
- <a href=\"https://altair.sirmuel.design\">Altair GraphQL Client</a> by <a
971
- href=\"https://github.com/imolorhe\">Samuel Imolorhe</a>.</p>\n\n<p>As I''ve
972
- <a href=\"https://iphylo.blogspot.com/2021/04/it-been-while.html\">mentioned
973
- earlier</a> it''s taken me a while to see the point of GraphQL. But it is
974
- clear it is gaining traction in the biodiversity world (see for example the
975
- <a href=\"https://dev.gbif.org/hosted-portals.html\">GBIF Hosted Portals</a>)
976
- so it''s worth exploring. My take on GraphQL is that it is a way to create
977
- a self-describing API that someone developing a web site can use without them
978
- having to bury themselves in the gory details of how data is internally modelled.
979
- For example, WikiData''s query interface uses SPARQL, a powerful language
980
- that has a steep learning curve (in part because of the administrative overhead
981
- brought by RDF namespaces, etc.). In my previous SPARQL-based projects such
982
- as <a href=\"https://ozymandias-demo.herokuapp.com\">Ozymandias</a> and <a
983
- href=\"http://alec-demo.herokuapp.com\">ALEC</a> I have either returned SPARQL
984
- results directly (Ozymandias) or formatted SPARQL results as <a href=\"https://schema.org/DataFeed\">schema.org
985
- DataFeeds</a> (equivalent to RSS feeds) (ALEC). Both approaches work, but
986
- they are project-specific and if anyone else tried to build based on these
987
- projects they might struggle for figure out what was going on. I certainly
988
- struggle, and I wrote them!</p>\n\n<p>So it seems worthwhile to explore this
989
- approach a little further and see if I can develop a GraphQL interface that
990
- can be used to build the sort of rich apps that I want to see. The demo I''ve
991
- created uses SPARQL under the hood to provide responses to the GraphQL queries.
992
- So in this sense it''s not replacing SPARQL, it''s simply providing a (hopefully)
993
- simpler overlay on top of SPARQL so that we can retrieve the data we want
994
- without having to learn the intricacies of SPARQL, nor how Wikidata models
995
- publications and people.</p>","tags":["GraphQL","SPARQL","WikiCite","Wikidata"],"language":"en","references":[]}]}'
996
- recorded_at: Sun, 18 Jun 2023 06:01:20 GMT
997
- recorded_with: VCR 6.1.0