biblicit 1.0 → 2.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (406) hide show
  1. data/.gitmodules +3 -0
  2. data/Gemfile +1 -1
  3. data/README.md +125 -30
  4. data/Rakefile +22 -0
  5. data/biblicit.gemspec +9 -7
  6. data/lib/biblicit/cb2bib.rb +10 -11
  7. data/lib/biblicit/citeseer.rb +14 -26
  8. data/lib/biblicit/extractor.rb +40 -19
  9. data/lib/biblicit/parscit.rb +38 -0
  10. data/parscit/.gitignore +8 -0
  11. data/parscit/CHANGELOG +125 -0
  12. data/parscit/COPYING +674 -0
  13. data/parscit/COPYING.LESSER +165 -0
  14. data/parscit/INSTALL +105 -0
  15. data/parscit/README +97 -0
  16. data/{perl/ParsCit/README.TXT → parscit/USAGE} +25 -15
  17. data/parscit/bin/archtest.pl +31 -0
  18. data/parscit/bin/citeExtract.pl +562 -0
  19. data/parscit/bin/conlleval.pl +315 -0
  20. data/parscit/bin/headExtract.pl +40 -0
  21. data/parscit/bin/parsHed/convert2TokenLevel.pl +138 -0
  22. data/parscit/bin/parsHed/keywordGen.pl +308 -0
  23. data/parscit/bin/parsHed/parseXmlHeader.pl +141 -0
  24. data/parscit/bin/parsHed/redo.parsHed.pl +198 -0
  25. data/parscit/bin/parsHed/tr2crfpp_parsHed.pl +521 -0
  26. data/parscit/bin/parseRefStrings.pl +102 -0
  27. data/parscit/bin/phOutput2xml.pl +223 -0
  28. data/parscit/bin/redo.parsCit.pl +105 -0
  29. data/parscit/bin/sectExtract.pl +149 -0
  30. data/parscit/bin/sectLabel/README +110 -0
  31. data/parscit/bin/sectLabel/README.txt +110 -0
  32. data/parscit/bin/sectLabel/genericSect/crossValidation.rb +98 -0
  33. data/parscit/bin/sectLabel/genericSect/extractFeature.rb +104 -0
  34. data/parscit/bin/sectLabel/genericSectExtract.rb +53 -0
  35. data/parscit/bin/sectLabel/getStructureInfo.pl +156 -0
  36. data/parscit/bin/sectLabel/processOmniXML.pl +1427 -0
  37. data/parscit/bin/sectLabel/processOmniXML_new.pl +1025 -0
  38. data/parscit/bin/sectLabel/processOmniXMLv2.pl +1529 -0
  39. data/parscit/bin/sectLabel/processOmniXMLv3.pl +964 -0
  40. data/parscit/bin/sectLabel/redo.sectLabel.pl +219 -0
  41. data/parscit/bin/sectLabel/simplifyOmniXML.pl +382 -0
  42. data/parscit/bin/sectLabel/single2multi.pl +190 -0
  43. data/parscit/bin/sectLabel/tr2crfpp.pl +158 -0
  44. data/parscit/bin/tr2crfpp.pl +260 -0
  45. data/parscit/bin/xml2train.pl +193 -0
  46. data/parscit/lib/CSXUtil/SafeText.pm +130 -0
  47. data/parscit/lib/Omni/Config.pm +93 -0
  48. data/parscit/lib/Omni/Omnicell.pm +263 -0
  49. data/parscit/lib/Omni/Omnicol.pm +292 -0
  50. data/parscit/lib/Omni/Omnidd.pm +328 -0
  51. data/parscit/lib/Omni/Omnidoc.pm +153 -0
  52. data/parscit/lib/Omni/Omniframe.pm +223 -0
  53. data/parscit/lib/Omni/Omniline.pm +423 -0
  54. data/parscit/lib/Omni/Omnipage.pm +282 -0
  55. data/parscit/lib/Omni/Omnipara.pm +232 -0
  56. data/parscit/lib/Omni/Omnirun.pm +303 -0
  57. data/parscit/lib/Omni/Omnitable.pm +336 -0
  58. data/parscit/lib/Omni/Omniword.pm +162 -0
  59. data/parscit/lib/Omni/Traversal.pm +313 -0
  60. data/parscit/lib/ParsCit/.PostProcess.pm.swp +0 -0
  61. data/parscit/lib/ParsCit/Citation.pm +737 -0
  62. data/parscit/lib/ParsCit/CitationContext.pm +220 -0
  63. data/parscit/lib/ParsCit/Config.pm +35 -0
  64. data/parscit/lib/ParsCit/Controller.pm +653 -0
  65. data/parscit/lib/ParsCit/PostProcess.pm +505 -0
  66. data/parscit/lib/ParsCit/PreProcess.pm +1041 -0
  67. data/parscit/lib/ParsCit/Tr2crfpp.pm +1195 -0
  68. data/parscit/lib/ParsHed/Config.pm +49 -0
  69. data/parscit/lib/ParsHed/Controller.pm +143 -0
  70. data/parscit/lib/ParsHed/PostProcess.pm +322 -0
  71. data/parscit/lib/ParsHed/Tr2crfpp.pm +448 -0
  72. data/{perl/ParsCit/lib/ParsCit/Tr2crfpp.pm → parscit/lib/ParsHed/Tr2crfpp_token.pm} +22 -21
  73. data/parscit/lib/SectLabel/AAMatching.pm +1949 -0
  74. data/parscit/lib/SectLabel/Config.pm +88 -0
  75. data/parscit/lib/SectLabel/Controller.pm +332 -0
  76. data/parscit/lib/SectLabel/PostProcess.pm +425 -0
  77. data/parscit/lib/SectLabel/PreProcess.pm +116 -0
  78. data/parscit/lib/SectLabel/Tr2crfpp.pm +1246 -0
  79. data/parscit/resources/parsCit.model +0 -0
  80. data/parscit/resources/parsCit.split.model +0 -0
  81. data/{perl/ParsCit → parscit}/resources/parsCitDict.txt +205 -0
  82. data/parscit/resources/parsHed/bigram +10 -0
  83. data/parscit/resources/parsHed/keywords +10 -0
  84. data/parscit/resources/parsHed/parsHed.model +0 -0
  85. data/parscit/resources/parsHed/parsHed.template +178 -0
  86. data/parscit/resources/sectLabel/affiliation.model +0 -0
  87. data/parscit/resources/sectLabel/author.model +0 -0
  88. data/parscit/resources/sectLabel/funcWord +320 -0
  89. data/parscit/resources/sectLabel/genericSect.model +0 -0
  90. data/parscit/resources/sectLabel/sectLabel.config +42 -0
  91. data/parscit/resources/sectLabel/sectLabel.configXml +42 -0
  92. data/parscit/resources/sectLabel/sectLabel.model +0 -0
  93. data/sh/convert_to_text.sh +20 -0
  94. data/spec/biblicit/extractor_spec.rb +121 -0
  95. data/spec/fixtures/Review_of_Michael_Tyes_Consciousness_Revisited.docx +0 -0
  96. data/spec/fixtures/critical-infrastructures.ps +63951 -0
  97. data/spec/fixtures/txt/E06-1050.txt +867 -0
  98. data/spec/fixtures/txt/sample1.txt +902 -0
  99. data/spec/fixtures/txt/sample2.txt +394 -0
  100. data/spec/spec_helper.rb +3 -0
  101. data/{perl → svm-header-parse}/HeaderParseService/lib/HeaderParse/API/Function.pm +2 -20
  102. data/{perl → svm-header-parse}/HeaderParseService/lib/HeaderParse/API/MultiClassChunking.pm +0 -7
  103. data/{perl → svm-header-parse}/HeaderParseService/lib/HeaderParse/API/Parser.pm +0 -2
  104. data/{perl → svm-header-parse}/HeaderParseService/lib/HeaderParse/API/ParserMethods.pm +0 -7
  105. data/{perl → svm-header-parse}/HeaderParseService/lib/HeaderParse/Config/API_Config.pm +6 -1
  106. data/svm-header-parse/HeaderParseService/tmp/.gitignore +4 -0
  107. data/svm-header-parse/extract.pl +75 -0
  108. metadata +351 -317
  109. data/perl/DocFilter/lib/DocFilter/Config.pm +0 -35
  110. data/perl/DocFilter/lib/DocFilter/Filter.pm +0 -51
  111. data/perl/FileConversionService/README.TXT +0 -11
  112. data/perl/FileConversionService/converters/PDFBox/pdfbox-app-1.7.1.jar +0 -0
  113. data/perl/FileConversionService/lib/CSXUtil/SafeText.pm +0 -140
  114. data/perl/FileConversionService/lib/FileConverter/CheckSum.pm +0 -77
  115. data/perl/FileConversionService/lib/FileConverter/Compression.pm +0 -137
  116. data/perl/FileConversionService/lib/FileConverter/Config.pm +0 -57
  117. data/perl/FileConversionService/lib/FileConverter/Controller.pm +0 -191
  118. data/perl/FileConversionService/lib/FileConverter/JODConverter.pm +0 -61
  119. data/perl/FileConversionService/lib/FileConverter/PDFBox.pm +0 -69
  120. data/perl/FileConversionService/lib/FileConverter/PSConverter.pm +0 -69
  121. data/perl/FileConversionService/lib/FileConverter/PSToText.pm +0 -88
  122. data/perl/FileConversionService/lib/FileConverter/Prescript.pm +0 -68
  123. data/perl/FileConversionService/lib/FileConverter/TET.pm +0 -75
  124. data/perl/FileConversionService/lib/FileConverter/Utils.pm +0 -130
  125. data/perl/HeaderParseService/lib/CSXUtil/SafeText.pm +0 -140
  126. data/perl/HeaderParseService/resources/data/EbizHeaders.txt +0 -24330
  127. data/perl/HeaderParseService/resources/data/EbizHeaders.txt.parsed +0 -27506
  128. data/perl/HeaderParseService/resources/data/EbizHeaders.txt.parsed.old +0 -26495
  129. data/perl/HeaderParseService/resources/data/tagged_headers.txt +0 -40668
  130. data/perl/HeaderParseService/resources/data/test_header.txt +0 -31
  131. data/perl/HeaderParseService/resources/data/test_header.txt.parsed +0 -31
  132. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test1 +0 -23
  133. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test10 +0 -23
  134. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test11 +0 -23
  135. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test12 +0 -23
  136. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test13 +0 -23
  137. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test14 +0 -23
  138. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test15 +0 -23
  139. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test2 +0 -23
  140. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test3 +0 -23
  141. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test4 +0 -23
  142. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test5 +0 -23
  143. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test6 +0 -23
  144. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test7 +0 -23
  145. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test8 +0 -23
  146. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test9 +0 -23
  147. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test1 +0 -23
  148. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test10 +0 -23
  149. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test11 +0 -23
  150. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test12 +0 -23
  151. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test13 +0 -23
  152. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test14 +0 -23
  153. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test15 +0 -23
  154. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test2 +0 -23
  155. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test3 +0 -23
  156. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test4 +0 -23
  157. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test5 +0 -23
  158. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test6 +0 -23
  159. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test7 +0 -23
  160. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test8 +0 -23
  161. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test9 +0 -23
  162. data/perl/ParsCit/crfpp/traindata/parsCit.template +0 -60
  163. data/perl/ParsCit/crfpp/traindata/parsCit.train.data +0 -12104
  164. data/perl/ParsCit/crfpp/traindata/tagged_references.txt +0 -500
  165. data/perl/ParsCit/lib/CSXUtil/SafeText.pm +0 -140
  166. data/perl/ParsCit/lib/ParsCit/Citation.pm +0 -462
  167. data/perl/ParsCit/lib/ParsCit/CitationContext.pm +0 -132
  168. data/perl/ParsCit/lib/ParsCit/Config.pm +0 -46
  169. data/perl/ParsCit/lib/ParsCit/Controller.pm +0 -306
  170. data/perl/ParsCit/lib/ParsCit/PostProcess.pm +0 -367
  171. data/perl/ParsCit/lib/ParsCit/PreProcess.pm +0 -333
  172. data/perl/ParsCit/resources/parsCit.model +0 -0
  173. data/perl/extract.pl +0 -199
  174. data/spec/biblicit/cb2bib_spec.rb +0 -48
  175. data/spec/biblicit/citeseer_spec.rb +0 -40
  176. /data/{perl → svm-header-parse}/HeaderParseService/README.TXT +0 -0
  177. /data/{perl/DocFilter → svm-header-parse/HeaderParseService}/lib/CSXUtil/SafeText.pm +0 -0
  178. /data/{perl → svm-header-parse}/HeaderParseService/lib/HeaderParse/API/AssembleXMLMetadata.pm +0 -0
  179. /data/{perl → svm-header-parse}/HeaderParseService/lib/HeaderParse/API/LoadInformation.pm +0 -0
  180. /data/{perl → svm-header-parse}/HeaderParseService/lib/HeaderParse/API/NamePatternMatch.pm +0 -0
  181. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/50states +0 -0
  182. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/AddrTopWords.txt +0 -0
  183. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/AffiTopWords.txt +0 -0
  184. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/AffiTopWordsAll.txt +0 -0
  185. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/ChineseSurNames.txt +0 -0
  186. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/Csurnames.bin +0 -0
  187. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/Csurnames_spec.bin +0 -0
  188. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/DomainSuffixes.txt +0 -0
  189. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/LabeledHeader +0 -0
  190. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/README +0 -0
  191. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/TrainMulClassLines +0 -0
  192. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/TrainMulClassLines1 +0 -0
  193. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/abstract.txt +0 -0
  194. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/abstractTopWords +0 -0
  195. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/addr.txt +0 -0
  196. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/affi.txt +0 -0
  197. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/affis.bin +0 -0
  198. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/all_namewords_spec.bin +0 -0
  199. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/allnamewords.bin +0 -0
  200. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/cities_US.txt +0 -0
  201. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/cities_world.txt +0 -0
  202. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/city.txt +0 -0
  203. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/cityname.txt +0 -0
  204. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/country_abbr.txt +0 -0
  205. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/countryname.txt +0 -0
  206. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/dateTopWords +0 -0
  207. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/degree.txt +0 -0
  208. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/email.txt +0 -0
  209. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/excludeWords.txt +0 -0
  210. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/female-names +0 -0
  211. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/firstNames.txt +0 -0
  212. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/firstnames.bin +0 -0
  213. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/firstnames_spec.bin +0 -0
  214. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/intro.txt +0 -0
  215. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/keyword.txt +0 -0
  216. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/keywordTopWords +0 -0
  217. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/male-names +0 -0
  218. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/middleNames.txt +0 -0
  219. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/month.txt +0 -0
  220. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/mul +0 -0
  221. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/mul.label +0 -0
  222. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/mul.label.old +0 -0
  223. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/mul.processed +0 -0
  224. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/mulAuthor +0 -0
  225. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/mulClassStat +0 -0
  226. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/nickname.txt +0 -0
  227. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/nicknames.bin +0 -0
  228. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/note.txt +0 -0
  229. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/page.txt +0 -0
  230. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/phone.txt +0 -0
  231. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/postcode.txt +0 -0
  232. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/pubnum.txt +0 -0
  233. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/statename.bin +0 -0
  234. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/statename.txt +0 -0
  235. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/states_and_abbreviations.txt +0 -0
  236. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/stopwords +0 -0
  237. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/stopwords.bin +0 -0
  238. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/surNames.txt +0 -0
  239. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/surnames.bin +0 -0
  240. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/surnames_spec.bin +0 -0
  241. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/A.html +0 -0
  242. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/B.html +0 -0
  243. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/C.html +0 -0
  244. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/D.html +0 -0
  245. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/E.html +0 -0
  246. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/F.html +0 -0
  247. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/G.html +0 -0
  248. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/H.html +0 -0
  249. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/I.html +0 -0
  250. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/J.html +0 -0
  251. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/K.html +0 -0
  252. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/L.html +0 -0
  253. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/M.html +0 -0
  254. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/N.html +0 -0
  255. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/O.html +0 -0
  256. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/P.html +0 -0
  257. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/Q.html +0 -0
  258. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/R.html +0 -0
  259. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/S.html +0 -0
  260. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/T.html +0 -0
  261. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/U.html +0 -0
  262. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/V.html +0 -0
  263. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/W.html +0 -0
  264. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/WCSelect.gif +0 -0
  265. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/X.html +0 -0
  266. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/Y.html +0 -0
  267. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/Z.html +0 -0
  268. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ae.html +0 -0
  269. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/am.html +0 -0
  270. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ar.html +0 -0
  271. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/at.html +0 -0
  272. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/au.html +0 -0
  273. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/bd.html +0 -0
  274. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/be.html +0 -0
  275. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/bg.html +0 -0
  276. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/bh.html +0 -0
  277. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/blueribbon.gif +0 -0
  278. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/bm.html +0 -0
  279. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/bn.html +0 -0
  280. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/br.html +0 -0
  281. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ca.html +0 -0
  282. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ch.html +0 -0
  283. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/cl.html +0 -0
  284. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/cn.html +0 -0
  285. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/co.html +0 -0
  286. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/cr.html +0 -0
  287. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/cy.html +0 -0
  288. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/cz.html +0 -0
  289. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/de.html +0 -0
  290. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/dean-mainlink.jpg +0 -0
  291. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/dk.html +0 -0
  292. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ec.html +0 -0
  293. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ee.html +0 -0
  294. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/eg.html +0 -0
  295. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/es.html +0 -0
  296. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/et.html +0 -0
  297. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/faq.html +0 -0
  298. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/fi.html +0 -0
  299. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/fj.html +0 -0
  300. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/fo.html +0 -0
  301. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/fr.html +0 -0
  302. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/geog.html +0 -0
  303. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/gr.html +0 -0
  304. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/gu.html +0 -0
  305. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/hk.html +0 -0
  306. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/hr.html +0 -0
  307. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/hu.html +0 -0
  308. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/id.html +0 -0
  309. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ie.html +0 -0
  310. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/il.html +0 -0
  311. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/in.html +0 -0
  312. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/is.html +0 -0
  313. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/it.html +0 -0
  314. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/jm.html +0 -0
  315. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/jo.html +0 -0
  316. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/jp.html +0 -0
  317. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/kaplan.gif +0 -0
  318. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/kr.html +0 -0
  319. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/kw.html +0 -0
  320. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/lb.html +0 -0
  321. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/linkbw2.gif +0 -0
  322. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/lk.html +0 -0
  323. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/lt.html +0 -0
  324. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/lu.html +0 -0
  325. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/lv.html +0 -0
  326. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ma.html +0 -0
  327. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/maczynski.gif +0 -0
  328. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/mirror.tar +0 -0
  329. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/mk.html +0 -0
  330. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/mo.html +0 -0
  331. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/mseawdm.gif +0 -0
  332. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/mt.html +0 -0
  333. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/mx.html +0 -0
  334. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/my.html +0 -0
  335. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ni.html +0 -0
  336. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/nl.html +0 -0
  337. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/no.html +0 -0
  338. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/nz.html +0 -0
  339. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/pa.html +0 -0
  340. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/pe.html +0 -0
  341. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ph.html +0 -0
  342. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/pl.html +0 -0
  343. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/pointcom.gif +0 -0
  344. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/pr.html +0 -0
  345. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ps.html +0 -0
  346. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/pt.html +0 -0
  347. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/recognition.html +0 -0
  348. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/results.html +0 -0
  349. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ro.html +0 -0
  350. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ru.html +0 -0
  351. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/sd.html +0 -0
  352. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/se.html +0 -0
  353. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/sg.html +0 -0
  354. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/si.html +0 -0
  355. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/sk.html +0 -0
  356. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/th.html +0 -0
  357. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/tr.html +0 -0
  358. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/tw.html +0 -0
  359. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ua.html +0 -0
  360. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/uk.html +0 -0
  361. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/univ-full.html +0 -0
  362. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/univ.html +0 -0
  363. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/uy.html +0 -0
  364. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/ve.html +0 -0
  365. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/yu.html +0 -0
  366. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/za.html +0 -0
  367. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list/zm.html +0 -0
  368. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/university_list.txt +0 -0
  369. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/url.txt +0 -0
  370. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/webTopWords +0 -0
  371. /data/{perl → svm-header-parse}/HeaderParseService/resources/database/words +0 -0
  372. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/10ContextModelfold1 +0 -0
  373. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/10Modelfold1 +0 -0
  374. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/11ContextModelfold1 +0 -0
  375. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/11Modelfold1 +0 -0
  376. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/12ContextModelfold1 +0 -0
  377. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/12Modelfold1 +0 -0
  378. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/13ContextModelfold1 +0 -0
  379. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/13Modelfold1 +0 -0
  380. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/14ContextModelfold1 +0 -0
  381. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/14Modelfold1 +0 -0
  382. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/15ContextModelfold1 +0 -0
  383. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/15Modelfold1 +0 -0
  384. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/1ContextModelfold1 +0 -0
  385. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/1Modelfold1 +0 -0
  386. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/2ContextModelfold1 +0 -0
  387. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/2Modelfold1 +0 -0
  388. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/3ContextModelfold1 +0 -0
  389. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/3Modelfold1 +0 -0
  390. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/4ContextModelfold1 +0 -0
  391. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/4Modelfold1 +0 -0
  392. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/5ContextModelfold1 +0 -0
  393. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/5Modelfold1 +0 -0
  394. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/6ContextModelfold1 +0 -0
  395. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/6Modelfold1 +0 -0
  396. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/7ContextModelfold1 +0 -0
  397. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/7Modelfold1 +0 -0
  398. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/8ContextModelfold1 +0 -0
  399. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/8Modelfold1 +0 -0
  400. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/9ContextModelfold1 +0 -0
  401. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/9Modelfold1 +0 -0
  402. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/NameSpaceModel +0 -0
  403. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/NameSpaceTrainF +0 -0
  404. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/WrapperBaseFeaDict +0 -0
  405. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/WrapperContextFeaDict +0 -0
  406. /data/{perl → svm-header-parse}/HeaderParseService/resources/models/WrapperSpaceAuthorFeaDict +0 -0
data/.gitmodules ADDED
@@ -0,0 +1,3 @@
1
+ [submodule "parscit"]
2
+ path = parscit
3
+ url = git@github.com:academia-edu/ParsCit.git
data/Gemfile CHANGED
@@ -1,6 +1,6 @@
1
1
  # encoding: UTF-8
2
2
 
3
- source :rubygems
3
+ source 'https://rubygems.org'
4
4
 
5
5
  # Specify your gem's dependencies in the gemspec
6
6
  gemspec
data/README.md CHANGED
@@ -3,33 +3,130 @@ biblicit
3
3
 
4
4
  Extract citations from PDFs.
5
5
 
6
- ## Usage
6
+ Note: The version is 2.x, but really should be 0.2.x.
7
+
8
+
9
+ # Usage
7
10
 
8
11
  ```ruby
9
- # Extract metadata from a file using the code from CiteSeerX
10
- Biblicit.extract(file: "myfile.pdf", tool: :citeseer)
12
+ # Extract metadata from a file using default tools and settings
13
+ result = Biblicit::Extractor.extract(content: "a string containing the content of a PDF file")
14
+
15
+ # Extract metadata from a file using all available tools
16
+ result = Biblicit::Extractor.extract(file: "myfile.pdf", tools: [:citeseer, :parshed, :cb2bib], remote: true, token: false)
11
17
 
12
- # Extract metadata from the contents of a PDF using cb2bib
13
- Biblicit.extract(contents: IO.read("myfile.pdf"), tool: :cb2bib, remote: true)
18
+ # See reference information for "myfile.pdf"
19
+ result[:citeseer][:title]
20
+ result[:parshed][:title]
21
+ result[:citeseer][:authors]
22
+ # etc
14
23
  ```
15
24
 
16
- ## Algorithms
25
+
26
+ # Algorithms
17
27
 
18
28
  ### CiteSeer (default)
19
29
 
20
30
  Wrapper around Perl code extracted from [CiteSeerX](http://citeseer.ist.psu.edu/).
21
31
 
22
- Uses [Apache PDFBox](http://pdfbox.apache.org/) to extract text from the PDF, uses a model trained with the [svm-light](http://svmlight.joachims.org/) Support Vector Machine library to extract citation data for the PDF itself, and then uses [ParsCit](http://aye.comp.nus.edu.sg/parsCit/)'s model trained with the [CRF++](http://code.google.com/p/crfpp/) Conditional Random Fields library to parse citations from the PDF's bibliography, if any.
32
+ Uses a model trained with the [svm-light](http://svmlight.joachims.org/) Support Vector Machine library.
33
+
34
+ ### ParsCit (default)
35
+
36
+ Wrapper around Perl & Ruby code from [ParsCit](http://aye.comp.nus.edu.sg/parsCit/), which is included as a Git submodule.
37
+
38
+ Uses a model trained with the [CRF++](http://code.google.com/p/crfpp/) Conditional Random Fields library.
23
39
 
24
40
  ### cb2Bib
25
41
 
26
42
  Wrapper around [cb2Bib](http://www.molspaces.com/cb2bib/) in command-line mode.
27
43
 
28
- Uses pdf2text from [Xpdf](http://www.foolabs.com/xpdf/download.html) to extract text from the PDF, uses an apparently less-sophisticated parsing algorithm than the CiteSeerX code to parse metadata, but then, if :remote=true, scrapes one of a large number of journal or public repository websites for a structured version of the citation data.
44
+ Uses an apparently less-sophisticated parsing algorithm than the others to parse metadata, but then, if :remote=true, scrapes one of a large number of journal or public repository websites for a structured version of the citation data. Warning: sometimes it finds the wrong work!
45
+
46
+
47
+ # Requirements
48
+
49
+ There are a lot, but you may not need all of them, depending on your use case.
50
+
51
+
52
+ ## Required to support various input file formats
53
+
54
+ Different tools are used for different input file formats.
55
+
56
+ #### PDF - [Poppler](http://poppler.freedesktop.org/)
57
+
58
+ This provides `pdftotext`. You could install `xpdf` instead.
59
+
60
+ ##### From source
61
+
62
+ Requires fontconfig.
63
+
64
+ wget http://poppler.freedesktop.org/poppler-0.22.1.tar.gz
65
+ tar -xzf poppler-0.22.1.tar.gz
66
+ cd poppler-0.22.1
67
+ ./configure
68
+ make
69
+ sudo make install
70
+
71
+ ##### On Debian/Ubuntu
72
+
73
+ sudo apt-get install poppler-utils
74
+
75
+ ##### On OS X with Homebrew
76
+
77
+ brew install poppler
78
+
79
+ #### Postscript - [Ghostscript](http://www.ghostscript.com/)
80
+
81
+ This provides `ps2ascii`.
82
+
83
+ ##### From source
84
+
85
+ wget http://downloads.ghostscript.com/public/ghostscript-9.06.tar.gz
86
+ tar -xzf ghostscript-9.06.tar.gz
87
+ cd ghostscript-9.06
88
+ make
89
+ sudo make install
90
+
91
+ ##### On Debian/Ubuntu
92
+
93
+ sudo apt-get install ghostscript
94
+
95
+ ##### On OS X with Homebrew
96
+
97
+ brew install ghostscript
98
+
99
+ #### Other (e.g. docx) - [AbiWord](http://www.abisource.com/)
100
+
101
+ This provides `abiword`.
102
+
103
+ ##### On Debian/Ubuntu
104
+
105
+ sudo apt-get install abiword
106
+
107
+ ##### On OS X
108
+
109
+ As of writing, you're out of luck, because AbiWord doesn't compile on recent versions of OS X. According to their website, however, this is being actively worked on.
110
+
111
+
112
+ ## Required to use either the ParsCit or CiteSeer algorithms
113
+
114
+ #### Perl modules
115
+
116
+ More than these might be required; this is what I had to add to my default installation.
29
117
 
30
- ## Requirements
118
+ ##### From CPAN
119
+
120
+ sudo cpan install Digest::SHA1
121
+ sudo cpan install String::Approx
122
+ sudo cpan install XML::Writer::String
123
+ sudo cpan install XML::Twig
124
+
125
+ ## Required to use the ParsCit algorithm
31
126
 
32
- ### CRF++
127
+ #### CRF++
128
+
129
+ You can specify where you have installed CRF++ by setting the CRFPP_HOME environment variable.
33
130
 
34
131
  ##### From source
35
132
 
@@ -44,15 +141,19 @@ Uses pdf2text from [Xpdf](http://www.foolabs.com/xpdf/download.html) to extract
44
141
 
45
142
  sudo apt-add-repository 'deb http://cl.naist.jp/~eric-n/ubuntu-nlp oneiric all'
46
143
  sudo apt-get update
47
- sudo apt-get install libcrf++
144
+ sudo apt-get install libcrf++ crf++
48
145
 
49
146
  ##### On OS X with Homebrew
50
147
 
51
148
  brew install crf++
52
149
 
53
- ### svm-light
150
+ ## Required to use the CiteSeer algorithm
151
+
152
+ #### svm-light
54
153
 
55
- The included model requires version 5, not the current version.
154
+ Required for header extraction (reference information for the input work itself).
155
+
156
+ The included model requires version 5, not the current version. You can specify where you have installed svm-light by setting the SVM_LIGHT_HOME environment variable.
56
157
 
57
158
  ##### From source
58
159
 
@@ -61,22 +162,12 @@ The included model requires version 5, not the current version.
61
162
  wget http://download.joachims.org/svm_light/v5.00/svm_light.tar.gz
62
163
  tar -xzf svm_light.tar.gz
63
164
  make
64
- sudo ln -s $(readlink -f "$(dirname svm_classify)/$(basename svm_classify)") /usr/bin/svm_classify5
65
- sudo ln -s $(readlink -f "$(dirname svm_learn)/$(basename svm_learn)") /usr/bin/svm_learn5
165
+ echo "export SVM_LIGHT_HOME=`pwd`" >> ~/.profile # or .bashrc or whatever
166
+ source ~/.profile
66
167
 
67
- Note: On OS X you'll need to use greadlink instead of readlink if you have coreutils installed, or another workaround for the absence of `-f`.
168
+ ## Required to use the cb2bib algorithm
68
169
 
69
- ### Perl modules
70
-
71
- ##### From CPAN
72
-
73
- sudo cpan install DBI
74
- sudo cpan install Digest::SHA1
75
- sudo cpan install Log::Log4perl
76
- sudo cpan install Log::Dispatch
77
- sudo cpan install String::Approx
78
-
79
- ### cb2bib
170
+ #### cb2Bib
80
171
 
81
172
  ##### From source (Linux)
82
173
 
@@ -105,15 +196,19 @@ Requires Qt & X11, unfortunately, and still requires a hack to work on recent ve
105
196
 
106
197
  sudo apt-get install cb2bib
107
198
 
108
- ### Other
199
+
200
+ ## Other
201
+
202
+ (I'm not currently sure what this was required for; TODO figure it out!)
109
203
 
110
204
  ##### On Debian/Ubuntu
111
205
 
112
206
  sudo apt-get install libicu-dev
113
207
 
114
- ## Copying
115
208
 
116
- Copyright Academia.edu or the original author(s).
209
+ # Copying
210
+
211
+ Copyright Academia.edu or the original author(s) - see documentation in the included parscit and svm-header-parse directories.
117
212
 
118
213
  Apache licensed (see LICENSE.TXT).
119
214
 
data/Rakefile CHANGED
@@ -6,3 +6,25 @@ require 'rspec/core/rake_task'
6
6
  require 'biblicit'
7
7
 
8
8
  RSpec::Core::RakeTask.new :spec
9
+
10
+ desc "Tag #{Bundler::GemHelper.new.send(:version_tag)}, build and push to gemfury"
11
+ task :release_internal do |t|
12
+ require 'gemfury'
13
+
14
+ class ReleaseInternalGem < Bundler::GemHelper
15
+ def release_gem
16
+ guard_clean
17
+ built_gem_path = build_gem
18
+ if Bundler::VERSION =~ /1\.3\.\d/
19
+ tag_version { git_push } unless already_tagged?
20
+ else
21
+ guard_already_tagged
22
+ tag_version { git_push }
23
+ end
24
+ `fury push #{built_gem_path}`
25
+ Bundler.ui.confirm "Pushed #{name} #{version} to gemfury"
26
+ end
27
+ end
28
+
29
+ ReleaseInternalGem.new.release_gem
30
+ end
data/biblicit.gemspec CHANGED
@@ -5,14 +5,15 @@ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
5
5
 
6
6
  Gem::Specification.new do |gem|
7
7
  gem.name = "biblicit"
8
- gem.version = "1.0"
8
+ gem.version = "2.0.3"
9
9
  gem.authors = ["David Judd"]
10
10
  gem.email = ["david@academia.edu"]
11
- gem.description = %q{Extract citations from PDFs.}
12
11
  gem.summary = %q{Extract citations from PDFs.}
13
12
  gem.homepage = "http://github.com/academia-edu/biblicit"
14
13
 
15
- gem.files = `git ls-files`.split("\n")
14
+ gem.files =
15
+ `git ls-files`.split("\n") +
16
+ `cd parscit && git ls-files`.split("\n").map{ |f| "parscit/#{f}" }
16
17
  gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
17
18
  gem.test_files = gem.files.grep(%r{^spec/})
18
19
  gem.require_paths = ["lib"]
@@ -25,9 +26,10 @@ Gem::Specification.new do |gem|
25
26
  gem.add_development_dependency 'pry'
26
27
  gem.add_development_dependency 'pry-debugger'
27
28
 
28
- gem.requirements << 'java'
29
- gem.requirements << 'perl'
30
- gem.requirements << 'for the :citeseer algorithm, the CRF++ Conditional Random Fields library (try "which crf_test")'
31
- gem.requirements << 'for the :cb2bib algorithm, the cb2bib citation extraction tool'
29
+ gem.requirements << 'For PDFs, Poppler or XPDF (try "which pdftotext")'
30
+ gem.requirements << 'For Postscript files, Ghostscript (try "which ps2ascii")'
31
+ gem.requirements << 'For word processor files, AbiWord (try "which abiword")'
32
+ gem.requirements << 'For the :citeseer algorithm, Perl, CPAN, CRF++ (try "which crf_test"), and svm-light 5.0, aliased to svm_classify5 (try "svm_classify -h")'
33
+ gem.requirements << 'For the :cb2bib algorithm, cb2Bib (try "which cb2bib")'
32
34
 
33
35
  end
@@ -4,30 +4,28 @@ require 'tempfile'
4
4
 
5
5
  module Cb2Bib
6
6
 
7
- def self.extract(file, opts)
8
- ParseOperation.new(file, opts)
7
+ def self.extract(file, opts={})
8
+ ParseOperation.new(file, opts).result
9
9
  end
10
10
 
11
11
  class ParseOperation
12
12
 
13
+ attr_reader :result
14
+
13
15
  def initialize(file, opts)
14
16
  extract_from_file(file, opts[:remote] || false, opts[:sloppy] || true)
15
17
  end
16
18
 
17
- def header
18
- @result
19
- end
20
-
21
19
  private
22
20
 
23
- def extract_from_file(pdf, remote=false, sloppy=true)
21
+ def extract_from_file(doc, remote=false, sloppy=true)
24
22
  bib = Tempfile.new(['out','.bib'])
25
23
  conf = Tempfile.new(['cb2bib','.conf']) # we'll put our custom configuration here, and then cb2bib will fill in the rest with its defaults
26
24
 
27
25
  begin
28
26
  conf.write(cb2bib_config(remote))
29
27
  conf.open # not clear why we have to do this, but otherwise cb2bib doesn't read it
30
- `cb2bib #{sloppy ? '--sloppy' : ''} --doc2bib #{pdf.shellescape} #{bib.path} --conf #{conf.path}`
28
+ `cb2bib #{sloppy ? '--sloppy' : ''} --txt2bib #{doc.path} #{bib.path} --conf #{conf.path}`
31
29
  bibtext = bib.read
32
30
  ensure
33
31
  conf.close!
@@ -45,12 +43,13 @@ module Cb2Bib
45
43
  end
46
44
  end
47
45
  end
48
-
49
- @result[:valid] = !@result[:title].blank?
50
46
  end
51
47
 
52
48
  def cb2bib_config(remote)
53
- "[cb2Bib]\nAutomaticQuery=#{!!remote}"
49
+ """
50
+ [cb2Bib]
51
+ AutomaticQuery=#{!!remote}
52
+ """
54
53
  end
55
54
 
56
55
  def cleaned_field(field)
@@ -6,48 +6,36 @@ require 'nokogiri'
6
6
 
7
7
  module CiteSeer
8
8
 
9
- PERL_DIR = "#{File.dirname(__FILE__)}/../../perl"
9
+ PERL_DIR = "#{File.dirname(__FILE__)}/../../svm-header-parse"
10
10
 
11
- def self.extract(in_file, opts)
12
- ParseOperation.new(in_file)
11
+ def self.extract(in_file, opts={})
12
+ ParseOperation.new(in_file).result
13
13
  end
14
14
 
15
15
  class ParseOperation
16
16
 
17
+ attr_reader :result
18
+
17
19
  def initialize(in_file)
18
20
  Dir.mktmpdir do |out_dir|
19
- `#{PERL_DIR}/extract.pl #{in_file.shellescape} #{out_dir}`
20
- @header_xml = IO.read("#{out_dir}/out.header")
21
- @citations_xml = IO.read("#{out_dir}/out.parscit")
21
+ `#{PERL_DIR}/extract.pl #{in_file.path} #{out_dir}`
22
+ output = IO.read("#{out_dir}/out.header")
23
+ xml = Nokogiri::XML output
24
+ @result = parse(xml)
22
25
  end
23
26
  end
24
27
 
25
- def header
26
- @header ||= get_header
27
- end
28
-
29
- def citations
30
- @citations ||= get_citations
31
- end
32
-
33
28
  private
34
29
 
35
- def get_header
36
- parsed = Nokogiri::XML @header_xml
37
-
30
+ def parse(xml)
38
31
  {
39
- title: parsed.css('title').text,
40
- authors: parsed.css('author > name').map { |n| n.text },
41
- abstract: parsed.css('abstract').text,
42
- valid: parsed.css('validHeader').first.text == '1',
32
+ title: xml.css('title').text,
33
+ authors: xml.css('author > name').map { |n| n.text },
34
+ abstract: xml.css('abstract').text,
35
+ valid: xml.css('validHeader').first.text == '1',
43
36
  }
44
37
  end
45
38
 
46
- def get_citations
47
- # TODO
48
- []
49
- end
50
-
51
39
  end
52
40
 
53
41
  end
@@ -2,36 +2,57 @@
2
2
 
3
3
  require 'biblicit/cb2bib'
4
4
  require 'biblicit/citeseer'
5
+ require 'biblicit/parscit'
5
6
 
6
7
  require 'tempfile'
7
8
 
8
9
  module Biblicit
9
10
 
10
- def self.extract(opts)
11
- if (content = opts.delete(:content))
12
- Tempfile.open(['in','.pdf']) do |pdf|
13
- pdf.write(content)
14
- extract_from_file pdf.path, opts
11
+ SH_DIR = "#{File.dirname(__FILE__)}/../../sh"
12
+
13
+ module Extractor
14
+
15
+ def self.extract(opts)
16
+ if (content = opts.delete(:content))
17
+ Tempfile.open('in') do |in_file|
18
+ in_file.binmode
19
+ in_file.write(content)
20
+ extract_from_file in_file.path, opts
21
+ end
22
+ elsif (file = opts.delete(:file))
23
+ extract_from_file file, opts
24
+ else
25
+ raise 'Either file or content is required'
15
26
  end
16
- elsif (file = opts.delete(:file))
17
- extract_from_file file, opts
18
- else
19
- raise 'Either file or content is required'
20
27
  end
21
- end
22
28
 
23
- private
29
+ private
30
+
31
+ def self.extract_from_file(file, opts)
32
+ file = File.realpath(file)
33
+ tools = opts.delete(:tools) || [:parshed, :citeseer]
34
+
35
+ result = {}
36
+
37
+ Tempfile.open(['in','.txt']) do |in_txt|
38
+ `#{SH_DIR}/convert_to_text.sh #{file.shellescape} #{in_txt.path}`
24
39
 
25
- def self.extract_from_file(file, opts)
26
- tool = opts.delete(:tool) || :citeseer
40
+ if tools.include?(:parshed)
41
+ result.merge!( parshed: ParsCit.extract(in_txt, opts) )
42
+ end
27
43
 
28
- if tool == :citeseer
29
- CiteSeer.extract(file, opts)
30
- elsif tool == :cb2bib
31
- Cb2Bib.extract(file, opts)
32
- else
33
- raise "Unknown tool #{tool}"
44
+ if tools.include?(:citeseer)
45
+ result.merge!( citeseer: CiteSeer.extract(in_txt, opts) )
46
+ end
47
+
48
+ if tools.include?(:cb2bib)
49
+ result.merge!( cb2bib: Cb2Bib.extract(in_txt, opts) )
50
+ end
51
+ end
52
+
53
+ result
34
54
  end
55
+
35
56
  end
36
57
 
37
58
  end
@@ -0,0 +1,38 @@
1
+ # encoding: UTF-8
2
+
3
+ require 'tmpdir'
4
+ require 'shellwords'
5
+ require 'nokogiri'
6
+
7
+ module ParsCit
8
+
9
+ PERL_DIR = "#{File.dirname(__FILE__)}/../../parscit"
10
+
11
+ def self.extract(in_file, opts={})
12
+ ParseOperation.new(in_file, opts).result
13
+ end
14
+
15
+ class ParseOperation
16
+
17
+ attr_reader :result
18
+
19
+ def initialize(in_txt, opts={})
20
+ ENV['CRFPP_HOME'] ||= "#{File.dirname(`which crf_test`)}/../"
21
+ output = `#{PERL_DIR}/bin/citeExtract.pl -q -m extract_all #{in_txt.path}`
22
+ @result = parse(Nokogiri::XML output)
23
+ end
24
+
25
+ private
26
+
27
+ def parse(xml)
28
+ parsed = xml.css("algorithm[name=ParsHed]")
29
+ {
30
+ title: parsed.css('title').text.gsub(/\s+/,' ').strip,
31
+ authors: parsed.css('author').map { |a| a.text.gsub(/\s+/,' ').strip },
32
+ abstract: parsed.css('abstract').text
33
+ }
34
+ end
35
+
36
+ end
37
+
38
+ end
@@ -0,0 +1,8 @@
1
+ doc/*.zip
2
+ doc/*.tgz
3
+ lib/cgiLog.txt
4
+ *~
5
+ thang/*
6
+ tmp/*
7
+ archive/*
8
+ doc/.htaccess
data/parscit/CHANGELOG ADDED
@@ -0,0 +1,125 @@
1
+ 110505 (done by Huy)
2
+ - New features: Sectlabel
3
+ - New sectlabel xml model with more training data (resources/sectLabel/sectLabel.modelXml.v2)
4
+
5
+ - Major changes: Sectlabel
6
+ - Addded Omni (lib/Omni/Omnitable.pm lib/Omni/Omnicell.pm lib/Omni/Omnidd.pm lib/Omni/Omniframe.pm lib/Omni/Traversal.pm)
7
+ - ParsCit now uses the reference output from Sectlabel as its input
8
+
9
+ - Minor changes: Sectlabel and ParsCit
10
+ - Cleanup all temporary files in /tmp properly
11
+
12
+ - Minor changes: bug fixes
13
+ - Biblioscript: patch from Tim Brody
14
+
15
+ 110121 (done by Huy)
16
+ - Major changes: Omni classes for Omnipage XML process using XML::Twig and XML::Writer instead of regular expression
17
+ - Added Omni (lib/Omni/Config.pm lib/Omni/Omnicol.pm lib/Omni/Omnidoc.pm lib/Omni/Omniline.pm lib/Omni/Omnipage.pm lib/Omni/Omnipara.pm lib/Omni/Omnirun.pm lib/Omni/Omniword.pm)
18
+ - Use the new Omni classes in Parscit module
19
+
20
+ - Major changes: improve Parscit performance when there's no reference marker
21
+ - Added new crf++ XML model, template and train data (resources/parsCit.split.model crfpp/traindata/parsCit/parsCit.split.template crfpp/traindata/parsCit/parsCit.split.train.data)
22
+ - Use the new Parscit model when there's no reference marker
23
+
24
+ - Minor changes: bug fixes
25
+ - Parscit post-process: stripPunctuation function removes the semi-colon in XML special characters, e.g. &amp; Thanks to Qasemizadeh, Behrang for reporting these issues.
26
+ - Parscit controller: terminate properly when no reference is found (normBodyText size 1 != posArray size 0)
27
+ - Parscit post-process: fix the volume number truncation, e.g. vol 5(1) becomes vol 5; Thanks to Lennart Borgman for reporting these issues
28
+
29
+ - Minor changes: Parscit
30
+ - Add bin/xml2train.pl: extract reference text and XML information from Omnipage and save it into Parscit train file's format
31
+
32
+ 100901 (done by Thang)
33
+ - Incorporate BiblioScrip (http://github.com/mromanello/BiblioScript) and BibUtils (http://www.scripps.edu/~cdputnam/software/bibutils/)
34
+
35
+ 100401e (done by Min on 100725)
36
+ - Minor changes to paths and to make it work again from wing.nus directory
37
+ (moved from forecite, due to restructuring of WING server)
38
+
39
+ 100401d
40
+ - Minor changes to documentation and ParsHed library updating.
41
+
42
+ 100401c
43
+ - Minor changes for correcting errors with punctuation and XML
44
+ entities in reference string parsing. Reported by Cheong Chi Hong
45
+ and Mario Lipinski. Fixed by Minh-Thang Luong.
46
+
47
+ 100401b
48
+ - Minor changes (bug fixes) to section labeler model.
49
+
50
+ 100401 (done by Thang)
51
+ - Major Change: Added SectLabel module (due to Minh-Thang Luong and Thuy Dung Nguyen)
52
+ - Added Iconip training data from Cheong Chi Hong
53
+ - Updated default model to include Iconip data
54
+ - Updated CGI demo to call new SectLabel module as well
55
+ - Updated documentation
56
+ - Corrected small regexp error in lib/ParsCit/PreProcess.pm
57
+ - Corrected small problem with training data in mixed-humanities
58
+
59
+ - Added SectLabel (bin/sectExtract.pl, bin/sectExtract/, resource/sectExtract, lib/SectExtract)
60
+ - Modified bin/citeExtract.pl, lib/ParsCit/PostProcess.pm, lib/ParsHed/PostProcess.pm to combine ParsCit, ParsHed, SectLabel, and standardize XML output
61
+ - Added test/ for testing purpose with 12 samples documents and standard-output of citeExtract.pl in 5 modes (citations, header, section, meta, and all) using both txt and XML inputs
62
+ - Added SectLabel annotated data doc/sectLabel.tagged.txt and doc/sectLabelXml.tagged.txt (40 documents fully annotated)
63
+ - Added in crfpp/traindata CRF++ feature files (sectLabel.train.dataXml, sectLabel.train.data) and templates (sectLabel.templateXml, sectLabel.template)
64
+
65
+ - Incorporated works by Emma
66
+ - Added GenericSect code (bin/sectLabel/genericSectExtract.rb, crfpp/traindata/genericSect.train.data, bin/sectLabel/genericSect/) into SectLabel
67
+ - Added GenericSect annotated data doc/genericSect.tagged.txt (211 documents with headers annotated)
68
+ - Added in crfpp/traindata CRF++ feature file (genericSect.train.data) and template (genericSect.template)
69
+
70
+ 090625b
71
+ - Updated documentation only. no change to executables
72
+ - Released on 30 September 2009
73
+
74
+ 090625 (due to Minh-Thang Luong)
75
+ - Standardized and improved ParsHed model with line-level classification instead of token-level as previously.
76
+ - Add a post-processing module for ParsHed to normalize field data, e.g. authors, email, etc.
77
+ - Detailed changes are reflected as follows:
78
+ * Added resources/parsHed - all parsHed-related including models (old models in resources/parsHed/archive), template file, and top frequent keyword files.
79
+ * Added lib/ParsHed - similar architect as lib/ParsCit to modularizes and faciliates line-level training in ParsHed.
80
+ lib/ParsHed/Tr2crf.pm: line-level CRF feature extractor
81
+ lib/ParsHed/PostProcess.pm: post-processing of field data
82
+ * Added bin/parsHed - all parsHed-related scripts (redo.parsHed.pl, and tr2crffpp_parsHed.pl).
83
+ Includes parseXmlHeader.pl and convert2TokenLevel.pl used by redo.parsHed.pl to convert output from line to token-level.
84
+ * Updated bin/headExtract.pl - to use the new model lib/ParsHed, as well as old model (with -tokenLevel flag).
85
+ * Bug fixes in Citation.pm and Preprocess.pm (see doc/v090625-Artemy-issues.txt). Thanks to Artemy Kolchinsky for reporting these issues.
86
+ * Reordered the CHANGELOG into descending chronological order.
87
+ 090625 (due to Min-Yen KAN)
88
+ - Deprecated and unified ParsHedClient.rb into ParsCitClient.rb
89
+ - Deprecated and unified ParsHedServer.rb into ParsCitServer.rb
90
+ - Added wsdl/forecite.wsdl which describes the ParsCit portion of the services
91
+ - Added bin/ParsCitClientWSDL.rb which demonstrates the use of forecite.wsdl
92
+
93
+ 090316:
94
+ - Adds ParsHed module, updates to:
95
+ * resources/ - parsHed.*.model model files for binaries
96
+ * doc/ - svm_headerparse.tagged.txt (from CiteSeer; 935 headers)
97
+ * bin/ - headExtract.pl, phOutput2xml.pl, redo.parsCit.pl, redo.parsHed.pl
98
+ * crfpp/traindata/ - parsHed.train.data (converted from svm_headerparse.tagged.txt)
99
+ Changes doc/index.html, doc/parsCit.cgi to handle parsHed call (not
100
+ yet entirely integrated, still separate module).
101
+
102
+ 081201:
103
+ - Bug fixes from Scienstein.org team
104
+ * Added context positions
105
+ * Handle reference patterns such as [1-5]
106
+ * Handles context references within same window
107
+ * See doc/v081201-sciensteinEmail.txt for detailed notes
108
+ - Updated ParsCit.cgi to update for context position output
109
+
110
+ 080917:
111
+ - Added new training data from Matteo Romanello
112
+ - Fixed Preprocess.pm bug (thanks to Dain Kaplan)
113
+ - Upgraded CRF++ model to 0.51 (now bundled with CRF++ source just in case it is no longer available on sourceforge)
114
+ - Added bin/redo.pl script to retrain model
115
+
116
+ 080402:
117
+ - Added re-tagged data from FLUX-CiM
118
+ - Added conlleval.pl evaluation script
119
+ - Added output2xml.pl transformation script
120
+ - Corrected warning in parseRefString.pl (thanks to Ayeh Bandeh-Ahmadi)
121
+
122
+ 080310:
123
+ - First released version to Peter Weiland
124
+ - Web services working for wing machines at NUS
125
+