biblicit 1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (322) hide show
  1. data/.gitignore +3 -0
  2. data/.rspec +1 -0
  3. data/Gemfile +6 -0
  4. data/LICENSE.TXT +176 -0
  5. data/README.md +120 -0
  6. data/Rakefile +8 -0
  7. data/biblicit.gemspec +33 -0
  8. data/lib/biblicit/cb2bib.rb +83 -0
  9. data/lib/biblicit/citeseer.rb +53 -0
  10. data/lib/biblicit/extractor.rb +37 -0
  11. data/lib/biblicit.rb +6 -0
  12. data/perl/DocFilter/lib/CSXUtil/SafeText.pm +140 -0
  13. data/perl/DocFilter/lib/DocFilter/Config.pm +35 -0
  14. data/perl/DocFilter/lib/DocFilter/Filter.pm +51 -0
  15. data/perl/FileConversionService/README.TXT +11 -0
  16. data/perl/FileConversionService/converters/PDFBox/pdfbox-app-1.7.1.jar +0 -0
  17. data/perl/FileConversionService/lib/CSXUtil/SafeText.pm +140 -0
  18. data/perl/FileConversionService/lib/FileConverter/CheckSum.pm +77 -0
  19. data/perl/FileConversionService/lib/FileConverter/Compression.pm +137 -0
  20. data/perl/FileConversionService/lib/FileConverter/Config.pm +57 -0
  21. data/perl/FileConversionService/lib/FileConverter/Controller.pm +191 -0
  22. data/perl/FileConversionService/lib/FileConverter/JODConverter.pm +61 -0
  23. data/perl/FileConversionService/lib/FileConverter/PDFBox.pm +69 -0
  24. data/perl/FileConversionService/lib/FileConverter/PSConverter.pm +69 -0
  25. data/perl/FileConversionService/lib/FileConverter/PSToText.pm +88 -0
  26. data/perl/FileConversionService/lib/FileConverter/Prescript.pm +68 -0
  27. data/perl/FileConversionService/lib/FileConverter/TET.pm +75 -0
  28. data/perl/FileConversionService/lib/FileConverter/Utils.pm +130 -0
  29. data/perl/HeaderParseService/README.TXT +80 -0
  30. data/perl/HeaderParseService/lib/CSXUtil/SafeText.pm +140 -0
  31. data/perl/HeaderParseService/lib/HeaderParse/API/AssembleXMLMetadata.pm +968 -0
  32. data/perl/HeaderParseService/lib/HeaderParse/API/Function.pm +2016 -0
  33. data/perl/HeaderParseService/lib/HeaderParse/API/LoadInformation.pm +444 -0
  34. data/perl/HeaderParseService/lib/HeaderParse/API/MultiClassChunking.pm +409 -0
  35. data/perl/HeaderParseService/lib/HeaderParse/API/NamePatternMatch.pm +537 -0
  36. data/perl/HeaderParseService/lib/HeaderParse/API/Parser.pm +68 -0
  37. data/perl/HeaderParseService/lib/HeaderParse/API/ParserMethods.pm +1880 -0
  38. data/perl/HeaderParseService/lib/HeaderParse/Config/API_Config.pm +46 -0
  39. data/perl/HeaderParseService/resources/data/EbizHeaders.txt +24330 -0
  40. data/perl/HeaderParseService/resources/data/EbizHeaders.txt.parsed +27506 -0
  41. data/perl/HeaderParseService/resources/data/EbizHeaders.txt.parsed.old +26495 -0
  42. data/perl/HeaderParseService/resources/data/tagged_headers.txt +40668 -0
  43. data/perl/HeaderParseService/resources/data/test_header.txt +31 -0
  44. data/perl/HeaderParseService/resources/data/test_header.txt.parsed +31 -0
  45. data/perl/HeaderParseService/resources/database/50states +60 -0
  46. data/perl/HeaderParseService/resources/database/AddrTopWords.txt +17 -0
  47. data/perl/HeaderParseService/resources/database/AffiTopWords.txt +35 -0
  48. data/perl/HeaderParseService/resources/database/AffiTopWordsAll.txt +533 -0
  49. data/perl/HeaderParseService/resources/database/ChineseSurNames.txt +276 -0
  50. data/perl/HeaderParseService/resources/database/Csurnames.bin +0 -0
  51. data/perl/HeaderParseService/resources/database/Csurnames_spec.bin +0 -0
  52. data/perl/HeaderParseService/resources/database/DomainSuffixes.txt +242 -0
  53. data/perl/HeaderParseService/resources/database/LabeledHeader +18 -0
  54. data/perl/HeaderParseService/resources/database/README +2 -0
  55. data/perl/HeaderParseService/resources/database/TrainMulClassLines +254 -0
  56. data/perl/HeaderParseService/resources/database/TrainMulClassLines1 +510 -0
  57. data/perl/HeaderParseService/resources/database/abstract.txt +1 -0
  58. data/perl/HeaderParseService/resources/database/abstractTopWords +9 -0
  59. data/perl/HeaderParseService/resources/database/addr.txt +28 -0
  60. data/perl/HeaderParseService/resources/database/affi.txt +34 -0
  61. data/perl/HeaderParseService/resources/database/affis.bin +0 -0
  62. data/perl/HeaderParseService/resources/database/all_namewords_spec.bin +0 -0
  63. data/perl/HeaderParseService/resources/database/allnamewords.bin +0 -0
  64. data/perl/HeaderParseService/resources/database/cities_US.txt +4512 -0
  65. data/perl/HeaderParseService/resources/database/cities_world.txt +4463 -0
  66. data/perl/HeaderParseService/resources/database/city.txt +3150 -0
  67. data/perl/HeaderParseService/resources/database/cityname.txt +3151 -0
  68. data/perl/HeaderParseService/resources/database/country_abbr.txt +243 -0
  69. data/perl/HeaderParseService/resources/database/countryname.txt +262 -0
  70. data/perl/HeaderParseService/resources/database/dateTopWords +30 -0
  71. data/perl/HeaderParseService/resources/database/degree.txt +67 -0
  72. data/perl/HeaderParseService/resources/database/email.txt +3 -0
  73. data/perl/HeaderParseService/resources/database/excludeWords.txt +40 -0
  74. data/perl/HeaderParseService/resources/database/female-names +4960 -0
  75. data/perl/HeaderParseService/resources/database/firstNames.txt +8448 -0
  76. data/perl/HeaderParseService/resources/database/firstnames.bin +0 -0
  77. data/perl/HeaderParseService/resources/database/firstnames_spec.bin +0 -0
  78. data/perl/HeaderParseService/resources/database/intro.txt +2 -0
  79. data/perl/HeaderParseService/resources/database/keyword.txt +5 -0
  80. data/perl/HeaderParseService/resources/database/keywordTopWords +7 -0
  81. data/perl/HeaderParseService/resources/database/male-names +3906 -0
  82. data/perl/HeaderParseService/resources/database/middleNames.txt +2 -0
  83. data/perl/HeaderParseService/resources/database/month.txt +35 -0
  84. data/perl/HeaderParseService/resources/database/mul +868 -0
  85. data/perl/HeaderParseService/resources/database/mul.label +869 -0
  86. data/perl/HeaderParseService/resources/database/mul.label.old +869 -0
  87. data/perl/HeaderParseService/resources/database/mul.processed +762 -0
  88. data/perl/HeaderParseService/resources/database/mulAuthor +619 -0
  89. data/perl/HeaderParseService/resources/database/mulClassStat +45 -0
  90. data/perl/HeaderParseService/resources/database/nickname.txt +58 -0
  91. data/perl/HeaderParseService/resources/database/nicknames.bin +0 -0
  92. data/perl/HeaderParseService/resources/database/note.txt +121 -0
  93. data/perl/HeaderParseService/resources/database/page.txt +1 -0
  94. data/perl/HeaderParseService/resources/database/phone.txt +9 -0
  95. data/perl/HeaderParseService/resources/database/postcode.txt +54 -0
  96. data/perl/HeaderParseService/resources/database/pubnum.txt +45 -0
  97. data/perl/HeaderParseService/resources/database/statename.bin +0 -0
  98. data/perl/HeaderParseService/resources/database/statename.txt +73 -0
  99. data/perl/HeaderParseService/resources/database/states_and_abbreviations.txt +118 -0
  100. data/perl/HeaderParseService/resources/database/stopwords +438 -0
  101. data/perl/HeaderParseService/resources/database/stopwords.bin +0 -0
  102. data/perl/HeaderParseService/resources/database/surNames.txt +19613 -0
  103. data/perl/HeaderParseService/resources/database/surnames.bin +0 -0
  104. data/perl/HeaderParseService/resources/database/surnames_spec.bin +0 -0
  105. data/perl/HeaderParseService/resources/database/university_list/A.html +167 -0
  106. data/perl/HeaderParseService/resources/database/university_list/B.html +161 -0
  107. data/perl/HeaderParseService/resources/database/university_list/C.html +288 -0
  108. data/perl/HeaderParseService/resources/database/university_list/D.html +115 -0
  109. data/perl/HeaderParseService/resources/database/university_list/E.html +147 -0
  110. data/perl/HeaderParseService/resources/database/university_list/F.html +112 -0
  111. data/perl/HeaderParseService/resources/database/university_list/G.html +115 -0
  112. data/perl/HeaderParseService/resources/database/university_list/H.html +140 -0
  113. data/perl/HeaderParseService/resources/database/university_list/I.html +138 -0
  114. data/perl/HeaderParseService/resources/database/university_list/J.html +82 -0
  115. data/perl/HeaderParseService/resources/database/university_list/K.html +115 -0
  116. data/perl/HeaderParseService/resources/database/university_list/L.html +131 -0
  117. data/perl/HeaderParseService/resources/database/university_list/M.html +201 -0
  118. data/perl/HeaderParseService/resources/database/university_list/N.html +204 -0
  119. data/perl/HeaderParseService/resources/database/university_list/O.html +89 -0
  120. data/perl/HeaderParseService/resources/database/university_list/P.html +125 -0
  121. data/perl/HeaderParseService/resources/database/university_list/Q.html +49 -0
  122. data/perl/HeaderParseService/resources/database/university_list/R.html +126 -0
  123. data/perl/HeaderParseService/resources/database/university_list/S.html +296 -0
  124. data/perl/HeaderParseService/resources/database/university_list/T.html +156 -0
  125. data/perl/HeaderParseService/resources/database/university_list/U.html +800 -0
  126. data/perl/HeaderParseService/resources/database/university_list/V.html +75 -0
  127. data/perl/HeaderParseService/resources/database/university_list/W.html +144 -0
  128. data/perl/HeaderParseService/resources/database/university_list/WCSelect.gif +0 -0
  129. data/perl/HeaderParseService/resources/database/university_list/X.html +44 -0
  130. data/perl/HeaderParseService/resources/database/university_list/Y.html +53 -0
  131. data/perl/HeaderParseService/resources/database/university_list/Z.html +43 -0
  132. data/perl/HeaderParseService/resources/database/university_list/ae.html +31 -0
  133. data/perl/HeaderParseService/resources/database/university_list/am.html +30 -0
  134. data/perl/HeaderParseService/resources/database/university_list/ar.html +35 -0
  135. data/perl/HeaderParseService/resources/database/university_list/at.html +43 -0
  136. data/perl/HeaderParseService/resources/database/university_list/au.html +82 -0
  137. data/perl/HeaderParseService/resources/database/university_list/bd.html +28 -0
  138. data/perl/HeaderParseService/resources/database/university_list/be.html +41 -0
  139. data/perl/HeaderParseService/resources/database/university_list/bg.html +28 -0
  140. data/perl/HeaderParseService/resources/database/university_list/bh.html +28 -0
  141. data/perl/HeaderParseService/resources/database/university_list/blueribbon.gif +0 -0
  142. data/perl/HeaderParseService/resources/database/university_list/bm.html +28 -0
  143. data/perl/HeaderParseService/resources/database/university_list/bn.html +28 -0
  144. data/perl/HeaderParseService/resources/database/university_list/br.html +66 -0
  145. data/perl/HeaderParseService/resources/database/university_list/ca.html +174 -0
  146. data/perl/HeaderParseService/resources/database/university_list/ch.html +52 -0
  147. data/perl/HeaderParseService/resources/database/university_list/cl.html +40 -0
  148. data/perl/HeaderParseService/resources/database/university_list/cn.html +87 -0
  149. data/perl/HeaderParseService/resources/database/university_list/co.html +39 -0
  150. data/perl/HeaderParseService/resources/database/university_list/cr.html +34 -0
  151. data/perl/HeaderParseService/resources/database/university_list/cy.html +34 -0
  152. data/perl/HeaderParseService/resources/database/university_list/cz.html +44 -0
  153. data/perl/HeaderParseService/resources/database/university_list/de.html +128 -0
  154. data/perl/HeaderParseService/resources/database/university_list/dean-mainlink.jpg +0 -0
  155. data/perl/HeaderParseService/resources/database/university_list/dk.html +42 -0
  156. data/perl/HeaderParseService/resources/database/university_list/ec.html +31 -0
  157. data/perl/HeaderParseService/resources/database/university_list/ee.html +30 -0
  158. data/perl/HeaderParseService/resources/database/university_list/eg.html +29 -0
  159. data/perl/HeaderParseService/resources/database/university_list/es.html +68 -0
  160. data/perl/HeaderParseService/resources/database/university_list/et.html +28 -0
  161. data/perl/HeaderParseService/resources/database/university_list/faq.html +147 -0
  162. data/perl/HeaderParseService/resources/database/university_list/fi.html +49 -0
  163. data/perl/HeaderParseService/resources/database/university_list/fj.html +28 -0
  164. data/perl/HeaderParseService/resources/database/university_list/fo.html +28 -0
  165. data/perl/HeaderParseService/resources/database/university_list/fr.html +106 -0
  166. data/perl/HeaderParseService/resources/database/university_list/geog.html +150 -0
  167. data/perl/HeaderParseService/resources/database/university_list/gr.html +38 -0
  168. data/perl/HeaderParseService/resources/database/university_list/gu.html +28 -0
  169. data/perl/HeaderParseService/resources/database/university_list/hk.html +34 -0
  170. data/perl/HeaderParseService/resources/database/university_list/hr.html +28 -0
  171. data/perl/HeaderParseService/resources/database/university_list/hu.html +46 -0
  172. data/perl/HeaderParseService/resources/database/university_list/id.html +29 -0
  173. data/perl/HeaderParseService/resources/database/university_list/ie.html +49 -0
  174. data/perl/HeaderParseService/resources/database/university_list/il.html +35 -0
  175. data/perl/HeaderParseService/resources/database/university_list/in.html +109 -0
  176. data/perl/HeaderParseService/resources/database/university_list/is.html +32 -0
  177. data/perl/HeaderParseService/resources/database/university_list/it.html +75 -0
  178. data/perl/HeaderParseService/resources/database/university_list/jm.html +28 -0
  179. data/perl/HeaderParseService/resources/database/university_list/jo.html +28 -0
  180. data/perl/HeaderParseService/resources/database/university_list/jp.html +155 -0
  181. data/perl/HeaderParseService/resources/database/university_list/kaplan.gif +0 -0
  182. data/perl/HeaderParseService/resources/database/university_list/kr.html +65 -0
  183. data/perl/HeaderParseService/resources/database/university_list/kw.html +28 -0
  184. data/perl/HeaderParseService/resources/database/university_list/lb.html +28 -0
  185. data/perl/HeaderParseService/resources/database/university_list/linkbw2.gif +0 -0
  186. data/perl/HeaderParseService/resources/database/university_list/lk.html +30 -0
  187. data/perl/HeaderParseService/resources/database/university_list/lt.html +31 -0
  188. data/perl/HeaderParseService/resources/database/university_list/lu.html +34 -0
  189. data/perl/HeaderParseService/resources/database/university_list/lv.html +30 -0
  190. data/perl/HeaderParseService/resources/database/university_list/ma.html +28 -0
  191. data/perl/HeaderParseService/resources/database/university_list/maczynski.gif +0 -0
  192. data/perl/HeaderParseService/resources/database/university_list/mirror.tar +0 -0
  193. data/perl/HeaderParseService/resources/database/university_list/mk.html +29 -0
  194. data/perl/HeaderParseService/resources/database/university_list/mo.html +29 -0
  195. data/perl/HeaderParseService/resources/database/university_list/mseawdm.gif +0 -0
  196. data/perl/HeaderParseService/resources/database/university_list/mt.html +28 -0
  197. data/perl/HeaderParseService/resources/database/university_list/mx.html +68 -0
  198. data/perl/HeaderParseService/resources/database/university_list/my.html +39 -0
  199. data/perl/HeaderParseService/resources/database/university_list/ni.html +28 -0
  200. data/perl/HeaderParseService/resources/database/university_list/nl.html +51 -0
  201. data/perl/HeaderParseService/resources/database/university_list/no.html +56 -0
  202. data/perl/HeaderParseService/resources/database/university_list/nz.html +41 -0
  203. data/perl/HeaderParseService/resources/database/university_list/pa.html +31 -0
  204. data/perl/HeaderParseService/resources/database/university_list/pe.html +40 -0
  205. data/perl/HeaderParseService/resources/database/university_list/ph.html +41 -0
  206. data/perl/HeaderParseService/resources/database/university_list/pl.html +51 -0
  207. data/perl/HeaderParseService/resources/database/university_list/pointcom.gif +0 -0
  208. data/perl/HeaderParseService/resources/database/university_list/pr.html +31 -0
  209. data/perl/HeaderParseService/resources/database/university_list/ps.html +28 -0
  210. data/perl/HeaderParseService/resources/database/university_list/pt.html +45 -0
  211. data/perl/HeaderParseService/resources/database/university_list/recognition.html +69 -0
  212. data/perl/HeaderParseService/resources/database/university_list/results.html +71 -0
  213. data/perl/HeaderParseService/resources/database/university_list/ro.html +38 -0
  214. data/perl/HeaderParseService/resources/database/university_list/ru.html +48 -0
  215. data/perl/HeaderParseService/resources/database/university_list/sd.html +28 -0
  216. data/perl/HeaderParseService/resources/database/university_list/se.html +57 -0
  217. data/perl/HeaderParseService/resources/database/university_list/sg.html +33 -0
  218. data/perl/HeaderParseService/resources/database/university_list/si.html +30 -0
  219. data/perl/HeaderParseService/resources/database/university_list/sk.html +35 -0
  220. data/perl/HeaderParseService/resources/database/university_list/th.html +45 -0
  221. data/perl/HeaderParseService/resources/database/university_list/tr.html +44 -0
  222. data/perl/HeaderParseService/resources/database/university_list/tw.html +76 -0
  223. data/perl/HeaderParseService/resources/database/university_list/ua.html +29 -0
  224. data/perl/HeaderParseService/resources/database/university_list/uk.html +168 -0
  225. data/perl/HeaderParseService/resources/database/university_list/univ-full.html +3166 -0
  226. data/perl/HeaderParseService/resources/database/university_list/univ.html +122 -0
  227. data/perl/HeaderParseService/resources/database/university_list/uy.html +31 -0
  228. data/perl/HeaderParseService/resources/database/university_list/ve.html +34 -0
  229. data/perl/HeaderParseService/resources/database/university_list/yu.html +28 -0
  230. data/perl/HeaderParseService/resources/database/university_list/za.html +46 -0
  231. data/perl/HeaderParseService/resources/database/university_list/zm.html +28 -0
  232. data/perl/HeaderParseService/resources/database/university_list.txt +3025 -0
  233. data/perl/HeaderParseService/resources/database/url.txt +1 -0
  234. data/perl/HeaderParseService/resources/database/webTopWords +225 -0
  235. data/perl/HeaderParseService/resources/database/words +45402 -0
  236. data/perl/HeaderParseService/resources/models/10ContextModelfold1 +369 -0
  237. data/perl/HeaderParseService/resources/models/10Modelfold1 +376 -0
  238. data/perl/HeaderParseService/resources/models/11ContextModelfold1 +400 -0
  239. data/perl/HeaderParseService/resources/models/11Modelfold1 +526 -0
  240. data/perl/HeaderParseService/resources/models/12ContextModelfold1 +510 -0
  241. data/perl/HeaderParseService/resources/models/12Modelfold1 +423 -0
  242. data/perl/HeaderParseService/resources/models/13ContextModelfold1 +364 -0
  243. data/perl/HeaderParseService/resources/models/13Modelfold1 +677 -0
  244. data/perl/HeaderParseService/resources/models/14ContextModelfold1 +459 -0
  245. data/perl/HeaderParseService/resources/models/14Modelfold1 +325 -0
  246. data/perl/HeaderParseService/resources/models/15ContextModelfold1 +340 -0
  247. data/perl/HeaderParseService/resources/models/15Modelfold1 +390 -0
  248. data/perl/HeaderParseService/resources/models/1ContextModelfold1 +668 -0
  249. data/perl/HeaderParseService/resources/models/1Modelfold1 +1147 -0
  250. data/perl/HeaderParseService/resources/models/2ContextModelfold1 +755 -0
  251. data/perl/HeaderParseService/resources/models/2Modelfold1 +796 -0
  252. data/perl/HeaderParseService/resources/models/3ContextModelfold1 +1299 -0
  253. data/perl/HeaderParseService/resources/models/3Modelfold1 +1360 -0
  254. data/perl/HeaderParseService/resources/models/4ContextModelfold1 +1062 -0
  255. data/perl/HeaderParseService/resources/models/4Modelfold1 +993 -0
  256. data/perl/HeaderParseService/resources/models/5ContextModelfold1 +1339 -0
  257. data/perl/HeaderParseService/resources/models/5Modelfold1 +2098 -0
  258. data/perl/HeaderParseService/resources/models/6ContextModelfold1 +888 -0
  259. data/perl/HeaderParseService/resources/models/6Modelfold1 +620 -0
  260. data/perl/HeaderParseService/resources/models/7ContextModelfold1 +257 -0
  261. data/perl/HeaderParseService/resources/models/7Modelfold1 +228 -0
  262. data/perl/HeaderParseService/resources/models/8ContextModelfold1 +677 -0
  263. data/perl/HeaderParseService/resources/models/8Modelfold1 +1871 -0
  264. data/perl/HeaderParseService/resources/models/9ContextModelfold1 +198 -0
  265. data/perl/HeaderParseService/resources/models/9Modelfold1 +170 -0
  266. data/perl/HeaderParseService/resources/models/NameSpaceModel +181 -0
  267. data/perl/HeaderParseService/resources/models/NameSpaceTrainF +347 -0
  268. data/perl/HeaderParseService/resources/models/WrapperBaseFeaDict +13460 -0
  269. data/perl/HeaderParseService/resources/models/WrapperContextFeaDict +14045 -0
  270. data/perl/HeaderParseService/resources/models/WrapperSpaceAuthorFeaDict +510 -0
  271. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test1 +23 -0
  272. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test10 +23 -0
  273. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test11 +23 -0
  274. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test12 +23 -0
  275. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test13 +23 -0
  276. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test14 +23 -0
  277. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test15 +23 -0
  278. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test2 +23 -0
  279. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test3 +23 -0
  280. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test4 +23 -0
  281. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test5 +23 -0
  282. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test6 +23 -0
  283. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test7 +23 -0
  284. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test8 +23 -0
  285. data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test9 +23 -0
  286. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test1 +23 -0
  287. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test10 +23 -0
  288. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test11 +23 -0
  289. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test12 +23 -0
  290. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test13 +23 -0
  291. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test14 +23 -0
  292. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test15 +23 -0
  293. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test2 +23 -0
  294. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test3 +23 -0
  295. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test4 +23 -0
  296. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test5 +23 -0
  297. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test6 +23 -0
  298. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test7 +23 -0
  299. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test8 +23 -0
  300. data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test9 +23 -0
  301. data/perl/ParsCit/README.TXT +82 -0
  302. data/perl/ParsCit/crfpp/traindata/parsCit.template +60 -0
  303. data/perl/ParsCit/crfpp/traindata/parsCit.train.data +12104 -0
  304. data/perl/ParsCit/crfpp/traindata/tagged_references.txt +500 -0
  305. data/perl/ParsCit/lib/CSXUtil/SafeText.pm +140 -0
  306. data/perl/ParsCit/lib/ParsCit/Citation.pm +462 -0
  307. data/perl/ParsCit/lib/ParsCit/CitationContext.pm +132 -0
  308. data/perl/ParsCit/lib/ParsCit/Config.pm +46 -0
  309. data/perl/ParsCit/lib/ParsCit/Controller.pm +306 -0
  310. data/perl/ParsCit/lib/ParsCit/PostProcess.pm +367 -0
  311. data/perl/ParsCit/lib/ParsCit/PreProcess.pm +333 -0
  312. data/perl/ParsCit/lib/ParsCit/Tr2crfpp.pm +331 -0
  313. data/perl/ParsCit/resources/parsCit.model +0 -0
  314. data/perl/ParsCit/resources/parsCitDict.txt +148783 -0
  315. data/perl/extract.pl +199 -0
  316. data/spec/biblicit/cb2bib_spec.rb +48 -0
  317. data/spec/biblicit/citeseer_spec.rb +40 -0
  318. data/spec/fixtures/pdf/10.1.1.109.4049.pdf +0 -0
  319. data/spec/fixtures/pdf/Bagnoli Watts TAR 2010.pdf +0 -0
  320. data/spec/fixtures/pdf/ICINCO_2010.pdf +0 -0
  321. data/spec/spec_helper.rb +3 -0
  322. metadata +474 -0
@@ -0,0 +1,61 @@
1
+ #
2
+ # Copyright 2007 Penn State University
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ # http://www.apache.org/licenses/LICENSE-2.0
7
+ # Unless required by applicable law or agreed to in writing, software
8
+ # distributed under the License is distributed on an "AS IS" BASIS,
9
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the License for the specific language governing permissions and
11
+ # limitations under the License.
12
+ #
13
+ package FileConverter::JODConverter;
14
+ #
15
+ # Wrapper to execute the JODConverter command-line tool for converting
16
+ # doc, rtf to PDF files.
17
+ #
18
+ # Juan Pablo Fernandez Ramirez, 10/05/07
19
+ #
20
+ use strict;
21
+ use FileConverter::Config;
22
+ use FileConverter::Utils;
23
+
24
+ my $JODConverterLoc = $FileConverter::Config::JODConverterPath;
25
+
26
+ ##
27
+ # Execute the JODConverter utility.
28
+ ##
29
+ sub convertFile {
30
+ my ($filePath, $rTrace, $rCheckSums) = @_;
31
+ my ($status, $msg) = (1, "");
32
+
33
+ if (FileConverter::Utils::checkProcess("soffice") == 0) {
34
+ return (0, "Open Office Service is not running");
35
+ }
36
+
37
+ my $pdfFilePath = FileConverter::Utils::changeExtension($filePath, "pdf");
38
+ my @commandArgs = ("java", "-jar", $JODConverterLoc, $filePath,
39
+ $pdfFilePath);
40
+ system(@commandArgs);
41
+
42
+ if ($? == -1) {
43
+ return (0, "Failed to execute JODConverter: $!");
44
+ } elsif ($? & 127) {
45
+ return (0, "Java died with signal ".($? & 127));
46
+ }
47
+
48
+ my $code = $?>>8;
49
+ if ($code == 0) {
50
+ push @$rTrace, "JODConverter";
51
+
52
+ my $sha1 = FileConverter::CheckSum->new();
53
+ $sha1->digest($filePath);
54
+ push @$rCheckSums, $sha1;
55
+
56
+ return ($status, $msg, $pdfFilePath, $rTrace, $rCheckSums);
57
+ } else {
58
+ return (0, "Error executing JODConverter (code $code): $!");
59
+ }
60
+ } # convertFile
61
+ 1;
@@ -0,0 +1,69 @@
1
+ #
2
+ # Copyright 2007 Penn State University
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ # http://www.apache.org/licenses/LICENSE-2.0
7
+ # Unless required by applicable law or agreed to in writing, software
8
+ # distributed under the License is distributed on an "AS IS" BASIS,
9
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the License for the specific language governing permissions and
11
+ # limitations under the License.
12
+ #
13
+ package FileConverter::PDFBox;
14
+ #
15
+ # Wrapper to call the PDFBox ExtractText command-line tool
16
+ # for extracting text from PDF files. It's recommended to
17
+ # use TET instead, if TET is available.
18
+ #
19
+ # Isaac Councill, 09/06/07
20
+ #
21
+ use strict;
22
+ use FileConverter::Config;
23
+ use FileConverter::Utils;
24
+
25
+ my $PDFBoxLoc = $FileConverter::Config::PDFBoxLocation;
26
+
27
+ ##
28
+ # Execute the PDFBox utility.
29
+ ##
30
+ sub extractText {
31
+ my ($filePath, $rTrace, $rCheckSums) = @_;
32
+ my ($status, $msg) = (1, "");
33
+
34
+ if (FileConverter::Utils::checkExtension($filePath, "pdf") <= 0) {
35
+ return (0, "Unexpected file extension at ".
36
+ __FILE__." line ".__LINE__);
37
+ }
38
+
39
+ my $textFilePath =
40
+ FileConverter::Utils::changeExtension($filePath, "txt");
41
+ my @commandArgs = ("java", "-jar", $PDFBoxLoc,
42
+ "ExtractText", "-encoding",
43
+ "utf8", $filePath, $textFilePath);
44
+
45
+ system(@commandArgs);
46
+
47
+ if ($? == -1) {
48
+ return (0, "Failed to execute PDFBox: $!");
49
+ } elsif ($? & 127) {
50
+ return (0, "Java died with signal ".($? & 127));
51
+ }
52
+
53
+ my $code = $?>>8;
54
+ if ($code == 0) {
55
+ push @$rTrace, "PDFBox";
56
+
57
+ my $sha1 = FileConverter::CheckSum->new();
58
+ $sha1->digest($filePath);
59
+ push @$rCheckSums, $sha1;
60
+
61
+ return ($status, $msg, $textFilePath, $rTrace, $rCheckSums);
62
+ } else {
63
+ return (0, "Error executing PDFBox (code $code): $!");
64
+ }
65
+
66
+ } # extractText
67
+
68
+
69
+ 1;
@@ -0,0 +1,69 @@
1
+ #
2
+ # Copyright 2007 Penn State University
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ # http://www.apache.org/licenses/LICENSE-2.0
7
+ # Unless required by applicable law or agreed to in writing, software
8
+ # distributed under the License is distributed on an "AS IS" BASIS,
9
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the License for the specific language governing permissions and
11
+ # limitations under the License.
12
+ #
13
+ package FileConverter::PSConverter;
14
+ #
15
+ # Wrapper to execute the ps2pdf command-line tool for converting
16
+ # ps to PDF files.
17
+ #
18
+ # Juan Pablo Fernandez Ramirez, 10/08/07
19
+ #
20
+ use strict;
21
+ use FileConverter::Config;
22
+ use FileConverter::Utils;
23
+
24
+ my $timeout = 20;
25
+
26
+ ##
27
+ # Execute the converter utility.
28
+ ##
29
+ sub convertFile {
30
+ my ($filePath, $rTrace, $rCheckSums) = @_;
31
+ my ($status, $msg) = (1, "");
32
+
33
+ my $pdfFilePath = FileConverter::Utils::changeExtension($filePath, "pdf");
34
+ my @commandArgs = ("ps2pdf13", $filePath, $pdfFilePath);
35
+ my $child;
36
+ eval {
37
+ local $SIG{'ALRM'} = sub { die "alarm\n" };
38
+ alarm $timeout;
39
+ $child = system(@commandArgs);
40
+ alarm 0;
41
+ };
42
+
43
+ if ($@) {
44
+ if ($@ eq "alarm\n") {
45
+ if (defined $child) { kill 9, $child; }
46
+ return (0, "ps2pdf timeout");
47
+ }
48
+ }
49
+
50
+ if ($? == -1) {
51
+ return (0, "Failed to execute ps2pdf: $!");
52
+ } elsif ($? & 127) {
53
+ return (0, "ps2pdf died with signal ".($? & 127));
54
+ }
55
+
56
+ my $code = $?>>8;
57
+ if ($code == 0) {
58
+ push @$rTrace, "ps2pdf";
59
+
60
+ my $sha1 = FileConverter::CheckSum->new();
61
+ $sha1->digest($filePath);
62
+ push @$rCheckSums, $sha1;
63
+
64
+ return ($status, $msg, $pdfFilePath, $rTrace, $rCheckSums);
65
+ } else {
66
+ return (0, "Error executing ps2pdf (code $code): $!");
67
+ }
68
+ } # convertFile
69
+ 1;
@@ -0,0 +1,88 @@
1
+ #
2
+ # Copyright 2007 Penn State University
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ # http://www.apache.org/licenses/LICENSE-2.0
7
+ # Unless required by applicable law or agreed to in writing, software
8
+ # distributed under the License is distributed on an "AS IS" BASIS,
9
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the License for the specific language governing permissions and
11
+ # limitations under the License.
12
+ #
13
+ package FileConverter::PSToText;
14
+ #
15
+ # Wrapper to execute the ps2ascii command-line tool for converting
16
+ # ps to text files.
17
+ #
18
+ # Isaac, 10/08/07
19
+ #
20
+ use strict;
21
+ use FileConverter::Config;
22
+ use FileConverter::Utils;
23
+ use Encode;
24
+
25
+ my $timeout = 20;
26
+
27
+ ##
28
+ # Execute the converter utility.
29
+ ##
30
+ sub extractText {
31
+ my ($filePath, $rTrace, $rCheckSums) = @_;
32
+ my ($status, $msg) = (1, "");
33
+
34
+ my $txtFilePath = FileConverter::Utils::changeExtension($filePath, "txt");
35
+
36
+ my @commandArgs = ("ps2ascii", $filePath, $txtFilePath);
37
+ my $child;
38
+ eval {
39
+ local $SIG{'ALRM'} = sub { die "alarm\n" };
40
+ alarm $timeout;
41
+ $child = system(@commandArgs);
42
+ alarm 0;
43
+ };
44
+
45
+ if ($@) {
46
+ if ($@ eq "alarm\n") {
47
+ if (defined $child) { kill 9, $child; }
48
+ return (0, "ps2ascii timeout");
49
+ }
50
+ }
51
+
52
+ if ($? == -1) {
53
+ return (0, "Failed to execute ps2ascii: $!");
54
+ } elsif ($? & 127) {
55
+ return (0, "ps2ascii died with signal ".($? & 127));
56
+ }
57
+
58
+ my $code = $?>>8;
59
+ if ($code == 0) {
60
+ push @$rTrace, "ps2ascii";
61
+ ascii2utf8($txtFilePath);
62
+
63
+ my $sha1 = FileConverter::CheckSum->new();
64
+ $sha1->digest($filePath);
65
+ push @$rCheckSums, $sha1;
66
+
67
+ return ($status, $msg, $txtFilePath, $rTrace, $rCheckSums);
68
+ } else {
69
+ return (0, "Error executing ps2ascii (code $code): $!");
70
+ }
71
+ } # convertFile
72
+
73
+ sub ascii2utf8 {
74
+ my $fn = shift;
75
+
76
+ open(IN, "<$fn") or die $!;
77
+ my $text;
78
+ {
79
+ local $/ = undef;
80
+ $text = <IN>;
81
+ }
82
+ close IN;
83
+ $text = Encode::decode_utf8($text);
84
+ open(OUT, ">:utf8", $fn) or die $!;
85
+ print OUT $text;
86
+ close OUT;
87
+ }
88
+ 1;
@@ -0,0 +1,68 @@
1
+ #
2
+ # Copyright 2007 Penn State University
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ # http://www.apache.org/licenses/LICENSE-2.0
7
+ # Unless required by applicable law or agreed to in writing, software
8
+ # distributed under the License is distributed on an "AS IS" BASIS,
9
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the License for the specific language governing permissions and
11
+ # limitations under the License.
12
+ #
13
+ package FileConverter::Prescript;
14
+ #
15
+ # Wrapper to execute the Prescript command-line tool for extracting
16
+ # text from PS files.
17
+ #
18
+ # Juan Pablo Fernandez R., 10/31/07
19
+ #
20
+ use strict;
21
+ use FileConverter::Config;
22
+ use FileConverter::Utils;
23
+ use FileConverter::CheckSum;
24
+
25
+ my $PrescriptPath = $FileConverter::Config::PrescriptPath;
26
+
27
+ ##
28
+ # Execute the Prescript utility.
29
+ ##
30
+ sub extractText {
31
+ my ($filePath, $rTrace, $rCheckSums) = @_;
32
+ my ($status, $msg) = (1, "");
33
+
34
+ if (FileConverter::Utils::checkExtension($filePath, "ps") <= 0) {
35
+ return (0, "Unexpected file extension at ". __FILE__." line ".__LINE__);
36
+ }
37
+
38
+ my $textFilePath = FileConverter::Utils::changeExtension($filePath, "txt");
39
+ my @commandArgs = ($PrescriptPath, "plain", $filePath, $textFilePath);
40
+
41
+ system(@commandArgs);
42
+
43
+ if ($? == -1) {
44
+ return (0, "Failed to execute Prescript: $!");
45
+ } elsif ($? & 127) {
46
+ return (0, "Prescript died with signal ".($? & 127));
47
+ }
48
+
49
+ my $code = $?>>8;
50
+ if (($code == 0) || ($code == 1)) {
51
+ if ($code == 1) {
52
+ print STDERR "Prescript completed with errors: $filePath\n";
53
+ }
54
+
55
+ push @$rTrace, "PSLIB Prescript";
56
+
57
+ my $sha1 = new FileConverter::CheckSum();
58
+ $sha1->digest($filePath);
59
+ push @$rCheckSums, $sha1;
60
+
61
+ return ($status, $msg, $textFilePath, $rTrace, $rCheckSums);
62
+
63
+ } else {
64
+ return (0, "Error executing Prescript (code $code): $!");
65
+ }
66
+ } # extractText
67
+
68
+ 1;
@@ -0,0 +1,75 @@
1
+ #
2
+ # Copyright 2007 Penn State University
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ # http://www.apache.org/licenses/LICENSE-2.0
7
+ # Unless required by applicable law or agreed to in writing, software
8
+ # distributed under the License is distributed on an "AS IS" BASIS,
9
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the License for the specific language governing permissions and
11
+ # limitations under the License.
12
+ #
13
+ package FileConverter::TET;
14
+ #
15
+ # Wrapper to execute the TET command-line tool for extracting
16
+ # text from PDF files.
17
+ #
18
+ # Isaac Councill, 09/06/07
19
+ #
20
+ use strict;
21
+ use FileConverter::Config;
22
+ use FileConverter::Utils;
23
+ use FileConverter::CheckSum;
24
+
25
+ my $TETPath = $FileConverter::Config::TETPath;
26
+ my $TETLicensePath = $FileConverter::Config::TETLicensePath;
27
+
28
+ $ENV{'PDFLIBLICENSEFILE'} = $TETLicensePath;
29
+
30
+ ##
31
+ # Execute the TET utility.
32
+ ##
33
+ sub extractText {
34
+ my ($filePath, $rTrace, $rCheckSums) = @_;
35
+ my ($status, $msg) = (1, "");
36
+
37
+ if (FileConverter::Utils::checkExtension($filePath, "pdf") <= 0) {
38
+ return (0, "Unexpected file extension at ".
39
+ __FILE__." line ".__LINE__);
40
+ }
41
+
42
+ my $textFilePath =
43
+ FileConverter::Utils::changeExtension($filePath, "txt");
44
+ my @commandArgs = ($TETPath, "-o", $textFilePath, $filePath);
45
+
46
+ system(@commandArgs);
47
+
48
+ if ($? == -1) {
49
+ return (0, "Failed to execute TET: $!");
50
+ } elsif ($? & 127) {
51
+ return (0, "TET died with signal ".($? & 127));
52
+ }
53
+
54
+ my $code = $?>>8;
55
+ if (($code == 0) || ($code == 1)) {
56
+ if ($code == 1) {
57
+ print STDERR "TET completed with errors: $filePath\n";
58
+ }
59
+
60
+ push @$rTrace, "PDFLib TET";
61
+
62
+ my $sha1 = new FileConverter::CheckSum();
63
+ $sha1->digest($filePath);
64
+ push @$rCheckSums, $sha1;
65
+
66
+ return ($status, $msg, $textFilePath, $rTrace, $rCheckSums);
67
+
68
+ } else {
69
+ return (0, "Error executing TET (code $code): $!");
70
+ }
71
+
72
+ } # extractText
73
+
74
+
75
+ 1;
@@ -0,0 +1,130 @@
1
+ #
2
+ # Copyright 2007 Penn State University
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ # http://www.apache.org/licenses/LICENSE-2.0
7
+ # Unless required by applicable law or agreed to in writing, software
8
+ # distributed under the License is distributed on an "AS IS" BASIS,
9
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ # See the License for the specific language governing permissions and
11
+ # limitations under the License.
12
+ #
13
+ package FileConverter::Utils;
14
+ #
15
+ # Container for subroutines that may be shared across multiple
16
+ # FileConverter modules.
17
+ #
18
+ # Isaac Councill, 09/06/07
19
+ #
20
+ use strict;
21
+ use Encode;
22
+
23
+ ##
24
+ # Returns the file extension of a file name, if there is one.
25
+ ##
26
+ sub getExtension {
27
+ my ($fn) = @_;
28
+ if ($fn =~ m/^.*\.(.*)$/) {
29
+ return $1;
30
+ }
31
+ return undef;
32
+
33
+ } # getExtension
34
+
35
+
36
+ ##
37
+ # Strips off the last extension of the file name.
38
+ ##
39
+ sub stripExtension {
40
+ my ($fn) = @_;
41
+ $fn =~ s/^(.*)\..*$/$1/;
42
+ return $fn;
43
+
44
+ } # stripExtension
45
+
46
+
47
+ ##
48
+ ##
49
+ # Routine for checking that a filename ends with an expected
50
+ # extension. Returns 1 if it does, 0 if not.
51
+ ##
52
+ sub checkExtension {
53
+ my ($fn, $ext) = @_;
54
+ if ($fn =~ m/^.*\.(.*)$/) {
55
+ if ($1 =~ m/$ext/i) {
56
+ return 1;
57
+ }
58
+ }
59
+ return 0;
60
+
61
+ } # checkExtension
62
+
63
+
64
+ ##
65
+ # Simple routine for changing the extension of a file.
66
+ # Example: $newFileName = changeExtension($oldFileName, "txt");
67
+ ##
68
+ sub changeExtension {
69
+ my ($fn, $ext) = @_;
70
+ unless ($fn =~ s/^(.*)\..*$/$1\.$ext/) {
71
+ $fn .= ".$ext";
72
+ }
73
+ return $fn;
74
+
75
+ } # changeExtension
76
+
77
+
78
+ ##
79
+ # Returns the directory part of a file path.
80
+ ##
81
+ sub getDirectory {
82
+ my ($filePath) = @_;
83
+ if ($filePath =~ m/^(.*)\/.*$/) {
84
+ return $1;
85
+ } else {
86
+ return $filePath;
87
+ }
88
+
89
+ } # getDirectory
90
+
91
+ ##
92
+ ##
93
+ # Routine for checking if a process is running or not
94
+ # Returns 1 if it is runnig, 0 if not.
95
+ ##
96
+ sub checkProcess {
97
+ my ($process) = @_;
98
+ my $cmd = "ps -ef | grep " . $process . " | grep -v grep";
99
+ my $result = `$cmd`;
100
+ if ($result eq '') {
101
+ return 0;
102
+ }
103
+ else {
104
+ return 1;
105
+ }
106
+ } # checkProcess
107
+
108
+
109
+ ##
110
+ # Convert an file of the specified encoding to UTF-8
111
+ ##
112
+ sub convertToUTF8 {
113
+ my ($fn, $encoding) = @_;
114
+ my $octets;
115
+ open (FILE, "<$fn") or die "could not open file $fn: $!";
116
+ binmode FILE, ":bytes";
117
+ {
118
+ local $/ = undef;
119
+ $octets = <FILE>;
120
+ }
121
+ close FILE;
122
+
123
+ Encode::from_to($octets, $encoding, "utf8");
124
+ open (FILE, ">:utf8", "$fn") or die "could not open file $fn: $!";
125
+ print FILE Encode::decode_utf8($octets);
126
+ close FILE;
127
+
128
+ }
129
+
130
+ 1;
@@ -0,0 +1,80 @@
1
+ SVMHeaderParse README
2
+ IGC
3
+
4
+ SVMHeaderParse is a utility for extracting header information
5
+ from research papers based on Support Vector Machines and
6
+ heuristic regularization. For details on the algorithm,
7
+ please see the following paper:
8
+
9
+ Hui Han, C. Lee Giles, Eren Manavoglu, Hongyuan Zha,
10
+ Zhenyue Zhang, Edward A. Fox. "Automatic Document Metadata
11
+ Extraction Using Support Vector Machines", in Proceedings
12
+ of ACM/IEEE Joint Conference on Digital Libraries
13
+ (JCDL 2003): 37-48, 2003.
14
+
15
+ Installation:
16
+
17
+ In order to use the SVMHeaderParse web service you will need
18
+ the following modules in your perl library:
19
+
20
+ Log::Log4perl
21
+ Log::Dispatch
22
+
23
+ Edit lib/HeaderParse/Config/API_Config.pm to provide values
24
+ appropriate for your environment. Also edit wsdl/SVMHeaderParse.wsdl
25
+ to reflect any changes to your service URL.
26
+
27
+ Command Line Usage:
28
+
29
+ Command line utilities are provided for extracting header information
30
+ from text documents. By default, text files are expected to be encoded
31
+ in UTF-8, but the expected encoding can be adjusted using perl
32
+ command line switches. To run SVMHeaderParse on a single document,
33
+ execute the following command:
34
+
35
+ extractHeader.pl textfile [outfile]
36
+
37
+ If "outfile" is specified, the XML output will be written to that
38
+ file; otherwise, the XML will be printed to STDOUT.
39
+
40
+ There is also a web service interface available, using the
41
+ SOAP::Lite perl module. To start the service, just execute:
42
+
43
+ headerparse-service.pl
44
+
45
+ By default, the service will start on port 40000, but this is
46
+ configurable in the HeaderParse::Config::API_Config library module.
47
+ A WSDL file is provided with the distribution that outlines the
48
+ message details expected by the SVMHeaderParse service. If the
49
+ service port is changed, the WSDL file must also be modified to
50
+ reflect that change. Expected parameters in the input message
51
+ are "filePath" (a path to the text file to parse) and "repositoryID".
52
+ The SVMHeaderParse service is designed for deployment in an
53
+ environment where text files may be located on file systems mounted
54
+ from arbitrary machines on the network. Thus, "repositoryID" provides
55
+ a means to map a given shared file system to it's mount point.
56
+ Repository mappings are configurable in the API_Config module. The
57
+ "filePath" parameter provides a path to the text file relative to
58
+ the repository mount point. The local file system may be specified
59
+ using the reserved repository ID "LOCAL". In that case, an absolute
60
+ path to the text file may be specified.
61
+
62
+ A perl client is also provided that demonstrates how to use the
63
+ service. Execute the client with the following command:
64
+
65
+ headerparse-client.pl filePath repositoryID
66
+
67
+ If the call is successful, the XML output will be printed to STDOUT.
68
+
69
+ API:
70
+
71
+ The SVMHeaderParse libraries may be used directly from external perl
72
+ applications. The interface module is HeaderParse::API::Parser. If
73
+ XML output is desired, use the
74
+
75
+ HeaderParse::API::Parser::_parseHeader($filePath, $jobID)
76
+
77
+ subroutine. If the SVMHeaderParse library is used from external
78
+ Perl applications, remember to use the "-CSD" perl option for
79
+ global unicode stream support (or otherwise handle encoding) or
80
+ risk string corruption.