biblicit 1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +3 -0
- data/.rspec +1 -0
- data/Gemfile +6 -0
- data/LICENSE.TXT +176 -0
- data/README.md +120 -0
- data/Rakefile +8 -0
- data/biblicit.gemspec +33 -0
- data/lib/biblicit/cb2bib.rb +83 -0
- data/lib/biblicit/citeseer.rb +53 -0
- data/lib/biblicit/extractor.rb +37 -0
- data/lib/biblicit.rb +6 -0
- data/perl/DocFilter/lib/CSXUtil/SafeText.pm +140 -0
- data/perl/DocFilter/lib/DocFilter/Config.pm +35 -0
- data/perl/DocFilter/lib/DocFilter/Filter.pm +51 -0
- data/perl/FileConversionService/README.TXT +11 -0
- data/perl/FileConversionService/converters/PDFBox/pdfbox-app-1.7.1.jar +0 -0
- data/perl/FileConversionService/lib/CSXUtil/SafeText.pm +140 -0
- data/perl/FileConversionService/lib/FileConverter/CheckSum.pm +77 -0
- data/perl/FileConversionService/lib/FileConverter/Compression.pm +137 -0
- data/perl/FileConversionService/lib/FileConverter/Config.pm +57 -0
- data/perl/FileConversionService/lib/FileConverter/Controller.pm +191 -0
- data/perl/FileConversionService/lib/FileConverter/JODConverter.pm +61 -0
- data/perl/FileConversionService/lib/FileConverter/PDFBox.pm +69 -0
- data/perl/FileConversionService/lib/FileConverter/PSConverter.pm +69 -0
- data/perl/FileConversionService/lib/FileConverter/PSToText.pm +88 -0
- data/perl/FileConversionService/lib/FileConverter/Prescript.pm +68 -0
- data/perl/FileConversionService/lib/FileConverter/TET.pm +75 -0
- data/perl/FileConversionService/lib/FileConverter/Utils.pm +130 -0
- data/perl/HeaderParseService/README.TXT +80 -0
- data/perl/HeaderParseService/lib/CSXUtil/SafeText.pm +140 -0
- data/perl/HeaderParseService/lib/HeaderParse/API/AssembleXMLMetadata.pm +968 -0
- data/perl/HeaderParseService/lib/HeaderParse/API/Function.pm +2016 -0
- data/perl/HeaderParseService/lib/HeaderParse/API/LoadInformation.pm +444 -0
- data/perl/HeaderParseService/lib/HeaderParse/API/MultiClassChunking.pm +409 -0
- data/perl/HeaderParseService/lib/HeaderParse/API/NamePatternMatch.pm +537 -0
- data/perl/HeaderParseService/lib/HeaderParse/API/Parser.pm +68 -0
- data/perl/HeaderParseService/lib/HeaderParse/API/ParserMethods.pm +1880 -0
- data/perl/HeaderParseService/lib/HeaderParse/Config/API_Config.pm +46 -0
- data/perl/HeaderParseService/resources/data/EbizHeaders.txt +24330 -0
- data/perl/HeaderParseService/resources/data/EbizHeaders.txt.parsed +27506 -0
- data/perl/HeaderParseService/resources/data/EbizHeaders.txt.parsed.old +26495 -0
- data/perl/HeaderParseService/resources/data/tagged_headers.txt +40668 -0
- data/perl/HeaderParseService/resources/data/test_header.txt +31 -0
- data/perl/HeaderParseService/resources/data/test_header.txt.parsed +31 -0
- data/perl/HeaderParseService/resources/database/50states +60 -0
- data/perl/HeaderParseService/resources/database/AddrTopWords.txt +17 -0
- data/perl/HeaderParseService/resources/database/AffiTopWords.txt +35 -0
- data/perl/HeaderParseService/resources/database/AffiTopWordsAll.txt +533 -0
- data/perl/HeaderParseService/resources/database/ChineseSurNames.txt +276 -0
- data/perl/HeaderParseService/resources/database/Csurnames.bin +0 -0
- data/perl/HeaderParseService/resources/database/Csurnames_spec.bin +0 -0
- data/perl/HeaderParseService/resources/database/DomainSuffixes.txt +242 -0
- data/perl/HeaderParseService/resources/database/LabeledHeader +18 -0
- data/perl/HeaderParseService/resources/database/README +2 -0
- data/perl/HeaderParseService/resources/database/TrainMulClassLines +254 -0
- data/perl/HeaderParseService/resources/database/TrainMulClassLines1 +510 -0
- data/perl/HeaderParseService/resources/database/abstract.txt +1 -0
- data/perl/HeaderParseService/resources/database/abstractTopWords +9 -0
- data/perl/HeaderParseService/resources/database/addr.txt +28 -0
- data/perl/HeaderParseService/resources/database/affi.txt +34 -0
- data/perl/HeaderParseService/resources/database/affis.bin +0 -0
- data/perl/HeaderParseService/resources/database/all_namewords_spec.bin +0 -0
- data/perl/HeaderParseService/resources/database/allnamewords.bin +0 -0
- data/perl/HeaderParseService/resources/database/cities_US.txt +4512 -0
- data/perl/HeaderParseService/resources/database/cities_world.txt +4463 -0
- data/perl/HeaderParseService/resources/database/city.txt +3150 -0
- data/perl/HeaderParseService/resources/database/cityname.txt +3151 -0
- data/perl/HeaderParseService/resources/database/country_abbr.txt +243 -0
- data/perl/HeaderParseService/resources/database/countryname.txt +262 -0
- data/perl/HeaderParseService/resources/database/dateTopWords +30 -0
- data/perl/HeaderParseService/resources/database/degree.txt +67 -0
- data/perl/HeaderParseService/resources/database/email.txt +3 -0
- data/perl/HeaderParseService/resources/database/excludeWords.txt +40 -0
- data/perl/HeaderParseService/resources/database/female-names +4960 -0
- data/perl/HeaderParseService/resources/database/firstNames.txt +8448 -0
- data/perl/HeaderParseService/resources/database/firstnames.bin +0 -0
- data/perl/HeaderParseService/resources/database/firstnames_spec.bin +0 -0
- data/perl/HeaderParseService/resources/database/intro.txt +2 -0
- data/perl/HeaderParseService/resources/database/keyword.txt +5 -0
- data/perl/HeaderParseService/resources/database/keywordTopWords +7 -0
- data/perl/HeaderParseService/resources/database/male-names +3906 -0
- data/perl/HeaderParseService/resources/database/middleNames.txt +2 -0
- data/perl/HeaderParseService/resources/database/month.txt +35 -0
- data/perl/HeaderParseService/resources/database/mul +868 -0
- data/perl/HeaderParseService/resources/database/mul.label +869 -0
- data/perl/HeaderParseService/resources/database/mul.label.old +869 -0
- data/perl/HeaderParseService/resources/database/mul.processed +762 -0
- data/perl/HeaderParseService/resources/database/mulAuthor +619 -0
- data/perl/HeaderParseService/resources/database/mulClassStat +45 -0
- data/perl/HeaderParseService/resources/database/nickname.txt +58 -0
- data/perl/HeaderParseService/resources/database/nicknames.bin +0 -0
- data/perl/HeaderParseService/resources/database/note.txt +121 -0
- data/perl/HeaderParseService/resources/database/page.txt +1 -0
- data/perl/HeaderParseService/resources/database/phone.txt +9 -0
- data/perl/HeaderParseService/resources/database/postcode.txt +54 -0
- data/perl/HeaderParseService/resources/database/pubnum.txt +45 -0
- data/perl/HeaderParseService/resources/database/statename.bin +0 -0
- data/perl/HeaderParseService/resources/database/statename.txt +73 -0
- data/perl/HeaderParseService/resources/database/states_and_abbreviations.txt +118 -0
- data/perl/HeaderParseService/resources/database/stopwords +438 -0
- data/perl/HeaderParseService/resources/database/stopwords.bin +0 -0
- data/perl/HeaderParseService/resources/database/surNames.txt +19613 -0
- data/perl/HeaderParseService/resources/database/surnames.bin +0 -0
- data/perl/HeaderParseService/resources/database/surnames_spec.bin +0 -0
- data/perl/HeaderParseService/resources/database/university_list/A.html +167 -0
- data/perl/HeaderParseService/resources/database/university_list/B.html +161 -0
- data/perl/HeaderParseService/resources/database/university_list/C.html +288 -0
- data/perl/HeaderParseService/resources/database/university_list/D.html +115 -0
- data/perl/HeaderParseService/resources/database/university_list/E.html +147 -0
- data/perl/HeaderParseService/resources/database/university_list/F.html +112 -0
- data/perl/HeaderParseService/resources/database/university_list/G.html +115 -0
- data/perl/HeaderParseService/resources/database/university_list/H.html +140 -0
- data/perl/HeaderParseService/resources/database/university_list/I.html +138 -0
- data/perl/HeaderParseService/resources/database/university_list/J.html +82 -0
- data/perl/HeaderParseService/resources/database/university_list/K.html +115 -0
- data/perl/HeaderParseService/resources/database/university_list/L.html +131 -0
- data/perl/HeaderParseService/resources/database/university_list/M.html +201 -0
- data/perl/HeaderParseService/resources/database/university_list/N.html +204 -0
- data/perl/HeaderParseService/resources/database/university_list/O.html +89 -0
- data/perl/HeaderParseService/resources/database/university_list/P.html +125 -0
- data/perl/HeaderParseService/resources/database/university_list/Q.html +49 -0
- data/perl/HeaderParseService/resources/database/university_list/R.html +126 -0
- data/perl/HeaderParseService/resources/database/university_list/S.html +296 -0
- data/perl/HeaderParseService/resources/database/university_list/T.html +156 -0
- data/perl/HeaderParseService/resources/database/university_list/U.html +800 -0
- data/perl/HeaderParseService/resources/database/university_list/V.html +75 -0
- data/perl/HeaderParseService/resources/database/university_list/W.html +144 -0
- data/perl/HeaderParseService/resources/database/university_list/WCSelect.gif +0 -0
- data/perl/HeaderParseService/resources/database/university_list/X.html +44 -0
- data/perl/HeaderParseService/resources/database/university_list/Y.html +53 -0
- data/perl/HeaderParseService/resources/database/university_list/Z.html +43 -0
- data/perl/HeaderParseService/resources/database/university_list/ae.html +31 -0
- data/perl/HeaderParseService/resources/database/university_list/am.html +30 -0
- data/perl/HeaderParseService/resources/database/university_list/ar.html +35 -0
- data/perl/HeaderParseService/resources/database/university_list/at.html +43 -0
- data/perl/HeaderParseService/resources/database/university_list/au.html +82 -0
- data/perl/HeaderParseService/resources/database/university_list/bd.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/be.html +41 -0
- data/perl/HeaderParseService/resources/database/university_list/bg.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/bh.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/blueribbon.gif +0 -0
- data/perl/HeaderParseService/resources/database/university_list/bm.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/bn.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/br.html +66 -0
- data/perl/HeaderParseService/resources/database/university_list/ca.html +174 -0
- data/perl/HeaderParseService/resources/database/university_list/ch.html +52 -0
- data/perl/HeaderParseService/resources/database/university_list/cl.html +40 -0
- data/perl/HeaderParseService/resources/database/university_list/cn.html +87 -0
- data/perl/HeaderParseService/resources/database/university_list/co.html +39 -0
- data/perl/HeaderParseService/resources/database/university_list/cr.html +34 -0
- data/perl/HeaderParseService/resources/database/university_list/cy.html +34 -0
- data/perl/HeaderParseService/resources/database/university_list/cz.html +44 -0
- data/perl/HeaderParseService/resources/database/university_list/de.html +128 -0
- data/perl/HeaderParseService/resources/database/university_list/dean-mainlink.jpg +0 -0
- data/perl/HeaderParseService/resources/database/university_list/dk.html +42 -0
- data/perl/HeaderParseService/resources/database/university_list/ec.html +31 -0
- data/perl/HeaderParseService/resources/database/university_list/ee.html +30 -0
- data/perl/HeaderParseService/resources/database/university_list/eg.html +29 -0
- data/perl/HeaderParseService/resources/database/university_list/es.html +68 -0
- data/perl/HeaderParseService/resources/database/university_list/et.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/faq.html +147 -0
- data/perl/HeaderParseService/resources/database/university_list/fi.html +49 -0
- data/perl/HeaderParseService/resources/database/university_list/fj.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/fo.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/fr.html +106 -0
- data/perl/HeaderParseService/resources/database/university_list/geog.html +150 -0
- data/perl/HeaderParseService/resources/database/university_list/gr.html +38 -0
- data/perl/HeaderParseService/resources/database/university_list/gu.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/hk.html +34 -0
- data/perl/HeaderParseService/resources/database/university_list/hr.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/hu.html +46 -0
- data/perl/HeaderParseService/resources/database/university_list/id.html +29 -0
- data/perl/HeaderParseService/resources/database/university_list/ie.html +49 -0
- data/perl/HeaderParseService/resources/database/university_list/il.html +35 -0
- data/perl/HeaderParseService/resources/database/university_list/in.html +109 -0
- data/perl/HeaderParseService/resources/database/university_list/is.html +32 -0
- data/perl/HeaderParseService/resources/database/university_list/it.html +75 -0
- data/perl/HeaderParseService/resources/database/university_list/jm.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/jo.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/jp.html +155 -0
- data/perl/HeaderParseService/resources/database/university_list/kaplan.gif +0 -0
- data/perl/HeaderParseService/resources/database/university_list/kr.html +65 -0
- data/perl/HeaderParseService/resources/database/university_list/kw.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/lb.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/linkbw2.gif +0 -0
- data/perl/HeaderParseService/resources/database/university_list/lk.html +30 -0
- data/perl/HeaderParseService/resources/database/university_list/lt.html +31 -0
- data/perl/HeaderParseService/resources/database/university_list/lu.html +34 -0
- data/perl/HeaderParseService/resources/database/university_list/lv.html +30 -0
- data/perl/HeaderParseService/resources/database/university_list/ma.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/maczynski.gif +0 -0
- data/perl/HeaderParseService/resources/database/university_list/mirror.tar +0 -0
- data/perl/HeaderParseService/resources/database/university_list/mk.html +29 -0
- data/perl/HeaderParseService/resources/database/university_list/mo.html +29 -0
- data/perl/HeaderParseService/resources/database/university_list/mseawdm.gif +0 -0
- data/perl/HeaderParseService/resources/database/university_list/mt.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/mx.html +68 -0
- data/perl/HeaderParseService/resources/database/university_list/my.html +39 -0
- data/perl/HeaderParseService/resources/database/university_list/ni.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/nl.html +51 -0
- data/perl/HeaderParseService/resources/database/university_list/no.html +56 -0
- data/perl/HeaderParseService/resources/database/university_list/nz.html +41 -0
- data/perl/HeaderParseService/resources/database/university_list/pa.html +31 -0
- data/perl/HeaderParseService/resources/database/university_list/pe.html +40 -0
- data/perl/HeaderParseService/resources/database/university_list/ph.html +41 -0
- data/perl/HeaderParseService/resources/database/university_list/pl.html +51 -0
- data/perl/HeaderParseService/resources/database/university_list/pointcom.gif +0 -0
- data/perl/HeaderParseService/resources/database/university_list/pr.html +31 -0
- data/perl/HeaderParseService/resources/database/university_list/ps.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/pt.html +45 -0
- data/perl/HeaderParseService/resources/database/university_list/recognition.html +69 -0
- data/perl/HeaderParseService/resources/database/university_list/results.html +71 -0
- data/perl/HeaderParseService/resources/database/university_list/ro.html +38 -0
- data/perl/HeaderParseService/resources/database/university_list/ru.html +48 -0
- data/perl/HeaderParseService/resources/database/university_list/sd.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/se.html +57 -0
- data/perl/HeaderParseService/resources/database/university_list/sg.html +33 -0
- data/perl/HeaderParseService/resources/database/university_list/si.html +30 -0
- data/perl/HeaderParseService/resources/database/university_list/sk.html +35 -0
- data/perl/HeaderParseService/resources/database/university_list/th.html +45 -0
- data/perl/HeaderParseService/resources/database/university_list/tr.html +44 -0
- data/perl/HeaderParseService/resources/database/university_list/tw.html +76 -0
- data/perl/HeaderParseService/resources/database/university_list/ua.html +29 -0
- data/perl/HeaderParseService/resources/database/university_list/uk.html +168 -0
- data/perl/HeaderParseService/resources/database/university_list/univ-full.html +3166 -0
- data/perl/HeaderParseService/resources/database/university_list/univ.html +122 -0
- data/perl/HeaderParseService/resources/database/university_list/uy.html +31 -0
- data/perl/HeaderParseService/resources/database/university_list/ve.html +34 -0
- data/perl/HeaderParseService/resources/database/university_list/yu.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list/za.html +46 -0
- data/perl/HeaderParseService/resources/database/university_list/zm.html +28 -0
- data/perl/HeaderParseService/resources/database/university_list.txt +3025 -0
- data/perl/HeaderParseService/resources/database/url.txt +1 -0
- data/perl/HeaderParseService/resources/database/webTopWords +225 -0
- data/perl/HeaderParseService/resources/database/words +45402 -0
- data/perl/HeaderParseService/resources/models/10ContextModelfold1 +369 -0
- data/perl/HeaderParseService/resources/models/10Modelfold1 +376 -0
- data/perl/HeaderParseService/resources/models/11ContextModelfold1 +400 -0
- data/perl/HeaderParseService/resources/models/11Modelfold1 +526 -0
- data/perl/HeaderParseService/resources/models/12ContextModelfold1 +510 -0
- data/perl/HeaderParseService/resources/models/12Modelfold1 +423 -0
- data/perl/HeaderParseService/resources/models/13ContextModelfold1 +364 -0
- data/perl/HeaderParseService/resources/models/13Modelfold1 +677 -0
- data/perl/HeaderParseService/resources/models/14ContextModelfold1 +459 -0
- data/perl/HeaderParseService/resources/models/14Modelfold1 +325 -0
- data/perl/HeaderParseService/resources/models/15ContextModelfold1 +340 -0
- data/perl/HeaderParseService/resources/models/15Modelfold1 +390 -0
- data/perl/HeaderParseService/resources/models/1ContextModelfold1 +668 -0
- data/perl/HeaderParseService/resources/models/1Modelfold1 +1147 -0
- data/perl/HeaderParseService/resources/models/2ContextModelfold1 +755 -0
- data/perl/HeaderParseService/resources/models/2Modelfold1 +796 -0
- data/perl/HeaderParseService/resources/models/3ContextModelfold1 +1299 -0
- data/perl/HeaderParseService/resources/models/3Modelfold1 +1360 -0
- data/perl/HeaderParseService/resources/models/4ContextModelfold1 +1062 -0
- data/perl/HeaderParseService/resources/models/4Modelfold1 +993 -0
- data/perl/HeaderParseService/resources/models/5ContextModelfold1 +1339 -0
- data/perl/HeaderParseService/resources/models/5Modelfold1 +2098 -0
- data/perl/HeaderParseService/resources/models/6ContextModelfold1 +888 -0
- data/perl/HeaderParseService/resources/models/6Modelfold1 +620 -0
- data/perl/HeaderParseService/resources/models/7ContextModelfold1 +257 -0
- data/perl/HeaderParseService/resources/models/7Modelfold1 +228 -0
- data/perl/HeaderParseService/resources/models/8ContextModelfold1 +677 -0
- data/perl/HeaderParseService/resources/models/8Modelfold1 +1871 -0
- data/perl/HeaderParseService/resources/models/9ContextModelfold1 +198 -0
- data/perl/HeaderParseService/resources/models/9Modelfold1 +170 -0
- data/perl/HeaderParseService/resources/models/NameSpaceModel +181 -0
- data/perl/HeaderParseService/resources/models/NameSpaceTrainF +347 -0
- data/perl/HeaderParseService/resources/models/WrapperBaseFeaDict +13460 -0
- data/perl/HeaderParseService/resources/models/WrapperContextFeaDict +14045 -0
- data/perl/HeaderParseService/resources/models/WrapperSpaceAuthorFeaDict +510 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test1 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test10 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test11 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test12 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test13 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test14 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test15 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test2 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test3 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test4 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test5 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test6 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test7 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test8 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_1156237246.08016_test9 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test1 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test10 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test11 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test12 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test13 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test14 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test15 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test2 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test3 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test4 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test5 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test6 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test7 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test8 +23 -0
- data/perl/HeaderParseService/tmp/tmpVec_914027525.276114_test9 +23 -0
- data/perl/ParsCit/README.TXT +82 -0
- data/perl/ParsCit/crfpp/traindata/parsCit.template +60 -0
- data/perl/ParsCit/crfpp/traindata/parsCit.train.data +12104 -0
- data/perl/ParsCit/crfpp/traindata/tagged_references.txt +500 -0
- data/perl/ParsCit/lib/CSXUtil/SafeText.pm +140 -0
- data/perl/ParsCit/lib/ParsCit/Citation.pm +462 -0
- data/perl/ParsCit/lib/ParsCit/CitationContext.pm +132 -0
- data/perl/ParsCit/lib/ParsCit/Config.pm +46 -0
- data/perl/ParsCit/lib/ParsCit/Controller.pm +306 -0
- data/perl/ParsCit/lib/ParsCit/PostProcess.pm +367 -0
- data/perl/ParsCit/lib/ParsCit/PreProcess.pm +333 -0
- data/perl/ParsCit/lib/ParsCit/Tr2crfpp.pm +331 -0
- data/perl/ParsCit/resources/parsCit.model +0 -0
- data/perl/ParsCit/resources/parsCitDict.txt +148783 -0
- data/perl/extract.pl +199 -0
- data/spec/biblicit/cb2bib_spec.rb +48 -0
- data/spec/biblicit/citeseer_spec.rb +40 -0
- data/spec/fixtures/pdf/10.1.1.109.4049.pdf +0 -0
- data/spec/fixtures/pdf/Bagnoli Watts TAR 2010.pdf +0 -0
- data/spec/fixtures/pdf/ICINCO_2010.pdf +0 -0
- data/spec/spec_helper.rb +3 -0
- metadata +474 -0
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
#
|
|
2
|
+
# Copyright 2007 Penn State University
|
|
3
|
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
4
|
+
# you may not use this file except in compliance with the License.
|
|
5
|
+
# You may obtain a copy of the License at
|
|
6
|
+
# http://www.apache.org/licenses/LICENSE-2.0
|
|
7
|
+
# Unless required by applicable law or agreed to in writing, software
|
|
8
|
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
9
|
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
10
|
+
# See the License for the specific language governing permissions and
|
|
11
|
+
# limitations under the License.
|
|
12
|
+
#
|
|
13
|
+
package FileConverter::JODConverter;
|
|
14
|
+
#
|
|
15
|
+
# Wrapper to execute the JODConverter command-line tool for converting
|
|
16
|
+
# doc, rtf to PDF files.
|
|
17
|
+
#
|
|
18
|
+
# Juan Pablo Fernandez Ramirez, 10/05/07
|
|
19
|
+
#
|
|
20
|
+
use strict;
|
|
21
|
+
use FileConverter::Config;
|
|
22
|
+
use FileConverter::Utils;
|
|
23
|
+
|
|
24
|
+
my $JODConverterLoc = $FileConverter::Config::JODConverterPath;
|
|
25
|
+
|
|
26
|
+
##
|
|
27
|
+
# Execute the JODConverter utility.
|
|
28
|
+
##
|
|
29
|
+
sub convertFile {
|
|
30
|
+
my ($filePath, $rTrace, $rCheckSums) = @_;
|
|
31
|
+
my ($status, $msg) = (1, "");
|
|
32
|
+
|
|
33
|
+
if (FileConverter::Utils::checkProcess("soffice") == 0) {
|
|
34
|
+
return (0, "Open Office Service is not running");
|
|
35
|
+
}
|
|
36
|
+
|
|
37
|
+
my $pdfFilePath = FileConverter::Utils::changeExtension($filePath, "pdf");
|
|
38
|
+
my @commandArgs = ("java", "-jar", $JODConverterLoc, $filePath,
|
|
39
|
+
$pdfFilePath);
|
|
40
|
+
system(@commandArgs);
|
|
41
|
+
|
|
42
|
+
if ($? == -1) {
|
|
43
|
+
return (0, "Failed to execute JODConverter: $!");
|
|
44
|
+
} elsif ($? & 127) {
|
|
45
|
+
return (0, "Java died with signal ".($? & 127));
|
|
46
|
+
}
|
|
47
|
+
|
|
48
|
+
my $code = $?>>8;
|
|
49
|
+
if ($code == 0) {
|
|
50
|
+
push @$rTrace, "JODConverter";
|
|
51
|
+
|
|
52
|
+
my $sha1 = FileConverter::CheckSum->new();
|
|
53
|
+
$sha1->digest($filePath);
|
|
54
|
+
push @$rCheckSums, $sha1;
|
|
55
|
+
|
|
56
|
+
return ($status, $msg, $pdfFilePath, $rTrace, $rCheckSums);
|
|
57
|
+
} else {
|
|
58
|
+
return (0, "Error executing JODConverter (code $code): $!");
|
|
59
|
+
}
|
|
60
|
+
} # convertFile
|
|
61
|
+
1;
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
#
|
|
2
|
+
# Copyright 2007 Penn State University
|
|
3
|
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
4
|
+
# you may not use this file except in compliance with the License.
|
|
5
|
+
# You may obtain a copy of the License at
|
|
6
|
+
# http://www.apache.org/licenses/LICENSE-2.0
|
|
7
|
+
# Unless required by applicable law or agreed to in writing, software
|
|
8
|
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
9
|
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
10
|
+
# See the License for the specific language governing permissions and
|
|
11
|
+
# limitations under the License.
|
|
12
|
+
#
|
|
13
|
+
package FileConverter::PDFBox;
|
|
14
|
+
#
|
|
15
|
+
# Wrapper to call the PDFBox ExtractText command-line tool
|
|
16
|
+
# for extracting text from PDF files. It's recommended to
|
|
17
|
+
# use TET instead, if TET is available.
|
|
18
|
+
#
|
|
19
|
+
# Isaac Councill, 09/06/07
|
|
20
|
+
#
|
|
21
|
+
use strict;
|
|
22
|
+
use FileConverter::Config;
|
|
23
|
+
use FileConverter::Utils;
|
|
24
|
+
|
|
25
|
+
my $PDFBoxLoc = $FileConverter::Config::PDFBoxLocation;
|
|
26
|
+
|
|
27
|
+
##
|
|
28
|
+
# Execute the PDFBox utility.
|
|
29
|
+
##
|
|
30
|
+
sub extractText {
|
|
31
|
+
my ($filePath, $rTrace, $rCheckSums) = @_;
|
|
32
|
+
my ($status, $msg) = (1, "");
|
|
33
|
+
|
|
34
|
+
if (FileConverter::Utils::checkExtension($filePath, "pdf") <= 0) {
|
|
35
|
+
return (0, "Unexpected file extension at ".
|
|
36
|
+
__FILE__." line ".__LINE__);
|
|
37
|
+
}
|
|
38
|
+
|
|
39
|
+
my $textFilePath =
|
|
40
|
+
FileConverter::Utils::changeExtension($filePath, "txt");
|
|
41
|
+
my @commandArgs = ("java", "-jar", $PDFBoxLoc,
|
|
42
|
+
"ExtractText", "-encoding",
|
|
43
|
+
"utf8", $filePath, $textFilePath);
|
|
44
|
+
|
|
45
|
+
system(@commandArgs);
|
|
46
|
+
|
|
47
|
+
if ($? == -1) {
|
|
48
|
+
return (0, "Failed to execute PDFBox: $!");
|
|
49
|
+
} elsif ($? & 127) {
|
|
50
|
+
return (0, "Java died with signal ".($? & 127));
|
|
51
|
+
}
|
|
52
|
+
|
|
53
|
+
my $code = $?>>8;
|
|
54
|
+
if ($code == 0) {
|
|
55
|
+
push @$rTrace, "PDFBox";
|
|
56
|
+
|
|
57
|
+
my $sha1 = FileConverter::CheckSum->new();
|
|
58
|
+
$sha1->digest($filePath);
|
|
59
|
+
push @$rCheckSums, $sha1;
|
|
60
|
+
|
|
61
|
+
return ($status, $msg, $textFilePath, $rTrace, $rCheckSums);
|
|
62
|
+
} else {
|
|
63
|
+
return (0, "Error executing PDFBox (code $code): $!");
|
|
64
|
+
}
|
|
65
|
+
|
|
66
|
+
} # extractText
|
|
67
|
+
|
|
68
|
+
|
|
69
|
+
1;
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
#
|
|
2
|
+
# Copyright 2007 Penn State University
|
|
3
|
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
4
|
+
# you may not use this file except in compliance with the License.
|
|
5
|
+
# You may obtain a copy of the License at
|
|
6
|
+
# http://www.apache.org/licenses/LICENSE-2.0
|
|
7
|
+
# Unless required by applicable law or agreed to in writing, software
|
|
8
|
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
9
|
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
10
|
+
# See the License for the specific language governing permissions and
|
|
11
|
+
# limitations under the License.
|
|
12
|
+
#
|
|
13
|
+
package FileConverter::PSConverter;
|
|
14
|
+
#
|
|
15
|
+
# Wrapper to execute the ps2pdf command-line tool for converting
|
|
16
|
+
# ps to PDF files.
|
|
17
|
+
#
|
|
18
|
+
# Juan Pablo Fernandez Ramirez, 10/08/07
|
|
19
|
+
#
|
|
20
|
+
use strict;
|
|
21
|
+
use FileConverter::Config;
|
|
22
|
+
use FileConverter::Utils;
|
|
23
|
+
|
|
24
|
+
my $timeout = 20;
|
|
25
|
+
|
|
26
|
+
##
|
|
27
|
+
# Execute the converter utility.
|
|
28
|
+
##
|
|
29
|
+
sub convertFile {
|
|
30
|
+
my ($filePath, $rTrace, $rCheckSums) = @_;
|
|
31
|
+
my ($status, $msg) = (1, "");
|
|
32
|
+
|
|
33
|
+
my $pdfFilePath = FileConverter::Utils::changeExtension($filePath, "pdf");
|
|
34
|
+
my @commandArgs = ("ps2pdf13", $filePath, $pdfFilePath);
|
|
35
|
+
my $child;
|
|
36
|
+
eval {
|
|
37
|
+
local $SIG{'ALRM'} = sub { die "alarm\n" };
|
|
38
|
+
alarm $timeout;
|
|
39
|
+
$child = system(@commandArgs);
|
|
40
|
+
alarm 0;
|
|
41
|
+
};
|
|
42
|
+
|
|
43
|
+
if ($@) {
|
|
44
|
+
if ($@ eq "alarm\n") {
|
|
45
|
+
if (defined $child) { kill 9, $child; }
|
|
46
|
+
return (0, "ps2pdf timeout");
|
|
47
|
+
}
|
|
48
|
+
}
|
|
49
|
+
|
|
50
|
+
if ($? == -1) {
|
|
51
|
+
return (0, "Failed to execute ps2pdf: $!");
|
|
52
|
+
} elsif ($? & 127) {
|
|
53
|
+
return (0, "ps2pdf died with signal ".($? & 127));
|
|
54
|
+
}
|
|
55
|
+
|
|
56
|
+
my $code = $?>>8;
|
|
57
|
+
if ($code == 0) {
|
|
58
|
+
push @$rTrace, "ps2pdf";
|
|
59
|
+
|
|
60
|
+
my $sha1 = FileConverter::CheckSum->new();
|
|
61
|
+
$sha1->digest($filePath);
|
|
62
|
+
push @$rCheckSums, $sha1;
|
|
63
|
+
|
|
64
|
+
return ($status, $msg, $pdfFilePath, $rTrace, $rCheckSums);
|
|
65
|
+
} else {
|
|
66
|
+
return (0, "Error executing ps2pdf (code $code): $!");
|
|
67
|
+
}
|
|
68
|
+
} # convertFile
|
|
69
|
+
1;
|
|
@@ -0,0 +1,88 @@
|
|
|
1
|
+
#
|
|
2
|
+
# Copyright 2007 Penn State University
|
|
3
|
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
4
|
+
# you may not use this file except in compliance with the License.
|
|
5
|
+
# You may obtain a copy of the License at
|
|
6
|
+
# http://www.apache.org/licenses/LICENSE-2.0
|
|
7
|
+
# Unless required by applicable law or agreed to in writing, software
|
|
8
|
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
9
|
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
10
|
+
# See the License for the specific language governing permissions and
|
|
11
|
+
# limitations under the License.
|
|
12
|
+
#
|
|
13
|
+
package FileConverter::PSToText;
|
|
14
|
+
#
|
|
15
|
+
# Wrapper to execute the ps2ascii command-line tool for converting
|
|
16
|
+
# ps to text files.
|
|
17
|
+
#
|
|
18
|
+
# Isaac, 10/08/07
|
|
19
|
+
#
|
|
20
|
+
use strict;
|
|
21
|
+
use FileConverter::Config;
|
|
22
|
+
use FileConverter::Utils;
|
|
23
|
+
use Encode;
|
|
24
|
+
|
|
25
|
+
my $timeout = 20;
|
|
26
|
+
|
|
27
|
+
##
|
|
28
|
+
# Execute the converter utility.
|
|
29
|
+
##
|
|
30
|
+
sub extractText {
|
|
31
|
+
my ($filePath, $rTrace, $rCheckSums) = @_;
|
|
32
|
+
my ($status, $msg) = (1, "");
|
|
33
|
+
|
|
34
|
+
my $txtFilePath = FileConverter::Utils::changeExtension($filePath, "txt");
|
|
35
|
+
|
|
36
|
+
my @commandArgs = ("ps2ascii", $filePath, $txtFilePath);
|
|
37
|
+
my $child;
|
|
38
|
+
eval {
|
|
39
|
+
local $SIG{'ALRM'} = sub { die "alarm\n" };
|
|
40
|
+
alarm $timeout;
|
|
41
|
+
$child = system(@commandArgs);
|
|
42
|
+
alarm 0;
|
|
43
|
+
};
|
|
44
|
+
|
|
45
|
+
if ($@) {
|
|
46
|
+
if ($@ eq "alarm\n") {
|
|
47
|
+
if (defined $child) { kill 9, $child; }
|
|
48
|
+
return (0, "ps2ascii timeout");
|
|
49
|
+
}
|
|
50
|
+
}
|
|
51
|
+
|
|
52
|
+
if ($? == -1) {
|
|
53
|
+
return (0, "Failed to execute ps2ascii: $!");
|
|
54
|
+
} elsif ($? & 127) {
|
|
55
|
+
return (0, "ps2ascii died with signal ".($? & 127));
|
|
56
|
+
}
|
|
57
|
+
|
|
58
|
+
my $code = $?>>8;
|
|
59
|
+
if ($code == 0) {
|
|
60
|
+
push @$rTrace, "ps2ascii";
|
|
61
|
+
ascii2utf8($txtFilePath);
|
|
62
|
+
|
|
63
|
+
my $sha1 = FileConverter::CheckSum->new();
|
|
64
|
+
$sha1->digest($filePath);
|
|
65
|
+
push @$rCheckSums, $sha1;
|
|
66
|
+
|
|
67
|
+
return ($status, $msg, $txtFilePath, $rTrace, $rCheckSums);
|
|
68
|
+
} else {
|
|
69
|
+
return (0, "Error executing ps2ascii (code $code): $!");
|
|
70
|
+
}
|
|
71
|
+
} # convertFile
|
|
72
|
+
|
|
73
|
+
sub ascii2utf8 {
|
|
74
|
+
my $fn = shift;
|
|
75
|
+
|
|
76
|
+
open(IN, "<$fn") or die $!;
|
|
77
|
+
my $text;
|
|
78
|
+
{
|
|
79
|
+
local $/ = undef;
|
|
80
|
+
$text = <IN>;
|
|
81
|
+
}
|
|
82
|
+
close IN;
|
|
83
|
+
$text = Encode::decode_utf8($text);
|
|
84
|
+
open(OUT, ">:utf8", $fn) or die $!;
|
|
85
|
+
print OUT $text;
|
|
86
|
+
close OUT;
|
|
87
|
+
}
|
|
88
|
+
1;
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
#
|
|
2
|
+
# Copyright 2007 Penn State University
|
|
3
|
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
4
|
+
# you may not use this file except in compliance with the License.
|
|
5
|
+
# You may obtain a copy of the License at
|
|
6
|
+
# http://www.apache.org/licenses/LICENSE-2.0
|
|
7
|
+
# Unless required by applicable law or agreed to in writing, software
|
|
8
|
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
9
|
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
10
|
+
# See the License for the specific language governing permissions and
|
|
11
|
+
# limitations under the License.
|
|
12
|
+
#
|
|
13
|
+
package FileConverter::Prescript;
|
|
14
|
+
#
|
|
15
|
+
# Wrapper to execute the Prescript command-line tool for extracting
|
|
16
|
+
# text from PS files.
|
|
17
|
+
#
|
|
18
|
+
# Juan Pablo Fernandez R., 10/31/07
|
|
19
|
+
#
|
|
20
|
+
use strict;
|
|
21
|
+
use FileConverter::Config;
|
|
22
|
+
use FileConverter::Utils;
|
|
23
|
+
use FileConverter::CheckSum;
|
|
24
|
+
|
|
25
|
+
my $PrescriptPath = $FileConverter::Config::PrescriptPath;
|
|
26
|
+
|
|
27
|
+
##
|
|
28
|
+
# Execute the Prescript utility.
|
|
29
|
+
##
|
|
30
|
+
sub extractText {
|
|
31
|
+
my ($filePath, $rTrace, $rCheckSums) = @_;
|
|
32
|
+
my ($status, $msg) = (1, "");
|
|
33
|
+
|
|
34
|
+
if (FileConverter::Utils::checkExtension($filePath, "ps") <= 0) {
|
|
35
|
+
return (0, "Unexpected file extension at ". __FILE__." line ".__LINE__);
|
|
36
|
+
}
|
|
37
|
+
|
|
38
|
+
my $textFilePath = FileConverter::Utils::changeExtension($filePath, "txt");
|
|
39
|
+
my @commandArgs = ($PrescriptPath, "plain", $filePath, $textFilePath);
|
|
40
|
+
|
|
41
|
+
system(@commandArgs);
|
|
42
|
+
|
|
43
|
+
if ($? == -1) {
|
|
44
|
+
return (0, "Failed to execute Prescript: $!");
|
|
45
|
+
} elsif ($? & 127) {
|
|
46
|
+
return (0, "Prescript died with signal ".($? & 127));
|
|
47
|
+
}
|
|
48
|
+
|
|
49
|
+
my $code = $?>>8;
|
|
50
|
+
if (($code == 0) || ($code == 1)) {
|
|
51
|
+
if ($code == 1) {
|
|
52
|
+
print STDERR "Prescript completed with errors: $filePath\n";
|
|
53
|
+
}
|
|
54
|
+
|
|
55
|
+
push @$rTrace, "PSLIB Prescript";
|
|
56
|
+
|
|
57
|
+
my $sha1 = new FileConverter::CheckSum();
|
|
58
|
+
$sha1->digest($filePath);
|
|
59
|
+
push @$rCheckSums, $sha1;
|
|
60
|
+
|
|
61
|
+
return ($status, $msg, $textFilePath, $rTrace, $rCheckSums);
|
|
62
|
+
|
|
63
|
+
} else {
|
|
64
|
+
return (0, "Error executing Prescript (code $code): $!");
|
|
65
|
+
}
|
|
66
|
+
} # extractText
|
|
67
|
+
|
|
68
|
+
1;
|
|
@@ -0,0 +1,75 @@
|
|
|
1
|
+
#
|
|
2
|
+
# Copyright 2007 Penn State University
|
|
3
|
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
4
|
+
# you may not use this file except in compliance with the License.
|
|
5
|
+
# You may obtain a copy of the License at
|
|
6
|
+
# http://www.apache.org/licenses/LICENSE-2.0
|
|
7
|
+
# Unless required by applicable law or agreed to in writing, software
|
|
8
|
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
9
|
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
10
|
+
# See the License for the specific language governing permissions and
|
|
11
|
+
# limitations under the License.
|
|
12
|
+
#
|
|
13
|
+
package FileConverter::TET;
|
|
14
|
+
#
|
|
15
|
+
# Wrapper to execute the TET command-line tool for extracting
|
|
16
|
+
# text from PDF files.
|
|
17
|
+
#
|
|
18
|
+
# Isaac Councill, 09/06/07
|
|
19
|
+
#
|
|
20
|
+
use strict;
|
|
21
|
+
use FileConverter::Config;
|
|
22
|
+
use FileConverter::Utils;
|
|
23
|
+
use FileConverter::CheckSum;
|
|
24
|
+
|
|
25
|
+
my $TETPath = $FileConverter::Config::TETPath;
|
|
26
|
+
my $TETLicensePath = $FileConverter::Config::TETLicensePath;
|
|
27
|
+
|
|
28
|
+
$ENV{'PDFLIBLICENSEFILE'} = $TETLicensePath;
|
|
29
|
+
|
|
30
|
+
##
|
|
31
|
+
# Execute the TET utility.
|
|
32
|
+
##
|
|
33
|
+
sub extractText {
|
|
34
|
+
my ($filePath, $rTrace, $rCheckSums) = @_;
|
|
35
|
+
my ($status, $msg) = (1, "");
|
|
36
|
+
|
|
37
|
+
if (FileConverter::Utils::checkExtension($filePath, "pdf") <= 0) {
|
|
38
|
+
return (0, "Unexpected file extension at ".
|
|
39
|
+
__FILE__." line ".__LINE__);
|
|
40
|
+
}
|
|
41
|
+
|
|
42
|
+
my $textFilePath =
|
|
43
|
+
FileConverter::Utils::changeExtension($filePath, "txt");
|
|
44
|
+
my @commandArgs = ($TETPath, "-o", $textFilePath, $filePath);
|
|
45
|
+
|
|
46
|
+
system(@commandArgs);
|
|
47
|
+
|
|
48
|
+
if ($? == -1) {
|
|
49
|
+
return (0, "Failed to execute TET: $!");
|
|
50
|
+
} elsif ($? & 127) {
|
|
51
|
+
return (0, "TET died with signal ".($? & 127));
|
|
52
|
+
}
|
|
53
|
+
|
|
54
|
+
my $code = $?>>8;
|
|
55
|
+
if (($code == 0) || ($code == 1)) {
|
|
56
|
+
if ($code == 1) {
|
|
57
|
+
print STDERR "TET completed with errors: $filePath\n";
|
|
58
|
+
}
|
|
59
|
+
|
|
60
|
+
push @$rTrace, "PDFLib TET";
|
|
61
|
+
|
|
62
|
+
my $sha1 = new FileConverter::CheckSum();
|
|
63
|
+
$sha1->digest($filePath);
|
|
64
|
+
push @$rCheckSums, $sha1;
|
|
65
|
+
|
|
66
|
+
return ($status, $msg, $textFilePath, $rTrace, $rCheckSums);
|
|
67
|
+
|
|
68
|
+
} else {
|
|
69
|
+
return (0, "Error executing TET (code $code): $!");
|
|
70
|
+
}
|
|
71
|
+
|
|
72
|
+
} # extractText
|
|
73
|
+
|
|
74
|
+
|
|
75
|
+
1;
|
|
@@ -0,0 +1,130 @@
|
|
|
1
|
+
#
|
|
2
|
+
# Copyright 2007 Penn State University
|
|
3
|
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
4
|
+
# you may not use this file except in compliance with the License.
|
|
5
|
+
# You may obtain a copy of the License at
|
|
6
|
+
# http://www.apache.org/licenses/LICENSE-2.0
|
|
7
|
+
# Unless required by applicable law or agreed to in writing, software
|
|
8
|
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
9
|
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
10
|
+
# See the License for the specific language governing permissions and
|
|
11
|
+
# limitations under the License.
|
|
12
|
+
#
|
|
13
|
+
package FileConverter::Utils;
|
|
14
|
+
#
|
|
15
|
+
# Container for subroutines that may be shared across multiple
|
|
16
|
+
# FileConverter modules.
|
|
17
|
+
#
|
|
18
|
+
# Isaac Councill, 09/06/07
|
|
19
|
+
#
|
|
20
|
+
use strict;
|
|
21
|
+
use Encode;
|
|
22
|
+
|
|
23
|
+
##
|
|
24
|
+
# Returns the file extension of a file name, if there is one.
|
|
25
|
+
##
|
|
26
|
+
sub getExtension {
|
|
27
|
+
my ($fn) = @_;
|
|
28
|
+
if ($fn =~ m/^.*\.(.*)$/) {
|
|
29
|
+
return $1;
|
|
30
|
+
}
|
|
31
|
+
return undef;
|
|
32
|
+
|
|
33
|
+
} # getExtension
|
|
34
|
+
|
|
35
|
+
|
|
36
|
+
##
|
|
37
|
+
# Strips off the last extension of the file name.
|
|
38
|
+
##
|
|
39
|
+
sub stripExtension {
|
|
40
|
+
my ($fn) = @_;
|
|
41
|
+
$fn =~ s/^(.*)\..*$/$1/;
|
|
42
|
+
return $fn;
|
|
43
|
+
|
|
44
|
+
} # stripExtension
|
|
45
|
+
|
|
46
|
+
|
|
47
|
+
##
|
|
48
|
+
##
|
|
49
|
+
# Routine for checking that a filename ends with an expected
|
|
50
|
+
# extension. Returns 1 if it does, 0 if not.
|
|
51
|
+
##
|
|
52
|
+
sub checkExtension {
|
|
53
|
+
my ($fn, $ext) = @_;
|
|
54
|
+
if ($fn =~ m/^.*\.(.*)$/) {
|
|
55
|
+
if ($1 =~ m/$ext/i) {
|
|
56
|
+
return 1;
|
|
57
|
+
}
|
|
58
|
+
}
|
|
59
|
+
return 0;
|
|
60
|
+
|
|
61
|
+
} # checkExtension
|
|
62
|
+
|
|
63
|
+
|
|
64
|
+
##
|
|
65
|
+
# Simple routine for changing the extension of a file.
|
|
66
|
+
# Example: $newFileName = changeExtension($oldFileName, "txt");
|
|
67
|
+
##
|
|
68
|
+
sub changeExtension {
|
|
69
|
+
my ($fn, $ext) = @_;
|
|
70
|
+
unless ($fn =~ s/^(.*)\..*$/$1\.$ext/) {
|
|
71
|
+
$fn .= ".$ext";
|
|
72
|
+
}
|
|
73
|
+
return $fn;
|
|
74
|
+
|
|
75
|
+
} # changeExtension
|
|
76
|
+
|
|
77
|
+
|
|
78
|
+
##
|
|
79
|
+
# Returns the directory part of a file path.
|
|
80
|
+
##
|
|
81
|
+
sub getDirectory {
|
|
82
|
+
my ($filePath) = @_;
|
|
83
|
+
if ($filePath =~ m/^(.*)\/.*$/) {
|
|
84
|
+
return $1;
|
|
85
|
+
} else {
|
|
86
|
+
return $filePath;
|
|
87
|
+
}
|
|
88
|
+
|
|
89
|
+
} # getDirectory
|
|
90
|
+
|
|
91
|
+
##
|
|
92
|
+
##
|
|
93
|
+
# Routine for checking if a process is running or not
|
|
94
|
+
# Returns 1 if it is runnig, 0 if not.
|
|
95
|
+
##
|
|
96
|
+
sub checkProcess {
|
|
97
|
+
my ($process) = @_;
|
|
98
|
+
my $cmd = "ps -ef | grep " . $process . " | grep -v grep";
|
|
99
|
+
my $result = `$cmd`;
|
|
100
|
+
if ($result eq '') {
|
|
101
|
+
return 0;
|
|
102
|
+
}
|
|
103
|
+
else {
|
|
104
|
+
return 1;
|
|
105
|
+
}
|
|
106
|
+
} # checkProcess
|
|
107
|
+
|
|
108
|
+
|
|
109
|
+
##
|
|
110
|
+
# Convert an file of the specified encoding to UTF-8
|
|
111
|
+
##
|
|
112
|
+
sub convertToUTF8 {
|
|
113
|
+
my ($fn, $encoding) = @_;
|
|
114
|
+
my $octets;
|
|
115
|
+
open (FILE, "<$fn") or die "could not open file $fn: $!";
|
|
116
|
+
binmode FILE, ":bytes";
|
|
117
|
+
{
|
|
118
|
+
local $/ = undef;
|
|
119
|
+
$octets = <FILE>;
|
|
120
|
+
}
|
|
121
|
+
close FILE;
|
|
122
|
+
|
|
123
|
+
Encode::from_to($octets, $encoding, "utf8");
|
|
124
|
+
open (FILE, ">:utf8", "$fn") or die "could not open file $fn: $!";
|
|
125
|
+
print FILE Encode::decode_utf8($octets);
|
|
126
|
+
close FILE;
|
|
127
|
+
|
|
128
|
+
}
|
|
129
|
+
|
|
130
|
+
1;
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
SVMHeaderParse README
|
|
2
|
+
IGC
|
|
3
|
+
|
|
4
|
+
SVMHeaderParse is a utility for extracting header information
|
|
5
|
+
from research papers based on Support Vector Machines and
|
|
6
|
+
heuristic regularization. For details on the algorithm,
|
|
7
|
+
please see the following paper:
|
|
8
|
+
|
|
9
|
+
Hui Han, C. Lee Giles, Eren Manavoglu, Hongyuan Zha,
|
|
10
|
+
Zhenyue Zhang, Edward A. Fox. "Automatic Document Metadata
|
|
11
|
+
Extraction Using Support Vector Machines", in Proceedings
|
|
12
|
+
of ACM/IEEE Joint Conference on Digital Libraries
|
|
13
|
+
(JCDL 2003): 37-48, 2003.
|
|
14
|
+
|
|
15
|
+
Installation:
|
|
16
|
+
|
|
17
|
+
In order to use the SVMHeaderParse web service you will need
|
|
18
|
+
the following modules in your perl library:
|
|
19
|
+
|
|
20
|
+
Log::Log4perl
|
|
21
|
+
Log::Dispatch
|
|
22
|
+
|
|
23
|
+
Edit lib/HeaderParse/Config/API_Config.pm to provide values
|
|
24
|
+
appropriate for your environment. Also edit wsdl/SVMHeaderParse.wsdl
|
|
25
|
+
to reflect any changes to your service URL.
|
|
26
|
+
|
|
27
|
+
Command Line Usage:
|
|
28
|
+
|
|
29
|
+
Command line utilities are provided for extracting header information
|
|
30
|
+
from text documents. By default, text files are expected to be encoded
|
|
31
|
+
in UTF-8, but the expected encoding can be adjusted using perl
|
|
32
|
+
command line switches. To run SVMHeaderParse on a single document,
|
|
33
|
+
execute the following command:
|
|
34
|
+
|
|
35
|
+
extractHeader.pl textfile [outfile]
|
|
36
|
+
|
|
37
|
+
If "outfile" is specified, the XML output will be written to that
|
|
38
|
+
file; otherwise, the XML will be printed to STDOUT.
|
|
39
|
+
|
|
40
|
+
There is also a web service interface available, using the
|
|
41
|
+
SOAP::Lite perl module. To start the service, just execute:
|
|
42
|
+
|
|
43
|
+
headerparse-service.pl
|
|
44
|
+
|
|
45
|
+
By default, the service will start on port 40000, but this is
|
|
46
|
+
configurable in the HeaderParse::Config::API_Config library module.
|
|
47
|
+
A WSDL file is provided with the distribution that outlines the
|
|
48
|
+
message details expected by the SVMHeaderParse service. If the
|
|
49
|
+
service port is changed, the WSDL file must also be modified to
|
|
50
|
+
reflect that change. Expected parameters in the input message
|
|
51
|
+
are "filePath" (a path to the text file to parse) and "repositoryID".
|
|
52
|
+
The SVMHeaderParse service is designed for deployment in an
|
|
53
|
+
environment where text files may be located on file systems mounted
|
|
54
|
+
from arbitrary machines on the network. Thus, "repositoryID" provides
|
|
55
|
+
a means to map a given shared file system to it's mount point.
|
|
56
|
+
Repository mappings are configurable in the API_Config module. The
|
|
57
|
+
"filePath" parameter provides a path to the text file relative to
|
|
58
|
+
the repository mount point. The local file system may be specified
|
|
59
|
+
using the reserved repository ID "LOCAL". In that case, an absolute
|
|
60
|
+
path to the text file may be specified.
|
|
61
|
+
|
|
62
|
+
A perl client is also provided that demonstrates how to use the
|
|
63
|
+
service. Execute the client with the following command:
|
|
64
|
+
|
|
65
|
+
headerparse-client.pl filePath repositoryID
|
|
66
|
+
|
|
67
|
+
If the call is successful, the XML output will be printed to STDOUT.
|
|
68
|
+
|
|
69
|
+
API:
|
|
70
|
+
|
|
71
|
+
The SVMHeaderParse libraries may be used directly from external perl
|
|
72
|
+
applications. The interface module is HeaderParse::API::Parser. If
|
|
73
|
+
XML output is desired, use the
|
|
74
|
+
|
|
75
|
+
HeaderParse::API::Parser::_parseHeader($filePath, $jobID)
|
|
76
|
+
|
|
77
|
+
subroutine. If the SVMHeaderParse library is used from external
|
|
78
|
+
Perl applications, remember to use the "-CSD" perl option for
|
|
79
|
+
global unicode stream support (or otherwise handle encoding) or
|
|
80
|
+
risk string corruption.
|