biblicit 2.0.5 → 2.0.6

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,692 @@
1
+ <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN">
2
+
3
+ <HTML><HEAD>
4
+ <META HTTP-EQUIV="Keywords" NAME="Keywords" CONTENT="citations, citation parser, references, reference parser, bibliography parser, reference string parser, bibtex, citation, citation extraction, logical structure, document logical structure">
5
+ <LINK REL="stylesheet" HREF="parsCit.css"><TITLE>ParsCit: An open-source CRF Reference String and Logical Document Structure Parsing Package</TITLE>
6
+ <script type="text/javascript">
7
+ function toggleLayer( whichLayer ) {
8
+ var elem, vis;
9
+ if( document.getElementById ) // this is the way the standards work
10
+ elem = document.getElementById( whichLayer );
11
+ else if( document.all ) // this is the way old msie versions work
12
+ elem = document.all[whichLayer];
13
+ else if( document.layers ) // this is the way nn4 works
14
+ elem = document.layers[whichLayer];
15
+ vis = elem.style;
16
+ // if the style.display value is blank we try to figure it out here
17
+ if(vis.display==''&&elem.offsetWidth!=undefined&&elem.offsetHeight!=undefined)
18
+ vis.display = (elem.offsetWidth!=0&&elem.offsetHeight!=0)?'block':'none';
19
+ vis.display = (vis.display==''||vis.display=='block')?'none':'block';
20
+ }
21
+ </script>
22
+ </HEAD><BODY BGCOLOR="#FFFFFF">
23
+
24
+ <div id="leftcontent">
25
+ [&nbsp;<A href="http://wing.comp.nus.edu.sg/">WING homepage</A>&nbsp;]<br/>
26
+ [&nbsp;<A href="http://wing.comp.nus.edu.sg/portal/web-services.html">WING web services</A>&nbsp;]<br/>
27
+ <BR/>
28
+ [&nbsp;<A href="#d">Download</a>&nbsp;]<br/>
29
+ [&nbsp;<A href="#ws">Web Service</a>&nbsp;]</br/>
30
+ [&nbsp;<A href="#wd">Web Demo</a>&nbsp;]<br/>
31
+ [&nbsp;<A href="#p">Publications</a>&nbsp;]<br/>
32
+ [&nbsp;<A href="#gsiso">Input and Output</a>&nbsp;]<br/>
33
+ [&nbsp;<A href="#gm">Group Members</a>&nbsp;]<br/>
34
+ [&nbsp;<A href="#faq">FAQ</a>&nbsp;]<br/>
35
+ [&nbsp;<A href="#t">Troubleshooting</a>&nbsp;]<br/>
36
+ <br/>
37
+ <script type="text/javascript"
38
+ src="http://feedjit.com/serve/?vv=932&tft=3&dd=0&wid=0804fba767d140cd&pid=0&proid=0&bc=FFFFFF&tc=000000&brd1=012B6B&lnk=135D9E&hc=FFFFFF&hfc=2853A8&btn=C99700&ww=200&wne=10&wh=WING+Live+Traffic+Feed&hl=0&hlnks=0&hfce=0&srefs=1&hbars=0"></script><noscript><a
39
+ href="http://feedjit.com/">Feedjit Live Blog Stats</a></noscript>
40
+ </div>
41
+
42
+ <div id="centercontent">
43
+ <IMG ALIGN="LEFT" SRC="parsCit.png" WIDTH="200px" ALT="Picture of ParsCit Swami">
44
+
45
+ <CENTER><H1>ParsCit: An open-source CRF Reference String and Logical Document Structure Parsing Package</H1></CENTER>
46
+
47
+ <P>This is the home page of the ParsCit project, which performs two
48
+ tasks: 1) reference string parsing, sometimes also called citation
49
+ parsing or citation extraction, and 2) logical structure parsing of
50
+ scienfific documents. It is architected as a supervised machine
51
+ learning procedure that uses Conditional Random Fields as its learning
52
+ mechanism. You can download the code below, parse strings online, or
53
+ send batch jobs to our web service. The code contains both the
54
+ training data, feature generator and shell scripts to connect the
55
+ system to a web service (used on this web site).</P>
56
+
57
+ <P>Some definitions (thanks to Robert Dale for Citations and Reference
58
+ Strings):
59
+
60
+ <DL>
61
+
62
+ <DT>Reference String:</DT><DD>A text string in the bibliography or
63
+ reference section of a work, usually at the end of the document that
64
+ refers to a unique document. Usually occurs with other reference
65
+ strings that point to other documents. May also appear as
66
+ footnotes.</DD>
67
+
68
+ <DT>Citation:</DT><DD>A text string (usually explicit) in the
69
+ document body that points to a corresponding reference string at the
70
+ end of the document. Several citations may co-refer to a single
71
+ reference string.</DD>
72
+
73
+ <DT>Document Logical Structure:</DT><DD> A hierarchy of logical
74
+ components, for example, titles, authors, affiliations, abstracts,
75
+ sections, etc., according to (Mao, Rosenfeld &amp;
76
+ Kanungo,2003). Our logical structure is more comprehensive,
77
+ comprising not only header metadata and references, but also the
78
+ logical structure of the internals of the document -- sections,
79
+ subsections, figures, tables, equations, footnotes and
80
+ captions. </DD>
81
+
82
+ </DL>
83
+
84
+ <P>This project deals with the problem of parsing the reference
85
+ strings and parsing the logical structure of a document. The first
86
+ task is handled by a module with the project namesake, ParsCit, and
87
+ the second task by a separate module SectLabel.
88
+ </P>
89
+
90
+ <br clear="all"/>
91
+ <!-- License ---------------------------------------------------------------------- -->
92
+ <A NAME="l"></A><H2>License</H2>
93
+
94
+ <P>This software is licensed under the <A
95
+ HREF="http://www.gnu.org/copyleft/lesser.html">Lesser GNU Public
96
+ License</A> (LGPL), which means you are free to use it for any
97
+ purpose, including embedding in commercial products. </P>
98
+
99
+ <br clear="all" />
100
+ <!-- Download ---------------------------------------------------------------------- -->
101
+ <A NAME="d"></A><H2>Download</H2>
102
+
103
+ <P>You can download the open-source code for ParsCit here. The source requires you to re-compile the CRFPP source code
104
+ and assumes that perl is installed on your system and can be invoked
105
+ using <CODE>perl</CODE> (must be in your path).
106
+ </P>
107
+
108
+ <ul>
109
+
110
+ <li> Current version <A HREF="parscit-110505b.zip">110505b</A>: Added XML::Twig for XML processing. ParsCit now uses input provided by SectLabel. See <A HREF="CHANGELOG.txt"> CHANGELOG.txt </A>.<BR/>
111
+ The (partially ported) <A HREF="parscit-110505b-win.zip">Windows</A> version is here (provided by Yumichika). See the <A HREF="CHANGES%20FOR%20WINDOWS.txt">CHANGES FOR WINDOWS.txt</A>
112
+ <BR/>
113
+ <BR/>
114
+ We have also pushed a copy of the ParsCit current distribution into <A HREF="http://www.github.com/knmnyn/parscit">GitHub:knmnyn/parscit</A>.
115
+ The Windows version has also been pushed to <A HREF="http://www.github.com/wing-nus/parscit">GitHub:wing-nus/parscit</A>.
116
+
117
+ While we'll strive to keep the GitHub version as updated as possible, the versions on this page will remain the most authoritative for major updates.
118
+ <BR/>
119
+ <li> Other versions: <BR/>
120
+ <A HREF="parscit-101101.zip">101101</A>: Incorporated <A HREF="http://github.com/mromanello/BiblioScript">BiblioScript</A> and <A HREF=http://www.scripps.edu/~cdputnam/software/bibutils>BibUtils</A> software. See CHANGELOG.txt; <BR/>
121
+ <A HREF="parscit-100401.zip">100401d</A>: Added SectLabel (logical structure parsing) software from the NUS team, and Iconip training data from Cheong Chi Hong for ParsCit with new ParsCit model retrained. See CHANGELOG.txt; <BR/>
122
+ <A HREF="parscit-090625.zip">090625b</A>: Added documentation for complete re-installation. Improved ParsHed with added line-level CRF model together and post-processing module by NUS team, WSDL file and client for service at NUS and minor bug fixes for ParsCit. See CHANGELOG.txt; <BR/>
123
+ <A HREF="parscit-090316.zip">090316</A>: Incorporation of ParsHed (header parsing) software from the NUS team. See CHANGELOG.txt; <BR/>
124
+ <A HREF="parscit-081201.zip">081201</A>: Bug fixes and incorporation of byte position offset from the Scienstein.org team. See CHANGELOG.txt; <BR/>
125
+ <A HREF="parscit-080917.zip">080917</A>: Minor changes (improved models and mulilingual support), see CHANGELOG.txt; <BR/>
126
+ <A HREF="parscit-080402.zip">080402</A>: First public release. Comes with precompiled linux binaries for CRF++; <BR/>
127
+ <A HREF="parscit-080310.tgz">080310</A>: Beta release.
128
+
129
+ <li><A HREF="http://crfpp.sourceforge.net">CRF++</A>: A conditional random fields toolkit that you may need to install, if the compiled one does not work for you. We recommend that you use version 0.51. </ul>
130
+
131
+ <!-- Web Service ---------------------------------------------------------------------- -->
132
+ <A name="ws"></a><H2>Web Service</h2>
133
+
134
+ <P>More NLP services are now being made available on the web.
135
+ Following this trend you can send your plain text citations to use via
136
+ our web service. We will parse these for you free of charge (as and
137
+ when time and processing power allows, these processes are done with
138
+ lower priority).</P>
139
+
140
+ <P CLASS="red">N.B. We keep logs of what's parsed in these demos, to
141
+ improve the accuracy and productivity of ParsCit. If you'd like these
142
+ to be kept private or you find you use this service a lot, why not
143
+ install a local copy of ParsCit for yourself? If you do, please
144
+ let us know where you are so we acknowledge you here and can re-direct
145
+ some traffic your way.
146
+ </P>
147
+
148
+ <UL>
149
+ <LI> <A HREF="wing.nus.wsdl">Download the WSDL file</A> for the service at NUS.
150
+ <LI> <A HREF="ParsCitClientWSDL.rb">Download the sample ruby client
151
+ that uses the WSDL file</A> to dynamically generate the ParsCit web
152
+ service call to the NUS server. Edit the file to see how to
153
+ execute it.
154
+ <LI> <A HREF="ParsCitClient.rb">Download sample ruby client code</a>
155
+ for the ParsCit web service at the NUS server. To execute,
156
+ just point it at a local
157
+ text file that represents the text dump of a scholarly article
158
+ (such as one produced by a PDF to text converter):
159
+ <CODE>
160
+ ./ParsCitClient.rb ~/public_html/samples/E06-1050.txt
161
+ </CODE>
162
+ <LI><FORM METHOD="post" ACTION="parsCit.cgi"><INPUT TYPE="HIDDEN"
163
+ NAME="ping" VALUE="ping"><INPUT TYPE="SUBMIT" VALUE="Check"> whether
164
+ the web service is up.
165
+ </FORM>
166
+ </UL>
167
+
168
+ <!-- Web demo ----------------------------------------------------------------------- -->
169
+ <A name="wd"></a><H2>Web-based Demonstration</H2>
170
+
171
+ <P CLASS="red">N.B.: We keep logs of what's parsed in these demos, to
172
+ improve the accuracy and productivity of ParsCit. If you'd like these
173
+ to be kept private, why not install a local copy of ParsCit for
174
+ yourself?</P>
175
+
176
+ <P>You can also run ParsCit directly in your browser. The form below
177
+ submits your text input (after suitable cleaning) to the ParsCit
178
+ service to parse the input file or strings. <FONT COLOR="red">
179
+ Note that if system loads gets high, your demo call may not be executed. If you want to run this program in batch, please download your own copy.</FONT>
180
+ </P>
181
+
182
+ <P><B>Demo #1: Parsing the header, logical structure and/or reference strings (and citation contexts) from a text file</B></P>
183
+
184
+ <DIV STYLE="background-color:D0D0FF; padding: 1em">
185
+ <FORM ENCTYPE="multipart/form-data" METHOD="post" ACTION="parsCit.cgi">
186
+ <P>NB - this demo does not handle PDF input at this time. You can use another web service or software to convert PDFs to text. </P>
187
+ <P style="font-size:small;"><I>Internal key (if applicable):</I> <INPUT TYPE="password" SIZE="4" NAME="key"></P>
188
+ <INPUT TYPE="text" SIZE="80" NAME="demo" value="1" style="display:none;">
189
+ <P>Input Method 1) Enter a URL to a file on the web (e.g., <A HREF="http://wing.comp.nus.edu.sg/~wing.nus/samples/E06-1050.txt">http://wing.comp.nus.edu.sg/~wing.nus/samples/E06-1050.txt</A> or <A HREF="http://wing.comp.nus.edu.sg/~wing.nus/samples/W06-0102.txt">http://wing.comp.nus.edu.sg/~wing.nus/samples/W06-0102.txt</A>).<BR/>
190
+ <INPUT TYPE="text" SIZE="80" NAME="urlfile">
191
+ </P>
192
+
193
+ <P>Input Method 2) Upload a .txt file (ASCII; UTF-8)<BR/>
194
+ <INPUT TYPE="FILE" NAME="datafile">
195
+ </P>
196
+
197
+ <P>Input Method 3) Paste the whole file here:
198
+ <br/>
199
+ <TEXTAREA ROWS="4" COLS="80" NAME="textfile">
200
+ </TEXTAREA>
201
+ </P>
202
+ <P>Parse the document using the following options
203
+ <SELECT NAME="ParsCitOptions">
204
+ <OPTION SELECTED VALUE="5">all</OPTION>
205
+ <OPTION VALUE="1">citations</OPTION>
206
+ <OPTION VALUE="2">header</OPTION>
207
+ <OPTION VALUE="4">section</OPTION>
208
+ </SELECT>
209
+ </P>
210
+
211
+ <P>Citation export formats
212
+ <INPUT TYPE=CHECKBOX NAME="ads1">ADS
213
+ <INPUT TYPE=CHECKBOX NAME="bib1" CHECKED>BIB
214
+ <INPUT TYPE=CHECKBOX NAME="end1">EndNote
215
+ <INPUT TYPE=CHECKBOX NAME="isi1">ISI
216
+ <INPUT TYPE=CHECKBOX NAME="ris1">RIS
217
+ <INPUT TYPE=CHECKBOX NAME="wordbib1">WordBib
218
+ </P>
219
+
220
+
221
+ <br/><CENTER><INPUT TYPE="SUBMIT" VALUE="Parse this file!"></CENTER>
222
+ </FORM>
223
+ </DIV>
224
+
225
+ <P><B>Demo #2: As above but using XML input (XML must conform to Omnipage output). This demo is slow so please be patient.</B></P>
226
+ <DIV STYLE="background-color:D0D0FF; padding: 1em">
227
+ <FORM ENCTYPE="multipart/form-data" METHOD="post" ACTION="parsCit.cgi">
228
+ <INPUT TYPE="text" SIZE="80" NAME="demo" value="2" style="display:none;">
229
+ <P style="font-size:small;"><I>Internal key (if applicable):</I> <INPUT TYPE="password" SIZE="4" NAME="key"></P>
230
+ <P>Input Method 1) Enter a URL to a file on the web (e.g., <A HREF="http://wing.comp.nus.edu.sg/~wing.nus/samples/E06-1050.xml">http://wing.comp.nus.edu.sg/~wing.nus/samples/E06-1050.xml</A> or <A HREF="http://wing.comp.nus.edu.sg/~wing.nus/samples/W06-0102.xml">http://wing.comp.nus.edu.sg/~wing.nus/samples/W06-0102.xml</A>).<BR/>
231
+ <INPUT TYPE="text" SIZE="80" NAME="urlfile">
232
+ </P>
233
+
234
+ <P>Input Method 2) Upload a .xml file (ASCII; UTF-8)<BR/>
235
+ <INPUT TYPE="FILE" NAME="datafile">
236
+ </P>
237
+
238
+ <P>Input Method 3) Paste the whole .xml file here:
239
+ <br/>
240
+ <TEXTAREA ROWS="4" COLS="80" NAME="textfile">
241
+ </TEXTAREA>
242
+ </P>
243
+
244
+ <P>Input Method 4) Upload your own .pdf file (less than 50 pages & smaller than 10MB):
245
+ <br/>
246
+ <INPUT TYPE="FILE" NAME="pdffile">
247
+ </P>
248
+
249
+ <P>Parse the document using the following options
250
+ <SELECT NAME="ParsCitOptions">
251
+ <OPTION SELECTED VALUE="5">all</OPTION>
252
+ <OPTION VALUE="1">citations</OPTION>
253
+ <OPTION VALUE="2">header</OPTION>
254
+ <OPTION VALUE="4">section</OPTION>
255
+ </SELECT>
256
+ </P>
257
+ <P>Citation export formats
258
+ <INPUT TYPE=CHECKBOX NAME="ads2">ADS
259
+ <INPUT TYPE=CHECKBOX NAME="bib2" CHECKED>BIB
260
+ <INPUT TYPE=CHECKBOX NAME="end2">EndNote
261
+ <INPUT TYPE=CHECKBOX NAME="isi2">ISI
262
+ <INPUT TYPE=CHECKBOX NAME="ris2">RIS
263
+ <INPUT TYPE=CHECKBOX NAME="wordbib2">WordBib
264
+ </P>
265
+
266
+ <br/><CENTER><INPUT TYPE="SUBMIT" VALUE="Parse this file!"></CENTER>
267
+ </FORM>
268
+ </DIV>
269
+
270
+ <!--
271
+ <P><B>Demo #2b: OCR a PDF file using Omnipage (less than 50 pages & smaller than 10MB).</B></P>
272
+ <DIV STYLE="background-color:D0D0FF; padding: 1em">
273
+ <FORM ENCTYPE="multipart/form-data" METHOD="post" ACTION="upload.cgi">
274
+ <P style="font-size:small;"><I>Internal key (if applicable):</I> <INPUT TYPE="password" SIZE="4" NAME="key"></P>
275
+ <P>File to OCR (PDF only): <INPUT TYPE="FILE" NAME="content"></P>
276
+ <br/><CENTER><INPUT TYPE="SUBMIT" VALUE="OCR this file!"></CENTER>
277
+ </FORM>
278
+ </DIV>
279
+ -->
280
+
281
+ <P><B>Demo #3: Parsing individual reference strings only (just <CODE>extract_citations</CODE>)</B></P>
282
+ <DIV STYLE="background-color:D0D0FF; padding: 1em">
283
+ <FORM ENCTYPE="multipart/form-data" METHOD="post" ACTION="parsCit.cgi">
284
+ <INPUT TYPE="text" SIZE="80" NAME="demo" value="3" style="display:none;">
285
+ <P style="font-size:small;"><I>Internal key (if applicable):</I> <INPUT TYPE="password" SIZE="4" NAME="key"></P>
286
+ <P>Input Method 1) Enter a URL to a file on the web in the correct format (each line should be a separate citation; e.g., <A
287
+ HREF="http://wing.comp.nus.edu.sg/~wing.nus/samples/E06-1050.cite">http://wing.comp.nus.edu.sg/~wing.nus/samples/E06-1050.cite</A> or <A
288
+ HREF="http://wing.comp.nus.edu.sg/~wing.nus/samples/W06-0102.cite">http://wing.comp.nus.edu.sg/~wing.nus/samples/W06-0102.cite</A>).
289
+ <INPUT TYPE="text" SIZE="80" NAME="urllines">
290
+ </P>
291
+
292
+ <P>Input Method 2) Upload a file (again, each line should be a separate citation)<BR/>
293
+ <INPUT TYPE="FILE" NAME="datalines">
294
+ </P>
295
+
296
+ <P>Input Method 3) Enter a list of plain text citations (again, one per line):<BR/>
297
+ <TEXTAREA ROWS="4" COLS="80" NAME="textlines">Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008) ParsCit: An open-source CRF reference string parsing package. To appear in the proceedings of the Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morrocco, May.
298
+ </TEXTAREA>
299
+ </P>
300
+
301
+ <P>Citation export formats
302
+ <INPUT TYPE=CHECKBOX NAME="ads3">ADS
303
+ <INPUT TYPE=CHECKBOX NAME="bib3" CHECKED>BIB
304
+ <INPUT TYPE=CHECKBOX NAME="end3">EndNote
305
+ <INPUT TYPE=CHECKBOX NAME="isi3">ISI
306
+ <INPUT TYPE=CHECKBOX NAME="ris3">RIS
307
+ <INPUT TYPE=CHECKBOX NAME="wordbib3">WordBib
308
+ </P>
309
+
310
+ <br/><CENTER><INPUT TYPE="SUBMIT" VALUE="Parse these lines!"></CENTER>
311
+ </FORM>
312
+ </DIV>
313
+
314
+ <!-- Publications ---------------------------------------------------------------------- -->
315
+ <A name="p"></a><H2>Publications</H2>
316
+ <P><B>Journal Papers:</B>
317
+ <UL>
318
+ <LI> Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan (forthcoming)
319
+ <U>Logical Structure Recovery in Scholarly Articles with Rich
320
+ Document Features</U>. Forthcoming in the International
321
+ Journal of Digital Library Systems. <BR/>
322
+ [ <A HREF="ijdls-SectLabel.pdf">pre-print .pdf</A> ]
323
+ </UL>
324
+
325
+ <P><B>International Referreed Conference Publications:</B>
326
+ <UL>
327
+ <LI> Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008)
328
+ <U>ParsCit: An open-source CRF reference string parsing
329
+ package</U>. In Proceedings of the Language Resources and
330
+ Evaluation Conference (LREC 08), Marrakesh, Morrocco, May.
331
+ <BR/> [ <A HREF="lrec08/lrec08.pdf">.pdf</A> ]
332
+ [ <A HREF="lrec08b.png">Poster (.png)</A> ]
333
+ </UL>
334
+
335
+ <P><B>Others:</B>
336
+ <UL>
337
+ <LI> Yong Kiat Ng. (2004) <U>Citation Parsing Using Maximum Entropy
338
+ and Repairs</U>. Undergraduate thesis. National University of
339
+ Singapore. <BR/>
340
+ [ <A HREF="yongKiatNgThesis.pdf">.pdf</A> ]
341
+ </UL>
342
+
343
+ <!-- Output ---------------------------------------------------------------------- -->
344
+ <A name="gsiso"></a><H2>Gold Standard Input and Sample Output</H2>
345
+
346
+ <UL>
347
+ <LI>Chunk tagged data for <A HREF="cora.tagged.txt">Cora</A>, <A
348
+ HREF="citeseerx.tagged.txt">CiteSeer<SUP>X</SUP></A>, <A
349
+ HREF="flux-cim-cs.tagged.txt">FLUX-CiM</A> and humanities (<A
350
+ HREF="it-humanities.tagged.txt">Italian</A>, <A
351
+ HREF="en-humanities.tagged.txt">English</A>, and <A
352
+ HREF="mixed-humanities.tagged.txt">mixed language</A>) datasets
353
+ (suitable for ParsCit training). For FLUX-CiM data, please try
354
+ the original hosting site maintained by Eli Cortez. Credits to
355
+ Matteo Romanello for contributing the humanities datasets.
356
+ <LI> <A HREF="iconip.tagged.txt">Chunk tagged data for some ICONIP
357
+ papers</A>. Contributed by Cheong Chi Hung.
358
+ <LI>Results of running the v080917 version of ParsCit on FLUX-CiM's
359
+ dataset for [ <A HREF="flux-cim-cs.out.xml">300 computer science
360
+ references</A> ] [ <A HREF="flux-cim-med.out.xml">2000 medical
361
+ references</A> ] [ <A HREF="cora.out.xml">on the CORA dataset</A>
362
+ ]. Note that these results are considered cheating as current
363
+ version has been trained on this data.
364
+ <LI> Tagged section data for the SectLabel module. <BR/> [ <A
365
+ HREF="sectLabelXML.tagged.txt">XML Format</A> ] [ <A
366
+ HREF="sectLabel.tagged.txt">Plain Text Format</A> ]<BR/>
367
+ [ <A HREF="genericSect.tagged.txt">GenericSect training data</A> ]
368
+ </UL>
369
+
370
+ <!-- Group Members ---------------------------------------------------------------------- -->
371
+ <A name="gm"></a><H2>Group Members</H2>
372
+
373
+ <UL>
374
+ <LI> <A HREF="http://www.comp.nus.edu.sg/~kanmy">Min-Yen Kan</A> - Project leader, NUS
375
+ <LI> <A HREF="http://www.personal.psu.edu/igc2/">Isaac G. Councill</A>, The Pennsylvania State University
376
+ <LI> <A HREF="http://clgiles.ist.psu.edu/">C. Lee Giles</A>, The Pennsylvania State University
377
+ <LI> <A HREF="http://wing.comp.nus.edu.sg/~lmthang">Minh-Thang Luong</A> - Research Assistant (alumnus), NUS
378
+ <LI> Yong Kiat Ng - Final year undergraduate student (graduated, 2004), NUS
379
+ <LI> Thuy Dung Nguyen - Research Assistant (alumnus), NUS
380
+ <LI> Huy Nhat Hoang Do - Research Assistant, NUS
381
+ </UL>
382
+
383
+ <!-- FAQ ---------------------------------------------------------------------- -->
384
+ <A name="faq"></a><H2>FAQ</H2>
385
+ <DL>
386
+ <DT>What platforms does ParsCit work on?</DT>
387
+ <DD>ParsCit works on all major platforms: Windows, Linux and MacOS.
388
+ The installation requires ruby and perl and the CRF++ embedded
389
+ package also requires standard UNIX utilities like sed. You
390
+ should have a working knowledge of UNIX and some experience in
391
+ installing UNIX tools. Due to our time constraints, we may not be
392
+ able answer your particular problems with installation. Do let us
393
+ know if there was something important that you had to do to get
394
+ your particular download and installation working; we'll
395
+ incorporate it into the Troubleshooting section below.</DD>
396
+ <DT>What is the difference of SectLabel and previous ParsHed?</DT>
397
+ <DD>SectLabel is a newly-developed module that further extends
398
+ ParsHed in functionality. It not only classifies header metadata,
399
+ but analyzes full documents to output the logical structure of
400
+ the internals of the document -- sections, subsections, figures,
401
+ tables, equations, footnotes and captions. <BR/> For compatibility
402
+ issues, the ParsHed module is still retained in our source code
403
+ and command line options. </DD>
404
+
405
+ <DT>How do I retrain ParsCit for a different language? I saw code in
406
+ lib/ParsCit/PreProcess' to find the beginning of the bibliography
407
+ section, and changed that but it doesn't work.</DT>
408
+ <DD>The current version does not depend on those regular expressions
409
+ anymore, they are for previous versions (e.g., v101101). ParsCit
410
+ now first labels each line using the SectLabel module and
411
+ discovers which lines to parse references for based on the first
412
+ step's output. You need to retrain SectLabel for this, by
413
+ providing labeled data about what class of line each line in your
414
+ training data is. It's also possible to "downgrade" the current
415
+ version to go back to use the rule-based method for identifying
416
+ the reference section.</DD>
417
+ <DT>What is the "genericHeader" in the output of SectLabel? What is
418
+ the difference between "genericSect.tagged" and "SectLabel.tagged"?</DT>
419
+ <DD>Generic headers, such as introduction, methodology, and
420
+ evaluation, represent generic purposes of different sections in a
421
+ scholarly article. We map all section names to generic ones
422
+ (i.e., "5. Text Features" to "Methodology"). This promotes
423
+ comparative viewing of sections with identical purpose across
424
+ articles. For the second question, actually, Generic section is
425
+ a component of SectLabel. It is responsible for classifying the
426
+ section headers of a paper into the generic categories such as
427
+ Introduction, Methodology, Result, etc. For details refer to our
428
+ IJDLS journal paper.
429
+ </DD>
430
+ <DT>Why is there an option to input file in XML format? Which DTD
431
+ should it follow?</DT>
432
+ <DD>SectLabel is a robust logical document structure inference
433
+ system that can handle both rich input (produced by OCR software
434
+ such as font or spatial features) to boost recognition
435
+ performance, but still be able to perform inference on
436
+ impoverished input (plain text) with degraded
437
+ performance. Currently, the XML input must be in the form of
438
+ output from Nuance OmniPage (version 16)'s XML format, and hence,
439
+ should follows the DTD by OmniPage. Note: The ParsCit team is not
440
+ affiliated with Nuance in any way nor does it endorse
441
+ OmniPage.</DD>
442
+ <DT> I need to run ParsCit but I can't get well-formed text from my
443
+ PDF documents. Can you help?</DT>
444
+ <DD> No, we cannot help you with this. We don't perform OCR or text
445
+ extraction from PDF documents. You will have to find your own
446
+ source for doing the extraction or conversion. We've found
447
+ Omnipage useful in our own project work (hence the possibility of
448
+ XML input), but we don't endorse any product.</DD>
449
+ <DT> The OmniPage XML doesn't seem to be well-formed. Is that OK?</DT>
450
+ <DD> Yes. The sample "XML" provided in the links (for Demo 2) are
451
+ actual outputs for a sequence of XML pages (one XML file per
452
+ page). If you use OmniPage to save an XML file for input to
453
+ ParsCit, make sure to save individual pages as separate files,
454
+ then concatenate them to send to ParsCit. You may want to
455
+ download the sample links for inspection (as they are
456
+ concatenations of several XML files, your browser will likely
457
+ complain about them not being well-formed.</DD>
458
+ <DT> I ran Demos 1 and 2 with the default "all" settings, but sections
459
+ don't seem to be detected.</DT>
460
+ <DD> There's no problem. The demo just hides the SectLabel output
461
+ by default. Click "Show SectLabel output" to reveal it.</DD>
462
+ <DT> I ran ParsCit using the OmniPage XML output, but encountered malformed UTF8 character errors.</DT>
463
+ <DD> OmniPage normally outputs XML results in UTF-16 format, a conversion into UTF-8 will solve the problem, see below: </BR>
464
+ <I>&nbsp; &nbsp; &nbsp; iconv --from-code UTF-16 --to-code UTF-8 omnipageOutput.xml > newOmnipageOutput.xml</I>
465
+ </DD>
466
+ </DL>
467
+
468
+ <!-- Troubleshooting ---------------------------------------------------------------------- -->
469
+ <A name="t"></a><H2>Troubleshooting</H2>
470
+
471
+ <P> A list of common problems with ParsCit. If you find problems,
472
+ email the lead developer at &lt;kanmy@comp.nus.edu.sg&gt;. Please use
473
+ the subject "[ParsCit]" to ensure that it reaches our attention. If
474
+ you have hand-corrected tagged data that you don't mind providing us,
475
+ we can use that to further improve ParsCit's extracting capabilities.
476
+ Nevertheless, there are problems with the output occasionally. Below
477
+ are some common problems people have encountered.
478
+
479
+ <DL>
480
+ <DT>ParsCit v110505 seems to be a lot slower when used on Omnipage
481
+ output than the previous versions, why?</DT>
482
+ <DD>You are correct. We are now using XML::Twig to do the XML
483
+ processing correctly, rather than do it ad-hoc ourselves, but this
484
+ requires constructing an exhaustive DOM tree for the Omnipage input.
485
+ This is the timesink that you are experiencing.</DD>
486
+ <DT>I'm running ParsCit on Windows but I can't get it to work, even
487
+ after installing a perl interpreter. Specifically, the
488
+ citeExtract.pl program dies complaining that it Can't open
489
+ "/tmp/...." at line 175. </DT>
490
+ <DD>ParsCit hasn't been fully tested on windows at NUS, so we can't
491
+ vouch for whether it will run correctly. In this specific error
492
+ case, the "/tmp/" directory (a standard place for temporary files in
493
+ UNIX systems) is normally not available in Windows, and may generate
494
+ problems. You may need to change the code and/or create an
495
+ appropriate directory for ParsCit to generate such files.</DD>
496
+ <DT>I tried downloading and running ParsCit but I get complaints
497
+ about /bin/sed and crf not being found. Help?</DT>
498
+ <DD>Please read the INSTALL.txt directions. You need to recompile
499
+ CRF++ for your platform. The paths included with the install are
500
+ for our version, you need to recompile to have the paths point
501
+ correctly.</DD>
502
+ <DT>When running citeExtract.pl I get some errors complaining about
503
+ the wrong ELF class of the binaries. How can I fix this?</DT>
504
+ <DD>This seems to be a problem with the compiled executables of
505
+ CRF++ bundled with the software. Follow the INSTALL instructions
506
+ but after step 1 do:
507
+ <P>
508
+ <CODE>$ cp -Rf * ../../.libs
509
+ $ cp crf_learn ../../.libs/lt-crf_learn<BR/>
510
+ $ cp crf_test ../../.libs/lt-crf_test<BR/>
511
+ </CODE></DD>
512
+ <DT>I'm trying to install parscit v110505 using the instructions in the install file, and when I get to the point where you're supposed to recompile CRF, it exists with an error:<BR/>
513
+
514
+ <PRE>In file included from node.h:13:0,
515
+ from node.cpp:9:
516
+ path.h:26:52: error: 'size_t' has not been declared
517
+ make[1]: *** [node.lo] Error 1
518
+ make[1]: Leaving directory `/home/agarnett/parscit/crfpp/CRF++-0.51'
519
+ make: *** [all] Error 2</PRE><BR/>
520
+ The install file mentions that this may fail the first time; unfortunately for me, it keeps failing. any help?</DT>
521
+ <DD>The error is from CRF++ package (not from ParsCit), there are two ways to fix it:<BR/>
522
+ 1. Add the line. <CODE>#include&lt;iostream&gt;</CODE> in node.cpp and compile crf++ again, or;<BR/>
523
+ 2. Go to <A HREF="http://crfpp.googlecode.com/svn/trunk/doc/index.html">http://crfpp.googlecode.com/svn/trunk/doc/index.html</A> and download the latest version. The instruction is the same. Hope this helps.</DD>
524
+ <DT>Issue numbers don't get extracted.</DT>
525
+ <DD><SPAN CLASS="red">This issue should be fixed as of the v110505
526
+ release.</SPAN> There is now some heuristic postprocessing code to
527
+ take care of breaking single or multiple tokens for issues and
528
+ volumes. </DD>
529
+ <DT>Separation of author names and publishing year fails</DT>
530
+ <DD> In some reference data with non-standard sequences of
531
+ first names and family names, e.g.
532
+ <pre>
533
+ Baltes, Paul, Ursula Staudinger, Ulmann Lindenberger (1999): Lifespan
534
+ psychology: theory and application of intellectual functioning; in:
535
+ Annual Review of Psychology, 50, 471-507
536
+ </pre>
537
+ ParsCit's post processing step may not detect and deal with these
538
+ problems reliably. We're working to fix these too. </DD>
539
+ <DT>I passed ParsCit plain text output but in another, non-English
540
+ language. I didn't get good results or I got empty results. Can
541
+ you help? </DT>
542
+ <DD>Aside from English, ParsCit can handle Italian and German to a
543
+ limited extent, thanks to the multilingual training data.
544
+ However, the demo web interface uploads non-ASCII (e.g., UTF-8 or
545
+ UTF-16 data) as binary data and fails to execute ParsCit.
546
+ However, if you download a copy of ParsCit, the libraries do work
547
+ on such data. Here's a <A
548
+ HREF="humanities.test.out.xml">sample</A>. We'd love to help make
549
+ a more universal model that can accommodate reference strings in
550
+ other languages. If you're willing to help contribute ground
551
+ truth data, we love to hear from you!</DD>
552
+
553
+ <DT>How about retraining ParsCit for another language/domain?</DT>
554
+ <DD>You can put your supervised exemplar data into the same format
555
+ as tagged_references.txt found in crfpp/traindata/. Once you have
556
+ this file you can generate the appropriate model for ParsCit, by
557
+ using three commands (assumes you are in the crfpp/traindata
558
+ directory):
559
+ <P>
560
+ <CODE>$ ../../bin/tr2crfpp.pl tagged_references.txt > parsCit.train.data
561
+ <BR/>
562
+ $ ../crf_learn parsCit.template parsCit.train.data model
563
+ <BR/>
564
+ $ mv model ../../resources/parsCit.model
565
+ </CODE>
566
+ <P>The first command creates the input feature file that crfpp uses
567
+ from the training data. The second creates the model using the
568
+ crf_learn command. You can then move the model file to the
569
+ resources/ subdirectory where it can be utilized. To replace the
570
+ default model that comes with ParsCit, just execute the final
571
+ command. </DD>
572
+ <DT> Can I retrain the package for a different set of tags if I
573
+ change the tagset in the training data?</DT>
574
+ <DD> Yes, you should be able to change the tagset to suit your
575
+ dataset. You can add, eliminate and change the tagset as you
576
+ wish. You need to retrain the parser system after creating your
577
+ tag data. For more details on the training process, see the
578
+ documentation for CRF++, that is on the web at sourceforge.
579
+ </DD>
580
+ <DT>When retraining I get a "bad_alloc" error. What gives?</DT>
581
+ <DD>We're not entirely sure of this. CRF training is quite memory
582
+ intensive and running a large amount of training data tuples may
583
+ cause the embedded CRF++ package to fail. You can try with less
584
+ training data, or try training on a machine with a larger amount
585
+ of RAM. </DD>
586
+ <DT>Does the web service actually work? I can't seem to run it.</DT>
587
+ <DD>Occasionally our school's networking staff changes the firewall
588
+ settings, so the port for our group's web services may be blocked
589
+ (port 4000 on host wing.comp.nus.edu.sg). If you find you can't
590
+ reach our services (they time out), please let us know. </DD>
591
+ <DT>I get funny errors with crf_test not being useful. How do I
592
+ fix this?</DT> <DD>The updated README.txt file in the 090625b
593
+ distribution fixes this. Basically you need to recompile CRF++
594
+ 0.51 and place the libraries and the executables in the proper
595
+ place. See the README for details.</DD>
596
+
597
+ </DL>
598
+
599
+ <!-- Kudos ---------------------------------------------------------------------- -->
600
+ <H2>Kudos</H2>
601
+
602
+ <p>ParsCit owes its continued maintenance and support from its user
603
+ base. Here we'd like to thank them for their help.</p>
604
+
605
+ <P>Thanks to David Judd who reconfigured how CRF++ is located with
606
+ respect to the main code. Thanks to Alex Garnett in spotting more
607
+ problems with CRF dependencies. Thanks to George E. Raptis and Eric
608
+ Tran for the port to Windows. Thanks to Zhu Ying-Bo
609
+ (yumichika@163.com) from the Language Computing and Web Mining Group,
610
+ Institute of Computer Science and Technology of Peking University for
611
+ the partial port to Windows. Thanks to Yustus Oktian for questions
612
+ about training for another language. Thanks to Madhur Kapoor for
613
+ asking questions about PDF conversion. Thanks to Behrang Qasemizadeh
614
+ for reporting problems with truncation of XML entities in XML output
615
+ (v110505). Thanks Tim Brody for his BiblioScript patch. Thanks to
616
+ David Jurgens for suggesting that remove temporary files after runs
617
+ (v110505). Thanks Nikolay Nikolov for suggesting the conversion of
618
+ OmniPage XML results from UTF-16 to UTF-8 to avoid encoding
619
+ problems. Thanks to Matteo Romanello for the suggestion and permission
620
+ to incorporate BiblioScript software (v101101). Many thanks to Kris
621
+ Jack for pointing out problems with the ELF binaries and an
622
+ appropriate fix. Thanks to Cheong Chi Hong for fixing problems with
623
+ Preprocess.pm (v100401) and contributing the ICONIP training data and
624
+ XML entity problems in reference string parsing (v100401). Thanks to
625
+ Priya Venkateshan for pointing out sudo/root installation
626
+ possibilities (v100401). Thanks to Mario Lipinski for reporting
627
+ punctuation stripping problems in reference string parsing (v100401).
628
+ Thanks to Artemy Kolchinsky for fixes in Preprocess.pm
629
+ (v090625). Thanks to Matteo Romanello for the humanities training
630
+ datasets. Thanks to Dain Kaplan for helping us fix the Preprocess.pm
631
+ bug. Thanks to Ayeh Bandeh-Ahmadi for correcting the warning in
632
+ parseRefString.pl. Thanks to Nick Friedrich and J&ouml;ran Beel of
633
+ scienstein.org for all fixes in the v081201 version of ParsCit. Also
634
+ thanks to Madian Khabsa for indicating problems with installation to
635
+ MacOS.</p>
636
+
637
+ <P>ParsCit is used by many projects worldwide, and not just in
638
+ experimental, research and academic places, but in commercial
639
+ snterprises as well. <A HREF="http://www.mendeley.com/">Mendeley</A>
640
+ is using ParsCit to parse references from contributed papers, as is
641
+ the <A HREF="http://citec.repec.org/">Citations in Economics
642
+ (CitEc)</A> project.
643
+
644
+
645
+ <!-- Related Links ---------------------------------------------------------------------- -->
646
+ <H2>Related Links</H2>
647
+
648
+ <P>Other, open-source citation parsers:
649
+
650
+ <UL>
651
+ <LI> <A
652
+ HREF="http://freecite.library.brown.edu/welcome">FreeCite</A>:
653
+ supported by the Mellon Foundation and Brown University. Written in
654
+ Ruby on Rails, with the same CRF++ backend.
655
+ <LI> An <A
656
+ HREF="http://purl.net/net/egh/hmm-citation-extractor/">Hidden Markov
657
+ Model Citation Extractor</A>: written by Erik Hetzner of the
658
+ California Digital Library.
659
+ </UL>
660
+
661
+ <P> Other related links. Contact Min below to get your other related
662
+ software listed here. Thanks!
663
+
664
+ <UL>
665
+ <LI> Perhaps you're interested in open source code for libraries?
666
+ If so try the <A
667
+ HREF="http://dewey.library.nd.edu/mailing-lists/code4lib/">CODE4LIB
668
+ mailing list</A>.
669
+
670
+ <LI> <A
671
+ HREF="https://wiki.birncommunity.org:8443/display/NEWBIRNCC/LATISI+-+Literature+Annotation+Tool+from+the+Information+Sciences+Institute">LATISI
672
+ - Literature Annotation Tool from the Information Sciences
673
+ Institute</A>. A related project from ISI, using MBL instead of CRF.
674
+ <LI> <A HREF="http://www.scienstein.org">Scienstein.org</A>: A
675
+ recommendation system for papers.
676
+ <LI> PdfBox: An open-source package for extracting text information
677
+ from PDF files. Does not deal with problems with custom font
678
+ encodings.
679
+ </UL>
680
+
681
+ <HR>
682
+ <H5><ADDRESS><A HREF="http://www.comp.nus.edu.sg/~kanmy">Min-Yen Kan</A> &lt;<A HREF="mailto:kanmy@comp.nus.edu.sg">kanmy@comp.nus.edu.sg</A>&gt;</ADDRESS>
683
+ Created on: Fri Dec 24 01:48:05 SGT 2004
684
+ <!-- hhmts start -->
685
+ | Version: 1.0
686
+
687
+ | Last modified:
688
+ Mon Mar 4 14:23:46 SGT 2013
689
+ <!-- hhmts end -->
690
+ </H5>
691
+ </div>
692
+ </BODY> </HTML>