simple_bioc 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/xml/BioC.dtd ADDED
@@ -0,0 +1,146 @@
1
+ <!-- Combination DTD that will work with any document so far. -->
2
+
3
+ <!--
4
+
5
+ Some believe XML is easily read by humans and that should be
6
+ supported by clearly formatting the elements. In the long run,
7
+ this is destracting. While the only meaningful spaces are in text
8
+ elements and the other spaces can be ignored, current tools add no
9
+ additional space. Formatters and editors may be used to make the
10
+ XML file appear more readable.
11
+
12
+ The possible variety of annotations that one might want to produce
13
+ or use is nearly countless. There is no guarantee that these are
14
+ organized in the nice nested structure required for XML
15
+ elements. Even if they were, it would be nice to more easily
16
+ ignore unwanted annotations. So annotations are recorded in a
17
+ stand off manner, external to the annotated text. The exceptions
18
+ are passages and sentences because of their fundamental place in
19
+ text.
20
+
21
+ The text is expected to be encoded in Unicode, specifically
22
+ utf-8. This is one of the encodings required to be implented by
23
+ XML tools, is portable between big-endian and little-endian
24
+ machines and is a superset of 7-bit ASCII. Code points beyond 127
25
+ may be expressed directly in utf-8 or indirectly using numeric
26
+ entities. Since many tools today still only directly process
27
+ ASCII characters, conversion should be available and
28
+ standardized. Offsets should be in 8 bit code units (bytes) for
29
+ easier processing by naive programs.
30
+
31
+ Nothing final. Just current thoughts.
32
+
33
+ collection: Group of documents, usually from a larger corpus. If
34
+ a group of documents is from several corpora, use several
35
+ collections.
36
+
37
+ source: Name of the source corpus from which the documents were selected
38
+
39
+ date: Date documents extracted from original source. Can be as
40
+ simple as yyyymmdd or an ISO timestamp.
41
+
42
+ key: Separate file describing the types used and any other useful
43
+ information about the data in the file. For example, if a file
44
+ includes part-of-speech tags, this file should describe the
45
+ part-of-speech tags used.
46
+
47
+ infon: key-value pairs. Can record essentially arbitrary
48
+ information. "type" will be a particular common key in the major
49
+ sub elements below. For PubMed references, passage "type" might
50
+ signal "title" or "abstract". For annotations, it might indicate
51
+ "noun phrase", "gene", or "disease". In the programming language
52
+ data structures, infons are typically represented as a map from
53
+ strings to strings. This means keys should be unique within each
54
+ parent element.
55
+
56
+ document: A document in the collection. A single, complete
57
+ stand-alone document as described by it's parent source.
58
+
59
+ id: Typically, the id of the document in the parent
60
+ source. Should at least be unique in the collection.
61
+
62
+ passage: One portion of the document. For now PubMed documents
63
+ have a title and an abstract. Structured abstracts could have
64
+ additional passages. For a full text document, passages could be
65
+ sections such as Introduction, Materials and Methods, or
66
+ Conclusion. Another option would be paragraphs. Passages impose a
67
+ linear structure on the document. Further structure in the
68
+ document can be implied by the infon["type"] value.
69
+
70
+ offset: Where the passage occurs in the parent document. Depending
71
+ on the source corpus, this might be a very relevant number. They
72
+ should be sequential and identify a passage's position in
73
+ the document. Since pubmed is extracted from an XML file, the
74
+ title has an offset of zero, while the abstract is assumed to
75
+ begin after the title and one space.
76
+
77
+ text: The original text of the passage.
78
+
79
+ sentence: One sentence of the passage.
80
+
81
+ offset: A document offset to where the sentence begins in the
82
+ passage. This value is the sum of the passage offset and the local
83
+ offset within the passage.
84
+
85
+ text: The original text of the sentence.
86
+
87
+ annotation: Stand-off annotation
88
+
89
+ id: Used to refer to this annotation in relations.
90
+
91
+ location: Location of the annotated text. Multiple locations
92
+ indicate a multi-span annotation.
93
+
94
+ offset: Document offset to where the annotated text begins in
95
+ the passage or sentence. The value is the sum of the passage or
96
+ sentence offset and the local offset within the passage or
97
+ sentence.
98
+
99
+ length: Length of the annotated text. While unlikely, this could
100
+ be zero to describe an annotation that belongs between two
101
+ characters.
102
+
103
+ text: Unless something else is defined one would be expect the
104
+ annotated text. The length is redundant in this case. Other uses
105
+ for this text could be the normalized ID for a gene in a gene
106
+ database.
107
+
108
+ relation: Relationship between multiple annotations.
109
+
110
+ id: Used to refer to this relation in other relationships.
111
+
112
+ refid: Id of an annotated object or other relation.
113
+
114
+ role: Describes how the referenced annotated object or other
115
+ relation participates in the current relationship. Has a default
116
+ value so can be left out if there is no meaningful value.
117
+
118
+ -->
119
+
120
+ <!ELEMENT collection ( source, date, key, infon*, document+ ) >
121
+ <!ELEMENT source (#PCDATA)>
122
+ <!ELEMENT date (#PCDATA)>
123
+ <!ELEMENT key (#PCDATA)>
124
+ <!ELEMENT infon (#PCDATA)>
125
+ <!ATTLIST infon key CDATA #REQUIRED >
126
+
127
+ <!ELEMENT document ( id, infon*, passage+, relation* ) >
128
+ <!ELEMENT id (#PCDATA)>
129
+
130
+ <!ELEMENT passage ( infon*, offset, ( ( text?, annotation* ) | sentence* ), relation* ) >
131
+ <!ELEMENT offset (#PCDATA)>
132
+ <!ELEMENT text (#PCDATA)>
133
+
134
+ <!ELEMENT sentence ( infon*, offset, text?, annotation*, relation* ) >
135
+
136
+ <!ELEMENT annotation ( infon*, location*, text ) >
137
+ <!ATTLIST annotation id CDATA #IMPLIED >
138
+ <!ELEMENT location EMPTY>
139
+ <!ATTLIST location offset CDATA #REQUIRED >
140
+ <!ATTLIST location length CDATA #REQUIRED >
141
+
142
+ <!ELEMENT relation ( infon*, node* ) >
143
+ <!ATTLIST relation id CDATA #IMPLIED >
144
+ <!ELEMENT node EMPTY>
145
+ <!ATTLIST node refid CDATA #REQUIRED >
146
+ <!ATTLIST node role CDATA "" >
@@ -0,0 +1,492 @@
1
+ <?xml version="1.0" encoding="utf-8"?>
2
+ <!DOCTYPE collection SYSTEM "BioC.dtd">
3
+ <collection>
4
+ <source>PubMed</source>
5
+ <date>20130316</date>
6
+ <key>PMID-8557975-simplified-sentences-tokens.key</key>
7
+ <document>
8
+ <id>8557975</id>
9
+ <passage>
10
+ <infon key="type">abstract</infon>
11
+ <offset>0</offset>
12
+ <sentence>
13
+ <infon key="type">original sentence</infon>
14
+ <offset>70</offset>
15
+ <annotation id="t0">
16
+ <infon key="type">token</infon>
17
+ <location offset="70" length="6"></location>
18
+ <text>Active</text>
19
+ </annotation>
20
+ <annotation id="t1">
21
+ <infon key="type">token</infon>
22
+ <location offset="77" length="5"></location>
23
+ <text>Raf-1</text>
24
+ </annotation>
25
+ <annotation id="t2">
26
+ <infon key="type">token</infon>
27
+ <location offset="83" length="14"></location>
28
+ <text>phosphorylates</text>
29
+ </annotation>
30
+ <annotation id="t3">
31
+ <infon key="type">token</infon>
32
+ <location offset="98" length="3"></location>
33
+ <text>and</text>
34
+ </annotation>
35
+ <annotation id="t4">
36
+ <infon key="type">token</infon>
37
+ <location offset="102" length="9"></location>
38
+ <text>activates</text>
39
+ </annotation>
40
+ <annotation id="t5">
41
+ <infon key="type">token</infon>
42
+ <location offset="112" length="3"></location>
43
+ <text>the</text>
44
+ </annotation>
45
+ <annotation id="t6">
46
+ <infon key="type">token</infon>
47
+ <location offset="116" length="17"></location>
48
+ <text>mitogen-activated</text>
49
+ </annotation>
50
+ <annotation id="t7">
51
+ <infon key="type">token</infon>
52
+ <location offset="134" length="7"></location>
53
+ <text>protein</text>
54
+ </annotation>
55
+ <annotation id="t8">
56
+ <infon key="type">token</infon>
57
+ <location offset="142" length="1"></location>
58
+ <text>(</text>
59
+ </annotation>
60
+ <annotation id="t9">
61
+ <infon key="type">token</infon>
62
+ <location offset="143" length="3"></location>
63
+ <text>MAP</text>
64
+ </annotation>
65
+ <annotation id="t10">
66
+ <infon key="type">token</infon>
67
+ <location offset="146" length="1"></location>
68
+ <text>)</text>
69
+ </annotation>
70
+ <annotation id="t11">
71
+ <infon key="type">token</infon>
72
+ <location offset="148" length="20"></location>
73
+ <text>kinase/extracellular</text>
74
+ </annotation>
75
+ <annotation id="t12">
76
+ <infon key="type">token</infon>
77
+ <location offset="169" length="16"></location>
78
+ <text>signal-regulated</text>
79
+ </annotation>
80
+ <annotation id="t13">
81
+ <infon key="type">token</infon>
82
+ <location offset="186" length="6"></location>
83
+ <text>kinase</text>
84
+ </annotation>
85
+ <annotation id="t14">
86
+ <infon key="type">token</infon>
87
+ <location offset="193" length="6"></location>
88
+ <text>kinase</text>
89
+ </annotation>
90
+ <annotation id="t15">
91
+ <infon key="type">token</infon>
92
+ <location offset="200" length="1"></location>
93
+ <text>1</text>
94
+ </annotation>
95
+ <annotation id="t16">
96
+ <infon key="type">token</infon>
97
+ <location offset="202" length="1"></location>
98
+ <text>(</text>
99
+ </annotation>
100
+ <annotation id="t17">
101
+ <infon key="type">token</infon>
102
+ <location offset="203" length="4"></location>
103
+ <text>MEK1</text>
104
+ </annotation>
105
+ <annotation id="t18">
106
+ <infon key="type">token</infon>
107
+ <location offset="207" length="1"></location>
108
+ <text>)</text>
109
+ </annotation>
110
+ <annotation id="t19">
111
+ <infon key="type">token</infon>
112
+ <location offset="208" length="1"></location>
113
+ <text>,</text>
114
+ </annotation>
115
+ <annotation id="t20">
116
+ <infon key="type">token</infon>
117
+ <location offset="210" length="5"></location>
118
+ <text>which</text>
119
+ </annotation>
120
+ <annotation id="t21">
121
+ <infon key="type">token</infon>
122
+ <location offset="216" length="2"></location>
123
+ <text>in</text>
124
+ </annotation>
125
+ <annotation id="t22">
126
+ <infon key="type">token</infon>
127
+ <location offset="219" length="4"></location>
128
+ <text>turn</text>
129
+ </annotation>
130
+ <annotation id="t23">
131
+ <infon key="type">token</infon>
132
+ <location offset="224" length="14"></location>
133
+ <text>phosphorylates</text>
134
+ </annotation>
135
+ <annotation id="t24">
136
+ <infon key="type">token</infon>
137
+ <location offset="239" length="3"></location>
138
+ <text>and</text>
139
+ </annotation>
140
+ <annotation id="t25">
141
+ <infon key="type">token</infon>
142
+ <location offset="243" length="9"></location>
143
+ <text>activates</text>
144
+ </annotation>
145
+ <annotation id="t26">
146
+ <infon key="type">token</infon>
147
+ <location offset="253" length="3"></location>
148
+ <text>the</text>
149
+ </annotation>
150
+ <annotation id="t27">
151
+ <infon key="type">token</infon>
152
+ <location offset="257" length="3"></location>
153
+ <text>MAP</text>
154
+ </annotation>
155
+ <annotation id="t28">
156
+ <infon key="type">token</infon>
157
+ <location offset="261" length="21"></location>
158
+ <text>kinases/extracellular</text>
159
+ </annotation>
160
+ <annotation id="t29">
161
+ <infon key="type">token</infon>
162
+ <location offset="283" length="6"></location>
163
+ <text>signal</text>
164
+ </annotation>
165
+ <annotation id="t30">
166
+ <infon key="type">token</infon>
167
+ <location offset="290" length="9"></location>
168
+ <text>regulated</text>
169
+ </annotation>
170
+ <annotation id="t31">
171
+ <infon key="type">token</infon>
172
+ <location offset="300" length="7"></location>
173
+ <text>kinases</text>
174
+ </annotation>
175
+ <annotation id="t32">
176
+ <infon key="type">token</infon>
177
+ <location offset="307" length="1"></location>
178
+ <text>,</text>
179
+ </annotation>
180
+ <annotation id="t33">
181
+ <infon key="type">token</infon>
182
+ <location offset="309" length="4"></location>
183
+ <text>ERK1</text>
184
+ </annotation>
185
+ <annotation id="t34">
186
+ <infon key="type">token</infon>
187
+ <location offset="314" length="3"></location>
188
+ <text>and</text>
189
+ </annotation>
190
+ <annotation id="t35">
191
+ <infon key="type">token</infon>
192
+ <location offset="318" length="4"></location>
193
+ <text>ERK2</text>
194
+ </annotation>
195
+ <annotation id="t36">
196
+ <infon key="type">token</infon>
197
+ <location offset="322" length="1"></location>
198
+ <text>.</text>
199
+ </annotation>
200
+ </sentence>
201
+ <sentence>
202
+ <infon key="type">simplified sentence</infon>
203
+ <offset>325</offset>
204
+ <annotation id="t37">
205
+ <infon key="type">token</infon>
206
+ <location offset="325" length="6"></location>
207
+ <text>Active</text>
208
+ </annotation>
209
+ <annotation id="t38">
210
+ <infon key="type">token</infon>
211
+ <location offset="332" length="5"></location>
212
+ <text>Raf-1</text>
213
+ </annotation>
214
+ <annotation id="t39">
215
+ <infon key="type">token</infon>
216
+ <location offset="338" length="14"></location>
217
+ <text>phosphorylates</text>
218
+ </annotation>
219
+ <annotation id="t40">
220
+ <infon key="type">token</infon>
221
+ <location offset="353" length="4"></location>
222
+ <text>MEK1</text>
223
+ </annotation>
224
+ <annotation id="t41">
225
+ <infon key="type">token</infon>
226
+ <location offset="357" length="1"></location>
227
+ <text>.</text>
228
+ </annotation>
229
+ </sentence>
230
+ <sentence>
231
+ <infon key="type">simplified sentence</infon>
232
+ <offset>360</offset>
233
+ <annotation id="t42">
234
+ <infon key="type">token</infon>
235
+ <location offset="360" length="6"></location>
236
+ <text>Active</text>
237
+ </annotation>
238
+ <annotation id="t43">
239
+ <infon key="type">token</infon>
240
+ <location offset="367" length="5"></location>
241
+ <text>Raf-1</text>
242
+ </annotation>
243
+ <annotation id="t44">
244
+ <infon key="type">token</infon>
245
+ <location offset="373" length="9"></location>
246
+ <text>activates</text>
247
+ </annotation>
248
+ <annotation id="t45">
249
+ <infon key="type">token</infon>
250
+ <location offset="383" length="4"></location>
251
+ <text>MEK1</text>
252
+ </annotation>
253
+ <annotation id="t46">
254
+ <infon key="type">token</infon>
255
+ <location offset="387" length="1"></location>
256
+ <text>.</text>
257
+ </annotation>
258
+ </sentence>
259
+ <sentence>
260
+ <infon key="type">simplified sentence</infon>
261
+ <offset>390</offset>
262
+ <annotation id="t47">
263
+ <infon key="type">token</infon>
264
+ <location offset="390" length="4"></location>
265
+ <text>MEK1</text>
266
+ </annotation>
267
+ <annotation id="t48">
268
+ <infon key="type">token</infon>
269
+ <location offset="395" length="2"></location>
270
+ <text>in</text>
271
+ </annotation>
272
+ <annotation id="t49">
273
+ <infon key="type">token</infon>
274
+ <location offset="398" length="4"></location>
275
+ <text>turn</text>
276
+ </annotation>
277
+ <annotation id="t50">
278
+ <infon key="type">token</infon>
279
+ <location offset="403" length="14"></location>
280
+ <text>phosphorylates</text>
281
+ </annotation>
282
+ <annotation id="t51">
283
+ <infon key="type">token</infon>
284
+ <location offset="418" length="4"></location>
285
+ <text>ERK1</text>
286
+ </annotation>
287
+ <annotation id="t52">
288
+ <infon key="type">token</infon>
289
+ <location offset="422" length="1"></location>
290
+ <text>.</text>
291
+ </annotation>
292
+ </sentence>
293
+ <sentence>
294
+ <infon key="type">simplified sentence</infon>
295
+ <offset>425</offset>
296
+ <annotation id="t53">
297
+ <infon key="type">token</infon>
298
+ <location offset="425" length="4"></location>
299
+ <text>MEK1</text>
300
+ </annotation>
301
+ <annotation id="t54">
302
+ <infon key="type">token</infon>
303
+ <location offset="430" length="2"></location>
304
+ <text>in</text>
305
+ </annotation>
306
+ <annotation id="t55">
307
+ <infon key="type">token</infon>
308
+ <location offset="433" length="4"></location>
309
+ <text>turn</text>
310
+ </annotation>
311
+ <annotation id="t56">
312
+ <infon key="type">token</infon>
313
+ <location offset="438" length="14"></location>
314
+ <text>phosphorylates</text>
315
+ </annotation>
316
+ <annotation id="t57">
317
+ <infon key="type">token</infon>
318
+ <location offset="453" length="4"></location>
319
+ <text>ERK2</text>
320
+ </annotation>
321
+ <annotation id="t58">
322
+ <infon key="type">token</infon>
323
+ <location offset="457" length="1"></location>
324
+ <text>.</text>
325
+ </annotation>
326
+ </sentence>
327
+ <sentence>
328
+ <infon key="type">simplified sentence</infon>
329
+ <offset>460</offset>
330
+ <annotation id="t59">
331
+ <infon key="type">token</infon>
332
+ <location offset="460" length="4"></location>
333
+ <text>MEK1</text>
334
+ </annotation>
335
+ <annotation id="t60">
336
+ <infon key="type">token</infon>
337
+ <location offset="465" length="2"></location>
338
+ <text>in</text>
339
+ </annotation>
340
+ <annotation id="t61">
341
+ <infon key="type">token</infon>
342
+ <location offset="468" length="4"></location>
343
+ <text>turn</text>
344
+ </annotation>
345
+ <annotation id="t62">
346
+ <infon key="type">token</infon>
347
+ <location offset="473" length="9"></location>
348
+ <text>activates</text>
349
+ </annotation>
350
+ <annotation id="t63">
351
+ <infon key="type">token</infon>
352
+ <location offset="483" length="4"></location>
353
+ <text>ERK1</text>
354
+ </annotation>
355
+ <annotation id="t64">
356
+ <infon key="type">token</infon>
357
+ <location offset="487" length="1"></location>
358
+ <text>.</text>
359
+ </annotation>
360
+ </sentence>
361
+ <sentence>
362
+ <infon key="type">simplified sentence</infon>
363
+ <offset>489</offset>
364
+ <annotation id="t65">
365
+ <infon key="type">token</infon>
366
+ <location offset="489" length="4"></location>
367
+ <text>MEK1</text>
368
+ </annotation>
369
+ <annotation id="t66">
370
+ <infon key="type">token</infon>
371
+ <location offset="494" length="2"></location>
372
+ <text>in</text>
373
+ </annotation>
374
+ <annotation id="t67">
375
+ <infon key="type">token</infon>
376
+ <location offset="497" length="4"></location>
377
+ <text>turn</text>
378
+ </annotation>
379
+ <annotation id="t68">
380
+ <infon key="type">token</infon>
381
+ <location offset="502" length="9"></location>
382
+ <text>activates</text>
383
+ </annotation>
384
+ <annotation id="t69">
385
+ <infon key="type">token</infon>
386
+ <location offset="512" length="4"></location>
387
+ <text>ERK2</text>
388
+ </annotation>
389
+ <annotation id="t70">
390
+ <infon key="type">token</infon>
391
+ <location offset="516" length="1"></location>
392
+ <text>.</text>
393
+ </annotation>
394
+ </sentence>
395
+ <!-- equ -->
396
+ <!-- Active -->
397
+ <relation id="r0">
398
+ <infon key="type">equ</infon>
399
+ <node refid="t0" role="original"></node>
400
+ <node refid="t37" role="simplified"></node>
401
+ <node refid="t42" role="simplified"></node>
402
+ </relation>
403
+ <!-- RAF-1 -->
404
+ <relation id="r1">
405
+ <infon key="type">equ</infon>
406
+ <node refid="t1" role="original"></node>
407
+ <node refid="t38" role="simplified"></node>
408
+ <node refid="t43" role="simplified"></node>
409
+ </relation>
410
+ <!-- phosphorylates -->
411
+ <relation id="r2">
412
+ <infon key="type">equ</infon>
413
+ <node refid="t2" role="original"></node>
414
+ <node refid="t39" role="simplified"></node>
415
+ </relation>
416
+ <!-- MEK1 -->
417
+ <relation id="r3">
418
+ <infon key="type">equ</infon>
419
+ <node refid="t17" role="original"></node>
420
+ <node refid="t40" role="simplified"></node>
421
+ <node refid="t45" role="simplified"></node>
422
+ <node refid="t47" role="simplified"></node>
423
+ <node refid="t53" role="simplified"></node>
424
+ <node refid="t59" role="simplified"></node>
425
+ <node refid="t65" role="simplified"></node>
426
+ </relation>
427
+ <!-- . -->
428
+ <relation id="r4">
429
+ <infon key="type">equ</infon>
430
+ <node refid="t36" role="original"></node>
431
+ <node refid="t41" role="simplified"></node>
432
+ <node refid="t46" role="simplified"></node>
433
+ <node refid="t52" role="simplified"></node>
434
+ <node refid="t58" role="simplified"></node>
435
+ <node refid="t64" role="simplified"></node>
436
+ <node refid="t70" role="simplified"></node>
437
+ </relation>
438
+ <!-- activates -->
439
+ <relation id="r5">
440
+ <infon key="type">equ</infon>
441
+ <node refid="t4" role="original"></node>
442
+ <node refid="t44" role="simplified"></node>
443
+ </relation>
444
+ <!-- in -->
445
+ <relation id="r6">
446
+ <infon key="type">equ</infon>
447
+ <node refid="t21" role="original"></node>
448
+ <node refid="t48" role="simplified"></node>
449
+ <node refid="t54" role="simplified"></node>
450
+ <node refid="t60" role="simplified"></node>
451
+ <node refid="t66" role="simplified"></node>
452
+ </relation>
453
+ <!-- turn -->
454
+ <relation id="r7">
455
+ <infon key="type">equ</infon>
456
+ <node refid="t22" role="original"></node>
457
+ <node refid="t49" role="simplified"></node>
458
+ <node refid="t55" role="simplified"></node>
459
+ <node refid="t61" role="simplified"></node>
460
+ <node refid="t67" role="simplified"></node>
461
+ </relation>
462
+ <!-- phosphorylates -->
463
+ <relation id="r8">
464
+ <infon key="type">equ</infon>
465
+ <node refid="t23" role="original"></node>
466
+ <node refid="t50" role="simplified"></node>
467
+ <node refid="t56" role="simplified"></node>
468
+ </relation>
469
+ <!-- ERK1 -->
470
+ <relation id="r9">
471
+ <infon key="type">equ</infon>
472
+ <node refid="t33" role="original"></node>
473
+ <node refid="t51" role="simplified"></node>
474
+ <node refid="t63" role="simplified"></node>
475
+ </relation>
476
+ <!-- ERK2 -->
477
+ <relation id="r10">
478
+ <infon key="type">equ</infon>
479
+ <node refid="t35" role="original"></node>
480
+ <node refid="t57" role="simplified"></node>
481
+ <node refid="t69" role="simplified"></node>
482
+ </relation>
483
+ <!-- activates -->
484
+ <relation id="r11">
485
+ <infon key="type">equ</infon>
486
+ <node refid="t25" role="original"></node>
487
+ <node refid="t62" role="simplified"></node>
488
+ <node refid="t68" role="simplified"></node>
489
+ </relation>
490
+ </passage>
491
+ </document>
492
+ </collection>