simple_bioc 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/xml/pos.key ADDED
@@ -0,0 +1,49 @@
1
+ pos key
2
+
3
+ collection: 10 random PubMed documents with ASCII text split into
4
+ sentences and tokens by the MedPost tokenizer
5
+
6
+ Original source sentence.xml
7
+
8
+ source: PubMed
9
+
10
+ date: yyyymmdd. Date documents downloaded from PubMed
11
+
12
+ document: Title and possibly abstract from a PubMed reference
13
+
14
+ id: PubMed id
15
+
16
+ passage: Either title or abstract
17
+
18
+ infon["type"]: "title" or "abstract"
19
+
20
+ offset: The original Unicode byte offsets were not updated after
21
+ the ASCII conversion.
22
+
23
+ PubMed is extracted from an XML file, so literal offsets
24
+ would not be useful. Title has an offset of zero, while
25
+ the abstract is assumed to begin after the title and one
26
+ space. These offsets at least sequence the abstract after
27
+ the title.
28
+
29
+ sentence: One sentence of the passage as determined by the
30
+ MedPost sentence splitter
31
+
32
+ offset: A document offset to where the sentence begins in the
33
+ passage. Sum of the passage offset and the local offset
34
+ within the passage.
35
+
36
+ annotation: tokens in the sentence with their part-of-speech.
37
+ the annotations are of "type" "token"
38
+
39
+ infon["POS"]: The Penn Treebank part of speech tag as determined
40
+ by the MedPost biomedical part-of-speech tagger
41
+
42
+ location: offset: A document offset to where the annotated text
43
+ begins in the sentence. Sum of the sentence
44
+ offset and the local offset within the
45
+ sentence.
46
+
47
+ length: The length of the token.
48
+
49
+ text: ASCII text of the token.