shalmaneser 1.2.0.rc1 → 1.2.0.rc2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (30) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +26 -8
  3. data/doc/SB_README +57 -0
  4. data/doc/exp_files_description.txt +160 -0
  5. data/doc/fred.pdf +0 -0
  6. data/doc/index.md +120 -0
  7. data/doc/salsa_tool.pdf +0 -0
  8. data/doc/salsatigerxml.pdf +0 -0
  9. data/doc/shal_doc.pdf +0 -0
  10. data/doc/shal_lrec.pdf +0 -0
  11. data/lib/ext/maxent/Classify.class +0 -0
  12. data/lib/ext/maxent/Train.class +0 -0
  13. data/lib/frprep/TreetaggerInterface.rb +4 -4
  14. data/lib/shalmaneser/version.rb +1 -1
  15. metadata +41 -48
  16. data/test/frprep/test_opt_parser.rb +0 -94
  17. data/test/functional/functional_test_helper.rb +0 -40
  18. data/test/functional/sample_experiment_files/fred_test.salsa.erb +0 -122
  19. data/test/functional/sample_experiment_files/fred_train.salsa.erb +0 -135
  20. data/test/functional/sample_experiment_files/prp_test.salsa.erb +0 -138
  21. data/test/functional/sample_experiment_files/prp_test.salsa.fred.standalone.erb +0 -120
  22. data/test/functional/sample_experiment_files/prp_test.salsa.rosy.standalone.erb +0 -120
  23. data/test/functional/sample_experiment_files/prp_train.salsa.erb +0 -138
  24. data/test/functional/sample_experiment_files/prp_train.salsa.fred.standalone.erb +0 -138
  25. data/test/functional/sample_experiment_files/prp_train.salsa.rosy.standalone.erb +0 -138
  26. data/test/functional/sample_experiment_files/rosy_test.salsa.erb +0 -257
  27. data/test/functional/sample_experiment_files/rosy_train.salsa.erb +0 -259
  28. data/test/functional/test_fred.rb +0 -47
  29. data/test/functional/test_frprep.rb +0 -52
  30. data/test/functional/test_rosy.rb +0 -40
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 83f5f0ca7cc27a632cb46deef7c093df649c61e1
4
- data.tar.gz: dbc9a29186421206de7bf9b0138f05f89228fad6
3
+ metadata.gz: 0ca81bfeedc7c124833b61ae3facccf6eb86cecc
4
+ data.tar.gz: 38ada6d619d55ede1e291a7c1534b1633b08e363
5
5
  SHA512:
6
- metadata.gz: 8a87f1e74b16082cba8d2ab49eb33289e8db23f5bdf3cdd4f294901c8119c8bff1239ec870032871d6d2cf69efbaba500058a47827df92be707aba3ab36ab30a
7
- data.tar.gz: be1f6b6f3e4aa0b20f26437f30c579faf68f03f7c474cb78e28cb1263ef4ab9397ab4d52fbdffa4ac7ceb50a2d3f44cb4200303a7f14b2bdd0cb06fbfae68f0f
6
+ metadata.gz: 863962998b66640c61e54a29ceb1a21821ff6f45d20329a22c4fce5284f2877e2e4b4f7ee88fd1cdf37b6e90d68d8b1d1acaac41c1225fe858ce4262f9b638da
7
+ data.tar.gz: 78334e6fa48815e1fa1dbc502759cb4632655a8bde2b0810ac2afe53c288bc983d5368e50812d576a4604c31acfff8921dc4f1b9707d569a60e0be4facdfd03e
data/README.md CHANGED
@@ -1,17 +1,18 @@
1
1
  # [SHALMANESER - a SHALlow seMANtic parSER](http://www.coli.uni-saarland.de/projects/salsa/shal/)
2
2
 
3
3
 
4
- [RubyGems](http://rubygems.org/gems/shalmaneser) | [RTT Project Page](http://bu.chsta.be/projects/shalmaneser/) |
4
+ [RubyGems](http://rubygems.org/gems/shalmaneser) | [Shalmanesers Project Page](http://bu.chsta.be/projects/shalmaneser/) |
5
5
  [Source Code](https://github.com/arbox/shalmaneser) | [Bug Tracker](https://github.com/arbox/shalmaneser/issues)
6
6
 
7
7
  [<img src="https://badge.fury.io/rb/shalmaneser.png" alt="Gem Version" />](http://badge.fury.io/rb/shalmaneser)
8
8
  [<img src="https://travis-ci.org/arbox/shalmaneser.png" alt="Build Status" />](https://travis-ci.org/arbox/shalmaneser)
9
9
  [<img src="https://codeclimate.com/github/arbox/shalmaneser.png" alt="Code Climate" />](https://codeclimate.com/github/arbox/shalmaneser)
10
10
  [<img alt="Bitdeli Badge" src="https://d2weczhvl823v0.cloudfront.net/arbox/shalmaneser/trend.png" />](https://bitdeli.com/free)
11
+ [![Dependency Status](https://gemnasium.com/arbox/shalmaneser.png)](https://gemnasium.com/arbox/shalmaneser)
11
12
 
12
13
  ## Description
13
14
 
14
- Please be careful, the whole thing is under construction!
15
+ Please be careful, the whole thing is under construction! Shalmaneser it not intended to run on Windows systems. For now it has been tested on only Linux.
15
16
 
16
17
  Shalmaneser is a supervised learning toolbox for shallow semantic parsing, i.e. the automatic assignment of semantic classes and roles to text. The system was developed for Frame Semantics; thus we use Frame Semantics terminology and call the classes frames and the roles frame elements. However, the architecture is reasonably general, and with a certain amount of adaption, Shalmaneser should be usable for other paradigms (e.g., PropBank roles) as well. Shalmaneser caters both for end users, and for researchers.
17
18
 
@@ -20,13 +21,13 @@ For end users, we provide a simple end user mode which can simply apply the pre-
20
21
  ## Origin
21
22
  You can find original versions of Shalmaneser up to ``1.1`` on the [SALSA](http://www.coli.uni-saarland.de/projects/salsa/shal/) project page.
22
23
 
23
- ## Literature
24
+ ## Publications on Shalmaneser
24
25
 
25
- K. Erk and S. Padó: Shalmaneser - a flexible toolbox for semantic role assignment. Proceedings of LREC 2006, Genoa, Italy. [Click here for details](http://www.nlpado.de/~sebastian/pub/papers/lrec06_erk.pdf).
26
+ - K. Erk and S. Padó: Shalmaneser - a flexible toolbox for semantic role assignment. Proceedings of LREC 2006, Genoa, Italy. [Click here for details](http://www.nlpado.de/~sebastian/pub/papers/lrec06_erk.pdf).
26
27
 
27
28
  ## Documentation
28
29
 
29
- The project documentation can be found in our [doc](doc/index.md) folder.
30
+ The project documentation can be found in our [doc](https://github.com/arbox/shalmaneser/blob/1.2/doc/index.md) folder.
30
31
 
31
32
  ## Development
32
33
 
@@ -40,10 +41,27 @@ We are working now on two branches:
40
41
 
41
42
  ## Installation
42
43
 
43
- See the installation instructions in the [doc](doc/index.md#installation) folder.
44
+ See the installation instructions in the [doc](https://github.com/arbox/shalmaneser/blob/1.2/doc/index.md#installation) folder.
44
45
 
45
- ### Machine Learning Systems
46
+ ### Tokenizers
47
+
48
+ - [Ucto](http://ilk.uvt.nl/ucto/)
49
+
50
+ ### POS Taggers
51
+
52
+ - [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
46
53
 
47
- - http://sourceforge.net/projects/maxent/files/Maxent/2.4.0/
54
+ ### Lemmatizers
48
55
 
56
+ - [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
57
+
58
+ ### Parsers
59
+
60
+ - [BerkeleyParser](https://code.google.com/p/berkeleyparser/downloads/list)
61
+ - [Stanford Parser](http://nlp.stanford.edu/software/lex-parser.shtml)
62
+ - [Collins Parser](http://www.cs.columbia.edu/~mcollins/code.html)
63
+
64
+ ### Machine Learning Systems
49
65
 
66
+ - [OpenNLP MaxEnt](http://sourceforge.net/projects/maxent/files/Maxent/2.4.0/)
67
+ - [Mallet](http://mallet.cs.umass.edu/index.php)
data/doc/SB_README ADDED
@@ -0,0 +1,57 @@
1
+ # Before running the programs you should make sure that all components
2
+ # needed by shalmaneser are installed and that all paths in the
3
+ # configuration files/code are adapted accordingly
4
+ # (maybe iterate over all files and grep for "rehbein" to find hard-
5
+ # coded paths; have a look at all configuration files in SampleExperimentFiles.salsa)
6
+
7
+
8
+ # Directories
9
+
10
+ # program_de -> ruby source code and additional stuff for the German
11
+ # version of shalmaneser
12
+ # program_de/SampleExperimentFiles.salsa
13
+ # -> configuration files for shalmaneser
14
+ # input -> includes test data in plain text format
15
+ # output -> all temporary files and output files, including the
16
+ # classifiers
17
+ #
18
+ # directory output:
19
+ # prp_test -> output of frprep.rb (parsed/tagged/lemmatised data)
20
+ # preprocessed -> output of frprep.rb (data converted to SalsaTiGerXML)
21
+ # exp_fred_salsa-> temp files/output of fred.rb (classifiers, features, ...)
22
+ # exp_fred/output/stxml/ -> output of fred.rb (SalsaTigerXML file with
23
+ # frames)
24
+ # exp_rosy_salsa-> temp files/output of rosy.rb (classifiers, features, ...)
25
+ # exp_rosy_salsa/output -> output of rosy.rb
26
+
27
+ # Set some variables
28
+ # => adapt to your program paths
29
+ DIR=/proj/llx/Annotation/experiments/test/shalmaneser
30
+ EXP=$DIR/program_de/SampleExperimentFiles.salsa
31
+
32
+ export CLASSPATH=/proj/llx/Software/MachineLearning/maxent-2.4.0/lib/trove.jar:/proj/llx/Software/MachineLearning/maxent-2.4.0/output/maxent-2.4.0.jar:/proj/llx/Annotation/experiments/sfischer_bachelor/shalmaneser/program/tools/maxent
33
+
34
+
35
+
36
+ # change to shalmaneser directory
37
+ cd $DIR/program_de
38
+
39
+ # Preprocessing
40
+ # (result: parsed file in SalsaTiGerXML format
41
+ # when running on SalsaTiGerXML data: gold frames/roles included
42
+ # when running on plain text: without frames/roles)
43
+
44
+ ruby frprep.rb -e $EXP/prp_test.salsa
45
+
46
+
47
+ # Frame assignment with fred
48
+ ruby fred.rb -t featurize -e $EXP/fred_test.salsa -d test
49
+
50
+ ruby fred.rb -t test -e $EXP/fred_test.salsa
51
+
52
+
53
+ # Role assignment with rosy
54
+ ruby rosy.rb -t featurize -e $EXP/rosy.salsa -d test
55
+
56
+ ruby rosy.rb -t test -e $EXP/rosy.salsa
57
+
@@ -0,0 +1,160 @@
1
+ = FrPrep
2
+ prep_experiment_ID => "string", # experiment identifier
3
+ frprep_directory => "string", # dir for frprep internal data
4
+ # information about the dataset
5
+ language => "string", # en, de
6
+ origin => "string", # FrameNet, Salsa, or nothing
7
+ format => "string", # Plain, SalsaTab, FNXml, FNCorpusXml, SalsaTigerXML
8
+ encoding => "string", # utf8, iso, hex, or nothing
9
+ # directories
10
+ directory_input => "string", # dir with input data
11
+ directory_preprocessed => "string", # dir with output Salsa/Tiger XML data
12
+ directory_parserout => "string", # dir with parser output for the parser named below
13
+
14
+ # syntactic processing
15
+ pos_tagger => "string", # name of POS tagger
16
+ lemmatizer => "string", # name of lemmatizer
17
+ parser => "string", # name of parser
18
+ pos_tagger_path => "string", # path to POS tagger
19
+ lemmatizer_path => "string", # path to lemmatizer
20
+ parser_path => "string", # path to parser
21
+ parser_max_sent_num => "integer", # max number of sentences per parser
22
+ input file
23
+ parser_max_sent_len => "integer", # max sentence length the parser handles
24
+
25
+ do_parse" => "bool", # use parser?
26
+ do_lemmatize" => "bool",# use lemmatizer?
27
+ do_postag" => "bool", # use POS tagger?
28
+
29
+ # output format: if tabformat_output == true,
30
+ # output in Tab format rather than Salsa/Tiger XML
31
+ # (this will not work if do_parse == true)
32
+ tabformat_output" => "bool",
33
+
34
+ # syntactic repairs, dependent on existing semantic role annotation
35
+ fe_syn_repair" => "bool", # map words to constituents for FEs: idealize?
36
+ fe_rel_repair" => "bool", # FEs: include non-included relative clauses into FEs
37
+
38
+ = Fred
39
+ experiment_ID" => "string", # experiment ID
40
+ enduser_mode" => "bool", # work in enduser mode? (disallowing many things)
41
+
42
+ preproc_descr_file_train" => "string", # path to preprocessing files
43
+ preproc_descr_file_test" => "string",
44
+ directory_output" => "string", # path to Salsa/Tiger XML output directory
45
+
46
+ verbose" => "bool" , # print diagnostic messages?
47
+ apply_to_all_known_targets" => "bool", # apply to all known targets rather than the ones with a frame?
48
+
49
+ fred_directory" => "string",# directory for internal info
50
+ classifier_dir" => "string", # write classifiers here
51
+
52
+ classifier" => "list", # classifiers
53
+
54
+ dbtype" => "string", # "mysql" or "sqlite"
55
+
56
+ host" => "string", # DB access: sqlite only
57
+ user" => "string",
58
+ passwd" => "string",
59
+ dbname" => "string",
60
+
61
+ # featurization info
62
+ feature" => "list", # which features to use for the classifier?
63
+ binary_classifiers" => "bool",# make binary rather than n-ary clasifiers?
64
+ negsense" => "string", # binary classifier: negative sense is..?
65
+ numerical_features" => "string", # do what with numerical features?
66
+
67
+ # what to do with items that have multiple senses?
68
+ # 'binarize': binary classifiers, and consider positive
69
+ # if the sense is among the gold senses
70
+ # 'join' : make one joint sense
71
+ # 'repeat' : make multiple occurrences of the item, one sense per occ
72
+ # 'keep' : keep as separate labels
73
+ #
74
+ # multilabel: consider as assigned all labels
75
+ # above a certain confidence threshold?
76
+ handle_multilabel" => "string",
77
+ assignment_confidence_threshold" => "float",
78
+
79
+ # single-sentence context?
80
+ single_sent_context" => "bool",
81
+
82
+ # noncontiguous input? then we need access to a larger corpus
83
+ noncontiguous_input" => "bool",
84
+ larger_corpus_dir" => "string",
85
+ larger_corpus_format" => "string",
86
+ larger_corpus_encoding" => "string"
87
+
88
+ [ # variables
89
+ "train",
90
+ "exp_ID"
91
+ ]
92
+
93
+ = Rosy
94
+ # features
95
+ feature" => "list",
96
+ classifier" => "list",
97
+
98
+ verbose" => "bool" ,
99
+ enduser_mode" => "bool",
100
+
101
+ experiment_ID" => "string",
102
+
103
+ directory_input_train" => "string",
104
+ directory_input_test" => "string",
105
+ directory_output" => "string",
106
+
107
+ preproc_descr_file_train" => "string",
108
+ preproc_descr_file_test" => "string",
109
+ external_descr_file" => "string",
110
+
111
+ dbtype" => "string", # "mysql" or "sqlite"
112
+
113
+ host" => "string", # DB access: sqlite only
114
+ user" => "string",
115
+ passwd" => "string",
116
+ dbname" => "string",
117
+
118
+ data_dir" => "string", # for external use
119
+ rosy_dir" => "pattern", # for internal use only, set by rosy.rb
120
+
121
+ classifier_dir" => "string", # if present, special directory for classifiers
122
+
123
+ classif_column_name" => "string",
124
+ main_table_name" => "pattern",
125
+ test_table_name" => "pattern",
126
+
127
+ eval_file" => "pattern",
128
+ log_file" => "pattern",
129
+ failed_file" => "pattern",
130
+ classifier_file" => "pattern",
131
+ classifier_output_file" => "pattern",
132
+ noval" => "string",
133
+
134
+
135
+ split_nones" => "bool",
136
+ print_eval_log" => "bool",
137
+ assume_argrec_perfect" => "bool",
138
+ xwise_argrec" => "string",
139
+ xwise_arglab" => "string",
140
+ xwise_onestep" => "string",
141
+
142
+ fe_syn_repair" => "bool", # map words to constituents for FEs: idealize?
143
+ fe_rel_repair" => "bool", # FEs: include non-included relative clauses into FEs
144
+
145
+ prune" => "string", # pruning prior to argrec?
146
+
147
+ ["exp_ID", "test_ID", "split_ID", "feature_name", "classif", "step",
148
+ "group", "dataset","mode"] # variables
149
+
150
+ = External Config Data
151
+
152
+ directory" => "string", # features
153
+
154
+ experiment_id" => "string",
155
+
156
+ gfmap_restrict_to_downpath" => "bool",
157
+ gfmap_restrict_pathlen" => "integer",
158
+ gfmap_remove_gf" => "list"
159
+
160
+
data/doc/fred.pdf ADDED
Binary file
data/doc/index.md ADDED
@@ -0,0 +1,120 @@
1
+ # Shalmaneser Documentation Index
2
+
3
+ ## Prerequisites
4
+
5
+ You need the following items installed on your system:
6
+ - [Ruby](https://www.ruby-lang.org/en/downloads/), at least version ``1.8.7`` (please note that the version ``1.8.7`` is deprecated, future Shalmaneser incarnations will run only under Ruby greater than ``1.9.x``)
7
+ - a MySQL database server, your database must be large enough to hold the test data (in end user mode) plus any training data (for training new models in manual mode), e.g. training on the complete FrameNet 1.2 dataset requires about 1.5 GB of free space.
8
+ - if you don't want to train classifiers from you own data, you need to download suitable classifiers from our homepage for available configurations (see for links later).
9
+ - preprocessing tools for your language, at least the ones required for the use of pre-trained classifiers. Currently Shalmaneser provides interfaces for the following systems:
10
+ <table>
11
+ <tr>
12
+ <th>System</th><th>Version</th>
13
+ </tr>
14
+ <tr>
15
+ <td>TreeTagger</td><td>README from 09.04.96</td>
16
+ </tr>
17
+ <tr>
18
+ <td>Collins Parser</td><td>1.0</td>
19
+ </tr>
20
+ <tr>
21
+ <td>Berkeley Parser</td><td>latest</td>
22
+ </tr>
23
+ <tr>
24
+ <td>Stanford Parser</td><td>latest</td>
25
+ </tr>
26
+ </table>
27
+
28
+ - at least one machine learning system. Currently Shalmaneser provides interfaces for the following systems:
29
+ <table>
30
+ <tr>
31
+ <th>System</th><th>Version</th>
32
+ </tr>
33
+ <tr>
34
+ <td>OpenNLP MaxEnt</td><td>2.4.0</td>
35
+ </tr>
36
+ <tr>
37
+ <td>TiMBL</td><td>Timbl5</td>
38
+ </tr>
39
+ <tr>
40
+ <td>Mallet</td><td>Mallet 0.4</td>
41
+ </tr>
42
+ </table>
43
+
44
+ Note: Please make sure you run the system in a terminal with Unicode encoding (``export LANG=eng_US.UTF-8``).
45
+
46
+ ## Setting up Shalmaneser on your system
47
+
48
+ ### MySQL Database
49
+
50
+ You need an instance of MySQL Server running on your system. Possibly, you have such a server on your site on the local or remote server. If not, please install one (e.g. on Debian based systems):
51
+
52
+ $ sudo aptitude install mysql-server mysql-client
53
+
54
+ During the installation you'll be prompted for the root password.
55
+
56
+ Log in into MySQL management console:
57
+
58
+ $ mysql -u root -p
59
+
60
+ You will be asked for the ``root`` password. The following commands suppose a local installation of MySQL.
61
+
62
+ Create a new user for Shalmaneser (or use an existing one if it complies with your security policy):
63
+
64
+ mysql> CREATE USER 'shalm'@'localhost' IDENTIFIED BY 'shalmpassword';
65
+
66
+ Feel free to change the username and the password.
67
+
68
+ Create at least one database for Shalmaneser (it is convenient to use several databases to reuse experiment results):
69
+
70
+ mysql> CREATE DATABASE shalmaneser;
71
+
72
+ Give your new user rights to use the new database and (for older MySQL versions) flush the privileges:
73
+
74
+ mysql> GRANT ALL PRIVILEGES ON shalmaneser.* TO 'shalm'@'localhost';
75
+ mysql> FLUSH PRIVILEGES; # Not needed on newer systems.
76
+
77
+ The ``username``, the ``password`` and the ``database name`` are essential for for the experiment file declarations.
78
+
79
+ ### TreeTagger
80
+ Downloand the TreeTagger archive from the official [site](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) by Helmut Schmid, uncompress it to your favorite location, preserve the initial directory structure. The path to the root directory is essential for the experiment file declarations. Schalmaneser expects the following directory structure:
81
+
82
+ TreeTaggerRootDirectory/
83
+ |_ bin/
84
+ | |_ tree-tagger
85
+ |_ lib/
86
+ | |_ english.par
87
+ | |_ german.par
88
+ |_ cmd/
89
+ |_ filter-german-tags
90
+
91
+ If you cannot name the binary or the model (the ``.par`` file) as given above please set the following environment variables: ``SHALM_TREETAGGER_BIN`` and ``SHALM_TREETAGGER_MODEL``.
92
+
93
+ Please do not use Unicode models for TreeTagger for now! We'll change this dependency in the future.
94
+
95
+ ### Berkeley Parser
96
+ Downloand the Berkeley Parser archive from the official [site](https://code.google.com/p/berkeleyparser/downloads/list) at Google Code, uncompress it to your favorite location. The path to the root directory is essential for the experiment file declarations. Schalmaneser expects the following directory structure:
97
+
98
+ BerkeleyRootDirectory/
99
+ |_ berkeleyParser.jar
100
+ |_ grammar.gr
101
+
102
+ If you cannot name the binary and/or the model as given above please set the following environment variables: ``SHALM_BERKELEY_BIN`` and ``SHALM_BERKELEY_MODEL``.
103
+
104
+
105
+ ### Stanford Parser
106
+
107
+ Downloand the Stanford Parser archive from the official [site](http://nlp.stanford.edu/software/lex-parser.shtml), uncompress it to your favorite location. The path to the root directory is essential for the experiment file declarations. Schalmaneser expects the following directory structure:
108
+
109
+ StanfordRootDirectory/
110
+ |_ stanford_parser.jar
111
+ |_ stanford_parser-x.y.z-models.jar
112
+
113
+ ### OpenNLP MaxEnt
114
+ Downloand the MaxEnt archive from the official [site](http://sourceforge.net/projects/maxent/files/Maxent/2.4.0/) from SourceForge, uncompress it to your favorite location. Set ``JAVA_HOME`` if it isn't set on your system. Run ``build.sh`` in the MaxEnt Root Directory.
115
+
116
+ The path to the root directory is essential for the experiment file declarations. Schalmaneser expects the following directory structure:
117
+
118
+ MaxEntRootDirectory/
119
+ |_ output/
120
+ |_ classes/
Binary file
Binary file
data/doc/shal_doc.pdf ADDED
Binary file
data/doc/shal_lrec.pdf ADDED
Binary file
Binary file
Binary file
@@ -117,13 +117,13 @@ class TreetaggerInterface < SynInterfaceTab
117
117
  include TreetaggerModule
118
118
 
119
119
  ###
120
- def TreetaggerInterface.system()
121
- return "treetagger"
120
+ def self.system
121
+ 'treetagger'
122
122
  end
123
123
 
124
124
  ###
125
- def TreetaggerInterface.service()
126
- return "lemmatizer"
125
+ def self.service
126
+ 'lemmatizer'
127
127
  end
128
128
 
129
129
  ###