RubyGems - opener-opinion-detector-base - Versions diffs - 2.0.1 → 2.1.2 - Mend

opener-opinion-detector-base 2.0.1 → 2.1.2

Files changed (67) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 9c8aef27fcd7c10ed7176b0a73bec253436c9efb
-  data.tar.gz: 161e62461eded780c02c261e4f980f78ba0577e6
+  metadata.gz: 8cab19f98d9ee9c6ae4938a3be0cebb666126e44
+  data.tar.gz: d83ded3deb19fe5b7cead1ba079cb6bf585c9593
 SHA512:
-  metadata.gz: 5723e548b534b9a743646de5e08adb3d59715b77bd931f381082ca3ee8924db9ed433f7655af6a9d10cef402350c9f60c4ef89afec5c2b87c90e6504a5cdb01d
-  data.tar.gz: 3f9e5386c7fb1400d4204c232e1498b44af8958ef576cd51441b25f707f65cc331785a3b6bff86b8e36f0e854feef817b270cd0d76727eb4a95a7346d40ca0dc
+  metadata.gz: bb3d6a9b3d9d6fd3a3fa496b6a2de948398d250e49540dd89bd24d1efe98e9837e7bcb5dfef3b973c5fb069f4425dca2cf2c01d66bfdb148884d8af20a0c8792
+  data.tar.gz: ac1f7d71ec3160f279b0e2c8954ea1ee5bf29f56630d7e249a16c46c3498a091a511df97f89606d1bbaece012e011aff0bdd30bd50b84643e0993d8f9598f68a

data/core/python-scripts/README.md CHANGED Viewed

@@ -2,14 +2,13 @@
 ##Introduction##
 Opinion miner based on machine learning that can be trained using a list of
 KAF/NAF files. It is important to notice that the opinion miner module will not call
 to any external module to obtain features. It will read all the features from the input KAF/NAF file,
 so you have to make sure that your input file contains all the required information in advance (tokens,
-terms, polarities, constituents, entitiess, dependencies...)
+terms, polarities, constituents, entitiess, dependencies...).
-The task is divided into 2 steps
+The task is general divided into 2 steps
 * Detection of opinion entities (holder, target and expression): using
 Conditional Random Fields
 * Opinion entity linking (expression<-target and expression-<holder): using
@@ -79,6 +78,82 @@ of CRFsuite and SVMLight. This file will be passed to the main script to detect
 cat my_file.kaf | classify_kaf_naf_file.py your_config_file.cfg
 ````
+There are two basic functionalities:
+* Training: from a corpus of opinion annotated files, induce and learn the models for detecting opinions
+* Classification: using the previous models, find and extract opinions in new text files.
+We provide models already trained and evaluated on hotel, news, attractions and restaurants domains for all the languages covered
+by the OpeNER project. Most of the users will just focus on this classification step, using the models that we provide. Some others
+will need to retrain the system to adapt it to a new domain or language. In the next sections we will introduce these 2 differents
+usages of the opinion miner deluxe
+##Classification##
+In this case you have the models already trained (either you trained them yourself or got the ones we provide) and you want just to detect
+the opinions in a new file. The input format of your file needs to be valid KAF format. The script that perfoms the classification is the script
+`classify_kaf_naf_file.py`. You can get information about the available parameters by running the script with the parameter -h.
+```shell
+classify_kaf_naf_file.py -h
+usage: classify_kaf_naf_file.py [-h]
+                                (-m MODEL_FOLDER | -d DOMAIN | -show-models)
+                                [-keep-opinions] [-no-time]
+Detect opinion triples in a KAF/NAF file
+optional arguments:
+  -h, --help       show this help message and exit
+  -m MODEL_FOLDER  Folder storing the trained models
+  -d DOMAIN        The domain where the models were trained
+  -show-models     Show the models available and finish
+  -keep-opinions   Keep the opinions from the input (by default will be deleted)
+  -no-time         No include time in timestamp (for testing)
+```
+The script reads the input KAF file from the standard input and will write the output KAF into the standard output. The main parameter is the model that
+will be used. There are two ways of specifyng this parameter:
+* By using the -m FOLDER option, by means of which we can specify that we would like to use exactly the folder stored in the path FOLDER
+* By using the -d DOMAIN option, where DOMAIN is the domain where the model that we want to use was trained.
+We can get which are the models available by running:
+```shell
+classify_kaf_naf_file.py -show-models
+#########################
+Models available
+#########################
+  Model 0
+    Lang: en
+    Domain: hotel
+    Folder: final_models/en/hotel_cfg1
+    Desc: Trained with config1 in the last version of hotel annotations
+  Model 1
+    Lang: en
+    Domain: news
+    Folder: final_models/en/news_cfg1
+    Desc: Trained with config1 using only the sentences annotated with news
+....
+....
+```
+You can train as use as many models as you want. You will need the file `models.cfg` which contains the metadata about which models
+are available and how to refer to them (the domain). This is an example of the content of this file:
+```shell
+#LANG|domain|pathtomodel|description
+en|hotel|final_models/en/hotel_cfg1|Trained with config1 in the last version of hotel annotations
+en|news|final_models/en/news_cfg1|Trained with config1 using only the sentences annotated with news
+nl|hotel|final_models/nl/hotel_cfg1|Trained with config1 in the last version of hotel annotations
+nl|news|final_models/nl/news_cfg1|Trained with config1 using only the sentences annotated with news
+```
+So in each line a model is specified and represented using 4 fields, the language, the domain identifier (which will be used later to refer to this model),
+the path to the folder and a text with a description. The language for the KAF file will be read directly from the KAF header, and considering this model
+and the domain id provided to the script, the proper model will be loaded and used.
+So if you want to tag a file with Dutch text called input.nl.kaf with the models trained on hotel reviews, and store the result on the file output.nl.kaf you just
+should call to the program as:
+```shell
+cat input.nl.kaf | python classify_kaf_naf_file.py -d hotel > output.nl.kaf
+```
 ##Training your own models##
 You will need first to install all the requirementes given and then follow these steps:

data/core/python-scripts/classify_kaf_naf_file.py CHANGED Viewed

@@ -8,7 +8,7 @@ this_folder = os.path.dirname(os.path.realpath(__file__))
 # This updates the load path to ensure that the local site-packages directory
 # can be used to load packages (e.g. a locally installed copy of lxml).
-sys.path.append(os.path.join(this_folder, '../site-packages/pre_build'))
+sys.path.append(os.path.join(this_folder, '../site-packages/pre_install'))
 import csv
 from tempfile import NamedTemporaryFile
@@ -20,17 +20,16 @@ import argparse
 from scripts import lexicons as lexicons_manager
 from scripts.config_manager import Cconfig_manager
 from scripts.extract_features import extract_features_from_kaf_naf_file
-from scripts.crfutils import extract_features_to_crf
+from scripts.crfutils import extract_features_to_crf
 from scripts.link_entities_distance import link_entities_distance
 from scripts.relation_classifier import link_entities_svm
-from KafNafParser import *
-from VUA_pylib import *
+from KafNafParserPy import *
 DEBUG=0
 my_config_manager = Cconfig_manager()
-__this_folder = os.getcwd()
+__this_folder = os.path.dirname(os.path.realpath(__file__))
 separator = '\t'
 __desc = 'Deluxe opinion miner (CRF+SVM)'
 __last_edited = '10jan2014'
@@ -59,7 +58,7 @@ def match_crfsuite_out(crfout,list_token_ids):
             if inside:
                 matches.append((current,current_type))
                 current = []
-                inside = False
+                inside = False
         else:
             if line=='O':
                 if inside:
@@ -73,8 +72,8 @@ def match_crfsuite_out(crfout,list_token_ids):
                     if inside:
                         matches.append((current,current_type))
                     current = [list_token_ids[num_token]]
-                    inside = True
-                    current_type = value
+                    inside = True
+                    current_type = value
                 elif my_type == 'I':
                     if inside:
                         current.append(list_token_ids[num_token])
@@ -92,42 +91,42 @@ def match_crfsuite_out(crfout,list_token_ids):
 def extract_features(kaf_naf_obj):
     feat_file_desc = NamedTemporaryFile(delete=False)
     feat_file_desc.close()
     out_file = feat_file_desc.name
     err_file = out_file+'.log'
     expressions_lexicon = None
     targets_lexicon = None
     if my_config_manager.get_use_training_lexicons():
         expression_lexicon_filename = my_config_manager.get_expression_lexicon_filename()
         target_lexicon_filename = my_config_manager.get_target_lexicon_filename()
         expressions_lexicon = lexicons_manager.load_lexicon(expression_lexicon_filename)
         targets_lexicon =lexicons_manager.load_lexicon(target_lexicon_filename)
     #def extract_features_from_kaf_naf_file(knaf_obj,out_file=None,log_file=None,include_class=True,accepted_opinions=None, exp_lex= None):
     labels, separator,polarities_skipped = extract_features_from_kaf_naf_file(kaf_naf_obj,out_file,err_file,include_class=False, exp_lex=expressions_lexicon,tar_lex=targets_lexicon)
     return out_file, err_file
 def convert_to_crf(input_file,templates):
     out_desc = NamedTemporaryFile(delete=False)
     out_desc.close()
     out_crf = out_desc.name
     ##Load description of features
     path_feat_desc = my_config_manager.get_feature_desc_filename()
     fic = open(path_feat_desc)
     fields = fic.read().strip()
     fic.close()
     ####
     extract_features_to_crf(input_file,out_crf,fields,separator,templates,possible_classes=None)
     return out_crf
 def run_crfsuite_tag(input_file,model_file):
     crfsuite = my_config_manager.get_crfsuite_binary()
     cmd = [crfsuite]
@@ -150,8 +149,8 @@ def run_crfsuite_tag(input_file,model_file):
 def detect_expressions(tab_feat_file,list_token_ids):
     #1) Convert to the correct CRF
-    templates = my_config_manager.get_templates_expr()
+    templates = my_config_manager.get_templates_expr()
     crf_exp_file = convert_to_crf(tab_feat_file,templates)
     logging.debug('File with crf format for EXPRESSIONS '+crf_exp_file)
     if DEBUG:
@@ -161,10 +160,10 @@ def detect_expressions(tab_feat_file,list_token_ids):
         print>>sys.stderr,f.read()
         f.close()
         print>>sys.stderr,'#'*50
     model_file = my_config_manager.get_filename_model_expression()
     output_crf,error_crf = run_crfsuite_tag(crf_exp_file,model_file)
     logging.debug('Expressions crf error: '+error_crf)
     matches_exp = match_crfsuite_out(output_crf, list_token_ids)
     if DEBUG:
@@ -175,19 +174,19 @@ def detect_expressions(tab_feat_file,list_token_ids):
         print>>sys.stderr,'MATCHES:',str(matches_exp)
         print>>sys.stderr,'TEMP FILE:',crf_exp_file
         print>>sys.stderr,'#'*50
     logging.debug('Detector expressions out: '+str(matches_exp))
     os.remove(crf_exp_file)
     return matches_exp
 def detect_targets(tab_feat_file, list_token_ids):
     templates_target =  my_config_manager.get_templates_target()
     crf_target_file = convert_to_crf(tab_feat_file,templates_target)
     logging.debug('File with crf format for TARGETS '+crf_target_file)
     if DEBUG:
@@ -197,13 +196,13 @@ def detect_targets(tab_feat_file, list_token_ids):
         print>>sys.stderr,f.read()
         f.close()
         print>>sys.stderr,'#'*50
     model_target_file = my_config_manager.get_filename_model_target()
     out_crf_target,error_crf = run_crfsuite_tag(crf_target_file, model_target_file)
     logging.debug('TARGETS crf error: '+error_crf)
     matches_tar = match_crfsuite_out(out_crf_target, list_token_ids)
     if DEBUG:
         print>>sys.stderr,'#'*50
         print>>sys.stderr,'CRF output for TARGETS'
@@ -211,18 +210,18 @@ def detect_targets(tab_feat_file, list_token_ids):
         print>>sys.stderr,'List token ids:',str(list_token_ids)
         print>>sys.stderr,'MATCHES:',str(matches_tar)
         print>>sys.stderr,'#'*50
     logging.debug('Detector targets out: '+str(matches_tar))
     os.remove(crf_target_file)
     return matches_tar
 def detect_holders(tab_feat_file, list_token_ids):
     templates_holder = my_config_manager.get_templates_holder()
     crf_holder_file = convert_to_crf(tab_feat_file,templates_holder)
     logging.debug('File with crf format for HOLDERS '+crf_holder_file)
     if DEBUG:
@@ -232,7 +231,7 @@ def detect_holders(tab_feat_file, list_token_ids):
         print>>sys.stderr,f.read()
         f.close()
         print>>sys.stderr,'#'*50
     model_holder_file = my_config_manager.get_filename_model_holder()
     out_crf_holder,error_crf = run_crfsuite_tag(crf_holder_file, model_holder_file)
     logging.debug('HOLDERS crf error: '+error_crf)
@@ -246,12 +245,12 @@ def detect_holders(tab_feat_file, list_token_ids):
         print>>sys.stderr,'List token ids:',str(list_token_ids)
         print>>sys.stderr,'MATCHES:',str(matches_holder)
         print>>sys.stderr,'#'*50
     logging.debug('Detector HOLDERS out: '+str(matches_holder))
     os.remove(crf_holder_file)
     return matches_holder
@@ -267,19 +266,19 @@ def map_tokens_to_terms(list_tokens,knaf_obj):
                     terms_for_token[tokid] = [termid]
                 else:
                     terms_for_token[tokid].append(termid)
     ret = set()
     for my_id in list_tokens:
         term_ids = terms_for_token[my_id]
         ret |= set(term_ids)
     return sorted(list(ret))
 def add_opinions_to_knaf(triples,knaf_obj,text_for_tid,ids_used, map_to_terms=True,include_polarity_strength=True):
     num_opinion =  0
     for type_exp, span_exp, span_tar, span_hol in triples:
-        #Map tokens to terms
+        #Map tokens to terms
         if map_to_terms:
             span_exp_terms = map_tokens_to_terms(span_exp,kaf_obj)
             span_tar_terms = map_tokens_to_terms(span_tar,kaf_obj)
@@ -288,16 +287,16 @@ def add_opinions_to_knaf(triples,knaf_obj,text_for_tid,ids_used, map_to_terms=Tr
             span_hol_terms = span_hol
             span_tar_terms = span_tar
             span_exp_terms = span_exp
         ##Creating holder
         span_hol = Cspan()
         span_hol.create_from_ids(span_hol_terms)
         my_hol = Cholder()
         my_hol.set_span(span_hol)
         hol_text = ' '.join(text_for_tid[tid] for tid in span_hol_terms)
         my_hol.set_comment(hol_text)
         #Creating target
         span_tar = Cspan()
         span_tar.create_from_ids(span_tar_terms)
@@ -318,7 +317,7 @@ def add_opinions_to_knaf(triples,knaf_obj,text_for_tid,ids_used, map_to_terms=Tr
         exp_text = ' '.join(text_for_tid[tid] for tid in span_exp_terms)
         my_exp.set_comment(exp_text)
         #########################
         #To get the first possible ID not already used
         new_id = None
         while True:
@@ -332,32 +331,33 @@ def add_opinions_to_knaf(triples,knaf_obj,text_for_tid,ids_used, map_to_terms=Tr
         new_opinion.set_id(new_id)
         if len(span_hol_terms) != 0:    #To avoid empty holders
             new_opinion.set_holder(my_hol)
         if len(span_tar_terms) != 0:    #To avoid empty targets
             new_opinion.set_target(my_tar)
         new_opinion.set_expression(my_exp)
         knaf_obj.add_opinion(new_opinion)
 ##
 # Input_file_stream can be a filename of a stream
 # Opoutfile_trasm can be a filename of a stream
 #Config file must be a string filename
 def tag_file_with_opinions(input_file_stream, output_file_stream,model_folder,kaf_obj=None, remove_existing_opinions=True,include_polarity_strength=True,timestamp=True):
     config_filename = os.path.join(model_folder)
     if not os.path.exists(config_filename):
         print>>sys.stderr,'Config file not found on:',config_filename
         sys.exit(-1)
     my_config_manager.set_current_folder(__this_folder)
     my_config_manager.set_config(config_filename)
     if kaf_obj is not None:
         knaf_obj = kaf_obj
     else:
         knaf_obj = KafNafParser(input_file_stream)
     #Create a temporary file
     out_feat_file, err_feat_file = extract_features(knaf_obj)
     if DEBUG:
@@ -367,7 +367,7 @@ def tag_file_with_opinions(input_file_stream, output_file_stream,model_folder,ka
         print>>sys.stderr,f.read()
         f.close()
         print>>sys.stderr,'#'*50
     #get all the tokens in order
     list_token_ids = []
     text_for_wid = {}
@@ -378,67 +378,67 @@ def tag_file_with_opinions(input_file_stream, output_file_stream,model_folder,ka
         s_id = token_obj.get_sent()
         w_id = token_obj.get_id()
         text_for_wid[w_id] = token
         list_token_ids.append(w_id)
         sentence_for_token[w_id] = s_id
     for term in knaf_obj.get_terms():
         tid = term.get_id()
         toks = [text_for_wid.get(wid,'') for wid in term.get_span().get_span_ids()]
         text_for_tid[tid] = ' '.join(toks)
     expressions = detect_expressions(out_feat_file,list_token_ids)
     targets = detect_targets(out_feat_file, list_token_ids)
     holders = detect_holders(out_feat_file, list_token_ids)
     os.remove(out_feat_file)
     os.remove(err_feat_file)
     if DEBUG:
         print>>sys.stderr,"Expressions detected:"
         for e in expressions:
-            print>>sys.stderr,'\t',e, ' '.join([text_for_wid[wid] for wid in e[0] ])
+            print>>sys.stderr,'\t',e, ' '.join([text_for_wid[wid] for wid in e[0] ])
         print>>sys.stderr
         print>>sys.stderr,'Targets detected'
         for t in targets:
-            print>>sys.stderr,'\t',t, ' '.join([text_for_wid[wid] for wid in t[0] ])
+            print>>sys.stderr,'\t',t, ' '.join([text_for_wid[wid] for wid in t[0] ])
         print>>sys.stderr
         print>>sys.stderr,'Holders',holders
         for h in holders:
-            print>>sys.stderr,'\t',h, ' '.join([text_for_wid[wid] for wid in h[0] ])
+            print>>sys.stderr,'\t',h, ' '.join([text_for_wid[wid] for wid in h[0] ])
         print>>sys.stderr
     # Entity linker based on distances
     ####triples = link_entities_distance(expressions,targets,holders,sentence_for_token)
     triples = link_entities_svm(expressions, targets, holders, knaf_obj, my_config_manager)
     ids_used = set()
     if remove_existing_opinions:
         knaf_obj.remove_opinion_layer()
     else:
         for opi in knaf_obj.get_opinions():
             ids_used.add(opi.get_id())
-    add_opinions_to_knaf(triples, knaf_obj,text_for_tid,ids_used, map_to_terms=False,include_polarity_strength=include_polarity_strength)
+    add_opinions_to_knaf(triples, knaf_obj,text_for_tid,ids_used, map_to_terms=False,include_polarity_strength=include_polarity_strength)
     #Adding linguistic processor
     my_lp = Clp()
     my_lp.set_name(__desc)
     my_lp.set_version(__last_edited+'_'+__version)
     if timestamp:
-      my_lp.set_timestamp()   ##Set to the current date and time
+        my_lp.set_timestamp()   ##Set to the current date and time
     else:
-      my_lp.set_timestamp('*')
+        my_lp.set_timestamp('*')
     knaf_obj.add_linguistic_processor('opinions',my_lp)
     knaf_obj.dump(output_file_stream)
 def obtain_predefined_model(lang,domain,just_show=False):
     #This function will read the models from the file models.cfg and will return
@@ -451,7 +451,7 @@ def obtain_predefined_model(lang,domain,just_show=False):
         print '#'*25
         print 'Models available'
         print '#'*25
     nm = 0
     for line in fic:
         if line[0]!='#':
@@ -471,15 +471,15 @@ def obtain_predefined_model(lang,domain,just_show=False):
     if just_show:
          print '#'*25
     return use_this_model
 if __name__ == '__main__':
     argument_parser = argparse.ArgumentParser(description='Detect opinion triples in a KAF/NAF file')
     group = argument_parser.add_mutually_exclusive_group(required=True)
     group.add_argument('-m',dest='model_folder',help='Folder storing the trained models')
     group.add_argument('-d', dest='domain',help='The domain where the models were trained')
     group.add_argument('-show-models', dest='show_models', action='store_true',help='Show the models available and finish')
     argument_parser.add_argument('-keep-opinions',dest='keep_opinions',action='store_true',help='Keep the opinions from the input (by default will be deleted)')
     argument_parser.add_argument('-no-time',dest='timestamp',action='store_false',help='No include time in timestamp (for testing)')
     arguments = argument_parser.parse_args()
@@ -487,7 +487,7 @@ if __name__ == '__main__':
     if arguments.show_models:
         obtain_predefined_model(None,None,just_show=True)
         sys.exit(0)
     knaf_obj = KafNafParser(sys.stdin)
     model_folder = None
     if arguments.model_folder is not None:
@@ -496,12 +496,12 @@ if __name__ == '__main__':
         #Obtain the language
         lang = knaf_obj.get_language()
         model_folder = obtain_predefined_model(lang,arguments.domain)
     tag_file_with_opinions(None, sys.stdout,model_folder,kaf_obj=knaf_obj,remove_existing_opinions=(not arguments.keep_opinions),timestamp=arguments.timestamp)
     sys.exit(0)