RubyGems - wikipedia-vandalism_detection - Versions diffs - 0.1.0-java - Mend

wikipedia-vandalism_detection 0.1.0-java

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (245) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: bf756c5448798deaecad9dff7f1158124f1665eae7f65e6e3cd1c018dcb4b273
+  data.tar.gz: ec45e4a4a402eb9dadada7570f094cd5be294634da3e31ce28603bd48666e74c
+SHA512:
+  metadata.gz: a72ec32117e19bbac2764eb01022f608c4eb91121e6d552c1a05a230b559a5279e51fe8e7970b48667d6450ebb0b23fc36338ade74bb47d729018fbdb4b39868
+  data.tar.gz: 8eb0fb8fe4d2e0ed681543cf0a76dd9a806253cf8e43ce2dd224137ad0970d1f7e9f84caf2b1fd22f289d3553414e449799d5f50c010d435ac2d6a3d5afa4a93

data/.gitignore ADDED Viewed

@@ -0,0 +1,19 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+spec/resources/build
+test/tmp
+test/version_tmp
+tmp
+/config/*.yml

data/.rspec ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ --force-color
2	+ --order rand

data/.rubocop.yml ADDED Viewed

@@ -0,0 +1,35 @@
+AllCops:
+  TargetRubyVersion: 2.4
+  Exclude:
+    - 'bin/**/*'
+    - '*.gemspec'
+    - 'Gemfile'
+    - 'Gemfile.lock'
+Style/Copyright:
+  Enabled: false
+Style/Documentation:
+  Enabled: false
+Metrics/LineLength:
+  Max: 80
+  Exclude:
+    - '**/*_spec.rb'
+    - 'spec/factories/*.rb'
+Layout/MultilineMethodCallIndentation:
+  EnforcedStyle: indented
+Style/FrozenStringLiteralComment:
+  Enabled: false
+Metrics/ModuleLength:
+  Exclude:
+    - '**/*_spec.rb'
+    - 'spec/factories/*.rb'
+Metrics/BlockLength:
+  Exclude:
+    - '**/*_spec.rb'
+    - 'spec/factories/*.rb'

data/.travis.yml ADDED Viewed

@@ -0,0 +1,6 @@
+language: ruby
+rvm:
+  - jruby-9.1.0.0
+  - jruby-9.2.0.0
+  - jruby-head

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in wikipedia-vandalism_detection.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,4 @@
+Copyright (c) 2014-2018 Paul Götze
+This software is licensed under the GPL v3.
+For further information and the full license text see: http://www.gnu.org/licenses/gpl-3.0.en.html

data/README.md ADDED Viewed

@@ -0,0 +1,288 @@
+# Wikipedia Vandalism Detection
+Vandalism detection on the Wikipedia history with JRuby v9.1.0.0+.
+The Wikipedia Vandalism Detection Gem uses the Weka Machine-Learning Library
+via the [weka](https://github.com/paulgoetze/weka-jruby) gem.
+[![Gem Version](https://badge.fury.io/rb/wikipedia-vandalism_detection.svg)](http://badge.fury.io/rb/wikipedia-vandalism_detection)
+[![Build Status](https://travis-ci.org/paulgoetze/wikipedia-vandalism-detection.png?branch=develop)](https://travis-ci.org/paulgoetze/wikipedia-vandalism-detection)
+## What You can do with it
+* parsing Wikipedia history pages to get edits and revisions
+* creating training and test ARFF files from
+the [WVC-PAN-10](https://www.uni-weimar.de/en/media/chairs/computer-science-and-media/webis/corpora/corpus-pan-wvc-10) and
+the [WVC-PAN-11](https://www.uni-weimar.de/en/media/chairs/computer-science-and-media/webis/corpora/corpus-pan-wvc-11)
+(See also http://pan.webis.de under category Wikipedia Vandalism Detection: [CLEF 2010](http://pan.webis.de/clef10/pan10-web/wikipedia-vandalism-detection) & [CLEF 2011](http://pan.webis.de/clef11/pan11-web/wikipedia-vandalism-detection))
+* calculating vandalism features for a Wikipedia page (XML) from the history dump
+* creating and evaluating a classifier with the created training ARFF file
+* classifing new instances of Wikipedia edits as 'regular' or 'vandalism'
+## Installation
+Add this line to your application's Gemfile:
+    gem 'wikipedia-vandalism_detection'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install wikipedia-vandalism_detection
+## Usage
+    require 'wikipedia/vandalism_detection'
+### Configuration
+To configure the system put a `wikipedia-vandalism-detection.yml` file in the
+`config/` or `lib/config/` directory.
+You can configure:
+A) the training and test corpora directories and essential input and output files
+```YAML
+corpora:
+  base_directory: /home/user/corpora
+  training:
+    base_directory: training
+    annotations_file: annotations.csv
+    edits_file: edits.csv
+    revisions_directory: revisions
+  test:
+    base_directory: test
+    edits_file: edits.csv
+    revisions_directory: revisons
+output:
+  base_directory: /home/user/output_path
+  training:
+    arff_file: training.arff
+    index_file: training_index.yml
+  test:
+    arff_file: test.arff
+    index_file: test_index.yml
+```
+Evaluation outputs are saved under the output base directory path.
+B) the features used by the feature calculator
+```YAML
+features:
+  - anonymity
+  - biased frequency
+  - character sequence
+  - ...
+```
+C) the classifier type and its options and the number of cross validation splits
+for the classifier evaluation
+```YAML
+classifier:
+  type: Trees::RandomForest         # Weka classifier class
+  options: -I 10 -K 0.5             # same as for Weka, for further classifier options see Weka-dev documentation
+  cross-validation-fold: 5          # default is 10
+  training-data-options: balanced   # default is unbalanced
+```
+`training-data-options` is used to resample the training dataset:
+* `unbalanced` is the default value and uses the original dataset
+* `balanced` uses random undersampling of the majority class
+* `oversampled` uses SMOTE oversampling (with percentage `-p`) and random undersampling (with minority/majority class balance `-u`)
+Examples:
+```YAML
+# 200% SMOTE oversampling with 300% random undersampling
+training-data-options: oversampled -p 200 -u true 300
+# default 100% SMOTE oversampling with 300% random undersampling
+training-data-options: oversampled -u true 300
+# 200% SMOTE oversampling with default full (100% minority/majority class balance)
+# random undersampling
+training-data-options: oversampled -p 200
+# default 100% SMOTE oversampling without undersampling
+training-data-options: oversampled -u false
+```
+Instead of the `true` option you can also use `t`, `y` and `yes` as well as their upper case pendants.
+### Examples
+**Create training and test ARFF file from configured corpus:**
+```ruby
+training_dataset = Wikipedia::VandalismDetection::TrainingDataset.build
+test_dataset = Wikipedia::VandalismDetection::TestDataset.build
+```
+While creating the training and test datasets, for each a corpus file index is created into the configured `index_file`
+directory.
+To run the corpus file index creation manually use:
+```ruby
+Wikipedia::VandalismDetection::TrainingDataset.create_file_index!
+Wikipedia::VandalismDetection::TestDataset.create_file_index!
+```
+**Parse a Wikipedia page content:**
+At the moment no namespaces are supported while parsing a page.
+So, the `<page>...</page>` tags should not be included in a namespaced xml tag!
+```ruby
+xml = File.read(wikipedia_page.xml)
+parser = Wikipedia::VandalismDetection::PageParser.new
+page = parser.parse(xml)
+# Work with revisions and edits from the page
+page.revisions.each do |revision|
+  puts revison.id
+  puts revison.parent_id
+end
+page.edits.each do |edit|
+  puts edit.new_revision.id
+  puts edit.old_revision.id
+end
+```
+**Use a classifier of configured type:**
+Create the classifier:
+```ruby
+classifier = Wikipedia::VandalismDetection::Classifier.new
+```
+Evaluation of the classifier against the configured training corpus:
+```ruby
+# classifier.classifier_instance returns the weka classifier instance
+evaluation = classifier.classifier_instance.cross_validate(folds: 10)
+puts evaluation.class_details
+```
+Classify a new edit:
+```ruby
+# Classification of a Wikipedia Edit or a feature set
+# 'edit' is a Wikipedia::VandalismDetection::Edit, this can be built manually or by
+# parsing a Wikipedia page content and getting its edits
+# The returned confidence is a value between 0.0 and 1.0 were 0.0 means 'regular' and 1.0 means 'vandalism'
+confidence = classifier.classify(edit)
+feature_calculator = Wikipedia::VandalismDetection::FeatureCalculator.new
+features = feature_calculator.calculate_features_for(edit)
+confidence = classifier.classify(features)
+```
+Evaluate test corpus classification:
+```ruby
+evaluator = classifier.evaluator
+# or create a new evaluator
+evaluator = Wikipedia::VandalismDetection::Evaluator.new(classifier)
+performance_data = evaluator.evaluate_testcorpus_classification #default sample_count = 100
+performance_data = evaluator.evaluate_testcorpus_classification(sample_count: 200)
+# following attributes can be used for further computations
+recall_values = performance_data[:recalls]           # recall values for e.g. x-values of PRC or y-values of ROC
+precision_values = performance_data[:precisions]     # precision values for e.g. y-values of PRC
+fp_rate_values = performance_data[:fp_rates]         # false positive rate values for e.g. x-values of ROC
+area_under_curve_pr = performance_data[:pr_auc]      # computed from the precision and recall values
+area_under_curve_ro = performance_data[:roc_auc]     # computed from the recall and fp-rate values
+total_recall = performance_data[:total_recall]       # precison and recall values with maximum area (rectangle area)
+total_precision = performance_data[:total_precision]
+```
+Get each features predictive value for analysis:
+```ruby
+evaluator = classifier.evaluator
+# or create a new evaluator
+evaluator = Wikipedia::VandalismDetection::Evaluator.new(classifier)
+analysis_data = evaluator.feature_analysis #default sample_count = 100
+analysis_data = evaluator.feature_analysis(sample_count: 1000)
+```
+This returns a hash comprising all feature names as configured as keys and the threshold hashes as values.
+```ruby
+{
+  feature_name_1:
+    {
+      0.0 => {fp:… , fn:… , tp:… , tn:… },
+      …,
+      1.0 => {fp:… , fn:… , tp:… , tn:… }
+    },
+  …,
+  feature_name_n:
+    {
+      0.0 => {fp:… , fn:… , tp:… , tn:… },
+      …,
+      1.0 => {fp:… , fn:… , tp:… , tn:… }
+    },
+}
+```
+**Creating new Features:**
+You can define your own new Feature classes and use them by configuration in the config.yml.
+Make sure to define the Feature class inside of the `Wikipedia::VandalismDetection::Features` module
+and to implement the `calculate` method
+(also refer to the `Wikipedia::VandalismDetection::Features::Base` class definition).
+```ruby
+module Wikipedia
+  module VandalismDetection
+    module Features
+      class MyNewFeature < Base
+        def calculate(edit)
+          super # ensures raising an error if 'edit' is not an Edit.
+          # ...your implementation
+        end
+      end
+    end
+  end
+end
+```
+While creating new Feature classes you should be aware of the following naming convention:
+The feature's name in the config.yml is the *downcased name with spaces or dashes* of the feature class name
+E.g.:
+```YAML
+    features:
+      - my new feature
+      - my-new-feature
+```
+both search for a Feature class with the name `MyNewFeature`.
+## Contributing
+1. Fork it ( http://github.com/paulgoetze/wikipedia-vandalism_detection/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request

data/Rakefile ADDED Viewed

@@ -0,0 +1,11 @@
+require 'bundler/gem_tasks'
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new(:spec)
+task default: :spec
+desc 'Start an irb session with the gem loaded'
+task :irb do
+  sh 'irb -I ./lib -r wikipedia/vandalism_detection'
+end

data/config/wikipedia-vandalism-detection.yml.example ADDED Viewed

@@ -0,0 +1,103 @@
+# Configuring the training and test corpora directories and essential input and output files.
+# As corpora the WVC-PAN-10 and WVC-PAN-11 can be used (see http://webis.de/ under Research -> Corpora).
+corpora:
+  base_directory: /home/user/corpora
+  training:
+    base_directory: training
+    annotations_file: annotations.csv
+    edits_file: edits.csv
+    revisions_directory: revisions
+  test:
+    base_directory: test
+    edits_file: edits.csv
+    revisions_directory: revisons
+output:
+  base_directory: /home/user/output_path
+  training:
+    arff_file: training.arff
+    index_file: training_index.yml
+  test:
+    arff_file: test.arff
+    index_file: test_index.yml
+# Configuring the used features.
+# See
+features:
+  - anonymity
+  - anonymity previous
+  - all wordlists frequency
+  - all wordlists impact
+  - article size
+  - bad frequency
+  - bad impact
+  - biased frequency
+  - biased impact
+  - blanking
+  - character sequence
+  - character diversity
+  - comment length
+  - comment biased frequency
+  - comment pronoun frequency
+  - comment vulgarism frequency
+  - compressibility
+  - copyedit
+  - digit ratio
+  - edits per user
+  - emoticons frequency
+  - emoticons impact
+  - inserted size
+  - inserted words
+  - inserted character distribution
+  - inserted external links
+  - inserted internal links
+  - longest word
+  - markup frequency
+  - markup impact
+  - non-alphanumeric ratio
+  - personal life
+  - pronoun frequency
+  - pronoun impact
+  - removed size
+  - removed words
+  - removed all wordlists frequency
+  - removed bad frequency
+  - removed biased frequency
+  - removed character distribution
+  - removed emoticons frequency
+  - removed markup frequency
+  - removed pronoun frequency
+  - removed sex frequency
+  - removed vulgarism frequency
+  - replacement similarity
+  - reverted
+  - revisions character distribution
+  - sex frequency
+  - sex impact
+  - same editor
+  - size increment
+  - size ratio
+  - term frequency
+  - time interval
+  - time of day
+  - upper case ratio
+  - upper case words ratio
+  - upper to lower case ratio
+  - vulgarism frequency
+  - vulgarism impact
+  - weekday
+  - words increment
+# Configuring the used classifier
+classifier:
+  type: Trees::RandomForest         # Weka classifier class
+  options: -I 10 -K 0.5             # same as for Weka, for further classifier options see the Weka-dev documentation
+  cross-validation-fold: 5          # default is 10
+  training-data-options: balanced   # default is unbalanced

data/lib/java/SMOTE.jar ADDED Viewed

Binary file

data/lib/java/balancedRandomForest.jar ADDED Viewed

Binary file

data/lib/java/diffutils-1.3.0.jar ADDED Viewed

Binary file

data/lib/java/oneClassClassifier.jar ADDED Viewed

Binary file

data/lib/java/realAdaBoost.jar ADDED Viewed

Binary file

data/lib/java/swc-engine-1.1.0-jar-with-dependencies.jar ADDED Viewed

Binary file

data/lib/java/sweble-wikitext-extractor.jar ADDED Viewed

Binary file

data/lib/weka/classifiers/meta/one_class_classifier.rb ADDED Viewed

@@ -0,0 +1,21 @@
+require 'weka'
+require 'weka/class_builder'
+module Weka
+  module Classifiers
+    module Meta
+      require 'java/oneClassClassifier.jar'
+      include ClassBuilder
+      # One class classifier by C. Hempstalk (cite: http://dl.acm.org/citation.cfm?id=1431987)
+      # Jar can be downloaded at: http://sourceforge.net/projects/weka/files/weka-packages/oneClassClassifier1.0.4.zip
+      build_class :OneClassClassifier
+      class OneClassClassifier
+        def self.type
+          'Meta::OneClassClassifier'
+        end
+      end
+    end
+  end
+end

data/lib/weka/classifiers/meta/real_ada_boost.rb ADDED Viewed

@@ -0,0 +1,15 @@
+require 'weka'
+require 'weka/class_builder'
+module Weka
+  module Classifiers
+    module Meta
+      require 'java/realAdaBoost.jar'
+      include ClassBuilder
+      # Real ada boost classifier, see: http://www.stanford.edu/~hastie/Papers/AdditiveLogisticRegression/alr.pdf
+      # Jar can be downloaded at: http://prdownloads.sourceforge.net/weka/realAdaBoost1.0.1.zip?download
+      build_class :RealAdaBoost
+    end
+  end
+end

data/lib/weka/classifiers/trees/balanced_random_forest.rb ADDED Viewed

@@ -0,0 +1,16 @@
+require 'weka'
+require 'weka/class_builder'
+module Weka
+  module Classifiers
+    module Trees
+      require 'java/balancedRandomForest.jar'
+      include ClassBuilder
+      # balanced RandomForest classifier,
+      # Modified from https://github.com/jdurbin/durbinlib/blob/master/src/durbin/weka/BalancedRandomForest.java
+      # and https://github.com/jdurbin/durbinlib/blob/master/src/durbin/weka/BalancedRandomTree.java
+      build_class :BalancedRandomForest
+    end
+  end
+end

data/lib/weka/filters/supervised/instance/smote.rb ADDED Viewed

@@ -0,0 +1,15 @@
+require 'weka'
+require 'weka/class_builder'
+module Weka
+  module Filters
+    module Supervised
+      module Instance
+        require 'java/SMOTE.jar'
+        include ClassBuilder
+        build_class :SMOTE
+      end
+    end
+  end
+end

data/lib/wikipedia/vandalism_detection/algorithms/kullback_leibler_divergence.rb ADDED Viewed

@@ -0,0 +1,103 @@
+require 'wikipedia/vandalism_detection/features/base'
+module Wikipedia
+  module VandalismDetection
+    module Algorithms
+      class KullbackLeiblerDivergence
+        ALLOWED_ERROR = 9e-6
+        # Returns the Symmetric Kullback-Leibler divergence with simple back-off
+        # of the given text's character distribution. For implementation details
+        # see: https://web.archive.org/web/20130508191111/http://staff.science.uva.nl/~tsagias/?p=185.
+        def of(text_a, text_b)
+          text_a = cleanup_text(text_a)
+          text_b = cleanup_text(text_b)
+          unless text_a.match(/[[:alnum:]]/) && text_b.match(/[[:alnum:]]/)
+            return Features::MISSING_VALUE
+          end
+          distribution_a = character_distribution(text_a)
+          distribution_b = character_distribution(text_b)
+          sum_a = distribution_a.values.inject(0, :+)
+          sum_b = distribution_b.values.inject(0, :+)
+          character_diff = distribution_b.keys - distribution_a.keys
+          epsilon = [
+            distribution_a.values.min / sum_a,
+            distribution_b.values.min / sum_b
+          ].min * 0.001
+          gamma = 1 - character_diff.size * epsilon
+          check_integrity(distribution_a, sum_a)
+          check_integrity(distribution_b, sum_b)
+          divergence = 0.0
+          distribution_a.each do |character, distribution|
+            prob_a = distribution / sum_a
+            character_distribution = distribution_b[character]
+            prob_b =
+              if character_distribution
+                gamma * (character_distribution / sum_b)
+              else
+                epsilon
+              end
+            divergence += (prob_a - prob_b) * Math.log(prob_a / prob_b)
+          end
+          divergence
+        end
+        private
+        # Removes invalid utf-8 characters
+        def cleanup_text(text)
+          text.encode(
+            'UTF-8',
+            'binary',
+            invalid: :replace,
+            undef: :replace,
+            replace: ''
+          )
+        end
+        # Returns a hash representing each character's distribution
+        def character_distribution(text)
+          distribution = {}
+          return distribution if text.empty?
+          characters = text.downcase.scan(/[[:alnum:]]/)
+          characters.each do |character|
+            if distribution.key?(character.to_sym)
+              distribution[character.to_sym] += 1
+            else
+              distribution[character.to_sym] = 1
+            end
+          end
+          Hash[distribution.map do |key, value|
+            [key, value / characters.count.to_f]
+          end]
+        end
+        # Checks if values sum up to 1.0, raises an error if they don't.
+        def check_integrity(distribution, sum)
+          difference = 1.0 - distribution.values
+            .inject(0) { |result, value| result + (value / sum) }.abs
+          return if difference <= ALLOWED_ERROR
+          raise(Exception, 'Text distribution does not sum up to 1.0')
+        end
+      end
+    end
+  end
+end