RubyGems - ai-nlp - Versions diffs - 0.1.0 - Mend

ai-nlp 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

checksums.yaml +7 -0
data/MIT-LICENSE +20 -0
data/README.md +114 -0
data/Rakefile +38 -0
data/app/assets/config/ai_nlp_manifest.js +2 -0
data/app/assets/javascripts/ai/nlp/application.js +13 -0
data/app/assets/stylesheets/ai/nlp/application.css +15 -0
data/app/controllers/ai/nlp/application_controller.rb +12 -0
data/app/helpers/ai/nlp/application_helper.rb +11 -0
data/app/jobs/ai/nlp/application_job.rb +11 -0
data/app/mailers/ai/nlp/application_mailer.rb +13 -0
data/app/models/ai/nlp/application_record.rb +12 -0
data/app/models/ai/nlp/language.rb +23 -0
data/app/views/layouts/ai/nlp/application.html.erb +14 -0
data/config/routes.rb +4 -0
data/db/migrate/20170907142959_create_ia_taln_languages.rb +12 -0
data/lib/ai/nlp.rb +12 -0
data/lib/ai/nlp/engine.rb +12 -0
data/lib/ai/nlp/languages.rb +95 -0
data/lib/ai/nlp/n_gram/hasher.rb +118 -0
data/lib/ai/nlp/n_gram/n_gram.rb +29 -0
data/lib/ai/nlp/stem/fr.md +178 -0
data/lib/ai/nlp/stem/stem.rb +0 -0
data/lib/ai/nlp/version.rb +7 -0
data/lib/tasks/ai/nlp_tasks.rake +5 -0
metadata +112 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: c966d762ba275e10544c946f454d9dad7334edb6
+  data.tar.gz: 10cab1a81834a14716bb4c187b14af1ec54c27f5
+SHA512:
+  metadata.gz: a6ec561914777c59f0e8d44ed4ade7cde89ff409ba1bae3efca1816161fbb01ee82bf102ed4a4c0530eb500ba8737c07f00ae6752862b58f21586e68fa0b00b5
+  data.tar.gz: cf2080d53cae13af26a85d9b9bd6961f1ce703d749f2d5ede844bed9adc363bfd60b582d5f3309e514a6309c98c0b4ccb1bae05eea663780d8213681e30c4da8

data/MIT-LICENSE ADDED

@@ -0,0 +1,20 @@
+Copyright 2017 Alain ANDRE
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,114 @@
+[![pipeline status](https://gitlab.com/al1_andre/ai-nlp/badges/master/build.svg)](https://gitlab.com/al1_andre/ai-nlp/commits/master)
+[![Coverage report](https://gitlab.com/al1_andre/ai-nlp/badges/master/coverage.svg?job=rspec)](http://al1_andre.gitlab.io/ai-nlp)
+[![Gem Version](https://badge.fury.io/rb/ai-nlp.svg)](https://badge.fury.io/rb/ai-nlp)
+# Objet
+J'entends de plus en plus parler de l'**intelligence artificielle** et des avancées sidérantes des **GAFA** et autres **NATU** dans ce domaine et ça m’intéresse beaucoup. Je souhaite en **apprendre** plus mais si l'Internet est plein d'informations, il faut tout de même savoir la chercher. Il y a de nombreux articles en **Anglais**, peu en **Français** et les concepts ne sont pas simple de prime abord. Je souhaite donc apporter ma pierre à l'édifice en **partageant** ce que j'apprends. C'est un projet à but non lucratif, ouvert à tous sous licence MIT dont le seul but est l'**apprentissage** par l'**expérience**. Le répertoire d'exercices se trouve [ici](exercices/)
+# Description
+> On regroupe habituellement sous le terme d'intelligence artificielle un ensemble de notions s'inspirant de la cognition humaine ou du cerveau biologique, et destiné s à assister ou suppléer l'individu dans le traitement des informations massives.
+En janvier **2017**, le **secrétaire d’État chargé de l'Industrie, du Numérique et de l'Innovation** et le **secrétaire d’État chargé de l'Enseignement supérieur et de la Recherche** ont lancé la démarche **#franceIA** qui avait pour objectif d'étudier les opportunités de l'**IA**. Cette étude se structure autour de **trois piliers** qui ont été déclinés en **dix thèmes** clés et traités par dix **groupes de travail** et sept sous-groupes. Leur rapport de synthèse est disponible [ici](https://www.economie.gouv.fr/files/files/PDF/2017/Rapport_synthese_France_IA_.pdf).
+L'intelligence artificielle traite les sujets suivants :
+ - La résolution des problèmes
+ - La [représentation des connaissances](https://fr.wikipedia.org/wiki/Repr%C3%A9sentation_des_connaissances)
+ - L'[organisation](https://fr.wikipedia.org/wiki/Planification_(intelligence_artificielle))
+ - L'[apprentissage](https://fr.wikipedia.org/wiki/Apprentissage_automatique)
+ - La [compréhension du langage humain (NLP/TALN)](https://fr.wikipedia.org/wiki/Traitement_automatique_du_langage_naturel)
+ - La[perception machine](https://fr.wikipedia.org/wiki/Machine_perception)
+ - La [robotique](https://fr.wikipedia.org/wiki/Robotique)
+ - L'[information affective](https://fr.wikipedia.org/wiki/Informatique_affective)
+ - La [créativité](https://en.wikipedia.org/wiki/Computational_creativity)
+ - La [conscience de soi](https://fr.wikipedia.org/wiki/Intelligence_artificielle#Intelligence_artificielle_forte)
+## Pourquoi un tutoriel en français ?
+Il ne faut pas se **mentir**, la langue **universelle** de la **technique** est bel et bien l'**anglais** aujourd'hui. Mon code est en anglais ainsi que ses commentaires. C'est une **obligation** si l'on veut pouvoir **partager** avec le plus grand monde possible ; étant un grand défenseur de l'**open-source**, le partage du code est ce qui le rend plus fort, plus **sûr**.
+Si les **Français** se trouvent assez **bien représentés** en ce qui touche à l'**I**ntelligence **A**rtificielle comme le dit le rapport du gouvernement, les **articles** en Français sur Internet sont vieux, vide de sens des fois, trop **compliqués** souvent ; mais surtout **jamais sexy** ! L'IA semble, pour le commun des mortels, **intouchable**, **stratosphérique** et même considéré à surtout éviter ! Hors il y a un **enjeu** national de taille. Nous avons **raté** le passage à l'**Internet** parce que nous pensions être les **maitres du monde** avec notre **minitel** ; il s'agit là de ne pas loupé la marche !
+Si les **infrastructures** françaises sont à la ramasse (quasi **inaccessibles**), nos **ingénieurs** se **vendent** bien ; mais pas chez nous ! Si nous étudions l'IA, en parlons ; si nous écrivons des articles dessus, c'est en **anglais** ! Hors dans les enjeux de l'IA, on compte l'**économie**, la **santé**, l'**industrie** qui ont des dimensions **humaines** fortes ; si on **alimente** les IA avec seulement des **données** et des **concepts** en **anglais**, on peut oublier le mode de **pensée à la française** qui pourtant historiquement a une bonne réputation.
+Et l'IA va rapidement **devenir** un outil d'**aide** à la prise de décision dans les **entreprises** et dans la **société**. C'est **déjà** le cas pour vos **recherches** sur Internet, vos **itinéraires** en voitures, la gestion de **vos données** sur les **clouds** etc. Lorsque **vous** répondez à un captcha afin de savoir si vous êtes bien un **humain**, vous pensez vraiment que **Google** ne vous **utilise** pas pour **former** une IA ? Si **Skynet** vous dit quelque chose, je vous laisse comprendre l'**importance** d'une **indépendance** de l'IA ou tout du moins d'une IA **alliée** (française ou européenne).
+# Mon idée
+Comme le soulève le **rapport de synthèse de France IA** de 2017, la France **manque** cruellement de **tutoriels** et de cours abordables en ce qui concerne l'IA ; mon **idée** est d'en **créer** un qui soit **reproduisible**, **accessible**, **ludique** et **agréable**.
+Si j'ai appris une chose au fil de mes années d’**expériences** de développeur, c'est qu'il ne faut pas **réinventer** la roue ! Si on a une idée, quelqu'un dans le monde l'a très probablement **déjà eu**, et il vaut mieux **réutiliser** son travail. La **difficulté** est de le trouver et d'en avoir les **droits**. Par exemple, **Facebook** met à disposition des outils pour jouer avec son IA [wit](https://wit.ai/) qui permet, entre autre, de créer des chat-bot mais l'**IA** et ses **données** se trouve **chez eux**. Autrement dit, pendant que **vous entrainez** votre bot, vous entrainez surtout **le leur**. Il n'y a à ma connaissance que l'[OpenIA](https://openai.com/) d'[Elon Musk](https://fr.wikipedia.org/wiki/Elon_Musk) qui se veut universelle et pour l'**humanité** plus que pour l'entreprise et le **profit**. Si le profit est bon, il ne doit être qu'un **résultat** et non un **objectif** !
+Je pense donc me baser sur des **concepts**, rechercher du code les gérants plus ou moins et tenter de recréer une IA **disponible** pour **tous** (disponible sous la gem [ai-nlp](https://rubygems.org/gems/ai-nlp)).
+Je dois faire un choix sur le type d'IA que je souhaite créer. Je suis un **littéraire**, et la plupart des cours et informations que je trouve sur l'IA sont basés sur des **mathématiques** ; alors j'ai décidé d'accer mon IA sur les éléments suivants :
+ - L'[apprentissage](https://fr.wikipedia.org/wiki/Apprentissage_automatique)
+ - La [compréhension du langage humain (NLP/TALN)](https://fr.wikipedia.org/wiki/Traitement_automatique_du_langage_naturel)
+Cette IA doit donc **apprendre** à **comprendre** un **texte humain**. Une interface web doit **afficher** de manière **graphique** la somme de **connaissances** qu'elle acquière, contenir une zone permettant à l'humain de **communiquer** avec l'IA et une zone dans laquelle elle **répond**.
+# Technologies et concepts
+Au tout début, mon IA ne **comprendra** rien à rien ! Il va falloir mettre en place un système d'**apprentissage** !
+Mais qu'est-ce que l'apprentissage ? Il s'agit d'**analyser**, du **regrouper**, de **stocker** l'information afin de pouvoir l'**exploiter**.
+## L'Analyse
+> L’analyse sémantique représente l’ensemble des procédés visant à analyser le sens des mots et des phrases, elle est le plus souvent utilisée comme préambule au traitement automatique des langues.
+Si je veux que l'IA comprenne le langage humain, je dois tout d'abord trouver comment réaliser les actions suivantes :
+ - Déterminer la langue (cf [N-gramme](https://fr.wikipedia.org/wiki/N-gramme))
+ - Déterminer la [syntaxe](https://fr.wikipedia.org/wiki/Analyse_syntaxique)
+ - Découper les éléments de la phrase [entités lexicales](https://fr.wikipedia.org/wiki/Analyse_lexicale) (ou tokens en anglais)
+### Le langage informatique
+Je suis un grand fan de [Ruby](https://www.ruby-lang.org/fr/), c'est un langage dont la philosophie est basée sur le [paradigme objet](https://fr.wikipedia.org/wiki/Smalltalk) ; il est donc réflexif et dynamiquement typé. je pense qu'il est capable d'apporter des avantages à mon projet grâce à sa [réflexivité](https://fr.wikipedia.org/wiki/R%C3%A9flexion_(informatique)) ; c'est à dire qu'un programme Ruby est capable d'examiner, et éventuellement modifier, ses propres structures internes de haut niveau lors de son exécution !
+Il y a, à mon avis, une belle opportunité à **exploiter** pour tout ce qui concerne les [lemmes](https://fr.wikipedia.org/wiki/Lemme_(linguistique)).
+## Le Regroupement
+> Le clustering : Il s'agit, pour un logiciel, de diviser un groupe hétérogène de données, en sous-groupes de manière que les données considérées comme les plus similaires soient associées au sein d'un groupe homogène et qu'au contraire les données considérées comme différentes se retrouvent dans d'autres groupes distincts ; l'objectif étant de permettre une extraction de connaissance organisée à partir de ces données.
+Une fois mes éléments de phrase correctement **identifiés**, je dois réussir à les **regrouper**. Il est possible de les regrouper par sens (homonymes, synonymes, racines, préfixes, suffixes, lemmes) par type (nom, verbe, adjectif, etc.)
+Ce regroupement est donc **crucial**, c'est lui qui va me permettre de **stocker** au bon **endroit** mes données. Je dois être capable de faire ces regroupements de façon **non supervisés** dans certains cas et **supervisé** dans d'autres.
+### Non supervisés
+Je peux donc automatiquement regrouper, classifier des éléments que j'ai découpé à l'aide de ma base de données. Mais si par exemple aucune donnée ne me permet de faire un regroupement non supervisé, je dois interagir avec l'utilisateur ; je dois lui demander de m'aider à classer cet élément.
+### Supervisés
+L'IA doit être en mesure d’interagir avec l'utilisateur et de lui demander de l'aide pour classifier les éléments qu'il aura découpé.
+L'utilisateur doit permettre de préciser :
+ - le type
+ - le sens
+ - l'origine (lemme)
+## Le stockage
+Le stockage doit permettre une récupération des données qui permette une bonne exploitation de ces dernières.
+## L'exploitation
+Tout ceci est magnifique, mais si l'IA arrive maintenant à **découper** et faire des **regroupements** d'information, il lui faut pourvoir les **exploiter**. L'avantage d'un **moteur de recherche** comme **QWANT** ou **Google**, par exemple, c'est qu'il sait que sa seule **action** est de **fournir des données** qu'il a collecté et regroupé à la **demande** ; il n'a pas vraiment besoin de **comprendre** ce qu'il stocke.
+Une IA plus évoluée comme **SIRI** par contre, est censée **répondre** à diverses **actions** comme **ajouter** un évènements à l'agenda, **donner** le temps qu'il fera **demain** ou encore **trouver** le **chemin** d'un magasin **proche** de notre position. Il faut donc qu'il **comprenne** le **type de demande** qui est faite.
+# Références
+ - [Rapport France-IA](https://www.economie.gouv.fr/France-IA-intelligence-artificielle#L%27IA)
+## Deep Learning
+ - [wiki](https://fr.wikipedia.org/wiki/Apprentissage_profond)
+ - [Le monde](http://www.lemonde.fr/pixels/article/2015/07/24/comment-le-deep-learning-revolutionne-l-intelligence-artificielle_4695929_4408996.html)
+ - [ruby introduction](https://speakerdeck.com/geoffreylitt/deep-learning-an-introduction-for-ruby-developers)
+## NLP
+ - [TAL](http://blog.onyme.com/traitement-automatique-des-langues-tal-intelligence-artificielle-ia-analyse-semantique-et-clusterings/)
+ - [Lemmatisation](http://blog.onyme.com/lemmatisation-et-racinisation-en-francais-flexion-lemme-et-racine-dun-mot/)
+ - [wiki](https://en.wikipedia.org/wiki/Natural_language_processing)
+ - [Structure de la phrase en Français](http://la-conjugaison.nouvelobs.com/regles/grammaire/les-structures-de-phrases-167.php)
+ - [lemmatizer](https://github.com/yohasebe/lemmatizer)
+## Programmes/API
+ - [Facebook wit](https://wit.ai/)
+ - [Deeptext](https://humanoides.fr/deeptext-la-nouvelle-intelligence-artificielle-de-facebook-pour-decortiquer-la-semantique/)
+## Ruby
+ - [AI4R](https://github.com/SergioFierens/ai4r/)
+ - [rubynlp](http://rubynlp.org/)

data/Rakefile ADDED

@@ -0,0 +1,38 @@
+# frozen_string_literal: true
+begin
+  require "bundler/setup"
+rescue LoadError
+  puts "You must `gem install bundler` and `bundle install` to run rake tasks"
+end
+require "rdoc/task"
+RDoc::Task.new(:rdoc) do |rdoc|
+  rdoc.rdoc_dir = "rdoc"
+  rdoc.title    = "Ai::Nlp"
+  rdoc.options << "--line-numbers"
+  rdoc.rdoc_files.include("README.md")
+  rdoc.rdoc_files.include("lib/**/*.rb")
+end
+APP_RAKEFILE = File.expand_path("../test/dummy/Rakefile", __FILE__)
+load "rails/tasks/engine.rake"
+load "rails/tasks/statistics.rake"
+require "bundler/gem_tasks"
+require "rake/testtask"
+Rake::TestTask.new(:test) do |t|
+  t.libs << "test"
+  t.pattern = "test/**/*_test.rb"
+  t.verbose = false
+end
+task default: :test

data/app/assets/config/ai_nlp_manifest.js ADDED

	@@ -0,0 +1,2 @@
1	+ //= link_directory ../javascripts/ai/nlp .js
2	+ //= link_directory ../stylesheets/ai/nlp .css

data/app/assets/javascripts/ai/nlp/application.js ADDED

@@ -0,0 +1,13 @@
+// This is a manifest file that'll be compiled into application.js, which will include all the files
+// listed below.
+//
+// Any JavaScript/Coffee file within this directory, lib/assets/javascripts, vendor/assets/javascripts,
+// or any plugin's vendor/assets/javascripts directory can be referenced here using a relative path.
+//
+// It's not advisable to add code directly here, but if you do, it'll appear at the bottom of the
+// compiled file. JavaScript code in this file should be added after the last require_* statement.
+//
+// Read Sprockets README (https://github.com/rails/sprockets#sprockets-directives) for details
+// about supported directives.
+//
+//= require_tree .

data/app/assets/stylesheets/ai/nlp/application.css ADDED

@@ -0,0 +1,15 @@
+/*
+ * This is a manifest file that'll be compiled into application.css, which will include all the files
+ * listed below.
+ *
+ * Any CSS and SCSS file within this directory, lib/assets/stylesheets, vendor/assets/stylesheets,
+ * or any plugin's vendor/assets/stylesheets directory can be referenced here using a relative path.
+ *
+ * You're free to add application-wide styles to this file and they'll appear at the bottom of the
+ * compiled file so the styles you add here take precedence over styles defined in any other CSS/SCSS
+ * files in this directory. Styles in this file should be added after the last require_* statement.
+ * It is generally better to create a new file per style scope.
+ *
+ *= require_tree .
+ *= require_self
+ */

data/app/controllers/ai/nlp/application_controller.rb ADDED

@@ -0,0 +1,12 @@
+# frozen_string_literal: true
+# Module containing artificial intelligence tools
+module Ai
+  # Module containing automatic natural language processing tools
+  module Nlp
+    # Main controller
+    class ApplicationController < ActionController::Base
+      protect_from_forgery with: :exception
+    end
+  end
+end

data/app/helpers/ai/nlp/application_helper.rb ADDED

@@ -0,0 +1,11 @@
+# frozen_string_literal: true
+# Module containing artificial intelligence tools
+module Ai
+  # Module containing automatic natural language processing tools
+  module Nlp
+    # Main Helper
+    module ApplicationHelper
+    end
+  end
+end

data/app/jobs/ai/nlp/application_job.rb ADDED

@@ -0,0 +1,11 @@
+# frozen_string_literal: true
+# Module containing artificial intelligence tools
+module Ai
+  # Module containing automatic natural language processing tools
+  module Nlp
+    # Main Job
+    class ApplicationJob < ActiveJob::Base
+    end
+  end
+end

data/app/mailers/ai/nlp/application_mailer.rb ADDED

@@ -0,0 +1,13 @@
+# frozen_string_literal: true
+# Module containing artificial intelligence tools
+module Ai
+  # Module containing automatic natural language processing tools
+  module Nlp
+    # Main mailer
+    class ApplicationMailer < ActionMailer::Base
+      default from: "from@example.com"
+      layout "mailer"
+    end
+  end
+end

data/app/models/ai/nlp/application_record.rb ADDED

@@ -0,0 +1,12 @@
+# frozen_string_literal: true
+# Module containing artificial intelligence tools
+module Ai
+  # Module containing automatic natural language processing tools
+  module Nlp
+    # Classe mère
+    class ApplicationRecord < ActiveRecord::Base
+      self.abstract_class = true
+    end
+  end
+end

data/app/models/ai/nlp/language.rb ADDED

@@ -0,0 +1,23 @@
+# frozen_string_literal: true
+# Module containing artificial intelligence tools
+module Ai
+  # Module containing automatic natural language processing tools
+  module Nlp
+    # Manages a language
+    class Language < ApplicationRecord
+      validates_presence_of :name
+      serialize :map, Hash
+      ##
+      # Add n-grams to the language
+      # @param array given_array The array of [gram, frequancy]
+      def add_grams(given_array)
+        given_array.to_h.each do |key, freq|
+          map[key] = map[key] ? letters + freq : freq
+        end
+        update
+      end
+    end
+  end
+end

data/app/views/layouts/ai/nlp/application.html.erb ADDED

@@ -0,0 +1,14 @@
+<!DOCTYPE html>
+<html>
+<head>
+  <title>Ia taln</title>
+  <%= stylesheet_link_tag    "ai/nlp/application", media: "all" %>
+  <%= javascript_include_tag "ai/nlp/application" %>
+  <%= csrf_meta_tags %>
+</head>
+<body>
+<%= yield %>
+</body>
+</html>

data/config/routes.rb ADDED

@@ -0,0 +1,4 @@
+# frozen_string_literal: true
+Ai::Nlp::Engine.routes.draw do
+end

data/db/migrate/20170907142959_create_ia_taln_languages.rb ADDED

@@ -0,0 +1,12 @@
+# frozen_string_literal: true
+class CreateIaTalnLanguages < ActiveRecord::Migration[5.1]
+  def change
+    create_table :ai_nlp_languages do |t|
+      t.string :name
+      t.text :map
+      t.timestamps
+    end
+  end
+end

data/lib/ai/nlp.rb ADDED

@@ -0,0 +1,12 @@
+# frozen_string_literal: true
+require "ai/nlp/engine"
+require "ai/nlp/n_gram/n_gram"
+require "ai/nlp/languages"
+# Module containing artificial intelligence tools
+module Ai
+  # Module containing automatic natural language processing tools
+  module Nlp
+  end
+end

data/lib/ai/nlp/engine.rb ADDED

@@ -0,0 +1,12 @@
+# frozen_string_literal: true
+# Module containing artificial intelligence tools
+module Ai
+  # Module containing automatic natural language processing tools
+  module Nlp
+    # Engine and isolation namespace
+    class Engine < ::Rails::Engine
+      isolate_namespace Ai::Nlp
+    end
+  end
+end

data/lib/ai/nlp/languages.rb ADDED

@@ -0,0 +1,95 @@
+# encoding: utf-8
+# frozen_string_literal: true
+require "ai/nlp/n_gram/n_gram"
+# Module containing artificial intelligence tools
+module Ai
+  # Module containing automatic natural language processing tools
+  module Nlp
+    # Class to handle multiple languages
+    class Languages
+      ##
+      # Initialisation
+      def initialize
+        @n_gram = NGram.new
+      end
+      ##
+      # Returns the currently known languages
+      # @return An array of Language
+      def all
+        @languages = Language.all
+      end
+      ##
+      # Offers among the available languages the closest one to the datasets
+      # @param string input The data set.
+      def guess(input)
+        all
+        return [] if @languages.empty?
+        hash = @languages.map { |language| [language, score(input, language)] }.to_h
+        sort(hash)
+      end
+      ##
+      # Create a new language.
+      # @param string name The language name.
+      # @param string input The initial data set.
+      # @return La langue créée.
+      def create_one(name, input)
+        language = Language.new(name: name)
+        language.update(map: @n_gram.calculate(input).to_h)
+      end
+      private
+        ##
+        # Sort the language hash
+        # @param hash hash The language hash
+        # @return the sorted list of languages
+        def sort(hash)
+          sorted_languages = @languages.sort_by { |language| hash[language] }
+          reject(sorted_languages, hash)
+        end
+        def reject(sorted_languages, hash)
+          sorted_languages.reject { |language| hash[language].zero? }
+        end
+        ##
+        # Compare a string of characters against a language based on, at most,
+        # the 400 most commonly used groups of letters.
+        # @param string input The data set to compare
+        # @param Language language The Language to compare to
+        def score(input, language)
+          input_gram = @n_gram.calculate(input)
+          ngram = language.map
+          calculate_point([input_gram.size, 400].min, ngram, input_gram)
+        end
+        ##
+        # Calculates the new frequency
+        # @return le score (point)
+        def calculate_point(max_compare, ngram, input_gram)
+          point = 0
+          (0..max_compare).each do |pos|
+            position = input_gram[pos]
+            next unless position
+            point = add_frequency(ngram[position[0]], pos, point)
+          end
+          point
+        end
+        ##
+        # Add frequency if needed
+        # @param integer input_gram_freq The input gram frequency
+        # @param integer pos The position in the max_compare
+        # @param integer point The current calculated points
+        def add_frequency(input_gram_freq, pos, point)
+          point += (input_gram_freq - pos).abs if input_gram_freq
+          point
+        end
+    end
+  end
+end

data/lib/ai/nlp/n_gram/hasher.rb ADDED

@@ -0,0 +1,118 @@
+# encoding: utf-8
+# frozen_string_literal: true
+require "sanitize"
+require "cgi"
+require "unicode"
+require "byebug"
+# Module containing artificial intelligence tools
+module Ai
+  # Module containing automatic natural language processing tools
+  module Nlp
+    # Class managing an n-gram hash
+    class Hasher
+      ##
+      # Initialisation
+      # @param string input The string to treat
+      def initialize(input)
+        @input = input
+        @hash = {}
+        clean
+      end
+      ##
+      # Calculates n-gram frequencies for the dataset
+      # @return Frequencies of ngram or sorted array
+      def calculate
+        @input.split(/[\d\s\[\]]/).each do |word|
+          calculate_word_gram("_#{word}_")
+        end
+        drop_unwanted_keys
+        @hash.sort { |one, other| other[1] <=> one[1] }
+      end
+      private
+        ##
+        # Enriched hash representing the n-gram of a word
+        # @param string word The word to calculate
+        def calculate_word_gram(word)
+          length = word.size
+          (0..length).each do |letter_position|
+            parameters = { letter_position: letter_position, word: word, length: length }
+            calculate_letter_gram(parameters)
+            length -= 1
+          end
+        end
+        ##
+        # Deletes a key if its value is less than or equal to zero
+        def drop_unwanted_keys
+          @hash.each_key do |key|
+            @hash.delete(key) if key.size <= 0
+          end
+        end
+        ##
+        # Stores the mono-gram, bi-gram and tri-gram in the hash
+        # @param hash parameters The list of necessary parameters :
+        #  - letter_position The position of the letter to be processed
+        #  - word The word treated
+        #  - length Current word size
+        def calculate_letter_gram(parameters)
+          (1..3).each do |nth|
+            letters = parameters[:word][parameters[:letter_position], nth]
+            next unless letters
+            init_key(letters)
+            @hash[letters] += 1 if parameters[:length] > (nth - 1)
+          end
+        end
+        ##
+        # Initialize key if necessary
+        # @param string letters The group of letters
+        def init_key(letters)
+          @hash[letters] ||= 0
+        end
+        ##
+        # Cleans the string passed as argument
+        def clean
+          safe_clean
+          specific_clean
+          clean_latin
+          @input = @input.strip.split(" ").join(" ")
+        end
+        ##
+        # Cleans the string from Latin characters if more than half of the string is not Latin.
+        def clean_latin
+          latin = @input.scan(/[a-z]/)
+          nonlatin = @input.scan(/[\p{L}&&[^a-z]]/)
+          nonlatin_ratio = nonlatin.size / (latin.size * 1.0)
+          return if nonlatin_ratio < 0.5
+          @input.gsub!(/[a-zA-Z]/, "") if !latin.empty? && !nonlatin.empty?
+        end
+        ##
+        # Removes polluting web addresses, mails and characters
+        def specific_clean
+          uri_regex = %r/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/
+          @input.gsub!(uri_regex, "")
+          # Remove mails
+          @input.gsub!(/[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}/, "")
+          # Repleace polluting non-alphabetical characters, punctuation included by a space
+          @input.gsub!(%r/[\*\^><!\"#\$%&\'\(\)\*\+:;,\._\/=\?@\{\}\[\]|\-\n\r0-9]/, " ")
+        end
+        ##
+        # Cleaning via existing tools
+        def safe_clean
+          @input = Sanitize.clean(@input)
+          @input = CGI.unescapeHTML(@input)
+          @input = Unicode.downcase(@input)
+        end
+    end
+  end
+end

data/lib/ai/nlp/n_gram/n_gram.rb ADDED

@@ -0,0 +1,29 @@
+# encoding: utf-8
+# frozen_string_literal: true
+require "ai/nlp/n_gram/hasher"
+# Module containing artificial intelligence tools
+module Ai
+  # Module containing automatic natural language processing tools
+  module Nlp
+    # Class for calculating n-grams, storing and exploiting them
+    class NGram
+      ##
+      # Cuts the data set into a grouping of letters
+      # @param string input The dataset
+      def hash(input)
+        calculate(input).map { |letters, _gram| letters }
+      end
+      ##
+      # Calculates the n-gram frequencies for the data set passed as an argument
+      # @param string input The dataset
+      # @return Frequencies of ngram or sorted array
+      def calculate(input)
+        hash = Hasher.new(input)
+        hash.calculate
+      end
+    end
+  end
+end

data/lib/ai/nlp/stem/fr.md ADDED

@@ -0,0 +1,178 @@
+The stemming algorithm
+In French the verb endings **ent** and **ons** cannot be removed without unacceptable overstemming.
+The **ons** form is rarer, but **ent** forms are quite common, and will appear regularly throughout a stemmed vocabulary.
+Letters in French include the following accented forms,
+    â   à   ç   ë   é   ê   è   ï   î   ô   û   ù
+The following letters are vowels:
+    a   e   i   o   u   y   â   à   ë   é   ê   è   ï   î   ô   û   ù
+Assume the word is in lower case. Then put into upper case u or i preceded and followed by a vowel, and y preceded or followed by a vowel. u after q is also put into upper case. For example,
+    jouer 		-> 		joUer
+    ennuie 		-> 		ennuIe
+    yeux 		-> 		Yeux
+    quand 		-> 		qUand
+(The upper case forms are not then classed as vowels — see note on vowel marking.)
+If the word begins with two vowels, RV is the region after the third letter, otherwise the region after the first vowel not at the beginning of the word, or the end of the word if these positions cannot be found. (Exceptionally, par, col or tap, at the begining of a word is also taken to define RV as the region to their right.)
+For example,
+    a i m e r     a d o r e r     v o l e r    t a p i s
+         |...|         |.....|       |.....|        |...|
+R1 is the region after the first non-vowel following a vowel, or the end of the word if there is no such non-vowel. R2 is the region after the first non-vowel following a vowel in R1, or the end of the word if there is no such non-vowel. (See note on R1 and R2.)
+For example:
+    f a m e u s e m e n t
+         |......R1.......|
+               |...R2....|
+Note that R1 can contain RV (adorer), and RV can contain R1 (voler).
+Below, ‘delete if in R2’ means that a found suffix should be removed if it lies entirely in R2, but not if it overlaps R2 and the rest of the word. ‘delete if in R1 and preceded by X’ means that X itself does not have to come in R1, while ‘delete if preceded by X in R1’ means that X, like the suffix, must be entirely in R1.
+Start with step 1
+Step 1: Standard suffix removal
+    Search for the longest among the following suffixes, and perform the action indicated.
+    ance   iqUe   isme   able   iste   eux   ances   iqUes   ismes   ables   istes
+        delete if in R2
+    atrice   ateur   ation   atrices   ateurs   ations
+        delete if in R2
+        if preceded by ic, delete if in R2, else replace by iqU
+    logie   logies
+        replace with log if in R2
+    usion   ution   usions   utions
+        replace with u if in R2
+    ence   ences
+        replace with ent if in R2
+    ement   ements
+        delete if in RV
+        if preceded by iv, delete if in R2 (and if further preceded by at, delete if in R2), otherwise,
+        if preceded by eus, delete if in R2, else replace by eux if in R1, otherwise,
+        if preceded by abl or iqU, delete if in R2, otherwise,
+        if preceded by ièr or Ièr, replace by i if in RV
+    ité   ités
+        delete if in R2
+        if preceded by abil, delete if in R2, else replace by abl, otherwise,
+        if preceded by ic, delete if in R2, else replace by iqU, otherwise,
+        if preceded by iv, delete if in R2
+    if   ive   ifs   ives
+        delete if in R2
+        if preceded by at, delete if in R2 (and if further preceded by ic, delete if in R2, else replace by iqU)
+    eaux
+        replace with eau
+    aux
+        replace with al if in R1
+    euse   euses
+        delete if in R2, else replace by eux if in R1
+    issement   issements
+        delete if in R1 and preceded by a non-vowel
+    amment
+        replace with ant if in RV
+    emment
+        replace with ent if in RV
+    ment   ments
+        delete if preceded by a vowel in RV
+In steps 2a and 2b all tests are confined to the RV region.
+Do step 2a if either no ending was removed by step 1, or if one of endings amment, emment, ment, ments was found.
+Step 2a: Verb suffixes beginning i
+    Search for the longest among the following suffixes and if found, delete if preceded by a non-vowel.
+        îmes   ît   îtes   i   ie   ies   ir   ira   irai   iraIent   irais   irait   iras   irent   irez   iriez   irions   irons   iront   is   issaIent   issais   issait   issant   issante   issantes   issants   isse   issent   isses   issez   issiez   issions   issons   it
+    (Note that the non-vowel itself must also be in RV.)
+Do step 2b if step 2a was done, but failed to remove a suffix.
+Step 2b: Other verb suffixes
+    Search for the longest among the following suffixes, and perform the action indicated.
+    ions
+        delete if in R2
+    é   ée   ées   és   èrent   er   era   erai   eraIent   erais   erait   eras   erez   eriez   erions   erons   eront   ez   iez
+        delete
+    âmes   ât   âtes   a   ai   aIent   ais   ait   ant   ante   antes   ants   as   asse   assent   asses   assiez   assions
+        delete
+        if preceded by e, delete
+    (Note that the e that may be deleted in this last step must also be in RV.)
+If the last step to be obeyed — either step 1, 2a or 2b — altered the word, do step 3
+Step 3
+    Replace final Y with i or final ç with c
+Alternatively, if the last step to be obeyed did not alter the word, do step 4
+Step 4: Residual suffix
+    If the word ends s, not preceded by a, i, o, u, è or s, delete it.
+    In the rest of step 4, all tests are confined to the RV region.
+    Search for the longest among the following suffixes, and perform the action indicated.
+    ion
+        delete if in R2 and preceded by s or t
+    ier   ière   Ier   Ière
+        replace with i
+    e
+        delete
+    ë
+        if preceded by gu, delete
+    (So note that ion is removed only when it is in R2 — as well as being in RV — and preceded by s or t which must be in RV.)
+Always do steps 5 and 6.
+Step 5: Undouble
+    If the word ends enn, onn, ett, ell or eill, delete the last letter
+Step 6: Un-accent
+    If the words ends é or è followed by at least one non-vowel, remove the accent from the e.
+And finally:
+    Turn any remaining I, U and Y letters in the word back into lower case.

data/lib/ai/nlp/stem/stem.rb ADDED

File without changes

data/lib/ai/nlp/version.rb ADDED

@@ -0,0 +1,7 @@
+# frozen_string_literal: true
+module Ai
+  module Nlp
+    VERSION = "0.1.0".freeze
+  end
+end

data/lib/tasks/ai/nlp_tasks.rake ADDED

@@ -0,0 +1,5 @@
+# frozen_string_literal: true
+# desc "Explaining what the task does"
+# task :ai_nlp do
+#   # Task goes here
+# end

metadata ADDED

@@ -0,0 +1,112 @@
+--- !ruby/object:Gem::Specification
+name: ai-nlp
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Alain ANDRE
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2017-09-11 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rails
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 5.1.3
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 5.1.3
+- !ruby/object:Gem::Dependency
+  name: sanitize
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '4.5'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '4.5'
+- !ruby/object:Gem::Dependency
+  name: unicode
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.4.4.4
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.4.4.4
+description: |2
+    This gem contains a grouping of ruby tools related to Artificial Intelligence
+    and Automatic Natural Language Processing
+email:
+- dev@alain-andre.fr
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- MIT-LICENSE
+- README.md
+- Rakefile
+- app/assets/config/ai_nlp_manifest.js
+- app/assets/javascripts/ai/nlp/application.js
+- app/assets/stylesheets/ai/nlp/application.css
+- app/controllers/ai/nlp/application_controller.rb
+- app/helpers/ai/nlp/application_helper.rb
+- app/jobs/ai/nlp/application_job.rb
+- app/mailers/ai/nlp/application_mailer.rb
+- app/models/ai/nlp/application_record.rb
+- app/models/ai/nlp/language.rb
+- app/views/layouts/ai/nlp/application.html.erb
+- config/routes.rb
+- db/migrate/20170907142959_create_ia_taln_languages.rb
+- lib/ai/nlp.rb
+- lib/ai/nlp/engine.rb
+- lib/ai/nlp/languages.rb
+- lib/ai/nlp/n_gram/hasher.rb
+- lib/ai/nlp/n_gram/n_gram.rb
+- lib/ai/nlp/stem/fr.md
+- lib/ai/nlp/stem/stem.rb
+- lib/ai/nlp/version.rb
+- lib/tasks/ai/nlp_tasks.rake
+homepage: https://gitlab.com/al1_andre/ai-nlp
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '2.3'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.5.1
+signing_key:
+specification_version: 4
+summary: Artificial Intelligence and Automatic Natural Language Processing
+test_files: []