RubyGems - gtfs_stops_clustering - Versions diffs - 0.1.4 → 0.1.6 - Mend

gtfs_stops_clustering 0.1.4 → 0.1.6

Files changed (11) hide show

checksums.yaml +4 -4
data/.rubocop.yml +9 -0
data/CHANGELOG.md +6 -2
data/README.md +73 -8
data/gtfs_stops_clustering.gemspec +0 -4
data/lib/gtfs_stops_clustering/dbscan.rb +19 -51
data/lib/gtfs_stops_clustering/redis_geodata.rb +1 -0
data/lib/gtfs_stops_clustering/utils.rb +41 -0
data/lib/gtfs_stops_clustering/version.rb +1 -1
data/lib/gtfs_stops_clustering.rb +5 -7
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: bd1d3d49ce47faac98cf22d674adab3d37d0afc1a0b7b55ecba36fe5cde3bac3
-  data.tar.gz: 3b6572e53a4268c2e8def8dd80e8605768e425dd05570827822b80ce0448a100
+  metadata.gz: 5babddb5a5a80c3afcdd55dca8b93738633c3e7420392683240665c8c375d4c9
+  data.tar.gz: c056da474b4bc656e83f6ba76c14e7ea7868225b1075818447da74eb68bc52a6
 SHA512:
-  metadata.gz: a7f35b9d4c35f638b5baac0fa384f50e44db4ef1a20cfc119c74c184b1d900c1207704c1503094fb85c63ea84c9acf283c4cb33e8e7efa7ae13aa5d18e5d5d67
-  data.tar.gz: 6b5c325393774c7f926891445c09f82eda11fbdb1b5528942e31284f2c5c946ceada56aabe82911a8167956a933ffce1ec17fb599a71ace3853e24dcbec059a9
+  metadata.gz: 2ad48d54a269da348ffde78fbddb40697b7ed268b66e2ae5f22cae4fd834edccea9e360eb65fecd7a2fb3c22c5f3dfca0e297e2e7adbd61eb75ac106b4b181ef
+  data.tar.gz: 6c553ba442c11e604fb7740be93c19ff4ea013e3b5fa67c58745c4c704246eace595e147d20792ba5c36b54e640821a3f83688530fdb101d2ed214b47c8f7bce

data/.rubocop.yml CHANGED Viewed

@@ -11,3 +11,12 @@ Style/StringLiteralsInInterpolation:
 Layout/LineLength:
   Max: 140
+Style/FrozenStringLiteralComment:
+  Enabled: false
+Metrics/MethodLength:
+  Max: 25
+Metrics/AbcSize:
+  Max: 20

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,9 @@
-## [Unreleased]
+## 0.1.6
-## [0.1.0] - 2023-12-06
+## [0.1.6] - 2023-12-19
+- Clean Redis stops data after performing the clustering algorithm
+## [0.1.5] - 2023-12-10
 - Initial release

data/README.md CHANGED Viewed

@@ -1,24 +1,89 @@
-# GtfsStopsClustering
+# GTFS Stops clustering
+[![Gem Version](https://badge.fury.io/rb/gtfs_stops_clustering.svg)](https://badge.fury.io/rb/gtfs_stops_clustering)
-TODO: Delete this and the text below, and describe your gem
+GTFS Stops Clustering is a Ruby Gem designed to read [GTFS](https://gtfs.org) (General Transit Feed Specification) stops data and create clusters based on the following parameters:
-Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/gtfs_stops_clustering`. To experiment with that code, run `bin/console` for an interactive prompt.
+- `GTFS paths` [Required]: array of gtfs zip files paths whose stops will be combined in the clustering algorithm
+- `Epsilon` [Required]: the maximum distance (in km) between 2 stops for them to be considered neighbors of one another (e.g.: 0.01, 0.5, 2 etc.)
+- `Min Points` [Required]: the minimum number of neighbors a point needs to have to be considered a core point (e.g.: 3, 5, 10 etc.)
+- `Names Similarity` [Optional]: Besides geographical proximity, the algorithm also considers the similarity between stop names using techniques like string similarity measures. This enhances the clustering by including stops with similar names within the same cluster (e.g.: all values between 0 and 1. The more the value is in proximity of 1, the more similar the stop names need to be considered points of the same cluster). The default value is 1, so if you want to create clusters based only on stop positions, leave this to 0.
+- `Stop config file` (CSV file path) [Optional]: This file is specifically designed to handle certain cases where stop names need to be altered or mapped to different names before running the clustering algorithm. Each entry consists of two columns:
+**stop_name**: This column contains the original name of the stop that requires modification or mapping to another name. **cluster_name**: This column specifies the name to which the original stop name should be changed or mapped during the clustering process.
+It utilizes the [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN) Density-Based algorithm to perform clustering. I based my core algorithm on the gem [Dbscan](https://github.com/matiasinsaurralde/dbscan)
+### Stops config file example
+Here is an example of a stops_config CSV file:
+```csv
+stop_name,cluster_name
+Stop Name To Be Changed,Actual Name
+Amargosa Valley (Demo),Amargosa Valley
+E Main St / S Irving St (Demo),E Main St / S Irving St
+```
+In this case, passing this CSV file to the clustering algorithm, **Amargosa Valley (Demo)** will be renamed **Amargarosa Valley**, and so on for all the entries provided. The reason why I needed to implement this feature is simply because I was dealing with bad stops names (typo) provided by default within the GTFS I was working on.
+## Requirements
+It is essential to have a **Redis server instance running locally (on default port 6379)** because the algorithm leverages Redis geospatial queries for efficient spatial operations.
+The Redis server is utilized to optimize geospatial queries, allowing the clustering algorithm to efficiently process proximity-related computations required during the clustering process.
+Please ensure that a Redis server is installed and running on your local machine to utilize the gem effectively.
 ## Installation
-TODO: Replace `UPDATE_WITH_YOUR_GEM_NAME_PRIOR_TO_RELEASE_TO_RUBYGEMS_ORG` with your gem name right after releasing it to RubyGems.org. Please do not do it earlier due to security reasons. Alternatively, replace this section with instructions to install your gem from git if you don't plan to release to RubyGems.org.
+Add this line to your application's Gemfile:
-Install the gem and add to the application's Gemfile by executing:
+```ruby
+gem 'gtfs_stops_clustering', '~> 0.1.5'
+```
+And run the following command
-    $ bundle add UPDATE_WITH_YOUR_GEM_NAME_PRIOR_TO_RELEASE_TO_RUBYGEMS_ORG
+```bash
+$ bundle install
+```
 If bundler is not being used to manage dependencies, install the gem by executing:
-    $ gem install UPDATE_WITH_YOUR_GEM_NAME_PRIOR_TO_RELEASE_TO_RUBYGEMS_ORG
+```bash
+$ gem install gtfs_stops_clustering
+```
 ## Usage
-TODO: Write usage instructions here
+```ruby
+require 'gtfs_stops_clustering'
+include GtfsStopsClustering
+gtfs_paths = ["path/to/gtfs/zip"]
+clusters = build_clusters(gtfs_paths, 0.3, 1, 0.85)
+clusters.each do |index, cluster|
+  puts index
+  cluster.each do |stop|
+    puts stop.inspect
+  end
+end
+```
+In this case, I'm showing the output referred to the GTFS file located in `test/fixtures/sample-feed-2.zip` (which is the sample-feed provided by Google, but changed a bit in order to create "clusterable" stops since they all were too far to be clustered). In this case I omitted the optional parameter `stops config`
+```
+-1
+{:stop_id=>"4", :stop_code=>nil, :cluster_name=>nil, :cluster_pos=>[], :stop_name=>"Stagecoach Hotel & Casino (Demo)", :stop_lat=>"36.915682", :stop_lon=>"-116.751677", :parent_station=>nil}
+{:stop_id=>"6", :stop_code=>nil, :cluster_name=>nil, :cluster_pos=>[], :stop_name=>"Alone stop (sad)", :stop_lat=>"36.914944", :stop_lon=>"-116.761472", :parent_station=>nil}
+{:stop_id=>"8", :stop_code=>nil, :cluster_name=>nil, :cluster_pos=>[], :stop_name=>"E Main St / S Irving St (Demo)", :stop_lat=>"36.905697", :stop_lon=>"-116.76218", :parent_station=>nil}
+{:stop_id=>"9", :stop_code=>nil, :cluster_name=>nil, :cluster_pos=>[], :stop_name=>"Amargosa Valley (Demo)", :stop_lat=>"36.641496", :stop_lon=>"-116.40094", :parent_station=>nil}
+0
+{:stop_id=>"1", :stop_code=>nil, :cluster_name=>"Awesome Stop Name", :cluster_pos=>[36.425286, -117.133156], :stop_name=>"Awesome stop name 1", :stop_lat=>"36.425288", :stop_lon=>"-117.133162", :parent_station=>nil}
+{:stop_id=>"5", :stop_code=>nil, :cluster_name=>"Awesome Stop Name", :cluster_pos=>[36.425286, -117.133156], :stop_name=>"Awesome stop name 2", :stop_lat=>"36.425284", :stop_lon=>"-117.133150", :parent_station=>nil}
+1
+{:stop_id=>"2", :stop_code=>nil, :cluster_name=>"Nye County Airport", :cluster_pos=>[36.868429, -116.78467699999999], :stop_name=>"Nye County Airport A1", :stop_lat=>"36.868446", :stop_lon=>"-116.784582", :parent_station=>nil}
+{:stop_id=>"3", :stop_code=>nil, :cluster_name=>"Nye County Airport", :cluster_pos=>[36.868429, -116.78467699999999], :stop_name=>"Nye County Airport A2", :stop_lat=>"36.868417", :stop_lon=>"-116.784352", :parent_station=>nil}
+{:stop_id=>"7", :stop_code=>nil, :cluster_name=>"Nye County Airport", :cluster_pos=>[36.868429, -116.78467699999999], :stop_name=>"Nye County Airport A5", :stop_lat=>"36.868424", :stop_lon=>"-116.785097", :parent_station=>nil}
+```
 ## Development

data/gtfs_stops_clustering.gemspec CHANGED Viewed

@@ -32,10 +32,6 @@ Gem::Specification.new do |spec|
   spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
   spec.require_paths = ["lib"]
-  # spec.files = ["lib/gtfs_stops_clustering.rb", "lib/gtfs_stops_clustering/data_import.rb", "lib/gtfs_stops_clustering/dbscan.rb",
-  #               "lib/gtfs_stops_clustering/redis_geodata.rb", "lib/gtfs_stops_clustering/version.rb",
-  #               "lib/gtfs_stops_clustering/input_consistency_checks.rb"]
   spec.add_runtime_dependency "csv", "~> 3.2", ">= 3.2.8"
   spec.add_runtime_dependency "distance_measures", "~> 0.0.6"
   spec.add_runtime_dependency "geocoder", "~> 1.8", ">= 1.8.2"

data/lib/gtfs_stops_clustering/dbscan.rb CHANGED Viewed

@@ -4,6 +4,7 @@ require "distance_measures"
 require "text"
 require "geocoder"
 require_relative "redis_geodata"
+require_relative "utils"
 # Array class
 class Array
@@ -50,31 +51,26 @@ module DBSCAN
         if neighbors.size >= options[:min_points]
           current_cluster += 1
-          point.cluster = current_cluster
-          cluster = [point].push(add_connected(neighbors, current_cluster))
-          clusters[current_cluster] = cluster.flatten
-          # Get Cluster Name
-          labels = clusters[current_cluster].map { |e| e.label.capitalize }
-          cluster_name = find_cluster_name(labels)
-          # Get Cluster Position
-          cluster_pos = find_cluster_position(clusters[current_cluster])
-          clusters[current_cluster].each do |e|
-            e.cluster_name = cluster_name
-            e.cluster_pos = cluster_pos
-          end
+          create_cluster(current_cluster, point, neighbors)
+          update_cluster_info(current_cluster)
         else
           clusters[-1].push(point)
         end
       end
     end
-    def results
-      hash = {}
-      @clusters.dup.each { |cluster_index, value| hash[cluster_index] = value.flatten.map(&:items) unless value.flatten.empty? }
-      hash
+    def create_cluster(cluster_index, point, neighbors)
+      point.cluster = cluster_index
+      cluster = [point].push(add_connected(neighbors, cluster_index))
+      @clusters[cluster_index] = cluster.flatten
+    end
+    def update_cluster_info(cluster_index)
+      labels = @clusters[cluster_index].map { |e| e.label.capitalize }
+      @clusters[cluster_index].each do |e|
+        e.cluster_name = Utils.find_cluster_name(labels)
+        e.cluster_pos = Utils.find_cluster_position(clusters[cluster_index])
+      end
     end
     def labeled_results
@@ -103,16 +99,10 @@ module DBSCAN
       neighbors = []
       geosearch_results = geosearch(point.items[1], point.items[0])
       geosearch_results.each do |neighbor_pos|
-        coordinates = neighbor_pos.split(",")
-        neighbor = @points.find do |elem|
-          elem.items[0] == coordinates[1] &&
-            elem.items[1] == coordinates[0]
-        end
+        neighbor = Utils.find_inmediate_neighbor(neighbor_pos, @points)
         next unless neighbor
-        string_distance = Text::Levenshtein.distance(point.label.downcase, neighbor.label.downcase)
-        similarity = 1 - string_distance.to_f / [point.label.length, point.label.length].max
-        neighbors.push(neighbor) if similarity > options[:similarity]
+        neighbors.push(neighbor) if Utils.string_similarity(point.label.downcase, neighbor.label.downcase) > options[:similarity]
       end
       neighbors
     end
@@ -139,30 +129,8 @@ module DBSCAN
       cluster_points
     end
-    def find_cluster_name(labels)
-      words = labels.map { |label| label.strip.split }
-      common_title = ""
-      # Loop through each word index starting from the first
-      (0...words.first.length).each do |i|
-        words_at_index = words.map { |word_list| word_list[i] }
-        break unless words_at_index.uniq.length == 1
-        common_title += " #{words_at_index.first.capitalize}"
-      end
-      common_title.strip! ? common_title : labels.first
-    end
-    def find_cluster_position(cluster)
-      total_lat = cluster.map { |e| e.items[0].to_f }.sum
-      total_lon = cluster.map { |e| e.items[1].to_f }.sum
-      avg_lat = total_lat / cluster.size
-      avg_lon = total_lon / cluster.size
-      [avg_lat, avg_lon]
-    end
   end
   # Point class
   class Point
     attr_accessor :items, :cluster, :visited, :label, :cluster_name, :cluster_pos
@@ -182,7 +150,7 @@ module DBSCAN
     end
   end
-  def DBSCAN(* args)
+  def dbscan(* args)
     clusterer = Clusterer.new(*args)
     clusterer.labeled_results
   end

data/lib/gtfs_stops_clustering/redis_geodata.rb CHANGED Viewed

@@ -22,6 +22,7 @@ module RedisGeodata
       @key = "stops"
       @epsilon = epsilon
       geoadd
+      ObjectSpace.define_finalizer(self, proc { @redis.del(@key) })
     end
     def geoadd

data/lib/gtfs_stops_clustering/utils.rb ADDED Viewed

@@ -0,0 +1,41 @@
+# lib/utils.rb
+# Utils class
+class Utils
+  def self.find_cluster_name(labels)
+    words = labels.map { |label| label.strip.split }
+    common_title = ""
+    # Loop through each word index starting from the first
+    (0...words.first.length).each do |i|
+      words_at_index = words.map { |word_list| word_list[i] }
+      break unless words_at_index.uniq.length == 1
+      common_title += " #{words_at_index.first.capitalize}"
+    end
+    common_title.strip! ? common_title : labels.first
+  end
+  def self.find_cluster_position(cluster)
+    total_lat = cluster.map { |e| e.items[0].to_f }.sum
+    total_lon = cluster.map { |e| e.items[1].to_f }.sum
+    avg_lat = total_lat / cluster.size
+    avg_lon = total_lon / cluster.size
+    [avg_lat, avg_lon]
+  end
+  def self.string_similarity(str1, str2)
+    string_distance = Text::Levenshtein.distance(str1.downcase, str2.downcase)
+    1 - string_distance.to_f / [str1.length, str2.length].max
+  end
+  def self.find_inmediate_neighbor(neighbor_pos, points)
+    coordinates_split = neighbor_pos.split(",")
+    points.find do |elem|
+      elem.items[0] == coordinates_split[1] &&
+        elem.items[1] == coordinates_split[0]
+    end
+  end
+end

data/lib/gtfs_stops_clustering/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module GtfsStopsClustering
-  VERSION = "0.1.4"
+  VERSION = "0.1.6"
 end

data/lib/gtfs_stops_clustering.rb CHANGED Viewed

@@ -37,19 +37,17 @@ module GtfsStopsClustering
     def create_stops_merged
       gtfs_stops = []
       @gtfs_paths.each do |gtfs_path|
-        begin
-          gtfs = GTFS::Source.build(gtfs_path)
-          gtfs_stops << gtfs.stops
-        rescue GTFS::InvalidSourceException => e
-          raise IOError "Error occurred while building GTFS from #{gtfs_path}: #{e.message}"
-        end
+        gtfs = GTFS::Source.build(gtfs_path)
+        gtfs_stops << gtfs.stops
+      rescue GTFS::InvalidSourceException => e
+        raise IOError "Error occurred while building GTFS from #{gtfs_path}: #{e.message}"
       end
       gtfs_stops.flatten
     end
     def clusterize_stops
       data = import_stops_data(@gtfs_stops, @stops_config_path)
-      @clusters = DBSCAN(data[:stops_data],
+      @clusters = dbscan(data[:stops_data],
                          data[:stops_redis_geodata],
                          epsilon: @epsilon,
                          min_points: @min_points,

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: gtfs_stops_clustering
 version: !ruby/object:Gem::Version
-  version: 0.1.4
+  version: 0.1.6
 platform: ruby
 authors:
 - Pietro Visconti
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2023-12-08 00:00:00.000000000 Z
+date: 2023-12-19 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: csv
@@ -166,6 +166,7 @@ files:
 - lib/gtfs_stops_clustering/dbscan.rb
 - lib/gtfs_stops_clustering/input_consistency_checks.rb
 - lib/gtfs_stops_clustering/redis_geodata.rb
+- lib/gtfs_stops_clustering/utils.rb
 - lib/gtfs_stops_clustering/version.rb
 - lib/stops_corner_cases.txt
 - sig/gtfs_stops_clustering.rbs