gtfs_stops_clustering 0.1.4 → 0.1.6
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rubocop.yml +9 -0
- data/CHANGELOG.md +6 -2
- data/README.md +73 -8
- data/gtfs_stops_clustering.gemspec +0 -4
- data/lib/gtfs_stops_clustering/dbscan.rb +19 -51
- data/lib/gtfs_stops_clustering/redis_geodata.rb +1 -0
- data/lib/gtfs_stops_clustering/utils.rb +41 -0
- data/lib/gtfs_stops_clustering/version.rb +1 -1
- data/lib/gtfs_stops_clustering.rb +5 -7
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 5babddb5a5a80c3afcdd55dca8b93738633c3e7420392683240665c8c375d4c9
|
4
|
+
data.tar.gz: c056da474b4bc656e83f6ba76c14e7ea7868225b1075818447da74eb68bc52a6
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2ad48d54a269da348ffde78fbddb40697b7ed268b66e2ae5f22cae4fd834edccea9e360eb65fecd7a2fb3c22c5f3dfca0e297e2e7adbd61eb75ac106b4b181ef
|
7
|
+
data.tar.gz: 6c553ba442c11e604fb7740be93c19ff4ea013e3b5fa67c58745c4c704246eace595e147d20792ba5c36b54e640821a3f83688530fdb101d2ed214b47c8f7bce
|
data/.rubocop.yml
CHANGED
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -1,24 +1,89 @@
|
|
1
|
-
#
|
1
|
+
# GTFS Stops clustering
|
2
|
+
[![Gem Version](https://badge.fury.io/rb/gtfs_stops_clustering.svg)](https://badge.fury.io/rb/gtfs_stops_clustering)
|
2
3
|
|
3
|
-
|
4
|
+
GTFS Stops Clustering is a Ruby Gem designed to read [GTFS](https://gtfs.org) (General Transit Feed Specification) stops data and create clusters based on the following parameters:
|
4
5
|
|
5
|
-
|
6
|
+
- `GTFS paths` [Required]: array of gtfs zip files paths whose stops will be combined in the clustering algorithm
|
7
|
+
- `Epsilon` [Required]: the maximum distance (in km) between 2 stops for them to be considered neighbors of one another (e.g.: 0.01, 0.5, 2 etc.)
|
8
|
+
- `Min Points` [Required]: the minimum number of neighbors a point needs to have to be considered a core point (e.g.: 3, 5, 10 etc.)
|
9
|
+
- `Names Similarity` [Optional]: Besides geographical proximity, the algorithm also considers the similarity between stop names using techniques like string similarity measures. This enhances the clustering by including stops with similar names within the same cluster (e.g.: all values between 0 and 1. The more the value is in proximity of 1, the more similar the stop names need to be considered points of the same cluster). The default value is 1, so if you want to create clusters based only on stop positions, leave this to 0.
|
10
|
+
- `Stop config file` (CSV file path) [Optional]: This file is specifically designed to handle certain cases where stop names need to be altered or mapped to different names before running the clustering algorithm. Each entry consists of two columns:
|
11
|
+
**stop_name**: This column contains the original name of the stop that requires modification or mapping to another name. **cluster_name**: This column specifies the name to which the original stop name should be changed or mapped during the clustering process.
|
12
|
+
|
13
|
+
It utilizes the [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN) Density-Based algorithm to perform clustering. I based my core algorithm on the gem [Dbscan](https://github.com/matiasinsaurralde/dbscan)
|
14
|
+
|
15
|
+
### Stops config file example
|
16
|
+
|
17
|
+
Here is an example of a stops_config CSV file:
|
18
|
+
|
19
|
+
```csv
|
20
|
+
stop_name,cluster_name
|
21
|
+
Stop Name To Be Changed,Actual Name
|
22
|
+
Amargosa Valley (Demo),Amargosa Valley
|
23
|
+
E Main St / S Irving St (Demo),E Main St / S Irving St
|
24
|
+
```
|
25
|
+
|
26
|
+
In this case, passing this CSV file to the clustering algorithm, **Amargosa Valley (Demo)** will be renamed **Amargarosa Valley**, and so on for all the entries provided. The reason why I needed to implement this feature is simply because I was dealing with bad stops names (typo) provided by default within the GTFS I was working on.
|
27
|
+
|
28
|
+
## Requirements
|
29
|
+
|
30
|
+
It is essential to have a **Redis server instance running locally (on default port 6379)** because the algorithm leverages Redis geospatial queries for efficient spatial operations.
|
31
|
+
The Redis server is utilized to optimize geospatial queries, allowing the clustering algorithm to efficiently process proximity-related computations required during the clustering process.
|
32
|
+
Please ensure that a Redis server is installed and running on your local machine to utilize the gem effectively.
|
6
33
|
|
7
34
|
## Installation
|
8
35
|
|
9
|
-
|
36
|
+
Add this line to your application's Gemfile:
|
10
37
|
|
11
|
-
|
38
|
+
```ruby
|
39
|
+
gem 'gtfs_stops_clustering', '~> 0.1.5'
|
40
|
+
```
|
41
|
+
And run the following command
|
12
42
|
|
13
|
-
|
43
|
+
```bash
|
44
|
+
$ bundle install
|
45
|
+
```
|
14
46
|
|
15
47
|
If bundler is not being used to manage dependencies, install the gem by executing:
|
16
48
|
|
17
|
-
|
49
|
+
```bash
|
50
|
+
$ gem install gtfs_stops_clustering
|
51
|
+
```
|
18
52
|
|
19
53
|
## Usage
|
20
54
|
|
21
|
-
|
55
|
+
```ruby
|
56
|
+
require 'gtfs_stops_clustering'
|
57
|
+
include GtfsStopsClustering
|
58
|
+
|
59
|
+
gtfs_paths = ["path/to/gtfs/zip"]
|
60
|
+
|
61
|
+
clusters = build_clusters(gtfs_paths, 0.3, 1, 0.85)
|
62
|
+
|
63
|
+
clusters.each do |index, cluster|
|
64
|
+
puts index
|
65
|
+
cluster.each do |stop|
|
66
|
+
puts stop.inspect
|
67
|
+
end
|
68
|
+
end
|
69
|
+
```
|
70
|
+
|
71
|
+
In this case, I'm showing the output referred to the GTFS file located in `test/fixtures/sample-feed-2.zip` (which is the sample-feed provided by Google, but changed a bit in order to create "clusterable" stops since they all were too far to be clustered). In this case I omitted the optional parameter `stops config`
|
72
|
+
|
73
|
+
```
|
74
|
+
-1
|
75
|
+
{:stop_id=>"4", :stop_code=>nil, :cluster_name=>nil, :cluster_pos=>[], :stop_name=>"Stagecoach Hotel & Casino (Demo)", :stop_lat=>"36.915682", :stop_lon=>"-116.751677", :parent_station=>nil}
|
76
|
+
{:stop_id=>"6", :stop_code=>nil, :cluster_name=>nil, :cluster_pos=>[], :stop_name=>"Alone stop (sad)", :stop_lat=>"36.914944", :stop_lon=>"-116.761472", :parent_station=>nil}
|
77
|
+
{:stop_id=>"8", :stop_code=>nil, :cluster_name=>nil, :cluster_pos=>[], :stop_name=>"E Main St / S Irving St (Demo)", :stop_lat=>"36.905697", :stop_lon=>"-116.76218", :parent_station=>nil}
|
78
|
+
{:stop_id=>"9", :stop_code=>nil, :cluster_name=>nil, :cluster_pos=>[], :stop_name=>"Amargosa Valley (Demo)", :stop_lat=>"36.641496", :stop_lon=>"-116.40094", :parent_station=>nil}
|
79
|
+
0
|
80
|
+
{:stop_id=>"1", :stop_code=>nil, :cluster_name=>"Awesome Stop Name", :cluster_pos=>[36.425286, -117.133156], :stop_name=>"Awesome stop name 1", :stop_lat=>"36.425288", :stop_lon=>"-117.133162", :parent_station=>nil}
|
81
|
+
{:stop_id=>"5", :stop_code=>nil, :cluster_name=>"Awesome Stop Name", :cluster_pos=>[36.425286, -117.133156], :stop_name=>"Awesome stop name 2", :stop_lat=>"36.425284", :stop_lon=>"-117.133150", :parent_station=>nil}
|
82
|
+
1
|
83
|
+
{:stop_id=>"2", :stop_code=>nil, :cluster_name=>"Nye County Airport", :cluster_pos=>[36.868429, -116.78467699999999], :stop_name=>"Nye County Airport A1", :stop_lat=>"36.868446", :stop_lon=>"-116.784582", :parent_station=>nil}
|
84
|
+
{:stop_id=>"3", :stop_code=>nil, :cluster_name=>"Nye County Airport", :cluster_pos=>[36.868429, -116.78467699999999], :stop_name=>"Nye County Airport A2", :stop_lat=>"36.868417", :stop_lon=>"-116.784352", :parent_station=>nil}
|
85
|
+
{:stop_id=>"7", :stop_code=>nil, :cluster_name=>"Nye County Airport", :cluster_pos=>[36.868429, -116.78467699999999], :stop_name=>"Nye County Airport A5", :stop_lat=>"36.868424", :stop_lon=>"-116.785097", :parent_station=>nil}
|
86
|
+
```
|
22
87
|
|
23
88
|
## Development
|
24
89
|
|
@@ -32,10 +32,6 @@ Gem::Specification.new do |spec|
|
|
32
32
|
spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
|
33
33
|
spec.require_paths = ["lib"]
|
34
34
|
|
35
|
-
# spec.files = ["lib/gtfs_stops_clustering.rb", "lib/gtfs_stops_clustering/data_import.rb", "lib/gtfs_stops_clustering/dbscan.rb",
|
36
|
-
# "lib/gtfs_stops_clustering/redis_geodata.rb", "lib/gtfs_stops_clustering/version.rb",
|
37
|
-
# "lib/gtfs_stops_clustering/input_consistency_checks.rb"]
|
38
|
-
|
39
35
|
spec.add_runtime_dependency "csv", "~> 3.2", ">= 3.2.8"
|
40
36
|
spec.add_runtime_dependency "distance_measures", "~> 0.0.6"
|
41
37
|
spec.add_runtime_dependency "geocoder", "~> 1.8", ">= 1.8.2"
|
@@ -4,6 +4,7 @@ require "distance_measures"
|
|
4
4
|
require "text"
|
5
5
|
require "geocoder"
|
6
6
|
require_relative "redis_geodata"
|
7
|
+
require_relative "utils"
|
7
8
|
|
8
9
|
# Array class
|
9
10
|
class Array
|
@@ -50,31 +51,26 @@ module DBSCAN
|
|
50
51
|
|
51
52
|
if neighbors.size >= options[:min_points]
|
52
53
|
current_cluster += 1
|
53
|
-
point
|
54
|
-
|
55
|
-
clusters[current_cluster] = cluster.flatten
|
56
|
-
|
57
|
-
# Get Cluster Name
|
58
|
-
labels = clusters[current_cluster].map { |e| e.label.capitalize }
|
59
|
-
cluster_name = find_cluster_name(labels)
|
60
|
-
|
61
|
-
# Get Cluster Position
|
62
|
-
cluster_pos = find_cluster_position(clusters[current_cluster])
|
63
|
-
|
64
|
-
clusters[current_cluster].each do |e|
|
65
|
-
e.cluster_name = cluster_name
|
66
|
-
e.cluster_pos = cluster_pos
|
67
|
-
end
|
54
|
+
create_cluster(current_cluster, point, neighbors)
|
55
|
+
update_cluster_info(current_cluster)
|
68
56
|
else
|
69
57
|
clusters[-1].push(point)
|
70
58
|
end
|
71
59
|
end
|
72
60
|
end
|
73
61
|
|
74
|
-
def
|
75
|
-
|
76
|
-
|
77
|
-
|
62
|
+
def create_cluster(cluster_index, point, neighbors)
|
63
|
+
point.cluster = cluster_index
|
64
|
+
cluster = [point].push(add_connected(neighbors, cluster_index))
|
65
|
+
@clusters[cluster_index] = cluster.flatten
|
66
|
+
end
|
67
|
+
|
68
|
+
def update_cluster_info(cluster_index)
|
69
|
+
labels = @clusters[cluster_index].map { |e| e.label.capitalize }
|
70
|
+
@clusters[cluster_index].each do |e|
|
71
|
+
e.cluster_name = Utils.find_cluster_name(labels)
|
72
|
+
e.cluster_pos = Utils.find_cluster_position(clusters[cluster_index])
|
73
|
+
end
|
78
74
|
end
|
79
75
|
|
80
76
|
def labeled_results
|
@@ -103,16 +99,10 @@ module DBSCAN
|
|
103
99
|
neighbors = []
|
104
100
|
geosearch_results = geosearch(point.items[1], point.items[0])
|
105
101
|
geosearch_results.each do |neighbor_pos|
|
106
|
-
|
107
|
-
neighbor = @points.find do |elem|
|
108
|
-
elem.items[0] == coordinates[1] &&
|
109
|
-
elem.items[1] == coordinates[0]
|
110
|
-
end
|
102
|
+
neighbor = Utils.find_inmediate_neighbor(neighbor_pos, @points)
|
111
103
|
next unless neighbor
|
112
104
|
|
113
|
-
|
114
|
-
similarity = 1 - string_distance.to_f / [point.label.length, point.label.length].max
|
115
|
-
neighbors.push(neighbor) if similarity > options[:similarity]
|
105
|
+
neighbors.push(neighbor) if Utils.string_similarity(point.label.downcase, neighbor.label.downcase) > options[:similarity]
|
116
106
|
end
|
117
107
|
neighbors
|
118
108
|
end
|
@@ -139,30 +129,8 @@ module DBSCAN
|
|
139
129
|
|
140
130
|
cluster_points
|
141
131
|
end
|
142
|
-
|
143
|
-
def find_cluster_name(labels)
|
144
|
-
words = labels.map { |label| label.strip.split }
|
145
|
-
common_title = ""
|
146
|
-
|
147
|
-
# Loop through each word index starting from the first
|
148
|
-
(0...words.first.length).each do |i|
|
149
|
-
words_at_index = words.map { |word_list| word_list[i] }
|
150
|
-
|
151
|
-
break unless words_at_index.uniq.length == 1
|
152
|
-
|
153
|
-
common_title += " #{words_at_index.first.capitalize}"
|
154
|
-
end
|
155
|
-
|
156
|
-
common_title.strip! ? common_title : labels.first
|
157
|
-
end
|
158
|
-
def find_cluster_position(cluster)
|
159
|
-
total_lat = cluster.map { |e| e.items[0].to_f }.sum
|
160
|
-
total_lon = cluster.map { |e| e.items[1].to_f }.sum
|
161
|
-
avg_lat = total_lat / cluster.size
|
162
|
-
avg_lon = total_lon / cluster.size
|
163
|
-
[avg_lat, avg_lon]
|
164
|
-
end
|
165
132
|
end
|
133
|
+
|
166
134
|
# Point class
|
167
135
|
class Point
|
168
136
|
attr_accessor :items, :cluster, :visited, :label, :cluster_name, :cluster_pos
|
@@ -182,7 +150,7 @@ module DBSCAN
|
|
182
150
|
end
|
183
151
|
end
|
184
152
|
|
185
|
-
def
|
153
|
+
def dbscan(* args)
|
186
154
|
clusterer = Clusterer.new(*args)
|
187
155
|
clusterer.labeled_results
|
188
156
|
end
|
@@ -0,0 +1,41 @@
|
|
1
|
+
# lib/utils.rb
|
2
|
+
|
3
|
+
# Utils class
|
4
|
+
class Utils
|
5
|
+
def self.find_cluster_name(labels)
|
6
|
+
words = labels.map { |label| label.strip.split }
|
7
|
+
common_title = ""
|
8
|
+
|
9
|
+
# Loop through each word index starting from the first
|
10
|
+
(0...words.first.length).each do |i|
|
11
|
+
words_at_index = words.map { |word_list| word_list[i] }
|
12
|
+
|
13
|
+
break unless words_at_index.uniq.length == 1
|
14
|
+
|
15
|
+
common_title += " #{words_at_index.first.capitalize}"
|
16
|
+
end
|
17
|
+
|
18
|
+
common_title.strip! ? common_title : labels.first
|
19
|
+
end
|
20
|
+
|
21
|
+
def self.find_cluster_position(cluster)
|
22
|
+
total_lat = cluster.map { |e| e.items[0].to_f }.sum
|
23
|
+
total_lon = cluster.map { |e| e.items[1].to_f }.sum
|
24
|
+
avg_lat = total_lat / cluster.size
|
25
|
+
avg_lon = total_lon / cluster.size
|
26
|
+
[avg_lat, avg_lon]
|
27
|
+
end
|
28
|
+
|
29
|
+
def self.string_similarity(str1, str2)
|
30
|
+
string_distance = Text::Levenshtein.distance(str1.downcase, str2.downcase)
|
31
|
+
1 - string_distance.to_f / [str1.length, str2.length].max
|
32
|
+
end
|
33
|
+
|
34
|
+
def self.find_inmediate_neighbor(neighbor_pos, points)
|
35
|
+
coordinates_split = neighbor_pos.split(",")
|
36
|
+
points.find do |elem|
|
37
|
+
elem.items[0] == coordinates_split[1] &&
|
38
|
+
elem.items[1] == coordinates_split[0]
|
39
|
+
end
|
40
|
+
end
|
41
|
+
end
|
@@ -37,19 +37,17 @@ module GtfsStopsClustering
|
|
37
37
|
def create_stops_merged
|
38
38
|
gtfs_stops = []
|
39
39
|
@gtfs_paths.each do |gtfs_path|
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
raise IOError "Error occurred while building GTFS from #{gtfs_path}: #{e.message}"
|
45
|
-
end
|
40
|
+
gtfs = GTFS::Source.build(gtfs_path)
|
41
|
+
gtfs_stops << gtfs.stops
|
42
|
+
rescue GTFS::InvalidSourceException => e
|
43
|
+
raise IOError "Error occurred while building GTFS from #{gtfs_path}: #{e.message}"
|
46
44
|
end
|
47
45
|
gtfs_stops.flatten
|
48
46
|
end
|
49
47
|
|
50
48
|
def clusterize_stops
|
51
49
|
data = import_stops_data(@gtfs_stops, @stops_config_path)
|
52
|
-
@clusters =
|
50
|
+
@clusters = dbscan(data[:stops_data],
|
53
51
|
data[:stops_redis_geodata],
|
54
52
|
epsilon: @epsilon,
|
55
53
|
min_points: @min_points,
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: gtfs_stops_clustering
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.6
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Pietro Visconti
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-12-
|
11
|
+
date: 2023-12-19 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: csv
|
@@ -166,6 +166,7 @@ files:
|
|
166
166
|
- lib/gtfs_stops_clustering/dbscan.rb
|
167
167
|
- lib/gtfs_stops_clustering/input_consistency_checks.rb
|
168
168
|
- lib/gtfs_stops_clustering/redis_geodata.rb
|
169
|
+
- lib/gtfs_stops_clustering/utils.rb
|
169
170
|
- lib/gtfs_stops_clustering/version.rb
|
170
171
|
- lib/stops_corner_cases.txt
|
171
172
|
- sig/gtfs_stops_clustering.rbs
|