gtfs_df 0.1.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d799dcd02b51045c948123e3b6bc64c41b5505279a583d98350377e6d06b4205
4
- data.tar.gz: 57ff89fd953842d68a5d4c7424a91b54484415a0b093bbe0d0b0411b025c6ef4
3
+ metadata.gz: 128899724d4613f0170fa601cb0c4cdc9f88a51b2fcd6c06108e7cc9acf9201c
4
+ data.tar.gz: 14c30a678bc9a623233c6631be1c7d497fff6bae4b30abd88406c297bc76190b
5
5
  SHA512:
6
- metadata.gz: 13b027c13cfec0a3493662cb048dd20f1b44aac778eb0e2d124ad873ae3649b1622758550399d9e457065401bfa04dbbb71e5e36da467599f8aac9675b2431ad
7
- data.tar.gz: 9d248e0fe06e69af6a5d86114e77e10583dcbf9956858411cc6ed2cc2d03ebf4ce2047d3e8bbc1eb5eb5530a2e544bf8cc21c4d0184fe331558008fce64a055a
6
+ metadata.gz: b7e3d1ac85953b995b82d0abbb9f872cf5307564ba67949320d607190ef5a7ff7f8e34df241cc584d7b36fc8db9c4159a7bf74ef1ce6d6868e4a3155784a9ec9
7
+ data.tar.gz: a1067ff3a0912b3eb56e3348e707c777563b50f440084a5b8c83b2d3da6ebf8a8a7ed49093a756d8871f1d3d849906daa844a0a66c04e56f97b03bfe2a106d79
data/.conform.yaml CHANGED
@@ -7,7 +7,7 @@ policies:
7
7
  case: lower
8
8
  invalidLastCharacters: .
9
9
  body:
10
- required: true
10
+ required: false
11
11
  dco: false
12
12
  spellcheck:
13
13
  locale: US
data/CHANGELOG.md CHANGED
@@ -1,3 +1,20 @@
1
+ ## [Unreleased]
2
+
3
+ ## [0.3.0] - 2025-12-04
4
+
5
+ ### Added
6
+
7
+ - keep parent stations linked to used stops
8
+
9
+ ### Fixed
10
+
11
+ - handle null values
12
+ - update lock on version bump
13
+
14
+ ### Maintenance
15
+
16
+ - reuse load_from_dir logic in reader
17
+ - clean up unused method + better comments
1
18
  ## [0.1.0] - 2025-11-10
2
19
 
3
20
  - Initial release
data/README.md CHANGED
@@ -8,18 +8,16 @@ This project was created to bring the power of [partridge] to ruby.
8
8
 
9
9
  ## Installation
10
10
 
11
- TODO: Replace `UPDATE_WITH_YOUR_GEM_NAME_IMMEDIATELY_AFTER_RELEASE_TO_RUBYGEMS_ORG` with your gem name right after releasing it to RubyGems.org. Please do not do it earlier due to security reasons. Alternatively, replace this section with instructions to install your gem from git if you don't plan to release to RubyGems.org.
12
-
13
11
  Install the gem and add to the application's Gemfile by executing:
14
12
 
15
13
  ```bash
16
- bundle add UPDATE_WITH_YOUR_GEM_NAME_IMMEDIATELY_AFTER_RELEASE_TO_RUBYGEMS_ORG
14
+ bundle add gtfs_df
17
15
  ```
18
16
 
19
17
  If bundler is not being used to manage dependencies, install the gem by executing:
20
18
 
21
19
  ```bash
22
- gem install UPDATE_WITH_YOUR_GEM_NAME_IMMEDIATELY_AFTER_RELEASE_TO_RUBYGEMS_ORG
20
+ gem install gtfs_df
23
21
  ```
24
22
 
25
23
  ## Usage
@@ -32,6 +30,9 @@ require 'gtfs_df'
32
30
  # Load from a zip file
33
31
  feed = GtfsDf::Reader.load_from_zip('path/to/gtfs.zip')
34
32
 
33
+ # Or, load from a directory
34
+ feed = GtfsDf::Reader.load_from_dir('path/to/gtfs_dir')
35
+
35
36
  # Access dataframes for each GTFS file
36
37
  puts feed.agency.head
37
38
  puts feed.routes.head
@@ -85,11 +86,25 @@ After checking out the repo, run `bin/setup` to install dependencies. Then, run
85
86
 
86
87
  To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
87
88
 
89
+ ## Release process
90
+
91
+ 1. `bin/bump-version`
92
+
93
+ - Bump the version in `lib/gtfs_df/version.rb`
94
+ - Update the `CHANGELOG.md` using the git log since the last version
95
+ - Create and push a new release branch with those changes
96
+ - Create a PR for that release
97
+
98
+ 2. `bin/create-tag`
99
+
100
+ Creates and pushes the git tag for the release. That will trigger the GitHub action: `.github/workflows/publish.yml` to publish to RubyGems.
101
+
88
102
  ## TODO
89
103
 
90
104
  - [ ] Time parsing
91
- Just like partridge, we should parse Time as seconds since midnight. There's a draft in `lib/gtfs_df/utils.rb` but it's not used anywhere.
92
- I haven't figured out how to properly implement with Polars.
105
+
106
+ Just like partridge, we should parse Time as seconds since midnight. There's a draft in `lib/gtfs_df/utils.rb` but it's not used anywhere.
107
+ I haven't figured out how to properly implement that with Polars.
93
108
 
94
109
  ## Contributing
95
110
 
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: ../..
3
3
  specs:
4
- gtfs_df (0.1.0)
4
+ gtfs_df (0.1.1)
5
5
  networkx (~> 0.4)
6
6
  polars-df (~> 0.22)
7
7
  rubyzip (~> 2.3)
@@ -40,12 +40,19 @@ end
40
40
 
41
41
  agency_ids.each do |agency_id|
42
42
  Whirly.start do
43
- Whirly.status = "-> #{agency_id} filtering..."
44
43
  output_path = File.join(output_dir, "#{agency_id}.zip")
44
+
45
+ start_time = Time.now
46
+
47
+ Whirly.status = "-> #{agency_id} filtering..."
45
48
  filtered_feed = feed.filter("agency" => {"agency_id" => agency_id})
49
+
46
50
  Whirly.status = "-> #{agency_id} writing..."
47
51
  GtfsDf::Writer.write_to_zip(filtered_feed, output_path)
48
- Whirly.status = "-> #{agency_id}"
52
+
53
+ elapsed = Time.now - start_time
54
+
55
+ Whirly.status = "-> #{agency_id}.zip (#{elapsed.round(2)}s)"
49
56
  end
50
57
  end
51
58
 
@@ -10,7 +10,11 @@ module GtfsDf
10
10
  if input.is_a?(Polars::DataFrame)
11
11
  input
12
12
  elsif input.is_a?(String)
13
- Polars.read_csv(input, dtypes: self.class::SCHEMA)
13
+ # We need to account for extra columns due to: https://github.com/ankane/ruby-polars/issues/125
14
+ all_columns = Polars.scan_csv(input).columns
15
+ default_schema = all_columns.map { |c| [c, Polars::String] }.to_h
16
+ dtypes = default_schema.merge(self.class::SCHEMA)
17
+ Polars.read_csv(input, null_values: [""], dtypes:)
14
18
  elsif input.is_a?(Array)
15
19
  head, *body = input
16
20
  df_input = body.each_with_object({}) do |row, acc|
data/lib/gtfs_df/feed.rb CHANGED
@@ -36,7 +36,7 @@ module GtfsDf
36
36
  booking_rules
37
37
  ].freeze
38
38
 
39
- attr_reader(*GTFS_FILES)
39
+ attr_reader(*GTFS_FILES, :graph)
40
40
 
41
41
  # Initialize with a hash of DataFrames
42
42
  REQUIRED_GTFS_FILES = %w[agency stops routes trips stop_times].freeze
@@ -53,6 +53,8 @@ module GtfsDf
53
53
  end.join(", ")}"
54
54
  end
55
55
 
56
+ @graph = GtfsDf::Graph.build
57
+
56
58
  GTFS_FILES.each do |file|
57
59
  df = data[file]
58
60
  schema_class_name = file.split("_").map(&:capitalize).join
@@ -68,85 +70,123 @@ module GtfsDf
68
70
  end
69
71
  end
70
72
 
71
- # Load from a directory of GTFS CSV files
72
- def self.load_from_dir(dir)
73
- data = {}
74
- GTFS_FILES.each do |file|
75
- path = File.join(dir, "#{file}.txt")
76
- next unless File.exist?(path)
77
-
78
- schema_class_name = file.split("_").map(&:capitalize).join
79
-
80
- data[file] = GtfsDf::Schema.const_get(schema_class_name).new(path)
81
- end
82
- new(data)
83
- end
84
-
85
73
  # Filter the feed using a view hash
86
74
  # Example view: { 'routes' => { 'route_id' => '123' }, 'trips' => { 'service_id' => 'A' } }
87
75
  def filter(view)
88
76
  filtered = {}
89
- graph = GtfsDf::Graph.build
90
- # Step 1: Apply view filters
77
+
91
78
  GTFS_FILES.each do |file|
92
79
  df = send(file)
93
80
  next unless df
94
81
 
95
- filters = view[file]
96
- if filters && !filters.empty?
97
- filters.each do |col, val|
98
- df = if val.is_a?(Array)
99
- df.filter(Polars.col(col).is_in(val))
100
- elsif val.respond_to?(:call)
101
- df.filter(val.call(Polars.col(col)))
102
- else
103
- df.filter(Polars.col(col).eq(val))
104
- end
105
- end
106
- end
107
82
  filtered[file] = df
108
83
  end
109
- # Step 2: Cascade filters following the directed edges
110
- # An edge from parent->child means: filter child based on valid parent IDs
111
- changed = true
112
- while changed
113
- changed = false
114
- GTFS_FILES.each do |parent_file|
115
- parent_df = filtered[parent_file]
116
- next unless parent_df && parent_df.height > 0
117
-
118
- # For each outgoing edge from parent_file to child_file
119
- graph.adj[parent_file]&.each do |child_file, attrs|
120
- child_df = filtered[child_file]
121
- next unless child_df && child_df.height > 0
122
-
123
- attrs[:dependencies].each do |dep|
124
- parent_col = dep[parent_file]
125
- child_col = dep[child_file]
126
-
127
- next unless parent_col && child_col &&
128
- parent_df.columns.include?(parent_col) && child_df.columns.include?(child_col)
129
-
130
- # Get valid values from parent
131
- valid_values = parent_df[parent_col].to_a.uniq.compact
132
- next if valid_values.empty?
133
-
134
- # Filter child to only include rows that reference valid parent values
135
- before = child_df.height
136
- child_df = child_df.filter(Polars.col(child_col).is_in(valid_values))
137
-
138
- if child_df.height < before
139
- filtered[child_file] = child_df
140
- changed = true
141
- end
142
- end
143
- end
84
+
85
+ # Trips are the atomic unit of GTFS, we will generate a new view
86
+ # based on the set of trips that would be included for each invidual filter
87
+ # and cascade changes from this view in order to retain referential integrity
88
+ trip_ids = nil
89
+
90
+ view.each do |file, filters|
91
+ new_filtered = filter!(file, filters, filtered.dup)
92
+ trip_ids = if trip_ids.nil?
93
+ new_filtered["trips"]["trip_id"]
94
+ else
95
+ trip_ids & new_filtered["trips"]["trip_id"]
144
96
  end
145
97
  end
146
98
 
99
+ if trip_ids
100
+ filtered = filter!("trips", {"trip_id" => trip_ids.to_a}, filtered)
101
+ end
102
+
147
103
  # Remove files that are empty, but keep required files even if empty
148
- filtered.delete_if { |file, df| (!df || df.height == 0) && !REQUIRED_GTFS_FILES.include?(file) }
104
+ filtered.delete_if do |file, df|
105
+ is_required_file = REQUIRED_GTFS_FILES.include?(file) ||
106
+ file == "calendar" && !filtered["calendar_dates"] ||
107
+ file == "calendar_dates" && !filtered["calendar"]
108
+
109
+ (!df || df.height == 0) && !is_required_file
110
+ end
149
111
  self.class.new(filtered)
150
112
  end
113
+
114
+ private
115
+
116
+ def filter!(file, filters, filtered)
117
+ unless filters.empty?
118
+ df = filtered[file]
119
+
120
+ filters.each do |col, val|
121
+ df = if val.is_a?(Array)
122
+ df.filter(Polars.col(col).is_in(val))
123
+ elsif val.respond_to?(:call)
124
+ df.filter(val.call(Polars.col(col)))
125
+ else
126
+ df.filter(Polars.col(col).eq(val))
127
+ end
128
+ end
129
+
130
+ filtered[file] = df
131
+
132
+ prune!(file, filtered)
133
+ end
134
+
135
+ filtered
136
+ end
137
+
138
+ # Traverses the grah to prune unreferenced entities from child dataframes
139
+ # based on parent relationships. See GtfsDf::Graph::STOP_NODES
140
+ def prune!(root, filtered)
141
+ graph.each_bfs_edge(root) do |parent_node_id, child_node_id|
142
+ parent_node = Graph::NODES[parent_node_id]
143
+ child_node = Graph::NODES[child_node_id]
144
+ parent_df = filtered[parent_node.fetch(:file)]
145
+ next unless parent_df
146
+
147
+ child_df = filtered[child_node.fetch(:file)]
148
+ # Certain nodes are pre-filtered because they reference only
149
+ # a piece of the dataframe
150
+ filter_attrs = child_node[:filter_attrs]
151
+ if filter_attrs && child_df.columns.include?(filter_attrs.fetch(:filter_col))
152
+ filter = filter_attrs.fetch(:filter)
153
+ # Temporarily remove rows that do not match node filter criteria to process them
154
+ # separately (e.g., when filtering stops, parent stations that should be preserved
155
+ # regardless of direct references)
156
+ saved_vals = child_df.filter(filter.is_not)
157
+ child_df = child_df.filter(filter)
158
+ end
159
+ next unless child_df && child_df.height > 0
160
+
161
+ attrs = graph.get_edge_data(parent_node_id, child_node_id)
162
+
163
+ attrs[:dependencies].each do |dep|
164
+ parent_col = dep[parent_node_id]
165
+ child_col = dep[child_node_id]
166
+
167
+ next unless parent_col && child_col &&
168
+ parent_df.columns.include?(parent_col) && child_df.columns.include?(child_col)
169
+
170
+ # Get valid values from parent
171
+ valid_values = parent_df[parent_col].to_a.uniq.compact
172
+
173
+ # Filter child to only include rows that reference valid parent values
174
+ before = child_df.height
175
+ child_df = child_df.filter(
176
+ Polars.col(child_col).is_in(valid_values)
177
+ )
178
+ changed = child_df.height < before
179
+
180
+ # If we removed a part of the child_df earlier, concat it back on
181
+ if saved_vals
182
+ child_df = Polars.concat([child_df, saved_vals], how: "vertical")
183
+ end
184
+
185
+ if changed
186
+ filtered[child_node.fetch(:file)] = child_df
187
+ end
188
+ end
189
+ end
190
+ end
151
191
  end
152
192
  end
data/lib/gtfs_df/graph.rb CHANGED
@@ -2,17 +2,50 @@
2
2
 
3
3
  module GtfsDf
4
4
  class Graph
5
+ FILES = %w[
6
+ agency routes trips stop_times calendar calendar_dates shapes transfers frequencies fare_attributes fare_rules
7
+ fare_leg_join_rules fare_transfer_rules areas networks route_networks location_groups location_group_stops booking_rules
8
+ stop_areas fare_leg_rules
9
+ ]
10
+
11
+ STANDARD_FILE_NODES = FILES.map do |file|
12
+ [file, {id: file, file: file, filter: nil}]
13
+ end.to_h.freeze
14
+
15
+ # Separate node definitions for stops and parent stations to handle the self-referential
16
+ # relationship in stops.txt where stops reference parent stations via parent_station column.
17
+ # This allows filtering to preserve parent stations when their child stops are referenced.
18
+ STOP_NODES = {
19
+ "stops" => {
20
+ id: "stops",
21
+ file: "stops",
22
+ filter_attrs: {
23
+ filter_col: "location_type",
24
+ filter: Polars.col("location_type").is_in(
25
+ Schema::EnumValues::STOP_LOCATION_TYPES.map(&:first)
26
+ ) | Polars.col("location_type").is_null
27
+ }
28
+ },
29
+ "parent_stations" => {
30
+ id: "parent_stations",
31
+ file: "stops",
32
+ filter_attrs: {
33
+ filter_col: "location_type",
34
+ filter: Polars.col("location_type").is_in(
35
+ Schema::EnumValues::STATION_LOCATION_TYPES.map(&:first)
36
+ ) & Polars.col("location_type").is_not_null
37
+ }
38
+ }
39
+ }.freeze
40
+
41
+ NODES = STANDARD_FILE_NODES.merge(STOP_NODES).freeze
42
+
5
43
  # Returns a directed graph of GTFS file dependencies
6
44
  def self.build
7
- g = NetworkX::DiGraph.new
8
- # Nodes: GTFS files
9
- files = %w[
10
- agency routes trips stop_times stops calendar calendar_dates shapes transfers frequencies fare_attributes fare_rules
11
- fare_leg_join_rules fare_transfer_rules areas networks route_networks location_groups location_group_stops booking_rules
12
- ]
13
- files.each { |f| g.add_node(f) }
45
+ g = NetworkX::Graph.new
46
+ NODES.keys.each { |node| g.add_node(node) }
14
47
 
15
- # Edges: dependencies
48
+ # TODO: Add fare_rules -> stops + test
16
49
  edges = [
17
50
  ["agency", "routes", {dependencies: [
18
51
  {"agency" => "agency_id", "routes" => "agency_id"}
@@ -33,6 +66,10 @@ module GtfsDf
33
66
  ["stop_times", "stops", {dependencies: [
34
67
  {"stop_times" => "stop_id", "stops" => "stop_id"}
35
68
  ]}],
69
+ # Self-referential edge: stops can reference parent stations (location_type=1)
70
+ ["stops", "parent_stations", {dependencies: [
71
+ {"stops" => "parent_station", "parent_stations" => "stop_id"}
72
+ ]}],
36
73
  ["stops", "transfers", {dependencies: [
37
74
  {"stops" => "stop_id", "transfers" => "from_stop_id"},
38
75
  {"stops" => "stop_id", "transfers" => "to_stop_id"}
@@ -116,9 +153,6 @@ module GtfsDf
116
153
  ["booking_rules", "stop_times", {dependencies: [
117
154
  {"booking_rules" => "booking_rule_id", "stop_times" => "pickup_booking_rule_id"},
118
155
  {"booking_rules" => "booking_rule_id", "stop_times" => "drop_off_booking_rule_id"}
119
- ]}],
120
- ["stops", "booking_rules", {dependencies: [
121
- {"stops" => "stop_id", "booking_rules" => "stop_id"}
122
156
  ]}]
123
157
  ]
124
158
 
@@ -4,25 +4,39 @@ module GtfsDf
4
4
  class Reader
5
5
  # Loads a GTFS zip file and returns a Feed
6
6
  def self.load_from_zip(zip_path)
7
- data = {}
7
+ data = nil
8
+
8
9
  Dir.mktmpdir do |tmpdir|
9
10
  Zip::File.open(zip_path) do |zip_file|
10
11
  zip_file.each do |entry|
11
12
  next unless entry.file?
13
+ out_path = File.join(tmpdir, entry.name)
14
+ entry.extract(out_path)
15
+ end
16
+ end
12
17
 
13
- GtfsDf::Feed::GTFS_FILES.each do |file|
14
- next unless entry.name == "#{file}.txt"
18
+ data = load_from_dir(tmpdir)
19
+ end
15
20
 
16
- out_path = File.join(tmpdir, entry.name)
17
- entry.extract(out_path)
18
- schema_class_name = file.split("_").map(&:capitalize).join
21
+ data
22
+ end
19
23
 
20
- data[file] = GtfsDf::Schema.const_get(schema_class_name).new(out_path).df
21
- end
22
- end
23
- end
24
+ # Loads a GTFS dir and returns a Feed
25
+ def self.load_from_dir(dir_path)
26
+ data = {}
27
+ GtfsDf::Feed::GTFS_FILES.each do |gtfs_file|
28
+ path = File.join(dir_path, "#{gtfs_file}.txt")
29
+ next unless File.exist?(path)
30
+
31
+ data[gtfs_file] = data_frame(gtfs_file, path)
24
32
  end
33
+
25
34
  GtfsDf::Feed.new(data)
26
35
  end
36
+
37
+ private_class_method def self.data_frame(gtfs_file, path)
38
+ schema_class_name = gtfs_file.split("_").map(&:capitalize).join
39
+ GtfsDf::Schema.const_get(schema_class_name).new(path).df
40
+ end
27
41
  end
28
42
  end
@@ -82,13 +82,16 @@ module GtfsDf
82
82
 
83
83
  # stops.txt
84
84
  # location_type: Type of location
85
- LOCATION_TYPE = [
85
+ STOP_LOCATION_TYPES = [
86
86
  ["0", "Stop or platform"],
87
- ["1", "Station"],
88
87
  ["2", "Entrance/Exit"],
89
88
  ["3", "Generic Node"],
90
89
  ["4", "Boarding Area"]
91
90
  ]
91
+ STATION_LOCATION_TYPES = [
92
+ ["1", "Station"]
93
+ ]
94
+ LOCATION_TYPE = STOP_LOCATION_TYPES + STATION_LOCATION_TYPES
92
95
 
93
96
  # wheelchair_boarding: Indicates wheelchair boarding possibility
94
97
  WHEELCHAIR_BOARDING = [
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module GtfsDf
4
- VERSION = "0.1.1"
4
+ VERSION = "0.3.0"
5
5
  end
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: gtfs_df
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - David Mejorado
8
8
  bindir: exe
9
9
  cert_chain: []
10
- date: 1980-01-01 00:00:00.000000000 Z
10
+ date: 1980-01-02 00:00:00.000000000 Z
11
11
  dependencies:
12
12
  - !ruby/object:Gem::Dependency
13
13
  name: networkx