gtfs_df 0.7.0 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +24 -0
- data/README.md +43 -13
- data/examples/split-by-agency/Gemfile.lock +9 -7
- data/lib/gtfs_df/base_gtfs_table.rb +4 -0
- data/lib/gtfs_df/feed.rb +18 -5
- data/lib/gtfs_df/graph.rb +18 -8
- data/lib/gtfs_df/reader.rb +1 -2
- data/lib/gtfs_df/version.rb +1 -1
- metadata +10 -4
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 27b47041852a25bf7c06f1553a1138f0f1e91fdbfa4619535374c749a561be97
|
|
4
|
+
data.tar.gz: e474c040a3fed6cefc3a36163d90133a3f44be3a91d0701854e36b7ba078fee5
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: a6a306fd0e248619dbe518b3a573ca86c740bca0bcddeb86e014160aeabdd01b42345509db8b65b469d32bebd3767b1f34bd50ad99e0a602f7066a4859c346bc
|
|
7
|
+
data.tar.gz: dd02fc068a03da457beaa377441dd1fbe8f70c779bde47dfee7911b513743827a5fb0ac708ef793dec3bb4bab45110b54a690898cc2f6c19d6f9b77d85963f42
|
data/CHANGELOG.md
CHANGED
|
@@ -1,3 +1,26 @@
|
|
|
1
|
+
## [0.8.0] - 2026-01-09
|
|
2
|
+
|
|
3
|
+
### 🐛 Bug Fixes
|
|
4
|
+
|
|
5
|
+
- Ignore extra newlines when parsing csv
|
|
6
|
+
- Bump minimum ruby version to 3.2.0
|
|
7
|
+
- Fix fare_attributes filtering
|
|
8
|
+
- Fix exceptions edge case
|
|
9
|
+
- Replace dynamic graph traversal with bidirectional graph option
|
|
10
|
+
|
|
11
|
+
### 📚 Documentation
|
|
12
|
+
|
|
13
|
+
- Document dev environment
|
|
14
|
+
- Clarify the actions made by the bump-version script
|
|
15
|
+
- Update example transitive dependencies
|
|
16
|
+
|
|
17
|
+
### ⚙️ Miscellaneous Tasks
|
|
18
|
+
|
|
19
|
+
- Reduce the test run frequency
|
|
20
|
+
- Update dependabot schedule
|
|
21
|
+
- Consolidate test fixtures
|
|
22
|
+
- Add test for additional fares case
|
|
23
|
+
- Update readme
|
|
1
24
|
## [0.7.0] - 2025-12-30
|
|
2
25
|
|
|
3
26
|
### 🚀 Features
|
|
@@ -22,6 +45,7 @@
|
|
|
22
45
|
### ⚙️ Miscellaneous Tasks
|
|
23
46
|
|
|
24
47
|
- Include the util helpers in the console and the test spec
|
|
48
|
+
- Bump version to 0.7.0
|
|
25
49
|
## [0.6.2] - 2025-12-15
|
|
26
50
|
|
|
27
51
|
### 🐛 Bug Fixes
|
data/README.md
CHANGED
|
@@ -35,6 +35,9 @@ feed = GtfsDf::Reader.load_from_zip('path/to/gtfs.zip')
|
|
|
35
35
|
# Or, load from a directory
|
|
36
36
|
feed = GtfsDf::Reader.load_from_dir('path/to/gtfs_dir')
|
|
37
37
|
|
|
38
|
+
# Parse times as seconds since midnight instead of string
|
|
39
|
+
feed = GtfsDf::Reader.load_from_dir('path/to/gtfs_dir', parse_times: true)
|
|
40
|
+
|
|
38
41
|
# Access dataframes for each GTFS file
|
|
39
42
|
puts feed.agency.head
|
|
40
43
|
puts feed.routes.head
|
|
@@ -71,11 +74,25 @@ When you filter by a field, the library automatically:
|
|
|
71
74
|
|
|
72
75
|
For example, filtering by `agency_id` will automatically filter routes, trips, stop_times, and stops to only include data for that agency.
|
|
73
76
|
|
|
77
|
+
By default gtfs_df treats trips as the atomic unit of GTFS. Therefore, if we
|
|
78
|
+
filter to one stop referenced by TripA, we will preserve _all stops_ referenced
|
|
79
|
+
by TripA.
|
|
80
|
+
|
|
81
|
+
To avoid this behavior, you can pass the `filter_only_children` param. In this case, only the children of the specified filter will be pruned and trip integrity will not be maintained. In the below example, stop 1 and related stop_times will be pruned.
|
|
82
|
+
|
|
83
|
+
```ruby
|
|
84
|
+
filtered_feed = feed.filter({ 'stop' => { 'stop_id' => ['1'] } }, filter_only_children: true)
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
|
|
74
88
|
### Writing filtered feeds
|
|
75
89
|
|
|
76
90
|
```ruby
|
|
77
91
|
# Write to a new zip file
|
|
78
92
|
GtfsDf::Writer.write_to_zip(filtered_feed, 'output/filtered_gtfs.zip')
|
|
93
|
+
|
|
94
|
+
# Write to a directory
|
|
95
|
+
GtfsDf::Writer.write_to_dir(filtered_feed, 'output/filtered_gtfs')
|
|
79
96
|
```
|
|
80
97
|
|
|
81
98
|
### Example: Split feed by agency
|
|
@@ -84,30 +101,41 @@ See [examples/split-by-agency](examples/split-by-agency) for a complete example
|
|
|
84
101
|
|
|
85
102
|
## Development
|
|
86
103
|
|
|
87
|
-
|
|
104
|
+
### Environment
|
|
105
|
+
|
|
106
|
+
This project manages its development environment with nix, specifically [devenv].
|
|
107
|
+
|
|
108
|
+
After checking out the repo:
|
|
88
109
|
|
|
89
|
-
|
|
110
|
+
- Install devenv: https://devenv.sh/getting-started/
|
|
111
|
+
|
|
112
|
+
- To enable the environment you can either:
|
|
113
|
+
- Use [direnv] to enable the environment as soon as you enter the project's path.
|
|
114
|
+
- Enable it only when you needed by running: `devenv shell`
|
|
115
|
+
|
|
116
|
+
- Run `bin/setup` to install the gem dependencies.
|
|
117
|
+
|
|
118
|
+
### Tests
|
|
119
|
+
|
|
120
|
+
Run `rake spec` to run the tests.
|
|
121
|
+
|
|
122
|
+
### REPL
|
|
123
|
+
|
|
124
|
+
You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
|
90
125
|
|
|
91
126
|
## Release process
|
|
92
127
|
|
|
93
128
|
1. `bin/bump-version`
|
|
94
129
|
|
|
95
|
-
-
|
|
96
|
-
-
|
|
97
|
-
-
|
|
98
|
-
-
|
|
130
|
+
- Bumps the version in `lib/gtfs_df/version.rb`
|
|
131
|
+
- Updates the `CHANGELOG.md` using the git log since the last version
|
|
132
|
+
- Creates and push a new release branch with those changes
|
|
133
|
+
- Creates a PR for that release
|
|
99
134
|
|
|
100
135
|
2. `bin/create-tag`
|
|
101
136
|
|
|
102
137
|
Creates and pushes the git tag for the release. That will trigger the GitHub action: `.github/workflows/publish.yml` to publish to RubyGems.
|
|
103
138
|
|
|
104
|
-
## TODO
|
|
105
|
-
|
|
106
|
-
- [ ] Time parsing
|
|
107
|
-
|
|
108
|
-
Just like partridge, we should parse Time as seconds since midnight. There's a draft in `lib/gtfs_df/utils.rb` but it's not used anywhere.
|
|
109
|
-
I haven't figured out how to properly implement that with Polars.
|
|
110
|
-
|
|
111
139
|
## Contributing
|
|
112
140
|
|
|
113
141
|
Bug reports and pull requests are welcome on GitHub at https://github.com/davidmh/ruby-gtfs_df.
|
|
@@ -120,3 +148,5 @@ The gem is available as open source under the terms of the [MIT License](https:/
|
|
|
120
148
|
[Polars]: https://pola.rs/
|
|
121
149
|
[ruby-polars]: https://github.com/ankane/ruby-polars
|
|
122
150
|
[partridge]: https://github.com/remix/partridge
|
|
151
|
+
[devenv]: https://devenv.sh
|
|
152
|
+
[direnv]: https://direnv.net
|
|
@@ -1,20 +1,21 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: ../..
|
|
3
3
|
specs:
|
|
4
|
-
gtfs_df (0.
|
|
4
|
+
gtfs_df (0.8.0)
|
|
5
5
|
networkx (~> 0.4)
|
|
6
6
|
polars-df (~> 0.22)
|
|
7
|
-
rubyzip (
|
|
7
|
+
rubyzip (>= 2.3, < 4.0)
|
|
8
8
|
|
|
9
9
|
GEM
|
|
10
10
|
remote: https://gem.coop/
|
|
11
11
|
specs:
|
|
12
|
-
bigdecimal (
|
|
12
|
+
bigdecimal (4.0.1)
|
|
13
|
+
json (2.18.0)
|
|
13
14
|
matrix (0.4.3)
|
|
14
15
|
networkx (0.4.0)
|
|
15
16
|
matrix (~> 0.4)
|
|
16
17
|
rb_heap (~> 1.0)
|
|
17
|
-
optparse (0.8.
|
|
18
|
+
optparse (0.8.1)
|
|
18
19
|
polars-df (0.23.0-aarch64-linux)
|
|
19
20
|
bigdecimal
|
|
20
21
|
polars-df (0.23.0-aarch64-linux-musl)
|
|
@@ -28,11 +29,12 @@ GEM
|
|
|
28
29
|
polars-df (0.23.0-x86_64-linux-musl)
|
|
29
30
|
bigdecimal
|
|
30
31
|
rb_heap (1.1.0)
|
|
31
|
-
rubyzip (2.
|
|
32
|
+
rubyzip (3.2.2)
|
|
32
33
|
unicode-display_width (3.2.0)
|
|
33
34
|
unicode-emoji (~> 4.1)
|
|
34
|
-
unicode-emoji (4.
|
|
35
|
-
whirly (0.
|
|
35
|
+
unicode-emoji (4.2.0)
|
|
36
|
+
whirly (0.4.0)
|
|
37
|
+
json
|
|
36
38
|
unicode-display_width (>= 1.1)
|
|
37
39
|
|
|
38
40
|
PLATFORMS
|
|
@@ -15,6 +15,10 @@ module GtfsDf
|
|
|
15
15
|
df = Polars.read_csv(input, infer_schema_length: 0, encoding: "utf8-lossy")
|
|
16
16
|
.rename(->(col) { col.strip })
|
|
17
17
|
|
|
18
|
+
# Strip out empty lines. Unfortunately read_csv does not support the drop_empty_rows
|
|
19
|
+
# option right now.
|
|
20
|
+
df = df.filter(Polars.all_horizontal(Polars.all.is_null).is_not)
|
|
21
|
+
|
|
18
22
|
dtypes = self.class::SCHEMA.slice(*df.columns)
|
|
19
23
|
df
|
|
20
24
|
.with_columns(dtypes.keys.map do |col|
|
data/lib/gtfs_df/feed.rb
CHANGED
|
@@ -172,17 +172,17 @@ module GtfsDf
|
|
|
172
172
|
filtered
|
|
173
173
|
end
|
|
174
174
|
|
|
175
|
-
# Traverses the
|
|
175
|
+
# Traverses the graph to prune unreferenced entities from child dataframes
|
|
176
176
|
# based on parent relationships. See GtfsDf::Graph::STOP_NODES
|
|
177
177
|
def prune!(root, filtered, filter_only_children: false)
|
|
178
178
|
seen_edges = Set.new
|
|
179
|
-
|
|
179
|
+
rerooted_graph = Graph.build(bidirectional: !filter_only_children)
|
|
180
180
|
|
|
181
181
|
queue = [root]
|
|
182
182
|
|
|
183
183
|
while queue.length > 0
|
|
184
184
|
parent_node_id = queue.shift
|
|
185
|
-
|
|
185
|
+
rerooted_graph.adj[parent_node_id].each do |child_node_id, attrs|
|
|
186
186
|
edge = edge_id(parent_node_id, child_node_id)
|
|
187
187
|
|
|
188
188
|
next if seen_edges.include?(edge)
|
|
@@ -209,6 +209,13 @@ module GtfsDf
|
|
|
209
209
|
|
|
210
210
|
queue << child_node_id
|
|
211
211
|
|
|
212
|
+
# If the edge is weak (e.g. reverse edge of an optional relationship),
|
|
213
|
+
# we traverse to ensure connectivity but do NOT apply the filter.
|
|
214
|
+
if attrs[:type] == :weak
|
|
215
|
+
# puts "Skipping weak filter: #{edge}"
|
|
216
|
+
next
|
|
217
|
+
end
|
|
218
|
+
|
|
212
219
|
attrs[:dependencies].each do |dep|
|
|
213
220
|
parent_col = dep[parent_node_id]
|
|
214
221
|
child_col = dep[child_node_id]
|
|
@@ -220,6 +227,13 @@ module GtfsDf
|
|
|
220
227
|
# Get valid values from parent
|
|
221
228
|
valid_values = parent_df[parent_col].to_a.uniq.compact
|
|
222
229
|
|
|
230
|
+
# Annoying special case to make sure that if we have a calendar with exceptions,
|
|
231
|
+
# the calendar_dates file doesn't end up pruning other files
|
|
232
|
+
if parent_node_id == "calendar_dates" && parent_col == "service_id" &&
|
|
233
|
+
filtered["calendar"]
|
|
234
|
+
valid_values = (valid_values + calendar["service_id"].to_a).uniq
|
|
235
|
+
end
|
|
236
|
+
|
|
223
237
|
# Filter child to only include rows that reference valid parent values
|
|
224
238
|
before = child_df.height
|
|
225
239
|
filter = Polars.col(child_col).is_in(valid_values)
|
|
@@ -243,8 +257,7 @@ module GtfsDf
|
|
|
243
257
|
end
|
|
244
258
|
|
|
245
259
|
def edge_id(parent, child)
|
|
246
|
-
|
|
247
|
-
[parent, child].sort.join("-")
|
|
260
|
+
[parent, child].join("-")
|
|
248
261
|
end
|
|
249
262
|
end
|
|
250
263
|
end
|
data/lib/gtfs_df/graph.rb
CHANGED
|
@@ -41,7 +41,7 @@ module GtfsDf
|
|
|
41
41
|
NODES = STANDARD_FILE_NODES.merge(STOP_NODES).freeze
|
|
42
42
|
|
|
43
43
|
# Returns a directed graph of GTFS file dependencies
|
|
44
|
-
def self.build
|
|
44
|
+
def self.build(bidirectional: false)
|
|
45
45
|
g = NetworkX::DiGraph.new
|
|
46
46
|
NODES.keys.each { |node| g.add_node(node) }
|
|
47
47
|
|
|
@@ -53,15 +53,15 @@ module GtfsDf
|
|
|
53
53
|
]}],
|
|
54
54
|
["agency", "fare_attributes", {dependencies: [
|
|
55
55
|
{"fare_attributes" => "agency_id",
|
|
56
|
-
"agency" => "agency_id"}
|
|
57
|
-
]}],
|
|
56
|
+
"agency" => "agency_id", :allow_null => true}
|
|
57
|
+
], optional: true}],
|
|
58
58
|
["fare_attributes", "fare_rules", {dependencies: [
|
|
59
59
|
{"fare_attributes" => "fare_id",
|
|
60
60
|
"fare_rules" => "fare_id"}
|
|
61
61
|
]}],
|
|
62
62
|
["routes", "fare_rules", {dependencies: [
|
|
63
63
|
{"fare_rules" => "route_id", "routes" => "route_id", :allow_null => true}
|
|
64
|
-
]}],
|
|
64
|
+
], optional: true}],
|
|
65
65
|
["routes", "trips", {dependencies: [
|
|
66
66
|
{"routes" => "route_id", "trips" => "route_id"}
|
|
67
67
|
]}],
|
|
@@ -73,12 +73,12 @@ module GtfsDf
|
|
|
73
73
|
]}],
|
|
74
74
|
# Self-referential edge: stops can reference parent stations (location_type=1)
|
|
75
75
|
["parent_stations", "stops", {dependencies: [
|
|
76
|
-
{"stops" => "parent_station", "parent_stations" => "stop_id"}
|
|
76
|
+
{"stops" => "parent_station", "parent_stations" => "stop_id", :allow_null => true}
|
|
77
77
|
]}],
|
|
78
78
|
["stops", "transfers", {dependencies: [
|
|
79
79
|
{"stops" => "stop_id", "transfers" => "from_stop_id"},
|
|
80
80
|
{"stops" => "stop_id", "transfers" => "to_stop_id"}
|
|
81
|
-
]}],
|
|
81
|
+
], optional: true}],
|
|
82
82
|
["calendar", "trips", {dependencies: [
|
|
83
83
|
{"trips" => "service_id", "calendar" => "service_id"}
|
|
84
84
|
]}],
|
|
@@ -86,11 +86,11 @@ module GtfsDf
|
|
|
86
86
|
{"trips" => "service_id", "calendar_dates" => "service_id"}
|
|
87
87
|
]}],
|
|
88
88
|
["shapes", "trips", {dependencies: [
|
|
89
|
-
{"trips" => "shape_id", "shapes" => "shape_id"}
|
|
89
|
+
{"trips" => "shape_id", "shapes" => "shape_id", :allow_null => true}
|
|
90
90
|
]}],
|
|
91
91
|
["trips", "frequencies", {dependencies: [
|
|
92
92
|
{"trips" => "trip_id", "frequencies" => "trip_id"}
|
|
93
|
-
]}],
|
|
93
|
+
], optional: true}],
|
|
94
94
|
|
|
95
95
|
# --- GTFS Extensions ---
|
|
96
96
|
["stops", "fare_leg_join_rules",
|
|
@@ -163,6 +163,16 @@ module GtfsDf
|
|
|
163
163
|
|
|
164
164
|
edges.each do |from, to, attrs|
|
|
165
165
|
g.add_edge(from, to, **attrs)
|
|
166
|
+
if bidirectional
|
|
167
|
+
# When adding the reverse edge, if the relationship is optional (child is not required),
|
|
168
|
+
# mark the reverse edge as weak. This prevents empty child tables (e.g. fare_rules)
|
|
169
|
+
# from filtering parent tables (e.g. routes) into emptiness.
|
|
170
|
+
reverse_attrs = attrs.dup
|
|
171
|
+
if attrs[:optional]
|
|
172
|
+
reverse_attrs[:type] = :weak
|
|
173
|
+
end
|
|
174
|
+
g.add_edge(to, from, **reverse_attrs)
|
|
175
|
+
end
|
|
166
176
|
end
|
|
167
177
|
g
|
|
168
178
|
end
|
data/lib/gtfs_df/reader.rb
CHANGED
data/lib/gtfs_df/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: gtfs_df
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.8.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- David Mejorado
|
|
@@ -41,16 +41,22 @@ dependencies:
|
|
|
41
41
|
name: rubyzip
|
|
42
42
|
requirement: !ruby/object:Gem::Requirement
|
|
43
43
|
requirements:
|
|
44
|
-
- - "
|
|
44
|
+
- - ">="
|
|
45
45
|
- !ruby/object:Gem::Version
|
|
46
46
|
version: '2.3'
|
|
47
|
+
- - "<"
|
|
48
|
+
- !ruby/object:Gem::Version
|
|
49
|
+
version: '4.0'
|
|
47
50
|
type: :runtime
|
|
48
51
|
prerelease: false
|
|
49
52
|
version_requirements: !ruby/object:Gem::Requirement
|
|
50
53
|
requirements:
|
|
51
|
-
- - "
|
|
54
|
+
- - ">="
|
|
52
55
|
- !ruby/object:Gem::Version
|
|
53
56
|
version: '2.3'
|
|
57
|
+
- - "<"
|
|
58
|
+
- !ruby/object:Gem::Version
|
|
59
|
+
version: '4.0'
|
|
54
60
|
description: 'A Ruby gem to load, filter, and manipulate GTFS (General Transit Feed
|
|
55
61
|
Specification) feeds using DataFrames powered by Polars. Supports cascading filters
|
|
56
62
|
that maintain referential integrity across related tables. NOTE: This gem is not
|
|
@@ -137,7 +143,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
|
137
143
|
requirements:
|
|
138
144
|
- - ">="
|
|
139
145
|
- !ruby/object:Gem::Version
|
|
140
|
-
version: 3.
|
|
146
|
+
version: 3.2.0
|
|
141
147
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
142
148
|
requirements:
|
|
143
149
|
- - ">="
|