embulk-filter-row 0.3.3 → 0.4.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: d1a5e3ca3f112b41b011aeedc0ac215ba78c70fd
4
- data.tar.gz: 8ca385ba7785b9ed44d077dc274727e621a2c380
3
+ metadata.gz: 04f141506053778ed4ff092894f87c9f22acca3d
4
+ data.tar.gz: 8f7435bc29f4c0e97e44141771a4ce608edb429a
5
5
  SHA512:
6
- metadata.gz: 7cfb6083eacafa6e1c235b49663569b1b9f23524dc455b947fb7da10d22039823d4f4a114dfc84e0e9432b76e9a9c27bb19b070e4b595e2b7a2dbe13d1e09cb8
7
- data.tar.gz: b9f9da54d28074537d2bd655ba1d23acb67bb62bb88b7968ad7f2a384a648cb800f1fafb0aee920e725e7d35169ae751b89d56ca406516630cd784aa0098f358
6
+ metadata.gz: 5f27041d19e78633e4eedac7426d8d96e2180e047e0793f7c37b55929ba1f8da95f10966ac4643cd0dd690fe6e63ca8c82b95b0afaac64d2d96a89440f97bcf8
7
+ data.tar.gz: c12e1cca2eed39fa039180237808bc898e415af8a03b889c04ebe010eadcbbfd84cd82d4aaa62d18f2d17f9d71cb2382055e056c2ef30feebeceaf4ed53cc181
data/CHANGELOG.md CHANGED
@@ -1,3 +1,10 @@
1
+ # 0.4.0 (2017-05-18)
2
+
3
+ Enhancements
4
+
5
+ * Use joni library for REGEXP operator to improve performance
6
+ * Note that This may introduce trival incompatibility changes
7
+
1
8
  # 0.3.3 (2016-08-09)
2
9
 
3
10
  Enhancements
@@ -19,8 +26,7 @@ Fxies:
19
26
 
20
27
  Changes:
21
28
 
22
- * Change identifier syntax from `[a-zA-Z$][a-zA-z0-9\.\-_]*` to `[a-zA-Z_][a-zA-z0-9_]*`
23
- * Allow starting _. Disallow staring $. Diallow -. Disallow \.
29
+ * Change identifier syntax from `[a-zA-Z$][a-zA-z0-9\.\-_]*` to `[a-zA-Z_][a-zA-z0-9_]*` to allow starting _, disallow staring $, disallow -, disallow . (dot).
24
30
 
25
31
  # 0.3.0 (2016-08-06)
26
32
 
data/README.md CHANGED
@@ -6,93 +6,18 @@ A filter plugin for Embulk to filter out rows
6
6
 
7
7
  ## Configuration
8
8
 
9
- * **where**: Select only rows which matches with conditions written in SQL-like syntax. See [SQL-like Syntax](#sql-like-syntax)
10
-
11
- Following options are **deprecated**, and will be removed someday.
12
-
13
- * **condition**: AND or OR (string, default: AND).
14
- * **conditions**: select only rows which matches with conditions.
15
- * **column**: column name (string, required)
16
- * **operator** operator (string, optional, default: ==)
17
- * boolean operator
18
- * ==
19
- * !=
20
- * numeric operator (long, double, Timestamp)
21
- * ==
22
- * !=
23
- * >
24
- * >=
25
- * <=
26
- * <
27
- * string operator
28
- * ==
29
- * !=
30
- * start_with (or startsWith)
31
- * end_with (or endsWith)
32
- * include (or contains)
33
- * unary operator
34
- * "IS NULL"
35
- * "IS NOT NULL"
36
- * **argument**: argument for the operation (string, required for non-unary operators)
37
- * **not**: not (boolean, optional, default: false)
38
- * **format**: special option for timestamp column, specify the format of timestamp argument, parsed argument is compared with the column value as Timestamp object (string, default is `%Y-%m-%d %H:%M:%S.%N %z`)
39
- * **timezone**: special option for timestamp column, specify the timezone of timestamp argument (string, default is `UTC`)
40
-
41
- NOTE: column type is automatically retrieved from input data (inputSchema)
9
+ Versions >= 0.3.0 has `where` option to support SQL-like syntax.
42
10
 
43
- ## Example (AND)
44
-
45
- **Deprecated**
46
-
47
- ```yaml
48
- filters:
49
- - type: row
50
- condition: AND
51
- conditions:
52
- - {column: foo, operator: "IS NOT NULL"}
53
- - {column: id, operator: ">=", argument: 10}
54
- - {column: id, operator: "<", argument: 20}
55
- - {column: name, opeartor: "include", argument: foo, not: true}
56
- - {column: time, operator: "==", argument: "2015-07-13", format: "%Y-%m-%d"}
57
- ```
58
-
59
- ## Example (OR)
11
+ * **where**: Select only rows which matches with conditions written in SQL-like syntax. See [SQL-like Syntax](#sql-like-syntax)
60
12
 
61
- **Deprecated**
13
+ ## Example
62
14
 
63
15
  ```yaml
64
16
  filters:
65
17
  - type: row
66
- condition: OR
67
- conditions:
68
- - {column: a, operator: "IS NOT NULL"}
69
- - {column: b, operator: "IS NOT NULL"}
18
+ where: column1 = 'str'
70
19
  ```
71
20
 
72
- ## Example (AND of OR)
73
-
74
- **Deprecated**
75
-
76
- You can express a condition such as `(A OR B) AND (C OR D)` by combining multiple filters like
77
-
78
- ```yaml
79
- filters:
80
- - type: row
81
- condition: OR
82
- conditions:
83
- - {column: a, operator: "IS NOT NULL"}
84
- - {column: b, operator: "IS NOT NULL"}
85
- - type: row
86
- condition: OR
87
- conditions:
88
- - {column: c, operator: "IS NOT NULL"}
89
- - {column: d, operator: "IS NOT NULL"}
90
- ```
91
-
92
- ## Example (WHERE)
93
-
94
- Versions >= 0.3.0 suppors SQL-like syntax like
95
-
96
21
  ```yaml
97
22
  filters:
98
23
  - type: row
@@ -145,6 +70,8 @@ Characters surrounded by `'` such as `'foo'` is considered as a string literal
145
70
 
146
71
  ### Timestamp Literal
147
72
 
73
+ NOTE: It became possible to omit `TIMESTAMP` keyword on comparing with `timestamp` identifier (column) from version >= 0.3.3.
74
+
148
75
  `TIMESTAMP ( NumberLiteral | StringLiteral )` such as `TIMESTAMP 1470433087.747123` or `TIMESTAMP '2016-08-06 06:38:07.747123 +0900'` is considered as a timestamp literal
149
76
 
150
77
  Number is a epoch time since 1970-01-01 UTC with nano time resolution.
@@ -160,49 +87,47 @@ String is a timestamp string which matches with one of following format:
160
87
 
161
88
  The time zone for formats without `%z` is UTC, and the time resolution is micro second (caused by limitation of Embulk TimestampParser).
162
89
 
163
- It is possible to omit `TIMESTAMP` keyword on comparing with `timestamp` identifier from version >= 0.3.3.
164
-
165
90
  ### Json Literal
166
91
 
167
92
  Not supported yet
168
93
 
169
94
  ### Identifier Literal
170
95
 
171
- Characters matching with a regular expression `[a-zA-Z_][a-zA-z0-9_]*` such as `foobar`, and characters surrounded by `"` such as `"foo\"bar"` are considred as an identifier literal, that is, embulk's column name.
96
+ Characters matching with a regular expression `[a-zA-Z_][a-zA-z0-9_]*` such as `foobar`, and characters surrounded by `"` such as `"foo-bar"`, `"foo.bar"`, and `"foo\"bar"` are considred as an identifier literal, that is, embulk's column name.
172
97
 
173
98
  ## Operators
174
99
 
175
100
  ### Boolean Operator
176
101
 
177
- * =
178
- * !=
102
+ * `=`
103
+ * `!=`
179
104
 
180
105
  ### Number Operator (Long and Double)
181
106
 
182
- * =
183
- * !=
184
- * >
185
- * >=
186
- * <=
187
- * <
107
+ * `=`
108
+ * `!=`
109
+ * `>`
110
+ * `>=`
111
+ * `<=`
112
+ * `<`
188
113
 
189
114
  ### String Operator
190
115
 
191
- * =
192
- * !=
193
- * START_WITH
194
- * END_WITH
195
- * INCLUDE
196
- * REGEXP
116
+ * `=`
117
+ * `!=`
118
+ * `START_WITH`
119
+ * `END_WITH`
120
+ * `INCLUDE`
121
+ * `REGEXP`
197
122
 
198
123
  ### Timestamp Operator
199
124
 
200
- * =
201
- * !=
202
- * >
203
- * >=
204
- * <=
205
- * <
125
+ * `=`
126
+ * `!=`
127
+ * `>`
128
+ * `>=`
129
+ * `<=`
130
+ * `<`
206
131
 
207
132
  ### Json Operator
208
133
 
@@ -214,9 +139,101 @@ Not supported yet
214
139
  * "xxx IS NOT NULL"
215
140
  * "NOT xxx"
216
141
 
142
+ ## Old Configuration
143
+
144
+ Versions >= 0.3.0 has `where` option to supports SQL-like syntax. I recommend to use it.
145
+
146
+ Following options are **deprecated**, and **will be removed someday**.
147
+
148
+ * **condition**: AND or OR (string, default: AND).
149
+ * **conditions**: select only rows which matches with conditions.
150
+ * **column**: column name (string, required)
151
+ * **operator** operator (string, optional, default: ==)
152
+ * boolean operator
153
+ * `==`
154
+ * `!=`
155
+ * numeric operator (long, double, Timestamp)
156
+ * `==`
157
+ * `!=`
158
+ * `>`
159
+ * `>=`
160
+ * `<=`
161
+ * `<`
162
+ * string operator
163
+ * `==`
164
+ * `!=`
165
+ * `start_with` (or `startsWith`)
166
+ * `end_with` (or `endsWith`)
167
+ * `include` (or `contains`)
168
+ * unary operator
169
+ * `IS NULL`
170
+ * `IS NOT NULL`
171
+ * **argument**: argument for the operation (string, required for non-unary operators)
172
+ * **not**: not (boolean, optional, default: false)
173
+ * **format**: special option for timestamp column, specify the format of timestamp argument, parsed argument is compared with the column value as Timestamp object (string, default is `%Y-%m-%d %H:%M:%S.%N %z`)
174
+ * **timezone**: special option for timestamp column, specify the timezone of timestamp argument (string, default is `UTC`)
175
+
176
+ NOTE: column type is automatically retrieved from input data (inputSchema)
177
+
178
+ ## Example (AND)
179
+
180
+ **Deprecated**
181
+
182
+ ```yaml
183
+ filters:
184
+ - type: row
185
+ condition: AND
186
+ conditions:
187
+ - {column: foo, operator: "IS NOT NULL"}
188
+ - {column: id, operator: ">=", argument: 10}
189
+ - {column: id, operator: "<", argument: 20}
190
+ - {column: name, opeartor: "include", argument: foo, not: true}
191
+ - {column: time, operator: "==", argument: "2015-07-13", format: "%Y-%m-%d"}
192
+ ```
193
+
194
+ ## Example (OR)
195
+
196
+ **Deprecated**
197
+
198
+ ```yaml
199
+ filters:
200
+ - type: row
201
+ condition: OR
202
+ conditions:
203
+ - {column: a, operator: "IS NOT NULL"}
204
+ - {column: b, operator: "IS NOT NULL"}
205
+ ```
206
+
207
+ ## Example (AND of OR)
208
+
209
+ **Deprecated**
210
+
211
+ You can express a condition such as `(A OR B) AND (C OR D)` by combining multiple filters like
212
+
213
+ ```yaml
214
+ filters:
215
+ - type: row
216
+ condition: OR
217
+ conditions:
218
+ - {column: a, operator: "IS NOT NULL"}
219
+ - {column: b, operator: "IS NOT NULL"}
220
+ - type: row
221
+ condition: OR
222
+ conditions:
223
+ - {column: c, operator: "IS NOT NULL"}
224
+ - {column: d, operator: "IS NOT NULL"}
225
+ ```
226
+
227
+ ## Comparisions
228
+
229
+ * [embulk-filter-calcite](https://github.com/muga/embulk-filter-calcite)
230
+ * embulk-filter-calcite is a pretty nice plugin which enables us to write SQL query to filter embulk records.
231
+ * However, based on [my benchmark (Japanese)](http://qiita.com/sonots/items/a70482d29862de87624d), embulk-filter-calcite was 1471 times slower than embulk-filter-row in iterms of string `=` operator. Use `embulk-filter-row` (this plugin) when you need performance.
232
+
217
233
  ## ToDo
218
234
 
219
235
  * Support filtering by values of `type: json` with JSONPath
236
+ * Support IN operator
220
237
 
221
238
  ## ChangeLog
222
239
 
data/build.gradle CHANGED
@@ -15,7 +15,7 @@ configurations {
15
15
  provided
16
16
  }
17
17
 
18
- version = "0.3.3"
18
+ version = "0.4.0"
19
19
  sourceCompatibility = 1.7
20
20
  targetCompatibility = 1.7
21
21
 
@@ -23,6 +23,9 @@ dependencies {
23
23
  compile "org.embulk:embulk-core:0.8.+"
24
24
  provided "org.embulk:embulk-core:0.8.+"
25
25
 
26
+ compile "org.jruby.joni:joni:2.1.11"
27
+ compile "org.jruby.jcodings:jcodings:1.0.18"
28
+
26
29
  testCompile "junit:junit:4.+"
27
30
  testCompile "org.embulk:embulk-core:0.8.+:tests"
28
31
  testCompile "org.embulk:embulk-standards:0.8.+"
@@ -0,0 +1,24 @@
1
+ in:
2
+ type: file
3
+ path_prefix: example/regexp_multibyte.csv
4
+ parser:
5
+ type: csv
6
+ charset: UTF-8
7
+ newline: CRLF
8
+ null_string: "NULL"
9
+ skip_header_lines: 1
10
+ comment_line_marker: '#'
11
+ columns:
12
+ - {name: time, type: timestamp, format: "%Y-%m-%d"}
13
+ - {name: foo, type: string}
14
+ - {name: bar, type: string}
15
+ - {name: flag, type: boolean}
16
+ - {name: id, type: long}
17
+ - {name: name, type: string}
18
+ - {name: json, type: json}
19
+ - {name: score, type: double}
20
+ filters:
21
+ - type: row
22
+ where: name REGEXP '.*チ'
23
+ out:
24
+ type: stdout
@@ -4,8 +4,12 @@ import org.embulk.config.ConfigException;
4
4
  import org.embulk.spi.PageReader;
5
5
  import org.embulk.spi.time.Timestamp;
6
6
 
7
- import java.util.regex.Matcher;
8
- import java.util.regex.Pattern;
7
+ import org.jcodings.specific.UTF8Encoding;
8
+ import org.joni.Matcher;
9
+ import org.joni.Option;
10
+ import org.joni.Regex;
11
+
12
+ import java.nio.charset.StandardCharsets;
9
13
 
10
14
  // Operation Node of AST (Abstract Syntax Tree)
11
15
  public abstract class ParserExp extends ParserNode
@@ -282,12 +286,15 @@ class StringOpExp extends BinaryOpExp
282
286
 
283
287
  class RegexpOpExp extends BinaryOpExp
284
288
  {
285
- Pattern pattern;
289
+ Regex regex;
286
290
 
287
291
  public RegexpOpExp(ParserLiteral left, ParserLiteral right, int operator)
288
292
  {
289
293
  super(left, right, operator);
290
- this.pattern = Pattern.compile(((StringLiteral)right).val);
294
+
295
+ byte[] pattern = (((StringLiteral)right).val).getBytes(StandardCharsets.UTF_8);
296
+ this.regex = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE);
297
+
291
298
  if (! left.isString()) {
292
299
  throw new ConfigException(String.format("\"%s\" is not a String column", ((IdentifierLiteral)left).name));
293
300
  }
@@ -300,9 +307,10 @@ class RegexpOpExp extends BinaryOpExp
300
307
 
301
308
  public boolean eval(PageReader pageReader)
302
309
  {
303
- String l = left.getString(pageReader);
304
- Matcher m = pattern.matcher(l);
305
- return m.find();
310
+ byte[] l = left.getString(pageReader).getBytes(StandardCharsets.UTF_8);
311
+ Matcher matcher = regex.matcher(l);
312
+ int result = matcher.search(0, l.length, Option.DEFAULT);
313
+ return result != -1;
306
314
  }
307
315
  }
308
316
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: embulk-filter-row
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.3
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Naotoshi Seo
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-08-09 00:00:00.000000000 Z
11
+ date: 2017-05-18 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -56,6 +56,8 @@ files:
56
56
  - example/example.csv
57
57
  - example/example.yml
58
58
  - example/or.yml
59
+ - example/regexp_multibyte.csv
60
+ - example/regexp_multibyte.yml
59
61
  - example/where.yml
60
62
  - gradle/wrapper/gradle-wrapper.jar
61
63
  - gradle/wrapper/gradle-wrapper.properties
@@ -94,7 +96,9 @@ files:
94
96
  - src/test/java/org/embulk/filter/row/condition/TestTimestampCondition.java
95
97
  - src/test/java/org/embulk/filter/row/where/TestParser.java
96
98
  - src/test/java/org/embulk/filter/row/where/TestYylex.java
97
- - classpath/embulk-filter-row-0.3.3.jar
99
+ - classpath/embulk-filter-row-0.4.0.jar
100
+ - classpath/jcodings-1.0.18.jar
101
+ - classpath/joni-2.1.11.jar
98
102
  homepage: https://github.com/sonots/embulk-filter-row
99
103
  licenses:
100
104
  - MIT