search_syntax 0.1.1 → 0.1.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +3 -27
- data/docs/language-design.md +206 -0
- data/docs/terminology.md +6 -0
- data/lib/search_syntax/ransack.rb +1 -1
- data/lib/search_syntax/ransack_transformer.rb +10 -4
- data/lib/search_syntax/version.rb +1 -1
- metadata +4 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 91e00dd529f00af5e2173c3ed60e1ed545d2bc82c0f238007ede027bf7f093bc
|
4
|
+
data.tar.gz: e242ae7ed2975a1509952e51c933761dcb71dd0d819232713fe7436d9294b0fa
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: fb43c64c866318647311befe6f9e0e23dbbbcacd696dfeab04d0331003b03139410bb41ab84dcfe905b02d5bcee80947c06bc1b0f57c922465eef67cf77a19ce
|
7
|
+
data.tar.gz: 7a18aef3d8d1040110ae26da446d1373a45cd2a856d11551a3d58fdc0ccb3b6e8c6c747ea12776d637dcd18ed484e41c96d6bf237897df3e0d110ecf88847f04
|
data/README.md
CHANGED
@@ -14,31 +14,7 @@ So far parser only supports bare strings, **quoted strings** (`"some string"`) a
|
|
14
14
|
|
15
15
|
Parser **doesn't** support negation (`not`/`-`), boolean operations (`and`/`&`/`or`/`|`) and groupping (`(a | b)`).
|
16
16
|
|
17
|
-
This probably will change as soon as I understand how to add those "advanced" features without making it less user-friendly
|
18
|
-
|
19
|
-
## Challenge
|
20
|
-
|
21
|
-
Main challenge is to come up with query language intuitive enough that non-techy people can use, but powerfull enough to expose all advanced features.
|
22
|
-
|
23
|
-
There are different types of search, they require different features:
|
24
|
-
|
25
|
-
```mermaid
|
26
|
-
graph LR
|
27
|
-
Search --> Parametric --> op1[param = 1, param > 2, etc.]
|
28
|
-
Search --> s1[Text: single term]
|
29
|
-
s1 --> op2[Phonetic similarity: names, emails, words with alternate spellings, etc.]
|
30
|
-
s1 --> op3[Ortographic similarity: drug names, biological species, typos in proper nouns, etc.]
|
31
|
-
s1 --> op4[Pattern match: logs, match by part of word, etc.]
|
32
|
-
Search --> s2[Text: multiple terms]
|
33
|
-
s2 --> op5[Full-text search: text in natural language]
|
34
|
-
s2 --> op6[Single term search with boolean operations: AND, OR, NOT, grouping]
|
35
|
-
```
|
36
|
-
|
37
|
-
**Note**: No. Full-text search is not an universal solution for all types of text search. It is designed to search in natural language texts. But this subject deserves a separate article.
|
38
|
-
|
39
|
-
**Parametric search** aka faceted search - [filter by strctured data](https://en.wikipedia.org/wiki/Faceted_search).
|
40
|
-
|
41
|
-
**Aproximate search** aka fuzzy search aka approximate string matching - [is the technique of finding strings that match a pattern approximately (rather than exactly)](https://en.wikipedia.org/wiki/Approximate_string_matching).
|
17
|
+
This probably will change as soon as I understand how to add those "advanced" features without making it less user-friendly for non-techy people. See [Language design](docs/language-design.md) for explanations.
|
42
18
|
|
43
19
|
## Installation
|
44
20
|
|
@@ -92,7 +68,7 @@ $ bin/tt lib/search_syntax/search_syntax_grammar.tt
|
|
92
68
|
|
93
69
|
## Contributing
|
94
70
|
|
95
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/
|
71
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/stereobooster/search_syntax. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/stereobooster/search_syntax/blob/master/CODE_OF_CONDUCT.md).
|
96
72
|
|
97
73
|
## License
|
98
74
|
|
@@ -100,4 +76,4 @@ The gem is available as open source under the terms of the [MIT License](https:/
|
|
100
76
|
|
101
77
|
## Code of Conduct
|
102
78
|
|
103
|
-
Everyone interacting in the SearchSyntax project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/
|
79
|
+
Everyone interacting in the SearchSyntax project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/stereobooster/search_syntax/blob/master/CODE_OF_CONDUCT.md).
|
@@ -0,0 +1,206 @@
|
|
1
|
+
# Language design
|
2
|
+
|
3
|
+
Ideas and things to take into account when desigining query language
|
4
|
+
|
5
|
+
## Challenge
|
6
|
+
|
7
|
+
Main challenge is to come up with query language intuitive enough that non-techy people can use, but powerfull enough to expose all advanced features.
|
8
|
+
|
9
|
+
There are different types of search, they require different features:
|
10
|
+
|
11
|
+
```mermaid
|
12
|
+
graph LR
|
13
|
+
Search --> Parametric --> op1[param = 1, param > 2, etc.]
|
14
|
+
Search --> s1[Text: single term]
|
15
|
+
s1 --> op2[Phonetic similarity: names, emails, words with alternate spellings, etc.]
|
16
|
+
s1 --> op3[Ortographic similarity: drug names, biological species, typos in proper nouns, etc.]
|
17
|
+
s1 --> op4[Pattern match: logs, match by part of word, etc.]
|
18
|
+
Search --> s2[Text: multiple terms]
|
19
|
+
s2 --> op5[Full-text search: text in natural language]
|
20
|
+
s2 --> op6[Single term search with boolean operations: AND, OR, NOT, grouping]
|
21
|
+
```
|
22
|
+
|
23
|
+
**Note**: No. Full-text search is not an universal solution for all types of text search. It is designed to search in natural language texts. But this subject deserves a separate article.
|
24
|
+
|
25
|
+
## Rich syntax vs intuitivity
|
26
|
+
|
27
|
+
Implementing language (parser) is trivial. The problem is that the more advanced language (the more capabilities it has), the less intuitive it is.
|
28
|
+
|
29
|
+
It is less intuitive, because you need to deal with:
|
30
|
+
|
31
|
+
- precedence (which one have priority: OR, AND?)
|
32
|
+
- syntax errors, like missing closing bracket
|
33
|
+
- the need to escape special chars
|
34
|
+
- that people may be not aware of special meaning of operator
|
35
|
+
- for example, it is hard to find CSS properties in Google which start with "-" because minus interpreted as "NOT". So you need to quote those in order to find for them
|
36
|
+
|
37
|
+
which may be counterintuitve without syntax checker.
|
38
|
+
|
39
|
+
## Operators
|
40
|
+
|
41
|
+
**Important**: this is not comparison of features, but rather comparison of syntax.
|
42
|
+
|
43
|
+
| | Meilisearch | Solr | Sphinx | MySQL FT boolean | PostgreSQL FT | GitHub syntax | SQL | Google |
|
44
|
+
| -------------------------------- | ------------------------- | --------- | ----------- | ---------------- | ------------- | ----------------------- | ------------------- | --------- |
|
45
|
+
| **Boolean operators** applies to | parametric | both | text | text | text | both | parametric | both? |
|
46
|
+
| default operator | | OR | AND | OR | | AND | | and? |
|
47
|
+
| not | NOT | NOT / ! | - / ! | - (kind of) | ! / NOT / - | NOT / - | NOT | - |
|
48
|
+
| and | AND | AND / && | no operator | + (kind of) | & / AND | no operator | AND | and? |
|
49
|
+
| or | OR | OR / \|\| | \| | no operator | \| / OR | | OR | \| / or |
|
50
|
+
| grouping | () | () | () | () | () | | () | () |
|
51
|
+
| **Text search** | | | | | | | | |
|
52
|
+
| phrase search | "" | "" | "" | "" | <-> / "" | "" | LIKE "%phrase%" | "" |
|
53
|
+
| proximity phrase search | | ""~N | ""~N | ""@N | < N > | | | AROUND(N) |
|
54
|
+
| priority modifier | | ^N, ^= | ^N | > / < / ~ | :N | | | |
|
55
|
+
| prefix search | | \*, ? | | \* | :\* | | %, \_ | \*, \_ |
|
56
|
+
| required | | + | | + | | | | + |
|
57
|
+
| prohibited | | - | | - | | | | - |
|
58
|
+
| **Parametric search** | | | | | | | | |
|
59
|
+
| parameter specifier | param (in separate query) | param: | @param | | | param: | param | param: |
|
60
|
+
| comparison | >, >=, <, <=, =, != | implied = | implied = | | | >, >=, <, <=, implied = | >, >=, <, <=, =, != | |
|
61
|
+
| range | n TO n | [n TO n] | | | | n..n | <= AND => | n..n |
|
62
|
+
| other | IN [], EXISTS | | | | | | IN [], NOT NULL... | |
|
63
|
+
|
64
|
+
- [ransack](https://activerecord-hackery.github.io/ransack/getting-started/search-matches/)
|
65
|
+
- [MySQL Full-Text Search](https://dev.mysql.com/doc/refman/8.0/en/fulltext-boolean.html)
|
66
|
+
- [PostgreSQL Full-Text Search](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES)
|
67
|
+
- [Meilisearch](https://docs.meilisearch.com/learn/advanced/filtering_and_faceted_search.html#using-filters)
|
68
|
+
- [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html)
|
69
|
+
- [Solr](https://solr.apache.org/guide/6_6/the-standard-query-parser.html)
|
70
|
+
- [Lucene](https://lucene.apache.org/core/2_9_4/queryparsersyntax.html) ([Lucene vs Solr](https://www.lucenetutorial.com/lucene-vs-solr.html))
|
71
|
+
- [Sphinx](https://sphinxsearch.com/docs/current/extended-syntax.html)
|
72
|
+
|
73
|
+
## Text + parametric search
|
74
|
+
|
75
|
+
- Two separate fields - one for text and one for parametric search
|
76
|
+
- separate field for parametric query would allow to do autocomplete for parameters
|
77
|
+
- One field
|
78
|
+
- differentiate params with specific marker like `:` or `@`
|
79
|
+
- having marker in prefix position would allow to do autocomplete for parameters
|
80
|
+
- differentiate by predefined list of keywords
|
81
|
+
- differentiate by one specific marker, for example `text query (params query)`
|
82
|
+
- this would allow to do autocomplete for parameters and have boolean operations only for parametric search
|
83
|
+
- One field with "mode" e.g. ability to switch from text search to text + parametric search
|
84
|
+
- One field with projectional editing
|
85
|
+
- this would allow to do autocomplete for parameters
|
86
|
+
- this would allow to prevent syntax errors and show semantic errors (like unknown param)
|
87
|
+
- for text we can use "free" editing and for params we can use projectional editing which can be triggered by specific key, but will not get into input
|
88
|
+
- https://react-mentions.vercel.app/
|
89
|
+
- https://www.npmjs.com/package/react-tag-input
|
90
|
+
- this would require [CST](https://www.cse.chalmers.se/edu/year/2011/course/TIN321/lectures/proglang-02.html) rather than AST
|
91
|
+
- Return parsed query
|
92
|
+
- If query is parsed incorrectly it most likely will return no results. And on this screen we can show "parsed" query. For this we need "printer" which will convert CST to HTML
|
93
|
+
|
94
|
+
## No one-size-fits-all
|
95
|
+
|
96
|
+
Initial idea was to implement universal language. But more I think of this task more I realized there is no one-size-fits-all solution. So instead I can design language with all possible feature with ability to turn them on and off
|
97
|
+
|
98
|
+
| feature | input | output | enabled |
|
99
|
+
| -------------------- | ------------ | --------------------------------------------- | ------- |
|
100
|
+
| parametric search | param:1 | param = 1 | always |
|
101
|
+
| | param:>1 | param > 1 | ? |
|
102
|
+
| | param:">1" | param = ">1" | always |
|
103
|
+
| phrase quotation | "a b" | containing "a b" | always |
|
104
|
+
| escape quote | "\"" | containing "\"" | always |
|
105
|
+
| negation | -param:1 | NOT (param = 1) | |
|
106
|
+
| | param:!=1 | param != 1 | ? |
|
107
|
+
| | -a | not containing a | |
|
108
|
+
| | - a | shall we support space between minus? | |
|
109
|
+
| | -"a b" | not containing "a b" | |
|
110
|
+
| | not a | not containing a | |
|
111
|
+
| | not "a b" | not containing "a b" | |
|
112
|
+
| | not"a b" | shall we support absence of space? | |
|
113
|
+
| | not param:1 | not (param = 1) | |
|
114
|
+
| | not -a | Error? | |
|
115
|
+
| grouping | (a b) | doesn't make sense without boolean operations | |
|
116
|
+
| or | | I will assume that default operator is "and" | |
|
117
|
+
| | a\|b | containing a or b | |
|
118
|
+
| | a \| b | containing a or b | |
|
119
|
+
| | a\|"b c" | containing a or "b c" | |
|
120
|
+
| | a or"b c" | shall we support absence of space? | |
|
121
|
+
| | a\|b c | containing (a or b) and c | |
|
122
|
+
| | (a\|b) | containing a or b | |
|
123
|
+
| | (a\|b) c | containing (a or b) and c | |
|
124
|
+
| | a or b | containing a or b | |
|
125
|
+
| | -(a\|b) | not containing (a or b) | |
|
126
|
+
| | -a\|b | not containing a or containing b | |
|
127
|
+
| | not a\|b | not containing a or containing b | |
|
128
|
+
| | not (a or b) | not containing (a or b) | |
|
129
|
+
| | not(a or b) | shall we support absence of space? | |
|
130
|
+
| | a:1 \| b:1 | a = 1 or b = 1 | |
|
131
|
+
| | a:1 or b:1 | a = 1 or b = 1 | |
|
132
|
+
| | a:1\|2 | a = 1 or a = 2. a IN [1, 2] | |
|
133
|
+
| | a:(1 \| 2) | a = 1 or a = 2. a IN [1, 2] | |
|
134
|
+
| and | | I will assume that default operator is "and" | |
|
135
|
+
| | a b | containing a and b | always |
|
136
|
+
| | (a b) | containing a and b | |
|
137
|
+
| | a:1 b:1 | a = 1 and b = 1 | |
|
138
|
+
| | a and b | containing a and b | |
|
139
|
+
| | a:1 and b:1 | a = 1 and b = 1 | |
|
140
|
+
| escape special chars | \\\| | containing "\|" | never |
|
141
|
+
| | "\|" | containing "\|" | always |
|
142
|
+
|
143
|
+
## Operator Precedence
|
144
|
+
|
145
|
+
- [Microsoft Transact-SQL operator precedence](https://learn.microsoft.com/en-us/sql/t-sql/language-elements/operator-precedence-transact-sql?view=sql-server-2017)
|
146
|
+
- [Oracle MySQL 9 operator precedence](https://dev.mysql.com/doc/refman/8.0/en/operator-precedence.html)
|
147
|
+
- [Oracle 10g condition precedence](https://docs.oracle.com/cd/B19306_01/server.102/b14200/conditions001.htm#i1034834)
|
148
|
+
- [PostgreSQL operator Precedence](https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-PRECEDENCE)
|
149
|
+
- [SQL as understood by SQLite](https://www.sqlite.org/lang_expr.html)
|
150
|
+
|
151
|
+
If we assume that "AND" is the default boolean operator, than it should be lowest.
|
152
|
+
|
153
|
+
Option 1:
|
154
|
+
|
155
|
+
| Operator | Associativity | Position |
|
156
|
+
| ------------- | ------------- | -------- |
|
157
|
+
| NOT / - | right | prefix |
|
158
|
+
| OR / \| | left | infix |
|
159
|
+
| AND / (space) | left | infix |
|
160
|
+
|
161
|
+
Option 2:
|
162
|
+
|
163
|
+
| Operator | Associativity | Position |
|
164
|
+
| -------- | ------------- | -------- |
|
165
|
+
| NOT / - | right | prefix |
|
166
|
+
| AND | left | infix |
|
167
|
+
| OR / \| | left | infix |
|
168
|
+
| (space) | left | infix |
|
169
|
+
|
170
|
+
## Strange cases
|
171
|
+
|
172
|
+
- `"a b`
|
173
|
+
- containing `"a` and `b`
|
174
|
+
- containing `a b` e.g. as `"a b"`
|
175
|
+
- `(a b`
|
176
|
+
- containing `(a` and `b`
|
177
|
+
- containing `a` and `b` e.g. as `(a b)`
|
178
|
+
- `not -a`
|
179
|
+
- containing `a`
|
180
|
+
- not containing `-a`
|
181
|
+
- `not not not`
|
182
|
+
- empty query
|
183
|
+
- containing `not`
|
184
|
+
- `|||`
|
185
|
+
- empty query
|
186
|
+
- containing `|||`
|
187
|
+
- `""`
|
188
|
+
- empty query
|
189
|
+
- `param:<>1`
|
190
|
+
- `param = "<>1"`
|
191
|
+
- `param: 1`
|
192
|
+
- containing `param:` and `1`
|
193
|
+
- `param = 1`
|
194
|
+
- `()`
|
195
|
+
- containing `()`
|
196
|
+
- empty query
|
197
|
+
- `--a`
|
198
|
+
- containing `a`
|
199
|
+
- not containing `-a`
|
200
|
+
- containing `--a`
|
201
|
+
- `a -`
|
202
|
+
- containing `a`
|
203
|
+
- containing `a` and `-`
|
204
|
+
- `or a`
|
205
|
+
- containing `a`
|
206
|
+
- containing `or` and `a`
|
data/docs/terminology.md
ADDED
@@ -0,0 +1,6 @@
|
|
1
|
+
# Terminology
|
2
|
+
|
3
|
+
**Parametric search** aka faceted search aka filters - [filter by strctured data](https://en.wikipedia.org/wiki/Faceted_search).
|
4
|
+
|
5
|
+
**Aproximate search** aka fuzzy search aka approximate string matching - [is the technique of finding strings that match a pattern approximately (rather than exactly)](https://en.wikipedia.org/wiki/Approximate_string_matching).
|
6
|
+
|
@@ -5,7 +5,7 @@ require_relative "ransack_transformer"
|
|
5
5
|
module SearchSyntax
|
6
6
|
class Ransack
|
7
7
|
# text - symbol. Idea for the future: it can be callback to allow to manipulate query for full-text search
|
8
|
-
# params - array of strings
|
8
|
+
# params - array of strings; or hash to rename params
|
9
9
|
# sort - string. nil - to disbale parsing sort param
|
10
10
|
def initialize(text:, params:, sort: nil)
|
11
11
|
@transformer = RansackTransformer.new(text: text, params: params, sort: sort)
|
@@ -17,9 +17,15 @@ module SearchSyntax
|
|
17
17
|
|
18
18
|
def initialize(text:, params:, sort: nil)
|
19
19
|
@text = text
|
20
|
+
if params.is_a?(Array)
|
21
|
+
params = params.to_h { |i| [i.to_s, i] }
|
22
|
+
elsif params.is_a?(Hash)
|
23
|
+
params = params.map { |k, v| [k.to_s, v] }.to_h
|
24
|
+
end
|
20
25
|
@params = params
|
26
|
+
@allowed_params = params.keys
|
21
27
|
@sort = sort
|
22
|
-
@spell_checker = DidYouMean::SpellChecker.new(dictionary: @
|
28
|
+
@spell_checker = DidYouMean::SpellChecker.new(dictionary: @allowed_params)
|
23
29
|
end
|
24
30
|
|
25
31
|
def transform_sort_param(value)
|
@@ -36,16 +42,16 @@ module SearchSyntax
|
|
36
42
|
errors = []
|
37
43
|
result = {}
|
38
44
|
|
39
|
-
if @
|
45
|
+
if @allowed_params.length > 0
|
40
46
|
ast = ast.filter do |node|
|
41
47
|
if node[:type] != :param
|
42
48
|
true
|
43
49
|
elsif node[:name] == @sort
|
44
50
|
result[:s] = transform_sort_param(node[:value])
|
45
51
|
false
|
46
|
-
elsif @
|
52
|
+
elsif @allowed_params.include?(node[:name])
|
47
53
|
predicate = PREDICATES[node[:predicate]] || :eq
|
48
|
-
key = "#{node[:name]}_#{predicate}".to_sym
|
54
|
+
key = "#{@params[node[:name]]}_#{predicate}".to_sym
|
49
55
|
if !result.key?(key)
|
50
56
|
result[key] = node[:value]
|
51
57
|
else
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: search_syntax
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- stereobooster
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-10-
|
11
|
+
date: 2022-10-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: treetop
|
@@ -39,6 +39,8 @@ files:
|
|
39
39
|
- LICENSE.txt
|
40
40
|
- README.md
|
41
41
|
- Rakefile
|
42
|
+
- docs/language-design.md
|
43
|
+
- docs/terminology.md
|
42
44
|
- lib/search_syntax.rb
|
43
45
|
- lib/search_syntax/errors.rb
|
44
46
|
- lib/search_syntax/parser.rb
|