search_syntax 0.1.1 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 30784d95f477eb834cab186fbe04f26903a9430548fd242e7cc06c13b12c1773
4
- data.tar.gz: fdcd44997040e377e480e46b951d20cda1b679ae05786197720db76346f1b103
3
+ metadata.gz: 27ad7918f5ef1cfd2595f64bac7f7140ad11b189a1d360f01b007c11447c18b4
4
+ data.tar.gz: 0eec14f49542a2c57218433eb19c36955b0e1e04c431ed1d0b99b45207c28fa8
5
5
  SHA512:
6
- metadata.gz: 0f42fb1cf2803d73d188b93b959f3d4065c5ebc9090c657bffdd42e355a44f86ac90e489f7a0aa3b089601352e2b14ca2e4504bb780697d848daadb5652fbd0b
7
- data.tar.gz: 4675d285eee0b6b688ba15accfc17890c431cb21bf1a30b9d569516b3109c627d0e5526c874077a86d984531799f25ae404ddcf6d6a87f87ba08bfaa26c23dda
6
+ metadata.gz: e8b1768a2b3de1ac000f4b90ab82e2322bcefb7d6abe1853f74dc4143ba55edf67efcda43b26444a264ac617c0cbc974c54961c78c61c98ae291678c1fcba837
7
+ data.tar.gz: ab73fc00bda9e55b8f0a9ebe11f75d7aa591447d519e4e064abfe480bb52ccb274087431160ea1783e1a5983ee8ef52deff262df6ee9d7d59992ffb2fd37c1ae
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- search_syntax (0.1.1)
4
+ search_syntax (0.1.3)
5
5
  treetop (~> 1.6)
6
6
 
7
7
  GEM
data/README.md CHANGED
@@ -14,31 +14,7 @@ So far parser only supports bare strings, **quoted strings** (`"some string"`) a
14
14
 
15
15
  Parser **doesn't** support negation (`not`/`-`), boolean operations (`and`/`&`/`or`/`|`) and groupping (`(a | b)`).
16
16
 
17
- This probably will change as soon as I understand how to add those "advanced" features without making it less user-friendly to non-techy people. See "Challenge" section for explantions.
18
-
19
- ## Challenge
20
-
21
- Main challenge is to come up with query language intuitive enough that non-techy people can use, but powerfull enough to expose all advanced features.
22
-
23
- There are different types of search, they require different features:
24
-
25
- ```mermaid
26
- graph LR
27
- Search --> Parametric --> op1[param = 1, param > 2, etc.]
28
- Search --> s1[Text: single term]
29
- s1 --> op2[Phonetic similarity: names, emails, words with alternate spellings, etc.]
30
- s1 --> op3[Ortographic similarity: drug names, biological species, typos in proper nouns, etc.]
31
- s1 --> op4[Pattern match: logs, match by part of word, etc.]
32
- Search --> s2[Text: multiple terms]
33
- s2 --> op5[Full-text search: text in natural language]
34
- s2 --> op6[Single term search with boolean operations: AND, OR, NOT, grouping]
35
- ```
36
-
37
- **Note**: No. Full-text search is not an universal solution for all types of text search. It is designed to search in natural language texts. But this subject deserves a separate article.
38
-
39
- **Parametric search** aka faceted search - [filter by strctured data](https://en.wikipedia.org/wiki/Faceted_search).
40
-
41
- **Aproximate search** aka fuzzy search aka approximate string matching - [is the technique of finding strings that match a pattern approximately (rather than exactly)](https://en.wikipedia.org/wiki/Approximate_string_matching).
17
+ This probably will change as soon as I understand how to add those "advanced" features without making it less user-friendly for non-techy people. See [Language design](docs/language-design.md) for explanations.
42
18
 
43
19
  ## Installation
44
20
 
@@ -92,7 +68,7 @@ $ bin/tt lib/search_syntax/search_syntax_grammar.tt
92
68
 
93
69
  ## Contributing
94
70
 
95
- Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/search_syntax. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/search_syntax/blob/master/CODE_OF_CONDUCT.md).
71
+ Bug reports and pull requests are welcome on GitHub at https://github.com/stereobooster/search_syntax. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/stereobooster/search_syntax/blob/master/CODE_OF_CONDUCT.md).
96
72
 
97
73
  ## License
98
74
 
@@ -100,4 +76,4 @@ The gem is available as open source under the terms of the [MIT License](https:/
100
76
 
101
77
  ## Code of Conduct
102
78
 
103
- Everyone interacting in the SearchSyntax project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/search_syntax/blob/master/CODE_OF_CONDUCT.md).
79
+ Everyone interacting in the SearchSyntax project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/stereobooster/search_syntax/blob/master/CODE_OF_CONDUCT.md).
@@ -0,0 +1,121 @@
1
+ # Approximate string matching algorithms
2
+
3
+ ## Preprocessing
4
+
5
+ ```mermaid
6
+ graph LR;
7
+ String --> Split --> 1[Sequence of letters]
8
+ Split --> 2[Set of n-grams]
9
+ Split --> 3[Set of n-grams with frequency]
10
+
11
+ String --> Tokenizer --> 4[Sequence of words]
12
+ Tokenizer --> 5[Set of words]
13
+ Tokenizer --> 6[Set of words with frequency]
14
+
15
+ 1 --> Sequence
16
+ 4 --> Sequence
17
+ 2 --> Set
18
+ 5 --> Set
19
+ 3 --> swc[Set with frequency]
20
+ 6 --> swc
21
+ ```
22
+
23
+ - **Tokenizer** is language dependent, so algorithm would need to know language upfront or be able to detect it.
24
+ - **Sequence** is required for "edit distance" algorithms, because they need to know position.
25
+ - In means of implementation **set** can be implemented as hash table e.g. `{a: true, b: true}` (`aba`)
26
+ - Than **set with frequency** can be implemented as hash where value would be a frequency `{a: 2, b: 1}` (`aba`)
27
+ - There are aslo **skip-grams** which are not shown here
28
+
29
+ There can be more steps in this process, for example:
30
+
31
+ - Converting strings to lower case
32
+ - Normalizing alphabet, for example `β` can be converted to `ss` or `è` can be converted to `e`
33
+ - String can be padded before splitting in n-grams, this will add "weight" to the start or the end of the word.
34
+ - Tokenizer can have a lot of dictionary-based operations after splitting, for example:
35
+ - Removing **stop-words** e.g. words which are very common in the language and doesn't add a lot of information, like `the`, `in`, etc.
36
+ - Fixing common spelling errors
37
+ - **Stemming** words e.g. converting them into canonical form, for example `birds` will be converted to `bird`, etc.
38
+ - Replace words with more common synonyms, or convert acronyms to full verstion
39
+
40
+ ## Measures
41
+
42
+ Measures can be separated in following categories:
43
+
44
+ 1. similarity/dissimilarity
45
+ - similarity - higher value means more closer strings
46
+ - dissimilarity - lower value means more closer strings
47
+ - distance - disimilarity which has metrics properties
48
+ 2. ranking/relevance
49
+ - if measure returns only `true` and `false` it can be used as relevance, but not as ranking function
50
+ - similarity can be used as ranking if ordered descending
51
+ - dissimilarity can be used as ranking if ordered ascending
52
+ 3. by type of expected input
53
+ - sequence
54
+ - set
55
+ - set with frequency
56
+ 4. normalized/not normalized
57
+ - measure is normalised if it's values are in the range 0..1
58
+ - normalised similarity can be converted to dissimilarity using formula `dis(x, y) = 1 - sim(x, y)`
59
+ 5. By type of assumed error
60
+ - phonetic (if words sound similar). Good for words without but with different spellings, like `Claire`, `Clare`
61
+ - ortographic (if words look similar). Good for detecting typos and errors
62
+
63
+ | category | Measure | Input data | Type | Metric | Normalized |
64
+ | ------------ | ------------------------------------------- | ------------------- | ---------------------- | ------------- | -------------------------------------- |
65
+ | Phonetic | Phonetic hashing (Soundex, Metaphone, etc.) | sequence of letters | similarity (relevance) | | Yes |
66
+ | Orthographic | Levenshtein distance | sequence | dissimilarity | Yes | `l(x, y) / max(len(x), len(y))` , NED |
67
+ | | Damerau-Levenshtein distance | sequence | dissimilarity | | |
68
+ | | Hamming distance | sequence | dissimilarity | Yes | |
69
+ | | Jaro distance | sequence | dissimilarity | | |
70
+ | | Jaro–Winkler distance | sequence | dissimilarity | | |
71
+ | | Longest common subsequence (LCS) | sequence | similarity | ? | `len(lcs(x, y)) / max(len(x), len(y))` |
72
+ | | Jaccard index | set | similarity | `1 - j(x ,y)` | Yes |
73
+ | | Dice coefficient | set | similarity | ? | Yes |
74
+ | | Cosine similarity | set with frequency | similarity | ? | Yes |
75
+
76
+ **TODO**: it is not full list of measures
77
+
78
+ ## Algorithms
79
+
80
+ 1. Some measures can have more than one algorithm to calculate it
81
+ 2. Algorithms are different by computational and space complexity
82
+
83
+ | Measure | Algorithm | Computational complexity | Space complexity | Comment |
84
+ | -------------------------------- | -------------- | ------------------------ | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
85
+ | Levenshtein distance | Wagner-Fischer | O(mn) | O(mn) | Wagner, Robert A., and Michael J. Fischer. "The string-to-string correction problem." Journal of the ACM 21.1 (1974): 168-173 |
86
+ | | Myers | | | Myers, Gene. "A fast bit-vector algorithm for approximate string matching based on dynamic programming." Journal of ACM (JACM) 46.3 (1999): 395-415. |
87
+ | Longest common subsequence (LCS) | Larsen | O(log(m)log(n)) | O(mn²) | "Length of Maximal Common Subsequences", K.S. Larsen |
88
+ | | Hunt–Szymanski | | | https://imada.sdu.dk/~rolf/Edu/DM823/E16/HuntSzymanski.pdf |
89
+
90
+ **TODO**: it is not full list of algorithms
91
+
92
+ ## Indexes
93
+
94
+ All algorithms above assume two input strings. So if we need to search throug a database we would need to go row by by row comparing each value in DB to the query, choose all relevant rows and sort by rank.
95
+
96
+ This would be slow, so in order to overcome this we can preprocess data and produce data structure more suitable for the given algorithm, to speed up process of retrieval, by making inserts and updates a bit slower. This data structure in the context of database called **index**.
97
+
98
+ For example we can implement following indexes:
99
+
100
+ | Algorithm | Index | Example |
101
+ | ----------------- | ----------------------------------------- | ------------------------------------ |
102
+ | Phonetic hashing | B-tree with hashed values | |
103
+ | set of n-grams | inverted index with trigrams | PostgreSQL trigram index |
104
+ | sequence of words | inverted index with words (and positions) | PostgreSQL and MySQL full-text index |
105
+
106
+ - Those indexes won't help much with edit distance algorithm(s), because it is still quite expensive algorithm
107
+ - There is so called BK-tree index, which works in this case but it is not suitable for databases, it is more suitable for fixed dictionary, like correction for spell errors
108
+
109
+ ## Ranking
110
+
111
+ Theoretically it is possible to use measures as ranking, but the problem is that those functions only take into account two strings. This won't work good for big texts in DB and small queries. Because all measures will be indistinguishable (either very small, or very big).
112
+
113
+ For this case there are better rnaking functions, such as:
114
+
115
+ - TF-IDF
116
+ - BM25
117
+ - DFR similarity
118
+ - IB similarity
119
+ - LM Dirichlet similarity
120
+ - LM Jelinek Mercer similarity
121
+ - etc.
@@ -0,0 +1,172 @@
1
+ # Approximate string matching
2
+
3
+ a.k.a. string proximity search, error-tolerant search, string similarity search, fuzzy string searching, error tolerant pattern matching.
4
+
5
+ ## Definitions
6
+
7
+ ### Edit distance
8
+
9
+ **Edit distance** of two strings `dist(x, y)` is the minimal number of edit operations needed to transform first string (`x`) into the second (`y`).
10
+
11
+ Diffrent edit distances can have different **edit operations**: deletion, insertion, substitution, transposition.
12
+
13
+ | | deletion | insertion | substitution | transposition |
14
+ | ---------------------------- | -------- | --------- | ------------ | ------------- |
15
+ | Levenshtein distance | + | + | + | |
16
+ | Damerau-Levenshtein distance | + | + | + | + |
17
+ | Hamming distance | | | + | |
18
+ | Longest common subsequence | + | + | | |
19
+
20
+ Distance is a metric. Metric is a function which has following properties:
21
+
22
+ 1. `dist(x, x) = 0` identity
23
+ 2. `dist(x, y) >= 0` non-negativity
24
+ 3. `dist(x, y) = dist(y, x)` symmetry
25
+ 4. `dist(x, y) <= dist(x, z) + dist(z, y)` triangle inequality
26
+
27
+ Examples of metrics:
28
+
29
+ - Euclidean distance
30
+ - Manhattan distance
31
+ - Hamming distance
32
+
33
+ Different edit operations can have different cost. But if we want it to preserve metric properties deletion and insertion must be the same cost.
34
+
35
+ ```
36
+ I N T E - N T I O N
37
+ - E X E C U T I O N
38
+ d s s i s
39
+ ```
40
+
41
+ - If each operation has cost of 1 - distance between these is 5.
42
+ - If substitutions cost 2 (Levenshtein) - distance between these is 8.
43
+
44
+ In order to determine minimal number of edit operations we need optimal squence alignment:
45
+
46
+ ```
47
+ H A N D H A N D - - - - H A N D -
48
+ A N D I - - - - A N D I - A N D I
49
+ ```
50
+
51
+ Some distances may have more than one algorithm to compute it, for example [Levenshtein distance](https://ceptord.net/20200815-Comparison.html):
52
+
53
+ - **Wagner-Fischer**. Wagner, Robert A., and Michael J. Fischer. "The string-to-string correction problem." Journal of the ACM 21.1 (1974): 168-173.
54
+ - **Myers**. Myers, Gene. "A fast bit-vector algorithm for approximate string matching based on dynamic programming." Journal of ACM (JACM) 46.3 (1999): 395-415.
55
+
56
+ Distance-like measures which does not hold metric properties are called **dissimilarities**.
57
+
58
+ Distance-like measures require sequential data e.g. interpret string as array of tokens (letters or words).
59
+
60
+ ### String similarity
61
+
62
+ **Similarity** is the measure of how two strings are similar. For distance - the lower the value, the closer the two strings. But for similarity: the higher the value, the closer the two strings.
63
+
64
+ Similarity is not a metric. Edit distance can be converted to similarity, for example:
65
+
66
+ ```
67
+ sim(x, y) = 1 - dist(x, y) / max(len(x), len(y))
68
+ ```
69
+
70
+ On the other hand similarity doesn't have to rely on edit distance. For example, simplest but probably not most effective measure can be number of similar symbols in the string (aka unigrams). But trigrams would be quite good estimation.
71
+
72
+ ```
73
+ dice(x ,y) = 2 * len(commom(x, y)) / (len(x) + len(y))
74
+ ```
75
+
76
+ Normalized similarity is in the range `[0, 1]` . Where 0 means strings are different, 1 that strings are equal.
77
+
78
+ Similarity often doesn't need alignment, so it's much computationally-cheaper to calculate. It can be used to pre-filter set of strings before applying more expensive edit distance algorithms.
79
+
80
+ ### String is ...
81
+
82
+ String can be treated as sequence of letters or as sequence of words or can be converted to n-grams (unigrams, bigrams, trigrams etc.)
83
+
84
+ Classically edit distance applied to letters. `abc` vs `acb` - one transposition. But also it can be applied to words `abc def` vs `def abc` - one transposition.
85
+
86
+ If we want to work with words we need tokenizer, which is language dependant. The most primitive tokenizer for western languages is to split string by all non-alphanumeric letters. `Hello, Joe!` turns into `Hello`, `Joe`. But this approach is quite limited and doesn't work for words like, `O'neil`, `aren't`, `T.V.`, `B-52`, compound proper nouns, etc. Also this approach doesn't work for CJK (Chinese, Japanese, Korean).
87
+
88
+ ### n-grams
89
+
90
+ [n-grams](https://en.wikipedia.org/wiki/N-gram) is a set of all substrings of length `n` contained in a given string. For example, `abc` bigrams are `ab` and `bc`.
91
+
92
+ n-grams can be padded. We can add special symbol(s) before and after string to increase number of strings. For example `bigram-1ab` for `abc` are `#a`, `ab`, `bc`, `c#`.
93
+
94
+ Sometimes padded n-grams give much better similarity measure. For example, it is empiracally shown that `trigram-2b` gives much better result for matching drug names with errors. See [Similarity as a risk factor in drug-name confusion errors, B Lambert et al., 1999](https://www.researchgate.net/profile/Sanjay-Gandhi-3/publication/12701019_Similarity_as_a_risk_factor_in_drug-name_confusion_errors_the_look-alike_orthographic_and_sound-alike_phonetic_model/links/0deec51e6f14b979c1000000/Similarity-as-a-risk-factor-in-drug-name-confusion-errors-the-look-alike-orthographic-and-sound-alike-phonetic-model.pdf).
95
+
96
+ n-grams are language independent (unlike tokenization). n-grams can be used to create indexes in database, for example PostgreSQL has trigram index.
97
+
98
+ ### Overlap coefficient
99
+
100
+ Overlap coefficient is the type of similarity which works for sets. In cotext of strings - can be applied to n-grams or set of tokens (words).
101
+
102
+ | name | formula | comment |
103
+ | ------------------------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
104
+ | Jaccard index | \|x ∩ y\| / \|x ∪ y\| | Grove Karl Gilbert in 1884, Paul Jaccard, Tanimoto |
105
+ | Dice coefficient | 2\|x ∩ y\| / (\|x\| + \|y\|) | Lee Raymond Dice in 1945, Thorvald Sørensen in 1948. The same as F1 (?) |
106
+ | Tversky index | \|x ∩ y\| / (\|x ∩ y\| + a\|x \\ y\| + b\|y \\ x\|) | Amos Tversky in 1977. If a=b=1 the same as Jaccard index. If a=b=0.5 the same as Dice coefficient |
107
+ | Szymkiewicz–Simpson coefficient | \|x ∩ y\| / min(\|x\|, \|y\|) | Sometimes called overlap coefficient |
108
+
109
+ ### Relevance and ranking
110
+
111
+ > Relevance is the degree to which something is related or useful to what is happening or being talked about:
112
+ >
113
+ > -- https://dictionary.cambridge.org/dictionary/english/relevance
114
+
115
+ > Relevance is the art of ranking content for a search based on how much that content satisfies the needs of the user and the business. The devil is completely in the details.
116
+ >
117
+ > -- https://livebook.manning.com/book/relevant-search/chapter-1/8
118
+
119
+ Sometimes relevance and ranking considered to be separate steps.
120
+
121
+ Relevance is a binary function which returns `true` or `false` e.g. if document (row in a table) is relevant to the search or not.
122
+
123
+ Ranking is a function which assigns some score to each relevant result in order to bring more relevant results in the top of the list.
124
+
125
+ Relevance and ranking are connected. Sometimes it can be calculated in one step, for example, calculate similarity - similarity would be ranking function, and relevance would be similarity more than some threshold. Sometimes it can be two different steps and two different algorithms.
126
+
127
+ ### Phonetic indexing
128
+
129
+ > Index - a list (as of bibliographical information or citations to a body of literature) arranged usually in alphabetical order of some specified datum (such as author, subject, or keyword)
130
+ >
131
+ > -- https://www.merriam-webster.com/dictionary/index
132
+
133
+ Phonetic algorithm able to prodcue some kind of hash. If hash for different words the same we can assume that those words sound the same.
134
+
135
+ Phonetic algorithms are language dependent and typically used to compare people names, which can have different spelling, for example `Claire` and `Clare`.
136
+
137
+ Phonetic algorithm returns a hash, two hashes can be compared - this way we can get relevance function. For ranking function we can use for example edit distance or similarity.
138
+
139
+ | name | language | comment |
140
+ | ----------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
141
+ | Soundex | En | Robert C. Russell and Margaret King Odell around 1918 |
142
+ | Cologne phonetics | En (optimized to match the German language) | Hans Joachim Postel in 1969 |
143
+ | NYSIIS | En | Sometimes called Reverse Soundex. 1970 |
144
+ | Match rating approach | En | Western Airlines in 1977 |
145
+ | Daitch–Mokotoff Soundex | En | Add support for Germanic or Slavic surnames. Gary Mokotoff and Randy Daitch in 1985 |
146
+ | Metaphone | En | Lawrence Philips in 1990 |
147
+ | Double metaphone | En (of Slavic, Germanic, Celtic, Greek, Chinese, and other origins) | Lawrence Philips in 2000 |
148
+ | Metaphone 3 | En | Lawrence Philips in 2009 |
149
+ | Caverphone | En (optimized for accents present in parts of New Zealand) | David Hood in 2002 |
150
+ | Beider–Morse | En | Improvement over Daitch–Mokotoff Soundex. 2008 |
151
+
152
+ Other variations of Soundex:
153
+
154
+ - ONCA - The Oxford Name Compression Algorithm
155
+ - Phonex
156
+ - SoundD
157
+
158
+ Algorithms for other languages:
159
+
160
+ - [French Phonetic Algorithms](https://yomguithereal.github.io/talisman/phonetics/french)
161
+ - [German Phonetic Algorithms](https://yomguithereal.github.io/talisman/phonetics/german)
162
+
163
+ ## Types of search
164
+
165
+ | Type of search | Is precise? | Name | Intention | Example of data |
166
+ | -------------- | ----------- | ---------------- | -------------------------------- | ----------------------------------------------------------- |
167
+ | text | precise | Substring search | starts/ends with..., contains... | logs, match by part of word, etc. |
168
+ | | | Regexp | contains pattern | logs, match by part of word, etc. |
169
+ | | approximate | Phonetic | sounds like | names, emails, words with alternative spelling, etc. |
170
+ | | | Orhtographic | looks like | drug names, biological species, typos in proper nouns, etc. |
171
+ | | | Full-text | relevant to | texts in natural language |
172
+ | parametric | precise | Filter | filter rows by parameters | structured data, like RBDMS tables |
@@ -0,0 +1,221 @@
1
+ # Language design
2
+
3
+ Ideas and things to take into account when desigining query language
4
+
5
+ ## Challenge
6
+
7
+ Main challenge is to come up with query language intuitive enough that non-techy people can use, but powerfull enough to expose all advanced features.
8
+
9
+ There are different types of search, they require different features:
10
+
11
+ ```mermaid
12
+ graph LR
13
+ Search --> Parametric --> op1[param = 1, param > 2, etc.]
14
+ Search --> s1[Text: single term]
15
+ s1 --> op2[Phonetic similarity: names, emails, words with alternate spellings, etc.]
16
+ s1 --> op3[Ortographic similarity: drug names, biological species, typos in proper nouns, etc.]
17
+ s1 --> op4[Pattern match: logs, match by part of word, etc.]
18
+ Search --> s2[Text: multiple terms]
19
+ s2 --> op5[Full-text search: text in natural language]
20
+ s2 --> op6[Single term search with boolean operations: AND, OR, NOT, grouping]
21
+ ```
22
+
23
+ **Note**: No. Full-text search is not an universal solution for all types of text search. It is designed to search in natural language texts. But this subject deserves a separate article.
24
+
25
+ ## Rich syntax vs intuitivity
26
+
27
+ Implementing language (parser) is trivial. The problem is that the more advanced language (the more capabilities it has), the less intuitive it is.
28
+
29
+ It is less intuitive, because you need to deal with:
30
+
31
+ - precedence (which one have priority: OR, AND?)
32
+ - syntax errors, like missing closing bracket
33
+ - the need to escape special chars
34
+ - that people may be not aware of special meaning of operator
35
+ - for example, it is hard to find CSS properties in Google which start with "-" because minus interpreted as "NOT". So you need to quote those in order to find for them
36
+
37
+ which may be counterintuitve without syntax checker.
38
+
39
+ ## Operators
40
+
41
+ **Important**: this is not comparison of features, but rather comparison of syntax.
42
+
43
+ | | Meilisearch | Solr | Sphinx | MySQL FT boolean | PostgreSQL FT | GitHub syntax | SQL | Google |
44
+ | -------------------------------- | ------------------------- | --------- | ----------- | ---------------- | ------------- | ----------------------- | ------------------- | --------- |
45
+ | **Boolean operators** applies to | parametric | both | text | text | text | both | parametric | both? |
46
+ | default operator | | OR | AND | OR | | AND | | and? |
47
+ | not | NOT | NOT / ! | - / ! | - (kind of) | ! / NOT / - | NOT / - | NOT | - |
48
+ | and | AND | AND / && | no operator | + (kind of) | & / AND | no operator | AND | and? |
49
+ | or | OR | OR / \|\| | \| | no operator | \| / OR | | OR | \| / or |
50
+ | grouping | () | () | () | () | () | | () | () |
51
+ | **Text search** | | | | | | | | |
52
+ | phrase search | "" | "" | "" | "" | <-> / "" | "" | LIKE "%phrase%" | "" |
53
+ | proximity phrase search | | ""~N | ""~N | ""@N | < N > | | | AROUND(N) |
54
+ | priority modifier | | ^N, ^= | ^N | > / < / ~ | :N | | | |
55
+ | prefix search | | \*, ? | | \* | :\* | | %, \_ | \*, \_ |
56
+ | required | | + | | + | | | | + |
57
+ | prohibited | | - | | - | | | | - |
58
+ | **Parametric search** | | | | | | | | |
59
+ | parameter specifier | param (in separate query) | param: | @param | | | param: | param | param: |
60
+ | comparison | >, >=, <, <=, =, != | implied = | implied = | | | >, >=, <, <=, implied = | >, >=, <, <=, =, != | |
61
+ | range | n TO n | [n TO n] | | | | n..n | <= AND => | n..n |
62
+ | other | IN [], EXISTS | | | | | | IN [], NOT NULL... | |
63
+
64
+ - [ransack](https://activerecord-hackery.github.io/ransack/getting-started/search-matches/)
65
+ - [MySQL Full-Text Search](https://dev.mysql.com/doc/refman/8.0/en/fulltext-boolean.html)
66
+ - [PostgreSQL Full-Text Search](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES)
67
+ - [Meilisearch](https://docs.meilisearch.com/learn/advanced/filtering_and_faceted_search.html#using-filters)
68
+ - [Solr](https://solr.apache.org/guide/6_6/the-standard-query-parser.html)
69
+ - [Lucene](https://lucene.apache.org/core/2_9_4/queryparsersyntax.html) ([Lucene vs Solr](https://www.lucenetutorial.com/lucene-vs-solr.html))
70
+ - [Sphinx](https://sphinxsearch.com/docs/current/extended-syntax.html)
71
+
72
+ ### Other search engines
73
+
74
+ - [Manticore Search](https://github.com/manticoresoftware/manticoresearch) is an open-source database that was created in 2017 as a continuation of Sphinx Search engine.
75
+ - [RediSearch](https://github.com/RediSearch/RediSearch)
76
+ - [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) and it's alternatives:
77
+ - [sonic](https://github.com/valeriansaliou/sonic)
78
+ - [typesense](https://github.com/typesense/typesense)
79
+ - [zinc](https://github.com/zinclabs/zinc)
80
+ - [Toshi](https://github.com/toshi-search/Toshi)
81
+ - [phalanx](https://github.com/mosuka/phalanx)
82
+ - [pisa](https://github.com/pisa-engine/pisa)
83
+
84
+ ## Text + parametric search
85
+
86
+ - Two separate fields - one for text and one for parametric search
87
+ - separate field for parametric query would allow to do autocomplete for parameters
88
+ - One field
89
+ - differentiate params with specific marker like `:` or `@`
90
+ - having marker in prefix position would allow to do autocomplete for parameters
91
+ - differentiate by predefined list of keywords
92
+ - differentiate by one specific marker, for example `text query (params query)`
93
+ - this would allow to do autocomplete for parameters and have boolean operations only for parametric search
94
+ - One field with "mode" e.g. ability to switch from text search to text + parametric search
95
+ - One field with projectional editing
96
+ - this would allow to do autocomplete for parameters
97
+ - this would allow to prevent syntax errors and show semantic errors (like unknown param)
98
+ - for text we can use "free" editing and for params we can use projectional editing which can be triggered by specific key, but will not get into input
99
+ - https://react-mentions.vercel.app/
100
+ - https://www.npmjs.com/package/react-tag-input
101
+ - this would require [CST](https://www.cse.chalmers.se/edu/year/2011/course/TIN321/lectures/proglang-02.html) rather than AST
102
+ - Return parsed query
103
+ - If query is parsed incorrectly it most likely will return no results. And on this screen we can show "parsed" query. For this we need "printer" which will convert CST to HTML
104
+
105
+ ## No one-size-fits-all
106
+
107
+ Initial idea was to implement universal language. But more I think of this task more I realized there is no one-size-fits-all solution. So instead I can design language with all possible feature with ability to turn them on and off
108
+
109
+ | feature | input | output | enabled |
110
+ | -------------------- | ------------ | --------------------------------------------- | ------- |
111
+ | parametric search | param:1 | param = 1 | always |
112
+ | | param:>1 | param > 1 | ? |
113
+ | | param:">1" | param = ">1" | always |
114
+ | phrase quotation | "a b" | containing "a b" | always |
115
+ | escape quote | "\"" | containing "\"" | always |
116
+ | negation | -param:1 | NOT (param = 1) | |
117
+ | | param:!=1 | param != 1 | ? |
118
+ | | -a | not containing a | |
119
+ | | - a | shall we support space between minus? | |
120
+ | | -"a b" | not containing "a b" | |
121
+ | | not a | not containing a | |
122
+ | | not "a b" | not containing "a b" | |
123
+ | | not"a b" | shall we support absence of space? | |
124
+ | | not param:1 | not (param = 1) | |
125
+ | | not -a | Error? | |
126
+ | grouping | (a b) | doesn't make sense without boolean operations | |
127
+ | or | | I will assume that default operator is "and" | |
128
+ | | a\|b | containing a or b | |
129
+ | | a \| b | containing a or b | |
130
+ | | a\|"b c" | containing a or "b c" | |
131
+ | | a or"b c" | shall we support absence of space? | |
132
+ | | a\|b c | containing (a or b) and c | |
133
+ | | (a\|b) | containing a or b | |
134
+ | | (a\|b) c | containing (a or b) and c | |
135
+ | | a or b | containing a or b | |
136
+ | | -(a\|b) | not containing (a or b) | |
137
+ | | -a\|b | not containing a or containing b | |
138
+ | | not a\|b | not containing a or containing b | |
139
+ | | not (a or b) | not containing (a or b) | |
140
+ | | not(a or b) | shall we support absence of space? | |
141
+ | | a:1 \| b:1 | a = 1 or b = 1 | |
142
+ | | a:1 or b:1 | a = 1 or b = 1 | |
143
+ | | a:1\|2 | a = 1 or a = 2. a IN [1, 2] | |
144
+ | | a:(1 \| 2) | a = 1 or a = 2. a IN [1, 2] | |
145
+ | and | | I will assume that default operator is "and" | |
146
+ | | a b | containing a and b | always |
147
+ | | (a b) | containing a and b | |
148
+ | | a:1 b:1 | a = 1 and b = 1 | |
149
+ | | a and b | containing a and b | |
150
+ | | a:1 and b:1 | a = 1 and b = 1 | |
151
+ | escape special chars | \\\| | containing "\|" | never |
152
+ | | "\|" | containing "\|" | always |
153
+
154
+ ## Operator Precedence
155
+
156
+ - [Microsoft Transact-SQL operator precedence](https://learn.microsoft.com/en-us/sql/t-sql/language-elements/operator-precedence-transact-sql?view=sql-server-2017)
157
+ - [Oracle MySQL 9 operator precedence](https://dev.mysql.com/doc/refman/8.0/en/operator-precedence.html)
158
+ - [Oracle 10g condition precedence](https://docs.oracle.com/cd/B19306_01/server.102/b14200/conditions001.htm#i1034834)
159
+ - [PostgreSQL operator Precedence](https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-PRECEDENCE)
160
+ - [SQL as understood by SQLite](https://www.sqlite.org/lang_expr.html)
161
+
162
+ If we assume that "AND" is the default boolean operator, than it should be lowest.
163
+
164
+ Option 1:
165
+
166
+ | Operator | Associativity | Position |
167
+ | ------------- | ------------- | -------- |
168
+ | NOT / - | right | prefix |
169
+ | OR / \| | left | infix |
170
+ | AND / (space) | left | infix |
171
+
172
+ Option 2:
173
+
174
+ | Operator | Associativity | Position |
175
+ | -------- | ------------- | -------- |
176
+ | NOT / - | right | prefix |
177
+ | AND | left | infix |
178
+ | OR / \| | left | infix |
179
+ | (space) | left | infix |
180
+
181
+ ## Strange cases
182
+
183
+ - `"a b`
184
+ - containing `"a` and `b`
185
+ - containing `a b` e.g. as `"a b"`
186
+ - `(a b`
187
+ - containing `(a` and `b`
188
+ - containing `a` and `b` e.g. as `(a b)`
189
+ - `not -a`
190
+ - containing `a`
191
+ - not containing `-a`
192
+ - `not not not`
193
+ - empty query
194
+ - containing `not`
195
+ - `|||`
196
+ - empty query
197
+ - containing `|||`
198
+ - `""`
199
+ - empty query
200
+ - `param:<>1`
201
+ - `param = "<>1"`
202
+ - `param: 1`
203
+ - containing `param:` and `1`
204
+ - `param = 1`
205
+ - `()`
206
+ - containing `()`
207
+ - empty query
208
+ - `--a`
209
+ - containing `a`
210
+ - not containing `-a`
211
+ - containing `--a`
212
+ - `a -`
213
+ - containing `a`
214
+ - containing `a` and `-`
215
+ - `or a`
216
+ - containing `a`
217
+ - containing `or` and `a`
218
+
219
+ ## Similar languages
220
+
221
+ - [REST Query Language](https://github.com/jirutka/rsql-parser)
@@ -0,0 +1,6 @@
1
+ # Terminology
2
+
3
+ **Parametric search** aka faceted search aka filters - [filter by strctured data](https://en.wikipedia.org/wiki/Faceted_search).
4
+
5
+ **Aproximate search** aka fuzzy search aka approximate string matching - [is the technique of finding strings that match a pattern approximately (rather than exactly)](https://en.wikipedia.org/wiki/Approximate_string_matching).
6
+
@@ -5,7 +5,7 @@ require_relative "ransack_transformer"
5
5
  module SearchSyntax
6
6
  class Ransack
7
7
  # text - symbol. Idea for the future: it can be callback to allow to manipulate query for full-text search
8
- # params - array of strings
8
+ # params - array of strings; or hash to rename params
9
9
  # sort - string. nil - to disbale parsing sort param
10
10
  def initialize(text:, params:, sort: nil)
11
11
  @transformer = RansackTransformer.new(text: text, params: params, sort: sort)
@@ -17,9 +17,19 @@ module SearchSyntax
17
17
 
18
18
  def initialize(text:, params:, sort: nil)
19
19
  @text = text
20
+ if params.is_a?(Array)
21
+ params = params.to_h { |i| [i.to_s.delete(":"), i] }
22
+ elsif params.is_a?(Hash)
23
+ params = params.map do |k, v|
24
+ k = k.to_s
25
+ skip_predicate = k.include?(":")
26
+ [k.delete(":"), v + (skip_predicate ? ":" : "")]
27
+ end.to_h
28
+ end
20
29
  @params = params
30
+ @allowed_params = params.keys
21
31
  @sort = sort
22
- @spell_checker = DidYouMean::SpellChecker.new(dictionary: @params)
32
+ @spell_checker = DidYouMean::SpellChecker.new(dictionary: @allowed_params)
23
33
  end
24
34
 
25
35
  def transform_sort_param(value)
@@ -36,16 +46,15 @@ module SearchSyntax
36
46
  errors = []
37
47
  result = {}
38
48
 
39
- if @params.is_a?(Array)
49
+ if @allowed_params.length > 0
40
50
  ast = ast.filter do |node|
41
51
  if node[:type] != :param
42
52
  true
43
53
  elsif node[:name] == @sort
44
54
  result[:s] = transform_sort_param(node[:value])
45
55
  false
46
- elsif @params.include?(node[:name])
47
- predicate = PREDICATES[node[:predicate]] || :eq
48
- key = "#{node[:name]}_#{predicate}".to_sym
56
+ elsif @allowed_params.include?(node[:name])
57
+ key = name_with_predicate(node)
49
58
  if !result.key?(key)
50
59
  result[key] = node[:value]
51
60
  else
@@ -77,5 +86,17 @@ module SearchSyntax
77
86
 
78
87
  [result, errors]
79
88
  end
89
+
90
+ private
91
+
92
+ def name_with_predicate(node)
93
+ name = @params[node[:name]]
94
+ if name.include?(":")
95
+ name.delete(":")
96
+ else
97
+ predicate = PREDICATES[node[:predicate]] || :eq
98
+ "#{name}_#{predicate}"
99
+ end.to_sym
100
+ end
80
101
  end
81
102
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SearchSyntax
4
- VERSION = "0.1.1"
4
+ VERSION = "0.1.3"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: search_syntax
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.1.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - stereobooster
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-10-18 00:00:00.000000000 Z
11
+ date: 2022-10-24 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: treetop
@@ -39,6 +39,10 @@ files:
39
39
  - LICENSE.txt
40
40
  - README.md
41
41
  - Rakefile
42
+ - docs/approximate-string-matching-algorithms.md
43
+ - docs/approximate-string-matching.md
44
+ - docs/language-design.md
45
+ - docs/terminology.md
42
46
  - lib/search_syntax.rb
43
47
  - lib/search_syntax/errors.rb
44
48
  - lib/search_syntax/parser.rb