queryparser 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (6) hide show
  1. data/COPYING +340 -0
  2. data/COPYRIGHT +18 -0
  3. data/README +14 -0
  4. data/Rakefile +33 -0
  5. data/lib/queryparser.rb +714 -0
  6. metadata +61 -0
data/COPYING ADDED
@@ -0,0 +1,340 @@
1
+ GNU GENERAL PUBLIC LICENSE
2
+ Version 2, June 1991
3
+
4
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
5
+ 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
6
+ Everyone is permitted to copy and distribute verbatim copies
7
+ of this license document, but changing it is not allowed.
8
+
9
+ Preamble
10
+
11
+ The licenses for most software are designed to take away your
12
+ freedom to share and change it. By contrast, the GNU General Public
13
+ License is intended to guarantee your freedom to share and change free
14
+ software--to make sure the software is free for all its users. This
15
+ General Public License applies to most of the Free Software
16
+ Foundation's software and to any other program whose authors commit to
17
+ using it. (Some other Free Software Foundation software is covered by
18
+ the GNU Library General Public License instead.) You can apply it to
19
+ your programs, too.
20
+
21
+ When we speak of free software, we are referring to freedom, not
22
+ price. Our General Public Licenses are designed to make sure that you
23
+ have the freedom to distribute copies of free software (and charge for
24
+ this service if you wish), that you receive source code or can get it
25
+ if you want it, that you can change the software or use pieces of it
26
+ in new free programs; and that you know you can do these things.
27
+
28
+ To protect your rights, we need to make restrictions that forbid
29
+ anyone to deny you these rights or to ask you to surrender the rights.
30
+ These restrictions translate to certain responsibilities for you if you
31
+ distribute copies of the software, or if you modify it.
32
+
33
+ For example, if you distribute copies of such a program, whether
34
+ gratis or for a fee, you must give the recipients all the rights that
35
+ you have. You must make sure that they, too, receive or can get the
36
+ source code. And you must show them these terms so they know their
37
+ rights.
38
+
39
+ We protect your rights with two steps: (1) copyright the software, and
40
+ (2) offer you this license which gives you legal permission to copy,
41
+ distribute and/or modify the software.
42
+
43
+ Also, for each author's protection and ours, we want to make certain
44
+ that everyone understands that there is no warranty for this free
45
+ software. If the software is modified by someone else and passed on, we
46
+ want its recipients to know that what they have is not the original, so
47
+ that any problems introduced by others will not reflect on the original
48
+ authors' reputations.
49
+
50
+ Finally, any free program is threatened constantly by software
51
+ patents. We wish to avoid the danger that redistributors of a free
52
+ program will individually obtain patent licenses, in effect making the
53
+ program proprietary. To prevent this, we have made it clear that any
54
+ patent must be licensed for everyone's free use or not licensed at all.
55
+
56
+ The precise terms and conditions for copying, distribution and
57
+ modification follow.
58
+
59
+ GNU GENERAL PUBLIC LICENSE
60
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
61
+
62
+ 0. This License applies to any program or other work which contains
63
+ a notice placed by the copyright holder saying it may be distributed
64
+ under the terms of this General Public License. The "Program", below,
65
+ refers to any such program or work, and a "work based on the Program"
66
+ means either the Program or any derivative work under copyright law:
67
+ that is to say, a work containing the Program or a portion of it,
68
+ either verbatim or with modifications and/or translated into another
69
+ language. (Hereinafter, translation is included without limitation in
70
+ the term "modification".) Each licensee is addressed as "you".
71
+
72
+ Activities other than copying, distribution and modification are not
73
+ covered by this License; they are outside its scope. The act of
74
+ running the Program is not restricted, and the output from the Program
75
+ is covered only if its contents constitute a work based on the
76
+ Program (independent of having been made by running the Program).
77
+ Whether that is true depends on what the Program does.
78
+
79
+ 1. You may copy and distribute verbatim copies of the Program's
80
+ source code as you receive it, in any medium, provided that you
81
+ conspicuously and appropriately publish on each copy an appropriate
82
+ copyright notice and disclaimer of warranty; keep intact all the
83
+ notices that refer to this License and to the absence of any warranty;
84
+ and give any other recipients of the Program a copy of this License
85
+ along with the Program.
86
+
87
+ You may charge a fee for the physical act of transferring a copy, and
88
+ you may at your option offer warranty protection in exchange for a fee.
89
+
90
+ 2. You may modify your copy or copies of the Program or any portion
91
+ of it, thus forming a work based on the Program, and copy and
92
+ distribute such modifications or work under the terms of Section 1
93
+ above, provided that you also meet all of these conditions:
94
+
95
+ a) You must cause the modified files to carry prominent notices
96
+ stating that you changed the files and the date of any change.
97
+
98
+ b) You must cause any work that you distribute or publish, that in
99
+ whole or in part contains or is derived from the Program or any
100
+ part thereof, to be licensed as a whole at no charge to all third
101
+ parties under the terms of this License.
102
+
103
+ c) If the modified program normally reads commands interactively
104
+ when run, you must cause it, when started running for such
105
+ interactive use in the most ordinary way, to print or display an
106
+ announcement including an appropriate copyright notice and a
107
+ notice that there is no warranty (or else, saying that you provide
108
+ a warranty) and that users may redistribute the program under
109
+ these conditions, and telling the user how to view a copy of this
110
+ License. (Exception: if the Program itself is interactive but
111
+ does not normally print such an announcement, your work based on
112
+ the Program is not required to print an announcement.)
113
+
114
+ These requirements apply to the modified work as a whole. If
115
+ identifiable sections of that work are not derived from the Program,
116
+ and can be reasonably considered independent and separate works in
117
+ themselves, then this License, and its terms, do not apply to those
118
+ sections when you distribute them as separate works. But when you
119
+ distribute the same sections as part of a whole which is a work based
120
+ on the Program, the distribution of the whole must be on the terms of
121
+ this License, whose permissions for other licensees extend to the
122
+ entire whole, and thus to each and every part regardless of who wrote it.
123
+
124
+ Thus, it is not the intent of this section to claim rights or contest
125
+ your rights to work written entirely by you; rather, the intent is to
126
+ exercise the right to control the distribution of derivative or
127
+ collective works based on the Program.
128
+
129
+ In addition, mere aggregation of another work not based on the Program
130
+ with the Program (or with a work based on the Program) on a volume of
131
+ a storage or distribution medium does not bring the other work under
132
+ the scope of this License.
133
+
134
+ 3. You may copy and distribute the Program (or a work based on it,
135
+ under Section 2) in object code or executable form under the terms of
136
+ Sections 1 and 2 above provided that you also do one of the following:
137
+
138
+ a) Accompany it with the complete corresponding machine-readable
139
+ source code, which must be distributed under the terms of Sections
140
+ 1 and 2 above on a medium customarily used for software interchange; or,
141
+
142
+ b) Accompany it with a written offer, valid for at least three
143
+ years, to give any third party, for a charge no more than your
144
+ cost of physically performing source distribution, a complete
145
+ machine-readable copy of the corresponding source code, to be
146
+ distributed under the terms of Sections 1 and 2 above on a medium
147
+ customarily used for software interchange; or,
148
+
149
+ c) Accompany it with the information you received as to the offer
150
+ to distribute corresponding source code. (This alternative is
151
+ allowed only for noncommercial distribution and only if you
152
+ received the program in object code or executable form with such
153
+ an offer, in accord with Subsection b above.)
154
+
155
+ The source code for a work means the preferred form of the work for
156
+ making modifications to it. For an executable work, complete source
157
+ code means all the source code for all modules it contains, plus any
158
+ associated interface definition files, plus the scripts used to
159
+ control compilation and installation of the executable. However, as a
160
+ special exception, the source code distributed need not include
161
+ anything that is normally distributed (in either source or binary
162
+ form) with the major components (compiler, kernel, and so on) of the
163
+ operating system on which the executable runs, unless that component
164
+ itself accompanies the executable.
165
+
166
+ If distribution of executable or object code is made by offering
167
+ access to copy from a designated place, then offering equivalent
168
+ access to copy the source code from the same place counts as
169
+ distribution of the source code, even though third parties are not
170
+ compelled to copy the source along with the object code.
171
+
172
+ 4. You may not copy, modify, sublicense, or distribute the Program
173
+ except as expressly provided under this License. Any attempt
174
+ otherwise to copy, modify, sublicense or distribute the Program is
175
+ void, and will automatically terminate your rights under this License.
176
+ However, parties who have received copies, or rights, from you under
177
+ this License will not have their licenses terminated so long as such
178
+ parties remain in full compliance.
179
+
180
+ 5. You are not required to accept this License, since you have not
181
+ signed it. However, nothing else grants you permission to modify or
182
+ distribute the Program or its derivative works. These actions are
183
+ prohibited by law if you do not accept this License. Therefore, by
184
+ modifying or distributing the Program (or any work based on the
185
+ Program), you indicate your acceptance of this License to do so, and
186
+ all its terms and conditions for copying, distributing or modifying
187
+ the Program or works based on it.
188
+
189
+ 6. Each time you redistribute the Program (or any work based on the
190
+ Program), the recipient automatically receives a license from the
191
+ original licensor to copy, distribute or modify the Program subject to
192
+ these terms and conditions. You may not impose any further
193
+ restrictions on the recipients' exercise of the rights granted herein.
194
+ You are not responsible for enforcing compliance by third parties to
195
+ this License.
196
+
197
+ 7. If, as a consequence of a court judgment or allegation of patent
198
+ infringement or for any other reason (not limited to patent issues),
199
+ conditions are imposed on you (whether by court order, agreement or
200
+ otherwise) that contradict the conditions of this License, they do not
201
+ excuse you from the conditions of this License. If you cannot
202
+ distribute so as to satisfy simultaneously your obligations under this
203
+ License and any other pertinent obligations, then as a consequence you
204
+ may not distribute the Program at all. For example, if a patent
205
+ license would not permit royalty-free redistribution of the Program by
206
+ all those who receive copies directly or indirectly through you, then
207
+ the only way you could satisfy both it and this License would be to
208
+ refrain entirely from distribution of the Program.
209
+
210
+ If any portion of this section is held invalid or unenforceable under
211
+ any particular circumstance, the balance of the section is intended to
212
+ apply and the section as a whole is intended to apply in other
213
+ circumstances.
214
+
215
+ It is not the purpose of this section to induce you to infringe any
216
+ patents or other property right claims or to contest validity of any
217
+ such claims; this section has the sole purpose of protecting the
218
+ integrity of the free software distribution system, which is
219
+ implemented by public license practices. Many people have made
220
+ generous contributions to the wide range of software distributed
221
+ through that system in reliance on consistent application of that
222
+ system; it is up to the author/donor to decide if he or she is willing
223
+ to distribute software through any other system and a licensee cannot
224
+ impose that choice.
225
+
226
+ This section is intended to make thoroughly clear what is believed to
227
+ be a consequence of the rest of this License.
228
+
229
+ 8. If the distribution and/or use of the Program is restricted in
230
+ certain countries either by patents or by copyrighted interfaces, the
231
+ original copyright holder who places the Program under this License
232
+ may add an explicit geographical distribution limitation excluding
233
+ those countries, so that distribution is permitted only in or among
234
+ countries not thus excluded. In such case, this License incorporates
235
+ the limitation as if written in the body of this License.
236
+
237
+ 9. The Free Software Foundation may publish revised and/or new versions
238
+ of the General Public License from time to time. Such new versions will
239
+ be similar in spirit to the present version, but may differ in detail to
240
+ address new problems or concerns.
241
+
242
+ Each version is given a distinguishing version number. If the Program
243
+ specifies a version number of this License which applies to it and "any
244
+ later version", you have the option of following the terms and conditions
245
+ either of that version or of any later version published by the Free
246
+ Software Foundation. If the Program does not specify a version number of
247
+ this License, you may choose any version ever published by the Free Software
248
+ Foundation.
249
+
250
+ 10. If you wish to incorporate parts of the Program into other free
251
+ programs whose distribution conditions are different, write to the author
252
+ to ask for permission. For software which is copyrighted by the Free
253
+ Software Foundation, write to the Free Software Foundation; we sometimes
254
+ make exceptions for this. Our decision will be guided by the two goals
255
+ of preserving the free status of all derivatives of our free software and
256
+ of promoting the sharing and reuse of software generally.
257
+
258
+ NO WARRANTY
259
+
260
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
261
+ FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
262
+ OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
263
+ PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
264
+ OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
265
+ MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
266
+ TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
267
+ PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
268
+ REPAIR OR CORRECTION.
269
+
270
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
271
+ WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
272
+ REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
273
+ INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
274
+ OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
275
+ TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
276
+ YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
277
+ PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
278
+ POSSIBILITY OF SUCH DAMAGES.
279
+
280
+ END OF TERMS AND CONDITIONS
281
+
282
+ How to Apply These Terms to Your New Programs
283
+
284
+ If you develop a new program, and you want it to be of the greatest
285
+ possible use to the public, the best way to achieve this is to make it
286
+ free software which everyone can redistribute and change under these terms.
287
+
288
+ To do so, attach the following notices to the program. It is safest
289
+ to attach them to the start of each source file to most effectively
290
+ convey the exclusion of warranty; and each file should have at least
291
+ the "copyright" line and a pointer to where the full notice is found.
292
+
293
+ <one line to give the program's name and a brief idea of what it does.>
294
+ Copyright (C) <year> <name of author>
295
+
296
+ This program is free software; you can redistribute it and/or modify
297
+ it under the terms of the GNU General Public License as published by
298
+ the Free Software Foundation; either version 2 of the License, or
299
+ (at your option) any later version.
300
+
301
+ This program is distributed in the hope that it will be useful,
302
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
303
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
304
+ GNU General Public License for more details.
305
+
306
+ You should have received a copy of the GNU General Public License
307
+ along with this program; if not, write to the Free Software
308
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
309
+
310
+
311
+ Also add information on how to contact you by electronic and paper mail.
312
+
313
+ If the program is interactive, make it output a short notice like this
314
+ when it starts in an interactive mode:
315
+
316
+ Gnomovision version 69, Copyright (C) year name of author
317
+ Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
318
+ This is free software, and you are welcome to redistribute it
319
+ under certain conditions; type `show c' for details.
320
+
321
+ The hypothetical commands `show w' and `show c' should show the appropriate
322
+ parts of the General Public License. Of course, the commands you use may
323
+ be called something other than `show w' and `show c'; they could even be
324
+ mouse-clicks or menu items--whatever suits your program.
325
+
326
+ You should also get your employer (if you work as a programmer) or your
327
+ school, if any, to sign a "copyright disclaimer" for the program, if
328
+ necessary. Here is a sample; alter the names:
329
+
330
+ Yoyodyne, Inc., hereby disclaims all copyright interest in the program
331
+ `Gnomovision' (which makes passes at compilers) written by James Hacker.
332
+
333
+ <signature of Ty Coon>, 1 April 1989
334
+ Ty Coon, President of Vice
335
+
336
+ This General Public License does not permit incorporating your program into
337
+ proprietary programs. If your program is a subroutine library, you may
338
+ consider it more useful to permit linking proprietary applications with the
339
+ library. If this is what you want to do, use the GNU Library General
340
+ Public License instead of this License.
@@ -0,0 +1,18 @@
1
+
2
+ QueryParser - Parse a plain language query into Lucene syntax
3
+ Copyright (C) 2008 Peter Hickman
4
+
5
+ This program is free software; you can redistribute it and/or modify
6
+ it under the terms of the GNU General Public License as published by
7
+ the Free Software Foundation; either version 2 of the License, or
8
+ (at your option) any later version.
9
+
10
+ This program is distributed in the hope that it will be useful,
11
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13
+ GNU General Public License for more details.
14
+
15
+ You should have received a copy of the GNU General Public License
16
+ along with this program; if not, write to the Free Software
17
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
18
+
data/README ADDED
@@ -0,0 +1,14 @@
1
+
2
+ QueryParser - Parse a plain language query into Lucene syntax
3
+ Copyright (C) 2008 Peter Hickman
4
+
5
+ Parse a plain language query in the format
6
+
7
+ apple not banana
8
+
9
+ apple or banana and fig or date
10
+
11
+ and change them into a format suitable for use with Lucene or Solr.
12
+ The plain text query can include braces to group things and the Lucene
13
+ query can include both a similarity value and field boosting
14
+
@@ -0,0 +1,33 @@
1
+ # -*- ruby -*-
2
+
3
+ require 'rubygems'
4
+ require 'rake/gempackagetask'
5
+
6
+ $:.push 'lib'
7
+ require 'queryparser'
8
+
9
+ PKG_NAME = 'queryparser'
10
+ PKG_VERSION = QueryParser::VERSION
11
+
12
+ spec = Gem::Specification.new do |s|
13
+ s.name = PKG_NAME
14
+ s.version = PKG_VERSION
15
+ s.summary = 'Parse a natural language query into lucene query syntax'
16
+
17
+ s.files = FileList['README', 'COPY*', 'Rakefile', 'lib/**/*.rb']
18
+ s.test_files = FileList['test/*.rb']
19
+
20
+ s.has_rdoc = true
21
+ s.rdoc_options << '--title' << 'QueryParser' << '--charset' << 'utf-8'
22
+ s.extra_rdoc_files = FileList['README', 'COPYING']
23
+
24
+ s.author = 'Peter Hickman'
25
+ s.email = 'peterhi@ntlworld.com'
26
+
27
+ s.homepage = 'queryparser.rubyforge.org'
28
+ s.rubyforge_project = 'queryparser'
29
+ end
30
+
31
+ Rake::GemPackageTask.new(spec) do |pkg|
32
+ pkg.need_tar = true
33
+ end
@@ -0,0 +1,714 @@
1
+ # Takes a query in plain english and turns it into a string
2
+ # suitable to passing to Lucene or Solr.
3
+ #
4
+ # Assuming a Lucene / Solr database that has the body of the
5
+ # data in the +content+ field with the entry heading in a
6
+ # +title+ field, sub headings in a +subheading+ field
7
+ #
8
+ # p = QueryParser.new('content')
9
+ # l = p->parse("apple")
10
+ # => "content:apple"
11
+ #
12
+ # l = p->parse("apple and banana")
13
+ # => "+(+content:apple +content:banana)"
14
+ #
15
+ # l = p.parse('apple not banana or cherry')
16
+ # => "+((+content:apple -content:banana) content:cherry)"
17
+ #
18
+ # Here we boost the score of those queries that also match the
19
+ # title field of the document
20
+ #
21
+ # p = QueryParser.new("content", nil, 'title' => '^10')
22
+ # l = p.parse("apple")
23
+ # => "content:apple title:apple^10"
24
+ #
25
+ # Now with an extra boosting for subheadings
26
+ #
27
+ # p = QueryParser.new("content", nil, 'title' => '^10', 'subheading' => '^5')
28
+ # l = p.parse("apple")
29
+ # => "content:apple title:apple^10 subheading:apple^5"
30
+ #
31
+ # We can also change the similarity of the match. In Lucene terms
32
+ # a similarity of 1.0 will mean that 'banana' will only match 'banana'.
33
+ # However a similarity of 0.6 (entered as ~0.6) will allow 'banana' to
34
+ # match 'canada' which is only two letters different. The default similarity
35
+ # in Lucene is 0.6 (if I remember correctly).
36
+ #
37
+ # p = QueryParser.new("content", '~0.6', 'title' => '^10')
38
+ # l = p.parse("apple not banana")
39
+ # => "+(+content:apple~0.6 -content:banana~0.6) title:apple~0.6^10"
40
+
41
+ class QueryParser
42
+ VERSION = '1.0.0'
43
+
44
+ def initialize(field, similarity = nil, boosts = {})
45
+ @field = field
46
+ @similarity = similarity
47
+ @boosts = boosts
48
+ end
49
+
50
+ # Takes a plain english query and converts it into a string
51
+ # that can be fed into Lucene or Solr. It will apply the
52
+ # similarity and boostings set in the constructor.
53
+ def parse(text)
54
+ a = tokenise(text)
55
+ b = expand(a)
56
+ check_braces(b)
57
+ has_content(b)
58
+
59
+ c = add_implicit_and(b)
60
+
61
+ d = maketree(c)
62
+ if d.class != Array then
63
+ d = [d]
64
+ end
65
+
66
+ f = process_not(d)
67
+ g = process_and_or(f, 'and')
68
+ h = process_and_or(g, 'or')
69
+
70
+ # Wrap everything in an and
71
+ s = QueryParser::And.new
72
+ s.add(h)
73
+
74
+ t = reduce(s)
75
+
76
+ b = QueryParser::Or.new
77
+ b.add(t.boostable())
78
+
79
+ a = Array.new
80
+ x = t.lucene(@field, @similarity)
81
+ if x[0].chr == '(' then
82
+ x = "+#{x}"
83
+ end
84
+ a << x
85
+
86
+ @boosts.each_pair do |k, v|
87
+ x = [@similarity, v].join('')
88
+ a << b.lucene(k,x)
89
+ end
90
+
91
+ return a.join(' ')
92
+ end
93
+
94
+ private
95
+
96
+ # Split the string into tokens based on whitespace unless it is
97
+ # enclosed in ' or ". Initially we classify everything as either
98
+ # a term or quoted.
99
+ #
100
+ # The input is a text string and the output a flat list of terms
101
+ def tokenise(text)
102
+ r = Array.new()
103
+
104
+ delimiter = ''
105
+ token = ''
106
+
107
+ text.split("").each do |char|
108
+ if delimiter == '' then
109
+ if char == '"' or char == "'" then
110
+ token = remove_punctuation(token)
111
+ r << QueryParser::Term.new(token) if token != ''
112
+ delimiter = char.dup
113
+ token = char.dup
114
+ elsif char == " " then
115
+ token = remove_punctuation(token)
116
+ r << QueryParser::Term.new(token) if token != ''
117
+ token = ''
118
+ else
119
+ token << char.dup
120
+ end
121
+ elsif delimiter == char then
122
+ token << char.dup
123
+ token = remove_punctuation(token)
124
+ r << QueryParser::Term.new(token) if token != ''
125
+ token = ''
126
+ delimiter = ''
127
+ else
128
+ token << char.dup
129
+ end
130
+ end
131
+
132
+ token = remove_punctuation(token)
133
+ r << QueryParser::Term.new(token) if token != ''
134
+
135
+ return r
136
+ end
137
+
138
+ # All our terms will be a-Z0-9 and ( and ). The rest is lost
139
+ def remove_punctuation(a)
140
+ if a == '' then
141
+ return a
142
+ end
143
+
144
+ first = a[0].chr
145
+ last = a[-1].chr
146
+
147
+ quoted = false
148
+ if first == '"' or first == "'" then
149
+ if first == last then
150
+ quoted = true
151
+ end
152
+ end
153
+
154
+ b = a.gsub(/[^[:alnum:]()]/,' ')
155
+ c = b.gsub(/\s+/, ' ').strip
156
+
157
+ if quoted then
158
+ return ['"', c, '"'].join('')
159
+ else
160
+ return c
161
+ end
162
+ end
163
+
164
+ # If any terms have '(' or ')' in them then expand them up and tokenise
165
+ #
166
+ # The input is a list of terms, the output is a (possibly longer) list of terms
167
+ def expand(a)
168
+ r = Array.new
169
+
170
+ a.each do |i|
171
+ if i.type == 'term' and (i.data.index("(") or i.data.index(")")) then
172
+ x = i.data.gsub("(", " ( ").gsub(")", " ) ")
173
+ r << tokenise(x)
174
+ else
175
+ r << i
176
+ end
177
+ end
178
+
179
+ return r.flatten
180
+ end
181
+
182
+ # Create nested lists around the 'open' and 'close' ops
183
+ #
184
+ # The input is a list of terms, the output is a list of terms and lists of the same
185
+ def maketree(a)
186
+ r = Array.new
187
+
188
+ while x = a.shift do
189
+ case x.type
190
+ when "open"
191
+ y = maketree(a)
192
+ if y.size == 1 then
193
+ r << y[0]
194
+ elsif y.size > 1 then
195
+ r << y
196
+ end
197
+ when "close"
198
+ return r
199
+ else
200
+ r << x
201
+ end
202
+ end
203
+
204
+ if r.size == 1 then
205
+ return r[0]
206
+ else
207
+ return r
208
+ end
209
+ end
210
+
211
+ # Add the implicit 'and' after a term that is not itself an op
212
+ #
213
+ # The input is a list of terms and lists of same, the output is a (possibly longer) list of terms
214
+ def add_implicit_and(a)
215
+ r = Array.new
216
+
217
+ a.each do |i|
218
+ if r.size > 0 then
219
+ if previous_type(r.last) then
220
+ if current_type(i) then
221
+ r << QueryParser::Term.new('and')
222
+ end
223
+ else
224
+ if not current_type(i) then
225
+ raise QueryParser::Exceptions::MalformedQuery
226
+ end
227
+ end
228
+ end
229
+
230
+ r << i
231
+ end
232
+
233
+ if r.last.type == 'op' then
234
+ raise QueryParser::Exceptions::MalformedQuery
235
+ end
236
+
237
+ return r
238
+ end
239
+
240
+ # All these behave the same for adding an 'and'
241
+ def previous_type(i)
242
+ return (i.type == 'term' or i.type == 'close')
243
+ end
244
+
245
+ def current_type(i)
246
+ return (i.type == 'term' or i.type == 'open' or i.data == 'not')
247
+ end
248
+
249
+ # The not picks up the term to it's right
250
+ #
251
+ # The 'Not' op terms in the list are converted into Not objects
252
+ def process_not(a)
253
+ r = Array.new
254
+
255
+ # So we can handle a 'not not not apple' and the like
256
+ b = a.reverse
257
+
258
+ b.each do |i|
259
+ if i.class == Array then
260
+ i = process_not(i)
261
+ end
262
+
263
+ if i.class == QueryParser::Term and i.type == 'op' and i.data == 'not' then
264
+ if r.size == 0 then
265
+ raise QueryParser::Exceptions::MalformedQuery
266
+ else
267
+ x = QueryParser::Not.new(r.pop)
268
+ r << x
269
+ end
270
+ else
271
+ r << i
272
+ end
273
+ end
274
+
275
+ return r.reverse
276
+ end
277
+
278
+ # Find all the 'and' and 'or' op terms and convert them into And and Or objects
279
+ def process_and_or(a, type)
280
+ # make sure that it is in an array
281
+ if a.class != Array then
282
+ a = [a]
283
+ end
284
+
285
+ r = Array.new
286
+
287
+ has_op = false
288
+ s = nil
289
+
290
+ a.each do |i|
291
+ # First recurse into each element
292
+ if i.class == Array then
293
+ x = process_and_or(i, type)
294
+ if x.class == Array and x.size == 1 then
295
+ i = x.first
296
+ else
297
+ i = x
298
+ end
299
+ elsif i.class == QueryParser::Not then
300
+ x = process_and_or(i.contents, type)
301
+ if x.class == Array and x.size == 1 then
302
+ x = x.first
303
+ end
304
+ i = QueryParser::Not.new(x)
305
+ elsif i.class == QueryParser::And then
306
+ x = process_and_or(i.contents, type)
307
+ i = QueryParser::And.new()
308
+ i.add(x)
309
+ elsif i.class == QueryParser::Or then
310
+ x = process_and_or(i.contents, type)
311
+ i = QueryParser::Or.new()
312
+ i.add(x)
313
+ end
314
+
315
+ if has_op == true then
316
+ s.add(i)
317
+ r << s
318
+ s = nil
319
+ has_op = false
320
+ elsif i.class == QueryParser::Term and i.type == 'op' and i.data == type then
321
+ has_op = true
322
+ if i.data == 'and' then
323
+ s = QueryParser::And.new
324
+ else
325
+ s = QueryParser::Or.new
326
+ end
327
+
328
+ if r.size == 0 then
329
+ raise QueryParser::Exceptions::MalformedQuery
330
+ else
331
+ s.add(r.pop)
332
+ end
333
+ else
334
+ r << i
335
+ end
336
+ end
337
+
338
+ if r.size == 1 then
339
+ return r[0]
340
+ else
341
+ return r
342
+ end
343
+ end
344
+
345
+ # Reduce the sets down
346
+ def reduce(a)
347
+ process = true
348
+
349
+ while process do
350
+ a = a.reduce
351
+ if a.reduced? == false then
352
+ process = false
353
+ end
354
+ end
355
+
356
+ return a
357
+ end
358
+
359
+ # Check that the "(" and ")" are balanced
360
+ def check_braces(a)
361
+ counter = 0
362
+
363
+ a.each do |i|
364
+ if i.type == 'open' then
365
+ counter += 1
366
+ elsif i.type == 'close' then
367
+ counter -= 1
368
+ if counter < 0 then
369
+ raise QueryParser::Exceptions::UnbalancedBraces
370
+ end
371
+ end
372
+ end
373
+
374
+ if counter != 0 then
375
+ raise QueryParser::Exceptions::UnbalancedBraces
376
+ end
377
+ end
378
+
379
+ def has_content(a)
380
+ counter = 0
381
+
382
+ a.each do |i|
383
+ if i.type == 'term' then
384
+ counter += 1
385
+ end
386
+ end
387
+
388
+ if counter == 0 then
389
+ raise QueryParser::Exceptions::EmptyQuery
390
+ end
391
+ end
392
+ end
393
+
394
+ # The custom exceptions that may be thrown if there is some
395
+ # problem with the query.
396
+ module QueryParser::Exceptions
397
+ # This exception will be thrown if the query is generally
398
+ # malformed such as <tt>"apple and and banana"</tt> (too many
399
+ # <tt>and</tt>s), <tt>"apple not"</tt> (no term after the +not+)
400
+ # or <tt>"and apple"</tt> (no term before the +and+) and the like
401
+ class MalformedQuery < Exception
402
+ end
403
+
404
+ # This exception will be thrown if the query contains
405
+ # unbalanaced braces
406
+ class UnbalancedBraces < Exception
407
+ end
408
+
409
+ # This exception will be thrown if the supplied query string is
410
+ # empty after removing the +and+, +or+, +not+, ( and )
411
+ class EmptyQuery < Exception
412
+ end
413
+ end
414
+
415
+ # A basic search term. The input query is tokenised into
416
+ # terms which then cat manipulated to create the query tree.
417
+ #
418
+ # Generally you should not need to handle this class unless
419
+ # you are changing the parser works.
420
+ class QueryParser::Term
421
+ # Takes the token from the user's query and classify it:
422
+ #
423
+ # open:: The opening ( used to indicate the start of a parentisised part of the query.
424
+ # close:: The closing ) used to indicate the end of a parentisised part of the query.
425
+ # and:: The term indicating conjunction
426
+ # or:: The term indicating disjunction
427
+ # not:: The term indicating negation
428
+ # term:: None of the above. A term to find.
429
+ def initialize(data)
430
+ @type = 'term'
431
+ @data = data
432
+ @was_reduced = false
433
+
434
+ if @data == nil then
435
+ @data = ''
436
+ else
437
+ case @data.downcase
438
+ when "("
439
+ @type = "open"
440
+ when ")"
441
+ @type = "close"
442
+ when "and", "or", "not"
443
+ @type = 'op'
444
+ @data = @data.downcase
445
+ end
446
+ end
447
+ end
448
+
449
+ attr_reader :type, :data
450
+
451
+ # Display the Term, useful for debugging and testing
452
+ # the Term class in isolation
453
+ def inspect
454
+ "#{@type}:#{@data}"
455
+ end
456
+
457
+ # Convert a term into string usable in a Lucene query
458
+ # with an optional similarity
459
+ def lucene(field, suffix = nil)
460
+ "#{field}:#{@data}#{suffix}"
461
+ end
462
+
463
+ # Even though a term cannot, itself, be reduced the
464
+ # process will call this method on everything that
465
+ # is in the query. So we need to have this.
466
+ def reduce
467
+ @was_reduced = false
468
+ return self
469
+ end
470
+
471
+ # Return true if the previous call to #reduce did
472
+ # actually reduce the term. Again this is a method
473
+ # universal to all parts of the query and so we
474
+ # have to have it. But see #set_reduced to see why
475
+ # it can actually return true.
476
+ def reduced?
477
+ @was_reduced
478
+ end
479
+
480
+ # If the term was the only member of an +and+, +or+ or
481
+ # double (or any multiple of two) +not+ then it will replace
482
+ # the +and+, +or+ or +not+ in the query and therefore
483
+ # the original term has reduced and this, the replacement
484
+ # term, needs to indicate that fact. This allows us to
485
+ # flag that.
486
+ def set_reduced
487
+ @was_reduced = true
488
+ end
489
+
490
+ # The query can be traversed to return the terms
491
+ # that are considered *boostable*. In the following
492
+ # +apple+ will be considered positive and returned
493
+ # but +banana+ will not:
494
+ #
495
+ # apple not banana
496
+ #
497
+ # Terms that are boostable can be used to improve
498
+ # the documents relavance / position in the results list.
499
+ def boostable(negative = false)
500
+ if negative == true then
501
+ return nil
502
+ else
503
+ return self
504
+ end
505
+ end
506
+ end
507
+
508
+ # The base class for the +and+ and +or+ sets.
509
+ #
510
+ # Generally you should not need to handle this class unless
511
+ # you are changing the parser works.
512
+ class QueryParser::Set
513
+ def initialize
514
+ @data = Array.new
515
+ @was_reduced = false
516
+ end
517
+
518
+ # Add a list of +terms+, +nots+ and other +sets+
519
+ # to the list of things that are in this part
520
+ # of the query.
521
+ #
522
+ # Can handle a list of items or just a single one.
523
+ def add(*data)
524
+ data.each do |i|
525
+ if i.class == Array then
526
+ i.each {|j| add(j)}
527
+ else
528
+ @data << i
529
+ end
530
+ end
531
+ end
532
+
533
+ # Returns all the data held by this set
534
+ def contents
535
+ @data
536
+ end
537
+
538
+ # Display the set, useful for debugging and testing
539
+ def inspect
540
+ r = Array.new
541
+ @data.each {|i|r << i.inspect}
542
+ "<#{self.inspect_class} #{r.join(' ')}>"
543
+ end
544
+
545
+ # Convert a set into string usable in a Lucene query
546
+ # with an optional similarity that needs to be passed
547
+ # to the Terms
548
+ def lucene(field, similarity = nil)
549
+ r = Array.new
550
+ @data.each do |i|
551
+ x = ''
552
+ if self.class == QueryParser::And and i.class != QueryParser::Not then
553
+ x = '+'
554
+ end
555
+ x << i.lucene(field, similarity)
556
+ r << x
557
+ end
558
+
559
+ if r.size == 1 then
560
+ "#{r[0]}"
561
+ else
562
+ "(#{r.join(' ')})"
563
+ end
564
+ end
565
+
566
+ # A Set contained within a Set should fold the contents of the inner set
567
+ # into itself. Otherwise reduce the contents of the Set individually and
568
+ # set the flag if the contents reduced
569
+ def reduce
570
+ r = Array.new
571
+ @was_reduced = false
572
+
573
+ @data.each do |i|
574
+ if self.class == i.class then
575
+ @was_reduced = true
576
+ i.contents.each do |c|
577
+ r << c.reduce
578
+ end
579
+ else
580
+ x = i.reduce
581
+ if x.reduced? then
582
+ @was_reduced = true
583
+ end
584
+ r << x
585
+ end
586
+ end
587
+
588
+ if r.size == 1 then
589
+ @was_reduced = true
590
+ r[0].set_reduced
591
+ return r.first
592
+ else
593
+ @data = r
594
+ return self
595
+ end
596
+ end
597
+
598
+ # Did calling #reduce on this set actually reduce it
599
+ def reduced?
600
+ @was_reduced
601
+ end
602
+
603
+ # Force the reduced flag to true
604
+ def set_reduced
605
+ @was_reduced = true
606
+ end
607
+
608
+ # Return all the boostable terms that are held
609
+ # in the set. Thus for
610
+ #
611
+ # tom and dick and harry
612
+ #
613
+ # The terms +tom+, +dick+ and +harry+ are all considered
614
+ # boostable. However in
615
+ #
616
+ # tom and dick and not harry
617
+ #
618
+ # Only the terms +tom+ and +dick+ are considered boostable
619
+ def boostable(negative = false)
620
+ r = Array.new
621
+
622
+ @data.each do |i|
623
+ x = i.boostable(negative)
624
+ if x != nil then
625
+ r << x
626
+ end
627
+ end
628
+
629
+ return r.flatten
630
+ end
631
+ end
632
+
633
+ # A subclass just to distinguish the +and+ from the +or+
634
+ class QueryParser::And < QueryParser::Set
635
+ def inspect_class
636
+ "AND"
637
+ end
638
+ end
639
+
640
+ # A subclass just to distinguish the +and+ from the +or+
641
+ class QueryParser::Or < QueryParser::Set
642
+ def inspect_class
643
+ "OR"
644
+ end
645
+ end
646
+
647
+ # Something to handle the +not+ term in a query
648
+ #
649
+ # Generally you should not need to handle this class unless
650
+ # you are changing the parser works.
651
+ class QueryParser::Not
652
+ # +not+ handles a single term and so it is added in
653
+ # initialisation rather than with an add method.
654
+ def initialize(data)
655
+ @data = data
656
+ @was_reduced = false
657
+ end
658
+
659
+ # Returns the data held by the +not+
660
+ def contents
661
+ @data
662
+ end
663
+
664
+ # Display the +not+, useful for debugging
665
+ def inspect
666
+ "<NOT #{@data.inspect}>"
667
+ end
668
+
669
+ # Convert a +not+ into string usable in a Lucene query
670
+ # passing the similarity on to the term contained by
671
+ # the +not+
672
+ def lucene(field, similarity = nil)
673
+ "-#{@data.lucene(field, similarity)}"
674
+ end
675
+
676
+ # Double negatives should be eliminated otherwise
677
+ def reduce
678
+ if @data.class == QueryParser::Not then
679
+ @was_reduced = true
680
+ x = @data.contents.reduce
681
+ x.set_reduced
682
+ return x
683
+ else
684
+ @data = @data.reduce
685
+ @was_reduced = @data.reduced?
686
+ return self
687
+ end
688
+ end
689
+
690
+ # Were the contents reduced?
691
+ def reduced?
692
+ @was_reduced
693
+ end
694
+
695
+ # Sets the reduced flag to true
696
+ def set_reduced
697
+ @was_reduced = true
698
+ end
699
+
700
+ # Return all the boostable terms that are held
701
+ # in the +not+. Thus for
702
+ #
703
+ # tom and dick and harry
704
+ #
705
+ # The terms +tom+, +dick+ and +harry+ are all considered
706
+ # boostable. However in
707
+ #
708
+ # tom and dick and not harry
709
+ #
710
+ # Only the terms +tom+ and +dick+ are considered boostable
711
+ def boostable(negative = false)
712
+ @data.boostable(!negative)
713
+ end
714
+ end
metadata ADDED
@@ -0,0 +1,61 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: queryparser
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ platform: ruby
6
+ authors:
7
+ - Peter Hickman
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2008-12-21 00:00:00 +00:00
13
+ default_executable:
14
+ dependencies: []
15
+
16
+ description:
17
+ email: peterhi@ntlworld.com
18
+ executables: []
19
+
20
+ extensions: []
21
+
22
+ extra_rdoc_files:
23
+ - README
24
+ - COPYING
25
+ files:
26
+ - README
27
+ - COPYING
28
+ - COPYRIGHT
29
+ - Rakefile
30
+ - lib/queryparser.rb
31
+ has_rdoc: true
32
+ homepage: queryparser.rubyforge.org
33
+ post_install_message:
34
+ rdoc_options:
35
+ - --title
36
+ - QueryParser
37
+ - --charset
38
+ - utf-8
39
+ require_paths:
40
+ - lib
41
+ required_ruby_version: !ruby/object:Gem::Requirement
42
+ requirements:
43
+ - - ">="
44
+ - !ruby/object:Gem::Version
45
+ version: "0"
46
+ version:
47
+ required_rubygems_version: !ruby/object:Gem::Requirement
48
+ requirements:
49
+ - - ">="
50
+ - !ruby/object:Gem::Version
51
+ version: "0"
52
+ version:
53
+ requirements: []
54
+
55
+ rubyforge_project: queryparser
56
+ rubygems_version: 1.3.1
57
+ signing_key:
58
+ specification_version: 2
59
+ summary: Parse a natural language query into lucene query syntax
60
+ test_files: []
61
+