queryparser 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (6) hide show
  1. data/COPYING +340 -0
  2. data/COPYRIGHT +18 -0
  3. data/README +14 -0
  4. data/Rakefile +33 -0
  5. data/lib/queryparser.rb +714 -0
  6. metadata +61 -0
data/COPYING ADDED
@@ -0,0 +1,340 @@
1
+ GNU GENERAL PUBLIC LICENSE
2
+ Version 2, June 1991
3
+
4
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
5
+ 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
6
+ Everyone is permitted to copy and distribute verbatim copies
7
+ of this license document, but changing it is not allowed.
8
+
9
+ Preamble
10
+
11
+ The licenses for most software are designed to take away your
12
+ freedom to share and change it. By contrast, the GNU General Public
13
+ License is intended to guarantee your freedom to share and change free
14
+ software--to make sure the software is free for all its users. This
15
+ General Public License applies to most of the Free Software
16
+ Foundation's software and to any other program whose authors commit to
17
+ using it. (Some other Free Software Foundation software is covered by
18
+ the GNU Library General Public License instead.) You can apply it to
19
+ your programs, too.
20
+
21
+ When we speak of free software, we are referring to freedom, not
22
+ price. Our General Public Licenses are designed to make sure that you
23
+ have the freedom to distribute copies of free software (and charge for
24
+ this service if you wish), that you receive source code or can get it
25
+ if you want it, that you can change the software or use pieces of it
26
+ in new free programs; and that you know you can do these things.
27
+
28
+ To protect your rights, we need to make restrictions that forbid
29
+ anyone to deny you these rights or to ask you to surrender the rights.
30
+ These restrictions translate to certain responsibilities for you if you
31
+ distribute copies of the software, or if you modify it.
32
+
33
+ For example, if you distribute copies of such a program, whether
34
+ gratis or for a fee, you must give the recipients all the rights that
35
+ you have. You must make sure that they, too, receive or can get the
36
+ source code. And you must show them these terms so they know their
37
+ rights.
38
+
39
+ We protect your rights with two steps: (1) copyright the software, and
40
+ (2) offer you this license which gives you legal permission to copy,
41
+ distribute and/or modify the software.
42
+
43
+ Also, for each author's protection and ours, we want to make certain
44
+ that everyone understands that there is no warranty for this free
45
+ software. If the software is modified by someone else and passed on, we
46
+ want its recipients to know that what they have is not the original, so
47
+ that any problems introduced by others will not reflect on the original
48
+ authors' reputations.
49
+
50
+ Finally, any free program is threatened constantly by software
51
+ patents. We wish to avoid the danger that redistributors of a free
52
+ program will individually obtain patent licenses, in effect making the
53
+ program proprietary. To prevent this, we have made it clear that any
54
+ patent must be licensed for everyone's free use or not licensed at all.
55
+
56
+ The precise terms and conditions for copying, distribution and
57
+ modification follow.
58
+
59
+ GNU GENERAL PUBLIC LICENSE
60
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
61
+
62
+ 0. This License applies to any program or other work which contains
63
+ a notice placed by the copyright holder saying it may be distributed
64
+ under the terms of this General Public License. The "Program", below,
65
+ refers to any such program or work, and a "work based on the Program"
66
+ means either the Program or any derivative work under copyright law:
67
+ that is to say, a work containing the Program or a portion of it,
68
+ either verbatim or with modifications and/or translated into another
69
+ language. (Hereinafter, translation is included without limitation in
70
+ the term "modification".) Each licensee is addressed as "you".
71
+
72
+ Activities other than copying, distribution and modification are not
73
+ covered by this License; they are outside its scope. The act of
74
+ running the Program is not restricted, and the output from the Program
75
+ is covered only if its contents constitute a work based on the
76
+ Program (independent of having been made by running the Program).
77
+ Whether that is true depends on what the Program does.
78
+
79
+ 1. You may copy and distribute verbatim copies of the Program's
80
+ source code as you receive it, in any medium, provided that you
81
+ conspicuously and appropriately publish on each copy an appropriate
82
+ copyright notice and disclaimer of warranty; keep intact all the
83
+ notices that refer to this License and to the absence of any warranty;
84
+ and give any other recipients of the Program a copy of this License
85
+ along with the Program.
86
+
87
+ You may charge a fee for the physical act of transferring a copy, and
88
+ you may at your option offer warranty protection in exchange for a fee.
89
+
90
+ 2. You may modify your copy or copies of the Program or any portion
91
+ of it, thus forming a work based on the Program, and copy and
92
+ distribute such modifications or work under the terms of Section 1
93
+ above, provided that you also meet all of these conditions:
94
+
95
+ a) You must cause the modified files to carry prominent notices
96
+ stating that you changed the files and the date of any change.
97
+
98
+ b) You must cause any work that you distribute or publish, that in
99
+ whole or in part contains or is derived from the Program or any
100
+ part thereof, to be licensed as a whole at no charge to all third
101
+ parties under the terms of this License.
102
+
103
+ c) If the modified program normally reads commands interactively
104
+ when run, you must cause it, when started running for such
105
+ interactive use in the most ordinary way, to print or display an
106
+ announcement including an appropriate copyright notice and a
107
+ notice that there is no warranty (or else, saying that you provide
108
+ a warranty) and that users may redistribute the program under
109
+ these conditions, and telling the user how to view a copy of this
110
+ License. (Exception: if the Program itself is interactive but
111
+ does not normally print such an announcement, your work based on
112
+ the Program is not required to print an announcement.)
113
+
114
+ These requirements apply to the modified work as a whole. If
115
+ identifiable sections of that work are not derived from the Program,
116
+ and can be reasonably considered independent and separate works in
117
+ themselves, then this License, and its terms, do not apply to those
118
+ sections when you distribute them as separate works. But when you
119
+ distribute the same sections as part of a whole which is a work based
120
+ on the Program, the distribution of the whole must be on the terms of
121
+ this License, whose permissions for other licensees extend to the
122
+ entire whole, and thus to each and every part regardless of who wrote it.
123
+
124
+ Thus, it is not the intent of this section to claim rights or contest
125
+ your rights to work written entirely by you; rather, the intent is to
126
+ exercise the right to control the distribution of derivative or
127
+ collective works based on the Program.
128
+
129
+ In addition, mere aggregation of another work not based on the Program
130
+ with the Program (or with a work based on the Program) on a volume of
131
+ a storage or distribution medium does not bring the other work under
132
+ the scope of this License.
133
+
134
+ 3. You may copy and distribute the Program (or a work based on it,
135
+ under Section 2) in object code or executable form under the terms of
136
+ Sections 1 and 2 above provided that you also do one of the following:
137
+
138
+ a) Accompany it with the complete corresponding machine-readable
139
+ source code, which must be distributed under the terms of Sections
140
+ 1 and 2 above on a medium customarily used for software interchange; or,
141
+
142
+ b) Accompany it with a written offer, valid for at least three
143
+ years, to give any third party, for a charge no more than your
144
+ cost of physically performing source distribution, a complete
145
+ machine-readable copy of the corresponding source code, to be
146
+ distributed under the terms of Sections 1 and 2 above on a medium
147
+ customarily used for software interchange; or,
148
+
149
+ c) Accompany it with the information you received as to the offer
150
+ to distribute corresponding source code. (This alternative is
151
+ allowed only for noncommercial distribution and only if you
152
+ received the program in object code or executable form with such
153
+ an offer, in accord with Subsection b above.)
154
+
155
+ The source code for a work means the preferred form of the work for
156
+ making modifications to it. For an executable work, complete source
157
+ code means all the source code for all modules it contains, plus any
158
+ associated interface definition files, plus the scripts used to
159
+ control compilation and installation of the executable. However, as a
160
+ special exception, the source code distributed need not include
161
+ anything that is normally distributed (in either source or binary
162
+ form) with the major components (compiler, kernel, and so on) of the
163
+ operating system on which the executable runs, unless that component
164
+ itself accompanies the executable.
165
+
166
+ If distribution of executable or object code is made by offering
167
+ access to copy from a designated place, then offering equivalent
168
+ access to copy the source code from the same place counts as
169
+ distribution of the source code, even though third parties are not
170
+ compelled to copy the source along with the object code.
171
+
172
+ 4. You may not copy, modify, sublicense, or distribute the Program
173
+ except as expressly provided under this License. Any attempt
174
+ otherwise to copy, modify, sublicense or distribute the Program is
175
+ void, and will automatically terminate your rights under this License.
176
+ However, parties who have received copies, or rights, from you under
177
+ this License will not have their licenses terminated so long as such
178
+ parties remain in full compliance.
179
+
180
+ 5. You are not required to accept this License, since you have not
181
+ signed it. However, nothing else grants you permission to modify or
182
+ distribute the Program or its derivative works. These actions are
183
+ prohibited by law if you do not accept this License. Therefore, by
184
+ modifying or distributing the Program (or any work based on the
185
+ Program), you indicate your acceptance of this License to do so, and
186
+ all its terms and conditions for copying, distributing or modifying
187
+ the Program or works based on it.
188
+
189
+ 6. Each time you redistribute the Program (or any work based on the
190
+ Program), the recipient automatically receives a license from the
191
+ original licensor to copy, distribute or modify the Program subject to
192
+ these terms and conditions. You may not impose any further
193
+ restrictions on the recipients' exercise of the rights granted herein.
194
+ You are not responsible for enforcing compliance by third parties to
195
+ this License.
196
+
197
+ 7. If, as a consequence of a court judgment or allegation of patent
198
+ infringement or for any other reason (not limited to patent issues),
199
+ conditions are imposed on you (whether by court order, agreement or
200
+ otherwise) that contradict the conditions of this License, they do not
201
+ excuse you from the conditions of this License. If you cannot
202
+ distribute so as to satisfy simultaneously your obligations under this
203
+ License and any other pertinent obligations, then as a consequence you
204
+ may not distribute the Program at all. For example, if a patent
205
+ license would not permit royalty-free redistribution of the Program by
206
+ all those who receive copies directly or indirectly through you, then
207
+ the only way you could satisfy both it and this License would be to
208
+ refrain entirely from distribution of the Program.
209
+
210
+ If any portion of this section is held invalid or unenforceable under
211
+ any particular circumstance, the balance of the section is intended to
212
+ apply and the section as a whole is intended to apply in other
213
+ circumstances.
214
+
215
+ It is not the purpose of this section to induce you to infringe any
216
+ patents or other property right claims or to contest validity of any
217
+ such claims; this section has the sole purpose of protecting the
218
+ integrity of the free software distribution system, which is
219
+ implemented by public license practices. Many people have made
220
+ generous contributions to the wide range of software distributed
221
+ through that system in reliance on consistent application of that
222
+ system; it is up to the author/donor to decide if he or she is willing
223
+ to distribute software through any other system and a licensee cannot
224
+ impose that choice.
225
+
226
+ This section is intended to make thoroughly clear what is believed to
227
+ be a consequence of the rest of this License.
228
+
229
+ 8. If the distribution and/or use of the Program is restricted in
230
+ certain countries either by patents or by copyrighted interfaces, the
231
+ original copyright holder who places the Program under this License
232
+ may add an explicit geographical distribution limitation excluding
233
+ those countries, so that distribution is permitted only in or among
234
+ countries not thus excluded. In such case, this License incorporates
235
+ the limitation as if written in the body of this License.
236
+
237
+ 9. The Free Software Foundation may publish revised and/or new versions
238
+ of the General Public License from time to time. Such new versions will
239
+ be similar in spirit to the present version, but may differ in detail to
240
+ address new problems or concerns.
241
+
242
+ Each version is given a distinguishing version number. If the Program
243
+ specifies a version number of this License which applies to it and "any
244
+ later version", you have the option of following the terms and conditions
245
+ either of that version or of any later version published by the Free
246
+ Software Foundation. If the Program does not specify a version number of
247
+ this License, you may choose any version ever published by the Free Software
248
+ Foundation.
249
+
250
+ 10. If you wish to incorporate parts of the Program into other free
251
+ programs whose distribution conditions are different, write to the author
252
+ to ask for permission. For software which is copyrighted by the Free
253
+ Software Foundation, write to the Free Software Foundation; we sometimes
254
+ make exceptions for this. Our decision will be guided by the two goals
255
+ of preserving the free status of all derivatives of our free software and
256
+ of promoting the sharing and reuse of software generally.
257
+
258
+ NO WARRANTY
259
+
260
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
261
+ FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
262
+ OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
263
+ PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
264
+ OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
265
+ MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
266
+ TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
267
+ PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
268
+ REPAIR OR CORRECTION.
269
+
270
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
271
+ WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
272
+ REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
273
+ INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
274
+ OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
275
+ TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
276
+ YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
277
+ PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
278
+ POSSIBILITY OF SUCH DAMAGES.
279
+
280
+ END OF TERMS AND CONDITIONS
281
+
282
+ How to Apply These Terms to Your New Programs
283
+
284
+ If you develop a new program, and you want it to be of the greatest
285
+ possible use to the public, the best way to achieve this is to make it
286
+ free software which everyone can redistribute and change under these terms.
287
+
288
+ To do so, attach the following notices to the program. It is safest
289
+ to attach them to the start of each source file to most effectively
290
+ convey the exclusion of warranty; and each file should have at least
291
+ the "copyright" line and a pointer to where the full notice is found.
292
+
293
+ <one line to give the program's name and a brief idea of what it does.>
294
+ Copyright (C) <year> <name of author>
295
+
296
+ This program is free software; you can redistribute it and/or modify
297
+ it under the terms of the GNU General Public License as published by
298
+ the Free Software Foundation; either version 2 of the License, or
299
+ (at your option) any later version.
300
+
301
+ This program is distributed in the hope that it will be useful,
302
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
303
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
304
+ GNU General Public License for more details.
305
+
306
+ You should have received a copy of the GNU General Public License
307
+ along with this program; if not, write to the Free Software
308
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
309
+
310
+
311
+ Also add information on how to contact you by electronic and paper mail.
312
+
313
+ If the program is interactive, make it output a short notice like this
314
+ when it starts in an interactive mode:
315
+
316
+ Gnomovision version 69, Copyright (C) year name of author
317
+ Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
318
+ This is free software, and you are welcome to redistribute it
319
+ under certain conditions; type `show c' for details.
320
+
321
+ The hypothetical commands `show w' and `show c' should show the appropriate
322
+ parts of the General Public License. Of course, the commands you use may
323
+ be called something other than `show w' and `show c'; they could even be
324
+ mouse-clicks or menu items--whatever suits your program.
325
+
326
+ You should also get your employer (if you work as a programmer) or your
327
+ school, if any, to sign a "copyright disclaimer" for the program, if
328
+ necessary. Here is a sample; alter the names:
329
+
330
+ Yoyodyne, Inc., hereby disclaims all copyright interest in the program
331
+ `Gnomovision' (which makes passes at compilers) written by James Hacker.
332
+
333
+ <signature of Ty Coon>, 1 April 1989
334
+ Ty Coon, President of Vice
335
+
336
+ This General Public License does not permit incorporating your program into
337
+ proprietary programs. If your program is a subroutine library, you may
338
+ consider it more useful to permit linking proprietary applications with the
339
+ library. If this is what you want to do, use the GNU Library General
340
+ Public License instead of this License.
@@ -0,0 +1,18 @@
1
+
2
+ QueryParser - Parse a plain language query into Lucene syntax
3
+ Copyright (C) 2008 Peter Hickman
4
+
5
+ This program is free software; you can redistribute it and/or modify
6
+ it under the terms of the GNU General Public License as published by
7
+ the Free Software Foundation; either version 2 of the License, or
8
+ (at your option) any later version.
9
+
10
+ This program is distributed in the hope that it will be useful,
11
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13
+ GNU General Public License for more details.
14
+
15
+ You should have received a copy of the GNU General Public License
16
+ along with this program; if not, write to the Free Software
17
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
18
+
data/README ADDED
@@ -0,0 +1,14 @@
1
+
2
+ QueryParser - Parse a plain language query into Lucene syntax
3
+ Copyright (C) 2008 Peter Hickman
4
+
5
+ Parse a plain language query in the format
6
+
7
+ apple not banana
8
+
9
+ apple or banana and fig or date
10
+
11
+ and change them into a format suitable for use with Lucene or Solr.
12
+ The plain text query can include braces to group things and the Lucene
13
+ query can include both a similarity value and field boosting
14
+
@@ -0,0 +1,33 @@
1
+ # -*- ruby -*-
2
+
3
+ require 'rubygems'
4
+ require 'rake/gempackagetask'
5
+
6
+ $:.push 'lib'
7
+ require 'queryparser'
8
+
9
+ PKG_NAME = 'queryparser'
10
+ PKG_VERSION = QueryParser::VERSION
11
+
12
+ spec = Gem::Specification.new do |s|
13
+ s.name = PKG_NAME
14
+ s.version = PKG_VERSION
15
+ s.summary = 'Parse a natural language query into lucene query syntax'
16
+
17
+ s.files = FileList['README', 'COPY*', 'Rakefile', 'lib/**/*.rb']
18
+ s.test_files = FileList['test/*.rb']
19
+
20
+ s.has_rdoc = true
21
+ s.rdoc_options << '--title' << 'QueryParser' << '--charset' << 'utf-8'
22
+ s.extra_rdoc_files = FileList['README', 'COPYING']
23
+
24
+ s.author = 'Peter Hickman'
25
+ s.email = 'peterhi@ntlworld.com'
26
+
27
+ s.homepage = 'queryparser.rubyforge.org'
28
+ s.rubyforge_project = 'queryparser'
29
+ end
30
+
31
+ Rake::GemPackageTask.new(spec) do |pkg|
32
+ pkg.need_tar = true
33
+ end
@@ -0,0 +1,714 @@
1
+ # Takes a query in plain english and turns it into a string
2
+ # suitable to passing to Lucene or Solr.
3
+ #
4
+ # Assuming a Lucene / Solr database that has the body of the
5
+ # data in the +content+ field with the entry heading in a
6
+ # +title+ field, sub headings in a +subheading+ field
7
+ #
8
+ # p = QueryParser.new('content')
9
+ # l = p->parse("apple")
10
+ # => "content:apple"
11
+ #
12
+ # l = p->parse("apple and banana")
13
+ # => "+(+content:apple +content:banana)"
14
+ #
15
+ # l = p.parse('apple not banana or cherry')
16
+ # => "+((+content:apple -content:banana) content:cherry)"
17
+ #
18
+ # Here we boost the score of those queries that also match the
19
+ # title field of the document
20
+ #
21
+ # p = QueryParser.new("content", nil, 'title' => '^10')
22
+ # l = p.parse("apple")
23
+ # => "content:apple title:apple^10"
24
+ #
25
+ # Now with an extra boosting for subheadings
26
+ #
27
+ # p = QueryParser.new("content", nil, 'title' => '^10', 'subheading' => '^5')
28
+ # l = p.parse("apple")
29
+ # => "content:apple title:apple^10 subheading:apple^5"
30
+ #
31
+ # We can also change the similarity of the match. In Lucene terms
32
+ # a similarity of 1.0 will mean that 'banana' will only match 'banana'.
33
+ # However a similarity of 0.6 (entered as ~0.6) will allow 'banana' to
34
+ # match 'canada' which is only two letters different. The default similarity
35
+ # in Lucene is 0.6 (if I remember correctly).
36
+ #
37
+ # p = QueryParser.new("content", '~0.6', 'title' => '^10')
38
+ # l = p.parse("apple not banana")
39
+ # => "+(+content:apple~0.6 -content:banana~0.6) title:apple~0.6^10"
40
+
41
+ class QueryParser
42
+ VERSION = '1.0.0'
43
+
44
+ def initialize(field, similarity = nil, boosts = {})
45
+ @field = field
46
+ @similarity = similarity
47
+ @boosts = boosts
48
+ end
49
+
50
+ # Takes a plain english query and converts it into a string
51
+ # that can be fed into Lucene or Solr. It will apply the
52
+ # similarity and boostings set in the constructor.
53
+ def parse(text)
54
+ a = tokenise(text)
55
+ b = expand(a)
56
+ check_braces(b)
57
+ has_content(b)
58
+
59
+ c = add_implicit_and(b)
60
+
61
+ d = maketree(c)
62
+ if d.class != Array then
63
+ d = [d]
64
+ end
65
+
66
+ f = process_not(d)
67
+ g = process_and_or(f, 'and')
68
+ h = process_and_or(g, 'or')
69
+
70
+ # Wrap everything in an and
71
+ s = QueryParser::And.new
72
+ s.add(h)
73
+
74
+ t = reduce(s)
75
+
76
+ b = QueryParser::Or.new
77
+ b.add(t.boostable())
78
+
79
+ a = Array.new
80
+ x = t.lucene(@field, @similarity)
81
+ if x[0].chr == '(' then
82
+ x = "+#{x}"
83
+ end
84
+ a << x
85
+
86
+ @boosts.each_pair do |k, v|
87
+ x = [@similarity, v].join('')
88
+ a << b.lucene(k,x)
89
+ end
90
+
91
+ return a.join(' ')
92
+ end
93
+
94
+ private
95
+
96
+ # Split the string into tokens based on whitespace unless it is
97
+ # enclosed in ' or ". Initially we classify everything as either
98
+ # a term or quoted.
99
+ #
100
+ # The input is a text string and the output a flat list of terms
101
+ def tokenise(text)
102
+ r = Array.new()
103
+
104
+ delimiter = ''
105
+ token = ''
106
+
107
+ text.split("").each do |char|
108
+ if delimiter == '' then
109
+ if char == '"' or char == "'" then
110
+ token = remove_punctuation(token)
111
+ r << QueryParser::Term.new(token) if token != ''
112
+ delimiter = char.dup
113
+ token = char.dup
114
+ elsif char == " " then
115
+ token = remove_punctuation(token)
116
+ r << QueryParser::Term.new(token) if token != ''
117
+ token = ''
118
+ else
119
+ token << char.dup
120
+ end
121
+ elsif delimiter == char then
122
+ token << char.dup
123
+ token = remove_punctuation(token)
124
+ r << QueryParser::Term.new(token) if token != ''
125
+ token = ''
126
+ delimiter = ''
127
+ else
128
+ token << char.dup
129
+ end
130
+ end
131
+
132
+ token = remove_punctuation(token)
133
+ r << QueryParser::Term.new(token) if token != ''
134
+
135
+ return r
136
+ end
137
+
138
+ # All our terms will be a-Z0-9 and ( and ). The rest is lost
139
+ def remove_punctuation(a)
140
+ if a == '' then
141
+ return a
142
+ end
143
+
144
+ first = a[0].chr
145
+ last = a[-1].chr
146
+
147
+ quoted = false
148
+ if first == '"' or first == "'" then
149
+ if first == last then
150
+ quoted = true
151
+ end
152
+ end
153
+
154
+ b = a.gsub(/[^[:alnum:]()]/,' ')
155
+ c = b.gsub(/\s+/, ' ').strip
156
+
157
+ if quoted then
158
+ return ['"', c, '"'].join('')
159
+ else
160
+ return c
161
+ end
162
+ end
163
+
164
+ # If any terms have '(' or ')' in them then expand them up and tokenise
165
+ #
166
+ # The input is a list of terms, the output is a (possibly longer) list of terms
167
+ def expand(a)
168
+ r = Array.new
169
+
170
+ a.each do |i|
171
+ if i.type == 'term' and (i.data.index("(") or i.data.index(")")) then
172
+ x = i.data.gsub("(", " ( ").gsub(")", " ) ")
173
+ r << tokenise(x)
174
+ else
175
+ r << i
176
+ end
177
+ end
178
+
179
+ return r.flatten
180
+ end
181
+
182
+ # Create nested lists around the 'open' and 'close' ops
183
+ #
184
+ # The input is a list of terms, the output is a list of terms and lists of the same
185
+ def maketree(a)
186
+ r = Array.new
187
+
188
+ while x = a.shift do
189
+ case x.type
190
+ when "open"
191
+ y = maketree(a)
192
+ if y.size == 1 then
193
+ r << y[0]
194
+ elsif y.size > 1 then
195
+ r << y
196
+ end
197
+ when "close"
198
+ return r
199
+ else
200
+ r << x
201
+ end
202
+ end
203
+
204
+ if r.size == 1 then
205
+ return r[0]
206
+ else
207
+ return r
208
+ end
209
+ end
210
+
211
+ # Add the implicit 'and' after a term that is not itself an op
212
+ #
213
+ # The input is a list of terms and lists of same, the output is a (possibly longer) list of terms
214
+ def add_implicit_and(a)
215
+ r = Array.new
216
+
217
+ a.each do |i|
218
+ if r.size > 0 then
219
+ if previous_type(r.last) then
220
+ if current_type(i) then
221
+ r << QueryParser::Term.new('and')
222
+ end
223
+ else
224
+ if not current_type(i) then
225
+ raise QueryParser::Exceptions::MalformedQuery
226
+ end
227
+ end
228
+ end
229
+
230
+ r << i
231
+ end
232
+
233
+ if r.last.type == 'op' then
234
+ raise QueryParser::Exceptions::MalformedQuery
235
+ end
236
+
237
+ return r
238
+ end
239
+
240
+ # All these behave the same for adding an 'and'
241
+ def previous_type(i)
242
+ return (i.type == 'term' or i.type == 'close')
243
+ end
244
+
245
+ def current_type(i)
246
+ return (i.type == 'term' or i.type == 'open' or i.data == 'not')
247
+ end
248
+
249
+ # The not picks up the term to it's right
250
+ #
251
+ # The 'Not' op terms in the list are converted into Not objects
252
+ def process_not(a)
253
+ r = Array.new
254
+
255
+ # So we can handle a 'not not not apple' and the like
256
+ b = a.reverse
257
+
258
+ b.each do |i|
259
+ if i.class == Array then
260
+ i = process_not(i)
261
+ end
262
+
263
+ if i.class == QueryParser::Term and i.type == 'op' and i.data == 'not' then
264
+ if r.size == 0 then
265
+ raise QueryParser::Exceptions::MalformedQuery
266
+ else
267
+ x = QueryParser::Not.new(r.pop)
268
+ r << x
269
+ end
270
+ else
271
+ r << i
272
+ end
273
+ end
274
+
275
+ return r.reverse
276
+ end
277
+
278
+ # Find all the 'and' and 'or' op terms and convert them into And and Or objects
279
+ def process_and_or(a, type)
280
+ # make sure that it is in an array
281
+ if a.class != Array then
282
+ a = [a]
283
+ end
284
+
285
+ r = Array.new
286
+
287
+ has_op = false
288
+ s = nil
289
+
290
+ a.each do |i|
291
+ # First recurse into each element
292
+ if i.class == Array then
293
+ x = process_and_or(i, type)
294
+ if x.class == Array and x.size == 1 then
295
+ i = x.first
296
+ else
297
+ i = x
298
+ end
299
+ elsif i.class == QueryParser::Not then
300
+ x = process_and_or(i.contents, type)
301
+ if x.class == Array and x.size == 1 then
302
+ x = x.first
303
+ end
304
+ i = QueryParser::Not.new(x)
305
+ elsif i.class == QueryParser::And then
306
+ x = process_and_or(i.contents, type)
307
+ i = QueryParser::And.new()
308
+ i.add(x)
309
+ elsif i.class == QueryParser::Or then
310
+ x = process_and_or(i.contents, type)
311
+ i = QueryParser::Or.new()
312
+ i.add(x)
313
+ end
314
+
315
+ if has_op == true then
316
+ s.add(i)
317
+ r << s
318
+ s = nil
319
+ has_op = false
320
+ elsif i.class == QueryParser::Term and i.type == 'op' and i.data == type then
321
+ has_op = true
322
+ if i.data == 'and' then
323
+ s = QueryParser::And.new
324
+ else
325
+ s = QueryParser::Or.new
326
+ end
327
+
328
+ if r.size == 0 then
329
+ raise QueryParser::Exceptions::MalformedQuery
330
+ else
331
+ s.add(r.pop)
332
+ end
333
+ else
334
+ r << i
335
+ end
336
+ end
337
+
338
+ if r.size == 1 then
339
+ return r[0]
340
+ else
341
+ return r
342
+ end
343
+ end
344
+
345
+ # Reduce the sets down
346
+ def reduce(a)
347
+ process = true
348
+
349
+ while process do
350
+ a = a.reduce
351
+ if a.reduced? == false then
352
+ process = false
353
+ end
354
+ end
355
+
356
+ return a
357
+ end
358
+
359
+ # Check that the "(" and ")" are balanced
360
+ def check_braces(a)
361
+ counter = 0
362
+
363
+ a.each do |i|
364
+ if i.type == 'open' then
365
+ counter += 1
366
+ elsif i.type == 'close' then
367
+ counter -= 1
368
+ if counter < 0 then
369
+ raise QueryParser::Exceptions::UnbalancedBraces
370
+ end
371
+ end
372
+ end
373
+
374
+ if counter != 0 then
375
+ raise QueryParser::Exceptions::UnbalancedBraces
376
+ end
377
+ end
378
+
379
+ def has_content(a)
380
+ counter = 0
381
+
382
+ a.each do |i|
383
+ if i.type == 'term' then
384
+ counter += 1
385
+ end
386
+ end
387
+
388
+ if counter == 0 then
389
+ raise QueryParser::Exceptions::EmptyQuery
390
+ end
391
+ end
392
+ end
393
+
394
+ # The custom exceptions that may be thrown if there is some
395
+ # problem with the query.
396
+ module QueryParser::Exceptions
397
+ # This exception will be thrown if the query is generally
398
+ # malformed such as <tt>"apple and and banana"</tt> (too many
399
+ # <tt>and</tt>s), <tt>"apple not"</tt> (no term after the +not+)
400
+ # or <tt>"and apple"</tt> (no term before the +and+) and the like
401
+ class MalformedQuery < Exception
402
+ end
403
+
404
+ # This exception will be thrown if the query contains
405
+ # unbalanaced braces
406
+ class UnbalancedBraces < Exception
407
+ end
408
+
409
+ # This exception will be thrown if the supplied query string is
410
+ # empty after removing the +and+, +or+, +not+, ( and )
411
+ class EmptyQuery < Exception
412
+ end
413
+ end
414
+
415
+ # A basic search term. The input query is tokenised into
416
+ # terms which then cat manipulated to create the query tree.
417
+ #
418
+ # Generally you should not need to handle this class unless
419
+ # you are changing the parser works.
420
+ class QueryParser::Term
421
+ # Takes the token from the user's query and classify it:
422
+ #
423
+ # open:: The opening ( used to indicate the start of a parentisised part of the query.
424
+ # close:: The closing ) used to indicate the end of a parentisised part of the query.
425
+ # and:: The term indicating conjunction
426
+ # or:: The term indicating disjunction
427
+ # not:: The term indicating negation
428
+ # term:: None of the above. A term to find.
429
+ def initialize(data)
430
+ @type = 'term'
431
+ @data = data
432
+ @was_reduced = false
433
+
434
+ if @data == nil then
435
+ @data = ''
436
+ else
437
+ case @data.downcase
438
+ when "("
439
+ @type = "open"
440
+ when ")"
441
+ @type = "close"
442
+ when "and", "or", "not"
443
+ @type = 'op'
444
+ @data = @data.downcase
445
+ end
446
+ end
447
+ end
448
+
449
+ attr_reader :type, :data
450
+
451
+ # Display the Term, useful for debugging and testing
452
+ # the Term class in isolation
453
+ def inspect
454
+ "#{@type}:#{@data}"
455
+ end
456
+
457
+ # Convert a term into string usable in a Lucene query
458
+ # with an optional similarity
459
+ def lucene(field, suffix = nil)
460
+ "#{field}:#{@data}#{suffix}"
461
+ end
462
+
463
+ # Even though a term cannot, itself, be reduced the
464
+ # process will call this method on everything that
465
+ # is in the query. So we need to have this.
466
+ def reduce
467
+ @was_reduced = false
468
+ return self
469
+ end
470
+
471
+ # Return true if the previous call to #reduce did
472
+ # actually reduce the term. Again this is a method
473
+ # universal to all parts of the query and so we
474
+ # have to have it. But see #set_reduced to see why
475
+ # it can actually return true.
476
+ def reduced?
477
+ @was_reduced
478
+ end
479
+
480
+ # If the term was the only member of an +and+, +or+ or
481
+ # double (or any multiple of two) +not+ then it will replace
482
+ # the +and+, +or+ or +not+ in the query and therefore
483
+ # the original term has reduced and this, the replacement
484
+ # term, needs to indicate that fact. This allows us to
485
+ # flag that.
486
+ def set_reduced
487
+ @was_reduced = true
488
+ end
489
+
490
+ # The query can be traversed to return the terms
491
+ # that are considered *boostable*. In the following
492
+ # +apple+ will be considered positive and returned
493
+ # but +banana+ will not:
494
+ #
495
+ # apple not banana
496
+ #
497
+ # Terms that are boostable can be used to improve
498
+ # the documents relavance / position in the results list.
499
+ def boostable(negative = false)
500
+ if negative == true then
501
+ return nil
502
+ else
503
+ return self
504
+ end
505
+ end
506
+ end
507
+
508
+ # The base class for the +and+ and +or+ sets.
509
+ #
510
+ # Generally you should not need to handle this class unless
511
+ # you are changing the parser works.
512
+ class QueryParser::Set
513
+ def initialize
514
+ @data = Array.new
515
+ @was_reduced = false
516
+ end
517
+
518
+ # Add a list of +terms+, +nots+ and other +sets+
519
+ # to the list of things that are in this part
520
+ # of the query.
521
+ #
522
+ # Can handle a list of items or just a single one.
523
+ def add(*data)
524
+ data.each do |i|
525
+ if i.class == Array then
526
+ i.each {|j| add(j)}
527
+ else
528
+ @data << i
529
+ end
530
+ end
531
+ end
532
+
533
+ # Returns all the data held by this set
534
+ def contents
535
+ @data
536
+ end
537
+
538
+ # Display the set, useful for debugging and testing
539
+ def inspect
540
+ r = Array.new
541
+ @data.each {|i|r << i.inspect}
542
+ "<#{self.inspect_class} #{r.join(' ')}>"
543
+ end
544
+
545
+ # Convert a set into string usable in a Lucene query
546
+ # with an optional similarity that needs to be passed
547
+ # to the Terms
548
+ def lucene(field, similarity = nil)
549
+ r = Array.new
550
+ @data.each do |i|
551
+ x = ''
552
+ if self.class == QueryParser::And and i.class != QueryParser::Not then
553
+ x = '+'
554
+ end
555
+ x << i.lucene(field, similarity)
556
+ r << x
557
+ end
558
+
559
+ if r.size == 1 then
560
+ "#{r[0]}"
561
+ else
562
+ "(#{r.join(' ')})"
563
+ end
564
+ end
565
+
566
+ # A Set contained within a Set should fold the contents of the inner set
567
+ # into itself. Otherwise reduce the contents of the Set individually and
568
+ # set the flag if the contents reduced
569
+ def reduce
570
+ r = Array.new
571
+ @was_reduced = false
572
+
573
+ @data.each do |i|
574
+ if self.class == i.class then
575
+ @was_reduced = true
576
+ i.contents.each do |c|
577
+ r << c.reduce
578
+ end
579
+ else
580
+ x = i.reduce
581
+ if x.reduced? then
582
+ @was_reduced = true
583
+ end
584
+ r << x
585
+ end
586
+ end
587
+
588
+ if r.size == 1 then
589
+ @was_reduced = true
590
+ r[0].set_reduced
591
+ return r.first
592
+ else
593
+ @data = r
594
+ return self
595
+ end
596
+ end
597
+
598
+ # Did calling #reduce on this set actually reduce it
599
+ def reduced?
600
+ @was_reduced
601
+ end
602
+
603
+ # Force the reduced flag to true
604
+ def set_reduced
605
+ @was_reduced = true
606
+ end
607
+
608
+ # Return all the boostable terms that are held
609
+ # in the set. Thus for
610
+ #
611
+ # tom and dick and harry
612
+ #
613
+ # The terms +tom+, +dick+ and +harry+ are all considered
614
+ # boostable. However in
615
+ #
616
+ # tom and dick and not harry
617
+ #
618
+ # Only the terms +tom+ and +dick+ are considered boostable
619
+ def boostable(negative = false)
620
+ r = Array.new
621
+
622
+ @data.each do |i|
623
+ x = i.boostable(negative)
624
+ if x != nil then
625
+ r << x
626
+ end
627
+ end
628
+
629
+ return r.flatten
630
+ end
631
+ end
632
+
633
+ # A subclass just to distinguish the +and+ from the +or+
634
+ class QueryParser::And < QueryParser::Set
635
+ def inspect_class
636
+ "AND"
637
+ end
638
+ end
639
+
640
+ # A subclass just to distinguish the +and+ from the +or+
641
+ class QueryParser::Or < QueryParser::Set
642
+ def inspect_class
643
+ "OR"
644
+ end
645
+ end
646
+
647
+ # Something to handle the +not+ term in a query
648
+ #
649
+ # Generally you should not need to handle this class unless
650
+ # you are changing the parser works.
651
+ class QueryParser::Not
652
+ # +not+ handles a single term and so it is added in
653
+ # initialisation rather than with an add method.
654
+ def initialize(data)
655
+ @data = data
656
+ @was_reduced = false
657
+ end
658
+
659
+ # Returns the data held by the +not+
660
+ def contents
661
+ @data
662
+ end
663
+
664
+ # Display the +not+, useful for debugging
665
+ def inspect
666
+ "<NOT #{@data.inspect}>"
667
+ end
668
+
669
+ # Convert a +not+ into string usable in a Lucene query
670
+ # passing the similarity on to the term contained by
671
+ # the +not+
672
+ def lucene(field, similarity = nil)
673
+ "-#{@data.lucene(field, similarity)}"
674
+ end
675
+
676
+ # Double negatives should be eliminated otherwise
677
+ def reduce
678
+ if @data.class == QueryParser::Not then
679
+ @was_reduced = true
680
+ x = @data.contents.reduce
681
+ x.set_reduced
682
+ return x
683
+ else
684
+ @data = @data.reduce
685
+ @was_reduced = @data.reduced?
686
+ return self
687
+ end
688
+ end
689
+
690
+ # Were the contents reduced?
691
+ def reduced?
692
+ @was_reduced
693
+ end
694
+
695
+ # Sets the reduced flag to true
696
+ def set_reduced
697
+ @was_reduced = true
698
+ end
699
+
700
+ # Return all the boostable terms that are held
701
+ # in the +not+. Thus for
702
+ #
703
+ # tom and dick and harry
704
+ #
705
+ # The terms +tom+, +dick+ and +harry+ are all considered
706
+ # boostable. However in
707
+ #
708
+ # tom and dick and not harry
709
+ #
710
+ # Only the terms +tom+ and +dick+ are considered boostable
711
+ def boostable(negative = false)
712
+ @data.boostable(!negative)
713
+ end
714
+ end
metadata ADDED
@@ -0,0 +1,61 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: queryparser
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ platform: ruby
6
+ authors:
7
+ - Peter Hickman
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2008-12-21 00:00:00 +00:00
13
+ default_executable:
14
+ dependencies: []
15
+
16
+ description:
17
+ email: peterhi@ntlworld.com
18
+ executables: []
19
+
20
+ extensions: []
21
+
22
+ extra_rdoc_files:
23
+ - README
24
+ - COPYING
25
+ files:
26
+ - README
27
+ - COPYING
28
+ - COPYRIGHT
29
+ - Rakefile
30
+ - lib/queryparser.rb
31
+ has_rdoc: true
32
+ homepage: queryparser.rubyforge.org
33
+ post_install_message:
34
+ rdoc_options:
35
+ - --title
36
+ - QueryParser
37
+ - --charset
38
+ - utf-8
39
+ require_paths:
40
+ - lib
41
+ required_ruby_version: !ruby/object:Gem::Requirement
42
+ requirements:
43
+ - - ">="
44
+ - !ruby/object:Gem::Version
45
+ version: "0"
46
+ version:
47
+ required_rubygems_version: !ruby/object:Gem::Requirement
48
+ requirements:
49
+ - - ">="
50
+ - !ruby/object:Gem::Version
51
+ version: "0"
52
+ version:
53
+ requirements: []
54
+
55
+ rubyforge_project: queryparser
56
+ rubygems_version: 1.3.1
57
+ signing_key:
58
+ specification_version: 2
59
+ summary: Parse a natural language query into lucene query syntax
60
+ test_files: []
61
+