retreval 0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/CHANGELOG +7 -0
- data/README.md +321 -0
- data/TODO +5 -0
- data/bin/retreval +5 -0
- data/example/gold_standard.yml +48 -0
- data/example/query_results.yml +23 -0
- data/lib/retreval/gold_standard.rb +424 -0
- data/lib/retreval/options.rb +66 -0
- data/lib/retreval/query_result.rb +511 -0
- data/lib/retreval/runner.rb +121 -0
- data/output_avg_precision.yml +2 -0
- data/output_statistics.yml +82 -0
- data/retreval.gemspec +16 -0
- data/test/test_gold_standard.rb +111 -0
- data/test/test_query_result.rb +166 -0
- metadata +390 -0
data/CHANGELOG
ADDED
data/README.md
ADDED
@@ -0,0 +1,321 @@
|
|
1
|
+
README
|
2
|
+
======
|
3
|
+
|
4
|
+
This is a simple API to evaluate information retrieval results. It allows you to load ranked and unranked query results and calculate various evaluation metrics (precision, recall, MAP, kappa) against a previously loaded gold standard.
|
5
|
+
|
6
|
+
Start this program from the command line with:
|
7
|
+
|
8
|
+
retreval -l <gold-standard-file> -q <query-results> -f <format> -o <output-prefix>
|
9
|
+
|
10
|
+
The options are outlined when you pass no arguments and just call
|
11
|
+
|
12
|
+
retreval
|
13
|
+
|
14
|
+
You will find further information in the RDOC documentation and the HOWTO section below.
|
15
|
+
|
16
|
+
If you want to see an example, use this command:
|
17
|
+
|
18
|
+
retreval -l example/gold_standard.yml -q example/query_results.yml -f yaml -v
|
19
|
+
|
20
|
+
|
21
|
+
INSTALLATION
|
22
|
+
============
|
23
|
+
|
24
|
+
You can manually download the sources and build the Gem from there by `cd`ing to the folder where this README is saved and calling
|
25
|
+
|
26
|
+
gem build retreval.gemspec
|
27
|
+
|
28
|
+
This will create a file called `retreval` which you just have to install:
|
29
|
+
|
30
|
+
gem install retreval
|
31
|
+
|
32
|
+
And you're done.
|
33
|
+
|
34
|
+
|
35
|
+
HOWTO
|
36
|
+
=====
|
37
|
+
|
38
|
+
This API supports the following evaluation tasks:
|
39
|
+
|
40
|
+
- Loading a Gold Standard that takes a set of documents, queries and corresponding judgements of relevancy (i.e. "Is this document relevant for this query?")
|
41
|
+
- Calculation of the _kappa measure_ for the given gold standard
|
42
|
+
|
43
|
+
- Loading ranked or unranked query results for a certain query
|
44
|
+
- Calculation of _precision_ and _recall_ for each result
|
45
|
+
- Calculation of the _F-measure_ for weighing precision and recall
|
46
|
+
- Calculation of _mean average precision_ for multiple query results
|
47
|
+
- Calculation of the _11-point precision_ and _average precision_ for ranked query results
|
48
|
+
|
49
|
+
- Printing of summary tables and results
|
50
|
+
|
51
|
+
Typically, you will want to use this Gem either standalone or within another application's context.
|
52
|
+
|
53
|
+
Standalone Usage
|
54
|
+
================
|
55
|
+
|
56
|
+
Call parameters
|
57
|
+
---------------
|
58
|
+
|
59
|
+
After installing the Gem (see INSTALLATION), you can always call `retreval` from the commandline. The typical call is:
|
60
|
+
|
61
|
+
retreval -l <gold-standard-file> -q <query-results> -f <format> -o <output-prefix>
|
62
|
+
|
63
|
+
Where you have to define the following options:
|
64
|
+
|
65
|
+
- `gold-standard-file` is a file in a specified format that includes all the judgements
|
66
|
+
- `query-results` is a file in a specified format that includes all the query results in a single file
|
67
|
+
- `format` is the format that the files will use (either "yaml" or "plain")
|
68
|
+
- `output-prefix` is the prefix of output files that will be created
|
69
|
+
|
70
|
+
Formats
|
71
|
+
-------
|
72
|
+
|
73
|
+
Right now, we focus on the formats you can use to load data into the API. Currently, we support YAML files that must adhere to a special syntax. So, in order to load a gold standard, we need a file in the following format:
|
74
|
+
|
75
|
+
* "query" denotes the query
|
76
|
+
* "documents" these are the documents judged for this query
|
77
|
+
* "id" the ID of the document (e.g. its filename, etc.)
|
78
|
+
* "judgements" an array of judgements, each one with:
|
79
|
+
* "relevant" a boolean value of the judgment (relevant or not)
|
80
|
+
* "user" an optional identifier of the user
|
81
|
+
|
82
|
+
Example file, with one query, two documents, and one judgement:
|
83
|
+
|
84
|
+
- query: 12th air force germany 1957
|
85
|
+
documents:
|
86
|
+
- id: g5701s.ict21311
|
87
|
+
judgements: []
|
88
|
+
|
89
|
+
- id: g5701s.ict21313
|
90
|
+
judgements:
|
91
|
+
- relevant: false
|
92
|
+
user: 2
|
93
|
+
|
94
|
+
So, when calling the program, specify the format as `yaml`.
|
95
|
+
For the query results, a similar format is used. Note that it is necessary to specify whether the result sets are ranked or not, as this will heavily influence the calculations. You can specify the score for a document. By "score" we mean the score that your retrieval algorithm has given the document. But this is not necessary. The documents will always be ranked in the order of their appearance, regardless of their score. Thus in the following example, the document with "07" at the end is the first and "25" is the last, regardless of the score.
|
96
|
+
|
97
|
+
---
|
98
|
+
query: 12th air force germany 1957
|
99
|
+
ranked: true
|
100
|
+
documents:
|
101
|
+
- score: 0.44034874
|
102
|
+
document: g5701s.ict21307
|
103
|
+
- score: 0.44034874
|
104
|
+
document: g5701s.ict21309
|
105
|
+
- score: 0.44034874
|
106
|
+
document: g5701s.ict21311
|
107
|
+
- score: 0.44034874
|
108
|
+
document: g5701s.ict21313
|
109
|
+
- score: 0.44034874
|
110
|
+
document: g5701s.ict21315
|
111
|
+
- score: 0.44034874
|
112
|
+
document: g5701s.ict21317
|
113
|
+
- score: 0.44034874
|
114
|
+
document: g5701s.ict21319
|
115
|
+
- score: 0.44034874
|
116
|
+
document: g5701s.ict21321
|
117
|
+
- score: 0.44034874
|
118
|
+
document: g5701s.ict21323
|
119
|
+
- score: 0.44034874
|
120
|
+
document: g5701s.ict21325
|
121
|
+
---
|
122
|
+
query: 1612
|
123
|
+
ranked: true
|
124
|
+
documents:
|
125
|
+
- score: 1.0174774
|
126
|
+
document: g3290.np000144
|
127
|
+
- score: 0.763108
|
128
|
+
document: g3201b.ct000726
|
129
|
+
- score: 0.763108
|
130
|
+
document: g3400.ct000886
|
131
|
+
- score: 0.6359234
|
132
|
+
document: g3201s.ct000130
|
133
|
+
---
|
134
|
+
|
135
|
+
**Note**: You can also use the `plain` format, which will load the gold standard in a different way (but not the results):
|
136
|
+
|
137
|
+
my_query my_document_1 false
|
138
|
+
my_query my_document_2 true
|
139
|
+
|
140
|
+
See that every query/document/relevancy pair is separated by a tabulator? You can also add the user's ID in the fourth column if necessary.
|
141
|
+
|
142
|
+
Running the evaluation
|
143
|
+
-----------------------
|
144
|
+
|
145
|
+
After you have specified the input files and the format, you can run the program. If needed, the `-v` switch will turn on verbose messages, such as information on how many judgements, documents and users there are, but this shouldn't be necessary.
|
146
|
+
|
147
|
+
The program will first load the gold standard and then calculate the statistics for each result set. The output files are automatically created and contain a YAML representation of the results.
|
148
|
+
|
149
|
+
Calculations may take a while depending on the amount of judgements and documents. If there are a thousand judgements, always consider a few seconds for each result set.
|
150
|
+
|
151
|
+
Interpreting the output files
|
152
|
+
------------------------------
|
153
|
+
|
154
|
+
Two output files will be created:
|
155
|
+
|
156
|
+
- `output_avg_precision.yml`
|
157
|
+
- `output_statistics.yml`
|
158
|
+
|
159
|
+
The first lists the average precision for each query in the query result file. The second file lists all supported statistics for each query in the query results file.
|
160
|
+
|
161
|
+
For example, for a ranked evaluation, the first two entries of such a query result statistic look like this:
|
162
|
+
|
163
|
+
---
|
164
|
+
12th air force germany 1957:
|
165
|
+
- :precision: 0.0
|
166
|
+
:recall: 0.0
|
167
|
+
:false_negatives: 1
|
168
|
+
:false_positives: 1
|
169
|
+
:true_negatives: 2516
|
170
|
+
:true_positives: 0
|
171
|
+
:document: g5701s.ict21313
|
172
|
+
:relevant: false
|
173
|
+
- :precision: 0.0
|
174
|
+
:recall: 0.0
|
175
|
+
:false_negatives: 1
|
176
|
+
:false_positives: 2
|
177
|
+
:true_negatives: 2515
|
178
|
+
:true_positives: 0
|
179
|
+
:document: g5701s.ict21317
|
180
|
+
:relevant: false
|
181
|
+
|
182
|
+
You can see the precision and recall for that specific point and also the number of documents for the contingency table (true/false positives/negatives). Also, the document identifier is given.
|
183
|
+
|
184
|
+
API Usage
|
185
|
+
=========
|
186
|
+
|
187
|
+
Using this API in another ruby application is probably the more common use case. All you have to do is include the Gem in your Ruby or Ruby on Rails application. For details about available methods, please refer to the API documentation generated by RDoc.
|
188
|
+
|
189
|
+
**Important**: For this implementation, we use the document ID, the query and the user ID as the primary keys for matching objects. This means that your documents and queries are identified by a string and thus the strings should be sanitized first.
|
190
|
+
|
191
|
+
Loading the Gold Standard
|
192
|
+
-------------------------
|
193
|
+
|
194
|
+
Once you have loaded the Gem, you will probably start by creating a new gold standard.
|
195
|
+
|
196
|
+
gold_standard = GoldStandard.new
|
197
|
+
|
198
|
+
Then, you can load judgements into this standard, either from a file, or manually:
|
199
|
+
|
200
|
+
gold_standard.load_from_yaml_file "my-file.yml"
|
201
|
+
gold_standard.add_judgement :document => doc_id, :query => query_string, :relevant => boolean, :user => John
|
202
|
+
|
203
|
+
There is a nice shortcut for the `add_judgement` method. Both lines are essentially the same:
|
204
|
+
|
205
|
+
gold_standard.add_judgement :document => doc_id, :query => query_string, :relevant => boolean, :user => John
|
206
|
+
gold_standard << :document => doc_id, :query => query_string, :relevant => boolean, :user => John
|
207
|
+
|
208
|
+
Note the usage of typical Rails hashes for better readability (also, this Gem was developed to be used in a Rails webapp).
|
209
|
+
|
210
|
+
Now that you have loaded the gold standard, you can do things like:
|
211
|
+
|
212
|
+
gold_standard.contains_judgement? :document => "a document", :query => "the query"
|
213
|
+
gold_standard.relevant? :document => "a document", :query => "the query"
|
214
|
+
|
215
|
+
|
216
|
+
Loading the Query Results
|
217
|
+
-------------------------
|
218
|
+
|
219
|
+
Now we want to create a new `QueryResultSet`. A query result set can contain more than one result, which is what we normally want. It is important that you specify the gold standard it belongs to.
|
220
|
+
|
221
|
+
query_result_set = QueryResultSet.new :gold_standard => gold_standard
|
222
|
+
|
223
|
+
Just like the Gold Standard, you can read a query result set from a file:
|
224
|
+
|
225
|
+
query_result_set.load_from_yaml_file "my-results-file.yml"
|
226
|
+
|
227
|
+
Alternatively, you can load the query results one by one. To do this, you have to create the results (either ranked or unranked) and then add documents:
|
228
|
+
|
229
|
+
my_result = RankedQueryResult.new :query => "the query"
|
230
|
+
my_result.add_document :document => "test_document 1", :score => 13
|
231
|
+
my_result.add_document :document => "test_document 2", :score => 11
|
232
|
+
my_result.add_document :document => "test_document 3", :score => 3
|
233
|
+
|
234
|
+
This result would be ranked, obviously, and contain three documents. Documents can have a score, but this is optional. You can also create an Array of documents first and add them altogether:
|
235
|
+
|
236
|
+
documents = Array.new
|
237
|
+
documents << ResultDocument.new :id => "test_document 1", :score => 20
|
238
|
+
documents << ResultDocument.new :id => "test_document 2", :score => 21
|
239
|
+
my_result = RankedQueryResult.new :query => "the query", :documents => documents
|
240
|
+
|
241
|
+
The same applies to `UnrankedQueryResult`s, obviously. The order of ranked documents is the same as the order in which they were added to the result.
|
242
|
+
|
243
|
+
The `QueryResultSet` will now contain all the results. They are stored in an array called `query_results`, which you can access. So, to iterate over each result, you might want to use the following code:
|
244
|
+
|
245
|
+
query_result_set.query_results.each_with_index do |result, index|
|
246
|
+
# ...
|
247
|
+
end
|
248
|
+
|
249
|
+
Or, more simply:
|
250
|
+
|
251
|
+
for result in query_result_set.query_results
|
252
|
+
# ...
|
253
|
+
end
|
254
|
+
|
255
|
+
Calculating statistics
|
256
|
+
----------------------
|
257
|
+
|
258
|
+
Now to the interesting part: Calculating statistics. As mentioned before, there is a conceptual difference between ranked and unranked results. Unranked results are much easier to calculate and thus take less CPU time.
|
259
|
+
|
260
|
+
No matter if unranked or ranked, you can get the most important statistics by just calling the `statistics` method.
|
261
|
+
|
262
|
+
statistics = my_result.statistics
|
263
|
+
|
264
|
+
In the simple case of an unranked result, you will receive a hash with the following information:
|
265
|
+
|
266
|
+
* `precision` - the precision of the results
|
267
|
+
* `recall` - the recall of the results
|
268
|
+
* `false_negatives` - number of not retrieved but relevant items
|
269
|
+
* `false_positives` - number of retrieved but nonrelevant
|
270
|
+
* `true_negatives` - number of not retrieved and nonrelevantv items
|
271
|
+
* `true_positives` - number of retrieved and relevant items
|
272
|
+
|
273
|
+
In case of a ranked result, you will receive an Array that consists of _n_ such Hashes, depending on the number of documents. Each Hash will give you the information at a certain rank, e.g. the following to lines return the recall at the fourth rank.
|
274
|
+
|
275
|
+
statistics = my_ranked_result.statistics
|
276
|
+
statistics[3][:recall]
|
277
|
+
|
278
|
+
In addition to the information mentioned above, you can also get for each rank:
|
279
|
+
|
280
|
+
* `document` - the ID of the document that was returned at this rank
|
281
|
+
* `relevant` - whether the document was relevant or not
|
282
|
+
|
283
|
+
Calculating statistics with missing judgements
|
284
|
+
----------------------------------------------
|
285
|
+
|
286
|
+
Sometimes, you don't have judgements for all document/query pairs in the gold standard. If this happens, the results will be cleaned up first. This means that every document in the results that doesn't appear to have a judgement will be removed temporarily.
|
287
|
+
|
288
|
+
As an example, take the following results:
|
289
|
+
|
290
|
+
* A
|
291
|
+
* B
|
292
|
+
* C
|
293
|
+
* D
|
294
|
+
|
295
|
+
Our gold standard only contains judgements for A and C. The results will be cleaned up first, thus leading to:
|
296
|
+
|
297
|
+
* A
|
298
|
+
* C
|
299
|
+
|
300
|
+
With this approach, we can still provide meaningful results (for precision and recall).
|
301
|
+
|
302
|
+
Other statistics
|
303
|
+
----------------
|
304
|
+
|
305
|
+
There are several other statistics that can be calculated, for example the **F measure**. The F measure weighs precision and recall and has one parameter, either "alpha" or "beta". Get the F measure like so:
|
306
|
+
|
307
|
+
my_result.f_measure :beta => 1
|
308
|
+
|
309
|
+
If you don't specify either alpha or beta, we will assume that beta = 1.
|
310
|
+
|
311
|
+
Another interesting measure is **Cohen's Kappa**, which tells us about the inter-agreement of assessors. Get the kappa statistic like this:
|
312
|
+
|
313
|
+
gold_standard.kappa
|
314
|
+
|
315
|
+
This will calculate the average kappa for each pairwise combination of users in the gold standard.
|
316
|
+
|
317
|
+
For ranked results one might also want to calculate an **11-point precision**. Just call the following:
|
318
|
+
|
319
|
+
my_ranked_result.eleven_point_precision
|
320
|
+
|
321
|
+
This will return a Hash that has indices at the 11 recall levels from 0 to 1 (with steps of 0.1) and the corresponding precision at that recall level.
|
data/TODO
ADDED
data/bin/retreval
ADDED
@@ -0,0 +1,48 @@
|
|
1
|
+
- query: Example Query
|
2
|
+
documents:
|
3
|
+
- id: ict21307
|
4
|
+
judgements:
|
5
|
+
- relevant: true
|
6
|
+
user: Bob
|
7
|
+
- id: ict21309
|
8
|
+
judgements:
|
9
|
+
- relevant: false
|
10
|
+
user: Bob
|
11
|
+
- id: ict21311
|
12
|
+
judgements:
|
13
|
+
- relevant: false
|
14
|
+
user: Bob
|
15
|
+
- id: ict21313
|
16
|
+
judgements:
|
17
|
+
- relevant: false
|
18
|
+
user: Bob
|
19
|
+
- id: ict21315
|
20
|
+
judgements:
|
21
|
+
- relevant: true
|
22
|
+
user: Bob
|
23
|
+
- relevant: true
|
24
|
+
user: John
|
25
|
+
- id: ict21317
|
26
|
+
judgements:
|
27
|
+
- relevant: false
|
28
|
+
user: Bob
|
29
|
+
- relevant: false
|
30
|
+
user: John
|
31
|
+
- id: ict21319
|
32
|
+
judgements:
|
33
|
+
- relevant: false
|
34
|
+
user: Bob
|
35
|
+
- relevant: false
|
36
|
+
user: John
|
37
|
+
- id: ict21321
|
38
|
+
judgements:
|
39
|
+
- relevant: false
|
40
|
+
user: John
|
41
|
+
- id: ict21323
|
42
|
+
judgements:
|
43
|
+
- relevant: true
|
44
|
+
user: John
|
45
|
+
- id: ict21325
|
46
|
+
judgements:
|
47
|
+
- relevant: true
|
48
|
+
user: John
|
@@ -0,0 +1,23 @@
|
|
1
|
+
- query: Example Query
|
2
|
+
ranked: true
|
3
|
+
documents:
|
4
|
+
- score: 0.24921744
|
5
|
+
id: ict21307
|
6
|
+
- score: 0.1623808
|
7
|
+
id: ict21309
|
8
|
+
- score: 0.13997056
|
9
|
+
id: ict21311
|
10
|
+
- score: 0.12525019
|
11
|
+
id: ict21313
|
12
|
+
- score: 0.11482056
|
13
|
+
id: ict21315
|
14
|
+
- score: 0.1131133
|
15
|
+
id: ict21317
|
16
|
+
- score: 0.09897413
|
17
|
+
id: ict21319
|
18
|
+
- score: 0.09897413
|
19
|
+
id: ict21321
|
20
|
+
- score: 0.09742848
|
21
|
+
id: ict21323
|
22
|
+
- score: 0.09742848
|
23
|
+
id: ict21325
|
@@ -0,0 +1,424 @@
|
|
1
|
+
module Retreval
|
2
|
+
|
3
|
+
# A gold standard is composed of several judgements for the
|
4
|
+
# cartesian product of documents and queries
|
5
|
+
class GoldStandard
|
6
|
+
|
7
|
+
attr_reader :documents, :judgements, :queries, :users
|
8
|
+
|
9
|
+
# Creates a new gold standard. One can optionally construct the gold
|
10
|
+
# standard with triples given. This would be a hash like:
|
11
|
+
# triples = {
|
12
|
+
# :document => "Document ID",
|
13
|
+
# :query => "Some query",
|
14
|
+
# :relevant => "true"
|
15
|
+
# }
|
16
|
+
#
|
17
|
+
# Called via:
|
18
|
+
# GoldStandard.new :triples => an_array_of_triples
|
19
|
+
def initialize(args = {})
|
20
|
+
@documents = Hash.new
|
21
|
+
@queries = Array.new
|
22
|
+
@judgements = Array.new
|
23
|
+
@users = Hash.new
|
24
|
+
|
25
|
+
# one can also construct a Gold Standard with everything already loaded
|
26
|
+
unless args[:triples].nil?
|
27
|
+
args[:triples].each do |triple|
|
28
|
+
add_judgement(triple)
|
29
|
+
end
|
30
|
+
end
|
31
|
+
end
|
32
|
+
|
33
|
+
|
34
|
+
# Parses a YAML file adhering to the following generic standard:
|
35
|
+
#
|
36
|
+
# * "query" denotes the query
|
37
|
+
# * "documents" these are the documents judged for this query
|
38
|
+
# * "id" the ID of the document (e.g. its filename, etc.)
|
39
|
+
# * "judgements" an array of judgements, each one with:
|
40
|
+
# * "relevant" a boolean value of the judgment (relevant or not)
|
41
|
+
# * "user" an optional identifier of the user
|
42
|
+
#
|
43
|
+
# Example file:
|
44
|
+
# - query: 12th air force germany 1957
|
45
|
+
# documents:
|
46
|
+
# - id: g5701s.ict21311
|
47
|
+
# judgements: []
|
48
|
+
#
|
49
|
+
# - id: g5701s.ict21313
|
50
|
+
# judgements:
|
51
|
+
# - relevant: false
|
52
|
+
# user: 2
|
53
|
+
def load_from_yaml_file(file)
|
54
|
+
begin
|
55
|
+
ydoc = YAML.load(File.open(file, "r"))
|
56
|
+
ydoc.each do |entry|
|
57
|
+
|
58
|
+
# The query is first in the hierarchy
|
59
|
+
query = entry["query"]
|
60
|
+
|
61
|
+
# Every query contains several documents
|
62
|
+
documents = entry["documents"]
|
63
|
+
documents.each do |doc|
|
64
|
+
|
65
|
+
document = doc["id"]
|
66
|
+
|
67
|
+
# Only count the map if it has judgements
|
68
|
+
if doc["judgements"].empty?
|
69
|
+
add_judgement :document => document, :query => query, :relevant => nil, :user => nil
|
70
|
+
else
|
71
|
+
doc["judgements"].each do |judgement|
|
72
|
+
relevant = judgement["relevant"]
|
73
|
+
user = judgement["user"]
|
74
|
+
|
75
|
+
add_judgement :document => document, :query => query, :relevant => relevant, :user => user
|
76
|
+
end
|
77
|
+
end
|
78
|
+
|
79
|
+
end
|
80
|
+
end
|
81
|
+
|
82
|
+
rescue Exception => e
|
83
|
+
raise "Error while parsing the YAML document: " + e.message
|
84
|
+
end
|
85
|
+
end
|
86
|
+
|
87
|
+
|
88
|
+
# Parses a plaintext file adhering to the following standard:
|
89
|
+
# Every line of text should include a triple that designates the judgement.
|
90
|
+
# The symbols should be separated by a tabulator.
|
91
|
+
# E.g.
|
92
|
+
# my_query my_document_1 false
|
93
|
+
# my_query my_document_2 true
|
94
|
+
#
|
95
|
+
# You can also add the user's ID in the fourth column.
|
96
|
+
def load_from_plaintext_file(file)
|
97
|
+
begin
|
98
|
+
File.open(file).each do |line|
|
99
|
+
line.chomp!
|
100
|
+
info = line.split("\t")
|
101
|
+
if info.length == 3
|
102
|
+
add_judgement :query => info[0], :document => info[1], :relevant => info[2]
|
103
|
+
elsif info.length == 4
|
104
|
+
add_judgement :query => info[0], :document => info[1], :relevant => info[2], :user => info[3]
|
105
|
+
end
|
106
|
+
end
|
107
|
+
rescue Exception => e
|
108
|
+
raise "Error while parsing the document: " + e.message
|
109
|
+
end
|
110
|
+
end
|
111
|
+
|
112
|
+
|
113
|
+
# Adds a judgement (document, query, relevancy) to the gold standard.
|
114
|
+
# All of those are strings in the public interface.
|
115
|
+
# The user ID is an optional parameter that can be used to measure kappa later.
|
116
|
+
# Call this with:
|
117
|
+
# add_judgement :document => doc_id, :query => query_string, :relevant => boolean, :user => John
|
118
|
+
def add_judgement(args)
|
119
|
+
document_id = args[:document]
|
120
|
+
query_string = args[:query]
|
121
|
+
relevant = args[:relevant]
|
122
|
+
user_id = args[:user]
|
123
|
+
|
124
|
+
|
125
|
+
unless document_id.nil? or query_string.nil?
|
126
|
+
document = Document.new :id => document_id
|
127
|
+
query = Query.new :querystring => query_string
|
128
|
+
|
129
|
+
|
130
|
+
# If the user exists, load it, otherwise create a new one
|
131
|
+
if @users.has_key?(user_id)
|
132
|
+
user = @users[user_id]
|
133
|
+
else
|
134
|
+
user = User.new :id => user_id unless user_id.nil?
|
135
|
+
end
|
136
|
+
|
137
|
+
# If there is no judgement for this combination, just add the document/query pair
|
138
|
+
if relevant.nil?
|
139
|
+
# TODO: improve efficiency by introducing hashes !
|
140
|
+
@documents[document_id] = document
|
141
|
+
@queries << query unless @queries.include?(query)
|
142
|
+
return
|
143
|
+
end
|
144
|
+
|
145
|
+
if user_id.nil?
|
146
|
+
judgement = Judgement.new :document => document, :query => query, :relevant => relevant
|
147
|
+
else
|
148
|
+
judgement = Judgement.new :document => document, :query => query, :relevant => relevant, :user => user
|
149
|
+
|
150
|
+
user.add_judgement(judgement)
|
151
|
+
@users[user_id] = user
|
152
|
+
end
|
153
|
+
|
154
|
+
@documents[document_id] = document
|
155
|
+
@queries << query unless @queries.include?(query)
|
156
|
+
@judgements << judgement
|
157
|
+
else
|
158
|
+
#TOOD I think there is somethink like an ArgumentExcpetion in Ruby; use that if applicable
|
159
|
+
raise "Need at least a Document, and a Query for creating the new entry."
|
160
|
+
end
|
161
|
+
|
162
|
+
end
|
163
|
+
|
164
|
+
# This is essentially the same as adding a Judgement, we can use this operator too.
|
165
|
+
def <<(args)
|
166
|
+
self.add_judgement args
|
167
|
+
end
|
168
|
+
|
169
|
+
# Returns true if a Document is relevant for a Query, according to this GoldStandard.
|
170
|
+
# Called by:
|
171
|
+
# relevant? :document => "document ID", :query => "query"
|
172
|
+
def relevant?(args)
|
173
|
+
query = Query.new :querystring => args[:query]
|
174
|
+
document = Document.new :id => args[:document]
|
175
|
+
|
176
|
+
relevant_count = 0
|
177
|
+
nonrelevant_count = 0
|
178
|
+
|
179
|
+
#TODO: looks quite inefficient. Would a hash with document-query-pairs as key help?
|
180
|
+
@judgements.each do |judgement|
|
181
|
+
if judgement.document == document and judgement.query == query
|
182
|
+
judgement.relevant ? relevant_count += 1 : nonrelevant_count += 1
|
183
|
+
end
|
184
|
+
end
|
185
|
+
|
186
|
+
# If we didn't find any judgements, just leave it as false
|
187
|
+
if relevant_count == 0 and relevant_count == 0
|
188
|
+
false
|
189
|
+
else
|
190
|
+
relevant_count >= nonrelevant_count
|
191
|
+
end
|
192
|
+
end
|
193
|
+
|
194
|
+
|
195
|
+
# Returns true if this GoldStandard contains a Judgement for this Query / Document pair
|
196
|
+
# This is called by:
|
197
|
+
# contains_judgement? :id => "the document ID", :querystring => "the query"
|
198
|
+
def contains_judgement?(args)
|
199
|
+
query = Query.new :querystring => args[:query]
|
200
|
+
document = Document.new :id => args[:document]
|
201
|
+
|
202
|
+
#TODO: a hash could improve performance here as well
|
203
|
+
|
204
|
+
@judgements.each { |judgement| return true if judgement.document == document and judgement.query == query }
|
205
|
+
|
206
|
+
false
|
207
|
+
end
|
208
|
+
|
209
|
+
|
210
|
+
# Returns true if this GoldStandard contains this Document
|
211
|
+
# Called by:
|
212
|
+
# contains_document? :id => "document ID"
|
213
|
+
def contains_document?(args)
|
214
|
+
document_id = args[:id]
|
215
|
+
@documents.key? document_id
|
216
|
+
end
|
217
|
+
|
218
|
+
|
219
|
+
# Returns true if this GoldStandard contains this Query string
|
220
|
+
# Called by:
|
221
|
+
# contains_query? :querystring => "the query"
|
222
|
+
def contains_query?(args)
|
223
|
+
querystring = args[:querystring]
|
224
|
+
query = Query.new :querystring => querystring
|
225
|
+
@queries.include? query
|
226
|
+
end
|
227
|
+
|
228
|
+
|
229
|
+
# Returns true if this GoldStandard contains this User
|
230
|
+
# Called by:
|
231
|
+
# contains_user? :id => "John Doe"
|
232
|
+
def contains_user?(args)
|
233
|
+
user_id = args[:id]
|
234
|
+
@users.key? user_id
|
235
|
+
end
|
236
|
+
|
237
|
+
|
238
|
+
# Calculates and returns the Kappa measure for this GoldStandard. It shows
|
239
|
+
# to which degree the judges agree in their decisions
|
240
|
+
# See: http://nlp.stanford.edu/IR-book/html/htmledition/assessing-relevance-1.html
|
241
|
+
def kappa
|
242
|
+
|
243
|
+
# FIXME: This isn't very pretty, maybe there's a more ruby-esque way to do this?
|
244
|
+
sum = 0
|
245
|
+
count = 0
|
246
|
+
|
247
|
+
# A repeated_combination yields all the pairwise combinations of
|
248
|
+
# users to generate the pairwise kappa statistic. Elements are also
|
249
|
+
# paired with themselves, so we need to remove those.
|
250
|
+
@users.values.repeated_combination(2) do |combination|
|
251
|
+
user1, user2 = combination[0], combination[1]
|
252
|
+
unless user1 == user2
|
253
|
+
kappa = pairwise_kappa(user1, user2)
|
254
|
+
unless kappa.nil?
|
255
|
+
puts "Kappa for User #{user1.id} and #{user2.id}: #{kappa}" if $verbose
|
256
|
+
sum += kappa unless kappa.nil?
|
257
|
+
count += 1
|
258
|
+
end
|
259
|
+
end
|
260
|
+
end
|
261
|
+
|
262
|
+
@kappa = sum / count.to_f
|
263
|
+
puts "Average pairwise kappa: #{@kappa}" if $verbose
|
264
|
+
return @kappa
|
265
|
+
end
|
266
|
+
|
267
|
+
private
|
268
|
+
|
269
|
+
# Calculates the pairwise kappa statistic for two users.
|
270
|
+
# The two users objects need at least one Judgement in common.
|
271
|
+
# Note that the kappa statistic is not really meaningful when there are
|
272
|
+
# too little judgements in common!
|
273
|
+
def pairwise_kappa(user1, user2)
|
274
|
+
|
275
|
+
user1_judgements = user1.judgements.reject { |judgement| not user2.judgements.include?(judgement) }
|
276
|
+
user2_judgements = user2.judgements.reject { |judgement| not user1.judgements.include?(judgement) }
|
277
|
+
|
278
|
+
total_count = user1_judgements.count
|
279
|
+
|
280
|
+
unless user1_judgements.empty? or user1_judgements.empty?
|
281
|
+
|
282
|
+
positive_agreements = 0 # => when both judges agree positively (relevant)
|
283
|
+
negative_agreements = 0 # => when both judges agree negatively (nonrelevant)
|
284
|
+
negative_disagreements = 0 # => when the second judge disagrees by using "nonrelevant"
|
285
|
+
positive_disagreements = 0 # => when the second judge disagrees by using "relevant"
|
286
|
+
|
287
|
+
for i in 0..(user1_judgements.count-1)
|
288
|
+
if user1_judgements[i].relevant == true
|
289
|
+
if user2_judgements[i].relevant == true
|
290
|
+
positive_agreements += 1
|
291
|
+
else
|
292
|
+
negative_disagreements += 1
|
293
|
+
end
|
294
|
+
elsif user1_judgements[i].relevant == false
|
295
|
+
if user2_judgements[i].relevant == false
|
296
|
+
negative_agreements += 1
|
297
|
+
else
|
298
|
+
positive_disagreements += 1
|
299
|
+
end
|
300
|
+
end
|
301
|
+
end
|
302
|
+
|
303
|
+
# The proportion the judges agreed:
|
304
|
+
p_agreed = (positive_agreements + negative_agreements) / total_count.to_f
|
305
|
+
|
306
|
+
# The pooled marginals:
|
307
|
+
p_nonrelevant = (positive_disagreements + negative_agreements * 2 + negative_disagreements) / (total_count.to_f * 2)
|
308
|
+
# This one is the opposite of P(nonrelevant):
|
309
|
+
# p_relevant = (positive_agreements * 2 + negative_disagreements + positive_disagreements) / (total_count.to_f * 2)
|
310
|
+
p_relevant = 1 - p_nonrelevant
|
311
|
+
|
312
|
+
# The probability that the judges agreed by chance
|
313
|
+
p_agreement_by_chance = p_nonrelevant ** 2 + p_relevant ** 2
|
314
|
+
|
315
|
+
|
316
|
+
# Finally, the pairwise kappa value
|
317
|
+
# If there'd be a division by zero, we avoid it and return 0 right away
|
318
|
+
if p_agreed - p_agreement_by_chance == 0
|
319
|
+
return 0
|
320
|
+
# In any other case, the kappa value is correct and we can return it
|
321
|
+
else
|
322
|
+
kappa = (p_agreed - p_agreement_by_chance) / (1 - p_agreement_by_chance)
|
323
|
+
return kappa
|
324
|
+
end
|
325
|
+
end
|
326
|
+
|
327
|
+
# If there are no common judgements, there is no kappa value to calculate
|
328
|
+
return nil
|
329
|
+
end
|
330
|
+
|
331
|
+
end
|
332
|
+
|
333
|
+
|
334
|
+
# A Query is effectively a string that is used as its ID.
|
335
|
+
class Query
|
336
|
+
|
337
|
+
attr_reader :querystring
|
338
|
+
|
339
|
+
# Compares two Query objects according to their query string
|
340
|
+
def ==(query)
|
341
|
+
query.querystring == self.querystring
|
342
|
+
end
|
343
|
+
|
344
|
+
# Creates a new Query object with a specified string
|
345
|
+
def initialize(args)
|
346
|
+
@querystring = args[:querystring].to_s
|
347
|
+
raise "Can not construct a Query with an empty query string" if @querystring.nil?
|
348
|
+
end
|
349
|
+
|
350
|
+
end
|
351
|
+
|
352
|
+
# A Document is a generic resource that is identified by its ID (which could be anything).
|
353
|
+
class Document
|
354
|
+
|
355
|
+
attr_reader :id
|
356
|
+
|
357
|
+
# Compares two Document objects according to their id
|
358
|
+
def ==(document)
|
359
|
+
document.id == self.id
|
360
|
+
end
|
361
|
+
|
362
|
+
# Creates a new Document object with a specified id
|
363
|
+
def initialize(args)
|
364
|
+
@id = args[:id].to_s
|
365
|
+
raise "Can not construct a Document with an empty identifier" if @id.nil?
|
366
|
+
end
|
367
|
+
|
368
|
+
end
|
369
|
+
|
370
|
+
# A Judgement references one query and one document as being relevant to each other or not.
|
371
|
+
# It also keeps track of the User who created the Judgement, if necessary.
|
372
|
+
class Judgement
|
373
|
+
|
374
|
+
attr_reader :relevant, :document, :query, :user
|
375
|
+
|
376
|
+
# Creates a new Judgement that belongs to a Query, a Document, and optionally to a User
|
377
|
+
# Called by (note the usage of IDs, not objects):
|
378
|
+
# Judgement.new :document => my_doc_id, :user => my_user_id, :query => query_string, :relevant => true
|
379
|
+
def initialize(args)
|
380
|
+
@relevant = args[:relevant]
|
381
|
+
@document = args[:document]
|
382
|
+
@query = args[:query]
|
383
|
+
@user = args[:user]
|
384
|
+
end
|
385
|
+
|
386
|
+
|
387
|
+
# A Judgement is considered equal to another when they are for the same Query or Document.
|
388
|
+
# This comparison happens regardless of the user, so it is easier to generate "unique" Judgements
|
389
|
+
# or calculate the kappa measure.
|
390
|
+
def ==(judgement)
|
391
|
+
self.document == judgement.document and self.query == judgement.query
|
392
|
+
end
|
393
|
+
|
394
|
+
end
|
395
|
+
|
396
|
+
# A User is optional for a Judgement, they are identified by their ID, which could be anything.
|
397
|
+
class User
|
398
|
+
|
399
|
+
attr_reader :id, :judgements
|
400
|
+
|
401
|
+
# Compares two User objects according to their id
|
402
|
+
def ==(user)
|
403
|
+
user.id == self.id
|
404
|
+
end
|
405
|
+
|
406
|
+
|
407
|
+
# Creates a new User object with a specified id
|
408
|
+
def initialize(args)
|
409
|
+
@id = args[:id]
|
410
|
+
@judgements = Array.new
|
411
|
+
raise "Can not construct a User with an empty identifier" if @id.nil?
|
412
|
+
end
|
413
|
+
|
414
|
+
|
415
|
+
# Adds a reference to a Judgement to this User object, since this makes it
|
416
|
+
# easier to calculate kappa later. Some users have multiple judgements for
|
417
|
+
# the same Document Query pair, which isn't really helpful. We therefore eliminate
|
418
|
+
# duplicates.
|
419
|
+
def add_judgement(judgement)
|
420
|
+
@judgements << judgement unless @judgements.include?(judgement)
|
421
|
+
end
|
422
|
+
|
423
|
+
end
|
424
|
+
end
|