RubyGems - biblicit - Versions diffs - 1.0 → 2.0.3 - Mend

biblicit 1.0 → 2.0.3

Files changed (406) hide show

data/spec/fixtures/txt/E06-1050.txt ADDED Viewed

@@ -0,0 +1,867 @@
+A Probabilistic Answer Type Model
+Christopher Pinchak
+Department of Computing Science
+University of Alberta
+Edmonton, Alberta, Canada
+pinchak@cs.ualberta.ca
+Dekang Lin
+Google, Inc.
+1600 Amphitheatre Parkway
+Mountain View, CA
+lindek@google.com
+Abstract
+All questions are implicitly associated
+with an expected answer type. Unlike
+previous approaches that require a prede-
+ﬁned set of question types, we present
+a method for dynamically constructing
+a probability-based answer type model
+for each different question. Our model
+evaluates the appropriateness of a poten-
+tial answer by the probability that it ﬁts
+into the question contexts. Evaluation
+is performed against manual and semi-
+automatic methods using a ﬁxed set of an-
+swer labels. Results show our approach to
+be superior for those questions classiﬁed
+as having a miscellaneous answer type.
+1 Introduction
+Given a question, people are usually able to form
+an expectation about the type of the answer, even
+if they do not know the actual answer. An accu-
+rate expectation of the answer type makes it much
+easier to select the answer from a sentence that
+contains the query words. Consider the question
+“What is the capital of Norway?” We would ex-
+pect the answer to be a city and could ﬁlter out
+most of the words in the following sentence:
+The landed aristocracy was virtually crushed
+by Hakon V, who reigned from 1299 to 1319,
+and Oslo became the capital of Norway, re-
+placing Bergen as the principal city of the
+kingdom.
+The goal of answer typing is to determine
+whether a word’s semantic type is appropriate as
+an answer for a question. Many previous ap-
+proaches to answer typing, e.g., (Ittycheriah et al.,
+2001; Li and Roth, 2002; Krishnan et al., 2005),
+employ a predeﬁned set of answer types and use
+supervised learning or manually constructed rules
+to classify a question according to expected an-
+swer type. A disadvantage of this approach is that
+there will always be questions whose answers do
+not belong to any of the predeﬁned types.
+Consider the question: “What are tourist attrac-
+tions in Reims?” The answer may be many things:
+a church, a historic residence, a park, a famous
+intersection, a statue, etc. A common method to
+deal with this problem is to deﬁne a catch-all class.
+This class, however, tends not to be as effective as
+other answer types.
+Another disadvantage of predeﬁned answer
+types is with regard to granularity. If the types
+are too speciﬁc, they are more difﬁcult to tag. If
+they are too general, too many candidates may be
+identiﬁed as having the appropriate type.
+In contrast to previous approaches that use a su-
+pervised classiﬁer to categorize questions into a
+predeﬁned set of types, we propose an unsuper-
+vised method to dynamically construct a proba-
+bilistic answer type model for each question. Such
+a model can be used to evaluate whether or not
+a word ﬁts into the question context. For exam-
+ple, given the question “What are tourist attrac-
+tions in Reims?”, we would expect the appropriate
+answers to ﬁt into the context “X is a tourist attrac-
+tion.” From a corpus, we can ﬁnd the words that
+appeared in this context, such as:
+A-Ama Temple, Aborigine, addition, Anak
+Krakatau, archipelago, area, baseball,
+Bletchley Park, brewery, cabaret, Cairo,
+Cape Town, capital, center, ...
+Using the frequency counts of these words in
+the context, we construct a probabilistic model
+to compute P(in(w, Γ)|w), the probability for a
+word w to occur in a set of contexts Γ, given an
+occurrence of w. The parameters in this model are
+obtained from a large, automatically parsed, un-
+labeled corpus. By asking whether a word would
+occur in a particular context extracted from a ques-
+393
+tion, we avoid explicitly specifying a list of pos-
+sible answer types. This has the added beneﬁt
+of being easily adapted to different domains and
+corpora in which a list of explicit possible answer
+types may be difﬁcult to enumerate and/or identify
+within the text.
+The remainder of this paper is organized as fol-
+lows. Section 2 discusses the work related to an-
+swer typing. Section 3 discusses some of the key
+concepts employed by our probabilistic model, in-
+cluding word clusters and the contexts of a ques-
+tion and a word. Section 4 presents our probabilis-
+tic model for answer typing. Section 5 compares
+the performance of our model with that of an or-
+acle and a semi-automatic system performing the
+same task. Finally, the concluding remarks in are
+made in Section 6.
+2 Related Work
+Light et al. (2001) performed an analysis of the
+effect of multiple answer type occurrences in a
+sentence. When multiple words of the same type
+appear in a sentence, answer typing with ﬁxed
+types must assign each the same score. Light et
+al. found that even with perfect answer sentence
+identiﬁcation, question typing, and semantic tag-
+ging, a system could only achieve 59% accuracy
+over the TREC-9 questions when using their set of
+24 non-overlapping answer types. By computing
+the probability of an answer candidate occurring
+in the question contexts directly, we avoid having
+multiple candidates with the same level of appro-
+priateness as answers.
+There have been a variety of approaches to de-
+termine the answer types, which are also known
+as Qtargets (Echihabi et al., 2003). Most previous
+approaches classify the answer type of a question
+as one of a set of predeﬁned types.
+Many systems construct the classiﬁcation rules
+manually (Cui et al., 2004; Greenwood, 2004;
+Hermjakob, 2001). The rules are usually triggered
+by the presence of certain words in the question.
+For example, if a question contains “author” then
+the expected answer type is Person.
+The number of answer types as well as the num-
+ber of rules can vary a great deal. For example,
+(Hermjakob, 2001) used 276 rules for 122 answer
+types. Greenwood (2004), on the other hand, used
+46 answer types with unspeciﬁed number of rules.
+The classiﬁcation rules can also be acquired
+with supervised learning. Ittycheriah, et al. (2001)
+describe a maximum entropy based question clas-
+siﬁcation scheme to classify each question as hav-
+ing one of the MUC answer types. In a similar ex-
+periment, Li & Roth (2002) train a question clas-
+siﬁer based on a modiﬁed version of SNoW using
+a richer set of answer types than Ittycheriah et al.
+The LCC system (Harabagiu et al., 2003) com-
+bines ﬁxed types with a novel loop-back strategy.
+In the event that a question cannot be classiﬁed as
+one of the ﬁxed entity types or semantic concepts
+derived from WordNet (Fellbaum, 1998), the an-
+swer type model backs off to a logic prover that
+uses axioms derived form WordNet, along with
+logic rules, to justify phrases as answers. Thus, the
+LCC system is able to avoid the use of a miscel-
+laneous type that often exhibits poor performance.
+However, the logic prover must have sufﬁcient ev-
+idence to link the question to the answer, and gen-
+eral knowledge must be encoded as axioms into
+the system. In contrast, our answer type model
+derives all of its information automatically from
+unannotated text.
+Answer types are often used as ﬁlters. It was
+noted in (Radev et al., 2002) that a wrong guess
+about the answer type reduces the chance for the
+system to answer the question correctly by as
+much as 17 times. The approach presented here
+is less brittle. Even if the correct candidate does
+not have the highest likelihood according to the
+model, it may still be selected when the answer
+extraction module takes into account other factors
+such as the proximity to the matched keywords.
+Furthermore, a probabilistic model makes it eas-
+ier to integrate the answer type scores with scores
+computed by other components in a question an-
+swering system in a principled fashion.
+3 Resources
+Before introducing our model, we ﬁrst describe
+the resources used in the model.
+3.1 Word Clusters
+Natural language data is extremely sparse. Word
+clusters are a way of coping with data sparseness
+by abstracting a given word to a class of related
+words. Clusters, as used by our probabilistic an-
+swer typing system, play a role similar to that of
+named entity types. Many methods exist for clus-
+tering, e.g., (Brown et al., 1990; Cutting et al.,
+1992; Pereira et al., 1993; Karypis et al., 1999).
+We used the Clustering By Committee (CBC)
+394
+Table 1: Words and their clusters
+Word Clusters
+suite software, network, wireless, ...
+rooms, bathrooms, restrooms, ...
+meeting room, conference room, ...
+ghost rabbit, squirrel, duck, elephant, frog, ...
+goblins, ghosts, vampires, ghouls, ...
+punk, reggae, folk, pop, hip-pop, ...
+huge, larger, vast, signiﬁcant, ...
+coming-of-age, true-life, ...
+clouds, cloud, fog, haze, mist, ...
+algorithm (Pantel and Lin, 2002) on a 10 GB En-
+glish text corpus to obtain 3607 clusters. The fol-
+lowing is an example cluster generated by CBC:
+tension, anger, anxiety, tensions, frustration,
+resentment, uncertainty, confusion, conﬂict,
+discontent, insecurity, controversy, unease,
+bitterness, dispute, disagreement, nervous-
+ness, sadness, despair, animosity, hostility,
+outrage, discord, pessimism, anguish, ...
+In the clustering generated by CBC, a word may
+belong to multiple clusters. The clusters to which
+a word belongs often represent the senses of the
+word. Table 1 shows two example words and their
+clusters.
+3.2 Contexts
+The context in which a word appears often im-
+poses constraints on the semantic type of the word.
+This basic idea has been exploited by many pro-
+posals for distributional similarity and clustering,
+e.g., (Church and Hanks, 1989; Lin, 1998; Pereira
+et al., 1993).
+Similar to Lin and Pantel (2001), we deﬁne
+the contexts of a word to be the undirected paths
+in dependency trees involving that word at either
+the beginning or the end. The following diagram
+shows an example dependency tree:
+Which city hosted the 1988 Winter Olympics?
+det subj
+obj
+NN
+NN
+det
+The links in the tree represent dependency rela-
+tionships. The direction of a link is from the head
+to the modiﬁer in the relationship. Labels associ-
+ated with the links represent types of relations.
+In a context, the word itself is replaced with a
+variable X. We say a word is the ﬁller of a context
+if it replaces X. For example, the contexts for the
+word “Olympics” in the above sentence include
+the following paths:
+Context of “Olympics” Explanation
+X Winter
+NN
+Winter X
+X 1988
+NN
+1988 X
+X host
+obj
+host X
+X host
+obj
+city
+subj
+city hosted X
+In these paths, words are reduced to their root
+forms and proper names are reduced to their entity
+tags (we used MUC7 named entity tags).
+Paths allow us to balance the speciﬁcity of con-
+texts and the sparseness of data. Longer paths typ-
+ically impose stricter constraints on the slot ﬁllers.
+However, they tend to have fewer occurrences,
+making them more prone to errors arising from
+data sparseness. We have restricted the path length
+to two (involving at most three words) and require
+the two ends of the path to be nouns.
+We parsed the AQUAINT corpus (3GB) with
+Minipar (Lin, 2001) and collected the frequency
+counts of words appearing in various contexts.
+Parsing and database construction is performed
+off-line as the database is identical for all ques-
+tions. We extracted 527,768 contexts that ap-
+peared at least 25 times in the corpus. An example
+context and its ﬁllers are shown in Figure 1.
+X host Olympics
+subj obj
+Africa 2 grant 1 readiness 2
+AP 1 he 2 Rio de Janeiro 1
+Argentina 1 homeland 3 Rome 1
+Athens 16 IOC 1 Salt Lake City 2
+Atlanta 3 Iran 2 school 1
+Bangkok 1 Jakarta 1 S. Africa 1
+. .. . .. . . .
+decades 1 president 2 Zakopane 4
+facility 1 Pusan 1
+government 1 race 1
+Figure 1: An example context and its ﬁllers
+3.2.1 Question Contexts
+To build a probabilistic model for answer typ-
+ing, we extract a set of contexts, called question
+contexts, from a question. An answer is expected
+to be a plausible ﬁller of the question contexts.
+Question contexts are extracted from a question
+with two rules. First, if the wh-word in a ques-
+tion has a trace in the parse tree, the question con-
+texts are the contexts of the trace. For example, the
+395
+question “What do most tourists visit in Reims?”
+is parsed as:
+Whati
+do most tourists visit ei
+in Reims?
+det
+i
+subj
+det
+obj
+in
+The symbol ei is the trace of whati. Minipar
+generates the trace to indicate that the word what
+is the object of visit in the deep structure of the
+sentence. The following question contexts are ex-
+tracted from the above question:
+Context Explanation
+X visit tourist
+obj subj
+tourist visits X
+X visit Reims
+obj in
+visit X in Reims
+The second rule deals with situations where
+the wh-word is a determiner, as in the question
+“Which city hosted the 1988 Winter Olympics?”
+(the parse tree for which is shown in section 3.2).
+In such cases, the question contexts consist of a
+single context involving the noun that is modiﬁed
+by the determiner. The context for the above sen-
+tence is X city
+subj
+, corresponding to the sentence
+“X is a city.” This context is used because the
+question explicitly states that the desired answer is
+a city. The context overrides the other contexts be-
+cause the question explicitly states the desired an-
+swer type. Experimental results have shown that
+using this context in conjunction with other con-
+texts extracted from the question produces lower
+performance than using this context alone.
+In the event that a context extracted from a ques-
+tion is not found in the database, we shorten the
+context in one of two ways. We start by replac-
+ing the word at the end of the path with a wildcard
+that matches any word. If this fails to yield en-
+tries in the context database, we shorten the con-
+text to length one and replace the end word with
+automatically determined similar words instead of
+a wildcard.
+3.2.2 Candidate Contexts
+Candidate contexts are very similar in form to
+question contexts, save for one important differ-
+ence. Candidate contexts are extracted from the
+parse trees of the answer candidates rather than the
+question. In natural language, some words may
+be polysemous. For example, Washington may re-
+fer to a person, a city, or a state. The occurrences
+of Washington in “Washington’s descendants” and
+“suburban Washington” should not be given the
+same score when the question is seeking a loca-
+tion. Given that the sense of a word is largely de-
+termined by its local context (Choueka and Lusig-
+nan, 1985), candidate contexts allow the model to
+take into account the candidate answers’ senses
+implicitly.
+4 Probabilistic Model
+The goal of an answer typing model is to evalu-
+ate the appropriateness of a candidate word as an
+answer to the question. If we assume that a set
+of answer candidates is provided to our model by
+some means (e.g., words comprising documents
+extracted by an information retrieval engine), we
+wish to compute the value P(in(w, ΓQ)|w). That
+is, the appropriateness of a candidate answer w is
+proportional to the probability that it will occur in
+the question contexts ΓQ extracted from the ques-
+tion.
+To mitigate data sparseness, we can introduce
+a hidden variable C that represents the clusters to
+which the candidate answer may belong. As a can-
+didate may belong to multiple clusters, we obtain:
+P(in(w, ΓQ)|w) =
+X
+C
+P(in(w, ΓQ), C|w) (1)
+=
+X
+C
+P(C|w)P(in(w, ΓQ)|C, w) (2)
+Given that a word appears, we assume that it has
+the same probability to appear in a context as all
+other words in the same cluster. Therefore:
+P(in(w, ΓQ)|C, w) ≈ P(in(C, ΓQ)|C) (3)
+We can now rewrite the equation in (2) as:
+P(in(w, ΓQ)|w) ≈
+X
+C
+P(C|w)P(in(C, ΓQ)|C) (4)
+This equation splits our model into two parts:
+one models which clusters a word belongs to and
+the other models how appropriate a cluster is to
+the question contexts. When ΓQ consists of multi-
+ple contexts, we make the na¨ıve Bayes assumption
+that each individual context γQ ∈ ΓQ is indepen-
+dent of all other contexts given the cluster C.
+P(in(w, ΓQ)|w) ≈
+X
+C
+P(C|w)
+Y
+γQ∈ΓQ
+P(in(C, γQ)|C) (5)
+Equation (5) needs the parameters P(C|w) and
+P(in(C, γQ)|C), neither of which are directly
+available from the context-ﬁller database. We will
+discuss the estimation of these parameters in Sec-
+tion 4.2.
+396
+4.1 Using Candidate Contexts
+The previous model assigns the same likelihood to
+every instance of a given word. As we noted in
+section 3.2.2, a word may be polysemous. To take
+into account a word’s context, we can instead com-
+pute P(in(w, ΓQ)|w, in(w, Γw)), where Γw is the
+set of contexts for the candidate word w in a re-
+trieved passage.
+By introducing word clusters as intermediate
+variables as before and making a similar assump-
+tion as in equation (3), we obtain:
+P(in(w, ΓQ)|w, in(w, Γw))
+=
+X
+C
+P(in(w, ΓQ), C|w, in(w, Γw)) (6)
+≈
+X
+C
+P(C|w, in(w, Γw))P(in(C, ΓQ)|C) (7)
+Like equation (4), equation (7) partitions the
+model into two parts. Unlike P(C|w) in equation
+(4), the probability of the cluster is now based on
+the particular occurrence of the word in the candi-
+date contexts. It can be estimated by:
+P(C|w, in(w, Γw))
+=
+P(in(w, Γw)|w, C)P(w, C)
+P(in(w, Γw)|w)P(w)
+(8)
+≈
+Y
+γw∈Γw
+P(in(w, γw)|w, C)
+Y
+γw∈Γw
+P(in(w, γw)|w)
+× P(C|w) (9)
+=
+Y
+γw∈Γw
+„
+P(C|w, in(w, γw))
+P(C|w)
+«
+× P(C|w) (10)
+4.2 Estimating Parameters
+Our probabilistic model requires the parameters
+P(C|w), P(C|w, in(w, γ)), and P(in(C, γ)|C),
+where w is a word, C is a cluster that w belongs to,
+and γ is a question or candidate context. This sec-
+tion explains how these parameters are estimated
+without using labeled data.
+The context-ﬁller database described in Sec-
+tion 3.2 provides the joint and marginal fre-
+quency counts of contexts and words (|in(γ, w)|,
+|in(∗, γ)| and |in(w, ∗)|). These counts al-
+low us to compute the probabilities P(in(w, γ)),
+P(in(w, ∗)), and P(in(∗, γ)). We can also com-
+pute P(in(w, γ)|w), which is smoothed with add-
+one smoothing (see equation (11) in Figure 2).
+The estimation of P(C|w) presents a challenge.
+We have no corpus from which we can directly
+measure P(C|w) because word instances are not
+labeled with their clusters.
+P(in(w, γ)|w) =
+|in(w, γ)| + P(in(∗, γ))
+|in(w, ∗)| + 1
+(11)
+Pu(C|w) =
+(
+1
+|{C |w∈C }|
+if w ∈ C,
+0 otherwise
+(12)
+P(C|w) =
+X
+w ∈S(w)
+sim(w, w ) × Pu(C|w )
+X
+{C |w∈C },
+w ∈S(w)
+sim(w, w ) × Pu(C |w )
+(13)
+P(in(C, γ)|C) =
+X
+w ∈C
+P(C|w ) × |in(w , γ)| + P(in(∗, γ))
+X
+w ∈C
+P(C|w ) × |in(w , ∗)| + 1
+(14)
+Figure 2: Probability estimation
+We use the average weighted “guesses” of the
+top similar words of w to compute P(C|w) (see
+equation 13). The intuition is that if w and w
+are similar words, P(C|w ) and P(C|w) tend
+to have similar values. Since we do not know
+P(C|w ) either, we substitute it with uniform dis-
+tribution Pu(C|w ) as in equation (12) of Fig-
+ure 2. Although Pu(C|w ) is a very crude guess,
+the weighted average of a set of such guesses can
+often be quite accurate.
+The similarities between words are obtained as
+a byproduct of the CBC algorithm. For each word,
+we use S(w) to denote the top-n most similar
+words (n=50 in our experiments) and sim(w, w )
+to denote the similarity between words w and w .
+The following is a sample similar word list for the
+word suit:
+S(suit) = {lawsuit 0.49, suits 0.47, com-
+plaint 0.29, lawsuits 0.27, jacket 0.25, coun-
+tersuit 0.24, counterclaim 0.24, pants 0.24,
+trousers 0.22, shirt 0.21, slacks 0.21, case
+0.21, pantsuit 0.21, shirts 0.20, sweater 0.20,
+coat 0.20, ...}
+The estimation for P(C|w, in(w, γw)) is sim-
+ilar to that of P(C|w) except that instead of all
+w ∈ S(w), we instead use {w |w ∈ S(w) ∧
+in(w , γw)}. By only looking at a particular con-
+text γw, we may obtain a different distribution over
+C than P(C|w) speciﬁes. In the event that the data
+are too sparse to estimate P(C|w, in(w, γw)), we
+fall back to using P(C|w).
+P(in(C, γ)|C) is computed in (14) by assum-
+ing each instance of w contains a fractional in-
+stance of C and the fractional count is P(C|w).
+Again, add-one smoothing is used.
+397
+System Median % Top 1% Top 5% Top 10% Top 50%
+Oracle 0.7% 89 (57%) 123 (79%) 131 (85%) 154 (99%)
+Frequency 7.7% 31 (20%) 67 (44%) 86 (56%) 112 (73%)
+Our model 1.2% 71 (46%) 106 (69%) 119 (77%) 146 (95%)
+no cand. contexts 2.2% 58 (38%) 102 (66%) 113 (73%) 145 (94%)
+ANNIE 4.0% 54 (35%) 79 (51%) 93 (60%) 123 (80%)
+Table 2: Summary of Results
+5 Experimental Setup & Results
+We evaluate our answer typing system by using
+it to ﬁlter the contents of documents retrieved by
+the information retrieval portion of a question an-
+swering system. Each answer candidate in the set
+of documents is scored by the answer typing sys-
+tem and the list is sorted in descending order of
+score. We treat the system as a ﬁlter and observe
+the proportion of candidates that must be accepted
+by the ﬁlter so that at least one correct answer is
+accepted. A model that allows a low percentage
+of candidates to pass while still allowing at least
+one correct answer through is favorable to a model
+in which a high number of candidates must pass.
+This represents an intrinsic rather than extrinsic
+evaluation (Moll´a and Hutchinson, 2003) that we
+believe illustrates the usefulness of our model.
+The evaluation data consist of 154 questions
+from the TREC-2003 QA Track (Voorhees, 2003)
+satisfying the following criteria, along with the top
+10 documents returned for each question as iden-
+tiﬁed by NIST using the PRISE1 search engine.
+• the question begins with What, Which, or
+Who. We restricted the evaluation such ques-
+tions because our system is designed to deal
+with questions whose answer types are often
+semantically open-ended noun phrases.
+• There exists entry for the question in the an-
+swer patterns provided by Ken Litkowski2.
+• One of the top-10 documents returned by
+PRISE contains a correct answer.
+We compare the performance of our prob-
+abilistic model with that of two other sys-
+tems. Both comparison systems make use of a
+small, predeﬁned set of manually-assigned MUC-
+7 named-entity types (location, person, organiza-
+tion, cardinal, percent, date, time, duration, mea-
+sure, money) augmented with thing-name (proper
+1
+www.itl.nist.gov/iad/894.02/works/papers/zp2/zp2.html
+2
+trec.nist.gov/data/qa/2003 qadata/03QA.tasks/t12.pats.txt
+names of inanimate objects) and miscellaneous
+(a catch-all answer type of all other candidates).
+Some examples of thing-name are Guinness Book
+of World Records, Thriller, Mars Pathﬁnder, and
+Grey Cup. Examples of miscellaneous answers are
+copper, oil, red, and iris.
+The differences in the comparison systems is
+with respect to how entity types are assigned to the
+words in the candidate documents. We make use
+of the ANNIE (Maynard et al., 2002) named entity
+recognition system, along with a manual assigned
+“oracle” strategy, to assign types to candidate an-
+swers. In each case, the score for a candidate is
+either 1 if it is tagged as the same type as the ques-
+tion or 0 otherwise. With this scoring scheme pro-
+ducing a sorted list we can compute the probability
+of the ﬁrst correct answer appearing at rank R = k
+as follows:
+P(R = k) =
+k−2Y
+i=0
+„
+t − c − i
+t − i
+«
+c
+t − k + 1
+(15)
+where t is the number of unique candidate answers
+that are of the appropriate type and c is the number
+of unique candidate answers that are correct.
+Using the probabilities in equation (15), we
+compute the expected rank, E(R), of the ﬁrst cor-
+rect answer of a given question in the system as:
+E(R) =
+t−c+1X
+k=1
+kP(R = k) (16)
+Answer candidates are the set of ANNIE-
+identiﬁed tokens with stop words and punctuation
+removed. This yields between 900 and 8000 can-
+didates for each question, depending on the top 10
+documents returned by PRISE. The oracle system
+represents an upper bound on using the predeﬁned
+set of answer types. The ANNIE system repre-
+sents a more realistic expectation of performance.
+The median percentage of candidates that are
+accepted by a ﬁlter over the questions of our eval-
+uation data provides one measure of performance
+and is preferred to the average because of the ef-
+fect of large values on the average. In QA, a sys-
+tem accepting 60% of the candidates is not signif-
+icantly better or worse than one accepting 100%,
+398
+System Measure
+Question Type
+All Location Person Organization Thing-Name Misc Other
+(154) (57) (17) (19) (17) (37) (7)
+Our model
+Median 1.2% 0.8% 2.0% 1.3% 3.7% 3.5% 12.2%
+Top 1% 71 34 6 9 7 13 2
+Top 5% 106 53 11 11 10 19 2
+Top 10% 119 55 12 17 10 22 3
+Top 50% 146 56 16 18 17 34 5
+Oracle
+Median 0.7% 0.4% 1.0% 0.3% 0.4% 16.0% 0.3%
+Top 1% 89 44 8 16 14 1 6
+Top 5% 123 57 17 19 17 6 7
+Top 10% 131 57 17 19 17 14 7
+Top 50% 154 57 17 19 17 37 7
+ANNIE
+Median 4.0% 0.6% 1.4% 6.1% 100% 16.7% 50.0%
+Top 1% 54 39 5 7 0 0 3
+Top 5% 79 53 12 9 0 2 3
+Top 10% 93 54 13 11 0 12 3
+Top 50% 123 56 16 15 5 28 3
+Table 3: Detailed breakdown of performance
+but the effect on average is quite high. Another
+measure is to observe the number of questions
+with at least one correct answer in the top N% for
+various values of N. By examining the number of
+correct answers found in the top N% we can better
+understand what an effective cutoff would be.
+The overall results of our comparison can be
+found in Table 2. We have added the results of
+a system that scores candidates based on their fre-
+quency within the document as a comparison with
+a simple, yet effective, strategy. The second col-
+umn is the median percentage of where the highest
+scored correct answer appears in the sorted candi-
+date list. Low percentage values mean the answer
+is usually found high in the sorted list. The re-
+maining columns list the number of questions that
+have a correct answer somewhere in the top N%
+of their sorted lists. This is meant to show the ef-
+fects of imposing a strict cutoff prior to running
+the answer type model.
+The oracle system performs best, as it bene-
+ﬁts from both manual question classiﬁcation and
+manual entity tagging. If entity assignment is
+performed by an automatic system (as it is for
+ANNIE), the performance drops noticeably. Our
+probabilistic model performs better than ANNIE
+and achieves approximately 2/3 of the perfor-
+mance of the oracle system. Table 2 also shows
+that the use of candidate contexts increases the
+performance of our answer type model.
+Table 3 shows the performance of the oracle
+system, our model, and the ANNIE system broken
+down by manually-assigned answer types. Due
+to insufﬁcient numbers of questions, the cardinal,
+percent, time, duration, measure, and money types
+are combined into an “Other” category. When
+compared with the oracle system, our model per-
+forms worse overall for questions of all types ex-
+cept for those seeking miscellaneous answers. For
+miscellaneous questions, the oracle identiﬁes all
+tokens that do not belong to one of the other
+known categories as possible answers. For all
+questions of non-miscellaneous type, only a small
+subset of the candidates are marked appropriate.
+In particular, our model performs worse than the
+oracle for questions seeking persons and thing-
+names. Person questions often seek rare person
+names, which occur in few contexts and are difﬁ-
+cult to reliably cluster. Thing-name questions are
+easy for a human to identify but difﬁcult for au-
+tomatic system to identify. Thing-names are a di-
+verse category and are not strongly associated with
+any identifying contexts.
+Our model outperforms the ANNIE system in
+general, and for questions seeking organizations,
+thing-names, and miscellaneous targets in partic-
+ular. ANNIE may have low coverage on organi-
+zation names, resulting in reduced performance.
+Like the oracle, ANNIE treats all candidates not
+assigned one of the categories as appropriate for
+miscellaneous questions. Because ANNIE cannot
+identify thing-names, they are treated as miscella-
+neous. ANNIE shows low performance on thing-
+names because words incorrectly assigned types
+are sorted to the bottom of the list for miscella-
+neous and thing-name questions. If a correct an-
+swer is incorrectly assigned a type it will be sorted
+near the bottom, resulting in a poor score.
+399
+6 Conclusions
+We have presented an unsupervised probabilistic
+answer type model. Our model uses contexts de-
+rived from the question and the candidate answer
+to calculate the appropriateness of a candidate an-
+swer. Statistics gathered from a large corpus of
+text are used in the calculation, and the model is
+constructed to exploit these statistics without be-
+ing overly speciﬁc or overly general.
+The method presented here avoids the use of an
+explicit list of answer types. Explicit answer types
+can exhibit poor performance, especially for those
+questions not ﬁtting one of the types. They must
+also be redeﬁned when either the domain or corpus
+substantially changes. By avoiding their use, our
+answer typing method may be easier to adapt to
+different corpora and question answering domains
+(such as bioinformatics).
+In addition to operating as a stand-alone answer
+typing component, our system can be combined
+with other existing answer typing strategies, es-
+pecially in situations in which a catch-all answer
+type is used. Our experimental results show that
+our probabilistic model outperforms the oracle and
+a system using automatic named entity recognition
+under such circumstances. The performance of
+our model is better than that of the semi-automatic
+system, which is a better indication of the expected
+performance of a comparable real-world answer
+typing system.
+Acknowledgments
+The authors would like to thank the anonymous re-
+viewers for their helpful comments on improving
+the paper. The ﬁrst author is supported by the Nat-
+ural Sciences and Engineering Research Council
+of Canada, the Alberta Ingenuity Fund, and the Al-
+berta Informatics Circle of Research Excellence.
+References
+P.F. Brown, V.J. Della Pietra, P.V. deSouza, J.C. Lai, and R.L.
+Mercer. 1990. Class-based n-gram Models of Natural
+Language. Computational Linguistics, 16(2):79–85.
+Y. Choueka and S. Lusignan. 1985. Disambiguation by Short
+Contexts. Computer and the Humanities, 19:147–157.
+K. Church and P. Hanks. 1989. Word Association Norms,
+Mutual Information, and Lexicography. In Proceedings
+of ACL-89, pages 76–83, Vancouver, British Columbia,
+Canada.
+H. Cui, K. Li, R. Sun, T-S. Chua, and M-K. Kan. 2004. Na-
+tional University of Singapore at the TREC-13 Question
+Answering Main Task. In Notebook of TREC 2004, pages
+34–42, Gaithersburg, Maryland.
+D.R. Cutting, D. Karger, J. Pedersen, and J.W. Tukey. 1992.
+Scatter/Gather: A Cluster-based Approach to Browsing
+Large Document Collections. In Proceedings of SIGIR-
+92, pages 318–329, Copenhagen, Denmark.
+A. Echihabi, U. Hermjakob, E. Hovy, D. Marcu, E. Melz,
+and D. Ravichandran. 2003. Multiple-Engine Question
+Answering in TextMap. In Proceedings of TREC 2003,
+pages 772–781, Gaithersburg, Maryland.
+C. Fellbaum. 1998. WordNet: An Electronic Lexical
+Database. MIT Press, Cambridge, Massachusetts.
+M.A. Greenwood. 2004. AnswerFinder: Question Answer-
+ing from your Desktop. In Proceedings of the Seventh
+Annual Colloquium for the UK Special Interest Group
+for Computational Linguistics (CLUK ’04), University of
+Birmingham, UK.
+S. Harabagiu, D. Moldovan, C. Clark, M. Bowden,
+J. Williams, and J. Bensley. 2003. Answer Mining by
+Combining Extraction Techniques with Abductive Rea-
+soning. In Proceedings of TREC 2003, pages 375–382,
+Gaithersburg, Maryland.
+U. Hermjakob. 2001. Parsing and Question Classiﬁcation for
+Question Answering. In Proceedings of the ACL Work-
+shop on Open-Domain Question Answering, Toulouse,
+France.
+A. Ittycheriah, M. Franz, W-J. Zhu, and A. Ratnaparkhi.
+2001. Question Answering Using Maximum Entropy
+Components. In Proceedings of NAACL 2001, Pittsburgh,
+Pennsylvania.
+G. Karypis, E.-H. Han, and V. Kumar. 1999. Chameleon: A
+Hierarchical Clustering Algorithm using Dynamic Model-
+ing. IEEE Computer: Special Issue on Data Analysis and
+Mining, 32(8):68–75.
+V. Krishnan, S. Das, and S. Chakrabarti. 2005. Enhanced
+Answer Type Inference from Questions using Sequential
+Models. In Proceedings of HLT/EMNLP 2005, pages
+315–322, Vancouver, British Columbia, Canada.
+X. Li and D. Roth. 2002. Learning Question Classiﬁers.
+In Proceedings of COLING 2002, pages 556–562, Taipei,
+Taiwan.
+M. Light, G. Mann, E. Riloff, and E. Breck. 2001. Analyses
+for Elucidating Current Question Answering Technology.
+Natural Language Engineering, 7(4):325–342.
+D. Lin and P. Pantel. 2001. Discovery of Inference Rules
+for Question Answering. Natural Language Engineering,
+7(4):343–360.
+D. Lin. 1998. Automatic Retrieval and Clustering of Similar
+Words. In Proceedings of COLING-ACL 1998, Montreal,
+Qu´ebec, Canada.
+D. Lin. 2001. Language and Text Analysis Tools. In Pro-
+ceedings of HLT 2001, pages 222–227, San Diego, Cali-
+fornia.
+D. Maynard, V. Tablan, H. Cunningham, C. Ursu, H. Sag-
+gion, K. Bontcheva, and Y. Wilks. 2002. Architectural
+Elements of Language Engineering Robustness. Natural
+Language Engineering, 8(2/3):257–274.
+D. Moll´a and B. Hutchinson. 2003. Intrinsic versus Extrinsic
+Evaluations of Parsing Systems. In Proceedings of EACL
+Workshop on Evaluation Initiatives in Natural Language
+Processing, pages 43–50, Budapest, Hungary.
+P. Pantel and D. Lin. 2002. Document Clustering with Com-
+mittees. In Proceedings of SIGIR 2002, pages 199–206,
+Tampere, Finland.
+F. Pereira, N. Tishby, and L. Lee. 1993. Distributional Clus-
+tering of English Words. In Proceedings of ACL 1992,
+pages 183–190.
+D. Radev, W. Fan, H. Qi, H. Wu, and A. Grewal. 2002. Prob-
+ablistic Question Answering on the Web. In Proceedings
+of the Eleventh International World Wide Web Conference.
+E.M. Voorhees. 2003. Overview of the TREC 2003 Ques-
+tion Answering Track. In Proceedings of TREC 2003,
+Gaithersburg, Maryland.
+400