Parameter Sensitivity in the Probabilistic User Manual

Parameter Sensitivity in the Probabilistic Model for Ad-hoc
Retrieval
Ben He
Department of Computing Science
University of Glasgow
Glasgow, The United Kingdom
ben@dcs.gla.ac.uk
ABSTRACT
The term frequency normalisation parameter sensitivity is an important issue in the probabilistic model for Informa­tion Retrieval. A high parameter sensitivity indicates that a slight change of the parameter value may considerably af­fect the retrieval performance. Therefore, a weighting model with a high parameter sensitivity is not robust enough to provide a consistent retrieval performance across different collections and queries. In this paper, we suggest that the parameter sensitivity is due to the fact that the query term weights are not adequate enough to allow informative query terms to differ from non-informative ones. We show that query term reweighing, which is part of the relevance feed­back process, can be successfully used to reduce the param­eter sensitivity. Experiments on five Text REtrieval Confer­ence (TREC) collections show that the parameter sensitivity does remarkably decrease when query terms are reweighed.
Categories and Sub ject Descriptors: H.3.3 [Informa­tion Storage and Retrieval]: Retrievalmodels; General
Terms: Performance, Experimentation; Keywords: Query term reweighing, Relevance feedback, Parameter sensitivity
1. INTRODUCTION
In Information Retrieval (IR), it is a crucial issue to rank retrieved documents in decreasing order of relevance. A re­cent survey on the query logs from real Web search engine users concluded that users rarely look beyond the top re­turned documents [16]. Therefore, it is important to rank the highly relevant documents at the top of the retrieved list. Usually, the document ranking is based on a weighting model. In particular, most weighting models apply a term frequency (tf ) normalisation method to normalise term frequency, i.e. the number of occurrences of the query term in the docu­ment.
Var i o u s tf normalisation methods have been proposed in the literature, e.g. the pivoted normalisation [24] in the
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’07, November 6–8, 2007, Lisboa, Portugal. Copyright 2007 ACM 978-1-59593-803-9/07/0011 ...
$5.00.
Iadh Ounis
Department of Computing Science
University of Glasgow
Glasgow, The United Kingdom
vector space model [23], the normalisation method of the BM25 weighting model [22], normalisation 2 [1] and normal­isation 3 [1, 14] in the Divergence from Randomness (DFR) framework [1]. All the aforementioned normalisation meth­ods normalise term frequency according to document length, i.e. the number of tokens in the document. Each of the aforementioned normalisation methods involves the use of a parameter. The setting of these parameter values usually has an important impact on the retrieval performance of an IR system mance of a weighting model is sensitive to a slight change of its parameter value, the weighting model may not be robust enough to provide consistent retrieval performance. This is referred to as the parameter sensitivity issue.
In a practical IR context, parameter sensitivity is a very important issue. Since relevance assessment and training data are not always available in a practical environment, it is crucial to ensure that the parameter value used provides a robust retrieval performance. The parameter sensitivity issue has been previously studied in the context of the lan­guage modelling approach [18, 26, 27]. In addition, sev­eral weighting models that are less sensitive than the clas­sical ones were generated in an axiomatic approach based on parameter constraints [8, 9]. Nevertheless, little work has been done to actually reduce the parameter sensitivity of a weighting model. In this paper, we base our study on the classical BM25 probabilistic model, and the PL2 model of the Divergence from Randomness (DFR) proba­bilistic framework. These two models have been shown to be effective in various previous TREC experiments [7].
The main contributions of this paper are two-fold. First, we provide a better understanding and explanation of the parameter sensitivity. We argue that parameter sensitivity is caused by the existence of non-informative terms in the query. Second, we show that parameter sensitivity can be reduced by applying query term reweighing, which is part of the relevance feedback.
The rest of this paper is organised as follows. Section 2 introduces the related work, including the BM25 and PL2 models, and previous research on the parameter sensitivity issue. Section 3 provides an explanation for the manifesta­tion of the parameter sensitivity and suggests to apply an
1
For instance, training a retrieval system using the PL2 weighting model [1] on TREC 10 ad-hoc queries, gives an MAP of 0.2397 on the TREC 9 queries, compared to an MAP of 0.2174 using the default (c = 1) setting. This dif­ference is statistically significant (p<=0.0009).
1
[5, 14, 15]. In particular, if the retrieval perfor-
263
appropriate query term reweighing to reduce the parameter sensitivity. Section 4 describes the experimental setting and methodology, and Section 5 provides analysis and discussion on the experimental results. Finally, Section 6 concludes on the paper and suggests possible future research directions.
2. RELATED WORK
In this section, we introduce the BM25 and PL2 models in Section 2.1, and briefly describe previous research on the parameter sensitivity issue in Section 2.2.
2.1 The BM25 and PL2 Probabilistic
Weighting Models
As one of the most established weighting models, Okapi’s BM25 computes the relevance score of a document d for a query Q by the following formula [22]:
distribution is modelled by an approximation to the Pois­son distribution with the use of the Laplace succession for normalising the relevance score. Using the PL2 model, the relevance score of a document d for a query Q is given by:
score(d, Q)=
X
qtw ·
tQ
· log
tfn +1
e +0.5 · log2(2π · tf n)
2
1
`
tfn · log
tfn
2
λ
´
+(λ tf n)
where λ is the mean and variance of a Poisson distribution. It is given by λ = F/N. F is the frequency of the query term in the collection and N is the number of documents in the collection. The query term weight qtw is given by
qtf/qtf
; qtf is the query term frequency. qtf
max
max
is the
maximum query term frequency among the query terms.
The normalised term frequency tfn is given by the so-
called normalisation 2:
(3)
score(d, Q)=
X
tQ
(k1+1)tfn
(1)
w
k1+ tfn
· qtw (1)
where
(1)
is the idf factor, which is given by:
w
=log
N Nt+0.5
2
Nt+0.5
(1)
w
where N is the number of documents in the whole collection. N
is the document frequency of term t,
t
i.e. the number of documents containing t.
qtw is the query term weight that is given by
+1)qtf
(k
3
k3+ qtf
where qtf is the number of occurrences of the given term in the query. k
= 1000 [22].
is k
3
is a parameter. Its default setting
3
tfn is the normalised term frequency of the given term t. k
is a parameter. Its default setting is k1=1.2 [22].
1
The tf normalisation component of the BM25 formula is:
where l and avg
tfn =
tf
(1 − b)+b ·
l
avg l
l are the document length and the aver-
(2)
age document length in the collection, respectively. tf is the term frequency in the document. The document length refers to the number of tokens in a document. b is a param­eter. The default setting is b =0.75 [22]. Singhal et al.’s pivoted normalisation, for normalising the tf · idf weight in the context of the vector space model [24], can be seen as a generalisation of the above BM25’s tf normalisation component.
PL2 is one of the Divergence from Randomness (DFR) document weighting models [3]. The idea of the DFR mod­els is to infer the importance of a query term in a document by measuring the divergence of the term’s distribution in the document from its distribution in the whole collection that is assumed to be random. In the PL2 model, this random
avg
tfn = tf · log
(1 + c ·
2
where l is the document length and avg
l
), (c>0) (4)
l
l is the average
document length in the whole collection. tf is the original term frequency. c is the parameter of normalisation 2. Its default setting is c = 7 for short queries and c =1forlong queries [1].
2.2 Previous Work on Parameter Sensitivity
The parameter sensitivity issue has drawn the attention of several previous research. Zhai & Lafferty addressed the pa­rameter sensitivity issue of the smoothing technique for lan­guage modelling. They found that query length, the number of unique terms in a query, has a considerable impact on the parameter sensitivity. In particular, the parameters of the smoothing methods are very sensitive for long queries [26, 27]. Similar findings were also observed for the BM25 and PL2 weighting models [13]. Moreover, Fang et al. generated some weighting models that are less sensitive than current one, using their axiomatic approach based on a set of pa­rameter constraints [8, 9].
A quantitative analysis of the parameter sensitivity was conducted by Metzler in [18]. Two measures, namely En­tropy (H) and Spread (S), were proposed to define the pa­rameter sensitivity. The Entropy (H )measureisgivenas follows:
H = ZP (opt, x)log
where P (opt, x) is the probability of the parameter value x being the optimal one. In [18], P (opt, x) is computed using Bayes’s rule.
Spread measures the flatness of a posterior distribution over a set of parameter values. It is given as follows:
S = m(max,X) m(min, X)(6)
where m(max,X)(resp. m(min, X )) is the maximum (resp. minimum) posterior over a set X of parameter values. For example, if the retrieval performance evaluation measure used is the mean average precision (MAP), m(max, X)is the maximum MAP, and m(min, X) is the minimum one. In practise, the lower Spread or Entropy is, the lower pa­rameter sensitivity an IR model has. In addition, a low
P (opt, x)(5)
2
264
Spread is preferred over a low Entropy in order to ensure a low parameter sensitivity [18].
The above mentioned research either addressed or quan­titatively analysed the parameter sensitivity issue. Never­theless, little work has been actually done to reduce the parameter sensitivity of a weighting model. In the next sec­tion, we explain the parameter sensitivity issue, and show how to apply query term reweighing to reduce the parameter sensitivity of the probabilistic model.
3. QUERY TERM WEIGHTS AND
PARAMETER SENSITIVITY
In this section, we provide an explanation for the parame­ter sensitivity issue. We suggest that the parameter sensitiv­ity is caused by inadequate query term weighting. We also propose to reduce the parameter sensitivity by reweighing query terms.
As mentioned in the previous section, query length has an important impact on the parameter sensitivity of tf nor­malisation. In particular, the parameter setting tends to be more sensitive for long queries than for short queries [26, 27].
One of the characteristics differentiating long queries from short queries is that long queries have much more non­informative query terms than short queries. As a conse­quence, for a long query, it is necessary to address the dif­ference in the informativeness among query terms in the weighting models. For example, in the BM25 and PL2 weighting models (see Equations (1) and (3)), a query term weight (qtw) measure is employed to represent the relative informativeness among query terms. Such a query term weight measure is adequate for short queries, because a short query usually consists of highly informative query terms. When the query gets longer, it is “contaminated” by non­informative query terms. Although the query term weight measure is meant to reflect the informativeness of a query term, it accounts only for the query term frequency, and it is still not adequate to differentiate informative query terms from non-informative ones. In this case, a tf normalisation parameter, providing a “harsh” normalisation, is needed to neutralise the effect of non-informative query terms on the document ranking. We explain the notion of “harsh” nor­malisation as follows.
Harter [10] and Amati [1] suggested that document length and term frequency have a linear relationship. Such a lin­ear relationship can be indicated by the linear correlation between these two variables. He & Ounis suggested that the purpose of tf normalisation is to adjust the linear de­pendence between document length and term frequency [14]. They also showed that document length and term frequency are positively correlated. However, when tf normalisation is applied, the correlation between these two variables de­creases until it reaches a large negative value [14]. In Sec­tion 5, we will show that the optimised parameter value of short queries gives a small negative correlation, and that of long queries gives a relatively large negative correlation. In the IR weighting models, e.g. BM25 and PL2, the rele­vance score usually increases with term frequency. Hence, a large negative correlation indicates that the contribution of each occurrence of a query term on the document ranking decreases rapidly with document length increasing. There­fore, the smaller this correlation value is, the harsher the
tf normalisation process is. For long queries, the retrieval performance decreases radically if the parameter value used does not provide a harsh enough normalisation, which leads to a notable parameter sensitivity. One way of dealing with parameter sensitivity is to use the query term weights to address the difference in the informativeness among query terms. However, in current probabilistic IR models, the query terms weights depend only on the occurrences of query terms in the query, which is usually not adequate.
The above explanation suggests that the parameter sen­sitivity issue is due to the fact that the query term weights cannot adequately reflect the informativeness of each query term. For this problem, we hypothesise that if we reweigh the query terms to achieve an adequate query term weight­ing, the parameter sensitivity can be reduced. A query term reweighing process takes into account each query term’s dis­tribution in one or a set of assumed highly relevant doc­ument(s), returned by the first-pass retrieval. The query terms are reweighed accordingly. Query term reweighing is usually considered as part of the so-called blind relevance feedback technique [4]. In the next sections, we conduct experiments to test the effect of query term reweighing on reducing the parameter sensitivity.
4. EXPERIMENTAL SETTING AND
METHODOLOGY
The purpose of our experiments is to examine the im­pact of query term reweighing on the parameter sensitiv­ity. In particular, as per Section 3, we expect the use of query term reweighing to reduce the parameter sensitivity. We introduce the experimental setting in Section 4.1, the term weighting model used for query term reweighing in Sec­tion 4.2, and the experimental methodology in Section 4.3.
4.1 Experimental Setting
We experiment on five standard TREC collections. The five collections used are the disk1&2, disk4&5 (minus the Congressional Record on disk4) of the classical TREC col-
2
, and the WT2G [12], WT10G [11] and .GOV2 [6]
3
. The test queries used are the TREC top-
eng.html.
collections/.
Each TREC topic consists of three fields, i.e. title, de-
Title-only (T) queries: Only the title topic field is used.
Full ( T DN ) queri e s: All the three topic fields (title, description and narrative) are used.
Related information of disk1&2 and disk4&5 of the
Related information of these three TREC Web
lections Web collections ics that are numbered from 51 to 200 for disk1&2, from 301 to 450 and from 601-700 for disk4&5, from 401 to 450 for WT2G, from 451 to 550 for WT10G, and from 701 to 850 for .GOV2, respectively (see Table 1). All the test topics used are ad-hoc ones, which require finding as many rele­vant documents as possible [25].
scription and narrative. We experiment with two types of queries with respect to the use of different topic fields. These two types of queries are:
2
TREC collections can be found from the following URL: http://trec.nist.gov/data/docs
3
collections can be found from the following URL: http://ir.dcs.gla.ac.uk/test
265
Loading...
+ 7 hidden pages