{"title": "Lazy Learning Meets the Recursive Least Squares Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 375, "page_last": 381, "abstract": null, "full_text": "Lazy Learning Meets \n\nthe Recursive Least Squares Algorithm \n\nMauro Birattari, Gianluca Bontempi, and Hugues Bersini \n\nIridia - Universite Libre de Bruxelles \n\nBruxelles, Belgium \n\n{mbiro, gbonte, bersini} @ulb.ac.be \n\nAbstract \n\nLazy learning is a memory-based technique that, once a query is re(cid:173)\nceived, extracts a prediction interpolating locally the neighboring exam(cid:173)\nples of the query which are considered relevant according to a distance \nmeasure. In this paper we propose a data-driven method to select on a \nquery-by-query basis the optimal number of neighbors to be considered \nfor each prediction. As an efficient way to identify and validate local \nmodels, the recursive least squares algorithm is introduced in the con(cid:173)\ntext of local approximation and lazy learning. Furthermore, beside the \nwinner-takes-all strategy for model selection, a local combination of the \nmost promising models is explored. The method proposed is tested on \nsix different datasets and compared with a state-of-the-art approach. \n\n1 \n\nIntroduction \n\nLazy learning (Aha, 1997) postpones all the computation until an explicit request for a \nprediction is received. The request is fulfilled interpolating locally the examples consid(cid:173)\nered relevant according to a distance measure. Each prediction requires therefore a local \nmodeling procedure that can be seen as composed of a structural and of a parametric iden(cid:173)\ntification. The parametric identification consists in the optimization of the parameters of \nthe local approximator. On the other hand, structural identification involves, among other \nthings, the selection of a family of local approximators, the selection of a metric to evaluate \nwhich examples are more relevant, and the selection of the bandwidth which indicates the \nsize of the region in which the data are correctly modeled by members of the chosen family \nof approximators. For a comprehensive tutorial on local learning and for further references \nsee Atkeson et al. (1997). \n\nAs far as the problem of bandwidth selection is concerned, different approaches exist. The \nchoice of the bandwidth may be performed either based on some a priori assumption or \non the data themselves. A further sub-classification of data-driven approaches is of interest \n\n\f376 \n\nM. Birattari, G. Bontempi and H. Bersini \n\nhere. On the one hand, a constant bandwidth may be used; in this case it is set by a global \noptimization that minimizes an error criterion over the available dataset. On the other hand, \nthe bandwidth may be selected locally and tailored for each query point. \n\nIn the present work, we propose a method that belongs to the latter class of local data-driven \napproaches. Assuming a given fixed metric and local linear approximators, the method we \nintroduce selects the bandwidth on a query-by-query basis by means of a localleave-one(cid:173)\nout cross-validation. The problem of bandwidth selection is reduced to the selection of the \nnumber k of neighboring examples which are given a non-zero weight in the local modeling \nprocedure. Each time a prediction is required for a specific query point, a set of local \nmodels is identified, each including a different number of neighbors. The generalization \nability of each model is then assessed through a local cross-validation procedure. Finally, \na prediction is obtained either combining or selecting the different models on the basis of \nsome statistic of their cross-validation errors. \n\nThe main reason to favor a query-by-query bandwidth selection is that it allows better \nadaptation to the local characteristics of the problem at hand. Moreover, this approach is \nable to handle directly the case in which the database is updated on-line (Bontempi et at., \n1997). On the other hand, a globally optimized bandwidth approach would, in principle, \nrequire the global optimization to be repeated each time the distribution of the examples \nchanges. \n\nThe major contribution of the paper consists in the adoption of the recursive least squares \nalgorithm in the context of lazy learning. This is an appealing and efficient solution to the \nintrinsically incremental problem of identifying and validating a sequence of local linear \nmodels centered in the query point, each including a growing number of neighbors. It is \nworth noticing here that a leave-one-out cross-validation of each model considered does \nnot involve any significant computational overload, since it is obtained though the PRESS \nstatistic (Myers, 1990) which simply uses partial results returned by the recursive least \nsquares algorithm. Schaal and Atkeson (1998) used already the recursive least squares \nalgorithm for the incremental update of a set of local models. In the present paper, we \nuse for the first time this algorithm in a query-by-query perspective as an effective way to \nexplore the neighborhood of each query point. \n\nAs a second contribution, we propose a comparison, on a local scale, between a competitive \nand a cooperative approach to model selection. On the problem of extracting a final pre(cid:173)\ndiction from a set of alternatives, we compared a winner-takes-all strategy with a strategy \nbased on the combination of estimators (Wolpert, 1992). \n\nIn Section 5 an experimental analysis of the recursive algorithm for local identification \nand validation is presented. The algorithm proposed, used in conjunction with different \nstrategies for model selection or combination, is compared experimentally with Cubist, the \nrule-based tool developed by Ross Quinlan for generating piecewise-linear models. \n\n2 Local Weighted Regression \n\nGiven two variables x E lRm and y E lR, let us consider the mapping f: lRm --t lR, known \nonly through a set of n examples {(Xi, yd} ~=l obtained as follows: \n\nwhere Vi, Ci is a random variable such that E[ciJ = 0 and E[ciCjJ = 0, Vj =1= i, and \nsuch that E[ciJ = I-lm(Xi), Vm ~ 2, where I-lmO is the unknown mth moment of the \ndistribution of Ci and is defined as a function of Xi. In particular for m = 2, the last of \nthe above mentioned properties implies that no assumption of global homoscedasticity is \nmade. \n\n(1) \n\n\fLazy Learning Meets the Recursive Least Squares Algorithm \n\n377 \n\nThe problem of local regression can be stated as the problem of estimating the value that the \nregression function f(x) = E[Ylx] assumes for a specific query point x, using information \npertaining only to a neighborhood of x. \nGiven a query point x q , and under the hypothesis of a local homoscedasticity of Ci, the \nparameter (3 of a local linear approximation of f (.) in a neighborhood of Xq can be obtained \nsolving the local polynomial regression: \n\n(2) \n\nwhere, given a metric on the space Rm, d( Xi, Xq) is the distance from the query point to the \nith example, K (.) is a weight function, h is the bandwidth, and where a constant value 1 \nhas been appended to each vector Xi in order to consider a constant term in the regression. \n\nIn matrix notation, the solution of the above stated weighted least squares problem is given \nby: \n\n/3 = (X'W'WX)-lX'W'Wy = (Z'Z)-lZ'V = PZ'v, \n\n(3) \n\nwhere X is a matrix whose ith row is x~, y is a vector whose ith element is Yi, W is \na diagonal matrix whose ith diagonal element is Wii = JK (d(Xi,Xq)jh), Z = WX, \nv = Wy, and the matrix X'W'WX = Z'Z is assumed to be non-singular so that its \ninverse P = (Z'Z)-l is defined. \nOnce obtained the local linear polynomial approximation, a prediction of Yq = f(x q), is \nfinally given by: \n\nYq=X~/3. \n\n(4) \n\nMoreover, exploiting the linearity of the local approximator, a leave-one-out cross(cid:173)\nvalidation estimation of the error variance E[ (Yq - Yq)2] can be obtained without any \nsignificant overload. In fact, using the PRESS statistic (Myers, 1990), it is possible to \n\ncalculate the error er = Yj - xj /3 _ j' without explicitly identifying the parameters /3- j \n\nfrom the examples available with the ph removed. The formulation of the PRESS statistic \nfor the case at hand is the following: \n\ncv _ \nej \n\n- Yj - x j {3 _ j -\n\n,A \n\n_ Yj - xjPZ'v _ Yj - xj/3 \n' \njj \n\n1 h \n\n1 \n\n-\n\n'P \n\n- Zj Zj \n\n-\n\n(5) \n\nwhere zj is the ph row of Z and therefore Zj = WjjXj, and where hjj is the ph diagonal \ne1ementoftheHatmatrixH = ZPZ' = Z(Z'Z) - lZ' . \n\n3 Recursive Local Regression \n\nIn what follows, for the sake of simplicity, we will focus on linear approximator. An \nextension to generic polynomial approximators of any degree is straightforward. We will \nassume also that a metric on the space Rm is given. All the attention will be thus centered \non the problem of bandwidth selection. \nIf as a weight function K(-) the indicator function \n\nK (d(Xi'Xq)) = {I \n\nifd(xi,xq)::; h, \n\nh \n\n0 otherwise; \n\n(6) \n\nis adopted, the optimization of the parameter h can be conveniently reduced to the opti(cid:173)\nmization of the number k of neighbors to which a unitary weight is assigned in the local \n\n\f378 \n\nM. Birattari, G. Bontempi and H. Bersini \n\nregression evaluation. In other words, we reduce the problem of bandwidth selection to a \nsearch in the space of h( k) = d( x( k), Xq), where x( k) is the kth nearest neighbor of the \nquery point. \n\nThe main advantage deriving from the adoption of the weight function defined in Eq. 6, \nis that, simply by updating the parameter /3(k) of the model identified using the k nearest \nneighbors, it is straightforward and inexpensive to obtain /3 (k + 1). In fact, performing a \nstep of the standard recursive least squares algorithm (Bierman, 1977), we have: \n\nP(k + 1) = P(k) _ P(k)x(k + l)x'(k + l)P(k) \n1 + x'(k + l)P(k)x(k + 1) \n\n,(k + 1) = P(k + l)x(k + 1) \ne(k + 1) = y(k + 1) - x' (k + l)/3(k) \n/3(k + 1) = /3(k) + ,(k + l)e(k + 1) \n\n(7) \n\nwhere P(k) = (Z'Z)-l when h = h(k), and where x(k + 1) is the (k + l)th nearest \nneighbor of the query point. \nMoreover, once the matrix P(k + 1) is available, the leave-one-out cross-validation errors \ncan be directly calculated without the need of any further model identification: \n\ncv \n\n_ Yj - xj/3(k + 1) \n\nej (k + 1) - 1 _ xjP(k + l)x/ \n\n(8) \n\nIt will be useful in the following to define for each value of k the [k x 1] vector eCV (k) that \ncontains all the leave-one-out errors associated to the model {3(k). \nOnce an initialization /3(0) = jj and P(O) = P is given, Eq. 7 and Eq. 8 recursively \nevaluate for different values of k a local approximation of the regression function f(\u00b7), \na prediction of the value of the regression function in the query point, and the vector of \nleave-one-out errors from which it is possible to extract an estimate of the variance of the \nprediction error. Notice that jj is an a priqri estimate of the parameter and P is the covari(cid:173)\nance matrix that reflects the reliabi!ity of f3 (Bierman, 1977). For non-reliable initialization, \nthe following is usually adopted: P = >'1, with>. large and where I is the identity matrix. \n\n4 Local Model Selection and Combination \n\nThe recursive algorithm described by Eq. 7 and Eq. 8 returns for a given query point x q , \na set of predictions Yq (k) = x~/3(k), together with a set of associated leave-one-out error \nvectors e Cv (k) . \n\nFrom the information available, a final prediction f)q of the value of the regression function \ncan be obtained in different ways. Two main paradigms deserve to be considered: the first \nis based on the selection of the best approximator according to a given criterion, while the \nsecond returns a prediction as a combination of more local models. \n\nIf the selection paradigm, frequently called winner-takes-all, is adopted, the most natural \nway to extract a final prediction Yq, consists in comparing the prediction obtained for each \nvalue of k on the basis of the classical mean square error criterion: \n\"k \n\n( CV(k))2 \n\nA \n\nwith k = argmin MSE(k) = argmin \n\n(9) \n\nk \n\nL.J' Wi e\u00b7 \nt \n\nt=l \n\n\" k\n. \nL.Ji=l W t \n\nk \n\n. \n' \n\n\fLazy Learning Meets the Recursive Least Squares Algorithm \n\n379 \n\nTable 1: A summary of the characteristics of the data sets considered. \nI Housing I Cpu I Prices I Mpg I Servo I Ozone I \n\nDataset \n\nNumber of \nexamples \nNumber of \nregressors \n\n506 \n\n209 \n\n159 \n\n392 \n\n167 \n\n330 \n\nl3 \n\n6 \n\n16 \n\n7 \n\n8 \n\n8 \n\nwhere Wi are weights than can be conveniently used to discount each error according to the \ndistance from the query point to the point to which the error corresponds (Atkeson et at., \n1997). \n\nAs an alternative to the winner-takes-all paradigm, we explored also the effectiveness of \nlocal combinations of estimates (Wolpert, 1992). Adopting also in this case the mean \nsquare error criterion, the final prediction of the value Yq is obtained as a weighted average \nof the best b models, where b is a parameter of the algorithm. Suppose the predictions il q (k) \nand the error vectors e Cv (k) have been ordered creating a sequence of integers {ki } so that \nMSE( ki ) ::; MSE( kj ), Vi < j. The prediction of Yq is given by \n\n~ \nYq = \n\nL~-l (iYq(kd \n\nr. \n\",b \nL..-i=l ,>z \n\n' \n\n(10) \n\nwhere the weights are the inverse of the mean square errors: (i = l/MSE(ki ). This is an \nexample of the generalized ensemble method (Perrone & Cooper, 1993). \n\n5 Experiments and Results \n\nThe experimental evaluation ofthe incremental local identification and validation algorithm \nwas performed on six datasets. The first five, described by Quinlan (1993), were obtained \nfrom the VCI Repository of machine learning databases (Merz & Murphy, 1998), while the \nlast one was provided by Leo Breiman. A summary ofthe characteristics of each dataset is \npresented in Table 1. \n\nThe methods compared adopt the recursive identification and validation algorithm, com(cid:173)\nbined with different strategies for model selection or combination. We considered also two \napproaches in which k is selected globally: \n\nIbl: Local bandwidth selection for linear local models. The number of neighbors is se(cid:173)\n\nlected on a query-by-query basis and the prediction returned is the one of the best \nmodel according to the mean square error criterion. \n\nIbO: Local bandwidth selection for constant local models. The algorithm for constant \nmodels is derived directly from the recursive method described in Eq. 7 and Eq. 8. \nThe best model is selected according to the mean square error criterion. \n\nIbC: Local combination of estimators. This is an example, of the method described in \nEq. 10. On the datasets proposed, for each query the best 2 linear local models \nand the best 2 constant models are combined. \n\ngbl: Global bandwidth selection for linear local models. The value of k is obtained min(cid:173)\nimizing the prediction error in 20-fold cross-validation on the dataset available. \nThis value is then used for all the query points. \n\ngbO: Global bandwidth selection for constant local models. As in gbl, the value of k is \n\noptimized globally and kept constant for all the queries. \n\n\f380 \n\nM. Birattarl. G. Bontempi and H. Bersini \n\nTable 2: Mean absolute error on unseen cases. \n\nMethod I Housing I Cpu I Prices I Mpg I Servo I Ozone \n3.52 \n3.33 \n3.31 \n3.46 \n3.19 \n3.15 \n\n28.38 \n31.54 \n26.79 \n28.69 \n32.19 \n28.37 \n\n1509 \n1627 \n1488 \n1492 \n1639 \n1331 \n\n2.21 \n2.60 \n2.12 \n2.30 \n2.59 \n2.17 \n\n0.48 \n0.32 \n0.29 \n0.52 \n0.34 \n0.36 \n\n1.94 \n1.97 \n1.83 \n1.92 \n1.99 \n1.90 \n\nIbl \nIbO \nIbC \ngbl \ngbO \n\nCubist \n\nTable 3: Relative error (%) on unseen cases. \n\nI Method I Housing I Cpu I Prices I Mpg I Servo I Ozone \n35.25 \n31.11 \n30.28 \n32.58 \n28.21 \n26.59 \n\n28.66 \n22.04 \n19.72 \n30.46 \n24.30 \n18.53 \n\n15.87 \n22.19 \n17.62 \n15.95 \n22.29 \n11.67 \n\n9.20 \n20.37 \n9.29 \n9.93 \n21.43 \n12.71 \n\n12.63 \n18.06 \n12.35 \n13.47 \n17.99 \n16.02 \n\n12.65 \n12.64 \n11.82 \n12.83 \n13.48 \n12.57 \n\nIbl \nIbO \nIbC \ngb1 \ngbO \n\nCubist \n\nAs far as the metric is concerned, we adopted a global Euclidean metric based on the \nrelative influence (relevance) ofthe regressors (Friedman, 1994). We are confident that the \nadoption of a local metric could improve the performance of our lazy learning method. \n\nThe results of the methods introduced are compared with those we obtained, in the same \nexperimental settings, with Cubist, the rule-based tool developed by Quinlan for generating \npiecewise-linear models. Each approach was tested on each dataset using the same 10-fold \ncross-validation strategy. Each dataset was divided randomly into 10 groups of nearly \nequal size. In turn, each of these groups was used as a testing set while the remaining \nones together were providing the examples. Thus all the methods performed a prediction \non the same unseen cases, using for each of them the same set of examples. In Table 2 \nwe present the results obtained by all the methods, and averaged on the 10 cross-validation \ngroups. Since the methods were compared on the same examples in exactly the same \nconditions, the sensitive one-tailed paired test of significance can be used. In what follows, \nby \"significantly better\" we mean better at least at a 5% significance level. \n\nThe first consideration about the results concerns the local combination of estimators. Ac(cid:173)\ncording to Table 2, the method IbC performs in average always better than the winner(cid:173)\ntakes-all linear and constant. On two dataset IbC is significantly better than both Ibl and \nIbO; and on three dataset it is significantly better than one of the two, and better in average \nthan the other. \n\nThe second consideration is about the comparison between our query-by-query bandwidth \nselection and a global optimization of the number of neighbors: in average Ibl and IbO \nperforms better than their counterparts gbl and gbO. On two datasets Ibl is significantly \nbetter than gbl, while is about the same on the other four. On one dataset IbO is significantly \nbetter than gbO. \n\nAs far as the comparison with Cubist is concerned, the recursive lazy identification and \nvalidation proposed obtains results comparable with those obtained by the state-of-the-art \nmethod implemented in Cubist. On the six datasets, IbC performs one time significantly \nbetter than Cubist, and one time significantly worse. \n\n\fLazy Learning Meets the Recursive Least Squares Algorithm \n\n381 \n\nThe second index of performance we investigated is the relative error, defined as the mean \nsquare error on unseen cases, normalized by the variance of the test set. The relative errors \nare presented in Table 3 and show a similar picture to Table 2, although the mean square \nerrors considered here penalize larger absolute errors. \n\n6 Conclusion and Future Work \n\nThe experimental results confirm that the recursive least squares algorithm can be effec(cid:173)\ntively used in a local context. Despite the trivial metric adopted, the local combination \nof estimators, identified and validated recursively, showed to be able to compete with a \nstate-of-the-art approach. \n\nFuture work will focus on the problem of local metric selection. Moreover, we will ex(cid:173)\nplore more sophisticated ways to combine local estimators and we will extend this work to \npolynomial approximators of higher degree. \n\nAcknowledgments \n\nThe work of Mauro Birattari was supported by the FIRST program of the Region Wallonne, \nBelgium. The work of Gianluca Bontempi was supported by the European Union TMR \nGrant FMBICT960692. The authors thank Ross Quinlan and gratefully acknowledge using \nhis software Cubist. For more details on Cubist see http://www.rulequest.com. \nWe also thank Leo Breiman for the dataset ozone and the UCI Repository for the other \ndatasets used in this paper. \n\nReferences \n\nAha D. W. 1997. Editorial. Artificial Intelligence Review, 11(1-5), 1-6. Special Issue on \n\nLazy Learning. \n\nAtkeson C. G. , Moore A. W. & Schaal S. 1997. Locally weighted learning. Artificial \n\nIntelligence Review, 11(1-5), 11-73. \n\nBierman G. 1. 1977. Factorization Methodsfor Discrete Sequential Estimation. New York, \n\nNY: Academic Press. \n\nBontempi G., Birattari M. & Bersini H. 1997. Lazy learning for local modeling and \n\ncontrol design. International Journal of Control. Accepted for publication. \n\nFriedman 1. H. 1994. Flexible metric nearest neighbor classification. Tech. rept. Depart(cid:173)\n\nment of Statistics, Stanford University. \n\nMerz C. J. & Murphy P. M. 1998. UCI Repository of machine learning databases. \nMyers R. H. 1990. Classical and Modern Regression with Applications. Boston, MA: \n\nPWS-KENT. \n\nPerrone M. P. & Cooper L. N. 1993. When networks disagree: Ensemble methods for \nhybrid neural networks. Pages 126-142 of Mammone R. J. (ed), Artificial Neural \nNetworks for Speech and Vision. Chapman and Hall. \n\nQuinlan 1. R. 1993. Combining instance-based and model-based learning. Pages 236-243 \nof Machine Learning. Proceedings of the Tenth International Conference. Morgan \nKaufmann. \n\nSchaal S. & Atkeson C. G. 1998. Constructive incremental learning from only local \n\ninformation. Neural Computation, 10(8), 2047-2084. \n\nWolpert D. 1992. Stacked Generalization. Neural Networks, 5, 241-259. \n\n\f", "award": [], "sourceid": 1507, "authors": [{"given_name": "Mauro", "family_name": "Birattari", "institution": null}, {"given_name": "Gianluca", "family_name": "Bontempi", "institution": null}, {"given_name": "Hugues", "family_name": "Bersini", "institution": null}]}