{"title": "Fast Rates for Regularized Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 1545, "page_last": 1552, "abstract": "We show that the empirical minimizer of a stochastic strongly convex objective, where the stochastic component is linear, converges to the population minimizer with rate $O(1/n)$. The result applies, in particular, to the SVM objective. Thus, we get a rate of $O(1/n)$ on the convergence of the SVM objective to its infinite data limit. We demonstrate how this is essential for obtaining tight oracle inequalities for SVMs. The results extend also to strong convexity with respect to other $\\ellnorm_p$ norms, and so also to objectives regularized using other norms.", "full_text": "Fast Rates for Regularized Objectives\n\nKarthik Sridharan, Nathan Srebro, Shai Shalev-Shwartz\n\nToyota Technological Institute \u2014 Chicago\n\nAbstract\n\nWe study convergence properties of empirical minimization of a stochastic\nstrongly convex objective, where the stochastic component is linear. We show\nthat the value attained by the empirical minimizer converges to the optimal value\nwith rate 1/n. The result applies, in particular, to the SVM objective. Thus, we\nobtain a rate of 1/n on the convergence of the SVM objective (with \ufb01xed regular-\nization parameter) to its in\ufb01nite data limit. We demonstrate how this is essential\nfor obtaining certain type of oracle inequalities for SVMs. The results extend\nalso to approximate minimization as well as to strong convexity with respect to an\narbitrary norm, and so also to objectives regularized using other (cid:96)p norms.\n\n1 Introduction\n\nWe consider the problem of (approximately) minimizing a stochastic objective\n\nF (w) = E\u03b8 [f(w; \u03b8)]\n\n(1)\nwhere the optimization is with respect to w \u2208 W, based on an i.i.d. sample \u03b81, . . . , \u03b8n. We focus\non problems where f(w; \u03b8) has a generalized linear form:\n\nf(w; \u03b8) = (cid:96)((cid:104)w, \u03c6(\u03b8)(cid:105), \u03b8) + r(w) .\n\n(2)\nThe relevant special case is regularized linear prediction, where \u03b8 = (x, y), (cid:96)((cid:104)w, \u03c6(x)(cid:105), y) is the\nloss of predicting (cid:104)w, \u03c6(x)(cid:105) when the true target is y, and r(w) is a regularizer.\nIt is well known that when the domain W and the mapping \u03c6(\u00b7) are bounded, and the function\n(cid:96)(z; \u03b8) is Lipschitz continuous in z, the empirical averages\n\nconverge uniformly to their expectations F (w) with rate(cid:112)1/n. This justi\ufb01es using the empirical\n\n\u02c6F (w) = \u02c6E [f(w; \u03b8)] = 1\n\nf(w; \u03b8i)\n\n(3)\n\ni=1\n\nn(cid:88)\n\nn\n\n\u02c6w = arg min\nw\u2208W\n\n\u02c6F (w),\n\nF (w(cid:63)) = min\nw\u2208W\n\nF (w)\n\nand we can then establish convergence of F ( \u02c6w) to the population optimum\n\nminimizer\n\nwith a rate of(cid:112)1/n.\n\n(4)\n\n(5)\n\nRecently, Hazan et al [1] studied an online analogue to this problem, and established that if\nf(w; \u03b8) is strongly convex in w, the average online regret diminishes with a much faster rate,\nnamely (log n)/n. The function f(w; \u03b8) becomes strongly convex when, for example, we have\nr(w) = \u03bb\nIn this paper we present an analogous \u201cfast rate\u201d for empirical minimization of a strongly convex\nstochastic objective. In fact, we do not need to assume that we perform the empirical minimization\n\n2 (cid:107)w(cid:107)2 as in SVMs and other regularized learning settings.\n\n1\n\n\fexactly: we provide uniform (over all w \u2208 W) guarantees on the population sub-optimality F (w)\u2212\nF (w(cid:63)) in terms of the empirical sub-optimality \u02c6F (w)\u2212 \u02c6F ( \u02c6w) with a rate of 1/n. This is a stronger\ntype of result than what can be obtained with an online-to-batch conversion, as it applies to any\npossible solution w, and not only to some speci\ufb01c algorithmically de\ufb01ned solution. For example, it\ncan be used to analyze the performance of approximate minimizers obtained through approximate\noptimization techniques. Speci\ufb01cally, consider f(w; \u03b8) as in (2), where (cid:96)(z; \u03b8) is convex and L-\nLipschitz in z, the norm of \u03c6(\u03b8) is bounded by B, and r is \u03bb-strongly convex. We show that for any\na > 0 and \u03b4 > 0, with probability at least 1 \u2212 \u03b4, for all w (of arbitrary magnitude):\n\n(cid:18)\n(1 + 1/a) L2B2(log(1/\u03b4))\n\n(cid:19)\n\n\u03bbn\n\nF (w) \u2212 F (w(cid:63)) \u2264 (1 + a)( \u02c6F (w) \u2212 \u02c6F ( \u02c6w)) + O\n\n.\n\n(6)\n\nWe emphasize that here and throughout the paper the big-O notation hides only \ufb01xed numeric con-\nstants.\nIt might not be surprising that requiring strong convexity yields a rate of 1/n. Indeed, the connection\nbetween strong convexity, variance bounds, and rates of 1/n, is well known. However, it is interest-\ning to note the generality of the result here, and the simplicity of the conditions. In particular, we do\nnot require any \u201clow noise\u201d conditions, nor that the loss function is strongly convex (it need only be\nweakly convex).\nIn particular, (6) applies, under no additional conditions, to the SVM objective. We therefore obtain\nconvergence with a rate of 1/n for the SVM objective. This 1/n rate on the SVM objective is\nalways valid, and does not depend on any low-noise conditions or on speci\ufb01c properties of the\n\u221a\nkernel function. Such a \u201cfast\u201d rate might seem surprising at a \ufb01rst glance to the reader familiar with\n\u221a\nthe 1/\nn rate on the expected loss of the SVM optimum. There is no contradiction here\u2014what we\nestablish is that although the loss might converge at a rate of 1/\nn, the SVM objective (regularized\nloss) always converges at a rate of 1/n.\n\u221a\nIn fact, in Section 3 we see how a rate of 1/n on the objective corresponds to a rate of 1/\nn on the\nloss. Speci\ufb01cally, we perform an oracle analysis of the optimum of the SVM objective (rather than\nof empirical minimization subject to a norm constraint, as in other oracle analyses of regularized\nlinear learning), based on the existence of some (unknown) low-norm, low-error predictor w.\nStrong convexity is a concept that depends on a choice of norm. We state our results in a general\nform, for any choice of norm (cid:107)\u00b7(cid:107). Strong convexity of r(w) must hold with respect to the chosen\nnorm (cid:107)\u00b7(cid:107), and the data \u03c6(\u03b8) must be bounded with respect to the dual norm (cid:107)\u00b7(cid:107)\u2217, i.e. we must\nhave (cid:107)\u03c6(\u03b8)(cid:107)\u2217 \u2264 B. This allows us to apply our results also to more general forms of regularizers,\np, for p < 1 \u2264 2 (see Corollary 2). However,\nincluding squared (cid:96)p norm regularizers, r(w) = \u03bb\nthe reader may choose to read the paper always thinking of the norm (cid:107)w(cid:107), and so also its dual norm\n(cid:107)w(cid:107)\u2217, as the standard (cid:96)2-norm.\n\n2 (cid:107)w(cid:107)2\n\n2 Main Result\nWe consider a generalized linear function f : W \u00d7 \u0398 \u2192 R, that can be written as in (2), de\ufb01ned\nover a closed convex subset W of a Banach space equipped with norm (cid:107)\u00b7(cid:107).\nLipschitz continuity and boundedness We require that the mapping \u03c6(\u00b7) is bounded by B,\ni.e. (cid:107)\u03c6(\u03b8)(cid:107)\u2217 \u2264 B, and that the function (cid:96)(z; \u03b8) is L-Lipschitz in z \u2208 R for every \u03b8.\nStrong Convexity We require that F (w) is \u03bb-strongly convex w.r.t. the norm (cid:107)w(cid:107). That is, for all\nw1, w2 \u2208 W and \u03b1 \u2208 [0, 1] we have:\n\nF (\u03b1w1 + (1 \u2212 \u03b1)w2) \u2264 \u03b1F (w1) + (1 \u2212 \u03b1)F (w2) \u2212 \u03bb\n\n2 \u03b1(1 \u2212 \u03b1)(cid:107)w1 \u2212 w2(cid:107)2 .\n\nRecalling that w(cid:63) = arg minw F (w), this ensures (see for example [2, Lemma 13]):\n\n(7)\nWe require only that the expectation F (w) = E [f(w; \u03b8)] is strongly convex. Of course, requiring\nthat f(w; \u03b8) is \u03bb-strongly convex for all \u03b8 (with respect to w) is enough to ensure the condition.\n\nF (w) \u2265 F (w(cid:63)) + \u03bb\n\n2 (cid:107)w \u2212 w(cid:63)(cid:107)2\n\n2\n\n\fIn particular, for a generalized linear function of the form (2) it is enough to require that (cid:96)(z; y) is\nconvex in z and that r(w) is \u03bb-strongly convex (w.r.t. the norm (cid:107)w(cid:107)).\nWe now provide a faster convergence rate using the above conditions.\nTheorem 1. Let W be a closed convex subset of a Banach space with norm (cid:107)\u00b7(cid:107) and dual norm (cid:107)\u00b7(cid:107)\u2217\nand consider f(w; \u03b8) = (cid:96)((cid:104)w, \u03c6(\u03b8)(cid:105); \u03b8) + r(w) satisfying the Lipschitz continuity, boundedness,\nand strong convexity requirements with parameters B, L, and \u03bb. Let w(cid:63), \u02c6w, F (w) and \u02c6F (w) be as\nde\ufb01ned in (1)-(5). Then, for any \u03b4 > 0 and any a > 0, with probability at least 1 \u2212 \u03b4 over a sample\nof size n, we have that for all w \u2208 W:\n\n(where [x]+ = max(x, 0))\n\nF (w) \u2212 F (w(cid:63)) \u2264 (1 + a)[ \u02c6F (w) \u2212 \u02c6F (w(cid:63))]+ +\n\u2264 (1 + a)( \u02c6F (w) \u2212 \u02c6F ( \u02c6w)) +\n\n8 (1 + 1\n\na)L2B2(32 + log(1/\u03b4))\n\n\u03bbn\n\n8 (1 + 1\n\na)L2B2(32 + log(1/\u03b4))\n\n\u03bbn\n\n.\n\np, which are (p\u22121)\u03bb-\nIt is particularly interesting to consider regularizers of the form r(w) = \u03bb\nstrongly convex w.r.t. the corresponding (cid:96)p-norm [2]. Applying Theorem 1 to this case yields the\nfollowing bound:\nCorollary 2. Consider an (cid:96)p norm and its dual (cid:96)q, with 1 < p \u2264 2, 1\np = 1, and the objective\nf(w; \u03b8) = (cid:96)((cid:104)w, \u03c6(\u03b8)(cid:105); \u03b8) + \u03bb\np, where (cid:107)\u03c6(\u03b8)(cid:107)q \u2264 B and (cid:96)(z; y) is convex and L-Lipschitz\nin z. The domain is the entire Banach space W = (cid:96)p. Then, for any \u03b4 > 0 and any a > 0,\nwith probability at least 1 \u2212 \u03b4 over a sample of size n, we have that for all w \u2208 W = (cid:96)p (of any\nmagnitude):\n\n2 (cid:107)w(cid:107)2\n\n2 (cid:107)w(cid:107)2\n\nq + 1\n\nF (w) \u2212 F (w(cid:63)) \u2264 (1 + a)( \u02c6F (w) \u2212 \u02c6F ( \u02c6w)) + O\n\n(cid:18)(1 + 1\n\n(cid:19)\n\n.\n\na)L2B2 log(1/\u03b4)\n(p \u2212 1)\u03bbn\n\nCorollary 2 allows us to analyze the rate of convergence of the regularized risk for (cid:96)p-regularized\nlinear learning. That is, training by minimizing the empirical average of:\n\np\n\n(cid:107)w(cid:107)2\n\nf(w; x, y) = (cid:96)((cid:104)w, x(cid:105), y) + \u03bb\n2\n\n(8)\nwhere (cid:96)(z, y) is some convex loss function and (cid:107)x(cid:107)q \u2264 B. For example, in SVMs we use the (cid:96)2\nnorm, and so bound (cid:107)x(cid:107)2 \u2264 B, and the hinge loss (cid:96)(z, y) = [1 \u2212 yz]+, which is 1-Lipschitz. What\nwe obtain is a bound on how quickly we can minimize the expectation F (w) = E [(cid:96)((cid:104)w, x(cid:105), y)] +\n2 (cid:107)w(cid:107)2\np, i.e. the regularized empirical loss, or in other words, how quickly do we converge to the\n\u03bb\nin\ufb01nite-data optimum of the objective.\nWe see, then, that the SVM objective converges to its optimum value at a fast rate of 1/n, without\nany special assumptions. This still doesn\u2019t mean that the expected loss L( \u02c6w) = E [(cid:96)((cid:104) \u02c6w, x(cid:105), y)]\nconverges at this rate. This behavior is empirically demonstrated on the left plot of Figure 1. For\neach data set size we plot the excess expected loss L( \u02c6w) \u2212 L(w(cid:63)) and the sub-optimality of the\n2 (cid:107) \u02c6w(cid:107)2). Although the\nregularized expected loss F ( \u02c6w) \u2212 F (w(cid:63)) (recall that F ( \u02c6w) = L( \u02c6w) + \u03bb\nregularized expected loss converges to its in\ufb01nite data limit, i.e. to the population minimizer, with\n\nrate roughly 1/n, the expected loss L( \u02c6w) converges at a slower rate of roughly(cid:112)1/n.\n\nStudying the convergence rate of the SVM objective allows us to better understand and appreci-\nate analysis of computational optimization approaches for this objective, as well as obtain oracle\ninequalities on the generalization loss of \u02c6w, as we do in the following Section.\nBefore moving on, we brie\ufb02y provide an example of applying Theorem 1 with respect to the (cid:96)1-\nnorm. The bound in Corollary 2 diverges when p \u2192 1 and the Corollary is not applicable for\n(cid:96)1 regularization. This is because (cid:107)w(cid:107)2\n1 is not strongly convex w.r.t. the (cid:96)1-norm. An example of a\nregularizer that is strongly convex with respect to the (cid:96)1 norm is the (unnormalized) entropy regu-\nw-strongly convex w.r.t. (cid:107)w(cid:107)1, as\nlong as (cid:107)w(cid:107)1 \u2264 Bw (see [2]), yielding:\ni=1 |wi| log(|wi|), where\n(cid:107)\u03c6(\u03b8)(cid:107)\u221e \u2264 B and (cid:96)(z; y) is convex and L-Lipschitz in z. Take the domain to be the (cid:96)1 ball\n\nlarizer [3]: r(w) =(cid:80)d\nCorollary 3. Consider a function f(w; \u03b8) = (cid:96)((cid:104)w, \u03c6(\u03b8)(cid:105); \u03b8) + (cid:80)d\n\ni=1 |wi| log(|wi|). This regularizer is 1/B2\n\n3\n\n\fminw L(wo), relative to the overall optimal wo = arg minw L(w), with \u03bbn =(cid:112)300/n. Both plots are on a\n\nFigure 1: Left: Excess expected loss L( \u02c6w) \u2212 L(w(cid:63)) and sub-optimality of the regularized expected loss\nF ( \u02c6w) \u2212 F (w(cid:63)) as a function of training set size, for a \ufb01xed \u03bb = 0.8. Right: Excess expected loss L( \u02c6w\u03bb) \u2212\nlogarithmic scale and refer to a synthetic example with x uniform over [\u22121.5, 1.5]300, and y = sign x1 when\n|x1| > 1 but uniform otherwise.\n\nW = {w \u2208 Rd : (cid:107)w(cid:107)1 \u2264 Bw}. Then, for any \u03b4 > 0 and any a > 0, with probability at least 1 \u2212 \u03b4\nover a sample of size n, we have that for all w \u2208 W:\n\n(cid:18)(1 + 1\n\n(cid:19)\n\n.\n\na)L2B2B2\n\nw log(1/\u03b4)\n\n\u03bbn\n\nF (w) \u2212 F (w(cid:63)) \u2264 (1 + a)( \u02c6F (w) \u2212 \u02c6F ( \u02c6w)) + O\n\n3 Oracle Inequalities for SVMs\n\nIn this Section we apply the results from previous Section to obtain an oracle inequality on the\nexpected loss L(w) = E [(cid:96)((cid:104)w, x(cid:105), y)] of an approximate minimizer of the SVM training objective\n\u02c6F\u03bb(w) = \u02c6E [f\u03bb(w)] where\n\n(cid:107)w(cid:107)2 ,\n\nf\u03bb(w; x, y) = (cid:96)((cid:104)w, x(cid:105), y) + \u03bb\n2\n\n(9)\nand (cid:96)(z, y) is the hinge-loss, or any other 1-Lipschitz loss function. As before we denote B =\nsupx (cid:107)x(cid:107) (all norms in this Section are (cid:96)2 norms).\nWe assume, as an oracle assumption, that there exists a good predictor wo with low norm (cid:107)wo(cid:107)\nand which attains low expected loss L(wo). Consider an optimization algorithm for \u02c6F\u03bb(w) that is\nguaranteed to \ufb01nd \u02dcw such that \u02c6F\u03bb( \u02dcw) \u2264 min \u02c6F\u03bb(w) + \u0001opt. Using the results of Section 2, we can\ntranslate this approximate optimality of the empirical objective to an approximate optimality of the\nexpected objective F\u03bb(w) = E [f\u03bb(w)]. Speci\ufb01cally, applying Corollary 2 with a = 1 we have that\nwith probability at least 1 \u2212 \u03b4:\n\n(cid:19)\n\n(cid:18) B2 log(1/\u03b4)\n(cid:18) B2 log(1/\u03b4)\n(cid:19)\n\n\u03bbn\n\n.\n\n\u03bbn\n\nF\u03bb( \u02dcw) \u2212 F\u03bb(w(cid:63)) \u2264 2\u0001opt + O\n\nOptimizing to within \u0001opt = O( B2\n\n\u03bbn ) is then enough to ensure\n\nF\u03bb( \u02dcw) \u2212 F\u03bb(w(cid:63)) = O\n\n.\n\n(10)\n\n(11)\n\nIn order to translate this to a bound on the expected loss L( \u02dcw) we consider the following decompo-\nsition:\n\nL( \u02dcw) = L(wo) + (F\u03bb( \u02dcw) \u2212 F\u03bb(w(cid:63))) + (F\u03bb(w(cid:63)) \u2212 F\u03bb(wo)) + \u03bb\n\n2 (cid:107)wo(cid:107)2 \u2212 \u03bb\n\n2 (cid:107) \u02dcw(cid:107)2\n\n(cid:18) B2 log(1/\u03b4)\n\n(cid:19)\n\n\u03bbn\n\n\u2264 L(wo) + O\n\n+ 0 + \u03bb\n\n2 (cid:107)wo(cid:107)2\n\n4\n\n(12)\n\n10310410\u2212210\u22121n Suboptimality of ObjectiveExcess Expected Loss10310410\u2212210\u22121n\fwhere we used the bound (11) to bound the second term, the optimality of w(cid:63) to ensure the third\nterm is non-positive, and we also dropped the last, non-positive, term.\nThis might seem like a rate of 1/n on the generalization error, but we need to choose \u03bb so as to\nbalance the second and third terms. The optimal choice for \u03bb is\n\n\u03bb(n) = c\n\n,\n\n(13)\n\nB(cid:112)log(1/\u03b4)\n(cid:107)wo(cid:107)\u221a\n\nn\n\nfor some constant c. We can now formally state our oracle inequality, which is obtained by substi-\ntuting (13) into (12):\nCorollary 4. Consider an SVM-type objective as in (9). For any wo and any \u03b4 > 0, with probability\nat least 1\u2212\u03b4 over a sample of size n, we have that for all \u02dcw s.t. \u02c6F\u03bb(n)( \u02dcw) \u2264 min \u02c6F\u03bb(n)(w)+O( B2\n\u03bbn ),\nwhere \u03bb(n) chosen as in (13), the following holds:\n\n\uf8eb\uf8ed(cid:115)\n\nL( \u02dcw) \u2264 L(wo) + O\n\nB2 (cid:107)wo(cid:107)2 log(1/\u03b4)\n\nn\n\n\uf8f6\uf8f8\n\nCorollary 4 is demonstrated empirically on the right plot of Figure 1.\nThe way we set \u03bb(n) in Corollary 4 depends on (cid:107)wo(cid:107). However, using\n\n\u03bb(n) = B(cid:112)log(1/\u03b4)\n\n\u221a\nn\n\n(14)\n\nwe obtain:\nCorollary 5. Consider an SVM-type objective as in (9) with \u03bb(n) set as in (14). For any \u03b4 > 0,\nwith probability at least 1 \u2212 \u03b4 over a sample of size n, we have that for all \u02dcw s.t. \u02c6F\u03bb(n)( \u02dcw) \u2264\nmin \u02c6F\u03bb(n)(w) + O( B2\n\n\u03bbn ), the following holds:\n\n\uf8eb\uf8edL(wo) + O\n\n\uf8eb\uf8ed(cid:115)\n\nL( \u02dcw) \u2264 inf\n\nwo\n\nB2((cid:107)wo(cid:107)4 + 1) log(1/\u03b4)\n\nn\n\n\uf8f6\uf8f8\uf8f6\uf8f8\n\nThe price we pay here is that the bound of Corollary 5 is larger by a factor of (cid:107)wo(cid:107) relative to the\n\nbound of Corollary 4. Nevertheless, this bound allows us to converge with a rate of(cid:112)1/n to the\n\nexpected loss of any \ufb01xed predictor.\nIt is interesting to repeat the analysis of this Section using the more standard result:\n\n(cid:32)(cid:114)\n\n(cid:33)\n\nF\u03bb(w) \u2212 F\u03bb(w(cid:63)) \u2264 \u02c6F\u03bb(w) \u2212 \u02c6F\u03bb(w(cid:63)) + O\n\nfor (cid:107)w(cid:107) \u2264 Bw where we ignore the dependence on \u03b4. Setting Bw =(cid:112)2/\u03bb, as this is a bound on\n\nthe norm of both the empirical and population optimums, and using (15) instead of Corollary 2 in\nour analysis yields the oracle inequality:\n\nwB2\nB2\nn\n\n(15)\n\n\uf8eb\uf8ed(cid:32)\n\n(cid:33)1/3\uf8f6\uf8f8\n\nL( \u02dcw) \u2264 L(wo) + O\n\nB2 (cid:107)wo(cid:107)2 log(1/\u03b4)\n\nn\n\n(16)\n\nThe oracle analysis studied here is very simple\u2014our oracle assumption involves only a single\npredictor wo, and we make no assumptions about the kernel or the noise. We note that a more\n\u221a\nsophisticated analysis has been carried out by Steinwart et al [4], who showed that rates faster than\n1/\nn are possible under certain conditions on noise and complexity of kernel class. In Steinwart\u2019s\net al analyses the estimation rates (i.e. rates for expected regularized risk) are given in terms of\n2(cid:107)w(cid:63)(cid:107)2 + L(w(cid:63)) \u2212 L\u2217 where L\u2217 is the Bayes risk. In our re-\nthe approximation error quantity \u03bb\nsult we consider the estimation rate for regularized objective independent of the approximation error.\n\n5\n\n\f4 Proof of Main Result\n\n(cid:110)\n\nGr =\n\nTo prove Theorem 1 we use techniques of reweighing and peeling following Bartlett et al [5].\nFor each w, we de\ufb01ne gw(\u03b8) = f(w; \u03b8) \u2212 f(w(cid:63); \u03b8), and so our goal is to bound the expectation of\ngw in terms of its empirical average. We denote by G = {gw|w \u2208 W}.\nSince our desired bound is not exactly uniform, and we would like to pay different attention to func-\ntions depending on their expected sub-optimality, we will instead consider the following reweighted\nclass. For any r > 0 de\ufb01ne\n\n4k(w) : w \u2208 W, k(w) = min{k(cid:48) \u2208 Z+ : E [gw] \u2264 r4k(cid:48)}(cid:111)\n\n(17)\nw \u2208 Gr is just a scaled version of\nwhere Z+ is the set of non-negative integers. In other words, gr\ngw \u2208 G and the scaling factor ensures that E [gr\nWe will begin by bounding the variation between expected and empirical average values of gr \u2208 Gr.\nThis is typically done in terms of the complexity of the class Gr. However, we will instead use the\ncomplexity of a slightly different class of functions, which ignores the non-random (i.e. non-data-\ndependent) regularization terms r(w). De\ufb01ne:\n\n4k(w) : w \u2208 W, k(w) = min{k(cid:48) \u2208 Z+ : E [gw] \u2264 r4k(cid:48)}(cid:111)\n\nw = hw\nhr\n\nw] \u2264 r.\n\nw = gw\ngr\n\nHr =\n\n(cid:110)\n\n(18)\n\nwhere\n\nhw(\u03b8) = gw(\u03b8) \u2212 (r(w) \u2212 r(w(cid:63))) = (cid:96)((cid:104)w, \u03c6(\u03b8)(cid:105); \u03b8) \u2212 (cid:96)((cid:104)w\u2217, \u03c6(\u03b8)(cid:105); \u03b8).\n\nw(\u03b8) is the data dependent component of gr\nw] = E [hr\n\n(19)\nw, dropping the (scaled) regularization terms.\nThat is, hr\nw]\u2212 \u02c6E [hr\nWith this de\ufb01nition we have E [gr\nw] (the regularization terms on the left\nhand side cancel out), and so it is enough to bound the deviation of the empirical means in Hr. This\ncan be done in terms of the Rademacher Complexity of the class, R(Hr) [6, Theorem 5]: For any\n(cid:32)\n\u03b4 > 0, with probability at least 1 \u2212 \u03b4,\n\nw]\u2212 \u02c6E [gr\n\n(cid:33)(cid:113) log 1/\u03b4\n\n2n .\n\n(20)\n\nj=0\n\n6\n\nE [hr] \u2212 \u02c6E [hr] \u2264 2R(Hr) +\n\nsup\nhr\u2208Hr\n\n|hr(\u03b8)|\n\nsup\n\nhr\u2208Hr,\u03b8\n\nWe will now proceed to bounding the two terms on the right hand side:\nLemma 6.\n\nsup\n\n|hr(\u03b8)| \u2264 LB(cid:112)2r / \u03bb\n\nhr\u2208Hr,\u03b8\n\nProof. From the de\ufb01nition of hr\nbound (cid:107)\u03c6(\u03b8)(cid:107)\u2217 \u2264 B, we have for all w, \u03b8:\n\nw given in (18)\u2013(19), the Lipschitz continuity of (cid:96)(\u00b7; \u03b8), and the\n\nw(\u03b8)| \u2264 |hw(\u03b8)|\n|hr\n\n4k(w) \u2264 LB (cid:107)w \u2212 w(cid:63)(cid:107) /4k(w)\n\n(cid:107)w \u2212 w(cid:63)(cid:107) \u2264(cid:113) 2\n\n(21)\nWe now use the strong convexity of F (w), and in particular eq. (7), as well as the de\ufb01nitions of gw\nand k(w), and \ufb01nally note that 4k(w) \u2265 1, to get:\n\u03bb(F (w) \u2212 F (w(cid:63))) =\nSubstituting (22) in (21) yields the desired bound.\nLemma 7. R(Hr) \u2264 2L B\nProof. We will use the following generic bound on the Rademacher complexity of linear function-\nals [7, Theorem 1]: for any t(w) which is \u03bb-strongly convex (w.r.t a norm with dual norm (cid:107)\u00b7(cid:107)\u2217),\n\n\u03bb4k(w)r \u2264(cid:113) 2\n\nE [gw] \u2264(cid:113) 2\n\n(cid:113) 2r\n\n(cid:113) 2\n\n\u03bb16k(w)r\n\n(22)\n\n\u03bbn\n\n\u03bb\n\nR({\u03c6 (cid:55)\u2192 (cid:104)w, \u03c6(cid:105) | t(w) \u2264 a}) \u2264 (sup(cid:107)\u03c6(cid:107)\u2217)\n\n(23)\nFor each a > 0, de\ufb01ne H(a) = {hw : w \u2208 W, E [gw] \u2264 a}. First note that E [gw] = F (w) \u2212\nF (w(cid:63)) is \u03bb-strongly convex. Using (23) and the Lipschitz composition property we therefore have\nR(H(a)) \u2264 LB\n\n\u03bbn .\n\n(cid:113) 2a\n\n(cid:113) 2a\nj=04\u2212jH(r4j)(cid:1) \u2264\nR(Hr) = R(cid:0)\u222a\u221e\n\n\u03bbn. Now:\n\n\u221e(cid:88)\n\n4\u2212jR(H(4rj)) \u2264 LB\n\n4\u2212j/2 = 2LB\n\n(cid:113) 2r\n\n\u03bbn\n\n(cid:113) 2r\n\n\u221e(cid:88)\n\n\u03bbn\n\nj=0\n\n\fWe now proceed to bounding E [gw] = F (w)\u2212 F (w(cid:63)) and thus proving Theorem 1. For any r > 0,\nwith probability at least 1 \u2212 \u03b4 we have:\n\nE [gw] \u2212 \u02c6E [gw] = 4k(w)(E [gr\n\n(cid:113) 1\n\n\u221a\n\u03bbn(4\n\nw] \u2212 \u02c6E [gr\n\n2+(cid:112)log(1/\u03b4)) \u2264 2LB\n\nw]) = 4k(w)(E [hr\n\n(cid:113) 32+log(1/\u03b4)\n\nwhere D = LB\n6 and 7 into (20). We now consider two possible cases: k(w) = 0 and k(w) > 0.\nThe case k(w) = 0 corresponds to functions with an expected value close to optimal: E [gw] \u2264 r,\ni.e. F (w) \u2264 F (w(cid:63)) + r. In this case (24) becomes:\n\nis obtained by substituting Lemmas\n\n\u03bbn\n\nw] \u2212 \u02c6E [hr\n\nw]) \u2264 4k(w)\u221a\n\nrD\n\n(24)\n\nE [gw] \u2264 \u02c6E [gw] +\n\n\u221a\nrD\n\n(25)\n\nWe now turn to functions for which k(w) > 0, i.e. with expected values further away from optimal.\nIn this case, the de\ufb01nition of k(w) ensures 4k(w)\u22121r < E [gw] and substituting this into (24) we\nhave E [gw] \u2212 \u02c6E [gw] \u2264 4\n\n\u221a\nE [gw]\n\nr\n\nrD. Rearranging terms yields:\nE [gw] \u2264\n\n\u02c6E [gw]\n\n1\n1\u22124D/\n\n\u221a\nr\n\n(26)\nr \u2265 1), we always\n\u221a\n\n1\n1\u22124D/\n\n(27)\n\nCombining the two cases (25) and (26) (and requiring r \u2265 (4D)2 so that\nhave:\n\n(cid:104)\u02c6E [gw]\n\n(cid:105)\n\n\u221a\nrD\n\n+\n\n+\n\nE [gw] \u2264\n\n1\n1\u22124D/\n\n\u221a\nr\n\nSetting r = (1 + 1\n\na)2(4D)2 yields the bound in Theorem 1.\n\n5 Comparison with Previous \u201cFast Rate\u201d Guarantees\n\n\u221a\nRates faster than 1/\nn for estimation have been previously explored under various conditions,\nwhere strong convexity has played a signi\ufb01cant role. Lee et al [8] showed faster rates for\nsquared loss, exploiting the strong convexity of this loss function, but only under \ufb01nite pseudo-\ndimensionality assumption, which do not hold in SVM-like settings. Bousquet [9] provided similar\n\u221a\nguarantees when the spectrum of the kernel matrix (covariance of the data) is exponentially decay-\ning. Tsybakov [10] introduced a margin condition under which rates faster than 1/\nn are shown\npossible. It is also possible to ensure rates of 1/n by relying on low noise conditions [9, 11], but\nhere we make no such assumption.\nMost methods for deriving fast rates \ufb01rst bound the variance of the functions in the class by some\n\u221a\nmonotone function of their expectations. Then, using methods as in Bartlett et al [5], one can\nget bounds that have a localized complexity term and additional terms of order faster than 1/\nn.\nHowever, it is important to note that the localized complexity term typically dominates the rate and\nstill needs to be controlled. For example, Bartlett et al [12] show that strict convexity of the loss\nfunction implies a variance bound, and provide a general result that can enable obtaining faster rates\n\u221a\nas long as the complexity term is low. For instance, for classes with \ufb01nite VC dimension V , the\nresulting rate is n\u2212(V +2)/(2V +2), which indeed is better than 1/\nn but is not quite 1/n. Thus we\nsee that even for a strictly convex loss function, such as the squared loss, additional conditions are\nnecessary in order to obtain \u201cfast\u201d rates.\nIn this work we show that strong convexity not only implies a variance bound but in fact can be used\nto bound the localized complexity. An important distinction is that we require strong convexity of\nthe function F (w) with respect to the norm (cid:107)w(cid:107). This is rather different than requiring the loss\nfunction z (cid:55)\u2192 (cid:96)(z, y) be strongly convex on the reals. In particular, the loss of a linear predictor,\nw (cid:55)\u2192 (cid:96)((cid:104)w, x(cid:105), y) can never be strongly convex in a multi-dimensional space, even if (cid:96) is strongly\nconvex, since it is \ufb02at in directions orthogonal to x.\nAs mentioned, f(w; x, y) = (cid:96)((cid:104)w, x(cid:105), y) can never be strongly convex in a high-dimensional space.\nHowever, we actually only require the strong convexity of the expected loss F (w).\nIf the loss\nfunction (cid:96)(z, y) is \u03bb-strongly convex in z, and the eigenvalues of the covariance of x are bounded\naway from zero, strong convexity of F (w) can be ensured.\nIn particular, F (w) would be c\u03bb-\nstrongly-convex, where c is the minimal eigenvalue of the COV[x]. This enables us to use Theorem\n\n7\n\n\f1 to obtain rates of 1/n on the expected loss itself. However, we cannot expect the eigenvalues to be\nbounded away from zero in very high dimensional spaces, limiting the applicability of the result of\nlow-dimensional spaces were, as discussed above, other results also apply.\nAn interesting observation about our proof technique is that the only concentration inequality we\ninvoked was McDiarmid\u2019s Inequality (in [6, Theorem 5] to obtain (20)\u2014a bound on the deviations\nin terms of the Rademacher complexity). This was possible because we could make a localization\nargument for the (cid:96)\u221e norm of the functions in our function class in terms of their expectation.\n\n6 Summary\n\nWe believe this is the \ufb01rst demonstration that, without any additional requirements, the SVM objec-\ntive converges to its in\ufb01nite data limit with a rate of O(1/n). This improves the previous results that\nconsidered the SVM objective only under special additional conditions. The results extends also to\nother regularized objectives.\nAlthough the quantity that is ultimately of interest to us is the expected loss, and not the regularized\nexpected loss, it is still important to understand the statistical behavior of the regularized expected\nloss. This is the quantity that we actually optimize, track, and often provide bounds on (e.g. in ap-\nproximate or stochastic optimization approaches). A better understanding of its behavior can allow\nus to both theoretically explore the behavior of regularized learning methods, to better understand\nempirical behavior observed in practice, and to appreciate guarantees of stochastic optimization ap-\nproaches for such regularized objectives. As we saw in Section 3, deriving such fast rates is also\nessential for obtaining simple and general oracle inequalities, that also helps us guide our choice of\nregularization parameters.\n\nReferences\n[1] E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimiza-\n\ntion. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, 2006.\n\n[2] S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew\n\nUniversity, 2007.\n\n[3] T. Zhang. Covering number bounds of certain regularized linear function classes. J. Mach. Learn. Res.,\n\n2:527\u2013550, 2002.\n\n[4] I. Steinwart, D. Hush, and C. Scovel. A new concentration result for regularized risk minimizers. High-\n\ndimensional Probability IV, in IMS Lecture Notes, 51:260\u2013275, 2006.\n\n[5] P. L. Bartlett, O. Bousquet, and S. Mendelson. Localized rademacher complexities. In COLT \u201902: Pro-\nceedings of the 15th Annual Conference on Computational Learning Theory, pages 44\u201358, London, UK,\n2002. Springer-Verlag.\n\n[6] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In O. Bousquet,\nU.v. Luxburg, and G. R\u00a8atsch, editors, Advanced Lectures in Machine Learning, pages 169\u2013207. Springer,\n2004.\n\n[7] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin\n\nbounds, and regularization. In NIPS, 2008.\n\n[8] W. S. Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in learning with squared\n\nloss. In Computational Learing Theory, pages 140\u2013146, 1996.\n\n[9] O. Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of\n\nLearning Algorithms. PhD thesis, Ecole Polytechnique, 2002.\n\n[10] A. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics, 32:135\u2013166,\n\n2004.\n\n[11] I. Steinwart and C. Scovel. Fast rates for support vector machines using gaussian kernels. ANNALS OF\n\nSTATISTICS, 35:575, 2007.\n\n[12] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal of the\n\nAmerican Statistical Association, 101:138\u2013156, March 2006.\n\n8\n\n\f", "award": [], "sourceid": 632, "authors": [{"given_name": "Karthik", "family_name": "Sridharan", "institution": null}, {"given_name": "Shai", "family_name": "Shalev-shwartz", "institution": null}, {"given_name": "Nathan", "family_name": "Srebro", "institution": null}]}