{"title": "Linear Concepts and Hidden Variables: An Empirical Study", "book": "Advances in Neural Information Processing Systems", "page_first": 500, "page_last": 506, "abstract": "", "full_text": "Linear concepts and hidden variables: \n\nAn empirical study \n\nAdam J. Grove \n\nNEC Research Institute \n4 Independence Way \nPrinceton NJ 08540 \n\ngrove@research.nj.nec.com \n\nDan Rothe \n\nDepartment of Computer Science \n\nUniversity of Illinois at Urbana-Champaign \n\n1304 W. Springfield Ave. Urbana 61801 \n\ndanr@cs.uiuc.edu \n\nAbstract \n\nSome learning techniques for classification tasks work indirectly, by first trying \nto fit a full probabilistic model to the observed data. Whether this is a good idea \nor not depends on the robustness with respect to deviations from the postulated \nmodel. We study this question experimentally in a restricted, yet non-trivial and \ninteresting case: we consider a conditionally independent attribute (CIA) model \nwhich postulates a single binary-valued hidden variable z on which all other \nattributes (i.e., the target and the observables) depend. In this model, finding the \nmost likely value of anyone variable (given known values for the others) reduces \nto testing a linear function of the observed values. \n\nWe learn CIA with two techniques: the standard EM algorithm, and a new \nalgorithm we develop based on covariances. We compare these, in a controlled \nfashion, against an algorithm (a version of Winnow) that attempts to find a good \nlinear classifier directly. Our conclusions help delimit the fragility of using the \nCIA model for classification: once the data departs from this model, performance \nquickly degrades and drops below that of the directly-learned linear classifier. \n\n1 Introduction \nWe consider the classic task of predicting a binary (0/1) target variable zo, based on \nthe values of some n other binary variables ZI \u2022\u2022\u2022 Zft,. We can distinguish between two \nstyles of learning approach for such tasks. Parametric algorithms postulate some form of \nprobabilistic model underlying the data, and try to fit the model's parameters. To classify an \nexample we can compute the conditional probability distribution for Zo given the values of \nthe known variables, and then predict the most probable value. Non-parametric algorithms \ndo not assume that the training data has a particular form. They instead search directly in \nthe space of possible classification functions, attempting to find one with small error on the \ntraining set of examples. \nAn important advantage of parametric approaches is that the induced model can be used to \nsupport a wide range of inferences, aside from the specified classification task. On the other \nhand, to postulate a particular form of probabilistic model can be a very strong assumption. \n\n\"Partly supported by ONR grant NOOOI4-96-1-0550 while visiting Harvard University. \n\n\fLinear Concepts and Hidden Variables: An Empirical Study \n\n501 \n\nSo it is important to understand how robust such methods are when the real world deviates \nfrom the assumed model. \nIn this paper, we report on some experiments that test this issue. We consider the specific \ncase of n + 1 conditionally independent attributes Zi together with a single unobserved \nvariable z, also assumed to be binary valued, on which the Zi depend (henceforth, the \nbinary CIA model); see Section 2. In fact, such models are plausible in many domains \n(for instance, in some language interpretation tasks; see [GR96]). We fit the parameters of \nthe CIA model using the well-known expectation-maximization (EM) technique [DLR77], \nand also with a new algorithm we have developed based on estimating covariances; see \nSection 4. In the nonparametric case, we simply search for a good linear separator. This is \nbecause the optimal predictors for the binary CIA model (i.e., for predicting one variable \ngiven known values for the rest) are also linear. This means that our comparison is \"fair\" \nin the sense that neither strategy can choose from classifiers with more expressive power \nthan the other. As a representative of the non-parametric class of algorithms, we use the \nWinnow algorithm of [Lit881, with some modifications (see Section 6). Winnow works \ndirectly to find a \"good\" linear separator. It is guaranteed to find a perfect separator if one \nexists, and empirically seems to be fairly successful even when there is no perfect separator \n[GR96, Blu9?]. It is also very fast. \nOur experimental methodology is to first generate synthetic data from a true CIA model \nand test performance; we then study various deviations from the model. There are various \ninteresting issues involved in constructing good experiments, including the desirability of \ncontrolling the inherent \"difficulty\" of learning a model. Since we cannot characterize the \nentire space, we consider here only deviations in which the data is drawn from a CIA model \nin which the hidden variable can take more than two values. (Note that the optimal classifier \ngiven Zo is generally not linear in this case.) \nOur observations are not qualitatively surprising. CIA does well when the assumed model \nis correct, but performance degrades when the world departs from the model. But as we \ndiscuss, we found it surprising how fragile this model can sometimes be, when compared \nagainst algorithms such as Winnow. This is even though the data is not linearly separable \neither, and so one might expect the direct learning techniques to degrade in performance as \nwell. But it seems that Wmnow and related approaches are far less fragile. Thus the main \ncontribution of this work is that our results shed light on the specific tradeoff between fitting \nparameters to a probabilistic model, versus direct search for a good classifier. Specifically, \nthey illustrate the dangers of predicting using a model that is even \"slightly\" simpler than \nthe distribution actually generating the data, vs. the relative robustness of directly searching \nfor a good predictor. This would seem to be an important practical issue, and highlights the \nneed for some better theoretical understanding of the notion of \"robustness\". \n\n2 Conditionally Independent Attributes \nThroughout we assume that each example is a binary vector z E {O, 1 }n+l, and that each \nexample is generated independently at random according to some unknown distribution on \n{O, 1 }n+l. We use Xi to denote the i'th attribute, considered as a random variable, and Zi \nto denote a value for Xi. In the conditionally independent attribute (CIA) model, examples \nare generated as follows. We postulate a \"hidden\" variable Z with Ie values, which takes \nvalues z for 0 $ z < Ie with probability a. ~ O. Since we must have E::~ a. = 1 \nthere are Ie - 1 independent parameters. Having randomly chosen a value z for the hidden \nvariable, we choose the value Zi for each observable Xi: the value is 1 with probability \np~.}, and 0 otherwise. Here p~.} E [0,1). The attributes' values are chosen independently \nof each other, although z remains fixed. Note that there are thus (n + 1)1e probability \nparameters p~.). In the following, let l' denote the set of all (n + 1)1e + Ie - 1 parameters \nin the model. From this point, and until Section 7, we always assume that Ie = 2 and in this \ncase, to simplify notation, we write al as a, ao (= 1 - a) as ai, p! as Pi and p~ as qi. \n\n\f502 \n\nA. 1. Grove and D. Roth \n\n3 The Expectation-Maximization algorithm (EM) \nOne traditional unsupervised approach to learning the parameters of this model is to find \nthe maximum-likelihood parameters of the distribution given the data. That is, we attempt \nto find the set of parameters that maximizes the probability of the data observed. \nFinding the maximum likelihood parameterization analytically appears to be a difficult \nproblem, even in this rather simple setting. However, a practical approach is to use the well(cid:173)\nknown Expectation-Maximization algorithm (EM) [DLR77], which is an iterative approach \nthat always converges to a local maximum of the likelihood function. In our setting, the \nprocedure is as follows. We simply begin with a randomly chosen parameterization p, and \nthen we iterate until (apparent) convergence: 1 \n\nExpectation: For all zi, compute Ui = p-p(zi 1\\ Z = 1) and Vi = p-p(zi 1\\ Z = 0). \nMaximization: Reestimate P as follows (writing U = Ei Ui and V = Ei Vi): \n\na f- E:=I Ui/(U + V) P; f- E{i::i~=I} u;./U \n\nqj f- E{i::i~=I} Vi/V. \n\nAfter convergence has been detected all we kno'w is that we are near a [ocdi minima of the \nlikelihood function. Thus it is prudent to repeat the process with many different restarts. \n(All our experiments were extremely conservative concerning the stopping criteria at each \niteration, and in the number of iterations we tried.) But in practice, we are never sure that \nthe true optimum has been located. \n4 Covariances-Based approach \nPartly in response to concern just expressed, we also developed another heuristic technique \nfor learning P. The algorithm, which we call COY, is based on measuring the covariance \nbetween pairs of attributes. Since we do not see Z, attributes will appear to be correlated. In \nfact, if the CIA model is correct, it is easy to show that covariance between Xi and X j (de(cid:173)\nfined as Yi,; = ~,; - ~I-'; where~, 1-';, ~,; are the expectations of Xi, Xj, (Xi and Xj), \nrespectively), will be Yi,j = aa'did; where di denotes Pi - qi. We also know that the \nexpected value of Xi is ~ = aPi + a'qi. Furthermore, we will be able to get very accurate \nestimates of ~ just by observing the proportion of samples in which Zi is 1. Thus, if we \ncould estimate both a and di it would be trivial to solve for estimates of Pi and qi. \nTo estimate di, suppose we have computed all the pairwise covariances using the data; \ni= i we clearly have \nwe use fli,; to denote our estimate of Yi,j' For any distinct j, Ie \naa l5; = IV'rd.r\"/o1 so we could estimate d; using this equation. A better estimate would be \nto consider all pairs j, Ie and average the individual estimates. However, not all individual \nestimates are equally good. It can be shown that the smaller Y;,II is, the less reliable \nwe should expect the estimate to be (and in the limit, where X; and XII are perfectly \nuncorrelated, we get no valid estimate at all). This suggests that we use a weighted average, \nwith the weights proportional to Yj,II. Using these weights leads us to the next equation for \ndetermining 5i , which, after simplification, is: \n\n\"/o \n\nE j ,II:j;t1l;ti IYi,jYi,II I \nE;,II:;;tll;ti IY;,II I \n\n(E;:#i IYi,; 1)2 - E;:;;ti if,; \nE;,II:;;tll IY;,II I - 2 Ej:j;ti IYj,i I \n\nBy substituting the estimates 'Oi,; we get an estimate for aa' dl. This estimate can be \ncomputed in linear time except for the determination of Ej,II:j;tll IYj,II I which, although \nquadratic, does not depend on i and so can be computed once and for all. Thus it takes \nO(n2) time in total to estimate aa'd; for all i. \nIt remains only to estimate a and the signs of the di'S. Briefly, to determine the signs we first \nstipulate that do is positive. (Because we never see z, one sign can be chosen at random.) \n\nIThe maximization phase works as though we were estimating parameters by taking averages \nbased on weighted labeled data (Le., in which we see z). If ii is a sample point, these fictional data \npoints are (ii,Z = 1) with weight Ui/U and (ii, z = 0) with weight Vi/V. \n\n\fLinear Concepts and Hidden Variables: An Empirical Study \n\n503 \n\nIn principle, then, the sign of 0; will then be equal to the sign of Yo,;, which we have an \nestimate for. In practice, this can statistically unreliable for small sample sizes and so we \nuse a more involved ''voting'' procedure (details omitted here). Finally we estimate Q. We \nhave found no better method of doing this than to simply search for the optimal value, using \nlikelihood as the search criterion. However, this is only a I-dimensional search and it turns \nout to be quite efficient in practice. \n\n5 Linear Separators and CIA \n\nGiven a fully parameterized CIA model, we may be interested in predicting the value of \none variable, say Xo, given known values for the remaining variables. One can show that \nin fact the optimal prediction region is given by a linear separator in the other variables, \nalthough we omit details of this derivationhere.2 This suggest an obvious learning strategy: \nsimply try to find the line which minimizes this loss on the training set. Unfortunately, in \ngeneral the task of finding a linear separator that minimizes disagreements on a collection \nof examples is known to be NP-hard [HS92]. So instead we use an algorithm called Winnow \nthat is known to produce good results when a linear separator exists, as well as under certain \nmore relaxed assumptions [Lit9I], and appears to be quite effective in practice. \n\n6 Learning using a Winnow-based algorithm \n\n.1On ) of positive weights (Le., w, is the weight associated with the ith feature), \n\nThe basic version of the Winnow algorithm [Lit88] keeps an n-dimensional vector w = \n(1011\" \nwhich it updates whenever a mistake is made. Initially, the weight vector is typically \nset to assign equal positive weight to all features. The algorithm has 3 parameters, a \npromotion parameter Q > I, a demotion parameter 0 < f3 < 1 and a threshold 8. For a \ngiven instance (:1:1, \u2022 \"1 :l:n) the algorithm predicts that :1:0 = 1 iff E~l W,:I:, > 8. If the \nalgorithm predicts 0 and the label (Le., :1:0) is 1 (positive example) then the weights which \ncorrespond to active attributes (:1:, = 1) are promoted-the weight 10, is replaced by a larger \nweight Q \u2022 Wi. Conversely, if algorithm predicts 1 and the received label is 0, then the \nweights which correspond to active features are demoted by factor {3. We allow for negative \nweights as follows. Given an example (:1:1\" \"1 :l:n), we rewrite it as an example over 2n \nvariables (Y1, 'Y21 \u2022.. I 'Y2n) where y, = :1:, and Yn+, = 1 -\n:1:,. We then apply Winnow just \nas above to learn 2n (positive) weights. If wi is the weight associated with :1:, and wi is \nthe weight associated with :l:n+i (Le., 1 - :1:,), then the prediction rule is simply to compare \nE~=l(wi:l:, + wi(1 - :1:,)) with the threshold. \nIn the experiments described here we have made two significant modifications to the basic \nalgorithm. To reduce variance, our final classifier is a weighted average of several classifiers; \neach is trained using a subs ample from the training set, and its weight is based based on how \nwell it was doing on that sample. Second, we biased the algorithm so as to look for \"thick\" \nclassifiers. To understand this, consider the case in which the data is perfectly linearly \nseparable. Then there will generally be many linear concepts that separate the training data \nwe actually see. Among these, it seems plausible that we have a better chance of doing \nwell on the unseen test data if we choose a linear concept that separates the positive and \nnegative training examples as \"widely\" as possible. The idea of having a wide separation \nis less clear when there is no perfect separator, but we can still appeal to the basic intuition. \nTo bias the search towards \"thick\" separators, we change Wmnow's training rule somewhat. \nWe now have a new margin parameter T. As before, we always update when our current \nhypothesis makes a mistake, but now we also update if I E~=l Wi:l:, - 8 I is less than T, \neven if the prediction is correct. In our experiments, we found that performance when using \nthis version of Winnow is better than that of the basic algorithm, so in this paper we present \nresults for the former. \n\n2 A derivation for the slightly different case, for predicting z, can be found in [MP69J. \n\n\f504 \n\nA. J Grove and D. Roth \n\n7 Experimental Methodology \nAside from the choice of algorithm used, the number of attributes n, and the sample \nsize 8, our experiments also differed in two other dimensions. These are the type of \nprocess generating the data (we will be interested in various deviations from CIA), and \nthe \"difficulty\" of the problem. These features are determined by the data model we use \n(i.e., the distribution over {O, I} ft used to generate data sets). \nOur first experiments consider the case where the data really is drawn from a binary CIA \ndistribution. We associated with any such distribution a \"difficulty\" parameter B, which is \nthe accuracy with which one could predict the value of Z if one actually knew the correct \nmodel. (Of course, even with knowledge of the correct model we should not expect 100% \naccuracy.) The ability to control B allows us to select and study models with different \nIn particular, this has allowed us concentrated most of our \nqualitative characteristics. \nexperiments on fairly \"hard\" instances3, and to more meaningfully compare trials with \ndiffering numbers of attributes. We denote by CIA( n, 2, b) the class of all data models \nwhich are binary CIA distributions over n variables with difficulty b.4 The next family of \ndata models we used are also CIA models, but now using more than two values for the \nhidden variable. We denote the family using Ie values as CIA(n, Ie, b) where n and b are as \nbefore. When Ie > 2 there are more complex correlation patterns between the Xi than when \nIe = 2. Furthermore, the optimal predictor is not necessarily linear. The specific results we \ndiscuss in the next section have concentrated on this case. \nGiven any set of parameters, including a particular class of data models, our experiments \nare designed with the goal of good statistical accuracy. We repeatedly (typically 100 to \n300 times) choose a data model at random from the chosen class, choose a sample of the \nappropriate size from this model, and then run all our algorithms. Each algorithm produces \na (linear) hypothesis. We measure the success rate Salg (i.e., the proportion of times a \nhypothesis makes the correct prediction of :1:0) by drawing yet more random samples from \nthe data model being used. In the test phase we always draw enough new samples so that \nthe confidence interval for Salg, for the results on a single model, has width at most \u00b1 1 %. \nWe use the Salg values to construct a normalized measure of performance (denoted T) as \nfollows. Let Sbest be the best possible accuracy attainable for predicting:l:o (i.e., the accuracy \nachieved by the actual model generating the data). Let Sconst denote the performance of \nthe best possible constant prediction rule (i.e., the rule that predicts the most likely a priori \nvalue for :1:0). Note that Sconst and Sbest can vary from model to model. For each model we \ncompute :alg--;onst ,and our normalized statistic T is the average of these values. It can be \nthought of as measuring the percentage of the possible predictive power, over a plausible \nbaseline, that an algorithm achieves. \n\nbest- const \n\n8 Results \nWe only report on a small, but representative, selection of our experiments in any detail. \nFor instance, although we have considered many values of n ranging from 10 to 500, here \nwe show six graphs giving the learning curves for CIA(n, Ie, 0.90) for n = 10,75, and for \nIe = 2,3,5; as noted, we display the T statistic. The error bars show the standard error,s \nproviding a rough indication of accuracy. Not surprisingly, when the data model is binary \n\n3Note that if one simply chooses parameters of a CIA model independently at random, without \nexamining the difficulty of the model or adjusting for n, one will get many trivial problems, in which \nit is easy to predict Z with nearly 100% accuracy, and thus predict optimally for Xo. \n\n41t is nontrivial to efficiently select random models from this class. Briefly, our scheme is to choose \neach parameter in a CIA model independently from a symmetric beta distribution. Thus, the model \nparameters will have expected value 0.5. We choose the parameter of the beta distribution (which \ndetermines concentration about 0.5) so that the average B value, of the models thus generated, equals \nb. Finally, we use rejection sampling to find CIA models with B values that are exactly b \u00b1 1 %. \n\n5Computed as the observed standard deviation, divided by the square root of the number of trials. \n\n\fLinear Concepts and Hidden Variables: An Empirical Study \n\n505 \n\nCIA, the EM algorithm does extremely well, learning significantly (if not overwhelmingly) \nfaster than Winnow. But as we depart from the binary CIA assumption, the performance of \nEM quickly degrades. \n\n'00 \n\n00 \n\nlOO \n\n,\n\n40 \n\nj: \n\n)-......, -.... \n\n'0 \n\n00 \n\n00 \n\nl40 \n\nI\"\" t 0 \nJ ...... \n......, - ,0 \n\n10 \n\n40 \n\n20 \n\nl \n,\n\nj a \n)-\n......, - '0 \n\nCl\"(78.2.0.1IO) \n\n, .(cid:173), \n\n, \n\n,1,1 of ,.\u00b7f \n\n~i'.~/ .... \n\n~ _ ~~! ~ .. ! -,-4\u00b7 m\u00b7\u00b7 \"\"-\nCIA(78,',O.1O) m\u00b7\u00b7\u00b7\"\"-\n\nFigure 2: CIA(75,2,0.9) \n\n.. \n.oIT,.......~ \n\n-EM \n-\n- oa>I \n\n-EM \n\n- oa>I \n\n... \n\n'00 \n\n-\n\n'000 \n\n.. \n\n.01 Tralring ~ \n\n'00 \n\n... \n\n'000 \n\nFigure 4: CIA(75,3,0.9) \n\nCIA(7 \u2022\u2022\u2022\u2022 o..a) \n\n_---t--rI \n\n\" -1--\n\n-r' m \n\n-E .. \n-\n- oa>I \n\n... \n\n'ODD \n\neo \n.oIT~~ \n\n'00 \n\nFigure 6: CIA(75,5,0.9) \n\n,\"\" \n\n'00 \n\nCI\"(10.2.0.1IO) \n\nloo \n\nJoo \nt 40 \nJ: \n\n...... \n......, \n\n,0 \n\n.. \n\nI \u00b7\u00b7 z \n\n.. \n\n.oIT~~ \n\n'00 \n\n--_ .. -\n... .... \n\nm \n\n-EM \n- oa>I \n-\n\n... \n\n'000 \n\nFigure 1: CIA(10,2,0.9) \n\nOA,(10 \u2022\u2022 ,O.IO) \n\n'00 \n\n00 \n\nl }oo \nj40 \nI\"\" \n\n.. \nm \n\n-E .. \n-\n- oa>I \n\n...... \n\n,0 \n\n.. \n.dT\"INng~ \n\n'00 \n\n... \n\n'000 \n\nFigure 3: CIA(1O,3,0.9) \n\nCIAC10.a,O.1O) \n\n........... \n\nloo '-----' \n140 \n\ntoo I 0 \n\n-20 \n\n......,~,o~----~ .. ~~,~~~-----=*~~,_ \n\n.oIT' .... ExemPM \n\nFigure 5: CIA(10,5,0.9) \n\nWhen Ie = 3 performances is, on the whole, very similar for Winnow and EM. But when \nIe = 5 Winnow is already superior to EM; significantly and uniformly so for n = 10. For \nfixed Ie the difference seems to become somewhat less dramatic as n increases; in Figure 6 \n(for n = 75) Winnow is less obviously dominant, and in fact is not better than EM until the \nsample size has reached 100. (But when 8 ~ n, meaning that we have fewer samples than \nattributes, the performance is unifonnly dismal anyway.) \nShould we attribute this degradation to the binary CIA assumption, or to the EM itself? \nThis question is our reason for also considering the covariance algorithm. We see that the \nresults for COY are generally similar to EM's, supporting our belief that the phenomena \nwe see are properties inherent to the model rather than to the specific algorithm being used. \nSimilarly (the results are omitted) we have tried several other algorithms that try to find \ngood linear separators directly, including the classic Perceptron algorithm [MP69); our \nversion of Winnow was the best on the experiments we tried and thus we conjecture that its \nperformance is (somewhat) indicative of what is possible for any such approach. \nAs the comparison between n = 10 and n = 75 illustrates, there is little qualitative differ-\n\n\f506 \n\nA. J. Grove and D. Roth \n\nence between the phenomena observed as the number of attributes increases. Nevertheless, \nas n grows it does seem that Winnow needs more examples before its performance surpasses \nthat of the other algorithms (for any fixed k). As already noted, this may be due simply to \nthe very \"noisy\" nature of the region 8 $ n. We also have reasons to believe that this is \npartially an artifact of way we select models. \nAs previously noted, we also experimented with varying \"difficulty\" (B) levels. Although \nwe omit the corresponding figures we mentioned that the main difference is that Winnow is a \nlittle faster in surpassing EM when the data deviates from the assumed model, but when the \ndata model really is binary CIA, and EM converge even faster to an optimal performance. \nThese patterns were confinned when we tried to compare the approaches on real data. We \nhave used data that originates from a problem in which assuming a hidden \"context\" variable \nseems somewhat plausible. The data is taken from the context-sensitive spelling correction \ndomain. We used one data set from those that were used in [GR96]. For example, given \nsentences in which the word passed or past appear, the task is to determine, for each such \noccurrence, which of the two it should be. This task may be modeled by thinking of the \n\"context\" as a hidden variable in our sense. Yet when we tried to learn in this case under \nthe CIA model, with a binary valued hidden variable, the results were no better than just \npredicting the most likely classification (around 70%). Winnow, in contrast, performed \nextremely well and exceeds 95% on this task. We hesitate to read much into our limited \nreal-data experiments, other than to note that so far they are consistent witli the more careful \nexperiments on synthetic data. \n9 Conclusion \nBy restricting to a binary hidden variable, we have been able to consider a \"fair\" comparison \nbetween probabilistic model construction, and more traditional algorithms that directly \nlearn a classification-at least in the sense that both have the same expressive power. Our \nconclusions concerning the fragility of the former should not be surprising but we believe \nthat given the importance of the problem it is valuable to have some idea of the true \nsignificance of the effect. As we have indicated, in many real-world cases, where a model \nof the sort we have considered here seems plausible, it is impossible to nail down more \nspecific characterizations of the probabilistic model. Our results exhibit how important it \nis to use the correct model and how sensitive are the results to deviations from it, when \nattempting to learn using model construction. The purpose of this paper is not to advocate \nthat in practice one should use either Winnow or binary CIA in exactly the form considered \nhere. A richer probabilistic model should be used along with a model selection phase. \nHowever, studying the problem in a restricted and controlled environment in crucial so as \nto understand the nature and significance of this fundamental problem. \nReferences \n[Blu97] A. Blum. Empirical support for winnow and weighted majority based algorithms: results on \n\na calendar scheduling domain. Machine Learning, 26: 1-19, 1997. \n\n[DLR 77] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data \n\nvia the EM algorithm. Royal Statistical SOCiety B, 39: 1-38, 1977. \n\n[GR96] A. R. Golding and D. Roth. Applying winnow to context-sensitive spelling correcton. In \n\nProc. 13th International Conference on Machine Learning (ML' 96), pages 182-190, 1996. \n\n[HS92] K. HOffgen and H. Simon. Robust trainability of single neurons. In Proc. 5th Annu. Workshop \n\non Comput. Learning Theory, pages 428-439, New York, New York, 1992. ACM Press. \n\n[Lit88] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold \n\nalgorithm. Machine Learning, 2:285-318,1988. \n\n[Lit91] N. Littlestone. Redundant noisy attributes, attribute errors, and linear threshold learning \nusing Winnow. In Proc. 4th Annu. Workshop on Comput. Learning Theory, pages 147-156, San \nMateo, CA, 1991. Morgan Kaufmann. \n\n[MP69] M. L. Minsky and S. A. Papert. Perceptrons. MIT Press, Cambridge, MA, 1969. \n\n\f", "award": [], "sourceid": 1473, "authors": [{"given_name": "Adam", "family_name": "Grove", "institution": null}, {"given_name": "Dan", "family_name": "Roth", "institution": null}]}