{"title": "A Massively Parallel Digital Learning Processor", "book": "Advances in Neural Information Processing Systems", "page_first": 529, "page_last": 536, "abstract": "We present a new, massively parallel architecture for accelerating machine learning algorithms, based on arrays of variable-resolution arithmetic vector processing elements (VPE). Groups of VPEs operate in SIMD (single instruction multiple data) mode, and each group is connected to an independent memory bank. In this way memory bandwidth scales with the number of VPE, and the main data flows are local, keeping power dissipation low. With 256 VPEs, implemented on two FPGA (field programmable gate array) chips, we obtain a sustained speed of 19 GMACS (billion multiply-accumulate per sec.) for SVM training, and 86 GMACS for SVM classification. This performance is more than an order of magnitude higher than that of any FPGA implementation reported so far. The speed on one FPGA is similar to the fastest speeds published on a Graphics Processor for the MNIST problem, despite a clock rate of the FPGA that is six times lower. High performance at low clock rates makes this massively parallel architecture particularly attractive for embedded applications, where low power dissipation is critical. Tests with Convolutional Neural Networks and other learning algorithms are under way now.", "full_text": " \n\nA Massively Parallel Digital \n\nLearning Processor \n\nHans Peter Graf Srihari Cadambi \n\n \n \n hpg@nec-labs.com cadambi@nec-labs.com igord@nec-labs.com \n \nVenkata Jakkula Murugan Sankardadass Eric Cosatto Srimat Chakradhar \nJakkula@nec-labs.com murugs@nec-labs.com cosatto@nec-labs.com chak@nec-labs.com \n \n\n Igor Durdanovic \n\n4 Independence Way, Suite 200; Princeton, NJ 07738, USA \n\nNEC Laboratories, America \n\nAbstract \n\nWe present a new, massively parallel architecture for accelerating machine \nlearning algorithms, based on arrays of vector processing elements (VPEs) \nwith variable-resolution arithmetic. Groups of VPEs operate in SIMD \n(single instruction multiple data) mode, and each group is connected to an \nindependent memory bank. The memory bandwidth thus scales with the \nnumber of VPEs, while the main data flows are local, keeping power dissipation \nlow. With 256 VPEs, implemented on two FPGAs (field programmable gate \narray) chips, we obtain a sustained speed of 19 GMACS (billion multiply-\naccumulate per sec.) for SVM training, and 86 GMACS for SVM \nclassification. This performance is more than an order of magnitude higher \nthan that of any FPGA implementation reported so far. The speed on one \nFPGA is similar to the fastest speeds published on a Graphics Processor for \nthe MNIST problem, despite a clock rate that is an order of magnitude lower. \nTests with Convolutional Neural Networks show similar compute performances. \nThis massively parallel architecture is particularly attractive for embedded \napplications, where low power dissipation is critical. \n\nIntroduction \n\n \n1 \nMachine learning demands higher and higher compute-performance, but serial processors are not \nimproving that much anymore - at least not as quickly as they used to. Mainstream processor \ndevelopment is moving to multi-core systems, using shared memory technology to hide the \nparallel nature of the processors. But shared memory technology does not scale to hundreds or \nthousands of cores. In order to reach such levels of parallelization alternative approaches have to \nbe developed. Massively parallel general-purpose computers had limited success so far, because of \ndifficulties programming these machines, and they remain a niche market, mostly in high-\nperformance computing. Yet processors specialized for certain application domains, such as \ngraphics processors or routing processors1\n, have been parallelized to several hundred cores and are \nsuccessful mass products. They improve performance over general-purpose processors by \nfocusing on a few key algorithmic elements, yet still maintain enough flexibility that they can be \nprogrammed for a variety of applications. We explore in this paper if a similar approach can lead \nto efficient machine learning processors. \n\n \n1 e.g. Nvidia, Quadro FX 5600 graphics processor; Cisco, CRS-1 routing processor \n\n\fSeveral processors optimized for machine learning, in particular for neural networks, were \ndeveloped during the 1980\u2019s and 90\u2019s. Examples are the Synapse-1 architecture [1], or the \nConnectionist Network Supercomputer, CNS1 [2]. Recently there has been less activity in this \nfield, but some accelerators are sold today for specific applications, such as the Axeon [3] \nprocessor for power train control of cars. Beside digital processors a large number of analog \ncircuits were built, emulating neural network structures. Extremely high performance with low \npower dissipation is achievable, see e.g. [4][5], but these networks have little flexibility. SVM \nimplementations on FPGA have been demonstrated in recent years [6-8], yet reached only low \ncompute-performances. All machine learning processors had only limited success so far, \nindicating how difficult it is to find a good combination of performance, flexibility, price and ease \nof use. An important consideration is that many applications of machine learning, such as video \nanalysis, data mining, or personalization of services, show the most promise in embedded systems. \nEmbedded learning requires high compute performance while dissipating little power, a \ncombination that is difficult to achieve, and so far required application specific IC (ASIC). Our \naim is to develop architectures that meet the requirements for embedded learning, but are \nprogrammable and therefore can be used in a wide range of applications. \nWith the goal of analyzing different architectures we designed a development and testing \nenvironment where the parallel computation is mapped onto FPGA\u2019s. Initially this system was \nintended only for experimentation, but its performance is so high that this platform is useful in its \nown right as accelerator for high-performance systems. While the experiments shown here \nemphasize high performance, the architecture has been designed from the start for low power \ndissipation. The main features for achieving this goal are: low-resolution arithmetic, keeping the \nmain data flow local, low operating frequencies, and a modular design, so that unused parts can be \npowered down dynamically. All results shown here are from the test platform; migration to low-\npower FPGA or chip designs are done in a later stage. \n \n2 \nFor a substantial improvement over a general purpose processor, the algorithms, the arithmetic \nunits, as well as the architecture have to be optimized simultaneously. This is not just an exercise \nin hardware design, but algorithms and their software implementations have to be developed \nconcurrently. Most machine learning algorithms have not been developed with parallelization in \nmind. Therefore, we first need to find good parallel versions, identify their performance \nbottlenecks, and then extract common computational patterns that can be mapped into accelerator \nhardware. \n\nAlgorithms - Arithmetic - Architecture \n\nAlgorithms \n\n2.1 \nCharacteristic for machine learning is that large amounts of data need to be processed, often with \npredictable data access patterns and no dependency between operations over large segments of the \ncomputation. This is why data-parallelization can often provide good accelerations on multi-core \nchips, clusters of machines, or even on loosely coupled networks of machines. Using MapReduce, \nspeedups linear with the number of processors have been reported in [9] for several machine \nlearning algorithms. Up to 16 cores were tested, and simulations indicate good scaling to more \nprocessors in some cases. \nMany algorithms, such as KNN, K-means clustering, LVQ, and Neural Networks can be reduced \nto forms where the computation is dominated by vector-matrix multiplications, which are easily \nparallelizable. For Convolutional Neural Networks (CNN) the data flow can be complex, yet the \ncore of the computation is a convolution, an operation which has been studied extensively for \nparallel implementations. For Support Vector Machines (SVM), several parallel algorithms were \ndescribed, but most saturate quickly for more than 16 processors. Scaling to larger numbers of \nprocessors has been demonstrated, applying MapReduce on a graphics processor with 128 cores \n[10]. Another implementation on a cluster of 48 dual-core machines (with 384 MMX units) [11] \nscales even super-linearly, and, according to simulations, scales to thousands of cores. \nBased on this analysis it is clear that vector-matrix and matrix-matrix multiplications for large \nvector dimensionalities and large numbers of vectors must be handled efficiently. Yet this alone is \n\n\fnot sufficient since data access patterns vary greatly between algorithms. We analyze this here in \nmore detail for SVM and CNN. These algorithms were chosen, because they are widely used for \nindustrial applications and cover a broad range of computation, I/O, and memory requirements. \nThe characteristics of the SVM training are summarized in Table 1. We use an approach similar to \nthe one described in [11] to split different parts of the computation between a host CPU and the \nFPGA accelerator. For large dimensions d of the vectors the calculation of the columns of the \nkernel matrix dominates by far. This is needed to update the gradients, and in the present \nimplementation, only this part is mapped onto the FPGA. If the dimensionality d is smaller than \naround 100, operations 2 and 5 can become bottlenecks and should also be mapped onto the \naccelerator. Challenging is that for each kernel computation a new data vector has to be loaded \ninto the processor, leading to very high I/O requirements. We consider here dimensions of 10 - 104 \nand numbers of training data of 105 - 107, resulting easily in Gigabytes that need to be transferred \nto the processors at each iteration. \n\nOperation \nInitialize all \u03b1x, Gx \nDo \n Find working set \u03b1i, \u03b1j \n Update \u03b1i, \u03b1j \n Get 2 columns of kernel matrix \n Update gradients Gx \n\n \n1 \n- \n2 \n3 \n4 \n5 \n6 While not converged \n\nComputation \n2n \n \nI * 2n \nI * 10 \nI * 2nd \nI * n \n \n\nUnit \nCPU \n \nCPU \nCPU \n\nIO \n2n \n \nI * 2n \nI * 2 \nI * (2d+2dn) FPGA \nI * n \n \n\nCPU \n \n\nTable 1: Compute- and IO-requirements of each step for SVM training (SMO algorithm). \nn: number of training data; d: dimension of the vectors; G: gradients; \u03b1: support vector \nfactors; I: number of iterations. The last column indicates whether the execution happens \non the host CPU or the accelerator FPGA. It is assumed that the kernel computation \nrequires a dot product between vectors (e.g. rbf, polynomial, tanh kernels). \n\nNeural network algorithms are essentially sequences of vector-matrix multiplications, but \nnetworks with special connectivity patterns, such as convolutional networks have very different IO \ncharacteristics than fully connected networks. Table 2 shows the computation and IO requirements \nfor scanning several convolution kernels over one input plane. A full network requires multiple of \nthese operations for one layer, with nonlinearities between layers. We map all operations onto the \nFPGA accelerator, since intermediate results are re-used right away. The most significant \ndifference to between the SVM and CNN is the Compute/IO ratio: SVM: ~ 1; CNN: ~ L*k2 > 100. \nTherefore the requirements for these two algorithms are very different, and handling both cases \nefficiently is quite a challenge for an architecture design. \n\nOperation \n\n \n1 Load L kernels \n \n2 \n3 \n4 \n\nComputation \n \n \n \nn * m * L * k2 \n \n\nUnit \n\nIO \nL* k2 FPGA \n \nFPGA \nn* m FPGA \nFPGA \n \nFPGA \nn*m \n\nFor all input pixels \n Shift in new pixel \n Multiply kernels \n Shift out result \nTable 2: Compute- and IO-requirements for CNN computation (forward pass), where l \nkernels of size k*k are scanned simultaneously over an input plane of size n*m. This is \nrepresentative for implementations with kernel unrolling (kernel pixels processed in \nparallel). Internal shifts, computation of the non-linearity, and border effects not shown. \n\nArithmetic \n\n2.2 \nHardware can be built much more compactly and runs with lower power dissipation, if it uses \nfixed-point instead of floating-point operations. Fortunately, many learning algorithms tolerate a \nlow resolution in most of the computations. This has been investigated extensively for neural \n\n\fnetworks [12][13], but less so for other learning algorithms. Learning from data is inherently a \nnoisy process, because we see only a sparse sampling of the true probability distributions. A \ndifferent type of noise is introduced in gradient descent algorithms, when only a few training data \nare used at a time to move the optimization forward iteratively. This noise is particularly \npronounced for stochastic gradient descent. There is no point in representing noisy variables with \nhigh resolution, and it is therefore a property inherent to many algorithms that low-resolution \ncomputation can be used. \nIt is important, not to confuse this tolerance to low resolution with the resolution required to avoid \nnumeric instabilities. Some of the computations have to be performed with a high resolution, in \nparticular for variables that are updated incrementally. They maintain the state of the optimization \nand may change in very small steps. But usually by far the largest part of the computation can be \nexecuted at a low resolution. Key is that the hardware is flexible enough and can take advantage of \nreduced resolution while handling high resolution where necessary. \n\nKernel: 16 bit fixed point \n\nF-score Obj. f. \n\nObj. f. \n31,930.77 \n653,170.7 \n4,960.13 \n1,243.71 \n\n# SV \n11,486 77.58 \n49,333 98.29 \n99.12 \n6,172 \n3,077 \n93.34 \n\nProblem Kernel: Float \n \nAdult \nForest \nMNIST \nNORB \nTable 3: Comparison of the results of SVM training when the kernels are represented with \nfloating point numbers (32 or 64 bits) (left half) and with 16 bit fixed point (right half). The \nlast column shows the results when the resolution of the training data is reduced from 8 bit \nto 4 bit. For NORB this reduces the accuracy; all other differences in accuracy are not \nsignificant. All are two class problems: Adult: n=32,562, d=122; Forest: n=522,000, d=54 \n(2 against the rest); MNIST: n=60,000, d=784 (odd\u2013even); NORB: n=48,560, d=5,184. \n\nF-score F-sc. (4b in) \n77.63 \n98.28 \n99.11 \n93.26 \n\n# SV \n11,490 \n49,299 \n6,166 \n3,154 \n\n31,930.1 \n652,758 \n4,959.64 \n1,244.76 \n\nNA \nNA \n99.11 \n92.78 \n\nWe developed a simulator that allows running the training algorithms with various resolutions in \neach of the variables. A few examples for SVM training are shown in Table 3. Reducing the \nresolution of the kernel values from double or float to 16 bit fixed point representations does not \naffect the accuracy for any of the problems. Therefore all the multiplications in the dot products \nfor the kernel computation can be done in low resolutions (4\u201316 bit in the factors), but the \naccumulator needs sufficient resolution to avoid over/under flow (48 bit). Once the calculation of \nthe kernel value is completed, it can be reduced to 16 bit. A low resolution of 16 bit is also \ntolerable for the \u03b1 values, but a high resolution is required for the gradients (double). For Neural \nNetworks, including CNN, several studies have confirmed that states and gradients can be kept at \nlow resolutions (<16 bit), but the weights must be maintained at a high resolution (float) (see e.g. \n[12]). In our own evaluations 24 bits in the weights tend to be sufficient. Once the network is \ntrained, for the classification low resolutions can be used for the weights as well (<16 bit). \n\n2.3 \n\nArchitecture \n\nFigure 1: Left: Schematic of the architecture with the main data flows; on one FPGA 128 \nVPE are configured into four SIMD groups; L-S: Load-store units. Right: Picture of an \nFPGA board; in our experiments one or two of them are used, connected via PCI bus to a \nhost CPU. \n\n \n\n\fBased on the analysis above, it is clear that the architecture must be optimized for processing \nmassive amounts of data with relatively low precision. Most of the time, data access patterns \nare predictable and data are processed in blocks that can be stored contiguously. This type of \ncomputation is well suited for vector processing, and simple vector processing elements \n(VPE) with fixed-point arithmetic can handle the operations. Since typically large blocks of \ndata are processed with the same operation, groups of VPE can work in SIMD (single \ninstruction multiple data) mode. Algorithms must then be segmented to map the high-\nvolume, low precision parts onto the vector accelerators and parts requiring high precision \narithmetic onto the CPU. \nThe most important design decision is the organization of the memory. Most memory \naccesses are done in large blocks, so that the data can be streamed, making complex caching \nunnecessary. This is fortunate, since the amounts of data to be loaded onto the processor are \nso large that conventional caching strategies would be overwhelmed anyway. Because the \nblocks tend to be large, a high data bandwidth is crucial, but latency for starting a block \ntransfer is less critical. Therefore we can use regular DDR memories and still get high IO \nrates. This led to the design shown schematically in Figure 1, where independent memory \nbanks are connected via separate IO ports for each group of 32 VPE. \nBy connecting multiple of the units shown in Figure 1 to a CPU, this architecture scales to \nlarger numbers of VPE. Parallel data IO and parallel memory access scale simultaneously \nwith the number of parallel cores, and we therefore refer to this as the P3 (P-cube) architecture. \nNotice also that the main data flow is only local between a group of VPE and its own memory \nblock. Avoiding movements of data over long distances is crucial for low power dissipation. How \nfar this architecture can reasonably scale with one CPU depends on the algorithms, the amount of \ndata and the vector dimensionality (see below). A few hundred VPE per CPU have provided good \naccelerations in all our tests, and much higher numbers are possible with multi-core CPUs and \nfaster CPU-FPGA connections. \n \n3 \nThis architecture fits surprisingly well onto some of the recent FPGA chips that are available \nwith several hundred Digital Signal Processors (DSP) units and over 1,000 IO pins for data \ntransfers. The boards used here contain each one Xilinx Virtex 5 LX330T-2 FPGA coupled to 4 \nindependent DDR2 SDRAM with a total of 1GB, and 2 independent 4MB SSRAM memory \nbanks (commercial board from AlphaData). One FPGA chip contains 192 DSP with a \nmaximum speed of 550MHz, which corresponds to a theoretical compute-performance of \n105.6 GMACS (18 bit and 25 bit operands). There is a total of 14 Mbit of on-chip memory, \nand the chip incorporates 960 pins for data IO. Due to routing overhead, not all DSP units \ncan be used and the actual clock frequencies tend to be considerably lower than what is \nadvertised for such chips (typically 230MHz or less for our designs). Nevertheless, we \nobtain high performances because we can use a large number of DSP units for executing the \nmain computation. \nThe main architecture features are: \n\nImplementation of the P3 Architecture \n\n\u2022 Parallel processing (on one chip): 128 VPE (hardware DSP) are divided into 4 \nblocks of 32, each group controlled by one sequencer with a vector instruction set. \n\u2022 Custom Precision: Data are represented with 1 to 16 bit resolution. Higher \n\nresolutions are possible by operating multiple DSP as one processor. \n\n\u2022 Overlapping Computation and Communication: CPU-FPGA communication is \n\noverlapped with the FPGA computation. \n\n\u2022 Overlap Memory Operations with Computation: All loads and stores from the FPGA \n\nto off-chip memory are performed concurrently with computations. \n\n\u2022 High Off-chip Memory Bandwidth: 6 independent data ports, each 32 bits wide, \n\naccess banked memories concurrently (12GB/s per chip). \n\n\f\u2022 \n\nStreaming Data Flow, Simple Access Patterns: Load/store units are tailored for \nstreaming input and output data, and for simple, bursty access patterns. Caching is \ndone under application control with dual-port memory on chip. \n\n\u2022 Load/store with (de)compression: For an increase of effective IO bandwidth the \n\nload/store units provide compression and decompression in hardware. \n\n \nFigure 2 shows the configuration of the VPEs for vector dot product computation used for \nSVM training and classification. For training, the main computation is the calculation of one \ncolumn of the kernel matrix. One vector is pre-fetched and stored in on-chip memory. All other \nvectors are streamed in from off-chip memory banks 1-4. Since this is a regular and predictable \naccess pattern, we can utilize burst-mode, achieving a throughput of close to one memory word \nper cycle. But the speed is nevertheless IO bound. When several vectors can be stored on-chip, as \nis the case for classification, then the speed becomes compute-bound. \n \n\nFigure 2: Architecture for vector dot-product computation. The left side shows a high-level \nschematic with the main data flow. The data are streamed from memory banks 1-4 to the \nVPE arrays, while memory banks 5 and 6, alternatively receive results or stream them back \nto the host. The right side shows how a group of VPE is pipelined to improve clock speed. \n\n \n\nEvaluations \n\nThe operation for SVM training on the FPGA corresponds to a vector-matrix multiplication and \nthe one for classification to a matrix-matrix multiplication. Therefore the configuration of Figure 2 \nis useful for many other algorithms as well, where operations with large vectors and matrices are \nneeded, such as Neural Networks. We implemented a specialized configuration for Convolutional \nNeural Networks, for more efficiency and lower power dissipation. The VPE are daisy-chained \nand operate as systolic array. In this way we can take advantage of the high computation to IO \nratio (Table 2) to reduce the data transfers from memory. \n \n4 \nWe evaluated SVM training and classification with the NORB and MNIST problems, the latter \nwith up to 2 million training samples (data from [11]). Both are benchmarks with vectors of high \ndimensionality, representative for applications in image and video analysis. The computation is \nsplit between CPU and FPGA as indicated by Table 1. The DDR2 memory banks are clocked at \n230MHz, providing double that rate for data transfers. The data may be compressed to save IO \nbandwidth. On the FPGA they are decompressed first and distributed to the VPE. In our case, a 32 \nbit word contains eight 4-bit vector components. Four 32 bit words are needed to feed all 32 VPEs \nof a group; therefore clocking the VPE faster than 115MHz does not improve performance. A \nVPE executes a multiplication plus add operation in one clock cycle, resulting in a theoretical \nmaximum of 14.7 GMACS per chip. The sustained compute-rate is lower, about 9.4 GMACS, due \nto overhead (see Table 4). The computation on the host CPU overlaps with that on the FPGA, and \nhas no effect on the speed in the experiments shown here. For the classification the VPE can be \nclocked higher, at 230 MHz. By using 4-bit operands we can execute 2 multiply-accumulates \nsimultaneously on one DSP, resulting in speed that is more than four times higher and a sustained \n43.0 GMACS limited by the number and speed of the VPE. Adding a second FPGA card doubles \nthe speed, showing little saturation effects yet, but for more FPGA per CPU there will be \nsaturation (see Fig. 3). The compute speed in GMACS obtained for NORB is almost identical. \n\n\f \n\n \n# \n60k \n2M \n\n \nIterations \n8,000 \n266,900 \n\nCPU \ntime \n754s \n-- \n\nCPU+MMX \ntime \n\nspeed \n0.5 \n240 s \n-- 531,534 s \n\nCPU+FPGA \ntime \n\nCPU+2 FPGA \ntime \n\nspeed \n1.57 \n40 s \n1.58 88,589 s \n\nspeed \n9.42 \n21 s \n9.48 48,723 s \n\nspeed \n17.9 \n17.2 \n\nTable 4: Training times and average compute speed for SVM training. Systems tested: CPU, \nOpteron, 2.2GHz; CPU using MMX; CPU with one FPGA; CPU with two FPGA boards. \nResults are shown for training sizes of 60k and 2M samples. Compute speed is in GMACS (just \nkernel computations). Training algorithm: SMO with second order working set selection. \n\nParallelizations of SVM training have been reported recently for a GPU [10] and for a cluster [11], \nboth using the MNIST data. In [10] different bounds for stopping were used than here and in [11]. \nNevertheless, a comparison of the compute performance is possible, because based on the number \nof iterations we can compute the average GMACS for the kernel computations. As can be seen in \nTable 5 a single FPGA is similar in speed to a GPU with 128 stream processors, despite a clock \nrate that is about 5.5 times lower for I/O and 11 times lower for the VPE. The cluster with 384 \nMMX units is about 6 times faster than one FPGA with 128 VPE, but dissipates about two orders \nof magnitude more electric power. For the FPGA this calculation includes only the computation of \nthe kernel values while the part on the CPU is neglected. This is justified for this study, because \nthe rest of the calculations can be mapped on the FPGA as well and will increase the power \ndissipation only minimally. \n\nProcessor \n\nCPU (Opteron) \nGPU (from [10]) \nCluster (from [11]) \n\nFPGA \n\nNumber \nof cores \n1 \n\nClock \nspeed \n2.2 GHz \n128 1.35 GHz \n384 \n1.6 GHz \n128 0.12 GHz \n\nOperand \n\ntype \n\nfloat \nfloat \nbyte \n4 bit nibble \n\nPower \n\ndissipation \n40 W \n80 W \n> 1 kW \n9 W \n\nAverage \n\ncompute speed \n0.5 GMACS \n7.4 GMACS \n54 GMACS \n9.4 GMACS \n\nTable 5: Comparison of performances for SVM training (MNIST data). GPU: Nvidia 8800 GTX. \nCluster: 48 dual core CPU (Athlon), 384 MMX units. The GPU was training with 60k samples \n([10], table 2, second order), the cluster trained with 2 million samples. \n\nFigure 3: Acceleration of SVM training as a function of the number of VPE. MNIST n: \n2,000,000, d=784; NORB: n=48,560, d=5,184. The points for 128 and 256 VPE are \nexperimental, the higher ones are simulations. Curves MNIST, NORB: Multiple FPGA are \nattached to one CPU. Curve MNIST C: Each FPGA is attached to a separate host CPU. \n\n \n\nScaling of the acceleration with the number of VPEs is shown in Figure 3. The reference speed is \nthat of one FPGA attached to a CPU. The evaluation has been done experimentally for 128 and \n256 VPEs, and beyond that with a simulator. The onset of saturation depends on the \ndimensionality of the vectors, but to a much lesser extent on the number of training vectors (up to \nthe limit of the memory on the FPGA card). MNIST saturates for more than two FPGAs because \n\n\fthen the CPU and FPGA computation times become comparable. For the larger vectors of NORB \n(d=5,184) this saturation starts to be noticeable for more than 4 FPGA. Alternatively, a system can \nbe scaled by grouping multiple CPU, each with one attached FPGA accelerator. Then the scaling \nfollows a linear or even super-linear acceleration (MNIST C) to several thousand VPE. If the \nCPUs are working in a cluster arrangement, the scaling is similar to the one described in [11]. \nFor convolutional neural networks, the architecture of Figure 2 is modified to allow a block of \nVPE to operate as systolic array. In this way convolutions can be implemented with minimal data \nmovements. In addition to the convolution, also sub-sampling and non-linear functions plus the \nlogistics to handle multiple layers with arbitrary numbers of kernels in each layer are done on the \nFPGA. Four separate blocks of such convolvers are packed onto one FPGA, using 100 VPE. \nClocked at 115MHz, this architecture provides a maximum of 11.5 GMACS. Including all the \noverhead the sustained speed is about 10 GMACS. \n \n5 \nBy systematically exploiting characteristic properties of machine learning algorithms, we \ndeveloped a new massively parallel processor architecture that is very efficient and can be scaled \nto thousands of processing elements. The implementation demonstrated here is more than an order \nof magnitude higher in performance than previous FPGA implementations of SVM or CNN. For \nthe MNIST problem it is comparable to the fastest GPU implementations reported so far. These \nresults underline the importance of flexibility over raw compute-speed for massively parallel \nsystems. The flexibility of the FPGA allows more efficient routing and packing of the data and the \nuse of computations with the lowest resolution an algorithm permits. The results of Table 5 \nindicate the potential of this architecture for low-power operation in embedded applications. \n\nConclusions \n\nReferences \n[1] Ramacher, et al. (1995) Synapse-1: A high-speed general purpose parallel neurocomputer system. In \nProc. 9th Intl. Symposium on Parallel Processing (IPPS'95), pp. 774-781. \n[2] Asanovic, K., Beck, Feldman, J., Morgan, N. & Wawrzynek, J. (1994) A Supercomputer for Neural \nComputation, Proc. IEEE Intl. Joint Conference on Neural Networks, pp. 5-9, Orlando, Florida. \n[3] Neil, P., (2005) Combining hardware with a powerful automotive MCU for powertrain applications. In \nIndustrial Embedded Resource Guide, p. 88. \n[4] Korekado, et al. (2003) A Convolutional Neural Network VLSI for Image Recognition Using \nMerged/Mixed Analog-Digital Architecture, in Proc. 7th KES 2003, Oxford, pp 169-176. \n[5] Murasaki, M., Arima, Y. & Shinohara, H. (1993) A 20 Tera-CPS Analog Neural Network Board. In Proc. \nInt. Joint Conf. Neural Networks, pp. 3027 \u2013 3030. \n[6] Pedersen, R., Schoeberl, M. (2006), An Embedded Support Vector Machine, WISE 2006. \n[7] Dey, S., Kedia, M. Agarwal, N., Basu, A., Embedded Support Vector Machine: Architectural \nEnhancements and Evaluation, in Proc 20th Int. Conf. VLSI Design. \n[8] Anguita, D., Boni, A., Ridella, S., (2003) A Digital Architecture for Support Vector Machines: Theory, \nAlgorithm, and FPGA Implementation, IEEE Trans. Neural Networks, 14/5, pp.993-1009. \n[9] Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A. & Olukotun, K. (2007) Map-Reduce for Machine \nLearning on Multicore, Advances in Neural Information Processing Systems 19, MIT Press. \n [10] Catanzaro, B., Sundaram, N., & Keutzer, K. (2008) Fast Support Vector Machine Training and \nClassification on Graphics Processors, Proc. 25th Int. Conf. Machine Learning, pp 104-111. \n[11] Durdanovic, I., Cosatto, E. & Graf, H. (2007) Large Scale Parallel SVM Implementation. In L. Bottou, \nO. Chapelle, D. DeCoste, J. Weston (eds.), Large Scale Kernel Machines, pp. 105-138, MIT Press. \n[12] Simard, P & Graf, H. (1994) Backpropagation without Multiplication. In J. Cowan, G. Tesauro, J. \nAlspector, (eds.), Neural Information Processing Systems 6, pp. 232 \u2013 239, Morgan Kaufmann. \n[13] Savich, A., Moussa, M., Areibi, S., (2007) The Impact of Arithmetic Representation on Implementing \nMLP-BP on FPGAs: A Study, IEEE Trans. Neural Networks, 18/1, pp. 240-252. \n\n\f", "award": [], "sourceid": 187, "authors": [{"given_name": "Hans", "family_name": "Graf", "institution": null}, {"given_name": "Srihari", "family_name": "Cadambi", "institution": null}, {"given_name": "Venkata", "family_name": "Jakkula", "institution": null}, {"given_name": "Murugan", "family_name": "Sankaradass", "institution": null}, {"given_name": "Eric", "family_name": "Cosatto", "institution": null}, {"given_name": "Srimat", "family_name": "Chakradhar", "institution": null}, {"given_name": "Igor", "family_name": "Dourdanovic", "institution": null}]}