Reconfigurable NueRRAM-CIM architecture

A NeuRRAM chip consists of 48 CIM cores that can perform computation in parallel. A core can be selectively turned off through power gating when not actively used, whereas the model weights are retained by the non-volatile RRAM devices. Central to each core is a TNSA consisting of 256 × 256 RRAM cells and 256 CMOS neuron circuits that implement analogue-to-digital converters (ADCs) and activation functions. Additional peripheral circuits along the edge provides inference control and manages RRAM programming.

The TNSA architecture is designed to offer flexible control of dataflow directions, which is crucial for enabling diverse model architectures with different dataflow patterns. For instance, in CNNs that are commonly applied to vision-related tasks, data flows in a single direction through layers to generate data representations at different abstraction levels; in LSTMs that are used to process temporal data such as audio signals, data travel recurrently through the same layer for multiple time steps; in probabilistic graphical models such as a restricted Boltzmann machine (RBM), probabilistic sampling is performed back and forth between layers until the network converges to a high-probability state. Besides inference, the error back-propagation during gradient-descent training of multiple AI models requires reversing the direction of dataflow through the network.

However, conventional RRAM-CIM architectures are limited to perform MVM in a single direction by hardwiring rows and columns of the RRAM crossbar array to dedicated circuits on the periphery to drive inputs and measure outputs. Some studies implement reconfigurable dataflow directions by adding extra hardware, which incurs substantial energy, latency and area penalties (Extended Data Fig. 2): executing bidirectional (forwards and backwards) dataflow requires either duplicating power-hungry and area-hungry ADCs at both ends of the RRAM array11,34 or dedicating a large area to routing both rows and columns of the array to shared data converters15; the recurrent connections require writing the outputs to a buffer memory outside of the RRAM array, and reading them back for the next time-step computation.

The TNSA architecture realizes dynamic dataflow reconfigurability with little overhead. Whereas in conventional designs, CMOS peripheral circuits such as ADCs connect at only one end of the RRAM array, the TNSA architecture physically interleaves the RRAM weights and the CMOS neuron circuits, and connects them along the length of both rows and columns. As shown in Fig. 2e, a TNSA consists of 16 × 16 of such interleaved corelets that are connected by shared bit-lines (BLs) and word-lines (WLs) along the horizontal direction and source-lines (SLs) along the vertical direction. Each corelet encloses 16 × 16 RRAM devices and one neuron circuit. The neuron connects to 1 BL and 1 SL out of the 16 BLs and the 16 SLs that pass through the corelet, and is responsible for integrating inputs from all the 256 RRAMs connecting to the same BL or SL. Sixteen of these RRAMs are within the same corelet as the neuron; and the other 240 are within the other 15 corelets along the same row or column. Specifically, Fig. 2f shows that the neuron within corelet (i, j) connects to the (16i + j)th BL and the (16j + i)th SL. Such a configuration ensures that each BL or SL connects uniquely to a neuron, while doing so without duplicating neurons at both ends of the array, thus saving area and energy.

Circuit Design and Silicon Prototypes for Compute-in-Memory for Deep Learning Iinference Engine

/p>

Moreover, a neuron uses its BL and SL switches for both its input and output: it not only receives the analogue MVM output coming from BL or SL through the switches but also sends the converted digital results to peripheral registers through the same switches. By configuring which switch to use during the input and output stages of the neuron, we can realize various MVM dataflow directions. Figure 2g shows the forwards, backwards and recurrent MVMs enabled by the TNSA. To implement forwards MVM (BL to SL), during the input stage, input pulses are applied to the BLs through the BL drivers, get weighted by the RRAMs and enter the neuron through its SL switch; during the output stage, the neuron sends the converted digital outputs to SL registers through its SL switch; to implement recurrent MVM (BL to BL), the neuron instead receives input through its SL switch and sends the digital output back to the BL registers through its BL switch.

Weights of most AI models take both positive and negative values. We encode each weight as difference of conductance between two RRAM cells on adjacent rows along the same column (Fig. 2h). The forwards MVM is performed using a differential input scheme, where BL drivers send input voltage pulses with opposite polarities to adjacent BLs. The backwards MVM is performed using a differential output scheme, where we digitally subtract outputs from neurons connecting to adjacent BLs after neurons finish analogue-to-digital conversions.

To maximize throughput of AI inference on 48 CIM cores, we implement a broad selection of weight-mapping strategies that allow us to exploit both model parallelism and data parallelism (Fig. 2a) through multi-core parallel MVMs. Using a CNN as an example, to maximize data parallelism, we duplicate the weights of the most computationally intensive layers (early convolutional layers) to multiple cores for parallel inference on multiple data; to maximize model parallelism, we map different convolutional layers to different cores and perform parallel inference in a pipelined fashion. Meanwhile, we divide the layers whose weight dimensions exceed the RRAM array size into multiple segments and assign them to multiple cores for parallel execution. A more detailed description of the weight-mapping strategies is provided in Methods. The intermediate data buffers and partial-sum accumulators are implemented by a field-programmable gate array (FPGA) integrated on the same board as the NeuRRAM chip. Although these digital peripheral modules are not the focus of this study, they will eventually need to be integrated within the same chip in production-ready RRAM-CIM hardware.

Efficient voltage-mode neuron circuit

Figure 1d and Extended Data Table 1 show that the NeuRRAM chip achieves 1.6-times to 2.3-times lower EDP and 7-times to 13-times higher computational density (measured by throughput per million of RRAMs) at various MVM input and output bit-precisions than previous state-of-the-art RRAM-based CIM chips, despite being fabricated at an older technology node17,18,19,20,21,22,23,24,25,26,27,36. The reported energy and delay are measured for performing an MVM with a 256 × 256 weight matrix. It is noted that these numbers and those reported in previous RRAM-CIM work represent the peak energy efficiency achieved when the array utilization is 100% and does not account for energy spent on intermediate data transfer. Network-on-chip and program scheduling need to be carefully designed to achieve good end-to-end application-level energy efficiency.

Key to the NeuRRAM’s EDP improvement is a novel in-memory MVM output-sensing scheme. The conventional approach is to use voltage as input, and measure the current as the results based on Ohm’s law (Fig. 3a). Such a current-mode-sensing scheme cannot fully exploit the high-parallelism nature of CIM. First, simultaneously turning on multiple rows leads to a large array current. Sinking the large current requires peripheral circuits to use large transistors, whose area needs to be amortized by time-multiplexing between multiple columns, which limits ‘column parallelism’. Second, MVM results produced by different neural-network layers have drastically different dynamic ranges. Optimizing ADCs across such a wide dynamic range is difficult. To equalize the dynamic range, designs typically activate a fraction of input wires every cycle to compute a partial sum, and thus require multiple cycles to complete an MVM, which limits ‘row parallelism’.

NeuRRAM improves computation parallelism and energy efficiency by virtue of a neuron circuit implementing a voltage-mode sensing scheme. The neuron performs analogue-to-digital conversion of the MVM outputs by directly sensing the settled open-circuit voltage on the BL or SL line capacitance39 (Fig. 3b): voltage inputs are driven on the BLs whereas the SLs are kept floating, or vice versa, depending on the MVM direction. WLs are activated to start the MVM operation. The voltage on the output line settles to the weighted average of the voltages driven on the input lines, where the weights are the RRAM conductances. Upon deactivating the WLs, the output is sampled by transferring the charge on the output line to the neuron sampling capacitor (Csample in Fig. 3d). The neuron then accumulates this charge onto an integration capacitor (Cinteg) for subsequent analogue-to-digital conversion.

Such voltage-mode sensing obviates the need for power-hungry and area-hungry peripheral circuits to sink large current while clamping voltage, improving energy and area efficiency and eliminating output time-multiplexing. Meanwhile, the weight normalization owing to the conductance weighting in the voltage output (Fig. 3c) results in an automatic output dynamic range normalization for different weight matrices. Therefore, MVMs with different weight dimensions can all be completed within a single cycle, which significantly improves computational throughput. To eliminate the normalization factor from the final results, we pre-compute its value and multiply it back to the digital outputs from the ADC.

Neuromorphic NeuRRAM Chip AI Developed, Performs Computations in Memory Without Network Connectivity

Our voltage-mode neuron supports MVM with 1-bit to 8-bit inputs and 1-bit to 10-bit outputs. The multi-bit input is realized in a bit-serial fashion where charge is sampled and integrated onto Cinteg for 2n−1 cycles for the nth least significant bit (LSB) (Fig. 3e). For MVM inputs greater than 4 bits, we break the bit sequence into two segments, compute MVM for each segment separately and digitally perform a shift-and-add to obtain the final results (Fig. 3f). Such a two-phase input scheme improves energy efficiency and overcomes voltage headroom clipping at high-input precisions.

The multi-bit output is generated through a binary search process (Fig. 3g). Every cycle, neurons add or subtract CsampleVdecr amount of charge from Cinteg, where Vdecr is a bias voltage shared by all neurons. Neurons then compare the total charge on Cinteg with a fixed threshold voltage Vref to generate a 1-bit output. From the most significant bit (MSB) to the least significant bit (LSB), Vdecr is halved every cycle. Compared with other ADC architectures that implement a binary search, our ADC scheme eliminates the residue amplifier of an algorithmic ADC, and does not require an individual DAC for each ADC to generate reference voltages like a successive approximation register (SAR) ADC40. Instead, our ADC scheme allows sharing a single digital-to-analogue converter (DAC) across all neurons to amortize the DAC area, leading to a more compact design. The multi-bit MVM is validated by comparing ideal and measured results, as shown in Fig. 3h and Extended Data Fig. 5. More details on the multi-bit input and output implementation can be found in Methods.

The neuron can also be reconfigured to directly implement Rectified Linear Unit (ReLU)/sigmoid/tanh as activations when needed. In addition, it supports probabilistic sampling for stochastic activation functions by injecting pseudo-random noise generated by a linear-feedback shift register (LFSR) block into the neuron integrator. All the neuron circuit operations are performed by dynamically configuring a single amplifier in the neuron as either an integrator or a comparator during different phases of operations, as detailed in Methods. This results in a more compact design than other work that merges ADC and neuron activation functions within the same module12,13. Although most existing CIM designs use time-multiplexed ADCs for multiple rows and columns to amortize the ADC area, the compactness of our neuron circuit allows us to dedicate a neuron for each pair of BL and SL, and tightly interleave the neuron with RRAM devices within the TNSA architecture, as can be seen in Extended

Hardware-algorithm co-optimizations

The innovations on the chip architecture and circuit design bring superior efficiency and reconfigurability to NeuRRAM. To complete the story, we must ensure that AI inference accuracy can be preserved under various circuit and device non-idealities3,41. We developed a set of hardware-algorithm co-optimization techniques that allow NeuRRAM to deliver software-comparable accuracy across diverse AI applications. Importantly, all the AI benchmark results presented in this paper are obtained entirely from hardware measurements on complete datasets. Although most previous efforts (with a few exceptions8,17) have reported benchmark results using a mixture of hardware characterization and software simulation, for example, emulate the array-level MVM process in software using measured device characteristics3,5,21,24, such an approach often fails to model the complete set of non-idealities existing in realistic hardware. As shown in Fig. 4a, these non-idealities may include (1) Voltage drop on input wires (Rwire), (2) on RRAM array drivers (Rdriver) and (3) on crossbar wires (e.g. BL resistance RBL), (4) limited RRAM programming resolution, (5) RRAM conductance relaxation41, (6) capacitive coupling from simultaneously switching array wires, and (7) limited ADC resolution and dynamic range. Our experiments show that omitting certain non-idealities in simulation leads to over-optimistic prediction of inference accuracy. For example, the third and the fourth bars in Fig. 5a show a 2.32% accuracy difference between simulation and measurement for CIFAR-10 classification19, whereas the simulation accounts for only non-idealities (5) and (7), which are what previous studies most often modelled5,21.

Our hardware-algorithm co-optimization approach includes three main techniques: (1) model-driven chip calibration, (2) noise-resilient neural-network training and analogue weight programming, and (3) chip-in-the-loop progressive model fine-tuning. Model-driven chip calibration uses the real model weights and input data to optimize chip operating conditions such as input voltage pulse amplitude, and records any ADC offsets for subsequent cancellation during inference. Ideally, the MVM output voltage dynamic range should fully utilize the ADC input swing to minimize discretization error. However, without calibration, the MVM output dynamic range varies with network layers even with the weight normalization effect of the voltage-mode sensing. To calibrate MVM to the optimal dynamic range, for each network layer, we use a subset of training-set data as calibration input to search for the best operating conditions (Fig. 4b). Extended Data Fig. 6 shows that different calibration input distributions lead to different output distributions. To ensure that the calibration data can closely emulate the distribution seen at test time, it is therefore crucial to use training-set data as opposed to randomly generated data during calibration. It is noted that when performing MVM on multiple cores in parallel, those shared bias voltages cannot be optimized for each core separately, which might lead to sub-optimal operating conditions and additional accuracy loss (detailed in Methods).

Weier Wan's PhD Defense @ Stanford -- RRAM Compute-In-Memory Hardware For Edge Intelligence

Stochastic non-idealities such as RRAM conductance relaxation and read noises degrade the signal-to-noise ratio (SNR) of the computation, leading to an inference accuracy drop. Some previous work obtained a higher SNR by limiting each RRAM cell to store a single bit, and encoding higher-precision weights using multiple cells9,10,16. Such an approach lowers the weight memory density. Accompanying that approach, the neural network is trained with weights quantized to the corresponding precision. In contrast, we utilize the intrinsic analogue programmability of RRAM42 to directly store high-precision weights and train the neural networks to tolerate the lower SNR. Instead of training with quantized weights, which is equivalent to injecting uniform noise into weights, we train the model with high-precision weights while injecting noise with the distribution measured from RRAM devices. RRAMs on NeuRRAM are characterized to have a Gaussian-distributed conductance spread, caused primarily by conductance relaxation. Therefore, we inject a Gaussian noise into weights during training, similar to a previous study21. Figure 5a shows that the technique significantly improves the model’s immunity to noise, from a CIFAR-10 classification accuracy of 25.34% without noise injection to 85.99% with noise injection. After the training, we program the non-quantized weights to RRAM analogue conductances using an iterative write–verify technique, described in Methods. This technique enables NeuRRAM to achieve an inference accuracy equivalent to models trained with 4-bit weights across various applications, while encoding each weight using only two RRAM cells, which is two-times denser than previous studies that require one RRAM cell per bit.

By applying the above two techniques, we already can measure inference accuracy comparable to or better than software models with 4-bit weights on Google speech command recognition, MNIST image recovery and MNIST classification (Fig. 1e). For deeper neural networks, we found that the error caused by those non-idealities that have nonlinear effects on MVM outputs, such as voltage drops, can accumulate through layers, and become more difficult to mitigate. In addition, multi-core parallel MVM leads to large instantaneous current, further exacerbating non-idealities such as voltage drop on input wires ((1) in Fig. 4a). As a result, when performing multi-core parallel inference on a deep CNN, ResNet-2043, the measured accuracy on CIFAR-10 classification (83.67%) is still 3.36% lower than that of a 4-bit-weight software model (87.03%).

To bridge this accuracy gap, we introduce a chip-in-the-loop progressive fine-tuning technique. Chip-in-the-loop training mitigates the impact of non-idealities by measuring training error directly on the chip44. Previous work has shown that fine-tuning the final layers using the back-propagated gradients calculated from hardware-measured outputs helped improve accuracy5. We find this technique to be of limited effectiveness in countering those nonlinear non-idealities. Such a technique also requires re-programming RRAM devices, which consumes additional energy. Our chip-in-the-loop progressive fine-tuning overcomes nonlinear model errors by exploiting the intrinsic nonlinear universal approximation capacity of the deep neural network45, and furthermore eliminates the need for weight re-programming. Figure 4d illustrates the fine-tuning procedure. We progressively program the weights one layer at a time onto the chip. After programming a layer, we perform inference using the training-set data on the chip up to that layer, and use the measured outputs to fine-tune the remaining layers that are still training in software. In the next time step, we program and measure the next layer on the chip. We repeat this process until all the layers are programmed. During the process, the non-idealities of the programmed layers can be progressively compensated by the remaining layers through training. Figure 5b shows the efficacy of this progressive fine-tuning technique. From left to right, each data point represents a new layer programmed onto the chip. The accuracy at each layer is evaluated by using the chip-measured outputs from that layer as inputs to the remaining layers in software. The cumulative CIFAR-10 test-set inference accuracy is improved by 1.99% using this technique. Extended Data Fig. 8a further illustrates the extent to which fine-tuning recovers the training-set accuracy loss at each layer, demonstrating the effectiveness of the approach in bridging the accuracy gap between software and hardware measurements.

ML on the Edge for Industry 4.0 with Arm Ethos N-78 Neural Processing Unit | Embedded World 2021

;

Using the techniques described above, we achieve inference accuracy comparable to software models trained with 4-bit weights across all the measured AI benchmark tasks. Figure 1e shows that we achieve a 0.98% error rate on MNIST handwritten digit recognition using a 7-layer CNN, a 14.34% error rate on CIFAR-10 object classification using ResNet-20, a 15.34% error rate on Google speech command recognition using a 4-cell LSTM, and a 70% reduction of L2 image-reconstruction error compared with the original noisy images on MNIST image recovery using an RBM. Some of these numbers are not yet to the accuracies achieved by full-precision digital implementations. The accuracy gap mainly comes from low-precision (≤4-bit) quantization of inputs and activations, especially on the most sensitive input and output layers46. For instance, Extended Data Fig. 8b presents an ablation study that shows that quantizing input images to 4-bit alone results in a 2.7% accuracy drop for CIFAR-10 classification. By contrast, the input layer only accounts for 1.08% of compute and 0.16% of weights of a ResNet-20 model. Therefore, they can be off-loaded to higher-precision digital compute units with little overheads. In addition, applying more advanced quantization techniques and optimizing training procedures such as data augmentation and regularization should further improve the accuracy for both quantized software models and hardware-measured results.

Table 1 summarizes the key features of each demonstrated model. Most of the essential neural-network layers and operations are implemented on the chip, including all the convolutional, fully connected and recurrent layers, neuron activation functions, batch normalization and the stochastic sampling process. Other operations such as average pooling and element-wise multiplications are implemented on an FPGA integrated on the same board as NeuRRAM (Extended Data Fig. 11a). Each of the models is implemented by allocating the weights to multiple cores on a single NeuRRAM chip. We developed a software toolchain to allow easy deployment of AI models on the chip47. The implementation details are described in Methods. Fundamentally, each of the selected benchmarks represents a general class of common edge AI tasks: visual recognition, speech processing and image de-noising. These results demonstrate the versatility of the TNSA architecture and the wide applicability of the hardware-algorithm co-optimization techniques.

The NeuRRAM chip simultaneously improves efficiency, flexibility and accuracy over existing RRAM-CIM hardware by innovating across the entire hierarchy of the design, from a TNSA architecture enabling reconfigurable dataflow direction, to an energy- and area-efficient voltage-mode neuron circuit, and to a series of algorithm-hardware co-optimization techniques. These techniques can be more generally applied to other non-volatile resistive memory technologies such as phase-change memory8,17,21,23,24, magnetoresistive RAM48 and ferroelectric field-effect transistors49. Going forwards, we expect NeuRRAM’s peak energy efficiency (EDP) to improve by another two to three orders of magnitude while supporting bigger AI models when scaling from 130-nm to 7-nm CMOS and RRAM technologies (detailed in Methods). Multi-core architecture design with network-on-chip that realizes efficient and versatile data transfers and inter-array pipelining is likely to be the next major challenge for RRAM-CIM37,38, which needs to be addressed by further cross-layer co-optimization. As resistive memory continues to scale towards offering tera-bits of on-chip memory50, such a co-optimization approach will equip CIM hardware on the edge with sufficient performance, efficiency and versatility to perform complex AI tasks that can only be done on the cloud today.

Methods
Core block diagram and operating modes

Fig. 1 show the block diagram of a single CIM core. To support versatile MVM directions, most of the design is symmetrical in the row (BLs and WLs) and column (SLs) directions. The row and column register files store the inputs and outputs of MVMs, and can be written externally by either an Serial Peripheral Interface (SPI) or a random-access interface that uses an 8-bit address decoder to select one register entry, or internally by the neurons. The SL peripheral circuits contain an LFSR block used to generate pseudo-random sequences used for probabilistic sampling. It is implemented by two LFSR chains propagating in opposite directions. The registers of the two chains are XORed to generate spatially uncorrelated random numbers51. The controller block receives commands and generates control waveforms to the BL/WL/SL peripheral logic and to the neurons. It contains a delay-line-based pulse generator with tunable pulse width from 1 ns to 10 ns. It also implements clock-gating and power-gating logic used to turn off the core in idle mode. Each WL, BL and SL of the TNSA is driven by a driver consisting of multiple pass gates that supply different voltages. On the basis of the values stored in the register files and the control signals issued by the controller, the WL/BL/SL logic decides the state of each pass gate.

The core has three main operating modes: a weight-programming mode, a neuron-testing mode and an MVM mode (Extended Data Fig. 1). In the weight-programming mode, individual RRAM cells are selected for read and write. To select a single cell, the registers at the corresponding row and column are programmed to ‘1’ through random access with the help of the row and column decoder, whereas the other registers are reset to ‘0’. The WL/BL/SL logic turns on the corresponding driver pass gates to apply a set/reset/read voltage on the selected cell. In the neuron-testing mode, the WLs are kept at ground voltage (GND). Neurons receive inputs directly from BL or SL drivers through their BL or SL switch, bypassing RRAM devices. This allows us to characterize the neurons independently from the RRAM array. In the MVM mode, each input BL and SL is driven to Vref − Vread, Vref + Vread or Vref depending on the registers’ value at that row or column. If the MVM is in the BL-to-SL direction, we activate the WLs that are within the input vector length while keeping the rest at GND; if the MVM is in the SL-to-BL direction, we activate all the WLs. After neurons finish analogue-to-digital conversion, the pass gates from BLs and SLs to the registers are turned on to allow neuron-state readout.

Device Fabrication

RRAM arrays in NeuRRAM are in a one-transistor–one-resistor (1T1R) configuration, where each RRAM device is stacked on top of and connects in series with a selector NMOS transistor that cuts off the sneak path and provides current compliance during RRAM programming and reading. The selector n-type metal-oxide-semiconductor (NMOS), CMOS peripheral circuits and the bottom four back-end-of-line interconnect metal layers are fabricated in a standard 130-nm foundry process. Owing to the higher voltage required for RRAM forming and programming, the selector NMOS and the peripheral circuits that directly interface with RRAM arrays use thick-oxide input/output (I/O) transistors rated for 5-V operation. All the other CMOS circuits in neurons, digital logic, registers and so on use core transistors rated for 1.8-V operations.

The RRAM device is sandwiched between metal-4 and metal-5 layers shown in Fig. 2c. After the foundry completes the fabrication of CMOS and the bottom four metal layers, we use a laboratory process to finish the fabrication of the RRAM devices and the metal-5 interconnect, and the top metal pad and passivation layers. The RRAM device stack consists of a titanium nitride (TiN) bottom-electrode layer, a hafnium oxide (HfOx) switching layer, a tantalum oxide (TaOx) thermal-enhancement layer52 and a TiN top-electrode layer. They are deposited sequentially, followed by a lithography step to pattern the lateral structure of the device array.

RRAM write–verify programming and conductance relaxation

Each neural-network weight is encoded by the differential conductance between two RRAM cells on adjacent rows along the same column. The first RRAM cell encodes positive weight, and is programmed to a low conductance state (gmin) if the weight is negative; the second cell encodes negative weight, and is programmed to gmin if the weight is positive. Mathematically, the conductances of the two cells are max(𝑔max𝑊𝑤max,𝑔min)

and max(−𝑔max𝑊𝑤max,𝑔min)

respectively, where gmax and gmin are the maximum and minimum conductance of the RRAMs, wmax is the maximum absolute value of weights, and W is the unquantized high-precision weight.

To program an RRAM cell to its target conductance, we use an incremental-pulse write–verify technique42. Extended Data Fig. 3a,b illustrates the procedure. We start by measuring the initial conductance of the cell. If the value is below the target conductance, we apply a weak set pulse aiming to slightly increase the cell conductance. Then we read the cell again. If the value is still below the target, we apply another set pulse with amplitude incremented by a small amount. We repeat such set–read cycles until the cell conductance is within an acceptance range to the target value or overshoots to the other side of the target. In the latter case, we reverse the pulse polarity to reset, and repeat the same procedure as with set. During the set/reset pulse train, the cell conductance is likely to bounce up and down multiple times until eventually it enters the acceptance range or reaches a time-out limit.

There are a few trade-offs in selecting programming conditions. (1) A smaller acceptance range and a higher time-out limit improve programming precision, but require a longer time. (2) A higher gmax improves the SNR during inference, but leads to higher energy consumption and more programming failures for cells that cannot reach high conductance. In our experiments, we set the initial set pulse voltage to be 1.2 V and the reset pulse voltage to be 1.5 V, both with an increment of 0.1 V and pulse width of 1 μs. A RRAM read takes 1–10 μs, depending on its conductance. The acceptance range is ±1 μS to the target conductance. The time-out limit is 30 set–reset polarity reversals. We used gmin = 1 μS for all the models, and gmax = 40 μS for CNNs and gmax = 30 μS for LSTMs and RBMs. With such settings, 99% of the RRAM cells can be programmed to the acceptance range within the time-out limit. On average each cell requires 8.52 set/reset pulses. In the current implementation, the speed of such a write–verify process is limited by external control of DAC and ADC. If integrating everything into a single chip, such write–verify will take on average 56 µs per cell. Having multiple copies of DAC and ADC to perform write–verify on multiple cells in parallel will further improve RRAM programming throughput, at the cost of more chip area.

Besides the longer programming time, another reason to not use an overly small write–verify acceptance range is RRAM conductance relaxation. RRAM conductance changes over time after programming. Most of the change happens within a short time window (less than 1 s) immediately following the programming, after which the change becomes much slower, as shown in Extended Data Fig. 3d. The abrupt initial change is called ‘conductance relaxation’ in the literature41. Its statistics follow a Gaussian distribution at all conductance states except when the conductance is close to gmin. Extended Data Fig. 3c,d shows the conductance relaxation measured across the whole gmin-to-gmax conductance range. We found that the loss of programming precision owing to conductance relaxation is much higher than that caused by the write–verify acceptance range. The average standard deviation across all levels of initial conductance is about 2.8 μS. The maximum standard deviation is about 4 μS, which is close to 10% of gmax.

To mitigate the relaxation, we use an iterative programming technique. We iterate over the RRAM array for multiple times. In each iteration, we measure all the cells and re-program those whose conductance has drifted outside the acceptance range. Extended Data Fig. 3e shows that the standard deviation becomes smaller with more programming iterations. After 3 iterations, the standard deviation becomes about 2 μS, a 29% decrease compared with the initial value. We use 3 iterations in all our neural-network demonstrations and perform inference at least 30 min after the programming such that the measured inference accuracy would account for such conductance relaxation effects. By combining the iterative programming with our hardware-aware model training approach, the impact of relaxation can be largely mitigated.

Processing-In-Memory for Efficient AI Inference at the Edge

Implementation of MVM with multi-bit inputs and outputs

The neuron and the peripheral circuits support MVM at configurable input and output bit-precisions. An MVM operation consists of an initialization phase, an input phase and an output phase. Extended Data Fig. 4 illustrates the neuron circuit operation. During the initialization phase (Extended Data Fig. 4a), all BLs and SLs are precharged to Vref. The sampling capacitors Csample of the neurons are also precharged to Vref, whereas the integration capacitors Cinteg are discharged.

During the input phase, each input wire (either BL or SL depending on MVM direction) is driven to one of three voltage levels, Vref − Vread, Vref and Vref + Vread, through three pass gates, as shown in Fig. 3b. During forwards MVM, under differential-row weight mapping, each input is applied to a pair of adjacent BLs. The two BLs are driven to the opposite voltage with respect to Vref. That is, when the input is 0, both wires are driven to Vref; when the input is +1, the two wires are driven to Vref + Vread and Vref − Vread; and when the input is −1, to Vref − Vread and Vref + Vread. During backwards MVM, each input is applied to a single SL. The difference operation is performed digitally after neurons finish analogue-to-digital conversions.

After biasing the input wires, we then pulse those WLs that have inputs for 10 ns, while keeping output wires floating. As voltages of the output wires settle to 𝑉𝑗=∑𝑖𝑉𝑖𝐺𝑖𝑗∑𝑖𝐺𝑖𝑗

, where Gij represents conductance of RRAM at the i-th row and the j-th column, we turn off the WLs to stop all current flow. We then sample the charge remaining on the output wire parasitic capacitance to Csample located within neurons, followed by integrating the charge onto Cinteg, as shown in Extended Data Fig. 4b. The sampling pulse is 10 ns (limited by the 100-MHz external clock from the FPGA); the integration pulse is 240 ns, limited by large integration capacitor (104 fF), which was chosen conservatively to ensure function correctness and testing different neuron operating conditions.

The multi-bit input digital-to-analogue conversion is performed in a bit-serial fashion. For the nth LSB, we apply a single pulse to the input wires, followed by sampling and integrating charge from output wires onto Cinteg for 2n−1 cycles. At the end of multi-bit input phase, the complete analogue MVM output is stored as charge on Cinteg. For example, as shown in Fig. 3e, when the input vectors are 4-bit signed integers with 1 sign-bit and 3 magnitude-bits, we first send pulses corresponding to the first (least significant) magnitude-bit to input wires, followed by sampling and integrating for one cycle. For the second and the third magnitude-bits, we again apply one pulse to input wires for each bit, followed by sampling and integrating for two cycles and four cycles, respectively. In general, for n-bit signed integer inputs, we need a total of n − 1 input pulses and 2n−1 − 1 sampling and integration cycles.

Such a multi-bit input scheme becomes inefficient for high-input bit-precision owing to the exponentially increasing sampling and integration cycles. Moreover, headroom clipping becomes an issue as charge integrated at Cinteg saturates with more integration cycles. The headroom clipping can be overcome by using lower Vread, but at the cost of a lower SNR, so the overall MVM accuracy might not improve when using higher-precision inputs. For instance, Extended Data Fig. 5a,c shows the measured root-mean-square error (r.m.s.e.) of the MVM results. Quantizing inputs to 6-bit (r.m.s.e. = 0.581) does not improve the MVM accuracy compared with 4-bit (r.m.s.e. = 0.582), owing to the lower SNR.

To solve both the issues, we use a 2-phase input scheme for input greater than 4-bits. Figure 3f illustrates the process. To perform MVM with 6-bit inputs and 8-bit outputs, we divide inputs into two segments, the first containing the three MSBs and the second containing the three LSBs. We then perform MVM including the output analogue-to-digital conversion for each segment separately. For the MSBs, neurons (ADCs) are configured to output 8-bits; for the LSBs, neurons output 5-bits. The final results are obtained by shifting and adding the two outputs in digital domain. Extended Data Fig. 5d shows that the scheme lowers MVM r.m.s.e. from 0.581 to 0.519. Extended Data Fig. 12c–e further shows that such a two-phase scheme both extends the input bit-precision range and improves the energy efficiency.

Finally, during the output phase, the analogue-to-digital conversion is again performed in a bit-serial fashion through a binary search process. First, to generate the sign-bit of outputs, we disconnect the feedback loop of the amplifier to turn the integrator into a comparator (Extended Data Fig. 4c). We drive the right side of Cinteg to Vref. If the integrated charge is positive, the comparator output will be GND, and supply voltage VDD otherwise. The comparator output is then inverted, latched and readout to the BL or SL via the neuron BL or SL switch before being written into the peripheral BL or SL registers.

To generate k magnitude-bits, we add or subtract charge from Cinteg (Extended Data Fig. 4d), followed by comparison and readout for k cycles. From MSB to LSB, the amount of charge added or subtracted is halved every cycle. Whether to add or to subtract is automatically determined by the comparison result stored in the latch from the previous cycle. Figure 3g illustrates such a process. A sign-bit of ‘1’ is first generated and latched in the first cycle, representing a positive output. To generate the most significant magnitude-bit, the latch turns on the path from Vdecr− = Vref − Vdecr to Csample. The charge sampled by Csample is then integrated on Cinteg by turning on the negative feedback loop of the amplifier, resulting in CsampleVdecr amount of charge being subtracted from Cinteg. In this example, CsampleVdecr is greater than the original amount of charge on Cinteg, so the total charge becomes negative, and the comparator generates a ‘0’ output. To generate the second magnitude-bit, Vdecr is reduced by half. This time, the latch turns on the path from Vdecr+ = Vref + 1/2Vdecr to Csample. As the total charge on Cinteg after integration is still negative, the comparator outputs a ‘0’ again in this cycle. We repeat this process until the least significant magnitude-bit is generated. It is noted that if the initial sign-bit is ‘0’, all subsequent magnitude-bits are inverted before readout.

Such an output conversion scheme is similar to an algorithmic ADC or a SAR ADC in the sense that a binary search is performed for n cycles for a n-bit output. The difference is that an algorithmic ADC uses a residue amplifier, and a SAR ADC requires a multi-bit DAC for each ADC, whereas our scheme does not need a residue amplifier, and uses a single DAC that outputs 2 × (n − 1) different Vdecr+ and Vdecr− levels, shared by all neurons (ADCs). As a result, our scheme enables a more compact design by time-multiplexing an amplifier for integration and comparison, eliminating the residual amplifier, and amortizing the DAC area across all neurons in a CIM core. For CIM designs that use a dense memory array, such a compact design allows each ADC to be time-multiplexed by a fewer number of rows and columns, thus improving throughput.

To summarize, both the configurable MVM input and output bit-precisions and various neuron activation functions are implemented using different combinations of the four basic operations: sampling, integration, comparison and charge decrement. Importantly, all the four operations are realized by a single amplifier configured in different feedback modes. As a result, the design realizes versatility and compactness at the same time.

A Ternary-weight Compute-in-Memory RRAM Macro

Multi-core parallel MVM

NeuRRAM supports performing MVMs in parallel on multiple CIM cores. Multi-core MVM brings additional challenges to computational accuracy, because certain hardware non-idealities that do not manifest in single-core MVM become more severe with more cores. They include voltage drop on input wires, core-to-core variation and supply voltage instability. voltage drop on input wires (non-ideality (1) in Fig. 4a) is caused by large current drawn from a shared voltage source simultaneously by multiple cores. It makes equivalent weights stored in each core vary with applied inputs, and therefore have a nonlinear input-dependent effect on MVM outputs. Moreover, as different cores have a different distance from the shared voltage source, they experience a different amounts of voltage drops. Therefore, we cannot optimize read-voltage amplitude separately for each core to make its MVM output occupy exactly the full neuron input dynamic range.

These non-idealities together degrade the multi-core MVM accuracy. Extended Data Fig. 5e,f shows that when performing convolution in parallel on the 3 cores, outputs of convolutional layer 15 are measured to have a higher r.m.s.e. of 0.383 compared with 0.318 obtained by performing convolution sequentially on the 3 cores. In our ResNet-20 experiment, we performed 2-core parallel MVMs for convolutions within block 1 (Extended Data Fig. 9a), and 3-core parallel MVMs for convolutions within blocks 2 and 3.

The voltage-drop issue can be partially alleviated by making the wires that carry large instantaneous current as low resistance as possible, and by employing a power delivery network with more optimized topology. But the issue will persist and become worse as more cores are used. Therefore, our experiments aim to study the efficacy of algorithm-hardware co-optimization techniques in mitigating the issue. Also, it is noted that for a full-chip implementation, additional modules such as intermediate result buffers, partial-sum accumulators and network-on-chip will need to be integrated to manage inter-core data transfers. Program scheduling should also be carefully optimized to minimize buffer size and energy spent at intermediate data movement. Although there are studies on such full-chip architecture and scheduling37,38,53, they are outside the scope of this study.

Noise-resilient neural-network training

During noise-resilient neural-network training, we inject noise into weights of all fully connected and convolutional layers during the forwards pass of neural-network training to emulate the effects of RRAM conductance relaxation and read noises. The distribution of the injected noise is obtained by RRAM characterization. We used the iterative write–verify technique to program RRAM cells into different initial conductance states and measure their conductance relaxation after 30 min. Extended Data Fig. 3d shows that measured conductance relaxation has an absolute value of mean <1 μS (gmin) at all conductance states. The highest standard deviation is 3.87 μS, about 10% of the gmax 40 μS, found at about 12 μS initial conductance state. Therefore, to simulate such conductance relaxation behaviour during inference, we inject a Gaussian noise with a zero mean and a standard deviation equal to 10% of the maximum weights of a layer.

We train models with different levels of noise injection from 0% to 40%, and select the model that achieves the highest inference accuracy at 10% noise level for on-chip deployment. We find that injecting a higher noise during training than testing improves models’ noise resiliency. Extended Data Fig. 7a–c shows that the best test-time accuracy in the presence of 10% weight noise is obtained with 20% training-time noise injection for CIFAR-10 image classification, 15% for Google voice command classification and 35% for RBM-based image reconstruction.

For CIFAR-10, the better initial accuracy obtained by the model trained with 5% noise is most likely due to the regularization effect of noise injection. A similar phenomenon has been reported in neural-network quantization literature where a model trained with quantization occasionally outperforms a full-precision model54,55. In our experiments, we did not apply additional regularization on top of noise injection for models trained without noise, which might result in sub-optimal accuracy.

For RBM, Extended Data Fig. 7d further shows how reconstruction errors reduce with the number of Gibbs sampling steps for models trained with different noises. In general, models trained with higher noises converge faster during inference. The model trained with 20% noise reaches the lowest error at the end of 100 Gibbs sampling steps.

Extended Data Fig. 7e shows the effect of noise injection on weight distribution. Without noise injection, the weights have a Gaussian distribution. The neural-network outputs heavily depend on a small fraction of large weights, and thus become vulnerable to noise injection. With noise injection, the weights distribute more uniformly, making the model more noise resilient.

To efficiently implement the models on NeuRRAM, inputs to all convolutional and fully connected layers are quantized to 4-bit or below. The input bit-precisions of all the models are summarized in Table 1. We perform the quantized training using the parameterized clipping activation technique46. The accuracies of some of our quantized models are lower than that of the state-of-the-art quantized model because we apply <4-bit quantization to the most sensitive input and output layers of the neural networks, which have been reported to cause large accuracy degradation and are thus often excluded from low-precision quantization46,54. To obtain better accuracy for quantized models, one can use higher precision for sensitive input and output layers, apply more advanced quantization techniques, and use more optimized data preprocessing, data augmentation and regularization techniques during training. However, the focus of this work is to achieve comparable inference accuracy on hardware and on software while keeping all these variables the same, rather than to obtain state-of-the-art inference accuracy on all the tasks. The aforementioned quantization and training techniques will be equally beneficial for both our software baselines and hardware measurements.

Chip-in-the-loop progressive fine-tuning

During the progressive chip-in-the-loop fine-tuning, we use the chip-measured intermediate outputs from a layer to fine-tune the weights of the remaining layers. Importantly, to fairly evaluate the efficacy of the technique, we do not use the test-set data (for either training or selecting checkpoint) during the entire process of fine-tuning. To avoid over-fitting to a small fraction of data, measurements should be performed on the entire training-set data. We reduce the learning rate to 1/100 of the initial learning rate used for training the baseline model, and fine-tune for 30 epochs, although we observed that the accuracy generally plateaus within the first 10 epochs. The same weight noise injection and input quantization are applied during the fine-tuning.

Marco Rios - Running efficiently CNNs on the Edge thanks to Hybrid SRAM-RRAM in-Memory Computing

Implementations of CNNs, LSTMs and RBMs

We use CNN models for the CIFAR-10 and MNIST image classification tasks. The CIFAR-10 dataset consists of 50,000 training images and 10,000 testing images belonging to 10 object classes. We perform image classification using the ResNet-2043, which contains 21 convolutional layers and 1 fully connected layer (Extended Data Fig. 9a), with batch normalizations and ReLU activations between the layers. The model is trained using the Keras framework. We quantize the input of all convolutional and fully connected layers to a 3-bit unsigned fixed-point format except for the first convolutional layer, where we quantize the input image to 4-bit because the inference accuracy is more sensitive to the input quantization. For the MNIST handwritten digits classification, we use a seven-layer CNN consisting of six convolutional layers and one fully connected layer, and use max-pooling between layers to down-sample feature map sizes. The inputs to all the layers, including the input image, are quantized to a 3-bit unsigned fixed-point format.

All the parameters of the CNNs are implemented on a single NeuRRAM chip including those of the convolutional layers, the fully connected layers and the batch normalization. Other operations such as partial-sum accumulation and average pooling are implemented on an FPGA integrated on the same board as the NeuRRAM. These operations amount to only a small fraction of the total computation and integrating their implementation in digital CMOS would incur negligible overhead; the FPGA implementation was chosen to provide greater flexibility during test and development.

Extended Data illustrates the process to map a convolutional layer on a chip. To implement the weights of a four-dimensional convolutional layer with dimension H (height), W (width), I (number of input channels), O (number of output channels) on two-dimensional RRAM arrays, we flatten the first three dimensions into a one-dimensional vector, and append the bias term of each output channel to each vector. If the range of the bias values is B times of the weight range, we evenly divide the bias values and implement them using B rows. Furthermore, we merge the batch normalization parameters into convolutional weights and biases after training (Extended Data Fig. 9b), and program the merged Wʹ and bʹ onto RRAM arrays such that no explicit batch normalization needs to be performed during inference.

Under the differential-row weight-mapping scheme, the parameters of a convolutional layer are converted into a conductance matrix of size (2(HWI + B), O). If the conductance matrix fits into a single core, an input vector is applied to 2(HWI + B) rows and broadcast to O columns in a single cycle. HWIO multiply–accumulate (MAC) operations are performed in parallel. Most ResNet-20 convolutional layers have a conductance matrix height of 2(HWI + B) that is greater than the RRAM array length of 256. We therefore split them vertically into multiple segments, and map the segments either onto different cores that are accessed in parallel, or onto different columns within a core that are accessed sequentially. The details of the weight-mapping strategies are described in the next section.

The Google speech command dataset consists of 65,000 1-s-long audio recordings of voice commands, such as ‘yes’, ‘up’, ‘on’, ‘stop’ and so on, spoken by thousands of different people. The commands are categorized into 12 classes. Extended Data Fig. 9d illustrates the model architecture. We use the Mel-frequency cepstral coefficient encoding approach to encode every 40-ms piece of audio into a length-40 vector. With a hop length of 20 ms, we have a time series of 50 steps for each 1-s recording.

We build a model that contains four parallel LSTM cells. Each cell has a hidden state of length 112. The final classification is based on summation of outputs from the four cells. Compared with a single-cell model, the 4-cell model reduces the classification error (of an unquantized model) from 10.13% to 9.28% by leveraging additional cores on the NeuRRAM chip. Within a cell, in each time step, we compute the values of four LSTM gates (input, activation, forget and output) based on the inputs from the current step and hidden states from the previous step. We then perform element-wise operations between the four gates to compute the new hidden-state value. The final logit outputs are calculated based on the hidden states of the final time step.

Each LSTM cell has 3 weight matrices that are implemented on the chip: an input-to-hidden-state matrix with size 40 × 448, a hidden-state-to-hidden-state matrix with size 112 × 448 and a hidden-state-to-logits matrix with size 112 × 12. The element-wise operations are implemented on the FPGA. The model is trained using the PyTorch framework. The inputs to all the MVMs are quantized to 4-bit signed fixed-point formats. All the remaining operations are quantized to 8-bit.

An RBM is a type of generative probabilistic graphical model. Instead of being trained to perform discriminative tasks such as classification, it learns the statistical structure of the data itself. Extended Data Fig. 9e shows the architecture of our image-recovery RBM. The model consists of 794 fully connected visible neurons, corresponding to 784 image pixels plus 10 one-hot encoded class labels and 120 hidden neurons. We train the RBM using the contrastive divergence learning procedure in software.

During inference, we send 3-bit images with partially corrupted or blocked pixels to the model running on a NeuRRAM chip. The model then performs back-and-forth MVMs and Gibbs sampling between visible and hidden neurons for ten cycles. In each cycle, neurons sample binary states h and v from the MVM outputs based on the probability distributions: 𝑝(ℎ𝑗=1|𝐯)=σ(𝑏𝑗+∑𝑖𝑣𝑖𝑤𝑖𝑗)

and 𝑝(ℎ𝑗=1|𝐯)=

σ(𝑏𝑗+∑𝑖𝑣𝑖𝑤𝑖𝑗)

, where σ is the sigmoid function, ai is a bias for hidden neurons (h) and bj is a bias for visible neurons (v). After sampling, we reset the uncorrupted pixels (visible neurons) to the original pixel values. The final inference performance is evaluated by computing the average L2-reconstruction error between the original image and the recovered image. Extended Data Fig. 10 shows some examples of the measured image recovery.

When mapping the 794 × 120 weight matrix to multiple cores of the chip, we try to make the MVM output dynamic range of each core relatively consistent such that the recovery performance will not overly rely on the computational accuracy of any single core. To achieve this, we assign adjacent pixels (visible neurons) to different cores such that every core sees a down-sampled version of the whole image, as shown in Extended Data Fig. 9f). Utilizing the bidirectional MVM functionality of the TNSA, the visible-to-hidden neuron MVM is performed from the SL-to-BL direction in each core; the hidden-to-visible neuron MVM is performed from the BL-to-SL direction.

Crossbar Demonstration of Artificial Intelligence Edge Computing Acceleration with ReRAM

Weight-mapping strategy onto multiple CIM cores

To implement an AI model on a NeuRRAM chip, we convert the weights, biases and other relevant parameters (for example, batch normalization) of each model layer into a single two-dimensional conductance matrix as described in the previous section. If the height or the width of a matrix exceed the RRAM array size of a single CIM core (256 × 256), we split the matrix into multiple smaller conductance matrices, each with maximum height and width of 256.

We consider three factors when mapping these conductance matrices onto the 48 cores: resource utilization, computational load balancing and voltage drop. The top priority is to ensure that all conductance matrices of a model are mapped onto a single chip such that no re-programming is needed during inference. If the total number of conductance matrices does not exceed 48, we can map each matrix onto a single core (case (1) in Fig. 2a) or multiple cores. There are two scenarios when we map a single matrix onto multiple cores. (1) When a model has different computational intensities, defined as the amount of computation per weights, for different layers, for example, CNNs often have higher computational intensity for earlier layers owing to larger feature map dimensions, we duplicate the more computationally intensive matrices to multiple cores and operate them in parallel to increase throughput and balance the computational loads across the layers (case (2) in Fig. 2a). (2) Some models have ‘wide’ conductance matrices (output dimension >128), such as our image-recovery RBM. If mapping the entire matrix onto a single core, each input driver needs to supply large current for its connecting RRAMs, resulting in a significant voltage drop on the driver, deteriorating inference accuracy. Therefore, when there are spare cores, we can split the matrix vertically into multiple segments and map them onto different cores to mitigate the voltage drop.

By contrast, if a model has more than 48 conductance matrices, we need to merge some matrices so that they can fit onto a single chip. The smaller matrices are merged diagonally such that they can be accessed in parallel (case (3) in Fig. 2a). The bigger matrices are merged horizontally and accessed by time-multiplexing input rows (case (4) in Fig. 2a). When selecting the matrices to merge, we want to avoid the matrices that belong to the same two categories described in the previous paragraph: (1) those that have high computational intensity (for example, early layers of ResNet-20) to minimize impact on throughput; and (2) those with ‘wide’ output dimension (for example, late layers of ResNet-20 have large number of output channels) to avoid a large voltage drop. For instance, in our ResNet-20 implementation, among a total of 61 conductance matrices (Extended Data Fig. 9a: 1 from input layer, 12 from block 1, 17 from block 2, 28 from block 3, 2 from shortcut layers and 1 from final dense layer), we map each of the conductance matrices in blocks 1 and 3 onto a single core, and merge the remaining matrices to occupy the 8 remaining cores.

Table 1 summarizes core usage for all the models. It is noted that for partially occupied cores, unused RRAM cells are either unformed or programmed to high resistance state; WLs of unused rows are not activated during inference. Therefore, they do not consume additional energy during inference.

The NeuRRAM chip is not only twice as energy efficient as state-of-the-art, it’s also versatile and delivers results that are just as accurate as conventional digital chips.

NeuRRAM chip is twice as energy efficient and could bring the power of AI into tiny edge devices

Stanford engineers created a more efficient and flexible AI chip, which could bring the power of AI into tiny edge devices

AI-powered edge computing is already pervasive in our lives. Devices like drones, smart wearables, and industrial IoT sensors are equipped with AI-enabled chips so that computing can occur at the “edge” of the internet, where the data originates. This allows real-time processing and guarantees data privacy.

However, AI functionalities on these tiny edge devices are limited by the energy provided by a battery. Therefore, improving energy efficiency is crucial. In today’s AI chips, data processing and data storage happen at separate places – a compute unit and a memory unit. The frequent data movement between these units consumes most of the energy during AI processing, so reducing the data movement is the key to addressing the energy issue.

Stanford University engineers have come up with a potential solution: a novel resistive random-access memory (RRAM) chip that does the AI processing within the memory itself, thereby eliminating the separation between the compute and memory units. Their “compute-in-memory” (CIM) chip, called NeuRRAM, is about the size of a fingertip and does more work with limited battery power than what current chips can do.

“Having those calculations done on the chip instead of sending information to and from the cloud could enable faster, more secure, cheaper, and more scalable AI going into the future, and give more people access to AI power,” said H.-S Philip Wong, the Willard R. and Inez Kerr Bell Professor in the School of Engineering.

“The data movement issue is similar to spending eight hours in commute for a two-hour workday,” added Weier Wan, a recent graduate at Stanford leading this project. “With our chip, we are showing a technology to tackle this challenge.”

They presented NeuRRAM in a recent article in the journal Nature. While compute-in-memory has been around for decades, this chip is the first to actually demonstrate a broad range of AI applications on hardware, rather than through simulation alone.

Seminar in Advances in Computing-SRAM based In-Memory Computing for Energy-Efficient AI Systems

Putting computing power on the device

To overcome the data movement bottleneck, researchers implemented what is known as compute-in-memory (CIM), a novel chip architecture that performs AI computing directly within memory rather than in separate computing units. The memory technology that NeuRRAM used is resistive random-access memory (RRAM). It is a type of non-volatile memory – memory that retains data even once power is off – that has emerged in commercial products. RRAM can store large AI models in a small area footprint, and consume very little power, making them perfect for small-size and low-power edge devices.

Even though the concept of CIM chips is well established, and the idea of implementing AI computing in RRAM isn’t new, “this is one of the first instances to integrate a lot of memory right onto the neural network chip and present all benchmark results through hardware measurements,” said Wong, who is a co-senior author of the Nature paper.

The architecture of NeuRRAM allows the chip to perform analog in-memory computation at low power and in a compact-area footprint. It was designed in collaboration with the lab of Gert Cauwenberghs at the University of California, San Diego, who pioneered low-power neuromorphic hardware design. The architecture also enables reconfigurability in dataflow directions, supports various AI workload mapping strategies, and can work with different kinds of AI algorithms – all without sacrificing AI computation accuracy.

To show the accuracy of NeuRRAM’s AI abilities, the team tested how it functioned on different tasks. They found that it’s 99% accurate in letter recognition from the MNIST dataset, 85.7% accurate on image classification from the CIFAR-10 dataset, 84.7% accurate on Google speech command recognition and showed a 70% reduction in image-reconstruction error on a Bayesian image recovery task.

“Efficiency, versatility, and accuracy are all important aspects for broader adoption of the technology,” said Wan. “But to realize them all at once is not simple. Co-optimizing the full stack from hardware to software is the key.”

“Such full-stack co-design is made possible with an international team of researchers with diverse expertise,” added Wong.

Fueling edge computations of the future

Right now, NeuRRAM is a physical proof-of-concept but needs more development before it’s ready to be translated into actual edge devices.

But this combined efficiency, accuracy, and ability to do different tasks showcases the chip’s potential. “Maybe today it is used to do simple AI tasks such as keyword spotting or human detection, but tomorrow it could enable a whole different user experience. Imagine real-time video analytics combined with speech recognition all within a tiny device,” said Wan. “To realize this, we need to continue improving the design and scaling RRAM to more advanced technology nodes.”

“This work opens up several avenues of future research on RRAM device engineering, and programming models and neural network design for compute-in-memory, to make this technology scalable and usable by software developers”, said Priyanka Raina, assistant professor of electrical engineering and a co-author of the paper.

If successful, RRAM compute-in-memory chips like NeuRRAM have almost unlimited potential. They could be embedded in crop fields to do real-time AI calculations for adjusting irrigation systems to current soil conditions. Or they could turn augmented reality glasses from clunky headsets with limited functionality to something more akin to Tony Stark’s viewscreen in the Iron Man and Avengers movies (without intergalactic or multiverse threats – one can hope).

If mass produced, these chips would be cheap enough, adaptable enough, and low power enough that they could be used to advance technologies already improving our lives, said Wong, like in medical devices that allow home health monitoring.

They can be used to solve global societal challenges as well: AI-enabled sensors would play a role in tracking and addressing climate change. “By having these kinds of smart electronics that can be placed almost anywhere, you can monitor the changing world and be part of the solution,” Wong said. “These chips could be used to solve all kinds of problems from climate change to food security.”

The NeuRRAM chip is the first compute-in-memory chip to demonstrate a wide range of AI applications at a fraction of the energy consumed by other platforms while maintaining equivalent accuracy

The NeuRRAM neuromorphic chip was developed by an international team of researchers co-led by UC San Diego engineers.

An international team of researchers has designed and built a chip that runs computations directly in memory and can run a wide variety of AI applications–all at a fraction of the energy consumed by computing platforms for general-purpose AI computing.

The NeuRRAM neuromorphic chip brings AI a step closer to running on a broad range of edge devices, disconnected from the cloud, where they can perform sophisticated cognitive tasks anywhere and anytime without relying on a network connection to a centralized server. Applications abound in every corner of the world and every facet of our lives, and range from smart watches, to VR headsets, smart earbuds, smart sensors in factories and rovers for space exploration.

The NeuRRAM chip is not only twice as energy efficient as the state-of-the-art “compute-in-memory” chips, an innovative class of hybrid chips that runs computations in memory, it also delivers results that are just as accurate as conventional digital chips. Conventional AI platforms are a lot bulkier and typically are constrained to using large data servers operating in the cloud.

In addition, the NeuRRAM chip is highly versatile and supports many different neural network models and architectures. As a result, the chip can be used for many different applications, including image recognition and reconstruction as well as voice recognition.

“The conventional wisdom is that the higher efficiency of compute-in-memory is at the cost of versatility, but our NeuRRAM chip obtains efficiency while not sacrificing versatility,” said Weier Wan, the paper’s first corresponding author and a recent Ph.D. graduate of Stanford University who worked on the chip while at UC San Diego, where he was co-advised by Gert Cauwenberghs in the Department of Bioengineering.

The research team, co-led by bioengineers at the University of California San Diego, presents their results in the Aug. 17 issue of Nature.

Processing-in-Memory Course: Lecture 14: Analyzing&Mitigating ML Inference Bottlenecks - Spring 2022

Currently, AI computing is both power hungry and computationally expensive. Most AI applications on edge devices involve moving data from the devices to the cloud, where the AI processes and analyzes it. Then the results are moved back to the device. That’s because most edge devices are battery-powered and as a result only have a limited amount of power that can be dedicated to computing.

By reducing power consumption needed for AI inference at the edge, this NeuRRAM chip could lead to more robust, smarter and accessible edge devices and smarter manufacturing. It could also lead to better data privacy as the transfer of data from devices to the cloud comes with increased security risks.

On AI chips, moving data from memory to computing units is one major bottleneck.

“It’s the equivalent of doing an eight-hour commute for a two-hour work day,” Wan said.

To solve this data transfer issue, researchers used what is known as resistive random-access memory, a type of non-volatile memory that allows for computation directly within memory rather than in separate computing units. RRAM and other emerging memory technologies used as synapse arrays for neuromorphic computing were pioneered in the lab of Philip Wong, Wan’s advisor at Stanford and a main contributor to this work. Computation with RRAM chips is not necessarily new, but generally it leads to a decrease in the accuracy of the computations performed on the chip and a lack of flexibility in the chip’s architecture.

"Compute-in-memory has been common practice in neuromorphic engineering since it was introduced more than 30 years ago,” Cauwenberghs said. “What is new with NeuRRAM is that the extreme efficiency now goes together with great flexibility for diverse AI applications with almost no loss in accuracy over standard digital general-purpose compute platforms."

A carefully crafted methodology was key to the work with multiple levels of “co-optimization” across the abstraction layers of hardware and software, from the design of the chip to its configuration to run various AI tasks. In addition, the team made sure to account for various constraints that span from memory device physics to circuits and network architecture.

“This chip now provides us with a platform to address these problems across the stack from devices and circuits to algorithms,” said Siddharth Joshi, an assistant professor of computer science and engineering at the University of Notre Dame , who started working on the project as a Ph.D. student and postdoctoral researcher in Cauwenberghs lab at UC San Diego.

Chip performance

Researchers measured the chip’s energy efficiency by a measure known as energy-delay product, or EDP. EDP combines both the amount of energy consumed for every operation and the amount of times it takes to complete the operation. By this measure, the NeuRRAM chip achieves 1.6 to 2.3 times lower EDP (lower is better) and 7 to 13 times higher computational density than state-of-the-art chips.

Researchers ran various AI tasks on the chip. It achieved 99% accuracy on a handwritten digit recognition task; 85.7% on an image classification task; and 84.7% on a Google speech command recognition task. In addition, the chip also achieved a 70% reduction in image-reconstruction error on an image-recovery task. These results are comparable to existing digital chips that perform computation under the same bit-precision, but with drastic savings in energy.

Researchers point out that one key contribution of the paper is that all the results featured are obtained directly on the hardware. In many previous works of compute-in-memory chips, AI benchmark results were often obtained partially by software simulation.

Next steps include improving architectures and circuits and scaling the design to more advanced technology nodes. Researchers also plan to tackle other applications, such as spiking neural networks.

“We can do better at the device level, improve circuit design to implement additional features and address diverse applications with our dynamic NeuRRAM platform,” said Rajkumar Kubendran, an assistant professor for the University of Pittsburgh, who started work on the project while a Ph.D. student in Cauwenberghs’ research group at UC San Diego.

In addition, Wan is a founding member of a startup that works on productizing the compute-in-memory technology. “As a researcher and an engineer, my ambition is to bring research innovations from labs into practical use,” Wan said.

Intelligence on Silicon: From Deep Neural Network Accelerators to Brain-Mimicking AI-SoCs

New architecture

The key to NeuRRAM’s energy efficiency is an innovative method to sense output in memory. Conventional approaches use voltage as input and measure current as the result. But this leads to the need for more complex and more power hungry circuits. In NeuRRAM, the team engineered a neuron circuit that senses voltage and performs analog-to-digital conversion in an energy efficient manner. This voltage-mode sensing can activate all the rows and all the columns of an RRAM array in a single computing cycle, allowing higher parallelism.

In the NeuRRAM architecture, CMOS neuron circuits are physically interleaved with RRAM weights. It differs from conventional designs where CMOS circuits are typically on the peripheral of RRAM weights.The neuron’s connections with the RRAM array can be configured to serve as either input or output of the neuron. This allows neural network inference in various data flow directions without incurring overheads in area or power consumption. This in turn makes the architecture easier to reconfigure.

To make sure that accuracy of the AI computations can be preserved across various neural network architectures, researchers developed a set of hardware algorithm co-optimization techniques. The techniques were verified on various neural networks including convolutional neural networks, long short-term memory, and restricted Boltzmann machines.

As a neuromorphic AI chip, NeuroRRAM performs parallel distributed processing across 48 neurosynaptic cores. To simultaneously achieve high versatility and high efficiency, NeuRRAM supports data-parallelism by mapping a layer in the neural network model onto multiple cores for parallel inference on multiple data. Also, NeuRRAM offers model-parallelism by mapping different layers of a model onto different cores and performing inference in a pipelined fashion.

An international research team

The work is the result of an international team of researchers.

The UC San Diego team designed the CMOS circuits that implement the neural functions interfacing with the RRAM arrays to support the synaptic functions in the chip’s architecture, for high efficiency and versatility. Wan, working closely with the entire team, implemented the design; characterized the chip; trained the AI models; and executed the experiments. Wan also developed a software toolchain that maps AI applications onto the chip.

The RRAM synapse array and its operating conditions were extensively characterized and optimized at Stanford University.

The RRAM array was fabricated and integrated onto CMOS at Tsinghua University.

The Team at Notre Dame contributed to both the design and architecture of the chip and the subsequent machine learning model design and training.

The research started as part of the National Science Foundation funded Expeditions in Computing project on Visual Cortex on Silicon at Penn State University, with continued funding support from the Office of Naval Research Science of AI program, the Semiconductor Research Corporation and DARPA JUMP program, and Western Digital Corporation.

A compute-in-memory chip based on resistive random-access memory

Published open-access in Nature, August 17, 2022.

Weier Wan, Rajkumar Kubendran, Stephen Deiss, Siddharth Joshi, Gert Cauwenberghs, University of California San Diego

Weier Wan, S. Burc Eryilmaz, Priyanka Raina, H-S Philip Wong, Stanford University

Clemens Schaefer, Siddharth Joshi, University of Notre Dame

Rajkumar Kubendran, University of Pittsburgh

Wenqiang Zhang, Dabin Wu, He Qian, Bin Gao, Huaqiang Wu, Tsinghua University