Implementation of Artificial Neural Networks at the Edge
Posted 07/05/2017 by Hussein Osman
Imagine private security systems that can differentiate between an intruder and your neighbor’s dog, smart TVs that can scan the room and automatically turn off when no one is present, and cameras that can perform forensic analysis and identify suspicious behavior before a crime occurs. The applications for deploying artificial neural networks at the edge are endless. Coming up with ideas is easy, but getting to the implementation is not that simple. How can designers bring the advantages of artificial intelligence (AI), neural networks and machine learning to resource-constrained, power-optimized network edge devices?
Part of the answer lies in doing training in the cloud using traditional teaching techniques from data centers and then downloading the weights and parameters to devices at the edge of the network for “inference”. Neural networks learn progressively to do tasks through “inference”, considering examples provided in “training”. The use of inferencing at the network edge level promises to minimize latency in decision-making and reduce network congestion, as well as improve personal security and privacy since captured data is not continuously sent to the cloud.
Another possible solution lies in the use of on-device AI with Binarized Neural Networks (BNN). BNNs eliminate the use of multiplication and division by using 1-bit values instead of larger numbers at runtime. This allows the computation of convolutions using just addition. Since multipliers are the most space- and power-hungry components in a digital system, replacing them with addition offers significant power and cost savings.
However, if developers of AI-based solutions at the edge of the network want to avoid the use of computational resources in the cloud, they need to find new ways to become more power and resource efficient. One step in that direction is the emergence of a new class of neural networks called Tiny Binarized Neural Networks or TinBiNNs. They have the capability to compute any neural network with binary weights and 8-bit activations.
Recently, developers at VectorBlox Computing and Lattice Semiconductor collaborated on the development of a tiny, lightweight binarized neural network overlay. Together, we built a demonstration system that uses 5,036 of the 5,280 LUTs on a Lattice iCE40 UltraPlus 5K FPGA. It also uses 4 of the device’s 8 DSP blocks, as well as all 120 kbits of its block RAM and all 128 kbits of single port RAM. We implemented a binarized neural network operator in hardware to serve as an ALU in the ORCA soft RISC-V processor. Additionally, the processor was augmented with a custom set of vector instructions to accelerate other compute-intensive steps like the MaxPool and activation functions. To rapidly prototype and test this design, the team relied on the iCE40 UltraPlus Mobile Development Platform.
The system offers promising results, running at an error rate of less than 0.4% at 1 second per frame. Although a high-power Intel i7 quad-core can do the computation as fast as 1.5ms, an iCE40 UltraPlus system is far simpler and consumes only 4.5 mW.
In the meantime, the debate continues on the best place to develop inference systems – in the cloud, in the network, or at the edge on the device? The VectorBlox/Lattice project makes a strong argument for on-device AI. In this case, building AI into an FPGA with a RISC-V processor not only cuts power consumption, but also accelerates response time. In addition, keeping processing local improves security and privacy. For more details on iCE40 UltraPlus machine learning demo visit our website.