CNN Plus Accelerator IP Core

AI Acceleration Using Low Power FPGAs

The Lattice Semiconductor CNN Plus Accelerator IP Core is a calculation engine for Deep Neural Network with fixed point weight. It calculates full layers of Neural Network including convolution layer, pooling layer, batch normalization layer, and full connect layer by executing sequence code with weight value, which is generated by Lattice sensAI™ Neural Network Compiler. The engine is optimized for convolutional neural network, so it can be used for vision-based application such as classification or object detection and tracking. The IP Core does not require an extra processor; it can perform all required calculations by itself.

The CNN Plus Accelerator IP Core offers three types of implementations: the compact CNN type, which is suitable for small FPGA devices due to its low utilization, the optimized CNN type, which can perform four convolution calculations in parallel, making it suitable for high-speed applications, and the extended CNN type, which offers the same features as optimized CNN plus additional support for max pooling/unpooling with max argument.

Customized Convolutional Neural Network (CNN) IP – CNN Plus IP is a flexible accelerator IP that simplifies implementation of Ultra-Low power AI by leveraging the parallel processing capabilities, distributed memory and DSP resources of Lattice FPGAs.

Configurable Modes of Use – Three modes are available: COMPACT (low perf, smallest footprint), OPTIMIZED (higher perf in resource optimized footprint) and HIGH performance mode(highest perf with biggest footprint).

Easy to Implement – Models trained using common machine learning frameworks such as TensorFlow can be compiled using the Lattice Neural Network Complier Tool and implemented on HW using the CNN Plus Accelerator IP.

Features

  • Selectable three implementation types: Compact CNN, Optimized CNN, Extended CNN
  • Selectable AXI4 or FIFO interface
  • Support for convolution layer, max pooling layer, global average pooling layer, batch normalization layer, and full connect layer
  • Configurable bit width of activation (16/8-bit)
  • Configurable number of memory blocks for tradeoff between resource and performance

Jump to

Block Diagram

Functional Block Diagram of CNN Plus Accelerator (Compact CNN Type)

Resource Utilization

LFCPNX-100-9BBG484I
Configuration3 clk_i, aclk_i Fmax (MHz)2 Registers LUTs LRAMs4 EBRs Logical DSP
MULT9, MULT18 REG18, PREADD9
Default 121.448, 127.081 2566 3626 2 13 13, 1 13, 13
Scratch Pad Memory Size=2K, Others=Default 119.104, 125.078 2578 3639 2 15 13, 1 13, 13
Scratch Pad Memory Size=4K, Others=Default 116.469, 124.270 2582 3650 2 19 13, 1 13, 13
Scratch Pad Memory Size=8K, Others=Default 122.160, 130.022 2590 3657 2 27 13, 1 13, 13
Scratch Pad Memory Mode=OCTA, Others=Default 120.685, 127.275 2710 3844 2 15 13, 1 13, 13
Scratch Pad Memory Mode=OCTA, Scratch Pad Memory Size=2K, Others=Default 123.183, 129.149 2714 3840 2 19 13, 1 13, 13
Scratch Pad Memory Mode=OCTA, Scratch Pad Memory Size=4K, Others=Default 122.085, 123.047 2722 3858 2 27 13, 1 13, 13
Memory Type=SINGLE_LRAM, Others=Default 130.005, 130.463 2565 3608 1 13 13, 1 13, 13
Memory Type=QUAD_LRAM, Others=Default 116.564, 122.011 2573 3677 4 13 13, 1 13, 13
Machine Leaning Type=OPTIMIZED_CNN, Others=Default 113.869, 129.601 5470 7226 2 17 48, 4 48, 48
Machine Leaning Type=OPTIMIZED_CNN, Scratch Pad Memory Size=2K, Others=Default 112.803, 123.686 5475 7240 2 21 48, 4 48, 48
Machine Leaning Type=OPTIMIZED_CNN, Scratch Pad Memory Size=4K, Others=Default 114.863, 124.409 5486 7265 2 29 48, 4 48, 48
Machine Leaning Type=OPTIMIZED_CNN, Scratch Pad Memory Size=8K, Others=Default 113.353, 122.234 5490 7279 2 45 48, 4 48, 48
Machine Leaning Type=OPTIMIZED_CNN, Scratch Pad Memory Size=8K, Others=Default 113.353, 122.234 5490 7279 2 45 48, 4 48, 48
Machine Leaning Type=OPTIMIZED_CNN, Line Buffer Size=1024, Others=Default 115.473, 126.662 5475 7250 2 21 48, 4 48, 48
Machine Leaning Type=OPTIMIZED_CNN, Line Buffer Size=2048, Others=Default 118.161, 122.609 5477 7266 2 30 48, 4
Machine Leaning Type=OPTIMIZED_CNN, Scratch Pad Memory Size=8K, Maximum Burst Length=256, Others=Default 119.446, 126.263 5491 7279 2 45 48, 4 48, 48
Machine Leaning Type=OPTIMIZED_CNN Convolution Engine=DUAL_CONV, Others=Default 113.225, 124.657 7394 9585 2 21 84, 4 84, 84
Machine Leaning Type=OPTIMIZED_CNN Convolution Engine=QUAD_CONV, Others=Default 117.233, 124.008 11023 14216 2 29 156, 4 156, 156
Machine Leaning Type=OPTIMIZED_CNN LRAM Enable Output Register=Checked, Others=Default 147.842, 131.027 5475 7252 2 17 48, 4 48, 48
Embedded Mode=Checked, Others=Default 127.259, N/A 1737 2474 2 5 13, 1 13, 13
Embedded Mode=Checked, Machine Leaning Type=OPTIMIZED_CNN, Others=Default 120.715, N/A 4673 6109 2 9 48, 4 48, 48
Embedded Mode=Checked, Line Buffer Size=1024, Machine Leaning Type=OPTIMIZED_CNN, Others=Default 114.116, N/A 4677 6112 2 13 48, 4 48, 48
Embedded Mode=Checked, Machine Leaning Type=OPTIMIZED_CNN, Convolution Engine=DUAL_CONV, Others=Default 118.991, N/A 6596 8463 2 13 84, 4 84, 84
Machine Leaning Type=EXTENDED_CNN, Others=Default 134.120, 124.750 5798 8049 2 20 48, 4 48, 48
Machine Leaning Type=EXTENDED_CNN, Scratch Pad Memory Size=2K, Others=Default 126.406, 130.497 5803 8051 2 24 48, 4 48, 48
Machine Leaning Type=EXTENDED_CNN, Scratch Pad Memory Size=4K, Others=Default 150.466, 114.403 5811 8059 2 32 48, 4 48, 48
Machine Leaning Type=EXTENDED_CNN, Scratch Pad Memory Size=8K, Others=Default 132.732, 127.926 5815 8041 2 48 48, 4 48, 48
Machine Leaning Type=EXTENDED_CNN, Maximum Argument Size=8192, Others=Default 143.947, 128.287 5798 8051 2 22 48, 4 48, 48
Machine Leaning Type=EXTENDED_CNN, Convolution Engine=DUAL_CONV, Others=Default 144.071, 127.665 7722 10404 2 24 84, 4 84, 84
Machine Leaning Type=EXTENDED_CNN, Convolution Engine=QUAD_CONV, Others=Default 146.585, 135.208 11352 15002 2 32 156, 4 156, 156
Embedded Mode=Checked, Machine Leaning Type=EXTENDED _CNN, Others=Default 134.88, N/A 5000 6929 2 12 48, 4 48, 48

Notes:
1. Performance may vary when using a different software version or targeting a different device density or speed grade.
2. Fmax is generated when the FPGA design only contains the CNN Plus Accelerator IP Core. These values may be reduced when user logic is added to the FPGA design.
3. The K value in “Scratch Pad Memory Size=*K” is equivalent to 1024 entries × 2 bytes. For example, 4K is equal to 8 kB of scratch pad memory.
4. The OPTIMIZED_CNN implementation has a lot more EBRs because it duplicates the EBRs in Convolution scratch storage to enable parallel processing. In addition, some duplicated submodules have their own EBRs: CONV_EU (1 EBR per unit) and POOL (1 EBR shared by 2 units).

Ordering Information

  Part Number
Device Family Multi-site Perpetual Single Seat Annual
CrossLink-NX CNNPLUS-ACCEL-CNX-UT CNNPLUS-ACCEL-CNX-US
CertusPro-NX CNNPLUS-ACCEL-CPNX-UT CNNPLUS-ACCEL-CPNX-US
Certus-NX CNNPLUS-ACCEL-CTNX-UT CNNPLUS-ACCEL-CTNX-US

To download a full evaluation version of this IP, go to the IP Server in Lattice Radiant. This IP core supports Lattice’s IP hardware evaluation capability, which makes it possible to generate the IP core and operate in hardware for a limited time (approximately four hours) without requiring an IP license.

To find out how to purchase the CNN Plus Accelerator IP core, please contact your local Lattice Sales Office.

Documentation

Quick Reference
TITLE NUMBER VERSION DATE FORMAT SIZE
Select All
CNN Plus Accelerator IP Core - User Guide
FPGA-IPUG-02115 1.5 12/5/2023 PDF 853.8 KB

*By clicking on the "Notify Me of Changes" button, you agree to receive notifications on changes to the document(s) you selected.