Xilinx DPU for Convolutional Neural Network v1.2 Product Manual

DPU for Convolutional Neural Network v1.2
DPU IP Product Guide
PG338 (v1.2) March 26, 2019
03/26/2019 Version 1.2
Build the PetaLinux Project
Updated descrption.
Build the Demo
Updated figure.
Demo Execution
Updated code.
03/08/2019 Version 1.1
Table 6: Reg_dpu_base_addr
Updated descrption.
Figure 10: DPU Configuration
Updated figure.
Build the PetaLinux Project
Updated code.
Build the Demo
Updated descrption.
03/05/2019 Version 1.1
Chapter 6: Example Design
Added chapter regarding the DPU targeted reference
02/28/2019 Version 1.0
Initial release
N/A
Send Feedback

Revision History

The following table shows the revision history for this document.
Section Revision Summary
design.
DPU IP Product Guide www.xilinx.com 2 PG338 (v1.2) March 26, 2019
Send Feedback

Table of Contents

Revision History .......................................................................................................................................................................... 2
IP Facts .................................................................................................................................................................................... 5
Introduction ................................................................................................................................................................................. 5
Chapter 1: Overview ...................................................................................................................................................................... 6
Introduction ................................................................................................................................................................................. 6
Development Tools
Example System with DPU ...................................................................................................................................................... 8
DNNDK .......................................................................................................................................................................................... 8
Licensing and Ordering Information .................................................................................................................................. 9
Chapter 2: Product Specification
Hardware Architecture
DSP with Enhanced Utilization (DPU_EU)
Register Space .......................................................................................................................................................................... 13
Interrupts .................................................................................................................................................................................... 17
Chapter 3: DPU Configuration
Introduction .............................................................................................................................................................................. 18
Configuration Options .......................................................................................................................................................... 19
DPU Performance on Different Devices ......................................................................................................................... 22
Performance of Different Models ..................................................................................................................................... 22
I/O Bandwidth Requirements ............................................................................................................................................. 23
.................................................................................................................................................................... 7
............................................................................................................................................ 10
........................................................................................................................................................... 10
....................................................................................................................... 11
................................................................................................................................................. 18
Chapter 4: Clocking and Resets
Introduction ................................................................................................................................................................
Clock Domain ........................................................................................................................................................................... 24
Reference Clock Generation ............................................................................................................................................... 25
Reset ............................................................................................................................................................................................ 27
Chapter 5: Development Flow ................................................................................................................................................ 28
Customizing and Generating the Core in MPSoC ...................................................................................................... 28
Chapter 6: Example Design ...................................................................................................................................................... 33
DPU IP Product Guide www.xilinx.com 3 PG338 (v1.2) March 26, 2019
.............................................................................................................................................. 24
.............. 24
Table of Contents
Send Feedback
Introduction .............................................................................................................................................................................. 33
Hardware Design Flow .......................................................................................................................................................... 36
Software Design Flow ............................................................................................................................................................ 39
Appendix A: Legal Notices ....................................................................................................................................................... 43
References ................................................................................................................................................................................. 43
Please Read: Important Legal Notices ............................................................................................................................ 43
DPU IP Product Guide www.xilinx.com 4 PG338 (v1.2) March 26, 2019

Introduction

DPU IP Facts Table
Supported
Zynq®-7000 SoC and
Supported User
Chapter 3: DPU
Design Files
Encrypted RTL
Example Design
Verilog
Constraint File
Xilinx Design Constraints (XDC)
Supported
Design Entry
Vivado® Design Suite
Simulation
N/A
Synthesis
Vivado Synthesis
Provided by Xilinx at the Xilinx Support web page
Send Feedback

IP Facts

The Xilinx® Deep Learning Processor Unit (DPU) is a configurable engine dedicated for convolutional neural network. The computing parallelism can be configured according to the selected device and application. It includes a set of efficiently optimized instructions. It can support most convolutional neural networks, such as VGG, ResNet, GoogLeNet, YOLO, SSD, MobileNet, FPN, etc.

Features

One
slave AXI interface for accessing
configuration and status registers.
One master interface for accessing instructions.
Supports configurable AXI master interface with
64 or 128 bits for accessing data.
Supports individual configuration of each
channel.
Supports optional interrupt request generation.
Some highlights of DPU functionality include:
o Configurable hardware architecture includes:
B512, B800, B1024, B1152, B1600, B2304, B3136, and B4096
o Configurable core number up to three o Convolution and deconvolution o Max pooling o ReLu and Leaky ReLu o Concat o Elementwise o Dilation o Reorg o Fully connected layer o Batch Normalization o Split
Core Specifics
Device Family
Interfaces
Resources
Provided with Core
S/W Driver
Tested Design Flows
Notes:
1. Linux OS and driver support information are available from DPU
TRD or DNNDK.
2. If the requirement is on Zynq-7000 SoC, contact your local FAE.
3. For the supported versions of the tools, see the Vivado
Design Suite User Guide: Release Notes Installation, and Licensing (UG973).
UltraScale+™ MPSoC Family
Memory-mapped AXI interfaces
See
Configuration
Included in PetaLinux
Support
DPU IP Product Guide www.xilinx.com 5 PG338 (v1.2) March 26, 2019
Host
CPU
RAM
High Speed D at a Tube
DPU
High
Performance
Sched uler
Instruction
Fetch Unit
Globa l Memory Pool
Hybrid Compu tin g Array
PE
PE
PE
PE
X22327-022019
Send Feedback
Chapter 1: Overview

Introduction

The Xilinx® Deep Learning Processor Unit (DPU) is a programmable engine dedicated for convolutional neural network. The unit contains register configure module, data controller module, and convolution computing module. There is a specialized instruction set for DPU, which enables DPU to work efficiently for many convolutional neural networks. The deployed convolutional neural network in DPU includes VGG, ResNet, GoogLeNet, YOLO, SSD, MobileNet, FPN, etc.
The DPU IP can be integrated as a block in the programmable logic (PL) of the selected Zynq®-7000 SoC and Zynq UltraScale™+ MPSoC devices with direct connections to the processing system (PS). To use DPU, you should prepare the instructions and input image data in the specific memory address that DPU can access. The DPU operation also requires the application processing unit (APU) to service interrupts to coordinate data transfer.
The top-level block diagram of DPU is shown in Figure 1.
DPU IP Product Guide www.xilinx.com 6 PG338 (v1.2) March 26, 2019
Figure 1: Top-Level Block Diagram
Chapter 1: Overview
Hardware Platform
DPU Driver
Lib
API
Vivado
DPU
Example Thi rd Pa rty
bitfile
X22328-022019
Send Feedback

Development Tools

Use the Xilinx Vivado Design Suite to integrate DPU into your own project. Vivado Design Suite 2018.2 or later version is recommended. Previous versions of Vivado can also be supported. For requests, contact your sales representative.

Device Resources

The DPU logic resource is optimized and scalable across Xilinx UltraScale+ MPSoC and Zynq-7000 devices. For the detailed resource utilization, refer to
Chapter 3: DPU Configuration
.

How to Run DPU

The DPU operation depends on the driver which is included in the Xilinx Deep Neural Network Development Kit (DNNDK) toolchain.
You can download the free developer resources from the Xilinx website:
https://www.xilinx.com/products/design-tools/ai-inference/ai-developer-hub.html#edge
Refer to the DNNDK User Guide (UG1327) to obtain an essential guide on how to run a DPU with DNNDK tools. The basic development flow is shown in the following figure. First, use Vivado to generate the bitstream. Then, download the bitstream to the target board and install the DPU driver. For instructions on how to install the DPU driver and dependent libraries, refer to the DNNDK User Guide (UG1327).
DPU IP Product Guide www.xilinx.com 7 PG338 (v1.2) March 26, 2019
Figure 2: Basic Development Flow
Chapter 1: Overview
DPU
Cam era
AXI Inte rcon nect
Controller
DDR
ARM R5
DisplayPort
USB3.0
SATA3.1
PCI e G e n2
GigE
USB2.0
UART
SPI
Quad SPI
NAND
SD
dem os aic gam ma
Co lor_
conversion
DMA
AXI
Interconnect
AXI
Interconnect
MIPI CSI2
AXI Inte rcon nect
MIPI
CSI2
X22329-030719
Send Feedback

Example System with DPU

The figure below shows an example system block diagram with the Xilinx UltraScale+ MPSoC using a camera input. DPU is integrated into the system through AXI interconnect to perform deep learning inference tasks such as image classification, object detection, and semantic segmentation.
Figure 3: Example System with Integrated DPU

DNNDK

Deep Neural Network Development Kit (DNNDK) is a full-stack deep learning toolchain for inference with the DPU.
As shown in Figure 4, DNNDK is composed of Deep Compression Tool (DECENT), Deep Neural Network Compiler (DNNC), Neural Network Runtime (N2Cube), and DPU Profiler.
DPU IP Product Guide www.xilinx.com 8 PG338 (v1.2) March 26, 2019
Chapter 1: Overview
DECENT N2Cub e
DNN C
Prof ile r
OS
H ost CP U
DPU
X22330-022019
Industry-standard
Libraries
Loader
Operating System
H ost C P U
Deep Learning App
(DPU -accele rated)
Prof ile r
Libarary
DPU Driver
DPU
Us er Spac e
Kernel Space
Hardware Platform
X22331-022019
Send Feedback
Figure 4: DNNDK Toolchain
The instructions of DPU are generated offline with DNNDK.
Figure 5
illustrates the hierarchy of executing
deep learning applications on the target hardware platform with DPU.
Figure 5: Application Execution Hierarchy

Licensing and Ordering Information

This IP module is provided at no additional cost under the terms of the Xilinx End User License.
Information about this and other IP modules is available at the Xilinx Intellectual Property page. For information on pricing and availability of other Xilinx IP modules and tools, contact your local Xilinx sales
representative.
DPU IP Product Guide www.xilinx.com 9 PG338 (v1.2) March 26, 2019
Instruction
Schedu l e r
CPU (DNNDK)
Memory Controller
Bus
Fetcher
Decoder
Di spa tc h er
On-Chip B u ff e r
Controller
Data Mover
On-Chip BRAM
BRAM Read e r/Writer
Computing
En g i ne
Conv
En g i ne
Misc
En g i ne
PE
PE
PE
Processing Sys te m (PS)
Programmable Logic (PL)
Off-Chip Me mory
X22332-022019
Send Feedback
Chapter 2: Product Specification

Hardware Architecture

The detailed hardware architecture of DPU is shown in Figure 6. After start-up, DPU fetches instructions from the off-chip memory and parses instructions to operate the computing engine. The instructions are generated by the DNNDK compiler where substantial optimizations have been performed.
To improve the efficiency, abundant on-chip memory in Xilinx® devices is used to buffer the intermediate data, input, and output data. The data is reused as much as possible to reduce the memory bandwidth. Deep pipelined design is used for the computing engine. Like other accelerators, the computational arrays (PE) take full advantage of the fine-grained building blocks, which includes multiplier, adder, accumulator, etc. in Xilinx devices.
DPU IP Product Guide www.xilinx.com 10 PG338 (v1.2) March 26, 2019
Figure 6: DPU Hardware Architecture
Chapter 2: Product Specification
IMG ram
IMG
ram
WGT
ram
A
D
B
B
RES
+
×
DSP48 Slice
A+D
M
clk 1x
IMG ram
IMG ram
WGT
ram
A
D
B
+
×
DSP48 Slice
A+D
M
clk 2x
WGT
ram
RES
0
DLY
RES
1
OUT
0
OUT
1
+
A
DLY
D
DLY
B0
Async
B1
Async
D
Async
A
Async
B
B
SEL
PCIN
P
PCOUT
PCOUT
RES
0
clk 1x
clk 1x
X22333-022019
Send Feedback

DSP with Enhanced Utilization (DPU_EU)

In the previous DPU version, the general logic and DSP slices work in the same clock domain, though technically the latter can run at a higher frequency. To enhance the utilization of DSP slices in DPU, the advanced DPU_EU version was designed.
The EU in “DPU_EU” means enhanced utilization of DSP slices. DSP Double Data Rate (DDR) technique is used to improve the performance achieved with the device. Therefore, two input clocks for DPU is needed, one for general logic, and the other for DSP slices. The difference between DPU and DPU_EU is shown in Figure 7.
All DPU mentioned in this document refer to DPU_EU, unless otherwise specified.

Port Descriptions

The DPU top-level interfaces are shown in the following figure.
DPU IP Product Guide www.xilinx.com 11 PG338 (v1.2) March 26, 2019
Figure 7: Difference between DPU and DPU_EU
Figure 8: DPU_EU IP Port
Chapter 2: Product Specification
S_AXI
Memory mapped
32
I/O
32-bit Memory mapped AXI interface s_axi_aclk
Clock
1 I AXI clock input for S_AXI
s_axi_aresetn
Reset
1 I Active-Low reset for S_AXI
dpu_2x_clk
Clock
1 I Input clock used for DSP unit in DPU.
dpu_2x_resetn
Reset
1 I Active-Low reset for DSP unit
m_axi_dpu_aclk
Clock
1 I Input clock used for DPU general logic.
m_axi_dpu_aresetn
Reset
1 I Active-Low reset for DPU general logic
DPUx_M_AXI_INSTR
Memory mapped
32
I/O
32-bit Memory mapped AXI interface
DPUx_M_AXI_DATA0
Memory mapped
128
I/O
128-bit Memory mapped AXI interface
DPUx_M_AXI_DATA1
Memory mapped
128
I/O
128-bit Memory mapped AXI interface
dpu_interrupt
Interrupt
1~3
O
Active-High interrupt output from DPU.
Send Feedback
The DPU I/O signals are listed and described in
Table 1: DPU Signal Description
Table 1
.
Signal Name
Interface Type Width I/O Description
AXI slave interface
AXI master interface
AXI master interface
AXI master interface
for registers.
The frequency is two times of m_axi_dpu_aclk.
for instruction of DPU.
for DPU data fetch.
for DPU data fetch.
The data width is decided by the DPU number.
Notes:
1. If only input ports are needed, you can edit the ports in the block diagram and declare at interface level.
the
port
DPU IP Product Guide www.xilinx.com 12
Chapter 2: Product Specification

Reg_dpu_reset

0x004
32
R/W
[0] – reset of DPU core 0

Reg_dpu_isr

0x608
32 R [0] – interrupt status of DPU core 0
Send Feedback

Register Space

The DPU IP implements registers in the programmable logic. registers are accessible from the host CPU through the S_AXI interface.
Table 2 shows the DPU IP registers. These
Reg_dpu_reset
The reg_dpu_reset register controls the resets of all DPU cores integrated in the DPU IP. The lower three bits of this register control the reset of up to three DPU cores respectively. All the reset signals are active-High. The details of reg_dpu_reset is shown in Table 2.
Table 2: Reg_dpu_reset
Register Address
Offset
Width Type Description
[1] – reset of DPU core 1
[2] – reset of DPU core 2
Reg_dpu_isr
The reg_dpu_isr register represents the interrupt status of all DPU cores integrated in the DPU IP. The lower three bits of this register shows the interrupt status of up to three DPU cores respectively. The details of reg_dpu_irq is shown in Table 3.
Register Address
Table 3: Reg_dpu_isr
Width Type Description
Offset
[1] – interrupt status of DPU core 1
[2] – interrupt status of DPU core 2
DPU IP Product Guide www.xilinx.com 13 PG338 (v1.2) March 26, 2019
Loading...
+ 30 hidden pages