Sundance FC100 User Manual

Sundance Multiprocessor Technology Limited
User Manual
Form : QCF42 Date : 6 July 2006
Unit / Module Description: Unit / Module Number: Document Issue Number: Issue Date: Original Author:
User Manual
IEEE-754 Floating-point FPGA IP Core
IEEE-754 Floating-point FPGA IP Core
1.0.0
25/11/2008
SM
for
FC100
Sundance Multiprocessor Technology Ltd, Chiltern House,
Waterside, Chesham, Bucks. HP5 1PS.
This document is the property of Sundance and may not be copied
nor communicated to a third party without prior written
permission.
© Sundance Multiprocessor Technology Limited 2006
User Manual FC100 Page 1 of 12 Last Edited: 25/11/2008 15:00:00
Revision History
Issue Changes Made Date Initials
1.0 First release 25/11/08 SM
User Manual FC100 Page 2 of 12 Last Edited: 25/11/2008 15:00:00
Table of Contents
1 Introduction..................................................................................................................... 4
2 Related Documents........................................................................................................ 5
2.1 Referenced Documents.............................................................................................. 5
3 Acronyms, Abbreviations and Definitions.............................................................. 5
3.1 Acronyms and Abbreviations................................................................................... 5
4 Functional Description.................................................................................................. 6
4.1 Mathematical equations............................................................................................. 6
4.2 Algorithm ..................................................................................................................... 6
4.3 Pipelined FFT core....................................................................................................... 7
4.3.1 Data format ............................................................................................................. 7
4.3.2 FFT block diagram ................................................................................................. 7
4.3.3 Description.............................................................................................................. 7
4.3.4 Core modifications ................................................................................................ 8
4.3.5 Parameters and ports definition ......................................................................... 9
5 Critical signal descriptions ........................................................................................ 10
6 Core assumptions.........................................................................................................11
7 Resources usage and performance..........................................................................11
8 Testbench and MATLAB program............................................................................ 11
9 Ordering Information..................................................................................................12
User Manual FC100 Page 3 of 12 Last Edited: 25/11/2008 15:00:00

1 Introduction

The Fast Fourier Transform (FFT) is an efficient algorithm for computing the Discrete Fourier Transform (DFT). This Intellectual Property (IP) core was designed to offer very fast transform times while keeping a floating-point accuracy at all computational stages.
Sundance’s core is the fastest and the most efficient available in the FPGA world. It also saves memory resources compared to other floating-point cores available on the market.
Features:
This FFT IP core targets the following devices:
Xilinx FPGA devices
o Virtex-II, Virtex-II/Pro, Spartan-3, Virtex-4and Virtex-5
Radix-2 Fast Fourier Transform (FFT) with pipelined butterfly rank structure IEEE-754 Floating Point data
o Uses Xilinx Coregen math operators o Customizable precision, speed, and size o Any width fixed-point builds also available
Run-time selectable length N=32 to 2m, m= 5-26
o 32, 64, 128, 256, 512, 1024, …, 64M points
Run-time selectable Forward/Inverse transform mode Continuous processing at speeds up to Fmax (see Table 1).
o Data rate of 250Msps in Virtex-5 FPGA device.
Natural-order inputs and outputs Includes C/C++ bit-accurate model and data generator
o Model also usable from MATLAB
Includes Verilog or VHDL testbench and run scripts for simulation purposes and
specific performance characterization.
Applications:
The Pipelined Floating Point FFT IP Core is useful in high performance embedded computing (HPEC) applications which require continuous digital signal processing (DSP) at high sample rates. Floating point FFT hardware acceleration or co-processing is often a goal of scientific algorithms used in High Performance Computing (HPC). End applications and markets include radar, sonar, spectral analysis, telecommunications and image processing.
User Manual FC100 Page 4 of 12 Last Edited: 25/11/2008 15:00:00

2 Related Documents

2.1 Referenced Documents

N/A

3 Acronyms, Abbreviations and Definitions

3.1 Acronyms and Abbreviations

A list of all acronyms
User Manual FC100 Page 5 of 12 Last Edited: 25/11/2008 15:00:00

4 Functional Description

4.1 Mathematical equations

The Discrete Fourier Transform (DFT), of length N (N=2m), calculates the sampled Fourier transform of a discrete-time sequence with N points evenly distributed.
The forward DFT with N points of a sequence x(n) can be written as follows:
2
N
n
1
).()(
enxkX
0
nkj
N
1,,0,
Nk
The inverse DFT is given by the following equation:
2
1
N
1
)(
nx
N
n
).(
0
nkj
N
ekX
1,,0,
Nk

4.2 Algorithm

The pipelined Floating point FFT IP core uses modular radix-2 Fast Fourier Transform (FFT) architecture to provide discrete Fourier transforms (DFT) on data frames or continuous data streams, with sample rate up to the maximum clock frequency.
This efficient structure employs a single butterfly and a single delay feedback path per rank for low localized memory usage. True IEEE-754 floating point data maintained throughout, supporting a large dynamic range of data without requiring complicated fixed-point analysis. The standard pipelined IP Core is easily scalable to any Xilinx device and customisable to suit many FFT applications.
This FFT core is designed for FFT computation larger or equal to 32 points and up to 64M points. External memory, such as QDR/QDR2 SRAM, ZBT RAM, DDR/DDR2/DDR3 SDRAM, is most suited for transforms larger than 16384 points. For shorter transforms, memory banks can likely be implemented inside the FPGA depending on which device is used.
User Manual FC100 Page 6 of 12 Last Edited: 25/11/2008 15:00:00

4.3 Pipelined FFT core

4.3.1 Data format

This core is compliant to the IEEE-754 standard.

4.3.2 FFT block diagram

4.3.3 Description

Frame: the frame blocks use control signalling to delimit discrete data frames per the
selected transform length.
Bit-reverse: the bit-reverse block converts natural-order inputs to bit-reversed order as
required by the FFT engine.
Pipelined ranks: the pipelined rank blocks daisy-chain the FFT processing from input to
output. Each rank is optimized to contain the proper radix-2 butterfly math elements, twiddle factor ROMs, and local datapath memories for efficient continuous processing.
Variable length select: the variable length select block multiplexes the rank outputs for
variable transform length support.
User Manual FC100 Page 7 of 12 Last Edited: 25/11/2008 15:00:00

4.3.4 Core modifications

The standard IP Core is available in netlist or parameterized source code and supports the following:
Netlist builds for any Xilinx FPGA device
o FFT length and speed depend on chip resources and speed grade
Per-transform length selectable in powers-of-2
o from 32 to 2m points, where m= 5-26
Per-transform mode selectable between Forward and Inverse FFT Static length and mode configuration
o Pipeline must be clear before changing these configuration settings.
IEEE-754 single precision floating point math operators using Xilinx Coregen
o full DSP usage/maximum latency floating_point_v4_0 cores
Decimation-in-time (DIT) algorithm with internal bit-reversal
o providing natural-order data inputs and outputs
Potential customized deliveries from Dillon Engineering include:
Fixed single length of 2m for a slight logic
o savings over run-time selectable length.
Fixed Forward or Inverse mode for a slight logic
o savings over run-time selectable mode.
Pipelined configuration settings
o allows dynamic mode and/or length switching on back-to-back transforms.
Bit-reversal stage removed for a slight logic savings and elimination of a
BlockRAM FIFO and associated latency.
o Note: data must then be input in bit-reversed order to provide natural-
order outputs
Decimation-in-frequency (DIF) build option, which inputs data in natural-order
and outputs data in bit-reversed order.
Any Xilinx Floating Point operator adjustments to precision and latencies, with
logic parameter settings to match. Xilinx Floating Point operators are built separately with Coregen, providing RTL source and .ngc netlists. Thus all trade­offs between speed, number of pipeline stages, DSP48/Mult macro usage, double­or custom-precision float, etc., can be supported.
Any width fixed-point math operators in lieu of floating point. Options for various
scaling, rounding and saturation modes, all matched bit-accurate with the C/C++­model.
User Manual FC100 Page 8 of 12 Last Edited: 25/11/2008 15:00:00

4.3.5 Parameters and ports definition

The core signal I/O have not been fixed to specific device pins to provide flexibility for interfacing with user logic. Descriptions of all I/O signals are provided hereunder:
Signal
Signal
Direction
Description
CLK Input Clock Input.
Single source used for all I/O and internal clocking.
RST_N Input Active-low asynchronous reset.
Resets all control logic.
DIR Input Transform mode select.
0 = Forward FFT, 1 = Inverse FFT.
SEL[3:0] Input Transform length select.
Valid range is from 4'd5 (indicating transform length of 32) up to the maximum length supported by the build (e.g. 4'd10 for a transform length of 1024). Number of SEL bits is dependent on the maximum length.
SYNC_IN Input Input sync strobe.
Indicates to the core to begin processing i_data on the following clock cycle.
A[63:0] Input Input data
Complex data of the form R + iQ, where R is contained in bits 63:32 and Q is contained in bits 31:0, each a single-precision floating point number.
SYNC_OUT Output Output sync strobe.
Indicates the core is sending processed o_data beginning on the following clock cycle.
X[63:0] Output Output data.
Complex data of the form R + iQ, where R is contained in bits 63:32 and Q is contained in bits 31:0, each a single-precision floating point number.
User Manual FC100 Page 9 of 12 Last Edited: 25/11/2008 15:00:00

5 Critical signal descriptions

All interface and internal operation of the core is synchronous to CLK. Simple SYNC strobes are used on the input and output interfaces to signal that data is valid on the following clock cycle. An active SYNC coinciding with the last data point thus indicates back-to-back transforms. A SYNC_IN strobe active while the core is already inputing data is ignored. Tying SYNC_IN active will signal the core to perform continuous transforms, and SYNC_OUT will strobe as normal to frame the output data.
Figure 1: Interface input timing, 1K-length back-to-back transforms
Figure
Figure 2: Interfa
ce Output Timing, 1K-length back-to-back transforms
The DIR and SEL configuration inputs are by default selectable per-transform, but must be stable starting with SYNC_IN active and must not be changed until the transformed data has been completely output from the core (i.e. 2m clocks after the corresponding SYNC_OUT).
User Manual FC100 Page 10 of 12 Last Edited: 25/11/2008 15:00:00

6 Core assumptions

Following SYNC_IN, the initial transform has a start-up latency dependent on the bit-reversal stage, the floating point core latencies and the length of the transform. The core provides continuous processing at steady state, though the SYNC IN to OUT latencies may vary slightly due to internal pipeline alignment.
The standard core with transform length of 1024 has a start-up latency of around 2300 clock cycles, or 9.2usec at 250MHz clock rate.
Latencies of other lengths 2m follow approximately the formula:
m

7 Resources usage and performance

FPGA device
Spartan®-3A
Length
256 150 23,867 30,143 16 96
XC3SD3400A-5
FFT
SMT348-SX55
1024 200 21,585 26,079 26 352
XC4VSX55-12 SMT351T-SX50
1024 250 20,562 20,333 19 176
XC5VSX50T-3
SMT700-SX95T
16,384 200 27,799 29,163 109 256
XC5VSX95T-2
SMT702-LX110T
1,024 250 27,403 33,581 19 32
XC5VLX110T-3
SMT702-LX110T
8,192 250 36,592 44,954 61 44
XC5VLX110T-3
Notes: 1) Actual slice count dependent on percentage of unrelated logic – see Mapping Report File for details
2) Assuming all core I/Os and clocks are routed off-chip.
Fmax
(MHz)
Slice
1
FF
Slice LUT
1
BRAM
]}__)__2[({)22()( delaymultfpdelayaddfpmcyclesclockinLatency
MULT/
DSP48E

8 Testbench and MATLAB program

The core is verified to be bit-accurate with the C/C++ data model under all supported lengths, modes, throughputs and data format, using a rigorous simulation suite of directed and random data. Our model development is evaluated in terms of SQNR with a double-precision floating point software FFT implementation.
User Manual FC100 Page 11 of 12 Last Edited: 25/11/2008 15:00:00

9 Ordering Information

This product is available directly from Sundance Multiprocessor Technology and its selection of favorite suppliers. Please contact us for pricing and additional information about this product using the contact information on the front page of this datasheet.
There are also other FFT IP cores offered from Sundance Multiprocessor Technology:
UltraLong FFTs (up to 64M points, fixed or floating point), Parallel Butterfly FFTs (continuous FFTs at multiple points per clock cycle), Full Parallel FFTs (extremely fast rates, up to 25Gsps), 2D FFTs (two-dimensional transform for image processing), Mixed Radix FFTs (for non-power of 2 FFT lengths).
User Manual FC100 Page 12 of 12 Last Edited: 25/11/2008 15:00:00
Loading...