Technical notes on using Analog Devices DSPs, processors and development tools
Contact our technical support at dsp.support@analog.com and at dsptools.support@analog.com
Or vi sit our o n-li ne r esou rces htt p:/ /www.analog.com/ee-notes and http://www.analog.com/processors
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC®
Processors
Contributed by Boris Lerner Rev 2 – March 4, 2004
Introduction
So, you want to write efficient code for the
ADSP-TS201 TigerSHARC® processor? Or,
maybe, you have come across the optimized
example floating-point FFT for this processor
and would like to understand how it works and
what the author had in mind when writing it.
This application note tries to answer both
questions by going through that FFT example
and all its levels of optimization in detail. This
example can be followed in developing other
algorithms and code optimized for the ADSPTS201S processor.
Generally, most algorithms have several levels of
optimization, all of which are discussed in detail
in this note. The first and most straightforward
level of optimization is paralleling of
instructions, as the processor architecture will
allow. This is simple and boring. The second
level of optimization is loop unrolling and
software pipelining to achieve maximum
parallelism and to avoid pipeline stalls. Although
more complex than the simple parallelism of
level one, this can be done in prescribed steps
without good understanding of the algorithm and,
thus, requires little ingenuity. The third level is
to restructure the math of the algorithm to still
produce valid results, but so that the new
restructured algorithm fits the processor’s
architecture better. Being able to do this requires
a thorough understanding of the algorithm and,
unlike software pipelining, there are no
Copyright 2004, Analog Devices, Inc. All rights reserved. Analog Devices assumes no responsibility for customer product design or the use or application of
customers’ products or for any infringements of patents or rights of others which may result from Analog Devices assistance. All trademarks and logos are property
of their respective holders. Information furnished by Analog Devices Applications and Development Tools Engineers is believed to be accurate and reliable, however
no responsibility is assumed by Analog Devices regarding technical accuracy and topicality of the content provided in Analog Devices’ Engineer-to-Engineer Notes.
prescribed steps that lead to the optimal solution.
This is where most of the fun in writing
optimized code lies.
In practical applications it is often unnecessary to
go through all of these levels. When all of the
levels are required, it is always best to do these
levels of optimization in reverse order. By the
time the code is fully pipelined, it is too late to
try to change the fundamental underlying
algorithm. Thus, a programmer would have to
think about the algorithm structure first and
organize the code accordingly. Then, levels two
and one (paralleling, unrolling, and pipelining)
are usually done at the same time.
The code that this note refers to is supplied by
Analog Devices in the form that allows it to be
called as either a real or a complex FFT, the last
calling parameter of the function defining if real
or complex is to be called. The real N-point FFT
is obtained from the complex N/2-point FFT with
an additional special stage at the end. This note is
concerned with code optimization more than the
technicalities of the special stage, so it discusses
the algorithm for the complex FFT portion of the
code only. The last special stage of the real FFT
is discussed in detail in the comments of the
code.
Standard Radix-2 FFT Algorithm
Figure 1 shows a standard 16-point radix-2 FFT
implementation, after the input has been bitreversed. Traditionally, in this algorithm, stages
a
1 and 2 are combined together with the required
bit reversing into a single optimized loop (since
these two stages require no multiplies, only adds
and subtracts). Each of the remaining stages is
usually done by combining the butterflies that
share the same twiddle factors together into
groups (so the twiddles have to be fetched only
once for each group). Un-optimized assembly
source code for a TigerSHARC processor
implementing this algorithm is shown in Listing
1. This, with a few tricks that are irrelevant to
this discussion, is the way that the 32-bit
floating-point FFT code was written when it was
targeted to an ADSP-TS101 processor. The
benchmarks (in core clock cycles) for this
algorithm, including bit reversal, running on an
ADSP-TS101 and ADSP-TS201, are shown in
Table 1. Note that since the ADSP-TS101 has
less memory per memory block than a ADSPTS201, larger point size benchmarks do not
apply to the ADSP-TS101. Clearly, as long as
the data fits into the ADSP-TS201 cache, it is
efficient. Once the data becomes too large for the
cache, this FFT implementation becomes
extremely inefficient – the cycle count increases
from optimal by a factor of five.
Figure 1. Standard Structure of the 16-Point FFT
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 2 of 16
8192 99886 444628 NA
16384 215224 987730 NA
32768 NA 2133220 NA
65536 NA 4720010 NA
Table 1. Core Clock Cycles for N-point Complex FFT
Optimizing the Structure of the FFT
for ADSP-TS201 Processors
To be able to re-structure the algorithm to
perform optimally on ADSP-TS201, we have to
understand why the performance of large FFTs
using the conventional FFT structure is so poor.
ADSP-TS201 memory is optimized for
sequential reads. Cache is designed to help with
algorithms where the reads are not sequential. In
the conventional FFT algorithm, each stage’s
butterflies stride doubles, so the reads are nonsequential and, with each new stage, the cache is
less and less likely to be a hit – the reads are all
over the place. The solution lies in re-arranging a
stage’s output to ensure that the next stage’s
reads are sequential. The structure of the
algorithm implementation is shown in Figure 2.
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 3 of 16
Figure 2. Reorganized Structure of the 16-Point FFT
It is simple enough to trace this diagram by hand
to see that it is simply a re-ordering of the
diagram in Figure 1. Amazingly enough, the final
output is in correct order. This can be easily
proven for general N = 2
in the FFT. Note that the re-ordering is given by
the following formula:
n
⎧
⎪
()
Thus, if n is even, it is shifted right and if n is
odd, it is shifted right and its most significant bit
is set. This is, of course, equivalent to the
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 4 of 16
2
=
nf
⎨
⎪
⎩
k
= the number of points
even is if ,
n
−
1
Nn
+
22
odd is if ,
n
operation of right 1-bit rotation, which after
steps returns the original n back.
()
NK2log=
Thus, the output after K stages is in correct order
again.
Great! We have our new structure. It has
sequential reads and we are lucky enough that
the output is in the correct order. This should be
much more efficient. Right? Let’s write the code
for it! Well, before we spend a lot of time writing
the code, we should ensure that all of the DSP
operations that we are about to do actually fit
into our processor architecture efficiently. There
may be no reason to optimize data movements if
the underlying math suffers.
a
The first obvious point to notice is that this
structure cannot be done in-place due to its reordering. The stages will have to ping-pong their
input/output buffers. This should not be a
problem. The ADSP-TS201 processor has a lot
of memory on board, but should memory
optimization be required (and input does not
have to be preserved), we can use the input as
one of the two ping-pong buffers.
Next, we note that a traditional FFT combines
butterflies that share twiddles into the same
group to save twiddle fetch cycles. Amazingly,
the twiddles of the structure in Figure 2 line up
linearly – one group at a time. We are lucky
again!
Now, what would a butterfly of this new
structure consist of? Table 2 lists the operations
necessary to perform a single complex butterfly.
Since the ADSP-TS201 is a SIMD processor
(i.e., it can double all the computes), we will
write the steps outlined in Listing 1 in SIMD
fashion, so that two adjacent butterflies are
computed in parallel, one in the X-Compute
block and the other one in the Y-Compute block.
Let us analyze the DSP operations in more detail.
F1, F2, K2 and F4 fetch a total of four 32-bit
words, which on ADSP-TS201 can be done in a
single quad fetch into X-Compute block
registers. To be able to supply SIMD machine
with data, we would also have to perform a
second butterfly quad fetch into the Y-Compute
block registers. Then, M1, M2, M3, M4, A1, and
A2 will perform SIMD operations for both
butterflies.
The ADSP-TS201 supports a single add/subtract
instruction, so A3 and A4 can be combined into a
single operation (which is, of course, performed
SIMD on both butterflies at once) and similarly
A5 and A6 can be combined, as well.
Mnemonic Operation
F1
F2 Fetch Imag(Input1) of the Butterfly
K2 Fetch Real(Input2) of the Butterfly
Table 2. Single Butterfly Done Linearly – Logical
Implementation
Fetch Real(Input1) of the Butterfly
Now we run into a problem: S1, S2, S3 and S4
cannot be performed in the same cycle, since S3
and S4 are destined to another place in memory
due to our output re-ordering. Instead, we can
store S1 and S2 for
both butterflies in one cycle
(lucky again – these are adjacent!) and S3 and S4
for
both butterflies in the next cycle. So far, so
good – the new set of operations is summarized
in Table 3.
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 5 of 16
a
Mnemonic Operation
F1 Fetch Input1,2 of the Butterfly1
F2 Fetch Input1,2 of the Butterfly2
M1 Real(Input2) * Real(twiddle)
M2 Imag(Input2) * Imag(twiddle)
M3 Real(Input2) * Imag(twiddle)
M4 Imag(Input2) * Real(twiddle)
A1 M1–M2 = Real(Input2*twiddle)
A2 M3+M4 = Imag(Input2*twiddle)
A3 Real(Input1) +/- A1 = Real(Output1,2)
A4 Imag(Input1) +/- A2 = Imag(Output1)
S1 Store(Output1, both Butterflies)
S2 Store(Output2, both Butterflies)
Table 3. Single Butterfly Done Linearly – Actual
ADSP-TS20x Implementation
Each operation in Table 3 is a single-cycle
operation on ADSP-TS201 processor. There is a
total of 2 fetches, 4 multiplies, 4 ALU, and 2
store instructions. Since the ADSP-TS201 allows
fetches/stores to be paralleled with multiplies and
ALUs in a single cycle, loop unrolling,
pipelining, and paralleling should yield a 4-cycle
execution of these two SIMD butterflies (and we
are still efficient in the memory usage!). At this
point, we can now be reasonably certain that the
above will yield efficient code and we can start
developing it. However, careful observation at
this point can help us optimize this structure even
further. Note that we are only using a total of 4
fetches and stores from a single memory block,
say, by using JALU pointer registers. In parallel
we can do 3 more fetches/stores/KALU
operations without losing any cycles (actually,
we can do 4 of them, but we do need one
reserved place in one of the instructions for a
loop jump back).
Thus, the old rule of fetching twiddles only once
per group of butterflies that shares them is no
longer necessary – the twiddle fetches come free!
And, since the structure of the arrows of Figure 2
is identical at every stage, we may be able to
reduce the FFT from the usual three nested loops
to only two, provided that we can find a way to
correctly fetch the twiddles at each stage
(twiddles are the only thing that distinguishes the
stages of Figure 2). Figure 2 shows how the
twiddles must be fetched at each stage: 1
– all are W
N/4
W
. 3rd Stage – one quarter are W0, the next
quarter are W
the last quarter are W
0
. 2nd Stage – half are W0, next half are
N/8
, the next quarter are W
3N/8
. And so on… If we
st
Stage
2N/8
, and
keep a virtual twiddle pointer offset, increment it
to the next sequential twiddle every butterfly, but
AND it with a mask before actually using it in
the twiddle fetch, we achieve precisely this order
of twiddle fetch. Moreover, this rule is the same
for every stage, except that the mask at every
stage must be shifted down by one bit (i.e., each
stage requires twice as fine a resolution of the
twiddles as the previous stage). Here, our unused
KALU operations come in very handy. To
implement this twiddle fetch, we need to
increment the virtual offset, mask it and do a
twiddle fetch every butterfly… Oh, no! We are in
SIMD (i.e. we are doing two butterflies together)
and we do not have the 6 available instruction
slots for this! But luck saves us again. We can
easily notice that all stages except the last share
the twiddles between the SIMD pair of
butterflies – so, for these stages, we need only to
do the twiddle fetch once per SIMD pair of the
butterflies! And the three cycles are precisely
what we have to do this. Unfortunately, in the
last stage, every butterfly has its own unique
twiddle; but in the last stage, we do not have to
mask – just step the pointer to the next twiddle
every time! It will have to be written separately,
but it will optimize completely as well. Table 4
summarizes the latest structure’s steps. Three
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 6 of 16
a
new KALU operations (K1, K2 and K3) have
been added to Table 3. Time to write the code?
Well, no – let us figure out how to pipeline it
first.
This means that if the operation at the start of the
arrow is immediately followed by the operation
at the end of that arrow, the result will be correct,
but code execution will produce a stall. Thus, to
fully optimize the code, operations at the ends of
arrows with stalls must be kept more than one
instruction line apart.
A1 M1–M2 = Real(Input2*twiddle)
A2 M3+M4 = Imag(Input2*twiddle)
A3 Real(Input1) +/- A1 = Real(Output1,2)
A4 Imag(Input1) +/- A2 = Imag(Output1)
S1 Store(Output1, both Butterflies)
S2 Store(Output2, both Butterflies)
Table 4. Single Butterfly Done Linearly – Modified
ADSP-TS20x Implementation
Pipelining of the Algorithm
Figure 3 shows the algorithm’s operations from
Table 4 with arrows showing the dependencies.
The arrows of the dependencies indicate that the
result of the operation at the start of the arrow is
used by the operation at the end of that arrow
and, thus, must be completed first to ensure
correct data. Some arrows have a stall associated
with them, specifically:
Figure 3. Reorganized Structure’s Dependencies
A quick observation of the dependencies in
Figure 3 is sufficient to analyze the level of
pipelining and the number of compute block
registers needed to do it.
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 7 of 16
a
State
K1 K2 1 0
K2 M1,M2,M3,M4 5 4*([5/4]+1)=8
K3 K1 1 0
F1
F2
M1 A1 2 2([2/4]+1)=2
M2 A1 2 2([2/4]+1)=2
M3 A2 2 2([2/4]+1)=2
M4 A2 2 2([2/4]+1)=2
Dependency
To States
M1,M2,M3,M4,
A1,A2
M1,M2,M3,M4,
A1,A2
Max
Dep
Cycles
10 4*([10/4]+1)=16
10 4*([10/4]+1)=16
Compute Block
Registers
Needed
many output registers as
A2 since A3 and A4 are add/subtracts.
M1, M2, M3, M4, A1 and
The resulting requirement to fully pipeline this
code is 60 compute block registers, out of 64
total – just barely made it!
Cycle/
Operation
1 F1 K1 M4-- A3---
2 F2 K2 M2- A4---
3 S1--- M3- A2--
4 S2--- K3 M1 A1-
5 F1+ K1+ M4- A3--
6 F2+ K2+ M2 A4--
7 S1-- M3 A2-
JALU KALU MAC ALU
A1 A3,A4 2 2([2/4]+1)=2
A2 A3,A4 2 2([2/4]+1)=2
A3 S1,S2 1 4([1/4]+1)=4
A4 S1,S2 1 4([1/4]+1)=4
S1 none 0 0
S2 none 0 0
Table 5. Number of Compute Block Registers
Required to Pipeline the Butterflies
Total Regs
60
Full pipelining, as mentioned earlier, would give
a 4-cycle SIMD pair of butterflies. Thus,
Here, [x] denotes the integer part of the number
x. We can therefore determine the number of
compute block registers needed, as shown in
Table 5. Note that
A3 and A4 require twice as
8 S2-- K3+ M1+ A1
9 F1++ K1++ M4 A3-
10 F2++ K2++ M2+ A4-
11 S1- M3+ A2
12 S2- K3++ M1++ A1+
13 F1+++ K1+++ M4+ A3
14 F2+++ K2+++ M2++ A4
15 S1 M3++ A2+
16 S2 K3+++ M1+++ A1++
Table 6. Pipelined Butterflies
We pipeline this fully symbolically, using the
mnemonics of Table 4 and Figure 3. The
pipelining is shown in Table 6, in which “+” in
the operation indicates the operation that
corresponds to the next set of the butterflies and
“-” corresponds to the operation in the previous
set of the butterflies.
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 8 of 16
a
All instructions are paralleled, there are no stalls,
and there is a place to put the jump to the top of
the loop (actually, four places, but this is only
because the pipeline is 4 pairs of butterflies deep,
each iteration of the loop in Table 6 will actually
do 4 pairs of butterflies).
The Code
Now, writing the code is trivial. The ADSPTS201 is so flexible that it takes all the challenge
right out of it. Just follow the pipeline of Table 6
and the code is done. The resulting code for the
stages other than last is shown in Listing 2.
Outside of this inner loop is a stage loop that
ping-pongs input/output buffers and shifts the
twiddle modifier mask. Pretty simple!
Additional optimization is done by breaking the
first two stages away from the main of the code
and doing them separately – they do not really
require a complex multiply and can be done
faster. Also, bit-reversal is incorporated into the
first two stages, as well. Now, for the bottom line
– how much did the cycle count improve? In
Table 7 we repeat Table 1 with additional
columns for the benchmarks for the new
algorithm. The cycle count for the larger-thancache FFTs improved by a factor greater than 3!
Moreover, the cycle count for FFTs that fit into
the cache is better than it was on the original
ADSP-TS101 processor, which had no cache or
memory latency issues of any kind. The reason
for this is that the new architecture allows the
code to be written in two nested loops instead of
three and, thus, has significantly less overhead.
This code, ported to the ADSP-TS101, improves
its benchmarks, too – as shown in Table 7.
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 9 of 16
a
ADSP-TS101
N
256 2172 1958 2641 2402 2218 1963
512 4582 4276 5533 5192 4649 4283
1024 9872 9410 12170 11662 9992 9419
2048 21338 20688 26610 25316 22173 20699
4096 46244 45278 197272 69924 NA NA
8192 99886 98540 444628 147628 NA NA
16384 215224 213243 987730 313292 NA NA
32768 NA NA 2133220 662614 NA NA
65536 NA NA 4720010 1397544 NA NA
Table 7. Core Clock Cycles for N-point Complex FFT, New versus Old Structure
Old structure
Usage Rules
The C-callable complex FFT routine is called as
FFT32( &(input), &(ping_pong_buffer1),
&(ping_pong_buffer2), &(output), N, F);
where
input -> FFT input buffer,
output -> FFT output buffer,
ping_pong_bufferx are the ping pong buffers,
N=Number of complex points,
ADSP-TS101
New structure
ADSP-TS201
Old structure-
Input not in
cache
ADSP-TS201
New structure-
Input not in
cache
ADSP-TS201
Old structure-
Input in cache
ADSP-TS201
New structureInput in cache
same as input if input does not need to be
preserved. Also, if Log
(N) is even, output can be
2
made the same as ping_pong_buffer2 and if
Log
(N) is odd, output can be made the same as
2
ping_pong_buffer1. Below are two examples of
the routine usage with minimal use of memory:
FFT32( &(input), &( input),
&( output), &(output), 1024, 1);
FFT32( &(input), &(input),
&( ping_pong_buffer2), &( input), 2048, 1);
To eliminate memory block access conflicts,
input must reside in a different memory block
F=0 if FFT is real and 1 if FFT is complex.
As mentioned earlier, due to data re-ordering,
stages cannot be done in-place and have to pingpong. Thus, ping_pong_buffer1 and
ping_pong_buffer2 have to be two distinct
buffers. However, depending on the routine’s
user requirements, some memory optimization is
possible. Ping_pong_buffer1 can be made the
than ping_pong_buffer2 and twiddle factors must
reside in a different memory block than the pingpong buffers. Of course, all code must reside in a
block that is different from all the data buffers, as
well. Ping-pong buffers can share a memory
block, however – there is no instruction that
accesses both ping-pong buffers in the same
cycle.
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 10 of 16
Appendix
Complete Source Code of the Optimized FFT
/* fft32.asm
Prelim rev. October 19, 2003 - BL
Rev. 1.0 - added real inputs case - PM
This is assembly routine for the Complex radix-2 C-callable FFT on TigerSHARC
family of DSPs.
I. Description of Calling.
1. Inputs:
j4 -> input (ping-pong buffer 1)
j5 -> ping-pong buffer 1
j6 -> ping-pong buffer 2
j7 -> output
j27+0x18 -> N = Number of points
j27+0x19 -> REAL or COMPLEX
2. C-Calling Example:
fft32(&(input), &(ping_pong_buffer1), &(ping_pong_buffer2), &(output), N, COMPLEX);
3. Limitations:
a. All buffers must be aligned on memory boundary which is a multiple of 4.
b. N must be between 32 and MAX_FFT_SIZE.
c. If memory space savings are required and input does not have to be
preserved, ping_pong_buffer1 can be the same buffer as input.
d. If memory space savings are required, output can be the same buffer
as ping_pong_buffer2 if the number of FFT stages is even (i.e.
Log2(N) is even) or the same as ping_pong_buffer1 if the number of
FFT stages is odd (i.e. Log2(N) is odd).
4. MAX_FFT_SIZE can be selected via #define. Larger values allow for more choices
of N, but its twiddles will occupy more memory.
5. This C - callable function can process up to 64K blocks of data on TS201
(16K blocks on TS101) because C environment itself necessitates memory.
Therefore, if more input points are necessary, assembly language development
may become a must. On TS201, a block of memory is 128K words long, so
maximum N is 128K real points or 64K complex points. TS101 contains
only 2 blocks of data memory of 64K words and 4 buffers must be
accommodated. Therefore, maximum N is 32K real words or 16K complex words.
II. Description of the FFT algorithm.
1. The input data is treated as complex interleaved N-point.
2. Due to re-ordering, no stage can be done in-place.
3. The bit reversal and the first two stages are combined into
a single loop. This loop takes data from input and stores it
in the ping-pong buffer1.
4. Each subsequent stage ping-pongs the data between the two ping-pong
buffers. The last stage uses FFT output buffer for its output.
5. Although the FFT is designed to be called with any point size
N <= MAX_FFT_SIZE by subsampling the twiddle factors, for ADSP-TS20x
processors, the best cycle optimization is achieved when MAX_FFT_SIZE=N.
For ADSP-TS101 all choices of MAX_FFT_SIZE are equally optimal.
III. Description of the REAL FFT algorithm.
1. The input data is treated as complex interleaved N/2-point. The N/2 point complex
FFT will be computed first. Thus, N is halved, now number of points = N/2.
2. Details and source code of the N/2 point complex FFT are in II above.
3. Real re-combine:
Here the complex N/2-point FFT computed in the previous steps is recombined to
produce the N-point real FFT. If G is the complex FFT and F is the real FFT,
the formula for F is given by:
This is very efficient on the TigerSHARC architecture due to the add/subtract
instruction.
IV. For all additional details regarding this algorithm and code, see EE-218
a
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 11 of 16
application note, available from the ADI web site.
*/
//************************************ Includes **********************************