Technical notes on using Analog Devices DSPs, processors and development tools
Contact our technical support at dsp.support@analog.com and at dsptools.support@analog.com
Or vi sit our o n-li ne r esou rces htt p:/ /www.analog.com/ee-notes and http://www.analog.com/processors
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC®
Processors
Contributed by Boris Lerner Rev 2 – March 4, 2004
Introduction
So, you want to write efficient code for the
ADSP-TS201 TigerSHARC® processor? Or,
maybe, you have come across the optimized
example floating-point FFT for this processor
and would like to understand how it works and
what the author had in mind when writing it.
This application note tries to answer both
questions by going through that FFT example
and all its levels of optimization in detail. This
example can be followed in developing other
algorithms and code optimized for the ADSPTS201S processor.
Generally, most algorithms have several levels of
optimization, all of which are discussed in detail
in this note. The first and most straightforward
level of optimization is paralleling of
instructions, as the processor architecture will
allow. This is simple and boring. The second
level of optimization is loop unrolling and
software pipelining to achieve maximum
parallelism and to avoid pipeline stalls. Although
more complex than the simple parallelism of
level one, this can be done in prescribed steps
without good understanding of the algorithm and,
thus, requires little ingenuity. The third level is
to restructure the math of the algorithm to still
produce valid results, but so that the new
restructured algorithm fits the processor’s
architecture better. Being able to do this requires
a thorough understanding of the algorithm and,
unlike software pipelining, there are no
Copyright 2004, Analog Devices, Inc. All rights reserved. Analog Devices assumes no responsibility for customer product design or the use or application of
customers’ products or for any infringements of patents or rights of others which may result from Analog Devices assistance. All trademarks and logos are property
of their respective holders. Information furnished by Analog Devices Applications and Development Tools Engineers is believed to be accurate and reliable, however
no responsibility is assumed by Analog Devices regarding technical accuracy and topicality of the content provided in Analog Devices’ Engineer-to-Engineer Notes.
prescribed steps that lead to the optimal solution.
This is where most of the fun in writing
optimized code lies.
In practical applications it is often unnecessary to
go through all of these levels. When all of the
levels are required, it is always best to do these
levels of optimization in reverse order. By the
time the code is fully pipelined, it is too late to
try to change the fundamental underlying
algorithm. Thus, a programmer would have to
think about the algorithm structure first and
organize the code accordingly. Then, levels two
and one (paralleling, unrolling, and pipelining)
are usually done at the same time.
The code that this note refers to is supplied by
Analog Devices in the form that allows it to be
called as either a real or a complex FFT, the last
calling parameter of the function defining if real
or complex is to be called. The real N-point FFT
is obtained from the complex N/2-point FFT with
an additional special stage at the end. This note is
concerned with code optimization more than the
technicalities of the special stage, so it discusses
the algorithm for the complex FFT portion of the
code only. The last special stage of the real FFT
is discussed in detail in the comments of the
code.
Standard Radix-2 FFT Algorithm
Figure 1 shows a standard 16-point radix-2 FFT
implementation, after the input has been bitreversed. Traditionally, in this algorithm, stages
a
1 and 2 are combined together with the required
bit reversing into a single optimized loop (since
these two stages require no multiplies, only adds
and subtracts). Each of the remaining stages is
usually done by combining the butterflies that
share the same twiddle factors together into
groups (so the twiddles have to be fetched only
once for each group). Un-optimized assembly
source code for a TigerSHARC processor
implementing this algorithm is shown in Listing
1. This, with a few tricks that are irrelevant to
this discussion, is the way that the 32-bit
floating-point FFT code was written when it was
targeted to an ADSP-TS101 processor. The
benchmarks (in core clock cycles) for this
algorithm, including bit reversal, running on an
ADSP-TS101 and ADSP-TS201, are shown in
Table 1. Note that since the ADSP-TS101 has
less memory per memory block than a ADSPTS201, larger point size benchmarks do not
apply to the ADSP-TS101. Clearly, as long as
the data fits into the ADSP-TS201 cache, it is
efficient. Once the data becomes too large for the
cache, this FFT implementation becomes
extremely inefficient – the cycle count increases
from optimal by a factor of five.
Figure 1. Standard Structure of the 16-Point FFT
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 2 of 16
8192 99886 444628 NA
16384 215224 987730 NA
32768 NA 2133220 NA
65536 NA 4720010 NA
Table 1. Core Clock Cycles for N-point Complex FFT
Optimizing the Structure of the FFT
for ADSP-TS201 Processors
To be able to re-structure the algorithm to
perform optimally on ADSP-TS201, we have to
understand why the performance of large FFTs
using the conventional FFT structure is so poor.
ADSP-TS201 memory is optimized for
sequential reads. Cache is designed to help with
algorithms where the reads are not sequential. In
the conventional FFT algorithm, each stage’s
butterflies stride doubles, so the reads are nonsequential and, with each new stage, the cache is
less and less likely to be a hit – the reads are all
over the place. The solution lies in re-arranging a
stage’s output to ensure that the next stage’s
reads are sequential. The structure of the
algorithm implementation is shown in Figure 2.
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 3 of 16
Figure 2. Reorganized Structure of the 16-Point FFT
It is simple enough to trace this diagram by hand
to see that it is simply a re-ordering of the
diagram in Figure 1. Amazingly enough, the final
output is in correct order. This can be easily
proven for general N = 2
in the FFT. Note that the re-ordering is given by
the following formula:
n
⎧
⎪
()
Thus, if n is even, it is shifted right and if n is
odd, it is shifted right and its most significant bit
is set. This is, of course, equivalent to the
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 4 of 16
2
=
nf
⎨
⎪
⎩
k
= the number of points
even is if ,
n
−
1
Nn
+
22
odd is if ,
n
operation of right 1-bit rotation, which after
steps returns the original n back.
()
NK2log=
Thus, the output after K stages is in correct order
again.
Great! We have our new structure. It has
sequential reads and we are lucky enough that
the output is in the correct order. This should be
much more efficient. Right? Let’s write the code
for it! Well, before we spend a lot of time writing
the code, we should ensure that all of the DSP
operations that we are about to do actually fit
into our processor architecture efficiently. There
may be no reason to optimize data movements if
the underlying math suffers.
a
The first obvious point to notice is that this
structure cannot be done in-place due to its reordering. The stages will have to ping-pong their
input/output buffers. This should not be a
problem. The ADSP-TS201 processor has a lot
of memory on board, but should memory
optimization be required (and input does not
have to be preserved), we can use the input as
one of the two ping-pong buffers.
Next, we note that a traditional FFT combines
butterflies that share twiddles into the same
group to save twiddle fetch cycles. Amazingly,
the twiddles of the structure in Figure 2 line up
linearly – one group at a time. We are lucky
again!
Now, what would a butterfly of this new
structure consist of? Table 2 lists the operations
necessary to perform a single complex butterfly.
Since the ADSP-TS201 is a SIMD processor
(i.e., it can double all the computes), we will
write the steps outlined in Listing 1 in SIMD
fashion, so that two adjacent butterflies are
computed in parallel, one in the X-Compute
block and the other one in the Y-Compute block.
Let us analyze the DSP operations in more detail.
F1, F2, K2 and F4 fetch a total of four 32-bit
words, which on ADSP-TS201 can be done in a
single quad fetch into X-Compute block
registers. To be able to supply SIMD machine
with data, we would also have to perform a
second butterfly quad fetch into the Y-Compute
block registers. Then, M1, M2, M3, M4, A1, and
A2 will perform SIMD operations for both
butterflies.
The ADSP-TS201 supports a single add/subtract
instruction, so A3 and A4 can be combined into a
single operation (which is, of course, performed
SIMD on both butterflies at once) and similarly
A5 and A6 can be combined, as well.
Mnemonic Operation
F1
F2 Fetch Imag(Input1) of the Butterfly
K2 Fetch Real(Input2) of the Butterfly
Table 2. Single Butterfly Done Linearly – Logical
Implementation
Fetch Real(Input1) of the Butterfly
Now we run into a problem: S1, S2, S3 and S4
cannot be performed in the same cycle, since S3
and S4 are destined to another place in memory
due to our output re-ordering. Instead, we can
store S1 and S2 for
both butterflies in one cycle
(lucky again – these are adjacent!) and S3 and S4
for
both butterflies in the next cycle. So far, so
good – the new set of operations is summarized
in Table 3.
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 5 of 16
Loading...
+ 11 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.