Technical notes on using Analog Devices DSPs, processors and development tools
Contact our technical support at dsp.support@analog.com and at dsptools.support@analog.com
Or vi sit our o n-li ne r esou rces htt p:/ /www.analog.com/ee-notes and http://www.analog.com/processors
Parallel Implementation of Fixed-Point FFTs on TigerSHARC® Processors
Contributed by Boris Lerner Rev 1 – February 3, 2005
Introduction
The advance of modern highly paralleled
processors, such as the Analog Devices
TigerSHARC® family of processors, requires
finding efficient ways to parallel
implementations of many standard algorithms.
This applications note not only explains how the
fastest 16-bit FFT implementation on the
TigerSHARC works, but also provides guidance
about algorithm development, so you can apply
similar techniques to other algorithms.
Generally, most algorithms have several levels of
optimization, which are discussed in detail in this
note. The first and most straightforward level of
optimization is the paralleling of instructions, as
permitted by the processor's architecture. This is
simple and boring. The second level of
optimization is loop unrolling and software
pipelining to achieve maximum parallelism and
to avoid pipeline stalls. Although more complex
than the simple parallelism of level one, this can
be done in prescribed steps without a firm
understanding of the algorithm and, thus,
requires little ingenuity. The third level is
restructuring the math of the algorithm to still
produce valid results, but fit the processor’s
architecture better. This requires a thorough
understanding of the algorithm and, unlike
software pipelining, there are no prescribed steps
that lead to the optimal solution. This is where
most of the fun in writing optimized code lies.
In practical applications, it is often unnecessary
to go through all of these levels. When all of the
levels are required, it is best to perform these
levels of optimization in reverse order. By the
time the code is fully pipelined, it is too late to
try to change the fundamental underlying
algorithm. Thus, a programmer would have to
think about the algorithm structure first and
organize the code accordingly. Then, levels one
and two (paralleling, unrolling, and pipelining)
are usually done at the same time.
The code to which this note refers is supplied by
Analog Devices. A 256-point FFT is used as the
specific example, but the mathematics and ideas
apply equally to other sizes (no smaller than 16
points).
As we will see, the restructured algorithm breaks
down the FFT into much smaller parts that can
then be paralleled. In the case of the 256-point
FFT (its code listing is at the end of this
applications note), the FFT is split into 16 FFTs
of 16 points each and each, 16-point FFT is done
in radix-4 fashion (i.e., each has only two
stages). If we were to do a 512=point FFT, we
would have to do 16 FFTs of 32 points each
(and, also, 32 FFTs of 16 points each), each 32point FFT would have the first two stages done
in radix-4 and the last stage in radix-2. These
differences imply that it would be difficult to
write the code that is FFT size-generic. Although
the implemented algorithm is generic and applies
equally well to all sizes, the code is not, and it
must be hand-tuned to each point size to be able
to take full advantage of its optimization.
Copyright 2005, Analog Devices, Inc. All rights reserved. Analog Devices assumes no responsibility for customer product design or the use or application of
customers’ products or for any infringements of patents or rights of others which may result from Analog Devices assistance. All trademarks and logos are property
of their respective holders. Information furnished by Analog Devices Applications and Development Tools Engineers is believed to be accurate and reliable, however
no responsibility is assumed by Analog Devices regarding technical accuracy and topicality of the content provided in Analog Devices’ Engineer-to-Engineer Notes.
a
With all this in mind, let us dive into the
fascinating world of fixed-point FFTs in the land
of the TigerSHARC.
Standard Radix-2 FFT Algorithm
Figure 1 shows a standard 16-point radix-2 FFT
implementation, after the input has been bitreversed. Traditionally, in this algorithm, stages
1 and 2 are combined with the required bit
reversing into a single optimized loop (since
these two stages require no multiplies, only adds
and subtracts). Each of the remaining stages is
usually done by combining the butterflies that
share the same twiddle factors together into
groups (so the twiddles need only to be fetched
once for each group).
Figure 1. Standard Structure of the 16-Point FFT
Since TigerSHARC processors offer vectorized
16-bit processing on packed data, we would like
to parallel this algorithm into at least as many
parallel processes as the TigerSHARC can
handle. An add/subtract instruction of the
TigerSHARC (which is instrumental in
Parallel Implementation of Fixed-Point FFTs on TigerSHARC® Processors (EE-263) Page 2 of 12
computing a fundamental butterfly) can be
paralleled to be performed on eight 16-bit values
per cycle (four in each compute block of the
TigerSHARC processor). Since data is complex,
this equates to four add/subtracts of data per
cycle. Thus, we would like to break the FFT into
+
=
a
at least four parallel processes. Looking at the
diagram of Figure 1, it is clear that we can do
this by simply combining the data into blocks,
four points at a time, i.e.:
These groups have no interdependencies and will
parallel very nicely for the first two stages of the
FFT. After that we are in trouble and the
parallelism is gone. At this point, however, we
could re-arrange the data into different blocks to
ensure that the rest of the way the new blocks do
not crosstalk to each other and, thus, can be
paralleled. A careful examination shows that the
required re-arrangement is an operation of
interleaving (or de-interleaving), with new
blocks given by:
1st block = {x(0), x(2), x(1), x(3)}
comes from this is that N must be a perfect
square. As it turns out, we can dispose of this
requirement, but that will be discussed later. At
this time we are concerned with the 256-point
FFT and, as luck would have it,
2
16256 =.
So, which shall we parallel, the rows or the
columns? The answer lies in the TigerSHARC
processor’s vector architecture. When the
TigerSHARC processor fetches data from
memory, it fetches it in chunks of 128 bits at a
time (i.e., four 16-bit complex data points) and
packs it into quad or paired (for SIMD fetches)
registers. Then it vectorizes processing across the
register quad or pair. Thus, it is the columns of
the matrix that we want to parallel (i.e., we
would like to structure our math so that all the
columns of the matrix are independent from one
another).
Now that we know what we would like the math
to give us, it is time to do this rigorously in the
language of mathematics.
Another way to look at these new blocks is as the
4x4 matrix transpose (where blocks define the
matrix rows). Of course, there is a significant
side effect—after the data re-arrangement the last
stages will parallel, but will not produce data in
the correct order. Maybe we can compensate for
this by starting with an order other than the bitreverse we had started with, but let us leave this
detail for the rigorous mathematical analysis that
comes later.
At this time, the analysis of the 16 point FFT
seems to suggest that, in general, given an N
point FFT, we would like to view it in two
dimensions as a
NN × matrix of data and
parallel-process the rows or columns, then
transpose the matrix and parallel-process the
rows or columns again. Another requirement that
Mathematics of the Algorithm
The following notation will be used:
N = Number of points in the original FFT (256 in
our example),
NM =,
∧
x will stand for the Discrete Fourier Transform
(henceforth abbreviated as DFT) of
Now, given signal x,
−
2
ink
−
1
∧
M
m
N
=
0
k
−
2
inm
π
−
1
=
0
M
N
=
l
N
−
1
0
−=−
101
MmM
∑∑∑
−
2
=
0
l
inl
π
M
=+
where:
mMlxlx
(1)
m
)(:)(
x.
+−
)(2
mMlin
ππ
N
)()()(
emMlxekxnx
−
2
inm
π
−
1
M
∑∑∑
=
0
m
∧
N
m
=+==
)()(
nxeemMlxe
Parallel Implementation of Fixed-Point FFTs on TigerSHARC® Processors (EE-263) Page 3 of 12
∧
≤
and
x is this function’s M–point DFT. Now, we
m
view the output index n as arranged in an M x M
matrix (i.e., 1,0 ,−<
−
∧
M
∑
=
m
−
2
π
ism
itm
N
∧
m
−
1
M
∑
m
=
0
since
−
2
π
M
x , being an M–point DFT, is periodic
+=MtstMsn) Thus,
π
+−
)(2
1
0
∧
m
mtMsi
N
)(
txee
∧
m
)()(
tMsxetMsx
=+=+
with period = M. Thus,
ism
−
π
2
M
−
∑
m
1
=
0
*
M
t
∧
∧
*
)( )()(
sxmxetMsx
==+
t
, (2)
where:
itm
−
2
π
*
=
t
∧
*
x is this function’s M–point DFT.
and
t
∧
N
m
)(:)(
txemx
(3)
a
L
xxxx
)0()0()0()0(
⎡
⎢
⎢
⎢
L
L
xxxx
⎢
⎢
⎢
⎣
3.
We now compute parallel FFTs on columns
L
xxxx
(as mentioned before, TigerSHARC
processors do this very efficiently) obtaining:
⎡
⎢
⎢
⎢
⎢
L
L
L
xxxx
xxxx
⎢
⎢
⎢
⎣
4.
We point-wise multiply this by matrix
2≤≤−
itm
π
⎡
256
e
⎢
⎣
L
xxxx
⎤
⎥
⎦
15,0
mt
⎤
15210
⎥
xxxx
)1()1()1()1(
15210
⎥
⎥
)2()2()2()2(
15210
⎥
MMMMM
⎥
⎥
)15()15()15()15(
15210
⎦
∧∧∧∧
⎤
)0()0()0()0(
15210
⎥
∧∧∧∧
⎥
)1()1()1()1(
xxxx
15210
⎥
∧∧∧∧
15210
∧∧∧∧
15210
⎥
)2()2()2()2(
⎥
MMMMM
⎥
⎥
)15()15()15()15(
⎦
Implementation of the Algorithm
Equations (1), (2), and (3) show how to compute
the DFT of x using the following steps (here we
go back to our specific example of N=256, M=16):
1.
Arrange the 256 points of the input data x(n)
linearly, but think of it as a 16x16 matrix:
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎣
2.
Using equation (1), re-write as:
L
L
L
L
)15()2()1()0(
xxxx
⎤
⎥
)31()18()17()16(
xxxx
⎥
⎥
)47()34()33()32(
xxxx
⎥
MMMMM
⎥
⎥
)255()242()241()240(
xxxx
⎦
to obtain
−
∧
⎡
⎢
0
∧
⎢
⎢
0
⎢
∧
⎢
0
⎢
⎢
∧
⎢
0
⎣
02
256
−
02
256
−
02
256
−
02
256
−
∧
1
∧
∧
1
∧
1
02
256
−
12
256
1
−
22
256
−
152
256
−
∧
2
∧
2
∧
2
∧
2
02
256
−
22
256
−
42
256
−
302
256
L
L
L
L
∧
15
∧
15
∧
15
∧
15
which, according to equation (3) is precisely
*
⎡
0
⎢
*
1
⎢
*
⎢
2
⎢
*
0
*
1
*
2
*
0
*
1
*
2
⎢
*
⎢
15
⎣
Now we would like to compute the 16-point
5.
FFTs of
*
15
t
*
15
)(*mx
, but these are arranged to be
L
L
L
L
*
xxxx
xxxx
xxxx
xxxx
⎤
)15()2()1()0(
0
⎥
*
)15()2()1()0(
1
⎥
*
⎥
)15()2()1()0(
2
⎥
MMMMM
⎥
*
⎥
)15()2()1()0(
15
⎦
paralleled in rows instead of columns. Thus,
we have to transpose to obtain
−
iiii
ππππ
02
⎤
256
)0()0()0()0(
exexexex
⎥
−
152
iiii
ππππ
⎥
256
)1()1()1()1(
exexexex
⎥
−
302
iiii
ππππ
⎥
256
)2()2()2()2(
exexexex
⎥
⎥
MMMMM
⎥
−
2252
iiii
ππππ
⎥
256
)15()15()15()15(
exexexex
⎦
Parallel Implementation of Fixed-Point FFTs on TigerSHARC® Processors (EE-263) Page 4 of 12
Loading...
+ 8 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.