Analog Devices EE218v02 Application Notes

Engineer-to-Engineer Note EE-218

Technical notes on using Analog Devices DSPs, processors and development tools

Contact our technical support at dsp.support@analog.com and at dsptools.support@analog.com Or vi sit our o n-li ne r esou rces htt p:/ /www.analog.com/ee-notes and http://www.analog.com/processors

Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors

Contributed by Boris Lerner Rev 2 – March 4, 2004

Introduction

So, you want to write efficient code for the ADSP-TS201 TigerSHARC® processor? Or, maybe, you have come across the optimized example floating-point FFT for this processor and would like to understand how it works and what the author had in mind when writing it. This application note tries to answer both questions by going through that FFT example and all its levels of optimization in detail. This example can be followed in developing other algorithms and code optimized for the ADSPTS201S processor.

Generally, most algorithms have several levels of optimization, all of which are discussed in detail in this note. The first and most straightforward level of optimization is paralleling of instructions, as the processor architecture will allow. This is simple and boring. The second level of optimization is loop unrolling and software pipelining to achieve maximum parallelism and to avoid pipeline stalls. Although more complex than the simple parallelism of level one, this can be done in prescribed steps without good understanding of the algorithm and, thus, requires little ingenuity. The third level is to restructure the math of the algorithm to still produce valid results, but so that the new restructured algorithm fits the processor’s architecture better. Being able to do this requires a thorough understanding of the algorithm and, unlike software pipelining, there are no

Copyright 2004, Analog Devices, Inc. All rights reserved. Analog Devices assumes no responsibility for customer product design or the use or application of customers’ products or for any infringements of patents or rights of others which may result from Analog Devices assistance. All trademarks and logos are property of their respective holders. Information furnished by Analog Devices Applications and Development Tools Engineers is believed to be accurate and reliable, however no responsibility is assumed by Analog Devices regarding technical accuracy and topicality of the content provided in Analog Devices’ Engineer-to-Engineer Notes.

prescribed steps that lead to the optimal solution. This is where most of the fun in writing optimized code lies.

In practical applications it is often unnecessary to go through all of these levels. When all of the levels are required, it is always best to do these levels of optimization in reverse order. By the time the code is fully pipelined, it is too late to try to change the fundamental underlying algorithm. Thus, a programmer would have to think about the algorithm structure first and organize the code accordingly. Then, levels two and one (paralleling, unrolling, and pipelining) are usually done at the same time.

The code that this note refers to is supplied by Analog Devices in the form that allows it to be called as either a real or a complex FFT, the last calling parameter of the function defining if real or complex is to be called. The real N-point FFT is obtained from the complex N/2-point FFT with an additional special stage at the end. This note is concerned with code optimization more than the technicalities of the special stage, so it discusses the algorithm for the complex FFT portion of the code only. The last special stage of the real FFT is discussed in detail in the comments of the code.

Standard Radix-2 FFT Algorithm

Figure 1 shows a standard 16-point radix-2 FFT implementation, after the input has been bitreversed. Traditionally, in this algorithm, stages

1 and 2 are combined together with the required bit reversing into a single optimized loop (since these two stages require no multiplies, only adds and subtracts). Each of the remaining stages is usually done by combining the butterflies that share the same twiddle factors together into groups (so the twiddles have to be fetched only once for each group). Un-optimized assembly source code for a TigerSHARC processor implementing this algorithm is shown in Listing

1. This, with a few tricks that are irrelevant to this discussion, is the way that the 32-bit floating-point FFT code was written when it was

targeted to an ADSP-TS101 processor. The benchmarks (in core clock cycles) for this algorithm, including bit reversal, running on an ADSP-TS101 and ADSP-TS201, are shown in Table 1. Note that since the ADSP-TS101 has less memory per memory block than a ADSPTS201, larger point size benchmarks do not apply to the ADSP-TS101. Clearly, as long as the data fits into the ADSP-TS201 cache, it is efficient. Once the data becomes too large for the cache, this FFT implementation becomes extremely inefficient – the cycle count increases from optimal by a factor of five.

Figure 1. Standard Structure of the 16-Point FFT

Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 2 of 16

//*********************************** Stages *************************************

k10 = k31 + N;; // twiddles stride k11 = k31 + N/2;; // butterflies/group j12 = j31 + 1;; // groups j10 = j31 + 2;; // width of butterfly j11 = j31 + 4;; // butterfly stride k20 = k31 + STAGES;;

_stages_loop: j0 = j31 + j29;; // j0 -> internal_buff k0 = k31 + k30;; // k0 -> twiddles j13 = j31 + 0;;

LC1 = j12;; _group_loop: xr1:0 = L[k0 += k10];; // xr0=cos, xr1=-sin j1 = j0 + j10;; // j1 -> second input to butterfly

LC0 = k11;;

_butterfly_loop: xr3:2 = L[j0 += 0];; // xr2=Re1, xr3=Im1 xr5:4 = L[j1 += 0];; // xr4=Re2, xr5=Im2 xfr6 = r4 * r0;; // xr6=Re2*cos xfr7 = r5 * r1;; // xr7=Im2*sin xfr8 = r6 - r7;; // xr8=Re(z2*twid) xfr9 = r4 * r1;; // xr6=Re2*sin xfr10 = r5 * r0;; // xr7=Im2*cos xfr11 = r9 + r10;; // xr8=Im(z2*twid) xfr12 = r2 + r8, fr14 = r2 - r8;; // Re(butterfly) xfr13 = r3 + r11, fr15 = r3 - r11;; // Im(butterfly) L[j0 += j11] = xr13:12;; L[j1 += j11] = xr15:14;; if NLC0E, jump _butterfly_loop (NP);;

j13 = j13 + 2;; j0 = j29 + j13;; // offset for the next group if NLC1E, jump _group_loop (NP);;

k10 = lshiftr k10;; // twiddles stride k11 = lshiftr k11;; // butterflies/group j12 = j12 + j12;; // groups j10 = j10 + j10;; // width of butterfly j11 = j11 + j11;; // butterfly stride

k20 = k20 - 1;; if NKEQ, jump _stages_loop (NP);;

Listing 1. fft32_unoptimized.asm

Points

ADSP-

TS101

ADSP-TS201

Input not in

cache

ADSP-TS201

Input in

cache 256 2172 2641 2218 512 4582 5533 4649

1024 9872 12170 9992 2048 21338 26610 22173 4096 46244 197272 NA

8192 99886 444628 NA 16384 215224 987730 NA 32768 NA 2133220 NA 65536 NA 4720010 NA

Table 1. Core Clock Cycles for N-point Complex FFT

Optimizing the Structure of the FFT for ADSP-TS201 Processors

To be able to re-structure the algorithm to perform optimally on ADSP-TS201, we have to understand why the performance of large FFTs using the conventional FFT structure is so poor.

ADSP-TS201 memory is optimized for sequential reads. Cache is designed to help with algorithms where the reads are not sequential. In the conventional FFT algorithm, each stage’s butterflies stride doubles, so the reads are nonsequential and, with each new stage, the cache is less and less likely to be a hit – the reads are all over the place. The solution lies in re-arranging a stage’s output to ensure that the next stage’s reads are sequential. The structure of the algorithm implementation is shown in Figure 2.

Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 3 of 16

Figure 2. Reorganized Structure of the 16-Point FFT

It is simple enough to trace this diagram by hand to see that it is simply a re-ordering of the diagram in Figure 1. Amazingly enough, the final output is in correct order. This can be easily proven for general N = 2 in the FFT. Note that the re-ordering is given by the following formula:

⎧ ⎪

()

Thus, if n is even, it is shifted right and if n is odd, it is shifted right and its most significant bit is set. This is, of course, equivalent to the

Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 4 of 16

⎨ ⎪ ⎩

= the number of points

even is if ,

−

odd is if ,

operation of right 1-bit rotation, which after

steps returns the original n back.

( )

NK2log=

Thus, the output after K stages is in correct order again.

Great! We have our new structure. It has sequential reads and we are lucky enough that the output is in the correct order. This should be much more efficient. Right? Let’s write the code for it! Well, before we spend a lot of time writing the code, we should ensure that all of the DSP operations that we are about to do actually fit into our processor architecture efficiently. There may be no reason to optimize data movements if the underlying math suffers.

The first obvious point to notice is that this structure cannot be done in-place due to its reordering. The stages will have to ping-pong their input/output buffers. This should not be a problem. The ADSP-TS201 processor has a lot of memory on board, but should memory optimization be required (and input does not have to be preserved), we can use the input as one of the two ping-pong buffers.

Next, we note that a traditional FFT combines butterflies that share twiddles into the same group to save twiddle fetch cycles. Amazingly, the twiddles of the structure in Figure 2 line up linearly – one group at a time. We are lucky again!

Now, what would a butterfly of this new structure consist of? Table 2 lists the operations necessary to perform a single complex butterfly. Since the ADSP-TS201 is a SIMD processor (i.e., it can double all the computes), we will write the steps outlined in Listing 1 in SIMD fashion, so that two adjacent butterflies are computed in parallel, one in the X-Compute block and the other one in the Y-Compute block. Let us analyze the DSP operations in more detail. F1, F2, K2 and F4 fetch a total of four 32-bit words, which on ADSP-TS201 can be done in a single quad fetch into X-Compute block registers. To be able to supply SIMD machine with data, we would also have to perform a second butterfly quad fetch into the Y-Compute block registers. Then, M1, M2, M3, M4, A1, and A2 will perform SIMD operations for both butterflies.

The ADSP-TS201 supports a single add/subtract instruction, so A3 and A4 can be combined into a single operation (which is, of course, performed SIMD on both butterflies at once) and similarly A5 and A6 can be combined, as well.

Mnemonic Operation

F2 Fetch Imag(Input1) of the Butterfly K2 Fetch Real(Input2) of the Butterfly

F4 Fetch Imag(Input2) of the Butterfly M1 K2 * Real(twiddle) M2 F4 * Imag(twiddle) M3 K2 * Imag(twiddle) M4 F4 * Real(twiddle)

A1 M1–M2 = Real(Input2*twiddle)

A2 M3+M4 = Imag(Input2*twiddle)

A3 F1 + A1 = Real(Output1)

A4 F1 - A1 = Real(Output2)

A5 F2 + A2 = Imag(Output1)

A6 F2 – A2 = Imag(Output2)

S1 Store(Real(Output1))

S2 Store(Imag(Output1))

S3 Store(Real(Output2))

S4 Store(Imag(Output2))

Table 2. Single Butterfly Done Linearly – Logical Implementation

Fetch Real(Input1) of the Butterfly

Now we run into a problem: S1, S2, S3 and S4 cannot be performed in the same cycle, since S3 and S4 are destined to another place in memory due to our output re-ordering. Instead, we can store S1 and S2 for

both butterflies in one cycle (lucky again – these are adjacent!) and S3 and S4 for

both butterflies in the next cycle. So far, so good – the new set of operations is summarized in Table 3.

Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) Page 5 of 16

+ 11 hidden pages