Analog Devices EE-147 Application Notes

a Engineer To Engineer Note EE-147
Technical Notes on using Analog Devices’ DSP components and development tools
Phone: (800) ANALOG-D, FAX: (781) 461-3010, EMAIL: dsp.support@analog.com, FTP: ftp.analog.com, WEB: www.analog.com/dsp
Tuning C Source Code for the TigerSHARC® DSP Compiler
DSP Tools Compiler Group, Analog Devices Inc., Revision 3, 26 Nov 2001
This document provides some guidelines for obtaining the best code execution performance from the TigerSHARC® DSP family’s C/C++ compiler using VisualDSP++™ release 2.0.

Use the Optimizer

There is a vast difference in the performance of C code that has been compiled optimized and non­optimized. In some cases optimized code can run ten or twenty times faster. Optimization should always be attempted before measuring performance or shipping code as product. Note that the default setting is for non-optimized compilation, the non-optimized default being there to assist programmers in diagnosing problems with their initial coding.
The optimizer in the TigerSHARC DSP compiler is designed to generate efficiently executing code from C that has been written in a straightforward manner. The basic strategy for tuning a program is to present the algorithm in a way that gives the optimizer excellent visibility of the operations and data, hence the greatest freedom to safely manipulate the code. Note that future releases will enhance the optimizer, and expressing algorithms simply will provide the best path for reaping the benefits of such enhancements.

Use the Statistical Profiler

Tuning source begins with an understanding of what areas of the application are the hot spots. Statistical profiling provided in VisualDSP++ is an excellent means for finding those hot spots. If
the application is unfamiliar to you, compile it with diagnostics and run it unoptimized. This will give you results that connect directly to the C source. You will obtain a more accurate view of your application if you build a fully optimized application and obtain statistics that relate directly to the assembly code. The only problem may be in relating assembly lines to the original source. Do not strip out function names when linking. If you have the function names then you can scroll the assembly window to locate the hot spots. In very complicated code you can locate the exact source lines by counting the loops.
Note: The compiler optimizer may have moved code around.

Data Types

The compiler directly supports six scalar data types.
int 32-bit signed integer unsigned 32-bit unsigned integer long long 64-bit signed integer unsigned long long 64-bit unsigned integer float 32-bit floating point long double 64-bit floating point
The standard C types, char, short and long, in their signed and unsigned forms are also supported for compatibility, but they are all implemented as 32­bit integers.
double is supported and in the default
mode implemented as a 32-bit floating-point value.
long double arithmetic operations are implemented
by library routines and consequently are far slower than operations on
float. This data type should
only be used where the algorithm requires the
long double’s greater range and precision, and
performance is not at a premium.
Copyright 2001, Analog Devices, Inc. All rights reserved. Analog Devices assumes no responsibility for customer product design or the use or a pplication of customers’ products or for any infringements of patents or rights of others which may result from Analog Devices assistance. All trademarks and logos are property of their respective holders. Information furnished by Analog Devices Applications and Development Tools Engineers is believed to be accurate and reliable, however no responsibility is assumed by Analog Devices regarding the technical accuracy of the content provided in all Analog Devices’ Engineer-to-Engineer Notes.
Most operations on long long are supported directly by the hardware, but multiplication and division are not. This type is often useful for bit manipulation algorithms because of the number of bits that can be processed in each operation.

Avoid Division in Loops

The hardware does not provide direct support for 32-bit integer and floating point division, so the division and modulus operations on
int and float
are also expensive. The compiler will convert an integer division by a power of two to a shift operation if it actually knows the value of the divisor. General rule: Do not divide inside a loop.

Using 16-bit and 8-bit Data Types

Other data types supported by the hardware such as short vectors of 16-bit integers, 8-bit integers and 16-bit fixed-point complex are not directly supported by the compiler, but access to the instructions are available through intrinsic functions.
Appendix B gives a comprehensive list of intrinsic functions recognized by the compiler.

Indexed Arrays vs. Pointers

C allows you to program data accesses from an array in two ways: either by indexing off an invariant base pointer or by incrementing a pointer.
These two versions of vector addition illustrate the two styles:
Indexed Array:
void va_ind(int a[], int b[], int out[], int n) {
int i; for(i=0;i<n;++i)
out[i] = a[i] + b[i];
}
Pointers:
void va_ptr(int a[], int b[], int out[], int n) {
int i, *pout = p, *pa = a, *pb = b; for(i=0;i<n;++i)
*pout++ = *pa++ + *pb++;
}
It should be noted that 16-bit operations are often more efficient than 32-bit ones. However, the compiler will not generate a 16-bit operation where you wrote a 32-bit one, even if it is obvious to you that the 16-bit operation would generate the same result. To get 16-bit operations you need to use an intrinsic function.
This code snippet shows scalar product written with 4x16 short vector intrinsics. It illustrates the recommended style for code written with intrinsic functions.
typedef long long int4x16; #define add(x,y) __builtin_add_4x16(x,y) #define mult(x,y) __builtin_mult_i4x16(x,y) #define sum(x) __builtin_sum_4x16(x)
int sp4x16(int4x16 a[], int4x16 b[], int n) {
int i; int4x16 sum4 = 0; for (i = 0; i < n/4; ++i)
sum4 = add(sum4, mult(a[i], b[i]));
return sum(sum4);
}
The style should not make any difference to the generated code, but sometimes it does. Often one version of an algorithm will do better than the other but it is not always the same style that is better; the generated code is affected by the surrounding code, which is why there may be differences. The pointer style introduces additional variables that compete with the surrounding code for resources during the optimizer’s analysis. Array accesses, on the other hand, must be transformed to pointers by the compiler and sometimes it does not do the job as well as you could do by hand.
The best strategy is to start with array notation. If this looks unsatisfactory try using pointers. Outside the important loops use the indexed style, as it is easier to understand.
EE-147 Page 2
Technical Notes on using Analog Devices’ DSP components and development tools
Phone: (800) ANALOG-D, FAX: (781)461-3010, EMAIL: dsp.support@analog.com, FTP: ftp.analog.com, WEB: www.analog.com/dsp

Use the –ipa Switch

To ensure the best performance the optimizer often needs to know things that can only be determined by looking outside the routine it is working on. In particular it helps to know the alignment and value of pointer parameters and the value of loop bounds.
The –ipa compiler switch enables inter-procedural analysis which makes this information available.
This may be switched on from the IDDE by checking the Interprocedural Optimization box in the Compile tab of the Project Options dialogue selected from the project menu.
two values and will not consider it to have a constant value.
Bad: (IPA cannot see val is a constant)
#include <stdio.h>
static int val;
void init() {
val=3;
}
void func() {
printf("val %d",val);
}
int main() {
init(); func();
}
When this switch is used the compiler may be called again from the link phase to recompile the program using additional information obtained during previous compilations.
Note: Because it only operates at link time, the effects of –ipa will not be seen if you compile with the –S switch. To see the assembler file put –save-temps in the Additional Options text box in the Compile tab of the Project Options dialogue and look at the .s file produced after your program has been built.
Much of the following advice assumes that the –ipa switch is being used.

Initialize Constants Statically

Inter-procedural analysis will also identify variables that only have one value and replace them with constants that can enable better optimization. For this to happen a variable must have only a single value throughout the program. If the variable is statically initialized to zero, as all global variables are by default, and assigned to at one other point in the program the analysis sees
Good: (IPA knows val is 3)
#include <stdio.h>
static int val = 3;
void init() { }
void func() {
printf("val %d",val);
}
int main() {
init(); func();
}

Loop Guidelines

Appendix A gives an overview of how the
optimizer transforms a loop to generate highly efficient code. It describes the “loop unrolling” optimization technique.

Do not unroll loops yourself

Not only does loop unrolling make the program harder to read but it also prevents optimization. The compiler must be able to unroll the loop itself in order to use both compute blocks automatically.
In this example the first version of the loop runs two and a half times faster than the second, in cases where inter-procedural analysis can
EE-147 Page 3
Technical Notes on using Analog Devices’ DSP components and development tools
Phone: (800) ANALOG-D, FAX: (781)461-3010, EMAIL: dsp.support@analog.com, FTP: ftp.analog.com, WEB: www.analog.com/dsp
determine that the initial values of a, b, and c are quad-word aligned and
n is a multiple of four.
Good: (Compiler unrolls and uses both compute blocks)
void va1(int a[], int b[], int c[], int n) {
int i; for(i=0;i<n;++i) {
c[i] = b[i] + a[i];
}
}
Bad: (Compiler leaves on a single compute block)
Good: (a reduction)
for(i=0;i<n;++i)
x=x+a[i] * b[i];
In the first case, the scalar dependency is the subtraction operation: the variable x changes in a manner that will give different results if the iterations are performed out of order. In contrast, in the second case, the properties of the addition operator mean that the compiler can perform the operations in any order, and still get the same result.
void va2(int a[], int b[], int c[], int n) {
int xa, xb, xc, ya, yb, yc; int i; for(i=0;i<n;i+=2) {
xb = b[i]; yb = b[i+1]; xa = a[i]; ya = a[i+1]; xc=xa+xb;yc=ya+yb; c[i] = xc; c[i+1] = yc;
}
}

Avoid loop carried dependencies

A loop-carried dependency is where computations in a given iteration of a loop cannot be completed without knowing the values calculated in earlier iterations. When a loop has such dependencies the compiler cannot overlap loop iterations.
Some dependencies are caused by scalar variables that are used before they are defined in a single iteration.
Bad: (scalar dependency)

Do not rotate loops by hand

Loops in DSP code are often “rotated” by hand, attempting to do loads and stores from earlier and future iterations at the same time as computation from the current iteration. This technique introduces loop-carried dependencies that prevent the compiler from rearranging the code effectively. It is better to give the compiler a “normalized” version, and leave the rotating to the compiler.
Bad: (rotated)
float ss(float *a, float *b, int n) {
float ta, tb , sum = 0.0f; inti=0; ta = a[i]; tb = b[i]; for(i=1;i<n;i++) {
sum += ta + tb;
ta = a[i]; tb = b[i]; } sum += ta + tb; return sum;
}
for(i=0;i<n;++i)
x=a[i]-x;
An optimizer can reorder iterations in the presence of the class of scalar dependencies known as reductions. These are loops that reduce a vector of values to a scalar using an associative
By rotating the loop, the scalar variables ta and tb have been added, and they have introduced loop­carried dependencies which prevent the compiler from issuing iterations in parallel. The optimizer is capable of doing this kind of loop rotation itself.
and commutative operator. The most common example is multiply and accumulate.
EE-147 Page 4
Technical Notes on using Analog Devices’ DSP components and development tools
Phone: (800) ANALOG-D, FAX: (781)461-3010, EMAIL: dsp.support@analog.com, FTP: ftp.analog.com, WEB: www.analog.com/dsp
Loading...
+ 7 hidden pages