Tuning C Source Code for the TigerSHARC®
DSP Compiler
DSP Tools Compiler Group, Analog Devices Inc., Revision 3, 26 Nov
2001
This document provides some guidelines for
obtaining the best code execution performance
from the TigerSHARC® DSP family’s C/C++
compiler using VisualDSP++™ release 2.0.
Use the Optimizer
There is a vast difference in the performance of C
code that has been compiled optimized and nonoptimized. In some cases optimized code can run
ten or twenty times faster. Optimization should
always be attempted before measuring
performance or shipping code as product. Note
that the default setting is for non-optimized
compilation, the non-optimized default being
there to assist programmers in diagnosing
problems with their initial coding.
The optimizer in the TigerSHARC DSP compiler
is designed to generate efficiently executing code
from C that has been written in a straightforward
manner. The basic strategy for tuning a program
is to present the algorithm in a way that gives the
optimizer excellent visibility of the operations and
data, hence the greatest freedom to safely
manipulate the code. Note that future releases will
enhance the optimizer, and expressing algorithms
simply will provide the best path for reaping the
benefits of such enhancements.
Use the Statistical Profiler
Tuning source begins with an understanding of
what areas of the application are the hot spots.
Statistical profiling provided in VisualDSP++ is
an excellent means for finding those hot spots. If
the application is unfamiliar to you, compile it
with diagnostics and run it unoptimized. This will
give you results that connect directly to the C
source. You will obtain a more accurate view of
your application if you build a fully optimized
application and obtain statistics that relate directly
to the assembly code. The only problem may be
in relating assembly lines to the original source.
Do not strip out function names when linking. If
you have the function names then you can scroll
the assembly window to locate the hot spots. In
very complicated code you can locate the exact
source lines by counting the loops.
Note: The compiler optimizer may have moved
code around.
Data Types
The compiler directly supports six scalar data
types.
int32-bit signed integer
unsigned32-bit unsigned integer
long long64-bit signed integer
unsigned long long64-bit unsigned integer
float32-bit floating point
long double64-bit floating point
The standard C types, char, short and long, in their
signed and unsigned forms are also supported for
compatibility, but they are all implemented as 32bit integers.
double is supported and in the default
mode implemented as a 32-bit floating-point
value.
long double arithmetic operations are implemented
by library routines and consequently are far slower
than operations on
float. This data type should
only be used where the algorithm requires the
long double’s greater range and precision, and
performance is not at a premium.
Copyright 2001, Analog Devices, Inc. All rights reserved. Analog Devices assumes no responsibility for customer product design or the use or a pplication of customers’ products or for any
infringements of patents or rights of others which may result from Analog Devices assistance. All trademarks and logos are property of their respective holders. Information furnished by
Analog Devices Applications and Development Tools Engineers is believed to be accurate and reliable, however no responsibility is assumed by Analog Devices regarding the technical
accuracy of the content provided in all Analog Devices’ Engineer-to-Engineer Notes.
Most operations on long long are supported
directly by the hardware, but multiplication and
division are not. This type is often useful for bit
manipulation algorithms because of the number of
bits that can be processed in each operation.
Avoid Division in Loops
The hardware does not provide direct support for
32-bit integer and floating point division, so the
division and modulus operations on
int and float
are also expensive. The compiler will convert an
integer division by a power of two to a shift
operation if it actually knows the value of the
divisor. General rule: Do not divide inside a loop.
Using 16-bit and 8-bit Data Types
Other data types supported by the hardware such
as short vectors of 16-bit integers, 8-bit integers
and 16-bit fixed-point complex are not directly
supported by the compiler, but access to the
instructions are available through intrinsic
functions.
Appendix B gives a comprehensive list of
intrinsic functions recognized by the compiler.
Indexed Arrays vs. Pointers
C allows you to program data accesses from an
array in two ways: either by indexing off an
invariant base pointer or by incrementing a
pointer.
These two versions of vector addition illustrate
the two styles:
Indexed Array:
void va_ind(int a[], int b[], int out[],
int n) {
int i;
for(i=0;i<n;++i)
out[i] = a[i] + b[i];
}
Pointers:
void va_ptr(int a[], int b[], int out[],
int n) {
int i, *pout = p, *pa = a, *pb = b;
for(i=0;i<n;++i)
*pout++ = *pa++ + *pb++;
}
It should be noted that 16-bit operations are often
more efficient than 32-bit ones. However, the
compiler will not generate a 16-bit operation
where you wrote a 32-bit one, even if it is obvious
to you that the 16-bit operation would generate the
same result. To get 16-bit operations you need to
use an intrinsic function.
This code snippet shows scalar product written
with 4x16 short vector intrinsics. It illustrates the
recommended style for code written with intrinsic
functions.
typedef long long int4x16;
#define add(x,y) __builtin_add_4x16(x,y)
#define mult(x,y) __builtin_mult_i4x16(x,y)
#define sum(x) __builtin_sum_4x16(x)
int sp4x16(int4x16 a[], int4x16 b[], int n)
{
int i;
int4x16 sum4 = 0;
for (i = 0; i < n/4; ++i)
sum4 = add(sum4, mult(a[i], b[i]));
return sum(sum4);
}
The style should not make any difference to the
generated code, but sometimes it does. Often one
version of an algorithm will do better than the
other but it is not always the same style that is
better; the generated code is affected by the
surrounding code, which is why there may be
differences. The pointer style introduces
additional variables that compete with the
surrounding code for resources during the
optimizer’s analysis. Array accesses, on the other
hand, must be transformed to pointers by the
compiler and sometimes it does not do the job as
well as you could do by hand.
The best strategy is to start with array notation. If
this looks unsatisfactory try using pointers.
Outside the important loops use the indexed style,
as it is easier to understand.
EE-147Page 2
Technical Notes on using Analog Devices’ DSP components and development tools
To ensure the best performance the optimizer
often needs to know things that can only be
determined by looking outside the routine it is
working on. In particular it helps to know the
alignment and value of pointer parameters and the
value of loop bounds.
The –ipa compiler switch enables inter-procedural
analysis which makes this information available.
This may be switched on from the IDDE by
checking the Interprocedural Optimization box in
the Compile tab of the Project Options dialogue
selected from the project menu.
two values and will not consider it to have a
constant value.
Bad: (IPA cannot see val is a constant)
#include <stdio.h>
static int val;
void init() {
val=3;
}
void func() {
printf("val %d",val);
}
int main() {
init();
func();
}
When this switch is used the compiler may be
called again from the link phase to recompile the
program using additional information obtained
during previous compilations.
Note: Because it only operates at link time, the
effects of –ipa will not be seen if you compile
with the –S switch. To see the assembler file put
–save-temps in the Additional Options text box in
the Compile tab of the Project Options dialogue
and look at the .s file produced after your program
has been built.
Much of the following advice assumes that the
–ipa switch is being used.
Initialize Constants Statically
Inter-procedural analysis will also identify
variables that only have one value and replace
them with constants that can enable better
optimization. For this to happen a variable must
have only a single value throughout the program.
If the variable is statically initialized to zero, as all
global variables are by default, and assigned to at
one other point in the program the analysis sees
Good: (IPA knows val is 3)
#include <stdio.h>
static int val = 3;
void init() {
}
void func() {
printf("val %d",val);
}
int main() {
init();
func();
}
Loop Guidelines
Appendix A gives an overview of how the
optimizer transforms a loop to generate highly
efficient code. It describes the “loop unrolling”
optimization technique.
Do not unroll loops yourself
Not only does loop unrolling make the program
harder to read but it also prevents optimization.
The compiler must be able to unroll the loop itself
in order to use both compute blocks automatically.
In this example the first version of the loop runs
two and a half times faster than the second, in
cases where inter-procedural analysis can
EE-147Page 3
Technical Notes on using Analog Devices’ DSP components and development tools
determine that the initial values of a, b, and c are
quad-word aligned and
n is a multiple of four.
Good: (Compiler unrolls and uses both
compute blocks)
void va1(int a[], int b[], int c[], int n)
{
int i;
for(i=0;i<n;++i) {
c[i] = b[i] + a[i];
}
}
Bad: (Compiler leaves on a single compute
block)
Good: (a reduction)
for(i=0;i<n;++i)
x=x+a[i] * b[i];
In the first case, the scalar dependency is the
subtraction operation: the variable x changes in a
manner that will give different results if the
iterations are performed out of order. In contrast,
in the second case, the properties of the addition
operator mean that the compiler can perform the
operations in any order, and still get the same
result.
void va2(int a[], int b[], int c[], int n)
{
int xa, xb, xc, ya, yb, yc;
int i;
for(i=0;i<n;i+=2) {
xb = b[i]; yb = b[i+1];
xa = a[i]; ya = a[i+1];
xc=xa+xb;yc=ya+yb;
c[i] = xc; c[i+1] = yc;
}
}
Avoid loop carried dependencies
A loop-carried dependency is where computations
in a given iteration of a loop cannot be completed
without knowing the values calculated in earlier
iterations. When a loop has such dependencies the
compiler cannot overlap loop iterations.
Some dependencies are caused by scalar variables
that are used before they are defined in a single
iteration.
Bad: (scalar dependency)
Do not rotate loops by hand
Loops in DSP code are often “rotated” by hand,
attempting to do loads and stores from earlier and
future iterations at the same time as computation
from the current iteration. This technique
introduces loop-carried dependencies that prevent
the compiler from rearranging the code
effectively. It is better to give the compiler a
“normalized” version, and leave the rotating to the
compiler.
Bad: (rotated)
float ss(float *a, float *b, int n) {
float ta, tb , sum = 0.0f;
inti=0;
ta = a[i]; tb = b[i];
for(i=1;i<n;i++) {
sum += ta + tb;
ta = a[i]; tb = b[i];
}
sum += ta + tb;
return sum;
}
for(i=0;i<n;++i)
x=a[i]-x;
An optimizer can reorder iterations in the
presence of the class of scalar dependencies
known as reductions. These are loops that reduce
a vector of values to a scalar using an associative
By rotating the loop, the scalar variables ta and tb
have been added, and they have introduced loopcarried dependencies which prevent the compiler
from issuing iterations in parallel. The optimizer
is capable of doing this kind of loop rotation
itself.
and commutative operator. The most common
example is multiply and accumulate.
EE-147Page 4
Technical Notes on using Analog Devices’ DSP components and development tools