Technical Notes on using Analog Devices' DSP components and development tools
Contact our technical support by phone: (800) ANALOG-D or e-mail: dsp.support@analog.com
Or vi sit ou r on-l ine re sourc es ht tp:// www.analog.com/dsp and http://www.analog.com/dsp/EZAnswer
Tuning C Source Code for the Blackfin® Processor Compiler
Contributed by DSP Tools Compiler Group May 26, 2003
Introduction
This document provides some guidelines for
obtaining the best code execution performance
from the Blackfin® processor family’s C/C++
compiler using VisualDSP++™ release 2.0.
Use the optimizer
There is a vast difference in the performance of
C code that has been compiled optimized and
non-optimized. In some cases optimized code
can run ten or twenty times faster. Optimization
should always be attempted before measuring
performance or shipping code as product. Note
that the default setting is for non-optimized
compilation, the non-optimized default being
there to assist programmers in diagnosing
problems with their initial coding.
The optimizer in the Blackfin processor compiler
is designed to generate efficiently-executing
code from C which has been written in a
straightforward manner. The basic strategy for
tuning a program is to present the algorithm in a
way that gives the optimizer excellent visibility
of the operations and data, hence the greatest
freedom to safely manipulate the code. Note that
future releases will enhance the optimizer, and
expressing algorithms simply will provide the
best path for reaping the benefits of such
enhancements.
Use the Statistical Profiler
Tuning source begins with an understanding of
what areas of the application are the hot spots.
Statistical profiling provided in VisualDSP++ is
an excellent means for finding those hot spots.
If the application is unfamiliar to you, compile it
with diagnostics and run it unoptimized. This
will give you results that connect directly to the
C source. You will obtain a more accurate view
of your application if you build a fully optimized
application and obtain statistics that relate
directly to the assembly code. The only problem
may be in relating assembly lines to the original
source. Do not strip out function names when
linking. If you have the function names then you
can scroll the assembly window to locate the hot
spots. In very complicated code you can locate
the exact source lines by counting the loops –
unless they are unrolled. Look at the line
numbers in the
.s file. Note that the compiler
optimizer may have moved code around.
Data Types
char
unsigned char
short
unsigned short
int
unsigned int
long
unsigned long
8-bit signed integer
8-bit unsigned integer
16-bit signed integer
16-bit unsigned integer
32-bit signed integer
32-bit unsigned integer
32-bit signed integer
32-bit unsigned integer
Table 1: Fixed-Point Data Types (Native Arithmetic)
Copyright 2003, Analog Devices, Inc. All rights reserved. Analog Devices assumes no responsibility for customer product design or the use or application of
customers’ products or for any infringements of patents or rights of others which may result from Analog Devices assistance. All trademarks and logos are property
of their respective holders. Information furnished by Analog Devices Applications and Development Tools Engineers is believed to be accurate and reliable, however
no responsibility is assumed by Analog Devices regarding technical accuracy and topicality of the content provided in Analog Devices’ Engineer-to-Engineer Notes.
a
The compiler directly supports ten scalar data
types as shown in Table 1 and Table 2. double is
equivalent to float on Blackfin processors, since
64-bit values are not supported directly on the
hardware.
float
double
Table 2: Floating-Point Data Types (Emulated
Arithmetic)
32-bit floating point
32-bit floating point
Fractional data types can be represented as either
short or int. Manipulation of these types is best
done by using intrinsics, which will be described
in a subsequent section.
Avoid Float/Double Arithmetic
Floating-point arithmetic operations are
implemented by library routines and,
consequently, are far slower than integer
operations. An arithmetic floating-point
operation inside a loop will prevent the optimizer
from using a hardware loop.
Avoid Integer Division in Loops
The hardware does not provide direct support for
32-bit integer division, so the division and
modulus operations on int variables are multicycle operations. The compiler will convert an
integer division by a power of two to a right-shift
operation if the value of the divisor is known.
If the compiler has to issue a full division
operation, it will issue a call to a library function.
In addition to being a multi-cycle operation, this
will prevent the optimizer from using a hardware
loop for any loops around the division.
Whenever possible, do not use divide or modulus
operators inside a loop.
pointer. These two versions of the vector
addition illustrate the two styles:
void va_ind( short a[], short b[],
short out[], int n)
{
int i;
for (i = 0; i < n; ++i)
out[i] = a[i] + b[i];
}
Listing 1: Indexed Arrays
void va_ptr( short a[], short b[],
short out[], int n)
{
int i;
short *pout = out, *pa = a, *pb = b;
for (i = 0; i < n; ++i)
*pout++ = *pa++ + *pb++;
}
Listing 2: Pointers
Common thought might indicate that the chosen
style should not make any difference to the
generated code, but sometimes it does. Often,
one version of an algorithm will generate better
optimized code than the other, but it is not
always the same style that is better; the generated
code is affected by the surrounding code, which
is why there may be differences. The pointer
style introduces additional variables that compete
with the surrounding code for resources during
the optimizer’s analysis. Array accesses, on the
other hand, must be transformed to pointers by
the compiler, and sometimes it does not do the
job as well as you could do by hand.
The best strategy is to start with array notation. If
this looks unsatisfactory try using pointers.
Outside the important loops, use the indexed
style, because it is easier to understand.
Use the -ipa Switch
Indexed Arrays versus Pointers
C allows you to program data accesses from an
array in two ways: either by indexing from an
invariant base pointer or by incrementing a
Tuning C Source Code for the Blackfin® Processor Compiler (EE-149) Page 2 of 10
To ensure the best performance, the optimizer
often needs to know things that can only be
determined by looking outside the routine which
it is working on. In particular, it helps to know
the alignment and value of pointer parameters
a
and the value of loop bounds. The -ipa compiler
switch enables inter-procedural analysis (IPA),
which makes this information available. This
may be switched on from the IDDE by checking
the Interprocedural Optimization box in the Compile
tab of the Project Options dialogue selected from
the Project menu.
When this switch is used the compiler may be
called again from the link phase to recompile the
program using additional information obtained
during previous compilations.
Because it only operates at link time, the
L
effects of
-ipa will not be seen if you
compile with the -S switch. To see the
assembler file put -save-temps in the
Additional Options text box in the Compile
tab of the Project Options dialogue and
look at the .s file produced after your
program has been built.
Much of the following advice assumes that the
-ipa switch is being used.
Initialize Constants Statically
Inter-procedural analysis will also identify
variables that only have one value and replace
them with constants, which can enable better
optimization. For this to happen, a variable must
have a single value throughout the program.
#include <stdio.h>
static int val = 3; // initialized
// once
void init() {
}
void func() {
printf("val %d",val);
}
int main() {
init();
func();
}
Listing 3: Optimal (IPA knows val is 3)
If the variable is statically initialized to zero, as
all global variables are by default, and is
subsequently assigned to some other value at
another point in the program, then the analysis
sees two values and will not consider the variable
to have a constant value.
#include <stdio.h>
static int val; // initialized to zero
void init() {
val = 3; // re-assigned
}
void func() {
printf("val %d",val);
}
int main() {
init();
func();
}
Listing 4: Non-optimal (IPA cannot see that val is a
constant)
Word-align Your Data
To make most efficient use of the hardware, it
must be kept fed with data. In many algorithms,
the balance of data accesses to computations is
such that, to keep the hardware fully utilized,
data must be fetched with 32-bit loads.
Although the Blackfin architecture supports byte
addressing, the hardware requires that references
to memory be naturally aligned. Thus, 16-bit
references must be at even address locations, and
32-bit at word-aligned addresses. So, for the
most efficient code to be generated, you should
ensure that data are word-aligned.
The compiler helps establish the alignment of
array data. The stack frames are kept wordaligned. Top-level arrays are allocated at wordaligned addresses, regardless of their data types.
If you write programs that only pass the address
of the first element of an array as a parameter
and loops that process input arrays an element at
a time, starting at element zero, then interprocedural analysis should be able to establish
that the alignment is suitable for 32-bit accesses.
Where the inner loop processes a single row of a
multi-dimensional array, be certain that each row
Tuning C Source Code for the Blackfin® Processor Compiler (EE-149) Page 3 of 10
Loading...
+ 7 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.