Analog Devices EE149 Application Notes

Engineer To Engineer Note EE-149
a
Technical Notes on using Analog Devices' DSP components and development tools
Contact our technical support by phone: (800) ANALOG-D or e-mail: dsp.support@analog.com Or vi sit ou r on-l ine re sourc es ht tp:// www.analog.com/dsp and http://www.analog.com/dsp/EZAnswer
Tuning C Source Code for the Blackfin® Processor Compiler
Contributed by DSP Tools Compiler Group May 26, 2003

Introduction

This document provides some guidelines for obtaining the best code execution performance from the Blackfin® processor family’s C/C++ compiler using VisualDSP++™ release 2.0.

Use the optimizer

There is a vast difference in the performance of C code that has been compiled optimized and non-optimized. In some cases optimized code can run ten or twenty times faster. Optimization should always be attempted before measuring performance or shipping code as product. Note that the default setting is for non-optimized compilation, the non-optimized default being there to assist programmers in diagnosing problems with their initial coding.
The optimizer in the Blackfin processor compiler is designed to generate efficiently-executing code from C which has been written in a straightforward manner. The basic strategy for tuning a program is to present the algorithm in a way that gives the optimizer excellent visibility of the operations and data, hence the greatest freedom to safely manipulate the code. Note that future releases will enhance the optimizer, and expressing algorithms simply will provide the best path for reaping the benefits of such enhancements.

Use the Statistical Profiler

Tuning source begins with an understanding of what areas of the application are the hot spots. Statistical profiling provided in VisualDSP++ is an excellent means for finding those hot spots.
If the application is unfamiliar to you, compile it with diagnostics and run it unoptimized. This will give you results that connect directly to the C source. You will obtain a more accurate view of your application if you build a fully optimized application and obtain statistics that relate directly to the assembly code. The only problem may be in relating assembly lines to the original source. Do not strip out function names when linking. If you have the function names then you can scroll the assembly window to locate the hot spots. In very complicated code you can locate the exact source lines by counting the loops – unless they are unrolled. Look at the line numbers in the
.s file. Note that the compiler
optimizer may have moved code around.

Data Types

char unsigned char short unsigned short int unsigned int long unsigned long
8-bit signed integer 8-bit unsigned integer 16-bit signed integer 16-bit unsigned integer 32-bit signed integer 32-bit unsigned integer 32-bit signed integer 32-bit unsigned integer
Table 1: Fixed-Point Data Types (Native Arithmetic)
Copyright 2003, Analog Devices, Inc. All rights reserved. Analog Devices assumes no responsibility for customer product design or the use or application of customers’ products or for any infringements of patents or rights of others which may result from Analog Devices assistance. All trademarks and logos are property of their respective holders. Information furnished by Analog Devices Applications and Development Tools Engineers is believed to be accurate and reliable, however no responsibility is assumed by Analog Devices regarding technical accuracy and topicality of the content provided in Analog Devices’ Engineer-to-Engineer Notes.
a
The compiler directly supports ten scalar data types as shown in Table 1 and Table 2. double is equivalent to float on Blackfin processors, since 64-bit values are not supported directly on the hardware.
float double
Table 2: Floating-Point Data Types (Emulated Arithmetic)
32-bit floating point 32-bit floating point
Fractional data types can be represented as either
short or int. Manipulation of these types is best
done by using intrinsics, which will be described in a subsequent section.

Avoid Float/Double Arithmetic

Floating-point arithmetic operations are implemented by library routines and, consequently, are far slower than integer operations. An arithmetic floating-point operation inside a loop will prevent the optimizer from using a hardware loop.

Avoid Integer Division in Loops

The hardware does not provide direct support for 32-bit integer division, so the division and modulus operations on int variables are multi­cycle operations. The compiler will convert an integer division by a power of two to a right-shift operation if the value of the divisor is known.
If the compiler has to issue a full division operation, it will issue a call to a library function. In addition to being a multi-cycle operation, this will prevent the optimizer from using a hardware loop for any loops around the division. Whenever possible, do not use divide or modulus operators inside a loop.
pointer. These two versions of the vector addition illustrate the two styles:
void va_ind( short a[], short b[], short out[], int n) { int i; for (i = 0; i < n; ++i) out[i] = a[i] + b[i]; }
Listing 1: Indexed Arrays
void va_ptr( short a[], short b[], short out[], int n) { int i; short *pout = out, *pa = a, *pb = b; for (i = 0; i < n; ++i) *pout++ = *pa++ + *pb++; }
Listing 2: Pointers
Common thought might indicate that the chosen style should not make any difference to the generated code, but sometimes it does. Often, one version of an algorithm will generate better optimized code than the other, but it is not always the same style that is better; the generated code is affected by the surrounding code, which is why there may be differences. The pointer style introduces additional variables that compete with the surrounding code for resources during the optimizer’s analysis. Array accesses, on the other hand, must be transformed to pointers by the compiler, and sometimes it does not do the job as well as you could do by hand.
The best strategy is to start with array notation. If this looks unsatisfactory try using pointers. Outside the important loops, use the indexed style, because it is easier to understand.

Use the -ipa Switch

Indexed Arrays versus Pointers

C allows you to program data accesses from an array in two ways: either by indexing from an invariant base pointer or by incrementing a
Tuning C Source Code for the Blackfin® Processor Compiler (EE-149) Page 2 of 10
To ensure the best performance, the optimizer often needs to know things that can only be determined by looking outside the routine which it is working on. In particular, it helps to know the alignment and value of pointer parameters
a
and the value of loop bounds. The -ipa compiler switch enables inter-procedural analysis (IPA), which makes this information available. This may be switched on from the IDDE by checking the Interprocedural Optimization box in the Compile tab of the Project Options dialogue selected from the Project menu.
When this switch is used the compiler may be called again from the link phase to recompile the program using additional information obtained during previous compilations.
Because it only operates at link time, the
L
effects of
-ipa will not be seen if you
compile with the -S switch. To see the assembler file put -save-temps in the
Additional Options text box in the Compile
tab of the Project Options dialogue and look at the .s file produced after your program has been built.
Much of the following advice assumes that the
-ipa switch is being used.

Initialize Constants Statically

Inter-procedural analysis will also identify variables that only have one value and replace them with constants, which can enable better optimization. For this to happen, a variable must have a single value throughout the program.
#include <stdio.h> static int val = 3; // initialized // once void init() { } void func() { printf("val %d",val); } int main() { init(); func(); }
Listing 3: Optimal (IPA knows val is 3)
If the variable is statically initialized to zero, as all global variables are by default, and is subsequently assigned to some other value at
another point in the program, then the analysis sees two values and will not consider the variable to have a constant value.
#include <stdio.h> static int val; // initialized to zero void init() { val = 3; // re-assigned } void func() { printf("val %d",val); } int main() { init(); func(); }
Listing 4: Non-optimal (IPA cannot see that val is a constant)

Word-align Your Data

To make most efficient use of the hardware, it must be kept fed with data. In many algorithms, the balance of data accesses to computations is such that, to keep the hardware fully utilized, data must be fetched with 32-bit loads.
Although the Blackfin architecture supports byte addressing, the hardware requires that references to memory be naturally aligned. Thus, 16-bit references must be at even address locations, and 32-bit at word-aligned addresses. So, for the most efficient code to be generated, you should ensure that data are word-aligned.
The compiler helps establish the alignment of array data. The stack frames are kept word­aligned. Top-level arrays are allocated at word­aligned addresses, regardless of their data types.
If you write programs that only pass the address of the first element of an array as a parameter and loops that process input arrays an element at a time, starting at element zero, then inter­procedural analysis should be able to establish that the alignment is suitable for 32-bit accesses.
Where the inner loop processes a single row of a multi-dimensional array, be certain that each row
Tuning C Source Code for the Blackfin® Processor Compiler (EE-149) Page 3 of 10
Loading...
+ 7 hidden pages