Texas Instruments Incorporated and its subsidiaries (TI) reserve the right to make corrections,
modifications, enhancements, improvements, and other changes to its products and services at any
time and to discontinue any product or service without notice. Customers should obtain the latest
relevant information before placing orders and should verify that such information is current and
complete. All products are sold subject to TI’s terms and conditions of sale supplied at the time of order
acknowledgment.
TI warrants performance of its hardware products to the specifications applicable at the time of sale
in accordance with TI’s standard warranty. Testing and other quality control techniques are used to the
extent TI deems necessary to support this warranty. Except where mandated by government
requirements, testing of all parameters of each product is not necessarily performed.
TI assumes no liability for applications assistance or customer product design. Customers are
responsible for their products and applications using TI components. To minimize the risks associated
with customer products and applications, customers should provide adequate design and operating
safeguards.
TI does not warrant or represent that any license, either express or implied, is granted under any TI
patent right, copyright, mask work right, or other TI intellectual property right relating to any
combination, machine, or process in which TI products or services are used. Information published by
TI regarding third-party products or services does not constitute a license from TI to use such products
or services or a warranty or endorsement thereof. Use of such information may require a license from
a third party under the patents or other intellectual property of the third party, or a license from TI under
the patents or other intellectual property of TI.
Reproduction of information in TI data books or data sheets is permissible only if reproduction is without
alteration and is accompanied by all associated warranties, conditions, limitations, and notices.
Reproduction of this information with alteration is an unfair and deceptive business practice. TI is not
responsible or liable for such altered documentation.
Resale of TI products or services with statements different from or beyond the parameters stated by
TI for that product or service voids all express and any implied warranties for the associated TI product
or service and is an unfair and deceptive business practice. TI is not responsible or liable for any such
statements.
Following are URLs where you can obtain information on other Texas Instruments products and
application solutions:
ProductsApplications
Amplifiersamplifier.ti.comAudiowww.ti.com/audio
Data Convertersdataconverter.ti.comAutomotivewww.ti.com/automotive
DSPdsp.ti.comBroadbandwww.ti.com/broadband
Interfaceinterface.ti.comDigital Controlwww.ti.com/digitalcontrol
Logiclogic.ti.comMilitarywww.ti.com/military
Power Mgmtpower.ti.comOptical Networkingwww.ti.com/opticalnetwork
Microcontrollersmicrocontroller.ti.comSecuritywww.ti.com/security
Telephonywww.ti.com/telephony
Video & Imagingwww.ti.com/video
Wirelesswww.ti.com/wireless
This document describes the C64x+ digital signal processor little-endian
(DSP) Library, or DSPLIB for short.
Notational Conventions
This document uses the following conventions:
- Hexadecimal numbers are shown with the suffix h. For example, the
- Registers in this document are shown in figures and described in tables.
- Macro names are written in uppercase text; function names are written in
Preface
Read This First
following number is 40 hexadecimal (decimal 64): 40h.
lowercase.
J Each register figure shows a rectangle divded into fields that repre-
sent the fields of the register . Each field is labeled with its bit name, its
beginning and ending bit numbers above, and its read/write properties
below. A legend explains the notation used for the properties.
J Reserved bits i n a register figure designate a bit that is used for future
device expansion.
Related Documentation From Texas Instruments
The following books describe the C6000™ devices and related support tools.
Copies of these documents are available on the Internet at www.ti.com
Enter the literature number in the search box provided at www.ti.com.
SPRU732 — TMS320C64x/C64x+ DSP CPU and Instruction Set
Reference Guide. Describes the CPU architecture, pipeline, instruction
set, and interrupts for the TMS320C64x and TMS320C64x+ digital
signal processors (DSPs) of the TMS320C6000 DSP family. The
C64x/C64x+ DSP generation comprises fixed-point devices in the
C6000 DSP platform. The C64x+ DSP is an enhancement of the C64x
DSP with added functionality and an expanded instruction set.
. Tip:
iRead This First
Trademarks
Trademarks
SPRAA84 — TMS320C64x to TMS320C64+ CPU Migration Guide.
Describes migrating from the Texas Instruments TMS320C64x digital
signal processor (DSP) to the TMS320C64x+ DSP. The objective of this
document is to indicate dif ferences between the two cores. Functionality
in the devices that is identical is not included.
C6000, TMS320C64x+, TMS320C64x, C64x are trademarks of Texas
Instruments.
Provides a brief introduction to the TI C64x+ DSPLIBs, shows the organization of the routines
contained in the libraries, and lists the features and benefits of the DSPLIBs.
This chapter provides a brief introduction to the TI C64x+ DSP Libraries
(DSPLIB), shows the organization of the routines contained in the library, and
lists the features and benefits of the DSPLIB.
The TI C64x+ DSPLIB is an optimized DSP Function Library for C
programmers using devices that include the C64x+ megamodule. It includes
many C-callable, assembly-optimized, general-purpose signal-processing
routines. These routines are typically used in computationally intensive
real-time applications where optimal execution speed is critical. By using
these routines, you can achieve execution speeds considerably faster than
equivalent code written in standard ANSI C language. In addition, by providing
ready-to-use DSP functions, TI DSPLIB can significantly shorten your DSP
application development time.
The TI DSPLIB includes commonly used DSP routines. Source code is
provided that allows you to modify functions to match your specific needs.
The routines contained in the library are organized into the following seven
different functional categories:
- Adaptive filtering
J DSP_firlms2
- Correlation
J DSP_autocor
J DSP_autocor_rA8
- FFT
J DSP_fft16x16
J DSP_fft16x16_imre
J DSP_fft16x16r
J DSP_fft16x32
J DSP_fft32x32
J DSP_fft32x32s
J DSP_ifft16x16
J DSP_ifft16x16_imre
J DSP_ifft16x32
J DSP_ifft32x32
J DSP_fft16x16t (obolete, use DSP_fft16x16)
J DSP_bitrev_cplx (obsolete, use DSP_fft16x16)
J DSP_radix 2 (obsolete, use DSP_fft16x16)
J DSP_r4fft (obsolete, use DSP_fft16x16)
J DSP_fft (obsolete, use DSP_fft16x16)
You should read the README.txt file for specific details of the release.
The DSPLIB is provided in the file dsp64plus.zip. The file must be unzipped to
provide the following directory structure:
dsp
|
+−−README.txtTop−level README file
|
+−−docslibrary documentation
|
+−−examplesCCS project examples
|
|−−includeRequired include files
|
|−−liblibrary and source archives
|
|−−supportfft twiddle generation functions
|
Please install the contents of the lib directory in the default directory indicated
by your C_DIR environment. If you choose not to install the contents in the
default directory, update the C_DIR environment variable, for example, by
adding the following line in autoexec.bat file:
SET C_DIR=<install_dir>/lib;<install_dir>/include;%C_DIR%
Unless specifically noted, DSPLIB operates on Q.15-fractional data type
elements. Appendix A presents an overview of Fractional Q formats.
2.2.1.2DSPLIB Arguments
TI DSPLIB functions typically operate over vector operands for greater
efficiency. Even though these routines can be used to process short arrays, or
even scalars (unless a minimum size requirement is noted), they will be slower
for those cases.
- Vector stride is always equal to 1: Vector operands are composed of vector
elements held in consecutive memory locations (vector stride equal to 1).
- Complex elements are assumed to be stored in consecutive memory
locations with Real data followed by Imaginary data.
- In-place computation is not allowed, unless specifically noted: Source
operand cannot be equal to destination operand.
2-3Installing and Using DSPLIB
Using DSPLIB
Using DSPLIB
2.2.2Calling a DSPLIB Function From C
In addition to correctly installing the DSPLIB software, follow these steps to
include a DSPLIB function in the code:
- Include the function header file corresponding to the DSPLIB function
- Link the code with dsp64plus.lib
- Use a correct linker command file for the platform used.
The examples in the DSP\Examples folder show how to use the DSPLIB in a
Code Composer Studio C envirionment.
2.2.3Calling a DSP Function From Assembly
The C64x+ DSPLIB functions were written to be used from C. Calling the
functions from assembly language source code is possible as long as the
calling function conforms to the Texas Instruments C64x+ C compiler calling
conventions. For more information, see Section 8 (Runtime Environment) of
TMS320C6000 Optimizing C Compiler User’s Guide (SPRU187).
2.2.4DSPLIB Testing − Allowable Error
DSPLIB is tested under the Code Composer Studio environment against a
reference C implementation. You can expect identical results between
Reference C implementation and its Assembly implementation when using
test routines that focus on fixed-point type results. The test routines that deal
with floating points typically allow an error margin of 0.000001 when
comparing the results of reference C code and DSPLIB assembly code.
2.2.5DSPLIB Overflow and Scaling Issues
The DSPLIB functions implement the same functionality of the reference C
code. You must conform to the range requirements specified in the API
function, and in addition, restrict the input range so that the outputs do not
overflow.
In FFT functions, twiddle factors are generated with a fixed scale factor; i.e.,
32767(=2
DSP_fft32x32s, 2147483647(=2
Twiddle factors cannot be scaled further to not scale input data. Because
DSP_fft16x16r and DSP_f ft32x32s perform scaling by 2 at each radix-4 stage,
the input data must be scaled by 2
overflow. I n all other FFT functions, the input data must be scaled by 2
because no scaling is done by the functions.
15−1
) for all 16-bit FFT functions, 1073741823(=2
31−1
) for all other 32-bit FFT functions.
(log2(nx)−cei[log4(nx)−1])
to completely prevent
30−1
) for
(log2(nx))
2-4
2.2.6Interrupt Behavior of DSPLIB Functions
All of the functions in this library are designed to be used in systems with
interrupts. Thus, it is not necessary to disable interrupts when calling any of
these functions. The functions in the library will disable interrupts as needed to
protect the execution of code in tight loops and so on. Library functions have
three categories:
- Fully-interruptible: These functions do not disable interrupts. Interrupts
are blocked by at most 5 to 10 cycles at a time (not counting stalls) by
branch delay slots.
- Partially-interruptible: These functions disable interrupts for long
periods of time, with small windows of interruptibility. Examples include a
function with a nested loop, where the inner loop is non-interruptible and
the outer loop permits interrupts between executions of the inner loop.
- Non-interruptible: These functions disable interrupts for nearly their
entire duration. Interrupts may happen for a short time during the setup
and exit sequence.
How to Rebuild DSPLIB
Note that all three function categories tolerate interrupts. That is, an interrupt
can occur at any time without affecting the function correctness. The
interruptibility of the function only determines how long the kernel might delay
the processing of the interrupt.
2.3How to Rebuild DSPLIB
If you would like to rebuild DSPLIB (for example, because you modified the
source file contained in the archive), you will have to use the mk6x utility as
follows:
mk6x dsp64plus.src −mv64plus −l dsp64plus.lib
2-5Installing and Using DSPLIB
2-6
Chapter 3
DSPLIB Function Tables
This chapter provides tables containing all DSPLIB functions, a brief
description of each, and a page reference for more detailed information.
3.4Differences Between the C64x and C64x+ DSPLIBs3-8
3-1
Arguments and Conventions Used
3.1Arguments and Conventions Used
The following convention has been used when describing the arguments for
each individual function:
Table 3−1. Argument Conventions
ArgumentDescription
x,yArgument reflecting input data vector
rArgument reflecting output data vector
nx,ny,nrArguments reflecting the size of vectors x,y, and r, respectively. For
functions in the case nx = ny = nr, only nx has been used across.
hArgument reflecting filter coefficient vector (filter routines only)
nhArgument reflecting the size of vector h
w
Some C64x+ functions have additional restrictions due to optimization using
new features such as higher multiply throughput. While these new functions
perform better, they can also lead to problems if not carefully used. For
example, DSP_autocor_rA8 is faster than DSP_autocor, but the output buffer
must be aligned to an 8−byte boundary. Therefore, the new functions are
named with any additional restrictions. Three types of restrictions are specified
to a pointer: minimum buffer size (M), buffer alignment (A), and the number of
elements in the buffer to be a multiple of an integer (X).The following
convention has been used when describing the arguments for each individual
function:
A kernel function foo with two parameters, m and n, with the following
restrictions:
m −> Minimum buffer size = 8, buffer alignment = double word, buffer
needs to be a multiple of 8 elements
n −> Minimum buffer size = 32, buffer alignment = word , buffer needs to be
a multiple of 16 elements
This function would be named: foo_mM8A8X8_nM32A4X16.
3-2
3.2DSPLIB Functions
The routines included in the DSP library are organized into eight functional
categories and listed below in alphabetical order.
- Adaptive filtering
- Correlation
- FFT
- Filtering and convolution
- Math
- Matrix functions
- Miscellaneous
- Obsolete functions
DSPLIB Functions
3-3DSPLIB Function Tables
DSPLIB Function Tables
3.3DSPLIB Function Tables
Table 3−2. Adaptive Filtering
FunctionsDescriptionPage
long DSP_firlms2(short *h, short *x, short b, int nh)LMS FIR4-2
Table 3−3. Correlation
FunctionsDescriptionPage
void DSP_autocor(short *r,short *x, int nx, int nr)Autocorrelation4-4
void DSP_autocor_rA8(short *r,short *x, int nx, int nr)Autocorrelation ( r[] must be
double word aligned)
4-4
Table 3−4. FFT
FunctionsDescriptionPage
void DSP_fft16x16(short *w, int nx, short *x, short *y)Complex out of place, Forward
FFT mixed radix with digit
reversal. Input/Output data in
Re/Im order.
void DSP_fft16x16_imre(short *w, int nx, short *x, short
*y)
void DSP_fft16x16r(int nx, short *x, short *w, unsigned
char *brev, short *y, int radix, int offset, int n_max)
void DSP_fft16x32(short *w, int nx, int *x, int *y)Extended precision, mixed radix
Complex out of place, Forward
FFT mixed radix with digit
reversal. Input/Output data in
Im/Re order.
Cache-optimized mixed radix FFT
with scaling and rounding, digit
reversal, out of place. Input and
output: 16 bits, Twiddle factor: 16
bits.
FFT, rounding, digit reversal, out
of place. Input and output: 32 bits,
Twiddle factor: 16 bits.
4-8
4-11
4-14
4-24
void DSP_fft32x32(int *w, int nx, int *x, int *y)Extended precision, mixed radix
FFT, rounding, digit reversal, out
of place. Input and output: 32 bits,
Twiddle factor: 32 bits.
void DSP_fft32x32s(int *w, int nx, int *x, int *y)
3-4
Extended precision, mixed radix
FFT, digit reversal, out of place.,
with scaling and rounding. Input
and output: 32 bits, Twiddle
factor: 32 bits.
4-26
4-28
Table 3−4. FFT (Continued)
FunctionsPageDescription
DSPLIB Function Tables
void DSP_ifft16x16(short *w, int nx, short *x, short *y)Complex out of place, Inverse
FFT mixed radix with digit
reversal. Input/Output data in
Re/Im order.
void DSP_ifft16x16_imre(short *w, int nx, short *x, short
*y)
void DSP_ifft16x32(short *w, int nx, int *x, int *y)Extended precision, mixed radix
void DSP_ifft32x32(int *w, int nx, int *x, int *y)
Complex out of place, Inverse
FFT mixed radix with digit
reversal. Input/Output data in
Re/Im order.
IFFT, rounding, digit reversal, out
of place. Input and output: 32 bits,
Twiddle factor: 16 bits.
Extended precision, mixed radix
IFFT, digit reversal, out of place,
with scaling and rounding. Input
and output: 32 bits, Twiddle
factor: 32 bits.
4-28
4-28
4-34
4-36
Table 3−5. Filtering and Convolution
FunctionsDescriptionPage
void DSP_fir_cplx (short *x, short *h, short *r, int nh, int
nx)
void DSP_fir_cplx_hM4X4 (short *x, short *h, short *r, int
nh, int nx)
void DSP_fir_gen (short *x, short *h, short *r, int nh, int nr) FIR Filter (any nh)4-42
void DSP_fir_gen_hM17_rA8X8 (short *x, short *h, short
*r, int nh, int nr)
void DSP_fir_r4 (short *x, short *h, short *r, int nh, int nr)FIR Filter (nh is a multiple of 4)4-46
void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr)FIR Filter (nh is a multiple of 8)4-50
void DSP_fir_r8_hM16_rM8A8X8 (short *x, short *h, short
*r, int nh, int nr)
void DSP_fir_sym (short *x, short *h, short *r, int nh, int nr,
int s)
Complex FIR Filter (nh is a
multiple of 2)
Complex FIR Filter (nh is a
multiple of 4)
FIR Filter (r[] must be double
word aligned, nr must be multiple
of 8)
FIR Filter (r[] must be double
word aligned, nr is a multiple of 8)
Symmetric FIR Filter (nh is a
multiple of 8)
4-38
4-38
4-42
4-50
4-52
3-5DSPLIB Function Tables
DSPLIB Function Tables
Table 3−5. Filtering and Convolution (Continued)
FunctionsPageDescription
void DSP_iir(short *r1, short *x, short *r2, short *h2, short
*h1, int nr)
void DSP_iirlat(short *x, int nx, short *k, int nk, int *b,
short *r)
IIR with 5 Coefficients4-54
All−pole IIR Lattice Filter4-56
Table 3−6. Math
FunctionsDescriptionPage
int DSP_dotp_sqr(int G, short *x, short *y, int *r, int nx)Vector Dot Product and Square4-58
int DSP_dotprod(short *x, short *y, int nx)Vector Dot Product4-60
short DSP_maxval (short *x, int nx)Maximum Value of a Vector4-62
int DSP_maxidx (short *x, int nx)Index of the Maximum Element of
a Vector
short DSP_minval (short *x, int nx)Minimum Value of a Vector4-65
void DSP_mul32(int *x, int *y, int *r, short nx)32-bit Vector Multiply4-66
void DSP_neg32(int *x, int *r, short nx)32-bit Vector Negate4-68
void DSP_recip16 (short *x, short *rfrac, short *rexp, short
nx)
16-bit Reciprocal4-69
4-63
int DSP_vecsumsq (short *x, int nx)Sum of Squares4-71
void DSP_w_vec(short *x, short *y, short m, short *r, short
nr)
Weighted V ector Sum4-72
Table 3−7. Matrix
FunctionsDescriptionPage
void DSP_mat_mul(short *x, int r1, int c1, short *y, int c2,
short *r, int qs)
void DSP_mat_trans(short *x, short rows, short columns,
short *r)
3-6
Matrix Multiplication4-73
Matrix Transpose4-75
DSPLIB Function Tables
Table 3−8. Miscellaneous
FunctionsDescriptionPage
short DSP_bexp(int *x, short nx)Max Exponent of a Vector (for
scaling)
void DSP_blk_eswap16(void *x, void *r, int nx)Endian-swap a block of 16-bit
values
void DSP_blk_eswap32(void *x, void *r, int nx)Endian-swap a block of 32-bit
values
void DSP_blk_eswap64(void *x, void *r, int nx)Endian-swap a block of 64-bit
values
void DSP_blk_move(short *x, short *r, int nx)Move a Block of Memory4-84
void DSP_fltoq15 (float *x,short *r, short nx)Float to Q15 Conversion4-85
int DSP_minerror (short *GSP0_TABLE,short *errCoefs,
int *savePtr_ret)
void DSP_q15tofl (short *x, float *r, short nx)
Minimum Energy Error Search4-87
Q15 to Float Conversion4-89
4-76
4-78
4-80
4-82
Table 3−9. Obsolete Functions
FunctionsDescriptionPage
void DSP_bitrev_cplx (int *x, short *index, int nx)Use DSP_fft16x16() instead4-88
void DSP_radix2 (int nx, short *x, short *w)Use DSP_fft16x16() instead4-91
void DSP_r4fft (int nx, short *x, short *w)Use DSP_fft16x16() instead4-93
void DSP_fft(short *w, int nx, short *x, short *y)Use DSP_fft16x16() instead4-96
void DSP_fft16x16t(short *w, int nx, short *x, short *y)
Use DSP_fft16x16() instead4-107
3-7DSPLIB Function Tables
Differences Between the C64x and C64x+ DSPLIBs
3.4Differences Between the C64x and C64x+ DSPLIBs
The C64x+ DSPLIB was developed by optimizing some of the functions of the
C64x DSPLIB to take advantage of the C64x+ architecture.
Table 3−10 shows the optimized functions for the C64x+ DSPLIB.
There are two optimization types:
- SPLOOP conversion: Optimized code uses SPLOOP to provide
interruptibility and decrease power consumption. The new C64x+
instructions do not increase algorithm performance, and thus, are not
used.
- Kernel redesign, SPLOOP: Kernel of algorithm rewritten to take
advantage of the new C64x+ instructions and of the SPLOOP feature.
Table 3−10. Functions Optimized in the C64x+ DSPLIB
Any functions which were not optimized for the C64x+ have the same
performance as on the C64x.
3-10
Chapter 4
DSPLIB Reference
This chapter provides a list of the functions within the DSP library (DSPLIB)
organized into functional categories. The functions within each category are
listed in alphabetical order and include arguments, descriptions, algorithms,
benchmarks, and special requirements.
Function long DSP_firlms2(short * restrict h, const short * restrict x, short b, int nh)
Argumentsh[nh]Coefficient Array
x[nh+1]Input Array
bError from previous FIR
nhNumber of coefficients. Must be multiple of 4.
return longReturn value
DescriptionThe Least Mean Square Adaptive Filter computes an update of all nh
coefficients by adding the weighted error times the inputs to the original
coefficients. The input array includes the last nh inputs followed by a new
single sample input. The coefficient array includes nh coefficients.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
long DSP_firlms2(short h[ ],short x[ ], short b,
int nh)
{
int i;
long r = 0;
for (i = 0; i < nh; i++) {
h[i] += (x[i] * b) >> 15;
r += x[i + 1] * h[i];
}
return r;
}
Special Requirements
- This routine assumes 16-bit input and output.
- The number of coefficients nh must be a multiple of 4.
4-2
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- The loop is unrolled 4 times.
BenchmarksCycles3 * nh/4 + 17
Codesize148 bytes
DSP_firlms2
4-3 C64x+ DSPLIB Reference
DSP_autocor
AutoCorrelation
DSP_autocor
4.2Correlation
Function void DSP_autocor(short * restrict r, const short * restrict x, int nx, int nr)
Argumentsr[nr]Output array
x[nx+nr]Input array. Must be double-word aligned.
nxLength of autocorrelation. Must be a multiple of 8.
nrNumber of lags. Must be a multiple of 4.
DescriptionThis routine accepts an input array of length nx + nr and performs nr
autocorrelations each of length nx producing nr output results. This is typically
used in VSELP code.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
Special Requirements
void DSP_autocor(short r[ ],short x[ ], int nx, int nr)
{
int i,k,sum;
for (i = 0; i < nr; i++){
sum = 0;
for (k = nr; k < nx+nr; k++)
sum += x[k] * x[k−i];
r[i] = (sum >> 15);
}
}
- nx must be a multiple of 8.
- nr must be a multiple of 4.
- x[ ] must be double-word aligned.
4-4
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- The inner loop is unrolled 8 times.
- The outer loop is unrolled 4 times.
- The outer loop is conditionally executed in parallel with the inner loop. This
allows for a zero overhead outer loop.
BenchmarksCyclesnx<40: 6*nr*nr/4 + 20
nx>=40: nx*nr/8 + 2*nr + 20
Codesize304 bytes
DSP_autocor
4-5 C64x+ DSPLIB Reference
DSP_autocor_rA8
AutoCorrelation
DSP_autocor_rA8
Function void DSP_autocor_rA8(short * restrict r, const short * restrict x, int nx, int nr)
Argumentsr[nr]Output array, Must be double word aligned.
x[nx+nr]Input array. Must be double-word aligned.
nxLength of autocorrelation. Must be a multiple of 8.
nrNumber of lags. Must be a multiple of 4.
DescriptionThis routine accepts an input array of length nx + nr and performs nr
autocorrelations each of length nx producing nr output results. This is typically
used in VSELP code.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_autocor(short r[ ],short x[ ], int nx, int nr)
{
int i,k,sum;
for (i = 0; i < nr; i++){
sum = 0;
for (k = nr; k < nx+nr; k++)
sum += x[k] * x[k−i];
r[i] = (sum >> 15);
}
}
Special Requirements
Implementation Notes
4-6
- nx must be a multiple of 8.
- nr must be a multiple of 4.
- x[ ] must be double-word aligned.
- r[ ] must be double-word aligned.
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
- The inner loop is unrolled 8 times.
- The outer loop is unrolled 4 times.
BenchmarksCyclesnx<40: 6*nr+ 20
nx>=40: nx*nr/8 + 2*nr + 20
Codesize304 bytes
DSP_autocor_rA8
4-7 C64x+ DSPLIB Reference
DSP_fft16x16
Complex Forward Mixed Radix 16 x 16-bit FFT
DSP_fft16x16
4.3FFT
Function void DSP_fft16x16(const short * restrict w, int nx, short * restrict x, short *
restrict y)
Argumentsw[2*nx]Pointer to complex Q.15 FFT coefficients.
nxLength of FFT in complex samples. Must be power of 2 or 4
, and 16 ≤ nx ≤ 32768.
x[2*nx]Pointer to complex 16-bit data input.
y[2*nx]Pointer to complex 16-bit data output.
DescriptionThis routine computes a complex forward mixed radix FFT with rounding and
digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.
The output is returned in the separate array y[ ] in normal order. Each complex
value is stored with interleaved real and imaginary parts. The code uses a
special ordering of FFT coefficients (also called twiddle factors) and memory
accesses to improve performance in the presence of cache.
AlgorithmAll stages are radix-4 except the last one, which can be radix-2 or radix-4,
depending on the size of the FFT. All stages except the last one scale by two
the stage output data.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be power of 2 or 4, and 16 ≤ nx ≤ 32768.
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices. All data are in short precision
or Q.15 format.
4-8
Implementation Notes
DSP_fft16x16
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
The routine uses log
(nx) − 1 stages of radix-4 transform and performs either
4
a radix-2 or radix-4 transform on the last stage depending on nx. If nx is a
power of 4, then this last stage is also a radix-4 transform, otherwise it is a
radix-2 transform. The conventional Cooley Tukey FFT is written using three
loops. The outermost loop “k” cycles through the stages. There are log N to
the base 4 stages in all. The loop “j” cycles through the groups of butterflies
with different twiddle factors, and loop “i” reuses the twiddle factors for the
different butterflies within a stage. Note the following:
Butterflies With Common
Stage
Groups
Twiddle Factors
Groups*Butterflies
1N/41N/4
2N/164N/4
........
logN
1N/4N/4
The following statements can be made based on above observations:
1) Inner loop “i0” iterates a variable number of times. In particular, the number
of iterations quadruples every time from 1..N/4. Hence, software pipelining
a loop that iterates a variable number of times is not profitable.
2) Outer loop “j” iterates a variable number of times as well. However, the
number of iterations is quartered every time from N/4 ..1. Hence, the
behavior in (a) and (b) are exactly opposite to each other.
3) If the two loops “i” and “j” are coalesced together then they will iterate for
a fixed number of times, namely N/4. This allows us to combine the “i” and
“j” loops into one loop. Optimized implementations will make use of this
fact.
In addition,, the Cooley Tukey FFT accesses three twiddle factors per iteration
of the inner loop, as the butterflies that reuse twiddle factors are lumped
together. This leads to accessing the twiddle factor array at three points, each
separated by “ie”. Note that “ie” is initially 1, and is quadrupled with every
iteration. Therefore, these three twiddle factors are not even contiguous in the
array.
4-9 C64x+ DSPLIB Reference
DSP_fft16x16
To vectorize the FFT, it is desirable to access the twiddle factor array using
double word wide loads and fetch the twiddle factors needed. To do this, a
modified twiddle factor array is created, in which the factors WN/4, WN/2,
W3N/4 are arranged to be contiguous. This eliminates the separation between
twiddle factors within a butterfly. However, this implies that we maintain a
redundant version of the twiddle factor array as the loop is traversed from one
stage to another. Hence, the size of the twiddle factor array increases as
compared to the normal Cooley T ukey FFT. The modified twiddle factor array
is of size “2 * N” where the conventional Cooley Tukey FFT is of size “3N/4”
where N is the number of complex points to be transformed. The routine that
generates the modified twiddle factor array was presented earlier. With the
above transformation of the FFT, both the input data and the twiddle factor
array can be accessed using double-word wide loads to enable packed data
processing.
The final stage is optimized to remove the multiplication as w0 = 1. This stage
also performs digit reversal on the data, so the final output is in natural order.
In addition, if the number of points to be transformed is a power of 2, the final
stage applies a radix-2 pass instead of a radix-4. In any case, the outputs are
returned in normal order.
The code performs the bulk of the computation in place. However, because
digit-reversal cannot be performed in-place, the final result is written to a
separate array, y[].
BenchmarksCycles(6 * nx/8 + 19) * ceil[log
Codesize864 bytes
(nx) − 1] + 8*nx/8 + 30
4
4-10
DSP_fft16x16_imre
Complex Forward Mixed Radix 16 x 16-bit FFT, With Im/Re Order
DSP_fft16x16_imre
Function void DSP_fft16x16_imre(const short * restrict w, int nx, short * restrict x, short
* restrict y)
Argumentsw[2*nx]Pointer to complex Q.15 FFT coefficients.
nxLength of FFT in complex samples. Must be power of 2 or 4
, and 16 ≤ nx ≤ 32768.
x[2*nx]Pointer to complex 16-bit data input.
y[2*nx]Pointer to complex 16-bit data output.
DescriptionThis routine computes a complex forward mixed radix FFT with truncation and
digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.
The output is returned in the separate array y[ ] in normal order. Each complex
value is stored with interleaved imaginary and real parts. The code uses a
special ordering of FFT coefficients (also called twiddle factors) and memory
accesses to improve performance in the presence of cache.
AlgorithmAll stages are radix-4 except the last one, which can be radix-2 or radix-4,
depending on the size of the FFT. All stages except the last one scale by two
the stage output data.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be power of 2 or 4, and 16 ≤ nx ≤ 32768.
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the imaginary/real
components stored in adjacent locations in the array. The imaginary
components are stored at even array indices, and the real components are
stored at odd array indices. All data are in short precision or Q.15 format.
Implementation Notes
- Bank Conflicts: no conflicts occur.
- Interruptibility: The code is interruptible.
4-11 C64x+ DSPLIB Reference
DSP_fft16x16_imre
The routine uses log4(nx) − 1 stages of radix-4 transform and performs either
a radix-2 or radix-4 transform on the last stage depending on nx. If nx is a
power of 4, then this last stage is also a radix-4 transform, otherwise it is a
radix-2 transform. The conventional Cooley Tukey FFT is written using three
loops. The outermost loop “k” cycles through the stages. There are log N to
the base 4 stages in all. The loop “j” cycles through the groups of butterflies
with different twiddle factors, and loop “i” reuses the twiddle factors for the
different butterflies within a stage. Note the following:
Butterflies With Common
Stage
1N/41N/4
2N/164N/4
........
Groups
Twiddle Factors
Groups*Butterflies
logN
The following statements can be made based on above observations:
1) Inner loop “i0” iterates a variable number of times. In particular, the number
of iterations quadruples every time from 1..N/4. Hence, software pipelining
a loop that iterates a variable number of times is not profitable.
2) Outer loop “j” iterates a variable number of times as well. However, the
number of iterations is quartered every time from N/4 ..1. Hence, the
behavior in (a) and (b) are exactly opposite to each other.
3) If the two loops “i” and “j” are coalesced together then they will iterate for
a fixed number of times, namely N/4. This allows us to combine the “i” and
“j” loops into one loop. Optimized implementations will make use of this
fact.
In addition, the Cooley Tukey FFT accesses three twiddle factors per iteration
of the inner loop, as the butterflies that reuse twiddle factors are lumped
together. This leads to accessing the twiddle factor array at three points, each
separated by “ie”. Note that “ie” is initially 1, and is quadrupled with every
iteration. Therefore these three twiddle factors are not even contiguous in the
array.
1N/4N/4
4-12
DSP_fft16x16_imre
To vectorize the FFT, it is desirable to access twiddle factor array using double
word wide loads and fetch the twiddle factors needed. To do this, a modified
twiddle factor array is created, in which the factors WN/4, WN/2, W3N/4 are
arranged to be contiguous. This eliminates the separation between twiddle
factors within a butterfly. However, this implies that we maintain a redundant
version of the twiddle factor array as the loop is traversed from one stage to
another. Hence, the size of the twiddle factor array increases as compared to
the normal Cooley Tukey FFT. The modified twiddle factor array is of size
“2 * N”, where the conventional Cooley Tukey FFT is of size “3N/4”, where N
is the number of complex points to be transformed. The routine that generates
the modified twiddle factor array was presented earlier. With the above
transformation of the FFT, both the input data and the twiddle factor array can
be accessed using double-word wide loads to enable packed data processing.
The final stage is optimized to remove the multiplication as w0 = 1. This stage
also performs digit reversal on the data, so the final output is in natural order.
In addition, if the number of points to be transformed is a power of 2, the final
stage applies a DSP_radix2 pass instead of a radix 4. In any case, the outputs
are returned in normal order.
The code performs the bulk of the computation in place. However, because
digit-reversal cannot be performed in-place, the final result is written to a
separate array, y[].
BenchmarksCycles(6 * nx/8 + 19) * ceil[log
Codesize864 bytes
(nx) − 1] + 8*nx/8 + 30
4
4-13 C64x+ DSPLIB Reference
DSP_fft16x16r
Complex Forward Mixed Radix 16 x 16-bit FFT With Rounding
DSP_fft16x16r
Function void DSP_fft16x16r(int nx, short * restrict x, const short * restrict w, const un-
signed char * restrict brev, short * restrict y, int radix, int offset, int nmax)
ArgumentsnxLength of FFT in complex samples. Must be power of 2 or 4, and
≤16384
x[2*nx]Pointer to complex 16-bit data input
w[2*nx]Pointer to complex FFT coefficients
brev[64]Pointer to bit reverse table containing 64 entries. Only required for
C code. Use NULL for assembly code since BITR instruction
is used instead.
y[2*nx]Pointer to complex 16-bit data output
radixSmallest FFT butterfly used in computation used for
decomposing FFT into sub-FFTs. See notes.
offsetIndex in complex samples of sub-FFT from start of main FFT.
nmaxSize of main FFT in complex samples.
DescriptionThis routine implements a cache-optimized complex forward mixed radix FFT
with scaling, rounding and digit reversal. Input data x[ ], output data y[ ], and
coefficients w[ ] are 16-bit. The output is returned in the separate array y[ ] in
normal order. Each complex value is stored as interleaved 16-bit real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors).
This redundant set of twiddle factors is size 2*N short samples. As pointed out
in subsequent sections, dividing these twiddle factors by 2 will give an effective
divide by 4 at each stage to guarantee no overflow. The function is accurate
to about 68dB of signal to noise ratio to the DFT function as follows.
4-14
DSP_fft16x16r
void dft(int n, short x[], short y[])
{
int k,i, index;
const double PI = 3.14159654;
short * p_x;
double arg, fx_0, fx_1, fy_0, fy_1, co, si;
for(k = 0; k<n; k++)
{
p_x = x;
fy_0 = 0;
fy_1 = 0;
for(i=0; i<n; i++)
{
fx_0 = (double)p_x[0];
fx_1 = (double)p_x[1];
p_x += 2;
index = (i*k) % n;
arg = 2*PI*index/n;
co = cos(arg);
si = −sin(arg);
fy_0 += ((fx_0 * co) − (fx_1 * si));
fy_1 += ((fx_1 * co) + (fx_0 * si));
}
y[2*k] = (short)2*fy_0/sqrt(n);
y[2*k+1] = (short)2*fy_1/sqrt(n);
}
}
Scaling by 2 (i.e., >>1) takes place at each radix-4 stage except the last one.
A radix-4 stage could give a maximum bit-growth of 2 bits, which would require
scaling by 4 . To completely prevent overflow, the input data must be scaled by
(BT−BS)
2
scales by the functions) = ceil[log
, where BT (total number of bit growth) = log2(nx) and BS (number of
(nx)−1]. All shifts are rounded to reduce
4
truncation noise power by 3dB.
4-15 C64x+ DSPLIB Reference
DSP_fft16x16r
The function takes the twiddle factors and input data, and calculates the FFT
producing the frequency domain data in the y[ ] array. As the FFT allows every
input point to affect every output point, which causes cache thrashing in a
cache based system. This is mitigated by allowing the main FFT of size N to
be divided into several steps, allowing as much data reuse as possible. For
example, see the following function:
Notice how the first FFT function is called on the entire 1K data set. It covers
the first pass of the FFT until the butterfly size is 256.
The following 4 FFTs do 256-point FFTs 25% of the size. These continue down
to the end when the butterfly is of size 4. They use an index to the main twiddle
factor array of 0.75*2*N. This is because the twiddle factor array is composed
of successively decimated versions of the main array.
4-16
N not equal to a power of 4 can be used; i.e. 512. In this case, the following
would be needed to decompose the FFT:
The twiddle factor array is composed of log4(N) sets of twiddle factors, (3/4)*N,
(3/16)*N, (3/64)*N, etc. The index into this array for each stage of the FFT is
calculated by summing these indices up appropriately. For multiple FFTs, they
can share the same table by calling the small FFTs from further down in the
twiddle factor array, in the same way as the decomposition works for more data
reuse.
Thus, the above decomposition can be summarized for a general N, radix “rad”
as follows.
As discussed previously, N can be either a power of 4 or 2. If N is a power of
4, then rad = 4, and if N is a power of 2 and not a power of 4, then rad = 2. “rad”
controls how many stages of decomposition are performed. It also determines
whether a radix4 or DSP_radix2 decomposition should be performed at the
last stage. Hence, when “rad” is set to “N/4”, the first stage of the transform
alone is performed and the code exits. To complete the FFT, four other calls
are required to perform N/4 size FFTs. In fact, the ordering of these 4 FFTs
amongst themselves does not matter and, thus, from a cache perspective, it
helps to go through the remaining 4 FFTs in exactly the opposite order to the
first. This is illustrated as follows:
- Complex input data x[ ], twiddle factors w[ ], and output array y[ ] must be
double-word aligned.
- Real values are stored in even word, imaginary in odd.
- All data are in short precision or Q.15 format. Allowed input dynamic range
is 16 − (log
- Output results are returned in normal order.
- The FFT coefficients (twiddle factors) are generated using the program
(nx)−ceil[log4(nx)−1]).
2
tw_fft16x16 provided in the directory ‘support\fft’. The scale factor must be
32767.5. The input data must be scaled by 2
(log2(nx)−ceil[log4(nx)−1])
to
completely prevent overflow.
Implementation Notes
DSP_fft16x16r
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
- The routine uses log
either a radix-2 or radix-4 transform on the last stage depending on nx. If
nx is a power of 4, then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
- A special sequence of coefficients used as generated above produces the
FFT. This collapses the inner 2 loops in the traditional Burrus and Parks
implementation.
- The revised FFT uses a redundant sequence of twiddle factors to allow a
linear access through the data. This linear access enables data and
instruction level parallelism.
- The butterfly is bit reversed; i.e. the inner 2 points of the butterfly are
crossed over. This makes the data come out in bit reversed rather than in
radix 4 digit reversed order. This simplifies the last pass of the loop. The
BITR instruction does the bit reversal out of place.
BenchmarksCyclesceil[log
Codesize640 bytes
(nx) − 1 stages of radix-4 transform and performs
4
(nx) − 1] * (8 * nx/8 + 24) + 5.25 * nx/4 + 31
4
4-23 C64x+ DSPLIB Reference
DSP_fft16x32
Complex Forward Mixed Radix 16 x 32-bit FFT With Rounding
DSP_fft16x32
Function void DSP_fft16x32(const short * restrict w, int nx, int * restrict x, int * restrict y)
Argumentsw[2*nx]Pointer to complex Q.15 FFT coefficients.
nxLength of FFT in complex samples. Must be power of 2 or 4,
and 16 ≤ nx ≤ 32768.
x[2*nx]Pointer to complex 32-bit data input.
y[2*nx]Pointer to complex 32-bit data output.
DescriptionThis routine computes an extended precision complex forward mixed radix
FFT with rounding and digit reversal. Input data x[ ] and output data y[ ] are
32-bit, coefficients w[ ] are 16-bit. The output is returned in the separate array
y[ ] in normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache. The C code to generate the twiddle factors is the same as
the one used for the DSP_fft16x16r routine.
AlgorithmThe C equivalent of the assembly code without restrictions is similar to the one
shown for the DSP_fft16x16t routine. For further details, see the source code
of the C version of this function, which is provided with this library. Note that
the assembly code is hand optimized and restrictions may apply.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
- The FFT coefficients (twiddle factors) are generated using the program
tw_fft16x32 provided in the directory ‘support\fft’. The scale factor must be
32767.5. No scaling is done with the function; thus, the input data must be
log2(nx)
scaled by 2
to completely prevent overflow.
4-24
Implementation Notes
DSP_fft16x32
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
- The routine uses log
(nx) − 1 stages of radix-4 transform and performs
4
either a radix-2 or radix-4 transform on the last stage depending on nx. If
nx is a power of 4, then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
- See the fft16x16t implementation notes, as similar ideas are used.
BenchmarksCycles(10.25 * nx/8 + 10) * ceil[log
Codesize1056 bytes
(nx) − 1] + 6 * nx/4 + 81
4
4-25 C64x+ DSPLIB Reference
DSP_fft32x32
Complex Forward Mixed Radix 32 x 32-bit FFT With Rounding
DSP_fft32x32
Function void DSP_fft32x32(const int * restrict w, int nx, int * restrict x, int * restrict y)
Argumentsw[2*nx]Pointer to complex 32-bit FFT coefficients.
nxLength of FFT in complex samples. Must be power of 2 or 4,
and 16 ≤ nx ≤ 32768.
x[2*nx]Pointer to complex 32-bit data input.
y[2*nx]Pointer to complex 32-bit data output.
DescriptionThis routine computes an extended precision complex forward mixed radix
FFT with rounding and digit reversal. Input data x[ ], output data y[ ], and
coefficients w[ ] are 32-bit. The output is returned in the separate array y[ ] in
normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache. The C code to generate the twiddle factors is similar to the
one used for the DSP_fft16x16r routine, except that the factors are maintained
at 32-bit precision.
AlgorithmThe C equivalent of the assembly code without restrictions is similar to the one
shown for the DSP_fft16x16t routine. For further details, see the source code
of the C version of this function, which is provided with this library. Note that
the assembly code is hand optimized and restrictions may apply.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
- The FFT coefficients (twiddle factors) are generated using the program
tw_fft32x32 provided in the directory ‘support\fft’. The scale factor must be
2147483647.5. No scaling is done with the function; thus, the input data
log2(nx)
must be scaled by 2
to completely prevent overflow.
4-26
Implementation Notes
DSP_fft32x32
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
- The routine uses log
(nx) − 1 stages of radix-4 transform and performs
4
either a radix-2 or radix-4 transform on the last stage depending on nx. If
nx is a power of 4, then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
- See the fft16x16t implementation notes, as similar ideas are used.
BenchmarksCycles(12 * nx/8 + 12) * ceil[log
Codesize1056 bytes
(nx) − 1] + 6 * nx/4 + 79
4
4-27 C64x+ DSPLIB Reference
DSP_fft32x32s
Complex Forward Mixed Radix 32 x 32-bit FFT With Scaling
DSP_fft32x32s
Function void DSP_fft32x32s(const int * restrict w, int nx, int * restrict x, int * restrict y)
Argumentsw[2*nx]Pointer to complex 32-bit FFT coefficients.
nxLength of FFT in complex samples. Must be power of 2 or 4,
and 16 ≤ nx ≤ 32768.
x[2*nx]Pointer to complex 32-bit data input.
y[2*nx]Pointer to complex 32-bit data output.
DescriptionThis routine computes an extended precision complex forward mixed radix
FFT with scaling, rounding and digit reversal. Input data x[ ], output data y[ ],
and coefficients w[ ] are 32-bit. The output is returned in the separate array y[ ]
in normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache. The C code to generate the twiddle factors is the same one
used for the DSP_fft32x32 routine.
Scaling by 2 (i.e., >>1) takes place at each radix-4 stage except for the last
one. A radix-4 stage can add a maximum of 2 bits, which would require scaling
by 4 to completely prevent overflow. Thus, the input data must be scaled by
log2(nx)−ceil[log4(nx)−1])
2
.
AlgorithmThe C equivalent of the assembly code without restrictions is similar to the one
shown for the fft16x16t routine. For further details, see the source code of the
C version of this function, which is provided with this library. Note that the
assembly code is hand optimized and restrictions may apply.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
4-28
Implementation Notes
DSP_fft32x32s
-
The FFT coefficients (twiddle factors) are generated using the program
tw_fft32x32 provided in the directory ‘support\fft’. The scale factor must be
1073741823.5. The input data must be scaled by 2
])
to completely prevent overflow.
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
- Scaling is performed at each stage by shifting the results right by 1,
preventing overflow.
(log2(nx) − ceil[ log4(nx)−1
- The routine uses log
(nx) − 1 stages of radix-4 transform and performs
4
either a radix-2 or radix-4 transform on the last stage depending on nx. If
nx is a power of 4, then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
- See the fft16x16t implementation notes, as similar ideas are used.
BenchmarksCycles(13 * nx/8 + 36) * ceil[log
Codesize928 bytes
(nx) − 1] + 6 * nx/4 + 36
4
4-29 C64x+ DSPLIB Reference
DSP_ifft16x16
Complex Inverse Mixed Radix 16 x 16-bit FFT With Rounding
DSP_ifft16x16
Function void DSP_ifft16x16(const short * restrict w, int nx, short * restrict x, short *
restrict y)
Argumentsw[2*nx]Pointer to complex Q.15 FFT coefficients.
nxLength of FFT in complex samples. Must be power of 2 or 4,
and 16 ≤ nx ≤ 32768.
x[2*nx]Pointer to complex 16-bit data input.
y[2*nx]Pointer to complex 16-bit data output.
DescriptionThis routine computes a complex inverse mixed radix IFFT with rounding and
digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.
The output is returned in the separate array y[ ] in normal order. Each complex
value is stored with interleaved real and imaginary parts. The code uses a
special ordering of IFFT coefficients (also called twiddle factors) and memory
accesses to improve performance in the presence of cache.
The fft16x16 can be used to perform IFFT, by first conjugating the input,
performing the FFT, and conjugating again. This allows fft16x16 to perform the
IFFT as well. However, if the double conjugation needs to be avoided, then this
routine uses the same twiddle factors as the FFT and performs an IFFT. The
change in the sign of the twiddle factors is adjusted for in the routine. Hence,
this routine uses the same twiddle factors as the fft16x16 routine.
AlgorithmThe C equivalent of the assembly code without restrictions is similar to the one
of the fft16x16 routine. For further details, see the source code of the C version
of this function which is provided with this library.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
- Scaling by two is performed after each radix-4 stage except the last one.
4-30
Implementation Notes
DSP_ifft16x16
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
- The routine uses log
(nx) − 1 stages of radix-4 transform and performs
4
either a radix-2 or radix-4 transform on the last stage depending on nx. If
nx is a power of 4, then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
- See the fft16x16 implementation notes, as similar ideas are used.
BenchmarksCycles(6 * nx/8 + 19) * ceil[log
Codesize864 bytes
(nx) − 1] + 8 * nx/8 + 30
4
4-31 C64x+ DSPLIB Reference
DSP_ifft16x16_imre
Complex Inverse Mixed Radix 16 x 16-bit FFT With Im/Re Order
DSP_ifft16x16_imre
Function void DSP_ifft16x16_imre(const short * restrict w, int nx, short * restrict x, short
* restrict y)
Argumentsw[2*nx]Pointer to complex Q.15 FFT coefficients.
nxLength of FFT in complex samples. Must be power of 2 or 4,
and 16 ≤ nx ≤ 32768.
x[2*nx]Pointer to complex data input.
y[2*nx]Pointer to complex data output.
DescriptionThis routine computes a complex inverse mixed radix IFFT with rounding and
digit reversal. Input data x[ ], output data y[ ], and coefficients w[ ] are 16-bit.
The output is returned in the separate array y[ ] in normal order. Each complex
value is stored with interleaved imaginary and real parts. The code uses a
special ordering of IFFT coefficients (also called twiddle factors) and memory
accesses to improve performance in the presence of cache.
The fft16x16_imre can be used to perform IFFT, by first conjugating the input,
performing the FFT, and conjugating again. This allows fft16x16_imre to
perform the IFFT as well. However, if the double conjugation needs to be
avoided, then this routine uses the same twiddle factors as the FFT and
performs an IFFT. The change in the sign of the twiddle factors is adjusted for
in the routine. Hence, this routine uses the same twiddle factors as the
fft16x16_imre routine.
AlgorithmThe C equivalent of the assembly code without restrictions is similar to the one
of the ifft16x16 routine. For further details, see the source code of the C version
of this function which is provided with this library.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the imaginary/real
components stored in adjacent locations in the array. The imaginary
components are stored at even array indices, and the real components are
stored at odd array indices.
- Scaling by two is performed after each radix-4 stage except the last one.
4-32
Implementation Notes
DSP_ifft16x16_imre
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
- The routine uses log
(nx) − 1 stages of radix-4 transform and performs
4
either a radix-2 or radix-4 transform on the last stage depending on nx. If
nx is a power of 4, then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
- See the fft16x16 implementation notes, as similar ideas are used.
BenchmarksCycles(6 * nx/8 + 19) * ceil[log
Codesize864 bytes
(nx) − 1] + 8 * nx/8 + 30
4
4-33 C64x+ DSPLIB Reference
DSP_ifft16x32
Complex Inverse Mixed Radix 16 x 32-bit FFT With Rounding
DSP_ifft16x32
Function void DSP_ifft16x32(const short * restrict w, int nx, int * restrict x, int * restrict
y)
Argumentsw[2*nx]Pointer to complex Q.15 FFT coefficients.
nxLength of FFT in complex samples. Must be power of 2 or 4,
and 16 ≤ nx ≤ 32768.
x[2*nx]Pointer to complex 32-bit data input.
y[2*nx]Pointer to complex 32-bit data output.
DescriptionThis routine computes an extended precision complex inverse mixed radix
FFT with rounding and digit reversal. Input data x[ ] and output data y[ ] are
32-bit, coefficients w[ ] are 16-bit. The output is returned in the separate array
y[ ] in normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache.
fft16x32 can be reused to perform IFFT, by first conjugating the input,
performing the FFT, and conjugating again. This allows fft16x32 to perform the
IFFT as well. However, if the double conjugation needs to be avoided, then this
routine uses the same twiddle factors as the FFT and performs an IFFT. The
change in the sign of the twiddle factors is adjusted for in the routine. Hence,
this routine uses the same twiddle factors as the fft16x32 routine.
AlgorithmThe C equivalent of the assembly code without restrictions is similar to the one
shown for the fft16x16t routine. For further details, see the source code of the
C version of this function which is provided with this library. Note that the
assembly code is hand optimized and restrictions may apply.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
4-34
Implementation Notes
DSP_ifft16x32
-
The FFT coefficients (twiddle factors) are generated using the program
tw_fft16x32 provided in the directory ‘support\fft’. The scale factor must be
32767.5. No scaling is done with the function; thus the input data must be
scaled by 2
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
log2(nx)
to completely prevent overflow.
- The routine uses log
(nx) − 1 stages of radix-4 transform and performs
4
either a radix-2 or radix-4 transform on the last stage depending on nx. If
nx is a power of 4, then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
- See the fft16x16t implementation notes, as similar ideas are used.
BenchmarksCycles(12.5 * nx/8 + 30) * ceil[log
Codesize864 bytes
(nx) − 1] + 6 * nx/4 + 32
4
4-35 C64x+ DSPLIB Reference
DSP_ifft32x32
Complex Inverse Mixed Radix 32 x 32-bit FFT With Rounding
DSP_ifft32x32
Function void DSP_ifft32x32(const int * restrict w, int nx, int * restrict x, int * restrict y)
Argumentsw[2*nx]Pointer to complex 32-bit FFT coefficients.
nxLength of FFT in complex samples. Must be power of 2 or 4, and
16 ≤ nx ≤ 32768.
x[2*nx]Pointer to complex 32-bit data input.
y[2*nx]Pointer to complex 32-bit data output.
DescriptionThis routine computes an extended precision complex inverse mixed radix
FFT with rounding and digit reversal. Input data x[ ], output data y[ ], and
coefficients w[ ] are 32-bit. The output is returned in the separate array y[ ] in
normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache.
fft32x32 can be reused to perform IFFT, by first conjugating the input,
performing the FFT, and conjugating again. This allows fft32x32 to perform the
IFFT as well. However, if the double conjugation needs to be avoided, then this
routine uses the same twiddle factors as the FFT and performs an IFFT. The
change in the sign of the twiddle factors is adjusted for in the routine. Hence,
this routine uses the same twiddle factors as the fft32x32 routine.
AlgorithmThe C equivalent of the assembly code without restrictions is similar to the one
shown for the fft16x16t routine. For further details, see the source code of the
C version of this function which is provided with this library. Note that the
assembly code is hand optimized and restrictions may apply.
Special Requirements
- In-place computation is not allowed.
- The size of the IFFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ], and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary
components stored in adjacent locations in the array. The real
components are stored at even array indices, and the imaginary
components are stored at odd array indices.
4-36
Implementation Notes
DSP_ifft32x32
-
The FFT coefficients (twiddle factors) are generated using the program
tw_fft32x32 provided in the directory ‘support\fft’. The scale factor must be
2147483647.5. No scaling is done with the function; thus the input data
must be scaled by 2
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
log2(nx)
to completely prevent overflow.
- The routine uses log
(nx) − 1 stages of radix-4 transform and performs
4
either a radix-2 or radix-4 transform on the last stage depending on nx. If
nx is a power of 4, then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
- See the fft16x16t implementation notes, as similar ideas are used.
BenchmarksCycles(13*nx/8 + 28) * ceil(log
Codesize960 bytes
(nx) − 1) + 6 * nx/4 + 39
4
4-37 C64x+ DSPLIB Reference
DSP_fir_cplx
Complex FIR Filter
DSP_fir_cplx
4.4Filtering and Convolution
Function void DSP_fir_cplx (const short * restrict x, const short * restrict h, short * restrict
r, int nh, int nr)
Argumentsx[2*(nr+nh−1)] Complex input data. x must point to x[2*(nh−1)].
h[2*nh]Complex coefficients (in normal order).
r[2*nr]Complex output data.
nhNumber of complex coefficients. Must be a multiple of 2.
nrNumber of complex output samples. Must be a multiple of 4.
DescriptionThis function implements the FIR filter for complex input data. The filter has
nr output samples and nh coefficients. Each array consists of an even and odd
term with even terms representing the real part and the odd terms the
imaginary part of the element. The pointer to input array x must point to the
(nh)th complex sample; i.e., element 2*(nh−1), upon entry to the function. The
coefficients are expected in normal order.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_cplx(short *x, short *h, short *r,short nh, short
nr)
{
short i,j;
int imag, real;
for (i = 0; i < 2*nr; i += 2){
imag = 0;
real = 0;
for (j = 0; j < 2*nh; j += 2){
real += h[j] * x[i−j] − h[j+1] * x[i+1−j];
imag += h[j] * x[i+1−j] + h[j+1] * x[i−j];
}
r[i] = (real >> 15);
r[i+1] = (imag >> 15);
}
}
4-38
Special Requirements
- The number of coefficients nh must be a multiple of 2.
- The number of output samples nr must be a multiple of 4.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- The outer loop is unrolled 4 times while the inner loop is not unrolled.
- Both inner and outer loops are collapsed in one loop.
- ADDAH and SUBAH are used along with PACKH2 to perform
accumulation, shift, and data packing.
- Collapsed one stage of epilog and prolog each.
BenchmarksCyclesnr * nh/2 + 7
Codesize448 bytes
DSP_fir_cplx
4-39 C64x+ DSPLIB Reference
DSP_fir_cplx_hM4X4
Complex FIR Filter
DSP_fir_cplx_hM4X4
Function void DSP_fir_cplx _hM4X4(const short * restrict x, const short * restrict h, short
* restrict r, int nh, int nr)
Argumentsx[2*(nr+nh−1)] Complex input data. x must point to x[2*(nh−1)].
h[2*nh]Complex coefficients (in normal order).
r[2*nr]Complex output data.
nhNumber of complex coefficients. Must be a multiple of 4.
nrNumber of complex output samples. Must be a multiple of 4.
DescriptionThis function implements the FIR filter for complex input data. The filter has
nr output samples and nh coefficients. Each array consists of an even and odd
term with even terms representing the real part and the odd terms the
imaginary part of the element. The pointer to input array x must point to the
(nh)th complex sample; i.e., element 2*(nh−1), upon entry to the function. The
coefficients are expected in normal order.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_cplx(short *x, short *h, short *r,short nh, short
nr)
{
short i,j;
int imag, real;
for (i = 0; i < 2*nr; i += 2){
imag = 0;
real = 0;
for (j = 0; j < 2*nh; j += 2){
real += h[j] * x[i−j] − h[j+1] * x[i+1−j];
imag += h[j] * x[i+1−j] + h[j+1] * x[i−j];
}
r[i] = (real >> 15);
r[i+1] = (imag >> 15);
}
}
4-40
Special Requirements
- The number of coefficients nh must be larger or equal to 4 and a multiple
of 4.
- The number of output samples nr must be a multiple of 4.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is fully interruptible.
- The outer loop is unrolled 4 times while the inner loop is not unrolled.
- Both inner and outer loops are collapsed in one loop.
- ADDAH and SUBAH are used along with PACKH2 to perform
accumulation, shift and data packing.
- Collapsed one stage of epilog and prolog each.
BenchmarksCyclesnr * nh*9/16 + 40
Codesize384 bytes
DSP_fir_cplx_hM4X4
4-41 C64x+ DSPLIB Reference
DSP_fir_gen
FIR Filter
DSP_fir_gen
Function void DSP_fir_gen (const short * restrict x, const short * restrict h, short * restrict
r, int nh, int nr)
Argumentsx[nr+nh−1] Pointer to input array of size nr + nh − 1.
h[nh]Pointer to coefficient array of size nh (coefficients must be in
reverse order).
r[nr]Pointer to output array of size nr. Must be word aligned.
nhNumber of coefficients. Must be ≥5.
nrNumber of samples to calculate. Must be a multiple of 4.
DescriptionComputes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. It operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_gen(short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
for (j = 0; j < nr; j++) {
sum = 0;
for (i = 0; i < nh; i++)
sum += x[i + j] * h[i];
r[j] = sum >> 15;
}
}
4-42
Special Requirements
Implementation Notes
DSP_fir_gen
- The number of coefficients, nh, must be greater than or equal to 5.
Coefficients must be in reverse order.
- The number of outputs computed, nr, must be a multiple of 4 and greater
than or equal to 4.
- Array r[ ] must be word aligned.
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- Load double-word instruction is used to simultaneously load four values
in a single clock cycle.
- The inner loop is unrolled four times and will always compute a multiple
of 4 of nh and nr. If nh is not a multiple of 4, the code will fill in zeros to make
nh a multiple of 4.
- This code yields best performance when the ratio of outer loop to inner
loop is less than or equal to 4.
BenchmarksCycles: Not available
Codesize: Not available
4-43 C64x+ DSPLIB Reference
DSP_fir_gen_hM17_rA8X8
FIR Filter
DSP_fir_gen_hM17_rA8X8
Function void DSP_fir_gen_hM17_rA8X8 (const short * restrict x, const short * restrict
h, short * restrict r, int nh, int nr)
Argumentsx[nr+nh−1] Pointer to input array of size nr + nh − 1.
h[nh]Pointer to coefficient array of size nh (coefficients must be in
reverse order).
r[nr]Pointer to output array of size nr. Must be double word
aligned.
nhNumber of coefficients. Must be ≥17.
nrNumber of samples to calculate. Must be a multiple of 8.
DescriptionComputes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. It operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_gen(short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
for (j = 0; j < nr; j++) {
sum = 0;
for (i = 0; i < nh; i++)
sum += x[i + j] * h[i];
r[j] = sum >> 15;
}
}
4-44
Special Requirements
Implementation Notes
DSP_fir_gen_hM17_rA8X8
- The number of coefficients, nh, must be greater than or equal to 17.
Coefficients must be in reverse order.
- The number of outputs computed, nr, must be a multiple of 8 and greater
than or equal to 8.
- Array r[ ] must be word aligned.
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is fully interruptible.
- Load double-word instruction is used to simultaneously load four values
in a single clock cycle.
- The inner loop is unrolled four times and will always compute a multiple
of 4 of nh and nr. If nh is not a multiple of 4, the code will fill in zeros to make
nh a multiple of 4.
- This code yields best performance when the ratio of outer loop to inner
loop is less than or equal to 4.
BenchmarksCycles: 3*ceil(nh/4)*nr/4+39
Codesize: 416 bytes
4-45 C64x+ DSPLIB Reference
DSP_fir_r4
FIR Filter (when the number of coefficients is a multiple of 4)
DSP_fir_r4
Function void DSP_fir_r4 (const short * restrict x, const short * restrict h, short * restrict
r, int nh, int nr)
Argumentsx[nr+nh−1] Pointer to input array of size nr + nh – 1.
h[nh]Pointer to coefficient array of size nh (coefficients must be in
reverse order).
r[nr]Pointer to output array of size nr.
nhNumber of coefficients. Must be multiple of 4 and ≥8.
nrNumber of samples to calculate. Must be multiple of 4.
DescriptionComputes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. This FIR operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_r4(short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
for (j = 0; j < nr; j++) {
sum = 0;
for (i = 0; i < nh; i++)
sum += x[i + j] * h[i];
r[j] = sum >> 15;
}
}
4-46
Special Requirements
- The number of coefficients, nh, must be a multiple of 4 and greater than
or equal to 8. Coefficients must be in reverse order.
- The number of outputs computed, nr, must be a multiple of 4 and greater
than or equal to 4.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- The load double-word instruction is used to simultaneously load four
values in a single clock cycle.
- The inner loop is unrolled four times and will always compute a multiple
of 4 output samples.
BenchmarksCycles(8 + nh) * nr/4 + 9
Codesize308 bytes
DSP_fir_r4
4-47 C64x+ DSPLIB Reference
DSP_fir_r8
FIR Filter (when the number of coefficients is a multiple of 8)
DSP_fir_r8
Function void DSP_fir_r8_hM16_rM8A8X8 (short *x, short *h, short *r, int nh, int nr)
Argumentsx[nr+nh−1]Pointer to input array of size nr + nh – 1.
h[nh]Pointer to coefficient array of size nh (coefficients must be in
reverse order).
r[nr]Pointer to output array of size nr. Must be word aligned.
nhNumber of coefficients. Must be multiple of 8, ≥ 8.
nrNumber of samples to calculate. Must be multiple of 4.
DescriptionComputes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. This FIR operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
Special Requirements
void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
for (j = 0; j < nr; j++) {
sum = 0;
for (i = 0; i < nh; i++)
sum += x[i + j] * h[i];
r[j] = sum >> 15;
}
}
- The number of coefficients, nh, must be a multiple of 8 and greater than
or equal to 8. Coefficients must be in reverse order.
- The number of outputs computed, nr, must be a multiple of 4 and greater
than or equal to 4.
- Array r[ ] must be word aligned.
4-48
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
- The load double-word instruction is used to simultaneously load four
values in a single clock cycle.
- The inner loop is unrolled 4 times and will always compute a multiple of
4 output samples.
- The outer loop is conditionally executed in parallel with the inner loop. This
allows for a zero overhead outer loop.
BenchmarksCyclesnh*nr/4 + 17
Codesize336 bytes
DSP_fir_r8
4-49 C64x+ DSPLIB Reference
DSP_fir_r8_hM16_rM8A8X8
FIR Filter (the number of coefficients is a multiple of 8)
DSP_fir_r8_hM16_rM8A8X8
Function void DSP_fir_r8_hM16_rM8A8X8 (short *x, short *h, short *r, int nh, int nr)
Argumentsx[nr+nh−1]Pointer to input array of size nr + nh – 1.
h[nh]Pointer to coefficient array of size nh (coefficients must be in
reverse order).
r[nr]Pointer to output array of size nr. Must be double word
aligned.
nhNumber of coefficients. Must be multiple of 8, ≥ 16.
nrNumber of samples to calculate. Must be multiple of 8, ≥.8.
DescriptionComputes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. This FIR operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
for (j = 0; j < nr; j++) {
sum = 0;
for (i = 0; i < nh; i++)
sum += x[i + j] * h[i];
r[j] = sum >> 15;
}
}
4-50
Special Requirements
Implementation Notes
DSP_fir_r8_hM16_rM8A8X8
- The number of coefficients, nh, must be a multiple of 8 and greater than
or equal to 16. Coefficients must be in reverse order.
- The number of outputs computed, nr, must be a multiple of 8 and greater
than or equal to 8.
- Array r[ ] must be double word aligned.
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
- The load double-word instruction is used to simultaneously load four
values in a single clock cycle.
- The inner loop is unrolled 4 times and will always compute a multiple of
4 output samples.
- The outer loop is conditionally executed in parallel with the inner loop. This
allows for a zero overhead outer loop.
BenchmarksCyclesWhen nh>32, nh*nr/8+22
Otherwise, 32*nr/8+22
Codesize640 bytes
4-51 C64x+ DSPLIB Reference
DSP_fir_sym
Symmetric FIR Filter
DSP_fir_sym
Function void DSP_fir_sym (const short * restrict x, const short * restrict h, short * re -
strict r, int nh, int nr, int s)
Argumentsx[nr+2*nh]Pointer to input array of size nr + 2*nh. Must be double-word
aligned.
h[nh+1]Pointer to coefficient array of size nh + 1. Coefficients are in
normal order and only half (nh+1 out of 2*nh+1) are required.
Must be double-word aligned.
r[nr]Pointer to output array of size nr. Must be word aligned.
nhNumber of coefficients. Must be multiple of 8. The number of
original symmetric coefficients is 2*nh+1.
nrNumber of samples to calculate. Must be multiple of 4.
sNumber of insignificant digits to truncate; e.g., 15 for Q.15
input data and coefficients.
DescriptionThis function applies a symmetric filter to the input samples. The filter tap array
h[] provides ‘nh+1’ total filter taps. The filter tap at h[nh] forms the center point
of the filter. The taps at h[nh − 1] through h[0] form a symmetric filter about this
central tap. The effective filter length is thus 2*nh+1 taps.
The filter is performed on 16-bit data with 16-bit coefficients, accumulating
intermediate results to 40-bit precision. The accumulator is rounded and
truncated according to the value provided in ‘s’. This allows a variety of
Q-points to be used.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_sym(short *x, short *h, short *r, int nh, int nr,
int s)
- nh must be a multiple of 8. The number of original symmetric coefficients
is 2*nh+1. Only half (nh+1) are required.
- nr must be a multiple of 4.
- x[ ] and h[ ] must be double-word aligned.
- r[ ] must be word aligned.
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
- The load double-word instruction is used to simultaneously load four
values in a single clock cycle.
- The inner loop is unrolled eight times.
BenchmarksCycles(10 * nh/8 + 15) * nr/4 + 26
Codesize664 bytes
4-53 C64x+ DSPLIB Reference
DSP_iir
IIR With 5 Coefficients
DSP_iir
Function void DSP_iir (short * restrict r1, const short * restrict x, short * restrict r2, const
short * restrict h2, const short * restrict h1, int nr)
Argumentsr1[nr+4] Output array (used in actual computation. First four elements
must have the previous outputs.)
x[nr+4]Input array
r2[nr]Output array (stored)
h2[5]Moving-average filter coefficients
h1[5]Auto-regressive filter coefficients. h1[0] is not used.
nrNumber of output samples. Must be ≥ 8.
DescriptionThe IIR performs an auto-regressive moving-average (ARMA) filter with 4
auto-regressive filter coefficients and 5 moving-average filter coefficients for
nr output samples.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_iir(short *r1, short *x, short *r2, short *h2,
short *h1, int nr)
{
int j,i;
int sum;
for (i=0; i<nr; i++){
sum = h2[0] * x[4+i];
for (j = 1; j <= 4; j++)
sum += h2[j]*x[4+i−j]−h1[j]*r1[4+i−j];
r1[4+i] = (sum >> 15);
r2[i] = r1[4+i];
}
}
4-54
Special Requirements
- nr is greater than or equal to 8.
- Input data array x[ ] contains nr + 4 input samples to produce nr output
samples.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- Output array r1[ ] contains nr + 4 locations, r2[ ] contains nr locations for
storing nr output samples. The output samples are stored with an offset
of 4 into the r1[ ] array.
- The inner loop that iterated through the filter coefficients is completely
unrolled.
BenchmarksCycles4 * nr + 21
Codesize276 bytes
DSP_iir
4-55 C64x+ DSPLIB Reference
DSP_iirlat
All-Pole IIR Lattice Filter
DSP_iirlat
Function void DSP_iirlat(const short * restrict x, int nx, const short * restrict k, int nk, int
* restrict b, short * restrict r)
Argumentsx[nx]Input vector (16-bit).
nxLength of input vector.
k[nk]Reflection coefficients in Q.15 format.
nkNumber of reflection coefficients/lattice stages. Must be >=4.
Make multiple of 2 to avoid bank conflicts.
b[nk+1]Delay line elements from previous call. Should be initialized to
all zeros prior to the first call.
r[nx]Output vector (16-bit).
DescriptionThis routine implements a real all-pole IIR filter in lattice structure (AR lattice).
The filter consists of nk lattice stages. Each stage requires one reflection
coefficient k and one delay element b. The routine takes an input vector x[] and
returns the filter output in r[]. Prior to the first call of the routine, the delay
elements in b[] should be set to zero. The input data may have to be pre-scaled
to avoid overflow or achieve better SNR. The reflections coefficients lie in the
range −1.0 < k < 1.0. The order of the coefficients is such that k[nk−1]
corresponds to the first lattice stage after the input and k[0] corresponds to the
last stage.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void iirlat(short *x, int nx, short *k, int nk, int *b,
short *r)
{
int rt; /* output */
int i, j;
for (j=0; j<nx; j++)
{
rt = x[j] << 15;
for (i = nk − 1; i >= 0; i−−)
{
4-56
Special Requirements
Implementation Notes
DSP_iirlat
rt = rt − (short)(b[i] >> 15) * k[i];
b[i + 1] = b[i] + (short)(rt >> 15) * k[i];
}
b[0] = rt;
r[j] = rt >> 15;
}
}
- nk must be >= 4.
- No special alignment requirements
- See Bank Conflicts for avoiding bank conflicts
- Bank Conflicts: nk should be a multiple of 2, otherwise bank conflicts
occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- Prolog and epilog of the inner loop are partially collapsed and overlapped
to reduce outer loop overhead.
BenchmarksCycles(2 * nk + 7) * nx + 9(without bank conflicts)
Codesize352 bytes
4-57 C64x+ DSPLIB Reference
DSP_dotp_sqr
Vector Dot Product and Square
DSP_dotp_sqr
4.5Math
Function int DSP_dotp_sqr(int G, const short * restrict x, const short * restrict y, int *
restrict r, int nx)
ArgumentsGCalculated value of G (used in the VSELP coder).
x[nx]First vector array
y[nx]Second vector array
rResult of vector dot product of x and y.
nxNumber of elements. Must be multiple of 4, and ≥12.
return intNew value of G.
DescriptionThis routine performs an nx element dot product of x[ ] and y[ ] and stores it
in r. It also squares each element of y[ ] and accumulates it in G. G is passed
back to the calling function in register A4. This computation of G is used in the
VSELP coder.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
int DSP_dotp_sqr (int G,short *x,short *y,int *r,
int nx)
{
short *y2;
short *endPtr2;
y2 = x;
for (endPtr2 = y2 + nx; y2 < endPtr2; y2++){
*r += *y * *y2;
G += *y * *y;
y++;
}
return(G);
}
4-58
Special Requirements nx must be a multiple of 4 and greater than or equal to 12.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
BenchmarksCyclesnx/2 + 21
Codesize128
DSP_dotp_sqr
4-59 C64x+ DSPLIB Reference
DSP_dotprod
Vector Dot Product
DSP_dotprod
Function int DSP_dotprod(const short * restrict x, const short * restrict y, int nx)
Argumentsx[nx]First vector array. Must be double-word aligned.
y[nx]Second vector array. Must be double word-aligned.
nx Number of elements of vector. Must be multiple of 4.
return intDot product of x and y.
DescriptionThis routine takes two vectors and calculates their dot product. The inputs are
16-bit short data and the output is a 32-bit number.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
int DSP_dotprod(short x[ ],short y[ ], int nx)
{
int sum;
int i;
sum = 0;
for(i=0; i<nx; i++){
sum += (x[i] * y[i]);
}
return (sum);
}
Special Requirements
- The input length must be a multiple of 4.
- The input data x[ ] and y[ ] are stored on double-word aligned boundaries.
- To avoid bank conflicts, the input arrays x[ ] and y[ ] must be offset by 4
half-words (8 bytes).
4-60
Implementation Notes
- Bank Conflicts: No bank conflicts occur if the input arrays x[ ] and y[ ] are
offset by 4 half-words (8 bytes).
- Interruptibility: The code is fully interruptible.
- The code is unrolled 4 times to enable full memory and multiplier
bandwidth to be utilized.
- Interrupts are masked by branch delay slots only.
- Prolog collapsing has been performed to reduce codesize.
BenchmarksCyclesnx / 4 + 14
Codesize64 bytes
DSP_dotprod
4-61 C64x+ DSPLIB Reference
DSP_maxval
Maximum Value of Vector
DSP_maxval
Function short DSP_maxval (const short *x, int nx)
Argumentsx[nx]Pointer to input vector of size nx.
nxLength of input data vector. Must be multiple of 8 and ≥32.
return shortMaximum value of a vector.
DescriptionThis routine finds the element with maximum value in the input vector and
returns that value.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
short DSP_maxval(short x[ ], int nx)
{
int i, max;
max = −32768;
for (i = 0; i < nx; i++)
if (x[i] > max)
max = x[i];
return max;
}
Special Requirements nx is a multiple of 8 and greater than or equal to 32.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
BenchmarksCyclesnx / 4 + 10
Codesize116 bytes
4-62
DSP_maxidx
Index of Maximum Element of Vector
DSP_maxidx
Function int DSP_maxidx (const short *x, int nx)
Argumentsx[nx]Pointer to input vector of size nx. Must be double-word aligned.
nxLength of input data vector. Must be multiple of 16 and ≥ 48.
return intIndex for vector element with maximum value.
DescriptionThis routine finds the max value of a vector and returns the index of that value.
The input array is treated as 16 separate columns that are interleaved
throughout the array. If values in different columns are equal to the maximum
value, then the element in the leftmost column is returned. If two values within
a column are equal to the maximum, then the one with the lower index is
returned. Column takes precedence over index.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
int DSP_maxidx(short x[ ], int nx)
{
int max, index, i;
max = −32768;
for (i = 0; i < nx; i++)
if (x[i] > max) {
max = x[i];
index = i;
}
return index;
}
Special Requirements
- nx must be a multiple of 16 and greater than or equal to 48.
- The input vector x[ ] must be double-word aligned.
4-63 C64x+ DSPLIB Reference
DSP_maxidx
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- The code is unrolled 16 times to enable the full bandwidth of LDDW and
MAX2 instructions to be utilized. This splits the search into 16 sub-ranges.
The global maximum is then found from the list of maximums of the
sub-ranges. Then, using this offset from the sub-ranges, the global
maximum and the index of it are found using a simple match. For common
maximums in multiple ranges, the index will be different to the above C
code.
- This code requires 40 bytes of stack space for a temporary buffer.
BenchmarksCycles5 * nx / 16 + 42
Codesize388 bytes
4-64
DSP_minval
Minimum Value of Vector
DSP_minval
Function short DSP_minval (const short *x, int nx)
Argumentsx [nx]Pointer to input vector of size nx.
nxLength of input data vector. Must be multiple of 4 and ≥20.
return shortMaximum value of a vector.
DescriptionThis routine finds the minimum value of a vector and returns the value.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
short DSP_minval(short x[ ], int nx)
{
int i, min;
min = 32767;
for (i = 0; i < nx; i++)
if (x[i] < min)
min = x[i];
return min;
}
Special Requirements nx is a multiple of 4 and greater than or equal to 20.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- The input data is loaded using double word wide loads, and the MIN2
instruction is used to get to the minimum.
BenchmarksCyclesnx / 4 +10
Codesize116 bytes
4-65 C64x+ DSPLIB Reference
DSP_mul32
32-Bit Vector Multiply
DSP_mul32
Function void DSP_mul32(const int * restrict x, const int * restrict y, int * restrict r, short
nx)
Argumentsx[nx]Pointer to input data vector 1 of size nx. Must be double-word
aligned.
y[nx]Pointer to input data vector 2 of size nx. Must be double-word
aligned.
r[nx]Pointer to output data vector of size nx. Must be double-word
aligned.
nx Number of elements in input and output vectors. Must be multiple
of 8 and ≥16.
DescriptionThe function performs a Q.31 x Q.31 multiply and returns the upper 32 bits of
the result. The result of the intermediate multiplies are accumulated into a
40-bit long register pair, as there could be potential overflow. The contribution
of the multiplication of the two lower 16-bit halves are not considered. The
output is in Q.30 format. Results are accurate to least significant bit.
AlgorithmIn the comments below, X and Y are the two input values. Xhigh and Xlow
represent the upper and lower 16 bits of X. This is the C equivalent of the
assembly code without restrictions. Note that the assembly code is hand
optimized and restrictions may apply.
void DSP_mul32(const int *x, const int *y, int *r,
short nx)
{
shorti;
inta,b,c,d,e;
for(i=nx;i>0;i−−)
{
a=*(x++);
b=*(y++);
c=_mpyluhs(a,b); /* Xlow*Yhigh */
d=_mpyhslu(a,b); /* Xhigh*Ylow */
e=_mpyh(a,b); /* Xhigh*Yhigh */
d+=c; /* Xhigh*Ylow+Xlow*Yhigh */
d=d>>16; /* (Xhigh*Ylow+Xlow*Yhigh)>>16 */
4-66
e+=d; /* Xhigh*Yhigh + */
/* (Xhigh*Ylow+Xlow*Yhigh)>>16 */
*(r++)=e;
}
}
Special Requirements
- nx must be a multiple of 8 and greater than or equal to 16.
- Input and output vectors must be double-word aligned.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- The MPYHI instruction is used to perform 16 x 32 multiplies to form 48-bit
intermediate results.
BenchmarksCycles9 * nx/8 + 18
Codesize512 bytes
DSP_mul32
4-67 C64x+ DSPLIB Reference
DSP_neg32
32-Bit Vector Negate
DSP_neg32
Function void DSP_neg32(int *x, int *r, short nx)
Argumentsx[nx]Pointer to input data vector 1 of size nx with 32-bit elements.
Must be double-word aligned.
r[nx]Pointer to output data vector of size nx with 32-bit elements.
Must be double-word aligned.
nxNumber of elements of input and output vectors. Must be a
multiple of 4 and ≥8.
DescriptionThis function negates the elements of a vector (32-bit elements). The input and
output arrays must not be overlapped except for where the input and output
pointers are exactly equal.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_neg32(int *x, int *r, short nx)
{
short i;
for(i=nx; i>0; i−−)
*(r++)=−*(x++);
}
Special Requirements
- nx must be a multiple of 4 and greater than or equal to 8.
- The arrays x[ ] and r[ ] must be double-word aligned.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- The loop is unrolled twice and pipelined.
BenchmarksCyclesnx/2 + 19
Codesize124 bytes
4-68
DSP_recip16
16-Bit Reciprocal
DSP_recip16
Function void DSP_recip16 (short *x, short *rfrac, short *rexp, short nx)
Argumentsx[nx]Pointer to Q.15 input data vector of size nx.
rfrac[nx]Pointer to Q.15 output data vector for fractional values.
rexp[nx]Pointer to output data vector for exponent values.
nxNumber of elements of input and output vectors.
DescriptionThis routine returns the fractional and exponential portion of the reciprocal of
an array x[ ] of Q.15 numbers. The fractional portion rfrac is returned in Q.15
format. Since the reciprocal is always greater than 1, it returns an exponent
such that:
(rfrac[i] * 2rexp[i]) = true reciprocal
The output is accurate up to the least significant bit of rfrac, but note that this
bit could carry over and change rexp. For a reciprocal of 0, the procedure will
return a fractional part of 7FFFh and an exponent of 16.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_recip16(short *x, short *rfrac, short *rexp, short
nx)
{
int i,j,a,b;
short neg, normal;
for(i=nx; i>0; i−−)
{
a=*(x++);
if(a<0)/* take absolute value */
{
a=−a;
neg=1;
}
else neg=0;
normal=_norm(a);/* normalize number */
a=a<<normal;
4-69 C64x+ DSPLIB Reference
DSP_recip16
}
Special Requirements None
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interruptible.
*(rexp++)=normal−15;/* store exponent */
b=0x80000000;/* dividend = 1 */
for(j=15;j>0;j−−)
b=_subc(b,a);/* divide */
b=b&0x7FFF;/* clear remainder
/* (clear upper half) */
if(neg) b=−b;/* if originally
/* negative, negate */
*(rfrac++)=b;/* store fraction */
}
- The conditional subtract instruction, SUBC, is used for division. SUBC is
used once for every bit of quotient needed (15).
BenchmarksCycles8 * nx + 14
Codesize196 bytes
4-70
DSP_vecsumsq
Sum of Squares
DSP_vecsumsq
Function int DSP_vecsumsq (const short *x, int nx)
Argumentsx[nx]Input vector
nxNumber of elements in x. Must be multiple of 4 and ≥8.
return intSum of the squares
DescriptionThis routine returns the sum of squares of the elements contained in the vector
x[ ].
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
int DSP_vecsumsq(short x[ ], int nx)
{
int i, sum=0;
for(i=0; i<nx; i++)
{
sum += x[i]*x[i];
}
return(sum);
}
Special Requirements nx must be a multiple of 4 and greater than or equal to 32.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- The code is unrolled 4 times to enable full memory and multiplier
bandwidth to be utilized.
BenchmarksCyclesnx/4 + 11
Codesize188 bytes
4-71 C64x+ DSPLIB Reference
DSP_w_vec
Weighted Vector Sum
DSP_w_vec
Function void DSP_w_vec(const short * restrict x, const short * restrict y, short m, short
* restrict r, short nr)
Argumentsx[nr]Vector being weighted. Must be double-word aligned.
y[nr]Summation vector. Must be double-word aligned.
mWeighting factor
r[nr]Output vector
nrDimensions of the vectors. Must be multiple of 8 and ≥8.
DescriptionThis routine is used to obtain the weighted vector sum. Both the inputs and
output are 16-bit numbers.
AlgorithmThis is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_w_vec(short x[ ],short y[ ],short m,
short r[ ],short nr)
{
short i;
for (i=0; i<nr; i++) {
r[i] = ((m * x[i]) >> 15) + y[i];
}
}
Special Requirements
- nr must be a multiple of 8 and ≥ 8.
- Vectors x[ ] and y[ ] must be double-word aligned.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- Interruptibility: The code is interrupt-tolerant but not interruptible.
- Input is loaded in double-words.
- Use of packed data processing to sustain throughput.
BenchmarksCycles3 * nr/8 + 18
Codesize144 bytes
4-72
Loading...
+ hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.