Engineer-to-Engineer Note EE-211
a
Technical notes on using Analog Devices DSPs, processors and development tools
Contact our technical support at dsp.support@analog.com and at dsptools.support@analog.com
Or vi sit our o n-li ne r esou rces htt p:/ /www.analog.com/ee-notes and http://www.analog.com/processors
16-bit FIR Filters on ADSP-TS20x TigerSHARC® Processors
Contributed by Klas Brink and Rickard Fahlquist Rev 1 – January 13, 2004
Introduction
This document presents two assembly code
implementations with memory of direct-form
FIRs, capable of handling complex input and
output and complex or real coefficients with 16bit integer precision. These implementations
present methods of achieving high performance
while conserving memory.
General
x[i]
Figure 1. Direct-form FIR.
x[i-1]
Z-1
h[0] h[1] h[M-3] h[M-2] h[M-1]
Z-1
+ + + +
x[i-M+3]
A C pseudo-code description of the same FIR is
given below, where ** represents complex
multiplication.
Z-1
x[i-M+2]
Z-1
x[i-M+1]
y[i]
A mathematical representation in direct form of a
FIR filter is given below.
−
10M
[] [ ] [ ]
∑
=
k
⋅−=
khkixiy
i 0, 1, …, N-1
N Number of samples
M Number of filter coefficients
y[i] Output sample number i
x[i-k] Input sample number i-k
h[k] Filter coefficient number k
Equation 1. Direct-form FIR
This equation is the result from the vector inner
product between the filter coefficient vector h
and the (time) order-reversed input data vector x.
This is also known as the convolution between h
and x.
Figure 1 presents the same equation graphically.
for(i=0; i< N; i++){
y[i] = 0;
Ncoeffs = i < (M-1) ? i : (M-1);
for(k=0; k<=Ncoeffs; k++){
y[i] = y[i] + x[i-k] ** h[k];
}
}
Listing 1. C pseudo-code algorithm of direct form FIR.
Parallelism in TigerSHARC Processors
ADSP-TS20x TigerSHARC® processors are
highly parallel computing devices that have three
distinct types of parallelism:
• Latency-2 computational pipeline
• Multiple compute units
• Wide memory structure
These three forms of parallelism complement
each other, and all three must be exploited to
achieve a high level of efficiency. The
computation rate and memory bandwidth in this
machine are balanced in such a way that failing
Copyright 2004, Analog Devices, Inc. All rights reserved. Analog Devices assumes no responsibility for customer product design or the use or application of
customers’ products or for any infringements of patents or rights of others which may result from Analog Devices assistance. All trademarks and logos are property
of their respective holders. Information furnished by Analog Devices Applications and Development Tools Engineers is believed to be accurate and reliable, however
no responsibility is assumed by Analog Devices regarding technical accuracy and topicality of the content provided in Analog Devices’ Engineer-to-Engineer Notes.
a
to pay attention to one of the three components
of parallelism may result in sub-optimal
performance.
16-Bit Integer FIR with Complex
Taps, Input and Output Data
Introduction
ADSP-TS20x TigerSHARC processors support
two complex multiplications (one in each
compute block) per core clock cycle along with
simultaneous data transfers. Filter calculations
like Equation 1 can, of course, be implemented
in a straightforward sequential fashion using one
(inner) loop for the summation, each iteration
performing a multiplication between an (old)
input sample and a coefficient, and adding that to
the output of the last iteration, and another
(outer) loop going through the same procedure
over and over again to produce the output
samples. However, using the parallel features
and high internal bandwidth of ADSP-TS20x
TigerSHARC processors, achieves higher
performance.
Pipelining and Parallel Resources Utilization
The FIR representations show that most data
used to compute output y[i] are the same as the
ones used to compute y[i+1]. The same applies
for y[i+1] when it comes to y[i+2] and so on. The
‘outer’ loop in the C pseudo-code performs all
the necessary steps to calculate all the output
samples. Unrolling this outer loop gains three
things:
1. Data can be reused between calculations of
different output samples.
2. MAC operations can be parallelized.
3. The effects of loop overhead are reduced.
Reducing loop overhead is not to be neglected as
this decreases the time required to perform the
conditional branching necessary for looping,
thereby increasing the percentage of time
available to perform the actual calculations.
Unrolling the outer loop four times yields an
algorithm described by the following C pseudocode:
for(i=0; i< N; i+=4){
y[i] = 0;
y[i+1] = 0;
y[i+2] = 0;
y[i+3] = 0;
Ncoeffs = i < (M-1) ? i : (M-1);
for(k=0; k<=Ncoeffs; k++){
y[i] = y[i] + x[i-k] **h[k];
y[i+1] = y[i+1]+ x[i+1-k]**h[k];
y[i+2] = y[i+2]+ x[i+2-k]**h[k];
y[i+3] = y[i+3]+ x[i+3-k]**h[k];
}
}
Listing 2. Outer loop unrolled 4 times.
What we have done so far is reuse the
coefficients. By also unrolling the ‘inner’ loop,
we achieve reuse of input data as well. Unrolling
of the inner loop by four gives the C pseudo-code
in Listing 3.
for(i=0; i< N; i+=4){
y[i] = 0;
y[i+1] = 0;
y[i+2] = 0;
y[i+3] = 0;
Ncoeffs = i < (M-1) ? i : (M-1);
for(k=0; k<=Ncoeffs; k+=4){
y[i] = y[i] + x[i-k] **h[k];
y[i] = y[i] + x[i-1-k]**h[k+1];
y[i] = y[i] + x[i-2-k]**h[k+2];
y[i] = y[i] + x[i-3-k]**h[k+3];
y[i+1] = y[i+1]+ x[i+1-k]**h[k];
y[i+1] = y[i+1]+ x[i-k] **h[k+1];
y[i+1] = y[i+1]+ x[i-1-k]**h[k+2];
y[i+1] = y[i+1]+ x[i-2-k]**h[k+3];
y[i+2] = y[i+2]+ x[i+2-k]**h[k];
y[i+2] = y[i+2]+ x[i+1-k]**h[k+1];
y[i+2] = y[i+2]+ x[i-k] **h[k+2];
y[i+2] = y[i+2]+ x[i-1-k]**h[k+3];
y[i+3] = y[i+3]+ x[i+3-k]**h[k];
y[i+3] = y[i+3]+ x[i+2-k]**h[k+1];
y[i+3] = y[i+3]+ x[i+1-k]**h[k+2];
y[i+3] = y[i+3]+ x[i-k] **h[k+3];
}
}
Listing 3. Outer and inner loops unrolled 4 times.
16-bit FIR Filters on ADSP-TS20x TigerSHARC® Processors (EE-211) Page 2 of 10
a
We now have a high level of data reuse and a
possibility to parallelize calculations. What is not
so obvious in the C pseudo-code is that we also
have the possibility to do pipelining (i.e., fetch
data concurrent with performing the
calculations).
Data Partitioning
Complex 16-bit data is represented in ADSPTS20x TigerSHARC processors by a 32-bit word
as shown in Figure 2.
31 16 15 0
Imaginary Real
Figure 2. Complex 16-bit data representation.
Input and Coefficient Buffer Structure
The input samples and coefficients are stored in
memory as described in Figure 3.
Input
x[3] x[2] x[1] x[0]
x[7] x[6] x[5] x[4]
31
0
Address
0xHHHH
0xHHHH + 4
word aligned storage is used when writing the
output samples to memory, and j4 keeps track of
the current position to be written.
Output
y[3] y[2] y[1] y[0]
y[7] y[6] y[5] y[4]
…
j4
Address
0 31
0xKKKK
0xKKKK + 4
Figure 4. Output storage in memory.
Delay Line Structure
The filter has memory in which it stores a history
of the last M input samples used by the filter.
This history is called the delay line. The samples
from the delay line are quad-word loaded, and
Figure 5 shows how they are stored.
Delay line Address
Circular
uffe
{
x[i-3] x[i-2]x[i-1]
0 31
x[i-M] x[i-M+1] x[i-M+2]x[i-M+3]
x[i-M+4] x[i-M+5] x[i-M+6]x[i-M+7]
…
x[i-4]
0xLLLL – M + 4
0xLLLL – M + 8
j0
0xLLLL
…
j5
Coefficients
h[3] h[2] h[1] h[0]
h[7] h[6] h[5] h[4]
…
k1
Figure 3. Input and coefficient storage in memory.
31
0
0xGGGG
0xGGGG + 4
The addresses are quad-word aligned. This type
of data storage enables quad-word loading,
which is used for the input samples. Quad-word
loading is used also for the coefficients. J5 points
to the position from where we are currently
loading input samples, and k1 points to the
current coefficient loading position.
Output Buffer Structure
The two compute blocks (CBX and CBY) each
calculate two output samples every outer loop
iteration. CBX produces samples y[i+3] and
y[i+1], and CBY produces y[i+2] and y[i]. Quad-
Figure 5. Delay line in memory.
The delay line is implemented as a circular
buffer with j0 as a pointer to the current
position/index in the buffer.
Delay line Address
Circular
uffe
{
x[i-3] x[i-2]x[i-1]
0 31
x[i] x[i+1] x[i+2]x[i+3]
x[i-M+4] x[i-M+5] x[i-M+6]x[i-M+7]
…
x[i-4]
j0
0xLLLL
0xLLLL – M + 8
0xLLLL
M + 4
Figure 6. Delay line after update.
Old samples are read from the delay line starting
at the position indicated by j0 and counting
backwards, wrapping at the circular buffer
boundaries. Assuming that Figure 5 shows the
16-bit FIR Filters on ADSP-TS20x TigerSHARC® Processors (EE-211) Page 3 of 10